Vous êtes sur la page 1sur 156

Measurement of

Health and Disease


For Medicine and
Midwifery students

Mekonnen Assefa
/Bsc,
Mph/Epidemiology &
Biostatistics.

June,2016

1
Descriptive statistics
HANDLING OF DATA

2
Learning Objectives
Understand the basic concepts and terminology of
biostatistics, including the various kinds of variables,
measurement, and measurement scales.
Understand how data can be appropriately organized
and displayed.
Understand how to reduce data sets into a few useful,
descriptive measures.
Be able to calculate and interpret measures of central
tendency, such as the mean, median, and mode.
Be able to calculate and interpret measures of
dispersion, such as the range, variance, and standard
deviation.

3
Cont
Understand classical, relative frequency, and subjective
probability.
Understand the properties of probability and selected
probability rules.
Be able to calculate the probability of an event.
Understand selected discrete distributions and how to use
them to calculate probabilities.
Understand selected continuous distributions and how to
use them to calculate probabilities.
Be able to explain the similarities and differences between
distributions of the discrete type and the continuous type
and when the use of each is appropriate.

4
Cont
Be able to construct a sampling distribution of a statistic.
Understand how to use a sampling distribution to
calculate basic probabilities.
Understand the basic concepts of sampling with
replacement and without replacement.
Understand the importance and basic principles of
estimation.
Understand how to correctly state a null and alternative
hypothesis and carry out a structured hypothesis test.
Understand the concepts of type I error, type II error.
Be able to calculate and interpret z, and chi-square test
statistics for making statistical inferences

5
Define terms
Statistics ???
Biostatistics ???
Descriptive statistics ?
Inferential statistics ?

Variables???

Sample ??? Statistics ???


Population ??? Parameter ???

6
Variable
Variable: A characteristic which takes different
values in different persons, places, or things.
Any aspect of an individual or object that is
measured (e.g., BP) or recorded (e.g., age,
sex) and takes any value.
There may be one variable in a study or many.
E.g., A study of treatment outcome of TB
patients

7
Before summarization and organization, we
need to know the types of variables and
measurement scales of our data.
Before displaying or analyzing data, we
should classify the variables into their
different types.

8
Variables can be broadly classified into:
Categorical (or Qualitative) or
Quantitative (or numerical variables).

9
Categorical variable: A variable or
characteristic which can not be measured in
quantitative form but can only be sorted by
name or categories.

Not able to be measured as we measure height


or weight

The notion of magnitude is absent.

10
Quantitative variable: A variable that can be
measured (or counted) and expressed
numerically.

Height, wt, # of children, etc.

Has the notion of magnitude.

11
Quantitative variable is divided into two:
1. Discrete: It can only have a limited number of
discrete values (usually whole numbers).
E.g., the number of episodes of diarrhoea a child has had
in a year. You cant have 12.5 episodes of diarrhoea
Characterized by gaps or interruptions in the
values.
Both the order and magnitude of the values matter.
The values arent just labels, but are actual
measurable quantities.

12
2. Continuous variable: It can have an
infinite number of possible values in any given
interval.
Both the magnitude and the order of the values
matter
Does not possess the gaps or interruptions
Weight is continuous since it can take on any
number of values.

13
SUMMARY

Variable

Types Quantitative
Qualitative
of measurement
or categorical
variables

Nominal Ordinal Discrete Continuous


(not ordered) (ordered) (count data) (real-valued)
e.g. ethnic e.g. response e.g. # of e.g. height
group to treatment admissions

Measurement scales
14
Measurement and measurement
Scales

All measurements are not the same.


Measuring weight = e.g. 40kg
Measuring the status of a patient on scale =
improved, stable, not improved.
There are four types of scales of
measurement.

15
1. Nominal scale:
The simplest type of data, in which the
values fall into unordered categories or
classes
Uses names, labels, or symbols to assign
each measurement.

16
Example of nominal Scale:

Race/Ethnicity: The numbers


1. Black have NO meaning
2. White They are labels
3. Latino only
4. Other

17
If nominal data can take on only two possible
values, they are called dichotomous or
binary.
So sex is not just nominal, it is dichotomous
(male or female).
Yes/no questions
E.g., cured from TB at 6 months of Rx

Other Examples: Blood type, race, marital


status

18
2. Ordinal scale:
Assigns each measurement to one of a limited
number of categories that are ranked in terms
of order.
Although non-numerical, can be considered to
have a natural ordering

19
Example of ordinal scale:
Pain level:

1. None The numbers


2. Mild have LIMITED
meaning
3. Moderate
4>3>2>1 is all
4. Severe we know apart
from their utility
as labels
Other Examples: Patient status, cancer stages, social
class
20
3. Interval scale:
- Measured on a continuum and differences
between any two numbers on a scale are of
known size.
Example: Temp. in oF on 4 consecutive days
Days: A B C D
Temp. oF: 50 55 60 65
For these data, not only is day A with 50o cooler
than day D with 65o, but is 15o cooler.
- It has no true zero point. 0 is arbitrarily chosen
and doesnt reflect the absence of temp.

21
4. Ratio scale:
- Measurement begins at a true zero point and
the scale has equal space.
- For example, age is a ratio data, some one
who is 40 is twice as old as someone who is
20.

- Other examples: Height, weight, etc.

22
Scales of measurement

Degree of precision in measuring


23
1.5 Exercises
Consider the following Scales of measurement (types of data) and answer
questions A to D.

1. Blood group
2. Temperature (Celsius)
3. Ethnic group
4. Job satisfaction index (1-5)
5. Number of heart attacks
6. Serum uric acid (mg/100ml)
7. Number of accidents in 3 - year period
8. Number of cases of each reportable disease reported by a health worker
9. The average weight gain of 61-year old dogs (with a special diet
supplement) was 950grams last month.
10. Injury severity (a score between 1and 3 is allocated depending on the
severity) scores 1 and 3 show mild and very severe respectively.
11. Gender of babies born in a Hospital

24
24
A) Identify the types of data which are qualitative
and quantitative.
B) Identify the types of data which are numerical
discrete and numerical continuous.
C) Identify the type of data/ measurement scale
(nominal, ordinal, interval or ratio). Confirm your
answers by giving your own examples.
D) Which nominal scales are dichotomous? Which
ones are multichotomous?

25
Methods Of Data Collection,
Organization And Presentation

26
26
Before any statistical work can be done data
must be collected.

Depending on the type of variable and the


objective of the study different data
collection methods can be employed.

27
27
1. Data Collection Methods

Observation
Face-to-face and self-administered interviews
Postal or mail method and telephone interviews
Using available information
Focus group discussions (FGD) Qualitative
In depth interview study

28
28
1. Observation:

A technique that involves systematically selecting,


watching and recoding behaviors of people or other
phenomena and aspects of the setting in which they
occur, for the purpose of getting specified
information.

It includes all methods from simple visual


observations to the use of high level machines and
measurements, sophisticated equipment or
facilities, such as radiographic, biochemical, X-ray
machines, microscope, clinical examinations, and
microbiological examinations.
29
29
Advantages:
Generate relatively more accurate data on
behavior and activities

Disadvantages:
Investigators or observers own biases,
prejudice, desires
Needs more resources and skilled human power
during the use of high level machines.

30
30
2. Interviews and self-administered questionnaire

Are probably the most commonly used research


data collection techniques.

Developing a good questioning tools therefore is


more important and time consuming phase of
research proposal

31
31
Questions should be considered before
designing our questioning tools
include:
What exactly do we want to know?

Is questioning the right technique to obtain all


answers?
Do we understand the topic sufficiently to
design a questionnaire?
Are our informants mainly literate or illiterate?

How large is the sample that will be


interviewed?

32
32
Unstructured questionnaire:

Is flexible

Use open ended questions

Useful in a preliminary survey, and in intensive


studies of perceptions, attitudes, motivation and
affective reactions.
Unstructured interviews are characteristic of
qualitative research.

33
33
Structured questionnaire

The wording and order of the questions being


decided in advance.

Preferred in community medicine research

34
34
Adv. And Disadv.
Self-administered questionnaire
simpler and cheaper;
Can be administered to many persons
simultaneously (e.g. to a class of students)
Unlike interviews, can be sent by post.
On the other hand,
They demand a certain level of education and skill
on the part of the respondents;
People of a low socio-economic status are less likely
to respond to a mailed

35
35
Interviewing using questionnaire
can be face-to-face or telephone interviews.

A good interviewer can stimulate and maintain the


respondents interest.
If a question is not understood an interviewer can
repeat it and give explanation or alternative
wording.
Probing possible

In face-to-face interviews, observations can be


made as well.
Are more expensive than self administered
questionnaire

36
36
Types of Questions
Depending on how questions are asked
and recorded they can be classified as:

1. Open-ended questions
Permit free responses that should be recorded
in the respondents own words.

The respondent is not given any possible


answers to choose from.

37
37
Such questions are useful to obtain
information on:

Facts with which the researcher is not very


familiar,
Opinions, attitudes, and suggestions of
informants, or
Sensitive issues.

E.g. -Can you describe what traditional birth


attendant did when your labor started?
-What do you do when you feel sick?

38
38
2. Closed Questions
Offer a list of possible options or answers from
which the respondents must choose.
When designing closed questions one should try to:
Offer a list of options that are exhaustive and
mutually exclusive
Keep the number of options as few as possible.
Are useful :
if the range of possible responses is known
if one is only interested in certain aspects of the
issue
E.g. What is your religion? (Muslim, Orthodox,
Protestant, others)

39
39
Closed questions may be used as well to get
the respondents to express their opinions by
choosing rating points on a scale.
For example: How useful would you say the use of
insecticide treated bed net in the prevention of
malaria?
1. Extremely useful
2. Very useful
3. Useful
4. Not very useful
5. Not useful at all

40
40
Requirements of questions

Must have face validity


Must be clear and unambiguous

Must not be offensive

The questions should be fair

41
41
Steps in Designing a Questionnaire

Step1: Content (Variables & Objectives)


You may add, drop or change some of your
variables & objectives at this stage.

Step 2: FORMULATING QUESTIONS


Formulate one or more questions that will
provide the information needed for each
variable.

42
42
Step 3: SEQUENCING OF QUESTIONS
Sequence in logical order
Sensitive questions at last

Step 4: Formatting the Questionnaire


Heading, Spacing

Step 5: Translation and Pre-testing

43
43
3. Use of documentary sources:
Examples include:

1. Official publications of Central Statistical


Authority
2. Publication of Ministry of Health and Other
Ministries
3. News Papers and Journals.
4. International Publications like Publications by
WHO, World Bank, UNICEF
5. Records of hospitals or any Health Institutions.

44
44
Though the use of data from documents, are
less time consuming and relatively have
low cost,

Care should be taken on the quality and


completeness of the data.

There could be differences in objectives


between the primary author of the data and
the user.

45
45
Common Problems in gathering data
include:
Language barriers
Lack of adequate time
Expense
Inadequately trained and experienced staff
Invasion of privacy
Suspicion
Bias
Cultural norms (e.g. which may preclude men
interviewing women)

46
46
The statistical data may be classified under
two categories, depending upon the
sources.

1) Primary data 2) Secondary data

47
47
1. Primary Data:
Are those data, which are collected by the investigator
himself for the purpose of a specific inquiry or study.

Are original in character and are mostly generated by


surveys
More reliable and accurate since the investigator can
extract the correct information
High response rates might be obtained since the
answers to various questions are obtained on the
spot.

It permits explanation of questions concerning difficult


subject matter.
Expensive
48
48
2. Secondary Data:
When an investigator uses data, which have already
been collected by others.
Such data are primary data for the agency that
collected them,& become secondary for someone
else who uses them for his own purposes.
Can be obtained from journals, reports, government
publications, publications of professionals and
research organizations.
Are less expensive
Sometimes the quality of such data may be better
because these might have been collected by persons
who were specially trained for that purpose.

49
49
\
On the other hand, may also be full of errors
due to:
the fact that the purpose of the collection of the
data by the primary agency may have been
different from the purpose of the user of these
secondary data.
there may have been bias introduced,
the size of the sample may have been inadequate,
or
there may have been arithmetic or definition errors,
hence, it is necessary to critically investigate
the validity of the secondary data.

50
50
Even though the choice of methods of data
collection is largely based on the accuracy of the
information they yield, it is also based on practical
considerations, such as:
The need for personnel, skills, equipment, etc. in
relation to what is available and the urgency with
which results are needed.
The acceptability of the procedures to the
subjects.
The probability that the method will provide a
good coverage
The investigators familiarity with a study
procedure.
51
51
Methods of data organization and
presentation

52
52
The data collected in a survey is called raw data.

For data to be more easily appreciated and to draw


quick comparisons, it is often useful to arrange the
data in the form of a table, or in one of a number of
different graphical forms.
Array (ordered array): is a serial arrangement of
numerical data in an ascending or descending order.
Very difficult with large sample size
12 19 27 36 42 59
15 22 31 39 43 61
17 23 31 41 44 65
18 26 34 41 54 67

53
53
1. Frequency Distributions
A table which involves a listing of all
observed values of the variable being
studied and how many times each value is
observed.

E.g. A study in which 400 persons were asked how


many full-length movies they had seen on television
during the preceding week.

The following gives the distribution of the data


collected.

54
54
Number of movies Number of persons Relative frequency (%)
0 72 18.0
1 106 26.5
2 153 38.3
3 40 10.0
4 18 4.5
5 7 1.8
6 3 0.8
7 0 0.0
8 1 0.3
Total 400 100.0

55
In the above distribution:
Number of movies represents the variable
under consideration,

Number of persons represents the frequency,


and

The whole distribution is called frequency


distribution particularly simple frequency
distribution.

56
56
A categorical distribution:

Non-numerical information can also be represented


in a frequency distribution.

E.g. Seniors of a high school were interviewed on their


plan after completing high school. The following data
give plans of 548 seniors of a high school.

57
57
SENIORS PLAN NUMBER OF
SENIORS

Plan to attend college 240


May attend college 146
Plan to or may attend a vocational school 57
Will not attend any school 105

Total 548

58
58
Grouped frequency distribution.
E.g. The age of persons arrested in a country.
Age (years) Number of persons
Under 18 1,748
18 24 3,325
25 34 3,149
35 44 1,323
45 54 512
55 and over 335
Total 10,392

59
The construction of grouped frequency
distribution consists four steps:
1. Choosing the classes,
2. Sorting (or tallying)
3. Counting the number of items in each class,
and
4. Displaying the results in the form of a chart or
table

60
60
No. of classes and class interval (width)
Choices are arbitrary to some extent, but they
depend on:
The nature of the data

Its accuracy, and

On the purpose of the distribution

The following are some rules that are


generally observed:

61
61
1. No of classes usually between 6 & 20 (average
15)
Sturges Formula, given by:

No. of classes (K) = 1 + 3.322log(n), and

class interval (w) = (Maximum value


Minimum value)/K = Range/K
NB: The Sturges rule should not be regarded as
final, but should be considered as a guide only.

2. Classes should be mutually exclusive.

62
62
3. Determination of class limits:

i) Class limits should be definite and clearly


stated.

ii) The starting point, i.e., the lower limit of the


first class be determined in such a manner
that frequency of each class get
concentrated near the middle of the class
interval

63
63
Example:
The birth weights (in Kilogram) of 30 children
were recorded as follow:

2.0, 2.1, 2.3, 3.0, 2.7, 2.8, 3.5, 3.1, 3.7, 4.0,
2.3, 3.5, 4.2, 3.7, 3.2, 2.7, 2.5, 2.7, 3.8, 3.1,
3.0, 2.6, 2.8, 2.9, 3.5, 4.1, 3.9, 2.8, 2.2, 3.1.

K = 1+3.322(log30) = 5.91
W = 4.2-2.0 = 0.37 0.4
5.91
64
64
Birth weight Tally mark No. of % Cumulative
children freq.
2.0 - 2.3 IIII 5 16.7 5
2.4 - 2.7 IIII 5 16.7 10
2.8 - 3.1 IIII IIII 9 30.0 19
3.2 - 3.5 IIII 4 13.3 23
3.6 - 3.9 IIII 4 13.3 27

4.0 - 4.3 III 3 10.0 30

Total 30 100.0

65
When frequencies of two or more classes are added up,
such total frequencies are called Cumulative
Frequencies.

Two types:

Less than cumulative frequency distribution (most


common)
More than cumulative frequency distribution.

Lower & upper class limit: are the smallest and largest values that
can go into any class, respectively

66
66
Mid-point or class mark (Xc)
=Upper Class Limit + Lower Class Limit
2
Class Boundaries (true class limits):
Are those limits, which are determined mathematically
to make an interval of a continuous variable
continuous in both directions, and no gap exists
between classes.

The true limits are what the tabulated limits would


correspond with if one could measure exactly.
Note: The width of a class is found from the true class
limit by subtracting the true lower limit from the upper
true limit of any particular class.

67
67
Example: Frequency distribution of weights (in Ounces) of Malignant Tumors
Removed from the Abdomen of 57 subjects

Weight Class Xc Freq. Cum. Relat.


(C. limit) boundaries Freq. Freq.
10-19 9.5-19.5 14.5 5 5 0.0877
20-29 19.5-29.5 24.5 19 24 0.3333
30-39 29.5-39.5 34.5 10 34 0.1754
40-49 39.5-49.5 44.5 13 47 0.2281
50-59 49.5-59.5 54.5 4 51 0.0702
60-69 59.5-69.5 64.5 4 55 0.0702
70-79 69.5-79.5 74.5 2 57 0.0352
Total 57 1.0000
68
68
2. Statistical Tables:

A statistical table is an orderly and systematic


presentation of numerical data in rows (stubs,
horizontal) and columns (captions, vertical
arrangements).

69
69
Construction of tables
Although there are no hard and fast rules to follow, the
following
general principles should be addressed in constructing
tables.
1. Tables should be as simple as possible.
2. Tables should be self-explanatory. what? when? where?
how classified ? and it be placed above the table.
Each row and column should be labeled.
Numerical entities of zero should be explicitly written rather
than indicated by a dash. Dashed are reserved for missing
or unobserved data.
Totals should be shown.
3. If data are not original, their source should be given in
a footnote.
70
70
A) Simple or one-way table:
Is used when the individual observations
involve only to a single variable

71
71
Table 1: Overall immunization status of children in Adami
Tullu Woreda, Feb. 1995

Immunization status Number Percent

Not immunized 75 35.7


Partially immunized 57 27.1
Fully immunized 78 37.2
Total 210 100.0

Source: Fikru T et al. EPI Coverage in Adami Tulu. Eth J


Health Dev 1997;11(2): 109-113

72
B. Two-way table:

Shows two characteristics/variables.

73
73
Table 2: TT immunization by marital status of the women of
childbearing age, Assendabo town, Jimma Zone, 1996

Marital Immunization status Total


status Immunized Not immunized
Single 58 24.7 177 75.3 235
Married 156 34.7 294 65.3 450
Divorced 10 35.7 18 64.3 28
Widowed 7 50.0 7 50.0 14
Total 231 31.8 496 68.2 727
Source: Mikael A. et al Tetanus Toxoid immunization coverage among
women of child bearing age in Assendabo town; Bulletin of JIHS, 1996,
7(1): 13-20
74
C. Higher Order Table:

When it is desired to represent three or more


characteristics in a single table.

75
75
Table 3: Distribution of Health Professional by Sex and
Residence

Residence
Profession/Sex Urban Rural Total
Doctors Male 8(10.0) 35 (21.0) 43 (17.7)
Female 2 (3.0) 16 (10.0) 18 (7.4)
Nurses Male 46 (58.0) 36 (22.0) 82 (33.7)
Female 23 (29.0) 77 (47.0) 100 (41.2)
Total 79 (100.0) 164(100.0) 243(100.0)

76
s
3. Diagrammatic Representation of Data:
Importance:
1. They have greater attraction than mere figures.
2. They help in deriving the required information in
less time and without any mental strain.
3. They facilitate comparison.
4. They may reveal unsuspected patterns in a
complex set of data and may suggest directions
in which changes are occurring.
This warns us to take an immediate action.

5. They have greater memorizing value.

77
77
Limitations:
1. Use only for purposes of comparison.
2. Diagrammatic representation is not an alternative
to tabulation. It only strengthens the textual
exposition of a subject, and cannot serve as a
complete substitute for statistical data.
3. It can give only an approximate idea and as such
where greater accuracy is needed diagrams will
not be suitable.
4. They fail to bring to light small differences

78
78
Specific types of graphs include:
Bar graph
Nominal, ordinal
Pie chart data

Histogram
Frequency polygon
Scatter plot Quantitative
Line graph data
Others

79
N.B: The choice of the particular form among the
different possibilities will depend on personal
choices and/or the type of the data.

General rules that are commonly accepted


about construction of graphs.

80
80
1. Every graph should be self-explanatory and as simple
as possible.
2. Titles are usually placed below the graph and it
should again question what ? Where? When? How
classified?
3. Legends or keys should be used to differentiate
variables if more than one is shown.
4. The axes label should be placed to read from the left
side and from the bottom.
5. The units in to which the scale is divided should be
clearly indicated.
6. The numerical scale representing frequency must
start at zero or a break in the line should be shown.

81
81
1. Bar Chart
The categories are represented on the base line (X-
axis) at regular interval and the corresponding values
of frequencies or relative frequencies represented on
the Y-axis (ordinate) in the case of vertical bar
diagram and vis-versa in the case of horizontal bar
diagram.

All the bars must have equal width and the distance
between bars must be equal.

All the bars should rest on the same line called the base

82
82
A. Simple bar chart:

One variable/one-dimensional diagram


The height or length of each bar indicates the
size (frequency) of the figure represented.

83
83
Fig 1. Distribution of patients in hospital X by source of
referral, 1999
84
84
B. Multiple bar chart:

It depicts distributional pattern of more than


one variable
Eg. consider the data on immunization status of
women by marital status.

85
85
Number of women

400
300
200
100
0
ed

d
le

d
d

we

te
ie

ng

rc

ra
ar

ido
vo
Si
M

pa
Immunized
Di

Se
Not Immunized
Marital status
Fig.2. TT immunization status by marital status
of women 15-49 years, Town X, 1990
86
86
C. Component (or sub-divided) Bar Diagram:

Bars are sub-divided into component parts of the figure.

These sorts of diagrams are constructed when each total


is built up from two or more component figures.

I ) Actual Component Bar Diagrams:


When the over all height of the bars and the individual
component lengths represent actual figures.
the data on immunization status of women by marital
status can also be presented as follow:

87
87
Number of women

500
400
300
200
100
0
ied

d
ed
le

ed

te
ng

rc

ow
ar

ra
vo
Si
M

Not Immunized

pa
id
Di

Se
Immunized
Marital status
Fig.3. TT immunization status by marital
status of women 15-49 years, Town X, 1990

88
88
ii) Percentage Component Bar Diagram:

Where the individual component lengths represent the


percentage each component forms the over all total.

Note that a series of such bars will all be the same


total height, i.e., 100 percent.

the data above data can also be presented as follow:

89
89
Number of women

100%
80%
60%
40%
20%
0% le

d
ed
d

ed

te
ie

ng

rc

ow
Not Immunized

ra
ar

Si

ivo
M

pa
id
W
D

Immunized

Se
Marital Status
Fig.4. TT immunization status by marital status of
women 15-49 years, Town X, 1990
90
90
2) Pie-chart (qualitative or quantitative discrete data):
it is a circle divided into sectors so that the areas of the
sectors are proportional to the frequencies.

Steps to construct a pie-chart:


Construct a frequency table
Change the frequency into percentage (P)
Change the percentages into degrees, where: degree =
Percentage X 360o
Draw a circle and divided it accordingly

91
91
Example: Distribution of cause of death for females,
in England and Wales, 1989.

Cause of death No. of deaths


Circulatory system 100,000
Neoplasm 70,000
Respiratory system 30,000
Injury & poisoning 6,000
Digestive system 10,000
others 20,000
Total 236,000

92
Fig 5. Distribution fo cause of death for females, in England and
Wales, 1989

93
93
3. Histograms (quantitative continuous data)

It is constructed on the basis of the following


principles:
a) The horizontal axis is a continuous scale running from
one extreme end of the distribution to the other.
It should be labeled with the name of the variable and the units
of measurement.
b) For each class in the distribution a vertical rectangle is
drawn with:

Its base on the horizontal axis extending from one


class boundary of the class to the other class
boundary, there will never be any gap between the
histogram rectangles.

The bases of all rectangles will be determined by the


width of the class intervals.
94
94
Example: Distribution of the age of women at the time of
marriage
Age group 15-19 20-24 25-29 30-34 35-39 40-44 45-49
No. of women 11 36 28 13 7 3 2

40
35
No of women

30
25
20
15
10
5
0
14.5-19.5 19.5-24.5 24.5-29.5 29.5-34.5 34.5-39.5 39.5-44.5 44.5-49.5
Age group

Fig 6. Distribution of the age of women at the time of marriage


95
4. FREQUENCY POLYGON:
If we join the midpoints of the tops of the adjacent
rectangles of the histogram with line segments a
frequency polygon is obtained.
NB: it is not essential to draw histogram in order to obtain
frequency polygon.
It can be drawn with out erecting rectangles of histogram as
follows:
1. The scale should be marked in the numerical values of the
midpoints of intervals.
2) Erect ordinates on the midpoints of the interval
the length or altitude of an ordinate representing the frequency of
the class
3) Join the tops of the ordinates and extend the connecting lines to
the scale of sizes.

96
96
Example: Age of women at the time of marriage

40
35
30
No. of women

25
20
15
10
5
0
12 17 22 27 32 37 42 47
Age

Fig 7. Distribution of the age of women at the time of marriage

97
5. O give or cumulative frequency curve:
Some times it may be necessary to know the number of
items whose values are more or less than a certain
amount.
When the cumulative frequencies of a distribution are
graphed the resulting curve is called O give Curve.
To construct an O give curve:
I ) Compute the cumulative frequency of the
distribution.
ii) Prepare a graph with the cumulative frequency
on the vertical axis and the true upper class
limits (class boundaries) of the interval scaled
along the X-axis (horizontal axis).

98
98
Example: Heart rate of patients admitted in hospital Y, 1998

Heart rate No. of Cumulative Cumulative


patients frequency Less frequency More than
than Method (LM) Method (MM)
54.5 - 59.5 1 1 54
59.5 - 64.5 5 6 53
64.5 - 69.5 3 9 48
69.5 - 74.5 5 14 45
74.5 - 79.5 11 25 40
79.5 - 84.5 16 41 29
84.5 - 89.5 5 46 13
89.5 - 94.5 5 51 8
94.5 - 99.5 2 53 3
99.5 - 104.5 1 54 1

99
60

50
Cum. Frequency

40
LM
30
MM
20

10

0
54.5 59.5 64.5 69.5 74.5 79.5 84.5 89.5 94.5 99.5 105
Heart Rate

Fig 8. Heart rate of patients admited in hospital Y, 1998


100
6. The line diagram:
We have two variables under consideration
(along the X and Y-axis).
The points are plotted and joined by line
segments in order.
These graphs depict the trend or variability
occurring in the data.
Sometimes two or more graphs are drawn on the
same graph paper taking the same scale so that
the plotted graphs are comparable.

101
101
Response to administration of zidovudine in two groups of AIDS
patients in hospital X, 1999

8
7
Blood zidovudine
concentration

6
5
4
3
2
1
0
10
20
70
80
100
120
170
190
250
300
360
Tim e since adm inistration (Min.)

Fat malabsorption Normal fat absorption

Fig 9. Line graphs for response to administration of zidovudine in two


groups of AIDS patients in hospital X, 1999

102
7. Scatter plot

Most studies in medicine involve measuring more


than one characteristic, and graphs displaying
the relationship between two characteristics
are common in the literature.

To illustrate the relationship between two


characteristics when both are quantitative
variables we use bivariate plots (also called
scatter plots or scatter diagrams).

103
103
A scatter diagram is constructed by drawing X-
and Y-axes.
And each observation is represented by a point
or dot().

Eg. A study done to see whether a relationship


existed between age and percentage of super
saturation of bile, revealed the following result.

104
104
Age and percentage saturation of bile for women patients in
hospital Z, 1998
160

140

120
Saturation of bile

100

80

60

40

20

0
0 10 20 30 40 50 60 70 80
Age

Fig 10. The relationships between age and percentage saturation of bile
for women patients in hospital Z, 1998

105
Summarizing Data

1. Measures of Central Tendency and


2. Measures of Dispersion

106
106
1. Measures of Central Tendency:

The tendency of statistical data to get


concentrated at certain values is called the
Central Tendency and the various
methods of determining the actual value at
which the data tend to concentrate are called
measures of central Tendency or
averages.

107
107
Characteristics of a good measure of central
tendency:
1.It should be based on all the observations
2.It should not be affected by the extreme values
3.It should be as close to the maximum number of
values as possible
4.It should have a definite value
5.It should not be subjected to complicated and tedious
calculations
6.It should be capable of further algebraic treatment
7.It should be stable with regard to sampling

108
108
1. The Arithmetic Mean or simple Mean
The most familiar MCT.
It is also popularly known as average.

a) Ungrouped data:
If x1, x2, ..., xn are n obseved values, then
_ n
x xi
i 1
____
n
109
b) Grouped data:
In calculating the mean from grouped data, we assume that
all values falling into a particular class interval are
located at the mid-point of the interval. It is calculated
as follow
k

m
i =1
i fi
x = k

f i =1
i

where,
k = the number of class intervals
mi = the mid-point of the ith class interval
fi = the frequency of the ith class interval

110
110
EXAPLE: Find the arithmetic mean of the following
data (age distribution of 100 pregnant mothers, at H X)

Age fi mi Mi fi
15-19 11 17 187
20-24 36 22 792
25-29 28 27 756
30-34 13 32 416
35-39 7 37 259
40-44 3 42 126
45-49 2 47 94
Total 100 2630

111
Solution
k

m
i=1
i fi
x = k

f i=1
i

Mean = 2630/100 = 26.3

112
112
Characteristics:-
Uniqueness
Simplicity: the arithmetic mean is easily
understood and easy to compute
The value of the arithmetic mean is
determined by every item in the series.
It is greatly affected by extreme values.
The sum of the deviations about it is zero.

113
Advantages:
It is based on all values given in the distribution.
It is most easily understood.
It is most amenable to algebraic treatment.

Disadvantages
It may be greatly affected by extreme items and its
usefulness as a Summary of the whole may be
considerably reduced.
When the distribution has open-end classes, its
computation would be based on assumption, and
therefore may not be valid.

114
114
0
2. Median
The median of a finite set of values is that value which
divides the set of values in to two equal parts such that
the number of values greater than the median is equal to
the number of values less than the median.
a) Ungrouped data
If the number of values is odd, the median will be the
middle value when all values have been arranged in order
of magnitude i.e. [(n+1)/2] th
When the number of observations is even, there is no
single middle observation but two middle
observations
In this case the median taken to be the mean of these two
middle observations, when all observations have been
arranged in the order of their magnitude, i.e. the
average of (n/2)th and [(n/2)+1]th values
115
115
Example:
Compute the sample median for the birth weight
data
Solution:
First arrange the sample in ascending order
2069, 2581, 2759, 2834, 2838, 2841, 3031, 3101,
3200, 3245, 3248, 3260, 3265, 3314, 3323, 3484,
3541, 3609, 3649, 4146
Since n=20 is even,
Median = average of the 10th and 11th
observation = (3245 + 3248)/2 = 3246.5 g

116
116
b) Grouped data
In calculating the median from grouped data, we
assume that the values within a class-interval are
evenly distributed through the interval.

The first step is to locate the class interval in which it


is located. We use the following procedure.

Find n/2 and see a class interval with a minimum


cumulative frequency which contains n/2.

To find a unique median value, use the following


interpolation formula.
117
117
n
Fc
~
x = Lm 2 W


fm


where,
Lm= lower true class boundary of the interval containing
the median
Fc= cumulative frequency of the interval just above the
median class interval
fm= frequency of the interval containing the median
W= class interval width
n = total number of observations

118
118
Example: Age distribution of 75 malaria cases
attending at OPD of HC X was as follow:
Age fi Cum. freq
5-14 5 5
15-24 10 15
25-34 20 35
35-44 22 57
45-54 13 70

55-64 5 75

119
Solution
n/2 = 75/2 = 37.5

Median class interval = 35-44 (4th class)

Lm=34.5 , Fc= 35 , W = 10 , n = 75 , fm=22

Therefore; Median = 34.5 + (37.5-35)/22 x 10


=35.64

120
120
Characteristics
1) It is an average of position.
2) It is affected by the number of items than by
extreme values.
Advantages
It is easily calculated
It is a positional average and hence it is not

affected by extreme values


The median may be located even when
the data are incomplete,

121
Disadvantages
The median is not so well suited to algebraic
treatment as the arithmetic, geometric and
harmonic means.
It is not so generally familiar as the arithmetic
mean
It is less sensitive to the actual numerical values
of the remaining data points.

122
122
3. Mode
It is a value which occurs most frequently in a
set of values.
Any observation of a variable at which the distribution
reaches a peak is called a mode.
a) Ungrouped data
If all the values are different there is no mode,
On the other hand, a set of values may have
more than one mode.
E.g. a) 22, 66, 69, 70, 73. (no modal value)
b) 1.8, 3.0, 3.3, 2.8, 2.9, 3.6, 3.0, 1.9, 3.2,
3.5 (modal value = 3.0 kg)

123
123
b) Grouped data

In designating the mode of grouped data, we


usually refer to the modal class, where the modal
class is the class interval with the highest
frequency.

If a single value for the mode of grouped data


must be specified, it is taken as the mid point of
the modal class interval.

124
124
Advantages
Since it is the most typical value it is the most
descriptive average
It is not affected by extreme values
It can be calculated for distributions with open
end classes
Disadvantages:
It is not capable of mathematical treatment
In a small number of items the mode may not
exist.

125
125
4. Geometric mean (GM)
It is obtained by taking the nth root of the product of n
values, i.e, if the values of the observation are denoted by
x , x ,..., x
1 2 nthen, GM = n x1 x2 ...xn
n
GM n
xi
i 1

The geometric mean is preferable to the arithmetic mean if


the series of observations contains one or more
unusually large values.
The above method of calculating geometric mean is
satisfactory only if there are a small number of items.
But if n is a large number the problem of computing the nth
root of the product of these values by simple arithmetic is
a tedious work.
To facilitate the computation of geometric mean we make
use of logarithms. 126
126
n
GM n
xi
i 1
n
LogGM { (log xi )}
i 1

________
n

127
Example: The geometric mean may be calculated for
the following parasite counts per 100 fields of thick
films.
7, 8, 3, 14, 2, 1, 440, 15, 52, 6, 2, 1, 1, 25
12, 6, 9, 2, 1, 6, 7, 3, 4, 70, 20, 200, 2, 50
21, 15, 10, 120, 8, 4, 70, 3, 1, 103, 20, 90, 1, 237

GM 42
7 x8 x3x...x1x237
log Gm = 1/42 (log 7+log8+log3+..+log 237)
= 1/42 (.8451+.9031+.4771 +2.3747)
= 1/42 (41.9985)
= 0.9999 1.0000
The anti-log of 0.9999 is 9.9992 10 and this is the

required geometric mean. 128


128
Characteristics
1. It is a calculated value and depends upon the size of all the
items.
2. It gives less importance to extreme items than does the
arithmetic mean.

Advantages:-
less affected by extremes values than the arithmetic mean
It is capable of algebraic treatment
It based on all values given in the distribution.
Disadvantages:-
Its computation is relatively difficult.
It cannot be determined if there is any negative value in the
distribution, or where one of the items has a zero value.
129
129
Skewness
If extremely low or extremely high observations are
present in a distribution, then the mean tends to shift
towards those scores. Based on the type of
skewness, distributions can be:
Negatively skewed distribution: occurs when
majority of scores are at the right end of the curve and
a few small scores are scattered at the left end.
Positively skewed distribution: Occurs when the
majority of scores are at the left end of the curve and
a few extreme large scores are scattered at the right
end.
Symmetrical distribution: It is neither positively nor
negatively skewed. A curve is symmetrical if one half
of the curve is the mirror image of the other half.

130
Guidelines help an investigator to decide which
measure of central tendency is best with a given set
of data.
1.The arithmetic mean is used for interval and ratio data
and for symmetric distribution.
2.The median and quartiles are used for ordinal, interval
and ratio data whose distribution is skewed.
3.For nominal data mode is the appropriate MCT.
4.The geometric mean is used primarily for observations
measured on a logarithmic scale.

131
2. Measures of variation
The measure of central tendency alone is not
enough to have a clear idea about the
distribution of the data.
Moreover, two or more sets may have the same
mean and/or median but they may be quite
different.
Thus to have a clear picture of data, one needs
to have a measure of dispersion or variability
(scatterness) amongst observations in the set.
E.g. A 1,2,3,4,5 mean =3 median =3
B 0,1,2,3,4,5,6 mean =3 median =3

132
132
1. Range
Is the difference between the highest and smallest
observation in the data.
Range = X max X min
It is the crudest measure of dispersion.
Since it is based upon two extreme cases, it may be
considerably changed if either of the extreme cases
happens to drop out.
The extreme values may be unreliable; that is, they are
the most likely to be faulty
Not suitable with regard to the mathematical treatment
required in driving the techniques of statistical inference.
Example: 60 40 30 50 60 40 70 50
Range=70-30 =40

133
133
2. INTERQUARTILE RANGE
Another approach that addresses some of the
shortcomings of the range is in quantifying the
spread in the data set is the use of quantiles.
Eg. Quartiles, Deciles, Percentiles etc.
The pth percentile is the value Vp such that p
percent of the sample points are less than or
equal to Vp.
The median, being the 50th percentile, is a
special case of a quantile.

134
134
The calculation of quantiles is not as simple as it might
seem.
The data should be ranked from 1 to n in order of
increasing size.
The kth percentile is obtained by calculating
q=k(n+1)/100 and then interpolating between the two
values with ranks either side of the qth.
For example, for the 5th centile of a sample of 145
observations we have q=5 x 146/100=7.3.
We estimate the 5th centile as the value 0.3 of the way
between the 7th and 8th ranked observations.
If these data values are 11.4 and 14.9 the estimated
centile is 12.45.

135
135
The quartiles divide the distribution into four
equal parts.
a) The first quartile (Q1): It is the measurement, i.e.,
25% of all the ranked observations are less than or
equal to Q1.
b) The second quartile (Q2): It is the measurement.
50% of all the ranked observations are less than or
equal to Q2. The second quartile is the median.
c) The third quartile (Q3): It is the observation. 75%
of all the ranked observations are less than or equal
to Q3.
The inter-quartile range (IQR) is the difference
between the first and the third quartiles.

136
136
3
E.g. Given the following data set (age of patients):
18,59,24,42,21,23,24,32

find the inter quartile range!


Solution:
1. sort the data from lowest to highest
18 21 23 24 24 32 42 59

2. find the bottom and the top quarters of the data


1st quartile = The {0.25x(n+1)}th observation = (2.25)th
observation
= 21 + (23-21)x .25 = 21.5
.
137
137
3rd quartile = {0.75x(n+1)}th observation =
(6.75)th observation
= 32 + (42-32)x .75 = 39.5

3. find the difference (inter quartile range)


between the two quartiles
Hence, IQR = 39.5 - 21.5 = 18

138
138
The inter-quartile range is a preferable
measure to the range.
Because
it is less prone to distortion by a single large

or small value.
i.e, outliers in the data do not affect the inter-
quartile range.
it can be computed when the distribution has

open-end classes.

139
5. Mean deviation (MD)
Mean deviation is the average of the
absolute deviations taken from a central
value, generally the mean or median.
Consider a set of n observations x1, x2, ...,
xn. Then:
1 n
MD x i A
n i 1
A is a central value (arithmetic mean or
median).

140
Properties of mean deviation:

MD removes one main objection of the earlier


measures, that it involves each value

It is not affected much by extreme values

Its main drawback is that algebraic negative


signs of the deviations are ignored which is
mathematically unsound

141
6. Variance (2, s2)
The main objection of mean deviation, that
the negative signs are ignored, is removed by
taking the square of the deviations from the
mean.

The variance is the average of the squares of


the deviations taken from the mean.

142
It is squared because the sum of the
deviations of the individual observations of a
sample about the sample mean is always 0

0= (x i - x)

The variance can be thought of as an


average of squared deviations

143
Variance is used to measure the dispersion
of values relative to the mean.
When values are close to their mean (narrow
range) the dispersion is less than when there
is scattering over a wide range.
Population variance = 2
Sample variance = S2

144
a) Ungrouped data
Let X1, X2, ..., XN be the measurement on N
population units, then:

(X i ) 2

2 i 1
where
N
N

X i
= i =1
is the population mean.
N
145
A sample variance is calculated for a sample of
individual values (X1, X2, Xn) and uses the sample
mean (e.g. ) rather than the population mean .

146
b) Grouped data
k

(m i x) f i 2

S2 i =1
k

f
i =1
i -1
where
mi = the mid-point of the ith class interval
fi = the frequency of the i th class interval
x = the sample mean
k = the number of class intervals

147
Properties of Variance:
The main disadvantage of variance is that
its unit is the square of the unite of the
original measurement values
The variance gives more weight to the
extreme values as compared to those
which are near to mean value, because
the difference is squared in variance.
The drawbacks of variance are
overcome by the standard deviation.

148
7. Standard deviation (, s)
It is the square root of the variance.
This produces a measure having the same
scale as that of the individual values.

and S = S
2 2

149
Following are the survival times of n=11
patients after heart transplant surgery.

The survival time for the ith patient is


represented as Xi for i= 1, , 11.

Calculate the sample variance and SD??

150
151
Example. Compute the variance and SD of the age of
169 subjects from the grouped data.
Mean = 5810.5/169 = 34.48 years
S2 = 20199.22/169-1 = 120.23
SD = S2 = 120.23 = 10.96
Class
interval (mi) (fi) (mi-Mean) (mi-Mean)2 (mi-Mean)2 fi
10-19 14.5 4 -19.98 399.20 1596.80
20-29 24.5 66 -9-98 99.60 6573.60
30-39 34.5 47 0.02 0.0004 0.0188
40-49 44.5 36 10.02 100.40 3614.40
50-59 54.5 12 20.02 400.80 4809.60
60-69 64.5 4 30.02 901.20 3604.80
Total 169 1901.20 20199.22

152
Properties of SD
The SD has the advantage of being expressed in
the same units of measurement as the mean

SD is considered to be the best measure of


dispersion and is used widely because of the
properties of the theoretical normal curve.
However, if the units of measurements of
variables of two data sets is not the same, then
their variability cant be compared by comparing
the values of SD.

153
8. Coefficient of variation (CV)
When two data sets have different units of
measurements, or their means differ
sufficiently in size, the CV should be used
as a measure of dispersion.
It is the best measure to compare the
variability of two series of sets of
observations.
Data with less coefficient of variation is
considered more consistent.

154
CV is the ratio of the SD to the mean multiplied by
100.
S
CV 100
x
SD Mean CV (%)
SBP 15mm 130mm 11.5
Cholesterol 40mg/dl 200md/dl 20.0

Cholesterol is more variable than systolic blood


pressure

155
Thank you why
ee
156

Vous aimerez peut-être aussi