Vous êtes sur la page 1sur 114

Part I Paper 2

RESEARCH METHODOLOGY AND BIOSTATISTICS


Overall objective:
The student is able to apply the basic concepts of statistics and principles of
scientific enquiry in planning and evaluating the results of dental practice and
participate in and conduct descriptive exploratory and survey students in dental
and evaluate apply results of research studies in health, dental medicine and
related fields in the practice of dental.
Behavioral objective:The student is able to
Design a study, identifying a population and methods of selection of the
sample required
Present data in appropriate tables, graphs and diagrams
Calculate averages, variation, linear correlation and regression.
Calculate the confidence intervals and simple tests of significance using
normal, t, F, 2 distributions.
Compute commonly used vital and health statistical and estimate
population using arithmetic progression methods.
Construct instruments for eliciting data through questioning observation
and measurement methods and techniques.
Quantify, analyze describe and interpret data.
Critique dental studies.
Select and write a clear statement of a research able problem.
Search and analyze the literature for facts and theory relating to the
problem.
Identify and state relevant assumption methods of selection of the sample
required.

Make recommendations based on the finding for application to nursing and


further research
Prepare and write a scientific report of the study.
Methods of Teaching: Lectures and discussion with power point presentations
Seminars and practical with power point presentations
Methods of evaluation:
Regular attendance, Seminars, written test and dissertation
Suggested practical:
Each student will select and present critique of dental study.
Survey and asses selected studies in dental with particular reference
to the research process presentation of individually selected problems at each step
of the research process and are independent for evaluation and discussion.

QUESTION PATTERN
Time: 1 hour
Short Answer

52

10 Marks

Short Note

5X6

30 Marks

Internal Assessment

10 Marks

UNIT
I

DESCRIPTION
1.1 Introduction and overview of Biostatistics
1.2 Scope of Biostatistics
1.3 Biostatistics in Dentistry
1.4 Applying study result to patient care

II

2.1 Review of descriptive statistics


(Central tendency, dispersion, plotting)
2.2 Correlation and regression

III

3.1 Testing of statistical hypothesis


3.2 Statistical inference with mean, proportion and normal deviate
3.3 Sampling distributions (t, F, 2)

IV

4.1 ANOVA (one & two way classification)


4.2 Non-Parametric tests
a). Sign test
b). Wilcoxon Signed Rank tests
c). Mann Whitney U test
d). Wald Wolfwitch Run test
e). Krushkal Wallis test

5.1 Concept of research & research process


5.2 Principle and various methods of research process
5.3 Utilization of research, the result section has a research report &
conclusions
5.4 The Checklist for the reading literature

STATISTICS
Different authors give different definition for statistics from time to time.
But, a definition must aim at laid down the meaning; scope and definition of
subject. Statistics is used in two senses Viz, singular and plural.
In the singular sense it denotes numerical facts whereas; in the
plural sense it denotes statistical methods.
Among them, two authors C. E. Croxton and D. J. Cowdon give the
precious definition for statistics, and Prof. Horce Secrist gives the best
definition.
According to C.E.Croxton and D.J.Cowden,
A branch of mathematics that deals with collection, Classification, analysis
and interpretation of numerical data is called as statistics.
From this definition, the main divisions of statistics are,
i.

Collection of Data,

ii.

Classification of data,

iii.

Analysis of Data,

iv.

And interpretation of numerical data.

According to Prof. Horce Secrist.


Statistics is a field of study concerned with
(1) The collection , organization, summarization, and analysis of
data, and
(2) The drawing of inferences about a body of data when only a
part of the data is observed.
Simply put, we may say that data are numbers, numbers contain
information, and the purpose of statistics is to investigate and evaluate the
nature and meaning of this information.
Statistics is the science of compiling, classifying, and tabulating numerical
data and expressing the results in a mathematical or graphical form.

The aggregates of facts affected to a marked extent by multiplicity


of causes, numerically expressed, enumerated or estimated according to
reasonable standard of accuracy, collected in a systematic manner for a
predetermined purpose and placed in relation to each other is called
statistics.
This definition gives the characteristics of the statistics. The characteristics
of statistics are,
It is aggregate of facts.
It is affected to a marked extent by multiplicity of causes.
It is numerically expressed.
It should be enumerated or estimated.
It should be collected in a systematic manner for a predetermined
purpose
It should be collected with reasonable standard of accuracy.
It should be placed in relation to each other.

BIOSTATISTICS
The tools of statistics are employed in many fields-business, education,
psychology, agriculture, and economics, to mentioned only few. When the
data analyzed are derived from the biological sciences and medicine, we
use the term biostatistics to distinguish this particular application of
statistical tools and concepts.
Biostatistics is that branch of statistics concerned with mathematical facts
and data relating to the biological events. Medical statistics is a further
specialty of Biostatistics, when the mathematical facts and data are related
to health, preventive medicine and disease.

Essential Feature of Statistics


The essential features of statistics are evident from various definitions of
statistics:
a) Principles and methods for the data collection of presentation,
analysis and interpretation of numerical data of different kinds
i. Observational data. Quantitative data
ii. Data that have been obtained by a repetitive operation
iii. Data affected to a marked degree of a multiplicity of
causes
b) The science and art of dealing with variation in such a way as
to obtain reliable results.
c) Controlled objective methods whereby group trends are
abstracted from observations on many separate individuals.
d) The science of experimentation which may be regarded as
mathematics applied to experimental data.
The objective of dental science is primarily to improve the oral health of
an individual and hence relevant knowledge has to be obtained by
observation of groups of individuals. The treatment of a patient with best
course of action depends on the overall oral hygiene or health status.
Fundamental processes involved in the organization of oral health
care services are:
Acquisition of information i.e., monitoring data, from independent
study and systematic enquiry (scientific research)
Dissemination of information e.g., by teaching, demonstrating,
writing, publishing.

Application of knowledge and skill i.e., provision of health care and


related services such as environmental control (e.g., fluoride
adjustment,

regulation

of

harmful

substances,

etc)

and

manufacturing of health products.


Judgment or evaluation by the application of proportional ethics,
laws, regulation, policies, guidelines, criteria and standards.
Administration i.e., the management of personnel, facilities,
materials, funds and other resources to facilitate four processes
outlined above.
Uses of Biostatistics:
1) To define normalcy
2) To test whether the difference between two populations, regarding a
particular attribute is a real or a chance occurrence.
3) To study the correlation or association between two or more
attributes in the same population.
4) To evaluate the efficacy of vaccines, sera etc. by control studies.
5) To locate, define and measure the extent of morbidity and mortality
in the community.
6) To evaluate the achievements of public health programs.
7) To fix priority in public health programs.
Uses of Biostatistics in dental science:
1) To assess the state of oral health in the community and to determine
the availability and utilization of dental care facilities.
2) To indicate the basic factors underlying the state of oral health by
diagnosing the community and solutions to such problems
3) To determine success or failure specific oral health programs or to
evaluate the program action.
4) To promote health legislation and in creating administrative
standards for oral health.

Role / Importance / Applications / Uses of Biostatistics in dental


research:
To maintain the patient record
To maintain the patient previous treatment and next or further
treatment procedure
Long time process of record to helpful to seen previous treatment
procedure also helps in the current treatment idea.
Suppose new drug launch in the market, biostatistics analysis gives
idea this drug is more effective than other drugs.
Statistical analysis to gives idea about which drug commonly or
averagely used for particular treatment or all treatments
To estimate number of patients visiting in future (weekly, monthly or
yearly)
To know which age of people or male/female have more dental
problems
Dental problems vary by area, culture, habits, or water also by the
village, district, city, countries.
A dental complaint varies for age, sex, area, culture, habits, etc.
To compare two or more of treatment, drug, or surgeons, or time
taken for same complaint, which is better? Or all are no difference.
Any one of the drugs may be used for a treatment, whether this two
effect are same or not same
Compare and estimate for treatment time, cure level, etc.
Compare and estimate students intelligence
To estimate a person when will get a dental complaint or when will
cure of a treatment taken patient

To analyze people dental knowledge out of 100% how much have


very poor / poor / average / good / very good knowledges
Patient record also given patient family history of fast, present and
future.
To do the basic calculation: total number of patients visiting, average
number of patient visit by age, sex, treatment, complaint, finished,
and undergoing, etc.
It gives, enough or we want to improve for patient details
Before treatment and after treatments, it is significant.
Number of patient visits varies department wise, if varies why? To
analyze and find out the inference.
Applications of Biostatistics in patient care / applying study results
in patient care:
A patient record gives overview and idea of the patient treatment
and further steps
Suitable treatment or method to apply the patient
To know the maximum, minimum, and average value of any of the
patients character.
The character varying patient-patient or else, if vary what are the
reason
Previous analyses give what disease attacks for which type of
population (age, sex, area, culture, etc.). These analyses is much
helpful to give the instruction to prevent or take care of the disease,
Suppose more number of drugs available in the market, then we
select suitable drug for satisfying patient co-operate, cost, time or
any one of satisfaction or all.

How much percentage of patient cures a particular treatment that


treatment cures level is very low then advice to medical research for
develops the treatment, here we use statistical analysis, whether
newly developed treatment is effective?
To estimate patient cure time, next visit, number of visits for
particular treatment, etc.
To estimate the number of patient in future
A statistical analysis inference; particular disease gives major
problem or most affect the regular life. In the situation, taking further
steps to prevent or control these diseases.
Why need Statistics?
The objectives of this paper are twofold:
(1) To teach the student to organize and summarize data and
(2) To teach the student how to reach decisions about large body of
data by examining only small part of the data.
The concepts and methods necessary for achieving the first objective are
presented under the heading of descriptive statistics and the second
objective is reached through the study of what is called inferential
statistics.
Need of quantifying the data: As per the definition of STATISTICS (i.e., A
branch of mathematics that deals with collection, Classification, analysis
and interpretation of numerical data) it mainly deals with numerical data.
Hence, whenever we have the numerical data then only statistics can be
applied. But in many situations researcher cant get numerical data. (i.e., it
will be of mixture of numerical and qualitative characteristics)

So to draw valid conclusion from the qualitative characteristics it


essential to quantify the qualitative information into quantitative by giving
ranks or scale values.
While conducting an oral health examination, the investigator makes
observations according to his judgment of the situation. This depends on
his skill, knowledge, experience and temperament.
Grading of plaque scores or malocclusion or the quality of diet of an
individual are situations, which are influenced by the particular investigator
who makes the observations. If the same observer repeats the observation
on the same case after some time lapse, he may or may not agree with his
previous assessment. Similarly, if more than one investigator observes the
same individual, all of them may not agree in their assessment. The
variability in measurement can be handled using statistics.
Epidemiology and biostatistics are sister sciences or disciplines. The
former collects facts relating to groups of population in place, times and
situations, while the later converts all facts into figures and at the end
translates them into facts, interpreting the significance of their results.
Facts are qualitative in nature and do not admit several kinds of statistical
treatment and hence have to be converts into figures for statistical
analysis.
Both the science of epidemiology and biostatistics deal with factsfigures-facts, which is termed as quantitative methodology.
In

community

dentistry,

the

approach

is

primarily

through

epidemiology and social or behavioral sciences, all of which require


intensive studies, by collecting facts, which are quantitative and later,
expressed into figures, which are quantitative.
Example:

The oral health worker is interested in knowing how many people


have good oral hygiene or otherwise, the circumstances when it takes
place and also the age at which various upsets take place, whether it is
equally distributed among the sexes, which group is at risk of developing
diseases leading to mortality., which areas of town-rural or urban are more
or less affected by the diseases. As most of these events are counted, they
are the foundations of dentistry. And because these numbers come in with
variation between people or from place to place or from time to time,
statistics finds its role in dentistry.
Data:
The raw material of statistics is data. For our purposes we may define data
as numbers. The two kinds of numbers that we use in statistics are
numbers that the result from the taking in the usual sense of the term of
a measurement, and those that result from the process of counting;
Example: When a nurse weighing a patient or takes a patients
temperature, a measurement consisting of a number such as 150 pounds
or 100 degrees Fahrenheit, is obtained.
Quite a different type of number is obtained when a hospital administrator
counts the number of patients-perhaps 20-discharged from the hospital on
a given day. Each of the three numbers is a datum, and three taken
together are data.
Variable
If, as we observe a characteristic, we find that it takes on different values in
different persons, places, or things, we label the characteristic a variable.
Example: Diastolic blood pressure, heart rate, heights of adult males
Random variable

Whenever we determine the height, weight, or age of an individual, the


results is frequently referred to as a value of the respective variable. When
the values obtained arise as a result of chance factors, so that they cannot
be exactly predicted in advance, the variable is called a random variable.
Example: Adult height-when a child is born, we cannot predict exactly his
or her height at maturity. Attained adult height is the result of numerous
genetic and environmental factors. Values resulting from measurement
procedures are often referred to as observations or measurements.
Population
The average people thinks of a population as a collection of entities,
usually people. A population or collection of entities may, however, consist
of animals, machines, places, or cells.
For our purposes, we define a population of entities as the largest
collection of entities for which we have an interest at a particular time. If we
take a measurement of some variable on each of the entities in a
population, we generate a population of values of that variable. We may,
therefore, define a population of values as the largest collection of values
of a random variable for which we have an interest at a particular time.
Populations may be finite or infinite. If a population of values consists of a
fixed number of these values, the population is said be finite. If, on the
other hand, population consists of an endless succession of values, the
population is an infinite one.
Example: We are interested in the weights of all the children enrolled in a
certain country elementary school system; our population consists of all
these weights. If our interest lies only in the weights of first grade students
in the system, we have different population-weights of first grade students
enrolled in the school system. Hence populations are determine or defined
by our sphere of interest.

Sample
A sample may be defined simply as a part of a population. Suppose our
population consists of the weights of all the elementary school children
enrolled in a certain country school system. If we collect for analysis the
weights of only a fraction of these children, we have only a part of our
population of weights, that is, we have a sample.

TYPES OF VARIABLE
(1). Quantitative variable
A quantitative variable is one that can be measured in the usual sense.
Measurements made on quantitative variables convey information
regarding amount.
Example: Weights of preschool children, age of the patients.
(2). Qualitative variable
Some characteristics are not capable of being measured in the sense that
height, weight, and age are measured. Many characteristics can be
characterized only.
Example: When an ill person is given a medical diagnosis
Object is said to posses or not posses some characteristic of interest. In
such cases measuring consist of categorizing.
(3). Discrete random variable
Variables may be characterized further as to whether they are discrete or
continuous.
A discrete random variable is characterized by gaps or interruptions in the
values that it can assume. These gaps or interruptions indicate the
absence of values between particular values that the variable can assume.
Example: The number of daily admissions to a general hospital is a
discrete random variable since the number of admissions each day must

be represented by a whole number, such as 0, 1, 2, or 3. The number of


admissions on a given day cannot be number such as 1.5, 2.432, and
3.9009.
The number of decayed, missing, or filled teeth per child in an elementary
school is another example of discrete random variable.
(4). Continuous random variable
A continuous random variable does not posses the gaps or interruptions
characteristic of a discrete random variable. A continuous random variable
can assume any value within a specified relevant interval of values
assumed by the variable.
Example: Height, weight, age, water fluoride of individual

SCALES OF MEASUREMENT OF DATA


It is necessary to express the data measurements clearly, either in units or
as categories. Each level of measurement form scales of measurements
which are defined by the degree of accuracy and sophistication of the
measuring device.
Measurement: This may be defined as the assignment of numbers to
objects or events according to a set of rules. The various measurement
scales result from the fact that measurement may be carried out under
different set of rules.
Commonly following scales are used
i). Nominal scale: (By name, label, and tag)
The lowest measurement scale is the nominal scale. As the name implies it
consist of naming observation or classifying them into various mutually
exclusive and collectively exhaustive categories.
Example: includes such dichotomies,
Outcome of cancer Dead, alive

Goals of RCT Achieved, not achieved

ii). Ordinal scale: (With Implicit order of relationship)


Whenever observation are not only different from category to category but
can be ranked according to some criterion, they are said to be measured
on an ordinal scale.
Example: OHI score Poor, Fair, Good
Students intelligence Above average, Average, Below Average
iii). Interval Scale: (Number between characters)
The interval scale is more sophisticated scale than the nominal and ordinal
scale in that with this scale it is not only possible to order measurements,
but also the distance between any two measurements is known. The
interval scale unlike the nominal and ordinal scale is a truly quantitative
scale. We know say, that the difference between measurements of 20 and
a measurement of 30 is equal to the difference between measurements of
30 and 40. The ability to do this implies the use of a unit distance and a
zero point, both of which are arbitrary.
Example: Age of the patient, BP, Water fluoride level.
iv). Ratio scale: (Relative Magnitude)
The highest level of measurement is the ratio scale. This scale is
characterized by the fact that equality of ratios as well as equality of
intervals may be determined. Fundamental to the ratio scale is a true zero
point.
Example: Gingival bleeding per 1000 people, Height by weight

RELIABILITY OF DATA

Reliability is checked by testing the findings or results from the data.


If the agency has used proper methods to collect the data, the statistics
may be relied upon.
Reliability indicates the consistent result in repeated observation. Many
determine reliability of data. Major factors are:
Inherent variation like unused reagents used after a lapse of long time.
Zero marked in weighing machine is not obtained, etc.
Observers

variation

like

the

same

person

doing

repeated

measurements. E.g. BP recordings MP smear examination, pulse rate


recording, etc.
Variable fluctuations like reply by respondents according to their
capability of understanding questions and replying.
Inter-observer variations like many people, many instruments at
recording.

VALIDITY OF DATA
Data obtained by measurement should measure what it is supposed to
measure. Concept of validity relies upon the specific situations at data
collection.
Example:

Oral interview on abortion practice is not valid


Infertility of no issues is not valid
Fever in non malaria area is not valid

Validity is measured by sensitivity and specificity. Sensitivity is true positive


observation correctly identified by a test. Specificity is true negative
observation correctly identified by a test.
Notation for test validation of measurements of data
True picture (e.g.
Disease)

Total

Test Result
(e.g.
Screening
Test)

(a+b)

(c+d)

(a+c)

(b+d)

(a+b+c+d)

Total

Sensitivity

a
(a c)

Sensitivity = Number of Positive value of test result and true picture / total
number of Positive value of true picture
Specificity

d
(b d )

Specificity = Number of Negative value of test result and true picture /


total number of Negative value of true picture
Positive predictive value

a
( a b)

Positive predictive value = Number of Positive value of test result and


true picture / total number of Positive value of test result
Negative predictive value

d
(c d )

Negative predictive value = Number of Negative value of test result and


true picture / total number of Negative value of test result

SOURCES OF DATA
The performance of statistical activities is motivated by the need to answer
a question. For example, clinicians may want answers to questions
regarding the relative merits of competing treatment procedures.
Administrators may want answers to questions regarding such areas of
concern as employee morale or facility utilization. When we determine that
the appropriate approach to seeking an answer to a question will require

the use of statistics, we begin to search for suitable data to serve as the
raw material for our investigation.
Before the data collection, type of data should be decided. That is,
primary data or secondary data. The choice of data depend on,
Nature and scope of study,
Availability of finance, time factors,
The degree of accuracy needed,
Nature of investigation (individual or government study).
Generally most of the survey primary data is preferable.
The main sources of data are
1). Routinely kept records

2). Surveys

3). Experiments

Data can be collected through either


a). Primary source

b). Secondary source

1. Routinely kept records


It is difficult to imagine any type of organization that does not keep records
of day-to-day transaction of its activities. OP medical records, for example,
patient habits while OP sheet contain a patient habits on the facilities of
business activities. When the need for data arises, we should look for them
first among routinely kept records.

2. Surveys
If the data needed to answer a question are not available from routinely
kept records, the logical source may be a survey. Suppose, for example,
that the administrator of a clinic wishes to obtain information regarding the
mode of transportation used by patients to visit the clinic. If admission

forms do not contain a question on mode of transportation, we may


conduct a survey among patients to obtain this information.

3. Experiments
Frequently the data needed to answer a question are available only as the
result of an experiment. A nurse may wish to know which of several
strategies is best for maximizing patient compliance. The nurse might
conduct an experiment in which the different strategies of motivating
compliance are tried with different patients. Subsequent evaluation of the
responses to the different strategies might enable the nurse to decide
which is most effective.

a). Primary Source


The first hand information that is collected for the first time by the
investigator for the purpose of his study is called primary data.
This is first hand information.
This data is original in character.
The primary data collection methods: To collect the primary data five
methods are commonly used. They are,
1. Direct personal investigation

2. Oral health examination

3. Indirect oral investigation

4. Questionnaire method

5. Local correspondent method

6. Enumeration method

(1). Direct personal investigation: In this method, the investigator


personally meets the informants and collects the information by asking
them questions. The person form that the information is collected is called

informants. This method is intensive rather than extensive. The investigator


must be keen observer and tactful and courteous in behavior.
Suitability:
This method can be employed, when
High accuracy is needed.
The coverage area is small.
The confidential data is needed.
The intensive study is needed. And
Sufficient time is available.
Merits:
Original (first hand) data is collected.
The collected data are highly reliable.
The high degree of accuracy can be achieved.
Due to personal approach response will be more.
Correct information can be extracted from the informant.
Cross-examination is possible.
Miss interpretation on the informant part can be avoided.
Demerits:
This method is not advisable when coverage area is large and
time, finance factor are low.
Possibility of bias is more.
Untrained investigator cannot bring good result.
It is expensive and time consuming.
(2). Oral health examination:
When information is needed on the oral diseases, this method provides
more valid information than health interview. It is conducted by dentists,
technicians, and the trained investigator. This method cannot be

considered for an extensive study because it is expensive and also one


has to consider the treatment to people suffering from certain diseases.
(3). Indirect oral investigation:
If the informant is unwilling (reluctant) to provide information, this
method can be used. But in this method the investigator dont meet the
actual informant. Alternatively, the investigator meets the witnesses or third
parties or friends who are in touch with the informant. Investigator
interviews the people who are directly or indirectly connected to the
informant and collect the information.
For example:

To collect the information relation to gambling or drinking or


smoking habit the informant wont provide information. Even, they wont
response the study. On such situations the investigator has to approach
friends, neighbors, etc., of the actual informant to collect the information.
Usually police department adopts this method.
Example: Police department, riots, alliance, etc.,
Merits:
It is simple and convenient method.
It is suitable when the investigation area is large.
It saves time, money and labor factors.
The information is unbiased.
Adequate information can be collected.
Demerits:
The result is based on third parties prejudice.
To get adequate information much number of persons may be
interviewed.
Interview with an improper man will spoil the result.
Bad information will spoil the result.

(4). Questionnaire method:


In this method, a separate questionnaire consisting of a list of
questions for the enquiry is prepared. There are two ways collect
information through this method,
(1). Mailed questionnaire

(2). Direct questionnaire

(i). Mailed questionnaire method


This questionnaire is sent to the informants requesting them to do extend
their co-operation by fill-upping the questionnaire and correct replay of the
questionnaire. To get the quick and better response, the postal expense is
borne by the investigator. After receiving the sent questionnaires back
analysis is carried out. The research workers of state and central
governments adopt this method.
(ii). Direct questionnaire method
The investigator directly meets the informants and collects the information
by asking questions, from questionnaire.
Suitability:
This method is advisable, if,
The coverage area is wide.
There is a legal compulsion to supply information, so that nonresponse risk is eliminated.
Merits:
This method is most and economical comparing with other
methods.
This method of data collection covers wide area and reduces
money, time and labor
Bias is less since the data is collected directly from the
respondents.

Demerits:
There is no direct contact between the investigator and
respondent.
The accuracy and reliability are less.
This method is suitable among literate people only.
There is the possibility of delay in receiving questionnaire.
The people may furnish wrong information.
Asking supplementary questions is not possible.
Framing questionnaire:
In

this

mailed

questionnaire

method,

questionnaire

is

the

communication media between the investigator and the informant. Hence,


the success of investigation is based on the questionnaire. So the
questionnaire must be designed with adequate skill, efficiency and
experience.
Characteristics of Good questionnaire:
Number of questions should be minimum
Questions should be short and simple to understand.
Questions should be arranged in logical order.
Questions may have multiple-choice answers.
Personal questions are to be avoided.
The questions that require calculations are to be avoided.
Questions of sensitive and personal type should be avoided.
The wordings of questionnaire shouldnt hurt the feelings of
respondents.
Questionnaire information must be given.
Questionnaire should look attractive.

Pre - Test: After the questionnaire is prepared, pre test is to be done.


The process of refining the validation of questionnaire by collecting
information from the related respondents in small number with the framed
questionnaire in the view of overcoming the shortcomings of questionnaire
is called as pre test. If any shortcoming is found in the questionnaire, it will
be incorporated in the questionnaire. After the required changes are
incorporated, pilot study is employed.
Pilot study: Whenever the investigator has to deal with large survey, he
should not plunge directly. After the pre-test is over, to overcome the
shortcomings of the analysis pilot study is carried out. This is a small-scale
survey with a small number of persons. The collected data through the pilot
study is analyzed. If any technical difficulty in the analysis is found then the
questionnaire will be altered. The main survey is taken if the pilot study
doesnt reveal any analytical difficulties. (See Figure 1.)

Questionnaire
Pre-test

Is Error?

Yes

Review & Refine

No
Pilot survey

Is
analytical
difficulty?
No
Main Survey

Yes

Figure 1.

(5). Local Correspondents Method:


In this method instead of collecting the information by the
researcher, local agents are appointed to collect the information. They
collect the information from the informant and the collected data is sent to
the actual researcher or investigator. The data collection is done according
to local correspondents taste. Newspaper agencies, magazines, etc. adopt
this method.
Suitability:
If the data is required regularly from the wide area, this method can
be used.
Merits:
Extensive information is collected.
This is most cheep economical method.
Information will be collected regularly.
Demerits:
Information may be biased.
Degree of accuracy cant be maintained.
Data may be of duplicate nature.
(6). Enumerator method:
In this method, a number of enumerators are selected and trained to
collect the data. They are provided the questionnaires and trained to fill up
the questionnaire. They meet the informant along with the questionnaire
and collect the data by filling up the questionnaire. The enumerator
explains the object, purpose of the study to the informant.

Merits:
Intensive information is collected.
This method yields reliable and accurate results.
This method is helpful even if the informants are illiterate,
because the investigator is going to record the information.
Due to personal contact, the non-response is less.
Demerits
This method leads to more money and time
Personal bias of enumerator leads to wrong conclusion.

b). Secondary Source


The second hand information that is, collected from the already
existing sources for the study is called as secondary data. That is, the
researcher gets the required information from the information that is
already collected by some one for his purpose. The sources of secondary
data are,
Published sources:
The data that is published by the various governments, local and
international agencies are published data.
International publications:
IMF, IBRD, ICAFE and LINO etc., publish the data regular time
intervals.
Central and state governments:
Department of union and state government regularly publish the
data. The other organizations are, RBI-Bulletin; census of India; Indian
trade journal etc,
Semi-official publications:

The semi government institutions like district, panchayat, municipal,


corporation etc, publish the statistical data.
Research institutions publication:
The research institutions such as Indian statistical institution (ISI); Indian
agricultural statistics research institute (IASRI) etc., publish the data.
Journals and newspapers:
Some journals like Indian finance, commence etc, publish the
current and important material on statistics and socio-economic problems.

Unpublished sources:
There are various unpublished data sources. Various government
and private office maintain them. These are the data carried out by the
researchers in universities or research institutions.

Precautions in Using Secondary Source


The secondary data is not a reliable one and the data taken in olden
days will be inadequate. So before using the secondary data in the
analysis, some precautions must be taken.
The precaution steps are,
Suitability of data:
The available data should be suitable for his study. This
characteristic is to be examined by the investigator himself. The data
should be coherent with scope of the present analysis.
Adequacy of data:
After the suitability is tested, the data must be adequate for the
study. That is adequate data must be extracted from the source to carry
out analysis.
Reliability of data:

Reliability is checked by testing the findings or results from the data.


If the agency has used proper methods to collect the data, the statistics
may be relied upon.

COLLECTION OF DATA
The first and foremost step of the research process is data collection.
Before the statistical investigation, the researcher has to know the nature,
objective and scope of investigation, time and type of investigation and the
desired degree of study.
The two types of investigation are
Census/complete enumeration method.
Sampling method.

Census Method
A data collection method that investigates or collects information each and
every unit of the population is called as census method. That is, in this
method the data is collected from all the population units. For e.g., To study
the average height of the students of a particular college then the
investigator has to investigate (Measure) all the students height in that
college.
Population: The collection of individual items about which the study of the
investigation is concerned is called as population.
Merits:
The data is collected from all the items of study. Hence, bias is
minimized data is more accurate reliable and
The highest accuracy can be maintained.
Results drawn from the data collected through this method is
more representative and true.

Demerits:
When the coverage area is wide, this method is not suitable.
Because it will take more money, time and energy.
The cost needed is more, hence the organization that posses
huge finance and manpower can only adopt this method.
If the population size is infinite, this method is not suitable.
If the study is of destructive type product this method is not
suitable.
Destructive type product: The product that cant be used after its initial
use is called destructive type product.
Type of population: The two types of population are,
Hypothetical
Existent population.
The collection of concrete objects or persons under the study of
investigation constitutes the existent population. The existent population
may be finite or infinite. An existent population that consists of countable
number of individuals or objects is called as finite population.
An existent population that consists of un-countable no of individuals
or objects is called infinite population. E.g., In the study of economical level
of a particular college students, the totality of that college students and it
will be finite. Hence it is a finite population. E.g., In the study of
characteristic pattern of stars in the sky. All the stars in the sky constitute
the population. But there are infinite. Hence it is an infinite population.
The collection of non-concrete object, which exists only in
imagination and un-countable constitutes hypothetical or theoretical
population. For e.g., In the study pattern of the result of the coin tossing
experiment, the researcher couldnt get the concrete result. He can only
imagine the result as head and tail.

Hence the result of the coin tossing experiment constitutes the


hypothetical population.

Sampling Method
The method or technique that is adopted to select the sample from the
population is called as sampling method.
Sample: A finite subset or small part of population that has exactly
duplicate characteristic of population used to make valid inference
regarding the entire mass of population is called as sample.
Objectives:
To get more information about the population with minimum effort
time and cost.
To estimate the population parameters through its statistic.
To obtain the degree of precision of the drawn result through its
statistic.
To draw valid conclusion about the population.
To give desired result with required precision with the given
minimum cost.
To identify the true representative of the population.
Merits:
It is more economical. (i.e.,) it saves time, money and energy
because of limited number of investigation units.
It helps to achieve high degree of accuracy.
It helps to get reliable results for the population.
It serves as the alternative method of census.
It helps to organize and administrate the survey easy.

If the approximate result is needed or required this method can


be used.
Demerits:
Careful planning must be followed otherwise the result will be
incorrect and biased.
The result is based on the investigator. The attitude of personnel
will affect the result.
There is possibility of large errors.
Hence
The sample must be true representative of population
Experienced personnel have to be employed to the fieldwork.
The sample size must be adequate number.
The coverage area should be small.
The two types of sampling methods are,
Probability sampling
Non-probability sampling.
Probability sampling: The sampling method that follows some standard
procedure and selects the units with pre-defined probability is called
probability sampling.
The six types of probability sampling method are,
1). Simple (Equal) Random (chance) Sampling.
2). Stratified Random Sampling.
3). Systematic Random Sampling.
4). Cluster Sampling.
5). Multistage Sampling.
(1). Simple random sampling: Sampling procedure that is used to select
the sample from the population in such a way that each population units

has an equal and independent chance of being included in that sample is


called as simple random sample.
This is the simplest method to select the sample. This method is
applicable when the population is of homogenous nature. This simple
random sample can be selected by two ways.
(i). Lottery method:
In this method, all the population units are numbered or named. Then the
numbers or the names are written on different slips or cards of same size and
shape so that a card is not distinguished from others.
These cards are placed in a box and shuffled well so that no particular
card gets any preference in selection. From that box sample is selected one by
one, till the desired number of units are selected.

The only one drawback of this method is if the population size is


very large, this method is not suitable.
(ii). Random number table method:
In this method is sample is selected from the population by making
use of random number table. The table which contains random digits
arranged in row and column format is known as Random number table.
Selection process:
Random number table is arrangement of five digit numbers in row
and column format.
Selection process may be proceeded row wise or column wise.
Assign numbers to the population units.
Decide the sample size.
Count the number digits of population size. (i.e.,) k.
Read out number with k-digits from the random number table.
If the read number is greater than the population size, ignore it and
select the next number.

If the read number is less than the population size includes the
corresponding population unit in the sample.
Precede this process until required numbers of sample units are
selected.
There are several standard random number tables are available. Among
them some are,
L.H.C Tippets random number table: 10,400 four-digit numbers.
Fisher and Yates random number table:15,000 two digit numbers.
Kendall and B.B Smiths random number table: 25,000 four-digit numbers.
Rand corporations random number table: 2,00,000 five-digit numbers.

Merits:
There is less chance for personal bias.
As the sample size increases; the selected sample will be more
representative one.
Sampling errors can be measured.
This method saves money, time and labor.
Demerits:
This method requires complete list of population. But in many
enquires it is not possible.
As the sample size decreases the sample wont represent the
population.
If the population units are of heterogeneous nature this method
cant be employed.
(2). Stratified random sampling: A sampling method that selects sample
from the heterogeneous population by dividing the population into

homogenous sub-groups called stratum, is called as stratified random


sampling.
Since the population is of heterogeneous nature the population is
divided into stratums that are of homogenous nature. From that each
stratum, a number of sample units that constitutes the sample is selected.
The two types of stratified random sampling method are,
(i). Proportional method: If the sample is selected from the stratum
proportionate to its size, then the sample is selected by proportional
method.
(ii). Optimum method: If the sample is selected from the stratum by
considering the cost, then the sample is selected by optimum allocation
method. That is, based on the cost, the sample is selected.

Merits:
The sample selected by this method is more representative of
population.
If ensures grater accuracy.
For the heterogeneous population this method is more reliable.
Demerits:
The process of dividing the population into strata requires more
time money and experience.
If the stratification is not proper, then the sampling bias will prevail
in the sample.
(3). Systematic sampling: A probability sampling method that selects
sample by making using up-to-date complete list of population units is
called as systematic sampling. In this method, the selection of first
sampling unit is selected with probability, so it is also known as quasirandom sampling. After the selection of first unit is selected then the
remaining units of sample are automatically selected using the random
start range.

If the complete and up-to-date list of population units is, available,


then this method can be used.
Selection procedure:
Assume that we have to select n units from N population units.
Arrange the items in numerical or alphabetical or geographical
or any other order.
Find the sampling interval K = N / n such that nk = N.
Select the random start i such that i < k.
Select the sample units of i-th, i+k-th, i+2k-th,.., i+(n-1) k-th
units to constitute the systematic sample.
Hence the random start determines the (Whole) sample.
Merit:
This method is simple and operationally more convenient.
Time and work involved in selection procedure is less.
Demerit:
This sample maynt represent the population.
If the population size is not multiple of sample size, one cant get
required number of sampling units.
(4). Cluster sampling: A probability sampling method that selects the
sample by grouping the population units into some groups called clusterssimilarity of objects, and selects the sampling units through the selection of
clusters is known as cluster sampling.
Cluster sampling is same as stratified random sampling, but the only
difference is, in the former the entire units of the selected clusters
constitute sample. But in the later case, the sampling units are selected
from the selected strata.
Merits:

It introduces flexibility in sampling method.


It is suitable in large-scale survey, where the list preparation is
difficult.
Demerits:
It has less accurate than other methods.
(5). Multistage Sampling: When we consider the available resources,
concentrating on limited number of units for study, multistage sampling
helps us a lot. In national sample survey multiphase sampling is used. For
total health care programme the question is which village, which house
and which person is answered in this type of sampling.
I stage

Village selection

II stage

Household selection

III stage

Person selection

Reduction in cost and permitting the available resources concentrating on


selected samples will be advantageous. Sampling error enhancement is
expected, since variation between the final units will be lesser (within the
group than between groups). Unequal size at different stages may pose
analytical difficulties.
Another Example:
I stage - Urine sugar positive case are selected by screening tests
II stage All +ve cases under stage I are subjected for PPBS and these
who have above critical level of PPBS are selected.
III stage Among PPBS above critical level +ve, retinoscopy for diabetic
retinopathy is done and positive retinopathy cases are selected.
Non-probability sampling: The sampling method that doesnt follow any
standard procedure and selects the units with unknown probability is called

as non-probability sampling. This method is directly opposite to the


probability sampling method.
The three types of non-probability sampling methods are,
1. Judgment or purposive sampling.
2. Convenience sampling.
3. Quota sampling.
Judgment/purposive sampling: The sampling method, which selects the
sample units to achieve a specific purpose, is called as judgment or
purposive sampling method. In this method the samplers choice plays
major role in collecting the sampling unit.
For e.g. to know or study the cultural activity of the students in a
particular college the sampler has to select the students who are interested
in cultural activity. Then only the study reveals the valid conclusion. If not
so the sample does not reflect the population characteristics- Cultural skill
of the college. Hence he has to find the students who are involved in that
activity; from them the investigator has to collect the information.
Merits:
It is simple method
The sample collected is more representative.
This method can be adopted for public policy, to make decision,
etc.,
Demerits:
Due to sampler interest, the sample maynt be true representative
of population.
Difficult to correct sampling errors.
The estimates will not be accurate.

Quota sampling: This method is similar to the stratified random sampling.


In this method population is divided into various quotas and then
from the quota the sample is selected. The sample size per quota is
personal judgment. This is also known as stratified purposive sampling
method.
Merits:
This method reduces money and time.
Demerits:
Result is based on the investigators.
Personal bias is possible.
Since sample selection is based on random sampling. Sampling
errors cant be estimated.
Convenience sampling: The sampling method that selects the sample
units based on the continent of investigator is called as convenient
sampling. If
The universe is not clearly defined.
Sample unit is not clear.
Complete list is not available.
Then this method can be used.
Demerits:
This sample is not true representative of population
The results are biased.
But this method can be used for pilot study.
Applications of Sampling Designs

1. Identification of predisposing factors, precipitating factors and


perpetuating factors which influence health and disease.
2. Evaluation of health programmes.
3. Impact studies.
4. Coverage surveys.
5. Planning, administration and implementation of activities.
6. Forecasting the future.
7. Environmental studies.
8. Evaluation of health status.

PRESENTATION OF DATA
After the data collection is over, the researcher has raw data. (i.e., The
information prior to the proper arrangement is known as raw data.) They
are huge and conducive. As such, the researcher cant carryout analysis
and they wont furnish any useful information. So to condense and present
the data into compact manner we go for presentation of data. Presentation
of data has three main types of presentations. They are,
1. Classification,
2. Tabulation, and
3. Graphical representation.

Classification: The process of arranging the data into sequences and


groups according to their common characteristics and separating them into
different but related parts is called as classification.
Objects:
The raw data are classified,
To condense the mass of data.
To present the data in simpler form.

To differentiate the similarity and dissimilarity among the data.


To facilitate comparison and statistical treatment.
To bring out relation.
To facilitate further analysis.
To eliminate the unnecessary data.
Rules for classification:
The classes should be rigidly defined. (I.e.) there shouldnt be any
ambiguity in their rules.
The classes shouldnt overlap (i.e.) each item of data must have its
place in only one class.
The classification must be flexible to adjustment of new situations.
The items included in total and sub total of class and subclass must
be same.

Types of classification:
Geographical classification: Classifying the data based on the
area of its occurrence such as states, districts, Taluks etc., is called
as geographical classification.
Chronological classification: Classifying the data based on the
time of its occurrence such as decades, Years, Months, etc., is
called as chronological classification.
Quantitative classification: Classifying the data based on some
characteristics that is capable of quantitative measurement like age,
price, weight etc., is called as quantitative classification.
Qualitative classification: Classifying the data based on the
qualitative characteristics such as sex, honesty, literacy, etc., is
called as qualitative classification.

That is, presence or absence of the characteristic is presented


in this type of classification.

Tabulation: The systematic arrangement of numerical data in the form of


rows and columns in accordance with some characteristics is called as
tabulation.
Objects:
To simplify complex data.
To clarify characteristics of data.
To facilitate comparison.
To detect errors and omissions in the data.
To facilitate statistical processing.
The parts of table are:
1. Table number,
2. Title,
3. Head note,
4. Caption,
5. Strata,
6. Body of table,
7. Foot-note,
8. Source-note.
The table number is used for identify and reference of the table in
future. For the reference and explanation the columns may also have
numbers.
Each table has to be given a suitable title. Suitable in the sense, it
must describe the content of table.

Head note is a statement about the tables that is placed below the
table title within brackets. Usually the measurements of the table units are
placed such as, in-millions; in crores; etc,
The headings of the columns are called as captions. They must be
brief and self-explanatory. This caption may have sub-headings.
The row headings names are called stabs.
The most important part of the table that contains the numerical
information is called body of table. To provide any explanation about the
items in the table, footnote is used.
Types of tabulation:
1. One-way tabulation,
2. Two way tabulation, and
3. Manifold tabulation.
One-way Table: The table that displays information on a single variable is
called as one-way table or univariate table. The variable may be discrete or
categorical.
Two-way Table: The table that displays information on categories of a
single variable over the categories of another variable is known as two-way
table or bi-variate table.
Manifold table: The table that shows information on more than two
variables categories is known as manifold table.
Frequency Distribution: A tabulation type that summarizes the raw data
in the form of table along with variable values or variable class intervals
and their corresponding frequencies is known as Frequency table. It may
be one-way or two-way or manifold type.
Moreover, Frequency table
1) Organizes the data into compact manner without loss of
essential information.

2) Describes how the total frequency distributed over different


classes or discrete points.
There are three types of frequency tables. They are,
1. Discrete frequency table.
2. Continuous frequency table.
3. Relative frequency table.
Discrete Frequency table: A Frequency table that shows the distribution
of frequencies at different distinct values of variable is known as discrete
frequency table.
Procedure to form discrete frequency table:
1. Draw a table with three columns namely, variable, tally marks and
frequency.
2. Take the first observation.
3. Write down the observation in the variable column and put a tally
mark (|) against the written observation in the tally mark column.
4. Take the next observation.
5. Check weather the observation is entered in the variable column
or not.
6. If it is entered, put another tally mark against the written
observation. Else, go to the step 3.
Repeat the procedures starting from 4 6 until all the observations
are entered in the table.
7. Count number of tally marks for each variable and put the totals
in the frequencies column.
8. The resultant table is called as discrete Frequency Table.
9. If for any variable row has four tally marks, then the next
occurrence of that variable is marked by putting a cross mark
over the four bars. This process facilitates counting process.

Continuous Frequency table: A Frequency table that shows the


distribution of frequencies over different class intervals of values is known
as continuous frequency table.
Procedure to form Continuous frequency table:
1. Draw a table with three columns namely, variable, tally marks and
frequency columns.
2. Find the smallest and largest observations in the data set.
3. Decide the class interval.
4. Write down the class limits with equal class intervals under the
heading variables.
5. Take the first observation.
6. Decide in which class it falls.
7. Put a tally mark (|) against the variable class in the tally mark
column.
8. Take the next observation.
9. Repeat the procedures starting from 6 - 8 until all the
observations are entered in the table.
10.

Count number of tally marks for each variable class and put

the totals in the frequencies column.


11.The resultant table is called as continuous Frequency Table.
Relative Frequency Table:

A frequency distribution in which the

frequencies are expressed as fraction or percentage of total number of


observations is known as relative frequency distribution.
It is noted that, the sum of relative frequency is equal to one when
the frequencies are expressed as fractions and the total is 100 when the
frequencies are expressed as percentage.

Graphical representation:

Classification and tabulation are used to present the data in the


neat, concise systematic and understandable manner. But, the large
amount of information, extending over a large number of columns is
difficult to understand the significance of data. Hence, the statisticians are
necessitated to introduce diagrams and graphs.
Classification is the process of grouping of data into homogenous
groups or categories. Tabulation is the process of presenting the classified
data in tabular form.
The process of highlighting the salient features of study through
graphs and charts is called as graphical representation. This type of
presentation made easy to understand. Moreover, attractive graphs and
charts make understood at a glance for even layman.
Merits:
Diagrams are attractive and create interest in the mid of readers.
Diagrams are easily understandable to even for the layman.
In interpretation, diagram saves much time.
i.e., human beings maynt like go through numerical figures. But they
may like to go through diagrams.
Diagrams make data simple.
i.e., at a glance of look on diagrams remembered and readers can
easily understand the pattern of data.
A diagram facilitates comparison of two or more sets of data.
Diagrams reveal more information than data in a table.
Limitations:
Diagrams cant be analyzed or used for further analysis.
Diagrams shows approximate values only
It exposes only limited facts.
(i.e.) all details cant be presented in the form of diagrams.

Construction of diagram needs some intelligence and experience.


This is supplementing to tabulation not an alternative one.
Rules for making diagrams:
Every diagram must be given a suitable title of bold letters.
The title conveys the main fact depicted by the diagram.
Sub-headings may also be given.
Title should be brief and self-explanatory.
Due to comparison, diagram must be drawn accurately and
neatly.
Each diagram should be numbered for further reference.
The type of diagram should be selected according to the nature
of data.
When many items are shown in the diagram, through different
patterns such as dots, crossing etc., index must be given.
Diagram must be simple as understandable by the layman.
There are two types of graphical representation. They are,
1. Graphs,
a. Frequency curves,
b. Frequency polygon, and
c. Ogives.
i. Less than ogives, and
ii. More than Ogives.
2. Charts/ Diagrams.
a. Bar chart,
i. Simple bar chart,
ii. Multiple bar chart,
iii. Stacked bar chart, and

iv. Percentage bar chart.


b. Pie- chart, and
c. Histogram.
One-dimensional diagram: The diagram that is drawn to the single set of
data set is called one-dimensional diagram. The bar and pie diagram are
belongs to this one-dimensional diagram.
Bar chart: The visual representation of (qualitative or categorical or
discrete numerical) data is called as bar chart. The bars are proportionate
height to the frequency. The bars may be horizontal or vertical. The
distances between the bars are kept uniform. Bar charts are drawn only for
single discrete quantitative or categorical variables.
The types of bar diagrams are
Simple bar chart.
Multiple bar chart,
Stacked bar chart.
Simple bar chart: The bar diagram that is drawn for a single set of
categorical or numerical data is called as simple bar diagram.
Multiple bar chart: The bar diagram that is drawn to single variable with
more than one phenomenon is called as multiple bar diagram. This
facilitates the comparison. The categories of a single variable are drawn
side by side. The differentiation is shown by different colors or patterns
such as lines dots etc,
Stacked bar chart: A type of bar diagram that is drawn for single variable
with any number of (categorical or numerical) categories is called as
Stacked bar diagram. In this diagram the categorical variables categories
are placed on the bar by dividing the portion of bar.
Percentage bar chart: Percentage bar diagram is a kind of stacked bar
chart, drawn for percentage of frequencies of categorical variables with the

equal bar height is called as percentage bar diagram. The division of bars
of categories is made with the percentages. But in this case bars are of
equal heights to 100%. But in the stacked bar diagram the height of bars
are unequal. That is, bars are proportional to the frequencies of the base
variables category.
Pie diagram: The graphical representation of single variables categories
in circle form is called pie diagram. In this graph the circle is divided into
the various pieces based on the frequency. This type of diagram provides
high understanding ability at a glance. The each slide is divided by taking
the whole data equal to 360 degrees.
Relative Frequency Histogram:

A histogram constructed with the

help of relative frequencies rather than absolute frequencies is known as


relative frequency histogram.
Histogram: A bar diagram where the bars are constructed continuously
without (leaving space between bars) on the class intervals in such a way
that the height of bars are proportional to the frequencies of relative
classes is known as Histogram.
Frequency polygon: The graph formed by plotting the frequencies
against the mid points of continuous frequency distribution and joining the
points by straight lines is known as Frequency polygon.
This can also be obtained from the histogram by joining the top mid
points of bars with straight lines.
Frequency Curve:
The graph that is formed by plotting the frequencies against the mid
points of continuous frequency distribution and joining the points by freehand curve is known as Frequency polygon.
This can also be obtained from the histogram by joining the top mid
points of bars with free hand curve.

Ogives:
The graph obtained by plotting the cumulative frequencies
against the class limits of continuous frequency distribution is known as
Ogives.
The two types of Ogives are,
1. Less than Ogive.
2. More than Ogive.
Less than Ogive:
The graph obtained by plotting the less than cumulative
frequencies against the upper class limits of continuous frequency
distribution and joining the points of smooth curve are known as less than
Ogive.
More than Ogive:
The graph obtained by plotting the more than cumulative
frequencies against the lower class limits of continuous frequency
distribution and joining the points of smooth curve are known as more than
Ogive.

DATA ANALYSIS
The process of drawing or obtaining the representative measure
from the raw, mass amount of data is called data analysis. To carry out, the
analysis, statistical methods are used. Hence it is called statistical data
analysis.
The three type of data analysis are
Univariate data analysis.
Bivariate data analysis.
Multivariate data analysis.

Univariate data analysis:


Analyzing or drawing

representative measure for the one-

dimensional data set (it may be raw or grouped or ungrouped) is called


univariate data analysis. That is, the characteristics of single data set are
studied. The three types of Univariate Data Analysis Tools are,
1. Measures of Central Tendency,
2. Measures of Dispersion,
3. Skewness, and
4. Kurtosis.

Bivariate data analysis:


Analyzing or obtaining the representative measure for two sets of
variables by considering both the variables simultaneously is called
bivariate data analysis. The variables type may be quantitative or
qualitative.
The two types of bivaritate measures are,
Associative measure and
Functional measure
Associative measure: The measure that is used to measure the interrelationship between the two types of variables is called associative
measure.
The two types of associative measures are,
Correlation and
Chi-square association
Chi-square association: The bivariate method that is used to measure
the relationship between two qualitative variables is called chi square

association method. This method tests whether the two qualitative


variables are dependent or independent.
Functional measure: The process of finding relationship between the two
sets of variables in the form of equation is called functional measure. In
this case, variables can be classified as dependent and independent.
The statistical method that finds the functional relation of two sets of
variables is known as regression analysis.

Multivariate analysis:
The simultaneous study of several related and equally important
random variables is called multivariate data analysis. That is, multivariate
tool is used to deal more number of variables under study.
The multivariate analysis is classified into.
Dependent analysis and
Interdependent analysis
Dependence analysis:
The method of studying the association between two sets viz.
dependent and independent variables is called dependence analysis. That
is, the relationship between the dependent set and independent set is
analyzed by this dependence analysis.
The five dependence analysis methods are,
Multiple regression,
Discriminant analysis,
Logit analysis,
Multivariate analysis of variance and
Canonical correlation.
Inter dependence methods:

The method of analyzing mutual association across all the variables


is called interdependence analysis. In this study no distinction will be made
such as dependent and independent.
The five interdependence methods are,
Principal component analysis:
Factor analysis
Cluster analysis
Log linear models and
Multidimensional scaling
Factor analysis: A data reduction technique that studies the inter
relationship among a set of variables by introducing new set of variables
that are fewer in number than the original set of variables is called factor
analysis.
Profile analysis: The graphical method of comparing a number of ordinal
variables based on different groups is called profile analysis. That is the
common opinion nature about the ordinal variables is studied.
Friedman test: A non-parametric statistical method that is applied to
ranking data set to find the common agreement of ranking between the
respondents about the various factors is called Frideman test.
Kendalls w test: This procedure is similar to Fridman test. The merit of
this method is it provides Kendalls concordance value that represents the
amount of common agreement between the respondents.
Logistic regression: This method is used to examine the relationship
among the set of variables. That is, the statistical method that is used to
study about a dichotomous response variable, which is explained by a
number of explanatory variables, is called as logistic regression. (It may be
ordinal or interval or ranking data)
The assumptions for logistic regression are,

Response variable is binary


The model for response and explanatory variable is log linear.

DESCRIPTIVE STATISTICS
Measures of Central Tendency:
A single (single) representative measure
Describes the characteristics of entire mass of data
There are three types of measures of central tendency. They are
Mean,
Arithmetic Mean,
Weighted Mean,
Geometric Mean,
Harmonic Mean.
Median,
Mode,
The characteristics of good average are:
It should be preciously (rigidly) defined.
It should be
Easy to understand.
Easy (Simple) to compute.
Based on all observation.
Capable of further analysis.
Its definition should be in the form of mathematical formula.
It should not be influenced by extreme values.
It should have sampling stability. (Least affected by sampling
fluctuations)

Merits of averages:
It facilitate quick understanding of complex data:
The purpose of average is to represent a group of values in
simple and concise manner. That is, an average condenses the
mass of data into a single figure.
It facilitates comparison.
It facilitates to know about universe from sample.
If helps in decision-making.
It establishes mathematical relationship.
Mean: A single representative figure of a mass amount of data which
obtained by adding together all the values and dividing the sum by the total
number observations is called mean (i.e.) if the series x 1, x2, x3, , xn has
the n observations. Than the mean value of this series will be,
(i)For ungroupedData:
n
Xi
X i 1
n
(ii) For DiscretefrequencyDistribution :

n
fi xi
X i 1
,
N
n
WhereN fi is thetotalfrequency.
i 1

(iii) For ContiniusfrequencyDistribution :


n
fi xi
X i 1
,
N
n
WhereN fi is thetotalfrequency.
i 1
xi ' s areMid pointsof classinterval.

(iv)DeviationFormula:
n
fi di
X A i 1
,
N
n
WhereN fi is thetotalfrequency.
i 1
xi ' s areMid pointsof classinterval.
A is assumedmeanfrom within theseries.
di ' s arethedeviationvalues,[i.e.,di (xi A)]
(iv) Step DeviationFormula:
n
fi d i
X A i 1
h,
N
n
WhereN fi is thetotalfrequency.
i 1
d i ' s arethedeviationvalues,[i.e.,d i (xi A)]
xi ' s areMid pointsof classinterval.
A is assumedmeanfromwithin theseries.
h is thewidth of classinterval.

(v)For CombinedSeries:

If X 1 is themeanof firstsampleof sizen1, X 2 is themeanof secondsampleof size

X is themeanof k - th sampleof sizen , thenthemeanof (k samples)


combinedserie
k
k
n
n i xi
1
X i n
n
i 1 i

This is the most widely used measure of central tendency tool.

Properties:
1. The sum of deviations taken from arithmetic mean is zero. (i.e.,)
(xi-x) = 0
2. The sum of squares taken from the mean other than is minimum.
(i.e.,)

X
n

i 1

the observations.
Merits:

X X i A , Where A is any value and x is mean of


i 1

It is easy to understand and calculate.


It is used in further calculations.
It is based on all the items.
It provides a good basis for comparison.
It is a more stable measure.
It is considered as good or idle average.
Demerits:
Mean is unduly affected by extreme values.
It is unrealistic.
It may lead to wrong conclusion.
It is not useful for studying the qualitative characters.
It is not suitable measure in case of highly skewed distribution.
It gives greater importance for bigger values and smaller
importance for the smaller values in the series.
It cannot calculate for the frequency distribution with open-end
class.
Median: A measure of location calculated from the set of values that
divides the series into two equal parts is called as median. That is one of
part of data set contains the items less then median and another part of
data set contains the items greater then median value. But the number of
observations on both the sides is equal.
1). For ungrouped data:
a. Arrange the observations in either ascending or descending order
of magnitude.
b. Find the number of observations in the data set. (i.e., n).

c. If n is odd, then the median of the data set is,

Median
1
2

th

observation.
d. If

n is

even,

then

the

median

of

the

data

set

is,

th
n th

n 1

observatio
n
observatio
n

Median

2). For grouped data: (Discrete frequency distribution)


1. Form the cumulative frequencies.
n
fi
2. Find i 1 , where n f is thesumof frequencie
s.

i 1 i
2

n
fi
3. Find the cumulative frequency just greater than i 1 .
2

4. The observation (x value) that corresponds to that frequency


is the median of the set of observation.
3). For grouped data: (Continuous frequency distribution)
1. Form the cumulative frequencies.
n
fi
n
2. Find i 1 , where
f is thesumof frequencie
s.
i 1 i
2

n
fi
3. Find the cumulative frequency just grater than i 1 .
2

4. Find its corresponding class, it is the median class.

5. Find median by using the formula,

f
i
i 1 m

Median L

c
f
Where' L ' is thelowerlimit of themedianclass.
' m' is thecumulativefrequencyof themedianclass.
' f' is thefrequencyof themedianclass.
' c' is thewidth of theclassinterval.

Merits:
It is easy to understand and compute.
It is quite rigidly defined.
It eliminates the effect of extreme items.
It is amenable to further process.
Median can be calculated for even qualitative phenomenon.
Its value generally lies in the distribution.
It can be calculated for frequency distribution with open-end
class interval.
This can be located graphically.
Demerits:
If the series is of irregular nature, median cannot be
computed.
It ignores the extreme values.
In the case of continuous case and even number of
observations, median is estimated but not calculated.
It is not based on all observations.
It is not amenable to algebraic treatments.
It is affected by the fluctuations of sampling.

It cant be calculated for continuous frequency distribution with


exclusive type class interval. To calculate the median the class
interval has to be converted into inclusive type class interval
by adding the value to both the limits (Upper And Lower).
Mode: A single value that appears more number of times (more
frequently) than other observations in the data set is called as
mode.
1). for ungrouped Data:
i). count the observations frequency.
ii). The observation that has occurred more number of times is
the mode of that data set.
2). For Grouped data: (Discrete frequency Distribution)
i). from the frequency distribution identify the highest
frequency.
ii). The observation corresponding to the highest frequency is
the mode of distribution.
3). For Grouped data: (continuous frequency Distribution)
i). From the frequency distribution identify the highest
frequency.
ii). The class interval corresponding to the highest frequency
is the modal class.
iii). Find mode by using the formula,

Mode L

f1 f0
c
2f1 f0 f2

Where' L ' is thelowerlimit of themodalclass.


' f ' is thefrequencyof theclasspreceeding
to themodal class.
0
' f ' is the frequencyof themodeclass.
1
' f ' is thefrequencyof theclasssuccedingto themodal class.
2
' c' is thewidth of theclassinterval.

Merits:

It is easy to understand and calculate.

It is not affected by extreme values. It is simple


and precise.

It ca be located by mere inspection.

It can be determined by the graphic method.


This value can be determined to the open-end class interval.

Demerits:
It is ill-defined (If there is two observations occurs equal
number of times we cant calculate the mode-bi-modal
distribution)
It is amenable to further mathematical treatment.
It is not based on all observations.
It is difficult to compute, when there are both positive and
negative data in the series.
It is stable only when the sample size is large.

If there are both positive and negative values


or any one or more observation is zero, we cant find the mode
of distribution.

Comparison of Measures of Central Tendency Tools:


Characteristics
Precious
Definition
Procedure
Understanding
Calculation
Observations
Utilization
Further

Mean

Median

Mode

Given

Given

Not given

Easy
Easy

Easy
Easy
Not all
obsn:s
Not

Easy
Easy
Not all
obsn:s
Not

All obsn:s
Amenable

treatment
Sampling
fluctuations
Effect of extreme
values

Least
affected
Much
affected

amenable
Much
affected
Not
affected

amenable
Much
affected
Not affected

From the comparison table of Measures of Central Tendency table it is


noted that, among the tools mean holds many of the idle average
characteristics. Hence, Mean is considered as good or idle average.
Measures of dispersion:
The statistical tool that measures the variation or the scattered ness of
values from its representative (Central) value is called as dispersion.
Properties of good measure of variation are,
It should be easy to calculate and understand.
It should be rigorously defined.
It should be based on all observations and amenable to further
treatment.
It must have sampling stability.
If should not affected by extreme values.
The types measures of dispersion are,

Range:

Range,

Variance and Standard Deviation,

Mean deviation.
The simplest measure of dispersion that is calculated by

subtracting the minimum value from the maximum value of the data set is
called as range.
i.e., Range = maximum value - minimum value.

Standard deviation: A most widely used important measure of


dispersion that is defined as positive square root of arithmetic means of
squared deviation values from arithmetic mean is called as standard
deviation. Standard deviation is denoted by .
That is, to stabilize the negative and positive variations. The square
of deviations is taken.
Formula for calculating standard deviation value is,
N
2
Xi X
Population s tan dard Deviation i 1
N

Where, N= Population size


If we have sample, then the sample standard deviation(s) is,

Sample S tan dard Deviation s

n
2
Xi X
i 1

n 1

Where n= sample size


Merits:
It is rigorously defined.
Its value is always definite.
It is based on all observation of data.
It is amenable for further analysis.
It is less affected by sampling fluctuations.
It serves basis for measuring coefficient of correlation. Sampling
and statistical inference.
This is the most appropriate measure for the variability,
measurement of distribution.

As a best measure of dispersion, it posses most of the


characteristics of an ideal measure of dispersion.
Demerits:
It is not easy to understand and calculate.
It gives more weight to extreme values by squaring them.
It cannot be used for comparison
Co-efficient of variation or relative measure: This is a measure of
relative variation rather than absolute variation. In order to decide which
of the two distributions is more variable, we compare the coefficient of
variation. The distribution with greater CV is said to be more variable.
Such a measured is found in the coefficient of variation, which expresses
the standard deviation as a percentage of the mean. The formula is given
by
Co efficient of Variation C .V

deviation and

x 100

(Where, - is the population standard

- is the population mean) (or)

the sample standard deviation and

C.V

s
X

100

(Where, s- is

is the sample mean)

To find the variability of data set, find the individual Co-efficient of


Variation. The data set with greater co-efficient of variation will have more
variability (or less precise / less consistent / less homogeneous).

Uses of coefficient of variation (C.V):


(i). The standard deviation is useful as a measure of variation within a
given set of data. When one desires to compare the dispersion in two

sets of data, however, comparing the two standard deviations may lead
to fallacious results.
(ii). It is used to compare two variables involved are measured in different
units
Example
We may wish to know, for a certain population, whether serum
cholesterol levels, measured in milligrams per 100ml, are more variable
than body weight, measured in pounds.
(iii). Although the same unit of measurement used, the two
measurements may be quite different.
Example
If we compare the standard deviation of weights of first grade
children with the standard deviation of weights of high school freshmen,
we may find that the latter standard deviation is numerically larger than
the former, because the weights themselves are larger, not because the
dispersion is greater.
PROBABILITY DISTRIBUTIONS
The relationship between the values of a random variable and the
probabilities of their occurrence may be summarized by means of a
device called a probability distribution. A probability distribution may be
expressed in the form of a table, a graph, or a formula. Knowledge of the
probability distribution of a random variable provides the clinician
researcher with a powerful tool for summarizing and describing a set of
data and for reaching conclusions about a population of data on the basis
of a sample of data drawn from the population.
There are two types of probability distribution
(1). Discrete

(2) Continuous

Probability distribution of a discrete random variable


The probability distribution of discrete random variable is table, graph, or
other device used to specify all possible values of a random variable
along with their respective probabilities.
The following are two essential properties of a probability distribution of a
discrete variable
(1)

0 P( X x) 1

( 2)

P( X

x) 1

The following are example of discrete probability distribution


1. Binomial

2. Poisson

THE BINOMIAL DISTRIBUTION


The binomial distribution is one of the most widely encountered
probability distributions in applied statistics. The distribution is derived
from a process known as a Bernoulli trial, named in honor of the Swiss
mathematician James Bernoulli (1654-1705), who made significant
contributions in the field of probability, including, in particular, the binomial
distribution. When a random process or experiment, called a trial, can
result in only one of two mutually exclusive outcomes, such as dead or
alive, sick or well, male or female, the trial is called a Bernoulli trial.
The Bernoulli process A sequence of Bernoulli trials forms a Bernoulli
process under the following conditions.
1.

Each trial result in one of two mutually exclusive, outcomes. One of

the possible outcomes is denoted (arbitrarily) as a success, and the other


is denoted a failure.
2.

The probability of a success, denoted by p, remains constant from

trial to trial. The probability of a failure, 1-p, is denoted by q.


3.

The trials are independent; that is, the outcome of any particular trial

is not affected by the outcome of any other trial.

Example1:
We are interested in being able to compare the probability of x successes
in n Bernoulli trials. For example, suppose that in a certain population
52% of all recorded births are males. We interpret this to mean that the
probability of a recorded male birth is 0.52. If we are randomly select five
birth records from this population, what is the probability that exactly
three of the records will be for male births?
Solution: Suppose the five birth records selected result in this sequence
of sexes
MFMMF
In coded we would write this as
10110
Since the probability of a success is denoted by, p=0.52
And the probability of a failure is denoted by, q= 1-p = 1-0.52 = 0.48
The probability of the above sequence of outcomes is found by means of
the multiplication rule to be
P (1, 0, 1, 1, 0) = pqppq = q2p3
Three successes and two failures could occur in any of the following
additional sequences as well

Number
1
2
3
4
5
6

Sequence
10110
11100
10011
11010
11001
10101

Probability
pqppq
q2p3
pppqq
q2p3
pqqpp
q2p3
ppqqp
q2p3
ppqqp
q2p3
pqpqp
q2p3

7
8
9
10

01110
00111
01011
01101

qpppq
qqppp
qpqpp
qppqp

q2p3
q2p3
q2p3
q2p3

We may now answer our original question: what is the probability, in a


random sample of size 5, drawn from the specified population, of
observing three successes (record of a male birth) and two failures
(record of a female birth)?
The answer to the question is
10(0.48)2(0.52)3 = 10(0.2304)(0.140608) = 0.32
General formula:
n
f ( x) p x q ( n x )
x
0, elsewhere

for x 0, 1, 2, ......, n and 0 p 1, 0 q 1, n 0.

This expression called the binomial distribution.


Where, f(x) = P(X=x)
n = Number of trials
x = the random variable of success
p = probability of a success
q= probability of a failure = 1-p
This distribution satisfy the discrete probability distribution properties
1.

f(x)0, for all real values of x. this follows from the fact that n and p
n

x
x
x
are both nonnegative and, hence x , p , q (1 p ) are all non negative and,

therefore, their product is greater than or equal to zero.


2.

n x ( n x )
p q
x

f (x) 1. This is seen to be true if we recognize that

is equal to1.

Example2:
Suppose that it is known that 30% of certain populations are immune to
some disease. If a random sample of size 10 is selected from this
population, what is the probability that will contain exactly four immune
persons?
Solution:
The probability of an immune persons to be 0.3 i.e. p =.0.3 and q = 1-p =
1-0.3 = 0.7
10
(0.3) 4 (0.7) (104 )
4

f ( 4)

10!
(0.0081) (0.117649 )
4! 6!
0.2001

The Binomial Parameters


The binomial distribution has two parameters, n and p. they are
parameters in the sense that they are sufficient to specify a binomial
distribution. The binomial distribution is really a family of distributions with
each possible value of n and p designating a different member of the
family. The mean and variance of the binomial distribution are = np and
2 = np(1-p), respectively.
Strictly speaking, the binomial distribution is applicable in situations
where sampling is from an infinite population or from a finite population
with replacement. Since in actual practice samples are usually drawn
without replacement from finite populations, the question arises as to the
appropriateness of the binomial distribution under these circumstances.
Whether or not the binomial is appropriate depends how drastic is the

effect of these conditions on the constancy of p from trial to trial. It is


generally agreed that when n is small relative to N, the binomial model is
appropriate.
POISSON DISTRIBUTION
The next discrete distribution that we consider is the Poisson distribution,
named for the French mathematician Simeon Denis Poisson (17811840), who is generally credited for publishing its derivation in 1837. This
distribution has been used extensively as a probability model in biology
and medicine.
If x is the number of occurrences of some random event in an interval of
time or space (or some volume of matter), the probability that x will occur
is given by
f ( x)

e x
x!

x 0, 1, 2, ........

and 0

The Greek letter (lambda) is called the parameter of the distribution and
is the average number of occurrences of the random event in the interval
(or volume)
The symbol e is the constant = 2.7183
It can be shown that
1.

f(x)0 for every x

2.

f ( x) 1
x

So that the distribution satisfies the requirements for a discrete probability


distribution.
The Poisson Process
We have seen that the binomial distribution results from a set of
assumptions about an underlying process yielding a set of numerical

observations. Such, also is the case with the Poisson distribution. The
following statements describe what is known as the Poisson process.
1.

The occurrences of the events are independent. The occurrence of

an event in an interval of space or time has no effect on the probability of


a second occurrence of the event in the same, or any other, interval.
2.

Theoretically an infinite number of occurrences of the event must be

possible in the interval.


3.

The probability of the single occurrence of the event in a given

interval is proportional to the length of the interval.


4.

In any infinitesimally small portion of the interval, the probability of

more than one occurrence of the event is negligible.


An interesting feature of the Poisson distribution is the fact that the mean
and variance are equal.
When to Use the Poisson Model
The Poisson distribution is employed as a model when counts are made
of events or entities that are distributed at random in space or time. One
may suspect that a certain process obeys the Poisson law, and under this
assumption probabilities of the occurrence of events or entities within
some unit of space or time may be calculated.
Examples:
Number of failure of surgery for experienced doctors.
Number of rain days in summer from 1947 to last year.
Number of unexpected holidays declared by a hospital.
Number of major accidents in a street road.
Example for calculation:

In study of suicides, the monthly distribution of adolescent suicides in


India between 1977 and 1987 closely followed a Poisson distribution with
parameter =2.75. Find the probability that a randomly selected month
will be one in which three adolescent suicides occurred.
Solution: The Poisson distribution is given by
f ( x)

e x
x!

x 0, 1, 2, ........

and 0

Given =2.75, than


e 2.75 ( 2.75) 3
3!
(0.063928) (20.796875)

0.221583
6

f ( X 3) P ( X 3)

CORRELATION
Correlation: The statistical method that discovers amount of relationship
and the direction between two sets of quantitative variables is called as
correlation. The correlation provides nature and indent of the relationship.
(i.e.) if correlation between A and B is 0.48 then the negative sign
express that the relationship is negative and the value 0.48 expresses
the amount of relation between the variables A and B.
Correlations value will always lie on the interval of 1 and +1 (i.e.,
-1 1).
Assumptions:
a.

The regression of x on y and y on x are linear.

b.

Number of x values and y values are same. (i.e., number of

samples are equal)

c.

The pairs x and y are measured in interval/ratio scale or continuous

variable
The nature of correlation:

Positive correlation: The correlation is said to be positive

correlation, if its value lie on the interval 0 and +1. Two variables are
positively correlated if for an increase in the value of one variable there is
also an increase in the value of the other variable or for a decrease in the
value of one variable there is also a decrease of in the value of the other
variable. That is the two variables change in the same direction.
Examples:
Age and weight of the patient
Weight and blood pressure

Negative correlation: The correlation is said to be negative

correlation, if its value lie on the interval 0 and -1. Two variables are
negatively correlated if for an increase in the value of one variable there
is a decrease in the value of the other variable; that is the two variables
change in the opposite direction.
Examples:
Number of patient visiting and number of clinics

No correlation: The correlation is said to be no correlation. If its

value is 0. the two variables are said to be uncorrelated if the change in


the value of one variable has no connection with the change in the value
of other variable.
Example: We should expect Zero correlation between
Age and tooth color of a person

Height and number of tooth of a person


Simple and multiple correlation
The correlation between two variables is called simple correlation. The
correlation in the case of more than two variables is called multiple
correlation.
Scatter Diagram
Let us consider a set of paired values of the variables x and y. along the
horizontal axis we represent the values of y and along the vertical axis
the values of x. plot the values (x,y) on a graph paper. We get a collection
of dots. The figure so obtained is called a scatter diagram. From the
scatter diagram we can obtain a rough idea of the correlation between
two variables x and y.
If all these dots cluster around a line the correlation is called linear
correlation. If the dots cluster around a curve, the correlation is called a
non-linear or curve linear correlation. We can also get an idea about of
whether the correlation is positive or negative from the scatter diagram.
They are illustrated in the following diagrams
Formula:
The formula is given by
n

Formula (1):

x y

i 1

i 1

2
i

i 1

2
i

x y
i 1

Formula (2):

i 1

2
i

i 1

y
i 1

i 1

y
i 1

2
i

i 1

1 n
( xi x) ( yi y) Covariance of x and y
Formula (3): r n i 1

x y
(SD of x) (SD of y)

In words, that is

sum of the product of the deviations of x and y pairs from their respective means
sum of squares of the sum of squares of the

deviations of x from the mean of x deviations of y from the mean of y

Numerical values of the correlation coefficient


The coefficient of correlation r lies between -1 and +1 inclusive of those
values.
1)

When r is positive the variables x and y increase or decrease

together.
2)

When r = +1, there is a perfect positive correlation between

variables x and y.

3)

When r is negative the variables x and y move in the opposite

direction (i.e., as one increases the other decreases).


4)

When r = -1, there is a perfect negative correlation between

variables x and y .
5)

When r = 0, the two variables are uncorrelated.

Uses of r 2 :
It is customary to mention by statisticians that r is of even more use in its
squared form r 2 is also called coefficient of determination, which
measures the proportion of the total variance in y which is associated
with or can be explained by the variance in x.
The proportion r 2 of is also substituted to percent by multiplying by 100.
For example,
Correlation (r) of x and y is 0.9982
Then

r 2 =0.99640324
r 2 100=99.640324%

Thus 99.64 percent of variability in Y can be accounted for by the


variance of X. the remaining variance i.e., 100-99.64=0.36 represents the
variance due to otherwise unexplained deviations of around the
regression line.
r 2 =99.64
S x2. y 0.36

In correlation coefficient, while doing statistical inference, we should take


care in two stages.
Stage1: r may be significant yet very weak and no practical consequence
is there. It is called important difference.
Stage2: r may be strong (close to +1) but small sample size may prove
this to be not significant. It is called significant difference.

Rank correlation or Karl Pearsons rank correlation


The Karl Pearsons formula for calculating r is developed on the
assumptions that the values of the variables are exactly measurable. In
some situations it may not be possible to give precise values for the
variables. In such case we can use another measure of correlation
coefficient called rank correlation coefficient. We rank the observations in
ascending or descending order using the numbers 1, 2, 3,, n and
measure the degree of relationship between the ranks instead of actual
numerical values. The rank correlation coefficient when there are n ranks
in each variable is given by the formula
n

6 d i2
i 1
2

n( n 1)

Where, d i xi yi , is the difference between ranks of corresponding pairs


of x and y.
And n=number of observations.
Note: Tied ranks: when the values of variables x and y are given we can
rank the values in each of the variables and determine the Spearmans
rank correltion coefficient. If two or more observations have the same
rank we assign them the mean rank. In this case ther e is a correlation
factor in the formula for . The formula for is given by

d i2

m( m
12

i 1

n(n 1)

1)

Where, m= number of times a rank is repeated


For example if a rank is repeated 2 times in x-series and 3 times in yseries, the correlation factor is
2(2 2 1) 3(3 2 1)

12
12

Properties of correlation coefficient


1)

The correlation coefficient is unaffected by change of origin of

reference and scale of reference.


2)

The values of correlation coefficient lies between -1 and +1.(-1

+1 or
-1 r +1)
Limitations
1)

The formula for correlation coefficient holds only if there is a linear

correlation between the variables; that is the relationship between the


variables is linear.
2)

Correlation theory does not establish casual relationship between

the variables. It does not suggest that the variations in y are caused by
variables in x or vice versa. A high correlation between variables x and y
may describe any one of the following situations.
a) Variation in y is caused by variation in x.
b) Variation in x is caused by variation in y.
c) The x and y are jointly dependent.
d) The correlation between x and y may be due to chance.
Correlation are sometimes observed between variables not conceivably
be casually related.
For example
If a high correlation is found between the number of births and the
number of murders in a country it does not prove that number of births of
babies is determined by number of murders. This type of correlation is
called spurious correlation or chance correlation and they do not provide
any casual relationship between variables involved.

REGRESSION
Regression analysis is helpful in ascertaining the probable form of the
relationship between variables. The ultimate objective when this method
of analysis is employed usually is to predict or estimate the value of
another variable.
The Regression Model
In the typical regression problem, as in most problems in applied
statistics, researchers have available for analysis a sample of
observations from some real or hypothetical population. Based on the
results of their analysis of the sample data, they are interested in
reaching decisions about the population from which the sample is
presumed to have been drawn. It is important, therefore, that the
researchers understand the nature of the population in which they are
interested. They should know enough about the population to be able
either to construct a mathematical model for its representation or to
determine if it reasonably fits some established model. These are two
types
(1). Simple linear regression model
(2). Multiple linear regression model
(1). Simple Linear Regression Model
y = a+bx

( I )

Where, y = response or dependent variable

x = independent or

explanatory variable
a = intercept

b = regression coefficient or slope (i.e., the average

change in y per unit change in x).


Assumptions:

In the simple linear regression model two variables, x and y, are of


interest. The variable x is usually referred to as the independent variable,
since frequently it is controlled by the investigator; that is, values of x may
be selected by the investigator, and corresponding to each preselected
value of x, one or more value of y are obtained. The other variable y
accordingly is called the dependent variable, and we speak of the y on x.
The following are the assumptions underlying the simple linear
regression model.
i.

The investigator fixes values for independent variable.

ii.

The values of dependent variable are always by random

iii.

For each x value, they are y values which have common variance
y2.x and their means lie in the true regression line.

Formula for estimating or find out a and b values:


(i)

using normal equations, to estimate a and b

i 1

i 1

yi na b xi
n

x y
i 1

..............(1)

i 1

i 1

a xi b xi2

.............(2)

Using equ (1) and (2) estimating the values of a and b substituting
these values into equ (I).
Formulae for find out y on x and x on y (Regression lines):

y on x
and

x on y

y y r
xxr

y
x

x
y y
y

x x

..(3)
(4)

Two Regression coefficients


b y. x

and

bx. y

are two regression coefficients.

Formula for find

and bx. y :

b y. x

Method (1): When means and correlation coefficient are not known
n

by . x

x y
i 1
n

x
i 1

bx. y

and

2
i

x y
i

i 1
n

y
i 1

2
i

Method (2): When means are known and correlation coefficient is not
known
n

by . x

( xi x)( yi y )
i 1

( x x)

and bx. y

( x x)( y y )
i 1

( y y)

i 1

i 1

Method (3): When correlation coefficient is known

b y. x r

y
x

and bx. y r

x
y

From equ (3) and (4) we get


r 2 b y.x b y.x
r b y. x b y. x

One can easily see that r is positive if the regression coefficients

b y. x

and bx. y are positive and r is negative if the regression coefficients are
negative. In no real situation we get one regression coefficient positive
and the other negative.
Logistic regression: This method is used to examine the relationship
among the set of variables. That is, the statistical method that is used to
study about a dichotomous response variable, which is explained by a

number of explanatory variables, is called as logistic regression. (It may


be ordinal or interval or ranking data)
The assumptions for logistic regression are,

Response variable is binary

The model for response and explanatory variable is log linear.

TESTING OF HYPOTHESIS
INTRODUCTION
The purpose of hypothesis is to aid the clinician, researcher, or
administrator in reaching a conclusion concerning a population by
examining a sample from that population.
BASIC CONCEPTS
A hypothesis may be defined simply as statement about one or more
populations.
Types of hypothesis: Researchers are concerned with two types of
hypotheses research hypotheses and statistical hypotheses.
Research hypothesis is the conjecture or supposition that motivates the
research.
Statistical hypotheses are hypotheses that are stated in such a way
that they may be evaluated by appropriate statistical techniques.
HYPOTHESIS TESTING STEPS
For convenience, hypothesis testing will be presented as a ten-step
procedure. There is nothing magical or sacred about this particular
format. It merely breaks the process down into logical sequence or
actions and decisions.

1)

Data

The nature of the data that form the basis of the testing procedures must
be understood, since this determines the particular test to be employed.
Whether the data consist of counts or measurements, for example, must
be determined
2)

Assumptions

Different assumptions lead to modifications of testing of hypothesis. For


example, assumptions about the normality of the population distribution,
equality of variances, and independence of samples.
3)

Hypotheses

There are two statistical hypotheses involved in hypotheses testing, and


these should be stated explicitly.
Null hypothesis
Null hypothesis is the hypothesis to be tested. It is designated by the
symbol H0. The null hypothesis is sometimes referred to as a hypothesis
of no difference, since it is statement of agreement with (no difference
from) conditions presumed to be true in the population of interest. In
general, the null hypothesis is set up for the express purpose of being
discredited. Consequently, the compliment of the conclusion that the
researcher is seeking to reach becomes the statement of the null
hypothesis. In the testing process null hypothesis either is rejected or is
not rejected. If the null hypothesis is not rejected, we will say that the
data on which the test is based do not provide sufficient evidence to
cause rejection.
Alternative hypothesis
If null hypothesis is rejected, we will say that the data at hand are not
compatible with the null hypothesis, but are supportive of some other
hypothesis. The alternative hypothesis is a statement of what we will

believe is true if our sample data cause us to reject the null hypothesis.
Usually the alternative hypothesis and the research hypothesis are same,
and in fact the two terms are used interchangeably. We shall designate
the alternative hypothesis by the symbol HA or H1.
Rules for stating Statistical hypothesis
When the hypotheses are of the type considered in this chapter an
indication of equality (either =, , or ) must appear in the null hypothesis,
suppose for example, that we want tro answer the question: can we
conclude that a certain population mean is not 50? The null hypothesis is
H0: = 50

and alternative is HA: 50

Suppose we want to know if we can conclude that the population mean is


greater than 50. Our hypotheses are
H0: 50

Vs

HA: > 50

If we want know if we can conclude that he population mean is less than


50, the null hypothesis is
H0: 50

Vs

HA: < 50

In summary, we may state the following rules of thumb for deciding what
statement goes in the null hypothesis and what statement goes in the
alternative hypothesis:
a)

What you hope or expect to be able to conclude as a result of

the test usually should be placed in the alternative hypothesis.


b)

The null hypothesis should contain a statement of equality,

either =, , or .
c)

The null hypothesis is the hypothesis to be tested.

d)

The null and alternative hypotheses are complimentary. That

is, the two together exhaust all possibilities regarding the value that the
hypothesized parameter can assume.
A Precaution

It should be pointed out that neither hypothesis testing nor statistical


inference, in general, leads to the proof of a hypothesis; it merely
indicates whether the hypothesis is supported or is not supported by the
available data. When we fail to reject a null hypothesis, therefore, we do
not say that it is true, but that it may be true. When we speak of accepting
a null hypothesis, we have this limitation in mind and do not wish to
convey the idea that accepting implies proof.
4)

Test statistics

The test statistic is some statistic that may be computed from the data of
the sample. As a rule, there are many possible values that the test
statistic may assume, the particular value observed depending on the
particular sample drawn. As we will see, the test statistic serves as a
decision maker, since the decision to reject or not to reject the null
hypothesis depends on the magnitude of the test statistic. An example of
a test statistic is the quantity
z

x 0
n

Where 0 is a hypothesized value of a population mean. This test statistic


is related to the statistic
z

x
n

General formula for test statistic


The following is a general formula for a test statistic that will be
applicable in many of the hypothesis tests:
Test statistic

relevant statsitic hypothesized parameter


s tan dard error of the relevant statistic

In above example,
In above example, x is the revant statistic; 0 is the hypotesized parameter
and

5)

n is the statndard error of x , the revant staistic.

Distribution of test statistic

It has been pointed out that the key to statistical inference is the sampling
distribution. We are reminded of this again when it becomes necessary to
specify the probability distribution of the test statistic. The distribution of
the test statistic
z

x 0
n

For example, follows the standard normal distribution if the null


hypothesis is true and the assumptions are met.
6)

Decision rule

All possible values of the test statistic can assume are points on the
horizontal axis of the graph of the distribution of the test statistic and are
divided into two groups; one group constitutes what is known as the
rejection region and the other group makes up the nonrejection region.
The values of the test statistic forming the rejection region are those
values that are less likely to occur if the null hypothesis is true, while the
values making up the acceptance region are more likely to occur if the
null hypothesis is true. The decision rule tells us to reject the null
hypothesis if the value of the test statistic that we compute from our
sample is one of the values in the rejection region and to not reject the
null hypothesis if the computed value of the test statistic is one of the
values in the nonrejection region.
Significance Level
The decision as to which values go into rejection region and which ones
go into the nonrejection region is made on the basis of the desired level
of significance, designated by . The term level of significance reflects
the fact that hypothesis tests are sometimes called significance tests, and
a computed value of the test statistic that falls in the rejection region is
said to be significant. The level of significance, , specifies the area

under the curve of the distribution of the test statistic that it is above the
values on the horizontal axis constituting the rejection region.
The level of significance is a probability and, in fact, is the
probability of rejection a true null hypothesis.
Since to reject a true null hypothesis would constitute an error, it seems
only reasonable that we should make the probability of rejecting a true
null hypothesis small and, in fact, that is what is done. We select a small
value of in order to make the probability of rejecting a true null
hypothesis small.
Types of Errors
The error committed when a true null hypothesis is rejected is called the
type I error. The type II error is the error committed when a false null
hypothesis is not rejected. The probability of committing a type II error is
designated by .
Whenever we reject a null hypothesis there is always the concomitant
risk of committing a type I error, rejecting a true null hypothesis.
Whenever we fail to reject a null hypothesis the risk of failing to reject a
false null hypothesis is always present. We make small, but we
generally exercise no control over , although we know that in most
practical situations it is larger than ..
We never know whether we have committed one of these errors when we
reject or fail to reject a null hypothesis, since the true state of affairs is
unkown. Comfort from the fact that we made .small and, therefore, the
probability of committing a type I error was small. If we fail to reject the
null hypothesis, we do not know the concurrent risk of committing a type
II error, since is usually unkown but, as has been pointed out, we do
know that, in most practical situations, it is larger than .
Possible

Condition of Null (H0)

Action

7)

Fail to reject

Hypothesis
True
Correct

H0
Reject H0

action
Type I error

False
Type II error
Correct action

Calculation test statistic

From the data contained in the sample we compute a value of the test
statistic and compare it with the rejection and nonrejection regions that
have already been specified.
8)

Statistical Decision

The statistical decision consists of rejecting or of not rejecting the null


hypothesis. It is rejected if the computed value of the test statistic falls in
the rejection region, and it is not rejected if the computed value of the test
statistic falls in the nonrejection region.
9)

Conclusion

If H0 is rejected, we conclude that HA is true. If H0 is not rejected, we


conclude that H0 may be true.
10) P-Value
The p-value is a number that tells us how unusual our sample results are,
given that the null hypothesis is true. A p value indicating that the
sample results are not likely to have occurred, if the null hypothesis is
true, provides justification for doubting the truth of the null hypothesis.
The p is probability of occurrence of an event as extreme under null
hypothesis. Here chance of occurrence is 5 in 100. This is critical
probability, since a rare occurrence null hypothesis is rejected.
We emphasize that when the null hypothesis is not rejected one should
not say that the null hypothesis is accepted. We should say that the null
hypothesis is not rejected. We avoid using the word accept in this
case because we may have committed a type II error. Since, frequently,

the probability of committing a type II error can be quite high, we do not


wish to commit ourselves to accepting the null hypothesis.
Steps involved in testing of hypothesis:
The following are the steps involved in applying a test of significance.
1)

State an appropriate null hypothesis for the problem.

2)

Calculate the suitable statistic using the standard error; t, 2, F, etc.

3)

Determine the degrees of freedom for the statistic.

4)

Find the probability level p corresponding to the test statistic using

the relevant tables.


5)

The null hypothesis is rejected if p is less than 0.05; otherwise it is

not rejected.
If the objective is to conclude that the two samples are from the same
population or not, without considering the direction of difference
significance is used. On the other hand, if the objective is to conclude
that the mean of one of the samples is larger than the other or not, one
tailed test of significance is used.
The decision about the choice of test statistic depends on the sample
size and the type of data whether qualitative or quantitative and the size
of sample.
The test of significance is used when the objective is to compare
a)

Sample mean with population.

b)

Means of two samples.

c)

Sample proportion with population proportion;

d)

Proportion of two samples;

e)

Association between two attributes.

If we have large sample and the objective is either one of (a) to (d)
mentioned above, we use the normal curve test or the normal test. To
use the normal test, the following assumptions should be satisfied.
i.

The sample should be randomly selected.

ii.

The variable should follow a normal distribution.

iii.

The standard deviation of the two samples should be more or


less the same.

iv.

The sample should be large.

General Procedure in Testing of Hypothesis


i).

Use two-tailed test for non directional difference.

ii).

Use one tailed test if alternative hypothesis specifies the direction of


the difference.

iii).

Use 2 or Z for enumerative data.

iv).

Table value decides for given P under NH.

v).

No true difference exists is always assumed in decision making.

vi).

Arbitrary level of significance be 0.05 {choose () alpha)

vii).

Chance alone is explanation, if P is less than predetermined level


(rejecting hypothesis H0)

viii).

Not by chance also is explanation, if P is more than predetermined


level (Not rejecting null hypothesis)

NON-PARAMETRIC AND DISTRIBUTION FREE STATISTICS


Introduction:
Most of the Statistical Inference procedures we have discussed up
to this point are classified as parametric statistics.
One exception is our use of Chi-square: as a
Tests of goodness of fit and

Test of independence
These uses of chi-square come under the heading of nonparametric
statistics.
Difference between Parametric:
The obvious question now is: What is the difference?
In answer to this, let us recall the nature of the inferential procedures that
we have categorized as parametric.
In each case, our interest was focused on estimating or testing a
hypothesis about one or more population parameters. Furthermore, central
to these procedures was a knowledge of the functional form of the
population from which were drawn the samples providing the basis for the
inference.
An example for a parametric statistical test is the widely used t test. The
most common uses of this test are for testing a hypothesis about a single
population mean or the difference between two population means. One of
the assumptions underlying the valid use of this test is that the sampled
population or populations are at least approximately normally distributed.
As we will learn, the procedure that we discuss in this chapter either are
not concerned with population parameters or do not depend on knowledge
of the sampled population. Strictly speaking, only those procedures that
test hypotheses that are not statements about population parameters are
classified as nonparametric, while those that makes no assumption about
the sampled population are called distribution-free interchangeably and to
discuss the various procedures of both types under the heading of
nonparametric statistics. We will follow the convention.

The above discussion implies the following two advantages of


nonparametric statistics.
1. They allow for the testing of hypotheses that are not statements
about population parameter values. Some of the chi-square tests of
goodness of fit and tests of independence are examples of tests
processing this advantage.
2. Nonparametric tests may be used when the form of the sampled
population is unknown.
3. Nonparametric procedures tend to be computationally easier and
consequently more quickly applied than parametric procedures. This
can be desirable feature in certain cases, but when time is not at a
premium, it merits a low priority as a criterion for choosing a
nonparametric test.
4. Nonparametric procedures may be applied when the data being
analyzed consist merely of rankings or classification. That is, the
data may not be based on a measurement scale strong enough to
allow the arithmetic operations necessary for carrying out parametric
procedures.
Although nonparametric statistics enjoy a number of advantages their
disadvantages must also be recognized.
1. The use of nonparametric procedures with the data that can be
handled with a parametric procedure results in a waste of data.
2. The application of some of the nonparametric tests may be laborious
for large samples.
1). SIGN TEST:
The familiar t test is not strictly valid for testing
(1). The null hypothesis that a population mean is equal to some particular
value, or

(2). The null hypothesis that the mean of a population of differences


between pairs of measurements is equal to zero
Unless the relevant populations are at least approximately normally
distributed. When a). The normality assumption cannot be made or
b). The data at hand are ranks rather than
measurements on an interval or ratio scale,
the investigator may wish for an optional procedure. Although the t test is
known to be rather insensitive to violations of the normality assumption,
there are times when an alternative test is desirable.
A frequently used nonparametric test that does not depend on the
assumptions of the t test is the sign test. This test focuses on the
median rather than mean as a measure of central tendency or location.
The median and mean will be equal in symmetric distributions. The only
assumption underlying the test is that the distribution of the variable of
interest is continuous. This assumption rules out the use of nominal
data.
The sign test gets its name from the fact that pluses and minuses,
rather than numerical values, provide the raw data used in the calculations.
Example for Sign Test:
General appearance scores of 10 mentally retarded girls
Girl
1
2
3
4
5

Score
4
5
8
8
9

Girl
6
7
8
9
10

Score
6
10
7
6
6

We wish to know if we can conclude that the median score of the


population from which we assume this sample to have been drawn is
different from 5.

Solutions:
1. Data.
2. Assumptions. We assume that the measurements are taken on
a continuous variable.
3. Hypotheses.
H0: the population median is 5.
HA: The population median is not 5.
4. Level of significance: let = 0.05
5. Test statistic. The test statistic for the sign test is either the
observed number of plus signs or the observed number of minus
signs. The nature of the alternative of hypothesis determines
which of these test statistics is appropriate. In a given test, any
one of the following alternative hypotheses is possible.
HA: P (+) > P (-) one-sided alternative
HA: P (+) < P (-) one-sided alternative
HA: P (+) = P (-) two-sided alternative
If the alternative hypothesis is
HA: P (+) > P (-)
A sufficiently small number of minus signs causes rejection region of H0.
The test statistic is the number of minus signs. Similarly, if the alternative
hypothesis is
HA: P (+) < P (-)
A sufficiently small number of plus signs causes rejection region of H0. The
test statistic is the number of plus signs. If the alternative hypothesis is
HA: P (+) = P (-)
Either a sufficiently small number of plus signs or a sufficiently small
number of minus signs causes rejection of the null hypothesis H0. We may
take as the test statistic the less frequently occurring sign.

Scores above (+) and below (-) the hypothesized median based in data of
Example:
Girl Score

1
2
3
4
5

4
5
8
8
9

Score relative to

Girl

Score

Score relative to

hypothesized median 5

hypothesized median

0
+
+
+

5
+
+
+
+
+

6
7
8
9
10

6
10
7
6
6

2). THE WILCOXON SIGNED-RANK TEST FOR LOCATION


Purpose and Uses:
Sometimes we wish to test a null hypothesis about a population mean,
but for some reason neither z nor t is an appropriate test statistic. If
we have a small sample (n < 30) from a population that is known to be
grossly nonnormally distributed, and the central limit theorem is not
applicable, the statistic is ruled out. The t statistic is not appropriate
because the sampled population does not sufficiently approximate a
normal distribution.
The sign test may be used when our data consist of a single sample or
when we have paired data. If however, the data for analysis are measured
on at least an interval scale, the sign test may be undesirable since it
would not make full use of information contained in the data. A more
appropriate procedure might be the Wilcoxon signed rank test, which
makes use of the magnitudes of the differences between measurements

and a hypothesized location parameter rather than just the signs of


the differences.
Assumptions the Wilcoxon test for location is based in the following
assumption about the data.
1. The sample is random
2. The variable is continuous.
3. The population is symmetrically distributed about its mean .
4. The measurement scale is at least interval.
Hypotheses. The following are the null hypotheses (along with their
alternatives) that may be tested about some unknown population mean 0.
(a)
When we use wilcoxon procedure we perform the following calculations.
1. Subtract the hypothesized mean 0 from each of the observation Xi,
to obtain

d X
i
i
0

If any Xi is equal to the mean, so that di = 0, eliminate that di from the


calculations and reduce accordingly.
2. Rank the usable di from the smallest to the largest without regard to
the sign of di. That is, consider only absolute value of the di,
designated by

di

,when ranking them. If two or more of the

di

are

equal, assign each tied value the mean of the rank positions the tied
values occupy. If, for example the three smallest

di

values are

equal, place them in rank positions 1, 2 and 3 but assign each rank
of (1+2+3) / 2=2.
3. Assign each rank the sign of the di that yields that rank.
4. Find T+, the sum of the ranks with positive signs, and T-, the sum of
the ranks with negative signs.

3). MEDIAN TEST:


A nonparametric procedure that may be used to test the null hypothesis
that two independent samples have been drawn from populations
with equal medians is the median test.
4). THE MANNWHITNEY U TEST
Purpose and Uses:
It is the most widely used test as an alternative to the t-test when we do
not make the t-test assumptions about the parent population. The median
test does not make full use of all the information present in the two
samples when the variable of interest is measured on at least on ordinal
scale. By reducing an observations information content to merely that of
whether or not it falls above or below the common median is a waste of
information. If, for testing the desired hypothesis, there is available a
procedure that makes use of more of the information inherent in the data,
that procedure should be used if possible. Such a nonparametric
procedure that can often be used instead of the median test is the MannWhitney test, sometimes called the Mann-Whitney-Wilcoxon test. Since
this test is based in the ranks of the observations it utilizes more
information that does the median test.
Assumptions:
1. The two samples, of size n and m, respectively, available for
analysis have been independently and randomly drawn from their
respective populations.
2. The measurement scale is at least ordinal.
3. The variable of interest is continuous.
4. If the populations differ at all, they differ only with respect to their
medians

Hypotheses:
When these assumptions are met we may test the null hypothesis
that is the two populations have equal medians against either of the three
possible alternatives.
H0 : M X MY

Vs

H A : M X MY

Two Sided

H A : M X M Y

One Sided

H A : M X MY

One Sided

Level of significance.
Let = 0.05 (or 5%)

Test statistic.
To compute the test statistic we compute the samples and rank all
observations from smallest to largest while keeping track of the sample to
which each observation belongs. Tied observations are assigned a rank
equal to the mean of the rank positions for which they are tied
The test statistic is
T S

n(n 1)
2

Where, n = Number of sample X observations


S = The sum of the ranks assigned to the sample
observations

from the population of X values.

The choice of which samples values we label X is arbitrary.


Distribution of test statistic.

Critical values from the distribution of the test statistic MW are given
in table of Quantiles of the Mann-Whitney test statistic for various n, m, p
values.
Decision rule.
In general, for the two-sided situation with
H 0 : M X M Y Vs H A : M X M Y
Computed values of T that are either sufficiently large or sufficiently small
will cause rejection of H0. The decision rule for this case, then, is:
Reject H 0 : M X M Y if the computed value of T is either less than MW / 2 or
greater than MW(1 / 2), where MW / 2 is the critical value of T for n, m, and
/2 given in Quantiles of the Mann-Whitney test statistic, and MW(1 / 2) =
nm MW / 2.
For one-sided tests of the type illustrated here the decision rule is:
Reject H 0 : M X M Y if the computed T is less than MW is the critical value
of T obtained by entering Quantiles of the Mann-Whitney test statistic with
n = the number of X observations

and

m = the number of Y observations


= the chosen level of significance.

If the Mann-Whitney procedure to test


H 0 : M X M Y Vs
H A : M X MY
Sufficiently large values of T will cause rejection so that the decision rule
is:
Reject H 0 : M X M Y if computed value of T is greater than MW1
Where, MW1 = nm MW
Statistical Decision.
When we enter MW statistic table with n, m, and , we find the
critical value of MW.

Since, Computed T value > Critical value of the test statistic MW


We do not reject H0.
Conclusion.
H 0 : M X M Y Vs
If

H A : M X M Y (two-tail test)

We conclude that MX is equal to MY. This leads to the conclusion that there
is no significant difference between the X and Y.
If

H0 : M X MY

Vs

H A : M X M Y (one-tail test)

We conclude that MX is greater than MY. This leads to the conclusion that X
is more than that of Y.
p value.
If p > , we do not reject H0.
Large-Sample Approximation. When either n or m is greater than 20 we
cannot use Mann-Whitney test statistic table to obtain critical values for the
Mann-Whitney test. When this is the case we may compute

mn

nm(n m 1) 12
T

And compare the result, the significance, with critical values of the
standard normal distribution.
Example for Mann-Whitney U Test:
The following table shows the time taken for Root Canal treatment in
Conservative for a tooth. The junior student taken time and senior student
taken time are shown.
Sl.no

Junior
Student
Taken Time

Senior Student
Taken Time

1
2
3
4
5

17.4
17.6
18.0
15.3
16.8

14.4
16.5
16.5
14.1
15.9

6
7
8
9
10

14.2

13.7
16.7
14.0
14.7
17.6

We wish to know if we can conclude that senior student taking lesser time
that junior student.
Solution:
1. Data: See problem table
2. Assumptions. We presume that the assumptions of the MannWhitney U test are met.
3. Hypotheses. The null and alternative hypotheses are as follows
H0 : M X M Y
HA: MX < MY
Where,

MX = Median of a population of junior taken time


MY = Median of a population of senior taken time

4. Level of significance. Let = 0.05 (or 5%)


5. Test statistic. To compute the test statistic we compute the samples
and rank all observations from smallest to largest while keeping
track of the sample to which each observation belongs. Tied
observations are assigned a rank equal to the mean of the rank
positions for which they are tied. The results of the steps are shown
below.
Table: Original Data and Ranks
Sl. No.
1
2
3

Junior
Student
(X)

Rank of X

Senior
Student
(Y)

Rank of Y

13.7
14.0
14.1

1
2
3

4
5
6
7
8

14.2

15.3

14.4
14.7

5
6

15.9

8
(9+10) / 2
= 9.5
(9+10) / 2
= 9.5
11

16.5

10

16.5

11
12
13
14

16.7
16.8
17.4
17.6

15,16

18.0

Total

12
13
14
(15+16) / 2
= 15.5
S = 65.5

18.0

(15+16) / 2
= 15.5
70.5

The test statistic is


T S

n(n 1)
2

Where, n = Number of sample X observations


S = The sum of the ranks assigned to the sample
observations

from the population of X values.

The choice of which samples values we label X is arbitrary.


6. Distribution of test statistic. Critical values from the distribution of
the test statistic MW are given in table of Quantiles of the MannWhitney test statistic for various n, m, p values.
7. Decision rule. If the median of X population is, in fact, smaller than
the median of the Y population, as specified in the alternative
hypothesis, we would expect (for equal sample sizes) the sum of the
ranks assigned to the observations from the X population to be
smaller than the sum of the ranks assigned to the observations from

the Y population. The test statistic is based on this rationale in such


a way that a sufficiently small value of T will cause rejection of
H 0 : M X M Y.
In general, for one-sided tests of the type illustrated here the decision rule
is:
Reject H0: MX MY if the computed T is less than MW is the critical value
of T obtained by entering Quantiles of the Mann-Whitney test statistic with
n = the number of X observations

and

m = the number of Y observations


= the chosen level of significance.

If the Mann-Whitney procedure to test


H0: MX MY Vs HA: MX < MY
Sufficiently large values of T will cause rejection so that the decision rule
is:
Reject H0: MX MY if computed value of T is greater than MW1
Where, MW1 = nm MW
For the two-sided situation with
H0: MX = MY Vs HA: MX MY
Computed values of T that are either sufficiently large or sufficiently small
will cause rejection of H0. The decision rule for this case, then, is:
Reject H0: MX = MY if the computed value of T is either less than MW / 2 or
greater than MW(1 / 2), where MW / 2 is the critical value of T for n, m, and
/2 given in Quantiles of the Mann-Whitney test statistic, and MW(1 / 2) =
nm MW / 2.
For this example the decision rule is:
Reject H0, if the computed value of T is smaller than 15, the critical value of
the test statistic for n = 6, m = 10, and = 0.05 found in Quantiles of the
Mann-Whitney test statistic table.

8. Calculation of test statistic. For our present example we have, as


shown in table (2), S=65.5, so that
T S

n(n 1)
6(6 1)
65.5
44.5
2
2

9. Statistical Decision. When we enter MW statistic table with


n = 6, m = 10, and = 0.05, we find the critical value of MW to be 15.
Since Computed T value > Critical value of the test statistic
(i.e.,)
44.5 > 15
We do not reject H0.
10. Conclusion. We conclude that MX is greater than MY. This leads to the
conclusion that junior students do not reduce the time than senior
students.
11. p value. We have for this test p = 0.0519 > 0.05 (i.e., p>), we do not
reject H0.

5). Wald-Wolfwitz Run Test


Purpose: Testing the randomness of a given set of observations.

Procedure
Let X1,X2,X3,.,Xn be the set of observation arranged in the order in which
they occur, Xi is the i-th observation in the outcome of an experiment. Then
for each of the observations, we see if it is above or below the median of
the observations and write M if the observation is above and B if it is below
the median value. Thus we get the sequence of As and Bs of the type say
A B B B A A B B B B AAAA B AAA....
1

7.......

(1)

Here 1, 2, 3, 4, 5, 6, 7, .. are runs.


Null Hypothesis:
H0: That the set of observations is random
Test statistic:
Let U= number of runs in equation (1) is a random variable.
With

n2
2

(n 2)

(n 1)

Mean(U) E (U )
Var(U)

and

SD (U ) Var (U )

And we use normal test


Z

U E (U )
~ N(0,1) ,
SD(U )

asymptotically.. (2)

Example for Wald-Wolfwitz Run Test:


The following data is the teeth size of the patients and to test teeth size are
randomly distributed to patients. The median size of the teeth is 3.4
3.4

4.5

3.1

4.6

2.9

2.8

4.2

4.6

3.9

3.5

3.6

Solution:
Data. See the problem
Assumption. The given data is continuous variable. If the given data is
ordinal than no assumption.
Null Hypothesis:
H0: That the set of observations is random
Level of significance. Let = 0.05 or 5%
Test statistic:
i.

To write A (above median) if value > 3.4 and write B (below


median) if value < 3.4 and put if value =3.4.

ii.

Continuity of A (or B) is one run break A (or B) starting of B (or A) is


next run continue this process to cover all As and Bs.

3.4

4.5

3.1

4.6

2.9

2.8

4.2

4.6

3.9

3.5

3.6

A|

B|

|B

B|

..(1)

Let U= number of runs in equation (1) is a random variable.


Here U = 5
n= number of observation = 11.
Mean(U) E (U )
Var(U)
SD(U )

and

(n 2) (11 2) 13

6.5 = (11+2) / 2 = 13/2 = 6.5


2
2
2

n (n 2)
11 (11 2)
11 9

2.7(0.9) 2.48
4 (n 1)
4 (11 1)
4 10
Var (U )

2.48 1.57

5 3.4
1.6

1.01
1.57
1.57

And we use normal test


Z

U E (U )
~ N(0,1) , asymptotically.. (2)
SD (U )

Distribution of test statistic.


Critical values from the distribution of the test statistic Z standard
normal distribution are given in table standard normal table for various
values.
Decision rule.
Computed values of Z Critical values of Z will cause rejection of H0.
Calculation of test statistic. For our present example we have
Z

5 3.4
1.6

1.01
1.57
1.57

Statistical Decision. When we enter Z statistic table with


= 0.05, we find the critical value of Z to be 1.64.
Since, Computed Z value < Critical value of the Z test statistic
(i.e.,)
1.01 < 1.64
We do not reject H0.
Conclusion. We conclude that the given data is random
p value. Here p = 0.8531.We have for this test p > 0.05 (i.e., p>), we do
not reject H0.
6). Kruskal-Wallis One Way Analysis Of Variance By Ranks:

Purpose:
One-way analysis of variance may be used to test the null hypothesis
that several population means are equal. When the assumptions
underlying this technique are not met, that is,
i.

When the populations from which the samples are drawn are not
normally distributed with equal variances

ii.

When the data for analysis consist only of ranks

A non-parametric alternative to the one-way analysis of variance may be


used to test the hypothesis of equal location parameters.
Procedure:
1) The n1,n2,n3,.,nk observations from the k samples are combined
into a single series of size n and arranged in order of magnitude
from smallest to largest. The observations are then replaced by
ranks from 1, which is assigned to the smallest observation, to n,
which is assigned to largest observation. When two or more
observations have the same value, each observation is given the
mean of the ranks for which it is tied.
2) The ranks assigned to observations in each of the k groups are
added separately to give k rank sums.
3) The test statistic
2

Where

k R
12
j
3(n 1)

n(n 1) j 1 n j

.(KW1)

k= the number of samples


nj = the number of observations on the j-th sample
n= the number of observations in all samples combined
Rj=the sum of ranks in the j-th sample

4) When there are three samples and find five or fewer observations in
each sample, the significance of the computed H is determined by
consulting in corresponding table. When there are more than five

observations in one or more of the samples, H is compared with


tabulated values of 2 (chi-square) with k-1 degrees of freedom.
Example for Kruskal-Wallis one-way ANOVA test by ranks:Dental
surgery time in minutes of 13 Experimental patients
________________
Sample
I
17

II III
2

20

40

31

35
Solution:
1. Data: See the problem
2. Assumptions
a. The samples are independent random samples from their
respective populations.
b. The measurement scale employed is at least ordinal.
c. The distributions of the values in the sampled populations are
identical except for the possibility that one or more of the
populations are composed of values that tend to be larger
than those of the other populations.
3. Hypothesis:
H0: The Population centers are equal.
HA: At least one of the populations tends to exhibit larger
values than at least one of the other populations.

4. Level of significance
Let =0.01
5. Test statistic.
2

k R
12
j
H
3(n 1)

n(n 1) j 1 n j

Distribution of test statistic. Critical values of H for various sample sizes


and levels are given in the critical values of the Kruskal-Wallis test
statistic table.
6. Decision rule:
The null hypothesis will be rejected if the
Computed value of H Critical value of H.
The null hypothesis may be accepted if the
Computed value of H < Critical value of H.
7. Calculation of test statistic.
When the three samples are combined into a single series and
ranked, the table of ranks shown below.
The Data of table Replaced by ranks
____________________________
Sample

II

III

the ascending order of the data

6.5

2, 3, 4, 5, 7, 8, 8, 9, 17, 20,31,

10

35, 40

13

order of ranks are

11

6.5

1, 2, 3, 4, 5, 6.5, 6.5, 8, 9, 10,

11

12

is

12, 13

Note: 8 Occur two times these value takes mean of ranks (i.e) mean of 6
and 7 is (6+7)/2 = 6.5.
R1 = 9+10+13+11+12

(1st row total)

= 55

R2 = 6.5+5+8+6.5

(2nd row total)

= 26

R3 = 1+4+3+2

= 10

(3rd row total)

3 R
12
j
3(13 1)

13(13 1) j 1 n j

12 552 26 2 10 2

13(14) 5
4
4

3(14)

12 3025 676 100

42
182 5
4
4

0.0659 605 169 25 42


0.0659(799) 42
52.6541 42
10.6541

9. Statistical Decision:
Kruskal-Wallis statistic table when nj are 5, 4, and 4, the critical value oh
H is 7.7604 and probability of obtaining a value of H is 0.009. The null
hypothesis can be rejected at the 0.01 level of significance. (i.e)
Computed value of H Critical value of H
Here

10.65 > 7.7604.

10. Conclusion. We conclude that there is a difference in the average


reaction time among the three populations.
11. p value.

For this test, p<0.009 (i.e., p < ) we reject H0.

RESEARCH PROCESS STEPS

Main area of study design is sampling technique. Hence sampling and


study design are interrelated. Following are the suggested steps to be
followed in a study design (experiment, project, thesis, dissertation,
study, survey, etc.).
i).

Area of study is mapped out and a proper title is given which should
be precise and self explanatory.

ii).

Objectives of the study are sorted out.

iii).

Review of literature in the connected or related area.

iv).

Sampling design and sample size are determined.

v).

Observers error and instrumental error are ruled out.

vi).

Specify the type of study being conducted (case control,


prospective, etc.).

vii).

Pilot study for proper finalization pf format, questionnaire, etc.

viii).

Proper recording of data.

ix).

Proper statistical analysis.

x).

Report preparation.

REFERENCE BOOKS:
1). Wayne W. Daniel (1999) BIOSTATISTICS: A Foundation for Analysis in the
Health Sciences, John Wiley & Sons, INC, New York.
2). G N Prabhakara(2006) BIOSTATISTICS, Jaypee Brothers, New Delhi.
3). P.S.S. Sundar Rao and J. Richard (2006) Introduction To Biostatistics And
Research Methods, Prentice-Hall of India Pvt. Ltd., New Delhi.
4). Dr. Soben Peter (2004) Essentials of Preventive
And Community Dentistry, Arya (Medi) Publishing House, New Delhi.
5). Rebecca G. Knapp and M. Clinton Miller III Clinical Epidemiology and
Biostatistics, Harwal, Malvern, Pennsylvania
And Some Pure Statistics Books:

S.C. Gupta and V.K. Kapoor - Introduction to Mathematical Statistics

S.C. Gupta and V.K. Kapoor - Applied Statistics

Part A

Answer all questions


(Minimum 5 lines for each)

5 x 2 = 10 Marks

UNIT I
1. Define Bio-statistics?
2. State the uses of Bio-statistics for dental research?
3. Explain the ratio and interval scale?
4. Distinguish between qualitative data and quantitative data?
5. What are the variables and scales in data collection?
6. Distinguish between nominal scale and ordinal scale?
7. Define frequency distribution?
8. Explain the sample and population?
9. Explain the individual data and grouped data?
10. Explain the pilot survey?
11. Explain the simple random sampling. Give an example?
12. Explain the uses of pie and bar diagrams?
UNIT II
13. How do you determine the consistency of two sets of variables?
14. Explain the correlation. Give an example.
15. Explain dental practice using correlation?
16. Explain the uses of regression?
UNIT III
17. What is bias in research?
18. What are the errors in testing of hypothesis?
19. Explain the Chi-square test?
20. What is mean by small sample and large sample tests?
21. Explain the cohort study design?
UNIT IV
22. What is ANOVA? How will you use it in dental research?
23. Explain the Wilcoxon signed rank test?
24. Explain the Non-parametric tests?
25. What are the assumptions of the Non-parametric tests?
UNIT V
26. Write the utilization of dentistry research?
27. Explain the descriptive approach in research?
28. Write the importance of bibliography?
29. What is case study?
30. Write is research?
31. What is literature review?

Part B

(Minimum 15 lines for each) 5 x 6 = 30 Marks

UNIT I
1.
2.
3.
4.
5.
6.

Explain the role of biostatistics in dental research?


Explain the Bio-statistics. State its applications in patient care
Explain the methods of data collection.
Explain the applications of Bio-Statistics in dentistry research?
Explain reliability and validity? How will you Asses?
Explain the applying study results in patients care?

UNIT II
7. Explain the various measures of central tendencies and illustrate these
with example
8. Explain the applications of descriptive statistics in dentistry research?
9. Determine the mean and standard deviation of each of the sets of
analytical measurements, which is of more precise?
A: 29.5 45.3

28.8

42.9

46.6

24.0

32.7

28.0

B: 35.2 34.2

33.0

35.9

33.7

38.2

33.1

34.5

10. Explain the scatter diagram. How do you infer about the scatter diagram?
11. Explain the two regression equations. With an example.
UNIT III
12. State the properties and applications of sampling distributions?
13. Explain the procedure for statistical hypothesis in dental research?
14. Two types of treatments were tried for a group of patients with bleeding
teeth disease and their outcome was measured as improvement or no
improvement.
Outcome
Details
Improvement No Improvement
New Treatment
38
Conventional
39
Check weather new method is effective?

7
17

UNIT IV
15. Explain the ANOVA. State its applications in dentistry research
16. Explain the method of Mann-Whitney U-test. State its importance.
17. Explain the method of Kruskal-Wallis One way analysis of variance by
ranks? Explain with an example?

18. Explain the procedure for Wald-Wolfwitz run test.


UNIT V
19. Formation of hypothesis is key research in dentistry research-Discuss
and identify the role of the researcher towards evidence-based practice?
20. Explain the steps in thesis report writing?
21. Explain the criteria of good research?
22. How do you frame the research study? Explain the various steps in
preparing scientific report?
23. Explain the research proposal?
24. Explain the research process?
25. How do you statistical estimation helpful to achieve the research in clinical
trials?