Bio Stat

statistics are simply a collection of tools that researchers employ to help
answer research questions
INTRODUCTION
Statistics plays a vitally important role in the research.
Health information is very often explained in statistical terms
Many decisions in the Health Sciences are created through

statistical studies
It enables you:
o
to read and evaluate reports and other literature
to take independent research investigations
to describe the data in meaningful terms
DEFINITIONS
1. Statistics: is the study of how to collect, organizes, analyze, and
interpret data.
2. Data: the values recorded in an experiment or observation.
3. Population: refers to any collection of individual items or units
that are the subject of investigation.
4. Sample: A small representative sample of a population is called
sample.
5. Observation: each unit in the sample provides a record, as a
measurement which is called observation.
6. Sampling: getting sample from a population
7. Variable: the value of an item or individual is called variable
8. Raw Data: Data collected in original form.
9. Frequency: The number of times a certain value or class of
values occurs.
10. Tabulation: can be defined as the logical and systematic
arrangement of statistical data in rows and columns.
11. Frequency Distribution: The organization of raw data in table
form with classes and frequencies.
12. Class Limits: Separate one class in a grouped frequency
distribution from another. The limits could actually appear in
the data and have gaps between the upper limit of one class and
the lower limit of the next.
13. Class Boundaries: Separate one class in a grouped frequency
distribution from another.
14. Cumulative Frequency: The number of values less than the
upper class boundary for the current class. This is a running
total of the frequencies.
15. Histogram: A graph which displays the data by using vertical
bars of various heights to represent frequencies.
16. Frequency Polygon: it is a line graph. The frequency is placed
along the vertical axis and the class midpoints are placed along
the horizontal axis. These points are connected with lines.
17. Pie Chart: Graphical depiction of data as slices of a pie. The
frequency determines the size of the slice. The number of
degrees in any slice is the relative frequency times 360 degrees.
18. Central tendency - a typical or representative value for a
dataset.
VARIABLES
The value of an item or individual is called variable.
Variables are of two types:
Quantitative: a variable with a numeric value. E.g. age,

weight.
Qualitative: a variable with a category or group value.

E.g. Gender (M/F), Religion (H/M/C), Qualification
(degree/PG)
Quantitative variable are two types:

o
Discrete /categorical variables
Continuous variables
Variables can be
Independent
Are not influenced by other variables.
Are not influenced by the event, but could

influence the event.
Dependent
The variable which is influenced by the others

is often referred as dependent variable.
E.g. In an experimental study on relaxation intervention for reducing

HTN, blood pressure is the dependent variable and relaxation training,
age and gender are independent variable.
SAMPLING
Sampling is the process of getting a representative fraction of a

population.
Analysis of the sample gives an idea of the population.
Methods of sampling:
o
Random Sampling or Probability sampling
Simple random sampling
Stratified random Sampling
Cluster sampling
Non-random sampling
Convenient Sampling
Purposive Sampling
Quota Sampling
In Simple Random sampling, each individual of the population

has an equal chance of being included in the sample. Two
methods are used in simple random sampling:
Random Numbers method
Lottery method
In stratified random sampling, the population is divide in to

groups or strata on the basis of certain characteristics.
In cluster sampling, the whole population is divided in to a

number of relatively small cluster groups. Then some of the
clusters are randomly selected.
Convenience sampling is a type of non-probability sampling

which involves the sample being drawn from that part of the
population which is selected because it is readily available and
convenient.
Purposive sampling is a type of non-probability sampling in

which researcher selects participants based on fulfillment of
some criteria. E.g. schizophrenia treatment naive.
SCALES OF MEASUREMENT
o
Four measurement scales are used: nominal, ordinal,

interval and ratio.
Each level has its own rules and restrictions.
Nominal Scale of measurement
Nominal variables include categories of people, events, and

other phenomena are named.
Example: gender, age-class, religion, type of disease, blood

groups A, B, AB, and O.
They are exhaustive in nature, and are mutually exclusive.
These categories are discrete and non-continuous.

o
Statistical operations permissible are: counting of

frequency, Percentage, Proportion, mode, and
coefficient of contingency.
Ordinal Scale of measurement
It is second in terms of its refinement as a means of classifying

information.
It incorporates the functions of nominal scale.
The ordinal scale is used to arrange (or rank) individuals into a

sequence ranging from the highest to lowest.
Ordinal implies rank-ordered from highest to lowest.
Grade A+, A, B+, B, C+, C
1st , 2nd , 3rd etc
Interval scale of Measurement
Interval scale refers to the third level of measurement in relation
to complexity of statistical techniques used to analyze data.
It is quantitative in nature
The individual units are equidistant from one point to the other.
The interval data does not have an absolute zero.
E.g. temperature is measured in Celsius or

Fahrenheit.
Ratio Scale of Measurement
Equal distances between the increments
This scale has an absolute zero.
Ratio variables exhibit the characteristics of ordinal and interval

measurement
E.g. variable like time, length and weight are

ratio scales and also be measured using nominal
or ordinal scale.
[The mathematical properties of interval and ratio scales are very

similar, so the statistical procedures are common for both the scales.]
PROCESSING OF DATA
The first step in processing of data is classification and tabulation.
Classification is the process of arranging data on the basis of some common characteristics possessed by
them.
Two approaches in analysing data are:

o
Descriptive statistics
Inferential statistics
Descriptive statistics are concerned with describing the characteristics of frequency distributions. The
common methods in descriptive analyses are:
o
Measures of central tendency
Measures of dispersion
Tabulation, cross-tab, contingency table
Line diagram, bar diagram, pie diagram.
Histogram, frequency polygon, frequency curve
Quantile, Q-Q plot
Scatterplot
The inferential statistics helps to decide whether the outcome of the study is a result of factors planned
within design of the study or determined by chance. Common inferential statistical tests are:
o
T-tests
Chi-squire test
Pearson correlation
Frequency Distribution
Simple depiction of all the data
Frequency distribution is a statistical table containing groups of values according to the number of times a
value occurs.
The data collected by an investigator is called raw data.
Raw data is ungrouped data.
It is not in order.
Raw data is arranged in order called array.
The data arranged in ascending order or descending order
Frequency Distribution with Classes
It is constructed with class intervals.
It is a frequency distribution of continuous series.
Raw data arranged as array data.
Then the data is divided in to groups called classes.
The first class and the last class are fixed by seeing the lowest and highest values.
Lowest and highest numbers of each class are called class limits (upper & lower).
The class limit may be made in two methods:

1.
Inclusive methods
2.
Exclusive method
PRESENTATION OF DATA
1.
Tabular presentation
2.
Diagrammatic Presentation
3.
Graphical Presentation
A. Tabular Presentation of Data

1.
Arranging values in columns is called tabulation.

1.
E.g. The amount of oxygen content in water samples
Water samples
1
2
3
4
Amount of O2 in mL
4.5
6.9
6.2
5.3
B. Diagrammatic Presentation of data

1.
It is a visual form of presentation of statistical data in which data are presented in the form of diagrams such
as bars, lines, circles, maps
2.
3.
Advantages of diagrammatic presentation of data:

1.
It more attractive
2.
It simplify complex information
3.
It saves time
4.
It helps to make comparison.
Rules for drawing diagrams

1.
It should have a title
4.
2.
Proper scaling should be used.
3.
Index must be given for better understanding of diagrams
Common Types
1.
Line Diagram
2.
Pie diagram
3.
Bar diagram
Line diagram
E.g. A traffic survey shows the following vehicles passing a particular bus stop during a hour
Vehicles
Frequency
Cars
Lorries
Motor Cycles
Buses
45
22
6
3
Total
76
Pie Diagram
Example: blood group of 50

students
Group
A
B
AB
O
Students
5
20
10
15
Bar Diagram
Example:
yield of
various
vegetables
from a
garden.
B. Graphical Presentation of data

1.
Presenting data in the form of graphs prepared on a graph.
2.
The graph has two axes: X & Y
3.
Usually, Independent variable is marked on the X-axis and dependent variable on the Y-axis.
4.
Common Types:
1.
Histogram
2.
Frequency Polygon
3.
Frequency curve
Histogram
1.
Histogram is a graph containing frequencies in the form of vertical rectangles.
2.
It is an area diagram
3.
It is the graphical presentation of frequency distribution.
4.
X-axis is marked with class intervals
5.
Y-axis is marked with frequencies
6.
Histogram differs from bar diagram. The bar diagram is one dimensional, whereas histogram is twodimensional.
7.
Uses of histogram
Category
1.
It gives a clear picture of entire data
2.
It simplifies complex data
3.
Median and mode can be calculated.
4.
It facilitates comparison of two or more frequency distributions on the same graph.
Systolic BP Number of
(mmHg)
Persons7
100-109
110-119
16
120-129
19
130-139
31
140-149
41
150-159
23
160-169
10
170-179
DESCRIPTIVE STATISTICS
Measures of central tendancy
Measures of dispersion/variability
MEASURES OF CENTRAL TENDANCY
A measure of central tendency is a single number used to represent the centre of a grouped
data.
The basic measures are;

o
Mean, Median and Mode
For any symmetrical distribution, the mean, median, and mode will be identical.
Each measure is designed to represent a typical score.
The choice of which measure to use depends on:

o
the shape of the distribution (whether normal or skewed), and
the variables level of measurement (data are nominal, ordinal or interval).
Mean
The mean (or average) is found by adding all the numbers and then dividing by how many
numbers you added together.
Most common measure of central tendency.
Formula for calculation of mean:
Best for making predictions.
Applicable under two conditions:
scores are measured at the interval level, and
distribution is more or less normal [symmetrical].

Example:
3,4,5,6,7
3+4+5+6+7= 25
25 divided by 5 = 5
The mean is 5
Advantages of mean
o
Mathematical center of a distribution.
Good for interval and ratio data.
Does not ignore any information.
Inferential statistics is based on mathematical properties of the mean.
Disadvantages of mean
o
Influenced by extreme scores and skewed distributions.
May not exist in the data.
Median
When the numbers are arranged in numerical order, the middle one is the median.
50% of observations are above the Median, 50% are below it.
Formula Median = n + 1 / 2.
Example:
3,6,2,5,7
Arrange in order 2,3,5,6,7
The number in the middle is 5
The median is 5
Advantages:
o
Not influenced by extreme scores or skewed distribution.
Good with ordinal data.
Easier to compute than the mean.
Considered as the typical observation.
Disadvantages:
o
May not exist in the data.
Does not take actual values into account.
Mode
The number that occurs most frequently is the mode.
We usually find the mode by creating a frequency distribution in which we count how often
each value occurs.
If we find that every value occurs only once, the distribution has no mode.
If we find that two or more values are tied as the most common, the distribution has more
than one mode.
Example:
2,2,2,4,5,6,7,7,7,7,8
The number that occurs most frequently

is 7
The mode is 7
Advantages:
o
Good with nominal data.
Bimodal distribution might verify clinical observations (pre and post-menopausal

breast cancer).
Easy to compute and understand.
The score exists in the data set.
Disadvantages:
o
Ignore most of the information in a distribution.
Small samples may not have a mode
More than one mode might exist.
Appropriate Measures of Central Tendency
Nominal variables
- Mode
Ordinal variables
- Median
Interval level variables
- Mean
If the distribution is normal (median is better with skewed

distribution)
MEASURES OF VARIABILITY
If there is no variability within populations there would be no need for statistics.
Three indices are used to measure variation or dispersion among scores:

o
range
variance, and
standard deviation (Cozby, 2000).
These indices answer the question: How Spread out is the distribution?
Dispersion/Deviation/Spread tells us a lot about how a variable is distributed.
Range
Range is the simplest method of examining variation among scores
It refers to the difference between the highest and lowest values produced.
For continuous variables, the range is the arithmetic difference between the highest and
lowest observations in the sample. In the case of counts or measurements, 1 should be added
to the difference because the range is inclusive of the extreme observations.
Another statistic, known as the interquartile range, describes the interval of scores bounded by
the 25th and 75th percentile ranks; the interquartile range is bounded by the range of scores
that represent the middle 50 percent of the distribution.
Percentiles (or quartiles)
The First quartile is the 25th percentile (noted Q1),
the Median value is the 50th percentile (noted Median), and
the Third quartile is the 75th percentile (noted Q3).
A percentile is a value at or below which a given percentage or fraction of the variable

values lie.
The p-th percentile is the value that has p% of the measurements below it and (100-p)%
above it.
Thus, the 20th percentile is the value such that one fifth of the data lie below it. It is higher
than 20% of the data values and lower than 80% of the data values.
o
E.g. if you are in the 80th percentile on a real GMAT result, you scored better on that
section than 80% of the students taking the GMAT.
Standard deviation
The standard deviation is the most widely applied measure of variability.
It shows how much variation there is from the "average" (mean).
Large standard deviations suggest that scores are probably widely scattered.
Small standards deviations suggest that there is very little deference among scores.
Computational formula for S.D:
Example: (Adapted from Wikipedia)
Consider a population consisting of the following values:
There are eight data points in total, with a mean (or average) value of 5:
To calculate the population standard deviation, first compute the difference of each data point
from the mean, and square the result:
Next divide the sum of these values by the number of values and take the square root to give
the standard deviation:
Therefore, the above has a population standard deviation of 2.
Variance
The squire of the standard deviation is the variance.
INFERENTIAL STATISTICS: COMMON TESTS

Chi-Squire Tests
Chi-square test is an inferential statistics technique designed to test for significant relationships
between two variables organized in a bivariate table.
Chi-square requires no assumptions about the shape of the population distribution from which a
sample is drawn. However, like all inferential techniques it assumes random sampling.
It can be applied to variables measured at a nominal and/or an ordinal level of measurement.
The research hypothesis (H1) proposes that the two variables are related in the population.
The null hypothesis (H0) states that no association exists between the two cross-tabulated variables in
the population, and therefore the variables are statistically independent.
Formula for computing chi-squire statistic
chi-squire formula
Where, O=observed frequency, and E = expected frequency
The essence of the chi squire test is to compare the observed frequencies with the frequencies
expected for independence
if the difference between observed and expected frequencies is large, then we can reject the null
hypothesis of independence.
Determining the Degrees of Freedom
df = (r 1)(c 1)
where, r = the number of rows and c = the number of columns

Bio Stat

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Bio Stat

Transféré par

Droits d'auteur :

Formats disponibles

statistics are simply a collection of tools that researchers employ to help

answer research questions

Statistics plays a vitally important role in the research.

Health information is very often explained in statistical terms

Many decisions in the Health Sciences are created through

to read and evaluate reports and other literature

to take independent research investigations

to describe the data in meaningful terms

The value of an item or individual is called variable.

Variables are of two types:

Quantitative: a variable with a numeric value. E.g. age,

Qualitative: a variable with a category or group value.

Quantitative variable are two types:

Discrete /categorical variables

Are not influenced by other variables.

Are not influenced by the event, but could

The variable which is influenced by the others

E.g. In an experimental study on relaxation intervention for reducing

Sampling is the process of getting a representative fraction of a

Analysis of the sample gives an idea of the population.

Random Sampling or Probability sampling

Simple random sampling

Stratified random Sampling

In Simple Random sampling, each individual of the population

Random Numbers method

In stratified random sampling, the population is divide in to

In cluster sampling, the whole population is divided in to a

clusters are randomly selected.

Convenience sampling is a type of non-probability sampling

Purposive sampling is a type of non-probability sampling in

Four measurement scales are used: nominal, ordinal,

Each level has its own rules and restrictions.

Nominal Scale of measurement

Nominal variables include categories of people, events, and

Example: gender, age-class, religion, type of disease, blood

They are exhaustive in nature, and are mutually exclusive.

These categories are discrete and non-continuous.

Statistical operations permissible are: counting of

Ordinal Scale of measurement

It is second in terms of its refinement as a means of classifying

It incorporates the functions of nominal scale.

The ordinal scale is used to arrange (or rank) individuals into a

Ordinal implies rank-ordered from highest to lowest.

Grade A+, A, B+, B, C+, C

1st , 2nd , 3rd etc

Interval scale of Measurement

Interval scale refers to the third level of measurement in relation

to complexity of statistical techniques used to analyze data.

The interval data does not have an absolute zero.

E.g. temperature is measured in Celsius or

Ratio Scale of Measurement

Equal distances between the increments

This scale has an absolute zero.

Ratio variables exhibit the characteristics of ordinal and interval

E.g. variable like time, length and weight are

[The mathematical properties of interval and ratio scales are very

The first step in processing of data is classification and tabulation.

Two approaches in analysing data are:

Measures of central tendency

Tabulation, cross-tab, contingency table

Line diagram, bar diagram, pie diagram.

Histogram, frequency polygon, frequency curve

Quantile, Q-Q plot

Simple depiction of all the data