Vous êtes sur la page 1sur 16

statistics are simply a collection of tools that researchers employ to help

answer research questions

INTRODUCTION

Statistics plays a vitally important role in the research.

Health information is very often explained in statistical terms

Many decisions in the Health Sciences are created through


statistical studies

It enables you:
o

to read and evaluate reports and other literature

to take independent research investigations

to describe the data in meaningful terms

DEFINITIONS
1. Statistics: is the study of how to collect, organizes, analyze, and
interpret data.
2. Data: the values recorded in an experiment or observation.
3. Population: refers to any collection of individual items or units
that are the subject of investigation.
4. Sample: A small representative sample of a population is called
sample.
5. Observation: each unit in the sample provides a record, as a
measurement which is called observation.
6. Sampling: getting sample from a population
7. Variable: the value of an item or individual is called variable
8. Raw Data: Data collected in original form.
9. Frequency: The number of times a certain value or class of
values occurs.
10. Tabulation: can be defined as the logical and systematic
arrangement of statistical data in rows and columns.
11. Frequency Distribution: The organization of raw data in table
form with classes and frequencies.
12. Class Limits: Separate one class in a grouped frequency
distribution from another. The limits could actually appear in

the data and have gaps between the upper limit of one class and
the lower limit of the next.
13. Class Boundaries: Separate one class in a grouped frequency
distribution from another.
14. Cumulative Frequency: The number of values less than the
upper class boundary for the current class. This is a running
total of the frequencies.
15. Histogram: A graph which displays the data by using vertical
bars of various heights to represent frequencies.
16. Frequency Polygon: it is a line graph. The frequency is placed
along the vertical axis and the class midpoints are placed along
the horizontal axis. These points are connected with lines.
17. Pie Chart: Graphical depiction of data as slices of a pie. The
frequency determines the size of the slice. The number of
degrees in any slice is the relative frequency times 360 degrees.
18. Central tendency - a typical or representative value for a
dataset.
VARIABLES

The value of an item or individual is called variable.

Variables are of two types:

Quantitative: a variable with a numeric value. E.g. age,


weight.

Qualitative: a variable with a category or group value.


E.g. Gender (M/F), Religion (H/M/C), Qualification
(degree/PG)

Quantitative variable are two types:


o

Discrete /categorical variables

Continuous variables

Variables can be

Independent

Are not influenced by other variables.

Are not influenced by the event, but could


influence the event.

Dependent

The variable which is influenced by the others


is often referred as dependent variable.

E.g. In an experimental study on relaxation intervention for reducing


HTN, blood pressure is the dependent variable and relaxation training,
age and gender are independent variable.
SAMPLING

Sampling is the process of getting a representative fraction of a


population.

Analysis of the sample gives an idea of the population.

Methods of sampling:
o

Random Sampling or Probability sampling

Simple random sampling

Stratified random Sampling

Cluster sampling

Non-random sampling

Convenient Sampling

Purposive Sampling

Quota Sampling

In Simple Random sampling, each individual of the population


has an equal chance of being included in the sample. Two
methods are used in simple random sampling:

Random Numbers method

Lottery method

In stratified random sampling, the population is divide in to


groups or strata on the basis of certain characteristics.

In cluster sampling, the whole population is divided in to a


number of relatively small cluster groups. Then some of the

clusters are randomly selected.

Convenience sampling is a type of non-probability sampling


which involves the sample being drawn from that part of the
population which is selected because it is readily available and
convenient.

Purposive sampling is a type of non-probability sampling in


which researcher selects participants based on fulfillment of
some criteria. E.g. schizophrenia treatment naive.

SCALES OF MEASUREMENT
o

Four measurement scales are used: nominal, ordinal,


interval and ratio.

Each level has its own rules and restrictions.

Nominal Scale of measurement

Nominal variables include categories of people, events, and


other phenomena are named.

Example: gender, age-class, religion, type of disease, blood


groups A, B, AB, and O.

They are exhaustive in nature, and are mutually exclusive.

These categories are discrete and non-continuous.


o

Statistical operations permissible are: counting of


frequency, Percentage, Proportion, mode, and
coefficient of contingency.

Ordinal Scale of measurement

It is second in terms of its refinement as a means of classifying


information.

It incorporates the functions of nominal scale.

The ordinal scale is used to arrange (or rank) individuals into a


sequence ranging from the highest to lowest.

Ordinal implies rank-ordered from highest to lowest.

Grade A+, A, B+, B, C+, C

1st , 2nd , 3rd etc

Interval scale of Measurement

Interval scale refers to the third level of measurement in relation

to complexity of statistical techniques used to analyze data.

It is quantitative in nature

The individual units are equidistant from one point to the other.

The interval data does not have an absolute zero.

E.g. temperature is measured in Celsius or


Fahrenheit.

Ratio Scale of Measurement

Equal distances between the increments

This scale has an absolute zero.

Ratio variables exhibit the characteristics of ordinal and interval


measurement

E.g. variable like time, length and weight are


ratio scales and also be measured using nominal
or ordinal scale.

[The mathematical properties of interval and ratio scales are very


similar, so the statistical procedures are common for both the scales.]

PROCESSING OF DATA

The first step in processing of data is classification and tabulation.

Classification is the process of arranging data on the basis of some common characteristics possessed by
them.

Two approaches in analysing data are:


o

Descriptive statistics

Inferential statistics

Descriptive statistics are concerned with describing the characteristics of frequency distributions. The
common methods in descriptive analyses are:
o

Measures of central tendency

Measures of dispersion

Tabulation, cross-tab, contingency table

Line diagram, bar diagram, pie diagram.

Histogram, frequency polygon, frequency curve

Quantile, Q-Q plot

Scatterplot

The inferential statistics helps to decide whether the outcome of the study is a result of factors planned
within design of the study or determined by chance. Common inferential statistical tests are:
o

T-tests

Chi-squire test

Pearson correlation

Frequency Distribution

Simple depiction of all the data

Frequency distribution is a statistical table containing groups of values according to the number of times a
value occurs.

The data collected by an investigator is called raw data.

Raw data is ungrouped data.

It is not in order.

Raw data is arranged in order called array.

The data arranged in ascending order or descending order

Frequency Distribution with Classes

It is constructed with class intervals.

It is a frequency distribution of continuous series.

Raw data arranged as array data.

Then the data is divided in to groups called classes.

The first class and the last class are fixed by seeing the lowest and highest values.

Lowest and highest numbers of each class are called class limits (upper & lower).

The class limit may be made in two methods:


1.

Inclusive methods

2.

Exclusive method

PRESENTATION OF DATA
1.

Tabular presentation

2.

Diagrammatic Presentation

3.

Graphical Presentation

A. Tabular Presentation of Data


1.

Arranging values in columns is called tabulation.


1.

E.g. The amount of oxygen content in water samples

Water samples

1
2
3
4

Amount of O2 in mL

4.5
6.9
6.2
5.3

B. Diagrammatic Presentation of data


1.

It is a visual form of presentation of statistical data in which data are presented in the form of diagrams such
as bars, lines, circles, maps

2.

3.

Advantages of diagrammatic presentation of data:


1.

It more attractive

2.

It simplify complex information

3.

It saves time

4.

It helps to make comparison.

Rules for drawing diagrams


1.

It should have a title

4.

2.

Proper scaling should be used.

3.

Index must be given for better understanding of diagrams

Common Types
1.

Line Diagram

2.

Pie diagram

3.

Bar diagram

Line diagram
E.g. A traffic survey shows the following vehicles passing a particular bus stop during a hour

Vehicles

Frequency

Cars
Lorries
Motor Cycles
Buses

45
22
6
3

Total

76

Pie Diagram

Example: blood group of 50


students
Group
A
B
AB
O

Students
5
20
10
15

Bar Diagram

Example:
yield of
various
vegetables
from a
garden.

B. Graphical Presentation of data


1.

Presenting data in the form of graphs prepared on a graph.

2.

The graph has two axes: X & Y

3.

Usually, Independent variable is marked on the X-axis and dependent variable on the Y-axis.

4.

Common Types:
1.

Histogram

2.

Frequency Polygon

3.

Frequency curve

Histogram
1.

Histogram is a graph containing frequencies in the form of vertical rectangles.

2.

It is an area diagram

3.

It is the graphical presentation of frequency distribution.

4.

X-axis is marked with class intervals

5.

Y-axis is marked with frequencies

6.

Histogram differs from bar diagram. The bar diagram is one dimensional, whereas histogram is twodimensional.

7.

Uses of histogram

Category

1.

It gives a clear picture of entire data

2.

It simplifies complex data

3.

Median and mode can be calculated.

4.

It facilitates comparison of two or more frequency distributions on the same graph.

Systolic BP Number of
(mmHg)
Persons7

100-109

110-119

16

120-129

19

130-139

31

140-149

41

150-159

23

160-169

10

170-179

DESCRIPTIVE STATISTICS

Measures of central tendancy

Measures of dispersion/variability

MEASURES OF CENTRAL TENDANCY

A measure of central tendency is a single number used to represent the centre of a grouped
data.

The basic measures are;


o

Mean, Median and Mode

For any symmetrical distribution, the mean, median, and mode will be identical.

Each measure is designed to represent a typical score.

The choice of which measure to use depends on:


o

the shape of the distribution (whether normal or skewed), and

the variables level of measurement (data are nominal, ordinal or interval).

Mean

The mean (or average) is found by adding all the numbers and then dividing by how many
numbers you added together.

Most common measure of central tendency.

Formula for calculation of mean:

Best for making predictions.

Applicable under two conditions:

scores are measured at the interval level, and

distribution is more or less normal [symmetrical].


Example:

3,4,5,6,7

3+4+5+6+7= 25

25 divided by 5 = 5

The mean is 5

Advantages of mean
o

Mathematical center of a distribution.

Good for interval and ratio data.

Does not ignore any information.

Inferential statistics is based on mathematical properties of the mean.

Disadvantages of mean
o

Influenced by extreme scores and skewed distributions.

May not exist in the data.

Median

When the numbers are arranged in numerical order, the middle one is the median.

50% of observations are above the Median, 50% are below it.

Formula Median = n + 1 / 2.
Example:

3,6,2,5,7

Arrange in order 2,3,5,6,7

The number in the middle is 5

The median is 5

Advantages:
o

Not influenced by extreme scores or skewed distribution.

Good with ordinal data.

Easier to compute than the mean.

Considered as the typical observation.

Disadvantages:
o

May not exist in the data.

Does not take actual values into account.

Mode

The number that occurs most frequently is the mode.

We usually find the mode by creating a frequency distribution in which we count how often
each value occurs.

If we find that every value occurs only once, the distribution has no mode.

If we find that two or more values are tied as the most common, the distribution has more
than one mode.
Example:

2,2,2,4,5,6,7,7,7,7,8

The number that occurs most frequently


is 7

The mode is 7

Advantages:
o

Good with nominal data.

Bimodal distribution might verify clinical observations (pre and post-menopausal


breast cancer).

Easy to compute and understand.

The score exists in the data set.

Disadvantages:
o

Ignore most of the information in a distribution.

Small samples may not have a mode

More than one mode might exist.

Appropriate Measures of Central Tendency

Nominal variables

- Mode

Ordinal variables

- Median

Interval level variables

- Mean

If the distribution is normal (median is better with skewed


distribution)

MEASURES OF VARIABILITY
If there is no variability within populations there would be no need for statistics.

Three indices are used to measure variation or dispersion among scores:


o

range

variance, and

standard deviation (Cozby, 2000).

These indices answer the question: How Spread out is the distribution?

Dispersion/Deviation/Spread tells us a lot about how a variable is distributed.

Range

Range is the simplest method of examining variation among scores

It refers to the difference between the highest and lowest values produced.

For continuous variables, the range is the arithmetic difference between the highest and
lowest observations in the sample. In the case of counts or measurements, 1 should be added
to the difference because the range is inclusive of the extreme observations.

Another statistic, known as the interquartile range, describes the interval of scores bounded by
the 25th and 75th percentile ranks; the interquartile range is bounded by the range of scores
that represent the middle 50 percent of the distribution.

Percentiles (or quartiles)

The First quartile is the 25th percentile (noted Q1),

the Median value is the 50th percentile (noted Median), and

the Third quartile is the 75th percentile (noted Q3).

A percentile is a value at or below which a given percentage or fraction of the variable


values lie.

The p-th percentile is the value that has p% of the measurements below it and (100-p)%
above it.

Thus, the 20th percentile is the value such that one fifth of the data lie below it. It is higher
than 20% of the data values and lower than 80% of the data values.
o

E.g. if you are in the 80th percentile on a real GMAT result, you scored better on that
section than 80% of the students taking the GMAT.

Standard deviation

The standard deviation is the most widely applied measure of variability.

It shows how much variation there is from the "average" (mean).

Large standard deviations suggest that scores are probably widely scattered.

Small standards deviations suggest that there is very little deference among scores.

Computational formula for S.D:

Example: (Adapted from Wikipedia)

Consider a population consisting of the following values:

There are eight data points in total, with a mean (or average) value of 5:

To calculate the population standard deviation, first compute the difference of each data point
from the mean, and square the result:

Next divide the sum of these values by the number of values and take the square root to give
the standard deviation:

Therefore, the above has a population standard deviation of 2.

Variance

The squire of the standard deviation is the variance.

INFERENTIAL STATISTICS: COMMON TESTS


Chi-Squire Tests

Chi-square test is an inferential statistics technique designed to test for significant relationships
between two variables organized in a bivariate table.
Chi-square requires no assumptions about the shape of the population distribution from which a
sample is drawn. However, like all inferential techniques it assumes random sampling.
It can be applied to variables measured at a nominal and/or an ordinal level of measurement.
The research hypothesis (H1) proposes that the two variables are related in the population.
The null hypothesis (H0) states that no association exists between the two cross-tabulated variables in
the population, and therefore the variables are statistically independent.
Formula for computing chi-squire statistic

chi-squire formula

Where, O=observed frequency, and E = expected frequency

The essence of the chi squire test is to compare the observed frequencies with the frequencies
expected for independence
if the difference between observed and expected frequencies is large, then we can reject the null
hypothesis of independence.
Determining the Degrees of Freedom
df = (r 1)(c 1)
where, r = the number of rows and c = the number of columns

Vous aimerez peut-être aussi