Vous êtes sur la page 1sur 50

Descriptive Statistics

Descriptive Statistics reducing a complex


mass of data to a manageable set of
information
Descriptive Statistics: the summary
and presentation of data to:
simplify the data
enable meaning full interpretation
support decision making
Numerical descriptive measures (few
numbers)
Graphical presentations
Descriptive Statistics
Measures of central tendency
Measures of dispersion
Data Presentation
Grouped data
What is the Mean?
The mean is the most common measure of
central tendency.
The arithmetic mean (or average) is
defined as the sum of all the observed
values, divided by the number of
observations.
The mean is a good way to describe the
center of a group of data if the values have
a more or less normal distribution
It may not well describe a group of data if a
few values are far from the rest (the data is
"skewed" or there are many "outliers").
When is the Mean Useful?
The mean is useful when you have a
normal distribution of data
The mean is not very useful when you
have an abnormal distribution of
data.
How is the Mean Calculated?
Let us calculate the mean for the data set below. This
sample data set represents the incubation period in
days for a group of 21 people who contracted
hepatitis. Look at the numbers:
16, 20, 22, 24, 25, 28, 28, 29, 29, 30, 30, 30, 31, 31,
32, 33, 35, 36, 38, 40, 42
Applying the steps described previously, you simply
add up all the numbers and then divide by the
number of observations:
There are 21 data values (n = 21).
The sum of the 21 data values (X1 + X2 ++Xn )
listed above is 629.
629 / 21 = 29.95. We can round this up to 30.0 or 30.
The mean for the data set above, then, is 30.
Median, What is the Median?
The median is a measure of central
tendency that is useful in representing data
that is skewed. "Skewed" simply means
that there are significantly more data points
with values below the mean than there are
above the mean (or vice-versa). With
skewed data, the normally centered hump
on the frequency distribution curve is offset
to the left or right of center. The median is
the value that divides the distribution of
values into two equal parts.
When is the Median Useful?
The median is useful when you have
an abnormal (or skewed) distribution
of data. A skewed distribution data
shows up clearly if you present the
information in a graph:
How is the Median Determined?
The median is determined rather than calculated. That
is, the median is based on its relationship to other
data in the population rather than calculated
algebraically. The median of a set of observations is
the value that falls in the middle position when the
observations are ranked in order from the smallest to
the largest. The rules for calculating the mean are:
Rank the observations from the smallest to the
largest.
If the number of observations is odd, the median is
the middle number.
If the number of observations is even, the median is
the average of the two middle numbers.
An Exercise in Determining the Median
Let us take a sample of 8 people from this
community, assuming these people are
representative of the general population. Here
are the numbers representing net worth for these
people:
$2,000, $10,000, $25,000, $32,000, $45,000,
$50,000, $80,000, $3,000,000,
we have an even number of observations (8).
The two middle values in the ordered list are
$32,000 and $45,000.
Therefore, the median is halfway between the
values of $32,000 and $45,000.
Calculate the average of these two numbers to
determine the median: $38,500.
Mode, What is the Mode?
The mode is the value that occurs
with the greatest frequency in a set
of observations. If no value is
repeated within the set of
observations, then there is no mode.
If two or more values are repeated at
the same frequency, then each of
those observations is a mode. In a
normal or symmetrical distribution of
data, the mean, median, and mode
have the same values (or very close).
When is the Mode Useful?
The mode is useful if you are trying to
focus on the most frequent value for
a certain population. Although the
mode is seldom used in public health
statistics, it could be used to focus
attention on the modal (most
common) age group of a population
in the outbreak of a disease, or
establish some other modal
characteristic for a population
experiencing a disease.
How is the Mode Determined?
The mode of a set of observations is the
value that occurs with the greatest
frequency.
To determine the mode,
Rank the observations from the smallest to the
largest.
Evaluate the ranked data set by counting the
number of times each individual value occurs,
and
determine which value (s) occur with the
greatest frequency
Measures of Dispersion
Dispersion of a set of observations is the
variety exhibited by the observations
If all values are the same, no dispersion
More the values are spread, the greater the
dispersion
Many distributions are well-described by
measure of location and dispersion
Common Measures of
Dispersion
Range
Variance
Standard deviation
Coefficient of variation (CV)
Standard Deviation of the Mean (SE)
Percentiles and Quartiles
When are measures of dispersion
useful?
If you are evaluating the norm for a
particular characteristic, like weight or
height, you need to establish the extremes
(lowest and highest values) in order to
assess what might be outside the norm. For
example, there are standards for weight in
proportion to height. Some people are very
heavy for their height, whereas others are
much lighter compared to their even if they
are of the same height. The extremes of
this range can describe how far from the
norm a person's weight is when assessed
with their height
R
Range
What is the Range?
The range is calculated as the
difference between the smallest and
the largest values in a set of data.
Heavily influenced by two most
extreme values and ignores the rest
of the distribution
Range
When is the Range Useful?
The range is an adequate measure of
variation for a small set of data, like
class scores for a test. Think of other
measures where range might be
useful: Salaries for a particular job
category; or Indoor versus outdoor
temperatures?
Range
How is the Range Calculated?
The range is calculated by subtracting
the smallest value in the data set from
the largest value in the data set:
Range = Largest value - Smallest value
Data set A: 8, 9, 10, 10, 11, 12
Data set B: 5, 6, 10, 10, 14, 15
The range for A: 4, The range for B: 10
Variance
Variance measures distribution of values
around their mean
Definition of sample variance
Degrees of freedom
n-1 used because if we know n-1 deviations,
the nth deviation is known
Deviations have to sum to zero
) 1 /( ) (
2 2
=

n x x s
i
Standard Deviation
Definition of sample standard deviation
Standard deviation in same units as mean
Variance in units
2
2
s s =
What is the Standard
Deviation?
The standard deviation of a data set
is based on how much each data
value deviates from the mean, and is
equal to the square root of the
variance. The greater the dispersion
of values, the larger the standard
deviation. Much of statistical theory is
based on the standard deviation and
the 'normal' distribution.
When is the Standard Deviation
Useful?
It is a useful measure when your data
distribution is very close to a normal curve.
In this situation, the mean is the best
measure of central tendency, and the
standard deviation is the best measure of
dispersion.
In a normal distribution, if you measure 1
standard deviation to either side of the
mean, you will find that 68.3% of the
observations fall into this area; 95.5% of
the observations fall within 2 standard
deviations to either side of the mean; and
99.7% of observations fall within 3
standard deviations of the mean
Calculation of the Sample Standard Deviation
using the Theoretical (Squared Deviation)
Method
X
1
= 2
X
2
= 4
X
3
= 5
X
4
= 5
X
5
= 6
X
6
= 6
X
7
= 6
X
8
= 7
Childs Age
(X) in Years
Childs Age (X) Minus The
Mean Age (X) in Years
(X X)
2
X = 66 years (X X) = 0 (X X)
2
= 44
X = X n = 6 years; n = 11; n 1 = 10
2 6 = -4
4 6 = -2
5 6 = -1
5 6 = -1
6 6 = 0
6 6 = 0
6 6 = 0
7 6 = 1
7 6 = 1
8 6 = 2
10 6 = 4
(-4)
2
=
16
(-2)
2
=
4
(-1)
2
=
1
(-1)
2
=
1
( 0)
2
=
0
( 0)
2
=
0
( 0)
2
=
0
Squared Deviation
from the Mean Age
for a Sample of 11
Chicken Pox
Sufferers
Calculation of the Sample Standard
Deviation Using the Data in Table 5.6 and
the Theoretical Formula:
1
) (
2

=

N
X X
S
=
S
44
10
=
S
4.4
=
S
2.10 years
Calculation of the Sample Standard
Deviation Using the Computational (Sum of
Squares) Formula:
2
4
5
5
6
6
Childs Age
(X) in Years
X
2
Computation
Formula
4
16
25
25
36
36
36
49
49
64
100

1
2
2


n
n
X
X
S

10
2

=
440
oo
11
=
4.4
=
S
2.10 years
X = 66
X
2
= 440,
where
n=11
Coefficient of Variation (CV)
What is the Coefficient of Variation?
The coefficient of variation measures variability in relation
to the mean (or average) and is used to compare the
relative dispersion in one type of data with the relative
dispersion in another type of data. The data to be compared
may be in the same units, in different units, with the same
mean, or with different means.
When is the Coefficient of Variation Useful?
Suppose you want to evaluate the relative dispersion of
grades for two classes of students: Class A and Class B. The
coefficient of variation can be used to compare these two
groups and determine how the grade dispersion in Class A
compares to the grade dispersion in Class B. This is one
example of how the coefficient of variation can be applied.
Coefficient of Variation
Relative variation rather than absolute
variation such as standard deviation
Definition of C.V.
Useful in comparing variation between two
distributions
Used particularly in comparing laboratory
measures to identify those determinations with
more variation
Also used in QC analyses for comparing
) 100 ( . .
x
s
V C =
Standard Deviation of the Mean
(SE)
The standard deviation of the mean (often
called the standard error) is a measure of
the variation in means of repeated
samples. It is defined as the standard
deviation divided by the square root of the
sample size: SE = To calculate the
standard deviation of the mean, do the
following:
Calculate the standard deviation (s).
Calculate the square root of the sample size (n).
Divide the standard deviation by result of step 2.
Percentiles and Quartiles
Definition of Percentiles
Given a set of n observations x
1
, x
2
,, x
n
, the
pth percentile P is value of X such that p
percent or less of the observations are less
than P and (100-p) percent or less are greater
than P
P
10
indicates 10th percentile, etc.
Definition of Quartiles
First quartile is P
25
Second quartile is median or P
50
Third quartile is P
75
Measures of Position
Quartiles, Deciles,
Percentiles
Q
1
, Q
2
, Q
3
divides ranked scores into four equal parts
25%
25% 25%
25%
Q
3
Q
2
Q
1
(minimum) (maximum)
(median)
Q
1
, Q
2
, Q
3
divides ranked scores into four equal parts
Quartiles
25%
25% 25%
25%
Q
3
Q
2
Q
1
(minimum) (maximum)
(median)
Q
1
, Q
2
, Q
3
divides ranked scores into four equal parts
25%
25% 25%
25%
Q
3
Q
2
Q
1
(minimum) (maximum)
(median)
Finding the Percentile of a
Given Score
Percentile of score x = 100
number of scores less than x
total number of
scores
Inter-quartile Range
Better description of distribution
than range
Range of middle 50 percent of the
distribution
Definition of Inter-quartile Range
IQR = Q
3
- Q
1
.
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Range = 14-1 =13
upper middle lower
25% 50% 25%
Values
upper middle lower
25% 50% 25%
Values
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 21-1 =20
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Frequency distributions of values with inter-quartile range
of 5 to 9
Frequency distributions of values with inter-quartile range
of 5 to 9
Interquartile Range (or IQR): Q
3
- Q
1
Semi-interquartile Range:
Mid quartile:
10 - 90 Percentile Range: P
90
- P
10
2
2
Q
3
- Q
1
Q
1
+ Q
3
Percentiles
A "percentile" shows how a single system may be
compared to all other systems. Percentiles range
from lowest (1) to highest (99) with the average
equal to 50
The pth percentile (p ranges from 0 to 1) is a value so
that roughly p% of the data is smaller and (100-p)%
of the data is larger. Percentiles can be computed for
ordinal, interval, or ratio data.
There are three steps for computing a percentile.
.1 Sort the data from low to high;
.2 Count the number of values (n);
.3 Select the p*(n+1) observation.
If p*(n+1) is not a whole number, then go halfway
between the two adjacent numbers.
If p*(n+1) < 1, select the smallest observation.
If p*(n+1) > n, select the largest observation
Examples
The following data represents cotinine levels in saliva
(nmol/l) after smoking. We want to compute the 50th
percentile.
73, 58, 67, 93, 33, 18, 147
1. Sorted data: 18, 33, 58, 67, 73, 93, 147
2. There are n=7 observations.
3. Select 0.50*(7+1) = 4th observation.
Therefore, the 50th percentile equals 67. Notice that
there are three observations larger than 67 and three
observations smaller than 67.
Suppose we want to compute the 20th percentile.
Notice that p*(n+1) = 0.20*(7+1)=1.6. This is not a
whole number so we select halfway between 1st and
2nd observation or 25.5. (Some people see the 1.6 and
think they have to go six tenths of the way to the
second value. You can do this if you like, but I think life
is too short to worry about such details.)
Suppose we want to compute the 10th percentile. Since
The five number summary
A five number summary uses percentiles to
describe a set of data. The five number
summary consists of
MAX - the maximum value
75% - the 75th percentile (3rd quartile)
50% - the 50th percentile (2nd quartile or
median)
25% - the 25th percentile (1st quartile)
MIN - the minimum value
The five number summary splits the data
into four regions, each of which contains
25% of the data.
Summary
In practice, descriptive statistics play a
major role
Always the first 1-2 tables/figures in a paper
Statistician needs to know about each variable
before deciding how to analyze to answer
research questions
In any analysis, 90% of the effort goes
into setting up the data
Descriptive statistics are part of that 90%

Vous aimerez peut-être aussi