Vous êtes sur la page 1sur 34

Biostatistics I Descriptive Statistics 1

Master Degree Public Health 2016/2017

Descriptive Statistics

Margarida Fonseca Cardoso

Biostatistics I Descriptive Statistics 2


Master Degree Public Health 2016/2017

The primary objective of a statistical analysis is to infer characteristics of a


group of data by analyzing the characteristics of a small sampling of the
group.

This generalization from the part to the whole requires the consideration of
such important concepts as population, sample, parameter, statistic and
sampling.

An observational study observes individuals and measures variables of


interest but does not attempt to influence the responses. The purpose of an
observational study is to describe some group or situation.

An experiment, on the other hand, deliberately imposes some treatment on


individuals in order to observe their responses. The purpose of an
experiment is to study whether the treatment causes a change in the
response.

1
Biostatistics I Descriptive Statistics 3
Master Degree Public Health 2016/2017

The data in a biometric study are generally based on individual observations,


which are observations or measurements taken on the smallest sampling
unit.

Example:
•If we measure weight in 100 rats, then the weight of each rat is an individual
observation; the hundred weights together represent the sample of
observations. These smallest sampling units frequently, but not necessarily,
are also individuals in the ordinary biological sense, that is one rat.
•However, if we had studied weight in a single rat over a period of time, the
sample of individual observations would be all the weights recorded on one
rat at successive times.
•In a study of temperature in ant colonies, where each colony is a basic
sampling unit, each temperature reading for one colony is an individual
observation, and the sample of observations is the temperatures for all the
colonies considered.
ant - formiga

Biostatistics I Descriptive Statistics 4


Master Degree Public Health 2016/2017

Populations
Biologists consider a Population as a defined group of humans or of another
species of organisms
In statistics, population (or universe) means the totality of individual
observations about which inferences are to be made.

For example:

An investigator may desire to draw conclusions


about the length of a commercially important fish specie of the Azores
archipelago. All lengths of this fish specie are, therefore the population
under consideration.

If a study is concerned with the blood-glucose concentration in three year old


children, then the blood glucose levels in all children of that age are the
population of interest.

2
Biostatistics I Descriptive Statistics 5
Master Degree Public Health 2016/2017

Biologists may sample a population that does not physically exist.

Suppose an experiment is performed in which a food supplement is


administered to 40 guinea pigs, and the sample data consists of the growth
rates of these 40 animals.
Then the population about which conclusions might be drawn is the growth
rates of all the guinea pigs that conceivably might have been administered
the same food supplement under identical conditions. That population is
said to be “hypothetical” or “potential”.

The actual property measured by the individual observations is the variable.


More than one variable can be measured on each sampling unit.
Length, mass, age, temperature, number of parasites, number of petals are
examples of biological variables.

guinea pig - cobaia

Biostatistics I Descriptive Statistics 6


Master Degree Public Health 2016/2017

Any set of data:


 contains information about some group of individuals.
 the information is organized in variables.

The techniques of descriptive statistics apply equally to data sets obtained


for a given population – the entire group of individuals about which we want
information – or from only those individuals in a smaller sample.

3
Biostatistics I Descriptive Statistics 7
Master Degree Public Health 2016/2017

Using SPSS
Example: The data file called highschoolb.sav, high school and beyond,
contains 200 observations from a sample of high school students with demographic
information about the students, such as their gender (female) and type of school. It
also contains a number of scores on standardized tests, including tests of reading
(read), writing (write), mathematics (math) and social studies (socst).

Each column includes data about a different variable. Each row represents data
for one student. Some variables, like gender, simply place individuals into
categories. Others, like scores take numerical values with which we can do
arithmetic.

Biostatistics I Descriptive Statistics 8


Master Degree Public Health 2016/2017

When you plan a statistical study or explore data from someone else’s work,
ask yourself the following questions:

1. Who? What individuals do the data describe? How many individual


appear in the data?
2. What? How many variables do the data contain? What are the exact
definitions of these variables? In what units of measurements is each
variable recorded? Lengths, for example, might be recorded in inches or
in meters.
3. Why? What purpose do the data have? Do we hope to answer some
specific questions? Do we want to draw conclusions about individuals
other than the actually have data for? Are the variables suitable for the
intended purpose?

4
Biostatistics I Descriptive Statistics 9
Master Degree Public Health 2016/2017

Categorical and quantitative variables

• Categorical variable – places individuals into one of several groups or


categories.
– Nominal variables are purely qualitative and unordered, like flower
colour.
– Ordinal data can be ranked, like socioeconomic status or the Likert
scales commonly used in psychology (for example, “rate from 0 to 5,
with 0 for really dislike and 5 for really like”).
• Quantitative variable – takes numerical values for which arithmetic
operations such as adding and averaging make sense. The values of a
quantitative variable are usually recorded in a unit of measurement such
as seconds or kilograms.
– Some quantitative variables, like weight, are continuous variables that
can take any value over an interval.
– Discrete variables, are quantitative variables that can take only a
limited, number of values, like the number of petals in a flower.

Biostatistics I Descriptive Statistics 10


Master Degree Public Health 2016/2017

Exploring data
• Begin by examining each variable by itself. Then move on to study the
relationship among variables.
• Begin by graph or graphs. Then add numerical summaries of specific
aspects of the data.
The proper choice of graph depends on the nature of the variable.

Distribution of a variable
The distribution of a variable tell us what values it takes and how often it take
these values.
The values of a categorical variable are labels for the categories. The
distribution of a categorical variable lists the categories and gives either the
count or the percent of individuals that fall in each category.

5
Biostatistics I Descriptive Statistics 11
Master Degree Public Health 2016/2017

Qualitative variables

Example: The distribution of students according to their gender and type of


school.

Number (%)

Gender Male 91 (45.5%)


Female 109 (54.5%)

Type of school Public 168 (84%)


Private 32 (16%)
The data table provides both the count of individuals for each category and the
percent that each category represents in the data set. Counts are also
sometimes referred to as frequencies and percents as relative frequencies.

It is clear that the majority of the students (84%) were enrolled in public
schools.

Biostatistics I Descriptive Statistics 12


Master Degree Public Health 2016/2017

Using SPSS

Choose Analyse → Descriptive Statistics→ Frequencies

6
Biostatistics I Descriptive Statistics 13
Master Degree Public Health 2016/2017

Example: The distribution of students according to their gender and type of


school.

We could also make a pie chart or a bar graph.

Using SPSS: Choose Graphs → Legacy dialogs→ Pie …

Biostatistics I Descriptive Statistics 14


Master Degree Public Health 2016/2017

Quantitative variables

Quantitative variables often take many values. The distribution of a variable


tells us what values the variable takes and how often it takes these values.
The most common graph of the distribution of one quantitative variable is a
histogram.
Example: Making a histogram: score on reading.

Using SPSS: Choose Graphs → Legacy dialogs→ Histogram …

And the graph that is automatically displayed is:

The software’s choice has too many classes,


with many classes having one or no observations.

7
Biostatistics I Descriptive Statistics 15
Master Degree Public Health 2016/2017

Number of classes:

There have been several “rules of thumb” proposed to aid in deciding into
how many classes data might reasonably be grouped, for the use of too few
groups will obscure the general shape of the distribution.

But such “rules” or recommendations are rough guides, and the choice is
generally left to good judgement, bearing in mind that from 10 to 20 groups
are useful for most biological work.

Groups should be established that are equal in size interval of the variable
being measured.

Biostatistics I Descriptive Statistics 16


Master Degree Public Health 2016/2017

Quantitative variables

Consider classes of width 5 starting at 20.

8
Biostatistics I Descriptive Statistics 17
Master Degree Public Health 2016/2017

Figure: Histogram of the reading score for 200 students.

A histogram displays the distribution of one quantitative variable.


A histogram should be drawn with no extra space between consecutive
classes, to indicate that all values of the variable are covered.

Biostatistics I Descriptive Statistics 18


Master Degree Public Health 2016/2017

Interpreting histograms

• Making a statistical graph is not an end in itself. The purpose of the graph
is to help us understand the data.
• After you make a graph, always ask, “What do I see?”

Examining a histogram
 In any graph of data, look for the overall pattern and for striking
deviations from that pattern.
 You can describe the overall pattern of a histogram by its shape, center,
and spread.
 An important kind of deviation is an outlier, an individual value that falls
outside the overall pattern.

9
Biostatistics I Descriptive Statistics 19
Master Degree Public Health 2016/2017

Describing a distribution: reading score.

Figure: Histogram of the reading score for 200 students.

Shape: The distribution is unimodal. That is, it has a single peak , which
represents students with a reading score between 50 and 55.
The distribution is also symmetric. Real data are almost never exactly
symmetric. We are content to describe the histogram of the reading score as
roughly symmetric.
Center: The midpoint of the distribution is about 45 to 55.
Spread: The spread is from 25 to 80.

Biostatistics I Descriptive Statistics 20


Master Degree Public Health 2016/2017

Describing a distribution: reading score.

Figure: Histogram of the reading score for 200 students.

Outliers:
In figure the observations less than 30 and greater than 70 are part of the
continuous range of reading scores and do not stand apart from the overall
distribution.
If you had spotted possible outliers, look for an explanation. Some outliers
are due to mistakes, such as typing 4.5 instead of 45.5. Other outliers points
to the special nature of some observations. For instance a score of 4.5 could
be just an individual that didn’t finish the reading test.

10
Biostatistics I Descriptive Statistics 21
Master Degree Public Health 2016/2017

Symmetric and skewed distributions


A distribution is symmetric if the right and left sides of the histogram are
approximately mirror images of each other.
A distribution is skewed to the right, or positively skewed, if the right side of
the histogram extends much farther out than the left side.
It is skewed to the left, or negatively skewed, if the left side of the histogram
extends much farther out than the right side.

Biostatistics I Descriptive Statistics 22


Master Degree Public Health 2016/2017

An additional question we could ask is what was the proportion of students that
passed in the test of reading.
Considering that students pass the test if they score 45 or more, a new variable
can be created.
Using SPSS for creating the variable new variable ReadPass:
Choose Transform → Visual Binning …

And then choose Analyse → Descriptive Statistics→ Frequencies

11
Biostatistics I Descriptive Statistics 23
Master Degree Public Health 2016/2017

Example: Guinea pig survival times


Figure displays the survival times in days of 72 guinea pigs after they were
injected with infectious bacteria in a medical experiment.

(Baldi and Moore , 2012). File in SPSS: SurvivalGuineaPigs

Biostatistics I Descriptive Statistics 24


Master Degree Public Health 2016/2017

 The distribution is single-peaked and skewed to the right.


 Most guinea pigs have a short survival time, between 50 and 150 days.
 However, some animals survive longer, so that the graph extends to the
right of its peak much farther than it extends to the right.
 The survival times range from 0 to 600. However, almost all infected
guinea pigs die within 250 days.
 A few guinea pigs survive much longer, maybe because they had some
immunity to the bacteria used for the experiment.

12
Biostatistics I Descriptive Statistics 25
Master Degree Public Health 2016/2017

The overall shape of a distribution is important information about a


variable.
 Many biological measurements on the same species and sex have
symmetric distributions. For example birth weights, heights of young
women.
 Survival times – patients after an organ transplant, lab animals following
an experimental inoculation have distributions that are typically strongly
skewed to the right.
 Many distributions have irregular shapes that are neither symmetric nor
skewed.
 Use you eyes, describe what you see, and then try to explain it.

Biostatistics I Descriptive Statistics 26


Master Degree Public Health 2016/2017

Stemplots

For small data sets, stemplots are quicker to make and easier to interpret.
They display the raw data, that is, they show each one of the values in the
data set.
To make a stemplot:
1. Separate each observation into a stem, consisting of all but the final
(rightmost ) digit, and a leaf, the final digit. Stems may have as many
digits as needed, but each leaf shows only a single digit.
2. Write the stems in a vertical column with the smallest at the top, and
draw a vertical line at the right of the column.
3. Write each leaf in the row to the right of its stem, in increasing order out
from the stem.

Spss automatically displays stemplots.

13
Biostatistics I Descriptive Statistics 27
Master Degree Public Health 2016/2017

Using SPSS: Choose Analyse → Descriptive Statististics → Explore …

And the stemplot is automatically displayed:


reading score Stem-and-Leaf Plot

Frequency Stem & Leaf

1,00 2. 8
7,00 3. 1444444
14,00 3. 56667799999999
30,00 4. 112222222222222334444444444444
31,00 4. 5567777777777777777777777777778
34,00 5. 0000000000000000002222222222222234
27,00 5. 555555555555577777777777777
26,00 6. 00000000013333333333333333
21,00 6. 555555555688888888888
7,00 7. 1133333
2,00 7. 66

Stem width: 10,00


Each leaf: 1 case(s)

Biostatistics I Descriptive Statistics 28


Master Degree Public Health 2016/2017

Describing distributions with numbers – quantitative variables

• In samples one generally finds a preponderance of values somewhere


around the middle of the range of observed values.
• The description of this concentration near the middle is an average or a
measure of central tendency. It is also termed a measure of location, for
it indicates where, along the measurement scale, the sample is located.

• In addition to a description of the central tendency of a set of data, it is


generally desirable to have a description of the variability, or the
dispersion, of the data.
• Measurements that are concentrated around the center of a distribution
of data have low variability (low dispersion), whereas data that are very
spread out along the measurement scale have high variability (high
dispersion).

14
Biostatistics I Descriptive Statistics 29
Master Degree Public Health 2016/2017

Measuring center: the mean

The mean
To find the mean of a set of observations, add their values and divide by the
number of observations. If the n observations are x1, x2, …, xn their mean is:

Example: What is the mean length of 24 butterfly wings?


Xi (in centimeters):
3.3 3.5 3.6 3.6 3.7 3.8 3.8 3.8 3.9
3.9 3.9 4.0 4.0 4.0 4.0 4.1 4.1 4.1
4.2 4.2 4.3 4.3 4.4 4.5

(Zar, 2010). File in SPSS: WingLength

Biostatistics I Descriptive Statistics 30


Master Degree Public Health 2016/2017

In practice, you can key the data into your calculator or software and ask for
the arithmetic mean.

Using SPSS:
In the Example with a sample of high school students (data file called
highschoolb.sav), the mean of the reading score for the 200 students is 52.23.

There are a lot of procedures for obtaining the arithmetic mean in SPSS.
You can choose:
Analyse → Descriptive Statistics → Descriptives …
Analyse → Descriptive Statistics → Explore …

15
Biostatistics I Descriptive Statistics 31
Master Degree Public Health 2016/2017

Measuring center: the median


The median is the midpoint of a distribution, the measurement such that
half the observations are smaller and the other half is larger.
The median is typically defined as the middle measurement in an ordered set
of data.
To find the median of a distribution:
1. Arrange all observations in order of size, from smallest to largest.
2. If the number of observations n is odd, the median is the center
observation in the ordered list. Find the location of the median by
counting (n+1)/2 observations up from the smallest observation in the
list.
3. If the number of observations n is even, the median is the mean of the
two center observations in the ordered list. The location of the median is
again (n+1)/2, counting from the smallest observation in the list.

Note that the formula (n+1)/2 does not give the median, just the location of
the median in the ordered list.

Biostatistics I Descriptive Statistics 32


Master Degree Public Health 2016/2017

Example: What is the median wing length for our 24 butterflies?


Here are the data again in order, Xi (in centimeters):
3.3 3.5 3.6 3.6 3.7 3.8 3.8 3.8 3.9
3.9 3.9 4.0 4.0 4.0 4.0 4.1 4.1 4.1
4.2 4.2 4.3 4.3 4.4 4.5

The number of observations is even. There is no center observation, but there


is a center pair. These are the bold 4.0 values.
When n=24, the rule for locating the median in the list gives
n + 1 25
location of Md = = = 12.5
2 2
The location 12.5 means “halfway between the 12th and 13th observations in
the ordered list”.
4 .0 + 4 .0
Md = = 4.0 cm
2

(Zar, 2010).

16
Biostatistics I Descriptive Statistics 33
Master Degree Public Health 2016/2017

In the example of wing lengths of butterflies we found that the mean and the
median were very similar, at 3.96 and 4.00 cm, respectively. The distribution
of the 24 wing lengths were roughly symmetric and did not have any
outliers.
The mean is the center of gravity of the histogram. That is, if the histogram
were made of solid material, it would balance horizontally with the fulcrum at
the mean. The median divides the histogram into two equal areas.

But what would happen to the relationship between the mean and the
median of a data set with a marked skew or extreme outliers?

Biostatistics I Descriptive Statistics 34


Master Degree Public Health 2016/2017

The figure displays the survival times in days of 72 guinea pigs after they were
injected with infectious bacteria in a medical experiment:
The distribution is noticeably skewed to the right and has some potential high
outliers.
The mean survival time is 141.9 days, whereas the median survival time is only
102.5 days.
The mean is pulled toward the right tail of this right-skewed distribution.
If the longest survival time were increased, the mean would increase, but the
median would not change at all. The mean uses the actual value of each
observation and is, therefore, very sensitive to any extreme values.
(Baldi and Moore , 2012).

17
Biostatistics I Descriptive Statistics 35
Master Degree Public Health 2016/2017

Comparing the mean and the median


 The mean and median of a symmetric distribution are close together.
 If the distribution is exactly symmetric, the mean and median are exactly
the same.
 In a skewed distribution, the mean is usually farther out in the long tail
than is the median.

Using SPSS:
There are a lot of procedures for obtaining the arithmetic mean and the
median in SPSS.
You can choose:
Analyse → Descriptive Statistics → Explore …

Biostatistics I Descriptive Statistics 36


Master Degree Public Health 2016/2017

Many biological variables have distributions that are skewed to the right.
Survival times, such as in the guinea pig inoculation experiment, are typically
right skewed.
When dealing with strongly skewed distributions, it is customary to report
the median rather than the mean.
However a health organization or government agency may need to include all
survival times, and thus calculate the mean, to estimate the costs of medical
care for a given disease and to plan medical staffing appropriately. Relying
only on the median would result in underestimating the medical and
financial needs.

The mean and median measure center in different ways, and both are
useful.

18
Biostatistics I Descriptive Statistics 37
Master Degree Public Health 2016/2017

Other measures of central tendency

The Geometric mean


The geometric mean can be calculated as the antilogarithm of the mean of
the logarithms of the data (where the logarithm can be in any base).
1 n 
Mg = anti log  ∑ log X i 
 n i =1 
In the particular case of natural logarithm (base e):

1 n 
Mg = exp  ∑ ln X i 
 n i =1 
The geometric mean is appropriate to use only for quantitative data and only
when all of the data are positive (that is, greater than zero).
Geometric mean is sometimes used as a measure of location when the data
are highly skewed to the right.

Biostatistics I Descriptive Statistics 38


Master Degree Public Health 2016/2017

In the example of survival times in days of 72 guinea pigs after they were
injected with infectious bacteria in a medical experiment:
The mean survival time is 141.9 days, whereas the median survival time is only
102.5 days.
The geometric mean is 118.1 days.

We will see later that variables are sometimes transformed in logarithms. If we


compute the mean of such transformed variable and then change the mean back
into the original scale, the mean will not be the same as if we had computed the
arithmetic mean of the original variable.
(Baldi and Moore , 2012).

19
Biostatistics I Descriptive Statistics 39
Master Degree Public Health 2016/2017

Using SPSS:
Obtaining the geometric mean in SPSS.
Analyse → Reports→ Summarize cases …
The geometric mean (as well as the mean, median, …) can be obtained in the
Statistics sub-dialogue box as shown in the display.

Biostatistics I Descriptive Statistics 40


Master Degree Public Health 2016/2017

Measures of variability and dispersion.

In addition to a description of the central tendency of a set of data, it is


generally desirable to have a description of the variability, or of the
dispersion of the data.
A measure of variability is an indication of the spread of measurements
around the center of the distribution.
Measures that are concentrated around the center of a distribution of data
have low variability (low dispersion).
Data that are very spread out along the measurement scale have high
variability (high dispersion).

20
Biostatistics I Descriptive Statistics 41
Master Degree Public Health 2016/2017

Measuring spread: the range


The difference between the highest and lowest mesurements in a group of
data is termed range.
If sample measurements are arranged in increasing order of magnitude, then
Sample range = Xn – X1
Which is
Sample range = largest X – smallest X

In the example with wing length for 24 butterflies, with ordered data, Xi (in
centimeters):
3.3 3.5 3.6 3.6 3.7 3.8 3.8 3.8 3.9
3.9 3.9 4.0 4.0 4.0 4.0 4.1 4.1 4.1
4.2 4.2 4.3 4.3 4.4 4.5
The range may be expressed as 3.3 to 4.5 cm, or as 4.5 – 3.3 = 1.2 cm.

Biostatistics I Descriptive Statistics 42


Master Degree Public Health 2016/2017

The range is a relatively crude measure of dispersion, inasmuch as it does


not take into account any measurement except the highest and the lowest.

For example, the guinea pig survival times range from 43 to 598 days. These
single observations show the full spread of data, but they may be outliers.

Furthermore, it is unlikely that a sample will contain both the highest and
lowest values in the population, so the sample range usually underestimates
the population range.
Nonetheless, it is considered useful by some to present the sample range
as an estimate (although a poor one) of the population range.
Whenever the range is specified in reporting data, however, it is a good
practice to report another measure of dispersion as well.
The range is applicable to ordinal data and quantitative data.

21
Biostatistics I Descriptive Statistics 43
Master Degree Public Health 2016/2017

Measuring spread: the quartiles


We can improve our description of spread by also looking at the spread of the
middle half of the data.
If the data are divided into four equal parts, we speak of quartiles.
• One-fourth (25%) of all the ranked observations are smaller than the first
quartile, one-fourth (25%) lie between the first and second quartile, one-
fourth (25%) lie between the second and third quartile, and one-fourth
(25%) are larger than the third quartile.
• In other words, the first quartile is larger than 25% of the observations,
and the third quartile larger than 75% of the observations. The second
quartile is the median that is larger than 50% of the observations.

Biostatistics I Descriptive Statistics 44


Master Degree Public Health 2016/2017

The quartiles Q1 and Q3


To calculate the quartiles:
• Arrange the observations in increasing order and locate the median Md in
the ordered list of observations.
• The location of the first quartile Q1:
location = (n+1)/4 or location = 0.25 x (n + 1)
• The location of the third quartile Q3:
location = 0.75 x (n + 1)
If the location is not an integer or half-integer, then it is rounded up to the
nearest integer or half integer.
In the example with wing length for 24 butterflies, with ordered data, Xi (in
centimetres):
3.3 3.5 3.6 3.6 3.7 3.8 3.8 3.8 3.9
3.9 3.9 4.0 4.0 4.0 4.0 4.1 4.1 4.1
4.2 4.2 4.3 4.3 4.4 4.5
location of Q1 = 0.25 x (24+1)=6.25 (which we round up to 6) => Q1=3.8 cm
location of Q3 = 0.75 x (24+1)=18.75 (which we round up to 19) => Q3=4.2 cm

22
Biostatistics I Descriptive Statistics 45
Master Degree Public Health 2016/2017

The five-number summary and boxplots


The smallest and largest observations tell us little about the distribution as a
whole, but they give information about the tails of the distribution that is
missing if we know only Q1 , Md and Q3.
To get a quick summary of both center and spread, combine all five numbers.
The five-number summary of a distribution consists of the smallest
observation, the first quartile, the median, the third quartile, and the largest
observation, written in order from smallest to largest:
Minimum Q1 Md Q3 Maximum

The five number summary from the wing length example is:
3.3 3.8 4.0 4.2 4.5

Biostatistics I Descriptive Statistics 46


Master Degree Public Health 2016/2017

The five number summary of a distribution leads to a new graph, the boxplot.
Max = 4.5

Q3 = 4.2

Md=4.0

Q1 = 3.8

Min = 3.3

A boxplot is a graph of the five-number summary.


A central box spans the quartiles Q1 and Q3.
A line in the box marks the median Md.
Lines extend from the box out to the smallest and largest observations.

23
Biostatistics I Descriptive Statistics 47
Master Degree Public Health 2016/2017

Dispersion measured with quantiles:


 The interquartile range
The interquartile range is the distance between the first and third
quartiles (i.e., the 25th and 75th percentiles)
Interquartile range = Q3 - Q1
 The semi-interquartile range (or quartile deviation)
Semi-interquartile range = (Q3 - Q1)/2

For our data on wing lengths:


Interquartile range = 4.2 – 3.8 = 0.4
Semi-interquartile range = (4.2 – 3.8)/2 = 0.4/2=0.2

Biostatistics I Descriptive Statistics 48


Master Degree Public Health 2016/2017

Using SPSS:
Obtaining the quartiles, boxplot, etc in SPSS.
Analyse → Descriptive Statistics → Explore …
The boxplot is automatically displayed.
The quartiles can be obtained in the Statistics sub-dialogue box by checking
Percentiles as shown in the display.

24
Biostatistics I Descriptive Statistics 49
Master Degree Public Health 2016/2017

Measuring spread: the standard deviation


The standard deviation and its close relative, the variance, measure spread by
looking at how far the observations are from the mean.
The variance s2 of a sample set of observations is an average of the squares of
the deviations of the observations from their mean:

s 2
=
(X 1 − X ) + ( X 2 − X ) + ... + (X n − X )
2 2 2

or n −1
1 n
s2 = ∑
n-1 i =1
(X i − X)2

The standard deviation s is the square root of the variance s2:

1 n
s= ∑
n-1 i =1
(X i − X)2

The most common numerical description of a distribution is the mean to


measure center and the standard deviation to measure spread.

Biostatistics I Descriptive Statistics 50


Master Degree Public Health 2016/2017

Calculating the standard deviation


Example: A person’s metabolic rate is the rate at which the body
consumes energy. Metabolic rate is important in studies of weight gain,
dieting, and exercise. Here are the metabolic rates of 7 men who took part in
a study of dieting. The units are kilocalories (Cal) for a 24-hour period. These
are the same calories used to describe the energy content of foods.
1792 1666 1362 1614 1460 1867 1439
First find the mean:

The variance:
s2 =
(1792 − 1600)2 + (1666 − 1600)2 + ... + (1439 − 1600)2 = 214870
7 −1 6
s = 35811.67Cal
2 2

The standard deviation is the square root of the variance:


s = 35811.67 = 189.24Cal

The researchers reported the mean, 1600 Cal, and the standard deviation,
189.24 Cal.

25
Biostatistics I Descriptive Statistics 51
Master Degree Public Health 2016/2017

The standard deviation:


The standard deviation measures spread about the mean and should be
used only when the mean is chosen as the measure of center.
The standard deviation is always zero or greater than zero.
•s=0 only when there is no spread, all the observations have the same
value.
•s>0 when not all the observations have the same value. As the
observations become more spread out about their mean, s gets larger.
s has the same units of measurement as the original observations.
Like the mean, the standard deviation is not resistant. A few outliers can
make s very large.

Biostatistics I Descriptive Statistics 52


Master Degree Public Health 2016/2017

Using SPSS:
In the Example with a sample of high school students (data file called
highschoolb.sav), we can obtain the mean and standard deviation of the
reading score for the 168 students enrolled in public schools and the 32
students enrolled in private schools: 51.85 ±10.42 versus 54.25 ±9.20.
We can also compare the reading score in the form of one graph displaying
each mean with error bars extending on either side to show the standard
deviation in each group.

You can choose in SPSS:


Analyse → Descriptive Statistics → Explore …
Graphs → Legacy dialogs → Error bar …

26
Biostatistics I Descriptive Statistics 53
Master Degree Public Health 2016/2017

Describing a single distribution or comparing the distributions of several


groups of distributions:
 Numerical summaries can be useful for describing a single distribution as
well as for comparing the distributions of several groups of observations.
 Two important features of a distribution are its center and its spread.
 The mean and standard deviation are excellent numerical summaries for
distributions that are approximately symmetric without outliers.
 If a distribution is not symmetric, has outliers, or both, the five-number
summary provides a better, more comprehensive description.
 If a distribution is complex, with clusters or multiple peaks, for instance,
reducing the distribution to a few numbers would be misleading.

Data should always be graphed before computing and communicating


numerical summaries.

Biostatistics I Descriptive Statistics 54


Master Degree Public Health 2016/2017

Sample statistics and Parameters


• Up to now we have calculated statistics from samples.
• Any statistic of location, such as a mean or median, is always a true
measure for the sample on which is based.
Thus the true mean of the 24 wing lengths of butterflies is 3.96 cm for
this particular sample. Similar considerations will hold for the measures of
dispersion, such as the standard deviation.
• Rarely in biology (or in science in general) are we interested in measures
of location and dispersion only as descriptive summaries of the samples
we have studied.
• Almost always, we are interested in the populations from which the
samples were taken.
We therefore would like to know, for example, not the mean of the
particular 24 wing lengths, but the true mean of wing lengths of the
butterfly population from which the 24 butterflies were sampled. When
studying dispersion, we generally whish to learn the true standard
deviations of the populations, not those of the samples.

27
Biostatistics I Descriptive Statistics 55
Master Degree Public Health 2016/2017

Sample statistics and Parameters (cont.)


• The population statistics, however, are unknown and (generally speaking)
are unknowable.
Who would be able to collect all the wings of this particular butterfly
population and measure them?
• Thus, we must use sample statistics as estimators of population statistics,
or parameters.
• It is conventional in statistics to use Greek letters for population
parameters and roman letters for sample statistics.
Thus the sample mean estimates μ, the parametric mean of the
population.
The sample variance s2, estimates a parametric variance, symbolized by
σ2.

Biostatistics I Descriptive Statistics 56


Master Degree Public Health 2016/2017

Two-way tables
• Now we will describe the relationships between two categorical
variables.
• Some variables - such sex, species, and color - are categorical by nature.
• Other categorical variables are created by grouping values of as
quantitative variable into classes- like age groups, for example.
• To analyse categorical data, we use the counts or percents of individuals
that fall into various categories.

28
Biostatistics I Descriptive Statistics 57
Master Degree Public Health 2016/2017

Previously we analysed the distribution of students according to their gender


and type of school, separately, in the example of 200 high school students.
Number (%)

Gender Male 91 (45.5%)


Female 109 (54.5%)

Type of school Public 168 (84%)


Private 32 (16%)

But these data may be displayed in what is known as a contingency table (this
presentation of data is also known as a cross tabulation or cross classification)

Type of school Male Female Total


Public 77 91 168
Private 14 18 32
Total 91 109 200

Biostatistics I Descriptive Statistics 58


Master Degree Public Health 2016/2017

Table: Study participants by type of school and gender.


Type of school Male Female Total
Public 77 91 168
Private 14 18 32
Total 91 109 200

•The “Total” column at the right of the table contains the totals for each of
the rows. These row totals give the distribution of Type of school (the row
variable): 168 participants were enrolled in public schools, 32 in private
schools.
•In the same way, the “Total” row at the bottom of the table gives the
gender distribution: the study included 91 boys and 109 girls.
•Percents are often more informative than counts. We can display the
marginal distribution of type of school in terms of percents by dividing each
row total by the table total and converting to a percent.
•In the sample, 54.5% (109/200) of the students were girls and 84%
(168/200) were enrolled in public schools.
The distribution of gender alone and type of school are called marginal
distributions.

29
Biostatistics I Descriptive Statistics 59
Master Degree Public Health 2016/2017

Marginal distributions tell us nothing about the relationship between two


variables.
To describe the relationship, we must calculate some well-chosen percents
from the counts given in the body of the table.
We want to compare boys and girls in terms of their distribution by type of
school.
To do this, compare percents for boys alone with percents for girls alone.
Table: Study participants by type of school and gender.
Type of school Male Female Total
Public 77 91 168
Private 14 18 32
Total 91 109 200

To find the percent of boys who were enrolled in public schools, divide the count of
such boys by the total number of boys (the column total):
male enrolled in public schools 77
= = 0.846 = 84.6%
male' s column total 91

Doing this for the two entries in the “male” column gives the distribution of type of
school among boys.

Biostatistics I Descriptive Statistics 60


Master Degree Public Health 2016/2017

A conditional distribution of a variable is the distribution of values of that


variable among only individuals who have a given value of the other
variable. There is a separate conditional distribution for each value of the
other variable.
Comparing conditional distributions reveals the nature of the association
between type of school and gender.
Table: Comparison of the distribution of participants by type of school
between male and female students.
Type of
school Male Female
n (%) n (%)
Public 77 (84.6%) 91 (83.5%)
Private 14 (15.4%) 18 (16.5%)
Total 91 (100%) 109 (100%)

Only a minority of boys were enrolled in private schools, and a similar


proportion of girls were enrolled in private schools (15.4% versus 16.5%,
respectively). No gender differences were found in the choice of school.

30
Biostatistics I Descriptive Statistics 61
Master Degree Public Health 2016/2017

Using SPSS:
Obtaining a contingency table with conditional distributions in SPSS.
Analyse → Descriptive Statistics → Crosstabs…
The conditional distribution can be obtained in the Cells sub-dialogue box by
checking Column (or Row) as shown in the display.
As we want to calculate percentages of students enrolled in private schools
and public schools, for boys alone and for girls alone, choose Percentages by
column (notice that the female variable is in columns ).

Output of the SPSS program:

Biostatistics I Descriptive Statistics 62


Master Degree Public Health 2016/2017

Displaying relationships: scatterplots


The most useful graph for displaying the relationship between two
quantitative variables is a scatterplot.

Example : An endangered species: the manatee


Manatees are large, herbivorous, aquatic mammals found primarily in the
rivers and estuaries of Florida. This endagered species suffers from
cohabitation with human populations, and many manatees die each year
from collisions with power boats.
We examine the relationship between the number of manatee deaths from
power boat collisions and the number of powerboats registered in any given
year between 1977 and 2012, as displayed in the next table.

(Baldi and Moore , 2012). File in SPSS: Manatees

31
Biostatistics I Descriptive Statistics 63
Master Degree Public Health 2016/2017

Table: Powerboat registrations (in thousands) and manatee


deaths from powerboat collisions in Florida.

year powerboats deaths year powerboats deaths year powerboats deaths


1977 447 13 1989 711 50 2001 944 81
1978 460 21 1990 719 47 2002 962 95
1979 481 24 1991 681 55 2003 978 73
1980 498 16 1992 679 38 2004 983 69
1981 513 24 1993 678 35 2005 1010 79
1982 512 20 1994 696 49 2006 1024 92
1983 526 15 1995 713 42 2007 1027 73
1984 559 34 1996 732 60 2008 1010 90
1985 585 33 1997 755 54 2009 982 97
1986 614 33 1998 809 66 2010 942 83
1987 645 39 1999 830 82 2011 922 87
1988 675 43 2000 880 78 2012 902 81

The number of power boats registered in Florida varies from year to year.
Does it helps to explain the differences from year to year in the number of
manatee deaths from collision with power boats?
We suspect that “powerboats registered” will help explain “manatee deaths
from collisions”. So “powerboats registered” is the explanatory variable, and
“manatee deaths from collisions” is the response variable.

Biostatistics I Descriptive Statistics 64


Master Degree Public Health 2016/2017

Figure: Number of manatee


deaths due to powerboat
collisions in Florida each year
against the number of
powerboats registered (in
thousands ) the same year.

Scatterplot
A scatterplot shows the relationship between two quantitative variables
measured on the same individual.
The values of one variable appear in the horizontal axis, and the values of the
other variable appear on the vertical axis. Each individual in the data appears as
the point in the plot fixed by the values of both variables.
Always plot the explanatory variable on the horizontal axis (the x axis) of a
scatterplot. We usually call the explanatory variable x and the response variable y.
If there is no explanatory-response distinction, either variable can go on the
horizontal axis.

32
Biostatistics I Descriptive Statistics 65
Master Degree Public Health 2016/2017

Examining a scatterplot
 In any graph of data, look for the overall pattern and for striking
deviations from that pattern.
 You can describe the overall pattern of a scatterplot by the direction,
form, and strength of the relationship.
 An important kind of deviation is an outlier, an individual value that falls
outside the overall pattern of the relationship.

Interpreting the plot in the example with manatees:


 There is a clear direction: the overall pattern moves up, from lower left to
upper right. That is, years in which powerboat registrations were higher
tend to have higher counts of manatee deaths from collisions – there is a
positive association between the two variables.
 The form of the relationship is linear. That is the overall pattern follows a
straight line.
 The strength of a relationship in a scatterplot is determined by how closely
the points follows a clear form. The overall relationship is strong.

Biostatistics I Descriptive Statistics 66


Master Degree Public Health 2016/2017

Positive association, negative association


 Two variables are positively associated when high values of the two
variables tend to occur together.
 Two variables are negatively associated when high values of one variable
tend to occur with low values of the other variable.

Form
 Linear relationships, where the points show a straight line pattern.
 Curved relationships and clusters are other forms to watch for.

Strength
 The strength of a relationship is determined by how close the points in the
scatterplot lie to a simple form such as a line.

Using SPSS:
Graphs → Legacy dialogs → Scatter

33
Biostatistics I Descriptive Statistics 67
Master Degree Public Health 2016/2017

References
The text of these slides may be found in the following references*:

Baldi B., D.S. Moore - The practice of statistics in the life sciences, W.H. Freeman and
Company, 2012.
Cadima E.L., A.M. Caramelo, M. Afonso-Dias, P. C. Barros, M.O. Tandstad, J.I. Leiva-
Moreno – Sampling methods applied to fisheries science: a manual, FAO Fisheries
Technical Paper No.434, FAO, 2005.
Fowler J., L. Cohen, P. Jarvis - Practical statistics for field biology, 2nd edition, John
Wiley & Sons, Inc., 1998.
Ruxton, G.D., N. Colegrave - Experimental design for the life sciences, 3rd edition,
Oxford University Press, 2011.
Sokal R. and F.J. Rohlf – Biometry – The principles and practice of statistics in biological
research, W.H. Freeman and Company, 4th edition, 2012.
Zar J.H. - Biostatistical Analysis, 5th edition, Prentice - Hall International Inc., 2010.

* The copy and reproduction of portions created by other authors was done only for
educational use.

34

Vous aimerez peut-être aussi