Descriptive Statistics

Biostatistics I Descriptive Statistics 1
Master Degree Public Health 2016/2017
Descriptive Statistics
Margarida Fonseca Cardoso

The primary objective of a statistical analysis is to infer characteristics of a

group of data by analyzing the characteristics of a small sampling of the
group.
This generalization from the part to the whole requires the consideration of
such important concepts as population, sample, parameter, statistic and
sampling.
An observational study observes individuals and measures variables of

interest but does not attempt to influence the responses. The purpose of an
observational study is to describe some group or situation.
An experiment, on the other hand, deliberately imposes some treatment on

individuals in order to observe their responses. The purpose of an
experiment is to study whether the treatment causes a change in the
response.
1
The data in a biometric study are generally based on individual observations,

which are observations or measurements taken on the smallest sampling
unit.
Example:
•If we measure weight in 100 rats, then the weight of each rat is an individual
observation; the hundred weights together represent the sample of
observations. These smallest sampling units frequently, but not necessarily,
are also individuals in the ordinary biological sense, that is one rat.
•However, if we had studied weight in a single rat over a period of time, the
sample of individual observations would be all the weights recorded on one
rat at successive times.
•In a study of temperature in ant colonies, where each colony is a basic
sampling unit, each temperature reading for one colony is an individual
observation, and the sample of observations is the temperatures for all the
colonies considered.
ant - formiga

Populations
Biologists consider a Population as a defined group of humans or of another
species of organisms
In statistics, population (or universe) means the totality of individual
observations about which inferences are to be made.
For example:
An investigator may desire to draw conclusions

about the length of a commercially important fish specie of the Azores
archipelago. All lengths of this fish specie are, therefore the population
under consideration.
If a study is concerned with the blood-glucose concentration in three year old

children, then the blood glucose levels in all children of that age are the
population of interest.
2
Biologists may sample a population that does not physically exist.
Suppose an experiment is performed in which a food supplement is

administered to 40 guinea pigs, and the sample data consists of the growth
rates of these 40 animals.
Then the population about which conclusions might be drawn is the growth
rates of all the guinea pigs that conceivably might have been administered
the same food supplement under identical conditions. That population is
said to be “hypothetical” or “potential”.
The actual property measured by the individual observations is the variable.

More than one variable can be measured on each sampling unit.
Length, mass, age, temperature, number of parasites, number of petals are
examples of biological variables.
guinea pig - cobaia

Any set of data:

contains information about some group of individuals.
the information is organized in variables.
The techniques of descriptive statistics apply equally to data sets obtained

for a given population – the entire group of individuals about which we want
information – or from only those individuals in a smaller sample.
3
Using SPSS
Example: The data file called highschoolb.sav, high school and beyond,
contains 200 observations from a sample of high school students with demographic
information about the students, such as their gender (female) and type of school. It
also contains a number of scores on standardized tests, including tests of reading
(read), writing (write), mathematics (math) and social studies (socst).
Each column includes data about a different variable. Each row represents data
for one student. Some variables, like gender, simply place individuals into
categories. Others, like scores take numerical values with which we can do
arithmetic.

When you plan a statistical study or explore data from someone else’s work,
ask yourself the following questions:
1. Who? What individuals do the data describe? How many individual

appear in the data?
2. What? How many variables do the data contain? What are the exact
definitions of these variables? In what units of measurements is each
variable recorded? Lengths, for example, might be recorded in inches or
in meters.
3. Why? What purpose do the data have? Do we hope to answer some
specific questions? Do we want to draw conclusions about individuals
other than the actually have data for? Are the variables suitable for the
intended purpose?
4
Categorical and quantitative variables
• Categorical variable – places individuals into one of several groups or

categories.
– Nominal variables are purely qualitative and unordered, like flower
colour.
– Ordinal data can be ranked, like socioeconomic status or the Likert
scales commonly used in psychology (for example, “rate from 0 to 5,
with 0 for really dislike and 5 for really like”).
• Quantitative variable – takes numerical values for which arithmetic
operations such as adding and averaging make sense. The values of a
quantitative variable are usually recorded in a unit of measurement such
as seconds or kilograms.
– Some quantitative variables, like weight, are continuous variables that
can take any value over an interval.
– Discrete variables, are quantitative variables that can take only a
limited, number of values, like the number of petals in a flower.

Exploring data
• Begin by examining each variable by itself. Then move on to study the
relationship among variables.
• Begin by graph or graphs. Then add numerical summaries of specific
aspects of the data.
The proper choice of graph depends on the nature of the variable.
Distribution of a variable
The distribution of a variable tell us what values it takes and how often it take
these values.
The values of a categorical variable are labels for the categories. The
distribution of a categorical variable lists the categories and gives either the
count or the percent of individuals that fall in each category.
5
Qualitative variables
Example: The distribution of students according to their gender and type of

school.
Number (%)
Gender Male 91 (45.5%)

Female 109 (54.5%)
Type of school Public 168 (84%)

Private 32 (16%)
The data table provides both the count of individuals for each category and the
percent that each category represents in the data set. Counts are also
sometimes referred to as frequencies and percents as relative frequencies.
It is clear that the majority of the students (84%) were enrolled in public
schools.

Using SPSS
Choose Analyse → Descriptive Statistics→ Frequencies
6
Example: The distribution of students according to their gender and type of

school.
We could also make a pie chart or a bar graph.
Using SPSS: Choose Graphs → Legacy dialogs→ Pie …

Quantitative variables
Quantitative variables often take many values. The distribution of a variable

tells us what values the variable takes and how often it takes these values.
The most common graph of the distribution of one quantitative variable is a
histogram.
Example: Making a histogram: score on reading.
Using SPSS: Choose Graphs → Legacy dialogs→ Histogram …
And the graph that is automatically displayed is:
The software’s choice has too many classes,

with many classes having one or no observations.
7
Number of classes:
There have been several “rules of thumb” proposed to aid in deciding into
how many classes data might reasonably be grouped, for the use of too few
groups will obscure the general shape of the distribution.
But such “rules” or recommendations are rough guides, and the choice is
generally left to good judgement, bearing in mind that from 10 to 20 groups
are useful for most biological work.
Groups should be established that are equal in size interval of the variable
being measured.

Quantitative variables
Consider classes of width 5 starting at 20.
8
Figure: Histogram of the reading score for 200 students.
A histogram displays the distribution of one quantitative variable.

A histogram should be drawn with no extra space between consecutive
classes, to indicate that all values of the variable are covered.

Interpreting histograms
• Making a statistical graph is not an end in itself. The purpose of the graph
is to help us understand the data.
• After you make a graph, always ask, “What do I see?”
Examining a histogram
In any graph of data, look for the overall pattern and for striking
deviations from that pattern.
You can describe the overall pattern of a histogram by its shape, center,
and spread.
An important kind of deviation is an outlier, an individual value that falls
outside the overall pattern.
9
Describing a distribution: reading score.
Shape: The distribution is unimodal. That is, it has a single peak , which
represents students with a reading score between 50 and 55.
The distribution is also symmetric. Real data are almost never exactly
symmetric. We are content to describe the histogram of the reading score as
roughly symmetric.
Center: The midpoint of the distribution is about 45 to 55.
Spread: The spread is from 25 to 80.

Describing a distribution: reading score.
Outliers:
In figure the observations less than 30 and greater than 70 are part of the
continuous range of reading scores and do not stand apart from the overall
distribution.
If you had spotted possible outliers, look for an explanation. Some outliers
are due to mistakes, such as typing 4.5 instead of 45.5. Other outliers points
to the special nature of some observations. For instance a score of 4.5 could
be just an individual that didn’t finish the reading test.
10
Symmetric and skewed distributions

A distribution is symmetric if the right and left sides of the histogram are
approximately mirror images of each other.
A distribution is skewed to the right, or positively skewed, if the right side of
the histogram extends much farther out than the left side.
It is skewed to the left, or negatively skewed, if the left side of the histogram
extends much farther out than the right side.

An additional question we could ask is what was the proportion of students that
passed in the test of reading.
Considering that students pass the test if they score 45 or more, a new variable
can be created.
Using SPSS for creating the variable new variable ReadPass:
Choose Transform → Visual Binning …
And then choose Analyse → Descriptive Statistics→ Frequencies
11
Example: Guinea pig survival times

Figure displays the survival times in days of 72 guinea pigs after they were
injected with infectious bacteria in a medical experiment.
(Baldi and Moore , 2012). File in SPSS: SurvivalGuineaPigs

The distribution is single-peaked and skewed to the right.

Most guinea pigs have a short survival time, between 50 and 150 days.
However, some animals survive longer, so that the graph extends to the
right of its peak much farther than it extends to the right.
The survival times range from 0 to 600. However, almost all infected
guinea pigs die within 250 days.
A few guinea pigs survive much longer, maybe because they had some
immunity to the bacteria used for the experiment.
12
The overall shape of a distribution is important information about a

variable.
Many biological measurements on the same species and sex have
symmetric distributions. For example birth weights, heights of young
women.
Survival times – patients after an organ transplant, lab animals following
an experimental inoculation have distributions that are typically strongly
skewed to the right.
Many distributions have irregular shapes that are neither symmetric nor
skewed.
Use you eyes, describe what you see, and then try to explain it.

Stemplots
For small data sets, stemplots are quicker to make and easier to interpret.
They display the raw data, that is, they show each one of the values in the
data set.
To make a stemplot:
1. Separate each observation into a stem, consisting of all but the final
(rightmost ) digit, and a leaf, the final digit. Stems may have as many
digits as needed, but each leaf shows only a single digit.
2. Write the stems in a vertical column with the smallest at the top, and
draw a vertical line at the right of the column.
3. Write each leaf in the row to the right of its stem, in increasing order out
from the stem.
Spss automatically displays stemplots.
13
Using SPSS: Choose Analyse → Descriptive Statististics → Explore …
And the stemplot is automatically displayed:

reading score Stem-and-Leaf Plot
Frequency Stem & Leaf
1,00 2. 8
7,00 3. 1444444
14,00 3. 56667799999999
30,00 4. 112222222222222334444444444444
31,00 4. 5567777777777777777777777777778
34,00 5. 0000000000000000002222222222222234
27,00 5. 555555555555577777777777777
26,00 6. 00000000013333333333333333
21,00 6. 555555555688888888888
7,00 7. 1133333
2,00 7. 66
Stem width: 10,00

Each leaf: 1 case(s)

Describing distributions with numbers – quantitative variables
• In samples one generally finds a preponderance of values somewhere

around the middle of the range of observed values.
• The description of this concentration near the middle is an average or a
measure of central tendency. It is also termed a measure of location, for
it indicates where, along the measurement scale, the sample is located.
• In addition to a description of the central tendency of a set of data, it is

generally desirable to have a description of the variability, or the
dispersion, of the data.
• Measurements that are concentrated around the center of a distribution
of data have low variability (low dispersion), whereas data that are very
spread out along the measurement scale have high variability (high
dispersion).
14
Measuring center: the mean
The mean
To find the mean of a set of observations, add their values and divide by the
number of observations. If the n observations are x1, x2, …, xn their mean is:
Example: What is the mean length of 24 butterfly wings?

Xi (in centimeters):
3.3 3.5 3.6 3.6 3.7 3.8 3.8 3.8 3.9
3.9 3.9 4.0 4.0 4.0 4.0 4.1 4.1 4.1
4.2 4.2 4.3 4.3 4.4 4.5
(Zar, 2010). File in SPSS: WingLength

In practice, you can key the data into your calculator or software and ask for
the arithmetic mean.
Using SPSS:
In the Example with a sample of high school students (data file called
highschoolb.sav), the mean of the reading score for the 200 students is 52.23.
There are a lot of procedures for obtaining the arithmetic mean in SPSS.
You can choose:
Analyse → Descriptive Statistics → Descriptives …
Analyse → Descriptive Statistics → Explore …
15
Measuring center: the median

The median is the midpoint of a distribution, the measurement such that
half the observations are smaller and the other half is larger.
The median is typically defined as the middle measurement in an ordered set
of data.
To find the median of a distribution:
1. Arrange all observations in order of size, from smallest to largest.
2. If the number of observations n is odd, the median is the center
observation in the ordered list. Find the location of the median by
counting (n+1)/2 observations up from the smallest observation in the
list.
3. If the number of observations n is even, the median is the mean of the
two center observations in the ordered list. The location of the median is
again (n+1)/2, counting from the smallest observation in the list.
Note that the formula (n+1)/2 does not give the median, just the location of
the median in the ordered list.

Example: What is the median wing length for our 24 butterflies?

Here are the data again in order, Xi (in centimeters):
3.3 3.5 3.6 3.6 3.7 3.8 3.8 3.8 3.9
3.9 3.9 4.0 4.0 4.0 4.0 4.1 4.1 4.1
4.2 4.2 4.3 4.3 4.4 4.5
The number of observations is even. There is no center observation, but there

is a center pair. These are the bold 4.0 values.
When n=24, the rule for locating the median in the list gives
n + 1 25
location of Md = = = 12.5
2 2
The location 12.5 means “halfway between the 12th and 13th observations in
the ordered list”.
4 .0 + 4 .0
Md = = 4.0 cm
2
(Zar, 2010).
16
In the example of wing lengths of butterflies we found that the mean and the
median were very similar, at 3.96 and 4.00 cm, respectively. The distribution
of the 24 wing lengths were roughly symmetric and did not have any
outliers.
The mean is the center of gravity of the histogram. That is, if the histogram
were made of solid material, it would balance horizontally with the fulcrum at
the mean. The median divides the histogram into two equal areas.
But what would happen to the relationship between the mean and the
median of a data set with a marked skew or extreme outliers?

The figure displays the survival times in days of 72 guinea pigs after they were
injected with infectious bacteria in a medical experiment:
The distribution is noticeably skewed to the right and has some potential high
outliers.
The mean survival time is 141.9 days, whereas the median survival time is only
102.5 days.
The mean is pulled toward the right tail of this right-skewed distribution.
If the longest survival time were increased, the mean would increase, but the
median would not change at all. The mean uses the actual value of each
observation and is, therefore, very sensitive to any extreme values.
(Baldi and Moore , 2012).
17
Comparing the mean and the median

The mean and median of a symmetric distribution are close together.
If the distribution is exactly symmetric, the mean and median are exactly
the same.
In a skewed distribution, the mean is usually farther out in the long tail
than is the median.
Using SPSS:
There are a lot of procedures for obtaining the arithmetic mean and the
median in SPSS.
You can choose:

Many biological variables have distributions that are skewed to the right.
Survival times, such as in the guinea pig inoculation experiment, are typically
right skewed.
When dealing with strongly skewed distributions, it is customary to report
the median rather than the mean.
However a health organization or government agency may need to include all
survival times, and thus calculate the mean, to estimate the costs of medical
care for a given disease and to plan medical staffing appropriately. Relying
only on the median would result in underestimating the medical and
financial needs.
The mean and median measure center in different ways, and both are
useful.
18
Other measures of central tendency
The Geometric mean

The geometric mean can be calculated as the antilogarithm of the mean of
the logarithms of the data (where the logarithm can be in any base).
1 n 
Mg = anti log  ∑ log X i 
 n i =1 
In the particular case of natural logarithm (base e):
1 n 
Mg = exp  ∑ ln X i 
 n i =1 
The geometric mean is appropriate to use only for quantitative data and only
when all of the data are positive (that is, greater than zero).
Geometric mean is sometimes used as a measure of location when the data
are highly skewed to the right.

In the example of survival times in days of 72 guinea pigs after they were
injected with infectious bacteria in a medical experiment:
The mean survival time is 141.9 days, whereas the median survival time is only
102.5 days.
The geometric mean is 118.1 days.
We will see later that variables are sometimes transformed in logarithms. If we

compute the mean of such transformed variable and then change the mean back
into the original scale, the mean will not be the same as if we had computed the
arithmetic mean of the original variable.
(Baldi and Moore , 2012).
19
Using SPSS:
Obtaining the geometric mean in SPSS.
Analyse → Reports→ Summarize cases …
The geometric mean (as well as the mean, median, …) can be obtained in the
Statistics sub-dialogue box as shown in the display.

Measures of variability and dispersion.
In addition to a description of the central tendency of a set of data, it is

generally desirable to have a description of the variability, or of the
dispersion of the data.
A measure of variability is an indication of the spread of measurements
around the center of the distribution.
Measures that are concentrated around the center of a distribution of data
have low variability (low dispersion).
Data that are very spread out along the measurement scale have high
variability (high dispersion).
20
Measuring spread: the range

The difference between the highest and lowest mesurements in a group of
data is termed range.
If sample measurements are arranged in increasing order of magnitude, then
Sample range = Xn – X1
Which is
Sample range = largest X – smallest X
In the example with wing length for 24 butterflies, with ordered data, Xi (in
centimeters):
3.3 3.5 3.6 3.6 3.7 3.8 3.8 3.8 3.9
3.9 3.9 4.0 4.0 4.0 4.0 4.1 4.1 4.1
4.2 4.2 4.3 4.3 4.4 4.5
The range may be expressed as 3.3 to 4.5 cm, or as 4.5 – 3.3 = 1.2 cm.

The range is a relatively crude measure of dispersion, inasmuch as it does

not take into account any measurement except the highest and the lowest.
For example, the guinea pig survival times range from 43 to 598 days. These
single observations show the full spread of data, but they may be outliers.
Furthermore, it is unlikely that a sample will contain both the highest and
lowest values in the population, so the sample range usually underestimates
the population range.
Nonetheless, it is considered useful by some to present the sample range
as an estimate (although a poor one) of the population range.
Whenever the range is specified in reporting data, however, it is a good
practice to report another measure of dispersion as well.
The range is applicable to ordinal data and quantitative data.
21
Measuring spread: the quartiles

We can improve our description of spread by also looking at the spread of the
middle half of the data.
If the data are divided into four equal parts, we speak of quartiles.
• One-fourth (25%) of all the ranked observations are smaller than the first
quartile, one-fourth (25%) lie between the first and second quartile, one-
fourth (25%) lie between the second and third quartile, and one-fourth
(25%) are larger than the third quartile.
• In other words, the first quartile is larger than 25% of the observations,
and the third quartile larger than 75% of the observations. The second
quartile is the median that is larger than 50% of the observations.

The quartiles Q1 and Q3

To calculate the quartiles:
• Arrange the observations in increasing order and locate the median Md in
the ordered list of observations.
• The location of the first quartile Q1:
location = (n+1)/4 or location = 0.25 x (n + 1)
• The location of the third quartile Q3:
location = 0.75 x (n + 1)
If the location is not an integer or half-integer, then it is rounded up to the
nearest integer or half integer.
In the example with wing length for 24 butterflies, with ordered data, Xi (in
centimetres):
3.3 3.5 3.6 3.6 3.7 3.8 3.8 3.8 3.9
3.9 3.9 4.0 4.0 4.0 4.0 4.1 4.1 4.1
4.2 4.2 4.3 4.3 4.4 4.5
location of Q1 = 0.25 x (24+1)=6.25 (which we round up to 6) => Q1=3.8 cm
location of Q3 = 0.75 x (24+1)=18.75 (which we round up to 19) => Q3=4.2 cm
22
The five-number summary and boxplots

The smallest and largest observations tell us little about the distribution as a
whole, but they give information about the tails of the distribution that is
missing if we know only Q1 , Md and Q3.
To get a quick summary of both center and spread, combine all five numbers.
The five-number summary of a distribution consists of the smallest
observation, the first quartile, the median, the third quartile, and the largest
observation, written in order from smallest to largest:
Minimum Q1 Md Q3 Maximum
The five number summary from the wing length example is:
3.3 3.8 4.0 4.2 4.5

The five number summary of a distribution leads to a new graph, the boxplot.
Max = 4.5
Q3 = 4.2
Md=4.0
Q1 = 3.8
Min = 3.3
A boxplot is a graph of the five-number summary.

A central box spans the quartiles Q1 and Q3.
A line in the box marks the median Md.
Lines extend from the box out to the smallest and largest observations.
23
Dispersion measured with quantiles:

The interquartile range
The interquartile range is the distance between the first and third
quartiles (i.e., the 25th and 75th percentiles)
Interquartile range = Q3 - Q1
The semi-interquartile range (or quartile deviation)
Semi-interquartile range = (Q3 - Q1)/2
For our data on wing lengths:

Interquartile range = 4.2 – 3.8 = 0.4
Semi-interquartile range = (4.2 – 3.8)/2 = 0.4/2=0.2

Using SPSS:
Obtaining the quartiles, boxplot, etc in SPSS.
The boxplot is automatically displayed.
The quartiles can be obtained in the Statistics sub-dialogue box by checking
Percentiles as shown in the display.
24
Measuring spread: the standard deviation

The standard deviation and its close relative, the variance, measure spread by
looking at how far the observations are from the mean.
The variance s2 of a sample set of observations is an average of the squares of
the deviations of the observations from their mean:
s 2
=
(X 1 − X ) + ( X 2 − X ) + ... + (X n − X )
2 2 2
or n −1
1 n
s2 = ∑
n-1 i =1
(X i − X)2
The standard deviation s is the square root of the variance s2:
1 n
s= ∑
n-1 i =1
(X i − X)2
The most common numerical description of a distribution is the mean to

measure center and the standard deviation to measure spread.

Calculating the standard deviation

Example: A person’s metabolic rate is the rate at which the body
consumes energy. Metabolic rate is important in studies of weight gain,
dieting, and exercise. Here are the metabolic rates of 7 men who took part in
a study of dieting. The units are kilocalories (Cal) for a 24-hour period. These
are the same calories used to describe the energy content of foods.
1792 1666 1362 1614 1460 1867 1439
First find the mean:
The variance:
s2 =
(1792 − 1600)2 + (1666 − 1600)2 + ... + (1439 − 1600)2 = 214870
7 −1 6
s = 35811.67Cal
2 2
The standard deviation is the square root of the variance:

s = 35811.67 = 189.24Cal
The researchers reported the mean, 1600 Cal, and the standard deviation,
189.24 Cal.
25
The standard deviation:

The standard deviation measures spread about the mean and should be
used only when the mean is chosen as the measure of center.
The standard deviation is always zero or greater than zero.
•s=0 only when there is no spread, all the observations have the same
value.
•s>0 when not all the observations have the same value. As the
observations become more spread out about their mean, s gets larger.
s has the same units of measurement as the original observations.
Like the mean, the standard deviation is not resistant. A few outliers can
make s very large.

Using SPSS:
In the Example with a sample of high school students (data file called
highschoolb.sav), we can obtain the mean and standard deviation of the
reading score for the 168 students enrolled in public schools and the 32
students enrolled in private schools: 51.85 ±10.42 versus 54.25 ±9.20.
We can also compare the reading score in the form of one graph displaying
each mean with error bars extending on either side to show the standard
deviation in each group.
You can choose in SPSS:

Graphs → Legacy dialogs → Error bar …
26
Describing a single distribution or comparing the distributions of several

groups of distributions:
Numerical summaries can be useful for describing a single distribution as
well as for comparing the distributions of several groups of observations.
Two important features of a distribution are its center and its spread.
The mean and standard deviation are excellent numerical summaries for
distributions that are approximately symmetric without outliers.
If a distribution is not symmetric, has outliers, or both, the five-number
summary provides a better, more comprehensive description.
If a distribution is complex, with clusters or multiple peaks, for instance,
reducing the distribution to a few numbers would be misleading.
Data should always be graphed before computing and communicating

numerical summaries.

Sample statistics and Parameters

• Up to now we have calculated statistics from samples.
• Any statistic of location, such as a mean or median, is always a true
measure for the sample on which is based.
Thus the true mean of the 24 wing lengths of butterflies is 3.96 cm for
this particular sample. Similar considerations will hold for the measures of
dispersion, such as the standard deviation.
• Rarely in biology (or in science in general) are we interested in measures
of location and dispersion only as descriptive summaries of the samples
we have studied.
• Almost always, we are interested in the populations from which the
samples were taken.
We therefore would like to know, for example, not the mean of the
particular 24 wing lengths, but the true mean of wing lengths of the
butterfly population from which the 24 butterflies were sampled. When
studying dispersion, we generally whish to learn the true standard
deviations of the populations, not those of the samples.
27
Sample statistics and Parameters (cont.)

• The population statistics, however, are unknown and (generally speaking)
are unknowable.
Who would be able to collect all the wings of this particular butterfly
population and measure them?
• Thus, we must use sample statistics as estimators of population statistics,
or parameters.
• It is conventional in statistics to use Greek letters for population
parameters and roman letters for sample statistics.
Thus the sample mean estimates μ, the parametric mean of the
population.
The sample variance s2, estimates a parametric variance, symbolized by
σ2.

Two-way tables
• Now we will describe the relationships between two categorical
variables.
• Some variables - such sex, species, and color - are categorical by nature.
• Other categorical variables are created by grouping values of as
quantitative variable into classes- like age groups, for example.
• To analyse categorical data, we use the counts or percents of individuals
that fall into various categories.
28
Previously we analysed the distribution of students according to their gender

and type of school, separately, in the example of 200 high school students.
Number (%)
Gender Male 91 (45.5%)

Female 109 (54.5%)
Type of school Public 168 (84%)

Private 32 (16%)
But these data may be displayed in what is known as a contingency table (this
presentation of data is also known as a cross tabulation or cross classification)
Type of school Male Female Total

Public 77 91 168
Private 14 18 32
Total 91 109 200

Table: Study participants by type of school and gender.

Public 77 91 168
Private 14 18 32
Total 91 109 200
•The “Total” column at the right of the table contains the totals for each of
the rows. These row totals give the distribution of Type of school (the row
variable): 168 participants were enrolled in public schools, 32 in private
schools.
•In the same way, the “Total” row at the bottom of the table gives the
gender distribution: the study included 91 boys and 109 girls.
•Percents are often more informative than counts. We can display the
marginal distribution of type of school in terms of percents by dividing each
row total by the table total and converting to a percent.
•In the sample, 54.5% (109/200) of the students were girls and 84%
(168/200) were enrolled in public schools.
The distribution of gender alone and type of school are called marginal
distributions.
29
Marginal distributions tell us nothing about the relationship between two

variables.
To describe the relationship, we must calculate some well-chosen percents
from the counts given in the body of the table.
We want to compare boys and girls in terms of their distribution by type of
school.
To do this, compare percents for boys alone with percents for girls alone.
Table: Study participants by type of school and gender.
Public 77 91 168
Private 14 18 32
Total 91 109 200
To find the percent of boys who were enrolled in public schools, divide the count of
such boys by the total number of boys (the column total):
male enrolled in public schools 77
= = 0.846 = 84.6%
male' s column total 91
Doing this for the two entries in the “male” column gives the distribution of type of
school among boys.

A conditional distribution of a variable is the distribution of values of that

variable among only individuals who have a given value of the other
variable. There is a separate conditional distribution for each value of the
other variable.
Comparing conditional distributions reveals the nature of the association
between type of school and gender.
Table: Comparison of the distribution of participants by type of school
between male and female students.
Type of
school Male Female
n (%) n (%)
Public 77 (84.6%) 91 (83.5%)
Private 14 (15.4%) 18 (16.5%)
Total 91 (100%) 109 (100%)
Only a minority of boys were enrolled in private schools, and a similar

proportion of girls were enrolled in private schools (15.4% versus 16.5%,
respectively). No gender differences were found in the choice of school.
30
Using SPSS:
Obtaining a contingency table with conditional distributions in SPSS.
Analyse → Descriptive Statistics → Crosstabs…
The conditional distribution can be obtained in the Cells sub-dialogue box by
checking Column (or Row) as shown in the display.
As we want to calculate percentages of students enrolled in private schools
and public schools, for boys alone and for girls alone, choose Percentages by
column (notice that the female variable is in columns ).
Output of the SPSS program:

Displaying relationships: scatterplots

The most useful graph for displaying the relationship between two
quantitative variables is a scatterplot.
Example : An endangered species: the manatee

Manatees are large, herbivorous, aquatic mammals found primarily in the
rivers and estuaries of Florida. This endagered species suffers from
cohabitation with human populations, and many manatees die each year
from collisions with power boats.
We examine the relationship between the number of manatee deaths from
power boat collisions and the number of powerboats registered in any given
year between 1977 and 2012, as displayed in the next table.
(Baldi and Moore , 2012). File in SPSS: Manatees
31
Table: Powerboat registrations (in thousands) and manatee

deaths from powerboat collisions in Florida.
year powerboats deaths year powerboats deaths year powerboats deaths

1977 447 13 1989 711 50 2001 944 81
1978 460 21 1990 719 47 2002 962 95
1979 481 24 1991 681 55 2003 978 73
1980 498 16 1992 679 38 2004 983 69
1981 513 24 1993 678 35 2005 1010 79
1982 512 20 1994 696 49 2006 1024 92
1983 526 15 1995 713 42 2007 1027 73
1984 559 34 1996 732 60 2008 1010 90
1985 585 33 1997 755 54 2009 982 97
1986 614 33 1998 809 66 2010 942 83
1987 645 39 1999 830 82 2011 922 87
1988 675 43 2000 880 78 2012 902 81
The number of power boats registered in Florida varies from year to year.
Does it helps to explain the differences from year to year in the number of
manatee deaths from collision with power boats?
We suspect that “powerboats registered” will help explain “manatee deaths
from collisions”. So “powerboats registered” is the explanatory variable, and
“manatee deaths from collisions” is the response variable.

Figure: Number of manatee

deaths due to powerboat
collisions in Florida each year
against the number of
powerboats registered (in
thousands ) the same year.
Scatterplot
A scatterplot shows the relationship between two quantitative variables
measured on the same individual.
The values of one variable appear in the horizontal axis, and the values of the
other variable appear on the vertical axis. Each individual in the data appears as
the point in the plot fixed by the values of both variables.
Always plot the explanatory variable on the horizontal axis (the x axis) of a
scatterplot. We usually call the explanatory variable x and the response variable y.
If there is no explanatory-response distinction, either variable can go on the
horizontal axis.
32
Examining a scatterplot
In any graph of data, look for the overall pattern and for striking
deviations from that pattern.
You can describe the overall pattern of a scatterplot by the direction,
form, and strength of the relationship.
An important kind of deviation is an outlier, an individual value that falls
outside the overall pattern of the relationship.
Interpreting the plot in the example with manatees:

There is a clear direction: the overall pattern moves up, from lower left to
upper right. That is, years in which powerboat registrations were higher
tend to have higher counts of manatee deaths from collisions – there is a
positive association between the two variables.
The form of the relationship is linear. That is the overall pattern follows a
straight line.
The strength of a relationship in a scatterplot is determined by how closely
the points follows a clear form. The overall relationship is strong.

Positive association, negative association

Two variables are positively associated when high values of the two
variables tend to occur together.
Two variables are negatively associated when high values of one variable
tend to occur with low values of the other variable.
Form
Linear relationships, where the points show a straight line pattern.
Curved relationships and clusters are other forms to watch for.
Strength
The strength of a relationship is determined by how close the points in the
scatterplot lie to a simple form such as a line.
Using SPSS:
Graphs → Legacy dialogs → Scatter
33
References
The text of these slides may be found in the following references*:
Baldi B., D.S. Moore - The practice of statistics in the life sciences, W.H. Freeman and
Company, 2012.
Cadima E.L., A.M. Caramelo, M. Afonso-Dias, P. C. Barros, M.O. Tandstad, J.I. Leiva-
Moreno – Sampling methods applied to fisheries science: a manual, FAO Fisheries
Technical Paper No.434, FAO, 2005.
Fowler J., L. Cohen, P. Jarvis - Practical statistics for field biology, 2nd edition, John
Wiley & Sons, Inc., 1998.
Ruxton, G.D., N. Colegrave - Experimental design for the life sciences, 3rd edition,
Oxford University Press, 2011.
Sokal R. and F.J. Rohlf – Biometry – The principles and practice of statistics in biological
research, W.H. Freeman and Company, 4th edition, 2012.
Zar J.H. - Biostatistical Analysis, 5th edition, Prentice - Hall International Inc., 2010.
* The copy and reproduction of portions created by other authors was done only for
educational use.
34

Descriptive Statistics

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Descriptive Statistics

Transféré par

Droits d'auteur :

Formats disponibles

Biostatistics I Descriptive Statistics 1

Master Degree Public Health 2016/2017

Margarida Fonseca Cardoso

Biostatistics I Descriptive Statistics 2

The primary objective of a statistical analysis is to infer characteristics of a

An observational study observes individuals and measures variables of

An experiment, on the other hand, deliberately imposes some treatment on

The data in a biometric study are generally based on individual observations,

Biostatistics I Descriptive Statistics 4

An investigator may desire to draw conclusions

If a study is concerned with the blood-glucose concentration in three year old

Biologists may sample a population that does not physically exist.

Suppose an experiment is performed in which a food supplement is

The actual property measured by the individual observations is the variable.

guinea pig - cobaia

Biostatistics I Descriptive Statistics 6

Any set of data:

The techniques of descriptive statistics apply equally to data sets obtained

Biostatistics I Descriptive Statistics 8

1. Who? What individuals do the data describe? How many individual

Categorical and quantitative variables

• Categorical variable – places individuals into one of several groups or

Biostatistics I Descriptive Statistics 10

Example: The distribution of students according to their gender and type of

Gender Male 91 (45.5%)

Type of school Public 168 (84%)

Biostatistics I Descriptive Statistics 12

Choose Analyse → Descriptive Statistics→ Frequencies

Example: The distribution of students according to their gender and type of

We could also make a pie chart or a bar graph.

Using SPSS: Choose Graphs → Legacy dialogs→ Pie …

Biostatistics I Descriptive Statistics 14

Quantitative variables often take many values. The distribution of a variable

Using SPSS: Choose Graphs → Legacy dialogs→ Histogram …

And the graph that is automatically displayed is:

The software’s choice has too many classes,

Biostatistics I Descriptive Statistics 16

Consider classes of width 5 starting at 20.

Figure: Histogram of the reading score for 200 students.

A histogram displays the distribution of one quantitative variable.

Biostatistics I Descriptive Statistics 18

Describing a distribution: reading score.

Figure: Histogram of the reading score for 200 students.

Biostatistics I Descriptive Statistics 20

Describing a distribution: reading score.

Figure: Histogram of the reading score for 200 students.

Symmetric and skewed distributions

Biostatistics I Descriptive Statistics 22

And then choose Analyse → Descriptive Statistics→ Frequencies

Example: Guinea pig survival times

(Baldi and Moore , 2012). File in SPSS: SurvivalGuineaPigs

Biostatistics I Descriptive Statistics 24

The distribution is single-peaked and skewed to the right.

The overall shape of a distribution is important information about a

Biostatistics I Descriptive Statistics 26

Spss automatically displays stemplots.

Using SPSS: Choose Analyse → Descriptive Statististics → Explore …

And the stemplot is automatically displayed:

Frequency Stem & Leaf

Stem width: 10,00

Biostatistics I Descriptive Statistics 28

Describing distributions with numbers – quantitative variables

• In samples one generally finds a preponderance of values somewhere

• In addition to a description of the central tendency of a set of data, it is