Académique Documents
Professionnel Documents
Culture Documents
Descriptive Statistics
This generalization from the part to the whole requires the consideration of
such important concepts as population, sample, parameter, statistic and
sampling.
1
Biostatistics I Descriptive Statistics 3
Master Degree Public Health 2016/2017
Example:
•If we measure weight in 100 rats, then the weight of each rat is an individual
observation; the hundred weights together represent the sample of
observations. These smallest sampling units frequently, but not necessarily,
are also individuals in the ordinary biological sense, that is one rat.
•However, if we had studied weight in a single rat over a period of time, the
sample of individual observations would be all the weights recorded on one
rat at successive times.
•In a study of temperature in ant colonies, where each colony is a basic
sampling unit, each temperature reading for one colony is an individual
observation, and the sample of observations is the temperatures for all the
colonies considered.
ant - formiga
Populations
Biologists consider a Population as a defined group of humans or of another
species of organisms
In statistics, population (or universe) means the totality of individual
observations about which inferences are to be made.
For example:
2
Biostatistics I Descriptive Statistics 5
Master Degree Public Health 2016/2017
3
Biostatistics I Descriptive Statistics 7
Master Degree Public Health 2016/2017
Using SPSS
Example: The data file called highschoolb.sav, high school and beyond,
contains 200 observations from a sample of high school students with demographic
information about the students, such as their gender (female) and type of school. It
also contains a number of scores on standardized tests, including tests of reading
(read), writing (write), mathematics (math) and social studies (socst).
Each column includes data about a different variable. Each row represents data
for one student. Some variables, like gender, simply place individuals into
categories. Others, like scores take numerical values with which we can do
arithmetic.
When you plan a statistical study or explore data from someone else’s work,
ask yourself the following questions:
4
Biostatistics I Descriptive Statistics 9
Master Degree Public Health 2016/2017
Exploring data
• Begin by examining each variable by itself. Then move on to study the
relationship among variables.
• Begin by graph or graphs. Then add numerical summaries of specific
aspects of the data.
The proper choice of graph depends on the nature of the variable.
Distribution of a variable
The distribution of a variable tell us what values it takes and how often it take
these values.
The values of a categorical variable are labels for the categories. The
distribution of a categorical variable lists the categories and gives either the
count or the percent of individuals that fall in each category.
5
Biostatistics I Descriptive Statistics 11
Master Degree Public Health 2016/2017
Qualitative variables
Number (%)
It is clear that the majority of the students (84%) were enrolled in public
schools.
Using SPSS
6
Biostatistics I Descriptive Statistics 13
Master Degree Public Health 2016/2017
Quantitative variables
7
Biostatistics I Descriptive Statistics 15
Master Degree Public Health 2016/2017
Number of classes:
There have been several “rules of thumb” proposed to aid in deciding into
how many classes data might reasonably be grouped, for the use of too few
groups will obscure the general shape of the distribution.
But such “rules” or recommendations are rough guides, and the choice is
generally left to good judgement, bearing in mind that from 10 to 20 groups
are useful for most biological work.
Groups should be established that are equal in size interval of the variable
being measured.
Quantitative variables
8
Biostatistics I Descriptive Statistics 17
Master Degree Public Health 2016/2017
Interpreting histograms
• Making a statistical graph is not an end in itself. The purpose of the graph
is to help us understand the data.
• After you make a graph, always ask, “What do I see?”
Examining a histogram
In any graph of data, look for the overall pattern and for striking
deviations from that pattern.
You can describe the overall pattern of a histogram by its shape, center,
and spread.
An important kind of deviation is an outlier, an individual value that falls
outside the overall pattern.
9
Biostatistics I Descriptive Statistics 19
Master Degree Public Health 2016/2017
Shape: The distribution is unimodal. That is, it has a single peak , which
represents students with a reading score between 50 and 55.
The distribution is also symmetric. Real data are almost never exactly
symmetric. We are content to describe the histogram of the reading score as
roughly symmetric.
Center: The midpoint of the distribution is about 45 to 55.
Spread: The spread is from 25 to 80.
Outliers:
In figure the observations less than 30 and greater than 70 are part of the
continuous range of reading scores and do not stand apart from the overall
distribution.
If you had spotted possible outliers, look for an explanation. Some outliers
are due to mistakes, such as typing 4.5 instead of 45.5. Other outliers points
to the special nature of some observations. For instance a score of 4.5 could
be just an individual that didn’t finish the reading test.
10
Biostatistics I Descriptive Statistics 21
Master Degree Public Health 2016/2017
An additional question we could ask is what was the proportion of students that
passed in the test of reading.
Considering that students pass the test if they score 45 or more, a new variable
can be created.
Using SPSS for creating the variable new variable ReadPass:
Choose Transform → Visual Binning …
11
Biostatistics I Descriptive Statistics 23
Master Degree Public Health 2016/2017
12
Biostatistics I Descriptive Statistics 25
Master Degree Public Health 2016/2017
Stemplots
For small data sets, stemplots are quicker to make and easier to interpret.
They display the raw data, that is, they show each one of the values in the
data set.
To make a stemplot:
1. Separate each observation into a stem, consisting of all but the final
(rightmost ) digit, and a leaf, the final digit. Stems may have as many
digits as needed, but each leaf shows only a single digit.
2. Write the stems in a vertical column with the smallest at the top, and
draw a vertical line at the right of the column.
3. Write each leaf in the row to the right of its stem, in increasing order out
from the stem.
13
Biostatistics I Descriptive Statistics 27
Master Degree Public Health 2016/2017
1,00 2. 8
7,00 3. 1444444
14,00 3. 56667799999999
30,00 4. 112222222222222334444444444444
31,00 4. 5567777777777777777777777777778
34,00 5. 0000000000000000002222222222222234
27,00 5. 555555555555577777777777777
26,00 6. 00000000013333333333333333
21,00 6. 555555555688888888888
7,00 7. 1133333
2,00 7. 66
14
Biostatistics I Descriptive Statistics 29
Master Degree Public Health 2016/2017
The mean
To find the mean of a set of observations, add their values and divide by the
number of observations. If the n observations are x1, x2, …, xn their mean is:
In practice, you can key the data into your calculator or software and ask for
the arithmetic mean.
Using SPSS:
In the Example with a sample of high school students (data file called
highschoolb.sav), the mean of the reading score for the 200 students is 52.23.
There are a lot of procedures for obtaining the arithmetic mean in SPSS.
You can choose:
Analyse → Descriptive Statistics → Descriptives …
Analyse → Descriptive Statistics → Explore …
15
Biostatistics I Descriptive Statistics 31
Master Degree Public Health 2016/2017
Note that the formula (n+1)/2 does not give the median, just the location of
the median in the ordered list.
(Zar, 2010).
16
Biostatistics I Descriptive Statistics 33
Master Degree Public Health 2016/2017
In the example of wing lengths of butterflies we found that the mean and the
median were very similar, at 3.96 and 4.00 cm, respectively. The distribution
of the 24 wing lengths were roughly symmetric and did not have any
outliers.
The mean is the center of gravity of the histogram. That is, if the histogram
were made of solid material, it would balance horizontally with the fulcrum at
the mean. The median divides the histogram into two equal areas.
But what would happen to the relationship between the mean and the
median of a data set with a marked skew or extreme outliers?
The figure displays the survival times in days of 72 guinea pigs after they were
injected with infectious bacteria in a medical experiment:
The distribution is noticeably skewed to the right and has some potential high
outliers.
The mean survival time is 141.9 days, whereas the median survival time is only
102.5 days.
The mean is pulled toward the right tail of this right-skewed distribution.
If the longest survival time were increased, the mean would increase, but the
median would not change at all. The mean uses the actual value of each
observation and is, therefore, very sensitive to any extreme values.
(Baldi and Moore , 2012).
17
Biostatistics I Descriptive Statistics 35
Master Degree Public Health 2016/2017
Using SPSS:
There are a lot of procedures for obtaining the arithmetic mean and the
median in SPSS.
You can choose:
Analyse → Descriptive Statistics → Explore …
Many biological variables have distributions that are skewed to the right.
Survival times, such as in the guinea pig inoculation experiment, are typically
right skewed.
When dealing with strongly skewed distributions, it is customary to report
the median rather than the mean.
However a health organization or government agency may need to include all
survival times, and thus calculate the mean, to estimate the costs of medical
care for a given disease and to plan medical staffing appropriately. Relying
only on the median would result in underestimating the medical and
financial needs.
The mean and median measure center in different ways, and both are
useful.
18
Biostatistics I Descriptive Statistics 37
Master Degree Public Health 2016/2017
1 n
Mg = exp ∑ ln X i
n i =1
The geometric mean is appropriate to use only for quantitative data and only
when all of the data are positive (that is, greater than zero).
Geometric mean is sometimes used as a measure of location when the data
are highly skewed to the right.
In the example of survival times in days of 72 guinea pigs after they were
injected with infectious bacteria in a medical experiment:
The mean survival time is 141.9 days, whereas the median survival time is only
102.5 days.
The geometric mean is 118.1 days.
19
Biostatistics I Descriptive Statistics 39
Master Degree Public Health 2016/2017
Using SPSS:
Obtaining the geometric mean in SPSS.
Analyse → Reports→ Summarize cases …
The geometric mean (as well as the mean, median, …) can be obtained in the
Statistics sub-dialogue box as shown in the display.
20
Biostatistics I Descriptive Statistics 41
Master Degree Public Health 2016/2017
In the example with wing length for 24 butterflies, with ordered data, Xi (in
centimeters):
3.3 3.5 3.6 3.6 3.7 3.8 3.8 3.8 3.9
3.9 3.9 4.0 4.0 4.0 4.0 4.1 4.1 4.1
4.2 4.2 4.3 4.3 4.4 4.5
The range may be expressed as 3.3 to 4.5 cm, or as 4.5 – 3.3 = 1.2 cm.
For example, the guinea pig survival times range from 43 to 598 days. These
single observations show the full spread of data, but they may be outliers.
Furthermore, it is unlikely that a sample will contain both the highest and
lowest values in the population, so the sample range usually underestimates
the population range.
Nonetheless, it is considered useful by some to present the sample range
as an estimate (although a poor one) of the population range.
Whenever the range is specified in reporting data, however, it is a good
practice to report another measure of dispersion as well.
The range is applicable to ordinal data and quantitative data.
21
Biostatistics I Descriptive Statistics 43
Master Degree Public Health 2016/2017
22
Biostatistics I Descriptive Statistics 45
Master Degree Public Health 2016/2017
The five number summary from the wing length example is:
3.3 3.8 4.0 4.2 4.5
The five number summary of a distribution leads to a new graph, the boxplot.
Max = 4.5
Q3 = 4.2
Md=4.0
Q1 = 3.8
Min = 3.3
23
Biostatistics I Descriptive Statistics 47
Master Degree Public Health 2016/2017
Using SPSS:
Obtaining the quartiles, boxplot, etc in SPSS.
Analyse → Descriptive Statistics → Explore …
The boxplot is automatically displayed.
The quartiles can be obtained in the Statistics sub-dialogue box by checking
Percentiles as shown in the display.
24
Biostatistics I Descriptive Statistics 49
Master Degree Public Health 2016/2017
s 2
=
(X 1 − X ) + ( X 2 − X ) + ... + (X n − X )
2 2 2
or n −1
1 n
s2 = ∑
n-1 i =1
(X i − X)2
1 n
s= ∑
n-1 i =1
(X i − X)2
The variance:
s2 =
(1792 − 1600)2 + (1666 − 1600)2 + ... + (1439 − 1600)2 = 214870
7 −1 6
s = 35811.67Cal
2 2
The researchers reported the mean, 1600 Cal, and the standard deviation,
189.24 Cal.
25
Biostatistics I Descriptive Statistics 51
Master Degree Public Health 2016/2017
Using SPSS:
In the Example with a sample of high school students (data file called
highschoolb.sav), we can obtain the mean and standard deviation of the
reading score for the 168 students enrolled in public schools and the 32
students enrolled in private schools: 51.85 ±10.42 versus 54.25 ±9.20.
We can also compare the reading score in the form of one graph displaying
each mean with error bars extending on either side to show the standard
deviation in each group.
26
Biostatistics I Descriptive Statistics 53
Master Degree Public Health 2016/2017
27
Biostatistics I Descriptive Statistics 55
Master Degree Public Health 2016/2017
Two-way tables
• Now we will describe the relationships between two categorical
variables.
• Some variables - such sex, species, and color - are categorical by nature.
• Other categorical variables are created by grouping values of as
quantitative variable into classes- like age groups, for example.
• To analyse categorical data, we use the counts or percents of individuals
that fall into various categories.
28
Biostatistics I Descriptive Statistics 57
Master Degree Public Health 2016/2017
But these data may be displayed in what is known as a contingency table (this
presentation of data is also known as a cross tabulation or cross classification)
•The “Total” column at the right of the table contains the totals for each of
the rows. These row totals give the distribution of Type of school (the row
variable): 168 participants were enrolled in public schools, 32 in private
schools.
•In the same way, the “Total” row at the bottom of the table gives the
gender distribution: the study included 91 boys and 109 girls.
•Percents are often more informative than counts. We can display the
marginal distribution of type of school in terms of percents by dividing each
row total by the table total and converting to a percent.
•In the sample, 54.5% (109/200) of the students were girls and 84%
(168/200) were enrolled in public schools.
The distribution of gender alone and type of school are called marginal
distributions.
29
Biostatistics I Descriptive Statistics 59
Master Degree Public Health 2016/2017
To find the percent of boys who were enrolled in public schools, divide the count of
such boys by the total number of boys (the column total):
male enrolled in public schools 77
= = 0.846 = 84.6%
male' s column total 91
Doing this for the two entries in the “male” column gives the distribution of type of
school among boys.
30
Biostatistics I Descriptive Statistics 61
Master Degree Public Health 2016/2017
Using SPSS:
Obtaining a contingency table with conditional distributions in SPSS.
Analyse → Descriptive Statistics → Crosstabs…
The conditional distribution can be obtained in the Cells sub-dialogue box by
checking Column (or Row) as shown in the display.
As we want to calculate percentages of students enrolled in private schools
and public schools, for boys alone and for girls alone, choose Percentages by
column (notice that the female variable is in columns ).
31
Biostatistics I Descriptive Statistics 63
Master Degree Public Health 2016/2017
The number of power boats registered in Florida varies from year to year.
Does it helps to explain the differences from year to year in the number of
manatee deaths from collision with power boats?
We suspect that “powerboats registered” will help explain “manatee deaths
from collisions”. So “powerboats registered” is the explanatory variable, and
“manatee deaths from collisions” is the response variable.
Scatterplot
A scatterplot shows the relationship between two quantitative variables
measured on the same individual.
The values of one variable appear in the horizontal axis, and the values of the
other variable appear on the vertical axis. Each individual in the data appears as
the point in the plot fixed by the values of both variables.
Always plot the explanatory variable on the horizontal axis (the x axis) of a
scatterplot. We usually call the explanatory variable x and the response variable y.
If there is no explanatory-response distinction, either variable can go on the
horizontal axis.
32
Biostatistics I Descriptive Statistics 65
Master Degree Public Health 2016/2017
Examining a scatterplot
In any graph of data, look for the overall pattern and for striking
deviations from that pattern.
You can describe the overall pattern of a scatterplot by the direction,
form, and strength of the relationship.
An important kind of deviation is an outlier, an individual value that falls
outside the overall pattern of the relationship.
Form
Linear relationships, where the points show a straight line pattern.
Curved relationships and clusters are other forms to watch for.
Strength
The strength of a relationship is determined by how close the points in the
scatterplot lie to a simple form such as a line.
Using SPSS:
Graphs → Legacy dialogs → Scatter
33
Biostatistics I Descriptive Statistics 67
Master Degree Public Health 2016/2017
References
The text of these slides may be found in the following references*:
Baldi B., D.S. Moore - The practice of statistics in the life sciences, W.H. Freeman and
Company, 2012.
Cadima E.L., A.M. Caramelo, M. Afonso-Dias, P. C. Barros, M.O. Tandstad, J.I. Leiva-
Moreno – Sampling methods applied to fisheries science: a manual, FAO Fisheries
Technical Paper No.434, FAO, 2005.
Fowler J., L. Cohen, P. Jarvis - Practical statistics for field biology, 2nd edition, John
Wiley & Sons, Inc., 1998.
Ruxton, G.D., N. Colegrave - Experimental design for the life sciences, 3rd edition,
Oxford University Press, 2011.
Sokal R. and F.J. Rohlf – Biometry – The principles and practice of statistics in biological
research, W.H. Freeman and Company, 4th edition, 2012.
Zar J.H. - Biostatistical Analysis, 5th edition, Prentice - Hall International Inc., 2010.
* The copy and reproduction of portions created by other authors was done only for
educational use.
34