Vous êtes sur la page 1sur 8

Statistical analysis

Statistical analysis
1

CORE
S cience is concerned with the systematic study of
the natural world around us. Biology is particularly
concerned with the study of living organisms and includes
Volume of gas (mL)
12

10

8
all levels of life from molecules to ecosystems. In the study
of biology, scientists make careful observations and in 6

many cases develop hypotheses, conduct experiments and 4

gather data. Data may be quantitative measurements or 2

qualitative descriptions of phenomena or the results of 0


experiments. Once gathered, data needs to be analysed in 0 20 40 60 80 100 120
Time(s)
an appropriate way to decide whether or not it supports
the relevant hypothesis. Figure 101 Total volume of gas collected against time

1.1.1 State that error bars are a graphical


representation of the variability of data. (Please refer to Chapter 2 - Data collection and processing of
IBO 2007 The IBID Student Guide Biology for further examples and
detailed instructions on adding error bars with Excel).
All measurements are subject to errors and it is important
to recognise this in the way the data is recorded and 100
manipulated. Error bars can be used to show either the
90
range of the data or the standard deviation (see Topics
1.1.2 and 1.1.4) on several repeats of one experimental 80

measurement. 70
Mass of organism (kg)

60

Consider a photosynthesis investigation where a student is 50


collecting and measuring the gas released by a plant against 40
time. In the Excel-generated scatter graph in Figure101,
30
each of the time measurements was measured with an
analogue stopwatch and has an error or uncertainty of 20

1s. For example, a time of 40 s has a lower limit of 39 s 10

and an upper limit of 41 s. X error bars were selected with a 0


value of 1. The display was selected to show both the upper 0 0.02 0.04 0.06 0.08 0.1 0.12
Volume of organism (m3)
and lower limits, so the entire range of the uncertainty or
error is displayed. Figure 102 The mass of marine organisms against
their volume

070804 Biol Chapter 1 FINAL.indd1 1 29/11/07 3:49:35 PM


Chapter 1

Error bars can be displayed for the values of both variables. Example 2
In the Excel-generated graph in Figure 102 the mass of the When calculating a mean of ratios (for example,
marine organisms is measured using scales with a precision percentages) for several groups of different sizes, the
of 2 kg. The volume of the marine organisms is measured ratio for the combined total of all the groups is not the
with a precision of 0.006 m3. Notice the error bars are mean of the proportions for the individual groups.
small and they all fall on the trend line (see Topic1.1.6),
indicating that the measurements are reliable. For example, if 40 rats from a batch of 100 are male, this
implies 40% are male. If 120 rats from a batch of 240 are
CORE

(Please refer to Chapter 2 - Data collection and processing male, this implies 50% are male. The mean percentage
of The IBID Student Guide Biology for a detailed of males (50 + 40)/2 = 45% is not the percentage of
discussion about errors or uncertainties and precision versus males in the two groups, because there are 40 + 120 =
accuracy). 160 males in a total of 340 = 47% approximately.

1.1.2 Calculate the mean and standard Standard deviation


deviation of a set of values. Means do not give a complete description about a sample
IBO 2007 of data. The standard deviation summarises the spread of
the data around the mean (see Topic 1.1.3). The standard
When sampled biological data are recorded as a series deviation measures how widely spread the values in a set
of values representing variables, it is useful know their of data are. If the data points are close to the mean, then
mean and standard deviation. The mean is the sum of all the standard deviation is small. Conversely, if many data
the values divided by the number of values. In everyday points are far from the mean, then the standard deviation
language, it is often called the average. is large. If all the data values are equal, then the standard
deviation is zero.

Mean Standard deviation is often abbreviated to SD, s.d., s or .


For example, consider the following data: Figure 104 shows a manually-worked example for a sample
2; 4; 8; 11; 12; 13; 14; 14; 23; 24; 25 of shells where x is the length and f is the frequency,
150
The mean is calculated manually as follows: = 14. x f fx
11
35 1 35
However, the mean can be calculated more rapidly by 37 2 74
means of a calculator or Excel.
38 2 76
30 2 60
Usually, the mean is easily calculated, but described
below are two examples where simple means are not 42 2 84
appropriate. 44 1 44
f = 10 fx = 373
Example 1
If the means of samples themselves are meaned, an Figure 104 Table with shell data
error can arise if the samples are of different sizes. For
example, the mean of the means in Figure 103 is 8, The mean, x = 373 = 37.3; = x 2 = 1391.3
but this does not take into account the different sample 10
( )
2
sizes. A more accurate mean is a weighted mean: Standard deviation, s = xx = 4.49
((7.0 x 5) + (8.0 x 8) + ( 9.0 x 2))/(5.0 + 8.0 + 2.0) = 7.8 n
A weighted mean gives weights to different numbers in where x represents each value, x represents the mean
proportion to their importance. of the measurements and n represents the number of
measurements. The equation shows that the standard
Mean Sample size deviation is the root mean square deviation of values from
7.0 5 their mean.

8.0 8 The formula above strictly applies to an infinite population.


9.0 2 For calculating the SD of a sample change the denominator
n to n-1. The difference has little effect on the SD value if
the population is very large.
Figure 103 Table of means and sample sizes


070804 Biol Chapter 1 FINAL.indd2 2 29/11/07 3:49:38 PM


Statistical analysis

The IB Biology syllabus does not require you to do manual


calculations or memorise the mathematical formulas, but it
is helpful to observe how the statistics are generated for the
simpler statistical functions.

The standard deviation and the mean can be rapidly


calculated by means of a graphical or scientific calculator
or using Excel software.

CORE
Figure 108
The screen shots in Figures 105 to 109 and the instructions
describe how to enter the data from Figure 104 and perform
summary statistics on a TI graphical calculator (one of
the recommended calculators for the IB Mathematics
courses).

Press STAT

Figure 109

The top value ( x ) is the mean. The fourth value (Sx) is


the standard deviation, assuming that the data is from
a sample of the population. The fifth value () is the
standard deviation assuming that the data represents the
entire population.
Figure 105
Figure 111 shows some data from an Excel spreadsheet
Press ENTER and enter the numbers (shell sizes or describing the length of two groups of sea shells, such as
x) into L1 and then enter the respective frequencies (f) those shown in Figure 110. Use the Descriptive Statistics
into L2. function from Data Analysis in the Tools Menu of the
Excel program.

Figure 106 Figure 110 Sea shells


Press STAT and use the arrow key to select CALC.
Group 1 Group 2
Seashell
(Small Shell) (Large Shell)
Number
length(mm) length(mm)

1 10.2 16.8

2 11.6 19.7

3 9.7 18.5

4 13.3 22.5

Figure 107 5 8.3 20.7

Figure 111 Data table for small and large


Press ENTER to obtain a summary of the data seashell length
entered.

070804 Biol Chapter 1 FINAL.indd3 3 29/11/07 3:49:39 PM


Chapter 1

The output is shown in Figure 112 for the Group 1 small mean
shell data. The mean and standard deviation have been
manually highlighted in bold.

Mean 10.62
Standard Error 0.852877482

frequency
Median 10.2
CORE

Mode #N/A
Standard Deviation 1.907092027
Sample Variance 3.637
Kurtosis -0.229700866
Skewness 0.411499631
Range 5
Minimum 8.3
Maximum 13.3
Figure 113 A normal distribution curve
Sum 53.1
Count 5

Figure 112 Output data for shell length If the data is normally distributed then 68% (approximately
(generated in Excel) two thirds) of the sample have heights which are within
10 cm of 180 cm, that is, 68% of the sample have heights
between 170 cm and 190 cm.

1.1.3 State that the term standard deviation is In addition 95% of the sample heights lie within two
used to summarize the spread of values standard deviations of the mean. In this example, two
around the mean, and that 68% of the standard deviations = 2 x 10 cm = 20 cm. In other words
values fall within one standard deviation 95% of the sample have heights between 160 and 200 cm.
of the mean. Figure 114 illustrates these two values.
IBO 2007

mean
If repeated measurement of continuous Biological
variables, such as the height or weight of humans from a
large population, are plotted, a close approximation to a
normal distribution is obtained. The normal distribution
has some very special characteristics.
frequency

SD SD

Data which is normally distributed will exhibit a bell-


shaped curve which is symmetrical around a centrally
located mean. The curve is known as a Gaussian curve.
Many statistical functions and tests assume that the data
approximates to a normal distribution. 2 x SD 2 x SD

A normal distribution curve is shown in Figure 113. Note


that that most values are close to the mean and only a few 68%
values will be far from the mean. 95%
100%
The standard deviation provides information on the
range within a Biological sample (provided it is normally (note: SD represents Standard Deviation)
distributed). Consider the following data: the mean height
of a human population is 180 cm and its standard deviation Figure 114 The normal distribution curve
is 10 cm. showing values for standard deviation

070804 Biol Chapter 1 FINAL.indd4 4 29/11/07 3:49:40 PM


Statistical analysis

1.1.4 Explain how the standard deviation is Sample 1 Sample 2


useful for comparing the means and the 7.85 12.50
spread of data between two or more 8.51 12.94
samples. 13.66 6.26
IBO 2007 11.03 6.10
6.59 13.19
A small value of standard deviation indicates that the data 8.04 10.74
is clustered closely around the mean value. A large value

CORE
14.16 6.06
of standard deviation indicates a wider spread around the 8.13 12.53
mean. Figure 115 shows Gaussian curves for the frequency 6.79 15.45
distributions of two statistical populations with differing 11.06 15.64
standard deviations (spreads). 5.83 15.19
10.73 14.93
6.68 7.94
5.02 8.28
SD 1
10.37 12.65
Standard deviation Standard deviation
frequency of occurrence

2.761473899 3.545349066
Mean Mean
8.963333333 11.36
SD1 < SD2

Figure 116 Lengths of leaves from sampling


SD 2 two similar trees

The t test compares two sets of data and indicates the


probability (P) that the two sets are essentially the same.
X
P varies from 0 (not likely) to 1 (certain). The higher the
( X represents the mean; SD represents Standard Deviation )
probability, the more likely it is that the two sets are the
Figure 115 Normal distribution curves with large and
same, and that any differences are just due to random
small distribution curves
chance. The lower the probability, the more likely it is that
that the two sets are significantly different, and that the
1.1.5 Deduce the significance of the difference differences are real. In Biology, the critical probability to
between two sets of data using calculated show difference is usually taken as 0.05 (or 5%). A critical
values for t and the appropriate tables. value is a value that a statistic must exceed in order to have
IBO 2007 a hypothesis test result in rejection of the null hypothesis.
An example of a t test performed on the data from
Suppose a student measured and recorded the length of Figure116 is shown in Figure 117.
leaves from two similar trees. The data is displayed in an
Excel spreadsheet (Figure 116). Variable 1 Variable 2
Mean 8.96333 11.36
The processed data can be displayed graphically in the
Variance 7.625738 12.5695
form of a bar chart. Error bars can be added with the
standard deviation of the two samples. They will give a Observations 15 15
visual indication of the variability of the data. Pooled Variance 10.09761 -
Hypothesised Mean Difference 0 -
The mean for sample 1 is obviously lower than the mean df 28 -
for sample 2. However, is this difference statistically t Stat -2.065517 -
significant? This depends not only on the difference P(T<=t) one-tail 0.0241207 -
between the means of the two samples, but also on the
t Critical one-tail 1.7011309 -
difference between their standard deviations. The Variance
P(T<=t) two-tail 0.0482415 -
is the square root of the Standard Deviation.
t Critical two-tail 2.048407 -

Figure 117 Data table for t test

070804 Biol Chapter 1 FINAL.indd5 5 29/11/07 3:49:41 PM


Chapter 1

(Please refer to Chapter 2 - Data collection and processing of


The IBID Student Guide Biology for detailed instructions on
performing t-tests with cell formulas and the built-in Statistical
functions of Excel).

The difference is statistically significant since the two


tailed probability is much lower than 0.05. This indicates
that tree 1 has statistically smaller leaves than tree 2.
CORE

Figure 121 Screen shot D


t tests can also be performed on a TI calculator as shown
in the screen shots in Figures 118 to 122.. For example, a
two-tailed t test can be performed on the following values
(of bird wing spans): 2, 7, 9, 10, 13, 15, 18 and 20 cm to
establish whether the population mean is 10 cm.

Figure 122 Screen shot E

A two tailed test is a statistical analysis in which the


Figure 118 Screen shot A alternate hypothesis states that a difference exists. The null
hypothesis can be rejected in either tail of the theoretical
distribution.

TOK What is an objective standard?

An objective standard is one without bias. Natural


science claims that it is without bias as its findings are
based on observations made objectively. The t test
determines if there is a statistical difference between two
Figure 119 Screen shot B
sets of data in an objective way. Statistical tests reveal
probable difference between events or likelihood of an
event. However, it is often not well understood that the
probability that something will (or will not occur) is not a
guarantee that it will (or will not) occur.

Figure 120 Screen shot C

Move the cursor over the respective sections and enter ten
(for the mean) and indicate that the data is in L1.

070804 Biol Chapter 1 FINAL.indd6 6 29/11/07 3:49:42 PM


Statistical analysis

1.1.6 Explain that the existence of a correlation A negative correlation means that if the value of X increases,
does not establish that there is a causal the value of Y will decrease. A negative correlation could
relationship between two variables. be found in the amount of salt in a jelly and the number of
IBO 2007 bacterial colonies growing on the jelly after a fixed amount
of time.
Biological data should always be plotted to show the
relationship between two sets of data. A line graph should The closer the value is to -1 or 1, the stronger the relationship
be plotted if the independent variable is under the control between the variables, that is, the less scatter there would

CORE
of the student performing the investigation. If both be about a line of best fit. A coefficient of 0 implies that
variables are dependent (that is, measured) then the values there is no relationship between the variables.
should be plotted in the form of a scattergram.
Figure 123 shows examples of correlation with linear
Regression and correlation are methods used when testing regression lines. In (i) and (ii) the correlation is good; for
relationships between samples of variables. If one variable (i) the correlation is positive and the correlation coefficient
is known or assumed to be dependent on the other in a is close to 1; for (ii) the correlation is negative and the
linear manner then a linear regression technique is used correlation coefficient is close to -1; in (iii) there is weak
to determine the line of best fit. positive correlation and the correlation coefficient would
be close to zero.
A correlation coefficient can then be calculated which
indicates how well the experimental data fit the line of Care must be taken when interpreting correlation
best fit. Correlation coefficients are expressed as a number coefficients, because if two variables are highly correlated
between -1 and 1. A positive coefficient indicates a it does not necessarily mean that one causes the other. In
positive relationship while a negative coefficient indicates statistical terms, correlation does not imply causation.
a negative relationship (between the data and the line of
best fit). There are three possible relationships between two
variables, X and Y:
A positive correlation means that if the value of X increases, Causation: Changes in X cause changes in Y.
the value of Y will also increase. A positive correlation Common response: Both X and Y respond to changes
could be found between the amount of sugar in a jelly and in some unobserved variable.
the number of bacterial colonies growing on the jelly after Confounding: The effect of X on Y is mixed up with
a fixed amount of time. the effects of other variables on Y.

An example of a correlation without a causal relationship


could be found in the following: since 1950, CO2 levels in
the atmosphere have increased. During the same period,
crime levels have gone up. The correlation between these
data does NOT mean that increased crime levels are
caused by carbon dioxide levels going up.
f

(i) x

f f

(ii) x (iii) x

Figure 123 Correlation with linear regression lines

070804 Biol Chapter 1 FINAL.indd7 7 29/11/07 3:49:43 PM


Chapter 1

Exercises

1. The number of eggs in the nests of a sample of a


species of bird is shown below. Find the mean and
sample standard deviation of these numbers of eggs.
CORE

Perform the calculations manually and on a graphical


calculator.
5 3 5 3 4 2 0 2 1 2

2. In a clinical trial, a population of patients was given


Drug A or B. The mean times in minutes for blood
clotting are shown in Figure 124.

Drug A Drug B
61.6 39.3
64.6 26.3
55.6 32.4
45.2 21.5
50.6 60.3
70.5 24.3
67.7 36.4
57.5 47.4
66.5 33.2
42.3 57.2

Figure 124

(a) What is the null hypothesis?


(b) What is the alternate hypothesis?
(c) Perform a two tailed, unpaired test for the data
and calculate the t statistic for the t test.
(d) What is the critical value of t for P<0.05?
(e) Is the t statistic greater or less than the critical
value of t?
(f) Can the null hypothesis be rejected?

070804 Biol Chapter 1 FINAL.indd8 8 29/11/07 3:49:44 PM

Vous aimerez peut-être aussi