Académique Documents
Professionnel Documents
Culture Documents
Uppsala University
Fall 2014
1
Table of Contents
INTRODUCTION ........................................................................................................... 4
2
3.10 SUMMARY ......................................................................................................................................62
4 LINEAR MODELS...................................................................................................67
6.1 GOALS............................................................................................................................................90
3
Introduction
This introduction directed at advanced BSc and at beginning MSc students on Biology at Uppsala
University and may of course also be helpful for anyone interested. We have kept this script
concise and applied in order to allow people to conduct own statistical analyses in R after a short
time. Note that this “quick- start” guide does not replace a full-fledged course. However, we
hope that successfully using R for statistical analyses with the help of this script will generate
interest in learning more about statistics and R!
We wrote this script for flexible use, such that you can direct your attention to the parts that you
want to focus on, given your background and current interest.
You will have most use of this enhanced .pdf file if you read it electronically using a pdf reader
that provides a content sidebar and allows hyperlinks as well as attachments. Recent versions of
the free program Adobe Reader for Macintosh and for PC have these functions (do not use
Preview). The script contains internal links and links to webpages. You can navigate between
sections and subsection using the bookmarks pane (Figure 0-0-1). Some solutions or data files are
provides as attachments to the .pdf document that are accessible through the attachment pane
(press paperclip symbol). Note that Adobe Reader also allows you to add notes, highlight text
and set bookmarks on your own.
4
Figure 0-0-1 Screenshot of this script in Acrobat Reader. Use bookmarks to navigate
between sections and subsections. The attachment pane (paperclip symbol) will display
a list of files included (datasets, and exercise solutions). You can also insert your own
marks, as the yellow “1996” as well as comments using Adobe Readers tools.
Please browse the overview en summery sections that are present in all chapters in order to find
out where just you need to start reading. All sections have exercises with solutions such that you
can practice and test your knowledge.
We hope that you will find our script useful and fun!
August 2014,
Sophie Karrenberg, Andres Cortes, Elodie Chapurlat, Xiaodong Liu, Matthew Tye
5
1 Getting started with statistics
Goals:
learn about why and how statistics are used in Biology
be introduced to basic statistical concepts such as distributions and probabilities
become familiar with the normal distribution
get an idea about how statistical tests work
The problem is that all these units are often different, even though they belong to the same
population. By chance, your random sample may not be very representative of the population.
Thus, even two samples taken from two similar populations may differ greatly, just by chance. It
is also possible that two samples taken from two very different populations may be very similar,
misleading you to conclude that the two populations are similar. Also, the natural variation
among units within your samples may obscure the effect of an experimental treatment. Thus,
working with samples means that we have to deal with all these uncertainties in some way. If
there is really a difference between the samples, you need to know what differences you can
expect by chance, and how to deal with the variation within samples. Statistics help you to make
these decisions. In other words, statistical tests are methods to use samples to make inferences
about the populations.
6
Figure 1-1 Population and sample
Biological questions such as which genes affect certain traits or how climate change affects the
biosphere can only be solved using statistical analyses on massive datasets. But even
comparatively small questions, for example to what extent men are taller than women are in need
of statistical treatment. Thus, as soon as you formulate a study question, you should start
thinking about statistics.
Statistical analyses have a central place in biological studies and in many other sciences (Figure 1-
2 The role of statistical analyses in the biological sciences):
Hypothesis testing
Many statistical tests evaluate a strict null hypothesis H0 against an alternative hypothesis HA.
For example:
7
H0: mean dispersal distance does NOT differ between male and female butterflies
HA: mean dispersal distance differs between male and female butterflies
For this reason, the next parts of this lesson reviews first concepts of distribution and probability,
after which we come back to statistical testing.
Distributions
A distribution refers to how often different values occur in a set of data. In the graph below you
see a common representation of a distribution: a histogram (Figure 1-3 Histogram of normally
distributed data). In histograms, the horizontal x-axis represents the values occurring in the data,
separated into groups (columns), and the y-axis shows how many values fall into each group. The
vertical y-axis in histograms can also be given as the percentage of the data the values represent.
30
20
Frequency
10
2 3 4 5 6 7 8
Values
Nonetheless, for a single throw of the coin you cannot predict where it will land! If you, however,
throw the coin very many time times you expect it to land on heads about half of the times,
corresponding to the probability of 0.5.
The coin example concerned an outcome with two categories, heads and tails. For continuous
(measurement) values, probability density functions can be derived (Figure 1-4 Probability
8
density for a standard normal distribution). Note the similarity in shape to the histogram above
(Figure 1-3 Histogram of normally distributed data). For each value on the x-axis the value of the
probability density function displayed on the y-axis is the expected probability of that value
occurring. The value -1 is thus expected to occur with a probability 0.242 or in 24 of 100 cases.
Probability density functions of test statistics are used for the evaluation of statistical tests.
0.4
0.3
Density
0.2
0.1
0.0
-5 -4 -3 -2 -1 0 1 2 3 4 5
Standard deviations
Figure 1-5 Probability density for normal distributions with various means and standard
deviations (sd)
9
Mean and standard deviation of the normal distribution are intricately linked to how common
values are. In fact, the probability of obtaining values in a certain range corresponds to the area
under the curve in this range. The entire area under the curve sums to 1.
Values within one standard deviation to either side of the mean represent 34.4% of the data
(pink), 13.6% of the values occur between one and two standard deviations from the mean on
either side (yellow), 2.1% of the values occur between two and 3 standard deviations from the
mean (green) and 0.1% of the data occur beyond three standard deviations from the mean on
either side (white, Figure 1-6).
Figure 1-6 Standard normal distribution with percentage of values occurring in a certain
range indicated
H0: mean dispersal distance does NOT differ between male and female butterflies
HA: mean dispersal distance differs between male and female butterflies
Note that by simply saying that dispersal distance differs, we imply that female dispersal distance
could be either higher or lower than male dispersal distance. Because one group can differ from
the other in either direction this is called a two-sided hypothesis and is followed by a two-tailed
test. We could also phrase the alternative hypothesis on either of the following ways:
10
HA: female butterfly dispersal distance is longer than male butterflies.
Here we hypothesize that the female butterflies differ from the male butterflies in a specific
direction. This is referred to a one-sided hypothesis and is followed by a one-tailed test. .
To evaluate either one or two-sided hypotheses, statistical tests calculate a test statistic from the
data to find out how likely the obtained result would be under the null hypothesis. To do so, a
probability distribution of the test statistic is theoretically derived assuming the null hypothesis.
The probability of the test statistic from the data given that the null hypothesis is true is then
found using this theoretical distribution. This probability is termed the P-value. Common
statistical tests usually have an outcome of "significant", meaning that the alternative hypothesis
is accepted, or "not significant", meaning that the alternative hypothesis is discarded and the null
hypothesis accepted.
If the test statistic calculated from the data happens to be a value that is very rare under the null
hypothesis, usually occurring at a probability of less than 5% (P-value < 0.05), the null hypothesis
is discarded and the alternative hypothesis accepted. If the test statistic happens to have a
commonly occurring value of the test statistic under the null hypothesis, the alternative
hypothesis is discarded instead. This is illustrated for one- and two tailed tests in Figure 1-7.
Importantly, all statistical tests make assumptions on the data and are only valid if these are met.
You will come across these assumptions in the section Basic analyses with R.
Figure 1-7 Illustration of significance (P< 0.05) ranges in one- and two-tailed tests.
11
Test outcomes, error types, significance and power
A common statistical tests can have four potential outcomes, two are correct and two are false
(Table 1-1).
Table 1-1 Possible outcomes of statistical tests with the significance level of 0.05.
Test result Reality
H0 true HA true
Note that the P-values correspond to Type I errors (false positives), i.e. accepting the alternative
hypothesis when it is not true. The significance level is commonly set to 0.05 in biological studies
and P < 0.01 or P < 0.001 are regarded as highly significant.
Importantly, the choice of significance level has direct implications for the two error types.
Let´s look at this further. The two probability density curves on the graph below represent
theoretical normal distributions of measurements for our example of dispersal distance: female
(black) and male (red) butterflies. When the significance level is set to 0.05 (P-values < 0.05 taken
as significant, upper graph, Figure 1-8) the black areas under the black curve for females
represent the type I error, i.e. erroneously accepting the alternative hypothesis. The same cut-off
applies to the red curve for males: here the area in red represents the type II error, erroneously
accepting the null hypothesis when the alternative hypothesis is true.
If the significance level and thus the type I error is decreased to 0.01 (lower graph) the type II
error is inevitably increased! This means that even if you can be surer that the alternative
hypothesis is true when you do accept it, you also have to live with higher chances of missing
cases where the alternative hypothesis is true.
12
Significance level = 0.05
0.5
0.4
Density
0.3
0.2
0.1
0.0
-4 -2 0 2 4 6 8
Values
0.5
0.4
Density
0.3
0.2
0.1
0.0
-4 -2 0 2 4 6 8
Values
Figure 1-8 Trade-off between type I and type II errors, for the example of female (black)
vs. male (red) dispersal distance in butterflies.
Summary
Distributions can be displayed as histograms and show how often different values (ore
classes of values) occur.
Probabilities express how likely events or outcomes are. Probability density functions show
how likely it is to obtain values under a certain distribution.
The normal distribution is fundamentally liked to many common statistical tests. Normal
distributions are described by their mean and standard deviation.
Statistical tests use samples to makes inferences on large populations and generally evaluate a
null hypothesis (usually no difference) against an alternative hypothesis (a difference). They
do so by comparing a test statistic calculated from the data against a theoretical distribution
of this statistic under the null hypothesis.
The significance level used in statistical testing is related to both type I errors (false positives)
and type II errors (false negatives).
13
1.2 Descriptive statistics
Goals
In this section you will learn how to describe your data in terms of
One of the most basic measures of data series is their range. The range refers to the interval
between the smallest value, the minimum, and the largest value, the maximum.
Indeed, looking at the range is highly recommended as it allows you to conduct a first check of
the data: are the values actually in the expected (or reasonable) range?
In addition to the range it also a good idea to calculate the median, the value that is exactly in the
middle of the data: 50% of the values are larger than the median and the other 50% are smaller
than the median. Whether the median is more or less in the middle of the range will show you
whether you data is distributed symmetrically. For example data with a range of 1 to 10 and a
median of 2 is NOT symmetrically distributed. The data in the illustration below is symmetrically
distributed (Figure ).
14
Similar to the median you can also calculate the 25% and 75% quartiles, values that are larger
than 25% or 75% of the (ordered) data. The example data above with a range of 1 to 10 and a
median of 2 has a 25% quartile of 2 and a 75% quartile of 3, this would indicate that there
probably are some outliers causing the range to extend to 10.
The standard deviation s (also sd) is a measure of the spread of the data and is calculated:
Figure 1-10 Histogram of a normal distribution showing mean, median, quartiles and
standard deviation
15
Standard error and confidence interval of the mean
The standard error of the mean (SE or se) gives a measure of the precision of the estimate of the
mean.
The standard error can be used to calculate a confidence interval (CI) for the mean. The 95%
confidence interval around the sample mean that should contain the mean of the population with
95% probability. It is calculated as:
Note that standard error and confidence interval of the mean become smaller the larger the
sample is. This reflects the greater trust you can have for a mean calculated from a large sample
as opposed to a mean calculated from a small sample.
In this case the mean and standard deviation are not such god measures of center and spread. On
the graph below, the mean is rather far from the bulk of the data. A range of one standard
deviation around the mean does not contain the same number of measurements of each side.
Median and quartiles are better descriptive statistics for such data: the median indeed is in the
center of the data and the quartiles nicely reflect the asymmetry in the distribution, i.e. the
distance between 25% quartile and median is smaller than the distance between 75% quartile and
the median (Figure 1-11). Alternatively, you can use data transformations and calculate mean and
standard deviation from transformed data (see Basic analyses with R).
16
Figure 1-11 Descriptive statistics on a right-skewed distribution
Summary
Descriptive statistics are important to check data and are used to summarize data.
Range, quartiles and median are basic descriptive statistics for data with any distribution.
Mean and standard deviation are more useful for symmetrically distributed data.
17
1.3 Exercises: Getting started with statistics
1-A
Please have a look at the two distributions below, A and B. They correspond to commonly
observed distributions of biological data. Read the statements below and select whether they are
true for distribution A, distribution B, both distributions or neither of the distributions.
1-B
You and friend wonder if it is "normal" that some bottles of your favourite beer contain more beer
than others although the volume is stated as 0.33 L. You find out from the manufacturer that the
volume of beer in a bottle has a mean of 0.33 L and a standard deviation of 0.03 l. If you now
measure the beer volume in the next 100 bottles that you drink with your friend, how many of
those 100 bottles are expected to contain more than 0.39 L given that the information of the
manufacturer is correct?
1-C
18
Your data is distributed as shown below. Where do you expect the median to be?
Select one:
To the left of the mean.
To the right of the mean.
At the same place as the mean
1-D
Check all answers that are correct.
A P-value of 0.051 for a t-test …
(a) …means that 0.949 % of the data are greater than the mean.
(b) …indicates a 5.1% probability of Type I error
(c) …proves that there is no difference between the groups.
(d) …shows that the difference would be significant if more data were used.
(e) …is regarded significant in biostatistical analyses.
(f) …is regarded non-significant in biostatistical analyses.
19
Solutions to Exercises Getting started with statistics
1-A
Most values are between 3 and 7 – True for A
Values are between zero and 9 – True for A and B
Corresponds to about 1000 values in total – True for A and B
Symmetric – True for A
Asymmetric – True for B
Corresponds to about 350 values in total – True for neither
Less than 200 values are smaller than 4 – True for A
Many small values occur – True for B
Many large values and few small values – True for neither
Few extreme values occur – True for A
Most values are smaller than 4 – True for B
1-B
Correct answer:
0.39 l corresponds to the mean plus two standard deviations (0.33 + 2* 0.03) and values larger than 0,39 are thus expected to
occur in 2.3% of the cases. 2.3% of 100 is 2.3 or 2.
The correct answer is: 2.3
1-C
The correct answer is: To the right of the mean
1-D
(b), (f)
20
2 Getting started with R
2.1 What is R?
For more information please check out this list of Webpages and books on R at the end of this
section.
R is available for Macintosh, PC and Linux operating systems and easy to install. To download
and install R, please go to the The Comprehensive R Archive Network page (CRAN) and follow
the instructions there.
Goal
In this section, you will learn:
21
Elements of R: Script, Console & Co.
Using R involves mostly writing commands (or “code”) rather than clicking on menus. The
commands are usually assembled in a script that can be saved and reused. The R console
receives the commands either from the script or by direct typing, shows the progress of analyses
and displays the output. Graphical output will open in a separate window. This means that
working with R can involve quite a lot of windows and files, the script, the console, graphical out
and then, of course your data files and other user-specified output. You can assign a working
directory where all of these files and outputs are saved by default. Nonetheless, it can be difficult
to keep track of all these windows on the screen when working with R! (Figure 2-1)
An excellent way of ordering and manipulating your R windows and files is to use the free and
powerful interface for RStudio (see below). We highly recommend that you use this program.
22
The RStudio screen is divided into four resizable parts (Figure 2-2). The upper left part contains
a script editor where commands are written and saved. The various tabs in the upper left part
can contain multiple scripts and also data files. Commands are sent to the console in the lower
left part using the key combination cmd + Return (Macintosh) or Control+R (PC). On the right
side, RStudio displays a workspace tab listing all objects in the current analysis and a history tab
providing a recollection of executed commands. The lower right partition hosts a figure tab
where graphical output is stored, a package tab where packages can be viewed and installed, a
file tab to manipulate files and a help tab where R help information can be searched and
displayed.
Workspace/ History
Script editor tab
Write your commands here
In RStudio, you can bundle your analyses into projects using the project drop-down menu on the
top (through “file” for PC and through “project” for Mac) or the pop-up menu in the top right
corner of RStudio (both versions). Projects will contain all elements of analyses allowing you to
continue a session exactly where you ended the previous time.
You can set a new working directory following the menu options Session > Set working directory
(Figure 2-3).
23
Figure 2-3 Setting the working directory in RStudio
To create a new script, you can follow “File > New File > R script”, or use the shortcut Ctrl +
Shift + N. Save your scripts regularly. A file that has been modified but not saved again will
show with a red title and a * at the end.
You can navigate between different plots produced during a session using the blue arrows at the
top left corner of the Plots tab. You can save your graphs by clicking on “Export”.
24
In R, you can create scripts that are saved as an .R file, can be re-used and serve as a
documentation of your analysis. Scripts are written and manipulated in the script editor window,
and should be saved to the working directory with a .R file extension. It is also possible to enter
code directly into the console. Typing in the console can be faster when trying out code.
However, any analysis that you need to keep a record of should be created as a script and saved.
You send commands from the script to the console such that they are executed by
highlighting one or multiple lines and pressing the keyboard shortcuts cmd + Return
(Macintosh) or Control+R (PC).
You can enter comments preceded by #. Everything written after # will be ignored and
not executed in the console (Figure 2-5).
In RStudio, you can use 4 of these symbols #### after a title to organize the scripts and
mark specific points that you can find easily at the bottom left of the script window
indicated by an orange # (Figure 2-5).
Observe the ">" sign in the console. This is the command prompt, the place where you
can type in your commands and execute them by pressing return. The command prompt
indicates that R is ready to receive commands and has finished executing previous
commands. R will display a "+" if a command is longer than a single line or incomplete.
On the console, you can cycle through previous commands using the "arrow up" and
“arrow down” keys on your keyboard.
You assign information to variable using the operator- an "arrow" composed of a smaller than
sign and a minus sign (<-) pointing to the name of the variable. There are certain rules for
naming your variables.
25
The first character must be an English letter or underscore (_).
You can use uppercase or lowercase letters.
Blanks and special characters, except for the underscore and dot are not allowed to
appear in variable name.
will assign "hej" to an object named greeting. All text has to be in quotes (" "), otherwise R will
look for an object with this name and create an error message, for example,
To call an object and see what it contains you enter its name. Type the object name on the
command line as below.
greeting
[1] "hej"
The [1]indicates that this is the first (and only) element of this object.
You can check what variables have been created in R using the command
ls()
This will result in a list of the current objects in the workspace. In RStudio you can view and
manipulate current objects in the workspace tab. You can also remove (delete) objects using
rm(), for example
rm(greeting)
Variables are overwritten without notice whenever something else is assigned to the same name.
If you want to delete all current objects, use
rm(list=ls())
26
Data types
R deals with numbers, characters (entered in quotes, " ", as the “hej” in the example above) and
logical statements (TRUE or FALSE).
Vector: one-dimensional, contains a sequence of one type of data, i.e. numbers OR categories
(letters, group names) OR logical statements. Vectors can be created using c(element1, element2,
element3, .... , which concatenates (connect them one after each other) the different elements
into a vector. Note that the elements can themselves be vectors. For example,
c("population1","population2","population3","population4")
Number sequences can be created using the operator ‘:’. For instance,
x <- 1:7
Besides, there are a number of other functions for creating vectors, for user-defined seq() for
sequences and rep() for repeated elements. You can find out about these functions using the R
help.
Factor: similar to vectors but also contains information on levels. Entries of a factor that are
equal belong to the same factor level, or in other words, to the same category. Factors can be
created from vectors using factor(). For example, you can create a factor named sex using
the code below:
Data frame: collection of vectors and factors of the same length that can contain different data
types. This is the format commonly used for data analysis where each row corresponds to an
observation and each column corresponds to a variable (vector or factor). The section Getting
data into R explains how to create data frames from your data and the sections the sections
Accessing and changing individual entries, Accessing and changing entire rows or columns,
Adding and deleting columns explain how to handle and manipulate the contents of data frames.
27
List: collection of elements of any format and type, can be created using list( ). Outputs of
statistical analyses are often lists. Other types of objects include matrices and arrays.
Work flow in R
1. Define/create a folder on your computer that is to be used as a working directory.
2. Open R Studio and create a new file in the script editor (see R scripts in RStudio)
The hash sign (#) is used in scripts to identify text that is NOT a command, i.e., titles and
comments, and prevent R from trying to execute such text.
3. Add a title (preceded by a hash sign (#)), for example,
4. Set the working directory to your prepared folder (see 1). The working directory can be
changed at any time.
If you want to make sure that the working directory is correct, use
the command getwd() to obtain the path to the current working directory.
5. Save the script file now and regularly later on using the menu or symbol on the script
window or the shortcut Ctrl + S. The script file has the extension .R and will be saved in
the working directory by default. This script can be re-used and shared.
6. Load data from the working directory into an object (see Loading data files).
7. Conduct analyses, produce output and graphs creating further objects and save the script
and outputs /graphs.
8. Quit R using the command q() that you can type into the console or by closing the R
Studio window. This will result in a question whether you want to save the workspace.
You can safely answer no to that; save the script and the data instead. Saving the
workspace is only recommended for analyses that take a long time to complete.
Basic calculations in R
R is sometimes referred as an "overgrown" calculator.
3 + 4
will result in
[1] 7
28
numbers.1 * 2
Creating a second vector with 3 elements, numbers.2 and adding these two vectors:
You can use this to establish functional relationships of interest, plot and examine them. Try out
this code! Note that c(1:100) creates a vector with the numbers from 1 to 100.
plot(c(1:100)^2)
Summary
In R, you use a script window to enter the commands. Commands are transferred to the R
console for execution. Scripts can be saved and re-used.
Data, output and scripts are saved in a designated working directory.
R stores data and analysis outputs as objects that often are vectors, factors, data frames or
lists. They contain data as numbers, characters or logical statements. Factor levels or other
text must be in quotes "".
Objects are given names with the assignment arrow “<-“
The workflow in R involves setting the working directory at the beginning and saving the
script file repeatedly.
In R, you can conduct basic mathematical calculations directly and element-wise, for
example on vectors and on columns of data frames.
29
2.4 Handling data
Goal
In this section you will learn how to
In R, you can input data either by directly typing within the script or by loading data files.
Below is an example from plant experiments. You measured the width and the length of six
leaves in the plant species Silene dioica in cm. The first three plants were flowering and the last
three were not flowering.
You can enter the data directly as arguments to the function data.frame().
Each column is set by an argument: column.name = c(data). Note that the data entries
within c() must be separated by commas and the arguments (columns in this case) within
data.frame() must also be separated by commas. The function c() produces a vector
from the data (compare data types). You can choose column names freely. Note that data for the
flowering state is entered in quotes because flowering state is a category.
Alternatively, the data can be assigned to vectors first and then combined into a data frame.
Vectors can also be used for analysis by themselves.
You need this information to ensure correct loading of your data in R, explained on the next
page.
If you want to use a direct command to read data we recommend read.table() because it is
universally applicable. Within the read.table() command you can specify to browse your
computer for the file to load using the argument file = file.choose(). Or you can
enter the path of your file: file = "document_name.txt" for example. You indicate
whether your data contains a header (a row of titles) with header = TRUE (has a header) or
header = FALSE (no header). The table separator is set using sep, for example, sep =
";" (semicolon), sep = "," (comma), or sep = "\t" (tabulator). The decimal
separator is specified by argument dec, for example dec = "," (comma) or dec = "."
(point).
The input file needs to be assigned to an object using the arrow <-. In most cases, this will
automatically be a data frame object.
For a .csv file with header, semicolon-separated entries and decimal commas (as usually used for
Swedish settings for .csv files saved from Excel) the command looks like this:
For a .csv file with header, comma -separated entries and decimal points (as common in North
America) the command looks like this:
31
my.data <- read.table(file = file.choose(), header=TRUE, dec=".",
sep="," )
Note: when you execute the commands nothing happens! See next page for how to access the
data.
Remedy: copy your data (and only your data!) to a new sheet in Excel, save it as .csv and reload.
Non-English characters and signs. Non- English characters(ä, ö, é, etc. ), signs (for example
* ! & / | \ + - > < $? =) or spaces in the column names will produce an error
message similar to this:
Remedy: change the names in an Excel file or directly in a .csv file, save it as csv and reload.
Note: Non-English characters elsewhere in the data might also lead to an error message and thus
should be avoided.
str(Silene.leaves)
This indicates that the object Silene.leaves is a data frame with six observations, i.e. six
rows in our input file, and four variables, i.e. four columns in our input file. This matches well
with the six plants and four columns in our input file.
32
The column names and the type of the columns are also given together with the first few values
(in this case all six values). The columns plant.number, leaf.width and
leaf.length are numeric (continuous numbers) and you will be able to do calculations with
these numbers. The flowering.state column is a factor with the two levels
"flowering" and "non-flowering", you will be able to use this factor as a grouping
variable. All of this appears as expected and correct.
If you want to change the type of the column, for example changing plant.number from
numeric to factor (because the number is a "name" in this case), use the following command
Silene.leaves$plant.number <-
as.factor(Silene.leaves$plant.number)
You can control whether this has been successful by calling the structure command again.
To look at the data you can also just type the name of the object. This is advisable only for small
data frames. For example
Silene.leaves
will yield
has six elements. We can access its third element using the vector name followed by square
brackets and the element number. This line will bring up the third element:
width.vector[3]
[1] 2.8
Should we realize that this element needs to be changed from 2.8 to 3.0 we can do that using the
assignment arrow:
33
width.vector[3] <- 3.0
width.vector
We can access the elements of data frames in the same way, except that data frames have two
dimensions, rows and columns, such that two numbers separated by comma are needed with the
square brackets. The first number always refers to rows, the second to columns. To access the
element in the third row and the second column in our Silene.leaves data frame we use
Silene.leaves[3, 2]
[1] 2.8
Incidentally this is the same measurement as in the vector example above. To change it to 3.0 in
the data frame we use the same kind of assignment operation:
Silene.leaves[3,2]
[1] 3.0
Entire rows and columns of data frames can be accessed by leaving column (or row) number
empty in the square brackets. Note that the comma must always be entered because data frames
have two dimensions. Accessing rows and columns is needed to conduct analyses and to make
changes or calculation. For example,
Silene.leaves[ ,2]
Silene.leaves[3, ]
brings up the entire third row, for example to check that plant´s measurements. The first 3 is the
row number.
34
Column names can be used in place of the numbers. R has a special notation for columns
involving the dollar sign as you may have noticed in the output of the structure command. The
following line will also bring up the third column.
Silene.leaves$width
Alternatively, the column names can be entered in quotes directly within the square brackets
(note the comma!).
Silene.leaves[ ,"width"]
Should you now realize that the width measurements all need to be increased by 0.2 you can do
that using
OR
OR
Which of these options is most convenient depends on your column names, the size of your data
file and your preferences.
Calling the structure command shows that the column has been added and is numeric.
str(Silene.leaves)
Deleting one or several columns can be done using the minus sign within the square
brackets. This only works with column numbers not with column names. This line removes the
newly added width-length ratio column:
35
Silene.leaves <- Silene.leaves[ ,-5]
str(Silene.leaves)
Removing rows, for example, when you realize that measurements of an entire row are faulty,
works in the same way. This line removes the first row (observe the placement of the comma!).
If you need to remove more than one column use the c() command within the square brackets:
Silene.leaves[-c(1:3), ]
Silene.leaves[, -c(1,4)]
Subsetting data
There are many situations where only a specific subset of the data needs to be accessed. In R this
is done with entering logical statements that are into the square brackets for row and column
selection. If you want to select for example only the flowering plants in the Silene.leaves
data frame you use a Silene.leaves$flowering.state == "flowering"
statement for row selection.
will produce a new data frame named Silene.flowering containing only the flowering plants.
str(Silene.flowering)
36
'data.frame': 3 obs. of 4 variables:
$ plant.number : num 1 2 3
$ leaf.width : num 4 4.7 2.8
$ leaf.length : num 5.3 4.9 5.7
$ flowering.state: Factor w/ 2 levels "flowering","vegetative":
1 1 1
Silene.leaves$flowering.state == "flowering":
In words this statement means something like "check for each element of
Silene.leaves$flowering whether it reads "flowering" or not". If you execute only the logical
statement, you create a vector with six elements, the first three are TRUE (corresponding to the
flowering plants) and last three are FALSE (corresponding to vegetative plants).
Silene.leaves$flowering.state == "flowering"
When you use such statements for row selection, all rows corresponding to TRUE will be
selected, in this case the first three rows.
Note that R does not assume that you will use only columns of the same data frame for the
logical statements, in fact, you can also use columns from other data frames or vectors. For this
reason you need to write Silene.leaves$flowering.state == "flowering"
and not only fowering.state == "flowering".
== identical
!= not identical
> greater than
< smaller than
37
Silene.leaves[Silene.leaves$leaf.width < 3.5 |
Silene.leaves$leaf.length < 3.5, ]
Further, subset() is a useful function to perform these kind of selections and subset a data
set. The first argument specifies the data frame to subset. The second argument is a logical
expression as explained above used to select specific rows in the data frame and the third
argument indicates the columns to be selected by their names (if several columns are selected, the
names have to be in a vector). If you only want to omit one column, use – in front of the column
name: for example,
New.Silene.leaves <-
subset(Silene.leaves,flowering.state==”flowering”,select=-
flowering.state)
will create a new data frame containing only the rows concerning the flowering plants, and all
columns except the flowering.state column (which is not needed any longer since we know all the
plants in the data set are flowering).
Since you specify the data frame to subset in the first argument of the subset() function, you
can directly refer to the different variables (columns) by their names, without using the $ sign.
Summary
38
2.5 Dealing with missing Values
Goal
In this section you will learn how to
> log(-1)
[1] NaN
Warning message:
In log(-1) : NaNs produced
mean(c(1,2,3, NA))
[1] NA
yields NA. Setting the optional argument na.rm (for NA remove) to TRUE tells R to consider
only non.NA in the calculation, thus,
mean(c(1,2,3,NA), na.rm=TRUE)
[1] 2
This also works for range(), sd(), var(), sum(), median(), max(), min()and
many other commands.
An exception is the command length( ). It gives the number of cases regardless of the
presence of NA. Thus,
39
length(c(NA, NA, NA))
[1] 3
The commands cor() for correlation and cov() for covariance ignore NA with the argument
use="complete.obs":
Here, n.1 and n.2 are two vectors of the same length.
Other commands such as lm() for calculating linear models ignore NA in the default setting.
Consult the help files to find out how NA is dealt with for specific commands (see lesson on
interpreting help files).
This command can be applied to any data structure or part thereof. is.na() returns logical
statements for each element of the data with TRUE for both NA and NaN and FALSE for other
entries. To find only NaN use is.nan(). For example,
is.na(c(1,NA,3,NA,5)) returns
Vectors of logical statements can be summed, because TRUE is automatically converted to 1 and
FALSE to 0. This way the number of missing values can be obtained. For example,
[1] 2
complete.cases(data.name)
Note that the function summary(data.name) will also provide the number of NA in each
column.
To find out where the NA are in the data use the command which(), for example
40
which(is.na(c(1,NA,3,NA,5))==TRUE)
returns
[1] 2 4
numbers.1[1] <- NA
data.name[2,3] <- NA
Note that these changes are made to the data frame object stored in R’s current workspace NOT
to your original data file.
Summary
Missing data type sin R are NA (not available, to be used in data tables) and NaN (not a
number).
Many commands have optional arguments to deal with missing values, for example
na.rm=TRUE will tell R to ignore missing values in mean(), range(), sum() and
other basic functions.
The command is.na(data.name) is used to identify NA and NaN.
sum(is.na(data.name)) will return the number of missing values in the data and
which(is.na(data.name)) will return subscript numbers of the elements that are
NA or NaN.
Data entries can be set to NA with the assignment arrow as in numbers[1] <- NA.
41
2.6 Understanding help() functions
?t.test
At the top of the page, the package that the command originates from is given in braces. Here
t.test{stats}, shows that the command t.test() originates form the package stats.
Further sections contain a description, the usage, the arguments and the value or object returned
by the command. The help file for t.test() indicates that its arguments include x,
y, alternative, mu. It further explains that the t.test() command returns a list
object including the the value of the t-statistic, the estimated mean or difference in means, the
degrees of freedom and the P-value.
The help pages end with references, similar commands ("see also" section) and importantly,
examples. Example code can be directly copy-pasted into the console and only internal data is
used. Running example code is a very good way to examine how to work with a command.
When looking for a command or term, you use two question marks followed by the term you are
looking for. Information on correlation analyses, for example, is found by typing
??correlation
This command will open a table on command related to correlation in some way. The table lists
the commands and the packages they originate from, as well as a short description. Clicking on
these entries will take you to the help files for these commands.
A number of number webpages are dedicated to informing about the use of R. For beginning
users as search in the R help archive can be very helpful. It collects questions on R and answers
there are often given by well-known authors of R books and packages.
The Namazu R search page is accessible through directly or from the R console using the
command RSiteSearch(”search.word”, make sure to enter the search word in quotes
(" "). This page often leads to newer and more advanced topics. Try it out!
42
2.7 Exercises
2-B
What is the correct code to load the data in a .csv file that looks like this?
Fertilizer;plant.biomass;plant.height;seed.weight
high;0,180476248;21,31200596;0,029829762
…
Select one:
(a) read.table(file=file.choose(), header=T, sep=",", dec=",")
B. How many rows (observations) and columns (variables) does the iris dataset have?
C. Which variable of the data frame iris is a factor and how many levels does it have?
Select one:
(a) The variable Species is a factor and it has 5 levels.
43
2-D Loading and graphical exploration of data
Please download this file and load it into R: sunflower fertilizer file
A. Is there any indication in the graphs that plant height or seed weight differ between plants
subjected to the two fertilizer treatments?
Select one:
(a) Plant height appears to be considerably larger in plants treated with from high nutrient
fertilizer, whereas seed mass appears to be similar in plants from both treatments.
(b) Plant height appears to be considerably lower in plants treated with high nutrient fertilizer,
whereas seed mass appears to be similar in plants from both treatments.
(c) Plants from then two treatments do not appear to differ in height or seed weight.
(d) Plant height and seed weight appear to be lower in plants from the low nutrient fertilizer
treatment.
B. Create a new data frame containing only the rows of the “low” treatment.
44
Solutions:
2-A
A. v1=seq(1,4.9,0.3): you have to create a regular sequence, thus indicating the use of the function seq(), from 1 (first
argument) to 4.9 (second argument) using an increment of 0.3 (third argument)
B. v2=rep(1:4,3): this time you need to repeat the sequence of the 1 to 4 integers three times, so you should use the rep()
function, giving the vector to be repeated (first argument) and the number of repeats (second argument). As the vector to be
repeated is simply the sequence of integers from 1 to 4, you can use the : symbol to save time.
C. v3=c(v2,85): the vector you have to create is basically the same as the vector in question b), with just 85 added at the end.
So you should use the function c() that concatenates several vectors into one, in the order specified.
D. v4=seq(14,0,-2): again a sequence, but this time in decreasing order. You can generate that by using seq() with a negative
increment (-2 here).
E. v5=rep(c(5,12,13,20),each=2): this time we do not want to repeat a whole vector as in b), but we want to repeat each element
of a vector twice. This is done by using the argument each in the function rep(). The vector (5,12,13,20) is not an obvious
sequence so we just use c() to provide the vector to the rep() function.
2-B
d
2-C
A. data.frame
B. 150 obersvations 5 columns
C. Species. Three factors: setosa versicolor virginica.
2-D
A. a
Read the table into R using this command: t <- read.table("Downloads/sunflower.csv",sep=",",header=T)
Make the boxplot by typing: boxplot(t$plant.height ~ t$Fertilizer) boxplot(t$see.weight ~ t$Fertilizer)
B. t.low <- t[t$Fertilizer=="low",]
2-E
A. mtcars[mtcars$cyl >= 4, ]
B. mtcars[1:10, ]
2-F
c
45
2.8 Web resources and books on R
Web resources
CRAN page
R-Studio
Quick R
Useful web-page for the beginners and a little bit more advanced user, related to the book
R in action mentioned below
The perfect pocket card. Great for refreshing your memory or point you in the right
direction
Some of the presentations from the famous Paul H. Geissler's (now retired) web-based R
introductory course
Stackoverflow (forum)
Every time that you type an R related question in google, this is one of the best hits to
follow
Questions and answers on R use. Note that that the domain stat.ethz.ch often has good
information.
R tutor
Books
Click on the title to see the book in the Swedish libraries
46
Large overview with many biological examples, suitable for beginning users with some
statistical knowledge.
Very good and concise introduction to R for people with experience in statistical analyses
from other programs.
A quite brief (44p) list of basic tips on how to use R for data exploration. Only
recommended for beginners. Most of those tips are found in Quick R
R in action (Kabacoff)
From the author of Quick R; this book follows a case-study approach with many
practical data sets. Previous experience with R is desirable.
A long compendium of case-studies, some of them a bit outdated and quite technical.
Ideal for more advanced students.
R graphics (Murrell)
A popular guide on how to make perfect graphs. Intended for those who want to
improve their artwork in R.
Teaches you how to use R for programming efficiently. Previous experience with R and
programming concepts is required.
Rather technical explanations on how to use the S language (the one R is based on) for
statistics. For more advanced learners.
47
3 Basic Statistics with R
Continuous data, also known as numeric data, is any form of data in which data points can be
any numbers within a given range. Common examples of this include measurements such as
height, weight, etc. and many mathematical solutions (e.g. integration, slope, etc.).
Categorical data, also known as factor data, is any form of data in which data is grouped into
multiple categories. Examples of this include species type, hair color, etc. Binary data is a subset
of categorical data in which the data can only be one of two groups (e.g. dead or alive, heads or
tails, etc.).
Being able to distinguish between these types of data is extremely important, because as we will
see later, the type of data being used is an important factor in deciding the appropriate way to
analyze the data statistically.
One of the simplest diagnostics used in R is the table() command. This command allows for
the creation of contingency tables that report the counts of cases (rows) in different categories of
another variable or several variable combinations. These tables are extremely useful for
determining if your data is in the correct.
To demonstrate this we will use the example dataset warpbreaks, which records the amount
of times different wools at different levels of tension break. The researchers set up this
experiment so that each combination of wool and tension had an equal sample size. In order to
test this we could simply call the data and count each measurement. However, this would
become extremely tedious in larger datasets. We could instead use the table() command to
answer this question:
data(warpbreaks)
48
table(warpbreaks$wool,warpbreaks$tension)
L M H
A 9 9 9
B 9 9 9
We can thus quickly confirm that each combination of treatments does indeed have 9
measurements associated with it. Further exploration of using statistics to analyze contingency
tables will be discussed in an oncoming section.
plot(iris)
49
Histograms
The command hist() produces a histogram displaying data values on the x-axis against their
frequencies on the y-axis allowing you to judge the distribution of the data. The command
hist() is applied to individual variables (columns) of the data, that are given by the name of
the data frame followed by a dollar sign and the name of the variable (column). The output is
shown in Figure 3-2.
hist(iris$Sepal.Length)
Boxplots
Boxplots display continuous data separated into the levels (groups) of a factor (grouping
variable). In the default settings, the command boxplot() shows medians as thick black lines
and quartiles as a box around the median. The t-bars ("whiskers") are the range of the data that
is within 1.5 times the inter-quartile distance from the median. Data points outside that range are
regarded as outliers and are displayed as circles. The main argument in the boxplot()
command is a formula statement relating the continuous variable on the left side to the grouping
variable on the right side with a tilde symbol (~), continuous.variable.name ~
factor.variable.name ). Boxplots (Figure 3-3) can be used to get an idea on whether
there are large differences between groups, whether the data is distributed symmetrically within
groups and whether there are outliers.
boxplot(Sepal.Length~Species, data=iris)
50
Figure 3-3 Boxplot of sepal length of Iris
You can learn how to produce nicer looking graphs in R in the section Basic graphs in R.
summary(iris)
51
You can calculate these and further descriptive statistics directly using the following commands
(Table 3), that are all applied to vectors (or data frame columns) of continuous variables:
Mean
mean(variable.name)
Median
median(variable.name)
Range
range(variable.name)
Standard deviation
sd(variable.name)
No. observations
length(variable.name)
Variance
var(variable.name)
The main arguments of tapply()are X, the variable that you want to summarize, INDEX, one
or more grouping variable(s) and FUN the command you want to apply. The INDEX variable is
given as a list object, that is typically is converted into a list within the tapply() command
using INDEX=list(variable.name).
tapply(X=variable.name, INDEX=list(variable.name),
FUN=command.name)
52
3.5 Comparing two groups of measurements
One-sample test
Paired-sample test
Paired sample tests are used when two different measurements were taken
on the SAME experimental units. Examples are before and after studies on
the effect of medical treatments
53
Figure 3-4 Workflow for one- and two-group comparisons
You can view the distribution of your data using the command hist()as described above.
The command qqnorm(your.vector) produces a quantile-quantile plot (qq-plot) that is
regarded as the best graphical assessment of whether or not data conforms to the normal
distribution. A quantile is a value of the data that is just larger than a certain percentage of the
data. The median, for example is the 50% quantile and the quartiles are the 25% and 75%
quantiles (see here).
The qq-plot displays two different types of quantiles. On the y-axis sample quantiles, i.e. each
data point, are indicated. You can check this out by comparing the histogram and the qq-plots
below; the histogram and the y-axis of the qq-plot have the same range.
The x axis of the qq-plot represents the standardized theoretical quantiles for a normal
distribution corresponding to each data point. The qqnorm(your.vector) command first
calculates the quantile of each value in the data , i.e. what percentage of the data is smaller or
equal to that value. It then looks up the corresponding quantile (i.e value) of the standard normal
distribution with a mean of 0 and standard deviation of 1. Thus, points with values around zero
for the theoretical quantile should be close to mean of the data on the y-axis.
qq-plots are evaluated with the aid of the command qqline(your.vector), producing a
line from the first to the third quartile of the data. You expect the points to be close to this line if
the data is normally distributed.
54
Below you see examples of histograms and qq-plots (Figure 3-5) for 250 data points that are
normally distributed (green), left skewed (red) and right skewed (red). Data that look like the left
skewed or right skewed example below should transformed before analysis. If this does not work
a non-parametric test should be used.
Figure 3-5 Histogram and QQ plot for normally-distributed, right skewed, left skewed
data
For smaller datasets, quite some deviations from the line are expected even in normally
distributed data, especially at the extremes. Below you see three examples of histograms and
corresponding qq-plots (Figure 3-6) for five and ten values sampled from a normal distribution.
If your data looks like this you can use the parametric tests!
55
Figure 3-6 Histogram and QQ plot for small datasets
56
Transformations
To obtain normally distributed data for further analysis the following transformations (Table 3-2
Useful transformations) are recommended:
Remember that most of these commands, with the exception of the exponential transformation
for left-skewed data, are not defined for values smaller than zero. You may need to add a
constant value to all values in order to perform the transformation.
You can transform your data, assign it to a new vector or data frame column and plot it again.
You may need to try out several different transformations. If you are still not satisfied with the
distribution, please use a non-parametric test.
It is also possible to apply the transformations directly with other commands, for example
hist(log(your.data)).
Once you have established the distribution of your data is normal you are ready to conduct the
appropriate t-test otherwise proceed to non-parametric alternatives.
T-tests are calculated using the command t.test(). The arguments to this command identify
the type of test to be conducted.
One-sample t-test
We use our example in which we want to test whether average human height is 1.77m. This is
our data:
57
height <- c(1.43,1.75,1.85,1.74,1.65,1.83,1.91,1.52,1.92,1.83)
Be aware that the alternative hypothesis is that average human height is significantly different
from 1.77m. We declare the model in the following way, to obtain the following output:
t.test(height, mu = 1.77)
Therefore, we cannot reject the hypothesis that average human height is significantly different
from 1.77m.
Before running the test it is important to consider your alternative hypothesis, specifically
whether you want to run a one-tailed or two-tailed test. If no alternative hypothesis is specified,
the command will assume a two-tailed test.
The two-sample t-test has a second assumption in addition to the normality of the data: equal
variance in the two samples. If the variances are assumed to be equal, this must be specified using
the argument var.equal = TRUE, otherwise Welch´s t-test that does not assume equal
variances is automatically used when needed. Here we assume equal variances and perform a
two-tailed test.
58
mean in group female mean in group male
2.2 4.4
Thus, male butterflies dispersal is significantly different from female butterflies. We can also
specify a one-sided alternative hypothesis by adding the argument alternative= “less”
or alternative = “greater” depending on which tail is to be tested:
The results of these tests state that female dispersal distance is not significantly greater than male
dispersal distance.
Here you simply run add the argument paired=TRUE to the command from the two-sampled
test above.
Paired t-test
data: sleep.before and sleep.after
t = 0.7906, df = 5, p-value = 0.465
alternative hypothesis: true difference in means is not equal
to 0
95 percent confidence interval:
-1.501038 2.834372
sample estimates:
mean of the differences
0.6666667
Well, this test is NOT significant, and thus the data does not support an effect of exams on
students sleeping time. But maybe you forgot about the party after the exam?
59
3.7 Non-parametric alternatives
In certain cases it is impossible to meet the assumption of normality required for standard t-tests,
even after transformations. In such cases a non-parametric alternative such as the Wilcoxon
family of tests may be appropriate. These include one-sample, two-sample (also names Mann-
Whitney-U test) and paired alternatives, all available through the command wilcox.test().
The syntax of wilcox.test() is similar to that of t.test(), see ?wilcox.test.
Pearson Correlation
One of the most common correlation analyses for parametric data is the Pearson product-
moment correlation coefficient, commonly called Pearson’s r. This test seeks to determine the
level of relatedness between two variables using a score that runs from -1 (perfect negative
correlation) to 1(perfect positive correlation). A value of zero indicates no correlation. Since
Pearson’s r is parametric, it is advisable to test the assumption of normality before running this
test.
In r, Pearson’s r can be calculated using the cor.test() command. Here we once again use
the sample data set iris to assess the correlation between sepal length and petal length:
cor.test (iris$Sepal.Length,iris$Petal.Length)
This data shows a highly-significant (P-value < 2.2e-16) and strongly positive (0.87) correlation
between these two variables. Note that in this case, the P-value is used to reject the null
hypothesis that the true correlation is equal to zero.
60
Spearman Correlation
A non-parametric alternative to Pearson’s r is Spearman's rank correlation coefficient, or
Spearman’s rho. Like Pearson’s r, Spearman’s rho determines the level of correlation of two
variables ranging from -1 to 1. The difference between the two measures is that Spearman uses
the rank-order of the data rather than the raw values.
Spearman’s rho can also be calculated using the cor.test() command. Below we repeat the
previous correlation analysis this time using Spearman’s rho.
Note that this test produced similar, but not identical, results compared with Pearson’s R
Basic contingency tables would have two categorical variables. In many cases we may wish to test
whether the two grouping variables are independent. One of the most common ways to analyze
contingency tables is with the χ2-test (Chi-square test). The χ2 tests work by first calculating
the difference between expected and observed values:
The result of this calculation, the so-called the X2score, is then compared to the χ2 distribution to
calculate a p-value to determine if the observed values differ significantly from the expected
values.
In order to demonstrate the usage of χ2 tests we will use an example of eye color counts in two
different groups of flies. The dataset can be found in the attachments as 3.9_flies_eyes_color.csv.
We begin by loading the data and creating a contingency table.
A B
Red 34 41
White 16 9
61
Note that the ratio between red and white eyes differs between group A and group B. We will use
chi-squared test to determine whether the data is more compatible with the null hypothesis that
the variables of eye color and group are independent of each other or with the alternative
hypothesis that eye color and group are not independent.
chisq.test(tab)
data: tab
X-squared = 1.92, df = 1, p-value = 0.1659
In this case, based on a chi-squared value of 1.92 and 1 degree of freedom, we calculated a P-
value of 0.1659. Thus, data suggests that these variables are independent.
prop.test(tab)
data: tab
X-squared = 1.92, df = 1, p-value = 0.1659
alternative hypothesis: two.sided
95 percent confidence interval:
-0.43264180 0.05930846
sample estimates:
prop 1 prop 2
0.4533333 0.6400000
The first two lines return the same values as we observed with the chi-squared test. However two
additional groups of numbers are present in this output. The first is the 95% confidence interval.
These two numbers represent the lower and upper estimates of the difference between the two
groups. Note that because the interval includes zero that we are not confident that there is a
difference between the two groups at all. The sample estimates show the estimated proportion of
Red individuals in group A and White individuals in group A respectively.
3.10 Summary
62
3.11 Exercises
3-A
Below you find a number of descriptions of experiments.
Please assign the appropriate test: one-sample test, two independent groups or paired samples
A. You read that it costs on average 600 SEK you go to the hairdresser in Uppsala and you want
to find out whether that is actually true. You walk through the city and obtain prices from 10
hairdressers.
B. You investigate whether the flower color of your grandmothers’ orchids becomes more
intensive after applying fertilizer. You score color intensity in 10 orchids before fertilizing and one
week after fertilizing
C. You want to investigate whether arrival times at lectures differs between male and female
students. You come 15 minutes early to large lectures and record arrival time and sex of the
students. You obtain data from 60 women and 57 men.
D. You study two species of plants, red and white campion. You want to know which species has
larger flowers and measure flower size in 50 individuals of each species.
E. You want to study whether the hand people write with is stronger than their other had. You ask
25 people to participate in your experiments and measure how well they can squeeze balloons
with either hand.
3-B
A nonparametric test is applied when:
a. There are not parameters
b. The variables are not independent
c. The groups do not have the same sample size
d. The assumptions of parametric tests are not met
3-C
Below you find data on the snow-melt times in two different habitats, snow-beds and ridges.
Use a t-test to find out whether there is a significant difference in snowmelt times between these
two habitats. Assume that the variances are equal. What is the P-value?
Generate the data using this code:
snowmelt <- c(110, 120,109,101,105,99,106,108,95,98)
habitat <- c(rep("snowbed",times=5), rep("ridge", times=5))
3-D
63
You are investigating two different nature conservation areas (area 1 and area 2). You would like
to know if interactions between poplar trees and leaf eating insects differ between these two
reserves. For this purpose, you measured the leaf area that has been eaten (in %) on 10-20 year
old poplars (in %). 52 trees were sampled in each reserve. The data is available in the
attachments as 3-D.
Does the consumed leaf surface differ significantly between the two reserves?
Hint, Conduct the appropriate statistical test, produce an appropriate graph and interpret the
results critically.
3-E
You want to understand interactions between insects and oaks. Do older trees support more
insects, and if so, how much more? For this purpose, you set up insect traps in 20 oaks trees of
different ages and measure the total dry weight of insects collected in one month (July). The data
is available in the attachments as 3-E.
Hint, conduct the appropriate statistical test, produce a graph and interpret the results.
3-F
You want to study willow shrub responses to herbivory – do willows produce tannins (that are
known to act as defense compounds) in response to herbivory?
For this purpose, you selected 10 willow shrub pairs with similar ages and sizes and growing
closed to each other. In May and June you spray one shrub in each pair with insecticide each
week and the other one with water. At the end of June you measure tannin concentration in 50
leaves per tree. The data is available in the attachments as 3-F.
Hint, Conduct the appropriate statistical test, produce an appropriate graph and interpret the
results.
3-G
Explore the Swiss dataframe. And indicate whether the following statements are true or false:
A. Fertility positively correlates with religiosity and agriculture, and correlates negatively with
education and examination
B. Education and examination are not strongly positive correlated
C. Swiss cities with better education are usually more catholic
D. More extensive agriculture causes more fertility
E. The more catholic you are the more fertile you will be
3-H
64
Using Spearman correlation and Pearson correlation, the correlation between fertility and
mortality is:
(a) 0.44 and 0.42 (b) 0.42 and 0.44 (c) 0.43 and 0.41 (d) 0.41 and 0.43
3-I
I am counting people who arrive into the bank in a sunny summer afternoon and I record the
color of their clothes. I got that 35 guys wore a white t-shirt, 22 a blue one, and 7 a black one. On
the other hand, 14 ladies wore white dress, 12 a light-blue and 8 dark one. Answer the following
questions:
A. Is the color being randomly chosen?
B. Assuming that the female/male ratio in the city where the bank is situated is 50:50, do men
prefer go more often to the bank than women?
C. Is color selection gender-biased?
3-J
You receive the following data regarding behavior and colour of crabs:
Blue Red
Aggressive 36 24
Passive 32 28
Are colour and behavior independent?
65
Solutions
3-A
A. one-sample test, B. paired samples, C. two-independent groups, D. two-independent groups, E. paired samples
3-B
d
3-C
0.09112: t.test(snowmelt~habitat). Data indicates that there is not difference between microhabitats
3-D
#Load data check object and data;poplar <- read.table(file.choose(),sep=";", header=T, dec=",");str(poplar);#convert area to a
factor;poplar$area <- as.factor(poplar$area);#plot data;boxplot(consumed_surface~area,
data=poplar,xlab="Area",ylab="Consumed surface");#check
normality;par(mfrow=c(2,2));hist(poplar$consumed_surface[poplar$area=="1"], main=”Area
1”);hist(poplar$consumed_surface[poplar$area=="2"],main=”Area1”);qqnorm(poplar$consumed_surface[poplar$area=="1"]);qqli
ne(poplar$consumed_surface[poplar$area=="1"]);qqnorm(poplar$consumed_surface[poplar$area=="2"]);qqline(poplar$consum
ed_surface[poplar$area=="2"]);# The data have only minor deviations from normality, so a two sample t-test is appropriate;
t.test(consumed_surface~area, data=poplar);#Output and graphs;means <- tapply(poplar$consumed_surface,
poplar$area,mean);se <- tapply(poplar$consumed_surface, poplar$area, function(x)sd(x)/sqrt(length(x)));par(mfrow =
c(1,1));mp <- barplot(means, ylim = c(0, 20), las=1, xlab="Area", ylab="Consumed surface");arrows(mp, means - se, mp,
means+se, mp, angle = 90, length = 0.2, code=3, col="black", lty=1, lwd=1). Interpretation: The consumed surface is around
17,5% in both areas and does not differ significantly between areas.
3-E
#Load data, check object and data;grazing <- read.table(file.choose(), sep=";", header=T, dec=",");str(grazing);par(mfrow =
c(1,1));plot(grazing$insect ~ grazing$age, xlab="Age of the tree (year)", ylab="Occurrence of insects");#Analysis: Linear
Regression;model <- lm(grazing$insect ~ grazing$age);Checking assumptions;par(mfrow=c(2,2)) ;plot(model);#Output and
graphs;summary(model);anova(model);#This graphic is optional;p_conf1 <- predict(model, interval = "confidence");p_pred1 <-
predict(model, interval = "prediction"); par(mfrow=c(1,1));plot(grazing$insect ~ grazing$age, xlab="Age of the tree (year)",
ylab="Occurrence of insects ");abline(model);matlines(grazing$age, p_conf1[,c("lwr","upr")], col=2, lty=1, type="b",
pch="+");matlines(grazing$age, p_pred1[,c("lwr","upr")], col=2,lty=2, type="b", pch=1).Interpretation: Insects occur more
frequently on older trees, on average, an increase in one year of age is related to an increase in dry insect biomass supported
of 16mg.
3-F
#Load data, check object and data;tannin <- read.table(file.choose(), sep=";", dec=",", header=T);str(tannin);#Testing
assumptions;tannin$diff <- tannin$water -
tannin$insecticide;qqnorm(tannin$diff);qqline(tannin$diff);#Analysis;t.test(tannin$water, tannin$insecticide, var.equal=T);mean<-
mean(tannin$diff);mean;se<-sd(tannin$diff)/ sqrt(length(tannin$diff));se. Interpretation: Insecticide application significantly
reduced the tannin content as compared to a water treatment suggesting that the presence of insects induced tannin production
3-G A. True, B. False, C. False, D. False, E. False
3-H a
3-I
A. No, p-value = 0.0001381, chisq.test(c(35+14,22+12,7+8))
B. Yes, p-value = 0.002442, chisq.test(c(35 + 22 + 7, 14 + 12 + 8))
C. No, p-value = 0.2105,summer <- as.table(rbind(c(35, 22, 7), c(14, 12, 8))); dimnames(summer) <- list(gender=c("M","F"),
color = c("White","Blue", "Black"));chisq.test(summer)
3-J Yes, P=.5805
66
4 Linear models
Tests whether means of more than two groups are the same, for example
whether fruit production differs among five populations of a plant species.
If there are only two groups, a t-test is the way to proceed. ANOVA relates
variance within groups to variance between groups. The analysis does not,
however, tell you which groups are significantly different from each other.
For this purpose a Tukey test can be applied.
Two-way ANOVA
This analysis can assesses the influence of two grouping factors on groups
means, for example, whether irrigation and fertilization have an effect on
plant growth. Importantly, two-way ANOVA can also analyze whether the
two factors interact, in the example, whether the effect of irrigation depends
on the fertilizer level (or the other way around). This is called a statistical
interaction. The same methods can also be applied to studies with more
than two grouping factors (multi-way ANOVA).
67
Linear regression
First, start by exploring the data through a basic plot (check plot section). Based on this, define
the model and analyze the residuals. If the residuals are normally distributed, obtains and
interpret the result. Otherwise, try transforming the data and re-checking the residuals or use a
different model (Figure 4-1).
Analysis
Plot data Define OK Obtain and interpret
of residuals
model results
not OK
Different
Transformation
model
68
4.4 Defining the model
First you must define the linear model that you want to use, using the lm() function. Within
this function a so-called formula statement defines the relationship of the variables to each other.
The response variable is always on the left side of the tilde-symbol (~) and the explanatory
variable(s) are on the right side as in lm(, …). For instance, if one uses the AirQuality R
internal dataset and we want to make a model to predict Ozone content in the atmosphere using
wind speed, the model definition would be as follows:
Observe that we are assigning the model to an object, for example My.model <- lm (…,
…). This is a good practice since later you can just use the object for testing assumptions and for
extracting results.
Whether or not variable are categorical or continuous is defined in the data itself and can be
checked by str()(check data section), R will calculate the appropriate model by itself. Thus, the
following formula statement will yield a one-way ANOVA:
Formula statements are further used to combine explanatory variables and to define interactions.
If variables should be considered only by themselves (additive effects), for example in a multiple
regression without interaction you connect the variables by a plus sign as in:
On the other hand, if you want to consider interactions in addition to the additive effects use an
asterisk (*) between the explanatory variables, as in:
69
3. The residuals, i.e. the differences between the observed values of a response
variable and the values fitted by the model, are normally distributed with a mean
of zero.
Analysis of residuals is thus a key step when conduction linear model analyses.
We can for now concentrate on two diagnostic plots for analysis of residuals, the Tukey
Anscombe plot, that displays residuals vs. fitted values. Fitted values are those predicted by the
model.
The other diagnostic plot is the qq-plot that was explained in the section basic statistical analyses.
You obtain both of these graphs from your model object using the plot(,…) command.
In the following example, one is interested in analyzing whether wind speed is a good predictor
of ozone levels using an AirQuality internal dataset from R, such as in the previous section
(defining the model). The output is presented in Figure 4-2.
Figure 4-2 Residuals and Q-Q plot for a situation when assumptions are not fulfilled.
In the first graph, the Tukey Anscombe plot, we expect as random scatter of points around zero.
If there is any pattern in the graphs such as a funnel shape, the model fit is not good, in our
example, the residuals of higher fitted values are much larger than those at low values violating
assumption 2 above. This is very common especially in measurement data and can be related to a
larger variation at higher values. A log-transformation of the response variable often improves
model fit, such as in this case (see Figure 4-3). The corresponding qq-plot testing assumption 3
above, also improves after log-transformation (see assumption here) and thus, this analysis
should definitely be based on log-transformed values.
70
My.model2 <- lm(log(Ozone) ~ Wind, data = airquality)
plot(My.model2, which = c(1,2))
Figure 4-3 Residuals and Q-Q plot for a situation when assumptions are fulfilled.
Response: Ozone
Df Sum Sq Mean Sq F value Pr(>F)
Month 4 29438 7359.5 8.5356 4.827e-06 ***
Residuals 111 95705 862.2
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In the output, df stands for degrees of freedom, which is the number of values in the final
calculation of a statistic that are free to vary, the F value which is the test statistic calculated as
the ratio between the explained and the unexplained variance, and the corresponding p-value for
this F statistic, which is the probability of not rejecting the alternative hypothesis (significant
effect of the explanatory variable on the response variable) given than is false.
71
The command summary(model) will show the parameters estimated by the model, for
example the slope of the regression for regression analyses or the difference between group
means for ANOVA.
Residuals:
Min 1Q Median 3Q Max
-3.4219 -0.4662 0.0663 0.5021 1.4035
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.75331 0.21879 21.726 < 2e-16 ***
Wind -0.13726 0.02153 -6.376 4.39e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The first part of the output indicates the model that was run and the distribution of residuals by
means of quantiles (see basic stats section). In the first row of the table under the title
“Coefficients” one can find the estimate of the intercept of the regression line, a t-value which is
the test statistics that tests whether the estimate for the intercept is different from zero, and the
p-value for this test.
In the second row of the same table one can find the slope, which is effect of the explanatory
variable on the response variable (a.k.a. magnitude of the effect), and the same test statistic and
p-value as before, but this time for the test that the slope is significantly different from zero. If
the slope is different from zero, then there is an appreciable effect. The effect can either be
positive or negative (gray and black lines, respectively, in the case III plot in figure 1.6.2). Notice
that for these two tests there is a significance coding that is shown at the bottom of the table.
Finally, R2, which is the percentage of explained variance by the model, is presented at the
bottom of the output. This indicated how much of the variation in the data is explained by the
model.
Interpreting interactions
Sometimes the effect of an explanatory variable on a response variable depends on another
explanatory variable, this is termed an interaction. Below you will find some examples on how
interactions would look like and how to interpret them. In the following graphs, different colors
symbolize different grouping variables. The X-axis is the explanatory variable, categorical in the
72
case of the bar-plots and continuous in the case of the dot-plots with tendency line. The Y-axis is
the continuous response variable.
In the following graph (Figure 4-4), you can assume that the continuous response variable is seed
productivity, the white and dark-gray colors correspond to watering treatments (irrigation and
drought) and A and B are different populations. A two-way ANOVA is the right test to be
applied in a scenario such as this, a graphical representation of which is shown in figure 1.6.1.
The first two figures show cases where the interaction term of a two-way ANOVA is not
significant, while in the last two cases the interaction term of the ANOVA is significant. In the
first case, the response variable (i.e. seed productivity) differs between populations but not
between treatments. In the second case it also differs between treatments. Observe that in the
fourth case the overall mean between the populations is the same. In this context the interaction
cases (III & IV) can be interpreted as follows: the effect of the treatment (water availability) on
the response variable (i.e. seed productivity) depends on the population.
A B A B A B A B
Figure 4-4 Bar plots for two treatments (white and grey) and two populations (A and B)
Now, assume that instead of having the treatment as a categorical variable, there is a spectrum of
different values. For instance, instead of having drought and watered treatments, we measured
the amount of water naturally available in the soil (x-axis in Figure 4-5) in the two different
populations (black and empty dots). ANCOVA is the test to be applied in this particular case. To
the left of Figure 4-5 you can see a case where the interaction term of an ANCOVA is not
significant (case I), but in the next two cases the interaction term of the ANCOVA is significant.
In other words, if the interaction term is significant then how the response variable (i.e. seed
productivity) varies as a function of the continuous explanatory variable (i.e. water availability)
will depend on the population. As was discussed in section interpreting the model, this effect can
be accessed by looking at the slope. Observe that in the first two cases the response variable also
varies regarding the population, which does not happen in the third case (mentally try projecting
the empty and filled dots into the y-axis).
73
Case I Case II Case III
Figure 4-5 Dot plots for two populations (filled and empty dots)
The following examples follow the workflow structure (see here): make exploratory plot, define
the model, check assumptions, and analyze and interpret the summary and test statistics.
One-way ANOVA
To test whether fruit production differs between populations of Lythrum, fruits were counted on
10 individuals on each of 3 populations.
fruits <- data.frame(fruits = c(24, 19, 21, 20, 23, 19, 17, 20,
23, 20, 11, 15, 11, 9, 10, 14, 12, 12, 15, 13, 13, 11,
19, 12, 15, 15, 13, 18, 17, 13), pop = c(rep(c(1), 10),
rep(c(2), 10), rep(c(3), 10)))
fruits$pop <- as.factor(fruits$pop)
plot(fruits ~ pop, data = fruits)
20
fruits
15
10
1 2 3
pop
model<-lm(fruits~pop,data=fruits)
74
par(mfrow=c(1,2));plot(model, which=c(1,2))
Standardized residuals
23 23
2
4
Residuals
1
2
0
0
-1
22 7
-4
227
12 16 20 -2 -1 0 1 2
Fitted values Theoretical Quantiles
anova(model)
Response: fruits
Analysis of Variance Table
Response: fruits
Df Sum Sq Mean Sq F value Pr(>F)
pop 2 420.0 210.000 19.104 6.767e-06 ***
Residuals 27 296.8 10.993
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The analysis shows that fruit production differs among populations. You can also make a bar
plot out of this data. For that, please refer to the section on how to make plots.
Two-way ANOVA
In a study on pea cultivation methods, pea production was assessed in two treatments of
irrigation (normal irrigation and drought) and in three treatments of radiation (low, medium and
high). 10 plants in each of the six combinations were considered.
plants<-
data.frame(seeds=c(39,39,39,40,40,39,41,42,40,40,39,38,41,
41,40,41,40,40,41,40,38,40,40,39,42,40,39,41,39,40,39,40,4
1,40,41,39,40,41,40,39,42,40,39,39,42,40,39,39,39,39,41,38
,40,39,41,42,40,40,40,41),irrigation=c(rep(c(1),30),rep(c(
2),30)),radiation=c(rep(c(1,2,3),20)))
plants$irrigation<-as.factor(plants$irrigation)
plants$radiation<-as.factor(plants$radiation)
par(mfrow=c(1,2));plot(seeds~irrigation*radiation,data=plants)
75
42
42
41
41
seeds
seeds
40
40
39
39
38
38
1 2 1 2 3
irrigation radiation
Figure 4-8 Boxplots for seed numbers (response variable) categorized by irrigation and
radiation (explanatory variable)
model<-lm(seeds~irrigation*radiation,data=plants)
par(mfrow=c(1,2));plot(model, which=c(1,2))
Standardized residuals
41 41 45
2
Residuals
1
1
0
-1
-2
anova(model)
Response: seeds
Df Sum Sq Mean Sq F value Pr(>F)
irrigation 1 0.067 0.0667 0.0747 0.785671
radiation 2 2.233 1.1167 1.2510 0.294370
irrigation:radiation 2 11.433 5.7167 6.4046 0.003192 **
Residuals 54 48.200 0.8926
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
76
This analysis shows that the effect of irrigation depends on that of radiation as the interaction
term in significant. To find out more, this analysis should be followed by analyses within either
irrigation or radiation treatments (split the dataset).
Linear regression
In an experiment testing whether the duration of male courtship behavior depends on female size
16 pairs of earwigs were observed.
57 58 59 60 61
fem_size
model<-lm(male_court_hrs~ fem_size,data=sex)
par(mfrow=c(1,2));plot(model, which=c(1,2))
77
Residuals vs Fitted Normal Q-Q
2
2
Standardized residuals
8 8
1
Residuals
0
0
-1
-1
4 4
1
-2
2
1
Standardized residuals
Standardized residuals
8
1.2
4
summary(model) 0.5
1
0.8
Call:
0.4
3
lm(formula = male_court_hrs ~ fem_size, data = sex)
-1
0.5
Cook's
1 distance
0.0
1
-2
Residuals:
Min 9.8 1Q
9.4 10.2 Median
10.6 3Q0.00 Max
0.10 0.20 0.30
-1.7679 -0.8837 0.2574 0.6771 1.8334
Fitted values Leverage
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.3698 12.8304 2.133 0.0511 .
fem_size -0.2929 0.2152 -1.361 0.1950
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This analysis suggests that larger females do not cause longer male courtship behavior.
78
Summary
Linear models specify a linear relationship between a response variable and one or more
explanatory variables.
When working with linear models the workflow involves exploring the data through a plot,
define the model, analyze the residuals and, depending on the output of this step, obtain and
interpret the results or transform variables and/or try another model.
The assumption to apply linear models are that the experimental units are independent and
sampled at random, the residuals have constant variance across values of explanatory
variables and the residuals are normally distributed with a mean of zero.
Interactions are cases in which the effect of an explanatory variable on a response variable
depends on another explanatory variable.
Use the command model<-lm()to define the model, the command plot(model) to
revise the assumptions and the commands summary(model) and anova(model) to
retrieve a model estimates and an ANOVA table of the analysis.
When defining a model with two or more explanatory variables, use * to include both direct
and interaction effects, + to include only direct (additive) effects.
79
4.8 Exercises
Exercise 4-A
R has different internal datasets that do not require being uploaded. What sort of linear model is
conducted by the code below? Choose between one-way ANOVA, two-way ANOVA, linear
regression, multiple regression, and ANCOVA. In order do to decide, use the command
str(…)to understand the structure of each internal dataset (i.e. str(ChickWeight)). You can
also plot the data and run the code.
A. lm(weight~Time,ChickWeight)
B. lm(weight~Time*Chick,ChickWeight)
C. lm(GNP.deflator~Unemployed + Population,longley)
D. lm(weight~Diet+Chick,ChickWeight)
E. lm(weight~Diet,ChickWeight)
Exercise 4-B
7 economic indicators have been collected during 16 years (1947 to 1962) and were recently
made available in the data frame called "longley". GNP.deflator is the GNP implicit price deflator,
a measure of inflation. GNP is the gross national product, Unemployed/Employed is the number
of unemployed/employed people, Armed.Forces is the number of people in the armed forces and
Population is the 'noninstitutionalized' population over 14 years of age. Answer the following
questions:
A. Does unemployment implies a reduction in the gross national product? Does employment
increase the gross national product?
(a) YES / YES (b) NO /YES (c) NO / NO (d) YES / NO
B. What percentage of the variation in the gross national product is explained by the employment
rate? (Hint: refer to the concept of percentage of explained variance covered here)
C. How much (euros) does the gross national product increase for every person that is newly
employed? (Hint: refer to the concept of magnitude of effect and slope covered here)
Exercise 4-C
Are the following models adequate in terms of residuals’ distribution? (It uses “longley” R internal
dataset)
A. lm(GNP.deflator~Unemployed+Armed.Forces+Population,longley)
B. lm(GNP~Employed,longley)
80
Exercise 4-D
Forests are sometimes fertilized with nitrogen compound to increase their growth. However, this
could lead to a change in herbivory. 42 3-year old birch trees were used in a greenhouse
experiment. They were divided in 6 groups with seven trees each. Trees were subjected to two
fertilization treatments (yes and no) and three herbivory treatments (none, low, high). Resulting in
six combinations of treatments. One tree died so one treatment combination is missing a
replicate. The data is available in the attachments as 4-D. How do trees react to fertilization and
herbivory? Are these effects independent? Does fertilization increase herbivory risk?
Hint, Conduct the appropriate statistical test, produce an appropriate graph and interpret the
results.
Exercise 4-E
In this exercise you are going to use linear models in order to perform a selection analysis in the
orchid Gymnadenia conopsea. A selection analysis is actually “simply” a multiple regression
analysis in which the response variable is the fitness and the explanatory variables are the
different phenotypic quantitative traits. The idea is that if there is a relationship between a
phenotypic trait and fitness, that means that some values of the trait are favored, i.e. the trait is
under selection. The strength of selection is represented by the slope of the regression line
between fitness and the phenotypic trait (See next figure).
The data and full exercise are available from this in the attachments as 4Ea and 4Eb.
Hint: You will have to practice your skills on handling a data set (checking for outliers, subsetting,
computing and adding new variables), get and interpret descriptive statistics (means,
correlations), graphics (exploration, bar plots with error bars) and linear models (multiple
regression), make variable transformation and standardization of variables, and extract values
from a statistical output for plotting.
81
Solutions
4-A
A. Linear regression, B. ANCOVA, C. multiple regression, D. Two-way ANOVA, E. ANOVA.
4-B
A.b:summary(lm(GNP~Unemployed,longley));summary(lm(GNP~Unemployed,longley))$coefficients["Employed",];summary(lm(
GNP~Employed,longley));summary(lm(GNP~Employed,longley))$coefficients["Employed",], B. 97:
summary(lm(GNP~Employed,longley));summary(lm(GNP~Employed,longley))$r.squared, C. 28EUR:
summary(lm(GNP~Employed,longley))
4-C
A. T:plot(lm(GNP.deflator~Unemployed+Armed.Forces+Population,longley)), B. T:plot(lm(GNP~Employed,longley))
4-D
#Load data, check object and data;fertilization <- read.table(file.choose(), sep=";", header=T, dec=",");str(fertilization);#You
need to make sure that the response is numeric.;fertilization$growth <- as.numeric( fertilization$growth);interaction.plot(x.factor=
fertilization$fertilization, trace.factor=fertilization$herbivory, response= fertilization$growth, cex.axis=1);#Analyses: two-way
ANOVA;lm.fert.lab <- lm(growth~fertilization*herbivory, data=fertilization);Testing the assumptions: residual
analysis;par(mfrow=c(2,2));plot(lm.fert.lab) ;#The assumptions of the analysis are met, residuals are normally distributed and
model fit is satisfactory.;#Result output and graph;anova(lm.fert.lab);#Means, se and barplot;means <-
tapply(fertilization$growth, list(fertilization$fertilization,fertilization$herbivory),mean);se <-
tapply(fertilization$growth,list(fertilization$fertilization,fertilization$herbivory), function(x) sd(x)/sqrt(length(x)));mp <-
barplot(means, beside=T, ylim=c(0,55), las=1, xlab="Herbivory", ylab="Growth (cm)", col=c(0,8));legend(5,55, legend=c("No",
"Yes"), fill=c(0,8), bty="n", title="Fertilization", horiz=T);arrows(mp, means-se, mp, means+se, angle=90, length=0.05, code=3,
col="black", lty=1, lwd=1). Interpretation: The interaction between grazing and fertilization has a significant effect on growth
indicating that the effect of fertilizer depends on whether or not there is grazing. Vice versa the effect of grazing depends on how
much fertilizer there is. It is difficult to interpret the main effects (i.e., fertilizer and grazing) when the interaction is significant. If
this is desired the data needs to be split into the different levels of one of the factors and reanalyzed with one-way ANOVAs.
From the graph of means it is clear that fertilization generally increases growth. The higher the grazing intensity the more
pronounced is the growth response to fertilization.
4-E
Get the solution with detailed explanations from the attachments as 4Ec
82
5 Basic graphs with R
5.1 Bar-plots
Goal
In this section, you will learn how to script a grouped barplot of means with standard errors
indicated a T-bars (Figure 5-1) and how to adjust the layout of this type of graph.
How to do it
We are going to use the internal dataset ToothGrowth (available with R installation), which
contains measurements of tooth length in guinea pigs that received three levels of vitamin C and
two supplement types (Figure 5-1). To explore this dataset you can use ?ToothGrowth,
str(ToothGrowth) and summary(ToothGrowth).
We want to produce a barplot of the mean tooth length for all six combinations of the two
factors (supplement type: 2 levels, dose: 3 levels). We first need to calculate the mean tooth
83
length each of the combinations. For this, we use the command tapply(). tapply() can
return a table with mean tooth lengths for all six combinations and this table will be the input for
the barplot. Importantly, tapply() will create a matrix with two rows and three columns
corresponding to the factor levels in the dataset as you can see below. This structure is needed to
produce a groups barplot.
0.5 1 2
OJ 13.23 22.70 26.06
VC 7.98 16.77 26.14
You are now ready to use the command barplot(). The first argument is the data to use, in
this case our mean.tg matrix. Using the argument beside = T indicates that the bars
should be plotted beside each other instead of on top of each other.
barplot(mean.tg, beside = T)
We can customize the layout of the barplot using further arguments, otherwise default options
will be used. The labels of the axes can be specified with the arguments xlab and ylab and the
labels below each group of bars are controlled with the argument names. The font size of these
labels can be changed with cex.lab and cex.names. These arguments are set to 1 by
default and changes are relative to this default. For example, cex.axis = 2 will double the
font size. The limit of the y-axis is specified with ylim. Here we use the maxiumum and the
minimum in the datset. The orientation of the axis labels can be altered with the argument las
that has four options (0,1,2,3). Here, las = 1 produces horizontal axis labels. The colors of
the bars are determined by col, in our example by a vector with a length of two for the two
groups, specifying 1 (black) and 8 (grey). The color can be specified either with numbers (1 to 8)
or with the color name. You can to get an overview of all available color names type
colors(). You can further explore colors at http://research.stowers-
institute.org/efg/R/Color/Chart/.
The next step is to add error bars to the barplot. There is no standard command to add error
bars. Instead, we have to draw them ourselves with the command arrows(). First, we need
the standard error of the mean for all six groups. We do this in the same way as the calculating
the means: we use tapply() but ask for calculation of the standard error. Besides the length of
the error bars, we also need the horizontal locations of the bars, such that they end up in the
middle of the bars. These midpoints, in the same matrix format as the means above, can be
extracted from a basic barplot().We assign the barplot(…) command to an object, here
84
named midpoints, use plot = F to suppress the plotting as we want to use the improved
barplot we produced above.
Now we are ready to draw the error bars using the command arrows(). Within this command
we first state the position of the error bars by two sets of x and y coordinates corresponding to
the start and end of the error bars. The coordinates are given in our matrix format and
correspond to the six bars on the graph. The starting coordinate set, midpoints, mean.tg
- sem.tg identifies the six midpoints as x coordinates and the means minus the standard
errors as the six y coordinates. Likewise, midpoints, mean.tg + sem.tg is used for the
end of the error bars. We further use the argument code = 3 and angle = 90 such that we
get bars with T´s on both ends and not arrows. The arguments length and lwd set the size
of the T´s and line width of the entire error bars.
We can further use the command legend()to add a legend with our groups to the graph. We
can specify the place of the legend in the graph either with coordinates (here, 0.75, 30) or
with the options such as “topright” or “topleft” (see help for legend()). The fill
argument produces boxes with the specified colors to place next to the legend text. bty
determines if there will be a box drawn around the legend or not (default: bty = "o", with box),
here, bty = ”n” removes the box. The font size in the legend is determined by cex, as
explained above.
There are many more details of the plot that can be controlled and changed. For an overview of
the graphical parameters that can be changed by arguments use ?par.
Summary
A barplot can be made with the command barplot(), a higher-level plotting command
that creates a new graph. Mean values to be plotted should be calculated first with the
command tapply().
Error bars, calculated with tapply(), and a legend can be added with the lower-level
plotting commands arrows() and legend()that add extra features to an existing graph.
A large number of graphical parameters can be used to customize plots arguments.
85
5.2 Grouped scatter plot with regression lines
Goal
In this section, you will learn how to script a grouped scatterplot with regression lines (Figure 5-2)
and to adjust the layout of this type of graph.
How to do it
To produce a scatterplot, we will use the plot() command. plot() is a higher-level plotting
command that it will create a new graph.
We are going to use part of the internal dataset Iris (available with the R installation) as an
example (Figure 5-2). Iris contains flower measurements of three different Iris species. You can
explore the dataset by ?iris, summary(iris) and str(iris). To reduce the dataset
o two species and to plot all the datapoints use:
We can now assign two different plotting symbols for the species by creating a new column in
the data frame iris.short, named iris.short$pch, that contains the number of the
plotting symbol to be used. There are 26 different plotting symbols, ranging from 0 to 25. Here
we use symbol 1 for Iris setosa and symbol 16 for Iris versicolor. You can use the same procedure to
assign different colors to the two species (see above). We can then set the axis labels, range and
orientation as well as font size using xlab, ylab, xlim, ylim, las, cex.axis and
cex.lab as explained above.
86
iris.short$pch[iris.short$Species == "versicolor"] <- 16
plot(iris.short$Sepal.Length, iris.short$Sepal.Width, type = "n",
xlab = "Sepal length (mm)", ylab = "Sepal width (mm)",
xlim = c(4, 7.5), las = 1, cex.axis = 1.2, cex.lab =
1.3,pch = iris.short$pch)
The next step is to add a regression line for each species assuming that sepal length causes
changes in sepal width (which may or may not be reasonable). For this, we have to model the
regression lines first. Subsequently, we plot lines corresponding to these models with the lower-
level plotting command lines(). The line is specified by the x and y coordinates, which are
both a vector: the x-vector contains are sepal lengths the y vector contains the sepal widths
predicted by the model. We increase the line widths using with lwd = 1.5.
We should also add a legend to the figure. This is similar as above, and we can produce species
names in italics using the command expression(italics()) for each legend entry.
There are many more details of the plot that can be customized. An overview of the graphical
parameters that can be changed can be viewed using ?par.
Summary
Scatter plots can be created with the higher-level plotting command plot()
A new vector in the dataframe can be used to specify plotting symbol and color
The lower level plotting command lines() can be used to add regression from a linear
model
87
5.3 Exercises
Exercise 5-A
Please use the dataset below to produce a scatterplot where each point has a different
color and symbol.
x <- c(2,3,4,5,7,8,9,10)
y <- c(10,14,14,17,18,22,23,26)
Exercise 5-B
Starting from graph produced by the code below, change the symbols to filled red
triangles that are twice as large.
plot(iris$Sepal.Length, iris$Sepal.Width,
xlab="Sepal length (mm)", ylab="Sepal width (mm)")
Exercise 5-C
88
Solutions
5-A
x<-c(2,3,4,5,7,8,9,10)
y<-c(10,14,14,17,18,22,23,26)
plot(x,y, las=1, cex.lab=1.5, cex.axis=1.5,cex=1.5, pch=c(1:8), col=c(1:8))
5-B
plot(iris$Sepal.Length, iris$Sepal.Width, xlab="Sepal length (mm)", ylab="Sepal width (mm)", pch=17, col="red",
cex=2)
5-C
means <- tapply(CO2$uptake, INDEX = list(CO2$Type, CO2$Treatment), FUN = mean)
se <- tapply(CO2$uptake, INDEX = list(CO2$Type, CO2$Treatment), FUN = function(x) sd(x)/sqrt(length(x)))
uptake <- barplot(means, beside = T, col = c(0,2), las = 1, ylim = c(0,50))
arrows(uptake, means-se, uptake, means+se, angle = 90, type = 3, length = 0.1, code = 3)
legend(x = "top", fill = c(0, 2), legend = c("Quebec", "Mississippi"), bty = "n", horiz
89
6 Logistic Regression
6.1 Goals
6.2 How to do it
Background
Logistic regression models are used to in situations where we want to know how a binary
response variable is affected by one or more continuous variables. Common biological
examples of this include assessing probability of survival, probability of reproducing, or
probability of an individual possessing a certain allele. On a natural scale, logistic
regression is non-linear and cannot be analyzed using linear models. However this
problem is circumvented by using the logit transformation to linearize the model.
Model<-glm(probability_data~ continuous_predictor,
family = ”binomial”)
90
To demonstrate this, we will use survival data collected by Quintana-Ascencio et al. on
Hypericum cumulicola, a plant endemic to the southeastern United States. This dataset,
which can be obtained in the attachments as Hypericum, contains both a probabilistic
response variable (survival) as well as continuous predictor variables (log-transformed
number of fruits produced and height in the previous year). First we create the
generalized linear model and use the summary() function to obtain a summary:
Call:
glm(formula = survival ~ height, family = binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4470 -1.2380 0.6166 0.8544 1.2199
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.9931 0.4201 9.505 < 2e-16 ***
height -2.1885 0.2912 -7.515 5.71e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1
‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1018.68 on 878 degrees of freedom
Residual deviance: 947.12 on 877 degrees of freedom
AIC: 951.12
Number of Fisher Scoring iterations: 4
As can be seen from this summary, height has a significant negative effect on survival.
One thing that should be noted is that the intercept in this case is far larger than 1. This
is because summary presents values after logit transformation, so values are no longer
bound between 0 and 1.
There are several ways to determine the goodness of fit and significance on the overall
model rather than the individual parameters. One way is to use the G2 statistic (similar to
a chi-squared) to compare Null and Residual deviance. This technique compares the
unexplained variation in a null model (that is, one that has no predictive value) to the
unexplained variation in the model being tested (i.e. the residual deviance). A greater
difference between null and residual deviance indicates lower deviance in the model and
a better model fit. This difference is then tested using the chi-squared distribution to
determine a p-value.
91
This technique produces a p-value of ~2.7 X 10-17 so the overall model in this case is
highly significant.
In order to plot on the logit scale, you must first define a sequence using the
seq()function.
Sequence<-seq(0,4,.1)
The first and second numbers define the minimum and maximum values of the sequence
and the third value specifies the intervals. Next we must create a generated predicted
values of the model using the predict() function
PLMlogit<-predict(LModel, list(height=Sequence))
plot(Sequence,PLMLogit, type="l", xlab="Log(height)",
ylab="Logit Survival")
Figure 6-1 Relationship between height and survival on the logit scale
In order to plot a logistic curve we again use the predict() and plot() functions,
but we add an additional argument to predict: type= “response”
92
PLMcurve<-
predict(LModel,list(height=Sequence),type="response"
)
plot(Sequence,PLMcurve,type="l", xlab="Log(height)",
ylab="Survival")
This argument tells the predict function to output the response variable (survival) on its
original scale rather than on the transformed scale. The resulting figure should look like
this (Figure 6-2):
Figure 6-2 Relationship between log(height) and survival on the natural scale.
6.3 Summary
Logistic Regressions are used when you have a probability as a response variable and
a continuous predictor variable
Logistic curves are analyzed as generalized linear models glm() though the use of the
logit transformation.
Logistic regressions can be plotted either as a logistic curve or as a linear function.
93
6.4 Exercises
6-A
Repeat the example above using fruits as the predictor variable rather than height.
A. Compare the overall significances of both models. Which predictor variable is a better
fit for the data? Why?
B. Compare the graphs of the two predictor variables. How are they similar? How are
they different?
6-B
(Yes/No) Which of the following questions could be answered using logistic regression?
A. Is the probability of getting a head in a coin flip is affected by wind speed?
B. Is there a correlation between rate of coffee consumption and hours worked?
C. Is the probability of successfully building a nest related to body size in doves?
D. Are yellow or red crabs are more likely to occupy holes at the beach.
E. Does advertising spending affect the proportion of people who receive flu shots?
F. How does increasing the concentration of a drug affect mortality rates in mice?
G. Do different species of bears produce differing number of offspring?
H. Is the ratio of body length to width similar throughout the family Mustelidae?
6-C
You created a logistic model which has a null deviance of 432 and a residual deviance
of 425. Is the overall model significant?
94
Solutions:
6-A
a) Fruits is a better fit because the difference between null and residual deviance is larger.
b) Both negative; logit plots are both linear/ Height has a stronger negative effect; shape of the logistic plots are
different.
6-B
Y;N;Y;N;Y;Y;N;N
6-C
Yes, P=.008
95
7 R programming structures
Goal
As a full and miscellaneous programming language, R comes with various looping and
conditional constructs. In this section, we will preliminarily discuss iteration and
conditionals. R provides three basic C- style paradigms to write explicit loop: for(),
while() and repeat(). Conditional evaluations can be employed using function
if() and ifelse().
How to do it
The syntax of the looping functions is listed below.
The first one, for(), iterates through each component VAR of the sequence SEQ-
for example, in the first iteration VAR = SEQ[1], in the second iteration VAR =
SEQ[2], and so on. The following code uses the for() structure to print out square
of each component in a vector.
96
for ( i in 1:5 ) {
print( paste('square of', i, '=', i^2) )
}
The other two loop structure with while() and repeat() rely on the change state
of expression, or the use of break to leave the loop. The function break halts the
execution of the innermost loop and passes control to the first statement outside.
Similarly, next exists the processing of the current iteration of the loop and causes the
execution of the next iteration. When using repeat() or while(), special attention
should be paid to averting infinite loop, that is, a loop which iterates without end. Below
is an example showing two different ways of accomplishing the same job.
i_w <- 1
while ( i_w <=10 ) {
i_w <- i_w + 5
}
i_w
[1] 11
i_r <-1
repeat {
i_r <- i_r + 1
if (i_r > 10) {break}
}
i_r
[1] 11
Note that excessive use of loops will make your R code rather crappy. Although loop in
R is very straightforward and convenient, sometimes you should avoid it due to its high
cost of computational time and performance, especially when working on long vectors.
A better alternative choice is to use vectorized function, for example which(),
where(), all(), etc. In case of matrix computation, you can use rowSums(),
colSums(), and so on.
Now it is time to move to conditionals. The syntax of the if()statement looks like this:
97
if ( COND ) {EXPR1} else {EXPR2}
The conditional COND is first evaluated, and if it is TRUE, then expression EXPR1 is
executed; if COND evaluates to FALSE; then EXPR2 is executed. Particularly, when
COND evaluates to numeric value of zero, R treats it as FALSE; and COND evaluates
to any non-zero number, it is treated as TRUE. We can also easily extend/shrink if()
structures by adding/removing one or several else clause as it is optional. But note that
in case of extension, the order of conditional clauses are vital because once a condition
statement is satisfied, R will ignore the rest part of the whole if-else structure and jump
out of it. Here is a simple example:
x <- 3
if ( ! is.numeric(x) ) {
stop( paste(x, 'is not numeric') )
} else if ( x%%2 == 1) {
print ( paste(x, 'is an odd') )
} else if ( x == round(x) ) {
print ( paste(x, 'is an integer') )
} else {
print ( paste(x, 'is a number') )
}
[1] "3 is an odd"
You can assign other value to x, for example, x <- 1.3, x<-'abc', etc., then copy
and execute the if-else structure to check the result.
Goal
In this section, you will learn how to develop your own R function.
How to do it
R provides a convenient way to define custom function and make good use of it. All
functions read and parse input, which are referred to as arguments, and then return
output. R function is actually first-class object defined in R. It can be created by using the
command function(), which is followed by a comma separated list of formal
arguments enclosed by a pair of parenthesis, and then the expression that form the body
98
of the function. If the expression only includes one statement, it can be directly entered
and when there are multiple expressions, they have to be enclosed in braces {}. The
value returned by a R function, can be either yielded by R built-in function return or
simply the value of the last expression.
expon(3,4)
[1] 81
The formal and body arguments to function expon() can later be accessed via the R
functions formals() and body(), as following:
formals(expon)
$x
$n
body(expon)
{
if (x%%1 != 0) {
stop("x must be an integer!")
} (…)
Another point worth mentioning is that you can print out the built-in function in R when
you are not sure what it does. By looking at the code, you may have a better idea. For
99
instance, if you are curious about detail of the function cat(), you can glance over its
code by typing it without braces.
cat
page(cat)
However, as most fundamental functions in R are written directly in C, they are not
viewable in this manner. There is a lot more to writing custom function in R than what is
shown here. Nonetheless you won’t need it unless you move to the advance level.
7.3 Summary
100
8 Appendix: Code References
The following appendix provides a list and general descriptions of commands used in
each section of this document. Clicking on the names links to the CRAN reference
pages.
101
wilcox.test(y~group)
cor.test tests for correlations cor.test(x,y)
among variables
chisq.test runs a chi-squared test chisq.test(x,y)
prop.test tests for equal prop.test (x,y)
proportions
Linear Models
lm constructs a linear lm(x~y)
model
anova calculates analysis of x<-lm(x~y)
variance on a model anova(x)
Basic Graphs with R
barplot creates and edits barplot(x)
barplots (see link for
list of plotting
arguments)
arrows adds arrows to a plot arrows(midpoints, mean.tg-sem.tg,
midpoints, mean.tg+sem.tg, angle = 90,
length = 0.1, code=3, lwd = 1.5,)
102