Vous êtes sur la page 1sur 304

Genetic Mapping and Marker Assisted

Selection

N. Manikanda Boopathi

Genetic Mapping
and Marker Assisted
Selection
Basics, Practice and Benefits

N. Manikanda Boopathi
Plant Molecular Biology &
Bioinformatics
Tamil Nadu Agricultural University
Coimbatore, TN, India

ISBN 978-81-322-0957-7
ISBN 978-81-322-0958-4 (eBook)
DOI 10.1007/978-81-322-0958-4
Springer New Delhi Heidelberg New York Dordrecht London
Library of Congress Control Number: 2012954276
Springer India 2013
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or
part of the material is concerned, specifically the rights of translation, reprinting, reuse of
illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way,
and transmission or information storage and retrieval, electronic adaptation, computer software,
or by similar or dissimilar methodology now known or hereafter developed. Exempted from this
legal reservation are brief excerpts in connection with reviews or scholarly analysis or material
supplied specifically for the purpose of being entered and executed on a computer system, for
exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is
permitted only under the provisions of the Copyright Law of the Publishers location, in its
current version, and permission for use must always be obtained from Springer. Permissions for
use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable
to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are
exempt from the relevant protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of
publication, neither the authors nor the editors nor the publisher can accept any legal responsibility
for any errors or omissions that may be made. The publisher makes no warranty, express or
implied, with respect to the material contained herein.
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)

Preface

Current trends in agricultural biotechnological tools clearly show that the


genes or regulatory elements controlling agronomically important traits
remain unknown and, possibly, will remain mysterious for some time. For the
moment, marker assisted selection (MAS) is considered to be an efficient
supplementary tool to conventional plant breeding since other techniques
such as genetic engineering in crop improvement have limitations in transferring
such a large number of genes residing in quantitative trait loci (QTL). Plant
scientists will continue to use QTL maps and markers that tag and manipulate
the genes of interest for many years to come.
Despite its importance, it was difficult for me, since my graduation, to find
a book that explains the basics and procedures of genetic mapping and MAS.
On the other hand, I used to find a large collection of advanced literature on
every point of MAS in the latest journals. That is the reason I started to write
this small introductory book. I am very sure that what I have tried to show in
this book is just a single cup of water that has been taken from the genetic
mapping and MAS pond. Further, I am completely aware that it is not at
all possible to completely list out each and every aspect of MAS and their
contributors even if I work for years together. Anyone can easily find the
missed component(s) in a complete index of MAS, even though it was
prepared by a subject specialist because of rapid developments in genetical
and statistical methodologies in MAS. The simple idea of writing this book is
introducing the basic concept and protocol for practising MAS in crop plants
with suitable examples. There are different roads to reach the destination.
I just stand on a junction with a comprehensive map, trying to explain all
the possible routes, their rewards and restrictions. And of course, you can
find your own way. Hence, readers are requested to refer to the bibliography
to get more information on the given topics and find an appropriate design of
an MAS programme for their targeted crop and trait.
I further request your feedback, suggestions and critical comments on this
work to improve the quality and usage of this book.
I sincerely apologise having not cited all the authors who have contributed
a lot to this field. This is mainly due to space limitation and not with any other
intention. I also wish to thank and acknowledge all my teachers, guides,
colleagues and friends whom I have had the good fortune to associate with
during my research period.

Preface

vi

I greatly appreciate and thank Springer for publishing this work.


I exquisitely dedicate this book to my dearly loved son, Sri Ezhilalan
Boopathi, who had forgone all his quality time with me.
Coimbatore
20th November, 2012

N. Manikanda Boopathi
nmboopathi@tnau.ac.in
www.sites.google.com/sites/drnmboopathi

Contents

Germplasm Characterisation: Utilising


the Underexploited Resources......................................................
Phenotyping for Morphological and Agronomic Characters ..........
Case Study in Rice Germplasm Characterisation
for Drought Resistance...............................................................
Traits Useful for Characterisation ..............................................
Allele Mining ..................................................................................
Genetic Diversity and Clustering ....................................................
Software .....................................................................................
Principle Behind the Genetic Diversity
Analysis ......................................................................................
Principle of Measuring Goodness of Fit
of a Classification .......................................................................
Genetic Diversity Analysis Using Molecular Markers ...................
Parental Selection............................................................................
Bibliography ...................................................................................
Literature Cited ...............................................................................
Further Readings .............................................................................
Mapping Population Development ..............................................
Mapping Population and Its Importance
in Genetic Mapping.........................................................................
Selfing and Crossing Techniques in Crop Plants ............................
F2 Progenies ....................................................................................
F2-Derived F3 (F2:3) Populations ......................................................
F2 Intermating Populations or Immortalised F2 Populations...........
DH Lines .........................................................................................
BC Progenies ..................................................................................
RILs.................................................................................................
NILs, Exotic Libraries and Advanced
Backcross Populations ....................................................................
Four-Way Cross Populations...........................................................
Multi-Cross Populations .................................................................
Nested Association Mapping Populations ......................................
Natural Populations.........................................................................
Chromosome-Specific Genetic Stocks
for Linkage Mapping ......................................................................

1
2
2
3
5
8
9
9
10
10
20
20
20
20
23
23
27
27
28
28
29
29
30
30
31
31
32
33
34
vii

Contents

viii

Bulk Segregant Analysis .................................................................


Combining Markers and Populations..............................................
Characterisation of Mapping Populations.......................................
Choice of Mapping Populations......................................................
Challenges in Mapping Population Development
and Solutions to These Challenges .................................................
Bibliography ...................................................................................
Literature Cited ...............................................................................
Further Readings .............................................................................
3

Genotyping of Mapping Population ............................................


Markers and Its Importance ............................................................
Morphological Markers ..................................................................
Biochemical Markers or Isozymes..................................................
Principle .....................................................................................
Electrophoresis ...........................................................................
Chromatography.........................................................................
Gel Filtration ..............................................................................
Immunochemistry ......................................................................
Catalysis .....................................................................................
Genome Structure and Organisation ...............................................
Chromosome Structure...............................................................
Mitochondrial DNA ...................................................................
Chloroplast DNA........................................................................
Molecular Markers ..........................................................................
Restriction Fragment Length Polymorphism (RFLP).....................
PCR-Based Techniques ...................................................................
Arbitrarily Primed PCR-Based Markers .........................................
Random Amplified Polymorphic DNA (RAPD)........................
Arbitrarily Primed Polymerase Chain Reaction
(AP-PCR) and DNA Amplification
Fingerprinting (DAF) .................................................................
Amplified Fragment Length Polymorphism (AFLP) .................
Sequence-Specific PCR-Based Markers .........................................
Microsatellite-Based Marker Technique ....................................
Inter-Simple Sequence Repeats (ISSR) .....................................
Single-Nucleotide Polymorphism (SNPs)..................................
Single-Feature Polymorphism (SFP) .........................................
Sequence-Characterised Amplified Regions (SCAR) ................
Cleaved Amplified Polymorphic Sequences (CAPS).................
Randomly Amplified Microsatellite
Polymorphisms (RAMP)............................................................
Sequence-Related Amplified Polymorphism (SRAP)................
Target Region Amplification Polymorphism (TRAP) ................
Single-Strand Conformation Polymorphism (SSCP) .................
Transposable Elements (TE)-Based Molecular Markers ................
Retrotransposon-Based Molecular Markers ...............................
Diversity Array Technology (DArT) ...............................................

34
35
35
35
35
37
37
37
39
39
39
40
40
41
42
42
42
43
43
45
45
46
46
51
51
54
54

54
55
55
56
60
61
61
62
62
63
64
64
64
65
66
68

Contents

ix

Intron-Targeted Intron-Exon Splice Conjunction


(IT-ISJ) Marker ...............................................................................
Restriction Site Associated DNA (RAD) Markers..........................
RNA-Based Molecular Markers .....................................................
cDNA-AFLP ..............................................................................
RNA Fingerprinting by Arbitrarily Primed PCR
(RAP-PCR) ................................................................................
cDNA-SSCP ...............................................................................
Role of Genomics ...........................................................................
Selection of Marker Technology .....................................................
Research Problem.......................................................................
The Number of Loci and/or Alleles ...........................................
Discrimination Level ..................................................................
Mode of Inheritance ...................................................................
Quality of DNA ..........................................................................
Expertise Required .....................................................................
Costs ...........................................................................................
Speed ..........................................................................................
Reproducibility...........................................................................
PCR Versus Non-PCR Techniques .............................................
Marker Genotyping and Scoring .....................................................
Analysing the Genotype Score: Chi-Square Test ............................
c2 Test to Analyse the Segregation Ratio
Using the Program ANTMAP.........................................................
Bibliography ...................................................................................
Literature Cited ...............................................................................
Further Readings .............................................................................

68
69
69
70
70
70
70
74
74
75
75
75
75
75
75
76
76
76
76
77
78
78
78
80

Linkage Map Construction .......................................................... 81


Basics of Genetic/Linkage Mapping:
Mendelian Ratios, Meiosis, Crossing Over
and Partial Linkage ......................................................................... 81
Mapping Functions ......................................................................... 87
Mapping of Genetic Markers: Practical Considerations ................. 89
Testing for Linkage: LOD Scores ................................................... 90
Grouping, Ordering and Spacing .................................................... 90
Sources of Error .............................................................................. 92
Chromosomal Assignment .............................................................. 94
Allopolyploidy and Autopolyploidy ............................................... 94
Bridging Linkage Maps to Develop Unified
Linkage Maps.................................................................................. 95
Bibliography ................................................................................... 108
Literature Cited ............................................................................... 108
Further Readings ............................................................................. 108

Phenotyping ...................................................................................
Phenotyping Versus QTL Mapping.................................................
Need for Precise Phenotyping.........................................................
Phenotyping for Biotic Stress .........................................................

109
109
110
111

Contents

Phenotyping for Abiotic Stress .......................................................


Heritability of Phenotypes ..............................................................
Statistical Analysis of Phenotypic Data: Simple Statistics,
Heritability Estimation and Correlation ..........................................
Bibliography ...................................................................................
Literature Cited ...............................................................................
Further Readings .............................................................................
6

QTL Identification ........................................................................


QTL: A Prelude ...............................................................................
Single-Marker Analysis (SMA) ......................................................
Interval Mapping .............................................................................
Multiple QTL and Methods to Detect Multiple QTL .....................
Composite Interval Mapping ..........................................................
Multiple Trait Mapping ...................................................................
Testing for Linked QTL Versus Pleiotropic QTL ...........................
Multiple Interval Mapping (MIM) or Multiple QTL Mapping.......
Statistical Significance ....................................................................
Permutation Testing ........................................................................
Bootstrapping ..................................................................................
Permutation Versus Bootstrapping
and Other Methods..........................................................................
QTL QTL Interaction: Impact of Epistasis...................................
QTL Environment Interaction ......................................................
Congruence of QTL: Across the Environments and
Across the Genetic Backgrounds Is the Key in MAS .....................
Meta-QTL Analysis ........................................................................
Concluding Remarks on QTL Methods ..........................................
Alternatives in Classical QTL Mapping .........................................
Bulked Segregant Analysis and Selective Genotyping ...................
Genomics-Assisted Breeding ..........................................................
Array Mapping ................................................................................
Association Mapping ......................................................................
Nested Association Mapping ..........................................................
EcoTILLING...................................................................................
Challenges in QTL Mapping ..........................................................
Confronts with Mapping Populations ........................................
Markers and Its Implications ......................................................
Segregation Distortion................................................................
Phenotyping................................................................................
Statistical Issues .........................................................................
Practical Utility ..........................................................................
Bibliography ...................................................................................
Literature Cited ...............................................................................
Further Readings .............................................................................

112
113
115
115
115
115
117
117
119
120
124
124
125
125
125
140
140
141
141
142
143
144
144
145
146
146
146
147
148
151
152
153
153
155
155
156
157
161
162
162
163

Fine Mapping ................................................................................ 165


Need for Fine Mapping or High-Resolution Mapping ................... 165
Types of Molecular Markers Suitable for Fine Mapping ................ 166

Contents

xi

10

Physical Mapping and Its Role in Fine Mapping............................


Comparative Mapping.....................................................................
Genetical Genomics/eQTL Mapping ..............................................
Map-Based Cloning ........................................................................
Validation of QTLs .........................................................................
Testing the Markers in Related Germplasm Accessions .................
Bibliography ...................................................................................
Literature Cited ...............................................................................
Further Readings .............................................................................

166
167
168
170
171
171
172
172
172

Marker-Assisted Selection ............................................................


Advantages of MAS ........................................................................
Limitations in MAS ........................................................................
Prerequisites for an Efficient Marker-Assisted
Selection Program ...........................................................................
Procedure for a Generalised MAS Program for Selection
from Breeding Lines/Populations ...................................................
Marker-Assisted Backcross Breeding .............................................
Gene Pyramiding or Stacking .........................................................
Accelerated Methods of Gene Pyramiding .....................................
Marker-Assisted Recurrent Selection (MARS) ..............................
Advanced Backcross (AB)-QTL Analysis ......................................
Mapping-As-You-Go (MAYG) .......................................................
Application of Markers in Germplasm Storage,
Evaluation and Use .........................................................................
Resources for MAS on the Web ......................................................
Bibliography ...................................................................................
Literature Cited ...............................................................................
Further Readings .............................................................................

173
173
175

184
185
185
185
186

Success Stories in MAS.................................................................


Tomato ............................................................................................
Maize...............................................................................................
Wheat ..............................................................................................
Rice .................................................................................................
Barley ..............................................................................................
Soybean ...........................................................................................
Varieties Released Through MAS ...................................................
Hybrids Developed Through MAS .................................................
MAS in Multinational Companies ..................................................
Contrasting Stories ..........................................................................
Conclusions and Future Prospects ..................................................
Bibliography ...................................................................................
Literature Cited ...............................................................................
Further Readings .............................................................................

187
187
188
188
188
189
189
189
190
190
190
190
191
191
192

175
176
177
181
181
181
184
184

Curtain Raiser to Novel MAS Platforms .................................... 193


Current Techniques in Molecular, Biochemical
and Physiological Studies and Its Integration into MAS ................ 193

Contents

xii

Molecular Techniques .....................................................................


Expression Profiling ........................................................................
cDNA Library Construction............................................................
Differential Display and Representational
Difference Analysis .........................................................................
Subtractive Hybridisation ...............................................................
Microarray.......................................................................................
Types of DNA Chips and Their Production ...............................
Hybridisation and Detection Methods .......................................
1. DNA Sequencing by Hybridisation........................................
2. Single Nucleotide Polymorphisms and Point Mutations .......
3. Functional Genomics .............................................................
4. Reverse Genetics ....................................................................
5. Diagnostics and Genetic Mapping .........................................
6. Genomic Mismatch Scanning ................................................
7. DNA Chips and Agriculture ...................................................
8. Proteomics ..............................................................................
9. Nucleic Acid Sequencing .......................................................
Second-Generation DNA Sequencing ........................................
454 Pyrosequencing ...................................................................
Illumina Genome Analyser ........................................................
AB SOLiD..................................................................................
Microchip-Based Electrophoretic Sequencing...........................
Sequencing by Hybridisation .....................................................
Sequencing in Real Time ...........................................................
Targeted Capture of Genomic Subsets .......................................
Handling and Storage of Sequence Information ........................
Predicting Function from Sequence ...........................................
Homology Searches ...................................................................
Other Sequence Comparisons Strategies ...................................
Serial Analysis of Gene Expression (SAGE) ..................................
cDNA-AFLP ...................................................................................
RFLP-Coupled Domain-Directed Differential
Display (RC4D) ..............................................................................
Gene Tagging by Insertional Mutagenesis ......................................
T-DNA Tag .................................................................................
Transposon Tags .........................................................................
Post-transcriptional Gene Silencing................................................
MicroRNAs .....................................................................................
Biochemical Techniques .................................................................
Plant Proteomics .............................................................................
Why Proteomics? ............................................................................
Types of Proteomics ........................................................................
Protein Expression Proteomics ..................................................
Structural Proteomics .................................................................
Functional Proteomics................................................................
Protein Analysis ..............................................................................

193
193
195
196
196
199
200
200
201
202
202
202
203
203
203
204
204
205
206
206
207
209
210
210
211
212
213
213
214
215
217
219
219
220
220
221
221
222
222
224
225
225
225
225
225

Contents

xiii

One- and Two-Dimensional Gel Electrophoresis ...........................


Alternatives to Electrophoresis in Proteomics ................................
Acquisition of Protein Structure Information .................................
Edman Sequencing .....................................................................
Mass Spectrometry .....................................................................
Types of Mass Spectrometers .........................................................
Peptide Fragmentation ....................................................................
De Novo Peptide Sequence Information .........................................
Uninterpreted MS/MS Data Searching ...........................................
Proteomics Approach to Protein Phosphorylation ..........................
Phosphoprotein Enrichment ............................................................
Phosphorylation Site Determination
by Edman Degradation ...................................................................
Phosphorylation Site Determination
by Mass Spectrometry.....................................................................
Metabolite Profiling Technologies ..................................................
Physiological Techniques ................................................................
Near-Infrared (NIR) Spectroscopy..................................................
Canopy Spectral Reflectance (SR) and Infrared
Thermography (IRT) .......................................................................
Estimation of Compatible Solutes ..................................................
Genomics-Assisted Breeding ..........................................................
Functional Markers .........................................................................
Comparative Genomics ...................................................................
Identification of Novel Molecular Networks
and Construction of New Metabolic Pathway ................................
Bioinformatics for MAS .................................................................
Bibliography ...................................................................................
Literature Cited ...............................................................................
Further Readings .............................................................................
11

Recent Advances in MAS in Major Crops ..................................


Rice .................................................................................................
Rice and Drought .......................................................................
Mechanisms of Drought Resistance in Rice ..............................
Phenology...................................................................................
Root System ...............................................................................
Osmotic Adjustment ...................................................................
Dehydration Tolerance ...............................................................
Shoot-Related Drought-Resistance Traits ..................................
Genetic Linkage Map in Rice ....................................................
QTL Mapping of Drought-Resistance
Traits in Rice ..............................................................................
Rice Subspecies and Habitat ......................................................
Marker-Aided Selection and Near-Isogenic Lines
for Drought-Resistance Improvement ........................................
Target Population of Environment and Molecular
Breeding .....................................................................................

225
227
227
227
228
230
231
231
231
232
232
233
233
234
234
236
236
236
237
238
239
240
241
243
243
244
245
245
246
246
246
247
247
248
248
250
250
256
257
257

Contents

xiv

Concluding Remarks on MAS in Rice


for Water-Limited Environments................................................
Cotton..............................................................................................
Status of Cotton Molecular Marker Technology ........................
Molecular Markers and Polymorphism in Cotton ......................
Simple Sequence Repeats (SSRs) in Cotton ..............................
Cotton Linkage Maps .................................................................
QTL Mapping for Yield and Fibre Quality
Traits in Cotton...........................................................................
Specific Challenges in Cotton MAS ..........................................
Confronts with Mapping Population ..........................................
QTL Environment Analysis .....................................................
Incongruence Among QTL Studies............................................
Complexities in Integration of Functional
Genomics with QTL...................................................................
Alternatives and Future Perspectives .........................................
Meta-analysis of QTL: Synergy Through Networks..................
Map-Based Cloning ...................................................................
Cotton Genome Sequencing.......................................................
Advances in Functional Genomics.............................................
System Quantitative Genetics: Bridging
Subdisciplines ............................................................................
Association Mapping and Alternatives ......................................
Improved Databases ...................................................................
Concluding Remarks for MAS in Cotton...................................
Mungbean .......................................................................................
Genetic Diversity and Linkage Mapping
in Mungbean...............................................................................
QTL Mapping in Mungbean ......................................................
Legume Comparative Genomics and Its Importance
in Mungbean MAS .....................................................................
Concluding Remarks for MAS in Mungbean ............................
Tomato ............................................................................................
Conventional Breeding and Tomato Improvement ....................
Biotechnology and Tomato Breeding .........................................
MAS for Bacterial Spot Resistance............................................
MAS for Tomato Yellow Leaf Curl Virus Resistance ................
MAS for Other Economic Traits ................................................
MAS for Genetic Improvement
of Fruit Quality Traits.................................................................
Fine Mapping and Characterisation
of Fruit-Size QTL.......................................................................
Concluding Remarks for MAS in Tomato .................................
Hot Pepper ......................................................................................
Progress in MAS in Hot Pepper .................................................
Concluding Remarks on MAS in Hot Pepper ............................
Bibliography ...................................................................................

258
259
260
260
260
262
262
263
263
263
264
264
264
264
265
265
265
266
266
266
267
267
268
268
269
270
271
271
272
273
274
275
275
276
276
277
277
278
278

Contents

xv

Literature Cited ............................................................................... 278


Further Reading .............................................................................. 280
12

Future Perspectives in MAS .........................................................


MAS in Orphan Crops ....................................................................
MAS in Developing Countries ........................................................
Community Efforts in Developing Countries
and Their Implications in MAS ......................................................
Field and Laboratory Infrastructure Improvement..........................
Lessons Learnt and Concluding Remarks.......................................
Bibliography ...................................................................................
Literature Cited ...............................................................................
Further Readings .............................................................................

281
283
285
286
288
289
290
290
290

About the Author................................................................................... 293

Germplasm Characterisation:
Utilising the Underexploited
Resources

Farmers, in the given geographical region, cultivate


only a small set of crop varieties for a long period
of time. Modern plant breeding programs also
resulted in severe genetic bottleneck. As a
consequence, reduction in genetic diversity is
widespread among crop plants, and it is considered as a detrimental feature to the future farming
process. This is because continuous use of same
cultivars usually leads to at least (1) extensive
existence of (as well as emergence of new) pest
and diseases to the given crop species and (2) loss
of landraces and wild species of the given crop
plants (which is otherwise referred to as genetic
erosion). Due to ever increasing population
growth and continuous shrinking of farming
lands, farmers are forced to cultivate crop plants
under a wide range of latitudes and longitudes.
This requires crop plants which can tolerate variations in light, temperature, water and nutrients
besides occurrence of peculiar pest and diseases
that challenge crop production in these environments. Conventional breeding approaches such
as desirable phenotypic selection among the
breeding materials have considerably contributed
in genetic improvement of crops. However, only
a few genetically improved lines are available to
meet such challenges. The main limitations that
prevent the further progress through conventional
breeding methods are lack of adequate genetic/
biochemical/molecular knowledge on expression
of traits that are beneficial to the crop cultivation
and production. Most of the agronomically and
economically important traits are quantitative in
nature and having complex inheritance. Thanks to

the developments in nucleic acid characterisation


and manipulation, it is now possible to genetically
analyse and manipulate such quantitative traits
using quantitative trait loci (QTL) mapping and
marker-assisted selection (MAS). Thus, advances
in molecular marker technologies have opened
the door to new techniques for construction and
screening of breeding populations, increase the
efficiency of selection and accelerate the rates of
genetic gain. By employing genetic and QTL
mapping, a marker can either be located within
the gene of interest or be linked to a gene determining a trait of interest. Consequently, MAS can
be executed as a selection for a trait based on
genotype using associated markers rather than
the phenotype of the trait. This book is designed
to describe the basics of genetic and QTL mapping
using molecular markers and practicing MAS in
crop plants with step-by-step procedures. In
general, MAS scheme in genetic improvement of
crop plants for the given trait involves (1) characterisation of germplasm for the trait of interest,
(2) selection of extremely diverse parents, (3)
development of mapping population, (4) selection of
appropriate combinations of molecular markers
and genotyping of parents and mapping population, (5) construction of genetic or linkage map,
(6) phenotyping of mapping population for the
selected trait, (7) QTL analysis by combining the
data obtained from step 5 and 6, (8) fine mapping
and validation of QTLs and (9) executing MAS
for the target trait. Therefore, this first chapter of
this book is keen to describe the leading vital step
in MAS: characterisation of germplasm.

N.M. Boopathi, Genetic Mapping and Marker Assisted Selection: Basics, Practice
and Benefits, DOI 10.1007/978-81-322-0958-4_1, Springer India 2013

Germplasm Characterisation: Utilising the Underexploited Resources

Traditional collections, exotic accessions and


the wild species of crop plants, which are maintained in the germplasm banks, possess excellent
tolerance to the biotic and abiotic stresses that are
prevalent in the above-said existing and new crop
production environments. Such germplasm collections provide potential resources for future crop
improvement program that is designed to cope
with the many biotic and abiotic stresses. Hence, it
is important to characterise and understand the
genetic variation that exists in germplasm for their
effective and proficient utilisation in crop breeding
programs using MAS. Characterisation of germplasm facilitates identification and selection of
beneficial genes or alleles in the related wild species and landraces via MAS. It involves screening
each entry for morphological and agronomic
characters using a standard descriptor list. As
many characteristics as possible should be recorded
using coded qualitative scores. Further, gathering
passport data (such as country, site and location of
collection) permits selection of germplasm on a
geographical basis. In addition, a range of molecular markers (e.g. isozymes, RAPD, AFLP and
microsatellites) are also used for classification of
germplasm, and this data would be useful for more
detailed genetic diversity analysis. Thus, screening thousands of accessions for pest and disease
resistance and tolerance to different abiotic stresses
and systematic studies of the wild species and
molecular studies of genetic diversity provide data
on species taxonomy and genetic relationships.
Based on this information, a core set of germplasm
entries can be selected for selection of parents.
Knowledge on genetic diversity and relationship
among elite breeding materials constituting the
germplasm (see below) can have a significant
impact on the selection of parents in crop improvement program. Selection of parents is also imperative in QTL mapping (see below).

Phenotyping for Morphological


and Agronomic Characters
The most salient hurdle to the effective utilisation
of germplasm in development of improved crop
cultivars is the troubles in accurately phenotyping

the germplasm. Combining precise phenotyping


of germplasm with dissection of genetic and functional basis of yield and other agronomically and/
or economically important traits under various
biotic and abiotic stresses would give unprecedented ways to characterise the crop germplasm.
Thus, precise phenotyping practice is the first key
step, and its successful completion definitely
would guarantee a better germplasm characterisation. To this end, it is imperative to have knowledge on factors that affect the quality of phenotypic
data, defining the nomenclature and mechanisms
of crop productivity under different climatic and
stress conditions. All these limiting factors should
be addressed adequately for the target crop and
trait. There is no general procedure that fits well
to all the crops and for all the target traits. It
definitely varies from crop to crop (and even
within the species) and trait to trait. As an example, a detailed phenotyping procedure in rice for
characterising the germplasm for one of the most
important abiotic stress, drought, is elucidated
hereunder. However, many of the concepts presented herein are equally useful to other crops too
for drought-resistance screening.

Case Study in Rice Germplasm


Characterisation for Drought Resistance
Realisation of the Essential Requirements
It has long been realised that release of rice
cultivars with enhanced resistance to drought
conditions and with high yield stability is essential to ensure food security in the twenty-first
century due to frequent occurrence and rigorousness of water stress around the world. Hence, we
need to genetically tailor new cultivars that can
withstand drought and its other closely related
environmental constraints such as high temperature, salinity and nutrient deficiency. In the past,
traditional breeding strategies have shown several promising achievements. However, the
progress has shown to be slow in several occasions
mainly due to lack of knowledge on droughtresistance mechanisms and their appropriate
screening methods and strategies, poor heritability of traits under water stress in field, lack of

Phenotyping for Morphological and Agronomic Characters

comprehensive interpretation of results at molecular,


biochemical, physiological, genetical and agronomical perspectives, etc. Hence, before proceeding further, it is important to set the scene on
long-term and short-term objectives.
As stated earlier, first we should describe the
nomenclature and mechanism of expression of
target trait. In general, the term drought is
referred in agriculture as a condition in which
the amount of water available via rainfall and/or
irrigation is insufficient to meet the transpiration needs of the crop. Plants adapt different
mechanisms to withstand and mitigate the negative effects of such water deficit. In general,
there are traits that (1) help plants to survive
under drought stress and (2) mitigate yield
losses in crops when exposed to a water stress.
Therefore, it is essential to judge the overall
phenotypic value of given germplasm accession
in terms of yield under water stress in the given
environment. In other words, the knowledge
generated by any drought-related study should
address their impact on the yield and its component traits either directly or indirectly. Several
absolute reviews and committed volumes and
book chapters have addressed the mechanisms
underlying drought-resistance and breeding
strategies that can improve yield under water
stress (please see further readings). Provided
below is the very simple synopsis of this knowledge and its application in characterising rice
germplasm for drought resistance in a laboratory that has minimum facilities.
To begin well, the major critical step is to
define the environment to which the breeding
program is targeted (referred some times as target
population of environments). Each crop is grown
in a complex set of socio-physical and biological
environments, and there is no single and similar
environment even on the same farm. The
identification and characterisation of a target
environment is facilitated by the use of historic
records of weather data, cropping pattern followed during the past, etc. Simulation models
can also be used to describe the target environment by the frequency of occurrence of water
stress and based on the soil moisture profile. This
helps to shortlist the type (e.g. early/mid/terminal

water stress), severity (e.g. mild/moderate/severe)


and duration (e.g. short/long duration) of water
stress in the given environment. This also helps to
describe other associated stresses such as high
temperature, dry and high wind speed and nutrient
deficiency. Another key point in characterising
the germplasm within the given environment is
observation of genotype by environment interactions on expression of yield traits. This observation may include additional factors of environment
such as rainfall pattern; maximum and minimum
temperature; relative humidity; soil physical
(e.g. texture), chemical (e.g. presence of heavy
metal or other toxic elements) and biological
factors (e.g. beneficial and harmful microbial
load); diseases (e.g. foliar diseases); pests/
beneficial insects (e.g. pollinators); and parasites.
Thus, it is nearly impossible to find a single environment that represents the target population of
environments. An ideal strategy would be phenotyping for drought tolerance and yield stability
across a broad range of sites within the given environment with at least three replications in Latin
square design. Latin square design effectively taking care of field heterogeneity. During the past
decades, it has been repeatedly shown in several
crops that multi-environment trails are instrumental
in increasing yield potential under drought. Thus,
it is essential to define the set of environments,
fields and seasons in which the given germplasm
entry is expected to do well before beginning the
genetic mapping and MAS.

Traits Useful for Characterisation


Considering the fact that farmers ultimately harvest grain in rice, it is vital to interpret cause
effect relationships (usually with correlation
studies) between morpho-physio-agronomical
traits and grain yield (or other economic traits
in case of other crops) under drought conditions.
It should be noted that the sign and magnitude of
this relationship are not universal and can change
widely according to frequency, timing and intensity
of water stress periods. Thus, the traits that are
potential in characterising rice germplasm for
improving yield under water-limited conditions

Germplasm Characterisation: Utilising the Underexploited Resources

should be genetically (i.e. causally) correlated


with yield and preferably would have higher
heritability than yield (see chapter 5 for heritability
calculation). Presence of sufficient genetic variability and lack of yield penalties under favourable conditions are considered as additional
features of these traits. Ideally, measurement of
such trait(s) must be non-destructive (i.e. use of
small number of plants or plant samples), rapid
(e.g. without using lengthy procedures to calibrate sensors to individual plants), accurate and
inexpensive and, finally, should provide longterm ecophysiological performance of the crop.
Such traits should be cheaper and easier to measure than grain yield under stress. The reader
could now realise the difficulty in identifying
such potential trait since there is no single trait
that can satisfy all the above-said requirements.
Very often, experiments are lost due to pest or
erratic weather damage before recording final
yield. In such conditions, these traits are useful.
Based on the peer-reviewed literature, carefully
tested under different experimental procedures
and personal experience, the following traits are
listed as potential candidates for characterising
rice germplasm. As a caution, it should be noted
that these traits are not final and they are not suitable for all the water-limited environments.
Readers are requested to finalise the traits based
on the target environment, breeding objective, etc.
However, the concept and procedure of characterising the plant germplasm described here is the
same for all the plants. By ensuring random representative plants are selected for measurement of
traits in the each plot, sampling bias can be
avoided. Again it is highlighted that the secondary
traits (other than the grain yield) should always be
associated (good statistical correlations) with
yield, and it is essential in depicting any final conclusion on the germplasm characterisation.

Early Vigour
Several physiological and biochemical studies
have shown that selection of germplasm accessions that shown early and vigorous establishment allow the stored water available for later
developmental stages when soil moisture becomes
progressively exhausted and increasingly limiting

for yield. On the other hand, excessively vigorous


leaf development could cause early depletion of
soil moisture. Thus, the optimal degree of vigour
should be selected, and besides genetic potential, it
also depends on the characteristics of the given
environment. Keeping all these in mind, the rice
germplasm should be screened for each accession
to count the number of days required to germinate
and develop a particular leaf area under field
conditions.

Flowering Time
Another critical factor that optimises adaptation
(and produce better yield) under low water availability is flowering time. It was established in
almost all the crops that there is positive association between yield and flowering time across
different levels of water availability. Days to
achieve 50% flowering can be phenotyped quite
easily and effectively under both irrigated control and water-stressed experimental conditions,
and it can be used as a valuable trait for drought
tolerance breeding program. Flowering delay
(=days to flowering under stress conditions
days to flowering under irrigated control) could
serve as a potential additional trait to the 50%
flowering.
Chlorophyll Concentration, Leaf Rolling
and Leaf Drying
The traits that have been phenotyped to indirectly
estimate photosynthetic potential (a critical element that decides final yield) are chlorophyll
concentration, leaf rolling and leaf drying, all of
which are interconnected. Total and individual
components of chlorophylls and chlorophyll stability index can be measured both under normal
and water stressed conditions. Similarly, leaf rolling and drying scores need to be phenotyped by
essentially following the procedures around
midday.
Grain Yield
The main objective of drought tolerance breeding
program is to develop a variety that produces
higher yield when compared to currently available varieties in the given environment under the
types of drought stress that occur most frequently.

Allele Mining

Further, if water stress does not occur in some


years, that variety should also produce high yields
in the absence of stress. Thus, in farmers viewpoint, a drought-tolerant variety is the one that
produces higher yield relative to other cultivars
under drought stress and produce sustainable
yield under normal conditions. Hence, all the
protocols and strategies that focus on breeding
for drought tolerance should be designed in this
light. To increase the efficiency of direct selection for yield, it is essential to ensure that the testing environment is a true representation of the
target environments; large numbers of germplasm
entries (usually > 500) are screened in order to
increase the selection intensity; uniform management of drought stress across the trails, sites and
seasons with reasonable levels of replications (it
was noticed that increasing the number of locations is more effective than increasing the number of replications within the location); and use
of best experimental design to address the field
variation.
The traits mentioned above are very far from
being exhaustive. Therefore, the use of the above
said and other traits as selection criteria for yield
should be exercised cautiously and only after
defining the target environment. Irrespective of
the procedures used and experimental designs
employed, each phenotyping score might have a
specific background, and hence results should be
inferred accordingly in characterising the germplasm. Availability of a good record of meteorological parameters (rainfall, temperatures, wind,
evapotranspiration, light intensity and relative
humidity) allows meaningful interpretation of
the results. Collection of meaningful phenotypic
data in field experiments greatly depends on
experimental design, heterogeneity of experimental conditions between and within experimental units, size of the experimental unit and
number of replicates, number of sampled plants
within each experimental unit and genotype environment management interactions. Further
variations due to phenology (duration for each
developmental phases) and other environmental
stresses should also be considered while evaluating the germplasm. Poor attention on these factors may lead to erroneous conclusions, particularly

in terms of interpreting cause and effect relationships between yield and drought tolerance traits.

Allele Mining
Allele mining refers to identification of naturally
occurring allelic variation at agronomically
important genetic loci (otherwise called as
genes). This can be performed by using a variety
of approaches including mutant screening, QTL
and AB-QTL analysis, association mapping and
genome-wide survey for the signature of artificial
selection (each method is described in details in
subsequent chapters). Though several methods
have been described, efficient extraction and
exploitation of the adaptive variation and valuable traits present in the germplasm is yet to be
uncovered. For example, several traditional and
improved cultivars from drought-prone areas
have some tolerance to reproductive stage
drought stress, but they have rarely been used in
molecular breeding program. A more extensive
survey of these germplasm may lead to the
identification of new germplasm entries carrying
superior alleles for agronomic and economic
crop traits. Such unique alleles can be integrated
into molecular crop breeding program that aimed
to combat pest and diseases; to promote yield,
quality or nutritional properties; or to improve
abiotic stress tolerance.
Thus, the successful allele mining procedure
is highly dependent on the use of diverse germplasm collections, especially those rich in wild
species. This is because the majority of allelic
variation at the gene(s) of interest is largely
assumed to occur in the wild relatives of a crop
(i.e. not in the cultivating crop varieties) due to
the unavoidable loss of variation during the
domestication process. Several efforts have been
made to identify useful new alleles that are present in the wild gene pool in almost all the crop
plants. Despite those efforts, unfortunately, entire
germplasm entries have not yet been efficiently
characterised for their novel phenotypes due to
several challenges including lack of resources
for evaluating huge collections. Alternatively,
core collection of germplasm has been proposed

Germplasm Characterisation: Utilising the Underexploited Resources

as materials for allele mining. A representative


subset of the complete collection of germplasm
that has been optimised to contain maximal diversity in a minimal number of accessions is referred
to as core collection. Thus, while maintaining
maximum allelic diversity at loci controlling
traits of interest, core collections help in integration of novel useful alleles into molecular or conventional breeding programs by reducing the
number of accessions. This will lead to the development of broad and diversified elite breeding
lines with superior yield and enhanced adaptation
to diverse environments.
Best core collections can be constituted by
assembling a wide range of evidence on diversity
and subsequently sampling those accessions that
are representative of this diversity. One such simple
generic factor is geographic origin. Conventional
accessions from different parts of the world usually
have had an independent history of domestication
for thousands of years and are therefore likely to
show differences across the genome. Construction
of such core collection can discover at least the
majority of new alleles in a relatively small number
of accessions. On the other side, one key factor to
be remembered at this time is even a carefully constructed core collection will not allow to discover
the complete list of alleles in all possible combinations. Hence, it is essential to screen the whole germplasm. When cheaper and faster technologies for
allele mining are developed, this effort would not
be a titanic task.

To this end, large-scale genome sequencing


projects and functional genomic efforts on several major food crops provide a directory of all
the genes in the given crop with their function.
Though this information has been generated
using the reference crop cultivar or accession,
this can also be extended to other varieties/species
too, in light of allele mining. This is possible
because of genome synteny and gene(s)
sequence conservation among the species.
Several approaches has been designed to isolate
novel alleles from the related species and genera
using this sequence information, and it would
result in direct access to key alleles conferring
resistance to biotic stresses, tolerance to abiotic
stresses, greater nutrient use efficiency, enhanced
yield and improved quality and nutrition. One
among the technique, which employs simple
polymerase chain reaction (PCR; refer box 3.1
in chapter 3) strategy to isolate useful alleles
from rice germplasm, has been given in Box 1.1
as an example. It is also worth to mention here
the role of EcoTILLING in allele mining. A
variant of targeting induced local lesions in
genomes (TILLING), known as EcoTILLING,
was developed to identify multiple types
of polymorphisms in germplasm collections
or breeding materials (Comai et al. 2004).
EcoTILLING allows characterisation of natural
alleles at a specific locus across several germplasm entries in a rapid and affordable way (see
chapter x for more details).

Box 1.1 Rapid and Inexpensive Strategy for Allele Mining in Rice

There are >100,000 germplasm accessions/


entries deposited at International Rice Gene
Bank, IRRI, the Philippines. Each genotype
has ~50,000 estimated genes. Every gene has
an unknown number of alleles and each allele
may change the way the rice adapts or grows
or seems or tastes. Hence, understanding the
function of each allele has utmost importance
that decides future rice breeding. Publically
available rice genome sequence database and

physical map location of each rice gene (refer


international rice genome sequencing project
(IRGSP) home page at http://rgp.dna.affrc.go.
jp/IRGSP/download.html or gramene at http://
www.gramene.org/resources/ for example)
form the base for allele mining. The first step
in allele mining is deciding which part of the
genome we should explore. In other words,
allele mining can be conducted on specific
genes that are involved in the particular
(continued)

Allele Mining

Box 1.1 (continued)


mechanism of phenotypic trait expression.
Usually allelic differences (also called as allelic
polymorphism) will be a result of differences
in intron and exon sequences or in the regulatory regions of the given gene. For example,
the genes involved in abiotic stress tolerance
(like genes code for heat-shock proteins, transcription factors, late embryogenesis abundant
proteins) can be fished out from the genome
sequence, and primers that are specifically
flanking the conserved genic regions can be
designed. Primer3 is the most frequently used
freely available online software (http://frodo.
wi.mit.edu/) for primer designing. We need to
paste the target sequence in FASTA format in
the box provided, and by clicking the PICK
PRIMER radio button, we can obtain appropriate primers that flank the target sequence.
Since the selected genes are members of multigene family, the members may have conserved
genic sequences. In general, member of multigene family dispersed around the genome or
may have remained as tandem repeats within a
single genetic locus. Thus, these primers can
be used in PCR-based allele mining that provides an opportunity to test the evolutionary
range over cultivated rice and its relatives. To
increase the efficiency of identifying polymorphic alleles, it is better to design primers in the
5 or 3 untranslated regions of the selected
genes since these DNA sequences have shown
to have variation in multi-gene family when
compared to coding sequences. Thus, it is
important to have a balance in targeting the
conserved genic sequence and maintaining the
genetic variation. Once the candidate gene(s)
was explored, discovering new alleles for the
selected candidate gene(s) should be performed
with the germplasm collection. It should not start
with the first accession and work through the
collection. This is because such effort would
be inefficient, since the second accession might
be similar to the first accession at the given
loci. Hence, analysing second accession would

not result any additional information. Instead,


we need to employ a subset of highly distinctive
accessions, namely, core collections (see the
text for more information on core collection).
The amplified PCR product using the primers designed with the above-said principle represents either entire allele or functional
component of the allele (i.e. depending on the
primer designing strategy that have employed).
If it is component of the gene, the full length
gene should be amplified with same strategy
explained above. The identified polymorphic
allele needs to be sequenced, and at the end of
this experiment, we could identify, isolate and
characterise the novel alleles of genes that are
candidates for the target trait (in this case, it is
abiotic stress tolerance). Since we do have data
on field-based phenotyping of the given rice
germplasm, we can group those accessions
that are having similar alleles and tolerance
level. The strategy that associates alleles or
genomic regions to the given phenotype using
linkage disequilibrium or association mapping
is described separately in detail (see chapter 6).
Briefly, association mapping assumes that
an allele responsible for the expression of a
phenotype, along with the markers that flank
the allelic locus, will be inherited as a block.
Hence, use of such flanking markers or allelic
sequence itself as a marker will predict the
performance of a progeny that express the
favourable phenotype. We can also proceed
further in characterising the key biochemical
and physiological mechanisms of tolerance
using the functional genomics tool. Thus, upon
complete characterisation of these alleles,
molecular backcross breeding strategy can be
employed to transfer this useful allele into elite
variety. Development of such new combination of useful alleles from different genes in
different accessions will lead to breed for a
novel variety that meets the farmers and consumers needs. However, this technique has
some drawbacks: (1) lack of specificity during
(continued)

Germplasm Characterisation: Utilising the Underexploited Resources

Box 1.1 (continued)


primer annealing may lead to amplification of
non-specific PCR products, (2) usually PCR
will not be successful for those distantly related
genera due to poor conservation of primer
sequences and (3) when the length of gene

Genetic Diversity and Clustering


Study of genetic diversity exists in the germplasm
(i.e. investigation on genetic variation among
individuals or groups of individuals) is usually a
collective process. There are several methods and
strategies available to study the germplasm in
terms of genetic diversity which is essential to
reveal the genetic relationships among the germplasm entries. Precise estimation of genetic
relationship depends on sampling strategies, use
of several data sets, selection of genetic distance
estimate strategies, clustering procedures or other
multivariate methods, etc. Thus, careful combinations of these features and use of appropriate
statistical programs and strategies are the key
in these data analysis (refer Mohammadi and
Prasanna 2003 for further details). In general, the
germplasm data comprises numerical measurements and combinations of different types of
variables. Pedigree data, passport data, morphological data, biochemical data, storage proteins
data and more recently DNA-based marker data
are being used to reliably estimate the genetic
relationship in crop plants (for details on markers
and its application, see chapter 3). The selection
of data sets is decided by the objective of the
experiment, the level of resolution required,
availability of resources and infrastructure facilities and impact of operational, cost and time constraints. Each data provide a specific type of
information. For example, when we use the
molecular data, genetic distance or similarity or
relationship among individuals of the given germplasm is usually calculated as a quantitative
measure that differentiates the two individuals at
sequence or allelic frequency level. Wide range

sequence is beyond the limit of PCR, it would


be difficult to proceed further for complete
allelic characterisation using this strategy;
alternatively, PCR walking would be useful in
mining such alleles.

of genetic distance measurement methods are


available, and use of such method is highly
decided by the selection of software tool we
employ for the analysis. Among the genetic
distance measurement methods, modified Rogers
genetic distance (GDMR) is the most frequently
used measure. There are several constraints while
employing the data for the analysis of genetic
distance. One most frequently occurring problem
is use of molecular marker data. When certain
genotypes did not show any amplification for
some marker alleles, it is often difficult to assume
whether such lack of amplification is due to null
alleles or failure in molecular experiment. In such
cases (i.e. when we are not sure about the null
status of a genotype at this specific marker locus),
it should be considered as missing data during
genetic distance measurements; otherwise it will
lead to erroneous inference. It should also be
noted that use of dominant and co-dominant types
of marker can also influence the genetic distance
measurements due to unknown statistical distributions. In order to overcome this limitation,
several alternatives, including bootstrapping
method, have been proposed in certain statistical
software. When a scientist wish to use more than
one genetic distance measures to analyse the data
set, it is essential to understand the correspondence between matrices derived from those
measures. To reliably test this correspondence, a
popularly known Mantel test can be engaged
and it has been widely followed in crop plants.
Resampling techniques such as bootstrapping
and jackknife are also used predominantly in
the recent publications, particularly in relation to
application of marker data in genetic diversity
analysis. Especially, to find the smallest set of
markers that can provide an accurate assessment

Genetic Diversity and Clustering

of genetic relationships among the germplasm


entries, resampling techniques have provided
useful measures. The latest versions of statistical
programs used in genetic diversity analysis
(see below) have these features. Interpreting the
resampling techniques is also simple. For example, a simple rule of thumb is that internal tree
branches that have >70% bootstrap are likely to
be correct at the 95% probability level.
When sample sizes of germplasm increases, it is
important to classify and order genetic variability
among germplasm by using established multivariate statistical algorithms such as cluster analysis,
principal component analysis, principal coordinate
analysis and multidimensional scaling. Interestingly,
multivariate analytical techniques simultaneously
analyse multiple measurements on each individual
of the germplasm and analyse the genetic diversity
irrespective of the data set (i.e. morphological,
biochemical or molecular data can be used).
This book has focused only on clustering method
(especially on salient statistical methodologies and
other considerations with respect to this method)
and is described in Box 1.2.

Software
Numerous software programs are available for
assessing genetic diversity, such as Arlequin,
DnaSP, PowerMarker, MEGA2, PAUP, TFPGA,
GDA, GENEPOP, NTSYSpc, Structure, Gene
Strut, POPGENE, Maclade, PHYLIP, SITES,
CLUSTALW and MALIGN. Most of them are
freely available in the World Wide Web. Most of
the programs perform similar tasks, with the main
differences being in the user interface, type of
data input and output, and platform. Thus, choosing which to use depends profoundly on individual favourites.

Principle Behind the Genetic Diversity


Analysis
When a rectangular data matrix Xn*p is prepared
(where n rows corresponding to n different
genetic objects and p columns corresponding to

p different types of phenotypic and/or binary


molecular data), the term genetic diversity among
the n genetic objects refers to grouping of the n
objects into an appropriate number of classes
(usually less than n), and the objects within
classes are relatively homogeneous with respect
to the data p. The statistical techniques,
classification and ordination are used for grouping the n entities based on the p types of phenotypic and/or binary molecular data. Application
of these techniques requires an a priori selection
of an appropriate quantitative measure of proximity (similarity/dissimilarity/distance) among
the given entities. In consequence to the selection
of appropriate proximity measure, the data matrix
Xn*p is converted to a square proximity matrix
Mn*n of n rows and n columns corresponding
to the n genetic entities. Implementation of an
appropriate sequential agglomerative hierarchical nonoverlapping (SAHN) classification technique and an appropriate ordination technique on
the proximity matrix, Mn*n, yields a dendrogram
and a two- or three-dimensional ordination plot,
respectively. Such dendrogram and the ordination plot, which are the graphical end products of
classification and ordination, elucidate the underlying structure of genetic diversity among the n
genetic objects. In general, SAHN clustering
takes dissimilarity matrix Dn*n = {dij} as input
data. Initially, two closest objects are joined based
on their dij values, giving (n 1) clusters, one
contains two objects and others have a single
member. In each succeeding steps, two closest
clusters are merged. But to do so, we need appropriate definition of dissimilarity between clusters
based on dissimilarity between their constituent
objects. This is the point at which different SAHN
methods differ. There are several SAHN methods
including unweighted pair group method using
arithmetic averages (UPGMA), single linkage
method, complete linkage method (compromise
between single and complete linkage preferred
due to its robust nature), Wards method (useful
for continuous variables such as plant height and
yield) and weighted average linkage (WPGMA).
Other SAHN methods that are rarely used in
practice are centroid (UPGMC), median
(WPGMC), and flexible. SAHN classification

10

Germplasm Characterisation: Utilising the Underexploited Resources

results are represented by 2-D diagram known as


dendrogram. The dendrogram depicts the fusion
of objects/clusters at each step of the analysis
along with a numerical measure of (dis) similarity. Thus, hierarchical clustering methods are
agglomerative or divisive. Agglomerative methods proceed by a series of successive fusions of n
objects into groups. Divisive methods proceed by
separating n objects into successively finer
groups. Groupings or divisions produced by a
hierarchical method are final; thus, defects in
clusters, once introduced, cannot be repaired.
Agglomerative methods are more widely used
than divisive methods. Single linkage, complete
linkage, centroid, Wards and group average are
the most widely used agglomerative clustering
methods. The group average method, also called
as average linkage or UPGMA method, has been
widely used for germplasm analysis in plant
breeding. The clustering method by data structure
interactions can be significant. The aim of cluster
analysis is to find an optimum tree (or phenogram
or dendrogram) or set of clusters. Hierarchical
algorithmic clustering methods are used to represent distance matrices as ultrametric trees. If the
distances are ultrametric, then the fit of the data
to an ultrametric tree is exact. If the distances are
not ultrametric, then the fit of the data to an ultrametric tree is not exact. The reliability of the estimated diversity elucidated by a dendrogram and/
or an ordination plot depends on many factors.
However, the most critical factor is the accuracy
with which the phenotypic and molecular scores
in the data matrix Xn*p are recorded and
estimated.

Principle of Measuring Goodness of Fit


of a Classication
When genetic diversity analysis was done with
more than one statistical software (see above),
comparison of dendrograms, with each other or
with their proximity matrices, is required for validation of clustering results. For example, we may
like to test whether different subsets of p variables
or different clustering methods applied on same
data provided the similar results. Statistical measures to address such questions include cophe-

netic correlation and Mantels permutation test.


These are implemented in statistical program
itself (e.g. in NTSYSpc). There are other measures such as kappa coefficient, Rand index,
adjusted Rand index and BC coefficient, but rarely
employed. Cophenetic matrix of cophenetic values is generated from the dendrogram to compute
cophenetic correlation. Values of cophenetic correlation above 0.80 indicate a good agreement
(see Box 1.2). The Mantel test provides a measure
of statistical significance for the observed cophenetic correlation. When the same n objects are
separately clustered using phenotypic and molecular data, results can be synthesised into a single
consensus dendrogram using strict consensus or
majority consensus rules (refer NTSYSpc manual
for performing such analysis). Strict consensus rule
delivers a consensus dendrogram, each subset of
which is in each individual constituent dendrogram.
In a majority consensus dendrogram, each subset
in it is in a majority of the individual constituent
dendrograms. Before attempting to obtain a consensus dendrogram, it may be useful to first compute cophenetic correlations to get an idea of the
extent to which the constituent dendrograms represent similar results. Bootstrap can be used to
assess reliability of results produced by a dendrogram. WinBoot performs bootstrap on binary
data to determine confidence limits of UPGMAbased dendrogram.

Genetic Diversity Analysis Using


Molecular Markers
Success of any crop breeding program is based on
(1) the knowledge of and (2) availability of genetic
variability for efficient selection. Genetic similarity (or genetic distance) estimates among genotypes are helpful in at least two ways: (1) selecting
parental combinations for creating segregating
populations so as to maintain genetic diversity in
a breeding program and (2) the classification of
germplasm into heterotic groups for hybrid crop
breeding. Establishment of heterotic groups can
be based on geographical origin, agronomical
traits, pedigree data or on molecular marker
data. Before the use of molecular markers,
genetic diversity was estimated from pedigree or

Genetic Diversity Analysis Using Molecular Markers

agronomic and morphological characteristics.


However, the estimates based on pedigree information are generally overestimated and often
found unrealistic. For example, the morphologically based genetic diversity estimates suffer from
the drawback that morphological characteristics
are limited in number and are influenced by the
environment. Therefore, neither pedigree-based
nor morphologically based estimates may not
reflect the actual genetic difference of the studied
populations. On the other hand, molecular markers are not influenced by environment and likely
reflect true genetic similarity (or dissimilarity)
and do not require previous pedigree information
which is valuable for crops where pedigree information is lacking. Various types of molecular
markers are available for genome analysis. Simple
sequence repeats (SSRs) in particular have been
reported to be very useful to analyse the structure
of germplasm collections as these are abundant,
co-dominant, multi-allelic, highly polymorphic
and chromosome specific. SSR markers have been
extensively used in genetic diversity studies in
many plants including wheat, pearl millet, sorghum, triticale, cotton, rice and maize. There are
also other types of DNA- and RNA-based markers that have shown their potential utility in genetic
diversity analysis (see chapter 3 for more detailed
description on markers). However, molecular
markers should be used in caution when they are
engaged in genetic diversity analysis because of
the following issues.
1. There are two approaches that are commonly
used in studies of genetic diversity within and
among populations or groups of individuals
using molecular markers. In the first, allele frequencies over a number of polymorphic loci
are determined, and parameters based on the
allele frequencies are used for partitioning
genetic variation into components for variation
within and between units. This approach may
be chosen when dominant markers (such as
RAPDs, AFLPs and ISSRs) are applied to haploid individuals or co-dominant markers (such
as allozymes, RFLPs and SSRs) used with haploid or diploid species with the assumption of
no linkage between loci. With dominant markers, individuals that are heterozygous for a
DNA band at a specific position cannot be

11

distinguished with certainty from individuals that


are homozygous for that band (see chapter 3).
In the second approach, a genetic dissimilarity matrix constructed using molecular data
from all possible pairwise combinations of
individuals and is used for characterising population structure based on relative affinities of
each tested individual. This approach requires
proper methods for assessing dissimilarity
between individuals, and it is particularly useful in the case of possible linkages between
different loci. The choice of a suitable index of
similarity is a very important and decisive
point for determining true genetic dissimilarity
between individuals, clustering and analysing
diversity within populations and studying relationship between populations. This is because
different dissimilarity indices may yield contrary outcomes. Many researchers have preferred for various well-documented reasons
to use the second approach either alone or in
combination with the first approach. However,
the bases for choosing the most appropriate
coefficient of dissimilarity depending on type
of marker and ploidy of the organism in question have not received sufficient attention in
published research articles.
2. Molecular markers are commonly used to
characterise genetic diversity within or between
populations or groups of individuals because
they typically detect high levels of polymorphism. Furthermore, RAPDs and AFLPs are
efficient in allowing multiple loci to be analysed
for each individual in a single gel run. In
analysing banding patterns of molecular markers, the data typically are coded as (0,1)-vectors,
1 indicating the presence and 0 indicating the
absence of a band at a specific position in the
gel. With diploid organisms and co-dominant
markers, the banding patterns may be translated
to homozygous or heterozygous genotypes at
each locus, and the allelic structure derived is
utilised for comparison between individuals.
Several measures including the Dice (Nei and
Li), Jaccard and simple match (or the squared
Euclidean distance) coefficients are commonly
employed in the analyses of similarity of individuals (binary patterns) in the absence of
knowledge of ancestry of all individuals in the

12

Germplasm Characterisation: Utilising the Underexploited Resources

populations. These similarity coefficients are


defined differently and therefore they may yield
different results for both the qualitative and
quantitative relationships between individuals.
Although these coefficients may not yield
identical results, most published studies do not
offer any rationale to support the choice of the
coefficient that was used in relation to the type
of marker evaluated or the ploidy and mating
system of the organism being studied. Each of
these factors may influence how accurately the
direct application of a given similarity coefficient
to the (1,0)-vectors will reflect the true genetic
similarity of any pair of individuals. In most
published studies, the similarity coefficient used
was apparently chosen simply because it was
used in an earlier publication or it is available
in the software package used to analyse the data.
In some cases, two or three similarity coefficients
are used with the same data set with the
expectation that if the results are robust; the different coefficients should reveal essentially the
same patterns of diversity. If two similarity
coefficients reveal somewhat different patterns
of relationships between individuals, there is
hardly any rationale presented to suggest which
pattern is more valid, and often only one of the
patterns is presented in the publication. As a
general rule, we should expect an appropriate
similarity coefficient to produce a consistent
measure of the proportion of differentiating
factors showing similarity between any pair of
individuals relative to the total number of factors in which differences could have been
detected if the individuals showed no detectable similarity. That is, the similarity coefficient
employed should accurately reflect our best
understanding of the phenotypes observed and
the genetic basis for them.
3. With co-dominant markers, each recognisable
allele at a given locus is ordinarily associated
with a single band at a unique position in the
gel. Thus, in the case of diploid organisms for
a given locus, a homozygote will have one
band and a heterozygote will have two. Null
alleles (no band) are rarely seen. Therefore,
the shared absence of a band at a specific
position should not be considered in measures
of similarity with co-dominant markers.

Clearly with co-dominant markers, the genetic


similarities between pairs of individuals cannot be characterised simply in terms of the
proportion of bands that are shared between
two individuals. Also, if there are multiple
alleles per locus, as expected for SSRs, which
are highly variable, the total number of bands
expressed by all the individuals in a sample
will likely be much greater than the number
of loci involved. Therefore, the banding
profiles should be adjusted to represent the
allelic patterns of individuals across all loci
studied and to represent the total number of
loci and the number of shared alleles rather
than the total number of bands and the number
of shared bands, respectively, and the adjusted
values should be employed for measuring
similarity between individuals.
4. For dominant markers, it is generally assumed
that each band represents a different locus and
that the alternative to a band at the gel position
characteristic of that locus is the absence of a
band anywhere in the gel. Thus, for dominant
markers, there is a direct identity assumed
between the number of unique bands observed
and the number of identifiable loci for the sample of individuals. On the other hand, the interpretation of shared absences of specific bands
by two individuals may depend on the degree
of genetic similarity among individuals within
the sample. That is, the interpretation may be
different when the individuals are drawn from
different taxa in a phylogenetic tree than when
the individuals are all from closely related populations of a single species.
5. The key problem with analysis of genetic relationships between individuals with molecular
markers is measuring their dissimilarity. There
are no acceptable universal approaches for
assessing genetic dissimilarity between individuals based on molecular markers. Different
dissimilarity measures are relevant to, and
should be used with, multi-locus dominant
and co-dominant DNA markers as well as with
diploid (polyploid) and haploid individuals.
The Dice dissimilarity index is suitable for
haploids with co-dominant molecular markers, and it can be applied directly to (0,1)-vectors representing multi-locus multi-allelic

Genetic Diversity Analysis Using Molecular Markers

banding profiles of individuals. None of the


Dice, Jaccard and simple mismatch coefficient
is appropriate for diploids (polyploids) with
co-dominant markers, because there is no way
for direct processing of fingerprint profiles.
By transforming multi-allelic banding patterns
at each locus into the corresponding homozygous or heterozygous states, a new measure of
dissimilarity within loci needs to be used and
may be expanded for measuring dissimilarity
between multi-locus states of two individuals
by averaging across all co-dominant loci
tested. The simple mismatch coefficient can

13

be considered as the most suitable measure of


dissimilarity between banding patterns of
closely related haploid forms, whereas for distantly related haploid individuals, the Jaccard
dissimilarity is recommended. In general, no
suitable method for measuring genetic dissimilarity between diploids with dominant
markers can be proposed. Therefore, analyses
of genetic dissimilarity between diploid (polyploid) organisms with dominant markers
should be viewed with caution unless the
organism is highly inbred and therefore highly
homozygous.

Box 1.2 Cluster Analysis

Cluster analysis refers to mathematically


grouping (or clustering) the individuals of the
germplasm based on their similar characteristics. Thus, individuals within the cluster show
high internal homogeneity and individuals
between the cluster exhibit high external
heterogeneity. Broadly, there are two types of
clustering strategies. One is based on distancebased method (in which a pairwise distance
matrix is used which leads to a graphical
representation such as a tree or dendrogram)
and another method is based on model-based
methods such as parametric models (inferences on each cluster and their relationship is
obtained by maximum likelihood or Bayesian
methods). It has been established that the later
method is innovative and useful due to the
constraints associated with former method
with respect to multi-locus genotypic data.
However, at present, the distance-based methods are most frequently used, and step-by-step
procedure for clustering analysis using this
method is explained hereunder.
Hierarchical and nonhierarchical methods
are commonly used in distance-based clustering analysis, and hierarchical clustering methods are most commonly employed in analysis
of genetic diversity in crop plants. These
methods perform either by a series of successive merger (called as agglomerative hierar-

chical method) or successive divisions of


group of individuals (see above). The most
similar individuals are first grouped and
these initial groups are merged according to
their similarities. Among the various agglomerative hierarchical methods, unweighted
paired group method using arithmetic averages (UPGMA) is the most commonly adopted
clustering algorithm followed by Wards minimum variance method. For your information,
the nonhierarchical clustering procedures do
not involve in construction of dendrogram,
and hence, it can be done using statistical software such as SAS or SPSS. However, this
method is not usually followed in crops primarily due to lack of prior information about
the optimal number of clusters that are required
for accurate assignment of individual objects.
Among the different types of clustering
methods (such as UPGMA, unweighted paired
group method using centroids (UPGMC),
single linkage, complete linkage and median),
UPGMA dendrograms have been used extensively in the published reports since it provide
consistency in grouping germplasm objects
with relationships computed from different
data types. However, despite some advantages
in UPGMA, a single clustering method might
not be useful or effective in uncovering genetic
relationships, and it would be desirable to
(continued)

14

Germplasm Characterisation: Utilising the Underexploited Resources

Box 1.2 (continued)

analyse the congruence among results obtained


by different clustering procedures. The
efficiency of different clustering algorithms
can be estimated by calculating cophenetic
correlation coefficient (see above). It is a product moment correlation coefficient measuring
agreement between dissimilaritysimilarity
indicated by a phenogramdendrogram as
output analysis and the distancesimilarity
matrix as input of cluster analysis. Using this
coefficient value, the degree of fit of the dendrogram can be subjectively fixed as 0.9 r,
very good fit; 0.8 r < 0.9, good fit; 0.7 r < 0.8,
poor fit; and r < 0.7, very poor fit. At the same
time, it should be kept in mind that low
coefficient score does not mean that the dendrogram has no use. This poor coefficient value
only indicates that some distortion might have
occurred. It is also worth to note that whatever
algorithm is used for dendrogram construction,
in order to assess the reliability of the nodes, it
is essential to carry out bootstrapping of the
allele frequencies followed by calculation of
genetic distances.
Therefore, while studying the genetic diversity in crop plants, it is vital to decide the following points: (1) careful and effective use of
different types of data variables like continuous,
discrete, ordinal, multistate and binomial; (2) use
of multiple data sets such as morphological, biochemical and molecular data; and (3) appropriate
selection of clustering algorithms. Depending on
the genetic materials being analysed and objectives of the experiment, different strategies (since
there is no single strategy that addresses all the
issues in genetic diversity analysis) are required
to formulate, and hence readers are requested to
refer to the bibliography to proceed further in
their crop and materials of interest.
There are many statistical packages available for analysing genetic diversity (see above
and Labate 2000). There is still a need for
developing a comprehensive and easy-to-use

statistical packages that provide integrated


study on genetic diversity at various levels.
However, because of user-friendliness and
availability of several features, NTSYSpc (F.
J. Rohlf, State University of New York, Stony
Brook, USA) and PHYLIP (J. Felsenstein,
University of Washington, Seattle, USA) have
been extensively employed in publications.
The procedure for employing NTSYSpc for
genetic diversity analysis using molecular
marker data is provided below.
Computer software, NTSYSpc (Numerical
Taxonomy and multivariate analysis SYStem),
is a system of program modules used to discover
and describe the patterns of biological diversity
that can be demonstrated in a set of multivariate
data. There are modules in NTSYSpc that
perform cluster analysis. The first crucial step in
genetic diversity analysis using the marker (or
DNA fingerprinting) data is the measurement
of similarity among germplasm entries. When
DNA profiles of two individual plants are compared, certain number of bands will be common
(or shared or monomorphic) between the two
DNA profiles (even by chance). The number or
proportion of common bands is expected to be
larger if the two individuals are biologically
related. It is therefore important to objectively
measure the expected degree of similarity due to
chance of relatedness. Hierarchical clustering
(which is going to be used in the below procedure) provides not only information about the
object that belong to each cluster but also gives
us an idea about which ones are closest to each
other and how dissimilar with the other objects
in the cluster. Subsequently, such analysis is
used for phylogenetic tree estimation, which is
then visualised as a graphical dendrogram. This
entire process involves first computing a matrix
of similarity coefficients for all pairs of OUT
(operational taxonomic units) and then performing the actual cluster analysis based on the
similarity index by UPGMA. The resulting
(continued)

Genetic Diversity Analysis Using Molecular Markers

15

Box 1.2 (continued)

dendrogram provides a good estimate of the


phylogeny of a particular group of organisms.
As an example, the modules SIMQUAL (for
similarity matrix construction), SAHN (for
sequential agglomerative, hierarchal and
nested) clustering methods and TREE (displays tree from cluster analysis as dendrogram)
to perform phylogenetic tree (dendrogram)
estimation are explained hereunder. However,
there are several computational modules
included in NTSYSpc. Detailed technical
descriptions of the modules (including equations for the operations and the various
coefficients) are provided in the help file.
NTSYSpc is not limited to just the analyses
mentioned in this box. The modules can be

used in sequence to build many other types of


analyses (e.g. Gowers principal coordinates
analysis can be carried out by using the
SIMINT, DCENTER and EIGEN modules;
CONSENSUS computes a consensus tree for
two or more trees (such as multiple tied trees
from SAHN or between two different methods, and several consensus indices are also
computed to measure the degree of agreement
between trees); COPH produces a cophenetic
value matrix (matrix of ultrametric values)
from a tree matrix produced by the SAHN program; this matrix can be used by the MXCOMP
program to measure the goodness of fit of a
cluster analysis to the similarity or dissimilarity matrix on which it was based).

Preparation of Input Data File

Individual3

Individual2

Individual1

Ladder

Scoring of Data from Gel Matrix

Individual 1

Individual 2

Individual 3

A1

A2

1,0

1,1

0,1

Scoring by band

Locus A

Scoring by genotype

Geno
types

A1A1

A1A2

A2A2

Locus A

With a co-dominant marker (see chapter


3), the genotypes of the three genotypic
classes can be observed for the two homozygotes and the heterozygote. In the drawing
above, a gel image with the banding pattern
of a co-dominant marker for a single locus of
a diploid organism is given. We need to score
the bands in the gel and convert them to
numerical data (numbers). To do so, each of
the band sizes (the band in the same row) is
scored and transformed to a 1 if it is present

or to a 0 if it is absent. We can do it by band


or by genotype, as in the table. This is because
the analysis of genetic diversity involves the
quantification of diversity and the relationships within and between populations and/or
individuals and displays the relationships. To
do this kind of analysis, molecular data are
usually handled as binary data. Molecular
data can be usefully complemented with
morphological or evaluation data. To do so,
these types of variables can be transformed to
(continued)

16

Germplasm Characterisation: Utilising the Underexploited Resources

With a dominant marker (see chapter 3), only


two genotypic classes can be observed: AA + Aa
and aa. That is, one of the homozygote classes is
confounded with the heterozygote (as shown in
the below gel picture, banding pattern for AA or
Aa will look like individual 1). Thus, the gel
image with the banding pattern of a dominant
marker for a single locus will show either one
band or no band for each individual. The bands
are scored in a way similar to that for the
co-dominant marker, where bands are converted
to a score of 1 if present or 0 if not.

Individual2

Individual1

Ladder

Box 1.2 (continued)


binary variables. A gel image with the banding
pattern of a co-dominant marker with three
alleles (A1, A2 and A3) or multiple alleles in
a diploid sample, it needs to be scored each
band (each row) independently, and transform them to a score of 1 if present or a
score of 0 if not. It is wise to score the
co-dominant markers as allele frequencies
since scoring as presence/absence may cause
loss of genetic information. Alternatively, use
of large number markers with such scoring
would solve this issue.

Individual 1

Individual 2

Locus A

Locus A

Geno
types

AA or Aa

aa

Creation of Data Files for NTSYSpc


NTSYSpc files are ordinary *ASCII files. A
file for an initial data matrix may be prepared
with an editor or any word processor that has
a pure ASCII character. Free format is used
for all the entries in the data matrix. This
means that at least one blank space is required
between numbers; tab characters will not
work. Alternatively, an Excel sheet (derived
from MS Office) can also be used to prepare
data file, and this can be imported into
NTSYSpc using the NTedit program.
For each of the basic file format (rectangular, symmetric, diagonal tree and graph),
NTedit program displays an appropriate
arrangement of the cells in the spreadsheet.
Though anyone of the above-said file format
can be employed, use of NTedit ensures that
the files are formatted correctly; however, data

cannot be exported to Excel spreadsheet.


NTedit needs to be started by clicking on the
program icon to start the program and then use
the drop-down file menu (open the menu to
load an existing data file or files). Once NTedit
is started, data can be entered or corrected in
any of the cells of the spreadsheet. Rows and
columns can be deleted or inserted within the
table by clicking on the appropriate menu
choices. Addition or deletion of rows and columns should be done by entering new values in
the edit boxes displaying the current number of
rows and columns. The numerical code used to
indicate the missing values in the data can be
entered or changed. Make sure this field is
blank (not zero) if there is no missing value. It
is essential to check for missing data and it
should be of maximum of 5% since missing
data can distort analyses.
(continued)

Genetic Diversity Analysis Using Molecular Markers

17

Box 1.2 (continued)

Tips to Prepare Data File


1. The qualitative or quantitative data pertaining to each individual (or population) may
be prepared in Excel sheet in the following
format.
1
Individual1
Individual2

12
SSR1
0
1

13
SSR2
1
1

1
SSR3
0
0

9
SSR4
1
1

Note:
First column first row: type of matrix (1 for
rectangular matrix; 2 similarity matrix)
Second column first row: number of the
markers scored in this analysis
Third column first row: number of
accessions
Fourth column first row: presence of missing value (0 if there is no missing value; 1
if there is any missing value)
Fifth column first row: the value given for
missing value (if any)
First column second row: leave it empty
First column second row: marker (or quantitative trait) names in each column
First column third row: name of the accessions in the entire column (it is better to
restrict the marker name and accession
name to eight characters)
Second column third row onwards: marker
score for each accession for the corresponding
marker.
2. Save the Excel file as *.txt (text tab delimited
file) and import this file through NTedit.

Construction of Dendrogram and


Genetic Diversity Analysis
1. Open the NTSYS program.
2. Go to NTedit if you have your file in Excel
format.
3. Point the cursor to select file import
Excel using DDE.

4. This opens up a new pop-up menu in which


you have to browse for your Excel file to
open in the NTedit window.
5. Save this file in *.NTS format by specifying appropriate file name.
6. Close this NTedit window and open
NTSYSpc window.
7. Select the Similarity icon, and on this
window, select SIMQUAL which means
for similarity index to be calculated from
qualitative data (zero and one data; e.g. the
data file prepared as above). If the data is
in allele frequency format, select SIMGEN.
If you have the data file in quantitative
measures, then select SIMINT, which
means similarity index calculation using
interval data (such as plant height).
8. This leads to a new pop-up menu. In the
input file pointer, double click to browse
the data file that has been saved using
NTedit program.
9. If you have saved the accessions in the
rows, then select BY ROW column. If you
have saved the data as per the format
described in this exercise, DO NOT
SELECT ROW option.
10. In the next row, you will find coefficient
parameter for which a range of arguments have been given. The default
coefficient is SM, which denotes simple
matching coefficient. The coefficient
quoted by Dr. Dice and his group is the
preferred argument (DICE). Please click
the help icon to get more information on
the parameter/arguments and references
therein.
11. Specify the output file (e.g. file number 2)
by double clicking that corresponding
column using the browser.
12. Running the Compute results in a new
pop-up menu report listing which
contains the information on data input file,
output file, the parameter you have selected
for coefficient, the matrix type, etc.
(continued)

18

Germplasm Characterisation: Utilising the Underexploited Resources

Box 1.2 (continued)

13. Close this and similarity windows and


select the CLUSTERING icon.
14. In the new pop-up menu, select SAHN;
input the file by double clicking on the
argument column and browsing the file that
you have saved in step 11 (file number 2).
15. Specify the new output tree file (e.g. file
number 3) in the argument of next row by
double clicking.
16. Select the clustering method, nature of tie
and maximum number of ties. Rest you
can leave as default values, if you dont
have any options.
17. Similar kind of report listing window
found in step 12 will result which contains
all the calculations.
18. Close this window.
19. In the clustering window, now you can find
the dendrogram symbol (a red-coloured
icon) below the compute button; select that
tree icon.
20. It results into a picture of dendrogram
obtained based on the input file in a new tree
plot window. The dendrogram is usually
plotted with distance or similarity in the horizontal axis and germplasm entries in the vertical side. If number of individuals is found
to be low, use Options menu to increase the
number of clusters/individuals per page.
21.You can edit this picture using plot options or
copy the metafile and paste it in a PowerPoint
slide. Before editing the PowerPoint picture,
ungroup the picture you have saved.
22. The file number 2, which can be opened in
note file format, contains the coefficient
values for each individual with respect to
the other individuals, and this can be used
for interpretation of results.

Sometimes, it was found that some of the


germplasm entries show up in different
cluster when different procedure was
employed. It is very difficult to assign these
entries into a proper cluster; it may require
some additional information (such as pedigree and region of origin) to assign them to
the appropriate cluster. Bootstrapping can
be used to ensure that there were enough
number of markers employed to sample the
genetic diversity and the resulted dendrogram is statistically sound. A bootstrapping
program (available in WinBoot) can repeat
the cluster analysis many times and return
a dendrogram in which the clusters are
defined by the number of times the individuals within the cluster were found together
in each analysis. This number can be used
as a confidence limit of the clusters within
the dendrograms. It is generally believed
that to ensure the accuracy of the bootstrap
is 95%, 400 repetitions of the analysis must
be done; similarly, 2,000 repetitions must
be done to ensure the accuracy of 99%.
Often one wishes to test whether one set of
relationships among a set of objects is independent of another. For example, one may
wish to test whether the degree of morphological difference between samples is
related to the geographical distances
between the sampled populations. A simple
way to do this is by the use of the Mantel
test. The test assumes that the two matrices
have been obtained independently. However,
one cannot use it to test two or more matrices where one of them has been derived
from the other.

Interpretation of Results

Partitioning Variation in the


Germplasm

When you have completed clustering with


a number of procedures, the obvious
next step is finding the consensus clusters.

Yet another critical step in a diversity analysis is to investigate the variation present
in the germplasm, that is, not to visualise
(continued)

Genetic Diversity Analysis Using Molecular Markers

19

Box 1.2 (continued)


relationships between individuals but simply
to see the overall breakdown of variation in
the sample. Usually, analysis of molecular
variation (AMOVA) is used for this analysis,
which is very similar to ANOVA procedure.
It is also useful to measure the richness of
alleles for each marker or the information
that each marker imparts to the study in discriminating each individual. Usefulness of
such study is affected by number of alleles,
frequency of alleles, etc. To this end, there
are three measures that frequently used:
polymorphic information content (PIC),
allelic richness and discriminatory power of
the markers. Allelic richness can be calculated using the LCDMW package (http://
www.cimmyt.cgiar.org/ABC/Protocols/manualABC.html). PIC is a calculation of number of alleles (or bands) that a marker has and
the frequency of each of the alleles in the
studied germplasm. Since a marker with
fewer alleles (or bands) has less power to distinguish several entries that constitute the
germplasm, markers possessing higher PIC
values are usually preferred. The formula
used to calculate PIC is

where Pi is the frequency of the ith allele


for the individual P. This can be calculated
by simply using Excel spreadsheet as shown
below.

Data Sheet Preparation and PIC


Calculation
Enter the marker allelic data as presence (1)
or absence (0) of each allele for each entry
of the germplasm. It is important to change
the score 1 to 2, if the entry is homozygous
for that allele; otherwise data 1 should be
retained if the entry is heterozygous or there is
another allele present for that marker in the
given entry. For example, in case of SSRs, we
can sum over all alleles for each SSR to make
sure the sum is maximum of 2 in every individual for every SSR (refer below tables 6th
row). Thus, we can assure that the data was
not mis-scored in any individuals, as every
individual will have two alleles for every SSR.
An example of gel matrix of SSR profile
(which produced four different alleles (a, b, c
and d) in the given five individuals) and its
respective data sheet is given below for easy
understanding.

PIC = 1 Pi 2 ,
Individual5

Individual4

Individual3

Individual2

Individual1

Ladder

n =1

Ind1

Indi2

Ind3

Ind4

Ind5

Freq*

Freq2**

SSR1a

2/5

(2/5)2 = 0.16

SSR1b

1/5

(1/5)2 = 0.04

SSR1c

2/5

(2/5)2 = 0.16

SSR1d

1/5

(1/5)2 = 0.04

Sum

PIC

0.40
0.60

Freq*: frequency of allele = number of individual having this allele/total number of individuals
Freq2**: (frequency of allele)2
PIC = 1 sum

20

Germplasm Characterisation: Utilising the Underexploited Resources

Parental Selection
Successful crop breeding program depends on
careful selection of parents that complement each
other for the given trait and yield. Thus, choosing
parents is one of the most important steps in
a breeding program. Although breeders have
different approaches for parental selection, all
the strategies share a common feature: Selected
parents should be as diverse as possible at phenotypic and genotypic level. At least one locally
adapted, popular cultivar is used as one parent to
ensure the recovery of a high proportion of progenies with adaptation and quality that are acceptable by farmers and end users. Each parent should
complement the weakness of the other parent.
For instance, when we select parents for drought
tolerance breeding, it is better to avoid parents that
are highly drought susceptible but genetically
diverse. In such cases, use of improved modern
varieties as one of the parent may offer many disease-, insect- and abiotic stress-tolerant genes.
Thus, a thorough phenotyping and genetic diversity analysis will lead to identify most appropriate parental lines for biparental or multiparental
crosses to produce new segregating populations
(discussed in chapter 2) suitable for high-resolution
genetic map construction and efficient quantitative trait loci (QTL) discovery.

Bibliography
Literature Cited
Comai L, Young K, Till BJ et al (2004) Efficient discovery
of DNA polymorphisms in natural populations by
Ecotilling. Plant J 37:778786
Labate JA (2000) Software for population genetic analysis
of molecular marker data. Crop Sci 40:15211528
Mohammadi SA, Prasanna BM (2003) Analysis of genetic
diversity in crop plants salient statistical tools and
considerations. Crop Sci 43:12351248

Further Readings
Alpert P (2006) Constraints of tolerance: why are desiccation-tolerant organisms so small or rare? J Exp Biol
209:15751584

Araus JL, Slafer GA, Royo C, Serret MD (2008) Breeding


for yield potential and stress adaptation in cereals. Crit
Rev Plant Sci 27:377412
Baker FWG (ed) (1989) Drought resistance in cereals.
CAB Publishing, Wallingford, 222 pp
Bhullar NK, Zhang Z, Wicker T, Keller B (2010) Wheat
gene bank accessions as a source of new alleles of the
powdery mildew resistance gene Pm3: a large scale
allele mining project. BMC Plant Biol 10:88
Blum A (2011) Plant breeding for water-limited environments. Springer, New York
Boyer JS, Westgate ME (2004) Grain yields with limited
water. J Exp Bot 55:23852394
Ceccarelli S, Grando S (1996) Drought as a challenge for
the plant breeder. Plant Growth Reg 20:149155
Chaves MM, Oliveira MM (2004) Mechanisms underlying plant resilience to water deficits: prospects for
water-saving agriculture. J Exp Bot 55:23652384
Farooq M, Wahid A, Kobayashi N, Fujita D, Basra SMA
(2009) Plant drought stress: effects, mechanisms and
management. Agric Sustain Dev 29:185212
Fischer KS, Lafitte R, Fukai S, Atlin G, Hardy B (2003)
Breeding rice for drought prone environments. The
International Rice Research Institute, Los Baos, 98
pp
Fukai S, Cooper M (1995) Development of drought-resistant cultivars using physiomorphological traits in rice.
Field Crop Res 40:6786
Kamoshita A, Babu RC, Boopathi NM, Fukai S (2008)
Phenotypic and genotypic analysis of droughtresistance traits for development of rice cultivars
adapted to rainfed environments. Field Crop Res
109:123
Kumar A, Bernier J, Verulkar S, Lafitte HR, Atlin GN
(2008) Breeding for drought tolerance: direct selection
for yield, response to selection and use of droughttolerant donors in upland and lowland-adapted populations. Field Crop Res 107:221231
Lafitte HR, Li ZK, Vijayakumar CHM, Gao YM, Shi Y,
Xu JL, Fu BY, Ali AJ, Domingo J, Maghirang R,
Mackill DJ (2006) Breeding for resistance to abiotic
stresses in rice: the value of quantitative trait loci. In:
Lamkey KR, Lee M (eds) Plant breeding: the Arnel R.
Hallauer international symposium. Blackwell, Ames,
pp 201212
Monneveux P, Ribaut JM (eds) (2011) Drought phenotyping in crops: from theory to practice. Available at
Generation Challenge Program website www.generationcp.org
Morison JIL, Baker NR, Mullineaux PM, Davies WJ
(2008) Improving water use in crop production. Philos
Trans R Soc B Biol Sci 363:639658
Nguyen HT, Babu RC, Blum A (1997) Breeding for
drought resistance in rice: physiology and molecular
genetics considerations. Crop Sci 37:14261434
Passioura JB (2007) The drought environment: physical,
biological and agricultural perspectives. J Exp Bot
58:113117
Reynolds M, Tuberosa R (2008) Translational research
impacting on crop productivity in drought-prone
environments. Curr Opin Plant Biol 11:171179

Bibliography
Ribaut JM (ed) (2006) Drought adaptation in cereals.
The Haworth Press Inc, Binghamton, 642 pp
Richards RA (2008) Genetic opportunities to improve
cereal root systems for dryland agriculture. Plant Prod
Sci 11:1216
Torres R, Mackill D (2006) Improvement of rice drought
tolerance through backcross breeding: evaluation of

21
donors and selection in drought nurseries. Field Crop
Res 97:7786
Tuberosa R, Salvi S (2007) Dissecting QTLs for tolerance
to drought and salinity. In: Jenks MA, Hasegawa PM,
Jain M (eds) Advances in molecular breeding toward
drought and salt tolerant crops. Springer, Dordrecht,
pp 381411

Mapping Population Development

Mapping Population and Its


Importance in Genetic Mapping
The principle of genetic mapping is mainly based
on sampling recombination frequency for the
given genes (or markers) that are available in the
mapping population. Mapping population consists of individual progenies that are originated
from two parents of one species or related species. Hence, the first step in linkage or genetic
map construction is development of mapping
population. It is considered as key genetic tools/
resources in linkage map construction since they
are used to identify genetic loci that influence the
expression of phenotypes and to determine the
recombination distance between loci.
In diverse crops of the same species, the genes
(or markers), represented by alternative allelic
forms, are arranged in a fixed linear order on the
chromosomes. Linkage values among these gene
or marker loci are estimated based on recombination events between alleles of different loci,
and such linkage relationship along all the chromosomes offers a genetic map of the crop (see
chapter 4 for more details). However, to explain
the complexity of genome organisation, genetic
maps are not sufficient since they are based on
recombination events, which is highly different
along the chromosomes. At the same time, knowledge on the genetic map and cytogenetic map
forms the fundamentals for the physical map
construction. An integrated map thus provides
a detailed view on genome structure and offers

efficient ways for positional cloning of the genes


or genome sequencing. Hence, mapping populations are the basic tools for understanding the
effect of selected genetic factors and the organisation of the genome of a species as a whole.
They are the backbone of genomics research that
aims to decipher large, complex genomes at the
nucleotide sequence level.
Generally in conventional genetic mapping
and QTL analysis, mapping population is developed from parents that are highly homozygous
(usually inbreds are homozygous in nature). The
major key phase in the development of the mapping population is selection of two genetically
divergent parents (see chapter 1) and should show
clear phenotypic differences for the trait of interest. It is also desirable to choose the parents that
are as diverse as possible for a number of economic and agronomically important traits, and
hence, the same mapping population can be used
to identify QTLs for several traits. In addition to
that, it is essential to have significant trait heritability. Both monogenic (trait governed by single
genes) and polygenic (trait governed by several
genes) traits can be mapped when two parents
are extremely different for these traits. It is
expected that the more the parental lines differ,
the more genetic factors will be described for the
trait in the segregating population and the easier
their identification will be. Due to intensive
breeding and pedigree selection, genetic variability within the gene pools of the relevant
crops is at risk and hence contribution of wild
species is of high value at this point. At the same

N.M. Boopathi, Genetic Mapping and Marker Assisted Selection: Basics, Practice
and Benefits, DOI 10.1007/978-81-322-0958-4_2, Springer India 2013

23

24

time, the parental divergence should not be too


genetically distant. This is because it helps to
reduce the sterility of the progenies and segregation distortion during linkage analysis.
Several types of mapping population such as
F2 progenies, F2 immortal populations, backcross
(BC) progenies, recombinant inbred lines (RILs),
double haploids (DHs), near isogenic lines (NILs)
and nested association mapping (NAM) populations have been utilised in this regard. It should
be noted that each population type possess its
own rewards and restrictions and hence selection
of population type is critical for successful
genetic mapping. Both F2 and BC populations
are simplest and easy to construct, but they are
highly heterozygous and cannot be propagated
indefinitely through seeds. It can temporarily be
used to construct a preliminary linkage map.
Alternatively, RILs, NILs and DHs are permanent populations since they are homozygous or
true breeding lines that can be multiplied and
reproduced without any genetic change. Thus,
these populations represent eternal resources
for mapping, and seed from individual RI or DH
lines can be exchanged among different laboratories for further linkage analysis or addition of
more markers to existing maps and ensure that all
collaborators examine identical material.
The type of mapping population to be used
depends on the reproductive mode of the given
crop. For self-pollinating species, F2 progenies
and RILs are used; for self-incompatible, highly
heterozygous progenies, that is, the F1 populations are mostly the tools of choice. BC progenies
and DHs can be employed for both types of
plants. If pure lines cannot be generated from a
species due to self-incompatibility or inbreeding
depression, heterozygous parental plants are used
to derive mapping populations such as F1 and BC
progenies. This is the case for several tree species
(such as apple, pear and grape) and for potato. To
maintain the identity of the F1 genotypes of the
mapping population, parental lines and each of
their F1 or BC progenies are propagated clonally.
In cross-pollinating species, the situation is more
complicated since most of these species do not
tolerate inbreeding. Many cross-pollinating plant
species are also polyploids (i.e. they contain several

Mapping Population Development

sets of chromosome pairs). Mapping populations


used for mapping cross-pollinating species may
be derived from a cross between a heterozygous
parent and a haploid or homozygous parent. For
example, in both the cross-pollinating species of
white clover (Trifolium repens L.) and ryegrass
(Lolium perenne L.), F1 generation mapping populations were successfully developed by pair
crossing heterozygous parental plants that were
distinctly different for important traits associated
with plant persistence and seed yield.
There is no specific study that pinpoints the ideal
number of individuals in a given population that are
required to establish accurate genetic map. The precision with which genetic distance measured in a
genetic map is directly related to the number of individuals that constitute the given mapping population. For example, if only 20 individuals are studied
and no recombinants are found between the given
two markers, then the distance between these two
markers would be noted as 0 cM (see chapter 4 for
details on genetic distance calculation). On the other
hand, analysis of another 80 individuals in the same
population may reveal recombinants, and hence the
distance between the same two markers would be
>0 cM depending on the number of recombinants
identified. In general, segregating progenies consisting of 50250 individuals may be sufficient to
construct the initial skeletal linkage map; however,
a larger population size (say >1,000) is needed for
high resolution or fine mapping. It has been shown
in several studies that more accurate maps were
obtained when large population size and co-dominant markers were employed and poor population
size provided several fragmented linkage groups
and inaccurate locus order (discussed in chapter 4).
It was also noticed that maximum genetic information can be obtained from F2 population using
a co-dominant markers. Dominant markers supply
as much information as co-dominant markers in
RILs, NILs and DHs since all loci in these population are homozygous or nearly so. It is important
to note that RILs, NILs and DHs may be powerful
tools for QTL detection in some occasions, but
offer no information on QTLs dominance relationships. Characteristics of major types of mapping
populations used in genetic mapping studies are
described in Table 2.1.

1
x
2
Requires less time
to be developed

The populations
can be further
utilised for
marker-assisted
backcross breeding

Best population for


preliminary mapping

Requires less time for


development

Can be developed with


minimum efforts, when
compared to other
populations
The degree of dominance
can be estimated

Number of generations
required to make
Number of informative
gametes per individual
Number of recombinant
events per gamete
Number of possible
genotypes per locus
Merits

BC progenies
Parent (x)
Parent F1 (x)
Parent BC

F2 progenies
Parent (x)
Parent F1 (s) F2

Particulars
Development procedure

Epistasis can be detected

DHs are permanent


mapping population and
hence can be replicated
and evaluated over
locations and years and
maintained without any
genotypic change
Useful for mapping
both qualitative and
quantitative characters
Instant production of
homozygous lines, thus
saving time

DH lines
Parent (x)
Parent F1 Anther
culture DH lines

Table 2.1 Characteristics of major types of mapping populations used in genetic mapping studies.

Since RILs are immortal


population, they can be replicated
over locations and years
RILs, being obtained after
several cycles of meiosis, are
very useful in identifying tightly
linked markers
RIL populations obtained by
selfing have twice the amount of
observed recombination
between very closely linked
markers as compared to
population derived from a single
cycle of meiosis.
Epistasis can be detected

Once homozygosity is achieved,


RILs can be propagated
indefinitely without further
segregation

2x

68

RILs
Parent (x)
Parent F1 (s) SSD F6
or more RILs

(continued)

Epistasis can be detected

Suitable for tagging


the qualitative and
quantitative trait
NILs are quite useful in
functional genomics

NILs are immortal


mapping population

NILs
Parent (x) Parent F1
(x) Parent BC
continues with Parent
up to BC6 (s) two
generations NILs
9

Mapping Population and Its Importance in Genetic Mapping


25

Since it involves in vitro


techniques, relatively
more technical skills are
required in comparison
with the development of
other mapping
populations
Often suitable culturing
methods/haploid
production methods are
not available for number
crops and different crops
differ significantly for
their tissue culture
response. Further, anther
culture-induced
variability should be
taken care of

1:1
1:1

The recombination
information in case
of backcrosses is
based on only one
parent

1:0a
1:1a

Quantitative traits
cannot be precisely
mapped using F2
population as each
individual is genetically
different and cannot be
evaluated in replicated
trials over locations and
years. Thus, the effect
the G x E interaction or
epistatic interaction on
the expression of
quantitative traits cannot
be precisely estimated
Not a long-term
population; impossible
to construct exact
replica or increase seed
amount
3:1

1:2:1

1:1

1:1

Developing RILs is relatively


difficult in crops with high
inbreeding depression

RILs
Requires many seasons/
generations to develop.

x crossing, s selfing, SSD single seed descent method, BC backcross


However, backcross with recessive parent (B2) or testcross would segregate in a ratio of 1:1 irrespective of the nature of marker

DH lines
Recombination from the
male side alone is
accounted

BC progenies
They are not
immortal

F2 progenies
Linkage established
using F2 population is
based on one cycle of
meiosis
F2 populations are of
limited use for fine
mapping.

1:1

1:1

Linkage drag is a
potential problem in
constructing NILs, which
has to be taken care of

Directly useful only for


molecular tagging of the
gene concerned, but not
for linkage mapping

NILs
Require many generations
for development

Inheritance of dominant
markers
Inheritance of
co-dominant markers

Particulars
Demerits

Table 2.1 (continued)

26
Mapping Population Development

F2 Progenies

Selng and Crossing Techniques


in Crop Plants
In crop improvement program, selfing and crossing are the two paramount procedures. Success
of mapping population development largely
depends on perfect execution of selfing and
crossing procedures. The exact procedures used
to ensure self- or cross-pollination of specific
plants will depend on the floral structure and
method of pollination. Generally, accomplishing
cross-pollination in a strictly self-pollinating
species is more difficult because prevention of
self-pollination that occurs inside the unopened
flowers is not easy. However, self-pollination in
cross-pollinating species is simple. In the selfing
of cross-pollinated species, it is essential that the
flower are bagged or otherwise protected to prevent natural cross-pollination. The structure of
the flowers in the species determine manner of
pollination. For these reasons, during mapping
population development, it is always better to
acquaint flowering habit of the crop.
In the case of wheat, rice, barley, groundnut,
etc., the plant is permitted to have self-pollination
and the seeds are harvested. It is necessary to
know the mode of pollination. If the extent of
natural cross-pollination is more, then the flowers
should be protected by bagging. This will prevent
the foreign pollen to reach the stigma. Seed set is
frequently reduced in ear heads enclosed in bags
because of excessive temperature and humidity
inside the bags. In crops like cotton which have
larger flowers, the petals may fold down the sexual organs and fasten, thereby pollen and pollencarrying insects may be excluded. This is simply
achieved by closing the flower bud with cotton
lint. In certain legumes which are almost pollinated via insect, the plants may be caged to prevent the insect pollination. In maize, a paper bag
is placed over the tassel to collect pollen and the
cob is bagged to protect from foreign pollen. The
pollen collected from the tassel is transferred to
the cob.
Removal of stamens or anthers or killing the
pollen of a flower without affecting the female
reproductive organ is known as emasculation. In

27

bisexual flowers, emasculation is essential to


prevent of self-pollination. In monoecious plants,
male flowers are removed (e.g. castor, coconut) or
male inflorescence is removed (e.g. maize). In species with large flowers (e.g. cotton, pulses), hand
emasculation is accurate and it is adequate. For
other crops, several other methods of emasculation are being followed (e.g. suction method, hot
water or cold water treatment, alcohol treatment,
use of genetic or cytoplasmic male sterility lines,
employing protogyny (e.g. cumbu.) and use of
gametocides (e.g. ethrel, sodium methyl arsenate,
zinc methyl arsenate are used in rice, maleic
hydrazide is used in cotton and wheat)). Immediately
after emasculation, the flower or inflorescence is
enclosed with suitable bags of appropriate size to
prevent random cross-pollination. The pollen
grains collected from a desired male parent should
be transferred to the emasculated flower. This is
normally done in the morning hours during anthesis. The flowers are bagged immediately after
artificial crossing and should be tagged with appropriate information such as date, name of the cross
combination, etc. using pencil.

F2 Progenies
Development of F2 progenies are the simplest and
rapid method when compared to other mapping
population types. This is the population in which
the foundations of Mendelian laws were first
established. Usually, two pure lines that result
from natural or artificial inbreeding are selected
as parents (Fig. 2.1). Alternatively, two doubled
haploid lines can be used as parents to avoid any
residual heterozygosity. Crossing of such parents
will lead to produce fertile progenies and those
progenies are called as F1 generation. If the parental lines are true homozygotes, all individuals of
the F1 generation will have the same genotype
and have a similar phenotype as per the Mendels
law of uniformity. Each individual of F1 plant is
then selfed to produce F2 population that segregates for the given trait. Thus, F2 population is the
outcome of one meiosis, during which the genetic
material is recombined. The expected segregation ratio for each co-dominant marker is 1:2:1

28

Male parent
(Donor parent )

Female parent
(elite line)
aaBB
F1

Hybrid
Haploids

AB

Ab

AaBb

AAbb
X

Anther culture

ab

aB

Mapping Population Development

F1

Chromosome doubling by
Colchicine treatment

BC1F1

Female parent
(elite line)
X

Female parent
(elite line)

BC2F1
Doubled
haploids

AABB

AAbb

aaBB

F2

aabb

BC4F1
S
BC4F2 Near Isogenic Lines (NILs)

F3
SSD
(Each plant contributes a single
offspring to the next generation)

F7

Recombinant Inbred Lines (RILs)

Fig. 2.1 Schematic illustration that explains development of commonly used mapping populations in genetic mapping.
X refers to crossing, S refers to selfing, SSD single seed descent method

(homozygous-like female parent to heterozygous


to homozygous-like male parent) (see chapter 3).
The main limitation in F2 population is it cannot
be easily preserved, because F2 plants are frequently not immortal and F3 plants that result
from their selfing are genetically not identical.
Alternatively, the crops that can be multiplied as
clones using tissue culture can be produced and
regrown whenever needed. Another way is maintain the F2 population in pools of F3 plants. Traits
can be evaluated in hybrids and testcross plants
can be constructed by crossing each F2 individual
with a common tester genotype. Ideally, different
common testers should produce corresponding
results to exclude the specific effects of one particular tester genotype. With a compromise
between resolution of linked loci and cost, a preliminary genome-wide map can be produced with
200 F2 individuals. However, for higher resolution, as required for positional cloning of genes,
F2 progenies of several thousands are required
(see chapter 7).

F2-Derived F3 (F2:3) Populations


F2:3 population is obtained by selfing the F2 individuals for a single generation. It is suitable for
specific situations where mapping of recessive

genes is required that underlie the quantitative


trait of interest. The F2:3 family can be used for
reconstituting the genotype of respective F2
plants, if needed, by pooling the DNA from plants
in the family. However, the main limitation is,
like F2 populations, it is not immortal population and hence cannot be used for replicated
experiments to validate the results.

F2 Intermating Populations
or Immortalised F2 Populations
Random intermating of F2 populations has been
suggested for obtaining precise estimates of recombination frequencies between tightly linked loci.
Immortalised F2 populations can be developed by
paired crossing of the randomly chosen RILs
derived from a cross in all possible combinations
excluding reciprocals. The set of RILs used for
crossing along with the F1s produced provides a
true representation of all possible genotype combinations (including the heterozygotes) expected in
the F2 of the cross from which the RILs are derived.
The RILs can be maintained by selfing and required
quantity of F1 seed can be produced at will by fresh
hybridisation. This population therefore provides
an opportunity to map heterotic QTLs and interaction effects from multi-location data.

BC Progenies

However, in a simulation study, sampling


effects due to small population sizes in the intermating generations were found to abolish the
advantages of random intermating that were
reported in previous theoretical studies considering an infinite population size. Frisch and
Melchinger (2008) proposed a mating scheme for
intermating with planned crosses that yields more
precise estimates than those under random intermating. Mapping populations generated with
mating scheme with independent recombinations have the same properties as mapping populations derived from large random-mating
populations. Hence, such mating scheme guarantees the maximum possible information content
in the mapping population but reduces the efforts
of employing large intermating populations.

DH Lines
Doubled haploid (DH) lines contain two identical sets of chromosomes in their cells. They are
completely homozygous, as only one allele is
available for all the genes. Usually, DH lines are
produced from haploid lines. These haploid lines
either occur spontaneously (e.g. rapeseed and
maize) or can be induced artificially (Fig. 2.1).
Haploid plants are usually smaller and less vigorous than diploids and nearly sterile. Haploids
can be induced by culturing immature anthers on
special media, and haploid plant can later be
regenerated from the haploid cells of the gametophyte. Alternatively, microspore culture can be
employed. As a rare event, in some of the haploid plants, the chromosome number doubles
spontaneously that leads to DH plants. Such
lines can also be obtained artificially by colchicine treatment of haploid plants. It is shown that
colchicine prevents the formation of the spindle
apparatus during mitosis and thus inhibits the
separation of chromosomes and leading to DH
plants. If callus is induced in haploid plants, a
doubling of chromosomes often occurs spontaneously during endomitosis and DH lines can be
regenerated via somatic embryogenesis. On the
other hand, in vitro culture conditions may
decrease the genetic variability of regenerated

29

materials to be used for genetic mapping. DH


lines are also product of one meiotic cycle and
hence comparable to F2 in terms of recombination information. Despite this, DH lines are used
as permanent resource for genetic mapping and
are ideal crossing partners in the production of
mapping population since they have no residual
heterozygosity.

BC Progenies
To analyse the specific genes or other regulatory
DNA elements derived from one parent (i.e.
donor parent) in the background of another parent (i.e. recurrent (or elite) parent), the hybrid F1
plant is backcrossed to recurrent parent (Fig. 2.1).
Two key features that best describe BC progenies
are: unlinked donor fragments are separated by
segregation and linked donor fragments are minimised due to recombination with the recurrent
parent. In order to reasonably reduce the number
and size of donor fragments, backcrossing is
repeated. With each round of backcrossing, the
proportion of the donor genome is reduced by
50%. Sometimes backcrossing process can be
accelerated by use of recurrent parent-specific
markers (referred to as background markers; discussed in detail in chapter 3). With each round of
backcrossing, the number and size of genomic
fragments of the donor parent are reduced until a
single gene (or other regulatory DNA element)
differentiates the BC progeny from the recurrent
parent. That particular progeny is later screened
for the trait introduced by the donor. In the event
of dominant expression of traits, the progeny can
be screened directly; on the other hand, recessive
expression of traits requires the testing of selfed
progeny of each BC progeny. Identical BC progeny with the exception of few donor loci is called
as near isogenic lines (NILs) and discussed separately (see below). BC progeny incorporated with
a fragment of genomic DNA from a very distantly
related species is called as introgression line,
while the BC progeny incorporated with genetic
material from a different variety is indicated as
inter-varietal substitution lines. At this point, it
should be noted that recombination is reduced in

30

interspecific hybrids with respect to intraspecific


hybrids since variations in DNA will lead to
reduced pairing of the chromosomes during meiosis. This phenomenon is called as linkage drag,
which can be explained as the situation when
larger than expected fragments are retained during backcross breeding. Thus, linkage drag can
cause undesirable effect in addition to introgression of trait of interest.

Mapping Population Development

markers can be determined. This is because the


degree of recombination is higher compared to F2
populations. RILs also equalise marker types like
DH lines; the genetic segregation ratio for both
dominant and co-dominant markers is 1:1. RILs
developed through brothersister mating require
more time than those developed through selfing.
The number of inbred lines required is twice, in
case they are developed through brothersister
mating compared to selfing, particularly, when
linkage is not very tight.

RILs
Recombinant inbred lines (RILs) are the homozygous selfed or sib-mated progeny of the individuals of an F2 population (Fig. 2.1). Use of RIL
concept in genetic mapping was originally developed for mouse. Nearly 20 generations of sib
mating are required to reach useful levels of
homozygosity in animals. However, in plants,
RILs with more than 98% homozygosity are produced by selfing within eight or nine generations
(unless the species is completely self-incompatible). Self-pollination allows production of RILs
in a relatively short period of time. In fact, in some
of the strict self-pollinating crops, almost complete homozygosity can be reached within six
generations. Development of RILs is usually following a single-seed descent method, since during the selfing process, one seed of each line is the
source for the next generation. Bulk method and
pedigree methods without selection can also be
used for production of RILs. In RILs, alleles
derived from either of the parent are arranged in
alternative way along each chromosome. In each
generation, meiotic events lead to further recombination and reduce heterozygosity until completely homozygous RILs with fragments of either
parental genome are achieved. Since recombination cannot change the genetic constitution of
RILs, further segregation in the progeny of such
lines is absent. Because of this, RILs are considered as a permanent resource that can be replicated indefinitely and be shared by many groups
among the researchers. Another advantage of
using RILs is it can be used to construct higherresolution genetic map than F2 populations, and
hence, the map positions of even tightly linked

NILs, Exotic Libraries and Advanced


Backcross Populations
Development of near isogenic lines (NILs)
involves several generations of backcrossing.
Backcrossing is executed with the help of molecular markers since markers can be used to recover
the maximum amount of recurrent genome. Two
additional rounds of self-fertilisation are required
at the end of backcrossing process in order to fix
the donor segments and to visualise traits that are
caused by recessive genes (Fig. 2.1). Generally, it
is assumed that if two NILs differ in phenotypic
performance, it might be the effect of the alleles
carried by the introgressed DNA fragment in the
given NIL. Thus, NILs constitute powerful tools
in the functional analysis of the underlying genes.
Particularly, they are valuable for those species
for which no transformation protocol is established to produce transgenics for the alleles of
interest. In addition, genomic rearrangements,
which may occur during transformation, are also
avoided in NILs.
Usually desirable positive alleles (e.g. disease
resistance, quality parameters) are found in distantly related or wild species, and those alleles
can be introduced into the local elite cultivar
through backcrossing. If the trait to be introduced
is already known, the backcrossing can be expedited directly via marker-assisted selection.
However, the potential of wild species that
influence the expression of quantitative traits is
often not assessed. To this end, backcross breeding
is a method to identify single genomic components contributing to the phenotype. In such cases,

Multi-Cross Populations

NILs are developed by an advanced backcross


program (i.e. simultaneous act of mapping
population development and QTL identification
and their phenotypic effects are assayed; first
described by Tanksley and his research team
(1996) in tomato; see chapter 8). A collection of
introgression lines, each harbouring a different
fragment of genomic DNA, can be generated to
assess the effects of small chromosomal introgression at a genome-wide level. Such collections
are referred to as exotic library, and they are
developed through recurrent backcrossing and
marker-assisted selection for six generations and
to the self-fertilisation of the two more generations to generate plants homozygous to the introgressed DNA fragments. Thus, NILs, after the
advanced backcross program, will resemble the
cultivated parent, but introgressed fragments
with even subtle phenotypic effects can be easily
identified. The introgressed fragments can be
clearly defined by the use of molecular markers.

Four-Way Cross Populations


The majority of the genetic maps in crops were
constructed using mapping populations derived
from either interspecific or intraspecific singlecross hybridisation. Due to lower level of
within-species and between-species polymorphism, most of the maps have included only a
relatively small portion of the genome. For
example, even a joint map from different mapping populations has shown 31% coverage of
the cotton genome. If such poor coverage
genetic map is used for QTL mapping, only a
small portion of genome will be explored and
large amounts of QTL information could not be
revealed. Use of four parents of a double cross
(otherwise referred to as four-way cross) has
been shown to increase the density of genetic
maps (Qin et al. 2008). The F1s derived from
two different single-cross hybridisation programs are crossed to generate four-way cross
populations. Initial parental polymorphic survey should include all the four parents. If one
locus screened for polymorphism was homozygous in two of the F1 parents, this locus would

31

be excluded in linkage analysis because the


alleles did not segregate in four-way cross population. The markers can have Mendelian segregation ratio of 1:1, 1:2:1, 3:1 and 1:1:1:1 in
four-way cross population. Since four-way
cross involves four inbred lines (L1, L2, L3
and L4), the polymorphic markers identified
between L1 and L2 or L3 and L4 can be
employed to develop genetic map. If only two
parents were employed to mapping, half of
polymorphic markers would be homozygous
and could not be used in linkage analysis. Thus,
a four-way cross can increase the density of the
linkage map, and in some cases, it can counteract the lower levels of polymorphism found in
certain crops. Further, use of four-way cross
can potentially reduce the type II error caused
by a random sampling of parents and increase
the probability of detecting QTL (see chapter x)
if they segregate in single-line cross but not in
the other single-line cross. In contrast to a single cross in which only two alleles are involved,
a four-way cross can have a maximum of four
alleles. Because of this, the additive and dominance effects in a four-way cross are defined
differently from a simple cross to accommodate
different inbred lines. When only two different
alleles exist among four inbred parents, the
additive and dominance effects of alleles have
common mean with that of alleles identified in
a single-cross population. If allele of one parent
differs from other three parents on one locus, a
four-way cross population is analogous to BC
population.

Multi-Cross Populations
The features of the genetic structure of RILs can
be studied using two-, four- and eight-way crosses
following either selfing or sib mating. Though
eight-way cross RILs have been successfully
shown in mouse, it is yet to be demonstrated in
major crops. Interestingly, there are several contrasting features between the nested association
mapping (NAM) strategy (explained below) and
eight-way cross RILs. In maize, which has very
low linkage disequilibrium and tremendous genetic

32

diversity, the main point in RIL generation for


NAM development is to capture large array of
alleles by using many founders, rapid production
of RILs and minimised physiological variation by
crossing a reference line. In contrast, the mouse
has low diversity and high linkage disequilibrium,
but the eight-way cross produces more recombinations per line, which helps compensate for the high
linkage disequilibrium, and the mixing ensures
that a fuller range of epistatic interactions are produced. For example, if 5,000 maize RILs capture
~200,000 independent recombination breakpoints
when compared to 135,000 breakpoints in the
1,000 mouse RILs from an eight-way cross. Thus,
previous studies of genetic designs with multiple
line crosses have shown an improved power and
mapping resolution over a single population.
Nevertheless, their importance in genetic mapping
is yet to be clearly demonstrated in crops.

Nested Association Mapping


Populations
Linkage mapping focuses on the development of
large families from two inbred lines to detect
QTLs. However, slow progress has been made in
identifying completely characterised QTLs
because of limitations in the scope of allelic diversity and resolution in available genetic resources.
Particularly, the poor resolution of the QTLs is
mainly due to the limited number of recombination events that occur during population development. Association mapping takes advantage of
remarkable recombination from long history as
linkage disequilibrium generally decays within
2 kb (see chapter 6). Nevertheless, since there is a
requirement of a large number of highly polymorphic molecular markers and the confounding
effects of population structure, whole-genome
association analysis is difficult in crop plants. To
circumvent these problems, nested association
mapping (NAM) population can be constructed to
enable high power and high resolution by capturing the best features of both linkage and association mapping through joint linkage-association
analysis. The genetic structure of NAM population is a reference design of 25 families of 200

Mapping Population Development

RILs per family. NAM has been successfully


implemented in maize using the inbred B73 as the
reference line (because of its use for public physical map and for the maize sequencing project).
The other 25 parents (called as founder lines)
were independent of any specific phenotype and
represented diverse germplasm lines (that were
collected from all over the world to maximise the
genetic diversity of the RIL families). The NAM
strategy addresses complex trait dissection at a
fundamental level by generating a common mapping resource to efficiently exploit genetic,
genomic and systems biology tools. The original
procedure proposed by McMullen et al. (2009)
involves the following steps: (a) selection of
diverse founders and developing a large set of
related mapping progenies (preferably RILs for
robust phenotypic trait collection), (b) either
sequencing completely or densely genotyping the
founders, (c) genotyping a smaller number of tagging markers on both the founders and the progenies to define the inheritance of chromosome
segments and to project the high-density marker
information from the founders to the progenies,
(d) phenotyping progenies for various complex
traits and (e) conducting genome-wide association analysis relating phenotypic traits with
projected high-density markers of the progenies.
When compared to conventional linkage mapping
procedure, NAM has the advantages of (1) lower
sensitivity to genetic heterogeneity, (2) higher
power, (3) higher efficiency in using the genome
sequence or dense markers and (4) maintaining
high allele richness due to diverse founders.
Thus, NAM aims to create an integrated mapping population specifically designed for a full
genome scan with high power for QTL detection
with different effects. In NAM, individual progeny of RILs represents a mosaic of chromosome
segments derived from either one of the diverse
founders or common parent. With the scores of
common parent-specific markers (markers for
which reference line has rare alleles) in RILs, the
marker or sequence information nested between
two flanking common parent-specific markers can
be predicted for RILs on the basis of marker or
genome sequence available for the founders. By
choosing diverse founders, linkage disequilibrium

Natural Populations

within these chromosome segments resulting


from historical or evolutionary recombination is
mostly preserved in RILs due to the small probability of recombination within short genetic distances between flanking common parent-specific
markers. The potentially confounding effects of
genes outside of a specific segment being tested
are minimised across the whole RILs via the
reshuffling of the parental genomes by the recent
recombinations during RIL development. All the
immortal mapping population used in the publications have maximum of 400 lines, and thus, it
limits their mapping power and coverage of allelic
diversity. Further, because of genetic heterogeneity, QTL mapped in a single two-parent population often have little application to QTL
segregating in other populations, limiting the
scope of inference of QTL studies and the use of
MAS in crops. In NAM, the polymorphisms
within the tagging molecular markers can be
tested more directly because high-density markers on founders can be obtained, and this information can be projected onto the progeny through
flanking common parent-specific markers. Thus,
rather than inferring multiple alleles at each testing locus as in previous methods, NAM reduced
the testing to exact biallelic contrasts across the
whole population. Therefore, the advantages of
designed mapping populations from linkage analysis and of high resolution from association mapping are integrated in NAM through development
of a large number of RILs from diverse founders.
While common parent-specific markers allowed
the prediction of transmission of chromosome
segments in RILs, the short range of linkage disequilibrium within these segments across the
diverse founders enabled improved mapping resolution. The genetic background effect of these
parental founders on mapping individual QTL,
which is a limiting factor for association mapping,
is systematically reduced by reshuffling the
genomes of the two parents of each cross during
RIL development as well as by the combined
analysis of all the RILs across all 25 crosses.
At the same time, a balanced design with wellchosen diverse founders in NAM, if possible for a
particular species, would provide higher power
and finer resolution than exploiting an existing

33

pedigree. Further, as in association mapping, the


mapping resolution offered by NAM largely
depends on the linkage disequilibrium among the
founder individuals. Rapid decay of linkage disequilibrium has been noticed across genetically
diverse species over 2 kb. Given the diversity of
the founders and the rapid linkage disequilibrium
decay within 2 kb, mapping resolution for NAM
is expected to be high.

Natural Populations
The main limitations of experimental mapping
populations are: they are laborious, time consuming and require great care and effort in construction. The natural variation existing among
individuals of one species can also be exploited
for genetic mapping. In case of crops, germplasm
entries consisting of different breeding materials
and wild species can fulfil this purpose. It has
been shown that such natural populations can be
used to map complex traits that are influenced by
the action of many genes in a quantitative way.
However, it is important that such a collection of
different accessions of the germplasm should
contain a whole range of phenotypes for a given
trait. More importantly, the availability of extreme
phenotypes of interest is valuable. The basic
norm of this idea is that genomic fragments naturally present in a particular genotype are transmitted as non-recombining blocks and that
molecular markers can easily follow the inheritance of such blocks. These are called as haplotypes and their existence reveals a state of linkage
disequilibrium (LD) among allelic variants of
tightly linked genes (explained in detail in
Chapter 6). Usually, the association between a
marker and a trait can exist if one marker allele or
haplotype is significantly associated with a
particular phenotype when studied in unrelated
genotypes (such as natural population). The
main strength of this approach is that it does not
require the construction of mapping populations.
Particularly, for self-pollinating crops, inbred
individuals of natural ecotypes are specifically
immortal, and phenotyping needs to be performed
only once. In addition, natural populations are

34

particularly informative because usually more


than two alleles exist for each marker locus. Since
unrelated natural populations are genetically separated by many generations, the corresponding
large number of meiotic events leads to a high
rate of recombinations. Therefore, if LD blocks
exist, the loci that influence the expression of trait
can be mapped with high precision (sometimes
largely exceeding the resolution of F2 populations). However, such association study requires
thorough statistical assessment of the relatedness
and population structure and the reasons for such
analysis is given in chapter 6.

Chromosome-Specic Genetic Stocks


for Linkage Mapping
Chromosome-specific tools or genetic stocks
allow a segregation population to be genotyped
in a way that each chromosome is directly
scanned for linkage. There are several such tools
and one such kind were mutant lines with one
or more visible mapped mutations. As stated earlier, the distances in genetic maps are based on
recombination frequencies (refer chapter 4 for
details). However, recombination frequencies are
not equally distributed all over the genome. For
example, in heterochromatic regions such as the
centromeres, usually reduced recombination
frequencies are noticed. In such situations, cytogenetic maps can provide complementary information since they are based on the fine physical
structure of chromosomes. The chromosomes are
visualised under the (fluorescent or phase contrast) microscopes and can be characterised by
specific staining (e.g. Giemsa C) patterns or by
morphological structures such as the centromeres, the nucleolus-organising regions (NOR), the
telomeres and knobs, heritable heterochromatic
regions of particular shape. Cytogenetic maps
provide information on association of linkage
groups with chromosome and orientation of the
linkage groups with respect to chromosome morphology. It is worth to mention here that the
anonymous molecular markers (see chapter 3)
are assigned to particular chromosome based on
such cytogenetic stocks. In several crops, lines

Mapping Population Development

carrying chromosome deletions, translocation


breakpoints or monosomics/trisomics/nullisomics
have been generated for this purpose. Thus,
numerical aberrations in chromosome numbers,
together with marker data, could clearly help in
identification of chromosomes.
Alternatively, defined translocation breakpoints can also localise probes to specific regions
on the arms of chromosomes by using techniques
that can localise nucleic acids in situ on the chromosomes. At pachytene stage (during the meiotic
prophase), the chromosomes are generally 20
times longer than at mitotic metaphase. During
this time, chromosomes display a differentiated
pattern of brightly fluorescing heterochromatin
segments. It is possible to identify all chromosomes based on chromosome length, centromere
position, heterochromatin patterns and the positions of repetitive sequences (such as 5S rDNA,
45s rDNA) using fluorescence in situ hybridisation (FISH). The recent refinement in multicolour
FISH even allows the mapping of single-copy
sequences. Thus, cytogenetic maps developed
using FISH can provide complementary information for the assembly of physical map by positioning bacterial artificial clones and other DNA
sequences along the chromosomes (discussed in
detail in chapter 7).

Bulk Segregant Analysis


Besides the above-mentioned populations, bulk
segregant analysis (BSA) approach is frequently used in gene tagging or identifying
major QTLs. BSA is based on the principle of
isogenic lines and this concept was introduced
by Michelmore et al., in lettuce for identifying
genes associated with downy mildew resistance
during 1991. In BSA, two parents (say a resistant and susceptible), showing high degree of
molecular polymorphism and contrast for the
target trait are crossed and F1 is selfed to generate F2 population. In F2, individual plants are
phenotyped for resistance and susceptibility.
Usually, the DNA isolated from ten plants in
each group is pooled to constitute resistant and
susceptible bulks. The resistant parent, susceptible

Challenges in Mapping Population Development and Solutions to These Challenges

parent, resistant bulk and susceptible bulk are


surveyed for polymorphism using molecular
markers. A marker showing polymorphism
between parents as well as bulks is considered
putatively linked to the target trait and is further used for mapping using individual F2
plants. Conceptually, the genetic constitution
of the two bulks is similar but for the genomic
region associated with the target trait. Hence,
they serve the purpose of isogenic lines in principle. It has been observed over experiments
that when ten plants are sampled in each group
for constituting the bulk, the probability of a
polymorphic marker (between parents as well
as bulks) not being linked to the target trait is
extremely low. Hence, usually ten plants are
used for constituting the bulks. However, this
number may vary depending upon the types of
mapping populations used. Using BSA, markers can be reliably identified in a 0- to 25-cM
window to either side of the locus of interest.
Further, this method can be applied iteratively,
in the sense that new bulks can be constructed
based on each new marker that linked more
closely to the gene. The linkage of each marker
with the tagged locus is verified by analysing
single plants of the segregating populations.

Combining Markers and Populations


The genetic segregation ratio at marker locus is
jointly determined by the nature of marker (dominant/co-dominant; see chapter 3 for definition
and details) and types of mapping populations
(Table 2.1). Therefore, a thorough understanding
of the nature of markers and mapping population
is crucial for any mapping projects. Mapping
populations such as RILs and DHs equalise
marker type because of fixation of parental alleles
at marker locus in homozygous condition. These
populations result in 1:1 segregation ratio at
marker locus irrespective of genetic nature of
markers, while an F2 population segregates in
1:2:1 ratio for a co-dominant marker and in 3:1
ratio for dominant marker. Depending upon the
segregation pattern, statistical analysis of marker
data will vary.

35

Characterisation of Mapping
Populations
Precise genotypic and phenotypic characterisation
of mapping population is vital for success of any
mapping project. Since the molecular genotype of
any individual is independent of environment, it is
not influenced by G E interaction. However, trait
phenotype could be influenced by the environment, particularly in case of quantitative characters. Therefore, it becomes important to precisely
estimate the trait value by evaluating the genotypes
in multi-location testing over seasons and/or years
using immortal mapping populations to have a
valid markertrait association.

Choice of Mapping Populations


It is evident from the foregoing discussion that
the short-term mapping populations such as F2,
backcross and conceptual near isogenic lines
developed through BSA approach can be a good
starting point in molecular mapping, while longterm mapping populations such as RILs, NILs
and DHs must be developed and characterised
properly with respect to the traits of importance
for global mapping projects. As a matter of fact,
the development and phenotypic characterisation
of mapping populations should become an integral part of the ongoing breeding programs in
important crops. At this point, the role of geneticists and plant breeders becomes crucial to reap
the benefits of genetic mapping.

Challenges in Mapping Population


Development and Solutions to These
Challenges
As described in chapter 1, a loss in genetic diversity inevitably causes problems in breeding for
new varieties, and this has been repeatedly shown
in several crops (well-known examples are tomato
and cotton). This erosion in genetic diversity
created a bottleneck. Breeding methods such as
single-seed descent and pedigree selection also

36

promote genetic uniformity. In self-compatible


species, even further decrease in genetic diversity
can be expected since the mode of reproduction
is playing a major role in the maintenance of
genetic variability. In such cases, use of landraces
that are not genetically uniform is one option to
increase genetic polymorphism and is essential
for introducing new genetic factors into the
breeding pool of this crop. Another problem that
is often found in genetic mapping is distorted
segregation. Significant deviation from expected
segregation ratio in a given markerpopulation
combination is referred to as segregation distortion. There are several reasons for segregation
distortion, including gamete/zygote lethality,
meiotic drive/preferential segregation, sampling/
selection during population development and
differential responses of parental lines to tissue
culture in case of DHs (find more details in
chapter 4). Segregation distortion can also be
specific with respect to some markers in an otherwise normal mapping population. It is common
in plants that one allelic class can be underrepresented due to dysfunction of the concerned gametes. This can occur in pollen or in megaspores or
in both organs. It can be explained either by the
selective abortion of male and female gametes or
by the selective fertilisation of particular gametic
genotypes. A selection process during seed development, seed germination and plant growth can
also be a causative agent. Gametophyte loci leading to a distorted segregation have been identified
in rice and other crops. They are supposed to
be responsible for the partial or total elimination of gametes carrying one of the parental
alleles. Thus, a marker locus linked to a gametophyte locus, also referred to as a gamete eliminator or pollen killer, can also show distorted
segregation. Self-incompatibility loci preventing self-pollination are also another important
direct cause for distorted segregation. Therefore,
breeding programs that aim at the generation of
specific recombinants are directly affected if one
locus is close to a region affected by segregation
distortion.
Detection of QTLs is often limited by several
factors such as genetic properties of QTLs, environmental effects, population size and experimental

Mapping Population Development

error. Hence, it is desirable to independently confirm


QTL-mapping studies. Such confirmation studies
may involve independent mapping populations
constructed from the same parental genotypes or
closely related genotypes used in the primary QTLmapping study. Sometimes, larger population sizes
may also be used. Furthermore, some recent studies
have proposed that QTL positions and effects should
be evaluated in independent populations, because
QTL mapping based on typical population sizes
results in a low power of QTL detection and a large
bias of QTL effects. Unfortunately, due to constraints such as lack of research funding and time
and perhaps a lack of understanding of the need to
confirm results, QTL-mapping studies are rarely
confirmed. Validation of conserved QTLs across
populations has not been conclusive so far due to
the fact that the majority of the QTL studies were
either derived from small and mortal (F2 or BC)
populations. As compared to F2 or BCs, homozygous immortalised RILs constitute the preferred
material for QTL mapping in many crops. When n
pairs of genes segregate independently, the number
of different gametes is 2n, while the number of possible genotypes in an F2 is 3n; that is, with doubled
haploids or RILs, fewer individuals need to be
screened (and this is economically very important
when using molecular markers) to cover a similarly
wide spectrum of recombinants, and more accurate
estimates of the location of the QTL can be obtained
with less variance.
For RILs or DHs, the power of detecting a given
quantitative trait locus is clearly related to its relative contribution to the heritability of the character
(refer chapter 5). The power of the test was about
90% for heritabilities of QTL. To obtain a similar
power for backcrosses, the heritability attributable
to the individual quantitative trait locus should be
around 14%. For a given type of gene action, it
seems that DHs have a similar power to an F2.
However, if dominance is present, DHs or RILs
will only detect the additive component of a
particular quantitative trait locus. This could be
very important for QTL showing overdominant
(or pseudo-overdominant) effects. The major technical advantage for DHs or RILs, independent of
any effect of replication on the required number of
offspring, lies in the fact that the lines can be repro-

Bibliography

duced independently and continuously evaluated


with respect to additional quantitative traits and
markers with all the information being cumulative.
If the effect of replication is taken into account,
replicated progenies can bring about a major
reduction in the number of lines that need to be
scored. Reductions are greatest when heritability
of the trait is low, under the assumption of codominance at all QTL. In this situation of low heritability, MAS is much more efficient when
compared with phenotypic selection.
RILs have not been widely utilised in crops
except in some cases, mainly due to long development timelines and difficulties in production
of sufficient seeds. Though there is no clear rule
for the precise population size that is required
for QTL analysis, it is increasingly believed that
sampling limited numbers of progeny (say <200)
in mapping studies tends to cause the skewed
distribution of QTL effects and identification of
limited number of QTLs, even if many genes
with equal and small effects actually control
the trait. Further, in several published reports,
the number of linkage groups exceeds the gametic
chromosome number, and numerous linkage groups
are yet to be associated with specific chromosomes mainly due to lack of informative markers and use of small sample size. In most of the
published genetic maps, the markers were not
uniformly spaced over many linkage groups. It is
attributed that these regions may be heterochromatin or gene rich. Clusters of markers with
very limited recombination are frequently
present which may be indicative of QTL-rich
(gene-rich) regions.
Consideration must be given to the source
of parents (adapted vs. exotic) used in developing mapping population. Chromosome pairing
and recombination rates can be severely disturbed (suppressed) in wide crosses and generally yield greatly reduced linkage distances.
Wide crosses will usually provide segregating
populations with a relatively large array of
polymorphism when compared to progeny segregating in a narrow cross (adapted adapted).
To have significant value in crop improvement

37

program, a map made from a wide cross must


be collinear (i.e. order of loci should show similarity) with map constructed using adapted
parents.
Thus, before starting up a mapping population
development program, several above-said points
need to be critically evaluated depending on the
type of crop pollination, nature of marker types,
availability of resources, genetics of the investigating trait, etc.

Bibliography
Literature Cited
Broman KW (2005) The genomes of recombinant inbred
lines. Genetics 169:11331146
Frisch M, Melchinger EA (2008) Precision of recombination frequency estimates after random intermating
with finite population sizes. Genetics 178:597600
McMullen MD et al (2009) Genetic properties of maize
nested association mapping population. Science
325:737740
Michelmore RW, Paran I, Kesseli RV (1991) Identification
of markers linked to disease resistance genes by bulked
segregant analysis: a rapid method to detect markers in
specific regions by using segregating populations.
Proc Natl Acad Sci USA 88:98289832
Qin H, Guo W, Zhang YM, Zhang T (2008) QTL mapping
of yield and fiber traits based on a four-way cross population in Gossypium hirsutum L. Theor Appl Genet
117:883894
Tanksley SD, Nelson JC (1996) Advanced backcross QTL
analysis: a method for the simultaneous discovery and
transfer of valuable QTLs from unadapted germplasm
into elite breeding lines. Theor Appl Genet 92:191203
Yu J et al (2008) Genetic design and statistical power of
nested association mapping in maize. Genetics
178:539551

Further Readings
McCouch SR, Kochert G, Yu ZH, Wang ZY, Khush GS,
Tanksley SD, Coffman RW (1988) Molecular mapping
of rice chromosomes. Theor Appl Genet 76:815829
Rao SQ, Xu SZ (1998) Mapping quantitative trait loci for
ordered categorical traits in four-way crosses. Heredity
81:214224
Xu S (1996) Mapping quantitative trait loci using fourway crosses. Genet Res 68:175181

Genotyping of Mapping Population

Markers and Its Importance


The basic principle of plant breeding for genetic
improvement of crop plants is mainly relied on
selection of superior progenies from the available
population based on the traits of interest (such
as higher yield, improved nutritional quality,
appropriate colour or fragrance preferred by the
consumer). In general, such traits are not measured directly from the plants; instead, they are
enumerated from some other markers or tags that
are closely linked to the trait of interest. For
example, rice yield is decided by higher number
of productive tillers, number of grains/spikelet,
etc.; other classical examples are traits such as
pea seed size, colour and plant height, used by
Mendel. Such tags which used to select the
superior progenies from the heterogeneous
mixture of population are called as markers.
These markers are useful in an array of plant
breeding and genetics studies including:
1. Genetic relatedness and diversity
2. Population genetics
3. Studying polymorphism in landraces, cultivars and germplasm
4. Identification of cultivars and taxonomy
5. Phylogenetic studies
6. Studying domestication and evolution
7. Gene flow and introgression
8. Comparative mapping
9. Gene mapping and identification
10. Genetic improvement of crop plants
11. Detecting somaclonal variation

12. Evaluating germplasm for useful genes


13. Pedigree analysis
14. Hybrid identification
The following section describes two different
classes of markers that are being used in plant
breeding from time immemorial, and later part of
this chapter describes the importance of molecular
markers in characterising or genotyping mapping
populations for genetic and QTL mapping.

Morphological Markers
During the early days of plant breeding, breeders
use to cross and select the progeny based on
certain neutral characteristics, since these easily
recognisable characteristics most probably coincide with specific expression of agronomically
and economically important traits. Therefore,
those visibly observable characteristics are
used to mark or tag the desired (or sometime
undesired) progeny among the population, and
they are called as phenotypic or morphological
markers. Genetically, their function is based on
linkage between the genes for the characteristics
and the agronomic trait. This concept of using
markers in genetics dates back to as early as nineteenth century. Gregor Mendel used phenotypebased genetic markers in his experiment. It is also
interesting to note that those phenotype-based
genetic markers in Drosophila led to the establishment of the theory of genetic linkage by
Alfred Henry Sturtevant at Dr. Morgans laboratory
(the details of linkage mapping are discussed in

N.M. Boopathi, Genetic Mapping and Marker Assisted Selection: Basics, Practice
and Benefits, DOI 10.1007/978-81-322-0958-4_3, Springer India 2013

39

40

chapter 4). In 1913, A. H. Sturtevant generated


the first genetic map using six morphological
traits (it was termed factors at his period) in the
fruit fly (Drosophila melanogaster). Similarly,
Karl Sax produced evidence for genetic linkage
between a qualitative and a quantitative trait (seed
colour and seed size, respectively) in the common bean (Phaseolus vulgaris).
To be useful in genetic analysis, a heritable morphological characteristic has to exist in at least two
alternative forms or phenotypes (e.g. the tall or short
stems studied by Mendel in the pea plants). As a
consequence, several such visibly observable
phenotypes were used as markers to construct the
genetic map in earlier days. As stated, the first
genetic map was developed in fruit fly which
showed the positions of body colour, eye colour,
wing shape and other suchlike traits. However,
shortly, it was recognised that there were only a limited number of visual phenotypes, and identification
of such morphological markers is very infrequent
and mostly not accessible in every segregating population. Further, in many cases, the genetic analysis
was intricate since a single phenotype could be
affected by more than one gene. Though morphological markers are easily examined, they are frequently affected by the environment. Some of them
appear late in plant development (e.g. flower colour),
making early scoring impossible. In addition, a
given morphological marker can affect other
morphological markers or traits of interest in breeding programs because of pleiotropic gene action.
Because of these features, morphological markers
found limited application in plant breeding. To
make genetic maps more comprehensive, it would
be necessary to find phenotypes that were more distinctive and less complex. As innovations in protein
science were developed, a new set marker, namely,
protein or enzyme-based marker (isoenzymes), was
introduced in this context.

Genotyping of Mapping Population

the different molecular forms of proteins, which


exhibit the same enzymatic specificity. The terms
isozyme and isoenzyme have been used interchangeably. However, the Standing Committee on
Enzymes of the International Union of Biochemistry
prefers the term isoenzyme. Specific enzymes
isolated from different species may possess wide
variations in their physical and catalytic properties
and are called heteroenzymes since these enzymes
have originated from different origins (hence, it is
different from the term isoenzymes). The term
isoenzyme is restricted to those forms of an
enzyme with similar enzymatic activity occurring within a single species, as a result of the presence of more than one structural gene. The multiple
forms of enzymes are also divided into two main
classes according to how they are coded: allozymes
(enzymes coded by different alleles at one gene
locus) and isozymes (enzymes coded by alleles at
more than one gene locus). However, the term
isoenzymes refers to both the classes. Allozymes
are controlled by co-dominant alleles, which means
that homozygotes (all alleles at a locus are similar)
can be distinguished from heterozygotes (parents
of the individual have contributed different alleles
to that locus). For monomeric enzymes (i.e. consisting of a single polypeptide), plants that are
homozygous for a given locus will produce one
band, whereas heterozygous individuals will produce three because of random association of the
polypeptides. Multimeric enzymes also exist, where
the polypeptides are specified by different loci.
The formation of isozymic heteromers can thus
considerably complicate gel banding patterns
(discussed below). As per the definition, the following categories are regarded as isozymes: (1) genetically independent proteins arising from the presence
of multiple gene loci, (2) enzyme variants from
occurrence of allelic genes at a particular gene locus
(these isoenzymes are called allelozymes) and
(3) heteropolymers (noncovalent hybrid molecular
of two or more different polypeptide chains).

Biochemical Markers or Isozymes


Enzymes, a type of protein, usually act as catalysts,
and each enzyme is highly specific to the particular
biochemical reaction. The term isozymes was
coined by Markert and Miller in 1959 to describe

Principle
Isoenzyme in plant breeding (also called as protein markers) is based on the principle that allelic

Biochemical Markers or Isozymes

variation exists for many different proteins.


For instance, two alleles of malic dehydrogenase
can perform the same enzymatic function, but the
electrophoretic mobility of these two may differ
(i.e. the proteins of two alleles would not migrate
to the same location in the gel). Thus, the procedure to identify isoenzymes variation is simple.
A crude protein extract is made from tissues
(such as leaves, flowers). The extracts are next
separated by electrophoresis in a starch or
polyacrylamide gel. The gel is then placed in a
solution that contains reagents required for the
enzymatic activity of the isoenzyme that are
being investigated. Further, the solution contains
a dye that the isoenzyme can catalyse into a
colour reagent that stains the protein, and because
of this the allelic variants of the protein can be
visualised on the gel.
Since isoenzymes catalyse the same reaction,
they should be closely related forms of proteins.
Hence, it is possible to explain the specific
chemical and biological properties of individual
isoenzymes in terms of their physicochemical
structure. The amino acid sequence in an enzyme
is predetermined and made up of 21 amino acid
residues. It is a well-known fact that the amino
acid sequence (primary structure) of the polypeptide chain predetermines the secondary, tertiary
and quaternary structure of the protein. The spatial conformation dictated by the secondary, tertiary and quaternary structures is of paramount
importance in determining the specific and unique
properties of an enzyme. Hence, any change,
which would modify the secondary, tertiary or
quaternary structure, would thereby produce
different but closely related forms of an enzyme.
This may be due to some modification in protein
structure such as small changes in amino acid
sequence, amidation of carboxyl groups, conjugation with small molecules, polymerisation and
folding the same primary structure in different
ways. The spatial conformation of lipoproteins
and glycoproteins may also be modified by small
changes in the covalently linked prosthetic
groups. If isoenzymes are closely related forms
of a protein, it is necessary to propose a mechanism, which will permit structural variation in the
protein but allow retention of enzymatic activity.

41

It is now well established that there can be vast


regions of the protein unnecessary for enzyme
activity. Therefore, by modifying the nonessential region of the structure, a wide variety of protein molecules could exist with similar enzymatic
activities. It is also known that most of the
enzymes are made up of more than one polypeptide or subunits. Hence, variation in one subunit
may cause structural difference in the enzyme.
Further, random association of subunits may also
lead to isoenzyme formation.
Numerous studies have shown that isoenzymes may arise as artefacts during the course
of purification. How to determine whether the
presence of isoenzymes in a tissue homogenate is
real or artefacts that have arisen during isolation
and purification procedures? The usual precaution against artefacts is to establish the existence
of isoenzymes by as many techniques as possible,
that is, different isolation procedures, assorted
purification techniques and different detection
methods. It is also essential to demonstrate that
the isoenzymes do not arise from one another
during the experimental procedures (discussed
below). Clear evidence for isoenzymes occurring
within the tissue may be obtained by demonstrating a definite structural difference or by showing
that the isoenzymes are synthesised under independent genetic control.
Diverse types of experimental protocols are
available to detect and distinguish isoenzymes.
However, each procedure has its own advantages
and limitations, and they are discussed hereunder.

Electrophoresis
Electrophoresis is presently the most powerful analytical technique available to separate isoenzymes.
Its scope of application has been broadened
tremendously in recent years by simplification of
the apparatus and by the development of synthetic
support media, which have shortened the time of
analysis. The theory underlying electrophoresis is
simple. Direct current is used to separate the individual isoenzymes (electrophoretic mobility) by
taking advantage of the differences in net charge
of each isoenzyme. Changes in electrophoretic

42

Genotyping of Mapping Population

mobility may result from the substitution of a single


amino acid. Thus, the altered electrophoretic mobility reflects a change in the net charge of the protein
molecule, which occurs when the substituted amino
acid carries a charge, different from that of the one
it replaces. The most widely used electrophoretic
technique involves zymogram display of isoenzymes, which utilises zone electrophoresis followed by histochemical staining methods to locate
the zones of enzyme activity directly in the supporting medium (starch or polyacrylamide). While
the zymogram method is very sensitive and convenient, it must be noted that several sources of error
are inherent in the staining techniques. Since isoenzymes frequently differ in catalytic properties, certain isoenzymes may fail to react with the detecting
stain because they are not at optimal conditions
or they may have lower specific activity. Hence,
extreme caution must be exercised in interpreting
the electrophoretic results.

gels offered a convenient way of determining


the molecular weights of many proteins even
though anomalous results may be obtained if the
protein forms a complex with the gel or contains
an appreciable amount of carbohydrate. It has
also been reported that gel filtration is unsatisfactory for estimating the molecular weight of glycoproteins. A number of cases have been reported
where enzyme fragments produced by proteolytic
digestion still retain part of their enzymatic
activity. These observations serve to illustrate the
possibility that in some instances, isoenzymes
may arise from endogenous proteolytic actions
during enzyme purification. Gel filtration would
appear to provide a convenient technique to
ascertain this type of artefact. Gel filtration may
also be employed to examine the possibility that
an isoenzyme might have arisen as a result of the
dissociation of one or more subunits from the
parent enzyme.

Chromatography

Immunochemistry

Chromatographic techniques represent the second


most powerful tool available for the separation of
isoenzymes. This approach is desirable when it is
necessary to isolate isoenzymes as a preparative
step. When separating isoenzymes by chromatographic procedures, it is necessary to establish
their validity as chromatographic identities. This
can be achieved by performing re-chromatography of the isolated peaks under the original conditions as a result of which each peak should emerge
in the effluent profile as that of original position.
However, the appearance of false components is
a characteristic feature of protein chromatography, and it is for this reason that most investigators use a variable gradient device for eluting
proteins from a column.

When a foreign protein, an antigen, is injected


into a suitable animal, the animal produces a
specific protein called an antibody. The antibody may combine with the antigen to produce
a visible precipitate. Antibodies are highly
specific in their activity. Injection of a homogeneous isoenzyme can result in the formation
of a single type of antibody, which gives no
precipitation reaction with other isoenzymes.
If one isoenzyme is shown to be immunologically different from another, it can be said
unequivocally that the two are structurally different. On the other hand, if two isoenzymes
give the same immunological reaction to a
given antibody, it can only be said that they
may be identical. The immuno-electrophoresis
technique combines the principles of zone
electrophoresis with those of immunochemical
analysis, thus making it possible to establish
the immunochemical relationship between
electrophoretically dissimilar components.
Selective staining may be carried out in the gel
medium, thereby adding a third dimension to
the analysis.

Gel Filtration
Gel filtration or molecular sieving is carried out
on various cross-linked dextran polymers (e.g.
Sephadexes) or cross-linked polyacrylamide
polymers (e.g. Bio-Gel P). However, use of dextran

Genome Structure and Organisation

43

Catalysis

Genome Structure and Organisation


Isoenzymes may differ from one another in a
variety of catalytic properties including affinity
for substrates, behaviour towards coenzyme analogues, sensitivity to inhibitors or denaturing
agents, in their amino acids sequence and order
of amino acids, pH and pI optima, thermal stability, Vmax and/or regulatory properties and specific
activity. In order to investigate the catalytic properties of isoenzymes, it is imperative that each
should be purified to a state approaching homogeneity. Since this criterion is difficult to achieve,
reliable studies in this area of investigation are
limited.
Obvious limitations of the above-said procedures for isoenzyme detection are development of
reagent systems (so far, nearly 50 different reagent
system alone has been developed, and each plant
species require a specific modification) and tissue
variability (some enzymes are better expressed in
roots, whereas others are best sampled in leaves).
About 90 isoenzyme systems have been used for
plants, with isozyme loci being mapped in many
cases. As a consequence of this smallest number
of isozymes, the percentage of genome coverage is
inadequate for a thorough study of genetic diversity. Further, since differential expression of the
genes may occur at different developmental stages
or in different tissues, the same type of material
must be used for all experiments. Other issues
while interpreting isoenzyme banding pattern are
the quaternary structure of enzymes (whether
monomeric, dimeric, etc.), whether the plant is
homozygous or heterozygous at each gene locus,
the number of gene loci, the number of alleles per
loci and how the genes are inherited. In order to
overcome these limitations, the next-generation
markers, namely, DNA or molecular markers, have
been introduced after the 1950s since the properties of nucleic acids were completely elucidated
during this period. Of late, the paramount role of
molecular markers in plant breeding (when compared to other two types of markers) has been
documented in almost all the crop plants. Before
getting into the basics and details of molecular
markers, it is appropriate to introduce genome
structure and organisation in crop plants.

The genome is the sum of the entire DNA of an


individual or a species. It includes the entire
DNA, not just the genes. For simple viruses, with
a single nucleic acid molecule, the genome is
obvious, although of course for RNA viruses, it is
RNA rather than DNA. For haploid prokaryotes,
it is also straightforward, except for the plasmids,
one copy of which is counted in the genome. For
eukaryotes, one haploid copy of the DNA of each
of the diploid pairs of chromosomes (the autosomes) is included, plus one copy of the DNA of
sex chromosomes. Thus, the female and male
genomes will differ if there is a difference in sex
chromosomes. One copy of the DNA from any
organelles other than the nucleus, such as the
mitochondria and chloroplasts, should also be
included. The majority of the DNA of a genome
is not in the genes themselves and their known
associated regulatory sequence. While the phenomenon of gene regulation is beginning to be
understood, little is known of the significance of
the majority of the non-genic DNA, whether it
has any functions other than acting as spacer
between genes. In most species, a large fraction
of the DNA is repeated sequences that cause
genetic recombination and unequal crossing over
(discussed in chapter 4), resulting in genomic
rearrangements, but their overall significance is
not understood.
The main role of the genome is providing gene
products, but in many genomes, only 1% or so of
the DNA is transcribed and translated during normal cellular activities. Striking evidence states
that the actual coding capacity is likely to be relatively constant among plants. For example, when
the genomes of Arabidopsis and maize were
compared with the sequence information obtained
from cDNAs, it indicated that both genomes code
essentially the same number of genes, although
the genome sizes differ by two orders of magnitude. Similarly, maize and sorghum are closely
related plants, and both have ten chromosomes,
but the maize genome is more than three times
the size of that of sorghum. When DNA fragments from maize were used in hybridisation

44

analyses with sorghum sequences, homology was


shared predominantly by low copy number
sequences and unique sequences. In fact, several
of the genes in sorghum show the same chromosomal arrangement as their counterparts in maize.
From these and similar analysis, the extra DNA
that accounts for the difference in maize and sorghum genome size apparently comprises mostly
non-coding repetitive sequences between genes.
This finding supports the conclusion that the
majority of nuclear DNA may play a supporting
role in the structure and organisation of the
genome but does not contribute directly to its
protein-coding capacity.
The size of the nuclear genome varies among
organisms. The DNA content of haploid eukaryotic cells (referred to as C value) ranges from
107 to 1011 base pair (bp). Although it has been
assumed that organism complexity correlates
roughly with genome sizehumans have larger
genomes than most insects, and insects have
larger genomes than fungithis correlation is by
no means universal. For example, some amphibians have genomes almost 50 times larger than
that of humans, and cartilaginous fish generally
have larger genomes than bony fish. The lack of a
direct relationship between genome size and
organism complexity is called the C-value paradox. We have no satisfactory explanation yet for
the C-value paradox, but in plant, at least, we
know that genome size can to some degree be
attributed to repetitive DNA and duplicated
genomes (due to polyploidy).
The general nature of the eukaryotic genome
can be assessed by the kinetics with which denatured DNA reassociates. The reassociation reaction is the product of DNA concentration (Co)
and time of incubation (t), usually described
simply as the Cot value. A useful parameter is
derived by considering the conditions when the
reaction is half complete, at time t1/2. The value
required for half reassociation is called the Cot1/2.
Since the Cot1/2 is the product of the concentration and time required to proceed halfway, a
greater Cot implies slower reaction and thereby
low similarity between two genomes. The Cot1/2
of a reaction therefore indicates the total length
of different sequences that are present. This is

Genotyping of Mapping Population

described as the complexity, usually given in


base pairs. The renaturation of the DNA of any
genome (or part of a genome) should display a
Cot1/2 that is proportional to its complexity. Thus,
the complexity of any DNA can be determined
by comparing its Cot1/2 with that of a standard
DNA of known complexity. Usually E. coli DNA
is used as a standard. Its complexity is taken to
be identical with the length of genome (implying
that every sequence in the E. coli genome of
4.2 106 bp is unique).
From the perspective of genetics, a major difference between prokaryotic and eukaryotic cells
is that a eukaryote has a nuclear envelope, which
surrounds the genetic material to form a nucleus
and separates the DNA from the other cellular
contents. In prokaryotic cells, the genetic material is in close contact with other components of
the cella property that has important consequences for the way in which genes are controlled. Another fundamental difference between
prokaryotes and eukaryotes lies in the packaging
of their DNA. In eukaryotes, DNA is closely
associated with a special class of proteins, the
histones, to form tightly packed chromosomes.
This complex of DNA and histone proteins is
termed chromatin, which is the stuff of eukaryotic chromosomes. Histone proteins limit the
accessibility of enzymes and other proteins that
copy and read the DNA, but they enable the DNA
to compactly fit into the nucleus. Eukaryotic
DNA must separate from the histones before the
genetic information in the DNA can be accessed.
However, prokaryotes do not possess histones, so
their DNA does not exist in the highly ordered,
tightly packed arrangement found in eukaryotic
cells. The copying and reading of DNA are therefore simpler processes in eubacteria.
Genes of prokaryotic cells are generally on a
single, circular molecule of DNA, the chromosome
of the prokaryotic cell. In eukaryotic cells, genes are
located on multiple, usually linear DNA molecules
(multiple chromosomes). Eukaryotic cells therefore
require mechanisms that ensure that a copy of each
chromosome is faithfully transmitted to each new
cell. This generalisationa single, circular chromosome in prokaryotes and multiple, linear chromosomes in eukaryotesis not always true. A few

Genome Structure and Organisation

bacteria have more than one chromosome, and


important bacterial genes are frequently found on
other DNA molecules called plasmids. Furthermore,
in some eukaryotes, a few genes are located on circular DNA molecules found outside the nucleus
such as in mitochondria and chloroplast.
Each eukaryotic species has a characteristic
number of chromosomes per cell: potatoes have
48 chromosomes, fruit flies have 8 and humans
have 46. There appears to be no special significance
between the complexity of an organism and its
number of chromosomes per cell. In most eukaryotic cells, there are two sets of chromosomes. The
presence of two sets is a consequence of sexual
reproduction; one set is inherited from the male
parent and the other from the female parent. Each
chromosome in one set has a corresponding chromosome in the other set, together constituting a
homologous pair.

Chromosome Structure
The chromosomes of eukaryotic cells are larger
and more complex than those found in prokaryotes,
but each unreplicated chromosome nevertheless
consists of a single molecule of DNA. Although
linear, the DNA molecules in eukaryotic chromosomes are highly folded and condensed; if stretched
out, some human chromosomes would be several
centimetres longthousands of times longer than
the span of a typical nucleus. To package such a
tremendous length of DNA into this small volume,
each DNA molecule is coiled again and again and
tightly packed around histone proteins, forming the
rod-shaped chromosomes. Most of the time, the
chromosomes are thin and difficult to observe, but
before cell division, they condense further into
thick, readily observed structures; it is at this stage
that chromosomes are usually studied under genetic
mapping.
A functional chromosome has three essential elements: a centromere, a pair of telomeres
and origins of replication. The centromere is
the attachment point for spindle microtubules,
which are the filaments responsible for moving
chromosomes during cell division. The centromere appears as a constricted region that

45

often stains less strongly than does the rest of


the chromosome. Before cell division, a protein complex called the kinetochore assembles
on the centromere, to which spindle microtubules later attach. Chromosomes without a
centromere cannot be drawn into the newly
formed nuclei; these chromosomes are lost,
often with calamitous consequences to the cell.
Telomeres are the natural ends, the tips, of a
linear chromosome; they serve to stabilise the
chromosome ends. If a chromosome breaks,
producing new ends, these ends have a tendency to stick together, and the chromosome is
degraded at the newly broken ends. Telomeres
provide chromosome stability. Origins of replication are the sites where DNA synthesis
begins; they are not easily observed by microscopy. In preparation for cell division, each
chromosome replicates, making a copy of it.
These two initially identical copies, called sister
chromatids, are held together at the centromere. Each sister chromatid consists of a single
molecule of DNA.

Mitochondrial DNA
In animals and most fungi, the mitochondrial
genome consists of a single, highly coiled, circular DNA molecule (mtDNA). Plant mitochondrial
genomes often exist as a complex collection of
multiple circular DNA molecules. Each mitochondrion contains multiple copies of the mitochondrial genome, and a cell may contain many
mitochondria. Like eubacterial chromosomes,
mtDNA lacks the histone proteins normally associated with eukaryotic nuclear DNA. The guaninecytosine (GC) content of mtDNA is often
sufficiently different from that of nuclear DNA
that mtDNA can be separated from nuclear DNA
by density gradient centrifugation. Mitochondrial
genomes are small compared with nuclear
genomes and vary greatly in size among different
organisms. Most of this size variation is in
non-coding sequences such as introns and intergenic regions. Flowering plants (angiosperms)
have the largest and most complex mitochondrial
genomes known; their mitochondrial genomes

46

range in size from 186,000 bp in white mustard


to 2,400,000 bp in muskmelon. Even closely
related plant species may differ greatly in the
sizes of their mtDNA. Part of the extensive size
variation in the mtDNA of flowering plants can
be explained by the presence of large direct
repeats, which constitute large parts of the
mitochondrial genome. Crossing over between
these repeats can generate multiple circular
chromosomes of different sizes. The mitochondrial genome in turnip, for example, consists of a
master circle consisting of 218,000 bp that
has direct repeats. Homologous recombination
between the repeats can generate two smaller
circles of 135,000 bp and 83,000 bp. Other species contain several direct repeats, providing
possibilities for complex crossing-over events
that may increase or decrease the number and
sizes of the circles.

Chloroplast DNA
Geneticists have long recognised that many
traits associated with chloroplasts exhibit cytoplasmic inheritance, indicating that these traits
are not encoded by nuclear genes. In 1963, chloroplasts were shown to have their own DNA.
Among different plants, the chloroplast genome
ranges in size from 80,000 to 600,000 bp, but
most chloroplast genomes range from 120,000
to 160,000 bp. Chloroplast DNA (cpDNA) is
usually contained on a single, double-stranded
DNA molecule that is circular, is highly coiled
and lacks associated histone proteins. As in
mtDNA, multiple copies of the chloroplast
genome are found in each chloroplast, and there
are multiple organelles per cell; so there are several hundred to several thousand copies of
cpDNA in a typical plant cell.

Molecular Markers
A molecular marker is defined as a particular
segment of DNA that differs among individuals
at the nucleotide level. Molecular markers may or
may not correlate with phenotypic expression of

Genotyping of Mapping Population

a trait. Molecular markers offer numerous


advantages over conventional morphological
markers and isoenzymes. They are stable and
detectable in all tissues regardless of growth,
differentiation, development and status of the
cell. Further, they are not confounded by the
environment, pleiotropic and epistatic effects.
The publication of Botstein et al. in 1980 about
the construction of genetic maps using restriction fragment length polymorphism (RFLP) was
the first reported molecular marker technique in
the detection of DNA polymorphism. After the
invention of polymerase chain reaction (PCR;
see Box 3.1), several PCR-based markers were
developed. Thus, basic techniques used to identify such third-generation markers can be
classified into two categories: (1) non-PCRbased techniques or hybridisation-based techniques
and (2) PCR-based techniques. Depending on
the need and modifications in the techniques,
second generation of advanced molecular markers has been made, and they are discussed in the
following sections.
Though there are several marker techniques
available at this point, it is essential to consider
that an ideal molecular marker technique for
genetic mapping should have at least the following criteria: (1) be polymorphic and evenly distributed throughout the genome; (2) provide
adequate resolution of genetic differences;
(3) generate multiple, independent and reliable
markers; (4) simple, quick and inexpensive;
(5) need small amounts of tissue and DNA
samples; (6) have linkage to distinct phenotypes;
(7) and require no prior information about
the genome of an organism. Unfortunately, no
molecular marker technique is ideal for every
situation. Techniques differ from each other
with respect to important features such as
genomic abundance, level of polymorphism
detected, locus specificity, reproducibility, technical requirements and cost. The following
sections describe the principle of each marker
technique, advancement and applications in plant
breeding. Table 3.1 describes the comprehensive
view of marker techniques, their applications
and limitations. The details/features of each
marker class are furnished in Table 3.2.

Molecular Markers

47

Box 3.1 PCR

For DNA marker analysis, it is essential to


have large quantity of specific DNA fragment,
and scientists find it difficult to make such
quantity before the 1980s. It was Dr. Kary
Banks Mullis who gave solution to this limitation in the form of polymerase chain reaction
(PCR), and he received Nobel Prize in chemistry in 1993 for this invention. Since then,
this process is addressed as one of the scientific
techniques of the twentieth century that has
immense potential, and it is now very hard to
find a molecular laboratory without a PCR
Denaturation

machine. PCR is utilising the ability of DNA


polymerase to synthesise new strand of DNA,
complementary to the given template strand.
Since DNA polymerase can add a nucleotide
only onto a pre-existing 3-OH group, it needs
a primer to which it can add the first nucleotide. This requirement makes it possible to
amplify the target region of DNA template. At
the end of the PCR reaction, the specific target
DNA sequence (in our case, the marker region)
will be accumulated in billions of copies
(Fig. 3.1).

Target sequence

5
3

3
5
94-96C

nd

2 Cycle
8 copies

Annealing

Exponential
Amplification

30 -35 cycles

30-55C

3rd Cycle
16 copies

35th cycle
236 copies
Extension
5

72C
3

Fig. 3.1 Exponential amplification of target sequence using PCR

How It Works?
The PCR reaction requires the following
components:
DNA Template: It is the sample DNA that
contains the target sequence. At the beginning
of the reaction, high temperature is applied to
the original double-stranded DNA molecule
to separate the strands from each other, and
this process is termed as denaturation.

DNA Polymerase: It is a type of enzyme


that synthesises new strands of DNA complementary to the template. The first and most
commonly used enzyme is Taq DNA polymerase (isolated from Thermus aquaticus).
Alternatively, Pfu DNA polymerase (obtained
from Pyrococcus furiosus) is used widely
because of its higher fidelity when copying
DNA. Although these enzymes are subtly
different, they both have two capabilities that
(continued)

48

Genotyping of Mapping Population

Box 3.1 (continued)

make them suitable for PCR: (1) they can


generate new strands of DNA using a DNA
template and primers and (2) they are heat
resistant. Generally, the DNA polymerase in
eukaryotes breaks down at temperatures below
95C, the temperature necessary to separate
two complementary strands of DNA in a test
tube. Hence, the DNA polymerase thats most
often used in PCR comes from above-said
microbes that live in the hot springs. Such
enzymes can survive near boiling temperatures and work quite well at 72C.
Primers: They are short pieces of singlestranded DNA that are complementary to the
5 ends of template. Depending on the marker
class, we need to provide either single primer
(in case of RAPD, ISSR, etc.) or two primers
(forward and reverse primers; in case of SSR,
CAPS, etc.). The polymerase begins synthesising new DNA from the 5 end of the primer.
Through complementary base pairing, primer
attaches to target DNA at one end of the top
strand and in the bottom strand at the other end.
In most of the cases, since the primers are
more than 20 bp long, they target just a single
locus in the entire genome.
Nucleotides (dNTPs or Deoxynucleotide
Triphosphates): They are single units of the
bases A, T, G and C, which are essentially
building blocks for synthesising new DNA
strands.
Buffers and sterile water: These are added
to the PCR mix to maintain the pH and other
deleterious effects of chemical reaction that
affects the PCR and maintain the optimum
activity of the enzyme. Divalent ions such as
Mg2+ are also supplied since they are cofactor
for the DNA polymerase.
PCR Program: PCR relies on thermal
cycling, consisting of 3040 cycles of repeated
heating and cooling of the reaction for DNA
melting and enzymatic replication of the DNA
(Fig. 3.1). PCR program contains a minimum of
five different steps characterised with specific

temperature. This is done on an automated


PCR thermal cycler or PCR machine.

Step 1: Initialisation or Initial


Denaturation
This step consists of heating the reaction to a
temperature of 9496C (or 98C if extremely
thermostable polymerases are used), which is
held for 19 min. At this temperature, almost
all the DNA got denatured by disrupting
the hydrogen bonds between complementary bases, yielding single-stranded DNA
molecules.
Step 2: Denaturation
It usually consists of heating the reaction to
9498C for 2030 s.
Step 3: Annealing
The reaction temperature is lowered to
5065C for 2040 s allowing annealing of
the primers to the single-stranded DNA template. Typically, the annealing temperature is
about 35C below the melting temperature
(Tm) of the primers used (melting temperature
can be obtained from data sheet provided by
the commercial company who had synthesised
the primer). Stable DNADNA hydrogen
bonds are formed only when the primer
sequence very closely matches the template
sequence. The polymerase binds to the primer
template hybrid and is ready to begin new
DNA strand synthesis.
Step 4: Extension or Elongation
The temperature of this step is fixed depending on the type of DNA polymerase used in
the PCR mix. For example, Taq polymerase
has its optimum activity temperature at
7580C, and commonly a temperature of
72C is used with this enzyme. At this step,
the DNA polymerase synthesises a new DNA
strand complementary to the DNA template
strand by adding dNTPs that are complementary
(continued)

Molecular Markers

Box 3.1 (continued)


to the template in 53 direction. This is done
by condensing the 5-phosphate group of the
dNTPs with the 3-hydroxyl group at the end
of the nascent (extending) DNA strand. Thus,
polymerase enzyme adds dNTPs from 5 to 3,
reading the template from 3 to 5 side, to make
two double-stranded molecules. The extension time depends both on the DNA polymerase used and on the length of the DNA
fragment to be amplified. As a rule of thumb,
at its optimum temperature, the DNA polymerase will polymerise a 1,000 bases per
minute.
Steps 24 are repeated for 3035 cycles.
Under optimum conditions, that is, if there are
no limitations due to limiting substrates or
reagents, at each extension step, the amount of
DNA target is doubled, leading to exponential
(geometric) amplification of the specific DNA
fragment.

Optional Step: Final Elongation


This single step is occasionally performed at a
temperature of 7074C for 515 min after the
last PCR cycle to ensure that any remaining
single-stranded DNA is fully extended.
Step 5: Final Hold
This step is set at 415C for an indefinite
time and may be employed for short-term storage of the PCR products.

Tips to Improve the PCR


The requirement of an optimal PCR reaction is to amplify a specific locus without
any unspecific by-products. Therefore,
annealing needs to take place at a
sufficiently high temperature to allow
only the perfect templateprimer matches
to occur in the reaction. For any given
primer pair, the PCR program can be
selected based on the composition (GC

49

content) of the primers and the length of


the expected PCR product. In the majority
of the cases, products expected to be
amplified are relatively small (from 0.1 to
3 kb). The activity of the Taq polymerase
is about 1,000 nucleotides/min at optimal
temperature (7278C), and the extension
time in the reaction can be calculated
accordingly. As the activity of the enzyme
may not be always optimal during the
reaction, an easy rule is to consider an
extension time (in minutes) equal to the
number of kb of the product to be
amplified (e.g. 1 min for a 1 kb product,
2 min for a 2 kb product).
Many researchers use a 25-min first
denaturing step before the actual cycling
starts. This is supposed to help denaturing the target DNA better (especially the
hard to denature templates as it found in
polyploids). Also, a final last extension
time, of 510 min, is described in many
reports (to finish the elongation of many
or most PCR products initiated during
the last cycle). A denaturing time of
2050 s is sufficient to achieve good PCR
products during the cyclic process. Long
denaturing time will expose Taq polymerase for long time at high temperatures
and hence may decrease the activity of
the enzyme.
The annealing temperature can be chosen
based on the melting temperature of the
primers. A simple procedure is to use an initial annealing temperature of 54C (usually
good for most primers with a length of 20 bp
or more). Annealing temperature should
not be much lower unless you have designed
the primer from heterologous sequence.
If unspecific products result, this temperature should be increased. If the reaction is
specific (only the expected product is synthesised), the melting temperature can be
used as it is. Gradient PCR can be employed
(continued)

50

Genotyping of Mapping Population

Box 3.1 (continued)

with different annealing temperature, when


the primers are designed from heterologous systems. To calculate Tm for duplex
DNA of <50 bp, use the following simple
rule:
Calculate number of A or T and G or C
Add 2C for each A or T
Add 4C for each G or C
In general, 30 cycles is sufficient for a usual
PCR reaction. Little or no quantitative
changes (i.e., relative amounts of PCR
products) were observed with increasing
cycle from 30 to 45. Little quantitative gain
was noticed when increasing the number of
cycles up to 60.
Like a simple PCR, multiplex reactions
should be done at a stringent enough
temperature, allowing amplification of all
loci of interest without any by-products.
Although many individual loci can be
specifically amplified at an annealing temperature of 5660C, experiments showed
that lowering the annealing temperature
by 46C was required for the same loci
to be co-amplified in multiplex mixtures.
Due to differences in base composition,
length of product or secondary structure,
some loci are more efficiently amplified
than others. When many loci are simultaneously amplified (multiplexed), the more
efficiently amplified loci will negatively
influence the yield of product from the less
efficient loci. This phenomenon is due in
part to the limited supply of enzyme and
nucleotides in the PCR reaction. Therefore,
during the multiplex procedure sufficient
quantity of PCR components should be
added.
While people typically measure DNA
quantity in ng, the relevant unit is actually
moles, that is, how many copies of the
sequence that will anneal with the primers.
Thus, the amount of DNA in ng that you
need to add is a function of its complexity.

In theory, a single molecule of DNA can be


used in PCR but normally people use
between 1,000 and 100,000 molecules for
eukaryotic nuclear DNA. Both DNA template quality and PCR product size affect
the amount of DNA added to the PCR mix.
If the DNA possess very high molecular
weight (such as polyploids), and/or the
PCR product length is short (e.g. an SSR),
less DNA can be used since higher fraction
of the molecules will contain the annealing
sites for both the forward and reverse
primer. If the DNA is degraded and you
want to amplify a large product, it may not
work, but the same DNA may be fine for
amplifying SSRs.
Standard Mg2+ concentration is 2 mM, but
sometimes the concentration needs to be
raised (rarely lowered) to get a PCR to work.
Raising Mg lowers specificity and is roughly
comparable to lowering the annealing temperature. It may cause multiple bands to
appear (or, occasionally, disappear).
It is better to heat up the thermocycler block
to high temperature (>100C) before starting the PCR program. This is not a true hot
start, but it may improve the specificity of
the reaction.
Nested PCR can be employed using the
primary PCR product as template with new
forward and reverse primers that are
designed internal to the original. It will
eliminate extra bands if the first PCR is
messy and produce robust band where the
first PCR is weak or even invisible. Besides,
this method saves genomic DNA.
Enzymes are expensive and perishable. It is
better to follow all the rules that specify the
usage of enzymes (such as storage at 20C
in a frost free freezer in 50% glycerol,
wearing gloves when handling the
enzymes). Before you open a new tube of
enzyme, first spin it briefly as there is often
enzyme in the cap. This is particularly true
(continued)

PCR-Based Techniques

51

Box 3.1 (continued)

for temperature-sensitive enzymes that


may be put in ice: enzyme in the cap does
not stay cold. Spin tubes as necessary to
keep enzyme at bottom. Avoid trying to
measure out minute quantities of enzyme,
as the 50% glycerol storage buffer makes
this impossible. When pipetting enzyme
from a stock tube, place the end of the tip
just far enough into the enzyme to get what
you need, and do not plunge the tip way
down into the solution, as the outside of the

Restriction Fragment Length


Polymorphism (RFLP)
In RFLP, DNA polymorphism is detected by
hybridising a chemically labelled DNA probe to a
Southern blot of sample DNA which has digested
with restriction endonucleases (Botstein et al.
1980). Thus, RFLP generates differential banding
profile which is generated due to nucleotide substitutions or DNA rearrangements like insertion or
deletion or single-nucleotide polymorphisms in
recognition site of the restriction enzymes
(Fig. 3.2). Further, the detection of polymorphism
is also due to the use of DNA probe. The DNA
probe is a radioactively labelled DNA sequence
that hybridises with one or more fragments of the
restriction enzyme digested DNA sample after
they have separated by gel electrophoresis. Short,
single or low copy genomic DNA or cDNA clones
are typically used as RFLP probes. Thus, RFLP is
specific to the probe and restriction enzyme combination and hence results in unique banding pattern characteristic to a specific genotype at a
specific locus. RFLP markers are relatively highly
polymorphic, co-dominantly inherited and highly
reproducible. Because of their presence throughout
the plant genome, high heritability and locus
specificity, the RFLP markers are considered superior. The method also provides opportunity to
simultaneously screen numerous samples. DNA

tip will become covered with enzyme and


your measurement will be off. Whenever
possible, make a cocktail of enzyme, buffer,
water, etc., and aliquot this as appropriate.
Do not add enzyme to unbuffered water,
which will denature it. Mix water and
buffer first, place on ice, then add enzyme.
The volume of the enzyme should be less
than one-tenth of the final volume of the
reaction mixture, as too much glycerol can
interfere with enzyme activity.

blots can be analysed repeatedly by stripping and


re-probing (usually eight to ten times) with different RFLP probes. However, RFLP is not widely
used in linkage mapping since it is time consuming, involves expensive and radioactive/toxic
reagents and requires large quantity of high-quality
genomic DNA. The requirement of prior sequence
information for probe generation further increases
the complexity of the methodology. These limitations led to the conceptualisation of a new set of
less technically complex methods that are based
on PCR.

PCR-Based Techniques
After the invention of polymerase chain reaction
(PCR) technology (Mullis and Faloona 1987;
Box 3.1), a large number of approaches for generation of variety of molecular markers were
described and used in genetic mapping. This is
primarily due to its obvious simplicity and high
probability of success. Further, usage of random
primers overcame the limitation of prior sequence
knowledge for PCR analysis and facilitated the
development of genetic markers for a range of
purposes. PCR-based techniques can further be
subdivided into two subcategories: (1) arbitrarily
primed PCR-based techniques or sequence
nonspecific techniques and (2) sequence-targeted
PCR-based techniques.

Differences in isoenzymes that


are detected by gel electrophoresis
and specific staining
Co-dominant inheritance

Differences in the presence or absence


of recognition sites in the target region
Co-dominant inheritance

Differences in primer annealing sites


Dominant inheritance

Differences in number of repeats of


microsatellite motifs
Co-dominant inheritance

Differences in the presence or absence


of recognition site and differences in
the primer annealing sites
Dominant inheritance
Difference in the sequences at
single-nucleotide level
Dominant inheritance

Isozymes

RFLP

RAPD

SSR

AFLP

Genetic diversity
Fine mapping
Map-based cloning

Complicated methodology
Each marker has less alleles
Mixture interpretation is more difficult
Require costly equipments to assay

High levels of polymorphism


Extremely degraded DNA samples
can be used
Most common in genome
Multiplexing hundreds of markers
in a single chip is possible

Linkage and QTL


mapping
Marker-assisted
selection
Hybrid fixation
Saturation mapping

Genetic diversity
Saturation mapping

Hybrid fixation
Genetic diversity

Large amount of DNA required

Usually require polyacrylamide gel


electrophoresis which is labour intensive

Time and cost intensive initial establishment

Large amount of DNA is required


Limited polymorphism especially in related
species
Poor reproducibility
Generally not transferable

Map construction

Map construction

Genetic diversity

Applications
Conventional plant
breeding program

Multiple loci

Quick, simple and inexpensive


Small amount of DNA is required
Multiple loci from a single primer
Technically simple, robust
and reliable
Transferable between populations

Transferable across the laboratories

Phenotype-based analysis

Suitable for estimating a wide range


of population genetics parameters
and for genetic mapping
Robust and reliable
Time consuming, laborious and harmful

Disadvantages
Limited in number
Laborious and time-consuming procedures
Relatively few biochemical assays available
to detect enzymes

Advantages
Simple to assay
Lowest cost involved protocol
Robust and highly reproducible

SNP

Principle and mode of inheritance


Differences in phenotypic expression
of the given trait (e.g. petal colour)

Marker class
Morphological
markers

Table 3.1 Properties, advantages and limitations of markers used in genetic mapping

52
Genotyping of Mapping Population

Genome and
QTL-mapping
potential
Comparative
mapping
potential

Reproducibility

Transferability

Degree of
polymorphism
Amount of DNA
sample required
Ease of assay
Can be
automated?
Equipment cost
Development
cost
Assay cost
Easy
Yes

Difficult
Difficult

Expensive
Expensive

Expensive

Cheap
Cheap

Cheap

Very limited

Excellent

Good

Very good

Low to medium

Within species

Across families Across genera


and genera
Very high
High to very
high
Limited
Good

Moderate

Moderate
Moderate

~10 ng

~10 mg

Few mg
of tissue
Easy
Difficult

Unlimited
Not applicable
(presence/
absence type
of detection)
Lowmedium

Limited by
the size of
genome and by
nucleotide
polymorphism

RAPD
Anonymous

Lowmedium

100 s
Rare to
extremely rare

RFLP
Anonymous/
genic
Limited by the
restriction site
(nucleotide)
polymorphism

Low

Limited by
the number
of enzyme
genes and
histochemical
enzyme assays
available
3050
Rare

Maximum
theoretical
number of
possible loci
in analysis

Number of loci
Null alleles

Isozymes
Genic

Features
Origin

Table 3.2 Comparison of features of different types of markers

Very limited

Medium to
high
Very good

Within species

Moderate

Expensive
Moderate

Moderate
Yes

~25 ng

Unlimited
Not applicable
(presence/
absence type
of detection)
Lowmedium

Limited by the
restriction site
(nucleotide)
polymorphism

AFLP
Anonymous

~10 ng

1,000 s
Not applicable
(presence/
absence type
of detection)
Lowmedium

Limited by
the size of
genome and
by nucleotide
polymorphism

Good

Within genus
or species
Medium to
high
Good

Moderate

Expensive
Very expensive

Very limited

Within genus
or species
Low to
medium
Very good

Moderate

Moderate
Moderate

Easy to moderate Easy


Yes
Yes

~50 ng

Mediumhigh

10s
Occasional
to common

Limited by the
size of genome
and number of
simple repeats
in a genome

SSR
ISSR
Anonymous/genic Anonymous

Limited

Limited

Within genus
or species
High

Moderate

Moderate
Expensive

Moderate
Yes

~25 ng

Mediumhigh

Limited

Limited

Within genus
or species
High

Moderate

Moderate
Moderate

Easy
Semi-automated

~25 ng

Mediumhigh

10s
10s
Rare to extremely Rare to
rare
extremely rare

SCAR
CAPS
Anonymous/genic Anonymous/
genic
Limited by the
Limited by the
size of genome
size of genome

Limited

Moderate to
expensive
Within genus
or species
Medium to
high
Very good

Expensive
Expensive

Easy
Yes

~50 ng

Mediumhigh

10s
Rare to
extremely rare

SNP
Anonymous/
genic
Limited by the
size of genome

54
Restriction digestion and
Gel electrophoresis
DNA isolated
from individuals
Transfer of digested DNA
fragments to a membrane
(Southern blotting)

Radioactive DNA probe


binds to specific DNA
fragments

Autoradiography
(X-ray film sandwiched to
the membrane to detect
radioactive pattern)
Individual A

Fig. 3.2 Schematic


development

description

of

RFLP

marker

Arbitrarily Primed PCR-Based Markers


Random Amplied Polymorphic DNA
(RAPD)
The basis of RAPD technique is differential PCR
amplification of genomic DNA using short random oligonucleotide sequences (mostly ten bases
long) (Fig. 3.3). Usually differential banding
pattern is produced due to rearrangements or
deletions at or between oligonucleotide primer
binding sites in the genome (Williams et al.
1991).
As the approach requires no prior knowledge
of the genome that is being analysed, it can be
employed across species using universal primers.
Various results obtained in plants indicated that
RAPDs are dominant, highly polymorphic and
informative and complement to RFLP markers.
RAPD markers offer many advantages such as
higher frequency of polymorphism, rapidity,

Genotyping of Mapping Population

technical simplicity, use of fluorescence and


feasibility of automation and requirement of a
few nanograms of DNA. Because of these
rewards, RAPD markers has potential application
in crop improvement by locating and manipulating genes of interest, identification of somatic
hybrids, evaluation and conservation of genetic
resources, DNA profiling, population genetics
and gene mapping. However, the limitation associated with RAPD technology is inconsistency
because PCR reactions are very sensitive to
factors such as annealing temperature, template
DNA concentration and Mg2+ ion concentration,
and hence it cannot be reproduced even within
the laboratory. Further, as several discrete loci in
the genome are amplified by each primer, it complicates the scoring procedures. Since they are
dominant markers, RAPD profiles cannot be used
to distinguish heterozygous from homozygous
individuals. Hence, RAPD markers, although
useful for genetic studies, should be used with
caution. Paran and Michelmore (1993) were able
to separate RAPD fragments and clone and
sequence those fragments after reamplification.
These sequence data were used to design lengthy
PCR primers specific to particular RAPD fragments and use PCR to consistently produce
specific RAPDs fragments from genomic DNA
(thus, this technique is also called as sequencetagged siteSTS). This method allows in eliminating the reproducibility problem associated
with RAPD analysis. Reamplification from
genomic DNA and subsequent sequencing of the
PCR products also allow for the identification of
any artefacts in RAPD technology.

Arbitrarily Primed Polymerase Chain


Reaction (AP-PCR) and DNA
Amplication Fingerprinting (DAF)
These techniques are independently developed
methodologies, which are variants of RAPD. For
AP-PCR (Welsh and McClelland 1990), a single
primer (about 1015 nucleotides long) is used.
The technique involves amplification for initial
two PCR cycles at low stringency. Thereafter, the
remaining cycles are carried out at higher stringency by increasing the annealing temperature.
This variant of RAPD was not very popular as

Sequence-Specific PCR-Based Markers

55

a
1

b
x
Random 10 bp oligonucleotide primer; for simplicity
only 3 loci are described in the genome
x Single base change destroys target
sequence for primer binding and
hence this locus will not amplify from individual B
PCR amplification of target gene
and agarose gel electrophoresis
A

1
2
3

Fig. 3.3 Schematic description of RAPD marker

it involved autoradiography, but it has been


simplified as fragments and can now be fractionated using agarose gel electrophoresis. The DAF
technique involves usage of single arbitrary primers shorter than ten nucleotides for amplification
(Caetano-Anolls and Bassam 1993), and the
amplicons are analysed using polyacrylamide gel
along with silver staining.

Amplied Fragment Length


Polymorphism (AFLP)
To overcome the limitation of reproducibility associated with RAPD, AFLP technology (Vos et al.
1995) was developed. It combines the power of
RFLP with the flexibility of PCR-based technology by ligating primer recognition sequences
(adaptors) to the restricted DNA and selective PCR
amplification of restriction fragments using a limited set of primers (Fig. 3.4). The primer pairs used
for AFLP usually produce 50100 bands per assay.
The number of amplicons per AFLP assay is a
function of the number of selective nucleotides in
the AFLP primer combination, the selective nucleotide motif, GC content and physical genome size
and complexity. The AFLP technique generates

fingerprints of any DNA regardless of its source


and without any prior knowledge of DNA
sequence. Most AFLP fragments correspond to
unique positions on the genome and hence can be
exploited as landmarks in genetic and physical
mapping. The technique can also be used to distinguish closely related individuals at the subspecies
level and map genes. Applications for AFLP in
plant mapping include establishing linkage groups
in crosses, saturating regions with markers for
map-based gene cloning efforts and assessing the
degree of relatedness or variability among cultivars. For high-throughput screening approach,
fluorescence tagged primers are also used for
AFLP analysis. The amplified fragments are
detected on denaturing polyacrylamide gels using
an automated ALF-DNA sequencer with the fragment option (Huang and Sun 1999).

Sequence-Specic PCR-Based Markers


With the advent of high-throughput sequencing
technology, abundant information on DNA
sequences for the genomes of many plant species
has been generated. For the crops where the
genome sequencing projects have not yet been

56

Genotyping of Mapping Population

EcoRI

MseI
TTAA
AATT

GAATTC
CTTAAG

Digestion of genomic DNA with EcoRI and MseI


and ligation of EcoRI and MseI adaptors to
restriction products
TAA
T

G
CTTAAG

Pre-amplification with unlabeled primers having


a single selective nucleotide
A

A
TAA
T

G
CTTAAG
C

Final selective amplification of with AFLP primers having


2-3 selective nucleotides; EcoRI specific primers are
Labeled. AFLP primers consists of three parts: a core
(property sequence (not revealed to public), a enzyme
Specific sequence and a selective extension sequences

A ACC
TAA

G
CTTAAG

AGC

Polyacrylamide gel electrophoresis and scoring for


AFLP profile; for simplicity only few bands are shown
here (actually there will be 50-100 bands per assay)

Fig. 3.4 Schematic representation of AFLP protocol

started, large collections of expressed sequence


tags (ESTs) are available in public domains.
Functional genomics approaches through ESTs
offer great scope in the development of genebased markers for molecular breeding of complex
traits. It also provides better knowledge on the
activity of genes involved in pest and disease
resistance and tolerance to environmental stresses
and promises to increase productivity and yield.
ESTs have been generated and thousands of
sequences have been annotated as putative functional genes using powerful bioinformatics tools.
In order to correlate DNA sequence information
with particular phenotypes, sequence-specific

molecular marker techniques have been designed.


The following sections describe such marker
techniques in detail.

Microsatellite-Based Marker
Technique
Microsatellites or short tandem repeats (STR) or
simple sequences repeats (SSR) or sequencetagged microsatellite site (STMS) are monotonous repetitions of very short nucleotide motif
(usually one to five base pairs). It occurs as interspersed repetitive elements in all eukaryotic

Sequence-Specific PCR-Based Markers


SSR motif in individual A
(AT)10
(TA)10

57
SSR motif in individual B

SSR motif in individual C

(AT)5
(TA)5

(AT)20
(TA)20

Forward and Reverse primers that


flanks corresponding SSR or
microsatellite motif
PCR amplification and gel electrophoresis

Differential number of repeats helps in polymorphism identification [note that individual B is having only
5 motifs and hence the PCR product was moved very rapidly whereas the PCR product of C moved slowly
because of its large size (20 motifs)]

Fig. 3.5 Schematic representation of microsatellite or SSR marker development

genomes (Tautz and Renz 1984). Variation in the


number of tandemly repeated units is mainly due
to strand slippage during DNA replication where
the repeats allow matching via excision or addition of repeats. As slippage in replication is more
likely than point mutations, microsatellite loci
tend to be hyper variable. The regions flanking
the microsatellites are generally conserved
among species or even among genera, and PCR
primers complementary to the flanking regions
are used to amplify SSR containing DNA fragments. The length of the amplified fragment will
vary according to the number of repeat units
(Fig. 3.5). Microsatellite assays show extensive
inter-individual length polymorphisms during
PCR analysis of unique loci using discriminatory primers sets. The PCR amplification protocols used for microsatellites employ loci-specific
either unlabelled primer pairs or primer pairs
with one radiolabelled or fluorolabelled primer.
Analysis of unlabelled PCR products is carried
out using polyacrylamide or agarose gels. The
employment of fluorescent-labelled microsatellite primers and laser detection (that are
available automated sequencer) in genotyping
procedures has significantly improved the
throughput and automation. However, due to the
high price of the fluorescent label, which must
be carried by one of the primers in the primer

pair, the assay becomes costly. Alternatively,


Schuelke (2000) introduced a novel procedure in
which three primers are used for the amplification
of a defined microsatellite locus: a sequencespecific forward primer with M13 (21) tail at its
5 end, a sequence-specific reverse primer and
the universal fluorescent-labelled M13 (21)
primer. This technique has been proved as simple and less expensive. Microsatellites are highly
popular genetic markers because of their codominant inheritance, high abundance, enormous extent of allelic diversity and the ease of
assessing SSR size variation by PCR with pairs
of flanking primers. The reproducibility of microsatellites is such that they can be used
efficiently by different research laboratories to
produce consistent data, and hence they are being
considered as the markers of choice in many of
the crop-breeding programs. Besides, this marker
had high information content, co-dominant
inheritance, locus specificity and ease for
automation for high-throughput screening. Thus,
advent of SSR or microsatellite markers has
brought a new, user-friendly and highly polymorphic class of genetic markers in many plant
species. However, the higher development cost
and effort required to obtain working SSR primers for a given species has restricted their use to
only a few of the agriculturally important crops.

58

Microsatellites are classified in to different types


as: (1) based on the number of nucleotides per
repeat (such as mononucleotide (A)n, dinucleotide (CA)n, trinucleotide (CGT)n, tetranucleotide (CAGA)n, pentanucleotide (AAATT)n and
hexanucleotide (CTTTAA)n, where n is number
of variables), (2) based on the arrangement of
nucleotides in the repeat motifs (such as pure or
perfect or simple perfect (CA)n, simple imperfect (AAC)n ACT (AAC)n, compound or simple
compound (CA)n (GA)n and interrupted or

Genotyping of Mapping Population

imperfect or compound imperfect (CCA)n TT


(CGA)n) and (3) based on location of SSRs in
the genome (such as nuclear (nuSSRs), chloroplastic (cpSSRs) and mitochondrial (mtSSRs)).
Several methods have been pursued to develop
SSR markers, including analysis of SSR-enriched
small insert genomic DNA libraries, SSR mining
from ESTs and large insert bacterial artificial
chromosome derivation by end sequence analysis. As an example, mining of SSRs from EST
database is described in Box 3.2.

Box 3.2 Practising Genotyping of a Mapping Population with SSR Markers

DNA-based marker techniques such as RFLP,


RAPD, SSR and AFLP are routinely being
used in genetic studies, and their advantages as
well as limitations have long been realised.
Among the markers, SSRs are molecular breeders marker of choice. SSRs exist throughout
the whole genome of an organism in both noncoding and coding regions. In the past, genomic
SSRs (gSSRs) were developed on the basis of
isolating and sequencing clones containing
putative SSR regions, together with designing and testing flanking primers. However,
expressed sequence tag (EST)-derived SSRs
have some intrinsic advantages over gSSRs
because they are present in expressed regions
of the genome. In recent years, great efforts
have been made to develop gSSRs and EST
SSRs for several crops, and they have been
widely used in genetic mapping.
The ESTs containing at least four di-, tri,
tetra-, penta- or hexanucleotide repeats (EST
SSRs) in the crop of interest can be identified
using SSR identification tool (SSRIT) available at http://www.gramene.org/db/markers/ssrtool. The procedure is simple: just enter
or paste the EST sequence to the text area and
select the parameters to identify SSR motifs.
Once the SSR motif containing EST sequence
is identified, primers that flank the given SSR
motif are to be identified. Such primers can be
designed for the flanking regions of the SSR
using web-based software, Primer3 v 0.4.0 or

higher versions of this program (http://frodo.


wi.mit.edu/primer3/). Primers can be designed
based on the criteria of 50% GC content, a
minimum melting temperature of 50C and
absence of secondary structure or other
parameters as per the requirement. Primers
ranged from 18 to 27 nucleotides in length
with amplified products of 100400 bp can be
picked up and used for primer synthesis. If
possible, primers may be designed within the
5 or 3 untranslated region (UTR) (or near to
the start or stop codon within coding DNA)
closest to the repeat motif and/or at the start of
the intron (as intronic polymorphic primers) to
increase the efficiency of polymorphic information content. Once the primers are synthesised, PCR (see Box 3.1) can be executed to
amplify the SSR motifs from the template
DNA samples.
For SSR analysis, there are three electrophoresis methods currently employed to
determine the length polymorphisms: polyacrylamide gel electrophoresis, MetaPhor
agarose gel electrophoresis and automated
capillary electrophoresis (see Box 3.4), and all
these methods produce comparable and reproducible results. The polyacrylamide gel electrophoresis (PAGE) is the most common and
excellent method. The amplification products
in polyacrylamide gels are typically visualised
with radioactive labelling, fluorescent dye
labelling and silver staining. However, these
(continued)

Sequence-Specific PCR-Based Markers

59

Box 3.2 (continued)

visualisation techniques require either expensive or hazardous radioactive chemicals and


are time consuming. On the other hand, capillary electrophoresis can be performed more
quickly and good for high-throughput analysis. Capillary electrophoresis with the CEQTM
8000 Genetic Analysis System, QIAxcel
System and ABI 3130xl DNA sequencer can
easily separate products and determine allelic
size. But it is more expensive and requires
more sophistication and expertise. MetaPhor
agarose gel electrophoresis (MAGE) is another
approach to separate alleles of microsatellite

Organelle Microsatellites
Plant organelle genomes such as chloroplast
DNA and mitochondrial DNA have been increasingly applied to study population genetic structure and phylogenetic relationships in plants. Due
to their uniparental mode of transmission
(Box 3.3), chloroplast and mitochondrial genomes
exhibit different patterns of genetic differentiation compared to nuclear alleles. Thus, for a
comprehensive understanding of plant population differentiation and evolution, three interrelated genomes must be considered.
Chloroplast Microsatellites
Numerous studies have shown that chloroplast
microsatellites consisting of relatively short and
several mononucleotide stretches (such as (dA)n
and (dT)n) are ubiquitous and polymorphic.
Chloroplast genome-based markers uncover
genetic discontinuities and distinctiveness among
or between taxa with slight morphological differentiation, which sometimes cannot be revealed
by nuclear DNA markers. The conservation and
homology of sequence in chloroplast genome
makes it possible to compare genes across the
plant kingdom and examine phylogenetic
relationships in taxa that have diverged for
hundreds of thousands to millions of years.
Chloroplast microsatellites are now becoming
firmly established as a high-resolution tool for

markers. MetaPhor agarose (FMC or Cambrex


Corporation, USA) is an intermediate melting
temperature agarose (75C) that provides
twice the resolution capabilities of the finestsieving agarose products. Using submarine
gel electrophoresis, MetaPhor agarose gives
high-resolution separation of 20800 bp
DNA fragments that differ in size by 2%,
which approximates the resolution of polyacrylamide gels. MetaPhor agarose gels
(24%) made in either TAE or TBE and
stained with ethidium bromide are ideal for
resolving SSRs.

examining patterns of cytoplasmic variation in a


wide range of plant species. Chloroplast microsatellites are particularly effective markers for
studying mating systems, gene flow via both pollen and seeds and uniparental lineage. Chloroplast
microsatellite-based markers have been used for
the detection of hybridisation and introgression
and the analysis of the genetic diversity and phylogeography of plant populations. One limitation
of the approach is the need of sequence data for
primer construction. Primer sequences flanking
chloroplast microsatellites are usually inferred
from fully or partially sequenced chloroplast
genomes. In general, these primer pairs produce
polymorphic PCR fragments from the species of
origin and their close relatives, but transportability to more distant taxa is limited. Attempts to
design universal primers to amplify chloroplast
microsatellites have resulted in a set of consensus
chloroplast microsatellite primers that aims at
amplifying cpSSR regions in the chloroplast
genome of dicotyledonous angiosperms (Weising
and Gardner 1999).

Mitochondrial (mt) Microsatellites


In contrast to animal mtDNA, which typically
has a size of 10 MDa per mitochondrial genome,
plant mtDNA is far more complex. For example, the
maize mitochondrial genome has been estimated
to be 320 MDa. In addition to larger size, plant

60

Genotyping of Mapping Population

Box 3.3 Features of Molecular Markers

Dominant, Co-dominant
and Cytoplasmic or Uniparentally
Inherited Markers
For diploid organisms (organisms harbouring
two copies of each chromosome), the exact
genotype of each individual should have two
possible genotypes for the given marker. In
contrast, for markers such as RAPD, AFLP
and ISSR, it is only possible to describe
whether the given marker allele (e.g. A) is
present or not at the given locus. Therefore, in
such cases, one cannot distinguish the
heterozygous genotype (Aa) from the homozygous genotype (AA). It is clear that this genotyping method incurs a loss of information,
and such kinds of markers are referred to as
dominant markers. Alternatively, SSRs,
RFLPs, etc., are called as co-dominant markers since they can distinguish a heterozygote
(two bands for Aa (i.e. the bands produced by
both AA and aa are co-occurring) from each
of homozygotes AA and aa (different sizes of
single band for AA and aa)) (Fig. 3.6).
Dominant markers allow the analysis of
many loci per experiment without requiring
any prior information on their sequence.
For predominantly self-fertilising species,
heterozygosity could be disregarded, and
allele frequencies can be considered as equal
to observed frequencies. In contrast, co-dominant markers allow analysis of only one locus
per experiment, and hence the degree of data
per assay is usually lower. Nevertheless, they
are more informative since the allelic variations of that locus can be distinguished. As a
consequence, we can identify the linkage

mtDNA is characterised by molecular heterogeneity observed as classes of circular chromosomes


that vary in size and relative abundance. There
are only few reports that describe the utilisation
of mtSSRs in plant species.

b
AA

Aa

aa

AA

Aa

aa

Fig. 3.6 Diagrammatic explanation for dominant (a)


and co-dominant (b) marker that reveals homozygotes
(AA or aa) and heterozygotes (Aa)

groups between different genetic maps.


However, it is very imperative to precisely
know the sequence of the particular locus.
Interestingly, there is yet another fascinating
feature of molecular marker which is worth
to mention here. They are called as cytoplasmic markers which are uniparentally
inherited (either maternally or paternally).
Mitochondrial- and chloroplast-specific SSR
or SNP markers are placed under this category
(refer the text for detail), and use of such kind
of markers requires adequate caution during
linkage mapping.

Polymorphism Information
Content (PIC)
PIC value is commonly used in genetics as a
measure of polymorphism for a marker locus
used in linkage analysis. It is the probability
that one could identify which marker allele of
the parents has inherited to the offspring. PIC
can be calculated as described in Chap. 1 or
using the freely available program, CERVUS
v2.0. PIC value for co-dominant markers range
from 0.5 to 1.0 and for dominant markers it has
a maximum value of 0.5.

Inter-Simple Sequence Repeats (ISSR)


ISSR involves amplification of DNA segments
present at an amplifiable distance in between two
identical microsatellite repeat regions that are

Sequence-Specific PCR-Based Markers

oriented in opposite direction. The technique uses


microsatellites as primers in a single primer PCR
targeting multiple genomic loci to amplify mainly
inter-simple sequence repeats of different sizes.
The microsatellite repeats used as primers for
ISSRs can be dinucleotide, trinucleotide, tetranucleotide or pentanucleotide. The primers used
can be either unanchored or more usually anchored
at 3 or 5 end with 14 degenerate bases extended
into the flanking sequences. Thus, the principle is
similar to RAPD; however, ISSRs use longer
primers (1530 mers) as compared to RAPD
primers (10 mers), which permit the subsequent
use of high annealing temperature leading to
higher stringency. The annealing temperature
depends on the GC content of the primer used
and ranges from 45 to 65C. The amplified products are usually 2002,000 bp long and amenable
to detection by both agarose and polyacrylamide
gel electrophoresis (PAGE). ISSRs exhibit the
specificity of microsatellite markers but need no
sequence information for primer synthesis enjoying the advantage of random markers. The primers are not proprietary and can be synthesised by
anyone. The technique is simple and quick, and
the use of radioactivity is not essential. ISSR
markers usually show high polymorphism
although the level of polymorphism has been
shown to vary with the detection method used.
PAGE in combination with radioactivity was
shown to be most sensitive, followed by PAGE
with silver nitrate staining and then agarose gel
with ethidium bromide system of detection. Like
RAPDs, reproducibility, dominant inheritance
and homology of co-migrating amplification
products are the main limitations of ISSRs. ISSRs
segregate mostly as dominant markers, although
co-dominant segregation has been reported in
some cases. There is also a possibility as in RAPD
that fragments with the same mobility originate
from non-homologous regions.

Single-Nucleotide Polymorphism (SNPs)


Variations at single-nucleotide level in genome
sequence of individuals of a population are
known as SNPs (Jordan and Humphries 1994).

61

They constitute the most abundant molecular


markers in the genome and are widely distributed
throughout genomes although their occurrence
and distribution varies among species. The SNPs
are usually more prevalent in the non-coding
regions of the genome. Within the coding regions,
an SNP is either non-synonymous and results in
an amino acid sequence change or it is synonymous and hence does not alter the amino acid
sequence. However, synonymous changes can
modify mRNA splicing and thus sometimes
result in phenotypic differences. Improvements
in sequencing technology and availability of an
increasing number of EST sequences have made
direct analysis of genetic variation at the DNA
sequence level. Majority of SNP genotyping
assays are based on one or two of the following
molecular mechanisms: allele-specific hybridisation, primer extension, oligonucleotide ligation
and invasive cleavage. High-throughput genotyping methods, including DNA chips, allele-specific
multiplex PCR and primer extension approaches,
make SNPs especially attractive as genetic markers. Because of these technological improvements, SNPs are highly suitable for automation
and are used for construction of ultra-high-density genetic maps.

Single-Feature Polymorphism (SFP)


The basis of genome-wide polymorphism discovery by the SNP depends on the principle that
a sequence, which is perfect match to a feature/
probe sequence present on gene chip or microarray, may hybridise with greater affinity than one
with a mismatch sequence. The polymorphism of
the two sequences, originating from two different
varieties or genotype, results in differential
hybridisation intensity, and this property associated with sequence characteristics functions as a
molecular marker popularly known as SFP.
Such genetic differences between genotypes at
sequence level are at two levels: single-nucleotide
polymorphisms (SNPs) and insertion/deletions
(INDELs). These assays are done by labelling
genomic DNA (target) and hybridising to arrayed
oligonucleotide probes that are complementary

62

to target. Either type of variation can potentially


influence the hybridisation of target to 25-mer
oligonucleotides. Each SFP is scored by the presence or absence of a hybridisation signal with its
corresponding oligonucleotide probe on the array.
Thus, a polymorphism detected by a single probe
in an oligonucleotide array is called a SFP, where
a feature refers to a probe in the array. Since it is
amenable to microarray-based genotyping, it is
highly suitable for high-throughput genotyping.
For genotyping large populations, the cost per
individual is more critical than the cost per data
point. Spotted oligonucleotide microarrays have
the potential to provide low-cost genotyping
platforms. Polymorphisms within a transcribed
sequence are of particular interest because they
may reflect variation in biological function.

Sequence-Characterised Amplied
Regions (SCAR)
In order to utilise markers identified by arbitrary
markers (such as RAPD, AFLP, ISSR) for mapbased cloning and/or efficient marker-assisted
selection (MAS), identification of unambiguous
single locus is a must. In addition, the arbitrary
marker techniques are sensitive to changes in the
reaction conditions. In order to bridge the gap
between the ability to obtain linked markers to a
gene of interest in a short time and the use of
these markers for map-based cloning approaches
and for routine MAS, SCAR marker technique
was developed and applied. The SCARs are PCRbased markers that represent genomic DNA fragments at genetically defined loci. SCARs are
identified by PCR amplification using sequencespecific oligonucleotide primers (Paran and
Michelmore 1993). Development of SCARs
involves cloning the amplified products of arbitrary marker techniques and then sequencing the
two ends of the cloned products. The sequence is
thereafter used to design specific primer pairs of
1530 bp which amplify single major bands of the
size similar to that of cloned fragment. Polymorphism is either retained as the presence or
absence of amplification of the band or can appear
as length polymorphisms convert dominant

Genotyping of Mapping Population

arbitrary-primed marker loci into co-dominant


SCAR markers. As SCARs are primarily defined
genetically, they can be used both as physical
landmarks in the genome and as genetic markers.
Co-dominant SCARs are more informative for
genetic mapping than dominant arbitrary-primed
molecular markers, as they can be used to screen
pooled genomic libraries by PCR and for physical mapping, defining locus specificity as well as
comparative mapping and homology studies
among related plant species. Thus, SCARs
have several advantages over RAPD or AFLP:
(1) higher reproducibility resulting from longer
primer and higher annealing temperature and
(2) having the possibility of changing dominant
markers to co-dominant markers.
However, cloning and sequencing are still
laborious in SCAR development. To avoid this
problem, extended random primer amplified
region (ERPAR) has been developed (Wang et al.
2000). Similar to SCAR, an ERPAR uses specific
primer pairs derived from RAPD primers by adding bases sequentially to their 3 ends. The extension of primers is a continuous procedure of
adding bases and screening primer pairs. Because
longer primers are designed without sequence
information, cloning and sequencing are not
needed. ERPAR has the same advantages of
SCAR; in addition, it eliminates the tedious
works involved in SCAR development. Thus, it is
a universal and efficient approach to convert an
RAPD marker in to a stable marker.

Cleaved Amplied Polymorphic


Sequences (CAPS)
The CAPS marker technique provides a way to
utilise the DNA sequences of mapped RFLP
markers to develop PCR-based markers thereby
eliminating the tedious DNA blotting (Komori
and Nitta 2005). Therefore, CAPS are also known
as PCR-RFLP markers. The CAPS make out the
restriction fragment length polymorphisms caused
by single base changes like SNPs, insertions/deletions, which modify restriction endonuclease recognition sites in PCR amplicons. The CAPS
assays are performed, by digesting locus-specific

Sequence-Specific PCR-Based Markers

63

Target gene in individual A

Target gene in individual B

PCR amplification of target gene and


Restriction digestion of PCR products

Presence or absence of restriction site


helps in polymorphism identification

Fig. 3.7 Schematic illustration of CAPS

PCR amplicons with one or more restriction


enzyme, followed by separation of the digested
DNA on agarose or polyacrylamide gels (Fig. 3.7).
The primers are synthesised based on the sequence
information available in databank of genomic or
cDNA sequences or cloned RAPD bands.
The CAPS analysis is versatile and can be
combined with single-strand conformational
polymorphism (SSCP; see below), SSR, SCAR,
AFLP or RAPD analysis to increase the possibility of finding DNA polymorphisms. The
CAPS markers are co-dominant and locus
specific and have been used to distinguish
between plants that are homozygous or heterozygous for alleles. Thus, CAPS proves useful for
genotyping, positional or map-based cloning
and molecular identification studies where
sequence-based identification is not feasible.
The technique is, however, limited by mutations,
which create or disrupt a restriction enzyme recognition site. To overcome this limitation,
Michaels and Amasino (1998) proposed a variant of the CAPS method called derived cleaved
amplified polymorphic sequence (dCAPS). In
dCAPS analysis, a restriction enzyme recognition site, which includes the SNP, is introduced
into the PCR product by a primer containing one
or more mismatches to template DNA. The
modified PCR product is then subjected to
restriction enzyme digestion, and the presence

or absence of the SNP is determined by the


resulting restriction pattern. The method is simple, relatively inexpensive and utilises the ubiquitous technologies of PCR, restriction digestion
and agarose gel analysis. This technique proved
useful for following known mutations in segregating populations and positional-based cloning
of new genes in plants.

Randomly Amplied Microsatellite


Polymorphisms (RAMP)
Microsatellite-based markers show a high degree
of allelic polymorphism, but they are labour
intensive. On the other hand, RAPD markers are
inexpensive but exhibit a low degree of polymorphism. To compensate for the weaknesses
of these two approaches, a technique termed as
RAMP was developed (Wu et al. 1994). The
technique involves a radiolabelled primer consisting of a 5 anchor and 3 repeats which is
used to amplify genomic DNA in the presence
or absence of RAPD primers. The resulting
products are resolved using denaturing polyacrylamide gels, and as the repeat primer is
labelled, the amplification products derived
from the anchored primer are only detected. The
melting temperatures of the anchored primers
are usually 1015C higher than those of the

64

RAPD primers; thus, at higher annealing


temperature, only the anchored primer would
anneal efficiently, whereas in PCR cycles at low
annealing temperature, both anchored microsatellite and RAPD primers would anneal. So the
PCR program was modified such that there is
switching between high and low annealing temperatures during the reaction. Most fragments
obtained with RAMP primers alone disappear
when RAPD primers are included, and different
patterns are obtained with the same RAMP
primer and different RAPDs, indicating that
RAPD primers compete with RAMP primer
during the low annealing temperature cycle.
RAMP has been successfully employed in plant
genetic diversity studies.

Sequence-Related Amplied
Polymorphism (SRAP)
The aim of SRAP technique (Li and Quiros
2001) is the amplification of open reading
frames (ORFs). It is based on two-primer
specific PCR amplification. The technique uses
primers of arbitrary sequence, which are 1721
nucleotides in length. It uses pairs of primers
with AT- or GC-rich cores to amplify intragenic
fragments for polymorphism detection. The
primers consist of the following elements: (1)
Core sequences, which are 1314 bases long,
where the first 10 or 11 bases starting at the 5end, are sequences of no specific constitution
(filler sequences), followed by the sequence
CCGG in the forward primer and AATT in the
reverse primer and (2) the core is followed by
three selective nucleotides at the 3-end. The
filler sequences of the forward and reverse
primers must be different from each other and
can be 10 or 11 bases long. For the first five
cycles, the annealing temperature is set at 35C.
The following 35 cycles are run at 50C. The
amplified DNA fragments are fractionated by
denaturing acrylamide gels and detected by
autoradiography or silver staining. SRAP combines simplicity, reliability, moderate throughput ratio and facilitate sequencing of selected
bands. SRAP targets coding sequences in the

Genotyping of Mapping Population

genome and results in a moderate number of


co-dominant markers. Sequencing demonstrated that SRAP polymorphism results from
two events, fragment size changes due to insertions and deletions, which could lead to codominant markers, and nucleotide changes
leading to dominant markers. The SRAP marker
system has been adapted for a variety of purposes
in different crops, including map construction,
gene tagging and genetic diversity studies.

Target Region Amplication


Polymorphism (TRAP)
The TRAP technique (Hu and Vick 2003) is a
rapid and efficient PCR-based technique, which
utilises bioinformatics tools and EST database
information to generate polymorphic markers,
around targeted candidate gene sequences. The
technique uses two primers (18 nucleotides in
length) to generate markers. One of the primers,
the fixed primer, is designed from the targeted EST
sequence in the database; the second primer is an
arbitrary primer with either an AT- or GC-rich core
to anneal with an intron or exon. As the TRAP
technique can be used to generate markers for
specific gene sequences, it is useful for genotyping germplasm and generating markers associated with desirable agronomic traits in crop plants
for marker-assisted breeding. The technique has
also been effectively used in fingerprinting, in
estimating genetic diversity and mapping QTL.

Single-Strand Conformation
Polymorphism (SSCP)
Single-strand conformation polymorphism is the
mobility shift analysis of single-stranded DNA
sequences on neutral polyacrylamide gel electrophoresis, to detect polymorphisms produced by
differential folding of single-stranded DNA due
to subtle differences in sequence (often a single
base pair) (Orita et al. 1989). In the absence of a
complementary strand, the single strand experiences intra-strand base pairing, resulting in loops
and folds, that gives it a unique 3-D structure

Transposable Elements (TE)-Based Molecular Markers

65

Target gene in individual A

Target gene in individual B

PCR amplification of target gene

Denatured to produce single strands


(or pooling of denatured products from A and B)

differential folding of single-stranded DNA due to differences in DNA sequence or


internal sequence polymorphisms in PCR products from two genomes A and B

The differential conformation leads


to differences in gel mobility

Fig. 3.8 Schematic representation of SSCP

which can be considerably altered due to single


base change resulting in differential mobility
(Fig. 3.8). The SSCP analysis proves to be a powerful tool for assessing the complexity of PCR
products as the two DNA strands from the same
PCR product often run separately on SSCP gels,
thereby providing opportunities (1) to score a
polymorphism and (2) resolving internal sequence
polymorphisms in some PCR products from
identical places in the two parental genomes. The
PCR-based SSCP analysis is a rapid, simple and
sensitive technique for detection of various mutations, including single-nucleotide substitutions
and insertions and deletions in PCR-amplified
DNA fragments. The technique shares similarity
to RFLPs as it can also decipher the allelic variants of inherited and genetic traits. However,
unlike RFLP analysis, SSCP analysis can detect
DNA polymorphisms and mutations at multiple
places in DNA fragments. The SSCP gels have
been used to increase throughput and reliability
of scoring during mapping.
Fluorescence-based PCR-SSCP (F-SSCP)
is an adapted version of SSCP analysis involving amplification of the target sequence using

fluorescent primers (Makino et al. 1992). The


major disadvantage of the technique is that the
development of SSCP markers is labour intensive
and costly and cannot be automated.

Transposable Elements (TE)-Based


Molecular Markers
Transposons are mobile genetic elements capable
of changing their location in the genome. They
were first discovered in maize. There are two
broad classes of transposable elements, each with
characteristic properties. For all Class I or retroelements, such as retrotransposons, short interspersed nuclear elements and long interspersed
nuclear elements, it is the element-encoded
mRNA, and not the element itself, that forms
the transposition intermediate. This means that
each transposition event creates a new copy of
the transposon, while the original copy remains
intact at the donor site. In contrast, Class II consists of DNA transposons, which change their
location in the genome by a cut and paste mechanism. In other words, they excise themselves

66

Genotyping of Mapping Population

Long terminal repeats (LTR)


Outward facing 5 and 3 LTR primers for IRAP marker development

Long terminal repeats (LTR)


SSR or microsatellite motif
Use of one outward facing LTR primers
Primer corresponding to SSR or microsatellite motif

Fig. 3.9 Schematic representation of development of (a) IRAP and (b) REMAP primers

from the donor site and reintegrate themselves at


the acceptor site. Based on structural characteristics, transposons can be further subdivided into
subclasses, super families, families and subfamilies based on the type and orientation of open
reading frames; the presence, orientation, length
and sequence of their terminal repeats; and the
length and sequence of target site duplications
created upon insertion.

Retrotransposon-Based Molecular
Markers
In plants with large genomes, retrotransposons
are the major class of repetitive DNA, comprising 4060% of the genome. Based on their structural organisation and amino acid similarities
among their encoded reverse transcriptases, retrotransposons can be divided into three categories. Long terminal direct repeats (LTRs) flank
two of these categories, and they encode proteins
similar to the retroviruses. These LTR retrotransposons are referred to as the gypsy-like and
copia-like retrotransposons. The third class of
retrotransposons, the LINE1-like or non-LTR

retrotransposons, lack terminal repeats and


encode proteins with significantly less similarity
to those of the retroviruses. Retrotransposons
replicate by successive transcription, reverse
transcription and insertion of the new cDNA
copies back into the genome. Copia-like and
gypsy-like retrotransposons are present throughout the plant kingdom. Retrotransposons provide
an excellent opportunity to develop molecular
marker system (Kalendar et al. 1999) due to their
long, defined, conserved sequences and new
insertional polymorphisms produced by replicationally active members. The new insertions help
organising insertion events temporally in a lineage and thus can be used to determine pedigrees
and phylogenies. Retrotransposon-based molecular analysis relies on amplification using a
primer corresponding to the retrotransposon and
a primer matching a section of the neighbouring
genome. Sequence-specific amplified polymorphism (S-SAP) relies on amplification of DNA
between a retrotransposon integration site and a
restriction site with a ligated adapter (Waugh
et al. 1997). In inter-retrotransposon amplified
polymorphism (IRAP), DNA between two nearby
retrotransposons or LTRs is amplified (Fig. 3.9).

Transposable Elements (TE)-Based Molecular Markers

Retrotransposon-microsatellite amplified polymorphism (REMAP) involves amplification of


fragments which lie between a retrotransposon
insertion site and a microsatellite site (Fig. 3.9).
Retrotransposon-based amplified polymorphism
(RBIP) detects loci occupied by or empty of a
retrotransposon.

Inter-retrotransposon Amplied
Polymorphism (IRAP) and REtrotransposonMicrosatellite Amplied Polymorphism
(REMAP)
IRAP and REMAP are two amplification-based
marker methods which have been developed
based on the position of given LTRs within the
genome. These two markers have been developed
originally for BARE-I retrotransposon of Hordeum
genus, which is present in the barley genome in
numerous copies. The IRAP markers are generated by the proximity of two LTRs using outwardfacing primers annealing to LTR target sequences
(Fig. 3.9). In REMAP, amplification between
LTRs proximal to simple sequence repeats such
as constitutive microsatellites produces markers
(Fig. 3.9). Both IRAP and REMAP examine polymorphism in retrotransposon insertion sites, IRAP
between retrotransposons and REMAP between
retrotransposons and microsatellites (SSRs).
Retrotransposons can integrate in either orientation into the genome. For head-to-head and tailto-tail orientations, PCR products can be generated
using a single primer from elements sufficiently
close to one another. Intervening genomic DNA
for elements in head-to-tail orientation is amplified
using both 5 and 3 LTR primers. The REMAP
method relies on one outward-facing LTR primer
and a second primer from a microsatellite. Primers
were designed to the (GA)n/(CT)n/(CA)n/(CAC)n/
(GTG)n/and (CAC)n microsatellites and were
anchored (all but one) to the microsatellite 3 terminus by the addition of a single selective base at
the 3 end. In both techniques, polymorphism is
detected by the presence or absence of the PCR
product. Lack of amplification indicates the
absence of the retrotransposon at the particular
locus. As these markers were extremely polymorphic, they can prove useful for evaluating
intraspecific relationships. Copia-SSR marker

67

assay, a variant of REMAP, utilises a Ty-1


copia-specific primer along with anchored SSR
primers. IRAP technique has been used in genome
classification of plant cultivars and detects similarity between cultivars.

Sequence-Specic Amplication
Polymorphism (S-SAP)
The technique was first used to investigate the
location of BARE-1 retrotransposons in the barley
genome (Waugh et al. 1997). In principle, it is a
simple modification of the standard AFLP protocol. The final amplification is performed with
retrotransposon-specific
and
MseI-adaptorspecific primers. S-SAP has been extensively used
to generate markers to study genetic diversity and
to prepare linkage maps in several plants.
Retrotransposon-Based Insertion
Polymorphism (RBIP)
The technique was first developed using the
PDR1 retrotransposon in the pea (Flavell et al.
1998). It requires the sequence information of
the 5 and 3 regions flanking the transposon.
When a primer specific to the transposon is used
together with a primer designed to anneal to the
flanking region, they generate a product from
template DNA containing the insertion. On the
other hand, primers specific to both flanking
regions amplify a product if the insertion is
absent. Polymorphisms can be identified using
standard agarose gel electrophoresis or by
hybridisation with a reference PCR fragment.
Hybridisation is more useful for automated,
high-throughput analysis. It is technically
demanding and little bit costlier than other methods for detecting transposon insertions.
Transposable Display (TD)
TD permits the simultaneous detection of many
TEs from high copy number lines. The technique is a modification of the AFLP procedure
where PCR products are derived from primers
anchored in a restriction site (i.e. BfaI or MseI)
and a transposable element rather than in two
restriction sites (van den Broeck et al. 1998).
Individual transposons are identified by a ligation-mediated PCR that starts from within the

68

transposon and amplifies part of the flanking


sequence up to a specific restriction site.
Resulting PCR products can be analysed in a
high-resolution polyacrylamide gel system. TD
was first used to reveal the copy number of the
dTph1 transposon (TIRs) family in petunia and
related insertion event. It also allows detection
of an insertion that can be correlated with a particular phenotype. It is also possible to exploit
the unique properties of a group of TEs called
miniature inverted repeat transposable elements
(MITEs) using TD technique to develop a new
class of molecular marker for analysing Hbr
transposon family in maize.

Inter-MITE Polymorphism (IMP)


The technique is in principle very similar to
IRAP, except that it uses MITE like transposons
rather than retrotransposons. MITEs are short,
non-autonomous DNA elements (class II transposons) that are widespread and abundant in plant
genomes and exhibit high copy number and intrafamily homogeneity in size and sequence. Most
of the hundreds of thousands of MITEs identified
to date have been divided into two major groups
on the basis of shared structural and sequence
characteristics: Tourist-like and Stowaway-like.
The IMP technique was first used to identify two
groups of MITEs in barley, one belonging to the
Stowaway family and the other to the Barfly family
(Chang et al. 2001).
It is assumed that still more number of marker
systems could be developed based on the features
of transposable elements. However, it would be
desirable to generate such markers that are chromosome specific (which would be a herculean task
because of the nature of transposable elements).

Diversity Array Technology (DArT)


DArT operates on the principle that the sample
genomic DNA contains two types of fragments:
(1) constant fragments (found in any representation prepared from a DNA sample) and (2) variable or polymorphic fragments (found only in
some but not all of the representations of the DNA
samples). The variable fragments are informative

Genotyping of Mapping Population

because they reflect sequence variation that


determines the fraction of the original DNA
sample that is included in the representation.
Thus, the variable fragments are called as DArT
markers. Their presence or absence in a genomic
representation is assayed by hybridising the representation to a DArT array consisting of a library
of that given sample. Thus, DArT consists of the
following sequences of steps: complexity reduction of the sample DNA, library creation, microarraying libraries to the glass slides, hybridisation
of fluorolabelled DNA onto slides, scanning of
slides for hybridisation signal and data extraction and analysis (http://www.diversityarrays.
com/molecularprincip.html).

Intron-Targeted IntronExon Splice


Conjunction (IT-ISJ) Marker
Weining and Langridge (1991) considered that
gene promoter regions, intronexon splice conjunction sites and 3 poly-A addition sites in primary RNA all have the characteristics closely
linked with targeted genes, so they can contribute
to design PCR primers. According to the conserved sequences of intronexon junctions,
Weining and Langridge (1991) designed ISJ
(intronexon splice junctions) primer which was
used for amplifying intron or exon and utilised ISJ
primer PCR products to analyse the genome DNA
in wheat and barley, and they found that the ISJ
primers produced smear bands, but the ISJ primers conjunction with random primers and specific
primers produced clear bands. The core part of
forward primers included 5 splice junction conserved sequence GAGGTAAGT, which was
supplement with restriction endonuclease Sph Is
recognition sequence GCATGC at the 5 end
and with 3 selective bases at the 3 end. The core
part of reverse primers included 3 splice junction
conserved sequence ACCTGCA, which was
supplement with restriction endonuclease EcoRIs
recognition sequence GAATTC at the 5 end
and three selective bases at the 3 end. In order to
determine the applicable value of IT-ISJ marker
in genetic map construction, different IT-ISJ
primer combinations were used to genotype the

RNA-Based Molecular Markers

69

Digest DNA sample A and B with restriction enzymes


Ligate linkers
Linkers

Recognition site
Mutation at recognition site

A
B
Physically shear restriction products

A
B
Purify RAD tags

A
B
Release RAD tags

A
B
Label and hybridize to identify or type RAD markers

A
B
Fig. 3.10 Schematic representation of RAD marker development

recombinant inbred line population developed


from upland cotton, and a genetic map was
constructed.

a pre-existing genomic tiling path microarray.


The procedure of RAD marker development is
explained in Fig. 3.10.

Restriction Site Associated DNA (RAD)


Markers

RNA-Based Molecular Markers

RAD can identify and type a large number of


markers on a resource that is easy to produce for
both model and non-model organisms (Baird
et al. 2008). These markers were first employed
to rapidly map a recombination breakpoint in the
model organism, Drosophila melanogaster, using

In an alternate to DNA, other types of nucleic acids,


such as RNA, have also been used as template to
develop special kinds of molecular markers. For
example, PCR-based marker techniques, such
as complementary DNA-AFLP (cDNA-AFLP),
cDNA-SSCP and RNA fingerprinting by arbitrarily
primed PCR (RAP-PCR), are used as markers.

70

cDNA-AFLP
The cDNA-AFLP is a novel RNA fingerprinting
technique to display differentially expressed
genes (Bachem et al. 1996). The methodology
includes digestion of cDNAs by two restriction
enzymes followed by ligation of oligonucleotide
adapters and PCR amplification using primers
complementary to the adapter sequences with
additional selective nucleotides at the 3 end. The
cDNA-AFLP technique is more stringent and
reproducible than RAP-PCR. In contrast to
hybridisation-based techniques, such as cDNA
microarrays, cDNA-AFLP can distinguish
between highly homologous genes from individual gene families. Further, there is no requirement of any pre-existing sequence information in
cDNA-AFLP; thus, it is valuable as a tool for the
identification of novel process-related genes such
as stress-regulated genes.

RNA Fingerprinting by Arbitrarily


Primed PCR (RAP-PCR)
The RAP-PCR technique (Welsh et al. 1992)
involves fingerprinting of RNA populations using
arbitrarily selected primer at low stringency for
first and second strand cDNA synthesis followed
by PCR amplification of cDNA population. The
method requires nanograms of total RNA and is
unaffected by low levels of genomic DNA contamination. Differential PCR fingerprints are
detected for RNAs from the same tissue isolated
from different individuals and for RNAs from
different tissues from the same individual. The
individual-specific differences revealed are due to
sequence polymorphisms and are useful for genetic
mapping of genes. The tissue-specific differences
revealed are useful for studying differential gene
expression.

cDNA-SSCP
The SSCP analysis of RT-PCR products can
be used to evaluate the expression status (presence and relative quantity) of highly similar

Genotyping of Mapping Population

homologous gene pairs from a polyploid


genome. Replicated tests show that cDNA-SSCP
reliably separates duplicated transcripts with
99% sequence identity (Cronn and Adams 2003).
This technique has been used to gain remarkable
insight into the global frequency of silencing in
synthetic and natural polyploids.

Role of Genomics
Genomics has brought an innovative level of
hope to development of novel types of markers
and unravelling the secrets of complex traits.
Genome and/or gene sequences themselves have
the potential to provide a comprehensive list of
the markers in an organism. Functional genomics
approaches can then be used to generate information about gene function, as well as data on
genetic interactions, not only among and between
gene complexes but also in response to environmental stimuli. At present, microarray technology
(see Box 3.4) is providing the most comprehensive assessment of gene function and variation. Our
ability to view the transcription of the genome is
improving rapidly, and as a result, the potential to
dissect complex traits is also developing. Already,
array technology has been instrumental in identifying groups of co-expressed genes in various
physiological states, including stages of development and disease. Although array technology is
valuable, these data are not conclusive or comprehensive as regards gene function and only provide one more piece (i.e. transcriptional profile) of
the puzzle. The translation of genes into proteins
is another key step in gene action, and it will be
essential to subject protein synthesis, as well as
protein interaction, to the same genome-wide
analysis to understand how genotype can influence
a complex phenotype. In other words, how the
growing collections of data at the DNA, RNA,
protein and metabolite levels can be combined to
dissect complex traits and diseases remains to be
seen. It has been proposed that the power available
through the merger of genetics and genomics
(called genetical genomics or eQTL; discussed
in chapter 7) might lead to further unravelling
of metabolic, regulatory and developmental

Role of Genomics

71

Box 3.4 Techniques Used to Find DNA Variations

Finding the polymorphic marker is the key


factor that decides the success of linkage mapping. Identifying polymorphism relies on the
efficient discrimination of DNA markers
generated from the individuals. Usually, the
markers are classified as monomorphic or
polymorphic using techniques such as gel and
capillary electrophoresis, microarray and
TILLING.

Gel Electrophoresis
The electrophoresis is used to describe the
migration of charged particle under the
influence of an electric field. Gel electrophoresis is the technique in which molecules are
forced across a span of gel, driven by an electrical current. On either end of the gel, there
are activated electrodes that provide the driving force. Therefore, a molecules properties
(especially size, charge (the possession of ionisable groups) and conformation) determine
how rapidly an electric field can move the
molecule through a gelatinous medium or a
matrix. The important factor here is the length
and conformation of DNA molecule; smaller
molecules travel farther.

Agarose and Polyacrylamide Gel


Electrophoresis
Matrix is composed of either agarose or polyacrylamide, each of which has attributes
suitable to particular tasks. Agarose is a polysaccharide extracted from seaweed. It can be
simply prepared, and it is typically used at
concentrations of 0.52% to resolve 100 bp
to 15 kb DNA fragments. The higher the
agarose concentration, the stiffer the gel and
smaller DNA fragments can be resolved.
Polyacrylamide gels are chemically crosslinked gels formed by the polymerisation of
acrylamide with a cross-linking agent, N,Nmethylenebisacrylamide (Bis). The reaction is

a free radical polymerisation, carried out with


ammonium persulfate as the initiator and
N,N,N,N-tetramethylenediamine (TEMED)
as the catalyst. The length of the polymer
chains is dictated by the concentration of acrylamide used, which is typically between 3.5
and 20%. Polyacrylamide gels are significantly
more annoying to prepare than agarose gels.
Because oxygen inhibits the polymerisation
process, they must be poured between glass
plates (or cylinders). Polyacrylamide gels have
a rather small range of separation but very high
resolving power. Polyacrylamide is used for
separating fragments of less than 500 bp DNA
fragments. However, under appropriate conditions, fragments of DNA differing in length by
a single base pair are easily resolved. Small
DNAs or RNAs (smaller than 100 bp) are better separated by polyacrylamide gels; however,
23% agarose gels may be adequate to separate even 50 bp fragments from much larger
nucleic acids.
DNA electrophoresis is arguably the most
commonly performed molecular assay over
the past 50 years. The technique was initially
borrowed from protein and RNA techniques
rather than primarily developing through
design of optimised methods. It generally
employs suboptimal buffers having high ionic
strength, conductance and electric field
strength. Excessive joule heating limits the
tolerable applied voltage and the speed of
electrophoretic separation.
There are a number of buffers used for agarose electrophoresis. The most common being
Tris/Acetate/EDTA (TAE), Tris/Borate/EDTA
(TBE) and lithium borate (LB). TAE has the
lowest buffering capacity but provides the best
resolution for larger DNA. This means a lower
voltage and more time but can produce a
better resolution. LB is relatively new and is
ineffective in resolving fragments larger than
5 kb (Brody et al. 2004). However, with its
low conductivity, a much higher voltage could
(continued)

72

Genotyping of Mapping Population

Box 3.4 (continued)

be used (up to 35 V/cm), which means a


shorter analysis time for routine electrophoresis.
As low as one base pair size difference could
be resolved in 3% agarose gel with an
extremely low conductivity medium such as
1 mM Lithium borate.
Thus, recent modifications of DNA electrophoresis eliminated sodium EDTA and substituted alkali metal cations for Tris. Lithium was
preferred over other alkali metal cations for its
large shell of hydration and low electrokinetic
mobility, which provided lower conductance,
improved tolerance for applied voltage, lower
heat generation and improved separation quality. Compared to Tris/Borate/EDTA (TBE),
the alkali metal ion media decreased the conductivity, lowered the final running temperature and reduced the time for electrophoretic
separation. In general, TAE buffer (Tris/
Acetate/EDTA) is the most commonly used
agarose gel electrophoresis buffer. TAE has the
lowest buffering capacity and offers the best
resolution for larger DNA. However, TAE
requires a lower voltage and more time.
Alternatively, TBE buffer (Tris/Borate/EDTA)
is often used for smaller DNA fragments (i.e.
less than 500 bp). Sodium borate (SB) buffer
can also be used because of its low conductivity and allowing higher voltages (up to 35 V/
cm) during the electrophoresis. This could
allow a shorter analysis time for routine electrophoresis. However, it is ineffective for
resolving fragments larger than 5 kb.

MetaPhor Agarose Gel Electrophoresis


MetaPhor Agarose is a high-resolution agarose which is considered as an alternative to
polyacrylamide. MetaPhor Agarose is an
intermediate melting temperature (75C) agarose with twice the resolution capabilities of
the finest-sieving agarose products. Using
submarine gel electrophoresis, PCR products
and small DNA fragments (20800 bp) that
differ in size by 2% can be resolved. Of late,

this has been widely employed in SSR marker


analysis, since polyacrylamide involves
expensive and laborious protocols.

Temperature Gradient Gel


Electrophoresis
Temperature Gradient Gel Electrophoresis
(TGGE) is a powerful technique to separate
DNA fragments of identical length. In contrast
to conventional electrophoresis methods, molecules are separated by their melting behaviour. Thus, it becomes possible to separate
DNA fragments according to their primary
sequence. To understand TGGE, there are two
fundamental points. The first is how the structure of DNA changes with temperature; the
second is how these changes in structure affect
the movement of DNA through a gel. As temperature rises, the two strands of the DNA
start to unwind. At some high temperature, the
two strands will completely separate. However,
at some intermediate temperature, the two
strands will be partly separated, with part of
the molecule still double stranded and part
single stranded. What makes TGGE useful is
that the mobility of the DNA molecule through
the gel decreases drastically when these partially melted structures are formed, and most
important, the exact temperature at which this
occurs depends on sequence; thus, TGGE
offers a sequence-dependent, size-independent method for separating DNA molecules.
A very simple but realistic analogy is to consider a person moving through a crowded
room; when you extend your arms out, your
movement through the room slows drastically,
even though your mass has not changed.
Denaturing gradient gel electrophoresis
(DGGE) works in the same principle. However,
the difference is a small sample of DNA is
applied to an electrophoresis gel that contains
a denaturing agent. It has been shown that certain denaturing gels are capable of inducing
DNA to melt at various stages. As a result of
(continued)

Role of Genomics

73

Box 3.4 (continued)

this melting, the DNA spreads through the gel


and can be analysed for single components,
even those as small as 200700 bp.

Pulsed Field Gel Electrophoresis


In 1984, Schwartz and Cantor described
pulsed field gel electrophoresis (PFGE), introducing a new way to separate DNA. In particular, PFGE resolved extremely large DNA
in agarose from 3050 kb to 10 Mb. During
continuous field electrophoresis, DNA above
3050 kb migrates with the same mobility
regardless of size. This is seen in a gel as a
single large diffuse band. If, however, the
DNA is forced to change direction during
electrophoresis, different-sized fragments
within this diffuse band begin to separate from
each other. With each reorientation of the
electric field relative to the gel, smaller sized
DNA will begin moving in the new direction
more quickly than the larger DNA. Thus, the
larger DNA lags behind providing a separation from the smaller DNA. Currently, there
are three models that attempt to describe the
behaviour of DNA during PFGE: the biased
repetition model (BRM), the chain model and,
most recently, the bag model.

the DNA sample can be affected by the run


conditions: the buffer type, concentration and
pH; the run temperature; the amount of voltage applied; and the type of polymer used.
Shortly before reaching the positive electrode,
the fluorescently labelled DNA fragments,
separated by size, move across the path of a
laser beam. The laser beam causes the dyes on
the fragments to fluoresce. An optical detection device detects the fluorescence, and the
signal is converted into data.

Microarray
Microarray can be used to find the polymorphic SNP or SFP markers. Microarray works
by exploiting the ability of fluorescently
labelled given DNA fragment to bind (or
hybridise) specifically to the markers
(predefined DNA template) arranged in a regular pattern on a small chip. Depending on the
strength or degree of binding/hybridisation,
the colour intensity varies, and it is used to
generate the data. The major advantage of
microarray is several DNA samples can be
analysed in a single experiment and thousands
of data points can be generated.

Capillary Electrophoresis
TILLING
Capillary electrophoresis has largely replaced
the use of gel separation techniques due to significant gains in workflow, throughput and ease
of use. Fluorescently labelled DNA fragments
are separated according to molecular weight,
and it can be automated since it does not involve
gel casting. During capillary electrophoresis,
the PCR products or DNA enters the capillary
as a result of electrokinetic injection. A highvoltage charge applied to the buffered sequencing reaction forces the negatively charged
fragments into the capillaries. The extension
products are separated by size based on their
total charge. The electrophoretic mobility of

TILLING (Targeting Induced Local Lesions


IN Genomes) is a reverse genetics process,
and it relies on the ability of a special enzyme
to detect mismatches in normal and mutant
(or polymorphic) DNA strands when they are
annealed. By selectively pooling the DNA
and amplifying with fluorescently labelled
primers, mismatched heteroduplexes were
generated between wild type and mutant
DNA. Heteroduplexes were incubated with
the plant endonuclease CEL-I, (which cleaves
heteroduplex mismatched sites), and the resultant products are visualised on a capillary
(continued)

74

Genotyping of Mapping Population

Box 3.4 (continued)

sequencer, and the fluorescently labelled traces


are analysed. The differential end labelling of
the amplification products permits the two
cleavage fragments to be observed and identify
the position of the mismatch or polymorphism.

When a mutation/polymorphism is detected in


the pooled DNA, the individual DNA samples
are sequenced to identify the specific plant
carrying the polymorphism (McCallum et al.
2000).

SNPs on chips (after 2000)


AFLP on microarrays (1998)
SNPs (1994)
AFLP on automated sequencers (1998)

Anonymous markers

Automation
AFLP (1995)
cDNA sequencing (ESTs)
SCARs (1991)
Oligo scene

RAPD (1990)

Minisatellites and SSRs (1989)


Pre-PCR era

Gene specific PCR

PCR (1987)

RFLPs (1980s)

DNA Hybridization scene

Restriction (1968) and Southern blotting (1975)

Protein scene

Classical era

CAPS (1993)

SSCPs (1989)

Gene-Based markers

Complete genomic sequence


High throughput marker analysis

Genomics era

Allozymes (1960s)

Gel electrophoresis (1950s)

Morphological variants (Pre 1950s)

Fig. 3.11 Evolutionary and historical perspectives of molecular markers

pathways, but rigorous investigations still need to


be completed. What is clear, however, is that
genomic technology is emerging in such a way
that it will supply quantities of data that require
detailed statistical and mathematical analyses.

different marker technologies. In general, the


choice of a molecular marker technique has to be
a compromise between reliability and ease of
analysis, statistical power and confidence of
revealing polymorphisms. Thus, before selecting the marker technology, the following should
be finalised.

Selection of Marker Technology


When science advances, several classes of
marker technologies are identified. Figure 3.11
describes the evolutionary and historical perspectives of the marker systems. An obvious
problem that usually arises is how to choose the
most appropriate marker among the myriad of

Research Problem
This is the key question that needs to be solved
before choosing the right marker technology.
Thus, the first step is to finalise what is the
biological question one wants to answer with

Selection of Marker Technology

the research? For instance, for information on


population history or phylogenetic relationships,
sequence data or restriction site data should be
used. In order to construct a saturated linkage
map (i.e. approximately one marker in every
1 cM distance), a combination of SSRs and
AFLPs needs to be selected.

75

Quality of DNA
RFLP analysis requires large amounts of pure
quality DNA. Most PCR-based methods require
only tiny quantities of DNA. In many cases, PCR
is performed only to amplify the original amount
of target DNA. Hence, the marker technology
should also be selected with the available facilities and resources.

The Number of Loci and/or Alleles


The next critical question in this context is Will
information from a few loci be sufficient or is
greater genome coverage required? Isozymes
are usually limited in number. AFLP detect high
numbers of loci. Where hyper-variability is
required, the best techniques are those based on
single-locus SSR.

Discrimination Level
Further, it is also important to decide at what
taxonomic level is the genetic variation being
measured: within populations, between species or
between genera? Is the selected method appropriate for detecting the desired level of variation?
SSRs can provide sufficient variation between
genera; however, to generate same degree of variation between species, it is better to use SNPs.

Mode of Inheritance
Other questions related to inheritance of markers
in the segregating progenies such as should both
homozygotes and heterozygotes be identified?
Are co-dominant markers needed (single-locus
RFLPs, isozymes, SSRs) or will dominant
markers suffice (RAPD, AFLP)? also need to be
addressed before selecting the marker system.
If presence versus absence information is
sufficient, then any molecular marker technology
can be used; but if information about heterozygotes is needed (e.g. population and diversity
structure, knowledge on type of inheritance), then
co-dominant markers such as isozymes or microsatellites should be used.

Expertise Required
Techniques involving hybridisation or manual
sequencing are technically demanding, whereas
RAPDs or SSRs (once the primers are available)
are the least demanding techniques. Thus, expertise availability also decides the selection of
marker technology. Further, availability of or
access to laboratory facilities and equipments
and man power with a good grasp of many basic
laboratory skills are also required to choose the
appropriate marker technology.

Costs
In terms of costs, isozymes are the cheapest;
RAPD, RFLP and even AFLP are intermediate;
but sequencing or SNP is still more expensive.
The costs of all types of experiments should be
considered, because lack of reproducibility of
some markers may, in the end, result in higher
costs. For required skills, a visit to another laboratory where the relevant techniques are being
used can provide invaluable information. Of late,
costs for sequencing experiments have
significantly decreased. Many ESTs are already
available for several species. Microarrays, based
on either anonymous genomic characterisation or
gene expression, are becoming common.
Microarray technology is still very demanding,
technically and financially (in terms of equipment and consumables). Before deciding on it,
get acquainted with the techniques, requirements
and outputs. A better option might be to consider
outsourcing of sample analysis. SNPs are being
routinely used in human studies. They are still

76

too expensive for standard applications to genetic


diversity studies in plants. Nevertheless, SNPs
reveal ultimate level of variation in the DNA
sequencethe nucleotidesand they would be
the futures best molecular marker option when
their costs of discovery and application decrease.

Speed
Further, it is required to decide how quickly are
data needed? and how much time will the equipment allow? PCR-based methods certainly give
fast results when primers are available.
Hybridisation-based methods are slower.
Conventional DNA sequencing is slow, whereas
automated sequencing is faster.

Reproducibility
Yet another critical question to be finalised is are
robust methods required? For example, will the
markers be exchanged? is more than one laboratory involved? Isozymes, RFLPs, SSRs and
sequencing are robust, whereas RAPD is not.

Genotyping of Mapping Population

searching putative microsatellites rely on


sequence databases, circumventing the problem of having to make and screen libraries in
the laboratory.
AFLPs have become a very popular option,
although their need for a double PCR and vertical gel electrophoresis makes them more
expensive and technically more demanding.
However, this is the only PCR-based technique that helps in constructing saturated linkage map.
In summary, the three key factors that assess
the utility of DNA markers in genetic mapping
are:
1. The informativeness of a genetic marker: It is
measured by the number of alleles and allele
frequencies. There are two measures of informativeness: heterozygosity and polymorphic
information content (discussed in Box 3.3).
2. The throughput of a genetic marker: It is the
multiplex ratio, that is, number of simultaneously assayed loci.
3. Genotyping error: It affects the reproducibility of the marker assay and clarity of the
marker genotypes.

Marker Genotyping and Scoring


PCR Versus Non-PCR Techniques
PCR-based molecular marker techniques open up
numerous possibilities and could be considered
first, because of their simplicity. Hybridisationbased techniques are more labour intensive, hazardous and more technically demanding and
require costly equipment. Thus, PCR-based techniques can be explored. To this end, the following
points may be considered:
RAPD is an excellent technique by which to
become familiar with PCR. It allows rapid
examination of polymorphisms in most, if not
all, species of interest since primers are readily available.
Other PCR-based markers such as SSR could
be applied relatively easily, if primers are
already available in the given species.
Strategies for searching appropriate primers
are also improving, and some approaches for

Once the appropriate marker technolog(y)ies is


selected, initially, they need to be employed in
parental polymorphic survey. It is vital to identify
as many numbers of polymorphic markers as
possible since only those polymorphic markers
will be used to construct the linkage map. In order
to construct a saturated linkage map, it is essential to find polymorphic markers that span all
over the genome. As a general rule, to construct a
preliminary linkage map, it is suggested to have
markers in every 10-cM interval; so as to create a
saturated linkage map, it requires markers in
every 1-cM interval. The number of markers
required to construct such preliminary or saturated linkage map varies depending on the marker
system and plant species. For example, in cotton
(Gossypium spp.,), SSRs provide 830% polymorphism between the interspecific parents.
Tanksley and McCouch (1997) suggested that

Analysing the Genotype Score: Chi-Square Test

77

Table 3.3 Expected segregation ratios for different marker systems in different population types
Population type

F2 progenies
Back cross progenies

BC1
BC2
Recombinant inbred lines or double
haploid lines or near isogenic lines

Genetic segregation ratio


Co-dominant markers
(e.g. RFLP, SSR, CAPS)
1:2:1 (AA:Aa:aa)
1:1 (Cc:cc)
1:1 (Ee:ee)
1:1 (GG:gg)

once a map of 5,125 cM reaches a density of


about one marker per 5 cM or a total of about
1,025 marker loci, the map should link up into 26
linkage groups corresponding to 26 gametic
chromosomes of the tetraploid cotton. He et al.
(2007) have published such a map with F2 and
F2:3 population (G. hirsutum x G. barbadense)
which includes 1,029 genetic loci mapped to 26
linkage groups that covered 5472.3 cM with an
average distance of 5.31 cM between loci. In
some polyploid species such as sugarcane, identifying polymorphic markers is more complicated. In such cases, the mapping of diploid
relatives of polyploid species can be of great
benefit in developing maps for polyploid species.
However, diploid relatives do not exist for all
polyploid species. A general method for the mapping of polyploid species is based on the use of
single-dose restriction fragments.
In all the cases, thus, it is essential to identify
sufficient number of polymorphic markers that
span all the chromosomes of the given species.
These polymorphic markers are to be surveyed
across the progenies of the given mapping population (and if possible across F1 hybrids). This is
known as marker genotyping of the population.
Therefore, DNA must be extracted from each
individual of the given mapping population when
DNA markers are used.
The segregation of these polymorphic markers
in the progenies is then scored for parental or
recombinant behaviour. Markers that are close
together or tightly linked will be transmitted
together from parent to progeny more frequently
than markers that are located further apart. In a
segregating population, there is a mixture of
parental and recombinant genotypes. The

Dominant markers
(e.g. RAPD, AFLP, ISSR)
3:1 (B_:bb)
1:0 (D_)
1:1 (Ff:ff)
1:1 (HH:hh)

expected segregation ratios for co-dominant and


dominant markers (Table 3.3) are compared with
the actual ratios found in the experimental
population.
Significant deviations from expected ratios
can be analysed using chi-square tests (discussed below). Generally, markers will segregate in a Mendelian fashion although distorted
segregation ratios may be encountered in certain populations. The frequency of recombinant genotypes can be used to calculate
recombination fractions, which may be used to
infer the genetic distance between markers. By
analysing the segregation of markers, the relative order and distances between markers can
be determined: the lower the frequency of
recombination between two markers, the closer
they are situated on a chromosome (conversely,
the higher the frequency of recombination
between two markers, the further away they are
situated on a chromosome). Markers that have
a recombination frequency of 50% are described
as unlinked and assumed to be located far
apart on the same chromosome or on different
chromosomes. Mapping functions are used to
convert recombination fractions into map units
called centimorgans (cM). For a more detailed
explanation of linkage mapping, kindly refer
chapter 4.

Analysing the Genotype Score:


Chi-Square Test
The genetic segregation ratio at given maker
locus is jointly determined by the nature of
marker (dominant/co-dominant; see Box 3.3)

78

and types of mapping populations (Table 3.3).


Therefore, a thorough understanding of the
nature of markers and mapping population is
crucial for any mapping projects. Markers such
as RFLPs, microsatellites and CAPS are codominant in nature, while AFLP, RAPD and
ISSR are often scored as dominant markers.
Mapping populations such as RILs and DHs
equalise marker type because of fixation of
parental alleles at marker locus in homozygous
condition. These populations result in 1:1 segregation ratio at marker locus irrespective of
genetic nature of markers. In contrast, F2 population segregates in 1:2:1 ratio for a co-dominant
marker and in 3:1 ratio for dominant marker
(refer Table 3.3 for other types of segregation).
Depending upon the segregation pattern, statistical analysis of marker data will vary.
Significant deviation from expected segregation
ratio in a given marker-population combination is
referred to as segregation distortion. There are
several reasons for segregation distortion. It may
be due to gamete/zygote lethality, meiotic drive/
preferential segregation, sampling/selection during
population development and differential responses
of parental lines to tissue culture in case of DHs.
Segregation distortion can also be specific with
respect to some markers in an otherwise normal
mapping population. It is therefore important that
the goodness of fit of segregation ratio must be
tested for individual marker locus and, if necessary, the marker showing high degree of segregation distortion be eliminated from the analysis.

c2 Test to Analyse the Segregation


Ratio Using the Program ANTMAP
The chi-square (c2) test is the most commonly
used statistical analysis to test the hypothesis
concerning the frequency distribution or segregation pattern in genetics.
2 =

(O E )2
E

where O is observed frequency and E is expected


frequency. Measure the computed c2 value with

Genotyping of Mapping Population

the tabulated c2 value. Reject the hypothesis of


goodness of fit to the given ratio, if the computed
c2 value exceeds the corresponding c2 value at
given level of significance (i.e. 1% or 5%). The
chi-square test can be done using the program
AntMap. This program is freely available at
http://lbm.ab.a.u-tokyo.ac.jp/~iwata/antmap/ .
The following simple steps are sufficient to perform chi-square analysis using AntMap (For further advanced analyses refer the tutorial given in
the same website).

Step 1: Open an Input File


Open an input file in MapMaker format (*.raw)
through File-Open menu. Refer chapter 4 for
how to prepare a *.raw file? After opening the
file, contents of the file will appear in the Data
panel. When the Log tab is clicked, you can see
a summary of the input data.

Step 2: Segregation Ratio Test


Select Segregation Test from the Analysis
menu. By selecting, you can see the results of
segregation ratio tests in the Result panel. If P
value is <0.01, it will have ** (this indicates that
highly significant); for P value of 0.010.05, it
will have * (it indicates significant). In other
words, the above-said P value specifies the data
set fit the hypothesised frequency distribution at
1 and 5% level of significance.

Bibliography
Literature Cited
Bachem CWB, van der Hoeve RS, de Bruijn SM,
Vreugdenhil D, Zabeau M, Visser RGF (1996)
Visualisation of differential gene expression using a
novel method of RNA fingerprinting based on AFLP:
analysis of gene expression during potato tuber development. Plant J 9:745753
Baird NA, Etter PD, Atwood TS, Currey MC, Shiver AL
et al (2008) Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLoS One
3(10):e3376. doi:10.1371/journal.pone.0003376

Bibliography
Botstein D, White RL, Skolnick M, Davis RW (1980)
Construction of a genetic linkage map in man using
restriction fragment length polymorphisms. Am J
Hum Genet 32:314333
Brody JR, Calhoun ES, Gallmeier E, Creavalle TD, Kern
SE (2004) Ultra-fast high-resolution agarose electrophoresis of DNA and RNA using low-molarity conductive media. Biotechniques 37(4):598602
Caetano-Anolls G, Bassam BJDNA (1993) Amplification
fingerprinting using arbitrary oligonucleotide primers.
Appl Biochem Biotechnol 42:189200
Chang RY, ODonoughue LS, Bureau TE (2001) InterMITE polymorphisms (IMP): a high throughput transposon-based genome mapping and fingerprinting
approach. Theor Appl Genet 102:773781
Cronn RC, Adams KL (2003) Quantitative analysis of
transcript accumulation from genes duplicated by polyploidy using cDNA-SSCP. Biotechniques 34:726734
Flavell AJ, Knox M, Pearce SR, Ellis THN (1998)
Retrotransposon based insertion polymorphisms
(RBIP) for high throughput marker analysis. Plant J
16:643665
He DH, Lin ZX, Zhang XL, Nie YC, Guo XP, Zhang YX, Li
W (2007) QTL mapping for economic traits based on a
dense genetic map of cotton with PCR-based markers
using the interspecific cross of Gossypium hirsutum Gossypium barbadense. Euphytica 153(1):181197
Hu J, Vick BA (2003) Target region amplification polymorphism: a novel marker technique for plant genotyping. Plant Mol Biol Rep 21:289294
Huang J, Sun M (1999) A modified AFLP with fluorescence
labelled primers and automated DNA sequencer detection for efficient fingerprinting analysis in plants.
Biotechnol Tech 14:277278
Jordan SA, Humphries P (1994) Single nucleotide polymorphism in exon 2 of the BCP gene on 7q31-q35.
Hum Mol Genet 3:1915
Kalendar R, Grob T, Regina M, Suoniemi A, Schulman A
(1999) IRAP and REMAP: two new retrotransposonbased DNA fingerprinting techniques. Theor Appl
Genet 98:704711
Komori T, Nitta N (2005) Utilization of CAPS/dCAPS
method to convert rice SNPs into PCR-based markers.
Breed Sci 55:9398
Li G, Quiros CF (2001) Sequence-related amplified
polymorphism (SRAP), a new marker system based
on a simple PCR reaction: its application to mapping
and gene tagging in Brassica. Theor Appl Genet
103:455546
Makino R, Yazyu H, Kishimoto Y, Sekiya T, Hayashi K
(1992) F-SSCP: fluorescence-based polymerase chain
reaction single-strand conformation polymorphism
(PCR-SSCP) analysis. PCR Methods Appl 2:1013
McCallum CM, Comai L, Greene EA, Henikoff S (2000)
Targeted screening for induced mutations. Nat
Biotechnol 18:455457
Michaels SD, Amasino RMA (1998) A robust method for
detecting single nucleotide changes as polymorphic
markers by PCR. Plant J 14:381385

79
Mullis KB, Faloona F (1987) Specific synthesis of DNA
in vitro via polymerase chain reaction. Methods
Enzymol 155:350355
Orita M, Iwahana H, Kanazawa H, Hayashi K, Sekiya T
(1989) Detection of polymorphisms of human DNA by
gel electrophoresis as single-strand conformation polymorphism. Proc Natl Acad Sci USA 86:27662770
Paran I, Michelmore RW (1993) Development of reliable
PCR-based markers linked to downy mildew resistance genes in lettuce. Theor Appl Genet 85:985999
Schuelke M (2000) An economic method for the
fluorescent labelling of PCR fragments. Nat Biotechnol
18:233234
Schwartz DC, Cantor CR (1984) Separation of yeast chromosome-sized DNAs by pulsed field gradient electrophoresis. Cell 37:6775
Tanksley SD, McCouch SR (1997) Seed banks and molecular maps: unlocking genetic potential from the wild.
Science 277:10631066
Tautz D, Renz M (1984) Simple sequences are ubiquitous
repetitive components of eukaryotic genomes. Nucleic
Acids Res 12(10):41274138
van den Broeck D, Maes T, Sauer M, Zethof J, De
Keukeleire P, DHauw M, Van Montagu M, Gerats T
(1998) Transposon Display identifies individual transposable elements in high copy number lines. Plant J
13:121129
Vos P, Hogers R, Bleeker M, Reijans M, van de Lee T,
Hornes M, Frijters A, Pot J, Peleman J, Kuiper M,
Zabeau M (1995) AFLP: a new technique for DNA
fingerprinting. Nucleic Acids Res 23:44074414
Wang X, Zhiyuan F, Sanwen H, Peitian S, Yumei L, Limei
Y, Mu Z, Dongyu Q (2000) An extended random
primer amplified region (ERPAR) marker linked to a
dominant male sterility gene in cabbage (Brassica
oleracea var. capitata). Euphytica 112:267273
Waugh R, McLean K, Flavell AJ, Pearce SR, Kumar A,
Thomas WTB, Powell W (1997) Genetic distribution
of Bare-1-like retrotransposable elements in the
barley genome revealed by sequence-specific
amplification polymorphisms (SSAP). Mol Gen Genet
253:687694
Weining S, Langridge P (1991) Identification and mapping of polymorphisms in cereals based on the polymerase chain reaction. Theor Appl Genet 82:209216
Weising K, Gardner RC (1999) A set of conserved PCR
primers for the analysis of simple sequence repeat
polymorphisms in chloroplast genomes of dicotyledonous angiosperms. Genome 42:911
Welsh J, McClelland M (1990) Fingerprinting genomes
using PCR with arbitrary primers. Nucleic Acids Res
18:72137218
Welsh J, Chada K, Dalal SS, Ralph D, Cheng R, McClelland
M (1992) Arbitrarily primed PCR fingerprinting of
RNA. Nucleic Acids Res 20:49654970
Williams JGK, Kubelik AR, Livak KJ, Rafalski JA, Tingey
SV (1991) DNA polymorphisms amplified by arbitrary primers are usefll as genetic markers. Nucleic
Acids Res 18:65316535

80
Wu KS, Jones R, Danneberger L, Scolnik P (1994)
Detection of microsatellite polymorphisms without
cloning. Nucleic Acids Res 22:32573258

Further Readings
Agarwal M, Shrivastava N, Padh H (2008) Advances in
molecular marker techniques and their applications in
plant sciences. Plant Cell Rep 27:617631

Genotyping of Mapping Population

Eathington SR et al (2007) Molecular markers in a commercial breeding program. Crop Sci 47(S3):S154S163
Jena KK, Mackill DJ (2008) Molecular markers and their
use in marker assisted selection in rice. Crop Sci
48:12661277
Lorz H, Wenzel G (2005) Molecular marker systems in
plant breeding and crop improvement, Biotechnology
in agriculture and forestry 55. Springer, New York
Van Bueren L et al (2010) The role of molecular markers
and marker assisted selection in breeding for organic
agriculture. Euphytica 175:5164

Linkage Map Construction

Genome mapping methods are generally divided


into two categories: (1) genetic or linkage mapping and (2) physical mapping. Genetic mapping
is based on the use of genetic techniques to construct maps showing the positions of genes and
other sequence features on a genome, whereas
physical maps are constructed by directly
sequencing DNA molecules, and such physical
map shows the positions of sequence features,
including genes. There is yet another map in
genome analysis, which is called as cytogenetic
map. It is a genetic term used to describe the
visual appearance of a chromosome (known as
karyotype) when chromosomes are stained and
examined under a microscope. Physical map
identifies actual physical position of genes and
other DNA elements on the chromosomes and
facilitates positional cloning of agronomically
important genes and analysing chromosomes and
genome structure in detail (refer chapter 7 for
detailed description). This chapter focuses on
detailed description of genetic or linkage mapping besides briefly portraying other two types of
mapping procedure.

Basics of Genetic/Linkage Mapping:


Mendelian Ratios, Meiosis, Crossing
Over and Partial Linkage
As that of a geographic map, a genetic map must
show the positions of distinctive features since
both of these maps share the same analogy. In a
geographic map, these markers are recognisable

components of the landscape, such as rivers,


ponds, elevations, roads and buildings. Similarly,
to describe the genetic landscape, morphological
markers, isozymes and nucleic acid-based markers are used (discussed in Chap. 3).
The principle of genetic mapping has been
conceptualised more than a century ago. The discovery of genetic linkage, first reported in 1905
in the sweet pea by Bateson and colleagues
(however, it was referred to as coupling during
those period), and the observation by Morgan
(that the amount of crossing over between genes
indicates the distance between them on a chromosome) helped Sturtevant to develop the first
genetic map in 1913.
Visual appearance or morphological markers
were initially used to construct the first genetic
maps in the early decades of the twentieth century for organisms such as the fruit fly. To be useful in genetic analysis, a morphological trait
should have heritable characteristics, that is, it
has to exist in at least two alternative forms or
phenotypes (e.g. having tall or short stems in the
pea plants originally studied by Mendel). Each
phenotype is specified by a different allele of the
corresponding gene, and those phenotypes should
be distinguishable by visual examination. For
example, the first fruit-fly maps showed the positions of genes for body colour, eye colour, wing
shape etc. since all of these phenotypes are being
visible simply by looking at the flies with a lowpower microscope or the naked eye. As discussed
in Chap. 3, it was soon realised that there were
only a limited number of visual phenotypes

N.M. Boopathi, Genetic Mapping and Marker Assisted Selection: Basics, Practice
and Benefits, DOI 10.1007/978-81-322-0958-4_4, Springer India 2013

81

82

whose inheritance could be studied, and in many


cases their analysis was complicated because a
single phenotype could be affected by more than
one gene. For example, by 1922 over 50 genes
had been mapped onto the four fruit-fly chromosomes, but nine of these were for eye colour. In
other words, it was very difficult to distinguish
between fly eyes that were coloured red, light
red, vermilion, garnet, carnation, cinnabar, ruby,
sepia, scarlet, pink, cardinal, claret, purple or
brown. To make gene maps more comprehensive,
it would be necessary to find characteristics that
were more distinctive and less complex than
visual ones. The answer was to use the knowledge on biochemistry (isozymes) and molecular
biology (DNA- or RNA-based markers) to distinguish phenotypes. Hence, in order to prepare a
comprehensive genetic map (i.e. to have a complete coverage of the genome), we need a large
set of markers.
Once a set of distinguishable or polymorphic
markers have assembled, the next process is to
construct the linkage map. The technique involved
in the linkage map construction is based on the
genetic linkage, a discovery made in ninetieth
century by Gregor Mendel. Mendel studied seven
pairs of contrasting characteristics in his pea
plants, one of which was violet and white flower
colour. There were two important points to be
considered to understand the concept of genetic
linkage: (1) Pure-breeding plants always give rise
to flowers with the parental colour. These plants
are homozygotes, each possessing a pair of identical alleles, denoted by VV for violet flowers and
WW for white flowers. (2) When two pure-breeding plants are crossed, only one of the phenotypes
is seen in the F1 generation. Genetic mapping is
based on the principles of inheritance as first
described by Gregor Mendel in 1865. From the
results of his breeding experiments with peas,
Mendel concluded that each pea plant possesses
two alleles for each gene, but displays only one
phenotype. As discussed above, this is easy to
understand if the plant is pure breeding, or
homozygous, for a particular characteristic, since
it possesses two identical alleles and displays the
appropriate phenotype. However, Mendel showed
that if two pure-breeding plants with different

Linkage Map Construction

phenotypes are crossed, then all the progeny (the


F1 generation) display the same phenotype. These
F1 plants must be heterozygous, meaning that
they possess two different alleles, one for each
phenotype, one allele inherited from the mother
and one from the father. Mendel postulated that
in this heterozygous condition one allele overrides the effects of the other allele; he therefore
described the phenotype expressed in the F1
plants as being dominant over the second, recessive phenotype. Thus, Mendels first law of
alleles segregate randomly and the second law
of alleles segregate independently help to predict the outcome of genetic crosses. This is the
perfectly correct interpretation of the interaction
between the pairs of alleles studied by Mendel.
This study helped to introduce the concept recombination. When two characters are considered, a
gamete is said to be parental or nonrecombinant
if the genes governing the two characters were
both inherited from the same parent. It is said to
be recombinant if the genes it contains for the
two characters were inherited from different parents. In the above example, an F1 individual may
pass on to an offspring one of the four gametes,
WV, Wv, wV or wv. In this example, Wv and wV
are recombinant gametes because they represent
a mixing of genetic material which had been
inherited separately. Mendels second law
specifies that a given gamete has a chance of to
be recombinant (i.e. to the maximum of half of
the progenies).
However, later it was noticed that this simple
dominant-recessive rule can be complicated by
situations that he did not encounter. One of these
is incomplete dominance, where the heterozygous phenotype is intermediate between the two
homozygous forms. An example is when red carnations are crossed with white ones, the F1
heterozygotes being pink. Another complication
is co-dominance, when both alleles are detectable
in the heterozygote. Co-dominance is the typical
situation for DNA markers.
Further, his second law was also questioned
since it was soon established that genes reside on
chromosomes, and all organisms have many more
genes than chromosomes. If the chromosomes
are inherited as intact units, then the alleles of genes

Basics of Genetic/Linkage Mapping: Mendelian Ratios, Meiosis, Crossing Over and Partial Linkage

should also be inherited together since they are


on the same chromosome. Correns in 1913
described the phenomenon of complete linkage
or complete gametic coupling, in which alleles of
two or more different characters appeared to be
always inherited together rather than independently (i.e. no recombination was observed
between them). This is the principle of genetic
linkage.
Although this seems to violate Mendels second law, an obvious extension of his theory would
be to assume that the genes for these characters
are physically attached. Further, the chromosome
theory of heredity formulated by Sutton in 1903
also provided a physical mechanism for Mendels
law if it was assumed that the independent
Mendelian characters lay on different chromosomes and that those which were completely
linked lay on the same chromosome.
Though it was shown to be correct, the results
did not turn out exactly as expected. The complete linkage that had been anticipated between
many pairs of genes failed to materialise. Pairs of
genes were either inherited independently, as
expected for genes in different chromosomes, or,
if they showed linkage, then it was only partial
linkage, that is, sometimes they were inherited
together and sometimes they were not.
Partial linkage was discovered in the early
twentieth century. When a cross was carried out
by Bateson, Saunders and Punnett in 1905 with
sweet peas, the parental cross gives the typical
dihybrid result, with all the F1 plants displaying
the same parental phenotype. However, the F1
cross gives unexpected results as the progenies
show neither a 9:3:3:1 ratio (expected for genes
on different chromosomes) nor a 3:1 ratio
(expected if the genes are completely linked). An
unusual ratio is typical of partial linkage. Partial
linkage was explained later when the behaviour
of chromosomes during meiosis was elucidated
at molecular level.
It was Thomas Hunt Morgan who made the
conceptual link between partial linkage and the
behaviour of chromosomes when the nucleus of a
cell divides. Cytologists in the late ninetieth century had distinguished two types of nuclear division: mitosis and meiosis. Mitosis is more

83

common, being the process by which the diploid


nucleus of a somatic cell divides to produce two
daughter nuclei, both of which are diploid. Before
mitosis begins, each chromosome in the nucleus
is replicated, but the resulting daughter chromosomes do not immediately break away from one
another. To begin with they remain attached at
their centromeres and by cohesion proteins which
hold together the arms of the replicated chromosomes. The daughters do not separate until later
in mitosis when the chromosomes are distributed
between the two new nuclei. Obviously it is
important that each of the new nuclei receives a
complete set of chromosomes.
In contrast, at the start of meiosis the chromosomes condense, and each homologous pair lines
up to form a bivalent (Fig. 4.1). Within the bivalent, crossing over might occur, involving breakage of chromosome arms and exchange of DNA.
Meiosis then proceeds by a pair of mitotic nuclear
divisions that result initially in two nuclei, each
with two copies of each chromosome still attached
at their centromeres, and finally in four nuclei,
each with a single copy of each chromosome.
These final products of meiosis, the gametes, are
therefore haploid.
Mitosis illustrates the basic events occurring
during nuclear division but is not directly relevant
to genetic mapping. Instead, it is the distinctive
features of meiosis that interest us. Meiosis
occurs only in reproductive cells and results in a
diploid cell giving rise to four haploid gametes,
each of which can subsequently fuse with a gamete of the opposite sex during sexual reproduction. The fact that meiosis results in four haploid
cells whereas mitosis gives rise to two diploid
cells is easy to explain: meiosis involves two
nuclear divisions, one after the other, whereas
mitosis is just a single nuclear division. This is an
important distinction, but the critical difference
between mitosis and meiosis is more refined.
Recall that in a diploid cell there are two separate
copies of each chromosome, referred to as pairs
of homologous chromosomes. During mitosis,
homologous chromosomes remain separate from
one another, each member of the pair replicating
and being passed to a daughter nucleus independently of its homologue. In meiosis, however, the

84
Interphase

Prophase I

P
Q
R

P p
Q q
R r

p
q
r

Metaphase I

P
Q
R

Linkage Map Construction

Anaphase I

p
q
r

P
Q
R

P
q
R

p
Q
r

p
q
r

Chiasma
Crossing over has
occurred; Recombinant
chromotids

Homologous
chromosomes

P
Q
R

p
Q
r

P
q
R

p
q
r

Prophase II

P
Q
R

p
Q
r

P
q
R

p
q
r

Recombinant gametes

Telophase II

Fig. 4.1 Features of meiosis

pairs of homologous chromosomes are by no


means independent. During meiosis I, each chromosome lines up with its homologue to form a
bivalent. This occurs after each chromosome has
replicated, but before the replicated structures
split, so the bivalent in fact contains four chromosome copies, each of which is destined to find its
way into one of the four gametes that will be produced at the end of the meiosis. Within the bivalent, the chromosome arms (the chromatids) can
undergo physical breakage and exchange of segments of DNA (refer the Fig. 4.1). This is called
crossing over or recombination and was discovered by the Belgian cytologist Janssens in 1909.
This was just 2 years before Morgan started to
think about partial linkage.
This discovery of crossing over helped
Morgan to explain partial linkage. To understand
this we need to think about the effect of crossing

over on the inheritance of genes. Let us consider


two genes, each of which has two alleles. We
will call the first gene A and its alleles A and a,
and the second gene B with alleles B and b.
Imagine that the two genes are located on chromosome number 2 of Drosophila melanogaster
(fruit fly), the organism used by Morgan. We are
going to follow the meiosis of a diploid nucleus
in which one copy of chromosome 2 has alleles
A and B, and the second has a and b. In such
scenario there are two alternatives (as depicted
in Fig. 4.2):
1. A crossover does not occur between genes A
and B. If this happen, then two of the resulting gametes will contain chromosome copies
with alleles A and B, and the other two will
contain a and b. In other words, two of the
gametes have the genotype AB, and two have
the genotype ab.

Basics of Genetic/Linkage Mapping: Mendelian Ratios, Meiosis, Crossing Over and Partial Linkage
Fig. 4.2 The effect of
crossover on linked genes
If there is
no cross over

85

If cross over
occurs between A and B

Prophase II
A

AB

B
B

AB

Telophase II
A

B
aB

AB
a

b
ab

2. A crossover does occur between genes A and


B. This leads to segments of DNA containing
gene B being exchanged between homologous
chromosomes as shown in Fig. 4.2. The eventual result is that each gamete has a different
genotype: 1 AB, 1 aB, 1 Ab, 1 ab.
Now think about what would happen if we
looked at the results of meiosis in a 100 identical

b
Ab

a
ab

Genotypes
2AB:2ab

b
ab

Genotypes
1AB:1aB:1Ab:1ab

cells. If crossovers never occur, then the resulting


gametes will have the following genotypes: 200 AB
and 200 ab. This is complete linkage: genes A and
B behave as a single unit during meiosis. But if
crossovers occur between A and B in some of the
nuclei (as is more likely), then the allele pairs will
not be inherited as single units. Let us say that
crossovers occur during 40 of the 100 meiosis.

86

The following gametes will result: 160 AB, 160 ab,


40 Ab, 40 aB. In this context, the linkage is not complete, it is only partial. And gametes are termed as
the two parental genotypes (AB, ab) and recombinant genotypes (Ab, aB). In the example, the combination aB and Ab did not appear in the parental
cells. These new combinations are the result of
recombination, therefore, indicated as recombinant genotypes.
Once Morgan had understood how partial linkage could be explained by crossing-over during
meiosis, he was able to devise an experiment that
paved a way to map the relative positions of genes
on a chromosome. In fact the most important work
was done not by Morgan, but by an undergraduate in
his laboratory, Arthur Sturtevant in 1913. Sturtevant
assumed that crossing-over was a random event,
there being an equal chance of it occurring at any
position along a pair of lined-up chromatids. If this
assumption is correct, then two genes that are close
together will be separated by crossovers less
frequently than two genes that are more distant
from one another. Furthermore, the frequency with
which the genes are unlinked by crossovers will be
directly proportional to how far apart they are on
their chromosome. The recombination frequency is
therefore a measure of the distance between two
genes. If you work out the recombination frequencies for different pairs of genes, you can construct a
map of their relative positions on the chromosome.
The way in which the recombination frequency
calculation has helped in the construction of
genetic map is explained below: Let us consider
the original experiments carried out with fruit flies
by A. Sturtevant (explained in Fig. 4.3). He has
taken four genes (during his period gene was not
defined at molecular level; instead genes were
considered as entities responsible for heritability
of traits from parent to offspring). All the four
genes are on the X chromosome of the fruit fly. By
making experimental crosses, he had observed the
number of parental and recombinant genotypes
among the progenies. Recombination frequencies
between the genes were calculated as
Recombination frequency =
Number of recombinants
100 %.
Total number of progenies

Linkage Map Construction

Recombination frequencies
Between miniature wings (m) and Vermilion wings (v)
Between miniature wings (m) and yellow body (y)
Between vermilion eyes (v) and White eyes (w)
Between white eyes (w) and yellow body (y)

= 3.0%
= 33.7%
= 29.4%
= 1.3%

y w

0 1.3

30.7

33.7

Fig. 4.3 Construction of a genetic map using recombination frequencies

The calculated recombination frequencies


between these four genes were used to depict the
distance between the investigated genes. These
are shown along with their deduced map positions in Fig. 4.3.
Thus, it is clear that the resolution of a genetic
map depends on the number of crossovers that
have been scored (the higher the sampled crossover events, the higher the resolution). This is not
a major problem for microorganisms because
these can be obtained in huge numbers, enabling
many crossovers to be studied, resulting in a
highly detailed genetic map in which the markers
are just a few kb apart. For example, when the
Escherichia coli genome sequencing project
began in 1990 to construct the physical map, the
latest genetic map for this organism comprised
over 1,400 markers, an average of one per 3.3 kb
(kilobase pairs). This was sufficiently detailed to
direct the sequencing program without the need
for extensive physical mapping. Similarly, the
Saccharomyces cerevisiae project was supported
by a fine-scale genetic map (approximately 1,150
genetic markers, on average one per 10 kb). The
problem with humans and most other eukaryotes
is that it is simply not possible to obtain large
numbers of progeny, so relatively little crossover
events can be studied, and the resolving power of
linkage analysis is restricted. This means that
genes that are several tens of kb apart may appear
at the same position on the genetic map, and thus
such genetic maps have limited accuracy.
When we assessed Sturtevants assumption,
we understand that crossovers occur at random
along chromosomes. However, when molecular

Mapping Functions
Physical map
chaI

87
Genetic map
glkI
chaI

glkI
his4
SUPS3
leu2
Centromere

his4

SUPS3
leu2
Centromere
pgkI

pgkI
pet18
pet18
cryI
cryI
MAT
MAT
thr4

thr4

SUP61
SUP61
ABTI
ABTI

Fig. 4.4 Comparison between the part of the genetic and


physical maps of Saccharomyces cerevisiae chromosome 3

data are generated, it was realised that this


assumption is only partly correct because the
presence of recombination hotspots means that
crossovers are more likely to occur at some points
rather than at others. The effect that this can have
on the accuracy of a genetic map was illustrated
in 1992 when the complete sequence for S. cerevisiae chromosome III was published, enabling
the first direct comparison to be made between a
genetic map and the actual positions of markers
as shown by DNA sequencing. There were considerable discrepancies even to the extent that
one pair of genes had been ordered incorrectly by
genetic analysis. The comparison in Fig. 4.4
shows the discrepancies between the genetic and
physical maps (determined by DNA sequencing),
and part of the discrepancies was shown in
Fig. 4.4. Note that the order of the upper two
markers (glk1 and cha1) is incorrect on the
genetic map, and that there are also differences in
the relative positioning of other pairs of markers.

It is worth to mention here that S. cerevisiae is


one of the two eukaryotes (fruit fly is the second)
whose genomes have been subjected to intensive
genetic mapping. If the yeast genetic map is inaccurate, then how precise are the genetic maps of
organisms subjected to less detailed analysis?
These two limitations of genetic mapping clearly
stress the point that for most eukaryotes a genetic
map must be checked and supplemented by alternative mapping procedures (such as cytogenetic
mapping or fluorescent in situ hybridization
(FISH)) before large-scale DNA sequencing
begins.
Thus, Sturtevants assumption about the randomness of crossovers was not entirely justified.
Comparisons between genetic maps and the
actual positions of genes on DNA molecules, as
revealed by physical mapping and DNA sequencing, have shown that some regions of chromosomes, called recombination hotspots, are more
likely to be involved in crossovers than others.
This means that a genetic map distance does not
necessarily indicate the physical distance between
two markers. Also, we now realise that a single
chromatid can participate in more than one crossover at the same time but that there are limitations on how close together these crossovers can
be, leading to more inaccuracies in the mapping
procedure. Despite these limitations, linkage
analysis usually makes correct deductions about
gene/marker order, and distance estimates are
sufficiently accurate to generate genetic maps
that are of value as frameworks for genome
sequencing projects. The following section
describes the basic principles involved in construction of linkage or genetic mapping using different algorithms since mapping cannot be done
manually with large number of markers.

Mapping Functions
From the above explanation, it is clear that two
genes are said to be linked if they are located on
the same chromosome by assuming that different
chromosomes segregate independently during
meiosis. Therefore, for two genes located at different chromosomes, we may assume that their alleles

88

also segregate independently. The chance that an


allele at one locus coinherits with an allele at
another locus of the same parental origin is then
0.5 (), and such genes are unlinked. Thus, in the
above example, the chance that A/B or a/b coinherit to the offspring is 0.5 in case the genes are
unlinked. This chance increases if the genes are
linked. We can observe a degree of linkage. The
reason is that even if genes are located on the same
chromosome, they have a chance of not inheriting
as in the parental state due to recombination. The
further the distance between two genes, the more
frequently there will be crossover, and hence the
higher the number of recombinations.
It should be also noted that the combinations
aB and Ab are not always the recombinants. For
example, if the F1 was made from a parental cross
AAbb aaBB, then the recombinant gametes
would be AB and ab. Therefore, we have to determine how the alleles were joined in the parental
generation. This is known as the phase. If AB and
ab were joined in the parental gametes, the gene
pairs are said to be in coupling phase (as in the
cross, AABB aabb). Otherwise, as in the cross
AAbb aaBB, the gene pairs are in repulsion
phase. These terms can be somewhat messy if
there are no dominant or mutant alleles.
Another two genetic phenomena to be noticed
at this point are linkage equilibrium and its opposite, linkage disequilibrium. These are terms used
for the chance of coinheritance of alleles at different loci. Alleles that are in random association
are said to be in linkage equilibrium. Linkage
disequilibrium can be the result of physical linkage of genes, even if the genes are on different
chromosomes (refer chapter 6 for more details).
The main idea of linkage or genetic mapping
is finding those genes/markers that are linked
together and coinherited to the next generation.
Modern linkage analysis uses not only genes that
code for proteins that produce observable traits
but also neutral markers (refer chapter 3 for
more detail). Markers are mapped relative to one
another on chromosomes and used as signposts
against which to map genes of interest that are
linked with marker. This process of finding the
linked markers/genes is referred to as grouping.

Linkage Map Construction

The distance between two genes is determined by


their recombination fraction. The map units are
centimorgan (cM). One cM is the distance over
which, on average, one crossover occurs per meiosis. Sturtevant established the genetic map unit,
cM, by defining a portion of the chromosome of
such length that, on the average, one crossover
will occur in it out of every 100 gametes formed
(Sturtevant 1913).
When considering the mapping of more than
two markers/genes on the genetic map, it would
be very handy if the distances on the map were
additive. However, genetic studies have shown
that recombination fractions are not additive
(recombination fraction is not the best estimate
of genetic distance since they have certain variability). For example, consider the loci A, B and
C. The recombination fraction between AC is
not equal to the sum of the recombination fractions AB and BC.
If the distance AB is r1 and the distance BC
is r2, then the distance AC = r12 depends on the
existence of interference. If the recombination
between A and B (with probability r1) is independent from the event of recombination between B
and C (with probability r2), we say that there is no
interference. In that case, the recombination
between A and C: r12 = r1 + r2 2r1r2.
Interference is the effect in which the occurrence of a crossover in a certain region reduces
the probability of a crossover in the adjacent
region. This is a reflection of the double crossovers. If there is complete interference, the event
of a crossover in one region completely suppresses recombinations in adjacent regions. In
that case r12 = r1 + r2, that is, the recombination
fractions are additive. Also within small distances, the term 2r1r2 may be ignored, and recombination fractions are nearly additive. More
generally, double recombinants cannot be
ignored, and recombination fractions are not
additive. If distances were not additive, it would
be necessary to redo a genetic map each time
when new loci (marker/gene) are discovered. To
avoid this problem, the distances on the genetic
map are mapped using a mapping function.
A mapping function translates recombination

Mapping of Genetic Markers: Practical Considerations

frequencies between two loci into a map distance


in cM.
A mapping function gives the relationship
between the distance between two chromosomal
locations on the genetic map (in cM) and their
recombination frequency.
Thus, the properties of a good mapping function are:
1. Distances are additive, that is, the distance AC
should be equal to AB + BC if the order is
ABC.
2. A distance of more than 50 cM should translate into a recombination fraction of 50%.
In general, a mapping function depends on the
interference assumed. With complete interference, and within small distances, a mapping
function is simply:

89

stated earlier, there is no general relationship


between genetic distance and physical distance
(in base pairs). There is a large variability between
species for the average number of kilobase pairs
(kb) per cM. Even within chromosomes there is
variation, with some regions having less crossovers, and therefore more kb per cM, than others.
Further, it should be noted that the estimation of
genetic map distances is highly influenced by
chemical and physical radiation that are prevailing during the experiments (which can increase
the recombination frequency), plasmagenes, genotype, chromosomal aberrations, distance from
centromere etc. Thus, there is always certain
variations between genetic distance and physical
distance, since genetic distance estimation is relatively affected by more factors.

distance (d ) = r (recombination fraction).


There are two types of mapping function:
Haldane mapping function and Kosambi mapping function. With no interference, the Haldane
mapping function is appropriate. On the other
hand, Kosambis mapping function allows some
interference.
With no interference (i.e. all crossovers occurs
independently of one another), the Haldane mapping function is appropriate:
1
d = ln(1 2r ),
2
whereas Kosambis mapping function allows
some positive interference (i.e. one chiasma
deters the occurrence of the second in close proximity to the first), and hence the distance is calculated as
d=

1 1 + 2r
ln
.
4 1 2r

Based on several studies, it is established that


there is little difference between the different
mapping functions when the distance is below
15 cM.
Thus, mapping function provides a better estimate of genetic distance than the recombination
fraction used by Sturtevant. On the other hand, as

Mapping of Genetic Markers: Practical


Considerations
From the foregoing discussion, it is clear that
markers can be genetically mapped relative to
each other by:
1. Determining recombination fractions
2. Using a mapping function
Recombination fractions between genetic
markers can be estimated from mapping population (see chapter 2 for different types of mapping
populations and its importance in genetic mapping). Since we can observe complete marker
genotypes in the every progenies of mapping
population, it is easy to calculate recombination
fraction. Recombination fractions are estimated
from the proportion of recombinant gametes.
This is relatively easy to determine if we know
linkage phase in parents (the haplotype of the
gamete that was transmitted from parent to offspring). If the linkage phase is known in parents,
we can know which gametes are recombinants
and which ones are nonrecombinant. However, in
practice, linkage phases are not always known.
This is especially the case in animals, as it is hard
to create inbred lines. And markers are often in
linkage equilibrium, even across breeds. If the
linkage phase is not known, we can usually infer

90

the parental linkage phase, as the number of


recombinants is expected to be smaller than the
number of nonrecombinants.
However, there is some coincidental that by
chance there are more recombinants. To this end,
maximum likelihood is used to determine the
most likely phase, and therefore to determine the
most likely recombination fraction. Information
about the gamete that was received by an offspring depends on the genotypes of offspring and
parents. If parents and offspring are all heterozygous (e.g. Aa), then we do not know which allele
was paternal and which was maternal. If marker
genotypes of parents are not heterozygous, we
have no information about recombination events
during their meiosis. For example, if the sire has
genotype AB/Ab we cannot distinguish between
recombinant gametes. However, if one parent is
homozygous, it increases the chance of having
informative meiosis on the other parent.

Testing for Linkage: LOD Scores


Besides estimating the most likely recombination
fraction, it is important to test or validate those
estimates statistically. In particular we want to
test whether or not two loci are really linked.
Therefore, the statistical test to perform is the
likelihood of a certain recombination fraction (r)
versus the likelihood of no linkage (r = 0.5).
Different likelihoods are usually compared by
taking the ratio of the likelihood. The 10 log ratio
of this likelihood ratio which is indicated by LOD
score (abbreviation of log off odds) is the most
popularly used likelihoods. It was introduced by
Haldane and Smith in 1947 and considered as
key concept in linkage analysis.
A LOD score above 3 is generally used as a
critical value. A LOD score of >3 implies that the
null hypothesis (r = 0.5) is rejected. This value
implies a ratio of likelihoods of 1,000 to 1 (i.e.
among the 1,000 analysis, there is chance of 1
failure). This seems like a very stringent criterion. However, it accounts for the prior probability of linkage. Due to the finite number of
chromosomes, there is a reasonable probability
(e.g. 5% in humans with 23 chromosome pairs)

Linkage Map Construction

that two random loci are linked. Nevertheless,


different LOD thresholds should be used for
different data sets. When a pair of markers is considered during the analysis, it is known as pairwise
or two-point analysis, whereas those that consider many markers simultaneously are known as
multipoint analysis.

Grouping, Ordering and Spacing


Genetic map provides an essential resource to
understand the order and spacing of markers
(relative order when compared to those of other
similar species). Thus, the key step is identification
of a set of markers that are arranged together as a
single group and finding the order and spacing of
each marker in the given group. The mapping
population consists of p plants that result from
a crossing experiment with a given experimental
design. The commonly used designs include
backcrossing, F2, doubled haploid and recombinant inbred lines (refer chapter 2 for more details).
Further, marker data can consist of different
types: co-dominant or dominant. Thus, the primary data set consists of m p matrix, with p
members of a mapping population each scored
for m markers. Taken together, the experimental design and the marker type will define the way
in which distances and other functions are calculated between distinct markers. The computational approaches to be used in this linkage
analysis can be split into three parts: grouping,
ordering and spacing.
Grouping divides the DNA marker set into
distinct linkage groups. The number of linkage
groups in a species, as a rule, should be equal to
its gametic chromosome number (or haploid
number of chromosomes). Obviously, the ideal
number of linkage groups should have one-toone correspondence between linkage groups and
haploid chromosomes (e.g. if there are five chromosomes in the gametes, it should have five linkage groups). However, this will depend on the
density and proximity of the underlying markers,
which is a consequence of the co-ancestry of the
two parents in addition to the marker development strategies as well as regional recombination

Grouping, Ordering and Spacing

rates. On the other hand, a researcher knows that


the entire DNA markers are derived from a single
chromosome, this analytical step is unnecessary.
Several types of solution have been proposed for
the marker grouping problem. One type recognises the underlying similarity to the well-studied
area of agglomerative hierarchical clustering. In
methods such as nearest neighbour locus, clusters
of markers (i.e. linkage groups) are grown by
sequentially adding that marker which shows the
lowest recombination value to the current members of the cluster. For example, the strategy
employed by MAPMAKER (Box 4.1) is of this
type. It begins by calculating all two-point maximum likelihood distances and corresponding
LOD scores, with linkage established between
pairs of markers if the LOD score is >3 and the
inter-marker distance is <80 Haldane cM (default
values used by MAPMAKER. However, it can be
changed by the user). MAPMAKER considers
linkage to be transitive such that if marker A is
linked to marker B, and if B is linked to C, then
A, B and C are candidates for belonging to the
same linkage group (but which may be excluded
later if they show significant deviation from additivity of their map distances). Another type of
grouping method adopts ideas from graph theory.
For example, MadMapper and MSTMAP both
use graph partitioning approaches, creating a
complete graph of all markers connected to all
other markers and with connecting graph edges
weighted by some two-point function of the data.
Then, all edges over a certain threshold value are
chopped, leaving a number of distinct subgraphs,
each of which corresponds to a linkage group. It
is notable that many grouping methods require
input parameters to be specified by the user,
thereby influencing their output. Consequently,
linkage group content can be changed to some
extent by a users expertise, knowledge and
opinion.
Ordering takes each of the linkage groups in
turn and aims to find the relative orders of the
markers within the group. For a linkage group of
m markers, there are m!/2 possible orders.
Hence, if large data sets are used, this is not a
simple task that can be undertaken exhaustively
due to prohibitive computational time required to

91

carry it out. Given a linkage group, we wish to


find the order of its markers that maximises or
minimises some scoring function. This scoring
function is commonly known as an objective
function. In simple terms, we want some way to
(1) evaluate the quality of a given marker order
and (2) to describe how one marker order is better
or more suitable than another. Furthermore, we
require an objective function that is simple to calculate yet is also biologically and statistically
meaningful. An example of a simple objective
function, to be minimised, is the sum of adjacent
recombination fractions (SARF) (refer Further
Readings to get more on SARF and other computational approaches). Since adjacent marker loci
tend to have the smallest recombination fractions,
the marker order that minimises SARF was
referred to by its developer as the minimum distance map. Examples of other popular objective
functions are the maximum sum of adjacent LOD
scores (SALOD), the minimum number of crossovers, the product of adjacent recombination
fractions (PARF), the minimum entropy, the minimum weighted least-squares marker order, the
maximum likelihood (ML) and the maximum
number of fully informative meiosis. It is worth
to mention that if the linkage group size was more
than six markers, it would take long time to complete the ordering process even if we employ
superpower computers. Thus, optimising an
objective function over all m!/2 possible marker
orders is not feasible for most data sets. Finding
an optimal marker order for a particular objective
function is known in computer science terminology as a non-deterministic polynomial (NP)-hard
combinatorial problem and necessitates the use
of a search strategy that significantly reduces the
space of marker orders to explore. Initially, search
strategies such as seriation and branch-and-bound
were used. In a seriation approach, a marker order
is grown in a greedy fashion from an initial pair
of tightly linked markers, adding at each step the
single most informative marker in the position
that optimises the objective function. In the
branch-and-bound strategy, an initial good solution is found, perhaps based on a two-point
method. Subsequently, the initial marker order is
probed by incrementally constructing partial

92

orders, with those less good than the current full


order eliminated, along with all full orders based
on, or descended from, it. Once a full order better
than the current is discovered, it becomes the next
current order to be investigated. In this way, the
objective function never decreases from the initial solution to the time the method terminates.
Subsequent to these approaches, a convenient
relationship was discovered between the marker
ordering problem and the symmetric wandering
salesman problem, a variant of the travelling
salesman problem (TSP), perhaps one of the best
researched and understood problems in computer
science. In this problem, a given set of m cities
has to be traversed so that every city is visited
exactly once in such a way that the total distance
travelled is minimised and that the choice of the
first and last cities is free. Thus, algorithms for
solution of the TSP can be used within genetic
map estimation, with the m cities recoded as
our m markers. The type of strategy that seems
to cope best with the presence of missing data
and, hence that lends itself well to genetic mapping where missing data are common is that of
the local search procedure. AntMap employs TSP
in ordering of markers in the given linkage group
(explained in Box 4.2). To estimate order, one
may consider several candidate orders and maximise the appropriate likelihood under each of
them. The maximum likelihood estimate of order
is that order whose maximised likelihood is highest. When one wants to map new locus to the
existing map, one can follow this procedure. The
JoinMap package, which uses this greedy algorithm, has several refinements to this general
scheme. For example, the order in which markers
are added to the sequence is not random, but
depends on the amount of information a marker
contains. In addition, after a marker has been
added, a local reshuffling can be applied in
order to prevent that the previous sequence will
not be changed anymore, and the algorithm is
trapped in a local optimum from which it cannot
escape.
Spacing process involves finding the map distances for an ordered set of markers in a given
linkage group. Usually, it is in cM between each
adjacent pair of marker loci and hence the length

Linkage Map Construction

of the linkage group as the sum of those distances.


Remember that the distance is not additive among
three markers. This problem is solved by taking
or refining the two-point analysis calculated in
the grouping step. The total map distance between
two genes of a linkage group may exceed 50 or
even 100, but it doesnt mean that they would
show more than 50% recombination. The frequency of recombination between two linked
genes cannot exceed 50%, which is the frequency
in the case of independent segregation. There is
1:1 correspondence between map distance, and
the observed frequency up to 15 cM. However,
there is a progressive decline in the frequency of
observed recombination for every additional
1 cM beyond 15 cM. Thus, a map distance around
90 cM is expected to show close to 50%
recombination.

Sources of Error
It is necessary to be aware that genetic map estimation, like any estimation procedure, is prone to
error. Error may arise due to many factors, including missing data, chiasma interference, genotyping error and segregation distortion. Missing data
can lead to an incorrect marker order, particularly
in dense regions of a map. Some scoring failures
are likely to be the results of random processes.
However, there is also an element of systematic
bias, and we often see a particular marker for
which several plants are not scored. In such a
case, we may wish to delete the marker from our
analysis. For less systematic cases, we may wish
to infer missing values through some computational method. In the presence of chiasma interference, the Haldane map function is not valid,
since it assumes no interference has taken place.
However, many map functions account for chiasma interference in varying degrees. For example, the Rao map function is a versatile function
that accounts for interference along a sliding
scale. Although the Rao map function is not
widely implemented in software tools (see
Box 4.3 for list of software that deals genetic
mapping), the Kosambi map function, which
accounts for interference, is supported by many

Sources of Error

such software. Genotyping errors can have a large


impact on the accuracy of a map, inflating map
lengths (particularly when applying multipoint
maximum likelihood methods), reducing estimates of chiasma interference and supporting
incorrect marker orders. In practice, many
researchers will deal with genotyping errors by
searching for double recombinants on an estimated genetic map (and sometimes recombinants
over short distances), followed by checking of
potentially erroneous scores. However, such an
approach will not always be practical and is
unlikely to uncover all cases of genotyping error.
Consequently, two types of computational
approach to this problem have been developed.
The first type concerns the identification of potentially erroneous scores. For example, the JoinMap
software implements a method that calculates a
probability for each genotype, given the scores of
the two flanking markers and the inter-marker
distances. Genotypes with low probabilities can
then be investigated further. The second type
concerns modifying the map either during or following the estimation process. Both an error filter
for pairwise methods that corrected map length
while considering the level of interference p
and error corrections for multipoint methods are
described in the literature. Although it showed
that both methods performed well for certain data
sets, also highlighting the underestimation of
interference in their absence, it was noted that the
multipoint correction was potentially not as satisfactory as the error filter method as it was performed on a marker order obtained under the
assumption of no error.
Where segregation distortion is found to have
occurred (calculated via chi-squares, refer
Chap. 3), the mapping population deviates from
allele and genotype frequencies expected under
the HardyWeinberg law (which states that population frequencies remain in equilibrium across
generations unless disturbed by some phenomenon). For plant mapping populations, such deviations from the expected frequencies typically
arise as the result of gametic or post-zygotic
selection, resulting in a marker locus which,
though appropriate for the marker scores, does
not correspond to the physical location of the

93

marker. Although simulation analysis showed


that the presence of segregation distortion had
little effect on the accuracy of marker order or
map length, this contradicts the results of other
studies and may be data set specific. Consequently,
methods which allow such markers to be identified
prior to analysis are useful, as they give the
researcher the opportunity to analyse the data set
either with or excluding such markers (or potentially both). The interplay between these sources
of error is complex because of the interaction
between genotyping errors and chiasma interference. It has also been noted that missing values
led to shorter map lengths for more widely spaced
markers, particularly in the presence of segregation distortion, when using the weighted leastsquares method and further noted that missing
values had a lesser effect on the accuracy of
marker order than did genotyping errors. Other
sources of error may include mixing marker
types within a single scoring scheme can result in
attraction of similar types of marker independent of their chromosomal locations.
Consequently, diagnostic tests and methods that
allow researchers to interact with their mapping
data are desirable. Additional data may help to
resolve errors. For example, physical mapping
data and in particular complete genome sequences
will also present a marker order. This is highly
attractive, as estimating marker order is the most
difficult part of genetic map estimation for large
data sets. However, we should also be aware
when comparing genetic and physical marker
orders that the genome sequence is itself an estimate gained from a sequence assembly process
and may not be highly accurate for up to several
years following initial sequencing. Furthermore,
when comparing genetic and physical marker
orders from different organisms, we must not
underestimate the effect that micro-rearrangements could have on making inferences on the
accuracy of the genetic map. In general, the accuracy of any genetic map estimation method relies
on the distribution of recombination frequencies,
the proportion of missing data, the quantity of
noise due to genotyping errors and genetic interference. As more marker data sets grow, it is a
challenge to the researchers to discover new

94

search methods that can facilitate fast and accurate use of objective functions.
The outcome of a mapping experiment depends
on the composition of the sample population. The
larger the mapping population, the more confidence
we have in the estimates of recombination frequencies and map distances. For most purposes
populations of size in the range 80400 are used.
Remember that the population type also influences
the standard errors of the estimates. It is good to
realise that, for example, an experiment with
100 RILs will result in a (slightly) different map
when it was compared with sampling of an F2 population and the best map corresponding to each
sample. Although the variation between these
maps with respect to marker order may be nil, the
resulting total map length and the inter-marker distances are quite variable. This demonstrates that
the ultimate true linkage map does not exist.

Chromosomal Assignment
Once the linkage groups are identified and refined
from the data sets, the next step is assigning chromosome number to each linkage group. It is
usually done with the help of cytogenetic stocks.
Nullisomic/disomic/trisomic lines are used to
identify which chromosome of the given species
contains the markers that constitute the given linkage group. Assignment of markers to specific chromosomes can also be accomplished through PCR
using template DNA from each of the nullisomic
lines (or disomic or trisomic or tetrasomic lines
depending on the availability) in the given species.
It is also possible to assign the chromosome using
microisolated translocation chromosomes as a template in the PCR with the primer of the given
marker. Alternatively, deletion mapping using
structural aberrations of specific chromosomes can
also be employed in this context. In many species,
the chromosomes are designated in sequential
order based on their relative sizes. Recently, assignment of markers to the individual chromosomes or
chromosome arms is being extensively undertaken
with the help of fluorescent in situ hybridization
(FISH). Further, such FISH analysis helps in
comparison of physical and genetic map and
identification of introduced chromosomal segments
among related species. In polyploid species, it is

Linkage Map Construction

still complicated since it involves stepwise process


that builds on previous genetic and cytogenetic
information. Aneuploid stocks are employed to
locate markers on the chromosomes and identify
linkage groups to chromosomes. In cotton, monosomic and monotelodisomic stocks that are hemizygous for one arm provide facile means to localise
marker loci to one arm or another of the given chromosome. For example, TM-1/3-79 derived F1s
have been evaluated for monosomic or monotelodisomic stocks (Kohel et al. 1970). In each F1, the
donor genotype is euploid Gossypium barbadense
accession 379, and the recipient genotype is
hypoaneuploid G. hirsutum, usually a backcross
derivative of accession TM-1. TM-1 is an inbred
line derived from Deltapine 14 and is considered
as the genetic standard of upland cotton (G. hirsutum). The inbred 379 is a doubled haploid derived
from G. barbadense. A monosomic F1 substitution
stock has a single chromosome from the donor substituted for the corresponding chromosome pair of
the recipient genotype. Similarly, monotelodisomic
F1 stocks lack alleles form the recurrent parent in
the hemizygous chromosome arm from the donor,
but carry alleles of the recurrent parent in the
opposing arm (either in homozygous or heterozygous condition, depending on the patterns of crossing over). In general, SSR markers in combination
with cytogenetic stocks are used to construct the
framework map, and other types of markers are
consequently added to this framework map.

Allopolyploidy and Autopolyploidy


Polyploidy has played an important role in higher
plant evolution and applied plant breeding.
Polyploids are commonly categorised as (1) allopolyploids, resulting from the increase of chromosome number through hybridization and subsequent
chromosome doubling, and (2) autopolyploids, due
to chromosome doubling of the same genome
by fusion of unreduced gametes. Allopolyploids
undergo bivalent pairing at meiosis because only
homologous chromosomes pair. For autopolyploids,
however, all homologous chromosomes can pair at
the same time so that multivalents and, therefore,
double reductions are formed. For some polyploids, these two types of pairing occur at the same
time, leading to a mixed category. Alfalfa, banana,

Bridging Linkage Maps to Develop Unified Linkage Maps

canola, coffee, cotton, potato, soybean, strawberry,


sugarcane, sweet potato and wheat represent
excellent examples of polyploids of economic
importance. In spite of the economical relevance
of polyploid crops, genetic mapping of these species has been relatively overlooked. Statistical
methods for genetic mapping have well been developed for diploid species but are lagging in the more
complex polyploids. This is because of intrinsic
difficulties such as the uncertainty of the chromosome behaviour at meiosis-I and the need for very
large segregating populations. An important, yet
underestimated, issue in mapping polyploids is the
choice of the molecular marker system. An ideal
molecular marker system for polyploid mapping
should maximise the percentage of single-dose
markers detected and the possibility of recognising
allelic markers. The genetic mapping of polyploids,
where genome number is higher than two, is further
complicated by uncertainty about the genotype
phenotype correspondence, inconsistent meiotic
mechanisms, heterozygous genome structures and
increased allelic (action) and nonallelic (interaction)
combinations. Readers are requested to refer Wu
et al. (2001) for a review on several challenges due
to the complexities of linkage analysis in polyploids
and description of statistical models and algorithms
that have been developed for linkage mapping based
on their distinct meiotic characteristics. Besides,
this paper also describes several issues that should
be addressed to better understand the genome structure and organisation of polyploids and the genetic
architecture of complex traits for this unique group
of plants.

95

Fig. 4.5 Bridging different linkage maps of the same


species into single comprehensive linkage map Numbers
in parenthesis indicates number of markers in each stage
that have unified

Bridging Linkage Maps to Develop


Unied Linkage Maps
It is often difficult to construct a linkage map that
covers the entire genome due to unavailability of
polymorphic markers, unavailability of recombinants for the markers and several other reasons.
In such cases, maps developed with the help of
different mapping populations can be integrated
into single map with the help of anchored marker
as shown in Fig. 4.5. This figure schematically
represents the stepwise assemblage of a linkage
map based on a number of different crosses using
a reference set of anchored markers. Maps A, B
and C are obtained from different mapping populations. Integration is possible with the anchor
loci that are common to two or more data sets.

Box 4.1 Linkage Map Construction Using MAPMAKER/EXP

Data File Preparation


The following is the excerpt from
MAPMAKER/EXP tutorial.
The very first line of your raw data file
should read like:
data type xxxx
where xxxx is one of the allowed data types,
either:
f2 intercross

f2 backcross
f3 self
ri self
ri sib
The second line of the raw file should contain a list of three numbers separated by spaces,
such as
46 362 2
The first of these values indicates the number
of progeny for which data are included in the
(continued)

96

Box 4.1 (continued)


file (in this case, 46). The second indicates the
number of genetic loci for which data are supplied (362). The third indicates the number of
quantitative traits in the data set (here 2,
although this may be zero, of course).
Additional information may be optionally
supplied at the end of this line. In particular,
you may specify the coding scheme you use
for genotypes. By default, the codes used for
F2 backcross (a.k.a. BC1) data are:
A Homozygote for the recurrent parent
genotype
H Heterozygote
- Missing data for the individual at this
locus
For F2 intercross data, the default codes are:
A Homozygote for the allele from parental strain a of this locus
B Homozygote for the allele from parental strain b of this locus
H Heterozygote carrying both alleles a
and b
C Not a homozygote for allele a (either
bb or ab genotype)
D Not a homozygote for allele b (either
aa or ab genotype)
- Missing data for the individual at this
locus
For RI data, the default codes are:
A Homozygote for parental genotype a
B Homozygote for parental genotype b
- Missing data for the individual (or
line) at this locus
Also by default, MAPMAKER will match
genotype characters in a case-insensitive
manner (i.e. a and A indicate the same
genotypes).
However, you can tell MAPMAKER to use
whatever conventions you like, as long as you
use the same conventions for the entire data
file. First off, if you follow the numbers on the
second line with the word case, then
MAPMAKER will match genotype characters
in a case-sensitive manner (i.e. a and A can
be used to indicate different genotypes). For
example,

Linkage Map Construction

46 362 2 case
If you do not wish to use case-sensitive
genotypes, do not include the word case.
To specify the coding scheme itself, include
on the end of the above line the word symbols
followed by the coding scheme you wish to
use, defined in terms of the coding scheme
above. For example, if you wish to use the
following scheme with an RI data set,
1 Homozygote for parental genotype a
2 Homozygote for parental genotype b
0 Missing data for the individual (or line)
at this locus
then you would use a second line like
46 362 2 symbols 1 = A 2 = B 0 = Note that when interpreting this line,
MAPMAKER is in fact quite finicky about
spaces and case distinctions (in order to keep
MAPMAKER from ever misunderstanding
exactly what you mean). In particular, NO
SPACES should surround the = signs.
To use with a backcross data set the scheme
a Homozygote for parental genotype a
A Heterozygote
- Missing data for the individual (or line)
at this locus
you should use a line like
46 362 2 case symbols a = A A = H
The main restriction on coding schemes is
that the only allowed symbols are letters,
numbers and the characters - and +.
After the first two header lines, the raw file
should then present the genetic locus data in
the following simple format: For each locus,
you list (1) the name of the locus, preceded by
an asterisk (*); (2) one or more spaces
(or tabs etc.); and (3) the genotypic data for
all individuals, in order. For example
*locus1 BA-HHHAAABBB-HHAA
would provide data for a locus named locus1
with individual #1 having the B genotype,
individual #2 having the A genotype and so
forth. Data for each new locus should begin on
a new line (with blank lines allowed), although
the genetic data for any one locus may be
broken by any number of spaces, tabs and
(continued)

Bridging Linkage Maps to Develop Unified Linkage Maps

Box 4.1 (continued)


line breaks. This means that, among other
things, tab-delimited-text files (such as those
often exported by spreadsheet programs) will
work well, for example:
*L2 B A - H H H A A A B B B - H
There is a system-dependent maximum
line length, although it is fairly large (at least
1,000 characters, where a tab counts as one
character).
Locus names should be kept to at most 8
characters and must be limited to alphabetic and
numeric characters, along with the underscore
character (_) and periods (.). No other characters are allowed (although any dashes in locus
names (-) will be converted to underscores).
Locus names must start with an alphabetic character (so that they are not confused with locus
numbers in MAPMAKER sequences).
Any quantitative trait data should come
after the genetic locus data. These data follow
a similar format, except that the trait values for
each individual must be separated by at least
one space, tab or line break. A dash (-) alone
indicates missing data. For example
*weight 6.3 7.7 8.0 6.2 8.6 - 7.5 9.0 5.5 - 8.4 7.7 7.4 6.9 would correspond to a trait named weight,
for which individual #1 has a value of 6.3,
individual #2 has a value of 7.7 and so on. The
sixth individual is missing data for this trait
(and will be ignored for all analyses involving
these trait data). As for the genotypes, a new
trait should begin on a new line, and line
breaks are allowed. Tab-delimited-text files
work well here too.
Traits may also be specified as functions of
other existing trait data. For example:
*weight1 6.3 7.7 8.0 6.2 8.6 6.9 7.5 9.0
*weight2 6.7 7.9 7.5 6.8 8.0 7.3 7.5 9.5
*mean = (weight1 + weight2)/2
The format of these equations is described
under the make trait command. Such traits
must be included in the number of traits indicated on the files second line.
Note that genetic maps (particularly for
MAPMAKER/QTL) are no longer included in

97

the raw file, as they were with MAPMAKER


Version 2.0. Instead, use a .prep initialization file, described in MAPMAKER manual.
Finally, note that comments may be inserted
on any line starting with a number sign character (#).
An example of a complete raw file is as
follows:
data type f2 intercross
205 2
# tiny data set for practical class demonstration
*locus1 BBBHH-AAABBBHHH-AABA
*locus2 AB-ABHABHAB-ABHABHBH
*locus3 ABBAHHHBHABHABHBBHH# Locus3 may be mis-scored in individual
12!
*locus4 ABHABAAAHAB-ABHAB HHB
*locus5 ABHABHAA-ABHABHAHHHB
*trait1 6.3 7.7 8.0 6.2 8.8 6.2 4.1 6.5 5.4 7.3
8.7 9.0 5.2 6.8 7.2 7.1 7.6 8.3 8.1 7.5
*trait2 5.5 5.5 5.5 4.5 4.5 4.5 3.5 3.5 3.5 5.5 5.5 4.5 4.5 4.5 3.5 5.2 6.8 7.2 7.1

The MAPMAKER Data: How to


Prepare and How Does It Look Like?
For example, if there are 500 recombinant
inbred lines scored for 200 SSR markers that
were polymorphic to the parent A and B used
in recombinant inbred line development, the
data file can be prepared in the Microsoft
Office Excel sheet in the following format:
Data type
ri self
500
200 0
*ssr1
A
A B B A B A B

scoring up to
500th RILs

*ssr2

B B -

A B A B

scoring up to
500th RILs

.
.
.
*ssr200

A -

A A B B B

scoring up to
500th RILs

(continued)

98

Linkage Map Construction

Box 4.1 (continued)

Once the data file is prepared in the abovesaid procedure in Office Excel, save this file as
*.txt (text tab delimited) kind of file type.
Open the folder containing the above-said
*.txt file and change the file extension as *.raw
using folder options.
Important notes:
1. The * indicates a file name of your interest. For example, the file name for the
above-said data is specified as RIL.
2. If you could not find the file extension for
the specified file name, then click the folder
options, click the View tab and unclick
the radio button Hide extension for known
file types. By doing so, you can visualise
the file extension in the folder for the
specified file namejust change the file
extension alone (i.e. RIL.txt is to be
changed as RIL.raw).

Running Mapmaker
Precisely how you should start MAPMAKER
depends on your computer. It should be noted
that MAPMAKER downloaded from http://
www.broad.mit.edu/ftp/distribution/software/
mapmaker3/ can be installed only in Windows
XP or their previous operating system. It is not
supported by other high-end operating systems
such as Window Vista and Window 7. Just get
into the mapmaker folder and double-click the
mapmaker icon to get into the command
prompt.
When MAPMAKER starts running, you
will first see its start-up banner and a prompt
1> for the first command.
Command that should be typed into
MAPMAKER is represented in the below
procedure in bold italics, while MAPMAKER
output is presented in regular type.
The first step in almost every MAPMAKER
session is to load a data file for analysis. If you
are starting out an analysis on a new data set,

or if you have modified the raw data in an


existing data set, you will do this using
MAPMAKERs prepare data command. If
instead you are resuming an analysis of a particular (unmodified) data set, you may use the
load data command, which preserves many
of the results from your previous session. If
you are just starting out, use MAPMAKERs
prepare data command to load data file RIL.
raw. From this file, MAPMAKER extracts:
The type of cross, number of markers and
number of scored progeny
The genotype for each marker in each individual (if available)
Other information may be present in the
data files, such as quantitative trait data and
precomputed linkage results. These issues
will be addressed later. Before performing
any analyses of data set, first instruct
MAPMAKER to save a transcript of this session in a text file for later reference. Using
the photo command, a transcript named
RIL.out is started. Note that if the file
already exists, MAPMAKER appends new
output to this file. The above-said two commands are shown below as it looks in DOS
window.
************************************
* MAPMAKER/EXP*
* (version 3.0b)*
**
**********************************
Type help for help. Type about for general information.
1 > prepare RIL.raw
preparing data from RIL.
raw
ri self data (500 individuals, 200 loci) ok
saving genotype data in file
RIL.data ok
2 > photo RIL.out
photo is on: file is RIL.
out
(continued)

Bridging Linkage Maps to Develop Unified Linkage Maps

99

Box 4.1 (continued)

Finding Linkage Groups by TwoPoint Linkage


Initially begin the linkage map construction
analysis by performing a classical two-point
or pairwise, linkage analysis of data set. First,
we need to tell MAPMAKER which loci we
wish to consider in our two-point analysis. We
do this using MAPMAKERs sequence command (seq will also work). When you type
something like:
3 > sequence 1 2 3
MAPMAKER is told which loci (and, in
some cases, which orders of those loci) any
following analysis commands should consider (e.g. SSR1, SSR2, SSR3). Since almost
all of MAPMAKERs analysis functions use
the current sequence to indicate which loci
they should consider, you will find that the
sequence command must be entered before
performing almost any analysis function.
The sequence of loci in use remains
unchanged until you again type the
sequence command to change it. In this
two-point analysis, we want to examine all
the loci in our sample data set. Thus, we now
type into MAPMAKER:
3 > sequence 1 2 3 4 5 6 7 8 9
10 11 12 13
(OR)
3 > sequence all
Mapmaker gives each marker in the data
file its own number; it does not work with
SSR1, SSR2 etc. If at any point you want to
see the real name of the marker, use the translate command after specifying the sequence
of those markers (e.g. seq 1 2 3, then translate or tra).
Note that for two-point analysis, the order
in which the loci are listed is unimportant.
Alternatively, if you know the chromosomal
location of each marker, you can specify
only those marker numbers belonging to the
given chromosome in the sequence command, and hence only those markers will be

analysed for their fitness into a single linkage group. For example, if SSR1 to SSR5
belong to chromosome 1, then the command
to be used is
3 > sequence 1 2 3 4 5
However, there are 200 markers in this
data file, and suppose we dont know the
chromosomal position of each marker. If that
is the case, this data set is too many to work
with at once since doing all possible orders
of all these markers at once would take a long
time. The next step is instructing the program
to divide the markers in the sequence into
linkage groups; for this, type MAPMAKERs
group command. To determine whether any
two markers are linked, MAPMAKER calculates the maximum likelihood distance and
corresponding LOD score between the two
markers: If the LOD score is greater than
some threshold, and if the distance is less
than some other threshold, then the markers
will be considered linked. By default, the
LOD threshold is 3.0, and the distance threshold is 80 Haldane cM. For the purpose of
finding linkage groups, MAPMAKER considers linkage transitive. That is, if marker A
is linked to marker B, and if B is linked to C,
then A, B and C will be included in the same
linkage group. It will be too complicated if
the above-said data set is used in this analysis. In the below example, a simple data set is
explained which contains 13 markers. As you
can see, MAPMAKER has divided this 13
marker data set into two linkage groups,
which it names group1 and group2, and a
list of unlinked markers (if there are no
unlinked markers in the given data set, you
may not find it).
4 > group
Linkage groups at min LOD
3.00, max distance 80.0
group1 = 1 2 3 5 7
group2 = 4 6 8 9 10 11 12
unlinked 13
(continued)

100

Linkage Map Construction

Box 4.1 (continued)

Exploring Map Orders by Hand


To determine the most likely order of markers
within a linkage group, we could imagine
using the following simple procedure: For
each possible order of that group, we calculate the maximum likelihood map (e.g. the
distances between all markers given the data)
and the corresponding maps likelihood. We
then compare these likelihoods and choose
the most likely order as the answer. This type
of exhaustive analysis may be performed
using MAPMAKERs compare command.
In practice, however, this sort of exhaustive
analysis is not practical for even mediumsized groups: A group of N markers has N!/2
possible orders, a number which become
unwieldy (for most computers) when N gets
to be between 6 and 10. In practice, one needs
to order subsets of the linkage group and then
overlap those subsets, mapping any remaining markers relative to those already mapped,
a process which is illustrated in the next section. In the above example, since group1
consists of markers 1, 2, 3, 5 and 7, it is small
enough to perform the fully exhaustive analysis. To do this, we first change MAPMAKERs
sequence to {1 2 3 5 7}. Here, the {} indicate that the order of the markers contained
within them is unknown and, thus that all
possible orders need to be considered. We
then type the compare command, instructing MAPMAKER to compute the maximum
likelihood map for each specified order of
markers and to report the orders sorted by the
likelihoods of their maps. Please note the
bracket type as other brackets have different
meanings: [] mean markers within are at the
same locus (so order does not matter) and < >
mean the order within is known but not the
order of the group itself (could be the inverse
order).
5 > sequence {1 2 3 5 7}
sequence #2 = {1 2 3 5 7}
6 > compare

Best 20 orders:
1: 1 3 2 5 7 Like: 0.00
2: 3 1 2 5 7 Like: -6.00
3: 5 7 2 3 1 Like: -20.20
4: 5 7 2 1 3 Like: -26.26
5: 2 5 7 3 1 Like: -27.25
6: 2 5 7 1 3 Like: -28.39
7: 2 3 1 5 7 Like: -28.85
8: 5 2 3 1 7 Like: -32.33
9: 2 1 3 5 7 Like: -34.12
10: 5 7 1 3 2 Like: -35.55
11: 5 2 1 3 7 Like: -37.61
12: 1 3 5 2 7 Like: -37.76
13: 3 1 5 2 7 Like: -39.09
14: 5 7 3 1 2 Like: -40.38
15: 1 3 5 7 2 Like: -40.87
16: 3 1 5 7 2 Like: -41.55
17: 5 2 7 3 1 Like: -43.67
18: 5 2 7 1 3 Like: -44.78
19: 5 1 3 2 7 Like: -47.63
20: 2 5 3 1 7 Like: -52.28
order1 is set
Note that while MAPMAKER examines
all 5!/2 possible orders, by default only the
20 most likely ones are reported. For each of
these 20 orders, MAPMAKER displays the
log-likelihood of that order relative to the
best likelihood found. Thus, the best order 1
3 2 5 7 is indicated as having a relative loglikelihood of 0.0. The second best order 3 1
2 5 7 is significantly less likely than the best,
having a relative log-likelihood of -6.0. In
other words, the best order of this group is
supported by an odds ratio of roughly
1,000,000:1 (10 to the 6th power to one) over
any other order. We consider this good evidence that we have found the first order is the
right order.

Displaying a Genetic Map


When we used the compare command previously, MAPMAKER calculated the map
distances and log-likelihood for each of the 60
(continued)

Bridging Linkage Maps to Develop Unified Linkage Maps

Box 4.1 (continued)


orders we were considering. The compare
command, however, only reports the relative
log-likelihoods and afterwards forgets the map
distances. To actually display the genetic
distances, we must instead use the map command. Like compare, the map command
instructs MAPMAKER to calculate the maximum likelihood map of each order specified
by the current sequence. If the current sequence
specifies more than one order (e.g. the
sequence {1 2 3 5 7} specifies 60 orders),
then the maps for all specified orders will be
calculated and displayed. Because we found
one order of this group to be much more likely
than any other, we probably only care to see
the map distances for this single order. First,
we set MAPMAKERs sequence, putting the
markers in their best order and doing away
with the set brackets. Next, we simply type
map to display this orders maximum likelihood map. As you can see, the distances
between neighbouring markers are displayed.
Note, however, that these distances may be
considerably different than the two-point
distances between those markers: This is
because MAPMAKERs so-called multipoint
analysis facility can take into account much
more information, such as flanking marker
genotypes and some amount of missing data.
This is precisely the reason that we use multipoint analysis rather than two-point analysis
to order markers: Because more data is taken
into account, you have a smaller chance of
making a mistake.
7 > sequence 1 3 2 5 7
sequence #3 = 1 3 2 5 7
8 > map
==============================
Map:
Markers Distance
1 SSR1 4.2 cM
3 SSR3 15.0 cM
2 SSR2 11.9 cM
5 SSR5 12.2 cM

101

7 SSR7 ---------43.2 cM 5 markers log-likelihood = -424.94


==============================

Mapping a Slightly Larger Group


As we mentioned earlier, exhaustive analyses of
large linkage groups are not practical. Instead,
to find a map order of a larger group, we need to
find a subset of markers on which we can perform an exhaustive compare analysis. Thus, to
map group2 (in the above example), we could
pick a subset of its 6 markers at random, although
we might do better if we pick markers which are
likely to be ordered with high likelihood.
Generally, this is true for sets of markers which
have (1) as little missing data as possible and (2)
do not have many closely spaced markers.
To quickly see how much data is available
for the markers in the given group, we set
MAPMAKERs sequence appropriately and
use MAPMAKERs list loci command.
MAPMAKER prints a list of loci, showing
each marker by both its MAPMAKERassigned number as well as its name in the
data file. In the previous example, for each
marker, MAPMAKER prints the number of
informative progeny (out of the 500 in the data
set) and the type of scoring. In this case all loci
have been scored using co-dominant markers (e.g. SSR genotypes in a RILs), although
clearly markers 4 and 6 are the least informative. To also look for markers which may be
too close, we use MAPMAKERs lod table
command. MAPMAKER prints both the distance and LOD score between all pairs of
markers in the current sequence. Unfortunately,
the closest pair is separated by over 6.0 cM, a
distance which should almost always be
resolvable in a data set with so many informative meiosis. Given the results of these two
analyses, a good subset to try might be:
(continued)

102

Linkage Map Construction

Box 4.1 (continued)

8 9 10 11 12
Note that the above two tests could have
been
automatically
performed
using
MAPMAKERs suggest subset command.
9 > sequence 4 6 8 9 10 11 12
sequence #4 = 4 6 8 9 10 11 12
10 > list loci
Linkage
Num Name Genotypes Group
4 SSR4 273 codom group2
6 SSR6 275 codom group2
8 SSR8 306 codom group2
9 SSR9 327 codom group2
10 SSR10 297 codom group2
11 SSR11 324 codom group2
12 SSR12 319 codom group2
11 > lod table
Bottom number is LOD score;
top
number
is
centimorgan
distance:
4 6 8 9 10 11
6 63.1
3.33
8 16.8 56.0
39.06 4.33
9 56.3 17.8 54.8
6.77 36.70 7.68
10 106.3 27.7 - 43.3
0.89 22.51 15.08
11 14.9 74.0 6.3 65.4 43.78 2.20 80.87 5.76
12 28.2 43.1 18.4 24.1 89.1
30.1
22.24 9.13 39.84 32.39 2.22
23.90
As before (did with small linkage groups),
we can also change MAPMAKERs sequence
to specify the subset we wish to test and then
type the compare command. This time, the
results are even more conclusive, with order1
more likely than any other. The sequence of
commands to be used here are:
9 > sequence {8 9 10 11 12}
10 > compare
11 > sequence order1

12 > map
Note that this time we do this using a special shortcut, order1, instead of specifying
the marker sequence as shown in order1. This
is to show that in both ways we can specify the
markers to be analysed by sequence command.
To determine the map position of the remaining two markers in group2, we will use the following procedure: Starting with the known
order of 5 markers, we will place the other two
(one at a time) into every interval in this order
and then recalculate the maximum likelihood
map of each resulting 6 marker order. In this
analysis, MAPMAKER recalculates all
recombination fractions for all intervals in
each map (not just the ones involving the
newly placed markers). This function is performed by MAPMAKERs try command. In
its output, MAPMAKER again displays relative log-likelihood of each position for the
inserted markers. The relative log-likelihood
of 0 indicates the best position, while the negative log-likelihoods indicate the odd against
placement in each other interval.
13 > sequence {8 9 10 11 12}
sequence #5 = {8 9 10 11 12}
13 > compare
Best 20 orders:
1: 11 8 12 9 10 Like: 0.00
2: 10 11 8 12 9 Like: -14.57
3: 8 11 12 9 10 Like: -15.23
4: 10 9 11 8 12 Like: -27.20
5: 11 8 12 10 9 Like: -29.97
6: 10 8 11 12 9 Like: -30.14
7: 9 10 11 8 12 Like: -32.23
8: 8 11 10 9 12 Like: -39.80
9: 10 9 8 11 12 Like: -39.91
10: 9 11 8 12 10 Like:
-40.05
11: 11 8 10 9 12 Like:
-40.25
12: 11 8 9 12 10 Like:
-44.73
13: 8 11 12 10 9 Like:
-45.21
(continued)

Bridging Linkage Maps to Develop Unified Linkage Maps

103

Box 4.1 (continued)

14: 10 11 8 9 12 Like:
-46.57
15: 8 11 9 12 10 Like:
-47.46
16: 9 10 8 11 12 Like:
-47.94
17: 10 8 11 9 12 Like:
-49.61
18: 8 11 10 12 9 Like:
-52.71
19: 9 8 11 12 10 Like:
-52.74
20: 11 8 10 12 9 Like:
-53.07
order1 is set
14 > sequence order1
sequence #6 = order1
15 > try 4 6
4 6
--------------| 0.00 -42.68 |
11 | |
|-35.57 -118.6 |
8 | |
|-19.65 -70.19 |
12 | |
|-46.80 -28.09 |
9 | |
|-51.35 0.00 |
10 | |
|-43.40 -21.09 |
|---------------|
INF |-44.66 -45.03 |
--------------BEST -619.33 -612.03
In this case, we see that marker 4 should be
preferably placed before marker 11. INF is
the probability that a marker is anywhere
ELSE but not on this sequence. In the above
test, we see that a log-likelihood of 44.66 supports linkage between 4 and the rest of the
group. We also see that marker 6 strongly prefers to be in-between markers 9 and 10. Even
the next most likely position for marker 6 is
more than 10 to the 21.09th power times less

likely. The try command not only tries to


place markers in each interval in the framework but also tries to place each marker
infinitely far away (i.e. forced 50% recombination between it and the framework). The
relative log-likelihoods for this position are
indicated following the INF entry in the
MAPMAKER output. In the same way that a
two-point LOD score indicates the odds of
linkage between two loci when they are separated by their maximum likelihood distance,
these relative log-likelihoods indicate the odds
supporting linkage between one locus and a
framework of loci when the locus is placed in
its most likely position. As a last step, we now
type the complete sequence for this group,
adding markers 4 and 6 into their most likely
positions. Then we type map to see the complete map of all markers in this group.
16 > sequence 4 11 8 12 9 6 10
sequence #7 = 4 11 8 12 9 6 10
17 > map
==============================
Map:
Markers Distance
4 T24 14.8 cM
11 C15 6.4 cM
8 T125 18.9 cM
12 T71 24.0 cM
9 T83 18.1 cM
6 T209 28.6 cM
10 T17 ---------110.8 cM 7 markers log-likelihood = -688.99
==============================
Likewise we need to continue this process
for all the linkage groups. Note that sometimes, depending on the data file, a single
chromosome may have more than one linkage group. However, when we add more
markers in the data set to the particular chromosome, there is a possibility of finding single linkage group (i.e. the added markers
merges the two or more linkage groups into a
single linkage group). It is also important to
(continued)

104

Linkage Map Construction

Box 4.1 (continued)

note that this program compares combination


of markers and gives the likelihoods of possible sequence orders. It does NOT tell you
the right sequence, but it will tell the most
likely orderyou must decide what LODs
and cM distances you will accept; therefore,
it can be highly subjective. Hence, most
importantly, when you score the data, do not
guess. When you make a mistake in scoring,
it will look like a recombination has taken
place. Therefore, missing data is better than a
wrong data.
MAPMAKER in Windows DOS can show
the map distance; however, the graphical view
of genetic map cannot be visualised in the
Microsoft Windows operating system.
MapChart is a specially designed Windows
program that can produce the linkage map and
QTL maps very easily. It is freely available at
http://www.biometris.wur.nl/uk/Software/
MapChart/. Alternatively, MapDraw can also
be used for linkage map drawing, and it is
available free of cost at http://www.nslijgenetics.org/soft/mapdraw.v2.2.xls.

Tips to Improve Your Analysis


1. While you are using the compare command, recall that an LOD of 2 means one
event is 100 times more likely, LOD 3 is
1,000 times more likely, etc. A general
guideline is that an LOD of 2 or 3 is conventionally acceptable. If suppose, first 2
orders have exactly the same likelihood,
meaning that either order is equally as
likely. However, if we look at the sequences,
we can see that the only difference between
the first 2 orders is that the order of two
markers (say SSR56 and SSR58) cannot be
differentiated. The order of the other markers seems clearly to be, for example, SSR55
(either SSR56 or SSR58), SSR 57 and

SSR59. An educated guess would be that


SSR56 and SSR58 are either at the same
locus or tightly linked (with not enough
recombinations to create a statistically
significant order). We can check this by
asking for a recombination difference
between the 2 markers, using the map
command. We can double-check our order
by using ripple. This command assumes
the general order is known but checks other
possible orders within each group of 3
markers, moving down the given sequence.
(Note that you would not want to use ripple
for a completely unknown order as it only
looks at 3 markers at a time. Further, when
you specify the sequence command omit
{}, or it will check all triplets of all possible
combinations.)
2. A map with 20 cM or more between markers might be questionable (remember, we
dont know a sure order, just the most
likely).
3. To make a complete map, you would need
to keep going with this process until you
had a full set of good linkage groups. There
are many other commands you can try too,
depending on your preferences.
4. You can probably see that there is no
right way to use MAPMAKER. Instead
of choosing some markers of Group 1 to
compare, we could also have grouped
again with more stringent LOD and cM
levels or we could have worked backwards by using the first order command
to get an order, then pulled off markers
that didnt fit well. Likewise we can try
several options, since it is a very iterative
and somewhat subjective process.
Readers are strongly recommended to
read the MAPMAKER manual which is
available at http://linkage.rockefeller.
edu/soft/mapmaker/ before working with
this program.

Bridging Linkage Maps to Develop Unified Linkage Maps

105

Box 4.2 Linkage Map Construction Using AntMap

Locus ordering is an essential procedure in


genome mapping. When the number of loci is
large, it is quite difficult to determine the optimum order with an exhaustive search of all
possible orders. The problem of searching for
the optimum order has been recognised as a
special case of the travelling salesman problem (TSP), that is, given a set of cities and distances for each pair of them, find a round-trip
of minimal total length visiting each city
exactly once. In recent years, Ant Colony
Optimization (ACO), which is a set of algorithms inspired by the behaviour of real ant
colonies, has been successfully used to solve
discrete optimization problems, such as TSP.
Iwata and Ninomiya (2004) developed a novel
system based on ACO for locus ordering in
genome mapping. Loci and absolute value of
log-likelihood (or recombination fraction)
between loci were regarded as TSP cities and
distance between cities, respectively. They
tested the system using a simulated segregation population and found it highly efficient
for linkage grouping as well as locus ordering
in genome mapping.
To commoditize newly developed system,
they developed a software named AntMap for
constructing linkage map by the system.
AntMap performs segregation test, linkage
grouping and locus ordering and constructs a
linkage map quite rapidly and nearly automatically. Rapidity of the algorithm based on
ACO enables us to conduct a bootstrap test of
estimated order. With the aid of this software,
researchers can save their time and labour and
can obtain a linkage map whose reliability is
indicated by bootstrap values. Another advantage of AntMap is the fact that AntMap is
open source (http://lbm.ab.a.u-tokyo.ac.
jp/~iwata/antmap/), that is, source code and
executable of AntMap are available under
General Public License (GPL). Java and C++
objects that code this newly developed system
will be utilised effectively for other applications as well as AntMap.

Input File Format


Input file format of AntMap is identical to
*.raw files required by MAPMAKER (Lander
et al. 1987). AntMap can analyse data derived
from progeny of several types of crosses,
including:
1. F2 intercross
2. F2 backcross (e.g. BC1)
3. Recombinant inbred lines by self-mating
4. Doubled haploid lines
However, the current version of AntMap
does not support two types of cross, F3 intercross by self-mating (f3 self) and recombination inbred lines by sib-mating (ri sib), which
are supported by MAPMAKER/ EXP.
Step by step procedure to be followed while
using AntMap clearly described in the AntMap
Tutorial. The flowing are the excerpts from
them.

Step 0: Start AntMap


Start AntMap in Windows operating system
by double-clicking the AntMap icon.
AntMap can also be executed by using the
executable jar file AntMap.jar on any platforms (Linux, Solaris and Mac OS as well as
Windows).
Step 1: Open an Input File
Open an input file in MapMaker format (*.
raw) through File-Open menu. After opening the file, contents of the file will appear in
the Data panel. By clicking the Log tab,
you can see a summary of the input data.
Step 2: Segregation Ratio Test
Select Segregation Test from the Analysis
menu. By doing so, you can see the results of
segregation ratio tests in the Result panel.
Step 3: Linkage Grouping
Click the Options tab. Then you can see the
Grouping option panel. You can choose one
of the two grouping methods: nearest
(continued)

106

Linkage Map Construction

Box 4.2 (continued)

neighbouring locus and all combinations.


The former makes a group by sequentially
combining a locus which shows the smallest
recombination value against it. The latter will
produce similar results with group command
of MAPMAKER. You can also choose the
grouping criterion, threshold value and the
minimum number of markers for a single group.
Otherwise keep these options unchanged except
for the threshold value.
Select the Linkage Grouping from the
Analysis menu. Then you can see the
results of linkage grouping in the Result
panel. When you analyse your data, you may
not be able to achieve a good separation of
markers to linkage groups from the start. In
such a case, please find a good set of the
threshold value, criterion and method
through trial-and-error strategy. It is better to
organise your data according to chromosomes and then proceed separately for each
chromosome.

Step 4: Locus Ordering and Genetic


Map
Click the Options tab, and click the
Ordering tab. Then you can see the Ordering
option panel. In the locus ordering, you can
choose one of the two criteria: LL and
SARF. LL is an abbreviation for log-likelihood. SARF is an abbreviation for sum of
adjacent recombination fractions. AntMap
will search a locus order which maximises
log-likelihood or minimises SARF. You can
also choose the number of runs of locus ordering. You can find the meaning of this option in
the AntMap Options section of the AntMap
users manual. A map function for calculating
a map distance between adjacent markers can
be selected from Haldane or Kosambi
functions. Otherwise keep these options
unchanged. Select the Locus Ordering from
the Analysis menu. Then you can see the
results of locus ordering in the Result panel.

You can also obtain a graphic of linkage map


in the Map panel.

Step 5: One-Step Mapping


Select Full Course from the Analysis menu.
This facilitates overall process from segregation ratio test (Step 2) to locus ordering (Step
4) at once.
Step 6: Redraw a Linkage Map
Click the Options tab, and click the Draw
map tab. Then you can see the Draw map
option panel. You change the Scale factor
option and by doing so, drawing size of linkage
map can be changed. After changing the option
value, select Redraw Map from the Analysis
menu. Then you can obtain a modified linkage
map than one obtained previously.
Step 7: Bootstrap Test for Locus Order
You can evaluate the reliability of estimated
locus order by using bootstrap test. Bootstrap
test (or bootstrapping) is a method for estimating the sampling distribution of an estimator
by resampling with replacement from the
original sample. In a bootstrap test, a random
sample of size n is drawn from the original
sample of size n, and estimates are obtained
from the random sample. After repeating (iterating) this operation many times (e.g. 100
1,000 times), the stability of estimates (e.g.
standard error or confidence interval of estimators) is evaluated. In the bootstrap test for
locus order, we can obtain probability that a
locus is located at its estimated order. Click
the Options tab, and click the Ordering tab.
Then you can see the Ordering option panel.
You can change the number of iterations
(repeats) of bootstrapping. To get a good
estimate of percentage of correct locus order,
100 may be sufficient. You can also choose a
group which is targeted in the bootstrap test.
Select the Bootstrap Test from the
Analysis menu. Then you can see the results
(continued)

Bridging Linkage Maps to Develop Unified Linkage Maps

107

Box 4.2 (continued)

of bootstrap test for locus order in the Result


panel. You can also obtain a graphic of linkage
map with bootstrap values in the Map panel.
The bootstrap test for all linkage groups may
take a long time even with high-end PC. Thus,
you have better set your computer to perform
this test at your lunch time or after going
home.

Step 8: Save Results of Linkage


Mapping
You can save information in Result, Log
and Map panels through the Save submenu
in the File menu. The information in Result
and Log is saved as a text file. The information in Map (i.e. a graphic of linkage map) is
saved as a JPEG (*.jpg) file.

Box 4.3 List of Software Available for Linkage Map Construction

A comprehensive list of computer software on


genetic linkage analysis for human pedigree
data, QTL analysis for animal/plant breeding
data, genetic marker ordering, genetic association analysis, haplotype construction, pedigree drawing and population genetics is listed
out at http://linkage.rockefeller.edu/soft/list.
html in alphabetical order. However, the following software are very often used by plant molecular breeders in genetic or linkage map
construction.
1. MAPMAKER (http://www.broad.mit.edu/
ftp/distribution/software/mapmaker3/)
2. JoinMap (http://www.kyazma.nl/)
3. AntMap (http://cse.naro.affrc.go.jp/iwatah/
antmap/index.html)
4. Map Manager QTX (http://www.mapmanager.org/)
5. QGene (http://www.qgene.org/)
6. R/QTL (http://www.rqtl.org)
7. MSTMAP (http://www.138.23.191.145/
mstmap/)
8. CarthaGene (http://www.inra.fr/mia/T/
CarthaGene/)

9. MadMapper (http://cgpdb.ucdavis.edu/
XLinkage/MadMapper/)
10. THREaD Mapper (http://cbr.jic.ac.uk/
dicks/software/threadmapper/index.
html)
11. QTL IciMapping (http://www.isbreeding.
net/oldweb/download_software_ICIM.
aspx)
In practice, it is almost certainly best to
use a mixture of approaches in developing
and refining a map. This is not only because
each one brings something unique to the
analysis but also because we do not know
which approach will succeed best for a new
data set and we do not know enough about
the behaviour of each tool to judge this in
advance. It is strongly believed that map
estimation is an iterative process, where
researchers should first grasp the global pattern of their data set before revaluating and
revising the grouping and ordering of markers rather that performing a rigid, linear
three-stage methodology of grouping, ordering and spacing.

108

Bibliography
Literature Cited
Bateson W, Saunders ER, Punnett R (1905) Experimental
studies in the physiology of heredity. Rep Evol Comm
R Soc 2:155
Bovenhuis H, Meuwissen THE (1996) Detection and mapping of quantitative trait loci. Animal Genetics and
Breeding Unit. UNE, Armidale. ISBN 186389 323 7
Bulmer MG (1971) The effect of selection on genetic variability. Am Nat 105:201
Correns C (1913) Selbststerilitat und Individualstoffe.
Biol Centralbl 33:389423
Haldane JBS, Smith CAB (1947) A new estimate of the
linkage between the genes for colour-blindness and
haemophilia in man. Ann Eugen 14:1031
h t t p : / / w w w. n c b i . n l m . n i h . g o v / b o o k s h e l f / b r.
fcgi?book=genomes
Iwata H, Ninomiya S (2006) AntMap: constructing genetic
linkage maps using an ant colony optimization algorithm. Breed Sci 56:371377
Janssens FA (1909) La theorie de la chiasmatypie.
Nouvelle interpretation des cinises de maturation.
Cellule 22:387411
Kohel RJ, Richmond TR, Lewis CF (1970) Texas Marker
1. Description of genetic standards for G. hirsutum L.
Crop Sci 10:670671
Lander ES, Green P, Abrahamson J, Barlow A, Daly MJ,
Lincoln SE, Newburg L (1987) MAPMAKER: an
interactive computer package for constructing primary
genetic linkage maps of experimental and natural populations. Genomics 1:174181

Linkage Map Construction

MAPMAKER v3.0 Tutorial. http://linkage.rockefeller.


edu/soft/mapmaker/
Mendel G (1865) Available at http://www.dnalc.org/
view/16172-Gallery-3-Gregor-Mendel-Manuscript-1865.html
Morgan TH (1911) Random segregation versus coupling
in Mendelian inheritance. Science 34:384
Morton NE (1955) Sequential tests for the detection of
linkage. Am J Human Genet 7:277318
Sturtevant AH (1913) The linear arrangement of six
sex-linked factors in Drosophila, as shown by their
mode of association. J Exp 2061(14):4359
Sutton WS (1903) The chromosomes in heredity. Biol
Bull 4:231251

Further Readings
Bailey NTJ (1961) Introduction to the mathematical
theory of genetic linkage. Oxford University Press,
London
Cheema J, Dicks J (2009) Computational approaches and
software tools for genetic map estimation in plants.
Brief Bioinfo 10(6):595608
McPeek MS (1996) An introduction to recombination and
linkage analysis. http://www.stat.wisc.edu/courses/
st992-newton/smmb/files/broman/mcpeek96.pdf
Whitehouse HLK (1973) Towards an understanding
of the mechanism of heredity. St. Martins Press,
New York
Wu R, Gallo-Meagher M, Littell RC, Zeng Z (2001)
General polyploid model for analyzing gene segregation in outcrossing tetraploid species. Genetics
159:869882

Phenotyping

Phenotyping Versus QTL Mapping


The ultimate goal of plant breeding is to develop
cultivars that have shown consistently good
performance for the primary traits of interest.
Primary traits are usually agronomically and economically important traits and will vary among
crop species. These traits are quantitative, rather
than qualitative, in nature. Quantitative traits
vary continuously (e.g. yield, quality and stress
tolerance), whereas qualitative ones are usually
(not always) binary (yes vs. no; e.g. resistance to
a fungus and colour of flower). Quantitative traits
are typically governed by a number of genes,
while qualitative ones are often simply inherited
(decided by one or two genes; hence called as
simpler or major traits). Although progress had
been made in cultivar development in most crop
species since the rediscovery of Mendelism, further genetic progress required more information
on the inheritance of the primary traits and
associations with other traits that are needed
in improved cultivars. Quantitative geneticists
believed that they could enhance breeding methods if the inheritance of quantitative traits was
better understood. However, some of the assumptions (random mating populations, linkage equilibrium, two alleles per locus, no epistasis, etc.)
used by the quantitative geneticists in developing
the theory and methods of estimation did not
seem realistic to practicing plant breeders.
Initially, greater efforts were given to studies
related to types of gene action. Identifying the
genes for primary traits will help in answering

several genetic questions: How many genes


influence the given traits, and what are their relative effect sizes? Do these genes show evidence
of non-neutral evolution at the sequence level?
What environmental and evolutionary forces lead
to the maintenance of variation at these loci? Do
ecologically similar environments favour the
same genes or is it possible to achieve a similar
phenotype with different genetic mechanisms?
Recent breakthrough in molecular biology
helped to find answers for many of these questions via quantitative trait loci (QTL) mapping.
The loci involved in the inheritance of quantitative traits are commonly called QTL, and
identification of such QTL is referred to as QTL
mapping. The purpose of the phenotyping experiment (evaluating the given trait) is to assign a trait
value to each mapping population member. This
value is then combined with the allele score at the
set of marker loci distributed throughout the
(refer chapter 4). A data file is then created which
includes all the trait data and all the marker data
for the entire population. Various software applications can be applied to this data file to identify
statistical associations/correlations between the
presence of alternative alleles and the trait value.
The greater this correlation is, the higher the
probability that a certain gene contributes directly
to a specific trait. To calculate the strength of the
association between genotype and phenotype,
the mapping population is split into two groups,
according to the allele they carry for that trait at
each marker in turn. Then the mean trait value of
these two classes is compared. If the difference is

N.M. Boopathi, Genetic Mapping and Marker Assisted Selection: Basics, Practice
and Benefits, DOI 10.1007/978-81-322-0958-4_5, Springer India 2013

109

110

significant, then this provides initial evidence for


the location of a QTL in the neighbourhood of
the marker (refer chapter 6 for further details on
QTL-mapping methods and principles).
Thus, the goal of QTL mapping is to determine the loci that are responsible for variation in
quantitative traits. In some situations, determination of the number, location and the interaction of
these loci is the ultimate goal besides identifying
the actual genes and their functions. For example,
breeding studies attempt to identify the loci that
improve crop yield or quality and then to bring
the favourable alleles together into elite lines via
marker-assisted breeding. Understanding of the
response of QTL in different environments or
genetic backgrounds improves the efficiency of
marker-assisted breeding. If the genes underlying
the QTL are known (i.e. the QTL have been
cloned; called as map-based cloning; discussed
in chapter 7), then transgenic approaches can also
be used to directly introduce beneficial alleles
across wide species boundaries.
Identifying a gene or QTL within a plant
genome is like finding the proverbial needle in a
haystack. However, QTL analysis can be used to
divide the haystack in manageable piles and systematically search them. The data collection on
the given trait is often hampered by the significant
influence that environmental factors have on the
expression of a trait and the variability of these
environmental factors. This is especially true for
traits related to crop yield. In addition to their
sensitivity to environment and the phenomenon
of genotype-by-environment interaction (i.e. the
differential reaction of genotypes to environmental changes), such traits are often controlled by a
large number of genes. These factors make it
difficult to analyse their genetic basis and, therefore, QTL analysis.

Need for Precise Phenotyping


The accuracy of phenotypic evaluation is of the
utmost importance for the accuracy of QTL mapping. A reliable QTL map can only be produced
from reliable phenotypic data. Replicated phenotypic
measurements or the use of clones (via cuttings)

Phenotyping

can be used to improve the accuracy of QTL


mapping by reducing experimental error or
background noise. High-throughput phenotyping
for QTL mapping under highly controlled plant
development conditions provides the best basis
for extracting a maximum of information from
mapping populations. This way, reproducible and
comprehensive datasets are generated. Some thorough studies may include conducting phenotypic
evaluations both in field and glasshouse trials.
Moreover, QTL mapping assumes accurate
phenotypic scoring methods, something that can be
difficult to optimise and even more difficult to keep
working for months or years. Just a few mis-scored
individuals can totally confound QTL discovery
and placement. Even when a well-performed mapping experiment indicates promising QTL, there is
always much more that needs to be done to make
the mapping data ready for QTL analysis. In such
cases, repetition over several years and several
locations, repetition in larger sibling populations,
repetition in genetically unrelated populations and
detailed analyses in marker-generated near-isogenic
lines (NILs) that isolate the effects of individual
QTL can be considered as additional steps to
improve and validate the QTL analysis. It is also
important to consider that any one of these efforts
could be expensive, time consuming or impossible
in practice. Hence, it is essential to understand the
basic principles and a broad set of references that
are useful for the optimal management of phenotyping practices for QTL discovery.
To be practical, the first step is to define the
target environments (also identified as the target
population of environments (TPE)). Differences
in TPE are largely determined by genotype-byenvironment interactions (GEI). The identification
and characterisation of a TPE is facilitated by the
use of crop simulation models based on historic
records of weather data. Simulation can describe
a TPE by the frequency of occurrence of specific
biotic and abiotic stresses and be based on the soil
profile (moisture, nutrient, microbial load, etc.)
along with the crop cycle. Within each TPE, GEI
are frequently observed relating to yearly
fluctuations in environmental factors (e.g. rainfall
and temperature), diseases (e.g. foliar disease)
and/or parasites (e.g. insects). Ideally, phenotyping

Phenotyping for Biotic Stress

should be carried out across a broad range of


environments present within the TPE, and it has
shown in several occasions that they improved the
QTL analysis. Further, in combination with highthroughput phenotyping, multi-location trials
help to standardise and improve the collection of
phenotypic data and facilitate the creation of
repository databases useful for QTL metaanalyses and other comprehensive approaches
(explained in chapter 6). Thus, an essential necessity in QTL analysis is a great emphasis on the
basic factors that are crucial for the management
of experiments and the collection of meaningful
and error prone phenotypic data.
Three basic principles of experimental designs
(replication, randomization and blocking control)
proposed by the early statistician, Fisher, should
be strictly applied to a field or greenhouse test for
QTL identifications. In fact, for a QTL-mapping
project, field experiments should be more stringent for experimental error control since minor
QTLs with small effects are expected to be
detected. In a trail with less than three replicates
and small plot size per genotype, coefficient of
variation (CV) higher than 15% is usually considered less desirable. One may expect even
higher CV and environment variation when individual plants (such as individual progenies of
mapping population used for QTL mapping) are
the units of measurements. Heritability estimates
(see below) based on individual plots are usually
much higher than those of individual plants,
which is why breeders routinely test progenies in
replicated plots.
Phenotyping under controlled conditions is
relatively straightforward when scoring traits in a
binary fashion, such as for photoperiod sensitivity, and when environmental conditions do not
have much effect on the target trait or are easily
defined (e.g. light vs. darkness). However, it
becomes more complex when the target traits are
quantitatively assessed, as in the case of growth,
and when environmental conditions that vary
during the day (e.g. temperature, light intensity
and soil water status) influence the target trait
(e.g. the rate of leaf elongation). In this case, the
phenotype is rather dynamic and better defined
by a series of response curves to environmental

111

stimuli, an approach that is very time consuming


and requires a tight control of environmental conditions. High-throughput phenotyping platforms
allow for the automation of these procedures and
streamline and standardise the collection of
highly accurate phenotypic data. State-of-the-art
technology including imaging, robotic and computing equipment allows for the continuous phenotypic measurement of tens of thousands of
plants automatically and non-destructively. On
the other hand, the installation and operating cost
of these platforms is very high. Additionally, it is
critical that the experimental conditions mimic as
closely as possible the dynamics of the ecological environment prevailing in the fields of the
TPE. At the same time, it is no matter how accurate and precise our phenotyping will be, because
the vast majority of the QTLs determining the
measured phenotype will remain undetected. The
majority of the genetic factors controlling quantitative traits will equally challenge their detection
because their effects are simply too small to be
identified at a statistically significant level.

Phenotyping for Biotic Stress


Biotic stresses, such as diseases and insects
(including fungi, bacteria, viruses, nematodes,
phytoplasmas, herbivorous insects and sometimes weed species), account for significant
annual yield losses in crop plants. Biotic stress
usually affects all parts of the plants in all the
crop-growing regions and seasons. Resistance to
these diseases and insects is controlled either by
dominant or recessive major genes or by QTL.
Phenotyping of mapping populations for their
resistance to the given biotic stress is the key step
in QTL analysis. Upon identification of QTLs,
more durable resistance could be achieved by
pyramiding of resistance genes via markerassisted selection (refer chapter 8 for further
details). However, progress in this direction is
hindered by the pathogenic variability of insects
and pathogens and the evolution of new and more
aggressive pathotypes or races. Though sources
of resistances or tolerances to pests and diseases
have been recently identified in several crops, in

112

most cases genetic studies are not available. Only


for few diseases (which have agronomic and economic significance, depending on the pest/pathogen isolate or race), resistance or dominant genes
were reported. At present, it is not clear whether
the reported resistance genes represent the same
or different loci because allelic tests were not
performed. Involvement of other genes in expression of resistance further complicates this picture.
Yet another drawback in this context is when the
crop is screened in the field for biotic stress resistance, several pathotypes/genotypes of the pest
and pathogen coexist in the same field or even in
the same infected plant part or regions. Since random mating may occur between different pathotypes or genotypes of the pest and pathogens
carrying different mating type alleles, genetic
recombination may contribute to genotypic diversity and provide the pests/pathogens with an
additional means to adapt to resistant germplasm.
Thus, while screening of breeding materials for
biotic stress resistance combination of several
methods and strategies should be applied for
assessment of such resistance. Numerous studies
have indicated that testing under controlled glasshouse or growth chamber conditions combined
with field screening would very much help to
improve the reproducibility of the results (which
is essential for accurate and consistent QTL
identification) since severity and spread of the
pest and diseases are highly dependent on environmental conditions (especially on humidity,
which may change from year to year).
It is also imperative to note that different loci
may contribute to resistance at different points of
the life cycle of the plant. Usually, the biotic
stress resistance screening is followed with a
scale (e.g. score 1 denotes completely resistance
and score 9 denotes completely susceptible). As
the scale used for biotic stress resistance evaluation is subjective particularly for intermediate
values (in the above scoring, e.g. score 4moderately resistance; scale 5moderately susceptible), a bias may be introduced by the researcher
that may affect the phenotyping data and ultimately the QTL-mapping process. In such
dilemma, it is commonly suggested to follow
different scoring systems for the given pest or

Phenotyping

disease resistance in the same environment.


While conducting bioassay tests, it is necessary
to develop a pure pest population with a single
colony grown in single host under controlled
conditions with appropriate standard procedure.
Replicated experiments should be carried out
with the same instar larvae or nymphs on the
same phonological stage of the plants, and data
should be collected at different time points.
Failures in doing so may cause differential
responses and hence serious errors in phenotyping data. Further, recent evidences showed that
plants respond to multiple stresses differently
from how they do to individual stresses, activating a specific programme relating to the exact
encountered environmental conditions. Rather
than being additive, the presence of an abiotic
stress can have the effect of reducing or enhancing susceptibility to a biotic pest or pathogen and
vice versa. This interaction between biotic and
abiotic stresses is orchestrated by signalling pathways that may induce or antagonise one another
and further controlled by a complex regulatory
network. Hence, such phenotypic data should be
analysed very cautiously during QTL analysis
and interpretation.

Phenotyping for Abiotic Stress


Crop production is limited by various abiotic
stresses such as water deficit, submergence, salinity and deficiencies of P and Zn. In recent years,
advances in physiology, molecular biology and
genetics have greatly improved our understanding of how crops respond to these stresses and the
basis of varietal differences in tolerance. Progress
has relied on the application of rather specific
phenotypic screens that allow the effects of stress
to be distinguished from other general differences. QTLs have been identified that explain a
considerable portion of observed variation, and
in some cases, the genes underlying specific
QTLs have been identified (e.g. submergence tolerance in rice). The traits that are suitable for
QTL mapping of abiotic stress resistance/tolerance have been discussed as the key question for
long time. For example, the morpho-physiological

Heritability of Phenotypes

traits and the corresponding QTLs that affect


drought tolerance can be categorised as constitutive (i.e. also expressed under well-watered conditions) or drought-responsive (i.e. expressed only
under pronounced water shortage) (see chapter
11 for more detailed description of drought tolerance in rice). While drought-responsive traits/
QTLs usually affect yield only under rather
severe drought conditions, constitutive traits/
QTLs can affect yield at low and intermediate
levels of drought stress as well. The response of
QTLs for drought-adaptive traits (e.g. accumulation of osmolytes and relocation of water-soluble
carbohydrates) to drought is probably due to regulation of the expression of the underlying structural genes in response to signalling cues such as
abscisic acid (ABA) accumulation which intern
induced by cellular dehydration. Experimental
evidence indicates that the progress achieved by
breeders during the last century can mainly be
accounted for by changes in constitutive traits that
affect dehydration avoidance rather than droughtresponsive traits. In this respect, emphasis is
increasingly being placed on phenotyping traits
that constitutively increase yield per se, rather
than on characteristics that enhance plant survival
under extreme drought, in view of a possible negative trade-off under less severe circumstances.
An excellent collection of methods, principles
and protocols useful in abiotic stress resistance
screening (more particularly for drought screening in crop plants) is comprehensively described
in the book Drought Phenotyping in Crops: From
Theory to Practice. Before starting a phenotyping
experiment for abiotic stress resistance, readers
are requested to refer this book for better understanding of the phenotyping, issues and challenges
in planning and managing experiments specific to
each crop or trait and its importance in QTL analysis for abiotic stress resistance traits.
Good phenotyping is pivotal for reducing the
genotypephenotype gap, especially for quantitative traits, which are the major determinants of
abiotic stress resistance. Keeping a good record of
meteorological parameters (rainfall, temperatures,
wind, evapotranspiration, light intensity, etc.)
allows for more meaningful interpretation of the
results and identification of the environmental

113

factors limiting yield. The basic attributes of good


phenotyping carried out with appropriate genetic
materials are accuracy and precision of measurements, coupled with relevant experimental conditions that are representative of the TPE. Accuracy
involves the degree of closeness of a measured or
calculated quantity to its actual (true) value.
Accuracy is closely related to precision, also
termed reproducibility or repeatability, the degree
to which further measurements or calculations
show the same or similar results. A further
complexity of phenotyping a large number of
genotypes (e.g. a mapping population) for stressadaptive features is exemplified by those traits for
which the value can vary considerably within a
rather short timeframe due to changing environmental conditions. Good phenotyping means not
only the collection of accurate data to minimise
the experimental noise introduced by uncontrolled environmental and experimental variability but also the collection of data that are relevant
and meaningful from a biological and agronomic
standpoint, under the conditions prevailing in
farmers fields within the TPE. Although hundreds
of accurate studies reporting thousands of stressresponsive genes and QTLs can be found in the
literature, the relevance of these data to real field
conditions is often questionable.

Heritability of Phenotypes
Collecting accurate phenotypic data that are
relevant to the TPE has always been a major
challenge for the improvement of quantitative
traits. The success of this endeavour is intimately
connected with the heritability of the trait, namely,
the portion of the phenotypic variability accounted
for by additive genetic effects that can be inherited
through sexually propagated generations. Trait
heritability varies according to: (1) the genetic
make-up of the materials under investigation, (2)
the conditions under which the materials are investigated and (3) the accuracy and precision of the
phenotypic data. With only a few notable exceptions, most of the traits determining the performance of crops usually have low (~0.300.40)
or, at best, intermediate (~0.400.60) heritability.

114

This impairs our capacity to dissect their genetic


basis properly. Despite this, careful evaluation
and appropriate management of the experimental
factors that lower the heritability of traits, coupled with a wise choice of the genetic material
(e.g. use of phenotypically dissimilar parents to
obtain maximum extreme for mapping population development), can provide effective ways to
increase heritability. Once a sound association
has been established between a marker and a
locus affecting a target trait, the problems encountered in the conventional selection of quantitative
traits, particularly the lowly heritable ones, can
be partially overcome through the use of markers
linked to QTLs for the target trait. This enables
individuals to be scored based on their genetic
make-up rather than their phenotypic features,
and the process is referred to as marker-assisted
selection (refer chapter 8 for more details). In
contradiction, the probability of identifying the
relevant chromosomal regions and accurately
estimating their effects relies on good phenotyping of the genetic materials originally used to
establish the phenotypegenotype associations.
In other words, the effectiveness of marker-based
approaches intimately depends on how well and
how accurately the target trait has been assessed
phenotypically in mapping populations. In fact, a
low heritability impairs the probability of detecting the presence of QTLs, thereby increasing
Type II errors (i.e. false negatives).
Heritability measures the proportion of the
phenotypic variance that is due to genetic effects.
This measure is important for QTL mapping
because it tells us what the maximum proportion
of phenotypic variance that can be contributed
by the given QTLs. Thus, if a trait has a heritability of 50% in a particular set of environments
and if we detected all the QTL that affect the
trait, the combined effects of all the QTL can
explain 50% (but no more than 50%) of the phenotypic variation. In practice, it is possible to
overfit a QTL model, so it seems to be explaining more than the limit set by heritability, but in
such cases, the model is actually explaining
noise, rather than genetic effects, and will have
less predictive value than one thinks. Thus, by
knowing the heritability of a trait for a particular

Phenotyping

data set, one can at least know where the limit of


QTL modelling is, so one can know if overfitting
is likely to be a problem.
Typically, for both selection applications and
for QTL mapping, we mean the variance of
line-mean phenotypes. Thus, if we have data
from multiple replications and multiple environments, we first compute the means of each
line across replications and environments, then
we can calculate the variance of these means.
This is the phenotypic variance. So even if environment and experimental errors have large
effects on the phenotype observed in a single
plot, one can reduce the effect of these nongenetic factors on the line mean by averaging
across multiple replications and plots. This
results in an increase in the heritability on a
line-mean basis, even if the heritability is very
low on a single-plot basis. Since selection or
QTL mapping is conducted on the basis of line
means, rather than individual plot values, one
can experimentally increase the line-mean heritability by good experimental design and extensive environmental replication.
The heritability estimates (say x) tell us that the
best possible QTL models (assuming we detect all
the QTL affecting each trait) can explain at most
x % of the phenotypic variance for given trait. The
remaining phenotypic variance (100 x %) cannot
be explained by genetics or QTLs, since it is due to
GEI or to error variance. We should be able to
detect QTLs that explain more variance within
each environment because the within-environment
heritabilities are higher, but since the GEI variance
is large, we expect that some of the QTLs in 1 year
will be different in location and/or effect than the
QTLs detected in another year. Thus, this kind of
GEI is mainly noise. Hence, it is not advisable to
look for year-specific QTLs.
Assuming that both the type and the number of
treatments (genotypes, stress type (including
intensity, degree and duration), etc.) to be evaluated are adequate for the specific objectives of
each experiment, the following general factors
should be evaluated carefully to ensure the collection of meaningful phenotypic data in field
experiments: experimental design, heterogeneity
of experimental conditions between and within

Bibliography

experimental units, size of the experimental unit


and number of replicates, number of sampled
plants within each experimental unit and genotypeby-environment-by-management interaction. The
relative impact of each factor on the quality of the
phenotypic data to be collected will vary greatly
according to each experiment. As an example, an
excessive heterogeneity in soil characteristics
(depth, moisture, pH, etc.) and/or compaction
among field plots will inevitably increase the
experimental error and will jeopardise an accurate
evaluation of yield. The additional factors such as
variation in phenology, interaction with other
biotic and abiotic stresses and managing the
dynamics and intensity of given stress episodes
should also receive due attention when planning
and conducting the experiments. Insufficient attention may lead to faulty conclusions, particularly in
terms of interpreting cause and effect relationships
between yield and other traits/variables.

Statistical Analysis of Phenotypic


Data: Simple Statistics, Heritability
Estimation and Correlation
The data collected from phenotyping experiments
can be used for identifying mean, minimum and
maximum values for the given traits. Correlation
analysis should be done to understand the relationship among investigated traits (widely
Pearson correlation coefficient is preferred). A
negative genetic correlation between two traits
indicates that a large proportion of the QTL
effects for the investigated traits are the same but
in opposite direction. We expect to find some
QTL for the given two traits in the same chromosomal locations, if they have strong positive correlation. In order to calculate heritability, it is
essential to perform single factor analysis of variance. This can be done by using any statistical
software such as SAS, IRRISTAT and GENSTAT
or simply by using Windows Excel. From the

115

results of ANOVA table, the genetic variance s2 a


can be obtained as
s 2a =

(Genotype Mean Square Error mean Square)


Number of replication

Error mean square is also denoted by s2e and


number of replication as r. From these values,
broad sense heritability (h; repeatability on a single plot level) is calculated as
h=

s 2a
100 %
s 2e
2
s a+

The higher the h values, the higher the


repeatability of the given trait. In other words, the
environment effect on this trait is getting low if
h nears 1. Therefore, if h is 0, there is no need
of doing QTL analysis. The h can be interpreted
as follows: if h is 030%: low heritability;
3160%: moderate heritability; and 61100%:
the trait is highly heritable.

Bibliography
Literature Cited
Monneveux P, Ribaut JM (2012) Drought phenotyping in
crops: from theory to practice. CIMMYT/Generation
challenge programme, Mexico. Freely available at:
https://www.integratedbreeding.net/drought-phenotyping-crops-theory-practice

Further Readings
Pask AJD, Pietragalla J, Mullan DM, Reynolds MP (2012)
Physiological breeding II: a field guide to wheat phenotyping. CIMMYT, Mexico
Reynolds MP, Pask AJD, Mullan DM (2012) Physiological
breeding I: interdisciplinary approaches to improve
crop adaptation. CIMMYT, Mexico
Shashidhar HE, Henry A, Hardy B (2012) Methodologies
for drought studies in rice. International Rice Research
Institute, Los Baos

QTL Identication

QTL: A Prelude
Most of the important agronomic traits are quantitatively inherited and are controlled by several
genes (i.e. polygenic). Thus, the nature of quantitative traits is that their expression is controlled
by tens, hundreds or even thousands of quantitative trait loci (QTL), and in general, they are having only a small effect on the trait. QTL is a
genomic region that comprises gene(s) which
govern(s) the expression of the quantitative trait.
Since the advent of molecular markers, researchers and breeders have aimed to identify functional
markers (refer chapter 3 for different kinds
of markers) associated with these QTL for implementation of marker-assisted selection. Historically, QTL detection started with linkage mapping
in biparental populations (refer chapter 2 for
population types (Sax 1923; Thoday 1961)).
Identifying a gene or QTL within a plant genome
is like finding the proverbial needle in a haystack.
However, QTL analysis can be used to divide the
haystack in manageable piles and systematically
search them. In simple terms, QTL analysis is
based on the principle of detecting an association
between phenotype and the genotype of markers.
Markers are used to partition the mapping population into different genotypic groups based on the
presence or absence of a particular marker locus
and to determine whether significant differences
exist between groups with respect to the quantitative trait being measured. Thus, statistically a

significant difference between phenotypic means


of the marker groups (either 2 or 3), depending
on the marker system and type of population,
indicates that the marker locus being used to partition the mapping population is linked to a QTL
controlling the trait.
The reason for looking for a significant P value
obtained from differences between mean trait values to indicate linkage between marker and QTL is
due to recombination (refer chapter 4 for details on
recombination). The closer a marker is from a
QTL, the lower the chance of recombination
occurring between marker and QTL. Therefore,
the QTL and marker will usually be inherited
together in the progeny, and the mean of the group
with the tightly linked marker will be significantly
different (P < 0.05) to the mean of the group without the marker. When a marker is loosely linked or
unlinked to a QTL, there is independent segregation of the marker and QTL. In this situation, there
will be no significant difference between means of
the genotype groups based on the presence or
absence of the loosely linked marker. Unlinked
markers located far apart or on different chromosomes to the QTL are randomly inherited with the
QTL; therefore, no significant differences between
means of the genotype groups will be detected.
There are different methods used to detect the
QTL and test the inheritance of QTL and markers.
Those methods are discussed in detail hereunder,
and the comparisons of the commonly used methods in QTL detection are given in Table 6.1 and list
of QTL mapping software is given in Box 6.1.

N.M. Boopathi, Genetic Mapping and Marker Assisted Selection: Basics, Practice
and Benefits, DOI 10.1007/978-81-322-0958-4_6, Springer India 2013

117

Simple t-test, ANOVA, linear regression,


likelihood ratio test, maximum likelihood
estimation

Simple in terms of data analysis


Performed using common statistical
software
Gene order and complete linkage map are
not required

The putative QTL genotypic means and


QTL positions are confounded, and thus it
causes biased estimation of QTL effects and
low power in detection of such QTL
QTL positions cannot be precisely
determined due to the nondependence
among the hypothesis tests for
linked markers that confound QTL effect
and position
Doing a t-test/ANOVA at every marker
results in many false positives
Edwards et al. (1987)

Methods

Advantages

Limitations

Inclusion of too many


cofactors reduced the
power to identify QTL
relative to interval
mapping

Jansen (1993), Rodolphe and


Lefort (1993), and Zeng
(1993)

Lander and Botstein (1989)

Multiple QTL in a single


linkage group can be
identified

Composite interval mapping


Multiple regression methods
are integrated with interval
mapping to increase the
probability of including
all significant QTL in the
model
Combining simple interval
mapping with multiple
regression methods

Requires prior construction of


good quality linkage map
Considers one QTL at a time
in the model for QTL mapping
and hence it is biased in
estimation of QTL when
multiple QTL are located in
the same linkage group

Likelihood approach,
regression approach or
combination of above two
approaches
QTL location can be identified

Simple interval mapping


It is based on the joint
frequencies of a pair of
adjacent markers and a
putative QTL flanked by the
two markers

Kao et al. (1999)

Sophisticated high-end systems are


required with skilled manpower

Cockerhams model for interpreting


genetic parameters and the method of
maximum likelihood for estimating genetic
parameters
More powerful and precise than all the
above three methods
Epistasis between QTL, genotypic values
of individuals and heritabilities of
quantitative traits can be readily
estimated and analysed

Multiple QTL mapping


It uses multiple marker intervals simultaneously to fit multiple putative QTL directly
in the QTL-mapping model

Reference

Single-marker analysis
One marker is involved at a time to find
the QTL-marker association

Features
Principle

Table 6.1 Comparison of different types of methods used in QTL analysis

118
QTL Identification

QTL: A Prelude

119

Box 6.1 List of QTL-Mapping Software

In the past decades, many QTL-mapping


procedures have been developed. A larger
number of computer programs are now available to implement these methods. These
programs have significantly simplified the
applications of the methods in QTL analysis.
A complete list of the programs is posted on
the web sites http://linkage.rockefeller.edu/
soft and http://www.stat.wisc.edu/~yandell/
statgen/software/biosci/linkage.html. Most of
the programs were developed as standalone
software packages. These include MapMaker/
QTL [1], MapManager [2], QTL Express [3],
MapQTL [4], MCQTL [5], MULTIMAPPER
[6], Meta-QTL [7], WinQTLCart [8] and QTL
Network [9]. Other programs were developed
using the R package, for example, R/qtl [10]
and R/qtlbim [11]. PROCBTL is a trial version of a SAS procedure for mapping binary
trait loci (BTL) [12]. Another SAS-based software package, PROC QTL Version 1.0, is
available at http://www.statgen.ucr.edu/software.html. To get more details on specific
software, please refer the references given at
the end.
MAPMAKER/QTL is a widely used program for UNIX or DOS operating systems
and is the original QTL-mapping program
intended for distribution. It can perform
composite interval mapping, although the
documentation does not use that term; but it
cannot perform permutation tests. It requires
the companion program MAPMAKER/EXP
to format data and to calculate marker
maps.
QTL Cartographer is a suite of programs
for DOS, UNIX or Mac OS. They are designed
to be used in sequence, each accepting input
in the form of text files and storing its output
in text files for the next program. This suite
offers several variations of CIM with automatic selection of background loci. It also has
provision for estimating confidence intervals
by resampling. QTL Cartographer, MapQTL

and PLABQTL are similar in many respects.


QTL Cartographer is distinguished by its
menu-driven interface, its more detailed documentation, its resampling methods and the
lack of a licencing fee.
Map Manager QT is a program for Mac OS
distinguished by its graphical user interface
for data entry, editing, manipulation and
display. It is designed to be used either as a
mapping program itself or as a data-preparation program for other mapping programs.
QGene is a commercial program for Mac
OS whose strength is a variety of graphics for
displaying trait data and relationships among
marker genotypes and between traits and
marker genotypes. These functions make it
uniquely useful for rapid exploration of data.
However, it does not perform CIM.
MapQTL is a commercial program for
several operating systems that is distinguished
by its ability to map QTL in populations
derived from non-inbred parents, in which
both markers and QTL may have more than
two alleles. It also offers a nonparametric
form of single-locus association, the Kruskal
Wallis rank sum test, appropriate for data with
distributions far from normal.
PLABQTL is a script-driven program for
DOS or AIX that is designed to analyse automatically a dataset at increasing levels of
complexity in successive runs. The final level
is capable of evaluating the effect of different
environments and the effect of interactions
between QTL and environmental effects.
MQTL is a program for DOS or Sun OS
that uses a simplified form of composite
interval mapping (sCIM) for mapping QTL
in large data sets derived from multiple environments. Like PLABQTL, it will estimate
environmental effects and QTLenvironment
interactions.
Multimapper is a program for UNIX that
implements a Bayesian method for building
multi-QTL models automatically. Multimapper
(continued)

120

QTL Identification

Box 6.1 (continued)

is designed to map QTL within a single linkage group, and it produces a plot of QTL probability as a function of map distance. This type
of plot seems intuitively more interpretable than
the plot of the likelihood ratio statistic or LOD
score produced by other programs. However, it
seems to be most suited to the analysis of
single chromosomes for which other programs
have indicated the possibility of multiple QTL.
Multimapper is designed to work with QTL
Cartographer as a companion program.
The QTL Cafe is a program being developed in Java to make it available for multiple
computer platforms. It is currently available as
an applet that runs in a Java-enabled World
Wide Web browser.
Epistat is a program for DOS designed
primarily for the detection and analysis of
interactions between QTL. It does not perform
interval mapping and therefore does not require
mapped markers. It is an interactive program,
displaying graphic results in response to singlekeystroke commands.
QTL IciMapping: It is an integrated software for building genetic linkage maps and
mapping QTL. The modules are built very
user-friendly and this software is being
updated regularly.

Key References for QTL Mapping


Software
1. Lander ES, Green P, Abrahamson J et al
(1987) MAPMAKER: an interactive computer
package for constructing primary genetic linkage maps of experimental and natural populations. Genomics1(2):174181
2. Manly KF, Cudmore RH Jr, Meer JM
(2001) MapManager QTX, cross-platform
software for genetic mapping. Mammalian
Genome 12(12):930932

3. Seaton G, Haley CS, Knott SA, Kearsey


M, Visscher PM (2002) QTL express: mapping quantitative trait loci in simple and
complex pedigrees. Bioinformatics 18(2):
339340
4. Van Ooijen JW (2004) MapQTL_ 5,
software for the mapping of quantitative trait
loci in experimental populations. Kyazma B.
V., Wageningen
5. Jourjon M-F, Jasson S, Marcel J, Ngom B,
Mangin B (2005) MCQTL: multi-allelic QTL
mapping in multi-cross design. Bioinformatics
21(1):128130
6. Martinez V, Thorgaard G, Robison B,
Sillanpaa MJ (2005) An application of
Bayesian QTL mapping to early development
in double haploid lines of rainbow trout including environmental effects. Genet Res 86(3):
209221
7. Veyrieras J-B, Goffinet B, Charcosset A
(2007) MetaQTL: a package of new computational methods for the meta-analysis of QTL
mapping experiments. BMC Bioinformatics
8,article 49
8. Wang S, Basten CJ, Zeng ZB (2007)
Windows QTL Cartographer 2.5, Department
of Statistics, North Carolina State University,
Raleigh, NC, USA, 2007. http://statgen.ncsu.
edu/qtlcart/WQTLCart.htm
9. Yang J, Hu C, Hu H et al. (2008) QTL
network: mapping and visualizing genetic
architecture of complex traits in experimental
populations. Bioinformatics 24(5):721723
10. Broman KW, Wu H, Sen S, Churchill
GA (2003) R/qtl: QTL mapping in experimental crosses. Bioinformatics 19(7):889890
11. Yandell BS, Mehta T, Banerjee S et al.
(2007) R/qtlbimml: QTL with Bayesian interval mapping in experimental crosses.
Bioinformatics 23(5):641643
12. SAS Institute (2007) SAS Online
Doc_ 9.2. SAS Institute, Cary

Trait mean values

Single-Marker Analysis (SMA)

121

1x +

0+b

y=b

e
y = b0 + b1x + e,
where y is the phenotypic value of a line, b0 is the population
mean, b1 is the additive effect of the locus on the trait, and e is a
residual error term. x is directly related to the genotypic code at
the locus being tested for the line considered, it is -1 (for female
parent) or 1 (for donor or male parent).

550.5
471.5
361.0

AA

Aa

aa

Marker classes

Fig. 6.1 Principle of single-marker analysis

Single-Marker Analysis (SMA)


Single-marker analysis (also single-point analyses) is the simplest method for detecting QTL
associated with single markers. The statistical
methods used for single-marker analysis include
t-tests, analysis of variance (ANOVA) and linear
regression. Linear regression is most commonly
used because the coefficient of determination (r2)
from the marker explains the phenotypic variation contributed by the QTL linked to the marker.
Typically, the null hypothesis tested is that the
mean of the trait value is independent of the genotype at a particular marker. The null hypothesis
is rejected when the test statistic is larger than a
crucial value, and it is declared that a QTL is
linked to the marker under investigation. The
t-test, ANOVA and simple linear regression
approach are all equivalent to each other when
their hypotheses are testing for differences in the
phenotypic means. In analysis of variance
(ANOVA, sometimes called marker regression)
at the marker loci, at each typed marker, one
splits the progenies into two groups, according to
their genotypes at the marker, and compares the
phenotype distributions of the two groups. For
example, in Fig. 6.1, we see that the individuals
with genotype aa for a marker have somewhat
significantly higher phenotype values than those
with genotype Aa and AA at that marker, indicating that the marker is linked to a QTL. In contrast, when the phenotype distributions of the

genotypic classes are approximately the same,


it is decided that this marker does not appear to
be linked to a QTL.
The results from single-marker analysis are
usually presented in a table, which indicates the
chromosome (if known) or linkage group containing the markers, probability values and the
percentage of phenotypic variation explained by
the QTL (noted as r2). Sometimes, the allele size
of the marker is also reported. QTL Cartographer,
QGene and MapManager QTX are commonly
used computer programs to perform singlemarker analysis. Other common statistical software such as SAS, IRRISTAT or even Microsoft
Office Excel can be employed for single-marker
analysis.
The chief advantage of analysis of variance at
the marker loci is its simplicity and can be performed with basic statistical software programs.
In addition, a genetic map for the markers is not
required, and the method may be easily extended
to account for multiple loci. A further advantage
is the easy inclusion of covariates, such as sex,
treatment or an environment effect. However, the
major disadvantage with this method is that the
farther a QTL is from a marker, the less likely it
will be detected. This is because recombination
may occur between the marker and the QTL. This
causes the magnitude of the effect of a QTL to be
underestimated. The use of a large number of
segregating DNA markers covering the entire
genome (usually at intervals less than 15 cM)

122

may minimise both problems. Regression on


marker genotypes gives a great deal of information about markertrait associations, but there are
some problems with this approach: (1) The
approach only considers the marker positions and
has less power to detect a QTL between the markers. (2) We cannot estimate the QTL effect and
the recombination frequency separately. (3) There
is a large amount of variation within each marker
class, and some of this will be due to other QTL
affecting the trait: We need to take this into
account for a more accurate test for the presence
of a QTL. Further, we must discard individuals
whose genotypes are missing at the marker since
inclusion of such line may produce biased or
overestimation of the effect. Despite these problems, regression on marker genotypes is a good
start in QTL analysis. It identifies associations
without knowing the position of the marker on
the map, and it may be adapted for use in any
type of population.

Interval Mapping
Lander and Botstein in 1989 developed simple
interval mapping (SIM), which overcomes the
disadvantages of analysis of variance at marker
loci. SIM is currently the most popular approach
for QTL mapping in experimental crosses. This
method makes use of linkage maps and analyses
intervals between adjacent pairs of linked markers along chromosomes simultaneously, instead
of analysing single markers. The use of linked
markers for analysis compensates for recombination between the markers and the QTL and is
considered statistically more powerful compared
to single-point analysis. The intervals that are
defined by ordered pairs of markers are searched
in increments (e.g. 2 cM), and statistical methods
are used to test whether a QTL is likely to be
present at the location within the interval or not.
It is important to realise that interval mapping
statistically tests for a single QTL at each increment across the ordered markers in the genome.
Interval mapping searches through the ordered
genetic markers in a systematic, linear (also
referred to as one-dimensional) fashion, testing
the same null hypothesis at each increment.

QTL Identification

Interval mapping methods produce a profile of


the likely sites for a QTL between adjacent linked
markers. In other words, QTL are located with
respect to a linkage map. Given the marker genotype data (and assuming that the recombination
process in meiosis exhibits no interference), one
may calculate the probability that an individual
has genotype AA (or Aa or aa) at a putative QTL.
In interval mapping, we obtain maximum likelihood estimates of the three parameters, defined to
be the values for which this probability achieves
its maximum. The results of the test statistic for
SIM (as well as composite interval mapping
(CIM) which will be discussed subsequently) are
typically presented using a logarithmic of odds
(LOD) score or likelihood ratio statistic (LRS).
There is a direct one-to-one transformation
between LOD scores and LRS scores (the conversion can be calculated by LRS = 4.6 LOD).
These LOD or LRS profiles are used to identify
the most likely position for a QTL in relation to
the linkage map, which is the position where the
highest LOD value is obtained. A typical output
from interval mapping is a graph with markers
comprising linkage groups on the x-axis and the
test statistic (LOD) on the y-axis (Fig. 6.2). The
peak or maximum must also exceed a specified
significance level in order for the QTL to be
declared as real (i.e. statistically significant).
Figure 6.2 displays the LOD (logarithm of
the odds favouring linkage, a score that measures the strength of evidence for the presence of
a QTL) curve for a chromosome or linkage group.
The LOD curve achieves its maximum at position 32 cM (in between marker G and H), indicating the presence of a QTL at this position. A
question may arise: Is an observed peak actually
a QTL? when confronted with an LOD curves
(or, with 19 or 20 such curves, one for each chromosome). The LOD score indicates the strength
of evidence for the presence of a QTL, with larger
LODs corresponding to greater evidence. The
question is, how large is large? The standard
approach to answering this question has been to
formulate the problem as one of hypothesis testing. Consider the null hypothesis that there are no
QTL segregating in the mapping population. We
determine the distribution of the LOD score in
this situation. The probabilities of obtaining an

Interval Mapping

123
Maximum likelihood QTL between loci G and H

LOD score

LOD level at which


QTL effect occurs
by chance (LOD Threshold;
usually fixed at 3.0)

Marker F

25

G 15 H 10 I

35

Locus position

Fig. 6.2 Principle of interval mapping by maximum likelihood method

LOD score as large as or larger than that was


observed if there were no QTL are called the P
value. Large LOD scores give small P values;
very small P values indicate that either the null
hypothesis is false (really there is a QTL) or a
very rare event occurred. When one performs a
genome scan to identify QTL, one examines the
LOD score at 100 or more marker loci (in fact,
during interval mapping, at all locations between
markers). Thus, the null distribution of the LOD
score at a single location is not appropriate for
forming an overall threshold. Some adjustment
must be made for our examination of multiple
putative QTL locations over the whole genome.
Lander and Botstein (1989) performed extensive
computer simulations to estimate the appropriate
LOD threshold for various genome sizes and
marker densities and gave analytical calculations
for the case of a very dense marker map. These
guidelines (e.g. fixing a minimum LOD threshold
of 3.0) should suffice for most uses.
Alternatively, the determination of significance
thresholds is most commonly performed using
permutation tests (discussed below in detail).
Briefly, the phenotypic values of the population
are shuffled whilst the marker genotypic values
are held constant (i.e. all markertrait associations are broken), and QTL analysis is performed
to assess the level of false positive markertrait
associations. This process is then repeated (e.g.
1,000 times), and significance levels can then be
determined based on the level of false positive

markertrait associations. The observed LOD


score (with the phenotypes in the correct order) is
compared to the 1,000 LOD scores obtained from
permuted versions of the data. The proportion of
these 1,000 LOD scores that exceed the actual,
observed LOD score is reported as an approximate P value. This provides a customised threshold tailor-made for the individual experiment.
Before permutation tests were widely accepted as
an appropriate method to determine significance
thresholds, an LOD score of between 2.0 and 3.0
(most commonly 3.0) was usually chosen as the
significance threshold, as stated above. An LOD
score of 3 indicates that the chance of obtaining
the observed data, given that there is a QTL at the
specified position, is 1,000 times more likely than
if there are no QTL.
Many researchers have used MapMaker/QTL,
QTL Cartographer and QGene to conduct SIM.
The most common way of reporting QTL is by
indicating the most closely linked markers in a
table and/or as bars (or oval shapes or arrows) on
linkage maps (indicating as bars; Fig. 6.3).
The chromosomal regions represented by rectangles are usually the region that exceeds the
significance threshold (Fig. 6.2). Usually, a pair
of markersthe most tightly linked markers on
each side of a QTLis also reported in a table;
these markers are known as flanking markers.
The reason for reporting flanking markers is that
selection based on two markers should be more
reliable than selection based on a single marker.

124
1

2
A

F
10.5

12.5
B

K
14.5

U
3.5
V

13.4

Q
15.0

13.0

N
O

6.1

H
C

W
4.1

10.2

8.0
5.0

0.5

5.7
L

12.0

15.0

QTL Identification

R
2.3

8.0

17.2

5.4
T

Plant height QTL


Internode length QTL

Fig. 6.3 Presentation of hypothetical QTL for plant


height and internode length in linkage map. Numbers
above the vertical bar represent chromosome number.

Numbers in the left of each vertical bar represent genetic


distance between the markers in cM. Horizontal bars and
alphabets denote markers on the linkage map

Again, the reason for the increased reliability is


that there is a much lower chance of recombination between two markers and QTL compared to
the chance of recombination between a single
marker and QTL.
It should also be noted that QTL can only be
detected for traits that segregate between the parents used to construct the mapping population.
Therefore, in order to maximise the data obtained
from a QTL-mapping study, several criteria may
be used for phenotypic evaluation of a single trait
(for instance, rice yield can be evaluated based on
number of panicles, number of spikelet/panicle,
1,000 grain weight, etc.). QTL that are detected
in common regions (based on different criteria
for a single trait) are likely to be important QTL
for controlling the trait. Mapping populations
may also be constructed based on parents that
segregate for multiple traits. This is advantageous
because QTL controlling the different traits can
be located on a single map. However, for many
parental genotypes used to construct mapping
populations, this is not always possible, because
the parents may only segregate for one trait of
interest. Furthermore, the same set of lines of the

mapping population used for phenotypic evaluation must be available for marker genotyping and
subsequent QTL analysis, which may be difficult
with completely or semi-destructive bioassays
(e.g. screening for resistance to necrotrophic fungal pathogens).
In general terms, the identified QTL may also
be described as major or minor. This definition
is based on the proportion of the phenotypic variation explained by a QTL (based on the r2 value):
Major QTL will account for a relatively large
amount (e.g. >10%), and minor QTL will usually
account for <10%. Sometimes, major QTL may
refer to QTL that are stable across environments,
whereas minor QTL may refer to QTL that may
be environmentally sensitive, especially for QTL
that are associated with disease resistance or
drought tolerance. In more formal terms, QTL
are classified as: (1) suggestive, (2) significant
and (3) highly significant. This classification was
mainly proposed to avoid large numbers of false
positive claims and also ensure that real linkage
was not missed. Significant and highly significant
QTL were given significance levels of 5 and
0.1%, respectively, whereas a suggestive QTL is

Interval Mapping

one that would be expected to occur once at


random in a QTL-mapping study (in other words,
there is a warning regarding the reliability of
suggestive QTL). The mapping program
MapManager QTX reports QTL mapping results
with this classification.
Although the most likely position of a QTL is
the map position at which the highest LOD or LRS
score is detected, QTL actually occur within
confidence intervals (see below). There are several
ways in which confidence intervals can be calculated. The simplest is the one-LOD support
interval, which is determined by finding the region
on both sides of a QTL peak that corresponds to a
decrease of one LOD score as performed by
Mapmaker/QTL. Bootstrapping, a statistical method
for resampling, is another method to determine the
confidence interval of QTL and can be easily applied
within some mapping software programs such as
MapManager QTX.
All linkage maps are unique and are a product
of the mapping population (derived from two
specific parents) and the types of markers used.
Even if the same set of markers is used to construct
linkage maps, there is no guarantee that all of the
markers will be polymorphic between different
populations. Therefore, in order to correlate information from one map to another, common markers
are required. Common markers that are highly
polymorphic in mapping populations are called
anchor (also core markers). Anchor markers
are typically SSRs or RFLPs (refer chapter 3 for
details). Specific groups of anchor markers, that
are located in close proximity to each other in
specific genomic regions, are generally referred
to as bins. Bins are used to integrate maps and
are defined as 1020 cM regions along chromosomes; the boundaries of each are defined by
a set of anchor markers. If common anchor
markers have been incorporated into different
maps, they can be aligned together to produce
consensus maps. Consensus maps are produced
by combining or merging different maps, constructed from different genotypes, together (see
chapter 4). Such consensus maps can be extremely
useful for efficiently constructing new maps (with
evenly spaced markers) or targeted (or localised)
mapping. For example, a consensus map can

125

indicate which markers are located in a specific


region containing a QTL and thus be used to
identify more tightly linked markers. The study
of similarities and differences of markers and
genes within and between species, genera or
higher taxonomic divisions is referred to as
comparative mapping (refer chapter 7). It involves
analysing the extent of the conservation between
maps of the order in which markers occur
(i.e. collinear markers); conserved marker order
is referred to as synteny. Comparative mapping
may assist in the construction of new linkage
maps (or localised maps of specific genomic
regions) and in predicting the locations of QTL in
different mapping populations.
Interval mapping has several advantages over
analysis of variance at the marker loci. First, it
provides a curve, which indicates the evidence
for QTL location. Second, it allows for the inference of QTL to positions between markers. Third,
it provides improved estimates of QTL effects
(the apparent effect at a marker locus is decreased
as a result of recombination between the marker
and the QTL). Fourth, and perhaps most important, appropriately performed interval mapping
makes proper allowance for incomplete marker
genotype data. In the calculation of an individuals QTL genotype probabilities, conditional on
its marker genotype data, one considers the
closest flanking typed markers for that individual.
If an individual is missing the marker genotype
for a flanking marker, one moves to the next
flanking marker for which genotype data are
available. Allowance may even be made for the
presence of genotyping errors.
On the other side, although interval mapping
is certainly more powerful than single-marker
approaches to detect QTL, it is limited by both
the model that defines it as a single-QTL method
and by the one-dimensional search that does not
allow interactions between multiple QTL to be
considered. Additional disadvantage of interval
mapping, in comparison to analysis of variance,
is that it requires some increase in computation
time and the use of specially designed software.
An important, yet often ignored, issue in QTL
mapping concerns selection bias in the apparent
(estimated) effects of QTL. Such estimated

126

effects are often too large. Consider a single QTL


with an effect of moderate size, and imagine there
is a marker very near the QTL. In a particular
experiment, the estimated effect of the QTL will
be somewhat different from its true effectthe
observed difference between the phenotype averages for the two QTL genotype groups will not
be the same as the true difference. Nevertheless,
to produce an LOD score sufficiently large to
declare the presence of a QTL, the estimated
effect must be large. This introduces bias in the
estimated effect (bias is also introduced in the
maximisation over possible QTL locations; the
inferred location for a QTL is the one that gives
the largest estimated QTL effect.) Because this
bias is the result of the selection of only those loci
for which there is sufficient evidence for the presence of a QTL, it is called as selection bias. The
power to detect QTL with a larger effect is higher,
and the bias in their estimated effects will be
lower but may still be substantial. QTL with very
large effect are always detected, and so the bias in
their estimated effects will be minimal.

Multiple QTL and Methods


to Detect Multiple QTL
Interval mapping assumes the presence of a single QTL. One may use interval mapping to identify multiple QTL, especially when they are on
separate chromosomes, but there are several
advantages to using methods that model multiple
QTL simultaneously. First, by controlling for the
presence of a QTL, one may reduce the residual
variation and obtain greater power to detect additional QTL. Second, one may better separate
linked QTL. Third, the identification of interactions between QTL (called epistasis; see below)
requires the joint modelling of multiple QTL.
Thus, it is important to have a description of the
major statistical approaches or QTL mapping
that makes use of multiple QTL models. The
simplest such method is multiple regression. The
aim is principally to frame the problem as one of
model selection and to describe the key issues in
model selection (the most important of which is
the choice of criteria for comparing models).
While this simple approach should be more

QTL Identification

widely used, it shares many of the disadvantages


of analysis of variance at marker loci; most
importantly, it requires complete marker genotype data. The simplest multiple QTL method
that makes allowance for missing genotype data
is the use of forward selection in interval mapping. An approach that has received much attention and has been widely applied in practice is
composite interval mapping (CIM; discussed
below). In this method, one performs interval
mapping using a subset of marker loci as covariates. These markers serve as proxies for other
QTL to increase the resolution of interval mapping by accounting for linked QTL and reducing
the residual variation. The key problem with CIM
concerns the choice of suitable marker loci to
serve as covariates. Yet another interesting development is multiple interval mapping (MIM).
MIM is the extension of interval mapping to
multiple QTL, just as multiple regressions
extends analysis of variance. MIM allows one to
infer the location of QTL to positions between
markers, makes proper allowance for missing
genotype data and can allow interactions between
QTL. This is not the final solution to the QTLmapping problem; one is still confronted with
comparing models and searching through models.
Statistical researchers have much work to do in
this area. The above descriptions are the major
approaches to QTL mapping in experimental
crosses. Several other approaches are available,
including Bayesian methods and the use of a
genetic algorithm. These new methods may
become important in the future but are beyond
the scope of this elementary description of statistical methods for QTL mapping, and hence the
readers are requested to refer the further readings
for more details. However, the basic principles
and methods used in CIM and MIM are discussed
hereunder.

Composite Interval Mapping


Of late, composite interval mapping (CIM) is
becoming popular for mapping QTL. The main
advantage of CIM is that it is more precise and
effective at mapping QTL compared to singlepoint analysis and interval mapping, especially

Multiple Interval Mapping (MIM) or Multiple QTL Mapping

when linked QTL are involved. CIM combines


the approaches of interval mapping and singlemarker analysis in a multiple regression framework. The motivation for CIM was that the error
term (e) in the SIM model (y = b0 + b1x + e) is
composed in part of true experimental error but
also in part of variation due to QTL at other loci
(or genetic background segregation). Some of the
variability among lines that share a common QTL
genotype at QTL locus 1 is due to the fact that
they can have different genotypes at QTL locus 2
somewhere else in the genome. The CIM approach
begins by first conducting single-marker analysis,
then by building multiple-marker models using
typical regression model building methods
(forward or stepwise regression). Forward regression operates by first selecting the marker with
the highest statistical significance (highest LRT
or LOD score). Next, the second most significant
marker is added to make a 2-locus model. The
two markers are re-evaluated for significance, and
if they both remain significant in the model, then
it proceeds by adding the third most individually
significant marker and so on. At any step, if a
marker is no longer significant, it can be dropped
from the model. In this way, a model that includes
the most important markers that all remain
significant when fitted simultaneously is built.
In CIM terminology, these markers are called
cofactors. Once the model containing the cofactors is built, we then rescan the entire genome
using interval mapping. Many researchers have
used QTL Cartographer, MapManager QTX and
PLABQTL to perform CIM.

Multiple Trait Mapping


Multiple traits that are correlated to each other
can add further information to the investigated
traits. It can also be noted that to some extent,
two measurements on correlated traits are fairly
like repeated measurements. Therefore, information from correlated traits can reduce the effect of
error variance, making it easier (more powerful)
to detect QTL. Not only the power of QTL detection
is increased, also the precision of the QTL map
position is better and testing of models regarding
the genetic correlation between two traits. Jiang

127

and Zeng in 1995 have proposed a multiple-trait


version of the composite interval mapping.
Their method is based on maximum likelihood
and requires special programs for analysis. It is
postulated that considerable increase in power of
QTL detection can be expected when using information from two correlated traits.

Testing for Linked QTL Versus


Pleiotropic QTL
While doing single-trait analysis, when two QTL are
found in the same region (i.e. for a single genomic
region linked to two different traits), the question
arises whether these are actually the same genes
affecting both traits or these are two separate QTL.
Unravelling this difference allows to better understand the nature of a genetic correlation between two
traits. This would provide information concerning
the possibility to break an unfavourable genetic
correlation between two characters (in the case of
linkage) or whether this is impossible (in the case of
pleiotropism (which refers to same gene(s)) involved
in expression of several traits).
The test can be carried out with
H0: position 1 = position 2
H1: position 1 position 2
Also other genetic models could be compared
and tested (depending on design) such as (1)
existence of epistasis and (2) QTL effecting one
trait only versus effect on both traits. Maximum
likelihood might be a bit laborious for multiple-trait
analyses, especially when comparing a range of
genetic models. Moser in 1998 has proposed a
multiple-trait regression approach and showed again
that regression is very similar to maximum likelihood methods (at least in designed experiments).

Multiple Interval Mapping (MIM)


or Multiple QTL Mapping
As stated earlier, both SIM and CIM were
designed to detect a single QTL at a time based
on a statistical test that a candidate position for a
QTL has significant effect or not. The investigation was constructed to test each position in a
genome and thus created a genome scan for QTL

128

analysis. Though intuitive and widely used, these


methods are still insufficient to study the genetic
architecture of complex quantitative traits that
are affected by multiple QTL. When a trait is
affected by multiple loci, it is more efficient statistically to search for those QTL together. Also
in order to study epistasis of QTL, multiple QTL
need to be analysed together (Box 6.2). In this
setting, QTL analysis is basically a model-selection problem.
Multiple interval mapping (MIM) is targeted
to analyse multiple QTL with epistasis together
through a model-selection procedure to search
for the best genetic model for the quantitative
trait. As shown by Kao et al. in 1999, given a
genetic model (number, location and interaction
of multiple QTL), this linear model suggests a
likelihood function similar to that in SIM but
with more complexity. An expectation/maximisation algorithm can be used to maximise the likelihood and obtain maximum likelihood estimates
of parameters. The following model-selection
method is used to transverse the genetic model
space during MIM analysis in QTL Cartographer
(refer Box 6.2):
1. Forward selection of QTL main effects,
sequentially: In each cycle of selection, pick
the best position of an additional QTL, and
then perform a likelihood ratio test for its main
effect. If a test statistic exceeds the critical
value, this effect is retained in the model. Stop
when no more QTL can be found.
2. Search for epistatic effects between QTL main
effects included in the model, and perform
likelihood ratio tests on them: If a test statistic
exceeds the critical value, the epistatic effect
is retained in the model.
Repeat the process until no more significant
epistatic effects can be found.
3. Re-evaluate the significance of each QTLs
main effect in the model: If the test statistic for
a QTL falls below the significant threshold
conditional on other retained effects, this QTL
is removed from the model. However, if a QTL
is involved in a significant epistatic effect with
other QTL, it is not subject to this backward
elimination process. This process is performed
stepwise until no effects can be dropped.

QTL Identification

4. Optimise estimates of QTL positions based on


the currently selected model: Instead of performing a multidimensional search around the
regions of current estimates of QTL positions,
estimates of QTL positions are updated in turn
for each region. For the rth QTL in the model,
the region between its two neighbour QTL is
scanned to find the position that maximises
the likelihood (conditional on the current estimates of positions of other QTL and QTL
epistasis). This refinement process is repeated
sequentially for each QTL position until there
is no change on estimates of QTL positions.
Thus, model selection entails four distinct
steps: (1) Select a class of models (e.g. additive
models or models including pairwise interactions
between QTL), (2) search through the space of
models (there may be more possible models than
may be inspected individually), (3) compare
models and (4) assess the performance of a
model-selection procedure. If one allows only
three or fewer QTL, one may perform a simultaneous search to consider each such model. But if
one wishes to consider the possibility of many
more QTL, it will be impossible to inspect each
possible model individually, so one must form
some procedure for searching through this space
of models to pick out the best ones. Finally, it is
important to consider how one may assess the
performance of a model-selection procedure.
Decisions should be guided by the aims of the
study. In a study seeking to use marker-assisted
selection to improve an agricultural product, one
may be willing to allow a few extraneous loci in
an effort to identify a reasonably large number of
QTL. A scientist wishing to perform positional
cloning (refer chapter 7), a genomic region may
be satisfied only with a small number of strongly
supported QTL; this avoids wasting expensive
and time-consuming efforts on extraneous loci.
These sorts of aims should guide the researcher
in framing the desired performance characteristics for a procedure, which may then be used in
choosing an appropriate mapping method. One
will need to rely on experience, educated guesses
and large computer simulation studies because,
unfortunately, the appropriate mapping method
will vary with the context.

Multiple Interval Mapping (MIM) or Multiple QTL Mapping

129

Box 6.2 How to Analyse QTL Using QTL Cartographer

Windows QTL Cartographer (available with


no cost at http://statgen.ncsu.edu/qtlcart/
WQTLCart.htm) maps QTL in cross populations from inbred lines. It includes a powerful
graphic tool for presenting and summarising
mapping results and can import and export
data in a variety of formats. It provides singlemarker analysis, interval mapping, composite
interval mapping, Bayesian interval mapping
and multiple interval mapping. There are two
stages in making a QTL map for a particular
trait (once youve scored hundreds of marker
loci in hundreds of F2 or backcross progeny):
1. Construct a genetic map of your markers.
2. Feed the genetic map, marker data and
phenotype data into QTL Cartographer and
run the analysis.
Constructing the genetic map is dealt in
detail in chapter 4. This box focuses on the
second step.

Preparation of Data Files


Four different data files are to be prepared to
use in QTL Cartographer. They are (1) data
file of chromosome label and marker number,
(2) data file of marker label, (3) data file of
marker position and (4) data file of genotype
and phenotype. These data files should be prepared as explained below.

Preparation of Data File


of Chromosome Label and Marker
Number
This particular data file is prepared in Microsoft
Excel by following the below-mentioned format.
C2
C3
C4a
C4b

4
3
2
3

Consider there are four chromosomes and


during the map construction four linkage groups

have generated. However, while analysing those


linkage groups, it is found that chromosome 1
does not have linkage group and chromosome
4 has two linkage groups. Label those
chromosomes or linkage groups as C1 to C4
(C denotes chromosome; the number 14
describes respective linkage group, and the
alphabet in the suffix denotes different linkage
group belong to the same chromosome). The
chromosomes are entered in first column as
above. No linkage group is found for chromosomes 1, and hence, it should not be mentioned in this column. Note that chromosome
4 has two linkage groups, and hence they are
considered as separate linkage groups in the
above data file. The number of markers present in each linkage group is entered in second
column with respect to their chromosome
number (remember, this number of markers is
not the number of markers that have used in
the linkage mapping analysis, rather than it is
the number of markers that are linked at the
end of linkage mapping analysis). This data
file is saved in Text (Tab delimited) type with
a suitable name (e.g. file1.txt).

Preparation of Data File


of Marker Label
The data file is prepared in Microsoft Excel as
per the following format:
NAU NAU NAU BNL NAU NAU NAU
1246 3684 3875 3971 3083 3172 3839

Marker labels are entered in the first row.


The order of the markers is it starts from the
first marker of chromosome 2 (position 0)
until the last marker of chromosome 2 and
continued to the first marker of chromosome
(position 0 cM) 3 to the last marker. The
same format is followed as many chromosomes as entered in the first data file (in this
example, up to the last marker of chromosome C4b). The data file is saved in Text
(continued)

130

Box 6.2 (continued)


(Tab delimited) type with a suitable name
(e.g. file2.txt).

Preparation of Data File of Marker


Position
The data file is prepared again using Microsoft
Excel as below.
0
23.3
67.5
83.3
0
9
51.9
0

Individual
label
BC1-1
BC2-2
BC1-3

BC1-n

QTL Identification

The marker positions (in cM) are entered


in the fi rst column as above. Data of marker
positions are started from position 0 for the
fi rst marker of fi rst chromosome (in this
case chromosome 2). After typing the position of the last marker of the given chromosome, the next row is left blank before
continuing to next chromosome (see
above). The data fi le is saved in Text (Tab
delimited) type with a suitable name (e.g.
fi le3.txt).

Preparation of Data File of Genotype


and Phenotype
This data file is prepared in Microsoft Excel as
per the below-mentioned format.

NAU1246
2
0
2

NAU3684
0
.
2

NAU3875
.
.
2

The first row is earmarked for header line,


and it needs to be filled in the order of individual label; genotypic score for each marker
(marker labels are entered in the same order as
in file3) followed by phenotypic values (in the
above example, phenotype1 is DFF (days to
first flowering), phenotype2 is PH (plant
height) and so on) are to be entered. In the
subsequent rows, the data, with one row for
each individual, as per the header line in each
column are to be entered. Genotype code used
for scoring the genotypes is as follows for
backcross progenies (for other type of mapping population, refer to the QTL Cartographer
manual or help button available in the QTL
Cartographer main window): 2 = homozygous parent A, 0 = homozygous parent B,
1 = heterozygous and . = missing value.
Likewise, mean phenotypic values scored for
each progeny should be entered with uniform

BNL3971
2
0
2

DFF
55
45
61

58

PH
68.7
84.2
65.7

71.5

YLD
15.2
20.5
17.5

22.8

units (e.g. measurements in cm for plant


height for all the individuals; feeding one individuals plant height data in cm and the other
individuals data in mm or m is the wrong
input method). The data file is saved in Text
(Tab delimited) type with a suitable name (e.g.
file4.txt).

Importing Data File into QTL


Cartographer
All the data files prepared as mentioned above
are imported in to QTL Cartographer for QTL
analysis by essentially following the below
steps:
Run the QTL Cartographer by double
clicking the radio button and select the
tab New from the File menu. A Basic
information box will appear and the
(continued)

Multiple Interval Mapping (MIM) or Multiple QTL Mapping

131

Box 6.2 (continued)

following information is to be filled in that


box.
To save the data file, choose the destination
directory by selecting the tab File name
and Save as menus.
The information about number of chromosomes (linkage groups), number of traits,
number of other traits (binary value such as
sex), number of individuals and cross type
(e.g. in this case as B1, i.e. backcross to
parent 1) should be entered in Basic information box and run the program.
The message input chromosome label and
marker number for each chromosome will
appear. Select the data file of chromosome
label and marker number of each chromosome (in this case, file1.txt) from the corresponding directory and send the data to
QTL Cartographer by selecting the tab
Send Data.
The message input marker labels and positions for each chromosome will appear.
Select Labels tab and choose the data file
of marker labels (in this case, file2.txt) from
the corresponding directory and click Send
Data to send the data to QTL Cartographer.
Then, select the tab Positions and browse
the data file, marker position (in this case,
file3.txt) from corresponding directory.
The data is sent to QTL Cartographer by
selecting the tab Send Data.
By doing so, a message cross information
filename will appear. Select the data file of
genotype and phenotype (in this case, file4.
txt) from the corresponding directory, and
the data file is to be sent to QTL Cartographer
by selecting Send Data tab.
Finally, click the tab Finish to import all
the required data into QTL Cartographer
which will result in a message appearing as
QTL Cart has created the source data file,
the new source data file has been saved.
Description of the imported data will
appear in the main window, and it must be

critically viewed to cross check the


imported data in all aspects. The data file
will appear in the big lower window. The
top left third of the screen shows some
basic information about the data set, the top
middle allows to visualise or modify
specific genotypic or phenotypic data
points, and the right top of the screen has
the options for QTL analysis.

Importing Mapmaker Files into


WinQTL Cartographer
Alternatively, all information generated using
the above-said four files can be obtained from
Mapmaker and easily imported in to the QTL
Cartographer using the below flowchart, if the
linkage and QTL-mapping analysis has been
done using Mapmaker with the same data
(refer chapter x):
1. File > Import > Source DATA import 1/1
In this window, enable MapMaker/QTL
format and click < Next>
2. In Source Data Import 2/2 window:
Click < Map file > and provide the mapmaker file with .map extension
Click < Cross Data > and provide the cross
data (input data used in the mapmaker)
with .raw extension or .txt
The source data file for WinQTL will be
created in the working directory with same
file with extension of _mps_ln.
3. click < Finish>
A new window will appear as The new
source data file has been saved.

Single-Marker Analysis (SMA)


From the Analysis menu, select Single
Marker Analysis option to perform SMA.
Select the option Graphic, and mention the
destination directory to save the output file,
(continued)

132

QTL Identification

Box 6.2 (continued)

with a suitable name. From the tab Chrom,


select the option All Chroms, and the graphic
of all the chromosomes will be displayed.
SMA of each chromosome can also be separately done by selecting the option First
Chrom, Second Chrom and so on. Then the
graphic of each chromosome will displayed
for individual chromosome wise analysis.
Under the tab Setting, select the options
Show Trait Names or Legend, Show Marker
Names and Show Chromosome Names to
display those information in the graphic. Use
the option Copy Graph to Clipboard from
the File menu to import or paste the graph on
Microsoft Word or PowerPoint.
If you push the View info button in the
Single Marker Analysis box, youll get results
of linear regression analysis of the relationship
between phenotype and marker genotype for
each marker, individually. This analysis tells
us if there is any significant positive relationship between genotype and phenotype for the
markers. If you push the View info button
in the Statistical Summary box, youll get
summary statistics on the pattern of trait variation in the mapping population and on the
pattern of segregation at the marker loci, that
is, whether they follow Mendelian expectations. We can check whether the genotype
proportions in our mapping population all
appear to be consistent with Mendelian
expectations.
In the results, sample size refers to the
number of lines used in the analysis, the variance (which is almost identical to the phenotypic variance of line means; these numbers
should be nearly identical because they are
estimating the same thing although in slightly
different ways). Following the trait statistics
and histograms is a long table showing the
percentage of missing data at each marker
locus. You should at least scan this table to see
if there are any loci with large amounts of
missing data because that will warn you to be

more doubtful of the QTL tests at those loci.


Following this is a table showing tests of
segregation distortion at each locus.
Chi2 is the c2 test of the null hypothesis
that the locus is segregating as expected for a
Mendelian locus in the population. This test is
based on the difference between expected and
observed numbers of lines in each genotypic
class. The larger the deviation from expectation under the null hypothesis, the larger this
number is. It is important to look for is there
a significant deviation from the expected segregation at this locus? It is actually fairly
tricky to answer. For example, from the results,
you may find that the P value of a test is 0.022.
If you consider a = 0.05 the threshold for
significance, then you would consider the data
to demonstrate a significant deviation from
expected segregation. However, keep in mind
that setting a threshold of 0.05 means that one
expects that, by chance, one will declare 5%
of all tests to be significant, even if the null
hypothesis of Mendelian segregation is always
true. When we reject the null hypothesis even
though it is true, we make a Type I error. Since
we are testing many loci for segregation distortion, one should probably use a more stringent threshold to avoid making too many Type
I errors. One possibility is to use an experiment-wise (or whole-genome-wise) threshold
that adjusts the significance threshold to maintain the probability of making at least one
Type I error at some constant level. This often
leads to very stringent significance thresholds
because it becomes very difficult to avoid
making just one Type I error if you conduct
many tests (remember, the number of tests
here is equal to the number of marker loci). So
as you do a better job of controlling the rate of
Type I errors, you end up making more Type II
errors (where you do not reject the null hypothesis in cases where it is not true). Worse, it is
not even how to correctly set this threshold for
data where the tests are not all independent of
(continued)

Multiple Interval Mapping (MIM) or Multiple QTL Mapping

133

Box 6.2 (continued)

each other. In the case of genetic data, tests at


linked loci are not independent. If there is segregation distortion in a genomic region, then
all loci in that linked region will exhibit distortion. In such cases, the following points may
help. First, decide what is the relative cost of
making a Type I versus a Type II error. In this
example, what is the effect on QTL mapping if
there really is segregation distortion? The biggest difficulty is that segregation distortion
leads to biased recombination frequency estimates during linkage map construction (see
chapter 4 for detailed description). However,
for single-marker QTL analyses, segregation
distortion causes no bias at all. We just need to
keep in mind for the later methods of QTL
analyses to be discussed that the map distances
are not really known and may be estimated
with some bias. Second, set a significance
threshold somewhere between 0.05 (the most
liberal) and a Bonferroni-corrected threshold
of 0.05/n, where n = number of tests (the most
conservative), depending on how concerned
one is about Type I versus Type II errors. An
ad hoc, somewhat liberal threshold that often
used is created by dividing 0.05 by the number
of chromosome arm pairs in the linkage map.
Since loci at the two different ends of a chromosome tend to be independent of each other,
we guess that there are at least two independent groups of tests on each chromosome. For
example, in rice, there are 24 chromosome
arms, so the threshold is p = 0.05/24 = 0.002.
The corresponding c2 value with one degree of
freedom is 9.47. Even with this adjusted
threshold, we can find significant segregation
distortion on every chromosome, and it may
be very strong for some markers. Obviously, it
can be assumed that there are problems with
the linkage map in this region. You may notice
one other interesting fact about such region:
The QTL regions overlap with regions undergoing segregation distortion, and the favourable QTL alleles are in excess frequencies in

this region. By carefully examining Statistical


Summary output and checking the segregation distortion results in this region, we can
identify this fact (refer chapter 3 for c2 analysis using AntMap).
During the computation, single-marker
analysis considers one locus at a time and fits
the following regression model (refer Fig. 6.1):
y = b0 + b1z + e,
where y is the phenotypic value of a line, b0 is
the population mean, b1 is the additive effect
of the locus on the trait and e is a residual error
term. x is directly related to the genotypic code
at the locus being tested for the line considered; it is 1 (for female or recurrent parent)
or 1 (for male or donor parent). The population mean estimate, b0, should change very
little from marker to marker. The critical
parameter in this equation is b1; this tells us
what is half the effect of changing the genotype from female homozygote (x = 1) to male
homozygote (x = 1) at this locus? If the marker
locus is not linked to a QTL, then we expect
that changing the genotype at the marker locus
has no effect on the phenotype and b1 = 0. As
the effect of changing the genotype is greater,
the value of b1 increases, and the values of the
error terms, e, must decrease. This leads to
increased evidence against the null hypothesis
of b1 = 0 (no QTL linked to the marker).
The test of significance of b1 can be done by
regression, ANOVA or maximum likelihood.
The results of these methods for single-marker
analysis are essentially identical. QTL Cartographer actually does this test using maximum
likelihood estimation. Maximum likelihood
estimates the most likely value of b1 given the
observed genotypic and phenotypic data and
reports the likelihood of the model with the
most likely value of b1 as L1. A significance
test is based on the likelihood ratio test (LRT).
The LRT is calculated as 2 times the natural
log of the ratio of the likelihood of the model
(continued)

134

QTL Identification

Box 6.2 (continued)

where b1 is set equal to 0 (L0) to the most likely


QTL model (L1). This can be converted to an
F-test. Notice that the values of x (the genotypic values) change for each locus, so the
model is recalculated for each marker locus,
and the significance test is redone for each
locus. Therefore, we will test as many QTL
models as we have markers in the data set.
Scanning the output table, we can find
significant results that are notified by * and
**. The point to be noted here is QTL
Cartographers single-marker analysis is
essentially identical to a regression or ANOVA
analysis conducted using the genotype data
for one marker at a time. It is natural to test the
effect of the marker locus on the trait in this
fashion. But recall that we usually consider
the markers to be neutral and we are really
searching for QTL that are linked to the marker
loci. Therefore, the phenotypic effect observed
at a marker locus is affected both by the true
QTL effect and the recombination frequency
between the marker and the QTL. This makes
sense, since recombinations between the
marker and the QTL result in progeny with the
opposite QTL allele compared to the parental
arrangement. Between the two extremes of
marker and QTL are unlinked and tightly
linked, you can see that the estimated effect of
the QTL decreases linearly as recombination
between the marker and QTL increases. This
means that unless the marker is right at the
QTL, then you will underestimate the true
effect of the QTL. The marker closest to the
QTL should have the largest effect. It is important to decide that suppose if there were eight
significant markers on chromosome 1
identified, does it mean that the analysis has
found 7 QTL on chromosome 1? In reality, we
really do not know if there are multiple QTL
or a single QTL whose effect extends to
numerous linked loci, but the latter hypothesis
is simpler, so it is usually accepted unless solid
evidence to the contrary can be given.

Interval Mapping
To perform the interval mapping, select the
option Interval Mapping from Analysis
menu. Mention the destination directory to
save the graphic of interval mapping results.
Since we are doing a lot of statistical tests
when doing a QTL analysis, you have to take
account of that fact in choosing a threshold
value of the likelihood ratio statistic for declaring that youve found a QTL. You can accept
the default value, use one of your own or select
one through permutations (which will take the
longest but produce the most reliable threshold value). The number of permutation tests
can be set as 3001,000 or more. QTL
Cartographer will automatically calculate the
threshold when you press Go tab, and the
resulting LOD score will be fixed as threshold
for interval mapping. As mentioned above, the
threshold value can be fixed manually in the
appropriate tab that can be seen in the same
window. Note that the default significance
threshold is an LRT value of 11.5, which
equals an LOD score of 2.5 (refer text for
details). Once this threshold value is set, the
interval mapping can be performed. The other
parameter you may want to change is the walk
speed. Thats the parameter that determines
the interval along the map at which QTL calculations are done. If you have a very dense
map, you can set the interval to be quite small,
and youll have a much more precise idea of
where any QTL you locate may be, but it will
take the program much longer to do the calculations. If there is no idea on this walk speed,
let us leave the walk speed at the default
2 cM.
The graphics of all the chromosomes can
be obtained by selecting the All Chromos
option from the tab Chrom. Interval mapping for each chromosome can also be carried
out separately by selecting the particular chromosome (First Chrom, Second Chrom and
(continued)

Multiple Interval Mapping (MIM) or Multiple QTL Mapping

135

Box 6.2 (continued)

so on), and the graph of each chromosome can


be saved separately (as shown in Fig. 6.4).
Similarly, interval mapping can also be performed for each trait separately by selecting
one trait at a time (1: DFF, 2: PH, etc.).
The additive effect of the particular character
was also displayed separately as graphic, just
below the graph of the LOD score (Fig. 6.4).
Analyse the graph of each chromosome to
identify the QTL linked to the particular trait
as the peak of LOD score that exceeds the
threshold. These are the peaks of the likelihood profile where QTL are most likely to be
located (if you accept a peak as being
significant, the exact position of the peak can
be seen in the results table).
Figure 6.4 suggests that a QTL is present at
about 20 cM from the left end of the chromosome. There are two parts to the graph. The
x-axis of both graphs is the marker positions
along the linkage map. The top graph plots the
LOD score for each marker against its
position on the map. You can see that this has
some relationship to the LRT discussed previously. Why are LOD scores given instead of
LRTs? It is for simplicity. Linkage map (such
as MAPMAKER) results are often given in
LOD scores, so it makes some sense to also
report the QTL results in terms of LOD. Also,
LOD scores are easier to interpret than LRTs.
One can easily see from the definition of an
LOD score that:
LOD = 0 means that the best QTL model and
the no-QTL models have identical likelihoods (thus, no evidence for a QTL).
LOD = 1 means that the best QTL model is 10
times more likely than the no-QTL model
(which is considered only limited evidence
for a QTL, not significant).
LOD = 2 means that the best QTL model
is 100 times more likely than the noQTL model (which is still considered
only limited evidence for a QTL, not
significant).

LOD = 3 means that the best QTL model is


1,000 times more likely than the no-QTL
model. A threshold of 2.53 is often used
to declare significance of QTL to minimise
the frequency of Type I errors.
Notice that a horizontal line is drawn across
the graph at the common threshold value of
2.5. You can actually change the level of this
threshold on the graph by choosing Setting > Set
display parameters and entering the desired
value in the box near the bottom right of the
dialog box. This raises the question of what
the appropriate threshold for significance
should be for declaring a QTL to exist near a
marker (and that is why we used a permutation
test). An LOD of 2.5 corresponds to an LRT of
11.5, which corresponds to a P value of
0.0007. This is lower than the ad hoc threshold
of 0.05/24 = 0.002 previously suggested for
rice. Again, we are faced with the problem of
balancing Type I and Type II errors.
The bottom graph plots the additive effect
against the marker position. Notice that the
additive effect can shift from positive to negative according to the QTL. For example,
finding the corresponding line in the output
(position 20.0601), we can see that the additive effect of the A allele at this locus is estimated to be 9.20 and that this QTL accounts
for about 22% of the variance (r2) in the trait
(this values can be obtained from the table that
can be seen in the results output). The key
point to be noted here is interval mapping
should have higher power to detect QTL
located between marker loci and should provide better (unbiased) estimates of the QTL
effects. But, this is all based on the assumption
that our linkage map is accurate!
The r2 value for a QTL peak can be interpreted as the proportion of the phenotypic
variance explained by that QTL. But this
interpretation must be made with caution. If
it were really true, then we could add up all
of the r2 values for the QTL discovered and
(continued)

136

QTL Identification

Box 6.2 (continued)

Fig. 6.4 Interval mapping results for the sample data

obtain the proportion of phenotypic variance


that all of our QTL combined explain. For
example, suppose if there were seven QTL
reported in the output, we got a cumulative
total of 94% of the phenotypic variation
explained by all the 7 QTL. It is obvious that
this must be an overestimate because the heritability of the trait is usually less than 94%.
Therefore, realise that the total variance
explained by the QTL will typically be less
than the sum of the individual QTL r2 values
(in some cases, you can get individual QTL
r2 values to sum to more than100%).
One obvious reason that the r2 values can
sum to more than they really explain jointly is
that some of the QTL peaks given in the SIM
output are false positives (Type I error). It is
previously mentioned that if one conducts
many independent tests, the overall probability of making at least one Type I error is much
higher than the threshold rate for an individual
test. It is also discussed that it is difficult to
determine an appropriate threshold level for
declaring significance and it depends on the

relative costs of making Type I and Type II


errors. For that reason, it is suggested to
perform permutation tests as a way to accurately obtain the overall genome-wise QTL
Type I error rate. And another possible reason
is by adding up individual QTL r2 values to
obtain a combined effect estimate, you are
assuming that the QTL effects are independent. This can be violated in at least three
ways in typical mapping studies: (1) The QTL
may be linked on the same chromosome.
(2) The QTL may be on different chromosomes,
but are not completely independent just
because the sample size (number of mapping
lines) is finite. (3) The QTL genetic effects
may interact epistatically. These problems of
not knowing if a QTL is real or not and of
overestimating the QTL effects in singlemarker analysis and SIM can be addressed are
to build multiple QTL models such as composite interval mapping (but does not entirely
solve the above-said problems). This should
help to eliminate some false positive QTL
because it is more difficult for them to be
(continued)

Multiple Interval Mapping (MIM) or Multiple QTL Mapping

137

Box 6.2 (continued)

included in a multiple QTL model and remain


significant. It will also improve our estimates
of the QTL effects and get more realistic estimates of the total variation explained by the
QTL jointly because the r2 value of the multiple QTL model takes into account their lack of
independence. The other issue of the genomewise error rate is also not entirely solved by
multiple QTL modelling, because it is still
not clear what the probability of a Type I error
is in multiple QTL models. For interval mapping and composite interval mapping, however, we can get good estimates of the
genome-wide Type I error rate by using permutation tests. The permutation test will normally take some time to finish. Usually, 1,000
permutations are recommended for an accurate estimate of the threshold value. The value
that occurs at the bottom of the highest 5% of
values is used as the threshold level that indicates an LRT test significant at the 5% level,
and it is automatically fit by QTL Cartographer
as stated above during analysis.
When analysing an F2 or any mapping population design using interval mapping or composite interval mapping, QTL Cartographer
reports 21 columns of information for each
position in the walk along the chromosomes.
Before enumerating those statistics, its useful
to point out that there are four hypotheses
being examined at each position (refer the
manual for details):
1. H0: a = 0, d = 0Both the additive allelic
effect and the dominance deviation are
zero.
2. H1: a 0, d = 0The additive allelic effect
is distinguishable from zero, but the dominance deviation is zero.
3. H2: a = 0, d 0The additive allelic effect
is zero, but the dominance deviation is distinguishable from zero.
4. H3: a 0, d 0Both the additive allelic
effect and the dominance deviation are
zero.

Many of the 21 columns in the output correspond


to comparisons among these hypotheses or to
estimates of additive and dominance effects
under a particular hypothesis, and refer the
manual to get detailed features of each column.

Composite Interval Mapping


The options available and procedure for composite interval mapping are very similar to
those for interval mapping. Thats because
the underlying statistical model is very similar. In fact, the only difference is the CIM is
attempting to statistically control for the genotype at markers other than those immediately flanking the candidate QTL. It is
obvious that graphic display generated by
interval mapping and composite interval
mapping look pretty similar.
The idea is that including the cofactors in
the model reduces the error term and should
provide higher statistical power to detect the
QTL using interval mapping. However, power
of QTL detection can actually decrease if you
try to fit linked marker loci. QTL Cartographer
deals with this issue by using a window that
slides along the chromosome as the interval
mapping proceeds and drops out of the model
any cofactors that are within a set distance from
the markers defining the interval being tested.
Thus, if you set the window size to 10 cM and
you are testing a position within the interval
defined by loci B and C, then any markers
within 5 cM to the left of B to 5 cM to the right
of C would be dropped from the model if they
happened to be cofactors. What this means is
that the model being tested at each position is
actually subject to change as cofactors drop in
and out of the model due to being blocked by
the sliding window. This makes interpretation
of CIM results difficult sometimes.
We implement the CIM analysis in QTL
Cartographer by selecting Composite Interval
Mapping from the Analysis drop-down
(continued)

138

QTL Identification

Box 6.2 (continued)

menu on the top right of the main window.


Again, we have the option to accept the
default threshold of LRT = 11.5 (LOD = 2.5)
or we can do a permutation test using CIM
(the threshold could differ between CIM and
SIM for the same data set because the analysis methods are different) or simply accept
the default threshold. You can also see the
various options for selecting cofactors and
setting window size by clicking the Control
button at the top centre of the top panel. The
default is Model 6 which selects only the
most significant markers as cofactors using
multiple regressions. There are other model
options for choosing the cofactors (and you
can even define the cofactors yourself), but
these other models are not generally recommended (there may be some special cases
where they would be useful). Having selected
Model 6, we can still choose the multiple
regression method (forwards, backwards or
forwards and backwards stepwise). The forwards and backwards is generally recommended as the best model-selection algorithm,
but it will take longer to select the cofactors
or select the default of forward selection. If
you do choose stepwise regression, you will
need to decide on appropriate thresholds for
permitting markers to enter the model and to
delete markers from the model. We can leave
the window size as the default of 10 cM and
accept the default number of control markers
(cofactors) of 5. It is probably good to limit
the number of cofactors to about 5 unless you
have a very large population size, or you may
end up with so many cofactors that there will
be little power to detect QTL in the interval
mapping scans.
The output from the CIM analysis may
show lesser number of QTL peaks than SIM,
but each of the CIM peaks may have higher
LOD scores than the SIM QTL tests. You can
also notice that the additive effects estimates
and the r2 values of the QTL are usually higher

with CIM than they are with SIM. This is


because of the higher power of detection and
higher estimation precision gained by controlling the genetic background variation with the
cofactors. But these r2 values are still not based
on fitting all of the QTL in a final model. And
we still have the problem of finding tightly
linked QTL peaks. These problems can be
addressed by making a model that fits each of
the QTL positions as interval positions simultaneously, without additional cofactors. This
would give us a valid estimate of the total variation explained by the model and would give
us the evidence of which peak of multiple
linked QTL peaks is the most likely position
of the QTL.
We can also estimate a 95% confidence
interval on the position of the QTL using these
CIM results. This is based on the 1-LOD support interval, meaning that the confidence
interval includes the position of the QTL peak
plus all positions to the right and left of it that
have LOD scores within 1 of the peak. For
example, you can get a rough guess at the 95%
CI for the QTL at a particular position, say
215.6 by looking at the LOD profile graph.
The LOD at the peak is about 3.6, so any positions flanking it that have LOD scores greater
than 2.6 should be included in the confidence
interval. You can also do this by looking at the
results for each tested position in the output
file. Suppose, if the LRT value for position
215.6 is given as 16.79 (~3.6 LOD), so we
need to include any positions around it with
LRT values greater than 11.97 (=2.6 LOD). In
fact, it is not really known how to obtain true
confidence intervals for QTL located with
CIM, and the 1-LOD support interval may be
an underestimate, but even so, it illustrates
that point that in typical QTL-mapping studies, a QTL position cannot be located with
better precision than about 10 cM. This makes
relating QTL to underlying genes (positioned
on a physical map) extremely unreliable.
(continued)

Multiple Interval Mapping (MIM) or Multiple QTL Mapping

139

Box 6.2 (continued)

Multiple Interval Mapping


Multiple interval mapping is a still more
sophisticated method of mapping. It allows
you to identify more than one QTL and to
refine your analyses as you go along. One nice
feature is that it provides an easily understandable summary of the results. Choose multiple
interval mapping (MIM) from the Analysis
drop-down menu on the top right of the main
window. We are prompted to select the trait
and choose trait 1, PH. A new top window
opens and says No MIM Model Exist. Create
a new MIM model by selecting New Model.
The Create New MIM Model window opens,
and we can choose the method we want to
create the MIM Model. We can choose
Forward Selection on Markers, Forward &
Backward Selection on Markers, Scan Through
Composite Interval Mapping or MIM Forward
Search Method. The first two options implement multiple regression model building
by fitting marker loci (not interval positions)
in the model as that of CIM does. The Scan
Through Composite Interval Mapping
approach inputs the information from CIM
and fits a multiple QTL model by first selecting the position with the highest LOD score
from CIM, then fitting the position with the
2nd highest LOD score from CIM and so
forth. Only positions that remain significant
when fitted with the previously included QTL
positions will be maintained in the MIM
model. The MIM Forward Search Method
builds a multiple interval position model by
first selecting the position with highest LOD
score from interval mapping. Then, the
genome is rescanned with interval mapping,
but including the first selected position in the
model during the rescan. Then, the next most
significant position found upon rescanning the
genome is fit into the model. Following this,
the genome is rescanned again, but including
the first two positions in the model. This pro-

cess continues until no more positions can be


added as significant markers. These two
approaches result in MIM models that can be
then further refined by testing the effects of
moving one QTL position just slightly, while
maintaining the other positions constant to see
if the model can be improved. This can be
done iteratively until no further improvements
can be made in the model. Then the final
model can be tested, providing total r2 values
for all QTL jointly and additive effects of QTL
estimated simultaneously. However, for preliminary analysis, it is advised to start MIM
using the CIM and MIM default methods to
compare the models they select as best. Start
the MIM search procedure to build the initial
MIM model. A dialog box pops up, and we are
asked to choose the model-selection criterion
from among Bayesian information criterion
(BIC), Akaike information criterion (AIC)
and modified versions of the original BIC.
These selection criteria are computations that
weight the increase in likelihood of adding a
parameter (such as a new QTL) to the model
against the possibility of over-fitting a model
by adding too many parameters. Each additional parameter can only be added if it
increases the likelihood more than some
threshold value. The different criteria vary by
how stringent they make that threshold. AIC is
the least stringent, and the original BIC is
probably a good choice. By doing such MIM
analysis, it estimates the additive effects of the
QTL and their positions and effects. We can
also test for epistasis among pairs of QTL. Hit
Refine Model, then Searching for new QTL
in the window that pops up; then in the new
top panel, select the Search for Epistasis button and then hit Start.
Caution: Interpreting the results requires
more advanced knowledge on genetics of the
traits and additional restrained interpretation.
Readers are requested to refer the manual/tutorial and the latest papers that have used MIM.
(continued)

140

QTL Identification

Box 6.2 (continued)

It is difficult to manually draw the QTL


map (such as shown in Fig. 6.3) with publication quality. MapChart which is freely
available at http://www.biometris.wur.nl/uk/
Software/MapChart/ can be used for this purpose. MapChart is a computer package for
the MS-Windows platform that produces
charts of genetic linkage maps and QTL data.
These charts are composed of a sequence of
vertical bars representing the linkage groups

Statistical Signicance
Regardless of the method used to estimate and
locate single or multiple QTL, once the test
statistics are calculated, the likelihood of the
event is assessed. The statistical basis of these
comparisons relies on model assumptions,
the most common of which requires the quantitative trait values to be normally distributed. In
reality, however, the distribution of the trait
values is not normal and needs to be considered
as a mixture of (normal) distributions. Violating
the normality assumption has an impact on the
distribution of the statistic used to test for a QTL,
which makes standard statistical procedures
potentially inaccurate.
One approach to obtaining the distribution
(or behaviour, in the long term) of the test statistic
is to use a computer simulation to produce the
data. Thousands of data sets, taken from the same
statistical model, are simulated and the test statistics calculated. Together, these test statistics show
the behaviour of the test in the long run and,
therefore, represent the statistical distribution of
the particular test statistic. From this distribution,
one chooses the level of statistical significance or
threshold above which results are considered
statistically significant (or valid). This approach
is indeed useful if the model used to simulate the
data is the true model. However, the model rarely
describes the complicated relationships that occur

or chromosomes. On these bars, the positions


of loci are indicated, and next to the bars,
QTL intervals and QTL graphs can be shown.
MapChart reads the linkage information (i.e.
the locus and QTL names and their positions)
from text files. This information has to be
calculated before using MapChart, usually
with genetic mapping software such as
Mapmaker, QTL Cartographer, JoinMap
and MapQTL.

in the genome. For example, epistasis is difficult


to model unless the interacting QTL are known in
advance. When a detailed model accurately
describes complex relationships between multiple (interacting) QTL, it is often the case that
simulation-based thresholds are the only practical way to assess statistical significance because
alternative approaches are so computationally
demanding. In QTL analysis, this statistic provides only an approximate test, as the null hypothesis involves a non-mixture distribution whereas
the QTL model involves a mixture distribution.
Also regression analysis provides only approximate test statistics, as they assume normal distributed errors within marker type, whereas the
distribution is really a mixture of two (or three).
Nonparametric resampling methods have provided a useful alternative to simulation-based
thresholds. Permutation resampling and bootstrap resampling have been applied as a means of
randomising the phenotypic (trait) data for the
purpose of evaluating any test statistic under a
null hypothesis that tests for a QTL.

Permutation Testing
Churchill and Doerge in 1994 proposed permutation testing to obtain empirical distributions for
test statistics. In a permutation test, the data is
randomly shuffled over the marker data. Analysis

Permutation Versus Bootstrapping and Other Methods

141

of the permutated data provides a test statistic, as


it is the result of the null hypothesis (marker not
associated with QTL). The number of permutations required is about 10,000 for a reasonable
approximation of threshold levels of 1%. The
important property of this method is that it does
not depend on the distribution of the data. A
permutation test is typically used to determine a
threshold value for significance testing of the
existence of a QTL effect.

In contrast, a bootstrap randomisation of the data


samples allows an individual acquires a phenotype with replacement such that after an individual receives a random trait assignment, some
other individual might receive the same random
trait assignment. The debate about permutation
or bootstrap randomisation is continuing and is
based on the argument that a permutation retains
the summary information of the trait, whereas the
bootstrap changes the mean and variance of the
bootstrap sample. In both resampling approaches,
the genotypic (marker) assignments remain as in
the original data, and, therefore, the genetic map
does not change. An additional implication of not
changing the genetic map is that all genotypic
and population information is retained (such as
segregation distortion, missing data and recombination fractions).In general, empirical threshold
values obtained by permutation testing are widely
mentioned in publications. Permutation testing
can also be used to obtain genome-wide significance
levels by simply repeating the procedure across
all markers.
However, both resampling methods have been
noted as being computationally demanding
techniques that require more than 1,000 resamples, and each potentially leads to different
results. Additionally, when the models are very
complex, the extension of resampling methods to
these situations quickly becomes computationally too demanding, as one would have to provide
up to 1,000 resamples for every model considered. Motivated by the computational intensity of
the resampling-based methods, Piepho suggests a
quick method for calculating approximate QTL
thresholds. Because the Piepho thresholds are
theoretically based and do not retain the previously mentioned genetic specifications, they
remain constant across experiments, even though
it is well known that the environment has a large
role in the variation of a quantitative trait and,
therefore, the accuracy of QTL location. In situations in which the biological and statistical effects
are minimised (e.g. segregation distortion, environmental variation, small sample size and incomplete data), the theoretical and resampling-based
thresholds are generally the same.

Bootstrapping
Bootstrapping, described by Visscher et al. in
1996, is an alternative resampling procedure.
From the original dataset, N individual observations are drawn with replacement. An observation
is a phenotype and its marker type; hence, unlike
in permutation testing, the observed combinations
remain together. Note that some observation may
appear twice in the bootstrap sample, whereas
other may not appear at all. It shows that confidence
is approximated very well with this method, with
only 200 bootstrap samples used. A bootstrap
method is typically used to determine an empirical confidence interval for the QTL location,
assuming that the QTL effect exists. In QTL analysis, usually many markers are tested, often for
multiple traits and in multiple families. The risk
of false positives is very high with so many tests.
If a 5% significance level would be used, we
would expect 5% false positives. Therefore, a
more stringent significance level is usually applied
for genome-wide QTL detection, for example,
0.1%. Hence, for 200 tests, we would need a
significance level of 0.05/200 = 0.00025 to have
a chance of false positives of about 5%. Usually, a
significance level of around 0.1% is applied.

Permutation Versus Bootstrapping


and Other Methods
In permutation, traits are randomly assigned to
individuals in the data set with no single trait
value being assigned to more than one individual.

142

QTL QTL Interaction: Impact


of Epistasis
Epistasis refers to interactions between alleles
from two or more genetic loci of the genome. The
consequence of epistasis is that the phenotype of
an individual cannot be predicted simply by the
sum of the single-locus effects but rather depends
on the specific combinations of loci. In germplasm that has experienced selection, epistasis
has been shown to contribute to the expression of
complex traits. Hence, estimation of genetic
architecture of the trait in terms of contribution of
main effects and epistatic interactions to the
genotypic variance is important in plant breeding. Such an interaction may arise when two
genes are part of a common biochemical pathway, with gene 1 upstream of gene 2, so that in
individuals homozygous for a null mutational
gene 1, mutations in gene 2 have no effect. This
is the origin of the term epistasis, which means
literally as to stop. Statistical geneticists now
apply the term more widely to indicate any
deviation from additivity between QTL.
Among the approaches, multiple QTL models
are more powerful than single-QTL approaches
because they can potentially differentiate between
linked and interacting QTL. Under epistasis, that
is, when the alleles of two or more QTL interact,
it has great potential to alter the quantitative trait
in a manner that is difficult to predict. One of
the most extreme (and simplest) cases is the complete loss of trait expression in the presence of a
particular combination of alleles at multiple QTL.
The crucial challenge in the search for multiple
QTL is to consider every position in the genome
simultaneously, for the location of a potential
QTL that might act independently, be linked to
another QTL or interact epistatically with other
QTL. Interacting QTL are of particular interest as
they indicate regions of the genome that might
not otherwise be associated with the quantitative
trait using a one-dimensional search. Although
the concept of locating multiple, interacting QTL
is straightforward, implementation is quite difficult
due to the tremendous number of potential QTL
and their interactions, which lead to innumerable

QTL Identification

statistical models and heavy computational


demand. One heuristic approach that has been
taken is to first locate all single QTL, then to
build a statistical model with these QTL and their
interactions and, finally, search in one dimension
for significant interactions. Kao et al. 1999 made
such a proposal (see above) through a direct
extension of interval mapping to include a simultaneous search for multiple epistatic QTL. Owing
to the computational intensity of a multidimensional search, a simultaneous investigation is not
possible, and the search is referred to as a quasisimultaneous investigation. Approaches like this
have the potential to work in many situations, but
are limited to the pool of QTL that resulted from
the first-pass QTL analyses, and have little hope
of establishing true epistatic effects for QTL that
are not individually significant. Searching through
all potential models is a problem known as model
selection and remains an active area of research
in genetical statistics.
It must be noted that the detection of epistatic
QTL will rely even more on large population sizes
than the detection of main effects. The most promising approach to detect epistatic QTL appears to
be a full two-dimensional scan for all possible
pairwise interactions. Such scans are nowadays
computationally feasible and have successfully
been used to detect epistatic interactions.
Contrastingly, some researcher has considered
that epistasis appears to be of minor importance in
breeding populations. For most crops and traits,
epistasis could be detected, but the proportion of
genotypic variance explained by these epistatic
QTL was small compared to that of the main
effect QTL. There are, however, exceptions where
individual epistatic QTL have been identified
which explain a proportion of genotypic variance
comparable to that of the main effects. As the
forces active in natural populations are not effective in breeding populations, epistatic interactions
may be selected and maintained, thus contributing
to the expression of the trait. In addition, some
results suggest the presence of epistatic master
regulators, that is, loci that appear to be involved
in a large number of interactions. Though the contribution of epistasis to the genetic architecture of
agronomic traits in breeding populations appears

QTL Environment Interaction

to be small, an epistasis scan seems advisable as


single epistatic QTL may have large effects and
thus may improve knowledge-based breeding.

QTL Environment Interaction


All the genotypes are not responding similarly to
environmental signals, and there is variation in
response (variation is mainly in terms of reaction
or sensitivity to the environmental stimuli or
signal). Differential genotypic expression across
environments is often referred to as genotype
environment interaction (G E or GEI) which is
one of the unifying challenges facing plant breeders. G E is an age-old, universal issue that relates
to all living organisms. Genotypes and environments interact to produce an array of phenotypes.
GEI can be defined as the difference between the
phenotypic value and the value expected from the
corresponding genotypic and environmental values. Thus, G E is the variation caused by the
joint effects of genotypes and environments.
Many agriculturally important traits are end-point
measurements, reflecting the aggregate effects of
large numbers of genes acting independently and
in concert throughout the life cycle. External
factors at any time during the life cycle may
change the developmental process in ways that
may not be predictable. The extent to which G E
affects a trait is an important determinant of the
degree of testing over years and locations that
must be employed to satisfactorily quantify the
performance of a crop genotype. Because testing
is a major factor in the time and cost of developing new crop varieties, G E interactions and
their consequences have received much attention.
For example, it is found that the genetic control
of cotton fibre quality, as reflected by QTL detected
by genome-wide mapping, is markedly affected
both by general differences between growing
seasons (years) and by specific differences in
water regimes. There appears to exist a basal set
of QTL that are relatively unaffected by environmental parameters and may account for progress
from selection in a wide range of environments,

143

such as the diverse sets of environments that are


often employed in mainstream cotton breeding
programs. On the other hand, differences between
years were reflected in similar numbers of QTL
that were specific to each of the year. In other
words, several QTL were detected only in the
water-limited treatment, while only few were
specific to the well-watered treatment. This suggests that improvement of fibre quality underwater stress may be even more complicated than
improvement of this already-complex trait under
well-watered conditions. As a component of the
total phenotypic variance (the denominator in
any heritability equation), G E affects heritability negatively. The larger the G E component,
the smaller the heritability estimate; thus, progress from selection would be limited. A large
G E reflects the need for testing cultivars in
numerous environments (locations and/or years)
to obtain reliable results. If the weather patterns
and/or management practices differ in target
areas, testing must be done at several sites representative of the target areas. The disadvantages of
discarding genotypes evaluated in only one environment in early stages of a breeding program are
discussed in many occasions. The discarded genotypes might have the potential to do well at
another location or in another year. Thus, some
potentially useful genes could be lost due to
limited testing. With the increasing omnipresence of marker technology in plant breeding,
the classical problem of how to handle G E is
gradually being absorbed into more basic questions towards the existence and description
of differential gene expression, where the term
gene is replaced by QTL. Because of this process, the need has arisen for statistical models
that are applicable in the contexts of both G E
and QTLenvironment interaction (Q E).
Though theory for QTL detection and estimation
has developed strongly during the past decades,
still theory for Q E is scarce and applications of
such theory are few. Noteworthy contributions
are listed in the further readings, and readers are
requested to go through those bibliographies for
cutting-edge knowledge on Q E.

144

Congruence of QTL: Across the


Environments and Across the Genetic
Backgrounds Is the Key in MAS
Relatively large numbers of QTL were detected for
agronomic traits, and most of the detected QTL
explained only less than half of the total genetic
variation. What causes the remaining genetic variation that is unexplained by QTL in large samples?
One possibility is that there are many QTL with
very small effects, as assumed in classical models
of quantitative genetics, and these remain undetected even with very large sample sizes. Another
possibility is that higher-order epistatic interactions, which are recalcitrant to QTL mapping.
Further, a recurring complication in the use of QTL
data is that different parental combinations and/or
experiments conducted in different environments
often result in identification of partly or wholly
nonoverlapping sets of QTL (as stated in the above
cotton example). The majority of such differences
in the QTL landscape are presumed to be due to
environment sensitivity of genes. Hence, proper
care of including Q E analysis will improve the
further progress of QTL mapping towards MAS.
The use of stringent statistical thresholds to infer
QTL while controlling experiment-wise error rates
is another reason for identification of only a small
fraction of these nonoverlapping or incongruence
of QTL. Small QTL with opposite phenotypic
effects might occasionally be closely linked in coupling in early-generation populations and separated
only in advanced-generation populations after
additional recombination. Comparison of multiple
QTL-mapping experiments by alignment to a common reference map offers a more complete picture
of the genetic control of a trait than can be obtained
in any one study. However, lack of common set of
anchored markers in the published reports of many
crop plants limits the comparison of QTL across
the genetic backgrounds.

Meta-QTL Analysis
Since the first publication of a QTL localisation
in tomato using molecular data by Paterson et al.,
in 1988, more and more species and traits have

QTL Identification

been studied, and many of these results have been


made available via public databases. One of the
main purposes of these databases was to help
researchers to compare results from different
QTL studies; to study the congruency of QTL
locations in order to find the QTL identified for a
given trait in a population is the same as that of
QTL detected in other populations.
In theory, one would expect that the variation
of a quantitative trait within a species is explained
by a finite number of genes. Thus, QTL congruency investigation will be a relevant approach to
improve knowledge on trait genetics. Nevertheless,
combining results from linkage studies can be
tedious since, even if several studies focus on the
same trait within the same species, since the differences in family structures, sample sizes,
genetic maps or simply QTL detection methods
may differ between studies. Some methods have
been recently developed to tackle such issues
raised by heterogeneity of between QTL studies.
Integration of genetic maps and QTL locations
by iterative projections on a reference map is now
widely used to position both markers and QTL
on a single and homogeneous consensus map
(referred to as comparative mapping; see chapter 7).
However, this process yields a consensus marker
map for which both the statistical properties
and biological reality cant be clearly assessed,
even if a robust ordered marker map was used
as reference. Alternatively, an approach using
graph theory to integrate various types of maps
(such as genetic and physical maps) has been
proposed, but it mainly dealt with dissection of
marker order inconsistencies between maps. In
order to study QTL congruency, Goffinet and
Gerber in 2000 proposed a strategy called as
meta-analysis. Meta-analysis, which is mainly
used in medical, social and behavioural sciences,
aims to pool results across independent studies in
order to combine them in a single result or estimate. The relevance of meta-analysis investigations in genetics and evolution has been discussed
widely. Yet another meta-analysis-based approach
was proposed by Etzel and Guerra in 2002 to
overcome the between-study heterogeneity and
to refine both QTL location and the magnitude of
the genetic effects. Nevertheless, both the methods are limited to a small number of underlying

Concluding Remarks on QTL Methods

QTL positions (from one to four for the former


and only one for the later) which is a serious limitation for a whole-genome study of QTL congruency. Even if the average number of QTL per
experiment is around four in plants, one would
expect that more than four genes can be involved
in the trait variation on a single chromosome. In
order to incorporate this fact, a computational
and statistical package, called Meta-QTL, was
developed for carrying out whole-genome metaanalysis of QTL-mapping experiments. Contrary
to other methods, Meta-QTL offers a complete
statistical process to establish a consensus model
for both the marker and the QTL positions on the
whole genome. First, Meta-QTL implements a
new statistical approach to merge multiple distinct genetic maps into a single consensus map
which is optimal in terms of weighted least
squares and can be used to investigate recombination rate heterogeneity between studies.
Secondly, assuming that QTL can be projected
on the consensus map, Meta-QTL offers a new
clustering approach based on a Gaussian mixture
model to decide how many QTL underlie the distribution of the observed QTL. Meta-QTL is
freely available at http://bioinformatics.org/mqtl.

Concluding Remarks on QTL Methods


The simplest statistical method for QTL mapping
is analysis of variance at marker loci. This approach
suffers when there is appreciable missing marker
genotype data and when the markers are widely
spaced. Interval mapping, though more complicated and more computationally intensive, allows
for missing genotype data. LOD scores are used
to measure the strength of evidence for the presence of a QTL; the LOD curve for a chromosome
indicates whether a QTL maybe present and
where it is likely to be located. The region where
the LOD score is within 1.0 of its maximum may
be taken as the plausible region for the location of
the QTL. Alternatively, permutation tests are
valuable for determining significance landmarks
for the LOD score; although computationally
intensive, permutation tests allow for the observed
phenotype distribution, marker density, and pat-

145

tern of missing genotype data. Interval mapping


and analysis of variance make use of a singleQTL model. Methods that consider multiple QTL
simultaneously have three advantages: greater
power to detect QTL, greater ability to separate
linked QTL, and the ability to estimate interactions between QTL. These more complex methods may facilitate the identification of additional
QTL and assist in elucidating the complex genetic
architecture underlying many quantitative traits.
Model selection is the principal problem in multiple QTL methods; the chief concern is the formation of appropriate criteria for comparing
models. The simplest multiple QTL method,
multiple regression, should be used more widely,
although, like analysis of variance, it suffers in
the presence of appreciable missing marker genotype data. A forward selection procedure using
interval mapping (i.e. the calculation of conditional LOD curves) is appropriate in cases of
QTL that act additively and makes proper allowance for missing genotype data. MIM is an
improved method that, although computationally
intensive, can, in principle, map multiple QTL
and identify interactions between QTL. The
important aspects of the model-selection problem
require much further study and will not have general solutions. From results of QTL experiments
gathered over a wide range of plant species, it has
shown that confidence intervals around most
likely QTL positions are, on average, approximately 10 cM, which usually includes several
hundreds of genes. Also several researchers have
pointed out that QTL detection is statistically
biased both in the true number of QTL, which is
underestimated since only QTL with large effects
are detected, and in the QTL effects which are
over estimated as only significant effects are
reported (a phenomenon has commonly referred
to as the Beavis effect). A lot has been happened
in methodological development on multiple QTL
mapping, threshold determination and Bayesian
QTL-mapping methods. This area has been
advanced greatly by the interaction between
genotyping technologies and statistical methodologies in the last several years and will continue
to be so in the future. However, it is equally important that these tools are applied with thorough

146

understanding of the genetic data and the tools


themselves.

Alternatives in Classical QTL Mapping


There are several other alternative procedures
available for QTL mapping other than the methods described above. It includes bulked segregant
analysis, selective genotyping, association mapping and nested association mapping.

Bulked Segregant Analysis


and Selective Genotyping
The construction of linkage maps and QTL analysis takes a considerable amount of time and effort
and may be very expensive. Therefore, alternative
methods that can save time and money would be
extremely useful, especially if resources are limited. Two short-cut methods that are commonly
used to identify markers linked to QTL are bulked
segregant analysis (BSA) and selective genotyping. Both methods require mapping populations.
BSA is a method used to detect markers located
in specific chromosomal regions (Michelmore
et al. 1991). Briefly, two pools or bulks of DNA
samples are combined from 10 to 20 individual
plants from a segregating population; these two
bulks should differ for a trait of interest (e.g. resistant
vs. susceptible to a particular disease). By making
DNA bulks, all loci are randomised, except for the
region containing the gene of interest. Markers are
screened across the two bulks. Polymorphic markers may represent markers that are linked to a gene
or QTL of interest. The entire population is then
genotyped with these polymorphic markers, and a
localised linkage map may be generated. This
enables QTL analysis to be performed and the
position of a QTL to be determined. BSA is generally used to tag genes controlling simple traits, but
the method may also be used to identify markers
linked to major QTL. High-throughput or highvolume marker techniques such as RAPD or
AFLP (refer chapter 3), that can generate multiple
markers from a single DNA preparation, are generally preferred for BSA.

QTL Identification

Selective genotyping (also known as distribution extreme analysis or trait-based marker


analysis) involves selecting individuals from a
population that represent the phenotypic extremes
or tails of the trait being analysed (Lander and
Botstein 1989). In other words, the segregating
population is evaluated phenotypically as a first
step. Then, genotypic evaluation is performed on
only a subset of the population: those genotypes
that occur in the tails of the distribution of the
trait of interest. Linkage map construction and
QTL analysis are performed using only the individuals with extreme phenotypes. By genotyping
a subsample of the population, the costs of a mapping study can be significantly reduced. Selective
genotyping is typically used when growing and
phenotyping individuals in a mapping population
are easier and/or cheaper than genotyping using
DNA marker assays.
The disadvantages of these methods are that
they are not efficient in determining the effects of
QTL and that only one trait can be tested at a time
since the individuals selected for extreme phenotypic values will usually not represent extreme
phenotypic values for other traits. Furthermore,
single-point analysis cannot be used for QTL
detection, because the phenotypic effects would
be grossly overestimated, and hence interval
mapping methods must be used (Lander and
Botstein 1989).

Genomics-Assisted Breeding
In the last decade, some scientific milestones,
including genome sequencing projects, EST databases and microarray technologies, have enhanced
the understanding of plant genomes and allowed for
the identification of genes responsible for a desired
trait. Besides using random markers derived from
anonymous polymorphic sites in the genome, it has
become possible to generate functional markers;
they are derived from polymorphisms within the
transcribed regions of the genome. Such markers
are completely linked to the desired trait allele and
have also been termed perfect markers. The main
limitation of applying random, non-perfect DNA
markers such as RFLPs, AFLPs or microsatellite

Array Mapping

markers is the limited number of detectable polymorphisms, low throughput and high costs of assaying each locus. The development of SNPs allows
higher throughput, but still marker development and
PCR reactions are required. Thus, it was suggested
that marker-assisted breeding and selection will
gradually evolve into genomics-assisted breeding
(the term genomic selection is also used in some
publications). Currently, array mapping, association
mapping and EcoTILLING are often discussed as
methodologies within the context of genomicsassisted breeding and refer chapter 10 for more
details.

Array Mapping
With the completion of the genomic sequence of
several model crop plants (since Arabidopsis thaliana, the first plant genome, was deciphered), plant
genomics moved on to the era of functional genomics. The mere sequence of a genome is of limited
value in revealing the function of genes. Gene
expression needs to be studied in the next step and
DNA microarrays have become the main technological approach to expression studies. Microarrays
(also known as biochips, DNA chips and gene
chips) were developed by Schena and co-workers
in 1995. There are several ways in which genes
can be arrayed, the two most common technologies being cDNA arrays and oligonucleotide
arrays. To conduct an oligonucleotide array, oligonucleotides are synthesised in situ for setting up
the array, requiring knowledge of sequence data.
cDNA arrays are also applicable to non-model
organisms, as they only require a large cDNA
library and the development of ESTs. ESTs are
end segments of sequences from cDNA clones that
correspond to mRNA, that is, parts of expressed
genes. To conduct a cDNA array, several thousand
ESTs are needed. A unique set of these ESTs is
amplified by PCR and used to conduct the array.
Irrespective of cDNA arrays or oligonucleotide
arrays, the basic steps are the following: (1) mRNA
from cells or tissues in a sample is extracted, (2)
converted into cDNA and fluorescently labelled,
(3) hybridised with the array by robotically spotting the probe onto a planar surface (often glass

147

microscopic slide or filter). Labelled cDNA pieces


bind to their complementary counterpart on the
array, and (4) a laser scanner is used to measure
the fluorescent signal of the hybridised probes.
As the intensity of the signals from the samples
correlates with the original concentration of mRNA
in the cell/tissue, it can be estimated whether the
expression of a gene is up- or downregulated,
absent or unchanged. Besides RNA expression
profiling, microarrays offer opportunities for DNA
polymorphism analysis and have been found useful in linkage mapping, the dissection of QTL or
assessment of population structure. Fragments
matching the array feature sequence perfectly will
hybridise with a higher affinity than a fragment
mismatching the sequence, and thus every array
oligonucleotide has the potential to measure a
polymorphism. The sequence polymorphisms
detected as a difference in hybridisation intensity
between two samples function as molecular
markers and are referred to as single-feature
polymorphisms (SFPs; see chapters 3 and 10).
Microarrays can detect high numbers of SFP
markers, and as several hundred thousand loci can
be measured in a single experiment, all markers
can be scored simultaneously, thus allowing the
mapping of quantitative or multigenic trait loci.
No amplification steps, gels or enzymatic manipulation are required to carry out a microarray which
makes such high-density oligonucleotide arrays an
effective platform for identifying allelic variation.
Wolyn et al. (2004) developed a method called
eXtreme array mapping (XAM) that combines
array hybridisation with BSA in order to map
QTL, hoping for a way to reduce time and effort
needed to genotype and map QTL loci. Within
each bulk, the individuals are identical for the
trait/gene of interest but arbitrary for all other
genes. Ideally, the two samples differ genetically
only in the selected region and are expected to
have equal mixtures of both parental genotypes at
loci unlinked to the mutation. The chromosomal
region linked to the gene causing the phenotype
will be fixed for alternative alleles between the
two pools. BSA has the advantage of identifying
markers associated with a trait without needing
the construction of a full genetic map. BSA is
widely used in many marker development

148

programs. One possibility in BSA is to hybridise


DNA from each pool to a microarray. In this way,
SFPs can be identified, indicating a genomic
region of interest containing alleles that can be
tested before introgression into elite germplasm.
Another application of the microarray technology to the analysis of DNA variation is the
Diversity Array Technology (DArT). Using
DArT, the presence and amount of a specific
DNA fragment can be assessed in the total
genomic DNA of an organism or a population.
DArT does not rely on DNA sequence information, and potential applications include germplasm characterisation, genetic mapping, gene
tagging or MAS. In terms of cost and speed of
marker discovery/analysis, DArT can be a good
alternative to other marker techniques such as
RFLP, AFLP, microsatellite markers or SNP
(refer chapter 3). The major advantage of microarrays is the fact that gene expression patterns for a
large number of genes or even a whole genome
can be obtained in one experiment. As the elements placed on the chip are only between 20 and
200 mm in diameter and only spaced 50 mm apart,
a whole genome complement can be placed on
one chip.

Association Mapping
In plants, most of the QTL analyses have been
conducted in highly structured populations with
known pedigrees (such as F2 or backcross populations). However, in general, such structured
populations have two major limitations. First, the
limited number of recombination events results
in poor resolution for quantitative traits. Second,
only two alleles at any given locus can be studied
simultaneously. In order to increase the resolution of mapping populations, large populations
that have undergone several rounds of random
mating should be created. These rounds of mating increase the potential number of recombination events, and structured populations such as
recombinant inbred lines are potential resources
in this context. Despite these efforts, the resolution for many QTL is still several centimorgan
(cM), corresponding to hundreds of genes.

QTL Identification

Additionally, the low number of alleles sampled


per locus in each population makes it difficult to
examine the full range of genetic diversity available in crop germplasm.
Alternatively, an increasingly common method
of refining the identification of QTL using the
production of near-isogenic lines (NILs) and positional cloning is proposed. Nevertheless, technical limitations, such as the lack of contiguous
coverage and the large amounts of repetitive DNA
in the genomes of many plant species, prevent the
successful implementation of positional cloning by
means of chromosome walking (refer chapter 7).
Aside from these technical issues, positional
cloning may not be efficient at identifying genes
responsible for complex traits. This is due in part
both to the difficulty of developing NILs for loci
that explain less than 20% of the variance and to
constraints created by only using two alleles. For
example, the majority of genes cloned via positional cloning explain large portions of the phenotypic variation, for example, fruit weight2.2 in
tomato, teosintebranched1 (tb1) in maize,
heading date1 in rice and FRIGIDA and
CRYPTOCHROME2 in Arabidopsis. Further, the
production of NILs is a time-consuming process,
especially in long-generation species.
Similar kinds of limitations were documented
in animal genetics too. Linkage analysis has not
been successful in fine-scale mapping of disease
loci in humans because construction of organised
pedigrees from controlled breeding crosses is not
possible. Even when studying families with high
occurrence of a disease, it is often difficult to find
direct evidence of genetic recombination between
polymorphic sites. Therefore, the medical community turned to association analysis because
there was too few meiosis in most families to
finely map diseases. Association analysis, also
known as linkage disequilibrium (LD) mapping
or association mapping, is a population-based
survey used to identify traitmarker relationships
based on LD. Unlike linkage analysis, where
familial relationships are used to predict correlations between phenotype and genotype, association methods rely on previous, unrecorded
sources of disequilibrium to create populationwide markerphenotype associations. Genetic

Association Mapping

diversity is evaluated across natural populations


to identify polymorphisms that correlate with
phenotypic variation. Association analysis is
extremely powerful because the individuals that
are sampled do not have to be closely related,
which harnesses all of the meiotic and recombination events among those individuals to improve
resolution. Because of these recombination
events, only markers in LD with a disease or trait
of interest will associate with the disease or trait.
Association analysis was successfully used for
the identification and cloning of the cystic fibrosis
gene, the diastrophic dysplasia gene and one of
the major Alzheimers factors.
As in animals, association analysis recently
emerged as a powerful tool to identify QTL in
plants, thereby increasing mapping resolution
substantially over the current capabilities of
standard mapping populations. Association analysis has the potential to identify a single polymorphism within a gene that is responsible for
the difference in phenotype. In addition, many
plant species have high levels of diversity for
which association approaches are well suited to
evaluate the numerous alleles available. LD plays
a central role in association analysis. The distance
over which LD persists will determine the number and density of markers and experimental design
needed to perform an association analysis.
LD is also known as gametic phase disequilibrium, gametic disequilibrium and allelic association. Simply stated, LD is the nonrandom
association of alleles at different loci. It is the
correlation between polymorphisms (e.g. singlenucleotide polymorphisms (SNPs); refer chapter 3)
that is caused by their shared history of mutation
and recombination. In a large, randomly mated
population with loci segregating independently,
but in the absence of selection, mutation or migration, polymorphic loci will be in linkage equilibrium. In contrast, linkage, selection and admixture
will increase levels of LD.
The terms linkage and LD are often confused.
Although LD and linkage are related, they are distinctly different. Linkage refers to the correlated
inheritance of loci through the physical connection on a chromosome, whereas LD refers to the
correlation between alleles in a population. The

149

confusion occurs because tight linkage may result


in high levels of LD. For example, if two mutations occur within a few bases of one another, they
undergo the same pressures of selection and drift
through time. Because recombination between
the two neighbouring bases is rare, the presence
of these SNPs is highly correlated, and the tight
linkage will result in high LD. In contrast, SNPs
on separate chromosomes experience different
selection pressures and independent segregation,
so these SNPs have a much lower correlation or
level of LD. A variety of statistics have been used
to measure LD, and each method has its own relative advantages and disadvantages.
Because allele frequency and recombination
between sites affect LD, most of the processes
observed in population genetics are reflected in LD
patterns. Population mating patterns and admixture
can strongly influence LD. Generally, LD decays
more rapidly in outcrossing species as compared to
selfing species. This is because recombination
is less effective in selfing species, where individuals are more likely to be homozygous, than in
outcrossing species. Admixture is gene flow
between individuals of genetically distinct populations followed by inter-mating. Admixture results
in the introduction of chromosomes of different
ancestry and allele frequencies. Often, the resulting
LD extends to unlinked sites, even on different
chromosomes, but breaks down rapidly with random mating.
LD can also be created in populations that have
recently experienced a reduction in population
size (bottleneck) with accompanying extreme
genetic drift. During a bottleneck, only few allelic
combinations are passed on to future generations.
This can generate substantial LD. Selection,
which produces locus-specific bottlenecks, also
causes LD between the selected allele at a locus
and linked loci. Moreover, selection for or against
a phenotype controlled by two unlinked loci may
result in LD despite the fact that the loci are not
physically linked. There are several explanations
for why the LD patterns are so different between
plant samples. First, most of the diversity in
plants such as maize is descended from an
extremely variable outcrossing wild relative with
large effective population sizes. Most of the

150

observed recombinant haplotypes were probably


generated before domestication of this wild relative. Hence, the different rates of LD decay reflect
differing levels of population bottleneck, that is,
the progression from diverse landraces to diverse
inbreds to elite inbreds. Additionally, the LD
reported between loci 100 kb apart likely includes
recombinationally inactive repetitive regions of the
genome, which are not present in the other studies.
The basic structure of LD is understood for
only few plant species. There are still many issues
that need to be better studied and resolved before
LD can be used routinely to dissect complex
traits. The reluctance to use this technique in
plant systems and the mixed results seen in animal systems are due in large part to the effects of
population structure. The presence of population
stratification and an unequal distribution of alleles
within these groups can result in non-functional,
spurious associations. Highly significant LD
between polymorphisms on different chromosomes may produce associations between a
marker and a phenotype, even though the marker
is not physically linked to the locus responsible
for the phenotypic variation. Effective recombination rate is related to the degree of selfing that
a species exhibits. This is because recombination
is less effective in selfing species where individuals are more likely to be homozygous at a given
locus than in outcrossing species. Although
physical recombination may occur more often in
selfing species, recombination is rarely between
distinct alleles; hence, the amount of effective
recombination is fairly low. This relationship
between recombination and selfing can extend
to LD. Because effective recombination is
reduced severely in highly selfing species, LD
will be more extensive. As mentioned above, LD
is proportional to the recombination fraction.
One must be cautious, however, when predicting
the structure of LD based on the present-day
mating system because the mating system may
have changed significantly, whether by natural
evolutionary processes or by human intervention.
Because selfing rates can change rapidly, it is
necessary to empirically determine the LD
structure before employing association-based
methods.

QTL Identification

A major unresolved question is how genome


structure and the rate of recombination affect the
structure of LD across the genome. It is generally
accepted that different regions of genomes undergo
different rates of recombination. For example, in
maize, there is extensive evidence for tremendous
heterogeneity in rates of recombination across
the genome. There is also evidence that generich stretches are likely to have more recombination than methylated, gene-poor regions. One
reason for decreased recombination in various
regions is that the retrotransposon composition
can be entirely different between two alleles.
Unfortunately, the direct connection between
the present locations of hot spots and structure of
LD produced through evolution has not been completely demonstrated in plants. However, it is
likely that this connection does exist, as in humans.
This suggests that predicting LD levels between
two sets of polymorphisms based solely on physical distance will be problematic. For example, two
sites at either end of a 5-kb gene might have very
little LD if the gene is a hot spot, whereas two sites
on either side of 100 kb of retrotransposons could
have very high levels of LD. The design of LD
mapping experiments and placement of SNPs will
require a thorough understanding of how these hot
spots are dispersed.
Association approaches have been the main
application of LD, but the nature of LD in the
population determines what type of association
approach can be conducted. There are mainly two
approaches: whole-genome scan and candidategene(s)-based analysis. The rate of LD decay
determines which one these two approaches can
be used in association mapping.
In whole-genome scans, markers are distributed across the genome are employed to evaluate
all genes simultaneously. For example, the human
genome may require 70,000 markers, Arabidopsis
require 2,000 markers and diverse maize landraces
require 750,000 markers, but only 50,000 markers
are required for elite maize lines. The first association study to attempt a genome scan in plants
was conducted in sea beet (Beta vulgaris ssp.
maritima), a wild relative of sugar beet (Beta
vulgaris ssp. vulgaris) (Hansen et al. 2001). For
species other than Arabidopsis, rice and crops

Nested Association Mapping

that have physical maps, this could be a hefty


number of markers although technological
improvements in the future may enable the scoring of such huge number of markers. Despite this
advances in genotyping, the key problem in association mapping is the large number of resources
needed for phenotyping and fixing of statistical
issues. Statistical significance in a genome scan
could only be obtained with large sample sizes of
thousands of individuals for QTL that explain
modest amounts of variation.
There are two ways to circumvent this problem: Either population with greater levels of LD
can be chosen or the analysis can be restricted to
candidate gene regions. By choosing a bottlenecked population, one can substantially increase
genome-wide LD. The limitation of this approach
is that the appropriate populations must be
identified, and by their nature, these bottlenecked
populations will only contain a subset of the total
variation. Again, it is necessary to point out that
novel alleles outside the elite germplasm will not
be identified. The candidate geneassociation
approaches rely on combining multiple lines of
evidence to restrict the numbers of genes that are
evaluated. Genome sequencing, comparative
genomics, transcript profiling, low-resolution
QTL analysis and large-scale knockouts all
provide opportunities to develop and refine candidate gene lists. These approaches are powerful at
identifying candidate genes but not at evaluating
allelic effects. The first association study of a
quantitative trait based on a candidate gene was
the analysis of flowering time and the dwarf8 (d8)
gene in maize by Thornsberry et al. in 2001.
The candidate gene approach can substantially
reduce the amount of genotyping required, but
most importantly, it can reduce the multiple issues
created by testing thousands of sites across the
genome. The statistical issues in combining these
disparate types of evidence have not been resolved.
In plants, another way to conduct a genomic scan
is to use F1-derived mapping populations. These
populations are efficient for doing a genome scan,
as often only a few hundred markers are needed.
Because only two alleles are being evaluated,
these populations will have more statistical power
to evaluate the effect of a chromosomal region in
comparison to association mapping. Additionally,

151

there is more statistical power to evaluate epistasis. The advantages of association mapping in
terms of resolution, speed and allelic range are
complementary to the strengths of F2-based QTL
mapping, namely, marker efficiency and statistical power. There are two commonly used programs for association mapping: TASSEL (http://
www.maizegenetics.net/tassel) and STRUCTURE
(http://pritch.bsd.uchicago.edu/structure.html).
Readers are requested to visit these websites and
manuals for detailed procedure for association
mapping, which are self-explanatory and simple
to do. The free website, http://www.extension.org/
pages/62755/association-mapping-and-tasselsoftware-tutorial, may also be visited for further
technical tips.

Nested Association Mapping


From the above discussions, it is obvious that
linkage analysis often identifies broad chromosome regions of interest with relatively low marker
coverage, while association mapping offers
high resolution with either prior information on
candidate genes or a genome scan with very high
marker coverage. An integrated mapping strategy
would combine the advantages of the two
approaches to improve mapping resolution without requiring excessively dense marker maps.
Nested association mapping (NAM) has been
proposed as a genome-wide complex trait dissection strategy that integrates the advantages of
linkage analysis and association mapping in a
single, unified mapping population. The proposed
procedure in NAM involves the following steps:
(1) selecting diverse founders and a single reference line for developing a large set of related
mapping progenies preferably recombinant inbred
lines (RILs) for robust phenotypic trait collection,
(2) either sequencing completely or densely genotyping the founders, (3) genotyping a smaller
number of tagging markers on both the founders
and the progenies to define the inheritance of
chromosome segments and to project the highdensity marker information from the founders
to the progenies, (4) phenotyping progenies for
various complex traits and (5) conducting genomewide association analysis relating phenotypic

152

traits with projected high-density markers of the


progenies. The aims of the experimental design in
NAM are to (1) capture crop genetic diversity, (2)
exploit ancestral recombination, (3) efficiently
take advantage of next-generation sequencing
technologies through genetic design, (4) generate
mapping materials that can be evaluated for agronomic traits at field locations of temperate regions,
(5) develop a mapping population that has
sufficient power to detect numerous QTL and
resolve them to a level of individual genes and (6)
provide a community resource. Thus, NAM has
several advantages, and Yu et al. (2008) have
provided a detailed comparison of the main
characteristics of different mapping strategies.
In NAM, the advantages of designed mapping
populations from linkage analysis and of high
resolution from association mapping are integrated through the development of a large number
of RILs from diverse founders. While the common parent specific markers allowed the prediction of transmission of chromosome segments in
RILs, the short range of LD within these segments
across the diverse founders enabled improved
mapping resolution. The genetic background
effect of these parental founders on mapping individual QTL, which can be a hurdle for association
mapping, is systematically minimised by
reshuffling the genomes of the two parents of each
cross during RIL development as well as by the
combined analysis of all RILs across all the
crosses. In general, the strategy of projecting
sequence information, nested within informative
markers, from the most connected individuals to
the remaining individuals is applicable to a wide
range of crop species though it was first shown in
maize.

EcoTILLING
EcoTILLING is based on the methodology of
TILLING (Targeting Induced Local Lesions IN
Genomes), which was developed as a strategy in
reverse genetics (McCallum et al. 2000).
TILLING is a methodology that identifies DNA
polymorphisms regardless of phenotypic consequence, allows the identification of single-base-

QTL Identification

pair allelic variation in target genes and can be


applied to any organism that can be chemically
mutagenised. It is, on the one hand, an attractive
strategy for functional genomics and, on the
other hand, also attractive for agricultural applications. TILLING requires relatively few individual plants and is therefore appropriate for
small- and large-scale screening. In TILLING,
traditional chemical mutagenesis is followed by
PCR-based screening to identify point mutations in regions of interest. First, the regions of
interest are amplified by PCR. By denaturing
and re-annealing the PCR products, heteroduplex molecules between wild-type fragments
and mutated fragments form, provided that at
least one plant in the pool includes a mutation in
the amplified region. The resultant doublestranded products are digested by CEL I, an
endonuclease that specifically targets and digests
heteroduplexes at mismatch positions. The
cleaved products are resolved on denaturing
polyacrylamidegels, individuals carrying a
mutation in the gene of interest are identified
and the mutant PCR product is sequenced. The
TILLING methodology has been adapted to the
discovery of polymorphisms in natural populations, termed EcoTILLING by COMAI et al.
(2004). The cutting with CEL I allows the display of multiple mismatches in a DNA duplex.
If an unknown homologous DNA is heteroduplexed to a known sequence, the number and
position of polymorphisms can be revealed, and
the approximate position of each SNP within a
few nucleotides is recorded. EcoTILLING is
applicable to any species, including heterozygous and polyploid ones. It often compares
favourably to full sequencing because it reduces
the number of sequences that need to be
determined in order to identify a point mutation in a gene of interest. It is considered that
TILLING/EcoTILLING remains at the moment
the technique of choice for medium- to highthroughput reverse genetics in many organisms.
EcoTILLING is gel based and thereby a lowcost method. As a marker system, it combines
two advantages. Being based on the gene of
interest itself, it has the advantage of a functional marker, and it produces a high number of

Challenges in QTL Mapping

marker alleles because every SNP in the


amplified sequence results in a change in the
overall fragment pattern. Currently, EcoTILLING
and microarrays, as two methods for natural
polymorphism discovery, seem to be two
complementary tools. While microarrays have
their strength in the detection of global natural polymorphisms among a few genotypes,
EcoTILLING is better suited for surveying
diversity at specific loci among many genotypes.
In general, it can be expected that developments
in marker technologies during the next few years
will go along with the development of sequencing technologies. The new generation of
sequencing technologies, called next-generation
sequencing, that has become available during
the last few years permits the rapid production
of sequence information, and it can be expected
that sequence information of many different
crop plants will become available soon.

Challenges in QTL Mapping


Though there are huge numbers of publications
in QTL mapping of agronomically and economically important traits in several crop plants have
been published, it has been repeatedly shown by
the geneticists, statisticians and breeders that
QTL-mapping strategies used in the publications
are having several limitations and different
approaches that can be employed to overcome
these challenges are discussed hereunder.

Confronts with Mapping Populations


There are several types of experimental design
that are suitable for QTL analysis, depending on
the mating system of the crop species. Advantages
and limitations of each system in QTL analysis
are discussed in chapter 2. Most QTL analysis in
plants involves populations derived from pure
lines, and several approaches have been developed to associate QTL with molecular markers in
such populations. In autogamous species, QTLmapping studies commonly make use of F2 or
backcross progenies because they are the easiest

153

and earliest to obtain. An F2 is better than a


backcross since QTL with recessive alleles in a
recurrent parent could not be detected, and when
dominance is present, backcrosses give biased
estimates of the effects because additive and
dominant effects are completely confounded
in this design. The degree of dominance can be
estimated in F2 progenies, but there are two
important inconveniences of F2 and backcross
populations: The genotype cannot be replicated
(and therefore cannot be evaluated several times
or in several environmental conditions, different
years, locations, etc.), as in the cases of doubled
haploids (DHs) or recombinant inbred lines
(RILs), and epistatic interactions could hardly be
studied. When n pairs of genes segregate independently, the number of different gametes is 2n,
while the number of possible genotypes in an F2
is 3n; that is, with doubled haploids or RILs,
fewer individuals need to be screened (and this is
economically very important when using molecular markers) to cover a similarly wide spectrum
of recombinants. Using simulated populations, it
was concluded that the DH population (also valid
for a RIL population) could be used with smaller
sample sizes because of their advantage over
backcrosses. Moreover, more accurate estimates
of the location of the QTL were obtained with
less variance. This result is to be expected because
the interval mapping approach, in the absence of
overdominance, uses more widely separated
genotypic values than in a backcross.
For RILs or DHs, the power of detecting a
given quantitative trait locus is clearly related to
its relative contribution to the heritability of the
character. The power of the test was about 90%
for heritabilities of QTL as low as 5%. To obtain
a similar power for backcrosses, the heritability
attributable to the individual quantitative trait
locus should be around 14%. For a given type of
gene action, it seems that DHs have a similar
power to an F2. However, if dominance is present,
DHs or RILs will only detect the additive component of a particular quantitative trait locus. This
could be very important for QTL showing
overdominant (or pseudo-overdominant) effects.
The major technical advantage for DHs or RILs,
independent of any effect of replication on the

154

required number of offspring, lies in the fact that


the lines can be reproduced independently and
continuously evaluated with respect to additional
quantitative traits and markers with all the information being cumulative. If the effect of replication is taken into account, replicated progenies
can bring about a major reduction in the number
of lines that need to be scored. Reductions are
greatest when heritability of the trait is low, under
the assumption of co-dominance at all QTL.
Current statistical methods for mapping QTL
based on controlled crosses are well-developed
(Table 6.1). These methods depend critically on
well-defined mapping pedigrees, such as F2, F3 or
backcrosses, initiated with two inbred lines. The
development of such pedigrees is extremely
difficult in outcrossing species, particularly fruit
and forest trees, owing to high heterozygosity
(probably maintained by recessive lethals) and
long generation intervals. Therefore, other strategies based on half- or full-sib families derived
from controlled crosses have been proposed for
outcrossing species. Alternatively, another
approach that takes advantage of the haploid tissue known as the megagametophyte in gymnosperms has been proposed. To be able to apply
the MAPMAKER program (see chapter 4), a
full-sib family is usually analysed as a double
pseudo-testcross, enabling the construction of a
map for each parent and the utilisation of dominant markers (i.e. RAPD). In the cross between
two heterozygous individuals, many single-dose
RAPD markers will be heterozygous in one
parent, null in the other and therefore segregate
1:1 in their progeny following a testcross
configuration. Two separate data sets are then
obtained, one for each parent. This is very convenient when parents belong to different species or
genera since they may differ in gene order because
of translocations, inversions or deletions during
evolution. QTL-mapping studies that use a
pseudo-testcross format differ from those that
use inbred populations in that up to four different
quantitative trait locus alleles (and marker alleles)
may be segregating. Because the two parents do
not derive from the same F1 individuals, the
marker alleles in each may differ in state and in
phase from the QTL alleles. If genotypes are

QTL Identification

introduced as obtained in MAPMAKER, without


giving the phase or considering both possibilities
per locus, linked markers that differ in phase will
be placed in different linkage groups, although
they are closely linked. An important limitation
of the pseudo-testcross design is that only the
effect of an allele substitution (substituted by the
alleles of the other parent) can be tested, which is
much less powerful than the classical testing. In
other words, in addition to the effect of allele
substitution, only genotypic values can be estimated. If dominant markers are used, the phase
and power limitations clearly increase, although
many studies ignore it.
In considering how many progenies in a mapping population to obtain and how many markers
to type, one thinks about both the chance of
detecting QTL and the resolution of localisation
of QTL. The chance of detecting a QTL is called
the power. Suppose that under the null hypothesis
of no segregating QTL, one obtains a maximum
LOD score, genome wide of at least 3, only 5%
of the time, so the threshold of 3.0 may be used to
define significant evidence for the presence of a
QTL. In this case, the power to detect a QTL is
the chance that one will obtain an LOD score
above 3 in the region of the QTL. Power depends
on the type of cross, the size of the effect of the
QTL, the number of progenies obtained, the density of typed markers in the region of the QTL
and the stringency of the chosen LOD threshold
(i.e. the significance level). When a QTL has an
effect of only moderate size, this power can be
extremely low. It is possibly more interesting to
consider the power to detect at least one QTL. If
there are 10 unlinked QTL segregating in a cross
and for each of them the power is only 20%, one
will still have approximately 90% power to detect
at least one of them. This has implications for the
replication of experiments; if there are many
moderate-sized QTL segregating in a particular
cross, the set of QTL for which one will obtain
strong evidence may be quite different. Of course,
QTL with quite strong effect will be detected
with high power and so will be seen with each
group of progenies.
However, with a mapping population size
of 200 typed at 1 cM spacing, the precision of

Challenges in QTL Mapping

localisation of the QTL is greatly improved. But


these results are not necessarily typical. It is recommended that initial genotyping in an experimental cross be performed with markers at a
1015-cM spacing. It is also suggested that for
markers spaced at 10 cM or closer, there is really
little point in increasing marker density when the
goal is simple detection of a linked QTL. Typing
additional markers in the region of an inferred
QTL may improve the resolution of its localisation, but such improvement will likely only occur
if one has typed many progenies in that population or the QTL has a relatively large effect.

Markers and Its Implications


There is no absolute value for the number of
DNA markers required for a genetic map, since
the number of markers varies with the number
and length of chromosomes in the organism. For
detection of QTL, a relatively sparse framework
(or skeletal or scaffold) map consisting of
evenly spaced markers is adequate, and preliminary genetic mapping studies generally contain
between 100 and 200 markers. However, this
depends on the genome size of the species; more
markers are required for mapping in species with
large genomes. It was repeatedly shown that the
power of detecting a QTL was virtually the same
for a marker spacing of 10 cM as for an infinite
number of markers and only slightly decreased
for marker spacing of 20 or even 50 cM.
Typically, when investigations focus on
questions of genomic location, then more sophisticated methods of QTL analysis, which rely on
the estimated order of markers, are used. The
added information that is gained from knowing
the relationships between markers is essential to
QTL methodologies that aim to locate QTL. The
accuracy of locating QTL is limited by the information, in particular the number of recombinants
that is gained from observing the genotypic states
of the markers. These observed recombinants can
be limited by both small sample size and missing
genotypic data. A question that is very often
asked by the researchers at this stage is Should I
genotype more markers on fewer individuals or

155

score more individuals (for genotype and


phenotype) on fewer markers? Because observed
recombinants provide the information, scoring
more individuals addresses previously mentioned
concerns.

Segregation Distortion
The first step in any QTL-mapping experiment
is usually to construct populations that originate from homozygous, inbred parental lines.
The resulting F1 lines will tend to be heterozygous at all markers and QTL. From the F1 population, crosses are made (e.g. backcross, F2
intercross and crosses to generate recombinant
inbred lines), and the segregation of markers
and QTL are statistically modelled. In general,
experimenters assume that markers are segregating randomly, but if markers are subject to
segregation distortion, it is not possible to
anticipate how the resulting estimates of recombination will be affected, as well as any potential QTL locations. Two important issues should
be considered when assessing these statistical
results. The first consideration is sample size.
The number of individuals studied provides
information for the estimation of phenotypic
means and variances. A large sample of individuals provides the opportunity to observe
recombinant events (thus to have a knowledge
on segregation distortion) and to estimate
parameters with greater accuracy and, therefore, a greater ability to detect QTL.
Missing data and markers with distorted segregations may make ordering of markers difficult
to decide. Especially, markers deviating
significantly from expected Mendelian segregation ratios and markers with less than 100 data
points are excluded from the QTL analysis. High
marker density is usually seen as a guarantee of
being a high-standing QTL analysis regardless
of the proportion of dominant versus co-dominant markers or the reliability in the order of
markers. At the same time, the abundance of
dominant markers (RAPDs, AFLPs) may cause
problems in the construction of maps and in the
analysis of QTL by interval mapping procedures.

156

In QTL analysis, the genotype at a chromosomal


position is inferred by the genotype of the marker
at that position. If the marker cannot distinguish
between the genotypes in the progeny (e.g. a
dominant marker in an F2), such reduction in
information affects the power of QTL detection.
In cases where the markers are very tightly
linked, analysis of hundreds of segregating progeny may be required to determine the correct
order of markers. Linkage maps with a high density of markers therefore have to be obtained
from huge segregating populations. An alternative methodology for constructing dense genetic
linkage maps has been recently reported (Jansen
et al. 2001). It is based on simulated annealing to
obtain the best map according to the number of
recombination events. It uses the Gibbs sampler
for missing data imputation and, notably, establishes posterior intervals for the positions of
markers, as a measure of precision of the genetic
linkage map obtained.

Phenotyping
The accuracy of phenotypic evaluation is of the
utmost importance for the accuracy of QTL
mapping (see chapter 5). A reliable QTL map can
only be produced from reliable phenotypic data.
Replicated phenotypic measurements or the use
of clones (via cuttings) can be used to improve
the accuracy of QTL mapping by reducing background noise. Thorough studies should include
phenotypic evaluations that have been conducted
in both field and glasshouse trials, and QTLmapping studies should be independently
confirmed or verified. Such confirmation studies
(referred to as replication studies) may involve
independent populations constructed from the
same parental genotypes or closely related genotypes used in the primary QTL-mapping study.
Sometimes, larger population sizes may be used.
Furthermore, some recent studies have proposed
that QTL positions and effects should be evaluated in independent populations because QTL
mapping based on typical population sizes results
in a low power of QTL detection and a large bias
of QTL effects. Unfortunately, due to constraints

QTL Identification

such as lack of research funding and time, and


possibly a lack of understanding of the need to
confirm results, QTL-mapping studies are rarely
confirmed. An important issue for QTL detection
in breeding populations is that the phenotypic
data from breeding programs is often generated
by combining multiple trials, thus resulting in
unbalanced designs. Another important consideration is that a statistically sound joint analysis of
the phenotypic data requires overlapping genotypes between different trials, locations and
years (breeding cycles). Another crucial factor
that strongly determines the success of a QTLmapping experiment is the phenotyping intensity.
High heritabilities are a prerequisite for reliable
QTL results and a high predictive power of the
detected QTL, that is, a low bias in the estimation
of the proportion of genotypic variance explained
by these QTL.
Another major concern in trait evaluation is
not only trying to diminish environmental variation versus genetic variation but also because
of the distribution of values in the segregating
populations. Some deviations from normality are
corrected by a variable transformation (log10,
arcsin, etc.). For others, nonparametric tests for
QTL detection should be used. Again, many studies ignore these features and their effect in QTL
analysis and efficiency and profitability of MAS.
Also, the trade-off between extent of replication
and environments over which the progeny needs
to be evaluated versus number of progeny should
be considered. The cost-effectiveness of all of
these depends upon the relative costs for genotypic and phenotypic analyses, of course. It is
clear that a single approach to the QTL analysis
of a quantitative trait is never enough to fully
understand its genetic control.
As genes, QTL effects may be environmentally sensitive, and this sensitivity results in
phenotypic plasticity or the ability of the organisms to take on alternative developmental fates,
depending on environmental cues. Phenotypic
plasticity is likely to be of particular importance
in plants since their sedentary nature dictates that
they adjust to their local environment. Species
with great phenotypic plasticity have been seen
as likely progenitors for novel species which

Challenges in QTL Mapping

express only one of the possible developmental


fates of their ancestors. It has shown that selection during maize domestication for a QTL allele
(teosinte branched1), which lacks environmental
plasticity, may have led to the fixation of a morphological form that can be induced in teosinte
(its ancestor species) by environmental conditions.
Many authors deal with G E interaction at the
level of QTL as a matter of lack of consistency of
QTL effects across environments, concluding
with their lack of interest for MAS purposes.
However, if a QTL shows G E interaction, then
selection of genotypes adapted to specific environments may well be achieved. The proportion
of this kind of QTL is especially impressive in
fruit and forest tree species. Selection pressure
on phenotypic plasticity has to be stronger on
perennials than on annuals. Following this reasoning, plasticity (ability to change gene expression depending on environmental conditions)
should be the rule in tree species rather than
replicates the exception. In any case, the study
of G E interaction needs carefully designed
experiments with several replications of each
genotype per environmental condition tested,
which is not usually achieved in QTL studies of
woody species.
For traits with low heritability, extensive replication and evaluation across different environments is critical to get good estimates of QTL
effects. It is suggested that larger population sizes
and more phenotypic testing are higher priorities
than making dense linkage maps (e.g. increasing
marker density beyond one marker per 1520 cM).
Other effects of small sample size include underestimation of the number of QTL involved in a
trait because the power of the QTL significance
tests is reduced. Simultaneously, the effects of
QTL that are detected with small progeny sizes
are overestimated, sometimes greatly so. The r2
values based on studies with small population
sizes may be impressively high, but they are
probably not realistic. In the few cases when the
QTL models developed in small populations are
tested against independent validation data sets
with larger populations, the real amount of variation they explain is much less. It has also shown
that the predictive power of QTL mapping with

157

cross-validation techniques has reported that


QTL mapped in populations of typical size have
poor predictive power in independent samples
from the same population. Thus, perhaps we
should be less concerned with Type I errors
(finding false positive QTL) than with Type II
errors (missing real QTL).

Statistical Issues
As we discussed, a QTL is a region of any genome
that is responsible for variation in the quantitative
trait of interest. The goal of identifying all such
regions that are associated with a specific complex phenotype might, at first, seem quite simple,
especially with all the genomic and computational tools available to help us. Unfortunately,
the task is difficult because of the sheer number
of QTL, and the possible epistasis or interactions
between QTL, and because of the many additional sources of variation. To combat this, QTL
experiments can be designed with the aim of
containing the sources of variation to a limited
number so that dissection of a complex phenotype might be possible. In general, a large sample
of individuals has to be collected to represent the
total population, to provide an observable number of recombinants and to allow a thorough
assessment of the trait under investigation. This
is the first key step in QTL analysis, and it is
ignored in most of the studies.
Composite interval mapping and multiple
QTL mapping achieve the same result by reducing
the number of potential models under consideration. Both methods extend the ideas of interval
mapping to include additional markers as cofactorsoutside a defined window of analysisfor
the purpose of removing the variation that is
associated with other (linked) QTL in the genome.
The limitations of both approaches are that they
are restricted to one-dimensional searches across
the genetic map and are challenged at times by
the multiplicity of epistatic QTL effects. There is
also a risk of putting too many markers in the
model as cofactors, and care should be taken to
preserve the amount of information that is available for estimation of the QTL effect.

158

The importance of developing models with


multiple QTL is well understood for linked QTL
and has an even greater role in the estimation and
location of epistatic QTL. The limiting feature in
successfully using multiple QTL models is not
our inability to write an equation for a model; it is
our inability to identify the best model or subset
of models (from potentially millions). Enumeration
of all possible QTL models that consider the
appropriate genetic architecture for the experiment, as well as linkage and epistasis, is a
daunting task. Accurate and fast simultaneous
multidimensional searches through the most
likely models, and their comparisons, are required
to determine the most feasible models that warrant further investigation. As shown previously,
one-dimensional searches (e.g. interval mapping
and composite interval mapping) have benefited
the mapping community but are limited in
their inability to accommodate multiple linked
QTL. Because a stepwise linear approach to
model building, by adding and deleting every
combination of multiple (linked) QTL and their
interactions, is not computationally feasible,
many investigators have proposed solutions by
addressing the computational issues rather than
the QTL-mapping method itself. One approach is
to globally search for the optimum multiple QTL
genotype using genetic algorithms. The application of genetic algorithm(s) to multiple QTL
problems is one of many beneficial approaches
because it allows a sampling of the QTL models
across unequal QTL numbers to be considered
and because it can be used in conjunction with any
QTL-mapping methodology that is implemented
for a multidimensional search of a genome.
An inclusive computational framework for
addressing many of the previously mentioned
challenges, namely, covariates, nonnormal trait
distributions, epistatic QTL and the issues of
multiple simultaneous searches, has been put
forward by Sen and Churchill. The approach
breaks the QTL problem into two distinct parts:
the relationship between the QTL and the quantitative trait and the location of the QTL. Disjoining
these two independent relationships allows the
initial focus to be placed on estimation of the
unknown QTL genotypes and then on allowing

QTL Identification

the search for different models and their comparisons with the information gained from completing the QTL genotype information. The power in
breaking a problem into two independent parts is
not new as it was dealt with by Jansen in 1993
and lies in the fact that information is gained in
the first part that can be used in the second part.
Once the QTL genotypes are estimated, Sen and
Churchill explore all possible models using an
approach that allows distinct models of different
QTL numbers to be considered. As the QTL
genotypes are calculated independently from the
QTL effect and location, previous issues of
epistasis and linked QTL are eliminated because
the state of the QTL genotype and QTL number
is known before the estimation of their effects
and interactions. Multi-trait QTL mapping can
also benefit from the computational framework
of Sen and Churchill by simply extending from a
single phenotype to multiple correlated phenotypes and by dissecting the problem in a similar
manner. Although the Sen and Churchill view
has been shown to benefit QTL mapping, it might
have an even larger potential for accommodating
other types of problem and data structure
(for details, see Doerge 2002).
The most obvious applications of QTL analysis are MAS in crop breeding and QTL cloning
for transgenic technology. The success (or
efficiency) in both endeavours primarily depends
on the reliability and accuracy of the QTL analysis where information has been obtained.
Chromosomal QTL regions are quite often large
and can include many open reading frames or
favourable QTL alleles in repulsion. This situation can exacerbate linkage drag in the application of QTL analysis for plant breeding or
introgression into elite germplasm of undesirable
characters that are linked to a desirable QTL.
Thus, a principal objective of QTL analysis is
confining QTL to narrow chromosomal regions,
which implies joint consideration of the type of
experimental design or segregating population,
its size, number, informativeness and level of polymorphism of DNA markers and the statistical
methodologies both to build up the linkage map
and to perform the QTL analysis. These are the
methodological features that should be considered

Challenges in QTL Mapping

seriously. Other factors also have an important


influence on this accuracy: the experimental
design (including the type of segregating
population),its size, the heritability of the trait,
the number and contribution of each quantitative
trait locus to the total genotypic variance, their
interactions, their distribution over the genome,
the number and distance between consecutive
markers, the percentage of co-dominant markers,
the reliability of the order of markers in the linkage map, the evaluation of the trait, etc. There are
also situations that may reduce the efficiency of
MAS, when the environment or the genetic background, or both together, affects the final contribution of the QTL (i.e. when G E and epistatic
interactions are involved in the phenotypic value).
QTL analysis not only provides DNA markers for
efficient selection, it is also of particular value in
resolving these interacting environmental and
genetic effects which are common in agronomically important traits such as days to flowering,
stay-green or tolerance to abiotic stresses. These
aspects are also considered because their study
will not only help plant breeding and germplasm
enhancement but also plant genomics connecting
the proteins of known biochemical function to
the agronomic traits where they are involved.
Another basic problem that concerns QTL
analysis is the true number of QTL governing a
quantitative trait. It has been shown that it is
difficult to locate more than 12 QTL in any given
population at any one time, and generally far
fewer. Moreover, because only significant effects
are reported, published QTL effects will be biased
towards larger values; the more stringent the
significance level, the greater the bias. It is not
the estimation procedures that are biased, it is the
fact that only the significant estimates are used;
the poorer the power of the test (low progeny
number), the greater the bias. This bias will be
greater on estimates of dominance than on additive effects because dominance effects are more
difficult to detect. All these biases are larger with
QTL of small effects and together imply that one
will tend to underestimate the true number of
QTL but exaggerate their additive and dominance
effects. Suggestions in the statistical literature to
diminish these problems include model valida-

159

tion with an additional sample and resampling


strategies such as bootstrapping.
It has also long been clear that the confidence
intervals (CIs) associated with QTL locations in
segregating populations are larges since QTL are
estimated with poor precision. The CI for a QTL
using likelihood methods is generally a 1-LOD
support interval, which means that any position
around a likelihood peak that has an LOD score
of not less than 1 lower than the peak is included
in the CI. Generally, QTL have been located to
intervals of 1520 cM. This is probably sufficient
for marker-assisted selection, but this level of
precision is nowhere near satisfactory to contemplate map-based cloning of QTL. The reliability
depends on the heritability of the individual quantitative trait locus. Given a typical trait with an
overall broad heritability of 50% or less, the individual quantitative trait locus will have heritabilities of a fraction of this 50%. Thus, with five
equally sized QTL, each can only have a heritability of 10%. Simulations have shown that the
95% CI of such a quantitative trait locus in an F2
population of 300 individuals is more than 30 cM,
while it is very difficult to reduce the CI to much
less than 10 cM, even for a very highly heritable
quantitative trait locus. More markers beyond a
density of one in every 15 cM do not help much.
These distances should be viewed in the context
that, on average, a chromosome is about 100 cM
long. Several approaches have been explored to
overcome this problem. Again, increasing the
number of genotypes is the most efficient way of
improving precision, which is easy to achieve
with F2 or backcross populations of herbaceous
plants. Another strategy is to enhance the heritability of individual QTL in one of two ways. First,
the environmental variation can be minimised by
having many replicates of each individual, as can
easily be achieved with RIL and DH lines (or
vegetatively propagated fruit trees). Second, the
residual variation caused by other QTL can be
identified and removed from the error as in multiple QTL-mapping approach or composite interval mapping. However, in such cases, CIs cannot
be reduced to much less than 10 cM and then
only for the QTL with the largest effects. Note
that 10 cM equates to 300 kbp in Arabidopsis and

160

6,000 kbp in wheat. Because of the wideness of


CI, it is difficult to demonstrate the existence of
more than three QTL per chromosome. This limitation affecting the distribution of QTL along the
chromosomes is largely due to the low chiasma
frequency per chromosome (around two, on average), which limits recombination and hence
quantitative trait locus resolution. To go below
10 cM resolution, it would be necessary to resort
to fine QTL-mapping designs, such as advanced
intercross lines or near-isogenic lines, or to
greatly increase population sizes (refer chapter
2). Analysis of hundreds or thousands of segregating progeny might be required, which is a
costly and time-consuming affair. Alternatively,
pooled sample approach to the construction of
high-resolution genetic maps was proposed.
Increasing resolution allows the discovery of
new QTL since linked QTL with favourable
alleles in repulsion would mask each other.
Increasing resolution is also very important to
reduce genetic drag during the marker-assisted
introgression of wild genes because a good QTL
allele for a trait might be linked in phase to a bad
QTL allele for another important trait. There are
two situations in plant genomics where the wideness of CI is important: distinguishing linked
QTL governing different traits from a quantitative trait locus with pleiotropic effects over the
traits and candidate gene analysis. QTL with
pleiotropic effects seem to be crucial in coordinating (or regulating) the connected physiological pathways of traits. Genes with related
functions usually cluster through the genome.
Gene clustering seems to be the case, at least, for
resistance genes or genes controlling floral traits,
which is very convenient for comparative genomics. Correlated traits also usually have QTL in
common genomic regions. Several statistical
approaches to analyse several quantitative traits
simultaneously, such as those based on multivariate methodologies using Markov chain Monte
Carlo approaches (Guo and Thompson 1992) or
using canonical transformation of the traits into
canonical variates, to which univariate techniques
(Mangin et al. 1998) are being explored.
Taking a step forward, high-resolution mapping may deliver several candidate genes but no

QTL Identification

proof of the molecular basis of the quantitative


trait locus. Progress in this direction will require
association tests, gene expression profiling and
complementation tests (functional and quantitative). It is clear that the experimental set-up in an
expression quantitative trait loci (eQTL; see
chapter 7) mapping study is similar in structure to
a traditional QTL-mapping study, but with thousands of phenotypes. The simplicity with which
this difference can be stated obscures the resulting challenges posed for the statistical analysis of
eQTL data. The statistical methods available for
multi-trait QTL mapping consider relatively few
traits and are not easily extended to the eQTL setting as they require estimation of a phenotype
covariance matrix, which is not feasible for hundreds or thousands of traits (for a review of eQTL
methods, refer Kendziorski et al. 2006 and references therein).
Some of the studies simply show QTL at different map positions, or with different effects in
different environments, which may result from
statistical uncertainty. Those studies, in annual
species, show that the expression of QTL can
vary among environments, and, together, they
suggest that most of the identified QTL show
significant G E interaction. The percentage of
such interaction is expected to be larger as the
difference among the target environments
becomes larger, as in the case of control versus
stress. Very often, G E interaction is confounded
with the effect of the research team. For example,
when two traits that are evaluated in two locations by two different teams, only three QTL out
of 12 and three out of 16 are detected by the two
teams, at both locations. This can be easily seen
in the published reports. Therefore, the effect of
the research team may be more important than
the G E interaction as such or it is at least as
large. How the traits were evaluated might also
be important because, in all cases, the evaluation
was visual using a simple scoring scale from 1 to
5 or 9. Unless the population size is large enough,
the lines or families are uniform and the evaluation is consistent through researchers, the study
of QTL E interaction is not relevant.
A considerable body of research in quantitative genetics suggests that epistatic interactions

Challenges in QTL Mapping

among loci at two-locus, three-locus and higherorder levels often have major effects on adaptability and have a considerable influence on phenotype.
If there is gene interaction, populations can differentiate not only for population means but also
for local average effects. The consequence of this
differentiation is that the local average effects of
alleles change relative to each other so that an
allele favoured by selection in one population
may be removed by selection in other populations.
The importance of two-locus genetic model and
inclusion of measures of genetic population differentiation, it was theoretically shown that the
potential role of additive dominant and dominant dominant epistasis in reproductive isolation
and inbreeding depression at the QTL level. It was
also concluded that the same forces that reduce
the apparent contribution of genetic interactions
to the variance within populations lead to populations differentiating from the local average effects
of alleles. Epistasis between QTL assayed in populations segregating for an entire genome has
been found at a frequency close to that expected
by chance alone. Yet, when RILs, DHs and
isogenic lines are used, epistasis is detected more
frequently. Therefore, QTL mapping may underestimate the number of non-additive interactions
for three reasons. First, when advanced backcross
progenies are used, it is not useful for detecting
epistatic QTL since every backcross generation
greatly reduces the number of genotypic combinations because the donor genotype is being
recovered. For example, the frequency of individuals with phenotype AB derived from the twolocus double heterozygoteAaBb by self-pollination
will be 9/16, while by backcrossing it will be 1
or1/4 (testcross). Second, even large F2 mapping
populations will contain few individuals in the
two-locus double homozygous classes, limiting
the statistical power detecting non-additive deviations for these genotypes. Finally, searching for
epistatic interactions involves many statistical
tests, so significance thresholds must be increased
accordingly. Unless epistatic interactions contribute largely to the total variance, they will not show
up in F2 populations. Kao et al. (1999) described a
method for simultaneous mapping of multiple
interacting QTL, but owing to computational con-

161

straints, this is only a quasi-simultaneous QTLmapping method.

Practical Utility
In practical point of view, the following common
question is often raised: Is the information from a
QTL analysis enough for being successful in MAS
for QTL? The experimental results showed mixed
response. Schneider et al. (1997) have reported that
MAS improved drought resistance performance by
11% under stress and 8% under non-stress in common beans. A MAS study for malting quality in
barley, based on two QTL, gave contrasting results
(Han et al. 1997). Whereas tandem genotypic and
phenotypic selection proved useful for one quantitative trait locus, a second putative quantitative trait
locus identified in the original mapping population
vanished in the population used for selection. The
proportion of genetic variance explained by the
QTL, individually and together, in the QTL experiment is a first key point. The second key point is
that G E and epistatic interactions at any quantitative trait locus may be involved in the phenotypic
value. Concerning the first point, it is often difficult
to determine from the literature how much of the
genetic variance is explained by the QTL, either
individually or together, because only the total
phenotypic variance is reported. It is therefore not
possible to decide whether any variation left unexplained is caused by other QTL or the environment.
Taking into account that for QTL alleles of small
effect the magnitude of the bias will be larger than
for QTL alleles of large effect, one should be
especially cautious with QTL of small effect.
Fortunately, in some cases, a small number of QTL
have been reported as contributing to a large proportion of the trait variance. This would explain
why MAS experiments have generally been successful when using the marker information for
introgressing or accumulating QTL alleles of large
effect. At the same time, the purpose of the QTL
analysis is not only MAS but also the genetic
dissection of the quantitative trait. Therefore, all
QTL have to be identified regardless of whether
their effect is large or small, or environmentally
sensitive or not. This task requires information

162

from different progenies, indifferent environments,


development and implementation of robust QTLmapping methodologies and complementing
experimental designs to confirm, at least, QTL
positions.

Bibliography
Literature Cited
Churchill GA, Doerge RW (1994) Empirical threshold
values for quantitative trait mapping. Genetics 138(3):
963971
Comai L, Young K, Till BJ, Reynolds SH, Greene EA,
Codomo CA, Enns LC, Johnson JE, Burtner C, Odden
AR, Henikoff S (2004) Efficient discovery of DNA
polymorphisms in natural populations by Ecotilling.
Plant J 37:778786
Edwards MD, Stuber CW, Wendel JF (1987)
Molecular marker facilitated investigation of quantitative trait loci in maize. I. Numbers, genomic distribution and types of gene action. Genetics 116:
113125
Etzel C, Guerra R (2002) Meta-analysis of geneticlinkage of quantitative trait loci. Am J Hum Genet
71:5665
Goffinet B, Gerber S (2000) Quantitative trait loci: a
meta-analysis. Genetics 155:463473
Guo SW, Thompson EA (1992) Performing the exact test
of Hardy-Weinberg proportion for multiple alleles.
Biometrics 48:361372
Han F, Ullrich SE, Kleinhofs A, Jones BL, Hayes PM,
Wesenberg DM (1997) Fine structure mapping of the
barley chromosome- 1 centromere region containing
malting-quality QTLs. Theor Appl Genet 95:
903910
Hansen M, Kraft T, Ganestam S, Sll T, Nilsson NO
(2001) Linkage disequilibrium mapping of the bolting
gene in sea beet using AFLP markers. Genet Res
77:6166
Jansen RC (1993) Interval mapping of multiple quantitative trait loci. Genetics 135:205211
Jansen J, De Jong AG, Van Ooijen JW (2001) Constructing
dense genetic linkage maps. Theor Appl Genet
102:11131122
Jiang C, Zeng ZB (1995) Multiple trait analysis of genetic
mapping for quantitative trait loci. Genetics
140:11111117
Jiang C, Zengt ZB (1995) Multiple trait analysis of genetic
mapping for quantitative trait loci. Genetics
140(3):11111127
Kao C-H et al (1999) Multiple interval mapping for quantitative trait loci. Genetics 152:12031216

QTL Identification

Lander ES, Botstein D (1989) Mapping Mendelian factors


underlying quantitative traits using RFLP linkage
maps. Genetics 121:185199
Mangin B, Thoquet P, Grimsley N (1998) Pleiotropic
QTL analysis. Biometrics 54:8899
McCallum CM, Comai L, Greene EA, Henikoff S (2000)
Targeting induced local lesions IN genomes (TILLING)
for plant functional genomics. Plant Physiol
123:439442
Michelmore RW, Paran I, Kesseli RV (1991) Identification
of markers linked to disease-resistance genes by bulked
segregant analysis: a rapid method to detect markers in
specific genomic regions by using segregating populations. Proc Natl Acad Sci USA 88:98289832
Moser G, Muller E, Beeckmann P, Yue G, Geldermann
H (1998) Mapping QTL in F2 generations of Wild
Boar, Pietrain and Meishanpigs. In: Proceedings of
the 6th world congress on genetics applied to livestock production, vol 26, Armidale, pp 478481
Paterson AH, Lander ES, Hewitt JD, Peterson S, Lincoln
SE, Tanksley SD (1988) Resolution of quantitative
traits into Mendelian factors by using a complete linkage map of restriction fragment length polymorphisms.
Nature 335:521529
Rodolphe F, Lefort M (1993) A multi-marker model for
detecting chromosomal segments displaying QTL
activity. Genetics 134:12771288
Sax K (1923) The association of size difference with seedcoat pattern and pigmentation in Phaseolus vulgaris.
Genetics 8:552560
Schena M, Shalon D, Davis RW, Brown PO (1995)
Quantitative monitoring of gene expression patterns
with a complementary DNA microarray. Science
270:467470
Schneider AK, Mary EB, James DK (1997) Markerassisted selection to improve drought resistance in
common bean. Crop Sci 37:5160
Thoday JM (1961) Location of polygenes. Nature
191:368370
Thornsberry JM, Goodman MM, Doebley J, Kresovich S,
Nielsen D et al (2001) Dwarf 8 polymorphisms associate with variation in flowering time. Nat Genet
28:286289
Visscher PM, Thompson R, Haley CS (1996) Confidence
intervals in QTL mapping by bootstrapping. Genetics
143:10131020
Wolyn DJ, Borevitz JO, Loudet O, Schwartz C, Maloof J,
Ecker JR, Berry CC, Chory J (2004) Light-response
quantitative trait loci identified with composite
interval and eXtreme array mapping in Arabidopsis
thaliana. Genetics 167:907917
Yu J, Holland JB, McMullen MD, Buckler ES (2008)
Genetic design and statistical power of nested association mapping in maize. Genetics 178:539551
Zeng ZB (1993) Theoretical basis for separation of
multiple linked gene effects in mapping quantitative
trait loci. Proc Natl Acad Sci 90:1097210976

Bibliography

Further Readings
Asns MJ (2002) Present and future of quantitative trait
locus analysis in plant breeding. Plant Breed
121:281291
Broman KW (2001) Review of statistical methods for
QTL mapping in experimental crosses. Lab Anim
30(7):4452
Delvin B, Risch N (1995) A comparison of linkage disequilibrium measures for fine-scale mapping.
Genomics 29:311322
Doerge RW (2002) Mapping and analysis of quantitative
trait loci in experimental populations. Nat Rev
3:4353
Hospital F (2009) Challenges for effective marker-assisted
selection in plants. Genetica 136:303310, http://
www.knowledgebank.irri.org/ricebreedingcourse/
bodydefault.htm#QTL_mapping.htm

163
Jorde LB (2000) Linkage disequilibrium and the search
for complex disease genes. Genome Res 10:
14351444
Kang MS (2002) Quantitative genetics, genomics, and
plant breeding. In: Papers from the symposium on
quantitative genetics and plant breeding in the 21st
century, Louisiana State University, 2628 Mar 2001,
CAB International 2002
Kendziorski CM et al (2006) Statistical methods for
expression quantitative trait loci (eQTL) mapping.
Biometrics 62:1927
McMullen MD et al (2009) Genetic properties of the
maize nested association mapping population. Science
325:737740
Wrschum T (2012) Mapping QTL for agronomic traits in
breeding populations. Theor Appl Genet 125:201210
Xu Y, Crouch JH (2008) Marker-assisted selection in plant
breeding: from publications to practice. Crop Sci
48:391407

Fine Mapping

Need for Fine Mapping


or High-Resolution Mapping
The ultimate aim of molecular genetic studies of
quantitative genetic variation is to find the genes
that influence the trait. However, the use of MAS
does not require the gene to be known, but can be
effective with linked markers. So, the critical
point is how closely a QTL is mapped with
respect to the markers. Several simulation studies
have shown that for MAS, informative markers
that flank a QTL within 5 cM seem adequate. In
contrast, virtually all QTL-mapping studies have
been conducted with panels of 100300 markers
covering the entire genome, corresponding to an
average distance between markers of ~5 and
20 cM. Hence, it is imperative to fine map at least
those QTL regions with more number of markers. Such mapping process is also referred to as
high-resolution mapping.
Fine mapping of QTL will also increase the
efficiency of foreground selection in introgression programs through MAS because the genomic
region that has to be controlled is smaller. This
will reduce the number of individuals that is
required and the genotyping cost. In addition,
introgression of a smaller genomic region helps
to eliminate unwanted genes that are located
around the target QTL. This is particularly important when the donor is an exotic genetic resource.
Similar considerations also hold true for recurrent MAS (refer chapter 8 for more details). For
MAS to be effective, the target QTLs must be

free from any undesirable linkage. The large size


of the regions encompassing QTLs and the likely
presence of undesirable linked genes make it
essential to fine map such regions to facilitate
their precise introgression and to identify candidate genes within these QTLs.
Further, fine mapping will help to clone the
genes residing at the target QTLs (referred to as
map-based cloning; see below). This provides
more detailed knowledge of the functional genes
underlying these QTL and allows a better understanding of the physiology of the quantitative
trait. This might also allow better prediction of
the effects of the QTL in different genetic backgrounds and environmental conditions and on
different characteristics of performance. In addition, specific management strategies could be
developed for specific genotypes to enhance their
performance.
Thus, the initial QTL-mapping step typically
needs to be followed by a fine-mapping step. To
select the optimal fine-mapping strategy, one
needs to have a good understanding of what
factors limit the achievable fine- or high-mapping resolution. Among them, the primary four
factors are:
1. Marker density: Mapping consists of placing
a QTL in a given marker interval. The more
markers one has, the smaller the average
interval size and, thus, the higher the map
resolution.
2. Crossover density: Actually, recombinant
chromosomes are the only ones that provide
mapping information.

N.M. Boopathi, Genetic Mapping and Marker Assisted Selection: Basics, Practice
and Benefits, DOI 10.1007/978-81-322-0958-4_7, Springer India 2013

165

166

3. QTL detection methods: This corresponds to


the accuracy with which one can infer the QTL
genotype of a given individual or chromosome.
Positioning a QTL with respect to a crossover
requires knowledge of the QTL allele carried
by the corresponding chromosome.
4. Molecular architecture of the QTL: Many
QTL probably reflect the combined effect of
not one, but several, linked QTLs. Approaching
such a composite QTL using a model that
assumes a single location may result in fuzzy
positioning.

Types of Molecular Markers Suitable


for Fine Mapping
Increasing the marker density in a chromosome
segment of interest is conceptually the easiest
limiting factor to resolve. However, developing
markers that target specific regions is a laborious
and time-consuming task. Fortunately, this has
recently changed with the availability of the
nearly complete genome sequences of the major
crop species. Microsatellite markers can be
directly identified from the genomic sequences
and suitable primers can be designed and used in
fine mapping since they are simple and exchanged
among laboratories. However, the frequency of
polymorphism detected using microsatellite locus
generally is not sufficient (since it is very low
(<10%), SSRs are not suitable for fine mapping
and it prompts the use of other markers). In such
cases, insertions or deletions (INDELs) and/or
single nucleotide polymorphisms (SNPs) in both
intergenic and coding regions (see chapters 3
and 10) might be more useful in fine mapping
since they are far more efficient than microsatellites in detecting polymorphism. Although the
validity of each SNP needs to be confirmed, the
conversion rate is generally very high and the
transportability across populations remarkably
good. If the genome sequence is available for the
target species, it is possible to identify the putative genes present in the QTL regions. SNPs can
be developed for these regions to enhance the
efficiency of identifying causal polymorphism.
In addition to selection of appropriate highly

Fine Mapping

efficient polymorphic markers, good number


recombinants are needed, to establish the relative
position of the locus of interest (for which we
need 1,000 s of progenies). Further, since transformation is a routine activity in several plant
species, functional complementation would be a
more productive approach to analyse the function
of identified candidate genes. Thus, it will help to
identify more tightly linked markers besides
deciphering a physiological and molecular mechanism of expression of such quantitative trait.

Physical Mapping and Its Role in Fine


Mapping
A physical map is an ordered set of DNA fragments, among which the distances are, expressed
in physical distance units, that is, in base pairs.
The resolution or accuracy with which this can be
done ranges from mapping loci to a particular
chromosome (low resolution) to the determination of the precise nucleotide sequence (high
resolution). Physical maps are an important
resource for several molecular researches such as
positional or map-based cloning of agronomically important genes, analysing chromosomes
and genome structure in detail and establishing
the relationship between genetic and physical
distance (thereby increasing the efficiency of fine
mapping).
It has shown in Arabidopsis that on an average, 1 cM equals to 280 kbp. On the other hand,
in barley, 1-cM genetic distance covers more than
7,000 kbp. Thus, the relationship between genetic
and physical distance can vary up to 100-fold (or
even more) in different regions of the same
genome. For example, in tomato, the average
amount of DNA per cM was estimated as 750 kbp,
but this varies in certain regions of tomato genome
and shown to be as low as 50 kbp per cM to more
than 4,000 kbp per cM. The large discrepancy is
mainly due to the existence of recombination
suppression (at centromeric regions) and recombination hot spots on the chromosomes.
A prerequisite for physical mapping is the
availability of libraries containing large inserts of
genomic DNA and the techniques and resources

Comparative Mapping

such as pulsed field gel electrophoresis, rare-cutting restriction enzymes and Southern blotting
facilities. Large genomic DNA inserts derived
from the given crop genome are usually cloned
into high-capacity vectors such as cosmids, yeast
artificial chromosomes (YAC), bacterial artificial
chromosomes (BAC), bacteriophage P1-derived
artificial chromosomes (PAC) and mammalian
artificial chromosomes (MAC). Using such vectors, insert DNA of 45 to 800 kbp can be cloned.
Such large insert libraries facilitate the development of small insert libraries which will be
sequenced to determine the order of nucleotides
in those small inserts using state-of-the-art automated DNA sequencing technologies (such as
pyrosequencing, massively parallel signature
sequencing (MPSS), polony sequencing and
sequencing with Illumina or SOLiD; see chapter 10).
Then, the sequencing results are ordered or
assembled as contigs, and from this assembly, the
complete physical map of the genome is prepared. Such physical map can be compared with
the genetic map, and new markers (such as SNPs
and/or INDELs) can be obtained from the physical map for fine or localised mapping of the target
QTL region in the genetic map.

Comparative Mapping
Genetic or physical maps constructed in one species can be compared by means of common markers (or common single gene traits) with closely
related species. Such common markers are
referred to as anchored markers. These comparative maps can be used to study genome evolutionhow the genome has been rearranged
through timeand to make inferences about gene
organisation, repeated sequences, etc. Further,
map-based cloning (see below) may be easier in
some species than othersfor example, rice (with
a small genome) versus wheat (with a massive
genome). Conservation of the gene order within a
chromosomal segment between different species
is referred to as colinearity, whereas conservation
of the order of genes in DNA fragments that are
bigger than 50 kb is referred to as microlinearity.
Deletion, inversions and duplications are detected

167

at this level. In comparative mapping, loci in


different species originating from the same ancestral locus are called orthologous loci. Paralogous
loci are those loci in different (or the same) species that arose due to a duplication of an ancestral
locus. Comparative mapping has been done in
several crop species that usually belong to a single
family. For example, in Solanaceae, comparative
maps are available for tomato and pepper and
tomato and potato. Similarly, in Gramineae, a
comparative map between rice and maize is available. The details of such comparative map have
shown that maize has 6 more nuclear DNA than
rice. However, sixfold more DNA did not increase
in recombination in the conserved region as compared to that in rice. Single copy gene of rice
always duplicated in maize, and 72% of the duplication still exist in maize genome. Further, it is
noticed that loss of 28% of duplicated copy of
maize genes could have resulted from deletions or
loss of entire chromosomes or chromosomal segments. Pairs of homologous chromosomes in
maize are similar and collinear to rice chromosomes
and have the same gene content but shuffling of
gene orders. An example of Gramineae comparative map can be found at CMap (the Comparative
Map Viewer) which allows you to construct comparisons between different maps. CMap is available at http://www.gramene.org/cmap/. In this
module, you can view genetic, physical, sequence
and QTL maps for many species of cereal crops.
All data (map sets, maps, features and correspondences) in the Maps Module are built from the
Markers Module. Users are encouraged to consult
the Markers Module for primary information
about markers and their mappings. The Maps
Module should be considered to be primarily a
visualisation tool.
Thus, it is obvious that the gains of comparative mapping are severalfolds: (1) Maps constructed in one species can be compared by means
of common (or anchor) markers with closely
related species. (2) These comparative maps can
be used to study genome evolutionhow the
genome has been rearranged through timeand
to make inferences about gene organisation,
repeated sequences, etc. (3) It facilitates easier
map-based cloning.

168

Genetical Genomics/eQTL Mapping


Transcriptome analysis (studies on gene expression at mRNA level with spatial and temporal
pattern) with microarrays is opening exciting
possibilities for the genetic dissection of complex
traits (see chapter 10). In an approach called genetical genomics, the expression levels of many (not
all) genes are measured in one or more tissues
assumed to be relevant with respect to the given
phenotype. Jansen and Nap (2001) have first introduced this concept as genetical genomics. The
transcript levels of individual genes are treated as
quantitative traits and subjected to QTL mapping
to identify expression QTL (eQTL). In general,
tissue samples are harvested and the mRNA is
purified and then subjected to some means of measurement. In microarray hybridisation technology,
the purified mRNA is converted to labelled cDNA
which hybridises with complementary DNA on
the microarray slide. The relative amounts of transcript present for each gene represented on the
microarray are determined by measuring the
amount of label bound following the hybridisation. eQTLs are derived from polymorphisms in
the genome that result in differential measurable
transcript levels. Of course, any method of expression profiling based on RNA, protein or metabolites can be used as quantitative trait in genetical
genomics. eQTLs are typically sorted into local
eQTL when the affected gene lies within the
confidence interval of the eQTL, as opposed to
distant eQTL when not. Local and distant eQTL
are also, respectively, referred to as cis- versus
trans-acting eQTL. One possible explanation for a
local eQTL is a cis-acting regulatory mutation that
directly controls the transcript level of the corresponding gene, whereas a distant eQTL necessarily implies a trans-acting molecular mechanism.
Thus, genetical genomics approach provides a
novel way of discovering, at a genetic level, regulators of gene expression acting either in cis or in
trans relative to the target gene. The eQTL position may coincide with the gene itself displaying
cis regulation or be different, thus revealing transacting factors controlling expression. A common
feature of eQTL studies is the detection of hot

Fine Mapping

spots of trans-acting eQTL, interpreted as regions


rich in regulatory genes that co-regulate many
downstream targets.
The majority of expression studies are being
performed in mapping populations in order to aid
in the identification of eQTLs of interest, as well
as to take advantage of simplified genetics due to
the homozygosity of the selfed progeny derived
from the biparental cross. As one might expect,
differential gene expression can be explained
simply by sequence differences in the gene itself,
for example, the promoter regions that respond to
transcription factors to varying degrees. In other
words, a motif integral for transcription factor
binding may contain a polymorphism or mutation that prevents effective binding, and therefore,
decreases transcription of that gene. Additionally,
polymorphisms in the intronic regions could
affect splicing, or changes in untranscribed
regions (UTR) could affect mRNA stability, both
potentially creating degradation-susceptible transcripts. When samples are derived from a genotyped mapping population and subjected to
high-resolution mapping, these cases of polymorphism are identified as cis-eQTLs. In this
case, the genomic marker allele that most closely
associates with a phenotype is located in close
proximity to the gene being measured. However,
genomic markers most closely associated with an
eQTL phenotype may physically lie far from the
gene being measured. In these trans-eQTLs, the
polymorphism that results in differential transcript levels may be located in the transcription
factor itself, thereby creating a dysfunctional or
hyperactive protein. It could be expected that cisacting factors would have larger effects on measurable transcript levels.
In addition to understanding general patterns
of gene expression, these genetical genomic studies are creating caches of information useful for a
multitude of applications. As one gene regulates
the level of expression of another (trans-acting
eQTL), novel upstream or downstream components in gene regulation pathways can be
identified. In addition to steady state analysis, the
induction of stimuli such as drought can lead to a
deeper understanding of gene networks that are
activated under such conditions. Correlation of

Genetical Genomics/eQTL Mapping

measured transcript levels (eQTL phenotype)


with classic QTL phenotypes may suggest functional roles for the allelic variation in gene expression and serve as a predictor of downstream
effects on plant development, morphology and
agronomic interest. Finally, the analysis of the
activation of particular genes under steady state
or external stimuli treatment provides insight into
the functionality of endogenous promoters. While
promoters used for transgenic expression have
been thoroughly analysed in model systems and
model inbred lines, the understanding of agronomically important phenotypes may benefit
from the analysis of genetic polymorphisms of
trans-acting regulators affecting transgene expression and therefore can allow for the optimization
of expression both of current and future transgenic lines.
Local eQTL that co-localises with QTL affecting the phenotype of interest denotes possible
causal genes. Genetical genomics thus provides a
highly parallelized shortcut bypassing, at least in
some instances, tedious QTL fine mapping. It is
important to realise that finding local eQTL overlapping a phenotypic QTL, a common occurrence
given the abundance of local eQTL in most
experiments, provides interesting candidates, but
does not establish causal connection. A correlation between the corresponding expression traits
and phenotype in the studied population does not
establish causal connection either. One possible
strategy to distinguish unexpected from causal
correlation is to apply conditional correlation
measures. If transcript levels of the candidate
gene directly affect the phenotype, one will find a
correlation between transcript levels and the phenotype both across and within genotypes; if transcript levels and the phenotype are not causally
linked, the correlation will be observed across,
but not within, genotypes. Alternatively, one can
attempt to specifically disturb the candidate gene
either genetically (for instance, knockout or
knockdown strategies) and measure the effects
on the phenotype.
The contribution of genetical genomics to the
molecular dissection of complex traits is not limited to facilitating the discovery of causal genes.
Whereas local eQTL coinciding with phenotypic

169

QTL points towards primary events, co-localised


distant eQTL may help unravel the networks that
connect primary events and phenotypes. The
phenotype may be controlled by the products of
the genes regulated in trans. Just as for overlapping local eQTL, one has to be wary of fortuitous QTL overlap and resulting trait correlations.
The same conditional correlation measures and
gene perturbations may be applied to probe into
the nature of the observed correlations. Despite
its many attractive features, genetical genomics
has its limitations. It can only detect effects that
are mediated by alterations in transcript levels.
Moreover, it can only detect effects that manifest
themselves in the panel of examined tissues,
which are usually limited. Further, it should be
noted that QTL regions appear often quite complex and approximate and may contain hundreds
of genes. Consequently, the actual involvement
of the candidate gene in most cases remains to
be confirmed by genetic and physical mapping, positional cloning, expression analysis
or genetic transformation experiments. Costsaving alternatives to large genome-wide and
population-wide analyses with minimal loss of
informativeness have been proposed: analysing
pooled samples of phenotypically extreme
members of the population or concentrating on
genotypically selected individuals. Though
anonymous DNA markers are useful to cover the
entire genome and efficient QTL analysis,
deployment of gene-derived markers will be
more desirable since they can validate the
identified QTLs by elucidating the genes underlying those QTLs. Transcript-derived markers
such as EST-SSR, CAPS, dCAPS and more
recently SNPs have promising applications at
this juncture. Further, advances in microarray
technologies reveal global changes in gene
expression, and mapping of these changes in the
same mapping population used for QTL analysis
might lead to identify informative eQTLs. However,
it should be noted that genetical genomics is
only in its infancy. It is also vital to note that
several works in proteomics (see chapter 10) have
indicated that functionally important changes in
the levels of transcripts are not necessarily reflected
in changes in the levels of proteins, and hence

170

assessing the genetics of protein, transcripts and


DNA markers is essential to infer causal networks
to understand how the system works as an integrated whole. Another concern during eQTL
analysis is how transcript variation relates to
other genomic and/or physiological levels? As
stated, preliminary evidence suggests that metabolites and transcripts have different levels of
heritability as well as epistasis underlying their
genetic architecture. The differences in heritability between these three trait levels (metabolite,
enzyme activity and transcript) could be explained
by transcripts being functionally linked to potential DNA polymorphisms in their genes or regulators, which would leave less potential for
stochastic noise to be introduced between the
genetic and transcript variations (see chapter
10). In comparison, the variations in metabolite
and enzyme activities require a DNA polymorphism to be processed via transcription and
translation, with the extra steps allowing more
stochastic noise into the system. An alternative
explanation is that fundamental differences exist
in the physics of the three trait levels. Metabolites
within a network are directly linked such that the
atoms in one metabolite are transferable to a different metabolite via few direct enzymatic steps.
This interconnectedness could magnify small
biological perturbations, allowing more noise in
metabolic networks than in corresponding transcripts. Similarly, enzymatic networks are likely
dominated by MichaelisMenten kinetics, which
introduces a nonlinear relationship between the
levels of protein and enzyme activity. As such,
the use of linear statistical approaches to define
heritability may produce a bias in heritability
within a nonlinear enzymatic network in comparison with transcript networks if transcription
behaves in a more linear fashion.
Thus, keeping all these limitations in mind, it
is suggested that integration of the advances in
quantitative genetics, functional genomics and
bioinformatics as system quantitative genetics
can greatly facilitate systems level understanding
of the biological cause and effect relationship. A
number of experiments are underway in this
direction and will hopefully yield exciting results
in the near future.

Fine Mapping

Map-Based Cloning
Successful isolation of genes underlying the target QTL using the information on QTL map and
physical map is referred to as map-based cloning.
There are at least three important steps in mapbased cloning, since it may vary depending on
the crop and purpose:
1. Mapping of target QTL and identification of
more closely linked markers through fine mapping. For preliminary QTL mapping, a population size of 60150 individuals with 100200
markers that span the entire genome is sufficient.
However, for fine mapping, it is essential to
increase the population size to >1,000 with more
number of informative/polymorphic markers.
2. Physical localization of the target QTL on the
physical map using the markers sequence
information (referred to as chromosome landing). This identifies the genomic fragment
which is flanked by the target markers. The
identified genomic region is then scanned
towards the putative candidate genes (referred
to as chromosome walking). It is usually done
by screening a large insert genomic library
with the closely linked marker and isolate the
clones that hybridise with the marker. This is
followed by creating new markers (usually
sequences at the end of the clone) and screening the segregating population (often this population is large (1,0003,000 individuals))
with the new markers. The goal is to find a set
of markers that co-segregate with the gene
under the QTL. Co-segregation means that
whenever one allele of the gene is expressed,
the markers associated with that allele are also
present (i.e. recombination is not occurred
between the gene and the marker). Such
identified genes are called positional candidate genes, which are in the region of genome
scan as likely to host a QTL.
3. Gene identification, characterization and validation: Co-segregation confirms that the genes are
within the two flanking markers. Step 2 usually
finds large number of putative candidate genes
(which are identified by predicting open reading
frames (ORFs) in the DNA sequence of the

Testing the Markers in Related Germplasm Accessions

selected clone through bioinformatics tools).


It is now necessary to determine the actual candidate gene behind the QTL. This can be done
by several approaches such as generation of
transgenic plants with the identified putative
candidate genes and generation of independently
derived mutant alleles at the target gene (referred
to as recombinational or mutant analysis).
Map-based cloning has been first successfully
employed in mammalian system, for the cystic
fibrosis gene. In plants, it has been demonstrated
in several occasions. For example, map-based
cloning has been applied for isolating AB13 gene
and omega-3 fatty acid desaturase gene in
Arabidopsis. Similarly, fruit weight2.2 in tomato,
teosinte branched1 (tb1) in maize, heading
date1, Sub1 and SalT in rice and FRIGIDA
and CRYPTOCHROME2 in Arabidopsis have
been isolated using positional cloning approaches.
The map-based cloning of sd-1 gene, as an
example, is explained here briefly: Several studies
have reported that sd-1 is closely linked to several
molecular markers on chromosome 1; however,
the resolution of these genetic analyses is not
enough for gene responsible for the trait, semi
dwarfism (sd). By employing advanced positional
cloning strategies with high-throughput genetic
mapping using CAPS, dCAPS or single nucleotide polymorphism (SNP) markers, Monna et al.
(2002) successfully identified sd-1 as a single
open reading frame (ORF) which encoded gibberellin oxidase, the key enzyme in the gibberellin
biosynthesis pathway. Analysis of 3,477 segregants using several PCR-based marker technologies, including CAPs, derived-CAPS and SNPs,
revealed one ORF in a 6-kb candidate interval.
Normal-type rice cultivars have an identical
sequence in this region, consisting of 3 exons
(558, 318 and 291 bp) and 2 introns (105 and
1,471 bp). Dee-Geo-Woo-Gen-type sd-1 mutants
have a 383-bp deletion from the genome (278-bp
deletion from the expressed sequence), from the
middle of exon 1 to upstream of exon 2, including
a 105-bp intron, resulting in a frameshift that produces a termination codon after the deletion site.
The radiation-induced sd-1 mutant Calrose 76 has
a 1-bp substitution in exon 2, causing an amino

171

acid substitution (Leu [CTC] to Phe [TTC]).


Expression analysis suggests the existence of at
least one more locus of gibberellin oxidase which
may prevent severe dwarfism from developing in
sd-1 mutants. Accordingly, they have successfully
shown the potential of accelerated positional cloning and its applications in plants.

Validation of QTLs
The markers identified in preliminary genetic
mapping studies are seldom suitable for markerassisted selection without further testing, validation and additional development. Markers that
are not adequately tested before use in MAS programs may not be reliable for predicting phenotype and will therefore be useless. Generally, the
steps required for the development of markers for
use in MAS include high-resolution mapping,
validation of markers and possibly marker conversion, testing the markers in related germplasm
accessions and testing the genes isolated from the
map-based cloning using transgenic tests. The
procedure of fine mapping and its importance
have been discussed above and the rest is discussed hereunder.

Testing the Markers in Related


Germplasm Accessions
Generally, markers should be validated by testing
their effectiveness in determining the target phenotype in independent populations and different
genetic backgrounds, which is referred to as
marker validation. In other words, marker validation involves testing the reliability of markers to
predict phenotype. This indicates whether or not a
marker could be used in routine screening for
MAS. Markers should also be validated by testing
for the presence of the marker on a range of cultivars and other important genotypes that possess
the target trait. Even when a single gene controls
a particular trait, there is no guarantee that
DNA markers identified in one population will be
useful in different populations, especially when

172

the populations originate from distantly related


germplasm. For markers to be most useful in
breeding programs, they should reveal polymorphism in different populations derived from a wide
range of different parental genotypes.
There are two instances where markers may
need to be converted into other types of markers:
when there are problems of reproducibility (e.g.
RAPDs) and when the marker technique is complicated, time consuming or expensive (e.g. RFLPs
or AFLPs). The problem of reproducibility may be
overcome by the development of SCARs or STSs
derived by cloning and sequencing specific RAPD
markers (see chapter 3 for more details). SCAR
markers are robust and reliable. They detect a single locus and may be co-dominant. RFLP and
AFLP markers may also be converted into SCAR
or STS markers. The use of such PCR-based markers that are converted from RAPD, RFLP or AFLP
markers is technically simpler, less time consuming and cheaper. In addition, STS markers may
also be transferable to related species.

Fine Mapping

Bibliography
Literature Cited
Monna L, Kitazawa N et al (2002) Positional cloning of
rice semi-dwarfing gene, sd1: rice GreenRevolution
Gene encodes a mutant enzyme involvedin gibberellin synthesis. DNA Res 9:1117
Jansen RC, Nap JP (2001) Genetical genomics: the added
value from segregation. Trends Genet 17:388391

Further Readings
Holloway B, Li B (2010) Expression QTLs: applications
for crop improvement. Mol Breed 26:381391
Kliebenstein D (2009) Quantitative genomics: analyzing
intraspecific variation using global GeneExpression polymorphisms or eQTLs. Annu Rev Plant Biol 60:93114
ParanI ZD (2003) Quantitative traits in plants: beyond the
QTL. Trends Genet 19(6):303306

Marker-Assisted Selection

Conventional plant breeding is largely dependent


on selection of desirable plants which is highly
decided by the genotype and environment interaction. Selecting plants in a segregating progeny
that contain appropriate combinations of genes is
a critical component of plant breeding. Usually,
breeders improve crops by crossing plants with
desired traits, such as high yield or disease
resistance, and selecting the best offspring over
multiple generations of testing under multilocation trials. Thus, to develop a new variety, it
may take 1015 years. Any technique that may
speed up this process or make it more efficient is
really a boon to breeders.
Molecular marker technology offers such a possibility. Marker-assisted selection (MAS) involves
selecting individuals based on their marker pattern
(genotype) rather than their observable traits
(phenotype). The term marker-assisted selection
was first used by Beckmann and Soller in 1986.
Since then, the term marker-assisted selection has
attracted plant breeders and geneticists, and subsequently, both the numbers of publications on MAS
and on QTL mapping have increased dramatically.
Sometimes, the term SMART breeding, an acronym for Selection with Markers and Advanced
Reproductive Technologies, which was first used
in animal breeding, is used to describe markersupported breeding strategies. In some of the publications, genotype-assisted selection was also
used instead of MAS. Once markers that are tightly
linked to genes or QTLs of interest have been
identified, prior to field evaluation of large numbers
of plants, breeders may use specific DNA marker

alleles as a diagnostic tool to identify plants carrying


the genes or QTLs.
Major MAS methods include the following:
(1) Marker-assisted introgression or markerassisted backcross, where one gene from a donor
line is introgressed into the genetic background of
a recipient parent by repeated backcrossing to the
recipient parent. Here, markers are used either to
control the presence of the target gene or to accelerate the return of background genome to recipient type. (2) Population screening: the simple
screening of populations (e.g. F2, F3, recombinant
inbred lines, doubled haploids) for genotypes of
interest based on markers. (3) Gene pyramiding
schemes, where two (or more) parent line(s), each
hosting one (or more) gene(s) of interest, are
crossed, then the offspring population is screened
for individuals carrying both (or all) genes of
interest. The process can be iterated further to
combine more genes. More complex methods are
(4) marker-based recurrent selection (several
generations of selection on markers with random
mating) and (5) selection on an index combining
molecular and phenotypic score. These methods
are discussed in details in this chapter.

Advantages of MAS
MAS can theoretically enhance breeders selection efficiency because:
1. It can be performed on seedling material,
thus reducing the time required before a
plants genotype is known. In contrast, many

N.M. Boopathi, Genetic Mapping and Marker Assisted Selection: Basics, Practice
and Benefits, DOI 10.1007/978-81-322-0958-4_8, Springer India 2013

173

8 Marker-Assisted Selection

174

2.

3.

4.

5.

important plant traits are observable only


when the plant has reached flowering or harvest maturity. Knowing a plants genotype
before flowering can be particularly useful in
order to plan the appropriate crosses between
selected individuals.
MAS is not affected by environmental conditions. Some crop production constraints
(such as disease, insect pests, temperature
and water stress) occur sporadically or nonuniformly. Therefore, evaluating resistance
to those constraints may not be possible in a
given year or location. MAS offers the chance
to determine a plants resistance level independent of environment.
When recessive alleles determine the trait of
interest, they cannot be detected through
phenotypic evaluation of heterozygous backcross plants, because their presence is masked
by the dominant allele. In a traditional backcross program, plants with recessive alleles
are identified by progeny evaluation after
self-pollination or testcrossing to a recessive
tester. This time-consuming step can be
eliminated in a MAS program, because recessive alleles are identified by appropriate
linked markers.
Gene pyramiding or combining multiple
genes simultaneously: When multiple resistance genes are pyramided (or combined)
together in the same variety or breeding line,
the presence of each individual gene is
difficult to verify phenotypically. The presence of one resistance gene may conceal the
effect of additional genes. This problem can
be overcome if markers are available for each
of the resistance genes.
Selecting for traits with low heritability:
Environmental variation in the field reduces
a traits heritability, the proportion of phenotypic variation that is due to genetics. In a
low heritability situation, progress from phenotypic selection will be slow, because so
much of the variation for the trait is due to
environmental variation, experimental error
or genotype environment interaction, and
will not be passed on to the next generation.
If a reliable marker for a trait is available,

6.

7.

8.

9.

10.

MAS can result in greater progress than


phenotypic selection in such a situation.
Elimination of unreliable phenotypic evaluation associated with field trials due to environmental effects.
Testing for specific traits where phenotypic
evaluation is not feasible (e.g. quarantine
restrictions may prevent exotic pathogens to
be used for screening).
MAS may be cheaper and faster than conventional phenotypic assays, depending on
the trait. For example, evaluating nematode
resistance is usually an expensive operation
because it requires artificial inoculation of
plants with nematode eggs, followed by a
labour-intensive technique to count the number of nematodes present. Selecting on the
basis of a reliable marker would probably be
cost-effective in this case. On the other hand,
plant height is cheap and easy to measure, so
there may not be an economic advantage in
using markers for that trait and hence simply
regular conventional selection method is
sufficient. Economic aspects of MAS in a
maize breeding program are discussed in
detail in several publications. Readers are
requested to refer Dreher et al. (2003) and
Morris et al. (2003). Economics will be a
major driver of the application of MAS. For
certain traits that are expensive or logistically
difficult to evaluate, MAS is an attractive
alternative. Time savings obtained through
MAS may be as important as cost savings
where there are competitive markets for
improved cultivars. Any cost change in DNA
extraction or genotyping methods, or on the
other hand, in phenotypic evaluation methods, will affect the relative economic benefits
of MAS.
A consideration that may affect costeffectiveness of MAS is that multiple markers can be evaluated using the same DNA
sample. Once DNA is extracted and purified,
it may be used for multiple markers, for the
same or different traits, thus reducing the
time and cost per marker.
Markers can be applied in the choice of parents in crossing programs. Here, they can

Prerequisites for an Efficient Marker-Assisted Selection Program

either help to maximise diversity, and in this


way support the exploitation of heterosis, or
they can minimise diversity, if gene complexes built up in elite inbred germplasm are
to be preserved.
11. Recessive genes can be maintained without
the need for progeny tests in each generation,
as homozygous and heterozygous plants can
be distinguished with the aid of co-dominant
markers. In backcrossing, DNA markers can
help to minimise linkage drag around the target gene and reduce the generations required
to recover a recurrent parents genetic
background.

Limitations in MAS
MAS is not universally advantageous and cannot
be applied to all the traits in all the crops. Some
limitations of the technique are briefly discussed
hereunder:
1. MAS may be more expensive than conventional techniques, especially for start-up
expenses and labour costs. In certain situations, conventional breeding method may suit
well to meet out the breeding objective. An
important consideration for MAS, often not
reported, is that while markers may be cheaper
to use, there is a large initial cost in their
development.
2. Recombination between the marker and the
gene of interest may occur, leading to false
positives. For example, if the marker and the
gene of interest are separated by 5 cM and
selection is based on the marker pattern, there
is an approximately 5% chance of selecting
the wrong plant. This is based on the general
guideline that across short distances, 1 cM of
genetic distance is approximately equal to 1%
recombination. The breeder will need to
decide the error rate that is acceptable in the
MAS program, keeping in mind that errors are
also usually involved in phenotypic evaluation. To avoid this last problem, it may be necessary to use flanking markers on either side
of the QTL of interest to increase the probability that the desired gene is selected.

175

3. Sometimes, markers that were used to detect a


locus must be converted to breeder-friendly
markers that are more reliable and easier to
use. Examples are: RFLP markers need to be
converted to STS markers, and RAPD markers
are converted to SCAR markers for more
reliability.
4. Imprecise estimates of QTL locations and
effects may result in slower progress than
expected. Many QTLs have large confidence
intervals of 20 cM or more or their relative
importance in explaining trait inheritance has
been overestimated.
5. Markers developed for MAS in one population may not be transferrable to other populations, either due to lack of marker
polymorphism or the absence of a marker
trait association.

Prerequisites for an Efcient


Marker-Assisted Selection Program
Before practising, the following most important
requirements should be considered in detail for
implementing successful MAS.
High-Throughput DNA Extraction and Marker
Technology: Most breeding programs would need
to screen hundreds to thousands of plants for
desired marker patterns. In many cases, the results
will be needed quickly to allow the breeder to
make selections in a timely manner. Both of these
considerations demand a simple and efficient
DNA extraction system that can handle a large
number of samples in a streamlined operation and
low-cost, high-throughput marker technology.
Many labs conducting MAS should develop a
strategy that extracts DNA from small tissue samples in 96- or even 384-well plates and assays the
given tightly linked markers to the desired QTL,
within a reasonable period of time. Although
DNA markers have received the most attention,
other types of markers (protein, morphological,
cytological) can also be used in MAS programs.
For efficient MAS, important attributes of markers include ease of use, small amount of DNA
required, low cost, repeatability of results, high

176

rate of polymorphism, occurrence throughout the


genome and co-dominance. As stated earlier,
co-dominance is the ability to detect both parental
forms of a marker in heterozygotes. It is an advantage when heterozygous individuals are screened,
such as in backcross breeding programs or in an
F2 population. SSRs combine the desirable features listed above and are the current marker of
choice for many crop species. SNPs require more
detailed knowledge of the specific, single nucleotide DNA changes responsible for genetic variation among individuals. Only a small number of
SNPs are currently available for MAS in plants,
but within a few years, many more are expected to
be developed and may become an important
marker type for MAS.
Genetic Maps: Linkage maps provide a framework for detecting markertrait associations and
for choosing markers to employ in MAS. Once a
marker is found to be associated with a trait in
a given population, a dense molecular marker
(or high-resolution or fine) map in a standard
reference population will help identify markers
that are closer to, or that flank, the target QTL.
Selection of QTLs for MAS: It is important to
decide the number QTLs selected for MAS.
Theoretically, all markers that are tightly linked
to QTL could be used for MAS. However, due to
the cost of utilising several QTL, only markers
that are tightly linked to three QTLs are typically
used, although there have been reports of up to 5
QTLs being introgressed into tomato via MAS.
Even selecting for a single QTL via MAS can be
beneficial in plant breeding; such a QTL should
account for the largest proportion of phenotypic
variance for the trait. Furthermore, all QTLs
selected for MAS should be stable across
environments.
Knowledge of Associations and Validation
Between Molecular Markers and Trait of Interest:
The most crucial ingredient for MAS is knowledge of markers that are associated with the given
traits. This information on marker validation
might collectively come from QTL studies,
bulked segregant analysis, classical mutant analysis, fine mapping, comparative mapping, mapbased cloning or some other means.

8 Marker-Assisted Selection

Efficient Data Management System: Large numbers


of samples are handled in an MAS program, with
each sample potentially evaluated for multiple
markers. This situation requires an efficient system for labelling, storing, retrieving and analysing
large data sets, and producing reports useful to
the breeder.

Procedure for a Generalised MAS


Program for Selection from Breeding
Lines/Populations
The simplified basic procedure (Fig. 8.1) for conducting MAS with DNA markers is as follows:
1. Extract DNA from tissue of each individual or
family in a population.
2. Screen DNA samples via PCR for the molecular markers linked to the QTL.
3. Analysis of PCR products, using an appropriate separation and detection technique such
as agarose gel electrophoresis.
4. Identify individuals having the desired marker
allele linked to target QTL.
5. Combine the marker results with other selection
criteria (e.g. phenotypic data or other marker
results), select the progenies of the population that
are positive to the given marker allele and advance
those individuals in the breeding program.
Markers are used for selecting qualitative as well
as quantitative traits. MAS can aid selecting for
all target alleles that are difficult to assay phenotypically. Especially in early generations, where
breeders usually restrict their selection activities
to highly heritable traits because a visual selection
for complex traits like yield is not possible
with only few plants per plot being available,
MAS is said to be effective, cost- and time-saving.
To improve early-generation selection, markers
should decrease the number of plants retained
due to their early-generation performance, and at the
same time they should ensure a high probability
of retaining superior lines. Important prerequisites
for successful early-generation selection with
MAS are large populations and low heritability
of the selected traits, as under individual selection, the relative efficiency of MAS is greatest
for characters with low heritability.

Marker-Assisted Backcross Breeding

177

P1 (S)

Identify molecular marker linked to the trait of interest.


For example: R - resistance; and S susceptible to disease;
R and S lines have different banding patterns

P2 (R)

F1
Selfing

Generation of large F2 population


Extract DNA from tissue of each individual
Marker assay for DNA samples (e.g. using PCR)
Analysis ( e.g. agarose gel electrophoresis of PCR products)

Identify individuals
having the desired
marker allele ; lines
having S banding and
heterozygotes are
removed.
Combine the marker results with other selection criteria advance those individuals

Fig. 8.1 Basic procedure in MAS

Marker-Assisted Backcross Breeding


Using conventional breeding methods, it typically takes 68 backcrosses to fully recover the
recurrent parent genome. The theoretical proportion of the recurrent parent genome after n generations of backcrossing is given by
2 n+1 1
2 n+1
(where n = number of backcrosses, assuming an
infinite population size). The percentages of
recurrent parent recovery after each backcross
generation are presented in Table 8.1. The percentages shown in Table 8.1 are only achieved
with large populations; the percentages are usually lower in smaller population sizes that are
typically used in actual plant breeding programs.
Although the average percentage of the recurrent parent genome is 75% (for the entire BC1
population), some individuals possess more of
the recurrent parent genome than others.

Table 8.1 Percentage of recurrent parent genome after


backcrossing
Backcross generation
BC1
BC2
BC3
BC4
BC5
BC6

Percentage of recurrent
parent genome
75.0
87.5
93.8
96.9
98.4
99.2

Therefore, if tightly linked markers flanking QTL


and evenly spaced markers from other chromosomes (i.e. unlinked to QTLs) of the recurrent
parent are used for selection, the introgression of
QTLs and recovery of the recurrent parent may
be accelerated. This process is called markerassisted backcrossing (MABC). MABC is always
successful, except of course when the effect of
the target gene is unstable (e.g. a QTL of low
effect on a complex trait). However, MABC is
considered as the simplest form of MAS, in which
the goal is to incorporate a major gene from the

8 Marker-Assisted Selection

178

Selection of 2-4 polymorphic markers per chromosome (as background markers)


Selection of 2-3 flanking markers on each side of target QTL (as recombinant markers)
Selection of tightly linked markers (for foreground markers)
Recurrent parent
Recurrent patrent

Donor parent

F1

BC1F1

Get 100 300 seeds

Grow the plants and genotype for chosen markers (foreground, recombinant and background selection)
Select the BC1F1 progenies based on recovery
of target QTL and background markers

Recurrent parent x

Selected BC1F1

Get 100 300 seeds

BC2F1
Continue the same process until BC3F1

Selected BC3F1

Get 100 300 seeds

Selfing the selected BC3F1


Testing BC3F2 for homozygosity at target QTL
Seed multiplication of homozygous positive progenies

Fig. 8.2 Schematic representation of marker-assisted


backcross program for single QTL. Two to three QTLs
can be backcrossed with the same process but larger pop-

ulations are required at each generation. For more loci,


conduct parallel MABC and combine the loci at the end
(i.e. by crossing final BC3F1s)

donor parent into an elite cultivar or a breeding


line (the recurrent parent). The use of additional
markers to accelerate cultivar development is
sometimes referred to as full MAS or complete
line conversion. Whatever it may, the desired
outcome is a line containing only the major gene
from the donor parent, with the recurrent parent
genotype present everywhere else in the genome.
The use of markers can reduce the number of
generations required to achieve the desired proportion of the recurrent parent genome. For
example, if conventional backcrossing program
takes six generations to achieve more than 99%
recurrent parent (Table 8.1), it takes only three
backcross generations in MABC (Fig. 8.2). Under
this situation, two types of selection are
recognised:
1. Foreground selection, in which the breeder
selects plants having the marker (i.e. the

tightly linked marker or the direct marker or


perfect marker to the QTL) of the donor parent at the target locus. The objective is to
maintain the target locus in a heterozygous
state (one donor allele and one recurrent
parent allele) until the final backcross is
completed. Then, the selected plants are selfpollinated and progeny plants identified that
are homozygous for the donor allele.
Foreground selection is the part of MABC that
is the most similar to MAS. In this case, however, one of the goals besides the selection of
the target trait at each generation is to minimise the amount of linked genomic region
from the donor parent that ends up being transferred along with the trait. In traditional backcrossing, the linked regions from the donor
parent can cover a very large span of the
chromosome on either side of the introgressed

Marker-Assisted Backcross Breeding

gene even after many generations of backcrossing. This can lead to linkage drag, where
deleterious traits from the donor parent are
inadvertently transferred to the recipient parent along with the target trait. Ensuring the
cleanest transfer of the target trait includes the
following steps: (a) the availability of several
closely linked markers on each side of the
target trait. This is easy for transgenic traits in
crops where a dense set of mapped markers is
available but could be harder to achieve if the
markertrait linkage is not strong, and especially in the case of quantitative traits where
the region to introgress may be quite large. (b)
Enough plants are screened for the linked
markers at each generation to increase the
chances of recombination close to the target
region. This is done typically in two successive steps: (1) In the BC1 generation, the focus
is on finding the closest possible recombinations on one side of the target trait (besides
ensuring that the proper alleles on the other
side are still present). Enough plants are
selected at this stage to still allow for background selection (see below). (2) In the BC2
generation, the same takes place for the other
side of the target trait. (c) Selfing will then be
needed to fix the introgressed region. That will
be done at the end of the background selection
process, which may take an additional generation. This selection of a very clean introgression can thus be done quickly in two
generations of backcrossing. One caution is
that the size of the final donor region surrounding the introgressed gene will depend on
the intensity of the effort, especially in terms
of number of BC1 and BC2 plants that are
screened. Enough plants need to be screened
not only to find a close recombination at each
step (usually markers that flank the target
QTLs are used as recombinant markers) but
also to have enough plants remaining for a
sufficient background selection.
2. Background selection, in which the breeder
selects for recurrent parent marker alleles in
all genomic regions except the target locus,
and the target locus is also additionally
selected based on phenotype. Background

179

selection is important in order to eliminate


potentially deleterious genes introduced from
the donor through linkage drag, the inheritance of unwanted donor alleles in the same
genomic region as the target locus. It was considered as a difficult to overcome problem
with conventional backcrossing, but now it
can be addressed efficiently with the use of
markers. The background selection is focused
on recovering as much as possible of the
genome of the recurrent parent on the chromosomes not carrying the target trait (that particular chromosome is primarily handled as
part of the foreground selection). The concept
is to use a set of well-spaced markers that
cover all those chromosomes. At each backcross generation, the plants preselected from
the foreground selection step are genotyped
for this array of markers and scored for their
similarity to the genome of the recurrent
parent. At each generation, the plants that
have recovered the most of the recurrent
parent are used for the next generation of
backcrossing. Plants with more than 95%
recovery of the recurrent parents genome can
be obtained by the BC2 or BC3 generation
depending on the intensity of the work done.
In practice, both foreground and background
selections are often conducted in the same
backcross program, either simultaneously or
sequentially. However, the efficiency of markerassisted backcrossing depends on a number of
factors, including the population size of each
backcross generation, distance of markers from
the target locus and number of background
markers used. Experienced MAS researchers
have shown that faster recovery of the recurrent
parent genome with MAS compared to conventional backcrossing when foreground and background selection are combined. The recurrent
parent genome is recovered more slowly on the
chromosome carrying the target locus than on
other chromosomes because of the difficulty in
breaking linkage with the target donor allele.
Refer the further readings (particularly Neeraja
et al. 2007) for methods for optimising sample
sizes and selection strategies in marker-assisted
selection.

180

The below procedure describes MABC


process for single locus:
1. Selection of markers
Two to four well-spread polymorphic markers
per chromosome should be selected for
background (recurrent genome) selection.
Similarly, two or three flanking markers on
each side of the target QTL should be selected.
If the QTL is 25 cM apart from the markers,
better to find more markers in that interval and
those additional markers should also be used
to introgress the target QTL.
2. Crossing program
Start the crossing program between the recurring parent (elite line or cultivar) and the donor
parent (which contains the target QTL) and
get the F1 plants. The F1 plants are to be
backcrossed with the recurrent parent and get
100300 BC1F1 seeds.
3. Genotyping of BC1F1
Grow all the BC1F1 seeds and genotype them
with the chosen foreground and background
markers. The BC1F1 plants are selected based
on (1) close recombination on one side of
target QTL (between two flanking markers)
and (2) best recovery of recurrent background
at noncarrier chromosomes.
4. Repeating steps 2 and 3 until to produce 100
300 BC3F1 seeds
5. Selfing and genotyping
Self all the selected BC3F1 progenies and genotype the selfed progenies for homozygosity
at introgressed QTL. Bulk all the homozygous
positive progenies and increase the seeds
through selfing and make a final genotyping
test before proceeding further for multi-location trial for evaluation of the phenotype governed by the target QTL. The same procedure
can be followed to backcross twothree
QTLs at the same time, but larger populations
will be needed at each generation (e.g. for three
QTLs, we may need up to 1,000 progenies).
Alternatively, conduct parallel MABC for
each selected QTL and combine the loci at the
end by crossing the final BC3F1s.
It should also be noted that use of markers to
select for multiple QTLs is more complex, and
less proven, than selection for a single gene.
Population sizes required to recover individuals

8 Marker-Assisted Selection

with all the desired marker patterns increase


exponentially with the number of QTLs involved.
In a backcrossing scheme, there may be little
opportunity to select for the recurrent parent
genome, because few individuals will have the
desired marker pattern at all the target loci.
If some of the genes are QTLs, whose locations
and effects are often imprecisely estimated, then
there is uncertainty that the results of MAS will
meet expectations. Finally, the more the genes
undergoing selection, the greater the chances of
incorporating unfavourable alleles through linkage drag. Hence, the following suggestions are
proposed for selecting multiple QTLs:
1. Limit the number of QTLs undergoing selection to three or four.
2. Target only verified QTLs that have medium
to large effects and that are consistently
detected in several environments.
3. Examine the QTL analysis results carefully to
decide which markers to select (usually both
the markers that flank the selected QTL).
4. If desired, an index can be constructed that
weights some markers differently than others,
depending on their relative importance in
terms of effect sizes (and/or contribution to
the expression of phenotype).
5. When more than two QTLs are involved, consider a stepwise backcrossing procedure. For
example, if four target QTLs are to be introgressed into the same genetic background, one
could first conduct two parallel backcross
schemes, each incorporating two target QTLs.
Then, the selected individuals from each
scheme are crossed and plants with all four
targets identified. This procedure gives greater
opportunity to conduct background selection
for the recurrent parent genome than selecting
for all four targets simultaneously.
6. Alternatively, F2 enrichment, backcrossing
and inbreeding can be employed (Bonnett
et al. 2005) to reduce the population size
needed to attain selection goals.
Another important point to be considered here
is MAS never replace phenotypic selection
entirely. Especially for disease resistances, a
final testing of breeding lines is always required,
regardless how tight a marker is linked to a QTL.
It is no doubt that the collection and use of very

Marker-Assisted Recurrent Selection (MARS)

high quality phenotypic data are critical for the


application of MAS. It is also concluded that it is
risky to carry out selection solely on the basis of
marker effects, without confirming the estimated
effects by phenotypic evaluation, and further
that laboratory-based breeding should remain
the servant of the field breeder and not its master. Further, it has been observed that backcrossing is a very conservative breeding strategy and
should not become the prime focus of a breeding
program, as it does hardly ever broaden the
genetic basis of plants in a substantial way. To
overcome the limitation of only being able to
improve existing elite genotypes, other approaches
like marker-assisted recurrent selection (see
below) have to be considered.

Gene Pyramiding or Stacking


In many cases, the breeders goal will not be to
introgress a single trait but potentially to introgress several traits at the same time, possibly
from different sources. Instead of trying to handle
all those traits together in the backcrossing process, the best approach usually is to perform all
those conversions into the same background individually in parallel and then to intercross the final
single conversions to combine the traits together
(see above). In that case, only MAS is needed at
the end since the narrowing of the introgressed
regions through foreground selection and the
recovery of the recurrent parent through background selection have already been done for each
individual trait.
The most frequent strategy of pyramiding is
combining multiple resistance genes. Different
resistance genes can be combined in order to
develop broad-spectrum resistance to diseases
and insects. Either qualitative resistance genes
can be combined or quantitative resistances controlled by QTLs. An example for the combination of two resistance QTLs is the pyramiding of
a major stripe rust resistance gene and two QTLs
in the same genotype. In order to pyramid disease
or pest resistance genes that have similar phenotypic effects, and for which the matching races
are often not available, MAS might even be the
only practical methodespecially where one gene

181

masks the presence of other genes. For example, the


Barley Yellow Mosaic Virus (BaYMV) complex
is a major threat to winter barley cultivation in
Europe. As the disease is caused by various
strains of BaYMV and Barley Mild Mosaic Virus
(BaMMV), pyramiding resistance genes seems
an intelligent strategy. However, phenotypic
selection cannot be carried out due to the lack of
differentiating virus strains. Thus, MAS offers
promising opportunities. Suitable strategies have
been developed for pyramiding genes against the
BaYMV complex. At the same time, pyramiding
has to be repeated after each crossing, because
the pyramided resistance genes are segregating in
the progeny.

Accelerated Methods of Gene


Pyramiding
Gene pyramiding is considered as one of the
best MAS methods currently available (along
with marker-assisted introgression, which is
complementary since its aim is slightly different). But, even such a best method can accumulate only a couple of major genes from two
parents and requires a couple of generations. If
large sources of major genes were really to be
unlocked, then an efficient marker-assisted gene
pyramiding scheme would need to tackle
multiple, possibly linked, genes, from multiple
parents. Methodological developments in this
area are only starting and still need more work
(Hospital 2003).

Marker-Assisted Recurrent
Selection (MARS)
In marker-assisted recurrent selection (MARS),
the breeders take advantage of favourable alleles
originating from both parents involved in the
crossing program. QTL alleles impacting the
major traits of interest to the breeders are
identified within breeding populations and accumulated through successive intercrossing using
only genotypic selection. Recombined lines are
then subjected to a final phenotypic screen to
select the best varieties to release. This allows the

8 Marker-Assisted Selection

182
Parent 1

Parent 2

F1
F2

(generate 300 progenies using


single seed descent method )

F3
F3:4

GENOTYPING

F3:5 (if required)


Evaluation at multi location

PHENOTYPING

QTL ANALYSIS
MODELLING AND SELECTION OF QTLS FOR RECOMBINE
IDENTIFY F3 DERIVED PROGENIES FOR RECOMBINE
GENOTYPE 8 16 SEEDS PER PROGENY OF F3:6 AND SELECT
BEST 8 PLANTS (e.g. A H) TO CROSS
A x B
C x D
E x F
G x H
1ST recombination cycle
F1

F1

F1

F1

2nd recombination cycle


F1

F1

3rd recombination cycle


F1
F2
F3
F3:4
Multi location phenotyping

Fig. 8.3 Flow chart explaining marker-assisted recurrent selection

generation of progenies with an optimum combination of key alleles from both parents that could
never be obtained by chance recombination alone.
Thus, MARS has a clear breeding objective, as
opposed to QTL discovery conducted in good x
bad crosses. The concept is to identify QTL
effects for polygenic traits (usually minor) that
are specific to that population and to recombine
them via genotypic selection to generate superior
progenies for variety development. To do this,
de novo QTL detection is performed with each
population of interest and the best lines are
recombined to obtain a progeny that performs

better than either of the two parents (Fig. 8.3). In


contrast to MARS which use de novo QTL mapping as part of their process, the use of MAS or
MABC implies prior knowledge of mapping
information for the targeted traits.
If one of the two parents presents a large QTL
such as for a quality trait or biotic stress resistance (identified through published report or
historical data or de novo identification), such a
QTL can also be included in the selection and the
favourable allele is fixed at an early stage of recombination. MARS can be used to select for specific
traits like yield under water stress conditions,

Marker-Assisted Recurrent Selection (MARS)

but it should also include many other traits of


interest to the breeder (such as yield under optimal conditions, maturity, disease resistance) so
that the final selection of alleles to recombine can
take all those factors into account and negative
correlations between traits at a given locus can be
identified and/or eliminated. Thus, with the use
of markers, recurrent selection can be accelerated
considerably. In continuous nursery programs,
pre-flowering genotypic information is used for
marker-assisted selection and controlled pollination. Accordingly, several selection cycles are
possible within 1 year, accumulating favourable
QTL alleles in the breeding population.
Additionally, it is possible today to define an
ideal genotype as a pattern of QTLs, all QTLs
carrying favourable alleles from various parents.
If individuals are crossed based on their molecular marker genotypes as in MARS, it might be
possible to get close to the ideal genotype after
several successive generations of crossings. It is
likely that through such a MARS breeding
scheme, higher genetic gain will be achieved than
through MABC.
Basic Steps Involved in MARS
1. Selection of parents
MARS works best with populations that are
derived from good x good crosses, that is, using
parental lines that are used in a regular breeding
program. Excessive segregation for traits such
as maturity or height should be avoided to allow
a good quality yield evaluation. It is probably a
good idea to start more crosses between various
parents and then to focus the MARS project on
the most suitable populations.
2. Population development
MARS does not need very advanced populations, and F3-derived populations are generally sufficient. Progenies are advanced to the
F3 generation through single-seed descent
(single F3 plants are selfed to generate F3:4 or
F3:5 progenies, depending of the amount of
seed necessary for multi-location yield testing). The population size will depend on the
precision of QTL mapping desired by the
breeder and can range from 200 to 500.
Usually, the population size is made to fit a
96-well PCR plate format so it would be a
multiple of a given number (92, 94 or other)

183

3.

4.

5.

6.

fitting in that format (if we include the parents


and may be some checks).
Parental and progeny genotyping
MARS does not need a large density of markers since relatively little recombination has
taken place during the F3 population development. Typically, having markers covering the
genome with approximately a 10 cM average
distance between markers should be adequate.
SSRs or SNPs can be used but SNPs will
greatly facilitate the expansion to multiple
MARS projects. For large-scale MARS use,
the best would be to have the parental genotyping with a relatively high density of SNP
markers (1,0002,000) so that specific sets of
SNPs polymorphic for a given MARS population can be quickly chosen. DNA samples are
obtained directly from the F3 plants or from
bulked F4 progenies from each F3 if more leaf
material is needed or if sampling could not be
done at the F3. These samples are genotyped at
the polymorphic loci identified from the
parental screening.
Phenotyping
Multi-location field trials, using replicated
experimental designs, are then conducted to
obtain good evaluation of the target traits
(refer chapter 5). Accurate plant phenotyping
is critical to the success of MARS. Evaluation
of nontarget traits segregating in the population can also generate new useful information,
including potential negative correlations with
target traits.
Identification of QTLs
Many QTL analysis procedures are available
for QTL identification for the traits of interest.
Using a selection index with different weight
being given to various key traits is often useful
for final QTL selection. Ideally, the breeder
will use different models to compare the
results and decide on the QTLs to recombine
(refer chapter 6).
Recombination cycles
Once a set of key QTLs has been identified, a
few sets of F3-derived progenies are chosen
based on their complementarity for the presence
of favourable alleles and on their overall phenotypic performance. Several individual plants
(F4 or F5 depending on what makes the most

184

sense for that crop) of each progeny are grown


and genotyped (nearest marker to the QTL
peak, or flanking markers) to identify the best
individual plants to use in the recombination
crosses. An example would be to cross four
pairs of progenies (8 lines), then the two pairs of
resulting F1s in the second cycle, and then the
final two F1s in the final cycle. At each stage, the
F1s are genotyped and the best ones are used
again for the next cycle of recombination. At the
end of the process, the resulting lines are selfed
for few generations for fixation.
In order to ensure the variability at the unselected loci for the final phenotypic evaluation, a
few different independent sets of parental progenies and several progenies from the final recombination cycles will be employed. Lines can also
be developed from each intermediate recombination step. The specific strategies used for the
recombination process will depend on the crop
(ease of crossing, number of progenies obtained
per cross, cycle length, etc.), on the number of
loci to recombine, and on the breeders preference (which is again based on availability of
expertise/labour, resources, etc.).

Advanced Backcross (AB)-QTL


Analysis
QTL studies using populations which carry
alleles of both parents at relatively high frequency
(e.g. F2, BC1) are well suited for QTL mapping,
but have some drawbacks when it comes to
detecting and transferring useful QTLs from
unadapted germplasm into elite breeding lines.
Undesirable QTL alleles from the unadapted parent occur in high frequency and epistatic interactions are likely to occur, because donor alleles are
present at a high frequency. Tanksley and Nelson
(1996) proposed a method for simultaneously
discovering valuable QTLs from unadapted germplasm (e.g. land races, wild species) and transferring them into elite breeding lines. The method
is named advanced backcross QTL analysis
(AB-QTL) and delays QTL analysis until the BC2
or BC3 generation. In BC1, negative selection is
conducted to reduce deleterious donor alleles,

8 Marker-Assisted Selection

while in BC2 and BC3 populations are evaluated


for traits of interest and genotyped using molecular
markers. In this way, the identification of QTL
happens while these QTLs are transferred into an
adapted genetic background. The AB-QTL
method can be employed to exploit unadapted
germplasm for the quantitative trait improvement
of crop plants and has been applied successfully
in several crop species, for example, barley,
maize, rice, tomato and wheat.

Mapping-As-You-Go (MAYG)
In 2004, Podlich et al. suggested the MappingAs-You-Go (MAYG) approach, to overcome the
problem of inaccurate estimation of QTLs and
their effects. MAYG is a mapping-MAS strategy
that accounts for the presence of epistasis and
genotype by environment (G E) interactions.
The effectiveness of the MAYG approach has
been investigated through simulation. In the
MAYG approach, estimates of QTL allele effects
are continually revised by remapping new elite
germplasm generated during cycles of MAS, thus
ensuring that QTL estimates remain relevant to
the current set of germplasm in the breeding
program. It is considered as a mapping-MAS
strategy that explicitly recognises that alleles of
QTL for complex traits can have different values
as the current breeding material changes with
time. The integration of genetic mapping and
MAS offers two major advantages: (1) ability to
carry out markertrait association analysis using
breeding populations directly rather than having
to follow time-consuming development of genetic
populations and (2) combining markertrait association development and validation. This saves
time, both in the process itself but also in the generation of the necessary genetic materials.

Application of Markers in Germplasm


Storage, Evaluation and Use
Marker-assisted germplasm evaluation is another
important tool in the acquisition, storage and use
of plant genetic resources, and the evaluation of

Bibliography

germplasm can be considerably improved with


the assistance of markers. Markers can be used
prior to crossing to evaluate the breeding material.
Also, mixing of seed samples can be discovered
using markers instead of growing plants to maturity and assessing morphological characteristics.
In order to broaden the genetic base of core
breeding material, germplasm of diverse genetic
background for crossings with elite cultivars can
be identified with the assistance of markers, and
markers are on the whole a valuable tool for characterising genetic resources, delivering detailed
information usable in selecting parents. The
genotypic evaluation of germplasm based on
molecular markers (marker-assisted germplasm
evaluation, MAGE) and/or QTL analysis can be
used to identify and extract superior alleles from
inferior germplasm. This complements phenotypic selection. The advancements in the field of
genomics have considerably contributed to
increase the use of wild relative genes, as they
allow for the isolation of beneficial genes, the
selection for traits which are difficult to detect
based on phenotype or the screening of whole
collections of wild relatives. MAS has increasingly been applied for the maintenance of recessive alleles in backcrossing pedigrees and for
pyramiding resistance genes. Molecular markers
can also be used for (1) differentiating cultivars
and creating, maintaining and improving heterotic groups; (2) assessing collections and identifying germplasm redundancy, underrepresented
alleles and genetic gaps; (3) monitoring genetic
shifts that can occur during medium- or long-term
storage, regeneration, domestication and breeding; (4) identifying unique germplasm; and
(5) constructing core collections.

185

2.

3.

4.

5.

6.

7.

Bringing Genomics to the Wheat Fields (http://


maswheat.ucdavis.edu/).
Grafgen: Design of Precision Graphical Genotypes (http://moulon.inra.fr/~fred/programs/
programs.html), a computer program developed by Frederic Hospitals group at INRA,
France. Using marker data for a population,
the program displays each individuals allelic
composition in a graphical format as an aid to
selecting desirable genotypes.
Molecular Plant Breeding (http://www.molecular
plantbreeding.com/), an Australian-based initiative to incorporate marker-assisted strategies into plant breeding programs.
PLABSIM, MAS simulation software available
from Matthias Frischs website at the University
of Hohenheim, Germany ( http://www.unihohenheim.de/~frisch/).
Popmin (http://moulon.inra.fr/~fred/programs/
programs.html), another computer program from
Frederic Hospitals group at INRA, France. This
program calculates optimum population sizes for
marker-assisted backcrossing programs.
Molecular marker assisted selection as a
potential tool for genetic improvement of crops,
forest trees, livestock and fish in developing
countries (http://www.fao.org/biotech/Conf10.
htm). This site reports results of a conference
sponsored by FAOs Electronic Forum on
Biotechnology in Food and Agriculture.
Molecular marker maps that have been constructed for a wide range of crops are available
at www.ncbi.nlm.nih.gov/genomes/PLANTS/
PlantList.html.

Bibliography
Literature Cited

Resources for MAS on the Web


A large collection of web resources are available
for MAS in the World Wide Web, and some of
them are listed below:
1. As an example of current opportunities for
MAS in wheat, protocols for over 20 traitassociated markers (associated with disease
resistance, insect resistance and grain quality)
are posted on the website MAS Wheat:

Beckmann JS, Soller M (1986) Restriction fragment


length polymorphisms in plant genetic improvement.
Oxford Surv Plant Mol Cell Biol 3:197246
Bonnett DG, Rebetzke GJ, Spielmeyer W (2005) Strategies
for efficient implementation of molecular markers in
wheat breeding. Mol Breed 15:7585
Dreher K, Khairallah M, Ribaut JM, Morris M (2003)
Money matters (I): costs of field and laboratory
procedures associated with conventional and markerassisted maize breeding at CIMMYT. Mol Breed
11:221234

186
Morris M, Dreher K, Ribaut JM, Khairallah M (2003)
Money matters (II): costs of maize inbred line conversion schemes at CIMMYT using conventional and
marker-assisted selection. Mol Breed 11:235247
Tanksley SD, Nelson JC (1996) Advanced backcross
QTL analysis: a method for the simultaneous discovery and transfer of valuable QTLs from unadapted
germplasm into elite breeding lines. Theor Appl Genet
92:191203

Further Readings
Beavis WD (1998) QTL analysis: power, precision, and
accuracy. In: Paterson AH (ed) Molecular dissection of
complex traits. CRC Press, Boca Raton, pp 145161
Frisch M, Melchinger AE (2001) Marker-assisted backcrossing for introgression of a recessive gene. Crop
Sci 41:14851494
Frisch M, Bohn M, Melchinger AE (1999a) Minimum
sample size and optimal positioning of flanking markers in marker-assisted backcrossing for transfer of a
target gene. Crop Sci 39:967975
Frisch M, Bohn M, Melchinger AE (1999b) Comparison
of selection strategies for marker-assisted backcrossing of a gene. Crop Sci 39:12951301
Frisch M et al (2000) PLABSIM: software for simulation
of marker-assisted backcrossing. J Hered 91:8687
Hospital F (2003) Marker-assisted breeding. In: Newbury
HJ (ed) Plant molecular breeding. Blackwell
Publishing/CRC Press, Oxford/Boca Raton, pp 3059
Kearsey MJ, Farquhar AGL (1998) QTL analysis in
plants; where are we now? Heredity 80:137142

8 Marker-Assisted Selection
Knapp S (1998) Marker-assisted selection as a strategy
for increasing the probability of selecting superior
genotypes. Crop Sci 38:11641174
Knight J (2003) Crop improvement: a dying breed. Nature
421:568570
Morgante M, Salamini F (2003) From plant genomics to
breeding practice. Curr Opin Biotechnol 14:214219
Neeraja C, Maghirang-Rodriguez R, Pamplona A, Heuer S,
Collard B, Septiningsih E et al (2007) A marker-assisted
backcross approach for developing submergence-tolerant rice cultivars. Theor Appl Genet 115:767776
Peleman JD, van der Voort JR (2003) Breeding by design.
Trends Plant Sci 8:330334
Podlich DW, Winkler CR, Cooper M (2004) Mapping as
you go: an effective approach for marker-assisted
selection of complex traits. Crop Sci 44:15601571
Ribaut JM, Hoisington D (1998) Marker-assisted selection:
new tools and strategies. Trends Plant Sci 3:236238
Smith S, Beavis W (1996) Molecular marker assisted
breeding in a company environment. In: Sobral BWS
(ed) The impact of plant molecular genetics.
Birkhauser, Boston, pp 259272
Thomas WTB (2003) Prospects for molecular breeding of
barley. Ann Appl Biol 142:112
Xu Y (2003) Developing marker-assisted selection strategies
for breeding hybrid rice. Plant Breed Rev 23:73174
Xu Y, Crouch JH (2008) Marker-assisted selection in plant
breeding: from publications to practice. Crop Sci
48:391407
Young N (1999) A cautiously optimistic vision for markerassisted breeding. Mol Breed 5:505510
Yousef GG, Juvik JA (2001) Comparison of phenotypic
and marker-assisted selection for quantitative traits in
sweet corn. Crop Sci 41:645655

Success Stories in MAS

There is a tremendous amount of publications


reporting the identification of new QTLs in crop
plants since its first description in tomato during
1988. However, reports on the successful application of MAS in plant breeding programs are
still limited. This fact is discussed in several
papers and reviewed the current status and applications of molecular markers in public and private
sector breeding programs (see further readings).
Most of the critical reviewers have come to the
conclusion that rate, scale and scope of uptake of
genomics and MAS in crop breeding programs
continually lag behind expectations. Thus, it has
been repeatedly stated that the vast majority of
the favourable alleles at these identified QTL
reside in publications rather than in cultivars that
have been improved through the introgression or
selection of such QTLs. However, the aim of this
book is to show the successful detection of QTLs
by circumventing all the challenges that limit the
transfer of knowledge from QTL mapping
to routine MAS in plant breeding program.
The previous chapters have addressed those
approaches, and this chapter describes how those
approaches have successfully applied in development of new crop cultivars. Critical analysis of
published reports brought an impression that
MAS has great potentials in genetic improvement
of crop plants, if the limitations are properly
looked for. Among the different MAS-based
breeding strategies applied (refer chapter 8),
MABC/introgression is the main strategy that has
been used in most of the publications. Regarding
the breeding objective, breeding for disease/pest

resistance is clearly dominating among publications since they are mainly controlled by major
genes and detection of such QTLs is more or less
accurate. However, few studies reported the
successful application of MAS for improved
yield, quality traits, abiotic stress tolerance,
variety detection or growth character (see below).
Another important fact among MAS studies is
that the main marker technologies applied are
predominantly microsatellite markers. Though
almost all the publications are results from public
breeding programs, it would be incorrect to
conclude that MAS is mainly conducted in public
breeding programs. What has to be considered is
that publishing is of little or no importance for
private plant breeders, while it is one of the main
aims in public research institutes and at universities. The following section provides success
stories made in different crops that employed
MAS, and the list is not exhaustive. Due to space
constraints, only few examples in each crop have
been shown, merely to showcase that MAS has
been widely employed in crop plants for their
genetic improvement. Please refer the further
readings to get more examples.

Tomato
This was the first crop in which both QTL
mapping and MAS has been demonstrated.
Tanksley et al. in 1981 have first demonstrated
the real MAS-based selection on metric characters using isozyme markers in early generations

N.M. Boopathi, Genetic Mapping and Marker Assisted Selection: Basics, Practice
and Benefits, DOI 10.1007/978-81-322-0958-4_9, Springer India 2013

187

188

of tomato lines. Lecomte et al. (2004) introgressed five QTLs controlling fruit quality in
tomato from a parental line into three improved
lines through marker-assisted backcross program.

Maize
This was the second crop that has successfully
been used to show that isozyme markers can also
be used for genetic improvement of yield in 1982
by Stuber. In another study, Yousef and Juvik
(2002) showed that QTLs identified in a mapping
population can very well exert the same effects in
different genetic backgrounds and across two
environments. By introgressing three marker QTL
alleles associated with enhanced seedling emergence into elite lines utilising marker-assisted
backcrossing, this trait was successfully enhanced
in sweet corn. The AB-QTL method, which can be
used for the simultaneous identification and transfer of favourable QTL alleles, has successfully
been used to improve yield in elite maize lines (Ho
et al. 2002) and also Bouchez et al. (2002) successfully introgressed favourable QTLs for grain
yield into maize elite lines. As abiotic stress resistance is a complex trait, only few successful MAS
applications in breeding for such traits have been
published. An example is the results of a markerassisted backcross experiment conducted at
CIMMYT to improve grain yield in tropical maize
under water-limited conditions (Ribaut and Ragot
2006). Other important examples for the successful
application of MAS in maize are the use of microsatellite markers for the conversion of normal
maize lines into Quality Protein Maize (QPM),
containing more lysine and tryptophan than the
native lines (Babu et al. 2004), or the introgression
of favourable QTL for earliness and grain yield
between maize elite lines (Bouchez et al. 2002).

Wheat
Examples of commercially released genetic material include Patwin (Hard White Spring wheat),
the first variety developed by MAS released by the
University of California at Davis (http://www.

Success Stories in MAS

plantsciences.ucdavis.edu/plantbreeding/main/
history.htm), which contains the introgressed stripe
rust resistance gene Yr17 and leaf rust resistance
gene Lr37 (Helguera et al. 2003). Similarly, several
other related genes Lr1, Lr9, Lr24 and Lr47 were
introgressed into common wheat cultivars by MAS
(Nocente et al. 2007). Marker-assisted pyramiding
of two cereal cyst nematode resistance genes from
Aegilops variabilis in wheat has also been reported
(Barloy et al. 2007). In wheat, there is extensive
use of DNA markers for cereal cyst nematode
(Heterodera avenae Woll.) resistance (Eagles et al.
2001). The extensive use of MAS in CIMMYT
wheat breeding programs is reported elsewhere.
Large wheat MAS programs have also been
developed in Australia for around 20 genes or
chromosome regions used in cultivar development.
During the last few years, remarkable progress in
implementation of MAS strategies for cultivar
development has been achieved by the MAS
Wheat Consortium in the United States, including
the completion of 80 MAS projects (visit the
consortium website for more detail).

Rice
Ashikari et al. (2005) provide a good example of
successful gene pyramiding experiments. First,
the introgression of one QTL for grain number
and one QTL for plant height separately in the
same genetic background improved both traits.
Second, the lines generated by pyramiding both
QTLs in the same genetic background exhibited
trait values slightly lower than expected based
on single introgression lines, but overall, the
addition of genetic loci was still beneficial and
permitted improvement of the yield of a strain of
rice. There are many other successful examples
in numerous species, including pyramiding of
Xa7 and Xa21 for the improvement of disease
resistance to bacterial blight in hybrid rice (Zhang
et al. 2006). Up to now, MAS in rice breeding has
mainly been utilised for the pyramiding of disease
resistances, namely, bacterial blight and blast
(Narayanan et al. 2002). In 2002, two cultivars
resistant to bacterial leaf blight were released in
Indonesia, which have been selected using MAS.

Varieties Released Through MAS

The variety Angke carries the resistance gene


xa5, and Conde carries Xa7 (Bustamam et al.
2002). Several publications report introgression
from wild relatives (e.g. O. glumaepatula, O.
rufipogon) in order to improve yield (Liang et al.
2004). In 2006, two lines showing strong
submergence tolerance were developed by
introgressing a locus conferring submergence
tolerance from cultivar FR13A into the variety
Swarma (Xu et al. 2006). Jantaboon et al. (2011)
have successfully shown to introgress four QTLs
that confer submergence tolerance and cooking
quality traits into the development of an ideotypes using MAS. Marker-assisted backcross
breeding approach was employed to incorporate
blast resistance genes, namely, Piz-5 and
Pi54, from the donor lines C101A51 and Tetep
into the genetic background of PRR78 to
develop Pusa1602 (PRR78 + Piz5) and Pusa1603
(PRR78 + Pi54), respectively (Singh et al. 2012).

Barley
In Australia, a marker linked (0.7 cM) to the Yd2
gene for resistance to barley yellow dwarf virus
was successfully used to select for resistance in a
barley backcross breeding scheme (Jefferies et al.
2003). Field test data showed that BC2F2-derived
lines containing the linked marker had fewer leaf
symptoms and higher grain yield when infected
by the virus compared to lines lacking the marker.
Castro et al. (2003) provided an example of gene
pyramiding in barley by combining a qualitative
gene with QTL alleles for resistance to barley
stripe rust. Preliminary results indicated combining qualitative and quantitative resistance genes
improved resistance levels in the presence of a
virulent race of the pathogen.

Soybean
Soybean yields were increased by using markerassisted backcrossing to introgress a yield QTL
from a wild accession into commercial genetic
backgrounds (Concibido et al. 2003). Although
the yield enhancement was observed in only two

189

of six genetic backgrounds, the study demonstrates


the potential of incorporating wild alleles with the
assistance of markers. In soybean, the most prominent example for MAS application in breeding is
resistance to soybean cyst nematode (Heterodera
glycines). Mudge et al. (1997) showed that with
MAS using SSR markers that flank rhg1, they
were 98% accurate in identifying resistant lines
from a cross between Evans and PI 209332.
Refer Concibido et al. (2004) for an excellent
review on MAS for cyst nematode resistance in
soybean.

Varieties Released Through MAS


MAS-breeding programs have been used to produce two low-amylose rice varieties, Cadet and
Jacinto (Hardin 2000), and two Indonesian rice
varieties, Angke and Conde, with resistance to
bacterial leaf blight (Bustamam et al. 2002). A
white bean variety resistant to BGYMV and common bacterial blight, Verano (Beaver et al. 2008),
a leaf rust resistant wheat variety from Argentina,
Biointa 2004 (Bainotti et al. 2009), and an
Australian barley variety, SloopSA, resistant to
cereal cyst nematode (Barr et al. 2000) have also
been released. The soybean cultivar Sheyenne,
tolerant to iron deficiency-induced chlorosis and
resistant to lodging, was derived from a Pioneer
variety. Sheyenne was confirmed to be different
from that variety with the help of markers (Helms
et al. 2008). Other important examples for success
in MAS are a maize variety named Sunrise, with
high resistance against the western corn root worm
(Diabrotica virgifera) or a potato producing pure
amylopectin, which is the first product in Germany
developed by TILLING that achieves market readiness. The maize variety was developed by the
German Saaten-Union; the potato was developed
by German Fraunhofer researchers and is processed by Emsland group, the largest German
potato processor. As both examples originate from
private breeding programs, they will most probably never appear in scientific journals (Brumlop
and Finckh 2010). Nevertheless, press reports
announcing MAS-breeding projects or releases of
varieties that were bred with the assistance of

190

markers are mentioned here. In the USA, the variety


Tango, carrying two QTLs for adult resistance to
stripe rust, was released in 2000 (Hayes et al.
2003), claiming to be the first commercially
released barley variety using MAS. However,
Tango yields less than its recurrent parent and is
therefore primarily seen as a genetically characterised source of resistance to barley stripe rust rather
than a variety of its own. As a result of the South
Australian Barley Improvement Program, the
malting variety Sloop was improved with cereal
cyst nematode resistance introgressed from the
variety Chebec and released in 2002 as SloopSA
(Brumlop and Finckh 2010).

Hybrids Developed Through MAS


A common application of marker-assisted backcrossing has been the introgression of transgenes
into an adapted variety or line (e.g. introgression
of the Bt insect resistance transgene into different
genetic backgrounds in maize, cotton). It has been
shown in previous chapters that the easy scenario
is when the marker allele M and the QTL allele Q
are always together. This is only the case if the
marker is actually measuring the relevant polymorphism within the gene that causes the effect.
Such a direct marker is very convenient, because
the marker genotype will directly inform us about
the QTL genotype. In contrast, if indirect or linked
markers are used in MAS, there is a chance of
recombination between the marker and QTL
alleles. These are typically markers for genes that
were known to exist before they were mapped and
had a large effect. Direct markers are generally
much preferred to linked markers, if they are truly
markers for major gene effects. Their biggest
benefit is that they can even be used without trait
measurement or pedigree recording. Often, the
target gene can also be detected phenotypically
(pest resistance given by Bt gene), and markers
are used to select for the recurrent parent genome.
The technique has reportedly accelerated the
recovery of the recipient genome by about two
backcross generations, and almost all the Bt
hybrids released in India are developed using this
strategy. Similarly, in pearl millet (Pennisetum

Success Stories in MAS

glaucum), the parental lines of the original hybrid


(HHB 67) were improved for downy mildew
(caused by Sclerospora graminicola (Sacc.)
Schroet.) resistance through MAS combined with
conventional backcross breeding, leading to the
release in India of a new hybrid HHB 67-2
(Navarro et al. 2006).

MAS in Multinational Companies


Although there is very limited specific information
on the successes of molecular breeding, the first
commercial products of MAS are expected to be
released to the market by all the major multinational breeding companies in the very near future.
The first cultivar developed through MAS by
Monsanto was released to the US market in 2006.
Examples for patent applications related to MAS
technologies are available at the free patents online
database (www.freepatentsonline.com). A search
in a patent database using marker-assisted selection as search item will result in providing list of
patents related to MAS. Check for latest updates.

Contrasting Stories
In some cases, MAS is not as efficient as expected.
Most of the time, this depends on how stable are
QTL effects, which may be altered in different
ways. In some cases, the QTL effect vanishes
after MAS or introgression (Shen et al. 2001).
One can then wonder whether the QTL was a
false positive (ghost QTL) or a true positive for
which the effect (expression) depended on one or
several of the interactions listed below. There is
also a tendency for supposedly additive QTL
effects not to really sum up! Refer Hospital
(2009) for more details on reasons for failures of
MAS in crop plants.

Conclusions and Future Prospects


Marker-assisted selection has been successful for
introgressing and pyramiding major-effect genes;
however, many challenges remain to be resolved

Bibliography

before MAS can routinely provide added value


for breeding very complex traits. The genetic
basis of complex traits and the interaction between
all related traits will become much better understood because of the rapid developments in the
omics studies. This will allow accurate modelling of gene networks and the development of
robust simulation tools for designing target
genomic ideotypes. Integration of all the state-ofthe-art branches of biotechnology, physiology,
biochemistry, soil science and plant breeding, and
genetics is the need of the hour. With the availability of such knowledge and tools, the early
stages of plant breeding programs will become
much more efficient in a designing of knowledgebased plant breeding program. However, there
will be no substitute for multi-locational replicated evaluation trials for screening elite breeding
lines for the selection and validation of finished
products of MAS before distribution to local
breeding companies and farmers fields.

Bibliography
Literature Cited
Babu ER, Mani VP, Gupta HS (2004) Combining high
protein quality and hard endosperm traits through
phenotypic and marker assisted selection in maize.
In: Proceedings of the 4th international crop science
congress, Brisbane
Bainotti C, Fraschina J, Salines JH, Nisi JE, Dubcovsky
J, Lewis SM, Bullrich L, Vanzetti L, Cuniberti
M, Campos P, Formica MB, Masiero B, Alberione E,
Helguera M (2009) Registration of BIOINTA 2004
wheat. J Plant Regist 3:165169
Barloy D, Lemoine J, Abelard P, Tanguy AM, Rivoal R,
Jahier J (2007) Marker assisted pyramiding of two
cereal cyst nematode resistance genes from Aegilops
variabilis in wheat. Mol Breed 20:3140
Barr AR, Jefferies SP, Warner P, Moody DB, Chalmers KJ,
Langridge P (2000) Marker-assisted selection in theory
and practice. In: Proceedings of the 8th international
barley genetics symposium, vol I. Adelaide, Australia,
pp 167178
Beaver JS, Porch TG, Zapata M (2008) Registration of
Verano white bean. J Plant Regist 2:187189
Bouchez A, Hospital F, Causse M, Gallais A, Charcosset
A (2002) Marker-assisted introgression of favorable
alleles at quantitative trait loci between maize elite
lines. Genetics 162:19451959

191
Bustamam M, Tabien RE, Suwarno A, Abalos MC, Kadir
TS, Ona I, Bernardo M, Veracruz CM, Leung H (2002)
Asian rice biotechnology network: improving popular
cultivars through marker-assisted backcrossing by the
NARES. Poster presented at the international rice congress, 1620 Sept 2002, Beijing
Castro AJ et al (2003) Mapping and pyramiding of qualitative and quantitative resistance to stripe rust in barley. Theor Appl Genet 107:922930
Concibido VC, Diers BW, Arelli PR (2004) A decade of
QTL mapping for cyst nematode resistance in soybean. Crop Sci 44:11211131
Concibido VC et al (2003) Introgression of a quantitative
trait locus for yield from Glycine soja into commercial
soybean cultivars. Theor Appl Genet 106:575582
Eagles HA, Bariana HS, Ogbonnaya FC, Rebetzke GJ,
Hollamby GJ, Henry RJ, Henschke PH, Carter M
(2001) Implementation of markers in Australian wheat
breeding. Aust J Agric Res 52:13491356
Fraley R (2006) Presentation at Monsanto European
investor day, 10 Nov 2006. Available at www.monsanto.com/investors/presentations.asp
Hardin B (2000) Rice breeding gets marker assists.
Available at www.ars.usda.gov/is/AR/archive/dec00/
rice1200.pdf. Verified 19 Nov 2012
Hayes PM, Corey AE, Mundt C, Toojinda T, Vivar H
(2003) Registration of Tango barley. Crop Sci
43:729731
Helguera M, Khan IA, Kolmer J, Lijavetzky D, Zhong-Qi
L, Dubcovsky J (2003) PCR assays for the Lr37Yr17-Sr38 cluster of rust resistance genes and their use
to develop isogenic hard red spring wheat lines. Crop
Sci 43:18391847
Helms TC, Nelson BD, Goos RJ (2008) Registration of
Sheyenne soybean. J Plant Regist 2:2020
Ho C, McCouch R, Smith E (2002) Improvement of
hybrid yield by advanced backcross QTL analysis in
elite maize. Theor Appl Genet 105:440448
Jantaboon J, Siangliw M, Im-mark S, Jamboonsri W,
Vanavichit A, Toojinda T (2011) Ideotypes breeding
for submergence tolerance and cooking quality by
MAS in rice. Field Crops Res 123(3):206213
Jefferies SP, King BJ, Barr AR, Warner P, Logue SJ,
Langridge P (2003) Marker-assisted backcross introgression of the Yd2 gene conferring resistance to barley yellow dwarf virus in barley. Plant Breed
122:5256
Lecomte L, Duff P, Buret M, Servin B, Hospital F, Causse
M (2004) Marker- assisted introgression of five QTLs
controlling fruit quality traits into three tomato lines
revealed interactions between QTLs and genetic backgrounds. Theor Appl Genet 109:658668
Liang F, Deng Q, Wang Y, Xiong Y, Jin D, Li J, Wang B
(2004) Molecular marker-assisted selection for yieldenhancing genes in the progeny of 9311 O.
rufipogon using SSR. Euphytica 139:159165
Mudge J, Cregan PB, Kenworthy JP, Kenworthy WJ, Orf
JH, Young ND (1997) Two microsatellite markers that
flank the major soybean cystnematode resistance
locus. Crop Sci 37:16111615

192
Narayanan NN, Baisakh N, Vera Cruz CM, Gnanamanickam
SS, Datta K, Datta SK (2002) Molecular breeding for
the development of blast and bacterial blight resistance
in rice cv. IR50. Crop Sci 42:20722079
Navarro RL, Warrier GS, Maslog CC (2006) Genes are
gems: reporting agri-biotechnologya sourcebook
for journalists. In: International crops and research
institute for the semi-arid tropics, Patancheru, Andhra
Pradesh, India
Nocente F, Gazza L, Pasquini M (2007) Evaluation of leaf
rust resistance genes Lr1, Lr9, Lr24, Lr47 and their
introgression into common wheat cultivars by markerassisted selection. Euphytica 155(3):329336
Ribaut JM, Ragot M (2006) Marker-assisted selection to
improve drought adaptation in maize: the backcross
approach, perspectives, limitations, and alternatives.
J Exp Bot 58:351360
Shen L, Courtois B, McNally KL, Robin S, Li Z (2001)
Evaluation of near-isogenic lines of rice introgressed
with QTLs for root depth through marker-aided selection. Theor Appl Genet 103:7583
Singh VK et al (2012) Incorporation of blast resistance
into PRR78, an elite Basmati rice restorer line,
through marker assisted backcross breeding. Field
Crops Res 128:816
Stuber CW (1982) Improvement of yield and ear number
resulting from selection at allozyme loci in a maize
population. Crop Sci 22:737
Tanksley SD, Medino-Filho DH, Rick CM (1981) The
effect of isozyme selection on metric characters in an
interspecific backcross of tomato: basis of an early
screening procedure. Theor Appl Genet 60:291296
Xu K, Xu X, Fukao T, Canlas P, Maghirang-Rodriguez
R, Heuer S, Ismail AM, Baileyerres J, Ronald PC,
Mackill DJ (2006) Sub1A is an ethylene-response-

Success Stories in MAS

factor-like gene that confers submergence tolerance to


rice. Nature 442:705708
Yousef GG, Juvik JA (2002) Enhancement of seedling
emergence in sweet corn by marker-assisted backcrossing of beneficial QTL. Crop Sci 42:96104
Zhang J, Li X, Jiang G, Xu Y, He Y (2006) Pyramiding of
Xa7 and Xa21 for the improvement of disease resistance to bacterial blight in hybrid rice. Plant Breed
125(6):600605

Further Readings
Anthony VM, Ferroni M (2012) Agricultural biotechnology and smallholder farmers in developing countries.
Curr Opin Biotechnol 23:278285
Ashikari M, Sakakibara H, Lin S, Yamamoto T, Takashi T,
Nishimura A et al (2005) Cytokinin oxidase regulates
rice grain production. Science 309:741745
Brumlop S, Finckh MR (2010) Applications and potentials
of marker assisted selection (MAS) in plant breeding.
Final report of the F+E project Applications and
Potentials of Smart Breeding (FKZ 350 889 0020) On
behalf of the Federal Agency for Nature Conservation
December 2010. http://www.bfn.de/0502_skripten.html
Hospital F (2009) Challenges for effective marker-assisted
selection in plants. Genetica 136:303310
Ribaut JM, Hoisington D (1998) Marker assisted selection:
new tools and strategies. Trends Plant Sci 3(6):236239
Zong G, Ahong W, Lu W, Guohua L, Minghong G, Tao S,
Bin H (2012) A pyramid breeding of eight grain-yield
related quantitative trait loci based on marker-assistant
and phenotype selection in rice (Oryza sativa L.).
J Genet Genomics 39(7):335350

Curtain Raiser to Novel MAS


Platforms

Current Techniques in Molecular,


Biochemical and Physiological
Studies and Its Integration
into MAS
Plant breeding programmes key goal revolves in
generation of elite crop plants that are having
combination of superior genes/alleles. However,
the critical limitation is lack of understanding
of what most genes do in terms of the desired
phenotype expression (e.g. pest resistance, salt
tolerance and yield increase) in plants. We do
know that all the agronomically important traits
are quite complex. For example, in halophytes,
we know that salt tolerance depends on the ability to compartmentalise ions, which in turn
depends on regulation of transpiration, the tight
control of leakage of ions through the root
apoplast, the nature of the membranes in the leaf
vacuoles, synthesis of compatible solutes such as
glycine betaine and the ability to tolerate low K
and Na ratios in the cytoplasm of mature cells or
the ability of protein synthesis to operate at low
K:Na ratios in the cells, etc. Under such conditions, how QTL mapping might be useful in
increasing the yield under those unfavourable
environments? In order to have efficient knowledge-based MAS, it is necessary to understand
the techniques that are being used to unravel the
function of genes, and such knowledge should be
incorporated to the QTL mapping procedure. This
chapter provides the state-of-the-art techniques

10

in molecular, biochemical and physiological


studies and their potential role in MAS.

Molecular Techniques
To realise the importance of rapidly accumulating
data as well as to understand the functioning of
the cell at the organism level, there is a need
for high-throughput molecular techniques. The
studies that use such techniques are collectively
called as functional genomics. The term functional genomics is defined as the development
and application of global or genome-wide experimental approaches to assess gene function by
using the information and components provided
by structural genomics. Several approaches have
been used to explore the probable function of
the genes, as well as to monitor their expression
in relation to various other genes, and they are
explained hereunder.

Expression Proling
A major part of functional genomics is the
analysis of gene expression. Having knowledge
of when and where a gene product, that is,
RNA and/or protein, is expressed can give vital
information about the particular gene in question.
The very first step in generating a genome-wide
expression profile is the preparation of expressed
sequence tags (EST) profiles. ESTs are DNA

N.M. Boopathi, Genetic Mapping and Marker Assisted Selection: Basics, Practice
and Benefits, DOI 10.1007/978-81-322-0958-4_10, Springer India 2013

193

10

194

Curtain Raiser to Novel MAS Platforms

Fig. 10.1 cDNA library construction and EST database development

sequences read from either end of complementary


DNA (cDNA) molecules. Since cDNAs are prepared from mRNA, these provide information
about the expressed part of the genome. Thus,
EST data sets have been generated on a large
scale for almost all the crop species, and they
have deposited in the NCBI (National Center for
Biotechnological Information) database for
ESTs (dbEST; Fig. 10.1). The large number of
EST sequences, however, may not be a representation of the number of expressed genes because
several of them are redundant. For example, total
numbers of 252,364 sequences (221,715 ESTs
and 30,649 mRNA sequences) have been clustered into only 31,080 genes in rice (as on 10th
September, 2012). A minimally redundant set of

ESTs provides a suitable substrate for a variety of


high-throughput techniques used for expression
analyses such as microarrays. Such a collection
of ESTs could be provided with quality value if
ESTs represent an outcome from differential
screening in relation to a particular state, for
example, drought or salt stress. At the same time,
in the above said example, 28,000 full-length
sequences of cDNA reported for rice could help
annotation of genes accurately and provide
resources for gene discovery and manipulation.
Other techniques used in expression genomics
include traps and the serial analysis of gene
expression (see below). The technique used to
analyse EST is referred to as cDNA library construction, and it is described in detail hereunder.

cDNA Library Construction

cDNA Library Construction


The generation of full-length cDNA libraries is
indispensable for characterising the structure
and function of newly discovered genes. Several
procedures for the construction of cDNA libraries are available depending on their applications.
The synthesis of cDNA libraries is a chain of
enzymatic reactions, each requiring specific buffers, substances and enzymes. In most cDNA
libraries, the first step is the isolation of total
RNA followed by removal of the highly abundant rRNA and tRNA components to isolate
mRNA. However, in PCR approach to cDNA
synthesis, total RNA is the starting material.
Usually the first strand of cDNA is synthesised
by a reverse transcriptase, and it is followed by
second-strand synthesis by DNA polymerase.
Subsequently, such cDNAs are ligated into an
adaptor, and such adaptor ligation facilitates
their easy integration into the vector (Fig. 10.1).
Such recombinant vectors are later sequenced
(see below) to characterise the nucleotide
sequence of each EST. Advantage of cDNA
libraries is that if the gene of interest is highly
expressed in a particular tissue, there will be
abundance of that mRNA, and it will be easy to
isolate because it will be enriched in a cDNA
library made from that particular tissue. A cDNA
library will represent individual genes, although
not all the genes are represented. Further, there
were no promoters or introns will be present.
Thus, conventional cDNA library construction
methods suffer from several major shortcomings.
First, the majority of cDNA clones are not fulllength, especially for mRNAs longer than 2 kb.
This loss of sequence is typically due to premature termination of reverse transcription or 5terminal sequence loss caused by cDNA blunt-end
polishing before cloning. As a result, cDNA 5
ends are significantly underrepresented in cDNA
libraries. Second, an adaptor-mediated cloning
process is still a common approach for cDNA
library construction (Fig. 10.1). Thus, the resulting cDNA libraries can be comprised of up to
20% undesirable ligation by-products (chimeras)
and inserts of non-mRNA origin (e.g. genomic

195

DNA, mitochondrial DNA, ribosomal RNA and


adaptor dimers). Additionally, current library
construction methods for directional cloning suffer from their reliance on methylation, a process
that is often incomplete in protecting internal
restriction sites and is also inefficient for cloning.
To overcome these limitations, several protocols
for cDNA library construction have been
described that exploit the mRNA cap structure
to enrich for full-length sequences. Leading technologies in this field include the oligo-capping
method, CAPture, SMARTTM approach and
CAP-trapper. As an example, the oligo-capping
method is described in detail.
Usually, cDNA libraries constructed by many
types of conventional methods have high content
of non-full-length cDNA clones. One of the reasons for this high content is that reverse transcriptase tends to stop during the first-strand
synthesis and falls off, leaving non-full-length
cDNA. Thus, non-full-length cDNA is an
unavoidable result of the use of reverse transcriptase for the cDNA synthesis. In order to
make a full-length cDNA library, some types of
selection procedure need to be designed such as
selection of cDNA that contain both ends of the
mRNA. For that purpose, the features which are
characteristic to the 3-end and the 5-end of
mRNA should be used as tags.
The polyA stretch is a characteristic feature
of the 3-end of mRNA. Conventional methods
have been using the polyA as a sequence tag to
select the 3-end of mRNA. According to the
conventional methods, the first-strand cDNA is
usually synthesised from the oligo(dT) primer.
Because dT primer mostly hybridises at the
polyA, most of the cDNA is selectively synthesised from the 3-end of the mRNA. Thus, the
conventional methods include the selection
step for the 3-end tag of the mRNA. On the
contrary, they include no step to select the 5-end
of mRNA. As a result, the largest part of the
cDNA library is occupied by the cDNA which
lack the 5-end of the mRNA.
The 5-end of mRNA also has a characteristic
structure, called the cap structure, but unfortunately it is not a sequence tag. Unlike the polyA
at the 3-end, it cannot be used for the hybridisation.

196

If the 5-end tag of the mRNA were also a


sequence tag, it would be easy to use it to select
the 5-end of mRNA. In order to overcome this
difficulty, a new method was introduced: a
sequence tag at the 5-end, which is called as
oligo-capping method. This method allows us
to replace the cap structure of mRNA with the
synthetic oligonucleotide enzymatically. Each
mRNA product of the oligo-capping contains
the sequence tags at both ends, which is polyA
at the 3-end and the cap-replaced oligo at the
5-end. Thus, with oligo-capped mRNA as a
starting material, a new system is developed to
selectively clone the cDNA which contains both
of the sequence tags at the respective ends.

Differential Display and


Representational Difference Analysis
A large number of PCR-based methods have been
developed for analysing gene expression. The
sensitivity of PCR makes it especially useful in
analysing rare transcripts that cannot be analysed
by Northern blotting techniques. For known
sequences, quantitative PCR is used to analyse
relative levels of gene expression in different tissues or after different treatments. Various PCRbased methods have been developed to identify
and isolate differentially expressed genes. Two of
the most commonly used procedures are representational difference analysis (RDA) and differential display. RDA is used to select for genes
expressed in only one mRNA population (the tester mRNA) compared to a second mRNA population (the driver). After cDNA synthesis and
amplification of both populations, adapters are
ligated only to the tester cDNA population
(T-adapters). The tester and driver are mixed,
denatured and hybridised so that common
sequences between the populations form tester
driver hybrids. Because of the excess of driver in
the hybridisation mix, only tester-specific
sequences form testertester molecules. These
are amplified using T-adapter-specific primers
and used for further studies. RDA results in
identifying a set of tissue- or treatment-specific
cDNAs.

10

Curtain Raiser to Novel MAS Platforms

Differential display uses an arbitrary primer to


amplify cDNAs obtained from different mRNA
samples randomly. One primer (5-T11NN, where
NN are any two specific nucleotides) selects only
cDNAs that have the nucleotides NN immediately adjacent to the polyA tail. When PCR is
carried out using this primer in conjunction with
a random 10-mer primer, the same subset of
cDNAs is selectively amplified in each sample
analysed. PCR reactions from the different samples are run side by side on sequencing gels, so
that gene expression differences can be visualised as bands present in one lane and absent in
another. The bands of interest are cut out of the
gel, and the DNA is eluted, cloned, sequenced
and used for further analysis. This method is useful for analysing many different tissues or treatments at once, but a large number of different
primers are needed to survey for differences in all
of the cDNAs in a sample.

Subtractive Hybridisation
Subtractive hybridisation is a popular technique
for gene discovery from non-model organisms
without an annotated genome sequence. They are
valuable tools for identifying differentially regulated genes important for cellular growth and
differentiation. Over the last decade, numerous
subtractive hybridisation techniques have been
developed and used to isolate significant genes in
many systems. The simple suppression subtractive hybridisation (SSH; see below) is a widely
used method for separating DNA molecules that
distinguish two closely related DNA samples.
Two of the main SSH applications are cDNA
subtraction and genomic DNA subtraction. It is
based primarily on a suppression polymerase
chain reaction (PCR) technique and combines
normalisation and subtraction in a single procedure. The normalisation step equalises the abundance of DNA fragments within the target
population, and the subtraction step excludes
sequences that are common to the populations
being compared. This dramatically increases the
probability of obtaining low-abundance differentially expressed cDNAs or genomic DNA

Subtractive Hybridisation

fragments and simplifies analysis of the subtracted


library. SSH technique is applicable to many
comparative and functional genetic studies for
the identification of disease, developmental,
tissue-specific or other differentially expressed
genes (e.g. diseased vs. normal tissues, drought
stressed or irrigated plant cells). As shown in
many examples, the SSH technique may result in
over 1,000-fold enrichment for rare sequences
in a single round of subtractive hybridisation.
SSH has been shown as an efficient technique
for identifying and characterising differences
between two populations of nucleic acids. For
example, it detects differences between the RNA
in different cells, tissues, organisms or sexes
under normal conditions, or during different
growth phases, after various treatments (i.e. hormone application, heat shock) or in diseased
(or mutant) versus healthy (or wild-type) cells.
Subtractive hybridisation also detects DNA
differences between different genomes or
between cell types where deletions or certain
types of genomic rearrangements have occurred.
Subtractive hybridisation requires two populations of nucleic acids; the tester (or tracer) contains the target nucleic acid (the DNA or RNA
differences that one wants to identify), and the
driver lacks the target sequences. The two populations are hybridised with a driver to tester ratio
of at least 10:1. Because of the large excess of
driver molecules, tester sequences are more likely
to form drivertester hybrids than doublestranded tester. Only the sequences in common
between the tester and the driver hybridise, however, leaving the remaining tester sequences
either single-stranded or forming testertester
pairs. The drivertester, double-stranded driver
and any single-stranded driver molecules are subsequently removed (the subtractive step), leaving only tester molecules enriched for sequences
not found in the driver. Usually multiple rounds
of subtractive hybridisation are necessary to identify truly tester-specific nucleic acid sequences.
There are five basic steps to subtractive hybridisation: (1) choosing material for isolating tester
and driver nucleic acids, (2) producing tester
and driver, (3) hybridising, (4) removing driver
tester hybrids and excess driver (subtraction) and

197

(5) isolating of the complete sequence of the


remaining target nucleic acid. Variations are
possible at each step, and the materials used and
methods chosen depend on the desired results.
When choosing appropriate sources for driver
and tester, it must be kept in mind that the less
complex the source of tester and driver and the
more sequences they have in common, the easier
it is to isolate specific target sequence differences.
For example, it is easier to identify RNA differences between cell types than it is to identify
differences between tissues because fewer genes
are expressed in single cells.
1. Preparation of Driver and Tester
In principle, both tester and driver samples
can be either DNA or RNA, but it is often
most practical for the tester to be DNA
(because the tester is present in a low concentration, and DNA is more stable than RNA)
and for the driver to be RNA (after hybridisation, excess driver RNA can be eliminated
enzymatically or by alkali degradation). In
the basic subtractive hybridisation protocol,
RNA from the tester source is reverse transcribed into complementary DNA (cDNA)
and hybridised to polyA + driver RNA. The
testerdriver hybrids are removed, excess
fresh driver is added, and the hybridisation is
repeated once. The remaining target cDNA
is either cloned or used to make a probe. This
basic procedure is useful if the starting material is not very complex and is easy to isolate.
If little starting tissue is available or if the
starting material is complex, multiple rounds
of hybridisation-subtraction are needed, and it
is necessary to use a library- or a PCR-based
technique. Tester and driver are prepared from
cDNA libraries as phagemids or as library
inserts amplified by PCR or in vitro transcription. Alternatively, cDNA from tester and
driver sources is ligated to different primers,
amplified by PCR and hybridised. The steps
are repeated as needed.
2. Hybridisation
When single-stranded nucleic acids are
hybridised to each other, more abundant
sequences anneal more rapidly because they
encounter each other more frequently. During

198

subtractive hybridisation, the hybridisation


step is driven by the excess driver sequences,
so tester sequences that have complementary
sequences in the driver population rapidly
form drivertester hybrids, whereas sequences
unique to the tester population remain singlestranded or form testertester pairs more
slowly. Rare sequences from either population take longer to pair up than abundant
sequences. The ratio of driver to tester, the
overall concentration of driver, the temperature and the length of hybridisation should be
chosen based on the complexity of the driver
and tester, the abundance class of the target
nucleic acids and the length of the driver and
tester sequences used.
2.1. Subtraction
The purpose of the subtraction step is to
remove drivertester hybrids formed
during the hybridisation step, leaving
behind tester enriched for the target
sequences. Many different methods are used
for subtraction, depending on the nature
of the driver and the tester. A few possibilities are mentioned. Hydroxyapatite
chromatography is used to bind doublestranded driver and drivertester hybrids,
leaving single-stranded nucleic acids
behind. This is a good choice if the driver
is RNA because single-stranded RNA can
be removed chemically or enzymatically,
leaving only single-stranded cDNA tester
after the subtraction.
If the tester is a single-stranded
phagemid library and the driver is firststrand cDNA, after hybridisation, the
double-stranded drivertester hybrids
can be digested with a frequent-cutting
restriction enzyme and the hybridisation
mixture used to infect bacteria. Only the
single-stranded tester phagemids infect,
and they can thus be isolated. A common
procedure is to use biotinstreptavidin
binding to separate nucleic acids.
Streptavidin binds to biotinylated driver
sequences, and phenol extraction is used
to remove the streptavidin protein and
the bound driver and drivertester

10

Curtain Raiser to Novel MAS Platforms

hybrids. Streptavidin can also be attached


to beads or to a column and used to
remove excess driver and drivertester
hybrids.
The effectiveness of the subtraction is
monitored by using radiolabelled tester
and determining whether the levels of
single-stranded tester decrease after subtraction. Alternatively, enrichment for
target sequences is monitored. If there
are known genes common to the driver
and tester and one or more specific to the
tester, it can be determined, after each
round of hybridisation and subtraction,
whether the tester-specific gene is becoming
more abundant compared with the common genes.
2.2 Isolation of Target Sequences
After one or more hybridisation and
subtraction steps, the resulting tester
nucleic acids should be greatly enriched
for target sequences. However, it is still
possible that rare sequences common to
both the driver and the tester remain,
and in many cases, the sequences
isolated are only partial gene sequences.
The remaining tester sequences are isolated and analysed in a variety of ways.
Tester can be made into an enriched
library and probed with driver and tester
sequences to look for tester-specific
clones, or the tester is labelled and used
to probe tester and driver libraries and
to isolate full-length clones. It is necessary to further analyse isolated tester
sequences by Northern blotting, in situ
hybridisation or PCR methods to determine whether the sequences are truly
tester-specific.
Alternatives to standard subtractive
hybridisation techniques may include
positive selection (hybridisation of tester
and driver is still carried out but, rather
than removing unwanted drivertester
and driver sequences by subtraction
during step 4, double-stranded tester
sequences are positively selected for
selective cloning or selective amplification.

Microarray

Again, various methods are employed to


carry out positive selection. A simple
method is to digest tester with a restriction
enzyme producing cohesive ends while
using sonication to shear the driver DNA
randomly. After hybridisation, DNA
ligase and vector DNA are added. Only
double-stranded tester is cloned into the
vector, and then it can be used to transform bacteria), suppression subtractive
hybridisation (in this positive selection
technique, both driver and tester are
digested with a frequent-cutting restriction enzyme to give blunt ends. Tester is
divided into two samples, which are
ligated to different adapters, P1 and P2,
and then hybridised to excess driver. Then
the two tester populations are mixed, and
additional driver is added. Hybrids
formed between members of the two subtracted tester populations are selectively
amplified by PCR using primers specific
to P1 and P2. Molecules that have either
P1 or P2 adapters at both ends form panhandles as the adapters hybridise to each
other, and these molecules are not
amplified by PCR; this results in the
suppression).

Microarray
The microarray is also called as DNA chips or
biochips. DNA chips are made up of silicon or
nylon or glass on which DNA fragments are fabricated. The sources of DNA fragments may be
obtained from cDNA clones, EST clones, genomic
clones or DNA amplified from open reading
frames. Size of the single DNA chips varies from
1 to 3.24 cm2. But within this small size, we can
display nearly all the genes of a crop plant.
DNA chip technologies utilise microscopic
arrays (microarrays) of molecules immobilised
on solid surfaces for hybridisation analysis.
Advanced arraying technologies such as photolithography, micro-spotting and ink-jetting, coupled with sophisticated fluorescence detection
systems and bioinformatics, permit molecular

199

data gathering at an unprecedented rate. Mixtures


of DNA or RNA isolated form biological sources
are labelled enzymatically by incorporating
nucleotides bearing reporter genes and hybridised
to microarrays. Hybridisation reactions yield
heteroduplexes between individual components
of the fluorescent sample (probe) and complementary sequences (target) on the chip surface.
Since each target element or feature is chemically homogeneous and occupies a known location, the identity and quantity of each component
in the fluorescent mixture can be ascertained by
measuring the fluorescence intensity at each
position on the microarray. Though the basic
principles behind DNA chips (e.g. the hybridisation of samples to immobilised DNA molecules)
are conceptually similar to those used in earlier
filter-based assays (such as Southern blotting),
the precision, speed and scale afforded by DNA
chip assays are unmatched and represent a major
technological advance in molecular biology.
The characteristic features of microarrays that
make them highly useful in functional genomics
are:
1. Parallelism: Microarray analysis allows parallel acquisition and analysis of massive data.
This greatly increases the speed of experimental work. It allows meaningful comparison
between genes or gene products represented
in microarrays and may eventually allow the
analysis of the entire genome of any organism
in a single reaction. Recent gene expression
experiments in yeast are important examples
of achieving this goal.
2. Miniaturisation: Microarray analysis involves
miniaturisation of DNA, thus reducing times
and reagent consumption.
3. Speed: Microarray analysis is highly sensitive
and allows rapid data acquisition with either
confocal scanner or cameras equipped with
charged coupled devices (CCD).
4. Multiplexing: This is a process by which multiple samples are analysed in a single assay.
The labelling and detection methods help to
analyse multiple samples on a single DNA
chip. Multiplexing also increases the accuracy
of comparative analysis by eliminating complicating factors such as chip to chip variation,

200

discrepancies in reaction conditions and other


shortcomings inherent in comparing separate
experiments. It has already been used in
expression analysis, genotyping and DNA
resequencing.
5. Automation: Advanced manufacturing technologies permit the mass production of DNA
chips, and the automation led to proliferation
of microarray assays by ensuring their quality,
availability and affordability. As a result, DNA
chips may eventually become like commodity
items in the computer industry.
6. Combinatorial synthesis: Using the combinatorial synthesis strategy, a set of all 4k oligonucleotides of the length k nucleotides
(k-mers) can be generated in 4k synthesis
cycles. For example, the set of all 4-mers
(256) can be synthesised in 4 rounds, each
round having 4 cycles, thus making a total of
16 cycles.

Types of DNA Chips and Their


Production
Two major types of DNA chips are available for
DNA analysis.

Oligonucleotide-Based Chips
This type of DNA chips contains a high density
of short oligonucleotide microarrays, which are
prepared by photolithography. Such arrays contain 100,000400,000 oligonucleotides immobilised within an area of 1.6 cm2. This allows the use
of targeted regions of genomic DNA for sequencing or for a large-scale analysis of single nucleotide polymorphisms (SNPs).
DNA-Based Chips or cDNA Arrays
This type of DNA chips contains a high density
of DNA microarrays, most often derived from
cDNA (hence, they are currently made by robotically spotting a large number of PCR-amplified
DNA fragments onto glass or nylon surfaces).
The hybridisation is carried out with fluorescently
labelled mRNA or its corresponding cDNA, and
the hybridised duplexes are identified by colour
fluorescence detection methods. These DNA

10

Curtain Raiser to Novel MAS Platforms

chips, thus, can be used for studying gene


expression patterns in time and space.
The above two types of microarrays can be
produced by using two different approaches:
synthesis and deposition. In the synthesis
approach, microarrays are prepared in a stepwise
fashion by in situ synthesis of nucleic acids from
biochemical building blocks, the nucleotides.
With each round of synthesis, individual nucleotides are added to growing chains until the
desired length is achieved. In the deposition or
delivery approach, on the other hand, separately
prepared samples of nucleic acids are deposited
exogenously for chip fabrication. Molecules,
such as cDNA fragments, are amplified by PCR
and purified; small quantities of these fragments
are then deposited onto known locations using a
variety of delivery technologies. The key parameters
for evaluating both the techniques include
microarray density and design, biochemical composition, quality, cost and ease of prototyping.

Hybridisation and Detection Methods


Hybridisation of the target DNA to a microarray
yields sequence information. The target DNA is
labelled and incubated with the array. If the target
DNA has regions complementary to the probes
on the array, then the target DNA will hybridise
with these probes. Under a fixed set of hybridisation conditions, for example, target concentration, temperature and buffer and salt concentration,
the fraction of probes bound to targets will vary
with the base composition of the probe and the
extent of the targetprobe match. In general, for a
given length, probes with high GC content will
hybridise more strongly than those with high AT
content. Similarly, probes matching the target
will hybridise more strongly than probes with
mismatches, insertions and deletions. Various
detection methods are currently available for the
analysis of hybridisation patterns on microarrays
of immobilised probes. Some rely on the use of
enzymes to enable detection, while others detect
hybridisation directly.
For the detection of hybridisation patterns on
DNA chips, the technique of reverse dot-blot,

Microarray

used earlier on the membranes, is utilised. The


technique is so described because as opposed to
dot-blots, where the target DNA is dot-blotted on
the membrane and the probes are labelled on
DNA chips, the probes are anchored in the form
of microarrays and the target DNA is labelled.
Once hybridisation is completed, the detection
of hybridisation is achieved either with the help
of an enzyme system (enzyme-assisted detection) or directly due to radiolabelling and/or
fluorescence.
The target DNA is either nonradioactively
labelled (biotin or digoxigenin labelling) or
radioactively labelled, the former requiring enzymatic detection and the latter requiring direct
detection through autoradiography, gas phase
ionisation and phosphorimagers. However, there
are drawbacks with the detection methods involving radioactivity (such as low resolution). In
order to circumvent these problems, fluorochromes
may be used which will also allow direct detection due to fluorescence. This would also allow
multiplexing, where more than one target DNA
labelled with different fluorochromes can be used
for hybridisation of microarray on the DNA
chips. The hybridisation patterns can be scanned
in this case using automatic scanner. These detection systems are based either on lens-based systems (epifluorescent and confocal microscopes)
or on CCD-based systems. The lens-based systems, including confocal microscopy, allow
selective detection of the surface-bound molecules, as opposed to those in the surrounding
fluid medium. However, these are not well suited
to the level of miniaturisation already achieved in
DNA chip technology. Therefore, more recently
CCD detection systems have been developed to
detect small quantities of array-bound molecules.
In this method, labelled target DNA is hybridised
to an immobilised probe on a silicon wafer. The
wafer is then placed on the CCD surface, and a
signal is generated. A fluorescence microscope
fitted with a CCD camera and a computer is used
for data capturing.
Once the microarray scanners have captured
the image of the microarray biochip, that image
must be rigorously analysed to determine which
elements correspond to artefacts or contamination

201

and which correspond to actual signal. Due to the


huge number of spots on the array, automatic
determinations must be made concerning issues
such as background intensity, the presence of
brightly glowing dust or lint artefacts, the occurrence of donut-shaped signals rather than solid
spots and the warping or irregularities in the array
itself. Image analysis software (e.g. Array Vision,
Clone Tracker, ImaGene and Gene Vision) has
been steadily improving to meet these challenges.
Microarrays have a large number of applications, which will expand in future. Some of them
include:

1. DNA Sequencing by Hybridisation


The two popular methods of sequencing include
the Sangers dideoxy synthetic method and the
Maxam and Gilberts degradation method (see
below). Sangers method is even currently used
as a routine method for DNA sequencing.
However, the efficiency, cost and reliability of the
above two methods were not able to cope with
the requirements of large-scale genome sequencing. Therefore, in the late 1980s, a new approach
towards DNA sequencing was suggested simultaneously by four groups. The approach was
described as sequencing by hybridisation or SBH:
The method involves manufacturing the sequencing DNA chips that contain a complete set of
immobilised oligonucleotides of a particular size
(e.g. 8-mers) and hybridisation of the target DNA
of unknown sequence (whose sequence is to be
determined) onto these DNA chips. The hybridisation patterns are then recorded using one of
the several suitable devices discussed earlier.
Identification and analysis of the overlapping
oligomers that form perfect duplexes with the
DNA of interest permits reconstruction of the target DNA sequence. During the 1980s, it was
believed that SBH using microarrays carrying all
the possible 65,536 octamer oligonucleotides
could possibly be used as an alternative to
Sangers dideoxy and Maxam and Gilberts
methods of sequencing. However, this objective
has not been successfully achieved, since uniform
hybridisation signals are not available for a large

202

number of oligonucleotides in parallel due to


sequence-dependent variability in heteroduplex
formation. This leads to false positives and false
negatives so that unambiguous determination of
an unknown sequence is not always possible.
Further complications arise due to repeated
sequences. Consequently, the technical barriers
of SBH are now obvious, and microarrays which
are initially considered to be useful only for SBH
are now used for a variety of other purposes.

2. Single Nucleotide Polymorphisms


and Point Mutations
Restriction fragment length polymorphisms
(RFLPs) and simple sequence repeats (SSRs)
were the markers of choice in the past, but these
markers had some drawbacks. For instance, they
need gel-based assays and are, therefore, time
consuming and expensive. Recently, single
nucleotide polymorphisms (SNPs) as biallelic
genetic markers have been extensively used
as the markers of choice (refer to chapter 3).
Although they have the disadvantage of being
biallelic as against SSRs, which are polyallelic,
their abundance (more than 1 per 1,000 bp) makes
them attractive. Genotyping individuals using
SNPs through microarray needs only plus/minus
assay, and hence, it permits easier automation.
Further, high-density oligonucleotide arrays
allow genotyping at a large number of these biallelic loci in parallel. The approach used for this
purpose relies on the capacity to distinguish a
perfect match from a single-base mismatch. A set
of four groups of oligonucleotides of known and
related sequences is used, such that corresponding oligomers that form the four groups differ
only for the central base. For this purpose, a tiling strategy proposed by Affymetrix makes use
of a microarray of 40,000 oligomers for resequencing a 10 kb gene. Use of SNPs offered great
promise for rapid and highly automated genotyping, leading to rapid development in developing
high-resolution genetic map (refer to chapter 7).
However, it was emphasised that there are also
some problems with this technology, since association of SNPs with individual traits can break

10

Curtain Raiser to Novel MAS Platforms

due to recombination, thus making it necessary to


have many SNPs associated with a trait.

3. Functional Genomics
Microarrays for gene expression analysis provide
an integrated platform for functional genomics.
Samples of mRNA form a variety of cells and tissues that are used for microarray analysis and
would yield information about specific changes
in gene expression patterns. The mRNA samples
of interest are labelled and used for hybridisationbased microarray analysis, yielding quantitative
data on the expression of thousands of cellular
genes. Parallel measurement of transcript levels
for thousands of genes is one of the most widespread uses of DNA chip technology. Both oligonucleotide and cDNA microarrays are very useful
for estimating levels of transcripts.

4. Reverse Genetics
DNA chips can also be used for characterisation
of mutant populations exposed to various selection pressures, to collect information about the
fitness value of a variety of alleles for each of the
large number of genes in a species. This is done
particularly in organisms where complete
sequence of the genome is already available and
studying the impact of deletions/insertions followed by analysis of their fitness. (such an
approach where we start a study with DNA
sequence and conclude it with the analysis of
phenotype is described as reverse genetics).
This can be achieved if the mutants are first subjected to a selection pressure and then characterised. This can be illustrated using the example of
yeast, where the genome has been completely
sequenced and was shown to carry 6,000 open
reading frames (ORFs). Unique molecular
sequences or bar codes can be introduced in
each of the above 6,000 ORFs in the yeast
genome. A mixture of yeast strains containing
individual bar codes for all 6,000 genes is then
subjected to a selection pressure. Samples of
cells are taken, and bar code sequences are

Microarray

labelled using multiplex PCR with fluorescent


primers. A pool of fluorescent amplicons is then
hybridised to an oligonucleotide microarray containing sequences complementary to each of the
amplified bar codes, and after detection of
fluorescent signals, an estimate of fitness of each
strain under a given selection pressure can be
worked out. In species, where the genome
sequence is not yet fully determined, ESTs can
be used to identify mutants. Hybridisation of
PCR amplicons (derived from these lines carrying insertion elements) to microarray of ESTs
can be used to identify mutant lines.

5. Diagnostics and Genetic Mapping


DNA chips are also being used for diagnostics.
Since some information about the alleles belonging to genes responsible for a number of diseases
is available, the search can be focused on a
restricted number of polymorphisms, thus reducing the required number of features on a DNA
chip. For instance, human diagnostic chips have
been prepared to detect mutant alleles in CFTR
(cystic fibrosis), BRCA 1 (cancer susceptive
gene) and beta globin genes. For CFTR, one
microarray containing 428 features was designed
to detect mutations in exon 11 of CFTR, and
another microarray containing 1,480 features was
designed for detection of known deletions, insertions or base substitutions. Hybridisation of
genomic DNA samples from CFTR patients with
already characterised mutations to diagnostic
chips for CFTR gave expected results. Similarly,
genotyping of patients with uncharacterised
mutations by microarrays could be confirmed by
techniques of RFLP and PCR. These results
confirmed the utility of microarrays in diagnostics. DNA chips technology was also successfully
applied to the genotyping of hepatitis virus in
blood samples.

6. Genomic Mismatch Scanning


Genomic mismatch scanning (GMS) is a hybridisation-based method for linkage analysis.

203

Homologous segments are identified by the


formation of heteroduplexes that are free of any
mismatches. Fragments of chromosomal DNA
representing inherited regions are hybridised to a
microarray of ordered genomic clones, and positive hybridisation signals pinpoint regions of
identity by descent at high resolution. The mapped
PCR products could be used to prepare a microarray of physical fragments and can also be used for
detecting meiotic recombination breaking points.
GMS is only one example of the use of the gene
microarrays to characterise the composition of
nucleic acid mixture subjected to in vitro selection.
Restriction endonuclease protection, selection
and amplification (REPSA) is another example of
a selection method that could be adopted to a
DNA microarray-based detection. REPSA makes
use of a combination of restriction enzyme cleavage, PCR amplification and filter binding to selectively identify DNA sequences used for binding
of DNA-binding proteins.

7. DNA Chips and Agriculture


DNA chips with ESTs can also be used to collect
data on expression in an agricultural crop under
different conditions. This information can prove
to be of practical utility in agricultural biotechnology. For instance, if the expression of genes
on hormone is known, hormone can be monitored. Transgenic plants can also be rapidly analysed using microarray and expression patterns
under environmental conditions that can be predicted at the gene level. Action of herbicide
can be similarly determined and decision be
taken on the application of herbicide. DNA
microarray is also being extensively used for a
study of DNA polymorphism (e.g. SNPs) to develop
molecular markers tagged to specific economic
traits (see above). The molecular markers thus
developed can be used in diagnostics and for
actual molecular marker-aided selection in breeding programmes. The main advantage of DNA
chips for developing molecular markers is the
simultaneous analysis of thousands of polymorphisms in a single experiment. This will of course
require a cost-effective microarray technology.

204

The current excitement and activity in this technique


suggests that the complete microarray system
will soon be available in affordable price.
Functional analysis, through parallel expression monitoring, should help researchers better
understand the fundamental mechanisms that
underlie plant growth and development. By accumulating databases of expression information as a
function of tissue type, developmental stage, hormone and herbicide treatment, genetic background and environmental condition, it should be
possible to identify the genes involved in many
aspects of plant biology. Microarray analysis provides a way to link genomic sequence information
and functional analysis. Several specific research
areas will be of significant commercial interest.
Because of the central role of plant hormones in
plant growth and development, microarray-based
gene expression analysis of plant hormone action
will be an important commercial project. The
interplay of genes and the environment is also of
particular importance in plants and will constitute
another area of research interest. Microarrays will
assist plant biotechnology companies by allowing
rapid analysis of transgenic plants. These data
will permit genome-wide correlations between
expression patterns and a host of desirable traits
such as fertility, seed set, yield and resistance to
environmental stress and insects. It may ultimately
be possible to reduce the need for costly field trials by chip-based analysis of transgenic lines. The
use of microarray technology to understand the
effect of small molecules on gene expression
might serve to speed the discovery of herbicides
and elucidate their mechanism of action.

8. Proteomics
Like genomics, the proteomics relates to the
study of proteinprotein interactions. DNA chips
can also be used for this area of study. Protein
linkage maps can also be created using genomic
sequence information. Proteinprotein interactions can be studied using the yeast two-hybrid
system. In this system, two fusion proteins are
used for the activation of transcription of a
reporter gene in yeast. The first fusion protein

10

Curtain Raiser to Novel MAS Platforms

contains a DNA-binding domain fused to a


second protein of interest. Specific interaction
between two chimeric proteins leads to transcriptional activation of the reporter genes which
can be easily scored with colour-based assays.
The identity of the two proteins of interest is
confirmed sequence analysis of each clone
thus identified. Therefore, major sequencing
work is involved in the above two-hybrid system.
As alternative to DNA sequencing needed in
two-hybrid analysis as mentioned earlier, DNA
chip arrays can be used to identify the genes
involved in proteinprotein interactions. In cases
where the entire genome sequences are available,
DNA chips can be used in parallel resequencing
so that clones involved in the two-hybrid system
can be identified through single hybridisation to
genomic chips. Phage presentation library can
also be used for DNA chip-based detection system. This involves use of fusion proteins encoded
by chimeric sequences of phage viral coat protein
gene and gene of interest.

9. Nucleic Acid Sequencing


The term DNA sequencing involves biochemical
methods for determining the order of the nucleotide bases, adenine, guanine, cytosine and thymine, in a DNA molecule. The sequence of DNA
constitutes the heritable genetic information in
nuclei, plasmids, mitochondria and chloroplasts
that forms the basis for the developmental programmes of all living organisms. Determining the
DNA sequence is therefore useful in basic research
studying fundamental biological processes, as
well as in applied fields such as diagnostic or
forensic research, genetic mapping and MAS. The
advent of DNA sequencing has significantly
accelerated biological research and discovery.
The rapid speed of sequencing attained with modern DNA sequencing technology has been instrumental in the large-scale sequencing of the plant
genomes. The field of DNA sequencing technology development has a rich and diverse history.
However, the overwhelming majority of DNA
sequence production to date has relied on some
version of the Sanger biochemistry.

Microarray

Actually, in the late 1970s, two DNA sequencing


techniques for longer DNA molecules were
invented. These were the Sanger (or dideoxy)
method and the MaxamGilbert (chemical cleavage) method. The MaxamGilbert method is
based on nucleotide-specific cleavage by chemicals and is best used to sequence oligonucleotides
(short nucleotide polymers, usually smaller than
50 base pairs in length). The Sanger method is
more commonly used because it has been proven
technically easier to apply and, with the advent of
PCR and automation of the technique, is easily
applied to long strands of DNA including some
entire genes. This technique is based on chain
termination by dideoxy nucleotides during PCR
elongation reactions.
In the Sanger method, the DNA strand to be
analysed is used as a template, and DNA polymerase is used, in a PCR reaction, to generate
complimentary strands using primers. Four
different PCR reaction mixtures are prepared,
each containing a certain percentage of dideoxynucleoside triphosphate (ddNTP) analogues to
one of the four nucleotides (ATP, CTP, GTP or
TTP). Synthesis of the new DNA strand continues until one of these analogues is incorporated,
at which time the strand is prematurely truncated.
Each PCR reaction will end up containing a mixture of different lengths of DNA strands, all ending with the nucleotide that was dideoxy labelled
for that reaction. Gel electrophoresis is then used
to separate the strands of the four reactions, in
four separate lanes, and determine the sequence
of the original template based on what lengths of
strands end with what nucleotide.
In the automated Sanger reaction, primers are
used that are labelled with four different coloured
fluorescent tags. PCR reactions, in the presence
of the different dideoxy nucleotides, are performed as described above. However, next, the
four reaction mixtures are then combined and
applied to a single lane of a gel. The colour of
each fragment is detected using a laser beam, and
the information is collected by a computer which
generates chromatograms showing peaks for each
colour, from which the template DNA sequence
can be determined. Typically, the automated
sequencing method is only accurate for sequences

205

up to a maximum of about 700800 bp in length.


However, it is possible to obtain full sequences of
larger genes and, in fact, whole genomes, using
stepwise methods such as primer walking and
shotgun sequencing.
In primer walking, a workable portion of a
larger gene is sequenced using the Sanger method.
New primers are generated from a reliable segment of the sequence and used to continue
sequencing the portion of the gene that was out of
range of the original reactions. Shotgun sequencing entails randomly cutting the DNA segment of
interest into more appropriate (manageable) sized
fragments, sequencing each fragment and arranging the pieces based on overlapping sequences.
This technique has been made easier by the application of computer software for arranging the
overlapping pieces.

Second-Generation DNA Sequencing


Alternative strategies for DNA sequencing can be
grouped into several categories. These include
(1) micro-electrophoretic methods, (2) sequencing by hybridisation, (3) real-time observation of
single molecules and (4) cyclic-array sequencing.
Here, we use second generation in reference to
the various implementations of cyclic-array
sequencing that have recently been realised in a
commercial product (e.g. 454 sequencing (used in
the 454 Genome Sequencers, Roche Applied
Science; Basel), Solexa technology (used in the
Illumina (San Diego) Genome Analyser), the
SOLiD platform (Applied Biosystems; Foster
City, CA, USA), the Polonator (Dover/Harvard)
and the HeliScope Single Molecule Sequencer
technology (Helicos; Cambridge, MA, USA)).
The concept of cyclic-array sequencing can be
summarised as the sequencing of a dense array of
DNA features by iterative cycles of enzymatic
manipulation and imaging-based data collection.
Although these platforms are quite diverse in
sequencing biochemistry as well as in how the
array is generated, their workflows are conceptually similar. Library preparation is accomplished
by random fragmentation of DNA, followed by
in vitro ligation of common adaptor sequences.

206

Thus, what is common to these methods is that


PCR amplicons derived from any given single
library molecule end up spatially clustered, either
to a single location on a planar substrate (in situ
polonies, bridge PCR) or to the surface of micronscale beads, which can be recovered and arrayed
(emulsion PCR). The sequencing process itself
consists of alternating cycles of enzyme-driven
biochemistry and imaging-based data acquisition.

454 Pyrosequencing
The 454 system was the first next-generation
sequencing platform available as a commercial
product. In this approach, libraries may be constructed by any method that gives rise to a mixture of short, adaptor-flanked fragments. Clonal
sequencing features are generated by emulsion
PCR, with amplicons captured to the surface of
28-mm beads. After breaking the emulsion,
beads are treated with denaturant to remove
untethered strands and then subjected to a
hybridisation-based enrichment for ampliconbearing beads (i.e. those that were present in an
emulsion compartment supporting a productive
PCR reaction). A sequencing primer is hybridised to the universal adaptor at the appropriate
position and orientation, that is, immediately
adjacent to the start of unknown sequence.
Sequencing is performed by the pyrosequencing method. In brief, the amplicon-bearing beads
are pre-incubated with Bacillus stearothermophilus (Bst) polymerase and single-stranded binding
protein and then deposited on to a micro-fabricated array of picoliter scale wells (with dimensions such that only one bead will fit per well) to
render this biochemistry compatible with arraybased sequencing. Smaller beads are also added,
bearing immobilised enzymes which are also
required for pyrosequencing (e.g. ATP sulfurylase and luciferase). During the sequencing, one
side of the semi-ordered array functions as a flow
cell for introducing and removing sequencing
reagents, whereas the other side is bonded to a
fibre-optic bundle for CCD (charge coupled
device)-based signal detection. At each of several
hundred cycles, a single species of unlabelled

10

Curtain Raiser to Novel MAS Platforms

nucleotide is introduced. On templates where this


results in an incorporation event, pyrophosphate
is released. Via ATP sulfurylase and luciferase,
incorporation events immediately drive the generation of a burst of light, which is detected by
the CCD as corresponding to the array coordinates of specific wells.
In contrast with other platforms, therefore, the
sequencing by synthesis must be monitored live
(i.e. the camera does not move relative to the
array). Across multiple cycles (e.g. A-G-C-T-AG-C-T), the pattern of detected incorporation
events reveals the sequence of templates represented by individual beads. Like the HeliScope
(discussed below), the sequencing is asynchronous in that some features may get ahead or
behind other features depending on their sequence
relative to the order of base addition. A major
limitation of the 454 technology relates to
homopolymers (i.e. consecutive instances of the
same base, such as AAA or GGG). Because there
is no terminating moiety preventing multiple consecutive incorporations at a given cycle, the length
of all homopolymers must be inferred from the
signal intensity. This is prone to a greater error
rate than the discrimination of incorporation versus non-incorporation. As a consequence, the
dominant error type for the 454 platform is insertiondeletion, rather than substitution. Relative to
other next-generation platforms, the key advantage of the 454 platform is read-length. For example, the 454 FLX instrument generates ~400,000
reads per instrument run at lengths of 200300 bp.
Currently, the per-base cost of sequencing with
the 454 platform is much greater than that of other
platforms (e.g. SOLiD and Solexa), but it may be
the method of choice for certain applications
where long read-lengths are critical (e.g. de novo
assembly and metagenomics).

Illumina Genome Analyser


Commonly referred to as the Solexa, this platform has its origins in work by Turcatti and colleagues and the merger of four companiesSolexa
(Essex, UK), Lynx Therapeutics (Hayward, CA,
USA), Manteia Predictive Medicine (Coinsins,

Microarray

Switzerland) and Illumina. Libraries can be


constructed by any method that gives rise to a
mixture of adaptor-flanked fragments up to several
hundred bp in length. Amplified sequencing
features are generated by bridge PCR. In this
approach, both forward and reverse PCR primers
are tethered to a solid substrate by a flexible
linker, such that all amplicons arising from any
single template molecule during the amplification
remain immobilised and clustered to a single
physical location on an array. On the Illumina
platform, the bridge PCR is somewhat unconventional in relying on alternating cycles of extension with Bst polymerase and denaturation with
formamide. The resulting clusters each consist
of ~1,000 clonal amplicons. Several million clusters can be amplified to distinguishable locations
within each of eight independent lanes that are
on a single flow cell (such that eight independent
libraries can be sequenced in parallel during the
same instrument run). After cluster generation,
the amplicons are single stranded (linearisation)
and a sequencing primer is hybridised to a universal sequence flanking the region of interest.
Each cycle of sequence interrogation consists of
single-base extension with a modified DNA polymerase and a mixture of four nucleotides. These
nucleotides are modified in two ways. They are
reversible terminators, in that a chemically
cleavable moiety at the 3 hydroxyl position
allows only a single-base incorporation to occur
in each cycle, and one of four fluorescent labels,
also chemically cleavable, corresponds to the
identity of each nucleotide. After single-base
extension and acquisition of images in four channels, chemical cleavage of both groups sets up for
the next cycle. Read-lengths up to 36 bp are currently routine; longer reads are possible but may
incur a higher error rate.
Read-lengths are limited by multiple factors
that cause signal decay and dephasing, such as
incomplete cleavage of fluorescent labels or terminating moieties. The dominant error type is
substitution, rather than insertions or deletions
(and homopolymers are certainly less of an issue
than with other platforms such as 454). Average
raw error rates are on the order of 11.5%, but
higher accuracy bases with error rates of 0.1% or

207

less can be identified through quality metrics


associated with each base-call. As with other
systems, modifications have recently enabled
mate-paired reads, for example, each sequencing
feature yielding 2 36 bp independent reads
derived from each end of a given library molecule
several hundred bases in length.

AB SOLiD
This platform has its origins in the system
described by J. Shendure and colleagues in 2005
and in work by McKernan and colleagues at
Agencourt Personal Genomics (Beverly, MA,
USA), which is acquired by Applied Biosystems
(Foster City, CA, USA) in 2006. Libraries may
be constructed by any method that gives rise to a
mixture of short, adaptor-flanked fragments,
though much effort with this system has been put
into protocols for mate-paired tag libraries with
controllable and highly flexible distance distributions. Clonal sequencing features are generated
by emulsion PCR, with amplicons captured to the
surface of 1-mM paramagnetic beads. After breaking the emulsion, beads bearing amplification
products are selectively recovered and then immobilised to a solid planar substrate to generate a
dense, disordered array. Sequencing by synthesis
is driven by a DNA ligase, rather than a polymerase. A universal primer complementary to
adaptor sequence is hybridised to the array of
amplicon-bearing beads. Each cycle of sequencing involves the ligation of a degenerate population of fluorescently labelled octamers. The octamer
mixture is structured, in that the identity of
specific position(s) within the octamer (e.g. base
5) correlates with the identity of the fluorescent
label. After ligation, images are acquired in four
channels, effectively collecting data for the same
base positions across all template-bearing beads.
Then, the octamer is chemically cleaved between
positions 5 and 6, removing the fluorescent label.
Progressive rounds of octamer ligation enable
sequencing of every 5th base (e.g. bases 5, 10,
15, 20). Upon completing several such cycles, the
extended primer is denatured to reset the system.
Subsequent iterations of this process can be

208

directed at a different set of positions (e.g. bases


4, 9, 14, 19) either by using a primer that is set
back one or more bases from the adaptor-insert
junction or by using different mixtures of octamers where a different position (e.g. base 2) is
correlated with the label. An additional feature
of this platform involves the use of two-base
encoding, which is an errorcorrection scheme in
which two adjacent bases, rather than a single
base, are correlated with the label. Each base
position is then queried twice (once as the first
base and once as the second base, in a set of 2 bp
interrogated on a given cycle) such that miscalls
can be more readily identified.
A related system to the SOLiD is the Polonator,
also based in part on the system developed by J.
Shendure and the Church group at Harvard. This
platform also uses sequencing features generated
by emulsion PCR and sequencing by ligation.
The cost of the instrument, however, is substantially lower than that of other second-generation
sequencing instruments. Additionally, the instrument is open source and programmable, potentially enabling user innovation (e.g. the use of
alternative biochemistries). The current readlengths, however, may be significantly limiting.
An additional disadvantage, common to 454,
SOLiD and the Polonator, is that emulsion PCR
can be cumbersome and technically challenging.
On the other hand, it is possible that sequencing
on a high-density array of very small (1 mm)
beads (with sequencing by ligation, polymerase
extension or another biochemistry) may represent
the most straightforward opportunity to achieve
extremely high data densities, simply because
1-mm beads physically exclude one another at a
spacing that is on the order of the diffraction
limit. Furthermore, high-resolution ordering of
1-mm bead arrays may enable the limit of one
pixel per sequencing feature to be closely
approached.

HeliScope
The Helicos sequencer, based on work by Quakes
group, also relies on cyclic interrogation of a dense
array of sequencing features. However, a unique
aspect of this platform is that no clonal amplification
is required. Instead, a highly sensitive fluorescence

10

Curtain Raiser to Novel MAS Platforms

detection system is used to directly interrogate


single DNA molecules via sequencing by synthesis.
Template libraries, prepared by random fragmentation and polyA tailing (i.e. no PCR
amplification), are captured by hybridisation to
surface-tethered poly-T oligomers to yield a
disordered array of primed single-molecule
sequencing templates. At each cycle, DNA polymerase and a single species of fluorescently
labelled nucleotide are added, resulting in templatedependent extension of the surface-immobilised
primertemplate duplexes. After acquisition of
images tiling the full array, chemical cleavage
and release of the fluorescent label permits the
subsequent cycle of extension and imaging. As
described in some reports, several hundred cycles
of single-base extension (i.e. A, G, C, T, A, G, C,
T) yield average read-lengths of 25 bp or
greater. Notable aspects of this system include
the following. First, like the 454 platform, the
sequencing is asynchronous, as some strands will
fall ahead or behind others in a sequence-dependent manner. Chance also plays a role, as some
templates may simply fail to incorporate on a
given cycle despite having the appropriate base at
the next position. However, because these are
single molecules, dephasing is not an issue, and
such events do not in and of themselves lead to
errors. Second, no terminating moiety is present
on the labelled nucleotides. As with the 454 system,
therefore, homopolymer runs are an important
issue. However, because single molecules are
being sequenced, the problem can be mitigated
by limiting the rate of incorporation events.
Additionally, it was noted that consecutive incorporations of labelled nucleotide at homopolymers
produced a quenching interaction that enabled
the researchers to infer the discreet number of
incorporations (e.g. A vs. AA vs. AAA). Third,
the raw sequencing accuracy can be substantially
improved by a two-pass strategy in which the
array of single-molecule templates (here with
adaptors at both ends) is sequenced as described
above and then fully copied. As the newly synthesised strand is surface-tethered, the original
template can be removed by denaturing.
Sequencing primed from the distal adaptor then
yields a second sequence for the same template,

Microarray

obtained in the opposite orientation. Positions


that are concordant between the two reads have
Phred-like quality scores. And finally, largely
secondary to the incorporation of contaminating,
unlabelled or non-emitting bases, the dominant
error type is deletion (27% error rate with one
pass, 0.21% with two passes). However, substitution error rates are substantially lower (0.01
1% with one pass). With two passes, the per-base
raw substitution error rate (approaching 0.001%)
may currently be the lowest of all the secondgeneration platforms.
Advantages and disadvantages of different
approaches in terms of costs, limitations and
practical aspects of implementation, clear differences between conventional sequencing and the
second-generation platforms determine which
general strategy represents the best option for any
given project. The applications of conventional
sequencing (i.e. Sanger) have grown diverse, and
for small-scale projects in the kilobase-tomegabase range, this will likely remain the technology of choice for the immediate future. This is
a consequence of its greater granularity (i.e. the
ability to efficiently operate at either small or
large production scales) relative to the new technologies. Even so, it is clear that despite limitations relative to Sanger sequencing (e.g. in terms
of read-length and accuracy), large-scale projects
will quickly come to depend entirely on nextgeneration sequencing. As an example of the
advantages of the new platforms, consider that
large-scale resequencing studies for identifying
germline variation or somatic mutations have
relied on Sanger-based resequencing approaches
that in turn are reliant on one-at-a-time PCR
amplification of each targeted region. In this context, the requirements of a Sanger sequencing
approach include major costs beyond just
reagents. These include robotic support of
reagents, processing of multiple samples in 96or 384-well formats, maintenance of capillarybased sequencers, extensive bioinformatics
infrastructure to handle the flow of data and dedicated support staff to maintain complicated
equipment. It is estimated that the overall cost to
conventionally sequence 100 genes from 100
samples, assuming each gene has an average of

209

10 exons, quoted estimates from non-commercial


genome centres and commercial sequence service
providers ranged from $300,000 to over
$1,000,000 (as on August, 2012). Clearly, this
cost is beyond the range of most individual laboratories. In addition to reducing the per-base
cost of sequencing by several orders of magnitude, second-generation instruments have fewer
infrastructure requirements; instead, the principle
challenge is downstream data management.

Microchip-Based Electrophoretic
Sequencing
Significant progress has been made toward developing methods whereby conventional electrophoretic sequencing can be carried out on a
micro-fabricated device. The primary advantages
of this approach include faster processing times
and substantial reductions in reagent consumption. An ideal device for this purpose would integrate all aspects of sample processing, with
microfluidic transport of the reaction volume
between steps, for example, clonal amplification
by nanoliter-scale PCR from a single cell or a
single template molecule; template purification;
cycle sequencing reaction; isolation and concentration of extension fragments; and injection into
a microchannel for electrophoretic separation
(potentially parallelised, e.g. with 384 or more
channels concentrically arranged around a
rotating fluorescence scanner). Many of the key
challenges have already been overcome in proofof-concept experiments. Although it is unclear in
the immediate moment whether these efforts will
be able to keep pace with cyclic-array sequencing
and other strategies, it is worth bearing in mind
that the Sanger biochemistry coupled to electrophoretic separation remains by far the best option
for DNA sequencing in terms of read-length and
accuracy; we simply lack methods to parallelise
it to the extent possible with cyclic-array strategies. One could imagine that lab on-a-chip
nucleic acid analysis could supplant conventional
DNA sequencing for low-scale applications and
may also prove useful in the context of point-ofcare diagnostics.

210

Sequencing by Hybridisation
The basic concept of sequencing by hybridisation
is that the differential hybridisation of labelled
nucleic acid fragments to an array of oligonucleotide probes can be used to precisely identify
variant positions. Usually, the oligos tethered to
the array are designed as a tiling representation of
the reference sequence corresponding to the
genome of interest. As that of the approach taken
by Affymetrix (Santa Clara, CA, USA) and
Perlegen (Mountain View, CA, USA) (in performing extensive SNP discovery in, e.g. human,
mouse and yeast), each possible single-base substitution is represented on the array by an independent feature. Roche NimbleGen (Madison,
WI, USA), in performing sequencing by hybridisation of microbial genomes, takes a two-tier
approach, with an initial array directed at performing approximate localisation, and a second
custom array directed at pinpointing and
confirmation of variant positions. Although
microarrays are clearly useful and cost effective
for genomic resequencing as well as a range of
other genome-scale applications (see above), it is
unclear what will happen as next-generation
sequencing technologies begin to compete for
many of the same applications (e.g. resequencing, but also expression analysis, structural variation analysis, DNA-protein binding).
In terms of sequencing, limitations of
microarrays include the following: (1) Sequences
that are repetitive or subject to cross hybridisation cannot easily be interrogated; (2) it remains
unclear how de novo sequencing can be achieved
with hybridisation-based strategies; and (3)
without very careful data analysis, false positives pose an important problem, and it is not
clear how to obtain the equivalent of redundant
coverage that is possible with conventional and
cyclic-array sequencing. Thus far, sequencing
by hybridisation has likely had its greatest impact
in the context of genome-wide association
studies, which rely on array-based interrogation
(i.e. genotyping by hybridisation) of a highly
defined set of discontinuous genomic coordinates. A different (and earlier) take on the idea of
sequencing by hybridisation involves serial or

10

Curtain Raiser to Novel MAS Platforms

parallel interrogation with comprehensive sets of


short oligonucleotides (e.g. 4,096 6-mersor
8,192 7-mers) followed by sequence reconstruction. Recently, this basic strategy was used
in the context of an array of rolling circleamplified sequencing features to perform resequencing of an E. coli genome. This successful
proof of concept is perhaps better classified as a
cyclic-array method, where serial hybridisation
rather than polymerase-driven synthesis was used
for the actual sequencing.

Sequencing in Real Time


Several academic groups and companies are
working on technologies for ultrafast DNA
sequencing that are substantially different from
the current next-generation platforms. One
approach is nanopore sequencing, in which
nucleic acids are driven through a nanopore
(either a biological membrane protein such as
alpha-hemolysin or a synthetic pore). Fluctuations
in DNA conductance through the pore, or, potentially, the detection of interactions of individual
bases with the pore, are used to infer the nucleotide sequence. Although progress has been made
in achieving early proof-of-concept demonstrations with such methods, major technical challenges remain along the path to a truly practical
nanopore-based sequencing platform. Another
approach involves the real-time monitoring of
DNA polymerase activity. Nucleotide incorporations can potentially be detected through
fluorescence resonance energy transfer (FRET)
interactions between a fluorophore-bearing polymerase and gamma phosphate-labelled nucleotides (Visigen; Houston), or with zero-mode
waveguides (Pacific Biosciences; Menlo Park,
CA, USA), with which illumination can be
restricted to a zeptoliter-scale volume around a
surface-tethered polymerase such that incorporation of nucleotides (with fluorescent labels on
phosphate groups) can be observed with low
background. Pacific Biosciences demonstrated
substantial progress toward a working technology, including the potential for longer reads than
Sanger sequencing, in several presentations and

Microarray

publications. Although technical hurdles remain


and the bar has been raised by cyclic-array methods, we are also unlikely to run out of nucleotides
to sequence anytime soon.

Targeted Capture of Genomic Subsets


For genomic resequencing (i.e. sequencing for
somatic or germline variation discovery in
individual(s) of a species for which a reference
genome is available), it is frequently the case that
investigators would prefer to use finite resources
to sequence a specific subset of the genome
across more individuals, rather than the whole
genome of fewer individuals. Examples of
genomic subsets that may be highly relevant
include (1) a specific megabase scale region of
the genome to which a disease phenotype has
been mapped, (2) exons of specific candidate
genes belonging to a disease-related pathway and
(3) the full complement of protein-coding DNA
sequences. These subsets generally total to
megabases, raising the question of how they can
be efficiently isolated barring hundreds or thousands of individual PCR reactions. In other
words, analogous to how PCR served as an effective front-end for resequencing of kilobasesized targets with capillary electrophoresis, there
is a strong need for flexible targeting methods
that are matched to the megabase scale granularity at which the next-generation sequencing platforms operate. Fortunately, a variety of such
methods have shown convincing proof-of-concept demonstrations in the past several years.
These include methods that, like PCR, rely on a
combination of oligonucleotide hybridisation and
enzymatic activity (e.g. polymerase or ligase) to
confer specificity but, unlike PCR, are more compatible with high degrees of multiplexing. For
example, Ji and colleagues in 2007 described the
multiplex capture of 177 exons by selective circularisation of restriction fragments. Another
approach is capture by hybridisation. It has been
demonstrated that 10,000-fold hybridisationbased enrichment of sequences was derived
from BAC (bacterial artificial chromosome)sized genomic regions.

211

Global advantages of second-generation or


cyclic-array strategies, relative to Sanger sequencing, include the following: (1) In vitro construction of a sequencing library, followed by in vitro
clonal amplification to generate sequencing
features, circumvents several bottlenecks that
restrict the parallelism of conventional sequencing (i.e. transformation of E. coli and colony
picking). (2) Array-based sequencing enables a
much higher degree of parallelism than conventional capillary-based sequencing. As the effective size of sequencing features can be on the
order of 1 mm, hundreds of millions of sequencing reads can potentially be obtained in parallel
by restored imaging of a reasonably sized surface
area. (3) Because array features are immobilised
to a planar surface, they can be enzymatically
manipulated by a single reagent volume. Although
microliter scale reagent volumes are used in practice, these are essentially repaid over the full set
of sequencing features on the array, dropping the
effective reagent volume per feature to the scale
of picoliters or femtoliters. Collectively, these
differences translate into dramatically lower costs
for DNA sequence production.
On the other hand, the advantages of secondgeneration DNA sequencing are currently offset
by several disadvantages. The most prominent of
these include read-length (for all of the new
platforms, read-lengths are currently much
shorter than conventional sequencing) and raw
accuracy (on average, base-calls generated by the
new platforms are at least tenfold less accurate
than base-calls generated by Sanger sequencing).
Although these limitations create important algorithmic challenges for the immediate future, we
should bear in mind that these technologies
will continue to improve with respect to these
parameters, much as conventional sequencing
progressed gradually over three decades to reach
its current level of technical performance.
There are important differences among the
second-generation platforms themselves that may
result in advantages with respect to specific applications. Some applications (e.g. resequencing)
may be more tolerant of short read-lengths than
others (e.g. de novo assembly). For applications
relying on tag counting (e.g. quantification of

212

proteinDNA interactions), one would actually


prefer a given amount of sequencing to be split
into as many reads as possible (above some
minimum length that allows placement to a reference). The overall accuracy as well as the specific
error distributions of individual technologies
(e.g. the rate of insertiondeletion vs. substitution
errors, the propensity for systematic consensus
errors) may also be highly relevant. Mate-paired
reads, useful in de novo assembly and for
mapping structural variants, for example, are
now available with all of the second-generation
platforms, but the extent to which the distance
distribution with which the read pairs are separated can be controlled or varied may be an
important factor. Finally, of course, the cost of
sequencing varies greatly between the secondgeneration platforms, and as consumers, we hope
for more competition between vendors than was
the case with conventional sequencing in the past
decade. Comparisons of per-base costs can be
helpful but occasionally misleading, as, for
example, more accurate bases may be worth more
than less accurate bases.
The DNA sequence of the entire genome constitutes the ultimate objective of physical mapping (see chapter 7). It provides the most detailed
description of an organisms genome and can act
as a bridge between the structural and the functional phases of genomics. With the advances in
sequencing strategies, including automation and
the vast input of computational biology, there has
been accelerated accumulation of sequence data
of many plant species (visit NCBI website for a
list of plant species that have completely
sequenced). These are significant milestones in
the sequence-based era of genomic research.

Handling and Storage of Sequence


Information
To date, many millions of base pairs of DNA from
many species have been sequenced and deposited.
For example, the chromosomes of at least 100
bacterial species, several yeasts and almost the
entire human and rice and other crop chromosomes have been determined. These sequences

10

Curtain Raiser to Novel MAS Platforms

contain an incredible amount of information.


So much in fact that special computer programs
had to be designed to help interpret just a fraction
of the data. When a DNA sequence is published in
a scientific journal, it is also deposited in a computer database known as GenBank. When a
sequence is placed in GenBank, the known and
predicted features of the sequence are also indicated. These include promoters, open reading
frames and transcription factor binding sites. Just
a listing of As, Cs, Gs and Ts is known as a raw
sequence, and the sequence with all of the features
indicated is known as an annotated sequence.
What can be learned from sequence searches?
First, DNA sequence searches are more stringent
than protein sequences. Two DNA sequences
either have an adenine in the same position or
they do not. Protein sequences can have the same
amino acid in the same place and are, thus, identical at that position. Proteins can also have similar amino acids in one position, such as valine in
one protein and alanine in the other. Because both
amino acids are hydrophobic, they can frequently
carry out the same functions. In this case, the
proteins are said to be similar in a given position.
If two proteins have similarity over a large
segment of their sequences, they may have similar
functions. This kind of analysis is especially useful if the function of one of the proteins has been
identified. Knowing the function of one of the
proteins suggests that the other protein should
also be checked for this function.
More limited regions of sequence similarity or
identity can indicate the presence of a cofactor
binding site. An example of this is the Walker
box, which is an ATP binding site. Sequence
similarities can provide very valuable information about an unknown sequence and dramatically influence the direction of experiments on
the novel gene or protein.
Before being sequenced, most genomes contain few genes whose locations have already
been determined, which, coupled with the enormous amount of DNA in a genome and the
complexities of gene structure, makes finding
genes a difficult task. Computer programs have
been developed to look for specific sequences
in DNA that are associated with certain genes.

Microarray

For example, protein-encoding genes are characterised by an open reading frame, which includes
a start codon and a stop codon in the same reading frame.
Specific sequences mark the splice sites at the
beginning and end of introns; other specific
sequences are present in promoters immediately
upstream of start codons. Still other sequences
are associated with particular functions in certain
classes of proteins. Computer programs have
been developed that scan the DNA for these
sequences and identify genes on the basis of their
presence and position. Some of these programs
are capable of examining databases of EST and
protein sequences to see if there is evidence that
a potential gene is expressed.
It is important to recognise that the programs
that have been developed to identify genes on the
basis of DNA sequence are not perfect. Therefore,
the numbers of genes reported in most genome
projects are estimates. The presence of multiple
introns, alternative splicing, multiple copies of
some genes and much non-coding DNA between
genes makes accurate identification and counting
of genes difficult.

Predicting Function from Sequence


The nucleotide sequence of a gene can be used to
predict the amino acid sequence of the protein
that it encodes. The protein can then be synthesised or isolated and its properties studied to
determine its function. However, this biochemical approach to understanding gene function is
both time consuming and expensive. A major
goal of functional genomics has been to develop
computational methods that allow gene function
to be identified from DNA sequence alone,
bypassing the laborious process of isolating and
characterising individual proteins.

Homology Searches
One computational method (often the first
employed) for determining gene function is to
conduct a homology search, which relies on

213

comparing DNA and protein sequences from the


same and different organisms. Genes that are
evolutionarily related are said to be homologous.
Homologous genes found in different species that
evolved from the same gene in a common ancestor are called orthologs. For example, both mouse
and human genomes contain a gene that encodes
the alpha subunit of haemoglobin; the mouse and
human alpha-haemoglobin genes are said to be
orthologs, because both genes evolved from an
alpha-haemoglobin gene in a mammalian ancestor common to mice and humans. Homologous
genes in the same organism (arising by duplication of a single gene in the evolutionary past) are
called paralogs. Within the human genome is a
gene that encodes the alpha subunit of haemoglobin and another homologous gene that encodes
the beta subunit of haemoglobin.
These two genes arose because an ancestral
gene underwent duplication and the resulting
two genes diverged through evolutionary time,
giving rise to the alpha- and beta-subunit genes;
these two genes are paralogs. Homologous genes
(both orthologs and paralogs) often have the
same or related functions; so, after a function has
been assigned to a particular gene, it can provide
a clue to the function of a homologous gene.
Databases containing genes and proteins found
in a wide array of organisms are available for
homology searches. Powerful computer programs
have been developed for scanning these databases
to look for particular sequences. A commonly
used homology search program is BLAST (Basic
Local Alignment Search Tool). Suppose a geneticist sequences a genome and locates a gene that
encodes a protein of unknown function. A homology search conducted on databases containing the
DNA or protein sequences of other organisms
may identify one or more orthologous sequences.
If a function is known for one of these sequences,
that function may provide information about the
function of the newly discovered protein.
In a similar way, computer programs can
search a single genome for paralogs. Eukaryotic
organisms often contain families of genes that
have arisen by duplication of a single gene. If a
paralog is found and its function has been previously assigned, this function can provide

214

information about a possible function of the


unknown gene. However, paralogs often evolve
new functions; so information about their functions must be used cautiously. Of the genes newly
identified through genomic-sequencing projects,
50% are significantly similar to orthologs and
paralogs whose function has already been described.
The 50% of newly identified genes that cannot be
assigned a function on the basis of homology
searches will undoubtedly decrease in number as
functions are assigned to more and more genes and
as more genomes are sequenced.

Other Sequence Comparisons


Strategies
Complex proteins often contain regions that
have specific shapes or functions called protein
domains. For example, certain DNA-binding
proteins attach to DNA in the same way; these
proteins have in common a domain that provides
the DNA-binding function. Each protein domain
has an arrangement of amino acids common to
that domain. There are probably a limited,
though large, number of protein domains, which
have mixed and matched through evolutionary
time to yield the protein diversity seen in present-day organisms.
Many protein domains have been characterised, and their molecular functions have been
determined. The sequence from a newly identified
gene can be scanned against a database of known
domains. If the gene sequence encodes one or
more domains whose functions have been previously determined, the function of the domain can
provide important information about a possible
function of the new gene.
Another computational method for predicting
protein function is a phylogenetic profile. In this
method, the presence-and-absence pattern of a
particular protein is examined across a set of
organisms whose genomes have been sequenced.
If two proteins are either both present or both
absent in all genomes surveyed, the two proteins
may be functionally related. For example, the
two proteins might function as consecutive steps
in a biochemical pathway. The idea is that the

10

Curtain Raiser to Novel MAS Platforms

two proteins depend on each other and will evolve


together. One protein cannot function without the
other, and they will either both be present or both
be absent.
To understand this concept, consider the
following proteins in four bacterial species:
E. coli: protein 1, protein 2, protein 3, protein 4,
protein 5, protein 6
Species A: protein 1, protein 2, protein 3,
protein 6
Species B: protein 1, protein 3, protein 4,
protein 6
Species C: protein 2, protein 4, protein 5
We can create a phylogenetic profile by constructing a table comparing the presence (+) or
absence () of the proteins in the four bacterial
species.
The phylogenetic profile reveals that proteins
1, 3 and 6 are either all present or all absent in
all species, so these proteins might be functionally related. Examining fusion patterns among
proteins is another method for predicting functional relations; this technique is sometimes
called the Rosetta Stone method. Functionally
related, separate proteins in one organism sometimes exist as a single, fused protein in another
organism. Thus, the presence of a fused AB
protein in one species suggests that separate
proteins A and B in another organism may be
functionally related.
Yet another method for determining the function of an unknown gene is gene neighbour
analysis. Genes that encode functionally related
proteins are often closely linked in organism
(called as linked genes; see chapter 4). For
example, if two genes are consistently linked in
the genomes of several bacteria, they might be
functionally related. Functionally related genes
are sometimes also linked in eukaryotes; examples are the hox genes, which play an important
role in embryonic development. It is important to
recognise that functions suggested by computational methods such as homology searches,
phylogenetic profiling, fusion proteins and neighbour analysis do not define a proteins function;
rather, these computational methods provide
hints about possible functions that can be pursued
through detailed analyses of the biochemistry

Serial Analysis of Gene Expression (SAGE)

and cellular location of the protein. Nevertheless,


these computational methods and others like them
have proved to be invaluable in determining the
functions of genes revealed in genomic studies.

Serial Analysis of Gene Expression


(SAGE)
The genomic sequences of a wide variety of
organisms were revealed during the last decade.
The genomes of eukaryotic organisms are long
and massive and contain an enormous number of
genes. By precisely regulating activities of these
genes, each organism can supply required amount
of products at an appropriate time that confer
functions in the given organism. It is thus believed
that the majority of biological phenomena found
in a variety of organisms can be explained by the
quantity of gene products. Although the gene
function is certainly conducted by its final
product, protein, there are a large number of
observations that the amount of protein produced
is directly dependent on the amount of mRNA
that encodes it. This means that, to generally
understand the cellular functions under the certain conditions at a certain time, it can be attained
by measuring the species and respective numbers
of mRNAs at a point of time. However, each cell
contains more than 10,000 species, copies of each
species ranging from less than one to more than
10,000 and, as a total, up to half a million mRNA
transcript copies. It was therefore practically
impossible to determine them. A feasible tactic
was only to identify genes whose expression was
influenced by a variety of internal or external factors. These were classical differential colony
(plaque) hybridisation of cDNA clones, subtractive hybridisation and differential display method
(see above). Large-scale random cDNA sequencing by EST project was very useful for the
identification of unknown genes expressed in
given cells or tissues. However, this approach
was not designed to quantify expressed genes,
since the cDNA library to be sequenced was
usually normalised to eliminate recurring transcripts derived from abundant class mRNA
sequences for the purpose of expanding the size

215

of the gene collection. The body mapping project


was the unique and direct attempt to construct
gene expression profiles of a number of cells and
tissues by random sequencing of a 3-directed
cDNA library. About 300-bp fragments of these
3-regions were called gene signature, and each
represented a particular mRNA species. By
sequencing 1,000 or so cDNA clones, they could
make a rough pattern of gene expression and
identify mRNAs of highly abundant class.
However, as an unavoidable weakness common
to both EST and body mapping projects, they
include an inefficient sequencing step, in which
one sequencing process yields only one cDNA
sequence. Mainly because of this low throughput, the profiles obtained by the body mapping
project unavoidably became a long way from
what is expected and demanded. Although the
more recent methods of hybridisation-based
analyses (DNA microarray) using immobilised
cDNAs or oligonucleotides (see above) can
potentially examine the expression patterns of a
relatively large number of genes, the method can
only examine expressed sequences that have
already been identified.
In contrast, the SAGE method allows for a
quantitative and simultaneous analysis of a large
number of transcripts in any particular cells or
tissues, without prior knowledge of the genes
(Velculescu et al. 1995). As the body mapping
procedure does, this method takes advantage of
the 3-portion of mRNA as the gene tag but of
much shorter form (910 bp). These tags can be
serially connected before cloning into a plasmid
vector. Since the resulting plasmid clones contain
multiple tags, sequences of several dozens of
mRNAs can be obtained by a single sequencing
reaction. Rapid and cost-saving sequencing by
this original device allows quantification and
identification of a large number of cellular
transcripts.
SAGE is based mainly on two principles, representation of mRNAs (cDNAs) by short sequence
tags and concatenation of these tags for cloning
to allow the efficient sequencing analysis. If one
wants to elucidate the gene expression profile of
this particular cell, they would have to conduct
several cDNA sequencing reactions. However, if

216

each mRNA species can be represented by a short


unique sequence stretch (such as 9-bp tag), the
purpose would be attained by sequencing them,
because a sequence stretch as short as 9 bp can
distinguish 49 transcripts, provided a random
nucleotide distribution throughout the genome.
This ability appears sufficient for the discrimination of all the human transcripts, because the
human genome is estimated to encode between
28,642 and 153,478 genes. However, since current sequencing procedure handles one clone at a
time, one has to conduct at least seven sequencing reactions for the profiling of this hypothetical
cell. There is no particular merit by replacing
mRNA with short sequence stretch, and this is
the reason why the body mapping project fell into
a setback despite its ideological importance.
However, if we could connect these tags into a
long stretch of DNA molecule, sequencing reaction would be needed only once. Since a currently
used automated DNA sequencer stably gives
5600 nucleotides for any given clones, one
would be able to obtain 5060 number of 9-bp
tag-represented mRNA sequences by a single
reaction and run. This is more than enough for
the elucidation of gene expression profile of this
hypothetical cell. SAGE procedure can be
explained briefly as follows: Double-stranded
cDNA is synthesised from mRNA by means of a
biotinylated oligo(dT) primer. The cDNA is then
cleaved with a restriction enzyme (called anchoring enzyme). Any four-base recognising enzymes
may be used, because they cleave every 256 bp
(44) on average, while the majority of mRNAs are
considered to be much longer. Actually, NlaIII is
the most frequently used enzyme. The 3-most
portion of the cleaved cDNA with a common
NlaIII cohesive end at its 5-terminus is then
recovered by binding to streptavidin-coated
beads. After dividing the reaction mixture into
two portions, two independent linkers are ligated
using NlaIII cohesive termini to each portion.
These linkers are designed to contain type IIS
enzyme (usually FokI or BsmFI and designated
as tagging enzyme) site near (or partially overlapping) the 3-NlaIII sequence. After the reaction mixtures are digested with type IIS enzyme,
released portions are recovered. Resulting stag-

10

Curtain Raiser to Novel MAS Platforms

gered ends of the products are then blunt ended by


T4 DNA polymerase. Two portions are mixed
again and ligated. Since the 5-ends of the linkers
are blocked by amino group, only the mRNAderived termini are able to be ligated in a tail-totail orientation. The products are PCR-amplified,
cleaved by NlaIII, an anchoring enzyme, and then
separated by polyacrylamide gel electrophoresis
(PAGE). Ditag fragments flanked both ends with
NlaIII cohesive terminus are isolated and ligated
to obtain concatemers. Highly concatenated
products are recovered by PAGE and cloned into
a plasmid vector for sequencing. Thus, SAGE
analysis is derived to provide a readout, via
sequencing, of the spectrum of genes being
expressed in a cell.
Thus, in simple terms, the steps that underlie
the SAGE methodology include the following:
(1) a short sequence tag (1015 bp) contains
sufficient information to uniquely identify a transcript provided that the tag is obtained from a
unique position within each transcript, (2)
sequence tags can be linked together to form long
serial molecules that can be cloned and sequenced
and (3) quantification of the number of times a
particular tag is observed provides the expression
level of the corresponding transcript.
Extra stringency step that facilitates gene
identification is that the tag must include the 3
most anchoring site in a predicted transcript. A
fraction of genes will have multiple tags due to
alternative splicing near the 3 end, or use of
alternative polyadenylation sites, but for the most
part, these can be identified. The number of times
a specific tag is found in the SAGE sequences
reflects its abundance in the mRNA population.
Therefore, SAGE is described as a method that is
used to obtain comprehensive, unbiased and
quantitative gene expression profiles. Its major
advantage over arrays is that it does not require a
priori knowledge of the genes to be analysed and
reflects absolute mRNA levels. Since the original
SAGE protocol was developed in a short-tag (10bp) format, several modifications have been made
to produce longer SAGE tags for more precise
gene identification and to decrease the amount of
starting material necessary. Several SAGE-like
methods have also been developed for the

cDNA-AFLP

genome-wide analysis of DNA copy number


changes and methylation patterns, chromatin
structure and transcription factor targets.
Unlike array and chip methods, you do not
have to make cDNAs and ESTs. The expression
information derives from SAGE tags, which are
produced as part of the analysis. Sequence information is required to assign the tags to individual
ORFs. However, unassigned SAGE tags are also
useful (in species for which the complete genomes
have not been sequenced, unassigned tags will be
encountered frequently). They can be used to pull
out promoters from genomic clones, to provide
information about coordinated gene regulation,
and to identify previously unknown genes.
Quantitative comparison of SAGE samples is not
always easy to interpret. A tag present in four
copies in one sample of 50,000 tags and two copies in another may actually be twofold induced,
or the difference is due to random sampling.

cDNA-AFLP
For many years the isolation of genes for which
products and mutants were not known was only
possible by differential screening of cDNA libraries. The first in vitro technique for the determination of transcript patterns was differential display
reverse transcription PCR (DDRT-PCR). For the
first time it was possible to determine simultaneously a large part of the transcripts present in a
eukaryotic cell within a single experiment with
high sensitivity. The technique was applied
widely, and for several years no other method
was available by which comprehensive transcript
patterns of eukaryotic cells could be obtained.
Later, Fischer and his group combined DDRTPCR and amplified fragment length polymorphism (AFLP), a method developed by Vos et al.
in 1995 for the characterisation of genomic DNA.
The new technique, termed restriction fragment
length polymorphism-coupled domain-directed
differential display (RC4D), provided a useful
tool to detect differentially expressed members of
individual gene families. The cDNA-AFLP technique is based on the selective PCR amplification
of adapter-ligated restriction fragments derived

217

from cDNA. The principle of this technique is


described briefly hereunder.
cDNA is synthesised from total RNA or
poly(A) RNA and is digested with TaqI and AseI,
which recognise 4 and 6 bp, respectively. A complete digest of plant cDNA with these enzymes
produces five different types of molecules: Ase/
Ase fragments, Ase/Taq fragments, Taq/Taq fragments and two terminal fragments with only one
cohesive end. TaqI, which cuts DNA frequently,
generates small cDNA fragments (around 256 bp
on average), which amplify well and lie in the
optimal size range for separation on sequencing
gels. AseI, which cuts only rarely due to its longer
recognition sequence, reduces the number of
fragments to a manageable size. Following digestion, double-stranded adapters are ligated to the
restriction fragments to generate templates for
amplification. PCR amplification is carried out in
two steps. In the first step, around 15 cycles of
non-specific amplification are carried out using
primers without extensions. The products of this
reaction are then subjected to a second round of
PCR amplification using primers bearing at their
30 end two additional nucleotides which extend
into the sequence of the restriction fragments,
allowing only a subpopulation to be amplified.
All the 256 possible primer combinations are
necessary to amplify the whole cDNA population.
The amplicons are separated on a polyacrylamide
gel and visualised by autoradiography. Most of the
bands represent Ase/Taq fragments because Ase/
Ase fragments are rare and Taq/Taq fragments are
not visible on the gel. RNA probes from different
sources (A, B) will produce different cDNAAFLP banding patterns, which allow differentially
expressed cDNAs to be identified. However, there
are variations to the above said protocol, and three
of them are described hereunder.
1. cDNA-AFLP with Two Restriction Enzymes
cDNA-AFLP is an RNA fingerprinting technique that evolved from AFLP (amplified
fragment length polymorphism), a method
described by Vos and his co-workers during
1995 for the fingerprinting of genomic DNA
(see chapter 3). The classical cDNA-AFLP
procedure uses the standard AFLP protocol on
a cDNA template. The technique involves

218

three steps: (1) restriction of cDNA and ligation of oligonucleotide adapters, (2) selective
amplification of sets of restriction fragments
using PCR primers bearing selective nucleotides at the 30 end and (3) gel analysis of the
amplified fragments. Restriction of plant
cDNA with a combination of two restriction
enzymes, a tetra cutter and a hexa cutter,
allows a significant fraction of the cDNA population to be cleaved and to be represented as
a discrete banding pattern on a sequencing
gel. In genomic AFLP with plant DNA, three
selective bases on the end of each primer are
required to give a useful banding pattern. The
lower complexity of cDNA allows the use of
two selective bases for each primer giving a
total of 256 possible primer combinations.
The largest cDNA-AFLP products visible on a
polyacrylamide sequencing gel are around
1,000 bp in size, the lower end of the gel representing approx. 100 bp. In this size window,
an average of 40 bands can be observed for
each primer combination, corresponding to a
total of approx. 10,000 bands.
2. cDNA-AFLP with One Restriction Enzyme
A systematic comparison of known potato
cDNA sequences showed that approx. 45%
are cleaved by the AseI/TaqI restriction
enzyme combination. Thus, in so far as only
one pair of enzymes is applied, about half of
the transcripts present in a cell will not be
detected by the standard cDNA-AFLP technique. To obtain more comprehensive patterns, the cDNA-AFLP protocol has modified
and showed that the rarely cutting enzyme can
be omitted, and meaningful banding patterns
can be produced using TaqI alone. Samples
derived from buds of red and white flowers of
the common morning glory (Ipomoea purpurea) were compared using 96 different primer
combinations, each of which gave approximately 50 bands, corresponding to a total of
approximately 5,000 bands.
3. iAFLP
iAFLP (introduced AFLP) is a quantitative
high-throughput expression profiling method
specifically designed to measure the concentrations of known transcripts in numerous

10

Curtain Raiser to Novel MAS Platforms

different probes. cDNA from each probe is


restricted with MboI and ligated to one of up
to six adapters having short insertions of various lengths into a common sequence (polymorphic adapters). Following ligation, the
differentially adapted cDNAs are pooled and
3 end fragments are selectively amplified with
a gene-specific primer and a fluorescently
labelled adapter primer. The amplicon is then
separated on an automatic sequencer. Due to
length heterogeneity introduced by the polymorphic adapters, iAFLP fragments from different probes will produce distinct peaks on
the electrophenogram. Transcript abundance
is determined by evaluating peak areas relative to an internal standard.

Applications
cDNA-AFLP and its application to plants was
first described by Bachem et al. in 1996, who
analysed differential gene expression in a synchronised potato in vitro tuberisation system.
During screening with different primer combinations, two lipoxygenase cDNA fragments were
isolated on the basis of their differential expression during potato tuber formation. Both transcripts are highly tuber specific and are expressed
strongly in 15-d-old tubers, but not in stolons,
leaves or petioles and only at very low levels in
stems. The dramatic induction of a lipoxygenase
gene just after the start of tuberisation led the
authors to speculate that the expression of at least
one of these enzymes might directly be linked to
the tuber development process. Following this
initial report, a small number of papers have
described the use of cDNA-AFLP fingerprinting
in plant and animal systems. Habu et al. in 1997
compared mRNA samples obtained from the
flower buds of two lines of Ipomoea purpurea.
Fourteen cDNA fragments (approximately 0.3%)
amplified differently in the two samples. Two of
these were shown to have been derived from a
gene that was actively expressed in the buds of
red flowers but not in those of white flowers.
Sequence analysis showed that this cDNA carries
a sequence highly homologous to the chalcone

Gene Tagging by Insertional Mutagenesis

synthase gene, a key enzyme in the flavonoid


biosynthetic pathway. cDNA-AFLP was also
applied to identify differentially expressed genes
in cold-tolerant and cold-sensitive alfalfa genotypes and rice.

RFLP-Coupled Domain-Directed
Differential Display (RC4D)
Many genes and their protein products have a
modular structure where the presence of certain
domains (family-specific domains, FSDs) defines
membership in different gene families. This is
well characterised for the chlorophyll a/b binding
proteins and for many transcription factors.
Restriction fragment length polymorphism-coupled domain-directed differential display (RC4D,
which was first described by Fischer and his team
in 1995) is a method specifically designed to
analyse expression of multi-gene families at different developmental stages, in diverse tissues or
in different organisms. RC4D combines cDNAAFLP technology with a gene family-specific
version of DDRT-PCR. In RC4D, instead of arbitrary decameric primers, longer primers directed
against an FSD are used, allowing cDNAs belonging to the same gene family to be selectively
amplified. As the amplification products are relatively uniform in length, restriction fragment
length polymorphism (RFLP) is introduced by
digestion with a frequently cutting restriction
enzyme. This reduces the amplicon size from
approximately 1 kbp to several hundred base
pairs, which is optimal for separation on acrylamide gels. Family members can thus easily be
distinguished by size. The RC4D protocol can be
explained briefly as cDNA is synthesised from
mRNA with an oligo(dT) primer bearing a PCR
downstream primer binding sequence at its 5
end. PCR is performed with the downstream
primer and an upstream primer specific for a
family-specific domain (FSD). This results in a
mixture of truncated family member cDNAs. The
amplicon is digested with a frequently cutting
restriction enzyme, and double-stranded linkers
are ligated to the cohesive ends. PCR with a
linker primer and an FSD primer results in a

219

population of family member cDNA fragments


of different lengths. To get rid of the unligated
fragments, a further round of PCR is performed
using the FSD primer and a primer directed
against the linker. Amplification products are
then used as a template to extend a radiolabelled
FSD primer, and extension products are separated
on acrylamide gels. Different probes will produce
different RC4D banding patterns, which allow
identification of differentially expressed cDNAs.
RC4D was first used to analyse differential
expression of MADS box genes in male and female
inflorescences of maize. The name MADS was
constructed from the initials of the first four members of the gene family, which were MCM1 (yeast),
AGAMOUS (plants), DEFICIENS (plants) and
SRF (human). A small collection of MADS box
primers was designed, directed against sequences
encoding derivatives of a highly conserved amino
acid motif which covered all its variations known
from plants. RC4D yielded many fragments
significantly different in size. Most of them were
equally present in both sexes. Four already known
and two new MADS box genes were identified,
being either specifically expressed in the female
sex or preferentially expressed in male or female
inflorescences, respectively. The two new MADS
box genes belong to a subfamily showing sequence
similarity to floral homoeotic and transcription
factor genes. Another example of using RC4D was
identification of several cDNAs coding for calcium-dependent protein kinases involved in calcium signalling during cold induction of the kin
genes of Arabidopsis thaliana.

Gene Tagging by Insertional


Mutagenesis
Identification of genes by insertional mutagenesis
is quite advantageous due to the ease of isolating
the tagged gene in comparison with functional
analysis based on mutations derived from chemical or physical treatments. The process of insertional mutagenesis involves the insertion of a
known segment of DNA into a gene of interest.
This inserted sequence often creates a knockout
mutation by blocking or disrupting the expression

220

of the gene and might result in a mutant phenotype


that can be screened. In addition, the insertion
sequence also tags the affected gene, which can
be isolated by using hybridisation probes based
on the sequence of the gene tag. Once the mutated
gene is known, the initial wild-type gene can also
be identified. Such a method has a major advantage of not requiring any prior knowledge of the
gene product or its expression. Also, this approach
provides a direct route to determine the function
of a gene product in situ unlike other methods
which are correlative and do not necessarily
prove a relationship between a gene sequence
and its function. Two types of insertion sequences
are commonly used for mutagenesis in case of
plants: transposable elements and Agrobacterium
tumefaciens-mediated T-DNA (transfer DNA)
insertions.

T-DNA Tag
The process of gene tagging using T-DNA as the
insert has been used effectively to isolate genes,
especially in Arabidopsis. T-DNA insertional
mutagenesis has also been used to produce 22,090
primary transgenic rice plants having approximately 25,700 tags. Another efficient T-DNA
tagging system for japonica rice has also been
described in which over 1,000 T-DNA tags in rice
genome have been characterised. It clearly
revealed that preferential insertion has occurred
in gene-rich regions.

Transposon Tags
Transposons, first recognised by Barbara
McClintock in maize, have become a powerful
tool for gene isolation. The mutagenic potential
of mobile elements and their ability to tag the
mutated sequences along with their widespread
distribution have been exploited for use as tools
for gene isolation as these properties help in the
cloning of genes. The application of transposon
tagging was initially restricted to plants, such as
maize (Zea mays) and snapdragon (Antirrhinum),
with active and well-characterised endogenous

10

Curtain Raiser to Novel MAS Platforms

transposons. But, now maize transposon systems have been used for mutagenesis in heterologous transgenic plant species which otherwise
lack an active endogenous transposon family.
For example, the Ac element was introduced
into rice, and checking for hygromycin resistance identified the transposed plants, since the
autonomous Ac element had been cloned
between the promoter and the hph-coding
region. A strategy, using the maize Ac-Ds system,
has also been effectively used for gene tagging in
case of rice. Retrotransposons, transposable
elements that transpose via an RNA intermediate
and are structurally similar to integrated copies of
retroviruses, have also been shown to be efficient
gene tags as demonstrated by the introduction
of tobacco retrotransposon Tto1 into rice and
its autonomous transposition through reverse
transcription.
Classical genetic approaches to identify
genes, as mentioned earlier, are generally based
on the creation of mutations leading to a recognisable phenotype reflecting the gene function,
such as in gene tagging. However, this is not
always possible, since many genes show functional redundancy, and thus mutation in one gene
or locus could be compensated for by the functioning of one or more other family members.
Moreover, certain genes function at different
stages of development. Mutations in such genes
could cause early lethality or could be highly
pleiotropic. This can thus prevent the identification
of the role of the gene. Trapping techniques have
been developed keeping these limitations in
mind. Entrapment strategies rely on the use of
inserts, such as transposons or T-DNA, containing reporter gene constructs, whose expression is
dependent on cis-acting regulatory sequences at
the site of insertion. The inserts then allow for
the identification of genes, based on their expression pattern, even though they might not display
an obvious mutant phenotype. Three basic types
of gene traps are constructed using reporter
genes such as those encoding b-glucuronidase
(GUS) and green fluorescent protein (GFP):
enhancer trap, promoter trap and gene trap.
Another approach used to access gene function
is activation tagging. This technique is based on

MicroRNAs

the use of an insertion element carrying a strong


enhancer. Thus, on integration into the genome,
it causes activation of an adjacent gene or
enhances its expression, resulting in gain-offunction mutants.

221

quelling and RNAi exists. Thus, understanding


such gene regulation mechanisms also has strong
influence in characterising the QTLs at molecular
level.

MicroRNAs
Post-transcriptional Gene Silencing
Epigenetic regulation of gene expression is a heritable change in gene expression that cannot be
explained by changes in gene sequence. It can
result in the repression or activation of gene
expression and is therefore referred to as gene
silencing or gene activation, respectively. Until
the end of the 1980s, only modifications of DNA
or protein that lead to transcriptional repression
or activation, or to the formation of prions, were
classified as epigenetic. During the 1990s, however, a number of gene-silencing phenomena that
occur at the post-transcriptional level were discovered in plants, fungi, animals and ciliates,
introducing the concept of post-transcriptional
gene silencing (PTGS) or RNA silencing. PTGS
results in the specific degradation of a population
of homologous RNAs. It was first observed after
introduction of an extra copy of an endogenous
gene (or of the corresponding cDNA under the
control of an exogenous promoter) into plants.
Because RNAs encoded by both transgenes and
homologous endogenous gene(s) were degraded,
the phenomenon was originally called co-suppression. A similar phenomenon in the fungus
Neurospora crassa was named quelling. Later,
several groups showed that PTGS can also affect
transgenes that are not homologous to endogenous genes, suggesting that this phenomenon is
not a simple regulatory mechanism that controls
the expression of endogenous genes. Fire et al. in
1998 identified a related mechanism, RNA interference (RNAi), in animals. RNAi results in the
specific degradation of endogenous RNA in the
presence of homologous dsRNA either locally
injected or transcribed from an inverted repeat
transgene. Injected dsRNA, as well as transgenes
expressing dsRNA, also triggers silencing of
homologous (trans)genes in plants. This strongly
suggests that a mechanistic link between PTGS,

MicroRNAs are a class of post-transcriptional


regulators. They are short ~22 nucleotide RNA
sequences that bind to complementary sequences
in the 3 untranslated region (UTR) of multiple
target mRNAs, usually resulting in their silencing. MicroRNAs target ~60% of all genes, are
abundantly present in cells and are able to repress
hundreds of targets each. These features, coupled
with their conservation in organisms ranging
from the unicellular algae Chlamydomonas reinhardtii to mitochondria, suggest they are a vital
part of genetic regulation with ancient origins.
MicroRNAs were first discovered in 1993 by
Victor Ambros, Rosalind Lee and Rhonda
Feinbaum during a study into development in the
nematode Caenorhabditis elegans regarding the
gene lin-14. This screen led to the discovery that
the lin-14 was able to be regulated by a short
RNA product from lin-4, a gene that transcribed
a 61 nucleotide precursor that matured to a 22
nucleotide mature RNA which contained
sequences partially complementary to multiple
sequences in the 3 UTR of the lin-14 mRNA.
This complementarity was sufficient and necessary to inhibit the translation of lin-14 mRNA.
Retrospectively, this was the first microRNA to
be identified, though at the time Ambros et al.
speculated it to be a nematode idiosyncrasy. Since
then, several thousand miRNAs and their targets
have been discovered in all eukaryotes including
mammals, fungi and plants.
In plants, the successful targeting reaction
requires complementarity of the miRNA at most
of the residues. The consequence of the targeting
reaction depends on the nature of the targeted
RNA and the extent of complementarity with the
miRNA. The target RNA is cleaved, and the level
of the protein product is reduced if there is near
complete complementarity, including positions 9
and 10 of the miRNA. Translational suppression

222

without turnover of the target RNA is mediated by


miRNAs with incomplete complementarity to
their target. In addition, there may be miRNAmediated targeting of chromatin-associated RNAs
that lead directly or indirectly to targeted epigenetic modification. In some instances, miRNAmediated gene silencing is a simple negative
switch: Whenever the miRNA gene is active, the
target mRNA is silent. However, these versatile
RNA regulators may also participate in feedback
loops and carry out more subtle roles in genetic
regulation. They might dampen fluctuations in
target gene expression, for example, or influence
temporal changes. In some instances, the miRNAs
or their precursors may move through plasmodesmata, and different stages in the feedback system
occur in adjacent cells or in separate roots and
shoots. miRNAs may also initiate regulatory cascades with multiple mRNA targets. These cascades involve secondary small interfering RNAs
(siRNAs) that associate with argonaute (AGO)
proteins, similarly to miRNAs. The first step in
these cascades requires an RNA-dependent RNA
polymerase (RDR, RDR6 in Arabidopsis thaliana), and it takes place when the initiator miRNA
duplex structure is asymmetrical, if the initiator
miRNA is 22 nucleotides rather than 21 nucleotides long, or if there are two target sites for
21-nucleotide RNAs. The initiator miRNA stimulates the RDR to convert the targeted RNA into
long, double-stranded RNA that is then processed
by Dicer into secondary siRNAs. A high proportion of the secondary siRNAs are in a 21-nucleotide phased register in which the first position is
the cleavage target of the initiator miRNA.
Comparing miRNAs between species can
even be used to delineate molecular evolutionary history on the basis that the complexity of an
organisms phenotype may reflect that of the
microRNA found in the genotype. Unfortunately,
the rate of validation of microRNA targets is
substantially more time consuming than that of
predicting sequences and targets. Due to their
abundant presence and far-reaching potential,
miRNAs have all sorts of functions in physiology, from cell differentiation, proliferation and
apoptosis to the endocrine system, haematopoiesis, fat metabolism and limb morphogenesis.

10

Curtain Raiser to Novel MAS Platforms

They display different expression profiles from


tissue to tissue, reflecting the diversity in cellular phenotypes and as such suggest a role in tissue differentiation and maintenance. Hence,
integration of such information in QTL mapping studies can open up new avenues in the
MAS.

Biochemical Techniques
Biochemistry involves the study of chemical processes that occur in the living organisms with the
ultimate aim of understanding the nature of life in
molecular terms. There are several biochemical
techniques that have their role in unravelling the
molecular basis of life. One- and two-dimensional electrophoresis is the most widely used
techniques in protein identification and characterisation. Mass spectrometry is mainly used to
predict protein structure and function (proteomics) and small metabolites (metabolomics). There
are large numbers of biochemical techniques that
have potential application in MAS, and only a
few major techniques are discussed hereunder.

Plant Proteomics
Proteins are the workhorses of the cell and have
important functions in both normal and abnormal
states. In order to understand how proteins interact and regulate various cellular processes, it is
important to understand their expression behaviour under a wide range of experimental conditions. Unlike the genome which contains a fixed
number of genes, the levels of protein within the
cells are highly dynamic. Proteins are constantly
processed within the cell in response to external
stimuli and undergo a wide range of posttranslational modifications. As a result, it is hard to
accurately determine the exact number or quantities of proteins which are present within the biological systems. In addition, protein families are
extremely diverse and have considerable differences in their physical sizes, chemical and structural properties, affinity constants and relative
abundance within the cells. As a result, accurately

Plant Proteomics

characterising such interactions is extremely


challenging.
The term proteomics was first coined in 1995
and was defined as the large-scale characterisation of the entire protein complement of a cell
line, tissue or organism. Today, two definitions of
proteomics are encountered. The first is the more
classical definition, restricting the large-scale
analysis of gene products to studies involving
only proteins. The second and more inclusive
definition combines protein studies with analyses
that have a genetic readout such as mRNA analysis, genomics and the yeast two-hybrid analysis.
However, the goal of proteomics remains the
same, that is, to obtain a more global and integrated view of biology by studying all the proteins of a cell rather than each one individually.
Using the more inclusive definition of proteomics, many different areas of study are now grouped
under the heading proteomics. These include
proteinprotein interaction studies, protein
modifications, protein function and protein localisation studies to name a few. The aim of proteomics is not only to identify all the proteins in a
cell but also to create a complete three-dimensional (3-D) map of the cell indicating where proteins are located. These ambitious goals will
certainly require the involvement of a large number of different disciplines such as molecular
biology, biochemistry and bioinformatics. It is
likely that in bioinformatics alone, more powerful computers will have to be devised to organise
the immense amount of information generated
from these endeavours.
In the quest to characterise the proteome of a
given cell or organism, it should be remembered
that the proteome (the complete set of proteins at
the given time) is dynamic. The proteome of a cell
will reflect the immediate environment in which it
is studied. In response to internal or external cues,
proteins can be modified by posttranslational
modifications, undergo translocations within the
cell or be synthesised or degraded. Thus, examination of the proteome of a cell is like taking a
snapshot of the protein environment at any given
time. Considering all the possibilities, it is likely
that any given genome can potentially give rise to
an infinite number of proteomes.

223

The first protein studies that can be called


proteomics began in 1975 with the introduction
of the two-dimensional gel by OFarrell, Klose
and Scheele, who began mapping proteins from
Escherichia coli, mouse and guinea pig, respectively. Although many proteins could be separated and visualised, they could not be identified.
Despite these limitations, shortly thereafter, a
large-scale analysis of all human proteins was
proposed. The goal of this project, termed the
human protein index, was to use two-dimensional protein electrophoresis (2-DE) and other
methods to catalogue all human proteins.
However, lack of funding and technical limitations prevented this project progress. Although
the development of 2-DE was a major step forward, the science of proteomics would have to
wait until the proteins displayed by 2-DE could
be identified. One problem that had to be overcome was the lack of sensitive protein sequencing technology. Improving sensitivity was
critical for success because biological samples
are often limiting and both one-dimensional
(1-D) and two-dimensional (2-D) gels have
limits in protein loading capacity. The first
major technology to emerge for the identification
of proteins was the sequencing of proteins by
Edman degradation. A major breakthrough was
the development of microsequencing techniques for electroblotted proteins. This technique was used for the identification of proteins
from 2-D gels to create the first 2-D databases.
Improvements in microsequencing technology
resulted in increased sensitivity of Edman
sequencing in the 1990s to high-picomole
amounts.
One of the most important developments in
protein identification has been the development
of mass spectrometry (MS). In the last decade,
the sensitivity of analysis and accuracy of results
for protein identification by MS have increased
by several orders of magnitude. It is now estimated that proteins in the femtomolar range can
be identified in gels. Because MS is more sensitive, can tolerate protein mixtures and is amenable to high-throughput operations, it has
essentially replaced Edman sequencing as the
protein identification tool of choice.

10

224

Why Proteomics?
Many types of information cannot be obtained
from the study of QTLs or genes alone. For
example, proteins (intern metabolites), not genes,
are responsible for the phenotypes of cells. It is
impossible to elucidate mechanisms of growth
and development, disease, aging and effects of
the environment solely by studying the genome.
Only through the study of proteins can protein
modifications be characterised and the targets of
drugs identified.
1. Annotation of the Genome
One of the first applications of proteomics will
be to identify the total number of genes in a
given genome. This functional annotation of a
genome is necessary because it is still difficult
to predict genes accurately from genomic data.
One problem is that the exonintron structure
of most genes cannot be accurately predicted
by bioinformatics. To achieve this goal, genomic
information will have to be integrated with
data obtained from protein studies to confirm
the existence of a particular gene.
2. Protein Expression Studies
In recent years, the analysis of mRNA expression by various methods has become increasingly popular. These methods include SAGE
and DNA microarray technology (see above).
However, the analysis of mRNA is not a
direct reflection of the protein content in the
cell. Consequently, many studies have now
shown a poor correlation between mRNA
and protein expression levels. The formation
of mRNA is only the first step in a long
sequence of events resulting in the synthesis
of a protein. First, mRNA is subject to posttranscriptional control in the form of alternative splicing, polyadenylation and mRNA
editing. Many different protein isoforms can
be generated from a single gene at this step.
Second, mRNA then can be subject to regulation at the level of protein translation.
Proteins, having been formed, are subject to
posttranslational modification. It is estimated
that up to 200 different types of posttranslational protein modification exist. Proteins

3.

4.

5.

6.

Curtain Raiser to Novel MAS Platforms

can also be regulated by proteolysis and


compartmentalisation. The average number
of protein forms per gene was predicted to be
one or two in bacteria, three in yeast and
three or more in humans. Therefore, it is clear
that the theory of one gene, one protein is
an oversimplification. In addition, some
bodily fluids such as serum or urine have no
mRNA source and therefore cannot be studied
by mRNA analysis.
Protein Function
According to one study, no function can be
assigned to about one-third of the sequences
in organisms for which the genomes have been
sequenced. The complete identification of all
proteins in a genome will aid the field of structural genomics in which the ultimate goal is to
obtain 3-D structures for all proteins in a proteome. This is necessary because the functions
of many proteins can only be inferred by
examination of their 3-D structure.
Protein Modifications
One of the most important applications of
proteomics will be the characterisation of
posttranslational
protein
modifications.
Proteins are known to be modified posttranslationally in response to a variety of intracellular and extracellular signals. For example,
protein phosphorylation is an important signalling mechanism, and dysregulation of protein kinases or phosphatases can result in
undesirable effects such as oncogenesis. By
using a proteomics approach, changes in the
modifications of many proteins expressed by
a cell can be analysed simultaneously.
Protein Localisation and Compartmentalisation
One of the most important regulatory mechanisms known is protein localisation. The mislocalisation of proteins is known to have
profound effects on cellular function (e.g. cystic fibrosis). Proteomics aims to identify the
subcellular location of each protein. This
information can be used to create a 3-D protein map of the cell, providing novel information about protein regulation.
ProteinProtein Interactions
Of fundamental importance in biology is the
understanding of proteinprotein interactions.

One- and Two-Dimensional Gel Electrophoresis

The process of cell growth, programmed cell


death and the decision to proceed through the
cell cycle are all regulated by signal transduction through protein complexes. Proteomics
aims to develop a complete 3-D map of all protein interactions in the cell. One step toward
this goal was completed for the microorganism
Helicobacter pylori. Using the yeast twohybrid method to detect protein interactions,
1,200 connections were identified between H.
pylori proteins covering 46.6% of the genome.
A comprehensive two-hybrid analysis has also
been performed on all the proteins obtained
from the yeast S. cerevisiae.

225

Functional Proteomics
Functional proteomics is a broad term for
many specific, directed proteomics approaches.
In some cases, specific subproteomes are isolated by affinity chromatography for further
analysis. This could include the isolation of protein complexes or the use of protein ligands to
isolate specific types of proteins. This approach
allows a selected group of proteins to be studied
and characterised and can provide important
information about protein signalling, disease
mechanisms or proteindrug interactions.

Protein Analysis
Types of Proteomics
Protein Expression Proteomics
The quantitative study of protein expression
between samples that differ by some variable is
known as expression proteomics. In this approach,
protein expression of the entire proteome or of
subproteomes between samples can be compared.
Information from this approach can identify
novel proteins in signal transduction or identify
disease-specific proteins.

Structural Proteomics
Proteomics studies whose goal is to map out the
structure of protein complexes or the proteins
present in a specific cellular organelle are known
as cell map or structural proteomics. Structural
proteomics attempts to identify all the proteins
within a protein complex or organelle, determine where they are located and characterise all
proteinprotein interactions. An example of
structural proteomics is the analysis of the
nuclear pore complex. Isolation of specific subcellular organelles or protein complexes by
purification can greatly simplify the proteomic
analysis. This information will help join together
the overall architecture of cells and explain how
expression of certain proteins gives a cell its
unique characteristics.

By the very definition of proteomics, it is expected


that complex protein mixtures will be encountered. Therefore, methods must exist to resolve
these protein mixtures into their individual components so that the proteins can be visualised,
identified and characterised. The predominant
technology for protein separation and isolation is
polyacrylamide gel electrophoresis. Unlike the
breakthroughs in molecular biology that eventually enabled the sequencing of the human genome,
some aspects of protein science have shown little
progress over the years. Protein separation technology is one of them. Since its inception several
decades ago, protein electrophoresis still remains
the most effective way to resolve a complex mixture of proteins. In many applications, it is at this
stage where the bottleneck occurs. This is because
1- or 2-DE is a slow, tedious procedure that is not
easily automated. However, until something
replaces this methodology, it will remain an
essential component of proteomics.

One- and Two-Dimensional Gel


Electrophoresis
For many proteomics applications, 1-DE is the
method of choice to resolve protein mixtures. In
1-DE, proteins are separated on the basis of
molecular mass. Because proteins are solubilised
in sodium dodecyl sulphate (SDS), protein solu-

226

bility is rarely a problem. Moreover, 1-DE is


simple to perform, is reproducible and can be
used to resolve proteins with molecular masses of
10300 kDa. The most common application of
1-DE is the characterisation of proteins after some
form of protein purification. This is because of the
limited resolving power of a 1-D gel. If a more
complex protein mixture such as a crude cell
lysate is encountered, then 2-DE can be used. In
2-DE, proteins are separated by two distinct properties. They are resolved according to their net
charge in the first dimension and according to
their molecular mass in the second dimension.
The combination of these two techniques produces resolution far exceeding that obtained in
1-DE. One of the greatest strengths of 2-DE is the
ability to resolve proteins that have undergone
some form of posttranslational modification. This
resolution is possible in 2-DE because many types
of protein modifications confer a difference in
charge as well as a change in mass on the protein.
One such example is protein phosphorylation.
Frequently, the phosphorylated form of a protein
can be resolved from the nonphosphorylated form
by 2-DE. In this case, a single phosphoprotein
will appear as multiple spots on a 2-D gel. In addition, 2-DE can detect different forms of proteins
that arise from alternative mRNA splicing or proteolytic processing.
The primary application of 2-DE continues to
be protein expression profiling. In this approach,
the protein expression of any two samples can be
qualitatively and quantitatively compared. The
appearance or disappearance of spots can provide
information about differential protein expression,
while the intensity of those spots provides quantitative information about protein expression levels.
Such information can be treated as quantitative
traits and mapped on the linkage map (which is
referred to as protein QTL (pQTL) mapping).
Protein expression profiling can be used for samples from whole organisms, cell lines, tissues or
bodily fluids. Examples of this technique include
the comparison of normal and diseased tissues or
of cells treated with various chemicals (pesticide/
herbicide) or stimuli (water or salinity or nutrient
stress). Another application of 2-DE is in cell map
proteomics. 2-DE is used to map proteins from

10

Curtain Raiser to Novel MAS Platforms

microorganisms, cellular organelles and protein


complexes. It can also be used to resolve and
characterise proteins in subproteomes that have
been created by some form of purification of a
proteome. Because a single 2-DE gel can resolve
thousands of proteins, it remains a powerful tool
for the cataloguing of proteins. Many 2-DE databases have been constructed and are available on
the World Wide Web.
A number of improvements have been made
in 2-DE over the years. One of the biggest
improvements was the introduction of immobilised pH gradients, which greatly improved the
reproducibility of 2-DE. The use of fluorescent
dyes has improved the sensitivity of protein
detection, and specialised pH gradients are able
to resolve more proteins. The speed of running
2-DE has been improved, and 2-D gels can now
be run in the mini-gel format. In addition, there
have been efforts to automate 2-DE. Hochstrassers
group has automated the process of 2-DE from
gel running to image analysis and spot picking.
The use of computers has aided the analysis of
complex 2-D gel images. This is a critical aspect
of 2-DE because a high degree of accuracy is
required in spot detection and annotation if artefacts are to be avoided. A molecular scanner is
available to record 2-DE images. Software programs, such as Melanie, compare computer
images of 2-D gels and facilitate both the
identification and quantitation of protein spots
between samples. An exciting advance in 2-DE
was developed by Minden and co-workers. This
technology is called difference gel electrophoresis (DIGE) and utilises fluorescent tagging of
two protein samples with two different dyes.
The tagged proteins are run on the same 2-D gel,
and post-run fluorescence imaging of the gel is
used to create two images, which are superimposed to identify pattern differences. The dyes
are amine reactive and are designed to ensure that
proteins common to both samples have the same
relative mobility regardless of the dye used to tag
them. This technique circumvents the need to
compare several 2-D gels. In their original paper,
DIGE was used to detect differences between
exogenous proteins in two Drosophila melanogaster embryo extracts at nanogram levels.

Acquisition of Protein Structure Information

Moreover, an inducible protein from Escherichia


coli was detected after 15 min of induction. This
technology is now commercially available from
Amersham Pharmacia.
However, a number of problems with 2-DE
still remain. Despite efforts to automate protein
analysis by 2-DE, it is still a labour-intensive and
time-consuming process. A typical 2-DE experiment can take 2 days, and only a single sample
can be analysed per gel. In addition, 2-DE is limited by both the number and type of proteins that
can be resolved. For example, the protein mixture
obtained from a eukaryotic cell lysate is too complex to be completely resolved on a single 2-D
gel. Many large or hydrophobic proteins will not
enter the gel during the first dimension, and proteins of extreme acidity or basicity (proteins with
pIs below pH 3 and above pH 10, respectively)
are not well represented. Some of these problems
can be overcome with different solubilisation
conditions and pH gradients. Another limitation
of 2-DE is the inability to detect low-copy proteins when a total-cell lysate is analysed. In a
crude cell extract, the most abundant proteins can
dominate the gel, making the detection of lowcopy proteins difficult. It was determined in the
analysis of yeast proteins by 2-DE that no proteins defined as low-copy proteins were visible
by 2-DE. Yet it is estimated that over half of the
6,000 genes in yeast may encode low-copy proteins. In mammalian cells, the dynamic range of
protein expression is estimated to be between 7
and 9 orders of magnitude. This problem cannot
be overcome by simply loading more protein on
the gel, because the resolution will decrease and
the co-migration of proteins will increase.
Because of these limitations, the largest application of 2-DE in the future will probably involve
the analysis of protein complexes or subproteomes as opposed to whole proteomes.

Alternatives to Electrophoresis
in Proteomics
The limitations of 2-DE have inspired a number
of approaches to bypass protein gel electrophoresis. One approach is to convert an entire protein

227

mixture to peptides (usually by digestion with


trypsin) and then purify the peptides before subjecting them to analysis by mass spectrometry
(MS). Various methods for peptide purification
have been devised, including liquid chromatography, capillary electrophoresis and a combination
of techniques such as multidimensional protein
identification or cation-exchange chromatography and reverse-phase (RP) chromatography. The
advantage of these methods is that because a 2-D
gel is avoided, a greater number of proteins in the
mixture can be represented. The disadvantage is
that it can require an immense amount of time
and computing power to disclose the data
obtained. In addition, considerable time and
effort may be expended in the analysis of uninteresting proteins. One of the most exciting techniques to emerge as an alternative to protein
electrophoresis is that of isotope-coded affinity
tags (ICAT). This method allows the quantitative
protein profiling between different samples without the use of electrophoresis.

Acquisition of Protein Structure


Information
Edman Sequencing
One of the earliest methods used for protein
identification was microsequencing by Edman
chemistry to obtain N-terminal amino acid
sequences. Little has changed in Edman chemistry since its introduction, but improvements in
sequencing technology have increased the sensitivity and ease of Edman sequencing. Although
the use of Edman sequencing is decreasing in the
field of proteomics, it is still a very useful tool for
several reasons. First, because Edman sequencing
existed before MS as a sequencing tool, a considerable number of investigators continue to use
Edman sequencing. Second, Edman sequencing
of relatively abundant proteins is a viable alternative to MS if a mass spectrometer is in high
demand for the identification of low-copy proteins
or is not available. Finally, Edman sequencing is
used to obtain the N-terminal sequence of a protein (if possible) to determine its true start.

228

The N-terminal sequencing of proteins was


introduced by Edman in 1949. Today, Edman
sequencing is most often used to identify proteins
after they are transferred to membranes. The
development of membranes compatible with
sequencing chemicals allowed Edman sequencing to become a more applicable sequencing
method for the identification of proteins separated by SDS-polyacrylamide gel electrophoresis. One of the biggest problems that has limited
the success of Edman sequencing in the past is
N-terminal modification of proteins. Since it is
difficult to tell if a protein is N-terminally blocked
before it is sequenced, precious samples were
often lost in failed sequencing attempts. To overcome this problem, a novel approach called
mixed-peptide sequencing has developed. In
mixed-peptide sequencing, a protein is converted
into peptides by cleavage with cyanogen bromide
(CNBr) or skatole, and the peptides are sequenced
in an Edman sequencer simultaneously. Briefly,
the process of mixed-peptide sequencing involves
separation of a complex protein mixture by polyacrylamide gel electrophoresis (1-D or 2-D) and
then transfer of the proteins to an inert membrane
by electroblotting. The proteins of interest are
visualised on the membrane surface, excised and
fragmented chemically at methionine (by CNBr)
or tryptophan (by skatole) into several large peptide fragments. On average, three to five peptide
fragments are generated, consistent with the frequency of occurrence of methionine and tryptophan in most proteins. The membrane piece is
placed directly into an automated Edman
sequencer without further manipulation. Between
6 and 12 automated Edman cycles are carried out
(48 h), and the mixed sequence data are fed into
the FASTF or TFASTF algorithms, which sort
and match the data against protein (FASTF) and
DNA (TFASTF) databases to unambiguously
identify the protein. The FASTF and TFASTF
programs were written in collaboration with
William Pearson (Department of Biochemistry,
University of Virginia) and are available at several databases including NCBI. Because minimal
sample handling is involved, mixed-peptide
sequencing can be a sensitive approach for identifying proteins in polyacrylamide gels at the 0.1-

10

Curtain Raiser to Novel MAS Platforms

to 1-pmol level. The mixed sequence approach


has the advantage of enabling subsequent searches
to be carried out against unannotated or nonspecies-specific DNA databases as well as annotated protein databases. This is because the T/
FASTF algorithms utilise actual amino acid
sequence and are therefore able to tolerate
errors in the database as well as polymorphisms
or conservative substitutions. A variation of T/
FASTF has been devised for MS. The T/FASTF/S
programs are available at http://fasta.bioch.
virginia.edu/.

Mass Spectrometry
MS enables protein structural information, such as
peptide masses or amino acid sequences, to be
obtained. This information can be used to identify
the protein by searching nucleotide and protein
databases. It also can be used to determine the type
and location of protein modifications. The harvesting of protein information by MS can be divided
into three stages: (1) sample preparation, (2) sample ionisation and (3) mass analysis.

Sample Preparation
In most of proteomics, a protein is resolved from
a mixture by using a 1- or 2-D polyacrylamide
gel. The challenge is to extract the protein or its
constituent peptides from the gel, purify the sample and analyse it by MS. The extraction of whole
proteins from gels is inefficient; however, if a
protein is in-gel digested with a protease, many
of the peptides can be extracted from the gel. A
method for in-gel protein digestion was developed and is now commonly applied to both 1and 2-D gels. In-gel digestion is more efficient at
sample recovery than other common methods
such as electroblotting. In addition, the conversion of a protein into its constituent peptides provides more information than can be obtained
from the whole protein itself. For many applications, the peptides recovered following in-gel
digestion need to be purified to remove gel contaminants. Common impurities from electrophoresis such as salts, buffers and detergents can
interfere with MS. In addition, peptide samples

Acquisition of Protein Structure Information

often require concentration before being analysed


by MS. One method of peptide purification commonly employed for this purpose is reverse-phase
chromatography, which is available in a variety of
formats. Peptides can be purified with ZipTips
(Millipore) or Poros R2 perfusion material
(PerSeptive Biosystems, Framingham, Mass.) or
by high-pressure liquid chromatography (HPLC).

Sample Ionisation
For biological samples to be analysed by MS, the
molecules must be charged and dry. This is
accomplished by converting them to desolvated
ions. The two most common methods for this are
electrospray ionisation (ESI) and matrix-assisted
laser desorption/ionisation (MALDI). In both
methods, peptides are converted to ions by the
addition or loss of one or more protons. ESI and
MALDI are soft ionisation methods that allow
the formation of ions without significant loss of
sample integrity. This is important because it
enables accurate mass information to be obtained
about proteins and peptides in their native states.
(a) Electrospray Ionisation: In ESI, a liquid sample flows from a microcapillary tube into the
orifice of the mass spectrometer, where a
potential difference between the capillary and
the inlet to the mass spectrometer results in the
generation of a fine mist of charged droplets.
As the solvent evaporates, the sizes of the
droplets decrease, resulting in the formation of
desolvated ions. A significant improvement in
ESI technology occurred with the development
of nanospray ionisation. In nanospray ionisation, the microcapillary tube has a spraying
orifice of 12 mm and flow rates as low as
510 nl/min. The low flow rates possible with
nanospray ionisation reduce the amount of
sample consumed and increase the time available for analysis. For ESI, there are several
ways to deliver the sample to the mass spectrometer. The simplest method is to load
individual microcapillary tubes with sample.
Because a new microcapillary tube is used for
each sample, cross-contamination is avoided.
In ESI, peptides require some form of
purification after in-gel digestion, and this can be
accomplished directly in the microcapillary

229

tubes. The drawback to both the purification


and manual loading of microcapillary tubes is
that it is tedious and slow. As an alternative,
electrospray sources have been connected in
line with liquid chromatography (LC) systems
that automatically purify and deliver the sample to the mass spectrometer. Examples of this
method are LC, reverse-phase LC (RP-LC)
and reverse-phase microcapillary LC (RP-LC)
(b) Matrix-Assisted Laser Desorption/Ionisation
(MALDI): In MALDI, the sample is incorporated into matrix molecules and then subjected
to irradiation by a laser. The laser promotes
the formation of molecular ions. The matrix is
typically a small energy-absorbing molecule
such as 2,5-dihydroxybenzoic acid or cyano4-hydroxycinnamic acid. The analyte is
spotted, along with the matrix, on a metal
plate and allowed to evaporate, resulting in the
formation of crystals. The plate, which can be
96-well format, is then placed in the mass
spectrometer, and the laser is automatically
targeted to specific places on the plate. Since
sample application can be performed by a
robot, the entire process including data
collection and analysis can be automated.
This is the single biggest advantage of MALDI.
Another advantage of MALDI over ESI is that
samples can often be used directly without
any purification after in-gel digestion.

Mass Analysis
Mass analysis follows the conversion of proteins
or peptides to molecular ions. This is accomplished by the mass analysers in a mass spectrometer, which resolve the molecular ions on the
basis of their mass and charge in a vacuum.
(a) Quadrupole Mass Analysers: One of the most
common mass analysers is the quadrupole
mass analyser. Here, ions are transmitted
through an electric field created by an array
of four parallel metal rods, the quadrupole.
A quadrupole can act to transmit all ions or as
a mass filter to allow the transmission of ions
of a certain mass-to-charge (m/z) ratio. If
multiple quadrupoles are combined, they can
be used to obtain information about the
amino acid sequence of a peptide. For a more

230

detailed review of the operating principles


of a quadrupole mass analyser, the reader is
directed to several excellent reviews.
(b) Time of Flight: A time-of-flight (TOF) instrument is one of the simplest mass analysers. It
measures the m/z ratio of an ion by determining the time required for it to traverse the
length of a flight tube. Some TOF mass analysers include an ion mirror at the end of the
flight tube, which reflects ions back through
the flight tube to a detector. In this way, the
ion mirror serves to increase the length of the
flight tube. The ion mirror also corrects for
small energy differences among ions. Both of
these factors contribute to an increase in mass
resolution.
(c) Ion Trap: Ion trap mass analysers function to
trap molecular ions in a 3-D electric field. In
contrast to a quadrupole mass analyser, in
which ions are discarded before the analysis
begins, the main advantage of an ion trap
mass analyser is the ability to allow ions to be
stored and then selectively ejected from the
ion trap, increasing sensitivity.

Types of Mass Spectrometers


Most mass spectrometers consist of four basic
elements: (1) an ionisation source, (2) one or more
mass analysers, (3) an ion mirror and (4) a detector. The names of the various instruments are
derived from the name of their ionisation source
and the mass analyser. Some of the most common
mass spectrometers are discussed hereunder. The
analysis of proteins or peptides by MS can be
divided into two general categories: (1) peptide
mass analysis and (2) amino acid sequencing.
In peptide mass analysis or peptide mass
fingerprinting, the masses of individual peptides
in a mixture are measured and used to create a
mass spectrum. In amino acid sequencing, a procedure known as tandem mass spectrometry, or
MS/MS, is used to fragment a specific peptide
into smaller peptides, which can then be used to
deduce the amino acid sequence.
(a) Triple Quadrupole: Triple-quadrupole mass
spectrometers are most commonly used to

10

Curtain Raiser to Novel MAS Platforms

obtain amino acid sequences. In the first stage


of analysis, the machine is operated in MS
scan mode, and all ions above a certain m/z
ratio are transmitted to the third quadrupole
for mass analysis. In the second stage, the
mass spectrometer is operated in MS/MS
mode, and a particular peptide ion is selectively passed into the collision chamber. Inside
the collision chamber, peptide ions are fragmented by interactions with an inert gas by a
process known as collision-induced dissociation or collisionally activated dissociation.
The peptide ion fragments are then resolved
on the basis of their m/z ratio by the third
quadrupole. Since two different mass spectra
are obtained in this analysis, it is referred to as
tandem mass spectrometry (MS/MS). MS/MS
is used to obtain the amino acid sequence of
peptides by generating a series of peptides that
differ in mass by a single amino acid.
(b) Quadrupole-TOF: Several hybrid mass
spectrometers have emerged from the combination of different ionisation sources with
mass analysers. One example is the quadrupole-TOF mass spectrometer. In this machine,
the first quadrupole (Q) and the quadrupole
collision cell (q) of a triple-quadrupole
machine have been combined with a time-offlight analyser (TOF). The main applications
of a QqTOF mass spectrometer are protein
identification by amino acid sequencing and
characterisation of protein modifications.
However, because it is coupled to electrospray, it is not typically utilised for largescale proteomics.
(c) MALDI-TOF: The principal application of a
MALDI-TOF mass spectrometer is peptide
mass fingerprinting because it can be completely
automated, making it the method of choice for
large-scale proteomics work. Because of its
speed, MALDI-TOF is frequently used as a
first-pass instrument for protein identification.
If proteins cannot be identified by fingerprinting,
they can then be analysed by electrospray and
MS/MS. A MALDI-TOF machine can also be
used to obtain the amino acid sequence of peptides by a method known as post-source decay.
However, peptide sequencing by post-source

Uninterpreted MS/MS Data Searching

decay is not as reliable as sequencing with


competing electrospray methods because the
peptide fragmentation patterns are much less
predictable.
(d) MALDI-QqTOF: The MALDI-QqTOF mass
spectrometer was developed to permit both
peptide mass fingerprinting and amino acid
sequencing. It was formed by the combination of a MALDI ion source with a QqTOF
mass analyser. Thus, if a sample is not
identified by peptide mass fingerprinting in
the first step, the amino acid sequence can
then be obtained without having to use a different mass spectrometer. However, the amino
acid sequence information obtained using this
instrument was more difficult to interpret than
that obtained from a nanospray-QqTOF mass
spectrometer.
(e) FT-ICR: A Fourier transform ion cyclotron
resonance (FT-ICR) mass spectrometer is an
ion-trapping instrument that can achieve
higher mass resolution and mass accuracy
than any other type of mass spectrometer.
Recently, FT-ICR has been employed in the
analysis of biomolecules ionised by both ESI
and MALDI. The unique abilities of FT-ICR
provide certain advantages compared to other
mass spectrometers. For example, because of
its high resolution, FT-ICR can be used for
the analysis of complex mixtures. FT-ICR,
coupled to ESI, is also being employed in the
study of protein interactions and protein conformations. A high-throughput, large-scale
proteomics approach involving FT-ICR has
recently been developed

Peptide Fragmentation
As peptide ions are introduced into the collision
chamber, they interact with the collision gas (usually nitrogen or argon) and undergo fragmentation primarily along the peptide backbone. Since
peptides can undergo multiple types of fragmentation, nomenclature has been created to indicate
what type of ions has been generated. If, after
peptide bond cleavage, the charge is maintained
on the N-terminus of the ion, it is designated a

231

b-ion, whereas if the charge is maintained on the


C terminus, it is a y-ion. The difference in mass
between adjacent y- or b-ions corresponds to that
of an amino acid. This can be used to identify the
amino acid and hence the peptide sequence, with
the exception of isoleucine and leucine, which
are identical in mass and therefore indistinguishable. In addition to fragmentation along the peptide backbone, cleavage can occur along amino
acid side chains, and this information can be used
to distinguish isoleucine and leucine.

De Novo Peptide Sequence


Information
Another approach to protein identification is to
obtain de novo sequence data from peptides by
MS/MS and then use all the peptide sequences to
search appropriate databases. Multiple peptide
sequences can be used for protein identification
by searching databases with the FASTS program.
The single biggest advantage of this method is
the capability of searching peptide sequence
information across both DNA and protein databases. This is because the search engine utilised
exhibits a certain amount of flexibility in the
assignment of protein scores. This search method
is useful for organisms that do not have wellannotated databases. However, because this
method requires several peptide amino acid
sequences of three or four amino acids, it is not
the first choice for peptide identification. Rather,
the much faster methods of peptide mass
fingerprinting or peptide mass tag searching can
be used first. If these search methods fail, de novo
sequence information can be obtained and used
to identify the protein.

Uninterpreted MS/MS Data Searching


A large number of programs are now available
for the identification of proteins by using uninterpreted MS/MS data. Examples include programs such as Mascot, SONAR and SEQUEST.
However, searches against unannotated or
untranslated DNA databases with uninterpreted

232

MS/MS data are likely to suffer from the same


pitfalls associated with mass fingerprinting. In
particular, polymorphisms, sequencing errors
and conservative substitutions will probably
contribute to failure to accurately identify a protein. The development of uninterpreted MS/MS
search algorithms that are error tolerant may
overcome some of these shortcomings, provided
that they assign some form of statistical scoring
to the identified proteins.

10

Curtain Raiser to Novel MAS Platforms

proteins can then be examined by 2-DE and


autoradiography. Proteins of interest are excised
from the gel and microsequenced by MS. A major
limitation of this approach is that while many
phosphorylated proteins can be visualised by
autoradiography, they cannot be identified because
of their low abundance. One solution to this
problem is enrichment of the phosphoproteome.

Phosphoprotein Enrichment
Proteomics Approach to Protein
Phosphorylation
Posttranslational modification of proteins is a
fundamental regulatory mechanism, and characterisation of protein modifications is paramount
for understanding protein function. MS is one of
the most powerful tools for the analysis of protein modifications because virtually any type of
protein modification can be identified. Although
we focus here on protein phosphorylation, the
analysis of other types of protein modification by
MS can also been done.
Protein phosphorylation is one of the most
common of all protein modifications and has been
found in nearly all cellular processes. MS can be
used to identify novel phosphoproteins, measure
changes in the phosphorylation state of proteins
in response to an effector and determine phosphorylation sites in proteins. Identification of
phosphorylation sites can provide information
about the mechanism of enzyme regulation and
the protein kinases and phosphatases involved.
A proteomics approach to protein phosphorylation has the advantage that instead of studying
changes in the phosphorylation of a single
protein in response to some perturbation, one
can study all the phosphoproteins in a cell (the
phosphoproteome) at the same time. A common
approach to studying protein phosphorylation
events is the use of in vivo labelling of phosphoproteins with inorganic 32P. The phosphoproteomes of cells that differ in some way (e.g.
normal vs. water stressed) can be analysed by
growing cells in inorganic 32P and creating cell
lysates. Changes in the phosphorylation state of

Enrichment of the phosphoproteome of a cell can


allow the identification of low-copy phosphoproteins that would otherwise go undetected. In one
approach, phosphoproteins were enriched by
conversion of phosphoserine residues to biotinylated residues. This method is an extension of
techniques originally developed by Hielmeyer
and colleagues. Following derivatisation, proteins that were formerly phosphorylated can be
isolated by avidin affinity chromatography.
Proteins immobilised on avidin beads can then be
eluted with biotin, theoretically resulting in the
isolation of the entire phosphoserine proteome.
By increasing the amount of cell lysate used
for avidin affinity chromatography, low-abundance
phosphoproteins can be enriched. However, this
technique does not work for phosphotyrosine,
and the reactivity of phosphothreonine by this
method is very poor. Tyrosine-phosphorylated
proteins can be isolated by the use of antiphosphotyrosine antibodies. As an alternative, another
method for phosphopeptide enrichment was
devised to allow the recovery of proteins phosphorylated on serine, threonine and tyrosine. In
this method, a protein or mixture of proteins is
digested to peptides with a protease and then
subjected to a multistep procedure for the conversion of phosphoamino acids into free sulfhydryl
groups. To capture the derivatised peptides, the
free sulfhydryl groups in the peptides are then
reacted with iodoacetyl groups immobilised on
glass beads. Enrichment of the phosphoproteome
can also be combined with protein profiling by
1- or 2-DE. In this way, changes in protein amount
observed on electrophoresis will reflect the level
of protein phosphorylation. Thus, the principle of

Phosphorylation Site Determination by Mass Spectrometry

protein quantitation by ICAT can be combined


with phosphoprotein enrichment.

Phosphorylation Site Determination


by Edman Degradation
Edman sequencing is still a widely used method
for determining phosphorylation sites in proteins
labelled with 32P, either in vitro or in vivo. This is
because sites can be determined at the subfemtomolar level if enough radio activities can be
incorporated into the phosphoprotein of interest.
This can be as little as 1,000 cpm (which is not
ideal). Briefly, a 32P-labelled protein is digested
with a protease, and the resulting phosphopeptides are separated and purified by reverse-phase
HPLC or thin-layer chromatography (TLC).
The isolated peptides are then cross-linked via
their C termini to an inert membrane (e.g.
Immobilon P, PerSeptive Biosystems). The radioactive membrane is subjected to several rounds of
Edman cycles, and radioactivity is collected after
the cleavage step. The released 32P is counted in a
scintillation counter. This method positionally
places the phosphoamino acid within the
sequenced phosphopeptide. Of course, this is
meaningful only if the sequence of the phosphopeptide is already known. In addition, the
analysis ceases to become quantitative beyond
30 Edman cycles (even with efficient, modern
Edman machines) due to well-understood issues
with repetitive yield associated with Edman
chemistry.

Phosphorylation Site Determination


by Mass Spectrometry
Because of its sensitivity, MS can allow the direct
sequencing of phosphopeptides, resulting in
unambiguous phosphorylation site identification.
Below, a brief overview of some common methods for phosphorylation site determination by
MS is given. Identification of phosphorylation
sites in proteins provides several unique challenges for the mass spectrometrist. For example,
unlike in protein identification, where analysis of

233

any peptide within the protein can be informative,


phosphorylation site analysis requires that the
phosphorylated peptide be analysed. This means
that considerably more protein is required for
analysis. In addition, phosphorylation can alter
the cleavage pattern of a protein, and the resulting phosphopeptides may require different
purification methods. To isolate and purify the
phosphopeptides of interest, it may be necessary
to alter the way in which the phosphoprotein is
digested and to alter the pH or the chromatographic material used for peptide purification.
1. Phosphopeptide Sequencing by MS/MS
A combination of HPLC, Edman degradation
and phosphopeptide sequencing by MS/MS
provides the best results for phosphorylation
site determination. Following excision and
digestion of a 32P-labelled protein, the peptides
are resolved by HPLC. By monitoring HPLC
fractions for radioactivity, the phosphopeptides can be selected for analysis. This reduces
the complexity of the peptide mixture before
MS is performed and facilitates phosphopeptide identification. Phosphopeptides can be
identified from a mixture of peptides by a
method known as precursor ion scanning.
Peptides are sprayed under neutral or basic
conditions, and phosphopeptides are identified
in the precursor ion scan. Once a phosphopeptide is identified, the peptide mixture is sprayed
under acidic conditions, and the phosphopeptide is sequenced by conventional tandem MS/
MS. On fragmentation of the phosphopeptide,
phosphoserine and phosphothreonine can be
identified by the formation of elimination
products.
2. Analysis of Phosphopeptides by MALDI-TOF
MALDI-TOF mass spectrometry can also be
used to identify phosphopeptides. When phosphorylated peptides are subjected to ionisation
by MALDI, phosphate groups are frequently
liberated from the peptides. This is the case
for phosphoserine- and phosphothreoninecontaining peptides, which can liberate HPO3
or H3PO4, resulting in a neutral loss of 80 and
98 Da, respectively. Careful examination of
the TOF spectrum for differences in peptide
masses of 80 Da that are not found in the

234

unphosphorylated peptide control can identify


phosphopeptides. Phosphopeptides can also
be identified by treating one of two identical
samples with protein phosphatase to liberate
phosphate groups. Once a phosphopeptide is
identified, it can be sequenced by MS/MS for
identification of the phosphorylation site.

Metabolite Proling Technologies


Two techniques dominate metabolite profiling
strategies: (1) mass spectrometry (MS) and (2)
nuclear magnetic resonance (NMR). Metabolomics, or the more modestly termed metabolite profiling, has been carried out since the
mid-1970s but only became a standard laboratory technique after 2000. The following focus on
providing short definitions of the techniques
and their relative advantages and disadvantages.
Gas-chromatography-mass-spectrometry (GCMS), gas-chromatography-time-of-flight-massspectrometry (GC-TOF-MS) and liquidchromatography-mass-spectrometry (LC-MS)
are currently the standard mass-spectrometry
methods for metabolite analyses. GCMS technologies enable the identification and robust
quantification of a few hundred primary metabolites within a single extract. The main advantage
of this instrument stems from the fact that it has
long been used for metabolite profiling, and,
therefore, there are stable protocols for machine
set-up, maintenance and usage. GCTOF-MS
offers several advantages, most notably, fast scan
times, which give rise to either improved peak
deconvolution (the ability to resolve partially coeluting peaks) or higher sample throughput.
Compared with GCMS technologies, LCMS
offers several distinct advantages, chiefly its
adaptability to measure a far broader range of
metabolites encompassing both primary and secondary metabolites. However, LCMS usually
uses electrospray ionisation, which is prone to
ion suppression (i.e. the competition of co-eluting
entities for ionisation energy) making it important to validate novel applications of this type of
instrumentation. In addition to these machines,
use of capillary electrophoresismass spectrom-

10

Curtain Raiser to Novel MAS Platforms

etry (CEMS) and Fourier transform ion cyclotron resonance mass spectrometry (FT-ICR-MS)
systems has been demonstrated for metabolite
profiling. The first of these, CEMS, is a highly
sensitive methodology that can detect low-abundance metabolites and that provides good analyte
separation, whereas the second, FT-ICRMS,
relies solely on very high-resolution mass analysis, which potentially enables the measurement
of the empirical formula for thousands of
metabolites; however, it is somewhat limited by
the lack of chromatographic separation. NMR
approaches, which rely on the detection of magnetic nuclei of atoms after application of a constant magnetic field, are the main alternative to
MS-based approaches for metabolite profiling.
These are well-developed and well-validated
methods, and the computer software associated
with NMR instrumentation is, consequently,
also advanced. Furthermore, despite limitations
in its sensitivity and, therefore, in metabolite coverage, it retains an advantage over MS-based
approaches for certain biological questions. For
example, it can be used non-invasively (i.e. on
living cells) because the pH of the vacuole is
different from that found elsewhere in the cell.
NMR can provide subcellular information, and it
is easier to derive atomic information for flux
modelling from NMR than from MS-based
approaches.

Physiological Techniques
Several numbers of physiological criteria (including physiological traits determining yield under
normal and unfavourable environments and
genetic basis of such physiological traits) need to
be evaluated before starting up a molecular breeding programme. The use of physiological trait as
indirect selection index for yield (such as tillering,
xylem vessel diameter, leaf dimensions, stomatal
or cuticular water loss, harvest index) in breeding
programme has been discussed elsewhere. As that
of previous sections, only few physiological techniques are explained below, though large arrays of
techniques are available to increase the efficiency
of QTL mapping and MAS.

Physiological Techniques

The global water shortage caused by an


increasing world population and worldwide
climate change is considered as one of the major
challenges in agriculture. The combination of
continued impact of drought, salinity and high
temperature impairs the photosynthesis during
the daytime and increases the surface temperatures in the night, which in turn increase the photorespiratory losses and thus the productivity.
The elevated greenhouse gas concentrations may
lead to the general drying of the subtropics. Thus,
the convergence of population growth and variable climate is expected to threaten global food
security. This forces the scientists to develop
drought-suited varieties through molecular
breeding and genetically modified approaches.
However, it is clear that the demand to produce
sufficient major food crops (wheat, rice and
maize) for the growing population has always
been increasing. Hence, optimising yield stability
for these major crops and locally important crops
is essential. Therefore, maintaining food security
in this scenario will require systematic approaches
including advances in physiological approaches.
The physiological dissection of complex traits
like drought, salinity or nutrient stress tolerance
is a first step to understand the genetic control of
tolerance and will ultimately enhance the
efficiency of MAS strategies. Developing and
integrating a gene-to-phenotype concept in crop
improvement requires particular attention to
phenotyping and ecophysiological modelling, as
well as the identification of stable candidate
genomic regions through novel concepts of
genetical genomics (see chapter 7). Knowledge
of both the plant physiological response and
integrative modelling is needed to tackle the
confounding effects associated with environment
and gene interaction. To maximise the impact of
using specific physiological traits, breeding
strategies require a detailed knowledge of the
environment where the crop is grown, genotype
environment interactions and fine tuning the
genotypes suited for local environments. A physiological approach has an advantage over empirical breeding for yield per se because it increases
the probability of crosses resulting in additive
gene action for stress adaptation, provided that

235

the germplasm is characterised more thoroughly


for physiological traits than for yield alone.
The use of physiological traits in a breeding
programme, either by direct selection or through
a surrogate such as molecular markers, depends
on their relative genetic correlation with yield,
extent of genetic variation, heritability and genotype environment interactions. For instance, in
drought environments, osmotic adjustment, accumulation and remobilisation of stem reserves,
superior photosynthesis, heat- and desiccationtolerant enzymes, etc. are important physiological traits. However, it is important to establish
their heritability and genetic correlation with
yield in target environments. Identification of
physiological traits and mechanisms is time consuming and costly; however, if successful, the
benefits are likely to be substantial. The information on important physiological traits can be collected on potential parental lines involving
screening of entire crossing block, or a set of
commonly used parents, thus producing a catalogue of useful physiological traits. This information can be used strategically in designing
crosses, thereby increasing the likelihood of
transgressive segregation events, which bring
together desirable traits. However, if enough
resources are available, screening for physiological traits could be applied to segregating generations in yield trials, or any intermediate stage,
depending on when genetic gains from selection
are optimal. It is important to note that using
specific traits, breeding strategies are effective
only when these traits are properly defined in
terms of the stage of crop development at which
they are relevant, the specific attributes of the
target environment for which they are adaptive
and their potential contribution to yield. For
example, the early escape from progressively
intensifying moisture stress, through the manipulation of plant phenology, is the most commonly exploited genetic strategy used to ensure
relatively stable yields under terminal drought
conditions. When significant genetic diversity
for a physiological trait in a germplasm collection for the given species is established, it is
imperative that the relevance of the trait as a
selection criterion be determined. The precise

236

phenotyping of physiological traits often requires


the utilisation of sophisticated and expensive
techniques, and the techniques used to characterise drought tolerance specific physiological traits
are explained here.

Near-Infrared (NIR) Spectroscopy


This method provides spectral information corresponding to the field plot in a single near-infrared spectrum, where physical and chemical
characteristics of the harvested seed material are
captured. By using calibration models (i.e. mathematical and computational operations that relate
the spectral information with phenotypic values),
several traits can be determined on the basis of a
single spectrum (dry matter, protein, nitrogen,
starch and oil content, grain texture and grain
weight, etc.). The use of NIR spectroscopy on
agricultural harvesters provides indexing of grain
characteristics. In contrast to conventional sample-based methods, NIR spectroscopy on agricultural harvesters secures a good distribution of
measurements within plots and covers substantially larger amounts of plot material, thus
reducing sampling error and providing more representative measurements of the plot material in
terms of homogeneity.

Canopy Spectral Reectance (SR)


and Infrared Thermography (IRT)
Spectral reflectance of plant canopy is a noninvasive phenotyping technique that enables several dynamic complex traits, such as biomass
accumulation, to be monitored with high temporal resolution. It has many advantages including
easy and quick measurements; integration at the
canopy level and additional parameters can also
be measured simultaneously via a series of
diverse spectral indices like photosynthetic
capacity, leaf area index, intercepted radiation
and chlorophyll content. Plant water status as
determined by plant water content or water potential integrates the effects of several drought-adaptive traits. Several methods are used to determine

10

Curtain Raiser to Novel MAS Platforms

crop water content, including leaf water potential,


leaf stomatal conductance and canopy temperature, which is the relative measure of water flow
associated with water absorption from the soil
under water deficit. In addition to the above, one
of the most commonly used indirect techniques
for measurement of these variables is thermal
infrared imaging, or infrared thermography,
which involves the measurement of leaf or canopy temperature. Plant canopy temperature is a
widely measured variable that is closely related
to canopy conductance at the vegetative stage and
therefore provides insight into plant water status.
One of the high-throughput integrated phenotyping platforms that include the pipeline of
imaging, image processing automatisation and
data handling modules was developed by
LemnaTec, a German company (http://www.
lemnatec.com). The platform has the capacity to
measure almost unlimited sets of parameters easily, allows comprehensive screening and provides
statistics on various plant traits in a dynamic way.
Depending on the degree of automatisation,
plants are manually placed in the Scanalyzer 3-D
or transported on conveyor belts directly from the
greenhouses to the imaging chambers. Such
chambers provide top and side imaging of both
shoot and root systems to quantify plant height/
width, biomass and plant architecture. Application
of different camera and acquisition modesfrom
visual light to near-infrared (NIR/SWIR), infrared (IR) and fluorescence imagingopens new
perspectives for visualisation using non-destructive quantification. The key application is in the
fast developing domain of plant functional
genomics. These automated systems will increase
our understanding of plant growth kinetics and
help improve plant models for systems biology or
breeding programmes.

Estimation of Compatible Solutes


Under osmotic stress, an important consideration is
to accumulate osmotically active compounds called
osmolytes in order to lower the osmotic potential.
These are referred to as compatible metabolites
because they do not apparently interfere with the

Genomics-Assisted Breeding

237

Table 10.1 Important osmolytes that accumulate in


plants during drought and salinity
Carbohydrate
Sucrose
Sorbitol
Mannitol
Glycerol
Arabinitol
Pinitol
Other polyols

Nitrogenous compound
Proteins
Betaine
Glutamate
Aspartate
Glycine
Choline
Putrescine

Organic acid
Oxalate
Malate

normal cellular metabolism. Molecules like


glycerol and sucrose were discovered by empirical
methods to protect biological macromolecules
against the damaging effects of salinity. Later, a
systematic examination of the molecules, which
accumulate in halophytes and halotolerant
organisms, led to the identification of a variety
of molecules also able to provide protection.
Characteristically, these molecules are not highly
charged, but are polar, highly soluble and have a
larger hydration shell. Such molecules will be
preferentially solubilised in the bulk water of the
cell where they could interact directly with the
macromolecules. The biochemical pathways
producing them are now better known, and there
are several sophisticated methods to estimate
such compounds. Genes that are rate limiting
these steps have been cloned and transferred into
crop plants to raise the level of osmolytes.
Osmolytes for which some progress has been
made are indicated in Table 10.1.
To sum up, the techniques and platforms
mentioned above will greatly improve the phenotyping accuracy and throughput, thus contributing to a better elucidation of the genetic
control of complex physiological traits in plants.
However, many of the techniques discussed
above are applied to plants grown under controlled conditions that may not reflect field environment or can only be used to assess a limited
number of genotypes due to high costs and/or
practicality. Therefore, to overcome this problem, multitiered selection screens, where a simple but less accurate screen allows large number
of genotypes to be evaluated (first screen), followed
by tiers of more sophisticated screens of decreasing
numbers of genotypes have been proposed.

A three-tiered sequence of physiological screens


has been already used to identify candidate parental genotypes for use as parents in breeding programmes for some key traits like nitrogen fixation
activity during soil water deficit in soybean.
Furthermore, bringing integrative phenotyping
technology, such as that developed by LemnaTec,
from the controlled environments to the field will
improve the assessment of plant responses to
environmental stimuli while enabling highthroughput screening and generating comprehensive and accurate phenotypic data.

Genomics-Assisted Breeding
A number of resources for major crop species
including detailed, high-density genetic maps,
cytogenetic stocks, contig-based physical maps
and deep coverage and large-insert libraries are
now available to the public. These tools have
facilitated the isolation of genes via map-based
cloning, the localisation of quantitative trait loci
(QTLs) and the sequencing and annotation of
large genomic DNA fragments in several plant
species. Complete genome sequences of crop
plants such as Arabidopsis and rice have become
available through public databases. Further,
whole-genome or gene space sequencing projects for several plant species such as maize
(http://www.maizegenome.org/), sorghum, wheat
(http://www.wheatgenome.org/), tomato (http://
sgn.cornell.edu/help/about/tomato_sequencing.
html), tobacco (http://www.intl-pag.org/13/abstracts/
PAG13_P027.html), poplar (http://genome.jgi-psf.
org/Poptr1/), Medicago (http://www.medicago.org/
genome/) and lotus (http://www.kazusa.or.jp/lotus/)
are now ready to use. The widespread use of
transcriptome sampling strategies is a complementary approach to genome sequencing and results
in a large collection of expressed sequence tags
(ESTs) for almost all the important plant species
(http://www.ncbi.nlm.nih.gov/dbEST/dbEST_
summary.html). Comparative sequence analysis
can be used in some cases to facilitate isolation
of genes in species lacking ESTs. However,
EST resources have some limitations, such as
unidentified contaminants, chimeric sequences,

238

multiple forms in polyploids (homoeoalleles) and


putatively non-functional transcripts. Moreover, they
lack untranscribed regulatory factors and underrepresented genes.
One of the hallmarks of genomics research has
been the discovery of new mechanisms contributing to genome evolution. Bioinformatics facilitates
both the analysis of genomic and post-genomic
data and the integration of data from the related
fields of transcriptomics, proteomics, metabolomics and phenomics. Several bioinformatic tools
and databases have been developed for DNA
sequence analysis, marker discovery and querying
and analysing information. Enhanced bioinformatic tools, genome databases and integration of
information from different fields enable the
identification of genes and gene products and can
elucidate the functional relationships between
genotype and observed phenotype. Probably the
most important future prospect is the enhancement
of visualisation tools that extend beyond simple
relationships and help us more clearly to interpret
the complex multidimensional biological networks
of genes and their relationships to phenotypes.
Metabolomics approaches enable the parallel
assessment of the levels of a broad range of metabolites and have been documented to have great
value in both phenotyping and diagnostic analyses
in plants. These tools have recently been turned to
evaluation of the natural variance apparent in
metabolite composition.
Such advances in genomics can contribute to
crop improvement in two general ways. First, a
better understanding of the biological mechanisms can lead to new or improved screening
methods for selecting superior genotypes more
efficiently. Second, new knowledge can improve
the decision-making process for more efficient
breeding strategies which is broadly termed as
genomics-assisted breeding.

Functional Markers
During the past decades, molecular mapping has
identified chromosome regions carrying important genes in crop plants using SSR, RFLP, AFLP,
RAPD, DArT and other markers. However, these
usually neutral genetic markers can be some

10

Curtain Raiser to Novel MAS Platforms

distance from the targeted genes and thus are


often population specific or parent related, and
their predictive value depends on the degree of
linkage between markers and target locus alleles
in specific populations. As a result, relatively few
linked markers are used in breeding. In contrast,
functional or gene-specific markers are derived
from polymorphic sites within candidate genes
that are directly associated with phenotypic variations developed from functional gene sequences
and accurately discriminate alleles at one locus
and represent ideal markers for MAS in breeding.
Candidate gene is defined as a gene that has been
identified as related to a particular trait (phenotype, disease or condition). Candidate genes in
general can be divided into two categories: positional and functional. A positional candidate
gene is one that might be associated with a trait,
based on the location of a gene on a chromosome.
A functional candidate gene is one whose function
has something in common biologically with the
trait under investigation. Positional candidate
genes are identified through QTL- and map-based
cloning approaches, whereas functional genomics
approaches such as transcriptomics and expression genetics provide the set of functional candidate genes.
Functional markers have advantages over random DNA markers, because they are diagnostic
of the desired trait allele. Many new crop-specific
genes have been cloned during the past years, and
the corresponding functional markers have been
developed and used in MAS. For example, more
than 30 loci (genes) have been cloned in common
wheat and its relatives, and 97 functional markers
for wheat processing quality, agronomic traits
and disease resistance genes have been developed
and used to identify those alleles (Liu et al. 2012).
Knowledge of marker-trait association is a prerequisite for marker-assisted selection. SNPs and
InDels are the most abundant forms of DNA
sequence variation in crop plants, and this was
confirmed with cloned genes and amplicons.
Large-scale genome sequencing and associated
bioinformatics are becoming widely accepted
research tools for accelerating the analysis of crop
genome structure and function. Second-generation
DNA sequences from several crops provide an
opportunity to use genomic information to clone

Comparative Genomics

genes and develop SNP markers. Rapid progress


is now being achieved in assembling the DNA
sequences from individual chromosome arms of
crop plants, and this progress provides a template
for defining the FMs for future use. High-quality
genome sequences integrated with molecular
genetic maps provide the basis for identifying
duplicated genes, analysing promoter regions in
detail, defining SNPs/InDels and aligning the
transcriptome with the genome. These advances
will allow gene networks to be clearly defined
and thus allow meaningful functional markers to
be developed for complex traits. Extensive proteomic studies have allowed identification of
many allelic variants, and genomic analyses
identified several markers for discriminating
alleles at one locus. These successes have indicated that it is now essential to establish rapid,
convenient and economical PCR-based assays in
crop breeding. In order to detect genes simultaneously in a single PCR, multiplex PCR can be
developed, in which several markers in the same
reaction mix are co-amplified under identical
conditions. However, a clear challenge is for
multiplexing markers to have similar annealing
temperatures for the different primers and for the
expected PCR products to be easily separated on
agarose gels. If alleles conferring specific resistance are being sought, it is important to know
which alleles are effective and potentially useful
to local breeding programmes. However, more
functional markers are needed for important traits
such as disease and stress resistance in order to
strengthen the application of molecular markers
in breeding programmes. SNPs are the most
applicable markers for high-throughput screening once the genotypephenotype associations
are determined. The expanded use of these markers will develop as high-throughput techniques
for MAS based on functional SNP markers and
produce DNA chips for efficient analysis.

Comparative Genomics
The number of sequenced plant genomes and
associated genomic resources is growing rapidly
with the advent of both an increased focus on
plant genomics from funding agencies and the

239

application of inexpensive next-generation


sequencing. It seems certain that with the
sequencing of major crop plants, followed by the
assigning of function to these sequences (drafts),
there is a lot of information for applications of
genomics in other orphan species as well. This
assignment is based on the fact that there is a
significant degree of synteny that exists between
plant species as revealed by several comparative
genetic mapping experiments. Comparative
genomics is the study of the relationship of
genome structure and function across different
biological species or strains. Actually, it is an
attempt to take advantage of the information provided by the signatures of selection to understand
the function and evolutionary processes that act
on genomes. While it is still a young field, it
holds great promise to yield insights into many
aspects of the evolution of modern crop species.
For example, conservation of gene order and
content has been detected between Arabidopsis
and other species within the dicot family, such as
the cultivated Brassica species, tomato and soybean. Within the monocots also, especially the
cereals, extensive colinearity has been observed
by comparative mapping of the genomes using
genetic markers. This phenomenon of macrocolinearity was first established between seven
grass species, with rice as the reference genome,
and was represented in the form of a graphical
consensus map that is popularly known as the
circle diagram. This map has been refined to
embrace more grass species whose genomes are
described using several rice linkage blocks (visit
www.gramene.org for more information).
Altogether these studies give the general impression that all the grasses examined have similar
gene order despite the large differences in DNA
content or chromosome number. Microcolinearity,
or the conservation of gene order at the submegabase level, is also observed to be extensive
but has frequent deviations which can be attributed to small-scale rearrangements, deletions or
even local gene amplification and translocation.
This has been examined not only between sorghum and maize but also between rice and other
crop plants as well as between rice subspecies.
The absence of microcolinearity as compared to
the recombinational map level has also been

240

confirmed by comparison of small segments of


the rice genome sequence with some cereals. In
particular, use of wheat chromosome bin-mapped
ESTs with rice genome sequence has predicted
that order of rice genes in relation to wheat
genome could emerge as a complex pattern, and
its utility for synteny-based analysis/application
remains to be assessed. Nevertheless, the rice
genome has come forth as a relatively stable
genome compared to other cereals, which have
faced most of the rearrangements during evolution. Various investigations have also revolved
around the idea of colinearity between monocot
and dicot plants. However, rice genome being
four times larger and containing more than twice
the number of genes as that of Arabidopsis may
show limited synteny. The low level of synteny
between Arabidopsis and rice might not be adequate for applications in map-based cloning strategies as well as for integration of functional and
structural genomic data across the monocot or
dicot divide, but a detailed study of the genomic
data of both plants could provide answers to
questions related to the structure and evolution of
genomes. On the other hand, the high level of
genome colinearity between plant species belonging to the same family can be exploited to carry
out fine mapping and map-based cloning experiments, especially in the case of crop plants having large genomes. As in the cereals, the genetic
mapping of an agronomically important locus is
carried out with the large genome followed by
cloning using information from the closely
related model organism such as rice.
The major benefits of comparative genomics
are in twofolds: (1) Using computer-based analysis to zero in on the genomic features that have
been preserved in multiple organisms over millions of years, researchers will be able to pinpoint
the signals that control gene function, which in
turn should translate into innovative approaches
for treating human disease and improving human
health, and (2) in addition to its implications for
human health and well-being, comparative genomics may benefit the plant world as well. As sequencing technology grows easier and less expensive, it
will likely find wide applications in agricultural
biotechnology as a tool to tease apart the often-

10

Curtain Raiser to Novel MAS Platforms

subtle differences among animal species. Such


efforts might also possibly lead to the rearrangement of our understanding of some branches on
the evolutionary tree, as well as point to new strategies for conserving rare and endangered species.

Identication of Novel Molecular


Networks and Construction
of New Metabolic Pathway
Despite extensive knowledge of fundamental metabolic processes, the mechanisms of physiological
modulation over short and extended time intervals
in response to changing environmental conditions
remain difficult to understand. What is more, the
pure existence of some plant metabolites such as
trehalose still puzzles us. Correspondingly, investigation of metabolic network regulation upon
genetic or environmental perturbations may be
viewed as a necessity for pathway discovery and
functional genomics. There is a long tradition of,
and extensive knowledge about, metabolite analysis. In fact, metabolite analysis can be better
understood by distinguishing among levels on
the basis of its objectives. Four levels can be
identified. First, there is metabolite target analysis, which utilises specialised protocols for the
analysis of difficult analytes such as phytohormones. Second, metabolite profiling aims at
quantitation of several predefined targets (e.g. of
all metabolites of a specific pathway or a set of
metabolites typical for different pathways).
Third, metabolomics has the ultimate goal of
unbiased identification and quantitation of all the
metabolites present in a certain biological sample from an organism grown under defined conditions. Fourth, there is metabolic fingerprinting,
which, instead of separating individual metabolites by physical parameters, focuses on collecting and analysing data from crude metabolite
mixtures to rapidly classify samples. Among
these four approaches, metabolomics seems to
be best suited for investigation of metabolic networks, because it focuses on quantifying individual metabolites without having a bias
concerning the choice of targets to be analysed,
as in metabolite profiling.

Bioinformatics for MAS

Ideally, metabolomic data should accurately


describe physiological processes as responses to
developmental, genetic or environmental changes.
However, some theoretical considerations limit
direct interpretation of metabolic networks
generated from metabolic snapshots. First, any
subcellular compartmentalisation is lost in the
process of sample preparation. Although mRNA
or protein expression levels can sometimes be
ascribed to plant compartments on the basis of
their target sequences, there is a high degree of
uncertainty about the actual location of metabolites, many of which may occur simultaneously
(and for potentially different purposes) in different locations and in varying amounts. Therefore,
metabolomic information can be interpreted on
the multicellular, tissue or organ level. If metabolite analysis of subcellular compartments is the
goal, large amounts of tissue must be used for the
parallel determination of enzyme activities for
ascribing cellular compartments to density
fractions. Because plant metabolomes are so
complex, many of the detected metabolites will
remain structurally unidentified until being elucidated by de novo identification, which is much
more difficult than the identification of transcripts
or proteins. Finally, the question arises of how to
correlate metabolite levels under different situations if they only relate to multiple steady states
without any kinetic experimental design that
could guide interpretation. Most often, average
metabolite levels are used for deducing novel
insights into plant physiology. This strategy again
results in a loss of information, however, as
metabolomic data from individual snapshots can
be regarded to be as reliable as proven by the initial method validation tests. Any variation found
in a homozygous plant population therefore indicates responses to subtle differences in plant
development or physiology for each individual
plant. This variation must have biological causes
reflecting the flexibility of metabolic networks in
the studied populations. It can, therefore, be used
to calculate pathways by comprehensive pairwise
metabolite correlation plots. In this way, stoichiometrically feasible metabolic networks could be
computed for a variety of organisms. Such networks
would enable researchers to predict the effect of

241

knockout mutations and novel metabolic pathways.


Besides allowing comparison with experimentally established metabolic networks, the inherent
characteristics of topological metabolic networks
could be investigated to compare structural differences in network organisation and thus
improve our understanding of key metabolites
and the effects of random mutations in biological
systems.
An understanding of metabolic networks might
be further improved by an integration of static
enzyme stoichiometry networks and inherent
network characteristics. Eventually, the combination of metabolomic analysis with other profiling
technologies, especially proteomics and integrative techniques like metabolic control analysis,
could enable novel pathway discovery and aid the
evaluation of changes in plant networks produced
by genetic or environmental changes.

Bioinformatics for MAS


Bioinformatics refers to the study of biological
information using concepts and methods in computer science, statistics and engineering. It can
be divided into two categories: biological information management and computational biology.
Bioinformatics plays an essential role in todays
plant science. As the amount of data grows exponentially, there is a parallel growth in the demand
for tools and methods in data management, visualisation, integration, analysis, modelling and
prediction. At the same time, many researchers
in biology are unfamiliar with available bioinformatics methods, tools and databases, which
could lead to missed opportunities or misinterpretation of the information. Here, an attempt
has been made to list out only a few commonly
used bioinformatics tools that may have their
potential role in MAS made. Of course, this list
is not exhaustive; no one can prepare such a
complete list because of the rapid developments
in bioinformatics.
Biological sequence such as DNA, RNA and
protein sequence is the most fundamental object
for a biological system at the molecular level.
Advances in sequencing technologies provide

242

opportunities in bioinformatics for managing,


processing and analysing the sequences. Shotgun
sequencing (see above) is currently the most
common method in genome sequencing: Pieces
of DNA are sheared randomly, cloned and
sequenced in parallel. Software has been developed to piece together the random, overlapping
segments that are sequenced separately into a
coherent and accurate contiguous sequence.
Numerous software packages exist for sequence
assembly, including Phred/Phrap/Consed (http://
www.phrap.org), Arachne (http://www.broad.mit.
edu/wga/) and GAP4 (http://staden.sourceforge.
net/overview.html). The Institute of Genome
Research (TIGR) developed a modular, opensource package called AMOS (http://www.tigr.org/
software/AMOS/), which can be used for comparative genome assembly.
Gene finding refers to prediction of introns
and exons in a segment of DNA sequence.
Dozens of computer programs for identifying
protein-coding genes are available. Some of the
well-known ones include Genscan (http://genes.
mit.edu/Genscan.html ),
GeneMarkHMM
( http://opal.biology.gatech.edu/GeneMark/ ),
GRAIL (http://compbio.ornl.gov/Grail-1.3/),
Genie (http://www.fruitfly.org/seq tools/genie.
html) and Glimmer (http://www.tigr.org/softlab/
glimmer). In addition, one can use genome
comparison tools such as SynBrowse (http://www.
synbrowser.org/) and VISTA (http://genome.lbl.
gov/vista/index.shtml) to enhance the accuracy
of gene identification.
An important aspect of genome annotation is
the analysis of repetitive DNAs, which are copies
of identical or nearly identical sequences present
in the genome. Repetitive sequences exist in
almost any genome and are abundant in most
plant genomes. The identification and characterisation of repeats is crucial to shed light on the
evolution, function and organisation of genomes
and to enable filtering for many types of homology searches. A small library of plant-specific
repeats can be found at ftp://tigr.org/pub/data/
TIGR Plant Repeats/; this is likely to grow substantially as more genomes are sequenced. One
can use RepeatMasker (http://www.repeatmasker.

10

Curtain Raiser to Novel MAS Platforms

org/) to search repetitive sequences in a genome.


Working from a library of known repeats,
RepeatMasker is built upon BLAST and can
screen DNA sequences for interspersed repeats
and low complexity regions. Repeats with poorly
conserved patterns or short sequences are hard to
identify using RepeatMasker due to the limitations of BLAST. To identify novel repeats, various algorithms were developed. Some widely
used tools include RepeatFinder (http://ser-loopp.
tc.cornell.edu/cbsu/repeatfinder.htm) and RECON
( http://www.genetics.wustl.edu/eddy/recon/ ).
Simple sequence repeats can be identified in the
given sequence using SSRIT available at www.
gramene.org.
Comparing sequences provides a foundation
for many bioinformatics tools and may allow
inference of the function, structure and evolution
of genes and genomes. Methods in sequence
comparison can be largely grouped into pairwise, sequence-profile and profileprofile comparison. For pairwise sequence comparison,
FASTA (http://fasta.bioch.virginia.edu/) and
BLAST
(http://www.ncbi.nlm.nih.gov/blast/)
are popular. To assess the confidence level for an
alignment to represent homologous relationship,
a statistical measure (expectation value, e-value)
is integrated into pairwise sequence alignments.
A sequence profile is calculated using the probability of occurrence for each amino acid at each
alignment position. PSI-BLAST (http://www.
ncbi.nlm.nih.gov/BLAST/) is a popular example
of a sequence-profile alignment tool. Some other
sequence-profile comparison methods are slower
but even more accurate than PSI-BLAST, including HMMER (http://hmmer.wustl.edu/), SAM
( http://www.cse.ucsc.edu/research/compbio/
sam.html) and META-MEME (http://metameme.
sdsc.edu/).
Proteins can be generally classified based on
sequence, structure or function. Several sequencebased methods were developed based on sizable
protein sequence (typically longer than 100
amino acids), including Pfam (http://pfam.wustl.
edu/), ProDom (http://protein.toulouse.inra.fr/
prodom/current/html/home.php) and Clusters of
Orthologous Group (COG) (http://www.ncbi.

Bibliography

nlm.nih.gov/COG/new/). Other methods are


based on fingerprints of small conserved motifs
in sequences, as with PROSITE (http://au.expasy.
org/prosite/), PRINTS (http://umber.sbs.man.ac.
uk/dbbrowser/PRINTS/) and BLOCKS (http://
www.psc.edu/general/software/packages/blocks/
blocks.html). Several bioinformatics tools have
been developed for two-dimensional (2-D) electrophoresis analysis. SWISS-2DPAGE can locate
the proteins on the 2-D PAGE maps from SwissProt (http://au.expasy.org/ch2d/). Melanie (http://
au.expasy.org/melanie/) can analyse, annotate
and query complex 2-D gel samples. Flicker
(http://open2dprot.sourceforge.net/Flicker/) is an
open-source stand-alone program for visually
comparing 2-D gel images. PDQuest (http://
www.proteomeworks.bio-rad.com) is a popular
commercial software package for comparing 2-D
gel images. Some software platforms handle
related data storage and management, including
PEDRo (http://pedro.man.ac.uk/), a software
package for modelling, capturing and disseminating 2-D gel data and other proteomics experimental data.
A protein family can be represented in a phylogenetic tree that shows the evolutionary relationships among proteins. Phylogenetic analysis
can be used in comparative genomics, gene function prediction and inference of lateral gene
transfer among other things. The analysis typically starts from aligning the related proteins
using tools like ClustalW (http://bips.u-strasbg.
fr/fr/Documentation/ClustalX/). Among the popular methods to build phylogenetic trees are minimum distance (also called neighbour joining),
maximum parsimony and maximum likelihood
trees. Some programs provide options to use any
of the three methods, for example, the two widely
used packages PAUP (http://paup.csit.fsu.edu),
and PHYLIP (http://evolution.genetics.washington.
edu/phylip.html).
As more reliable data are collected, one can use
ordinary differential equations for dynamic simulations of metabolic networks and combine information about connectivity, concentration balances,
flux balances, metabolic control and pathway optimisation. Ultimately, one may integrate all of the

243

information and perform analysis and simulation


in a cellular modelling environment like E-Cell
(http://www.e-cell.org/) or CellDesigner (http://
www.systems-biology.org).
The data that are generated and analysed as
described in the previous sections need to be
compared with the existing knowledge in the
field in order to place the data in a biologically
meaningful context and derive hypotheses. To do
this efficiently, data and knowledge need to be
described in explicit and unambiguous ways that
must be comprehensible to both humans and
computer programs. Ontology is a set of vocabulary terms whose meanings and relations with
other terms are explicitly stated and which are
used to annotate data. A list of open-source
ontologies used in biology can be found on the
Open Biological Ontologies website (http://obo.
sourceforge.net/). Many ontologies on this site
are under development and are subject to frequent
change. Gene Ontology (GO) (www.geneontology.
org) is an example of bio-ontologies that has
garnered community acceptance. Other examples of ontologies currently in development are
the Sequence Ontology (SO) and the Plant
Ontology (PO) project (www.plantontology.org).
Besides, there are large collections of biological
databases that are available in the web for several
crops. Nucleic Acids Research (http://nar.oxfordjournals.org/) publishes a database issue in
January of every year.

Bibliography
Literature Cited
Bachem CWB, van der Hoeven RS, de Bruijn SM,
Vreugdenhil D, Zabeau M, Visser RGF (1996)
Visualization of differential gene expression using a
novel method of RNA fingerprinting based on AFLP:
analysis of gene expression during potato tuber development. Plant J 9:745753
Edman P (1949) A method for the determination of amino
acid sequence in peptides. Arch Biochem 22(3):475
Fire A, Xu S, Montgomery MK, Kostas SA, Driver SE,
Mello CC (1998) Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans. Nature 391:806811

244
Fischer A, Saedler H, Theissen G (1995) Restriction fragment length polymorphism-coupled domain-directed
differential dis-play: a highly efficient technique for
expression analysis of multigene families. Proc Natl
Acad Sci USA 92:53315335
Habu Y, Fukuda-Tanaka S, Hisatomi Y, lida S (1997)
Amplified restriction fragment length polymorphismbased mRNA fingerprinting using a single restriction
enzyme that recognizes a 4-bp sequence. Biochem
Biophys Res Commun 234:516521
Ji H, Hodges E et al (2007) Genome-wide in situ exon capture
for selective resequencing. Nat Genet 39:15221527
Liu Y, He Z, Appels R, Xia X (2012) Functional markers
in wheat: current status and future prospects. Theor
Appl Genet 125:110
Shendure J et al (2005) Accurate multiplex polony sequencing
of an evolved bacterial genome. Science 309:17281732
Vos P, Hogers R, Bleeker M, Reijans M, van de Lee T,
Hornes M, Freijters A, Pot J, Peleman J, Kuiper M,
Zabeau M (1995) AFLP: a new concept for DNA fingerprinting. Nucleic Acids Res 21:44074414
Velculescu VE, Zhang L, Vogelstein B, Kinzler KW (1995)
Serial analysis of gene expression. Science 270:484487

10

Curtain Raiser to Novel MAS Platforms

Further Readings
Buzdin A, Lukyanov S (eds) (2007) Nucleic acids hybridization. Springer, New York
Rhee S, Dickerson J, Xu D (2007) Bioinformatics and its
applications in plant biology. Annu Rev Plant Biol
57:335360
Shendure J, Hanlee J (2008) Next-generation DNA
sequencing. Nat Biotechnol 26(10):11351145
Tyagi AK, Khurana JP, Khurana P, Raghuvanshi S, Gaur
A, Kapur A, Gupta V, Kumar D, Ravi V, Vij S, Khurana
P, Sharma S (2004) Structural and functional analysis
of rice genome. J Genet 83:7999
Varshney RK, Graner A, Sorrells ME (2005) Genomicsassisted breeding for crop improvement. Trends Plant
Sci 10(12):621630
Yamamoto M et al (2001) Use of serial analysis of gene
expression (SAGE) technology. J Immun Method
250:4566
Ye SQ et al (2000) MiniSAGE: gene expression profiling
using serial analysis of gene expression from 1 mg
total RNA. Anal Biochem 287:144152

Recent Advances in MAS in Major


Crops

The amount of land available for crop production


is decreasing steadily due to urban growth and
land degradation, and the trend is expected to be
much more dramatic in the developing than in
the developed countries. These decreases in the
amount of land available for crop production and
increase in human population will have major
implications for food security over the next two
or three decades. Food insecurity and malnutrition
result in serious public health problems. Much of
the early increase rise in grain production resulted
from an increase in area under cultivation,
irrigation, better agronomic practices and, most
importantly improved cultivars through conventional breeding strategies. However, yields of
several crops have already reached a plateau in
developed countries, and therefore, most of the
productivity gains in the future will have to be
achieved in developing countries through better
natural resources management and crop improvement. It is in this context that marker-assisted
selection (MAS) will play an important role in
food production in the near future. MAS offers
plant breeders access to an infinitely wide array
of novel genes and traits, which can be inserted
into high-yielding and locally adapted cultivars.
This approach offers rapid introgression of novel
genes and traits into elite agronomic backgrounds.
Though MAS has been successfully applied to
several crops (see chapter 9), only four crops have
been discussed in detail in the below sections.

11

Rice
Rice (Oryza sativa L.) is an intimate part of the
culture, food habits and economy of many societies
and is one of the most important crops for mankind. It is the basic food of more than three
billion people, and it accounts for 5080% of
their daily calorie intake. To meet the growing
demand for food and to sustain food security for
people in low-income countries, rice production
has to be raised by another 70% over the next
three decades. This means raising the rice yield
from the current level if these countries can
maintain their rice-growing area at current levels.
For the irrigated ecosystem, the rice yield
will be difficult to rise from the current levels of
56 t/ha. The potential for increasing yield in the
rainfed ecosystem is vast, as the current yield is
only about 2.0 t/ha (compared to 5.0-t attainable
yields) and nearly 40% of the total rice area is
grown under rainfed conditions and future
increases in rice production will rely on rainfed
ecosystems. Hence, this section describes the
importance of MAS in genetic improvement of
rice under water-limited environments. As that of
this complex drought-tolerance trait, MAS can
also be applied to genetically improve other
complex characteristics such as pest and disease
resistance, nutrient improvement and other quality
and agronomic traits.

N.M. Boopathi, Genetic Mapping and Marker Assisted Selection: Basics, Practice
and Benefits, DOI 10.1007/978-81-322-0958-4_11, Springer India 2013

245

246

Rice and Drought


Rice is a heavy consumer of water, requiring around
5,000 liters of water to produce 1 kg of rice, and
is less efficient in the way it uses water than either
wheat or maize. In Asia, where 90% of all rice is
grown and the vast majority of it is consumed,
72% of freshwater resources are used for irrigating rice crops. However, water availability has
been shrinking as domestic and industrial demand
has increased. In the tropics of South and Southeast
Asia, only 41% of the rice area is irrigated. Yield
loss due to drought is 227 kg/ha (20% of average
yield) for upland ecosystem. In a typical year,
abiotic stresses decrease rice yields by about 15%
in Asia, more than twice the damage caused by
biotic stresses. Almost half of the land planted to
rice in Asia and almost all of the rice in Africa is
rainfed and the yields are seriously limited by water
stress. Thus, obviously, drought is the most important abiotic constraint in the upland ecosystem.
Rice is the main food of 65% of the population
in India. It constitutes about 52% of the total food
grain production and 55% of total cereal production. Rice environments in India are extremely
diverse. Since the major portion of the area under
rice in India is rainfed, production is strongly
tied to the distribution of rainfall. In some of the
states, erratic rainfall leads to drought during
the vegetative period, but later on the crop may
be damaged by submergence due to high rainfall.
Improving the yield of rainfed rice can be
achieved by selecting directly for yield under
stress in breeding program. However, the ability
to select for yield is severely hampered by yearto-year variability in rainfall pattern and low heritability of yield under water stress. Consequently,
it has been suggested that improvements in yield
could be achieved more efficiently by identifying
secondary traits that allow a plant to escape, avoid
or tolerate water stress and selecting for those
traits in a breeding program.

Mechanisms of Drought Resistance


in Rice
In general, rice plant uses less than 5% of the
water absorbed through roots from the soil. The

11

Recent Advances in MAS in Major Crops

rest is lost through transpiration, which helps to


maintain leaf energy balance of the crop. The
effect of water stress may vary with variety,
growth stage of the rice crop and degree and
duration of water stress. There may be two
kinds of traits, namely, constitutive and adaptive
traits, which confer drought resistance in rice.
Constitutive traits are expressed under anaerobic,
non-water stressed conditions, do not require water
stress for their expression and may demonstrate
variation that is subsequently modified by
adaptive traits. Adaptive traits can be defined as
those, such as osmotic adjustment (OA), which are
expressed in response to water deficit. Identifying
traits of importance in drought resistance is
difficult due to the complexity of climatic variation in precipitation and evapotranspiration, the
diversity of the rice hydrological environments,
the relationship between soil moisture status
and nutrient availability and the differential plant
interactions with this environment. Traits which
are contributing drought resistance in rice have been
reviewed by several researchers (see chapter 5).
All the traits have either positive or negative
influence on yield, depending on the existing
drought situation (timing, severity and duration)
and depending on whether a survival or production
mechanism is necessary. The best combination
of traits depends, therefore, on the nature of the
drought stress. This emphasises the need for a
good characterisation of drought occurrence in
the target area for breeding programs. The problem of adaptation to drought conditions in rice
is complex and unique as compared with most
other crops. The following are the traits, which
are demonstrated for their importance in drought
resistance in rice.

Phenology
If a pattern of drought occurrence can be identified,
the plant can escape drought by having the most
sensitive phenological stages coinciding with the
periods of lower risks of drought stress either
through manipulation of the plant duration or
through manipulation of the cropping calendars.
For example, in a terminal stress situation, a
common phenomenon in South Asia, breeding

Rice

for short-duration varieties is a simple strategy


with proven efficacy. The duration of upland
varieties of Bangladesh and eastern India is
generally below 95 days, which matches the
short monsoon season. The role of plant developmental and phenological factors in affecting
crop response to drought stress, such as moderated
water use through reduced leaf area and shorter
growth duration, has already been discussed
elsewhere.

Root System
The possession of deep and thick root system
which allows access to water deep in the soil
profile is considered crucially important in determining drought resistance. The trait may be less
important in rainfed lowland rice, where hardpans
may severely restrict root growth. Here, the
ability to penetrate a hard layer is considered
important. This trait may also be useful in upland
rice where high penetration resistance may limit
rooting depth and where soils will harden as they
dry. The penetration of roots through uniform
hard layers is probably achieved through the
possession of large root diameter which resists
buckling, but when the impedance is due to a
coarse textured sandy or stony horizon, thin roots
would penetrate more easily. The investment of
carbon in a deep root system may have a yield
implication because of loss of carbon allocation
to the shoot. The rapid development of deep or
thick root systems may, therefore, be of limited
value if terminal drought occurs early in the crop
cycle, but it is certainly important for intermittent
and later terminal drought situations. It is also
important to note that root growth is influenced
by the environment. Chemical or physical adverse
conditions such as low water potential or high/
low soil temperature directly inhibit root growth.
Biological factors in the rooting environment
such as root-feeding nematodes, termites, mites
and aphids can severely reduce root proliferation
or rooting depth and thereby affect drought resistance. The shoot environment can also indirectly
influence root growth either via carbon supply
or signalling process (e.g. light interception,

247

water status, nutrient status). At the genetic level,


the response of roots to the environment is poorly
understood because roots are intrinsically difficult
to study, particularly in the natural environment.
Irrespective of root axial resistance, a few
long roots can theoretically sustain reasonable
evapotranspirational demand at adequately high
leaf water potential. The ability of rice to reach
deep soil moisture or to penetrate compacted soil
is linked with the capacity to develop a few thick
(fibrous) and long root axes. Thick roots persist
longer and produce more and larger branch
roots, thereby increasing root length density
and water-uptake capacity. When drought stress
develops, the root/shoot dry matter ratio increases
as shown in some of the studies. Sometimes, even
the absolute size of the root increases. Most certainly, root morphology and distribution changes.
Drought-resistance improvement through breeding
program using root traits is limited due to requirement of labour-intensive, destructive and expensive
phenotyping protocols. Whatever the desirable root
ideotype may be, it would be extremely difficult
to perform selection based on measuring the root
phenotype.

Osmotic Adjustment
Osmotic adjustment (OA) is increasingly recognised in several crop plants as an effective component of drought resistance, which has a positive
direct or indirect effect on plant productivity
under drought stress. Generally, when cells are
subjected to slow dehydration, compatible
solutes are accumulated in the cytosol resulting
in the maintenance of cell water content against
the reduction in apoplastic water potential. The
compatible solutesvarious sugars, organic
acids, amino acids, sugar alcohols or ions (most
commonly K+)differ with plant species and
genera. The main solutes that are responsible
for OA in rice under water-deficit conditions
were not elucidated. Rice does not accumulate
glycine betaine because of a deficiency in choline
monooxygenase and betaine aldehyde dehydrogenase, the key enzymes that involved in glycine
betaine synthesis. Rice accumulates proline, but

248

the extent of proline accumulation and its


contribution to OA has not been evaluated.
The support of leaf turgor by OA in rice was well
reflected in delayed leaf rolling when water deficit
developed. Results indicate that leaf rolling and
leaf death can be delayed by OA in rice. However,
more data are needed on the contribution of
OA to rice performance under different drought
stress conditions. Traditional upland cultivars
generally tend to excel in root growth and soil
moisture extraction capacity while lacking in OA.
These cultivars usually develop severe leaf dehydration and leaf rolling as soon as soil moisture is
depleted. It can be speculated that under upland
situations with deep soil moisture, there may
have been a selective advantage to deep and thick
root systems, which served to maintain high
leaf water status and dehydration avoidance.
Under such conditions, deep roots have evolved
in adapted materials. OA did not evolve under
such conditions because plants were usually
avoiding severe water deficit. The capacity for
OA may have evolved where leaf tissue water
status was often reduced by water de fi cit, such
as in lowland rice where deep rooting is often
deterred by the subsoil compaction. These different
modes of response to drought stress require
validation and further research to suggest clues to
desirable breeding strategies with respect to the
different rice environments.

Dehydration Tolerance
Dehydration tolerance (the ability of leaves to
tolerate desiccation level water stress) assists the
plant organs to survive short-term water deficits.
The lowest leaf water potential that leaves reach
just prior to death (lethal leaf water potential) has
been used to determine dehydration tolerance.
During terminal stress, dehydration tolerance
may allow plants to maintain metabolic activity
for longer time and to translocate more stored
assimilates to the grain. Plants with the ability
to adjust osmotically or tolerate dehydration
may delay leaf rolling, delay stomatal closure
and maintain leaf expansion with little cost, which
should promote resistance particularly in the

11

Recent Advances in MAS in Major Crops

terminal drought situation. So if dehydration


tolerance of rice is increased by breeding approaches, then it could be possible to increase or at
least stabilise the yield of rainfed rice. As reported
in some studies, genotypic variation for dehydration tolerance capacity of rice is large. However,
incorporation of this trait in breeding program
is hampered by complex experimental protocols
requiring heavy investment in creating controlled
environment facilities.

Shoot-Related Drought-Resistance
Traits
Leaf Rolling
Several mechanisms of drought resistance are
associated with the shoots of rice. Leaf rolling
(drought avoidance) reduces the water loss in
addition to reducing the leaf area exposed to heat
and light radiation. Varieties differ in their ability
to roll leaves under similar water deficit. There is
some evidence that enhanced ability to roll leaves
confers a yield advantage under drought conditions.
However, most breeders consider the triggering
of leaf rolling as an indication of a plant suffering
and select against its early manifestation.
Green Leaf Area
It has been suggested that plants which are able
to retain green leaf area are better able to recover
after drought and give good yield. Leaf drying,
often used in field scoring, is the reverse side
of the stay-green ability and has been shown
to be correlated with leaf relative water content.
However, it has proved difficult to separate the
green leaf retention from the possible underlying
mechanisms of drought resistance since the
process of drought recovery in terms of mechanisms, importance or genetic variation is poorly
understood.
Stomatal Closure and Canopy
Temperature
Another mechanism of drought avoidance in the
rice shoot is fast stomatal closure which acts to
reduce water losses. Varietal differences in the
sensitivity of stomatal conductance to leaf water

Rice

status do exist. The contribution of stomatal


conductance to drought performance in the field
is yet to be identified. However, a plant with
sensitive stomata would only be adapted to a
situation of relatively severe drought. But during
mild drought, rapid stomatal closure would
reduce photosynthesis when there is no need to
do so. Canopy temperature can also be used
since low canopy temperature may indicate more
favourable soil moisture conditions. This characteristic could be valuable in selection, but measuring them requires extremely uniform soils to
eliminate any subsoil spatial variation.

Cell Membrane Stability


The cell membrane is one of the main cellular
targets common to different stresses. The extent
of its damage is commonly used as a measure of
tolerance to various stresses in plants such as
freezing, heat, drought and salt. Cell membrane
stability (CMS) or the reciprocal of cell membrane
injury is a physiological index widely used
for the evaluation of drought and temperature
tolerance. This method was developed for a
drought and heat tolerance assay in sorghum and
measures the amount of electrolyte leakage from
leaf segments. Its reliability as an index of heat
stress tolerance is supported in several plant
species by good correlation between CMS and
plant performance in the field under high temperature and water stress. The genetic variation
in heat tolerance in various crops has been studied
using CMS as one of the component traits.
Phenotype selection for CMS may not always
lead to accurate results for breeding purposes
because of its complex nature and its strong interaction with the environment. Thus, the evaluation
of this trait should be done in a controlled environmental situation.
Water Use Efciency
Connected to stomata and leaf rolling is water
use efficiency (WUE, the ratio between carbon
gained for water used). Analysis of WUE generally
relies on measuring carbon isotope discrimination. This has been shown to vary between rice
varieties, suggesting that upland varieties need
less water for every molecule of carbon fixed.

249

A plant, which is more water use efficient, should


be more successful in a drought environment,
particularly late in the growing season when
transpiration accounts for the majority of total
evaporation. WUE can be either positively or
negatively related to production under stress,
which is largely dependent on the genotypes
capacity to sustain transpiration, and WUE alone
might be questionable as a selection criterion.
Therefore, WUE can even be a misleading parameter if selection for high WUE is performed under
drought stress where genotypic variation in deep
soil moisture extraction is possible. It is realised
that results from selection for WUE (by carbon
isotope discrimination) depend very much on the
environmental conditions in which such selection
is performed. It also seems that the results from
selection for high WUE may be unpredictable.
In several crops, the correlations between WUE
and dry matter production were inconsistent
in experiments conducted over different water
regimes and years.

Epicuticular Wax
It has been repeatedly shown that total crop dry
matter production is linearly and positively
related to crop transpiration. This relationship
is partly derived from the fact that the control of
both transpiration and CO2 exchange is dependent
on stomatal activity. However, loss of water can
also occur through non-stomatal pathways for
which no return in CO2 fixation is expected. Nonstomatal resistance to water loss from leaves
can also be considered a drought-avoidance
mechanism. An important non-stomatal pathway
is the leaf cuticle. Research suggests that rice has
a low cuticular resistance to water loss compared
with other grasses but variation between varieties
exists, and this may have potential in breeding
for improvement in drought resistance. The
fact that traditional upland rice cultivars have
relatively higher epicuticular wax supports the
hypothesis that high epicuticular wax is an important
drought-resistance attribute in rice. The specific
effects of the amount, the composition and the
form of cuticular wax in rice were explored, but
the quantification of these factors with respect
to rice performance under drought stress is still

250

needed. Further, physiological and biochemical


work is required to logically link cuticular
resistance and epicuticular wax with drought
resistance and for efficient manipulation in breeding program.

Other Traits
The value of improving the use of absorbed
light, resistance to photoinhibition and capacity
for non-photochemical quenching to improve
drought resistance of rice has been described.
In addition, a genetic basis for difference in
resistance to photoinhibition in rice has been
demonstrated. These traits are physiologically,
biochemically and genetically complex in themselves and interact with each other. Since abscisic
acid (ABA) has been shown to be involved in
regulating stomatal conductance, OA and root
conductivity, interest has been shown in measuring
ABA contents in order to establish relationships
with drought resistance. Varietal differences
in leaf ABA content and sensitivity to applied
ABA also exist in rice.
In summary, a utilisable secondary trait in
breeding for drought resistance in rice should be
(1) genetically associated with grain yield under
drought, (2) highly heritable, (3) stable and feasible
to measure and (4) not associated with yield loss
under ideal growing conditions. However, despite
the description of several above-said traits,
these traits are rarely selected for in traditional
rice improvement programs because phenotypic
selection for these traits involves complex,
difficult and labour-intensive protocols; the tremendous diversity of environments and water
availability; and the large genotype environment interactions which complicate selection.
Knowledge from physiological studies indicated
that the ability of the root systems in exploiting
deep soil moisture and the capacity for OA
during water stress are considered as major
drought-resistance traits in rice. They can also be
negatively correlated due to tight genetic linkage
of some of the controlling genes as was shown
for OA and root morphology. Therefore, the
impact of one trait in isolation may be difficult to
establish. One promising approach is to map genetic
loci (quantitative trait loci, QTL) influencing

11

Recent Advances in MAS in Major Crops

drought-resistance traits and crop productivity


in stressful environments. Once the tightly linked
markers have been identified, they can be used to
develop marker-assisted selection (MAS) strategy
for breeding applications. Molecular markers
allow breeders to track the genetic loci controlling
drought resistance without measuring the phenotype, thus reducing the need for extensive field
testing over space and time. High-resolution
mapping and physical mapping can be followed
for isolation of the drought-resistance genes by
map-based cloning techniques. The genes of
interest can be used in functional studies and crop
improvement through genetic transformation.

Genetic Linkage Map in Rice


Construction of linkage map is essentially the
first step in QTL mapping. Such maps allow genetic
dissection of QTL, facilitate high-resolution
genetic mapping and positional cloning of important genes, assist in local comparisons of synteny
within and across the species and provide an
ordered scaffold on which complete physical
maps can be assembled. Recent progress in DNA
markers and their linkage maps have provided an
efficient tool and methods for mapping individual
loci conferring not only monogenic but also
polygenic traits. For rice, the first molecular
marker-based genetic map was constructed by
McCouch et al. in 1988, and since then several
linkage maps were constructed in rice using
different mapping populations including highdensity restriction fragment length polymorphism
(RFLP) maps and expressed sequence tags (ESTs)
maps. These maps provide the foundation for
molecular genetic analysis of almost any traits
of interest and thus have a number of advantages
over classical genetic maps for genetic research
and breeding.

QTL Mapping of Drought-Resistance


Traits in Rice
The availability of high-density linkage maps is
valuable as a resource for studies that genetically

Rice

dissect out the complex traits such as drought


resistance. QTL mapping provides a potential
tool for conducting physiological and genetical
research to understand and improve drought
resistance. It eases screening for traits that are
difficult to quantify and influenced by environmental stimuli.
A good progress has been made in identifying
molecular markers linked to various droughtresistance traits in rice. Two review papers have
been published, from the author of this book and
his colleagues, based on the available literature,
and it is available freely on the web (or refer https://
sites.google.com/site/drnmboopathi/). Table 11.1
summarises the details of QTL identified from
selected publications as an example, for different
drought-resistant traits and their flanking markers
in different mapping population. The first report on
QTL associated with various root morphological
characters has been reported in a CO39/Moroberekan recombinant inbred (RI) line population
under greenhouse conditions by Cham-poux et al.
in 1995. They have also identified QTL linked to
drought avoidance in the field under water-deficit
stress at three different growth stages using the
same mapping population. It is encouraging to
note that over 50% of the putative QTL associated with root characters in the greenhouse study
mapped to the same chromosomal locations as
QTL influencing drought avoidance in the field
experiments. Using the same RI lines, Ray et al.
in 1996 mapped QTL for root penetration ability
using wax petrolatum layer. Clustering of QTL
associated with root traits was observed as that of
previous study. This suggests that specific regions
of the rice genome containing genes that determine
root morphology may be clustered in certain
chromosomal regions. These regions may contain
clusters of genes or genes with pleiotropic effect.
Most of the QTLs linked to tiller number are
mapped closely to chromosomal regions identified
as associated with total root number. These results
suggested that molecular marker could play a
significant role in studying the relationship of
shoot- and root-related drought-resistant traits.
This issue can be investigated further in a rice
population developed specifically for the purpose
of studying these traits. An analysis was also

251

conducted using the subset of this population


to identify and map QTL associated with dehydration tolerance and OA by Lilley and her team
in 1996, and the identified QTLs were compared
to root traits and leaf rolling scores measured in
the same lines. It is interesting to note that the
putative OA locus and two of the dehydration
tolerance QTL on chromosome 8 were close to
the regions associated with root morphology.
From their results, it was suggested that OA and
dehydration tolerance is negatively correlated
with root morphological characters associated
with drought avoidance. High OA and dehydration
tolerance is associated with Co39 (indica) alleles,
and extensive root systems were associated with
Moroberekan (japonica) alleles. It was suggested
that to combine high OA with extensive root
systems, the linkage between these traits needs
to be broken.
It is obvious that QTL detection depends on
the cross combination used in the analysis
because detection of QTL is based on allelic
differences in QTL between parental lines. Thus,
an important question is whether QTLs detected
in one population are shared with QTL detected
in other populations. QTL analysis of the same
traits using different cross combinations will be
necessary to answer this question. In this context,
several publications studied doubled haploid
(DH) population derived from IR64/Azucena
cross and mapped the genes controlling root
morphology and distribution. The main QTLs
were common between traits, which indicate
that there is a possibility to modify several aspects
of root morphology simultaneously. The sd-1
locus on chromosome 1, which has massive effect
on plant height and tillering, was found to show
co-location with QTL governing root system in
this study. However, the QTL on chromosome 7
that was associated with effects on maximum
root depth did not seem to be linked with a QTL
for plant height. This suggests that it may be
possible to decrease the height of traditional tall
upland rice varieties without diminishing the
quality of their root system. Besides, those reports
identified several common QTL depending on
the traits. Development of isogenic lines would
help to clarify the proper value of the common

127 (RFLP)

127 (RFLP)

Co39/Moroberekan281 F7 RILs
(52)

Linkage map
coverage (cM) Traits
Root thickness
Rootshoot ratio
Root dry weight
per tiller
Deep root weight
Maximum root depth
Drought avoidance
(leaf rolling)
Number of penetrating
roots
Total number of roots
Root penetration
index
Tiller number
Dehydration tolerance
Osmotic adjustment
Relative water content

4
19
6
10
5
1
2

8
4
18

Across
population

QTL identified
Across trials/
No. of QTL experiments
18

16

14

14
36
32
35

19
13

18.5

35

35

Lilley et al. (1996)

Ray et al. (1996)

Maximum
phenotypic
variance (%) References
56
Champoux et al.
(1995)
38
11

Co39/Moroberekan281 F7 RILs
(202)

Parents
Populationa
Co39/Moroberekan281 F7 RILs
(203)

Number and type


of markers used
127 (RFLP)

Table 11.1 Details of mapping population, linkage map characteristics and QTL identified for drought-resistant traits in rice from selected publications

252
Recent Advances in MAS in Major Crops

150
BC3F3(142)
135 DH
(90, 84,
56 & 109)

IR62266/
IR60080
IR64/Azucena

167 (RFLP, SSR,


candidate genes)
260 (RFLP, SSR,
RAPD, isozymes)

249 (RFLP, SSR,


cDNA-AFLP)

2,457

1,370
Days to flowering
Plant height
Grain yield
Harvest index
Days to maturity
Root thickness
Root volume
Root dry weight
Maximum root length

Seminal root
length
Relative seminal
root length
Adventitious
root number
Relative adventitious
root number
Lateral root length
Relative lateral
root length
Lateral root number
Relative lateral
root number
Osmotic adjustment

7
1
4
1
2
1

2
2
1
1
1
1
1
1
1

1
1
1

12

19

24.6
20.0
15.7
19.7
20.4
26.9
29.1
30.7
12.9

25.0

11.7
12.3

14.4
11.9

15.0

18.2

13.9

13.4

Venuprasad et al.
(2002)

Robin et al. (2003)

Zheng et al. (2003)

DH doubled haploids, RIL recombinant inbred lines, BC backcross progenies, RFLP restriction fragment length polymorphism, RAPD random amplified polymorphic DNA,
SSR simple sequence repeats, cDNA complimentary DNA, AFLP amplified fragment length polymorphism
a
Subset of population used for phenotyping is indicated in parenthesis

150 RIL (96)

IR1552/Azucena

Rice
253

254

QTL by eliminating the confounding effects


of other genomic regions and to fine-tune their
location.
QTLs controlling drought-avoidance mechanisms (such as leaf rolling, leaf drying, relative
water content of leaves and relative growth rate
under stress) were analysed in this DH population in three field trials with different drought
stress intensities in two sites in some publications.
Some of the QTLs were common across the trials
and traits. QTLs detected for leaf rolling, leaf
drying and relative water content were mapped
in the same location as QTL controlling root
morphology in the previous study using the same
population. QTL identified for leaf rolling in this
population located similarly as that of the QTL
for leaf rolling in other population. However,
in contrast to these studies, when a randomly
chosen subset of 56 DH lines derived from this
cross were grown in polyvinyl chloride cylinders
to study the root morphology and associated traits
under well-watered conditions and low-moisture
stress at two growth stages during the vegetative
phase, several QTLs were found. In total, 15 QTLs
were detected from both the growth stages, and
only three were common between the stages.
This reveals that different sets of QTL show up
under different developmental stages within the
vegetative stage itself. Further, absence of common
QTL for root traits between two developmental
stages and two moisture regimes in this study
suggests the existence of parallel genetic pathways
operating at different growth stages and moisture
regimes. Using a wax petrolatum layer system
simulated to compacted soil layers, root traits
were evaluated with a subset of this DH lines.
QTLs for root penetration index, penetrated root
thickness, penetrated root number and total root
number have been located. Common QTLs linked
to root penetration index and basal root thickness
were noted across experimental systems and
genetic background. This suggests that both root
penetration ability and root thickness may be
controlled by genes, which are closely linked or
have pleiotropic effect. No QTLs for maximum
penetrated root length were detected by interval
mapping, although five RFLP markers were
found significantly associated with this trait using

11

Recent Advances in MAS in Major Crops

single-marker analysis. Root length is known


to be highly sensitive to environmental variation
and therefore is more difficult to improve than
other root traits such as root thickness.
Another extensively analysed population for
QTL linked to drought resistance is Bala/Azucena
developed by Price and his team. They reported
the construction of a linkage map and its use in
mapping the QTL controlling maximum root
length at various stages of root development,
adventitious root thickness and root volume in
an F2 population. QTL for different days/stages
showed different types of genetic effect. Some
QTLs observed in the Bala/Azucena population
are evident in the CO39/Moroberekan population,
while some are not. The same population was used
for mapping two shoot-related mechanisms,
namely, stomatal conductance and leaf rolling
along with heading date. This F2 population was
forwarded to F6, and a more detailed linkage
map was constructed to analyse the QTL for root
penetration ability with modified wax petrolatum
layer. It is interesting to note that some of the
QTLs for root penetration ability reported here
are close to QTL for root morphology reported in
the F2. However, the differences in the reported
locations of QTL between this study and similar
study are probably due to the different populations
studied and to the different methods used for
assessing the root penetration phenotype. Comparison of the QTL identified in this study with
previous reports of QTL for root morphology
suggests that alleles which improve root penetration ability may also either make the roots
longer or thicker. In another study, QTLs for
drought avoidance based on the field trials in the
Philippines and West Africa have been localised.
QTLs for leaf rolling and drying and relative
water content were mapped for each site and
across the site. However, there was relatively
poor correlation between traits measured in the
two sites suggesting there may be some different
genetic components contributing to drought
resistance in the different environments. The
same experimental materials were used to map
QTL for root morphology and distribution
using soil-filled chambers exposed to contrasting
water-deficit regimes. QTLs for the deep root

Rice

weight, maximum root length, rootshoot ratio,


number of deep roots and root thickness were
identified. Some were revealed only in individual
experiment and/or for individual traits, while others
were common to different traits or experiments.
A comprehensive analysis of dissecting
physiological and morphological traits related to
drought resistance and partitioning of drought
resistance into components and comparative QTL
analysis would contribute a better understanding
of the genetic basis for drought resistance in
plants. The parents, CT9993 and IR62266, were
studied at morphological and physiological level
and shown to differ in root system and OA.
In order to better understand the mechanisms of
drought tolerance via OA and drought avoidance
via a deep root system in rice, a molecular dissection of QTL for both OA and root traits in one
genetic background is important. Hence, genomic
regions responsible for CMS were studied in the
greenhouse in a slowly developed drought-stress
environment by using rice DH lines derived from
CT9993/IR62266. No significant correlation was
found between CMS and relative water content,
indicating that the variation in CMS was genotypic
in nature. They have located nine putative QTLs
for CMS and one of the QTL on chromosome 8
mapped on the same locus as the OA mapped.
Moreover, several QTLs involved in root morphology and the drought avoidance in rice have been
identified in this region. The mapping of CMS
QTL in this region suggests that this region might
contain genes for different traits responsible for
conferring drought resistance in rice. The same
DH lines were used to map the QTL associated
with root traits and OA. Consistent QTL for
drought responses across genetic backgrounds
were detected. Comparative mapping identified
three conserved regions associated with various
physiological responses to drought in several grass
species. This result suggests that these regions
conferring drought adaptation have been conserved across grass species during genome evolution and might be directly applied across species
for the improvement of drought resistance in
cereal crops.
Rice develops roots under anaerobic soil
conditions with ponded water prior to exposure

255

to aerobic soil conditions and water stress in


rainfed lowlands. Constitutive root system development in anaerobic soil conditions has been
reported to have a positive effect on subsequent
expression of adaptive root traits and water
extraction during water stress (Kamoshita et al.
2008). The effect of phenotyping environment on
identification of QTL for constitutive root morphology traits were studied using greenhouse
experiments, and the results emphasised the
careful selection of phenotyping environment
which relate closely to the target environment
where the traits are to be expressed and interpretation of results which otherwise leads to misplacing the QTL. In spite of large environmental
effects, even in well-watered anaerobic conditions,
they have identified stable QTL across the experiments in CT9993/IR62266 DH lines. Physical
mapping of the putative QTL for deep root
morphology traits would help to elucidate how
rooting depth and deep root mass are genetically
controlled at the molecular level. QTLs linked to
plant height, number of tillers, total root number,
root dry weight, total plant length and root to
shoot length ratio were identified in this population under well-watered conditions. Some of the
alleles governing the root-related traits were from
IR62266, which indicates that inferior parent can
also contribute favourable alleles for root traits.
Drought-resistance component traits, described above, can interact with each other in modifying the plant water status. The real test for
drought resistance is continuous growth and
production under stress. Three traits, which
perhaps encapsulate all the drought-resistance
components, are leaf expansion (as an indication
of plant turgor), biomass production and ultimately
grain production under stress. Although previous
analysis indicated the map positions of QTL
associated with drought-resistance traits and their
co-location, the effects of those traits on plant
production under drought have to be properly
established. Thus, there is a need to determine
whether the QTLs linked to drought-resistance
traits also affect yield under stress. By comparing
the coincidence of QTL for specific traits and
QTL for plant production under drought, it is
possible to test whether a particular constitutive

256

or adaptive response to drought stress is of


significance in improving field level drought
resistance. Such associations would also improve
the efficacy of MAS in breeding for drought
tolerance in rice. QTLs associated with grain
yield and root morphological traits were mapped
in IR64/Azucena DH population under contrasting
moisture regimes. CT9993/IR62266 DH lines were
used to identify the QTL linked to rice performance under drought and to genetically dissect
the nature of association between drought-resistance traits and yield under drought in the field.

Rice Subspecies and Habitat


Rice is cultivated in four continents, and very
large germplasm collections are available offering
many possibilities of identifying adaptive traits
and tolerance characters towards abiotic stresses.
Cultivated rice belongs to the Oryza sativa complex, which contains the two cultivated species,
O. sativa and O. glaberrima, and several wild
species, which are considered as direct ancestors
of the cultivated ones. O. sativa is cultivated all
over the world, whereas O. glaberrima is cultivated
only in Africa. Within the O. sativa species, two
major groups of ecogeographic races are distinguished, the indica and japonica types. They
roughly correspond to rice grown in tropical
regions of Southeast Asia and in more temperate
regions of Japan and northern China, respectively.
Indica and japonica varieties cross-hybridise, but
usually many plants in the progeny are sterile or
partially sterile. Large and well-known genetic
diversity exists in the subspecies level and is a
valuable resource for both classical and biotechnology-assisted breeding.
Most of the populations used in QTL analysis
of drought-resistance traits were derived from
an indica/japonica cross because of the high
frequency of polymorphism based on wide
variation. Development of a deep and extensive
root system is one adaptive strategy of plants
for drought avoidance. Upland japonica cultivars
appear to rely on its deep and extensive root
system to achieve its demonstrated capacity for

11

Recent Advances in MAS in Major Crops

drought avoidance, whereas indica cultivars have


different adaptive strategies including shortening
of growth duration and tissue level tolerance.
Whether a drought-avoidance strategy based
almost entirely on a well-developed root system
in japonica background can be combined with
tissue level tolerance and/or short growth duration to improve plant performance under water
stress in specific environments is a question
which is central to drought-resistance breeding in
cereals. The phenomenon of return to parental
type after repeated generations of selfing following indica/japonica hybridisation is familiar to
rice breeders and makes it difficult to obtain
favourable recombinants through traditional
means. Differential adaptation to edaphic factors,
such as soil, water and temperature regimes and
genetically controlled sterility barriers, separates
these two major subspecies. Evaluation of upland
japonica/lowland indica populations under
anaerobic lowland conditions may be confounded
by the difference in adaptation to lowland conditions. Cross combinations used in breeding
programs are mainly same ecotype crosses, such
as japonica/japonica and indica/indica.
Therefore, more QTL analysis based on crosses
between closely related varieties, especially the
indica/indica cross, will be necessary for
identification of QTL alleles which will be useful
in rice breeding. Ali et al. in 2000 analysed RILs
developed from two indica parents, IR58821/
IR552561, to map QTL for root traits in two
different seasons. They have identified not only
common QTL between two seasons but also
consistent QTL across genetic backgrounds. The
effect of phenotyping environment and genetic
background on QTL identification was examined
by using this population. QTLs for shoot biomass, deep root morphology and root thickness
were mapped. Consistent QTLs across the
experiments and genetic backgrounds were
detected. Results from these studies suggest
that some amount of similarity exists between
japonica/indica crosses and indica/indica crosses
in the genetic control of root traits. Since then,
several studies were conducted using such cross
combination (e.g. see Gomez et al. 2010).

Rice

Marker-Aided Selection and NearIsogenic Lines for Drought-Resistance


Improvement
QTL presented above, associated with different
drought-resistance mechanisms assessed at different sites, methodologies and seasons, confirms
the complexity of the genetics of drought resistance in rice. It also illustrates the degree of QTL
by year and QTL by site interaction and demonstrates the value of calculating averages for
identification of the more stable but small effect
QTL. A significant proportion of the phenotypic
variability of several of these putative droughtresistance traits is explained by the segregation
of relatively few genetic loci, thus leading to the
possibility of indirect selection of these complex
traits using MAS strategy. This information is
potentially valuable to breeders and enables
researchers to target specific regions in order to
produce near-isogenic lines (NILs) at some QTL.
These NILs will allow more accurate determination of environmental stable QTL and understand
and further allow for the assessment of the impact
of QTL on yield under drought. They could also
aid in the identification of the genes responsible
for the QTL through candidate gene and/or positional cloning approaches. Shen et al. in 2001
reported improvement of rice root system by
MAS of several root QTL. They have also studied
the possible effects of these introgressed segments
on other agronomic traits through pleiotropy
or linkage drag. Work has also been done to transfer the QTL for root morphological traits from
Azucena into a popular Indian variety, Kalinga
III, by MAS. NILs were developed for OA with
japonica background. NILs shall serve as valuable
material to test the utility of the introgressed QTL.
This will also lead to understand the mechanisms
underlying physiological and molecular nature
of the QTL and to evaluate the contribution of
the QTL to yield in the target environment.

Target Population of Environment


and Molecular Breeding
To improve the drought resistance of rainfed
lowland rice, mapping populations from crosses

257

between parental lines that are equally well


adapted to target environments should be evaluated (refer chapter 5 also). Focusing on the variation within single ecotype might hasten progress
towards drought resistance, and the locally welladapted germplasm will increase the efficiency of
breeding. Traditional rice varieties are still
being grown in rainfed uplands even though
they give low but definite yield. There is a need
to develop rice varieties with higher yield but
retaining the drought-tolerance capacity of traditional accessions. The necessity of QTL
identification based on the variation from the
crosses between two related varieties belonging to
the same subspecies adapted to target population
of environment (TPE) has been emphasised by
various authors. Further, upland rice environments vary widely in terms of climate and edaphic
factors, making it difficult to use genetic material
developed for one location in other locations.
Most of the QTLs linked to drought-resistance
traits were flanked by mostly RFLP and few
amplified fragment length polymorphisms (AFLP)
markers. Though RFLP markers are reliable, it
involves tedious, time-consuming protocols besides
handling hazardous radioactive chemical. Hence,
they are not suitable for routine MAS. The RFLP
and AFLP markers need to be converted to a
simple, rapid and inexpensive polymerase
chain reaction (PCR)-based markers, like STS,
to enhance and economise the breeding programs.
This involves extra effort in conversion of this
marker besides establishing the polymorphism
between the parents as that of original RFLP
or AFLP markers. Identification of simple PCRbased nonradioactive markers linked to putative
drought-resistance component traits will hasten
MAS for drought-resistance improvement. SSRs,
inter-simple sequence repeats (ISSRs) and random amplified polymorphic DNAs (RAPDs) are
well-established PCR-based markers being
involved in mapping process (see chapter 3).
The candidate gene approach has been applied
in plant genetics in the past decade for the characterisation and cloning of QTL (see chapter 10).
Candidate genes are genes involved in the expression of a given trait. They can be identified either
from previously sequenced genes of known function or from cDNA libraries constructed specific

258

to different organs, developmental stages or stress


responses. Expressed sequence tags (ESTs) are
partial or single-pass sequencing of more or
less randomly chosen cDNA clones from libraries
at all stages of plant growth and development.
They allow fast and affordable gene identification.
Development of EST-based markers is dependent
on extensive sequence data of regions of the
genome that are expressed. They are highly
reproducible and can be directly associated
with functional genes. A number of ESTs specific
to drought response are now available in the EST
database (dbEST). It will be important to resolve
to what extent the allelic variation in these genes
affects drought tolerance in rice. Hybridisationbased RFLP markers have been developed
from ESTs and used extensively for the construction of high-density genetic linkage maps
in rice. The genetic factors underlying constitutive and adaptive morphological traits of roots
under different water-supply conditions were
investigated using RI lines derived from IR1552/
Azucena by exploiting the genetic map constructed with EST clones and cDNA-AFLP
clones. Two genes for cell expansion, OsEXP2
and endo-1,4-b-d-glucanase Ecase, and four
cDNA-AFLP clones from root tissues of Azucena
were mapped on the intervals carrying the QTL
for seminal as well as lateral root length. Robin
et al. in 2003 found a candidate gene that was
closely linked to QTL for OA. The tight linkage
between these candidate genes and the QTL for
root traits and OA may demonstrate a causal
relationship. However, further investigation of
these genes for stimulated root elongation under
water-limited stress in rice is needed before
drawing conclusions on what gene lies beneath
the QTL. The candidate genes used in these
studies were engaged as radioactive probe as
that of RFLP. Development of PCR-based
EST markers could be useful in QTL mapping
and efficient MAS for drought-resistance
improvement in rice. Further, ESTs allow a computational approach to the development of
SSR for which previous development strategies
have been expensive. Pattern-finding programs
can be employed to identify SSRs in the ESTs.
Readily available EST sequence information

11

Recent Advances in MAS in Major Crops

allows the design of primer pairs, which can be


used to identify the length polymorphism among
the parental lines.

Concluding Remarks on MAS in Rice


for Water-Limited Environments
Managed drought environments in the field, such
as dry season trials, delayed planting in the
wet season, use of high toposequence locations,
drainage, raised beds and large-scale rainout
shelters, have been developed to simulate the
target environments for breeding. Selection for
higher grain yield under managed stress, partly
assisted by selection for secondary or integrative
traits such as low leaf rolling score, low spikelet
sterility and high drought-resistance index,
with their moderate to high degrees of heritability, shows promise. Understanding of genotypic
responses to drought is increasing. Resistance
traits differ under different types of drought (e.g.
terminal drought, vegetative stage drought and
intermittent drought), but genotypic responses
that contribute to drought avoidance (e.g. deep
and thick roots and conservative water use by
moderate plant size) and maintain higher plant
water status are often found to be more important
for higher yield under stress than are tolerance
mechanisms. Transgenic rice, engineered for
enhanced expression of primary induced traits
for drought tolerance, has been studied under
laboratory conditions, but the usefulness of these
lines under field drought conditions remains to
be tested. QTLs for constitutive primary traits
such as deep roots and plant-type traits such as
plant height had higher contribution to phenotypic expression than QTL for induced traits and
were identified across different populations
under both well-watered and stress conditions.
The QTLs for root traits and plant-type traits,
together with QTL for plant water status, were
more often co-located with integrated traits such
as grain yield under stress. Although it is unlikely
that a single primary or secondary trait will improve
rice resistance to different types of drought, selection of some of the QTL clusters containing multiple
drought-resistance traits is promising.

Cotton

In spite of the large amount of information


on QTL linked to various drought-resistance
traits, routine use of these QTLs in MAS is not
widely practised. The accuracy of phenotyping in
these QTL mapping studies is one concern.
Further, use of molecular approaches may be
limited because of the need to consider large
number of QTL with individually small effects.
The effects that MAS for such QTL will have on
improvement of plant breeding can be estimated
by the use of simulation models. Development
of near-isogenic lines for these QTL will allow
testing of their true agronomic value. Several labs
are currently working on MAS introgression of
these QTL into locally adapted elite rice lines.

Cotton
Cotton (Gossypium spp.) is an important commercial and natural fibre crop of global importance and generates high employment at various
stages. Though synthetic/man-made fibres have
made inroads, cotton deserves the prime position in India with cultivation. It has been in
cultivation in India for more than 5,000 years.
Globally, India ranks first in cotton area but
occupies second position in production, next
to China. Cotton has significant contributions
in Indian economy by earning more than 30%
of foreign exchange.
India has the distinction of growing all the
four cotton cultivable species, namely, Gossypium
arboreum, G. herbaceum, G. barbadense and
G. hirsutum. Among the four species, the tetraploid
(or allopolyploid) species G. hirsutum L. and G.
barbadense L. accounted for 90 and 8% of the
world cotton production, respectively. Though
India is the major cultivating and consuming
country, commercial cotton lint produced in India
is in narrow fibre quality spectrum, and hence
several 1,000 bales of cotton lint that fit to modern
textile industries are being imported. Thus, it is
imperative to improve the fibre quality of the
cotton cultivars.
Conventional breeding methods have contributed much to the development of high-yielding
cotton cultivars. But, the efficiency of fibre

259

genetic improvement still remains to be resolved


due to negative association between lint yield and
fibre quality. The long-term challenge faced by
cotton breeders is the simultaneous improvement
of yield and fibre quality traits to meet the
demands of the cotton cultivars as well as the
modernised textile industry. Textile industry is
based on measurable quality factors, and often
this is the area where technological changes
are being rapidly implemented. All the changes
in spinning technology require unique and often
greater cotton fibre quality, especially strength,
for processing. Strong fibres survive the rigours
of ginning, cleaning, carding, combing and
drafting. Besides fibre strength, fibre length and
fibre fineness are the other key qualities that
influence textile processing. Usually, G. hirsutum
accessions possess high yield, and G. barbadense
accessions have superior fibre quality traits.
Though considerable progress has been made in
the past, the current genetic information and
conventional plant breeding methods involving
interspecific hybridisation between G. hirsutum G. barbadense cannot lead to quick
improvement of fibre quality. This may be due
to the involvement of long duration and low
selection efficiency in such cross combinations.
These attempts also had resulted in poor agronomic qualities of the progeny, distorted segregation, sterility, motes formation and limited
recombination due to incompatibility between
the genomes.
On the other hand, quantitative trait loci (QTL)
mapping and marker-assisted selection (MAS)
offer new avenues to overcome the above-said
limitations. Molecular markers are employed to
construct genetic linkage map, and it can be
employed to understand the genetic basis and
improvement of the complex polygenic traits
such as fibre quality. The identification of tightly
linked markers to the stable QTL affecting
fibre traits across the generations would be useful
in MAS and thus increase the efficiency of
breeding program. Thus, the identification of
DNA markers linked to the fibre quality QTL
would allow cotton breeders to trace this very
important trait in early plant growing stages or
in early segregating generations.

260

Status of Cotton Molecular Marker


Technology
DNA marker technology has enormous potential
to improve the efficiency and precision of conventional plant breeding via MAS. The advantage
of MAS over conventional breeding is that the
selection is simple than phenotypic selection and
selection can be done at seedling stage itself
(single plant or even a small leaf sample is enough
to predict the entire gene or QTL of the particular
trait). Thus, DNA marker technology provides
a valuable tool to the plant breeders to select
desirable plants directly on the basis of genotype
rather than phenotype. Advances in the use of
DNA markers to identify QTL and MAS have
shown promises for streamlining plant breeding
programs. For example, genetic maps constructed
using crosses of upland cotton (Gossypium
hirsutum L.) and Egyptian cotton (Gossypium
barbadense L.) have led to the identification of
several QTLs for fibre strength, fineness and
length (e.g. refer Table 11.2).

Molecular Markers and Polymorphism


in Cotton
Though modern G. hirsutum and G. barbadense
cultivars show significant variation for important
traits including fibre production, pest resistance
and tolerance to environmental adversities such
as heat, cold and drought, these cultivars exhibit
narrow genetic diversity. Decrease in genetic
diversity is harmful to future breeding programs.
Molecular markers are playing a critical and
increasing role in the analysis of genetic diversity
in cotton cultivars. Wild Gossypium germplasm
harbours many valuable traits including disease
and insect resistance, stress tolerance and fibre
quality attributes. DNA markers in construction
of genetic maps would be useful in introgression
of alien genes into cultivated cotton species.
Molecular linkage map construction has been
recognised as an essential tool for plant breeding
because they have the properties of neutrality,
lack epistasis and are simply inherited Mendelian
characters. Efficient construction of genetic map

11

Recent Advances in MAS in Major Crops

requires well-spaced polymorphic markers for


the given parents. Hence, selection of a marker
system that serves the above purpose is the key
step in MAS.
To overcome the paucity of a particular type
of DNA markers, genetic maps were developed
by incorporating different classes of markers.
For example, Lacape and his group have constructed a combine restriction fragment length
polymorphism (RFLP)simple sequence repeats
(SSR)amplified fragment length polymorphism (AFLP) map based on an interspecific
G. hirsutum G. barbadense backcross population
of 75 BC1 plants. The map consists of 888 loci
that ordered into 37 linkage groups and spanning
4,400 cM. This map was updated, mostly with new
SSR markers, to contain 1,160 loci that spanned
5,519 cM with an average distance between loci
of 4.8 cM. Similarly, SSRs, SRAP, RAPD and
retrotransposonmicrosatellite amplified polymorphisms (REMAPs) were also employed to
construct cotton linkage map. Due to conservation
of genomic regions in cotton, combination of
different types of molecular markers is required
to have a sufficiently saturated linkage map in
cotton. However, use of simple, cost-effective
marker types may have promising applications
in Indian scenario. Considering the different type
of molecular marker system to study the extent of
diversity in cultivated cotton, SSR marker is
the best to predict the genetic variation within
cultivated diploid and tetraploid cotton.

Simple Sequence Repeats (SSRs)


in Cotton
Though several types of DNA markers are available, simple sequence repeats (SSRs) are being
considered as the markers of choice in many of
the crop-breeding activities. SSR or microsatellites are short, tandemly repeated DNA sequence
motifs that consist of two to six nucleotide core
units. They are highly abundant in eukaryotic
genome but also occur in prokaryotes at lower frequencies. The regions flanking the microsatellites
are generally conserved, and PCR primers relative
to the flanking regions are used to amplify SSR-

Population
type
F2

RILs

F2
RILS

BC, F2

BC3F2
RIL

Species involved
G. hirsutum G. barbadense

G. hirsutum G. hirsutum

G. hirsutum G. hirsutum
G. hirsutum G. hirsutum

G. hirsutum G. barbadense

G. hirsutum G. tomentosum
G. hirsutum G. hirsutum

Uniformity ratio
Fibre elongation

Micronaire

Lint percentage
Boll size
Lint percentage
Reniform nematode
resistance
Fibre fineness
Fibre strength
Fibre length

Earliness
Micronaire
2.5 % span length
Elongation percentage
Bundle strength

Fibre length
Fibre thickness
Fibre elongation

QTL reported for


Fibre strength

Chr.14
Chr.7, Chr.13,Chr.18, Chr.24, Chr.25
Chr.4, Chr.7, Chr.14, Chr.18, Chr.23,
Chr.25
Chr.3, Chr.4, Chr.5, Chr.7, Chr.14,
Chr.16, Chr.19, Chr.25
Chr.4, Chr.7, Chr.13, Chr.14, Chr.25
Chr.4, Chr.7, Chr.13, Chr.14, Chr.15,
Chr.18, Chr.25

Chromosome number/linkage group


LGD02
Chr.20
Chr.22
LGD03
Chr.10
LGA02
LGD03
LGD04
Chr.3, Chr.5, Chr.13
Chr.12, Chr.13, Chr.14, Chr.20
Chr.14, Chr.20, Chr.26
Chr.5, Chr.9, Chr.12, Chr.16, Chr.20,
Chr.26
Chr.26
D08
D08
Chr.21

Table 11.2 Selected examples in QTL mapping for agronomic, yield and fibre quality traits in cotton

13.4
11.5

19.1

11.9
27.8
20.6

87.1
35
19
15

Maximum phenotypic
variance observed (%)
13.3
9.7
12.0
14.7
12.6
14.0
12.3
8.1
13.3
38.6
9.7
13.7

Zhang et al. (2011)


Sun et al. (2012)

Gutierrez et al. (2011)

Jenkins et al. (2010)


Chen et al. (2010)

Wu et al. (2009)

References
Jiang et al. (1998)

Cotton
261

262

containing DNA fragments. Several methods have


been pursued to develop SSR markers in cottons,
including analysis of SSR-enriched small insert
genomic DNA libraries, SSR mining from
expressed sequence tags (ESTs) and large insert
BAC derivation by end sequence analysis or SSRcontaining BAC subcloning. More than 16,000
SSRs have been developed in cotton and are made
available to public as on September, 2012 (http://
www.cottonmarker.org). It is considered that the
total pool of SSRs present in the cotton genome is
sufficient to satisfy the requirements of extensive
genome mapping and MAS. Several SSRs have
assigned to cotton chromosomes by making use
of aneuploid stocks. SSRs have been employed
to study the extent of genetic diversity among
cotton germplasm. Even though few of the studies
revealed that low level of polymorphism within
G. hirsutum genotypes, some of the studies
clearly discriminate the evaluated germplasm and
phylogenic evolution of Gossypium species.

Cotton Linkage Maps


As in most plant species, the early application of
DNA markers in cotton genomic research has
been in the form of RFLPs. It is, therefore, not
surprising that the first molecular linkage map of
the Gossypium species was constructed from an
interspecific G. hirsutum G. barbadense F2 population based on RFLPs by Reinisch et al. in 1994
who used to assemble 705 RFLP loci into 41
linkage groups with average spacing between
markers of about 7 cM. This map later was further
advanced that spanned 4,447.9 cM of the cotton
genome which comprised 2,584 loci at 1.74 cM
intervals and covered all 26 chromosomes of
the allotetraploid cottons, representing the most
complete genetic map of the Gossypium to date.
Many of the DNA probes of the map were also
mapped in crosses of the D-genome diploid species G. trilobum G. raimondii and the A-genome
diploid species G. arboreum G. herbaceum.
Detailed comparative analysis of the relationship
of gene orders between the tetraploid AD subgenomes with the maps of the A and D diploid
genomes has revealed intriguing insights on the

11

Recent Advances in MAS in Major Crops

organisation, transmission and evolution of the


Gossypium genomes. Later, an F2 population was
derived from a cross between homozygous lines
G. hirsutum cv. TM-1 and G. barbadense cv. 3-79
at the USDA-ARS in Texas, and segregation data
of 171 F2 individuals of this cross were obtained
for 868 genetic markers. These markers have been
mapped into 50 linkage groups and spanning
nearly 5,000 cM of the cotton genome.
A trispecific F2 population was also developed
from three different cultivars to study inheritance
patterns of segregating loci and to establish linkage groups among three genome species. Besides
interspecific linkage maps, intraspecific maps
are also constructed by several researchers to
investigate cotton genome and identify molecular
markers linked to agriculturally important genes/
QTL. The linkage maps so far constructed in cotton
helped in determining the chromosomal location
of many agronomically important characters
such as yield, fibre quality, yield and fibre quality,
bacterial blight resistance and pubescence, stomatal
conductance, verticillium wilt resistance gene and
leaf morphology.

QTL Mapping for Yield and Fibre


Quality Traits in Cotton
In view of most measures of cotton, quality and
productivity are polygenic; QTL mapping is in a
high priority of many research programs. Selected
noteworthy findings have come out of QTL mapping for yield and fibre quality in cotton and
are summarised in Table 11.2. From these studies,
comparison of QTL revealed poor consistency
among populations. Although some QTLs were
found to be located on same chromosomes in
different populations, no common markers could
prove that they were of the same QTL. Only a
few stable and common QTLs have been reported
up to now due to non-replicated experiments and
difficulty in assignment of linkage groups. To
identify stable QTL for routine molecular breeding
program, we need to integrate different maps of
intraspecific and interspecific population, and for
this it is important to work with a fixed population
and common set of molecular markers.

Cotton

Specic Challenges in Cotton MAS


Despite the enormous above-said achievements,
genetic improvement of cotton faces some specific
challenges because of its polyploid genome structure, the large genome size and so forth, and they
are described hereunder.

Confronts with Mapping Population


Detection of QTL is often limited by several
factors such as genetic properties of QTL,
environmental effects, population size and
experimental error. Hence, it is desirable to
independently confirm QTL mapping studies.
Such confirmation studies may involve independent populations constructed from the same
parental genotypes or closely related genotypes used in the primary QTL mapping study.
Sometimes, larger population sizes may also be
used. Furthermore, some recent studies have
proposed that QTL positions and effects should
be evaluated in independent populations because
QTL mapping based on typical population sizes
results in a low power of QTL detection and a
large bias of QTL effects.
Unfortunately, due to constraints such as lack
of research funding and time and possibly a lack
of understanding of the need to confirm results,
QTL mapping studies are rarely confirmed.
Validation of conserved fibre quality QTL
across populations has not been conclusive due
to the fact that the majority of these QTL studies
were either derived from small and mortal (F2 or
backcross (BCs)) populations. As compared to
F2 or BCs, homozygous immortalised recombinant inbred lines (RILs) constitute the preferred
material for QTL mapping in many crops. RILs
have not been widely utilised in cotton except
in some cases mainly due to long development
timelines and difficulties in production of
sufficient seeds. Though there is no clear rule for
the precise population size that is required for
QTL analysis, it is increasingly believed that
sampling limited numbers of progeny in mapping studies tends to cause the skewed distribution of QTL effects and identification of limited

263

number of QTL, even if many genes with equal


and small effects actually control the trait.
Further, in several published reports, the number
of linkage groups exceeds the gametic chromosome number (n = 26), and numerous linkage
groups are yet to be associated with specific
chromosomes mainly due to lack of informative
markers and use of small sample size. Moreover,
common identities and common nomenclature
have yet to be established among many linkage
groups in the laboratory-specific maps. Physical
coverage of the cotton genome by these linkage
maps also remains unknown. In most of the published maps, the markers were not uniformly
spaced over many linkage groups. It is suggested
that such regions may be heterochromatin or
gene rich. Clusters of markers with very limited
recombination were frequently present which
may be indicative of QTL-rich (gene-rich) regions
of cotton.

QTL Environment Analysis


Relatively large numbers of QTL were detected
for fibre quality traits, and most of the detected
QTLs explained only less than half of the total
genetic variation. What causes the remaining
genetic variation that is unexplained by QTL in
large samples? One possibility is that there are
many QTLs with very small effects, as assumed
in classical models of quantitative genetics,
and these remain undetected even with very
large sample sizes. Another possibility is the
higher-order epistatic interactions, which are
refractory to QTL mapping. Further, a recurring complication in the use of QTL data is that
different parental combinations and/or experiments conducted in different environments
often result in identification of partly or wholly
nonoverlapping sets of QTL. The majority
of such differences in the QTL landscape are
presumed to be due to environment sensitivity
of genes. Hence, proper care of including QTL
environment interaction analysis, which was
found to be limited in the published literature,
will improve the further progress of QTL mapping towards MAS.

264

Incongruence Among QTL Studies


The use of stringent statistical thresholds to infer
QTL while controlling experiment-wise error
rates is another reason for identification of only
a small fraction of these nonoverlapping QTL.
Small QTL with opposite phenotypic effects
might occasionally be closely linked in coupling
in early-generation populations and separated
only in advanced-generation populations after
additional recombination. Comparison of multiple
QTL mapping experiments by alignment to a
common reference map offers a more complete
picture of the genetic control of a trait than can
be obtained in any one study. However, lack of
common set of anchored markers in the published
reports limits the comparison of QTL across the
genetic backgrounds.

Complexities in Integration
of Functional Genomics with QTL
Fibre gene function is highly conserved in the
genomes of wild and cultivated species, as well
as diploid and tetraploid species, despite millions
of years of evolutionary history. The phenotypic
variation in fibre properties therefore is more likely
one of quantitative differences in gene expression
as opposed to differences in the genotype at the
DNA level. Hence, further studies are required to
understand the number of copies of the genes,
their regulation and specific function in fibre
development. Though systematic transcriptomic
approaches can be combined with QTL analyses
(discussed below), these studies do not address
the occurrence of alternative splicing or the
posttranslational modifications of the proteins.
In addition, proteins can move in and out of other
macromolecular complexes and thus modifying
their functionality. This level of complexity cannot
be tackled using transcriptomics alone, and
hence it is vital to include proteomics in MAS.
On the other hand, biochemical functions of only
a small proportion of the identified proteins have
been demonstrated and/or determined based on
the assumptions that proteins sharing conserved
domains have the same activity. Hence, the leftover

11

Recent Advances in MAS in Major Crops

proteins (domains of unknown function) remain


as a challenge for elucidation of their biological
function. In addition to that, quantitative data
on proteome and metabolome is still in its infant
stage, and proteinprotein interactions and
protein with other macromolecules remain to be
revealed. Therefore, complete knowledge on fibre
growth and development at molecular level and
its integration with QTL mapping is essential
to design next-generation breeding strategies.

Alternatives and Future Perspectives


The realisation of value of MAS in routine cotton
breeding program for fibre productivity and
quality has been realised only in few reports.
It highlights several insights and improvement
in the current methodologies and tools, and the
following strategies are proposed for successful
MAS in cotton.

Meta-analysis of QTL: Synergy Through


Networks
Though QTLs for several common traits were
mapped, direct comparisons cannot be conducted
since no common markers existed among these
studies. Detected QTLs are held up within family,
the sizes of QTL effects that can be detected are
limited, and inferences are restricted to a single
population and set of conditions. Thus, one direction for QTL analysis is to combine information
from several or many studies by meta-analysis.
Integration of QTL from different populations
into a common map facilitates exploration of
their allelic and homoeologous relationships,
though the level of resolution is limited by comparative marker densities, variation in recombination rates in different crosses, variation in gene
densities across the genome and other factors.
Using a high-density reference genetic map which
consists of 3,475 loci in total, Rong and his team
reported alignment of 432 QTL mapped in one
diploid and ten tetraploid interspecific cotton
populations and depicted in a CMap resource.
Similarly, Lacape group conducted meta-analysis

Cotton

of more than 1,000 QTLs obtained from the RIL


and BC populations derived from the same
parents and reported consistent meta-clusters
for fibre colour, fineness and length. As per their
discussion, although their result on cotton fibre
can hardly support the optimistic assumption that
QTLs are accurate, they have shown that the reliability of QTL-calls and the estimated trait impact
can be improved by integrating more replicates
in the analysis. Hence, it is imperative to verify
the regions of convergence with new maps which
share common markers with the consensus map.

Map-Based Cloning
As QTL mapping results accumulate over the
next years, attention will turn to clone QTL and
then to using them. This requires higher resolution of QTL mapping, combined with a dense
marker map. A centimorgan (cM), corresponding
to a crossover of 1%, can be a span of 101,000 kbp
and can vary across species or even within the
chromosome of the given species. This region
may contain both desirable and undesirable
genes, and hence to avoid the linkage drag of
undesirable traits, it is important to establish
the causal relationship between the QTL and
phenotype using positional or map-based cloning. The physical size of a cM in cotton is not
prohibitive to map-based cloning, but the lengthy
genetic map will require a large number of markers
in order to be sufficiently close to most genes for
chromosome walking. A new high-throughput
marker, SNPs, is gaining its importance in this
context, but huge initial investment for its generation necessitates simple innovative and economic
marker techniques. It is also important to note
that instead of using anonymous DNA markers,
development and use of gene-specific functional
markers such as SRAP, TRAP and PAAP (see
chapter 3) may increase the efficiency of mapbased cloning.
Further, map-based cloning in polyploids such
as cotton introduces a new technical challenge
not encountered in diploid (or highly diploidised)
organisms, for example, that virtually all singlecopy DNA probes occur at two or more unlinked

265

loci. This makes it difficult to assign megabase


DNA clones to their site of origin. One possible
approach to this problem is the utilisation of
diploids in physical mapping and map-based
cloning.

Cotton Genome Sequencing


Decoding cotton genomes will be a foundation
for improving understanding of the functional
and agronomic significance of polyploidy and
genome size variation within the Gossypium
genus. The whole-genome shotgun sequence of
the smallest Gossypium genome, G. raimondii,
provided fundamental information about gene
content and organisation. This sequence will
be used to query homologous and orthologous
genomes and to investigate the gene and allele
basis of phenotypic and evolutionary diversity
for cotton improvement. A good parallel approach
may be to search for candidates in species that
are having naturally superior fibre qualities.
Sequencing of G. raimondii genome established
the critical initial template for characterising the
spectrum of diversity among the eight Gossypium
genome types and three polyploid clades and
provided a reference for sequencing many genomes in Gossypium species which is essential for
further improvement of cotton.

Advances in Functional Genomics


Several studies performed to compare the structural differences in the genomes have shown that
the difference is in the expression pattern rather
than in the presence or absence of particular
genes. The comparison of gene expression profiling
between contrasting genotypes with respect to
fibre quality can be extended to transcription
profiling at the QTL level, and the genes identified
at such QTL may potentially be better candidates
for superior fibre quality. In addition to cDNA
and oligonucleotide microarrays, tiling path
arrays can also be used to study gene expression
in plants. The advantage of tiling path arrays
over conventional microarrays is that they are not

266

stuck-up with the gene structure and hence provide


unbiased and more accurate information about
the transcriptome. In addition, they provide knowledge on transcriptional control at the chromosomal level. The use of tiling path arrays could
help to provide better understanding on the fibre
transcriptome at the genome-wide level, and it
is yet to be tried in cotton. This will result into
a paradigm shift from MAS to genomics-assisted
selection.

11

Recent Advances in MAS in Major Crops

operates within the cell. A complete elucidation


of the genotypephenotype map does not seem to
be feasible unless we can include all possible
causal variables in the network-inference methodology. One has to take a global perspective
on life processes instead of individual components of the system. The network approach connecting all these subdisciplines indicates the
emergence of a system quantitative genetics.

Association Mapping and Alternatives


System Quantitative Genetics:
Bridging Subdisciplines
The ultimate objective of QTL mapping is to
identify the causal genes or even the causal
sequence changes, the quantitative trait nucleotides
(QTNs). While this remains a major challenge,
it has been achieved in a few instances in other
crops. Identification of candidate genes and
enrichment of functional markers within small
targeted genomic regions are driven by the increasing availability of sequence resources, genomic
databases and by technological developments.
If functional candidate genes for a trait are not
known, co-location of candidate gene polymorphisms with map positions, linkage to QTL,
association of alleles with specific traits or the
identification of syntenic regions among genomes
can help to select positional candidate genes for
the trait. In another approach called genetical
genomics, gene expression profiles are quantitatively assessed within a segregating population,
and expression quantitative trait loci (eQTL)
can be mapped like classical QTL (see chapters 7
and 10). Though global eQTL mapping studies,
using whole-genome microarrays, have been
published in yeast, Arabidopsis, maize and
eucalyptus, it is in preliminary stage in cotton. In
addition, a comparative picture of transcript versus protein abundance indicates that functionally
important changes in the levels of the former are
not necessarily reflected in changes in the levels of
the later. It also holds good for metabolomes too.
Hence, genes, proteins, metabolites and phenotypes should be considered simultaneously to
unravel the complex molecular circuitry that

Association mapping provides another route to


identifying QTLs that have effects across a
broader spectrum of germplasm, if false positives
that are caused by population structure can be
minimised. In addition, QTL mapping in biparental populations reveals only a slice of the genetic
architecture for a trait because only alleles that
differ between the two parental lines will segregate. Therefore, more comprehensive analyses
of genetic architecture require consideration of
multiple populations that represent a larger
sample of the standing genetic variation in the
species. An important genetic resource developed
in recent years is the construction of nested association mapping (NAM) population. The NAM
population is a novel approach for mapping genes
underlying complex traits, in which the statistical
power of QTL mapping is combined with the
high (potentially gene-level) chromosomal resolution of association mapping, and it has been
adapted in maize (see chapter 6). Although
sufficient diversity must be present in each association mapping panel, too much phenotypic
diversity (or poor adaptation to any specific growing environment) may make it difficult to phenotype a panel in an association study. Thus, more
region-specific association mapping panels may
need to be created that contain germplasm more
suited to specific growing regions.

Improved Databases
There is a great need to expand bioinformatic infrastructure for managing, curating and annotating

Mungbean

the cotton genomic sequences that will be generated


in the near future. The cotton genome sequence
and functional genomics database of the future
should be able to host and manage cotton information resources using community-accepted genome
annotation, nomenclature and gene ontology.
Some existing databases may be upgraded to
effectively handle a large amount of data flow
and community requests, but additional resources
will be sought to support key bioinformatic
needs.

Concluding Remarks for MAS in Cotton


Significant strides have been made particularly in
phenotypic and molecular diversity in the cotton
germplasm and identification of QTL linked to
fibre productivity and quality. Yet the application
of molecular marker-assisted breeding tools to
accelerate gains in cotton productivity has barely
begun, and there is vast potential and need to
expand the scope and impact of such innovative
breeding program. Progress in this direction will
be further enhanced by bringing the information
generated through omics studies. Further, as
discussed above, involvement of innovative
strategies, resource pooling and capacity building to deploy marker-assisted breeding in
cotton will eventually lead to develop cotton
cultivars improved with improved productivity
and quality.

Mungbean
Pulses are important protein resources that help
meet the nutritional requirements of poor people
living in developing countries. Among them,
mungbean (Vigna radiata (L.) Wilczek) is one of
the most widely cultivated species throughout
the southern half of Asia, and particularly it is the
widely cultivated crop in the rainfed areas. It is
adapted to short growth duration, low water requirements, several nutrient deficient soils or poor
soil fertility. It is popularly grown as a component
in various cropping systems because of its ability
to fix nitrogen in association with soil bacteria,

267

early maturity (approximately 60 days) and relatively


drought tolerance. It is a self-pollinating diploid
plant with 2n = 2x = 22 chromosomes and a
genome size of 515 Mb/1C.
Despite its importance in poor mans food
basket, mungbean genomic research has lagged
behind the other crop species due to a lack of
polymorphic DNA markers. A limited number of
polymorphic SSR markers, the marker of choice,
have been published for mungbean. Therefore,
developing and identifying polymorphisms of the
SSR motifs of mungbean is an important requirement for mungbean development. Similarly,
single-nucleotide polymorphisms are the most
frequently found variation in DNA and are valuable
markers for high-throughput genetic mapping,
analysis of genetic variation and association
mapping studies in crop plants. Several methods
have been described for SNP detection such as
high-throughput sequencing technologies and
EcoTILLING. However, the discovery of SNP
markers based on transcribed regions has become
a common application in plants because of the
large number of ESTs available in databases,
and EST-SNPs have been successfully mined
from EST databases in non-model.
A transcriptome is the set of all RNA molecules,
including mRNA, rRNA, tRNA and non-coding
RNA, produced in one cell or a population of
cells. Although the analysis of relative mRNA
expression levels might be complicated by the fact
that relatively small changes in mRNA expression can produce large changes in the total amount
of corresponding protein present in the cell, a
number of organism-specific transcriptome
databases have been constructed and annotated to
aid in identifying genes that are differentially
expressed in distinct cell populations or subtypes.
Unlike genome analysis, transcriptome analysis
offers a full profile of gene function information
under various conditions, and it differs with dissimilar environments, cell types, developmental
stages and cell states. It has repeatedly shown that
transcriptome or EST sequencing is an efficient
way to generate functional genomic level data for
non-model organisms.
Interestingly, some of the studies have focused
on the analysis of transcriptomic functions and

268

11

Recent Advances in MAS in Major Crops

investigation of SSR and SNP markers in mungbean.


This study can support clear understanding of
the transcriptomic functions in mungbean and
can provide resource data for the purpose of
crop improvement programs. Next-generation
transcriptome sequencing will serve as a superior
resource for developing polymorphic DNA markers, not only because of the enormous quantities
of sequence data in which markers can be discovered but also because the discovered markers
are gene-based. Such markers are advantageous
because they facilitate the detection of functional
variation and selection in genomic scans or genetic
association studies in mungbean. The large number of SSRs and SNPs is now available, and they
are potentially useful for multiple applications
ranging from population genetics, linkage mapping and comparative genomics to gene-based
association studies.

These maps were constructed from the data of F2


or RIL populations from inter-subspecific crosses
of VC3980 (cultivated) TC1966 (wild from
Madagascar) or Berken (cultivated) ACC41
(wild from Australia) using mainly RFLP and/or
random amplified polymorphic DNA (RAPD)
markers. The population size ranged from 58
to 80 plants. The maps differ in length (737.9
1,570 cM), number of markers (102255 markers),
number of linkage groups (LG) (1214) and level
(1230.8%) and regions of marker distortion.
The most comprehensive map consists of 255
loci with an average distance between the adjacent
markers of 3 cM. However, most of the maps
do not resolve 11 LGs, which is the haploid
chromosome number of mungbean. To resolve
11 LGs and saturate the map, many more markers
are needed. In addition, the genome coverage of
the markers has yet to be determined.

Genetic Diversity and Linkage Mapping


in Mungbean

QTL Mapping in Mungbean

A large collection of mungbean germplasm


encompassing 415 cultivated (V. radiata var. radiata), 189 wild (V. radiata var. sublobata) and 11
intermediate accessions from diverse geographic
regions have been characterised using 19 azuki
bean SSRs. The results revealed that mungbean
has highest diversity in South Asia, supporting
the view of its domestication in the Indian subcontinent and showing that Australia and Papua
New Guinea are centres of diversity for wild
mungbean. A core collection of 106 accessions
representing most genetically diverse of these
germplasm has been made. Despite the work
carried out on the Fabaceae, research into mungbean genetics and evolution is not as advanced
as in many other species.
Several linkage maps of mungbean have been
constructed (e.g. Menancio-Hautea et al. 1992;
Lambrides et al. 2000; Humphry et al. 2002)
upon which most marker research into this crop
has been based, but they do not provide the same
level of genome saturation seen in many other
species mainly due to the reason mentioned above.

QTLs for several traits encompassing azuki


bean weevil resistance, seed colour, seed weight,
hard-seededness, powdery mildew resistance
and Cercospora leaf spot resistance were mapped
with molecular markers in mungbean. Among
them, QTL linked to bruchid, Cercospora leaf
spot and yellow mosaic virus resistance are of
importance for genetic improvement of this
crop, and they are highlighted here. The bruchidresistance gene (Br) has already been mapped
using an F2 population from a cross between
resistance line, TC1966 and a susceptible cultivar. Br is located on linkage group 9 of the
current mungbean linkage map. Mungbean has
a relatively small genomic size, ranging from
470 to 560 Mb. The current estimated genetic
size of the mungbean genome is about 1,570 cM.
The small genomic size of mungbean may allow
us to apply a map-based cloning strategy to
isolate the resistance gene. Cloning of the Br
gene would aid not only the elucidation of the
synthetic pathway of the resistance factor(s)
but also the development of transgenic plants
harbouring resistance against a wide spectrum

Mungbean

of insect pests. In another study, molecular


markers that are tightly linked to the resistance
locus using the construction of a high-resolution
linkage map were reported.
Cercospora leaf spot (CLS) caused by the
fungus Cercospora canescens Ellis and Martin
is a serious disease in mungbean, and disease
can reduce seed yield by up to 50%. The QTL
analysis was conducted using F2 (KPS1 V4718)
and BC1F1 [(KPS1 V4718) KPS1] populations developed from crosses between the CLSresistant mungbean V4718 and CLS-susceptible
cultivar Kamphaeng Saen 1 (KPS1). The results
of segregation analysis indicated that resistance
to CLS is controlled by a single dominant gene,
while composite interval mapping consistently
identified one major QTL (qCLS) for CLS
resistance on linkage group 3 in both F2 and
BC1F1 populations. qCLS was located between
markers CEDG117 and VR393 and accounted
for 65.580.53% of the disease score variation
depending on seasons and populations. An allele
from V4718 increased the resistance. The SSR
markers flanking qCLS will facilitate transferral
of the CLS resistance allele from V4718 into elite
mungbean cultivars.
At present, mungbean yellow mosaic virus
(MYMV) is the most important disease of mungbean all over the world. The disease is characterised by yellow mosaic on leaves of infected
plants that results in considerable yield losses.
MYMV is caused by a bipartite begomovirus
which is transmitted via whiteflies (Bemisia
tabaci). Lambrides and his group tagged the
resistance gene from NM92 in two RIL populations, using BSA strategy. A marker generated
from RAPD primer OPAJ20 was found to be
distantly linked with the resistance gene. Intersimple sequence repeat (ISSR) and SCAR
markers linked to the resistance in blackgram
have exerted a potential for locating the gene in
mungbean. Lambrides and Godwin suggested
that mungbean probe Mng247 associated with
soybean mosaic virus resistance might be useful
in identifying MYMV resistance gene. In addition, Mng247-derived SSR marker, M3Satt41,
may also be useful in this regard.

269

Legume Comparative Genomics


and Its Importance in Mungbean MAS
Economically, legumes represent the second most
important family of crop plants after Poaceae
(grass family), accounting for approximately
27% of the worlds crop production. On a worldwide basis, legumes contribute about one-third of
humankinds protein intake, while also serving
as an important source of fodder and forage for
animals and of edible and industrial oils. One of
the most important attributes of legumes is their
unique capacity for symbiotic nitrogen fixation,
underlying their importance as a source of nitrogen in both natural and agricultural ecosystems.
Legumes also accumulate natural products (secondary metabolites) such as isoflavonoids that
are beneficial to human health through anticancer
and other health-promoting activities.
The legumes are highly diverse and contain
several economically important crops such as
soybean (Glycine max), peanut (Arachis hypogaea),
mungbean (Vigna radiata), chickpea (Cicer
arietinum), lentil (Lens culinaris), common
bean (Phaseolus vulgaris), pea (Pisum sativum)
and alfalfa (Medicago sativa). Despite their close
phylogenetic relationships, crop legumes differ
greatly in their genome size, base chromosome
number, ploidy level and self-compatibility. Nevertheless, earlier studies indicated that members
of the legumes exhibited extensive genome conservation based on comparative genetic mapping.
Unlike many of the major crop legumes, M. truncatula and Lotus japonicus (selected as model
systems for studying legume genomics and biology)
are of small genome size, amenable to forward
and reverse genetic analyses, and well suited for
studying biological issues important to the related
crop legume species.
An immediate goal of legume genomics is
to transfer knowledge between model and crop
legumes. Accordingly, an in-depth understanding
of conservation of genome structure among
legume species is a prerequisite to achieving this
goal. The idea that conserved genome structure
can facilitate transfer of knowledge among related
plant species is best addressed in grasses in which

270

genome macrosynteny and microsynteny have


been extensively maintained.
It has been demonstrated that mungbean and
cowpea (Vigna unguiculata) exhibited a high
degree of linkage conservation, whereas chromosomal rearrangements have occurred since the
divergence of the two species. Comparative
mapping among mungbean, common bean and
soybean in the Phaseoleae tribe indicated that
mungbean and common bean linkage groups
were highly conserved, but synteny with soybean
was limited only to the short linkage blocks.
Use of a bridging species (soybean) revealed
that homoeologous segments of soybean chromosomes showed a higher degree of synteny with
chromosomes of common bean and mungbean
than previously thought.
Comparative mapping in mungbean and a distantly related legume crop, lablab, gave surprising
results in that the two species share several large
conserved genome blocks as indicated by similar
marker orders and LGs. However, the results
also showed genome rearrangements and many
deletions/duplications after divergence.
By contrast, macrosyntenic relationships
between M. truncatula and Phaseoloid legumes
were more complicated and less informative.
Twenty-nine of the 38 (approximately 76%) markers mapped between M. truncatula and mungbean
revealed evidence of conserved gene order, whereas
the remaining markers mapped to nonsyntenic
positions. Despite these limitations, it is proposed
that a comprehensive analysis of legume comparative genomics in future may help to genetically
improve the mungbean via MAS.

Concluding Remarks for MAS


in Mungbean
Although some progress in genome research
has been made in mungbean, it is still far behind
the other major legume crops such as soybean,
cowpea and common bean or, even their relative
but less important, azuki bean. The fact that the
current genetic linkage maps of mungbean are
not yet at detailed level and hence dense or
saturated maps with 11 LGs resolved for this

11

Recent Advances in MAS in Major Crops

crop is needed. A major obstacle to achieve such


maps is the lack of high-throughput SSR and
SNP markers (however, some progress has made
to this end; see above). As indicated above, the
genome study in mungbean has been made
possible by using genetic markers from other
related legumes, and this trend will continue
since only limited genetic resources are available
for further study in mungbean. For example, SSRs
from azuki bean, common bean and cowpea will
be useful in development of mungbean linkage
map with 11 LGs resolved, as in the case of
blackgram. Moreover, the information obtained
from sequencing of soybean genome, common
bean ESTs and gene space of cowpea, M. truncatula and Lotus japonicus, can create highthroughput genetic markers for mungbean. In
addition, a database of thousands of cowpea gene
space sequences containing SSRs is now publicly
available. In-silico development of cowpea SSRs
and application of those markers in mungbean
are also interesting. With many genomic tools and
resources for legumes are becoming increasingly
available, a more detailed and in-depth genome
mapping of mungbean will be possible in the
near future. One such study is already reported
(Isemura et al. 2012). The genetic differences
between mungbean and its presumed wild ancestor were analysed for domestication-related
traits by QTL mapping. A genetic linkage map of
mungbean was constructed using 430 SSR and
ESTSSR markers from mungbean and its related
species, and all these markers were mapped onto
11 linkage groups spanning a total of 727.6 cM.
This mungbean map was the first map where
the number of linkage groups coincided with the
haploid chromosome number of mungbean. In
total, 105 QTLs and genes for 38 domesticationrelated traits were identified using this map.
Another challenge for mungbean genome
researchers is the development and establishment
of a more efficient protocol of genetic transformation to support breeding work as the use of
transgenic technology is inevitable for mungbean
in the future. The technology will be helpful in
development of cultivars resistant to serious
insects and tolerance to adverse environment that
no effective gene source exists in their gene pool

Tomato

such as legume pod borers and drought and other


abiotic stresses.

Tomato
Tomatoes (Lycopersium esculentum L.) are considered to be one of the most economically
important crops of all those that exist in the world.
Tomatoes are juicy berry fruits of the nightshade
family (Solanaceae). They came originally from
Central and South America. They are nutritious
vegetables that provide good quantities of vitamins A and C as well as essential minerals and
other nutrients. Furthermore, fresh and processed
tomatoes are the richest sources of the dietary
antioxidant lycopene, which arguably protects
cells from oxidants that have been linked to cancer. Tomato is also a source of other compounds
with antioxidant activities, including chlorogenic
acid, plastoquinones, rutin, tocopherol and
xanthophylls.
Economically speaking, tomatoes are worth a
tremendous amount of money because they give
more yields. Tomatoes are also one of the main
ingredients in hundreds of dishes and products
that are sold in supermarkets throughout the
developing and developed world. This means that
the demand of tomatoes (i.e. where ever high
demand for tomatoes as they are a main ingredient in dishes) is extremely high. The production
of tomatoes is ranked first in India, where small
business owners and farmers are dominated by
producing tomatoes. They highly value and
favour the choice to produce tomatoes because of
their high value in money as this makes up a very
large part of their income.
Tomatoes are also a popular choice by people
who wish to grow fruits and vegetables in their
own gardens. Not only can they be used raw in
salads, but they are also an essential part of
many recipes as well as many products such as
tomato ketchup and chutney. They can also be
grown both indoors in greenhouses and outdoors, although tomatoes that are grown outside
tend to have higher nutrient contents than those
grown in greenhouses. Tomatoes have many
advantages over growing other types of vegeta-

271

ble crops, such as (1) their high yield which


results in their high economic value and (2) they
have very high nutritional value with high levels of pro-vitamin A and C. As well as being
ranked first on their nutritional contribution to a
humans diet, (3) they are a short-duration crop,
and (4) they are very well suited for different
cropping systems that are used on grains, pulses,
cereals and oilseeds.
There are over 200 documented diseases of
cultivated tomato and seriously affecting the fruit
yield. Growers usually employ an integrated pest/
disease management strategy including both
cultural practices and pesticide use to combat the
damage caused by these pathogens. An example
of a cultural practice is the use of netting over
tomato plants, which provides a physical barrier
that can be effective in excluding disease-bearing
insects from infecting the crop.

Conventional Breeding and Tomato


Improvement
Conventional breeding efforts in tomato date
back to the 1930s, when breeding for improvement of the overall horticultural characteristics
of tomato started. As market demand developed
for more specific traits desired by the freshmarket or processing tomato industry, breeding
objectives became more specialised, and by the
1950s, improved varieties were developed for
either processing or fresh-market uses through
selecting best phenotypes. Despite a significant
contribution in genetic improvement, conventional breeding has several potential inherent
difficulties, including limitations in the availability of screening environments, reduced response
to selection for traits with low heritability or
recessive expression, growing length before trait
evaluation can be conducted, genetic linkage
drag, the need to use large populations and thus
large space and concerns regarding genotype by
environment (G E) interactions. Furthermore,
in some cases, breeders are unable to fully characterise or utilise the genetic information available
in wild germplasm or breeding populations via
phenotypic screening.

272

Biotechnology and Tomato Breeding


Advances in DNA technology after 1950s have
made huge revolution in tomato breeding.
There are two areas in biotechnology that have
immediate effect in tomato breeding: (1) transgenic technology and (2) marker-assisted selection (MAS). Despite numerous research studies
regarding transgenic approaches against diseases
of plants, there are currently no or very few
transgenic tomato varieties (in some countries)
available to the grower that are resistant to any
pathogens. Further, there remains an issue of
public resistance, which, combined with the high
cost of obtaining regulatory approval, has effectively prohibited this promising technology from
being used in commercial tomato cultivation.
Thus, the MAS has the proven potential in
tomato breeding for genetic improvement of
several important economic traits such as pest
and disease resistance, quality improvement and
nutrient enhancement. With the advent of molecular markers and genetic maps, there has been
an increased interest in using markers technology
to facilitate tomato crop improvement. Tomato
was among the first crop species for which genetic
markers and maps were developed and utilised
for breeding purposes (Tanksley et al. 1992).
Molecular markers and MAS can potentially
overcome at least some of the limitations associated
with conventional breeding involving phenotypic
selection. A major advantage of DNA markers is
that they are neutral in phenotypic reactions,
that is, they do not have any pleiotropic effect
on the phenotype, nor are they influenced in their
segregation and inheritance by the growing
conditions of the plant. Furthermore, molecular
markers can be detected at any growth stage,
offering the possibility of selecting plants on the
basis of convenience to the breeder, in contrast
to the season-bound nature of conventional selection. With the availability of molecular markers
distributed throughout the tomato genome, many
tomato genetic maps have been developed,
including the high-density linkage map of tomato
based on a S. lycopersicum S. pennellii cross
(refer Foolad et al. 2008 for a list of tomato
genetic maps).

11

Recent Advances in MAS in Major Crops

As discussed in chapter 8, successful application


of the MAS depends on several factors. A major
concern in the use of molecular markers for
breeding purposes in tomato is the low frequency
of marker polymorphism within breeding populations as shown in several reports. Most genetic
maps of tomato are based on interspecific crosses
between the cultivated and related wild species
of tomato, where marker polymorphism is abundant. This is of particular concern when the wild
species is only distantly related to the cultivated
tomato, such as S. pennellii that has been used
for the construction of the high-density molecular linkage map of tomato. However, as shown
in the rice case study, most tomato-breeding
populations are based on intraspecific crosses
within the cultigen or crosses between the cultivated and closely related wild species such
as S. pimpinellifolium. In such populations, there
is much less marker polymorphism compared
to that in wide crosses. Thus, efforts must be
made to identify markers with a higher rate of
polymorphism in breeding populations. Further,
markers must be high throughput and economically affordable to justify their use in large populations. Finally, linkage association between the
gene or QTL of interest and the genetic marker
must be tight enough to avoid unwanted crossing
over, which may result in false positive selection.
In this regard, the best genetic markers are those
that are within the gene of interest. Due to the low
genetic diversity within the tomato cultigen, new
marker technologies, which can detect minor
genetic variation, are being leveraged for marker
discovery and tomato variety development. Among
the marker classes, SNPs have become the marker
of choice for numerous reasons. First, SNPs are
more plentiful than other marker types. Second,
high-throughput Taqman-based SNP assays can
be developed for large-scale genotyping and
relatively easy data analysis. Third, Taqman-based
SNP genotyping is cheaper than other protocols
when larger numbers of samples are involved.
Furthermore, a newer technology that is emerging and is being employed by some public and
private tomato researchers is genotyping by
sequencing (GBS). This technology is becoming
more feasible due to the reduced cost and the fact

Tomato

that normally large numbers of polymorphic


SNPs are discovered between genotypes (often
on the order of hundreds of thousands). With
the completion of the tomato reference genome
sequence, localising SNPs identified by GBS
to specific physical locations is becoming an
easy task.
Tomato was one of the first crops for which
molecular markers were suggested as indirect
selection criteria for breeding purposes (as early
as it is reported in 1974; refer Foolad and Panthee
2012 for an excellent review of tomato breeding
using MAS). The actual use of MAS in tomato
breeding began approximately three decades
ago with the use of the isozyme marker acid
phosphatase (Aps-11 locus) as an indirect selection
criterion for breeding for nematode resistance.
This isozyme marker still is being used in many
private and public tomato-breeding programs for
selecting for nematode resistance. However, more
recently, with the development of new molecular
markers and maps in tomato, MAS has become
a routine practice in many tomato-breeding
programs, in particular in the private sector, for
several purposes including the following three.
First, MAS is often used to assess hybrid purity
from overseas production by screening seed lots
with a panel of molecular markers. The technologies used for this purpose vary widely; SNPs
are leveraged regularly, PCR-based markers
are employed routinely, and in some cases, even
well-known isozyme markers are recruited.
Second, when reliable markers closely linked
to resistance genes (or specific fruit quality loci)
are known, MAS is used effectively for quick
germplasm screening for disease resistance or
fruit quality. Often, a panel of linked markers is
used on individual selections or pools of seed
or tissue from early-generation populations to
index breeding populations. This aids breeding
efforts by informing the breeder about which
disease resistances or fruit quality traits are segregating or fixed in a given population. However,
often organism screening may still be required to
verify the results of MAS and to validate linkage
(or lack thereof) between markers and the trait(s)
of interest. Third, MAS is employed for markerassisted backcrossing (MAB; refer chapter 8)

273

after reliable linkages between markers, and


simple traits of interest are discovered. Such traits
include, but not limited to, disease resistance,
fruit colour and carotenoid content (e.g. lycopene
and b-carotene), fruit ripening-related traits
(various genes including Rin and Nr), jointless
pedicel (j2) and extended field storage (EFS;
using various genes including Alcobaca and Long
Keeper). It appears that for many simple diseaseresistance traits in tomato, MAS is not only faster
than conventional selection but also cheaper and
more effective. In tomato, genes for resistance
to over 35 pathogens have been identified and
mapped. It is assumed that currently in the tomato
seed industry MAS is routinely employed for
selecting for several qualitative disease-resistance
traits, including fusarium wilt races 1, 2 (with
some difficulty) and 3, late blight (Ph-3 and may
be Ph-2), verticillium wilt race 1, bacterial spot
(Rx3 and Rx4), tomato spotted wilt virus (Sw5),
tomato yellow leaf curl virus (Ty1, Ty2, Ty3 and
Ty4) and root-knot nematode. As an example,
the detailed MAS work for genetic improvement
of tomato for bacterial spot and TYLC virus
resistance is discussed below (see Foolad and
Panthee 2012 for references and other details).

MAS for Bacterial Spot Resistance


Bacterial spot, a common disease of tomato
throughout the world and particularly in tropical
and subtropical regions, is caused by four
species and five races of Xanthomonas, including
X. euvesicatoria (race T1), X. vesicatoria (race
T2), X. perforans (races T3, T4 and T5) and
X . gardneri (race T2). Among these, X. perforans
is the predominant species. Bacterial spot affects
leaves, stem and fruit and causes defoliation, fruit
lesion and reduced yield. The chemical control of
this disease has not been very effective due to
the presence of multiple sources of inoculum and
development of chemical resistance in the pathogen.
Sources of host genetic resistance to bacterial
spot have been identified in S. lycopersicum (e.g.
Hawaii 7998 and Hawaii 7981), S. lycopersicum
var. cerasiforme (PI 114490) and the related wild
species S. pimpinellifolium (PI 126932 and PI

274

128216) and S. pennellii (LA 716). However,


the presence of multiple species and races of the
pathogen as well as complex nature of host
genetic resistance has made bacterial spot resistance breeding in tomato very challenging. While
most resistance sources seem to be race-specific,
some resistant genotypes interact with multiple
races of the pathogen and exhibit quantitative
response. For example, the breeding line Hawaii
7998, the most reliable source of resistant to race
T1, exhibits reduced disease symptoms in the
field and a hypersensitive response (HR) to T1 in
the greenhouse. Three QTLs/genes, Rx-1 (chromosome 1), Rx-2 (chromosome 1) and Rx-3 (chromosome 5), were reported to be independently
associated with HR in the greenhouse using a
population derived from crosses between Hawaii
7998 and S. pennellii accession LA 716. The
RFLP markers associated with these genes,
however, are based on S. pennellii LA716 and
thus are not polymorphic in most breeding populations, limiting their utility for MAS breeding.
The Rx-3 locus was subsequently confirmed to
provide HR as well as field resistance in advanced
backcross populations derived from a cross
between Hawaii 7998 and processing breeding
line OH 88119 (susceptible), and markers linked
to Rx-3 were also reported including a CAPS
marker that has been used for MAS breeding.
Breeding line Hawaii 7981 provides an HR-based
resistance to race T3 of the pathogen and is considered the strongest source of resistance to this
race under both greenhouse and field conditions.
This resistance is controlled by a single gene,
Xv-3, which is mapped to tomato chromosome
11. In another study, using a population derived
from OH 88119 and PI 128216 (a resistant accession of S. pimpinellifolium), markers associated
with race T3 resistance were identified in the same
location as Xv-3 on chromosome 11, and resistance gene was designated as Rx-4. SSR and SNP
markers associated with Rx-4 have been identified.
S. pennellii accession LA 716 exhibits HR to race
T4, conferred by the resistance gene Xv-4, which
originally was mapped to tomato chromosome 3.
Another bacterial spot resistance gene, Bs-4,
was discovered in cv. Moneymaker and mapped
to the short arm of chromosome 5. Furthermore,

11

Recent Advances in MAS in Major Crops

the S. lycopersicum var. cerasiforme accession


PI114490 (yellow cherry tomato) has shown field
resistance to multiple races of the pathogen. This
resistance seems complex as it may be conferred
by different genes in response to different races
of the pathogen. However, in a mapping study
using this accession, a major QTL was identified
on chromosome 11, which may confer resistance
to races T1, T2, T3 and T4. In addition, QTLs
associated with race T4 of bacterial spot were
identified on chromosome 3 (PVE = 4.8%) and 11
(PVE = 29.4%) in inbred backcross populations
developed from PI 114490, OH 9242 and Fla
7600. In a different study, two RAPD markers
associated with bacterial spot resistance were
reported, where the markers were originally
derived based on a resistance gene (Bs-2) in pepper. In this study, an F2 population of pepper
from a cross between Early Calwonder (bs1/bs1
bs2/bs2 bs3/bs3) and Early Calwonder 20R (bs1/
bs1 Bs2/Bs2 bs3/bs3) was employed to identify
recombinants, which subsequently were used to
identify the gene sequence and design primers
for screening for Bs-2 gene in tomato.
In summary, the available molecular markers
associated with different bacterial spot resistance
genes or QTLs are expected to be useful for
pyramiding resistance from different sources via
MAS, providing a strong and durable resistance
to tomato bacterial spot. However, because of the
complexities of the pathogen and host resistance,
it may be necessary to combine MAS with field
disease screening to confirm the presence of
strong resistance.

MAS for Tomato Yellow Leaf Curl Virus


Resistance
Tomato yellow leaf curl virus (TYLCV), a
monopartite geminivirus transmitted by whitefly,
is a serious disease of tomatoes in tropical and
subtropical regions of the world. Genetic sources
of resistance have been identified in the tomato
wild species S. pimpinellifolium, S. peruvianum,
S. cheesmanii, S. habrochaites and S. chilense
and used to study the genetic control of resistance.
Due to the very destructive nature of this disease

Tomato

in certain tomato growing regions, intensive


breeding efforts have been devoted to developing
TYLCV resistant cultivars, mostly in private seed
companies. Traditional breeding has resulted in
development of cultivars with reduced susceptibility, but no cultivar with complete resistance to
TYLCV is available. In addition, the disease
response of the resistant cultivars often varies
from location to location, and it has been difficult
to develop resistant cultivars with horticultural
characteristics similar to those of susceptible
ones. Thus far, four resistance loci, Ty-1, Ty-2,
Ty-3 and Ty-4, have been identified and mapped
to tomato chromosomes 6, 11, 6 and 3, respectively. Several QTLs conferring resistance to
TYLCV have also been identified. At least six
PCR-based molecular markers associated with
the major resistance genes have been developed
and reported. However, the lack of consistent
genetic markers associated with TYLCV resistance
has hindered the utility of MAS for this trait.
In addition, since TYLCV is considered a dangerous pathogen, screening germplasm for resistance as well as validation of any genetic marker
has been challenging.

MAS for Other Economic Traits


As for quantitative traits, in addition to the limited
use of MAS for manipulating QTLs for traits
such as fruit flavour and soluble solids content
(Brix), MAS is being attempted for improving
quantitative resistance to diseases such as powdery mildew, bacterial canker and bacterial wilt.
Furthermore, despite considerable efforts devoted
to the identification and mapping of QTLs for
various abiotic stress tolerance traits in tomato,
including salt tolerance, drought tolerance and
cold tolerance, it does not seem MAS has been
employed for improving any of these traits. As
is the case in other crop species, many QTLs
reported for complex traits in tomato are either
unreliable, population-specific or not strong
enough in terms of linkage to warrant their use
for marker-assisted breeding. In fact, in many
cases where MAS has been employed to transfer
QTLs from wild species, there have been problems

275

associated with linkage drag and recovery of


desirable horticultural characteristics. Such undesirable associations could be due to genetic linkage
and/or pleiotropic effects; the distinction between
the two is often not very straightforward. Thus,
before MAS can become a routine practice for
improving complex traits in tomato, issues surrounding this utility must be addressed.

MAS for Genetic Improvement


of Fruit Quality Traits
Antioxidants in tomato fruits have been a public
health focus for many years. The lycopene
content (LYC) in tomato fruit is an important
source of lipid-soluble antioxidants in the human
diet and can prevent the initiation or propagation
of oxidising chain reactions. Total soluble solid
content (SSC) is one of the main components of
tomato flavour, and it is the property in tomato
most likely to match the consumer perception
of internal quality. LYC and SSC are the main
quality traits of tomato fruit. A range of genetic
and environmental factors that result in quantitative variation across varieties governs tomato
fruit quality; however, the inheritance is complex.
Therefore, overcoming the genetic linkage
between fruit quality traits presents a challenge
for conventional breeding methods. The use of
QTL mapping to find major genes and functional
markers and improve the ability to control quantitative traits is an effective way to solve these
problems.
Conventional breeding methods provide little
information on the chromosomal regions controlling these complex quality traits or the simultaneous effects of each chromosomal region on
other traits such as epistasis, pleiotropy and
linkage. If based only on phenotype analysis,
selection by conventional breeding methods is
extremely difficult when genotypeenvironment
interactions are substantial. No reliable field
screening technique exists that can be used year
after year and generation after generation. One
approach to facilitate the selection and breeding
of complex quality traits is to identify genetic
markers linked to the traits of interest. During the

276

past decades, QTL studies conducted for tomato


have revealed more than 50 traits, and most are
fruit-related traits. Studies on the traits of LYC or
SSC have suggested the existence of at least 17
QTLs for LYC in all of the tomato chromosomes
except 9 and at least 109 QTLs for SSC in all
chromosomes. With the exception of 2 QTLs
for LYC, none of these QTLs have been used for
marker-assisted selection (MAS) in breeding;
this suggests that constructing a static model of
genetic roles only at only one development point
is inadequate and more effort should be directed
towards examining the stability and effectiveness
of the target trait QTLs with the view of using a
dynamic model in the genetic variation.

Fine Mapping and Characterisation


of Fruit-Size QTL
Fruit size is one of the most important agricultural
traits controlled by quantitative trait loci (QTL).
Therefore, identification of the underlying genes
of the major fruit-size loci may benefit the breeding
industry, as well as help us better understand
the molecular mechanism underlying fruit development. In one study, one of the major fruit-size
loci in tomato, fw3.2, was fine mapped by linkage
analysis to a 51.4 kb interval corresponding to
BAC clone of the tomato genome. The gene action
suggested a gain-of-function mutation occurred
in cultivar allele producing larger fruit during the
domestication. The phenotypic characterisation
of near-isogenic lines (NILs) showed that this
locus also controls other traits such as branch
number, leaf size and seed size. Yield per plant
was similar, and the larger fruited lines carried
fewer fruit that ripened later than the smaller
fruited lines. The changes in fruit weight were
not due to an alteration in the sinksource relationship. Expression level analysis of the seven
candidate genes in the NILs did not identify
which gene may underlie fw3.2, and numerous
SNPs and InDels were found between the parents
of the population. Based on function of the putative
orthologs, one candidate gene is proposed to
be FW3.2. Association mapping around this
candidate gene yielded one quantitative trait

11

Recent Advances in MAS in Major Crops

nucleotide (QTN) in the promoter of the gene.


Further genetic analysis of this QTN supported
the finding that this SNP is the causative mutation
at the fw3.2 locus.

Concluding Remarks for MAS in Tomato


Molecular markers associated with genes or
QTLs have been reported for numerous economically important traits in tomato. Theoretically,
such marker information should be useful for
improving qualitative or quantitative traits in
tomato via marker-assisted breeding. In practice,
however, while markers have been used rather
extensively for improving certain simple-inherited
traits in tomato, they have rarely been utilised for
improving complex traits. This has been due to
various reasons, including population-specific
markers (e.g. lack of correspondence between
QTLs identified in interspecific populations and
those existing in breeding populations), lack of
marker validation by repeating experiments, lack
of marker polymorphism in breeding populations
and linkage drag. For simple-inherited characteristics, in particular some disease-resistance traits,
however, markers have been used for tomato
breeding to a great extent in both public- and
private-sector breeding programs. It is estimated
that, at least for some disease-resistance traits,
MAS is not only faster than phenotypic selection
but it is also cheaper and more efficient. However,
not all markers publicly reported in the literature
are readily applicable in tomato-breeding programs. Often additional efforts are necessary to
refine the markers or to identify and develop new
markers with greater utility and reproducibility
in specific breeding populations. In particular,
extra efforts are often required to identify/develop
markers that detect polymorphism within tomatobreeding populations. In fact, as most commercialscale tomato-breeding material is developed
by the private sector, such programs often develop
their own resource of proprietary markers and associations tailored to their germplasm pool. Often
publicly available marker information is a good
start but not always adequate. The utility of available markers for several major disease-resistance

Hot Pepper

traits in tomato was tested in a number of breeding


lines and commercial cultivars with known
resistance/susceptibility responses. While several
markers were validated, others needed PCR optimisation for successful amplifications or were
not informative in the genotypes used. Specifically,
of the 37 markers examined, 19 (approximately
51%) were informative, including markers for
resistance to Fusarium wilt, late blight, bacterial
wilt, tomato mosaic virus, tomato spotted wilt
virus and root-knot nematodes (Panthee and
Foolad 2012). It appears that many of the available markers may need to be further refined or
examined for trait association and presence of
polymorphism in breeding lines and populations.
However, with recent advances in tomato sequencing, it is becoming increasingly possible to
develop more informative markers to accelerate
the use of MAS in tomato breeding. Thus, it is
imperative that additional efforts are required
to devote to identifying allele-specific and
population-specific markers in order to expand
the utility of MAS in tomato breeding.

Hot Pepper
Hot pepper (Capsicum annuum) is an important
horticultural crop, not only because of its economic importance but also due to nutritional
and medicinal value of its fruit. These are the
excellent source of natural colours and antioxidants. A wide spectrum of antioxidant vitamins,
carotenoids, capsaicinoids and phenolic compounds are present in hot pepper fruits. The intake
of these compounds in food is an important
health-protecting factor preventing widespread
human diseases. Acreage under hot peppers is
increasing due to a shift in production trend from
other crop-based farming to nontraditional crop
production which in turn is due to a decline in
income from regular cropping program. During
the last decade, the area under protected cultivation (poly/plastic tunnels) of vegetables like
hot pepper, tomato and cucumber is increasing
steadily. Hot pepper is one of the potential crops
to be grown in poly/plastic tunnels.

277

Progress in MAS in Hot Pepper


The characteristics of male sterility (MS) are
used in breeding programs to achieve economical
seed production. Male sterility is divided into
genic male sterility (GMS) and cytoplasmic male
sterility (CMS), which are used to breed commercial pepper varieties. The CMS system, however,
is not feasible in some pepper varieties, including
C. annuum, because of the absence of a restorer
source. GMS is thus important for seed production in bell peppers. A GMS-linked marker from
bell peppers was developed using the bulked segregant analysis and amplified fragment length
polymorphism method using F2 and sibling individuals. Use of 1024 AFLP primer sets found a
polymorphism from EcoRI ACG/MseI GTT
among the siblings. An internal sequence-based
primer was designed from the 395 bp sequence
for high-resolution melting (HRM) analysis, and
the marker score of 87 of 92 F2 individuals corresponded to their phenotypes. The marker was
mapped on chromosome 5 on the AC99 map.
Phytophthora root rot, caused by Phytophthora
capsici, is a major disease that limits pepper
production in the world. It is a soil-borne pathogen that can survive on host residues in soil for
months. Various methods to control phytophthora
root rot have been reported; however, most
treatments increase production costs as well as
environmental and health risks. The use of resistant cultivars is a simple and effective strategy.
Several resistance sources to phytophthora root
rot have been reported, but commercial cultivars
with good stable resistance in different environments against diverse isolates of the pathogen
across regions are still lacking. Quantitative trait
loci (QTL) for resistance to phytophthora root
rot were investigated using two Korean P. capsici
isolates and 126 F8 recombinant inbred lines
derived from a cross of Capsicum annuum line
YCM334 (resistant parent) and local cv. Tean
(susceptible parent). Seven QTLs common to
resistance for the two isolates on chromosome 5
besides QTL that were isolate-specific were
identified. The QTLs in common with the major
effect on the resistance for two isolates explained

278

20.048.2% of phenotypic variation. The isolatespecific QTLs explained 6.017.4% of phenotypic variation. The result confirms a
gene-for-gene relationship between C. annuum
and P. capsici for root rot resistance (Truong
et al. 2012). QTLs for phytophthora root rot
resistance were previously identified on chromosome 11 in other studies. Thus, the results
indicate that at least a few specific gene functions are important components of root rot resistance to different P. capsici races/isolates in the
YCM334 Tean population. Identification of
isolate-specific resistance QTLs in P. capsiciC.
annuum interactions will help breeders in selecting appropriate resistant lines for future hybridisation. Breeders may need to breed for resistance
against a specific isolate from different regions
and then pyramid a number of specific genes to
confer resistance into a cultivar. The approach
for further studies could be to develop nearisogenic lines carrying different combinations of
QTLs and challenging the isogenic lines with
different pathogen isolates.
Pungency in peppers is due to the presence of
capsaicinoid molecules, which are only produced
in Capsicum species. Capsaicinoids, the molecules
that cause a pungent, burning sensation when hot
peppers are consumed, are produced exclusively
in the genus Capsicum. This organoleptic quality
is due to the activation of the TRPV1 (VR1)
receptor. The primary capsaicinoids are capsaicin,
dihydrocapsaicin and nordihydrocapsaicin.
The presence of capsaicinoids makes pungent
peppers valuable as a spice. In contrast, the
absence of capsaicinoids is important when nonpungent peppers are grown as a vegetable crop.
The major gene Pun1 is required for the production
of capsaicinoids. Three distinct mutant alleles
of Pun1 have been found in three cultivated
Capsicum species, one of which has been widely
utilised by breeders. A robust collection of
molecular markers for the set of alleles were
identified that can differentiate four Pun1 alleles.
Those markers were tested on a diverse panel of
pepper lines and in an F2 population segregating
for pungency (Wyatt et al. 2012). These markers
will be useful for pepper breeding, germplasm
characterisation and seed purity testing. Those

11

Recent Advances in MAS in Major Crops

markers are unique in their ability to detect the


functional nucleotide polymorphisms of the
three Pun1 alleles. This set of Pun1 markers will
aid diversity studies through the easy
identification of the three known Pun1 mutants
in a wide range of germplasm. Additionally, the
markers are useful for seed lot testing in seed
purity programs. With a trait such as pungency
in fruit, which can cause a painful sensation upon
contact, it is critical to maintain the purity of nonpungent seed stocks. Finally, these markers will
be highly useful in breeding programs because
they provide an easy method to genotype populations and quickly identify plants with the
desired pungency state.

Concluding Remarks on MAS


in Hot Pepper
Molecular markers have been contributed in
genetic improvement of hot pepper in several
ways including ef fi cient screening of large
amount of germplasm for genetic diversity
analysis, screening for seed purity, finger printing and QTL mapping. Though genes for major
dominant traits have been mapped, QTL for
complex polygenic traits such as pest and disease resistance and abiotic stress resistance
remains to be analysed. It is envisaged that
future development in molecular biology may
reduce the cost involved in marker development
which in turn have huge impact on hot pepper
breeding via MAS.

Bibliography
Literature Cited
Ali ML, Pathan MS, Zhang J, Bai G, Sarkarung S, Nguyen
HT (2000) Mapping QTLs for root traits in a recombinant inbred population from two indica ecotypes in
rice. Theor Appl Genet 101:756766
Boopathi NM, Senthil A, Chandrikala R, Singh A,
Shanmugasundaram P, Sadasivam S, Babu RC (2002)
Mapping quantitative trait loci and marker assisted

Bibliography
selection for the improvement of drought tolerance in
rice. Madras Agric J 89(1012):553562
Champoux MC, Wang G, Sarkarang S, Mackill DJ,
OToole JC, Huang N, McCouch SR (1995) Locating
genes associated with root morphology and drought
avoidance in rice via linkage to molecular markers.
Theor Appl Genet 90:961981
Chen H, Qian N, Guo W, Song Q, Li B, Deng F, Dong C,
Zhang T (2010) Using three selected overlapping RILs
to fine-map the yield component QTL on Chro.D8 in
Upland cotton. Euphytica 176:321329
Foolad MR, Panthee DR (2012) Marker-assisted selection
in tomato breeding. Crit Rev Plant Sci 31(2):93123
Foolad MR, Merk HL, Ashrafi H (2008) Genetics, genomics
and breeding of late blight and early blight resistance
in tomato. Crit Rev Plant Sci 27:75107
Gomez S, Boopathi NM, Kumar SS, Ramasubramanian T,
Chengsong Z, Jeyaprakash P, Senthil A, Babu RC
(2010) Molecular mapping and location of QTL for
drought resistance traits in indica rice (Oryza sativa
L.) lines adapted to target environments. Acta Physiol
Plant 32(2):355364
Gutierrez OA, Robinson AF, Jenkins JN, McCarty JC,
Wubben MJ, Callahan FE, Nichols RL (2011)
Identification of QTL regions and SSR markers associated with resistance to reniform nematode in Gossypium
barbadense L. accession GB713. Theor Appl Genet
122:271280
Humphry ME, Konduri V, Lambridges CJ, Magner T,
McIntyre CL, Aitken EAB, Liu CJ (2002) Development
of a mungbean (Vigna radiata) RFLP linkage map and
its comparison with lablab (Lablab purpureus) reveals
a high level of synteny between the two genomes.
Theor Appl Genet 105:160166
Isemura T, Kaga A, Tabata S, Somta P, Srinives P et al
(2012) Construction of a genetic linkage map and
genetic analysis of domestication related traits in
mungbean (Vigna radiata). PLoS One 7(8):e41304.
doi:10.1371/journal.pone.0041304
Jenkins JN, Wu J, Guo Y, McCarty JC (2010) Use of fiber
and fuzz mutants to detect QTL for yield components,
seed, and fiber traits of upland cotton. Euphytica
172:2134
Jiang CX, Wright RJ, El-Zik KM, Paterson AH (1998)
Polyploid formation created unique avenues for
response to selection in Gossypium (cotton). Proc Natl
Acad Sci USA 95(8):44194424
Kamoshita A, Babu RC, Boopathi NM, Fukai S (2008)
Phenotypic and genotypic analysis of drought
resistance traits for development of rice cultivars
adapted to rainfed environments. Field Crops Res
109(13):123
Lambrides CJ, Lawn RJ, Godwin ID, Manners J, Imrie
BC (2000) Two genetic linkage maps of mungbean
using RFLP and RAPD markers. Aust J Agric Res
51:415425
Lilley JM, Ludlow MM, McCouch SR, OToole JC (1996)
Locating QTL for osmotic adjustment and dehydration
tolerance in rice. J Exp Bot 47:14271436

279
McCouch SR, Kochert G, Yu ZH, Wang ZY, Khush GS,
Coffman WR, Tanksley SD (1988) Molecular mapping
of rice chromosomes. Theor Appl Genet 76:815829
Menancio-Hautea D, Kumar L, Danesh D, Young ND
(1993) A genome map for mungbean [Vigna radiata
(L.) Wilczek] based on DNA genetic markers (2n = 2x
= 22) In: OBrien JS (ed) Genetic maps 1992. A compilation of linkage and restriction maps of genetically
studied organisms. Cold Spring Harbor Laboratory
Press, Cold Spring Harbor, pp 6.2596.261
Panthee DR, Foolad MR (2012) A reexamination of
molecular markers for use in marker-assisted breeding
in tomato. Euphytica 184:165179
Ray JD, Yu LX, McCouch SR, Champoux MC, Wang G,
Nguyen HT (1996) Mapping quantitative trait loci
associated with root penetration ability in rice (Oryza
sativa L.). Theor Appl Genet 92:627636
Reinisch AJ, Dong J, Brubaker CL, Stelly DM, Wendelt
JF, Paterson AH (1994) A detailed RFLP map of cotton, Gossypium hirsutum Gossypium barbadense:
chromosome organization and evolution in a disomic
polyploid genome. Genetics 138:829847
Robin S, Pathan MS, Courtois B, Lafitte R, Carandang S,
Lanceras S, Amante M, Nguyen HT, Li Z (2003)
Mapping osmotic adjustment in an advanced backcross
inbred population of rice. Theor Appl Genet
107:12881296
Shen L, Courtois B, McNally KL, Robin S, Li Z (2001)
Evaluation of near-isogenic lines of rice introgressed
with QTLs for root depth through marker-aided selection. Theor Appl Genet 103:7583
Sun FD, Zhang JH, Wang SF, Gong WK, Shi YZ, Liu AY,
Li JW, Gong JW, Shang HH, Yuan YL (2012) QTL
mapping for fiber quality traits across multiple generations and environments in upland cotton. Mol Breed
30:569582
Tanksley SD, Ganal MW, Prince JP, Devicente MC,
Bonierbale MW, Broun P, Fulton TM, Giovannoni JJ,
Grandillo S, Martin GB et al (1992) High-density
molecular linkage maps of the tomato and potato
genomes. Genetics 132:11411160
Truong HTH et al (2012) Identification of isolate-specific
resistance QTLs to phytophthora root rot using an
intraspecific recombinant inbred line population of
pepper (Capsicum annuum). Plant Pathol 61(1):
4856
Venuprasad R, Shashidhar HE, Hittalmani S, Hemamalini
GS (2002) Tagging quantitative trait loci associated
with grain yield and root morphological traits in rice
under contrasting moisture regimes. Euphytica
128:293300
Wu J, Gutierrez OA, Jenkins JN, McCarty JC, Zhu J
(2009) Quantitative analysis and QTL mapping for
agronomic and fibre traits in an RI population of upland
cotton. Euphytica 165:231245
Wyatt LE et al (2012) Development and application of a
suite of non-pungency markers for the Pun1 gene in
pepper (Capsicum spp.). Mol Breed. doi:10.1007/
s11032-012-9716-9

280
Zhang Z, Rong J, Waghmare VN, Chee PW, May OL,
Wright RJ, Gannaway JR, Paterson AH (2011) QTL
alleles for improved Wber quality from a wild
Hawaiian cotton, Gossypium tomentosum. Theor Appl
Genet 123:10751088
Zheng BS, Yang L, Zhang WP, Mao CZ, Wu YR, Yi KK,
Liu FY, Wu P (2003) Mapping QTLs and candidate
genes for rice root traits under different water-supply
conditions and comparative analysis across three populations. Theor Appl Genet 107:15051515

11

Recent Advances in MAS in Major Crops

Further Reading
Boopathi NM, Thiyagu K, Urbi B, Santhoshkumar M,
Gopikrishnan A, Aravind S, Swapnashri G, Ravikesavan
R (2011) Marker-assisted breeding as next-generation
strategy for genetic improvement of productivity and
quality: can it be realized in cotton? Int J Plant Genom
2011. doi:10.1155/2011/670104

Future Perspectives in MAS

MAS can be simply defined as selection for a


trait based on the genotype of an associated
marker rather than the trait itself. In essence, the
associated marker is used as an indirect selection
criterion. The potential of MAS as a tool for crop
improvement has been extensively explored in
different plant species. Major applications of
MAS include (1) tracing favourable alleles
and pyramiding them in desirable genetic backgrounds (foreground MAS), (2) eliminating
unwanted genetic backgrounds (background
MAS) or undesirable plant material in early
breeding generations and identifying the most
desirable gene combinations or individuals in
segregating populations and (3) breaking the
undesirable linkages between favourable and
unfavourable alleles (reducing linkage drag). The
success of MAS in plant breeding is often
assessed on the basis of these three components.
In theory, MAS can reduce the cost and increase
the precision and efficiency of selection and
breeding. However, MAS is not a silver bullet,
and it can be more effective than conventional
phenotype-based selection only under certain
situations, including when (1) trait-based selection is not feasible (e.g. lack of selection environment or pathogen), (2) such selection is costly or
ineffective, (3) trait expression is developmentally regulated or phenotypically not obvious
until late in the season, (4) the trait is governed by
recessive or incompletely dominant gene(s), (5)
trait heritability is low rendering conventional
phenotypic selection is ineffective, (6) there are too
much G E interactions, (7) multiple trait selection

12

is desired, (8) conducting gene introduction/


pyramiding from different sources and (9) transferring genes/QTLs from wild genetic backgrounds. Furthermore, in a backcross-breeding
programme, MAS allows reduction of linkage
drag by selecting against the undesirable donor
genome and for desirable recurrent parent genome
(background selection) while also selecting for
desirable donor alleles (foreground selection).
Moreover, with MAS, it is possible to conduct
multiple rounds of selection in a year, allowing
approximately two generations of selection per
year, compared to one in phenotypic selection
methods.
The success of MAS also depends on many
other factors, including the underlying genetic
control of the trait(s) of interest. MAS has been
possible, if not always practical, for a wide range
of qualitative/simple traits since the early twentieth century. The utility of MAS for manipulating
single-gene traits is straightforward and has been
well documented. MAS for the improvement of
polygenic traits, however, is more complicated,
though its usefulness has been recognised.
In general, for quantitative traits, MAS seems to
be most effective for traits with low (0.10.3)
heritability and those which are controlled by
rather small numbers of QTLs with large effects.
However, with the recent development of
next-generation molecular tools and genetic
maps, MAS has shown to become more attractive
and practical for many simple and complex traits
in applied breeding programmes in several
occasions.

N.M. Boopathi, Genetic Mapping and Marker Assisted Selection: Basics, Practice
and Benefits, DOI 10.1007/978-81-322-0958-4_12, Springer India 2013

281

282

One of the yet unrealised promises of molecular markers is their utility for improvement of
complex quantitative traits, which are often
controlled by more than one gene and exhibit low
heritability and often strong G E interactions.
The failure in using molecular markers for complex traits is due to various reasons, including
QTLs being unreliable or population or environment specific, QTLs not strong enough in terms
of linkage to warrant their use for marker-assisted
breeding, lack of marker validation or marker
polymorphism in breeding populations and problems associated with linkage drag. However, it
should be possible to use markers for improving
complex traits assuming that additional necessary efforts are made to develop reliable markers,
including minimising the environmental effects
and maximising the relationship between genotype and phenotype (e.g. by repeating experiments in multiple environments), breaking up
complex traits into their individual components
and identifying QTL-linked markers for such
components, and identifying QTLs using actual
breeding populations. Obviously, these are not
easy challenges, but they are doable.
Thus, future progress in MAS will greatly
depend on improved genetics. However, the agronomical context, as well as socio-economic factors and policy, must be taken into account; they
influence to a large extent whether farmers adopt
improved varieties and whether they can minimise the gap between yield potential and on-farm
yield. This integration of quantitative knowledge
arising from diverse but complementary disciplines will allow researchers to more fully understand genes associated with complex traits in
crop plants and more precisely forecast the penalty of modulating expression levels of those
genes.
Large-scale genome sequencing and associated bioinformatics are becoming widely accepted
research tools for accelerating the analysis of
plant genome structure and function. Secondgeneration DNA sequences from crop plants can
provide an opportunity to use genomic information to clone genes and develop SNP markers in
plants. Rapid progress is now being achieved in
assembling the DNA sequences from individual

12

Future Perspectives in MAS

chromosome arms of plant sequences, and this


progress provides a template for defining the
novel functional markers for future use. Highquality crop genome sequences integrated with
molecular genetic maps provide the basis for
identifying duplicated genes, analysing promoter
regions in detail, defining SNPs/InDels and
aligning the transcriptome with the genome.
These advances will allow gene networks to be
clearly defined and thus allow meaningful causal
or functional markers to be developed for complex
traits.
Extensive proteomic studies have allowed
identification of many allelic variants at the novel
genes, and genomic analyses identified several
markers for discriminating alleles at one locus.
These successes have indicated that it is now
essential to establish rapid, convenient and
economical PCR-based assays in crop breeding.
In order to detect genes simultaneously in a single
PCR, multiplex PCR can be developed, in which
several markers in the same reaction mix are
co-amplified under identical conditions. For
example, two multiplex PCR assays, developed
for the identification of genes/loci w-secalin,
Glu-B1-2a, Glu-D1-1d, Glu-A3d, Glu-B3,
Pin-D1b, Ppo-A1, Ppo-D1 and Wx-B1b, provide
the proof of concept for the efficient screening of
genotypes in wheat. A clear challenge is for
multiplexing markers to have similar annealing
temperatures for the different primers and for the
expected PCR products to be easily separated on
agarose gels. Although several genes conferring
pest/disease resistance have been cloned in plants,
the gene-specific markers are available for only
few genes. If alleles conferring specific resistance
are being sought, it is important to know which
alleles are effective and potentially useful to local
breeding programmes. A good example is for the
leaf rust resistance genes Lr10 and Lr21, which
confer resistance to a broad spectrum of Puccinia
triticina races, but gene-specific markers are
not available for these two genes because the
reactions of alleles to various Puccinia triticina
races have not been well characterised. Currently,
functional markers are being increasingly adopted
in crop breeding including wheat (e.g. many
functional markers associated with wheat quality

MAS in Orphan Crops

genes, in particular, are available; however, more


functional markers are needed for important traits
such as disease and stress resistance in order to
strengthen the application of molecular markers
in breeding programmes). SNPs are the most
applicable markers for high-throughput screening once the genotypephenotype associations
are determined. The expanded use of these
markers will develop as high-throughput techniques for MAS based on functional SNP markers
and chips are established. The meaningful interpretation of whole-genome studies to associate
SNPs with variation in phenotype is expected to
provide the next generation of functional markers
for use in MAS.

MAS in Orphan Crops


The development of genetic markers is complex
and costly in species with little pre-existing
genomic information (such as orphan or neglected
or underutilised crops but have potential in
human welfare). Such orphan crops possess one
of the largest and least studied genomes among
cultivated crop plants, and only few gene-based
genetic maps have been reported in such crops.
The development of new markers in orphan crops
will be an essential step for MAS to be adopted as
a routine procedure in such crops breeding
programmes. Many regional working groups are
now engaged in developing molecular markers in
those crops. This includes the utilisation of
SCAR, SRAP, ISSR, AFLP, SSR and SNP markers (see chapter 3). Developing new SSRs based
on SSR-enriched libraries from locally adopted
genotypes, EST-based SSRs or cross species
SSRs, may be deployed. The development of
SSRs together with increasingly larger sets of
transferable markers such as ESTs in orphan
crops should provide direct bridges among
genetic maps, allowing not only to streamline
high-resolution mapping and positional cloning
of major QTLs or genes of interest but also the
development of many types of DNA markers
such as STSs, SCARs or SNPs that will greatly
help in establishing MAS systems in orphan
crops.

283

Evaluation of the extent of linkage disequilibrium in exotic and domesticated germplasm is yet
another requirement. Phenotypic evaluation of
multiple populations per species should be conducted so that the locations of quantitative trait
loci for important agronomic traits can be
identified by genetic and association mapping.
The accumulation of mapping information will
facilitate the exploration of syntenic regions
across orphan crops. These genetic tools will also
help in construction of physical maps of chromosomes in orphan crops. Construction of physical
maps will allow better understanding of such a
complex genome and facilitate cloning and
manipulation of traits with economic interests.
This will also help to better understand the secondary metabolism involved in interactions
between neglected crops and pathogens, symbiotic organisms, predators and pollinators and will
lead to varieties with enhanced yield potential,
nutritional benefits, resistance to pests and diseases and tolerance of adverse environmental
conditions.
Using molecular marker technology, it is now
feasible to analyse quantitative traits such as
salt tolerance and identify the chromosomal
regions (QTLs) associated with such characters.
Identifying such regions will significantly help to
increase the selection efficiency in the breeding
programmes. Molecular marker-assisted selection is considered to be faster, more efficient and
probably more cost effective than conventional
screening particularly for abiotic stresses where
expression of the trait is subject to significant
environmental effects. It will also help narrow
down the possible candidate genes and ultimately
will lead to map-based cloning of the major genes
controlling the trait of interest and opening a new
avenue for genetic manipulations using the real
candidate genes, since it has been shown that several such underutilised crops are adapted well to
the unfavourable environmental conditions. With
the recent advances in DNA sequencing and single nucleotide polymorphism (SNP) genotyping,
new approaches to QTL mapping and quantitative trait nucleotide (QTN) identification are now
available, and this could be applied to orphan
crops for identification of phenotype-related SNPs.

284

Once genes responsible for quantitative variation


are identified, information can be passed on to
those crop breeding programmes to enable implementation of MAS. This will greatly help in accelerating the breeding programme. In addition,
traditional breeding efforts will be greatly enhanced
through integrated approaches using functional,
comparative and structural genomics. It should be
kept in mind, however, that optimisation of marker
genotyping methods in terms of cost-effectiveness
and a greater level of integration between molecular and conventional breeding represent the
main challenges for the future adoption and
impact of MAS on orphan crop breeding.
Orphan crops are widely distributed across the
Mediterranean region and have shallow soil
requirements, and their cultivated accessions have
variable seed yields in Mediterranean environments. In addition, some of them, for example,
yellow lupin seeds have the highest protein content and twice the cysteine and methionine content
of most lupins. However, despite its highly nutritional qualities, there is a lack of genetic and
molecular tools to aid the genetic breeding of
this species. However, some progress has been
shown in certain orphan crops. EST sequencing
has accelerated gene discovery when genome
sequences are not available, facilitating gene family identification and development of molecular
markers. Next-generation sequencing has generated enormous amount of expressed sequence data
for a wide number of plant species, specially minor
or orphan crops. For example, EST and genome
sequencing of lentil and chickpea would not have
been feasible without next-generation sequencing.
The lower cost and greater sequence yield have
allowed the identification of candidate genes, even
when they are expressed at low levels.
Research on plants, animals and fungi has
shown that sequences of expressed genes are
often widely transferable among species, and
even genera, allowing wide genome comparative
mapping studies(see chapter 7). For instance, the
combination of orphan crop EST sequences with
model plant genetic and genomic resources, such
as Lotus japonicus (Japanese trefoil) and
Medicago truncatula (barrel medic), has identified
macro- and microscale synteny, discovered new

12

Future Perspectives in MAS

genes and alleles and provided insights into


genome evolution and duplication. Comparisons
between ESTs and gene sequences among several
legume species have allowed comparative
genome studies between L. albus and M. truncatula, and L. angustifolius and Lotus japonicus.
The use of molecular markers and the development of suitable mapping populations will
allow significant progress in mapping to enhance
breeding strategies in orphan crops. For example,
local faba bean variety Hassawi 2, with drought
tolerance and excellent cooking quality, was used
with an introduced small black seeded Pakistani
variety for developing a mapping population in
an attempt to map QTLs for drought tolerance in
Vicia faba. Those studies proved that some
physiological parameters such as stomatal
conductance, leaf rolling and leaf temperature as
well as grain yield under stress are well associated with drought tolerance. These parameters
along with water use efficiency and proline
content could be utilised in plant phenotyping.
Breeding programmes for drought tolerant in
faba bean should consider the genetic diversity in
the tested genotypes for physiological, morphological and agronomical traits and the important
correlations among these traits. Significant
correlations allow the utilisation of relatively
simple traits as indirect selection criteria for
drought tolerance in faba bean breeding. Other
drought tolerant traits investigated in a number of
field legumes include dry matter accumulation
under stressed and unstressed environments,
relative water content (RWC), stomata frequency,
stomata size, transpiration efficiency, carbon isotope discrimination, leaf temperature and osmotic
potential. These traits have been detected to have
significant linkage with drought tolerance and
could be utilised in drought breeding selection.
There is an urgent need to identify chromosomal
regions associated with economically important
traits in faba bean. Identification of expression
QTLs (eQTLs) will help in narrowing down
candidate genes for traits of interest and lead to
an increase number of QTLs for agronomically
important traits for faba bean improvement.
One of the functional genomic approaches to
identify candidate genes responsible for a trait of

MAS in Developing Countries

interest is through differential expression strategies.


DNA chips and subtractive hybridisation are
among the tools of choice to identify abiotic
stress responsive genes. Many genes are expected
to be drought responsive, among which, a fewer
number are the real candidate genes. Combining
the QTL approach with differential display strategy will allow narrowing down the possible candidate genes by focusing only on those responsive
genes in the major QTL regions in faba bean. In
summary, the bioinformatics tools and analysis
of gene motifs, real candidate genes, could be
identified in faba bean. Further PCR-based
validation using such candidate genes designed
primers will demonstrate the efficiency of the
genes identified. This will allow trait manipulation and eventually will lead to the development
of stress tolerant faba bean genotypes. The availability of second-generation sequencing and
high-throughput technology in parallel with
other genomic approaches will facilitate the
analysis of transcripts, proteins and insertional
and chemically induced mutants and will allow
understanding the gene function and phenotype
relationship.
Furthermore, developing efficient regeneration protocols will allow successful in vitro
culture and genetic transformation in orphan
crops. This will facilitate the development of
transgenic plants in such underutilised crops with
excellent biotic and abiotic stress tolerance and
open a new avenue for functional genomics and
crop manipulation. Ultimately this will help in
developing better genotypes in underutilised
crops that are suitable for local and regional
ecosystem and enhancing the role of orphan crops
for conservation agriculture in arid and semiarid
regions.

MAS in Developing Countries


Though there were successful examples in MAS
shown in developed countries, the transfer and
application of new plant biotechnologies to
developing countries are recognised as a big challenge, and solutions can be found only through
innovative partnerships and collaborations with

285

advanced laboratories. Molecular breeding for


polygenic traits has been successfully deployed
in the multinational private sector, and several
experts in the art see molecular plant breeding as
the foundation for twenty-first century crop
improvement.
Although the number of successful stories is
increasing, it is fair to say that in todays reality
in MAS application for complex traits in breeding
programmes remains primarily limited to the private sector and is barely used in developing countries. Reasons for this situation in developing
countries are shortage of well-trained personnel,
inadequate access to high-throughput genotyping, inappropriate phenotyping infrastructure,
unaffordable information systems and analysis
tools and the logistical difficulty of integrating
new approaches with traditional breeding methodologies, including problems when scaling
up from small to large breeding programmes.
Therefore, except for leading emerging economies, the capacity to conduct intensive research
in plant biology and to support plant breeding
remains rather limited in developing countries,
and in some cases it has even decreased over the
last decade. For example, although there has been
a strong focus on agricultural development in
Africa in recent years, many of the African breeding institutes, especially those in sub-Saharan
Africa, remain dependent on international support
for agricultural research. These needier institutes
tend to be in countries whose population has a
high proportion of resource-poor people; thus,
building the capacities of breeding programmes
and seed systems in those countries is vital to
achieving any improvement in the ability of poor
farmers to grow improved varieties. In order to
realise the full potential of marker technologies
and bioinformatics in plant breeding, tools for
molecular characterisation, accurate phenotyping, efficient information systems and effective
data analysis must be integrated with breeding
workflows managing pedigree, phenotypic,
genotypic and adaptation data into efficient
information systems. With all the progress
achieved in marker technology, software development, analytical pipelines and data management
systems, it is time to provide an information

286

system, available through a public platform that


will offer breeding programmes in developed and
developing countries access to modern breeding
technologies, in an integrated and configurable
way, to boost crop quality and productivity.
There are several constraints in developing
countries that hamper the application of MAS.
Some relate to access to information and publications. Others relate to data collection, management and storage, such as availability of systems
for reliable sample and data tracking. Very important are the scientific and technical concerns
involved in adequate experimental design, precise
and reliable trait phenotyping (i.e. dissection of
complex traits), dependable marker validation
and advanced analytical methodologies and tools
for accurate decision making, among others.
Thus, the main challenges hampering the potential
of molecular breeding in developing countries
encompass (1) human resources, (2) infrastructure
capacity, (3) access to marker technologies and
(4) availability of an efficient data management
system. Human capacity for molecular breeding
technologies in developing countries is an on-going
challenge, and limitations include substandard
agriculture programmes at universities; difficulties
in keeping up to date with relevant developments,
including failures by others; poor technical skills
in core disciplines; isolation as a result of
insufficient peer critical mass in the workplace;
and poor incentives to attract and retain scientists, resulting in brain drain and staff turnover.
Fortunately, with the establishment of marker
service laboratories and a clear change in mentality, breeders need to be trained on how to analyse
the data and not how to run marker genotyping;
there is general acceptance that large-scale genotyping activities are best outsourced while nobody
questions the basic local laboratories. For breeders to efficiently access relevant information generated by themselves and by other researchers,
reliable data management (including sample
tracking, data collection and storage and modern
analytical methodologies and tools for accurate
decision making, among others) is critical both
within a given molecular breeding programme
and across programmes. In view of this, it is
essential that breeders manage pedigree, pheno-

12 Future Perspectives in MAS

typic and genotypic information through common or mutually compatible crop information
systems.
However, amidst the challenges there are also
actual and potential opportunities. Several of the
constraints listed above, in particular access to
marker technologies and limited data management systems, can be overcome through the establishment of crosscutting technology and service
platforms, and several international initiatives are
supporting the development of such platforms in
tight collaboration with partners from developing
countries. To partially offset the undesirable trend
of losing the champions, novel international initiatives such as the Alliance for a Green Revolution
in Africa (AGRA) support high-quality education
in the South, and although there is still a long way
to go, governmental and institutional commitment
is increasing for the adoption of biotechnologies
in developing countries (Delannay et al. 2012).

Community Efforts in Developing


Countries and Their Implications
in MAS
The recent emergence of affordable large-scale
marker technologies (e.g. Diversity Arrays
Technology (DArT), SNPs), the sharp decline of
sequencing costs boosting marker development
based on sequence information and the explicit
efforts of national agricultural research programmes (e.g. in India) and international initiatives such as generation challenge programme
(GCP) have all resulted in a large increase in
the number of genomic resources available for
less-studied crops. As a result, most key crops in
developing countries now have adequate genomic
resources for meaningful genetic studies and
most MAS applications. In more recent times,
the capacity of the national breeding institutes, in
terms of their financial resources, infrastructure
and expertise, has evolved in a somewhat countryspecific manner, reflecting the health of their
domestic economies. Thus, capacity has degraded
in some countries, while in others there have been
major improvements, as evidenced by a change
from requiring training and support from large

Community Efforts in Developing Countries and Their Implications in MAS

international programmes to becoming mutual


partners in agricultural research. This is reflected
in the sharp differences in capacity to conduct and
apply biotechnological research in developing
countries.
Interestingly, newly industrialised countries
such as Brazil, China, India, Mexico, South
Africa and Thailand substantially invest in technology and research and development (R&D)
and are self-reliant in most aspects of marker
technologies. These countries have the concomitant potential to effectively adopt, adapt and apply
information and communication technologies to
enhance research efficiency and outputs. They
are therefore naturally at the frontline in adopting
molecular breeding technologies. These institutes
are beginning to communicate with one another,
as illustrated by the 2006 agreement between
Brazil, China and India to collaborate in the area
of agriculture, including the exchange of genetic
resources and joint efforts in plant biology and
breeding.
On the other hand, mid-level developing world
economies such as Colombia, Indonesia, Kenya,
Morocco, Uruguay and Vietnam are well aware
of MASs importance, and some effectively apply
marker technologies for germplasm characterisation and selection of major genes. These countries have a matching potential for a limited
utilisation of molecular breeding platforms, a
potential that can be enhanced fairly rapidly in
the medium to long term. In contrast, low-level
developing world economies are struggling to
sustain even basic conventional breeding. They
have very limited or no approaches to application
of molecular breeding and are unlikely to adopt
molecular breeding platforms except in the long
term. Due to its ability to generate quickly and
cost-effectively precise trait linkage information
for specific regions of the genome, MAS is
expected to improve the efficiency of crop
breeding to progressively increase genetic gains
by selecting and stacking with markers favourable alleles at target loci. Comparing the costeffectiveness of MAS with phenotyping selection
is not straightforward. Firstly, interlinked factors
other than cost, such as trade-offs between time
and money, are likely to play an important role in

287

determining the choice of screening method.


Secondly, the choice between MAS and conventional selection may be complicated by the fact
that the two are rarely direct substitutes for one
another or mutually exclusive, and in fact they
are quite complementary under most breeding
schemes. Where operating capital is not a limitation, MAS maximises the net present value and
with the decrease in marker data point cost and
increased access to marker service laboratories,
marker-assisted breeding operating costs are
shrinking, making this approach increasingly
attractive from an economic perspective.
Few economic analyses have been undertaken
to assess the potential impacts of MAS. A famous
example is definitely the impact of the submergence
gene for rice in Asia. Among the few analyses
available is an evaluation of the economic benefits
of MAS to develop rice varieties with tolerance to
salinity and P deficiency in Bangladesh, India,
Indonesia and the Philippines, since DNA molecular markers for these traits are available (see
chapter 11). Encompassing a broad set of economic parameters, the study concluded that MAS
is estimated to save at least 23 years, resulting in
significant incremental benefits in the range of
USD 300800 million, depending on the country,
abiotic stress and lag for conventional breeding.
Another study estimates the benefits of using
marker-assisted breeding, as compared with conventional breeding alone, in developing cassava
varieties resistant to cassava mosaic disease,
green mite, whitefly and postharvest physiological
deterioration in Nigeria, Ghana and Uganda.
Marker-assisted breeding is estimated to save at
least 4 years in the breeding cycle for varieties
resistant to the pests and to result in incremental
net benefits over 25 years in the range of USD
34800 million depending on the country, the particular constraint and various assumptions.
The key technical constraint to the efficient
management of crop information across the
layers of implementation is standardisation and
consistency. At the crop level, the most important
key to data integration is a community-accepted
trait dictionary, ontology of traits of interest for
each crop together with a set of effective protocols for their evaluation, including scales or units

288

of measurements and data quality standards.


Developing, maintaining and supporting integrated breeding informatics applications are also
critical. This would include the design of databases to manage crop information from any crop
and the development of user applications to facilitate breeding processes. These would need to be
configured to the best practices for each crop to
provide common functionality under different
community efforts.

Field and Laboratory Infrastructure


Improvement
Reliable phenotypic data are a must for high-quality genetic studies, and most developing countries
lack suitable field infrastructure for proper trials
and collection of accurate phenotypic data.
Guidelines on best practice must be provided on
how to design and run a trial and conduct precise
phenotyping for genetic studies under different
target environments. Improving access to homogeneous field areas and paying attention to good
soil preparation and homogeneous sowing are
critical. Until a few years ago, the major investment required to establish large-scale marker
technology was considered a large impediment to
the application of molecular breeding in developing countries. One of the challenges in conducting
agronomic research in developing countries is
that research stations are often underfunded
and understaffed and do not have the resources
necessary to establish and maintain the field environments appropriate for quality phenotyping.
Even with the availability of the best genotyping
resources, integrated molecular breeding programmes will be doomed to failure in the absence
of quality phenotypic data to support the proper
identification of the main QTLs affecting key
target traits.
The ability to generate genotyping data has
been one of the main stumbling blocks preventing
wide utilisation of markers in developing countries. Molecular markers rely on the availability
of high-quality laboratories able to perform
the necessary molecular biology operations. For
simple sequence repeat (SSR) markers, these

12 Future Perspectives in MAS

operations include at a minimum high-quality


DNA extraction, polymerase chain reaction (PCR)
amplification, gel electrophoresis and gel scoring.
Performing those operations requires well-trained
technicians and the availability of well-equipped
laboratories with stable electricity supply, reliable
supply of clean water, room temperature and
humidity control and the scientific equipment
necessary to perform those tasks. Refrigerators
and freezers (regular freezers and 80C freezers)
also need to be in operation on an uninterrupted
basis to store temperature-sensitive reagents,
primers and DNA samples. Automatically
triggered power generators need to be installed
when a reliable electrical supply cannot be
guaranteed. A first attempt to resolve this issue
has been for donor organisations to fund the construction of genotyping laboratories in various
places of the Third World. However, except for
large, well-funded centres, this was often not successful because sustained resourcing was not
available to hire qualified personnel and to purchase and maintain the necessary equipment and
reagents. The logistics of reliably shipping perishable reagents to remote areas of the Third World
is also often an obstacle. As a result, there are
unfortunately a number of poorly equipped laboratories lying idle in some remote parts of Africa.
In spite of that, a few local centres, such as the
National Root Crop Research Institute (NRCRI)
in Umudike, Nigeria, have been successful in
establishing low-throughput laboratories that can
serve the basic genotyping needs of their breeders. An intermediate solution is to rely on regional
hubs. Those hubs should be relatively wellfunded and well-equipped laboratories that can
handle primarily SSR genotyping for interested
parties. Part of the strategy is to rely on four
hubs covering the needs of the Americas (Centro
Internacional de Agricultura Tropical, CIAT,
www.ciat.cgiar.org), Africa (BioSciences eastern
and central Africa, BecA, http://hub.africabiosciences.org), South Asia (International Crops
Research Institute for the Semi-Arid Tropics,
ICRISAT, www.icrisat.org) and Southeast Asia
(International Rice Research Institute, IRRI,
www.irri.org). Those hubs will be able to provide
basic genotyping needs and at the same time help

Lessons Learnt and Concluding Remarks

train local scientists in the fundamentals of


molecular breeding.
Full integration of molecular markers into
breeding programmes will require the availability of high-throughput and low-cost genotyping
platforms primarily based on SNPs. SNPs are the
only marker type that can meet the long-term
needs of integrated molecular breeding so that it
can be widely applied in a cost-effective manner.
However, high-throughput SNP genotyping
requires the use of highly automated laboratories
using an array of sophisticated equipment
(pipetting robots, high-density PCR, highthroughput SNP detection machines, high-level
informatics). Although large private seed companies have had the need and the resources to put in
place large-scale genotyping laboratories for their
own uses, smaller programmes, especially in the
public sector, have typically not had the resources
or the justification to establish and maintain such
large operations to meet their increasing needs
for SNP genotyping data. In response to this
need, a few private marker service laboratories
have sprung up over the past few years. Those
laboratories can provide complete genotyping
services for their customers, from DNA extraction to generation of large numbers of SNP or
other datapoints. Due to their broad customer
base (from medical research laboratories to animal and plant breeding operations, both public
and private), such laboratories can have the large
volume of data point production that can lead to
low costs to the customer and high throughput.
They are able to invest in the most advanced
equipment to keep up with the constant evolution
of genotyping technologies and are able to pass
on the resulting benefits to their customers.
Processes have now been put in place for rapid
shipment of dried leaf samples from any location
(field or laboratory) around the world without
the phytosanitary and similar restrictions that can
affect the shipment of seed or other viable
tissues.
Contract genotyping is also generally exempt
from material transfer agreements (MTAs) and
other intellectual property requirements because
the material being sent is not viable and will not
be used for any other purpose than the generation

289

of genotyping data for the exclusive benefit of the


customer. Examples of such companies that can
service breeding programmes from around the
world are DNA Land-Marks, Inc. of Saint-Jeansur-Richelieu, Quebec, Canada (http://www.dnalandmarks.ca/english), and KBioscience Ltd. of
Hoddeston, UK (http://www.kbioscience.co.uk).
This approach represents a very attractive solution for large-scale integration of markers into
Third World country breeding programmes, as it
does not necessitate any heavy capital investment
and it completely removes the maintenance and
equipment upgrade issues.

Lessons Learnt and Concluding


Remarks
Marker-assisted selection that complements regular conventional breeding programme increases
genetic gain per crop cycle, stacks favourable
alleles at target loci and reduces the number of
selection cycles. In the last decade, the multinational private sector has benefitted immensely
from MAS, which demonstrates its efficacy.
In contrast, its adoption is still limited in the public sector, and it is hardly used in developing
countries. Major bottlenecks in these countries
include shortage of well-trained personnel, inadequate high-throughput capacity, poor phenotyping infrastructure, lack of information systems or
adapted analysis tools or simply resource-limited
breeding programmes. The emerging virtual
platforms aided by the information and communication technology revolution will help to overcome some of these limitations by providing
breeders with better access to genomic resources,
advanced laboratory services and robust analytical and data management tools. Apart from some
advanced national agricultural research systems,
the implementation of large-scale molecular
breeding programmes in developing countries
will take time. However, the exponential development of genomic resources, including for lessstudied crops, the ever-decreasing cost of marker
technologies and the emergence of platforms for
accessing MAS tools and support services, plus
the increasing publicprivate partnerships and

12

290

needs-driven demand for improved varieties to


counter the global food crisis, are all grounds to
predict that MAS will have a significant impact on
crop breeding in developing countries. These
predictions are supported by some preliminary
successful examples presented in previous
chapters 9 and 11. Advances in genomics research
are generating new tools, such as functional
molecular markers and informatics, as well as
new knowledge about statistics and inheritance
phenomena that could increase the efficiency and
precision of crop improvement. In particular, the
elucidation of the fundamental mechanisms of
heterosis and epigenetics, and their manipulation,
has great potential. Eventually, knowledge of the
relative values of alleles at all loci segregating in a
population could allow the breeder to design a
genotype in silico and to practise whole-genome
selection for minor crops in developing countries.
Considerable progress has been made building
infrastructure for applying genomics approaches.
These include one-dimensional genetic information (genome sequences), many ESTs and gene
knockout populations in several plant species
of biological and agronomic importance. New
knowledge and new tools are changing the strategies used in crop plant research and will thus
reduce the costs and increase the throughput of
the assays. There is a continuing need to integrate
disciplines such as structural genomics, transcriptomics, proteomics and metabolomics with plant
physiology and plant breeding. Bioinformatics is
providing the means for integration and structured
interrogation of datasets that will facilitate
the cross-fertilisation of disciplines. Genomics
research has successfully unravelled various metabolic pathways and provided molecular markers
for agronomic traits. However, the mechanisms
of epigenetic phenomena are only beginning to
be understood, and their potential role in crop
improvement is unknown. Similarly, tantalising
bits of information concerning the possible basis
of heterosis are gradually emerging. Eventual
elucidation of the mechanism of heterosis might
be one of the most important contributions of
molecular genetics research to crop improvement.
Ultimately, the goal of the breeder will be to assay
the genetic make-up of individual plants rapidly

Future Perspectives in MAS

and to select desirable genotypes in breeding


populations. The construction of graphical genotypes of each plant or progeny row would allow
the breeder to determine which chromosome sections are inherited from each parent to facilitate
the selection process and perhaps to reduce the
need for extensive field tests. A logical extension
of whole-genome selection for the breeder would
be to design the superior genotypes in silico, an
approach described as breeding by design.
Thus, in the post-genomics era, high-throughput approaches combined with automation,
increasing amounts of sequence data in the public
domain and enhanced bioinformatics techniques
will contribute to genomics research for crop
improvement. However, the costs of applying
genomics strategies and tools are often more than
is available in commercial or public breeding
programmes, particularly for crops that are only
of regional importance. Newly developed genetic
and genomics tools will enhance, but not replace,
the conventional breeding and evaluation process. The ultimate test of the value of a genotype
is its performance in the target environment and
acceptance by farmers and consumers.

Bibliography
Literature Cited
Delannay X, McLaren G, Ribaut JM (2012) Fostering
molecular breeding in developing countries. Mol
Breed 29:857873

Further Readings
Ali HQ et al (2012) An overview of genomics assisted
improvement of drought tolerance in maize (Zea mays
L.): QTL approaches. Afr J Biotechnol 11(65):
1283912848
Fauquet CM, Taylor NJ, Tohme J (2012) The global cassava partnership for the 21st century (GCP21). Trop
Plant Biol 5:48
Foolad MR, Panthee DR (2012) Marker-assisted selection
in tomato breeding. Crit Rev Plant Sci 31(2):93123
Fridman E, Zamir D (2012) Next-generation education in
crop genetics. Curr Opin Plant Biol 2012(15):218223

Bibliography
Isemura T, Kaga A, Tabata S, Somta P, Srinives P et al
(2012) Construction of a genetic linkage map and
genetic analysis of domestication related traits in
Mungbean (Vignaradiata). PLoS One 7(8):e41304.
doi:10.1371/journal.pone.0041304
Khan M (2012) Current status of genomic based approaches
to enhance drought tolerance in rice (Oryza sativa L.):
an over view. Mol Plant Breed 3(1):110. doi:10.5376/
mpb.2012.03.00
Liu Y, He Z, Appels R, Xia X (2012) Functional markers
in wheat: current status and future prospects. Theor
Appl Genet 125:110
Nakaya A, Isobe SN (2012) Will genomic selection
be a practical method for plant breeding? Ann Bot
110(6):13031316. doi:10.1093/aob/mcs109

291
Panthee DR, Foolad MR (2012) A re-examination of
molecular markers for usein marker-assisted breeding
in tomato. Euphytica 184:165179
Sharma HC et al (2002) Applications of biotechnology for
crop improvement: prospects and constraints. Plant
Sci 163:381395
Varshney RK, Graner A, Sorrells ME (2005) Genomicsassisted breeding for crop improvement. Trends Plant
Sci 10(12):621630
Xu Y et al (2012a) Whole-genome strategies for
marker-assisted plant breeding. Mol Breed
29:833854
Xu Y, Li Z-K, Thomson MJ (2012b) Molecular breeding
in plants: moving into the mainstream. Mol Breed
29:831832

About the Author

N. Manikanda Boopathi is presently working


as an Assistant Professor (Biotechnology) at
the Department of Plant Molecular Biology
and Bioinformatics, CPMB&B, Tamil Nadu
Agricultural University, Coimbatore, India. He
graduated in agricultural sciences, did his masters
and doctoral studies in Plant Biotechnology and
trained at International Rice Research Institute,
the Philippines. He has handled more than 20
courses for undergraduate and postgraduate
students in his university and is invited frequently
for delivering lectures in several institutions,
both in India and abroad. His scientific work has

been recognised during several occasions and


has brought him laurels and awards. He has a
vast experience in QTL mapping and marker
assisted selection in rice and cotton. He has
completed several national and international
research projects and is currently working in
two countrywide and one worldwide network
projects that address the problems of biotic and
abiotic stresses in cotton, mungbean, hot-pepper
and tomato using system quantitative genetics.
His publications can be found at http://sites.
google.com/site/drnmboopathi and/or http://
tnaucottondatabase.wordpress.com/.

N.M. Boopathi, Genetic Mapping and Marker Assisted Selection: Basics, Practice
and Benefits, DOI 10.1007/978-81-322-0958-4, Springer India 2013

293

Vous aimerez peut-être aussi