Académique Documents
Professionnel Documents
Culture Documents
the Genome:
Problems
Arising
in
Biology
Karp*
Introduction
Experimental
Techniques
The
hereditary
information offspring
that is encoded
all
living in DNA
We briefly describe some of the basic experimental techniques that are used in physical mapping. The bibli-
of papers giving
are tran-
A r-estriction
the living
sequence of nucleotides
chemical
cesses of life. A DNA tertwined nucleotides strands on the human nucleot The many one strand molecule helical drawn is bonded and is a polymer Each the set from consisting strand {A, C, T, G}. of two The each inof two A on T
ing a DNA molecule at any site where that sequence sites for the occurs. Such sites are called restriction given restriction enzyme. A restriction enzyme can be digest, in which the DNA is used to obtain a complete
cut into disjoint consecutive different restriction each restriction fragments, restriction copies sites, sites, of the so that by two is obtained. called gel electrophoresis out size of each a collecaccording can each of which is bounded digest, at difof fragconsecuby two in which ferent ments, tive) sites, DNA a wide (not or a par+al are cleaved selection
strands.
is a sequence
are complementary other, ides. ultimate other the goal efforts genetic each
in the C is bonded
sense that
to its complementary
nucleotide
bounded
necessarily
chromosome}
each
strand
Using of the Human in molecular DNA of humans information goal Genome biology and other contained Project species therein. A physical fragmarkDNA to until and and A physt ion to size,
of DNA
can be separated
is to sequence
approximate
fragment
be determined. Cloning incorporated replication of macroscopic and The thus most makes perimentally incorporate cosmids, cleotides can is a process into process in which a fragment genetic incorporated the of DNA host; is
intermediate 23 pairs
is to construct
the selffragment,
of our
as a factory
specifies of DNA
molecule.
fragment
ers provide molecule. be close the gene In order molecule ber DNA This ing lems should entists.
a clone.
of interest along
used about
an experimenter
at a marker walk
incorporate of
50,000
is identified. to construct it is necessary called and then a physical to extract clones, map from obtain how and the of a large it a large DNA num-
6h?omosomes, up a million
handle
fragments
cleot ides. A clone libraTy DNA library effort. is a procedure molecules containing together. hybridizes whether to the bond a probe in which two singleseis a collection is the starting of clones The for point covering creation any physical one of
of each
more
molecules
of interest.
a clone mapping
to a number in a primitive
of challeng-
probabilistic computer
are currently
complementary
of theoretical
of nucleotides whether
is radioactively to determine
to a given
it is possible Supported
Permission granted direct title that to
by NSF Grant
copy that without
CCR-9005448
fee all or part are not made copyright and appear, of this material ia for
piece
that
is complementary
To copy STOC otherwise,
or to republish,
provided commercial
the copies
advantage,
the ACM
of the copying
publication
is by permission
of the Association
for Computing
278
that
of a single
to hundreds
sizes, the locations of the restriction sites along the clone. We begin by describing the rather unrealistic noiseless) versions of these problems, in which all measurements are assumed to be exact. We then comment on how measurement error can be incorporated into the
problems,
The
Lander-Waterman
question to construct is that by clones. nearly is to determine a physical all the DNA the map.
Model
number A minimal should inare of
A fundamental
clones needed requirement be covered troduced
4.1
A clone A, and then B
The
Double
Digest
digested
Problenn
using of A restriction then cllone al < the be enzyme using be az A IV, <:
molecule
Lander model
[LW88] clones
B, and
all of the same dent is the DNA. by any out random length Define clones, gaps. is defined
let the restriction . . . < a~, and let b1<b2 together where sequences first A = <... the the
sites the
for
enzyme B are
< bq. restriction and yields al,.. yields multiset double reconstruct the sequence {ai}
A and
is not interval
is the
of merging
as a maximal
Lander
experiment {al,az
multiset
of fragment the
number
aP_.l, IV aP}, of fragment and (in the its C2, . . . . cPtq restriction i.e., that whether would yield N bq}, problem to determine
of C.
experiment {bl,bzbl,..., iment, cP+~}. sion) lem multisets. has choice the The is to
multiset
bq bq-l, digest
that
of the library
on these in order
create
sites whether
is at least
5. Thus,
cosmid
clones
there
clones if only
would Artificial
a portion being
a Yeast long)
d, 1?,C of multisets. There are instances that are wildly ambiguous, in the sense that the number of solutions grows exponentially
hand, randomly if the from y one, that will
were
as a function
sites the are there are
of p + q. On the other
real numk~ers is unique enumerative it quickly. drawn with alsolution various find
cosmid
restriction and
4
There striction
Fingerprints
are two fundamental In the fingerprinting, of the clone of restriction fragments In and to the it second the clone. error Thus, fragments approaches first to obtaining called of complete using and using called to to the differsizes a re-
probability gorithms
certainly
fingerprint
approach, a number
4.2
The tial
Partial Digest
digest a single
Digest Problems
problem enzyme ..< the A.
and
Probed
Par-
or partial ent
combinations resulting
In
the
partial using
is partially
di-
of the
trophoresis. tion of probes, hybridize that printing gerprinting, trophoresis the remaining
sites the
be al fragments multiset
< a2 <.
fingerprinting, the
clone It
is exposed is necessary
possible
bounded of fragment
is determined
lengths. problem
be lost, values
gel elec-
b on
gives only
approximate
segregated
and
mea-
to false the
is recorded negatives,
ai Iai
and
the
again
restriction to polynomial
of restriction
il y reducible unlikely
fact orizat
of reconstructing,
to be NP-complete.
papers
279
discuss how the degree of ambiguity of choices for the set of restriction a given multiset of fragment lengths)
(i.e., can
set
(the the
and
a set
of intervals the
(the
sites com-
clones) order
determine occur.
left-to-right
grow as a function of the number of restriction sites. In practice, the measurement process will miss certain fragments and will give only approximate values forthesizes of a perfect of restriction of the remaining fragments. Thus, instead solution, it is necessary to look for a choice sites that best fits the noisy data. Given a
In the absence each clone to it probes. property that, in is consecutive Thus, for every
the problem left-to-right (dij ) has the the rows the ones this are
is very that
easy.
is an interval,
hybridize
in the
probabilistic model of measurement error, this problem can be cast as a maximum likelihood problem, and can be attacked ful example by branch-and-bound of this approach methods. A successis [KN93].
property
on the
structure
yield in linear
description
the
by assuming false positives on on p and case, model take not occur occur latter This
Physical bridization
Mapping
Using
Hy-
false not j j
at random, if one
Data
and his coworkers Deborah physical clones The are at Berkeley and (Farid Geoffrey
if probe
then then
The
present
writer
Weisser mapping
with random
probability variables
1 p. ~j
algorithms by hy-
are independent.
have
practical false
as experimenters
data
mapping
to avoid In this i most ing into fast lating precise. To where properly with clone sume j,
is equal
likely a matrix
by chang-
a minimum This
positives.
techniques
method
instances;
between
probes
clones.
The
of several noiseless
clones; case)
understand the data contains and a vertex that this problem order clone cannot we delete graph
the
consider us assume
the
case
of the
number
of pooled
to problems of group
by statisticians
an edge graph
vj ) if and into
is connected,
otherwise
mapping
5.1
The
Case
of Unique
molecule
Probes
as an interval A that on the ([BD91], unlikely clones the on the probe is
a probe Then
vertices
to it.
of the
There clones
In an increasingly are extracted is long time. probe in of the (in the along problem
clones between
of isolated three
probe
correspondone ob-
component, vertices.
a divide-and-conquer
ordering
probes noiseless
We have
found
that
this
approach
works number
successfully of randomly
interpretation:
given
incidence
of a significant
280
placed when
false it and
Call
a probe
vertex
a sp~itter
Under determine,
number ordering
it is an open
the minimum the
problem
n,L and expected left-to-right
to
N
clone
vertices
of the parameters
model, to determine
remaining tain sumes strictly tains that the the clones. that to
two
components to the
contains of the
clones,
5.2
Nonunique Model
Probes:
The
Pokson
right
these
continuing
as long
as splitters
Several experimenters ([CN90], [EL89], [L90], [LD91] and [ST90] ) have approached physical mapping using hybridization out requiring probes to fingerprint each probe to occur the clones, but withat a uniclue point. times along the (d~j ), where an
to clone complicated one can no hybridize
abundant, ordered
cess usually correctly they The blesome, annealing we have 21 that structed Another clones; a single more gether. can
Instead, each probe may occur many DNA. The data is given by a matrix
sequence
of subproblems
entrydii indicates
j. The resulting the one assume same than longer to the In clone mines curs). with that
be solved where
case but
positives good
occur success
is more using
trou-
we have
simulated
clones
overlap
if they
heuristic
probe. interpretation on the set (the incidence wish to line places between determine of and the each the the problem prclbe the intervals most probe each deterocand likely
a physical research
is in close agreement by a French complication are clones of the fragments out chimeric
where
these disjoint
the
sets,
we
fragment
chromosome, with
intervals. attacks clones within probe model this are problem under that the asto is the
chromosome various
We are experimenting
to screening
another along
5.1.1 Up The by
On-Line
to now practical
Probe
Selection
that the are are unique in probes advance. to
occurrences by a Poisson in
of known
~ (thus, dz with
we have
assumed
interval
of length
be used
intervals
assuming with
choice
begun
to study idealized
strategies
selection
the following
assumptions:
The the
set
of clones
is determined of clones
in
advance, the
and
This tractable, likelihood that are imize explain probes, probes over the
problem
distribution model.
satisfies
Lander-
to the problems
are reasonably NP-hard. the the and, incident alphabet whether this problem
in practice, simplest
although is to needed
experiments positives
are
i.e.,
or false from
be the the
of some
clone.
step, the
experimenter will
c. Then that,
word is is al-
next
P such there
c, there It
probe
likely then
containing to approximate
in S(c).
of the ends.
If two
exists
a constant
281
considered 1 ranges to
are
interleavof of the
terval graphs instead of interval graphs. Unfortunately, these problems are NP-hard ([GK92] and [GS92]). A number or approximate tion problems. interleaving of researchers methods We outline have investigated an approach
Wj k,
and
is an
logarithm
heuristic
likelihood The
m be a permutation
of solving
compatible
in the minimum with c(n). ~(1) in and linear
with pervalue ~. of
left-to-right
interleavings may
1 compatible as min. functions proposes are in quite for widely practice. effective
Then
that,
is the set of pairs of clones that over ! ap in 1. Then we seek an 1 maximizing F(1). For any permutation T of the clones, let C(m) = maxi F(1), where the maximization
ranges the in over the that, time. by interleavings for a given Thus the search problem. based on this compatible as mi% m, C(m) can with C(m). It ~. Then problem [AK93] can be restated is shown can
[AK93],
Lin-Kernighan
be computed problem
( [LK73]
compu-
a local
spirit
of the
Lin-Kernighan approach.
being
traveling-salesman program
search
computer
devised
7
6 Mapping Graphs
Any method of fingerprinting clones can be used to esti-
Sequencing
The Sequence
version a DNA whose
Problems
Assembly
physical
Using
of the molecule
that two clones overlap; for examfingerprinting is used and the Pois-
fragments will
son model of probe occurrences is assumed, then two clones are likely to overlap if they have many probes in common and there are few probes that lie on one clone but not the other. Given an overlap probability pj ~ for each pair (j, k) of clones, it is natural to look for an interleaving 1 that best agrees with these probabilities. The degree of agreement of interleaving 1 may be measured by IIEPj ~IIE( 1 pj ~), where E is the set of pairs of clones that overlap in 1, and ~ is the complementary set. Let ting wj ~ = in ~ we find that maximizing the degree of agreement zfi
wjk.
Under
reasonable
probabilistic thus
as a substring, pIobierm
shortest
supe?wtving
given
the shortest
string
containing
strings Z1, x2, ..., xn, find each of the given strings
is equivalent
problem determines an edge intervals an
to maximizing
it is natural Any two A a graph with graph to a
To attack introduce intervals vertex tices arising interval sulting val terized determine ([BL76] Our graph another, for if the in the
this
concept interval
of an interval
set of ver-
along each
as a consecutive substring. Simple polynomial-time approximation algorithms for the shortest superstring problem are discussed in [GM80] and [T89]. In particular, [BJ91] gives an approximation algorithm that solves the problem wit hin a factor of three; i.e., the superstring it produces is at most three times as long as the shortest superstring. The shortest superstring probclass lem is known to be complete for the complexity
gTaph; if no
then graph. the reIntercharacto graph
in another, graphs
graphs
are well
Max SNP. This implies that unless P = NP there exists an 6 > 0 such that it is NP-hard to solve the shortest superstring problem within a factor of 1 + c. Of greater interest to biologists is the case where the measurement of the sequences for the fragments is subject to error and, given these possibly erroneous sequences, one wants to determine the most likely sequence for the DNA molecule from which the fragments have been drawn. as the sequence This problem
pvob~e~
algorithm
whether [KM89]).
is an interval
problem ~E(~)
Wj k,
find E(G)
to require restrict
no clone
is known
generically
versions,
to proper
assembly
ithas many
282
statistical information
model
erDNA
to solving
the
messier,
but
more
relevant,
other for
factors.
is a good prob-
and
have
reviewing
on this
many
7.2
Sequencing
by
Hybridization
Sequencing by hybridization, ( [DL89], [PL91] and [SP91]), is a novel method of sequencing DNA p?obe as a sequence with molecules. Define a generalized
dont ple, first the cares over the alphabet AC
Search for Homologies strings for homologies that are similar Trees
{A, C, T, Gj.
T that T by attempts of the contains in the
For
Phylogenetic
probe the
G represents A fifth
position,
to sequence method,
hybridizing
general-
Protein Folding Determining the three-dimensional structure of a protein from its sequence of amino acids. In conclusion, a word about the role of combinatorial It is conin molec-
probes that
version can
assumed
of p in probes string P
us say same
that
ular biology as optimization problems. For example, physical mapping can be treated as the probl~em of finding the most likely interleaving of a set of clones, given the fingerprints of the clones. However, the most likely interleaving is of little use unless it very closely resembles the true interleaving. Thus, optimization methods should be viewed not as vehicles for solving a problem, but for proposing a plausible hypothesis to be confirmed or disconfirmed by further experiments. the correct solution of a reconstruction The search for prolblem must a close in-
y #
z of the
as z such following
arise.
N(pl,
of z, , pn} a positive
z)
and only
the if
is possible
inevitably teraction
be an iterative between
process involving
P, determine
what
fraction
experimentation
and computation.
Given
Acknowledgements The writer wishes to acknowledge enlightening discussions with William Chang, Dan Gusfield, John Kececioglu, Gene Lawler, Dalit Naor, David Nelson, Frank Olken, Pavel Pevzner, Ron Shamir, Terry Speed, David Torney, Michael Waterman and, especially,. . Farid Alizadeh, Lee Newberg, Deborah Weisser
and Geoffrey Zweig.
est cardinality sequences m. The solved a given first in the length and
of length
second
been of to
case where
t ([P89]).
open.
be completely
References
[AB89]
F.M.
Ausubel,
R. Brent,
R.E. Kingston,
Biology. Newberg of in Molecular on
et al.
1989 and D.
8
This
Conclusion
[AK92] article has been concerned about serious the will versions by the with problems molecule it. will an It of defrom inthe errors. if fail [BD91]
Current F. A
Protocols R.
Physical
Mapping
Chromosomes: BiolDis-
fragments
is important
ACM/SIAM
Symposium
J. Dausset
Cohen.
TheoStrategy PTOC.
are seduced
appealing of these
easy to un-
Analysis Random
Mapping Landmarks.
283
Acad.
Sci.
USA
Vol.
88,
pp
3917-3921.
[FS83]
and
W.W.
Ralph. Frag-
Restriction
19-29.1983 and Tech Institute December, J. Storer. Vol. 20, R. Shamir. The Com-
Blum,
T.
Jiang, Linear
M.
Li,
J.
Tromp
and
M. [GK921 M.C. Graph Moise puter [GM80] Golumbic, Sandwich and Frida No. Sciences Kaplan Problems Eskansky 270/92, and Science Teport, of 1992 On pp Findof 50-58.
Approximation
of the Twenty
on Theory of
G.S.
Testing
for
Conand 335[G090]
Superstring.
Journal
secutive Planarity
Ones
Graphs 13, pp
.lorwnal
of Computer 379.1976 [BS90] E. A.V. ing Maps 31(2), [CL89] A.V. et al. ing. [CL89] A.V. B. Raff, Meister Branscomb, Carrano
E.D. Region
Green
and
M.V.
Olson. Fibrosis
of the
Cystic
T. and Human
Slezak, M. and
R.
Pae,
D.
Chromosomes: Mapping.
A Model Vol.
ConstructGenome
Science
250,
Chromosomeof the
Golumbic Algorithms
and for
R.
Shamir.
Approach.
PTOC. lsTael
Fluorescence-Based, 129-136.1989 L.K. T. L. High Ashworth, Slezak, McBride, Resolution Method Vol. 4, pp [K91] M. S.
Gonick,
Wheelis, 1991
Harper
for Molecules.
the
Keith, A
85, pp 7298-7301.1988 and Approximation Reconstruction. of Arizona K. E. Isono. Coli for 1991 The Rapid LiAl-
Fluorescence-based, of DNA Fingerprinting. 129-136.1989 [CN90] A.G. hetner mid Type Craig, and D. H.
Semi-automated Genomics
J .D. Kececioglu. gorithms PhD for Dissertation, Kohara, Map Application and Cell A.
Exact DNA
Sequence University
J .D.
Hoheisel,
G. of
ZeCos-
[KA87]
Y.
Akiyama of the
and
Whole
Chromo-
Clones
Covering
of a New
Strategy
I (HSV-1) by
Genome:
Sorting Vol.
of a Large
Genomic
Hybridisation.
and
Linear-Time Drmanac, Graphs. Crkvenajakov. Genomics [EL89] Vol. Sequencing 4, pg 114.1989 K.A. Genomes Lewis. Physical USA MapVol. 86, [L90]
Recognizing of Computing
18, pp 68-81.1989 [KN93] R. Karp, L. Newberg, Partial submitted Genes Press, Clone Digest An Algorithm 1993 PTess and OzfoTd for the
by Hybridization:
G.A.
ping
Evans
and
lem, B.
of Complex
by Cosmid
Multiplex
Analysis.
Lewin.
pp 5030-5034.1989 [L190] [F85] P.C. Fishburn. John Interval Wiley Orders and Interval [LD91] [FG65] D.R. trices Math Fulkerson abd Vol. and O.A. Gross. Incidence Jou?nal MaGraphs. pp 35-561985
Simple.
Nature
346, pp 611-12.1990 Lehrach, R. Drmanac, Fingerprinting Genome J. Hoheisel, in Genome Analysis et Vol. al. 1,
Hybridization
Map-
Interval
Graphs.
Pacific
of
and Sequencing.
15, pp 835-855.1965
pp 39-81.1991
284
[LK73]
S. Lin
and W. Kernighan.
An
Effective
HeurisProb-
[S w91]
W.
Schmitt
and
M.S.
in
Waterman.
Multiple ProbVol.
Solutions lems.
of DNA
Restriction
Applied
Map,ping
Mathematics
Operations
Advances
12, pp 412-427,
E.S. Lander and M.S. Waterman. Mapping by Fingerprinting Random Mathematical 231-239.1988 Analysis. Genomics
[T89]
J.S. the
Turner. Shortest
Superstring Vol.
Information pp 1-20.1983
and
Computation
[NN92]
L.A. Newberg and D. Naor. A Lower Bound on the Number of Solutions to the Exact Probed Partial Digest Problem. Advances in Applied
Mathematics, to appear. J.E. Dutchik, Strategy in Yeast. M.Y. for Proc. Graham, Natl. Acad. et al. RestricSci.
[T91]
Mapping
Using Bioiogy
Unique Vol.
of Molecular
[OD86]
M.V. tion
Olson, Mapping
[TB91]
and
D.J. Fingerprint
Random-Clone
Genomic
of DNA
Mapping
USA Vol.
[P82] W.
of Mathematical
Biology
6, pp 853-879.1991 , P. Dussen, Map C. Mugnier Construction Comparability in the 103-110.1988 et al. in Assembly the CEA 9. pp and S. UsAlBio-
Acids
Res. Vol.
P.
Tuffery
10, pp 217-227.1982 [P89] P.A. puter tural [PD84] G. Pevzner. Analysis. Dynamics. Polner, L. DNA Nucleic L-tuple DNA Sequencing: ComStruc[T092] Dorgai Physical Acids and Map L. Orosz. PMAP. ProPMAPS: grams. 1984 [PL91] P.A, Pevzner, Y.P. Lysov, K.R. Khrapko, Construction
Hazout. ing
Restriction
a Complete Vol.
Sentences 1, pp
Computer
4, No.
Applications
Journal
Vol.
of Biomoleczdar
7, pp 63-73.1989 A. Olsen, of Acids Region B. Trask, of Analysis Family Nucleic Cosmid Contigs Human Vol.
Res. Vol
12, pp 227-236.
Research
2653-2660.1990
A.V. Mirz-
[w91]
Exons, 1991
Introns
and
Talking
Genes
Ba-
Belyavsky,
bridization. and Dynamics 1991 [S78] M. Stefik. mentation
V.L.
Florentiev Chim
and
A.D.
abekov. Immoved
Journal Vol.
for Secmencimz -..bv HyStructure [WG86] M.S. Graphs Vol. Waterman and Maps 48, pp. and J.R. Griggs. Bull. Intervall Biol. No. 2, pp 399-410. of DNA.
of Biomo~ecuiar
9, Issue
of Math.
Inferring Data.
DNA Artificial
Structure Intelligence
from Vol.
Seg11,
[WH87]
Molecular
of the
pp 85-114.1978 [SP91] Z. Strezoska, I. Labat, Sequencing by a Acad. Sci., Skiena, T. Paunesku, D. Radosavljevic, DNA Read Natl. 1991 Reon
Edition,
Benjamin/Cummings
Non-Gel-Baaed
[ss90] S.S.
Smith From
and
Interpoint
Distances.
[ST90]R.L.
et al. somes
Stallings, Physical
Mapping
by Repetitive
Fingerprinting.
Proceedings
of the
of Sci-
ences Vol.
87, pp 6218-22.
285