Vous êtes sur la page 1sur 34

DNA SEQUENCING

The sequencing protocol is essentially a technical procedure with several variants of the basic procedure, but the most widely used techniques are based on the enzymatic method. Whatever the method, the desired result is to generate a set of overlapping fragments that terminate at different bases and differ in length by one nucleotide. This is known as a set of nested fragments. Assuming that the technique has generated a set of nested fragments, the detection step is the final stage of the sequencing procedure. This usually involves separation of the fragments on a polyacrylamide gel. Slab gels, in which fragments are radioactively labeled, generate an autoradiograph. Automated sequencing procedures tend to use fluorescent labels and a continuous electrophoresis to separate the fragments, which are identified as they pass a detector. There are two main methods for sequencing DNA. In one method, developed by Allan Maxam and Walter Gilbert, chemicals are used to cleave the DNA at certain positions, generating a set of fragments that differ by one nucleotide. The same result is achieved in a different way in the second method, developed by Fred Sanger and Alan Coulson, which involves enzymatic synthesis of DNA strands that terminate in a modified nucleotide. Analysis of fragments is similar for both methods and involves gel electrophoresis and autoradiography (assuming that a radioactive label has been used). The enzymatic method (and variants of the basic technique) has now almost completely replaced the chemical method as the technique of choice, although there are some situations where chemical sequencing can provide useful data to confirm information generated by the enzymatic method. Fluorographic detection methods are also used in place of radioactive isotopes. This is particularly important in DNA sequencing, as it speeds up the process and enables the technique to be automated. Nucleotide Sequencing Traditionally about 500 nucleotides can be sequenced at a time. Therefore it calls for cloning the entire genome in different vectors (plasmid upto 10 Kb, Cosmid upto 40 Kb, BAC upto 300 KB and YAC above 300 Kb). The uniformity of the DNA molecule and the seemingly monotonous repetition of the nucleotide bases may seem like impenetrable barriers to determining the precise sequence order of the bases within nucleic acid. The methods used were, however, impractical for DNA sequencing on a large scale. In 1975, Fred Sanger and Alan Coulson devised a method of direct DNA sequencing referred to as the plusminus method (Sanger and Coulson, 1975). This method utilized a DNA polymerase, primed by synthetic radio-labeled oligonucleotides, to generate fragments of DNA that could be analyzed following electrophoresis and autoradiography. This technique was used to determine the entire 5386 bp sequence of the bacteriophage X174 genome (Sanger et al., 1977).

DNA Sequencing gDNA libraries

DNA Sequencing METHODS- Manual DNA Sequencing

1. MaxamGilbert (chemical) sequencing


(Suitable for sequencing Short oligonucleotide Sequences) A defined fragment of DNA is required as the starting material. This need not be cloned in a plasmid vector, so the technique is applicable to any DNA fragment. The DNA is radio-labeled with 32P at the 5 ends of each strand, and the strands are denatured, separated, and purified to give a population of labeled strands for the sequencing reactions. The next step is a chemical modification of the bases in the DNA strand. This is done in a series of four or five reactions with different specificities, and the reaction conditions are chosen so that, on average, only one modification will be introduced into each copy of the DNA molecule. The modified bases are then removed from their sugar groups and the strands cleaved at these positions using the chemical piperidine. The theory is that, given the large number of molecules and the different reactions, this process will produce a set of nested fragments.

The method requires radioactive labeling at one 5' end of the DNA (typically by a kinase reaction using gamma-32P ATP) and purification of the DNA fragment to be sequenced. Chemical treatment generates breaks at a small proportion of one or two of the four nucleotide bases in each of four reactions (G, A+G, C, C+T). For example, the purines (A+G) are depurinated using formic acid, the guanines (and to some extent the adenines) are methylated by dimethyl sulfate, and the pyrimidines (C+T) are methylated using hydrazine. The addition of salt (sodium chloride) to the hydrazine reaction inhibits the methylation of thymine for the C-only reaction. The modified DNAs are then cleaved by hot piperidine at the position of the modified base. The concentration of the modifying chemicals is controlled to introduce on average one modification per DNA molecule. Thus a series of labeled fragments is generated, from the radiolabeled end to the first "cut" site in each molecule. The fragments in the four reactions are electrophoresed side by side in denaturing acrylamide gels for size separation. To visualize the fragments, the gel is exposed to X-ray film for autoradiography, yielding a series of dark bands each corresponding to a radiolabeled DNA fragment, from which the sequence may be inferred.

2. SangerCoulson sequencing

(dideoxy

or

enzymatic)

Although the end result is similar to that attained by the chemical method, the Sanger--Coulson procedure is totally different from that of Maxam and Gilbert. In this case a copy of the DNA to be sequenced is made by the Klenow fragment of DNA polymerase. The template for this reaction is singlestranded (SS) DNA, and a primer must be used to provide the 3 terminus for DNA polymerase to begin synthesizing the copy (Fig. 3.9). The production of nested fragments is achieved by the incorporation of a modified dNTP in each reaction. These dNTPs lack a hydroxyl group at the 3 position of deoxyribose, which is necessary for chain elongation to proceed. Such modified dNTPs are known as dideoxynucleoside triphosphates (ddNTPs). The four ddNTPs (A, G, T, and C forms) are included in a series of four reactions, each of which contains the four normal dNTPs. The concentration of the dideoxy form is such that it will be incorporated into the growing DNA chain infrequently. Each reaction, therefore, produces a series of fragments terminating at a specific nucleotide, and the four reactions together provide a set of nested fragments. The DNA chain is labelled by including a radioactive dNTP in the reaction mixture. This is usually [- S]dATP, which enables more sequence to be read from a single gel than the 32P-labelled dNTPs that were used previously. The generation of fragments for dideoxy sequencing is more complicated than for chemical sequencing and usually involves sub-cloning into different vectors. Many plasmid vectors are now available and some types can be used directly for DNA sequencing experiments. Another method is to clone the DNA into a vector such as the bacteriophage M13, which produces singlestranded DNA during infection. This provides a suitable substrate for the sequencing reactions.
35

Figure 9.6 depicts the dideoxy form of the nucleotide that incorporates and the chain is terminated. The gel used to separate the newly synthesized DNA fragments usually contains high concentrations of urea (7 M) and is run at a high power level to heat the gel to about 70C. Both of these have denaturing effects on DNA fragments and help reduce secondary structure that could occur in the single-stranded molecules that may make them run anomalously through the gel.

3. Automated DNA Sequencing- High-throughput Sequencing Protocols


One of the major advances in technology that enabled sequencing to move from single-gel lab-based systems up to large-scale production line sequencing was the automation of many parts of the process. Whereas a good lab scientist or technician could sequence maybe a few hundred bases per day, this was not going to solve the problem of determining genome sequences as opposed to gene sequences. Improving the technology by orders of magnitude was required. This was achieved by improvements in sample preparation and handling, with robotic processing enabling highvolume throughput. In a similar way automation of sequencing reactions, and linear continuous capillary electrophoresis techniques, enabled scale-up of the sequence-determination stage of the process.

A straightforward way to increase the throughput of DNA sequencing would be to combine the four individual sequencing reactions (each containing a different ddNTP) into a single reaction that could be analyzed on a single lane of a gel. This is not possible using radioactivity since each band is distinguishable only by the position in which it runs on the gel. Therefore, combining all four lanes would merely result in a series of bands differing in size by a single base (Figure 9.8). However, if the terminal base of each DNA fragment can be identified specifically then, since each band on the gel is a different size, the DNA sequence can be unambiguously assigned from a single gel lane. A set of dideoxynucleotides has been developed that are labeled with fluorescent dyes precisely for this purpose. The dideoxynucleotide can still be incorporated into DNA opposite its complementary base, which again results in the termination of DNA synthesis. The dye structures attached to the dideoxynucleotide contain a fluorescein donor dye linked to a dichlororhodamine (dRhodamine) acceptor dye via an aminobenzoic acid linker and are called Big Dye terminators. An argon ion laser is able to excite the fluorescein donor dye that efficiently transfers the energy to one of the four acceptor dyes, each of which has a distinctive emission spectrum (Figure 9.9). Each dideoxynucleotide is labeled with a different acceptor dye so that DNA fragments ending in a different ddNTP will fluoresce at a different wavelength. Sequencing reactions can therefore be performed in a single tube (or single well of a microtitre dish)

and the products separated either on a single lane of a gel, or using a capillary tube containing a gel matrix. The intensity and wavelength of the fluorescent emission is measured as the DNA fragments move past a laser and fluorescence detector located at the bottom of the gel. This information is fed directly into a computer so that the resulting sequence can be automatically assigned and stored.

Sophisticated base calling software is available to convert the fluorescent patterns obtained into a sequence of DNA bases (Figure 9.10). Sequencing in this way has massive speed advantages over manual sequencing methods. As many as 1000 bases can be read automatically from a single reaction, although the sequence obtained from within 500 bp of the primer is generally more reliable than that further away. Additionally, the detection methods used during automated sequencing are far more reliable than sequence interpretation from an autoradiograph.

The Methodology for DNA Sequencing


Rapid and efficient methods for DNA sequencing were first devised in the mid-1970s. Two different procedures were published at almost the same time: The chain termination method, in which the sequence of a single-stranded DNA molecule is determined by enzymatic synthesis of complementary polynucleotide chains, these chains terminating at specific nucleotide positions; The chemical degradation method, in which the sequence of a double-stranded DNA molecule is determined by treatment with chemicals that cut the molecule at specific nucleotide positions. Both methods were equally popular to begin with but the chain termination procedure has gained ascendancy in recent years, particularly for genome sequencing. This is partly because the chemicals used in the chemical degradation method are toxic and therefore hazardous to the health of the researchers doing the sequencing experiments, but mainly because it has been easier to automate chain termination sequencing. As we will see later, a genome project involves a huge number of individual sequencing experiments and it would take many years to perform all these by hand. Automated sequencing techniques are therefore essential if the project is to be completed in a reasonable time-span.

Chain termination DNA sequencing :


Chain termination DNA sequencing is based on the principle that single-stranded DNA molecules that differ in length by just a single nucleotide can be separated from one another by polyacrylamide gel electrophoresis. This means that it is possible to resolve a family of molecules, representing all lengths from 10 to 1500 nucleotides, into a series of bands.

Figure 6.1. Polyacrylamide gel electrophoresis can resolve single-stranded DNA molecules that differ in length by just one nucleotide. The banding pattern is produced

after separation of single-stranded DNA molecules by denaturing polyacrylamide gel electrophoresis. The molecules are labeled with a radioactive marker and the bands visualized by autoradiography. The bands gradually get closer together towards the top of the ladder. In practice, molecules up to about 1500 nucleotides in length can be separated if the electrophoresis is continued for long enough.

Chain termination sequencing


The starting material for a chain termination sequencing experiment is a preparation of identical single-stranded DNA molecules. The first step is to anneal a short oligonucleotide to the same position on each molecule, this oligonucleotide subsequently acting as the primer for synthesis of a new DNA strand that is complementary to the template. The strand synthesis reaction, which is catalyzed by a DNA polymerase enzyme and requires the four deoxyribonucleotide triphosphates (dNTPs - dATP, dCTP, dGTP and dTTP) as substrates, would normally continue until several thousand nucleotides had been polymerized. This does not occur in a chain termination sequencing experiment because, as well as the four dNTPs, a small amount of a dideoxynucleotide (e.g. ddATP) is added to the reaction. The polymerase enzyme does not discriminate between dNTPs and ddNTPs, so the dideoxynucleotide can be incorporated into the growing chain, but it then blocks further elongation because it lacks the 3 hydroxyl group needed to form a connection with the next nucleotide. If ddATP is present, chain termination occurs at positions opposite thymidines in the template DNA. Because dATP is also present the strand synthesis does not always terminate at the first T in the template; in fact it may continue until several hundred nucleotides have been polymerized before a ddATP is eventually incorporated. The result is therefore a set of new chains, all of different lengths, but each ending in ddATP. Now the polyacrylamide gel comes into play. The family of molecules generated in the presence of ddATP is loaded into one lane of the gel, and the families generated with ddCTP, ddGTP and ddTTP loaded into the three adjacent lanes. After electrophoresis, the DNA sequence can be read directly from the positions of the bands in the gel. The band that has moved the furthest represents the smallest piece of DNA, this being the strand that terminated by incorporation of a ddNTP at the first position in the template. In the example shown in Figure 6.2 this band lies in the 'G' lane (i.e. the lane containing the molecules terminated with ddGTP), so the first nucleotide in the sequence is 'G'. The next band, corresponding to the molecule that is one nucleotide longer than the first, is in the 'A' lane, so the second nucleotide is 'A' and the sequence so far is 'GA'. Continuing up through the gel we see that the next band also lies in the 'A' lane (sequence GAA), then we move to the 'T' lane (GAAT), and so on. The sequence reading can be continued up to the region of the gel where individual bands are not separated.

Figure 6.2. Chain termination DNA sequencing. (A) Chain termination sequencing involves the synthesis of new strands of DNA that are complementary to a single-stranded template. (B) Strand synthesis does not proceed indefinitely because the reaction mixture contains small amounts of a dideoxynucleotide, which blocks further elongation because it has a hydrogen atom rather than a hydroxyl group attached to its 3 -carbon. (C) Strand synthesis in the presence of ddATP results in chains that are terminated opposite Ts in the template. This 'A' family of terminated chains is loaded into one lane of a polyacrylamide gel, alongside the families of terminated chains from the T, G and C reactions. (D) In the methodology shown here, the banding pattern is visualized by autoradiography, the terminated chains having become radioactively labeled by inclusion of a labeled dNTP in the strand synthesis reactions. The sequence, shown on the right, is read by noting which lane each band lies in, starting at the bottom of the autoradiograph and moving band by band towards the top.

DNA polymerases for chain termination sequencing


Any template-dependent DNA polymerase is capable of extending a primer that has been annealed to a single-stranded DNA molecule, but not all polymerases do this in a way that is useful for DNA sequencing. Three criteria in particular must be fulfilled by a sequencing enzyme: High processivity. This refers to the length of polynucleotide that is synthesized before the polymerase terminates through natural causes. A sequencing polymerase must have high processivity so that it does not dissociate from the template before incorporating a chainterminating nucleotide.

Negligible or zero 5 --> 3 exonuclease activity. Most DNA polymerases also have exonuclease

activities, meaning that they can degrade DNA polynucleotides as well as synthesize them. A5 --> 3 exonuclease activity enables the polymerase to remove a DNA strand that is already attached to the template. This is a disadvantage in DNA sequencing because removal of nucleotides from the 5 ends of the newly synthesized strands alters the lengths of these strands, making it impossible to read the sequence from the banding pattern in the polyacrylamide gel. Negligible or zero 3 --> 5 exonuclease activity is also desirable so that the polymerase does not remove the chain termination nucleotide once it has been incorporated. These are stringent requirements and are not entirely met by any naturally occurring DNA polymerase. Instead, artificially modified enzymes are generally used. The first of these to be developed was the Klenow polymerase, which is a version of Escherichia coli DNA polymerase I from which the 5 --> 3 exonuclease activity of the standard enzyme has been removed, either by cleaving away the relevant part of the protein or by genetic engineering. The Klenow polymerase has relatively low processivity, limiting the length of sequence that can be obtained from a single experiment to about 250 bp, and giving non-specific bands on the sequencing gel, these 'shadow' bands representing strands that have terminated naturally rather than by incorporation of a ddNTP. The Klenow enzyme was therefore superseded by a modified version of the DNA polymerase encoded by bacteriophage T7, this enzyme going under the trade name 'Sequenase'. Sequenase has high processivity and no exonuclease activity, and also possesses other desirable features such as rapid reaction rate and the ability to use many modified nucleotides as substrates.

The chemical degradation sequencing method


The difference between the two sequencing techniques lies in the way in which the A, C, G and T families of molecules are generated. In the chemical degradation procedure these families are produced by treatment with chemicals that cut specifically at a particular nucleotide, not by enzymatic synthesis. The starting material can be double-stranded DNA but before beginning the sequencing procedure, the double-stranded molecules must be denatured into single-stranded DNA, with each strand labeled at one end. To illustrate the procedure, we will follow the 'G' reaction. First, the molecules are treated with dimethyl sulfate, which attaches a methyl group to the purine ring of G nucleotides. Only a limited amount of dime objective being to modify, on average, just one G per polynucleotide. At this stage the DNA strands are still intact, cleavage not occurring until a second chemical - piperidine - is added. Piperidine removes the modified purine ring and cuts the DNA molecule at the phosphodiester bond immediately upstream of the baseless site that is created. The result is a set of cleaved DNA molecules, some of which are labeled and some are not. The labeled molecules all have one end in common and one end determined by the cut sites, the latter indicating the positions of the G nucleotides in the DNA molecules that were cleaved. In other words, the G family of molecules produced by chemical treatment is equivalent to the G family produced by the chain termination method. The families of cleaved molecules are electrophoresed in a polyacrylamide gel and the sequence read in a similar way to that described for chain termination sequencing. The only significant difference is that problems have been encountered in developing chemical treatments to cut specifically at A or T, and so the four reactions that are carried out are usually 'G', 'A + G', 'C' and 'C + T'. This does not affect the accuracy of the sequence that is read from the gel.

Fluorescent primers are the basis of automated sequence reading


The standard chain termination sequencing methodology employs radioactive labels and the banding pattern in the polyacrylamide gel which is visualized by autoradiography. Usually one of the nucleotides in the sequencing reaction is labeled so that the newly synthesized strands contain radiolabels along their lengths, giving high detection sensitivity. To ensure good band resolution, 33 P or 35S is generally used, as the emission energies of these isotopes are relatively low, in contrast to 32P, which has a higher emission energy and gives poorer resolution because of signal scattering. Replacement of radioactive labels by fluorescent ones has given a new dimension to in situ hybridization (FISH) techniques. Fluorolabeling has been equally important in the development of sequencing methodology, in particular because the detection system for fluorolabels has opened the way to automated sequence reading. The label is attached to the ddNTPs, with a different fluorolabel used for each one. Chains terminated with A are therefore labeled with one fluorophore, chains terminated with C are labeled with a second fluorophore, and so on. Now it is possible to carry out the four sequencing reactions - for A, C, G and T - in a single tube and to load all four families of molecules into just one lane of the polyacrylamide gel, because the fluorescent detector can discriminate between the different labels and hence determine if each band represents an A, C, G or T. The sequence can be read directly as the bands pass in front of the detector and either printed out in a form readable by eye or sent straight to a computer for storage. When combined with robotic devices that prepare the sequencing reactions and load the gel, the fluorescent detection system provides a major increase in throughput and avoids errors that might arise when a sequence is read by eye and then entered manually into a computer. It is only by use of these

automated techniques that we can hope to generate sequence data rapidly enough to complete a genome project in a reasonable length of time.

Figure 6.7. Automated DNA sequencing with fluorescently labeled dideoxynucleotides. (A) The chain termination reactions are carried out in a single tube, with each dideoxynucleotide labeled with a different fluorophore. In the automated sequencer, the bands in the electrophoresis gel move past a fluorescence detector, which identifies which dideoxynucleotide is present in each band. The information is passed to the imaging system. (B) The printout from an automated sequencer. The sequence is represented by a series of peaks, one for each nucleotide position. In this example, a green peak is an 'A', blue is 'C', black is 'G', and red is 'T'. RECONSTRUCTION OF SEQUENCES (FRAGMENT ASSEMBLY):

DEFINITION OF COVERAGE:

HUMAN GENOME PROJECT ABI3700 Prism

The number of genes in the human genome was unknown, with estimates ranging from 50,000 to 90,000 (refs 1 2, and to more than 140,000 according to unpublished sources. A procedure named Exofish, based on homology searches, to identify human genes quickly and reliably was followed. This method relies on the sequence of another vertebrate, the pufferfish Tetraodon nigroviridis, to detect conserved sequences with a very low background. Similar to Fugu rubripes , a marine pufferfish proposed by Brenner et al.3 as a model for genomic studies, T. nigroviridis is a more practical alternative4 with a genome also eight times more compact than that of human. Many comparisons have been made between F. rubripes and human DNA that demonstrate the potential of comparative genomics using the pufferfish genome5. Application of Exofish to the December version of the working draft sequence of the human genome and to Unigene showed that the human genome contains 28,00034,000 genes, and that Unigene contains less than 40% of the protein-coding fraction of the human genome.

A map of 1.42 million single nucleotide polymorphisms (SNPs) distributed throughout the human genome, providing an average density on available sequence of one SNP every 1.9 kilobases. We estimate that 60,000 SNPs fall within exon (coding and untranslated regions), and 85% of exons are within 5 kb of the nearest SNP. Nucleotide diversity varies greatly across the genome, in a manner broadly consistent with a standard population genetic model of human history. This high-density SNP map provides a public resource for defining haplotype variation across the genome, and should help to identify biomedically important genes for diagnosis and therapy.
DNA sequencing using reversible terminators (Nature 456, 53-59 (6 November 2008)
2008) | doi:10.1038/nature07517; Received 24 June 2008; Accepted 2 October

Accurate whole human genome sequencing using reversible terminator chemistry


David R. Bentley1, Generated high-density single-molecule arrays of genomic DNA fragments attached to the surface of the reaction chamber (the flow cell) and used isothermal bridging amplification to form DNA clusters from each fragment. We made the DNA in each cluster single-stranded and added a universal primer for sequencing. For paired read sequencing, we then converted the templates to double-stranded DNA and removed the original strands, leaving the complementary strand as template for the second sequencing reaction (Fig. 1ac). To obtain paired reads separated by larger distances, we circularized DNA fragments of the required length (for example, 20.2kb) and obtained short junction fragments for paired end sequencing (Fig. 1d). DNA and sequencing DNA samples (NA07340 and NA18507) and cell line (GM07340) were obtained from Coriell Repositories. DNA samples were genotyped on the HM550 array and the results compared to publicly available data to confirm their identity before use. Methods for DNA manipulation, including sample preparation, formation of single-molecule arrays, cluster growth and sequencing were all developed during this study and formed the basis for the standard protocols now available from Illumina, Inc. All sequencing was performed on Illumina GA1s equipped with a one-megapixel camera. All purity filtered read data are available for download from the Short Read Archive at NCBI or from the European Short Read Archive (ERA) at the EBI. Analysis software Image analysis software and the ELAND aligner are provided as part of the Genome Analyzer analysis software. SNP and structural variant detectors will be available as future upgrades of the analysis pipeline. The Resembl extension to Ensembl is available on request. The MAQ (Mapping and Assembly with Qualities) aligner is freely available for download from http://maq.sourceforge.net/. Data access Sequence data for NA18507 are freely available from the NCBI short read archive, accession SRA000271 (ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead/SRA000271). X chromosome data are freely available from ERA, accession ERA000035. Links to Resembl displays for chromosome X and human data, plus information on other available data, are provided at http://www.illumina.com/HumanGenome. Preparation of Sample:

DR Bentley of the paper et al. Nature 456, 53-59 (2008) doi:10.1038/nature07484

Fig. a, DNA fragments are generated, for example, by random shearing and joined to a pair of oligonucleotides in a

forked adaptor configuration. The ligated products are amplified using two oligonucleotide primers, resulting in doublestranded blunt-ended material with a different adaptor sequence on either end. b, Formation of clonal single-molecule array. DNA fragments prepared as in a are denatured and single strands are annealed to complementary oligonucleotides on the flow-cell surface (hatched). A new strand (dotted) is copied from the original strand in an extension reaction that is primed from the 3 end of the surface-bound oligonucleotide; the original strand is then removed by denaturation. The adaptor sequence at the 3 end of each copied strand is annealed to a new surfacebound complementary oligonucleotide, forming a bridge and generating a new site for synthesis of a second strand (dotted). Multiple cycles of annealing, extension and denaturation in isothermal conditions result in growth of clusters, each ~1m in physical diameter. This follows the basic method outlined in ref. 33. c, The DNA in each cluster is linearized by cleavage within one adaptor sequence (gap marked by an asterisk) and denatured, generating singlestranded template for sequencing by synthesis to obtain a sequence read (read 1; the sequencing product is dotted). To perform paired-read sequencing, the products of read 1 are removed by denaturation, the template is used to generate a bridge, the second strand is re-synthesized (shown dotted), and the opposite strand is then cleaved (gap marked by an asterisk) to provide the template for the second read (read 2). d, Long-range paired-end sample preparation. To sequence the ends of a long (for example,>1kb) DNA fragment, the ends of each fragment are tagged by incorporation of biotinylated (B) nucleotide and then circularized, forming a junction between the two ends. Circularized DNA is randomly fragmented and the biotinylated junction fragments are recovered and used as starting material in the standard sample preparation procedure illustrated in a. The orientation of the sequence reads relative to the DNA fragment is shown (magenta arrows). When aligned to the reference sequence, these reads are oriented with their 5 ends towards each other (in contrast to the short insert paired reads produced as shown in ac). See Supplementary Fig. 17a for examples of both. Turquoise and blue lines represent oligonucleotides and red lines represent genomic DNA. All surface-bound oligonucleotides are attached to the flow cell by their 5 ends. Dotted lines indicate newly synthesized strands during cluster formation or sequencing. (See Supplementary Methods for details.)

High-throughput sequencing

The high demand for low-cost sequencing has driven the development of high-throughput sequencing technologies that parallelize the sequencing process, producing thousands or millions of sequences at once.[20][21] High-throughput sequencing technologies are intended to lower the cost of DNA sequencing beyond what is possible with standard dye-terminator methods.[22]

Lynx Therapeutics' Massively Parallel Signature Sequencing (MPSS)


Main article: Massively parallel signature sequencing

The first of the "next-generation" sequencing technologies, MPSS was developed in 1990s at Lynx Therapeutics, a company founded in 1992 by Sidney Brenner and Sam Eletr. MPSS was a bead-based method that used a complex approach of adapter ligation followed by adapter decoding, reading the sequence in increments of four nucleotides; this method made it susceptible to sequence-specific bias or loss of specific sequences. Because the technology was so complex, MPSS was only performed 'in-house' by Lynx Therapeutics and no machines were sold; when the merger with Solexa later lead to the development of sequencing-by-synthesis, a more simple approach with numerous advantages, MPSS became obsolete. However, the essential properties of the MPSS output were typical of later "next-gen" data types, including hundreds of thousands of short DNA sequences. In the case of MPSS, these were typically used for sequencing cDNA for measurements of gene expression levels. Lynx Therapeutics merged with Solexa in 2004, and this company was later purchased by Illumina. [23]

Polony Sequencing
Main article: Polony sequencing

Polony sequencing, developed in George Church's lab at Harvard, was among the first nextgeneration sequencing systems used to sequence a full genome in 2005. It combined an in vitro paired-tag library with emulsion PCR, an automated microscope, and ligation-based sequencing chemistry to sequence an E. coli genome at an accuracy of > 99.9999% and a cost approximately 1/10th that of Sanger sequencing. The technology was licensed to Agencourt Biosciences, subsequently spun out into Agencourt Personal Genomics, and ultimately incorporated into the Applied Biosystems SOLiD platform.

Pyrosequencing
Main article: 454 Life Sciences#Technology

A parallelized version of pyrosequencing was developed by 454 Life Sciences. The method amplifies DNA inside water droplets in an oil solution (emulsion PCR), with each droplet containing a single DNA template attached to a single primer-coated bead that then forms a clonal colony. The sequencing machine contains many picolitre-volume wells each containing a single bead and sequencing enzymes. Pyrosequencing uses luciferase to generate light for detection of the individual nucleotides added to the nascent DNA, and the combined data are used to generate sequence read-outs.[16] This technology provides intermediate read length and price per base compared to Sanger sequencing on one end and Solexa and SOLiD on the other. [24] 454 Life Sciences has since been acquired by Roche Diagnostics.

Illumina (Solexa) sequencing


Solexa, now part of Illumina developed a sequencing technology based on reversible dyeterminators. DNA molecules are first attached to primers on a slide and amplified so that local clonal colonies are formed (bridge amplification). Four types of ddNTPs are added, and nonincorporated nucleotides are washed away. Unlike pyrosequencing, the DNA can only be extended one nucleotide at a time. A camera takes images of the fluorescently labeled nucleotides then the dye along with the terminal 3' blocker is chemically removed from the DNA, allowing a next cycle.[25]

SOLiD sequencing
Main article: ABI Solid Sequencing

Applied Biosystems' SOLiD technology employs sequencing by ligation. Here, a pool of all possible oligonucleotides of a fixed length are labeled according to the sequenced position. Oligonucleotides are annealed and ligated; the preferential ligation by DNA ligase for matching sequences results in a signal informative of the nucleotide at that position. Before sequencing, the DNA is amplified by emulsion PCR. The resulting bead, each containing only copies of the same DNA molecule, are deposited on a glass slide.[26] The result is sequences of quantities and lengths comparable to Illumina sequencing.[24]

Future methods
Sequencing by hybridization is a non-enzymatic method that uses a DNA microarray. A single pool of DNA whose sequence is to be determined is fluorescently labeled and hybridized to an array containing known sequences. Strong hybridization signals from a given spot on the array identifies its sequence in the DNA being sequenced.[27] Mass spectrometry may be used to determine mass differences between DNA fragments produced in chain-termination reactions.
[28]

DNA sequencing methods currently under development include labeling the DNA polymerase, [29] reading the sequence as a DNA strand transits through nanopores,[30][31] and microscopybased techniques, such as AFM or electron microscopy that are used to identify the positions of individual nucleotides within long DNA fragments (>5,000 bp) by nucleotide labeling with heavier elements (e.g., halogens) for visual detection and recording.[32][33] In microfluidic Sanger sequencing the entire thermocycling amplification of DNA fragments as well as their separation by electrophoresis is done on a single glass wafer (approximately 10 cm in diameter) thus reducing the reagent usage as well as cost.[citation needed] In some instances researchers[who?] have shown that they can increase the throughput of conventional sequencing through the use of microchips.[citation needed] Research will still need to be done in order to make this use of technology effective. In October 2006, the X Prize Foundation established an initiative to promote the development of full genome sequencing technologies, called the Archon X Prize, intending to award $10 million to "the first Team that can build a device and use it to sequence 100 human genomes within 10 days or less, with an accuracy of no more than one error in every 100,000 bases sequenced, with sequences accurately covering at least 98% of the genome, and at a recurring cost of no more than $10,000 (US) per genome."[34]

Each year NHGRI promotes grants for new research and developments in genomics. 2010 grants and 2011 candidates include continuing work in microfluidic, polony and base-heavy sequencing methodologies [35]

Major landmarks in DNA sequencing


1953 Discovery of the structure of the DNA double helix.[36] 1972 Development of recombinant DNA technology, which permits isolation of defined fragments of DNA; prior to this, the only accessible samples for sequencing were from bacteriophage or virus DNA. 1977 The first complete DNA genome to be sequenced is that of bacteriophage X174.[37] 1977 Allan Maxam and Walter Gilbert publish "DNA sequencing by chemical degradation".[5] Frederick Sanger, independently, publishes "DNA sequencing with chain-terminating inhibitors".[38] 1984 Medical Research Council scientists decipher the complete DNA sequence of the Epstein-Barr virus, 170 kb. 1986 Leroy E. Hood's laboratory at the California Institute of Technology and Smith announce the first semi-automated DNA sequencing machine. 1987 Applied Biosystems markets first automated sequencing machine, the model ABI 370. 1990 The U.S. National Institutes of Health (NIH) begins large-scale sequencing trials on Mycoplasma capricolum, Escherichia coli, Caenorhabditis elegans, and Saccharomyces cerevisiae (at US$0.75/base). 1991 Sequencing of human expressed sequence tags begins in Craig Venter's lab, an attempt to capture the coding fraction of the human genome.[39] 1995 Craig Venter, Hamilton Smith, and colleagues at The Institute for Genomic Research (TIGR) publish the first complete genome of a free-living organism, the bacterium Haemophilus influenzae. The circular chromosome contains 1,830,137 bases and its publication in the journal Science[40] marks the first use of whole-genome shotgun sequencing, eliminating the need for initial mapping efforts. 1996 Pl Nyrn and his student Mostafa Ronaghi at the Royal Institute of Technology in Stockholm publish their method of pyrosequencing[41] 1998 Phil Green and Brent Ewing of the University of Washington publish phred for sequencer data analysis.[42] 2000 Lynx Therapeutics publishes and markets "MPSS" - a parallelized, adapter/ligation-mediated, bead-based sequencing technology, launching "next-generation" sequencing.[43]

2001 A draft sequence of the human genome is published.[44][45] 2004 454 Life Sciences markets a parallelized version of pyrosequencing.[46] [47] The first version of their machine reduced sequencing costs 6-fold compared to automated Sanger sequencing, and was the second of a new generation of sequencing technologies, after MPSS.[24]

In Silico Biology : Bioinformatics and data mining Journal CABIOS (Computer applications in Biosciences) -1984

Sequencing: Construction of PHYSICAL MAPS (physical position of each clone on the chromosome depending on the physical distance) [Linkage or chromosomal map is made from recombination frequency] Method of whole genome sequencing: (started during late 80s) 1.Whole-genome Shotgun (WGS) sequencing Bacterium Haemophylus influenzae (1995)- 140 seq contigs each of 2-20 overlapping clones and representing diff. non-overlapping portions of the genome.

Contig is a set of contiguous overlapping clones, each contig having two or more than 25 clones and a Singleton is a clone not incorporated in to any contig. International Human Genome project (HGP headed by Franklin from NIH of Federal, malty state, malty inst.

IHGSC= International human genome sequencing consortium) and the individual genomics company (Craig Venter of Celera Genomics, Maryland USA) released the rough draft of human genome sequences during Feb. 2001. Genomes of following 4 eukaryotes were sequenced:Yeast (Saccharoromyces cerevisiae) Nematode (Coenorbditis elegans) Fruit fly (Drosophila melanogaster) Higher plant ( Arabidopsis thaliana) Fission yeast (Schizosaccharomyces pombe)-2002 Later mouse and human genomes Rough drafts of rice genome by M/S Monsanto & M/S Syngenta and also by the Chinese Assembly packages sequencing: for large scale genome

Shotgun sequences generated by automatic sequences called Contigs : A set of contiguous overlapping clones PHRED : Base calling software with quality identification Reads from ABI 373, ABI 377, ABI 3700 Documented sequences in PHRED files or Fasta formats

The output can be read by PHRAP seq. Assembly programs

PHRAP (Fragment assembly program or Phils revised assembly programs) : For assembly of shortgun DNA seq. Data. It makes contig seq, each as a mosaic of high quality parts of read outs and removes consensus seq of low quality values. DEMIGLACE: identifying polymorphisms including SNPs Consed: For viewing (Sequence editor-viewer). and editing Genome seq

Seqman & Sequencher : For small scale seq projects EST Clustering packages : EST of > 250 sp. are in dbEST EST- Expressed sequence tags (small seq of an expressing gene) CORPUS (Contig Semantics) refinement performed using

2.

Map first sequence later / Clone strategy/ Hierarchical shotgun cloning

by

clone

Between 10,000 to 20,000 BACs were selected to generate a working draft for Human genome. Minimum tiling path (with suitable algorithms [S/W] used to have few BAC clones for the entire genome. BAC clones are used for subcloning of` 500-800 bp in to cosmid or plasmid vectors. These are seq randomly. All parts of genome is seq 4-5 times so that

no part is left out. In case of WGS sequencing 8-10 fold seq is required for similar efficiency. 2.1. Construction of whole-genome BAC map (BACby-BAC approach) Maps having landmarks based on molecular markers like RFLP, STSs, SSRs and AFLPs are now available. Thus construction of whole genome physical maps of ordered BACs. The steps involved are (i) Fingerprinting of BAC clones having 10-15 times coverage of the genome. e.g. 30,000 BAC (each 150 kb) = 45 billion bp (15 fold coverage of 3 billion bp.

BACs are arranged as contigs and singletons with S/W like FPc. Physical mapping on chromosome is done by chromosome walking, FISH, deletion stocks and radiation hybrids. [In radiation hybrid mapping, human chromosomes are separated from one
another and broken into several fragments using high doses of X rays. Similar to the underlying principle of mapping genes by linkage analysis based on recombination events, the farther apart two DNA markers are on a chromosome, the more likely a given dose of X rays will break the chromosome between them and thus place the two markers on two different chromosomal fragments. The order of markers on a chromosome can be determined by estimating the frequency of breakage that, in turn, depends on the distance between the markers. This technique has been used to construct whole-genome radiation hybrid maps]. {Technique: A rodent-human somatic cell hybrid ("artificial" cells with both rodent and human genetic material), which contains a single copy of the human chromosome of interest, is X-irradiated. This breaks the chromosome into several pieces, which are subsequently integrated into the rodent chromosomes. In addition, the dosage of radiation is sufficient to kill the somatic cell hybrid or donor cells, which are then rescued by fusing them with non-irradiated rodent recipient cells. The latter, however, lack an important enzyme and are also killed when grown in a specific medium. Therefore, the only cells that can survive the procedure are donor-recipient hybrids that have acquired a rodent gene for the essential enzyme from the irradiated-rodent-human-cell-line}.

(ii) Contigs are joined by examining the extreme ends by gap filling approach.

2.2. Building of clone contigs (Overlapping series of cloned DNA fragments Hybridization approach: The first clone is selected by hybridization with mapped DNA markers. This is followed by progress on to the next by CHROMOSOME WALKING whose insert overlaps with the previous clone. The clone in question is used as a probe to screen the Genomic library. The problem in this method is the presence of repeat DNA which gives non-specific hybridization. This can partly be reduced by prehybridization with excess of genomic DNA. Also subcloning the end of the clone and using it as a probe eliminates the nonspecific hybridization due to repetitive DNA.
(i)

(ii) PCR approach for building contigs: The end of the clone is sequenced and PCR primer designed from that is used for all other clones, then overlapping clones can be identified. This process continues further. Speeding up this process can be made by combinational screening. STS (Sequence tag site) content mapping [STS is a short unique sequence that identifies one or more specific loci, which can be amplified through PCR. Each STS has a pair of PCR primers which are designed by partial sequencing of RFLP probe representing a mapped low copy number DNA seq. A sequence-tagged site (or STS) is a short (200 to 500 base
pair) DNA sequence that has a single occurrence in the genome and whose location and base sequence are known. STSs can be easily detected by the polymerase chain reaction (PCR) using specific primers. For this reason they are useful for constructing genetic and physical maps from sequence data reported from many different laboratories. They serve as landmarks on the developing physical map of a genome.

When STS loci contain genetic polymorphisms (e.g. simple sequence length polymorphisms, SSLPs, single nucleotide polymorphisms), they become valuable genetic markers, i.e. loci which can be used to distinguish individuals. They are used in shotgun sequencing, specifically to aid sequence assembly. The STS concept was introduced by Olson et al (1989). In assessing the likely impact of the Polymerase Chain Reaction (PCR) on human genome research, they recognized that single-copy DNA sequences of known map location could serve as markers for genetic and physical mapping of genes along the chromosome. The advantage of STSs over other mapping landmarks is that the means of testing for the presence of a particular STS can be completely described as information in a database: anyone who wishes to make copies of the marker would simply look up the STS in the database, synthesize the specified primers, and run the PCR under specified conditions to amplify the STS from genomic DNA. In most cases STS markers are co-dominant, i. e., allow hetorozygotes to be distinguished from the two homozygotes. The DNA sequence of an STS may contain repetitive elements, sequences that appear elsewhere in the genome, but as long as the sequences at both ends of the site are unique and conserved, researches can uniquely identify this portion of genome using tools usually present in any laboratory. Thus, in broad sense STS include such markers as microsatellites (SSRs, STMS or SSRPs), SCARs,

]. for overlapping YAC or BAC clones. With Combinational screening two BAC clones giving same sized PCR products with same STS primer are assumed to be overlaps. The combinational screening significantly reduces the requirement of number of PCR reactions.
CAPs, and ISSRs

(iii) Clone fingerprinting: Restriction fragments from all clones are electrophoresed and banding compared. Similar sized fragments indicate overlaps. This is suitable even for seq long distances like the genome (unlike the earlier two methods). Overlaps is inferred from banding patterns. Clone fingerprinting include (a) Restriction fingerprinting (b) Repetitive DNA probing (c) Repetitive DNA PCR fingerprinting (d) STS content mapping. (iv) Directed Shortgun approach:

This method seq 10 times the genome size and covers 99.8% of the genome leaving only few gaps (which can be closed as in H. influenzae). For human 7000 million individual clones (each 500 bp) will give 35,000 Mb of seq which is 10 times that of genome of 3,500 Mb. The Sequencer (2001) MegaBASE 4000 can seq 1-2 Mb/day. 100 such machines can complete seq of H. Influenzae in 1 year. 1998 the ABI 3700 machine 70 Nos. (each machine for 1000 clones= 0.5Mb/day takes 3 years.

Whole genome sequence data :

Electronic PCR (e-PCR) : Bridging the gap between mapping and sequencing of genome

From genome sequence to function (annotation): Integrative biology Methods of annotation of genome sequences Annotation be sequence search:Loss of function by mutation approach i. Insertion mutagenesis

ii. Targeted gene disruption iii. Gene silencing

Gene trap & Enhancer trap:

Vous aimerez peut-être aussi