Académique Documents
Professionnel Documents
Culture Documents
I. Brief history of sequencing II. Sanger dideoxy method for sequencing III. Sequencing large pieces of DNA VI. The $1,000 dollar genome
MC chapter 12
Methods of sequencing
A. Sanger dideoxy (primer extension/chain-termination) method: most popular protocol for sequencing, very adaptable, scalable to large sequencing projects Maxam-Gilbert chemical cleavage method: DNA is labelled and then chemically cleaved in a sequencedependent manner. This method is not easily scaled and is rather tedious Pyrosequencing: measuring chain extension by pyrophosphate monitoring
B.
C.
One way for obtaining single-stranded DNA from a double stranded source--magnets
Should be highly processive, and incorporate ddNTPs efficiently Should lack exonuclease activity
Thermostability required for cycle sequencing
3 5 3
b) Extend the primer with DNA polymerase in the presence of all four dNTPs, with a limited amount of a dideoxy NTP (ddNTP)
DNA polymerase incorporates ddNTP in a templatedependent manner, but it works best if the DNA pol lacks 3 to 5 exonuclease (proofreading) activity
ddA ddA
ddA ddA
ddATP in the reaction: anywhere theres a T in the template strand, occasionally a ddA will be added to the growing strand
Fluorescence ddNTPs chemically synthesized to contain fluors Each ddNTP fluoresces at a different wavelength allowing identification
Polyacrylamide gel electrophoresis--good resolution of fragments differing by a single dNTP Slab gels: as previously described Capillary gels: require only a tiny amount of sample to be loaded, run much faster than slab gels, best for high throughput sequencing
Animation of cycle sequencing: see http://www.dnai.org/ Click on: manipulation techniques sorting and sequencing
An automated sequencer
The output
~160 kbp
Sequencing done by TIGR (Maryland) and The Sanger Institute (Cambridge, UK) Here we report an analysis of the genome sequence of P. falciparum clone 3D7, including descriptions of chromosome structure, gene content, functional classification of proteins, metabolism and transport, and other features of parasite biology.
Sequencing strategy A whole chromosome shotgun sequencing strategy was used to determine the genome sequence of P. falciparum clone 3D7. This approach was taken because a whole genome shotgun strategy was not feasible or cost-effective with the technology that was available at the beginning of the project. Also, high-quality large insert libraries of (A T)-rich P. falciparum DNA have never been constructed in Escherichia coli, which ruled out a clone-by-clone sequencing strategy. The chromosomes were separated on pulsed field gels, and chromosomal DNA was extracted
The shotgun sequences were assembled into contiguous DNA sequences (contigs), in some cases with low coverage shotgun sequences of yeast artificial chromosome (YAC) clones to assist in the ordering of contigs for closure. Sequence tagged sites (STSs)10, microsatellite markers11,12 and HAPPY mapping7 were also used to place and orient contigs during the gap closure process. The high (A /T) content of the genome made gap closure extremely difficult79. Chromosomes 15, 9 and 12 were closed, whereas chromosomes 68, 10, 11, 13 and 14 contained 337 gaps (most less than 2.5 kb) per chromosome at the beginning of genome annotation. Efforts to close the remaining gaps are continuing.
Methods: Sequencing, gap closure and annotation The techniques used at each of the three participating centres for sequencing, closure and annotation are described in the accompanying Letters79. To ensure that each centres annotation procedures produced roughly equivalent results, the Wellcome Trust Sanger Institute (Sanger) and the Institute for Genomic Research (TIGR) annotated the same100-kb segment of chromosome 14. The number of genes predicted in this sequence by the two centres was 22 and 23; the discrepancy being due to the merging of two single genes by one centre. Of the 74 exons predicted by the two centres, 50 (68%) were identical, 9 (2%) overlapped, 6 (8%) overlapped and shared one boundary, and the remainder were predicted by one centre but not the other. Thus 88% of the exons predicted by the two centres in the 100-kb fragment were identical or overlapped.
Previous sequencing techniques: one DNA molecule at a time Needed: many DNA molecules at a time -- arrays
Sequence by DNA polymerase -dependent chain extension, one base at a time in the presence of a reporter (luciferase) Luciferase is an enzyme that will emit a photon of light in response to the pyrophosphate (PPi) released upon nucleotide addition by DNA polymerase Flashes of light and their intensity are recorded
The readout is recorded by a detector that measures position of light flashes and intensity of light flashes A B
From www.454.com
On WebCT -- The $1000 genome -- review of new sequencing techniques by George Church
Introduction to bioinformatics
1) Making biological sense of DNA sequences 2) Online databases: a brief survey 3) Database in depth: NCBI 4) What is BLAST? 5) Using BLAST for sequence analysis 6) Biology workbench, etc.
www.ncbi.nlm.nih.gov www.tigr.org http://workbench.sdsc.edu
(2006)
b)
2)
Non-genes
a) b) Regulation: promoters and factor-binding sites Transactions: replication, repair, and segregation, DNA packaging (nucleosomes)
Sequence output
Raw data
Computer calls
GNNTNNTGTGNCGGATACAATTCCCCTCTAGAAATAATTTTGTTTAACTTTAAGAAGGAGATATACATATGCACCACCAC CACCACCACCCCATGGGTATGAATAAGCAAAAGGTTTGTCCTGCTTGTGAATCTGCGGAACTTATTTATGATCCAGAAAG GGGGGAAATAGTCTGTGCCAAGTGCGGTTATGTAATAGAAGAGAACATAATTGATATGGGTCCTAAGTGGCGTGCTTTTG ATGCTTCTCAAAGGGAACGCAGGTCTAGAACTGGTGCACCAGAAAGTATTCTTCTTCATGACAAGGGGCTTTCAACTGCA ATTGGAATTGACAGATCGCTTTCCGGATTAATGAGAGAGAAGATGTACCGTTTGAGGAAGTGGCANTCCANATTANGAGT TAGTGATGCAGCANANAGGAACCTAGCTTTTGCCCTAAGTGAGTTGGATAGAATTNCTGCTCAGTTAAAACTTCCNNGAC ATGTAGAGGAAGAAGCTGCAANGCTGNACANAGANGCAGNGNGANAGGGACTTATTNGANGCAGATCTATTGAGAGCGTT ATGGCGGCANGTGTTTACCCTGCTTGTAGGTTATTAAAAGNTCCCGGGACTCTGGATGAGATTGCTGATATTGCTAGAGC
ORF map
1) Where are the potential starts (ATG) and stops (TAA, TAG, TGA)? 2) Which reading frame is correct?
NCBI
NCBI home page --Go to www.ncbi.nlm.nih.gov for the following pages Pubmed: search tool for literature--search by author, subject, title words, etc. All databases: a retrieval system for searching several linked databases
Alignment of sequences: The principle: two homologous sequences derived from the same ancestral sequence will have at least some identical (similar) amino acid residues Fraction of identical amino acids is called percent identity Similar amino acids: some amino acids have similar physical/chemical properties, and more likely to substitute for each other--these give specific similarity scores in alignments
Gaps in similar/homologous sequences are rare, and are given penalty scores
Homology of proteins
Homology: similarity of biological structure, physiology, development, and evolution, based on genetic inheritance Homologous proteins: statistically similar sequence, therefore similar functions (often, but not always)
Pho TFB1 1 - - - - - - - - - - - - - - - - - M T K Q K1 C- - - - - - - - - - - - - - - - - M T K Q K V C P V C V PVC GST-- EFIYD PERGE IVCAR CGY Pab TFB 1 - - - - - - - - - - - - - - - - - M T K Q R1 C- - - - - - - - - - - - - - - - - M T K Q R V C P V C V PVC GST-- EFIYD PERGE IVCAR CGY Pfu TFB1 1 - - - - - - - - - - - - - - - - - M N K Q K1 C- - - - - - - - - - - - - - - - - M N K Q K V C P A C V PAC ESA-- ELIYD PERGE IVCAK CGY Tko TFB1 1 - - - - - - - - - - - - - - - - - M S G K R1 C- - - - - - - - - - - - - - - - - M S G K R V C P V C V PVC GST-- EFIYD PSRGE IVCKV CGY Tko TFB2 1 - - - - - - - - - - - - M R G - - I S P K R1 C- - - - - - - - - - - - M R G - - I S P K R V C P I C V PIC GST-- EFIYD PRRGE IVCAK CGY Pfu TFB2 1 - - - - - - - M S S T E P G G G W L I Y P V1 C- - - - - - - M S S T E P G G G W L I Y P V K C P Y C K PYC KSR-- DLVYD RQHGE VFCKK CGS o mP h o T F B 2 _ d e d u c e d N T D i s f r o m B L A S T _ - S K I1 C- - - - - - - - - - - - Y G G - - - - S K I R C P V C BLAS T_ 1 -- ----- ----- YGG-- R PVC GSS-- KIIYD PEHGE YYCAE CGH Sso TFB1 1 - - - - - - - - - - - - M L Y L S E E N K S1 S- - - - - - - - - - - - M L Y L S E E N K S V S T P C V TPC PPD-- KIIFD AERGE YICSE TGE Sso TFB2 1 - - - - - - - - - - - - - - - - - - - - - M1 C- - - - - - - - - - - - - - - - - - - - - M K C P Y C K PYC KTDN- AITYD VEKGM YVCTN CAS Sce TFIIB 1 M M T R E S I D K R A G R R G P N L N I V L1 CM M T R E S I D K R A G R R G P N L N I V L T C P E C T PEC KVYPP KIVER FSEGD VVCAL CGL con sensu s 1 m k1 c p v C g s t v e l i y d p e r Gm i v Ck v c p v C e ar cgy
G G E G G K G P K K g
Pho TFB1 3 2 V I E E N I I D M G P E W R A F D A S Q R3 2 EV I E E N I I D M G P E W R A F D A S Q R - - E K R S - - KRS RTGAP ESILL HDKGL STDIG IDR Pab TFB 3 2 V I E E N I V D M G P E W R A F D A S Q R3 2 EV I E E N I V D M G P E W R A F D A S Q R - - E K R S - - KRS RTGAP ESILL HDKGL STDIG IDR Pfu TFB1 3 2 V I E E N I I D M G P E W R A F D A S Q R3 2 EV I E E N I I D M G P E W R A F D A S Q R - - E R R S - - RRS RTGAP ESILL HDKGL STEIG IDR Tko TFB1 3 2 V I E E N V V D E G P E W R A F D P G Q R3 2 EV I E E N V V D E G P E W R A F D P G Q R - - E K R A - - KRA RVGAP ESILL HDKGL STDIG IDR Tko TFB2 3 5 V I E E N V V D E G P E W R A F E P G Q R3 5 EV I E E N V V D E G P E W R A F E P G Q R - - E K R A - - KRA RTGAP MTLMI HDKGL STDID WRD Pfu TFB2 4 2 I L A T N L V D S E L - - - - - - - - - -4 2 -I L A T N L V D S E L - - - - - - - - - - - - - - S R - - -SR KTKTN DIPRY -TKRI G---- --o mP h o T F B 2 _ d e d u c e d N T D i s f r o m B L A S T _ - - -3 3 -V I K S - - F D T R V - - - - - - - - - - - - - - R T BLAS T_ 33 VI KS--F DTRV- ----- - - -RT FSSP- --PKF RSKGT S---- ---
R R R R R K F
enzymes
Non-enzymes
40-20% identity: fold can be predicted by similarity but precise function cannot be predicted (the 40% rule)
BLAST considers all possible combinations of matches mismatches gaps in any given alignment Gives the best (highest scoring) alignment of sequences Three scores 1) percent identity 2) similarity score 3) E-value--probability that two sequences will have the similarity they have by chance (lower number, higher probability of evolutionary homology, higher probability of similar function)
What is the E-value? The E value represents the chance that the similarity is random and therefore insignificant. Essentially, the E value describes the random background noise that exists for matches between sequences. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance.
You can change the Expect value threshold on most main BLAST search pages. When the Expect value is increased from the default value of 10, a larger list with more lowscoring hits can be reported.
E values (continued) From the BLAST tutorial: Although hits with E values much higher than 0.1 are unlikely to reflect true sequence relatives, it is useful to examine hits with lower significance (E values between 0.1 and 10) for short regions of similarity. In the absence of longer similarities, these short regions may allow the tentative assignment of biochemical activities to the ORF in question. The significance of any such regions must be assessed on a case by case basis.
Multi-domain proteins
E value greater than 10-10, similar structure but possibly different functions
Computer calls
GNNTNNTGTGNCGGATACAATTCCCCTCTAGAAATAATTTTGTTTAACTTTAAGAAGGAGATATACATATGCACCACCAC CACCACCACCCCATGGGTATGAATAAGCAAAAGGTTTGTCCTGCTTGTGAATCTGCGGAACTTATTTATGATCCAGAAAG GGGGGAAATAGTCTGTGCCAAGTGCGGTTATGTAATAGAAGAGAACATAATTGATATGGGTCCTAAGTGGCGTGCTTTTG ATGCTTCTCAAAGGGAACGCAGGTCTAGAACTGGTGCACCAGAAAGTATTCTTCTTCATGACAAGGGGCTTTCAACTGCA ATTGGAATTGACAGATCGCTTTCCGGATTAATGAGAGAGAAGATGTACCGTTTGAGGAAGTGGCANTCCANATTANGAGT TAGTGATGCAGCANANAGGAACCTAGCTTTTGCCCTAAGTGAGTTGGATAGAATTNCTGCTCAGTTAAAACTTCCNNGAC ATGTAGAGGAAGAAGCTGCAANGCTGNACANAGANGCAGNGNGANAGGGACTTATTNGANGCAGATCTATTGAGAGCGTT ATGGCGGCANGTGTTTACCCTGCTTGTAGGTTATTAAAAGNTCCCGGGACTCTGGATGAGATTGCTGATATTGCTAGAGC
BLAST against (go to genomes page): -- Microbial genomes -- environmental sequences (genomes) Results: 1) Distribution of hits: query sequence and positions in sequence that gave alignments
2) Sequences producing significant alignments 1) Accession number (this takes you to the sequence that yielded the hit: gene or contig) 2) Name of sequence (sometimes identifies the gene) 3) Similarity score 4) E-value
3) Alignments arranged by E value, with links to gene reports
1) Homology? the function is only inferred (NOT known) 2) Large percentages of coding proteins cannot be assigned function based on homology
For a current list of databases and bioinformatics tools see: Nucleic Acids Research annual bioinformatics issue (comes out every January).
http://www.oxfordjournals.org/nar/database/cap/
Any DNA or protein sequence can then be compared to all other sequences in databases, and similar sequences identified
There is much more -- a great diversity of programs and databases are available
Brown and Botstein (1999) Exploring the new world of the genome with DNA microarrays Nature Genetics 21, p. 33-37.
genome
DNA
transcriptome
RNA
proteome
protein
3)
1) PCR each orf (several for each orf), attach (spot) each PCR product to a solid support in a specific order (pioneered by Pat Browns lab, Stanford)
2) Chemically synthesize orf-specific oligonucleotide probes directly on microchip (Affymetrix)
The RNA comes from the cells and conditions you are interested in
hybridize
read-out
(#6 helps correct for variations in the quantity of starting RNA, and for variable labelling and detection efficiencies)
genome
DNA
transcriptome
RNA
proteome
protein
Mass spectrometry
Separate individual proteins from cell by charge and mass, individual proteins can be identified (but need genome sequence information for this)
2D gel electrophoresis
1) Separate proteins on the basis of isoelectric point
10
2D gel electrophoresis
Lay gel containing isoelectrically focused protein on SDS page gel, separate on the basis of size
Liquid chromatography and tandem mass spectrometry Software for processing data
From J.R. Yates 1998 Mass spectrometry and the age of the proteome J Mass Spec. 33, p 1-19
Example of a protein microarray Proteins fused to GST with 6 x histidine tags, immobilized on Ni++ matrix Anti-GST tells how much protein is immobilized on surface Specific assays identify proteins with specific activities--calmodulin binding, phosphoinositide binding
genome
DNA
transcriptome
RNA
proteome
protein