Académique Documents
Professionnel Documents
Culture Documents
Why bioinformatics?
Other techniques raise more questions than they answer. Bioinformatics is what answers the questions those techniques generate.
Outline
Bioinformatics Defined Evolution of Bioinformatics Bioinformatics History Common Uses of Bioinformatics Procedures and Tools of Bioinformatics Our Procedure Our Results Resources
Bioinformatics Defined
Bioinformatics is broad term covering the use of computer algorithms to analyze biological data. Differs from computational biology in that while computational biology is the use of computer technology to solve a single, hypothesis-based hypothesisquestion, bioinformatics is the omnibus use of computerized statistical analysis to make statistical or comparative inferences. i.e. converting data to information.
1977 -X174 Phage Genome sequenced 1990 Paper published in the Journal of Molecular Biology describes sequence alignment search algorithm 1990s Software used to find fragment overlap for the Human Genome Project 1992 NCBI takes over GenBank DNA sequence database in response to the growing number of gene patents
1994 Entrez Global Query Cross-Database CrossSearch System allows users to search GenBank database 1995 Dr. Owen White writes software to help find gene elements (promoters, start and stop codons, etc.) in the sequenced Haemophilus influenzae genome 1996 NCBI-BLAST created to provide powerful NCBIheuristic searches against the GenBank database
Because proteins are ultimately the tool of all* gene expression, proteomics is, in effect, the product science made possible by bioinformatics A proteome is the collection of all proteins expressed in a cell at a given time Every organism has 1 genome, but many proteomes In addition to high throughput protein analysis, proteomics is researched through cDNA analysis (RT(RTPCR) Proteomics represents a methodical addition of large scale biology to traditional molecular biology, made possible by bioinformatics
Protein or gene homology is shared nucleotide or amino acid sequences or domains shared between different proteins regardless of whether from same or different organism Searching databases for nucleotide or amino acid sequences that match sequences in unknown samples
DNA Sequencing
Sanger Method
New nucleotide chains of DNA being replicated by DNA Polymerase are stopped when di-deoxy nucleotides (added in the direaction mixture in ~1/100 ratio) are incorperated into the chain
DNA Sequencing
Fluorescent dyes are bound to the ddNTPs, allowing the molecule to detected when it is excited by a laser Terminated DNA chains are run on a gel, and fragments are resolved by size By combining the fluorescence readings from each size nucleotide chain, the DNA sequence is computed
Sequence Analysis
First Things First Sequence File Formats: Most common for nucleotides: FASTA / Multi-FASTA Multi> followed by any unicode text, entire line read as sequence title Carriage return followed by continuous 5- 3 nucleotide sequence or 5protein sequence using 1-letter codes 1Example: >E. coli Globin-coupled chemotaxis sensory transducer (TM Globindomain) ATGGACCTGATCACAAATGCGATTTAGAGACCTGATCACAAATG CGATGACCTGATCACAAATGCGATGACCTGATCACAAATGCGA TGTAAACCTGATCACAAATGCGATGACCTGATCACAAATGCGAT CTAAACCTGATCACAAATGCGATGACCTGATCACAAATGCGATT AA
NCBI-BLAST NCBI
Run by the National Center for Biotechnology Information BLAST uses a heuristic algorithm based on the SmithSmith-Waterman algorithm Algorithm searches database for a small string within the query (default 11 for nucleotide searches), then when it detects a match, searches for shared nucleotides at each end of the seed to extend the match Gaps are taken into account, then the matches are presented in order of statistical significance http://www.ncbi.nlm.nih.gov/BLAST/
Basic nucleutide sequence searches The BLAST that you used for your sequences Similar technology used to search amino acid sequences A more advance protein BLAST useful for analyzing relationships between divergently evolved proteins.
Use six-frame translation for proteins and sixnucleotides, respectively, in the search Used for BLASTing several sequences at once to cut down on processing load and server reporting-time reporting-
MegaBLAST:
Max/Total Score
Calculated from the number of matches and gaps. Higher relative to your query length is better
S)
Translation: E Value gives you the number of entries required in the database for a match to happen by random chance. e.g. E=e-6 means that one match would be expected for every 1,000,000 entries in the database Smaller E Values are better Values larger than E=e-5 too likely to be due to chance
Query Coverage
The percent of the query sequence matched by the database entry The percent identity, i.e. the percent that the genes match up within the limits of the full match (e.g. deletions or additions reduce this value)
Max Ident
Clustal (free)
ClustalX Software ClustalW Web
DNAStar ($$$) Functionality is similar, but difference is in interface, tools, and speed of algorithms http://www.ebi.ac.uk/clustalw/
SMART
Simple Modular Architecture Research Tool Run by EMBL (European Molecular Biology Laboratory) While BLAST compares nucleotide sequences and then informs you of any domains that may have been annotated to them, SMART compares by domains
PFAM
Protein domain database Manually curated, trading volume for quality Uses hidden Markov models for domain pattern recognition Run by Sanger Institute in the UK Heuristic server-load analysis predicts when key serverprotein analysis report is due and crashes server http://www.sanger.ac.uk/Software/Pfam/
Interpro
Database of protein domains and functional sites Best source of annotation Other tools sometimes draw annotation from Interpro Run by the European Bioinformatics Institute http://www.ebi.ac.uk/interpro/
Protein Folding
Protein Folding
Use knowledge of biochemistry to fold protein into predicted structure, then software to find lowest energy state
Commercial Programs:
Protein Shop Profold
Our Procedure
Each group selected two colonies to sequence Colonies which survived ampicillin treatment were possibly transformed by the vector, which contained an ampicillin resistance gene Presence of PDI insert was expected to disrupt ccdB (lethal protein) and LacZ gene expression in vector plasmid LacZ expression resulted in some blue colonies, as the colonies were able to cleave X-Gal substrate into Xblue product
How did some blue colonies survive? Did all blue colonies come from the PCR product? Did the white colonies contain the PDI inserts? Were some colonies able to survive without the ampicillin resistance plasmid? What was the actual sequence of the commercial positive control insert? Some samples were transformed with inserts collected from PCR instead of gel electrophoresis. Could have nonnon-PDI sequences have ligated to the vector and been inserted into bacteria?
Procedure
Samples were prepared with T3 and T7 (forward and backward) primers in solution for sequencing Samples were sent to UH Manoa lab for sequencing Chromatogram results were viewed with Finch TV to determine quality
Procedure
Sequences were trimmed at 5 and 3 ends, then restriction enzyme sites on the vector were attempted to be located with Finch TV
Procedure
Sequences were exported in FASTA format Procedure was repeated for the other strands PairPair-wise alignment was performed for both strands of each sample with EBIs tools Consensus sequence from pair-wise alignment pairwas searched for in BLAST Gene information was located from BLAST annotation and TAIR website
Results
General Remarks
Because colonies were selected prior to the identity of the positive control insert being questioned, no control colonies were sequenced All sequenced white colonies definitively had PDI gene insert, save for one interesting exception Some blue colonies showed multiple nucleotide chromatogram readings, suggesting either sample contamination or separately transformed E. coli growing as one colony
Group 3 Results
Sequenced 1 blue and 1 white colony from same plate Colonies were transformed with PCR product, not gel-recovered DNA gel White colonies had PDI insert Blue colonies had 154Bp partial insert, disrupting ccdB gene, but remaining ininframe and allowing for a partially function LacZ alpha gene to be expressed
Group 1 Results
White Colony from PCR product showed PDI gene in both T3 and T7 strands White colony from gel purification:
Group 2 Results
White colonies sequenced with PDI gene Both T3 and T7 strand sequencing showed consistent multiple signals
Group 4 Results
1 white colony from PCR and 1 white colony from gel purification were sequenced Both showed PDI gene
Final Remarks
All white colonies had the PDI gene, except one with a modified vector All blue colonies were transformed with the direct PCR product (not gel purified) Group 3 showed that a small (154Bp) insert that stays in-frame with the inLacZ gene can knock-out the ccdB, while still allowing the expression of an knockat least partially functioning LacZ gene Some blue colonies with white rings could be 2 separate lines living together
Bacteria transformed with ampicillin resistance gene could deplete area of ampicillin, allowing bacteria without the gene to crowd the white bacteria out of the area of depleted ampicillin How could bacteria without the insert survive both ccdB expression and ampicillin selection in broth?
ccdB gene could be lost due to mutation Bactaria could have cut plasmid, deleting the ccdB, but retaining LacZ possibly and ampicillin resistance genes
Resources
http://www.bioinformatics.org http://http://syntheticbiology.org/Tools.html NCBI BLAST: http://www.ncbi.nlm.nih.gov/BLAST/ SMART: http://smart.embl-heidelberg.de/ http://smart.emblPFAM: http://www.sanger.ac.uk/Software/Pfam/ Interpro: http://www.ebi.ac.uk/interpro/ Canadian Bioinformatics Helpdesk Newsletter (Ramachandran Plot): http://gchelpdesk.ualberta.ca/news/22sep05/cbhd_news_22sep05.p hp Finch TV: http://www.geospiza.com/finchtv/ EBI Pair-wise alignment: Pairhttp://www.ebi.ac.uk/emboss/align/index.html TAIR: http://www.arabidopsis.org