Académique Documents
Professionnel Documents
Culture Documents
What is Bioinformatics?
Bioinformatics is the use of computers to solve biological and biomedical problems.
Bioinformatics is the application of information technology to mine, visualize, analyze, integrate, and manage biological and genetic information, which can then be applied in, among other things, accelerating drug discovery and development.
Application of tools of computation and analysis to the capture and interpretation of biological data. Biological Data management and analysis. NIH definition of Bioinformatics (http://www.bisti.nih.gov/CompuBioDef.pdf) Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.
Use of Bioinformatics
DNA analysis
Genome sequencing
Sequence assembly Sequence/gene annotations Genefinding/Sequence translation tools Sequence Similarity searching (eg. BLAST, ClustalW) Comparison between genomes Evolution of sequences (Phylogenetic analysis) Gene expression
Sequence
Sequence similarity Protein family assignments Conserved motifs Proteomics data analysis Protein Evolution
Mathematics Statistics
Chemistry
Vast Growth in (Structural) Data... but number of Fundementally New (Fold) Parts Not Increasing that Fast
Bioinformatics Analysis?
It is like any other lab analysis! You need to know your data/input sources You need to understand your methods and their assumptions You need a plan to get from point A to point B You need to understand your equipment You need to be critical and understand potential sources of error You need to interpret your results Your results need to be reproducible Your results should be testable
Baxevanis & Ouellette 2001. Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins 2nd Edition. John Wiley Publishing. Gibas & Jambeck 2001. Developing Bioinformatics Computer Skills. OReilly. Bioinformatics: Genome Sequence Analysis Mount 2001 Bioinformatics For Dummies Claverie & Notredame 2003 Introduction to Bioinformatics Lesk 2002
GenBank/GenPept
PHYLIP PIR
U03518;
Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S rRNA and 5.8S rRNA genes, partial sequence. rRNA and 5.8S rRNA genes, partial sequence. MEDLINE; 94303342. PUBMED; 8030378. rRNA <1..20 /product="18S ribosomal RNA" misc_RNA 21..205 /standard_name="Internal transcribed spacer 1 (ITS1)" rRNA 206..>237 /product="5.8S ribosomal RNA" Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other; aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237
//
FASTA
A sequence in Fasta format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greaterthan (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length.
>U03518 Aspergillus awamori internal transcribed spacer 1 (ITS1) AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCC TGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGC CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACT TTCAACAATGGATCTCTTGGTTCCGGC
GCG
Exactly one sequence Begins with annotation lines Start of the sequence is marked by a line ending with ".. This line also contains the sequence identifier, the sequence length and a checksum
ID XX AC XX DE DE XX .. 1 61 121 181 aacctgcgga tattgtaccc ccccccgggc tgagttgatt aggatcatta tgttgcttcg ccgtgcccgc gaatgcaatc ccgagtgcgg gcgggcccgc cggagacccc agttaaaact gtcctttggg cgcttgtcgg aacacgaaca ttcaacaatg cccaacctcc ccgccggggg ctgtctgaaa gatctcttgg catccgtgtc ggcgcctctg gcgtgcagtc ttccggc AA03518 standard; DNA; FUN; 237 BP. U03518; Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S rRNA and 5.8S rRNA genes, partial sequence.
GenBank/GenPept
The nucleotide (GenBank) and protein (Gen Pept) database entries are available from Entrez in this format Can contain several sequences One sequence starts with: LOCUS The sequence starts with: "ORIGIN The sequence ends with: "//
LOCUS AAU03518 237 bp DNA PLN 04-FEB-1995 DEFINITION Aspergillus awamori internal transcribed spacer 18S rRNA and 5.8S rRNA genes, partial sequence. ACCESSION U03518 BASE COUNT 41 a 77 c 67 g 52 t ORIGIN 1 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc 61 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg 121 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa 181 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg // 1 (ITS1) and
Phylip format
2 2000 G019uabh ATACATCATA ACACTACTTC CTACCCATAA GCTCCTTTTA ACTTGTTAAA G028uaah CATAAGCTCC TTTTAACTTG TTAAAGTCTT GCTTGAATTA AAGACTTGTT GTCTTGCTTG AATTAAAGAC TTGTTTAAAC ACAAAAATTT AGAGTTTTAC TAAACACAAA ATTTAGACTT TTACTCAACA AAAGTGATTG ATTGATTGAT TCAACAAAAG TGATTGATTG ATTGATTGAT TGATTGATGG TTTACAGTAG TGATTGATTG ATGGTTTACA GTAGGACTTC ATTCTAGTCA TTATAGCTGC
The first line of the input file contains the number of sequences and their length (all should have the same length) separated by blanks. The next line contains a sequence name, next lines are the sequence itself in blocks of 10 characters. Then follow rest of sequences.
Other formats
MEGA
#mega Title: infile.fasta #G019uabh ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG AATTAAAGACTTGTTTAAACACAAAAATTTAGAGTTTTACTCAACAAAAGTGATTGATTG ATTGATTGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGC AGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTT GGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC AGTCTTGTTACGTTATGACTAATCTTTGGGGATTGTGCAGAATGTTATTTTAGATAAGCA AAACGAGCAAAATGGGGAGTTACTTATATTTCTTTAAAGC #G028uaah CATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTTTAAACACAAA ATTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGATTGATTGATTGATGGTTTACA GTAGGACTTCATTCTAGTCATTATAGCTGCTGGCAGTATAACTGGCCAGCCTTTAATACA TTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTTGGTATGATTTATCTTTTTGGTCTTCT ATAGCCTCCTTCCCCATCCCATCAGTCT
ReadSeq
Don Gilbert software@bio.indiana.edu, May 2001 Indiana University, Bloomington, Indiana
WWW
http://www.ebi.ac.uk/cgi-bin/readseq.cgi http://bioportal.bic.nus.edu.sg/readseq/readseq.html http://www-bimas.cit.nih.gov/molbio/readseq/
Seqret
A program in EMBOSS suite
The Readseq package can read most common formats: examples of all these formats are included in the readseq directory. The formats include:
IG/Stanford, used by Intelligenetics and others GenBank/GB, genbank flatfile format NBRF format (SAM modifications cause this to break when sequences do not have a terminating asterix) EMBL, EMBL flatfile format GCG, single sequence format of GCG software DNAStrider, for common Mac program Fitch format, limited use Pearson/Fasta, a common format used by Fasta programs and others Zuker format, limited use. Input only. Olsen, format printed by Olsen VMS sequence editor. Input only. Phylip3.2, sequential format for Phylip programs Plain/Raw, sequence data only (no name, document, numbering) MSF multi sequence format used by GCG software PAUP's multiple sequence (NEXUS) format PIR/CODATA format used by PIR
Databases in Biology
http://www3.oup.co.uk/nar/database/c/
Exercise
Retrieve sequences from sequence databases Convert sequence formats Study different formats and flow of information