Vous êtes sur la page 1sur 61

Computational Biology, Part 1 Introduction/Representing and Retrieving Sequences

Robert F. Murphy Copyright 1996, 2000-2006. All rights reserved.

Course Introduction
What these courses are about What I expect What you can expect

Course numbers
03-311 (first half of 03-310) 03-310 & 42-334 no programming reqd 03-510 & 42-534 above plus programs 03-710 & 42-734 above plus paper

What these courses are about


overview of ways in which computers are used to solve problems in biology supervised learning of illustrative or frequently-used algorithms and programs (03-510/710 & 42-534/42-734) supervised learning of programming techniques and algorithms selected from these uses

I expect

students will have basic knowledge of biology and chemistry (at the level of Modern Biology/Chemistry) and willingness to learn more students will have basic familiarity with use of computers (e.g., at the level of Computing Skills Workshop) and eagerness to gain new skills (03-510/710 & 42-534/734) students have some programming experience and willingness to work to improve heterogeneous class - I plan to include refreshers on each new topic students will ask questions in class and via email

You can expect

Two major course sections


Computational Molecular Biology (Sequence & Structure Analysis) Computational Cell Biology (modeling and image analysis)

Class sessions: lectures/demonstrations/quizzes Pop quizzes on assigned reading and previous lectures Homework assignments

60% of grade for 03-311 60% of grade for 03-310 50% of grade for 03-510 50% of grade for 03-710

Midterm March 7 (40% for 03-311, 20% for 03-510, 15% for others) Final (30% of grade for 03-310, 03-710, 25% for 03-510) Grades totally determined by points system Communication on class matters via email list

Textbooks for first half of course

For all students


Required

textbook: Bioinformatics: Sequence and Genome Analysis by David W. Mount

For 03-510/710 students


Recommended

additional textbook: Biological Sequence Analysis: Probabilistic models of proteins and nucleic acids by Durbin et al. (ISBN: 0-52162971-3)
1 of Computational Molecular Biology, Peter Clote & Rolf Backofen (ISBN 0-471-87252-0) is an excellent introduction to molecular biology for nonbiology majors

Additional suggested book for non-Bio majors


Chap.

Web resources for CMU computational biology classes

Web page
(http://www.cmu.edu/bio/education/courses/03310 or 03311 or 03510 or 03710)
Lecture

Notes (as PowerPoint files) Homework Assignments (as Word files) Additional materials as needed

Class schedule

Tuesdays and Thursdays


3:00

to 4:20 all

Mondays
11:30

to 12:20 03-310/311 recitation

Fridays
1:30

to 2:20 03-510/710 recitation

Information flow
A major task in computational molecular biology is to decipher information contained in biological sequences Since the nucleotide sequence of a genome contains all information necessary to produce a functional organism, we should in theory be able to duplicate this decoding using computers

Review of basic biochemistry


Central Dogma: DNA makes RNA makes protein Sequence determines structure determines function

Structure

macromolecular structure divided into


primary

structure (1D sequence) secondary structure (local 2D & 3D) tertiary structure (global 3D)

DNA composed of four nucleotides or "bases": A,C,G,T RNA composed of four also: A,C,G,U (T transcribed as U) proteins are composed of amino acids

DNA properties - base composition

Some properties of long, naturally-occuring DNA molecules can be predicted accurately given only the base composition, usually expressed as either
%GC

(the percent of all base pairs that are G:C), or GC (the mole fraction of all bases that are either G or C) %GC = 100*GC

DNA properties - melting temperature


Example

of zero order sequence properties Tm, the melting temperature, defined as the temperature at which half of the DNA is single-stranded and half is double-stranded
Tm (oC)

NaCl)

= 69.3 + 41 GC (for 0.15 M

DNA structure - restriction maps


Restriction enzymes cut DNA at specific sequences. A restriction map is a graphical description of the order and lengths of fragments that would be produced by the digestion of a DNA molecule with one or more restriction enzymes

Restriction map of a circular plasmid with one enzyme


AccII AccII AccII AccII

AccII
AccII

AccII

pGEM4

AccII
AccII

AccII
AccII

Restriction map of all enzymes that cut only once


SspBIBsrGI Bsp1407I NheINaeINgoMINgoAIV SgrAI Eco47IIIAor51HI DsaI BsmFI EcoNI AcsI ApoI EcoRI Ecl136II EcoICRISacI SstI Acc65I Asp718I AvaI

AflIII

pGEM4
AatII SspI

AlwNI

XmnIAsp700I ScaI Eco255I XorII PvuI BspCI AhdI AspEI Eam1105I EclHKI BpmI GsuI BglI AviII FspI

Transcription

transcription is accomplished by RNA polymerase RNA polymerase binds to promoters promoters have distinct regions "-35" and "-10" efficiency of transcription controlled by binding and progression rates transcription start and stop affected by tertiary structure regulatory sequences can be positive or negative

RNA processing
eukaryotic genes are interrupted by introns these are "spliced" out to yield mRNA splicing done by spliceosome splicing sites are quite degenerate but not all are used

Translation
conversion from RNA to protein is by codon: 3 bases = 1 amino acid translation done by ribosome translation efficiency controlled by mRNA copy number (turnover) and ribosome binding efficiency translation affected by mRNA tertiary structure

Protein localization
leader sequences can specify cellular location (e.g., insert across membranes) leader sequences usually removed by proteolytic cleavage

Postranslational processing
peptides fold after translation - may be assisted or unassisted processing enzymes recognize specific sites (amino acid sequences) protein signals can involve secondary and tertiary structure, not just primary structure

Representing and Retrieving Sequences

Definition

A sequence is a linear set of characters (sequence elements) representing nucleotides or amino acids
DNA

composed of four nucleotides or "bases": A,C,G,T RNA composed of four also: A,C,G,U (T transcribed as U) proteins are composed of amino acids (20)

Representation of Sequences

characters
simplest easy

to read, edit, etc.

bit-coding
more

compact, both on disk and in memory comparisons more efficient more to come on this

Character representation of sequences

DNA or RNA
use

1-letter codes (e.g., A,C,G,T) 1-letter codes

protein
use
can

convert to/from 3-letter codes (e.g., A = Ala = Alanine C = Cys = Cysteine)

Representing uncertainty in nucleotide sequences

It is often the case that we would like to represent uncertainty in a nucleotide sequence, i.e., that more than one base is possible at a given position
to

express ambiguity during sequencing to express variation at a position in a gene during evolution to express ability of an enzyme to tolerate more than one base at a given position of a recognition site

Representing uncertainty in nucleotide sequences


To do this for nucleotides, we use a set of single character codes that represent all possible combinations of bases This set was proposed and adopted by the International Union of Biochemistry and is referred to as the I.U.B. code

The I.U.B. Code


A, C, G, T, U R = A, G (puRine) Y = C, T (pYrimidine) S = G, C (Strong hydrogen bonds) W = A, T (Weak hydrogen bonds) M = A, C (aMino group) K = G, T (Keto group) B = C, G, T (not A) D = A, G, T (not C) H = A, C, T (not G) V = A, C, G (not T/U) N = A, C, G, T/U (iNdeterminate) X or - are sometimes used

Representing uncertainty in protein sequences


Given the size of the amino acid alphabet, it is not practical to design a set of codes for ambiguity in protein sequences Fortunately, ambiguity is less common in protein sequences than in nucleic acid sequences Could use bit-coding as for nucleic acids but rarely done

Sequence File Formats

Sequence file formats


Two characteristics of file formats text or binary minimal or annotated Text files use IUB codes and are readable by a word processor (e.g., SimpleText, Microsoft Word) or text editor (e.g., emacs) Binary files are usually readable only by the program that created them (e.g., MacVector) Annotated files preserve information known about the sequence (coding region start/stop, protein features, literature references, etc.)

Examples of ASCII sequence file formats

Line (MacVector), Plain Text (AssemblyLIGN)

CCAAGAAGAAGAAGACCCCAGCGAGGAAAATGTGCTGGAGACCCCTGTGCCGGTTCCTGTGGCTTTGGTC CTATCTGTCCTATGTTCAAGCTGTGCCTATCCACAAAGTCCAGGATGACACCAAAACCCTCATCAAGACC ATTGTCACCAGGATCAATGACATTTCACACACGCAGTCGGTATCCGCCAGGCAGAGGGTCACCGGTTTGG ACTTCATTCCCGGGCTTCACCCCATTCTGAGTTTGTCCAAGATGGACCAGACCCTGGCAGTCTATCAACA GATCCTCACCAGCTTGCCTTCCCAAAACGTGCTGCAGATAGCTCATGACCTGGAGAACCTGCGAGACCTC CTCCATCTGCTGGCCTTCTCCAAGAGCTGCTCCCTGCCGCAGACCCGTGGCCTGCAGAAGCCAGAGAGCC TGGATGGCGTCCTGGAAGCCTCGCTCTACTCCACAGAGGTGGTGGCTCTGAGCAGGCTGCAGGGCTCTCT GCAGGACATTCTTCAACAGTTGGACCTTAGCCCTGAATGCTGAGGTTTC

Examples of ASCII sequence file formats


Rat mRNA for obese. >gi|995614|dbj|D49653|RATOBESE CCAAGAAGAAGAAGACCCCAGCGAGGAAAATGTGCTGGAGACCCCTGTGCCGGTTCCTGTGGCTTTGGTC

Fasta (Entrez)

CTATCTGTCCTATGTTCAAGCTGTGCCTATCCACAAAGTCCAGGATGACACCAAAACCCTCATCAAGACC ATTGTCACCAGGATCAATGACATTTCACACACGCAGTCGGTATCCGCCAGGCAGAGGGTCACCGGTTTGG ACTTCATTCCCGGGCTTCACCCCATTCTGAGTTTGTCCAAGATGGACCAGACCCTGGCAGTCTATCAACA GATCCTCACCAGCTTGCCTTCCCAAAACGTGCTGCAGATAGCTCATGACCTGGAGAACCTGCGAGACCTC CTCCATCTGCTGGCCTTCTCCAAGAGCTGCTCCCTGCCGCAGACCCGTGGCCTGCAGAAGCCAGAGAGCC TGGATGGCGTCCTGGAAGCCTCGCTCTACTCCACAGAGGTGGTGGCTCTGAGCAGGCTGCAGGGCTCTCT GCAGGACATTCTTCAACAGTTGGACCTTAGCCCTGAATGCTGAGGTTTC

Examples of ASCII sequence file formats

GCG (MacVector, GCG)

LOCUS RATOBESE.G 539 BP SS-RNA ENTERED 09/23/95 DEFINITION Rat mRNA for obese. ACCESSION KEYWORDS SOURCE Rattus norvegicus; Norway rat ORGANISM Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Sarcopterygii; Mammalia; Eutheria; Rodentia; Sciurognathi; Myomorpha; Muridae; Murinae; Rattus REFERENCE [1] AUTHORS Murakami, T. & Shima, K. TITLE Cloning of rat obese cDNA and its expression in obese rats. JOURNAL Biochem. Biophys. Res. Commun., 209, 3, 944-952, (1995) COMMENT Database Reference: DDBJ RATOBESE Accession: D49653 -----------Submitted (10-Mar-1995) to DDBJ by: Takashi Murakami Department of Laboratory Medicine School of Medicine University of Tokushima Kuramotocho 3-chome Tokushima 770 Japan Phone: +81-886-33-7184 Fax: +81-886-31-9495

[continued]

Examples of ASCII sequence file formats

GCG [continued]

FEATURES From To/Span Description pept 30 533 obese ???? 1 539 source; /organism=Rattus norvegicus; /strain=OLETF, LETO and Zucker; /dev_stage=differentiated; /sequenced_mol=cDNA to mRNA; /tissue_type=adipose BASE COUNT 121 A 167 C 133 G 118 T 0 OTHER ORIGIN ? RATOBESE.G Length: 539 Jan 30, 1996 - 05:32 PM Check: 5797 .. 1 CCAAGAAGAA GAAGACCCCA GCGAGGAAAA TGTGCTGGAG ACCCCTGTGC CGGTTCCTGT 61 GGCTTTGGTC CTATCTGTCC TATGTTCAAG CTGTGCCTAT CCACAAAGTC CAGGATGACA 121 CCAAAACCCT CATCAAGACC ATTGTCACCA GGATCAATGA CATTTCACAC ACGCAGTCGG 181 TATCCGCCAG GCAGAGGGTC ACCGGTTTGG ACTTCATTCC CGGGCTTCAC CCCATTCTGA 241 GTTTGTCCAA GATGGACCAG ACCCTGGCAG TCTATCAACA GATCCTCACC AGCTTGCCTT 301 CCCAAAACGT GCTGCAGATA GCTCATGACC TGGAGAACCT GCGAGACCTC CTCCATCTGC 361 TGGCCTTCTC CAAGAGCTGC TCCCTGCCGC AGACCCGTGG CCTGCAGAAG CCAGAGAGCC 421 TGGATGGCGT CCTGGAAGCC TCGCTCTACT CCACAGAGGT GGTGGCTCTG AGCAGGCTGC 481 AGGGCTCTCT GCAGGACATT CTTCAACAGT TGGACCTTAG CCCTGAATGC TGAGGTTTC //

Sequence file format tips

When saving a sequence for use in an email message or pasting into a web page, use an unannotated text format such as FASTA When retrieving from a database or exchanging between programs, use an annotated text format such as GCG When using sequence again with the same program, use that programs annotated binary format (or annotated text if binary not available)

Entrez

Entrez

a client-server system for retrieval of information related to molecular biology can be used
via

web page via "embedded" client in other software (e.g., MacVector)

provided by National Center for Biotechnology Information, part of the National Library of Medicine (NIH)

Entrez Databases
http://www.ncbi.nlm.nih.gov/

PubMed: The biomedical literature

PUBMED database contains Medline abstracts as well as links to full text articles on sites maintained by journal publishers

PubMed Central: free, full text journal articles Books: online books OMIM: Online Mendelian Inheritance in Man

Nucleotide sequence database (Genbank) Protein sequence database Genome: complete genome assemblies Structure: three-dimensional macromolecular structures

Entrez Databases

Taxonomy: organisms in GenBank

SNP:

single nucleotide polymorphism PopSet: population study data sets And many more

Entrez essentials
Semi-automated entry of information into databases Critical to usefulness is the links between databases

Entrez literature searching


can find papers on a given subject can find papers on a specific gene can find papers related to a given paper can switch between literature and sequence databases Pubmed has links to publishers websites to view full text of articles Pubmed Central has free full text copies

Entrez sequence searching


can find sequences for a given gene or protein can download copy of sequence

Example Entrez Session

Goal: Find literature and sequences for cystic fibrosis genes


Use

OMIM with Keyword searching. Switch to Nucleotide database to see sequence. Switch to Protein database to see sequence. Change to GenPept format to save sequence. Use links to find related literatures in pubmed. Use Related Articles to find similar articles. Search the Nucleotide database by gene name. Set Limits to narrow down the search

Example Entrez Session: home of Entrez

Example Entrez Session: search OMIM for cystic fibrosis

Example Entrez Session: first hit is CFTR

Example Entrez Session: after clicking linksNucleotide

Example Entrez Session: after clicking linksProtein

Example Entrez Session: Protein sequence from original cDNA

Example Entrez Session: click send to save it

Example Entrez Session: LinksPubMed

Example Entrez Session: paper in PubMed that is related

Example Entrez Session: Related Articles

Example Entrez Session: search Nucleotide for cftr

Example Entrez Session: 1012 hits related to cftr

Example Entrez Session: set limits as title and mRNA

Example Entrez Session: 141 hits with limits

Example Entrez Session: further narrow it down to human

Block Diagram for Entrez Literature Searching


Results of Previous Search Additional Search Criterion Displayed Item Selection Desired Output Format Results of Search (List) Item Display

Entrez Search Engine

Vous aimerez peut-être aussi