Lectures Part 01

Computational Biology, Part 1 Introduction/Representing and Retrieving Sequences
Robert F. Murphy Copyright 1996, 2000-2006. All rights reserved.
Course Introduction
What these courses are about What I expect What you can expect
Course numbers
03-311 (first half of 03-310) 03-310 & 42-334 no programming reqd 03-510 & 42-534 above plus programs 03-710 & 42-734 above plus paper
What these courses are about

overview of ways in which computers are used to solve problems in biology supervised learning of illustrative or frequently-used algorithms and programs (03-510/710 & 42-534/42-734) supervised learning of programming techniques and algorithms selected from these uses
I expect
students will have basic knowledge of biology and chemistry (at the level of Modern Biology/Chemistry) and willingness to learn more students will have basic familiarity with use of computers (e.g., at the level of Computing Skills Workshop) and eagerness to gain new skills (03-510/710 & 42-534/734) students have some programming experience and willingness to work to improve heterogeneous class - I plan to include refreshers on each new topic students will ask questions in class and via email
You can expect
Two major course sections

Computational Molecular Biology (Sequence & Structure Analysis) Computational Cell Biology (modeling and image analysis)
Class sessions: lectures/demonstrations/quizzes Pop quizzes on assigned reading and previous lectures Homework assignments

60% of grade for 03-311 60% of grade for 03-310 50% of grade for 03-510 50% of grade for 03-710
Midterm March 7 (40% for 03-311, 20% for 03-510, 15% for others) Final (30% of grade for 03-310, 03-710, 25% for 03-510) Grades totally determined by points system Communication on class matters via email list
Textbooks for first half of course
For all students

Required
textbook: Bioinformatics: Sequence and Genome Analysis by David W. Mount
For 03-510/710 students

Recommended
additional textbook: Biological Sequence Analysis: Probabilistic models of proteins and nucleic acids by Durbin et al. (ISBN: 0-52162971-3)
1 of Computational Molecular Biology, Peter Clote & Rolf Backofen (ISBN 0-471-87252-0) is an excellent introduction to molecular biology for nonbiology majors
Additional suggested book for non-Bio majors

Chap.
Web resources for CMU computational biology classes
Web page
(http://www.cmu.edu/bio/education/courses/03310 or 03311 or 03510 or 03710)
Lecture
Notes (as PowerPoint files) Homework Assignments (as Word files) Additional materials as needed
Class schedule

Tuesdays and Thursdays

3:00
to 4:20 all
Mondays
11:30
to 12:20 03-310/311 recitation
Fridays
1:30
to 2:20 03-510/710 recitation
Information flow
A major task in computational molecular biology is to decipher information contained in biological sequences Since the nucleotide sequence of a genome contains all information necessary to produce a functional organism, we should in theory be able to duplicate this decoding using computers
Review of basic biochemistry

Central Dogma: DNA makes RNA makes protein Sequence determines structure determines function
Structure
macromolecular structure divided into

primary
structure (1D sequence) secondary structure (local 2D & 3D) tertiary structure (global 3D)
DNA composed of four nucleotides or "bases": A,C,G,T RNA composed of four also: A,C,G,U (T transcribed as U) proteins are composed of amino acids
DNA properties - base composition
Some properties of long, naturally-occuring DNA molecules can be predicted accurately given only the base composition, usually expressed as either
%GC
(the percent of all base pairs that are G:C), or GC (the mole fraction of all bases that are either G or C) %GC = 100*GC
DNA properties - melting temperature

Example
of zero order sequence properties Tm, the melting temperature, defined as the temperature at which half of the DNA is single-stranded and half is double-stranded
Tm (oC)
NaCl)
= 69.3 + 41 GC (for 0.15 M
DNA structure - restriction maps

Restriction enzymes cut DNA at specific sequences. A restriction map is a graphical description of the order and lengths of fragments that would be produced by the digestion of a DNA molecule with one or more restriction enzymes
Restriction map of a circular plasmid with one enzyme

AccII AccII AccII AccII
AccII
AccII
AccII
pGEM4
AccII
AccII
AccII
AccII
Restriction map of all enzymes that cut only once

SspBIBsrGI Bsp1407I NheINaeINgoMINgoAIV SgrAI Eco47IIIAor51HI DsaI BsmFI EcoNI AcsI ApoI EcoRI Ecl136II EcoICRISacI SstI Acc65I Asp718I AvaI
AflIII
pGEM4
AatII SspI
AlwNI
XmnIAsp700I ScaI Eco255I XorII PvuI BspCI AhdI AspEI Eam1105I EclHKI BpmI GsuI BglI AviII FspI
Transcription

transcription is accomplished by RNA polymerase RNA polymerase binds to promoters promoters have distinct regions "-35" and "-10" efficiency of transcription controlled by binding and progression rates transcription start and stop affected by tertiary structure regulatory sequences can be positive or negative
RNA processing
eukaryotic genes are interrupted by introns these are "spliced" out to yield mRNA splicing done by spliceosome splicing sites are quite degenerate but not all are used
Translation
conversion from RNA to protein is by codon: 3 bases = 1 amino acid translation done by ribosome translation efficiency controlled by mRNA copy number (turnover) and ribosome binding efficiency translation affected by mRNA tertiary structure
Protein localization
leader sequences can specify cellular location (e.g., insert across membranes) leader sequences usually removed by proteolytic cleavage
Postranslational processing
peptides fold after translation - may be assisted or unassisted processing enzymes recognize specific sites (amino acid sequences) protein signals can involve secondary and tertiary structure, not just primary structure
Representing and Retrieving Sequences
Definition
A sequence is a linear set of characters (sequence elements) representing nucleotides or amino acids
DNA
composed of four nucleotides or "bases": A,C,G,T RNA composed of four also: A,C,G,U (T transcribed as U) proteins are composed of amino acids (20)
Representation of Sequences
characters
simplest easy
to read, edit, etc.
bit-coding
more
compact, both on disk and in memory comparisons more efficient more to come on this
Character representation of sequences
DNA or RNA
use
1-letter codes (e.g., A,C,G,T) 1-letter codes
protein
use
can
convert to/from 3-letter codes (e.g., A = Ala = Alanine C = Cys = Cysteine)
Representing uncertainty in nucleotide sequences
It is often the case that we would like to represent uncertainty in a nucleotide sequence, i.e., that more than one base is possible at a given position
to
express ambiguity during sequencing to express variation at a position in a gene during evolution to express ability of an enzyme to tolerate more than one base at a given position of a recognition site
Representing uncertainty in nucleotide sequences

To do this for nucleotides, we use a set of single character codes that represent all possible combinations of bases This set was proposed and adopted by the International Union of Biochemistry and is referred to as the I.U.B. code
The I.U.B. Code

A, C, G, T, U R = A, G (puRine) Y = C, T (pYrimidine) S = G, C (Strong hydrogen bonds) W = A, T (Weak hydrogen bonds) M = A, C (aMino group) K = G, T (Keto group) B = C, G, T (not A) D = A, G, T (not C) H = A, C, T (not G) V = A, C, G (not T/U) N = A, C, G, T/U (iNdeterminate) X or - are sometimes used
Representing uncertainty in protein sequences

Given the size of the amino acid alphabet, it is not practical to design a set of codes for ambiguity in protein sequences Fortunately, ambiguity is less common in protein sequences than in nucleic acid sequences Could use bit-coding as for nucleic acids but rarely done
Sequence File Formats
Sequence file formats

Two characteristics of file formats text or binary minimal or annotated Text files use IUB codes and are readable by a word processor (e.g., SimpleText, Microsoft Word) or text editor (e.g., emacs) Binary files are usually readable only by the program that created them (e.g., MacVector) Annotated files preserve information known about the sequence (coding region start/stop, protein features, literature references, etc.)
Examples of ASCII sequence file formats
Line (MacVector), Plain Text (AssemblyLIGN)
CCAAGAAGAAGAAGACCCCAGCGAGGAAAATGTGCTGGAGACCCCTGTGCCGGTTCCTGTGGCTTTGGTC CTATCTGTCCTATGTTCAAGCTGTGCCTATCCACAAAGTCCAGGATGACACCAAAACCCTCATCAAGACC ATTGTCACCAGGATCAATGACATTTCACACACGCAGTCGGTATCCGCCAGGCAGAGGGTCACCGGTTTGG ACTTCATTCCCGGGCTTCACCCCATTCTGAGTTTGTCCAAGATGGACCAGACCCTGGCAGTCTATCAACA GATCCTCACCAGCTTGCCTTCCCAAAACGTGCTGCAGATAGCTCATGACCTGGAGAACCTGCGAGACCTC CTCCATCTGCTGGCCTTCTCCAAGAGCTGCTCCCTGCCGCAGACCCGTGGCCTGCAGAAGCCAGAGAGCC TGGATGGCGTCCTGGAAGCCTCGCTCTACTCCACAGAGGTGGTGGCTCTGAGCAGGCTGCAGGGCTCTCT GCAGGACATTCTTCAACAGTTGGACCTTAGCCCTGAATGCTGAGGTTTC

Rat mRNA for obese. >gi|995614|dbj|D49653|RATOBESE CCAAGAAGAAGAAGACCCCAGCGAGGAAAATGTGCTGGAGACCCCTGTGCCGGTTCCTGTGGCTTTGGTC
Fasta (Entrez)
CTATCTGTCCTATGTTCAAGCTGTGCCTATCCACAAAGTCCAGGATGACACCAAAACCCTCATCAAGACC ATTGTCACCAGGATCAATGACATTTCACACACGCAGTCGGTATCCGCCAGGCAGAGGGTCACCGGTTTGG ACTTCATTCCCGGGCTTCACCCCATTCTGAGTTTGTCCAAGATGGACCAGACCCTGGCAGTCTATCAACA GATCCTCACCAGCTTGCCTTCCCAAAACGTGCTGCAGATAGCTCATGACCTGGAGAACCTGCGAGACCTC CTCCATCTGCTGGCCTTCTCCAAGAGCTGCTCCCTGCCGCAGACCCGTGGCCTGCAGAAGCCAGAGAGCC TGGATGGCGTCCTGGAAGCCTCGCTCTACTCCACAGAGGTGGTGGCTCTGAGCAGGCTGCAGGGCTCTCT GCAGGACATTCTTCAACAGTTGGACCTTAGCCCTGAATGCTGAGGTTTC
GCG (MacVector, GCG)
LOCUS RATOBESE.G 539 BP SS-RNA ENTERED 09/23/95 DEFINITION Rat mRNA for obese. ACCESSION KEYWORDS SOURCE Rattus norvegicus; Norway rat ORGANISM Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Sarcopterygii; Mammalia; Eutheria; Rodentia; Sciurognathi; Myomorpha; Muridae; Murinae; Rattus REFERENCE [1] AUTHORS Murakami, T. & Shima, K. TITLE Cloning of rat obese cDNA and its expression in obese rats. JOURNAL Biochem. Biophys. Res. Commun., 209, 3, 944-952, (1995) COMMENT Database Reference: DDBJ RATOBESE Accession: D49653 -----------Submitted (10-Mar-1995) to DDBJ by: Takashi Murakami Department of Laboratory Medicine School of Medicine University of Tokushima Kuramotocho 3-chome Tokushima 770 Japan Phone: +81-886-33-7184 Fax: +81-886-31-9495
[continued]
GCG [continued]
FEATURES From To/Span Description pept 30 533 obese ???? 1 539 source; /organism=Rattus norvegicus; /strain=OLETF, LETO and Zucker; /dev_stage=differentiated; /sequenced_mol=cDNA to mRNA; /tissue_type=adipose BASE COUNT 121 A 167 C 133 G 118 T 0 OTHER ORIGIN ? RATOBESE.G Length: 539 Jan 30, 1996 - 05:32 PM Check: 5797 .. 1 CCAAGAAGAA GAAGACCCCA GCGAGGAAAA TGTGCTGGAG ACCCCTGTGC CGGTTCCTGT 61 GGCTTTGGTC CTATCTGTCC TATGTTCAAG CTGTGCCTAT CCACAAAGTC CAGGATGACA 121 CCAAAACCCT CATCAAGACC ATTGTCACCA GGATCAATGA CATTTCACAC ACGCAGTCGG 181 TATCCGCCAG GCAGAGGGTC ACCGGTTTGG ACTTCATTCC CGGGCTTCAC CCCATTCTGA 241 GTTTGTCCAA GATGGACCAG ACCCTGGCAG TCTATCAACA GATCCTCACC AGCTTGCCTT 301 CCCAAAACGT GCTGCAGATA GCTCATGACC TGGAGAACCT GCGAGACCTC CTCCATCTGC 361 TGGCCTTCTC CAAGAGCTGC TCCCTGCCGC AGACCCGTGG CCTGCAGAAG CCAGAGAGCC 421 TGGATGGCGT CCTGGAAGCC TCGCTCTACT CCACAGAGGT GGTGGCTCTG AGCAGGCTGC 481 AGGGCTCTCT GCAGGACATT CTTCAACAGT TGGACCTTAG CCCTGAATGC TGAGGTTTC //
Sequence file format tips
When saving a sequence for use in an email message or pasting into a web page, use an unannotated text format such as FASTA When retrieving from a database or exchanging between programs, use an annotated text format such as GCG When using sequence again with the same program, use that programs annotated binary format (or annotated text if binary not available)
Entrez
Entrez

a client-server system for retrieval of information related to molecular biology can be used
via
web page via "embedded" client in other software (e.g., MacVector)
provided by National Center for Biotechnology Information, part of the National Library of Medicine (NIH)
Entrez Databases
http://www.ncbi.nlm.nih.gov/
PubMed: The biomedical literature
PUBMED database contains Medline abstracts as well as links to full text articles on sites maintained by journal publishers
PubMed Central: free, full text journal articles Books: online books OMIM: Online Mendelian Inheritance in Man
Nucleotide sequence database (Genbank) Protein sequence database Genome: complete genome assemblies Structure: three-dimensional macromolecular structures
Entrez Databases
Taxonomy: organisms in GenBank
SNP:
single nucleotide polymorphism PopSet: population study data sets And many more
Entrez essentials
Semi-automated entry of information into databases Critical to usefulness is the links between databases
Entrez literature searching

can find papers on a given subject can find papers on a specific gene can find papers related to a given paper can switch between literature and sequence databases Pubmed has links to publishers websites to view full text of articles Pubmed Central has free full text copies
Entrez sequence searching

can find sequences for a given gene or protein can download copy of sequence
Example Entrez Session
Goal: Find literature and sequences for cystic fibrosis genes

Use
OMIM with Keyword searching. Switch to Nucleotide database to see sequence. Switch to Protein database to see sequence. Change to GenPept format to save sequence. Use links to find related literatures in pubmed. Use Related Articles to find similar articles. Search the Nucleotide database by gene name. Set Limits to narrow down the search
Example Entrez Session: home of Entrez
Example Entrez Session: search OMIM for cystic fibrosis
Example Entrez Session: first hit is CFTR
Example Entrez Session: after clicking linksNucleotide
Example Entrez Session: after clicking linksProtein
Example Entrez Session: Protein sequence from original cDNA
Example Entrez Session: click send to save it
Example Entrez Session: LinksPubMed
Example Entrez Session: paper in PubMed that is related
Example Entrez Session: Related Articles
Example Entrez Session: search Nucleotide for cftr
Example Entrez Session: 1012 hits related to cftr
Example Entrez Session: set limits as title and mRNA
Example Entrez Session: 141 hits with limits
Example Entrez Session: further narrow it down to human
Block Diagram for Entrez Literature Searching

Results of Previous Search Additional Search Criterion Displayed Item Selection Desired Output Format Results of Search (List) Item Display
Entrez Search Engine

Lectures Part 01

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Lectures Part 01

Transféré par

Droits d'auteur :

Formats disponibles

Computational Biology, Part 1 Introduction/Representing and Retrieving Sequences

Robert F. Murphy Copyright 1996, 2000-2006. All rights reserved.

What these courses are about

You can expect

Two major course sections

Textbooks for first half of course

For all students

textbook: Bioinformatics: Sequence and Genome Analysis by David W. Mount

For 03-510/710 students

Additional suggested book for non-Bio majors

Web resources for CMU computational biology classes

Tuesdays and Thursdays

to 12:20 03-310/311 recitation

to 2:20 03-510/710 recitation

Review of basic biochemistry

macromolecular structure divided into

DNA properties - base composition

DNA properties - melting temperature

= 69.3 + 41 GC (for 0.15 M

DNA structure - restriction maps

Restriction map of a circular plasmid with one enzyme

Restriction map of all enzymes that cut only once

Representing and Retrieving Sequences

to read, edit, etc.

Character representation of sequences

1-letter codes (e.g., A,C,G,T) 1-letter codes

convert to/from 3-letter codes (e.g., A = Ala = Alanine C = Cys = Cysteine)

Representing uncertainty in nucleotide sequences

Representing uncertainty in nucleotide sequences

The I.U.B. Code

Representing uncertainty in protein sequences

Sequence File Formats

Sequence file formats

Examples of ASCII sequence file formats

Line (MacVector), Plain Text (AssemblyLIGN)

Examples of ASCII sequence file formats

Examples of ASCII sequence file formats

GCG (MacVector, GCG)

Examples of ASCII sequence file formats

Sequence file format tips

web page via "embedded" client in other software (e.g., MacVector)

PubMed: The biomedical literature

Taxonomy: organisms in GenBank

Entrez literature searching

Entrez sequence searching

Example Entrez Session

Goal: Find literature and sequences for cystic fibrosis genes

Example Entrez Session: home of Entrez

Example Entrez Session: search OMIM for cystic fibrosis

Example Entrez Session: first hit is CFTR

Example Entrez Session: after clicking linksNucleotide

Example Entrez Session: after clicking linksProtein

Example Entrez Session: Protein sequence from original cDNA

Example Entrez Session: click send to save it

Example Entrez Session: LinksPubMed

Example Entrez Session: paper in PubMed that is related

Example Entrez Session: Related Articles

Example Entrez Session: search Nucleotide for cftr

Example Entrez Session: 1012 hits related to cftr

Example Entrez Session: set limits as title and mRNA

Example Entrez Session: 141 hits with limits

Example Entrez Session: further narrow it down to human

Block Diagram for Entrez Literature Searching

Entrez Search Engine