Académique Documents
Professionnel Documents
Culture Documents
Course Introduction
What these courses are about What I expect What you can expect
Course numbers
03-311 (first half of 03-310) 03-310 & 42-334 no programming reqd 03-510 & 42-534 above plus programs 03-710 & 42-734 above plus paper
I expect
students will have basic knowledge of biology and chemistry (at the level of Modern Biology/Chemistry) and willingness to learn more students will have basic familiarity with use of computers (e.g., at the level of Computing Skills Workshop) and eagerness to gain new skills (03-510/710 & 42-534/734) students have some programming experience and willingness to work to improve heterogeneous class - I plan to include refreshers on each new topic students will ask questions in class and via email
Computational Molecular Biology (Sequence & Structure Analysis) Computational Cell Biology (modeling and image analysis)
Class sessions: lectures/demonstrations/quizzes Pop quizzes on assigned reading and previous lectures Homework assignments
60% of grade for 03-311 60% of grade for 03-310 50% of grade for 03-510 50% of grade for 03-710
Midterm March 7 (40% for 03-311, 20% for 03-510, 15% for others) Final (30% of grade for 03-310, 03-710, 25% for 03-510) Grades totally determined by points system Communication on class matters via email list
additional textbook: Biological Sequence Analysis: Probabilistic models of proteins and nucleic acids by Durbin et al. (ISBN: 0-52162971-3)
1 of Computational Molecular Biology, Peter Clote & Rolf Backofen (ISBN 0-471-87252-0) is an excellent introduction to molecular biology for nonbiology majors
Web page
(http://www.cmu.edu/bio/education/courses/03310 or 03311 or 03510 or 03710)
Lecture
Notes (as PowerPoint files) Homework Assignments (as Word files) Additional materials as needed
Class schedule
to 4:20 all
Mondays
11:30
Fridays
1:30
Information flow
A major task in computational molecular biology is to decipher information contained in biological sequences Since the nucleotide sequence of a genome contains all information necessary to produce a functional organism, we should in theory be able to duplicate this decoding using computers
Structure
structure (1D sequence) secondary structure (local 2D & 3D) tertiary structure (global 3D)
DNA composed of four nucleotides or "bases": A,C,G,T RNA composed of four also: A,C,G,U (T transcribed as U) proteins are composed of amino acids
Some properties of long, naturally-occuring DNA molecules can be predicted accurately given only the base composition, usually expressed as either
%GC
(the percent of all base pairs that are G:C), or GC (the mole fraction of all bases that are either G or C) %GC = 100*GC
of zero order sequence properties Tm, the melting temperature, defined as the temperature at which half of the DNA is single-stranded and half is double-stranded
Tm (oC)
NaCl)
AccII
AccII
AccII
pGEM4
AccII
AccII
AccII
AccII
AflIII
pGEM4
AatII SspI
AlwNI
XmnIAsp700I ScaI Eco255I XorII PvuI BspCI AhdI AspEI Eam1105I EclHKI BpmI GsuI BglI AviII FspI
Transcription
transcription is accomplished by RNA polymerase RNA polymerase binds to promoters promoters have distinct regions "-35" and "-10" efficiency of transcription controlled by binding and progression rates transcription start and stop affected by tertiary structure regulatory sequences can be positive or negative
RNA processing
eukaryotic genes are interrupted by introns these are "spliced" out to yield mRNA splicing done by spliceosome splicing sites are quite degenerate but not all are used
Translation
conversion from RNA to protein is by codon: 3 bases = 1 amino acid translation done by ribosome translation efficiency controlled by mRNA copy number (turnover) and ribosome binding efficiency translation affected by mRNA tertiary structure
Protein localization
leader sequences can specify cellular location (e.g., insert across membranes) leader sequences usually removed by proteolytic cleavage
Postranslational processing
peptides fold after translation - may be assisted or unassisted processing enzymes recognize specific sites (amino acid sequences) protein signals can involve secondary and tertiary structure, not just primary structure
Definition
A sequence is a linear set of characters (sequence elements) representing nucleotides or amino acids
DNA
composed of four nucleotides or "bases": A,C,G,T RNA composed of four also: A,C,G,U (T transcribed as U) proteins are composed of amino acids (20)
Representation of Sequences
characters
simplest easy
bit-coding
more
compact, both on disk and in memory comparisons more efficient more to come on this
DNA or RNA
use
protein
use
can
It is often the case that we would like to represent uncertainty in a nucleotide sequence, i.e., that more than one base is possible at a given position
to
express ambiguity during sequencing to express variation at a position in a gene during evolution to express ability of an enzyme to tolerate more than one base at a given position of a recognition site
A, C, G, T, U R = A, G (puRine) Y = C, T (pYrimidine) S = G, C (Strong hydrogen bonds) W = A, T (Weak hydrogen bonds) M = A, C (aMino group) K = G, T (Keto group) B = C, G, T (not A) D = A, G, T (not C) H = A, C, T (not G) V = A, C, G (not T/U) N = A, C, G, T/U (iNdeterminate) X or - are sometimes used
Two characteristics of file formats text or binary minimal or annotated Text files use IUB codes and are readable by a word processor (e.g., SimpleText, Microsoft Word) or text editor (e.g., emacs) Binary files are usually readable only by the program that created them (e.g., MacVector) Annotated files preserve information known about the sequence (coding region start/stop, protein features, literature references, etc.)
Fasta (Entrez)
LOCUS RATOBESE.G 539 BP SS-RNA ENTERED 09/23/95 DEFINITION Rat mRNA for obese. ACCESSION KEYWORDS SOURCE Rattus norvegicus; Norway rat ORGANISM Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Sarcopterygii; Mammalia; Eutheria; Rodentia; Sciurognathi; Myomorpha; Muridae; Murinae; Rattus REFERENCE [1] AUTHORS Murakami, T. & Shima, K. TITLE Cloning of rat obese cDNA and its expression in obese rats. JOURNAL Biochem. Biophys. Res. Commun., 209, 3, 944-952, (1995) COMMENT Database Reference: DDBJ RATOBESE Accession: D49653 -----------Submitted (10-Mar-1995) to DDBJ by: Takashi Murakami Department of Laboratory Medicine School of Medicine University of Tokushima Kuramotocho 3-chome Tokushima 770 Japan Phone: +81-886-33-7184 Fax: +81-886-31-9495
[continued]
GCG [continued]
FEATURES From To/Span Description pept 30 533 obese ???? 1 539 source; /organism=Rattus norvegicus; /strain=OLETF, LETO and Zucker; /dev_stage=differentiated; /sequenced_mol=cDNA to mRNA; /tissue_type=adipose BASE COUNT 121 A 167 C 133 G 118 T 0 OTHER ORIGIN ? RATOBESE.G Length: 539 Jan 30, 1996 - 05:32 PM Check: 5797 .. 1 CCAAGAAGAA GAAGACCCCA GCGAGGAAAA TGTGCTGGAG ACCCCTGTGC CGGTTCCTGT 61 GGCTTTGGTC CTATCTGTCC TATGTTCAAG CTGTGCCTAT CCACAAAGTC CAGGATGACA 121 CCAAAACCCT CATCAAGACC ATTGTCACCA GGATCAATGA CATTTCACAC ACGCAGTCGG 181 TATCCGCCAG GCAGAGGGTC ACCGGTTTGG ACTTCATTCC CGGGCTTCAC CCCATTCTGA 241 GTTTGTCCAA GATGGACCAG ACCCTGGCAG TCTATCAACA GATCCTCACC AGCTTGCCTT 301 CCCAAAACGT GCTGCAGATA GCTCATGACC TGGAGAACCT GCGAGACCTC CTCCATCTGC 361 TGGCCTTCTC CAAGAGCTGC TCCCTGCCGC AGACCCGTGG CCTGCAGAAG CCAGAGAGCC 421 TGGATGGCGT CCTGGAAGCC TCGCTCTACT CCACAGAGGT GGTGGCTCTG AGCAGGCTGC 481 AGGGCTCTCT GCAGGACATT CTTCAACAGT TGGACCTTAG CCCTGAATGC TGAGGTTTC //
When saving a sequence for use in an email message or pasting into a web page, use an unannotated text format such as FASTA When retrieving from a database or exchanging between programs, use an annotated text format such as GCG When using sequence again with the same program, use that programs annotated binary format (or annotated text if binary not available)
Entrez
Entrez
a client-server system for retrieval of information related to molecular biology can be used
via
provided by National Center for Biotechnology Information, part of the National Library of Medicine (NIH)
Entrez Databases
http://www.ncbi.nlm.nih.gov/
PUBMED database contains Medline abstracts as well as links to full text articles on sites maintained by journal publishers
PubMed Central: free, full text journal articles Books: online books OMIM: Online Mendelian Inheritance in Man
Nucleotide sequence database (Genbank) Protein sequence database Genome: complete genome assemblies Structure: three-dimensional macromolecular structures
Entrez Databases
SNP:
single nucleotide polymorphism PopSet: population study data sets And many more
Entrez essentials
Semi-automated entry of information into databases Critical to usefulness is the links between databases
OMIM with Keyword searching. Switch to Nucleotide database to see sequence. Switch to Protein database to see sequence. Change to GenPept format to save sequence. Use links to find related literatures in pubmed. Use Related Articles to find similar articles. Search the Nucleotide database by gene name. Set Limits to narrow down the search