Vous êtes sur la page 1sur 36

Electronic Molecular Biology

Dr. Fazeeda N. Hosein BIOL3061 2012

Learning Objectives
What is a database? Why do we need databases and what databases are available to us? What information can be obtained from databases? What is BLAST? What is a sequence alignment? Which software can we use to compare sequences? Which software can we use to obtain phylogenetic data?

What is a database?
A Database (db) is designed to offer an organized mechanism for storing, managing and retrieving information. A collection of structured searchable (index) -> table of contents updated periodically (release) -> new edition cross-referenced (hyperlinks) -> links with other db data Includes also associated tools (software) necessary for db access/query, db updating, db information insertion, db information deletion.

Why do we need databases


Biology has turned into data-rich science High-throughput genomics, proteomics, metabolomics, ... vast amount of data generated in experiments (like MS peptide fragments, whole genome sequencing)

Need for storing and communicating large datasets has grown tremendously archiving, curation, analysis and interpretation of all of these datasets are a challenge convenient methods for proper storing, searching & retrieving necessary
Databases are the means to handle this data overload

What can databases do?


Make biological data available ... 1. to scientists 2. in computer-readable form. analysis (computer based) handle and share large volumes of data interface for computer based systems (Algorithms, Web interfaces) Store data defined formats automated storage and retrieval of experimental data Link knowledge with external resources

What databases are available for us?


www.ebi.ac.uk/embl/

www.ncbi.nlm.nih.gov/

http://www.ddbj.nig.ac.jp

Sequences submitted directly by scientists and genome sequencing group and sequences taken from literature and patents Entries in the EMBL, GenBank and DDBJ databases are synchronized on a daily basis Accession numbers are managed in a consistent manner Comparatively little error checking and fair amount of redundancy

What databases are available for us?


GenBank/DDJB/EMBLwww.ncbi.nlm.nih.gov Nucleotide sequences

Ensembl www.ensembl.org Human/mouse genome


PubMed www.ncbi.nlm.nih.gov Literature references

NR www.ncbi.nlm.nih.gov Protein sequences


Swiss-Prot www.expasy.org Protein sequences

InterPro www.ebi.ac.uk Protein domains


OMIM www.ncbi.nlm.nih.gov Genetic diseases Enzymes www.expasy.org Enzymes PDB www.rcsb.org/pdb/ Protein structures KEGG www.genome.ad.jp Metabolic pathways

Minimal content of an entry in a sequence database


Sequence Accession number (AC) (never changes) Taxonomic data References Annotation/Curation Keywords Cross-references Documentation

The Perfect Database


1. 2. 3. 4. 5. 6. Comprehensive, but easy to search. Annotated, but not too annotated. A simple, easy to understand structure. Cross-referenced. Minimum redundancy. Easy retrieval of data

How do you read an entry in GenBank


LOCUS: Unique string of 10 letters and numbers in the database. Not maintained amongst databases, and is therefore a poor sequence identifier. ACCESSION: A unique identifier to that record, citable entity; does not change when record is updated. A good record identifier, ideal for citation in publication.

VERSION: : New system where the accession and version play the same function as the accession and gi number.
Nucleotide gi: Geninfo identifier (gi), a unique integer which will change every time the sequence changes. PID: Protein Identifier: g, e or d prefix to gi number. Can have one or two on one CDS.

Protein gi: Geninfo identifier (gi), a unique integer which will change every time the sequence changes.
Protein_id: Identifier which has the same structure and function as the nucleotide accession version numbers, but slightly different format.

1 The LOCUS field consists of five different subfields:

1a Locus Name (HSHFE) - The locus name is a tag for grouping similar sequences. The first two or three letters usually designate the organism. In this case HS stands for Homo sapiens The last several characters are associated with another group designation, such as gene product. In this example, the last three digits represent the gene symbol, HFE. Currently, the only requirement for assigning a locus name to a record is that it is unique.

1b Sequence Length (12146 bp) - The total number of nucleotide base pairs (or amino acid residues) in the sequence record.

12

1d

1e

1c Molecule Type (DNA) Type of molecule that was sequenced. All sequence data in an entry must be of the same type. 1d GenBank Division (PRI) There are different GenBank divisions. In this example, PRI stands for primate sequences. Some other divisions include ROD (rodent sequences), MAM (other mammal sequences), PLN (plant, fungal, and algal sequences), and BCT (bacterial sequences). 1e Modification Date (23-July-1999) Date of most recent modification made to the record. The date of first public release is not available in the sequence record. This information can be obtained only by contacting NCBI at info@ncbi.nlm.nih.gov.

2 DEFINITION Brief description of the sequence. The description may include source organism name, gene or protein name, or designation as untranscribed or untranslated sequences (e.g., a promoter region). For sequences containing a coding region (CDS), the definition field may also contain a completeness qualifier such as "complete CDS" or "exon 1."

3 ACCESSION (Z92910) Unique identifier assigned to a complete sequence record. This number never changes, even if the record is modified. An accession number is a combination of letters and numbers that are usually in the format of one letter followed by five digits (e.g., M12345) or two letters followed by six digits (e.g., AC123456).

4 VERSION (Z92910.1) Identification number assigned to a single, specific sequence in the database. This number is in the format accession.version. If any changes are made to the sequence data, the version part of the number will increase by one. For example U12345.1 becomes U12345.2. A version number of Z92910.1 for this HFE sequence indicates that the sequence data has not been altered since its original submission.

5 GI (1890179) Also a sequence identification number. Whenever a sequence is changed, the version number is increased and a new GI is assigned. If a nucleotide sequence record contains a protein translation of the sequence, the translation will have its own GI number
17

6 KEYWORDS (haemochromatosis; HFE gene) A keyword can be any word or phrase used to describe the sequence. Keywords are not taken from a controlled vocabulary. Notice that in this record the keyword, "haemochromatosis," employs British spelling, rather than the American "hemochromatosis." Many records have no keywords. A period is placed in this field for records without keywords.

7 SOURCE (human) Usually contains an abbreviated or common name of the source organism. 8 ORGANISM (Homo sapiens) The scientific name (usually genus and species) and phylogenetic lineage. See the NCBI Taxonomy Homepage for more information about the classification scheme used to construct taxonomic lineages.

9 REFERENCE

Citations of publications by sequence authors that support information presented in the sequence record. Several references may be included in one record. References are automatically sorted from the oldest to the newest. Cited publications are searchable by author, article or publication title, journal title, or MEDLINE unique identifier (UID). The UID links the sequence record to the MEDLINE record.

The FEATURES table

A feature is simply an annotation that describes a portion of the sequence. Each feature includes a location (sequence location or interval) and one or several qualifiers. Clicking on the feature name will open a record for the sequence interval identified in the feature location. A list of features can be found in http://www.ncbi.nlm.nih.gov/col lab/FT/

The FEATURES table

source - An obligatory feature. The source gives the length of the entire sequence, the scientific name of the source organism, and the Taxon ID number. Other types of information that the submitter may include in this field are chromosome number, map location, clone, and strain identification.

The FEATURES table


gene - Sequence portion that delineates the beginning and end of a gene.

exon - Sequence segment that contains an exon. Exons may contain portions of 5' and 3 UTRs (untranslated regions). The name of the gene to which the exon belongs and exon number are provided.

CDS - Sequence of nucleotides that code for amino acids of the protein product (coding sequence). This feature includes the translation into amino acids and may also contain gene name, gene product function, link to protein sequence record, and cross-references to other database entries.

intron - Transcribed but spliced-out parts. Intron number is shown. polyA_signal Identifies the sequence portion required for endonuclease cleavage of an mRNA transcript. Consensus sequence for the polyA signal is AATAAA.

BASE COUNT - Base Count gives the total number of adenine (A), cytosine (C), guanine (G), and thymine (T) bases in the sequence.

ORIGIN - Origin contains the sequence data, which begins on the line immediately below the field title.

25

Blast Local Alignment Search Tool


BLAST, is an algorithm for comparing primary biological sequence information (amino-acid or nucleotide sequence) Enables comparison of a query sequence with a library or database of sequences and identify sequences that resemble the query sequence above a certain threshold BLAST is one of the most widely used bioinformatics program It addresses a fundamental problem The algorithm emphasizes speed over sensitivity (practical on the huge genome databases currently available) Variants Nucleotide-nucleotide BLAST (blastn) Protein-protein BLAST (blastp) Nucleotide 6-frame translation-protein (blastx)

BLAST
To run, BLAST requires a query sequence to search for, and a sequence to search against (also called the target sequence) or a sequence database containing multiple such sequences Input: sequence in FASTA or Genbank format Output: graphical format showing the hits found, a table showing sequence identifiers for the hits with scoring data, as well as alignments for the sequence of interest and the hits received with the corresponding BLAST scores of these NCBI: http://blast.ncbi.nlm.nih.gov/Blast.cgi

Now what?
What other genes encode proteins that exhibit structures similar to your sequence (Gene families)

Do you find proteins that are related in lineage over a range of species (evolutionary biology)

A phylogenetic tree shows the evolution of a species.

To do this, we use other programs which are available online as freeware MEGA,

ClustalW, ClustalX,

Phylip

Data entered into MEGA

Data analysed using MEGA

a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences.

Data analyses in Clustal W

100 27 26

V. vinefera BAD18977 VvMYBA1 V. vinifera AB097924 VvmybA2

A. thaliana ABB03879 PAP1 L. esculatum AAQ55181 LeANT1

47

74

Petunia x hybrid AAF66727 An2 protein Gh CAD87010 MYB10

100 25 93 19 98

A. majus ABB83828 VENOSA A, majus ABB83826 ROSEA1 A. majus ABB83827 ROSEA2 M. d DQ886415 MYB1-1

V. vinefera AAS68190 Myb transcription factor


A. thaliana Q9FJA2 TT2
45 100

Z. mays AAA33482 c1 locus myb homologue

Z. mays AAA19821 transcriptional activator


A. andreanum MYB1 AAO92352.1

22 44 53 42 50

Fragaria x ananassa AAK84064 transcription factor MYB1 A. majus CAA55725 mixta

A. thaliana ABB03913 MYB12


Petunia x hybrid AAV98200 MYB-like protein ODORANT1 D. carota BAE54312 transcription factor DcMYB1 ...
100

N. tobacum BAA88222 myb-related transcription factor

0.05

Tree generated using MEGA

Conclusions
A Database is designed to offer an organized mechanism for storing, managing and retrieving information. BLAST, is an algorithm for comparing primary biological sequence information (amino-acid or nucleotide sequence) Nucleotide-nucleotide BLAST (blastn) Protein-protein BLAST (blastp) Nucleotide 6-frame translation-protein (blastx) Programs used to perform multiple alignments and generate phylogenetic trees MEGA, ClustalW, ClustalX, Phylip

Vous aimerez peut-être aussi