Vous êtes sur la page 1sur 11


AIM: This is the practical session for the chapter Biomolecular databases of the course Introduction to bioinformatics. Resources This exercise will be based on the following Web resources. Acronym Type Description+URL EMBL Nucleic sequences The EMBL Nucleic Sequence Database (EBI - UK) http://www.ebi.ac.uk/embl/ Genbank Nucleic sequences Genbank (NCBI - USA) http://www.ncbi.nlm.nih.gov/Genbank/ DDBJ Nucleic sequences DDBJ - DNA Data Bank of Japan http://www.ddbj.nig.ac.jp/ UniProt Protein sequences UniProt - the Universal Protein Resource http://www.uniprot.org/ PDB 3D structure of PDB - The Protein Data Bank macromolecules http://www.rcsb.org/pdb/ EnsEMBL Genome browser EnsEMBL Genome Browser (Sanger Institute + EBI) http://www.ensembl.org/ UCSC Genome browser UCSC Genome Browser (University California Santa Cruz - USA) http://genome.ucsc.edu/ ECR Genome browser ECR Browser http://ecrbrowser.dcode.org/ Integr8 Comparative Integr8 - access to complete genomes and proteomes genomics http://www.ebi.ac.uk/integr8/ Prosite Protein domains Prosite - protein domains, families and functional sites http://www.expasy.ch/prosite/ Pfam Protein domains PFAM - Protein families represented by multiple sequence alignments and hidden Markov models (HMMs) (Sanger Institute - UK) http://pfam.sanger.ac.uk/ CATH Protein domains CATH - Protein Structure Classification http://www.cathdb.info/ InterPro Protein domains InterPro (EBI - UK) http://www.ebi.ac.uk/interpro/ GO Gene ontology Gene Ontology Database http://www.geneontology.org/

Entrez SRS

Multi-database Data warehouse

A collection of biomolecular databases maintained at the NCBI (USA), accessible via an interface called Entrez. http://www.ncbi.nlm.nih.gov/Entrez/ A collection of biomolecular databases maintained at the European Institute for Bioinformatics (EBI, UK), accessible via an interface called SRS http://srs.ebi.ac.uk/

A quick tour of selected databases The number of biomolecular databases is growing so fast that it is impossible to give a balanced survey of all the existing resources. We selected here a few databases on the basis of various criteria (popularity, ease of access,..) to illustrate the type of information that can be retrieved from them. As a matter of exercise, we propose to browse some databases in order to grab information about one particular protein. Each student can do the same analysis with some protein of interest to him/her. If you are out of inspiration, you can for example run the exercise with the Drosophila protein Ubx. Exercise (a) Choose a protein for which you have some prior knowledge (e.g. the protein Ubx from Drosophila melanogaster,) and try to extract all the information relevant to this protein in the databases listed in the table of bimolecular databases above. Next steps In the exercise above, we saw that each database can provide us with a piece of information about some aspects of our protein of interest:

gene sequence (GenBank, EMBL, DDBJ), genomic context + cross-genome conservation (EnsEMBL, USCS, ECR), orthologous and paralogous genes (Integr8), protein sequence enriched with annotations about functional features (UniProt), 3D structure (PDB), structural domains (CATH), sequential motifs (PROSITE), ...

Note that this is just a very small sample of the information that can be obtained via the hundreds of biomolecular databases distributed around the world.

We will now consult two Web servers (NCBI Entrez and EBI SRS) that provide an integrated access to multiple databases, thereby facilitating the consultation of multiple aspects regarding a protein of interest. Retrieving information from the NCBI with Entrez Entrez is a retrieval system for searching several linked databases stored at the NCBI (National Computational Bioinfology Institute of the United States). http://www.ncbi.nlm.nih.gov/Entrez/ An example of simple query We will try to retrieve from Entrez the information about the protein Gal4p from the yeast Saccharomyces cerevisiae. Open the Entrez home page o http://www.ncbi.nlm.nih.gov/Entrez/ You can see a list of the databases supported at Entrez. Click on the link Protein sequence database. In the query box, type Gal4p Questions How many results do you obtain ? How many of them correspond to your needs ? How could you try to improve the result ? Logical operators A first improvement can be obtained by imposing some additional words in the query. For instance, we could impose to find the words "Saccharomyces" and "cerevisiae", in addition to "Gal4p". For this, you can use the logical operators 'AND', 'OR', and 'NOT' within the query sentence. Beware ! These operators are case-sensitive, i.e. if you type them in lowercase, they will be considered as imposed words rather than operators.

In the query box, type

Gal4p AND Saccharomyces AND cerevisiae An even more precise way to select Saccharomyces cerevisiae is to quote the pair of words. Gal4p AND "Saccharomyces cerevisiae"

Questions What about the result ? Did we obtain an improvement ? How do you explain the incorrect result ? Imposing constraints on a specific field You can refine the selection by specifying the field in which your query text has to be found. For this, click on the link Limits just below the query box. A form appears which allows you to select a field, and impose some constraints on its content. The form contains additional options for filtering the results on the basis of additional criteria (language, publication date, ...), but we will ignore them for this tutorial. Click on the link Limits below the query box. Select the field Gene name. Try with Gal4p. You obtain no results. The reason is that the yeast community convention is to add the suffix "p" to gene names to speify their product. Thus, the name of the gene coding for the protein Gal4p is GAL4. Since we are filtering on the basis of gene name, we need to type GAL4. Questions How many results do we obtain now ? Do they all fit our needs ? How could we refine the query ? Specifying constraints on multiple fields The Limits form only allows to select a single field, but we would like to impose constraints simultaneously on gene name and on organism. For this, we need to address a Detailed query. Click on the link Details below the query box. This gives you access to the precise sentence that was used to address the query to the database. In principle, you should see the formal syntax for the previous query. GAL4[Gene Name] We will now use this syntax to directly write our own queries. Before this, you need to uncheck the option Limits (otherwise, the next steps will not work).

To add a constraint on the orgnism, enter the following sentence in the query box. GAL4[Gene Name] AND Saccharomyces[Organism]

Questions How many results do you obtain now ? What is the difference between these entries ?

Browsing a protein entry Now that we have selected a reasonably low number of proteins, we can identify the one we were searching for: Gal4p from Saccharomyces cerevisiae. The result list should include a record with the identifier P04386. Click on this identifier to display the entire record. Browse the resulting page to get an idea about the annotation content.

Saving the protein sequence in FASTA format

On the top of the window, the option Display allows you to choose among different formats. Select the format FASTA. This will display the coding sequence of the Gal4p protein.


Selecting a database SRS interface is more powerful but also more complex than what we saw with Entrez. The reason is that SRS allows to perform more complex queries. But for this, you must first get familiar with the basic concepts. SRS is a multi-database retrieval system, and the first step is to select one or several databases on which the query will be performed. For our first exercise, it will be sufficient to select a single database : UniProt, the non-redundant database of protein sequences. As we saw during the course, this database contains two sections : Swiss-Prot (proteins wih experimental characterization, annotated with many references to the literature), and TREMBL (translation of all the coding sequences from the EMBL nucleotide sequence database).
Go to SRS home page, and click Start a permanent project.

You will be prompted for a user ID, type the login name you would like to use in the future, and click OK. From now on, the server will store your queries, and the next time you connect to SRS with the same login name, you will be able to get back any previous result. On the top of the page, there is a series of "tabs", which allow you to select different tasks: Quick query, Library page, Query form, .... Try a Quick Search: select the database Proteins in the pop-up menu, and type the protein name Gal4p as query. Do you obtain the right result ? Try the same query with the gene name (GAL4) instead of the protein name (Gal4p). How many results do you obtain ? Imposing constraints on field contents We will now refine the query by imposing constraints on specific fields. For this, we will use the standard query form. Before using a query form, you need to specify which databases you want to include in your search. For this, clik on the tab Library page. This page displays the list of databases supported by SRS at EBI. Select Uniprot-KB by clicking on the check box besides it. You can now open the query form by selecting the tab Query at the top of the page. You should see a form with 4 text boxes, which will allow you to select proteins on the basis of 4 criteria. Each text box is preceded by a pop-up menu, permitting to select a field on which the constraint will be imposed. For the first criterion, select Gene Name in the pop-up menu, and type GAL4 in the text box.

Click Search. How many results do you obtain ? Check the gene names associated to these proteins. Do they all correspond to your query (GAL4) ? In which organisms did you find a protein ? We will not add a second criterion of selection, the organism. Come back to the query form. For the first criterion, select GAL4 as Gene name as previously. For the secondd query field, select Organism and enter Saccharomyces cerevisiae. Click Search and compare the results with the previous search.

Selecting multiple sequences Saccharomyces cerevisiae, together with their description. Gal4p belong to a family of proteins containing a domain called commonly Zinc cluster. This domain contains 6 cysteins, which interact with 2 atoms of Zinc. To date, this domain has only been found in fungi. Exercise Retrieve the peptidic sequence of all proteins from the yeast Saccharomyces cerevisiae which contain a binuclear Zn cluster domain. Retrieve the peptidic sequences of these proteins and save them in a file. Beware: this sequence file will be used for subsequent tutorials. Perform the same query as above with the yeast Schizosaccharomyces pombe.

We will try two different approaches to select all the Zinc cluster proteins from UniProt. A simple but not very accurate method : searching the Zinc cluster-characteristic keyword in the UniProt entries themselves. A more complex but more accurate method : identify the Zinc cluster family in the PFAM database (a database specialized in the annotation of protein families), and find the links between this family and UniProt proteins.

Search by keywords Read carefully the swissprot entry for GAL4, and try to find a way to select all the yeast protein having the Zinc cluster domain. For this, you need to identify the part of the form where this domain is mentionned, and think about the best way to select all the proteins having the same key words in the same field of their Uniprot record. Come back to the Query form, and select all proteins from Saccharomyces cerevisiae which contain a Zinc cluster domain. Beware, this exercise is difficult. Try to find the solution by yourself, and, if you don't succeed, read the following.

In the GAL4_YEAST record, the zinc finger domain is indicated in the comment field, with the following sentence. Contains 1 Zn(2)-C6 fungal-type DNA-binding domain.

We will use the substring Zn(2)-C6 which seems to characterize this domain. You can come back to the Query form and impose, as a first restriction, the Organism name to be Saccharomyces cerevisiae. For the second restriction, select the field Comment: comment in the pop-up menu, and type Zn(2)-C6 in the text box. If you run the query now, you will obtain a syntax error. This error comes from the presence of parentheses in the query text. Parentheses have a specific meaning in SRS queries : they are used to separate logical operations (AND, OR, NOT, ...).

We need thus to indicate that the string Zn(2)-C6 is a query as a whole. For this, we have to quote the string. Type "Zn(2)-C6" (with the quotes) in the query box and run the query. We now obtain 50 proteins (in May 2006), which corresponds to the number of Zn cluster proteins identified in the genome of Saccharomyces cerevisiae.

Search with a link between PFAM and Uniprot TO BE WRITTEN Saving the sequences in a text file In the left panel, you can see a button View. Below this button, there is a pop-up menu proposing different viewing options for the database you selected. Select FastaSeqs and click View. Click on the Save button in the left panel. This leads you to another form, displaying the saving options. Select the option Text file, and chec that the output format is set to FastaSeq. Now click Save. You will be prompted to indicate the foldeer and file name. Let us save the result in a folder Desktop/bioinfo/Zinc_cluster, in a file Saccharomyces_Zinc_cluster.fasta. Perform the same query for Schizosaccharomyces pombe and save the result in a separate file. We will use these files for the following tutorials. Questions Compare the number of entries selected with the query on "description", on "comments" and on either of those fields. Do you feel confident about your retrieval ? How would you envisage to refine the query to obtain a more complete list of transcription factors. Selecting multiple output fields There are various ways to customize the fields to be returned. The simplest way is to select them from the list on the query form.

Exercise(b) ealing with different formats Get the information from the HOXA11 gene (accession number = AF039307) in EMBL database. Take a look at the different fields and the different type of information that is presented to you about the gene. You should get an entry like: Some of the fields are:
ID: identifier AC: accession number SV: sequence version DE: description OS: organism FT: feature SQ: sequence etc....

Now, search the same gene in GenBank Take a look at the different format in which the data is presented See this link for a description of the fields. Write the difference Retrieve information from OMIM OMIM (Online Mendelian Inheritance in Man) is a catalog of human diseases and genes involved in this disease. Go to the NCBI web site and click on ENTREZ. Retrieve information for "early onset breast cancer" from the database OMIM and find out which genes are associated with the disorder. Which gene is the first to appear? In which chromosome is it located?
Explore the type of information that is available in OMIM for this gene. Linking


The next exercise illustrate how to link several databases.

Exercise Find all the yeast genes coding for a metabolic enzyme (tip: start from the LENZYME database). Retrieve all Saccharomyces cerevisiae enzymes, together with their substrates and products. Protocol 1. Select in UniProt all proteins from Saccharomyces cerevisiae. 2. When you have the result, click on the Link button in the left panel. The list of libraries is now displayed. 3. In Metabolic databases, select LENZYME (this is the enzyme section of LIGAND). To avoid massive data transfer, select the view *Names only* rather than the default view. Submit the link. 4. Link the new result to the database LCOMPOUND (Select the view*Names only*. ).

Selecting custom fields across multiple databases The previous query was interesting, but at each step we were able to display the result of the target of the link (e.g. LENZYME), and we lost the information about the origin (e.g. UniProt). SRS allow to go further, by defining custom views on linked databases. 5. In the tabs at the top of the form, click the View tab. 6. In the left panel, fill the View name box with "UniProt_substrates_products". 7. You can see two list of databases. The left list is the origin database, the right one is the target database. Select UniProt-KB as origin and LENZYME as target. Click Create new view. 8. Select the output fields of your choice, within UniProt-Kb and LENZYME (do not forget to include LENZYME substrates and products). 9. Click Save view (top of the form).

10. Open the Results form. Check the previous query where you linked UniProt to LENZYME (when writing this tutorial, it selected number of proteins). 11. Come back the the Query form. Select all Saccharomyces cerevisiae enzymes having the string "EC 2.4" in their description. Select your new view before submitting the query. Additonal exercises The following problems can be solved on the SRS server. 1. In UniProt, select all the proteins belonging to the species Saccharomyces cerevisiae. 2. Using the PATHWAY database (a mirror of KEGG), get all the yeast genes involved in galactose metabolism. 3. Find all the proteins of Escherichia coli for which there is a structure in PDB. 4. In UniProt, find all the enzymes with an aspartokinase catalytic domain. 5. Calculate, year per year, the number of entries submitted to UniProt during the last 10 last years. 6. Calculate the frequency distribution of polypeptide lengths in Swissprot, with a class intervl of 100. RESULT: The given bimolecular database exercise was done successfully with appropriate result