Vous êtes sur la page 1sur 13

Mol Biotechnol (2008) 38:165–177

DOI 10.1007/s12033-007-9003-x

REVIEW

In Silico Characterization of Proteins: UniProt, InterPro and


Integr8
Nicola Jane Mulder Æ Paul Kersey Æ
Manuela Pruess Æ Rolf Apweiler

Published online: 4 October 2007


 Humana Press Inc. 2007

Abstract Nucleic acid sequences from genome Introduction


sequencing projects are submitted as raw data, from which
biologists attempt to elucidate the function of the predicted Advances in DNA sequencing technologies have made the
gene products. The protein sequences are stored in public sequencing of whole genomes commonplace. However, the
databases, such as the UniProt Knowledgebase (Uni- major problem associated with the influx of sequence data
ProtKB), where curators try to add predicted and is the quick and reliable elucidation of protein function and
experimental functional information. Protein function pre- large-scale analysis of whole proteomes (protein compo-
diction can be done using sequence similarity searches, but nent of genomes). The protein sequences are stored in
an alternative approach is to use protein signatures, which public databases, such as the UniProt Knowledgebase
classify proteins into families and domains. The major (UniProtKB) [1], where curators try to add value to the raw
protein signature databases are available through the inte- sequences through biological data curation. The curators
grated InterPro database, which provides a classification of retrieve experimental information from the literature, but
UniProtKB sequences. As well as characterization of pro- also make use of automated function prediction methods to
teins through protein families, many researchers are supplement the information, and where there is no exper-
interested in analyzing the complete set of proteins from a imental data. These curated protein sequence records enter
genome (i.e. the proteome), and there are databases and the highly curated UniProtKB/Swiss-Prot portion of the
resources that provide non-redundant proteome sets and knowledgebase, and the rest of the sequences, derived from
analyses of proteins from organisms with completely EMBL nucleotide sequence [2] entries and direct protein
sequenced genomes. This article reviews the tools and sequence submissions, remain in UniProtKB/TrEMBL,
resources available on the web for single and large-scale where they are either unannotated or have preliminary
protein characterization and whole proteome analysis. automatically derived annotation. In this article we will
provide a detailed description of the main public protein
Keywords Bioinformatics  Databases  sequence repositories and specialized sequence databases.
Protein sequences  Protein signatures  Annotation  Traditionally, scientists use sequence similarity searches
Proteomics to compare a query sequence to those of known function, but
this method has its limitations and relies on the quality of
existing data. Alternative methods for protein sequence
classification use protein signatures. A number of different
databases developing protein signatures diagnostic for
known protein families or domains have arisen. Each has its
own focus, criteria and method for creating the signatures,
N. J. Mulder (&)  P. Kersey  M. Pruess  R. Apweiler and as a result also its own strengths and limitations. In
EMBL Outstation - European Bioinformatics Institute,
addition, the shear number of these databases is daunting for
Wellcome Trust Genome Campus, Hinxton, Cambridge CB10
1SD, UK a bench scientist who is not necessarily qualified to under-
e-mail: mulder@ebi.ac.uk stand the similarities, differences and idiosyncrasies of each.
166 Mol Biotechnol (2008) 38:165–177

Here we will explain these differences, and describe InterPro additional information to the sequence records they contain
[3], an integrated resource that aims to solve the above- and generally make no effort to provide a non-redundant
mentioned problem. InterPro unifies the major protein sig- collection of sequences to users. One example for this is
nature databases into a single, comprehensive resource with the GenBank Gene Products Data Bank, or GenPept, which
manual intervention by trained biologists to turn a database is produced by the National Center of Biotechnology
and software application into an understandable, usable tool Information [5]. The entries in the database are derived
for scientists world wide. InterPro has been used by bench from translations of the sequences contained in the nucle-
biologists analyzing a single gene product and by genome otide database maintained collaboratively by DDBJ [6],
sequencing centers for the annotation of entire genomes. EMBL Nucleotide Sequence Database [2] and GenBank
The analysis and comparison of whole genomes provides [7], and contain minimal annotation which has been
a powerful tool for identifying unique or target genes in the extracted primarily from the corresponding nucleotide
reference organism. For example, a comparison between entry. The entries lack any additional annotation and the
pathogenic and non-pathogenic organisms may shed light on database does not contain proteins derived from amino acid
which genes or proteins are responsible for pathogenesis if sequencing. It presents a redundant view of the protein
they are only found in the former organisms. The Integr8 world, which means that each protein may be represented
resource [4], aims to present an analysis of completely by multiple records and no attempt is made to group these
sequenced genomes and proteomes for this purpose. This records into a single database entry. NCBI’s Entrez Protein
resource will also be described in more detail below. [5] is another example of a sequence repository. The
database contains sequence data translated from the
Protein Sequence Databases nucleotide sequences of the DDBJ/EMBL/GenBank data-
base as well as sequences from UniProt, RefSeq and the
A range of new technologies in protein science make it Protein Data Bank (PDB) (see all below). The database
possible to quickly identify large numbers of proteins in a differs from GenPept in that many of the entries contain
complex, to map their interactions in a cellular context, to additional information but much of the annotated data has
determine their location within the cell and to analyze their been extracted from curated databases so there is little
biological activities. Protein sequence databases play a novel information added to the entries, which cannot be
vital role as a central resource for storing the data gener- found in other data collections. As with GenPept, the
ated by these efforts and making them freely available to sequence collection is redundant. A more ambitious
the scientific community. Data from large-scale experi- approach is taken by the Reference Sequence (RefSeq)
ments are often no longer published in a conventional sense collection produced by the NCBI [8]. The aim of the
but are deposited in a database. This means that protein project is to provide a comprehensive, integrated, non-
sequence databases are the most comprehensive resource redundant set of sequences, including genomic DNA,
of information on proteins available to scientists. transcript (RNA), and protein products, for major research
In order to exploit the various resources fully, it is organisms. NCBI staff and collaborators provide ongoing
essential to distinguish between them and to identify the curation, with review status indicated on each record.
types of data they contain. Universal protein databases cover However, the majority of the records are automatically
proteins from all species, while specialized data collections generated with minimal manual intervention, so the data-
contain information about a particular protein family or base is closer to a sequence repository than to any of the
group of proteins, or related to a specific organism. Universal curated databases discussed below.
protein sequence databases can be further subdivided into
two categories: simple archives of sequence data, or
sequence repositories, where the data are stored with little or Universal Curated Databases
no manual intervention in the creation of the records; and
expertly curated databases, in which the original data are Although repositories are an essential means of providing
enhanced by the addition of further information derived from the user with sequences as quickly as possible, it is clear
sources such as published scientific literature. One family of that, when additional information is added to a sequence,
protein sequence databases, the Universal Protein Resource this greatly increases the value of the resource for users.
(UniProt), will be discussed in detail. The curated databases take basic sequence information and
enrich it by adding additional information from a range of
Sequence Repositories sources such as scientific literature. This information is
extracted and validated by expert biologists before being
A number of protein sequence databases act as repositories added to the databases and this means that the data in these
of protein sequences. These databases add little or no collections can be considered to be highly reliable. There is
Mol Biotechnol (2008) 38:165–177 167

also a large effort invested in maintaining non-redundant • The UniProt Archive (UniParc) provides a non-redun-
datasets by compiling all reports for a given protein dant sequence collection of all publicly available
sequence into a single record. protein sequence data.
• The UniProt Knowledgebase (UniProtKB) combines
the work originally done with the expertly curated
PIR, Swiss-Prot and TrEMBL: A Short History Swiss-Prot, TrEMBL and PIR-PSD databases.
• The UniProt Reference Clusters (UniRef) offer non-
The oldest universal curated protein sequence database has redundant views of the data contained in the UniProt
been the Protein Information Resource Protein Sequence Knowledgebase and UniParc.
Database (PIR-PSD), established in 1984 as a successor to • The UniProt Metagenomic and Environmental
the original NBRF Protein Sequence Database. It has been Sequences (UniMES) section stores metagenomic data
developed over a 20-year-period by the late Margaret O. as it becomes available.
Dayhoff and published as the ‘Atlas of Protein Sequence
and Structure’ from 1965 to 1978 [9]. Later it became a
joint effort by Georgetown University Medical Centre and
UniParc
the National Biomedical Research Foundation in Wash-
ington D.C. Another universal curated protein sequence
The UniProt Archive (UniParc) is the main sequence store-
database, Swiss-Prot, has been established in 1986 as a
house and is a comprehensive repository that reflects the
collaboration between the European Bioinformatics Insti-
history of all protein sequences [10]. UniParc houses all new
tute (EBI) and the Swiss Institute of Bioinformatics (SIB).
and revised protein sequences from various sources to ensure
In 1996, the TrEMBL (Translation from EMBL) database
that complete coverage is available at a single site. It includes
was introduced as a supplement to Swiss-Prot to make new
not only UniProtKB but also translations from the EMBL-
sequences available as quickly as possible while preventing
Bank/DDBJ/GenBank Nucleotide Sequence Databases, the
the dilution of the high-quality annotation in Swiss-Prot. It
Ensembl database of animal genomes [11], the International
consisted of computer-annotated entries derived from the
Protein Index (IPI) (see IPI section), the Protein Data Bank
translation of all coding sequences in the DDBJ/EMBL/
(PDB) [12], NCBI’s Reference Sequence Collection (Ref-
GenBank nucleotide sequence database which were not yet
Seq), some model organism databases and protein sequences
included in Swiss-Prot. To ensure completeness, it also
from the European, American and Japanese Patent Offices.
contained a number of protein sequences extracted from
To avoid redundancy, sequences are handled as strings—all
the literature or submitted directly by the user community.
sequences 100% identical over the entire length are merged,
The increasing volume and complexity of protein data
regardless of source organism. New and updated sequences
has meant that the protein databases have had to find ways
are loaded on a daily basis, cross-referenced to the source
to adapt to this data influx so that they can continue to play
database accession number and provided with a sequence
a central role in the proteomics era. Therefore, in 2002 the
version that increases upon changes to the underlying
Swiss-Prot + TrEMBL groups at SIB and EBI and the PIR-
sequence. The basic information stored within each UniParc
PSD group at the Georgetown University Medical Center
entry is the identifier, the sequence, cyclic redundancy check
and National Biomedical Research Foundation joined for-
number, source database(s) with accession and version
ces as the UniProt Consortium, substantially funded by the
numbers and a time stamp. In addition, each source database
National Institutes of Health (NIH, USA). The Universal
accession number is tagged with its status in that database,
Protein Resource (UniProt) resulted, which contains the
indicating if the sequence still exists or has been deleted in
now-leading universal curated protein sequence database,
the source database. UniParc records are designed to be
the UniProt Knowledgebase or UniProtKB [1].
without annotation since the annotation will be only true in
the real biological context of the sequence: proteins with the
same sequence may have different functions depending on
The Universal Protein Resource species, tissue, developmental stage, etc.

The Universal Protein Resource (UniProt) (http://www.


uniprot.org) is the central resource for storing and inter-
connecting information from large and disparate sources and UniProtKB
the most comprehensive catalogue of protein sequence and
functional annotation. It has four components optimized for The UniProt Knowledgebase (UniProtKB) consists of two
different uses: sections, UniProtKB/Swiss-Prot and UniProtKB/TrEMBL.
168 Mol Biotechnol (2008) 38:165–177

The former contains manually annotated records with algorithm [18] such that each cluster is composed of
information extracted from literature and curator-evaluated sequences that have at least 90% or 50% sequence identity,
computational analysis. To achieve accuracy, annotations respectively, to the representative sequence. Selection of
are performed by biologists with specific expertise. Infor- the representative sequence in each UniRef cluster is based
mation including function, catalytic activity, subcellular on the ranking of all the sequences in the cluster using as
location, disease, structure and post-translational modifi- criteria quality of the entry, meaningful name, organism,
cations is annotated. An important part of the annotation and length of the sequence. The UniRef databases provide
process involves the merging of different reports for a up-to-date collections of sequences. UniRef100 is the most
single protein. After a careful inspection of the sequences, comprehensive and non-redundant protein sequence data-
the annotator selects the reference sequence, does the set. UniRef90 and UniRef50 yield a database size reduction
corresponding merging and lists the splice and genetic of 40% and 65%, respectively, providing for significantly
variants along with disease information when available. faster sequence similarity searches. In addition, UniRef
Any discrepancies between the different sequence sources databases reduce the bias in sequence searches by provid-
are also annotated. Cross-references are provided to the ing a more even sampling of sequence space.
underlying nucleotide sequence sources as well as to many
other useful databases including organism-specific,
domain, family and disease databases. UniProtKB/TrEM- UniMES
BL contains high-quality computationally analyzed records
enriched with automatic annotation and classification. The The UniProt Metagenomic and Environmental Sequences
computer-assisted annotation is created using automati- (UniMES) section has been set up to deal with the new
cally generated rules such as those in Spearmint [13] or field of metagenomics, the large-scale genomic analysis of
manually curated rules based on protein families, including microbes recovered from environmental samples (as
HAMAP family rules [14], RuleBase rules [15] and PIRSF opposed to laboratory-grown organisms which represent
classification-based name rules and site rules [16, 17]. only a small proportion of the microbial world). UniMES
UniProtKB/TrEMBL contains the translations of all coding currently contains the data from the Global Ocean Sam-
sequences present in the EMBL/GenBank/DDBJ Nucleo- pling Expedition (GOS), which was originally submitted to
tide Sequence Databases, the sequences of PDB structures the International Nucleotide Sequence Databases (INSDC).
and data derived from amino acid sequences that are The initial GOS dataset is composed of 28 million DNA
directly submitted to the UniProtKB or scanned from the sequences from oceanic microbes and predicts nearly
literature. Some types of data are excluded, such as DDBJ/ 6 million proteins. By combining the predicted protein
EMBL/DDBJ entries that encode small fragments, syn- sequences with automatic classification by InterPro, a
thetic sequences, most non-germline immunoglobulins and resource for protein families, domains and functional sites,
T-cell receptors, most patent sequences and some highly which will be described in detail in the ‘‘Protein sequence
over-represented data. Records are selected for full manual classification’’ section, UniMES uniquely provides free
annotation and integration into UniProtKB/Swiss-Prot access to the array of genomic information gathered from
according to defined annotation priorities. the sampling expeditions, enhanced by links to further
analytical resources.

UniRef
Specialized Databases
The UniProt Reference Clusters (UniRef) databases pro-
vide three clustered sets (UniRef100, 90 and 50) of In addition to the universal protein sequence databases,
sequences from UniProtKB and selected UniParc records there are numerous specialized databases available to the
in order to obtain complete coverage of sequence space at life science community. Some are devoted to one particular
several resolutions while hiding redundant sequences from aspect of proteins or protein groups or families, or on a
view. The UniRef100 database combines identical specific organism, while others seek to consolidate and
sequences and sub-fragments with 11 or more residues into exploit already existing resources to their full potential.
a single UniRef entry, which displays the sequence of a These vary in size and in the scope of the data they contain.
representative protein, with the accession numbers of all One example of the former type of database is the Protein
the UniProtKB entries within the cluster and links to the Data Bank (PDB) archiving three-dimensional structural
corresponding UniProtKB and UniParc records. UniRef90 data. The Gene Ontology (GO) [19] is a dynamic-controlled
and UniRef50 are built by further clustering UniRef100 vocabulary that can be applied to all organisms, even while
sequences with 11 or more residues using the CD-HIT knowledge of gene and protein roles in cells is still
Mol Biotechnol (2008) 38:165–177 169

accumulating and changing. GOA, the Gene Ontology Protein Sequence Classification
Annotation project [20], annotates proteins to GO terms.
IntAct [21] stores protein interaction data. Other examples Grouping of related objects is one of the best ways to make
are protein family databases and model organism databases. sense of very large data sets, and protein sequences are no
On the other hand, InterPro is an example of an integrated different. The classification of proteins into families helps us
protein resource combining a number of databases that use to reduce the overall problem space, as well as to predict
different methodologies and a varying degree of biological protein function. Proteins with similar sequences should have
information on well-characterized proteins to derive protein similar functions, so if related sequences are grouped and one
signatures for protein families, domains and sites. This or more of these sequences has a known function, this function
database and its member databases are described in ‘‘Protein can be inferred on the uncharacterized sequences in the group.
sequence classification’’ section below. There are various methods of classification of protein
sequences that have developed, but arguably the best methods
used protein signatures, which mathematically describe the
Protein Family Databases common properties of related sequences.

Protein family databases contain information related to a


specific family or group of proteins. These databases are Protein Signature Methods
generally maintained by experts in the field, and due to the
restricted nature of the data they contain, they are able to To create a protein signature, a number of related protein
offer a finer level of granularity than may be possible in the sequences are required. It is not possible to identify com-
universal databases. An example of such a database is mon features of a protein family or domain from one
MEROPS, an information resource for peptidases and their sequence, however, an alignment of related sequences can
inhibitors [22]. A summary page is provided for each be used to create a consensus for the protein family, or
peptidase and this describes the classification and nomen- identify conserved domains or residues. The highly con-
clature of the protein and offers links to supplementary served areas are likely to be involved in the common
pages showing sequence identifiers, any known structures, function of the related sequences, for example, highly
and literature references. The proteins are classified using a conserved residue triads may indicate an active site or
hierarchical, structure-based approach in which each pep- binding site. These conserved areas diagnostic of a protein
tidase is assigned to a family on the basis of statistically family, domain or functional site can be used to develop
significant similarities in amino acid sequence, and fami- protein signatures using several different methods. These
lies that are thought to be homologous are grouped together are then used to search a query protein sequence against to
in a clan. Information is also provided about naturally identify which family the protein belongs to.
occurring peptidase inhibitors. The simplest protein signature method uses regular
expressions to show patterns of conserved amino acid
residues. The regular expression specifies which amino
Organism-specific Databases acid(s) may or may not occur at each position. This core
pattern is tested against a set of annotated sequences and
Model organisms are an invaluable tool for understanding optimized until it only hits the correct sequences in the test
the basis of human diseases and biological processes. set [28]. Regular expressions are useful for identifying
Increasing amounts of genetic, phenotypic and protein- highly conserved active sites and binding sites, but are
related information are being generated in a variety of limited in their ability to find more distantly related
model organism systems, and a number of databases sequences due to their lack of flexibility in recognizing
dealing specifically with the biology of these organisms single residue variations.
have been established to capture this information and Another widely used protein signature method is a
provide it to the scientific community. Examples include profile. Profiles are built from multiple sequence align-
FlyBase [23], which covers Drosophila species, the Mouse ments, and are tables of position-specific amino acid
Genome Database [24], which contains information related weights and gap costs, or matrices describing the proba-
to the laboratory mouse, WormBase [25], which covers bility of finding an amino acid at a given position in the
Caenorhabdidtis elegans, the Saccharomyces Genome sequence [29]. The numbers in the table (scores) are used
Database (SGD) [26], for Saccharomyces proteins and The to calculate similarity scores between a profile and a
Arabidopsis Information Resource (TAIR) [27]. These sequence for a given alignment. For each set of sequences a
databases play an important role in providing integrated threshold score is calculated to determine whether the
access to the data available about these organisms. query sequence is related to the original set of sequences in
170 Mol Biotechnol (2008) 38:165–177

the alignment. An additional method derived from profiles TIGRFAMs, PIRSF and PANTHER are protein family
is a Hidden Markov Model (HMM), which essentially is a orientated, although TIGRFAMs do contain some domain
statistical profile based on probabilities rather than scores HMMs. TIGRFAMs are also used for annotation, particu-
[30, 31]. HMMs are the most commonly derived method larly of microbial genomes, and as a result, their HMMs hit
from the HMMER package, written by Sean Eddy [32], mostly bacterial proteins. They focus on creating HMMs
which allows a user to create an HMM from a sequence based on actual function rather than just sequence simi-
alignment and to search a database of sequences against the larity, so all proteins belonging to a TIGRFAMs family
HMM without the requirement of understanding how the must be ‘‘equivalogs’’, i.e. have the same function. They
HMMs work. Profiles and HMMs are powerful protein are therefore quite specific. PIRSF, on the other hand, only
signature methods and compensate for the limitations of produces HMMs representing protein families, mostly at
regular expressions in that they generally cover larger areas the superfamily level. The HMMs cover the full-length of
of the sequence, and are capable of identifying more the protein sequences and are derived for those proteins
divergent family members. containing the same domain composition with very similar
sequence lengths. Multifunctional proteins are represented
by different HMMs to those representing each part of the
Protein Signature Databases multifunctional protein, and no protein should match more
than one unique PIRSF HMM. PANTHER creates HMMs
There are a number of protein signature databases in the from sequences derived from higher eukaryotic organisms
public domain which use the methods described above to and thus has few bacterial hits.
produce diagnostic signatures for protein families and
domains, including the sequence-based PROSITE [33],
PRINTS [34], Pfam [35], SMART [36], TIGRFAMs [37], Integrating the Signatures in InterPro
PIRSF [16] and PANTHER [38], and the structure-based
SUPERFAMILY [39] and Gene3D [40]. There are also While all the databases described above have significant
databases that use sequence clustering and alignment overlaps in the protein families and domains they predict,
methods, for example ProDom [41], who cluster all pro- they arrive at these overlaps by different means. Using just
teins in the database into families. one of the databases to analyze a query sequence could
PROSITE is a database of both regular expressions and result in no hits if the sequence is outside their range of
profiles and has a primary focus on signatures for the anno- coverage, and also makes one vulnerable to any limitations
tation of UniProtKB/Swiss-Prot proteins. PRINTS uses a the chosen database may have. Conversely, however, try-
variation on profiles to produce fingerprints, which are a ing to use all of them at the same time but from the separate
collection of motifs along the conserved regions of a protein sites may lead to confusion in trying to rationalize the
sequence. Fingerprints are particularly useful for diagnosing different results obtained at each. This problem was
receptors and ion channels and for their high granularity, resolved by the InterPro resource, which integrates all the
showing different levels in a protein family hierarchy. protein signature databases into one. Signatures from
Pfam, SMART, TIGRFAMs, PIRSF, PANTHER, PROSITE, PRINTS, Pfam, ProDom, SMART, TIGRF-
SUPERFAMILY and Gene3D all use HMMs but in different AMs, PIRSF, SUPERFAMILY, Gene3D and PANTHER
ways. For example, each database may have different criteria that describe the same domain, family, repeat, active site,
for assembling their multiple sequence alignments, different binding site or post-translational modification, are grouped
means of calculating their thresholds, or different methods into single InterPro entries with unique accession numbers.
for post-processing their results. Pfam is driven by coverage InterPro entries that match a subset of proteins matched by
and aims to have HMMs to represent all sequence space. other entries are related to each other as parent/child
Pfam includes both families and domains in their sets and (family and subfamily) or contains/found in (domain
where possible they base their HMMs on protein structural composition) relationships.
information. There are two parts to Pfam, the curated Pfam-
A and the automatically generated Pfam-B, which covers
those families not represented in Pfam-A. SMART, InterPro Annotation
SUPERFAMILY and Gene3D tend to create HMMs for
protein domains rather than families. SMART focuses pre- Each InterPro entry contains high-quality manual annota-
dominantly on signalling, extracellular and chromatin- tion providing useful information on the protein family,
associated proteins, while SUPERFAMILY and Gene3D domain etc. in question. This is in the form of a name
base their domains on structural domains in the SCOP [42] (short description), abstract and cross-references. Literature
and CATH [43] databases respectively. references cited in the abstract are stored in a reference
Mol Biotechnol (2008) 38:165–177 171

field in each entry. Mapping of InterPro entries to GO software package, which combines different protein sig-
terms [19], where possible, provides additional functional nature recognition methods from each of the InterPro
annotation. The GO project is an effort to provide a uni- member databases into one package. The software is used
versal ontology for describing gene products across all to compute all matches for each entry, and is also available
species, and has developed a set of terms in a directed for user query sequence searches. For a query sequence run
acyclic graph under the three ontologies: molecular func- through InterProScan via a web interface, the software
tion, biological process and cellular component. Where all outputs the resulting InterPro matches in a graphical for-
proteins matching an InterPro entry have the same function mat. These can also be viewed as a table, which lists the
then the entry can be mapped to the appropriate GO term protein signature matches, positions of the matches, rela-
describing that function. InterPro entries are manually tionships for InterPro entries hit and GO terms where
mapped to GO terms taking into account information in the available from InterPro2GO mappings. For bulk sequences,
abstract of the entries and annotation of proteins in the InterProScan can also be downloaded and run locally, or
match lists. The associated GO terms should apply to all programmatic access is provided via web services.
proteins with true hits to all signatures in the InterPro entry. In InterPro entries, the protein matches may be viewed
The mappings provide an automatic means of large-scale in tabular and different graphical formats. The table lists
GO characterization of proteins. the protein accession numbers and the positions in the
Additional annotation is provided in InterPro entries amino acid sequence where each signature from that In-
through a number of different fields, including the ‘‘Taxon- terPro entry hits. The match list may be displayed in a
omy’’, ‘‘Interaction’’, ‘‘Structural links’’ and the ‘‘Database detailed graphical view, in which the sequence is split into
links’’ fields. The ‘‘Taxonomy’’ field aims to provide an, ‘at a several lines, one for each hit by a unique signature (the
glance’, view of the taxonomic range of the sequences bars are colored coded according to the member database).
associated with each InterPro entry. The lineages were The graphical overview splits the protein sequence into
carefully selected to provide a view of the major groups of different lines for each InterPro entry matched and displays
organisms. The circular display has the taxonomy-tree root the consensus domain boundaries of all signatures within
as its centre with the model organisms populating the outer each entry. The graphical views include hits to all signa-
most circle and nodes of the taxonomy-tree placed on the tures from the same and other InterPro entries, thus for
inner circles. The nodes themselves are either true taxonomy each sequence, the domain and/or motif organization can
nodes and have an NCBI taxonomy ID or are artificial nodes be seen at a glance. Where structures are available for
created for this display; of which there are three: ‘Unclas- proteins, there is a link from the graphical view to the
sified’, ‘Other Eukaryota (Non-Metazoa)’ and the ‘Plastid corresponding PDB [12] structures and a separate line in
Group’. The number of sequences associated with each the display showing the SCOP and/or CATH curated
lineage is displayed, with links to the InterPro matches for matches on the sequence as white striped bars. Some pro-
these proteins or the option to retrieve the FASTA sequences. teins have modelled structures from SWISS-MODEL [45]
The ‘‘Interaction’’ field provides links to example proteins or MODBASE [46], which are shown separately in match
that are known to interact with the InterPro domain through views, again as white striped bars.
protein-protein interactions. Through its matches to protein sequences, InterPro
The ‘‘Structural links’’ field provides information on provides a powerful tool for protein sequence classification
curated structure links based on the correspondence and function prediction. It has been used in many genome
between the proteins matching the InterPro entry and those annotation projects, as well as by UniProt curators for
proteins of known structure and belonging to SCOP or individual protein sequence annotation.
CATH superfamilies. The links are only included when the
structural domains overlap considerably with one or more
of the InterPro signatures on the protein sequence. Finally, From Genes and Proteins to Genomes and Proteomes
the ‘‘Database links’’ field provides links to a number of
external databases such as PANDIT, MEROPS, CAZy, The availability of large quantities of protein sequence data
Enzyme Commission numbers, etc. changes not just the quantity of analysis that can be per-
formed, but also the types of analysis it is possible to
perform. As detailed experimental investigation cannot be
InterProScan performed on all predicted proteins, functional annotation
of these proteins is often inferred from automatically
In addition to annotation, each InterPro entry contains a list detectable patterns in sequence, meaning that confirmation
of precomputed matches to UniProtKB proteins. Protein of the putative protein’s actual expression and information
matches are calculated using the InterProScan [44] about any novel or exceptional functionality (or non-
172 Mol Biotechnol (2008) 38:165–177

functionality) it may posses, will not be available. How- predictions derived through sequence classification based
ever, the possession of a complete set of proteins for a on InterPro and inferences from manual annotation in
species allows us, to the extent that the predictions are similarly classified UniProtKB/Swiss-Prot entries.
correct, to draw inferences from absence, to identify likely In addition, special procedures are in place to handle
orthologues between species, and to describe the nature of data from species with completely deciphered genomes.
a species in terms of the composition of its genome. The merging of extant, manually curated entries with
Annotated genome sequence thus serves as the central newly submitted data associated with a genome sequencing
platform for the evolving field of systems biology of whole project is a priority for UniProt curators; this allows for the
organisms, used in both experimental design and production of non-redundant data sets that still contain all
interpretation. pertinent experimental information. Data that remains in
There are a number of difficulties that a user can UniProtKB/TrEMBL is quickly triaged, errors in refer-
encounter in attempting to obtain complete, non-redundant ences are corrected, missing gene names are inserted,
proteome sets from protein sequence databases. First, some extant gene names are classified by type, taxonomic clas-
databases may contain redundant sequences by design sification is standardized, and cross-references to resources
(there are many senses in which two sequences may be maintained by the organization responsible for the original
redundant) or ‘‘accident’’, that is, the volume of redundant generation of the sequence and/or annotation are added. A
data may be too large to reliably identify and update. For keyword ‘Complete proteome’ is also added, indicating
example, the nucleotide sequence database maintained that the entry can be considered as part of a set that col-
jointly by the EMBL, Genbank, and the DDBJ is non- lectively describes the complete proteome of an organism.
redundant as a representation of submissions from indi- Entries from bacterial genomes are then subject to anno-
vidual experimentalists, i.e. a given submitter should tation by a specialist pipeline, HAMAP [14], designed to
update, not supplement, sequences they now believe to be identify families of bacterial proteins and apply appropriate
incorrect, but it is redundant for protein sequences. Uni- annotation. Following manual review, entries identified as
ProtKB is non-redundant for protein sequences at the family members are then promoted to UniProtKB/Swiss-
species level, but identification of variants, splice isoforms, Prot.
fragmentary sequence or errors has to be done manually In general, the level of redundant information is greater
and thus the database retains a certain element of redun- for proteomes of higher species; and the relationship
dancy in the wider sense. Another problem that may affect between biologically real variants more complex. For a
users is the incompleteness of a given data source, if, for number of species, specialist model organism databases, as
example, different data sets have been submitted to dif- described above, may maintain information about the
ferent resources. A further problem is that some, although various representations of the products of a single gene that
not all, of the proteins in a proteome may have been pre- exist in other databases. Data is taken from both these
viously studied, providing information that needs to be databases and used to ensure that a single entry per gene is
integrated, not replaced by, new prediction data made from chosen for inclusion in the UniProt proteome sets.
the complete genome sequence. The UniProt non-redun- Another problem for a number of well-studied organ-
dant Complete Proteome sets, and IPI, are two data sets isms is that revisions to the genome sequence, and/or
which have been produced in response to this problem. revisions to the protein predictions made from it, have not
necessarily always been submitted back into the standard
public repositories. However, specialist model organism
UniProt Non-redundant Complete Protein Sets databases usually maintain this information. UniProt
monitors some of these resources (including the Saccha-
As shown earlier, the UniProtKB consists of two sections, a romyces Genome Database and The Arabidopsis
manually curated part, UniProtKB/Swiss-Prot, and an Information Resource) and builds new entries corre-
automatically produced supplement, UniProtKB/TrEMBL. sponding to any data that appears to be missing, to ensure
In UniProtKB/Swiss-Prot, data is subject to manual review, that the UniProt proteome sets are complete. Recently this
‘‘redundant’’ sequences are merged according to the defi- approach was extended, to include predictions made from
nition that there should only be one Swiss-Prot entry per the human genome sequence by the Ensembl protein
gene, and experimental information obtained from pub- annotation pipeline. It is planned to further include
lished reports is available. UniProtKB/TrEMBL contains ‘‘missing’’ (i.e. new) Ensembl predictions for all species
mainly data submitted by users to the EMBL/Genbank/ well annotated by this resource.
DDBJ databases, although the annotations provided by UniProt non-redundant complete proteome sets can be
submitters are supplemented or replaced by automatic downloaded from ftp://ftp.ebi.ac.uk/pub/databases/integr8.
Mol Biotechnol (2008) 38:165–177 173

IPI itself is complete and up-to-date and there is no need to


supplement it with information from additional resources.
For many of the most studied species, the level of redun-
dancy in gene and transcript predictions is very high. For
example, as of August 2007, the UniProt Knowledgebase Genome Reviews
contains 71,479 human entries, all representing distinct
sequences (and with information about approximately 9000 The Genbank/EMBL/DDBJ nucleotide sequence databases
additional splice isoforms too). By contrast, the latest are repositories of stored information. However, this
Ensembl build predicts the existence of just 43,570 protein- information may become out-of-date over time, because it
coding transcripts (some of which may encode the same is inferred using the contents of this and other databases at
protein as each other) from 43,570 genes. While some of the time of annotation. Additionally, different submitters
the other sequences in UniProtKB may contain information may have different interests, in terms of what they choose
of genuine biological interest, others may be fragments or to annotate, quality standards and preferences for how to
erroneous predictions; others may contain information that express themselves within the possibilities allowable in the
is interesting only to certain communities (for example, rich data structures these databases support. This can cause
variants derived from SNPs). Additionally, the Ensembl problems for people wishing to use this data for compar-
database is not the only database containing predictions ative genomics: information may be incorrect or out of
made from the complete human genome sequence. The date, or represented in different ways in different entries.
NCBI’s RefSeq project maintains a set of human protein Genome Reviews [48] (http://www.ebi.ac.uk/Genome
sequences corresponding to their prediction of the com- Reviews) is a database of information about organisms
plete human genome (this contains 29,569 entries) and with completely deciphered genomes. In Genome Reviews,
numerous other groups make their own predictions, over- sequences and annotations are taken from a variety of pri-
lapping to a greater or lesser extent. The CCDS project mary sources, and improved by importing the latest manual
(http://www.ncbi.nlm.nih.gov/CCDS/) is an attempt by and automatic information derivable from UniProtKB, and
Ensembl, RefSeq and the University of Santa Cruz to agree additionally supplemented by information calculated by
on a common set of coding sequences. The current CCDS sequence analysis. All imported data is tagged to indicate the
set contains a total of 17,700 sequences, but unfortunately primary data source from which it was imported, allowing
there is a wide divergence between predictions outside this users to select their annotation according to source. The data
core set. is then distributed in a number of formats, including EMBL-
To deal with this problem, IPI (the International Protein format files, and an (Ensembl-like) MySQL database. An
Index) [47] was conceived. IPI (http://www.ebi.ac.uk/IPI) Ensembl-like Genome Browser is also available for inter-
is a database built from a number of distinct sources in active access. The EMBL format files not only contain
which attempts are made to remove redundancy such as to additional information calculated since the original sub-
include one entry per distinct gene product; each entry is mission, but also represent all information in a standardized
cross-referenced to the source databases. While IPI aims at way, making Genome Reviews an ideal database for use in
non-redundancy, the method used for building the database comparative genomic analysis.
places a higher importance on completeness. Non-identical Data is incorporated into Genome Reviews in three main
sequences can be merged into a single IPI entry if there is a ways. First, annotation is imported through the use of pre-
reciprocal best match between a protein in one database existing cross-references between databases that describe
and a protein in another, but IPI does not generally assert the same entities. This allows the import of data from
that a given protein prediction is wrong. It therefore references that are more intensely curated, such as Uni-
comprises a maximal subset of all predictions but with ProtKB. Second, certain predictions are made directly from
some sequence and annotation-derived filtering for redun- the DNA sequence, for example, predictions of non-protein
dancy. It has been used as a database for protein coding genes are made using tRNAScan-SE and RFAM
identification from mass spectrometry data, where users where no annotations are present in the original submis-
may be keen not to miss a potential hit, but also wish to sion. Third, where information in the curated protein
derive statistically significant results, which can be impe- sequences is in conflict with the original sequence, the
ded by the use of data sets inflated by redundancy. IPI is protein and DNA sequences are compared using the utili-
also of use as a supplement to UniProtKB for a few species ties BLAST [49] and ALIGN0 [50] to identify and describe
where the latest predictions have not yet been incorporated the conflicts such that they can be annotated in the Genome
in that resource. The database is currently produced for a Reviews flat file.
total of seven species in all, including six vertebrates and Gene sets are also available for all species in Genome
Arabidopsis thaliana. For most lower species, UniProtKB Reviews, providing DNA sequences (complete with
174 Mol Biotechnol (2008) 38:165–177

annotated features) in EMBL and FASTA formats for each without having prior knowledge of exactly what term is
gene. These are cross-referenced to the UniProt entries that used to describe a particular function, process or cellular
define the corresponding proteome sets. The current scope component. Another advantage of searching with GO terms
of Genome Reviews is bacteria, archaea, bacteriophage and is that the search is recursive, if one searches with a general
a small number of lower eukaryotes, but it is aimed to GO term, all genes that are annotated with a more specific
expand the database to cover all non-vertebrate genomes. term will also be retrieved. For all retrieved genes, Integr8
provides direct links to a number of useful views, including
the Integr8or (described below), and an Ensembl-like
Integr8 scalable genome browser for visualizing genes in genomes
covered by Genome Reviews. Sets of UniProtKB entries
The Integr8 web portal [4] (http://www.ebi.ac.uk/integr8) describing the gene products can also be downloaded.
provides a single point of entry to information about Integr8 also offers a sequence search facility, the
complete genomes. Available data includes DNA sequen- Inquisitor. The Inquisitor analyses submitted sequences
ces from databases including the EMBL Nucleotide using the protein classification tool InterProScan in com-
Sequence Database, Genome Reviews and Ensembl, pro- bination with a FASTA search specific to the species
tein sequences from databases including the UniProtKB against which the user has chosen to search. A single
and IPI, statistical genome and proteome analysis per- summary page provides an overview of the results of the
formed using InterPro, CluSTr [51], and GOA, and analysis.
information about orthology, paralogy and synteny. Integr8
provides tools to enable this data to be searched, browsed,
compared and contextualized, and easy access to files Species-centric Information
where the data representing complete genomes and their
constituent components can be downloaded and used for For each species, Integr8 provides a description, a list of
further, large scale analysis. recent publications, a list of the components of its genome,
Integr8 currently covers all cellular organisms with and information about the composition of these compo-
completely sequenced genomes, a total of 583 species nents (such as their length, average GC content, and the
(including 490 bacteria, 42 archaea and 53 eukaryotes) as length and codon usage in the CDSs they contain), repre-
of August 2007. This represents an increase of over 200% sented textually or graphically as appropriate. Additionally,
from just 193 species in July 2004. all genomes are assigned a Genome Annotation Score,
based on the apparent completeness of the annotation and
the degree of experimental evidence that exists to support
Gene, Species and Sequence Search it. This allows the change in the degree of annotation of a
genome over time to be measured.
A simple search form available on every page of the In- Integr8 also provides information about the composition
tegr8 web application provides access to summary of each species’ proteome. Individual proteins are classi-
information relevant to each species. When a species is fied according to InterPro, GO and CluSTr, and an
selected, it is used as the default species for any subsequent overview of the composition of each proteome constructed
gene or sequence search. Additionally, it is possible to on these criteria is available. For example, for each pro-
select any portfolio of species to search against. A graph- teome, users can identify the most common protein
ical query interface allows users to browse species families and domains, proteins without close relatives, or
alphabetically or according to their position in the taxo- clusters of related proteins unclassified by InterPro. GO
nomic hierarchy, and allows species to be added to the classifications for each species are summarized using a
portfolio individually or in groups. For example, one can reduced set of high-level terms (GO Slim), presenting an
add/remove all species below a single taxonomic node to/ overview of proteome function even in species where more
from the portfolio with a single mouse click. One can specific annotations might not be available. A major
search for a gene or sequence within all species, the cur- advance in the past year has been the doubling in the
rently selected species portfolio or a new selection. Gene coverage of CluSTr, a database that categorizes proteins
searching is simply carried out with any text string that into a hierarchy of clusters based on overall sequence
might have been associated with one’s gene or protein, e.g. similarity. Individual hierarchies of clusters have now been
name, description, EC number. Additionally, if one types prepared for each of 109 proteomes, enabling the rela-
in text that comprises the first part of a term from GO, a tionship of all paralogous proteomes to be analyzed in
menu of potential auto-completes will appear, allowing these species. Additional structural data is also available
users to search for genes annotated with a given term based on information derived from the PDB and HSSP
Mol Biotechnol (2008) 38:165–177 175

[52]. Integr8 also offers comparative analysis, whereby the target to, and already have relieved some of the burden of
composition of multiple proteomes can be compared. A the influx of raw sequencing data. Sequence repositories
total of 160 comparative analyses, each featuring between and curated databases are vital for providing the public
two and four related species, have been pre-compiled. with access to all known sequences, and adding value to
Additional comparisons can be specified interactively. that data in the case of the latter. Protein sequence dat-
abases, such as UniProtKB act as integrated resources of
protein information, and derived databases like UniRef
Gene Centric Information provide smaller subsets of the data that are more man-
ageable for searching. However, some researchers are
A key feature of Integr8 is a single page that summarizes interested only in smaller, more complete, and yet more
what is known about a gene. Emphasis is placed on specific subsets of the data, and for this they can go to
showing clearly the relationship between the different resources like IPI. For non-redundant protein sets users can
products of a gene, and what information (sequence, access data from RefSeq or Integr8 (for completely
experimental evidence etc.) is known about each; also, sequenced genomes).
information at the gene, transcript and protein level is As well as just providing protein sequences, it is nec-
clearly separated and displayed, and the gene can be essary for these resources to provide some form of function
visualized in its genomic context. Potential orthologues and associated with them. In UniProtKB/Swiss-Prot annotation
paralogues of each protein, identified using information is derived from the literature, but here, and for UniProtKB/
derived from the CluSTr database, can also be displayed. A TrEMBL, protein function is also predicted using tools like
number of these can then be selected, and either aligned, or InterPro. Through the protein signatures it integrates, In-
the potentially syntenic regions around each homologue terPro acts as a protein sequence classification system,
can be viewed in a stylized comparative view showing the which facilitates automatic function prediction. It has been
relationship of protein structure as determined by InterPro used for annotation of numerous completely sequenced
analysis. genomes, and in the analysis and comparison of complete
proteomes as provided in Integr8. In addition to these
protein family and domain analyses, Integr8 provides a
Download and Web services
useful source of whole genome analysis and comparisons
where users can get a genome-, gene- or protein-centric
The following data is available for download from Integr8:
view of a complete genome. As the name suggests, it
(i) Genome Reviews files (ii) UniProt complete proteome
integrates data for each organism from a variety of sources,
sets (iii) IPI data sets (iv) Files of InterPro matches for each
including GO, CluSTr, UniProtKB, InterPro etc., and
proteome set (v) GOA files, files of GO annotations for
facilitates searching, downloading or simply viewing of
each proteome set (vi) ‘‘Chromosome tables’’, summary
precomputed and user-chosen data.
files with information about each complete genome in an
There is a shift in current research from the single gene
easy-to-parse, tab-delineated format (vii) Orthologue files,
to whole genome focus, and hypothesis-driven bottom-up
identifying putative orthologues between each pair of
approaches are being replaced by exploratory top-down
species in Integr8 (ix) an XML dump of the complete
investigations. Proteins are being studied within their
database. Most of the function of the Integr8 web portal is
contexts in a cell to determine what molecules they interact
mirrored by a set of web service methods to allow pro-
with and which biological systems they play a role in. This
grammatic access to the database. Using standard
‘‘systems biology’’ approach is far more relevant to real life
technologies, this makes Integr8 available to programs
and likely to provide a more in-depth understanding of the
written in most modern programmatic languages; clients
molecular biology of organisms than the single gene focus.
for Java and Perl are provided, but clients for other lan-
To achieve this, researchers need access to large and
guages can easily be generated from the published service
complete data sets with as much functional information as
description.
possible. It is hoped that a combination of tools such as
those described here will help biologists shed some light on
Discussion the biological function of newly discovered proteins and
systems. Only once this is done it is possible to use the data
With the advent of whole genome sequencing as a rela- to its full potential in medical or commercial applications.
tively low-cost and common scientific method, huge Acknowledgements UniProt is mainly supported by the National
quantities of data have been and continue to be generated. Institutes of Health (NIH) grant 1 U01 HG02712-01. Additional
The tools and resources mentioned in this review are on support comes from the European Commission’s grant 021902RII3,
176 Mol Biotechnol (2008) 38:165–177

from the NIH grants 1R01HGO2273-01, HHSN266200400 061C, 10. Leinonen, R., Diez, F. G., Binns, D., Fleischmann, W., Lopez, R.,
NCI-caBIGICR-10-10-01 and ITR-0205470, and the Swiss Federal & Apweiler, R. (2004). UniProt archive. Bioinformatics, 20,
Government. InterPro was funded by the award of grant number 3236–3237.
QLRI-CT-2000–00517 and in part by grant number QLRI-CT- 11. Hubbard, T. J., Aken, B. L., Beal, K., Ballester, B., Caccamo, M.,
2001000015 from the European Union under the RTD programme Chen, Y., Clarke, L., Coates, G., Cunningham, F., Cutts, T.,
‘‘Quality of Life and Management of Living Resources’’. InterPro is a Down, T., Dyer, S. C., Fitzgerald, S., Fernandez-Banet, J., Graf,
member database of the MRC-funded eFamily project. Genome S., Haider, S., Hammond, M., Herrero, J., Holland, R., Howe, K.,
Reviews and Integr8 have been funded or are funded, respectively, by Johnson, N., Kahari, A., Keefe, D., Kokocinski, F., Kulesha, E.,
the European Commission’s grants QLRICT-2001000015, and Lawson, D., Longden, I., Melsopp, C., Megy, K., Meidl, P.,
021902RII3. Ouverdin, B., Parker, A., Prlic, A., Rice, S., Rios, D., Schuster,
M., Sealy, I., Severin, J., Slater, G., Smedley, D., Spudich, G.,
Trevanion, S., Vilella, A., Vogel, J., White, S., Wood, M., Cox,
T., Curwen, V., Durbin, R., Fernandez-Suarez, X. M., Flicek, P.,
Kasprzyk, A., Proctor, G., Searle, S., Smith, J., Ureta-Vidal, A.,
References & Birney, E. (2007). Ensembl 2007. Nucleic Acids Research, 35,
D610–D617.
1. The UniProt Consortium (2007). The Universal Protein Resource 12. Kouranov, A., Xie, L., de la Cruz, J., Chen, L., Westbrook, J.,
(UniProt). Nucleic Acids Research, 35, D193–D197. Bourne, P. E., & Berman, H. M. (2006). The RCSB PDB infor-
2. Kulikova, T., Akhtar, R., Aldebert, P., Althorpe, N., Andersson, mation portal for structural genomics. Nucleic Acids Research,
M., Baldwin, A., Bates, K., Bhattacharyya, S., Bower, L., 34, D302–D305.
Browne, P., Castro, M., Cochrane, G., Duggan, K., Eberhardt, R., 13. Wieser, D., Kretschmann, E., & Apweiler, R. (2004). Filtering
Faruque, N., Hoad, G., Kanz, C., Lee, C., Leinonen, R., Lin, Q., erroneous protein annotation. Bioinformatics, 20, i342–i347.
Lombard, V., Lopez, R., Lorenc, D., McWilliam, H., Mukherjee, 14. Gattiker, A., Michoud, K., Rivoire, C., Auchincloss, A. H.,
G., Nardone, F., Garcia-Pastor, M. P., Plaister, S., Sobhany, S., Coudert, E., Lima, T., Kersey, P., Pagni, M., Sigrist, C. J.,
Stoehr, P., Vaughan, R., Wu, D., Zhu, W., & Apweiler, R. (2007). Lachaize, C., Veuthey, A. L., Gasteiger, E., & Bairoch, A.
EMBL Nucleotide sequence database in 2006. Nucleic Acids (2003). Automated annotation of microbial proteomes in SWISS-
Research, 35, D16–D20. PROT. Computational Biological Chemistry, 27, 49–58.
3. Mulder, N. J., Apweiler, R., Attwood, T. K., Bairoch, A., Bat- 15. Kretschmann, E., Fleischmann, W., & Apweiler, R. (2001).
eman, A., Binns, D., Bork, P., Buillard, V., Cerutti, L., Copley, Automatic rule generation for protein annotation with the C4.5
R., Courcelle, E., Das, U., Daugherty, L., Dibley, M., Finn, R., data mining algorith applied on Swiss-Prot. Bioinformatics, 17,
Fleischmann, W., Gough, J., Haft, D., Hulo, N., Hunter, S., Kahn, 920–926.
D., Kanapin, A., Kejariwal, A., Labarga, A., Langendijk- 16. Wu, C. H., Nikolskaya, A., Huang, H., Yeh, L. S., Natale, D. A.,
Genevaux, P. S., Lonsdale, D., Lopez, R., Letunic, I., Madera, Vinayaka, C. R., Hu, Z. Z., Mazumder, R., Kumar, S., Kourtesis,
M., Maslen, J., McAnulla, C., McDowall, J., Mistry, J., Mitchell, P., Ledley, R. S., Suzek, B. E., Arminski, L., Chen, Y., Zhang, J.,
A., Nikolskaya, A. N., Orchard, S., Orengo, C., Petryszak, R., Cardenas, J. L., Chung, S., Castro-Alvear, J., Dinkov, G., &
Selengut, J. D., Sigrist, C. J., Thomas, P. D., Valentin, F., Wilson, Barker, W. C. (2004). PIRSF: Family classification system at the
D., Wu, C. H., & Yeats, C. (2007). New developments in the protein information resource. Nucleic Acids Research, 32, D112–
InterPro database. Nucleic Acids Research, 35, D224–D228. D114.
4. Kersey, P., Bower, L., Morris, L., Horne, A., Petryszak, R., Kanz, 17. Natale, D. A., Vinayaka, C. R., & Wu, C. H. (2004). Large-scale,
C., Kanapin, A., Das, U., Michoud, K., Phan, I., Gattiker, A., classification-driven, rule-based functional annotation of proteins.
Kulikova, T., Faruque, N., Duggan, K., Mclaren, P., Reimholz, In S. Subramaniam (Ed.), Encyclopedia of genetics, genomics,
B., Duret, L., Penel, S., Reuter, I., & Apweiler, R. (2005). Integr8 proteomics and bioinformatics. Bioinformatics volume. John
and genome reviews: Integrated views of complete genomes and Wiley & Sons, Ltd.
proteomes. Nucleic Acids Research, 33, D297–D302. 18. Li, W., Jaroszewski, L., & Godzik, A. (2001). Clustering of
5. Wheeler, D. L., Barrett, T., Benson, D. A., Bryant, S. H., Canese, highly homologous sequences to reduce the size of large protein
K., Chetvernin, V., Church, D. M., DiCuccio, M., Edgar, R., databases. Bioinformatics, 17, 282–283.
Federhen, S., Geer, L. Y., Kapustin, Y., Khovayko, O., Lands- 19. Gene Ontology Consortium (2006). The Gene Ontology (GO)
man, D., Lipman, D. J., Madden, T. L., Maglott, D. R., Ostell, J., project in 2006. Nucleic Acids Research, 34, D322–D326.
Miller, V., Pruitt, K. D., Schuler, G. D., Sequeira, E., Sherry, S. 20. Camon, E., Magrane, M., Barrell, D., Lee, V., Dimmer, E.,
T., Sirotkin, K., Souvorov, A., Starchenko, G., Tatusov, R. L., Maslen, J., Binns, D., Harte, N., Lopez, R., & Apweiler, R.
Tatusova, T. A., Wagner, L., & Yaschenko, E. (2007). Database (2004). The Gene Ontology Annotation (GOA) database: Sharing
resources of the National Center for Biotechnology Information. knowledge in UniProt with Gene Ontology. Nucleic Acids
Nucleic Acids Research, 35, D5–D12. Research, 32, D262–D266.
6. Okubo, K., Sugawara, H., Gojobori, T., & Tateno, Y. (2006). 21. Kerrien, S., Alam-Faruque, Y., Aranda, B., Bancarz, I, Bridge,
DDBJ in preparation for overview of research activities behind A., Derow, C., Dimmer, E., Feuermann, M., Friedrichsen, A.,
data submissions. Nucleic Acids Research, 34, D6–D9. Huntley, R., Kohler, C., Khadake, J., Leroy, C., Liban, A.,
7. Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J., & Lieftink, C., Montecchi-Palazzi, L., Orchard, S., Risse, J., Robbe,
Wheeler, D. L. (2007). GenBank. Nucleic Acids Research, 35, K., Roechert, B., Thorneycroft, D., Zhang, Y, Apweiler, R., &
D21–D25. Hermjakob, H. (2007). IntAct-open source resource for molecular
8. Pruitt, K. D., Tatusova, T., & Maglott, D. R. (2007). NCBI ref- interaction data. Nucleic Acids Research, 35, D561–D565.
erence sequences (RefSeq): A curated non-redundant sequence 22. Rawlings, N. D., Morton, F. R., & Barrett, A. J. (2006). MER-
database of genomes, transcripts and proteins. Nucleic Acids OPS: The peptidase database. Nucleic Acids Res, 34, D270–
Research, 35, D61–D65. D272.
9. Dayhoff, M. O. (1978). Atlas of protein sequence and structure, 23. Crosby, M. A., Goodman, J. L., Strelets, V. B., Zhang, P., &
(Vol. 5, Supplement 3). Washington, DC: National Biomedical Gelbart, W. M. (2007). FlyBase: Genomes by the dozen. Nucleic
Research Foundation. Acids Research, 35, D486–D491.
Mol Biotechnol (2008) 38:165–177 177

24. Eppig, J. T., Blake, J. A., Bult, C. J., Kadin, J. A., & Richardson, 37. Selengut, J. D., Haft, D. H., Davidsen, T., Ganapathy, A., Gwinn-
J. E. (2007). The mouse genome database (MGD): New features Giglio, M., Nelson, W. C., Richter, A. R., & White, O. (2007).
facilitating a model system. Nucleic Acids Research, 35, D630– TIGRFAMs and Genome Properties: Tools for the assignment of
D637. molecular function and biological process in prokaryotic gen-
25. Bieri, T., Blasiar, D., Ozersky, P., Antoshechkin, I., Bastiani, C., omes. Nucleic Acids Research, 35, D260–D264.
Canaran, P., Chan, J., Chen, N., Chen, W. J., Davis, P., Fiedler, T. 38. Mi, H., Guo, N., Kejariwal, A., & Thomas, P. D. (2007). PAN-
J., Girard, L., Han, M., Harris, T. W., Kishore, R., Lee, R., THER version 6: Protein sequence and function evolution data
McKay, S., Muller, H. M., Nakamura, C., Petcherski, A., Rang- with expanded representation of biological pathways. Nucleic
arajan, A., Rogers, A., Schindelman, G., Schwarz, E. M., Acids Research, 35, D247–D252.
Spooner, W., Tuli, M. A., Van Auken, K., Wang, D., Wang, X., 39. Wilson, D., Madera, M., Vogel, C., Chothia, C., & Gough, J.
Williams, G., Durbin, R., Stein, L. D., Sternberg, P. W., & Spieth, (2007). The SUPERFAMILY database in 2007: Families and
J. (2007). WormBase: New content and better access. Nucleic functions. Nucleic Acids Research, 35, D308–D313.
Acids Research, 35, D506–D510. 40. Yeats, C., Maibaum, M., Marsden, R., Dibley, M., Lee, D., Ad-
26. Nash, R., Weng, S., Hitz, B., Balakrishnan, R., Christie, K. R., dou, S., & Orengo, C. A. (2006). Gene3D: Modelling protein
Costanzo, M. C., Dwight, S. S., Engel, S. R., Fisk, D. G., Hir- structure, function and evolution. Nucleic Acids Research, 34,
schman, J. E., Hong, E. L., Livstone, M. S., Oughtred, R., Park, D281–D284.
J., Skrzypek, M., Theesfeld, C. L., Binkley, G., Dong, Q., Lane, 41. Bru, C., Courcelle, E., Carrere, S., Beausse, Y., Dalmar, S., &
C., Miyasato, S., Sethuraman, A., Schroeder, M., Dolinski, K., Kahn, D. (2005). The ProDom database of protein domain fam-
Botstein, D., & Cherry, J. M. (2007). Expanded protein infor- ilies: More emphasis on 3D. Nucleic Acids Research, 33, D212–
mation at SGD: New pages and proteome browser. Nucleic Acids D215.
Research, 35, D468–D471. 42. Andreeva, A., Howorth, D., Brenner, S. E., Hubbard, T. J.,
27. Rhee, S. Y., Beavis, W., Berardini, T. Z., Chen, G., Dixon, D., Chothia, C., & Murzin, A. G. (2004). SCOP database in 2004:
Doyle, A., Garcia-Hernandez, M., Huala, E., Lander, G., Mon- Refinements integrate structure and sequence family data.
toya, M., Miller, N., Mueller, L. A., Mundodi, S., Reiser, L., Nucleic Acids Research, 32, D226–D229.
Tacklind, J., Weems, D. C., Wu, Y., Xu, I., Yoo, D., Yoon, J., & 43. Greene, L. H., Lewis, T. E., Addou, S., Cuff, A., Dallman, T.,
Zhang, P. (2003). The Arabidopsis Information Resource (TAIR): Dibley, M., Redfern, O., Pearl, F., Nambudiry, R., Reid, A.,
A model organism database providing a centralized, curated Sillitoe, I., Yeats, C., Thornton, J. M., & Orengo, C. A. (2007).
gateway to Arabidopsis biology, research materials and com- The CATH domain structure database: New protocols and clas-
munity. Nucleic Acids Research, 31, 224–228. sification levels give a more comprehensive resource for
28. Sigrist, C. J. A, Cerutti, L., Hulo, N., Gattiker, A., Falquet, L., exploring evolution. Nucleic Acids Research, 35, D291–D297.
Pagni, M., Bairoch, A., & Bucher, P. (2002). PROSITE: A doc- 44. Quevillon, E., Silventoinen, V., Pillai, S., Harte, N., Mulder, N.,
umented database using patterns and profiles as motif descriptors. Apweiler, R., & Lopez, R. (2005). InterProScan: Protein domains
Briefings in Bioinformatics, 3, 265–274. identifier. Nucleic Acids Research, 33, W116–W120.
29. Gribskov, M., Luthy, R., & Eisenberg, D. (1990). Profile analysis. 45. Kopp, J., & Schwede, T. (2006). The SWISS-MODEL repository:
Methods in Enzymology, 183, 146–159. New features and functionalities. Nucleic Acids Research, 34,
30. Krogh, A., Brown, M., Mian, I. S., Sjolander, K., & Haussler, D. D315–D318.
(1994). Hidden Markov models in computational biology. 46. Pieper, U., Eswar, N., Davis, F. P., Braberg, H., Madhusudhan,
Applications to protein modeling. Journal of Molecular Biology, M. S., Rossi, A., Marti-Renom, M., Karchin, R., Webb, B. M.,
235(5), 1501–1531. Eramian, D., Shen, M. Y., Kelly, L., Melo, F., & Sali, A. (2006).
31. Durbin, R., Eddy, S., Krogh, A., & Mitchison, G. (1998). Bio- MODBASE: A database of annotated comparative protein
logical sequence analysis: Probabilistic models of proteins and structure models and associated resources. Nucleic Acids
nucleic acids. Cambridge, UK: Cambridge University Press. Research, 34, D291–D295.
32. Eddy, S. HMMER2 Profile hidden Markov models for biological 47. Kersey, P. J., Duarte, J., Williams, A., Karavidopoulou, Y., Bir-
sequence analysis. [http://hmmer.wustl.edu/]. ney, E., & Apweiler, R. (2004). The international protein index:
33. Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., De Castro, E., An integrated database for proteomics experiments. Proteomics,
Langendijk-Genevaux, P. S., Pagni, M., & Sigrist, C. J. A. 4, 1985–1988.
(2006). The PROSITE database. Nucleic Acids Research, 34, 48. Sterk, P., Kersey, P. J., & Apweiler, R. (2006). Genome Reviews:
D227–D230. Standardizing content and representation of information about
34. Attwood, T. K., Bradley, P., Flower, D. R., Gaulton, A., Mau- complete genomes. Omics, 10, 114–118.
dling, N., Mitchell, A. L., Moulton, G., Nordle, A., Paine, K., 49. McGinnis, S., & Madden, T. L. (2004). BLAST: At the core of a
Taylor, P., Uddin, A., & Zygouri, C. (2003). PRINTS and its powerful and diverse set of sequence analysis tools. Nucleic
automatic supplement, prePRINTS. Nucleic Acids Research, 31, Acids Research, 32, W20–W25.
400–402. 50. Myers, E. W., & Miller, W. (1988). Optimal alignments in linear
35. Finn, R. D., Mistry, J., Schuster-Bockler, B., Griffiths-Jones, S., space. Computational Applied Bioscience, 4, 11–7.
Hollich, V., Lassmann, T., Moxon, S., Marshall, M., Khanna, A., 51. Petryszak, P., Kretschmann, E., Wieser, D., & Apweiler, R.
Durbin, R., Eddy, S. R., Sonnhammer, E. L., & Bateman, A. (2005). The predictive power of the CluSTr database. Bioinfor-
(2006). Pfam: Clans, web tools and services. Nucleic Acids matics, 21(18), 3604–3609.
Research, 34, D247–D251. 52. Dodge, C., Schneider, R., & Sander, C. (1998). The HSSP
36. Letunic, I., Copley, R. R., Pils, B., Pinkert, S., Schultz, J., & database of protein structure-sequence alignments and family
Bork, P. (2006). SMART 5: Domains in the context of genomes profiles. Nucleic Acids Research, 26, 313–315.
and networks. Nucleic Acids Research, 34, D257–D260.

Vous aimerez peut-être aussi