Vous êtes sur la page 1sur 21

JournalofBiotechnology124(2006)629639

Review

Bioinformaticsdatabaseinfrastructureforbiotechnologyresearch

EleanorJ.Whitfield ,ManuelaPruess,RolfApweiler
EMBLEBI,WellcomeTrustGenomeCampus,HinxtonHall,Hinxton,CambsCB101SD,UK
Received14October2005;receivedinrevisedform6March2006;accepted3April2006

Abstract
Manydatabasesareavailablethatprovidevaluabledataresourcesforthebiotechnologicalresearcher.Accordingtotheir
coredata, they can be divided into different types. Somedatabasesprovideprimarydata, like all published nucleotide
sequences,othersdealwithproteinsequences.Inadditiontothesetwobasictypesofdatabases,ahugenumberofmore
specializedresourcesareavailable,likedatabasesaboutproteinstructures,proteinidentification,specialfeaturesofgenes
and/orproteins,orcertainorganisms.Furthermore,someresourcesofferintegratedviewsondifferenttypesofdata,allowing
theusertodoeasycustomisedqueriesoverlargedatasetsandtocomparedifferenttypesofdata.
2006ElsevierB.V.Allrightsreserved.
Keywords:Bioinformatics;Nucleicaciddatabase;Proteindatabase;Genomicsdatabase;Proteomedatabase

Contents
1.Introduction............................................................................................6302.
Nucleotidesequencedatabases...........................................................................6302.1.
EMBL/DDBJ/GenBank...........................................................................6302.2.
RefSeq..........................................................................................6312.3.
Ensembl.........................................................................................6312.4.Genome
reviews.................................................................................6323.Proteinsequence
databases..............................................................................6323.1.
GenPept.........................................................................................6323.2.Entrez
protein....................................................................................6323.3.
UniProt.........................................................................................632
4.Specializeddatabases...................................................................................6334.1.
Modelorganismdatabases........................................................................633

Correspondingauthor.Tel.:+44
1223494680;fax:+441223494468.
Emailaddress: eleanor@ebi.ac.uk(E.J.Whitfield).

01681656/$seefrontmatter2006ElsevierB.V.Allrights
reserved.doi: 10.1016/j.jbiotec.2006.04.006

630

E.J.Whitfieldetal./JournalofBiotechnology124(2006)629639

5.Proteinidentificationdatabases..........................................................................6335.1.
GO/GOA........................................................................................6335.2.
IntAct...........................................................................................6345.3.
SWISS2DPAGE.................................................................................6345.4.
PRIDE..........................................................................................6345.5.
ChEBI..........................................................................................634
6.Structuredatabases.....................................................................................6356.1.
Proteindatabank.................................................................................6356.2.
Cambridgestructuraldatabase.....................................................................6356.3.RESID...
.......................................................................................6357.Specialfeatures
databases...............................................................................6357.1.IntEnz...........
................................................................................6357.2.TRANSFAC.............
........................................................................6367.3.EPD...........................
.................................................................6367.4.
IMGT...........................................................................................636
8.Integratedandcomparativedatabases.....................................................................6368.1.
InterPro.........................................................................................6368.2.
Internationalproteinindex........................................................................6368.3.Integr8...
.......................................................................................6379.
Discussion.............................................................................................637
References.............................................................................................637

1.Introduction
If you knew that thujone, a terpenoid found in
wormwoodoil,gaveabsinthe,anemeraldgreenliquor,
itsparticularflavourandwastheactivecomponentof
itsclaimedtoxicity(Holdetal.,2000),youmaywant
to investigate its ability to bind to the gamma
aminobutyricacidAreceptorsorGABAAreceptorsin
our brain (Kash et al., 2004), which can bring on a
numberofbraindisorders.
Todothis,youcanquerythemanydatabasesthatare
available in the public domain provided by academic,
bioinfomaticandnonacademicinstitutes.Theyrangefrom
simple sequence repositories with broad domains of
interest,storingdatawithlittleornomanualintervention
andthereforeminimaldetail,toexpertlycurateddatabases
thatcoverallsequencedspeciesandinwhichtheoriginal
sequence data is enhanced by the manual annotation of
furtherinformation.Whilealldatabasesstriveforcomplete
coveragewithintheirchosenscope,thedomainofinterest
forsomeuserscantranscendthoseofindividualresources.
Thismayreflecttheuserswishtocombinedifferenttypes
ofinformationorfromtheinabilityofasingleresourceto
containthefulldetailsofeveryquery.Itisimportantto

provide the users of


biomolecular databases
witha

degree of integration
betweentheseresources,
asbynaturetheyareall
connectedinascientific
sense and each one of
themprovidesimportant
data to understanding
biologicalcomplexity.
2.Nucleotidesequence
databases
Primary nucleotide
sequence databases are
essential to provide
sequencestotheuseras
quickly as possible.
These databases add
little or no additional
information to the
sequence records they
contain.

The
redundancy of data and
the fact that entries in

EMBL/DDBJ/GenBank records cannot be updated,oped and improved,


corrected or amended without the permission of the providing the user with
original submitter has led to the creation of several accurate and maintained
secondary nucleotide sequence databases. Thesedatasets.
databases augment the annotation of completely
sequenced genomes and are continually being devel2.1.

EMBL/DDBJ/GenBank
The International
Nucleotide Sequence
Database

(INSD)
collaboration provides a
primarynucleotide

E.J.Whitfieldetal./JournalofBiotechnology124(2006)629639

aimsoftheRefSeqcollection
sequencerepositoryforthepublicdomain.Itisan include: explicitly linked
archivedatabasethatallowssubmissionsfroma nucleotide and protein
number of resources including individualsequences, updates to reflect
researchers, genome sequencing projects andcurrent knowledge of
patent applications, and updates to entries from sequencedataandbiologyand
theoriginalsubmitters.INSDisajointeffortof ongoing curation by NCBI
threepartnerdatabases;DNADataBankofJapanstaff and collaborators with
(DDBJ) (Tateno et al., 2005) at the Nationalreview status indicated on
Institute of Genetics (NIG), EMBL Nucleotide eachrecord.However,mostof
SequenceDatabase(EMBL)(Kanzetal.,2005)atthe entries are automatically
the European Bioinformatics Institute (EBI) (generatedwithoutanymanual
intervention or annotation so
Brooksbank et al.,2005)andGenBank(Benson
this database should still be
et al., 2005) at the National Centre of
viewed mainly as a sequence
Biotechnology Information (NCBI) (Wheeler et
repository.Release13of
al., 2005). The three organisations synchronize
their data on a daily basis to achieve optimal
synchronyandensureworldwidecoverageofall
nucleotide

sequence

entries.
EMBL/GenBank/DDBJ records include
individual genes, whole genomes, RNA, third
partyannotation,expressedsequencetags,high
throughput cDNAs and synthetic sequences.
Largescale genomic sequencing has led to the
exponential growth of these repositories, with
over 59,828,564 records and 109,825,661,925
nucleotides in EMBL release 84 of September
2005.Duetoitscompletenessandstandingasa
primarydataprovider,EMBL/DDBJ/GenBankis
the initial source for many molecular biology
databases.Proteinsequenceentriescanbederived
from the translation of the coding sequence
annotationinanucleotideentry.

2.2.RefSeq
The Reference Sequence (RefSeq) collection
(Pruitt et al., 2005) aims to provide a
comprehensive,integrated,nonredundantsetof
sequences, including genomic DNA, transcript
(RNA),andproteinproducts,formajorresearch
organisms.RefSeqisbasedondataderivedfrom
EMBL/DDBJ/GenBank supplemented by
additional sets of curated or predicted data in
organisms of particular scientific interest. The

631

September 2005 includes


1,899,454 proteins covering
3060organisms.
2.3.Ensembl
Anexampleofasecondary
nucleotide sequence genomic
databaseisEnsembl(Hubbard
et al., 2005). This is a joint
project of the EBI and the
Wellcome Trust Sanger
Institute(WTSI)todevelopa
softwaresystemthatproduces
and maintains fast automatic
annotation of raw genomic
sequenceonselectedmetazoan
genomes. Ensembl is a
comprehensive source of
stable annotation. Genes are
annotatedonevidencederived
from known protein, cDNA
and EST sequences. Novel
genes are determined by the
gene build system, this
incorporates a wide range of
methods including ab initio
gene predictions, homology
and gene prediction HMMs.
Allgenescanbevisualisedthe
context of the genome,
mapping genes to transcripts
toproteins.Dataisaugmented
withalternativetranscriptand
proteinsplicepatterns,dbSNP
data and DAS tracks of
external databases. The gene
build pipeline is constantly
being developed to improve
predictions and generates
regular new versions of the
genomes. Release 34 in
September 2005 includes 8
mammalian species, 6
chordates and 5 other
eukaryotes.

The Distributed Annotation System (DAS) server and any number of


(Dowell etal.,2001)specificationwasoriginally annotationserversandmerge
designedto allowthefeaturedataforbiologicalthe information from these
moleculestobeservedinrelationtoagenomic serversinaunifieddisplay.
sequence.TheDASserversystemisconceptually The Vertebrate Genome
areferenceserver,providingsequencedataand Annotation (Vega) database
its annotations, and an annotation server,(Ashurst et al., 2005) is
providing coordinates for each feature and another genomic database; a
indicates a suitable DAS reference server from community resource for
which the corresponding sequence can bebrowsing manual annotation
obtained.ADASclientisapowerfulapplication of finished sequences from a
that is able toconnectto atleastone reference

varietyofvertebrategenomes,
including human, mouse, dog
and zebrafish. Vega displays
onlymanuallyannotatedgene
structures built using
transcriptional evidence,
whichcanbeexaminedinthe
browser. The University of
California Santa Cruz
(UCSC) Genome Browser
(Karolchik

632

E.J.Whitfieldetal./JournalofBiotechnology124(2006)629639

standardised (for
etal.,2003)providesaccesstotheexample rRNA and
reference sequence and workingtRNA annotations)
draft assemblies for a largeanddeletedincases
collection of genomes. Thewherethecoverage
Genome Browser zooms andis low, making it
scrolls over chromosomes,easier to compare
showing the work of annotatorsdata across several
worldwide. The Gene Sortergenomes. The data

completely
shows expression, homology andis
synchronised
with
other information on groups of
genesthatcanberelatedinmany the fortnightly
ways. The NCBIs Entrez MapUniProtKB releases
Viewer (Wheeler et al., 2005)and evidence tags
provides special browsingareattachedtomost
capabilities for a subset of feature qualifiers
organismsinEntrezgenomes.Mapindicating the
Viewer allows you to view andprimary source of
search an organisms completethe information.
genome, display chromosomeRelease 36 of
maps,andzoomintoprogressivelySeptember2005has
greaterlevelsofdetail,downtothe239 complete
sequence data for a region ofprokaryotegenomes,
represented in 420
interest.
entries. Further
releaseswillseethe
2.4.Genomereviews
addition

of
The goal of the Genomeeukaryote genomes,
Reviews (Kersey et al., 2005)the first being
projectistoprovideanuptodate,Saccharomyces
standardisedandcomprehensivelycerevisiae, better
annotated view of the genomicrepresentation of
sequence of organisms withpseudogenes, and
completely deciphered genomes.computational
Each Genome Review representsanalysis to identify
anenhancedversionoftheoriginalncRNAs.
sequence of a complete
chromosome or plasmid, with
additional annotation imported3.Protein
fromdatasourcesthatincludethesequencedatabases
UniProt knowledgebase, GO, the
GOA (GO Annotation) project, Several protein
InterPro, and HoGenom. Crosssequence databases
referencesto18databasesarealsoact as repositories
provided. Annotations usedofproteinsequences
inconsistently among the originaland,liketheprimary
submissions have been

nucleotide sequence
databases, these are
essential to provide
thesequencestothe
user as quickly as
possible. These
databases add little
or no additional
information to the
sequence records
they contain and
generally make no
effort to provide a
nonredundant
collection

of
sequences to users.
When additional
information is
annotated to a
sequence, this
greatlyincreasesthe
valueoftheresource
for users. Expert
biologists validate
such curated data
before being added
to the databases to
ensure that the data
in these collections
is highly reliable.
Thereisalsoalarge
effort invested in
maintaining non
redundant datasets
by compiling all
reports for a given
protein sequence
intoasinglerecord.
3.1.GenPept
The GenBank
Gene Products data
bank (GenPept)
(Wheeler et al.,
2005) is produced
by the NCBI.

Entriesinthedatabasearederivedis compiled from a


from translations of the codingvariety of sources. It

contains
sequences contained in thealso
collaborative nucleotide databasesequence data from
and contain minimal annotation.translations of the
TheannotationinaGenPeptentrycoding sequences
has been extracted from thecontained in the
correspondingnucleotideentryandcollaborative
the database does not contain nucleotide database
proteins derived from aminoacidas well as protein
sequencing. The database issequences submitted
redundantasmultiplerecordsmayto

Protein
representeachprotein;noattemptInformationResource
ismadetogrouptheserecordsinto(PIR),
asingledatabaseentry.
UniProtKB/Swiss
Prot,

Protein
Research Foundation
(PRF) and Protein
Entrez protein, a sequenceData Bank (PDB).
repository also produced by NCBI,

3.2.Entrezprotein

Additional
information exists as
it has been extracted
from the manually
curated databases
such

as
UniProtKB/Swiss
Prot. As with
GenPept,

the
sequencecollectionis
redundant.

3.3.UniProt
The Universal
Protein Resource
(UniProt)(Bairoch et
al., 2005) is a
comprehensive
catalogueofdata

E.J.Whitfieldetal./JournalofBiotechnology124(2006)629639

2004) is a
on protein sequence and function, comprehensive
maintained by the UniProtrepository of protein
consortium. The consortium is asequences, providing
collaborationoftheSwissInstitutea mechanism by
of Bioinformatics (SIB), thewhich the historical

of
European Bioinformatics Instituteassociation
(EBI), and the Protein Information database records and
Resource (PIR). UniProt is com protein sequences
prisedofthreecomponents.Firstly, can be tracked. It is
the expertly curated UniProtnonredundant at the
Knowledgebase(UniProtKB)whichlevel of sequence
will continue the work ofidentity, but may
UniProtKB/SwissProt, UniProcontain semantic
tKB/TrEMBL ( Boeckmann et al.,redundancies.
2003) and PIR (Wu et al., 2003).Thirdly, the non
UniProtKB/SwissProtisamanuallyredundant UniProt
annotateddatabasewithinformationReference clusters
extractedfromliteratureandcurator (UniRef)thatprovide
evaluatedcomputationalanalysis.Itnonredundant

data
contains a minimal level ofreference
collections

based
on
redundancy and a high level of
the

UniProt

knowl
integration with other databases.
UniProtKB/TrEMBL contains theedgebase in order to
translationsofallcodingsequencesobtain complete
present in the collaborativecoverageofsequence
nucleotidedatabaseandalsoprotein space at several
sequences extracted from theresolutions: 100, 90
literature or submitted toand 50% sequence
UniProtKB. Entries are enrichedsimilarity.Updatesof
with automated classification andUniProtarepublicly
annotation. Records are awaitingavailable on a
full manual annotation. PIRbiweekly schedule.
produced the Protein SequenceThe UniProt Release
Database (PSD) of functionally6.1 consists of:
annotated protein sequences, whichUniProtKB/Swiss

Protein
grew out of the Atlas of Protein Prot
Knowledgebase
SequenceandStructure(19651978)
edited by Margaret Dayhoff. PIRRelease 48.1 of 27
PSDisnowanarchivedatabaseasSeptember2005
allsequencesandannotationshave(contains 195,058
been integrated into UniProtKB.sequence entries,
Secondly, the UniProt archivecomprising
(UniParc), into which new and 70,674,903 amino
updated sequences are loaded on aacidsabstractedfrom
dailybasis.UniParc(Leinonenetal.,134,132 references)

633

and
UniProtKB/TrEMBL
Protein Database
Release 31.1 of 27
September2005
(2,105,517 sequence
entries comprising
680,464,593 amino
acids).

4.Specializeddatabases

(Drysdale

and
Crosby, 2005) and
WormBase (Chen et
4.1.Modelorganismdatabases
al., 2005) are
The Human Genomecomprehensive

for
Organisation (HUGO) is thedatabases
international organisation ofinformation on the

and
scientists involved in humangenetics
genetics, established in 1989 tomolecular biology of
promote and sustain internationalDrosophila and
collaborationinthefield.AspartofCaenorhabditis,
HUGO, the Human Generespectively.TheRat
Nomenclature Committee (HGNC)Genome Database
maintains Genew, a database of(RGD) (Twigger et
approved human gene names andal.,2002)curatesand
symbols (Wain et al., 2002). Theirintegrates rat genetic
current priority is assigningandgenomicdataand
nomenclature to genes submittedprovides access to
from the Human Genome Project;this data to support
symbols for over 20,000 genes areresearchusingtherat
approved. Scientists, journals andasageneticmodelfor
databases also request individualthe study of human
new symbols. HGNC approveddisease.
symbolsareusedbymanydatabases
includingUniProt,ensuringcommon
nomenclatureacrossallhumandata. 5.Protein
Similargenecentricdatabasesforidentification
modelorganismsareavailable.Thedatabases
Mouse Genome Informatics (MGI)
(Eppig et al., 2005) provides5.1.GO/GOA
integrated access to data on the

(Gene
genetics, genomics, and biology of GO
Ontology
the laboratory mouse. FlyBase

Consortium, 2004)
provides

three
structured controlled
vocabularies,
describing

the
molecular function,
biological roles and
cellular locations of
gene products. The
dynamic controlled
vocabulary can be
applied to all
organisms, even
while knowledge of
gene and protein
roles in cells is still
accumulating and
changing. Many
resources have
adopted

GO
facilitating

the
integration

of
annotation and
encouraging the
developmentofmany
similar projects in
other domains. A
number of these
projects can be
accessed through the
Open Biological
Ontologieswebsite.

634

E.J.Whitfieldetal./JournalofBiotechnology124(2006)629639

manually curated.
The GOA project (GOThese are searchable
Annotation) (Camonet al., 2004)and viewable using

interactive
is a combination of electronican

web
mappings and manual curationgraphical
assigningGOtermstoallcomplete applicationofprotein
and incomplete proteomes thatnetworks.
exist in UniProtKB. Widespread IntAct is a
annotationofGOtermstoproteinmemberoftheIMEX
productsbymanyresourceshelpsconsortium, with
to promote the integration of BIND, DIP, MINT
annotation across databases,and MIPS. User
supplementingtheuseofstandardsubmitteddatawillbe
names with the use of standard exchanged between
partners to provide a
annotationvocabularies.
MonthlyGOAreleasesprovidenetwork of stable,
theGOassignmentstoUniProtKB,comprehensive

of
andindividualfilesofGOassignresources
interactiondata.
ments to 272 nonredundant

proteome sets for complete


genomes are available. The5.3.SWISS
September release provides2DPAGE
automatic and manual annotation
ofGOto93,192species,205,928 Twodimensional
PubMed references are crosspolyacrylamide gel
electrophoresis (2D
referenced.
PAGE) and Sodium
Dodecyl Sulfate
PAGE (SDSPAGE)
IntAct(Hermjakobetal.,2004)isexperiments
an open source protein interaction distribute proteins in
database, repository and analysisagelbasedsystemon
system.Therepositoryispopulatedthebasisofmolecular
withdatafromprojectpartnersand weight and charge,
curated literature data. It provides providing protein
both textual and graphicalexpression data.
representations of proteinSWISS2DPAGE
interactions, maintains annotation(Hoogland et al.,
standards by intensive use of2004) stores the
controlled vocabularies to ensureresults of such
data consistency and allowsexperimentsandadds
explorationofinteractionnetworksa variety of cross
using GO annotations of thereferencestoother2
interactingproteins.TheSeptemberD PAGE databases
release has nearly 36,000 proteinsand

to
and nearly 53,000 interactionsUniProtKB/Swiss
imported from the literature and

5.2.IntAct

Prot. A SWISS
2DPAGE entry also
contains images of
the gels and textual
information such as
physiology,
mappingprocedures,
experimental data
and references.
Release17.3,March
2004andupdatesup
to 08April2005,
contains 1265
entries in 36
referencemapsfrom
human, mouse,
Arabidopsis
thaliana,
Dictyostelium
discoideum,
Escherichia coli,
Saccharomyces
cerevisiae, and
Staphylococcus
aureus (N315). The
human and mouse
2DPAGEdatabases
at the Danish Cen
tre for Human
Genome Research
are intended to aid
functional genome
analysis in health
and disease. The
information from
eachgelisstoredas
its own database,
accessible through
aninteractiveimage
ofthegelitself.
5.4.PRIDE
The PRoteomics
IDEntifications
(PRIDE) (Martens et
al.,inpress)database

is a centralized, standards com to HUPO PSI.


pliant, public data repository for Release 2.0 in July
proteomics data. It has been2005 includes a new
developedtoprovidetheproteomicsand richer XML
communitywithapublicrepositoryschema.
for protein and peptide identi
ficationstogetherwiththeevidence5.5.ChEBI
supporting these identifications.
PRIDEhasbeendevelopedthrough Chemical Entities
acollaborationoftheEBIandGhentofBiologicalInterest
UniversityinBelgium.Theoriginal (ChEBI),availableat
motivation behind its developmentEuropean
was to provide a common dataBioinformatics
exchange format and repository toInstitute (EBI), is a
support proteomics literature publifreely available
cations. This remit has grown with dictionary of small
PRIDE, with the hope that it will molecular entities.

uses
provide a reference set of tissue ChEBI
basedidentificationsforusebythe nomenclature,
community.Thefuturedevelopmentsymbolism and
ofPRIDEhasbecomecloselylinkedterminologyendorsed

by the International
Union of Pure and
Applied Chemistry
(IUPAC)

and
Nomenclature
Committee of the
International Union
of Biochemistry and
Molecular Biology
(NCIUBMB). The
term molecular
entity encompasses
any constitutionally
or

isotopically
distinct

atom,
molecule, ion, ion
pair, radical, radical
ion,

complex,
conformer,

635

E.J.Whitfieldetal./JournalofBiotechnology124(2006)629639

throughput
etc., identifiable as a separatelystructural
distinguishable entity. Thedetermination
molecular entities in question aremethodologies and
either products of nature orworldwidestructural
synthetic products used togenomics efforts
interveneintheprocessesoflivingwith an estimated

or
organisms.ChEBIisorganisedintripling
an ontological classification,quadrupling in size
wherebytherelationshipsbetweenover the next 5
molecular entities or classes ofyears. This has led
entities and their parents and/orto PDB completely
childrenarespecified.Release13overhauling their
inJuly2005contains5549curatedsubmission and
browsing facilities
compounds.
inordertobeableto
respond
appropriately.
6.Structuredatabases
6.1.Proteindatabank

6.2.Cambridge
structuraldatabase

The worldwide Protein Data


Bank (wwPDB) (Berman et al., The Cambridge
2000) began in 1972 and is theStructural Database
single worldwide repository for the(CSD)(Allen, 2002),
processinganddistributionofoverprincipal product of
32,500 threedimensional structuresthe Cambridge
for proteins, nucleic acids andCrystallographic
carbohydrates as of SeptemberData Centre, is a
2005. It is a collaboration of therepository of small
Research Collaboratory formolecule crystal
Structural Bioinformatics (RCSB),structures.Releaseof
the Macromolecular StructuralJan 2005 contained
Database (MSDEBI), and theover335,200records
ProteinDataBankofJapan(PDBj). of organic molecules
The protein structures in theand metalorganic
database are from Xraycompounds, with no

or
crystallographyandsolutionnuclearpolypeptide
magnetic resonance (NMR)polysaccharide
larger than 24 units.
experiments.

three
ThearchivesgrowthhasbeenMost
accompaniedbyincreasesinbothdimensional
data content and the structuralstructures were
complexity of individual entries.identifiedusingeither
AfurtheraccelerationisexpectedXrayorneu

due to developments in high

trondiffraction.CSD
records results of
single crystal studies
and

powder
diffraction studies
which yield 3D
atomic coordinate
data for at least all
nonHatoms.Crystal
structure data is
captured

from
publications in the
open literature and
private
Communications to
the CSD (via direct
datadeposition).

6.3.RESID
The RESID
Database of Protein
Modifications
(Garavelli,2004)isa
comprehensive
collection

of
annotations and
structures for protein
modifications
including amino
terminal, carboxyl
terminalandpeptide
chaincrosslinkpost
translational
modifications.
Release42.00inJune
2005 contains 384
entries for predicted
or observed co or
posttranslational
modifications of the
23 encoded alpha
amino acids. 317 of
these modifications
are annotated in
UniProtKB. In

addition to structural information,undertheauspicesof


eachrecordincludessystematicandthe Nomenclature
alternate names, atomic formulaeCommittee of the
and masses, enzyme activitiesIUBMB.Thegoalof
generating the modifications, 3DIntEnz is to
models and structures, crossincorporatedatafrom
references (including GO andthe NCIUBMB
ChEBI) and UniProt feature tableEnzyme
annotations.
Classificationlist,the
Enzyme
Nomenclature
database (ENZYME)
7.Specialfeaturesdatabases
(Bairoch, 2000), and
the Braunschweig
7.1.IntEnz
Enzyme Database
(BRENDA) of
IntEnz ( Fleischmann et al.,enzyme function (
2004)isthenamefortheIntegratedSchomburg et al.,
relational Enzyme database and is2004). Release 13 in
the most uptodate version of theAugust 2005 IntEnz
Enzyme Nomenclature createdcontains records for

everyenzymewithan
EC number. Each
record

stores
recommended and
alternative names,
catalytic activity,
cofactors, disease
information, and
crossreferences with
UniProtKB.
ENZYME is a
repository

of
information relative
to the nomenclature
of enzymes, Release
38,September2005,
andupdatesupto26
September2005
(4563

entries).
BRENDA provides
similar records, with
a

636

E.J.Whitfieldetal./JournalofBiotechnology124(2006)629639

breakdown by species for The international


reactions, activities, cofactors,ImMunoGeneTics
inhibitors,andsubstrates.
(IMGT) Project
(Lefrancetal.,2005)
maintains a high
quality integrated
TRANSFAC (Matys et al.,knowledge resource
2003)isadatabaseoneukaryoticspecialized

in
cisacting regulatory DNAimmunoglobulins, T
elements and transacting factorscell receptors, major
covering the whole range fromhistocompatibility
yeasttohuman.Dataisextracted complex (MHC),
fromtheoriginalliteraturebutinimmunoglobulin
thelongterm,adirectsubmissionsuperfamily and
system is hoped to be available.relatedproteinsofthe
Regulatorysitesintheindividualimmune system of
genesaremappedsotheycanbehuman and other
positioned on the genome as avertebrate species.
whole.AtoolhasbeendevelopedThe collaborative
fortheidentificationofregulatorynucleotide database
elements in newly sequencedentries fitting these
genomes. Release 6.0 has 6627categories

are
transcription factorbinding sitesretrieved

and
and 1755 genes (1725 have sitesannotated to a high
annotated).
standard. IMGT
consists of sequence
7.3.EPD
databases
(IMGT/LIGMDB is
The Eukaryotic Promotera comprehensive
DatabaseEPD(Schmidetal.,2004)database

of
was designed and developed at the immunoglobulinsand
Weizmann Institute of Science inTcellreceptorsfrom
Rehovot (Israel) and is currently human and other
maintained at ISREC invertebrates, with
Epalinges/Lausanne (Switzerland).translation for fully
EPD is a specialized annotationannotated sequences,
database based on EMBL Data IMGT/MHCDB,
LibraryprovidinginformationaboutIMGT/PRIMERDB),
eukaryoticpromotersextractedfromgenome database
scientificliteratureor,startingfrom (IMGT/GENEDB)
release 73, compiled by a new inandstructure
silico primer extension method.
Release 83 was made available in
July2005.

7.2.TRANSFAC

7.4.IMGT

database
(IMGT/3Dstructure
DB),Webresources
(IMGT MariePaule
page)andinteractive
tools.
8.Integratedand
comparative
databases
8.1.InterPro
The identification
of possible DNA
codingregionscanbe
deducedbysimilarity
to previously charac
terised

genes.
Inferring biological
function to a coding
region can be a
complicated process,
which cannot always
be achieved by
sequence similarity
searches. Protein
sequence
comparisons often
providethefirstclues
to the structure and
function of novel
proteins,asfunctional
constraintsareknown
to persist in
evolution. Protein
domain signature
databases

are
available

for
identifying distant
relationshipsinnovel
sequencestoaknown
protein family.
InterPro (Mulder et
al., 2005) is an
integratedresourceof

protein families, domains andmethodstoensurethe


functional sites which amalgamatesbest prediction of
theeffortsofthememberdatabasesproteindomainsfora
whicharecurrentlyPROSITE(Huloquery translation. In
et al., 2004), PRINTS (Attwood etthe absence of
al., 2004), Pfam (Bateman et al.,biochemical
2004), ProDom (Bru et al., 2005)characterisation of a
SMART (Letunic et al., 2004),protein, domain
TIGRFAMS (Haft and Selengut,predictions can be a
2003),PIRSuperFamilies(Huangetgoodguidetoprotein
al.,2003),SUPERFAMILY(Goughfunction.
andChothia,2002),PANTHER(Mi Release 11.0 of
et al., 2005) and Gene3D (Pearl etJuly 2005 contains
al., 2005). InterProScan (Quevillon12,294 entries,
etal.,2005)combines thedifferentrepresenting 3240
protein recognition methods anddomains, 8753
scanningtoolsofeachmethodinto families,

230
one powerful searching resourcerepeats, 29 active
unifying the strength of thesites, 21 binding
individual signature databasesites and 21 post

translational
modificationsites.
8.2.International
proteinindex
Despite

the
complete
determination of the
genome sequence of
several

higher
eukaryotes, their
proteomes remain
relatively poorly
defined. Information
about

proteins
identifiedbydifferent
experimental and
computational
methods is stored in
differentdatabases,

E.J.Whitfieldetal./JournalofBiotechnology124(2006)629639

descriptions, recent
meaning that no single resourceliterature, detailed
offers full coverage of known andstatisticaloverviewof
predictedproteins.TheInternationalthe genome and

and
Protein Index, IPI (Kersey et al.,proteome,
2004) has been developed tosummaryinformation
address these issues and offersabout each complete
complete nonredundant data setsproteome. Data from
representingthehuman,mouse,rat, a variety of sources,
zebrafish, Arabidopsis and chickenincluding InterPro,
proteomes, built from theCluSTr and GO, is
UniProtKB, Ensembl and RefSeqintegrated. Integr8
databases.EachIPIentryrepresents can be used to
aclusterofentriesfromthesourceidentify putative

and
databases believed to represent theparalogs
orthologsandbeused
same protein. One difficulty in
creating IPI is that there is no to identify potential
absolutewayoftellingwhethertwo regions of synteny
entries in molecular biologybetween organisms.
databases represent differentThe relationships of
biologicalentitiesorthesameentitygenes in the context
rendered differently owing to inof their genomic
silico or experimental artefacts. To neighbours can be
assembleIPIdatasets,anautomaticviewed,aswellasthe
andpragmaticapproachischosentotranscripts and
build clusters through combiningproteinstheyencode.
knowledge already present in theThe Inquisitor tools
primary data sources (and in theallows a user to
crossreferencesbetweenthem)withdetermine if their
theresultsofproteinsequencesimprotein sequence is
ilaritycomparisons.Afteraclusteris available in Integr8,
assembled, a master entry fromand if it is not will
among the cluster members isprovide its protein
chosen,whichsuppliestheIPIentrydomain architecture
with its sequence and annotation.and identify the
Finally, an identifier is chosen forknown sequence of
high similarity. The
eachcluster.
information

is
available

for
8.3.Integr8
complete download
Integr8(Kerseyetal.,2005)isaor a user configured
browser for information relating todownload using the
completedgenomesandproteomes,BioMart query
basedondatacontainedinGenomeinterface. Release 24
Reviews, UniProtKB proteome sets(August2005)isbuilt
andIPI.Itprovidesaccesstospecies from UniProt release

637

5.8 and InterPro


release 11.0 and
contains217bacterial
species,

25
eukaryotes and 21
archea.

9.Discussion

sequence repository
database. Cross
Complete and uptodatereferences available
databasesofbiologicalknowledgewithin a UniProtKB
arevitalforinformationdependententryallowauserto
biological and biotechnologicallink out to many
research. Much of the value ofdatabases, including
these resources is as part of anthose for more
interconnectednetworkofrelatedprotein specific data,
databases, and many maintainanucleotidedatabase
crossreferencestootherdatabases.or an organism
Thesecrossreferencesprovidethedatabase.
basicplatformformoreadvanced Comparative
dataintegrationstrategies.
databasesallowusers
The rapid accumulation ofto identify gene or
genome sequences for manyproteinorthologssoa
organismshasturnedattentiontothequeryforoneprotein
identification and function ofcould, potentially,
proteinsencodedbythesegenomes. have a species wide
The increasing volume and variety result. The use of
ofproteinsequencesandfunctionalstandard identifiers,
information available meansnaming conventions
querying a manually annotatedand

controlled
database, such as UniProtKB,vocabularies,
providesauserwithmorecriteriato adoptionofstandards
perform a search give me all for data represen
proteins in mouse that aretation and exchange,
phosphorylated on serine, give meand the use of data
protein domain architecture ofwarehousing
alternative splice isoforms, howtechnologies enables
many proteins have been identified such outreaching
in the Drosophila melanogasterresults.
genome. This value added protein
informationisnotavailablewithina

References
Allen, F.H., 2002. The
Cambridge structural
database:aquarterof
a million crystal
structures and rising.
Acta Crystallogr. B
58,380388.
Ashurst,J.L.,Chen,C.K.,
Gilbert,

J.G.R.,
Jekosch, K., Keenan,
S., Meidl, P., Searle,
S.M., Stalker, J.,
Storey, R., Trevanion,
S., Wilming, L.,
Hubbard,T.,2005.The
vertebrate genome
annotation (Vega)
database. Nucl. Acids
Res.33,D459D465.

Attwood, T.K., Bradley,


P., Gaulton, A.,
Maudling,

N.,
Mitchell,

A.L.,
Moulton, G., 2004.
The PRINTS protein
fingerprint database:
functional

and
evolutionary
applications. In:
Dunn, M., Jorde, L.,
Little,

P.,
Subramaniam, A.
(Eds.), Encyclopaedia
of

Genomics.
Proteomics and
Bioinformatics.

638

E.J.Whitfieldetal./JournalofBiotechnology124(2006)629639
Consortium, 2005. Fly
Base: genes and gene
Bairoch,A.,2000.The ENZYMEdatabasein
models.Nucl.AcidsRes.
2000.Nucl.AcidsRes.28,304305.
33,D390D395.

Bairoch, A., Apweiler, R., Wu, C.H., Barker,


Eppig, J.T., Bult, C.J.,
W.C.,Boeckmann,B.,Ferro,S.,Gasteiger,
Kadin,

J.A.,
E., Huang, H., Lopez, R., Magrane, M.,
Richardson,J.E.,Blake,
Martin,M.J.,Natale,D.A.,ODonovan,C.,
J.A., The Mouse
Redaschi, N., Yeh, L.S.L., 2005. The
Genome Database
universal proteinresource (UniProt).Nucl.
Group, 2005. The
AcidsRes.33,D154D159.
Mouse

Genome
Bateman,A.,Coin,L.,Durbin,R.,Finn,R.D.,
Database(MGD):from
Hollich,V.,GriffithsJones,S.,Khanna,A.,
genes to micea
Marshall, M., Moxon, S., Sonnhammer,
communityresourcefor
E.L.L., et al., 2004. The Pfam protein
mouse biology. Nucl.
families database. Nucl. Acids Res. 32,
Acids Res. 33, D471
D138D141.
D475.
Berman, H.M., Westbrook, J., Feng, Z.,
Fleischmann, A., Darsow,
Gilliland, G., Bhat, T.N., Weissig, H.,
M., Degtyarenko, K.,
Shindyalov, I.N., Bourne, P.E., 2000. The
Fleischmann, W.,
proteindatabank.Nucl.AcidsRes.28,235
Boyce, S., Axelsen,
242.
K.B., Bairoch, A.,
Benson,D.A., KarschMizrachi,I., Lipman,D.J.,
Schomburg,D.,Tipton,
Ostell, J., Wheeler, D.L., 2005. GenBank.
K.F., Apweiler, R.,
Nucl.AcidsRes.33,D34D38.
2004. IntEnz, the
Boeckmann,B.,Bairoch,A.,Apweiler,R.,Blatter,
integrated relational
M.C., Estreicher, A., Gasteiger, E., Martin,
enzymedatabase.Nucl.
M.J.,Michoud,K.,ODonovan,C.,Phan,I.,et Acids Res. 32, D434
al., 2003. The SwissProt proteinD437.
knowledgebaseanditssupplementTrEMBLin
Garavelli, J.S., 2004. The
2003.Nucl.AcidsRes.31,365370.
RESID database of

Brooksbank, C., Cameron, G., Thornton, J.,protein modifications as


2005. The European Bioinformaticsaresourceandannotation
Institutes data resources: towards systemstool. Proteomics 4,
biology.Nucl.AcidsRes.33,D46D53. 15271533.
Bru,C.,Courcelle,E.,Carrere,`S.,Beausse,Y.,
Gene

Ontology
Dalmar, S., Kahn, D., 2005. The ProDom Consortium,2004.The
database of protein domain families: moreGene Ontology (GO)
emphasis on 3D. Nucl. Acids Res. 33, database

and
D212D215.
informatics resource.

Camon, E., Magrane, M., Barrell, D., Lee, V., Nucl. Acids Res. 32,
Dimmer,E.,Maslen,J.,Binns,D.,Harte,N., D258D261.
Lopez, R., Apweiler, R., 2004. The Gene
Gough, J., Chothia, C.,
OntologyAnnotation(GOA)Database:sharing 2002.
knowledge in Uniprot with Gene Ontology. SUPERFAMILY:
Nucl.AcidsRes.32,D262D266.
HMMs representing
Chen,N.,Harris,T.W.,Antoshechkin,I.,Bastiani, all proteins of known
C., Bieri, T., Blasiar, D., Bradnam, K., structure.

SCOP
Canaran,P.,Chan,J.,Chen,C.K.,etal.,2005. sequence searches,
WormBase:acomprehensivedataresourcefor alignmentsandgenome
Caenorhabditis biology and genomics. Nucl. assignments. Nucl.
AcidsRes.33,D383D389.
Acids Res. 30, 268

Dowell, R.D., Jokerst, R.M., Day, A., Eddy, 272.


S.R., Stein, L., 2001. The distributed
Haft,D.H.,Selengut,J.D.,
annotationsystem.BMCBioinformatics2,White, O., 2003. The
7.
TIGRFAMs database
Drysdale, R.A., Crosby, M.A., The FlyBase of protein families.

Nucl. Acids Res. 31,


371373.

Hermjakob,H.,MontecchiPalazzi,L.,Lewington, Nucl. Acids Res. 31,


C., Mudali, S., Kerrien, S., Orchard, S., 5154.
Vingron, M., Roechert, B., Roepstorff, Kash,
P., T.L., Trudell, J.R.,
Valencia, A., et al., 2004. IntAct: an open Harrison, N.L., 2004.
source molecular interaction database. Nucl.Structural elements
involved in activation
AcidsRes.32,D452D455.
Hold,K.M.,Sirisoma,N.S.,Ikeda,T.,Narahashi, of the gamma
T., Casida, J.E., 2000. Alphathujone (the aminobutyric acid type
active component of absinthe): gamma A (GABAA) receptor.
aminobutyricacidtypeAreceptormodulation Biochem. Soc. Trans.
andmetabolicdetoxification.Proc.Natl.Acad. 32,540548.
Kersey, P.J., Duarte, J.,
Sci.U.S.A.97,38263831.

A.,
Hoogland, C., Mostaguir, K., Sanchez, J.C., Williams,
Hochstrasser, D.F., Appel, R.D., 2004.Karavidopoulou, Y.,
SWISS2DPAGE, ten years later.Birney, E., Apweiler,
R., 2004. The
Proteomics,4.
Huang,H.,Barker,W.C.,Chen,Y.,Wu,C.H.,International Protein
2003. iProClass: an integrated database of Index: an integrated
protein family, function and structuredatabaseforproteomics
information.Nucl.AcidsRes.31,390392. experiments.
Hubbard, T., Andrews, D., Caccamo, M., Proteomics 4, 1985
Cameron,G.,Chen,Y.,Clamp,M.,Clarke,1988.
Kersey, P.J., Bower, L.,
L.,Coates,G.,Cox,T.,Cunningham,F.,et
al.,2005.Ensembl2005.Nucl.AcidsRes.Morris, L., Horne, A.,
Petryszak, R., Kanz, C.,
33,D447D453.
Hulo, N., Sigrist, C.J.A., Le Saux, V.,Kanapin, A., Das, U.,
LangendijkGenevaux, P.S., Bordoli, L.,Michoud,K.,Phan,I.,et
Gattiker, A., De Castro, E., Bucher, P., al., 2005. Integr8 and
Bairoch,A.,2004.RecentimprovementstoGenome Reviews:
the PROSITE database. Nucl. Acids Res. integrated views of
complete genomes and
32,D134D137.
Kanz,C.,Aldebert,P.,Althorpe,N.,Baker,W., proteomes. Nucl. Acids
Baldwin,A.,Bates,K.,Browne,P.,vandenRes.33,D297D302.
Lefranc, M.P., Giudicelli,
Broek,A.,Castro,M.,Cochrane,G.,etal.,
2005. The EMBL nucleotide sequence V.,Kaas,Q.,Duprat,E.,
database.Nucl.AcidsRes.33,D29D33. JabadoMichaloud, J.,
Karolchik, D., Baertsch, R., Diekhans, M., Scaviner, D., Ginestoux,
Furey,T.S.,Hinrichs,A.,Lu,Y.T.,Roskin,C., Clement, O.,
K.M.,Schwartz,M.,Sugnet,C.W.,Thomas,Chaume,D.,Lefranc,G.,
D.J.,Weber,R.J.,Haussler,D.,Kent,W.J., 2005. IMGT, the
2003.TheUCSCgenomebrowserdatabase. international

ImMunoGeneTics
information system.
Nucl. Acids Res. 33,
D593D597.

Leinonen, R., Diez, F.G.,


Binns,

D.,
Fleischmann, W.,
Lopez, R., Apweiler,
R., 2004. UniProt
archive. Bioinformatics
20,32363237.
Letunic, I., Copley, R.R.,
Schmidt, S., Ciccarelli,
F.D.,Doerks,T.,Schultz,
J., Ponting, C.P., Bork,
P., 2004. SMART 4.0:
towards genomic data
integration. Nucl. Acids
Res.32,D142D144.
Martens,L.,Hermjakob,H.,
Jones, P., Taylor, C.,
Gevaert, J., Van
dekerckhove,

J.,
Apweiler, R., in press.
PRIDE:ThePRoteomics
IDEntifications database
Proteomics, PPP Special
Issue.
Matys, V., Fricke, E.,
Geffers,R.,Gossling,E.,
Haubrock, M., Hehl,R.,
Hornischer, K., Karas,
D., Kel, A.E., Kel
Margoulis, O.V., Kloos,
D.U.,Land,S.,Lewicki
Potapov,B.,Michael,H.,
Munch, R., Reuter, I.,
Rotert, S., Saxel, H.,
Scheer, M., Thiele, S.,
Win

E.J.Whitfieldetal./JournalofBiotechnology124(2006)629639
Schmid, C.D., Praz, V.,
Delorenzi, M., Perier,
gender,E.,2003.TRANSFAC:
R., Bucher, P., 2004.
transcriptionalregulation,frompatternsto
The

Eukaryotic
profiles.Nucl.AcidsRes.31,374378.
Promoter Database
Mi, H., LazarevaUlitsky, B., Loo, R.,
EPD: the impact of in
Kejariwal, A., Vandergriff, J., Rabkin, S.,
silicoprimerextension.
Guo, N., Mruganujan, A., Doremieux, O.,
Nucl. Acids Res. 32,
Campbell,M.J.,Kitano,H.,Thomas,P.D.,
D82D85.
2005. The PANTHER database of protein
Schomburg,I.,Chang,A.,
families, subfamilies, functions and
Ebeling, C., Gremse,
pathways.Nucl.AcidsRes.33,D284D288.
M., Heldt, C., Huhn,
Mulder, N.J., Apweiler, R., Attwood, T.K.,
G., Schomburg, D.,
Bairoch, A., Bateman, A., Binns, D.,
2004. BRENDA, the
Bradley,P.,Bork,P.,Bucher,P.,Cerutti,L.,
enzyme database:
etal.,2005.InterPro,progressandstatusin
updatesandmajornew
2005.Nucl.AcidsRes.33,D201D205.
developments. Nucl.
Pearl,F.,Todd,A.,Sillitoe,I.,Dibley,M.,Redfern,
Acids Res. 32, D431
O.,Lewis,T.,Bennett,C.,Marsden,R.,Grant,
D433.
A.,Lee,D.,etal.,2005.TheCATHDomain
Tateno, Y., Saitou, N.,
Structure Database and related resources
Okubo, K., Sugawara,
Gene3D and DHS provide comprehensive
H.,Gojobori,T.,2005.
domain family information for genome
DDBJ in collaboration
analysis.Nucl.AcidsRes.33,D247D251.
with masssequencing
Pruitt,K.D.,Tatusova,T.,Maglott,D.R.,2005.teams on annotation.
NCBI Reference Sequence (RefSeq): aNucl. Acids Res. 33,
curatednonredundantsequencedatabaseof D25D28.
genomes, transcripts and proteins. Nucl.
Twigger, S., Lu, J.,
AcidsRes.33,D501D504.
Shimoyama,M.,Chen,
Quevillon,E.,Silventoinen,V.,Pillai,S.,Harte, D., Pasko, D., Long,
N., Mulder, N., Apweiler, R., Lopez, R., H., Ginster, J., Chen,
2005. InterProScan: protein domainsC.F., Nigam, R.,
identifier. Nucl. Acids Res. 33, W116Kwitek,A.,etal.,2002.
W120.
Nucl. Acids Res. 30,
125128.

639
Wain, H.M., Lush, M.,
Ducluzeau, F., Povey,
S., 2002. Nucl. Acids
Res.30,169171.
Wheeler, D.L., Barrett, T.,
Benson, D.A., Bryant,
S.H., Canese, K.,
Church,D.M.,DiCuccio,
M.,Edgar,R.,Federhen,
S.,Helmberg,W.,etal.,
2005.Databaseresources
of the National Centre
for Biotechnology
Information.Nucl.Acids
Res.33,D39D45.

Wu, C.H., Yeh, L.S.,


Huang, H., Arminski,
L., CastroAlvear, J.,
Chen, Y., Hu, Z.,
Kourtesis, P., Ledley,
R.S., Suzek, B.E., et
al., 2003. The protein
information resource.
Nucl. Acids Res. 31,
345347.