Vous êtes sur la page 1sur 4

Proteomics 2004, 4, 19851988

DOI 10.1002/pmic.200300721

1985

Technical Brief The International Protein Index: An integrated database for proteomics experiments
Paul J. Kersey, Jorge Duarte, Allyson Williams, Youla Karavidopoulou, Ewan Birney and Rolf Apweiler EMBL Outstation, The European Bioinformatics Institute, Hinxton, UK Despite the complete determination of the genome sequence of several higher eukaryotes, their proteomes remain relatively poorly defined. Information about proteins identified by different experimental and computational methods is stored in different databases, meaning that no single resource offers full coverage of known and predicted proteins. IPI (the International Protein Index) has been developed to address these issues and offers complete nonredundant data sets representing the human, mouse and rat proteomes, built from the Swiss-Prot, TrEMBL, Ensembl and RefSeq databases.
Keywords: Bioinformatics / Databases / Human genome / International protein index / Proteomes

Received Revised Accepted

19/8/03 6/11/03 22/11/03

Proteomics experiments, which aim at the comprehensive determination of the identity, characteristics and interactions of the proteins found in individual cellular systems, can provide information about real proteins in much greater quantities than traditional laboratory approaches. Experiments typically involve the use of mass spectrometry for peptide identification, after which a protein identification algorithm is used to match peptides to known protein sequences. Thus the success of a proteomics experiment carried out on material from a particular species is highly dependent on the prior determination and interpretation of the genome sequence of that species. However, even for well studied genomes like those of human, mouse and rat, there is no consensus on the gene number, still less the identity and structure of each gene. Moreover, the results of different gene prediction algorithms, and experimentally determined (mRNA and protein) sequences, are frequently stored in different databases. Among the databases that store protein sequence information linked to higher eukaryotic geCorrespondence: Dr. Paul Kersey, EMBL Outstation, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK E-mail: pkersey@ebi.ac.uk Fax: 144-1223-494468 Abbreviations: DDBJ, DNA Data Bank of Japan; EMBL, Europan Molecular Biology Laboratory; IPI, International Protein Index; MGD, Mouse Genome Database; RGD, Rat Genome Database; UniProt, SWISS-PROT plus TrEMBL

nomes are Swiss-Prot [1], a high quality manually annotated protein knowledgebase, TrEMBL [1], an automatically annotated supplement to Swiss-Prot mainly derived from submissions to the EMBL/GenBank/DDBJ SwissProt nucleotide sequence databases [2], Ensembl [3], a database of automatic annotations derived from the human genome sequence, and RefSeq [4], which contains, in different sections, protein sequences linked to experimentally determined mRNAs, and predicted sequences derived automatically from the genome. Additional, gene-centric, resources are also available, such as LocusLink [5], the Mouse Genome Database (MGD) [6], the Rat Genome Database (RGD) [http://rgd.mcw.edu/], and Genew (the human gene nomenclature database) [7]. Data from some of these resources was used to create the first version of the International Protein Index (IPI), a nonredundant human proteome set that was used in the primary analysis of the human genome sequence [8]. Since September 2001, a significantly revised version of IPI has been produced monthly and now offers complete, non-redundant protein sets for human, mouse and rat. IPI provides cross-references between the primary data sources, and maintains stable identifiers (with incremental versioning) to allow the tracking of sequences between releases. Each IPI entry represents a cluster of entries from the source databases believed to represent the same protein. One difficulty in creating IPI is that there is no absolute way www.proteomics-journal.de

2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

1986

P. J. Kersey et al.

Proteomics 2004, 4, 19851988 where there is more than one matching entry, a master is chosen Swiss-Prot entries are preferred to TrEMBL entries; otherwise the entry with the longest sequence is preferred). The results of these clustering steps are then aggregated, subject to certain constraints. The aim of the IPI process is to reflect, not contradict, knowledge contained in the source databases: therefore, while similar sequences are generally merged (as described above), exceptions are made where implicit knowledge in the source databases suggests that merging would be incorrect. For example: (1) Entries from Swiss-Prot (a high-quality manually curated database) are not merged with each other (i.e., it is assumed that distinct proteins are correctly represented within Swiss-Prot). (2) Entries from any source database with a mapping to a unique gene (as defined in a speciesspecific database (e.g., Genew or LocusLink) are not merged with entries mapped to a different gene. (3) Ensembl and RefSeq each produce a non-redundant set of protein predictions for a complete genome assembly; therefore, Ensembl and RefSeq entries are not merged with each other. (4) These rules only apply to proteins with different sequences. Proteins with identical sequences are always merged into a single IPI entry, even if there is evidence that they are derived from more than one gene. Master entries are chosen for each cluster according to a hierarchy of the source databases; entries with preferred mappings to LocusLink or species-specific databases are preferred to entries without; and entries identified as subfragments of other entries in the same cluster are not chosen as masters. Identifier stability is achieved according to the following rules: (i) if a master entry in one IPI release is also a master entry in the next IPI release, then that entrys cluster in the new release inherits the IPI identifier (ID) assigned to that entrys cluster in the previous release; (ii) if a nonmaster entry in one IPI release becomes a master entry in the subsequent IPI release and if the master entry of its cluster in the previous release is either not in the subsequent IPI release or remains in the same cluster as the new master entry, then the cluster of the new master inherits the IPI ID assigned to its cluster in the previous release; (iii) if a cluster in a new IPI release has a master entry with the same sequence as a master entry of a cluster in the previous IPI release, and the IPI ID assigned to that old cluster has not been reassigned under the first two rules, it will be assigned to the cluster in the new release with an identical sequence. Thus the principle is that IPI IDs are preferentially inherited by cluster masters; secondarily by nonmasters, and finally by sequences. This is illustrated in Table 1 with some examples taken from a recent IPI release. www.proteomics-journal.de

of telling whether two entries in molecular biology databases represent different biological entities or the same entity rendered differently owing to in silico or experimental artifacts. A human curator, maintaining a database such as Swiss-Prot, in which entries believed to represent the same protein are merged, makes a judgment on this issue based on a variety of information. To assemble IPI data sets, we take an automatic and pragmatic approach, building clusters through combining knowledge already present in the primary data sources (and in the cross-references between them) with the results of protein sequence similarity comparisons. After a cluster is assembled, a master entry from among the cluster members is chosen, which supplies the IPI entry with its sequence and annotation. Finally, an identifier is chosen for each cluster. The clusters are assembled by combining the results of sequence similarity comparisons with information derived from pre-existing cross-references as follows: (i) Entries from a single data source with identical protein sequences are grouped together. (ii) Pairwise inter-database protein sequence similarity comparisons are performed using the fast matching tool pmatch (R. Durbin, unpublished). pmatch identifies regions of identity between two sequences in excess of a threshold length (set to 20 amino acids for IPI). The identically matching segments belonging to each protein pair are then assembled according to the following criteria: (a) Assembled segments must be colinear and nonoverlapping in both sequences; and (b) the total gap length in any assembly must not exceed 5%. The quality of the match between each protein pair is defined as the length of the longest legal assembly. (iii) Reciprocally best-matching pairs of entries (such that if x is an entry from database X, and y is an entry from database Y, and y is xs best match in Y, then x is also ys best match in X) are identified; paired entries are then assembled into clusters by combining the results from each inter-database comparison. (iv) Protein sequences not reciprocally best matching to any entry from another database are scanned for subfragment matches (i.e., where entry x is 95% identical to entry y over 95% of its length, y is longer than x, and there is no restriction on the percentage of y that is matched by x). (v) TrEMBL is nonredundant at the level of 100% sequence identity, but additional redundancy is filtered out of TrEMBL by clustering Swiss-Prot and TrEMBL (UniProt) at the 95% sequence identity level (using the program CD-HI [9]). (vi) Entries from Genew, MGD, and RGD (as appropriate) and LocusLink are each mapped through the tracking of existing cross-references to UniProt and RefSeq entries;

2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

Proteomics 2004, 4, 19851988 Table 1. A case history of some IPI entries Old release: IPI human v2.26 Case 1. Identifier propagated by mastera) IPI ID Clustered entries IPI00170428 TrEMBL: Master Entry: Q9WT2, Q9NUS9, RefSeq NP: Q9BZD2, Q8IVZO NP_060814 Ensembl: ENSP00000287171

The International Protein Index

1987

New release: IPI human v2.27

IPI ID IPI00170428 Master Entry: RefSeq NP: NP_060814

Clustered entries TrEMBL: Q9WT2, Q9NUS9, Q7RTT8, Q9BZD2, Q8IVZO Ensembl: ENSP00000287171 Clustered entries RefSeq NP: NP_631895

Case 2. Identifier propagated through minor entry after expiry of old masterb) IPI ID Clustered entries IPI ID IPI00184670 TrEMBL: IPI00184670 Master Entry Q01433-2 Master Entry TrEMBL: RefSeq NP: TrEMBL: Q96IA1 NP_631895 Q01433-2 Case 3. Identifier propagated by sequencec) IPI ID Clustered entries IPI00013763 Ensembl Master Entry ENSP000026661 TrEMBL: Q9BE80

IPI ID IPI00013763 Master Entry SWISS-PROT: Q8N1M1-4

Clustered entries RefSeq NP: NP_116124 Ensembl: ENSP00000266661

a) Identifier propagated by master. In IPI human release 2.26, many entries probably representing equilibrative nucleoside transporter 3 cluster together. In release 2.27, a new TrEMBL entry (Q7RTT8) also clusters with these entries. As the new cluster shares a master with a cluster in the old release (RefSeq NP_060814), the IPI ID is transferred. b) Identifier propagated through minor entry after expiry of old master. TrEMBL splice isoform Q01433-2 is a non-master cluster entry in IPI human release 2.26 (the master is TrEMBL entry Q96IA1). In release 2.27, this isoform has become a master of its own cluster. Q96IA1 no longer exists (in the same or any other cluster), so the new cluster can inherit the IPI ID, even though Q014330-2 was not a master in the previous release. c) Identifier propagated by sequence. Swiss-Prot splice isoform Q8N1M1-4 was not present in IPI human release 2.26, but is a cluster master in release 2.27. The new cluster inherits the IPI ID of the old cluster whose master was Q9BE80, because Q9BE80 no longer exists in IPI, but has the same sequence as Q8N1M1.

Each source database identifier can only appear once in an IPI set. Separate sequences are provided for alternatively spliced isoforms (with cross-references to isoform-specific identifiers in the source databases). Proteins with identical sequence but differential post-translational modification are not yet individually represented within IPI, as these are not yet generally well identified in the source databases. IPI provides a species-specific, complete and non-redundant dataset particularly suited to supporting protein identification in proteomics experiments. Its sequence- and identifier-based construction eliminates the need for manual filtering of redundant results in protein identification, while maintaining cross-references to the source data. IPI is produced monthly and is available at http://www.ebi.ac.uk/IPI. Statistics for the IPI releases of December 2003 are provided in Table 2.

Table 2. Composition of IPI releases December 2003 Species IPI entries Source database entries referenced by IPI IPI entries matching UniProt, Ensembl and RefSeq entries IPI entries matching entries from any 2 source databasesa) IPI entries matching entries from only 1 source databasea) Human 39 440 99 399 18 070 Mouse 40 265 95 645 16 991 Rat 33 119 59 014 5 080

8 657

8 174

12 914

12 713

15 110

15 325

a) Counting UniProt (Swiss-Prot plus TrEMBL) and RefSeq (curated and predicted) as single databases www.proteomics-journal.de

2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

1988

P. J. Kersey et al.

Proteomics 2004, 4, 19851988


[3] Clamp, M., Andrews, D., Barker, D., Bevan, P. et al., Nucleic Acids Res. 2003, 31, 3842. [4] Pruitt, K. D., Tatusova, T., Maglott, D. R., Nucleic Acids. Res. 2003, 31, 3437. [5] Pruitt, K. D., Maglott, D. R., Nucleic Acids. Res. 2001, 29, 137140. [6] Blake, J. A., Richardson, J. E., Bult, C. J., Kadin, J. A. et al., Nucleic Acids. Res. 2002, 30, 113115. [7] Wain, H. M., Lush, M., Ducluzeau, F., Povey, S., Nucleic Acids. Res. 2002, 30, 169171. [8] The Genome International Sequencing Consortium, Nature 2001, 409, 860921. [9] Li, W., Jaroszewski, L., Godzik, A., Bioinformatics 2001, 17, 282283.

The authors would like to thank Henning Hermjakob for his comments on the manuscript. This work has in part been funded by EU grant number QLRI-CT-2001-00015 under the RTD program Quality of Life and Management of Living Resources.

References
[1] Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M. et al., Nucleic Acids Res. 2003, 31, 365370. [2] Stoesser, G., Baker, W., van den Broek, A., Garcia-Pastor, M. et al., Nucleic Acids Res. 2003, 31, 1722.

2004 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.proteomics-journal.de

Vous aimerez peut-être aussi