Académique Documents
Professionnel Documents
Culture Documents
tities of data that have been generated from high-throughput Priorities of large-scale proteomic
proteomic applications. The bottleneck of data validation and analyses
The predominant methods of large-scale data gen-
placement of the information obtained into sound biological
eration are liquid chromatography mass spec-
context urgently needs to be addressed. Here, we review the trometry (LC–MS) of complex protein mixtures
issues that arise when analysing large quantities of data gener- [2–5], and matrix-assisted laser desorption ion-
ization (MALDI) MS profiling of tissue sections
ated by liquid chromatography mass spectrometry, offer po- [6,7] and sera [8]. The current needs of the in-
tential solutions for data management and predict the future dustry with respect to the information required
from large-scale proteomic analyses are the de-
direction of large-scale data analysis by mass spectrometry.
tection of pharmaceutically relevant proteins.
These include proteins involved in signal trans-
Keiryn L. Bennett* ▼ In the age of high-throughput proteomics, duction and the effects of inhibitors or activators
Jan C. Brønd ‘more is better’ became the catch cry of the biotech- on the proteins; kinases and phosphatases involved
Dan B. Kristensen nology and pharmaceutical industry. Companies in pathway modulation; and receptors and associ-
Alexandre V.
have invested vast amounts of capital in state-of- ated ligands involved in intra- and intercellular
Podtelejnikov and
Jacek R. Wiśniewski† the-art mass spectrometers, robotics to prepare signalling. Identification of proteins such as tran-
MDS Denmark samples for analysis, and automated systems to scription factors, cell-surface proteins, secreted
Staermosegaardsvej 6 generate large data sets. These investments have proteins and biomarkers is a high priority for the
Odense M, DK-5230 neither been cost-effective nor significantly generation of drug targets, diagnosis of disease
Denmark
aided the pharmaceutical industry in discovering and assessment of drug toxicity. Within these
*e-mail: kbennett@
mdsdenmark.com
new targets. Mass spectrometry is a core technol- classes of proteins, the challenge is to identify
†e-mail: jwisniewski@ ogy in proteomic research; however, large-scale low-abundance proteins and to provide quantitive
mdsdenmark.com). proteome characterization has led to serious information.
bottlenecks in data interpretation.
The situation is eloquently captured by a re- Low-abundance proteins
cent quote from Scott Patterson, “our ability to Owing to the inherent nature of cellular com-
generate data now outstrips our ability to analyse plexity, the detection and identification of low-
it” [1], and reflects the current status for indus- abundance proteins by MS is extremely challeng-
try and academia alike.There is now an evolution ing. To increase the chance that a protein of low
towards statistically sound, biologically relevant copy number will be detected, it is necessary to
data sets: ‘quality, not quantity’ is the reality. reduce the complexity of the sample. There are
Efforts are concentrated on producing focused several approaches whereby sample complexity
data sets aimed at resolving the complexity of can be simplified. One method is to fractionate
biological samples. This is being achieved by a tissue and cells into subcellular components,
synergistic relationship among sample prepar- such as plasma membrane [9–11], mitochondria
ation, mass spectrometry (MS) analysis of the [12] nucleoli [13] and lipid rafts [14,15].A second
1741-8372/04/$ – see front matter ©2004 Elsevier Ltd. All rights reserved. PII: S1741-8372(04)02412-0 www.drugdiscoverytoday.com S43
reviews | mass spectrometry in proteomics supplement DDT: TARGETS Vol. 3, No. 2 (Suppl.), 2004
S44 www.drugdiscoverytoday.com
DDT: TARGETS Vol. 3, No. 2 (Suppl.), 2004 mass spectrometry in proteomics supplement | reviews
analysis of total cell lysate gives a ‘snap Drug Discovery Today: TARGETS
shot’ of the most abundant proteins,
whereas ‘hypothesis-driven proteomics’ Figure 1. Simplified outline of the experimental steps and flow of the data in a typical high-
throughput mass spectrometry (MS)-based analysis of complex protein mixtures. Each
(e.g. specific pull-down experiments or sample protein (yellow circle) is cleaved into smaller peptides (yellow squares), which can
subcellular fractionation) provides insight be either unique to that protein (unbroken arrows) or shared with other sample proteins
into the mechanism or function of a (broken arrows). The peptides are then ionized, and selected ions are fragmented to
produce tandem MS (MS/MS) spectra. Some peptides are selected for fragmentation
particular biological system. several times (broken arrows), whereas some are not selected even once. Each acquired
From an analytical angle, the quality of MS/MS spectrum is searched against a sequence database and assigned a best-matching
the data produced from the mass spec- peptide, which might be correct (yellow squares) or incorrect (black square). The database
search results are then manually or statistically validated. The list of identified peptides is
trometer is of fundamental importance.
used to infer which proteins are present in the original sample (yellow circles), and which are
Low-resolution three-dimensional ion traps false identifications (black circles) corresponding to incorrect peptide assignments. The
are extremely popular and well-suited process of inferring protein identities is complicated by the presence of degenerate peptides
to high-throughput LC–MS; however, the corresponding to more than a single entry in the protein sequence database (broken
arrows). Adapted, with permission, from Ref. [33], © (2003) American Chemical Society.
charge state of a precursor ion cannot be
distinguished when operating in full-scan
mode. The combination of a quadrupole mass selector and to process the information into a format that is compatible
collision cell with orthogonal acceleration has led to high reso- with search engine analysis. Effective automated transfer of MS
lution (~10 000) and mass accuracy (5–20 ppm with internal data to informatic programs is dependent on the reliable per-
calibration). Charge states are more readily assigned, which is formance and accuracy of peak-assignment algorithms. Most
an important advancement in minimizing false protein identi- instrument vendors provide proprietary algorithms as part of
fications. Fourier transform ion cyclotron resonance MS pro- the acquisition software, and there have been considerable ef-
vides the ultimate performance with a mass accuracy of 1–5 ppm; forts by independent groups to develop alternative mathemati-
however, such instruments are beyond the budget of most cal approaches to maximize information extraction from the
laboratories. data.
Ultimately, intelligent software solutions are vital to the contin- A lack of fully automated data analysis systems for protein
uation of proteomics. Areas of importance include (i) adapting identification from complex mixtures requires the development
current algorithms and developing new algorithms to process of more rigorous search and scoring algorithms. Recently,
and search data, (ii) maximizing information extraction, (iii) Colinge et al. [29] introduced a scoring scheme named OLAV.
semiautomatic to automatic data validation and generation of This scheme is similar to established probability-based search
statistically sound data, and (iv) data management and curation. engines, but the scoring has been markedly improved by in-
cluding structural information from consecutive fragment ion
Data processing and search algorithms matches; that is, it is an extension of the sequence tag concept.
There have been numerous attempts to determine the optimal The overall benefit of OLAV compared with other probability-
means of extracting information from MS-generated data and based search engines is the improved selection of correct peptide
www.drugdiscoverytoday.com S45
reviews | mass spectrometry in proteomics supplement DDT: TARGETS Vol. 3, No. 2 (Suppl.), 2004
assignments based purely on the score returned by the search positives, additional parameters related to the identification are
engine. computed. An important parameter is the number of sibling
Other algorithms, including Protein Prospector http://www. peptides [33]. NSP is the total number of unique peptides
ucsf.edu), ProbID [30] and Sequest [31], are under further de- identifying a protein (or protein group) and is a strong indica-
velopment with the aim of providing more reliable signifi- tor of the probability that the peptide identification is correct.
cance thresholds. It still remains to be seen, however, whether Similar to OLAV, a score is calculated for consecutive y- and
putative matches close to the threshold can be unequivocally b-ion fragment matches. Information concerning specific frag-
trusted. ment ions (e.g. proline), missed cleavage sites and quantitative
information (if available) are also scored. MS/MS spectra that
Maximal extraction of information are rejected by autovalidation can be manually validated directly
Multiple analyses of the same sample by MS via an ‘exclusion from the raw data by generating a sequence tag, or reassessed
list’ approach [32] and iterative database analyses of the same by the breakpoint algorithm with alternative parameters.
data are being combined in an approach to extract information PeptideProphet (http://www.systemsbiology.org) [34] is
exhaustively from a single biological experiment (MDS an open source software program that facilitates automatic
Denmark). validation of proteomic data. Based on an empirical statistical
During LC–MS analysis of a mixture of proteins on a model, a sensitivity threshold for correct and incorrect peptide
quadrupole time-of-flight (TOF) instrument, the TOF region identification can be selected. As would be expected, a higher
can resolve significantly more precursor ions than the mass threshold will increase the number of false positive identifica-
spectrometer can select and fragment; therefore, a single LC–MS tions. Thus, the gain obtained from automatic validation is lost
acquisition is insufficient to analyse all precursor ions compre- because of the need for additional manual validation to ensure
hensively. Multiple LC–MS analyses of the same sample and the that the peptides and proteins returned are indeed correct.
generation of an exclusion list between consecutive analyses ProteinProphet [33] is a tool for protein validation based
provide a means whereby only unique precursor ions are frag- on the output of confirmed peptides from PeptideProphet.
mented. After each LC–MS analysis, the m/z ratio and retention Identified peptides corresponding to the same protein are
time of all of the precursor ions selected for tandem MS combined using a statistical model to estimate the probability
(MS/MS) are added to an exclusion list.These peptides are dis- that the protein is present. The number of sibling peptides is
qualified from selection in subsequent analyses, and peptides also estimated and the information is used to improve peptide
that were not previously selected are fragmented. The greater validation.
the number of peptides selected for fragmentation by the mass Another approach for peptide validation is to analyse the data
spectrometer, the greater the number of peptides that will match by an alternative algorithm, such as sequence tag [35], after
a given protein, thereby increasing the sequence coverage and the initial first-round analysis using a breakpoint algorithm.
the probability of a correct identification. Following data gen- Confirmation of peptides by two unrelated algorithms will
eration, the files are iteratively searched through a combination increase the probability of protein identification.
of static and dynamic databases. The aim is to extract more
information (e.g. posttranslational modifications, alternatively Data management and curation
spliced proteins and point mutations) from a single experiment. Combining all of the information generated from a large-scale
The analysis of a complex biological mixture by LC–MS usu- MS investigation in a simple fashion that not only allows pep-
ally involves the production of several fractions, each fraction tide and protein validation, but also grouping of the results,
is analysed, and the protein database searched. This represents reassembly of peptide information into proteins, and data re-
linear data generation and analysis (Figure 2a). Alternatively, porting in a format that is simple to understand is not a trivial
each fraction is analysed several times by LC–MS and the data task. There are several filtering and visualization programs that
are cycled through several databases (Figure 2b). Thus, the attempt to simplify large MS data sets [36–38].
quantity of data that can be generated from a single biological The EPIR database containing validated and confirmed pep-
experiment exponentially explodes. tides has a series of modular interactive software tools that over-
lay the peptide database. These tools include (i) grouping of
Data validation and statistical analysis peptides to related proteins, (ii) addition of quantitative data,
An experimental peptide identification repository (EPIR) is a (iii) extraction of statistical information from within a single
recently developed database system that facilitates the valida- experiment, across multiple MS analyses and different experi-
tion and statistical analysis of data generated by LC–MS. One of ments, and (iv) data collation into a simple format showing,
the software modules enables automated peptide validation. To for example, protein sequence, confirmed peptides and database
enhance the sensitivity of true identification and to avoid false accession identifiers.
S46 www.drugdiscoverytoday.com
DDT: TARGETS Vol. 3, No. 2 (Suppl.), 2004 mass spectrometry in proteomics supplement | reviews
(a) (b)
CELL OR TISSUE EXTRACT CELL OR TISSUE EXTRACT
DATABASE SEARCH
DATABASE SEARCH
e.g. iterative
?
Figure 2. Proteomic hierarchy: from cell or tissue extract to peptide or protein identification and data management. (a) Generic linear analysis by
liquid chromatography mass spectrometry (LC–MS) and database searching. Cell or tissue extracts are fractionated into subcellular components
and then further fractionated at the protein or peptide level. Each sample is analysed once by LC–MS, and the data generated are searched
against a single database. (b) Advanced exhaustive data production and database analysis. Cell or tissue extracts are fractionated into
subcellular components, such as plasma membrane, and then further fractionated at the protein or peptide level by, for example, off-line reversed-
phase high-performance liquid chromatography (rpHPLC). Each sample is analysed three times by LC–MS via the exclusion list approach, and for
each analysis the data are searched iteratively against a series of databases. The quantity of data generated from a single biological experiment
exponentially explodes through the application of multiple LC–MS analyses and multiple database searches (combined static and dynamic
databases). Peptide-centric databases for storing and mining LC/MS/MS data hold the key to large-scale MS data consolidation.
Similarly, Interact (from the Institute for Systems Biology) interdependent and cross-functional. The demand to formulate
[36] allows rapid and flexible interrogation and analysis of mul- intelligible answers from MS-generated data will continue to
tiple data sets including filtering, unfiltering, sorting, grouping direct advancement in sample preparation, MS and software
and highlighting of the data by user-controlled criteria. development.
Comparisons across experiments are achieved using a tool that
highlights similarities and differences at either the peptide or Perspectives
the protein level. Interact interfaces with quantitative software Large-scale data production by MS and analysis of the results in
to provide the average relative abundance and standard deviation a biological context are still the domain of specialists who are
for each protein. aware of the disadvantages and restrictions of the technology.
The potential issues outlined in this section to assist in the Expert knowledge is heavily exploited when considering the
analysis of large-scale MS data sets are multifaceted, highly limitations of an experiment and the answers obtained. As MS
www.drugdiscoverytoday.com S47
reviews | mass spectrometry in proteomics supplement DDT: TARGETS Vol. 3, No. 2 (Suppl.), 2004
and proteomics become increasingly accessible to a broader range 13 Andersen, J.S. et al. (2002) Directed proteomic analysis of the human
of researchers – medical clinicians, for example – it is impera- nucleolus. Curr. Biol. 12, 1–11
tive that there is sufficient support to assist ‘novice’ scientists in 14 Nebl, T. et al. (2002) Proteomic analysis of a detergent-resistant
formulating realistic conclusions from the data. Without assis- membrane skeleton from neutrophil plasma membranes. J. Biol. Chem.
tance, an inadequate understanding of the technology will 277, 43399–43409
severely impact the reliability and significance of the results. 15 Foster, L.J. et al. (2003) Unbiased quantitative proteomics of lipid rafts
The future demand for software and statistical analysis of MS reveals high specificity for signalling factors. Proc. Natl.Acad. Sci. U. S.A. 100,
data will continue to grow as increasingly more laboratories 5813–5818
realize the need for exhaustive data analysis rather than exces- 16 Neubauer, G. et al. (1998) Mass spectrometry and EST-database searching
sive data production. Quantitation adds an extra dimension of allows characterisation of the multiprotein spliceosome complex.
complexity to the data that cannot be addressed generically by Nat. Genet. 20, 46–50
current software tools. Therefore, further software development 17 Taylor, S.W. et al. (2003) Characterisation of the human heart
in this area is essential. The integration of visionary, well- mitochondrial proteome. Nat. Biotechnol. 21, 281–286
formulated biological theories with MS, plus access to simple 18 Ficarro, S.B. et al. (2002) Phosphoproteome analysis by mass
and comprehensible software solutions, is the key to bringing spectrometry and its application to Saccharomyces cerevisiae. Nat. Biotechnol. 20,
the realm of MS and proteomics successfully within the reach 301–305
of everyday scientists. 19 Bunkenborg, J. et al. (2004) Screening for N-glycosylated proteins by
liquid chromatography mass spectrometry. Proteomics 4, 454–465
References 20 Gygi, S.P. et al. (1999) Quantitative analysis of complex protein mixtures
1 Patterson, S.D. (2003) Data analysis – the Achilles heel of proteomics. using isotope-coded affinity tags. Nat. Biotechnol. 17, 994–999
Nat. Biotechnol. 21, 221–222 21 Ong, S.E. et al. (2002) Stable isotope labelling by amino acids in cell
2 Lasonder, E. et al. (2002) Analysis of the Plasmodium falciparum proteome by culture, SILAC, as a simple and accurate approach to expression
high-accuracy mass spectrometry. Nature 419, 537–542 proteomics. Mol. Cell. Proteomics 1, 376–386
3 Florens, L. et al. (2002) A proteomic view of the Plasmodium falciparum life 22 Conrads, T.P. et al. (2001) Quantitative analysis of bacterial and
cycle. Nature 419, 520–526 mammalian proteomes using a combination of cysteine affinity tags and
4 Washburn, M.P. et al. (2001) Large scale analysis of the yeast proteome 15N-metabolic labelling. Anal. Chem. 73, 2132–2139
via multidimensional protein identification technology. Nat. Biotechnol. 19, 23 Blagoev, B. et al. (2003) A proteomics strategy to elucidate functional
242–247 protein–protein interactions applied to EGF signalling. Nat. Biotechnol. 21,
5 Lipton, M.S. et al. (2002) Global analysis of the Deinococcus radiodurans R1 315–318
proteome by using accurate mass tags. Proc. Natl.Acad. Sci. U. S.A. 99, 24 Yao, X. et al. (2001) Proteolytic 18O labelling for comparative
11049–11054 proteomics: model studies with two serotypes of adenovirus. Anal. Chem.
6 Caprioli, R.M. et al. (1997) Molecular imaging of biological samples: 73, 2836–2842
localisation of peptides and proteins using MALDI-TOF-MS. Anal. Chem. 25 Eng, J.K. et al. (1994) An approach to correlate tandem mass spectral data
69, 4751–4760 of peptides with amino acid sequences in a protein database. J.Am. Soc.
7 Yanagisawa, K. et al. (2003) Proteomic patterns of tumour subsets in Mass Spectrom. 5, 976–989
non-small-cell lung cancer. Lancet 362, 433–439 26 Perkins, D.N. et al. (1999) Probability-based protein identification by
8 Marshall, J. et al. (2003) Processing of serum proteins underlies the mass searching sequence databases using mass spectrometry data. Electrophoresis
spectral fingerprinting of myocardial infarction. J. Proteome Res. 2, 20, 3551–3567
361–372 27 Clauser, K.R. et al. (1999) Role of accurate mass measurement (±10
9 Adam, P.J. et al. (2003) Comprehensive proteomic analysis of breast ppm) in protein identification strategies employing MS or MS/MS and
cancer cell membranes reveals unique proteins with potential roles in database searching. Anal. Chem. 71, 2871–2882
clinical cancer. J. Biol. Chem. 278, 6482–6489 28 Field, H.I. et al. (2002) RADARS, a bioinformatics solution that
10 Blonder, J. et al. A detergent- and cyanogen bromide-free method for automates proteome mass spectral analysis, optimises protein
integral membrane proteomics: application to Halobacterium purple identification, and archives data in a relational database. Proteomics 2,
membranes and human epidermis. Proteomics (in press) 36–47
11 Olsen, J.V. et al. (2004) HysTag – a novel proteomic quantification tool 29 Colinge, J. et al. (2003) OLAV: towards high-throughput tandem mass
applied to differential display analysis of membrane proteins from spectrometry data identification. Proteomics 3, 1454–1463
distinct areas of mouse brain. Mol. Cell. Proteomics 3, 82–92 30 Zhang, N. et al. (2002) ProbID: a probabilistic algorithm to identify
12 Mootha,V.K. et al. (2003) Integrated analysis of protein composition, peptides through sequence database searching using tandem mass
tissue diversity, and gene regulation in mouse mitochondria. Cell 115, spectral data. Proteomics 2, 1406–1412
629–640 31 MacCoss, M.J. et al. (2002) Probability-based validation of protein
S48 www.drugdiscoverytoday.com
DDT: TARGETS Vol. 3, No. 2 (Suppl.), 2004 mass spectrometry in proteomics supplement | reviews
identifications using a modified SEQUEST algorithm. Anal. Chem. 74, 35 Mann, M. and Wilm, M. (1994) Error-tolerant identification of peptides
5593–5599 in sequence databases by peptide sequence tags. Anal. Chem. 66, 4390–4399
32 Kristensen, D.B. et al. (2003) Multiple LC–MS exclusion list analyses: a 36 Han, D.K. et al. (2001) Quantitative profiling of differentiation-induced
tool to enhance protein identification from complex biological samples. microsomal proteins using isotope-coded affinity tags and mass
Abstract at 51st ASMS Conference on Mass Spectrometry and Allied Topics spectrometry. Nat. Biotechnol. 19, 946–951
(http://www.inmerge.com/aspfolder/ASMSAbstracts.html), A031566 37 Tabb, D.L. et al. (2002) DTASelect and contrast: tools for assembling and
33 Nesvizhskii, A.I. et al. (2003) A statistical model for identifying proteins comparing protein identifications from shotgun proteomics. J. Proteome
by tandem mass spectrometry. Anal. Chem. 75, 4646–4658 Res. 1, 21–26
34 Keller, A. et al. (2002) Empirical statistical model to estimate the accuracy 38 Eddes, J.S. (2002) CHOMPER: a bioinformatic tool for rapid validation
of peptide identifications made by MS/MS and database search. Anal. of tandem mass spectrometry search results associated with high-
Chem. 74, 5383–5392 throughput proteomic strategies. Proteomics 2, 1097–1103
Editorial
Rational optimization of proteins as drugs: a new era of 'medicinal biology'.
by David Szymkowski
Reviews
Strategies to identify ion channel modulators: current and novel approaches to target neuropathic pain
by Phillip J. Birch, Lodewijk V. Dekker, Iain F. James, Andrew Southan and David Cronk
The use of cell-penetrating peptides as a tool for gene regulation
by Peter Järver and Ülo Langel
Targeting hypoxia-A2A adenosine receptor-mediated mechanisms of tissue protection
by Dmitriy Lukashev, Akio Ohta and Michail Sitkovsky
Monitor
Provides an insight into the latest developments in chemistry, biology and business, as well as awards and
appointments
www.drugdiscoverytoday.com S49