Académique Documents
Professionnel Documents
Culture Documents
Introduction
With rapid proliferation of proteomic data, there is a need
for tools that allow computational data mining and visualization of complex data sets. There are many software packages
available for processing proteomics data and displaying results
(for recent review, see Palagi et al.).1 However, there is a paucity
of visualization tools that are simple and easily adaptable to
evolving proteomic data formats. Visualization tools combine
several sources of information for intelligent data mining. The
human eye is particularly suited to identify complex patterns
and features, provided that the information is presented in a
structured visual way and limited to a few patterns at a time.
The gene expression red-green heat maps serve as an example
of simple and yet effective method of representation of complex
data.2
Proteins exist in plasma and tissue sources in multiple forms
that result from alternative splicing (isoforms), precursor/
mature protein combinations, or different patterns of glycosylation. Most proteins are secreted as precursor proteins from
which biologically active forms are generated upon proteolytic
cleavage (e.g., see Khatib and Geraldine).3 For biomarker
discovery, it is important to assess the presence of isoforms
that may differ in their levels in a disease related manner as in
the case of phosphorylation and glycosylation, among numer* To whom correspondence should be addressed. Tel: (206) 667-1250, fax:
(206) 667-2537, E-mail: akrasnos@fhcrc.org.
ous post-translational modifications. We present here a visualization approach for multidimensional proteomic data to
assist in the search for protein isoforms, precursor/mature
protein combinations, and fragments. Along with the visualization tool, we also describe a simple pattern recognition
algorithm that we developed to automatically detect and flag
potentially heterogeneous species of proteins in proteomic
experiments that involve extensive fractionation and result in
a large number of identified proteins in one experiment.
Methods
Protein Separation and Mass Spectrometry Analysis. Serum
and plasma protein samples were subjected to fractionation
followed by LC-MS/MS analysis of tryptic digests from individual fractions. The full procedure, designated Intact Protein
Analysis System (IPAS) has been previously described by Faca
et al.4 Briefly, after immunodepletion, acrylamide-labeled
samples5 were fractionated by anion-exchange into 12 fractions
and subsequently by reversed-phase into 12 fractions, representing a total of 144 fractions that were analyzed individually
by shotgun LC-MS/MS. In-solution tryptic digestion was
performed overnight with lyophilized aliquots from the reversedphase (second dimension) fractionation step. The resulting
peptide mixtures were analyzed by a LTQ-FTICR mass spectrometer (Thermo-Finnigan) coupled with a NanoAcquitynanoflow chromatography system (Waters). Spectra were acquired in a data-dependent mode in m/z range of 400-1800,
including selection of the 5 most abundant +2 or +3 ions of
each MS spectrum for MS/MS analysis. Acquired data was
automatically processed by the Computational Proteomics
Analysis System (CPAS)6 pipeline. This pipeline includes the
X!Tandem search algorithm7 with comet score module plugin,8 PeptideProphet9 peptide validation, and ProteinProphet10
protein inference tool. The tandem mass spectra were searched
against version 3.12 of the human IPI database.11 All identifications with a PeptideProphet probability greater than 0.75 were
selected and the subsequent protein identifications were
filtered at a 5% error rate.
Heterogeneity Detection Algorithm. The concept behind
cluster detection is as follows. For each protein (single IPI or a
protein group of multiple IPI numbers considered to represent
the same protein), the data were assembled into a n m grid
of fractions, where n corresponds to the number of fractions
derived in ion-exchange chromatography (represented on the
X-axis) and m corresponds to the number of fractions derived
in RP-HPLC (represented on the Y-axis). The dimensions for
the two data sets used in this article are 12 12 for one data
10.1021/pr7007219 CCC: $40.75
technical notes
Figure 1. Visualization of proteomic data in 2-D fractionation experiments with differential sample labeling, The data shown is for
protein HFAC (hepatocyte growth factor activator). (A) The peptide and ratio map of the 2-D chromatography fractionation. The grid
represents the 2-D chromatography fractionation (12 12 fractions). The X-axis represents 12 fractions of ion-exchange chromatography
and Y-axis, 11 fractions of RP HPLC. Each node of the grid shows the fraction location. The peptides are shown as concentric circles
of different colors (the full list of identified peptides is shown in the inset), whereby the size of the circle indicates a relative distance
of the peptide from the N-terminus of the full protein sequence. The size of the circle corresponds to the sequential order of the
peptides starting from the N-terminus. The range for each peptide represents the starting and ending position in the protein sequence,
scaled to 0-1. Values are provided for ratios between two samples being compared based on differential acrylamide labeling,5 where
case samples are labeled with C13 acrylamide and control samples with C12 acrylamide. (B) Histogram of the ratios obtained for this
protein in an experiment in which a comparison is made between two samples (in all 132 fractions). (C) Total MS events map for 2-D
separation. Each node of the grid shows the number of MS events summed up across all peptides, while the size of the circle reflects
visually that number.
Journal of Proteome Research Vol. 7, No. 6, 2008 2547
technical notes
Krasnoselsky et al.
protein group of multiple IPI numbers considered to represent
the same protein), the n m data matrix of fractions (as
described above), along with the ratio vector and a vector of
number of spectral events for each peptide in each fraction is
passed to the software. A vector of scaled 0-1 sequence
position information is passed to the application as well. All
preprocessing of the data is accomplished prior to passing the
data to the visualization tool. The outputs of the application
include three figures, saved as picture files (jpeg format): the
figure that combines fractionation, ratio, and peptide sequence
information (such as Figure 1A), histogram of the ratios (if
available, see Figure 1B), and a figure of total spectral events
for each fraction (such as Figure 1C). The Matlab code for the
application is available upon request from the author.
Results
Visualization of the IPAS Proteomics Data. The data generated in comparative proteomics experiments that utilize extensive protein fractionation contain information related to
isoforms that could be mined, but is generally not systematically analyzed. Such information is intrinsic to the locations
(fractions) in which proteins were identified. Thus, chromatographic properties contain information that could be used to
make inferences about subspecies/isoforms of proteins that
elute differently but may be the products of the same gene. In
this study, we analyzed data from 132 serum fractions that
resulted from 2-D fractionation of intact (undigested) proteins.
Figure 1A shows a representation of the 2-D fractionation as a
grid with the nodes denoting the fractions. The particular
identified peptides in a protein could be used to infer cleavages
as in the case of surface proteins that shed their extracellular
domains. We have devised a way of capturing this information
on the fractionation grid, whereby a set of concentric circles
represent the sequentially organized peptides. The circles are
scaled in such a way that the size of the circle indicates a
relative distance from the N-terminus of the protein, with the
peptide represented by the smallest circle being closest to the
N-terminus and the largest circle denoting the peptide closest
to the C-terminus. Such visualization aids in immediate
discerning a fragment: if a set of peptides appears as doughnutshaped in one or more fractions (such as fraction with
coordinates [x ) 2, y ) 5] on Figure 1A), such a set of peptides
would be derived from the C-terminal portion of the protein.
If the peptides in a given fraction are represented by a set of
small circles (relative to all the peptides identified in the
fractions, as shown in the figure inset), such as in the fraction
with coordinates [x ) 7, y ) 3], then the fragment is derived
from the N-terminal portion of the protein. Thus, visualization
allows an immediate grasp of four characteristics for each
protein: the two chromatographic properties, the distribution
of peptides along the sequence, and in comparative quantitative studies the differential ratio. Furthermore, the same
visualization approach can be used for representing the
number of MS events for a given protein in a given fraction
(Figure 1B). Additional information is provided in an accompanying histogram of all ratios for a given protein in the
experiment (Figure 1C).
Automated Detection of Chromatographic Clusters. We
developed a simple pattern recognition algorithm (see Methods) to identify and flag proteins that show distinct chromatographic clusters, such as shown in Figure 2A. The cluster
identification occurs on the peptide level, and the number of
clusters is then averaged across all the peptides for a single
technical notes
Figure 3. Hepatocyte growth factor activator protein, The sequence of the precursor is shown with a signal peptide in black letters,
prepropeptide removed in mature protein in red letters, short chain in blue letters, and long chain in green letters. The underlined
peptides denote those identified by mass-spectrometry in 132 fractions.
technical notes
Krasnoselsky et al.
Figure 4. Fibulin 1 isoforms, (A) Total MS events map for 2-D separation. The grid represents the 2-D chromatography fractionation (12
12 fractions). The X-axis represents 12 fractions of ion-exchange chromatography and Y-axis, 12 fractions of RP HPLC. Each node of
the grid shows the number of MS events corresponding to FBLN1, while the size of the circle reflects visually that number. (B) The
peptide and ratio map of the 2-D chromatography fractionation. Each node of the grid shows the fraction location as in (A). Information
is provided regarding fractions in which FBLN1 was found and the related peptides that were identified (full list is displayed in the
figure inbox). Peptides are shown as concentric circles of different colors, whereby the size of the circle indicates a relative distance
of the peptide from the N-terminus of the full protein sequence.
cates that two splice isoforms have been identified for this
protein. However, the visualization plot suggests that the two
chromatographic species are unlikely to be splice isoforms,
because the missing sequence in isoform 2 that distinguishes
it from isoform 1 is present in both clusters (the sequence maps
to the range of 0.17-0.30). An alternative explanation is the
difference in glycosylation pattern (F11 is heavily glycosylated).
Discussion
Fractionation based on chromatographic properties yields
a fingerprint of a protein that is determined by structural
variations in the protein. High resolution HPLC systems, such
as modern reverse-phase and ion-exchange HPLC, yield 2-D
fractionation patterns that allow inferences to be made regarding single protein heterogeneity. We have utilized this chro2550
matographic pattern information, along with sequence mapping of identified peptides, to gain insight into potential
fragmentation patterns, splice isoforms, or other sources of
protein heterogeneity that might be found in a sample. To
reduce data complexity and allow an easier grasp of multidimensional proteomic data, we developed a visualization method
that combines three sources of information (four dimensions
of data) in one two-dimensional plot. Along with the visualization tool, we also developed a simple pattern recognition
algorithm to automatically detect and flag potentially heterogeneous species of proteins in experiments such as IPAS, which
involve extensive fractionation and identify more than a
thousand serum or plasma proteins in one experiment.4
Given that proteins are identified based on matching of their
corresponding peptide mass spectra to sequence databases, the
technical notes
Figure 5. Protein heterogeneity for LCAT and F11, The peptide map of the 2-D chromatography fractionation. Each node of the grid
shows the fraction location as in 4A). (A) F11; peptides 6-9 are present in both chromatographically distinct clusters. Region 0.1-0.30
of the sequence of the F11 protein is missing in alternatively spliced isoform 2 (see text for details). (B) LCAT protein; N-glycosylation
of LCAT has been shown by mass-spectrometry.14
technical notes
peptide coverage in MS; and (c) have large enough differences
to be separable by common methods of protein fractionation.
In conclusion, we have developed a visualization tool to aid
in making inferences about heterogeneity of proteins identified
in proteomics experiments that utilize extensive fractionation.
We also provide a simple algorithm to detect and flag potential
splice isoforms, mature/precursor protein combinations, and
other types of protein structural variation.
References
(1) Palagi, P. M.; Hernandez, P.; Walther, D.; Appel, R. D. Proteome
informatics I: bioinformatics tools for processing experimental
data. Proteomics 2006, 6 (20), 54355444.
(2) Spellman, P. T.; Sherlock, G.; Zhang, M. Q.; Iyer, V. R.; Anders, K.;
Eisen, M. B.; Brown, P. O.; Botstein, D.; Futcher, B. Comprehensive
identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 1998,
9, 32733297.
(3) Khatib, A.-M.; Geraldine, S. Growth Factors: To Cleave or not To
Cleave. In Regulation of Carcinogenesis, Angiogenesis and Metastasis by the Proprotein Convertases (PCs), A New Potential in Cancer
Therapy; Khatib, A.-M., Ed.; Springer: The Netherlands, 2006; pp
121-135.
(4) Faca, V.; Pitteri, S.; Newcomb, L.; Glukhova, V.; Phanstiel, D.;
Krasnoselsky, A.; Zhang, Q.; Struthers, J.; Wang, H.; Eng, J.;
Fitzgibbon, M.; M, M.; Hanash, S. Contribution of protein fractionation to depth of analysis of the serum and plasma proteomes.
J. Proteome Res. 2007, 6 (9), 35583565.
(5) Faca, V.; Coram, M.; Phanstiel, D.; Glukhova, V.; Zhang, Q.;
Fitzgibbon, M.; McIntosh, M.; Hanash, S. Quantitative analysis of
acrylamide labeled serum proteins by LC-MS/MS. J. Proteome Res.
2006, 5 (8), 20092018.
2552
Krasnoselsky et al.
(6) Rauch, A.; Bellew, M.; Eng, J.; Fitzgibbon, M.; Holzman, T.; Hussey,
P.; Igra, M.; Maclean, B.; Lin, C. W.; Detter, A.; Fang, R.; Faca, V.;
Gafken, P.; Zhang, H.; Whitaker, J.; States, D.; Hanash, S.; Paulovich, A.; McIntosh, M. W. Computational Proteomics Analysis
System (CPAS): an extensible, open-source analytic system for
evaluating and publishing proteomic data and high throughput
biological experiments. J. Proteome Res. 2006, 5 (1), 112121.
(7) Craig, R.; Beavis, R. C. TANDEM: matching proteins with tandem
mass spectra. Bioinformatics 2004, 20 (9), 14661467.
(8) Maclean, B.; Eng, J. K.; Beavis, R. C.; McIntosh, M. General
framework for developing and evaluating database scoring algorithms using the TANDEM search engine. Bioinformatics 2006, 22
(July 28), 28302832.
(9) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Empirical
statistical model to estimate the accuracy of peptide identifications
made by MS/MS and database search. Anal. Chem. 2002, 74, 5383
5392.
(10) Nesvizhskii, A. I.; Keller, A.; Kolker, E.; Aebersold, R. A statistical
model for identifying proteins by tandem mass spectrometry. Anal.
Chem. 2003, 75 (17), 46464658.
(11) Kersey, P.; Duarte, J.; Williams, A.; Karavidopoulou, Y.; Birney, E.;
Apweiler, R. The International Protein Index: an integrated database for proteomics experiments. Proteomics 2004, 7, 19851988.
(12) Miyazawa, K.; Shimomura, T.; Naka, D.; Kitamura, N. Proteolytic
activation of hepatocyte growth factor in response to tissue injury.
J. Biol. Chem. 1994, 269 (12), 89668970.
(13) Gallagher, W. M.; Currid, C. A.; Whelan, L. C. Fibulins and cancer:
friend or foe. Trends Mol. Med. 2005, 11 (7), 336340.
(14) Liu, T.; Qian, W. J.; Gritsenko, M. A.; Camp, D. G., 2nd.; Monroe,
M. E.; Moore, R. J.; Smith, R. D. Human plasma N-glycoproteome
analysis by immunoaffinity subtraction, hydrazide chemistry, and
mass spectrometry. J. Proteome Res. 2005, 4 (6), 20702080.
PR7007219