Vous êtes sur la page 1sur 7

Isoform Analysis of LC-MS/MS Data from Multidimensional

Fractionation of the Serum Proteome


Alexei L. Krasnoselsky,* Vitor M. Faca, Sharon J. Pitteri, Qing Zhang, and Samir M. Hanash
Fred Hutchinson Cancer Research Center, 1100 Fairview Avenue North, Seattle, Washington 98109
Received November 8, 2007

Abstract: We developed a visualization approach for the


identification of protein isoforms, precursor/mature protein combinations, and fragments from LC-MS/MS analysis of multidimensional fractionation of serum and plasma
proteins. We also describe a pattern recognition algorithm
to automatically detect and flag potentially heterogeneous
species of proteins in proteomic experiments that involve
extensive fractionation and result in a large number of
identified serum or plasma proteins in an experiment.
Examples are given of proteins with known isoforms that
validate our approach and present a subset of precursor/
mature protein pairs that were detected with this approach. Potential applications include identification of
differentially expressed isoforms in disease states.
Keywords: Protein fractionation visualization LC-MS/
MS isoforms

Introduction
With rapid proliferation of proteomic data, there is a need
for tools that allow computational data mining and visualization of complex data sets. There are many software packages
available for processing proteomics data and displaying results
(for recent review, see Palagi et al.).1 However, there is a paucity
of visualization tools that are simple and easily adaptable to
evolving proteomic data formats. Visualization tools combine
several sources of information for intelligent data mining. The
human eye is particularly suited to identify complex patterns
and features, provided that the information is presented in a
structured visual way and limited to a few patterns at a time.
The gene expression red-green heat maps serve as an example
of simple and yet effective method of representation of complex
data.2
Proteins exist in plasma and tissue sources in multiple forms
that result from alternative splicing (isoforms), precursor/
mature protein combinations, or different patterns of glycosylation. Most proteins are secreted as precursor proteins from
which biologically active forms are generated upon proteolytic
cleavage (e.g., see Khatib and Geraldine).3 For biomarker
discovery, it is important to assess the presence of isoforms
that may differ in their levels in a disease related manner as in
the case of phosphorylation and glycosylation, among numer* To whom correspondence should be addressed. Tel: (206) 667-1250, fax:
(206) 667-2537, E-mail: akrasnos@fhcrc.org.

Graphics reveals data. Edward R. Tufte in The Visual Display of


Quantitative Information.

2546 Journal of Proteome Research 2008, 7, 25462552


Published on Web 04/18/2008

ous post-translational modifications. We present here a visualization approach for multidimensional proteomic data to
assist in the search for protein isoforms, precursor/mature
protein combinations, and fragments. Along with the visualization tool, we also describe a simple pattern recognition
algorithm that we developed to automatically detect and flag
potentially heterogeneous species of proteins in proteomic
experiments that involve extensive fractionation and result in
a large number of identified proteins in one experiment.

Methods
Protein Separation and Mass Spectrometry Analysis. Serum
and plasma protein samples were subjected to fractionation
followed by LC-MS/MS analysis of tryptic digests from individual fractions. The full procedure, designated Intact Protein
Analysis System (IPAS) has been previously described by Faca
et al.4 Briefly, after immunodepletion, acrylamide-labeled
samples5 were fractionated by anion-exchange into 12 fractions
and subsequently by reversed-phase into 12 fractions, representing a total of 144 fractions that were analyzed individually
by shotgun LC-MS/MS. In-solution tryptic digestion was
performed overnight with lyophilized aliquots from the reversedphase (second dimension) fractionation step. The resulting
peptide mixtures were analyzed by a LTQ-FTICR mass spectrometer (Thermo-Finnigan) coupled with a NanoAcquitynanoflow chromatography system (Waters). Spectra were acquired in a data-dependent mode in m/z range of 400-1800,
including selection of the 5 most abundant +2 or +3 ions of
each MS spectrum for MS/MS analysis. Acquired data was
automatically processed by the Computational Proteomics
Analysis System (CPAS)6 pipeline. This pipeline includes the
X!Tandem search algorithm7 with comet score module plugin,8 PeptideProphet9 peptide validation, and ProteinProphet10
protein inference tool. The tandem mass spectra were searched
against version 3.12 of the human IPI database.11 All identifications with a PeptideProphet probability greater than 0.75 were
selected and the subsequent protein identifications were
filtered at a 5% error rate.
Heterogeneity Detection Algorithm. The concept behind
cluster detection is as follows. For each protein (single IPI or a
protein group of multiple IPI numbers considered to represent
the same protein), the data were assembled into a n m grid
of fractions, where n corresponds to the number of fractions
derived in ion-exchange chromatography (represented on the
X-axis) and m corresponds to the number of fractions derived
in RP-HPLC (represented on the Y-axis). The dimensions for
the two data sets used in this article are 12 12 for one data
10.1021/pr7007219 CCC: $40.75

2008 American Chemical Society

Isoform Analysis of LC-MS/MS Data

technical notes

Figure 1. Visualization of proteomic data in 2-D fractionation experiments with differential sample labeling, The data shown is for
protein HFAC (hepatocyte growth factor activator). (A) The peptide and ratio map of the 2-D chromatography fractionation. The grid
represents the 2-D chromatography fractionation (12 12 fractions). The X-axis represents 12 fractions of ion-exchange chromatography
and Y-axis, 11 fractions of RP HPLC. Each node of the grid shows the fraction location. The peptides are shown as concentric circles
of different colors (the full list of identified peptides is shown in the inset), whereby the size of the circle indicates a relative distance
of the peptide from the N-terminus of the full protein sequence. The size of the circle corresponds to the sequential order of the
peptides starting from the N-terminus. The range for each peptide represents the starting and ending position in the protein sequence,
scaled to 0-1. Values are provided for ratios between two samples being compared based on differential acrylamide labeling,5 where
case samples are labeled with C13 acrylamide and control samples with C12 acrylamide. (B) Histogram of the ratios obtained for this
protein in an experiment in which a comparison is made between two samples (in all 132 fractions). (C) Total MS events map for 2-D
separation. Each node of the grid shows the number of MS events summed up across all peptides, while the size of the circle reflects
visually that number.
Journal of Proteome Research Vol. 7, No. 6, 2008 2547

technical notes

Krasnoselsky et al.
protein group of multiple IPI numbers considered to represent
the same protein), the n m data matrix of fractions (as
described above), along with the ratio vector and a vector of
number of spectral events for each peptide in each fraction is
passed to the software. A vector of scaled 0-1 sequence
position information is passed to the application as well. All
preprocessing of the data is accomplished prior to passing the
data to the visualization tool. The outputs of the application
include three figures, saved as picture files (jpeg format): the
figure that combines fractionation, ratio, and peptide sequence
information (such as Figure 1A), histogram of the ratios (if
available, see Figure 1B), and a figure of total spectral events
for each fraction (such as Figure 1C). The Matlab code for the
application is available upon request from the author.

Results

Figure 2. Relationship between number of peptides and the


average number of clusters per protein. (A) The average cluster
score (number of identified chromatographic clusters averaged
across all the peptides per protein) is plotted against the total
number of unique peptides for the corresponding protein. (B)
The histogram of average cluster scores across all proteins with
two or more unique peptides.

set, and 12 11 for the other. The pattern detection is


performed at the peptide level. For each fraction, a binary
peptide separation map is derived by assigning 1 to a fraction
where the peptide was identified and 0 where it was not. The
map serves as input to the protein heterogeneity detection
algorithm, which consists of two steps. First, the fractionation
pattern is smoothed by a 2 2 kernel, whereby each fraction
xij is assigned a sum of the values in the kernel: Si,j ) xi,j +
xi+1,j + xi,j+1 + xi+1,j+1. The rationale for smoothing is to reduce
the MS sampling effect that might result in overestimation
of the number of clusters. The clusters are defined by selecting
the nodes with the values equal or exceeding k (kmax ) 4) and
separated by a gap of at least g fractions (g ) 2 for this
fractionation experiment based on the chromatographic resolution of the system). The number of identified peptide clusters
is then averaged across all peptides for a given protein to result
in a cluster score assigned to this protein. The output consists
of all proteins ranked by the cluster score with the cluster
statistics described on peptide level.
Data Visualization. The visualization application requires
several input data matrices. For each protein (single IPI or a
2548

Journal of Proteome Research Vol. 7, No. 6, 2008

Visualization of the IPAS Proteomics Data. The data generated in comparative proteomics experiments that utilize extensive protein fractionation contain information related to
isoforms that could be mined, but is generally not systematically analyzed. Such information is intrinsic to the locations
(fractions) in which proteins were identified. Thus, chromatographic properties contain information that could be used to
make inferences about subspecies/isoforms of proteins that
elute differently but may be the products of the same gene. In
this study, we analyzed data from 132 serum fractions that
resulted from 2-D fractionation of intact (undigested) proteins.
Figure 1A shows a representation of the 2-D fractionation as a
grid with the nodes denoting the fractions. The particular
identified peptides in a protein could be used to infer cleavages
as in the case of surface proteins that shed their extracellular
domains. We have devised a way of capturing this information
on the fractionation grid, whereby a set of concentric circles
represent the sequentially organized peptides. The circles are
scaled in such a way that the size of the circle indicates a
relative distance from the N-terminus of the protein, with the
peptide represented by the smallest circle being closest to the
N-terminus and the largest circle denoting the peptide closest
to the C-terminus. Such visualization aids in immediate
discerning a fragment: if a set of peptides appears as doughnutshaped in one or more fractions (such as fraction with
coordinates [x ) 2, y ) 5] on Figure 1A), such a set of peptides
would be derived from the C-terminal portion of the protein.
If the peptides in a given fraction are represented by a set of
small circles (relative to all the peptides identified in the
fractions, as shown in the figure inset), such as in the fraction
with coordinates [x ) 7, y ) 3], then the fragment is derived
from the N-terminal portion of the protein. Thus, visualization
allows an immediate grasp of four characteristics for each
protein: the two chromatographic properties, the distribution
of peptides along the sequence, and in comparative quantitative studies the differential ratio. Furthermore, the same
visualization approach can be used for representing the
number of MS events for a given protein in a given fraction
(Figure 1B). Additional information is provided in an accompanying histogram of all ratios for a given protein in the
experiment (Figure 1C).
Automated Detection of Chromatographic Clusters. We
developed a simple pattern recognition algorithm (see Methods) to identify and flag proteins that show distinct chromatographic clusters, such as shown in Figure 2A. The cluster
identification occurs on the peptide level, and the number of
clusters is then averaged across all the peptides for a single

Isoform Analysis of LC-MS/MS Data

technical notes

Figure 3. Hepatocyte growth factor activator protein, The sequence of the precursor is shown with a signal peptide in black letters,
prepropeptide removed in mature protein in red letters, short chain in blue letters, and long chain in green letters. The underlined
peptides denote those identified by mass-spectrometry in 132 fractions.

protein to derive a protein score. Figure 2A shows that there is


no correlation between the average number of peptides and
the number of identified clusters. The increase in number of
clusters for proteins identified with a single peptide in multiple
fractions is most likely due to incorrect IDs. The single-peptide
hits were not included in subsequent analysis. The analysis
shows that out of 1224 proteins with more than one unique
peptide coverage 295 proteins showed chromatographic heterogeneity on the peptide levels. Such heterogeneity could be
due to multiple factors that include MS sampling, precursor/
mature protein, multichain proteins connected by S-S bridges,
splice isoforms, PTM modifications, and proteolytic fragments.
The algorithm flags all these instances as long as they are
manifested in discontinuous elution profile for a given protein.
The histogram in Figure 2B shows that the majority of
heterogeneous proteins show less than two clusters per protein
(averaged number of identified peptide clusters). This is
reasonable given the limited resolution of the system (11 12
fraction grid).
Identification of Proteins and Their Cleavage Products.
Most proteins are synthesized in vivo in the form of inactive
precursor that is cleaved upon a physiological event locally or
with their extracellular release. We have analyzed human
plasma for presence of such precursor/mature protein pairs
using our pattern detection algorithm to flag potential isoforms.
Out of 295 proteins that were flagged as heterogeneous, 176
(or 60%) were consistent with precursors. Figure 1A shows an
example of the detection of the full-length precursor and the
mature form of hepatocyte growth factor activator (HGFA),
identified in the IPAS experiment with 14 peptides. As could
be observed from Figure 1A, the protein species elute as
separate clusters that correspond to the mature protein as well
as the corresponding precursor part removed upon cleavage
(see Figure 3 for explanation). The detection algorithm flags
this protein as heterogeneous and fragments may be discerned
upon inspection of the plot. The precursor for HGFA does not
convert single chain HGF to its biologically active form.12
However, cleavage of pre-HGFA at R407-I408 and R372-V373
converts it to its active two-chain form. Figure 1A shows that
we detect several forms. The R407-I408 corresponds to the
position 0.62 on 0-1 scale from N- to C-terminus of 655 amino

acid-long HGFA, and R372-V373 to 0.57, respectively. Indeed, we


identified two sets of fractions that correspond to the precursor
part that is removed in the mature form (sequence 36:372 or
0.06:0.57) as well as the two chains of the mature protein itself
(0.57-0.6 and 0.72-0.98, peptides 9-14). Interestingly, the
short chain of the mature protein (peptide 8) yielded only a
single identified peptide, which elutes separately from the long
chain of the HGFA.
Analysis of Protein Isoforms. Proteins that result in alternative splicing can produce isoforms that are distinguishable in
IPAS experiments. Here, we show one such example, fibulin-1
(FBLN1). Fibulin-1 is an extracellular matrix protein that is
known to have four different isoforms (for recent review, see
Gallagher et al.13). In the IPAS experiment described here, we
have identified peptides that map to FBLN1 and identify at least
two groups of isoforms: isoform C and isoforms B and D. The
latter are indistinguishable by the identified peptides and
referred here as isoform B/D. Figure 4 exhibits the fractionation
pattern of FBLN1. The differences between the isoforms lie in
the C-terminal portion of FBLN1. Figure 4A exhibits the
fractions in which the isoforms B/D were identified by unique
peptides (peptides 14 and 15), whereas the isoform C was
identified by its corresponding C-terminal peptides (peptides
11 and 12 on Figure 4B). Isoform B/D elutes in the earlier ionexchange and reverse-phase HPLC fractions. There is also
some, albeit incomplete, separation of isoforms by reversephase HPLC for the late eluting ion-exchange fractions. Analysis
of the peptide composition shows no evidence of the earlier
eluting fractions resulting from fragmentation of the later fulllength protein. Such differences might be due to variation in
the glycosylation pattern FBLN1. The contribution of each
isoform to the overall FBLN1 ratio could not be assessed in
this study due to the origin of the Cys-containing peptides from
the region of FBLN1 sequence common to all known isoforms.
However, the presence of several isoforms that are partially
resolved chromatographically is demonstrated.
A utility of the visualization algorithm could be illustrated
on Figure 5A where two subspecies of coagulation factor F11
are shown. The detection algorithm flags F11 (IPI00008556) as
a chromatographically heterogeneous protein with two distinct
species (Figure 5A). The Swiss-Prot annotation (P03951) indiJournal of Proteome Research Vol. 7, No. 6, 2008 2549

technical notes

Krasnoselsky et al.

Figure 4. Fibulin 1 isoforms, (A) Total MS events map for 2-D separation. The grid represents the 2-D chromatography fractionation (12
12 fractions). The X-axis represents 12 fractions of ion-exchange chromatography and Y-axis, 12 fractions of RP HPLC. Each node of
the grid shows the number of MS events corresponding to FBLN1, while the size of the circle reflects visually that number. (B) The
peptide and ratio map of the 2-D chromatography fractionation. Each node of the grid shows the fraction location as in (A). Information
is provided regarding fractions in which FBLN1 was found and the related peptides that were identified (full list is displayed in the
figure inbox). Peptides are shown as concentric circles of different colors, whereby the size of the circle indicates a relative distance
of the peptide from the N-terminus of the full protein sequence.

cates that two splice isoforms have been identified for this
protein. However, the visualization plot suggests that the two
chromatographic species are unlikely to be splice isoforms,
because the missing sequence in isoform 2 that distinguishes
it from isoform 1 is present in both clusters (the sequence maps
to the range of 0.17-0.30). An alternative explanation is the
difference in glycosylation pattern (F11 is heavily glycosylated).

Discussion
Fractionation based on chromatographic properties yields
a fingerprint of a protein that is determined by structural
variations in the protein. High resolution HPLC systems, such
as modern reverse-phase and ion-exchange HPLC, yield 2-D
fractionation patterns that allow inferences to be made regarding single protein heterogeneity. We have utilized this chro2550

Journal of Proteome Research Vol. 7, No. 6, 2008

matographic pattern information, along with sequence mapping of identified peptides, to gain insight into potential
fragmentation patterns, splice isoforms, or other sources of
protein heterogeneity that might be found in a sample. To
reduce data complexity and allow an easier grasp of multidimensional proteomic data, we developed a visualization method
that combines three sources of information (four dimensions
of data) in one two-dimensional plot. Along with the visualization tool, we also developed a simple pattern recognition
algorithm to automatically detect and flag potentially heterogeneous species of proteins in experiments such as IPAS, which
involve extensive fractionation and identify more than a
thousand serum or plasma proteins in one experiment.4
Given that proteins are identified based on matching of their
corresponding peptide mass spectra to sequence databases, the

Isoform Analysis of LC-MS/MS Data

technical notes

Figure 5. Protein heterogeneity for LCAT and F11, The peptide map of the 2-D chromatography fractionation. Each node of the grid
shows the fraction location as in 4A). (A) F11; peptides 6-9 are present in both chromatographically distinct clusters. Region 0.1-0.30
of the sequence of the F11 protein is missing in alternatively spliced isoform 2 (see text for details). (B) LCAT protein; N-glycosylation
of LCAT has been shown by mass-spectrometry.14

isoform identification process is dependent on accurate peptide


identifications. The goal of the automated detection algorithm
we have developed is to reduce data complexity by eliminating
proteins that do not show heterogeneity and leaving it to the
researcher, aided by the visualization tool, to make final
decisions about the flagged proteins. It is desirable to estimate
a false-discovery rate for the list of proteins deemed heterogeneous by the algorithm. To address this problem, the
availability of a benchmark set of known heterogeneous
proteins that are resolved by chromatography would be useful
to develop an algorithm for FDR estimation. In this publication,
we provide two examples, whereby an observed heterogeneous
nature of proteins (HGFA and FBLN1) could be indicative of
the true precursor/mature protein (in the case of HGFA) and
different splice isoforms (in the case of FBLN1) to be present
in the samples. However, the definitive assessment requires

biochemical evidence to validate the finding of distinct species


for the same protein. Nevertheless, as shown in this paper, in
the example of coagulation factor F11, using our visualization
software tool enables the researcher to rule out a hypothesis,
such as the presence of alternatively spliced isoforms in the
case of F11.
Our approach allows us to start compiling a list of proteins
that could serve as benchmark set for performance evaluation
of future isoform detection algorithms. Figure 5 shows an
example of two such proteins, F11 and LCAT. Compiling a
comprehensive data set for benchmarking of isoform detection
algorithm is beyond the scope of this paper and will be
addressed in future publications. Such a protein set should
satisfy at least the following criteria: the species of a protein
should (a) be well-defined and characterized biochemically; (b)
be detectable in normal plasma in quantities that allow good
Journal of Proteome Research Vol. 7, No. 6, 2008 2551

technical notes
peptide coverage in MS; and (c) have large enough differences
to be separable by common methods of protein fractionation.
In conclusion, we have developed a visualization tool to aid
in making inferences about heterogeneity of proteins identified
in proteomics experiments that utilize extensive fractionation.
We also provide a simple algorithm to detect and flag potential
splice isoforms, mature/precursor protein combinations, and
other types of protein structural variation.

References
(1) Palagi, P. M.; Hernandez, P.; Walther, D.; Appel, R. D. Proteome
informatics I: bioinformatics tools for processing experimental
data. Proteomics 2006, 6 (20), 54355444.
(2) Spellman, P. T.; Sherlock, G.; Zhang, M. Q.; Iyer, V. R.; Anders, K.;
Eisen, M. B.; Brown, P. O.; Botstein, D.; Futcher, B. Comprehensive
identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 1998,
9, 32733297.
(3) Khatib, A.-M.; Geraldine, S. Growth Factors: To Cleave or not To
Cleave. In Regulation of Carcinogenesis, Angiogenesis and Metastasis by the Proprotein Convertases (PCs), A New Potential in Cancer
Therapy; Khatib, A.-M., Ed.; Springer: The Netherlands, 2006; pp
121-135.
(4) Faca, V.; Pitteri, S.; Newcomb, L.; Glukhova, V.; Phanstiel, D.;
Krasnoselsky, A.; Zhang, Q.; Struthers, J.; Wang, H.; Eng, J.;
Fitzgibbon, M.; M, M.; Hanash, S. Contribution of protein fractionation to depth of analysis of the serum and plasma proteomes.
J. Proteome Res. 2007, 6 (9), 35583565.
(5) Faca, V.; Coram, M.; Phanstiel, D.; Glukhova, V.; Zhang, Q.;
Fitzgibbon, M.; McIntosh, M.; Hanash, S. Quantitative analysis of
acrylamide labeled serum proteins by LC-MS/MS. J. Proteome Res.
2006, 5 (8), 20092018.

2552

Journal of Proteome Research Vol. 7, No. 6, 2008

Krasnoselsky et al.
(6) Rauch, A.; Bellew, M.; Eng, J.; Fitzgibbon, M.; Holzman, T.; Hussey,
P.; Igra, M.; Maclean, B.; Lin, C. W.; Detter, A.; Fang, R.; Faca, V.;
Gafken, P.; Zhang, H.; Whitaker, J.; States, D.; Hanash, S.; Paulovich, A.; McIntosh, M. W. Computational Proteomics Analysis
System (CPAS): an extensible, open-source analytic system for
evaluating and publishing proteomic data and high throughput
biological experiments. J. Proteome Res. 2006, 5 (1), 112121.
(7) Craig, R.; Beavis, R. C. TANDEM: matching proteins with tandem
mass spectra. Bioinformatics 2004, 20 (9), 14661467.
(8) Maclean, B.; Eng, J. K.; Beavis, R. C.; McIntosh, M. General
framework for developing and evaluating database scoring algorithms using the TANDEM search engine. Bioinformatics 2006, 22
(July 28), 28302832.
(9) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Empirical
statistical model to estimate the accuracy of peptide identifications
made by MS/MS and database search. Anal. Chem. 2002, 74, 5383
5392.
(10) Nesvizhskii, A. I.; Keller, A.; Kolker, E.; Aebersold, R. A statistical
model for identifying proteins by tandem mass spectrometry. Anal.
Chem. 2003, 75 (17), 46464658.
(11) Kersey, P.; Duarte, J.; Williams, A.; Karavidopoulou, Y.; Birney, E.;
Apweiler, R. The International Protein Index: an integrated database for proteomics experiments. Proteomics 2004, 7, 19851988.
(12) Miyazawa, K.; Shimomura, T.; Naka, D.; Kitamura, N. Proteolytic
activation of hepatocyte growth factor in response to tissue injury.
J. Biol. Chem. 1994, 269 (12), 89668970.
(13) Gallagher, W. M.; Currid, C. A.; Whelan, L. C. Fibulins and cancer:
friend or foe. Trends Mol. Med. 2005, 11 (7), 336340.
(14) Liu, T.; Qian, W. J.; Gritsenko, M. A.; Camp, D. G., 2nd.; Monroe,
M. E.; Moore, R. J.; Smith, R. D. Human plasma N-glycoproteome
analysis by immunoaffinity subtraction, hydrazide chemistry, and
mass spectrometry. J. Proteome Res. 2005, 4 (6), 20702080.

PR7007219

Vous aimerez peut-être aussi