Académique Documents
Professionnel Documents
Culture Documents
INTRODUCTION
35
36
PPI
Figure 1. Flowchart for the PPI site prediction using eFindSite . Details are given in text.
wileyonlinelibrary.com/journal/jmr
wileyonlinelibrary.com/journal/jmr
37
IPj
ASAi j
i1
Ni
X
ASAi
i1
Ns
X
ASAs j
s1
Ns
X
(1)
ASAs
s1
P
where,
ASAi(j) is thePsum of ASA of amino acid residues of
type j at the interface,P ASAi is the sum of ASA of all amino
acids at the interface,
ASAs(j) is thePsum of ASA of amino acid
residues of type j on the surface, and ASAs is the sum of ASA of
all amino acids on the surface.
Fraction of templates
Finally, we include the fraction of templates (FT) that have an
interfacial residue in the equivalent position according to templatetarget structure alignments constructed by Fr-TM-align.
Individual residue-level attributes, RSA, IP, SE, PSIP and FT, are
combined into a single probabilistic score using machine learning. Two different classiers, SVMs (Chang and Lin, 2011) and
the NBC (Zhang, 2004), are trained to predict interfacial residues
according to the assignment by iAlign (Gao and Skolnick, 2010).
iAlign assigns interfacial residues based on interatomic contacts,
which occur when any two heavy atoms belonging to residues
from different chains are within a distance of 4.5 . Both machine
learning models are twofold cross-validated on the BM4361
dataset. Specically, dataset proteins are randomly divided into
two subsets, A and B; A is used to train a model and then validate
it against B, and vice versa, the model trained on B is validated
against A. We note that <40% sequence identity between any
pair of proteins in the BM4361 dataset ensures that the classiers
are trained and validated using different proteins. Probability
thresholds optimized using the BM4361 dataset are 0.202 for
the SVM and 0.178 for the NBC predictor. These values were
selected to maximize MCC to 0.428, which corresponds to a true
positive rate of 0.464 at the expense of 0.076 false positive rate. A
given residue in the target protein is predicted to be at the interface when both probabilities are above their threshold values.
Calculation of interfacial interactions
Sequence entropy
Functionally important residues tend to be evolutionarily conserved (Caffrey et al., 2004; Guharoy and Chakrabarti, 2005;
Mintseris and Weng, 2005); therefore, we include a conservation
score estimating the sequence variability for each target residue.
First, multiple sequence alignments generated for the target
sequence by PSI-BLAST (Altschul et al., 1997) are converted
to a sequence prole. Then, the conservation score for each
residue position (SE) is calculated using the Shannon entropy
(Shanon, 1948):
SE
20
X
pi log2 pi
(2)
i1
PSIP
20
X
pi IPi
(3)
i1
CS
38
wileyonlinelibrary.com/journal/jmr
N
1X
SVMi NBC i
N i1
(4)
TP
TP FN
(5)
FP
FP TN
(6)
TN
FP TN
(7)
(8)
TP TN
TP FP TN FN
(9)
PPV
Accuracy:
ACC
(10)
wileyonlinelibrary.com/journal/jmr
39
where true positives (TP), false negatives (FN) and false positives
(FP) is the number of correctly predicted, underpredicted, and
overpredicted binding residues, respectively. True negatives
(TN) is the number of correctly predicted non-interfacial residues.
Binding residues in experimental complex structures (positives)
are dened as those forming proteinprotein interfaces according to iAlign (Gao and Skolnick, 2010). The minimum value is 0,
and the maximum value is 1 for all scores, except for MCC that
ranges from 1 to 1. MCC quanties the strength of the correlation between predicted and actual classes; by heavily penalizing
both overpredictions and underpredictions, it provides a
convenient assessment measure that balances the sensitivity
and specicity. In addition to numerical values assessing the
classication accuracy, we analyze the prediction results using
receiver operating characteristic (ROC) plots. This technique
was developed to evaluate the overall performance of a classier
and shows the trade-off between sensitivity and specicity. The
area under the ROC curve (AUC) quanties the performance of
a classier; larger AUC values indicate a better prediction power
of the classication model.
The accuracy of interface residue prediction is compared with
that of a random, size-independent classier. First, for a given
target protein, we estimate the size of its interface from the number of exposed residues as described by Martin (2014). Next, we
randomly select a patch on the target surface whose size is
equivalent to the estimated number of interfacial residues. This
patch represents a random interface and includes the correction
of a size bias, that is, smaller proteins have proportionally more
residues within the patch, increasing the chances of overlapping
with the correct interface.
Figure 2. Accuracy of eThread in recognizing templates for PPI site prediction. In (A), correct templates for the receptor (larger subunit) are dened
using the global structure similarity with a TM-score of 0.4, the overlap of interfacial residues with MCC of 0.5, and the local interfacial similarity with
an IS-score of 0.191. In (B), we evaluate the recognition of those dimer templates in which the ligand (smaller subunit) is globally similar to the targetbound ligand with a sequence identity of 40% and a TM-score of 0.4, respectively. Combined curves are calculated using a twofold cross-validation
against the BM4361 dataset. TPR, true positive rate; FPR, false positive rate. Gray areas correspond to predictions no better than random.
40
Because protein complexes are stabilized by a variety of interactions, we analyze the conservation of interaction patterns across
weakly related proteins. For each protein in the BM4361 dataset,
interfacial interactions in its dimer templates are mapped to the
target residues according to the structure alignments of receptor
proteins. ROC plots in Figure 3 show the structural conservation
of interfacial hydrogen bonds, salt bridges, aromatic and hydrophobic contacts at proteinprotein interfaces. ROC curves end at
certain sensitivity values, because we can only take account of
wileyonlinelibrary.com/journal/jmr
those surface residues having an interacting residue at a structurally aligned position in at least one template. The maximum
accuracy obtained for hydrogen bonds, salt bridges, hydrophobic and aromatic interactions is 0.900, 0.945, 0.895 and 0.949,
at a true (false) positive rate of 0.684 (0.091), 0.459 (0.049),
0.760 (0.098) and 0.488 (0.044), respectively. Comparison of
these ROC plots shows that the conservation of interfacial
hydrophobic contacts and hydrogen bonds is higher than aromatic interactions and salt bridges. The high conservation of hydrophobic contacts is in line with previous studies suggesting
wileyonlinelibrary.com/journal/jmr
41
PPI
Figure 6. Size and composition of interfaces predicted by eFindSite . (A) The correlation between the size of experimental interfaces identied by
PPI
iAlign and those predicted by eFindSite . (B) Amino acid composition of experimental and predicted interfaces.
Evaluation metric
FPR
TPR
ACC
SPC
PPV
MCC
0.076
0.077
0.042
0.464
0.415
0.151
0.835
0.824
0.800
0.924
0.922
0.957
0.594
0.563
0.459
0.428
0.381
0.177
42
wileyonlinelibrary.com/journal/jmr
BM1905C
BM1905H
BM1905M
Predictor
eFindSitePPI
eFindSitePPI
eFindSitePPI
PINUP
Random
eFindSitePPI
eFindSitePPI
eFindSitePPI
PINUP
Random
eFindSitePPI
eFindSitePPI
eFindSitePPI
PINUP
Random
(SVM)
(NBC)
(SVM)
(NBC)
(SVM)
(NBC)
Evaluation metric
FPR
TPR
ACC
SPC
PPV
MCC
0.150
0.208
0.076
0.091
0.078
0.161
0.228
0.083
0.112
0.074
0.169
0.233
0.089
0.121
0.076
0.581
0.627
0.464
0.244
0.086
0.539
0.590
0.428
0.179
0.087
0.517
0.571
0.402
0.166
0.097
0.760
0.760
0.835
0.748
0.759
0.785
0.739
0.829
0.722
0.778
0.775
0.732
0.822
0.709
0.780
0.850
0.793
0.924
0.808
0.921
0.838
0.771
0.916
0.787
0.925
0.839
0.766
0.910
0.778
0.923
0.483
0.421
0.594
0.414
0.209
0.418
0.357
0.522
0.284
0.201
0.393
0.341
0.489
0.264
0.212
0.403
0.366
0.428
0.189
0.011
0.344
0.304
0.371
0.080
0.019
0.314
0.281
0.339
0.053
0.030
For eFindSitePPI, three prediction protocols are evaluated: SVM only, NBC only and a combination of SVM and NBC (listed as
eFindSitePPI). Values pointing to the best performance are highlighted in bold, except for FPR and TPR that need to be
considered jointly
BM1905C, crystal structures; BM1905H, high-quality models; BM1905M, moderate-quality models.
FPR, false positive rate; TPR, sensitivity; ACC, accuracy; SPC, specicity; PPV, precision; MCC, Matthews correlation coefcient.
Random performance includes the correction of a size bias.
wileyonlinelibrary.com/journal/jmr
43
Table 3. Comparison of the performance of eFindSitePPI and PINUP using homodimers and heterodimers from the BM4361
dataset
Dataset
Homodimer
Heterodimer
Predictor
PPI
eFindSite
PINUP
eFindSitePPI
PINUP
Evaluation metric
FPR
TPR
ACC
SPC
PPV
MCC
0.088
0.089
0.093
0.090
0.478
0.239
0.354
0.217
0.820
0.771
0.806
0.773
0.911
0.910
0.906
0.909
0.574
0.414
0.456
0.368
0.419
0.187
0.289
0.156
Values pointing to the best performance are highlighted in bold, except for FPR and TPR that need to be considered jointly
FPR, false positive rate; TPR, sensitivity; ACC, accuracy; SPC, specicity; PPV, precision; MCC, Matthews correlation coefcient.
Table 4. Comparison of the performance of eFindSitePPI, PINUP and PrISE on the Benchmark 4.0 dataset. Values pointing to the
best performance are highlighted in bold, except for FPR and TPR that need to be considered jointly
Dataset
Bound
Unbound
Predictor
PPI
eFindSite
PINUP
PrISE
eFindSitePPI
Evaluation metric
FPR
TPR
ACC
PPV
MCC
0.049
0.065
0.042
0.047
0.399
0.347
0.381
0.377
0.909
0.783
0.790
0.909
0.404
0.307
0.432
0.499
0.352
0.246
0.279
0.338
44
Results for PINUP and PrISE are taken from ref. (Jordan et al., 2012).
TPR, sensitivity; ACC, accuracy; PPV, precision; MCC, Matthews correlation coefcient.
wileyonlinelibrary.com/journal/jmr
Table 5. Comparison of the performance of eFindSitePPI, ET and iJET using non-transient homodimers and heterodimers as well
transient complexes from the ET/iJET dataset
Dataset
Homodimer
Heterodimer
Transient
Predictor
PPI
eFindSite
ET
iJET
eFindSitePPI
ET
iJET
eFindSitePPI
ET
iJET
Evaluation metric
FPR
TPR
PPV
SPC
ACC
0.049
0.058
0.038
0.071
0.065
0.062
0.048
0.032
0.030
0.678
0.389
0.340
0.572
0.364
0.426
0.531
0.319
0.455
0.657
0.482
0.552
0.614
0.524
0.575
0.460
0.431
0.538
0.951
0.856
0.905
0.929
0.854
0.824
0.952
0.906
0.820
0.917
0.738
0.764
0.871
0.696
0.707
0.922
0.727
0.751
wileyonlinelibrary.com/journal/jmr
45
Results for ET and iJET are taken from ref. (Engelen et al., 2009).
Values pointing to the best performance are highlighted in bold, except for FPR and TPR that need to be considered jointly
TPR, sensitivity; PPV, precision; SPC, specicity; ACC, accuracy.
PPI
Figure 8. Example of PPI prediction by eFindSite for a homodimer (PDB-ID: 1GDH). (A) The surface representation of a monomer chain; true
positives, true negatives, false positives, and false negatives are colored in green, gray, red, and cyan, respectively. (B) Interface residues correctly
predicted to form specic interactions; dashed blue lines represent salt bridges and red lines represent hydrogen bonds.
PPI
Figure 9. Example of PPI prediction by eFindSite for a heterodimer (PDB-ID: 1TCR). The surface representations of alpha and beta chains are shown
in (A) and (B); the dimer complex is displayed in (C). True positives, true negatives, false positives, and false negatives are colored in green, gray/tan, red,
and cyan, respectively. Interfacial residues in both chains correctly predicted to form specic interactions are shown in (D). Dashed blue lines represent
salt bridges and red lines represent hydrogen bonds.
46
0.420 precision, and 0.929 accuracy (Figure 9A). For chain beta,
46% of interfacial residues are correctly predicted, with 0.815
specicity, 0.959 precision, and 0.817 accuracy (Figure 9B).
wileyonlinelibrary.com/journal/jmr
CONCLUSIONS
The analysis of evolutionarily weakly related dimer proteins reported in this study strongly suggests that the locations of their
binding sites are highly conserved, irrespectively of the global
structure similarity of proteinprotein complexes. Furthermore,
the interfacial geometry is preserved as well, thus can be predicted with a high accuracy. This is consistent with previous
studies demonstrating that surface regions responsible for protein binding are conserved among structural neighbors (Zhang
et al., 2010). Exploiting these insights, we developed eFindSitePPI,
a new approach for the prediction of protein-binding sites using
information derived from evolutionarily and structurally related
templates. eFindSitePPI employs sensitive meta-threading by
eThread (Brylinski and Lingam, 2012) to identify evolutionarily
related templates and extensively uses various machine learning
techniques to detect interfacial residues on a query protein
surface. A higher degree of conservation of local interface
compared with the global structure of protein complexes forms
the basis for an accurate prediction of interfacial binding sites.
In addition to these conservation patterns, eFindSitePPI also employs other residue-level descriptors to effectively discriminate
between interfacial and non-interfacial residues. For instance, it
incorporates the relative solvent accessible area and the interfacial propensities of amino acids, which have been already
successfully used by several other interfacial site prediction
algorithms (Liang et al., 2006; Li et al., 2008). A high accuracy in
extracting structural information from the twilight zone
templates motivated us to further extend the capabilities of
eFindSitePPI to predict specic interactions as well. That is,
eFindSitePPI also detects the types of molecular interactions that
target proteins are likely to form with their interacting partners;
this is demonstrated for hydrogen bonds, salt bridges as
well as hydrophobic and aromatic contacts. Comparative
benchmarking calculations on several datasets of protein dimers
show that eFindSitePPI outperforms other methods for proteinbinding residue prediction. Equally important, it is designed to
work with protein models so that the interfacial site can be
efciently predicted even when the experimental structure of a
query protein is unavailable. Finally, a carefully tuned condence
estimation system identies those predictions that are likely to
be correct. eFindSitePPI is freely available to the academic
community as a user-friendly web-server and a well-documented
stand-alone software distribution at http://www.brylinski.org/
endsiteppi; this website also provides all benchmarking datasets
and results reported in this paper.
Acknowledgements
This study was supported by the Louisiana Board of Regents
through the Board of Regents Support Fund [contract LEQSF
(201215)-RD-A-05]. We thank Wei Feinstein and Misagh Naderi
who read the manuscript and provided critical comments.
Portions of this research were conducted with high performance
computational resources provided by Louisiana State University
(HPC@LSU, http://www.hpc.lsu.edu).
REFERENCES
wileyonlinelibrary.com/journal/jmr
47
SUPPORTING INFORMATION
Additional supporting information may be found in the online
version of this article at the publishers website.
48
wileyonlinelibrary.com/journal/jmr