Computational Approaches For The Identification and Characterization of Protein Binding Sites

Computational Approaches for the Identification and Characterization of
Protein Binding Sites
by
Dario Ghersi
A dissertation submitted to the Graduate Faculty of the Mount Sinai Graduate School of
Biological Sciences Biomedical Sciences Doctoral Program, in partial fulfillment of the
requirements for the degree of Doctor of Philosophy, Mount Sinai School of Medicine of
New York University
2010
Professor Roberto Sanchez
c
2010
by Dario Ghersi
All Rights Reserved
To my parents and all the people with whom I shared small and big things during those
years, with aection.
In omnibus requiem quaesivi, et nusquam inveni nisi in angulo cum libro

Thomas `a Kempis
iv
Abstract
The problem of inferring the function of a protein in the context of the complex network of interactions is one of the most crucial challenges faced by Computational Biology
today. Knowing the binding partners of proteins is an essential step to untangle the web
of functional relationships that control cellular processes, and the identification and the
characterization of a protein binding site represent an important step to achieve this goal.
Some of the most widely used techniques that have been developed by the bioinformatics
community over the years are discussed here, together with their limitations and applicability range.
This thesis introduces a framework to perform binding site identification on protein structures by means of an energy-based approach based on the concept of Molecular Interaction
Fields (MIFs). The approach has been validated on a large set of bound and unbound
protein structures, and a specific application of binding site identification in the context
of reverse virtual screening is discussed. The advantage of using chemically specific probes
to compute the MIFs is illustrated by applying the binding site identification procedure to
phosphorylated ligands. Furthermore, an improved version of the energy-based binding site
identification approach that incorporates evolutionary information is presented, emphasizing its advantage in situations where the energy-based signal is weak.
As an attempt to move beyond the problem of binding site identification, a methodology
that can be applied to infer the bound conformation of a protein starting from an unbound
form is introduced.
Taken together, the results presented in this work indicate that the energy-based approach
with multiple probes MIFs provides a versatile framework to carry out binding site identification and hint to the possibility of identifying the bound form of structures that undergo
large conformational changes. Furthermore, the problem of predicting the type of ligand
that a binding site can accommodate lies among the future challenges that could benefit
from the methodology described here.
Acknowledgements
It is only fit to preface the acknowledgments with an apology to the people that one
inevitably forgets to mention despite their direct or indirect contributions.
I would like to thank my advisor Dr. Roberto Sanchez for encouraging me to find my own
voice and for his always wise suggestions (in science and beyond), Dr. Roman Osman for
following my progresses with careful attention during all these years, Dr. Mihaly Mezei
for his always knowledgeable and bright advice, Dr. Ming-Ming Zhou for his continuous
support and insightful ideas, and all the members of the Sanchez Lab for useful discussions.
Many thanks to Dr. Suvobrata Chakravarty for the fruitful exchanges that we had over
lunch and to Zachary Charlop-Powers for the wide ranging discussions about science, music
and everything else and our enjoyable musical projects. Special thanks to Dr. Maurizio
Filippone for sharing with me cutting-edge ideas in machine learning and statistics and,
more importantly, because I could not imagine a better friend. Thanks to Dr. Fabiana
Renzi for being such a wonderful person to spend time with. I would also like to thank Prof.
Franco Celada for giving me the confidence to jump from Medicine into a computational
field and for showing me that more things are possible than one would imagine.
Finally, I would like to acknowledge all the people that contribute to make our Department of
Structural and Chemical Biology at Mount Sinai a friendly and collaborative environment.
vi
Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
iv
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xi
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xiii
1 Introduction
1.1
Inferring Protein Function . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2
Binding Site Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2.1
Sequence-based approaches . . . . . . . . . . . . . . . . . . . . . . .
1.2.2
Structure-based approaches . . . . . . . . . . . . . . . . . . . . . . .
1.2.2.1
Geometric approaches . . . . . . . . . . . . . . . . . . . . .
1.2.2.2
Energy-based approaches . . . . . . . . . . . . . . . . . . .
Approaches that take into account the protein dynamics . . . . . . .
Binding Site Characterization and Comparison . . . . . . . . . . . . . . . .
10
1.3.1
Approaches for comparing geometric features . . . . . . . . . . . . .
11
1.3.2
Approaches for comparing structurally derived properties . . . . . .
11
1.4
Available Softwares for Binding Site Identification and Characterization . .
15
1.5
EASYMIFs and SITEHOUND . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
1.5.1
EASYMIFs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
1.5.1.1
Calculation of MIFs in EASYMIFs . . . . . . . . . . . . . .
17
1.5.1.2
Visualizing the results . . . . . . . . . . . . . . . . . . . . .
18
SITEHOUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
18
1.2.3
1.3
1.5.2
vii
1.5.2.1
1.6
The SITEHOUND-web Server . . . . . . . . . . . . . . . . . .
19
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
2 Focused Docking
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
2.1.1
Reverse Virtual Screening . . . . . . . . . . . . . . . . . . . . . . . .
23
2.1.2
The Protein-Ligand Docking Problem . . . . . . . . . . . . . . . . .
25
2.1.2.1
The scoring function component . . . . . . . . . . . . . . .
26
2.1.2.2
The search component . . . . . . . . . . . . . . . . . . . . .
27
Blind Docking vs Focused Docking . . . . . . . . . . . . . . . . . . .
29
Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
2.2.1
Binding site identification . . . . . . . . . . . . . . . . . . . . . . . .
30
2.2.2
Blind docking setup . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
2.2.3
Focused docking setup . . . . . . . . . . . . . . . . . . . . . . . . . .
32
2.2.4
Focused docking with masked grids . . . . . . . . . . . . . . . . . . .
33
2.2.5
Comparison of blind vs. focused docking . . . . . . . . . . . . . . . .
34
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
34
2.3.1
Comparison of blind and focused docking protocols . . . . . . . . . .
35
2.3.2
Binding site detection accuracy . . . . . . . . . . . . . . . . . . . . .
35
2.3.3
Docking pose accuracy . . . . . . . . . . . . . . . . . . . . . . . . . .
37
2.3.4
Comparison of blind vs. focused docking in the unbound dataset . .
39
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
2.1.3
2.2
2.3
2.4
23
3 Binding Site Identification for Phosphorylated Ligands
42
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
3.2
44
3.2.1
Binding Site Identification . . . . . . . . . . . . . . . . . . . . . . . .
44
3.2.2
Dataset Construction . . . . . . . . . . . . . . . . . . . . . . . . . .
45
3.2.2.1
Phosphopeptides Dataset . . . . . . . . . . . . . . . . . . .
45
3.2.2.2
ATP Dataset . . . . . . . . . . . . . . . . . . . . . . . . . .
46
3.2.2.3
Phoshosugars Dataset . . . . . . . . . . . . . . . . . . . . .
46
viii
3.3
3.4
3.2.3
Reranking of Putative Sites by Conservation . . . . . . . . . . . . .
46
3.2.4
Assessment of the Prediction Accuracy . . . . . . . . . . . . . . . . .
47
3.2.5
Electrostatic Potential Calculations . . . . . . . . . . . . . . . . . . .
47
3.2.6
ROC Curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
3.3.1
Overall Performance on the Whole Datasets
. . . . . . . . . . . . .
48
3.3.2
Evolutionary reranking of the putative sites . . . . . . . . . . . . . .
50
3.3.3
Role of the Electrostatic Potential . . . . . . . . . . . . . . . . . . .
53
3.3.4
Probe Selectivity Analysis . . . . . . . . . . . . . . . . . . . . . . . .
54
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
56
4 Beyond Binding Site Identification

4.1
58
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
4.1.1
Models of Conformational Changes . . . . . . . . . . . . . . . . . . .
59
4.1.2
The Elastic Network Model . . . . . . . . . . . . . . . . . . . . . . .
60
61
4.2.1
Dataset Construction . . . . . . . . . . . . . . . . . . . . . . . . . .
61
4.2.2
Root Mean Square Deviation (RMSD) Calculations . . . . . . . . .
62
4.2.3
The Anisotropic Elastic Network Model (ANM) . . . . . . . . . . . .
63
4.2.3.1
Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
4.2.4
Side-chain Modeling and MIFs Calculations . . . . . . . . . . . . . .
65
4.2.5
Comparing MIFs derived from binding sites . . . . . . . . . . . . . .
65
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
66
4.3.1
Normal Model Fitting . . . . . . . . . . . . . . . . . . . . . . . . . .
66
4.3.2
Bound Form Identification . . . . . . . . . . . . . . . . . . . . . . . .
67
4.4
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
69
4.5
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71
4.2
4.3
Appendices
72
ix
A Introduction to Clustering
73
A.1 Brief overview of clustering in SITEHOUND . . . . . . . . . . . . . . . . . . .

B Focused Docking Setup
73
76
B.1 Selection of complexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
76
B.2 Preparation of the proteins and ligands for docking . . . . . . . . . . . . . .
76
C Publications Resulting From This Thesis
78
List of Tables
1.1
Available softwares for binding site identification and characterization . . . . . . .
15
2.1
Parameters for the dierent sets of focused docking experiments . . . . . . .
32
2.2
Accuracy of binding site identification . . . . . . . . . . . . . . . . . . . . . . .
35
2.3
Accuracy of Blind and Focused Docking in Unbound Proteins . . . . . . . . . . .
41
3.1
Summary of the performance on the complete dataset of phosphorylated ligands 48
3.2
Summary of the performance for the first cluster only . . . . . . . . . . . .
49
3.3
Performance with the CMET probe
. . . . . . . . . . . . . . . . . . . . . .
54
4.1
Dataset of complexes undergoing hinge-like motion upon binding . . . . . .
62
4.2
Overview of the results with the centroid shape function . . . . . . . . . . .
68
xi
List of Figures
1.1
Methyllysine recognition domains . . . . . . . . . . . . . . . . . . . . . . . .
1.2
Pocket Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3
Conformational changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4
Identification of ligand-binding sites . . . . . . . . . . . . . . . . . . . . . .
16
1.5
An example of interaction energy calculations on a protein . . . . . . . . . .
19
1.6
Characterization of the yeast adenylate kinase binding site using EASYMIFs

and SITEHOUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
1.7
SITEHOUND-web results page . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
2.1
Possible applications of reverse virtual screening . . . . . . . . . . . . . . .
24
2.2
Focused docking scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
2.3
SITEHOUND binding site identification performance . . . . . . . . . . . . . .
33
2.4
Binding site identification rate for blind and focused docking . . . . . . . .
38
2.5
Accuracy of blind and focused docking . . . . . . . . . . . . . . . . . . . . .
39
2.6
Examples of focused docking results . . . . . . . . . . . . . . . . . . . . . .
40
3.1
Phosphopeptides dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
3.2
Phosphosugars dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
3.3
ATP dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
3.4
Energy ratio density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
3.5
Example of phosphobinding site identification . . . . . . . . . . . . . . . . .
54
3.6
ROC Curves
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
55
3.7
Role of the electrostatic potential . . . . . . . . . . . . . . . . . . . . . . . .
56
xii
3.8
Combining multiple probes . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
4.1
Models of Conformational Changes . . . . . . . . . . . . . . . . . . . . . . .
61
4.2
ENM scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
4.3
Contribution of normal modes to RMSD . . . . . . . . . . . . . . . . . . . .
67
4.4
MIFs comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
68
4.5
Example of Bound Form Identification . . . . . . . . . . . . . . . . . . . . .
69
4.6
Example of Bound Form Identification (2) . . . . . . . . . . . . . . . . . . .
70
A.1 Eects of linkage on clustering results . . . . . . . . . . . . . . . . . . . . .
74
xiii
Chapter 1
Introduction
1.1
Inferring Protein Function
The widespread structural genomics initiatives[103] have somewhat changed the traditional paradigm of the experimental investigation of proteins, where the structural determination was the last step and aimed to clarify and elucidate functional details. Now the
number of structures that have not yet been annotated is increasing and, as a consequence,
the bioinformatics community has taken up the challenge to develop more reliable tools for
inferring function from structure and for guiding further experimental work. Defining the
function of a protein is a dicult task, since it involves notions that range from the location
of the protein inside the cell to the complete blueprint of its regulatory systems. Widely
used approaches involve comparisons of sequences and global structural similarities. In
both cases, the underlying assumption is that proteins that share a detectable evolutionary
relationship are usually endowed with similar functions.
In the case of global structural similarity approaches the functional information contained
in the closest match from databases like CATH[97] and SCOP[93] is transferred to the query
structure. Clearly, caution must be exerted when annotating unknown proteins in this way,
since divergent evolution could have produced practically indistinguishable folds with very
dierent functions. For this reason, a significant level of sequence similarity is also required
to transfer the annotation[54].
In an intriguing review published in 2006, Kolodny[75] and colleagues show how the concept
1
of a protein fold is more fluid than usually thought. The authors propose that we should
move away from the idea of a structural space divided into discrete and non-overlapping
islands (the folds) with similarly discrete and non overlapping functions. The comparison
of global similarities in protein structures and evolutionary relationships between proteins
as inferred from global sequence comparisons are clearly very useful and represent one of
the main achievements of bioinformatics. On the other hand, it is possible to find numerous
examples of proteins that possess distinct evolutionary histories (as determined by sequence
and structural analysis) but that carry out similar functions. In many of these instances
one can find striking similarities at the level of the active site (in the case of enzymes)
or more generally in the binding site (for recognition domains). A compelling example is
given in Figure 1.1, where several structurally unrelated domains all involved in methyllysine recognition[16] show a similar arrangements of aromatic residues in the binding site.
These cases might represent instances of convergent evolution, where unrelated protein domains acquired similar recognition motifs that were particularly eective and were therefore
retained by selection. Another interesting example appeared in a recent issue of Science,
where it was shown that phylogenetically unrelated microbial hydrogenases possess similar
features in their active site[110].
An additional layer of complexity is given by the fact that many proteins whose function has already been determined are actually endowed with more than one function: this
phenomenon has been called moonlighting and is likely to play an important role in central
processes like catalysis, transcription and gene expression[27, 67]. The concept of moonlighting opens up new spaces to the application of algorithms for the prediction of function
from structure, since only in cases where the dierent functions are clearly encoded in
detectable motifs can multiple sequence alignment be an adequate tool. Computational
methods that are specifically tailored to address the problem of binding site identification
and characterization are therefore much needed, and have the potential to go beyond the
traditional description of global sequence and structural similarities.
Since the function of a protein is to some extent embedded in its interactions with other
molecules, elucidating these interactions is a key step to address the problem of functional
2
Figure 1.1: Domains involved in methylated lysine recognition: chromodomain (a 1KNA), PHD
finger (b 2FSA), tudor domain (c 2IG0), ankyrin repeats (d 3B95), mbt domain (e - 2RHI). The
bottom panel shows a close-up view of the aromatic residues surrounding the methylated lysine in
cage-like fashion. Despite being structurally unrelated, the domains shown here possess a binding
site with an aromatic cage responsible for the methylated lysine recognition
annotation. The concept of guilt by association exploits this idea and has been extensively employed to infer functional information about poorly characterized proteins[62].
In some instances, even a suciently complete description of the molecular function does
not encompass the dierent physiological roles played by a protein. An interesting example
is given by the Glycogen Synthase Kinase 3 (GSK3). In muscle cells, GSK3 is inhibited
by insulin activity, with the eect of promoting the conversion of glucose into glycogen and
therefore reducing blood glucose levels. From a therapeutic point of view GSK3 would
therefore seem a potentially attractive target[57]. On the other hand, GSK3 is also a part
of the Wnt signaling pathway, and a reduction in GSK3 is implicated in certain types of
cancers, while in other types (e.g. mixed lineage leukemia) GSK3 would seem to play an
oncogenic role [120]. Therefore, only a selective inhibition of GSK3 in muscle cells could
be a viable therapeutical option for diabetes. Furthermore, a study by Noble et al.[94]
showed how the inhibition of GSK3 leads to reduced production of the neurofibrillary
tangles implicated in Alzheimers disease.
The example provided by GSK3 illustrates the importance of knowing all the binding
partners of a protein, since even a detailed chemical knowledge of only one of the pathways
where GSK3 is involved does not fully account for the pleiotropic eects of the protein.
The interactions a protein can participate in can be roughly divided in three broad classes,
depending on the chemical nature of the protein partner:
Protein - Protein interactions
Protein - Nucleic Acid (DNA or RNA) interactions
Protein - Small Molecule interactions (where the term small molecule is meant to
represent the broad class of non polymeric molecules)
Protein - protein interactions can involve large contact interfaces (where shape complementarity usually plays an important role) or be mediated by peptide recognition modules[69, 2].
This work will concentrate on protein - small molecules and protein peptides interactions,
since protein - protein interactions mediated by large contact interfaces and protein - nucleic
acids interactions represent a dierent challenge and require a dierent set of computational
approaches.
A preliminary but crucial step to exploit the functional information embedded in the network of protein interactions is binding site identification, defined as the identification of the
residues that make up the region where binding occurs.
An overview of the currently available methods to carry out binding site identification is
presented below.
1.2
1.2.1
Binding Site Identification

Sequence-based approaches
One of the simplest yet eective ideas behind sequence-based approaches to identify functionally important residues is to exploit the evolutionary information contained in Multiple
Sequence Alignments (MSAs) of homologous sequences and extract a subset of residues
that show a high degree of conservation in the MSA. The assumption behind this idea is
that the evolutionary pressure acting on functionally important residues will reduce their
variability in a protein family. Dierent conservation measures have been employed, the
majority of them being cast in the information theoretic framework. Some of these measures take into account the background distribution of the aminoacids, and they have been
shown to perform better than those who do not in a study that systematically compared
several conservation measures on three dierent large datasets[22]. More specifically, the
Jensen-Shannon entropy has been shown to outperform other information theoretic measures and will be discussed in greater details in Chapter 3.
An alternative approach that takes advantage of phylogenetic analysis is the evolutionary
trace method developed by Lichtarge[85]. The idea behind the method is to consider the
degree of conservation of residues positions in a protein family in phylogenetically distinct
groups. The assumption is that functionally important residues may be conserved in a
subgroup but can vary across dierent subgroups, since these subgroups may have evolved
to perform slightly dierent functions.
Another approach that takes advantage of phylogenetic information is Rate4Site[88]. The
method relies on estimates of site-specific mutation rates by using a Bayesian approach, that
(by including prior information into the model) is less sensitive to the number of sequences
in the alignment than other conservation-based methods. On the other hand, a clear disadvantage of Rate4Site compared to simple information theoretic measures of conservation
is the speed of execution, which is substantially lower[22].
Despite their usefulness to infer functionally important residues, all the sequence-based
methods suer from the fundamental limitation of not being able to discriminate between
residues that are conserved as part of a binding site from residues that are crucial to protein
stability or regulation. In other words, while binding residues are usually conserved across
a protein family, conservation alone is not always a suciently specific criterion to identify
a binding site, since residues can be conserved for reasons other than binding. To overcome these limitations other approaches have been devised that explicitly take structural
information into account.
Figure 1.2: Geometric identification of binding sites. A) Human alpha-Phosphomannomutase in
complex with D-mannose 1-phosphate (PDB code: 2fue). The ligand binds in a deep crevice that
is correctly identified as the largest pocket by LIGASITEcsc . B) Mannose 6-phosphate receptor in
complex with mannose 6-phosphate (PDB code: 1sz0). The binding site is a shallow pocket and in
this case is not among the top three sites predicted by LIGASITEcsc . The black arrows indicate the
location of the ligand and the blue spheres show the pockets detected by LIGASITEcsc .
1.2.2
1.2.2.1
Structure-based approaches
Geometric approaches
Most of the geometric approaches to identify binding sites in protein structures rely on
the assumption that a binding site is usually a geometrically well defined cleft or a pocket.
For example, in a study of 67 protein structures Laskowski determined that the largest cleft
corresponded to a binding site in over 83% of the cases[80].
One of the earliest approaches employed by cleft detection algorithms is the protein-solventprotein concept (e.g. the POCKET[84] and LIGSITE algorithms[55]). The main idea
consists of embedding the protein in a 3D lattice and assigning the grid points to either the
protein (if within a predefined distance from an atom center) or the solvent. Subsequently,
the x, y, z axes are scanned and the pockets are defined as the regions in space that contain
points assigned to the solvent category and that are surrounded by protein points. An
improved version of LIGSITE (LIGSITEcs ) program implemented Connollys surfaces[26]
to replace the protein-solvent-protein with surface-solvent-surface events, and a further version of LIGSITE (LIGSITEcsc ) incorporated a conservation measure to rerank the putative
pockets[64].
Another well established algorithm for pocket detection was developed by Laskowski and
implemented in the SURFNET program[79]. The gist of the idea is to place spheres between all pairs of atoms in such a way that no two atoms are contained inside the spheres,
which are then retained only if their radius is between 1 and 4
A, and finally clustered. The
clustered spheres with the largest volume define the putative pocket. Another approach to
detect invaginations on protein structures was proposed by Mezei[89], exploiting the concept of circular variance.
A systematic comparison between all the geometric methods briefly outlined above was
carried out by Huang and colleagues on a dataset containing 210 bound proteins plus 48
proteins for which an unbound form was available[64]. The performance of the methods
ranged from 80 to 87% for the bound dataset and from 71 to 77% for the unbound cases.
Recently, Huang and colleagues combined several geometric approaches with an energy
based approach (see next section) into a metaserver named MetaPocket[63], yielding an
improvement over each of the individual methods used in isolation.
Despite their usefulness for binding site identification, one of the major shortcomings of all
the geometric approaches is represented by the fact that not all binding sites are deep pockets. An example is provided in Figure 1.2, where two types of binding sites are represented.
The first one is a typical deep cleft that can be easily identified by LIGASITEcsc , while the
second is a shallow pocket that does not rank among the top three sites identified by the
program.
1.2.2.2
Energy-based approaches
Energy-based approaches to binding site identification work on the assumption that a

protein binding site is characterized by energetic properties that stand out from the rest of
the protein surface and can be reliably identified.
One of the earliest attempts to characterize binding sites using energetic rather than geometric properties is the GRID program[49], that computes a semi-empirical interaction energy
between the protein and a set of chemical probes parameterized to mimic atom types and
chemical fragments of pharmaceutical and biological interest. The GRID program is not
meant to be used as a binding site identification tool per se , but the interaction energy
7
maps (also known as Molecular Interaction Fields) that are produced by the program can
be used for that purpose, with appropriate manipulations. As an example, Q-SiteFinder[81]
uses the GRID forcefield to compute an interaction energy map between the protein and
a methyl (-CH3 ) probe and carries out cluster analysis to identify the regions that have
the highest total interaction energy. These regions usually correspond to binding sites for
drug-like molecules [81]. More recently, Morita et al.[91] improved the performance of this
approach by using the AMBER force field [28] for the interaction energy calculations and
a more sophisticated two-steps algorithm for clustering.
An alternative approach to carry out binding site identification on protein structures builds
on the experimental technique introduced by Mattos and Ringe called Multiple Solvent
Crystal Structures (MSCS)[87]. The idea behind MSCS is to repeatedly soak the protein
with dierent organic solvents and identify the regions involved in binding to these solvents
by X-ray crystallography. Vajda and Guarnieri[116] have proposed an equivalent of this
procedure, where the solvent mapping is carried out computationally and a consensus site
where dierent solvents bind favorably is identified as the putative binding site.
1.2.3
Approaches that take into account the protein dynamics
All the energy-based methods discussed in the previous section treat the protein as a
rigid body. While this approximation may yield reasonable results in a variety of situations,
there are examples where it can be shown that the binding sites are not evident in the unbound form but show up transiently and are locked in by the ligand[38]. An illustrative
example is provided in figure 1.3. Clearly, the identification of these transient pockets can
be exploited to design inhibitors in situations where the unbound conformation would suggest a low degree of druggability.
Conventional Molecular Dynamics (MD) simulations can be used to generate an ensemble
of conformations that are subsequently analyzed with binding site identification tools. A
compelling example can be found in the work of Schames et al.[107], that showed the formation of a trench near the active site of the HIV-1 Integrase during a 2ns MD simulation.
Compounds that exploited both the active site and the trench were shown to have better
docking energies.
8
Figure 1.3: An example of conformational change occurring upon binding. A D-allose binding
protein is represented in the bound (green) and unbound (red) conformation, with the ligand in
licorice representation. The two domains of the protein move closer to each other and form a cleft
that accommodates the sugar
Another study by Frembgen-Kesner et al. [43] identified a transiently forming binding site
on a p38 MAP kinase using MD simulations. The method adopted in this study involved
docking a known inhibitor to 5000 dierent snapshots generated during an MD simulation.
The results indicated that a large movement of a side-chain in the unbound conformation
allowed for the formation of a new binding site exploited by the inhibitor and adjacent to
the kinase ATP site, similar to what was seen in crystal structures of the complex.
The disadvantages of the methods involving MD simulations are associated to the high computational cost and to the potential inability of average-length MD simulations to capture
large conformational changes occurring upon binding. Other approaches that incorporate
alternative methods for generating ensembles of conformations have therefore been devised.
Gonzalez-Ruiz and Gohlke[48], for examples, employed the FRODA method [123] to explore
the conformational space of the interleukin-2 receptor and were able to correctly identify the
bound conformation starting with an unbound form of the receptor. The main idea behind
the FRODA approach is to produce an ensemble of conformations that are not dependent
on time as in the MD framework but rather on the distance from a reference conformer.
Therefore the trajectory does not reflect time but a geometrical pathway. A rigidity analysis
9
step identifies the parts of the protein that are treated as rigid bodies during the geometric
simulation.
Other methods have coupled Normal Mode Analysis (NMA)[17] performed on MD snapshots or Elastic Network Models[5] with binding site identification or docking[24, 104].
The advantage that Normal Mode Analysis oers is the ability to capture large-amplitude
conformational changes. Furthermore, the possibility of performing Normal Mode Analysis
using the Elastic Network Model enables a substantial gain in speed compared to MD based
analyses.
It is noteworthy to point out that the methods outlined above require at least an approximate knowledge of the location of the binding site. The goal is to refine this knowledge
by identifying a conformation that is closer to the bound form or to provide a mechanistic
explanation to fill the gap between unbound and bound conformations. The problem of
reliably identifying the binding site a priori in the presence of large conformational changes
without any further knowledge is clearly much more challenging.
1.3
Binding Site Characterization and Comparison
As previously discussed binding site identification represents an important step towards

functional annotation by providing knowledge of the residues that are involved in binding.
The majority of approaches to infer functional information from the analysis of the binding
site rely on comparisons with known annotated examples. In other words, the process of
functional annotation usually consists of transferring the information gained on some well
studied proteins to the protein of interest by virtue of their binding sites similarity. This
approach relaxes the requirements for homology or structural similarity and underscores
the role that the binding site plays in the protein functional roles.
In a way that parallels what has been done in the field of binding site identification, both
geometric and energy-based approaches have been developed to compare binding sites.
10
1.3.1
Approaches for comparing geometric features
A graph theoretic method for identifying 3D patterns of amino acid side chains was
proposed by Artymiuk et al.[4]. By treating the side chains as nodes of a graph (using a
pseudo-atoms representation) and the distances between them as the edges it is possible
to search for a given pattern by resorting to well established subgraph isomorphism algorithms. A proof of principle was provided by screening a set of proteins for the Ser-His-Asp
catalytic triad[4]. More recently a similar approach was taken by Zhang et al.[132] to build
a network of binding sites similarities.
Another established approach to compare specific arrangements of residues is the TESS
algorithm[118], that uses a 3D template acquired by mining the primary literature and
containing all the atoms that are essential for an enzyme to perform its function; then,
given a query structure the algorithm looks for a match between the query and the 3D
template using the geometric hashing formalism, originally developed for object recognition
problems in computer vision[126]. Using a 3D template that contained information for the
serine protease active site (again with the well known Ser-His-Asp catalytic triad), the TESS
algorithm was able to detect the active site of all the serine proteases, acetylcholinesterase
and haloalkane dehalogenase[118].
The necessity to provide a template with a well-defined structural arrangement of residues
limits, in a sense, the applicability of the comparative approaches described above to enzymes or other molecules with a very conserved active site. Proteins whose function is to
bind other proteins or ligands (especially in the case of low anity binding) are less suitable
to the generation of a well-defined template, since they will generally lack a highly conserved
geometrical arrangement of residues in the binding region.
1.3.2
Approaches for comparing structurally derived properties
Visual inspection of Molecular Interaction Fields (MIFs) provides useful information

about regions characterized by favorable interaction energies with specific chemical probes,
but it is feasible only when analyzing a few targets. In other cases, a quantitative measure
of similarity between targets with respect to specific probes becomes necessary to automate
11
the comparison. In addition, a quantitative similarity index is a valuable tool to cluster the
target structures and organize them according to interaction energies patterns. One can
apply the same considerations to the electrostatic potential, that is computed at discrete
points in space as the MIFs and can be analyzed using similar or identical indexes.
Dierent similarity indexes first developed for comparing quantum mechanically computed
electron densities of small molecules have been adapted to calculate the similarity of electrostatic potential and MIFs[20]. Among the available indexes, the Hodgkin index is one of
the most popular and has been extensively used for protein structural comparisons based on
electrostatic potential and MIFs. The Hodgkin index is defined by the following equation:
SI =
2(p1 p2)
p1 p1 + p2 p2
(1.1)
where p1 and p2 represent the vectors containing the potential energy values for map
1 and 2 respectively and the symbol indicates the dot product.
The Hodgkin index is used by the PIPSA program[13], a software to analyze the pairwise
similarity of 3D interaction property fields of proteins. The proteins to be analyzed must be
superposed before computing the MIFs or the electrostatic potential; after the fields have
been calculated, PIPSA gives the option to select a region around the molecules (called
skin in the program) that contains points within a certain distance from the protein surface. Subsequently the program uses the Hodgkin index to calculate a similarity matrix and
clusters the structures accordingly. This approach has been successfully applied to dierent
types of proteins including PH domains[13] and WW domains[108], cupredoxins[34] and
E2 ubiquitin conjugating enzymes[124]. In all cases the clustering based on the similarity
analysis yielded classifications comparable to what could be achieved using functional information like known peptide binding specificities and catalytic mechanisms.
The Hodgkin Index and other similarity indexes usefully summarize global energy-based relationships between structures, but they cannot provide detailed information about regions
potentially important for binding anity or selectivity. Another problem, specific to MIFs,
is how to integrate and analyze the information obtained by using many dierent chemical
probes. To this purpose multivariate analysis techniques and, in particular, Principal Com-
12
ponent Analysis (PCA) have been applied to reduce the dimensionality of the problem and
identify regions characterized by highly selective interactions[99]. More recently, a number of publications have demonstrated the usefulness of Consensus Principal Component
Analysis (CPCA) applied to multi-probes MIFs[72]. CPCA allows one to discriminate between regions that are important for binding selectivity with respect to a particular probe
and regions that do not contribute to the protein binding sites selectivity. In addition,
many structures can be used to represent a particular protein (multiple NMR conformations or Molecular Dynamics snapshots, for instance), thus implicitly including dynamical
information into the analysis. More sophisticated statistical techniques such as Independent
Component Analysis (ICA)[25] could in principle be adopted to go beyond the requirements
needed by PCA to be eective (linear correlation and normal distributions of the variables).
The similarity indexes and multivariate techniques described above have shown their
potential as quantitative tools to compare protein structures and characterize chemically
important regions. Despite their usefulness, they have a major requirement that can limit
their applicability, namely they need a superposition of the binding sites. For proteins that
show very limited structural variability in their binding site this may not be a major issue,
but it becomes an obstacle when proteins known to bind the same ligand do not present
obvious ways to superpose their binding sites. Unfortunately, these are also the cases that
would benefit the most from these comparative approaches.
In addition, both similarity indexes comparisons and multivariate analysis rely on the assumption that the maximum correspondence between the grid points of dierent proteins
has been established; in other words, they require the best possible superposition of energy
values. This goal is only implicitly pursued when we perform a conventional minimization
of the RMSD of equivalent atoms. Using only atoms to maximize the similarity of energy
values could easily bias the results, since a group of atoms playing only a marginal role in
terms of contribution to the interaction energy with a particular probe or to the electrostatic
potential could heavily influence the final superposition of the proteins and, as a result, the
outcome of the calculations.
Barbany[9] et al. presented an approach to optimize a MIFs-based similarity index (named
13
MIPSIM index) defined by the following equation:

2)
exp(arij
N B N B
2
2
j=1 XiA XjA exp(arij )
i=1
j=1 XiB XjB exp(arij )
(1.2)
M IP SIMAB =
NA
NA
i=1
N A N B
i=1
j=1 XiA XjB
by seeking the locally optimal superposition of the structures that returns the maximal
similarity. The method was originally proposed to superpose ligands, but it can be adapted
to protein binding sites. The major limitations of the approach are the computational demands posed by the optimization step and the fact that only a locally optimal solution will
be determined.
The problem of the optimal superposition of binding sites has been circumvented in several
approaches that combine geometrical features and structurally derived properties (e.g. the
electrostatic potential) in a translationally and rotationally invariant framework. For example, Kinoshita and Nakamura [74] built a database of functional sites described by the
electrostatic potential computed at several points on the Connolly surface of the binding
site, and then implemented a graph theoretic approach to query the database.
More recently, Sael et al.[106] developed another method to rapidly compare physicochemical properties such as the electrostatic potential and a hydrophobicity index mapped onto
the surface of proteins. The method takes advantage of 3D Zernike descriptors[95], which
consist of a series expansion of a 3D function, thereby allowing for a translational and rotational invariant comparison of the so obtained coecients.
Another translationally and rotationally invariant approach for comparing structurallyderived properties has been introduced by Das et al.[32]. The approach relies on the concept of shape distributions[98] (originally introduced for object recognition purposes), that
measure the probability that a given property will be at a specified distance from another
on the surface, thereby incorporating shape and property distributions in one measure.
The method was benchmarked on the PDBBind database[119] and the results indicated its
capability to classify binding sites in functionally meaningful groups (defined by the EC
numbers of the enzymes).
One of the main limitation of these approaches lies in their inability to compare binding
14
sites of dierent sizes or highly dierent shapes. Also, the electrostatic potential may not
be the best structurally-derived property to use for all binding sites (see Chapter 3).
1.4
Available Softwares for Binding Site Identification and

Characterization
The calculations involved in binding site identification and characterization require specialized software, irrespective of whether the geometrical or energy-based approaches are
used. The first major requirement for a scientific software is clearly represented by its
availability, but another essential feature required for large scale analyses is the possibility of running the programs locally as opposed to using a web-based interface. As Table
1.1 shows, most of the methods to carry out geometrical or energy-based identification
and characterization of protein binding sites are either provided as webservers or require
a commercial license. More importantly, no currently available tool provides a combined
framework in which one can perform binding site identification and characterization using
an energy-based approach.
For these reasons I set out to implement a comprehensive package to address these needs.
An introduction to the methods is provided below and in Appendix A.
Table 1.1: Available softwares for binding site identification and characterization
Name
FTMAP
CMIP
GRID
LIGSITE
PocketFinder
PocketPicker
Q-SiteFinder
1.5
Purpose
fragment-based identification of hot spots
energy-based characterization of binding sites
energy-based characterization of binding sites
binding site identification (geometrical)
binding site identification (energy-based)
Availability
webserver only
currently unavailable
commercial license
webserver and standalone
webserver only
webserver and standalone
webserver only
Reference
Brenke et al.[15]
Gelpi et al.[44]
Goodford et al.[49]
Huang et al.[64]
Hendlich et al.[55]
Weisel et al.[121]
Laurie et al.[81]
EASYMIFs and SITEHOUND
EASYMIFs and SITEHOUND[46] are two software tools that in combination enable the
identification of binding sites in protein structures using an energy-based approach. EASYMIFs
is a simple Molecular Interaction Fields (MIFs) calculator; and SITEHOUND is a post pro15
cessing tool for MIFs that identifies interaction energy clusters corresponding to putative
binding sites. While these tools are conveniently used in combination, they can also be
used separately. EASYMIFs can be used to calculate MIFs for binding site characterization,
Quantitative Structure-Activity Relationship (QSAR) studies, selectivity analysis of protein
families, pharmacophoric search, and other applications that require MIFs [29]. SITEHOUND
can be used to process the ouput from other MIF or Anity Map calculation programs, in
addition to EASYMIFs, such as GRID [49] and the Autogrid tool of the AutoDock software
package [92].
Figure 1.4: Identification of ligand-binding sites using EASYMIFs and SITEHOUND. (A) A protein
structure is used as input and the program EASYMIFs computes the potential interaction energy of a molecular probe with the protein on each point on an orthogonal grid called a Molecular Interaction Field (MIF).
(B) The program SITEHOUND processes the MIF by first removing all points that have unfavorable interaction energy, (C) subsequently the remaining points are grouped using a hierarchical clustering algorithm,
and the resulting clusters are ranked by their Total Interaction Energy (the sum of the interaction energy
of all points in one cluster). (D) Known binding sites are usually found among the top three clusters.
1.5.1
EASYMIFs
Molecular Interaction Fields (MIFs) describe the spatial variation of the interaction
energy between a target molecule and a specific probe, that usually represents a chemical
group. Although the interaction energy field is, by definition, a continuous quantity, for
computational convenience it is usually discretized on a three-dimensional orthogonal grid
that surrounds the molecule of interest. The output of a MIF calculation is therefore
represented by an energy map that provides information about the potential energy between
the probe and the molecule under analysis. EASYMIFs aims to provide a simple and rapid
way to characterize a protein structure from a chemical standpoint at the global or local
16
level (e.g. around an active site), returning maps that can be loaded in a Molecular Graphics
Software such as PyMol[35], VMD[66] or Chimera[101].
1.5.1.1
Calculation of MIFs in EASYMIFs
EASYMIFs computes the potential energy between a chemical probe (represented by a

particular atom type) and the protein on a regularly spaced grid, using the following equation:
Vi =
(VLJ (rij ) + VE (rij ))
(1.3)
where the potential energy calculated for a probe at a point i in the grid is equal to
the sum of a Lennard-Jones and an electrostatics term over all the atoms of the protein.
rij represents the distance between the probe at point i in the grid and an atom j of the
protein. The Lennard-Jones and the electrostatics term are expressed by the following two
equations:
(12)
VLJ (rij ) =
VE (rij ) =
Cij
12
rij
(6)
Cij
(1.4)
6
rij
qi qj
1
40 (rij )rij
(1.5)
The C (12) and C (6) parameters in the Lennard-Jones term depend on the chosen probe
and the particular atom type and are taken from a matrix of LJ-parameters distributed
with the GROMACS package[117]. The dielectric constant
1
40
has been set to 138.935485.
The distance-dependent dielectric sigmoidal function has been taken from Solmajer and
Mehler[111] and has the following form:
(rij ) = A +
B
1 + eBrij
(1.6)
where A = 6.02944; B = e0A; e0 = 78.4; = 0.018733345; k = 213.5782. When the

distance between the probe and an atom becomes less than 1.32
A, a dielectric constant of
17
8 is used. The parameters reported above for the distance-dependent dielectric have been
taken from Cui et al.[30]
1.5.1.2
Visualizing the results
EASYMIFs produces Interaction Energy Maps in the dx format, that can be conveniently visualized in PyMOL, Chimera, VMD and other molecular graphics packages. The dx
file is usually displayed as a contour plot, showing regions of space where the energy value
is within a specified range. For large scale analyses that involve the generation of many
thousands maps it is also possible to use compressed maps that achieve a compression rate
usually greater than 4, thereby making optimal use of the disc space. The compression algorithm incorporated into EASYMIFs is the Lempel-Ziv-Welch (LZW) algorithm[122], with
an O(1) dictionary search step that adds almost no overhead to the calculations.
Figure 1.5 shows an example of MIF calculations. EASYMIFs has been used to calculate
an interaction energy map between the protein (in the binding site region) and an hydroxyl probe, shown in gold in the Figure. The box around the binding site illustrates the
boundaries of the box used in the calculations.
1.5.2
SITEHOUND
The purpose of SITEHOUND is to manipulate the output of the EASYMIFs program (and
other programs such as Autogrid [92] and GRID [49]) in order to predict regions on protein
structures that are likely to be involved in binding to small molecules or peptides. The
approach is based on the Q-SiteFinder algorithm [81], but contains more options and
improvements. The main dierences lie in the use of multiple probes for the detection of
dierent types of binding sites (see Chapter 3); alternative clustering algorithms, which
improve results for ligands of dierent shapes (see Appendix A) and the fact that SITEHOUND can be run independently of a web interface.
The program first filters o all the grid points that have an energy value above a userspecified threshold (a negative value) and clusters them according to spatial proximity using single or average linkage agglomerative clustering (see Appendix A). Subsequently,
the Total Interaction Energy (TIE) of each cluster is computed and this value is used
18
Figure 1.5: An example of interaction energy calculations on a protein The protein shown here
is a D-allose binding protein (PDB code 1rpj). The box delimitates the area of the protein where the
calculations have been carried out. The golden points indicate areas of favorable interaction energy with
an hydroxyl probe (energy threshold set to -28 KJ/mol). The ligand is overlaid for comparison, but was
removed before computing the interaction energy map.
to rank the clusters, from the most negative to the least negative. The last step involves
printing the results on text files and in the PDB and DX formats, that allow for graphical
display of the results on the protein using standard molecular visualization tools (such as
Chimera[101], PyMol[35] or VMD[66]).
Figure 1.6 illustrates an example of binding site identification carried out with two dierent probes (methyl and phosphate oxygen) on the same protein. The combination of the
two probes yields a more comprehensive picture of the large binding site and the correct
identification of the adenosine and phosphate binding regions.
1.5.2.1
The SITEHOUND-web Server
A streamlined web-based interface to carry out binding site identification using SITEHOUND
has been made available at http://sitehound.sanchezlab.org[59]. The interface (Figure 1.7)
can be used to upload a PDB structure, automatically perform the binding site identification
and visualize the results of the calculations on a ribbon representation of the protein. The
residues potentially involved in binding are also reported on a per-cluster basis, together
19
Figure 1.6: Characterization of the yeast adenylate kinase binding site using EASYMIFs and
SITEHOUND. (A) Ribbon diagram of the yeast adenylate kinase structure showing the top ranking clusters
as solid surfaces: phosphate probe cluster (red) and carbon probe clusters (green). (B) SITEHOUND clusters
superposed on the structure of the Ap5A (bis(adenosine)-5-pentaphosphate) inhibitor of adenylate kinase [1].
The phosphate probe correctly identifies the pathway of phosphoryl transfer, and the carbon probe correctly
identifies the adenosine binding regions. The Figure was prepared using the 1aky CMET cluster.pdb and
1aky OP cluster.pdb files from the example, and the PyMOL[35] molecular graphics program.
with a summary of the main features of the clusters. From the results page (Figure 1.7)
the user can also download all the files that are produced by SITEHOUND. Furthermore, it
is possible to download the .map file produced by EASYMIFs or Autogrid, which can be
used by SITEHOUND to carry out the binding identification with combinations of parameters
dierent from the default parameters used by the web server. SITEHOUND-web only allows
for the processing of relatively small systems with default parameters.
1.6
Conclusions
The molecular function of a protein is largely determined by interactions with other

molecules at binding sites on its surface. Hence, identification of the location and characteristics of ligand-binding sites can contribute to functional annotation of a protein; it can
guide experiments, and be useful in predicting or verifying interactions. The identification
of ligand-binding sites can also be an important part of the drug discovery process.
Several methods have been developed for the identification of binding sites from protein
structures and sequences, and the main ideas underlying some of the most widely used
approaches have been discussed above. Sequence-based methods have the advantage of
20
Figure 1.7: SITEHOUND-web results page. The results of a carbon probe calculation with the average
linkage clustering are shown for for adenylate kinase (PDB code 1aky). The Web Interface is available at
http://sitehound.sanchezlab.org
being applicable to proteins of unknown structure, by relying on the evolutionary conservation of residues. However, they are also limited by the fact that not all binding sites are
conserved, and not all conserved residues correspond to binding sites. Structure-based approaches can overcome these limitations and complement sequence-based methods. Among
the structure-based approaches, the energy-based methods directly describe the molecular
interaction properties of the protein surface and can in principle also distinguish binding
sites with distinct ligand preferences (e.g. hydrophobic versus polar) if dierent chemical
probes are used for the molecular interaction calculation.
The limited availability of softwares to compute interaction energy maps and the absence
of methods that incorporate multiple probes into the binding site identification framework
provided the motivation for implementing the EASYMIFs and SITEHOUND programs. The
remainder of this work will describe their application in the context of protein-ligand dock-
21
ing, binding site identification tailored to specific classes of ligands and the identification of
bound forms in cases where large conformational changes take place.
22
Chapter 2
Focused Docking
2.1
2.1.1
Introduction
Reverse Virtual Screening
As mentioned in Chapter 1, an important step towards (possibly automatic) protein

functional annotation consists in elucidating all the possible binding partners of a protein.
Investigating directly protein binding sites has also valuable applications in the field of drug
discovery, where several approaches have been implemented (both experimental and computational) for predicting all the possible biological targets of a drug[68]. The computational
approaches meant to accomplish this task usually go under the name of reverse virtual
screening.
At least four dierent situations can be envisaged where knowing all the targets of a drug
can be therapeutically exploited (figure 2.1). The first one involves the problem of so-called
orphan drugs, i.e. pharmaceuticals that are marketed and proven to be eective despite our
lack of knowledge about their molecular target(s). Related to this is the issue of adverse
drug reactions due to o-target eects, where the drug binds to (unwanted) secondary targets giving rise to an adverse reaction. Another scenario involves the repurposing of existing
drugs for new uses. In this case the o-target eects of the drug are exploited to treat other
medical conditions where the additional targets of the drug play an important role. The
advantage of repurposing the drug is mainly due to safety and pharmacoeconomics issues.
23
Figure 2.1: The panels represent dierent instances where the knowledge of all the targets mod-
ulated by a drug can be beneficial. A: adverse drug reactions due to o-target eects; B: orphan
drugs (drugs whose mechanism of action was previously unknown); C: drug repurposing; D: drug
interactions (for simplicity the drugs are shown to interact with the same target, but more indirect
interactions are also possible, for example at the pathway level). The dashed arrow indicates an
interaction that was previously unknown.
Finally, the problem of drug-drug interactions (a common cause of iatrogenic injuries and
therapeutic failures) can also be framed in the context of shared similarities between drug
targets. Significant eort has been spent into experimental and computational approaches
to detect potentially harmful drug-drug interactions that arise from shared metabolizing
enzymes. In this case the interaction is due to the fact that both drugs compete for the
same metabolic pathway responsible for the degradation and subsequent excretion of the
drug, leading to increases in blood concentration of one or both drugs. A more challenging
mechanism to predict can be found at the pharmacodynamic level, where the interaction
between drugs occur at the level of their targets, with synergistic or antagonistic eects.
The problem is compounded by the fact that the interaction between drugs is not necessarily due to modulation of the same target, but can also occur at the pathway level.
24
Determining all the biologically relevant protein partners of a small molecule clearly represents a formidable task. Computational filtering tools capable of narrowing down the
experimental validation to a handful of proteins are therefore necessary. One of the most
ecient approaches from a computational standpoint is to look for small molecules that
are very similar to the one that is under investigation and assume that they will also share
the same set of targets. A review of some of the techniques that are available can be found
in Sheridan and Kearsley[109]. Even though these approaches are computationally very
eective, they are limited in the sense that they rely on already discovered interactions of
cognate small molecules. If the small molecule of interest is not significantly similar to any
other small molecule for which there is enough information, nothing useful can be inferred.
More general approaches involve docking the small molecule into a large set of proteins and
using the docking energy as a ranking criterion. The main idea of screening a large set of
protein structures against a particular small molecule of interest has been described in Paul
et al.[100], where they carried out docking experiments by screening proteins whose binding
site was already known. Although this approach enables faster reverse virtual screening, it
limits the universe of candidate targets to those proteins that have clearly identified binding
sites and only to those sites within the protein. Ideally, a reverse virtual screening approach
would require only the knowledge of the three-dimensional structure of the candidate target
proteins and would allow for the discovery of unexpected interactions that may occur at
previously unidentified binding sites. This idea will be described in this chapter by showing
the results of combining a binding site prediction step with a subsequent docking experiment.
A brief introduction to the docking problem is necessary to understand the challenges involved in its applications in the context of reverse virtual screening.
2.1.2
The Protein-Ligand Docking Problem
The problem of protein-ligand docking frames the two interrelated questions of the free
energy of binding and the optimal orientation and conformation of the ligand (and possibly
of the protein as well) as a global optimization problem. In other words, given a protein
structure and a ligand the docking algorithm will seek the orientation and conformation
25
of the ligand that yields a global minimum on the free energy landscape. In practical
applications, the free energy is usually approximated by the docking energy, which has
been tuned to empirically reproduce experimentally derived free energy values for a given
set of known complexes.
The majority of docking algorithms consist of two components:
1. A scoring function that returns the docking energy for a given orientation and conformation of the ligand with respect to the protein
2. A search strategy that seeks to find the global minimum in the docking energy landscape
Here the focus will be on the AutoDock[92] package, one of the most widely cited applications for molecular docking (as reported at http://autodock.scripps.edu)
2.1.2.1
The scoring function component
AutoDock estimates the free energy of binding between a protein (P) and a ligand (L)
using pairwise terms and an implicit solvent model. More formally:
LL
LL
P P
P P
P L
P L
G = (Vbound
Vunbound
) + (Vbound
Vunbound
) + (Vbound
Vunbound
+ Sconf )
(2.1)
In other words, the intramolecular energetics of the transition from the unbound state to
the bound one are evaluated separately for each of the molecules, and finally the intermolecular energetics of the protein and the ligand in the complex are computed. The second term
in the equation above is clearly 0 if the protein is kept fixed during docking. The entropic
loss that occurs upon binding (last term in the equation) is directly proportional to the
number of rotatable bonds in the ligand.
The pairwise atomic terms in AutoDock are described by the following equation:
V = Wvdw
+Welec
Aij
Bij
Cij
Dij
( 12 6 ) + Whbound
E()( 12 10 )+
r
r
r
rij
ij
ij
ij
i,j
i,j
i,j
qi qj
2
+ Wsol
(Si Vj + Sj Vi ) exp(rij
/2 2 )
(rij )rij
i,j
26
(2.2)
The first term is a 6/12 potential for dispersion/repulsion interactions, while the second
term explicitly takes into account the H-bond term, with a dependence on the angle expressed as a deviation from the ideal bonding geometry. The electrostatic term is expressed
by a Coulomb potential with a distance dependent dielectric. The desolvation potential is
based on the volume (V) of the atoms that surround a given atom weighted by a solvation
parameter (S) and (exponentially) by the distance.
All the coecients in AutoDock4 were derived by calibration on set of 188 protein-ligand
complexes whose binding energies were experimentally determined[65] .
2.1.2.2
The search component
The expressions described in the previous section define the energy landscape that has
to be sampled in order to determine the global optimum. Since an analytical solution to
this problem does not exist, AutoDock resorts to global search heuristics to find a reasonable approximation. In particular, AutoDock employs a modified version of Genetic
Algorithms[90] .
Genetic Algorithms are a class of evolutionary strategies widely employed in global optimization problems. They apply a darwinian process of selection of the fittest to a population
of individuals that represent a potential solution to the problem under analysis. In this
specific context, each individual bears a genotype that fully describes the orientation and
conformation of the ligand, and the fitness is simply represented by the resulting docking
energy.
A simplified description of the optimization algorithm implemented in AutoDock is given
below:
1. randomly initialize the population
2. evaluate the fitness of each individual
3. select a fraction of the most fit individuals for reproduction
4. apply crossover and mutation operators to the individuals
5. carry out local optimization
27
6. evaluate the fitness of the new individuals

7. discard a fraction of the least fit individuals
8. repeat steps 3 though 7 until the maximum number of energy evaluations has been
reached
The mutation operator simply applies a random change to the genotype of an individual
according to a predefined probability distribution. In AutoDock this amounts to a random
change in the values of one of the degrees of freedom of the ligand. The crossover operator
mimics the exchange of genetic material that occurs during meiosis, and it consists in
assembling parts of the parents genotypes into a new combination.
The most notable dierence from the common implementations of Genetic Algorithms is
to be found in the local optimization step, which brings the solutions represented by the
individual genotypes to their local minima, by means of a local optimization search carried
out in the coordinate space of the ligands. The optimized solution is then coded back
in the genotype of the individuals, and this explains the name of Lamarckian Genetic
Algorithms given to the hybrid version implemented in AutoDock.
It is important to mention that the maximum number of energy evaluations is the primary
factor that controls when the algorithm will stop its search. A value set too low will prevent
the algorithm from thoroughly sampling the free energy space and will usually result in poor
solutions. On the other hand, a balance has to be struck in terms of accuracy and eciency
of the search, since in many applications docking is applied to thousands of molecules and
speed is a crucial factor.
Another factor that can potentially influence the aforementioned balance between speed and
quality of the results is the size of the space allowed for docking. In virtual screening, where
the binding site of the protein is usually known, the docking calculations are restricted to a
region approximately corresponding to the binding site. This is not applicable to the case
of reverse virtual screening, as will be discussed below.
28
2.1.3
Blind Docking vs Focused Docking
The goal of protein-ligand docking is to predict the position and orientation of a ligand
(usually a small molecule) when it is bound to a receptor protein. When the binding site
to be targeted by the small-molecule is known, selecting a reasonably small docking box
around this site facilitates docking by focusing sampling of the translational, rotational, and
torsional degrees of freedom of the ligand. This is the usual situation in lead optimization,
where predicting the binding mode or pose of the ligand is needed for rational design of
improved potency and selectivity, and in hit identification through virtual screening where
the goal is the discovery of ligands, out of a large library, that are likely to bind a protein
target. The reverse question is more dicult to address. Given a ligand, is it possible to
discover its most likely target? In this reverse virtual screening case, because the binding
site is not known it becomes necessary to explore the entire protein surface by docking, a
procedure that has been named blind docking[60, 61]. Because the space where blind
docking takes place must accommodate the entire protein and is therefore much larger
than a regular docking box, the number of energy evaluations carried out by the docking
program is usually set up to a proportionally higher value[60, 61], with a corresponding
increase in the running time. This shortcoming has been partially overcome by using known
protein binding sites as targets for reverse-virtual screening[100]. Although this approach
enables faster reverse virtual screening, it limits the universe of candidate targets to those
proteins that have clearly identified binding sites and only to those sites within the protein.
Ideally, a reverse virtual screening approach would require only the knowledge of the threedimensional structure of the candidate target proteins and would allow for the discovery of
unexpected interactions that may occur at previously unidentified binding sites. One such
approach has been described by Brown and Vander Jagt[19], in which a macromolecule
encapsulating surface (MES) was used to geometrically define the boundaries of predicted
binding sites and guide the docking search. On a set of 14 protein-ligand complexes the
MES approach was shown to improve the eciency of the genetic algorithm-based optimizer
in the AutoDock[92] docking software.
The alternative option of predicting a set of putative sites and carrying out docking on
29
them one at at time is investigated here. The use of binding sites calculated directly from
the docking grid (i.e. interaction energy-based calculation) is evaluated as a tool to focus
the docking searches of the AutoDock[92, 65] software. This idea results in an approach
consisting of multiple independent docking runs carried out on smaller boxes, centered on
a few predicted binding sites, as opposed to one larger blind docking run that covers the
complete protein structure. By comparing the focused docking approach with reference
blind docking runs over a set of 77 ligand-protein complexes and 19 ligand-free proteins,
the following questions will be addressed: Is focused docking more accurate than blind
docking? Is there a real gain in computational eciency when using focused docking? Is
there a penalty paid (e.g. missed binding sites) when using focused docking?
Figure 2.2: Blind docking and focused docking. The blind protocol consists of a single docking
experiment, carried out on the whole protein surface, whereas the focused protocol breaks up the
problem into multiple smaller docking experiments, focusing on predicted binding sites.
2.2
Materials and Methods
2.2.1
Binding site identification
A more detailed description of the algorithm can be found in Chapter 2. Here I will
only recall the main ideas behind the approach.
The algorithm to predict the location of potential binding sites for drug-like molecules is
30
based on principles similar to those that underlie the QSiteFinder[81] algorithm. Both algorithms identify the regions characterized by favorable van der Waals interactions, which
have been shown to play an important role in the binding of drug-like molecules to proteins.
The first step requires the computation of a low resolution (1.0
A) carbon anity map with
AutoGrid (part of the AutoDock suite v. 4), using a box large enough to accommodate the
entire protein. In the next step, a predefined energy cuto (-0.3 kcal/mol for all cases) is
applied to filter out all the anity map points corresponding to unfavorable interaction energies. Subsequently, the remaining points are clustered according to the spatial proximity
with an agglomerative hierarchical clustering algorithm using average linkage, as implemented in the C Cluster Library[33]. This step yields a hierarchical dendrogram, which is
finally cut into nonoverlapping clusters by applying a distance cuto (7.8
A for all cases).
This last step is made possible by the fact that the average linkage clustering produces
monotonic hierarchies. In other words, the distance between clusters at each merging step
never decreases. Therefore, the number of clusters needs not be determined a priori, but
only the value for the distance cuto must be chosen. Finally, these so-obtained clusters are
ranked by Total Interaction Energy (TIE, the sum of the energy values of all the points that
belong to the same cluster) and the first three are selected for focused docking (see below).
The spatial localization of the clusters is characterized by their center of energy (COE,
the average of their coordinates weighted by energy). The only two parameters that this
method requires are the energy cuto to filter the grid points and the distance cuto for the
clustering step. A range of values for these two parameters was tested, and a combination
(-0.3 kcal/mol and 7.8
A, respectively) was chosen that yielded the most accurate binding
site prediction as defined by the accuracy measure introduced by Laurie[81]. In terms of
computational overhead for the binding site prediction step, it is noteworthy to mention
that the time required to run the SITEHOUND program is negligible with respect to the time
required for a full docking experiment. The median time calculated on the dataset was <1
min per protein on a Pentium IV machine.
31
2.2.2
Blind docking setup
Details about the complexes selected for this analysis and the ligand preparation can
be found in Appendix B. The docking parameters recommended by Hetenyi and van der
Spoel[61] were used, with the most relevant for this analysis being the docking box size and
the number of energy evaluations. The dimensions of the boxes were calculated in such a
way to allow a clearance of 5
A from each side of the box, and the resolution was set to
0.55
A. The average number of points per box for this dataset amounted to 1.6 106 . The
number of energy evaluations was set to 107 and for comparison with the faster focused
docking (see below) an additional set of blind docking experiments was carried out with 106
energy evaluations. I refer to these two groups of docking experiments as slow and fast
blind docking respectively.
Table 2.1: Parameters for the dierent sets of focused docking experiments
Set
1
2
3
4
2.2.3
Description
Slow, low-resolution
Slow, high-resolution
Fast, low-resolution
Fast, high-resolution
Resolution (
A)
0.55
0.375
0.55
0.375
Dimensions (points)
40 40 40
60 60 60
40 40 40
60 60 60
No. of energy evaluations

107
107
106
106
Focused docking setup
In the focused docking experiments, the search space was restricted to the vicinity of the
top three binding sites predicted by the SITEHOUND program (figure 2.2). Thus, each focused
docking experiment consisted of three independent runs, with the docking box centered on
the COE of the predicted first, second, and third binding site respectively (ranked by TIE).
The size of the box for the focused docking experiments (23
A 23
A 23
A) was chosen
on the base of the results shown in figure 2.3, where it is shown that in 95% of the cases
the center of the ligand falls within 10.0
A of the COE of one of the first three predicted
sites. The candidate solution was defined as the one that had the lowest docking energy
among the three putative sites explored. Two alternative ranking methods were explored,
one based on the selection of the largest cluster, and the one proposed by Ruvinsky[105]
that corrects for cluster occupancy. In both cases the ranking was less accurate than using
32
the lowest docking energy. To mimic the two blind docking runs (slow and fast) the number
of energy evaluations was also varied for the focused runs. Additionally, the smaller size of
the three focused docking boxes enabled the use of a second set with a higher resolution box.
Table 2.1 describes the four sets of focused docking experiments that result from varying
the number of energy evaluations and the docking box resolution. Because the number of
jobs per docking was set to 33 (instead of 100 as in the case of blind docking), set 1 and
set 2 are comparable in running times to slow blind docking, whereas set 3 and set 4 are
comparable to fast blind docking.
Figure 2.3: SITEHOUND binding site identification performance. Distribution of distances between
the center of the ligand in the crystal structure of the complex and the Center of Energy of the best
site (i.e. closest to the ligand) out of the first three ranking sites predicted by SITEHOUND for 77
protein-ligand complexes
2.2.4
Focused docking with masked grids
Another approach for biasing the docking towards the predicted binding sites was explored as an alternative to running independent docking experiments with smaller grids
centered on the predicted site. The approach consists in masking all the carbon grid points
that are outside a sphere of 11.0
A radius centered at the predicted sites by assigning to
them extremely high energy values (105 kcal/mol), so that the regions outside the binding
sites become forbidden. The docking is then carried out as described for blind docking.
33
2.2.5
Comparison of blind vs. focused docking
As a first step to compare blind and focused docking, it was determined whether the
docking results identified the correct binding pocket, as defined by the crystal structure of
the protein/ligand complex. This was done by measuring the overlap between the candidate
solutions for blind and focused docking (lowest docking energy pose of the ligand) and the
ligand in the experimental structure. The overlap was defined as the fraction of ligand
heavy atoms that fell within 2.0
A of a ligand heavy atom in the crystal structure. A docking
solution was said to have identified the correct binding site if the overlap was 0.15. For
those cases where both blind and focused docking identified the correct binding site the
results were further characterized by comparing the root mean squared deviation of the
ligand heavy atoms (RMSD) of the candidate solutions for blind and focused docking with
respect to the experimental structure, using the values reported in the output produced by
AutoDock. The RMSD comparisons were restricted only to those complexes where both
protocols correctly identified the binding sites, because comparison of RMSDs for solutions
in incorrect binding sites would not be meaningful. The statistical significance of the RMSD
dierence between blind and focused docking was assessed with a paired student t-test.
2.3
Results
As illustrated in figure 2.2, the main idea behind the focused docking protocol is to
break up the exploration of the protein surface into a few smaller independent docking
jobs. The benefits that result from using a smaller sampling space focused on candidate
binding sites include a better chance to identify the native binding mode of the ligand and
the possibility of performing docking in a much faster way, as will be shown below. The
assumption behind the use of a few predicted binding sites is that only a handful of possible
small-molecule binding sites exist on protein structures, and that these sites can be reliably
identified, thus it is not necessary to explore a very large number of sites and a gain in speed
is possible without a significant loss in coverage. These assumptions are tested in the results
shown below. The identification of candidate binding sites by the SITEHOUND algorithm (see
Methods) is the first step in the focused docking protocol. Because the predicted binding
34
site was used to center the docking box, it is important to assess whether the COE of the
clusters representing the predicted binding sites are close to the real center of the ligand.
Figure 2.3 shows the performance of the SITEHOUND binding site identification procedure
on the Astex Diverse Set, expressed as a histogram of distances between the center of the
ligand in the crystal structure of the complex and the COE of the predicted binding sites.
In 95% of the cases the center of the ligand falls within 10.0
A of the COE of one of the
first three predicted sites (the first site alone yields 77% of the cases). For this reason, the
focused docking experiments, below, used the first three predicted sites.
2.3.1
Comparison of blind and focused docking protocols
Two sets of docking experiments, one with 107 and the other with 106 energy evaluations
for both focused and blind docking protocols were carried out. These sets are referred to as
slow and fast docking respectively, with fast mode docking being ten times faster than slow
mode docking. As described in the Methods section, the number of runs per job was reduced
for focused docking in such a way that the three individual runs (one for each predicted
binding site) that make up one focused docking experiment taken together require the same
amount of time as one blind docking experiment.
Table 2.2: Accuracy of binding site identification
Docking experiment
Blind slow
Blind fast
Focused slow (Set 1)
Focused slow (Set 2)
Focused fast (Set 3)
Focused fast (Set 4)
2.3.2
Correct cases
55
51
64
65
63
62
Incorrect cases
22
26
13
12
14
15
Fraction of correct cases (%)

71
66
83
84
82
80
Binding site detection accuracy
As a first step to assess the performance of the two protocols in predicting the native
binding mode of the ligands as defined in the crystal structures, I selected the poses with the
lowest docking energy and calculated the fraction of the ligand heavy atoms that overlapped
with atoms of the ligand in the crystal structure. This was used as a measure of the ability
35
of the docking protocol to identify the correct ligand binding site in the complete protein
structure (blind docking) or among the top three predicted binding sites (focused docking).
In the case of good overlap between the docking pose and the ligand in the crystal structure
the fraction will be close or equal to one, whereas in cases where the docking protocol
misses the binding site the overlap will be close or equal to zero. As shown in figure 2.4,
the focused docking protocol outperforms the blind docking protocol in terms of ligand
binding site identification in both fast and slow mode irrespective of the overlap cuto used
to measure accuracy. Furthermore, the data shows that there is a penalty to be paid when
using blind docking in fast mode, because more cases are missed in the faster mode. In
contrast, there is no significant dierence between fast and slow mode for focused docking,
or between high and low resolution focused docking. Thus, focused docking is able to
achieve a higher accuracy of binding site identification than the best blind docking protocol
(slow blind docking) even while requiring only one tenth of the computing time (set 3
and set 4, fast focused docking). The focused docking approach using masked grids (see
Methods) was tested on the Astex Diverse Set using the 106 energy evaluations protocol.
All but the first three predicted sites were masked. Even though the results were better
than the blind docking protocol (figure 2.4), the overall accuracy is still much lower than
with any of the other focused docking protocols. To evaluate whether the lower accuracy
of the masked approach is a consequence of the competition of the three sites present
simultaneously during docking, or the masking itself, the same experiment was repeated by
masking one site at a time. In this case, the masked approach yielded results that were
indistinguishable from the ones produced by the other focused docking protocols. This
suggests that the simultaneous presence of the hot-spots regions is suboptimal for achieving
a thorough exploration of the correct binding site, and hence there is an advantage in
exploring the predicted sites one at a time either by reducing the size of the docking box
or by masking the sites individually. As mentioned earlier, for this dataset in 95% of the
cases the ligand center falls within 10.0
A of at least one of the first three predicted sites.
Even though the first site alone accounts for 77% of the cases, the other two sites cannot
be neglected if one wants to achieve high accuracy of binding site prediction. Using the
overlap measure described earlier to assess whether the real binding site has been identified
36
in docking, the accuracy ranges from 80% to 84% for focused docking. The same measure
applied to the blind docking protocol yields a binding site identification accuracy of 71%
and 66%, for slow and fast blind docking, respectively (table 2.2). These results suggest
that focused docking can provide a small improvement over the initial SITEHOUND binding
site identification step by identifying some of the correct binding sites that ranked in the
second or third position. Blind docking is unable to do so probably because the large search
space prevents the exhaustive exploration of the three candidate sites, resulting in poor
discrimination. It is interesting to note that in most cases the incorrect sites identified
in blind docking correspond to one of the three sites predicted by SITEHOUND, thus the
incorrect solution is a consequence of incomplete sampling rather than scoring. In two
cases (PDB chains 1l2sB and 1hww) blind docking identified the correct site and focused
docking did not. In both the cases the correct binding site was not among the top three
SITEHOUND sites (the correct sites ranked 4th and 8th, respectively). In those cases where
the binding site was missed by the blind docking protocol, but correctly identified by the
focused docking protocol, a tendency towards a higher number of rotatable bonds in the
ligand was observed. On the other hand, in those cases where the docking performance was
poor for both the protocols no clear correlation with the number of rotatable bonds in the
ligand was observed. This observation is consistent with the benefits provided by focused
docking being simply a smaller sampling space, where the number of energy evaluations can
be spent more eciently exploring the torsional degrees of freedom of the ligand.
2.3.3
Docking pose accuracy
To further compare the performance of the two docking protocols, for fast and slow
modes, the cases where both the protocols correctly identified the binding site were selected
(arbitrarily defined as the cases where the overlap was 0.15) and the distributions of
ligand heavy atoms RMSD from the crystal structure were compared (figure 2.5). For
both fast and slow mode, focused docking outperformed blind docking (P-value < 0.05 and
< 0.01 for slow and fast mode respectively). Thus, even in those cases where both methods
identify the correct binding site, focused docking is able to produce ligand poses that are
more accurate than those produced by blind docking. In a few examples, blind docking
37
Figure 2.4: The number of cases that have a fraction of overlapping atoms equal to or greater
than a threshold is represented. The fraction of overlapping atoms is calculated as the fraction of
ligand heavy atoms in the lowest energy pose that are within 2.0
A of a ligand heavy atom in the
crystal structure - red (solid): slow blind docking; red (dashed): fast blind docking; purple (dashed):
fast focused docking (masked sites); blue (solid): slow focused docking (set 1); green (solid): slow
focused docking (set 2); blue (dashed): fast focused docking (set 3); green (dashed): fast focused
docking (set 4). See Table 2.1 for description of docking sets.
produced a slightly lower RMSD than focused docking, with the largest RMSD dierence
being 0.39
A. For comparison, the largest RMSD improvement due to the focused docking
was 4.61
A. As regarding as the comparison among the dierent focused docking set-up, no
statistically significant dierence was observed in terms of binding site identification and
RMSD of the poses from the crystal structure. Thus, focused docking is able to achieve a
higher ligand docking accuracy than the best blind docking protocol (slow blind docking)
even while requiring only one tenth of the computing time (set 3 and set 4, fast focused
docking). This observation can be explained by considering that, on average, the smaller
box used in the focused experiments yields convergence of the docking algorithm with a
lower number of energy evaluations, thanks to the reduced sampling space. Therefore, for
focused docking no penalty has to be paid when using the fast mode, whereas this does not
hold true for the blind docking protocol. It is to be expected that the performance of blind
38
Figure 2.5: Accuracy of blind and focused docking. Distribution of RMSD of the lowest energy
poses with respect to the crystal structures for the focused and blind docking protocols. Only lowresolution focused docking results are shown (see Table 2.1). The comparison includes only cases
where both blind and focused docking identified the correct binding site. For slow docking 53 out
of 77 cases are included. For fast docking 49 out of 77 cases are included.
docking will further increase with an even higher number of energy evaluations, however
with the corresponding increase in the computational cost.
2.3.4
Comparison of blind vs. focused docking in the unbound dataset
Further testing of the docking protocols was carried out on a subset of the dataset
for which unbound forms of the proteins are available. The performance of blind docking
(slow mode protocol) and focused docking (fast mode, low resolution) was compared on
the unbound dataset of 19 proteins. As expected, the overall docking accuracy on this set
is lower than on the set of complexes. However, the focused docking protocol produced a
marked increase in accuracy with respect to the blind protocol. Although the blind protocol
identified 6 out of 19 binding sites, the focused protocol correctly identified 11, while using
39
Figure 2.6: Examples of improved results with focused docking. Red: blind docking (slow); blue:
focused docking (set 3, fast, low-resolution); green: crystal structure. (A) and (B), the ligand is
placed in the correct binding site by focused docking, but missed by blind docking (PDB codes:
2bsm and 1n46, respectively). (C) The ligand is placed in correct site by focused and blind docking,
but the focused docking results is more accurate (PDB code: 1pmn).
one tenth of the computational time. This corresponds to an increase in accuracy from
32% to 58%. In those few cases where the blind docking identified the correct site, focused
docking outperformed it in terms of the accuracy (RMSD) of the lowest energy pose (Table
2.3). In summary, the results on the Astex Diverse Set indicate that the focused docking
protocol outperforms the blind docking approach both in terms of binding site identification
and RMSD from the crystal structure in the cases where the binding site was successfully
detected by both protocols (see Figure 2.6 for examples). Furthermore, for focused docking
no significant advantage was observed for slow mode docking, probably due to the more
thorough sampling achieved by focusing on a smaller region. This results in higher accuracy
using only one tenth of the computing time necessary for blind docking.
2.4
Conclusions
A protocol to carry out protein-ligand docking suitable for cases where the binding sites
are not known a priori was developed. Using first a simple and fast algorithm to predict
binding sites, the approach then performs independent docking jobs around each predicted
40
Table 2.3: Accuracy of Blind and Focused Docking in Unbound Proteins

Target protein
1hq2
1ke5
1n2v
1oq5
1oyt
1q41A
1s3v
1v0pA
1ywr
2br1
2bsm
Focused RMSD (
A)
4.75
6.00
3.95
3.10
0.56
1.62
0.70
3.40
4.64
3.03
1.72
Blind RMSD (
A)
n/a
n/a
n/a
3.10
0.42
n/a
3.10
3.29
n/a
2.96
1.72
site. The results show that the docking focused on a small number of predicted binding sites
not only reduces the computational time required to compute the solution, but the docking
results are also more accurate both in terms of binding site identification and of RMSD of
the lowest energy docked pose with respect to the experimental solution. Focused docking
is able to improve the binding site identification of the SITEHOUND algorithm because it is
able to identify the correct ligand binding site even in some cases where the binding site
did not rank first in the SITEHOUND results. Overall the results suggest that the benefits
of focused docking are a consequence of improved sampling in relevant regions (predicted
binding sites) and not due to removing unwanted decoy sites that would interfere with
scoring. The fact that very few binding sites were missed by the focused docking approach
confirms that, at least in this set, it is sucient to explore only a few of the putative binding
sites per protein. The results, taken together, suggest that since focused docking achieves
higher accuracy at a fraction of the computational cost of blind docking it is well suited
as an eective and fast protocol to enable reverse virtual screening on a large number of
proteins. It is also possible to envision the application of this approach to aid the process
of characterization of newly determined structures, especially in the context of structural
genomics initiatives. Many protein structures produced by structural genomics projects
do not have functional annotations, and computational methods are often used to provide
clues for further experimental investigations[83]. Characterizing these protein structures
from the perspective of potential ligands could be very valuable for functional annotation,
and could also suggest novel therapeutic targets.
41
Chapter 3
Binding Site Identification for

Phosphorylated Ligands
3.1
Introduction
Phosphorylated molecules play a vital role in a wide range of biological processes, both
in prokaryotic and eukaryotic organisms. The phosphate group is employed with remarkable versatility by the cell to store energy and to reversibly modify proteins in signaling
cascades. Besides proteins and nucleotides, another class of biomolecules that can undergo
phosphorylation is represented by sugars, either as intermediates in metabolic processes or
as signaling tags that are attached to proteins. Despite the fact that no rigid classification
is possible, we can approximately distinguish between phosphorylation as a means to energetically activate metabolic intermediates or products, and phosphorylation as a marker
or switch in cell signaling. In the latter case, the addition of the phosphogroup is in some
instances capable by itself of inducing conformational changes in proteins or otherwise autonomously driving biochemical processes, but in many cases a specific decoding process
has to take place. The decoding process is usually carried out by protein domains that
specifically recognize the phosphogroup in proteins, sugars or nucleotides.
A lot of eort has gone into the characterization of these protein domains, because of their
importance for understanding fundamental biological processes coupled with their potential
42
therapeutic exploitation. Historically, the SH2-domain was the first to be discovered[128] as

a protein module capable of binding to its cognate ligand in a phosphorylation-dependent
manner, with other new domains being identified over the years. Since phosphorylation
occurs in such a diverse range of contexts, it is not surprising that the domains involved in
its selective recognition are oftentimes unrelated from an evolutionary or structural standpoint. Despite this diversity, studies have tried to identify some of the properties that may be
common among all the domains that recognize their cognate ligands in a phosphorylationdependent manner. In particular, in a computational study focused on phosphopeptide
recognition Joughin et al.[70] collected 3D structures of seven phosphopeptide-binding domains and extracted properties such as amino acid identity, surface curvature, and electrostatic potential, in order to characterize the phosphopeptide-binding region with respect
to the whole of the protein surface. The propensities for each of these properties were
combined into one joint propensity that was then mapped back on the protein surfaces and
used for visual identification of the regions likely to be involved in binding. An important
result of this work was that the process of phosphorecognition cannot be fully captured by
simple properties such as the electrostatic potential (in fact in many instances the binding
site was not the region of most positive electrostatic potential on the protein surface) or
aminoacidic composition (as it turned out, tryptophan has higher predictive power than
arginine or lysine, since these positively charged aminoacids are quite common on protein
surfaces).
In this work the recognition of all three classes of phosphomodifications (on peptides, nucleotides and sugars) are considered regardless of whether the phoshogroup has been added
for metabolic purposes or as a signaling switch. Furthermore, I address the question of
whether it is possible to find a single structurally-derived property that has sucient discriminative power to confidently identify most of the binding sites. This problem is tackled
by detecting energetically favorable regions on the protein surface, along the lines of what
has been previously done in binding site identification for drug-like ligands (Chapter 3).
An important dierence with the identification of binding sites for drug-like ligands is that,
in the case of phosphorecognition most of the interaction energy does not come from the
Van der Waals term, that is what most of the energy-based approaches for binding site
43
identification exploit. Therefore, dierent energy maps have to be employed in order to

precisely identify the region of the binding site responsible for the selective recognition of
the phosphogroup. Furthermore, I investigate whether including evolutionary information
in the form of a per-residue conservation score derived from Multiple Sequence Alignments
of protein families can further improve the identification of the residues involved in binding
to the phosphogroup. The ability to reliably pinpoint the region of a binding site where
the phosphorecognition takes place can be useful to guide mutagenesis experiments or as a
step in functional annotation. It is noteworthy to mention that the problem of determining
whether a protein will bind to a phosphogroup is a dierent one, even though the approach
presented here can be used a first step to achieve that goal.
3.2
3.2.1
Binding Site Identification
The approach that has been employed here for binding site identification has been
already detailed in Chapter 2, so only a cursory description will be given here, with an
emphasis on the specific issues pertinent to phosphorylated ligands.
The main idea of the binding site identification protocol employed for this analysis is to
identify regions near the protein surface where the interaction with the phosphate oxygen
is particularly favorable, as defined by very negative values of interaction energy. In order
to identify those favorable regions SITEHOUND was used, already successfully employed to
identify binding sites for drug-like ligands. Schematically, the program carries out the
following steps:
1. An interaction energy map generated by EasyMIFs is read in and filtered by retaining
only the points that are below a predefined energy threshold (e).
2. The remaining points are clustered based on their position in Euclidean space with
an agglomerative hierarchical clustering using average linkage.
3. The resulting dendrogram is cut into non-overlapping clusters by applying a distance
cuto (d )
44
4. Finally, the clusters are ranked by Total Interaction Energy (TIE), the sum of the
energy values of all the points that belong to the same cluster.
Only two parameters have to be optimized, e and d. To pick good combination of these two
parameters a grid search on 25 randomly selected bound structures from the phosphopeptide
dataset was carried out, with e ranging from -9 to -7.5 and d from 6.5 to 8, with incremental
steps of 0.1 for both parameters. The selected values were e: -8.5 kJoule/mol and d : 6.5
A,
since this combination yielded a good compromise between coverage of cases and accuracy
of the prediction. 80% of the cases had at least one correct prediction in the first three
clusters and the median Matthews Correlation Coecient (see next section) for these was
0.86.
3.2.2
3.2.2.1
Dataset Construction
Phosphopeptides Dataset
All the crystal structures in the PDB[12] (downloaded on Nov 28th, 2008) were collected
and filtered for the presence of at least one target residue named PTR (phosphotyrosine),
SEP (phosphoserine) or TPO (phosphothreonine). All the residues within 5.0
A of the target
residues were extracted and their chain identifier was recorded. All the cases where the
chain identifier of the target residue and that of the interacting residues were identical were
discarded, since these represented phosphorylated proteins in isolation and not complexes.
To remove redundancy, the remaining sequences were clustered at 50% sequence identity.
The highest resolution structures were picked out from the resulting clusters. This procedure
yielded a total of 48 dierent proteins. Five of them (pdb codes 1j4x, 1p22, 1u7f, 2oq1 and
2z8p) contained two phosphorylated residues, and each residue was treated independently in
the analysis (for a total of 53 dierent binding sites). A corresponding dataset of unbound
proteins was also generated by carrying out a BLAST[3] search (with standard parameters
and an expected value of E-6) of the bound chains on the entire PDB. All the hits were
filtered by excluding cases with sequence identity (computed on the entire query sequence)
or coverage less than 95%. The structures that did not have a TPO, SEP or PTR residue
were retained. Finally, the crystal structures with an empty binding site, with the highest
45
coverage and the highest resolution (in this order of preference) were retained. This protocol
yielded a total of 29 unbound proteins. Four of these proteins corresponded to structures
bound to double-phosphorylated peptides, and each binding site was treated independently
as for the bound forms (for a total of 33 dierent binding sites).
3.2.2.2
ATP Dataset
To build a diverse dataset of ATP binding proteins I resorted to the sc-PDB database[73],
a collection of biologically relevant protein-small molecules complexes. All the proteins in
complex with ATP whose binding site was made of a single chain were selected and clustered
at 50% sequence identity to remove redundancy. From each cluster the highest resolution
structures were extracted. This protocol yielded a total of 70 dierent proteins. To build
a corresponding dataset of unbound structures the same protocol described above for the
phosphopeptides was followed, yielding a total of 33 proteins.
3.2.2.3
Phoshosugars Dataset
The same procedure outlined above for the ATP dataset was followed, yielding a total
of 29 bound and 17 unbound proteins.
3.2.3
Reranking of Putative Sites by Conservation
All the sequences corresponding to the structures in the datasets were extracted from
the PDB files. For each sequence a BLAST search on the nr database (downloaded on Jan
17th 2009) was run and the hits with an E-value 0.0001 and a coverage 90% were
retained. A Multiple Sequence Alignment (MSA) of each set of homologs with ClustalW 2.0[78] with the default parameters was performed. Finally, the conservation of each
column in the MSAs was measured by using the Jensen-Shannon divergence score (JSD),
as described in Capra and Singh[22]. The advantage that the JSD provides over simpler
measures of conservation (e.g. the Shannon entropy) is the possibility of incorporating
background information about residues distribution.
The JSD score was computed using the Python program named conservation code and
available at http://compbio.cs.princeton.edu/conservation.
46
The top 5 sites predicted by SITEHOUND were sorted using the average of their per-residue
conservation scores, from the most conserved to the least conserved site.
3.2.4
Assessment of the Prediction Accuracy
The clusters generated in the binding site identification step are used to identify the
residues that are in contact with them by applying an arbitrarily chosen distance cuto of
4
A. The groups of residues that contribute to each cluster make up the predicted binding
sites and can be directly compared with the residues that are in contact with the phospholigands in the complexes (or the corresponding ones in the unbound form). Binding site
identification can therefore be converted into a classification problem, where the task is to
decide whether a given residue is involved in binding or not.
In order to assess the quality of the predictions I resorted to the Pearson Correlation Coefficient between the Prediction (P) and the Reference (R). As shown by Baldi et al.[8], the
Peason Coecient for a classifier can be conveniently expressed by using the True Positives
(TP), the True Negatives (TN), the False Positives (FP) and the False Negatives (FN) with
the following equation:
TP TN FP FN
M CC(P, R) =
(T P + F N )(T P + F P )(T N + F P )(T N + F N )
(3.1)
The latter expression is better known as the Matthews Correlation Coecient (MCC).
As discussed in Baldi et al.[8], the MCC can be directly related to a chi-squared test applied
to the 2x2 contingency matrix containing the TP, TN, FP and FN by using the following
equation:
2 = N M CC 2
(3.2)
where N represents the the total number of residues.
3.2.5
Electrostatic Potential Calculations
The electrostatic potential calculations have been performed with APBS 1.0[7] with
default parameters.
47
3.2.6
ROC Curves
Receiver Operating Characteristic Curves[39] were built for the evolutionary-based approach, the energy-based approach and the reranked energy-based approach by plotting the
True Positive Rate vs. the False Positive Rate. All the residues of all the proteins in each
dataset were pooled together, and divided into binding vs. non-binding residues. For the
energy-based and reranked energy-based approach only the top 5 clusters were considered,
whereas for the evolutionary-based approach the normalized conservation range was divided
into 21 equally spaced intervals. The conservation score was normalized on each protein
individually by subtracting the mean and dividing by the standard deviation.
3.3
3.3.1
Results
Overall Performance on the Whole Datasets
ATP (b)
ATP (u)
ATP (P, b)
ATP (P, u)
Phosphopept. (b)
Phosphopept. (u)
Phosphosugars (b)
Phosphosugars (u)
MCC 0.3 - OP
62/70
29/33
61/70
28/33
43/53 (41/48 proteins)
25/33 (23/29 proteins)
28/29
14/17
MCC 0.3 - CMET

57/70
25/33
52/70
22/33
25/53 (25/48 proteins)
17/33 (16/29 proteins)
24/29
10/17
M. MCC OP
0.70
0.63
0.76
0.63
0.82
0.76
0.80
0.75
M. MCC CMET
0.75
0.60
0.60
0.52
0.63
0.60
0.72
0.69
Table 3.1: Number of cases with an MCC 0.3 and median value of MCC for the three ATP,
phosphopeptides and phosphosugars datasets. In the case of phosphopeptides the results per the
individual binding sites and per protein are reported separately (since 5 entries in the bound form
and 4 corresponding entries in the unbound form have two phosphorylated residues in the peptide,
the total number of individual sites is greater than the number of proteins).
The columns contain: the type of ligands, the number of cases with an MCC 0.3 with OP
(phospho) and CMET (methyl) probes respectively, and the median values of MCC for those cases
(with OP and CMET respectively). The letters b or u refer to bound and unbound datasets
respectively, wherease P indicates that only the residues in contact with the phosphogroup have
been considered as part of the binding site
A summary of the results across the three datasets containing phosphopeptides, phosphosugars and ATP is provided in Figures 3.1, 3.2, 3.3 respectively. The performance of
the binding site identification has been assessed by treating it as a classification problem,
where the objective is to discriminate between the residues that are in contact with the
48
ATP (b)
ATP (u)
Phosphopeptides (b)
Phosphopeptides (u)
Phosphosugars (b)
Phosphosugars (u)
Conservation (+)
39/70
17/33
30/53 (proteins 30/48)
22/33 (proteins 21/29)
22/29
10/17
Conservation (-)
39/70
15/33
24/53 (proteins 24/48)
13/33 (proteins 12/29)
20/29
8/17
Table 3.2: Number of cases with an MCC 0.3 in the first cluster only, with and without the
evolutionary reranking based on conservation
ligand versus the ones that are not. The well established Matthews Correlation Coecient
(MCC) was used to quantify the agreement between the predictions and the actual interacting residues derived from the crystal structures. In this way it is possible to use the
same performance measure to compare bound and unbound forms, even in the presence of
conformational changes, without the need for superposition. The MCC can range from -1
to 1, with 1 being a perfect match between the residues predicted to be in contact with the
phospholigand and the ones derived from the crystal structure of the complex. As discussed
in the Methods section, it is also possible to estimate the statistical significance of an MCC
value (useful for border-line cases, where only a subset of residues are correctly predicted).
In general, a value of 0.3 for a protein of at least 80 residues represents the lower limit for
a p-value < 0.05. 0.3 was therefore chosen as the limit for discriminating partially correct
predictions from wrong ones.
Figures 3.1, 3.2 and 3.3 show the distribution of the MCC and the rank of the best
prediction (out of the top 5) before and after evolutionary reranking. On average, the
approach seems to perform better on the phosphosugars and ATP datasets than on the
phosphopeptides dataset, which also shows a more substantial performance deterioration
for the unbound forms. I identified a potential explanation for this behavior by computing
the ratio between the average interaction energy on a 5
A shell in the binding site and the
average interaction energy on a 5
A shell surrounding the entire protein surface. The results
are illustrated with density plots in Figure 3.4. As expected, both phosphosugars and ATP
binding sites showed larger values, and therefore a stronger signal than the one deriving
from the phosphopeptides binding sites. This observation can also help explaining the more
robust performance on the unbound forms of phosphosugars and ATP compared to the
49
Figure 3.1: Matthews Correlation Coecient distribution for the phosphopeptides dataset in bound
and unbound forms. The stacked bars show the rank of the best prediction (from 1st to 5th)
phosphopeptides, since a stronger signal in the bound form has a better chance to remain
detectable even in the absence of the conformational changes experienced by the proteins
upon ligand binding.
3.3.2
Evolutionary reranking of the putative sites
Another observation that can be made by inspecting the rank distributions of Figures
3.1, 3.2 and 3.3 is that the evolutionary reranking improves the overall performance by
shifting the rank of the best prediction near the top. In other words, the confidence on the
top predictions is increased when evolutionary reranking is applied. On the other hand, it
50
Figure 3.2: Matthews Correlation Coecient distribution for the phosphosugars dataset in bound
and unbound forms. The stacked bars show the rank of the best prediction (from 1st to 5th)
is not surprising that evolutionary information by itself cannot identify the residues that
are specifically involved in binding with the same level of accuracy aorded by the energybased approach, since residues tends to be conserved both for structural and functional
reasons. In other words, the residues that are specifically involved in binding form a subset
that is smaller that the set of conserved surface residues. Figure 3.5 shows an example
where a structure has been colored according to the conservation score derived from a
set of homologous sequences: the binding site is clearly composed of conserved residues,
but many other conserved residues on the surface do not play any role in binding to the
phosphorylated residue. It can be observed that the large conserved region of Figure 5 does
indeed correspond to the protein-protein interface, and the top ranking cluster identifies
51
Figure 3.3: Matthews Correlation Coecient distribution for the ATP dataset in bound and unbound forms. The stacked bars show the rank of the best prediction (from 1st to 5th)
the small subset of residues that plays a role in the recognition of the phosphorylated
residue. Furthermore, the evolutionary analysis requires the availability of a sizable number
of homologous sequences. Nonetheless, adding evolutionary information provides a way to
reduce the noise coming from decoy sites with a relatively small computational overhead.
Again, the dataset the benefits the most from the conservation-based reranking is the one
containing phosphopeptides. This finding is consistent with what was observed before, since
the weaker energetic signal is strengthened by the signal coming from conservation.
The same results in the form of Receiver Operating Characteristic Curves (ROC) are shown
in Figure 3.6, where the True Positive Rate is plotted against the False Positive Rate
for each approach. As can be seen, the energy-based approach substantially outperforms
52
Figure 3.4: Density plot for the ratio between the average interaction energy in the binding site and
the average interaction energy on the whole protein surface (black: phosphopeptides; blue: ATP;
red: phosphosugars)
the evolutionary-based one in all the datasets, with the exception of the phosphopeptides
unbound, where the performances are comparable. Interestingly, this is also the dataset
that benefits the most from the evolutionary reranking of the energy-based approach, and
the combined approach yields a higher True Positive Rate at the same False Positive Rate
than either the evolutionary or the energy-based approaches used in isolation.
3.3.3
Role of the Electrostatic Potential
As already pointed out in Joughin et al.[70], the electrostatic potential plays an important role in the interaction between proteins and phospholigands, but in a non-trivial way.
In other words, the binding site does not necessarily correspond to the most positive patch
on the protein surface. An example is illustrated in Figure 3.7, where the structure of a
phosphatase in complex with a phosphorylated peptide colored by electrostatic potential
is shown. The region where the phosphothreonine is binding contains both a positive and
a negative part, making a pure electrostatic-based approach quite ineective in identifying
the correct binding site.
53
Figure 3.5: Kinase associated phosphatase in complex with phospho-cdk2 (pdb code: 1fq1). The
surface is colored by conservation (dark green: highly conserved, light green: non conserved)
3.3.4
Probe Selectivity Analysis
Dataset
ATP (b)
ATP (u)
Phosphopeptides (b)
Phosphopeptides (u)
Phosphosugars (b)
Phosphosugars (u)
MCC 0.3 (first cluster only)

35/70
13/33
4/53 (proteins 4/48)
5/33 (proteins 5/33)
12/29
3/17
Table 3.3: Performance with the CMET probe, first cluster only. The number of cases with an
MCC 0.3 in the first cluster is shown for each dataset. The letters b or u refer to bound and
unbound datasets respectively
I assessed the performance of binding site identification performed with a phosphate oxygen probe against a methyl probe, which is usually employed for cases where the dominant
component of the interaction energy comes from Van der Waals[47, 81]. Table 3.1 shows
54
Figure 3.6: ROC Curves. True vs. False Positive Rate of three dierent binding site identification
approaches. Black curve: conservation only; red curve: energy-based approach; green: energy-based
approach reranked by conservation
the results for all datasets. Since ATP contains three phosphogroups, I also considered the
subset of residues that are in contact with the phosphogroups only (thereby excluding the
region of the binding site that binds to the nucleotidic part of the ligand). In this way one
can directly assess whether it is possible to discriminate the part of the binding site that
binds to the phosphogroups vs. the one that binds to the nucleotidic part of the ligand. An
example of combining multiple interaction energy maps is illustrated in Figure 3.8, where
SITEHOUND with both carbon and phosphate oxygen probes has been applied to the same
protein, yielding a more comprehensive picture of the binding site.
Table 3.3 shows the results obtained by using only the topmost cluster. Overall, the results
indicate that the performance with the methyl probe is inferior to the one achieved by using
the phosphate oxygen. In other words, there is an advantage in using the more selective
phosphate oxygen probe when studying proteins that are known to bind to phosphorylated
ligands.
55
Figure 3.7: CTD-specific phosphatase Scp1 in complex with phosphorylated peptide (2ght). The
binding site is correctly identified by the 2nd and 3rd clusters. As shown by the electrostatic potential
map, the site where the phosphoserine is binding contains a region of negative potential, confirming
that the region binding to the phospholigand is not necessarily the most positive one on the protein
surface.
3.4
Conclusions
I presented a computational approach to identify the portion of a protein binding site

where a specific interaction with the phosphogroup(s) takes place. The procedure was
tested on three independent datasets comprising bound complexes and unbound proteins
involved in the recognition of ATP, phosphopeptides and phosphosugars. The overall performance suggests that by using a specific phosphate probe to compute interaction energy
maps one is able to reliably identify the binding sites in the majority of the cases. The
results also indicate that the approach is relatively insensitive to the small conformational
rearrangements that occur in the unbound forms. An optional step involving the reranking
of the top predicted sites by conservation score further improves the predictions where the
energy-based signal is relatively weak (as in some of the phosphopeptides cases). On the
other hand, conservation alone cannot be used to precisely pinpoint the residues involved
in the specific recognition of the phosphogroup, since they generally form a proper subset
of all the conserved residues in a protein family. Despite the variability in the electrostatic
potential or the aminoacidic composition of the binding sites, the signal derived from the
interaction energy with a phosphate probe is invariably higher in the binding site as compared to the rest of the protein, and the energy-based approach successfully exploits this
feature.
It is important to mention that this method cannot be used directly to identify a priori
56
Figure 3.8: 1kvk (mevalonate kinase in complex with ATP). The first ranking cluster obtained
with the phosphate oxygen probe correctly identifies the part of the site involved in binding to
the phosphogroup, whereas the third ranking cluster obtained with the methyl probe identifies the
nucleotidic part of the ligand, illustrating the use of multiple probe to characterize heterogeneous
binding sites.
proteins that could be potentially involved in phosphorecognition, but it can suggest mutagenesis experiments to confirm specific binding or guide further computational studies such
as molecular docking. On the other hand, the more challenging problem of binding site
classification (i.e. assigning the possible class of ligands to a binding site) can be considered
as an extension of the problem of binding site identification. The results presented here
(in particular the combination of multiple interaction energy maps shown in Figure 3.8)
indicate that an energy-based approach is definitely an option worth exploring in the quest
for a reliable integrative approach for binding site identification and classification (as it will
be briefly discussed in Chapter 4).
57
Chapter 4
Beyond Binding Site Identification

4.1
Introduction
The focus of this work so far has been on protein binding site identification by means
of an energy-based approach that exploits the information contained in protein structures.
The basic assumption that justifies such endeavor is that the binding site is detectable in
the absence of the ligand. As pointed out in Chapter 2 one application of binding site
identification is in the context of designing small molecules capable of targeting the binding
site. In this case the unbound conformation must be reasonably close the bound one for
such applications to be successful.
While these assumptions hold true in many circumstances, proteins do not necessarily exist in one stable conformation and there are numerous examples of proteins that undergo
substantial conformational changes[42].
This chapter will introduce a methodology that can be applied to infer the bound conformation of a protein starting from an unbound form and assuming an approximate knowledge of
the residues involved in binding and the type of ligand. The underlying assumption is that
it is possible to identify a subset of structures from an ensemble of computationally generated conformers that contain a reasonable approximation of the bound conformation. This
assumption is supported by a growing body of experiments that challenge the traditional
lock and key and induced fit models, as it will be discussed in the next section.
58
4.1.1
Models of Conformational Changes
One well characterized example of conformational change occurring upon binding is

represented by the interaction between antibodies and antigens[131], where the structural
plasticity explains a certain degree of crossreactivity existing between an antibody and several dierent antigens. Another compelling example is provided by proteins involved in
binding to multiple partners that show a high degree of plasticity at the interface[113]. It is
also possible to find examples of enzymes that undergo conformational changes upon binding to the substrate, with the binding residues displaying larger changes than the catalytic
residues[51].
In all these instances the classical lock and key model (Figure 4.1-A) proposed at the turn
of the XIX century by Fischer[40] for enzyme catalysis cannot be invoked to explain the
binding process. An alternative model to explain the conformational changes observed upon
binding was proposed by Koshland [76] in the late fifties, and has since become the textbook concept known as induced fit (Figure 4.1-B). The main idea underlying the induced
fit model is that the match between the protein and the ligand (which could be another
protein as well) takes place after the initial weak binding between the two molecules. In
other words, a two-steps mechanism is invoked in which the very presence of the ligand
induces the conformational change that leads to an energetically favorable interaction.
More recently, it has been shown that the induced fit model (despite its thermodynamical
plausibility) is not always compatible with kinetic measurements. An an example, Bosshard
showed how the binding between an antigen and an antibody undergoing a conformational
change necessary to accommodate the ligand would require almost a day to reach equilibrium under the induced fit paradigm[14]. The reason for the long time required to reach
equilibrium can be found in the nature of the initial complex between the antigen and the
antibody. Because of its instability, there is a small chance that the complex will undergo
the induced conformational change required to stabilize it, and this explains why it will
take such a long time to reach the equilibrium.
An alternative model (known as the conformational selection model, Figure 4.1-C) postulates that the best matching conformation or a conformation close to the best matching
59
one already exists in solution (albeit perhaps underrepresented in the ensemble of conformations visited by a protein) and that the ligand would select it and shift the equilibrium
of the ensemble towards that particular conformer[42]. Kinetic measurements performed
on antigen-antibody interactions [41, 82, 11] provided further support to the conformational selection mechanism, which has been validated on other types of protein-protein
interactions[21, 112, 127] and also in enzyme-ligand interactions[125, 10, 58].
The conformational selection model provides the theoretical foundation that justifies endeavors like the one presented here aiming to identify the bound form of a protein without
explicitly modeling the complex. In other words, the unbound form is used to computationally generate an ensemble of conformations (sampling phase) that is then processed to
select a small set of conformers that contain a structure close to the bound form (selection
phase). The ligand is not explicitly taken into account in the sampling phase, but can play
a role in the selection phase.
Several alternative approaches can be used in the sampling phase to generate the ensemble
of conformers, such as Molecular Dynamics, Monte Carlo simulations, or coarse-grained
models. The Elastic Network Model has been chosen here for its simplicity and the low
computational demands.
4.1.2
The Elastic Network Model
Molecular Dynamics represents one of most widely used approaches to sample the conformational space of proteins. However, the observation that proteins display in many cases
collective motion[45] suggests that alternative approaches designed to exploit this type of
motion may also work. Normal Mode Analysis (NMA)[23] in particular can be used in
situations where it is possible to express the motion of a protein in terms of some collective
variables. NMA has been successfully applied to model the conformational changes occurring in hexokinase[52], lysozime[18] and citrate synthase[86] among others, and on a dataset
containing 20 protein structures[114].
The work of Tirion[115] and Bahar[6] showed that it was possible to replace the detailed full
atomistic potential with a coarse-grained representation of the protein and a single- parameter harmonic potential. As it will be described in detail in Section 4.2.3, the backbone or
60
Figure 4.1: Mechanistic models for protein-ligand binding.
A) The classic lock and key

model, proposed by Fischer[40] in 1894, assumes a preexisting perfect complementarity between
the molecules and can be invoked only in the absence of conformational changes. B) The alternative induced fit model, proposed by Koshland[76] in 1958 to account for conformational changes,
postulates a structural rearrangement induced by the presence of the ligand occurring after a weak
interaction between the two molecules. C) The conformational selection model, proposed as an
alternative to the induced fit model (that is not always feasible from a kinetic standpoint), postulates that the bound form already exists in solution and is selected by the ligand
the alpha-trace of the protein are represented as nodes in a network, and an edge between
two nodes is drawn if they are within a predefined distance.
The Elastic Network Model (ENM) has been successfully applied to model local fluctuations
and to reproduce the B-factors of proteins[6], but also to model large-scale conformational
changes[129]. Because of its low computational demands, the ENM can also be used as an
eective tool for generating an ensemble of conformations, as it will be discussed here.
4.2
4.2.1

Dataset Construction
The MolMovDB[45] and the Gunasekaran Database[50] were screened for instances of
protein/small molecule complexes where at least one crystal structure for the bound and
one for the unbound form existed. Furthermore, only the complexes where the protein
underwent a hinge-like motion upon binding to a small molecule were considered. This
61
PDB codes
1ake 4ake
1anf 1omp
1gky 1ex6
1jg6 1jej
1lfg 1lfh
1lst 2lao
1quk 1oib
1rpj 1gud
1suv 1bp5
1wdn 1ggg
2dri 1ba2
Description
Adenylate Kinase
Maltodextrin Binding Protein
Guanylate Kinase
Beta-glucosyltransferase
Lactoferrin
Lys, Arg, Ornitine Binding Protein
Phosphate Binding Protein
D-Allose Binding Protein
Transferrin
Glutamine Binding Protein
Ribose Binding Protein
Ligand
Adenosine Pentaphosphate
Maltose
GMP
UDP
Fe + Carbonate
Lysine
PO4
D-Allopyranose
Fe + Carbonate
Glutamine
Ribose
gwRMSD
8.19
7.25
4.39
2.75
8.18
8.58
5.67
6.17
12.46
10.27
12.49
FIT RMSD
3.42
1.24
1.60
0.93
4.37
1.57
0.96
1.21
2.10
2.19
2.12
Table 4.1: Complexes obtained from MolMovDB [45] and Gunasekaran database [50] and used in
this analysis. The PDB codes column contains the PDB codes for each bound-unbound pair of
proteins. The gwRMSD column refers to the RMSD between bound and unbound form computed
according to [31] (see next section). The FIT RMSD column contains the RMSD after the Elastic
Network fitting procedure.
procedure yielded a total of 11 complexes, reported in table 4.1.
4.2.2
Root Mean Square Deviation (RMSD) Calculations
The widely used superposition method devised by Kabsch[71] cannot be reliably applied
as it is due to the domain (rigid-body) movements that occur upon ligand binding in the
proteins included in the dataset. To ensure that the cores of the proteins are correctly
superposed the algorithm presented in Damm and Carlson[31] has been employed. Briefly,
the algorithm carries out an initial superposition of the two proteins using the Kabsch
method, and then iteratively refines the superposition with a Gaussian-weighting scheme.
More specifically, given a reference structure A and another structure B initially superposed
onto A with the Kabsch method, one can compute a vector d of distances for pairs of
equivalent atoms whose entry i is:
di =
(xiA xiB )2 + (yiA yiB )2 + (ziA ziB )2
(4.1)
This distance is used to compute a vector of weights for each atom pair as follows:
wi = exp((di )2 /c)
62
(4.2)
where c is a scaling factor.

Finally, the structures are superposed again, with the relative contribution of each atom
pair i given by equation 4.2. The procedure is iterated until convergence is reached. In
this way the atoms that do not undergo major movements tend to receive a higher weight,
thereby ensuring a superposition of the static core of the proteins.
The algorithm outlined above has been implemented in R[102].
4.2.3
The Anisotropic Elastic Network Model (ANM)
Figure 4.2: Schematic representation of the Elastic Network Model. Only the trace of the protein
is retained and each backbone atom is treated as a node in the network. An edge is drawn between
two nodes (atoms) if the distance between them is within a specified distance.
The ensemble of conformations generated from the unbound form of the proteins has
been obtained with the Anisotropic Elastic Network Model[5, 37] The unbound protein is
assumed to be in a minimum of potential energy and the interaction between atoms are
modeled with the harmonic approximation, by using a spring constant. Only the backbone
of the protein is retained in the analysis. Two nodes i and j in this coarse-grained representation of the protein are connected if their distance is within a specified cut-o rc (Figure
4.2). In this way, the adjacency matrix that fully describes the network can be defined
as follows:
ij =
1 if dij rc
0 if dij > rc
(4.3)
where dij is the euclidean distance between two nodes and rc is a specified distance cut-o.
From the matrix one can compute the Hessian of the system, a block matrix defined as:
63
Xij Xij
ij
Hij =
0 )2 Yij Xij
(Rij
Zij Xij
Hii =
Xij Yij
Yij Yij
Zij Yij
Hij
Xij Zij
Yij Zij
Zij Zij
(4.4)
(4.5)
where Xij , Yij and Zij are the components of the distance vector between two nodes i and
j in the x, y, and z direction respectively and represents the spring constant, identical for
all pairs of nodes.
By performing the spectral decomposition of H one obtains 3N 6 eigenvectors with corresponding non zero real eigenvalues, that represent the directions (modes) where the collective motion of the nodes takes place. The low frequency modes (that express the highly
collective motions of the nodes) can be picked out by selecting the eigenvectors with the
corresponding lowest eigenvalues.
For this analysis, the top three modes were selected, and a sampling along these modes was
performed. The conformations were generated by computing a displacement along each
of the three modes, ranging from -180 to 180 arbitrary units with a stepsize of 20. This
procedure yielded 6859 conformers for each pair.
4.2.3.1
Fitting
In order to obtain a rough estimate of the range and relative contribution of the normal
modes to the protein motion (from the unbound to the bound form) a least square analysis
was carried out to estimate the optimal linear combination of the first 10 modes that yields
the best fit. More formally,
Mx b u
(4.6)
where M is the matrix containing the eigenvectors (arranged in columns), b is the vector
containing the coordinates of the bound form and u the vector with the unbound coordinates. x is the linear combination of the eigenvectors that yields the best fit. In other words,
the vector containing the dierence between the bound and the unbound forms is projected
64
in the space spanned by the lowest frequency 10 eigenvectors, obtaining a displacement

for each mode that converts the unbound form to a conformation that is as similar to the
bound form as possible (given the normal modes).
4.2.4
Side-chain Modeling and MIFs Calculations
The structures generated as described in section 4.2.3 have been processed with SCWRL3[36],
a side-chain modeling software that uses a backbone-dependent rotamer library cast in the
Bayesian framework to account for rarely occurring rotamers. No energy minimization was
performed.
Subsequently, the residues within 5
A from the ligand in the crystal structure of the reference bound form were extracted and the center of the binding site computed (for each of
the conformers generated with the ENM). Finally, EASYMIFs[46] was used to compute the
MIFs around these centers with the carbon (C), methyl (CMET), nitrogen (N), hydroxyl
oxygen (OA), phospho oxygen (OP) and water oxygen (OW) probes.
4.2.5
Comparing MIFs derived from binding sites
The ensemble of MIFs (6 maps per conformer, one for each probe) was compared against
the MIFs derived from the bound form. In order to deal with some of the problems outlined
in Chapter 1 and circumvent the rotational and translational dependence of many indices
used for comparing maps, a modified version of the algorithm described in Osada et al[98]
was implemented. The main idea is to derive a multidimensional feature vector from the
probability of observing a given distance between a point and the centroid of the map, where
the probability is a function of the energy of the points (the more favorable the interaction
energy, the higher the chance that a point has of being selected). The size of the maps used
in this analysis allowed an exact enumeration of all the distances between the centroid and
the points, weighted by the probability (energy) of the points. The fingerprint derived in
the aforementioned way is called a centroid shape distribution.
To quantitate the distance between the shape distributions describing the binding site in the
bound form and the ones representing the ensemble of conformations the Kullback-Leibler
65
divergence (KLD)[77] has been used:

DKL (P |Q) =
P (i) log(
P (i)
)
Q(i)
(4.7)
It is noteworthy to mention that the KLD does not induce a metrical space since it is
not symmetric. This fact does not represent a problem for the application described here,
since the comparison is always directional (i.e. one looks for the top n conformations that
are most similar to the template, i.e. the reference bound form in this application). A
schematic representation of the method is given in Figure 4.4.
4.3
4.3.1
Results
Normal Model Fitting
In order to assess the ENM ability to generate structures close to the bound conformation, the fitting procedure described in Section 4.2.3.1 has been applied to the 11 pairs
of structures in the dataset. The results (Table 4.1 and Figure 4.3) indicate that the top
2-3 modes are usually able to yield a conformation with an RMSD lower than 3
A (9 out of
11 cases), with 6 out of 11 cases having an RMSD lower than 2
A. In one case (1lfg 1lfh, a
lactoferrin), the ENM does not seem to fully capture the motion from the unbound to the
bound form.
As pointed out by Tama et al.[114], when the motion is collective the first mode is adequate
to yield a close fit to the bound form. Indeed, Figure 4.3 clearly illustrates this point, since
the largest drop in RMSD occurs with the topmost mode and, in the case where the fitting
is not successful (1lfg 1lfh), not even 10 modes are sucient to reduce the RMSD below
3
A.
Taken together, these results indicate that the ENM applied to this dataset can yield structures that approximately resemble the bound form. Therefore, the bound form identification
procedure described in Section 4.2.5 can be meaningfully applied.
66
Figure 4.3: Contribution of normal modes to fitting. For all the cases included in the dataset an
ANM fitting, with a number of normal modes ranging from 1 to 10. The resulting RMSD is plotted
against the number of modes employed in the fitting procedure. For most pairs the top three modes
are usually enough to reach a low RMSD, while in a few cases (e.g. 1lfg 1lfh) more are needed
4.3.2
Bound Form Identification
For each of the 11 cases in the dataset the top 20 conformers closest to the bound form
in MIFs space have been selected, and the Gaussian Weighted RMSD from the bound form
has been computed. The results are shown in Table 4.2. In most cases it is possible to find
one conformer among the top 20 that is similar to the fitted conformation, which represents
a lower bound on the RMSD, being the optimal conformation that can be obtained with a
given set of modes.
Figures 4.5 and 4.6 show two examples of bound form identification, where the lowest
RMSD structures among the top 20 ranking conformers (out of 6859 structures) have been
superimposed on the bound form. Interestingly, not all probes behave identically, and some
probes seem to be better suited at identifying the bound form than others (e.g. the OA
67
probe for the 1lst 2lao pair).
Figure 4.4: Binding site MIFs comparison. Two MIF maps are shown, one derived from a bound
structure and the other from an unbound structure. The grayscale used to represent the points is
proportional to the interaction energy (darker shades indicate more favorable interaction energies).
The shape distributions derived from the maps as described in section 4.3.2 are shown in the bottom
plot. The Kullback-Leibler divergence is used to compute the distance between them.
Table 4.2: Overview of the results with the centroid shape function.The initial RMSD and
the RMSDs obtained with the best out of the top 20 conformations ranked by similarity
to the bound form using the carbon (C), methyl (CMET), nitrogen (N), hydroxyl oxygen
(OA), phospho oxygen (OP) and water oxygen (OW) are reported
PDB Codes
1ake 4ake
1anf 1omp
1gky 1ex6
1jg6 1jej
1lfg 1lfh
1lst 2lao
1quk 1oib
1rpj 1gud
1suv 1bp5
1wdn 1ggg
2dri 1ba2
Initial
8.2
7.3
4.4
2.8
8.2
8.6
5.7
6.2
12.5
10.2
12.5
Min
3.6
1.3
1.8
1.0
5.1
1.7
1.0
1.3
2.3
2.3
2.1
MinC
3.9
4.0
4.0
4.0
8.3
2.6
1.1
1.3
3.1
3.0
8.2
MinCMET
3.9
4.0
6.7
3.8
5.3
4.0
3.5
4.5
3.9
2.3
2.9
68
MinN
4. 2
4.5
3.2
3.6
9.2
2.8
2.8
2.1
4.2
2.3
2.1
MinOA
5.3
4.5
3.2
3.7
10.9
1.7
2.3
2.9
4.4
2.4
2.6
MinOP
5.3
2.5
2.7
3.6
5.9
2.8
1.3
2.0
4.8
2.3
2.9
MinOW
4.2
2.5
3.2
4.2
7.4
2.8
2.8
2.1
4.1
2.3
2.1
Figure 4.5: Identification of the bound form of a protein. The green structure represents the
bound form of an E. coli phosphate binding protein (PDB code: 1quk[130]). The red structure
superimposed on the left is the corresponding unbound form (PDB code: 1oib[130]) with a backbone
RMSD of 5.7
A. The blue structure on the right is the 3rd ranking conformation generated from the
unbound form and selected with the shape function approach using a carbon probe (final RMSD:
1.1
A)
4.4
Discussion
The results presented above provide a proof of concept for the MIFs based approach
to screen an ensemble of structures and seek conformers that resemble the bound form.
The ENM has been chosen to generate the ensemble of conformations because of its ability
to capture large collective motions and its limited computational demands. However, the
bound form identification approach presented here is by no means restricted to a particular
sampling technique, and alternative methods (such as Monte Carlo or Molecular Dynamics)
could be better suited to deal with other situations. For example, the ENM would not work
well in cases where the motion from the unbound to the bound form is not collective.
Another important point worth mentioning is that the shape distributions used to select the
topmost 20 structures have been derived directly from the bound forms. This is clearly a
situation that would not occur in real applications, since the very goal of the procedure is to
get information about the possible conformation of the bound form when only the unbound
69
Figure 4.6: Identification of the bound form of a protein. The green structure represents the bound
form of the Salmonella typhimurium lysine-arginine-ornitine binding protein (PDB code: 1lst[96]).
The red structure superimposed on the left is the corresponding unbound form (PDB code: 2lao[96])
with a backbone RMSD of 8.6
A. The blue structure on the right is the 11th ranking conformation
generated from the unbound form and selected with the shape function approach using the hydroxyl
oxygen (final RMSD: 1.7
A)
form is available. The shape distributions would have to be derived from other structures
bound to the same ligand (or perhaps a similar one). It is conceivable that the performance
would deteriorate, but more robust approaches could be implemented to deal with this
issue. For example, multiple structures bound to the same ligand could be analyzed and
clustered, to identify potentially dierent binding modes. It should be possible to compute
an average shape distributions by including multiple structures whenever available, thereby
increasing the robustness of the approach.
The idea of comparing properties derived from the binding site in a translationally and rotationally invariant fashion could also be applied to the problem of binding site classification.
Binding site classification refers to the possibility of identifying a potential set of ligands
that could bind to a given binding site. Several approaches have been conceived to accomplish this task[56]. The method presented here could be applied to compare an unknown
binding site against a library of well characterized sites, identifying the closest ones in the
MIFs space. Some of the properties of the ligands known to bind to the structures closest
to the query could then be used to infer the type of ligand (and therefore gain functional
information about the unknown site) or to refine a lead compound to target the site.
The advantages of such an approach reside in comparing structurally-derived properties
that are directly related to binding, without sequence or structural constraints and in a fast
70
and computationally inexpensive way.
4.5
Conclusions
The problem of inferring the function of a protein in the context of the complex network of interactions is one of the most crucial challenges faced by Computational Biology
today. Knowing the binding partners of proteins is an essential step to untangle the web
of functional relationships that control cellular processes, and the identification and the
characterization of a protein binding site represent an important step to achieve this goal.
Some of the techniques that have been developed by the bioinformatics community over
the years have been discussed, together with their limitations and applicability range, in
Chapter 1.
This work proposes a framework to perform binding site identification on protein structures
by means of an energy-based approach based on the concept of Molecular Interaction Fields
(MIFs). The approach has been validated on a large set of bound and unbound protein
structures, and a specific application of binding site identification in the context of reverse
virtual screening has also been discussed (Chapter 2). The advantage of using chemically
specific probes to compute the MIFs has been demonstrated by applying the binding site
identification procedure to phosphorylated ligands. Furthermore, an improved version of
the energy-based binding site identification approach that incorporates evolutionary information has been presented, and its advantage in situations where the energy-based signal
is weak has been emphasized.(Chapter 3)
As an attempt to move beyond the problem of binding site identification, a methodology
that can be applied to infer the bound conformation of a protein starting from an unbound
form has been introduced, with a preliminary validation on 11 structures (Chapter 4).
Taken together, the results presented in this work indicate that the energy-based approach
with multiple probes MIFs provides a versatile framework to carry out binding site identification and hint at the possibility of identifying the bound form of structures that undergo
large conformational changes. Furthermore, the problem of predicting the type of ligand
that a binding site can accommodate lies among the future challenges that could benefit
71
from the methodology described here.
72
Appendix A
Introduction to Clustering
A.1
Brief overview of clustering in SITEHOUND
The main idea implemented in SITEHOUND is to group the points of the interaction
energy map that have passed the energy filter into clusters and to rank them by Total Interaction Energy (TIE). It is important to understand the options related to the clustering
step in order to eectively use the program. The principles of clustering algorithms and the
relevant parameters used by SITEHOUND are discussed here.
The fundamental goal of a clustering algorithm can be considered as finding a partition

of a set of points, defined in a multidimensional space, according to some optimality criterion (usually, one seeks to minimize intra-clusters distances and maximize inter-clusters
distances). It is worth pointing out that the problem is NP-complete, because one should
calculate all the possible partitions of the points, a combinatorial problem that scales with
the factorial of the number of points. In practice, one can resort to heuristics that make
the problem amenable to computation and yield satisfactory results.
More formally, given:

x1 = {x11 , x12 , . . . , x1n }, . . . , xm = {xm1 , xm2 , . . . , xmn }
73
(A.1)
Figure A.1: Eects of linkage on clustering results - a) and b) show the results of average and single
linkage on cyclin-dependent kinase 2 (PDB code 1ke5). Single linkage yields a better coverage of the binding
pocket, which is quite elongated. On the other hand, for human pregnenolone sulfotransferase (PDB code
1q1q) average linkage is the best choice, since it corresponds more closely to the ligand contour.
as a set of m points belonging to an n dimensional space, we can define the following

two quantities:
Dp (x1 , x2 )
(A.2)
Dc (R, S)
(A.3)
that represent the distance between two points x1 and x2 and the distance between
two clusters R and S, respectively. A natural choice for Dp in our problem is the simple
euclidean distance between the points.
One of the most widely used heuristics to approach the clustering problem is to proceed
74
from to the bottom to the top by iteratively merging clusters until one cluster containing all
the points is obtained. This is where the Dc quantity plays a role, by defining the distance
between clusters. The name linkage is commonly used to indicate this quantity.
SITEHOUND incorporates two types of linkage, single and average, defined in the following
way:
Dc single (R, S) =
Dc average (R, S) =
min
x1 R,x2 S
x1 R
Dp (x1 , x2 )
x2 S Dp (x1 , x2 )
|R||S|
(A.4)
(A.5)
where the | | notation indicates the cardinality of the set (i.e. the number of points of
the cluster).
Two important properties shared by these two linkages are the fact that the distance
between clusters increases monotonically at each step. Therefore, it is possible to cut the
partition at a particular level obtaining the corresponding clusters. In SITEHOUND this level
is called spatial cuto. The type of linkage used aects (to some extent) the shape of
the clusters obtained. In general, it can be shown that single linkage tends to yield more
elongated clusters, whereas with average linkage the shape of the clusters is closer to a
sphere. From a practical point of view, using single linkage can be more meaningful with
peptide binding sites or elongated ligands, whereas average linkage performs better with
small chemicals. These eects are illustrated in Figure A.1. In general, it is desirable to
run the calculations with both types of linkage, and compare the results. In some instances,
with average linkage the binding site is split in two regions, whereas single linkage will
tend to show one single site. This information could be valuable in the context of ligand
design, since the two regions that show up with average linkage could both be exploited by
connecting two fragments with a linker.
75
Appendix B
Focused Docking Setup

B.1
Selection of complexes
Both focused and blind docking experiments were carried out on the same set of complexes obtained from the Astex Diverse Set[53], a published collection of 85 protein-ligand
crystal structures extracted from the Protein Data Bank (PDB) and specifically selected to
evaluate the performance of docking algorithms. All water molecules and heteroatoms (including the ligands) were removed and for the cases that contained identical sets of chains,
only one set was retained.
B.2
Preparation of the proteins and ligands for docking
Gasteiger charges were added to both ligands and proteins, using the programs included
in the AutoDockTools suite (version 1.4.5). At that stage, eight cases that issued warnings
and would have required manual intervention were removed resulting in a final set of 77
complexes. The PDB codes of the selected chains are: 1gkcA, 1gm8, 1hnnA, 1hp0A, 1hq2,
1hvyD, 1hwiA1B, 1hww, 1ia1B, 1ig3, 1j3jA, 1jd0B, 1jjeA, 1jlaA, 1k3u, 1ke5, 1kzk, 1l2sB,
1l7f, 1lpz, 1lrhD, 1m2zA, 1meh, 1mzc, 1n1mA, 1n2jA, 1n2v, 1n46A, 1nav, 1of1B, 1opk,
1oq5, 1owe, 1oyt, 1p2y, 1p62, 1pmn, 1q1gF, 1q41A, 1q4gB, 1r1h, 1r55, 1r58, 1r9o, 1s19,
1s3v, 1sg0B, 1sj0, 1sq5A, 1sqnB, 1t40, 1t46, 1tow, 1tt1A, 1tz8B, 1u1cF, 1uml, 1unlA1D,
1uou, 1v0pA, 1v48, 1v4s, 1vcj, 1w1pB, 1w2gB, 1x8x, 1xm6A, 1xoqB, 1xoz, 1ygc, 1yqy,
76
1yvf, 1ywr, 1z95, 2bm2B, 2br1, and 2bsm. For each single-chain binding site entry in the
Astex Diverse Set a BLAST9 search was performed against the PDB database selecting
all the entries that had a sequence identity > 95% and a coverage > 95%. Subsequently,
the cases that had mutated residues in the binding site were eliminated from the dataset.
Finally, from the remaining cases only the entries that did not have any ligand in the
binding site were selected. This procedure led to 19 unbound proteins corresponding to a
subset of the 77 complexes described earlier. The PDB codes of the bound unbound pairs
are: 1hq2 1hka, 1t46 1t45, 1ke5 1hcl, 1v0pA 1ob3A, 1l2sB 2blsA, 1v48 1pbn, 1l7f 1nmaN,
1w1pB 1e15A, 1n1mA 1r9mA, 1yvf 2girA, 1n2v 1pud, 1ywr 2okrA, 1oq5 2cbe, 2br1 1ia8,
1oyt 1vr1H, 2bsm 1uyl, 1q41A 1i09A, 1s3v 1pdb, and 1t40 1xgd. To facilitate the comparison of docking results, the binding site residues in the unbound proteins were superimposed
on the corresponding residues of the bound proteins using the backbone atoms of the residues
that had at least one atom within 6.0
A of the ligand heavy atoms in the complex. A site
is considered to have been detected if the fraction of overlapping heavy atoms between the
lowest energy pose and the ligand in the complex is 0.15.
77
Appendix C
Publications Resulting From This

Thesis
Ghersi D, Sanchez R, EasyMIFs and SiteHound: a toolkit for the identification of
ligand-binding sites in protein structures, Bioinformatics 2009, 25(23): 3185-6
Ghersi D, Sanchez R, Improving accuracy and eciency of blind protein-ligand docking by focusing on predicted binding sites, Proteins 2009, 74(2): 417-24
Hernandez M, Ghersi D, Sanchez R, SITEHOUND-web: a server for ligand binding
site identification in protein structures, Nucleic Acids Research, 37: W413-16
Ghersi D, Sanchez R, An energy-based computational approach to automatically
identify binding sites for phosphorylated ligands in protein structures, submitted
78
Bibliography
[1] U. Abele and G.E. Schulz. High-resolution structures of adenylate kinase from yeast
ligated with inhibitor ap5a, showing the pathway of phosphoryl transfer. Prot. Sci.,
4:12621271, 1995.
[2] P. Aloy and R. B. Russell. Structural systems biology: modelling protein interactions.
Nat Rev Mol Cell Biol, 7(3):18897, 2006.
[3] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local
alignment search tool. J Mol Biol, 215(3):40310, 1990.
[4] P. J. Artymiuk, A. R. Poirrette, H. M. Grindley, D. W. Rice, and P. Willett. A
graph-theoretic approach to the identification of three-dimensional patterns of amino
acid side-chains in protein structures. J Mol Biol, 243(2):32744, 1994.
[5] A. R. Atilgan, S. R. Durell, R. L. Jernigan, M. C. Demirel, O. Keskin, and I. Bahar.
Anisotropy of fluctuation dynamics of proteins with an elastic network model. Biophys
J, 80(1):50515, 2001.
[6] I. Bahar, A. R. Atilgan, and B. Erman. Direct evaluation of thermal fluctuations in
proteins using a single-parameter harmonic potential. Fold Des, 2(3):17381, 1997.
[7] N. A. Baker, D. Sept, S. Joseph, M. J. Holst, and J. A. McCammon. Electrostatics
of nanosystems: application to microtubules and the ribosome. Proc Natl Acad Sci U
S A, 98(18):1003741, 2001.
79
[8] P. Baldi, S. Brunak, Y. Chauvin, C. A. Andersen, and H. Nielsen. Assessing the

accuracy of prediction algorithms for classification: an overview. Bioinformatics,
16(5):41224, 2000.
[9] M. Barbany, H. Gutierrez-de Teran, F. Sanz, and J. Villa-Freixa. Towards a mipbased alignment and docking in computer-aided drug design. Proteins, 56(3):58594,
2004.
[10] H. Beach, R. Cole, M. L. Gill, and J. P. Loria. Conservation of mus-ms enzyme
motions in the apo- and substrate-mimicked state. J Am Chem Soc, 127(25):916776,
2005.
[11] C. Berger, S. Weber-Bornhauser, J. Eggenberger, J. Hanes, A. Pluckthun, and H. R.
Bosshard. Antigen recognition by conformational selection. FEBS Lett, 450(1-2):149
53, 1999.
[12] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N.
Shindyalov, and P. E. Bourne. The protein data bank. Nucleic Acids Res, 28(1):235
42, 2000.
[13] N. Blomberg, R. R. Gabdoulline, M. Nilges, and R. C. Wade. Classification of protein
sequences by homology modeling and quantitative analysis of electrostatic similarity.
Proteins, 37(3):37987, 1999.
[14] H. R. Bosshard. Molecular recognition by induced fit: how fit is the concept? News
Physiol Sci, 16:1713, 2001.
[15] R. Brenke, D. Kozakov, G. Y. Chuang, D. Beglov, D. Hall, M. R. Landon, C. Mattos,
and S. Vajda. Fragment-based identification of druggable hot spots of proteins using
fourier domain correlation techniques. Bioinformatics, 25(5):6217, 2009.
[16] M. M. Brent and R. Marmorstein. Ankyrin for methylated lysines. Nat Struct Mol
Biol, 15(3):2212, 2008.
80
[17] B. Brooks and M. Karplus. Harmonic dynamics of proteins: normal modes and fluctuations in bovine pancreatic trypsin inhibitor. Proc Natl Acad Sci U S A, 80(21):65715,
1983.
[18] B. Brooks and M. Karplus. Normal modes for specific motions of macromolecules:
application to the hinge-bending mode of lysozyme. Proc Natl Acad Sci U S A,
82(15):49959, 1985.
[19] W. M. Brown and D. L. Vander Jagt. Creating artificial binding pocket boundaries to
improve the eciency of flexibl e ligand docking. J Chem Inf Comput Sci, 44(4):1412
22, 2004.
[20] Catherine Burt, W. Graham Richards, and Philip Huxley. The application of molecular similarity calculations. J. Comput. Chem., 11(10):11391146, 1990.
[21] Y. Cao, R. A. Musah, S. K. Wilcox, D. B. Goodin, and D. E. McRee. Protein
conformer selection by ligand binding observed with crystallography. Protein Sci,
7(1):728, 1998.
[22] J. A. Capra and M. Singh. Predicting functionally important residues from sequence
conservation. Bioinformatics, 23(15):187582, 2007.
[23] D.A. Case. Normal mode analysis of protein dynamics. Curr Opin Struct Biol, 4:285
90, 1994.
[24] C. N. Cavasotto, J. A. Kovacs, and R. A. Abagyan. Representing receptor flexibility
in ligand docking through relevant normal modes. J Am Chem Soc, 127(26):963240,
2005.
[25] P. Comon. Independent component analysis, a new concept?
Signal Processing,
36(3):287314, 1994.
[26] M. L. Connolly. Solvent-accessible surfaces of proteins and nucleic acids. Science,
221(4612):70913, 2006.
81
[27] S. D. Copley. Enzymes with extra talents: moonlighting functions and catalytic
promiscuity. Curr Opin Chem Biol, 7(2):26572, 2003.
[28] Wendy D. Cornell, Piotr Cieplak, Christopher I. Bayly, Ian R. Gould, Kenneth M.
Merz, David M. Ferguson, David C. Spellmeyer, Thomas Fox, James W. Caldwell,
and Peter A. Kollman. A second generation force field for the simulation of proteins,
nucleic acids, and organic molecules. J Am Chem Soc, 117:51795197, 1995.
[29] Gabriele Cruciani. Molecular Interaction Fields: Applications in Drug Discovery and
ADME prediction. Wiley-VHC, 2006.
[30] M. Cui, M. Mezei, and R. Osman. Prediction of protein loop structures using a
local move monte carlo approach and a grid-based force field. Protein Eng Des Sel,
21(12):72935, 2008.
[31] K. L. Damm and H. A. Carlson. Gaussian-weighted rmsd superposition of proteins: a
structural comparison fo r flexible proteins and predicted protein structures. Biophys
J, 90(12):455873, 2006.
[32] S. Das, A. Kokardekar, and C. M. Breneman. Rapid comparison of protein binding
site surfaces with property encoded shape distributions. J Chem Inf Model, 2009.
[33] M. J. de Hoon, S. Imoto, J. Nolan, and S. Miyano. Open source clustering software.
Bioinformatics, 20(9):14534, 2004.
[34] F. De Rienzo, R. R. Gabdoulline, M. C. Menziani, and R. C. Wade. Blue copper
proteins: a comparative analysis of their molecular interaction properties. Protein
Sci, 9(8):143954, 2000.
[35] W. L. Delano. The pymol molecular graphics system, 2002.
[36] Jr. Dunbrack, R. L. and F. E. Cohen. Bayesian statistical analysis of protein side-chain
rotamer preferences. Protein Sci, 6(8):166181, 1997.
[37] E. Eyal, L. W. Yang, and I. Bahar. Anisotropic network model: systematic evaluation
and a new web interface. Bioinformatics, 22(21):261927, 2006.
82
[38] S. Eyrisch and V. Helms. Transient pockets on protein surfaces involved in proteinprotein interaction. J Med Chem, 50(15):345764, 2007.
[39] Tom Fawcett. An introduction to roc analysis. Pattern Recogn. Lett., 27(8):861874,
2006.
[40] E. Fischer. Einfluss der configuration auf die wirkung der enzyme. Ber Dtsch Chem
Ges, 27:29842993, 1894.
[41] J. Foote and C. Milstein. Conformational isomerism and the diversity of antibodies.
Proc Natl Acad Sci U S A, 91(22):103704, 1994.
[42] H. Frauenfelder, S. G. Sligar, and P. G. Wolynes. The energy landscapes and motions
of proteins. Science, 254(5038):1598603, 1991.
[43] T. Frembgen-Kesner and A. H. Elcock. Computational sampling of a cryptic drug
binding site in a protein receptor: explicit solvent molecular dynamics and inhibitor
docking to p38 map kinase. J Mol Biol, 359(1):20214, 2006.
[44] J. L. Gelpi, S. G. Kalko, X. Barril, J. Cirera, X. de La Cruz, F. J. Luque, and
M. Orozco. Classical molecular interaction potentials: improved setup procedure in
molecular dynamics simulations of proteins. Proteins, 45(4):42837, 2001.
[45] M. Gerstein and W. Krebs. A database of macromolecular motions. Nucleic Acids
Res, 26(18):428090, 1998.
[46] D. Ghersi and R. Sanchez. Easymifs and sitehound: a toolkit for the identification of
ligand-binding sites in protein structures. Bioinformatics, 25(23):31856, 2009.
[47] D. Ghersi and R. Sanchez. Improving accuracy and eciency of blind protein-ligand
docking by focusing on predicted binding sites. Proteins, 74(2):41724, 2009.
[48] D. Gonzalez-Ruiz and H. Gohlke. Targeting protein-protein interactions with small
molecules: challenges and perspectives for computational binding epitope detection
and ligand finding. Curr Med Chem, 13(22):260725, 2006.
83
[49] P. J. Goodford. A computational procedure for determining energetically favorable

binding sites on biologically important macromolecules. J Med Chem, 28(7):84957,
1985.
[50] K. Gunasekaran and R. Nussinov. How dierent are structurally flexible and rigid
binding sites? sequence an d structural features discriminating proteins that do and do
not undergo conformational change upon ligand binding. J Mol Biol, 365(1):25773,
2007.
[51] A. Gutteridge and J. Thornton. Conformational changes observed in enzyme crystal
structures upon substrate binding. Journal of Molecular Biology, 346(1):2128, 2005.
[52] R. W. Harrison. Variational calculation of the normal modes of a large macromolecule:
methods and some initial results. Biopolymers, 23(12):29439, 1984.
[53] M. J. Hartshorn, M. L. Verdonk, G. Chessari, S. C. Brewerton, W. T. M. Mooij, P. N.
Mortenson, C. W. Murray, and Lr. Diverse, high-quality test set for the validation of
protein-ligand docking performance. Journal of Medicinal Chemistry, 50(4):726741,
2007.
[54] H. Hegyi and M. Gerstein. The relationship between protein structure and function: a
comprehensive survey with application to the yeast genome. J Mol Biol, 288(1):147
64, 1999.
[55] M. Hendlich, F. Rippmann, and G. Barnickel. Ligsite: automatic and ecient detection of potential small molecule-binding sites in proteins. J Mol Graph Model,
15(6):35963, 389, 1997.
[56] S. Henrich, O. M. Salo-Ahen, B. Huang, F. F. Rippmann, G. Cruciani, and R. C.
Wade. Computational approaches to identifying and characterizing protein binding
sites for ligand design. J Mol Recognit, 2009.
[57] E. J. Henriksen, T. R. Kinnick, M. K. Teachey, M. P. OKeefe, D. Ring, K. W.
Johnson, and S. D. Harrison. Modulation of muscle insulin resistance by selective
84
inhibition of gsk-3 in zucker diabetic fatty rats. Am J Physiol Endocrinol Metab,

284(5):E892900, 2003.
[58] K. A. Henzler-Wildman, V. Thai, M. Lei, M. Ott, M. Wolf-Watz, T. Fenn,
E. Pozharski, M. A. Wilson, G. A. Petsko, M. Karplus, C. G. Hubner, and D. Kern.
Intrinsic motions along an enzymatic reaction trajectory. Nature, 450(7171):83844,
2007.
[59] M. Hernandez, D. Ghersi, and R. Sanchez.
Sitehound-web: a server for ligand
binding site identification in protein structures. Nucleic Acids Res, 37(Web Server
issue):W4136, 2009.
[60] C. Hetenyi and D. van der Spoel. Ecient docking of peptides to proteins without
prior knowledge of the binding site. Protein Science, 11(7):17291737, 2002.
[61] C. Hetenyi and D. van der Spoel. Blind docking of drug-sized compounds to proteins
with up to a thousand residues. Febs Letters, 580(5):14471450, 2006.
[62] H. Hishigaki, K. Nakai, T. Ono, A. Tanigami, and T. Takagi. Assessment of prediction
accuracy of protein function from proteinprotein interaction data. Yeast, 18(6):523
31, 2001.
[63] B. Huang. Metapocket: a meta approach to improve protein ligand binding site
prediction. OMICS, 13(4):32530, 2009.
[64] B. Huang and M. Schroeder. Ligsitecsc: predicting ligand binding sites using the
connolly surface and degree of conservation. BMC Struct Biol, 6:19, 2006.
[65] R. Huey, G. M. Morris, A. J. Olson, and D. S. Goodsell. A semiempirical free energy force field with charge-based desolvation. Journal of Computational Chemistry,
28(6):11451152, 2007.
[66] W. Humphrey, A. Dalke, and K. Schulten. VMD Visual Molecular Dynamics.
Journal of Molecular Graphics, 14:3338, 1996.
85
[67] C. J. Jeery. Moonlighting proteins: old proteins learning new tricks. Trends Genet,
19(8):4157, 2003.
[68] J. L. Jenkins. In silico target fishing: Predicting biological targets from chemical
structure. Drug Discovery Today: Technologies, 3(4), 2006.
[69] S. Jones and J. M. Thornton. Analysis of protein-protein interaction sites using surface
patches. J Mol Biol, 272(1):12132, 1997.
[70] B. A. Joughin, B. Tidor, and M. B. Yae. A computational method for the analysis and prediction of protein:phosphopeptide-binding sites. Protein Sci, 14(1):1319,
2005.
[71] W. Kabsch. A solution for the best rotation to relate two sets of vectors. Acta
Crystallographica Section A, 32(5):922923, Sep 1976.
[72] M. A. Kastenholz, M. Pastor, G. Cruciani, E. E. Haaksma, and T. Fox. Grid/cpca:
a new computational tool to design selective ligands. J Med Chem, 43(16):303344,
2000.
[73] E. Kellenberger, P. Muller, C. Schalon, G. Bret, N. Foata, and D. Rognan. sc-pdb: an
annotated database of druggable binding sites from the protein data bank. J Chem
Inf Model, 46(2):71727, 2006.
[74] K. Kinoshita and H. Nakamura. Identification of protein biochemical functions by
similarity search using the molecular surface database ef-site. Protein Sci, 12(8):1589
95, 2003.
[75] R. Kolodny, D. Petrey, and B. Honig. Protein structure comparison: implications for
the nature of fold space, and structure and function prediction. Curr Opin Struct
Biol, 16(3):3938, 2006.
[76] D. E. Koshland. Application of a theory of enzyme specificity to protein synthesis.
Proc Natl Acad Sci U S A, 44(2):98104, 1958.
86
[77] S. Kullback and R. A. Leibler. On information and suciency. Annals of Mathematical

Statistics, 22:4986, 1951.
[78] M. A. Larkin, G. Blackshields, N. P. Brown, R. Chenna, P . A. McGettigan,
H. McWilliam, F. Valentin, I. M. Wallace, A. Wilm, R. Lopez, J. D. Thompso n,
T. J. Gibson, and D. G. Higgins. Clustal w and clustal x version 2.0. Bioinformatics,
23(21):29478, 2007.
[79] R. A. Laskowski. Surfnet: a program for visualizing molecular surfaces, cavities, and
intermolecular interactions. J Mol Graph, 13(5):32330, 3078, 1995.
[80] R. A. Laskowski, N. M. Luscombe, M. B. Swindells, and J. M. Thornton. Protein
clefts in molecular recognition and function. Protein Sci, 5(12):243852, 1996.
[81] A. T. R. Laurie, R. M. Jackson, and Rs. Q-sitefinder: an energy-based method for
the prediction of protein-ligand binding sites. Bioinformatics, 21(9):19081916, 2005.
[82] L. Leder, C. Berger, S. Bornhauser, H. Wendt, F. Ackermann, I. Jelesarov, and H. R.
Bosshard. Spectroscopic, calorimetric, and kinetic demonstration of conformational
adaptation in peptide-antibody recognition. Biochemistry, 34(50):1650918, 1995.
[83] D. Lee, O. Redfern, and C. Orengo. Predicting protein function from sequence and
structure. Nat Rev Mol Cell Biol, 8(12):9951005, 2007.
[84] D. G. Levitt and L. J. Banaszak. Pocket: a computer graphics method for identifying
and displaying protein cavities and their surrounding amino acids. J Mol Graph,
10(4):22934, 1992.
[85] O. Lichtarge, H. R. Bourne, and F. E. Cohen. An evolutionary trace method defines
binding surfaces common to protein families. J Mol Biol, 257(2):34258, 1996.
[86] O. Marques and Y. H. Sanejouand. Hinge-bending motion in citrate synthase arising
from normal mode calculations. Proteins, 23(4):55760, 1995.
[87] C. Mattos and D. Ringe. Locating and characterizing binding sites on proteins. Nat
Biotechnol, 14(5):5959, 1996.
87
[88] I. Mayrose, D. Graur, N. Ben-Tal, and T. Pupko. Comparison of site-specific rateinference methods for protein sequences: empirical bayesian methods are superior.
Mol Biol Evol, 21(9):178191, 2004.
[89] M. Mezei. A new method for mapping macromolecular topography. J Mol Graph
Model, 21(5):46372, 2003.
[90] M. Mitchell. An Introduction to Genetic Algorithms. MIT Press, 1998.
[91] M. Morita, S. Nakamura, and K. Shimizu. Highly accurate method for ligand-binding
site prediction in unbound state (apo) protein structures. Proteins, 73(2):46879,
2008.
[92] G. M. Morris, D. S. Goodsell, R. S. Halliday, R. Huey, W. E. Hart, R. K. Belew, and
A. J. Olson. Automated docking using a lamarckian genetic algorithm and an empirical binding free energy function. Journal of Computational Chemistry, 19(14):1639
1662, 1998.
[93] A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia. Scop: a structural
classification of proteins database for the investigation of sequences and structures. J
Mol Biol, 247(4):53640, 1995.
[94] W. Noble, E. Planel, C. Zehr, V. Olm, J. Meyerson, F. Suleman, K. Gaynor, L. Wang,
J. LaFrancois, B. Feinstein, M. Burns, P. Krishnamurthy, Y. Wen, R. Bhat, J. Lewis,
D. Dickson, and K. Du. Inhibition of glycogen synthase kinase-3 by lithium correlates with reduced tauopathy and degeneration in vivo. Proc Natl Acad Sci U S A,
102(19):69905, 2005.
[95] M. Novotni and R. Klein. 3d zernike descriptors for content based shape retrieval.
Proceedings of the 8th ACM Symposium on Solid Modeling and Applications, pages
216225, 2003.
[96] B. H. Oh, J. Pandit, C. H. Kang, K. Nikaido, S. Gokcen, G. F. Ames, and S. H.
Kim.
Three-dimensional structures of the periplasmic lysine/arginine/ornithine-
binding protein with and without a ligand. J Biol Chem, 268(15):1134855, 1993.
88
[97] C. A. Orengo, A. D. Michie, S. Jones, D. T. Jones, M. B. Swindells, and J. M.

Thornton. Catha hierarchic classification of protein domain structures. Structure,
5(8):1093108, 1997.
[98] R. Osada, T. Funkhouser, B. Chazelle, and D. Dobkin. Shape distributions. ACM
Transactions on Graphics, 21:807832, 2002.
[99] M. Pastor and G. Cruciani. A novel strategy for improving ligand selectivity in
receptor-based drug design. J Med Chem, 38(23):463747, 1995.
[100] N. Paul, E. Kellenberger, G. Bret, P. Muller, and D. Rognan. Recovering the true
targets of specific ligands by virtual screening of the protein data bank. ProteinsStructure Function and Bioinformatics, 54(4):671680, 2004.
[101] E. F. Pettersen, T. D. Goddard, C. C. Huang, G. S. Couch, D. M. Greenblatt, E. C.
Meng, and T. E. Ferrin. Ucsf chimeraa visualization system for exploratory research
and analysis. J Comput Chem, 25(13):160512, 2004.
[102] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2009. ISBN 3-90005107-0.
[103] D. J. Rigden. Understanding the cell in terms of structure and function: insights
from structural genomics. Curr Opin Biotechnol, 17(5):45764, 2006. Rigden, Daniel
J Review England Current opinion in biotechnology Curr Opin Biotechnol. 2006
Oct;17(5):457-64. Epub 2006 Aug 4.
[104] M. Rueda, G. Bottegoni, and R. Abagyan. Consistent improvement of cross-docking
results using binding site ensembles generated with elastic network normal modes. J
Chem Inf Model, 49(3):71625, 2009.
[105] A. M. Ruvinsky. Role of binding entropy in the refinement of protein-ligand docking
predictions: anal ysis based on the use of 11 scoring functions. J Comput Chem,
28(8):136472, 2007.
89
[106] L. Sael, D. La, B. Li, R. Rustamov, and D. Kihara. Rapid comparison of properties
on protein surface. Proteins, 73(1):110, 2008.
[107] J. R. Schames, R. H. Henchman, J. S. Siegel, C. A. Sotrier, H. Ni, and J. A. McCammon. Discovery of a novel binding trench in hiv integrase. J Med Chem, 47(8):187981,
2004.
[108] K. Schleinkofer, U. Wiedemann, L. Otte, T. Wang, G. Krause, H. Oschkinat, and
R. C. Wade. Comparative structural and energetic analysis of ww domain-peptide
interactions. J Mol Biol, 344(3):86581, 2004.
[109] R. P. Sheridan and S. K. Kearsley. Why do we need so many chemical similarity
search methods? Drug Discov Today, 7(17):90311, 2002.
[110] S. Shima, O. Pilak, S. Vogt, M. Schick, M. S. Stagni, W. Meyer-Klaucke, E. Warkentin,
R. K. Thauer, and U. Ermler. The crystal structure of [fe]-hydrogenase reveals the
geometry of the active site. Science, 321(5888):5725, 2008.
[111] T. Solmajer and E.L. Mehler. Electrostatic screening in molecular dynamics simulations. Protein Eng, 4(8):9117, 1991.
[112] L. Stella, A. M. Caccuri, N. Rosato, M. Nicotra, M. Lo Bello, F. De Matteis,
A. P. Mazzetti, G. Federici, and G. Ricci.
Flexibility of helix 2 in the human
glutathione transferase p1-1. time-resolved fluorescence spectroscopy. J Biol Chem,

273(36):2326773, 1998.
[113] E. J. Sundberg and R. A. Mariuzza. Luxury accommodations: the expanding role of
structural plasticity in protein-protein interactions. Structure, 8(7):R13742, 2000.
[114] F. Tama and Y. H. Sanejouand. Conformational change of proteins arising from
normal mode calculations. Protein Eng, 14(1):16, 2001.
[115] M. M. Tirion. Large amplitude elastic motions in proteins from a single-parameter,
atomic analysis. Phys Rev Lett, 77(9):19051908, 1996.
90
[116] S. Vajda and F. Guarnieri. Characterization of protein-ligand interaction sites using

experimental and computational methods. Curr Opin Drug Discov Devel, 9(3):354
62, 2006.
[117] D. Van Der Spoel, E. Lindahl, B. Hess, G. Groenhof, A. E. Mark, and H. J. Berendsen.
Gromacs: fast, flexible, and free. J Comput Chem, 26(16):170118, 2005.
[118] A. C. Wallace, N. Borkakoti, and J. M. Thornton. Tess: a geometric hashing algorithm
for deriving 3d coordinate templates for searching structural databases. application
to enzyme active sites. Protein Sci, 6(11):230823, 1997.
[119] R. Wang, X. Fang, Y. Lu, and S. Wang. The pdbbind database: collection of binding
anities for protein-ligand complexes with known three-dimensional structures. J
Med Chem, 47(12):297780, 2004.
[120] Z. Wang, K. S. Smith, M. Murphy, O. Piloto, T. C. Somervaille, and M. L. Cleary.
Glycogen synthase kinase 3 in mll leukaemia maintenance and targeted therapy. Nature, 455(7217):12059, 2008.
[121] M. Weisel, E. Proschak, and G. Schneider. Pocketpicker: analysis of ligand bindingsites with shape descriptors. Chem Cent J, 1:7, 2007.
[122] T.A. Welch. A technique for high-performance data compression. IEEE Computer,
17(6):819, 1984.
[123] S. Wells, S. Menor, B. Hespenheide, and M. F. Thorpe. Constrained geometric simulation of diusive motion in proteins. Phys Biol, 2(4):S12736, 2005.
[124] P. J. Winn, T. L. Religa, J. N. Battey, A. Banerjee, and R. C. Wade. Determinants
of functionality in the ubiquitin conjugating enzyme family. Structure, 12(9):156374,
2004.
[125] M. Wolf-Watz, V. Thai, K. Henzler-Wildman, G. Hadjipavlou, E. Z. Eisenmesser,
and D. Kern. Linkage between dynamics and catalysis in a thermophilic-mesophilic
enzyme pair. Nat Struct Mol Biol, 11(10):9459, 2004.
91
[126] H.J. Wolfson and I. Rigoutsos. Geometric hashing: an overview. Computational

Science & Engineering, IEEE, 4(4):1021, 1997.
[127] J. Xu and D. D. Root. Conformational selection during weak binding at the actin
and myosin interface. Biophys J, 79(3):1498510, 2000.
[128] M. B. Yae. Phosphotyrosine-binding domains in signal transduction. Nat Rev Mol
Cell Biol, 3(3):17786, 2002.
[129] L. Yang, G. Song, and R. L. Jernigan. How well can we understand large-scale protein
motions using normal modes of elastic network models? Biophys J, 93(3):9209, 2007.
[130] N. Yao, P. S. Ledvina, A. Choudhary, and F. A. Quiocho. Modulation of a salt
link does not aect binding of phosphate to its specific active transport receptor.
Biochemistry, 35(7):207985, 1996.
[131] J. Yin, A.E. IV Beuscher, S.E Andryski, R.C. Stevens, and P.G. Schultz. structural
plasticity and the evolution of antibody anity and specificity. Journal of Molecular
Biology, 330(4):651656, 2003.
[132] Z. Zhang and M. G. Grigorov. Similarity networks of protein binding sites. Proteins,
62(2):4708, 2006.
92

Computational Approaches For The Identification and Characterization of Protein Binding Sites

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Computational Approaches For The Identification and Characterization of Protein Binding Sites

Transféré par

Droits d'auteur :

Formats disponibles

Computational Approaches for the Identification and Characterization of

Protein Binding Sites

Professor Roberto Sanchez

In omnibus requiem quaesivi, et nusquam inveni nisi in angulo cum libro

Inferring Protein Function . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Binding Site Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Approaches that take into account the protein dynamics . . . . . . .

Binding Site Characterization and Comparison . . . . . . . . . . . . . . . .

Approaches for comparing geometric features . . . . . . . . . . . . .

Approaches for comparing structurally derived properties . . . . . .

Available Softwares for Binding Site Identification and Characterization . .

EASYMIFs and SITEHOUND . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Calculation of MIFs in EASYMIFs . . . . . . . . . . . . . .

Visualizing the results . . . . . . . . . . . . . . . . . . . . .

The SITEHOUND-web Server . . . . . . . . . . . . . . . . . .

Reverse Virtual Screening . . . . . . . . . . . . . . . . . . . . . . . .

The Protein-Ligand Docking Problem . . . . . . . . . . . . . . . . .

The scoring function component . . . . . . . . . . . . . . .

The search component . . . . . . . . . . . . . . . . . . . . .

Blind Docking vs Focused Docking . . . . . . . . . . . . . . . . . . .

Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Binding site identification . . . . . . . . . . . . . . . . . . . . . . . .

Blind docking setup . . . . . . . . . . . . . . . . . . . . . . . . . . .

Focused docking setup . . . . . . . . . . . . . . . . . . . . . . . . . .

Focused docking with masked grids . . . . . . . . . . . . . . . . . . .

Comparison of blind vs. focused docking . . . . . . . . . . . . . . . .

Comparison of blind and focused docking protocols . . . . . . . . . .

Binding site detection accuracy . . . . . . . . . . . . . . . . . . . . .

Docking pose accuracy . . . . . . . . . . . . . . . . . . . . . . . . . .

Comparison of blind vs. focused docking in the unbound dataset . .

3 Binding Site Identification for Phosphorylated Ligands

Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Binding Site Identification . . . . . . . . . . . . . . . . . . . . . . . .

Reranking of Putative Sites by Conservation . . . . . . . . . . . . .

Assessment of the Prediction Accuracy . . . . . . . . . . . . . . . . .

Electrostatic Potential Calculations . . . . . . . . . . . . . . . . . . .

Overall Performance on the Whole Datasets

Evolutionary reranking of the putative sites . . . . . . . . . . . . . .

Role of the Electrostatic Potential . . . . . . . . . . . . . . . . . . .

Probe Selectivity Analysis . . . . . . . . . . . . . . . . . . . . . . . .

4 Beyond Binding Site Identification

Models of Conformational Changes . . . . . . . . . . . . . . . . . . .

The Elastic Network Model . . . . . . . . . . . . . . . . . . . . . . .

Materials and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Root Mean Square Deviation (RMSD) Calculations . . . . . . . . .

The Anisotropic Elastic Network Model (ANM) . . . . . . . . . . . .

Side-chain Modeling and MIFs Calculations . . . . . . . . . . . . . .

Comparing MIFs derived from binding sites . . . . . . . . . . . . . .

Normal Model Fitting . . . . . . . . . . . . . . . . . . . . . . . . . .

Bound Form Identification . . . . . . . . . . . . . . . . . . . . . . . .

A.1 Brief overview of clustering in SITEHOUND . . . . . . . . . . . . . . . . . . .

B.1 Selection of complexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

B.2 Preparation of the proteins and ligands for docking . . . . . . . . . . . . . .

C Publications Resulting From This Thesis

Available softwares for binding site identification and characterization . . . . . . .

Parameters for the dierent sets of focused docking experiments . . . . . . .

Accuracy of binding site identification . . . . . . . . . . . . . . . . . . . . . . .

Accuracy of Blind and Focused Docking in Unbound Proteins . . . . . . . . . . .

Summary of the performance on the complete dataset of phosphorylated ligands 48

Summary of the performance for the first cluster only . . . . . . . . . . . .

Performance with the CMET probe

Dataset of complexes undergoing hinge-like motion upon binding . . . . . .