Vous êtes sur la page 1sur 4

GERF Bulletin of Biosciences 2010, 1(1): 37-40

Review Article

Insilco QSAR Modeling and Drug Development Process

Pooja Mishra1, Vijay Tripathi1, Brijesh Singh Yadav2*

Center of Bioinformatics, University of Allahabad, Allahabad, India
*Indian Veterinary Research Institute, Izatnagar, Bareilly, U.P., India

In this paper we focused on in silico QSAR (Quantitative structure-activity relationship) modeling is one of the well-
developed areas in drug development through computational chemistry. Similar molecules with just a slight variation in
their structure can have quit different biological activity. This kind of relationship between molecular structure &
change in biological activity is center of focus for QSAR Modeling. QSAR are based on a comparison between some type
of activity & the chemical structure or physicochemical properties of a series of chemical.Quantitative structure-
activity relationship (QSAR) is the process by which chemical structure is quantitatively correlated with a well defined
process, such as biological activity or chemical reactivity. Activity = f (physiochemical properties and/or structural
properties). There is a wide variety of descriptors for use in QSAR studies. Actually descriptors are biological property
that represents in mathematical form. The group or subset of these descriptors is potentially useful for predicting
ADME/Tox properties, & describe the arrangement of a pharmaceutical compound without organisms. Here ADME is an
acronyms in pharmacokinetics & pharmacology for absorption, distribution, metabolism & excretion, this is the very
important step in drug designing is prediction of ADME property of any compound. It is useful to predict toxicity of
particular compounds. This article also reviews the current achievements in the field of QSAR modeling and their
impact on modern drug discovery processes. The applications of QSAR modeling in drug discovery, such as compound
selection, virtual library generation, virtual high throughput screening, HTS data mining, and in-silico ADMET are
discussed. We have also presented a quantitative structure–human intestinal absorption relationship using Regression
analysis through Tsar.
Keywords: Quantitative structure-activity relationship (QSAR), ADME, Descriptors, Clustering, Regression, TSAR

Introduction these computationally screened virtual libraries can then be

Drug is a complex set of molecules. It is not distributed in all synthesized for high-throughput biological activity screen-
over body by complex form, so after being absorbed (orally ing. As the predictive ability of ADME/Tox software improves,
or by other route). It should be broken up to its in simplest and as pharmaceutical companies incorporate computational
form & after that it can be distributed to all over body by prediction methods into their R&D programs, the drug dis-
blood. After absorption & distribution the remaining waste covery process will move from a screening-based to a knowl-
part of drug will excreted in body. Historically, drug absorp- edge-based paradigm. In silico QSAR modeling, feature se-
tion, distribution, metabolism, excretion, and toxicity lection are used, this feature selections is used to reduce the
(ADMET) studies in animal models were performed after a number of descriptors per compounds. Successful data min-
lead compound was identified. Now, pharmaceutical compa- ing depends on good descriptor selection. If molecules are
nies are employing higher-throughput, in vitro assays to represented by improper descriptors, they will not lead to
evaluate the ADMET characteristics of potential leads at reasonable predictions. Correct descriptor selections rely
earlier stages of development (Jun Xu, 2002). This is done in on understanding the computational problem that one is
order to eliminate candidates as early as possible, thus avoid- trying to solve.
ing costs, which would have been expended on chemical Correlation analysis and relevant analysis
synthesis and biological testing. Scientists are developing approaches can help with this understanding. The criteria
computational methods to select only compounds with rea- used for selecting descriptors should be: the selected
sonable ADMET properties for screening. Molecules from descriptors should be bio-activity related (requiring
correlation analysis), the selected descriptors should be
Corresponding Author: brijeshbioinfo@gmail.com
informative (should have diversified value distributions),
GERF Bulletin of Biosciences 2010, 1(1): 37-40 38

the selected descriptors should be independent of each other erties of the input space. The SOM have been used at the
(if two descriptors are correlated to each other, related prop- research center in such applications as. Automatic speech
erty will be unfairly biased), the selected descriptors should recognition, clinical voice analysis, monitoring of the condi-
be simple to extract, easy to explain to a chemist, invariant to tion of industrial plant & process, cloud classification from
irrelevant transformations, insensitive to noise, and efficient satellite images, micro array, analysis of electrical signals
to discriminate patterns in different categories (specificity). from the brain, organization of & retrieval from large docu-
After comparing performance and predictability in high ment collections, analysis & visualization of large collec-
throughput data mining, researchers from multiple groups tions of statistical data. SOMs have also been applied to
have consistently. studies in the fields of QSAR. The fundamental promise of
QASR studies that structurally related (similar) compounds
Computational Methods for the QSAR Modeling will have similar properties determining similarity is a com-
Genetic Algorithm (GA) plex tasks, and many method exits such as principal com-
Genetic algorithm is optimizing algorithm used in find true pounds analysis & hierarchical cluster analysis in QSAR
or approximate solutions to optimization & search problems. study (Guha, 2004) the use of a SOMs chose the subset of
GA is categorized as global search heuristic. General molecular descriptors to dimensionality of a dataset by vi-
application of GA: Topology optimizations, Genetic training sualizing as a graphical lower dimensional display, & re-
algorithm, Control parameters optimizations. Genetic method duces the amount of data by representing them with a smaller
represents a powerful class of computational methodologies no of models ordered on a discrete map lattice.
as with GA represents infinite no. of possible algorithm that
can be used to examine combinatorial problem. This means Support vector machine (SVM)
that the problem should dictate the extract from the algorithm SVM are a set of related supervised learning methods that
such as the coding scheme of putative solutions (Niculescu, can used for classification and regression. SVM are a set of
2003). This is not probably best suited to examine particular related supervised learning method that that can perform
problems. binary classifications (pattern recognitions) & real valued
functions approximations (regression estimations) tasks
Artificial neural network (ANN) SVM non-linearly map their n- dimensional input space into
An ANN, often just called a “NN,” is an interconnection a high dimensional feature space a linear classifier is
group of artificial neuron that uses a mathematical model or constructed. A special property of SVM is that they
computational model. Applications of NN can be applied to simultaneously minimize the empirical classification error &
business using several different approaches & QSAR minimize geometric margin. SVM was created to address
modeling. Turnkey application, NN developed tolls challenging problems in QSAR analysis. The goal of QSAR
(commercial packages are Neuro shell brain maker), Used analysis is to predict the bioactivity of molecules. Each
extensively in the PMH (position specific iterative molecule has many potential descriptors that may be highly
predictions), visualizing protein structure & computing correlated with each other or irrelevant to the target
structure properties: Grail gene finder, sequence analysis, bioactivity (Burbidge, 2001). The bioactivity is known for
database-searching, pair-wise alignment. The variable only a few molecules. These issues make model validation
selection in particular important and challenging problem in challenging and over fitting easy. The results of the SVMs
the developed of ANN models. Why ANN is useful in that are somewhat unstable small changes in the training and
way? If there deterministic relation between some feature of validation data or on model parameter may produce rather
the molecules & the property that must be predicted, then different sets of nonzero weight attributes.
QSAR is amenable to a regression problem i.e. to the
determination of that unknown relation. From a statistical Decision Forest
point of view, NN represents a class of non-parametric Decision Forest is a decision support tool that uses a graph
adaptive models (Kustrin, 2001). In this framework, an or model of decision & their possible consequences. Deci-
important issue is to evaluate the performance of the models. sion Forest models often have a degree of accuracy that
This is done by separating the data into two sets: The training cannot be obtained using a large, singletree model. Decision
set & the testing set. The parameters (i.e. the value of the Forest models are as easy to create as single tree-models, it
synaptic weights) of the network are computed using the can be applied to regression & classification models, the
training set. stochastic (randomization) element in the decision tree for-
est algorithm makes it highly resistant to over, Decision For-
Self organizing map (SOM) est can handle hundreds or thousands of predictor variables.
The SOM is a subtype of ANN. It is trained using unsuper- It is a novel pattern recognition method, which combines
vised learning to produce low dimensional representations the results of multiple distinct but comparable decision tree
of the training sample while preserving the topological prop models to reach a consensus prediction. A decision forest
39 GERF Bulletin of Biosciences 2010, 1(1): 37-40

model was developed using a structurally diverse training clusters. The expression vectors for each cluster are
data set. A decision forest model was developed using a recalculated. K-means clustering algorithm use an
structurally diverse training data set compounds activity interchange (or switching) method to divide n data points
was tested. The model was subsequently validated using a into k group (clusters) is known before clustering. The k-
test data set of compounds selected and then applied to a means clustering results depend on the order of the rows in
large data set with compounds as a screening. the input data, the options k-means initialization, and number
of iteration for minimizing distance. The k means approach
Partial least square involves ND problems (combinatorial explosion).
Partial least squares projection to latent structures (PLS) is a
robust using projection to summarize multitudes of poten- Principal component analysis
tially collinear variables. Partial least squares projection to PCA (also called SVD or singular value decomposition) is
latent structures (PLS) is a robust multivariate generalized an exploratory technique and it is used to visually estimate
regression method using projections to summarize multitudes the number of clusters represented n the data. PCA is a
of potentially collinear variables (Waterbeemd, 2008). Multi- powerful technique for the analysis of QSAR modeling data
variate statistics is a set of statistical tools to analysis data when used with other classification technique such as k-
(e.g., chemical and biological) matrices using regression and/ means or SOM [11].
or pattern recognition technique. PLS regression technique
is especially useful in quite in common case where the num- TSAR
ber of descriptors (independent variables) is compare to or Tsar is a fully integrated quantitative structure-activity rela-
greater then the no of compounds (data points) and/or there tionship (QSAR) package for library design and lead optimi-
exist other factors leading to correlations between variables zation. Tsar can be used throughout drug discovery, from
(Khlebnikov, 2007). Many methodologies have been used in initial compound selection for primary screening to reagent
QSAR modeling such as the PLS here methodologies the selection and creation of focused libraries for lead optimiza-
partial least collinear input data to make no restriction on the tion. Tsar’s easy-to-use chemical spreadsheet interface is
number of variables used. PLS leads to stable, correct and equally accessible to medicinal chemists, computational
highly predictive models even for correlated descriptors chemists and project team leaders (Ivanenkov, 2009).
(Gieleciak 2007).
Advantages of Tsar
Multiple linear Regressions Accelerates design and selection of single compounds and
This is a mathematical technique used in both fundamental libraries for screening Lets you improve activity and
& technical analysis. This technique can be used a no. of eliminate undesirable properties in lead optimization.
variables to predict some unknown variables. In statistics,
regression analysis examines the relation of dependent vari- Typical applications of Tsar:
ables (response variables) to specified independent vari- • Exploring physicochemical properties, 2D or
ables (predictors) (Papa, 2007). This assumes that the un- 3D, to understand which promote activity
derlying relationship is linear & that any deviation from lin- • Reagent selection by sampling substitute or
earity will be distributed normally (a parameter assumption).It reagent properties
also assume that the drug properties are real no & they are • Designing combinatorial libraries by focusing
independent each other, so that the affect of one variable is on desired product properties Similar or diverse
the other variables. In many QSAR problems it is desirable subset selection.
to learn relationships that are non-linear. • Developing predictive models of activity.

K-mean clustering Conclusion

K-means clustering is on an alternative method to the QSAR model prediction depends on good descriptor selec-
hierarchical method. It is top-down approach & is useful if tion because similar molecules with a slight variation in their
there is prior knowledge about the no. of cluster that should structure can have quite different biological activity. This
be represented in the data. In k-means clustering objects are kind of relationship between molecular structure & change
partitioned into a fixed no (k) of cluster, such that the clusters in biological activity is center of focus for QSAR Modeling.
are internally similar but externally dissimilar (Mutihac 2008). Correlation analysis and relevant analysis approaches effi-
The process involve in k-means clustering is as follows. All ciently deals this task. The criteria used for QSAR modeling
initial objects are randomly assigned to one of k clusters was biological-activity relationship within selected descrip-
(where k is pre- specified). By using k-means clustering on tors that is efficiently achieved by the descriptor of a set of
experiments with k=2, the data will be partitioned in to two 64 drugs and their experimentally-derived intestinal absorp-
groups. An average expression vector is calculated for each tion (%) values as descriptor showing 95% correlation. The
cluster & this is used to compute the distances between application of the QSAR modeling is at developmental phase
GERF Bulletin of Biosciences 2010, 1(1): 37-40 40

currently. After comparing performance and predictability 9. Papa E et al., (2007). Linear QSAR regression models
in high throughput data mining, researchers from multiple for the prediction of bioconcentration factors by
groups have consistently. Improper descriptors selection physicochemical properties and structural
will not lead to reasonable predictions. Correct descriptor theoretical molecular descriptors. Chemosphere.
selections rely on understanding the computational problem 67(2):351-358.
that one is trying to solve. Regression analysis and clustering
in the field of QSAR modeling show crucial impact on modern 10. Mutihac L and Mutihac R (2008). Mining in
drug discovery processes. The applications of QSAR chemometrics. Analytica Chimica Acta. 612 (1):1-
modeling in drug discovery, such as compound selection, 18.
virtual library generation, virtual high throughput screening,
HTS data mining, and in-silico ADMET made it center of 11. Ivanenkov YA (2009). Computational mapping tools
study. for drug discovery. Drug Discovery Today.14:767-
1. Jun Xu (2002). Chemoinformatics and Drug
Discovery. Molecules. 7(8):566-600.

2. Niculescu SP, (2003). Artificial neural networks and

genetic algorithms in QSAR. J. Molecular
Structure: Theochem. 622(1-2):71-83.

3. Kustrin SA et al., (2001). ANN modeling of the

penetration across a polydimethylsiloxane
membrane from theoretically derived molecular
descriptors. J. Pharmaceutical and Biomedical
Analysis. 26(2):241-254.

4. Guha R et al., (2004). Generation of QSAR sets

with a self-organizing map. J. Molecular
Graphics and Modelling. 23 (1):1-14.

5. Burbidge R et al., (2001). Drug Design by Machine

Learning: Support Vector Machine for
Pharmaceutical Data Analysis. Computers and
Chemistry. 26 (1):5-14.

6. Waterbeemd HVD et al., (2008). Glossary of Terms

Used in Computational Drug Design (IUPAC)
Recommendations 1997. Annual Reports in
Medicinal Chemistry. 33:397-409.

7. Khlebnikov AI et al., (2007). Improved Quantitative

Structure-Activity Relationship Models to Predict
Antioxidant Activity of Flavonoids in Chemical,
Enzymatic, and Cellular Systems. Bioorg Med.
Chem.15 (4): 1749–1770.

8. Gieleciak R and Polanski J (2007). Modeling Robust

QSAR. 2. Iterative Variable Elimination Schemes for
CoMSA: Application for Modeling Benzoic Acid
pKa Values. J. Chem. Inf. Model. 47:547–556.