Review Article

Pooja Mishra1, Vijay Tripathi1, Brijesh Singh Yadav2*

1

Center of Bioinformatics, University of Allahabad, Allahabad, India

2

*Indian Veterinary Research Institute, Izatnagar, Bareilly, U.P., India

Abstract

In this paper we focused on in silico QSAR (Quantitative structure-activity relationship) modeling is one of the well-

developed areas in drug development through computational chemistry. Similar molecules with just a slight variation in

their structure can have quit different biological activity. This kind of relationship between molecular structure &

change in biological activity is center of focus for QSAR Modeling. QSAR are based on a comparison between some type

of activity & the chemical structure or physicochemical properties of a series of chemical.Quantitative structure-

activity relationship (QSAR) is the process by which chemical structure is quantitatively correlated with a well defined

process, such as biological activity or chemical reactivity. Activity = f (physiochemical properties and/or structural

properties). There is a wide variety of descriptors for use in QSAR studies. Actually descriptors are biological property

that represents in mathematical form. The group or subset of these descriptors is potentially useful for predicting

ADME/Tox properties, & describe the arrangement of a pharmaceutical compound without organisms. Here ADME is an

acronyms in pharmacokinetics & pharmacology for absorption, distribution, metabolism & excretion, this is the very

important step in drug designing is prediction of ADME property of any compound. It is useful to predict toxicity of

particular compounds. This article also reviews the current achievements in the field of QSAR modeling and their

impact on modern drug discovery processes. The applications of QSAR modeling in drug discovery, such as compound

selection, virtual library generation, virtual high throughput screening, HTS data mining, and in-silico ADMET are

discussed. We have also presented a quantitative structure–human intestinal absorption relationship using Regression

analysis through Tsar.

Keywords: Quantitative structure-activity relationship (QSAR), ADME, Descriptors, Clustering, Regression, TSAR

Drug is a complex set of molecules. It is not distributed in all synthesized for high-throughput biological activity screen-

over body by complex form, so after being absorbed (orally ing. As the predictive ability of ADME/Tox software improves,

or by other route). It should be broken up to its in simplest and as pharmaceutical companies incorporate computational

form & after that it can be distributed to all over body by prediction methods into their R&D programs, the drug dis-

blood. After absorption & distribution the remaining waste covery process will move from a screening-based to a knowl-

part of drug will excreted in body. Historically, drug absorp- edge-based paradigm. In silico QSAR modeling, feature se-

tion, distribution, metabolism, excretion, and toxicity lection are used, this feature selections is used to reduce the

(ADMET) studies in animal models were performed after a number of descriptors per compounds. Successful data min-

lead compound was identified. Now, pharmaceutical compa- ing depends on good descriptor selection. If molecules are

nies are employing higher-throughput, in vitro assays to represented by improper descriptors, they will not lead to

evaluate the ADMET characteristics of potential leads at reasonable predictions. Correct descriptor selections rely

earlier stages of development (Jun Xu, 2002). This is done in on understanding the computational problem that one is

order to eliminate candidates as early as possible, thus avoid- trying to solve.

ing costs, which would have been expended on chemical Correlation analysis and relevant analysis

synthesis and biological testing. Scientists are developing approaches can help with this understanding. The criteria

computational methods to select only compounds with rea- used for selecting descriptors should be: the selected

sonable ADMET properties for screening. Molecules from descriptors should be bio-activity related (requiring

correlation analysis), the selected descriptors should be

Corresponding Author: brijeshbioinfo@gmail.com

informative (should have diversified value distributions),

www.gerfbb.com

GERF Bulletin of Biosciences 2010, 1(1): 37-40 38

the selected descriptors should be independent of each other erties of the input space. The SOM have been used at the

(if two descriptors are correlated to each other, related prop- research center in such applications as. Automatic speech

erty will be unfairly biased), the selected descriptors should recognition, clinical voice analysis, monitoring of the condi-

be simple to extract, easy to explain to a chemist, invariant to tion of industrial plant & process, cloud classification from

irrelevant transformations, insensitive to noise, and efficient satellite images, micro array, analysis of electrical signals

to discriminate patterns in different categories (specificity). from the brain, organization of & retrieval from large docu-

After comparing performance and predictability in high ment collections, analysis & visualization of large collec-

throughput data mining, researchers from multiple groups tions of statistical data. SOMs have also been applied to

have consistently. studies in the fields of QSAR. The fundamental promise of

QASR studies that structurally related (similar) compounds

Computational Methods for the QSAR Modeling will have similar properties determining similarity is a com-

Genetic Algorithm (GA) plex tasks, and many method exits such as principal com-

Genetic algorithm is optimizing algorithm used in find true pounds analysis & hierarchical cluster analysis in QSAR

or approximate solutions to optimization & search problems. study (Guha, 2004) the use of a SOMs chose the subset of

GA is categorized as global search heuristic. General molecular descriptors to dimensionality of a dataset by vi-

application of GA: Topology optimizations, Genetic training sualizing as a graphical lower dimensional display, & re-

algorithm, Control parameters optimizations. Genetic method duces the amount of data by representing them with a smaller

represents a powerful class of computational methodologies no of models ordered on a discrete map lattice.

as with GA represents infinite no. of possible algorithm that

can be used to examine combinatorial problem. This means Support vector machine (SVM)

that the problem should dictate the extract from the algorithm SVM are a set of related supervised learning methods that

such as the coding scheme of putative solutions (Niculescu, can used for classification and regression. SVM are a set of

2003). This is not probably best suited to examine particular related supervised learning method that that can perform

problems. binary classifications (pattern recognitions) & real valued

functions approximations (regression estimations) tasks

Artificial neural network (ANN) SVM non-linearly map their n- dimensional input space into

An ANN, often just called a “NN,” is an interconnection a high dimensional feature space a linear classifier is

group of artificial neuron that uses a mathematical model or constructed. A special property of SVM is that they

computational model. Applications of NN can be applied to simultaneously minimize the empirical classification error &

business using several different approaches & QSAR minimize geometric margin. SVM was created to address

modeling. Turnkey application, NN developed tolls challenging problems in QSAR analysis. The goal of QSAR

(commercial packages are Neuro shell brain maker), Used analysis is to predict the bioactivity of molecules. Each

extensively in the PMH (position specific iterative molecule has many potential descriptors that may be highly

predictions), visualizing protein structure & computing correlated with each other or irrelevant to the target

structure properties: Grail gene finder, sequence analysis, bioactivity (Burbidge, 2001). The bioactivity is known for

database-searching, pair-wise alignment. The variable only a few molecules. These issues make model validation

selection in particular important and challenging problem in challenging and over fitting easy. The results of the SVMs

the developed of ANN models. Why ANN is useful in that are somewhat unstable small changes in the training and

way? If there deterministic relation between some feature of validation data or on model parameter may produce rather

the molecules & the property that must be predicted, then different sets of nonzero weight attributes.

QSAR is amenable to a regression problem i.e. to the

determination of that unknown relation. From a statistical Decision Forest

point of view, NN represents a class of non-parametric Decision Forest is a decision support tool that uses a graph

adaptive models (Kustrin, 2001). In this framework, an or model of decision & their possible consequences. Deci-

important issue is to evaluate the performance of the models. sion Forest models often have a degree of accuracy that

This is done by separating the data into two sets: The training cannot be obtained using a large, singletree model. Decision

set & the testing set. The parameters (i.e. the value of the Forest models are as easy to create as single tree-models, it

synaptic weights) of the network are computed using the can be applied to regression & classification models, the

training set. stochastic (randomization) element in the decision tree for-

est algorithm makes it highly resistant to over, Decision For-

Self organizing map (SOM) est can handle hundreds or thousands of predictor variables.

The SOM is a subtype of ANN. It is trained using unsuper- It is a novel pattern recognition method, which combines

vised learning to produce low dimensional representations the results of multiple distinct but comparable decision tree

of the training sample while preserving the topological prop models to reach a consensus prediction. A decision forest

www.gerfbb.com

39 GERF Bulletin of Biosciences 2010, 1(1): 37-40

model was developed using a structurally diverse training clusters. The expression vectors for each cluster are

data set. A decision forest model was developed using a recalculated. K-means clustering algorithm use an

structurally diverse training data set compounds activity interchange (or switching) method to divide n data points

was tested. The model was subsequently validated using a into k group (clusters) is known before clustering. The k-

test data set of compounds selected and then applied to a means clustering results depend on the order of the rows in

large data set with compounds as a screening. the input data, the options k-means initialization, and number

of iteration for minimizing distance. The k means approach

Partial least square involves ND problems (combinatorial explosion).

Partial least squares projection to latent structures (PLS) is a

robust using projection to summarize multitudes of poten- Principal component analysis

tially collinear variables. Partial least squares projection to PCA (also called SVD or singular value decomposition) is

latent structures (PLS) is a robust multivariate generalized an exploratory technique and it is used to visually estimate

regression method using projections to summarize multitudes the number of clusters represented n the data. PCA is a

of potentially collinear variables (Waterbeemd, 2008). Multi- powerful technique for the analysis of QSAR modeling data

variate statistics is a set of statistical tools to analysis data when used with other classification technique such as k-

(e.g., chemical and biological) matrices using regression and/ means or SOM [11].

or pattern recognition technique. PLS regression technique

is especially useful in quite in common case where the num- TSAR

ber of descriptors (independent variables) is compare to or Tsar is a fully integrated quantitative structure-activity rela-

greater then the no of compounds (data points) and/or there tionship (QSAR) package for library design and lead optimi-

exist other factors leading to correlations between variables zation. Tsar can be used throughout drug discovery, from

(Khlebnikov, 2007). Many methodologies have been used in initial compound selection for primary screening to reagent

QSAR modeling such as the PLS here methodologies the selection and creation of focused libraries for lead optimiza-

partial least collinear input data to make no restriction on the tion. Tsar’s easy-to-use chemical spreadsheet interface is

number of variables used. PLS leads to stable, correct and equally accessible to medicinal chemists, computational

highly predictive models even for correlated descriptors chemists and project team leaders (Ivanenkov, 2009).

(Gieleciak 2007).

Advantages of Tsar

Multiple linear Regressions Accelerates design and selection of single compounds and

This is a mathematical technique used in both fundamental libraries for screening Lets you improve activity and

& technical analysis. This technique can be used a no. of eliminate undesirable properties in lead optimization.

variables to predict some unknown variables. In statistics,

regression analysis examines the relation of dependent vari- Typical applications of Tsar:

ables (response variables) to specified independent vari- • Exploring physicochemical properties, 2D or

ables (predictors) (Papa, 2007). This assumes that the un- 3D, to understand which promote activity

derlying relationship is linear & that any deviation from lin- • Reagent selection by sampling substitute or

earity will be distributed normally (a parameter assumption).It reagent properties

also assume that the drug properties are real no & they are • Designing combinatorial libraries by focusing

independent each other, so that the affect of one variable is on desired product properties Similar or diverse

the other variables. In many QSAR problems it is desirable subset selection.

to learn relationships that are non-linear. • Developing predictive models of activity.

K-means clustering is on an alternative method to the QSAR model prediction depends on good descriptor selec-

hierarchical method. It is top-down approach & is useful if tion because similar molecules with a slight variation in their

there is prior knowledge about the no. of cluster that should structure can have quite different biological activity. This

be represented in the data. In k-means clustering objects are kind of relationship between molecular structure & change

partitioned into a fixed no (k) of cluster, such that the clusters in biological activity is center of focus for QSAR Modeling.

are internally similar but externally dissimilar (Mutihac 2008). Correlation analysis and relevant analysis approaches effi-

The process involve in k-means clustering is as follows. All ciently deals this task. The criteria used for QSAR modeling

initial objects are randomly assigned to one of k clusters was biological-activity relationship within selected descrip-

(where k is pre- specified). By using k-means clustering on tors that is efficiently achieved by the descriptor of a set of

experiments with k=2, the data will be partitioned in to two 64 drugs and their experimentally-derived intestinal absorp-

groups. An average expression vector is calculated for each tion (%) values as descriptor showing 95% correlation. The

cluster & this is used to compute the distances between application of the QSAR modeling is at developmental phase

www.gerfbb.com

GERF Bulletin of Biosciences 2010, 1(1): 37-40 40

currently. After comparing performance and predictability 9. Papa E et al., (2007). Linear QSAR regression models

in high throughput data mining, researchers from multiple for the prediction of bioconcentration factors by

groups have consistently. Improper descriptors selection physicochemical properties and structural

will not lead to reasonable predictions. Correct descriptor theoretical molecular descriptors. Chemosphere.

selections rely on understanding the computational problem 67(2):351-358.

that one is trying to solve. Regression analysis and clustering

in the field of QSAR modeling show crucial impact on modern 10. Mutihac L and Mutihac R (2008). Mining in

drug discovery processes. The applications of QSAR chemometrics. Analytica Chimica Acta. 612 (1):1-

modeling in drug discovery, such as compound selection, 18.

virtual library generation, virtual high throughput screening,

HTS data mining, and in-silico ADMET made it center of 11. Ivanenkov YA (2009). Computational mapping tools

study. for drug discovery. Drug Discovery Today.14:767-

775.

References

1. Jun Xu (2002). Chemoinformatics and Drug

Discovery. Molecules. 7(8):566-600.

genetic algorithms in QSAR. J. Molecular

Structure: Theochem. 622(1-2):71-83.

penetration across a polydimethylsiloxane

membrane from theoretically derived molecular

descriptors. J. Pharmaceutical and Biomedical

Analysis. 26(2):241-254.

with a self-organizing map. J. Molecular

Graphics and Modelling. 23 (1):1-14.

Learning: Support Vector Machine for

Pharmaceutical Data Analysis. Computers and

Chemistry. 26 (1):5-14.

Used in Computational Drug Design (IUPAC)

Recommendations 1997. Annual Reports in

Medicinal Chemistry. 33:397-409.

Structure-Activity Relationship Models to Predict

Antioxidant Activity of Flavonoids in Chemical,

Enzymatic, and Cellular Systems. Bioorg Med.

Chem.15 (4): 1749–1770.

QSAR. 2. Iterative Variable Elimination Schemes for

CoMSA: Application for Modeling Benzoic Acid

pKa Values. J. Chem. Inf. Model. 47:547–556.

www.gerfbb.com

