Vous êtes sur la page 1sur 4


out drugs

Quantitative structure–activity
analysis works to isolate drug
candidates from the vast well of data.


W ith the extraordinary cost and effort it takes to bring

a single drug to market, the pharmaceutical industry is striving
for ways to improve the efficiency of the discovery process.
Technology is making drug discovery and development more effi-
cient through the use of computational methods, particularly in
ligand-based design. With today’s high-throughput methods and
increasingly robust algorithms, hundreds of thousands of com-
pounds can be rapidly screened for binding to protein targets faster
and more accurately than ever.
The quantitative structure–activity relationship (QSAR) is a
routine tool in drug discovery that computational scientists use
to analyze large sets of candidate drug molecules. Sophisticated
descriptors have been developed to characterize the three-dimen-
sional geometry and chemistry of small molecules. In rational
drug design, the QSAR can help identify the features of a mole-
cule that control activity, which is critical information for the medic-
inal chemist. The QSAR can also be used to select the best
candidate molecules from large compound libraries, reducing test-
ing time and costs.

Drug discovery
QSAR methods date back to the 1800s, when scientists first cor-
related alcohol toxicity with hydrophobicity (1). Today’s drug
design efforts, however, are laboriously quantitative, incorporating
molecular structure description, combinatorial mathematics,
statistics, computer simulations, and database analysis. In today’s
data-rich environment, QSAR methods enable users to maximize
their use of available data. Because they can be applied quickly


and easily, the methods are useful as a screening tool, identify-
ing drug candidates that are likely to be most effective so that
more costly experimental or computational work can be focused.
They also help scientists understand complex, multicomponent
problems that often defy study by experiment or simulation.
QSARs identify a mathematical relationship between some
property of a molecular system, such as its ability to inhibit a fam-
ily of enzymes, and a series of “descriptors” representing chem-
ical or geometric characteristics. Typical descriptors include
thermodynamic properties (such as enthalpies and entropies),
electronic properties, or functions related to molecular shape (such
as molecular weight, volume, polar surface area, dipole moment,
and number of rotatable bonds). A typical QSAR spreadsheet is
composed of rows representing the molecules or compounds in
the data set and columns representing descriptor values. The prop-
erty in which you are interested is also in a column of the table.
Figure 1 shows an example of a QSAR study table.
The relationship between structure and activity is derived
empirically by analyzing a set of molecules for which values of
the property and descriptors are known. A study may involve many
descriptors, making derivation of QSARs a complex statistical exer-
cise. But this exercise is easily automated, and its benefits are
significant. QSARs identify the key structural and chemical fac-
tors that determine the property of interest. They can then be
applied to predict the factors that are critical to the property that
interests you—assisting in the optimized design of materials,
drugs, and chemicals.
QSAR tools help explain and predict properties based on sta-
tistical correlations. Using these tools, researchers may develop
predictive models based on analysis that identifies correlations
in your data, or they may apply established models to predict prop-
erties. In the latter case, the property of interest is the activity
of a set of drug molecules, shown in Figure 1 in the column labeled
“Activity”. For each molecule, the activity is compiled from
experimental data entered by the user or computed using a sim-
ulation method. Similarly, each cell in the table is filled using known
data from experiments or databases or by computing the value
of the descriptor.
Researchers generate a QSAR by analyzing all of this data
to establish an equation that best describes the relationship
between the property (activity) and the descriptors. Methods
that can be used to establish this correlation include regression
techniques, principal-component analyses (PCAs), and genetic
algorithms. The QSAR tells you which descriptors are most sta-
tistically significant in determining the property, allowing you
to focus your studies on the molecular characteristics that

those descriptors represent.

Statistical methods
QSARs were pioneered in the 1960s by Corwin Hansch and col-
leagues at Pomona College (www.pomona.edu) and the University
of Iowa (www.uiowa.edu), who used multiple linear regression to
describe activity as a function of chemical structure (2). However,
limitations with this method included the requirement of large num-
bers of compounds to explore structural combinations.


Data reduction techniques Y1 X1 X2 X3 X4 automation of the QSAR model
Structure Activity Apol Area Dipole Energy
such as PCA helped overcome search by combining a genetic
this requirement of high obser- 1. 3.150 1.06E+04 270.566 7.139 133.003 algorithm with statistical mod-
vation-to-parameter ratio (3). eling tools, rapidly generating a
2. 3.450 9.55E+03 242.417 2.056 100.681
By reducing the number of vari- population of statistically valid
ables that describe biological 3. 4.130 1.17E+04 252.990 1.037 103.760 structure–activity models rather
activity or chemical properties than a single model.
to a fewer number of inde- 4. 3.450 1.17E+04 257.214 2.313 109.687 These algorithms use a “sur-
pendent or thogonal compo- 5. 3.690 8.65E+03 215.372 1.028 90.970
vival of the fittest” strategy to
nents, regression can be determine if a solution makes it
performed on these principal 6. 4.010 1.17E+04 242.563 2.286 93.813 to the next stage. Beginning with
components. The result is that a population of randomly con-
7. 4.280 1.17E+04 251.587 1.558 100.894
redundancies are removed and structed QSAR models, GFA
intercorrelated data is mini- rates them by using an error
mized. Figure 1. Tabling descriptors. An example of a QSAR study table. measure that estimates each
Partial least squares (PLS) model’s relative predictiveness.
goes one step further than PCA by including cross-validation, a Researchers then “evolve” the population by repeatedly selecting
technique of leaving out components to be predicted by the rela- two better-rated models to serve as “parents” and then creating
tionship established by the other compounds. The actual predictive a next-generation or child model by using terms from each of the
ability of the final model is then evaluated by how well it predicts parent models. This new model replaces the worst-rated model
the unprocessed, unbiased data. Although PCA and PLS can pro- in the population, and as evolution proceeds, the population
duce highly predictive QSAR models, their main drawbacks lie becomes enriched with models of higher and higher quality.
in their limited ability to derive interpretable models. Although one can simply select the best-scoring model from the
Stepwise regression methods, such as forward-stepping lin- population, selecting the best models still relies on scientific
ear regression, have come into popular use because they can pro- knowledge and intuition for appropriateness of the features and
duce models with a reasonable level of interpretability and are combinations.
easily applied to original descriptor sets. These methods, how- Overall, GFA greatly simplifies identification of the significant
ever, rely on obtaining sufficient response levels from individual variables in statistical analyses. GFA is ideal when a data set con-
variables in isolation. With extremely large data sets, the signal- tains many more descriptors than samples, when selecting
to-noise ratio of a single variable is not always apparent. among competing correlated descriptors, or when there may be
nonlinear relationships in the data.
Genetic algorithms In these cases, GFA rapidly points to the most information-
Genetic function approximation (GFA) is part of a powerful class rich combinations of features and exposes patterns in the data
of computational techniques set that may otherwise remain
known as genetic algorithms hidden (see box, “Predicting
(4, 5). Incorporated into QSAR drug toxicity”, p 32).
What’s in store for QSARs?
model development, genetic
Several companies provide computational tools that researchers
algorithms help researchers
can use to perform QSAR analysis for drug discovery. These tools
find optimum solutions for com-
incorporate statistical and graphical models of biological activity
binatorial problems. Genetic Recursive partitioning as imple-
or properties from molecular structures, which in turn are used
algorithms offer a significant mented in developing QSAR
to make activity predictions of untested compounds. These tools
advantage in that, unlike the models for drug discovery pro-
have stemmed from a number of theoretical approaches devel-
methods described in the pre- vides the ability to derive deci-
oped in recent years to better predict the activity of QSAR models.
ceding section, they consider sion-tree-based QSAR models,
Commercial suppliers include
variables in combination with 2 which can be used to qualita-
j Accelrys (www.accelrys.com), whose Cerius environment
one another instead of just in iso- 2 2 2 2 tively predict activities or activ-
includes C .GA, C .QSAR+, C .CSAR, and C .NNet; multi-Y
lation. Genetic algorithms also ity classes in structure–activity
recursive partitioning, genetic algorithms, GFA, nonlinear
maintain the use of original relationship analysis or in
PCA, and PLS.
descriptors without converting focused library design.
j Tripos (www.tripos.com), which markets HQSAR and QSAR
the descriptors into principal Recursive par titioning is
with CoMFA; and molecular field generation, PCA, PLS
components, thereby retaining defined as the division of com-
regression, and hierarchical clustering.
the desired level of interpre- pound sets into groups of higher
j Chemical Computing Group (www.chemcomp.com), with its
tability of the final QSAR model. and lower response as a function
molecular operating environment (MOE), QuaSAR-Binary,
GFA takes genetic algo- of their descriptors. Recursive
and Binary QSAR.
rithms a step further through partitioning has been used for


many years in the credit and arsenal of discover y tools to
insurance world. When some- complement computational
one applies for credit, informa- Predicting drug toxicity methods.
tion “descriptors” about an The initial process of drug development involves screening can- However, with virtual high-
applicant can be immediately didate molecules for optimal therapeutic index. Screening is throughput screening comes
gathered and interpreted, such greatly facilitated by the use of computational models that cir- the challenge of effectively deal-
as gender, income, age, and col- cumvent extensive laboratory studies. The following case study ing with large, noisy, and often
lege education. The decision illustrates the use of genetic function approximation (GFA) to complex data sets. “The indus-
path determined by a person’s predict toxicity. try still struggles to enrich lead
descriptive details and charac- Scientists from a major pharmaceutical company applied por tfolios,” says Omoshile
teristics goes into dictating GFA, with a range of molecular descriptors and nonlinear func- Clement, senior product man-
what his or her percentage tions, to a diverse set of experimental compounds to develop a ager of rational and combina-
rates or premiums will be. This broadly applicable model for predicting cytotoxicity. torial drug design at Accelrys
is also true for recursive parti- The researchers assayed the viability of human dermal (www.accelrys.com). “The chal-
tioning in drug discovery and fibroblasts and determined inhibitory concentration 50% (IC50) lenge remains to improve signal-
QSAR, but in this case, the values (the molar concentration of drug compound required to to-noise ratios while reducing
decision path provides a way of kill 50% of the fibroblast cells). They then used octanol/water the threat of overtraining from
predicting activity for a given partition coefficient (LogP) values from the Pomona College too many variables.” Many
compound. database and calculated molecular hydrophobicity (ClogP). The approaches loom on the hori-
Recursive par titioning is researchers calculated other descriptors using Accelrys’s zon, ready to be adopted, from
especially good for large C2.QSAR+ with energy-minimized structures for each molecule. the incorporation of non-deci-
amounts of data that are difficult They tabulated the IC50 values and descriptors for each mol- sion-tree variables and the use
to sieve into usable divisions ecule and performed linear regression of LogP, a stepwise linear of artificial intelligence in neu-
of classification. The solution is regression of the entire descriptor set, a GFA regression with ral networks (see also Sites and
to partition the data, or divide linear operators, and a GFA regression with nonlinear operators Software, p 23) to the addition
it into bifurcated decision using the genetic algorithms module of the Cerius2 program. of ADME (absorption, dis-
“trees” or categories. In so do- The predictive capability of the nonlinear GFA model was tribution, metabolism, and
ing, you find out what charac- significantly better than that of the linear LogP model. With a excretion) properties as both
teristics are unique about the diverse set of compounds, it is unlikely that a single mechanism variables and descriptors to fil-
compounds and correlate those defines the toxic effect of all compounds, and the LogP model ter and refine QSAR data sets.
qualities with drug activity. was insufficient for highly toxic compounds. The GFA model, Although still in its infancy,
A variation of the recursive however, was capable of fitting these data and can therefore be ADME shows promise of even
partitioning method is multi-Y used as a reasonable predictive model for in vitro cytotoxicity. further improving and enhanc-
recursive partitioning, which 5 ing QSAR methods.
uses neural networks to screen
a library against any number of References
protein targets (multiple Y, 6). 4 (1) Borman, S. Chem. Eng. News
1990, 68, 20–23.
In addition to being able to gen- (2) Hansch, C.; et al. J. Amer. Chem.

erate many different solutions, Soc. 1963, 85, 2817–2824.

this method offers improved 3 (3) Sharaf, M. A.; Illman, D. A.;
Kowalski, B. R. In Chemometrics;
sensitivity when analyzing the Wiley: New York, 1986; p 179.
impor tance of variables and (4) Rogers, D.; Hopfinger, A. J. J.
higher tolerance for noise and Chem. Inf. Comp. Sci. 1994,
34, 854–866.
outliers, particularly when eval- (5) Rogers, D. In Proc. 7th Intern.
uating large complex data sets. 1 Conf. Genetic Algorithms, East
The advantage of this method 1 2 3 4 5 Lansing, MI, 1997.
Experimental (6) Zupan, J., Gasteiger, J., Eds.
is efficiency, allowing for more Neural Networks. In Chemistry
opportunities to use a single Figure 2. Nonlinear GFA model (green) versus linear LogP model (red). & Drug Design, 2nd ed.; Wiley-
screened data set against, for VCH: Weinheim, 1999.
example, multiple diseases.
Nancy Ogihara is a marketing communications specialist for Accelrys
(www.accelrys.com). Send your comments or questions about this arti-
The bottom line cle to mdd@acs.org or to the Editorial Office address on page 3. o
Successful applications of QSAR technology to drug discovery
research are becoming increasingly commonplace. Computational KEY TERMS: automation, high throughput, informatics,
medicinal chemistry, modeling, screening
scientists and experimentalists are adding QSAR methods to their