Vous êtes sur la page 1sur 12

This article appeared in a journal published by Elsevier.

The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elseviers archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/copyright

Author's personal copy


Chemometrics and Intelligent Laboratory Systems 104 (2010) 249259

Contents lists available at ScienceDirect

Chemometrics and Intelligent Laboratory Systems


j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / c h e m o l a b

QSAR models for tyrosinase inhibitory activity description applying modern statistical classication techniques: A comparative study
Huong Le-Thi-Thu a, Gladys Casas Cardoso b, Gerardo M. Casaola-Martin a,c,d,, Yovani Marrero-Ponce a,d, Amilkar Puris e, Francisco Torrens d, Antonio Rescigno f, Concepcin Abad c
a Unit of Computer-Aided Molecular Biosilico Discovery and Bioinformatic Research (CAMD-BIR Unit), Faculty of Chemistry-Pharmacy, Central University of Las Villas, Santa Clara, 54830, Villa Clara, Cuba b Bioinformatic Laboratory, Informatic Study Center, Faculty of Math, Physical and Computation, Central University of Las Villas, Santa Clara, 54830, Villa Clara, Cuba c Departament de Bioqumica i Biologia Molecular, Universitat de Valncia, E-46100 Burjassot, Spain d Institut Universitari de Cincia Molecular, Universitat de Valncia, Edici d'Instituts de Paterna, Poligon la Coma s/n (detras de Canal Nou) P. O. Box 22085, E-46071 (Valncia), Spain e Articial Intelligence Laboratory, Informatic Study Center, Faculty of Math, Physical and Computation, Central University of Las Villas, Santa Clara, 54830, Villa Clara, Cuba f Sezione di Chimica Biologica, Dip. Scienze e Tecnologie Biomediche, Universit di Cagliari, Cittadella Universitaria, 09042 Monserrato (CA), Italy

a r t i c l e

i n f o

a b s t r a c t
Cluster analysis (CA), Linear and Quadratic Discriminant Analysis (L(Q)DA), Binary Logistic Regression (BLR) and Classication Tree (CT) are applied on two datasets for description of tyrosinase inhibitory activity from molecular structures. The rst set included 701 tyrosinase inhibitors (TI) that are used for performance of inhibitory and non-inhibitory activity and the second one is for potency estimation of active compounds. 2D TOMOCOMD-CARDD atom-based quadratic indices are computed as molecular descriptors. CA is used to rational design of training (TS) and prediction set (PS) but it shows of not being adequate as classication technique. On the rst data, the overall accuracies (Q) are 91.42%, 92.35% 91.88%, 91.79% for TS, and 91.04%, 92.43%, 88.24%, 89.36% for PS in LDA, QDA BLR and CT-based model, respectively, while the corresponding values obtained on the second one are 89.95%, 90.70%, 90.20%, 89.20% for TS and 83.71%, 84.44%, 82.96%, 82.22% for PS. A comparative analysis of used statistical techniques is held out taking into consideration generated posterior probability, accuracy, required assumptions and the form of predictor variables used. On the two datasets, results depicted by Receiver Operating Characteristic (ROC) curves together with Multiple Comparison Procedures (MCP) show that QDA has in general the best behavior as classication algorithm. The results suggest that it will be possible to produce a better description of tyrosinase activity applying the statistical techniques presented in this report, which could increase the practicality of the in silico data mining for the discovery of novel TIs. 2010 Elsevier B.V. All rights reserved.

Article history: Received 5 April 2010 Received in revised form 28 July 2010 Accepted 31 August 2010 Available online 16 September 2010 Keywords: TOMOCOMD-CARDD Software Atom-based quadratic indices Modern statistical methods Tyrosinase inhibitor ROC curve Multiple Comparison Procedures

1. Introduction Pigmentation is one of the most obvious phenotypical characteristics in the natural world. Of the pigments, melanin is one of the most widely distributed, which is found in bacteria, fungi, plants and animals [1]. An accumulation of an abnormal melanin amount in different specic parts of the skin as more pigmented patches might become a psychosocial and cosmetic problem. Epidermal and dermal hyperpigmentation can be dependent on either increased numbers of melanocytes or activity of melanogenic enzymes [2]. The tyrosinase (EC 1.14.18.1) is determined as the key enzyme in the regulation of melanin biosynthesis. Wherefore TIs have attracted considerable interest in medicinal and cosmetic products, primarily in relation to the treatment of hyperpigmentation [3]. A number of TIs from both
Corresponding author. Tel.: + 53 42 211825 (Cuba), + 34 963543156 (Valncia). E-mail addresses: gmaikelc@yahoo.es, gmaikelc@gmail.com (G.M. Casaola-Martin). 0169-7439/$ see front matter 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2010.08.016

natural and synthetic sources that inhibited monophenolase, diphenolase or both of these activities have been identied to date [4], but bleaching compounds have demonstrated fairly ineffective on dermal accumulation of melanin [5]. So, the discovery of novel TI compounds keeps on being a challenge for international pharmaceutical industry and cosmetology. On the other hand, for improving productivity, knowledge baseguided decisions must be incorporated into the discovery and development process. There is a major advantage in using in silico criteria [6], that is based on the use of a virtual world of hypothesis computer-generated and tested for practicality. This kind of studies avoids the expensive commitment to actual synthesis and bioassay which are made only after exploring the initial concepts with computational QuantitativeStructureActivity-Relationship (QSAR) models [7]. In these in silico studies, the chemical-metrics are the most useful tool. They are solid base not only for data analysis but also modeling in general, which are intercepted by various elds of Math, Statistic and Articial Intelligence. These techniques are used for

Author's personal copy


250 H. Le-Thi-Thu et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 249259

collecting, elaborating, analyzing and characterizing dataset, in the way that we can obtain knowledge from them. Using this approach, our research group has obtained various QSAR models, which were used effectively in Virtual Screening (VS) of database leading to the identication of candidates that inhibit the tyrosinase enzyme [812] by means of different TOMOCOMD-CARDD descriptors as predictive variables and LDA as classication statistical technique. This technique was selected between many statistical methods to get classication functions due to its simplicity being one of the most currently used in drug discovery and design; moreover its use has been widely described [1316]. However, nowadays, there are many techniques which are demonstrated to behave and describe better the biological activity and were not applied up to the present to model the tyrosinase inhibitor activity. Taking all into account, the main goal of this research is to make use of different modeling statistical techniques, in addition to LDA, to perform QSAR models for description of tyrosinase inhibitory activity from molecular structures. These include non-supervised method as CA, supervised classic multivariate techniques as QDA, BLR and nally more modern method, derived by Articial Intelligence as classication or CT. 2D TOMOCOMD-CARDD atom-based quadratic molecular descriptors are used as independent variables. They are numbers containing structural information derived from the topological representation of the molecules. The application of these statistical algorithms is illustrated by comparing the models obtained by applying each method on two databases, one for modeling tyrosinase inhibitory activity included TIs and inactives and another for estimating potency of active ones integrated only by TI compounds. The theory of ROC curves and a set of statistical procedures are used in comparative studies of the classier performances and offer suitable criterion for decision making of the best model. 2. Dataset and computational methods 2.1. Dataset methodology 2.1.1. Classication dataset The rst dataset was conformed by a heterogeneous set of 701 active compounds which their tyrosinase inhibitory activity was reported in 132 different articles of the literature. They were chosen considering not only a representation of most of the different structural patterns but also diverse mechanisms of inhibition; they were isolated and identied from different sources (see Table S1 of Supplementary data). All these aspects increase the possibilities in the process of identication/selection of novel leads. In Table S1 of Supplementary data, the names of these compounds are depicted, together with their experimental data taken from the literature and its references. The chosen active compounds are presented and organized by type of inhibition, source of the enzymes and substrates used to measure the activity. The molecular structures of these 701 TIs are given in Table S2 of Supplementary data. The inactive compounds were 728 drugs with different pharmacological uses but not still reported literally as TIs. These drugs include, for instance, antibiotic, antiviral, sedative/hypnotics, diuretics, anticonvulsants, haemostatic, oral hypoglycemic, antihypertensive, antihelminthes, anticancer, antifungal, etc.; guaranteeing also a great structural variability. All chemicals were taken from the Negwer Handbook [17] where their names, synonyms and structural formulas can be found. 2.1.2. Potency dataset The second database that is a subset of the rst one was created assessing the inhibitory activity for the purpose of obtaining models to estimate the potency of TIs, which would be used hierarchically with the models adjusted on the rst database, for more complete description of tyrosinase inhibitory activity. The idea is estimating

the potency of new TIs identied by VS classifying them in two main classes (strong or moderate-to-weak compounds). The cut-off value of half inhibitory concentrations (IC50) is 30 M but this value constitutes an approximation, not a value that could be extrapolated to a specic assay. This database has a very heterogeneous conformation in all sense because the compounds included in it are structurally diverse and were identied by different experimental condition assays, with dissimilar substrates, different measurement systems and the activity was reported by varied measurements, such as IC50, percentage of enzymatic inhibition (I%) and inhibition constant (Ki) (for more information, see Table S1 of Supplementary data). The cut-off value mentioned above is based on the in vitro experimental system that sometimes corroborates to our in silico computational studies [8,9,11,12,18], where the IC50 values of kojic acid and L-mimosine are 16.67 M and 3.68 M, respectively. It should leave clearly that this value is only validated with the previously mentioned experimental system. And for the other compounds identied by other systems, we used kojic acid or L-mimosine as reference compound to establish the compound potency in each particular case. It is to say, potent compound when the activity is comparable with reference compound activity. In this manner, the database for potency estimation nally is conformed by 533 active compounds among them 343 strong-TIs and 190 moderate-to-weak ones. 2.2. Molecular descriptors calculation Each structure in the dataset is parameterized in molecular descriptors. Molecular descriptors are terms that characterize a specic aspect of a molecule [19]. In the earlier report, we outlined adequate features concerned with the calculation of atom-based nonstochastic and stochastic quadratic indices. This method codies molecular structures by means of mathematical quadratic forms [12,2023]. These molecular descriptors were generated by means of the interactive program for molecular design and bioinformatics research TOMOCOMD-CARDD [24]. The main steps for the generation of atom-based non-stochastic and stochastic quadratic indices by this program can be briey summarized as follows: (1) Draw the molecular pseudographs for each molecule of the dataset using the software drawing mode. This procedure is performed by selection of the active atom symbol belonging to different groups of the periodic table. (2) Use of weights in order to differentiate the molecular atoms. In the present report, we characterized each atomic nucleus with the following atomic propriety (AP): atomic mass (M), the van der Waals atomic volume (V), atomic polarizability (P), atomic Mullinken electronegativity (K) plus Pauling electronegativity (G) [25,26]. (3) Computation of the total and local (atom and atom-type) quadratic indices in the software calculation mode, where one can select the atomic properties and the descriptor family before calculating the molecular indices. This software generates a table in which the rows correspond to the compounds, and the columns correspond to the atom-based (both total and local) quadratic descriptor families. The following descriptors were calculated using different APs as atom-labels in this study:
AP qkx and APqH k x are the kth total non-stochastic quadratic indices not considering and considering H atoms in the molecule, respectively. ii) APqkLxE and APqH kLxE are the kth local (group = heteroatom: S, N, O) non-stochastic quadratic indices not considering and considering H atoms in the molecule, correspondingly. These

i)

Author's personal copy


H. Le-Thi-Thu et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 249259 251

local descriptors are putative molecular charge, dipole moment, and H-bonding acceptors. iii) APqH kLxEH is the kth local (group = H atoms bonding to heteroatom: S, N, O) non-stochastic quadratic indices considering H atoms in the molecule. These local descriptors are putative H-bonding donors (hydrogen bonding capacity), lipophilicity, and so on. iv) APsqk is the kth total [APsqkxand APsqHx] and atom-type APs H [APsqkxE , APsqH qk xE ] stochastic quadratic indices k xE and were also computed. 2.3. Chemometric studies The statistical analysis is carried out using Statistical Software for the Social Sciences (SPSS) 15.0 for Windows [27] and Data Analysis Software System (STATISTICA) 6.0 [28]. 2.3.1. CA for rational design of the training and test sets In order to design the TS and PS to warranty structural and inhibitory variability in both series of the present databases, CA so-called k-mean cluster algorithm (k-MCA) is used. The CA is the name of a group of methods used to recognize similarities among cases (objects) or among variables and single out some categories as a set of similar cases (or variables) [29]. This CA comprehends a number of different classication algorithms and it allows organizing the data into subsystems. kMCA is non-hierarchical clustering that rst determines a cluster center, then group all objects that are within a certain distance. The number of members in each cluster and the standard deviation of the variables in the cluster (kept as low as possible) were taken into account, to have an acceptable statistical quality of data partition into clusters. The values of the standard deviation between and within clusters, those of the respective Fisher ratio and their p-level of signicance were also examined [2932]. Once we have the TS it could be used to t QSAR models that permit the classication of chemicals as TI or non-TI; or estimate potency of active ones. The PS is never used in the development of the model but used for evaluating the predictive power of the models [33] because the statistical parameters in the complete TS provide some assessment of the goodness-of-t of the models, but it is not enough to assure the predictive power of the models. 2.3.2. Classication algorithms Classication is the process of dividing a dataset into mutually exclusive groups (in our special case there are two classes) so that the members of each group are as close as possible to another, and different groups are as far as possible from each other, where distance is measured with respect to specic variable(s) involved in the prediction [34]. In this way, the fundamental remarks of the kMCA, LDA, QDA, BLR and CT classication algorithms used in this work are summarized in Section 1 of Supplementary data. 2.4. ROC theory for studying performances of classication models The ROC curve has been introduced to evaluate machine learning algorithms some years ago [35,36]. ROC curves are so-named because they were rst used for the detection of radio signals in the presence of noise in the 1940s [37]. An ROC graph is a technique for visualizing, organizing and selecting classiers based on their performance. ROC graphs are two-dimensional graphs in which TP (true positive) rate is plotted on the Y-axis and FP rate is plotted on the X-axis. An ROC graph depicts relative trade-off between benets (TP) and costs (FP) [38]. The ROC graph is inside the square (0, 1) (0, 1). Several points in ROC space are important to note. The lower left point (0, 0) represents the strategy of never issuing a positive classication; such as a classier commits no FP error but also gains no TP. The opposite strategy, of unconditionally issuing positive classications, is represented by the upper right point (1, 1). The point (0, 1) represents

perfect classication because the model classies all positive and negative cases correctly then FP = 0 y TP = 1. The point (1, 0) represents worst model because all classication are incorrect. In many cases, a classier has parameter that can be adjusted for incrementing TP and FP instance counts or decreasing FP with the cost of decreasing TP. We increment TP benet and FP cost by the cost (benet) of each negative (positive) instance as it is processed. The ROC points are the fractions of total benets and costs, respectively. It compares the classiers' performance across the entire range of class distributions and error costs. However, often there is no clear dominating relation between two ROC curves in the entire range [39]. In those situations to compare classiers we may want to reduce ROC performance to a single scalar value representing expected performance. A common method is to calculate the area under the ROC curve, abbreviated AUC [38,40,41]. Since the AUC is a portion of the area of the unit square, its value will always be between 0 and 1. The AUC has an important statistical property: the AUC of a classier is equivalent to the probability that the classier will rank a randomly chosen positive instance higher than a randomly chosen negative instance [38,41]. A perfect classication model has an area of 1, while a non-discriminating test (one which falls on the diagonal) has an area of 0.5. If two classiers are compared across an ROC curve pathway, we can decide in general that one of the major AUCs identies as a better classier or more precisely, the classier has a point higher in Y-axis (major ordinate) with a point lower in X-axis (minor abscissa). 2.5. Multiple Comparison Procedures A sole accepted and established test doesn't exist for Multiple Comparison Procedures (MCP) [42,43]. In our case to determine the differences between the classiers presented in this report we perform several non-parametric statistical tests [44]. In the rst place, we applied an ImanDavenport test [45] to check whether all the results obtained by the algorithms present any inequality. In the case of nding it, then we can know, by using a Holm test [46] what algorithms partners' average results are dissimilar. Any interested reader in MCP developed in this report can nd additional information on the Website http://sci2s.ugr.es/sicidm/. 3. Results and discussion 3.1. Classication models for the tyrosinase inhibitory activity 3.1.1. Obtaining of training and test set for the classication dataset The classication of all compounds in the complete TS provides some assessment of the goodness-of-t of the models, but it does not provide a thorough criterion of how the models can predict the biological properties of new compounds. To assess such predictive power, the use of an external test set is essential. Validation external process or most commonly named test set is necessary to ensure the quality and extrapolation power of the QSAR models found in this report [47]. k-MCA was used to split the original dataset (1429 chemicals) into TS and PS [32]. Each set of active and inactive compounds was split into 15 clusters. Then, selection of TS and PS was performed by taking, in a random way compounds belonging to each cluster. The criterion of TS formed by 75% of database was taken into account. The TS was used to develop the discriminant functions and the PS is for the external validation of the models. In Fig. S1 of Supplementary data the described general algorithm to design training and test sets is shown. 3.1.2. QSAR models performed by different statistical techniques 3.1.2.1. CA. Here we pooled active and inactive compounds of the TS and then performed the k-MCA to split the TS (1072 chemicals) into

Author's personal copy


252 H. Le-Thi-Thu et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 249259

two clusters (k = 2). A random classication was obtained for both groups, grouping in the rst cluster 174 active and 185 inactive compounds, and in the second cluster were included 352 and 361, active and inactive chemicals, respectively. As can be seen, an adequate classication model cannot be assembled with simple clustering as classication technique to split the active from inactive compounds. It means CA cannot be used to have good performance in the classication of TIs from other chemicals that do not present this activity. In this sense other discriminant methods with more sophistication are required to model tyrosinase inhibitory activity. 3.1.2.2. LDA. Applying LDA technique, dependent variable was Class(L) and independent variables were the kth (k 15) 2D TOMOCOMDCARD atom-based non-stochastic and stochastic quadratic molecular descriptors. Best subset technique was used. Wilks' was calculated obtaining as signicance (0.000), so the fundamental hypothesis of being equal mean groups was refused concluding that the developed model makes an adequate separation of the groups. The model obtained by LDA with prediction performances, and statistical parameters for TS and PS are shown in the following Eq. (1):
V ClassL = 0:3930:001 q1L xE 0:015 q2L xEH H V H

The classication of all compounds in the complete TS provides some assessment of the goodness-of-t of the models, but it does not provide a thorough criterion of how the models can predict the biological properties of new compounds. It is known that one of the most import characteristics of QSAR models is its predictive power, i.e. the ability of a model to predict accurately the activity of compounds that were not used for model development. So, external set is the only way to determine the true predictive power of a QSAR model. In our case, the statistical parameters for PS can be observed above showing a good behavior of the LDA-based model for PS. 3.1.2.3. QDA. Applying QDA technique, dependent variable was Class(Q) and continuous predictors were atom-based non-stochastic and stochastic quadratic molecular descriptors. Forward stepwise analysis was used. The forward stepwise options at each step will cause the program to consider simultaneously the addition or removal of a variable or effect, based on the current specications of the p or F to enter/remove. In our case it was selected F1, enter and F2, remove values of 1.0 (as default). In step 82, the variables nally included in the model were obtained. The QDA-based model obtained is shown in the following Eq. (2):
ClassQ = 0:2277:717 10 10
8V H 3V

1:6 10
P

8V

q14L xEH + 0:073 q5L xEH


K

q1L xE 0:042 q2L xEH 1:354


P H K

q14L xEH + 0:167 q5L xEH + 0:080 q2L xE


7G

+ 4:160 q15L xEH + 0:040 q2L xE 1:543 10


7G

3:992 10
H
Ks

q11L xE + 0:672 q1L xEH 0:380 q3L xE


Ks

Ms

Ks

q11L xE + 0:268 q1L xEH 0:195 q3L xE


H Ks

Ms

Ks

0:736 q8L xE + 0:492 q0L xE + 5:549 10 10


12V

0:201 q8L xE + 0:130 q0L xE TS: N = 1072 = 0:39 = 1007:93 C = 0:76 Q = 91:42% Specificity = 92:89% F = 152:02 FP Rate = 6:59% PS: N = 357 C = 0:82 Q = 91:04% Specificity = 92:81%
2

Ks

q2L H xEH q14L H xEH 5:998


P 5V

5V

q2L H xEH q5L H xEH 1:074 10


5V H K

q1L H xE q2L xE

+ 6:416 10

q2L xEH q2L xE + 9:886


Ms P Ms

Rcan = 0:78

10

4V

q2L H xEH q1L H xEH 0:002 q5L H xEH q1L H xEH


Ms H 4V

Sensitivity = 89:36%

0:002 q2L xE q1L xEH 7:483 10


P H Ks H

q2L xEH q3L xE q2L xE q3L xE q1L xE q8L xE


H Ks H Ks H Ks H

Ks

+ 0:001 q5L xEH q3L xE + 3:617 10


Ms H Ks H

4K

+ 0:009 q1L xEH q3L xE + 1:122 10 + 0:002 q1L xEH q8L xE 3:626 10
Ms H Ks Ks H Ms H Ks H

4V

5 V

q1L xE q0L xE

Sensitivity = 88:57% FP Rate = 6:59% This model has good overall accuracy of 91.42%, the ability to detect known active compounds (sensibility) of 89.36% and nonactive compounds (specicity) of 92.89% and low FP rate and other adequate parameters in TS. For instance, it has an adequate C of 0.76. The C value quanties the strength of the linear relation between the molecular descriptors and the classications, and it may often provide a much more balanced evaluation of the prediction than, for instance, Q [48]. The Rcan is the correlation between the linear combination of independent variables and a linear combination of indicator variable (one and zero) that collect the pertinence of chemicals to groups. In our case, Rcan of 0.81 indicates that the discriminant variables included in the model allow differentiating between groups. The Wilks' expresses the proportion of total variability not due to the difference between groups; of 0.39 is an adequate value indicating great difference between groups. From the structure matrix, which shows the correlation of each variable with the discriminant function, we detected that MsqH 1LxEH is the most important variable to separate groups of TI and non-TI, because it presented the most correlation with the discriminant function (0.359) indicating that the presence of heteroatom and the atomic mass are important features of chemicals with tyrosinase inhibitory activity.

+ 0:003 q1L xEH q0L xE 0:002 q8L xE q0L xE + 2:307 10


19V 2 H q14L xEH

Ks

+ 5:453 10

19G 2 Ms 2 H q11L xE 0:013 q1L xEH :

2 The prediction performances and statistical parameters for TS and PS are depicted as below. The equations showed to be statistically signicant at p-level (p b 0.0001). TS: N = 1072 = 0:29 = 1324:93 C = 0:91 Rcan = 0:85 Q = 92:35% Specificity = 93:88% FP Rate = 5:68% D = 10:02 PS: N = 357 C = 0:85 Q = 92:43% Specificity = 93:02%
2 2

Sensitivity = 90:30%

Sensitivity = 91:43% FP Rate = 6:59%

It can be seen that the obtained QDA-based model has very good performances and is superior to LDA-based model in both series of TS

Author's personal copy


H. Le-Thi-Thu et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 249259 253

and PS while it was integrated by the same number of variables. But the quadratic function is more complicated because it included not only predictor variables in linear form but also characterized quadratic terms (see Eq. (2)). 3.1.2.4. BLR. The dependent variable is codied in dichotomous categories 1 (non-TI) and 1 (TI). The response of BLR is a mathematic formula that gives probability p of case to belong to class 1 in function of predictive variables. Forward: Conditional analysis was used to estimate model. Forward Selection (Conditional) is a stepwise selection method with entry testing based on the signicance of the score statistic, and removal testing based on the probability of a likelihood-ratio statistic based on conditional parameter estimates. A variable is entered into the model if the probability of its statistical score is less than 0.05 and is removed if the probability is greater than 0.1 (as default values). The model was obtained nally; in all steps Wald statistical was signicant (p b 0.05), that is to say the coefcients bi differentiate signicantly from 0. BLRbased model with performance parameters for TS and PS is shown in the following Eq. (3) while p exceeds 0.50 (default value) chemical is classied as TI and vice versa: p= 1 1 + eLRC

ones. The obtained trees for TS and PS are depicted in Figs. S2 and S3 of Supplementary data. The CT does not give mathematic equation like LDA, QDA or BLR, but from them it is possible to generate rules and with such system a classication algorithm is established. A summary of classication for chemicals in TIs and non-TIs on the TS following the tree is depicted below, where the compound is classied as TI with probability approximately 1 or at least equal 0.5 when:
Vs i) The chemical has MqH q1LxE 1080.67, 1L xEH 31.32, Gs H q0LxEH 22.74 and Vsq1LxE 216.94 (TI 76.5%). Vs ii) The chemical has MqH q1L xE 1080.67, 1L xEH 31.32, Gs H q 0 L xEH N 22.74, M s q H x 5.48 (TI 76.5%) and 2 L EH Ks H q8LxE 31.15% (TI 95.9%). Vs iii) The chemical has MqH q1LxE 1080.67, 1L xEH 31.32, Gs H q 0 L xEH N 22.74, M s q H 2 L xEH 5.48 (TI 76.5%) but Ks H q8LxE N 31.15% and GqH 1LxE N 76.09% (TI 89.7%). Vs iv) The chemical has MqH q1LxE N 1080.67 1LxEH 31.32 but P H and q3LxEH N 14.44 (TI 53.8%). Vs v) The chemical has M q H q1L xE N 1080.67, 1L xEH 31.32, P H q3LxEH 14.44, Vsq1LxE 1548.78, Kq2LxE 399.64 and Ks H q1LxEH N 44.15% (TI 50%). Vs vi) The chemical has M qH q1L xE N 1080.67, 1L xEH 31.32, P H q3LxEH 14.44, Vsq1LxE 1548.78, Kq2LxE N 399.64 and K q1LxE 113.68 (TI 91.7%). Vs vii) The chemical has M q H q1L xE N 1080.67, 1L xEH 31.32, P H G H q3LxEH 14.44, VEq1s N 1548.78, q1LxE 116.23 and G q11LxE N 12674538.06 (TI 57.1%). G viii) The chemical has MqH 1LxEH N 31.32, q11LxE 28283655.24 P H and q2LxEH 6.48 (TI 93.7%). G ix) The chemical has MqH 1LxEH N 31.32, q11LxE 28283655.24, P H Ms H q2LxEH N 6.48, and q1LxEH N 57.26 (TI 92.8%). G x) The chemical has MqH 1LxEH N 31.32, q11LxE 28283655.24, P H P H q2LxEH N 6.48, MsqH x 57.26, q2LxEH 8.18 and 1L EH Ms H q3LxEH N 30.55 (TI 93.5%).

where
V LRC = 0:952 + 0:007 q1L xE 0:003 q3L xE + 0:002 q3L xE 3 H V H V

+ 1:441 10
Ms H

6G

q11L xE 9:291 10
Ms H

9G Ms

q15L xE
H

+ 0:005 q5 x0:004 q14 x + 0:862 q3L xEH 0:717 q7L xEH 0:430 q3L xE + 0:345 q7L xE TS: N = 1072 C = 0:84 Q = 91:88% Sensitivity = 90:87% PS: Specificity = 92:46%
Ms H Ks H Ks H

The parameters to assess model's performance are given following for TS and PS: TS: N = 1072 Q = 91:79% Specificity = 94:88%

FP Rate = 7:14%

N = 357 C = 0:76 Q = 88:24% Sensitivity = 88%

Specificity = 88%

Sensitivity = 90:86% FP Rate = 8:97% PS:

FP Rate = 11:54%

Effectively, this model has a very good overall accuracy of 91.88% and 88.24% in the TS and PS respectively and other well performed parameters in both series. Considering the coefcients bi, we can postulate the variable that most contributes to prediction of tyrosinase inhibitory activity is MsqH 3LxEH . 3.1.2.5. CT. The technique CRT (Classication and Regression Trees) was used with the variable Class(T) as dependent. CRT splits the data into segments that are as homogeneous as possible with respect to the dependent variable. A terminal node in which all cases have the same value for the dependent variable is a homogeneous, pure node. Criteria for growth limits allow limiting the number of levels in the tree and control the minimum number of cases for parent and child nodes. In our study, the maximum tree depth is 6. We specied 15 for parent node and 7 for child node as the minimum numbers of cases for nodes. Nodes that do not satisfy these criteria will not be split. Twoing was selected as the method used to measure impurity and the minimum decrease in impurity required splitting nodes. In the root node we have the whole studied cases: 49.1% of the same represents TI chemicals and 50.9% corresponds to chemicals not representing this activity. The tree contains 39 nodes in total; 20 of them are terminal

N = 357 Q = 89:36% Specificity = 89:14% Sensitivity = 89:14% FP Rate = 10:44%:

From results of chart of model importance by independent predictors, we concluded that the most important variable in describing tyrosinase activity is MqH 1LxEH , indicating the presence of heteroatom and the atomic mass are important to inhibit the activity against the enzyme. 3.1.3. Comparison of QSAR classiers Here in this section, we compare different algorithms used for the QSAR approach. All these classication methods can be used to assign objects to two or more groups and employ one or more predictor variables. As can be shown in previous sections, the tyrosinase inhibitory activity can be successfully modeled by all these techniques with good performance for TS and PS. An important difference between the four statistical methods is that LDA, QDA and BLR can be used to generate a distinct probability of being TI for each chemical, whereas CT analysis only generates two

Author's personal copy


254 H. Le-Thi-Thu et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 249259

probabilities an average probability for all chemicals predicted to be TIs and an average probability for all non-TIs. The availability of such probabilities provides a means of overcoming one of the shortcomings of models that employ cut-off values to classify chemicals, namely, the treatment of borderline predictions [49]. The classications obtained by LDA and QDA were good for both sets, for LDA the values of overall accuracy are 91.42% and 91.04% for TS and PS respectively, for QDA 92.35% and 92.43% correspondingly. It is known that discriminant analysis requires a large number of assumptions that are sensitive to violate in practice. In the case of LDA, normal distribution of independent variables, homogeneity of variance/covariance matrixes and the absence of perfect multi-colinearity between the predictor variables are key assumptions. It can be seen that LDA is a more simple technique and if the assumption of multivariate normality is violated, this does not so much invalidate the model as reduce its predictive power, i.e. departures from normality do not necessarily lead to type I errors (in which the null hypothesis of no discrimination between groups is falsely rejected) but could lead to a reduced ability to classify objects when the model is applied to independent data. In contrast, heterogeneity of the variance/covariance matrices is likely to be more important, since objects are more likely to be classied into the group with greatest dispersion [49]. LDA assumes that the structurally similar chemical class has a common mode of action. One of the main limits of applicability of such QSAR is that relationships can only be established, and subsequently used as prediction tools, for the structurally similar compounds. Moreover, Russom and co-workers [50] have illustrated that even in the same chemical class, the chemicals may act through different modes of action, therefore, the accuracies of the linear QSAR methods are limited [51]. So, linearity could be a further weakness of the system because, even if it represents data with a linear correlation well, it is not able to model more complex relations. In our case QDA presented best results regarding accuracies and FP rate for TS and PS. QDA is able to model nonlinear relations and is more robust to small deviations from normality, but attributes must not have nil variance within the class, and a large number of parameters must be calculated [34]. BLR also presented good performance for TS and PS with the same number of variables (11) like LDA and QDA; 91.88% and 88.24% are Q values in the TS and PS, respectively. BLR is more exible than LDA and QDA when it is not assumed that the observations are normally distributed. It also provides directly the posterior probability of a compound to be TI and the possibility of obtain the optimal cut-off value using the coordinates obtained by ROC curve. Taking advantage of this moment here we show how to obtain the optimal cut-off value for BLR-based model. To carry out this task, it is needed to save the coordinates that were used to build ROC curve associated to BLRbased model and selecting the point with greater sensitivity and smaller value of 1-specicity. In that case the optimal value of 0.56 is nally selected; with this cut-off point the model is recalculated. The results are shown as follows. TS: N = 1072 C = 0:84 Q = 92:54% Specificity = 94:07% Sensitivity = 90:49% FP Rate = 5:49% PS: N = 357 C = 0:76 Q = 89:08% Sensitivity = 86:86% Specificity = 90:48%

CT has worst results in modeling the tyrosinase activity (91.79% and 89.36% are accuracies obtained for TS and PS, respectively), but an advantage of CT is that it is a non-parametric classier; i.e. no assumptions are made about the distributions of the variables. Another difference between four techniques is that discriminant analysis (linear, quadratic) and BLR use all predictor variables simultaneously in the model but CT analysis uses the predictor variables in a hierarchical manner.

3.1.3.1. Selecting the best classication model using the theory of ROC curves. Here, we want to use the results obtained by ROC curves for selecting the best classication model. This procedure is a useful way to evaluate the performance of classication schemes in which there is one variable with two categories by which subjects are classied. While LDA, QDA and BLR naturally yield an instance probability that is a numeric value that represents the degree to which an instance is a member of a class. The CT is designed to produce only a class decision; i.e. a TI or non-TI on each instance, so if we want to generate a full ROC curve from a classier instead of just a single point. We have to generate scores from a classier rather than just a class label. Figs. 1 and 2 display the ROC curves of four classiers studied formerly for TS and PS respectively. Predictive accuracy has been used as the main and often only evaluation criterion for the predictive performance of classication learning algorithms. In recent years, the AUC has been proposed as an alternative single-number measure for evaluating learning algorithms [38,40,41]. Some authors have proved that AUC is a better measure than accuracy in the evaluation of learning algorithms [39]. Table 1 displays the AUC for four classication models for TS and PS correspondingly. It can be observed that in both series AUC of QDA is slightly superior to the one of BLR and LDA and a great difference exists when it is compared with CT. From the ROC curves pathway and the result obtained from Table 1, we can conclude that QDA presents the best results than other classiers not only in TS but also in PS; besides LDA, BLR and CT also show adequate performances. This result is obvious because QDA calculates a large number of parameters. BLR is slightly better than LDA; here we can observe the complexity of the activity for modeling; and it is possible that the hypersurface generates the model based on BLR that can improve the resolution of problems.

FP Rate = 8:79%

It can be seen that the accuracies increased from 91.88% to 92.54% and from 88.24 to 89.08% for TS and PS, correspondingly and the values of FP rate for BLR-based model reduced from 7.14% to 5.49% and 11.54% to 8.79% for TS and PS, respectively.

Fig. 1. ROC curves associated to four classication methods obtained with TS.

Author's personal copy


H. Le-Thi-Thu et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 249259 Table 3 Results of ImanDavenport tests for classication dataset ( = 0.05). Series Training set Test set Test value 3.575 23.7778 Critical value 3.4903 3.4903 Hypothesis Rejected Rejected 255

Table 4 Results of Holm's test with control algorithm QDA for classication dataset. i Algorithm z = (R0 Ri)/SE 2.4495 1.9596 0.9798 p-value 0.0143 0.0500 0.3272 /i 0.0167 0.0250 0.0500 Hypothesis Rejected Accepted Accepted

Training set 3 LDA 2 CT 1 BLR Prediction set 3 BLR 2 CT 1 LDA

3.4293 2.4495 1.4697

6.05E4 0.0143 0.1416

0.0167 0.0250 0.0500

Rejected Rejected Accepted

Fig. 2. ROC curves associated to four classication methods obtained with PS.

3.1.3.2. Analysis of the results of Multiple Comparison Procedures. In the rst place we computed the mean rankings. The algorithms were ranked as can be observed from its values (Table 2). The classier with the best average ranking for both subsets (TS and PS) was QDA. Later we carried out an ImanDavenport test to observe if signicantly statistical differences exist between the models in training and prediction set. According to the results depicted in Table 3 signicant differences for the learning and prediction series were detected. These results are given by the higher test values with respect to the critical value (Fisher's distribution) considering a signicance level = 0.05. Later, a powerful procedure such as a Holm test for this same signicance level was carried out, to check if there are statistically signicant differences between QDA (best ranking) and the other classiers, for training and prediction sets (see Table 4). From visual inspection in Table 4 and comparison of the p-value of the test and /i value, it could be said that there is a signicant difference between the QDA vs. LDA and no signicant differences are observed between QDA and the other algorithms (CT and BLR) in the case of the training set. For the prediction set this occurs in a different way, with signicant

differences between QDA vs. CT and BLR, and no signicant difference is observed for the comparison QDA vs. LDA. Therefore, this algorithm (QDA) shows a general better behavior for TS and PS, and in some cases this improvement is statistically signicant according to the results of the statistical analysis.

3.2. QSAR models for the potency in silico estimation of tyrosinase inhibitory activity With an object of establishing a hierarchical study in identication of new leads, here the statistical techniques used above are applied again to the second dataset to obtain models for the potency estimation of new tyrosinase inhibitor compounds obtained by VS using models tted by the rst data. The procedure for obtaining these Potency models is similar for the construction of previous Class models. First, a k-MCA III was used to split the dataset (533 chemicals) into TS and PS. This set of active compounds was split into 13 clusters [31,32]. In Fig. S4 of Supplementary data, the described general algorithm to design training and test sets is shown. Then, we performed another k-MCA to split the TS (398 chemicals) into two clusters (k = 2). A random classication was obtained in both groups, in the rst cluster 85 strong-TIs and 23 moderate-to-weak ones, and in the second cluster we included 172 and 118, strong- and moderate-to-weak TIs, respectively. So it means CA cannot be used to obtain models to estimate the potency of TIs. After that LDA technique was developed but here the dependent variable is Potency(L). In Eq. (4), the obtained model and performances for TS and PS are illustrated. The equation showed to be statistically signicant at p-level (p b 0.0001).

Table 1 Area under ROC curves of four classiers for TS and PS. LDA (11)a TS PS 0.969 0.955 QDA (11)a,b 0.979 0.969 BLR (11)a 0.973 0.952 CT (15)a 0.954 0.922

a Between parenthesis is the number of molecular descriptors included in each model. b In bold data the results of the best model.

Table 2 Average ranking of the algorithms for classication dataset. Algorithm QDA LDA BLR CT
a b

PotencyL = 1:6726:770 108M q13 x + 1:081 10


H

8M

q15 x

Ranking (TS)a 1.40 3.40 2.20 3.00

Ranking (PS)b 1.00 2.20 3.80 3.00

+ 0:003 q5 x1:809 10
Ms H

6K

Ms H q11 x0:007 q1 x Ms H

0:020 q4 x + 0:029 q6 x1:100 q4L xEH + 0:141 q5L xEH + 0:712 q8L xEH 0:309 q2L xE 0:596 q9L xE + 0:247 q14L xE + 0:583 q3L xE 4
Ks H Ks Gs H Ms H Ms H Ks H

Ms H

Algorithm ranking corresponding to training set. Algorithm ranking corresponding to prediction set.

Author's personal copy


256 H. Le-Thi-Thu et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 249259

TS: N = 398 = 0:48 = 230:30 C = 0:78 Rcan = 0:67 Q = 89:95% Specificity = 93:23% Sensitivity = 91:05% FP Rate = 12:06% F = 29:95 PS: N = 135 C = 0:65 Q = 83:71% Specificity = 87:21% Sensitivity = 87:21% FP Rate = 22:45% This model has a good overall accuracy and other adequate statistical parameters for TS and PS. From the structure matrix, which shows the correlation of each variable with the discriminant function, we can detect that the most important variable that helps to separate groups of strong- and moderate-to-weak TIs is MsqH 4LxEH because it presented the most correlation with the discriminant function (0.207). Applying QDA technique, dependent variable was Potency(Q). The obtained model and prediction performances and statistical parameters for TS and PS are depicted in the following Eq. (5). The equation showed to be statistically signicant at p-level (p b 0.0001). PotencyQ = 5:4393:793 109M q13 x + 6:571 10
H M H 0:247 q2L xEH 9M 2

a mathematic formula that gives probability p of case to being strongTI in function of predictive variables. The model obtained by BLR with performance parameters for TS and PS is shown in the following Eq. (6) while p exceeds 0.5 (default value) chemical is classied as strong-TI and vice versa: 1 1 + eLRP

p=

where LRP = 0:226 + 4:005 1010M q13 x + 5:960 10 2:769 10


Ms H 8P 7P H q11 x

q15 x + 1:485 10
Ms H

8P H G q15L xE 0:028 q3 x Ks H

0:009 q2 x + 0:004 q13 x + 0:379 q2 xE TS N = 398 C = 0:79 Q = 90:20% Specificity = 92:25% Sensitivity = 92:61% FP Rate = 14:18% PS N = 135 C = 0:63 Q = 82:96% Specificity = 84:62% Sensitivity = 89:53% FP Rate = 28:57%:

q13 x

P H 0:425 q0 x

+ 0:008 q5 xE

5:497 10
Ks H

6K

q11 xE +

Ms H 0:458 q5L xEH Gs H

0:760 q2L xE 0:634 q9L xE + 1:017 q3L xE + 1:520 10


M H 11M H P H M P H q13 x q0 x1:884 q13 x q0 x P H 13M

Ks H

+ 0:002 q2L xEH q0 x1:558 10 + 1:363 10 10


5P H K q0 x q5 xE M H K

q13 x q5 xE

5M H K q2L xEH q5 xE 2:934

+ 1:134 10

16M H K q13 x q11 xE

1:065 q2L xEH q11 xE + 4:238 10


11M

q13 x q4L xEH 0:007 q0 x q4L xEH


8K

Ms H

P H

Ms H

2:253 10 10

q11 xE q4L xEH 8:784 + 0:008 q0 x q2L xE


P H Ks H Ks H

Ms H

12M H Ms H q13 x q5L xEH 6K

6:569 10 10
8K

q5 xE q2L xE + 1:139
P H Ks H 20M 2 q13 x

Taking into consideration the coefcients bi, we can propose the variable that most contributes to potency estimation of TI compounds is KsqH 2 xE . Finally, the technique CRT was used. The obtained classication trees for TS and PS can be observed in Figs. S5 and S6 of Supplementary data. The trees contain 33 nodes in total, 17 of them are terminal ones. In the root node we have the whole studied cases: 64.57% of the same represents strong-TIs and 35.43% corresponds to moderate-to-weak TIs. A resume of classication for chemicals in strong-TIs and moderate-to-weak TIs following the tree is displayed below. The compound is classied in strong-TI with a probability approximately 1 or at least equal 0.5 when:
Ps H Gs H i) The chemical has PqH 0 x 36.77, q15LxE 3.89, q3LxE N 31.52 (strong-TI 75.0%). Ps H ii) The chemical has PqH q15LxE 3.89, GsqH 0 x 36.77, 3LxE Ks H 31.52, KsqH x 21.34 and q x N 16.7 (strong-TI 100.0%). 2L E 9L E Ps H iii) The chemical has PqH q15LxE N 3.89, MqH 0 x 36.77, 13xE 9.0109 (strong-TI 88.24%). Ps H iv) The chemical has PqH q2LxEH 2.26 and MsqH 0 x N 36.77, 4 x N 2634.75 (strong-TI 93.24%). Ps H v) The chemical has PqH q2LxEH 2.26 and MsqH 0 x N 36.77, 4 x N 2634.75 (strong-TI 93.24%). Ps H Ms H vi) The chemical has PqH q4 x 0 x N 36.77, q2LxEH 2.26 and 8 2634.75 and MqH x 9.310 (strong-TI 96.0%). 13L E Ps H Ms H vii) The chemical has PqH q4 x 0 x N 36.77, q2LxEH 2.26 and M H 8 P H 2634.75 and q13LxE N 9.3 10 , q1LxE N 16.14 (strong-TI 90.91%). Ps H viii) The chemical has PqH q2LxEH 2.26, MsqH 0 x N 36.77, 4 x 8 P H 2634.75 and MqH x N 9.3 10 , q1LxE 16.14 and MsqH 13L E 8L xEH 12.21 (strong-TI 62.50%). Ps H ix) The chemical has PqH q2LxEH N 2.26, PqH 0 x N 36.77, 1LxE M 39.23, KsqH x 86.56 and q x N 32575.54 (strong-TI 81.82%). 2L E 2 Ps H P H x) The chemical has PqH 0 x N 36.77, q2LxEH N 2.26, q1LxE N 39.23 Ks H and q2LxE 207.38 (strong-TI 96.55%).

q11 xE q2L xE + 0:003 q0 x q9L xE


Gs H

Ks H

0:010 q0 x q3L xE 4:777 10 0:002 q0 H x TS:


P 2

P H

N = 398 = 0:37 = 383:94 C = 0:80 Rcan = 0:80 Q = 90:70% Specificity = 93:31% Sensitivity = 92:22% FP Rate = 12:01% D = 7:55 PS: N = 135 C = 0:67 Q = 84:44% Specificity = 89:16% Sensitivity = 86:05% FP Rate = 18:37% With BLR, the dependent variable is codied in dichotomous categories (strong- and moderate-to-weak TI). The response of BLR is
2

Author's personal copy


H. Le-Thi-Thu et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 249259 Table 5 Area under ROC curves of TI potency estimation models for TS and PS. LDA (14)a TS PS 0.945 0.901 QDA (12)a 0.965 0.899 BLR (8)a,b 0.933 0.879 CT (13)a 0.934 0.853 257

a Between parenthesis is the number of molecular descriptors included in each model. b In bold data the results of the best model.

Fig. 3. ROC curves associated to models of TI potency estimation obtained with TS.

The parameters to assess model's performance are given following TS and PS: TS: N = 398 Q = 89:20% Specificity = 92:80% Sensitivity = 90:27% FP Rate = 12:77% PS: N = 135 Q = 82:22% Specificity = 85:23% Sensitivity = 87:21% FP Rate = 10:44%

From results of chart of model importance by independent predictors, we can conclude that the most important variable in describing tyrosinase activity is MsqH 4 x, indicating that the atomic mass is the most important feature to the tyrosinase inhibitory activity. It can be seen that all these techniques can be used to estimate the potency of TIs with adequate performances for TS and PS. At this moment, we want to use the results obtained by ROC curves for selecting the best model. Figs. 3 and 4 display the ROC curves of four classiers studied formerly for TS and PS respectively. In Table 5, the AUC values of four models using the described techniques for TS and PS are displayed. It can be observed that for TS, AUC of QDA-based model is the largest, followed by LDA, BLR and CT has worst value. However for PS, LDA presents the best result, followed by QDA, BLR and CT. This behavior is different from the results obtained by the classication dataset. And it can be seen the BLR-based model used the smallest number of variables compared with the rest and presented good values of AUC and other parameters (only 8 descriptors to describe the potency of TIs). From these results, we could conclude that BLR is the most adequate technique to estimate the potency of TIs. Nevertheless, we also developed more powerful procedure for model selection, such as, the multiple comparison tests that showed according to rankings' method QDA was the best algorithm in training set, while LDA has the best average value of ranking in the test set (for more details see Table 6) in concordance with the results depicted by ROC analysis. Besides, an ImanDavenport test was carried out to detect if signicant differences exist between the algorithms in training and test set (see Table 7). In this sense the null hypothesis (no-differences) was non-rejected for the case where the test values were higher than the critical value. Later, a Holm test was carried out as can be seen in Table 8 and for the case of the learning set statistically signicant statistical differences were observed between QDA and CT and no signicance between QDA and the rest (LDA and BLR). In the prediction sets no signicant differences were observed, at = 0.05.

Table 6 Average ranking of the algorithms for potency estimation dataset. Algorithm QDA LDA BLR CT
a b

Ranking (TS)a 1.20 2.6 2.6 3.6

Ranking (PS)b 1.80 1.80 2.8 3.60

Algorithm ranking corresponding to training set. Algorithm ranking corresponding to prediction set.

Table 7 Results of ImanDavenport tests for potency estimation dataset ( = 0.05). Series Training set Test set Test value 5.6154 23.7778 Critical value 3.4903 3.4903 Hypothesis Rejected Accepted

Fig. 4. ROC curves associated to models of TI potency estimation obtained with PS.

Author's personal copy


258 H. Le-Thi-Thu et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 249259

Table 8 Results of Holm's test with control algorithm (QDA) for potency estimation dataset. i Algorithm z = (R0 Ri)/SE 2.9394 1.7146 1.7146 p-value 0.0033 0.0864 0.0864 /i 0.0167 0.0250 0.0500 Hypothesis Rejected Accepted Accepted

References
[1] A. Sanchez-Ferrer, J.N. Rodriguez-Lopez, F. Garcia-Canovas, F. Garcia-Carmona, Biochim. Biophys. Acta 1247 (1) (1995) 111. [2] J. P. Ortonne and J. J. Nordlund, in J.J. Norlund, R.E. Boissy, V.J. Hearing, R.A. King, and J.P. Ortonne (Eds.), 1998, Oxford University Press, New York, p. 489502. [3] I. Kubo, I. Kinst-Hori, J. Agric. Food Chem. 47 (10) (1999) 41214125. [4] Y.J. Kim, H. Uyama, Cell. Mol. Life Sci. 62 (15) (2005) 17071723. [5] S. Briganti, E. Camera, M. Picardo, Pigment Cell Res. 16 (2) (2003) 101110. [6] A.K. Ghose, T. Herbertz, J.M. Salvino, J.P. Mallamo, Drug Discov. Today 11 (23/24) (2006) 11071114. [7] E. Estrada, E. Uriarte, Curr. Med. Chem. 8 (13) (2001) 15731588. [8] G.M. Casanola-Martin, M.T. Khan, Y. Marrero-Ponce, A. Ather, M.N. Sultankhodzhaev, F. Torrens, Bioorg. Med. Chem. Lett. 16 (2) (2006) 324330. [9] Y. Marrero-Ponce, M.T.H. Khan, G.M. Casaola-Martn, A. Ather, M.N. Sultankhodzhaev, F. Torrens, QSAR Comb. Sci. 26 (4) (2007) 469487. [10] Y. Marrero-Ponce, M.T.H. Khan, G.M. Casaola-Martin, A. Ather, M.N. Sultankhodzhaev, R. Garcia-Domenech, F. Torrens, R. Rotondo, J. Comput-Aided Mol. Des. 21 (4) (2007) 167188. [11] Y. Marrero-Ponce, M.T. Khan, G.M. Casanola Martin, A. Ather, M.N. Sultankhodzhaev, F. Torrens, R. Rotondo, ChemMedChem 2 (4) (2007) 449478. [12] G.M. Casaola-Martin, Y. Marrero-Ponce, M.T.H. Khan, A. Ather, S. Sultan, F. Torrens, R. Rotondo, Bioorg. Med. Chem. 15 (3) (2007) 14831503. [13] E. Estrada, E. Uriarte, A. Montero, M. Teijeira, L. Santana, E. De Clercq, J. Med. Chem. 43 (10) (2000) 19751985. [14] H. Gonzalez-Diaz, E. Uriarte, R. Ramos de Armas, Bioorg. Med. Chem. 13 (2) (2005) 323331. [15] A. Garcia-Garcia, J. Galvez, J.V. de Julian-Ortiz, R. Garcia-Domenech, C. Munoz, R. Guna, R. Borras, J. Antimicrob. Chemother. 53 (1) (2004) 6573. [16] M.T. Cronin, A.O. Aptula, J.C. Dearden, J.C. Duffy, T.I. Netzeva, H. Patel, P.H. Rowe, T.W. Schultz, A.P. Worth, K. Voutzoulidis, G. Schuurmann, J. Chem. Inf. Comput. Sci. 42 (4) (2002) 869878. [17] M. Negwer, Organic-Chemical Drugs and Their Synonyms, Akademie-Verlag, Berlin, 1987. [18] G.M. Casanola-Martin, Y. Marrero-Ponce, M.T. Khan, A. Ather, K.M. Khan, F. Torrens, R. Rotondo, Eur. J. Med. Chem. 42 (1112) (2007) 13701381. [19] H. van de Waterbeemd, R.E. Carter, G. Grassy, H. Kubinyi, Y.C. Martin, S. Tute, P. Willett, Annu. Rep. Med. Chem. 33 (397) (1998). [20] Y. Marrero-Ponce, M. Iyarreta-Veitia, A. Montero-Torres, C. Romero-Zaldivar, C.A. Brandt, P.E. Avila, K. Kirchgatter, Y. Machado, J. Chem. Inf. Model. 45 (4) (2005) 10821100. [21] Y. Marrero-Ponce, R. Medina-Marrero, F. Torrens, Y. Martinez, V. Romero-Zaldivar, E.A. Castro, Bioorg. Med. Chem. 13 (8) (2005) 28812899. [22] Y. Marrero-Ponce, Molecules 8 (2003) 687726. [23] Y. Marrero-Ponce, Bioorg. Med. Chem. 12 (2004) 63516369. [24] Y. Marrero-Ponce and V. Romero, TOMOCOMD software. Central University of Las Villas; 2002. TOMOCOMD (TOpological MOlecular COMputational Design) for Windows, version 1.0 is a preliminary experimental version; in future a professional version can be obtained upon request to Y. Marrero: yovanimp@qf. uclv.edu.cu or ymarrero77@yahoo.es. [25] R. Todeschini and V. Consonni, Handbook of Molecular Descriptors, WILEY-VCH Verlag GmbH, D-69469 Weinheim, Federal Republic of Germany, 2000, p. 303308, 419424. [26] L.B. Kier, L.H. Hall, Molecular Connectivity in StructureActivity Analysis, Research Studies Press, Letchworth, U. K., 1986 [27] SPSS (Statistical Software for the Social Sciences) 15, For more information about SPSS software products, please visit the Web site at http://www.spss.com, SPSS Inc., Chicago, 2006. [28] STATISTICA (data analysis software system) 6.0, StatSoft Inc, Tulsa, OK, 2001. [29] J. Xu, A. Hagler, Molecules 7 (2002) 566700. [30] R.A. Johnson, D.W. Wichern, Applied Multivariate Statistical Analysis, PrenticeHall, Englewood Cliffs, NJ, 1988. [31] J.W. Mc Farland, D.J. Gans, in: H. Waterbeemd (Ed.), VCH Publishers, Weinheim, Ger, 1995, pp. 295307. [32] A. Golbraikh, M. Shen, Z. Xiao, Y.D. Xiao, K.H. Lee, A. Tropsha, J. Comput. Aided Mol. Des. 17 (24) (2003) 241253. [33] R.B. Kowalski, S. Wold, , Eds., in: P.R. Krishnaiah, L.N. Kanal (Eds.), North Holland Publishing Company, Amsterdam, 1982, pp. 673697. [34] P. Mazzatorta, E. Benfenati, P. Lorenzini, M. Vighi, J. Chem. Inf. Comput. Sci. 44 (1) (2004) 105112. [35] F. Provost, T. Fawcett and R. Kohavi, in: J.W. Shavlik (Ed.), Fifteenth International Conference on Machine Learning, Morgan Kaufmann, Proceedings of the Conference, 1998, pp. 445453. [36] F. Provost, T. Fawcett, Third International Conference on Knowledge Discovery and Data Mining (KDD-97), in: J.W. Shavlik (Ed.), Proceedings of the Conference, Newport Beach, California, Menlo Park, CA: AAAI Press, August 1417, 1997, pp. 4348. [37] L.B. Lusted, Science 171 (1971) 12171219. [38] T. Fawcett, Machine Learning, 2004, pp. 138. [39] C. X. Ling, J. Huang and H. Zhang, in G. Gottlob and T. Walsh (Ed.), 18th International Joint Conference on Articial Intelligence, Proceedings of the Conference, San Francisco, CA, USA, Morgan Kaufmann Publishers Inc., 2003, pp. 519524. [40] A.P. Bradley, Pattern Recognit. 30 (1997) 11451159. [41] J.A. Hanley, B.J. Mcneil, Radiology 143 (1) (1982) 2936. [42] S. Garca, F. Herrera, J. Mach. Learn. Res. 9 (2008) 26772694.

Training set 3 CT 2 LDA 1 BLR

Finally, it is important to highlight that the differences observed between the algorithms in the learning series are of main importance because of the different hyperplanes that could help to choose which algorithms use rst to develop multi-classiers for modeling potency estimation of tyrosinase inhibitory activity. 4. Conclusion TIs have attracted considerable interest in medicinal and cosmetic products because of its importance in the treatment of hyperpigmentation [52,53]. Recently, chemoinformatics in silico methods appear to be particularly rewarding in terms of both cost and time benets and are easily integrated into the modern drug discovery process. Statistical and Articial Intelligence techniques are important components related to this tool [54]. In this sense, here we made use of the non-stochastic and stochastic quadratic indices, and different modeling techniques such as LDA, QDA, BLR and CT to nd QSAR-based models that can describe the tyrosinase inhibitory activity classifying the chemical as TI or non-TI and estimating the potency of new TI. All of the models showed adequate performances. We also made a comparative study of all the used methods. Analyzing each model and using the ROC curve theory and some methods of experimental comparison of algorithms we concluded that for the classication dataset, QDA was the best model but at the same time a large number of parameters have been calculated. For the other dataset, QDA also was deduced as the most adequate technique to estimate the potency of TIs. Finally the contribution of this report is encouraging because it represents better results for modeling of the tyrosinase inhibitory activity and left an open door that other types of models of good and better performance than LDA which, has been used till the moment by our research group, can help to improve the Virtual Screening procedure. And the union of these different techniques, which take different features or hypersurfaces, can increase the practicality of data mining proceedings of chemical database for the discovery or identication of novel TIs. Acknowledgements M-P. Y and C-M. G. M. thank the program Estades Temporals per a Investigadors Convidats for a fellowship to work at Valencia University (2010). C-M. G.M. also thanks Professor Cosme Santiesteban-Toca and Departamento de Bioinformtica y Automatizacin de Procesos Biolgicos, Centro de Bioplantas for partial support in this paper. F. T. acknowledges nancial support from the Spanish MEC DGI (Project No. CTQ2004-07768-C02-01/BQU) and Generalitat Valenciana (DGEUI INF01-051 and INFRA03-047, and OCYT GRUPOS03-173). The authors acknowledge also the partial nancial support from the Spanish Ministry of Science and Innovation (Project reference: SAF2009-10399). Finally, but not least, this work was supported in part by VLIR (Vlaamse InterUniversitaire Raad, Flemish Interuniversity Council, Belgium) under the IUC Program VLIR-UCLV. Appendix A. Supplementary data Supplementary data to this article can be found online at doi:10.1016/j.chemolab.2010.08.016.

Author's personal copy


H. Le-Thi-Thu et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 249259 [43] A. Puris, R. Bello, F. Herrera, Expert Syst. Appl. (37) (2010) 54435453. [44] D. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures, Chapman and Hall. CRC, Florida, 2006. [45] R.L. Iman, J.M. Davenport, Commun. Stat. 9 (1980) 571595. [46] S. Holm, Scand. J. Stat. 6 (2) (1979) 6570. [47] S. Wold and L. Erikson, in H. van de Waterbeemd (Eds.), 1995, VCH Publishers, New York, p. 309318. [48] P. Baldi, S. Brunak, Y. Chauvin, C.A. Andersen, H. Nielsen, Bioinformatics 16 (2000) 412424. 259

[49] A.P. Worth, M.T.D. Cronin, Mol. Struct. (Theochem) 622 (2003) 97111. [50] C.L. Russom, S.P. Bradbury, S.J. Broderius, R.A. Drummond, D.E. Hammermeister, Environ. Toxicol. Chem. 16 (1997) 948967. [51] E. Papa, F. Villa, P. Gramatica, J. Chem. Inf. Model. 45 (2005) 12561266. [52] E.I. Solomon, U.M. Sundaram, T.E. Machonkin, Chem. Rev. 96 (1996) 25632605. [53] K. Jones, J. Hughes, M. Hong, Q. Jia, S. Orndorff, Pigment Cell Res. 15 (5) (2002) 335340. [54] R.V.C. Guido, G. Oliva, A.D. Andricopulo, Curr. Med. Chem. 15 (37) (2008) 3746.

Vous aimerez peut-être aussi