Académique Documents
Professionnel Documents
Culture Documents
Abstract
A method is described which is used to construct a descriptor vector of solid catalysts in
the oxidation of propene. Different methods are described which allow one to construct a
correlation between characteristics of the catalysts and their performance in propene
oxidation. Successful descriptor vectors are generated which predict catalytic performance
substantially better than statistically expected. These descriptor vectors do not contain
explicit information on the elemental composition of the catalysts any more, but only
parameters that are either derived from the elemental composition, such as the enthalpy
of oxide formation, or are related to the synthetic method. The general concept can
probably be extended to the development of descriptors for solids to be used in other
applications as well.
78 QSAR Comb. Sci. 2005, 24 DOI: 10.1002/qsar.200420066 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim
Design of Discovery Libraries for Solids Based on QSAR Models
methods based on descriptors have not been applied to sify the catalytic behavior experimentally is depicted on
catalysts so far, even if data-mining methods have been re- the right side. The whole process consists of the following
cently used [8 – 13]. Therefore, there is no method today stages: (1) collection and synthesis of a diverse library of
for the library design of solids that can guarantee that the solids, in which the selection is based upon experience and
properties of the solids will be diverse. It is the high com- chemical intuition; (2) description of these solids by nu-
plexity of solids, as compared to molecules or drugs, which merous attributes; (3) selection of a subset of relevant and
makes the design of libraries a serious challenge [14]. One uncorrelated attributes to create a first possible descriptor
of the main hurdles is the description of a solid, especially, vector; (4) high-throughput (HT) testing of the whole li-
if the solid has not been synthesized in reality and, there- brary of solids in a catalytic reaction and classification of
fore, no characterization results are available. the catalysts in distinct classes of performance, resulting in
The motivation of this work is to demonstrate for the clusters containing solids which exhibit similar catalytic
first time an implementation strategy for the building of a properties; and (5) computing QPAR models between de-
QSAR analogue model for solids. Since the solids are not scriptor vectors and catalytic performances, where in this
represented as structures, in the following, the term QPAR process the descriptor vectors are further modified. De-
(Quantitative Property – Activity Relationship) will be tails of each step are reported in the following sections.
used. The methodology to generate and select relevant de-
scriptors is described. Different QPAR models are com-
2.2 Sample Collection and HT Synthesis
pared and discussed with respect to previous knowledge
and the available literature. A short account on some as- For the first stage, we collected approximately 500 differ-
pects of this work has been given recently [15]. ent solids with the aim to cover a wide chemical space.
Hence, samples had to be as diverse as possible with re-
spect to the elements, the material types, and the synthesis
procedures, which also means that compounds were in-
2 Materials and Methods
cluded in the study for which it was clear that they were es-
sentially inactive in the target reaction, such as silica. This
2.1 Workflow Overview
initial diverse library was created based on a priori knowl-
Figure 1 gives a general description of the workflow used edge and intuition. A first selection of 367 catalysts was
in this study, which, however, is generic and should be ap- carried out among solids already available in our laborato-
plicable to other problems as well. The data-processing ries. Then, 100 additional solids were synthesized by means
steps to generate the catalyst descriptors are shown on the of HT equipment in order to expand the chemical space
left side in Figure 1, while the workflow to assess and clas- with respect to the element and support representation.
Figure 1. General workflow for the development of descriptors for solid catalysts; for explanation, see text.
QSAR Comb. Sci. 2005, 24 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim 79
Full Papers D. Farrusseng et al.
Thus, we ensure that most of the elements of the periodic with 3000 attributes. We assumed that a set of around 75
table (excluding exotics) were represented, and that the attributes would be a maximum with respect to the num-
occurrence of each element was well distributed over the ber of catalysts (467), otherwise the risk of self learning
library. As it is far beyond the scope of this work, the prep- would be too high. The task thus consists of selecting dis-
aration method of each solid catalyst will not be discussed criminative attributes that are not strongly correlated be-
in detail. However, synthetic methods include impregna- tween each others. A ranking of the 3119 attributes (X2
tion, ion exchange, precipitation, co-precipitation, deposi- and X3 attribute sets), based on their discriminative pow-
tion – precipitation, the activated-carbon route [16], sol-gel er, was carried out using the Relief algorithm [18]. Then,
synthesis, and others. 56 attributes were selected from the 3100 X2 attributes
with respect to their ranking and in order to map different
properties (Table 1). This ensemble of 56 X2 attributes
2.3 Computation of Catalyst Attributes
plus the 19 X3 attributes form a first descriptor vector
A Microsoft Access database was implemented to record which is used as input for the QPAR model (and which
the description of the catalyst synthesis and to perform in was further reduced during the model building). Note that
an automatic manner the calculation of thousands of at- this descriptor vector does not contain information on the
tributes (also called meta data) for each catalyst. A variety elemental composition any more. This only enters the vec-
of different information is recorded in the database, name- tor via the computed values that are correlated to the ele-
ly, (1) synthesis parameters of the catalysts, (2) elemental ments present in the catalyst.
composition of the catalysts, (3) properties of constituent
elements, (4) properties of constituent-element oxides, and
2.5 HT Testing and Clustering of Catalytic Performance
(5) properties of constituent-element ions (Figure 2). The
entries concerning the properties of elements, element ox- The library of solids has been tested in the gas-phase oxi-
ides, and element ions were collected from the Handbook dation reaction of propene with oxygen. This reaction of-
of Chemistry and Physics [17] and other sources of physi- fers the advantage to provide a wide spectrum of possible
co-chemical data. The workflow that describes the combi- products from C1 to C6 including a number of alkenes and
natorial generation of the attributes is shown in Figure 3. oxygenates. A complete list is given in Table 2. Data analy-
By combining the elemental composition of a catalyst and sis serves to classify the performances of all catalysts into a
the respective properties of the elements, ions, or oxides, small number of distinct classes that are representative of
3100 attributes were computed in a combinatorial manner typical catalytic behavior. An automated HT set-up has
using operators such as the average of a certain property been used to evaluate the performance of each catalyst. It
for the constituents of the catalyst, the maximum, the min- consists of a set of mass-flow controllers for oxygen, pro-
imum, weighted averages, and so on. For instance, from pene, and nitrogen, which feed the reagents via a common
the enthalpies of formation of the element oxides, one can line into a 16-fold plug flow reactor. The principle set-up
compute the average enthalpy of formation, or the spread of this system corresponds to the one described in Ref.
between the highest and the lowest enthalpy of formation. [19]. A gas chromatograph equipped with a capillary col-
Such values should, in some complex way, be related to umn and flame-ionization detector in combination with a
the availability of oxygen at the surface of a solid catalyst, methanizer has been used for analysis. Catalytic tests were
and thus to the performance in an oxidation reaction. The carried out with a gas consisting of 1% propene, 5% oxy-
motivation to generate a vast number of attributes is the gen (slightly over-stoichiometric for full oxidation to water
fact that relevant and discriminative attributes are a priori and carbon dioxide), and nitrogen as balance, at a space
not known, and obviously the relationship between prop- velocity of 225 mL h1 (gcat)1 at five different tempera-
erties of a catalyst and its performance is not simple. tures (200, 250, 300, 400, and 500 8C). Catalyst was used as
Finally, different types of attribute sets were generated: grains of between 250 and 500 mm in size. Propylene con-
X1 accounted for the composition of all catalysts (60 in to- version and the selectivity to 27 products were determined
tal), X2 consists of parameters calculated on the basis of for each catalyst. Each test was repeated twice. Each full
X1 and physical data (3100 in total), and X3 contains syn- cycle to analyze 16 catalysts needs 8 h. This means that the
thesis parameters of the solid catalysts. The last attribute catalysts were not analyzed at the same time on stream.
set consists of 19 categorical (mostly binary) variables However, the activation or deactivation information was
which provide information on the last synthesis step, also taken into account to some extent by calculating the
namely main synthesis parameters, precursor types, addi- difference between the propene conversion of the first and
tive types, etc. the second measurement. Using this procedure, the per-
formance of the 467 catalysts was described by 120 varia-
bles: propene conversion, 21 selectivities, deactivation be-
2.4 Search Space Reduction – the Descriptor Vector
havior, and mass balance (a measure for possible coking
The number of attributes must be reduced since there are or residue formation on the catalysts) at five different tem-
no modeling techniques available which are able to deal peratures. Because of obvious strong correlations in prod-
80 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim QSAR Comb. Sci. 2005, 24
Design of Discovery Libraries for Solids Based on QSAR Models
uct selectivities accompanied with low variance, some pected, that this will be different for other reaction condi-
composite variables were constructed by the combination tions under net reducing conditions, where coke formation
of variables to reduce the complexity of the problem: The can be a serious problem. 17 variables with high informa-
variables mass balance at all temperatures and temporal tion content have been taken directly for analysis: conver-
behavior at 200 8C, 250 8C, and 300 8C have been discarded, sion, selectivity to CO and to CO2 at all five temperatures,
as their information content was extremely low. It is ex- and temporal behavior at 400 8C and 500 8C. Four variables
QSAR Comb. Sci. 2005, 24 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim 81
Full Papers D. Farrusseng et al.
Figure 3. Scheme for the generation of the attribute set X2. For the elements only one operation was necessary, for oxides and ions
two steps were needed; first the different states for one element had to be taken into account and in the second step the different
constituents in the catalyst were generated. (SD ¼ standard deriation).
with medium information content were taken into ac- methyl-2-butene, and pentenes), and S C6 hydro-
count by summing up the respective variable for all five carbonsall temperatures (S C6H12). Then, a principal component
temperatures giving formaldehydeall temperatures, acetalde- analysis (PCA) was carried out to decrease the dimension-
hydeall temperatures, acroleinall temperatures, and benzeneall temperatures. ality and to orthogonalize the dataset. From the scores on
Variables that were assumed to be important from a chem- the PCs axis, distinct catalytic classes were generated by
ical point of view, but had a low variance, were taken into means of hierachical clustering (Wards distance) and the
account by grouping products with similar chemical func- k-means technique as implemented in Statistica 6.1.
tion or chemical structure The values of the respective
groups for all temperatures were summed up, giving
2.6 QPAR Model
S C3H6Oall temperatures (propionaldehyde and acetone), S
acidsall temperatures (acetic acid and acrylic acid), S alkanesall The task here consists of modeling the correlation between
temperatures (ethane, butane, and pentane), S C4 hydro- the descriptors (56 þ 19) and the performance clusters,
carbonsall temperatures (1-butene, 2-butene, and 2-methylpro- which represent typical catalytic behavior. In other words,
pene), S C5 hydrocarbonsall temperatures (2-methyl-1-butene, 2- we seek classification models that enable us to assign a cat-
82 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim QSAR Comb. Sci. 2005, 24
Design of Discovery Libraries for Solids Based on QSAR Models
Table 1. A list of the 56 attributes from the attribute set X2 used for the correlation. 1 – 24 are element properties and 25 – 56 are ox-
ide properties. Since often more than one oxide exists, an additional operation to calculate the value for an element is necessary, for
instance, the mean of the densities of different oxides for one element.
Number Code Property Calculation for cata-
lyst
1 meanbc_1ie first ionization energy mean all metals and semi-metals
2 meanbc_ar atomic radius mean all metals and semi-metals
3 difec_ar difference from highest to lowest value
4 meanbc_bseo bond strength element – oxygen mean all metals and semi-metals
5 difec_bseo difference from highest to lowest value
6 meanbc_bsee bond strength element – element mean all metals and semi-metals
7 difec_bsee difference from highest to lowest value
8 meanec_ea electron affinity mean all elements
9 meanbc_ea mean all metals and semi-metals
10 difec_ea difference from highest to lowest value
11 meanec_pe Pauling electronegativity mean all elements
12 meanbc_pe mean all metals and semi-metals
13 difec_pe difference from highest to lowest value
14 minec_nffefmsmo normalized formation free-enthalpy for most stable minimum value all elements
metal oxide
15 maxec_nffefmsmo maximum value all elements
16 minec_sedmsmoom smallest formation free-enthalpydifference from minimum value
the most stable metal oxide to another metal oxide
17 maxec_sedmsmoom maximum value all elements
18 wmec_ms molar mass weighted mean all elements
19 meanec_ms mean all elements
20 meanbc_ms mean all metals and semi-metals
21 wmbc_no number of element oxides weighted mean all metals and half metals
22 minec_no minimum value all elements
23 maxec_no maximum value all elements
24 nvec number of elements in catalyst number all elements
QSAR Comb. Sci. 2005, 24 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim 83
Full Papers D. Farrusseng et al.
Table 1. (cont.)
alyst to a specific cluster. The model quality is assessed by variables based on their discriminative power. The dataset
means of prediction-rate criteria. Two classification tech- of 467 catalysts was randomly divided into three subsets.
niques were used: Artificial Neural Networks (ANN) and The learning step was performed on one half of the whole
Classification tree as implemented in Statistica 6.1. dataset, the verification step on one quarter, and the inde-
In the search for appropriate ANN models, both Multi- pendent testing step on the remaining quarter.
Layer Perceptron (MLP) and Probabilistic Neural Net- For the Classification tree models, we used the C&RT
works (PNN) were applied using the Intelligent Solver of method with the Gini criteria as splitting conditions and
Statistica. In order to get robust models, we have enabled FACT-style direct stopping as pruning rule (Statistica 6.1).
the solver to discard attributes during the screening of the Cross-validation was carried out on one third of the data-
neural networks. In addition, after models were built, a set to ensure that models were not prone to overlearning.
pruning was performed for discarding irrelevant variables.
Pruning refers to a sensitivity analysis, e.g., a ranking of
84 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim QSAR Comb. Sci. 2005, 24
Design of Discovery Libraries for Solids Based on QSAR Models
2.7 Quality Assessment of QPAR Models the variance, which indicates the high dimensionality of
the search space (Figure 4). In addition, the loading plots
The assessment was performed by comparing the predic- PC1 vs. PC2 and PC3 versus PC4 enable us to visualize the
tions made by the model with our observations. The results fairly good covering of the variable search space for the at-
are reported in a so-called confusion matrix. This table re- tributes which contain most information (Figures 5a and
veals the number of correctly classified catalysts, how
many catalysts were misclassified, and for which classes
the misclassification occurred. The prediction rate is de-
rived from the confusion matrix to estimate the quality of
the prediction. It accounts for the correctly classified cases
in the respective predicted class and can be considered as
statistical benchmark for the quality of the prediction,
since it can directly be compared to the “ratio” given in
the confusion matrix. This ratio gives the statistical distri-
bution probabilities into the respective classes for the orig-
inal dataset. Hence, a model becomes meaningful when
the prediction rates are higher than the corresponding ra-
tios.
3 Results
QSAR Comb. Sci. 2005, 24 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim 85
Full Papers D. Farrusseng et al.
Figure 6. a) Typical distribution of normalized values shown for the example of the attribute meanecmie (#49). b) Box and whisker
plot for the distribution of the elements in the catalysts. Since most catalysts only contain few elements, most values for a given cata-
lyst are zero. Since most materials are oxides, only the median for oxygen differs substantially from zero. Quartiles significantly differ-
ing from zero are only observed for silicon and aluminium, which occur frequently in the supports.
86 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim QSAR Comb. Sci. 2005, 24
Design of Discovery Libraries for Solids Based on QSAR Models
which represent the different obvious catalytic behavior it is possible to assign a specific catalytic behavior to each
(coverage), (2) to avoid that a class is represented by less class. The chemical significance of each of the clusters is:
than 15 catalysts of the whole population (representative-
cluster #1: low conversion, high selectivity to CO2,
ness), and (3) to get classes as distinct as possible. Addi-
cluster #2: medium conversion, high selectivity to CO2,
tionally, the goal of the clustering was to identify each clus-
cluster #3: low conversion, high selectivity to CO, partial
ter with distinct chemical behavior. With respect to these
oxidation products,
criteria, the hierarchical clustering indicates that four dis-
cluster #4: low selectivity to (CO2 þ CO), hydrocarbons,
tinct classes can be generated when cutting at a linkage
and
distance of 150 (Figure 8). On the other hand, k-means
cluster #5: high conversion, high selectivity to CO2.
clustering enables us to generate five distinct classes with a
higher degree of discrimination. The distribution of the
3.3 QPAR Model
467 catalysts in the five clusters is shown in Figure 9. It re-
veals that four classes encompass about 100 catalysts, After the selection of the 56 (X2) plus 19 (X3) attributes
whereas the last class contains only 17 catalysts, which rep- and the identification of five distinct classes of catalytic be-
resent about 4% of the whole dataset. The results of the k- havior, a model was built in order to establish Quantitative
means clustering are shown on the PCA score plot (Fig- Property – Activity Relationships between the attributes
ure 10). For further analysis, these five clusters were inves- and the classes. In addition, because models can highlight
tigated, since they fulfill all our requirements. In addition the most discriminative attributes, a last selection of attrib-
QSAR Comb. Sci. 2005, 24 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim 87
Full Papers D. Farrusseng et al.
88 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim QSAR Comb. Sci. 2005, 24
Design of Discovery Libraries for Solids Based on QSAR Models
diagonal of a loss matrix is always equal to zero, because a ed in Figure 11 as example. The bar chart on top of the
correct classification has zero costs. The value 1 indicates tree presents the distribution of the catalysts in the five
no adjusted preference, while all values > 1 induce a pref- clusters as already shown in Figure 9. The whole dataset is
erence for certain classifications. With the loss matrix as firstly divided into two smaller datasets according to the
set, it becomes more “costly” to misclassify cases from most discriminative splitting rule i.e. whether catalysts
show values for maxec_nffefmsmo (#15) higher or lower
than 153.1. Then, successive splitting conditions are used
Table 4. Loss matrix for the Classification tree analysis which in-
dicates the “costs” for misclassification of catalysts in the re- to generate nodes that are aimed at creating clusters which
spective classes. are clearly separated from each other. At the end of each
branch, the terminal nodes contain the results of the pre-
Loss matrix 1 2 3 4 5
dictions. For example, when all splitting conditions which
1-predicted 0 1 4 5 1 define the terminal node * are fulfilled, the prediction rate
2-predicted 3 0 1.2 5 3 is 84% with respect to cluster #1 whereas the initial proba-
3-predicted 1.1 1 0 1.5 1 bility of cluster #1 is 25% (node 1). The results of the ter-
4-predicted 1 1 1 0 1 minal nodes are gathered by cluster prediction and then
5-predicted 1 1.5 1.1 1.1 0
reported in the confusion matrix (Table 5) which shows
the prediction rates for all catalyst classes and for the
small clusters than from large clusters. In other words, us- learning and cross-validation datasets, respectively.
ing a loss matrix enables to improve the prediction rates Judging from the prediction performance, it is obvious
on smaller classes. This is desirable, because these small that the learning has proceeded rather well since the mod-
clusters contain the catalysts producing partial oxidation el can predict all five clusters satisfactorily, with an overall
products or hydrocarbons, which are more valuable prod- prediction rate of 0.68. In order to validate the model, a
ucts than CO2 and CO. The search for best combinations prediction test was carried out with independent catalysts
of misclassification costs was carried out by trial and error (one third of the dataset). Also here, the prediction rates
to obtain high prediction rates in all five classes. for each class are well above the distribution probabilities
The best classification tree yielded a model based on although the prediction performance is significantly inferi-
only 23 attributes, 33 split nodes, and 34 terminal nodes or with respect to the training set. In addition, also the
(leaves). Because the large size of the classification trees good prediction for the catalytic behavior #4 has to be
results in a complex scheme, a complete graph represent- pointed out. Indeed, the model allows us to predict, with a
ing the tree structure cannot be shown here. However, the confidence rate at around 40%, that a catalyst would be-
first nodes and a few leaves of a simpler model are depict- long to the class of partial oxidation catalysts. In contrast,
Figure 11. Schematic representation of the first nodes and leaves of the classification tree. The numbers in the boxes and the bars
represent the number of cases left in each class after application of the corresponding splitting rule.
QSAR Comb. Sci. 2005, 24 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim 89
Full Papers D. Farrusseng et al.
Table 5. Contingency table for the Classification tree analysis often is as important as the chemical composition with re-
which reports the predictions for the training and testing data- spect to the catalytic performance. Great care therefore
sets. has to be taken to capture the essentials of the synthetic
training 1 2 3 4 5 sum ratio pred. rate sensibility procedure.
In general, all solids have been synthesized in several
1 predicted 42 14 13 0 4 80 0.25 0.61 0.64
steps. However the number of steps is a matter of defini-
2 predicted 5 58 0 0 7 70 0.28 0.83 0.67
3 predicted 16 1 54 1 1 73 0.24 0.74 0.74 tion. For example, Cu/MCM-41 (MCM-41: mesoporous
4 predicted 6 5 5 10 1 27 0.04 0.37 0.91 silica as discovered by scientists of Mobil Oil Corporation
5 predicted 1 8 1 0 49 59 0.20 0.83 0.79 [20]) would intuitively be classified as a two-step reaction
sum/mean 77 86 73 11 62 309 0.68 0.75 (preparation of the support, which is not commercially
test 1 2 3 4 5 sum ratio pred. rate sensibility
available and has to be synthesized, and subsequent im-
pregnation), while Cu/SiO2 made by impregnation of a
1 predicted 18 7 14 0 7 46 0.25 0.39 0.45 commercial SiO2 support would normally be assumed to
2 predicted 6 19 10 1 9 45 0.27 0.42 0.44 be a one-step reaction. However, although the support
3 predicted 11 2 12 0 0 25 0.25 0.48 0.31
was bought from a supplier, its synthesis, for instance via
4 predicted 2 3 2 5 0 12 0.04 0.42 0.83
5 predicted 3 12 1 0 14 30 0.19 0.47 0.47 flame pyrolysis, should also be considered as a synthetic
sum/mean 40 43 39 6 30 158 0.44 0.50 step, resulting, altogether, in a two-step reaction. Consid-
ering an ion-exchange reaction, it is also a matter of defini-
tion whether each exchange step counts as a reaction step
or whether the whole exchange procedure is considered as
with random selection, one would only have a 4% chance a single step. This creates a problem in coding the syn-
of classifying such a catalyst. thesis procedure: the more steps that are defined and the
From the tree structure and splitting conditions, 34 rules more precisely each single step is described, the more en-
are derived yielding an explicit model. A prediction rule tries that are equal to zero and therefore without informa-
corresponds to the ensemble of splitting conditions from tion are obtained. This results from the fact that each cata-
the top node to a terminal node. The collection of all path- lyst has to be described with the same attributes in order
ways to each terminal node in text form results in a “rec- to allow meaningful correlation in the model-building
ipe” which enables us to predict straightforwardly the cat- step. For example, assuming that each catalyst should be
alytic behavior of a new catalyst, in contrast to ANN. As described by four synthesis steps and that each step con-
an example, some of the 34 rules are reported in Table 6. sists of 15 parameters (altogether 60 parameters), the en-
try for a catalyst synthesized in a single step (for instance
by precipitation) would contain at least 45 variables with-
4 Discussion out any information. Hence, the goal was to find a good
compromise between a precise description and a suffi-
In the course of the work on this project, several issues ciently simple coding, by either discarding or regrouping
were encountered which are important and which shall be information to avoid the above-mentioned problem. For
addressed in this section, together with the discussion of this study, it was decided to restrict the information only to
the results and the wider implications this work may have. the last synthetic step which was encoded by 19 different
categorical attributes: coding the type of the synthesis re-
action (such as ion-exchange or impregnation), solvents,
4.1 Synthesis Coding
precursors, and the presence of supports and additives,
Appropriate encoding of the synthesis procedure, even in such as chlorine, alkali metal, or others.
a simplified form, is a very difficult task, if one deals with In a more advanced stage of this technology and on a
solid catalysts. On the other hand, the synthesis protocol broader data basis of catalysts, one can – and should – cer-
Table 6. Example for rules that allow the prediction of the performance cluster into which a catalyst falls. Explanation of symbols:
see in Table 1. Both sets of rules predominantly sort catalysts into cluster #1.
Cluster 1 2 3 4 5 1 2 3 4 5
Terminal node 14 Terminal node 26
Cases 10 3 0 0 0 13 0 3 0 0
Rule 1 maxec_nffefmsmo 153.1 maxec_nffefmsmo 153.1
Rule 2 meanbc_bseo 185.3 meanbc_bseo 185.3
Rule 3 meanec_ea 0.75 meanec_ea > 0.75
Rule 4 meanbc_difie_l 0.09 meanec_difie_l > 0.11
Rule 5 sumec_minvhosie_os 7
90 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim QSAR Comb. Sci. 2005, 24
Design of Discovery Libraries for Solids Based on QSAR Models
QSAR Comb. Sci. 2005, 24 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim 91
Full Papers D. Farrusseng et al.
the decisive factors that influence the catalytic behavior of information on the relevant properties that a catalyst
solids in certain catalytic reactions. should have for a specific reaction.
For the different types of ANN that were tested in this At this point it is not clear just how general descriptor
study, certain attributes were consistently selected as input vectors for catalytic reactions will be, i.e., whether a de-
variables by many ANN. If one analyzes these attributes, scriptor for propene oxidation catalysts, such as developed
one can identify certain trends. Considering the attributes in this study, will also be valid for other alkenes or even
related to properties of the elements, variables based on for hydrocarbon oxidation reactions. The establishment of
the atomic radius (ar, #2, 3), the electron affinity (ea #8, 9, a much broader database is necessary to verify and further
10), the normalized formation free-enthalpy of the most develop the descriptor concepts for catalysts. It is expected
stable metal oxide (nffefmsmo, #14, 15), and the smallest that there will be an intimate interplay between the refine-
energy difference between the most stable metal oxide ment of descriptors and the testing of new solids in catalyt-
and another metal oxide (sedmsmoom, #16, 17) seemed to ic reactions. The broader the experimental database be-
be of major importance. The number of elements in a cata- comes, the more discriminative will be the descriptors de-
lyst (nvec, #24) was also significant. When examining the veloped on this basis, which in turn allows more focused
attributes related to the element oxides, only two attrib- catalytic testing.
utes based on the melting point (mp, #31, 32) seemed to In addition, the concept seems to be more broadly appli-
have any significance. When analyzing the attributes relat- cable. Any field in which the correlation between proper-
ed to element ions, the ionic radius (ir), coordination num- ties of solids and performance in a given application is
ber (cn, #35 – 38), and ionic covalent parameter (icp, #39 – complex and development is, to a large extent, empirically
42) seemed to be the most important variables. The list of governed, may benefit from the possibility of virtual
synthesis attributes contains the highest number of attrib- screening as the first stage of a high-throughput program.
utes (eight) that were selected as input variables in all net- It will be interesting to see how fast these methods will
works. This is reasonable from a chemical point of view, as find their way into the laboratories active in this field.
synthesis parameters refer directly to the experimentally
tested solid catalyst.
All types of analysis revealed that the stability of oxides
Acknowledgements
has a strong impact on the performance of catalysts in the
oxidation of propene. This is a result that chemical intu-
We thank the Marie Curie Fellowship Association and re-
ition would also have given. However, in the framework of
gion Rhône-Alpes (Programme EuroDoc) for having sup-
this study this conclusion was discovered without addition-
ported the students mobility and training. In addition, we
al interference by a chemist, and one could therefore, with
would like to thank the Leibniz program of the DFG and
some justification, say that the methodology used in the
the FCI who provided funding in addition to the basic
framework of this investigation has implemented chemical
funding by the Max-Planck-Gesellschaft and the CNRS.
intuition on a basic level in a software program.
5 Conclusions References
We have described the implementation of a methodology [1] I. E. Maxwell, Nature 1998, 394, 325.
[2] B. Jandeleit, D. J. Schaefer, T. S. Powers, H. W. Turner,
that is the basis for a virtual screening of complex solids
W. H. Weinberg, Angew. Chem. Int. Ed. 1999, 38, 2494.
with respect to their catalytic properties. These solids are [3] W. F. Maier, Angew. Chem. Int. Ed. 1999, 38, 1216.
generated at random from available elements via a set of [4] S. Senkan, Angew. Chem. Int. Ed. 2001, 40, 312.
different synthetic procedures. Screening in-silico then al- [5] Y. Yamada, T. Kobayashi, Chem. Sens. 1999, 15, 100.
lows one to identify those samples which should be experi- [6] J. M. Newsam, F. Schth, Biotechnol. Bioeng. 1999, 61, 203.
mentally investigated. Since the coding is set up in a man- [7] C. Klanner, D. Farrusseng, L. Baumes, C. Mirodatos, F.
ner that corresponds relatively closely to a synthetic proce- Schth, QSAR Comb. Sci. 2003, 22, 729.
[8] D. Wolf, O. V. Buyevskaya, M. Baerns, Appl. Catal., A:
dure, there is a high probability that the suggested samples Gen. 2000, 200, 63.
can indeed be synthesized. [9] U. Rodemerck, M. Baerns, M. Holena, D. Wolf, Appl. Surf.
This approach could be very valuable, especially in reac- Sci. 2004, 223, 168.
tions where no good lead is available as yet, since for such [10] A. Corma, J. M. Serra, A. Chica, in Principles and Methods
reaction thousands of samples may have to be tested be- for Accelerated Catalyst Design and Testing (Eds.: E. G.
fore some activity is discovered at all. Prescreening to re- Derouane, V. Parmon, F. Lemos, F. R. Ribeiro), Kluwer,
Dordrecht, The Netherlands 2002, p. 153.
duce the number of tests to be performed is therefore al-
[11] J. M. Serra, A. Corma, E. Argente, S. Valero, V. Botti,
most mandatory. Moreover, if descriptors with high pre- Appl. Catal., A: Gen. 2003, 254, 133.
dictive power are discovered, one may be able to extract [12] J. M. Serra, A. Corma, D. Farrusseng, L. Baumes, C. Miro-
datos, C. Flego, C. Perego, Catal. Today 2003, 82, 67.
92 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim QSAR Comb. Sci. 2005, 24
Design of Discovery Libraries for Solids Based on QSAR Models
[13] D. Farrusseng, L. Baumes, C. Mirodatos, in High-Through- [17] For example: Handbook of Chemistry and Physics, 77th Ed-
put Analysis: A Tool For Combinatorial Materials Science ition (Eds.: D. R. Lide, H. P. R. Frederikse.) CRC Press,
(Eds.: R. A. Potyrailo., E. J. Amis.), Kluwer Academic/Ple- Boca Raton 1996 – 1997.
num Publishers, New York 2004, p. 551. [18] K. Kira and L. Rendell, A practical approach to feature se-
[14] J. Cawse, Experimental Design for Combinatorial and High lection. In: Proceedings of the 9th International Conference
Throughput Materials Development, John Wiley & Sons, on Machine Learning (Aberdeen, July 1992), D. Sleeman &
Weinheim, Germany 2002. P. Edwards (eds.), Morgan Kaufmann 1992, pp. 249 256
[15] C. Klanner, D. Farrusseng, L. Baumes, M. Lengliz, C. Miro- Aberdeen, Scotland.
datos, F. Schth, Angew. Chem. Int. Ed. 2004, 43, 5347. [19] C. Hoffmann, A. Wolf, F. Schth, Angew. Chem. Int. Ed.
[16] M. Schwickardi, T. Johann, W. Schmidt, F. Schth, Chem. 1999, 38, 2800.
Mater. 2002, 14, 3913. [20] a) C. T. Kresge, M. E Leonowicz, W. J Roth, J. C.Vartuli,
J. S Beck, Nature 1992, 359, 710.
QSAR Comb. Sci. 2005, 24 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim 93