Vous êtes sur la page 1sur 6

A BACKPROPAGATION NEURAL NETWORK FOR RISK ASSESSMENT

Ray R. Hashemi and Nancy L. Stafford University of Arkansas at Little Rock 2801 South University Ave., Little Rock, AR 72204

ABSTRACT The well known methods which have been used for risk assessment are statistical models, rule-based systems, rough sets approach, modified rough sets approach, ID3 algorithm and LA-LEARN algorithm. In this paper a neural network is used as a n alternative approach. A three-layer artificial neural network (ANN) was trained using backpropagation. The topology of the network was decided based on a set of trials and errors. This network was trained t o perform risk assessment on a set of toxicological data and give a decision like the decision given by experts. The assessment (classification) ability of the resulting network was compared with the statistical approach of discriminant analysis and the superiority of the neural network approach was established. I. INTRODUCTION I n the risk assessment of toxic compounds the goal is t o predict if a specific toxicant has the potential to harm humans. Methods used for risk assessment are statistical models [1,21, rule-based systems [3], rough sets approach [4], modified rough sets approach [ 5 ] , ID3 algorithm [6], and LA-LEARN algorithm [7]. In all of these methods two general steps take place, which we refer to them as the training andprediction steps. In the training step, for a given dataset a set of rules (patterns) is generated. These rules are used in the prediction step for risk
0-7803-0922-7/93$03.00 1993QIEEE
565

assessment of a given toxicant. An artificial neural network (ANN) easily lends itself t o these two general steps. Because of that, ANN has the potential t o be used as a risk assessment tool. The basic building block for ANN is an artificial neuron or a node. Nodes are organized into layers. Every node in one layer has weighted connections to every node in the next layer. A node receives inputs i l , i 2 , ..., in through its n input connections. If the associated weights t o these connections are w i , w2, ..., W n then the sum of weighted inputs for the node is
n

ijwj.

To determine whether the sum is

large enough t o excite the node, an activation function will be applied on the weighted sum of inputs t o generate an output value that represents the excitement level of the node. The training step for an ANN may be completed by using one of the several available training paradigms. The training paradigm of our choice is backpropagation which is explained briefly in next section. After training is completed then the prediction step may take place by applying the testing dataset - which the network had never encountered before - t o ascertain if the network had learned and could give t h e correct decision (classification) for each member of the testing dataset. In this paper, we investigate the credibility of the neural network approach as a viable tool in the field of developmental toxicity risk assessment. To do so, (i) we train a n ANN by backpropagation

paradigm and then test its prediction capability by applying the testing dataset against it, (ii) we apply the statistical model of discriminant analysis on the same training and testing datasets, and (iii)we compare the prediction power of the two approaches.

11. BACKPROPAGATION TRAINING PARADIGM


Detailed backpropagation training can be found in [8, 91. In this section we briefly introduce backpropagation training in regard to a three-layer net with a sigmoid function as its activation function. As the first step, the two weight matrices were initially given small random values in the range (0,1), since the learning rate improvee if training does not start with all weights equal. As the second step, the training pairs (conditions of a record and the corresponding decision make a pair ) were applied in a random order to the network (random selection of the training pairs as input helps to avoid local minima). The inputs to the neurons in the layer p + l are the outputs of the neurons in the layer p. If the output for the layer p is the vector OUTp and the weight matrix associated with the input to the layer p + l is the matrix W then the sum of the weighted inputs for all nodes in layer p + 1 is a vector (SUM) which is calculated by the matrix multiplication, SUM = OUT, * W. For each node in layer P + 1, the activation function for the forward propagation will be applied on the sum for the node to compute the output of the node, out = esum / (esum + 1). The output of all nodes in the layer p + 1 makes the vector OUTp+l. As the last step, the difference between the target and actual output was backpropagated using the derivative of the sigmoid function. This backprop-agation will modify every weight in the weight matrices. In other words, to each old value of a weight in the weight matrix a change of Aw will be added t o constitute its new
566

value, Wnew. (The A is not the same for w all the weights in the weight matrix). This fact is represented in matrix notation by W, = Wold + Aw. The weight changes for , , the weight matrix between the hidden and output layers (AWh,,) are computed as follows: A , = 1 * O m * OUT, (1- OUT,) * m, (T - OU'J-'o) (1) where, h: is the learning rate, a positive value less than 1, OUTh, is the output of the hidden layer, OUT,: is the output of the output layer, 1: is a vector of ones, and T: is the target output vector. If we refer to OUT, (1 - OUT,) (T- OUT,) as Do, then the weight changes for the weight matrix between the input and hidden layers, AWi,h* are:
wi,ht @[om - o m ) ] (1 (2) where, OUTi : is the output of the input layer, W i , h t : is the transpose of the weights between the input and hidden layer, and @: stands for the component-bycomponent multiplication of two vectors without summing. In the training process, the training pairs were applied randomly t o the network and all errors were backpropagated a s described above, changing the weights in a direction which minimizes the error. A trainable neuron bias was added to speed convergence of the weights [lo]. This is a constant input of +1 connected t o a weight t o prevent any neuron having an input of all zeros, which would cause OUT to be zero in (1)and (2), producing no weight change.

AWi,h=h*OUTi

Do*

1 1 TOXICOLOGICAL DATA 1. Toxicological data, reported in [5], was

used to train the network. Each record of this data (an example ) is composed of two parts, the condition set and the decision. The condition set is composed of 16 conditions. Based on each condition set of an example, an expert has given a decision for the example. The possible values for conditions and decisions are positive, negative, maybe, and unknown, which are represented by the codes +1,-1, 0.5, and 0, respectively. The conflict free examples have been identified and kept as an active dataset [ll]. Of the 110 examples in the active dataset, ten percent was reserved for testing. In fact for each decision value, we have set aside ten percent of its examples as a training set. Our testing set has 12 examples, three with decision 0, two with decision 1, two with decision 5, and four with decision -1. Among the remaining examples in the active dataset, the number of examples with decision value of 0.5 is the smallest and it is 18. As a result, we select 18 examples for each possible decision value t o make our training dataset. Having the same number of examples for every possible decision value makes the network unbiased [12].

empirical results showed that the training was poor. To improve the training of the net, as an alternative approach, we have turned t o bit patterns for representing input data. Four different possible values of a condition could be represented by a two bit input. This means that the number of nodes in the input layer will be 32. A major problem arose from the fact that one combination would be 00. Due to the nature of the backpropagation training paradigm zeros do not train well, giving no change in the weights. Three-bit and four-bit representations for input (Table I) were tried. Both representations provided the same accuracy i n classification. Therefore, the three-bit representation was kept because i t necessitates fewer calculations, having fewer weights. It needs to be mentioned that out of 8 possible combinations of 3 bits and 16 possible combinations of 4 bits, we have chosen the combinations in which the number of zeros are minimum.
Value

3-Bit Repi-esentation 111 110 101 011

4-Bit Representation

IV. TOPOLOGY OF THE NETWORK


There were three concerns regarding the topology of the network. They were the number of layers, the number of nodes in each layer, and the choice of an activation function for the network. Regarding the first concern, we have decided t o have only three layers. This decision stems from the fact that a threelayer network is powerful enough t o simulate the prediction [9]. The second concern, the choice of the number of nodes in each layer, is the most challenging. The number of nodes for the input layer is dictated by the input data representation. Originally we chose t o have sixteen nodes in the input layer. Each of the sixteen condition values was fed into an input node as a single number. Our
547

+1 (positive) 0.5 (maybe) 0 (unknown) -1 (negative)

1111 1110 1101 1011

Tablel: 3-bit and representing input data

4-bit

patterns

In order t o determine the number of nodes in the hidden layer, we used a trial and error method. Starting with seven nodes in the hidden layer, the ANN was trained and tested. The process of training and testing was repeated for the cases in which 8, 9, ..., 18 nodes were in the hidden layer. Since each network begins with random weights, two networks with the same number of nodes can have different classification abilities, so more than one network was trained for each configuration. The classification ability of each trained

network can be measured by the percent of the examples in the training set which the system has learned and the percent of the examples in the testing set which are correctly classified by the system. These percentages for different numbers of nodes in the hidden layer are illustrated in Figure 1. The best fitlines for these

representations of 1000,0100,0010, and 0001 for output values 1, 0.5, 0, and -1, respectively. This representation was more responsive to our problem. For each input pattern, the output node which is stimulated the most is fired and the other three nodes are suppressed. Therefore, the networks has 48 input nodes and 4 output nodes.

R
A C 80

--A

-AA A

E N 70-

-Ad

-
A

4 A
A

60:

40

30

0 R 50R E 40-

c
7

T 307
8

d
1 1 1 1 1 1 1 1 1 1 1 1

9 10 I 1 12 13 14 15 16 17 NUMBER OF NODES IN HIDDEN LAYER

18

20 7
sgend: 8

Mend Behavior of the ANN for Raining Sst A Behavior of the ANN for Tenting Set

9 IO 1 1 12 13 14 15 16 17 NUMBER OF NODES IN HIDDEN LAYER

18

Behavior of the A N N for Training Set b Behavior of the ANN fir Twting Set

Figure 1: Plotting the raw data and showing the trend. percentages were calculated and graphed separately. The best fit lines show that , in general, the percent of the records in training set for which the network is trained will increase with an increased number of nodes in the hidden layer and the percent of the records of the test set being correctly identified will decrease with an increased number of nodes in the hidden layer. This observation is supported by the fact that when the number of nodes in the hidden layer is too large, then the network can memorize the inputs instead of learning generalized rules from the average them. In Figure 2, percentages for each number of nodes in the hidden layer is illustrated separately for the training set and for the testing set. A curve is plotted through these averages. These curves demonstrate the fact that below a certain number of nodes in the hidden layer, the networks cannot lea%. The output layer used four-bit
568

Figure 2: The average percent correct for each network configuration In regard t o the last of our three concerns, the nature of our application suggests two options for the selection of an activation function. These choices are the sigmoid function and the tangent hyperbolic function. The sigmoid squashes the output of a node between 0 and 1, the tangent hyperbolic squashes between -1 and +l. They both are nonlinear, which allows the network t o represent more complex problems. The sigmoid function is more valuable when the averaging of training pairs is important, while the tangent hyperbolic function is used when the consideration of training pairs representing special cases is important. Both were tested, and the sigmoid gave better results.

V. RESULTS
The best configuration for our network

was composed of three layers. The input, hidden, and output layers had 48, 10, and 4 nodes, respectively. The net is using sigmoid function as its activation firnction. Figure 2, shows that the network with seven nodes in the hidden layer could not learn the training set. That is the main reason t h a t we have started our investigation of the number of hidden nodes from seven nodes.
Neural Netwark
-OM

I
Training Set Analysis

0.5

-1

Tdal

linear function. The same training set was used for calibration of data and the same testing set was used to observe the behavior of the discriminant analysis. The results are given in Table 3. A comparison of the results reveals that the network has a better prediction capability - except for the examples with the decision zero- than the discriminant analysis. Overall, discriminant analysis gave 42% correct responses for the testing dataset, while the network gave 67% correct responses for the same testing dataset.
Discriminant Analysis Decisions
0 1
~~~

06 .

-1

Total

Tenting Set
ANdYEiB

Legend C : The number of recorda coTIBetly deasitied. I : The number of recorda in&y eLaseified

Table 2: The behavior analysis of the neural network approach for training and testing dataset s. Our training dataset had 18 examples for each decision, totalling 72. Our testing dataset had 12 examples. The number of examples for each decision in the testing set are not the same and they have the same proportion as they had in the active data set. The network was coded in C and was implemented in a VAXNMS timesharing environment. The behavior of our net in regards t o the training and testing datasets is illustrated in Table 2. The network has learned 97% of the members of the training set and gave the correct decision for 67% of the testing set.

~~

Legend
C :The number ofrecords correctly classified. I : The number of remrds incorrectly classified.

Table 3: The behavior analysis of the di s c ri min a n t a n a1y si s a pp r o a c h for training and testing datasets. One may claim that the small size of training dataset may be the main factor contributing t o weak classification performed by the discriminant analysis. This claim demonstrates another advantage for neural network approach.

VI. SUMMARY AND FUTURE RESEARCH


An artificial neural network was trained t o perform risk assessment on a set of toxicological data and t o give the same decision as the experts. In order t o obtain a better network, the number of nodes in the input, hidden, and output layers were determined through trial and error. The classification ability of the resulting
569

To demonstrate the prediction capability of our neural net, we have compared its behavior t o the statistical approach of discriminant analysis. For implementation of discriminant analysis we have used SAS version 6.07 in a VAXNMS timesharing environment. The discriminant function used by SAS was a

network was compared t o discriminant analysis. The ANN is 25% better a t classifylng the test set. As future research we plan to use larger datasets and compare the prediction capability of the neural network approach and the statistical approach. Also, work is in progress t o use different training paradigms and conduct comparisons between neural networks and statistical models. VII. REFERENCES

Look-Ahead Learning Algorithm for Inductive Learning, Proceedings of t h e 1992 ACM International Symposium on Applied Computing, Kansas City, Missouri, March, 1992, pp. 603-605. 8. Hertz J., Krogh A., Palmer R. G., Introduction to the Theory of Neural Computation, Addison- W e sle y, Redwood City, Ca., 1991,pp. 145-156. 9. W a s s e r m a n P. D., Neural Computing, ANZA Research, Inc., Van Nostrand Reinhold, New York,

1989,D. 53. 1 . Caudill M, Neural Network Training 0


Tips and Techniques, AI Expert, January 1991,pp. 56-61. 11. Hashemi R, Razzaghi M, Jelovsek F, Talburt J., Conflict Resolution in Learning Through Examples, Proceedings of The 1992 ACM/IEEE International Symposium on Applied Computing, Kansas City, Missouri, March 1992,pp. 509 - 602. 1 . Klimasauskas C. C., Applying Neural 2 Networks, P a r t 11, PC AI, MarcWApril 1991,pp. 27-24.

1 Jelovsek F, Mattison D, and Chen J., .


Prediction of risk for human developmental toxicity: How important are animal studies for hazard identification?, Obstet Gynecol 1989,

74,624.
2. Enselein

3.

4.

5.

6.

K, Lander T, Strange J., Teratogenesis: A statistical structureactivity model, Teratogenesis Carcinog Mutagen, 1983;3,289. Mattison D, Jelovsek F., Pharmacokinetics and expert systems as aids for risk assessment in reproductive toxicology, Environ Health Persped 1988;76,107. Hashemi, R. An Automated Rule Generator (RZ) Based on the Rough Sets, Proceedings of the Oklahoma Symposium on Artificial Intelligence, 1989,pp 107-120. Hashemi R, Jelovsek F.R., Razzaghi M., Developmental Toxicity Risk Assessment: A Rough Sets Approach, The International Journal of Methods of Information in Medicine (in press). Quinlan J. R., Discovering Rules by Induction From Large Collections of Examples: An Experimental Comparison of the Two Methods of Knowledge Acquisition in the Context of Developing a n Expert System for Soybean Disease Diagnosis, International Journal of Policy Analysis and Information, Edinburg University Press, Edinburg, pp 168-201,

1979. 7 Hashemi R. and Jelovsek F. R., A .


570

Vous aimerez peut-être aussi