Vous êtes sur la page 1sur 5

Hidden Unit Reduction of Artificial Neural Network on English Capital Letter Recognition

Kietikul JEARANAITANAKIJ
Department of Computer Engineering Faculty of Engineering King Mongkuts Institute of Technology Ladkrabang Bangkok, Thailand
Abstract We present an analysis on the minimum number of hidden units that is required to recognize English capital letters of the artificial neural network. The letter font that we use as a case study is the System font. In order to have the minimum number of hidden units, the number of input features has to be minimized. Firstly, we apply our heuristic for pruning unnecessary features from the data set. The small number of the remaining features leads the artificial neural network to have the small number of input units as well. The reason is a particular feature has a one-to-one mapping relationship onto the input unit. Next, the hidden units are pruned away from the network by using the hidden unit pruning heuristic. Both pruning heuristic is based on the notion of the information gain. They can efficiently prune away the unnecessary features and hidden units from the network. The experimental results show the minimum number of hidden units required to train the artificial neural network to recognize English capital letters in System font. In addition, the accuracy rate of the classification produced by the artificial neural network is practically high. As a result, the final artificial neural network that we produce is fantastically compact and reliable. Keywords Artificial Neural Network, letter recognition, hidden unit, pruning, information gain

Ouen PINNGERN
Department of Computer Engineering Faculty of Engineering King Mongkuts Institute of Technology Ladkrabang Bangkok, Thailand the greater the ability of the network to recognize existing patterns. However, if the number of hidden units is too big, the network might simply memorize all training examples. This may prevent it from generalizing, or producing incorrect outputs when presented with pattern that was not used in training. There are some proposed methods that can be used to reduce the number of hidden units in the artificial neural network. Sietsma and Dow [10], [11] suggested an interactive method in which they inspect a trained network and identify a hidden unit that has a constant activation over all training patterns. Then, the hidden unit which does not influence the output is pruned away. Murase et al. [12] measured the Goodness Factors of the hidden units in the trained network. The unit which has the lowest value of the Goodness Factor is removed from the hidden layer. Hagiwara [13] presented the Consuming Energy and the Weights Power methods for removal of both hidden units and weights, respectively. Jearanaitanakij and Pinngern [14] proposed the informationgain based pruning heuristic that can efficiently remove unnecessary hidden units within the nearly minimum period of time. In this paper, we analyze the reduction of hidden units of the artificial neural network for recognizing English capital letters that are printed in System font. There are 10x10 pixels for a particular image of English capital letter. Each pixel (or feature) is represented by either 1 or 0. Our objective is to determine the minimum number of hidden units that is required to classify the 26-English letters with the practical recognition rate. Firstly, unnecessary features are filtered out of the data set by the feature pruning heuristic [14]. Then the hidden unit pruning heuristic [14] is utilized in order to find a suitable number of hidden units. The analysis of the experimental results show exceeding low number of hidden units required to the classification process. In addition, the results support our heuristics [14] in terms of the compact network and the nearly minimum pruning time. The rest of this paper is organized into the following orders. In Section 2, we give a brief review about the information gain and our hidden unit pruning heuristic. In Section 3, the data set of English capital letters is described. Next, in Section 4, we describe the experimental results and analysis. Finally, in Section 5, the conclusions and possible future work are discussed.

I.

INTRODUCTION

An artificial neural network can be defined as a model of reasoning based on the human brain. Recent developments on artificial neural network have been used widely in character recognition because of its ability to generalize well on the unseen patterns [1-8]. Recognition of both printed and handwritten letters is a typical domain where neural networks have been successfully applied. Letter recognition or in common, called OCR (Optical Character Recognition) is the ability of a computer to translate character images into a text file, using special software. It allows us to take a printed document and put it into a computer in editable form without the need of retyping the document (Negnevitsky, 2002, [9]). One issue of the letter recognition that uses the artificial neural network as the learning model is the suitable number of hidden units. The number of neurons in the hidden layer affects both the accuracy of character recognition and the speed of the training the network. Complex patterns cannot be detected by a small number of hidden units; however too many of them can unpleasantly increase the computational burden. Another problem is overfitting. The greater the number of hidden units,

1-4244-0023-6/06/$20.00 2006 IEEE

CIS 2006

II.

HIDDEN UNIT PRUNING

We begin by briefly review the notion of information gain and our hidden unit pruning heuristic. A. Information Gain Entropy, a measure commonly used in the information theory, characterizes the (im)purity of an arbitrary collection of examples. Given a collection S, containing examples with each of the C outcomes, the entropy of S is
Entropy ( S ) =

[ p ( I ) log
I C

p ( I )],

(1)

where p(I) is the proportion of S belonging to class I. Note that S is not a feature but an entire sample set. Entropy is 0 if all members of S belong to the same class. The scale of the entropy is 0 (purity) to 1 (impurity). The next measure is an information gain. This was first defined by Shannon and Weaver [15] to measure the expected reduction in entropy. For a particular feature A, Gain(S, A) means the information gain of the sample set S on the feature A and is defined by the following equation:
Gain ( S , A) = Entropy ( S )
v A

Figure 1. Network notations

[(| S

| / | S |).Entropy ( S v )],

(2)

where is the summation on all possible values (v) of the feature A; Sv is the subset of S for which feature A has value v; |Sv| is the number of elements in Sv, and; |S| is the number of elements in S. The merit of the information gain is that it indicates the degree of significance that a particular feature has on the classification output. Therefore, the more information gain the feature has, the more significance the feature gets. We always prefer the feature which has high value of information gain to those which have lower values. B. Hidden Unit Pruning Heuristic We describe a hidden unit pruning heuristic (Jearanaitanakij and Pinngern, 2005, [14]) used as ordering criterions for the hidden unit pruning in the artificial neural network. Before performing the hidden unit pruning, we must calculate the information gains of all features and then pass these gains to the hidden units in the next layer. The hidden unit pruning heuristic is based on the propagated information gains from feature units. Before going further, let us define some notations used in this section such as information gain of feature unit i (Gaini), incoming information gain of a hidden unit (GainIn), outgoing information gain of a hidden unit (GainOut), the weight from the i-th unit of the (n-1)th layer to the j-th unit of the n-th layer ( win 1j,n ), and, similarly, the weight from the j-th unit of the n-th layer to the k-th unit of the (n+1)-th layer ( w n , nk+1 ). All notations are shown in Fig. 1. j

The amount of information received at a hidden unit is the summation, on training patterns, of the total squared production between weights, which connect from feature units to a hidden unit in a hidden layer, and information gains of all feature units. Then the result is averaged over the number of training patterns and the number of feature units. We define the incoming information gains of the j-th hidden unit in n-th layer ( GainIn n ) as the following: j
Gain In
n j

1 PI

(w
P i

n 1, n i j

Gain i n 1 ) 2 ,

(3)

where P and I are the number of training patterns and the number of feature units in the (n-1)-th layer, respectively. This Gain In nj is, in turn, used for calculating the outgoing information gain of the j-th hidden unit. The degree of importance of a particular hidden unit can be determined by the outgoing information gain of the hidden unit ( GainOut jn ). The outgoing information gain of a particular hidden unit is the summation, on training patterns, of the total squared production between weights, which connect from the hidden unit to output units, and the incoming information gain of that hidden unit. Then the result is averaged over the number of training patterns and the number of output units. The outgoing information gain of the j-th hidden unit in the n-th layer ( GainOut jn ) is given by:
Gain Out
n j

1 PO

(w
P k

n , n +1 n 2 j k Gain In j ) ,

(4)

where O is the number of output units in the (n+1)-th layer. Note that the number of training patterns, P, in both (3) and (4) is the number of training patterns that the network has seen so far. The hidden unit which has the lowest outgoing information gain should be firstly removed from the trained network because it does not affect the convergence time for retraining the network that much. There is only one hidden unit removed at a time until the network cannot converge. Then, the last pruned unit and network configurations are restored.

III.

DATA SET

The data set used as the case study is the set of twenty-six English capital letters (A to Z) which are printed in System font. Each letter image is represented by 10x10 pixels. A particular pixel can be either on (1) or off (0). We scan each pixel in the enhanced letter image from top to bottom, from left to right, to locate the capital printed letter on the paper. An assumption has been made that the letters are separated clearly with each other.

noised and 5 noised images. Therefore, 260 letter images are used as the dataset. The dataset is randomly decomposed into 130 letter images for the training set, and 130 letter images for the test set. After a letter image has passed a transformation into an array of 10x10 binary-value features, the artificial neural network is brought into the training procedure. All features connect to input units by one-to-one relationship. The output units are encoded into 26 units, each stands for an English capital letter. For a particular classification, one of the 26 output units has value 1 whereas other output units must contain 0 as their values. IV. EXPERIMENTAL RESULTS

Figure 2. Image transformation

We train the 26-letter data set with the initial artificial neural network which has 100 input units, 10 hidden units, and 26 output units. There is only one single hidden layer between the input and output layers. The learning algorithm used training process is the standard back-propagation algorithm (Rumelhart et al., 1986, [17]), without momentum. All the weights and thresholds are randomly reset into the range between -1 and +1. In order to obtain the highest recognition rate, the sum-squared error is set to be converged below 0.3. Note that the number of hidden unit, i.e. 10, here is not the minimum number but it is the number that allows the artificial neural network to converge easily. However, our goal is to find the minimum number of hidden units of the artificial neural network that still correctly classifies the patterns at high recognition rate. Since the number of hidden units depends on the number of input features, it is worthwhile to remove the feature units before the hidden unit pruning begins. The idea of feature removal is similar to the hidden unit pruning. Instead of using outgoing information gain, the information gain of every feature is used as the pruning criterion. When the initial network is trained, the feature which has the lowest information gain is firstly removed from the network. There is only one feature unit removed at a time until the network cannot converge. Then the final number of features is returned. The experimental result on the number of input features is depicted in Fig. 4.
120 Number of features 100 80 60 40 20 0

Figure 3. 26 English capital letter images without noises

501

1001

1501

2001

2501

3001

3501

4001

Number of epochs

As shown in Figure 2, all pixels in an extracted letter are transformed to either 1 or 0. These pixels represent the features of the training set of the artificial neural network. For a particular pattern, there can be only one letter that corresponds to it. A set of non-noised 26 letter images is shown in Figure 3. In order to be realistic, we add more letter images which have noise probability of 0.05 in each pixel. Each letter has 5 non-

Figure 4. Number of features during the training

The number of features is constant at 100 units during the first 1473 training epochs. This is the duration that we train the initial neural network to get a convergence. When the network

is trained, the number of features keeps decreasing until it settles down at 37. This means that the essential number of features for classifying the 26-letter of English capital letters in System font is 37. This is not only the minimum number of features, but also the number of features that still maintain the highest recognition rate of the artificial neural network.
12 Number of hidden units 10

converge in no time. The ripples on the sum-squared error indicate the places where the hidden unit reductions are taken. The pruning process finishes when the sum-squared error does not consecutively decrease within 500 epochs. Then the network is restored to the previous convergence point. The restoration of the network back to the previous convergence can explain why the sum-squared error at the end of the graph in Fig. 6 suddenly falls below 0.3.
TABLE I. ACCURACY RATE ON THE TEST SET Proposed 97.71

8 6 4 2 0 1 501 1001 1501 2001 2501 3001 3501 4001 Number of epochs

Conventional Neural Network 94.10

Figure 5. Number of hidden units during the training process

Fig. 5 illustrates the number of hidden units during the training process. After the feature pruning has been done, at the 2000th epoch, the hidden unit which has the lowest outgoing information gain is pruned away. The hidden unit pruning process is seized, shown as a horizontal line, until the network is retrained. The hidden units are removed in succession until the network cannot be retrained. We discover that the final number of hidden units that maintains the highest recognition rate is 6.
5 4.5 Sum-squared error 4 3.5 3 2.5 2 1.5 1 0.5 0 1 501 1001 1501 2001 2501 3001 3501 4001 Number of epochs

Table I shows the accuracy rates on the test set between the conventional NN and our proposed method. The conventional approach has 10 hidden units and 100 input units, while the proposed method has 6 hidden units and 37 input units. We intend to use different number of hidden units in order to investigate that having the lower number of hidden units does not degrade the accuracy when classifying the unseen data. The result, in Table I, shows that the conventional NN has less accuracy rate than the proposed method. This can be explained as the effect of the overfitting problem that happens in the conventional NN. By having unnecessary hidden units, the conventional NN memorizes all training patterns instead of learning them. Moreover, our approach removes unimportant features from the original feature. This can filter some noises out of the training set. V. CONCLUSIONS

Figure 6. Sum-squared error during the training

Fig. 6 shows the sum-squared error throughout the training process. The error gradually decreases from 4.75 to 0.3 within 1473 training epochs. When the network reaches a convergence point, the feature pruning process starts removing features oneby-one. This causes the sum-squared error increases slightly. However, the error suddenly decreases within a small period of time. The similar situation happens to the hidden unit pruning. The hidden unit pruning starts the task at the 2001st training epoch. At this point, the sum-squared error suddenly rises up to 3.5. This high error does not prolong the network training because the abrupt reduction of the error allows the network to

We give an analysis of the hidden units that are necessary to recognize the English capital letters printed in System font. The 10x10 pixels in the letter image are the features that are passed into the input units of the artificial neural network. In the input layer, information gain indicates the degree of importance of a feature. The feature which has the smallest information gain is not important to the output classification and it should be ignored. As a result, we have the smallest number of epochs needed for retraining when that feature is pruned away. In the hidden layer, the hidden unit which has the smallest outgoing information gain tends to propagate rather small amount of information to the output units in the next layer. Consequently, removing that unit from the network gives little effect on the retraining time. The experimental result shows that the number of hidden units that is necessary to identify the 26 English capital letters in System font, which has 37 essential features, is 6 units. In addition, this small-sized artificial neural network gives a testing accuracy rate at 97.17%. Removing unnecessary hidden units reduces the overfitting problem that may occur to the network. If the network has too many hidden units, it will memorize all the training patterns, instead of learning them. This situation may prevent the network from generalizing, or producing incorrect outputs when presented with pattern that was not used in training.

REFERENCES
[1] [2] [3] [4] B. Widrow et al., Layered neural nets for pattern recognition, IEEE Trans. On ASSP, vol. 36, no. 7, July 1988. V.K. Govindan and A.P. Shivaprasad, Character recognition Review, Pattern Recognition, vol. 23, no. 2, pp. 671-679, 1990. B. Boser et al., Hardware requirements for neural network pattern classifiers, IEEE Micro, pp. 32-40, 1992. A. Shustorovich and C.W. Thrasher, Neural Network Positioning and Classification of Handwritten Characters, Neural Networks vol. 9, no. 4, pp. 685-693, 1996. R. Parekh, J. Yang and V. Honavar, Constructive neural network learning algorithms for pattern classification, IEEE Transactions on Neural Networks, pp. 436-451, vol. 11. no. 2, 2000. Kamruzzaman J., Kumagai Y., Mahfuzul Aziz S., Character recognition by double backpropagation neural network, Proceedings of IEEE Region 10 Annual Conference, Speech and Image Technologies for Computing and Telecommunications, vol. 1, pp. 411-414, 1997. Kamruzzaman, J., Comparison of feed-forward neural net algorithms in application to character recognition, Proceedings of IEEE Region 10 International Conference on Electrical and Electronic Technology, vol. 1, pp. 165-169, 2001. Jacquet, D., Saucier, G., Design of a digital neural chip: application to optical character recognition by neural network, Proceedings European Design and Test Conference, pp. 256-260, 1994.

[9] [10] [11] [12]

[5]

[13] [14]

[6]

[15] [16] [17]

[7]

[8]

Negnevitsky. M, Artificial Intelligence A Guide to Intelligent Systems, Addison-Wesley, 2002. J. Sietsma and R.J.F. Dow, Creating artificial neural networks that generalize, Neural Networks, vol.4, no.1, pp. 67-69, 1991. J. Sietsma and R.J.F. Dow, Neural net pruning why and how, in Proc. IEEE Int. Conf. Neural Networks, vol. I (San Diego), pp.325-333, 1988. K. Murase, Y. Matsunaga, and Y. Nakade, A Back-Propagation Algorithm which Automatically Determines the Number of Association Units, Proc. IEEE Int. Conf. Neural Networks, vol. 1, pp. 783-788, 1991. M. Hagiwara, Removal of Hidden Units and Weights for Back Propagation Networks, Proc. IJCNN93, vol. 1, pp. 351-354, 1993. K. Jearanaitanakij, O. Pinngern, Determining the Orders of Feature and Hidden Unit Prunings of Artificial Neural Networks, Proc. IEEE 2005 Fifth Int. Conf. on Information, Communications and Signal Processing (ICICS), w3c.3, pp. 353-356, 2005. Shannon, C. E. and Weaver, W., The Mathematical Theory of Communication, University of Illinois Press, Urbana, Illinois, 1949. Quinlan, J. R., Induction of decision trees, Machine Learning, vol. 1, issue 1, pp. 81106., 1986. D.E. Rumelhart, G.E. Hinton, and R.J. Williams, Learning internal representations by error propagation, in Parallel Distributed Processing: Exploration in the Microstructure of Cognition: vol.1: Foundations, eds. D.E. Rumelhart and J.L. McClelland, pp.318-362, The MIT Press, Cambridge, Massachusetts, 1986.

Vous aimerez peut-être aussi