Vous êtes sur la page 1sur 34

Face Detection using Neural Networks

(600.475)Machine Learning Course Project Instructor : Dr. J. Sheppard


Submitted by

Maneesh Dewan Jim Susinno

1 Introduction
The task of face detection, or more generally object detection, is one of the important and fundamental problems in computer vision. With the recent advances in information technology and media, automated human computer interaction (HCI) systems are built which involve face processing tasks like face detection, face recognition and face tracking [1]. A first step in any face processing system is to detect and localize faces in an image. The face detection task, however, is challenging because of variability in the pose, orientation, location and scale. Also the facial expression, lighting conditions and occlusion add further variability.

The face detection task can be broken down into two steps. The first step is a classification task that takes some arbitrary image as input and outputs a binary value of yes or no, indicating whether there are any faces present in the image. The second step is the face localization task that aims to take an image as input and output the location of any face or faces within that image as some bounding box with (x, y, width, height). The localization task assumes that its input image contains a face.

The factors which render the face detection task difficult to solve are: ? ? Variations in Image Plane and Pose: The faces in the image vary due to rotation, translation and scaling of the camera-pose or the face itself. The rotation can be both in and out of plane. ? ? Facial Expression and other structural components: The appearance of a face is largely affected by the expression on the face. Also the presence or absence of additional features like glasses, beards, and mustaches further add to this variability. ? ? Lighting conditions and background: The lighting condition dependent on the object and light source properties affect the appearance of the face. Also the background which defines the profile of the face is important and cannot be ignored.

? ? Occlusion: Other objects may occlude the face or in a group of faces some face might occlude the other.

Figure 1: Examples showing the variability of the faces in the images due to different factors like pose, lighting, and expression. There are many closely related problems to face detection, namely facial feature detection, face recognition or face identification and face tracking. All these problems involve face detection as the first processing step. Numerous methods have been proposed over the years to detect faces in single intensity (grayscale) or color images. These will be discussed in the next section.

Problem Outline We propose a neural network approach to accomplish the task of face detection. As discussed above, the task of face detection with neural networks or any machine learning

technique suffers from the variability problem. To reduce the variability in the face detection task we assume the following simplifications: ? ? The images contain only frontal faces with very slight rotations, though the lighting conditions can vary. ? ? The faces are not occluded ? ? The expression of the faces is close to neutral

The face detection system can be divided into the following steps: Pre-Processing: To reduce the variability in the faces, the images are processed before they are fed into the network. All positive examples that is the face images are obtained by cropping images with frontal faces to include only the front view. All the cropped images are then corrected for lighting through standard algorithms. The non-face examples also undergo the same corrections. Classification: Neural networks are implemented to classify the images as faces or nonfaces by training on these examples. We use both our implementation of the neural network and the Matlab neural network toolbox for this task. Different network configurations are experimented with to optimize the results. Localization: The trained neural network is then used to search for faces in an image and if present localize them in a bounding box.

2. Background and related work


The existing techniques to detect faces in an image can be classified broadly into five categories [2]. 1. Knowledge-based methods 2. Feature invariant methods 3. Template matching methods 4. Appearance based methods 5. Agent Based Methods

Knowledge based methods are rule based methods that describe a face based on rules. The approach suffers from the difficulty of coming up with well defined rules. The feature invariant methods extract features like eyes, nose, mouth etc and then use them to detect a face. The problem with these algorithms is that these features are corrupted due to illumination, occlusion and noise. Proposed agent-based methods [3] use collaborative evolutionary agents to search skin-like pixels and segment the face region. The shape of the region is then parameterized by height, aspect ratio and orientation, and classified as a face based on color and shape. In template matching methods, standard patterns of a face or features are stored and the input images are correlated with these patterns to detect a face. These methods fail to deal with variations in pose, scale and shape. The appearance based methods, on the other hand, use statistical and machine learning techniques to learn characteristics of face and non-face images from examples. The learned characteristics are in the form of distribution models or discriminant functions that are subsequently used for face detection. Also, dimensionality reduction is one of the important steps carried out in these methods to reduce computational complexity and improve detection efficacy. As the neural network approach is one of the appearance based methods, we briefly discuss some of the appearance based methods and the neural network approach in detail.

Some of the appearance based approaches are: Eigen faces: The idea is that given a collection of n x m pixel training images represented as vectors, basis vectors spanning the optimal subspace are determined such that the mean square error between the projection of the training images into this subspace and the original images is minimized. Principal component analysis as an extension to this approach has been applied to face recognition and detection [4].

Support Vector Machines: Support Vector Machines (SVMs) can be considered as a new paradigm to train polynomial function, neural networks, or radial basis function classifiers. SVM classifier is a linear classifier where the separating hyperplane is chosen to minimize the expected classification error of the unseen and the seen test patterns. This optimal hyperplane is defined as a weighted combination of a small subset of the

training vectors called the support vectors. Osuna et al [5] developed an efficient method to train an SVM for large scale problems and applied to face detection.

Hidden Markov Models (HMM): The goal of training an HMM for a pattern recognition problem is to maximize the probability of observing the training data by adjusting the parameters in the HMM model with the standard Viterbi segmentation method and Baum-Welch algorithms. After an HMM has been trained, the output probability of an observation determines the class to which it belongs. In face detection task, a face pattern can be divided into several regions such as eyes, nose, mouth etc. A face pattern can then be recognition by observing these regions in the appropriate order. In contrast to the template matching approach, here the idea is to associate the facial regions with the states of a continuous density Hidden Markov Model.

Neural Networks: Neural networks have been extensively applied to numerous pattern recognition problems such as character recognition, object recognition and autonomous robot driving. Various networks have been proposed over the years for the face detection task which is a two class pattern recognition problem. The advantage of using the neural networks is its ability to capture the complexity of the face patterns. However, the disadvantage is that the network architecture has to be extensively tuned (number of layers, number of nodes, learning rates etc) to obtain good results.

One of the most significant works in the face detection using neural networks has been done by Rowley et al [6], [7] and [8]. [6] Gives a robust multi-layer neural network algorithm for frontal face detection which provides a basis for the simplified algorithm we have proposed. The algorithm works by applying multiple neural networks to different portions of the image and unifying their results. The input image pyramid is used and lot of pre-processing is done on the images to correct for lighting and background correction before feeding to the network. The specific details about the configuration of the neural networks used in [6] are discussed in [7]. It also provides useful tips for acquiring negative training examples and a sensitivity analysis. The neural network framework can be extended such that rotation invariance can be incorporated

into the face detection system [8]. A specialized layer determines the rotation offset from horizontal for an input image, and then applies the opposite rotation before feeding the image into the rest of the system.

Extensive literature is available on the application of neural networks to the task of face detection. [9] Provides a good stepwise description of the implementation of a neural network pattern/face classification system. The detection of face involves a lot of processing before the image can be fed into the network. Many useful approaches and techniques for manipulation and segmentation of images can be found in [10].

3. Approach
The approach taken is briefly discussed in Section 1 where the problem is outlined. The various steps are discussed in detail here.

3.1 Preprocessing In the first step, all the faces are cropped manually from an image. The cropped image is sub-sampled down to 20 X 20 pixel size by using bilinear interpolation. Ideally we would want the images to be aligned such that the approximate position of the facial features is the same. As we crop the images manually this condition does not hold strictly, but we view the images before we select them for training and reject the ones which do not meet the above mentioned criterion within certain range.

Figure 2: Manually cropping images. Top left: original image, Top Right: manually cropped face from the original image. Bottom center: Cropped Image sub-sampled to 20X20 pixel size.

Preprocessing for brightness and contrast In order to reduce the variability due to lighting and camera characteristics, we perform two simple image processing steps as discussed in [6]. The first image processing technique tries to equalize the intensity values across the image window. A linear function is fit to the intensity values across the window. If the intensity value at the pixel (x, y) is I(x, y), then it can be fit to the parameters of the linear model a, b and c such that

a = b c I (x, y)

We would like to fit this linear model to the whole image window. The choice of this non-constant model is taken from [6] as it is useful to characterize brightness difference across images. The linear model has the advantage of a lesser number of parameters and quick solvability. For all the pixels in the image window an over-constrained matrix equation is solved which can be solved by the pseudo-inverse method as follows. I (x1, y1) I (x2, y2) = I (x3, y3) I (x4, y4)

x1 x2 x3 x4

y1 y2 y3 y4

1 1 1 1

a b c

: : :
A *

: : :
= B

Thus

X ? (A T A) -1 A T B

The histogram equalization step is performed in the next step. This is done by nonlinearly transforming the intensity values in the image such that the histogram of the resulting image is flat. This enhances the contrast in the image and also tries to reduce the effect of variable camera gains. The images produced during this algorithm are shown in the figure below. The first row contains the original cropped image, the second row contains the image corrected for lighting and the third row contains the histogram equalized image. The arrows depict each step in the algorithm.

Figure 3: The images produced during different steps of the image processing algorithm. The first row shows the original cropped images (re-sampled with bilinear interpolation). The second row shows the light corrected image and the third row shows the histogram equalized images. The corresponding images are in stacked column wise.

It can be easily seen that the variation in the original cropped images is reduced in the final histogram equalized image. The original cropped image shown in the figure above does not show a lot of variation because after the image is cropped, it is re-sampled to 20 X 20 pixel size using bilinear interpolation, which acts like an averaging filter as well. The final training images have the contrast enhanced and the lighting corrected by a sufficient amount. The facial features eyes, mouth and nose are well defined. The effect of the image processing step for correcting the lightness can be better visualized from the processed images shown in the figure 4. The initial image is very bright (Figure 4a) because of intense lighting. The light-corrected image (Figure 4d) has the effect of lighting removed to a large extent. The final training image (Figure 4e) has the contrast enhanced and the lighting effect compensated.

(a)

(b)

(c)

(d)

(e)

Figure 4: Processing of an image with intense lighting. (a) Original cropped image. (b) Original cropped image re-sampled to 20 X 20 pixels. (c) The lighting estimated as a linear function. (d) The light-corrected image. (e) histogram equalization of the light-corrected image which gives the final training image.

Similar procedure is used for the non-face images. Figure 5 and 6 show some of the training face and non-face images used for training the neural network.

Figure 5: Some of the face images used for training As can be seen from the face images in figure 5, the faces are not aligned with each other. This means that the position of the facial features in each face with respect to the images is variable. Ideally, we would want all the faces to be aligned with least amount of variation so that we have a compact space of face images. This variation in the training set makes it harder for the neural network to learn the function space of faces.

The non-face images used for training are much more varied than the face images. Ideally any image not containing a face can be a characterized as a non-face image. This makes the space of non-face images very large compared to the face images. For non-face training images we have selected images from sceneries containing trees and buildings

and from part of images not containing a face from the set of images in which the face will be searched.

Figure 6: Non-face training images The set of non-face images used for training is much larger than the set of face images. Our set of non-face images is 400 and the set of face images is 200.

3.2 Classification Neural networks are trained for the task of classification/detection of a face. Two approaches have been used for this task. The first approach is implementing the neural network using the MATLAB toolbox. The other approach involved implementing the neural network in C++ with a graphical display of the weights to get an insight on the weight structure of the neural networks.

3.2.1 First Approach Using MATLAB Two different architectures of the neural networks are employed. The first network is a simple multilayered network shown in Figure 7. The number of hidden layers and the number of nodes in the hidden layers are varied to obtain the optimum network. The maximum number of hidden layers chosen is 2. The input to both types of network is the 20 x 20 pixel image therefore the number of input nodes is 400. The second network is a multiple network. The multiple network comprises of three sub-networks each having only 1 hidden layer. Figure 8 provides a view of the network. The first sub-network has 4 hidden nodes and takes the input image as a 2 x 2 grid. The second sub-network has 16 hidden nodes taking input image as a 4 x 4 grid. The third sub-network has 20 hidden nodes and takes each row of the input image as one input. The output of the three subnetworks is fed into a single output.

Figure 7: Multilayer neural network

Figure 8: Multiple neural network. The multiple network has three sub-networks.

The backpropagation algorithm is used to train the network. Backpropagation has many variations. We use the batch mode steepest learning method with adaptive learning and momentum incorporated in it. The network classifies the face with an output 1 and a nonface with an output -1. The hyperbolic tangent function is the transfer function chosen for each node to compute its output from its inputs. The training set contained around 539 images (180 face and 359 non-face) and the test set contained 107 images (36 face and 71 non-face).

Localization The trained neural network in the previous step is employed to search for a face in an image. The neural network is applied to all possible 20 x 20 pixel windows in the image to search for an image. As the neural network is trained on a 20 x 20 image, it would

detect faces with only this size, but the size of the face in the image can be larger than this. Thus we consider an image pyramid which contains the input image at the top and sub-sampled image at each lower level. The sub-sampled image at each lower level is obtained by down sampling the image at the upper level by a factor of 1.1765 (1/0.85). Thus each dimension of the down sampled image is 0.85 times the corresponding dimension of the image at an upper level. The task can be easily understood from Figure 9. Each 20 x 20 window in the image is preprocessed before feeding it to the network.

Figure 9: The algorithm for face localization. The face is searched in all the images in the input pyramid.

3.2.2 Second Approach using C++


The first step was to implement the neural nets. The general algorithm is presented in algorithm 1.

To initially test the neural network implementation, simple classification examples such as the XOR problem are employed. Observing the training sessions ensures that the implementation is solid enough to use as a foundation to build upon. A simplified image test case is attempted as an initial image detection step, using hand-drawn images of chevrons [9]. For computational efficiency, the size of the neural network is restricted to

20 hidden nodes; with a 20x20 input image field yielding 20x20x20=8000 weights in the first network layer.

OpenGL is used extensively for debugging and visualization.

Network weights are

drawn in space in variable shades of green and red to indicate positive and negative values respectively. After a successful training run, a set of input images will often leave a recognizable imprint on the weights of the network, providing a cursory indication that the network may have learned the given pattern. While no hard claims are being made about this technique s effectiveness in machine learning, it can offer some intuitive insight into the nature of neural networks, and could evoke interest in future areas of study.

The next step in perfecting the learning process is the addition of negative training examples. Our cursory approach was to add a number of random images to a negative test set, consisting only of noise and expecting an output of 0 (not face). We quickly found that such images do not benefit the learning process at all, due to the random structure-less nature of the images. A network presented with a set of positive inputs did no worse than one with the same positive inputs in addition to some random examples.

Our next approach to negative examples was to add a number of all black images to the training set. This approach proved much more effective, and clearly provided a benefit to the network's ability to classify. The flat black image provided a baseline for the network to compare to when classifying. Intuitively, this makes sense: a black image will "quiet" all input weights in the network equally, in contrast to the face images, which excite ONLY the critical face weights. Training on a random image would have an effect very similar to just adding a random component to all the weights, adding no structure to the classifier.

Now that our classifiers ability to learn images has been proven, we focus on faces specifically. As discussed in the pre-processing section, input images from datasets obtained online are carefully pre-processed using Matlab. A simplified, manual version

of bootstrapping is employed to obtain negative images, in which non-face images are hand cropped then automatically sent through the face pre-processing steps.

The face detection system is now fully functional. The datasets used are the same as the one used by the first approach. This set is used as a baseline for evaluating different network parameters and their effectiveness in learning face patterns. The first parameter to vary is the number of hidden layers. The trivial case of zero provides a classifier of minimal classification ability. A second parameter to vary is the size of the networks hidden layers. We conducted a cursory study of networks with only one hidden layer, varying its size from 20 to 60. The results are presented in the graph in the results section. Algorithm Discussion The backpropagation algorithm was central to the project's operation. The basic algorithm implemented using the equation on p.98 [11] can be summarized as follows:

Algorithm 1

Let x denote the input and t denote the desired target. for each <x,t> in training examples do: input instance x to the network and compute the output o of every unit u in the network for each network output unit k, calculate its error term delta_k deltak <- ok (1- ok) (tk - ok) for each hidden unit h, calculate its error term delta_h deltah <- oh (1 - oh) * ? (over k outputs) wkh * deltak update each network weight w_ji wji <- wji + Deltawji Deltawji = eta * deltaji * xji

The momentum extension of backpropagation was also employed, using a separate coefficient alpha to scale the previous iteration's DeltaW values and add them back into the weight matrix. The momentum equation is given by:

Deltawji(n) = eta * deltaji* xji + alpha * Deltawji(n-1)

..(1)

The principle of a feed-forward neural network is to model multi-dimensional linear dependencies in a compact, concise framework. An input vector of arbitrary size n>1 is presented as input to the network. An output vector of a fixed size m is calculated by summing the contributions of all the input values multiplied by their corresponding weights. This process is repeated over all the layers of the neural network, feeding the output of the upstream layer to the input of the downstream layer.

Every layer consists of one n x m matrix of weight values, where size of n is one more than the number of inputs and the size of m is the number of outputs. The addition of 1 to the size of the input layer is to allow for bias in the network, which is effectively an extra input which is always set to 1.0.

Backpropagation relies on the delta-rule learning algorithm for neural nets. Delta rule is conceptually a very simple algorithm for learning in neural networks. Essentially, the delta learning rule corrects a neural network's weights so the next time it is presented with a particular example (for which the correct classification is known), its output will be closer to that known correct output, which is presented with the example. It then adds these resulting "delta" weights into the network's weight matrix (scaled by a coefficient eta, or learning rate). A small eta (0.1-0.5) is preferable so that the network might find a good compromise of weights among all of its training examples without over-learning the last presented example, producing a thrashing effect.

In the multi-layer network case where we are dealing with hidden layers, delta rule is not as straightforward. In particular, it is difficult to determine what values of the hidden nodes should be used to produce a desired output. In fact, they could be any number of

values and the network could still feasibly learn the pattern. This is the nature of the neural net's learning ability, to abstract hidden details of input patterns into hidden nodes. These details are often not perceivable by humans, but they are nonetheless effective means of learning how to classify inputs. For any given set of input-output pairs, the neural net can (ideally) learn the aspects of a positive set that distinguish all positive instances from the negative set by adjusting its weights according to error gradient and can do so with an entirely different set of hidden weights every time.

Often times, a neural net's error can go up before it can come down. This aspect of neural nets' learning makes a training run heavily dependent on the initial seeding of the weight matrices. The most common approach is the one we adopted, using small random values from -0.5 to 0.5. Varying this parameter is another possibility of system parameterization, one which was not explored due to time constraints.

3.3 Face Databases used


? ? BioID Database - http://www.bioid.com/downloads/facedb/facedatabase.html ? ? AT&T Database - http://www.uk.research.att.com/facesataglance.html ? ? CMU Database - http://vasc.ri.cmu.edu/idb/html/face/index.html

4. Results

Matlab Approach
As a first step, a neural network with only 1 hidden layer is exploited. The number of nodes in the single hidden layer is varied and the mean squared error of the test set and the training set is recorded over the whole run. The mean squared error of the test set at the end of training is used as a metric of comparison. The number of nodes is varied from 5 to 100. Figure 10 shows the plot

Figure 10: Variation of the final test set error with the number of nodes in a network with 1 hidden layer.

To get an better idea of how the test error varies throughout training, Figure 11 shows the test error during training for some of the number of nodes.

Figure 11: Testing error during training. Top figure is with number of hidden nodes equal to 20 and the bottom one is 40.

Now, we try to exploit the multiple layers and see if increasing the number of layers gives us better performance. We limit ourselves to only 2 hidden layers. Figure 12 shows the testing error shows the test error for two different networks with 2 hidden layers.

Figure 12: Testing error for networks with 2 hidden layers. The top plot is for the network with nodes 20 and 5 in layers 1 and 2 respectively. The bottom plot is for the network with nodes 40 and 10 in layers 1 and 2 respectively.

We also implemented the multiple network shown in Figure 8. The test error for the multiple network is shown in Figure 13.

Figure 13: Test error for multiple network

It seems from these figures that, multiple hidden layers and a multiple network do not provide any gain in accuracy to the classification task.

Results from C++ Implementation A series of trial runs on the 1 layer case were executed, varying the size of a single hidden layer from 13 to 60. The resulting error over test validations sets is graphed below.

Figure 14: Error over test validation sets for 1 hidden layer network with number of nodes varying from 13 to 60. It should be noted that this is still a preliminary result. Multiple trials were run for each hidden layer size, and all results are plotted. Since it is uncertain how many trials are necessary for the network to complete a training run and achieve its minimum possible test set error rate, each trial was repeated on average two or three times. The resulting graph of single hidden layer size vs. test set error shows a noisy curve that exhibits a slight downward slope toward a high number of hidden nodes. These results could be enhanced using more detailed statistics such as error bars to encompass more data from all of the trial runs.

Qualitative Results While watching the image classifying network learn in real-time, certain interesting aspects of neural networks became apparent. First of all, they are inherently unstable. The very same neural network under similar (random) initial conditions can sometimes learn and sometimes not learn a particular pattern correctly. To compensate for this fact, a stopping/re-seeding heuristic was employed: any network whose training set error rate drops below a certain threshold (0.03 in the C++ implementation) can be said to have converged, and will stop learning. Any network whose error has not dropped by a significant percentage (10%) in a certain number of training iterations (30) will be reseeded and begin learning again from random. Using this heuristic, while a training run of the network is not guaranteed to converge, in practice there have been very few cases where the network has failed to converge.

Further interesting characteristics of the neural networks can be observed when network weights are displayed in more elucidating configurations. One such configuration is to draw every set of input weights to every hidden node in its own local area in space. This is shown in Figure 15. This drawing method produces the effect of looking at N separate pseudo-face images in a 1 layer, N-hidden-node network. Such a layout offers some insight into the nature of the hidden nodes, and what exactly it is that they learn. One common observed phenomenon is two neighboring weight arrays capturing reversed polarities of the input images, i.e. where one is red, the other is green and vice versa. Another common observation is a sequential, increasing horizontal offset of the faces impression left on the network: one may observe some input arrays with a cropped or cut off image of a face in them, as if that layer has only learned the left half of a face image. The value of the pseudo-image s offset in the weight array may even be related to the location of the hidden layer, producing a monotonically increasing offset as each successive hidden layer is drawn. This effect can be employed to produce still another representation of matrix weights, in which weight arrays are drawn at a depth offset from one another, and a pseudo-image of a face can not only be seen in the front face (first weight array) of the matrix, but on its side as well.

Figure 15: Graphical display of the neural network system implemented in C++. The network has 1 hidden layer with 21 nodes. Training set error graphed in yellow and the test set in red. The array of network connection weights to the 21st hidden node is displayed in blue box.

Localization results The network with only 1 hidden layer with 20 nodes is chosen for search through the image and localizing it as from Figure 10 it can be seen it gives the minimum test error and also the computation is faster when the nodes are lesser in number. The results are presented in Figure 16 and 17.

Figure 16: Localization results

Figure 17: more localization results

5. Discussion and analysis of Results

Matlab Approach

As can be seen from the Figure 10, when the number of hidden nodes is greater than 70, it is hard for the network to learn. Ideally we would want the number of nodes to be as small as possible for faster computation when we search an image for localization, but if the number of hidden nodes is too small the network might not be able to capture the desired characteristics completely. As we can see in the plot, when the number of nodes is less than 20, the network sometimes has a high testing error.

From Figure 12 and 13, we can see that 2 hidden layer network and the multiple network do not increase the accuracy of the results. One might expect the multiple network to perform significantly better as it can capture more information in each of the networks and combine the results to produce more accurate results. The reason we think this does not happen is because the training images with faces are manually cropped and they are not aligned to each other that is the features are not in identical positions. The images can be manually processed to accomplish this but due to time constraints this could not 9be done. Also these networks take much more time to learn and are computationally expensive if they have to be used in the localization search.

C++ Implementation with graphical display

The relationship between number of hidden nodes and classification error rate seems to faintly indicate that there is a direct relationship between complexity and accuracy. This relationship, however, is not particularly striking. It is still unclear whether this graph is heavily affected by noise, and has yet to be validated by k-fold cross-validation. The results presented were collected entirely on a 1.4GHz Athlon cpu running linux kernel 2.4.13 in RedHat 7.1, using the same training and test sets in each trial.

The results of this project could be enhanced in several ways.

The first possible

enhancement would be to implement k-fold cross validation and run every experiment k times on k different held out subsets of training data, with k ~= 10. This would also increase the time required to run experiments by a factor of 10. A second possible enhancement is the addition of more training and test data, both positive and negative. A ten-fold increase in the data set size would undoubtedly smooth out a lot of the noise evident in the results, while also increasing the turnaround time of experiments by another factor of 10. A third possible enhancement is an increase in resolution of the input images. Such an increase would also increase the computational complexity of the experiments by (image size)^2, and would undoubtedly require more time than was allotted for this project. Localization

Figures 16 and 17 show that the neural network is able to localize faces in an image. The process is computationally very expensive as we have to sub-sample the image down to very low resolution to extract all the faces. The number of false positives was high: around 20 in these 4 images. The number of false negatives was around 10, and these were only in the images in figure 16. The number of windows the network searches in the image is very large (of the order of 1000) and in comparison to that, the percentage of false positives can be considered to be minimal. This task is a much harder task in comparison to the face classification task due to the space of faces and non-faces it covers. As the training set used by us is small, we don t expect the neural network to perform very well in the localization task. But we do think that the results are encouraging for future work. One of the approaches to improve the results in this task is bootstrapping in which you include the false positives in the testing phase as negative examples for further training. This approach was used to a very small extent.

6. Conclusion and Future Work


These results indicate that a face detection system is highly possible and can be very efficiently implemented and deployed on today's hardware. Our simplifying assumptions allowed the system to be tested in many trials with acceptable turnaround times.

Both the approaches still require lot of work to be done to make the system robust and efficient. One of the tasks that would benefit approaches is to increase the size of the training set. This would increase the accuracy of the face localization as well. The bootstrap method discussed in the previous section will definitely help in expanding the space of non-faces.

The optimal number of hidden layers in the network is not completely determined. We think more experimentation can be done with different number of layers (up to 3) with different number of nodes to obtain the best results.

Frequency of network convergence could also be explored. A large number of trials could be run on a given network with set parameters, and statistics could be collected on how often it converges, and within how many training iterations. The resulting data could be used as a higher-dimensional component to the above analysis of classification accuracy vs. hidden layer size, indicating not only the best performance of each individual network, but how often it will perform at that level.

One very logical extension to our C++ implementation is the technique of learning rate annealing. The benefits of such a technique could also be explored, running 2 networks in parallel, identical except for their annealing rate. By varying the annealing rate, we should be able to empirically determine the value of such a process for our specific learning task, and experimentally obtain an effective, practical annealing rate for our own purposes. We could also explore different heuristics on top of annealing, determining the best conditions in which to employ annealing, and those conditions in which it is best not to use it

The exploration of different seeding techniques is a potentially rich domain for future work. Experimental results on the speed and frequency of convergence can be obtained while varying the minimum and maximum values of the random seeding process. Also, entirely different heuristics for seeding weight matrices could be explored, such as sparsely seeding a matrix with a small number of high weights or attempting to capture a portion of some input pattern in the seeding process. This latter technique may help speed convergence, but possibly at the price of over-learning.

Also we think the face detection system can be extended to detect facial expressions. Because of the complexity of the problem and large number of classifications involved, further extensions to our existing system will be necessary. The most promising avenue of such extensions is most likely feature segmentation and relative position of facial features.

References
[1] Computers Seeing People by Irfan A. Essa. Article in American Association for Artificial Intelligence. [2] Ming-Hsuan Yang, David J. Kriegman, N. Ahuja, Detecting Faces in Images: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2002). [3] Y. Wang and B. Yuan, Fast method for face location and tracking by distributed behaviour-based agents. IEEE Proc.-Visual Image Signal Process. [4] M. Turk and A. Pentland, Eigenfaces for Recognition, J. Cognitive Neuroscience, vol3, no 1, pp 71-86, 1991. [5] E. Osuna, R. Freund, and F. grosi, Training Support Vector Machines : An Application to Face Detection, Proc IEEE Conf. Computer Vision and Pattern Recognition, pp. 130-136,1997. [6] Rowley, Baluja and Kanade: Neural Network-Based Face Detection PAMI, January 1998. [7] Rowley, Baluja and Kanade: Human Face Detection in Visual Scenes. [8] Rowley, Baluja and Kanade: Rotation Invariant Neural Network Based Face detection. [9] Human Face Recognition using Third Order Synthetic Neural Networks by Uwechue and Pandya [10] Computer Vision by Forsyth and Ponce [11] Machine Learning by Tom M. Mitchell

Vous aimerez peut-être aussi