Vous êtes sur la page 1sur 8

Emotion Recognition through Facial Expressions

John Chia Department of Computer Science University of British Columba, Vancouver, BC johnchia@cs.ubc.ca

Abstract
In this project, the problem of emotion recognition through visible facial features is examined. A system is proposed that performs L1 feature selection and is shown to perform on-par with a state-of-the-art comparison system on the dataset. The dataset used is the Cohn-Kanade facial expression database and a system based on a combination of support vector machines and multinomial logistic regression is used for comparison and evaluation.

Introduction

Facial expression analysis involves the recognition and interpretation of movements in the faces of humans. Possibly the earliest known work in the eld came from [Darwin, 1872] and claimed that facial expressions were universal. Later work identied the six primary emotions commonly used even still that are said to be universal across human cultures: happiness, sadness, feat, disgust, surprise and anger. [Ekman and Friesen, 1971] More recently, likely caused by the progressing state of technology, there has been interest in automatic facial expression analysis and more concretely facial expression recognition. Facial expression recognition is concerned with the recognition of certain facial movements without attempts to determine or presume about the underlying emotional state of the agent. The justication for this comes from the relationship between facial expression and underlying emotional states which do not necessarily map deterministically. For example, facial expressions may result from physical exertion rather than emotional state; in this case, emotional state is hidden or overridden from expressing itself through the face. For this reason, some have argued that facial expression interpretation must rely on more than just visual information. Nevertheless, there is interest in what can be done with just visual information. This projects focus is on facial expression recognition through visual analysis of still frames from video. It seeks to classify facial expression into one of the six primary emotions given by [Ekman and Friesen, 1971]. One useful and widely used application (in many digital cameras) of a very similar problem is blink detection [Wang et al., 2009]. Additional example elds and applications of interest to this problem are in psychiatry, lie detection, intelligent environments and emotionally intelligent computer interfaces. [Tian et al., 2005]

Related Work

This section provides an overview of the eld of facial expression recognition. It rst describes the general high-level procedure followed by most approaches. It then describes a few of the latest methods in the literature. Most previous work on this problem can be decomposed into the general sequence shown in Figure 1. [Tian et al., 2005]

A preliminary acquisition step detects the face and crops the image so that the facial features are aligned. This step may also determine head pose but nearly all techniques assume a frontal view. This step also involves the detection of a face. The prevelant method for facial detection is that of [Viola and Jones, 2001]. Next, there is an extraction step where the features of interest are acquired from the raw image data. The techniques at this step can be grouped into those using geometric features and those using appearance features. The latter group involves estimating the shape of the face and extracting features from that shape estimate. The latter is concerned with the texture or pixel intensity; techniques such as SIFT and optical ow fall into this category. Finally, the recognition step classies these features as one of several emotions. A variety of approaches have been used for this step: neural networks, support vector machines, Bayesian networks and linear discriminant analysis for example. Table 1 summarizes the performance two of the best performing systems. [Tian et al., 2005] The methods listed are limited to those with a similar approach to the proposed work: classify a frame into one of six basic emotion-representative expressions.

Figure 1: General high-level procedure for a facial expression recognition system. The standard database shared by these approaches was assembeld by [Kanade et al., 2000] and is comprised of 500 frontal view image sequences from 100 subjects. Each subject performed a series of 23 facial displays starting from a neutral face. The data is labelled using facial action coding system (FACS) codes which are a set of abstract representations for facial expressions introduced by [Ekman and Friesen, 1978]. An inter-agreement analyis was performed and the mean kappa for inter-observer agreement was 0.86. System UCSD UIUC Method Support vector machine + multinomial logistic regression Neural network + gaussian mixture model Table 1: Performance results from two related methods The UCSD system used the combination of a support vector machine and multinomial logistic regression with a ridge regularizer to classify frames into emotions with an accuracy of 91.5%. They also tested several other methods instead of logistic regression including nearest neighbor and a voting scheme based approach but found that multinomial logistic regression was the best performing. [Littlewort et al., 2002] The UIUC system also used a two stage approach where a neural network was rst used to distinguish netural expressions from non-neutral expressions. They then employed a Gaussian mixture model to classify the remaining expressions with an accuracy of 71.0%. [Wen and Huang, 2003] Accuracy 91.5% 71.0 %

Image Features

To collect image features, rst the face must be extracted from the image and its features should be localized. Then, the raw image data may be used directly in classication. In this project, we also use features derived from a set of Gabor ltered images. 3.1 Face extraction

Face extraction was performed with the Viola-Jones face detector as implemented in OpenCV and with a matlab interface written by Sreekar Krishna1 . The output of the face detector was rescaled and cropped to 48x48 pixels. Localization performance is satisfactory. An example of the output of this stage of processing is given in Figure 2.

Figure 2: Result of facial localization using the Viola-Jones algorithm as implemented by OpenCV and the Haar training data.

3.2

Gabor lter bank

Gabor lters are an image processing technique said to mimic certain processes in the human visual cortex. In 2D, when applied to images, they isolate edges of a certain spatial frequency of a certain direction. A bank of 40 Gabor lters was used in eight orientations and ve spatial frequencies from 4 to 16 pixels. The output of this stage is given in Figure 3.

Two-stage classication

The skewed class distribution (Table 2) as well as limitations of support vector machine inspired a two-stage approach consisting of a support vector machine directly on the image data (raw and ltered) followed by multinomial logistic regression. Unfortunately, the original database annotations used in the [Littlewort et al., 2002] paper are not available. An automated mapping was performed from facial expression action units to emotions according to a guide provided with the database. As well, a set of annotations, performed manually by Gwen Littlewort, were used. 4.1 Support vector machine

The rst stage combined each class with another class (e.g. neutral-happy, sad-happy) and solved the binary classication problem of discriminating between that pair of classes and the rest. This was said to improve performance over the non-paired approach since more training examples were available for the support vector machine. A total of 21 class pairs are present in the dataset so 21 independent support vector machines are trained. The outputs of these SVMs form a vector that is the input to the next stage of classication. This can be looked at as a feature reduction stage; for example, taking an image consisting of 48x48 features (in the raw case) and compressing it down to 21 while preserving information about the class pair distributions. To implement this stage, we used Kevin Murphys pmtk32 and its interface to libsvm.
1 sreekar.krishna@asu.edu 2 pmtk3

is available at: http://code.google.com/p/pmtk3/

50

100

150

200

250

300

350

50

100

150

200

Figure 3: Output of the gabor lter bank on a single image: each row is a different orientation (eight orientations in total) while each column is a different wavelength (ve wavelengths in total). Annotator Automatic Emotion Surprise Fear Happiness Sadness Disgust Anger Total Surprise Fear Happiness Sadness Disgust Anger Total Examples 74 3 23 14 33 19 166 66 30 76 53 45 34 304

Gwen

Table 2: Number of examples in each class. A neutral example exists for each example in each class so the total training examples actually used is double the total given here with the neutral class severely outweighing the other classes.

4.2

Multinomial logistic regression

In the next stage, the 21 SVM outputs are classied into one of seven emotions using multinomial logistic regression. The weights (Figure 4) when using the automated annotations are extremely sparse. This suggests that the dataset is not complete enough to make reliable conclusions from any proceeding steps; hence, we will only consider Gwens annotations from this point forward.

1 2 0 3 1 3 2

1.5

0.5

4 2

0.5 5 3 5 1 4 6 6 1.5

10

12

14

16

18

20

10

12

14

16

18

20

Figure 4: Optimal multionial logistic regression weights: automated annotations (left) and Gwens annotation (right)

Nearest Shrunken Centroids

Nearest shrunken centroids [Tibshirani et al., 2002] [Hastie et al., 2010] is an approach similar to that of k-means but each centroid is shrunken towards the overall centroid of the data. It encourages the features that dont characterize a class to shrink towards zero. It has been shown to perform well at classication tasks with a disproportionately high number of features compared with the number of training examples such as gene classication. It is natural to expect that it could perform well in vision tasks, especially given the Gabor bank image representation which has a high number of features compared to the number of training examples. The objective function is written to minimize the 2-norm of the example-centroid difference but regularized with an L1 norm on the deviation from cluster means mc j : 1 J= 2 D C xi j m j mc j 2 Nc + |mc j | sj sj j=1 c=1 j=1 c=1 i:yi =c
D C

The MLE for the cluster mean, m j is: xi j mc j NcC c=1 i:yi =c
C

mj =

The cluster offset from each cluster mean, mc j is given by: xi j m j mc j 1 Nc |mc j | mc j J = (2) + 2 2 i:yi =c sj sj Nc mc j s j dc j Nc = Nc |mc j | + 2 sj sj Where we dened dc j = Nc
xc j x j sj .

Expanding the subderivative |mc j |:

Nc Nc mc j s j dc j Nc mc j < 0 2 sj s d s j s j dc j N N j cj mc j J = [ Nc s2 s j c , Nc s2 s j c ] mc j = 0 j j Nc Nc mc j s j dc j + Nc mc j > 0 sj s2
j

Setting mc j J = 0 and solving for mc j we have: s j (dc j + )/ Nc mc j = 0 s j (dc j )/ Nc sj = soft (dc j ; ) Nc

dc j < |dc j | < dc j >

Where soft (dc j ; ) = sign(dc j )(|dc j | )+ is a soft thresholding operation which is zero for dc j < . In this work, we choose by cross validation. The cluster centroids and their respective offset from the global mean are given in Figure 5

10 20 30 40

50

100

150

200

250

300

10 20 30 40

50

100

150

200

250

300

Figure 5: Centroids for each emotion (top) and global cluster offsets.

Discussion

Experiments were run on the Cohn-Kanade database using the aforementioned algorithms and annotations provided by a certied FACS decoder (Gwen Littlewort) with both raw image data and Gabor bank ltered image data features. It should be noted that interpretation of expressions as emotions are extremely subjective, and the results from these experiments should be take with a grain of salt. To minimize this subjectivity in future studies, additional annotators should be used and an agreement study performed. The results of the experiments are summarized in Table 3 and confusion matrices are given in Figure 6. The rst thing of note is that both classiers benet signicantly from Gabor bank ltering: this is unsurprising as Gabor bank ltering is now a standard tool in computer vision. With just raw image data as input, the nearest shrunken centroids classier performs signicantly worse than the SVM+MLR classier in testing; however, the Gabor representation gave the centroids classier equivalent performance. It is likely the signicant increase in the dimension of the feature vector (2304 pixels to 92,160 pixels) that made this possible. In this high dimensional space, the nearest shrunken centroids classier was able to select more effective feature combinations. Both classiers had trouble with the fear emotion (Figure 6, 3rd row/column). This stems from the fact that expressing fear with facial expressions is a very subtle action and it is prone to misinterpretation. Interestingly, the same gure shows that nearest shrunken centroids seems to do a lot better than the SVM+MLR system, in terms of percentage correct per class. This suggests that given a more even class distribution, the nearest shrunken centroids classier could outperform the SVM+MLR system.

Method SVM+MLR Nearest shrunken centroids SVM+MLR Nearest shrunken centroids

Features Raw image data Raw image data Gabor bank ltered Gabor bank ltered

Training Error 11% 38% 16% 16%

Testing Error 33% 37% 20% 21%

Table 3: Mean train and test error with the Gwen annotations
1 1

7 1 2 3 4 5 6 7

7 1 2 3 4 5 6 7

7 1 2 3 4 5 6 7

7 1 2 3 4 5 6 7

Figure 6: Normalized confusion matrices for classication: raw pixel features (top), Gabor image representation (bottom); SVM+MLR (left) and nearest shrunken centroids (right). These were normalized such that each column sums to 1.

Conclusion

The experimental results, while fairly good, cannot stand on their own without better annotations for the database. The reservations stated earlier are important to qualify any conclusions: a single annotators subjective judgements cannot be relied upon to make any substantial conclusions. The main contribution of this work is some evidence that nearest shrunken centroids can perform on-par with the state-of-the-art (in this domain, at least) SVM+MLR system. Some of the results suggest that it may even perform better if the prior class distribution is explicitly handled. Further, in the implementation that were used, the nearest shrunken centroids classier performed faster and was less memory intensive than the two-stage 21 SVM with multinomial logistic regression approach we chose for comparison. Future work in this area may benet from classifying facial action units rather than emotions. These should be less subjective and ambiguous, for the reasons stated in the introduction. In the emotion recognition category, the effect of registration errors (misalignment of the eyes and mouth, for example) may be addressed. We have not qualitatively assessed this source of error in this work. Acknowledgements Special thanks to Gwen Littlewort who provided the annotations that were ultimately used in this project. Thanks also to the authors of all the packages used in this project; pariticularly, Kevin Murphy and all the authors who have contributed to pmtk3 and the authors of libsvm. Without these generous contributions, this project would not have been possible.

Additional notes The project was modied in the following ways since the project proposal: Removed UIUC system from proposal: they are using an approach based on 3D models that is not comparable to the systems presented here. Removed Fisher LDA from proposal as this has been done; see [Bartlett et al., 2004].

Changed approach of nearest shrunken centroids to output of Gabor for emotion pair classication.

References
[Bartlett et al., 2004] Bartlett, M. S., Littlewort, G., Lainscsek, C., Fasel, I., and Movellan, J. (2004). Machine learning methods for fully automatic recognition of facial expressions and facial actions. In Proc. IEEE Intl Conf. Systems, Man and Cybernetics, pages 592597. [Darwin, 1872] Darwin, C. (1872). The expression of emotions in man and animals. [Ekman and Friesen, 1978] Ekman, P. and Friesen, W. (1978). Facial Action Coding System: A Technique for the Measurement of Facial Movement. Consulting Psychologists Press, Palo Alto. [Ekman and Friesen, 1971] Ekman, P. and Friesen, W. V. (1971). Constants across cultures in the face and emotion. Journal of Personality and Social Psychology, 17(2):124129. [Hastie et al., 2010] Hastie, T., Tibshirani, R., and Friedman, J. (2010). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition. Springer Series in Statistics. Springer, 2nd ed. 2009. corr. 3rd printing edition. [Kanade et al., 2000] Kanade, T., Cohn, J. F., and Tian, Y. (2000). Comprehensive database for facial expression analysis. pages 4653. [Littlewort et al., 2002] Littlewort, G., Fasel, I., Bartlett, M. S., and Movellan, J. R. (2002). Fully automatic coding of basic expressions from video. Technical report, Tech. rep.(2002) U of Calif., S.Diego, INC MPLab. [Tian et al., 2005] Tian, Y.-L., Kanade, T., and Cohn, J. F. (2005). Facial Expression Analysis. Springer. [Tibshirani et al., 2002] Tibshirani, R., Hastie, T., Narasimhan, B., and Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences of the United States of America, 99(10):65676572. [Viola and Jones, 2001] Viola, P. and Jones, M. (2001). Robust real-time object detection. In International Journal of Computer Vision. [Wang et al., 2009] Wang, L., Ding, X., Fang, C., Liu, C., and Wang, K. (2009). Eye blink detection based on eye contour extraction. volume 7245, page 72450R. SPIE. [Wen and Huang, 2003] Wen, Z. and Huang, T. S. (2003). Capturing subtle facial motions in 3d face tracking. In ICCV 03: Proceedings of the Ninth IEEE International Conference on Computer Vision, page 1343, Washington, DC, USA. IEEE Computer Society.

Vous aimerez peut-être aussi