Vous êtes sur la page 1sur 6

2016 IEEE 9th International Workshop on Computational Intelligence and Applications

November 5, 2016, Hiroshima, Japan

Recognition of Persisting Emotional Valence


from EEG Using Convolutional Neural Networks
Miku Yanagimoto Chika Sugimoto
Graduate School of Engineering Center for Future Medical Social Infrastructure
Yokohama National University Based on Information Communications Technology
Yokohama, Japan Yokohama National University
Yokohama, Japan

Abstract—Recently there has been considerable interest in related to using DNNs in EEG-ER at this time. Using Deep
EEG-based emotion recognition (EEG-ER), which is one of the Belief Networks (DBNs)[2] or Stacked Autoencoders (SAEs)
utilization of BCI. However, it is not easy to realize the EEG- [3] both of which perform unsupervised pre-training may be
ER system which can recognize emotions with high accuracy
because of the tendency for important information in EEG signals major examples. These studies indicate that DNNs perform
to be concealed by noises. Deep learning is the golden tool to more accurately than conventional models.
grasp the features concealed in EEG data and enable highly
accurate EEG-ER because deep neural networks (DNNs) may
have higher recognition capability than humans’. The publicly
DBNs and SAEs usually consist of fully-connected lay-
available dataset named DEAP, which is for emotion analysis ers. Using fully-connected layers tend to have the network
using EEG, was used in the experiment. The CNN and a hold too many parameters, which often lead to overfitting.
conventional model used for comparison are evaluated by the Moreover, input data must be unstructured and this is the
tests according to 11-fold cross validation scheme. EEG raw data deficiency causing local distortions of the input. It has been
obtained from 16 electrodes without general preprocesses were
used as input data. The models classify and recognize EEG signals
felt that inputting raw data into the networks consisting of
according to the emotional states “positive” or “negative” which fully-connected layers is often avoided for that reason. Most
were caused by watching music videos. The results show that the previous studies of BCI used feature values designed by human
more training data are, the much higher the accuracies of CNNs instead of raw data as input for DNNs except for detecting
are (by over 20%). It also suggests that the increased training data P300 [4] or SSVEP [5]. However, it is necessary to propose
need not to belong to the same person’s EEG data as the test data
the method for DNNs to learn superior features in order to
so as to get the CNN recognizing emotions accurately. The results
indicate that there are not only the considerable amount of the outperform feature extractors designed by human. In contrast,
interpersonal difference but also commonality of EEG properties. Convolutional Neural Networks (CNNs) have fewer parameters
because of its local connections. In addition, the convolutional
Index Terms—BCI, EEG, Emotion recognition, Convolu- layer is suitable for inputting raw data because it need not
tional neural networks, Deep learning, Interpersonal differ-
ence/commonality
disband the structure of data. Actually, CNNs exhibit better
results than those of DNNs consisting of fully-connected layers
I. I NTRODUCTION not only in image but also in speech[6].
Due to extension of machines’ ability, people is just be-
ginning to have more opportunities to work with machines In this work, we mainly evaluate the accuracy of the CNN
familiarly. Brain Computer Interaction (BCI) is the field using inputted EEG raw data (multichannel time-series signals). The
EEG and other brain function measurements for this human- accuracies on some sets of training data are compared. Through
machine interaction. Specifically, the study to recognize human this comparison, the existence of the interpersonal common-
emotions using EEG data is called EEG-based Emotion Recog- ality and difference in EEG data is discussed. Generally,
nition (EEG-ER). The EEG-ER model which is inputted EEG filtering and removing artifacts are done as preprocesses in
data and outputs apposite emotional classes has the potential EEG analysis. These processing are considerably effective in
to become one of the powerful tools in neuromarketing in the recognition. But we don’t apply these preprocess at all to test
future. Though (scalp) EEG has advantage of high temporal whether the CNN can extract features and recognize emotions
resolution, non-invasiveness, and simplicity in recordings, it based on EEG raw data. EEG signals have considerable
includes a lot of noises and lacks spatial resolution. To make variability between recording sessions. EEG data are divided
EEG-ER models available, it is necessary to analyze massive into 3 groups — one can be easily recognized, another is
data in order to compensate for defect of EEG and to discover hard to recognize, and the other almost cannot be. Therefore,
the properties characterizing a certain mental state. Deep it is not always true that the model which can recognize
learning is the golden tool to achieve the analysis mentioned emotion accurately on a certain session data can also show
above because Deep Neural Networks (DNNs) can realize high-accuracy on another session data. For that reason, we use
high accuracy and versatility simultaneously by learning rep- cross validation scheme for testing models and this makes the
resentations based on massive data. There are a few studies accuracies more reliable measure of models’ ability.

978-1-5090-2775-0/16/$31.00 2016
c IEEE

㻞㻣
II. R ELATED WORK 2) Pooling layer: The input feature maps are subsampled
independently using a pooling function resulting in the same
Many works belonging to fields of EEG-ER aim at number of feature maps with a lower temporal resolution.
recognizing emotions which have long duration at high- There are some ways to pooling. For instance, average pooling
accuracy[2][3][7][8]. Although methodology of ERP (Event is the function written as
Related Potential) is famous in EEG analysis, ERP is not
used in these works because of the duration. In EEG-ER, it uijk = max zpqk (3)
(p,q)∈Pij
has been general to use feature values derived from values
in frequency domain (e.g. PSD) and classifiers performing where Pij is the area applied to the function on the input
machine learning (e.g. SVMs) so far[3]. For example, Lu feature maps. Usually, non-linearities are not used in pooling
et al. proposed feature value DE (Differential Entropy)[7] layers (zijk = uijk ) and there is not the parameter changing
and demonstrated its availability by comparing accuracies of with learning.
different features and combination of a SVM or KNN (k-
nearest neighbor) [8]. In ER (Emotion Recognition), there B. Shallow models
are studies which aim at recognizing emotional valence or 1) feature extraction: The energy spectral densities (ESD)
arousal from the viewpoint of a circumplex model of affect[9]. or the power spectral densities (PSD) of EEG signals as well
There are methods to stimulate intended emotions including as the values transformed from them are often used as the
watching music videos[10], looking at images[11], and playing features of emotions or intentions of human in BCI (Brain
games[12]. We also reported on possibility of CNNs as EEG- Computer Interaction). They have measurable results[2][3]. A
ER models in a previous study[13]. time-series signal which is cut out using rectangular window
(size W , stride WS ) is denoted by s(t) (t = 0, · · · , W − 1).
III. EEG- BASED EMOTION RECOGNITION MODELS PSD is given as
 
1  ∞ −iωt 
 F (ω)F ∗ (ω)
We implemented EEG-ER models which estimate emo- P SD(ω) =  s(t)e dt= (4)
tional classes using EEG data as input and outputting the 2π −∞ 2π
estimated labels. The models consisting of feature extractors where F (ω) is short-time fourier transform applied to s(t). We
and shallow classifiers, which we call “shallow models”, are use the sums of P SD(ω) as explanatory variables composing
also implemented for comparison. the feature vector “PS”.
2) shallow classifier: In conventional BCI, Support Vector
A. CNN Machines (SVMs), Decision Tree, Linear Discriminant Anal-
ysis (LDA) are used for the classifiers. Random Forests (RFs)
CNNs are one kind of feedforward DNNs and mainly used
are the supervised learning technique in pattern recognition as
in image processing. Convolutional layers and pooling layers
well as SVMs. With the moderate depth, number of trees, and
which are special layers with local connections characterize
other reasonable parameters, RFs are robuster against noises
CNNs’ architectures. Normally, convolutional layers and pool-
and can classify and regress faster than other statistical learning
ing layers are alternated several times, then fully-connected
models. In classification phase, an RF consists of decision trees
layers are arranged.
using random subsets of the features with respect to subsets
1) Convolutional layer: In terms of the l-th convolutional sampled from dataset randomly. In regression phase, each tree
layer, its input is a 3D array with K 2D feature maps of size is inputted the test data and predicts its class. An RF outputs
l−1
W1 ×W2 . Each component is denoted by zijk (i = 0, · · · , W1 − the class whose mean prediction of the individual trees is the
1, j = 0, · · · , W2 − 1, k = 0, · · · , K − 1). A filter bank consists largest in the leaves.
of M kinds of kernels and each component is denoted by
hpqkm (p = 0, · · · , H1 − 1, q = 0, · · · , H2 − 1, k = 0, · · · , K − IV. E XPERIMENT AND EVALUATION METHODS
1, m = 0, · · · , M − 1). Each kernel as well as input has K A. Conditions of EEG data
channels, so the size is H1 × H2 × K. Parallel calculation
with respect to each kernel(m = 0, · · · , M − 1) is made to 1) Database: We got EEG data (fS = 512[Hz]) to train
output uijm . Then an activation function f (·) is applied to and test the models from DEAP dataset[10]. In this database,
uijm to get the output of the l-th convolutional layer zijk l
. The each subject watched 40 one-minute highlight music videos in
operations as stated above are written as randomized order and performed a self-assessment of their lev-
els of certain kinds of emotions at the end of each video. EEG
 H
K−1 1 −1 H
 2 −1 signals were recorded when subjects were watching videos,
l−1
uijm = zi+p,j+q,k hpqkm + bijm (1) and emotional information includes the values of emotional
k=0 p=0 q=0 levels 1–9. Hereafter we refer to the one-minute multichannel
EEG data which was recorded when a subject was watching
l
zijm = f (uijm ) (2) a video as “one epoch signal”. The original dataset of each
subject consists of 40 pairs of an epoch signal and its emotional
where bijm is the bias and generally bijm = bm is used. levels.

㻞㻤
2) Making the dataset of one subject: We used EEG data Positive+Negative is equal to the amount of test data. For
which were gotten from 16 electrodes (Fp1ɼFp2ɼF3ɼF4ɼF7ɼ example, Positive+Negative=20,812 in Experiment 3 (the
F8ɼC3ɼC4ɼT7ɼT8ɼP3ɼP4ɼP7ɼP8ɼO1ɼO2) and values details will be described later).
of valence levels in the analysis. The datasets of 22 subjects In this work, cross validation scheme is always adopted for
(s01–s22) who were recorded in Twente are used for evaluation accuracy evaluation. The reason is that EEG data generally
of the models. The basic idea of the classification task is to exhibit high variability. It indicates that the accuracies on
recognize whether the emotion of each subject is “positive” different data of recording sessions vary. This is demonstrated
or “negative” from EEG data. When 11 epochs were taken in Section V in this EEG-ER classification task. We believe that
out in descending order from the highest level of valence and the accuracy does not mean the model’s recognition ability if it
11 epoch signals in ascending order from the lowest level of is not measured on large test data and cross validation scheme
valence with respect to each subject1 , there is no duplication of is not adopted for test.
valence values between two groups. The dataset of one subject 1) Experiment 1: We took out epoch signals from each
were made of these 22 epochs. The names of higher group class one by one and made 11 pairs randomly on a subject
and the lower group including 11 epochs each in one subject dataset. Eleven tests according to 11-fold cross validation
dataset are “positive” and “negative”, respectively. scheme were performed on each subject dataset by using 10
As the preprocess of EEG analysis, the average of 5[sec] pairs as training and validation data and 1 pair as test data. Five
potentials before each epoch was subtracted from the epoch subject datasets (s01, s06, s11, s16, s21) are used and each is
potentials. Removing the artifacts and applying a bandpass tested 11 times. In one test, the amounts of training data and
filter is often performed in EEG analysis too, but we did test data are 9,460(= 473 × 2 × 10) and 946(= 473 × 2 × 1),
not do that in this work at all. Removing the artifacts are respectively.
usually not totally automated and determination of frequency 2) Experiment 2: Eleven tests according to 11-fold cross
band of the filter sometimes requires preliminary experiments validation scheme were performed on 5 subject datasets. One
to investigate the frequency characteristic of EEG. It is better pair and 10 pairs in 11 pairs of each subject dataset, 5 pairs
to omit these preprocesses because they are laborious. Indeed and 50 pairs in total are used as test data and training data,
bandpass filter can be determined following prior researches respectively in one test. The datasets of 20 subjects (s01–s20)
like 0.3–50[Hz][2] or 4–45[Hz][10], but it is not necessarily are grouped into s01–s05, s06–s10, s11–s15, s16–s20. In one
true that those filters did not discard important features. Similar test, the amounts of training data and test data are 47,300(=
things can be said in terms of removing artifacts, too. We 946 × 5) and 4,730(= 9, 460 × 5), respectively.
aimed to make the models recognizing emotions as accurately 3) Experiment 3: Eleven tests according to 11-fold cross
as possible under the terms of inputting EEG raw data without validation scheme were performed on 22 subject datasets. One
these preprocesses. EEG data in DEAP dataset were recorded pair and 10 pairs in 11 pairs of each subject dataset, 22 pairs
using a Biosemi Active Two. Note that noises (e.g. noises of and 220 pairs in total are used as test data and training data,
commercial AC) were rejected by the hardware. respectively in one test. In one test, the amounts of training
3) Size of data: The purpose of the present work is to data and test data are 20,812(= 946 × 22) and 208,120(=
construct EEG-ER models which can recognize human emo- 9, 460 × 22), respectively.
tions at high time resolution in order to follow the change
of his or her emotional state. When we cut out signals and C. Conditions of models
make the minimal amount of data by rectangular window,
Usually, cross validation is performed on training data using
therefore, its size and stride should be set as small as possible.
grid search and the accuracies are compared for optimization
The size and the stride of the window are set W =512 point
of models. In this work, however, we do not do that because
(=1 sec×fs [Hz]) and WS =64 point (=0.125 sec×fs [Hz]),
it takes 11 times as long as usual one time test. We deter-
respectively. The size of the minimal amount of data is denoted
mined parameters of models based on the average accuracy of
by 512 × 1 × K according to Sec.III-A1. The number of the
Experiment 3 and applied them to all experiments.
minimal amount of data in a dataset of a subject is equal to
1) CNN: In terms of training, the values of learnable
10,406 (=22× {(60 − 1)/0.125 + 1} = 22 × 473).
parameters which maximize the accuracy on training data are
B. Testing method finally adopted. Then, the accuracy outputted by the model set
Three experiments are arranged in order to evaluate models as such on test data is the result. The maximum number of
by their accuracies. The accuracy means the percentage of training epochs 2 , batch size, learning rate is set to 100, 94,
minimal data in test data which are predicted correctly by the and 0.0001, respectively.
model. The accuracy is given by In terms of the architecture, for example, 4-layer models
TruePositive+TrueNegative
which consist of 2 convolutional pooling layers and 2 fully-
Accuracy = Positive+Negative × 100. (5) connected layers can learn without problems. But when we
Positive = TruePositive + FalsePositive (6) tried to train deeper models, the accuracies on training data
Negative = TrueNegative + FalseNegative (7) did not change from 50% in training. It was difficult to train
1 The number of 11 is the maximum not to make the duplication. 2 This “epoch” has different meaning from what was defined in Sec. IV-A1.

㻞㻥
networks fast without any techniques. In order to resolve V. R ESULTS
vanishing gradient problem, which can be thought to be the A. Experiment 1
cause of this, we use PReLU [15] as activation functions and
TABLE II shows the accuracies of the shallow model
apply batch normalization [16] after each convolutional layer.
(PS+RF) and the CNN in Experiment 1. In the table, “no.”
Applying these techniques enable deeper networks to learn fast
represents the number of the test following 11-fold cross
and smoothly.
validation scheme. “ave” represents the average of accuracies
The CNN tested in Sec. V is a 7-layer model displayed in
and standard errors of 11 tests.
TABLE I. This model was determined by comparing accuracies
of models having various numbers of layers and units in the TABLE II
preliminary experiment. We observed a increasing trend in T HE ACCURACIES OBTAINED IN EXPERIMENT 1.
accuracy with the number of convolutional layers, but the total subject no. shallow CNN no. shallow CNN
1 46.83 1.16 2 43.97 73.47
number of layers is set to 7 taking the calculation time into 3 51.59 96.72 4 79.18 53.70
account. Therefore, the used architecture is not the perfectly 5 67.97 59.62 6 60.15 39.96
optimized model. We recognize that experimenting with a s01 7 74.95 46.30 8 92.07 17.44
9 98.31 28.75 10 96.41 54.76
model adjusted by a better way is the further task. The shape 11 79.18 39.96 ave 71.87±5.60 46.53±7.52
(size) of data is denoted by K × T × 1(T : number of time 1 77.27 16.38 2 64.27 54.86
points) because multichannel time-series signals are handled. 3 70.19 85.62 4 79.60 55.92
The patches of convolution and pooling stride only in time- 5 28.22 46.30 6 95.56 13.74
s06 7 68.50 43.87 8 58.67 41.65
space (the space of T ) and the size of their strides are set to 1 9 37.84 29.81 10 69.98 50.21
in all operations. The dropout rate of a fully-connected layer 11 27.27 52.11 ave 61.58±6.29 44.59±5.74
is set to 0.5. Initialization method of h follows [15]. 1 19.34 33.51 2 43.55 38.79
3 66.07 75.05 4 52.54 83.72
TABLE I 5 70.30 55.92 6 65.54 58.56
C ONFIGURATION OF THE 7- LAYER NN. s11 7 66.81 60.89 8 24.52 94.82
9 43.76 46.62 10 28.44 52.43
no. input size layer 11 22.30 93.55 ave 45.74±5.68 63.08±6.05
1 16×512×1 conv(72×5×1) 1 74.31 43.23 2 51.59 60.36
72×508×1 BN→PReLU 3 69.98 77.91 4 53.07 16.81
2 conv(100×5×1) 5 60.78 59.41 6 17.23 29.28
100×504×1 BN→PReLU s16 7 49.79 40.59 8 43.87 58.62
3 conv(134×5×1) 9 67.02 44.19 10 41.54 45.24
134×500×1 BN→PReLU 11 59.62 52.22 ave 53.53±4.59 47.99±4.75
4 conv(182×2×1) 1 54.76 80.02 2 55.29 46.41
182×499×1 BN→PReLU 3 56.03 54.65 4 41.65 37.21
5 conv(141×2×1) 5 41.33 47.99 6 47.57 37.95
141×498×1 BN→PReLU s21 7 61.73 11.95 8 37.53 56.13
average pool(2×1) 9 40.27 8.88 10 24.74 62.37
6 141×249×1=35109 fc(900) 11 38.37 3.59 ave 45.39±3.10 40.65±6.91
900 PReLU→dropout
5 subjects ave 55.62±2.68 48.57±2.99
7 fc(2)
2 softmax

B. Experiment 2
2) PS: The feature vector is constructed referring to [3].
First, 16 time-series signals gotten from 16 electrodes and TABLE III shows the accuracies of the shallow model and
their differences of bilateral 8 pairs are gathered. And a multi- the CNN in Experiment 2.
channel time-series signal which has 24(= 16 + 8)channels is C. Experiment 3
made. Then, power spectra is derived from each channel and
P SD(ω) are integrated on each frequency band θ (4–8[Hz]), TABLE IV shows the accuracies of the shallow model and
α (8–13[Hz]), β (13–30[Hz]), and γ (30–64[Hz]). We obtain the CNN in Experiment 3.
a 96(= 24 × 4)-dimensional feature vector from each data in VI. D ISCUSSION
this way. This feature vector is not completely same as that of
The comparison of the accuracy is sometimes used as the
[3] because feature normalization is not done and numbers of
rate obtained when classified by chance3 in recognition of
electrodes are different.
sustaining time-space pattern of EEG[10][14]. Even if the
3) RF: In terms of trees, the number of features to consider
aim of these studies is to reveal the EEG properties when a
looking for the best split, the function to measure the quality of
human is stimulated some kinds of emotions, the ability of the
a split, the maximum depth and number of nodes, the minimum
model to recognize emotions accurately is very important. The
number of samples required to split an internal node, and the
reason is that the accuracy is the significant measure to ensure
minimum √ number of samples in newly created leaves are set to that the feature values change with the emotional state in the
f loor( 96) = 9, Gini impurity, unlimited, 2, 2, respectively.
In terms of forests, the number of trees is set to 100. 3 The rate is 100/N % when the task is N -class classification.

㻟㻜
TABLE III of models, the models which can recognize emotions most
T HE ACCURACIES OBTAINED IN EXPERIMENT 2.
accurately are often different between the tests[3]. Similarly,
group no. shallow CNN no. shallow CNN the most accurate models are often different between the tests
1 74.24 79.97 2 74.67 83.99
3 62.58 75.55 4 54.60 92.58 or subjects in Experiment 1. And the difference of accuracies
1 5 64.82 80.44 6 55.53 84.62 are also large. For instance, the highest accuracies of the
(s01–s05) 7 42.53 81.69 8 66.21 78.66 shallow model and the CNN on s01 dataset in Experiment
9 59.32 84.06 10 57.01 76.86
11 56.42 82.78 ave 60.72±2.66 81.93±1.33 1 is 98.31 and 96.72, respectively. On the other hand, the
1 72.63 82.32 2 40.21 76.95 lowest accuracies of them is 43.97 and 1.16, respectively.
3 59.29 77.67 4 58.01 75.70 There are also large differences of the accuracies of shallow
2 5 69.01 88.73 6 84.39 81.33 models in Experiment 2, 3. These results suggest that there is
(s06–s10) 7 76.27 82.99 8 74.50 73.85
9 61.76 82.94 10 74.63 76.84 considerable variability in EEG data. At the same time, it can
11 61.69 78.14 ave 66.59±3.46 79.77±1.24 be said that comparing the accuracies by the tests according
1 57.27 80.78 2 44.57 80.08 to cross validation scheme is desirable.
3 57.84 80.87 4 60.44 77.63
3 5 54.09 74.23 6 64.33 87.29 We interpret the results as the representation of the degree
(s11–s15) 7 52.60 88.41 8 42.86 88.07 of interpersonal commonality and difference of EEG prop-
9 57.47 75.76 10 44.25 82.62 erties. There are studies making a model on each subject’s
11 71.25 88.15 ave 55.18±2.54 82.17±1.49
dataset because it has been assumed that EEG properties vary
1 41.86 80.14 2 35.46 91.51
3 47.73 74.49 4 47.59 82.32 according to individuals. Nevertheless, there should be also
4 5 64.69 81.70 6 57.25 82.00 features which are common to people such as amplification of
(s16–s20) 7 55.14 77.40 8 54.33 85.32 α during sleepiness. It is significantly meaningful to find such
9 60.77 76.72 10 51.16 78.05
11 51.04 79.00 ave 51.55±2.40 80.79±1.34 features shared with people. To focus on the accuracies of the
4 groups ave 58.51±1.64 81.16±0.69 CNN in TABLE II–IV, the averaged accuracy of Experiment
2 is the highest in 3 experiments (48.57 < 76.63 < 81.16).
Assuming that these results are caused by something other
TABLE IV
T HE ACCURACIES OBTAINED IN EXPERIMENT 3. than setting of hyperparameters (including the architecture),
no. shallow CNN no. shallow CNN
one of them is the interpersonal commonality and difference
1 49.70 77.77 2 39.08 77.97 of EEG properties. In Experiment 2 and 3, the probabilities
3 57.86 77.52 4 56.03 78.49 that one data in all training data belongs to a certain subject
5 41.75 73.43 6 57.39 76.04
7 61.05 75.20 8 55.29 78.47
are 20%(= 1/5×100) and 4.55%(= 1/22×100), respectively.
9 56.77 79.00 10 55.06 75.83 We infer that the representations learned in Experiment 3 are
11 55.60 73.21 ave 53.23±1.99 76.63±0.58 less influenced by each personal EEG property than those in
Experiment 2. The result that the accuracies of Experiment
2 and 3 is much higher than that of Experiment 1, however,
right manner. Therefore, enhancing the accuracies of EEG- shows not only the fact that CNNs are good at learning mass
ER models is meaningful from the viewpoint of engineering data but also the potential of existing substantial commonality
and science. Recognizing emotions which have comparatively of EEG properties between different subjects when they are
long duration from EEG is not easy. In terms of image or stimulated the same kind of emotion (positive or negative).
speech data, we humans can recognize features deriving from We obtained the values of 900 units in 6th layer, reduced the
labels by watching or listening signals. In ERP, we can also dimension from 900 to 2 using t-SNE[17], and plotted them
recognize the rising of the potential caused as the cognitive (Fig. 1). At this time, training data for learning and test data
and psychological response to stimuli by averaging signals of for getting plots are same as no.1 in TABLE IV. For clarity,
many trials. In terms of recognizing mental states which have plotted data of test data are restricted to 785 (= 157 × 5) data
long duration from EEG like this work, however, we can hardly of 5 subjects (s01–s05). The colors represent subjects (e.g. red
identify the critical features deriving from labels by ourselves is s01 and blue is s05) and the shapes of markers represent
because of their complexity. Our idea of unraveling emotional classes (positive: ‘o’, negative: ‘+’). It is seen from Fig. 1 that
features by using machine learning is based on this fact. the distances between the same colors tend to be shorter than
In the study aiming to enhance accuracy, the type of feature the distances between the same markers. The CNN learns the
values influences the accuracies more than the type of clas- interpersonal commonality and difference simultaneously, but
sifiers (models)[2]. On the contrary, the CNN can recognize it regards the interpersonal difference as a component which
emotions highly accurately and stably in any test in contrast to is irrelevant to classes and recognizes emotions correctly.
the shallow model in Experiment 2 and 3. This indicates that We will also consider the possibility that the reason why the
the CNN learns properly the components which are relevant recognition can be done correctly is not always the differences
to classes from non-preprocessed EEG raw data. The results of emotions. In other words, there is a possibility that the rep-
shows that CNNs are suited to not only detection of P300 or resentations learned by the CNN are relative to the differences
SSVEP but also recognition of sustaining time-space patterns. of physiological response derived from watching music videos,
In the study using cross validation scheme for the test other than emotional functions. It is necessary to discriminate

㻟㻝
propose the method to interpret the emotional features from the
filters of trained CNNs.
ACKNOWLEDGMENT
This work was supported by Japan Society for Promotion
of Science KAKENHI 26330189.
R EFERENCES
[1] Y. LeCun and Y. Bengio, “Convolutional networks for images, speech,
and time-series,” The Handbook of Brain Theory and Neural Networks,
MIT Press, pp.255–258, 1995.
[2] W.-L. Zheng, J.-Y. Zhu, Y. Peng, and B.-L. Lu, “EEG-based emotion
classification using deep belief networks,” 2014 IEEE International Con-
ference on Multimedia and Expo (ICME), pp.1–6, 2014.
[3] S. Jirayucharoensak, S. Pan-Ngum, and P. Israsena, “EEG-based emotion
Fig. 1. The plots obtained from 900 units using t-SNE. recognition using deep learning network with principal component based
covariate shift adaptation,” The Scientific World Journal, vol.2014, 2014.
[4] H. Cecotti and A. Graeser, “Convolutional neural networks for P300
detection with application to brain-computer interfaces,” IEEE Trans
them in further work. Pattern Analysis and Machine Intelligence, vol.33, no.3, pp.433–445,
2011.
VII. C ONCLUSION [5] H. Cecotti, “A time-frequency convolutional neural network for the offline
classification of steady-state visual evoked potential responses,” Pattern
The EEG-ER models using CNNs inputted EEG raw data Recognition Letters, vol.32, no.8, p.1145–1153, 2011.
[6] T. Sainath, A.-R. Mohamed, B. Kingsbury, and B. Ramabhadran, “Deep
were developed and the accuracies on some sets of data are convolutional neural networks for LVCSR,” IEEE International Confer-
compared. General preprocesses such as bandpass filters or ence on Acoustics, Speech and Signal Processing, pp.8614-8618, 2013.
removing artifacts are not done and EEG raw data are inputted [7] L.-C. Shi, Y.-Y. Jiao, and B.-L. Lu, “Differential entropy feature for EEG-
based vigilance estimation,” in Proc. 35th Annu. Int. Conf. IEEE EMBS.
into the CNN in order to extract features by the CNN singly. IEEE, pp.6627–6630, 2013.
The tests following cross validation scheme were performed [8] R.-N. Duan, J.-Y. Zhu, and B.-L. Lu, “Differential entropy feature for
for evaluating accuracy of models taking the variability of EEG-based emotion classification,” 2013 6th International IEEE/EMBS
Conference on Neural Engineering (NER), pp.81–84, 2013.
EEG data into account. In the experiment on each subject [9] J.A. Russell, “A circumplex model of affect,” Journal of Personality and
dataset, the CNN could not learn apposite representations Social Psychology, vol.39, no. 6, pp.1161–1178, 1980.
because of lack of training data and is inferior to a shallow [10] S. Koelstra, C. Muehl, M, Soleymani, J.-S. Lee, A. Yazdani, T. Ebrahimi,
T. Pun, A. Nijholt, and I. Patras, “DEAP: A database for emotion analysis
model from the viewpoint of accuracy. In the experiments on using physiological signals,” IEEE Trans. Affective Computing, vol.3,
datasets of plural subjects, however, the CNN could learn the no.1, pp.18–31, 2012.
components which were relevant or irrelevant to emotional [11] M. Li and B.-L. Lu, “Emotion classification based on gamma-band
EEG,” Engineering in Medicine and Biology Society, 2009. EMBC 2009.
classes discriminatively and could recognize emotions at the Annual International Conference of the IEEE, pp.1223–1226, 2009.
higher accuracies on average than those of the shallow model [12] H.P. Martinez, Y. Bengio, and G.N. Yannakakis, “Learning deep phys-
by over 20%. Moreover, the result suggests that there are iological models of affect,” IEEE Computational Intelligence Magazine,
Special Issue on Computational Intelligence and Affective Computing,
not only the interpersonal difference but also commonality of vol.8, Issue.2, pp.20–33, 2013.
EEG properties in terms of emotional valence when people are [13] M. Yanagimoto, and C. Sugimoto, “Convolutional neural networks
stimulated by watching music videos. using supervised pre-training for EEG-based emotion recognition,” 8th
International Workshop on Biosignal Interpretation (BSI), 2016.
Further work is needed to propose the methods to enhance [14] S. Stober, A. Sternin, A.M. Owen, and J.A. Grahn, “Deep feature
accuracy of CNNs on the limited dataset like in Experiment learning for EEG recordings,” arXiv:1511.04306, 2016.
1. We will also examine the model whose training data do not [15] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers:
Surpassing human-level performance on imagenet classification. IEEE
include the same subject data as test data because the results of International Conference on Computer Vision (ICCV), pp.1026–1034,
this work show the effectiveness of the common model among 2015.
users. If it becomes to be possible at high accuracy, we can [16] S. Ioffe, and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift,” International Con-
save time a lot and get superior features from representations ference on Machine Learning (ICML), pp.448–456, 2015.
of the model. It is needed to use data recorded by the light [17] L.J.P van der Maaten and G.E. Hinton, “Visualizing high-dimensional
EEG which is mobile and simplified in order to utilize the data using t-SNE,” Journal of Machine Learning Research, vol.9, pp.2579–
2605, 2008.
method given in this work for human-machine interaction or
neuromarketing. We used only 16 electrodes, but this condition
is not enough to assume using data of light EEG. The model
should be additionally robuster against noises and able to
learn good representations from data sampled at lower rate
like 256[Hz]. In addition, little information is available about
the emotional features appearing persistently in EEG time-
series data. It is the significant subject to reveal this features
specifically to contribute to cognitive science. Thereby we will

㻟㻞

Vous aimerez peut-être aussi