Investigating the Use of Pretrained Convolutional Neural Network on Cross-Subject and Cross-Dataset EEG Emotion Recognition.

The electroencephalogram (EEG) has great attraction in emotion recognition studies due to its resistance to deceptive actions of humans. This is one of the most significant advantages of brain signals in comparison to visual or speech signals in the emotion recognition context. A major challenge in EEG-based emotion recognition is that EEG recordings exhibit varying distributions for different people as well as for the same person at different time instances. This nonstationary nature of EEG limits the accuracy of it when subject independency is the priority. The aim of this study is to increase the subject-independent recognition accuracy by exploiting pretrained state-of-the-art Convolutional Neural Network (CNN) architectures. Unlike similar studies that extract spectral band power features from the EEG readings, raw EEG data is used in our study after applying windowing, pre-adjustments and normalization. Removing manual feature extraction from the training system overcomes the risk of eliminating hidden features in the raw data and helps leverage the deep neural network’s power in uncovering unknown features. To improve the classification accuracy further, a median filter is used to eliminate the false detections along a prediction interval of emotions. This method yields a mean cross-subject accuracy of 86.56% and 78.34% on the Shanghai Jiao Tong University Emotion EEG Dataset (SEED) for two and three emotion classes, respectively. It also yields a mean cross-subject accuracy of 72.81% on the Database for Emotion Analysis using Physiological Signals (DEAP) and 81.8% on the Loughborough University Multimodal Emotion Dataset (LUMED) for two emotion classes. Furthermore, the recognition model that has been trained using the SEED dataset was tested with the DEAP dataset, which yields a mean prediction accuracy of 58.1% across all subjects and emotion classes. Results show that in terms of classification accuracy, the proposed approach is superior to, or on par with, the reference subject-independent EEG emotion recognition studies identified in literature and has limited complexity due to the elimination of the need for feature extraction.


Introduction
The electroencephalogram (EEG) is the measurement of the electrical signals which are a result of brain activities. The voltage difference is measured between the actual electrode and reference electrode. There are several EEG measurement devices in the market such as Neurosky, Emotiv, Neuroelectrics and Biosemi [1] which provide different spatial and temporal resolutions. Spatial resolution is related to number of electrodes and temporal resolution is related to the number of EEG samples processed for unit time. Generally, EEG has high temporal but low spatial resolution. In terms of spatial resolution,

Literature Review
For the classification of EEG signals, many machine learning methods such as K-Nearest Neighbor (KNN) [28], Support Vector Machine (SVM) [28][29][30], Decision Tree (DT) [31], Random Forest (RF) [32] and Linear Discriminant Analysis (LDA) [33] are applied. In the deep learning context, DBN (Deep Belief Network) [34] and AE (Auto Encoders) [35] are studied with promising results. Besides DBN and AE, Convolutional Neural Network (CNN) and Long-Short Term Memory (LSTM) structures are widely used [36][37][38][39][40]. Most of these models have shown good results for subject dependent analysis. In [41], KNN method is employed on the DEAP dataset [42] for different numbers of channels which show accuracy between 82% and 88%. The study conducted in [43] quadratic time-frequency distribution (QTFD) is employed to handle a high-resolution time-frequency representation of the EEG and the spectral variations over time. It reports mean classification accuracies ranging between 73.8% and 86.2%. In [44], four different emotional states (happy, sad, angry and relaxed) are classified. In that study, Discrete Wavelet Transform (DWT) is applied on the DEAP dataset. Wavelet features are classified using a Support Vector Machine (SVM) classifier with Particle Swarm Optimization (PSO) [45]. The overall accuracy of 80.63% is reported with valence and arousal accuracy of 86.25% and 88.125%, respectively.
An important issue in EEG-based emotion detection is the non-linearity and non-stationarity EEG signals. Feature sets, such as the spectral band powers of EEG channels, extracted from different people against the same emotional states do not exhibit strong correlations. For example, galvanic skin response is a robust indicator of arousal state, where different people's responses correlate with each other well. Training and testing data made of EEG channels' spectral band powers and their derivatives have different distributions. And it is difficult to identify sets of features from the EEG recordings of different subjects, different sessions and different datasets that exhibit more commonality. This makes the classification difficult with traditional classification methods, which assume identical distribution. In order to address this problem and provide subject independency to EEG-based emotion recognition models, deeper networks, domain adaptation and hybrid methods have been applied [46,47]. Furthermore, various feature extraction techniques have been applied, and different feature combinations have been tried [48].
Subject-independent EEG emotion recognition, as a challenging task, has gained high interest by many researchers recently. The method called Transfer Component Analysis (TCA) conducted in [46] reproduces Kernel Hilbert Space, on the assumption that there exists a feature mapping between source and target domain. A Subspace Alignment Auto-Encoder (SAAE), which uses non-linear transformation and consistency constraint method, is used in [47]. This study compares the results with TCA. It achieves a leave-one-out mean accuracy of 77.88% in comparison with TCA, which shows 73.82% on the SEED dataset. Moreover, mean classification accuracy for session-to-session evaluation is 81.81%, an improvement of up to 1.62% compared to the best baseline TCA. In one of the studies, CNN with the deep domain confusion technique is applied on the SEED dataset [49] and achieves 90.59% and 82.16 mean accuracy for conventional (subject-dependent) EEG emotion recognition and "leave one out cross validation", respectively [50]. Variational Mode Decomposition (VMD) [51] is used as a feature extraction technique and Deep Neural Network as the classifier. It gives 61.25% and 62.50% accuracy on the DEAP dataset for arousal and valence, respectively. Another study, [52], uses a deep convolutional neural network with changing numbers of convolutional layers on raw EEG data which are collected during music listening. It reports maximum 10-fold-validation mean accuracy of 81.54% and 86.87% for arousal and valence, respectively. It also achieves 56.22% of arousal and 68.75% of valence accuracies for one-subject-out test. As can be seen, the reported mean accuracy levels drop considerable in one-subject-out tests due to the nature of the EEG signals.
The study [48] extracts totally 10 different linear and nonlinear features from EEG signals. The linear features are Hjorth activity, Hjorth mobility, Hjorth complexity, the standard deviation, PSD-Alpha, PSD Beta, PSD-Gamma, PSD-Theta, and the nonlinear features are sample entropy and wavelet entropy. By using a method called Significance Test/Sequential Backward Selection and the Support Vector Machine (ST-SBSSVM), which is a combination of the significance test, sequential backward selection, and the support vector machine, it achieves 72% cross subject accuracy for the DEAP dataset with high-low valence classification. It also achieves 89% maximum cross subject accuracy for the SEED dataset with positive-negative emotions. Another study [53] uses FAWT (Flexible Analytic Wavelet Transform) which decomposes EEG signals into sub bands. Random forest and SVM are used for classification. The mean classification accuracies are 90.48% for positive/neutral/negative (three classes) in the SEED dataset; 79.95% for high arousal (HA)/low arousal (LA) (two classes); 79.99% for the high valence (HV)/low valence (LV) (two classes); and 71.43% for HVHA/HVLA/LVLA/LVHA (four classes) in the DEAP dataset. In [54], the transfer recursive feature elimination (T-RFE) technique is used to determine a set of the most robust EEG features for stable distribution across subjects. This method is validated on DEAP dataset, the classification accuracy and F-score for arousal is 0.7867, 0.7526 and 0.7875, 0.8077 for valence. A regularized graph neural network (RGNN) is applied in [55] for EEG-based emotion recognition, which includes inter-channel relations. The classification accuracy results on the SEED dataset are 64.88%, 60.69%, 60.84%, 74.96%, 77.50%, 85.30% for delta, theta, alpha, beta, gamma and all bands. Moreover, it achieves 73.84% of accuracy on the SEED [49] dataset.
There are several studies which apply transfer learning, which aims to explore common stable features and apply to other subjects [56]. In terms of affective computing, the work is exploring some common and stable features which are invariant between subjects. This is also called domain adaptation. In [57], the scientists tried to find typical spatial pattern filters from various recording sessions and applied these filters on the following ongoing EEG samples. Subject-dependent spatial and temporal filters were derived from 45 subjects and a representative subset is chosen in [58]. The study [59] uses compound common spatial patterns which are the sum of covariance matrices. The aim of this technique is to utilize the common information which is shared between different subjects. The other important studies which apply different domain adaptation techniques on the SEED dataset are [47,[60][61][62]. The common properties of these domain adaptation techniques are the exploration of an invariant feature subspace which reduces the inconsistencies of EEG data between subjects or different sessions. In the study in [63], the domain adaptation technique is applied not only in cross-subject context but also for cross datasets. The trained model in SEED dataset is tested against the DEAP dataset and vice versa. It reports an accuracy improvement of 7.25-13.40% with domain adaptation compared to the one without domain adaptation. Scientists applied an adaptive subspace feature matching (ASFM) in [62] in order to integrate both the marginal and conditional distributions within a unified framework. This method achieves 83.51%, 76.68% and 81.20% classification accuracies for the first, second and third sessions of the SEED dataset, respectively. This study also conducts testing between sessions. For instance, it trains the model with the data of first session and test on the second session data. In the domain adaptation method, the conversion of features into a common subspace may lead to data loss. In order to avoid this, a Deep Domain Confusion (DDC) method based on CNN architecture is used [64]. This study uses adaptive layer and domain confusion loss based on Maximum Mean Discrepancy (MMD) to automatically learn a representation, jointly trained to optimize classification and domain invariance. The advantage of this is adaptive classification, retaining the original distribution information.
Having observed that the distribution of commonly derived sets of features from EEG signals show differences between subjects, sessions and datasets, we anticipate that there could be some invariant feature sets that follow common trajectories across subjects, sessions and datasets. There is a lack of studies that investigate these additional feature sets in the EEG signals that can contribute to robust emotion recognition across subjects. The aim of this study is to uncover these kinds of features in order to achieve promising cross-subject EEG-based emotion classification accuracy with manageable processing loads. For this purpose, raw EEG channel recordings after normalization and a state-of-the-art pretrained CNN model are deployed. The main motivation behind choosing a pretrained CNN architecture is due to their superiority in feature extraction and inherent exploitation of the domain adaptation. To improve the emotion recognition accuracy, additionally in the test phase, a median filter is used in order to reduce the evident false alarms.

Contributions of the Work
The contributions of this work to related literature in EEG-based emotion recognition can be summarized as follows: • Feature extraction process is completely left to a pretrained state-of-the-art CNN model InceptionResnetV2 whose capability of feature extraction is shown as highly competent in various classification tasks. This enables the model to explore useful and hidden features for classification.

•
Data normalization is applied in order to remove the effects of fluctuations in the voltage amplitude and protect the proposed network against probable ill-conditioned situations. • Extra pooling and dense layers are added to the pretrained CNN model in order to increase its depth, so that the classification capability is enhanced.

•
The output of the network is post-filtered in order to remove the false alarms, which may emerge in short intervals of time where the emotions are assumed to remain mostly unchanged.

Materials
The EEG datasets used in this work are SEED [49], the EEG data of DEAP [42] and our own EEG dataset [65] which is a part of the multimodal emotional database LUMED (Loughborough University Multimodal Emotion Database). All the datasets are open to public access.

Overview of the SEED Dataset
SEED dataset is a collection of EEG recordings prepared by the Brain-like Computing & Machine Intelligence (BCMI) laboratory of Shanghai Jiao Tong University. A total of 15 clips are chosen for eliciting (neutral, negative and positive) emotions. Each stimuli session is composed of 5 s of hint of movie, 4 min of clip, 45 s of self-assessment and 15 s of rest. There are 15 Chinese subjects (7 females and 8 males) participated in this study. Each participant had 3 sessions on different days. In total, 45 sessions of EEG data was recorded. The labels are given according to the clip contents (−1 for negative, 0 for neutral and 1 for positive). The data were collected via 62 channels which are placed according to 10-20 system, down-sampled to 200 Hz, a bandpass frequency filter from 0-75 Hz was applied and presented as MATLAB "mat" files.

Overview of the DEAP Dataset
DEAP [22] is a multimodal dataset which includes the electroencephalogram and peripheral physiological signals of 32 participants. For 22 of the 32 participants, frontal face video was also recorded. Data were recorded while minute-long music videos were watched by participants. A total of 40 videos were shown to each participant. The videos were rated by the participants in terms of levels of arousal, valence, like/dislike, dominance and familiarity which changes between 1 and 9. EEG data are collected with 32 electrodes. The data were down-sampled to 128 Hz, EOG artefacts were removed, a bandpass frequency filter from 4.0-45.0 Hz was applied. The data were segmented into 60 s intervals and a 3 s baseline data were removed.

Overview of the LUMED Dataset
The LUMED (Loughborough University Multimodal Emotion Dataset) is a new multimodal dataset that was created in Loughborough University, London (UK), by collecting simultaneous multimodal data from 11 participants (4 females and 7 males). The modalities include visual data (face RGB), peripheral physiological signals (galvanic skin response, heartbeat, temperature) and EEG. These data were collected from participants while they were presented with audio-visual stimuli loaded with different emotional content. Each data collection session lasted approximately 16 min long that consists of short video clips playing one after the other. The longest clip was approximately 2.5 min and the shortest one was 1 min long. Between each clip, in order to provide the participant a refresh and rest, a 20 s-long gray screen was displayed. Although the emotional ground truth of each clip was estimated based on the content, in reality, a range of different emotions might be triggered for different participants. For this purpose, after each session, the participants were asked to label the clips they watched with the most dominant emotional state they felt. In this current study, we exploited the EEG modality of the LUMED dataset only. For this study, we have re-labelled the samples, such that only 2 classes were defined as negative valence and positive valence. This is done to make a fair comparison with other studies. Moreover, each channel's data were filtered in the frequency range of 0.5 Hz to 75 Hz to attenuate the high frequency components that are not believed to be have a meaningful correlation with the emotion classes. Normally, captured EEG signals are noisy with EMG (electromyogram) and EOG (electrooculogram) type artefacts. EMG artefacts are electrical noises resulting from facial muscle activities and EOG is electrical noise due to eye movements. For traditional classification and data analysis methods, in order to prevent heavily skewed results, these kinds of artefacts should be removed from the EEG channel data through several filtering stages. As an example, the study in [66] removes the eye movement artefacts from signal by applying ICA (Independent Component Analysis). The LUMED dataset was created initially with the purpose of training a deep-learning based emotion recognition system, described in Section 4. Depending on the type and purpose of other supervised machine learning systems, this dataset could require a more thorough pre-processing for artefact removal. In LUMED, EEG data was captured based on 10-20 system by Neuroelectrics Enobio 8 [67], an 8-channel EEG device with a temporal resolution of 500 Hz. The used channels were FP1, AF4, FZ, T7, C4, T8, P3, OZ, which are spread over frontal, temporal and center lobes of the brain.

Proposed Method
In this work, the emotion recognition model works on raw EEG signals without pre-feature extraction. Feature extraction is left to a state-of-the-art CNN model: InceptionResnetV2. The success of this pretrained CNN model on raw data classification was extensively outlined in [68]. Since the distribution of EEG data shows variations from person to person, session to session and dataset to dataset, it is difficult to identify a feature set that exhibits good accuracy every time. On the other hand, pretrained CNN models are very competent in feature extraction. Therefore, this work makes use of it.

Windowing of Data
Data are split into fixed length (N) windows with an overlapping size of N/6 as shown in Figure 1 (displayed for three random channels). One window of EEG data is given in Figure 2 where M is the number of selected channels, C ab is "b th data point of channel a".
Sensors 2020, 20, x FOR PEER REVIEW 6 of 20 re-labelled the samples, such that only 2 classes were defined as negative valence and positive valence. This is done to make a fair comparison with other studies. Moreover, each channel's data were filtered in the frequency range of 0.5 Hz to 75 Hz to attenuate the high frequency components that are not believed to be have a meaningful correlation with the emotion classes. Normally, captured EEG signals are noisy with EMG (electromyogram) and EOG (electrooculogram) type artefacts. EMG artefacts are electrical noises resulting from facial muscle activities and EOG is electrical noise due to eye movements. For traditional classification and data analysis methods, in order to prevent heavily skewed results, these kinds of artefacts should be removed from the EEG channel data through several filtering stages. As an example, the study in [66] removes the eye movement artefacts from signal by applying ICA (Independent Component Analysis). The LUMED dataset was created initially with the purpose of training a deep-learning based emotion recognition system, described in Section 4. Depending on the type and purpose of other supervised machine learning systems, this dataset could require a more thorough pre-processing for artefact removal. In LUMED, EEG data was captured based on 10-20 system by Neuroelectrics Enobio 8 [67], an 8-channel EEG device with a temporal resolution of 500 Hz. The used channels were FP1, AF4, FZ, T7, C4, T8, P3, OZ, which are spread over frontal, temporal and center lobes of the brain.

Proposed Method
In this work, the emotion recognition model works on raw EEG signals without pre-feature extraction. Feature extraction is left to a state-of-the-art CNN model: InceptionResnetV2. The success of this pretrained CNN model on raw data classification was extensively outlined in [68]. Since the distribution of EEG data shows variations from person to person, session to session and dataset to dataset, it is difficult to identify a feature set that exhibits good accuracy every time. On the other hand, pretrained CNN models are very competent in feature extraction. Therefore, this work makes use of it.

Windowing of Data
Data are split into fixed length ( ) windows with an overlapping size of /6 as shown in Figure 1 (displayed for three random channels). One window of EEG data is given in Figure 2 where is the number of selected channels, is " ℎ data point of channel ".

Data Reshaping
EEG data are reshaped to fit the input layer properties of InceptionResnetV2 which is shown in Figure 2. KERAS, which is an open-source neural network library written in Python, is used for the training purpose. Since KERAS is used for training purposes, the minimum input size should be ( , , 3) for InceptionResnetV2 where 75, 75 [69]. Depending on the number of selected channels, each channel data are augmented by creating the noisy copies of it. For instance, if the number of selected channels is , for each channel the number of noisy copies is calculated according to Equation (1) where operator rounds the number to the next integer (if the number is not integer) and is number of noisy copies.

1,
The noisy copies of each channel are created by adding random samples of a gaussian distribution of mean and variance where and are chosen as 0 and 0.01, respectively. This process is given in Equation (2) where , , … , is the noisy copy of the original data , , … , and , , … , is the noise vector. , Since the samples are randomly chosen, each noisy copy is different from each other. is related to the windowing size. Therefore, a window size greater than or equal to 75 is chosen. We chose as 300 in order to provide a standard window size for datasets. This corresponds to 1.5 s for the SEED dataset, approximately two seconds for the DEAP dataset and 0.6 s onds for the LUMED dataset. Moreover, we chose as 80 for all datasets. The augmentation process is repeated three times in order to make the data fit the KERAS input size. This work does not use interpolation between channels because EEG is a nonlinear signal. The reason for adding noise is mainly for data augmentation. There are several ways of data augmentation such as rotation, shifting and adding noise. In the image processing context, rotation, shifting, zooming and adding noise are used. However, we only use noise addition for EEG data augmentation in order to both keep the channel's original data and create new augmented data with limited gaussian noise. This is as if there was another electrode very close the electrode, which is augmented with additional noise. We use data augmentation instead of data duplication in order to make the network adapt to the noisy data and

Data Reshaping
EEG data are reshaped to fit the input layer properties of InceptionResnetV2 which is shown in Figure 2. KERAS, which is an open-source neural network library written in Python, is used for the training purpose. Since KERAS is used for training purposes, the minimum input size should be (N 1 , N, 3) for InceptionResnetV2 where N 1 ≥ 75, N ≥ 75 [69]. Depending on the number of selected channels, each channel data are augmented by creating the noisy copies of it. For instance, if the number of selected channels is S, for each channel the number of noisy copies is calculated according to Equation (1) where ceil operator rounds the number to the next integer (if the number is not integer) and NNC is number of noisy copies.
The noisy copies of each channel are created by adding random samples of a gaussian distribution of mean µ and variance σ 2 where µ and σ are chosen as 0 and 0.01, respectively. This process is given in Equation (2) where C a1 , C a2 , . . . , C aN is the noisy copy of the original data [C a1 , C a2 , . . . , C aN ] and [n a1 , n a2 , . . . , n aN ] is the noise vector.
Since the samples are randomly chosen, each noisy copy is different from each other. N is related to the windowing size. Therefore, a window size N greater than or equal to 75 is chosen. We chose N as 300 in order to provide a standard window size for datasets. This corresponds to 1.5 s for the SEED dataset, approximately two seconds for the DEAP dataset and 0.6 s onds for the LUMED dataset. Moreover, we chose N 1 as 80 for all datasets. The augmentation process is repeated three times in order to make the data fit the KERAS input size. This work does not use interpolation between channels because EEG is a nonlinear signal. The reason for adding noise is mainly for data augmentation. There are several ways of data augmentation such as rotation, shifting and adding noise. In the image processing context, rotation, shifting, zooming and adding noise are used. However, we only use noise addition for EEG data augmentation in order to both keep the channel's original data and create new augmented data with limited gaussian noise. This is as if there was another electrode very close the electrode, which is augmented with additional noise. We use data augmentation instead of data duplication in order to make the network adapt to the noisy data and increase the prediction capability of it. This also prevents the network from overfitting due to data repetition. This technique was similarly applied in [70].

Normalization
Following windowing, augmentation and reshaping, each channel data are normalized by removing mean of each window from each sample. This is repeated for all channels and the noisy copies. The aim of removing the mean is to equate the mean value of each window to 0. This protects the proposed network against probable ill-conditioned situations. In MATLAB, this process is applied automatically on the input data. In KERAS, we performed this manually just before training the network. Each dimension is created separately so they are different from each other.

Channel Selection
In this work, we concentrated on the frontal temporal lobes of the brain. As it is stated in [71,72], emotional changes mostly affect the EEG signals on the frontal and temporal lobes. A different number of channels are tried in this work, and increasing the number of channels does not help improve the accuracy. This is because, technically, including the channels in the model, which are not correlated with the emotion changes, does not help and on the contrary can adversely affect the accuracy. It is also known that the electrical relations between asymmetrical channels are determining the arousal and valence, hence the emotion [73,74]. Therefore, we chose four asymmetrical pairs of electrodes: AF1, F3, F4, F7, T7, AF2, F5, F8 and T8 from frontal and temporal lobes which are equally spread on the skull. The arrangement of these channels in the window is AF1, AF2, F3, F4, F5, F6, F7, T7, T8.

Network Structure
In this work a pretrained CNN network, InceptionResnetV2, is used as base model. Following InceptionResnetV2, Global Average Pooling layer is added for decreasing the data dimension and extra dense layers (fully connected layers) are added in order to increase the depth and success for classifying complex data. The overall network structure is given in Figure 3, and the properties of the layers following the CNN is described in Table 1. The training parameters are specified in Table 2. In Figure 3, Dense Layer-5 determines the number of output classes and argMax selects the one with the maximum probability. We use the "relu" activation function to cover the interaction effects and non-linearities. This is very important in our problem, while using a deep learning model. Relu is one of the most widely used and successful activation functions in the field of artificial neural networks. Moreover, at the last dense layer, we use the "softmax" activation in order to produce the class probabilities. increase the prediction capability of it. This also prevents the network from overfitting due to data repetition. This technique was similarly applied in [70].

Normalization
Following windowing, augmentation and reshaping, each channel data are normalized by removing mean of each window from each sample. This is repeated for all channels and the noisy copies. The aim of removing the mean is to equate the mean value of each window to 0. This protects the proposed network against probable ill-conditioned situations. In MATLAB, this process is applied automatically on the input data. In KERAS, we performed this manually just before training the network. Each dimension is created separately so they are different from each other.

Channel Selection
In this work, we concentrated on the frontal temporal lobes of the brain. As it is stated in [71,72], emotional changes mostly affect the EEG signals on the frontal and temporal lobes. A different number of channels are tried in this work, and increasing the number of channels does not help improve the accuracy. This is because, technically, including the channels in the model, which are not correlated with the emotion changes, does not help and on the contrary can adversely affect the accuracy. It is also known that the electrical relations between asymmetrical channels are determining the arousal and valence, hence the emotion [73,74]. Therefore, we chose four asymmetrical pairs of electrodes: AF1, F3, F4, F7, T7, AF2, F5, F8 and T8 from frontal and temporal lobes which are equally spread on the skull. The arrangement of these channels in the window is AF1, AF2, F3, F4, F5, F6, F7, T7, T8.

Network Structure
In this work a pretrained CNN network, InceptionResnetV2, is used as base model. Following InceptionResnetV2, Global Average Pooling layer is added for decreasing the data dimension and extra dense layers (fully connected layers) are added in order to increase the depth and success for classifying complex data. The overall network structure is given in Figure 3, and the properties of the layers following the CNN is described in Table 1. The training parameters are specified in Table 2. In Figure 3, Dense Layer-5 determines the number of output classes and selects the one with the maximum probability. We use the "relu" activation function to cover the interaction effects and non-linearities. This is very important in our problem, while using a deep learning model. Relu is one of the most widely used and successful activation functions in the field of artificial neural networks. Moreover, at the last dense layer, we use the "softmax" activation in order to produce the class probabilities.

Filtering on Output Classes
Since EEG is very prone to noise and different type of artifacts, filtering of EEG signals is widely studied in EEG recognition context. The study conducted in [75] compares three types of smoothing filters (smooth filter, median filter and Savitzky-Golay) on EEG data for the medical diagnostic purposes. The authors concluded that the most useful filter is the classical Savitzky-Golay since it smooths the data without distorting the shape of the waves.
Another EEG data filtering study is provided in [76]. This study employs a moving average filtering on extracted features and then classifies the signal by using an SVM. It achieves very promising accuracy results with limited processing time compared to similar studies.
Emotions change quicker than moods for healthy people [77]. However, in very short time intervals (in the range of few seconds), the emotions show lesser variance in healthy individuals with good emotion regulation. Different from the studies [75,76], the filtering is applied on the output in our method. It is assumed that in a defined small-time interval T the emotion state does not change. Therefore, we apply a median filter on the output data inside a specific time interval with an aim of removing the false alarms and increase the overall emotion classification accuracy. This process is shown in Figure 4 where A and B stands for different classes.

Filtering on Output Classes
Since EEG is very prone to noise and different type of artifacts, filtering of EEG signals is widely studied in EEG recognition context. The study conducted in [75] compares three types of smoothing filters (smooth filter, median filter and Savitzky-Golay) on EEG data for the medical diagnostic purposes. The authors concluded that the most useful filter is the classical Savitzky-Golay since it smooths the data without distorting the shape of the waves.
Another EEG data filtering study is provided in [76]. This study employs a moving average filtering on extracted features and then classifies the signal by using an SVM. It achieves very promising accuracy results with limited processing time compared to similar studies.
Emotions change quicker than moods for healthy people [77]. However, in very short time intervals (in the range of few seconds), the emotions show lesser variance in healthy individuals with good emotion regulation. Different from the studies [75,76], the filtering is applied on the output in our method. It is assumed that in a defined small-time interval the emotion state does not change. Therefore, we apply a median filter on the output data inside a specific time interval with an aim of removing the false alarms and increase the overall emotion classification accuracy. This process is shown in Figure 4 where A and B stands for different classes.  Finally, the overall process that describes how model training and testing is carried out is visually depicted in Figure 5.
Sensors 2020, 20, x FOR PEER REVIEW 10 of 20 Finally, the overall process that describes how model training and testing is carried out is visually depicted in Figure 5.

Results and Discussions
In this work, for SEED dataset, classification tests are conducted for two categories of classification: two-classes: Positive-Negative valence (Pos-Neg) and three-classes: Positive-Neutral-Negative valence (Pos-Neu-Neg). SEED dataset provides the labels as negative, neutral and positive. DEAP dataset labels the valence and arousal between one and nine. The valence values above 4.5 are taken as positive and the values smaller than 4.5 are taken as negative. For the LUMED dataset, classification is completed as either positive or negative valence. One-subject-out classifications for each dataset are conducted and the results are compared to several reference studies, which provide cross-subject and cross-dataset results. In one-subject-out tests, one subject's data are excluded completely from the training set. The remaining training set is divided into training and validation sets. In this work, during training, when we do not see improvement on validation accuracy for six consecutive epochs, we stopped the training and applied the test data on the final model. An example is shown in Figure 6. For each user, Table 3 depicts one-subject-out tests for the SEED dataset based on all sessions together, with and without normalization and with and without output filtering. We also obtained the accuracy results without pooling and dense layers. The mean accuracies dropped by 8.3% and 11.1% without pooling and dense layers, respectively. Applying the median filter on the predicted output improves the mean accuracy by approximately 4% for the SEED dataset. The filter size is empirically and set to five. This corresponds approximately to six seconds of data. In this time interval it is assumed that the emotion state remains unchanged. It can be seen in Table 3 that the accuracy for some users is high and for some users it is relatively lower. This is based on the modeling of the network with the remaining training data after excluding the test data. However, standard deviation is still acceptable. Another issue is that when the number of classes is increased from two (Pos-Neg) to three (Pos-Neg-Neu), the prediction accuracies drop. This is because some samples labelled as neutral might fall into the negative or the positive classes.
One of the most important characteristics of our work in this paper is that we provide the accuracy scores for each subject separately. This is not observed in most of the other reference studies that tackle EEG-based emotion recognition.

Results and Discussions
In this work, for SEED dataset, classification tests are conducted for two categories of classification: two-classes: Positive-Negative valence (Pos-Neg) and three-classes: Positive-Neutral-Negative valence (Pos-Neu-Neg). SEED dataset provides the labels as negative, neutral and positive. DEAP dataset labels the valence and arousal between one and nine. The valence values above 4.5 are taken as positive and the values smaller than 4.5 are taken as negative. For the LUMED dataset, classification is completed as either positive or negative valence. One-subject-out classifications for each dataset are conducted and the results are compared to several reference studies, which provide cross-subject and cross-dataset results. In one-subject-out tests, one subject's data are excluded completely from the training set. The remaining training set is divided into training and validation sets. In this work, during training, when we do not see improvement on validation accuracy for six consecutive epochs, we stopped the training and applied the test data on the final model. An example is shown in Figure 6. For each user, Table 3 depicts one-subject-out tests for the SEED dataset based on all sessions together, with and without normalization and with and without output filtering. We also obtained the accuracy results without pooling and dense layers. The mean accuracies dropped by 8.3% and 11.1% without pooling and dense layers, respectively. Applying the median filter on the predicted output improves the mean accuracy by approximately 4% for the SEED dataset. The filter size is empirically and set to five. This corresponds approximately to six seconds of data. In this time interval it is assumed that the emotion state remains unchanged. It can be seen in Table 3 that the accuracy for some users is high and for some users it is relatively lower. This is based on the modeling of the network with the remaining training data after excluding the test data. However, standard deviation is still acceptable. Another issue is that when the number of classes is increased from two (Pos-Neg) to three (Pos-Neg-Neu), the prediction accuracies drop. This is because some samples labelled as neutral might fall into the negative or the positive classes.
One of the most important characteristics of our work in this paper is that we provide the accuracy scores for each subject separately. This is not observed in most of the other reference studies that tackle EEG-based emotion recognition.   Table 4 shows the cross-subject accuracy comparison of several top studies, which provides the results for 2-classes (Pos-Neg) or 3-classes (Pos-Neu-Neg). For Pos-Neg, the proposed method achieves 86.5% accuracy which is slightly lower than ST-SBSSVM [48]. However, our method has far less complexity, since it does not depend on pre-feature extraction and associated complex calculations. Furthermore, it is not clear in [48] if the reported maximum accuracy of ST-SBSSVM corresponds to the mean prediction accuracy of all subjects, or the maximum prediction accuracy of any subject amongst all.
Another issue is that many reference cross-subject studies use the excluded users' data for validation during the training process. In domain adaptation methods, target domain is also used with source domain to convert data into an intermediate common subspace that makes distributions   Table 4 shows the cross-subject accuracy comparison of several top studies, which provides the results for 2-classes (Pos-Neg) or 3-classes (Pos-Neu-Neg). For Pos-Neg, the proposed method achieves 86.5% accuracy which is slightly lower than ST-SBSSVM [48]. However, our method has far less complexity, since it does not depend on pre-feature extraction and associated complex calculations. Furthermore, it is not clear in [48] if the reported maximum accuracy of ST-SBSSVM corresponds to the mean prediction accuracy of all subjects, or the maximum prediction accuracy of any subject amongst all.
Another issue is that many reference cross-subject studies use the excluded users' data for validation during the training process. In domain adaptation methods, target domain is also used with source domain to convert data into an intermediate common subspace that makes distributions of target and source domain closer. Similar approach with adding an error function is used in [50]. Using the target domain with source domain can increase the cross-subject [50,62] accuracy because of that the distributions between labelled and unlabeled data are controlled by some cost functions empirically. However, these kinds of approaches are not well-directed, as we should know that we only have the source domain in cross-subject and/or cross-dataset classification. We aim to generate a model and feature set only from source domain which will be tested with unused target data (either labelled or unlabeled). Validation and training data should be clearly split, where excluded subjects' data should not be used in the validation. After reaching the furthest epoch, where overfitting does not kick in yet, training should be stopped. Then, the final trained model should be tested with the excluded subjects' data. In our study, we respected this rule. Table 4. "One-subject-out" prediction accuracies of reference studies using the SEED dataset.

Accuracy (Pos-Nue-Neg)
ST-SBSSVM [48] 89.  [78] 67.5 KPCA [79] 62.4 MIDA [80] 72.4 Table 5 shows the accuracy results of proposed model for the DEAP database for two-classes (Pos-Neg). Generally, the reported accuracies are lower than the ones achieved in the SEED dataset. This may be due to the poorer labelling quality of the samples in the DEAP dataset. Some reference studies employ varying re-labelling strategies on the samples of the DEAP dataset to revise class labels. This automatically increases the reported prediction accuracy levels. However, we decided not to alter and respect the original labelling strategy used in that dataset. We only set the threshold in the exact midpoint of the scale of one to nine to divide the samples into two classes, positive and negative. It is acceptable to achieve slightly lower accuracy values than some others as shown in Table 6. To reiterate, post median filtering improves the mean prediction accuracy by approximately 4%. Table 5. "One-subject-out" prediction accuracies for DEAP dataset using two-classes (Pos-Neg).   Table 6 shows the prediction accuracies of several studies that use the DEAP dataset for two classes (Pos-Neg). Our proposed method yields promising accuracy results with only limited complexity (e.g., without any pre-feature extraction cycle) when compared to others. For all eight incoming EEG data channels, the windowing, reshaping, normalization and classification processes take on average 0.34 sec on the test workstation (Core i-9, 3.6 GHz, 64 Gb RAM). This is the computational time used by the Python script and KERAS framework. Hence, the data classification can be achieved with roughly a delay of half a second, rendering our method usable in real-time systems. Table 6. One-subject-out accuracy comparison of several studies for the DEAP dataset (Pos-Neg).
In this work, cross-dataset tests are also conducted between the SEED-DEAP, SEED-LUMED and DEAP-LUMED datasets for positive and negative labels. Table 8 shows the cross-dataset accuracy results between the SEED and DEAP. Our model is trained using the data in the SEED dataset and tested on the DEAP dataset separately. It yields 58.10% mean prediction accuracy, which is promising in this context. The comparison of the cross-dataset performance of our proposed model with the other cross-dataset studies is shown in Table 9. The cross-dataset accuracy of our model is consistently superior to other studies. Table 10 shows the cross-dataset results between SEED-LUMED and DEAP-LUMED. Since LUMED is a new dataset, we cannot give any benchmark results with other studies. However, the mean accuracy results and standard deviations are promising.

Conclusions
In many recognition and classification problems, the most time and resource consuming aspect is the feature extraction process. Many scientists focus on extracting meaningful features from the EEG signals either in time and/or frequency domains in order to achieve successful classification results. However, the derived feature sets, which can be useful in the classification problem for one subject, recording session or dataset can fail for different subjects, recording sessions and datasets. Furthermore, since the feature extraction process is a complex and time-consuming process, it is not particularly suitable for online and real-time classification problems. In this study, we do not rely on a separate pre-feature extraction process and shift this task to the deep learning cycle that inherently employs this process. Hence, we do not manually remove any potentially useful information from the raw EEG channels. Similar approaches, where deep neural networks are utilized for recognition, were applied in different domains, such as in [83] where electromagnetic sources can be recognized. The success of CNNs has already been shown as highly competent in various classification tasks, especially in the image classification context. Therefore, we deploy of a pretrained CNN architecture called InceptionResnetV2 to classify the EEG data. We have taken the necessary steps to reshape the input data to feed into and train this network.
One of the most important issues, which influences the success of deep learning approaches is the data themselves and the quality and reliability of the labels of the data. The "Brouwer recommendations" about data collection given in [19] are very useful for handling accurate data and labelling. Particularly during the EEG data recording process, these recommendations should be double-checked due to the EEG recording device's sensitivity to noise. EEG signals are non-stationary and nonlinear. This makes putting forth a general classification model and a set of features based on the well-studied spectral band powers difficult. It is important to be able to identify stable feature sets between subjects, recording sessions and datasets. Since CNN is very successful at extracting not-so-obvious features from the input data for complex classification problems, we exploit a state-of-the-art pretrained CNN model called InceptionResnetV2, and do not filter out any information from the raw EEG signals. For robustness, we further enrich this deep network by adding fully connected dense layers. This increases the depth and prevents the network from falling into probable ill-conditions and overfitting problems.
In this work, we applied the model successfully on three different EEG datasets: SEED, DEAP and LUMED. Furthermore, we tested our model in a cross-dataset context. We trained our model with the SEED dataset, tested on the DEAP and LUMED datasets. Moreover, we trained our model with the DEAP dataset and tested on the LUMED dataset. We showed that the results are promising and superior to most of the reference techniques. Once we generate the fully pretrained model, we can feed any online raw data directly as inputs to get the output class immediately. Since there is no dedicated pre-feature extraction process, our model is more suitable to be deployed in real-time applications.