Multiple Convolutional Neural Networks Fusion Using Improved Fuzzy Integral for Facial Emotion Recognition

: Facial expressions are indispensable in human cognitive behaviors since it can instantly reveal human emotions. Therefore, in this study, Multiple Convolutional Neural Networks using Improved Fuzzy Integral (MCNNs-IFI) were proposed for recognizing facial emotions. Since e ﬀ ective facial expression features are di ﬃ cult to design; deep learning CNN is used in the study. Each CNN has its own advantages and disadvantages, thus combining multiple CNNs can yield superior results. Moreover, multiple CNNs combined with improved fuzzy integral, in which its fuzzy density value is optimized through particle swarm optimization (PSO), overcomes the majority decision drawback in the traditional voting method. Two Multi-PIE and CK + databases and three main CNN structures, namely AlexNet, GoogLeNet, and LeNet, were used in the experiments. To verify the results, a cross-validation method was used, and experimental results indicated that the proposed MCNNs-IFI exhibited 12.84% higher accuracy than that of the three CNNs.


Introduction
In 2016, when a computer program won the Google DeepMind Challenge, people started having a different understanding of artificial intelligence (AI). AI is no longer just a sci-fi plot in a movie but is being implemented around us. The key technology of AlphaGo is deep learning and it has become more prominent than the original terminology.
Deep learning is used in image recognition. With improvements in hardware, more people are investing in convolutional neural network (CNN). The origin of CNN can be traced back to 1998. LeCun et al. [1] proposed the LeNet-5 model, which uses the back propagation algorithm proposed by Rumelhart et al. [2] in 1986, to adjust the parameters of a neural network. In 1998, Alex Krizhevsky [3] proposed the AlexNet model and won the championship in ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Dropout technology was used for the first time, and the rectified linear unit (ReLU) was used as an activation function to prove that increasing CNN layers can provide superior results.
In 2014, GoogLeNet proposed by Szegedy et al. [4] and VGGNet proposed by Simonyan et al. [5] were competitors in ILSVRC. The commonality between these two models is that the layers are deeper than those in other neural network methods. VGG-16 inherits LeNet and AlexNet's microarchitecture, whereas GoogLeNet boldly changed the network structure and improved the multilayer perceptron convolution layer to convolve with a 1 × 1 convolution kernel and introduced batch normalization [6].
Although the GoogLeNet depth consisted of only 22 layers, the overall size was considerably smaller than that of AlexNet and VGGNet. The number of parameters of GoogLeNet (5 million) are 12 times that of AlexNet, and the number of parameters of VGGNet are three times that of AlexNet. Therefore, GoogLeNet is a superior choice when memory or computing resources are limited. As the number of network layers deepens, a degradation problem occurs; it is impossible for networks to learn correct features. That is, the accuracy gradually reaches saturation; however, saturation is not caused by overfitting. To solve this problem, He et al. [7] proposed ResNet in 2015 to map low-level features directly to high-level networks in order to ensure that deep networks do not need to learn from scratch. Therefore, ResNet has a representation ability close to previous few layers, which reduces degradation problems and eases deep networks training.
The last layer, the fully connected layer, is the most parameterized part in CNNs, and overfitting problems can easily occur. Lin et al. [8] proposed global average pooling to replace the fully connected layer. Global average pooling has no parameter to train and reduces the overall burden in networks. In the training process, the stochastic gradient descent (SGD) is the most common optimization method used presently; this method randomly selects a fixed number of training samples each time for training. This method can learn effectively; however, its disadvantages include a tendency to fall into the local solution and an over reliance on the learning rate. To solve these problems, researchers have successively proposed various improved optimization methods [9][10][11][12].
In the area of pattern recognition, there are generally two types of combinations: classifier selection and classifier fusion. The presumption in classifier selection is that each classifier is "an expert" in some local area of the feature space. Classifier fusion assumes that all classifiers are trained over the whole feature space, and are thereby considered as competitive rather than complementary. In this study, we focus on classifier fusion. The way of combining different classifiers has been proposed and frequently considered in order to achieve better classification performance than using one single classifier. In other words, by combining the outputs of a team of classifiers, this proposed work aims at making more accurate decisions than the one by choosing the single best member within the team. Recently, several different classifier combination algorithms [13][14][15] have been proposed. Soto et al. [13] presented an intuitive, rigorous mathematical description of the voting process in an ensemble of classifiers. Kuncheva [14] used soft computing in classifier combination, such as neural network, fuzzy set, and evolutionary computation. Kevric et al. [15] developed a combining classifier model based on tree-based algorithms for network intrusion detection. The proposed detection algorithm was to classify whether the incoming network traffics are normal or an attack. Lin et al. [16] presented multiple functional neural fuzzy networks fusion using fuzzy integral (FI). FI is a nonlinear function that is defined with respect to a fuzzy measure and able to merge the information granted from multiple sources. In [16], since the fuzzy density in traditional fuzzy integral is often set according to user experiences, the recognition rate of the fusion model cannot be improved. The purpose of this study is to improve traditional FI. That is, the fuzzy density is determined using an evolutionary method. Furthermore, in the decision-making process, the FI considers both the objective evidence supplied by each information source and the expected worth of each subset of information sources.
In this study, Multiple Convolutional Neural Networks using Improved Fuzzy Integral (MCNNs-IFI) were proposed for recognizing facial emotions. The proposed method involved several trained CNNs as input to the fuzzy integral, and classification results were computed using the output rules of Sugeno or Choquet. Since effective facial expression features are difficult to design, deep learning CNN is used in the study. Each CNN has its own advantages and disadvantages. Therefore, combining multiple CNNs can yield superior results. Moreover, multiple CNNs combined with improved fuzzy integral, in which its fuzzy density value is optimized through particle swarm optimization (PSO), overcomes the majority decision drawback in the traditional voting method.
In Section 2, the proposed MCNNs-IFI method and its architecture are described. In Section 3, experimental results obtained using the Multi-PIE database and CK+ database are presented, and Section 4 presents the conclusion and future prospects.

Proposed MCNNs-IFI
Each CNN has its own advantages and disadvantages, in order to retain its superiorities, the multiple convolutional neural networks using improved fuzzy integral (MCNNs-IFI) is proposed in this study. There are two kinds of combinations; one is combining the same CNN architectures with different optimization methods; another is combining different CNN architectures with the same optimization methods. The proposed MCNNs-IFI architecture is presented in Figure 1. First, train the CNNs of different architectures or different optimization methods with the same set of training data to ensure that an individual CNN produces different outputs, then use the fuzzy integral of fuzzy measure to evaluate each classifier. Last, calculate the combined output results from CNNs to obtain superior recognition accuracy.
In Figure 1, the outputs from m CNNs are used as the input of the improved fuzzy integral. After calculation, the output is obtained by referencing the worth of the performance in each CNN.

CNN
Basic CNN can be roughly divided into two parts. As depicted in Figure 2, the first part (blue area) uses convolution and pooling to extract features. The second part (orange area) is a fully connected neural network classifier. Traditional CNNs consist of the feature extraction and classifier architecture. However, in recent CNNs, the fully connected part has been replaced by average pooling, which reduces the degree of overfitting and parameters required during training. The following subsections describe the three important operations in the feature extraction section, namely convolution, activation function, and pooling.

Convolution Layer
Convolution is a feature extraction process, which is based on the concept of a receptive field, and is used by many traditional feature extractions such as Sobel and Gabor. Convolution uses a convolution kernel mask of the sliding window on the input matrix. The formula is as follows: where F IJ is the output matrix and K w and K h are the width and height of the convolution kernel, respectively. Generally, K w = K h is set as a square kernel, x ij is the input matrix, and kw ij is the weight of the convolution kernel, and to maintain the same width and height of the matrix after convolution, the edge of the input matrix is usually padded with 0, and 0 is still 0 after the convolution operation. Therefore, padding 0 will not affect convolution.

Activation Function
The purpose of the activation function is to obtain nonlinear outputs for linearly combined networks. However, when the network is deep and the number of layers is high, the sigmoid, which is used in earlier networks, appears and the gradient disappears during back propagation. Therefore, ReLU is used as the activation function in CNN.
The definition of ReLU is as follows: Figure 3 illustrates that if the input x is smaller than 0, then the output is 0; if the input x is bigger than 0; then x is the direct output.

Pooling Layer
After the convolution operation, the input feature map is compressed to simplify network computation complexity. Feature compression is performed to extract main features.
Pooling is used to reduce the dimension. Pooling involves using a sliding window mask that operates on the input matrix; however, the mask does not overlap in the moving process. That is, the step is equal to the height of the convolution kernel to enable an N × N mask to reduce 1/N 2 times of the input feature matrix and thereby reduce the dimension.
Pooling is generally divided into the maximum or average pooling. Figure 4 displays the maximum pooling. Here, the largest value in the mask is used as the output, and other values in the mask are directly discarded. In the convolution process, it is usually desirable to retain the most prominent features, and features that are not prominent in theory should not affect characteristics; therefore, maximum pooling is used for the convolution process. Average pooling calculates the average of all values in the mask, so it usually uses in output layer. Average pooling has a physical meaning in the calculation process, such that a large target can obtain a high output value during the calculation process, as depicted in Figure 5. In addition to reducing the dimension, the pooling method can provide the network a certain degree of rotation and displacement invariance.

Proposed Improved Fuzzy Integral
The fuzzy density in traditional fuzzy integral is often set according to user experiences. Many optimization algorithms can be used to determine the optimal fuzzy density, such as particle swarm optimization (PSO), genetic algorithm (GA), differential evolution (DE), firefly algorithm (FA), and artificial bee colony (ABC). Without crossover and mutation steps, PSO is relatively simple, with fewer parameters to adjust, and easy to implement in software and hardware. Therefore, in this study, the proposed improved fuzzy integral uses PSO to automatically determine the optimal fuzzy density.
The hypothesis X = {x i } i = 1 : m represents a set of m classifiers which means the numbers of CNN used in this study and g(x i ) indicates the fuzzy measure and is regarded as the worth of the subset x i . That is, x i is the output from i th classifier and g(x i ) is the confidence degree of this class. The fuzzy measure value is [0,1]; if the value is 1, then it indicates that this output can be fully trusted. If the value is 0, then the output has no reference value. The fuzzy measure properties are shown as follows: (1) g(X) = 1 indicates that when all classifier outputs are consistent, the results must be trusted.
(3) Fuzzy measure should be an incremental monotonic function: To get the fuzzy measure it should start at fuzzy density which has only one element in g(X) for instance g({x 1 }) and g({x 2 }). The relationship between fuzzy measures and fuzzy density is presented in Figure 6. In this figure, g({x 1 }), g({x 2 }), and g({x 3 }) are called fuzzy density and g({x 1 , x 2 , x 3 }) is the confidence degree of the specific class combining three CNNs.   [17] is a type of an optimization algorithm and it is based on the concept of foraging birds. Birds flying in the air for foraging can be seen as several particles in a solution space of a problem. Then, through multiple iterations, the best solution can be obtained. The fuzzy densities are n × k float numbers from 0 to 1, where n is the number of classifiers (i.e., the number of CNNs), and k is the number of categories in the classification problem (i.e., the number of class); therefore, the dimensions of the solution space are n × k dimensions and range from 0 to 1. In the initialization phase, y particles are randomly generated in the solution space, and each particle p denotes a set of fuzzy densities, as depicted in Figure 8.  Then, the fitness value of each particle is evaluated by calculating the precision of the verified data as follows: where TP is the number of samples that are predicted correctly, and FP is the number of samples that are predicted incorrectly.
In the particle swarm, the position of the overall optimal fitness value was defined as Gbest; the position of each particle's best fitness value was defined as Pbest i , where i = 1, 2 . . . , j . When updating a particle's position, the current fitness was evaluated to check whether the current fitness value of the particle is better than Gbest and Pbest i . If yes, then the particle's location was updated and recorded.
where v id (t) expresses the velocity of the i th particle in the d th dimension at the t th iteration; w, c 1 , and c 2 are variable parameters; p id (t) is the position of the d th dimension of the i th particle at the t th iteration; and rand() is a random value from 0 to 1. From equations (5) and (6), it can be observed that each particle in the process of moving refers to its past experience and group experience to determine the fuzzy density with the highest accuracy until the number of iterations reaches the maximum number of iterations. After knowing the value of fuzzy density, the fuzzy measure can be calculated by the formula revealed below: λ in Equation (7) can be calculated by Equation (8) The number of λ in the improved fuzzy integral is equal to the output dimension, which indicates that each output category can calculate one λ and λ ∈ (−1, ∞) with the following characteristics: (1) i f n i = 1 g i = 1 , then λ = 0 (2) i f n i = 1 g i < 1 , then λ > 0 (3) i f n i = 1 g i > 1 , then − 1 ≤ λ < 0 In this paper, two output rules, Sugeno and Choquet, are selected to implement the fuzzy integral. Before calculating the two output rules, some parameters and sets should be defined. First, the output of the n classifiers h(x i ) are sorted to ensure that the following relationship holds, , where π j denotes the classifier which has the j th highest output value. Therefore, π 2 = 3 indicates the third classifier's output value is the second largest, and h(x π 1 ) represents the maximum output value of the classifier. Then, the following set is defined: Equations (11) and (12) present the improved fuzzy integral output of Sugeno and Choquet rules: Choquet FI: In this study, two fuzzy integral rules are calculated simultaneously, and the higher result among the two is the output of the fuzzy integral.
To verify that the recognition rates of MCNNs-IFI are superior because of a combination of different network architectures or optimization methods, three trainings were performed with the same architecture and optimization method, and another training that integrated the architecture and optimization methods was performed using the improved fuzzy integral. The comparison between various optimization methods and the same optimization method indicated their difference.
In this study, the Multi-PIE and CK+ databases are used, and seven expressions are initially predefined. To objectively evaluate the accuracy of the fuzzy integral, the cross-validation method was used in Multi-PIE database by dividing data into 10 groups, and the average result of each group was derived to obtain precise accuracy. Moreover, each group of data was divided into training, verified, and testing data. Only training data were involved in the CNN training process, and verified data were used to adjust the fuzzy density. Finally, testing data were used to present accuracy. When the fuzzy density was performed, the maximum number of iterations was 1000, w was set to 0.8-0.2, which linearly decreased according to the number of iterations, and c 1 and c 2 were set to two. In CK+ database, each group of data was split into training and testing data. All the initial parameters such as w, c 1 , c 2 and number of maximum iterations were set under the same conditions as Multi-PIE database.  Tables 1-6 indicates the 10 groups of verified data. In those tables, the combination of CNNs using MCNNs-IFI shows higher accuracy.   Table 3. Recognition results of different architectures with stochastic gradient descent (SGD) optimization method.

CK+ Database
CK+ database contains 593 sequences across 123 subjects and the images are taken with different brightness levels in Figure 10. All facial photos are categorized into eight expressions: anger, contempt, disgust, fear, happiness, neutral, sadness and surprise and the results are displayed from Tables 7-12.
The results of different architectures with the same optimization method and the combination of CNNs using MCNNs-IFI are shown in the first three tables. Tables 10, 11 and 12 reveal the results of the same architecture with different optimization methods and the combination of CNNs using MCNNs-IFI. In those tables, the combination of CNNs using MCNNs-IFI exhibits higher accuracy among the compared methods.   Figure 11 shows that applying the proposed MCNNs-IFI in various CNNs is feasible and effectively provides a high overall recognition rate. Figures 12 and 13 depict the two testing photos, and their recognition results using SGD as the optimal method are presented in Tables 13 and 14. While the complexity is increased by adding more CNNs, the advantages of each CNN can be retained using the proposed method. Besides, with regards to the recognition results, only one CNN can predict correct results, and the other two classifiers provide incorrect results; if a combination of three classifiers is used in the traditional voting concept, the final result will be incorrect.    Ground truth 1 0 0 0 0 0 0 LeNet [1] 0.99 0.00 0.01 0.01 0.00 0.00 0.00 AlexNet [3] 0.17 0.00 0.00 0.83 0.00 0.00 0.00 GoogLeNet [4] 0.01 0.00 0.00 0.99 0.00 0.00 0.00 Proposed method 0.57 0.00 0.00 0.53 0.00 0.00 0.00

Conclusions
In this paper, the proposed MCNNs-IFI was implemented for recognizing facial emotions. It takes output data from several trained CNNs as input data to the fuzzy integral, and the results of the classification are computed depending on the output of Sugeno or Choquet rules. Furthermore, the optimal fuzzy density value in the fuzzy integral was determined through PSO. In experimental results, the Multi-PIE and CK+ databases were used to classify facial emotions and the results indicated that the proposed MCNNs-IFI was at most 12.84% and 14.61% accurate over AlexNet, GoogLeNet, and LeNet in Multi-PIE database and CK+ database, respectively.
Future studies should automatically adjust the number of CNNs in the process and automatically filter and eliminate lower or similar classifiers to achieve the best combination of classifiers.