A Bayesian Scene-Prior-Based Deep Network Model for Face Verification

Face recognition/verification has received great attention in both theory and application for the past two decades. Deep learning has been considered as a very powerful tool for improving the performance of face recognition/verification recently. With large labeled training datasets, the features obtained from deep learning networks can achieve higher accuracy in comparison with shallow networks. However, many reported face recognition/verification approaches rely heavily on the large size and complete representative of the training set, and most of them tend to suffer serious performance drop or even fail to work if fewer training samples per person are available. Hence, the small number of training samples may cause the deep features to vary greatly. We aim to solve this critical problem in this paper. Inspired by recent research in scene domain transfer, for a given face image, a new series of possible scenarios about this face can be deduced from the scene semantics extracted from other face individuals in a face dataset. We believe that the “scene” or background in an image, that is, samples with more different scenes for a given person, may determine the intrinsic features among the faces of the same individual. In order to validate this belief, we propose a Bayesian scene-prior-based deep learning model in this paper with the aim to extract important features from background scenes. By learning a scene model on the basis of a labeled face dataset via the Bayesian idea, the proposed method transforms a face image into new face images by referring to the given face with the learnt scene dictionary. Because the new derived faces may have similar scenes to the input face, the face-verification performance can be improved without having background variance, while the number of training samples is significantly reduced. Experiments conducted on the Labeled Faces in the Wild (LFW) dataset view #2 subset illustrated that this model can increase the verification accuracy to 99.2% by means of scenes’ transfer learning (99.12% in literature with an unsupervised protocol). Meanwhile, our model can achieve 94.3% accuracy for the YouTube Faces database (DB) (93.2% in literature with an unsupervised protocol).


Introduction
Face verification or recognition has attracted much attention with rapid progress in the past two decades, particularly equipped with recent deep learning techniques for significant performance improvement. However, its high performance usually depends on features extracted from a large number of labeled training samples as requested by the developed deep learning techniques. In these Although most of the previous works in literature perform very well for popular face databases such as the LFW and YTF datasets [9,10], they still to some extent rely on the background variance of the dataset, which is referred to as the "scene" in this paper. That is to say, they require the training dataset to be large enough, that is, including many samples, to represent sufficient scenes as requested. Currently, many researchers focus on proposing approaches for face recognition/verification on the basis of idealistic scenes and validate their methods on specific datasets while ignoring the fact that the face samples in training and testing sets are frequently present in different scenarios. Therefore, many previous approaches are limited or fail for some applications if the size of the training samples is quite limited in terms of scenes. In order to build a robust face recognition or verification system, effective feature extraction as well as semantic scenes extracted from training samples are highly recommended [11]. The extraction of scene semantics in natural scenes has been well studied in [12] and [13]; we borrow the scene concept for face semantic segmentation tasks. In brief, the motivation of this paper is attributed to the fact that the same person with various backgrounds can improve the face-verification performance. This rationality can be justified by the following two aspects: on one hand, the dataset is effectively augmented via domain transfer learning; on the other hand, the final features learned from deep neural networks (NNs) are facilitated because of the enrichment of individual face scenes used for training. In summary, the main contributions of this paper can be summarized as follows: • We propose a scene model based on the Bayesian deep network technique, which can infer several complicated scenes for the face-verification task.
• A new unsupervised face-verification model is developed on the basis of the scene transfer learning technique. • Experiments on two challenging datasets validated the proposed model in the case of a lack of sufficient training samples.
The organization of the rest of the paper is as follows: In Section 2, we review the literature in this area; Section 3.1 develops the Bayesian prior scene model, and Section 3.2 focuses on the scene inference. Finally, in Section 4, we propose the deep learning model and present some quantitative results of our Bayesian scene based network model for face verification. The conclusions are given in Section 5.

Previous Work
As deep learning NNs are our main concern in this paper, we only review some results related to NNs. In literature, there are three main categories of networks related to the proposed network model: highly deep network based (HDNB), large-dataset-based (LDB), and multimodal-based (MMB) networks. As for the HDNB network, this category needs to use labeled data with a very deep network for training in order to achieve higher accuracy. In practice, the HDNB approaches rely on a deep structure, and they comprise a long sequence of convolutional layers. For example, the deep residual net [14] has more than 1000 layers and achieved 5.71% top-1 error on ImageNet validation. In 2014, Christian Szegedy [15] proposed a 22 layer net, the "Google Net", which is a network with a carefully crafted design that allows for increasing its depth and width while keeping the computational budget constant. However, this type of network model cannot be too deep because of vanishing gradients. Fortunately, Ropes Kumar Srivastava et al. [16] developed an approach to enable the depth of the net to increase without the vanishing-gradients constraint. They use the "transform gate" to transmit the information derived from the input data to keep the gradient descent from vanishing. Even with hundreds of layers, such networks still can be trained directly through simple gradient descent. Their research enlightened the study on extremely deep architectures with efficiency. LDB methods aim to train a classifier with a large dataset. Yana Taiga et al. [14] used 4 million data samples with a bootstrapping process and improved the transferring capability of the network with the aim of discovering the connection between the representation and the capability of discrimination. In 2015, Google [17] proposed a convolutional neural network (CNN) model (FaceNet) that could directly learn a compact Euclidean mapping from face images. As reported, it could achieve a highaccuracy of 99.6% on the LFW benchmark [10]. However, this approach needs to be trained on a large dataset with a data size of about 200 million. Yana Taigman [18] (Deepface) employed a 3D face model to align using locally connected layers without weight sharing. Later they declared [3] that the CNN method has a bottleneck with increasing data. They proposed a solution for alleviating this by replacing the naive random subsampling in the training set with a bootstrapping process. Moreover, a link between the representation norm and the capability of discrimination in a target domain was discovered, and this research sheds lights on how such networks can represent faces. Although it is suggested that the larger the data size is, the higher accuracy one can achieve, it is also clear that the last 0.4% point is hardly to be achieved by only increasing the size of the training dataset. Particularly, 0.1% improvement needs an increase in the data size of 199.99 million; the tendency is shown in Figure 2. MMB approaches are in essence associated with ensemble methods, which involve multi-models to work together. Jintao Liu et al. [16] (the Baidu Face) exploited a two-stage method based on multi-patch features and metric learning with triplet loss. Their method achieved 99.77% for pairwise verification. However, it tends to deplete too many computing sources because it needs too many overlapped computing jobs. Sun Jain et al. [19] tried different structures and different patches to extract the aligned face features. After that, they proposed a classifier to differentiate between different faces (verification or identification) and achieved high accuracy on both the LFW benchmark and Casia [2] dataset. As far as above three categories to be concerned, there is a consensus that the key issues to improve the face recognition/verification performance are to design a reasonable deep network with appropriate data size and then to work with a hybrid process model. However, the reality is that we can often have different methods while lacking an appropriate size of labeled data at hand. This is in fact the well-known problem of a small number of training samples, and it has been investigated extensively by the computer vision community [20]. This problem has not been sufficiently tackled for deep learning NNs, and currently researchers are trying hard to develop new network structures or promote broad applications for the currently existing network structures. This is mainly due to the fact that deep learning NNs are very complicated, as they aim to mimic the human brain functions and are still in a developing stage. To our surprise, the networks for the small data size problem do not perform as well as human beings. That is to say, human beings can distinguish between a large number of faces by using very few datasets in learning. For example, the work by Salakhutdinov, Ruslan et al. [21] could extract new features from very few training examples by learning both lowand high-level generic features, and these features could fully represent correlations between lowand high-level features. Motivated by their approach, we define the high-level features as a scene in this paper and propagate these scene factors to a deep learning network that is exploited to learn a practical model from input face images. Details are presented in following sections. For convenience, the related symbols and concepts are first listed in Table 1.

Symbol Notation
Scene; number is unknown in advance φ mix , φ pure Two feature spaces, the pure and the mixed space C Category p(,) PDF (probability density function) I mix The overlapping image space θ The statistics (µ; Σ) The pixel value at position u, v π k The PDF of the kth scene i,j,k The image, category, and scene orders φ x Features extracted from the image

The Proposed Methodology
To better understand the method presented in this paper, we need to clarify the concepts of the high-level features or scene first. Previously, S. Zheng et al. introduced a new form of CNN that combines the strengths of CNNs and conditional random field (CRF)-based probabilistic graphical modeling for segmentation [13,22,23]. Motivated by the proposed methods in [13,24], we embed the scene concept into the above semantic segmentation task. By "plugging" CRFs into the CNN, we can obtain a new deep network that has desirable properties derived from CNNs and CRFs. Although the network was originally designed for semantic segmentation, it provides great benefits when we integrate this idea into our approach for scene extraction and scene backwards propagation. In Figure 3, the first column is the original scene, and after the semantic segmentation shown in the second column, we can deduce more new scenes from columns 3 to 6, as we explain in the following sections.

The Bayesian Scene-Prior-Based Deep Network Model
First, the proposed pipeline for face verification in this paper is outlined in Figure 4. As shown in Figure 4, the whole process consists of two steps: the training and the verification. Next, we explain each part in detail. We first propose a combination of a Bayesian network and a CNN, which concerns both the global and the local feature distributions. Similarly to a human being's cognitive process, it is believed that a good object classifier should first have a good capability to understand the scenes, and then its capacity for knowing about objects' existence in these scenes can be much improved. In view of this, the proposed method consists of the following steps: (1) to learn the scenes; (2) to express the scenes; (3) to feed scene factors into the CNN training process and optimize the parameters for face verification; and (4) to finally feedback the learnt knowledge into a new learning iteration step if a new scene is given. Inspired by the contributions of a previous study that utilized the latent Dirichlet allocation (LDA) model to learn the natural scene categories [12], we propose a similar new method to learn the scene distributions between the face pairs' overlapping spaces. In this context, we consider the scene variance as continuous and use the mixture Gaussian model instead of the multi-nomial model to describe such distributions. The main idea is to transform a face pair into a very close scene in order to lessen the possible scene variation effect in face verification. As shown in Figure 5, after roughly detecting and aligning the face images from a dataset, the detected faces are segmented by the CRF to build a series of scene candidates. These preliminary candidates are then dealt with by the succeeding CNN feature extraction in order to learn a scene expression. Ultimately, a scene dictionary is output according to the distance measurement among any given face pair for the same person.  For clarity, we further define the scene learning by a rigid mathematical description: where Σ is the covariance matrix, and M is the number of scenes for a given person. The initial dictionary entry is determined on the basis of the distance d between the given face pairs.As noted, the covariance matrix is estimated by the face feature vector for the same person. For the purpose of generating an augmented new dataset according to the learnt scenes, we need to know the relationship between the scenes and features first. Once a scene entry in the dictionary is given, the feature distribution for the same person can be expressed as a mixture Gaussian distribution, as below: Given the parameter θ, and where µ and Σ are the distribution parameters' mean and the covariance to be learnt in the training process, then given the scene S, the feature f , and the scene S, the model described in Equation (4) is followed: When any two faces of a pair have a different θ value, the distance between them will be relatively large; conversely, the distance tends to be small. At the beginning of the procedure for our model, the face categories or scene entries C are randomly initialized. We suppose that there is a dataset containing M images denoted by X = (x 1 , x 2 , . . . , x m ) for a given person i. For convenience, we use the symbol x instead. One image may have K scenes with all the probabilities correlative to the change in the faces. In order to decide which scene a given image belongs to, the nearest-neighbor rule is applied to the set of the probabilities calculated from Equation (4) above. The scenes are not in discrete space, but they are continuous in s ∈ [0, 1]. The closer s is to 1, the more likely the image belongs to the given scene. Therefore, where π k is the distribution of scene s k . Eventually, more faces can be generated for the imbalance training dataset according to Equation (5) on the basis of the learnt scene dictionary, as shown in Figure 6. As for the face pairs shown in Figure 7, Equation (4) can be expressed in another way, as follows: p(x, y) = p(x|s)p(s|y)p(y). Furthermore, on the basis of the conditional distribution in the graph model of p(x|s), we denote this latent variable by γ(s k ); thus, where π k is the prior probability of a scene. The above Bayesian model describes the prior relationships among the images, features, and image categories. An image needs a scene to describe the condition that a face is shown in it; therefore, one person's face may have several scenes to be assigned to. In this context, the scene model is used as a prior to the image, and then we exploit the semantic pixels to determine the face region. This priori information is backward-propagated to the next iteration for generating more reasonable faces. We usually refer to the scenes as the learnt higher-level features, each of which will have a dramatic effect on the distribution of the face images. During this iteration, the face in different scenes is determined semantically [25] by calculating the low-level (pixel-level) distribution, which can be expressed as a Dirichlet distribution. The scene of a given face is one of K random variables subject to Up to here, we have addressed how to create K scenes from training samples; next we explain how to infer a scene for a given face.

Scene Inference
Each image is a mixture of scenes, where the corresponding latent variables γ(s k ) are denoted by an N × K matrix x with s k n rows (for a given person with n face images and K scenes). We let s be the kth scene, such that where µ and Σ are parameters learnt from the ground truth. The PDF of a scene is given by Equation (9). The scene can be considered as a main background factor for the identity of a given object, which is learnt from mixed image spaces. When a new scene is to be learnt, the original imbalanced data will even be augmented by generating new scene images. Eventually, this enlarges the original image space in the direction of the missing data characteristics by using iterative backward propagations. The number of scenes is a latent variable, which is a constant and is determined by the images from the training set. Then, the scene s is subject to s = argmax s (p(s|θ)).
Now the problem is how to propagate the scene information to the data features and make the data richer with the new scene feature. The aim is to obtain p(image|s mix ), where s mix is the hybrid face scene in a verification task. Up to here, we have already been able to calculate p(x), p(c), p(s), p(s|x, c), and p( f ). Hence, the goal can be achieved by the following steps: (1) Perform convolution and pooling on I(u, v).
(2) Determine a mixed feature space ϕ mix or a mixed image space, as shown in Equation (11).
By using the L2-norm, the mixed image space can be derived by the following equation: where i 1 and i 2 are two different images; f is the feature extracted from the images; and ε is an empirical value initialized within [0.3, 0.5], as we need 90% error rate hits when the similarity is in this interval.
As defined previously, θ represents the parameters to be already learnt with θ = (µ,Σ) and s n is the nth scene. When a scene is given or the distribution of the scene is fixed, we solve Equation (12) by maximizing the log of likelihood function; thus we have where µ and Σ are determined by the derivative of the log likelihood with respect to µ. In fact, µ and Σ are estimated by ∧ µ, ∧ Σ, as follows: Because the parameters s and θ are coupled, the variational approximation technique is used to maximize the log likelihood of the data and to minimize the Kullback-Leibler divergence between the approximation and the true posteriors. Here, we use the distribution q(.) to approximate the true distribution. Then we optimize Equation (12)  ≥ ∑ s q(π|s) log p(π, s, x|θ) − ∑ s q(π, s) log q(π, s) By defining L(γ, x, θ) for the right-hand side (R.H.S.) of the above equation, we have log p(x|θ) = L(γ, x, θ) + KL(q(π, s|γ)||p(π, s|x, θ)), where KL() is the Kullback-Leibler divergence between any two distributions, q(π, s|γ) is an arbitrary variational distribution, and γ is the above-mentioned latent variable. The second term on the R.H.S. of the above formula stands for the KL distance of the two probability densities. As shown in Figure 8, the original size of the training datasets is finally enlarged according to the scene inference procedure, and then the enriched datasets are fed into a model training process in order to obtain a model for the next verification task.

Hyperparameter Optimization
By simply considering the So f tmax loss or the Euclidean loss, the loss function is able to be affected by noisy scenes. Thus, we propose a ground truth distribution model based on the mixture Gaussian model. We denote the NN energy by E in and the distribution energy by E scene . Then we form the following distribution: where I(x) = (I 1 , I 2 , . . . , I N ) are the input images; ζ is a penalty item, which is a constant here; and α is the coefficient factor. The energy function has two main components. The first component is a plain NN error, which aims to make the predicted result much closer to the ground truth. However, if we only have this term, a large dataset is required in order to achieve a relatively higher accuracy. The training process also tends to be out of control after learning a batch size of the dataset and then results in an overfitting. As we know that the CNN was originally designed to learn the general features in a class, it cannot fit the variance of the training images. Thus we add the second term to generate the variance for the existing images, and we enlarge the variance by generating different scenes for a given image via the scenes' transformation. The bound becomes tight if and only if p(x) = q(x).
In addition to maximizing the log likelihood of the dataset, the conditional constraint (as shown in Equation (17)) could also select parameters that minimize the Kullback-Leibler divergence between the approximation and true posteriors. For implementation, the energy is approximated by an expected maximum (EM)-based variational learning method.

Overlapping Distributions' Transform
The validation dataset falls into two parts: one is data to be recognized easily with relatively discriminative feature spaces; the other contains some of the less familiar images that lie in a crossing space and that cannot be directly classified. We name all such image datasets in the crossing space as where η or ε is the threshold to determine the marginal of the overlapping space. Before we have the samples (refer to Section 3.1) to be generated for the next iteration and take the scene as a prior for the next time, we treat the difference in the pair as the scene variance. Then we can enlarge the distance in the overlapping classes' space with the dataset s and the scenes in s. Finally, we obtain the easily wrongly labeled faces and scenes for the overlapping space. However the difficulty is that the selected features are not sufficient to discriminate between the current varying scenes. Luckily, we are motivated by the human visual system in which, when people cannot recognize one person by the given low-level features, they would try to find more discriminative high-level features. Usually they extract the high-level features or semantic features. Similarly, we use the face region generated by the CRF to extract high-level semantic features for the recognition task. The semantic feature extraction procedure is as shown in Figure 9. In this procedure, a semantic feature is determined by the CNN extracted feature on the basis of the face region produced by the CRF. As shown in Figure 9, the extracted features will be used as input for subsequent face-verification tasks. We note that the main purpose for the scene dictionary exploited in this context is to help the transformation of a given face into a specified scene.

Scene Backward Propagation
Now we describe how to obtain the scene distribution among the images in the training dataset. As for Equation (17), the gradient of E scene can be expressed as follows: where l represents the layer in the NN, θ is a parameter of the scene distribution, and f l is the output activation function. The scene is propagated by the overlapping space. Then we can extract the scene distribution in Equation (22) and transform it back to the first layer of the NN.
where µ k is the learning rate for scene propagation, and δ is the gradient of a scene in pairs. The whole process is listed in the algorithms in the appendix.

Experiments and Results
For face detection, a face detector is used on each image, and a tight bounding box around each face is generated. These face thumbnails are resized and aligned to a size of 141 × 165 pixels. At the beginning, we use a CNN model to train the deep features. The training-and validation-related topics are given in the following sub-sections.

Datasets and Evaluation
The new method was evaluated for a face-verification task; that is, given a pair of two face images, a squared Cosine metric τ(x i , x j ) was used to determine whether the two images were the same or a different person. We used CASIAWebFace [25], which contains 10,575 subjects and 494,414 face images used to train our model. As for the evaluation, LFW [10], which contains 13,233 images with 5749 identities collected from the Web with large variations in pose, age, expression, illumination, and so forth, and YTF [3], a video dataset containing 3425 videos of 1595 different subjects downloaded from YouTube, were used. We considered the unsupervised protocol and followed the standard setting as described in [26]; in addition to the verification accuracy (Acc.), we used the ROC Receiver Operating Characteristic Curve to evaluatethe performance. We conducted the evaluation under the following setting: a cross-dataset validation, in which external data (CASIAWebFace) exclusive to LFW/YTF was used for training in order to show the generalization ability across different datasets. The datasets we used for validation were the LFW dataset in view #2, which has 6000 pairs, and the YTF face dataset, which has 5000 face pairs. As for the scene dictionary learning, we also exploited CASIAWebFace as the training dataset.

Training Process for the New Model
To find the differences with enough training images (ETI: training datasets augmented using scene transformation) and without enough training images (WETI: using the raw data as the training data), 250,000 iterations were run on both the ETI and WETI datasets. The observed loss (error) is illustrated in Figure 10. The green curve indicates the loss convergence speed for WETI and the red indicates that for ETI. One can observe that the convergence speed for ETI was quite fast. As discussed in the above section, Table 2 illustrates the feasible hyperparameters for our proposed model.

The Semantic Features' Extraction
As mentioned in the model description section, the semantic features are extracted after the CRF-RNN process. Essentially speaking, CRF-RNN is exploited to build a relationship between a scene and face by extracting the face regions from the scene backgrounds in which they co-exist (refer to Figure 6). Then the proposed CNN model can express the face feature with scene information in a semantic way. With the benefit of the representations from intermediate layers, we can turn any face image into a vector containing important scene attributes of the face. As shown in Figure 11, it is indicated that the semantic feature distribution for the same person tends to be very close (middle); on the contrary, the semantic feature distribution for different people varies considerably, as shown in Figure 11 (right), although the faces are transferred to the same scene level.

Distribution of Semantic Features
We randomly selected 10,000 images from the CASIAWebFace dataset and then extracted their features using both the VGGFace model [27] and our proposed model. Because it is difficult to visualize the distributions of extracted features directly, tSNE [28] was usedfor reducing the extracted features' dimension from a high dimension (10,575) to a low dimension (only 2). In this way, the differences in the distribution could be projected onto a two-dimensional plane. As shown in Figure 12, the same colors represent the same individual, and different colors represent different people. According to Figure 12 (left), the VGGFace model had a nearly circular feature distribution; that is, the extracted features of each category tended to gather together in a relatively small radius. As we can see, about 50% of the categories gathered in a compact form, while the rest were scattered (see Figure 12 (left)). Zooming in on the details, there was extensive overlapping among the features from the selected four individuals, as observed in Figure 12 (right).
Next, we look at those features extracted by our proposed model. As we see in Figure 13 (left), the distribution of respective individuals looked more like a strip. It is also observed that about 80% of the categories were dispersed with a relatively larger spacing. Compared to the zoomed-in distribution property shown in Figure 13 (right), Figure 13 (right) illustrates less overlapping, and the distribution for inner classes was much more uniform along certain directions than that the VGGFace model produced.

Comparison with Other Models
In order to validate the performance of the proposed model, an unsupervised protocol [10] was used on both the LFW and YTF datasets. The reason we chose this protocol for comparison was that a strict generalization ability is necessary for the face-verification task in an open scene. Firstly, we compare the Casia [25] and VGGFace [27] models with our model. For the training step, all the models were trained on the CASIAWebFace dataset, and then these three models were validated on the YTF dataset. For the verification step, the proposed model needs the face pair to be transformed into the same scene (see Figure 9). The ROC curves for the three models on the unsupervised protocol were plotted, as shown in Figure 14. One can see that the performance of our proposed model on the LFW dataset was better than that of the Casia [25] model (Area Under Curve (AUC) gap of 0.0427), but the AUC of the VGGFace model was 0.0301, which was slightly better than for our model. However, the VGGFace model has an unrestricted protocol, and a much larger dataset for training is inevitable. Secondly, we compared the AUC with the top 10 of the leading board [9,25,29,30] for the face-verification task on the LFW dataset (see Figure 15). We also compared several up-to-date leading models, such as Casia [25], DeepFace [18], OpenFace [31], and VGGFace [27], in an unsupervised protocol for the YTF dataset (see Figure 16). As observed in Figures 15 and 16, our proposed model performed the best, and its AUC was 0.0075-fold higher than that of the VGGFace model.  Thirdly, we summarize various protocols, training datasets, and networks in Table 3. Some performances listed in Table 3 are from the related publications. Here, it is shown that under the unsupervised protocol, our model achieved the best result on both the LFW and YTF datasets. In order to see the benefits much more clearly, we also compared the performances between ETI and WETI, which illustrated that ETI could gain an advantage of about 0.0062 over WETI on the LFW dataset, as shown in Figure 17.  Figure 17. ROCs with unsupervised verification protocol on LFW dataset for without enough training images (WETI) and enough training images (ETI).

Conclusions and Discussions
In this paper, a new deep learning model is proposed for face verification and is essentially used to solve the small number of training samples requested by current deep learning networks. The main idea is to use scene transfer learning to generate more images for validation. The proposed model was evaluated from multiple perspectives. With the unsupervised protocol, our model performed better than the existing leading algorithms on the LFW dataset. According to Figure 16, its performance was at least 0.7% higher than that of other models under an unsupervised verification protocol. As illustrated by the ROCs, the VGGFace model slightly outperformed our proposed model on the YTF dataset; however, our model performed much better than the VGGFace model on the LFW dataset.
The key point is that our model has much greater generalization capability, as it can deduce many more scenes for the training dataset. As observed in Table 3, it not only lists the performance and protocol, but also the training data size and network used by each model. As we can see, although our model was superior to DeepFace, with AUCs greater than those of LFW and YTF by 3.9% and 2.9%, respectively, the size of the training dataset was just 1/8 that of DeepFace. That is to say, we required much less data for our proposed model to train a better CNN model for the face-verification task. Even with the similar training data size (0.5 Million), in comparison with the CASIAWebFace model on the supervised protocol, our model achieved better results than LFW and YTF by 1.44% and 2.06%, respectively. The proposed model has only a relatively simple network structure. In contrast, the Baidu model uses 10 networks and had only about a 0.6% improvement with the unrestricted protocol. Although our model has only a simple network, it still achieved a steady performance for both LFW and YTF with an unsupervised protocol. The DeepID3 model has a much more complicated network (consisting of 200 models), but it achieved only 0.3% higher than the proposed model for LFW, even with a supervised protocol. The proposed model also outperformed the DeepID3 model by 0.9% for YTF with an unsupervised protocol. Hence, we can draw a conclusion that the proposed model can achieve a better performance than the state-of-the-art models with a relatively small amount of training data. As for the dataset LFW, there were several face pairs judged with great difficulty by the other models that could be distinguished between by the newly proposed model. As shown in Figure 18, the numbers stand for the cosine distance between a given face pair. When we set the threshold τ equal to 0.50, these face pairs could then be identified with less difficulty. However, there were still some face pairs that our model failed to verify properly. Figure 19 shows those pairs that were falsely accepted by our proposed model, and Figure 20 illustrates those pairs that were falsely rejected by the proposed model. As we can see, the key reason for the failure of our model was likely that the face images were subject to facial expressions, shelter, and other factors. Our next work will aim for profound research on the facial expression scene and make our model capable of transforming all facial expression scenes into uniform scenes.