CLEANIR: Controllable Attribute-Preserving Natural Identity Remover

: We live in an era of privacy concerns. As smart devices such as smartphones, service robots and surveillance cameras spread, preservation of our privacy becomes one of the major concerns in our daily life. Traditionally, the problem was resolved by simple approaches such as image masking or blurring. While these provide effective ways to remove identities from images, there are certain limitations when it comes to a matter of recognition from the processed images. For example, one may want to get ambient information from scenes even when privacy-related information such as facial appearance is removed or changed. To address the issue, our goal in this paper is not only to modify identity from faces but also keeps facial attributes such as color, pose and facial expression for further applications. We propose a novel face de-identiﬁcation method based on a deep generative model in which we design the output vector from an encoder to be disentangled into two parts: identity-related part and the rest representing facial attributes. We show that by solely modifying the identity-related part from the latent vector, our method effectively modiﬁes the facial identity to a completely new one while the other attributes that are loosely related to personal identity are preserved. To validate the proposed method, we provide results from experiments that measure two different aspects: effectiveness of personal identity modiﬁcation and facial attribute preservation.


Introduction
In recent years, cameras are becoming widespread. Surveillance cameras are deployed in almost every public places (e.g., airports, streets, and buildings) and even in some private spaces such as smart houses and vehicles. Smartphones equipped with high-performance cameras increase the convenience of taking pictures or recording daily events anywhere, anytime. Consequently, in our daily lives, there is a huge number of images and videos processed and even shared through online social networking services. Furthermore, recent advances in computer vision and artificial intelligence technologies have enabled many image-based applications. Due to high computational burden, however, these techniques tend to require images to be uploaded to high-capacity servers on public networks, resulting inevitable vulnerability to attacks. Also, these deep learning-based techniques demand large amounts of data to properly train, but their collection is plagued by privacy concerns.
In the meantime, for many applications, facial attributes such as expression, gender, pose and gaze play important role and there are many research works dedicated to extracting these attributes. Many early works on face de-identification are based on deteriorating of the image by blurring [1], pixelization, image segmentation [2], downsampling [3], deletion of part of the face, or cartoonizing [4]. While these methods can effectively remove identity-related information from images, they also get rid of such useful facial attributes which are loosely related to personal identity, making difficult to adopt in further applications.  [5], (e,f) de-identification results from the proposed method by rotating the disentangled identity-related latent vector 90 and 180 degrees, respectively. As it can be seen from the figures, the proposed method results in more natural facial images while the identity is effectively modified.
To address the issue, we propose a novel de-identification method based on a deep generative model that effectively modifies facial appearance while keeping the useful attributes. There are two main streams in deep generative models: variational auto-encoder (VAE)-based models and generative adversarial networks (GAN)-based models. GAN-based models are composed of two competing networks, a generator that outputs desired samples given random vectors and a discriminator that distinguishes between generated and real samples. Therefore, GAN-based models show impressive results while it is hard to converge, sometimes generating unnatural results. Also, in GAN-based methods, because the generator has a random vector as an input, it is difficult to design explicit relation from features to outputs. VAE-based models are usually constructed in an encoder-decoder structure as in Auto-Encoders. However, unlike Auto-Encoders that directly output latent vectors, the encoder in VAE-based methods outputs parameters for probabilistic distribution of latent feature space and use them to sample a latent vector from the distribution. Therefore, because VAE-based methods learn distribution from training samples, they tend to generate more natural samples. Also, the decoders in VAE-based methods take latent vectors from encoders, we can design explicit guidance between features and outputs inside the network.
Regarding preservation of facial attributes while modifying personal identity, some research works use additional attribute classifiers or facial landmarks detector [5][6][7][8][9]. These methods can successfully preserve some facial attributes. However, they only focus on the specific facial attributes and depend on additional algorithms such as facial landmark detector, facial expression estimator, or action detector.
Another approach is to use face swapping methods for face de-identification [10][11][12][13]. These methods produce more realistic, attribute-preserved outputs. However, the outputs are not images of new person. Rather, the result is a mapping from one to another which also exists in the training dataset. Therefore, it is hard to consider as a privacy-preserving technique.
In this work, we do not try to estimate facial attributes to preserve them. Instead, we aim to extract identity-related vector, and by modifying solely this vector we show that the proposed method can effectively change appearance of face while preserving rest of facial attributes that are loosely related to personal identity. Also, we show that with simple transformations we can control the amount of de-identification. As shown in Figure 1, the proposed method generates more natural faces than the other existing methods. Our method can be applied to smart devices such as service robots, surveillance cameras or smartphone applications for social networking service in which people does not want to be invaded their privacy. Because the applications should be able to analyze captured images to provide useful information, it is desirable only to remove privacy-sensitive information from images.
The contribution of this work can be summarized as follows: • A network architecture that explicitly disentangle latent vector to parts of personal identity and facial attributes Variational Auto-Encoders (VAE) [14] and Generative Adversarial Networks (GAN) [15] are representative deep learning-based generative models that are able to tackle intractable probabilistic distribution and large datasets. Similar to Auto-Encoders (AE), VAEs are usually composed of two parts, an encoder and a decoder, in which encoders in VAEs are responsible for capturing the probabilistic distribution of latent features while encoders in AEs are designed to directly output latent features. Therefore, it is well known that VAEs are effective in modelling latent probability distributions. Several works [16][17][18] have shown how VAEs can be used to learn structured, disentangled and interpretable representations in the latent space. However, outputs from VAEs tend to be blurry. GAN and its variations [19][20][21] are the most popular generative network recently. They alternately train a generative model to create samples and a discriminative model to distinguish between real and fake samples. Compared to VAE-based models, GAN-based models generate high-quality and realistic images while it is harder to converge and output inconsistent samples in some cases. Furthermore, their inputs for the generative model are meaningless random noise, thus difficult to manipulate.
VAE and GAN have also been applied to perform conditional generation of samples. Based on Conditional Variational Auto-Encoders (CVAE) [22] or Conditional Generative Adversarial Networks (CGAN) [23], there are works performing interesting tasks [24][25][26][27]. Odena et al. [24] proposed an image synthesis model which conditionally generates samples for 1000 classes. Reed et al. [26] demonstrated to generate images from text descriptions. Yan et al. [25] showed a conditioned image generation from visual attributes using CVAE. Walker et al. [27] proposed a model to generate possible future trajectories conditioned on the present image. Most recently, for person re-identification from images, Zheng et al. [28] proposed a GAN-based architecture containing two distinct encoders resulting an appearance-related latent vector and a structure-related latent vector, respectively. With these latent vectors, similar to our work, they manipulate appearance and structure from input images to generate new pedestrian images. Additionally, in a work from Larsen et al. [29], the authors proposed the combination method of VAE and GAN. In [30], Conditional VAE-GAN for data augmentation and image inpainting is proposed. They show impressive results but also suffer from aforementioned problems in GAN.

Face Swapping
Face swapping or face replacement is the task of transferring a face from source to target image. Early works of face swapping are based on 3D Morphable Model (3DMM) [31,32]. A drawback is that these methods only works properly when there is a large number of images of the target subject and the source subject because a 3D Model must be first built. Furthermore, estimation of 3D geometries along with different lighting conditions using 3DMM is still difficult.
As the result of the success of deep learning, many deep learning-based methods are emerging. In [33], the authors proposed new face swapping method as a style transfer task. They consider facial attributes and identity as a style. In [34], the authors proposed a method to work in more challenging conditions. They used convolutional neural network for blending technique. Region-Separative GAN (RS-GAN) [35] uses an approach that swaps in the latent space by disentangling the latent representations. FSNet [36] uses the latent space which separates identity and geometric components. Face Swapping GAN (FSGAN) [37] uses subject-agnostic method. In other words, their method does not need person-specific training.
Although some of the face swapping works have been proposed due to privacy concerns [10][11][12][13][38][39][40] and their techniques are similar to face de-identification, in the sense of privacy preservation, they do not adequately protect privacy of the person on the other side because this technique just transforms one face to the target face. Therefore, it is important to get explicit consent from owners of target facial images to use them for users' facial de-identification. For this reason, these methods are more suitable for recreation or entertainment purposes.

Face De-Identification
Earlier works on face de-identification had simply used blurring [1], downsampling, masking, or pixelation [41]. Although these methods had been easily applicable and removing privacy-sensitive information successfully, it had deleted other useful information. To solve this problem, k-Same family motivated by k-Anonymity [42] have been proposed. Vanilla k-Same method [43] created a new face by averaging k-closest faces of a gallery. It normally suffered from ghosting artifacts in the result images. k-Same-Select method [44] aimed at preserving facial attributes. To do that, this method partitions a gallery into mutually exclusive subsets. k-Same-M method [45] tried to avoid the undesirable artifacts due to misalignment. This method used Active Appearance Models (AAM) [46] for alignment. In [47,48], the authors also used the k-Same family method. These methods used the AAM and facial attribute classifiers to keep facial attributes. Problem of the k-Same family methods is lack of generalization. These methods need a large and various set of faces and simultaneously each subject should be only represented once in the set. In addition, a method cannot include all kind of the facial attribute classifiers, and the AAM also have a generalization problem.
Emerging approaches are using deep learning-based generative models [5][6][7][8][9]. These methods have produced higher quality images thanks to deep generative models. However, GAN-based methods [5,[7][8][9] have sometimes generated awkward facial images and cannot manipulate the amount of de-identification. In addition, randomly generated facial images may result in looking similar to the original one. In [6], the authors proposed a method using VAE with GAN. This method can control de-identification by using conditional vector for identity. However, because this control vector is the one-hot encoded vector, the range of the de-identification is limited in the training set. In [8], the authors proposed a method to preserve facial pose by using facial landmark detector to generate a new random face with the same pose. However, due to the aforementioned limitation in GAN-based methods, this method also would not assure that the generated random face is different from the input face.
For preserving attributes, many of those methods have focused on one or two attributes explicitly and use additional classifier(s) to do this. In [5], the authors focused only on preserving action and use action detector. The method of [6] preserved facial expression and use facial expression estimator. The authors of [7] tried to preserve structural similarity index of image, i.e., luminance, contrast and structural differences. The method of [9] had a bit different perspective. The authors viewed identity as a combination of facial attributes. They used 40 classifiers to predict facial attributes and selected facial attributes to preserve based on their criterion for protecting privacy. Finally, based on preserved attributes, it generated a new face. Leaving the computational power for running 40 classifiers, this perspective cannot meet our objectives to preserve useful information of an original face.

Proposed Method
The process of the proposed method differs in training and testing phases. We provide in detail the proposed network architecture, then describe the training and testing process in the following subsections.

Network Architecture
As it can be seen from Figure 2, the proposed network adopts the VAE architecture with skip connections [49]. Using the VAE architecture, the proposed method can learn to have latent space organized, enabling the encoded feature vector to be split into two parts: identity-related and attributes-related parts. Furthermore, benefited from skip connections, the network is able to generate new faces which do not exist in training images (i.e., not merely transforms one face to another which also exists in the training dataset) with high quality.  The encoder network contains four blocks using skip connections as shown in Figure 3a. In this work, we use facial images in a shape of 64 × 64 × 3 as input for the encoder and the output latent vector z is 1024-dimensional. With z, we treat the first 512 dimensions as an identity-related vector z i and the rest 512 dimensions as an attributes-related vector z a . As depicted in Figure 3b, the decoder network also contains four blocks using skip connections. z is directly input to the decoder network and the shape of the output is the same as the input image, 64 × 64 × 3. Detailed architecture of the proposed network is described in Table 1. Table 1. Architecture of the proposed network. Where LReLU is leaky ReLU, BN is batch normalization, FC is fully connected layer, '3x3 Conv' is a convolution of which filter size is 3 by 3, and, '3x3 AvgPool' is an average pooling of which filter size is 3 by 3.

Name
Operations

Training Process
An overview of the training process is shown in Figure 2a. The encoder network E maps a facial image I to z which consists of identity-related vector z i and attribute-related vector z a . Then, the decoder network D generates a reconstruction image I r from z. As is common in encoder-decoder architectures based on VAE, we also use binary cross entropy (BCE) loss to measure reconstruction error as well as Kullback-Leibler (KL) loss for regularization. Since in an image, pixel intensity values follow a conditional probability distribution, we assume that the values can be interpreted as probabilities for pixels being on/off after the values are scaled to [0, 1]. Therefore, the BCE loss can be adopted to our formulation. The BCE loss function results in the minimum loss when the value of a pixel on the input image I j (x, y) and the value of the corresponding pixel on the reconstructed image I r j (x, y) are the same. We define the BCE loss L r and the KL loss L kl as follows, where N is the number of samples, W and H are width and height of the image, respectively.
where q is the encoder network.
Our key idea in this work is to design the latent feature vector resulting from the encoder to have disentangled into identity-related part and the rest facial attribute-related part, enabling effective identity modification by solely manipulating the identity-related part from the latent vector. To train the network to result such disentanglement, we present an embedding loss using an external facial embedding extractor F. Since, we employ the F which is pre-trained for face recognition and verification, we assume that it is well trained to provide plenty distinctive features for facial identity. Therefore, by transforming a point on the identity space defined by F, we expect that identities of given facial images can be transformed with ease by producing new facial images. In this work, we use a Keras implementation [50] of FaceNet [51] as the face embedding extractor F. The network architecture is based on the Inception-Resnet-v1 [52] and the model was trained on VGGFace2 dataset [53] using a triplet loss.
To make z i resulted from the proposed network get closer to the output of F, z f , we design the embedding loss function using cosine distance as follows: For a sample j, the cosine distance between z f j and z i j ranges in [0, 2], and the loss is the sum of the distances of N samples in a batch.
Finally, our training loss is defined as the sum of the loss functions with control parameters λ r , λ kl and λ e : where by the embedding loss L e , the latent space related to facial identities is trained while non-identity latent space is also be learned due to the reconstruction loss L r because it demands the rest information to reconstruct the input image properly.

Testing Process
An overview of the testing process is shown in Figure 2b. In the testing, it does not require the face embedding extractor F, but an identity changer C to transform z i to z m which is a new identity-related vector. For the transformation, we first L2-normalize the identity-related vector z i . Then, we adopt the well-known Gram-Schmidt process to rotate z i 90 degrees with following equation, where z r ∈ R 512 is a random vector which determines the rotational axis and proj b (a) denotes a function which projects the vector a onto b. Finally, with z 0 (i.e., z i ) and z 90 , we can get a modified identity-related vector z m for arbitrary degrees with following equation: With the concatenated vector of z a and z m , the decoder generates a facial de-identified image as shown in Figure 2b. Therefore, by modifying z i we can change the identity from the input image. In particular, the proposed method can generate new image of a person who does not exist because a condition vector of the decoder z m is not an one-hot encoded vector but the face embedding that it contains rich information of a face as aforementioned. Figure 4 shows the facial de-identification examples from the proposed method in which I {0,60,120,180} denote transformation results rotated on a hyper-plane with 0, 60, 120 and 180 degrees, respectively. As it can be seen from the figure, the proposed network gradually modifies identities from given facial images as the rotation degree increases.

Experiments
To validate the effectiveness of the proposed method, we conduct two types of experiments. The first experiment is to show how well our method can remove identity from a facial image while the second one is to confirm how well it preserves facial attributes.

Experimental Setup
We implement the proposed system using TensorFlow and conducted all of the experiments in this work with a workstation equipped with four Nvidia GeForce RTX 2080 ti GPUs, Intel(R) Core(TM) i9-9900X CPU (3.50 GHz), 128 GB of RAM. For face part detection from given images, we employ a method from King et al. [54].
To train the proposed model, we use VGGFace2 [53], one of the major large-scale datasets for face recognition. The images in the dataset have significant variations in pose, age, illumination, ethnicity and profession, amounting 3.31 million images from 9131 identities. Since the official test split by Cao et al. [53] (167,559 images from 500 identities) contains evenly sampled images from the whole dataset, we use the split as our training set in this work. We train our network with Adam [55] with β 1 = 0.9, β 2 = 0.999, starting learning rate 0.0001 with a time-based decay of 10 −6 , for 300 epochs with batch size of 32 taking about 30 h on our system.

Evaluation on De-Identification
To validate facial de-identification performance of the proposed method, we adopt the state-of-the-art face verification method FaceNet [51] which provides a similarity distance given two facial images. Given two images of the same person, we measure how the proposed method can separate them as the transformation degree in M varies. As the test set, we use a widely adopted public benchmark dataset, Labeled Faces in the Wild Deep Funneled dataset (LFW) [56][57][58] which provides matched and mismatched facial image pairs of 1680 people. From the dataset, for each of 3000 matched pairs, we de-identify an image from a pair using the proposed method, then compute a similarity distance between the de-identified image and the other in the pair. Finally, if the resulting distance is lower than a threshold (η), we count the sample as a matching pair. Example images for evaluation on de-identification are shown in Figure 5.  Table 2 provides the result in which we compute the matching rate for 3000 pairs varying the threshold and transformation degree. For a comparison purpose, we also compute the matching rate for original image pairs (i.e., no images are transformed in the pairs) and reconstruction (i.e., one of the images in the pair is transformed with 0 degrees) pairs. Therefore, with the matching rate in the case of the original image pairs, we can see the performance of the face verification algorithm we used, while with the results on the reconstruction pairs, we can see the effect of degradation caused by the decoder network. As we can see from the table, the FaceNet algorithm performs well on the original and reconstruction pairs. However, after applying the proposed method, FaceNet cannot identify the same person. Interestingly, I 180 transformation results the best even high thresholds as we expect. Table 2. Quantitative de-identification results using 3000 matched facial image pairs from LFW dataset [58]. From left to right, η: cosine similarity distance, I: matching rate using original pairs given η, I m : matching rate in which one of the images in the testing pairs is rotated by m degrees.

Evaluation on Preserving Facial Attributes
To evaluate the performance of facial attributes preservation while de-identification of the proposed method, we conduct both of qualitative and quantitative experiments in this subsection.

Qualitative Analysis
Our goal in this work is to de-identify facial images while preserving facial attributes such as pose, color, gender, expression as much as possible. To confirm the performance qualitatively, we apply our method on three different datasets: VGG2Face, LFW, and Japanese Female Facial Expression dataset (JAFFE) [59]. The results are shown in Figures 6-8. As we can see from the figures, the proposed method effectively preserves non-identity-related attributes while identity changes.

Quantitative Analysis
To quantitatively analyze the facial attribute preservation performance of our work, we adopt a facial expression recognition algorithm, Microsoft Azure face API [60], and compute confusion matrices with the ground truth labels for four types of image sets: original, I 0 , I 90 and I 180 . For this experiment, we choose Japanese Female Facial Expression (JAFFE) dataset [59], which contains 213 images of 7 facial expressions (i.e., angry, disgust, fear, happy, neutral, sad and surprise) by 10 Japanese female models. Among the facial expressions, in this experiment, we use only four of them (i.e., happy, neutral, sad and surprise) showing high accuracy from the adopted facial expression algorithm. Figure 9 provides the results. In the case of using original images shows the best accuracy, the transformed results processed by the proposed method also provide comparable accuracy except only from the case of 'sad'. We analyze this as an effect of the degradation of details by our method. Since the proposed network architecture is based on VAE, it bounds to the generative power of stochastic sampling methods although it enables capturing meaningful features from the input domain, thus such disentanglement of latent feature space we benefit from in this work.
We also provide F-measures which are harmonic means of the precision and recall calculated from the confusion matrices. Averages of F-measures for original images, I 0 , I 90 and I 180 are 0.448, 0.338, 0.333 and 0.324, respectively. Since gaps between de-identification results are minimal, we can confirm that the proposed method can preserve facial expression while identity removes but the degree of preservation is bounded to the generative power of the VAE-based encoder we adopt in this work.

Qualitative Analysis on Videos
Finally, we present experimental results of the proposed method on videos (Supplementary Videos S1). In this experiment, we use Multimedia Understanding Group (MUG) facial expression dataset [61], which consists of image sequences of 86 subjects performing various facial expressions. The proposed method is applied on those image sequences frame by frame to see if the modified identities retain in a sequence which is preferable for various applications. As we can see in Figure 10, with the same m which is the control parameter for the de-identification, the modified identities tend to retain in the sequence as we intended in this work. However, as it also can be seen from the results, there are some limitations. The result has discontinuity on boundaries of facial parts (which have been processed by the proposed method) and loses small details such as moles from input faces.

Conclusions
In this work, we proposed a novel facial de-identification method for privacy preservation. Our method is aimed at not only removing identity-related information from input facial images but also preserving the rest facial attributes that are useful for further applications. The proposed method disentangles an identity-related vector and a facial attributes-related vector from a facial image and then we efficiently transform the identity-related vector to change the identity of the input image to a completely new identity which have not seen in the training. Through various experiments, we have shown that the proposed method can effectively change the identity from input facial images while preserving the rest attributes as we designed. However, we also have seen that the output of the proposed method is suffered from degradation when compared to real images and discontinuity on facial boundaries. Therefore, we will extend our method to construct with an adversarial architecture while having manipulated latent space to overcome the degraded quality and discontinuity on facial boundaries of the resulting de-identified images.

Conflicts of Interest:
The authors declare no conflict of interest.