Spectral Normalization for Domain Adaptation

: The transfer learning method is used to extend our existing model to more di ﬃ cult scenarios, thereby accelerating the training process and improving learning performance. The conditional adversarial domain adaptation method proposed in 2018 is a particular type of transfer learning. It uses the domain discriminator to identify which images the extracted features belong to. The features are obtained from the feature extraction network. The stability of the domain discriminator directly a ﬀ ects the classiﬁcation accuracy. Here, we propose a new algorithm to improve the predictive accuracy. First, we introduce the Lipschitz constraint condition into domain adaptation. If the constraint condition can be satisﬁed, the method will be stable. Second, we analyze how to make the gradient satisfy the condition, thereby deducing the modiﬁed gradient via the spectrum regularization method. The modiﬁed gradient is then used to update the parameter matrix. The proposed method is compared to the ResNet-50, deep adaptation network, domain adversarial neural network, joint adaptation network, and conditional domain adversarial network methods using the datasets that are found in O ﬃ ce-31, ImageCLEF-DA, and O ﬃ ce-Home. The simulations demonstrate that the proposed method has a better performance than other methods with respect to accuracy.


Introduction
Deep learning is a subset of machine learning, which is a subfield of artificial intelligence. Deep learning has been widely applied, for example, in natural language processing, medical imaging analysis, remote-sensing image analysis, and synthetic aperture radar (SAR) target recognition [1][2][3][4][5]. If the training dataset is insufficient, this insufficiency will affect the recognition accuracy. In some applications, it is difficult to collect enough training data. Furthermore, manual annotation labels are also considerably time-consuming [6]. In order to solve these problems, the transfer learning method is used in deep learning. This method can search annotated data that are similar, transferring learning to the target data and thereby increasing the size of the target dataset and improving the performance of deep learning. Transfer learning can be used to transfer labels or data structures from the Source domain (training dataset of other objects) to the Target domain (training dataset of the current object) in order to improve the learning effect [7].
Transfer learning can be divided into two categories: heterogeneous transfer learning and homogeneous transfer learning. In heterogeneous transfer learning, the Source and Target datasets are different in both their features and label spaces. In homogeneous transfer learning, the Source and Target datasets are the same in both their features and label spaces. Domain adaptation is one form of homogeneous transfer learning. It is used to search the shared features between the Source and Target domains in a high dimensional space [8]. Domain adaptation can be divided into the distribution adaptation, subspace learning, and feature representation transfer methods. The distribution adaptation method is a commonly used method of domain adaptation. This method g c log g c (C is the total number of classes and g c is the probability of predicting an example for class c). There are further domain adaptation methods that can be used when searching for domain invariant features in which the features are shared by the Source and Target domains. The Wasserstein distance guided representation learning (WDGRL) method [23] uses the Wasserstein distance [24] to calculate the difference in the distance between the Source and Target domains. The cycle-consistent adversarial domain adaptation (CyCADA) method [25] constructs a network to search for optimal shared features between the Source and Target domains. This network is used to generate a Target domain from the Source domain, and to generate a Source domain from the generated Target domain. The whole domain generation process is cyclic, so it is called a circular network. In unsupervised image-to-image translation (UNIT) [26], a network framework is designed that is used to learn the joint distribution of different domains by using marginal distribution. The selective adversarial networks (SAN) method [27] proposes partial transfer learning to solve the problem of a Source domain that has far more data categories than the Target domain.
In Section 1, we introduced transfer learning and some related algorithms concerning domain adaptation. The conditional adversarial domain adaptation (CDAN) method is introduced in detail in Section 2. The proposed method is then presented in Section 3. In Section 4, the simulation is presented and discussed.

Conditional Adversarial Domain Adaptation
Although domain adaptation can help to solve a dataset that lacks labels, for domain adaptation, which uses an adversarial network, some problems remain. First, the previous algorithms only aligned the features of the data without also considering the labels. Second, if the modal structure of the data features is highly complex, the current domain adaptation method-which uses an adversarial network-cannot locate the multimodal structures, representing a simple cause of negative transfer. Third, the minimax optimization method in the conditional domain discriminator imposes different examples that have equal importance. This may result in adverse effects of hard-to-transfer examples upon domain adaptation.
In order to solve these problems, a conditional adversarial domain adaptation (CDAN) method is proposed [22]. The framework of the CDAN is outlined in Figure 1. In adversarial domain adaptation, the classifier prediction result carries potentially discriminant information revealing multimodal results. This potentially discriminant information is used to align the appropriate features to capture multimodal information during network training. Based on this, the CDAN method combines features and labels that are introduced to domain adaptation to obtain the loss function, which can be expressed as: min where λ is a hyper parameter between the two objectives for the tradeoff source risk and domain adversary; ε(G) is located on the source classifier G and ε(D, G) on the source classifier G and domain discriminator D, which are respectively expressed as: where f and g are the feature and label of the Source domain and Target domain, respectively, L() is the cross-entropy loss function, s is the Source domain and t is the Target domain. The CDAN method uses a multilinear map to connect f and g to form h = ( f , g)-this can represent the relationship between features and labels. f ⊗ g represents a multilinear map, and ⊗ is tensor multiplication. However, when the dimensions of the feature and label are large, the dimension of f ⊗ g is d f × d g and will suffer from exploding gradient problems. The authors addressed this issue by randomly selecting some dimensions of f and g to construct a multilinear map: where is the element-wise product, R f and R g are random matrices which are only sampled once and are fixed during training, and d is the dimension. CDAN finds that T ⊗ = T is in a high-dimensional space. Thus, when the dimensional multiplication of f and g is greater than 4096, a random strategy is used for the multilinear map. Otherwise, the normal multilinear map is used: Furthermore, the CDAN method uses entropy H(g) = − C c=1 g c log g c (C is the total number of classes and g c is the probability of predicting for example to class c) to calculate the uncertainty of the predicted result of the classifier. The certainty of the predicted result is expressed as: and then the loss function in (1) and (2) can be expressed as: Following the concepts introduced above, the network structure of CDAN is presented in Figure 1. Figure 1a illustrates the network structure in a low-dimensional scenario, while Figure 1b features the network structure in a high-dimensional scenario.
and then the loss function in (1) and (2) can be expressed as: Following the concepts introduced above, the network structure of CDAN is presented in Figure  1. Figure 1a illustrates the network structure in a low-dimensional scenario, while Figure 1b features the network structure in a high-dimensional scenario.

Proposed Method
The conditional domain adversarial network (CDAN) adds an adversarial network into domain adaptation in order to search for a domain invariant feature. This adversarial network is composed of a label classifier, feature extractor, and domain discriminator. The better the domain discriminator performs, the more significantly the feature extractor gradient disappears. When the domain discriminator is trained too well, the gradient of the feature extractor will reach zero, and when the domain discriminator is trained too poorly, the gradient of the feature extractor does not decrease. Only when the training of the domain discriminator is neither good nor bad does the feature extractor exhibit a better performance.
To overcome this problem, we introduced the 1-Lipschitz constraint condition into the adversarial network of the CDAN method. The Lipschitz constraint condition is: where f () is a function, x is a variable, and δ is a smaller variable. In the proposed method, the output of the nth layer can be expressed as: where X n is output of the nth layer, f () is an activation function that corresponds to the function f () in (10), X n−1 is output of (n − 1)th layer, W n and b n are the parameter matrix and bias of the network in the n layer, respectively. The activation function is a ReLU (Rectified Linear Unit) function, so the bias units can be ignored. Therefore, the output of the nth layer in (11) can be simplified as follows: where D n is a diagonal matrix that is obtained by using the ReLU function as an activation function in (11). In this way, the relationship between the output of the nth layer and input can be expressed as: and the norm of gradient X n can be expressed as: where ∇X n 2 is the gradient of X n , and δ is a smaller variable around X 0 . According to the computation method of the maximum singular value of the matrix: where M is a matrix. We can thus obtain the maximum value of ∇X n 2 , with: where D n W n . . . D 1 W 1 2 is the spectral norm of D n W n D n−1 W n−1 · · · D 2 W 2 D 1 W 1 . According to: where D 2 is the spectral norm of the diagonal matrix D, and W 2 is the spectral norm of the parameter matrix W.Based on (16) and (18), we can deduce that: when the activation function used in our proposed method is the ReLU function, and the diagonal element is 1 when the diagonal element is greater than 0, and 0 when the diagonal element is less than 0. Therefore, the spectral norm of the diagonal matrix D is 1 [28]. When the activation function used is sigmoid, the value of the diagonal element is between 0 and 1. Therefore, the spectral norm of the diagonal matrix that corresponds to the sigmoid function is still less than that of the corresponding ReLU function. Based on the above analysis, the expression in (19) is correct when we use the ReLU function or sigmoid function as an activation function. Thus, we can conclude that: Based on (14), (19) and (20), we can deduce that: Finally, in order to allow the gradient to meet the Lipschitz constraint in (10), we use the spectral norm of the parameter matrix to correct the gradient: where ∇X * n 2 is the normalized gradient norm of ∇X n 2 . Based on (22), we can see that the corrected gradient meets the 1-Lipschitz constraint. Therefore, if we use this corrected gradient as the new gradient, the domain discriminator will be stable. The parameter matrix can be updated by: where µ is the learning rate,W n is the parameter matrix of the nth layer, and W n−1 is the parameter matrix of the n − 1th layer. Thus, the 1-Lipschitz constraint can be satisfied by dividing the network parameters matrix of each layer by the spectral norm of the parameter matrix of the layer. The above shows the content of the normalized spectrum. The specific network architecture of the proposed method is presented in Figure 2.  In the first layer of the adversarial network, the labels and features first pass through Spectrum layer 1 and then pass through the activation function layer ReLU 1, before finally passing through Dropout1. The processed data are then put into Spectrum layer2. These data first pass through Spectrum layer 2 and are then input to the ReLU 2; finally, they are treated with Dropout 2. Then, the data are placed into the third part, and the processed data are put into Spectrum layer 3, before finally passing the sigmoid layer. The final output data are the labels  Figure 2 illustrates the framework of the proposed method, where the images of the source and target domains are taken from the A and W domains in the Office-31 dataset. The basic network in this article is ResNet-50. After the data are processed through the basic network, the labels and features f s , g s of the source domain and f t , g t of the target domain are obtained. These features are then placed into the adversarial network. In the first layer of the adversarial network, the labels and features first pass through Spectrum layer 1 and then pass through the activation function layer ReLU 1, before finally passing through Dropout1. The processed data are then put into Spectrum layer2. These data first pass through Spectrum layer 2 and are then input to the ReLU 2; finally, they are treated with Dropout 2. Then, the data are placed into the third part, and the processed data are put into Spectrum layer 3, before finally passing the sigmoid layer. The final output data are the labels of the source and target domains.

Simulation and Discussion
We used three datasets in order to compare our proposed method with the other transfer learning methods of ResNet-50 [1], DAN [14], DANN [19], JAN [15], and CDAN [22]. The three datasets were Office-31 [29], ImageCLEF-DA, and Office-Home [2]. The Office-31 dataset is widely used in transfer learning. This dataset contains 4652 images within 31 classes. These images are collected from three domains: Amazon(A), Webcam(W), and DSLR(D). All transfer learning methods were tested in six transfer tasks: A→W, D→W, W→D, A→D, D→A, and W→A. ImageCLEF-DA has twelve classes and comprises three datasets: caltech-256 (C), ILSVRC 2012 (I), and Pascal VOC 2012 (P). In using this dataset, the results from our proposed method were compared with the existing transfer learning methods in the six transfer learning tasks: I→P, P→I, I→C, C→I, C→P, P→C. Office-Home, which was first presented at the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) in 2017, has greater complexity than the Office-31 dataset. This dataset consists of 15,500 images, 65 classes, and 4 domains: artistic images (Ar), clip art (CI), product images (Pr), and real-word images (RW). The four domains are significantly different.We tested our proposed method with others in the following 12 transfer learning tasks: Ar→CI, Ar→Pr, Ar→Rw, CI→Ar, CI→Pr, CI→Rw, Pr→Ar, Pr→CI, Pr→Rw, Rw→Ar, Rw→CI, and Rw→Pr. The basic network used in the experiment was ResNet-50, and the deep learning framework was PyTorch [30]. PyTorch is a popular deep learning framework, which is used by Facebook, Twitter, etc. The version of PyTorch we used is Pytorch1.1.
In all experiments, we used the same protocol with a conditional domain adversarial network, and the classification accuracies were obtained through averaging the results of five randomized experiments. The transfer loss and classifier loss had the same proportion for all methods. We used importance-weighted cross-validation (IWCV) [31] to select all hyperparameters. The IWCV is a type of cross-validation. It multiplies the loss function by the marginal distribution ratio of Source domain to Target domain, and uses the obtained loss function as the basis for selecting all hyperparameters. It is highly immune to covariate shift, i.e., data distributions of the Source domain and Target domain are different, and the conditional adversarial domain adaptation method that we used to compare to our proposed method also uses the IWCV to select all hyperparameters. Therefore, we also used IWCV as a cross-validation method. Other methods have also used IWCV as a method for cross-validation [22,32,33]. The main difference between IWCV and the cross-validation (CV) used in references [18,19,27] is the marginal distribution ratio. CV directly uses the loss function, while IWCV uses an improved loss function. We used IWCV to select the hyperparameter in task A→W.We first divided the training set into ten parts. Second, we selected one part as a validation set, and the rest was used as a training set in each cross-validation. The number of folds used in cross-validations was ten. Finally, if the average accuracy was higher than the threshold, the hyperparameters were reserved and applied to all datasets. The momentum, batch size, and weight decay were 0.9, 36, and 0.0005 for the mini-batch SGD (Stochastic Gradient Descent) used in our proposed method and the CADN method, respectively. The learning rates of the proposed method and the CADN method were the same (0.001). The operating system we used was Ubuntu 18.04, the CPU was an Intel Xeon E5-2678v3, and the GPU was an NVIDIA GeForce GTX 1080Ti. Table 1 shows the results obtained from our proposed method and the other methods, including the ResNet-50, DAN, DANN, JAN, and CDAN methods, based on the Office-31 dataset. As seen in Table 1, the proposed method had the highest accuracy for all tasks, except forW→D. The accuracies Information 2020, 11, 68 8 of 13 of the proposed method and the CDAN method were the same for W→D, at 100%. We can also see that compared to other methods, the improved accuracy of our proposed method was different for different tasks. Compared with the CDAN method, our method improved the accuracy of task A→D by 4.9%, task W→A by 3.7%, task D→A by 2.5%, task A→W by 2.2%, and task D→W by 0.7%. For our proposed method, the improvement in accuracy was largest for task A→D and lowest for task W→D when compared to the CDAN method. Compared to the CDAN method, the average accuracy of the proposed method presented an increase of 2.3%. Based on the above analysis, the results show that the proposed method had better accuracy than the others for the Office-31 dataset.  Table 2 shows the results that were obtained using our proposed method and the other methods, the ResNet-50, DAN, DANN, JAN, and CDAN methods, based on the ImageCLEF-DA dataset. As seen in Table 2, the proposed method also had the highest accuracy for all tasks, as well as highest average accuracy compared to the other methods. The improved accuracy of our proposed method was also different for different tasks. Compared to the CDAN method, the improved accuracy of our proposed method(2.1%) was highest for C→P and lowest for I→C (0.5%). The average accuracy of the proposed method exceeded the CDAN method by 1.3%. Based on the above analysis, the proposed method presented the highest accuracy for all tasks and the highest average accuracy when using the ImageCLEF-DA dataset.  Table 3 presents the results that were obtained using our proposed method and the other methods (i.e., ResNet-50, DAN, DANN, JAN, and the CDAN method, based on the Office-Home dataset). The proposed method also had the highest accuracy for all tasks and average accuracy when compared to other methods. Compared to the CDAN method, the improved accuracy of our proposed method(5.1%) was the largest for CI→Pr and lowest for Ar→RW (1.8%), with an improved average accuracy of 3.3%. Based on the above analysis, the proposed method had the highest level of accuracy for all tasks and the highest average accuracy when using the Office-Home dataset.

Comparison of Accuracy
Based on the analysis of Tables 1-3, the improved average accuracy of our proposed method as compared to the CDAN method was the largest for the Office-Home dataset, followed byOffice-31 and ImageCLEF-DA. The accuracies of all methods were lower for the Office-Home dataset than for the other datasets, so there is great potential for improved accuracy; the improved accuracy for the Office-Home dataset was also larger than for the other datasets and in comparison with the other methods. Based on the above analysis, the proposed method had better accuracy for all tasks and

Comparison of the Convergence Speed
In this section, our proposed method is compared to the ResNet-50, DANN, and CDAN methods to test the performance of its convergence in the test task of A→W using the Office-31 dataset. The results of the comparison are shown in Figure 3.  Figure 3 indicates that the proposed method had the lowest error, while the next best were the CDAN, DANN, and RestNet-50 methods. From the perspective of convergence, the DANN and RestNet-50 methods did not converge after 60,000 iterations. The CDAN method converged after 41,000 iterations, and the proposed method converged after 6999 iterations. When the methods converged, the number of iterations by the proposed method was only about 741. Based on the above analysis presented in Figure 3, the proposed method had faster convergence and fewer errors than the other methods.

Comparison of Distribution Discrepancy
In this section, the A-distance [34],which is used by many researchers, is also used to test the performance of different methods in distribution discrepancy. The A-distance can be used to reflect the distribution discrepancy between two datasets. The A-distance can train a binary classifier in the Source and Target domain so that the trained binary classifier can distinguish whether the data come from the Source or Target domain. The A-distance can be expressed as: where ε is the test error of the classifier. The smaller the A-distance, the closer the distance between the Source and Target domains, which were processed by the network in this method. We separately tested for the distribution discrepancy performance of the different methods in the A→W and W→D  Figure 3 indicates that the proposed method had the lowest error, while the next best were the CDAN, DANN, and RestNet-50 methods. From the perspective of convergence, the DANN and RestNet-50 methods did not converge after 60,000 iterations. The CDAN method converged after 41,000 iterations, and the proposed method converged after 6999 iterations. When the methods converged, the number of iterations by the proposed method was only about 741. Based on the above analysis presented in Figure 3, the proposed method had faster convergence and fewer errors than the other methods.

Comparison of Distribution Discrepancy
In this section, the A-distance [34], which is used by many researchers, is also used to test the performance of different methods in distribution discrepancy. The A-distance can be used to reflect the distribution discrepancy between two datasets. The A-distance can train a binary classifier in the Source and Target domain so that the trained binary classifier can distinguish whether the data come from the Source or Target domain. The A-distance can be expressed as: where ε is the test error of the classifier. The smaller the A-distance, the closer the distance between the Source and Target domains, which were processed by the network in this method. We separately tested for the distribution discrepancy performance of the different methods in the A→W and W→D tasks in the Office-31 dataset, I→P and P→I tasks in the ImageCLEF-DA dataset, and Ar→CI and CI→Ar tasks in the Office-Home dataset. The results are presented in Figures 4-6, respectively. In Figure 4, for task A→W, the A-distances of ResNet-50, DANN, CDAN, and the proposed method were 1.8,1.44,1.22, and 0.88, respectively. The proposed method presented the smallest A-distances. In task W→D, the A-distances of ResNet-50, DANN, CDAN, and the proposed method were 1.3,1.2,0.66, and 0.44, respectively. The proposed method again featured the smallest A-distances. In Figures 5 and 6, we can also see that the proposed method retained the smallest A-distances for different tasks. These results imply that the proposed method has the best performance in extracting features.

Conclusions
This paper presents the use of spectral normalization in a domain adaption method based on an adversarial network in order to ensure that the gradient satisfies the Lipschitz constraint condition. The results show that this can make the training of domain adaption more stable. The proposed method presents a higher accuracy when compared to other methods based on three different datasets, that is, Office-31, ImageCLEF-DA, and Office-Home. Further, the proposed method exhibits both faster convergence speed and better distribution discrepancy than that achieved by other methods.
Author Contributions: Conceptualization, formal analysis, investigation, and writing the original draft was done by L.Z. and Y.L. Experimental tests were done by Y.L. All authors have read and approved the final manuscript.