Deep Quadruplet Network for Hyperspectral Image Classiﬁcation with a Small Number of Samples

: This study proposes a deep quadruplet network (DQN) for hyperspectral image classiﬁcation given the limitation of having a small number of samples. A quadruplet network is designed, which makes use of a new quadruplet loss function in order to learn a feature space where the distances between samples from the same class are shortened, while those from a di ﬀ erent class are enlarged. A deep 3-D convolutional neural network (CNN) with characteristics of both dense convolution and dilated convolution is then employed and embedded in the quadruplet network to extract spatial-spectral features. Finally, the nearest neighbor (NN) classiﬁer is used to accomplish the classiﬁcation in the learned feature space. The results show that the proposed network can learn a feature space and is able to undertake hyperspectral image classiﬁcation using only a limited number of samples. The main highlights of the study include: (1) The proposed approach was found to have high overall accuracy and can be classiﬁed as state-of-the-art; (2) Results of the ablation study suggest that all the modules of the proposed approach are e ﬀ ective in improving accuracy and that the proposed quadruplet loss contributes the most; (3) Time-analysis shows the proposed methodology has a similar level of time consumption as compared with existing methods.


Introduction
A hyperspectral image covers hundreds of bands with high spectral resolution and provides a detailed spectral curve for each pixel [1,2]. Both the spatial and the spectral information are gathered in a hyperspectral image. Hyperspectral image classification is aimed at identifying the specific class (i.e., label) for each pixel (for example, cropland, lake, river, grassland, forest, mineral rocks, building, and roads). As the first step in many hyperspectral remote sensing applications, image classification is vital in the fields of agricultural statistics, disaster reduction, mineral exploration, and environmental monitoring.
However, the aforementioned methods still require substantial improvements in hyperspectral image classification, especially under the condition of small-samples. For supervised classification of remotely sensed images, the training samples are usually acquired by two methods: (1) from field surveys and (2) directly from images with higher resolution. In particular, higher classification accuracy is usually acquired from training samples collected by field surveys. However, compared with laboratory work, field survey is costly, complicated, and time-consuming, which can significantly restrict the number of training samples. A small dataset of training samples can substantially diminish accuracy in hyperspectral image classification. Moreover, hyperspectral images suffer more from data redundancy in the spectral dimension compared with multi-spectral images, which creates additional difficulties for classification.
Few-shot learning involves solving the problem using a limited number of samples and has been used for various applications such as image segmentation, image caption, object recognition, and face identification. [22][23][24][25]. Given the limited accuracy due to having only a few labeled samples per class, few-shot learning usually trains the model based on a well-labeled dataset, and the model is then generalized into new classes [26]. A metric learning strategy is usually adopted to learn the features of the object and distinguish based on the absolute distance between samples [27]. In recent years, several few-shot learning methods have been proposed for hyperspectral image classification, e.g., DFSL (deep few-shot learning) [28,29]. However, absolute distance ignores the relationship between inter-class and intra-class and limits classification accuracy. The use of relative distance, based on widening inter-class distance and shortening the intra-class distance, has been proposed in lieu of absolute distance [30]. Proposing new methods that account for the relative relationship between inter-class and intra-class is therefore crucial in improving the accuracy of hyperspectral image classification with a limited number of samples.
This study proposes a deep quadruplet network (DQN) for hyperspectral image classification with a small number of samples. To improve the accuracy, we designed a quadruplet network, in particular, a new quadruplet loss function, and a deep 3-D CNN with double branches consisting of dense convolution and dilated convolution.

Training Data
The training data used in this study are four well-known public hyperspectral datasets: "Houston", "Chikusei", "KSC", and "Botswana" [28]. The details of the four hyperspectral datasets used in training are presented in Table 1. The testing data used in this study are three widely-known public hyperspectral datasets: "Salinas", "Indian Pines" (IP), and "University of Pavia" (UP). The details of the three hyperspectral datasets used in testing are summarized in Table 2. The ground-truth maps of the three hyperspectral datasets are shown in Figures 1-3. Table 2. The details of the three hyperspectral datasets for testing networks [28]. The testing data used in this study are three widely-known public hyperspectral datasets: "Salinas", "Indian Pines" (IP), and "University of Pavia" (UP). The details of the three hyperspectral datasets used in testing are summarized in Table 2. The ground-truth maps of the three hyperspectral datasets are shown in Figures 1-3. The training datasets "Houston" and "Chikusei" can be acquired from the following websites, respectively:

Structure of the Proposed Method
The structure of the proposed methodology in this study is shown in Figure 4. A deep quadruplet network is trained to learn a feature space. The testing data is transferred to the learned feature space to extract features. The classification is accomplished using the Euclidean distance and the nearest neighbor (NN) classifier.   The training datasets "Houston" and "Chikusei" can be acquired from the following websites, respectively: "http://hyperspectral.ee.uh.edu/2egf4tg8hial13gt/2013_DFTC.zip" and "http://park.itc. u-tokyo.ac.jp/sal/hyperdata/Hyperspec_Chikusei_MATLAB.zip". The training datasets "KSC" and "Botswana" and all the testing datasets can be acquired from the website "http://www.ehu.eus/ccwintco/ index.php?title=Hyperspectral_Remote_Sensing_Scenes".

Structure of the Proposed Method
The structure of the proposed methodology in this study is shown in Figure 4. A deep quadruplet network is trained to learn a feature space. The testing data is transferred to the learned feature space to extract features. The classification is accomplished using the Euclidean distance and the nearest neighbor (NN) classifier.

Structure of the Proposed Method
The structure of the proposed methodology in this study is shown in Figure 4. A deep quadruplet network is trained to learn a feature space. The testing data is transferred to the learned feature space to extract features. The classification is accomplished using the Euclidean distance and the nearest neighbor (NN) classifier.

Quadruplet Learning
Metric learning refers to the transfer of input data from the original space R F into a new feature space R D (i.e., f θ : R F →R D ). F and D refer to the dimension of the original space and the new space, respectively, and θ is the learnable parameter. In the new feature space R D , samples from the same class are expected to be closer than those from different classes so that the classification can be finished Remote Sens. 2020, 12, 647 5 of 20 in R D using nearest neighbor classifier. Several networks have been developed to accomplish this task, including the siamese network, triplet network [31], and quadruplet network [32].
In a siamese network, a contrastive loss function is designed to train the network to distinguish between pairs of samples from the same class and those from different classes. The designed loss function limits the samples within the same class and enlarges the samples from the different classes. However, for classification purposes, the feature space learned by the siamese network is inferior to that of the triplet network. In addition, siamese networks are sensitive to calibration in order to contextualize similarity vs. dissimilarity [31]. The loss function for a siamese network is: where x a (i) and x p (i) are two samples from the same class, which has been transferred by f θ : R F →R D ; N s is the number of siamese pairs; and d (·) is the Euclidean distance of two elements. Triplet network refers to training based on the use of many triplets. A triplet contains three different samples (x a (i) , x p (i) , x n (i) ), where x a (i) and x p (i) are two samples from the same class (i.e., positive pairs), while x a (i) and x n (i) are samples from different classes (i.e., negative pairs). Each sample in a triplet has been transferred by f θ : R F →R D . The loss function for the triplet network is given by [32]: where γ is the value of the margin set that segregates the positive pairs with the negative pairs; N t is the number of triplets; and (z) + = max (0, z). The first term is intended to shorten the distance between two samples from the same class, while the second term is designed to enlarge the distance between two samples from different classes. For the loss function in triplet networks, each positive pair and negative pair share a given sample (i.e., x a (i) ), which compels triplet networks to focus more on obtaining the correct ranks for the pair distances. In other words, the triplet loss only considers the relative distances of the positive and negative pairs, which results in poor generalization for the triplet network and difficulty applying in tracking tasks [32]. The quadruplet loss (QL) [30] introduces a different negative pair into the triplet loss. The quadruplet loss function contains four different samples: (x a (i) , x p (i) , x n1 (i) , x n2 (i) ), where x a (i) and x p (i) are samples from a same class while x n1 (i) and x n2 (i) are samples from another two classes. All the samples have been transferred to the featured space by f θ : R F →R D . The quadruplet loss is given by the equation: where γ and β are the margins for the two terms; and N q is the number of quadruplets. The first term in quadruplet loss is the same as that in the triplet loss (Equations (2) and (3)). The second term constrains the intra-class distances to be smaller than the inter-class distances [30]. However, the loss function in Equation (3) usually performs poorly because the number of quadruplets and quadruplet pairs would grow rapidly when the dataset gets more extensive. Moreover, most samples are not so useful towards adequately training the network and can overwhelm the relevant hard-learning samples, leading to the poor performance of the network [32]. Hence, this study designed a new quadruplet loss function, as shown in Equation (4): where x p (i) is the farthest sample to the reference x a (i) in the same class; x m (i) and x n (i) are the closest negative pairs in the whole batch; N nq is the number of quadruplets in the new loss function; and γ is the value of the margin. Each sample in Equation (4) has been transferred by f θ : R F →R D . The conceptual diagram of the quadruplet network, as proposed in this study, is presented in Figure 5. The proposed loss function compensates for the shortcomings of Equations (2)-(4). The procedure for batch training using the proposed loss function is shown in Table 3, where T= {t 1 , t 2 , t 3 , . . . , t s } is the training dataset for this batch, and s is the number of labeled samples. In a batch (shown in Table 3), N nq = s, and t j or t k represents a sample in the dataset T. (t j , t k ) represents a pair of samples, C (t j ) is the class label of the sample t j , and α is the learning rate. The variables t a , t p , t m , and t n are the quadruplets before the deep network, while x a , x p , x m , and x n are the corresponding quadruplets after the deep network.
Remote Sens. 2020, 12, x FOR PEER REVIEW 6 of 21 where xp (i) is the farthest sample to the reference xa (i) in the same class; xm (i) and xn (i) are the closest negative pairs in the whole batch; Nnq is the number of quadruplets in the new loss function; and γ is the value of the margin. Each sample in Equation (4) has been transferred by fθ: R F → R D . The conceptual diagram of the quadruplet network, as proposed in this study, is presented in Figure 5. The proposed loss function compensates for the shortcomings of Equations (2)-(4). The procedure for batch training using the proposed loss function is shown in Table 3, where T= {t1, t2, t3, … , ts } is the training dataset for this batch, and s is the number of labeled samples. In a batch (shown in Table  3), Nnq = s, and tj or tk represents a sample in the dataset T. (tj, tk) represents a pair of samples, C (tj) is the class label of the sample tj, and α is the learning rate. The variables ta, tp, tm, and tn are the quadruplets before the deep network, while xa, xp, xm, and xn are the corresponding quadruplets after the deep network.  Update:

Deep Network Framework
As shown in Figure 6, the overall framework of the proposed deep dense dilated 3-D CNN contains two branches: a dense CNN and a dilated CNN. As shown in Figure 6, the overall framework of the proposed deep dense dilated 3-D CNN contains two branches: a dense CNN and a dilated CNN.

Dense CNN
The dense CNN block consists of five convolutional layers (see Figure 7a). "Conv" is a convolutional operation with a 3 × 3 × 3 kernel, while "2 Conv" represents a convolutional layer with two kernels (i.e., two convolutional operations). For a normal CNN block with five layers, there are five connections (a connection is between a layer and its subsequent layer) [33] (see Figure 7b). However, as the network becomes more and more deep, the problem with a normal CNN is that the features contained in the input can vanish after it passes through many layers until it reaches the end [33]. So instead of using the normal CNN, a dense CNN was used in this study. Aside from the preserving the five connections from the normal CNN block, the dense CNN provides six other connections: three connections are between the 1st layer and the 3rd, 4th, and 5th layers; two connections are between the 2nd Conv and the 4th and 5th Conv; and one connection is between the 3rd Conv and the 5th Conv. Figure 7a shows the operation at a connection point in the dense CNN, and "⊕" represents the sum of all the imported connected lines.

Dense CNN
The dense CNN block consists of five convolutional layers (see Figure 7a). "Conv" is a convolutional operation with a 3 × 3 × 3 kernel, while "2 Conv" represents a convolutional layer with two kernels (i.e., two convolutional operations). For a normal CNN block with five layers, there are five connections (a connection is between a layer and its subsequent layer) [33] (see Figure 7b). However, as the network becomes more and more deep, the problem with a normal CNN is that the features contained in the input can vanish after it passes through many layers until it reaches the end [33]. So instead of using the normal CNN, a dense CNN was used in this study. Aside from the preserving the five connections from the normal CNN block, the dense CNN provides six other connections: three connections are between the 1st layer and the 3rd, 4th, and 5th layers; two connections are between the 2nd Conv and the 4th and 5th Conv; and one connection is between the 3rd Conv and the 5th Conv. Figure 7a shows the operation at a connection point in the dense CNN, and "⊕" represents the sum of all the imported connected lines. As shown in Figure 6, the overall framework of the proposed deep dense dilated 3-D CNN contains two branches: a dense CNN and a dilated CNN.

Dense CNN
The dense CNN block consists of five convolutional layers (see Figure 7a). "Conv" is a convolutional operation with a 3 × 3 × 3 kernel, while "2 Conv" represents a convolutional layer with two kernels (i.e., two convolutional operations). For a normal CNN block with five layers, there are five connections (a connection is between a layer and its subsequent layer) [33] (see Figure 7b). However, as the network becomes more and more deep, the problem with a normal CNN is that the features contained in the input can vanish after it passes through many layers until it reaches the end [33]. So instead of using the normal CNN, a dense CNN was used in this study. Aside from the preserving the five connections from the normal CNN block, the dense CNN provides six other connections: three connections are between the 1st layer and the 3rd, 4th, and 5th layers; two connections are between the 2nd Conv and the 4th and 5th Conv; and one connection is between the 3rd Conv and the 5th Conv. Figure 7a shows the operation at a connection point in the dense CNN, and "⊕" represents the sum of all the imported connected lines.

Dilated CNN
For normal convolutional operation, the convolutional kernel covers an image area using the same size (Figure 8a). A normal CNN employed in image classification represents the image using many tiny feature scenes, resulting in obscure spatial structures [34]. Moreover, the spatial acuity and details that are lost are almost impossible to restore through upsampling and training. Hence, the image classification accuracy is limited using the normal CNN, especially for images requiring detailed scene understanding [34]. Dilated convolution is an operation in which the convolutional kernel covers an image area with a bigger size (Figure 8b). For example, a 3 × 3 convolutional kernel only covers a 3 × 3 image area in normal convolutional operation (Figure 8a), but a 3 × 3 convolutional kernel can enlarge the receptive field to 5 × 5 (Figure 8b), or even bigger field. The dilated CNN represents the image features on a bigger scale and alleviates the disadvantage of data redundancy in a hyperspectral dataset without increasing the network's depth or complexity [34]. The structure of the dilated branch in this study is shown in Figure 9.
Remote Sens. 2020, 12, x FOR PEER REVIEW 8 of 21 The use of the dense CNN in this study alleviates the problem regarding information vanishing when passing through numerous layers and make full use of the features extracted by all the layers.

Dilated CNN
For normal convolutional operation, the convolutional kernel covers an image area using the same size (Figure 8a). A normal CNN employed in image classification represents the image using many tiny feature scenes, resulting in obscure spatial structures [34]. Moreover, the spatial acuity and details that are lost are almost impossible to restore through upsampling and training. Hence, the image classification accuracy is limited using the normal CNN, especially for images requiring detailed scene understanding [34]. Dilated convolution is an operation in which the convolutional kernel covers an image area with a bigger size (Figure 8b). For example, a 3 × 3 convolutional kernel only covers a 3 × 3 image area in normal convolutional operation (Figure 8a), but a 3 × 3 convolutional kernel can enlarge the receptive field to 5 × 5 (Figure 8b), or even bigger field. The dilated CNN represents the image features on a bigger scale and alleviates the disadvantage of data redundancy in a hyperspectral dataset without increasing the network's depth or complexity [34]. The structure of the dilated branch in this study is shown in Figure 9.     In both the dense CNN block and the dilated CNN block, there is an operation called ReLU. ReLU is an activation function defined as:

Nearest Neighbor (NN) for Classification
In this study, an embedded feature space is learned after training the proposed deep quadruplet CNN using the training data. For the testing data, the supervised samples and the samples to be classified are transferred to the embedded feature space by the trained deep quadruplet CNN. The classification is completed using the average Euclidean distance between the supervised samples and the samples to be classified using the nearest neighbor classifier.

Parameters Setting
Details of the network architecture for the proposed deep quadruplet network are summarized in Table 4. The layer names in Table 4 correspond with the blocks in Figure 6, Figure 7, and Figure 9. N is the band number of the hyperspectral dataset. The N bands are selected by graph representation based band selection (GRBS) [35]. "ceil (N/2)" is the ceiling function, which is equal to the roundedup integer of N/2. The learning rate α for optimizing DQN is set to be 10 -3 with a weight decay of 10 -4 and a momentum of 0.9. The sensitivity of the margin γ to the classification accuracy was tested with 15 supervised samples per class. The overall accuracy (OA) for all three testing datasets is shown in Figure 10. The value of γ was set to be 0.4, which refers to the best accuracy obtained for all the three testing datasets, as presented in Figure 10.  In both the dense CNN block and the dilated CNN block, there is an operation called ReLU. ReLU is an activation function defined as:

Nearest Neighbor (NN) for Classification
In this study, an embedded feature space is learned after training the proposed deep quadruplet CNN using the training data. For the testing data, the supervised samples and the samples to be classified are transferred to the embedded feature space by the trained deep quadruplet CNN. The classification is completed using the average Euclidean distance between the supervised samples and the samples to be classified using the nearest neighbor classifier.

Parameters Setting
Details of the network architecture for the proposed deep quadruplet network are summarized in Table 4. The layer names in Table 4 correspond with the blocks in Figure 6, Figure 7, and Figure 9. N is the band number of the hyperspectral dataset. The N bands are selected by graph representation based band selection (GRBS) [35]. "ceil (N/2)" is the ceiling function, which is equal to the rounded-up integer of N/2. The learning rate α for optimizing DQN is set to be 10 -3 with a weight decay of 10 -4 and a momentum of 0.9. The sensitivity of the margin γ to the classification accuracy was tested with 15 supervised samples per class. The overall accuracy (OA) for all three testing datasets is shown in Figure 10. The value of γ was set to be 0.4, which refers to the best accuracy obtained for all the three testing datasets, as presented in Figure 10.

Accuracy
For the Salinas and the UP datasets, the number of supervised samples (L) used in the testing

Accuracy
For the Salinas and the UP datasets, the number of supervised samples (L) used in the testing experiments was set to 5, 10, 15, 20, and 25. Given the limited total number of labeled samples in the IP dataset, the number of supervised samples (L) used in the testing experiments was set to 5, 10, and 15. For each case of L, the supervised samples are selected randomly and ten runs were performed. The overall accuracy of each run was recorded. The results of the classification using the proposed method (DQN+NN) are presented in Figures 11-13. The overall accuracy of the proposed approach was then compared with other methods, including: SVM [6], LapSVM [36], TSVM [37], SCS 3 VM [38], SS-LPSVM [39], KNN+SNI [40], MLR+RS [41], SVM+S-CNN, 3D-CNN [19], DFSL+NN, and DFSL+SVM [28]. The average value and the standard deviation (STD) of the overall accuracy in the ten runs for the three testing datasets using the different methods are shown in Tables 5-7. The overall accuracy of 3D-CNN was examined based on the method described by Hamida et al. [19]. The accuracy of the SVM, LapSVM, TSVM, SCS 3 VM, SS-LPSVM, KNN+SNI, MLR+RS, SVM+S-CNN, DFSL+NN, and DFSL+SVM are derived from the study of Liu et al. [28]. The training datasets and testing datasets in our paper are exactly same with that in Reference [28], which are public and have Remote Sens. 2020, 12, 647 11 of 20 been widely used for comparing different methods for hyperspectral image classification [28,[38][39][40][41]. Hence, the comparison of different methods is appropriate.
There are some other results reported in existing publications [28]. The OA could reach 97.81%, 98.35%, and 98.62% for Salinas, IP, and UP dataset, respectively [28]. However, these results are obtained based on 200 supervised samples per class (L = 200). It is obvious that more supervised samples lead to better accuracy. Our paper pays attention to the hyperspectral image classification with a small number of samples, so the situation with L = 200 is out of our discussion.
Here is an explanation about the generalization of the network from some classes to new classes. Traditional artificial neural network has been successful in data-intensive applications, but it is hard to learn from a limited number of examples. To solve this problem, few-shot learning (FSL) is proposed [42]. Few-shot learning (FSL) is a type of machine learning problems where there is a little supervised information for the target. A strategy is that the network learns prior knowledge, and it can be generalized to new tasks with limited supervised samples. In fact, it is an important and famous branch of machine learning and has been widely used to solve many problems [28,32,42] (e.g., face recognition). The feature space learned from hyperspectral image has been demonstrated the ability to generalize to the new classes [28]. The network trained in this paper is also essentially a generalized feature extractor. The extracted features of supervised samples and samples to be classified are put into the classifier (nearest neighbor) to finish the classification.
Here we make it clear that the L specific supervised samples are not exactly the same as those in [28], due to the randomness of selecting supervised samples. However, the way of selection of supervised samples and comparative analysis in this paper is the same as that in [28,39]. It is common to randomly select supervised samples and conduct several runs of experiments (e.g., 10 runs in [28,39], and also our paper). The purpose is to avoid the occasionality of accuracy in one run.

Ablation Study
There are three key modules in the proposed methodology: the quadruplet loss, the dense branch, and the dilated branch. To demonstrate the effectiveness of each module, the classification accuracy was calculated when one of the modules is replaced. Simply put, an ablation study was performed. The proposed quadruplet loss was replaced by the siamese loss, the triplet loss, and the original quadruplet loss. The dense branch or dilated branch was replaced using a normal CNN module. In the proposed methodology, when one module is replaced, the other modules are kept the same. The summary of average values ± STD of OA in the ten runs is shown in Tables 8-10 for all three testing datasets.  In every method where a module was replaced, the accuracy was lower compared with the proposed methodology (see Tables 8-10). The inclusion of the quadruplet loss, the dense branch, and the dilated branch contributes to improving the accuracy, which was demonstrated by the ablation study. In particular, the decrease in accuracy was most substantial when the quadruplet loss was replaced, which suggests that the designation of the quadruplet loss contributes the most in improving the accuracy in the proposed methodology. Tables 11-13 shows the average accuracy (AA) and Kappa coefficients of the proposed method, 3D-CNN, and the five experiments in ablation study. There is no enough information in publications for other existing methods. Tables 11-13 suggest that the proposed method obtains satisfying results in terms of average accuracy and Kappa coefficient.

Time Consumption
In terms of the overall accuracy of the three datasets, SS-LPSVM, 3D-CNN, DFSL+NN, and DFSL+SVM show the closest accuracy performance with our method. Hence, these methods were selected for the comparative analysis of time consumption. The time consumption of 3D-CNN, DFSL+NN, DFSL+SVM, and the proposed method based on the IP dataset is shown in Table 14. Details regarding the computer configuration and program coding used in analyzing the time consumption are presented in Table 15. Based on the comparative analysis of time consumption, the proposed approach is similar to other classification techniques. SS-LPSVM has been demonstrated that it takes much longer time than DFSL+NN and DFSL+SVM based on the IP dataset [28] (198.30s vs. 11.14s + 0.36s and 11.14s + 2.21s). Hence, it can be inferred that the proposed method shows obvious advantage over SS-LPSVM.

Conclusions
This study integrates quadruplet loss with deep 3-D CNN with dense and dilated characteristics in proposing a quadruplet deep learning method for few-shot hyperspectral image classification. Verification and comparative analysis were performed using public hyperspectral datasets, and the results suggest the following conclusions: (1) The proposed approach was found to have higher overall accuracy than existing methods, which suggests that the classification method is state-of-the-art.
(2) An ablation study was conducted replacing each module of the proposed approach (i.e., quadruplet loss, dense branch, and dilated branch) to demonstrate the effectiveness of their contributions. The results show that all modules are effective and necessary in improving classification accuracy, with the proposed quadruplet loss providing the highest contribution.
(3) The time consumption for the different methods was tested under the same operating environment. The analysis shows the proposed methodology has a similar level of time consumption compared to existing methods.
In the future, given the scarcity of training samples in some cases, a sample-synthesis method can be explored for a few-shot hyperspectral image classification.