Classification of Hyperspectral Image Based on Double-Branch Dual-Attention Mechanism Network

In recent years, researchers have paid increasing attention on hyperspectral image (HSI) classification using deep learning methods. To improve the accuracy and reduce the training samples, we propose a double-branch dual-attention mechanism network (DBDA) for HSI classification in this paper. Two branches are designed in DBDA to capture plenty of spectral and spatial features contained in HSI. Furthermore, a channel attention block and a spatial attention block are applied to these two branches respectively, which enables DBDA to refine and optimize the extracted feature maps. A series of experiments on four hyperspectral datasets show that the proposed framework has superior performance to the state-of-the-art algorithm, especially when the training samples are signally lacking.


Introduction
Remote sensing images can be categorized by their spatial, spectral, and temporal resolutions [1], and has been generally researched for many areas such as land-cover mapping [2], water monitoring [3], and anomaly detection [4]. As a particular type of remote sensing images with high spectral resolution, hyperspectral image (HSI) contains plentiful information both in the spectral and spatial dimension [5]. HSI has been used in many fields including vegetation cover monitoring [6], atmospheric environmental research [7], and change area detection [8], among others. Supervised classification is an essential task of HSI, and is the common technology used in the above applications. However, the over-redundancy of spectral band information and limited training samples account for a huge challenge to HSI classification.
Early spectral-based attempts including support vector machines (SVM) [9], multinomial logistic regression (MLR) [10,11], and random or dynamic subspace [12,13], focus on the spectral characteristics of HSI. Nevertheless, another useful piece of information is that the adjacent pixels are possibly of the same category, but the spectral-based methods ignore the high spatial correlation and local consistency of HSI. Therefore, the increasing number of classification frameworks based on spectral-spatial features have been presented. Two types of low-level features, morphological profiles [14] and Gabor feature [15], were designed to represent the spatial information. Based on SVM, the morphological kernel [16] and the composite kernel [17] methods were also proposed to exploit spectral-spatial information. Although above attempts improve the accuracy of the classifier, these methods highly depend on the hand-crafted descriptors.
Deep learning (DL) has shown powerful capabilities in automatically extracting nonlinear and hierarchical features. A great surge of computer vision tasks have benefited from DL and made The rest of this paper is arranged as follows: In Section 2, we illustrate the related work briefly. The detailed structure of DBDA is given in Section 3. In Sections 4 and 5, we provide and analyze the experimental results. Finally, a conclusion of the entire paper with a direction for future work is presented in Section 6.

Related Work
In this section, we are going to make a brief introduction to the basic modules used in DBDA, including the 3D-cube-based HSI classification framework, 3D-CNN with batch normalization, ResNet and DenseNet, the channel-wise attention mechanism, and the spatial-wise attention mechanism. Since both the number of the HSI spectrums and convolutional kernels could be referred to as channels, we call the number of the HSI spectrums bands, and named the number of the convolutional kernels channels to avoid confusion.

HSI Classification Framework Based on 3D-Cube
Unlike traditional pixel-based methods that only use spectral features, 3D-cube-based methods like SSRN [31], FDSSC [32], DBMA [34], and our proposed framework exploit both spectral and spatial information. The pixel-based methods use the pixel individually to train the network, but the 3D-cube-based methods take the target pixel and its adjacent pixels as input. Certainly, the labels of adjacent central pixels are not fed into the network, and we only explore the abundant spatial information around the target pixel. Generally, the difference between pixel-based methods and 3D-cube-based methods is the input size of the former is 1 × 1 × b, while that of the latter is p × p × b, where p × p represents the number of neighboring pixels and b denotes the number of spectral bands.

3D-CNN with Batch Normalization
3D-CNN with batch normalization (BN) [45] is a common element in 3D-cube-based deep learning models. Inputting abundant labelled images, deep learning models with multiple nonlinear layers can learn hierarchical representations, and the multilevel convolutional layers empower CNN to learn characteristics under sparsity constraint more discriminatively. 1D-CNN and 2D-CNN only use spectral features or capture local spatial features of the pixels. When classifying HSI that contains plenty of both spatial and spectral information, 3D-CNN should be adopted to get reasonable results. Therefore, we use 3D-CNN as the basic structure of the DBDA. Moreover, we add a BN layer in each 3D-CNN layer to improve the numerical stability.
As shown in Figure 1, with n m input feature maps at the size of p m × p m × b m , a 3D-CNN layer contains k m+1 channels in the size of α m+1 × α m+1 × d m+1 , which generates the n m+1 output feature maps of size p m+1 × p m+1 × b m+1 . The ith output of the (m + 1)th 3D-CNN layer with BN could be calculated as:

ResNet and DenseNet
Normally, the more convolutional layers, the better a network will perform. However, too many layers may make the problems of vanishing and exploding gradients worse. ResNet [29] and DenseNet [30] are valid and efficient methods to escape this dilemma.
Generally, a skip connection is added to the conventional CNN model in ResNet. As indicated in Figure 2a, H denotes hidden block, which is a module containing convolutional layers, activation layers, and BN layers. The skip connection, which could be regarded as an identity mapping, enables the input data to pass directly through the network. The residual block is the basic unit in ResNet, and the output of the th residual block can be calculated as: Based on ResNet, DenseNet connects all layers directly to ensure maximum information flow between each layer of the network. Instead of combining features through summation like ResNet, DenseNet combines features via concatenating them in the channel dimension. The dense block is the basic unit in DenseNet, and the output of the th dense block can be computed as: in which is a module including convolution layers, activation layers, and BN layers, and , , … , denote the feature maps generated by the preceding dense blocks. As shown in Figure  2b, more connections ensure more information flow in the DenseNet. Specifically, DenseNet with L layers owns ( + 1) 2 ⁄ , while traditional convolutional networks with equal layers only have L direct connections. The structure of the dense connection block used in our framework can be seen in Figure 3. The Mish in Figure 3 means the activation function adopted in our framework, and the details about Mish

ResNet and DenseNet
Normally, the more convolutional layers, the better a network will perform. However, too many layers may make the problems of vanishing and exploding gradients worse. ResNet [29] and DenseNet [30] are valid and efficient methods to escape this dilemma.
Generally, a skip connection is added to the conventional CNN model in ResNet. As indicated in Figure 2a, H denotes hidden block, which is a module containing convolutional layers, activation layers, and BN layers. The skip connection, which could be regarded as an identity mapping, enables the input data to pass directly through the network. The residual block is the basic unit in ResNet, and the output of the lth residual block can be calculated as: Based on ResNet, DenseNet connects all layers directly to ensure maximum information flow between each layer of the network. Instead of combining features through summation like ResNet, DenseNet combines features via concatenating them in the channel dimension. The dense block is the basic unit in DenseNet, and the output of the lth dense block can be computed as: x l = H l [x 0 , x 1 , . . . , x l−1 ] (4) in which H l is a module including convolution layers, activation layers, and BN layers, and x 0 , x 1 , . . . , x l−1 denote the feature maps generated by the preceding dense blocks. As shown in Figure 2b, more connections ensure more information flow in the DenseNet. Specifically, DenseNet with L layers owns L(L + 1)/2, while traditional convolutional networks with equal layers only have L direct connections.
Remote Sens. 2020, 12, 582 4 of 25 in which ∈ ℝ × × is the th input feature map of the ( + 1)th layer, and is the output after the BN in the th layer. (•) and (•) denote the expectation and variance function of the input separately. and represent the weights and biases of the ( + 1)th 3D-CNN layer, * is the 3D convolutional operation, and (•) denotes the activation function that introduces the nonlinear unit of the network.

ResNet and DenseNet
Normally, the more convolutional layers, the better a network will perform. However, too many layers may make the problems of vanishing and exploding gradients worse. ResNet [29] and DenseNet [30] are valid and efficient methods to escape this dilemma.
Generally, a skip connection is added to the conventional CNN model in ResNet. As indicated in Figure 2a, H denotes hidden block, which is a module containing convolutional layers, activation layers, and BN layers. The skip connection, which could be regarded as an identity mapping, enables the input data to pass directly through the network. The residual block is the basic unit in ResNet, and the output of the th residual block can be calculated as: Based on ResNet, DenseNet connects all layers directly to ensure maximum information flow between each layer of the network. Instead of combining features through summation like ResNet, DenseNet combines features via concatenating them in the channel dimension. The dense block is the basic unit in DenseNet, and the output of the th dense block can be computed as: in which is a module including convolution layers, activation layers, and BN layers, and , , … , denote the feature maps generated by the preceding dense blocks. As shown in Figure  2b, more connections ensure more information flow in the DenseNet. Specifically, DenseNet with L layers owns ( + 1) 2 ⁄ , while traditional convolutional networks with equal layers only have L direct connections. The structure of the dense connection block used in our framework can be seen in Figure 3. The Mish in Figure 3 means the activation function adopted in our framework, and the details about Mish The structure of the dense connection block used in our framework can be seen in Figure 3. The Mish in Figure 3 means the activation function adopted in our framework, and the details about Mish can be seen in Section 3.2.1. Supposing that the shape of the input feature maps is p × p × b with n Remote Sens. 2020, 12, 582 5 of 25 channels, and that each convolution layer is composed of k kernels in the shape of 1 × 1 × d, then each layer generates feature maps in the shape of p × p × b with k channels. However, a dense connection concatenates feature maps at the channel dimension, so there is a linear relationship between the number of channels and the number of convolution layers. The output with k m channels generated by an m-layers dense block can be formulated as: where b represents the channel's number in the input feature maps.
Remote Sens. 2020, 12, 582 5 of 25 can be seen in Section 3.2.1. Supposing that the shape of the input feature maps is × × with n channels, and that each convolution layer is composed of k kernels in the shape of 1 × 1 × , then each layer generates feature maps in the shape of × × with k channels. However, a dense connection concatenates feature maps at the channel dimension, so there is a linear relationship between the number of channels and the number of convolution layers. The output with channels generated by an m-layers dense block can be formulated as: where b represents the channel's number in the input feature maps.

Attention Mechanism
A shortcoming of the 3D-CNN is that all the spatial pixels and spectral bands own the equivalent weights in the spatial and spectral domains. Obviously, different spectral bands and spatial pixels make different contributions to extracting features. The attention mechanism is a powerful technique to deal with this problem. Motivated by the human visual perception process [46], the attention mechanism is designed to focus more on the informative areas and takes less account of non-essential areas. The attention mechanism has been used for image categorization [47] and was later proved to be outstanding in other areas including image caption [48], text to image synthesis [49] and scene segmentation [44], etc. In DANet [44], the channel attention block and spatial attention block can be adopted to increase the weight of compelling channels and pixels. The two blocks will be introduced in detail as the following.

Spectral Attention Block
As illustrated in Figure 4a, the channel attention map ℝ × is directly computed from the initial input ℝ × × , where × is the patch size of the input, and c denotes the number of the input channels. Concretely, a matrix multiplication between A and is operated, and to obtain the channel attention map ℝ × , a softmax layer is connected as: in which means the ith channel's influence on the jth channel. Then, the results of matrix multiplication between and A are reshaped into ℝ × × . Finally, the reshaped results are weighted by a parameter of scale and added input A to acquire the final spectral attention map ℝ × × :

Attention Mechanism
A shortcoming of the 3D-CNN is that all the spatial pixels and spectral bands own the equivalent weights in the spatial and spectral domains. Obviously, different spectral bands and spatial pixels make different contributions to extracting features. The attention mechanism is a powerful technique to deal with this problem. Motivated by the human visual perception process [46], the attention mechanism is designed to focus more on the informative areas and takes less account of non-essential areas. The attention mechanism has been used for image categorization [47] and was later proved to be outstanding in other areas including image caption [48], text to image synthesis [49] and scene segmentation [44], etc. In DANet [44], the channel attention block and spatial attention block can be adopted to increase the weight of compelling channels and pixels. The two blocks will be introduced in detail as the following.

Spectral Attention Block
As illustrated in Figure 4a, the channel attention map X ∈ R c×c is directly computed from the initial input A ∈ R c×p×p , where p × p is the patch size of the input, and c denotes the number of the input channels. Concretely, a matrix multiplication between A and A T is operated, and to obtain the channel attention map X ∈ R c×c , a softmax layer is connected as: Remote Sens. 2020, 12, 582 6 of 25 in which x ji means the ith channel's influence on the jth channel. Then, the results of matrix multiplication between X T and A are reshaped into R c×p×p . Finally, the reshaped results are weighted by a parameter of scale α and added input A to acquire the final spectral attention map E ∈ R c×p×p : where α is initialized as zero and can be learned gradually. The final map E encompasses the weighted summations of all channels' features, which can describe long-range dependencies and boost the discriminability about features.
executed between B and C, and a softmax layer is attached subsequently to calculate the spatial attention feature maps ℝ × : where measures the impact of ith pixel to the jth pixel. The closer feature representations of the two pixels signify a stronger correlation between them.
The initial input feature A is simultaneously fed into a convolution layer to obtain a new feature map ℝ × × which is reshaped into ℝ × subsequently. Then a multiplication of matrices is performed between D and , and the result is reshaped into ℝ × × as: where with a zero initial value can be learned to assign more weight gradually. By Equation (9), it can be inferred that all positions and original features are added with a certain weight to get the final feature ℝ × × . Therefore, long-range contextual information in the spatial dimension is modeled as E.

Spatial Attention Block
As illustrated in Figure 4b, given an input feature map A ∈ R c×p×p , two convolution layers are adopted to generate new feature maps B and C respectively, where {B, C} ∈R c×p×p . Next, B and C are reshaped into R c×n , where n = p × p is the number of pixels. Then a multiplication of matrices is executed between B and C, and a softmax layer is attached subsequently to calculate the spatial attention feature maps S ∈ R n×n : where s ji measures the impact of ith pixel to the jth pixel. The closer feature representations of the two pixels signify a stronger correlation between them. The initial input feature A is simultaneously fed into a convolution layer to obtain a new feature map D ∈ R c×p×p which is reshaped into R c×n subsequently. Then a multiplication of matrices is performed between D and S T , and the result is reshaped into R c×p×p as: where β with a zero initial value can be learned to assign more weight gradually. By Equation (9), it can be inferred that all positions and original features are added with a certain weight to get the final feature E ∈ R c×p×p . Therefore, long-range contextual information in the spatial dimension is modeled as E.

Methodology
The procedure of the DBDA framework contains three steps: dataset generation, training and validation, and prediction. Figure 5 illustrates the whole framework of our method.
An HSI dataset X is supposed to be composed of N labelled pixels , , … , ℝ × × , where b represents the bands, and the corresponding category label set is = , , … , ℝ × × , where c denotes the numbers of land cover classes. In the dataset generation step, × neighboring pixels of the center pixel is selected from the original data to generate the 3D-cubes set , , … , ℝ × × . If the target pixel is on the edge of the image, the values of missing adjacent pixels are set as zero. The , i.e., patch size, is set as 9 in our framework. Then, the 3D-cubes set is randomly divided into training set , validation set , and testing set . Accordingly, their corresponding label vectors are divided into , , and . Certainly, the labels of neighboring pixels are not visible to the network, we use the spatial information around target pixel only.
In the training and validation steps, the training set is used to update the parameters for many epochs, while the validation set is adopted to monitor the performance of models and to select the best-trained model.
In the prediction step, the test set is chosen to verify the effectiveness of the trained model. The commonly used quantitative indexes for HSI classification to measure the difference between predicted results and real values is the cross-entropy loss function, which is defined as where = [ , , … , means the label vector predicted by the model and = [ , , … represents the ground-truth label vector.

The Framework of the DBDA Network
The whole structure of the DBDA network can be seen in Figure 6. For convenience, we call the top branch Spectral Branch and name the bottom branch Spatial Branch. The input is fed into spectral branch and spatial branch respectively to get the spectral feature maps and spatial feature maps. Then the fusion operation between spectral and spatial feature maps are adopted to get the classification results.
The following parts introduce the spectral branch, Spatial Branch and spectral and spatial fusion operation taking the Indian Pines (IP) dataset as an example; the patch size is assigned as 9 × 9 × 200. To facilitate the understanding for the matrices mentioned below such as (9 × 9 × 97, 24) , the 9 × 9 × 97 represent the height, width, and depth of the 3D-cube, and 24 represents the number of 3D-cubes generated by 3D-CNN. An HSI dataset X is supposed to be composed of N labelled pixels {x 1 , x 2 , . . . , x n } ∈ R 1×1×b , where b represents the bands, and the corresponding category label set is Y = y 1 , y 2 , . . . , y n ∈ R 1×1×c , where c denotes the numbers of land cover classes.
In the dataset generation step, p × p neighboring pixels of the center pixel x i is selected from the original data to generate the 3D-cubes set {z 1 , z 2 , . . . , z n } ∈ R p×p×b . If the target pixel is on the edge of the image, the values of missing adjacent pixels are set as zero. The p, i.e., patch size, is set as 9 in our framework. Then, the 3D-cubes set is randomly divided into training set Z train , validation set Z val , and testing set Z test . Accordingly, their corresponding label vectors are divided into Y train , Y val , and Y test . Certainly, the labels of neighboring pixels are not visible to the network, we use the spatial information around target pixel only.
In the training and validation steps, the training set is used to update the parameters for many epochs, while the validation set is adopted to monitor the performance of models and to select the best-trained model.
In the prediction step, the test set is chosen to verify the effectiveness of the trained model. The commonly used quantitative indexes for HSI classification to measure the difference between predicted results and real values is the cross-entropy loss function, which is defined as where y = [ y 1 , y 2 , . . . , y L ] means the label vector predicted by the model and y = [y 1 , y 2 , . . . y L ] represents the ground-truth label vector.

The Framework of the DBDA Network
The whole structure of the DBDA network can be seen in Figure 6. For convenience, we call the top branch Spectral Branch and name the bottom branch Spatial Branch. The input is fed into spectral branch and spatial branch respectively to get the spectral feature maps and spatial feature maps. Then the fusion operation between spectral and spatial feature maps are adopted to get the classification results.
The following parts introduce the spectral branch, Spatial Branch and spectral and spatial fusion operation taking the Indian Pines (IP) dataset as an example; the patch size is assigned as 9 × 9 × 200. To facilitate the understanding for the matrices mentioned below such as (9 × 9 × 97, 24), the 9 × 9 × 97 represent the height, width, and depth of the 3D-cube, and 24 represents the number of 3D-cubes generated by 3D-CNN.
The IP dataset contains 145 × 145 pixels with 200 spectral bands, that is, the size of IP is 145 × 145 × 200. The details of IP can be seen in Table 3. There are only 10, 249 pixels have corresponding labels, and the other pixels are background.

Spectral Branch with the Channel Attention Block
First, a 3D-CNN layer with a 1 × 1 × 7 kernel size is used. The down sampling stride is set to (1,1,2), which could reduce the number of bands. Then, feature maps in the shape of (9 × 9 × 97, 24) are captured. After that, the dense spectral block combined by 3D-CNN with BN is attached. Each 3D-CNN of the dense spectral block has 12 channels with a 1 × 1 × 7 kernel size. After attaching the dense spectral block, the channels of feature maps increase to 60 calculated by Equation (5). Therefore, we obtain feature maps with size of (9 × 9 × 97, 60). Next, after the last 3D-CNN with kernel size of 1 × 1 × 97, a (9 × 9 × 1, 60) feature map is generated. However, the 60 channels make different contributions to the classification. To refine the spectral features, the channel attention block illustrated in Figure 4a and explained in Section 2.4.1 is adopted. The channel attention block reinforces the informative channels and whittles the information-lacking channels. After obtaining the weighted spectral feature maps by channel attention, a BN layer and a dropout layer are applied to enhance the numerical stability and vanquish the overfitting. Finally, via a global average pooling layer, the feature maps in the shape of 1 × 60 are obtained. The implementation of the spectral branch is available in Table 1. phase. Moreover, the p is selected as 0.5 in our framework. The existence of dropout makes the presence of other units unreliable, which prevents co-adaptation between units. Figure 6. The structure of the DBDA network. The upper spectral branch composed of the dense spectral block and channel attention block is designed to capture spectral features. The lower spatial branch constituted by dense spatial block, and spatial attention block is designed to exploit spatial features. Figure 6. The structure of the DBDA network. The upper spectral branch composed of the dense spectral block and channel attention block is designed to capture spectral features. The lower spatial branch constituted by dense spatial block, and spatial attention block is designed to exploit spatial features.

Spatial Branch with the Spatial Attention Block
Meanwhile, the input data in the shape of 9 × 9 × 200 are delivered to the spatial branch, and the initial 3D-CNN layer's size is set to 1 × 1 × 200, which can compress spectral bands into one dimension. After that, feature maps in the shape of (9 × 9 × 1, 24) are obtained. Then, the dense spatial block combined by 3D-CNN with BN is attached. Each 3D-CNN in the dense spectral block has 12 channels with a 3 × 3 × 1 kernel size. Next, the extracted feature maps in the shape of (9 × 9 × 1, 60) are fed into the spatial attention block, as illustrated in Figure 4b and expounded in Section 2.4.2. With the attention block, the coefficient of each pixel is weighted to get a more discriminative spatial feature.
After capturing the weighted spatial feature maps, a BN layer with a dropout layer is applied. Finally, the spatial feature maps in the shape of 1 × 60 are obtained via a global average pooling layer. The implementation of the spatial branch is given in Table 2. Table 2. The implementation details of the spatial branch.

Layer Name
Kernel Size Output Size

Spectral and Spatial Fusion for HSI Classification
With the spectral branch and spatial branch, several spectral feature maps and spatial feature maps are obtained. Then, we perform a concatenation between two features for classification. Moreover, the reason why the concatenation operation is applied instead of add operation is that the spectral and spatial features are in the irrelevant domains, and the concatenate operation could keep them independent while the add operation would mix them together. In the end, the classification result is obtained via the fully connected layer and the softmax activation function.
For other datasets, network implementations are the same, and the only difference is the number of spectral bands. The whole methodology flowchart of DBDA is shown in Figure 7.

Measures Taken to Prevent Overfitting
Numerous training parameters and limited training samples cause the network to be prone overfitting. Thus, we take some measures to prevent overfitting.

A Strong and Appropriate Activation Function
The activation function brings the concept of nonlinearity to a neural network. An appropriate activation function can accelerate the speed of the counter-propagation and convergence of the network. The activation function we adopted is Mish [50], a self-regularized non-monotone activation function, instead of the conventional ReLU(x) = max(0, x) [51]. The formula for the Mish is: where x represents the input of the activation. The comparison of Mish and ReLU can be seen in Figure 8. Mish is upper unbounded, and lower bounded with a scope of [≈ −0.31, ∞). The differential coefficient definition of Mish is: where ω = 4(x + 1) + 4e x + e 3x + e x (4x + 6) and δ = 2e x + e 2x + 2.
Remote Sens. 2020, 12, 582 11 of 25  In addition, two training skills, the early stopping strategy, and the dynamic learning rate adjustment method are also introduced to our model. Early stopping signifies if the loss function is no longer decreasing for a certain number of epochs (the number is 20 in our model), then we would stop the training process early to prevent overfitting and reduce the training time.
The learning rate is a crucial hyper parameter to train a network, and dynamic learning rate can help a network avoid some local minima. The cosine annealing [53] method is adopted to adjust the learning rate dynamically as the following equation: where is the learning rate within the ith run and , is the range of the learning rate. accounts for the count of epochs that have been executed, and controls the count of epochs that will be executed in a cycle of adjustment. ReLU is a piecewise linear function that prunes all the negative inputs. Thus, if the input is nonpositive, then the neuron is going to "die" and cannot be activated anymore, even though negative inputs might contain useful information. On the contrary, negative inputs are preserved as negative outputs by Mish, which trades the input information and the network sparsity better.

Dropout Layer, Early Stopping Strategy and Dynamic Learning Rate Adjustment
A dropout layer [52] is adopted between the last BN layer and the global average pooling layer in the spatial branch and spectral branch separately. Dropout is a simple but effective method to prevent overfitting by dropping out units (hidden or visible) on a given percentage p at the training phase. Moreover, the p is selected as 0.5 in our framework. The existence of dropout makes the presence of other units unreliable, which prevents co-adaptation between units.
In addition, two training skills, the early stopping strategy, and the dynamic learning rate adjustment method are also introduced to our model. Early stopping signifies if the loss function is no longer decreasing for a certain number of epochs (the number is 20 in our model), then we would stop the training process early to prevent overfitting and reduce the training time.
The learning rate is a crucial hyper parameter to train a network, and dynamic learning rate can help a network avoid some local minima. The cosine annealing [53] method is adopted to adjust the learning rate dynamically as the following equation: where η t is the learning rate within the ith run and η i min , η i max is the range of the learning rate. T cur accounts for the count of epochs that have been executed, and T i controls the count of epochs that will be executed in a cycle of adjustment.

Experimental Results
To verify the accuracy and efficiency of the proposed model, experiments on four datasets are designed to compare and validate the accuracy and efficiency between the proposed network and other methods. The three quantitative metrics of overall accuracy (OA), average accuracy (AA), and Kappa coefficient (K) are used to measure the accuracy of each method. Concretely, OA represents the ratio of the true classifications of the entire pixels. AA means the average accuracy of all categories. The Kappa coefficient reflects the consistency between the ground truth and classification result. The higher the three metric values are, the better the classification result is. Meanwhile, we investigate the running time for each framework to evaluate its efficiency.
For each dataset, a certain number of training samples and validation samples are randomly selected from the labelled data on a certain percentage, and the rest of the samples are used to test the performance of the model. Since the proposed DBDA can maintain excellent performance when training samples are severely lacking, the amount of training samples and validation samples are set at a minimal level.

The Introduction about Datasets
In this paper, four widely used HSI datasets, the Indian Pines (IP) dataset, the Pavia University Deep learning algorithms are data-driven, which rely on plenty of labelled training samples. The more labelled data are fed into training, the better accuracy is yielded. However, more data mean more time consumption and higher computation complexity. It is worth noting that the proposed DBDA can maintain excellent performance even though the training samples are very lacking. Therefore, the size of training samples and validation samples are set at a minimal level in the experiments. For IP, we select 3% samples for training, and 3% samples for validation. As the samples are enough for each class of UP and SV, we only select 0.5% samples for training, and 0.5% samples for validation. For BS, the proportion of samples for training and validation is set to 1.2%. The reason why a decimal appears is that the number of samples in BS is small, so we set the ratio as 1% with a ceiling operation. Tables 3-6 list the samples of training, validation and testing for the four datasets.
CDCNN: The architecture of the CDCNN is shown in [27], which is based on 2D-CNN and ResNet. The size of input is 5 × 5 × b, where b denotes the number of spectral bands.
SSRN: The architecture of the SSRN is proposed in [31], which is based on 3D-CNN and ResNet. The size of the input is 7 × 7 × b.
FDSSC: The architecture of the FDSSC can be seen in [32], which is based on 3D-CNN and DenseNet. The size of the input is 9 × 9 × b.
DBMA: The architecture of the DBMA is presented in [34], which is based on 3D-CNN, DenseNet, and an attention mechanism. 7 × 7 × b is the input patch size.
For CDCNN, SSRN, FDSSC, DBMA, and the proposed method, the batch size is set as 16, and the optimizer is set to Adam with the 0.0005 learning rate. The upper limit of the early stopping strategy is set to 200 epochs. If the loss in the validation set no longer declines for 20 epochs, then we would terminate the training phase.

Classification Maps and Categorized Results for the IP Dataset
The categorized results using different methods for the IP dataset are demonstrated in Table 7 where the best class-specific accuracy is in bold, and classification maps of the different methods and ground truth are shown in Figure 9.

. Classification Maps and Categorized Result for the BS Dataset
The categorized results using different methods for the BS dataset are demonstrated in Table 10 where the best class-specific accuracy is in bold, and classification maps of the different methods and ground truth are shown in Figure 12.
Since the BS dataset is small and only with 3, 248 labelled samples, just 40 samples are selected as the training set and 40 samples are chosen as the validation set. Nonetheless, our method achieves 96.24% OA performance, 2.81% higher than DBMA. One reason is that our method can capture spatial and spectral features more effectively.

. Classification Maps and Categorized Result for the BS Dataset
The categorized results using different methods for the BS dataset are demonstrated in Table 10 where the best class-specific accuracy is in bold, and classification maps of the different methods and ground truth are shown in Figure 12.
Since the BS dataset is small and only with 3, 248 labelled samples, just 40 samples are selected as the training set and 40 samples are chosen as the validation set. Nonetheless, our method achieves 96.24% OA performance, 2.81% higher than DBMA. One reason is that our method can capture spatial and spectral features more effectively.

. Classification Maps and Categorized Result for the BS Dataset
The categorized results using different methods for the BS dataset are demonstrated in Table 10 where the best class-specific accuracy is in bold, and classification maps of the different methods and ground truth are shown in Figure 12.
Since the BS dataset is small and only with 3, 248 labelled samples, just 40 samples are selected as the training set and 40 samples are chosen as the validation set. Nonetheless, our method achieves 96.24% OA performance, 2.81% higher than DBMA. One reason is that our method can capture spatial and spectral features more effectively.   Table 10 where the best class-specific accuracy is in bold, and classification maps of the different methods and ground truth are shown in Figure 12.
Since the BS dataset is small and only with 3, 248 labelled samples, just 40 samples are selected as the training set and 40 samples are chosen as the validation set. Nonetheless, our method achieves 96.24% OA performance, 2.81% higher than DBMA. One reason is that our method can capture spatial and spectral features more effectively.  where the best class-specific accuracy is in bold, and classific ground truth are shown in Figure 12.
Since the BS dataset is small and only with 3, 248 labell as the training set and 40 samples are chosen as the validatio 96.24% OA performance, 2.81% higher than DBMA. One reas and spectral features more effectively.   Table 10 where the best class-specific accuracy is in bold, and classification maps of the different methods and ground truth are shown in Figure 12.
Since the BS dataset is small and only with 3, 248 labelled samples, just 40 samples are selected as the training set and 40 samples are chosen as the validation set. Nonetheless, our method achieves 96.24% OA performance, 2.81% higher than DBMA. One reason is that our method can capture spatial and spectral features more effectively.   Table 10 where the best class-specific accuracy is in bold, and classification maps of the different methods and ground truth are shown in Figure 12.
Since the BS dataset is small and only with 3, 248 labelled samples, just 40 samples are selected as the training set and 40 samples are chosen as the validation set. Nonetheless, our method achieves 96.24% OA performance, 2.81% higher than DBMA. One reason is that our method can capture spatial and spectral features more effectively.   Table 10 where the best class-specific accuracy is in bold, and classification maps of the different methods and ground truth are shown in Figure 12.
Since the BS dataset is small and only with 3, 248 labelled samples, just 40 samples are selected as the training set and 40 samples are chosen as the validation set. Nonetheless, our method achieves 96.24% OA performance, 2.81% higher than DBMA. One reason is that our method can capture spatial and spectral features more effectively.   Table 10 where the best class-specific accuracy is in bold, and classification maps of the different methods and ground truth are shown in Figure 12.
Since the BS dataset is small and only with 3, 248 labelled samples, just 40 samples are selected as the training set and 40 samples are chosen as the validation set. Nonetheless, our method achieves 96.24% OA performance, 2.81% higher than DBMA. One reason is that our method can capture spatial and spectral features more effectively.            Table 8 where the best class-specific accuracy is in bold, and classification maps for the different methods and ground truth are shown in Figure 10.
We can see that our proposed method obtains the best results regarding the three indexes fromTable 8. Though our method cannot make every class precision best, the accuracy of each class using our method exceeds 89%, which means our method is able to capture the distinctive features between different classes.
Since OA, due to the limited training samples and weak network structure. Although SVM performs better than CDCNN with more than 7% in OA, the salt-and-pepper noise is severe, which can be seen in Figure 9c. Because SVM uses no spatial neighborhood information. The 3D-CNN based models far exceed SVM and CDCNN, owing to its incorporation of both spatial and spectral information in the classification. FDSSC uses dense connection instead of residual connection, which enhances the performance of the network and obtains more than 5% improvement in OA compared to SSRN. Based on FDSSC, DBMA extracts the spatial and spectral features in two independent branches and brings the attention mechanism in. However, when training samples are very lacking, DBMA might overfit the training data. With our proposed framework DBDA, it can accomplish stable and reliable performance with limited data duo to its flexible and adaptive attention mechanism, the appropriate activation function, and the other measures to prevent overfitting. Taking class 7, which only has three training samples in the IP dataset, as an example, our method performs well and obtains an acceptable consequence of 92.59%, while the results of other methods (SVM: 56.10%, CDCNN: 0.00%, SSRN: 0.00%, FDSSC: 73.53%, and DBMA: 40.00%) are not very satisfactory.
Overall, the proposed model improves the OA by 2.23%, the AA by 8.80%, and the kappa by 0.0225 compared to DBMA.   Our proposed framework obtains the best results with 95.38% OA, 96.47% AA, and 0.9474 Kappa, which can be seen from Table 7.CDCNN based on 2D-CNN achieves the worst accuracy with 62.32% OA, due to the limited training samples and weak network structure. Although SVM performs better than CDCNN with more than 7% in OA, the salt-and-pepper noise is severe, which can be seen in Figure 9c. Because SVM uses no spatial neighborhood information. The 3D-CNN based models far exceed SVM and CDCNN, owing to its incorporation of both spatial and spectral information in the classification. FDSSC uses dense connection instead of residual connection, which enhances the performance of the network and obtains more than 5% improvement in OA compared to SSRN. Based on FDSSC, DBMA extracts the spatial and spectral features in two independent branches and brings the attention mechanism in. However, when training samples are very lacking, DBMA might overfit the training data. With our proposed framework DBDA, it can accomplish stable and reliable performance with limited data duo to its flexible and adaptive attention mechanism, the appropriate activation function, and the other measures to prevent overfitting.
Taking class 7, which only has three training samples in the IP dataset, as an example, our method performs well and obtains an acceptable consequence of 92.59%, while the results of other methods (SVM: 56.10%, CDCNN: 0.00%, SSRN: 0.00%, FDSSC: 73.53%, and DBMA: 40.00%) are not very satisfactory.
Overall, the proposed model improves the OA by 2.23%, the AA by 8.80%, and the kappa by 0.0225 compared to DBMA.

Classification Maps and Categorized Result for the UP Dataset
The categorized results using different methods for the UP dataset are demonstrated in Table 8 where the best class-specific accuracy is in bold, and classification maps for the different methods and ground truth are shown in Figure 10.
We can see that our proposed method obtains the best results regarding the three indexes fromTable 8. Though our method cannot make every class precision best, the accuracy of each class using our method exceeds 89%, which means our method is able to capture the distinctive features between different classes.
Since the samples in the UP dataset are sufficient, there are enough samples for each class even if we just choose 0.5% training samples. Thus, DBMA overcomes overfitting and performs better than FDSSC because of its superior architecture. CDCNN with ample samples surpasses the performance of SVM.

. Classification Maps and Categorized Result for the BS
The categorized results using different methods for the where the best class-specific accuracy is in bold, and classific ground truth are shown in Figure 12.
Since the BS dataset is small and only with 3, 248 labell as the training set and 40 samples are chosen as the validatio 96.24% OA performance, 2.81% higher than DBMA. One reas and spectral features more effectively.    fromTable 8. Though our method cannot make every class precision best, the accuracy of each class using our method exceeds 89%, which means our method is able to capture the distinctive features between different classes. Since the samples in the UP dataset are sufficient, there are enough samples for each class even if we just choose 0.5% training samples. Thus, DBMA overcomes overfitting and performs better than FDSSC because of its superior architecture. CDCNN with ample samples surpasses the performance of SVM.

Classification Maps and Categorized Results for the SV Dataset
The categorized results using the different methods for the SV dataset are demonstrated in Table 9 where the best class-specific accuracy is in bold, and classification maps of the different methods and ground truth are shown in Figure 11.   Table 10 where the best class-specific accuracy is in bold, and classification maps of the different methods and ground truth are shown in Figure 12.   Table 10 where the best class-specific accuracy is in bold, and classification maps of the different methods and ground truth are shown in Figure 12.   Table 10 where the best class-specific accuracy is in bold, and classification maps of the different methods and ground truth are shown in Figure 12.    Table 10 where the best class-specific accuracy is in bold, and classification maps of the different methods and ground truth are shown in Figure 12.    Table 10 where the best class-specific accuracy is in bold, and classification maps of the different methods and ground truth are shown in Figure 12.

Classification Maps and Categorized Result for the BS Dat
The categorized results using different methods for the BS where the best class-specific accuracy is in bold, and classificatio ground truth are shown in Figure 12.
Since the BS dataset is small and only with 3, 248 labelled s as the training set and 40 samples are chosen as the validation se 96.24% OA performance, 2.81% higher than DBMA. One reason i and spectral features more effectively.   Table 10 where the best class-specific accuracy is in bold, and classification maps of the different methods and ground truth are shown in Figure 12.    Table 10 where the best class-specific accuracy is in bold, and classification maps of the different methods and ground truth are shown in Figure 12.   Table 10 where the best class-specific accuracy is in bold, and classification maps of the different methods and ground truth are shown in Figure 12.    Table 10 where the best class-specific accuracy is in bold, and classification maps of the different methods and ground truth are shown in Figure 12.  OA, due to the limited training samples and weak network structure. Although SVM performs better than CDCNN with more than 7% in OA, the salt-and-pepper noise is severe, which can be seen in Figure 9c. Because SVM uses no spatial neighborhood information. The 3D-CNN based models far exceed SVM and CDCNN, owing to its incorporation of both spatial and spectral information in the classification. FDSSC uses dense connection instead of residual connection, which enhances the performance of the network and obtains more than 5% improvement in OA compared to SSRN. Based on FDSSC, DBMA extracts the spatial and spectral features in two independent branches and brings the attention mechanism in. However, when training samples are very lacking, DBMA might overfit the training data. With our proposed framework DBDA, it can accomplish stable and reliable performance with limited data duo to its flexible and adaptive attention mechanism, the appropriate activation function, and the other measures to prevent overfitting. Taking class 7, which only has three training samples in the IP dataset, as an example, our method performs well and obtains an acceptable consequence of 92.59%, while the results of other methods (SVM: 56.10%, CDCNN: 0.00%, SSRN: 0.00%, FDSSC: 73.53%, and DBMA: 40.00%) are not very satisfactory.
Overall, the proposed model improves the OA by 2.23%, the AA by 8.80%, and the kappa by 0.0225 compared to DBMA.  OA, due to the limited training samples and weak network structure. Although SVM performs better than CDCNN with more than 7% in OA, the salt-and-pepper noise is severe, which can be seen in Figure 9c. Because SVM uses no spatial neighborhood information. The 3D-CNN based models far exceed SVM and CDCNN, owing to its incorporation of both spatial and spectral information in the classification. FDSSC uses dense connection instead of residual connection, which enhances the performance of the network and obtains more than 5% improvement in OA compared to SSRN. Based on FDSSC, DBMA extracts the spatial and spectral features in two independent branches and brings the attention mechanism in. However, when training samples are very lacking, DBMA might overfit the training data. With our proposed framework DBDA, it can accomplish stable and reliable performance with limited data duo to its flexible and adaptive attention mechanism, the appropriate activation function, and the other measures to prevent overfitting. Taking class 7, which only has three training samples in the IP dataset, as an example, our method performs well and obtains an acceptable consequence of 92.59%, while the results of other methods (SVM: 56.10%, CDCNN: 0.00%, SSRN: 0.00%, FDSSC: 73.53%, and DBMA: 40.00%) are not very satisfactory.
Overall, the proposed model improves the OA by 2.23%, the AA by 8.80%, and the kappa by 0.0225 compared to DBMA.  OA, due to the limited training samples and weak network structure. Although SVM performs better than CDCNN with more than 7% in OA, the salt-and-pepper noise is severe, which can be seen in Figure 9c. Because SVM uses no spatial neighborhood information. The 3D-CNN based models far exceed SVM and CDCNN, owing to its incorporation of both spatial and spectral information in the classification. FDSSC uses dense connection instead of residual connection, which enhances the performance of the network and obtains more than 5% improvement in OA compared to SSRN. Based on FDSSC, DBMA extracts the spatial and spectral features in two independent branches and brings the attention mechanism in. However, when training samples are very lacking, DBMA might overfit the training data. With our proposed framework DBDA, it can accomplish stable and reliable performance with limited data duo to its flexible and adaptive attention mechanism, the appropriate activation function, and the other measures to prevent overfitting. Taking class 7, which only has three training samples in the IP dataset, as an example, our method performs well and obtains an acceptable consequence of 92.59%, while the results of other methods (SVM: 56.10%, CDCNN: 0.00%, SSRN: 0.00%, FDSSC: 73.53%, and DBMA: 40.00%) are not very satisfactory.
Overall, the proposed model improves the OA by 2.23%, the AA by 8.80%, and the kappa by 0.0225 compared to DBMA.  OA, due to the limited training samples and weak network structure. Although SVM performs better than CDCNN with more than 7% in OA, the salt-and-pepper noise is severe, which can be seen in Figure 9c. Because SVM uses no spatial neighborhood information. The 3D-CNN based models far exceed SVM and CDCNN, owing to its incorporation of both spatial and spectral information in the classification. FDSSC uses dense connection instead of residual connection, which enhances the performance of the network and obtains more than 5% improvement in OA compared to SSRN. Based on FDSSC, DBMA extracts the spatial and spectral features in two independent branches and brings the attention mechanism in. However, when training samples are very lacking, DBMA might overfit the training data. With our proposed framework DBDA, it can accomplish stable and reliable performance with limited data duo to its flexible and adaptive attention mechanism, the appropriate activation function, and the other measures to prevent overfitting. Taking class 7, which only has three training samples in the IP dataset, as an example, our method performs well and obtains an acceptable consequence of 92.59%, while the results of other methods (SVM: 56.10%, CDCNN: 0.00%, SSRN: 0.00%, FDSSC: 73.53%, and DBMA: 40.00%) are not very satisfactory.
Overall, the proposed model improves the OA by 2.23%, the AA by 8.80%, and the kappa by 0.0225 compared to DBMA.  OA, due to the limited training samples and weak network structure. Although SVM performs better than CDCNN with more than 7% in OA, the salt-and-pepper noise is severe, which can be seen in Figure 9c. Because SVM uses no spatial neighborhood information. The 3D-CNN based models far exceed SVM and CDCNN, owing to its incorporation of both spatial and spectral information in the classification. FDSSC uses dense connection instead of residual connection, which enhances the performance of the network and obtains more than 5% improvement in OA compared to SSRN. Based on FDSSC, DBMA extracts the spatial and spectral features in two independent branches and brings the attention mechanism in. However, when training samples are very lacking, DBMA might overfit the training data. With our proposed framework DBDA, it can accomplish stable and reliable performance with limited data duo to its flexible and adaptive attention mechanism, the appropriate activation function, and the other measures to prevent overfitting. Taking class 7, which only has three training samples in the IP dataset, as an example, our method performs well and obtains an acceptable consequence of 92.59%, while the results of other methods (SVM: 56.10%, CDCNN: 0.00%, SSRN: 0.00%, FDSSC: 73.53%, and DBMA: 40.00%) are not very satisfactory.
Overall, the proposed model improves the OA by 2.23%, the AA by 8.80%, and the kappa by 0.0225 compared to DBMA.   Table 8 where the best class-specific accuracy is in bold, and classification maps for the different methods and ground truth are shown in Figure 10.  and ground truth are shown in Figure 11. We can see that our proposed method obtains the best results regarding the three indexes from Table 9, and the accuracy of each category classified by our method exceeds 93%.
Similarly, because of the sufficient samples in the SV dataset, 0.5% training samples are enough. Thus, DBMA once again performs better than FDSSC. However, the SV dataset owns 16 classes while the UP dataset only has 9 classes, so CDCNN obtains a weaker performance than SVM.  We can see that our proposed method obtains the best results regarding the three indexes from Table 9, and the accuracy of each category classified by our method exceeds 93%.
Similarly, because of the sufficient samples in the SV dataset, 0.5% training samples are enough. Thus, DBMA once again performs better than FDSSC. However, the SV dataset owns 16 classes while the UP dataset only has 9 classes, so CDCNN obtains a weaker performance than SVM.

Classification Maps and Categorized Result for the BS Dataset
The categorized results using different methods for the BS dataset are demonstrated in Table 10 where the best class-specific accuracy is in bold, and classification maps of the different methods and ground truth are shown in Figure 12.
Since the BS dataset is small and only with 3, 248 labelled samples, just 40 samples are selected as the training set and 40 samples are chosen as the validation set. Nonetheless, our method achieves 96.24% OA performance, 2.81% higher than DBMA. One reason is that our method can capture spatial and spectral features more effectively.

. Classification Maps and Categorized Result for the BS Dataset
The categorized results using different methods for the BS dataset are demonstrated in Table 10 where the best class-specific accuracy is in bold, and classification maps of the different methods and ground truth are shown in Figure 12.
Since the BS dataset is small and only with 3, 248 labelled samples, just 40 samples are selected as the training set and 40 samples are chosen as the validation set. Nonetheless, our method achieves 96.24% OA performance, 2.81% higher than DBMA. One reason is that our method can capture spatial and spectral features more effectively.

. Classification Maps and Categorized Result for the BS Dataset
The categorized results using different methods for the BS dataset are demonstrated in Table 10 where the best class-specific accuracy is in bold, and classification maps of the different methods and ground truth are shown in Figure 12.
Since the BS dataset is small and only with 3, 248 labelled samples, just 40 samples are selected as the training set and 40 samples are chosen as the validation set. Nonetheless, our method achieves 96.24% OA performance, 2.81% higher than DBMA. One reason is that our method can capture spatial and spectral features more effectively.

. Classification Maps and Categorized Result for the BS Dataset
The categorized results using different methods for the BS dataset are demonstrated in Table 10 where the best class-specific accuracy is in bold, and classification maps of the different methods and ground truth are shown in Figure 12.
Since the BS dataset is small and only with 3, 248 labelled samples, just 40 samples are selected as the training set and 40 samples are chosen as the validation set. Nonetheless, our method achieves 96.24% OA performance, 2.81% higher than DBMA. One reason is that our method can capture spatial and spectral features more effectively.   Table 10 where the best class-specific accuracy is in bold, and classification maps of the different methods and ground truth are shown in Figure 12.
Since the BS dataset is small and only with 3, 248 labelled samples, just 40 samples are selected as the training set and 40 samples are chosen as the validation set. Nonetheless, our method achieves 96.24% OA performance, 2.81% higher than DBMA. One reason is that our method can capture spatial and spectral features more effectively.   Table 10 where the best class-specific accuracy is in bold, and classification maps of the different methods and ground truth are shown in Figure 12.
Since the BS dataset is small and only with 3, 248 labelled samples, just 40 samples are selected as the training set and 40 samples are chosen as the validation set. Nonetheless, our method achieves 96.24% OA performance, 2.81% higher than DBMA. One reason is that our method can capture spatial and spectral features more effectively.  Table 10 where the best class-specific accuracy is in bold, and classification maps of the different methods and ground truth are shown in Figure 12.
Since the BS dataset is small and only with 3, 248 labelled samples, just 40 samples are selected as the training set and 40 samples are chosen as the validation set. Nonetheless, our method achieves 96.24% OA performance, 2.81% higher than DBMA. One reason is that our method can capture spatial and spectral features more effectively.   Table 10 where the best class-specific accuracy is in bold, and classification maps of the different methods and ground truth are shown in Figure 12.
Since the BS dataset is small and only with 3, 248 labelled samples, just 40 samples are selected as the training set and 40 samples are chosen as the validation set. Nonetheless, our method achieves 96.24% OA performance, 2.81% higher than DBMA. One reason is that our method can capture spatial and spectral features more effectively.   Table 10 where the best class-specific accuracy is in bold, and classification maps of the different methods and ground truth are shown in Figure 12.
Since the BS dataset is small and only with 3, 248 labelled samples, just 40 samples are selected as the training set and 40 samples are chosen as the validation set. Nonetheless, our method achieves 96.24% OA performance, 2.81% higher than DBMA. One reason is that our method can capture spatial and spectral features more effectively.

Investigation of Running Time
The above experiments prove that our proposed method can achieve a higher degree of accuracy with less data. However, a good method should balance the accuracy and efficiency properly. This part is executed to measure the efficiency of each method. Tables 11-14 list the consumptions of time for the six algorithms on the IP, UP, SV, and BS datasets.
Since we use SVM as a pixel-based model, it spends less time than 3D-cube-based models in most cases. On account of 2D-CNN containing less parameters to be trained, CDCNN takes less time than 3D-CNN-based models.
For 3D-CNN-based models, the proposed method consumes less training time compared to FDSSC and DBMA while obtaining better performance because of its higher rate of convergence. Even though SSRN is quicker than our method, the accuracy of our method is superior. That is, our method can balance the accuracy and efficiency better. Table 11. Training and testing consumption of support vector machines (SVM), contextual deep convolutional neural networks (CDCNN), spectral-spatial residual network (SSRN), fast dense spectral-spatial convolution (FDSSC), double-branch multi-attention (DBMA), and our method on the

Investigation of Running Time
The above experiments prove that our proposed method can achieve a higher degree of accuracy with less data. However, a good method should balance the accuracy and efficiency properly. This part is executed to measure the efficiency of each method. Tables 11-14 list the consumptions of time for the six algorithms on the IP, UP, SV, and BS datasets.
Since we use SVM as a pixel-based model, it spends less time than 3D-cube-based models in most cases. On account of 2D-CNN containing less parameters to be trained, CDCNN takes less time than 3D-CNN-based models. Table 11. Training and testing consumption of support vector machines (SVM), contextual deep convolutional neural networks (CDCNN), spectral-spatial residual network (SSRN), fast dense spectral-spatial convolution (FDSSC), double-branch multi-attention (DBMA), and our method on the IP dataset using 307 training samples (3%) in 16 classes. For 3D-CNN-based models, the proposed method consumes less training time compared to FDSSC and DBMA while obtaining better performance because of its higher rate of convergence. Even though SSRN is quicker than our method, the accuracy of our method is superior. That is, our method can balance the accuracy and efficiency better.

Discussion
In this part, further assessments of DBDA are conducted. First, different proportions of training samples are fed into the network, and the results reflect that our method can maintain effectiveness especially when the training samples are severely limited. Second, the results of ablation experiments confirm the necessity of the attention mechanism. Third, the results of the different activation functions show that Mish is a better choice than ReLU for DBDA.

Investigation of the Proportion of Training Samples
As we mentioned, deep learning is a data-driven algorithm that depends on large amounts of high-quality labelled dataset. In this part, we investigate the scenarios for different proportions of training samples. Figure 13 demonstrates the experimental results. For the IP and BS datasets, we use 0.5%, 1%, 3%, 5%, and 10% samples as the training sets, respectively. For the UP and SV datasets, we use 0.1%, 0.5%, 1%, 5%, and 10% of samples as the training sets, respectively.

Effectiveness of the Attention Mechanism
To verify the effectiveness of the attention mechanism, we remove the spatial-attention module, spectral-attention module, and both attention modules of the DBDA respectively, and compare the performance between these three "incomplete DBDA" and the "complete DBDA." From Figure 14, we can conclude that the existence of the spatial attention mechanism and the spectral attention mechanism does promote the accuracy on four datasets.
Averagely, the attention mechanism improves 4.69% OA on four datasets. Furthermore, a single spatial attention mechanism (average 2.18% improvement) performs better than a single spectral attention mechanism (average 0.97% improvement) upon most occasions. As we expected, the accuracy improves with increase in the number of training samples. All 3D-based methods, including SSRN, FDSSC, DBMA, and the proposed framework can obtain near-perfect performances as long as enough samples (about 10% of the whole dataset) are provided. At the same time, the performance gaps between different models are narrowed according to the increases in training samples. Nevertheless, our method outpaces other methods, especially when samples are insufficient. Since it is costly to label the dataset, our proposed method can save labor and cost.

Effectiveness of the Attention Mechanism
To verify the effectiveness of the attention mechanism, we remove the spatial-attention module, spectral-attention module, and both attention modules of the DBDA respectively, and compare the performance between these three "incomplete DBDA" and the "complete DBDA".
From Figure 14, we can conclude that the existence of the spatial attention mechanism and the spectral attention mechanism does promote the accuracy on four datasets. performance between these three "incomplete DBDA" and the "complete DBDA." From Figure 14, we can conclude that the existence of the spatial attention mechanism and the spectral attention mechanism does promote the accuracy on four datasets.
Averagely, the attention mechanism improves 4.69% OA on four datasets. Furthermore, a single spatial attention mechanism (average 2.18% improvement) performs better than a single spectral attention mechanism (average 0.97% improvement) upon most occasions.  Averagely, the attention mechanism improves 4.69% OA on four datasets. Furthermore, a single spatial attention mechanism (average 2.18% improvement) performs better than a single spectral attention mechanism (average 0.97% improvement) upon most occasions.

Effectiveness of the Activation Function
In Section 3.2.1, we illustrate why we adopted Mish as the activation function rather than the generally used ReLU. Here, we will compare the performance between DBDA based on Mish and DBDA based on ReLU. Figure 15 shows the classification OA of them.

Effectiveness of the Activation Function
In Section 3.2.1, we illustrate why we adopted Mish as the activation function rather than the generally used ReLU. Here, we will compare the performance between DBDA based on Mish and DBDA based on ReLU. Figure 15 shows the classification OA of them.
As shown in Figure 15, DBDA based on Mish surpasses DBDA based on ReLU. Specifically, there are 2.27%, 2.01%, 4.00% and 1.24% OA improvements on the IP, UP, SV, and BS datasets, respectively. Since Mish can quicken counter-propagation, the difference in performance occurs.

Conclusions
In this paper, we proposed an end-to-end framework double-branch dual-attention mechanism network for HSI classification. The input of the DBDA framework is original 3D pixel data without any cumbersome pre-processing to reduce dimensionality. Based on densely connected 3D-CNN layers with BN, we designed two branches that capture spectral and spatial features respectively. Meanwhile, a flexible and adaptive self-attention mechanism was applied to spectral branch and spatial branch, respectively. Mish was introduced as the activation function to accelerate the counterpropagation and convergence processes. Dynamic learning rates, early stopping, and dropout layers were also adopted to prevent overfitting.
Extensive experimental results demonstrate that our proposed framework surpasses the stateof-the-art algorithm, especially when training samples are finite and limited. Meanwhile, the consumption of time is also decreased in comparison to FDSSC and DBMA, as the attention blocks and the activation function Mish accelerate the convergent speed of the model. Accordingly, we draw As shown in Figure 15, DBDA based on Mish surpasses DBDA based on ReLU. Specifically, there are 2.27%, 2.01%, 4.00% and 1.24% OA improvements on the IP, UP, SV, and BS datasets, respectively. Since Mish can quicken counter-propagation, the difference in performance occurs.

Conclusions
In this paper, we proposed an end-to-end framework double-branch dual-attention mechanism network for HSI classification. The input of the DBDA framework is original 3D pixel data without any cumbersome pre-processing to reduce dimensionality. Based on densely connected 3D-CNN layers with BN, we designed two branches that capture spectral and spatial features respectively. Meanwhile, a flexible and adaptive self-attention mechanism was applied to spectral branch and spatial branch, respectively. Mish was introduced as the activation function to accelerate the counter-propagation and convergence processes. Dynamic learning rates, early stopping, and dropout layers were also adopted to prevent overfitting.
Extensive experimental results demonstrate that our proposed framework surpasses the state-of-the-art algorithm, especially when training samples are finite and limited. Meanwhile, the consumption of time is also decreased in comparison to FDSSC and DBMA, as the attention blocks and the activation function Mish accelerate the convergent speed of the model. Accordingly, we draw a conclusion that the structure of our method is more preferable for HSI classification.
A future direction of our work is applying our proposed framework to other hyperspectral images, not just process the above-mentioned open-source datasets. Moreover, it is also an attractive challenge to reduce the training time.