Improved Joint Sparse Models for Hyperspectral Image Classiﬁcation Based on a Novel Neighbour Selection Strategy

: Joint sparse representation has been widely used for hyperspectral image classiﬁcation in recent years, however, the equal weight assigned to each neighbouring pixel is less realistic, especially for the edge areas, and one ﬁxed scale is not appropriate for the entire image extent. To overcome these problems, we propose an adaptive local neighbour selection strategy suitable for hyperspectral image classiﬁcation. We also introduce a multi-level joint sparse model based on the proposed adaptive local neighbour selection strategy. This method can generate multiple joint sparse matrices on different levels based on the selected parameters, and the multi-level joint sparse optimization can be performed efﬁciently by a simultaneous orthogonal matching pursuit algorithm. Tests on three benchmark datasets show that the proposed method is superior to the conventional sparsity representation methods and the popular support vector machines.


Introduction
In recent years, remote sensing images have played an important role in many areas, such as surveillance, land-use classification, forest disturbance, and urban planning [1]. How to exploit the information of remotely-sensed images has been a popular research problem for decades. Hyperspectral images (HSI) have attracted a significant amount of attention due to its high spectral resolution and wide spectral range which make it possible to analyse and distinguish various objects with higher accuracy [2].
One of the most important applications of HSI is supervised classification which assigns a specific class to each pixel based on the spectral information [3]. Various techniques have been employed for this task, such as support vector machines (SVM) [4][5][6], random forest (RF) [7], multinomial logistic regression (MLR) [8], and neural networks (NN) [9]. Among these techniques, SVM has shown its effectiveness for HSI classification, especially when dealing with the Hughes phenomenon of very high-dimensional data [4]. Dimensionality reduction methods were developed to deal with the high dimensionality of HSI data and obtained some promising results [6,10,11]. Although these methods have provided some reasonable solutions to the problem, spatial context has not been fully utilized in those conventional classifiers. Without spatial information being involved, the classification map may produce a more noisy appearance and a lower accuracy [12]. During the past decade, many attempts have been made to integrate the spatial context in the classification tasks. Some methods have focused on feature extraction, such as extended morphological profiles [13,14] and attribute pixel and reject the dissimilar neighbouring pixels, information of correlated spatial context should be more representative for classification. Hence, we propose an adaptive neighbour selection strategy which computes the weights based on distances between pixels, with the labels of training data as a priori information. The structural similarity between the central pixel and its neighbours can be exploited in a more sensible way by considering the different contribution of each spectral band. Based on this, a novel joint sparse model-based classification approach, namely 'adaptive weighted joint sparse model' (AJSM) is proposed in this paper. Moreover, we propose a novel classification method with a name 'multi-level joint sparse representation model' (MLSR), in order to take advantage of the correlations among neighbouring pixels in a region. The procedures of MLSR are summarized as: (1) Local matrices are obtained by the proposed adaptive neighbour selection strategy. Different thresholds of distances can result in different local matrices corresponding to different levels; therefore (2) different joint sparse representations of the test pixel from different levels can be constructed. Since pixels with similar distances can be simultaneously sparsely represented by the features in the same subspace, and pixels from multiple levels may share different sparsity patterns, MLSR is designed to learn the dictionary for each joint sparse model separately; and (3) a simultaneous orthogonal matching pursuit (SOMP) algorithm is employed to learn the multi-level classification task.
The weight matrix for AJSM and MLSR is constructed by the ratio of the between-class and within-class distances with the consideration of a priori label information. This alleviates the negative impact when we classify the mixed pixels and similar pixels. In addition, the proposed MLSR performs on one region scale with different levels, and the sparse coding procedures at different levels are independent of each other. To sum up the main advantage of the proposed multi-level method, various parameter values can generate multiple sparse models to represent the different inner contextual structures among pixels, thereby improving the HSI classification accuracy.
The remainder of this paper is organized as follows: Section 2 reviews the sparsity representation and joint sparse models briefly. Section 3 describes the proposed MLSR method in detail for HSI classification. Experimental results on three benchmark datasets are presented in Section 4. Finally, conclusions and future work are provided in Section 5.

Sparsity Representation Classification Model
For the sparsity representation classification (SRC) model, assume that there are N training pixels belonging to C classes, and x is a L dimensional pixel. Let D be the dictionary learnt by training samples, and x can be linearly represented by the combination of D: is the sub-dictionary for the c-th class, r c ∈ R N c ×1 is the sparse coefficients corresponding to D c . In an ideal situation, if x belongs to the c-th class, then r j = 0, ∀j = 1 . . . C, j = c. Given the dictionary D, coefficient vectors can be recovered by solving the optimization problem:r = argmin Considering empirical error tolerance σ, Equation (2) can be relaxed with the following inequality: Equation (3) can also be replaced by a sparse objective function: where P is a predefined sparsity parameter corresponding to the number of zero entries in r. This nondeterministic polynomial-time hard (NP-hard) problem can be optimized by greedy pursuit algorithms. Orthogonal Matching Pursuit (OMP) [38] is a typical algorithm that solves this NP-hard problem, in which the residual is always orthogonal to the span of the already selected atoms, and r is updated by the residual in each iteration. This problem can also be relaxed to a basis pursuit problem by replacing the 0 norm with other form of regularization as follows: where λ is a regularization parameter, and the norm is l 1 and l 2 when q = 1 and q = 2 respectively. Normally, 1 norm is more effective in solving the convex optimization problems than 0 norm is, and 2 norm can avoid the overfitting issue. The detailed procedure to solve this convex problem can be found in [39]. The label of x can be directly determined by the recovered sparse coefficients and reconstruction error. Let e represent the residual error between the test sample and the reconstruction term by sparse representation: where r c represents the residual computed by dictionary and an optimal sparse coefficient for the c-th class. Then the class label of test sample x can be obtained according to the minimum residual:

Joint Sparsity Model
Since spatial information has been considered very important for HSI classification tasks, it is essential to embed spatial contextual information into the SR model as well. A joint sparsity model (JSM) [28] was proposed to exploit the correlation between neighbouring pixels and the centre pixel. Given a patch of . . x W ] be the joint signal matrix consisting of all the neighbouring pixels in this patch. In other words, the test pixel is located at the centre of the selected region, the remaining pixels in X are its neighbours. According to the report [28], X can be expressed as: where R = [r 1 , r 2 . . . r W ] ∈ R N×W is the sparsity matrix, and the selected atoms in dictionary D are determined by the nonzero coefficients in R. Therefore, the common sparsity pattern for pixels can be recognized by enforcing the indices of nonzero atoms in the sparsity coefficient matrix. Given the dictionary D, the matrix R can be optimized by solving the following object function: where · F is the Frobenius norm, and R row,0 denotes the number of nonzero rows of R. Equation (9) is also an NP-hard problem. Simultaneous OMP (SOMP) [36] is a generalized OMP algorithm which can be used to efficiently solve this problem. The label of x can be directly determined by the recovered sparse coefficients and reconstruction error. Let e represent the residual error between the test sample and the reconstruction term by sparse representation: where R c represents the reconstruction residual corresponding to the c-th class. Then the class label of test sample x can be obtained according to the minimum residual: JSM can achieve a better classification result by incorporating the contextual information of neighbouring pixels when compared with a pixel-based SRC. However, different areas need different region scales, and there may exist some less-correlated pixels in one local patch due to the spectrally heterogeneous features in HSI scenes even though neighbouring pixels tend to have similar spectral signatures. Another situation that should be considered according to Li [40] is that the general dictionary constructed by the whole set of training samples may include outliers.

Adaptive Weight Joint Sparse Model (AJSM) and Multi-Level Sparse Representation Model (MLSR)
We introduce an adaptive weight joint sparse model (AJSM) and a multi-level joint sparse representation model (MLSR) for HSI classification in this paper. Multiple local signal matrices are constructed using different parameters to realize the similarity learning in MLSR. In fact, AJSM is a simple form of MLSR. The proposed AJSM is expected to improve the classification accuracy in these areas by not taking all the neighbouring pixels to construct the joint sparse matrix. Additionally, MLSR improves the classification results by selecting the neighbour pixels from various levels using the proposed adaptive neighbour selection strategy.
To better understand the procedure of the proposed method, a flowchart is shown in Figure 1 where each component of the method is explained in detail in the following sections.  (9) is also an NP-hard problem. Simultaneous OMP (SOMP) [36] is a generalized OMP algorithm which can be used to efficiently solve this problem.
The label of x can be directly determined by the recovered sparse coefficients and reconstruction error. Let e represent the residual error between the test sample and the reconstruction term by sparse representation: JSM can achieve a better classification result by incorporating the contextual information of neighbouring pixels when compared with a pixel-based SRC. However, different areas need different region scales, and there may exist some less-correlated pixels in one local patch due to the spectrally heterogeneous features in HSI scenes even though neighbouring pixels tend to have similar spectral signatures. Another situation that should be considered according to Li [40] is that the general dictionary constructed by the whole set of training samples may include outliers.

Adaptive Weight Joint Sparse Model (AJSM) and Multi-Level Sparse Representation Model (MLSR)
We introduce an adaptive weight joint sparse model (AJSM) and a multi-level joint sparse representation model (MLSR) for HSI classification in this paper. Multiple local signal matrices are constructed using different parameters to realize the similarity learning in MLSR. In fact, AJSM is a simple form of MLSR. The proposed AJSM is expected to improve the classification accuracy in these areas by not taking all the neighbouring pixels to construct the joint sparse matrix. Additionally, MLSR improves the classification results by selecting the neighbour pixels from various levels using the proposed adaptive neighbour selection strategy.
To better understand the procedure of the proposed method, a flowchart is shown in Figure 1 where each component of the method is explained in detail in the following sections.

Adaptive Local Signal Matrix
In order to select reasonable neighbours to construct the joint matrix, the weighted Euclidean distances between the test pixel and its neighbours are used. We first select a region with a window size WW  , which is centred at the test pixel i x . Different weights are given to each spectral

Adaptive Local Signal Matrix
In order to select reasonable neighbours to construct the joint matrix, the weighted Euclidean distances between the test pixel and its neighbours are used. We first select a region with a window size √ W × √ W, which is centred at the test pixel x i . Different weights are given to each spectral band according to their contribution to the whole spectral characteristics. The weighting strategy is described as follows: where A < x i , x j > is the weight distance between pixels x i and x j , w l is the weight for the l-th feature, and w l is determined by training samples from different classes. α is a positive parameter that controls the influence of a class-specific distance I l . If α = 0, the distance between two pixels decreases to the equal weight Euclidean distance. If α is large enough, the change will be reflected on I. In(·) denotes an indicator function which takes between-class and within-class distances into account. x cl is the average of the c-th class of the l-th feature, and x l represents the average of all training samples of the l-th feature; y i represents the label of pixel x i . Pixels with a predefined distance value can be selected as similar neighbours according to this method. In other words, this adaptive neighbour selection strategy can identify the samples with similar characteristics to form a group. The superiority of this weight strategy over other weighting schemes is that it considers the spectral similarities on a pixel level, and the discriminative information among different groups which can be obtained from training samples.

Adaptive Weight Joint Sparse Model
The goal of Equation (12) is to find the optimal samples to reconstruct the central pixel. Once the appropriate weights are assigned to each spectral band, the weight distances between the test pixel and its neighbouring pixels can be evaluated. Based on the top N-nearest strategy, N nearest neighbouring pixels can be chosen as the adaptive weight joint sparse matrix to relax the joint sparse model (Equation (9)). Here we define S N as the weight matrix chosen from the original joint sparse matrix X = [x 1 , x 2 . . . x W ]. In other words, N nearest pixels are selected from the W pixels based on the previous adaptive weight scheme. The adaptive weight joint sparse model can be expressed as: The label of central pixel can be identified by minimizing the class residual: The procedure of AJSM is summarized below in Algorithm 1. 1. Compute the w l for each spectral band according to Equation (12); 2. For each test pixel in x i ∈ X T : Construct the weight matrix S N according to Equation (12) and normalize the columns of S N to have a unit 2 norm; Calculate the sparse coefficient matrix R and dictionary D from Equation (13) using SOMP; Determine the class label y t for each test pixel x i ∈ X T by Equation (14).
It has been identified that neighbouring pixels consist of different types of materials in the heterogeneous areas in HSI. JSM cannot perform well on such areas due to its definition of neighbouring pixels, which tend to have similar labels. The proposed AJSM is expected to improve the classification accuracy in these areas by not taking all the neighbouring pixels to construct the joint sparse matrix.

Multi-Level Weighted Joint Sparse Model
The neighbour pixels selected from a fixed scale using single level criteria as seen in JSM and AJSM may not contain the complementary and accurate information, and the neighbour pixels selected from different criteria levels can help represent the data wholly. Herewith we propose a multi-level weighted joint sparse model to fully integrate the neighbour information, as well as to avoid the outliers dominating in the sparse coding. For a test pixel, its neighbour pixels are selected by the proposed adaptive neighbour selection strategy with different distance threshold level values. Then the multiple joint sparse matrices are constructed by the corresponding neighbour pixels with different distance threshold level values. The details of this method are described as follows.
Assume that S i,k is the k-th joint sparse matrix constructed for pixel x i . Here we define S i,k using a weight matrix i.e., a function that determines if pixel x i,j can be preserved to reconstruct x i , x i,j is the j-th sample in the given region which is restricted by the scale √ W × √ W. In (12), A < x i , x i,j > is a monotonously increasing function of the weighted distances. Although there are many ways to define < x i , x i,j >, we define it as a piecewise constant to simplify the selection of different joint sparse matrices as follows: where ε is a threshold controlling the value of the corresponding element in S i,k . According to (12) and (15), when a pixel in S i,k has the corresponding weighted distance with the test pixel x i : the corresponding term will be selected to reconstruct the test pixel. In other words, S i,k is constructed by the terms that have the weighted distances less than ε between itself and the test pixel x i . By using the proposed scheme, we can generate different patches with various values of ε: • ε = 0: This is an independent set. In this situation, only the central pixel itself is selected. This means that the joint sparse model becomes a pixel-wise sparse representation model. As described above, for each test pixel x i , when different parameters of {ε 1 , . . . , ε k , . . . , ε K } are applied, K different patches can be generated to represent this pixel with the inner contextual information involved. Our next task is to construct the multi-level joint sparse representation model for the test pixel.

Multi-Level Joint Sparse Representation
The SRC has been successfully used for HSI classification, herein we extend it to a multi-level version for the classification task. After K different patches are constructed for each pixel, the patches for the test pixel can be arranged as a feature matrix: In this paper, let D = D 1 , . . . , D k , . . . D K be a set of dictionary which can be learnt from all the training data for K patches, and D k is the dictionary learnt for the k-th level. Each dictionary D k is composed of all the sub-dictionaries for each labelled class as denotes the sub-dictionary of the c-th labelled class. The sparse representation of the test pixel x i with its k-th patch can be described as: where Q k is the sparse representation coefficients for the specific patch S i,k . Equation (16) expresses how to sparsely represent each of the K patches when the sparse coefficient vector is given. Considering all the K patches, Equation (16) can be rewritten as: where Q = [Q 1 , . . . , Q k , . . . Q K ] is composed of K columns of the coefficient vectors. Each column of the matrix is the sparse representation coefficients corresponding to a dictionary over a specific patch.
Since the pixels belonging to the same class should have the dictionary in the same subspace spanned by the training samples, the class-specific level joint representation optimization problem can be written as: This problem can be decomposed into K sub-problems. In this paper, the SOMP is used to solve the optimization function (Equation (18)) and it can efficiently solve this problem in several iterations. Algorithm 2 introduces the implementation of the proposed framework.
After the sparsity coefficients are obtained, for a given test pixel x i , it would be assigned to the class which gives the smallest reconstruction residual: where E c (x i ) is the reconstruction residual of x i , as: where D k c is the dictionary for the c-th class over the k-th patch, and Q k c denotes the sparse coefficient matrix corresponding to D k c .
Input: training datasets belonging to the c-th class: X c , region scale: W, number of levels: K, distance threshold controlling parameter: ε, test datasets X T Initialization: initialize dictionary D c = X c , and normalize the columns of dictionary to have unit 2 1. Compute w l (l = 1, 2, . . . , L) according to Equation (12) using the training data sets X c and the corresponding labels 2. For each test pixel x i ∈ X T Compute adaptive weight distances between the test pixel and all the pixels in the selected neighbour region to construct A < x i , x j > based on Equation (12) 3. Compute S i,k based on Equation (15). 4. For k = 1 : K Compute Q k for each level for each class using SOMP. 5. Compute the class label y i for test pixel based on Equations (19) and (20). Output: 2-dimensional classification map.

Data Description
To validate the proposed methods, three benchmark datasets are used in the experiments. 3. AVIRIS data set: Salinas. The image was also acquired by an AVIRIS sensor over Salinas Valley, CA, USA. The image is of 512 × 217 size with 224 spectral bands. In the experiments, 20 water absorption bands (no. 108-112, 154-167, and 224) are removed. Salinas has a 3.7 m resolution per pixel and 16 different classes. It includes vegetables, bare soils, and vineyard fields. Due to the spectral similarity of most classes, this dataset has been frequently used as a benchmark for HSI classification.
The ground truths of three datasets, as well as the false colour composite images are illustrated in Figure 2.
absorption bands (no. 108-112, 154-167, and 224) are removed. Salinas has a 3.7 m resolution per pixel and 16 different classes. It includes vegetables, bare soils, and vineyard fields. Due to the spectral similarity of most classes, this dataset has been frequently used as a benchmark for HSI classification.
The ground truths of three datasets, as well as the false colour composite images are illustrated in Figure 2.

Description of Comparative Classifiers and Parameters Setting
In this paper, the proposed AJSM and MLSR are compared with several benchmark classifiers: pixel-wise SVM (referred to as SVM), EMP with SVM (referred to as EMP), pixel-wise SRC (referred to as SRC), and JSM with a greedy pursuit algorithm [28]. Pixel-wise SVM and pixel-wise SRC classify the images with only spectral information, while JSM, AJSM, and MLSR are sparse representation-based classifiers with spatial information utilized.
During the experiments, the range of parameters is empirically determined and the optimal values are determined by cross-validation. The parameters for pixel-wise SVM are set as the default ones in [4] and implemented using the SVM library with Gaussian kernels [41]. Parameters for EMP and pixel-wise SRC are set up by following the instructions in [14] and [28], respectively. The selected regions for JSM, AJSM, and MLSR are set as 3 × 3, 5 × 5, 7 × 7, 9 × 9, 11 × 11, 13 × 13 and 15 × 15, and the best result is described in this paper. For AJSM, the number of pixels selected in the given region is set as: 7, 20, 40, 50, 50, 50, and 50 for the abovementioned scales, respectively. For the proposed MLSR, the number of threshold parameter  is set as seven, and the threshold values are: 0.2, 0.3, 0.4, 0.5, 0.7, 1}. The predefined sparsity level is set as 3 for each dataset.

Description of Comparative Classifiers and Parameters Setting
In this paper, the proposed AJSM and MLSR are compared with several benchmark classifiers: pixel-wise SVM (referred to as SVM), EMP with SVM (referred to as EMP), pixel-wise SRC (referred to as SRC), and JSM with a greedy pursuit algorithm [28]. Pixel-wise SVM and pixel-wise SRC classify the images with only spectral information, while JSM, AJSM, and MLSR are sparse representation-based classifiers with spatial information utilized.
During the experiments, the range of parameters is empirically determined and the optimal values are determined by cross-validation. The parameters for pixel-wise SVM are set as the default ones in [4] and implemented using the SVM library with Gaussian kernels [41]. Parameters for EMP and pixel-wise SRC are set up by following the instructions in [14] and [28], respectively. The selected regions for JSM, AJSM, and MLSR are set as 3 × 3, 5 × 5, 7 × 7, 9 × 9, 11 × 11, 13 × 13 and 15 × 15, and the best result is described in this paper. For AJSM, the number of pixels selected in the given region is set as: 7, 20, 40, 50, 50, 50, and 50 for the abovementioned scales, respectively. For the proposed MLSR, the number of threshold parameter ε is set as seven, and the threshold values are: {0.1, 0.2, 0.3, 0.4, 0.5, 0.7, 1}. The predefined sparsity level is set as 3 for each dataset.
Quantitative analysis metrics, overall accuracy (OA), average accuracy (AA), and Kappa coefficient are adopted to validate the proposed method. All the experiments in this paper are repeatedly implemented ten times and the mean accuracy is presented.

Experimental Results
The first experiment was performed on the Indian Pines image. We randomly selected 10% of the samples from each class as training data and the remaining as a test dataset. The optimal parameters in this experiment are set as: α= 0.2, W = 13 × 13. The numbers of training and test data for each class are described in Table 1. Classification results are listed in Table 2, and the classification maps are shown in Figure 3. One can observe that the classification maps obtained by pixel-wise SVM and pixel-wise SRC have a more noisy appearance than other classifiers, which confirms that the contextual information is important for hyperspectral image classification. Considering the spatial information, JSM gives a smoother result; however, it still fails to classify some near-edge areas. EMP, AJSM, and the proposed MLSR deliver better results, and MLSR shows the highest classification accuracy. From Figure 3, one can see that MLSR further provides a smoother classification result and preserves more useful information for HSI.  The second experiment is conducted on the Pavia University image, and Table 3 shows the class information. We randomly selected 250 samples as the training data, and the rest as test data. The optimal parameters in this experiment are set as:  = 0.2, 15 15 W . Classification results and maps are illustrated in Table 4 and Figure 4, respectively. It is obvious that the multi-level information can, indeed, improve the results of classification of the Pavia University image compared to other SRC based methods and the popular SVMs. The improvement of MLSR compared to JSM suggests that the local adaptive matrix can preserve the most useful information and reduce the redundant information. The result is consistent with the previous experiment on the Indian Pines image where the edge pixels are predicted more precisely.  The proposed AJSM improves the classification capability of JSM by exploring the different contributions of the neighbouring pixels in the selected region. This confirms the effectiveness of the adaptive weight matrix scheme. However, one can see that AJSM produces a relatively lower accuracy for oats, which has limited training samples. The improvement of MLSR-based classification of alfalfa and oats, which have been considered as small classes, indicates that the proposed method can perform well on classes with fewer training samples. In addition, the adaptive local matrix imposes the local constraint on the sparsity, which would improve the performance. As can be observed from the classification maps, our proposed method has a better capability to identify the near-edge areas and it benefits from the selection of most similar pixels to reconstruct the test pixel. The accuracies for MLSR are very high, which indicates that JSM can be significantly improved by multiple feature extraction approaches.
The second experiment is conducted on the Pavia University image, and Table 3 shows the class information. We randomly selected 250 samples as the training data, and the rest as test data. The optimal parameters in this experiment are set as: α= 0.2, W = 15 × 15. Classification results and maps are illustrated in Table 4 and Figure 4, respectively. It is obvious that the multi-level information can, indeed, improve the results of classification of the Pavia University image compared to other SRC based methods and the popular SVMs. The improvement of MLSR compared to JSM suggests that the local adaptive matrix can preserve the most useful information and reduce the redundant information. The result is consistent with the previous experiment on the Indian Pines image where the edge pixels are predicted more precisely.  The third experiment is conducted on the Salinas imagery. For each class, 1.5% samples are selected as the training data, and the remaining as the test dataset. The optimal parameters in this experiment are set as:  = 0.2, 15 15 W . The class information and classification results are given in Tables 5 and 6, respectively. The results are also visualized in classification maps as shown in Figure 5. One can observe that the proposed MLSR yields the best accuracy for most of the classes, especially for classes 15 and 16. Furthermore, the proposed MLSR identified the edge areas best.  The third experiment is conducted on the Salinas imagery. For each class, 1.5% samples are selected as the training data, and the remaining as the test dataset. The optimal parameters in this experiment are set as: α= 0.2, W = 15 × 15. The class information and classification results are given in Tables 5 and 6, respectively. The results are also visualized in classification maps as shown in Figure 5. One can observe that the proposed MLSR yields the best accuracy for most of the classes, especially for classes 15 and 16. Furthermore, the proposed MLSR identified the edge areas best. Table 5. Class Information for the Salinas image.

Effects of Different Kinds of Parameters
This section focuses on the effects of the parameters setting on the classification performance. We first varied the value of positive parameter α that controls the influence of the ratio of the between-class and within-class distances, and the value was varied from 0 to 1 at 0.2 intervals. The experiments were conducted with AJSM on three datasets and the window sizes were fixed as the corresponding optimal values. In Figure 6, the overall accuracies for three datasets fluctuate in a small range, and the best performances were obtained when α was set as 0.2 for all three datasets though the trends for them were different. As α only controls the influence of each feature band, it is reasonable to apply the same value for MLSR in the experiments.

Effects of Different Kinds of Parameters
This section focuses on the effects of the parameters setting on the classification performance. We first varied the value of positive parameter  that controls the influence of the ratio of the between-class and within-class distances, and the value was varied from 0 to 1 at 0.2 intervals. The experiments were conducted with AJSM on three datasets and the window sizes were fixed as the corresponding optimal values. In Figure 6, the overall accuracies for three datasets fluctuate in a small range, and the best performances were obtained when  was set as 0.2 for all three datasets though the trends for them were different. As  only controls the influence of each feature band, it is reasonable to apply the same value for MLSR in the experiments. The effect of region scales for JSM, AJSM, and MLSR has also been analysed in the experiments. In order to simply show the trends, the numbers of training and test datasets are selected to be the same as in the previous experiments. OA is shown in Figure 7. For JSM, AJSM, and MLSR, the region scales ranging from 3 × 3 to 29 × 29 at 2 × 2 intervals. As shown in Figure 7, the best OA is achieved for JSM when the scale is set as 7 × 7, 11 × 11, and 15 × 15 for Indian Pines, Pavia University, and Salinas, respectively. If the scale increases, the accuracy decreases dramatically. In most situations, AJSM performs better than JSM because the most useful information is preserved and the redundant information is rejected by the selection strategy. The accuracy for MLSR becomes stable when a larger region is selected. More specifically, the proposed MLSR performs better than other joint sparsity-based models in most regions. This result actually benefits from its mechanism of discarding outliers in the specific area, which provides a more reliable dictionary. The effect of region scales for JSM, AJSM, and MLSR has also been analysed in the experiments. In order to simply show the trends, the numbers of training and test datasets are selected to be the same as in the previous experiments. OA is shown in Figure 7. For JSM, AJSM, and MLSR, the region scales ranging from 3 × 3 to 29 × 29 at 2 × 2 intervals. As shown in Figure 7, the best OA is achieved for JSM when the scale is set as 7 × 7, 11 × 11, and 15 × 15 for Indian Pines, Pavia University, and Salinas, respectively. If the scale increases, the accuracy decreases dramatically. In most situations, AJSM performs better than JSM because the most useful information is preserved and the redundant information is rejected by the selection strategy. The accuracy for MLSR becomes stable when a larger region is selected. More specifically, the proposed MLSR performs better than other joint sparsity-based models in most regions. This result actually benefits from its mechanism of discarding outliers in the specific area, which provides a more reliable dictionary.

Effects of Different Kinds of Parameters
This section focuses on the effects of the parameters setting on the classification performance. We first varied the value of positive parameter  that controls the influence of the ratio of the between-class and within-class distances, and the value was varied from 0 to 1 at 0.2 intervals. The experiments were conducted with AJSM on three datasets and the window sizes were fixed as the corresponding optimal values. In Figure 6, the overall accuracies for three datasets fluctuate in a small range, and the best performances were obtained when  was set as 0.2 for all three datasets though the trends for them were different. As  only controls the influence of each feature band, it is reasonable to apply the same value for MLSR in the experiments. The effect of region scales for JSM, AJSM, and MLSR has also been analysed in the experiments. In order to simply show the trends, the numbers of training and test datasets are selected to be the same as in the previous experiments. OA is shown in Figure 7. For JSM, AJSM, and MLSR, the region scales ranging from 3 × 3 to 29 × 29 at 2 × 2 intervals. As shown in Figure 7, the best OA is achieved for JSM when the scale is set as 7 × 7, 11 × 11, and 15 × 15 for Indian Pines, Pavia University, and Salinas, respectively. If the scale increases, the accuracy decreases dramatically. In most situations, AJSM performs better than JSM because the most useful information is preserved and the redundant information is rejected by the selection strategy. The accuracy for MLSR becomes stable when a larger region is selected. More specifically, the proposed MLSR performs better than other joint sparsity-based models in most regions. This result actually benefits from its mechanism of discarding outliers in the specific area, which provides a more reliable dictionary. Another consideration is the number of patches that should be tested, i.e., is having more patches better? To evaluate this, the adaptive framework is used to generate more patches. Especially, with ε set to {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1}, we can define 11 patches. In each experiment, we randomly selected a patch subset with the number of K ∈ {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11} from these 11 patches and evaluated the performance of the method on three datasets. For each value of K, the experiment procedure is repeated 10 times with different subset selection. Figure 8 shows the average OA result of the 10 iterations. As K increases, the performance of the framework also increases when K ≤ 7; however, it slightly decreases when K ≥ 8. This trend shows that a certain number of patches are necessary for the improvement of the performance of the proposed method. However, too many patches can also result in a slight decrease in performance. In the experiment, we Another consideration is the number of patches that should be tested, i.e., is having more patches better? To evaluate this, the adaptive framework is used to generate more patches.
Especially, with  set to {0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1}, we can define 11 patches. In each experiment, we randomly selected a patch subset with the number of   1, 2,3, 4,5, 6, 7,8,9,10,11 K  from these 11 patches and evaluated the performance of the method on three datasets. For each value of K , the experiment procedure is repeated 10 times with different subset selection. Figure 8 shows the average OA result of the 10 iterations. As K increases, the performance of the framework also increases when 7 K  ; however, it slightly decreases when  We also conducted the experiments to evaluate the impact of the number of training samples per class for pixel-wise SVM, pixel-wise SRC, EMP with SVM, single-scale JSM, and the proposed MLSR. AJSM is not considered in this experiment as they exhibit a similar trend with JSM. Training samples are randomly chosen, and the rest are used as test samples. For the Indian Pines dataset, the number of training data ranges from 5% to 40% of the whole pixel counts at 5% intervals; for the Pavia University dataset, the number of training samples per class ranges from 150 to 500 at 50 intervals; For the Salinas dataset, the number of training samples per class ranges from 50 to 400 at 50 intervals. Figure 9 illustrates the classification results (OA) for these three datasets. As can be observed, less than 5% of the samples are needed for each class to obtain an OA over 90% for the Indian Pines datasets using the proposed MLSR. This is very promising because it is often difficult to collect a large training datasets in practice. For the Pavia University dataset, only 150 training samples are needed to obtain an OA of 95%. In fact, this accuracy is 3% higher than that by JSM and 4.5% higher than that by EMP with SVM. This is due to the fact that the local information included by the proposed MLSR outperforms the others. The same trend can be concluded for the Salinas dataset. In addition, the proposed MLSR produces very high accuracy and shows robustness with an increase of the number of training samples, and it can be observed that MLSR performs very well when training samples are limited. We also conducted the experiments to evaluate the impact of the number of training samples per class for pixel-wise SVM, pixel-wise SRC, EMP with SVM, single-scale JSM, and the proposed MLSR. AJSM is not considered in this experiment as they exhibit a similar trend with JSM. Training samples are randomly chosen, and the rest are used as test samples. For the Indian Pines dataset, the number of training data ranges from 5% to 40% of the whole pixel counts at 5% intervals; for the Pavia University dataset, the number of training samples per class ranges from 150 to 500 at 50 intervals; For the Salinas dataset, the number of training samples per class ranges from 50 to 400 at 50 intervals. Figure 9 illustrates the classification results (OA) for these three datasets. As can be observed, less than 5% of the samples are needed for each class to obtain an OA over 90% for the Indian Pines datasets using the proposed MLSR. This is very promising because it is often difficult to collect a large training datasets in practice. For the Pavia University dataset, only 150 training samples are needed to obtain an OA of 95%. In fact, this accuracy is 3% higher than that by JSM and 4.5% higher than that by EMP with SVM. This is due to the fact that the local information included by the proposed MLSR outperforms the others. The same trend can be concluded for the Salinas dataset. In addition, the proposed MLSR produces very high accuracy and shows robustness with an increase of the number of training samples, and it can be observed that MLSR performs very well when training samples are limited.

Conclusions and Future Research Lines
In this paper, we have introduced two novel sparse representation-based hyperspectral classification methods. These proposed methods employ an adaptive weight matrix scheme as the neighbour selection strategy for the joint sparse matrix construction. The adaptive weight joint sparse model outperforms the traditional joint sparse models, however, it is designed for simple cases rather than complicated situations where the number of labelled training samples is not sufficient. This was overcome by introducing the second model, i.e., the multi-level joint sparse model that can solve the complex classification problem in a more effective way. The multi-level joint sparse model consists of two main parts: adaptive locality patches and a multi-level joint sparse representation model. This model is introduced to fully explore the spatial context within a given region for the test pixel. The proposed methods locally smooth the classification maps and preserve the relevant information for most labelled classes. Compared with other spatial-spectral methods and sparse representation-based approaches, the proposed methods can provide a better performance on real hyperspectral scenes. This is consistent with the observation from the classification maps. Moreover, the experiments on the impact of the number of training samples also indicate that the proposed multi-level sparse approach leads to a more reliable result when only a limited number of training samples are available.
Author Contributions: Q.G. formulated and directed the methodology. S.L. and X.J. supervised the data processing. Q.G. prepared the manuscript and interpreted the results supported by S.L. and X.J. All authors contributed to the methodology validation, results analysis, and reviewed the manuscript.

Conclusions and Future Research Lines
In this paper, we have introduced two novel sparse representation-based hyperspectral classification methods. These proposed methods employ an adaptive weight matrix scheme as the neighbour selection strategy for the joint sparse matrix construction. The adaptive weight joint sparse model outperforms the traditional joint sparse models, however, it is designed for simple cases rather than complicated situations where the number of labelled training samples is not sufficient. This was overcome by introducing the second model, i.e., the multi-level joint sparse model that can solve the complex classification problem in a more effective way. The multi-level joint sparse model consists of two main parts: adaptive locality patches and a multi-level joint sparse representation model. This model is introduced to fully explore the spatial context within a given region for the test pixel. The proposed methods locally smooth the classification maps and preserve the relevant information for most labelled classes. Compared with other spatial-spectral methods and sparse representation-based approaches, the proposed methods can provide a better performance on real hyperspectral scenes. This is consistent with the observation from the classification maps. Moreover, the experiments on the impact of the number of training samples also indicate that the proposed multi-level sparse approach leads to a more reliable result when only a limited number of training samples are available.
Author Contributions: Q.G. formulated and directed the methodology. S.L. and X.J. supervised the data processing. Q.G. prepared the manuscript and interpreted the results supported by S.L. and X.J. All authors contributed to the methodology validation, results analysis, and reviewed the manuscript.