Low-Rank Hypergraph Hashing for Large-Scale Remote Sensing Image Retrieval

: As remote sensing (RS) images increase dramatically, the demand for remote sensing image retrieval (RSIR) is growing, and has received more and more attention. The characteristics of RS images, e.g., large volume, diversity and high complexity, make RSIR more challenging in terms of speed and accuracy. To reduce the retrieval complexity of RSIR, a hashing technique has been widely used for RSIR, mapping high-dimensional data into a low-dimensional Hamming space while preserving the similarity structure of data. In order to improve hashing performance, we propose a new hash learning method, named low-rank hypergraph hashing (LHH), to accomplish for the large-scale RSIR task. First, LHH employs a l 2-1 norm to constrain the projection matrix to reduce the noise and redundancy among features. In addition, low-rankness is also imposed on the projection matrix to exploit its global structure. Second, LHH uses hypergraphs to capture the high-order relationship among data, and is very suitable to explore the complex structure of RS images. Finally, an iterative algorithm is developed to generate high-quality hash codes and e ﬃ ciently solve the proposed optimization problem with a theoretical convergence guarantee. Extensive experiments are conducted on three RS image datasets and one natural image dataset that are publicly available. The experimental results demonstrate that the proposed LHH outperforms the existing hashing learning in RSIR tasks.


Introduction
With the development of satellite technology, the quality of remote sensing (RS) images increases dramatically. Retrieving similar RS images from large-scale RS datasets is very important and demanding [1][2][3]. Interestingly, content-based image retrieval (CBIR) [4][5][6] is widely involved in many real-world tasks, such as natural image retrieval and network searches. Nevertheless, large variations are usually contained in the RS images due to their large data volume, small object size and rich background [7,8], and thus how to extract valuable information and further adapt existing CBIR methods to remote sensing image retrieval (RSIR) is considered a key issue [9,10].
Hashing learning has become more and more important for large-scale retrieval, due to its superiority in terms of computation and storage [11][12][13]. In recent years, several hashing-based methods have been proposed for large-scale RSIR tasks [14][15][16][17][18]. Partial randomness hashing (PRH) [14] is proposed to employ random projections to map images to a low-dimensional Hamming space, and trains a linear model for mapping from the Hamming space back to the original space. Demir et al. employing deep learning and hashing learning [16]. To introduce deep neural networks (DHNNs) into large-scale RSIR tasks, Li et al. conducts a comprehensive study of DHNN systems [17]. To capture intra-class distribution and inter-class ranking, Fan et al. proposes a distribution consistency loss (DCL) to extract informative data and build a more informative structure [18]. These approaches only utilize the pairwise similarity to capture the relationships among data, although the relationships among RS images are more complex and high-order.
A natural way of capturing the complex structure among RS images is a hypergraph. Hypergraphs [19] generalize conventional graphs, where one edge can connect more than two vertices. Therefore, hypergraphs can capture complex and high-order relationships, and have been used in image annotation, image ranking and feature selection [19,20]. Recently, hypergraph spectral hashing (HSH) methods [21][22][23] have received considerable attention. For example, hypergraph spectral learning [21] is proposed for multi-label classification. To further exploit the correlation information, [22] introduces a transductive learning framework based on a probabilistic hypergraph. [23] applies a hypergraph in conventional spectral hashing for searching social images. Although these methods improve the performance with hypergraphs, all of them process with no label information. In addition, the noise of features and sample are ignored in these methods.
To address the aforementioned problems, new LHH is proposed to deal with large-scale RSIR tasks. The flowchart of the proposed LHH is shown in Figure 1, where we see that the proposed LHH is a shallow model, although its key components can also be easily considered in deep extensions of LHH. The main contributions of the proposed LHH are summarized as follows: (1) The LHH employs a l2-1 norm to constrain the projection matrix to reduce the noise and redundancy among features. In addition, low-rankness is also imposed on the projection matrix to exploit its global structure.
(2) The LHH exploits the hypergraph to capture the high-order relationship among data, and is very suitable to explore the complex structure of RS images.
(3) Finally, the proposed LHH is evaluated on three large-scale remote sensing datasets and one natural image dataset. The experimental results show that the proposed LHH outperforms some existing hashing methods in large-scale RSIR tasks. The rest of the paper is organized as follows. The notation and related work are presented in Section II. The proposed method is discussed in Section III. The extensive experimental evaluations are presented in Section IV. Finally, a conclusion is given in Section V.

Notation and Related Work
The notation of the paper and the recent advances of hashing techniques, low-rank analysis and hypergraph learning are reviewed in this section. The rest of the paper is organized as follows. The notation and related work are presented in Section 2. The proposed method is discussed in Section 3. The extensive experimental evaluations are presented in Section 4. Finally, a conclusion is given in Section 5.

Notation and Related Work
The notation of the paper and the recent advances of hashing techniques, low-rank analysis and hypergraph learning are reviewed in this section.

Notation
In this paper, we represent matrix and vector as a boldface italic letter, and scalar as a normal italic letter. For a matrix X = [x ij ], its i-th row and j-th column are denoted as x i and x j , respectively. We represent the transpose operator, the inverse and the trace operator of X as X T , X -1 and tr(X), respectively. We represent the Frobenius norm and l 2,1 -norm of X, respectively. The important notations in the paper are summarized in Table 1.

Hashing Learning
Hashing has been a key step to facilitate large-scale image retrieval [24]. Essentially, hashing maps the high-dimensional data into a low-dimensional Hamming space while preserving similarity structure among data. The representative hashing methods include local sensitive hashing (LSH) [25], spectral hashing (SH) [26], partial randomness hashing (PRH) [14], supervised hashing with kernels (KSH) [27] and supervised discrete hashing (SDH) [12]. LSH [25] obtains hash functions via random projections. Original metrics are theoretically guaranteed to be preserved in the Hamming space, and LSH often requires long code length to get high precision. The SH [26] preserves the similarity distribution among data and the binary codes are imposed to be balanced and uncorrelated. PRH [14] establishes a partial stochastic strategy to enable good approximation and fast learning to construct hashing functions. KSH [27] maps the data into binary codes whose Hamming distance are minimized on similar pairs and maximized on dissimilar pairs. Based on the equivalence between optimizing the code inner product and Hamming distance, KSH can train the hash function efficiently. SDH [12] aims to learn the hash codes that are good for classification. To deal with the non-deterministic polynomial hard (NP-hard) binary constraint, the SDH develops a cyclic coordinate descent to generate good hash codes, which admits an analytical solution.

Low-Rankness Analysis
The low-rankness property is attracting more and more attention, since it enables the finding of low-rank structures from high-dimensional data that is corrupted with noise and outliers [28,29]. With low-rank constraints, the computational complexity can be greatly reduced. [30] has proved that low-rank regression equals regression in the subspace generated by linear discriminant analysis (LDA). Low-rank representation (LRR) [31] seeks to find the lowest rank representation where data can be represented by linear combinations of the basis in a given dictionary. To further enhance LRR, latent low-rank representation (LatLRR) [32,33] is proposed to recover the unobserved data in LRR.
Since low-rank optimization problem is a NP-hard problem, we instead solve the nuclear norm minimization problem via alternating direction method of multipliers (ADMM) [34]. Low-rank can be used in matrix decomposition due to its advantage in "de-correlation". Specifically, given a matrix X with the rank r, low-rank decomposition solves the following optimization problems: It is evident that the size of X is larger than the sum of those of M and N. Therefore, the low rank matrix decomposition can remove the correlation among data, thereafter reducing the storage.

Hypergraph Learning
Since conventional graphs fail to exploit the relationships among data, the hypergraph has been widely used to characterize the complex relationships among complex data [22,23]. Specifically, a hypergraph generalizes the conventional graph where one edge can connect more than two vertices and capture high-order information. Different from simple graphs, hypergraphs contain local grouping information that is beneficial to clustering. In [35], Huang et al. construct hyperedges among images based on shape and appearance features in their region of interests (ROI), and perform spectral clustering for unsupervised image categorization. In [14], a transductive learning framework is introduced to further explore the correlations. This approach constructs a probabilistic hypergraph, and hypergraph ranking is further employed. To accelerate similarity search, the authors in [23] extend the traditional unsupervised hashing method to a hypergraph to capture the high-order information for social images. Besides, Bu et al. have modeled the relationship of different entities using the hypergraph for recommendation in the social-media community [36].

Proposed Low-Rank Hypergraph Hashing (LHH) Method
This section first introduces the proposed LHH. Then, Section 3.1 gives the notation and problem statement. The details of the proposed LHH are presented in Section 3.2. After this, Section 3.3 presents the optimization of the proposed LHH. Moreover, Section 3.4 introduces learning the hashing function. Finally, Section 3.5 presents the convergence analysis.

Problem Statement
is a set of images, and we are given its feature X = x 1 ; . . . ; x n ∈ R n×m , where m is the dimensionality and n is the number of the images. We represent the hash code matrix is the hash code of o i and l is the code length. The hash function HF(x) =sgn(F(x)) encodes x by l-bit hash code, where sgn (·) is the sign function, which outputs +1 for positive numbers and −1 otherwise. The LHH aims to learn B and the hash function HF to preserve the similarity structure of the images.

Low-Rank Hypergraph Hashing
To consider the supervised information, we regard learning the binary codes in the context of classification. We enable the binary codes to be optimal for the jointly learned classifier. Thus, the good binary codes are ideal for the classification.
Given binary code b, we adopt the following multi-class classification formulation where . . , c is the projection for class k and y ∈ R 1×c is the label vector, and the maximum value indicates the assigned class of x.
We choose to optimize the following problem where L(·) is the loss function, R(W) is a regularizer and λ 1 is a regularization parameter. Y = y i n i=1 ∈ R n×c is the ground truth label matrix, where y ik =1 if x i belongs to class k and 0 otherwise. Equation (3) is flexible, and we can define any loss function for L(·). For simplicity, we can choose the simple l 2 loss, which minimizes the difference between the label Y and prediction G(b). The problem in Equation (3) can be transformed into the following problem: To enable the coefficients of data in the same space to be highly correlated, we apply the low rank constraint to capture the global structure of the whole data. In addition, the low-rank structure can relieve the impact from noises, and makes regression more accurate [37,38]. In order to consider the low-rank structure of W, we need to make: We decompose W into two low-rank matrices, i.e., W = AC, where A ∈ R l×r , C ∈ R r×c , and r is the rank of W. Then, Equation (4) can be further transformed into where AA T = I(I = R r×r ), which is introduced for identifiability. Besides, we additionally enforce the sparsity, i.e., l 21 -norm for feature selection by [39]. Thus, the above problem is rewritten as: In Equation (7), we consider both low-rankness and sparsity to learn the regression coefficient matrix. Low-rankness deals with the noises, and the l 2,1 -norm selects features by setting some rows of W to be zero. Until now, we do not consider the similarity structure among data. If two samples are similar, we need to ensure that two corresponding binary codes are close. To preserve the original local similarity structure, we aim to minimize where S (s i,j ∈ S) is the similarity matrix that records similarities among data, in which s i, j represents the relationship between the i-th and the j-th sample. Normally, we use the following formulation to construct graphs where σ is the kernel width and the term a − b 2 2 denotes the distance between two samples. Here we instead use the hypergraph to measure the similarity among data. Figure 2 shows the distinction between a normal graph and hypergraph. As can be seen, the normal graph only connects two samples, while a hypergraph can connect more than two samples. Therefore, a hypergraph can reveal more complex relationships among data [23]. We formulate the incidence matrix H between the vertices and the hyperedges of the hypergraph as: . . ∈ {−1,1} To enable the coefficients of data in the same space to be highly correlated, we apply the low rank constraint to capture the global structure of the whole data. In addition, the low-rank structure can relieve the impact from noises, and makes regression more accurate [37,38]. In order to consider the low-rank structure of W, we need to make: We decompose W into two low-rank matrices, i.e., = , where ∈ ℝ × , ∈ ℝ × , and r is the rank of W. Then, Equation (4) can be further transformed into , , where = ( = ℝ × ), which is introduced for identifiability. Besides, we additionally enforce the sparsity, i.e., l21-norm for feature selection by [39]. Thus, the above problem is rewritten as: In Equation (7), we consider both low-rankness and sparsity to learn the regression coefficient matrix. Low-rankness deals with the noises, and the l2,1-norm selects features by setting some rows of W to be zero.
Until now, we do not consider the similarity structure among data. If two samples are similar, we need to ensure that two corresponding binary codes are close. To preserve the original local similarity structure, we aim to minimize where S ( , ∈ ) is the similarity matrix that records similarities among data, in which , represents the relationship between the i-th and the j-th sample. Normally, we use the following formulation to construct graphs where is the kernel width and the term ‖ − ‖ 2 2 denotes the distance between two samples.  The degree d(v) of vertex v and the degree δ(e) of hyperedge are defined as follows: With the above definition, the normalized distance between v i and v j on e k is To preserve the similarity of hash codes, we aim to map data on the same hyperedge into more similar hash codes. Thus, we seek the hash codes by minimizing the average Hamming distance between hash codes of data on the same hyperedge: Remote Sens. 2020, 12, 1164 By introducing the hypergraph Laplacian, we further rewrite Equation (12) as where the hypergraph Laplacian matrix v , I is the identity matrix, H is the incidence matrix and D v and D e are diagonal matrices, where the diagonal element of D v and D e are degrees of the hypergraph vertex d(v i ) and hyperedge δ(e i ), respectively.
Combining Equations (7) and (13), the final objective function of LHH is defined as: where λ 2 is a regularization parameter. In Equation (14), to learn high-quality binary codes, the first term learns the classifier with a binary code, the second term minimizes the l 2,1 -norm of the projection matrix to explore its low-rankness and sparsity, and the third term preserves the intrinsic complex structure of data via a hypergraph.

Optimization Algorithm
It is clear that Equation (14) is difficult to find a global solution for, as it is nonconvex. We alternatively solve the sub-problems for the following variables.
(1) C-step: Update C by fixing A and B.

Algorithm 1 Curvilinear Search Algorithm Based on Cayley Transformation
Input: initial point A (0) ∈ R l×r , matrix B, C, hash code length l Output: A (t) . 1: Initialize t = 0, ε > 0 and λ 1 = 1, λ 2 = 1e − 2. 2: Repeat 3: Compute the gradient according to (18); 4: Generate the skew-symmetric matrix F = G T A − A T G; 5: Compute the step size τ t , that satisfies the Armijo-Wolfe conditions [33] via the line search along the path J t (τ) defined by (19); 6: Set A (t+1) = J(τ t ); 7: Update t = t + 1; 8: Until convergence In this case, the objective function is simplified as: Equation (15) can be rewritten as: We have the derivative of Equation (16) with respect to C equal to 0, and receive where D W = 1 (2) A-step: Update A by fixing B and C. It is hard to obtain an optimal solution in Equation (14) with respect to A, due to the orthogonal constraint. Here we apply a gradient descent with a curvilinear search to seek a locally optimal solution. First, we denote G as the gradient of Equation (16) with respect to A, and it is defined as: A skew-symmetric matrix is defined as F = G T A − A T G. The next point is decided by a Crank-Nicolson scheme where τ is the step size. We can get a closed-form solution of J(τ): Here, Equation (20) is called the Cayley transformation [33,40,41]. The iteration terminates when τ t satisfies the Armijo-Wolfe condition. The algorithm solving the sub-problem is illustrated in Algorithm 1.
(3) B-step: Update B by fixing A and C.
The objective function is simplified as follows: The above problem is challenging due to the discrete constraint, and it has no closed-form solution. Inspired by the recent study in nonconvex optimization, we optimize Equation (21) with the proximal gradient method, which iteratively optimizes a surrogate function. In the j-th iteration, we define a surrogate function ∧ Loss j (B) that is a discrete approximation of Loss(B) at the point B (j) . Given B (j) , the next discrete point is obtained by optimizing: Note that ∇Loss B ( j) may include zero entries and that multiple solutions for B ( j+1) may exist, thus we introduce function Cf(x, y) = x, x 0 y, y = 0 to eliminate the zero entries. The updated rule for B ( j+1) is defined as [42,43]:

Algorithm 2 Low-Rank Hypergraph Hashing
Input: label matrix Y ∈ R n * c , hash code length l, hyperedge number k; Output: The learning algorithm of LHH is shown in Algorithm 2.

Hash Function Learning
The optimal hash code has been learned, and we need to further learn a mapping from the original space to Hamming space. Here we assume that there is a linear mapping between the two spaces, and the transformation matrix is learned by optimizing the following problem [12]: In Equation (24), it measures the fitting error between data and hash codes. The solution of the problem admits the following form: Finally, the hash function is defined as where x is an arbitrary sample.

Convergence Analysis and Computational Complexity Analysis
Firstly, we discuss the convergence of LHH, which is presented in the following theorem. Theorem 1: The alternating iteration scheme of Algorithm 2 monotonically reduces the objective function value of Equation (14), and Algorithm 2 converges to a local minimum of Equation (14).
Proof: LHH includes three sub-problems. The sub-problem C is convex, thus it clearly has the optimal solution. The sub-problems with respect to A and B are non-convex, but A and B steps decrease the objective function value. Thus, Algorithm 2 decreases the objective function value in each step. In addition, the objective function value is non-negative. Thus, Algorithm 2 can converge to a local optimal solution of LHH.
Then, we present the computational complexity of the proposed LHH method. The computational complexity of LHH mainly consists of the following several parts. In the step of updating A, its complexity is O(nlr). In the step of updating A, due to the orthogonal constraint, we use the Cayley transformation for solving this problem. Computing the gradient of A requires nl 2 + nlc and updating A for each iteration is O 4nl 2 + l 3 [40]. Thus, the complexity of optimizing A is O t1 nl 2 + nlc + 4nl 2 + l 3 , where t1 is the number of iterations for updating A. In the step of updating B, its complexity is O n 2 l , and it is time-consuming, as it contains hypergraph Laplacian matrix computing. In summary, the total computational complexity of LHH is O t nlr + n 2 l + t1 nl 2 + nlc + 4nl 2 + l 3 , where t is the number of total iterations in Algorithm 2. Finally, the computational complexity of hashing mapping matrix P requires the time complexity of O nd 2 + ndl . For the query part, the computational cost for encoding any query x is O(cd).

Experiments
We compare the proposed method with some state-of-the-art methods on four benchmark datasets, and their performance is evaluted in large-scale remote-sensing retrieval tasks.

Datasets
We adopt four benchmark datasets: UC Merced Land Use Dataset (UCMD) [44]; SAT4 [45]; SAT6 [45]; and CIFAR10 [46]. Their descriptions are as follows: • UCMD is generated by manually labeling aerial image scenes, and it covers 21 land cover categories. More specifically, each land cover category includes 100 images of 256 × 256 pixels. The spatial resolution of this public domain imagery is 0.3 meters. Here we randomly sample 420 samples as the query set, and use the remaining 1680 samples for training.
• SAT4 consists of a total of 500,000 image patches covering four broad land cover classes. These include barren land, grassland, trees and a class that consists of all land cover classes other than the above three. Each image patch is size normalized to 28 × 28 pixels, and the spatial resolution of each pixel is 1 m. we randomly select 100,000 samples as the query set, and the other 400,000 samples as a training set. • SAT6 consists of a total of 405,000 image patches covering six broad land cover classes. These include barren land, buildings, grassland, roads, trees and water bodies. The image size and spatial resolution of SAT6 are similar with these of SAT4. We randomly select 81,000 samples as the query set, and the other 324,000 samples as a training set. • CIFAR10 dataset consists of sixty thousand 32 × 32 color images of 10 classes and 6,000 images in each class. We randomly select 10,000 samples as the query set, and the remaining 50,000 samples as a training set.
The statistics of the four datasets are summarized in Table 2, and some sample images are presented in Figure 3.
We compare the proposed method with some state-of-the-art methods on four benchmark datasets, and their performance is evaluted in large-scale remote-sensing retrieval tasks.

Datasets
We adopt four benchmark datasets: UC Merced Land Use Dataset (UCMD) [44]; SAT4 [45]; SAT6 [45]; and CIFAR10 [46]. Their descriptions are as follows:  UCMD is generated by manually labeling aerial image scenes, and it covers 21 land cover categories. More specifically, each land cover category includes 100 images of 256 × 256 pixels. The spatial resolution of this public domain imagery is 0.3 meters. Here we randomly sample 420 samples as the query set, and use the remaining 1680 samples for training.
 SAT4 consists of a total of 500,000 image patches covering four broad land cover classes. These include barren land, grassland, trees and a class that consists of all land cover classes other than the above three. Each image patch is size normalized to 28 × 28 pixels, and the spatial resolution of each pixel is 1 m. we randomly select 100,000 samples as the query set, and the other 400,000 samples as a training set.
 SAT6 consists of a total of 405,000 image patches covering six broad land cover classes. These include barren land, buildings, grassland, roads, trees and water bodies. The image size and spatial resolution of SAT6 are similar with these of SAT4. We randomly select 81,000 samples as the query set, and the other 324,000 samples as a training set.
 CIFAR10 dataset consists of sixty thousand 32 × 32 color images of 10 classes and 6,000 images in each class. We randomly select 10,000 samples as the query set, and the remaining 50,000 samples as a training set.
The statistics of the four datasets are summarized in Table 2, and some sample images are presented in Figure 3.
In the experiment, the samples are represented as 512-dimensional gistification (GIST) vectors. The experiments are conducted on a standard PC with Intel Core i7-8550U, CPU 2.70 GHz and 8GB RAM. In the experiment, λ 1 and λ 2 are empirically set as 1 and 0.01 respectively. Other parameter setting of the used four datasets is summarized in Table 2.
The retrieval performance is measured with two widely used metrics: mean average precision (mAP) and Precision-Recall (P-R) curve [47]. The mAP score is calculated by where q ∈ Q is a query, and |Q| is the volume of query set. L q is the number of the true neighbors in the retrieved list. P q (r) denotes the precision of the top r retrieved results, δ q (r) = 1 if the r-th result is the true neighbor, and 0 otherwise [48].

Qualitative Analysis
We illustrate the retrieval results of several hashing methods on UCMD and CIFAR10 in Figures 4 and 5 respectively. Figure 4 illustrates the retrieved images of 'building' in UCMD, and Figure 5 illustrates the retrieved images of 'dog' in CIFAR10. The top nine images are returned, and the false images are with red rectangles.   From Figures 4 and 5, we can see that the proposed LHH can retrieve the most images among all the methods. LSH, SH and PRH retrieve 2-3 correct images, and HSH, KSH and SDH retrieve more than five similar images. The experiment validates the effectiveness of LHH.

Quantitative Analysis
(1) The comparison of mAP in several hashing methods under different datasets is shown in Tables 3-6. In Tables 3-6, it is clearly observed that the LHH generally achieves the best performance. The SDH and KSH have similar results, where the KSH is better than the SDH on UCMD. Moreover, the SDH outperforms the KSH on the SAT4, SAT6 and CIFAR10 datasets. The HSH has a satisfactory performance on both the SAT4 and SAT6 datasets. For the other two methods, the PRH is generally superior to LSH and SH. Therefore, these results indicate that the LHH can have a promising retrieval performance on these four datasets.
(2) Figure 6 gives the comparison of P-R curves of six hashing methods under different datasets. In Figure 6, the P-R curve of LHH is mostly above than those of the other methods-thus LHH can obtain a larger area under curve (AUC), which is important for evaluating information retrieval. The performances of LSH, SH and PRH are worst, as their AUC areas are the smallest. The AUC areas of HSH are smaller than those of KSH and SDH, indicating that HSH underperforms KSH and SDH. The above results demonstrate the superiorities of the proposed LHH over the comparisons in large-scale retrieval tasks.

Convergence Analysis
This section empirically studies the convergence of LHH. Figure 7 illustrates the convergence curves of LHH on these data sets. From Figure 7, we can clearly see that LHH quickly converges within around eight iterations. The empirical results corroborate Theorem 1.

Parameter Analysis
We discuss the sensitivity of the sparse regularization parameter 1 and hypergraph regularization parameter 2 in the proposed LHH. We show their influences on the mAP with a 32bit code. In the experiment, 1 and 2 are varied from the range of [10 −4 , 10 −2 , 10 0 , 10 2 , 10 4 ]. From Figure  8, we see that the mAP slightly changes with the two parameters. As 1 and 2 increase, mAP slowly rises and then drops on four datasets. The mAP change with 2 is larger than that of 1 . In general, the LHH can achieve acceptable results on the four datasets when 1 , 2 ∈ [0.01,1]. These results demonstrate that sparse and hypergraph terms can help improve the retrieval.

Discussion
The experimental results on four datasets reveal the following interesting points:  Section 4.3.1 qualitatively shows that the proposed low-rank hypergraph hashing (LHH) has

Convergence Analysis
This section empirically studies the convergence of LHH. Figure 7 illustrates the convergence curves of LHH on these data sets. From Figure 7, we can clearly see that LHH quickly converges within around eight iterations. The empirical results corroborate Theorem 1.

Parameter Analysis
We discuss the sensitivity of the sparse regularization parameter λ 1 and hypergraph regularization parameter λ 2 in the proposed LHH. We show their influences on the mAP with a 32-bit code. In the experiment, λ 1 and λ 2 are varied from the range of [10 −4 , 10 −2 , 10 0 , 10 2 , 10 4 ]. From Figure 8, we see that the mAP slightly changes with the two parameters. As λ 1 and λ 2 increase, mAP slowly rises and then drops on four datasets. The mAP change with λ 2 is larger than that of λ 1 . In general, the LHH can achieve acceptable results on the four datasets when λ 1 , λ 2 ∈ [0.01,1]. These results demonstrate that sparse and hypergraph terms can help improve the retrieval.  Section 4.5 shows that LHH is relatively robust to these parameters. From Figure 8, LHH generally performs well when , ∈[0.01,1]. It demonstrates the effectiveness of the sparse and hypergraph terms.  The LHH works very well for efficient large-scale RS image retrieval. It can effectively explore complex structures among RS image datasets and extract more discriminative hash codes.

Conclusions
This work focuses on applying a hashing technique for efficient large-scale remote sensing image retrieval (RSIR) tasks. We propose a new low-rank hypergraph hashing (LHH) method to generate compact hash codes on remote sensing (RS) images. LHH constraints low-rankness and sparsity on the transformation matrix to explore its global structure and filter unrelated features. LHH uses hypergraphs to capture the high-order relationship among data, and is very suitable to explore the complex structure of RS images. Extensive experiments are conducted on three RS image datasets and one natural image dataset that are publicly available. The experimental results demonstrate that the proposed LHH outperforms the existing hashing learning in RSIR tasks. In the future, we will explore the deep learning extension of LHH to further improve the performance of large-scale RS image retrieval.
Author Contributions: All the authors contributed to this study; conceptualization, J.K.; methodology, J.K.; software, J.K.; writing, J.K.; writing-review and editing, Q.S., M.M. and J.L. All the authors have read and agreed to the published version of the manuscript.

Discussion
The experimental results on four datasets reveal the following interesting points: • Section 4.3.1 qualitatively shows that the proposed low-rank hypergraph hashing (LHH) has better retrieval performance on large-scale remote sensing (RS) image datasets. Specifically, LHH can retrieval more correct images than the comparison methods, as shown in Figures 4 and 5. • Section 4.3.2 quantitatively reveals that the proposed LHH is obviously superior than the existing methods on four large-scale datasets, including three remote sensing and one natural image dataset. Specifically, Tables 3-6 illustrates that LHH has a higher mean average precision (mAP) than comparison methods, and Figure 6 illustrates that LHH also has better Precision-Recall (P-R) curves. • Section 4.4 shows that LHH converges very quickly within eight iterations on several datasets. This indicates that LHH may have less training time in real applications. • Section 4.5 shows that LHH is relatively robust to these parameters. From Figure 8, LHH generally performs well when λ 1 , λ 2 ∈ [0.01,1]. It demonstrates the effectiveness of the sparse and hypergraph terms.

•
The LHH works very well for efficient large-scale RS image retrieval. It can effectively explore complex structures among RS image datasets and extract more discriminative hash codes.

Conclusions
This work focuses on applying a hashing technique for efficient large-scale remote sensing image retrieval (RSIR) tasks. We propose a new low-rank hypergraph hashing (LHH) method to generate compact hash codes on remote sensing (RS) images. LHH constraints low-rankness and sparsity on the transformation matrix to explore its global structure and filter unrelated features. LHH uses hypergraphs to capture the high-order relationship among data, and is very suitable to explore the complex structure of RS images. Extensive experiments are conducted on three RS image datasets and one natural image dataset that are publicly available. The experimental results demonstrate that the proposed LHH outperforms the existing hashing learning in RSIR tasks. In the future, we will explore the deep learning extension of LHH to further improve the performance of large-scale RS image retrieval.