Region-Wise Deep Feature Representation for Remote Sensing Images

: Effective feature representations play an important role in remote sensing image analysis tasks. With the rapid progress of deep learning techniques, deep features have been widely applied to remote sensing image understanding in recent years and shown powerful ability in image representation. The existing deep feature extraction approaches are usually carried out on the whole image directly. However, such deep feature representation strategies may not effectively capture the local geometric invariance of target regions in remote sensing images. In this paper, we propose a novel region-wise deep feature extraction framework for remote sensing images. First, regions that may contain the target information are extracted from one whole image. Then, these regions are fed into a pre-trained convolutional neural network (CNN) model to extract regional deep features. Finally, the regional deep features are encoded by an improved Vector of Locally Aggregated Descriptors (VLAD) algorithm to generate the feature representation for the image. We conducted extensive experiments on remote sensing image classiﬁcation and retrieval tasks based on the proposed region-wise deep feature extraction framework. The comparison results show that the proposed approach is superior to the existing CNN feature extraction methods.


Introduction
With the developments of satellite imaging techniques, it is much easier to acquire a large collection of remote sensing images. In recent years, automatic remote sensing image analysis [1][2][3][4] has become a hot topic due to its wide applications in many fields such as military reconnaissance, agriculture, and environmental monitoring. Feature extraction and representation is the foundation of many remote sensing image processing tasks [5][6][7][8][9]. Developing powerful image feature representation methods helps us understand the image information more accurately.
During the past decades, a variety of feature learning methods for remote sensing images have been proposed. In earlier years, remote sensing image analysis was mainly based on the hand-crafted features which include both global features and local features. Global features [10][11][12] include color, shape and textual information, which are the primary characteristic of a remote sensing image. The global features are extracted based on the whole image, and they are not able to reflect the local information of interested area. Among the local features [13,14], bag-of-words (BoW) and its variations [15][16][17] are one of the most popular types in recent decades, which have comprised the state of the art for several years in the remote sensing community because of their simplicity, efficiency, and invariance to viewpoint changes. In addition to the hand-crafted features, data-driven features are also developed via unsupervised feature learning in terms of content-based remote sensing image retrieval and classification tasks [18][19][20][21][22]. For example, a saliency-guided unsupervised feature learning approach was proposed in [19] for remote sensing scene classification. A multiple feature-based remote sensing image retrieval approach was proposed in [21] by combining hand-crafted features and data-driven features via unsupervised feature learning. Wang et al. [22] proposed a multilayered graph model for hierarchically refining retrieval results from coarse to fine. However, as the remote sensing image understanding task becomes more challenging, the description capabilities of the above low-level features are limited and may not be effective to capture the high-level semantics.
More recently, various deep learning algorithms [23][24][25][26], especially convolutional neural networks (CNNs), have shown their much stronger feature representation power in many fields such as traffic scene analysis [27,28] and bush-fire frequency forecasting [29,30]. Convolutional neural networks (CNNs) learn high-level semantic features automatically rather than requiring hand-crafted features and have achieved great success in many remote sensing image analysis applications [31][32][33][34][35][36][37][38][39]. For example, a low dimensional convolutional neural network was learned in [34] for high-resolution remote sensing image retrieval while an unsupervised convolutional feature fusion network was developed for deep representation of remote sensing images in the scene classification task [36]. In these CNN-based remote sensing image feature learning methods, the whole image is usually directly fed into a pre-trained or fine-tuned network to obtain the deep representation. However, there exists one problem that is seldom exploited in the existing CNN feature extraction methods. Compared with other images, remote sensing images have several special characteristics. For example, even in the same category, the targets in different images may have varied sizes, colors, and angles. More importantly, other materials and the background around the target area may cause high intraclass variance and low interclass variance. Therefore, if we directly extract the CNN features from the whole image in the traditional manner, the image representations in the feature space may not accurately reflect their true category information (as demonstrated in Figure 1a). In order to address the above problem, we propose a novel region-wise deep CNN feature representation method for remote sensing image analysis, which extracts the CNN features from regions containing the targets instead of the whole image (see Figure 1b). The proposed feature extraction approach includes the following steps: First, regions that may contain the targets are generated from one whole image. Then, these regions are fed into a pre-trained CNN model to extract the regional deep features. Finally, the regional deep features are encoded by an improved Vector of Locally Aggregated Descriptors (VLAD) algorithm to generate the feature representation for the image. The image features extracted by our proposed approach have more powerful and effective representation ability to capture the local target information and geometric invariance. The flowchart of the proposed region-wise deep CNN feature learning method is illustrated in Figure 2.

Target Region Proposal
As we have introduced in Section 1, our proposed feature extraction method is based on the target regions instead of the whole image. Thus, we have to generate the regions that may contain the targets firstly. The regions are expected to reflect the objects from the same category rather than the varied background, such that the features extracted from these regions have more discriminative power between different classes. Region proposal algorithm can generate a set of bounding boxes, which may contain the interested targets. In this paper, we apply edge-boxes algorithm [40] to generate the object-wise bounding boxes. Edge-boxes method is able to cover most objects in one image with a set of bounding boxes as well as their corresponding confidential scores. It generates the target bounding boxes directly from edges information and each box is scored based on the number of contours wholly enclosed in it. The score is calculated as follows: First, neighboring edge pixels of similar orientation are clustered together to form groups. Then, affinities between edge groups are computed based on their relative positions and orientations. Finally, each bounding box is scored by summing the edge strength of all edge groups within the box, and subtracting the edge magnitudes in the center of the box. Specifically, the score h x of bounding box x in one image is expressed as follows: where s i denotes the edge group generated from the neighboring edge pixels, and w x (s i ) denotes the continuous value which indicates whether s i is wholly contained in the bounding box x. m i is the sum of the magnitudes of all the edge pixels in edge group s i . x w and x h are the width and height of x. p ∈ x in denotes the set of edge groups p located in the center of x, which are considered having no contribution to the score and b is the bias. According to the computed scores, we select the first K bounding boxes with the highest scores for each image as the candidate target regions. More details of bounding box generation for target regions can be found in [40].

Region-Wise CNN Feature Extraction
Based on the edge-boxes algorithm, we obtain a set of candidate target regions and their corresponding scores for each image. In the following step, we will extract the deep features from the target regions because deep features learn better high-level semantic information than hand-crafted low-level features. Among various deep learning algorithms, CNN is one of the most commonly used deep learning architectures for image feature extraction. A typical CNN model is generally structured as a series of layers including convolutional layers, pooling layers, and fully connected layers. Many deep CNN models have been developed for image analysis in the past few years, such as AlexNet [23], VGG-Net [24], and GoogleNet [25]. Without loss of generality, we choose AlexNet as the CNN feature extraction model in this paper and the candidate regions are fed into AlexNet for deep feature extraction. The AlexNet model has five convolutional layers and three fully-connected layers. We directly copy the model parameters for convolutional layers conv1-conv5 and fully-connected layers fc6-fc7, which are pre-trained on ImageNet dataset [23]. The output 4096-dimensional vector of the fully-connected layer fc7 is extracted as the deep feature for each target region. Therefore, for each image I with K candidate target regions, we can obtain K region-wise CNN features as I = {r 1 , r 2 , . . . , r K } where r i ∈ R D (i = 1, ..., K) denote the D-dimensional fc7 CNN feature vector for the i-th region.

Image Representation by Improved VLAD
After we have obtained the region-wise CNN features for each image, these massive regional deep features need to be encoded into a single vector for image representation. In this subsection, we propose an improved Vector of Locally Aggregated Descriptors (VLAD, [41]) method to encode the regional feature vectors into a single long vector for each image. Before encoding, we have to generate a set of M visual words is a D-dimensional vector. This can be simply done by running k-means clustering algorithm on all the regional CNN features of the whole image database and each cluster center can be regarded as one visual word. The traditional VLAD representation [41] where NN(r k ) = c i denotes that the nearest visual word (cluster center) of regional feature vector r k is c i . Thus, v i is the aggregation of differences between each visual word and its assigned regional features. From Equation (2), we can find that the traditional VLAD approach only calculates the differences between the assigned regional features and their nearest neighbor visual word. However, it is often possible that some regional features have similar or even identical distances between two or more visual words. Assigning the regional vectors only to one nearest visual word may not be appropriate and sometimes loses important information. More importantly, each bounding box in one image has a corresponding score obtained in the region proposal stage, which reflects the confidence that the target is contained in the region. If we apply the VLAD encoding method directly, the score weight information is also neglected. To overcome the above shortcomings, we propose a weighted multi-neighbor assignment strategy for the regional CNN features to improve the representation ability of the traditional VLAD method. Specifically, we propose to calculate the new v new i in VLAD by the following equation: where h r k is the score of region r k computed by Equation (1). β ki is the weight of difference between regional feature r k and visual word c i , which is simply computed through Gaussian function . NN(r k ) ⊇ c i denotes that visual word c i is included in the set of visual words that regional feature vector r k have been assigned to. By taking the region score and visual word assignment weight into consideration, the VLAD image representation through our method is more accurate than the traditional VLAD approach. The comparison of the improved VLAD with the original VLAD method is demonstrated in Figure 3. ; v r c v r c  By performing the improved VLAD encoding algorithm, each image is encoded into a single MD-dimensional long vector as Because DM may be very large which leads to a quite long V new , we apply principle component analysis (PCA) on V new for dimensionality reduction and get the final feature representation for each image.
The proposed region-wise deep CNN feature representation framework can be applied to many kinds of remote sensing image analysis tasks such as scene classification, image retrieval and so on. We will show the superiority of our proposed image representation method to the existing CNN feature extraction approaches for remote sensing images in the following section.

Experiments
In this section, we will conduct extensive experiments to evaluate the performance of the proposed region-wise deep feature representation method on different remote sensing image analysis tasks, i.e., remote sensing scene classification and large-scale remote sensing image retrieval.

Datasets and Settings
Two public available remote sensing image datesets are used in the experiments: UC-Merced Landuse dataset [16] and Aerial Image Dataset (AID) [42]. The images in the UC-Merced Landuse dataset were manually extracted from large images from the USGS (United States Geological Survey) National Map Urban Area Imagery collection for various urban areas around the country. The pixel resolution of this public domain imagery is 1 foot. The UC-Merced dataset contains 2100 images in total and each image measures 256 × 256 pixels. There are 100 images for each of the following 21 classes: agricultural, airplane, baseball diamond, beach, buildings, chaparral, dense residential, forest, freeway, golf course, harbor, intersection, medium residential, mobile home park, overpass, parking lot, river, runway, sparse residential, storage tanks, and tennis court. The AID dataset is a new benchmark dataset for performance evaluation of aerial scene analysis, which was released in 2017 and is much larger than the UC-Merced dataset. AID is a new large-scale aerial image dataset, by collecting sample images from Google Earth imagery and the new dataset is made up of the following 30 aerial scene types: airport, bare land, baseball field, beach, bridge, center, church, commercial, dense residential, desert, farmland, forest, industrial, meadow, medium residential, mountain, park, parking, playground, pond, port, railway station, resort, river, school, sparse residential, square, stadium, storage tanks and viaduct. In all, the AID dataset has a number of 10,000 images within 30 classes and each class contains about 200 to 400 samples of size 600 × 600 pixels. Some sample images from the two datasets are shown in Figure 4.  For target region proposal in the proposed approach, we directly use the code of edge-boxes algorithm [40] provided by the authors. The number K of candidate regions for each image is uniformity set to 500 in the experiments. AlexNet model is adopted to extract the deep CNN features for all the candidate regions and the model parameters are directly downloaded from the web, which are pre-trained on the ImageNet dataset. The output D-dimensional (D = 4096) vector of the layer fc7 is extracted as the regional deep CNN features. In the improved VLAD step, k-means clustering algorithm is firstly ran on all the regional CNN features to obtain M = 64 cluster centers as the visual words. Then, all the images can be encoded into a MD dimensional vector based on the improved VLAD algorithm through Equation (3) and the number of nearest neighbor visual words for each region is set to 5 in the multi-neighbor assignment stage. Finally, PCA is performed for dimensionality reduction and a 1024-dimensional feature vector as the representation for each image is obtained. The proposed regional deep CNN feature representation method is denoted by CNN-R in the experiments.
For comparison, we also implement the traditional CNN feature representation, where the whole image is directly fed into the pre-trained AlexNet model and a 4096-dimensional vector of the layer fc7 is extracted for each image. We denote the whole image based CNN feature by CNN-W in the experiments. PCA is also performed to reduce the dimensionality to 1024 for fair comparison.

Results for Remote Sensing Scene Classification
We first evaluate the proposed region-wise deep feature representation method CNN-R for remote sensing scene classification task on the UC-Merced dataset. CNN-W as well as state-of-the-art remote sensing image classification methods are used as benchmarks in the experiments. SVM is used as the classifier for CNN-R and CNN-W. Similar to the previous works in the literature [36,37], we randomly select 80% images from each class to train the SVM model and the remained 20% images are used for testing. According to [43], overall accuracy and confusion matrix are usually adopted as the metrics for accuracy assessment. Other related works such as [44] also report measures that derive from the confusion matrix, in which the Bradley-Terry Model was used to quantify association in remotely sensed images. In order for fair comparison with the reported results in previous remote sensing image classification works [19,32,[35][36][37], we adopt the same accuracy assessment measures used in the above literature. Table 1 shows the overall accuracy of classification results for different remote sensing image feature learning methods. By comparing the CNN-W with CNN-R, we can find that the proposed region-wise deep feature representation method CNN-R achieved better classification results than CNN-W. This can be attributed to the candidate regions capturing more effective local geometric information of the target areas and thus CNN features extracted from these target regions have more powerful discrimination ability and are less influenced by the background. By comparing our CNN-R with state-of-the-art remote sensing scene classification results, we can observe that the performance of CNN-R is still among the top ones, which further validates the effectiveness of the proposed approach.  Figure 5 shows the confusion matrix of the feature representation method CNN-W and our proposed CNN-R on the UC-Merced dataset. From the figure we observe that accuracies above 90% are obtained for all the 21 classes with our proposed CNN-R approach. By comparing CNN-R with CNN-W, significant improvements have been obtained upon the classes "building", "denseresidential" and "mediumresidential", where the accuracies have been elevated from 81%, 75%, 85% to 92%, 90%, 92% respectively. The reasons may be that the images in the above three classes have high intraclass variance and low interclass variance while directly extracting deep features from the whole image may not accurately reflect their true category information. In contrast, our proposed method employs the region-wise deep features for image representation, which effectively captures the local geometric invariance and has more discriminative power for remote sensing scene classification.

Results for Large-Scale Remote Sensing Image Retrieval
Hashing-based methods have attracted much attention in handling the large-scale remote sensing image retrieval problem recently [39,[45][46][47]. The hashing methods map the input images from the feature space to a low dimensional code space, i.e., hamming space, where each image is represented by a binary code. The goal of hashing approaches is to generate binary codes for each sample in a database such that similar samples have close codes. One advantage of the binary code representation is significantly reducing the amount of memory required for storing the images' content. In addition, it is extremely fast to perform similarity search over such binary codes for large-scale applications because the hamming distance between binary codes can be efficiently calculated with XOR operation in modern CPU.
In this subsection, we will evaluate the performance of proposed region-wise deep features for hashing-based large-scale remote sensing image retrieval on the AID dataset. We select three state-of-the-art hashing models: kernel supervised hashing (KSH) [48], supervised discrete hashing (SDH) [49], and column sampling based discrete supervised hashing (COSDISH) [50] in the experiments. CNN-W feature and our proposed CNN-R feature are used as input of the above models respectively to learn binary hash codes. Finally, image retrieval is carried out by comparing hamming distance of the learned codes. The retrieval performance is measured with four widely used metrics: mean average precision (MAP), precision of the top K retrieved images (Precision@K), recall of the top K retrieved images (Recall@K) and precision-recall (P-R) curves. More specifically, the precision and recall are defined as follows: precision = true positive true positive + false positive (4) recall = true positive true positive + false negative (5) The MAP score is calculated by where q i ∈ Q is a query, n i is the number of images relevant to q i in the database. Suppose the relevant images are ordered as {r i , r 2 , · · · , r n i }, then R ik is the set of ranked retrieval results from the top result until you get to point r k . Table 2 shows the MAP of different hashing methods for fast image retrieval on the AID dataset based on varied input deep features and hash code length. We observe that nearly all the hashing methods have obtained performance improvements when using our proposed region-wise deep feature representation method for hash code learning. As is illustrated, the KSH, SDH and COSDISH methods have achieved 4%, 4.75% and 7.75% improvement on average when using CNN-R as input feature instead of CNN-W. These results also indicate that learning CNN feature representation from regions is more effective to capture the target information and can generate powerful hash codes for large-scale image retrieval. The precision of the top K retrieved images, recall of the top K retrieved images and precision-recall curves for the compared methods are shown in Figure 6. From the Precision@K curves (Figure 6a,d,g,j) we can find that our CNN-R based hashing methods obtain better results than CNN-W based methods when the retrieved images grow in most cases. The Recall@K scores of different approaches over varied hash bits are shown in Figure 6b,e,h,k, which have shown similar results to Precision@K curves. The above observed results have demonstrated that the hash codes learned by our CNN-R features are more effective than traditional CNN-W features for the large-scale image retrieval task. The P-R curves, which reflect the overall image retrieval performance of different methods, are shown in Figure 6c,f,i,l. By comparing the hashing methods using CNN-R features as input with that using CNN-W, we also find that the proposed CNN-R feature representation still consistently outperforms CNN-W for hashing-based large-scale image retrieval in most cases. This may be because our proposed CNN-R feature learning scheme is able to generate more informative feature representation for remote sensing images and thus the learned hash codes with CNN-R features are also more accurate to capture the image contents. In fact, MAP score is the area under the precision-recall curve, thus, these detailed results in Figure 6 are consistent with the trends that we observe in the above experiments, which validates the superiority of our CNN-R feature representation strategy.

Future Work
From the above experimental results, we have demonstrated the effectiveness of our proposed region-wise CNN feature representation for remote sensing images. Compared with the traditional CNN features extracted from the whole image, our region-wise CNN feature can keep much more useful information in the final feature representations and thus achieve better performance in remote sensing image classification and retrieval tasks. However, there are also some open issues that remain for future research. For example, the first step of our proposed approach is to locate the target proposals, in which the existing edge-boxes algorithm is directly adopted. The edge-boxes algorithm is a general object proposal method for natural images, which may not be completely suitable for remote sensing targets. Therefore, how to improve the original object proposal algorithm specifically for remote sensing images can be one research direction. Moreover, our proposed feature extraction approach is made up of three individual steps and how to design an end-to-end region-wise deep feature representation for remote sensing images will be another direction for future research.

Conclusions
In this paper, we have proposed a novel region-wise deep feature representation framework for remote sensing images. In our proposed approach, the target-related bounding boxes are first computed for the candidate regions and a deep CNN model is applied to extract the regional deep features for each image. Then, the regional deep features are encoded into a single feature vector for each image by an improved VLAD algorithm, where a weighted multi-neighbor assignment strategy is proposed to calculate the VLAD representation. The main advantages of the our proposed approach are: (1) representing the images with region-wise deep features is able to capture the local geometric invariance of target information more accurately and retain more specific content information in the final image features. (2) the improved VLAD algorithm takes the region score and visual word assignment weight into consideration when encoding the local regional features and thus can generate more effective unique feature vectors for final image representations. Extensive experiments on two different remote sensing image analysis tasks have demonstrated the superiority of our approach over the traditional feature representing methods.
Author Contributions: P.L. and P.R. conceived and designed the experiments; Q.W. and X.Z. (Xiaoyu Zhang) performed the experiments; X.Z. (Xiaobin Zhu) and L.W. analyzed the data; P.L. wrote the paper; All authors read and approved the final manuscript.