A New Framework for Automatic Airports Extraction from SAR Images Using Multi-Level Dual Attention Mechanism

The detection of airports from Synthetic Aperture Radar (SAR) images is of great significance in various research fields. However, it is challenging to distinguish the airport from surrounding objects in SAR images. In this paper, a new framework, multi-level and densely dual attention (MDDA) network is proposed to extract airport runway areas (runways, taxiways, and parking lots) in SAR images to achieve automatic airport detection. The framework consists of three parts: down-sampling of original SAR images, MDDA network for feature extraction and classification, and up-sampling of airports extraction results. First, down-sampling is employed to obtain a medium-resolution SAR image from the high-resolution SAR images to ensure the samples (500 × 500) can contain adequate information about airports. The dataset is then input to the MDDA network, which contains an encoder and a decoder. The encoder uses ResNet_101 to extract four-level features with different resolutions, and the decoder performs fusion and further feature extraction on these features. The decoder integrates the chained residual pooling network (CRP_Net) and the dual attention fusion and extraction (DAFE) module. The CRP_Net module mainly uses chained residual pooling and multi-feature fusion to extract advanced semantic features. In the DAFE module, position attention module (PAM) and channel attention mechanism (CAM) are combined with weighted filtering. The entire decoding network is constructed in a densely connected manner to enhance the gradient transmission among features and take full advantage of them. Finally, the airport results extracted by the decoding network were up-sampled by bilinear interpolation to accomplish airport extraction from high-resolution SAR images. To verify the proposed framework, experiments were performed using Gaofen-3 SAR images with 1 m resolution, and three different airports were selected for accuracy evaluation. The results showed that the mean pixels accuracy (MPA) and mean intersection over union (MIoU) of the MDDA network was 0.98 and 0.97, respectively, which is much higher than RefineNet and DeepLabV3. Therefore, MDDA can achieve automatic airport extraction from high-resolution SAR images with satisfying accuracy.


Introduction
Synthetic Aperture Radar (SAR) can acquire images all day and all night without being affected by the weather and light conditions [1], which is a tremendous advantage that optical remote sensing images cannot offer. Therefore, it plays an increasingly important role in military and civilian applications. The airports are strategic hubs of the national economy and key targets in military missions. It is of great practical significance to implement automatic airport detection from SAR images. Additionally, this work can facilitate the takeoff and landing of aircraft, assist air traffic management, and provide various navigation services. This work is also very helpful to reduce the false alarms generated by aircraft detection by excluding specious targets from SAR images.
Airports share considerable common features in SAR images [2]: (1) The visualization of long and straight runways, taxiways, and parking lots are mostly black in SAR images; (2) The ground, made of cement and asphalt, looks lighter than the runway in SAR images; (3) Aircraft and buildings such as terminals and hangars are shown as highlighted areas in SAR images because of their strong scattering characteristics. However, they are difficult to distinguish from the buildings around the airport, which also look highlighted. The complex airport runway area plays a pivotal role in airport detection [3]. Based on its distinct visual features in SAR images, the airport runway area was extracted to achieve automatic airport detection.
The main contribution of this paper is listed as: (1) A new framework for airport extraction is proposed. It includes three parts: down-sampling of the original SAR images, deep learning network for the airport extraction, and bilinear interpolation to acquire the extraction result of high-resolution SAR images. For SAR images with high-precision, down-sampling is performed to produce medium resolution (5 m-10 m resolution) SAR images, and then datasets are generated. After extracting airports of medium SAR images by the deep learning network, up-sampling is carried out to produce the results with the same size as the original SAR images with high-resolution. (2) A new deep neural network is presented to accomplish airport extraction from SAR images, which is the multi-level and densely dual attention (MDDA) network. It mainly contains two parts, the encoder and the decoder. The encoder employs the ResNet-101 to extract features with different levels. In the decoder, the features of different levels are fully utilized through dense connection, and then the essential features of the airport are extracted by using the CRP_Net_x (1, 2, 3) modules and dual attention fusion and extraction (DAFE) module to realize the airport extraction.
In the DAFE module, the dual attention is introduced to fuse global semantic information via weighting spatial position and channels to extract more distinguishing features. (3) The proposed framework MDDA is implemented and the performance of airport extraction is evaluated by using large-scale Gaofen-3 SAR images with a 1-m resolution.
The remainder of the paper is as follows. Section 2 is the state-of-the-art, in which the development of airport detection and deep learning in semantic segmentation are described. Section 3 is the methodology, which elaborates on the proposed framework (MDDA) and the operating principle for airport extraction. In Section 4, the experiment is performed on the MDDA network using Gaofen-3 SAR images with a 1-m resolution including four airports, and the performance is assessed. Section 5 introduces the proposed network simply, and puts forward the future research. Finally, our conclusions are given in Section 6.

State-Of-The-Art
Since airports are important transportation hubs and military facilities, their detection has significant application values. Optical remote sensing images are usually utilized to detect airports [4]. However, it is impossible to obtain optical remote sensing images in bad weather (such as cloud, fog, rain), which has become an important problem restricting its wide application. In this case, the use of SAR images for airport detection has become a favorable choice.
The methods of airport detection can be roughly divided into two categories: one based on low-level features such as airport edges and geometric features, and the other based on high-level features of airport targets. For the first type of features, most researchers use the method of linear feature detection. Kou et al. [5] proposed an airport detection method from remotely sensed images based on line segment detectors, and Xiong et al. [6] presented a detection algorithm of airports from SAR images based on random transform and hypothesis testing. These methods rely on line segment detector (LSD) transform, random transform, or other transformation methods to obtain the linear edge segments of airports, which are then stitched for airport identification. These methods are simple and fast. However, linear segmentation in large-scale SAR images is time-consuming and prone to false detection. For the second type of features, airports are usually detected based on the object difference between the airport and the surrounding area. Zhu et al. [7] combined the saliency analysis model with edge detection to detect airports based on remote sensing images, and Liu et al. [8] integrated line segmentation and saliency analysis to detect airports based on SAR images. However, the airports in these experiments were all small airports with fewer types of objects and obvious edges, and the saliency model often generates more false alarm targets when applied to SAR images. To accomplish airport detection from high-resolution SAR images with large scales, Zhang et al. [9] pre-processed the original image to generate the region of interest (RoI) using adaptive threshold segmentation, and extracted the airport via the binary decision tree. This method could perform airport extraction, but was always confused by road networks.
In recent years, deep learning [10] has been widely used in various fields, especially in object detection and segmentation of optical images. It provides a good technical approach for the traditional target detection from SAR images. Therefore, the target detection network or image segmentation network based on deep learning can be incurred to implement airport detection. Among them, some popular networks have been widely applied in semantic segmentation. The so-called semantic segmentation is to label each pixel in the image with its corresponding category. Airport runway extraction from SAR images is to classify the SAR image pixel by pixel, and assign different category labels to the pixels of the airport runway area and the background area. Remarkable progress has been made in semantic segmentation using convolutional neural networks (CNNs) [11]. The fully connected layer of traditional CNNs classifies feature vectors with fixed length, so it can only accept input images of a specific size. To solve this problem, Jonathan et al. [12] proposed fully convolutional networks (FCN) for image segmentation, and Yang et al. [13] used FCN combined with conditional random field (CRF) to classify SAR images. Since FCN may produce rough segmentation results, Badrinarayanan et al. [14] proposed the SegNet network, but it has a poor segmentation performance on the edges of objects. DeepLab v1 [15] combined deep convolutional nets (DCNs) and fully connected CRF, and added hole convolution to improve the boundary segmentation effect; DeepLab v2 [16] introduced the atrous spatial pyramid pooling (ASPP) structure based on DeepLab v1 to improve the shortage of DeepLab v1 in fusing the information of different layers. RefineNet [17] was a new encoder-decoder architecture for sematic segmentation, which utilizes the ResNet_101 module in the encoder and RefineNet block in the decoder. Peng et al. [18] and Zhang et al. [19] facilitated the encoding network to extract the middle and high-level features of the image, and utilized the decoding network to merge and re-extract the features generated by the encoding network to finally implement the segmentation. There are several latest and excellent networks for further optimizing the accuracy or improving the efficiency of the segmentation results such as PSPNet [20], DeepLab v3 [21], DeepLab v3 + [22], and Auto-DeepLab [23].
Researchers have also applied deep learning to airport detection. Yu et al. [24] combined CNN based on the You Only Look Once (YOLO) model with salient features to extract airports and achieved good results. Xiao et al. [25] constructed a Google-LF network to fuse multiscale features, and then the generated features were input into support vector machine (SVM) to produce the detected airports. It accomplished airport detection from remote sensing images with complex background information, but the model was often overfit due to insufficient samples. Fan et al. [26] proposed a layered airport detection algorithm based on spatial analysis and Faster R-CNN to achieve large-scale airport detection from optical remote sensing images. Li et al. [27] built an end-to-end airport detection model from remote sensing images based on a deep transferable convolutional neural network, which overcame the shortcomings of traditional CNN models for airport detection under complex backgrounds.
Most of the above studies have focused on optical remote sensing images, but there are very few studies on the application of deep learning methods to airport detection from SAR images as SAR images are hard to understand, and the speckle noise makes it more difficult for airport detection. However, in light of the tremendous advantages of SAR images, it is necessary to study them further for airport detection from SAR images based on deep learning.

Methodology
We proposed the multi-level and densely dual attention (MDDA) framework, which includes three components: down-sampling, deep learning network for features extraction, and up-sampling using bilinear interpolation. First, high-resolution SAR images are down-sampled to generate medium-resolution images, so that samples can contain adequate information about the airport. Second, the samples are input into the deep learning network, which includes the encoder and decoder. The encoder utilizes ResNet to produce four-level features, which are input into the decoder. In the decoder, the dense connection and dual attention mechanism are incurred to improve the ability of features extraction. Then, the airport extraction is performed. Finally, the results are up-sampled by bilinear interpolation to accomplish the airport detection of high-resolution SAR images.

Residual Network
The residual network (ResNet) was proposed by He et al. [28,29] in 2016, which solves the problem that the accuracy of the training set decreases with the deepening of the network, and makes the CNN no longer hindered by the number of layers. The deeper the layers, the better the expression will be. The ResNet is formed by stacking the numerous residual units, as shown in Figure 1. For a residual unit, the output is where F is the residual function, and w l denotes the weight. x l and y l represent the input and output of the l-th residual unit. The activation between the two residual units is realized by the residual function. First, the residual function is used to calculate the residual of the input x l , and then the residual is added with x l to generate the output. Let x l+1 = y l , we can obtain the output of the L-th residual unit by recursively using Equation (1) Equation (2) indicates that the output of the L-th residual unit can be expressed as the sum of the input of a shallow residual unit and the mapping of all complex residual functions in the middle, which shows the good back propagation ability of the network. Assuming the loss function of the network is α, then the back propagation can be obtained via It can be seen that ∂ α ∂x L and ∂ ∂x l L−1 i=l F(x i , w i ) determine the value of the weight ω. Unless they are the opposite number of each other, the gradient cannot vanish. In fact, this case never happens in practice, so the gradient flow of the network from high to low layers is very smooth, which makes the training of deep networks possible.

Dense Connection
The dense connection links each layer to the others via the feedforward cascade method, as shown in Figure 2. DenseNet [30] changes the network architecture by adding the idea of skip connection and shorter connection to the residual network, and solves the problem of loss appearing or the disappearance of network input or gradient information after being transmitted through many layers. Zhang et al. [31] proposed an encoder-decoder network with dense connection to implement the extraction of water and shadow, and good results were achieved. The traditional L-layer neural network has L connections while the L-th layer network of DenseNet consists of the feature maps of the previous L-1 layers. Take x 0 . . . . . . x l−1 as input, then we can obtain where x 0, x 1 , . . . . . . , x l−1 denotes the feature map connection generated by the feature maps at 0, . . . . . . , l − 1 layers, and these connections are combined via using H l finally. This allows the information to be transmitted from one layer to the next, and each time, it reads information from its previous layer and writes it to the latter layer. It promotes the information transmitting of the network, strengthens the propagation of features, and enables features to be used sufficiently.

Dual-Attention Mechanism
The attention module plays an important role in the field of semantic segmentation. It weights the input feature maps, filters useful feature information, and removes redundant feature information. The attention module can fuse the input global information, and is widely used in the field of image vision. Chen et al. [32] proposed a feature recalibration network with multi-level spatial features (FRN-MSF) to implement scene classification for 11 types of scenes from SAR images, which incurred the SENet and achieved a satisfactory classification result. Fu et al. [33] extended two types of attention modules based on the self-attention module, and constructed the position attention module (PAM) and channel attention module (CAM), which work in parallel to capture the global information of the image in the spatial and channel dimensions to obtain rich contextual information. •

Position Attention Module (PAM)
The key in semantic segmentation is feature recognition. The PAM builds a positional relationship model between features by capturing global feature information, and selectively aggregates features at each position via the sum of weights for features at all positions. Regardless of the distance, similar features will be related to each other, thus enhancing the ability of PAM to express the features. Figure 3 shows the working mechanism of the PAM. As shown in Figure 3, the input feature map A (C × H × W) performs a convolution operation with a BN layer and a ReLu layer to produce three new feature maps A1, A2, and A3. They are all single-channel feature maps and they all come from feature map A, but feature maps A1 and A2 have the same dimensions, except for feature map A3. After performing the reshape operation on the feature maps of A1, A2, and A3, the scale of the feature maps becomes H × W, then a matrix multiplication is performed on the feature of transposed A1 and feature A2. B is a position attention map, and its essence is applying the softmax layer to the transposed feature map generated by the matrix multiplication.
where N = H × W, and B ij denotes how much the j-th position is affected by the i-th position. The more similar the position information of the two features, the larger the value of B ij . A matrix multiplication operation is performed between B and A3 after reshaping, then D is obtained. The final output feature map E (C × H × W) is obtained by adding the reshaped D and the original feature A. Here, we need to set the weighting factor α, which is initialized to 0 and then gradually learns automatically.
It can be seen that each position of the final output feature E is a weighted sum of the features of all positions and the features of the original input, so global semantic information is aggregated.
The CAM is mainly oriented to high-level features. Each channel mapping of high-level feature can be regarded as a type of response, and there is a close relationship between these types of responses. To enhance the feature map's ability to express specific semantics, the CAM obtains the interdependence relation between different channel mappings, and its working mechanism is shown in Figure 4.
Unlike the PAM, three features of A1, A2, and A3 with the same dimensions are directly produced by reshaping the original feature A. Moreover, a matrix multiplication operation is performed on A2 and A1, and then the obtained value is processed by softmax to generate the feature map X (C × C).
where X ij represents the effect of the i-th channel on the j-th channel. After a matrix multiplication of X and A3, a reshape operation is performed to generate D. Finally, a weight β is multiplied to D. Finally, the final feature map E (C × H × W) is obtained by adding the original feature A and D multiplied by the weighting factor β, which is initialized to 0 and then gradually learns automatically.
It can be seen that the output features of each channel are the weighted sum of the features of all channels and the original features. It encodes the global semantic relationship among the feature maps of different channel and improves the ability to discriminate the feature maps.

The Proposed Automatic Airport Extraction Algorithm
To extract the airport, the multi-level and densely dual attention (MDDA) framework as shown in Figure 5 is proposed in this paper. The framework mainly includes two parts: the encoding network and decoding network. The encoding network employs the ResNet_101 [28] residual network to perform multi-level features extraction on the input dataset. The decoding network incurs a dense connection and dual attention mechanism to fuse the multi-level features and further extract essential features. It mainly consists of four modules: dual attention fusion and extraction (DAFE), CRP_Net_1, CRP_Net_2, and CRP_Net_3. The last three modules have the same internal structure. Each low-resolution feature produced by ResNet is sent to all the previous modules with higher resolution to achieve adequate fusion of features with different resolutions. After the airport segmentation is realized by extracting features from the decoding network, the up-sampling processing is carried out by bilinear interpolation to get the large-scale airport segmentation results, where the up-sampling multiple is the same as that of the input SAR image at the beginning. Finally, the airport extraction result is fused with the SAR image to generate a fusion image.

Dense Connection
As shown in Figure 6a, CRP_Net_x (x = 1, 2, 3) is composed of a residual convolutional unit (RCU), multi-resolution fusion (MRF), and chained residual pooling (CRP). The RCU [17] is a residual unit [28] with the BN layer removed and the MRF module (as shown in Figure 6b). It consists of a series of parallel convolutions and down-samplings, which are used to fuse the features from different resolutions. The CRP (as shown in Figure 6c) [17] is the core module of CRP_Net_x. It consists of the ReLU activation function, pooling unit, and convolution unit. The features extracted from the ResNet network are input into CRP_Net_x for further processing. First, an RCU unit is used to fine-tune the weight of ResNet training. Then, the MRF module is utilized to fuse the input features from ResNet and the output features from lower resolutions. Moreover, the CRP module is employed to extract global semantic information, and finally, the result is output from an RCU.
The dense connection is mainly reflected in the connection between the features of different resolutions in the decoding network. As shown in Figure 5, the input of CRP_Net_x (x = 1,2,3) contains two parts: one is the input of the feature map from residual network, and the other is the feature map from all CRP_Net_x with lower resolutions. This allows each CRP_Net Block to make full use of the previous middle and high-level semantic features, and finally input them into the DAFE module, thus repeatedly fusing and re-extracting the features. The dense connection fuses the features of four resolutions, which makes the training gradient transfer effectively between the CRP_Net module and the DAFE module, and avoids the disappearance of the gradient.  •

The implementation of PAM
As shown in Figure 7a, the detailed implementation process of the PAM can be divided into three stages. Query1, Key1, and Value1 are all the position variables generated by the input. Query2 is obtained when the 'reshape' operation is performed on Query1, and Query3 is acquired when Query2 is transposed. Reshaping Key1 and Value1, we can gain Key2 and Value2, respectively. All of the element values of the input image can be regarded as a collection of <Query, Key>. In the first stage, a matrix multiplication function is introduced to calculate the similarity of the positional relationship between the two variables.
In the second stage, softmax is introduced to numerically convert the S1 and S2 obtained in the first stage. One purpose is to perform normalization, and the other purpose is to emphasize the weight of elements in important positions, which is more prominent through the internal mechanism of softmax. a1 = Softmax(S1) = e S1 e S1 +e S2 a2 = Softmax(S2) = e S2 e S1 +e S2 The calculated a1 and a2 are the weight coefficients corresponding to Value2 and Value1. In addition, matrix multiplication is performed, and then the position attention values are produced after the operation of the weight and sum.
Position Attention = a1 * Value2 + a2 * Value1 (11) Through the above calculation of the three stages, the position attention value for Query3 can be obtained. •

The implementation of CAM
The calculation process for the CAM is shown in Figure 7b. Unlike the PAM, ProjQuery1, ProjKey1 and ProjValue1 are all directly reshaped from the input, and ProjKey2 is generated after transposing ProjKey1. The calculation method of S1 and S2 is the same as that of PAM, but the difference is that S1 and S2-obtained by CAM in the first stage-are not directly input to softmax. In the CAM, the maximum value of elements in each dimension of the channel tensor is selected, and the dimension is expanded. Moreover, the total number of elements from the matrix is subtracted from the total number of elements after the expansion.
T S1 = Expand dim(Max(S1)) In the second stage, softmax is introduced to perform numerical conversion on T S1 − S1 and T S2 − S2 obtained in the first stage. a1 = Softmax(T S1 − S1) = e T S1 −S1 e T S1 −S1 +e T S2 −S2 The calculated values of a1 and a2 are respectively matrix-multiplied with ProjValue1, and then weighted and summed to obtain the value of the channel attention value.
Channel Attention = (a1 + a2) * ProjValue1 (14) The PAM weights the position features of all semantic features, and selectively aggregates features at each position. Regardless of whether the position is near or far, similar features are related to each other. The CAM integrates the relationships between all feature channels, and selectively emphasizes the interdependent channel features. The entire dual attention weights and selects the position features and channel features, retains useful features, discards low-level features, and further improves features representation to make the segmentation results more precise.

The Training Process of the Framework
The framework of the MDDA network proposed in this paper consists of two parts: one is the encoding network ResNet_101, and the other is the decoding network including the CRP_Net_x (x = 1, 2, 3) module and DAFE module. In this paper, dense connections were utilized between the CRP_Net_x (1, 2, 3) modules and also between the CRP_Net module and the DAFE module. In the DAFE module, the dual attention mechanism is introduced. The entire training process of the MDDA network is as follows: Input: Datasets including small SAR images and corresponding ground truth. Training: (1) Initializing of input data: the coding network loads training data from the ImageNet pre-trained model. (2) The loaded training data are input to ResNet-101 to extract multi-level features.
(3) The decoding network fuses and re-extracts the features extracted by the coding network. Of which, dense connections enhance gradient propagation between features, and dual-attention selects the features by weights. (4) Back propagation (BP) algorithm performs end-to-end training for the whole network. (5) The softmax function calculates the probabilities that the network output is mapped to the runway and background categories by the following formula.
where X k represents the number of pixels corresponding to the k-th category, and K is the number of the sample categories.p k denotes the probability of the k-th category being predicted correctly after the softmax function. The network employs Cross Entropy Loss as the optimization function, which is shown as follows: where p k is defined as a variable 0 or 1, that is, when the predicted category is the same as the sample category, p k = 1; otherwise, p k = 0.
Since there were only two types of targets in this paper, runway areas and background, we can directly use the binary classification of the cross-entropy loss function.
Output: Trained model for airport extraction.

Dataset Used in the Experiment
To validate the proposed framework MDDA in this paper, SAR images with 1-m resolution from Gaofen-3 system were utilized. Many large-scale SAR images including airports were used in the experiment. First, SAR images were down-sampled by five times to generate medium resolution images. Then, the ground truth was produced by Image Labeler of MATLAB, which includes runway areas and the background. The runway area, which includes runways, taxiways, and parking lots, is marked red and other targets are regarded as background. In addition, the down-sampled SAR images and corresponding ground truth were cut into small images with 500 × 500 pixels to generate the dataset. After data augmentation using flip, mirror, and shift, a total of 2479 samples was achieved, and the ratio of the training set to the validation set was 3:1. To test the model generated by training the proposed framework, four SAR images including airports unused in making the dataset were utilized to extract runway areas of airports. Figure 8 shows some samples of the airports in the experiment. Figure 8a-c are SAR images, the ground truth, and the optical remote sensing image of Hongqiao Airport in Shanghai, China. Figure 8d-f denote the three images of Capital airport, the same as Hongqiao Airport. From these samples, we can note the background of Hongqiao Airport is relatively simple, while the background of Capital Airport is more complicated and harder to extract.

Evaluation Measurements
To evaluate the extraction precision of airports, pixel accuracy (PA) and intersection over union (IoU) were used following previous works [21][22][23]34]. PA denotes the ratio of correctly extracted pixels to the total pixels of the type of targets, and IoU represents the ratio of intersection and union of extracted results to the ground truth. Mean pixel accuracy (MPA) is the mean proportion of correctly classified pixels for all categories, and mean intersection over union (MIoU) denotes the mean IoU of all types of targets. The specific calculation formulas are as follows [34].
where k + 1 is the total number of categories (because the background is also a category). P ij denotes the number of pixels that originally belong to class i but are predicted to be class j, which are false positive samples. P ji indicates the number of pixels that originally belong to class j, but are predicted to be class i, which are false negative samples. P ii means the number of pixels correctly classified in class i.

Experiment Analysis and Evaluation
To test the proposed framework of MDDA, four SAR images covering airports unused in training and validation were utilized. Furthermore, two popular deep neural networks for semantic segmentation (RefineNet [17] and DeepLabV3 [21]) were used as reference studies. DeepLabV3 was an excellent network for semantic segmentation, which achieved the best performance on the PASCAL VOC 2012 with other state-of-art models in 2017. RefineNet presented a multi-level structure and chained residual pooling strategy to accomplish semantic segmentation, which also attained much better performance than the vast majority of the networks at that time in PASCAL VOC 2012.

•
The extraction result of Airport I Figure 9 indicates the extraction results for Airport I. Figure 9a is the SAR image of Airport I from the Gaofen-3 system with a 1-m resolution and Figure 9b is the down-sampled image by five times of (a). From what we can see, the texture of the targets in (a) is much clearer than that in (b). Figure 9c is the ground truth of the airport corresponding to (b). Airport I belongs to the civil airport. According to Figure 9a, there are many buildings around Airport I, and the traffic lines are intertwined and complicated. The airport has a relatively obvious characteristic difference from the surrounding ground features, visually showing a large area of gray and black, and the runway area is black. Comparing Figure 9d-f with the ground truth in Figure 9c, we can see that result (f) had the highest overlap with (c), which indicates that MDDA extracted the best result for Airport I. From Figure 9g, RefineNet had a great number of missed detections, most of which were parking lots and some runways. Figure 9h shows that DeepLabV3 had some missed detections and false detections, and the integrity of the edge extraction for the runway area was not high. While for MDDA, only a small part of the runway was not detected, the missed detection rate was low, and there was no false detection. Comparing Figure 9g-i, we can see much more detailed information in Figure 9j-l because of the high resolution, which also demonstrates that we can obtain a satisfactory extraction result of the airport for SAR images with high resolution by the proposed MDDA framework.

•
The result of Airport II Figure 10 demonstrates the extraction results for Airport II. Figure 10a-l are the same type of images as the corresponding images in Figure 9. According to Figure 10a, the buildings around the airport are sparse, but there is a large area of water, and the characteristics of the water and the runway area are very similar. As shown in Figure 10h,k, DeepLabV3 mis-detected a large number of water bodies as runway areas, and there were also some missed detections for the airport. From Figure 10g,j, we found that RefineNet acquired a better extraction result than DeepLabV3, but there were still some false alarms and missed detections, while according to Figure 10f,i,l, MDDA achieved the best detection performance for Airport II. There was no false alarm and only a few missed detections.

•
The result of Airport III Figure 11 indicates the extraction results for Airport III. There are dense buildings and terraces to the west of the airport, and there are also waters nearby. Figure 11a-l are also the same type of images as the corresponding images in Figure 9. According to the comprehensive result images (d), (e), (f) and fusion maps (g), (h), (i), RefineNet and DeepLabV3 both misdetected the water areas as runway areas (such as yellow boxes), and there were many missed detections (such as green boxes). Due to the obvious visual difference between the extended runway at both ends of the airport runway and the main runway area, none of the three networks detected the extended runway. Except for the missed extended runway areas, the rest of the runway areas were all detected by MDDA, and there were no false alarms.

•
The result of Airport IV Figure 12 presents the extracted results of Airport IV, which is Hongqiao Airport. There are considerable road networks, which are very likely to cause false detections. Figure 12a-l are also the same type of images as the corresponding images in Figure 11.
Based on the results shown in Figure 12, all three networks can extract airport runways, and the crossing roads have not become false alarms. According to Figure 12d,g,j, we can see that RefineNet missed a great many runways (marked by green boxes), which caused the relatively worst detection performance of the three networks. DeepLab V3 has greatly reduced the runway areas missed by RefineNet, but there were some false alarms (marked by yellow boxes), and the overall detection performance was highly improved. According to Figure 12f,i,l, there were the least missed runways and no false alarms in the results generated by the MDDA network, so the extracted results were the best of the three networks for runways of the airport from SAR images.  To analyze the extraction performance of the airports, Table 1 gives the extraction accuracies of different networks for the four airports. According to Table 1, the proposed MDDA framework had the best extraction performance of airports, which had the least number of missed detections and almost no false alarms. The mean pixel accuracy (MPA) of runway areas reached 0.9811 and the MIoU reached 0.9707, which proved the superiority of the MDDA. RefineNet presented the most missed detection results. There were large areas of runway areas that were not detected in three airports, except for the second airport, and there were a small number of false detection results in Airport II, Airport III, and Airport IV. DeepLabV3 had the most false-alarms. For Airport II, the false alarm was the most serious, and they were all false alarms in the extraction results generated by three networks. The experimental results showed that RefineNet and DeepLabV3 do not have the ability to learn airport features and cannot distinguish runway areas from similar areas, resulting in poor detection integrity and false alarms. While for MDDA, the transmission between features is enhanced by introducing dense connection, and redundant features were abandoned and useful features were retained via incurring the dual attention mechanism. Therefore, the network's ability to learn features is improved, which makes the runway extraction results free of false alarms and high extraction completeness. Compared with the other two networks, MDDA could almost completely extract the entire runway edge line.
In order to more clearly analyze the detailed information of the extracted airport, Figure 13 shows an enlarged view of a small area of Airport I. Figure 13a-l are also the same type of images as the corresponding images in Figure 12. It can be seen from the views of (g), (h), and (i) that MDDA can extract the details and edge information well. RefineNet is a typical semantic segmentation network, but the decoding network only uses simply transferring features one by one, so the extraction effect was poor as we could see many missed detections (as shown in the green boxes). DeepLabV3 adds hole convolution to expand the receptive field, but the lack of attention mechanism makes the feature learning redundant. Therefore, it cannot extract the detailed information well, and is prone to false alarms. The proposed MDDA network in this paper improved these problems, and we can see from the detailed images that MDDA could extract the airport runway area much better. In addition, we can see more detailed information in Figure 13j-l than in Figure 13g-i due to their high resolutions.  In order to better verify the performance of the proposed network, more images were used for testing based on data augmentation techniques [34]. Due to the limited large-scale airport images of high-resolution SAR images, the four airports tested in this paper were horizontally flipped, vertically flipped, rotated 90 • clockwise, and rotated 90 • counterclockwise to obtain 16 new airport images. Then, we utilized three networks to test the 16 images to acquire the extracted accuracy for airports, respectively. For the four images of each airport, we could obtain a mean accuracy for every network, which is shown in Table 2. Compared with Table 1, we found that the accuracy of each network for every airport was nearly the same, which illustrates the stability of the three networks for extracting the runways of the airports.  Figure 14 demonstrates the extraction results for one image of the augmented 16 airport images, which is the horizontal flipped image of Airport I. Compared with Figure 9, we found that they were nearly the same detection accuracy according to Table 2 and Figure 14, which also demonstrates the stability of the networks.

Discussion
In this paper, we propose a multi-level and densely dual attention (MDDA) network to extract the runways of the airport from SAR images with high-resolution. First, the high-resolution SAR images were down-sampled to generate medium resolution SAR images, and then the samples were produced. Second, the MDDA network was utilized to extract the runways of the airport by making full use of the effective features of the runway areas, where ResNet, the dual-attention mechanism, dense connection, and multi-level structure were integrated. Finally, the bilinear interpolation was incurred to achieve the extraction results of the runways for the high-resolution SAR images. According to the results of the experiments, we noted that MDDA could acquire a satisfactory performance for runway extraction of the airport and obtained the highest accuracy of the three networks.
In addition, we also noted that the training speed of MDDA was relatively slow, which will be our next research direction. Moreover, only SAR images from the Gaofen-3 system were utilized in the experiment, so high-resolution SAR images with different bands and different resolutions will be tested in our further research. Once the runways are extracted, the aircrafts in the airport can be detected more accurately, which not only reduces the false alarms of aircraft detection, but also remarkably increases the detection speed. Therefore, this is also our future work.

Conclusions
To accomplish the automatic airport detection from high-resolution SAR images, a new framework named MDDA was proposed, which integrated ResNet, dense connection, CRP, and the dual attention mechanism. The dense connection takes the advantage of features generated by ResNet at different levels. The dual attention mechanism extracts the position features and channel features respectively, and weights them with different values according to their significance to classification. To implement airport detection from SAR images with high-resolution, two additional processes are performed. One is down-sampling the original SAR images to the medium resolution ones, so that the samples can contain more spatial features of the airport. The other is up-sampling the extraction result generated by the MDDA network to achieve the airport extraction of SAR images with the same resolution.
Three Gaofen-3 SAR images including different airports were utilized to test the proposed MDDA framework. Compared with two existing semantic segmentation networks, namely, RefineNet and DeepLabV3, MDDA achieved much better performance for airport extraction, which reached 0.98 in MPA and 0.97 in MIoU. In addition, it can also be seen from the extraction results that there were few missed detection areas and no false alarms for MDDA, which indicates that it can effectively extract the airport runway areas, and the integrity of the details remains outstanding.