Automatic Mapping of Thermokarst Landforms from Remote Sensing Images Using Deep Learning: A Case Study in the Northeastern Tibetan Plateau

Thawing of ice-rich permafrost causes thermokarst landforms on the ground surface. Obtaining the distribution of thermokarst landforms is a prerequisite for understanding permafrost degradation and carbon exchange at local and regional scales. However, because of their diverse types and characteristics, it is challenging to map thermokarst landforms from remote sensing images. We conducted a case study towards automatically mapping a type of thermokarst landforms (i.e., thermo-erosion gullies) in a local area in the northeastern Tibetan Plateau from high-resolution images by the use of deep learning. In particular, we applied the DeepLab algorithm (based on Convolutional Neural Networks) to a 0.15-m-resolution Digital Orthophoto Map (created using aerial photographs taken by an Unmanned Aerial Vehicle). Here, we document the detailed processing flow with key steps including preparing training data, fine-tuning, inference, and post-processing. Validating against the field measurements and manual digitizing results, we obtained an F1 score of 0.74 (precision is 0.59 and recall is 1.0), showing that the proposed method can effectively map small and irregular thermokarst landforms. It is potentially viable to apply the designed method to mapping diverse thermokarst landforms in a larger area where high-resolution images and training data are available.


Introduction
Permafrost is the ground at or below 0 • C for two or more consecutive years. It underlies about 24% of the land surface in the Northern Hemisphere [1]. According to the borehole temperature records, permafrost is undergoing unprecedented warming in both the Arctic and on the Tibetan Plateau [2][3][4][5][6][7]. Ground warming can cause permafrost degradation mainly through widespread active layer thickening and development of thermokarst landforms [7][8][9][10]. Thermokarst landforms and processes (due to thawing of ice-rich permafrost) have complex interactions with hydrology, vegetation, slope aspects, and soil texture in local environment over time, which characterizes the distinctive landforms on the ground [11,12]. Development of thermokarst landforms may accelerate the decay of permafrost carbon and release of greenhouse gases, affecting the global climate [13][14][15][16].
The area is approximately 6 km 2 . The near-surface soils are covered by a peat layer whose thickness ranges from 0.4 to 4 m [37][38][39]. Permafrost thickness varies from a few meters to tens of meters; permafrost is ice-rich; and its temperature is within a fraction of a degree from the melting point [40,41]. The average thickness of the active layer is 1.32 ± 0.29 m with a range from 0.81 to 2.1 m [41,42]. The vegetation is dominated by alpine swamp meadow and alpine meadow, where hummocks are well developed and characterize the microtopography in this area [41,43].   Three types of thermokarst landforms, including thermo-erosion gullies, sinkholes, and thermokarst pits, are present in this area. Sinkholes and thermokarst pits are rare and subtle on remote sensing images. Hence, we only target thermo-erosion gullies (hereafter referred to as "gullies") in this study. These gullies are characterized by erosional features such as collapsed ground, ponds, disturbed vegetation, and exposed soil (Figure 2a). Most of them are narrow features on the images (Figure 2a,e). Their width and length range from around 1 to 30 m and 20 to 500 m, respectively. Most of these gullies also incise more than one meter into the ground (Figure 2a).
In addition to thermokarst landforms, diverse land cover types such as water tracks, bare land, and roads are found in the study area (Figures 1 and 2). Anthropogenic disturbance and natural processes contribute to bare land (Figure 2b,f) and man-made objects (Figure 2c,g). The low permeability of permafrost and saturated organic soil layer lead to many water tracks (Figure 2d), which are also narrow features on remote sensing images (Figure 2h).

Collection of UAV Images and Creation of Digital Orthophoto Map (DOM)
We collected the UAV images in the field in July 2016. The UAV operated on Trimble UX5 platform and was equipped with Sony A5100 camera. We used the Trimble Access Aerial Imaging application software for flight planning and monitoring as well as for collecting aerial photographs. We deployed 12 ground control points (GCPs) that are evenly distributed within the study area and measured their locations using Real-time Kinematic (RTK) with centimeter-level positioning accuracy. We used the Trimble Inpho UASmaster software to create a DOM using the measured GCPs and to generate a digital elevation model (DEM). Table S1 in the Supplementary Materials lists the software parameters and settings. The final DOM ( Figure 1) has a spatial resolution of 0.15 m and contains three spectral bands of red, green, and blue. The DEM covers the entire study area. It was used for assisting the manual digitizing of gullies and simulating stream vectors. The horizontal resolution and vertical accuracy of the DEM are 0.41 m and 0.12 m, respectively.

Collection of Ground Truth Polygons
The ground truth polygons were collected using the RTK measurements in the field and manually delineated (digitizing) on the DOM. In the field, we walked along the edges of the thermo-erosion gullies and measured the boundaries of nine gullies in July 2016. We considered gully banks with collapse features as parts of gullies. We manually delineated seven gullies in QGIS (version 2.18.14, qgis.org). The UAV-based DEM also helped us to distinguish gullies from water tracks because the first ones incise more than one meter into the ground. We obtained 16 polygons (the white polygons in Figure 1) as the ground truths. Figure S1 in the Supplementary Materials gives the method of obtaining each polygon. We note that their boundaries are 1-2 m away from the gully edges because we could not walk exactly along the edges due to safety concerns.

Preparation of Training Data
We chose training polygons to prepare images to train our supervised learning method ( Figure 4). Although some public datasets such as the PASCAL VOC 2012 images [44] are suitable for training and evaluating deep learning algorithms, none of them contain thermo-erosion gullies. Therefore, we prepared the training data specifically for thermo-erosion gullies. Among these training polygons, 11 of them are from the ground truths and 15 from manually identifying non-gully regions (see Figure S2 in the Supplementary Materials for their distribution and sources). The 11 ground truths are representative of all gullies in the study area. The rest ground truths and the other regions on the DOM were used as test data. The 15 non-gully polygons are essential because they help distinguish the gullies from the similar non-gully land covers during the training. To prove that our choice of training polygons is representative, we conducted 21 (three groups) independent experiments using different combinations of training data, which is similar to bootstrapping in statistics. All experiments had the same setting except their training polygons. The first experiment (first group) used all ground truths and non-gully polygons. Both the second and third groups of experiments randomly selected 70% of ground truths. The second group randomly selected 70% of non-gully polygons while the third group used all of them. Using 70% of ground truths as training data is common in machine learning application. In the third group, the number of training polygons equaled the number of representative selection we described above. Lastly, we compared the results of these 21 experiments with the final results reported in this study.
We utilized the training polygons to extract the subsets of the DOM (termed as sub-images) and to generate the corresponding label images. For each training polygon, we adopted the following two steps to extract a sub-image. (1) We expanded the training polygon by adding a 20-m-wide buffer area. (2) We utilized the minimum bounding rectangle of the expanded area to extract a subset of the DOM by using the Geospatial Data Abstraction Library (GDAL, www.gdal.org). After extracting operation, we obtained 26 sub-images. Among them, 15 contain non-gully land covers such as a water track, bare land, and road, as shown in Figure 3a,d. The other 11 sub-images (e.g., Figure 3b,c) contain both gullies and their surrounding non-gully pixels. To obtain label images, we first rasterized all the ground truth polygons into one single image using GDAL. Then, we extracted 26 label images such as those shown in Figure 3a-L to d-L from this image.
We needed to further subdivide the sub-images and label images into small patches due to the input requirement of our method. The deep learning algorithm (more in Section 3.4) was originally developed to process images from everyday scenes such as the ones in the PASCAL VOC 2012 dataset. These images have small dimensions (on the order of 500 × 500 pixels) and file sizes (smaller than one megabyte). High-resolution remote sensing images are larger than everyday images. For example, the dimension and file size of the DOM are 28,162 × 21,003 pixels and 2.4 gigabytes, respectively. The dimension of sub-images ranges from 313 × 455 to 3011 × 2706 pixels depending on the size and shape of the training polygons. Moreover, because of the limitation of Graphics Processing Unit (GPU) and the algorithm, it is not possible to directly input an image with large dimension (such as greater than 1000 × 1000 pixels) to the deep learning algorithm. Therefore, we subdivided each sub-image and the corresponding label image into image patches and label patches. We created a grid for each sub-image, and then subdivided sub-images based on their corresponding grids. Figure 5 gives an example of how we subdivided the sub-image shown in Figure 3c. The size of each black square is 321 × 321 pixels. To utilize the contextual information, we extended each black square in up, down, left, and right directions by 150 pixels (red arrows in Figure 5). The yellow square in Figure 5 shows an example of how we extended one black square (ID #8). The extending operation ensures that each patch overlaps its adjacent ones. Moreover, the overlap is necessary for producing consistent results among the adjacent patches. Then, we extracted the image patches from the sub-images based on the extended squares. In the case the image patches in the last column or row of the grid are too small (the length of the yellow double arrow is smaller than one-third of the black square), we merged them to the patches in the adjacent column (e.g., #27 into #26 in Figure 5) or row (e.g., #52 into #45 in Figure 5). We did not subdivide those sub-images smaller than 321 × 321 pixels and considered them as image patches directly. Lastly, we applied the same subdividing operation to label images (e.g., Figure 3c-L) and obtained the corresponding label patches. To increase the number of the training data and the generalization of the deep learning algorithm, we performed data augmentation on the subdivided image patches and the corresponding label patches. The data augmentation included flipping, blurring, and rotating (see Figure 6 for examples of data augmentation).

Mapping of Thermo-Erosion Gullies Using DeepLab
The core of our method is based on DeepLab, a state-of-the-art semantic image segmentation algorithm building on deep CNN [45]. Semantic segmentation implies that the algorithm assigns each pixel in images to a specific class. We only have two classes, namely gullies and non-gullies, in the semantic segmentation task. DeepLab was developed to label specific objects (e.g., cars, cats, and dogs) on RGB images and outperforms many other deep learning algorithms [45]. Therefore, we developed our method based on DeepLab. We input three-band (i.e., RGB) images to DeepLab, which can learn features automatically. However, it is unclear what features it learned (more discussion in Section 5.5). The network architecture we used is DeepLab-LargeFOV. All parameters in the neural network can be considered as a model.
We fine-tuned the DeepLab model by utilizing image patches and the corresponding label patches. Fine-tuning implies that we initialized the network using a pre-trained model instead of random values, and then we made small adjustments to the model in the training process. The pre-trained model consisted of a set of network parameters that have been obtained by training with the PASCAL VOC 2012 images [45]. It has already learned image features that can be used for identifying targets. Such ability is termed as transfer learning, as the features that have been learned in one setting can be exploited to another setting [46]. By adopting the fine-tuning strategy, we could reduce the training time and lower the requirement of training data quantity. The fine-tuning step took around three hours on one GPU (NVIDIA Quadro P5000) after 6000 iterations and reached convergence (the loss value stabilized at about 0.05).
We utilized the fine-tuned model to map thermo-erosion gullies. We subdivided the entire DOM into many DOM patches (smaller than 650 × 650 pixels) and performed inference on them. Inference implies that we input a DOM patch into the network with the learned parameters, and then obtain a feature map. The feature map indicates the probability of thermokarst landforms at each pixel. It took around 23 min in the inference step for all the DOM patches on GPU Quadro P5000. We converted the feature map to a binary (gully or non-gully) patch and mosaicked them into a single file for the entire study area. Lastly, we converted (polygonized) binary results to polygons using GDAL. Hereafter, we refer to these polygonized results as "candidate gully polygons".

Post-Processing
We set criteria based on the area and shape of thermo-erosion gullies to remove a few erroneously candidate polygons. We set an area threshold of 40 m 2 to remove polygons smaller than it. Moreover, we defined a metric to quantify the narrowness of a gully as: where N is the narrowness metric, and P and S are the perimeter and area of the polygon, respectively. A larger N means a narrower polygon. We removed the candidate polygons with N < 40. The remaining polygons after post-processing are considered as the final output, namely the mapped gully polygons (also simply termed as 'gully polygons').

Validation
We used the ground truth polygons to validate the gully polygons. We assigned each pixel on the DOM to gully or non-gully, but only gullies are our targets. Therefore, we calculated the intersection over union (IoU) value of each gully polygons, then counted the numbers of true positives and false positives. We calculated the IoU value as: where A is a gully polygon and B is any ground truth polygon. In cases when multiple IoU values exist for one gully polygon (i.e., it intersects with several ground truth polygons), we only keep the maximum one. The value of IoU ranges from 0 to 1. A higher IoU value means a higher delineation accuracy for a given gully polygon. We set the IoU threshold as 0.4, slightly lower than the conventional threshold of 0.5 because the ground truth polygons contain uncertainties (e.g., their boundaries are wider than the truth ones). If the IoU value of a gully polygon is greater than the threshold, we count it as a true positive. Otherwise, it is a false positive. A false negative means that a ground truth polygon does not have any corresponding true positive. The number of false negatives is equal to the number of ground truth polygons minus the number of true positives. We then used these numbers to calculate the F1 score as: where TP, FP, and FN are the number of true positives, false positives, and false negatives, respectively.
The F1 score provides a quantitative rubric for assessing the overall accuracy of our mapped results. We also conducted a sensitivity analysis on how the delineation accuracy depends on the choices of IoU threshold (see Table S2 in the Supplementary Materials). The analysis shows that this threshold is reasonable and produces the best results in terms of the overall F1 score.

Results
From the DOM, our method allowed us to delineate 27 thermo-erosion gullies (Figure 7). Among them, 16   The experiments (Table 1) using varying combinations of training polygons show that the highest achievable F1 score is within 0.7-0.8, which also proves the robustness of our main results. Moreover, our choice of representative training polygons outperforms all other random combinations in terms of the F1 score and false negatives. Only one experiment (#16) has a higher F1 score than our final result. However, its FN equals one, which implies that one ground truth is missed in the result. The first group of an experiment (#1) shows that using all ground truths cannot guarantee the highest F1 score. The possible reason is that a larger number (16 gullies vs. 11 non-gullies) of gully samples in the training polygons makes the algorithm more sensitive to gully-like land cover, thus the algorithm produced more false positives. The F1 scores in Groups 2 and 3 are similar, which implies that the same percentage of ground truths can result in similar accuracies.  The thermo-erosion gullies are concentrated at moderate elevations and slopes within the study area. The elevation of gully polygons (true positive) ranges from 3526 m to 3667 m (Figure 9a), while the elevation of the entire study area ranges from 3484 m to 3726 m. Moreover, 94% of the gullies are located on slopes gentler than 17 • but steeper than 8 • (Figure 9b), while the slope values of more than half (55.6%) of the study area are greater than 17 • or smaller than 8 • . Their circularity indexes (namely, 4πS P 2 ) are smaller than 0.25 ( Figure 10a). Compared to thermokarst lakes, which have been well studied, the gullies are small: 87% of them are smaller than 2050 m 2 ( Figure 10b) and 94% of them have perimeters shorter than 1000 m (Figure 10c).

Spatial Distribution of the Thermo-Erosion Gullies
Our results provide insights into the controls of gully appearance. Typically, the spatial distribution of thermo-erosion gullies in the Arctic is controlled by ice-wedge polygon networks and topographic gradient [18]. The growth of ice-wedges is not typical in the study area. Therefore, we focused on examining the topographic gradient as represented by simulated surface streams. We obtained the stream vectors as shown in Figure 7 from a watershed simulation based on the DEM using the Geographic resources analysis support system (GRASS) toolbox (http://grass.osgeo.org). Only counting the true positives, we found that all gully polygons are co-located with the simulated surface streams (Figure 7). Two mechanisms can result in such co-location. Firstly, water run-off heats the ground and accelerates the thawing of ice-rich permafrost. Secondly, stream erosion disturbs or even removes the vegetation and soils, which exposes the organic layer to solar radiation and leads to permafrost thawing. We also note that not all the simulated stream vectors have co-located gullies, implying that ground ice content in the near-surface permafrost can also influence the distribution. However, the exact distribution of ground ice is unknown and difficult to quantify.

Delineation Accuracies of the True Positives
The IoU values indicate high delineation accuracies for most of the mapped gullies. A low IoU value (e.g., 0.41) could be due to the difference between the microtopography observed in the field and on the DOM. In the field, we considered the collapsed ground on the gully bank as part of the gully; however, the collapse features appear subtle at the gully in Figure 8a as well as a few others on the DOM. Therefore, the ground truth polygons of these gullies are wider than the corresponding mapped polygons from the DOM.
The gully size affects the delineation accuracy. Among the six mapped gullies whose areas are greater than 1500 m 2 , one has an IoU value around 0.7 and the other five have IoU values greater than 0.8 ( Figure 11). Conversely, the IoU values of the gullies smaller than 1500 m 2 are lower than 0.8. Obviously, a bigger gully is easier to be identified from the DOM. However, as shown in Figure 11, the dependency of the IoU values on the sizes is non-linear.

Possible Causes of False Positives
The false positives are mainly delineated due to the limitations of the training data and diverse land covers in our study area. Firstly, the formation of thermo-erosion gullies involves complex thermal, hydrological, and geomorphic processes [18,25]. Some of the gullies in our study area may be at an early stage of development. The training polygons we used may not include all the gullies at different development stages because we only selected the ones that we easily identified on the DOM and in the field. Secondly, the non-gully training polygons we selected may not cover all the land cover types. The study area is close to settlements. The residents and their livestock may have disturbed the land surface at some locations (e.g., vegetation damage and vehicle tracks). These anthropogenic disturbances may have been mistakenly mapped as thermo-erosion gullies because of their similar appearance on the DOM. Lastly, as we pointed out in Figure 8, some water tracks, shadows, and bared land also contribute to the false positives.

Mapping of Thermo-Erosion Gullies on the DEM
We tested the capability of mapping thermo-erosion gullies on the DEM because it is natural to use DEM data to map geomorphological features, although we wanted to solely use the three band (RGB) images. We applied the same processing chain on the DEM and obtained mapped gully polygons (as shown in Figure S4). In these results, we obtained the accuracy of F1 score = 0.49 (precision is 0.35 and recall is 0.81). Compared to the results solely using the RGB image, there are more false positives and false negatives. The reasons for this could be: (1) many other landforms or disturbances are similar to thermo-erosion gullies on the DEM in the study area; and (2) many erosional features (e.g., collapsed ground, ponds, and exposed soil) that are critical for identifying thermo-erosion gullies are not presented on the DEM. Further investigation on using DEM is worth exploring but were not included in this study because including DEM in the method would limit its transferability. In many other regions, collecting remote sensing images with the high spatial resolution from airborne platforms or satellites is easier than obtaining or constructing a high-resolution DEM.

Advantages and Limitations of Deep Learning for Mapping Thermokarst Landforms
The most significant advantage of deep learning methods is that they can automatically learn features [31]. Hand-designed features are essential and critical in the traditional automatic mapping methods. These methods require time and expertise to design good features and choose the best combination of them. However, with deep learning methods, we can easily extend the method to other application without corresponding expertise. In this study, we only prepared the training images according to the requirement of DeepLab and adopted a network architecture from other application (i.e., DeepLab-LargeFOV). We did not design features or conduct experiments on different combinations of features.
Our method based on deep learning can output results close to manually identified ones. Typically, traditional automatic or semi-automatic mapping methods detect the boundaries based on the differences in the features including edge, texture, and spectral information. Consequently, these methods output results that cannot delineate the thermokarst landform due to the various spectral reflection inside it [27,28]. Figure S3 of the Supplementary Materials gives one example that compares the results obtained from a supervised classification with the ones presented here. Our method can delineate the landforms and acquire high IoU values (as discussed in Section 5.2).
However, many studies claim that deep learning is a black box, namely the learned features are unknown, and the factors controlling the output are unclear [47,48]. For instance, we used different data augmentation strategies but found that not all strategies gave the same F1 score; some even introduced more false positives (and reduced the F1 score). However, it is unclear how exactly the data augmentation strategies yield the different mapping results. Nonetheless, we need to point out that data augmentation was essential in this study as our training data were limited.
Our deep learning method did not produce exactly repeatable results. We conducted several processes of fine-tuning and inference using the same network setting and training data. However, these processes gave different, yet very similar results; and the F1 score varied between 0.7 and 0.78. Even if we doubled the number of training iterations, the F1 score was still not repeatable. We speculate that the random initialization of parameters (of input and output layers) and dropout (which is a strategy to prevent overfitting) led to this phenomenon. Moreover, the small number (16) of ground truth polygons made the F1 score sensitive to TP, FP, and FN.

Prospects and Future Work for Mapping Thermokarst Landforms in a Large Area
With deep learning, the achievement of image processing has reached a new stage [30,31]. For example, identification of objects such as cats in images was challenging a few years ago, but it is much easier nowadays. Automatic mapping of thermokarst distribution in permafrost areas was regarded as extremely difficult [49]. However, our results show that it is feasible to map thermokarst landforms on remote sensing images with deep learning algorithms automatically.
Despite the great potentials, we need to make a few major improvements to the current method and training data towards mapping thermokarst landforms at regional scales such as the Tibetan Plateau. When extending the method to a large area, we need to collect corresponding high-resolution remote sensing images and training polygons. Due to the diverse types of thermokarst landforms in a large area, obtaining training data is labor-intensive. Therefore, it may require a community effort (from permafrost scientists and engineers). Variation in the size of thermokarst landforms requires images with different spatial resolution, which will increase image heterogeneities (different sources, bands, acquisition times, and spatial resolution). Moreover, a large area also means a large volume of data. To overcome these challenges, we need to innovate and improve the algorithm in the following aspects: (1) adopting state-of-the-art algorithms and trained models because deep learning methods advance very quickly; (2) utilizing multiple bands information by integrating the recurrent neural network and CNN; and (3) improving the algorithm to handle the dataset covering the large area by using multiple GPUs.

Conclusions
We designed and applied an automatic mapping method based on deep learning to map the thermokarst landforms from high-resolution UAV images. Despite a few false positives, we successfully mapped all 16 thermo-erosion gullies in a local area in the Northeastern Tibetan Plateau (the overall F1 score is 0.74, precision is 0.59, and recall is 1.0). These gullies are relatively small and narrowly-shaped. All the gullies are co-located with simulated surface streams, confirming that the formation and development of the gullies are largely affected by the surface streams.
In the future, we will consider more factors to quantitatively analyze the spatial distribution of these gullies. We can extend our method to other thermokarst landforms and larger areas by collecting corresponding training data and remote sensing images. With automatic mapping methods, it is possible to map the thermokarst landforms in a large area such as the Tibetan Plateau and the circumpolar region. Products from such comprehensive mapping efforts will significantly improve observations and understanding of permafrost degradation and its environmental and socio-economic impacts.
Supplementary Materials: The following are available online at http://www.mdpi.com/2072-4292/10/12/ 2067/s1, Figure S1: Map of the ground truth polygons and their obtaining methods. Nine of them were collected in the field using GPS Real-Time Kinematic. Seven of them were manually delineated (digitizing) on the DOM, Figure S2: Distribution of the training polygons: 15 of them are non-gullies; seven were collected from field GPS measurements; and four were from manual delineation (digitizing), Figure S3: Comparison between the DeepLab-based results and those obtained from a supervised classification using support vector machine (SVM).  Figure S4: Distribution of mapped gully polygons (the red and yellow ones are true and false positives, respectively) solely using the UAV-based digital elevation model, Table S1: Parameters and settings of the Trimble Inpho UASmaster software for producing the digital elevation model and digital orthophoto map, Table S2: Delineation accuracies obtained with different IoU thresholds.
Author Contributions: L.H. designed and developed the algorithm of this study, conducted the analysis, and wrote the manuscript. L.L. supervised L.H. and co-designed the algorithm. L.J. and T.Z. provided the data and valuable knowledge and revised the manuscript. All co-authors helped write the manuscript.