Review and Evaluation of Deep Learning Architectures for Efﬁcient Land Cover Mapping with UAS Hyper-Spatial Imagery: A Case Study Over a Wetland

: Deep learning has already been proved as a powerful state-of-the-art technique for many image understanding tasks in computer vision and other applications including remote sensing (RS) image analysis. Unmanned aircraft systems (UASs) offer a viable and economical alternative to a conventional sensor and platform for acquiring high spatial and high temporal resolution data with high operational flexibility. Coastal wetlands are among some of the most challenging and complex ecosystems for land cover prediction and mapping tasks because land cover targets often show high intra-class and low inter-class variances. In recent years, several deep convolutional neural network (CNN) architectures have been proposed for pixel-wise image labeling, commonly called semantic image segmentation. In this paper, some of the more recent deep CNN architectures proposed for semantic image segmentation are reviewed, and each model’s training efficiency and classification performance are evaluated by training it on a limited labeled image set. Training samples are provided using the hyper-spatial resolution UAS imagery over a wetland area and the required ground truth images are prepared by manual image labeling. Experimental results demonstrate that deep CNNs have a great potential for accurate land cover prediction task using UAS hyper-spatial resolution images. Some simple deep learning architectures perform comparable or even better than complex and very deep architectures with remarkably fewer training epochs. This performance is especially valuable when limited training samples are available, which is a common case in most RS applications. SegNet less accuracy among all employed architectures for semantic image segmentation. Results achieved by our MobileNet architecture is based on baseline settings for hyper-parameters, include for multiplier and ρ 1 for resolution multiplier.


Introduction
Remote sensing (RS) is the major source of spatial information related to the earth's surface, offering a wide range of sensors and platforms to monitor land cover and its spatial distribution. Recently, Unmanned Aircraft Systems (UASs) are widely employed in numerous RS applications including natural resource management [1][2][3]. In comparison with traditional RS, UAS technology stands out for its low-cost operation and ability to acquire image data with high spatial and temporal resolution in a flexible fashion at local scales. UAS usually flies at low altitudes and captures high spatial resolution (few cm to sub-cm) images. In combination with the recent advancement in image analysis algorithms, those high-quality images may significantly improve the overall accuracy of image-derived products in many different RS tasks. For instance, pixel-level labeling, which is frequently used in computer vision both spectral and contextual image features, it can outperform the pixel-based techniques [14,17]. By exploiting OBIA techniques, geographical objects, instead of individual pixels, form the basic unit for image analysis [14]. Unlike pixel-based analysis, in OBIA, a certain image is segmented into relatively homogeneous and semantically coherent objects based on a predefined homogeneity criteria at different scales [18]. In other words, spectral information is aggregated per object, where other textural and contextual information become available for conducting image classification on objects rather than pixels [22]. Several studies have already shown the higher performance of object-based image classification techniques than pixel-based methods, especially when high-spatial resolution images are employed [14,22,23]. In general, both pixel-wise and OBIA strategies for land cover or land use classification, take advantage of a wide variety of supervised or unsupervised machine learning (ML) classification algorithms [24][25][26][27][28][29].
In recent years, however, due to the striking achievement of deep learning models in outperforming almost all state-of-the-art techniques in a wide range of applications, the RS community is shifting its attention to deep learning models. The large number of publications exploiting these models in different RS image analyses and the reported accuracies demonstrate the potential of deep learning in this field of study [30][31][32][33]. The recent success of deep convolutional neural networks (CNNs) has enabled substantial progress in many image understanding tasks including pixel-wise semantic image segmentation due to a rich hierarchical feature learning process. Hierarchical features are learned through an end-to-end trainable framework in which higher levels of the feature hierarchy are formed by the precise composition of the lower level features [34][35][36][37]. Learned features, at multiple levels of abstraction, provide a unified, highly complex mapping function from input to output taking only as input the raw data. Such complex mapping not only considers the spectral information of each individual pixel in the image, but also takes all textural, contextual, and spatial information related to each individual pixel into account. Thanks to the recent rise of transfer learning techniques, it is possible to take a pre-trained deep CNN model, trained over a large dataset in a supervised or unsupervised manner, and leverage high complex mappings learned by very deep CNN models to perform effectively on downstream tasks [38]. In addition, due to exploiting end-to-end trainable models within the deep learning framework, efficient feature engineering, which is the biggest concern for almost all traditional classification techniques, is entirely eliminated. This paves the path for developing fully autonomous and online land cover prediction systems. All these characteristics are extremely important in many image analyses in different RS tasks. Specifically, deep CNN models have been successfully used for RGB, multispectral, and hyperspectral RS image analyses in various applications [39][40][41][42]. Very recently, deep CNNs have been specifically applied to wetland studies, including land cover classification. Results and findings confirm where adequate labeled training samples are available, deep CNN models usually outperform the traditional and machine learning classification techniques [3,[43][44][45][46].
The objectives of this paper include: (1) employing some of the most popular deep CNN architectures extensively used in computer vision community for semantic image segmentation on hyper-spatial resolution UAS images acquired over a coastal wetland for land cover prediction; (2) investigating the feasibility of deep learning architectures and evaluating the performance of different deep CNN models in pixel-wise image labeling where labeled training samples are limited and natural targets that appear in UAS images with high spatial resolution exhibit high complexity in their spectral and textural information without clear borders to distinguish other neighboring targets; (3) identifying a deep learning architecture representing, among others, a high performance CNN model from speed and accuracy points of view which can be effectively used in many RS applications where complex pixel-level analyses on high-spatial resolution imagery are required.
The author should emphasize that a comprehensive study on coastal wetland classification to perform detailed analyses of vegetation or other land cover properties is not the objective of this paper. Furthermore, the study of land cover changes over time in the coastal wetland setting due to changes in participating environmental factors is not a goal at this stage. Nonetheless, due to the complexity of the coastal wetland setting relative to many other natural environments, in terms of providing higher inter-class spectral similarity and higher intra-class spectral variability, variable target boundaries and spatial distributions, and mixed pixels, this environment has been chosen as a suitable and challenging case study. For evaluating the efficiency of the employed deep CNN models, performance metrics commonly employed for evaluating model performance of semantic image segmentation tasks in computer vision are utilized. These metrics usually take the ground truth images as the existing reality and compare the predicted images with the corresponding ground truth images based on manual labeling of the image data.
The remainder of this paper is organized as follows. Section 2 explains the most popular deep learning architectures in the computer vision community for image understanding tasks and briefly describes transfer learning as a widely used technique to leverage a deep learning architecture trained for a certain task in a different task. Furthermore, different metrics that are usually used in machine learning and deep learning for performance evaluation of a typical classification technique are described at the end of the the section. Section 3 introduces the data collection and data pre-processing steps. It also provides some information about the chosen deep CNN architectures for land cover prediction task and brief details about the employed optimization algorithm and hardware configuration. Sections 4 and 5 report and discuss, respectively, experimental results achieved by implementing some of the most popular deep CNN architectures for pixel-wise labeling on the experimental dataset. Lastly, Section 6 provides conclusions and future work perspectives.

Deep Learning for Semantic Image Segmentation
Advancing deep learning architectures to tackle pixel-wise image labeling is a natural step in the progress from coarse to fine inference [4]. The origin of convolutional neural networks could be located at handling classification tasks where a certain category was predicted for the entire image [47]. Target localization and detection in computer vision tasks was the next necessary step towards fine-grained inference providing further information, other than classes. Instance segmentation which joins detection and segmentation is an additional improvement towards fine-grained inference [48]. Fully Convolutional Network (FCN) [4] is considered a milestone in transforming classification-purposed CNNs for semantic image segmentation by replacing fully connected layers with convolutional ones to output spatial maps instead of classification scores. Moreover, to compensate for low resolution prediction maps due to several down-sampling steps within pooling layers, FCN includes several fractionally-strided convolutions, also known as deconvolutions or transposed convolution [49,50], combined with a simple bilinear or any learnable interpolation allowing per-pixel labeled output. FCN can be trained end-to-end to efficiently learn to predict pixels' categories for an image of arbitrary size. This approach achieved significant improvement over traditional methods on the PASCAL Visual Object Classes (VOC) [51] standardized image dataset with high efficiency at inference time. Despite its simplicity and flexibility, FCN architecture suffers from some critical limitations when it is applied for certain applications. FCN has a fixed receptive field which makes the network unable to capture contextual information appropriate for pixel-wise labeling for objects that are substantially smaller or larger than the predefined fixed receptive field [37]. As a result, predictions are more uncertain for local ambiguous regions. Feature maps that are used for prediction in several layers of the CNN architecture have contextual information appropriate for the classification task, not the pixel-wise labeling. Additionally, the entire network is usually trained to be spatially invariant, which does not let the network take useful global context information into account. Furthermore, the network suffers from lack of instance-awareness which is very important in some image understanding tasks [37].
Since the introduction of FCN in 2015, a wide range of research has focused on how to provide dense segmentation maps with pixel-level accuracy from arbitrary sized images. Recently introduced deep learning architectures owe their high performances in precise semantic segmentation to several factors including: 1.
introduction of more advanced and deeper CNN feature encoders that are efficiently trained using recently developed advanced optimization algorithms. 2.
utilizing a more advanced decoding strategy to the final low-resolution encoded feature maps in an encoder-decoder architecture using deconvolution or dilated convolution to efficiently increase their resolution for pixel-wise prediction. 3.
using the skip connection to introduce low-level abstract information to the high-level abstract information to build highly accurate feature maps representing pixel-level feature information.

Feature Encoders
Feature encoders are simply described as a stack of convolution layers in combination with activation functions, usually Rectified Linear Unit (ReLU) [52] , and pooling layers, usually Max-Pooling, which construct a hierarchical representation of the input data containing low-level to high-level abstract information [50]. LeNet [47] is considered as the fist CNN-based feature encoder introduced by LeCun et al. in 1998. However, AlexNet [53], the first deep CNN architecture, introduced by Alex Krizhevsky in 2012 is a landmark in deep learning history. Several key factors are contributing in this progress: (1) the efficient training procedure implemented on the modern GPUs [53], (2) the proposal of the ReLU activation function, which had significant contribution in boosting training and made convergence much faster, and (3) the availability of a huge dataset, e.g., ImageNet [54] to train models with high capacity which include millions of trainable parameters. VGG-Net [55], GoogLeNet [56], Residual Network (ResNet) [57], and Densely Connected Network (DensNet) [58] are a few examples of popular architectures that are frequently employed for feature extraction in very deep CNN models.
• VGG-Net. VGG-Net [55] was invented in 2014 by Oxford's Visual Geometry Group as a successful effort to build and train a very deep CNN. VGG-Net showed that the depth of a network is a critical component in CNNs to achieve high performance in recognition or classification. By shrinking the convolution kernels to 3 × 3 yet increasing the number of sequences of convolutional layers and feature maps in each convolution layer, VGG is able to train deeper architecture with appropriate receptive field comparable with AlexNet for recognition tasks. Inception module, which makes building block for the network, is a combination of 1 × 1, 3 × 3, and 5 × 5 convolutional kernels and a pooling layer. The motivation behind inception module is to increase the receptive field without losing fine information. By learning and combining features with different scales in parallel in each inception module, GoogLeNet is able to learn feature hierarchy in a multi-scale manner while its innovative architecture reduces the number of trainable parameters in a really deep framework (22 layers) to less than 5 million parameters in comparison to 62 million and 138 million parameters in AlexNet and VGG-Net, respectively. To train a deep stack of inception modules in an efficient way, bottleneck approach is exploited in which extra 1 × 1 convolutions reduce the dimensionality of feature maps that enter the inception module from the previous layer. This helps to avoid parameter explosion in inception modules and the overfitting problem in the whole network. Figure 1 illustrates the architecture of the inception module. Other versions of inception modules including BN-Inception [59], Inception V2, and Inception V3 [60] were later proposed. In order to increase the efficiency and performance of inception modules, in 2017, Szegedyetal et al. proposed a combined version of inception modules and residual network (ResNet) modules known as Inception-ResNet [61]. Xception [62], which stands for extreme version of inception, was proposed by Chollet et al. in 2017. The motivation behind it is to disjointly map cross-channels and spatial information in feature maps as their correlation is sufficiently decoupled. As a result, the depthwise separable convolutions from inception modules are modified in Xception modules as separable pointwise convolutions follow by depthwise convolutions.
• ResNet. As mentioned above, deeper networks can improve the performance of deep learning approach to solve complex visual tasks, but they are more prone to the notorious problem of vanishing/exploding gradients during training as well. It may lead to not only saturated accuracy, but also degradation of training accuracy. ResNet [57] designed by He et al. in 2015 exploits residual blocks to overcome the vanishing gradient problem in very deep CNNs by introducing identity shortcut connections to successive convolution layers as shown in Figure 2. The shortcut connections in residual blocks help gradients flow easily in back propagation step which leads to gaining accuracy during the training phase in a very deep network. Referring to Figure 2, each unit calculates a residual function F(x) = H(x) − x, in which x is the output of the previous residual unit and H(x) denotes the desired underlying mapping. More precisely, if y l is the output of the lth residual unit with weights w l , then where f () is the activation function.
According to Figure 3, different variants of residual unit were proposed, which consists of different combinations of convolutional layers, batch normalization (BN) [59], and rectified linear unit [63] activation function [57,64]. In our experiment, we use the full pre-activation variant of residual unit proposed by He et al. [57,64] to build our architectures, which use ResNet as their feature encoder. ResNeXt [65] proposed by Saining Xie in 2017 is a highly modularized version of ResNet architecture based on split transform aggregate strategy as an inception module for image classification. Its innovative, simple design results in homogeneous, multi-branch architecture that has only a few hyper-parameters to set. This approach exposes a new dimension called cardinality, the size of the set of transformations, as an essential factor in addition to other critical factors such as depth and width. The network is constructed by stacking repeating building blocks that aggregate a set of transformations with the same topology. Inspired by a residual network, several modifications, new designs, and architectures were proposed for different image understanding tasks [57,[66][67][68]. For instance, Figure 4 illustrates an inception-ResNet block called Inception ResNet-A module of the Inception ResNet-v2 network [61]. Other variants of inception-ResNet blocks including Inception ResNet-B and Inception ResNet-C modules were also proposed by Szegedy et al. [61] in 2017. • DenseNet. Inspired by ResNet and the idea that shorter connections between layers close to the input and those close to the output can help to train substantially deeper CNNs more accurately and efficiently, Huang et al. proposed DenseNet [58] in 2017. The architecture consists of densely connected CNN blocks in which the output feature maps of each layer are concatenated with the output feature maps of all successor layers in a dense block as shown in Figure 5. If lth layer receives all the feature maps from all preceding layers, x 0 , x 1 , ..., x l−1 , as input then: where [x 0 , x 1 , ..., x l−1 ] represent simple concatenation of feature maps produced in layers 0, 1, ..., l − 1 and H l is defined as a composite function of three consecutive operations including BN, followed by a ReLU and a 3 × 3 convolution. A transition layer composed of a a batch normalization layer and a 1 × 1 convolution followed by a 2 × 2 pooling operation is introduced between two consecutive dense blocks to reduce the dimensionality and spatial resolution of derived feature maps. DenseNet architecture consists of several densely connected blocks and transitional blocks, which are placed between two adjacent densely connected blocks. DenseNet concept alleviates the vanishing gradient problem, encourages feature propagation and feature reuse while substantially reducing network parameters.
• MobileNet. Since the advancement of deep learning, the general trend has been to make deeper and more complicated networks to improve model performance [55,60,61]. However, these advances to improve accuracy are not necessarily making networks more efficient with respect to size and speed. In many real world applications such as self-driving car, robotics, and augmented reality, the timely-fashioned or almost real-time prediction and recognition tasks need to be carried out on a computationally limited platform. Inspired by depthwise separable convolutions [69] to reduce the computation in the first few layers, a class of efficient models, called MobileNets [70,71], for mobile and embedded vision applications was introduced by Howard et al. in 2017. This class of models presents a streamlined-base architecture that uses depthwise separable convolutions to build lightweight deep neural networks. According to Figure 6, the depthwise separable convolution is a form of factorized convolutions factorizing a standard convolution into a depthwise convolution, which applies a single filter to each input channel, and a 1 × 1 convolution called a pointwise convolution to change the dimensions and linearly combine the output feature maps from depthwise convolutions. The depthwise separable convolution technique results in a drastic reduction in computation complexity and model size. Figure 7 illustrates two variants of MobileNet architectures. According to Figure 7, in MobileNetV1 [70], there are two layers including depthwise and pointwise convolutions. M and N are the number of input and output channels, respectively, and D F and D K are the sizes of feature maps and filter size, respectively. BN and ReLU activation function are both applied after convolutional layers. MobileNet introduces two hyper-parameters to the network including width multiplier, α ∈ (0, 1], to control the input width of a convolutional layer and resolution multiplier ρ ∈ (0, 1], to control the input image resolution of the network. α = 1 and ρ = 1 are hyper-parameters for the baseline MobileNets and α < 1 and ρ < 1 are considered for any reduced computation MobileNets. Computational cost and the number of parameters are reduced by roughly α 2 . However, the accuracy drops off as α and ρ decrease. MobileNetV2 [71] is a significant improvement over MobileNetV1 with high potential of reaching the state-of-the-art performance for mobile visual recognition tasks. It was also built upon the idea of depthwise separable convolution already applied in MobileNetV1 as efficient building blocks. In MobileNetV2, there are two types of blocks. One block is a residual block with stride of 1 and a second block with stride of 2 for downsampling. Both blocks include three layers. The first layer of each block in MobileNetV2 includes a 1 × 1 convolution with ReLU activation function. The second layer is a depthwise convolution, and the third layer is another 1 × 1 convolution but without any activation function.

Decoding Approaches
As explained earlier, an encoder is simply a deep learning architecture such as VGG-Net, GoogLeNet, ResNet, etc., making a hierarchical representation of input data. The final feature maps derived from encoders are usually coarse representations of the input image which needs to be upsampled to higher resolution feature maps. Decoding, on the other hand, is a strategy that aims to efficiently exploit encoded feature maps provided by the encoder to form an output that is the closest match to the intended output, usually corresponding ground truth. Deconvolution or transposed convolution [49,72] is conceptually required in deep CNN architectures for pixel-wise predictions as feature maps are continuously down-scaled within several convolution and pooling layers. As mentioned earlier, FCN architecture enables upsampled feature maps with resolution comparable to the input image through a fractionally-strided convolution step in combination with a simple bilinear interpolation. However, due to the lack of an efficient trainable deep deconvolution network, FCN fails to achieve the high accuracy in pixel-wise labeling, especially when it is required to reconstruct highly nonlinear structures of object boundaries. [73].
The deconvolution network was first discussed for image reconstruction from its feature representation by Zeiler et al. [50]. To resolve ambiguity induced by Max-pooling layers, the network stored the pooled locations, which need to be retrieved in an unpooling operation. To predict pixel-wise segmentation map, in 2015, Noh et al. proposed a trainable deep deconvolution network composed of deconvolution and unpooling layers [73]. SegNet [74] designed by Badrinarayanan et al. in 2015 consists of a deep encoder network and a hierarchy of decoders-one corresponding to each encoder followed by a pixel-wise classification layer. Appropriate decoders are fed by Max-pooling indices computed in the pooling steps of the corresponding encoder to perform deconvolution with nonlinear upsampling of their input feature maps. To produce dense feature maps in the decoder, the resulting sparse upsampled feature maps are, then, convolved with trainable filters. U-Net [75] developed in 2015 is an innovative deep learning architecture first developed for biomedical image segmentation by Ronneberger et al. and was then extensively used for image segmentation in many other fields with different encoders such as ResNet, DenseNet, and Inception modules. The network has a symmetrical architecture characterized by an encoder with a series of convolution and Max-pooling layers in the contracting path and a decoder containing a mirrored sequence of convolution and upsampling layers in the expanding path of the network. U-Net is able to concatenate low level abstract information, extracted from the first convolutional layers of the encoder (contracting path) and high level semantic abstraction information, extracted from the final layers of encoder, in the decoder (expanding path), resulting in a finer and more accurate prediction map. This strategy resulted in high performance especially when only a limited training dataset is available [75]. Motivated by a Laplacian pyramid developed for compact image coding [76], in 2016, Ghiasi et al. proposed a network called Laplacian Pyramid Reconstruction (LRR) in which low-resolution feature maps are used to reconstruct a low-frequency segmentation map. Feature maps are, then, refined by adding high-frequency details. Refinement network (RefineNet) [77], proposed by Lin et al. in 2017, is a generic multi-path network which explicitly exploits all available information along the downsampling path to enable high-resolution image labeling using long-range residual connections. This network consists of three main components: Residual convolution unit (RCU), which exploits features at multiple scales, multi-resolution fusion, which merge multi-resolution features, and chained residual pooling, which aims to capture background context from a large image region by fusing the output feature maps of all pooling blocks together with the input feature map.
Inspired by DenseNet, in 2017, Jegou et al. proposed a One Hundred Layers Tiramisu network, commonly called Fully Convolutional DenseNet (FC-DenseNet) [78]. The architecture extends the DenseNet to a fully convolutional network for a semantic segmentation task. The upsampling path includes convolution, upsampling operations called transition up, and skip connections. Transition up modules consist of a transposed convolution to upsample the previous feature maps. Upsampled feature maps are then concatenated with corresponding feature maps in the downsampling path using skip connections to prepare the input for the next upsampling dense block. To mitigate the parameter explosion problem, the input of a dense block is not concatenated with its output in the upsampling path e.g., transposed convolution is applied only on feature maps derived by the last dense block instead of the concatenation of all derived feature maps so far.
Other innovative techniques were also proposed for dense semantic segmentation, which, unlike the convolution/deconvolution design, do not introduce new parameters to upsample feature maps. Atrous convolution [79,80], usually called dilated convolution, originally developed for computing undecimated wavelet transform (UWT) [81] is employed to effectively enlarge the field of view of feature maps without increasing the number of parameters or computation complexity. Atrous or dilated convolution in the context of CNNs aims for expanding the receptive field of the network. They generate high-resolution feature maps capturing multi-scale contextual information from the input data. Dilated convolution introduces a new hyper-parameter called dilation rate to the convolution layers, which specifies the expansion rate of receptive field enabling the network to exploit a larger receptive field without losing spatial information.
In 2014, DeepLab [79], introduced by Chen et al. from Google, proposes atrous convolution instead of deconvolution for feature upsampling. Atrous convolution offers an efficient mechanism to control the receptive field of the network and finds the best trade-off between precise localization, with the small receptive field, and context assimilation, with the large receptive field. The output of the network is interpolated, with bilinear interpolation, and goes through the fully connected conditional random fields (CRF), which fine-tune the result for a more accurate and detailed segmentation map. Different variants of DeepLab architecture were later proposed with some modification on the original network. Atrous Spatial Pyramid Pooling (ASPP) was proposed in DeepLabV2 [34] to robustly segment objects at multiple scales. ASPP probes incoming feature maps at multiple sampling rates and field-of-views capturing objects and image context in multiple scales. In DeepLabV3 [82], to handle the problem of multi-scale object segmentation, a cascade or parallel atrous convolution design is employed to capture multi-scale context by adopting multiple dilation rates. DeepLabV3 outperformed its predecessors without dense CRF post-processing and attained comparable performance with other state-of-the-art models. Authors in DeepLabV3+ [83] decided to add a decoder module to the former variant in which the encoded features are first upsampled by a factor of 4, instead of 16 as in [82], and then the resulting feature maps were concatenated with corresponding mid-level features from the network backbone. Moreover, to reduce computational complexity, they adopted the Xception module [62] and applied depthwise separable convolution to both the ASPP and decoder.
Yu et al. [80] developed a deep learning architecture in 2015 specifically designed for dense prediction based on dilated convolution concept [80]. This convolutional network module combines multi-scale contextual information without losing spatial resolution. Pyramid scene parsing network (PSPNet) [84] introduced in 2017 exploits the capability of global context information by different region-based context aggregation methods by employing a pyramid pooling module in combination with the proposed pyramid scene parsing network. To do pixel-wise prediction, PSPNet extends pixel-level feature to a specially designed global pyramid pooling one. Then, the local and global clues jointly form the final prediction.

Transfer Learning
The idea of transfer learning was motivated by the fact that people can intelligently apply knowledge previously learned to solve a task in one domain to solve a new problem in the same or different domain [85]. In the deep learning context, features learned by a CNN architecture to solve a problem in a certain domain are reusable for solving problems in some other domains, as the first layers of the network in related domains usually tend to learn the same sorts of features. Transfer learning is a highly practical approach to tackle the issue of training a very deep architecture where a limited supply of target training data is available. This could be due to data scarcity, or methods to collect and label the data may be time consuming and expensive requiring expert knowledge. In contrast to many computer vision tasks that can take advantage of thousands of freely available images related to the underlying task, in most RS applications, e.g., land cover mapping, satellite or aerial imagery missions can be very expensive or time consuming. Data collections are a function of many participating factors including flight height, ground sampling distance (GSD), environmental conditions at the time of observation, and camera/sensor settings. Furthermore, a limited number of aerial images are acquired in every flight mission and the acquired images are not always available to the public to enable generation of large labeled data repositories for a specific type of environment or land cover. UAS provides a cost effective and flexible means to collect high-resolution aerial imagery over localized geographic extents; however, dense repositories of UAS imagery acquired over a specific type of natural environment that is expertly labeled for training deep CNNs to perform land cover prediction are presently non-existent.
Common practice in transfer learning is to copy the whole or just the first n layers of a pre-trained network, already trained on a huge dataset, to exploit them in a new task and then back-propagate the errors from the new task into the copied features to fine-tune them to the new task. In another approach, especially where the training sample size is significantly limited or the new task is closely related to the task from which a transferred feature is derived, the first n feature layers can be left frozen, meaning that they do not change during training on the new task. The choice of whether or not to fine-tune the copied first n feature layers depends on the size of the available dataset for the new target task. In a case where the target dataset is small, fine-tuning may lead to overfitting, especially when the network contains a large number of parameters. On the other hand, if the target dataset is rich enough or the number of network's parameters is small, where overfitting does not seem to be a problem, then fine-tuning copied features to the new task can highly improve the performance [38]. In such case, training the network from scratch may also be taken as an option.

Performance Metrics
This section describes the most common performance or evaluation metrics used in the context of semantic image segmentation. Usually, overall performance of a deep learning architecture in semantic image segmentation task is described in terms of overall accuracy of pixel-wise labelling, time, and memory usage. Overall accuracy of a network is a measure which usually describes the correctness of labelling as a simple ratio representing the number of correctly classified pixels over the total number of manually classified pixels in the ground truth. Pixel-wise or per-class accuracy is another measure that usually aims to report the percent of correctly classified pixels for each individual class. Pixel-wise accuracy is closely related to overall accuracy. In fact, binary mask employed in pixel-wise accuracy assessment may return quantities more than just true positive (TP), which represents the number of correctly labeled pixels, and true negative (TN), which represents the number of pixels that are correctly identified as not belonging to a certain class. False positive (FP) represents the number of pixels belonging to other classes misclassified as the target class, and false negative (FN) represents the number of pixels that belong to the target class but are misclassified as belonging to other classes. They are two of the most important quantities for which the binary mask may be designed to account. Accordingly, the overall accuracy per-class can be formulated as [86]: Pixel-wise accuracy metric is not reliable and may provide misleading results when a certain class representation is small within the whole dataset. Precision and recall are two metrics that can help to interpret the overall accuracy of each class more accurately even in the case of unbalanced classes. Precision or positive predictive value (PPV) describes the purity of positive detection procedure relative to all pixels that have already been truly classified in the ground truth [86]: Recall, or true positive value (TPV), on the other hand, effectively describes the completeness of the positive predictions relative to all pixels that have already been truly classified in the ground truth [86]: The F-score is a widely used performance metric for classification and segmentation tasks, which consists of the harmonic mean of precision and recall metrics [86]: where B is a scaling factor between the precision and recall. F1 score, one of the more widely used F-measure metrics is formulated by setting B = 1 [86]: Intersection over Union (IoU), also known as Jaccard index, is a standard performance measure for the object category segmentation. IoU measure represents the similarity ratio between the predicted region and the corresponding ground truth region for an object presented in the dataset [87]: Mean Intersection over Union (mIoU) is a common performance metric for semantic segmentation that is calculated by averaging over all IoU values computed for all existing semantic classes. Other performance metrics, such as time, memory, and power, are highly dependent on the available hardware, software, and the specific deep learning architecture chosen for solving a classification task. Providing such metrics becomes more crucial when a deep learning framework is employed in online applications such as autonomous driving and mobile systems where the memory and power is more limited.

Study Site
The study site is a coastal marsh located on a barrier island along the southern portion of the Texas Gulf Coast, USA, bounded by Corpus Christi Bay, the Laguna Madre, and the Gulf of Mexico called the Mustang Island Wetland Observatory as shown in Figure 8. The study area as imaged by the UAS is 11 hectares. Elevation within the wetland slopes gradually and is nearly flat, with the highest elevation in the study area at about 0.8 m (NAVD88). The wetland is located on the bay side of the island Figure 8 and is oriented in a northeast to southwest trend, with the Gulf of Mexico located to the east and Corpus Christi Bay to the west. The dominant vegetation species are Schizachyrium littorale (Nash) (coastal bluestem) and Spartina patens (Aiton) (gulf cordgrass) commonly found growing in mats. The second most prevalent environment of this study area is tidal flat; it ranges in elevation from −0.05 m to 0.5 m (NAVD88) [88]. Low regularly flooded tidal/algal flats are significantly less abundant than high flats in this area. These local tidal flats are designated as wind-tidal flats because flooding occurs mainly due to wind-driven tides [89,90]. Blue-green algae can be prevalent in the lower portion of the tidal flats after long periods of inundation. Furthermore, salt marsh vegetation can be found sparingly in portions of the tidal flat areas. Low marsh areas are very high in biologic productivity usually ranging in elevation from −0.1-0.3 m (NAVD88). More frequently inundated areas near tidal creeks are dominated by taller vegetation, primarily Avicennia germinans (L.) (black mangrove). High marsh environment is the least abundant in the study area imaged by the UAS. It varies in range from approximately 0.2-0.8 m (NAVD88) well above the mean high tide; therefore, it is rarely inundated. These characteristics briefly illustrate the highly dynamic and complex nature of the coastal wetland and the need for applying accurate algorithms for detailed land cover mapping through analyzing UAS hyper-spatial imagery.

Data Collection and Preparation
Phantom 3 multi-rotor UAS, manufactured by Shenzhen DJI Sciences and Technologies Ltd (SZ DJI Technology Co., Ltd.) headquartered in Shenzhen, Guangdong province, China, was employed to collect required images for this study. This platform is equipped with a CMOS RGB sensor to capture 12 megapixel images with a resolution of 4000 × 3000 pixels. The flight was designed at an altitude of 90 m above the ground resulting in an average GSD of around 3 cm. Imagery was collected at 80% sidelap and endlap flown in a grid pattern with parallel flight lines and a 90-degree (nadir) camera orientation. This high amount of overlap was used to perform Structure-from-Motion (SfM) photogrammetry processing and orthorectify the imagery to remove perspective and relief distortion and generate a large orthoimage that covers the study area. The performance and visual quality of land cover prediction using different deep CNN models is evaluated on a certain part of the study area that most original images belonging to that area are kept for validation purpose. Because in RS applications, land cover is usually predicted on orthorectified images, the visual quality of land cover prediction is illustrated on an orthoimage mosaic of validation images. The reader is referred to [90,91] for more details on SfM photogrammetry.
In this work, 300 images were manually selected from the total set of acquired UAS images (about 500 images) that cover the whole study area to reduce repetitive information from image overlap. Due to the high resolution of the original imagery, the image set can rapidly exhaust the whole GPU's memory when directly fed to any deep convolutional network. Therefore, we randomly extract 10, 000 image patches of resolution 512 × 512 pixels from the set of 300 raw images. From those image patches, 1000 image patches are held as a validation data set for evaluating the model performance after each training epoch. Every image, at most, represents four classes: vegetation, water bodies, tidal flat, and road. In our experiment, tidal flat is assigned to surfaces exposed within intertidal areas. All temporarily flooded areas or permanently submerged lands are considered water bodies. Areas covered by any type of vegetation is called vegetated area. Finally, road represents the artificially elevated dirt surface of exposed ground that has not been affected by tides. The different land cover classes can be observed in the orthoimage mosaic displayed in Figure 8, which was generated from all UAS images acquired over the study area using the SfM photogrammetry software. All needed ground truth data for training and validation were manually prepared through supervised labeling by interpretation and delineation of land cover boundaries in the image patches. This was done by color labeling of all existing pixels in each original image patch to a representative class using a labeling app in MATLAB software for pixel-level image labeling. According to our predefined color for each target, pixels belonging to vegetation, tidal flat, water, and road are represented by green, orange, blue, and brown, respectively. It should also be mentioned that a set of 64 raw UAS images from a portion of the study site that had a representative distribution of the land cover classes were set aside for independent evaluation of model performance results as presented below in Section 4. These images, or patches extracted from them, were not used as part of the training set described above.

Deep Learning Architectures
This subsection introduces the deep learning architectures evaluated in this study for performing pixel-wise image segmentation task (i.e., land cover mapping) with UAS hyper-spatial imagery acquired over a complex coastal wetland environment. The chosen architectures are extensively used in a wide range of applications beyond RS including computer vision and medical image processing.

•
Encoder-Decoder (SegNet). SegNet architecture, displayed in Figure 9, is examined in this study, which is a relatively old deep learning network for semantic image segmentation task. It uses VGG network as its encoder to hierarchically extract features from input images. The encoder network consists of 13 convolutional layers corresponding to the first 13 convolutional layers in the VGG-16 network. In our experiment, we use weights from pre-trained VGG-16 network to initialize the training process. Each encoder layer has a corresponding decoder layer that upsamples the feature maps by using the stored pooled indices. • U-Net. U-Net is a famous deep architecture based on an encoder-decoder principle that instead of using pooling indices, it transfers and exploits the entire feature maps from encoder to decoder. Upsampling strategy can have a great impact on the final accuracy of pixel-wise image classification. Comparing the performance of SegNet and U-Net architecture can tell us more about the effectiveness of those two upsampling strategies. Figure 10 illustrates U-Net architecture with ResNet-34 network for feature extraction in this study.
• FC-DenseNet. To explore the efficiency of DensNet architecture in feature learning for pixel-wise classification of coastal wetland images, the one hundred layer tiramisu model (FC-DenseNet), as shown in Figure 11, is employed which uses 56 convolutional layers, with four layers per dense block and a growth rate of 12. Similar to U-Net architecture, FC-DenseNet exploits U-shape encoder-decoder structure with skip connections between the downsampling and the upsampling paths to add higher resolution information to the final feature map. Unique characteristics of feature reuse, compactness, and substantially reduced number of parameters in FC-DenseNet architecture is evaluated in our experiment based on its performance when training the network from scratch using a limited dataset, which is the case here. • DeepLabV3+. Effectiveness of ASPP to encode multi-scale contextual information in images acquired over complex coastal wetland is investigated by examining DeepLabV3+ architecture illustrated in Figure 12. This architecture is able to perform several parallel atrus convolution with different rates. • PSPNet. As illustrated in Figure 13, PSPNet, which uses pyramid pooling module for more reliable prediction, is also investigated for this study. Specifically, this module is able to extract global context information through aggregating different regional context information.       In our experiment, we use a pre-trained ResNet-34 network as a feature encoder in all employed architectures excluding Encoder-Decoder (SegNet) and FC-DenseNet architectures. To predict each image pixel's category, all employed deep architectures include a multi-class softmax classifier on top, which is fed by the output upsampled feature map from the final layer of the network to produce pixel-wise class probabilities. Cross-entropy and Adam optimizer [92] are selected as the loss function and optimization algorithm, respectively. Adam optimizer computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradients [92] and realizes the benefits of both AdaGrad [93] and RMSProp [94]. It includes several parameters that need to be carefully set. Popular deep learning libraries generally use the default parameters recommended by the paper including learning rate parameter α = 0.001, two exponential decay rate parameters β 1 = 0.9 and β 2 = 0.999, and = 1e −8 , which prevents any division by zero in the implementation. In our experiment, we set all optimization parameters according to those recommended values.
Weight initialization is carried out for all employed networks. Except for FC-DenseNet, weight parameters in other networks are initialized by transfer learning. For FC-DenseNet, we decided to train the network from scratch since we did not find a pre-trained FC-DenseNet on large datasets such as ImageNet. FC-DenseNet has very few parameters, about 10 times less than recent state-of-the-art models; thus, it is worth it to train this network from scratch and compare its performance over our limited dataset with the performance of pre-trained encoders. All deep CNN models in this experiment were trained using the same training samples under the same conditions for 200 epochs.
All experiments were carried out on Amazon Web Service (AWS) with one high-performance NVIDIA K80 GPU, with 2496 parallel processing cores and 12 GB of GPU memory and high frequency Intel Xeon E5-2686 v4 processors under CUDA version 10.0.   Table 1 illustrates the land cover prediction results achieved for the different deep CNN architectures employed in the image segmentation experiment. The first two columns represent overall accuracy for training (OA-Train) and validation (OA-Val). Precision, recall, F1, and mIoU are included for evaluating the performance of each architecture as these are some of the most widely used metrics. Figure 17 displays a cropped orthoimage from the upper portion of the study area, and its corresponding ground truth labels. This area was selected for model validation purposes because it provides a nice distribution of the different land cover classes. Images from this area were not included in the training samples. Figure 17 stems from a set of 64 overlapping UAS images that were orthorectifed and mosaicked together as part of the SfM photogrammetric processing used to create the full study area orthoimage (Figure 8). To classify the orthoimage, the full image is not fed directly into the model due to its large size. Small image patches (512 × 512 pixels) are extracted and then fed into the model to undergo pixel-wise labeling. After the land cover class(es) contained in each individual image patch are predicted by the model, they are then reassembled to generate the full resolution image. The land cover maps predicted for this orthoimage, using all employed deep CNN models in this study, are displayed in Figure 18a-f. Interestingly, land cover classes predicted by all employed CNN models closely resemble the ground truth image in Figure 17. However, FC-DenseNet, UNet, and DeepLabV3+ are the most accurate representations of the ground targets in this complex wetland environment.

Discussion
Referring to Figure 15, FC-DenseNet, U-Net, and DeepLabV3+ show lower loss values for both training and validation losses w.r.t MobileU-Net, PSPNet, and SegNet models, resulting in higher training and validation accuracies according to Table 1. For the SegNet model, the validation losses keep a certain distance above the training losses explaining the larger difference between validation and training accuracies reported for this model. Furthermore, still referring to Figure 15, U-Net is showing a higher speed of convergence during the training phase. This suggests that the skip connections from encoder to decoder have a high contribution in smoothing the gradient descent's path towards the global minimum in the high-dimensional weight space. Additionally, in comparison to FC-DenseNet, the fine-tuning strategy of the transfer learning technique employed by U-Net yielded reduced training epochs. This approach helps to exploit the advantages of deeper CNNs with a larger number of trainable parameters where the available training resources are limited (as is the case here due to manual labeling). FC-DenseNet also takes advantage of skip connections in its encoders, and between encoders and decoders, which helps with the flow and convergence of gradient descent through reuse of features. However, due to training the network from scratch, more training steps to converge is necessary.
The fine-tuning strategy of transfer learning yielded very good results in all models with pre-trained VGG-16 and ResNet-34 architectures as their encoder for feature learning. This is a promising result given that the structure of low-level and high-level natural/wetland terrain features in our dataset are noticeably different from those that appear in the ImageNet dataset used for training the deep CNN architectures. Furthermore, the overall accuracy achieved by training FC-DenseNet from scratch confirms that the dramatic reduction in the number of parameters of this architecture w.r.t. other state-of-the-art deep learning architectures enables it to learn optimum features when presented with relatively limited training samples.
Regarding the F1 score and mIoU values depicted in Table 1, the first three CNN models exhibit the highest performance among the others. According to the confusion matrices displayed in Figure 16, three of the employed networks, FC-DenseNet, U-Net, and DeepLabV3+, were successful in predicting labels for pixels belonging to all existing classes with accuracy above 90%. Almost all deep networks were successful in predicting pixels belonging to the vegetation class with an accuracy greater than 95%. Compared to the other classes of targets, vegetation represents the least confused class. Referring back to Figure 16, especially when SegNet, PSPNet, and MobileU-Net models were employed, road pixels are mostly confused with tidal flat pixels, and pixels belonging to water bodies are more likely to be misclassified as tidal flat and vegetation. It should be noted that discriminating pixels belonging to the tidal flat class from those belonging to the road class at this study site is a difficult task. These two classes exhibit very high inter-class similarities due to the road being a dirt road comprised of similar sand material to that of the exposed ground areas within tidal flat areas but with some mixed gravel.
The comparable overall accuracy of the FC-DenseNet architecture trained from scratch to that of the U-Net and DeepLab V3+ architectures, which use fine-tuned encoders, illustrates that the compactness in the number of parameters of FC-DenseNet makes it a good choice among many recently-developed CNN architectures for pixel-wise labeling for training from scratch under limited training samples. The high performance of the U-Net architecture, trained based on the transfer learning technique, provided the most accurate and efficient choice among the others for pixel-wise labeling. Its performance justifies that the employed transfer learning technique does very well when it is employed to learn hierarchical features in high-spatial resolution UAS or RS images over natural terrain like wetlands. Such image sets and features are significantly different from the features of standard image datasets, such as ImageNet [54]. High performance of the DeepLabV3+ architecture demonstrates the effectiveness of ASPP in this network, which is able to properly encode multi-scale contextual information of the coastal wetland land cover captured in the images. However, this network needs more training steps to reach comparable performance w.r.t U-Net. Our experiment with PSPNet at the wetland study site shows that the pyramid pooling module together with the pyramid scene parsing network is more effective in predicting vegetation and tidal flat areas than water and road areas. MobileU-Net and SegNet achieved less accuracy among all employed architectures for semantic image segmentation. Results achieved by our MobileNet architecture is based on baseline settings for its hyper-parameters, which include α = 1 for width multiplier and ρ = 1 for resolution multiplier. Decreasing those two hyper-parameters can dramatically decrease the performance of the network. However, MobileNets have the potential to be employed effectively in some real-time RS applications. As mentioned earlier, MobileNets were built as small, low-latency, low-power models parameterized to meet the resource constraints of a variety of mobile and embedded vision applications. These type models require less computational power and capacity for near real-time applications compared to very deep architectures with a higher capacity for learning due to their larger number of parameters. SegNet, like other employed architectures in this experiment, performed very well in vegetation areas but was much less accurate in classifying other targets. It is suspected that SegNet's inefficiency for pixel-wise labeling of the other targets, which are more challenging, stems from the network's inefficiency for exploiting low-level and high-level abstract features throughout the network and in its inefficient upsampling method.
It is worth mentioning that the information needed for training any of the evaluated classification architectures was obtained through supervised labeling by interpretation and delineation of land cover boundaries in the UAS images. This interpretation includes labeling a relatively large number of images by a human operator. This may result in different types of errors in the labelling of land cover types, and most notably in those circumstances in which categories are very heterogeneous and the landscape is complex. This is especially worrisome for non-domain experts or practitioners of deep learning who may not be familiar with the key characteristics that differentiate one land cover type or boundary from another. In this case, training was limited to four relatively distinct classes of importance to our wetland monitoring efforts, as opposed to more refined classes, to try and reduce those issues. Although different types of vegetation and land cover exist in the study area, this grouping aided our ability to efficiently label the data and serve the study purpose. However, the high level of classification accuracy reported here, to some degree, may be a function of this class structure. Efforts to classify the land cover into more distinct categories and capture more biodiversity will be posed with greater labeling and training challenges and require more domain expertise. Classification accuracy may be lower in such cases than those reported here, especially if relying on low spectral resolution RGB imagery alone as evaluated in this study. Inevitably, some mixing of classes will occur during the labeling process, regardless of expertise or attention to detail, and these challenges will grow over heterogeneous and complex natural landscapes like coastal wetlands. This problem can be exacerbated when attempting to perform pixel-level labeling using very high resolution imagery, such as created from a low-altitude UAS flight. This is due to a large amount of within class spectral variability when viewing land cover at zoomed in geographic scales (here cm-level). The errors in labeling are specifically maximized when pixels belonging to the borderlines are going to be labeled because natural targets do not usually express clear borders. In some landscapes, two or more different targets can be so mixed together that the operator cannot decide which label should be given to that specific pixel or area. Inevitably, it becomes highly subjective. Such areas can be seen in the lower right part of Figure 17 where a vegetation area has been submerged in shallow water. In this work, it was classified as a water body/submerged landscape. Additionally, at this specific study site, discriminating pixels belonging to edges of the tidal flat class from those belonging to the dirt road was a difficult task because those two classes exhibit very high inter-class similarities at their boundaries. As a result, the uncertainty for labeling road pixels close to the boundaries increased.
Lastly, coastal wetlands are among some of the most dynamic and complex ecosystems on the planet. Many different factors, such as seasonal and climate changes, water temperature, altered flooding and salinity patterns, sea-level rise, topography, etc., [11,13], contribute to the current state of the land cover and its physical properties at the time of recording the remote sensing observations. Thus, the authors emphasize that the classification results shown here, based on the classes chosen to be examined, are valid for the specific data set acquired at a certain time over the study site. The results cannot be necessarily generalized to the same coastal wetland area imaged at a different time, or at a different land cover state, without further analyses. Ambient environmental conditions, such as lighting or wind, can impact data captured in an UAS image. Similarly, flight design including altitude above ground and camera perspective (e.g., oblique versus nadir) will impact the GSD and appearance of land cover features. As a result, the visual representation of the same target may deviate from one exposure to another in a single UAS flight mission and across repeat data acquisitions. For this study, UAS data acquisition targeted calm winds and a bright, sunny day. The flight was conducted during the middle part of the day to reduce shadowing and enhance scene brightness. Furthermore, the entire scene was mapped in under thirty minutes so variation of ambient lighting during flight was minimal. Camera angles were kept at nadir to provide a top-down view for orthoimage generation and reduce shadowing of terrain from oblique perspectives.
Future efforts will need to examine the generalizability and stability of these models to perform repetitive classification using a time series of images captured from repeat UAS flights under varying conditions. However, we believe that the high capacity of deep CNN models to efficiently extract informative and discriminative features from the raw UAS images in an end-to-end manner have the potential to be extended further by training deep CNN models using a time-series of UAS images acquired over the same area. An efficient deep network trained using appropriate training samples acquired at different times and labeled by expert knowledge will be able to capture more properties about a certain land cover target at a different state of the wetland or other environment. Such models could provide a powerful framework for designing any automatic or online land cover prediction system aiming to offer high performance regardless of the conditions at the time of data acquisition.

Conclusions
Wetlands provide a challenging natural environment for performing high accuracy land cover prediction with hyper-spatial resolution UAS imagery due to high intra-class variability and low inter-class disparity often observed between classes. For decades, semantic image segmentation for land cover mapping tasks in the RS field has relied heavily on the tedious procedure of manually designing and extracting the most informative hand-crafted features from the available data, which are then fed into different machine learning techniques for classification or segmentation. On the other hand, the accuracy of any prediction technique is highly dependent on the contribution of those features for discriminating different targets that are captured in high-spatial to hyper-spatial resolution images, such as those acquired by UAS flying at low altitude.
In this research, we exploited state-of-the-art deep learning frameworks, commonly called deep CNNs, to automatically explore high-dimensional hierarchical feature spaces and find the most informative and discriminative features for performing a pixel-wise image labeling task for land cover mapping. Among the many available deep CNN architectures, this study investigated the performance of some of the more recent very deep CNN architectures that are heavily employed for pixel-level labeling in many different applications. Six different networks were evaluated, FC-DenseNet, U-Net, DeepLabV3+, MobileU-Net, PSPNet, and Encoder-Decoder (SegNet), for performing a pixel-wise classification task using UAS hyper-spatial resolution images acquired over a coastal wetland area. Results of the study revealed that hierarchical features learned by the deep learning frameworks are highly efficient for discriminating different targets in a complex wetland environment and providing accurate pixel-level land cover predictions for the target classes investigated (vegetation, tidal flat, water, road). Specifically, fine-tuning of deep architectures with tens of millions of parameters is the best strategy when there is a limited labeled dataset as was the case in this study. This is also the case for most current RS land cover mapping applications where large repositories of relevant labeled datasets are not available. In this study, FC-DenseNet trained from scratch outperformed the other architectures regarding the overall accuracy performances (Table 1) based on the validation dataset. However, U-Net architecture with ResNet34 encoder outperformed the other architectures based on training speed while achieving comparable accuracy to FC-DenseNet. These results suggest that U-Net is the most efficient architecture for the UAS hyper-spatial pixel-wise classification task explored here. Skip connections in FC-DenseNet and U-Net architecture play a significant role in these networks' ability for faster training and/or achieving higher overall accuracies. DeepLabV3+, which uses the ASPP technique to account for objects at multiple scales, was also very successful at pixel-level prediction in our study case. Furthermore, results from per-class accuracy revealed that almost all networks were able to successfully predict pixels belonging to the vegetation area with high accuracy.
The experiment with the U-Net architecture employing a ResNet34 encoder revealed that fine-tuning using the transfer learning technique works well for hyper-spatial UAS image analyses. Furthermore, the transfer learning technique in combination with skip connections applied to the architecture of CNNs significantly reduced the need for a large number of training epochs, and large labeled data resources, typically required for training deep CNNs without sacrificing their high classification performance. In this study, FC-DenseNet, with 56 convolutional layers, trained from scratch performed comparably well with the U-Net architecture regarding the overall classification accuracy evaluated on the training dataset. This suggests that the parameter-based compactness of FC-DenseNet makes it a good choice among other deep CNN architectures for accurate pixel-wise labeling in RS applications where transfer learning may not be efficiently applicable and/or higher level of generalization with a limited training sample is required. However, as long as training from scratch is applied to FC-DenseNet, it would need more training epochs to reach an overall accuracy comparable to U-Net using a pre-trained encoder with the same number of training samples.
In conclusion, the results of this study demonstrate the high potential for exploiting recent deep CNN architectures to perform pixel-wise land cover mapping with hyper-spatial resolution imagery acquired from a small UAS equipped with an RGB camera or other RS method. Transfer learning is highly applicable for training deep CNNs in RS applications to help achieve state-of-the-art performances when faced with limited labeled data resources. Finally, coastal wetlands are highly diverse natural environments providing a range of complexities if attempting to identify more refined land covers, such as vegetation types. Such efforts will likely demand more advanced sensors to capture finer spectral information from the different targets. Future work will explore deep CNN architectures for pixel-wise labeling of multispectral and hyperspectral images to predict land cover in a coastal wetland setting.