Comparing Deep Learning and Shallow Learning for Large-Scale Wetland Classification in Alberta, Canada

Advances in machine learning have changed many fields of study and it has also drawn attention in a variety of remote sensing applications. In particular, deep convolutional neural networks (CNNs) have proven very useful in fields such as image recognition; however, the use of CNNs in large-scale remote sensing landcover classifications still needs further investigation. We set out to test CNN-based landcover classification against a more conventional XGBoost shallow learning algorithm for mapping a notoriously difficult group of landcover classes, wetland class as defined by the Canadian Wetland Classification System. We developed two wetland inventory style products for a large (397,958 km2) area in the Boreal Forest region of Alberta, Canada, using Sentinel-1, Sentinel-2, and ALOS DEM data acquired in Google Earth Engine. We then tested the accuracy of these two products against three validation data sets (two photo-interpreted and one field). The CNN-generated wetland product proved to be more accurate than the shallow learning XGBoost wetland product by 5%. The overall accuracy of the CNN product was 80.2% with a mean F1-score of 0.58. We believe that CNNs are better able to capture natural complexities within wetland classes, and thus may be very useful for complex landcover classifications. Overall, this CNN framework shows great promise for generating large-scale wetland inventory data and may prove useful for other landcover mapping applications.


Introduction
Machine learning-a method where a computer discovers rules to execute a data processing task, given training examples-can generally be divided into two categories: Shallow learning and deep learning methods [1]. Deep learning uses many successive layered representations of data (i.e., hundreds of convolutions/filters), while shallow learning typically uses one or two layered representations of the data [1]. Deep learning has shown great promise for tackling many tasks such as image recognition, natural language processing, speech recognition, superhuman Go playing, and autonomous driving [1][2][3].
are recharged through precipitation [32]. Marshes and swamps are also sensitive to climate change due to their reliance on predictable seasonal flooding cycles [33].
Spatial wetland inventories at a country or provincial scale [30][31][32][33][34] are not new, but having data that are reliable for land management and land planning decisions is a challenge. In Canada, mapping of wetlands via remote sensing is a well-studied topic [35]. Initially, inventories were typically built through aerial image interpretation [36]. While accurate, this methodology is usually very time consuming and costly. Given Canada's commitment to, and involvement with, synthetic aperture radar (SAR) data, many studies have used SAR to map and monitor wetlands [37][38][39][40][41][42][43][44][45] with varying degrees of success. It appears SAR data are most useful for monitoring the dynamics of wetlands. Other studies and projects have used moderate resolution optical data such as Landsat or Sentinel-2 to generate wetland inventories [30,46,47]. Most modern approaches to large-scale wetland inventories utilize a fusion of data such as SAR and optical [34,39,48] and, ideally, SAR, optical, plus topographic information [6,7,49]. Theoretically the fusion of SAR, optical, and topographic information should give the most information on wetlands and wetland class because: (1) SAR is sensitive to the physical structure of vegetation and can detect the dynamic nature of wetlands with a rich time series stack; (2) optical data can capture variations in vegetation type and vegetation productivity, as it is sensitive to the molecular structure of vegetation; and (3) topographic information can provide data about hydrological patterns which drive wetland formation and function.
It is our understanding that all of the machine learning studies listed in the paragraph above used shallow learning methods, such as random forest, SVM, or boosted regression trees. It appears that distinguishing wetland class with remote sensing data and shallow machine learning is still a difficult task. This is likely due to the fact that one wetland class (i.e., fen) can have different vegetation types-forested, shrubby, graminoid [35]. Additionally, wetlands of different classes can have identical vegetation and vegetation structure. These types are then only distinguished through their below-ground hydrological patterns. Finally, wetland classes do not have a defined boundary, since they gradually transition into another class or upland habitat [50]. This makes spatially explicit inventories inherently inaccurate because a hard boundary must be identified. These issues with wetland and wetland class mapping are best summed up in Figure 1. Given three different data sources (SAR, optical, and topographic) with static and multi-temporal remote sensing measures, the four wetland classes do not show any noticeable differences on a pixel level (top panel of Figure 1). The violin plots show the distribution of numerical values for all four wetland classes (essentially a vertical histogram by class). Marshes may show a wider distribution of values, but fens, bogs, and swamps are almost identical. In the bottom panel of Figure 1, even the visual identification of these wetland classes with high resolution imagery is difficult. Fens, in the bottom right of the image, can be seen visually (flow lines and lighter tan color), but then fens appear very different in the top right corner (dark green color and appear to be treed).
With the known difficulty of wetland classification with shallow learning (Figure 1), we believe wetland class mapping is the perfect candidate for deep learning and CNNs. In practice, CNNs trained on a patch-level learn low-and high-level features from the remote sensing data. For example, waterline edges which delineate marshes and open water may only need simple edge detection convolution filters, while fens and bogs may be differentiated by subtle variations in texture or color (i.e., visible flow lines in fens). Within the last couple of years, a number of studies have attempted to use deep learning for wetland mapping in Canada over small areas and have achieved promising results when compared to alternative shallow learning methods [20,51,52].
With the current status of machine learning and the history of Canadian wetland mapping in mind, we propose a simple goal for this study: To compare deep learning (CNN) classifications with shallow learning (XGBoost) classifications for wetland class mapping over a large region of Alberta, Canada (397,958 km 2 ) using the most up-to-date, open source, fusion of data sources from Sentinel-1 (SAR), Sentinel-2 (optical), and the Advanced Land Observing Satellite (ALOS) digital elevation model (DEM) (topographic). To reach a strong conclusion, we plan to validate our results against three validation

Study Area
Our study area includes the Boreal Natural Region (BNR) of Alberta, Canada, along with parts of the Canadian Shield, Parkland, and Foothills Natural Regions ( Figure 2). The study area comprises 60% (397,958 km 2 ) of the total area of Alberta. Elevations range from 150 m above sea level in the northeast to 1100 m near the Alberta-British Columbia border [53].
The BNR has short summers and long, cold winters [53]. Vegetation consists of vast deciduous, mixed wood, and coniferous forests interspersed with large wetland complexes [53]. Agriculture is limited to the southeast region of the study area (northeast of Edmonton, a large urban center) and areas around Grand Prairie (western portion of the study area) [54]. Other anthropogenic features are from forestry activities and extensive oil and gas development in the regions around Fort McMurray [55].
The Alberta Wetland Classification System recognizes five main wetland classes across the province: Bog, fen, marsh, swamp, and shallow open water . The BNR is dominated by fens and bogs (peatlands), which typically form in cool, flat, low-lying areas with poorly drained soils and peat accumulations of 30-40 cm or more [56,57]. The fens and bogs of this region are classified as wooded coniferous, shrubby, or graminoid, with bogs being relatively acidic and fens ranging from poor acidic to extreme-rich alkaline [55][56][57][58]. Fens and bogs are typically differentiated by their hydrology. Fens are fed by flowing ground water and precipitation, while bogs are fed solely by precipitation and have relatively stagnant water. Marshes are periodically inundated areas consisting mainly of emergent graminoid vegetation, while swamps are typically forested or shrubby and have standing water for longer periods of time and vegetation consists of dense conifer or deciduous forests [59].   The BNR has short summers and long, cold winters [53]. Vegetation consists of vast deciduous, 165 mixed wood, and coniferous forests interspersed with large wetland complexes [53]. Agriculture is 166 limited to the southeast region of the study area (northeast of Edmonton, a large urban center) and 167 areas around Grand Prairie (western portion of the study area) [54]. Other anthropogenic features 168 are from forestry activities and extensive oil and gas development in the regions around Fort

169
McMurray [55]. Figure 2. Location of the study area (red), ABMI plots used for validation (orange), ABMI plots used for training and pre-prediction validation (black), location of the Canadian Centre for Mapping and Earth Observation (CCMEO) validation data (purple), and Alberta Environment and Parks (AEP) field data location (blue) overlain on an elevation background.

Data
Data for these landcover classifications came from three sources: Sentinel-1 SAR, Sentinel-2 optical, and ALOS DEM data. All inputs can be seen in Table 1. All data used in this study were acquired, processed, and downloaded through the Google Earth Engine (GEE) JavaScript API [60]. Each Sentinel-1 ground range-detected image in GEE was pre-processed with the Sentinel-1 toolbox using the following steps: Thermal noise removal, radiometric calibration, and terrain correction using the Shuttle Radar Topography Mission (SRTM) 30 m DEM. All Sentinel-1 dual pol (VV VH) images over Alberta during the spring/summer time period (15 May-15 August) for the years 2017 and 2018 were used. This yielded 1123 Sentinel-1 images. All of these images were then further processed with: An angle correction [61], edge mask for dark strips on the edges of images, and a multi-temporal filtering using a two-month window [62]. To obtain the static backscatter inputs, the mean pixel value of the image stack was calculated. Additionally, the polarization ratio was calculated by dividing the VH polarization by the VV polarization. Sentinel-1 time series metrics were calculated in the same manner, but restricted to certain dates. Delta VH was calculated by subtracting winter backscatter (1 November-31 March) from summer backscatter (1 June-15 August).
Sentinel-2 top of atmosphere data were acquired over all of Alberta for the same time period as the Sentinel-1 data. Note that Sentinel-2 surface reflectance products were not available in GEE at the start of this study. All images with a cloudy pixel percentage of less than 50% were used. This yielded 4479 total Sentinel-2 images. All cloud and shadow pixels were masked out using an adapted Google Landsat cloud score algorithm and a Temporal Dark Outlier Mask (TDOM) method. These methods are not currently published in peer-review publication, but it appears the methods will soon be published in [63]. To obtain the static Sentinel-2 inputs, the median pixel, for each band, in the pixel stack was chosen. This was done to eliminate any bright or dark outlier pixels. With these median bands, all vegetation indices seen in Table 1 were calculated. Again, the time series inputs were calculated in the same fashion, but the median summer value (1 June-31 July) was subtracted from the median fall value (1 September-30 September). Time-series metrics were calculated for Sentinel-1 and -2 data because we wanted to try to capture the temporal signature of certain wetland classes (i.e., marshes).
The ALOS 30 m DEM was acquired over all of Alberta. To match the 10 m resolution of Sentinel-1 and Sentinel-2 data, the DEM was resampled with a bicubic method to 10 m and turned into a floating point data type. Additionally, a 5 × 5 pixel spatial mean filter was applied to the DEM for the purpose of creating more realistic hydrological indices [7]. With the 10 m ALOS DEM, topographic indices were then calculated using an open source terrain analysis software program-SAGA version 5.0.0 [64].
The training data for all models came from the Alberta Biodiversity Monitoring Institute's Landcover Photo Plots (henceforth, ABMI plots; see distribution in Figure 2, example plot in Figure 1, and data here: http://bit.ly/326i6V4) [65]. The ABMI plots are attributed, spatially explicit polygons derived from high resolution three-dimensional (3D) image interpretation. They include information on wetland class, wetland form, forest type, and structure. The ABMI plots have undergone ground-truthing and are typically highly accurate (high 90% range) when compared to field data [65]. For this study, we extracted the following classes from the LC3 field: Open water-0; fen-1; bog-2; marsh-3; swamp-4; upland-5. It should be noted that we did not train models with the shallow open water (defined as a maximum of 2 m deep) class because the ABMI plots do not have accurate representations of this class. Normalized Difference Water Index from [69].

VH VV
The ratio between the VH and VV polarization.
Plant Senescence Reflectance Index. A ratio used to estimate the ratio of bulk carotenoids to chlorophyll [70].
Red Edge Inflection Point. An approximation on a hyperspectral index for estimating the position (in nm) of the NIR/red inflection point in vegetation spectra [71].

Variable Data Source Model Equation Description
TPI ALOS XGB/CNN -Topographic Position Index (TPI) generated in SAGA [64]. An index describing the relative position of a pixel within a valley, ridge top continuum calculated in a given window size. TPI was calculated with a 750 m moving window for this purpose [72]. Justification for this size can be seen in [7].

Machine Learning Models
The shallow learning classification model was done with the XGBoost algorithm [75]. XGBoost was used since it has been shown to be one of the better performing shallow learning models in machine learning competitions [1], although it has limited use in remote sensing literature. It has been the most popular shallow learning algorithm in Kaggle competitions since 2014 [1]. Early work on this project showed XGB models slightly out performing random forest and boosted regression tree models. We used the xgboost package [75] in R Statistical software [76]. The inputs into the XGBoost model were: Anthocyanin Reflectance Index (ARI), delta Normalized Difference Vegetation Index fall-spring (dNDVI), POLr, Red Edge Inflection Point (REIP), Topographic Position Index (TPI), Topographic Wetness Index (TWI), Multi Resolution Index of Valley Bottom Flatness (VBF), VH, dVH (Table 1). These inputs were the indices shown to be important for wetland class mapping, while also having low correlation to each other. This variable selection process can be seen in more detail in [7]. The inputs were trained to the six classes from the ABMI plot training data using the "multi:softmax" objective setting. The XGBoost model parameters were tuned using grid search functions to find the optimal value when judged by the test error metric. Additionally we wanted to err on the side of conservative model building since we knew there was little power in the inputs to discriminate between wetland classes (see Figure 1). The optimized XGBoost parameters were: nrounds = 500, max_depth = 4, eta = 0.03, gamma = 1, min_child_weight = 1, subsample = 0.5, colsample_bytree = 0.8. See [77] for description of XGB parameter tuning. We then built 15 separate XGBoost models. Each model was built with a different subset of 2000 random points which were spaced anywhere from 900 to 2000 m apart, depending on the relative abundance of each of the six classes. The selection of this number of points follows the methodology seen in [7]. A total of 15 models were built with minimum point spacing, because this will prevent the model over-fitting and reduce statistical spatial autocorrelation [6,7,78]. The 15 models were then used to predict landcover class across the study area 15 times at a 10 m resolution. To get the final result, the modal value of the 15 predictions was chosen as the final class. Additional smoothing of the product was done with a 7 x 7 pixel modal filter to better match the ecological patterns of wetland classes.
The segmentation convolutional neural network was implemented in the Python programming language using the Keras [79] deep learning library. The specific architecture used was a U-Net CNN, which was originally developed for biomedical image segmentation [22]. The U-Net architecture is based on a fully convolutional network and was used since it typically requires fewer training patches and is able to train in a reasonable time [22]. A sample of the U-Net architecture can be seen in Appendix B. The inputs used by our CNN model were: ARI, Band 2, Band 3, Band 4, DEM, NDVI, Normalized Difference Water Index (NDWI), Plant Senescence Reflectance Index (PSRI), REIP, TPI, Topographic Roughness Index (TRI), TWI, VBF, VH ( Table 1). Note that these inputs were different from the XGB model, because different inputs are needed to best optimize deep learning models. We feel this would lead to the best comparison, since each model should be close to optimized within its respective architecture. We chose to not to input only multispectral RGB data as is done in some deep learning remote sensing studies [80], since it was found that wetland classes have very little difference on an optical level across such a large study area. Every layer, except DEM, was clipped high and low based on the 95th and 5th percentiles, and then standardized with mean subtraction and divided by the standard deviation. The training patch size was 224 × 244 × 14 depth (14 being the number of input layers) and the label patch was 49 ×49 ×6 depth (6 being the number of modeled classes). The 49 × 49 label patch was used because there is evidence that prediction error increases slightly as one moves from the center of the input patch to the edge [81]. To combat this, a smaller patch was used for the predictions. Furthermore, having a smaller input patch means there will be some overlap in the inputs between adjacent patches, which helps combat patch boundary side effects. Since the error does vary with architecture, we chose a reasonable inner patch size for the entire model exercise, although the optimized patch size was not iteratively tested. The output activations for the CNN were sigmoid units. The model was trained using the Keras Nadam optimizer (Nesterov Adam optimizer [82]) with a combination of binary crossentropy (a common loss used to train neural networks representing the entropy between two probability distributions) and dice coefficient loss (a statistic used to gauge the similarity of two samples) for the objective loss function. Candidate training patch indexes were created using a simple moving window with a stride of 10 and simple label counts were generated. During training, patches were randomly selected from the patch list and randomly rotated left or right by 90 degrees, flipped horizontally or vertically, or left as is. Since marsh and swamp wetland classes were somewhat rarer than the other classes, during batch creation (using a batch size of 24), we ensured that there were at least six patches containing each of those labels. Using a geometrically decaying learning rate, the model was trained for 110 epochs, where each epoch was composed of 4800 training samples. Model training took approximately 3-4 h and prediction over all the study area at 10 m resolution took a similar amount of time. Training and prediction was done on a desktop with 64 Gb of RAM and one Titan X (Maxwell) GPU. A full comparison of computation time between the models can be seen in Table 2.

Validation
We performed three validation exercises on three different validation data sets: ABMI plot data (photo-interpreted), Canadian Centre for Mapping and Earth Observation (CCMEO) data (photo-interpreted), and Alberta Environment and Parks (AEP) data (field data). The photo-interpreted validation exercises returned the overall accuracy, Kappa statistic, and per-class F1-score. The Kappa statistic is a value between 0 and 1 which is a measure of accuracy that accounts for the random chance of correct classification. F1-score (range 0-1) can be reported for every class and is useful for unbalanced validation data. F1-score is the harmonic average of precision and recall. Since the field validation data (AEP data) contained substantially less points and only covered three classes, we reported the confusion matrix and overall accuracy for this exercise.
The validation with the ABMI plots was done with 19 plots randomly pulled from the data set. These 19 plots were not used in training or pre-prediction validation. This validation data set totaled Remote Sens. 2019, 12, 2 9 of 20 1261 spatially explicit polygons containing information on the six landcover classes. A total of 300,000 random points were then generated in these 19 plots and the "truth" labels were extracted from the polygons, and the two modeled predictions were also extracted. With this, the three evaluation metrics were reported for the CNN and XGB model.
The CCMEO validation data set contains 852 polygons, which are spatially well distributed across the study area. A total number of 194,873 samples in six landcover classes were used for accuracy assessment of both CNN and XGB model. It is worth noting that the CCMEO validation data set was only used for validation and was not used for building (training) the CNN or XGB models. Next, the three evaluation metrics were calculated for pixels labels using CNN and XGB.
The AEP validation data set contains in-situ information from 22 sites within a 70 km radius of the city of Fort McMurray, during the summer of 2019 (Figure 2). At each site, three 20-25-m-long transects were established near the wetland edge transitioning towards the wetland center. Four or five individual plots, located using high-precision GNSS instrumentation, were established with a 5-10 m separation along each transect, where wetland class was recorded at each and subsequently used to inform site-level wetland class. For bogs (N = 7) and fens (N = 6), the site-level "central" location was determined using the (spatial) "mean center" (i.e., the mean of plot x and y coordinates) of all transect plot locations at each site. For open water sites (N = 9), transects terminated at the water's edge; therefore, the "mean center" location would not represent a true open water location. As a means of mitigation, a single (site-level) plot was manually established in the open water (aided by 2018 SPOT optical imagery), adjacent to transect plot locations. As per AEP's sampling protocol, which focuses on key wetland classes to meet Government objectives, data were acquired at bog, fen, and open water wetlands only. This limitation dictates that no validation was performed for marsh, swamp, or upland landcover classes. Similar to the CCMEO data, AEP data were not used for CNN or XGB model training, and a confusion matrix and overall accuracy were reported for the validation data using labeled predictions, as extracted from CNN and XGB outputs. A confusion matrix is used to evaluate the performance of a classifier. In this case, the confusion matrix was used to observe the accuracy of individual wetland classes and see where class "confusion" occurs (i.e., between bog and fen).

Results
The wetland classification results from the CNN and XGB models were compared against the photo-interpreted validation data sets (see Table 3). The CNN model showed an accuracy of 81.3%, Kappa statistic of 0.57, and mean F1-score of 0.56, while the XGB model showed an accuracy of 75.6%, 0.49 Kappa statistic, and mean F1-score of 0.52 when compared to the ABMI plot data. The CNN model showed an accuracy of 80.3%, Kappa statistic of 0.52, and mean F1-score of 0.59, while the XGB model showed an accuracy of 72.1%, 0.41 Kappa statistic, and mean F1-score of 0.52 when compared to the CCMEO data. In terms of overall accuracy, the CNN model was 5.7% more accurate than the XGB model when compared to the ABMI data, and 8.2% more accurate when compared to the CCMEO data (Table 3). Overall accuracy with uplands excluded (i.e., just wetland classes) was 60.0% for the CNN model and 45.6% for the XGB model. Full results of the accuracy assessment can be in Appendix A. The per-class F1-scores for the ABMI and CCMEO data are seen in Figure 3 (blue showing the F1-score for the CNN model and orange the F1-score for the XGB model). The open water class shows almost equal F1-scores between the two models. The fen class F1-score is much higher in the CNN model (0.57) than the XGB model (0.35), while the bog score proved to be slightly higher in the XGB model. The marsh and swamp class both had a higher F1-score in the CNN model, although the F1-score were pretty poor for the swamp class (0.25 and 0.21 scores). Finally, the most numerous class, upland, showed a slightly higher F1-score in the CNN model. score were pretty poor for the swamp class (0.25 and 0.21 scores). Finally, the most numerous class,

359
When compared to the field validation data (AEP data), both products show a 50% overall 360 accuracy on 22 points. In the CNN data, fen is correctly identified in six sites, but fen is also incorrectly 361 predicted in five out of the seven bog sites (Table 4). In the XGB data, fen is predicted correctly in two 362 of the six sites. XGB bog is more accurate than the CNN model, being correct in three of the seven 363 sites. Open water appears to be similarly predicted in both products, with five out of nine sites being 364 accurately predicted. Open water was incorrectly predicted as swamp or marsh in both models.  When compared to the field validation data (AEP data), both products show a 50% overall accuracy on 22 points. In the CNN data, fen is correctly identified in six sites, but fen is also incorrectly predicted in five out of the seven bog sites (Table 4). In the XGB data, fen is predicted correctly in two of the six sites. XGB bog is more accurate than the CNN model, being correct in three of the seven sites. Open water appears to be similarly predicted in both products, with five out of nine sites being accurately predicted. Open water was incorrectly predicted as swamp or marsh in both models. Table 4. Confusion matrix for the CNN and XGB accuracy assessment against the AEP field validation data. Columns represent reference wetland class, while the rows represent the predicted wetland class. Visual results of the two products can be seen in Figures 4 and 5. The four insets zoom into important wetland habitats in Alberta, which are described in the figure captions. An interactive side-by-side comparison of the two products can also be seen on a web map via this link (https: //abmigc.users.earthengine.app/view/cnn-xgb). Additionally, the CNN product can be downloaded via this link (https://bit.ly/2X3Ao6N). We encourage readers to assess the visual aspects of these two products.

381
This study produced two large-scale wetland inventory products using a fusion of open-access 382 satellite data, and machine learning methods. The two machine learning approaches that were 383 compared-convolutional neural networks and XGBoost-demonstrate a decent ability to predict 384 wetland classes and upland habitat across a large region. Some wetland classes, such as bog and 385 swamp, proved to be much harder to map. This is made clear in the relative F1-scores of the wetland 386 classes (Figure 3). In the comparisons to the photo-interpretation validation data sets (Table 3) With reference to Figures 4 and 5, and the web map, it is evident that CNN predictions produce smoother boundaries between wetland classes. The XGB model appears to produce speckled marsh at some locations (see McClelland fen-bottom-right inset of Figure 5). One of the major differences is the amount of bog versus fen predicted in the north-west portion of the province. The XGB model predicts massive areas of bog with small ribbons of fen, while the CNN model predicts about equal parts of fen and bog in these areas. Overall, the CNN model predicts: 4.8% open water, 19.0% fen, 3.0% bog, 1.0% marsh, 4.0% swamp, and 68.2% upland. The XGB model predicts: 4.4% open water, 10.8% fen, 9.3% bog, 5.0% marsh, 10.3% swamp, and 60.2% upland.

Discussion
This study produced two large-scale wetland inventory products using a fusion of open-access satellite data, and machine learning methods. The two machine learning approaches that were compared-convolutional neural networks and XGBoost-demonstrate a decent ability to predict wetland classes and upland habitat across a large region. Some wetland classes, such as bog and swamp, proved to be much harder to map. This is made clear in the relative F1-scores of the wetland classes ( Figure 3). In the comparisons to the photo-interpretation validation data sets (Table 3), it is clear that the CNN model outperforms the XGB model in terms of overall accuracy, Kappa statistic, and per-class F1-score. As expected, the ABMI validation data proved to be slightly more accurate than the CCMEO validation data by 1-3%. This is still surprisingly close, given that the CCMEO data were completely independent from the model training process. The ABMI data were also an independent validation data set, but it is a subset of the data used to train the model. The CCMEO validation demonstrated a larger gap between the two models, with the CNN outperforming the XGB model by 8.2%. The ABMI data still showed a large, 5.7%, difference. The gap between the models is even more apparent when removing uplands, as the CNN model was 60.0% accurate, while the XGB model was 45.6% accurate. In terms of model development, both models took similar amounts of time to optimize, train, and predict (Table 2).
When comparing the products to field data, the results do not seem to be as promising. Both products had a 50% overall accuracy over 22 field sites. With reference to the field data, the CNN model clearly over-predicts fen, with 11 of the 13 wetland classes being predicted as fen, while the XGB model does not appear to have much ability to distinguish bog from fen (only 6 out of the 13 predicted correctly). We fully expect the overall accuracy of field data to go up if upland classes were included, but the main goal of this landcover classification was to map wetland classes. We believe less weight should be assigned to this accuracy assessment, given that is was just 22 points and it was just a small portion of the overall study area. Nevertheless, this does raise the question about how well landcover classifications actually match with what is seen on the ground. This may be something to test further when a larger field data set can be acquired. Right now, we cannot conclude if this is a real issue or a result of the small sample size. In the end, on the ground, wetland class is what actually matters for policy and planning, not what a photo-interpreter sees.
It appears that contextual information, texture, and convolutional filters help the CNN model better predict wetland class. The fen class is predicted much more accurately in the CNN model-0.57 versus 0.35 F1-score. This may be due to the parallel flow lines seen in many fen habitats (Figure 1), which can potentially be captured by certain convolutional filters. Marshes are also predicted more accurately in the CNN model. Here the CNN model likely uses contextual information about marshes, given marshes often surround water bodies. Visually, the CNN model appears to produce more ecologically meaningful wetland class boundaries. Boreal wetland classes are generally large, complex habitats, which can have multiple different vegetation types within one class. A large fen can be treed on the edges, then transition into shrub and graminoid fens at the center. Overall, it appears that the natural complexities of wetlands are better captured with a CNN model than a traditional pixel-based shallow learning (XGB) method. It is possible that an object-based wetland classification may also capture these natural complexities, but that is a question for future studies, as it was not in the scope of this project. We would also like to point out that the reference CNN was not subjected to rigorous optimization, and it is likely that there is still room for improvement in this model. This does tell us, though, that a naïve implementation of a CNN does outperform traditional shallow learning approaches for large-scale wetland mapping. Future work should focus on the ideal inputs for CNN wetland classification (i.e., spectral bands only, spectral bands + SAR, or handcrafted spectral features + SAR + topography).
Other studies have attempted similar deep learning wetland classifications in Canadian Boreal ecosystems. Pouliot et al. [51] tested a CNN wetland classification over a similar region in Alberta using Landsat data and reported 69% overall accuracy. Mahdianpari et al. [52] achieved a 96% wetland class accuracy in Newfoundland with RapidEye data and an InceptionResNetV2 algorithm. Mohammadimanesh et al. [20] reported a 93% wetland class accuracy again in Newfoundland using RADARSAT-2 data and a fully convolutional neural network. All of these demonstrated that deep learning results outperform other machine learning methods such as random forest. Non-neural network methods, such as the work in Amani et al. [83], reported a 71% wetland class accuracy across all of Canada. Another study by Mahdianpari et al. [34] achieved 88% wetland class accuracy using an object-based random forest algorithm across Newfoundland. In this study, the comparison between CNNs and shallow learning methods comes to the same conclusion as other recent studies; CNN/deep learning algorithms lead to better wetland classification results. The CNN product produced in this study does not achieve accuracies over 90% as seen in some smaller scale studies, likely because it is predicted across such a large area (397,958 km 2 ). When compared to other large-scale wetland predictions, it appears to be one of the more accurate products, and thus it may prove useful for provincial/national wetland inventories. It also comes with the benefit of being produced with completely open-access data from Sentinel-1, Sentinel-2, and ALOS, and thus it can be easily updated to capture the dynamics of a changing Boreal landscape.
Large-scale wetland/landcover mapping in Canada seems to be converging towards a common methodology. Most studies are now using a fusion of remote sensing data from SAR, optical, and DEM products [6,7,34,39,49]. The easiest way to access provincial/national-scale data appears to be through Google Earth Engine; thus, many studies use Sentinel-1, Sentinel-2, Landsat, ALOS, or SRTM data [6,7,34,83]. Finally, many machine learning methods have been tested, but it appears that convolutional neural network frameworks produce better, more accurate wetland/landcover classifications [51,52]. This work contributes to these previous studies by confirming the value of CNNs. It also contributes to the greater goal of large-scale wetland mapping by demonstrating the ability to produce an accurate wetland inventory with a CNN and open-access satellite data.

Conclusions
The goal of this study was to compare shallow learning (XGB) and deep learning (CNN) methods for the production of a large-scale spatial wetland classification. We encourage readers to view both products via this link: https://abmigc.users.earthengine.app/view/cnn-xgb, and one of the products can be downloaded via this link: https://bit.ly/2X3Ao6N. A comparison of the two products to photo-interpreted validation data showed that CNN products outperform the shallow learning (XGB) product in terms of accuracy by about 5-8%. The CNN product achieved an average overall accuracy of 80.8% with a mean F1-score of 0.58. When compared to a small data set (n = 22) of field data, the results were inconclusive and both data sets showed little ability to distinguish between fen and bogs. This finding could just be due to the small, spatially constrained data or it could highlight the mismatch between on the ground conditions and large-scale landcover classifications.
Given the success of the CNN model in terms of accuracy, scalability, and production time, we believe this framework has the potential to provide credible landcover/wetland data for provinces, states, or countries. The use of Google Earth Engine and freely available imagery make the production of these inventories low cost with minimal processing time. The use of CNN deep learning algorithms produces products with ecologically meaningful boundaries and these algorithms are better able to capture the natural complexities of landcover classes such as wetlands. It appears that state-of-the-science large-scale inventories are moving towards deep learning-based classifications with freely available imagery accessed through Google Earth Engine. We hope other wetland/landcover mapping studies in