Challenges in Estimating Tropical Forest Canopy Height from Planet Dove Imagery

Monitoring tropical forests using spaceborne and airborne remote sensing capabilities is important for informing environmental policies and conservation actions. Developing large-scale machine learning estimation models of forest structure is instrumental in bridging the gap between retrospective analysis and near-real-time monitoring. However, most approaches use moderate spatial resolution satellite data with limited capabilities of frequent updating. Here, we take advantage of the high spatial and temporal resolutions of Planet Dove images and aim to automatically estimate top-of-canopy height (TCH) for the biologically diverse country of Peru from satellite imagery at 1 ha spatial resolution by building a model that associates Planet Dove textural features with airborne light detection and ranging (LiDAR) measurements of TCH. We use and modify features derived from Fourier textural ordination (FOTO) of Planet Dove images using spectral projection and train a gradient boosted regression for TCH estimation. We discuss the technical and scientific challenges involved in the generation of reliable mechanisms for estimating TCH from Planet Dove satellite image spectral and textural features. Our developed software toolchain is a robust and generalizable regression model that provides a root mean square error (RMSE) of 4.36 m for Peru. This represents a helpful advancement towards better monitoring of tropical forests and improves efforts in reducing emissions from deforestation and forest degradation (REDD+), an important climate change mitigation approach.


Introduction
Tropical forests are an important component in the global carbon cycle and for mitigating climate change, but continued forest use [1] is transforming tropical forests into sources of atmospheric carbon [2]. Mapping tropical forests is essential to understanding deforestation and degradation impacts, supporting natural resource monitoring, informing future conservation, and validating carbon sequestration agreements [3,4]. Achieving such a monitoring capability in an economically viable way requires integrating spaceborne and airborne remote sensing capabilities with ground truth data into robust and accurate machine learning models [5].
Mapping forest structure usually relies on measuring and estimating tree characteristics, such as wood density, stem diameter or top-of-canopy height (TCH) [6]. Of these, TCH is one of the easiest to determine remotely using terrestrial [7], airborne [8] or spaceborne [9,10] instruments. When not measured directly, TCH can be estimated from a variety of sensors, like Moderate Resolution Imaging Spectroradiometer (MODIS) [11,12] and Landsat [13,14], which have been used to scale-up light detection and ranging (LiDAR) measurements of TCH. However, their moderate spatial and temporal 2 of 18 resolutions might constitute a drawback for tropical TCH estimation [15]. Developing large-scale estimation models using higher spatial and temporal resolution will be instrumental in bridging the gap between retrospective analysis and near-real-time monitoring [16].
Such a high spatial and temporal capability is provided by the largest fleet of small cube satellites, Planet Dove, imaging the globe daily at 3.7 m resolution [17]. Moving from Landsat collections to the Planet Dove data source presents a substantial challenge in that Planet Dove data does not possess the same spectral resolution or stability. It does, however, provide substantially higher spatial and temporal resolution than Landsat, necessary to detect rapid changes in the underlying tropical forest vertical structure. While orthorectified Planet Dove imagery does not directly contain height information, textures present in high-resolution imagery can capture indications of TCH difference, canopy diameter via shapes, roughness, shadowing, and spectral differences [18]. In this context, some popular texture analysis studies employ gray-level co-occurrence matrices (GLCM) [19,20], Fourier textural ordination (FOTO) [21,22], or Gabor wavelets [23]. A model that predicts TCH from spectral and textural information can be further used to estimate aboveground carbon density (ACD) using previously calibrated and reported relationships [24][25][26][27].
In this study, we take advantage of the high spatial and temporal resolutions of Planet Dove images and aim to automatically estimate TCH for the entire country of Peru from satellite imagery at 100 m (1 ha) spatial resolution by building a model that associates Planet Dove textural features with airborne LiDAR measurements of TCH. We created a country-scale generalizable software toolchain for TCH estimation focusing our methods on (i) spatial texture analysis with modified FOTO textures and (ii) creating a machine learning regression model using gradient boosted regression trees. Our work has implications towards near-real-time monitoring of tropical forest canopy height at high spatial resolution. This will improve efforts in reducing emissions from deforestation and forest degradation (REDD+), an important climate change mitigation approach [4].

Study Area and Datasets
Our study area is the entire country of Peru, which covers more than 128.5 million ha with a wide range of ecosystems, from dry deserts to tropical forests and from lowlands to the mountainous Andes. The various types of vegetation include, but are not limited to, lowland rainforests and palm swamps, montane rainforests, dry forests and Andean forests, grasslands, shrublands and wetlands. Tropical forests of Peru are biologically diverse, with many native and endemic species [28], and some regions have more than 300 tree species per ha [29].
The primary data sets employed in this study were a cloud-free Planet Dove normalized multi-spectral mosaic and a LiDAR derived TCH, together with Shuttle Radar Topography Mission (SRTM) elevation (30 m resolution) [30] (Figure 1). The spatial resolution of the analysis was 1 ha (100 m × 100 m grid cells).
We created a normalized mosaic using 64,075 Planet Dove scenes having cloud-free pixels, with the near-infrared, red, green, and blue bands for the dry season of 2018 (July to September) at a resolution of 3.7 m [31]. The Planet Dove scenes were processed with atmospheric correction routines (see [31] for details) and each cloud-free Dove pixel was calibrated to co-registered Landsat data using a linear fit. This normalization was needed in order to reduce the scene-to-scene variability. To create a seamless normalized mosaic, a seam line removal algorithm was applied in the end, without blurring or affecting the spatial resolution of the images. This algorithm creates a long-wavelength adjustment to intensity near scene boundaries, so that edge lines from adjacent scenes are similar [26]. The final cloud contamination for the mosaic was 0.5% for the entire area of Peru.
The LiDAR data transect samples were acquired by the Global Airborne Observatory (GAO, formerly Carnegie Airborne Observatory) [8] during flight campaigns in 2011 and 2013 [25]. An average-on-the-ground LiDAR points spacing of 4-8 shots per square meter led to the creation of a

Motivation for extending FOTO
Our feature space started as radial Fourier power spectra. Indexes of forest canopy texture, via the FOTO method, as derived from such spectra of aerial imagery have been shown to describe stand structure parameters in local areas [22,32], suggesting that frequency domain representations of areas may be good starting points for machine learning algorithms designed to predict stand structure at large scales.
FOTO consists of two basic steps: (i) computing radially-averaged Fourier power spectra from a grayscale image as features and (ii) using normalized principal component analysis (PCA) to reduce the dimensionality of the spectra [21]. The normalization procedure, as implemented by FOTO, consists of scaling the power spectra such that they have unit norms prior to performing PCA. This eliminates spectral information contained in the features, which may not be beneficial to the general problem of predicting forest structural characteristics. Forest structure diversity, or indeed general terrain diversity, can complicate the PCA strategy for ordination, since it selects eigenvectors of a covariance matrix. Top PCA components will select directions that maximally explain variance in the feature space. For a diverse landscape, the structural differences within forest canopies will be overpowered (in the top eigenvectors) by the vast differences in landscape, resulting in what resembles a complex low-pass filter. It is perhaps for this reason that researchers have suggested dropping the first couple of PCA indexes when employing FOTO [33].
We modified the FOTO method by replacing its PCA stage with a spectral projection technique that has been shown to differentiate image textures well for the purposes of unsupervised segmentation [34] (Figure 2). We also modified the Fourier features avoiding extra normalization and

Motivation for Extending FOTO
Our feature space started as radial Fourier power spectra. Indexes of forest canopy texture, via the FOTO method, as derived from such spectra of aerial imagery have been shown to describe stand structure parameters in local areas [22,32], suggesting that frequency domain representations of areas may be good starting points for machine learning algorithms designed to predict stand structure at large scales.
FOTO consists of two basic steps: (i) computing radially-averaged Fourier power spectra from a grayscale image as features and (ii) using normalized principal component analysis (PCA) to reduce the dimensionality of the spectra [21]. The normalization procedure, as implemented by FOTO, consists of scaling the power spectra such that they have unit norms prior to performing PCA. This eliminates spectral information contained in the features, which may not be beneficial to the general problem of predicting forest structural characteristics. Forest structure diversity, or indeed general terrain diversity, can complicate the PCA strategy for ordination, since it selects eigenvectors of a covariance matrix. Top PCA components will select directions that maximally explain variance in the feature space. For a diverse landscape, the structural differences within forest canopies will be overpowered (in the top eigenvectors) by the vast differences in landscape, resulting in what resembles a complex low-pass filter. It is perhaps for this reason that researchers have suggested dropping the first couple of PCA indexes when employing FOTO [33].
We modified the FOTO method by replacing its PCA stage with a spectral projection technique that has been shown to differentiate image textures well for the purposes of unsupervised segmentation [34] ( Figure 2). We also modified the Fourier features avoiding extra normalization and took their square root. This retained input image intensity units and magnitude, allowing us to retain potentially beneficial spectral information.
Remote Sens. 2020, 12, x FOR PEER REVIEW 4 of 17 took their square root. This retained input image intensity units and magnitude, allowing us to retain potentially beneficial spectral information. The workflow for TCH estimation using Planet Dove, elevation, and airborne LiDAR datasets. The Planet Dove high-resolution mosaic was used as input for the Fourier textural ordination (FOTO) and spectral projection in order to generate features for the gradient boosted regression model weighted based on K-means clusters. The final TCH estimation was validated against hold-out airborne LiDAR measurements of TCH.

Fourier power spectra as a texture measure
Two-dimensional Fourier transforms were used to decompose images into a collection of sinusoidal components with differing wavelengths, representing variations at different size scales. Radially summed, or averaged power spectra may be computed from the transformed images and have been shown useful in describing image texture in a manner suitable for correlating with forest stand properties [22]. This kind of decomposition, illustrated in real space, is shown in Figure 3.
In addition, for an appropriately constructed radial power spectra, Parseval's theorem describes how to relate it to the root-mean-squared (RMS) intensity of the original image; that is, when the forward Fourier transform is chosen to be unitary (i.e., with a normalization factor of 1/2), Parseval's theorem is described in Equation (1).
where is the Fourier transform of (equation presented in 1D). Since this relationship naturally holds for spatially band-passed images constructed by dropping Fourier modes outside of a selected range of wavenumbers, it also holds element-wise for radial power spectra. This means that, as shown in Equation (2), we can interpret corresponding elements of Fourier periodograms computed from each band of a multispectral reflectance image as an RMS measure of reflectance for a particular coarseness length scale in an image. As a result, there was motivation for employing widely used pixel-based multispectral analysis techniques, such as spectral angle mapping [35] on each Fourier mode independently, thereby providing a strategy for combining spectral and textural features in a common analytical framework. Employing this framework on an N-band image patch processed into an M-element power spectrum, results in an M × N element feature matrix, where columns correspond to image bands, and rows to Fourier length scales. The workflow for TCH estimation using Planet Dove, elevation, and airborne LiDAR datasets. The Planet Dove high-resolution mosaic was used as input for the Fourier textural ordination (FOTO) and spectral projection in order to generate features for the gradient boosted regression model weighted based on K-means clusters. The final TCH estimation was validated against hold-out airborne LiDAR measurements of TCH.

Fourier Power Spectra as a Texture Measure
Two-dimensional Fourier transforms were used to decompose images into a collection of sinusoidal components with differing wavelengths, representing variations at different size scales. Radially summed, or averaged power spectra may be computed from the transformed images and have been shown useful in describing image texture in a manner suitable for correlating with forest stand properties [22]. This kind of decomposition, illustrated in real space, is shown in Figure 3.
In addition, for an appropriately constructed radial power spectra, Parseval's theorem describes how to relate it to the root-mean-squared (RMS) intensity of the original image; that is, when the forward Fourier transform is chosen to be unitary (i.e., with a normalization factor of 1/2), Parseval's theorem is described in Equation (1).
where X(k) is the Fourier transform of x(t) (equation presented in 1D). Since this relationship naturally holds for spatially band-passed images constructed by dropping Fourier modes outside of a selected range of wavenumbers, it also holds element-wise for radial power spectra. This means that, as shown in Equation (2), we can interpret corresponding elements of Fourier periodograms computed from each band of a multispectral reflectance image as an RMS measure of reflectance for a particular coarseness length scale in an image. As a result, there was motivation for employing widely used pixel-based multispectral analysis techniques, such as spectral angle mapping [35] on each Fourier mode independently, thereby providing a strategy for combining spectral and textural features in a common analytical framework. Employing this framework on an N-band image patch processed into

Local-scale random forest models
Random forest (RF) models [36] have been an extremely popular tool across a variety of application areas including the estimation of TCH and aboveground carbon density for tropical forests [26]. There are a variety of reasons for this, including ease of application (i.e., minimal parameter tuning requirements), ability to handle large numbers of input variables, and the traditional perception that they are somewhat immune to overfitting [37]. Such models, trained over smaller regions, can produce good accuracy results, but behave poorly when transferring the model to other regions [38,39]. For TCH prediction, a variety of factors can lead to this, including high dimensionality of input features, comparatively low complexity of the output TCH prediction, spatially localized imaging artifacts in source data, and ecologically relevant spatial autocorrelations. While the first two issues are traditional fitting challenges, the other challenges are more specialized to the spatial/remote sensing domain and are worth discussing as related to RF methodologies in general.

Spatial autocorrelation for RF
A number of studies have discussed spatial autocorrelations in ecology [40][41][42]. Of greater relevance are studies discussing the origin and conditions of overfitting in RF models that identify diversity and locality of estimators as crucial conditions for consistency [43]. Of crucial concern is how spatial autocorrelations caused by both natural and artificial factors can undermine RF statistical

Local-Scale Random Forest Models
Random forest (RF) models [36] have been an extremely popular tool across a variety of application areas including the estimation of TCH and aboveground carbon density for tropical forests [26]. There are a variety of reasons for this, including ease of application (i.e., minimal parameter tuning requirements), ability to handle large numbers of input variables, and the traditional perception that they are somewhat immune to overfitting [37]. Such models, trained over smaller regions, can produce good accuracy results, but behave poorly when transferring the model to other regions [38,39]. For TCH prediction, a variety of factors can lead to this, including high dimensionality of input features, comparatively low complexity of the output TCH prediction, spatially localized imaging artifacts in source data, and ecologically relevant spatial autocorrelations. While the first two issues are traditional fitting challenges, the other challenges are more specialized to the spatial/remote sensing domain and are worth discussing as related to RF methodologies in general.

Spatial autocorrelation for RF
A number of studies have discussed spatial autocorrelations in ecology [40][41][42]. Of greater relevance are studies discussing the origin and conditions of overfitting in RF models that identify diversity and locality of estimators as crucial conditions for consistency [43]. Of crucial concern is how spatial autocorrelations caused by both natural and artificial factors can undermine RF statistical infrastructure, as well as cross validation infrastructure, leading to overfit models with undefined uncertainty that evade detection by standard measures.
At their core, RFs are variance reduction techniques that use batch-aggregation (or bagging) to produce an ensemble of trees, each constructed from a subset of the data [36]. They make predictions by voting (in the classification case) or averaging (in the regression case). The ensemble's performance relies on having low-bias trees that are uncorrelated and having the resulting series of uncorrelated errors average to 0. In other words, the individual trees in the forest must be sufficiently different from one another for model variance to be reduced and for generalization to be possible.
Based on a two-dimensional Wigner-Seitz style estimate [44], randomly sampled subsets of a sufficiently large and symmetric region will have an average inter-sample spacing (L) as shown in Equation (3), where A is the sampled area, N is the number of samples, and δ is the pixel size.
In a region with spatial autocorrelations, the expected correlation between two sampled subsets is the value of the spatial autocorrelation function evaluated at this L. For example, this means that 1000 pixels, sampled from a 1000 × 1000 compact region of training data, with a pixel size of 100 m, will have an estimated average inter-sample spacing of 1.8 km. In this idealized case, we would thus expect reasonably independent sampled subsets if the autocorrelation function (1.8 km) is "small", in which case bagging techniques for variance reduction would work as expected. The validity of inferring model generalization performance using a sampled hold-out set is similarly constrained.
RFs use attribute bagging rather than (or in addition to) sample bagging [45]. In this scheme, the variation in decision trees is influenced by random number and order of dimensions considered during each stage of tree generation. An unfortunate consequence of spatial autocorrelations arising from systematic imaging artifacts or true ecological features is that measurable features are expected to be correlated in the region due to a shared local geometry, leading to the random subspace variance reduction scheme being undermined. These issues complicate the interpretation of models trained on smaller regions and suggest that modeling at large scales over diverse environments is necessary for producing meaningful uncertainty estimates.
This realization drove our approach towards employing large-scale gradient-boosted regression trees, with early stopping based on a large scale validation set, which amounts to a bias reduction technique that is far more difficult to overfit [46].

Generalization of TCH Estimation by Controlling Overfitting
Two general strategies for controlling overfitting in order to create a generalizable model are to restrict information available to the model and penalize model complexity. Like the FOTO method, we have chosen to restrict information available to the model by reducing the dimensionality of our features.
The validity of PCA for dimensionality reduction, as employed in FOTO, would suggest that there is redundant information in different length scales and bands. With Fourier decomposition of reflectance-derived data, this may not be such a great assumption, since each extra attribute contributes either a length scale or a band. As a direct alternative to PCA, as employed by FOTO, we used a manifold embedding to reduce the dimensionality of our space. This embedding is constructed by placing each feature in a fully connected, undirected graph, with each edge weighted by an affinity function that we can choose, and then computing a continuous analog of the graph Laplacian (degree -adjacency). The low dimensional space is constructed from eigenvectors of this graph Laplacian. This technique is known as a spectral embedding constructed using Laplacian eigenmaps [47].

Affinity Function
The χ 2 metric was found to be a good affinity function for segmentation based on neighborhood image histograms [34,48]. Since power spectra are length scale histograms, we applied the same metric for our models. In addition, the kernel is positive-definite which was helpful for subsequent steps of the algorithm.
Since our normalized Fourier power spectra represent RMS reflectance values resolved by length scale, we considered a cosine similarity, or spectral angle measure of affinity that operates on each length scale. This is particularly interesting because it means that brightness artifacts that are present in Planet Dove images and mosaics would not impact the similarity between image patches. As such it is expected that models trained based on a mosaic using this affinity would not reflect signatures of cutlines used in generating the mosaic. It also would take advantage of the work done to generate reflectance data. Our constructed affinity is the RMS value of cosine similarity at each length scale (Equation (4)): Unfortunately, this manifold constructed using this affinity, when fit, retained the ability to differentiate between component images of the mosaic for moderate numbers of dimensions, indicating that the source Planet Dove data contained spectral artifacts as well as brightness artifacts. As such, we reverted to the more standard χ 2 metric. We mention this metric because it may be useful in a situation with more stable bands.

Spectral Projection
The Nyström method is a general way to derive an approximation of a full N × N kernel with one derived from a smaller, explicit, sampled set of landmark features [34]. It operates by computing the full kernel over the sampled subset and evaluating new elements in terms of the approximated kernel function evaluated against elements of this subset. This was necessary for our manifold embedding to scale to larger datasets. The combination of manifold embedding and Nyström extension is known as spectral projection and has been used as an effective unsupervised segmentation technique that operates on image textures [34]. In addition, if approximating a positive-definite kernel, the algorithm can operate in a one-shot manner, which enables us to establish a manifold space and embed new vectors into the same space easily.
This means that if we have a well-selected sample set, and our kernel function adequately captures the meaningful similarity between our features, then we should be able to group an arbitrary number of image patches in an unsupervised manner, producing a transformed feature space suitable for training machine learning algorithms. This partially moves the problem of finding a good embedding space to finding a good set of landmark samples. To intelligently sample our output TCH space, we first partitioned our features by TCH and attempted to select an equal number of samples from each bin. Additionally, since right-tail outliers represented a proportionally larger contribution for a subsequent carbon estimation, we also explicitly sampled them to serve as landmark features for spectral projection.

Implicit Terrain Classification
Simple K-means clustering can be applied to the low dimensional manifold constructed using spectral projection [34]. An appropriately fit example revealed different categories of terrain types in the dataset, as well as mixtures of those terrain types, which encoded much of the TCH variation in the Remote Sens. 2020, 12, 1160 8 of 18 data. In addition, these categories were highly unevenly sampled. This was readily seen, qualitatively, by examining a random sampling of source image patches from each cluster, sorted according to cluster averaged TCH (Figure 4).
Remote Sens. 2020, 12, x FOR PEER REVIEW 8 of 17 terrain mixtures fill in the continuous output TCH space, the categorical bias may readily be hidden within regression results and can obscure evaluation of intra-class model performance. In combination with the issues associated with small spatial scale modeling, a terrain categorization and weighting scheme needs to be a part of a large-scale model, and such a scheme would likely need to be unsupervised in order to be practically applicable to large datasets. The combination of spectral projection, K-means clustering, and inverse frequency weighting represents a functional attempt at such a method. Figure 4. Sampled image patches from K-means clustering of the spectral projection. This clustering generated an unsupervised rough terrain classification that appears to group forested, mixture, and various unforested states into categories with a range of different average TCH values. These clusters were used as a proxy for the implicit categorical variable that partially drives the continuous TCH variable and were used to inform an inverse-frequency weighting scheme used during model fitting.

Gradient boosted trees
Once low-dimensional features were generated, we augmented them with SRTM elevation and trained a regression using gradient boosted trees. In contrast to RF techniques, gradient boosted regression combines a sequentially trained series of high bias, low variance learners into an improved ensemble model [46]. As such, it is a bias reduction technique rather than a variance reduction technique, and not subject to the same degree of spatial autocorrelation risk. Since individual trees are produced sequentially, we continually tested their performance against a sampled hold out set and stopped adding to the ensemble when no improvement was shown. This early stopping technique is subject to previously mentioned autocorrelation risk when applied on small regions, but provides protection against overfitting when the sample set is sufficiently independent. Gradient boosting also typically employs much shallower trees than RFs, which is also helpful in avoiding overfitting. This parameter, however, has not been extensively tuned, and doing so may be beneficial. Figure 4. Sampled image patches from K-means clustering of the spectral projection. This clustering generated an unsupervised rough terrain classification that appears to group forested, mixture, and various unforested states into categories with a range of different average TCH values. These clusters were used as a proxy for the implicit categorical variable that partially drives the continuous TCH variable and were used to inform an inverse-frequency weighting scheme used during model fitting.

Software implementation
These clusters were highly unevenly sampled, meaning that statistical models can be expected to have high bias with respect to implicit categorical variables and their mixtures. Further, since terrain mixtures fill in the continuous output TCH space, the categorical bias may readily be hidden within regression results and can obscure evaluation of intra-class model performance. In combination with the issues associated with small spatial scale modeling, a terrain categorization and weighting scheme needs to be a part of a large-scale model, and such a scheme would likely need to be unsupervised in order to be practically applicable to large datasets. The combination of spectral projection, K-means clustering, and inverse frequency weighting represents a functional attempt at such a method.

Gradient Boosted Trees
Once low-dimensional features were generated, we augmented them with SRTM elevation and trained a regression using gradient boosted trees. In contrast to RF techniques, gradient boosted regression combines a sequentially trained series of high bias, low variance learners into an improved ensemble model [46]. As such, it is a bias reduction technique rather than a variance reduction technique, and not subject to the same degree of spatial autocorrelation risk. Since individual trees are produced sequentially, we continually tested their performance against a sampled hold out set and stopped adding to the ensemble when no improvement was shown. This early stopping technique is subject to previously mentioned autocorrelation risk when applied on small regions, but provides protection against overfitting when the sample set is sufficiently independent. Gradient boosting also typically employs much shallower trees than RFs, which is also helpful in avoiding overfitting. This parameter, however, has not been extensively tuned, and doing so may be beneficial.

Software Implementation
The entire workflow (Figure 2) was developed using Python and tested on a high performance computing environment on a node with 24 cores of X86 Intel Xeon E5-2680v4 2.4 Ghz processors with 128 GB DDR4 2400 memory. A source normalized mosaic (e.g., Planet Dove), an augment parameters layer (e.g., SRTM elevation), and a prediction target image (e.g., LiDAR TCH) are needed to train and use the model. The general steps to train and apply a model using our developed routines are: (i) data preparation; (ii) generating training data; (iii) fitting and saving a model; and (iv) applying the new model to new imagery. The software implementation is available as Supplementary Material.

The Performance of the Generalized Model for TCH Estimation
Investigations into estimating TCH in Peruvian tropical rainforests from textural information present in Planet Dove imagery have ultimately yielded a modeling framework and software toolchain that generalized well to country scale, as well as to new data. Using this software, our country-scale predictions of TCH achieved an out-of-sample R 2 of 0.65 and RMSE of 4.36 m, for an area over 128.5 million ha (Figures 5-7). These values were consistent with those during fitting of the algorithm, which resulted in R 2 of 0.66 and RMSE of 4.32 m. In addition, models were qualitatively validated against new Planet Dove scenes and found to produce reasonable results (not shown here). As far as we are aware, this is the first generalizable, country-scale model of TCH estimation. Due to intensive computational needs in deriving textural measures from Planet Dove, the analysis at 1 ha resolution for the entire country of Peru was executed in 30 h.
Remote Sens. 2020, 12, x FOR PEER REVIEW 9 of 17 The entire workflow (Figure 2) was developed using Python and tested on a high performance computing environment on a node with 24 cores of X86 Intel Xeon E5-2680v4 2.4 Ghz processors with 128 GB DDR4 2400 memory. A source normalized mosaic (e.g., Planet Dove), an augment parameters layer (e.g., SRTM elevation), and a prediction target image (e.g., LiDAR TCH) are needed to train and use the model. The general steps to train and apply a model using our developed routines are: (i) data preparation; (ii) generating training data; (iii) fitting and saving a model; and (iv) applying the new model to new imagery. The software implementation is available as Supplementary Material.

The performance of the generalized model for TCH estimation
Investigations into estimating TCH in Peruvian tropical rainforests from textural information present in Planet Dove imagery have ultimately yielded a modeling framework and software toolchain that generalized well to country scale, as well as to new data. Using this software, our country-scale predictions of TCH achieved an out-of-sample R 2 of 0.65 and RMSE of 4.36 m, for an area over 128.5 million ha (Figures 5-7). These values were consistent with those during fitting of the algorithm, which resulted in R 2 of 0.66 and RMSE of 4.32 m. In addition, models were qualitatively validated against new Planet Dove scenes and found to produce reasonable results (not shown here). As far as we are aware, this is the first generalizable, country-scale model of TCH estimation. Due to intensive computational needs in deriving textural measures from Planet Dove, the analysis at 1 ha resolution for the entire country of Peru was executed in 30 h. The country-scale spatial distribution of TCH is highly influenced by elevation, geological substrate, soil fertility, and climate [25], as seen in Figure 6. Our model was able to generalize well for the entire country, correctly detecting areas of intact forest, disturbed forest, and deforestations ( Figure 7). However, the model saturates for TCH values above 25 m, underestimating the values

The uncertainty of TCH estimation
In addition to the R 2 and RMSE values of the TCH estimation, we generated spatially-explicit estimates of absolute RMSE (m) and relative RMSE uncertainty (%) of the estimated TCH (Figure 8, Figure 9). This was done by grouping the TCH into 10 bins by natural breaks and computing the RMSE values for each bin. To create wall-to-wall uncertainty maps for the entire area of Peru, we fitted a polynomial function (R 2 = 0.94, p-value = 0.001) and a logarithmic function (R 2 = 0.62, p-value = 0.011) for the absolute estimated uncertainty (Figure 8a) and the relative estimated uncertainty (Figure 8b), respectively. The absolute RMSE (m) declines from around 5-6 m for estimated TCH of 0-20 m towards 2.5 m for TCH values of 20-25 m. The error then abruptly increases for the highest TCH values due to the saturation in model predictions (Figure 8a). In terms of relative RMSE The country-scale spatial distribution of TCH is highly influenced by elevation, geological substrate, soil fertility, and climate [25], as seen in Figure 6. Our model was able to generalize well for the entire country, correctly detecting areas of intact forest, disturbed forest, and deforestations ( Figure 7). However, the model saturates for TCH values above 25 m, underestimating the values higher than that ( Figure 5).

The Uncertainty of TCH Estimation
In addition to the R 2 and RMSE values of the TCH estimation, we generated spatially-explicit estimates of absolute RMSE (m) and relative RMSE uncertainty (%) of the estimated TCH (Figure 8, Figure 9). This was done by grouping the TCH into 10 bins by natural breaks and computing the RMSE values for each bin. To create wall-to-wall uncertainty maps for the entire area of Peru, we fitted a polynomial function (R 2 = 0.94, p-value = 0.001) and a logarithmic function (R 2 = 0.62, p-value = 0.011) for the absolute estimated uncertainty (Figure 8a) and the relative estimated uncertainty (Figure 8b), respectively. The absolute RMSE (m) declines from around 5-6 m for estimated TCH of 0-20 m towards 2.5 m for TCH values of 20-25 m. The error then abruptly increases for the highest TCH values due to the saturation in model predictions (Figure 8a). In terms of relative RMSE uncertainty, more than 50% error of the estimated TCH is characteristic for TCH values lower than 10 m. The error then lowers towards 10%-20% for estimated TCH values of 20-25 m (Figure 8b).
Remote Sens. 2020, 12, x FOR PEER REVIEW 12 of 17 uncertainty, more than 50% error of the estimated TCH is characteristic for TCH values lower than 10 m. The error then lowers towards 10%-20% for estimated TCH values of 20-25 m (Figure 8b). The spatial distribution of the absolute uncertainty shows that the higher errors (in m) correspond to the transition zones between lowland forest and Andean slopes, as well as to the wetlands and converted land use change areas (Figure 9a). Extremely sparse mountainous vegetation areas have high relative errors (>100%); however, the absolute errors for these areas are very low (<1 m). The areas with high TCH values have lower than 20% relative uncertainty, an important asset when the TCH estimations are further used to derive, for example, aboveground carbon density (Figure 9b). Overall, higher the estimated TCH, the lower the relative uncertainty, leading to high relative uncertainty for mountainous areas and low relative uncertainty for tropical forests.  uncertainty, more than 50% error of the estimated TCH is characteristic for TCH values lower than 10 m. The error then lowers towards 10%-20% for estimated TCH values of 20-25 m (Figure 8b). The spatial distribution of the absolute uncertainty shows that the higher errors (in m) correspond to the transition zones between lowland forest and Andean slopes, as well as to the wetlands and converted land use change areas (Figure 9a). Extremely sparse mountainous vegetation areas have high relative errors (>100%); however, the absolute errors for these areas are very low (<1 m). The areas with high TCH values have lower than 20% relative uncertainty, an important asset when the TCH estimations are further used to derive, for example, aboveground carbon density (Figure 9b). Overall, higher the estimated TCH, the lower the relative uncertainty, leading to high relative uncertainty for mountainous areas and low relative uncertainty for tropical forests.  The spatial distribution of the absolute uncertainty shows that the higher errors (in m) correspond to the transition zones between lowland forest and Andean slopes, as well as to the wetlands and converted land use change areas (Figure 9a). Extremely sparse mountainous vegetation areas have high relative errors (>100%); however, the absolute errors for these areas are very low (<1 m). The areas with high TCH values have lower than 20% relative uncertainty, an important asset when the TCH estimations are further used to derive, for example, aboveground carbon density (Figure 9b). Overall, higher the estimated TCH, the lower the relative uncertainty, leading to high relative uncertainty for mountainous areas and low relative uncertainty for tropical forests.

Discussion
Our proposed high-resolution Peru-wide estimate of TCH resulted in an R 2 of 0.65 and RMSE of 4.36 m, comparable to other approaches available at different extents and resolutions. For example, Lefsky [49] used 500 m MODIS data with Geoscience Laser Altimeter System (GLAS) height transects to create a global forest canopy height map with an RMSE of 5.9 m. Using MODIS and GLAS data, Wang et al. [50] used a balanced RF algorithm at 500 m resolution to estimate a global mean forest canopy height with RMSEs of 2.7-4.4 m. Simard et al. [51] estimated global forest vertical structure at 1 km resolution using GLAS and ancillary data with RMSE of 6.1 m. Hansen et al. [52] used Landsat 7 and 8 with GLAS data to map the tree height in Sub-Saharan Africa with an overall mean absolute error (MAE) of 2.45 m. Qi et al.
[53] estimated forest height by fusing TanDEM-X InSAR, simulated Global Ecosystem Dynamics Investigation (GEDI) canopy height, and digital terrain models to obtain an estimation with RMSEs of 4.0−6.7 m. In the context of the recently launched GEDI spaceborne LiDAR [10], the high spatial and temporal resolution of our approach will have significant contribution towards near-real-time monitoring of forest canopy height.

Model Performance
Parameter selection for model training is crucial for managing the model's prediction performance when working with data prone to artifacts [54]. The addition of extra embedding dimensions, extra clusters, or the modification of landmark and projection chunk parameters can have negative effects on model behavior, despite achieving improved apparent performance. For example, despite utilizing the techniques above, an alternate set of parameters allows the model to differentiate mosaic components and reflect that in predictions. These artifacts persist at the country scale but are less severe.
Models trained over smaller regions (tested up to 100 × 100 km), based on the Fourier features mentioned above, were able to produce strong apparent R 2 and RMSE values, but behaved poorly when transferring the model to other regions. These results were misleading for a variety of factors, including high dimensionality of input features, comparatively low complexity of the output TCH prediction, spatially localized imaging artifacts in source data, and ecologically relevant spatial autocorrelations.
Underlying data and environmental challenges that were identified and addressed during this work suggest that interpreting results and uncertainties in prior modeling efforts requires careful consideration. In addition, despite the relative generalization success of the presented method, it is still an open question as to what degree are structure parameters discernable from Planet Dove textures.

Generalization Challenges
Research towards regional application of FOTO features to estimate aboveground biomass in central African forests have identified two main challenges: (i) variation in sensor and lighting parameters, and (ii) forest structural diversity [21]. When working specifically with Planet Dove normalized mosaics taken over Peru, generalizing a TCH estimation model for tropical rainforests involves addressing imaging challenges such as: (i) lighting inconsistencies across mosaic elements (sensor variation, atmospheric variation, preprocessing variation), and (ii) resolution inconsistencies across mosaic elements (angle from nadir, different sensors).
To operate at a large geographic scale, a model needs to be able to handle some of the challenges of using data collected across a highly varied ecosystem. These include: (i) the presence of forested and non-forested areas of an unspecified, and perhaps unknown variety of classes that are not labeled in advance; (ii) the presence of mixtures of those classes within 100 m cells, which produce continuous changes in output TCH simply due to coverage fraction; and (iii) highly unbalanced training data with respect to those classes and mixtures.
Further, the presence of ecologically-derived spatial correlations can compromise our ability to intuit generalization performance using a sampled validation set, when working on a smaller region. In a related effect, the high degree of similarity between plots can lead to minimal variances between sample distributions of both input attributes and prediction parameters, as well as correlation between input attributes used for prediction. As a result, the mechanisms employed by ensemble methods designed as variance reduction strategies, such as bagging or column subsampling (as part of RFs), may be compromised, leading to misleading estimates of generalization behavior.

Imaging Artifacts
The presence of the two categories of imaging artifacts represents a problematic combination for our texture features. Avoiding learning from inconsistent lighting involves ignoring the large length scale properties of the image, which are captured by the low frequency region of the power spectra. At the same time, avoiding learning from resolution inconsistencies in the Planet Dove mosaic involves ignoring small length scale behavior, which is captured by the high frequency region. In addition, when working with a mosaic, imaging artifacts present in constituent pieces can lead them to be identifiable, which can provide complex model location information, and defeat the model's ability to generalize.

Environmental Challenges
The presence of challenges such as unspecified terrain types, climatic conditions, and regulatory limitations means that properly balanced and stratified sampling is nearly impossible to achieve. As a result, the application of machine learning algorithms without a mechanism for accounting for terrain types and mixtures via a land-use/land-cover (LULC) classification risks being biased towards the behavior of the oversampled class, even with ideally selected features. Decision tree-based machine learning techniques, in particular, are known to be sensitive to imbalanced training data [54]. While at smaller scales it may be possible to manually mask out forested areas for analysis, such efforts are impractical at a large geographic scale. Further, the presence of mixtures of terrain within grid cells of appropriate size for allometry may present a sizable question, because they do not properly represent a categorical variable and the allometric equations are not designed with mixtures in mind. In the future, a mixture modeling approach may be investigated to address this issue.

Limitations
Our proposed method is globally applicable and may be used to make predictions on new Planet Dove collections, enabling continuous monitoring of a region with imagery that has been prepared in the same way. To make predictions on new collections, images must be histogram-matched to the reference training mosaic. This is not ideal for many reasons. First, primary areas of interest are areas of change, where histogram matching is not very valid. Second, predictions are geographically constrained to the coverage of the training mosaic. Nevertheless, this technique has been implemented into the software package.
The time delay between the acquisition of LiDAR data (2011-2013) and the Planet Dove images (2018) may have influenced our results due to changes in land cover that occurred in this timeframe. The influence of this mismatch is minimized by the vast amount of data and by using an ensemble classifier that produces a number of decision trees by randomly selecting subsets of training samples and variables and is less sensitive to outliers. This influence is also minimized by using a resolution of 1 ha for our analysis.

Conclusions
We presented the technical and scientific challenges involved in the generation of reliable mechanisms for estimating TCH from Planet Dove satellite image spectral and textural features. We focused our methods on spatial texture analysis with modified FOTO textures and created a machine learning regression model using gradient boosted regression trees. Our developed software toolchain is a robust and generalizable regression model that provides an RMSE of 4.36 m in Peru. This represents a helpful advancement towards better monitoring of tropical forests, which can be further used for informing environmental policies and conservation actions.