Extracting Impervious Surface from Aerial Imagery Using Semi ‐ Automatic Sampling and Spectral Stability

: The quantification of impervious surface through remote sensing provides critical information for urban planning and environmental management. The acquisition of quality reference data and the selection of effective predictor variables are two factors that contribute to the low accuracies of impervious surface in urban remote sensing. A hybrid method was developed to improve the extraction of impervious surface from high ‐ resolution aerial imagery. This method integrates ancillary datasets from OpenStreetMap, National Wetland Inventory, and National Cropland Data to generate training and validation samples in a semi ‐ automatic manner, significantly reducing the effort of visual interpretation and manual labeling. Satellite ‐ derived surface reflectance stability is incorporated to improve the separation of impervious surface from other land cover classes. This method was applied to 1 ‐ m National Agriculture Imagery Program (NAIP) imagery of three sites with different levels of land development and data availability. Results indicate improved extractions of impervious surface with user’s accuracies ranging from 69% to 90% and producer’s accuracies from 88% to 95%. The results were compared to the 30 ‐ m percent impervious surface data of the National Land Cover Database, demonstrating the potential of this method to validate and complement satellite ‐ derived medium ‐ resolution datasets of urban land cover and land use. signature of impervious surface. This method was applied to three Texas sites with different levels of land development and availability of ancillary data. Results indicate satisfactory separation of impervious surface from barren land, vegetation, and water, with user’s accuracies ranging from 69% to 90% and producer’s accuracies from 88% to 95%. This method was able to significantly reduce the effort of visual interpretation and manual labeling in the generation of quality samples for training and validation. Including measures of spectral stability as predictor variables led to improved classification performance. The comparison of our results to the NLCD percent impervious cover product at the 30 ‐ m level indicates the value of this method as an efficient, comprehensive means to validate and complement satellite ‐ derived medium ‐ resolution datasets of urban land cover and land use at regional or national scales.


Introduction
Over half the Earth's population now resides in cities [1]. Urbanization inevitably alters hydrologic processes, energy balance, and biological composition, resulting in higher nutrient loads, elevated surface temperature, increased peak flow, and accelerated habitat degradation in many urban areas and watersheds [2][3][4]. A characteristic indicator of urbanization is the increase of impervious surface, a unique land cover class that involves paved roads, sidewalks, parking lots, buildings, and other built structures, through which precipitation does not readily infiltrate into the underlying soil [5]. Understanding the spatiotemporal pattern of impervious surface has important implications for many urban studies on stormwater, heat islands, water quality, ecosystem function, population growth, and community resilience [6][7][8][9][10][11]. The quantification of impervious surface provides critical information to city managers and researchers for a range of issues in urban planning and environmental management [5].
Remote sensing has been used for decades to monitor land cover change and map the growth of impervious surface. This has led to the generation of important databases at national or global scales, such as the National Land Cover Database (NLCD) in the United States [12,13], Coordination of Information on the Environment (CORNIE) land cover inventory in Europe [14], Finer Resolution Observation and Monitoring-Global Land Cover (FROM-GLC) database in China [15], and MODIS Land Cover Type Yearly Global dataset [16]. As these databases are derived from medium-resolution satellite imagery with pixel size ranging from 30 m to 500 m, they have to use multiple urban land cover classes (i.e., low-and high-intensity developed areas) to reflect different levels of impervious surface cover at the scale of a satellite image pixel. High-resolution datasets of impervious surface are not available over large spatial extents, but some studies have utilized meter-or submeter-level aerial imagery to characterize urban landscape. For example, object-based feature extraction and regression tree techniques were applied to National Agriculture Imagery Program (NAIP) orthophotography in Minnesota and led to an overall accuracy of 74% [17]. An algorithm of multiple agent segmentation was applied to analyze NAIP imagery of Rhode Island [18]. More recently, random forest (RF) models were combined with geographic object-based image analysis to generate a 1-m land cover dataset of West Virginia with an overall accuracy of 96.7% [19].
However, reported accuracies for impervious surface were often lower than the accuracies of other land cover classes in supervised classification. An important reason is the acquisition of sufficient and reliable samples that are needed to train a classification model. Studies have shown that increasing the size of training data generally improves classification performance and could have a greater influence than the classification model [19,20]. Collecting a large number of quality samples by field surveys or screen digitalization is very complex, costly, and time consuming [21]. This is exacerbated by the acquisition of independent samples for validation, as using the same data for training and validation could result in optimistically biased accuracy assessments [22]. Another influential factor on the extraction of impervious surface is the selection of variables. Including predictor variables in addition to original image bands generally improves classification performance. These additional predictors include a variety of spectral, textural, and feature geometric indices, derived from the targeted image and sometimes with the support of ancillary datasets [2,19,[23][24][25][26][27][28][29]. These measures enhance the signatures of impervious surface across various spatial scales, but they lack the information about the unique temporal behaviors of impervious surface. Although some predictor variables of temporal changes (e.g., phenological features and time series) have been used in studies on vegetation classification [30][31][32], such variables have been rarely included in the extraction of impervious surface.
The goal of this study was to develop a hybrid method for extracting impervious surface from 1-m NAIP imagery. Reference data for training and validation were generated in a semi-automatic scheme that integrated multiple ancillary datasets to minimize human edits. Classification was also improved by analyzing temporal stability of surface reflectance based on a time series of Landsat imagery. We demonstrated this method at three sites in South Texas to address different levels of development intensity and data availability. Furthermore, after the accuracy assessment, the 1-m results of impervious surface were upscaled to 30-m resolution for a comparison to the NLCD percent impervious surface product, demonstrating the potential value of our method to validate and complement this important national land cover database.

Study Area and Data
The study area is located in the Corpus Christi-Kingsville metropolitan region in South Texas, United States (Figure 1). This region is a flat coastal plain that faces the Gulf of Mexico. Elevation varies from sea level to 70 m with an average slope of 0.5%. Soils are mainly Victorian clay and Orelia fine sandy loam. According to 2016 NLCD data, the major land cover classes in this area include developed (26%), open water (24%), cultivated crops (15%), hay and pasture (10%), and shrub and scrub (8%). This area has a humid subtropical climate with an average annual precipitation of 805 mm and an average annual temperature of 22.3 °C.
Three sites were selected for this study (Figure 1). Site 1 was in the south region of Corpus Christi, the urban core with a population of 325,733 in 2016. The landscape of this site was dominated by developed land and large areas of coastal water. Site 2 covered the majority of Kingsville, a satellite city with a population of 25,714 in 2016 and surrounded by agricultural land. Site 3 was an industrial zone in north Corpus Christi with a mixture neighborhood of farmland and wetland. Thus, the three sites reflected a gradient of land development: high (Site 1), medium (Site 2), and low (Site 3). They also addressed different conditions of barren land, ranging from limited mixed barren land (Site 1) to abundant single-type barren land (Site 3) and abundant mixed barren land (Site 2). NAIP imagery of these sites consisted of three 1-m multispectral (blue, green, red, and near-infrared) images acquired in May 2016.
Several ancillary datasets were used in this study. Vector datasets of building and roads were downloaded from the database of the OpenStreetMap (OSM) project (www.openstreetmap.org). Built on the concept of volunteered geographic information, the OSM project relies on volunteers to collect information of various built features, collates them on a central database, and distributes free datasets with sound quality [33]. Vector datasets of freshwater and estuarine wetland boundaries were extracted from the National Wetland Inventory (NWI) database of the United States Fish and Wildlife Service. Created through a hierarchical classification framework that utilizes aerial photography, NWI maps are recognized as the most comprehensive wetland maps in the United States [34]. National cropland data were accessed through Google Earth Engine (GEE). This robust raster dataset is created annually by the National Agricultural Statistics Service (NASS) of the United States Department of Agriculture using moderate resolution satellite imagery and extensive agricultural ground truth [35].
We also directly used satellite data, including 9 Landsat-5 TM images, 41 Landsat-7 ETM+ images, and 25 Landsat-8 OLI images, acquired from 2011 and 2016 in the path-26/row-41 scene of the Landsat World Reference System 2. The Landsat data were Level-2 30-m surface reflectance of six bands (blue, green, red, near-infrared, and two shortwave infrared bands). GEE was used to access and process all remote sensing data in this study.

Sample Generation
We developed a semi-automatic approach to efficiently generate quality training samples for land cover classification ( Figure 2). This method uses multiple ancillary datasets to improve the identification of representative pixels for impervious surface, vegetation, water, and barren land. First, training samples for impervious surface were generated using vector data of buildings and roads extracted from OSM ( Figure 3). Building samples were created by buffering building polygons inward by three meters and then calculating the centroids of the buffered polygons. Using the centroids instead of taking random positions within the building polygons reduced the potential disturbance of vegetation around roof edges. Road samples were generated using the vertices of road polylines in OSM. This comprehensive road dataset included a variety of motorway, primary, secondary, residential, and service roads, as well as the lanes in many commercial parking lots. NAIP pixels at the centroids of building polygons and the vertices of road polylines were extracted to establish the pool of samples for impervious surface. The size of this pool would vary with the local availability of OpenStreetMap data. The three sites in this study reflected a gradient of OSM data availability: abundant (Site 1), medium (Site 2), and scarce (Site 3). OpenStreetMap vector data.
Second, a sampling pool of barren land was established to represent both barren wetland ( Figure  4) and barren farmland ( Figure 5). Barren wetland consists of tidal flats, sand beaches, and sand dunes that are adjacent to natural water bodies. They were first roughly delineated on the NAIP image by overlaying polygons of estuarine and freshwater wetlands extracted from the National Wetland Inventory. Then, barren wetland was identified using masks of NDVI < 0 and NDWI < 0.1, calculated using the green, red, and near-infrared bands of the NAIP image. Barren farmland mainly consists of bare soils in arable lands that are left fallow. Arable lands were first identified using the cultivated areas of the NASS Cropland Data. Then, a threshold of NDVI < 0 was applied to identify pixels without crop cover. We did not attempt to differentiate all non-vegetation patches through NDVI thresholding. The purpose of using an NDVI threshold was to identify representative areas with sufficiently low values of NDVI, which were highly unlikely to be vegetated areas. Samples were randomly generated within these identified areas of barren land. Third, sampling pools of water and vegetation were established using thresholds of NDVI and NDWI, respectively. The thresholds were calculated based on NDVI and NDWI values of impervious surface samples within the same NAIP image: where is the threshold of a normalized difference index (i.e., NDVI or NDWI); , and , denote the mean and standard deviation of this index, respectively, calculated from its values of all impervious surface samples; and is a coefficient with values of 1.0 and 2.0 for NDVI and NDWI, respectively. Pixels with values above the calculated NDVI and NDWI thresholds were determined as vegetation and water samples, respectively. Last, 600 samples were randomly selected from the pool of impervious surface and 300 samples for each of the other classes (vegetation, water, and barren land), leading to a training dataset of 1500 samples for the targeted NAIP image. Using a large sample size for impervious surface was aligned with the emphasis of this study on separating impervious surface from other land cover classes. If a pool did not contain sufficient samples, all available samples were included. Given the 1-m resolution of NAIP imagery, it was reasonable to assign a single land cover class to each pixel, so we did not consider any mixed land cover classes.
A modified scheme was developed to generate a separate dataset of samples for validation. First, a group of 1500 pixels were randomly selected across the scene of the NAIP image. Second, pixels located within inward-buffered building polygons or outward-buffered road polygons were labeled as impervious surface. Third, pixels with NDVI and NDWI values exceeding the above thresholds (Equation (1)) were labeled as vegetation and water, respectively. Finally, unlabeled points were visually inspected to determine its land cover. In doing so, the effort of manual editing in generating validation samples was significantly reduced.

Spectral Stability
Spectral stability (in %) describes the variability (standard deviation ) of a pixel reflectance relative to the average reflectance over the whole time series [36]: This index was calculated using a six-year time series of Landsat imagery that covered the domain of the targeted NAIP image. The time series consisted of Level 2 surface reflectance products from Landsat 5 TM, Landsat 7 ETM+, and Landsat 8 OLI sensors. They are high-level Landsat products generated by the United States Geological Survey (USGS) to eliminate the need for routine reprocessing efforts of geometric, radiometric, and atmospheric corrections [37]. Landsat images with cloud and cloud shadow cover over 50% were excluded. Over the selected satellite images, spectral stability was calculated for each of six bands: Blue, green, red, near-infrared, shortwave infrared 1, and shortwave infrared 2. Details of band characteristics are available on the website of USGS (www.usgs.gov). The generated 30-m spectral stability images were resampled and clipped to align with the 1-m resolution and domain of the targeted NAIP image.

Classification and Accuracy Assessment
Three types of predictor variables were used in classification: (i) four original NAIP brightness bands (red, green, blue, and near-infrared); (ii) two multispectral indices (NDVI and NDWI) calculated using NAIP brightness bands; and (iii) six bands of spectral stability calculated using the time series of Landsat surface reflectance. This led to a 12-band, 1-m composite image for classification. For each site, a random forest model of 100 trees was trained on GEE using the 1500sample training dataset and was then applied to the whole NAIP image. Accuracy was assessed using overall accuracy, the kappa statistics, and the class user's and producer's accuracies based on the independent 1500-sample validation dataset.
After the 1-m classification results were fully validated, we further compared the results to the NLCD imperviousness product that represents impervious surface as a percentage of developed surface over every 30-m pixel. We first converted our result into a 1-m binary impervious/nonimpervious layer. Then, the percent impervious surface was calculated by dividing the number of 1m impervious pixels by 900 within each of 30-m grids that were aligned with NLCD pixels.

Classification Results
For Sites 1 to 3 (Tables 1-3), overall accuracies were 96%, 92%, and 91%, respectively, and kappa values were 0.94, 0.86, and 0.85, respectively. As expected, user's and producer's accuracies were higher for vegetation and water relative to the accuracies for impervious surface and barren land. Producer's accuracies varied from 88% to 95% for impervious surface and from 60% to 84% for barren land. User's accuracies were more variable, ranging from 69% to 90% for impervious surface and 57% to 99% for barren land. Confusion associated with impervious surface was a primary area of disagreement, accounting for 92%, 63%, and 54% of the classification errors at Sites 1 to 3, respectively. In particular, the classification errors were driven by the confusion between impervious surface and barren land, accounting for over 40% of the classification errors at all sites. As shown in Figure 6, it mainly included the misidentification of impervious surface from various natural bare ground along shorelines (Sites 1 and 3), unpaved rural roads (Sites 2 and 3), and the edges of cropland patches (Site 3). Most of these areas represent a transition zone between a homogeneous landscape feature and its neighboring features that are smaller and more fragmented (i.e., estuary water versus narrow beach zones, and cropland versus rural roads). The resolution mismatch between 1-m NAIP imagery and 30-Landsat imagery was evident in such areas and had an apparent impact on the spectral characteristics of training samples. Another noticeable component of disagreement was the confusion between vegetation and impervious surface in residual areas with low or medium development intensity, accounting for ~25% of the classification errors.

Selection of Predictor Variables
As shown in Figure 7, including NAIP-derived multispectral indices (i.e., NDVI and NDWI) and Landsat-derived spectral stability led to improved classification accuracies at all three sites, compared to the performance of only using NAIP four-band brightness. Improvements on the identification of impervious surface and barren land were particularly evident. For example, at Site 1, the user's accuracy of the barren land improved 20% and the producer's accuracy of impervious surface improved 10%. At Site 3, the user's accuracy of the barren land improved 5% and the producer's accuracy of impervious surface improved 13%. This had an important impact on the overall agreement, as the confusion between impervious surface and barren land was the dominant source of classification errors. The results also indicate that including both multispectral indices and spectral stability was more effective than using only one of them. For example, at Site 1, adding multispectral indices alone was better than adding spectral stability alone, but including both further improved the performance. Figure 7. Effect of predictor variables on classification accuracies. In the legend, B, MI, and TS denote brightness, multispectral indices, and temporal spectral stability, respectively. In the y-axis labels of subplots, OA, UA, and PA denote overall accuracy, user's accuracy, and producer's accuracy, respectively.

Training and Validation Samples
The semi-automatic sampling approach resulted in large training and validation datasets for each site (Figure 8). There were 600, 300, 300, and 300 training samples for impervious surface, water, vegetation, and barren land, respectively. This led to a total of 1500 pixels per NAIP image, i.e., one sample per 35,000 pixels or a high sampling fraction of 0.00286. As expected, the spatial distributions of training samples per class and overall were not homogeneous, as the locations of these samples were confined to scene-specific availability of ancillary data (e.g., OSM building polygons and road polylines) and overall land cover pattern. The validation dataset also consisted of 1500 pixels for each site. Because these pixels were selected using random sampling, the fractions of different land cover categories in the validation dataset reflected the actual land cover composition of the site. The size of validation dataset was deemed sufficient to represent minor land cover categories, e.g., barren ground at Site 1 (2.5%) and water at Site 3 (4.3%). The semi-automatic sampling approach significantly reduced the manual effort of assigning labels to individual pixels. In this study, manual labeling was only needed for 315, 570, and 615 pixels of the validation dataset for Sites 1 to 3, respectively, or on average 17% of the whole 3000-pixel reference dataset per site. To ensure the quality of the reference datasets, visual inspection was performed on all samples that were generated automatically from ancillary data in this study, and no correction was needed for any of them. Although the semi-automatic sampling approach could generate even larger reference datasets, using more samples did not always lead to enhanced agreement (Figure 9). The performance based on 50 samples per class was quite close to the performance of using 400 samples per class. This indicates that the quality of samples plays a more important role than the quantity of samples in training the random forest models of this study. This is generally consistent with the findings from other studies on the insensitivity of random forest models to the size of training data [19,38].

NDVI and NDWI Thresholds
The generation of vegetation and water samples relied on the thresholds of NDVI and NDWI, respectively. As shown in Equation (1), the thresholds were determined by: i) The statistics of index values, calculated from the pool of impervious samples; and ii) , the predefined multiplicative coefficient. Figures 10 and 11 show the responses of classification accuracies to different values of , ranging from 0 to 3.0. The increase of for NDVI generally led to decreased producer's accuracy and increased user's accuracy ( Figure 10). This pattern was more evident for impervious surface than barren land. The default value of 1.0 was associated with high values of overall accuracy and kappa coefficient. In comparison, the change of for NDWI appeared to have limited effect on classification (Figure 11). Exceptions were accuracies of impervious surface at Site 2, where the increased NDWI threshold tended to improve producer's accuracy and reduce user's accuracy. The default value of 2.0 for NDWI thresholds yielded good results at all sites.   Figure 12 shows the comparison between the NLCD percent impervious surface and the NAIPderived percent impervious surface from this study. The NLCD dataset was produced using regressions tree models with predictor variables from Landsat images and ancillary data of nighttime lights and road networks [12,13]. Overall, the two datasets agreed quite well for all three sites, with R 2 = 0.87, 0.69, and 0.68, respectively. The best agreement occurred at Site 1 due to the high intensity of development and the easy identification of open water. Visual inspection using the original NAIP images (Figure 6a-c) indicates the overestimation of the NLCD data in residential areas mainly due to the misclassification of urban green space as developed land. The NLCD data had a better performance than our results in identifying barren wetland, leading to fewer low-intensity developed areas along shorelines. The confusion with barren land contributed to the overestimation of lowdensity impervious areas at Site 2 ( Figure 10b) and Site 3 (Figure 10c) in our results. Nevertheless, our results suggest a smaller fraction of developed land with medium intensity (i.e., 50%-80% impervious surface cover) compared to the NLCD estimate (Figure 10g-i). This is consistent with the low producer's accuracy of this class reported by Yang et al. [12].

Comparison to NLCD data
. Figure 12. Estimates of percent impervious surface by this study (a-c) and the NLCD product (d-f). Subplots (g-i) compare the histograms of the two estimates.

Discussion
Our method addressed two important issues in mapping impervious surface: reference data and selection of predictor variables. Obtaining high-quality reference data is one of the most important factors in achieving an accurate classification and accounts for a major component of time and resource demands [19]. Our method was able to significantly reduce the effort of visual interpretation and manual labeling in the generation of quality samples for training and validation (Figure 8). In the era of big data, the fast expansion of satellite and aerial imagery resources is outpacing the capacity of conventional effort of collecting ground truth through field surveys or on-screen digitalization. Our results suggest that semi-automatic sampling approaches have the potential to fill this gap through efficient uses of existing ancillary data with proven accuracy. The benefit of ancillary data also contributed to the improved inclusion of predictor variables (Figure 7). Through extending variables into the temporal perspective, classification models were able to separate landscape features that were spectrally indifferentiable in a snapshot. This is consistent with findings on the positive effect of including temporal measures in classification [39]. Our method echoes the concept of pseudo-invariant features (PIV) [40][41][42] in radiometric calibration and provides an example of integrating satellite and aerial imagery to enhance the extraction of impervious surface.
The proposed method could contribute to the improved production and assessment of NLCD products and other satellite-derived national land cover data that have played a critical role in a variety of environmental and socioeconomic studies [9,12,13,43]. The landmark assessment by Wickham et al. in 2013 [13] found that userʹs accuracies for developed classes in NLCD products increased as the level of urbanization (i.e., percent impervious surface) increased. However, a more recent assessment on NLCD products of 2001-2016 [12] reported an opposite pattern, i.e., increased percent impervious surface was associated with reduced user's accuracies. Although the overall accuracies in the former study (0.78-0.79) were comparable to that of the later study (0.88-0.90), their conflicting findings in the user's accuracies of developed land classes indicate a need to improve the consistence of reference data and validation procedure. The agreement of urban classes in NLCD products is generally lower than other land cover classes. A separated assessment on the NLCD imperviousness product has been highly recommended [12]. Wickham et al. [13] suggested that the ideal assessment of NLCD products would require estimation of percent impervious surface for every 30-m pixel based on NAIP imagery, but they had to use a more limited, indirect method because the cost to obtain such ideal reference data was prohibitive. Although a national assessment on the NLCD imperviousness product has not been reported, a recent pilot study in the Chesapeake Bay region has demonstrated the promising value of using NAIP imagery as reference data [44]. The demand of substantial human intervention in generating pixel-level, full-coverage reference data for impervious surface is a major obstacle for the continuous development and validation of NLCD products and similar medium-resolution satellite-derived datasets in other countries.
As indicated in Figure 12, our method could provide an alternative means to verify NLCD products. The quality of samples was ensured by the proven high quality of ancillary datasets. NWI data has already been used as the proxy of ground truth for the validation of satellite-derived land cover products [12], and studies have indicated the trend of incorporating volunteered geographic information data in environmental remote sensing [45][46][47][48]. As our approach is built on GEE and uses publicly available ancillary datasets, it can be easily applied to the validation of NLCD products for other regions. This method provides an efficient, comprehensive, and transferrable way to evaluate the quality of NLCD land cover and imperviousness products. It is feasible to plan for a full validation of NLCD products at the national level using this method, but that would require the technical support of GEE to upgrade the storage and computation capacity that are available for regular users. Those issues are irrelevant to our methodology and are thus beyond of the scope of this paper.

Conclusions
A hybrid remote sensing method was developed to improve the extraction of impervious surface from aerial imagery. This method took advantage of multiple ancillary datasets to accelerate the acquisition of reference data and incorporated satellite-derived spectral stability to enhance the signature of impervious surface. This method was applied to three Texas sites with different levels of land development and availability of ancillary data. Results indicate satisfactory separation of impervious surface from barren land, vegetation, and water, with user's accuracies ranging from 69% to 90% and producer's accuracies from 88% to 95%. This method was able to significantly reduce the effort of visual interpretation and manual labeling in the generation of quality samples for training and validation. Including measures of spectral stability as predictor variables led to improved classification performance. The comparison of our results to the NLCD percent impervious cover product at the 30-m level indicates the value of this method as an efficient, comprehensive means to validate and complement satellite-derived medium-resolution datasets of urban land cover and land use at regional or national scales.
Author Contributions: H.Z., S.G. and P.Z. conceptualized the methodology, designed the study and collected the data; H.Z. performed the validation and prepared the original draft; H.Z., S.G. and P.Z. reviewed and edited the draft. All authors read and approved the submitted manuscript. All authors have read and agreed to the published version of the manuscript.