Deep Learning in Archaeological Remote Sensing: Automated Qanat Detection in the Kurdistan Region of Iraq

: In this paper, we report the results of our work on automated detection of qanat shafts on the Cold War ‐ era CORONA Satellite Imagery. The increasing quantity of air and space ‐ borne imagery available to archaeologists and the advances in computational science have created an emerging interest in automated archaeological detection. Traditional pattern recognition methods proved to have limited applicability for archaeological prospection, for a variety of reasons, including a high rate of false positives. Since 2012, however, a breakthrough has been made in the field of image recognition through deep learning. We have tested the application of deep convolutional neural networks (CNNs) for automated remote sensing detection of archaeological features. Our case study is the qanat systems of the Erbil Plain in the Kurdistan Region of Iraq. The signature of the underground qanat systems on the remote sensing data are the semi ‐ circular openings of their vertical shafts. We choose to focus on qanat shafts because they are promising targets for pattern recognition and because the richness and the extent of the qanat landscapes cannot be properly captured across vast territories without automated techniques. Our project is the first effort to use automated techniques on historic satellite imagery that takes advantage of neither the spectral imagery resolution nor very high (sub ‐ meter) spatial resolution.


Introduction
Remote sensing of air-borne and satellite imagery has become a key element of archaeological research and cultural resource management. Combined with field research, remote sensing permits rapid and high-resolution documentation of the ancient landscapes [1][2][3][4]. It also allows for the documentation and monitoring of ancient landscapes that are inaccessible for fieldwork, threatened, or permanently destroyed [5][6][7][8]. The volume and variety of air and space-borne imagery is ever increasing, which is, in theory, of great potential for the field. In practice, however, this "data deluge" is still handled by traditional methods of visual inspection and manual marking of the potential archaeological features. On one hand, researchers cannot handle the quantity of the available data. On the other hand, traditional methods will not do justice to the increasing spectral quality of new data because the researchers use the same method, i.e., a visual judgment based on one's experience, to all categories of imagery and at best can afford to experiment with a small subset of the spectral the applicability of CNNs for archaeological prospection by using them to detect the archaeological remains of qanat systems on the Cold War-era CORONA satellite imagery.
A qanat, also known by other names in different countries, is an underground water supply technology composed of horizontal tunnels and a series of vertical shafts [55]. It collects water from an underground water source, usually an aquifer, and transports it underground to the zone of consumption, which might be tens of meters to hundreds of kilometers away from the source ( Figures  1,2). There is much debate about the origin and the history of diffusion or invention of this water supply technology [56][57][58][59]. What is certain is that by the first centuries of the second millennium CE, rural and urban settlements in a wide range of arid and semi-arid regions of Africa and Asia relied on a variant of the qanat technology to thrive. Therefore, it is important for many archaeological research projects as well as cultural heritage preservation projects in many countries to properly document qanat landscapes. For the proper mapping of qanat systems, individual shafts that represent the course and complexity of the underground system need to be documented individually ( Figure 3). However, mapping qanats at shaft-level is most suitable for small systems and small research areas [60]. Large qanat systems are usually mapped either as a point representing its outlet or as a line representing its approximate path from the source to the outlet. The complexity of qanat systems is inevitably lost when mapping is done only at a high-level identification of overall shaft path or outlet ( Figure 4). Recently, Luo et al. [43] experimented with the automation of qanat shaft detection through traditional, template-matching, computer vision, but the model's best performance is only achieved in ideal landscapes, where non-qanat features are absent and feature preservation is good. We anticipated that CNNs would have good performance in the detection of qanat shafts. Circular features have proven to yield relatively good results in automated archaeological prospection, in traditional as well as deep learning methods [9,10,12,16,40,[45][46][47]61]. Furthermore, unlike most archaeological objects, expensive and time-consuming fieldwork is not crucial for sample data collection and model performance validation. Qanat shafts, especially when well-preserved, are generally easy to identify in any imagery. The shafts are originally doughnut-shaped; the spoil heap from the excavation and maintenance of the shaft creates a ring around the central hole. This circular shape is altered in a variety of ways as soon as the underlying tunnel is out of use, but the shafts maintain a (semi)-circular shape until they disappear. In addition, qanat shafts are found in linear groups, a pattern that is rarely found in nature ( Figure 3). Our goal was to test the hypothesis that deep learning can be used for automated detection of archaeological features with relatively uniform signature such as qanat shafts. We aimed at assessing the presence or absence of qanat shafts, not attributes such as shaft size or shape.
Experiments with automated object detection for archaeological prospection have previously been limited to a subset of projects that work with sub-meter and multispectral imagery. This was originally inevitable because traditional algorithms relied heavily on the spectral and high spatial resolution of expensive commercial imagery [11,12,17,22,47,54]. Our second goal was to assess the performance of deep learning in the automated detection on publicly available historic imagery. CORONA satellite imagery is panchromatic with approximately 1.8 m spatial resolution. CORONA is available to all projects and individuals at no or little cost. While the decreasing cost of commercial imagery will make high-resolution imagery more widely available in archaeology, a core area of remote sensing is and will be based on the use of panchromatic low-resolution historic imagery. This is especially the case in developing countries where landscapes are often heavily damaged or destroyed by development projects and thus not captured in the modern imagery [62]. We used CNNs to detect qanat shafts within the area of the Erbil Plain Archaeological Survey.

Case Study
The case study is the area covered by the Erbil Plain Archaeological Survey (EPAS) [63]. EPAS encompasses 3200 km 2 in the center of the Erbil Governorate of the Kurdistan Region of Iraq ( Figure  5). Qanat technology is used in the Kurdistan Region of Iraq, known locally as karez. Karez of Northern Iraq are one of the least-studied qanat systems of the Middle East. The only systematic study of the qanat of Northern Iraq was conducted by Dale Lightfoot under the auspices of UNESCO [64]. He recorded 683 qanat systems, using historic cadastral maps and field observation. One of the largest concentrations of qanat systems is in the EPAS area. Fifty-one qanat systems in the Lightfoot database are within EPAS area. Although the Lightfoot study used cadastral maps, these maps only document the qanat that were functional or recently abandoned in the mid-20th century. Since 2018, EPAS has been investigating the historic qanat systems of the Erbil Plain. The landscape of the plain is heavily altered by anthropological processes, mainly development. Only a small portion, between 5% to 10%, of the shafts visible on the CORONA imagery are preserved and are captured in modern imagery ( Figure 6). Declassified Cold War era CORONA imagery were acquired between 1960-1978 and captured the archaeological landscape right before it was heavily altered. In order to document and study the complexity of this qanat landscape, we created a shaftlevel map through visual inspection of historic CORONA satellite imagery and manual marking of shafts in a GIS database. Our database contains around 12,000 shafts, some of which belong to the 51 qanat documented earlier and others which were previously undocumented ( Figure 4). On the CORONA imagery, the shaft diameter usually ranges between 15 and 25 m, although shafts as small as 10 m and as big as 35 m are found. The process of manual mapping is, however, extremely slow and time consuming. In order to reproduce the depth of this study at a larger scale, we decided to train a deep learning model for automated qanat shaft detection on CORONA imagery that are available for most areas in the Middle East.

Satellite Data
The imagery used for training is the declassified US intelligence imagery from the CORONA program. CORONA acquired photographs from 1960 to 1972 to monitor Soviet missile strength and was declassified in 1995 [65]. Since 1998, archaeologists working in the Middle East have increasingly exploited CORONA imagery because it predates a great deal of landscape destruction by development. In addition, the spatial resolution of the CORONA imagery is high enough to allow for identification of many archaeological features [66]. CORONA imagery can be previewed and ordered on the USGS website (http://earthexplorer.usgs.gov). We downloaded and used the imagery KH-4B (DS1039-2088DA036-039) with a resolution of 1.8 m, acquired in the winter of 1968, when the soil moisture was optimal for remote sensing identification of archaeological landscape features (Table  1).

Deep Learning Workflow
A generic machine learning workflow consists of data gathering, data pre-processing, model development, and model evaluation. In data gathering and pre-processing, an annotated data set is created based on ground truth data or expert judgment. Data pre-processing also included exploratory data analysis (EDA) for identification of missing or erroneous annotation cases. To increase training data and to avoid overfitting, data augmentation can be used. Data augmentation generates more training data from existing training samples by random transformations such as image rotation and flipping. Before model development, an evaluation protocol needs to be determined. Once a model is trained, performance metrics are reported based on the evaluation protocol.

Annotation
Annotation was done by one of the authors who is specialized in remote sensing of landscape irrigation features. For data labeling, we used 3D Slicer, an opensource software platform widely used in medical image processing and annotation [67]. The spatial resolution of the imagery is not enough to create enough padding around each qanat shaft. Therefore, many adjacent qanat shafts were connected in the annotation process, creating clusters of qanat with erroneous variation in size and shape. After the first round of annotation, we conducted exploratory data analysis (EDA) to identify and edit overlapping labels and connected components that would affect the training model performance (Figure 7). Data was labeled for 11 patches that were representative of the diversity of the training contexts. The patches were chosen based on the range of the natural and anthropogenic landscape elements of the Erbil Plain as well as the density and clarity of the qanat shafts (Table 1, Figure S1-S22).

Convolutional Neural Networks
In this work, a binary classification model based on CNNs is proposed for qanat feature segmentation in CORONA images. Solving an object detection problem with a segmentation framework has been done previously using similar architectures [68,69]. The deep network architecture is composed of sequential convolutional layers ∈ 1, . At each convolutional layer , the input feature map (image) is convolved by a set of kernels , . . . , and biases , . . . , to generate a new feature map. A non-linear activation function is then applied to this feature map to generate the output which is the input for the next layer. The th feature map of the output of the layer can be expressed by: Figure 7. During data pre-processing, we conducted exploratory data analysis (EDA) to verify that connected components (individual segments) are of reasonable size. Random distinguishing colors were assigned to annotated cases so that outliers and connected components (in the cases where shaft labels overlapped) can be quickly identified. We used this size insight in the post-processing phase to remove some noise. (Note that the left image is flipped vertically in the course of the data augmentation process.) The concatenation of the feature maps at each layer provides a combination of patterns to the network, which become increasingly complex for deeper layers. Training of the CNN is usually done with several iterations of stochastic gradient descent (SGD), in which samples of training data (a batch) are processed by the network. At each iteration, based on the calculated loss, the network parameters (kernel weights and biases) are optimized by SGD in order to decrease the loss. Fully convolutional neural networks (FCNs) have been successfully applied in segmentation problems [70,71]. The use of FCNs for image segmentation allows for end-to-end learning, with each pixel of the input image being mapped by the FCN to the output segmentation map. This class of neural networks has shown great success for the task of semantic segmentation. During training, the FCN aims to learn representations based on local information. Although patch-based methods have shown promising results in segmentation tasks, FCNs have the advantage of reducing the computational overhead of sliding-window-based computation. The CNN proposed in this paper is a 2D FCN. FCNs for segmentation consist of an encoder (contracting) path and a decoder (expanding) path. The encoder path consists of repeated convolutional layers followed by activation functions with maxpooling layers on selected feature maps. The encoder path decreases the resolution of the feature maps by computing the maximum of small patches of units of the feature maps. However, good resolution is important for accurate segmentation and therefore up-sampling is performed in the decoder path to restore the initial resolution, but the feature maps are concatenated to keep the computation and memory requirements tractable. As a result of multiple convolutional layers and max-pooling operations, the feature maps are reduced, and the intermediate layers of an FCN become successively smaller. Therefore, following the convolutions, an FCN uses inverse convolutions (or backward convolutions) to up-sample the intermediate layers until the input resolution is matched [72,73]. FCNs with skip-connections are able to combine high-level abstract features with low-level high-resolution features which has been shown to be successful in segmentation tasks [74]. Figure 8 illustrates a schematic overview of the anisotropic 3D fully convolutional neural network for qanat feature segmentation in CORONA images. The network architecture is inspired by the 2D U-Net model [52]. The network architecture consists of 22 convolutional, 5 max-pooling, and 5 up-sampling layers. Convolutional layers were applied without padding, while max-pooling layers halved the size of their inputs. The parameters, including the kernel sizes and number of kernels, are explained in each corresponding box. Shortcut connections ensure the combination of low-level and high-level features. As illustrated in Figure 8, each convolution layer has a kernel size of (2 × 2) with a stride of size 1 in the two image dimensions. Since the input of the proposed network is a panchromatic image, the number of channels for the first layer is equal to one. After each convolutional layer, a rectified linear unit (ReLu) 0, is used as the nonlinear activation function except for the last layer where a sigmoid function 1 is used to map the output to a class probability between 0 and 1. There are 5 max-pooling and 5 up-sampling layers of the size 2 × 2 in the encoder and decoder paths, respectively. The network has a total of 31,574,657 trainable parameters. The input to the network is a CORONA image patch 224 224 1, and the output segmentation map size is 100 100 18.

Training
During training of the proposed network, we aimed to minimize a loss function that measures the quality of the segmentation on the training examples. This loss ℒ over N training images can be defined as: where is the output segmentation map, is the ground truth obtained from expert manual segmentation for the training volume, and (set to 5), is the smoothing coefficient which prevents the denominator from being zero. This loss function has demonstrated utility in image segmentation problems where there is a heavy imbalance between the classes, as in our case where most of the data is considered background [75].
We used a stochastic gradient descent (SGD) algorithm with the Adam update rule [76] which was implemented in the Keras framework (https://github.com/keras-team/keras). The initial learning rate was set to 0.001. The learning rate was reduced by a factor of 0.8 if the average of validation Dice score did not improve by 10 in five epochs. The parameters of the convolutional layers were initialized randomly from a Gaussian distribution using the He method [77]. To prevent overfitting, in addition to batch-normalization [78], we used drop-out with 0.3 probability as well as regularization with 10 penalty on convolutional layers except the last one. Training was performed on eleven 2000 × 2000 CORONA image patches (approximately 5.5 × 5.5 km) using fivefold cross validation (Figure 9). Each training sample was a 2D patch of the size of 224 224 1 pixel. Since grayscale image data encoded as integers in a 0-255 range, we divided each pixel value by 255 to scale the data to the range of [0,1] because neural networks work better with small homogeneous range of values. Data augmentation was performed by equally selecting patches that had qanat and those without qanat and then applying random left-right and up-down flipping of the images. Cross-validation was used to monitor overfitting and to find the best epoch (model checkpoint) for the test-time deployment. For each cross-validation fold, we used 100 as the maximum number of epochs for training and an early stopping policy by monitoring validation performance.

Postprocessing
At prediction time, a sliding window of 224 × 224 was moved over the 2000 × 2000 image patches with a stride size of 10, and the predictions were summed up. By using a sliding window, predictions with overlapping regions were created that can be viewed as a form of test-time augmentation that can subsequently improve the results. The resultant concatenated prediction map was then thresholded to create the final binary mask prediction denoting the predicted qanat mask. Since we did not have any annotated qanat feature smaller than 10 pixels, we removed any detected features smaller than 10 pixels.

Results
We used a five-fold cross-validation to evaluate the model prediction. Table 1 shows patch-bypatch results metrics. Table 2 shows the confusion matrix obtained from the cross-validation process. The overall precision and recall for the model are respectively 0.62 and 0.74. Precision (also known as specificity) measures the proportion of correctly classified positives to all predicted positives. Recall (also known as sensitivity) measures the proportion of correctly classified positives to all annotated positives. Simply put, precision concerns how many predicted cases are correct while recall concerns how many cases of interest are among predictions. The F1 score combines precision and recall in order to balance their importance. The F1 score is the harmonic mean of precision and recall.  While preparing data for training the model, we decide to include images with both a low and a high density of qanat features. Identifying qanat shafts in low-density photos is a challenging task, even for domain experts. The model performs better on high-density qanat images, defined as containing more than 100 labeled shafts. For five high-density patches (>100 shafts, avg = 704), performance metrics are: precision = 0.654, recall = 0.764, F1 score = 0.705. For six low-density patches (<=100 features, avg = 55), performance metrics are: precision = 0.340, recall = 0.526, F1 score = 0.413. There is not a significant pattern in false positives. They include a range of man-made or natural bumpy features that exist across all parts of the study area. Where the number of qanat shafts is low, these false positives reduce the model's precision performance. Therefore, there is a strong correlation between model precision and total labeled features in a patch because of the significantly higher proportion of false positives/true positives detected in low-density patches ( Figure 10, 11, Figures  S23-33).

Discussion
Our project demonstrates that deep learning, even with small datasets, can be successfully applied to automate the detection of qanat shafts. Because qanat infrastructures are found in many countries and contexts and their shafts have similar signatures, it is easy to gradually enlarge the training dataset from a wider range of contexts and optimize the CNNs for this task. In order to illustrate the performance of the CNN model, we chose not to pre-process the imagery and to apply minimal post-processing of the results. These two steps are commonly applied to improve the results in archaeological object detection projects [14,17,51,54]. We also wanted to test automation on a real task at hand. Our domain expert labeled all potential shafts including the faint traces that are not easily recognized unless one is highly experienced with the signature and patterns of the qanat on the imagery. We expected that the model would have performed even better if the training dataset were limited to the relatively well-preserved shafts (see above).
We had envisioned that selecting training data from a wide range of locations in the study area would improve the model's performance ( Figure 9). However, the fact that seven out of eleven patches were in the low-density shaft zones seem to have negatively affected our model performance. Probably, with larger datasets, diversifying the annotation context would improve the model. However, in our case, we concluded that limiting the labeled data on the high-density qanat areas would have focused the model on learning local features of those areas and would have yielded better results. Although the number of false positives is reasonably low, the worst performance is observed in low-density patches. Our model demonstrated better recall performance than precision. In ten out of eleven patches, at least half of the labeled objects were detected.
Qanats can be largely validated with visual inspection because shafts are found in long and linear alignments. In most landscapes, including on the Erbil Plain, only a small proportion of the shafts are preserved, and thus it is not possible to ground-truth the qanat beyond visual inspection. The creation of large, high-quality training datasets will be one of the biggest challenges along the way for archaeological prospection tasks using deep learning. In order to increase the recall rate, objects with lower certainty levels need to be labeled. The uncertainties, bias, and error during the labeling process will affect the model performance. Without ground-truthed labeled data, it is hard to judge where errors in training data may have contributed to the weak performance of the model. For example, our model performed best in areas were shafts are abundant and better preserved. The highest rate of false positives to true positives were in areas where qanat shafts are rare or the certainty of identification is low.
The existing model can be improved if a larger training dataset of field-validated data is available. From an engineering point of view, we envision experimentation toward improving the model itself through the following methods. First, we can experiment with a classification method, rather than the segmentation that is used in the existing model. Two deep learning methods of segmentation and classification could be applied to automatic detection of objects in remote sensing data. Segmentation implements a binary classification at pixel level. The outcome is either yes, i.e., the pixel belongs to an object of interest, or no, i.e., the pixel does not. Classification identifies an object, composed of any number of pixels, and determines whether the object belongs to any of the defined object classes (see for example [61]). Since the visibility of the entire object is not crucial in segmentation, we anticipate that it could perform better with lower resolution imagery. However, further work needs to be carried out to assess the capability of deep learning methods for the accurate segmentation of shaft boundaries which can provide useful archaeological information about the size and shape of shafts. Architectures such as the U-Net that was used in this paper or Mask R-CNNs [79] can be used for such segmentation tasks. Using Mask R-CNNs has the advantage of predicting bounding boxes and segmentation masks simultaneously. We also envision experimentation with classification to compare the results of the two approaches. Second, we will also investigate ways to improve the performance of the proposed CNN architecture. In this work, we used a baseline model very similar to the U-Net architecture and we trained the whole network from scratch. Transfer learning by using pre-trained ImageNet models for the encoder section of U-Net has been successful in some segmentation applications [80]. We will run experiments to compare U-Nets with pre-trained encoders with the baseline architecture reported here. Third, we will study the ability of the proposed model in predictive uncertainty estimation and will try confidence calibration to improve it [81]. In the short term, however, we have three immediate concerns. First, we plan to integrate postprocessing measures that may be able to improve the rate of true to false positives. Although detection performance is routinely reported through metrics such as recall and precision, the most important metric is how well the model is suited for a specific task. High metrics can be achieved for models that are trained for a small task with no outside application [12,50]. When models are trained for a complex task across a big region, better performance is achieved if the certainty measures are relaxed to increase recall, and quick post-processing steps are integrated to increase precision [51]. In our case, for example, we may be able to eliminate false positives based on the nearby positives. Second, we aim to build other pieces of the workflow that are necessary in order to transform the pixel-value results of the segmentation process into localized vector files that can be imported into a GIS environment. Third, we plan to test our existing model in automated detection of qanats in other regions without labeling new data in order to compare the performance of our model in familiar and new landscapes.

Conclusions
Deep learning has proved to be a powerful method for challenging computer vision tasks. In fields as sensitive as cancer diagnosis, deep learning has proved to create results superior to novice expert and comparable to experienced experts [53]. It is outdated to debate whether archaeology should embrace automated detection. Instead, the focus must be on how to embrace the soon-to-be ubiquitous deep learning tools in order to overcome the challenges of data overload and to speed up the process of remote-sensing documentation in the face of increasingly rapid destruction of the archaeological landscapes. Automated detection is not a replacement for human experts. By eliminating manual tasks that can be delegated to computers, the researcher's precious and limited time can be focused on the validation and analysis of the results. Furthermore, automation will open new research opportunities, for example for in-depth large-scale landscape study of regional-scale features with relatively uniform appearance [9,11], such as qanats. Automation will open new research opportunities that are currently unfeasible. For example, a regional-scale comparative study of qanat technology is currently impossible because no one has the time to create detailed maps of qanat shafts manually. It is possible to use transfer learning and data augmentation to overcome the problem of small datasets for archaeological prospection tasks.
So far, all archaeological experimentation has used models trained for other tasks. However, once more projects integrate deep learning and larger datasets are available, CNN models could be optimized for archaeological remote-sensing autodetection tasks, and we would witness even better results. The promising performance of deep learning has encouraged some researchers to build inhouse tools for specific archaeological tasks [22]. Another possible direction is that commercial offthe-shelf deep-learning platforms that are emerging for industrial applications can be soon used for archaeological tasks as well.