Large-Scale Oil Palm Tree Detection from High-Resolution Satellite Images Using Two-Stage Convolutional Neural Networks

: Being an important economic crop that contributes 35% of the total consumption of vegetable oil, remote sensing-based quantitative detection of oil palm trees has long been a key research direction for both agriculture and environmental purposes. While existing methods already demonstrate satisfactory effectiveness for small regions, performing the detection for a large region with satisfactory accuracy is still challenging. In this study, we proposed a two-stage convolutional neural network (TS-CNN)-based oil palm detection method using high-resolution satellite images (i.e. Quickbird) in a large-scale study area of Malaysia. The TS-CNN consists of one CNN for land cover classiﬁcation and one CNN for object classiﬁcation. The two CNNs were trained and optimized independently based on 20,000 samples collected through human interpretation. For the large-scale oil palm detection for an area of 55 km 2 , we proposed an effective workﬂow that consists of an overlapping partitioning method for large-scale image division, a multi-scale sliding window method for oil palm coordinate prediction, and a minimum distance ﬁlter method for post-processing. Our proposed approach achieves a much higher average F1-score of 94.99% in our study area compared with existing oil palm detection methods (87.95%, 81.80%, 80.61%, and 78.35% for single-stage CNN, Support Vector Machine (SVM), Random Forest (RF), and Artiﬁcial Neural Network (ANN), respectively), and much fewer confusions with other vegetation and buildings in the whole image detection results.


Introduction
Oil palm is one of the most rapidly expanding crops in tropical regions due to its high economic value [1], especially in Malaysia and Indonesia, the two leading oil palm-producing countries. The palm oil produced from the oil palm trees can be widely used for many purposes, e.g., producing cooking oil, food additive, cosmetics, industrial lubricants, and biofuels, etc. [2]. In recent years, palm oil has become the world's most consumed vegetable oil, making up 35% of the total consumption of vegetable oil [3]. As a result of the increasing demand of palm oil, a considerable amount of land (e.g., existing arable land and forests) has been replaced by oil palm plantation areas [4,5]. The expansion of oil palm plantation areas has caused serious environmental problems, e.g., deforestation, the reduction of biodiversity, and the loss of ecosystem functioning, etc. [6][7][8]. The economic benefits of the oil palm of about 0.08 km 2 . The traditional machine learning-based methods do not require manual selection of the tree plantation area. Many studies not only detect tree crowns but also identify tree species. The machine learning-based methods often outperform traditional image processing-based methods in complex regions. However, some feature extraction and tree classification methods have certain demands for image datasets. For instance, the methods proposed in References [35][36][37] require very high-resolution UAV images, hyperspectral and thermal imagery, and point cloud and hyperspectral images, respectively. These requirements bring limitations to large-scale oil palm detection issues to some extent.
Since 2014, deep learning has been applied to the remote sensing field and has achieved excellent performance in many studies [38][39][40][41], including multi-spectral and hyperspectral image-based land cover mapping [42][43][44][45][46], high resolution image-based scene classification [47][48][49], object detection [50][51][52][53][54], and semantic segmentation [55][56][57], etc. In recent years, several studies applied the deep learning-based methods to the oil palm or tree crown detection studies, all of which are based on the sliding window technique combined with the CNN. In 2016, we proposed the first deep learning-based method for oil palm detection in our previous study and achieved the detection accuracy of 96% in a pure oil palm plantation area in Malaysia [17]. Guirado et al. [58] proposed a deep learning-based method for scattered shrub detection, obtaining the detection accuracy of 93.38% and 96.50% in two study areas of 0.053 km 2 . The paper concluded that deep learning-based methods outperforms object-based methods in regions where the images for tree detection are relatively different from the images for training sample collection. Pibre et al. [59] proposed a deep CNN-based approach for detecting tree crowns in different sizes using a multi-scale sliding window technique and a multi-source data fusion method, obtaining the highest tree crown size detection accuracy of 75% in a study area of smaller than 0.1 km 2 . In summary, the deep learning-based tree crown detection methods proposed in existing studies are all based on the features of the individual tree crown, without taking the features of the plantation region into consideration. Although achieving relatively good detection results in small study areas (<5 km 2 ), existing deep learning-based tree crown detection methods are not suitable for oil palm tree detection in the large-scale study area, in which there exists great similarity between the oil palm trees and other vegetation types in many regions.
In this research, we proposed a two-stage convolutional neural network-based method (TS-CNN) for oil palm detection in large-scale study area (around 55 km 2 ) located in Malaysia, using high-resolution satellite images (i.e., Quickbird). We collected 20,000 samples in four types (i.e., oil palm, background, other vegetation/bareland, and impervious/cloud) from the study region through human interpretation. Two convolutional neural networks were trained and optimized based on the 20,000 samples, one for land cover classification and the other for object classification. We designed a large-scale image division method and a multi-scale sliding-window-based method for oil palm coordinate prediction, and proposed a minimum distance filter-based method for post-processing. Experiment results show that there exist much fewer misclassifications between oil palms and other vegetation or buildings in the image detection results of our proposed TS-CNN compared with the other 10 oil palm detection methods. Our proposed TS-CNN achieves the average F1-score of 95% in our study area compared with the manually labeled ground truth, which is 7-16% greater than the existing single-stage classifier-based methods. Moreover, the link of the 5000 manually-labeled oil palm sample coordinates and the whole oil palm detection results of our proposed method can be found at the end of this paper. We hope this dataset can be beneficial to other oil palm-related studies. Figure 1 shows the location of our study area (103 • 35 56.82"E, 1 • 35 47.96"N, in the south of Malaysia) and the QuickBird satellite image (acquired on 21 November 2006) used for the oil palm detection in this research. There are various land cover types in this area, including oil palm plantation areas, forests, grassland, bare land, impervious, and cloud, etc. The size of the QuickBird satellite image used in this study is 12,188 × 12,576 pixels. The spatial resolution of its one panchromatic band and four multi-spectral bands are 0.6 m and 2.4 m respectively. Moreover, we employ a spectral sharpening method (i.e., Gram-Schmidt) and remove the NIR band, resulting in an image with three spectral bands (Red, Green and Blue) in 0.6-m spatial resolution for further processes.

Study Area and Datasets
Twenty-thousand sample coordinates were collected from multiple regions selected within the whole study area, including 5000 oil palm sample coordinates, 5000 background sample coordinates, 5000 other vegetation/bare land sample coordinates, and 5000 impervious/cloud sample coordinates. Figure 1 shows several examples of sample coordinates collected from different regions. The oil palm sample coordinates and the background sample coordinates were collected from the two black rectangular areas through manual interpretation. Only the sample coordinates located in the center of an oil palm tree crown will be interpreted as the oil palm tree sample coordinates (denoted by red points), while other coordinates will be interpreted as background sample coordinates (denoted by yellow points). The sample coordinates in other vegetation/bare land type (denoted by green points) were randomly generated from the five green square areas. Similarly, the sample coordinates in impervious/cloud type (denoted by blue points) were randomly generated from the five blue square areas and the road area denoted by blue polygon. All 20,000 sample coordinates will be randomly divided into a training dataset with 17,000 coordinates and a validation dataset with 3000 coordinates. In addition, we select six typical areas (denoted by red squares) for method evaluation, which will be described in Section 4.
Remote Sens. 2018, 11, x FOR PEER REVIEW 4 of 18 5000 other vegetation / bare land sample coordinates, and 5000 impervious/cloud sample coordinates. Figure 1 shows several examples of sample coordinates collected from different regions. The oil palm sample coordinates and the background sample coordinates were collected from the two black rectangular areas through manual interpretation. Only the sample coordinates located in the center of an oil palm tree crown will be interpreted as the oil palm tree sample coordinates (denoted by red points), while other coordinates will be interpreted as background sample coordinates (denoted by yellow points). The sample coordinates in other vegetation/bare land type (denoted by green points) were randomly generated from the five green square areas. Similarly, the sample coordinates in impervious/cloud type (denoted by blue points) were randomly generated from the five blue square areas and the road area denoted by blue polygon. All 20,000 sample coordinates will be randomly divided into a training dataset with 17,000 coordinates and a validation dataset with 3000 coordinates. In addition, we select six typical areas (denoted by red squares) for method evaluation, which will be described in section 4.

Motivations and Overview of Our Proposed Method
In our previous study, we proposed a single CNN-based approach for oil palm detection in a pure oil palm plantation area [17]. The detection results show many misclassifications between oil palms and other vegetation if we directly apply the single CNN-based method to the large-scale study area with various land cover types, even if we collect thousands of samples of other vegetation types. The main reason is that in many regions it is hard to differentiate oil palm trees from other vegetation when only using the input image that has a similar size to a single oil palm tree. However, due to the fact that the neighboring oil palm trees often have similar spatial distances (i.e. around 7.5-9 meters in our study area) while other types of vegetation are often distributed randomly, it will be much easier to identify the oil palms from other types of vegetation if we take the special pattern or texture of the oil palm plantation areas into consideration.
In this research, we proposed a two-stage CNN-based method for oil palm detection from a

Motivations and Overview of Our Proposed Method
In our previous study, we proposed a single CNN-based approach for oil palm detection in a pure oil palm plantation area [17]. The detection results show many misclassifications between oil palms and other vegetation if we directly apply the single CNN-based method to the large-scale study area with various land cover types, even if we collect thousands of samples of other vegetation types. The main reason is that in many regions it is hard to differentiate oil palm trees from other vegetation when only using the input image that has a similar size to a single oil palm tree. However, due to the fact that the neighboring oil palm trees often have similar spatial distances (i.e., around 7.5-9 m in our study area) while other types of vegetation are often distributed randomly, it will be much easier to identify the oil palms from other types of vegetation if we take the special pattern or texture of the oil palm plantation areas into consideration.
In this research, we proposed a two-stage CNN-based method for oil palm detection from a large-scale QuickBird satellite image, which solves the above-mentioned problems of our previous work and improves the detection results to a great extent. Figure 2 shows the overall flowchart of this method, including the training and optimization of TS-CNN (Figure 2a-h) and the large-scale oil palm detection (Figure 2a-l). First, we built a sample dataset in three classes (i.e., oil palm plantation area, other vegetation/bareland, and impervious/cloud) with a size of 65 × 65 pixels from the sample coordinates (Figure 2c), and we train and optimize the first CNN for land cover classification ( Figure 2d). The sample image size for land cover classification can influence the detection accuracy to some extent, which will be analyzed in Section 4.1. Similarly, we built a sample dataset in four classes (i.e., oil palm, background, other vegetation/bareland, and impervious/cloud) with a size of 17 × 17 pixels (similar to the largest tree crown size of the oil palms in our study area) from the sample coordinates (Figure 2f), and we train and optimize the second CNN for object classification (Figure 2g). For the oil palm tree detection in the large-scale remote sensing image, first we proposed an overlapping partitioning method to divide the large-scale image into multiple sub-images ( Figure 2i). Second, we proposed a multi-scale sliding-window technique to collect image datasets (Figure 2j), and utilized the two optimal CNN models ( Figure 2e,h) obtained in training phases to predict labels for the image datasets and to obtain the oil palm tree coordinates. Third, we applied a minimum distance filter-based post-processing method to the predicted oil palm tree coordinates ( Figure 2k) and obtained the final oil palm detection results (Figure 2l). The details of each step will be described in

Training and Optimization of the CNN for Land Cover Classification
For training and optimization of the TS-CNN, the 20,000 sample coordinates collected in section 2 are randomly divided into 17,000 training sample coordinates and 3000 validation sample coordinates. To build the sample dataset for land cover classification, we extract the training and validation image samples in a size of 65 × 65 pixels from the whole image based on their corresponding sample coordinates, and combine the samples of oil palm class and background class into the samples of oil palm plantation class. The size of the image sample for land-cover classification can influence the detection accuracy to some extent, which will be analyzed in section 4.1. Then we use the 17,000 training samples and the 3000 validation samples (in three classes, with a size of 65 × 65 pixels) to train the first CNN for land-cover classification and calculate the classification accuracies. Figure 3 shows the framework of this step. The CNN used for land-cover classification is based on the AlexNet architecture [60]. It consists of five convolutional layers (the first, second and fifth are followed with a max-pooling layer) and three fully-connected layers. We utilize ReLU [60] as the activation function and adopt the dropout technique to prevent the substantial overfitting. Moreover, we optimize the main hyper-parameters of CNN (e.g. the number of convolution kernels in each convolutional layer, the number of hidden units in each fully-connected layer, and the batch size) to obtain the optimal hyper-parameter setting for land-cover classification (achieving the highest accuracy for 3000 validation samples in 65 × 65 pixels).

Training and Optimization of the CNN for Land Cover Classification
For training and optimization of the TS-CNN, the 20,000 sample coordinates collected in Section 2 are randomly divided into 17,000 training sample coordinates and 3000 validation sample coordinates. To build the sample dataset for land cover classification, we extract the training and validation image samples in a size of 65 × 65 pixels from the whole image based on their corresponding sample coordinates, and combine the samples of oil palm class and background class into the samples of oil palm plantation class. The size of the image sample for land-cover classification can influence the detection accuracy to some extent, which will be analyzed in Section 4.1. Then we use the 17,000 training samples and the 3000 validation samples (in three classes, with a size of 65 × 65 pixels) to train the first CNN for land-cover classification and calculate the classification accuracies. Figure 3 shows the framework of this step. The CNN used for land-cover classification is based on the AlexNet architecture [60]. It consists of five convolutional layers (the first, second and fifth are followed with a max-pooling layer) and three fully-connected layers. We utilize ReLU [60] as the activation function and adopt the dropout technique to prevent the substantial overfitting. Moreover, we optimize the main hyper-parameters of CNN (e.g., the number of convolution kernels in each convolutional layer, the number of hidden units in each fully-connected layer, and the batch size) to obtain the optimal hyper-parameter setting for land-cover classification (achieving the highest accuracy for 3000 validation samples in 65 × 65 pixels).
Remote Sens. 2018, 11, x FOR PEER REVIEW 6 of 18 Figure 3. The framework of the CNN for land-cover classification.

Training and Optimization of the CNN for Object Classification
To build the sample dataset for object classification, we extract the training and validation image samples in a size of 17 × 17 pixels (similar to the largest tree crown size of the oil palms in our study area) from the whole image based on their corresponding sample coordinates. Then we use the 17,000 training samples and the 3000 validation samples (in four classes, with a size of 17 × 17 pixels) to train the second CNN for object classification and to calculate the classification accuracy. Figure 4 shows the framework of this step. The basic architecture of the second CNN is the same as the one of the first CNN. The main hyper-parameters of the second CNN for object classification are optimized independently from the first CNN described in section 3.2.1. Similarly, we will obtain the optimal hyper-parameter setting for object classification (achieving the highest accuracy for 3000 validation samples in 17 × 17 pixels).

Training and Optimization of the CNN for Object Classification
To build the sample dataset for object classification, we extract the training and validation image samples in a size of 17 × 17 pixels (similar to the largest tree crown size of the oil palms in our study area) from the whole image based on their corresponding sample coordinates. Then we use the 17,000 training samples and the 3000 validation samples (in four classes, with a size of 17 × 17 pixels) to train the second CNN for object classification and to calculate the classification accuracy. Figure 4 shows the framework of this step. The basic architecture of the second CNN is the same as the one of the first CNN. The main hyper-parameters of the second CNN for object classification are optimized independently from the first CNN described in Section 3.2.1. Similarly, we will obtain the optimal hyper-parameter setting for object classification (achieving the highest accuracy for 3000 validation samples in 17 × 17 pixels).

Overlapping Partitioning Based Large-Scale Image Division
As mentioned in section 2, the image size of our study area is 12,188 × 12,576 pixels. To improve the efficiency of the large-scale oil palm tree detection, we divide the whole image into multiple subimages in 600 × 600 pixels so that each sub-image can be processed in parallel. If we directly divide the large-scale image into multiple sub-images without overlaps, some tree crowns will be divided into two sub-images and might not be detected correctly. To avoid this situation, we design an overlapping partitioning method for the large-scale image division. In our proposed method, every two neighbor sub-images in the large-scale image have an overlapping region in a user-defined width (20 pixels in this study). In this way, all the oil palm trees will belong to at least one sub-image. Correspondingly, we will perform a minimum distance filter-based post-processing after the palm tree detection of each sub-image to avoid repeated detection of the same oil palm tree resulting from the overlapping partitioning, which will be described in section 3.3.3.

Multi-Scale Sliding Window Based Image Dataset Collection and Label Prediction
After dividing the large-scale image into multiple sub-images, we designed a multi-scale slidingwindow technique to collect two image datasets for each sub-image and to predict labels for all samples in the image datasets, as shown in Figure 5. Each sample of the first image dataset corresponds to a window of 65 × 65 pixels. Each sample of the second image dataset corresponds to a window of 17 × 17 pixels. Taking both the processing efficiency and the detection accuracy into consideration, the optimal sliding distance of a window in each sliding step is three pixels in our study area (obtained through experiments). After the sample collection of the two image datasets, we used the first optimal CNN model to predict land-cover types for the first image dataset, and used the second optimal CNN model to predict the object types for the second image dataset. A sample coordinate will be predicted as the coordinate of an oil palm tree if both its corresponding sample in the first image dataset is predicted as the oil palm plantation area type and its corresponding sample in the second image dataset is predicted as the oil palm type.

Overlapping Partitioning Based Large-Scale Image Division
As mentioned in Section 2, the image size of our study area is 12,188 × 12,576 pixels. To improve the efficiency of the large-scale oil palm tree detection, we divide the whole image into multiple sub-images in 600 × 600 pixels so that each sub-image can be processed in parallel. If we directly divide the large-scale image into multiple sub-images without overlaps, some tree crowns will be divided into two sub-images and might not be detected correctly. To avoid this situation, we design an overlapping partitioning method for the large-scale image division. In our proposed method, every two neighbor sub-images in the large-scale image have an overlapping region in a user-defined width (20 pixels in this study). In this way, all the oil palm trees will belong to at least one sub-image. Correspondingly, we will perform a minimum distance filter-based post-processing after the palm tree detection of each sub-image to avoid repeated detection of the same oil palm tree resulting from the overlapping partitioning, which will be described in Section 3.3.3.

Multi-Scale Sliding Window Based Image Dataset Collection and Label Prediction
After dividing the large-scale image into multiple sub-images, we designed a multi-scale sliding-window technique to collect two image datasets for each sub-image and to predict labels for all samples in the image datasets, as shown in Figure 5. Each sample of the first image dataset corresponds to a window of 65 × 65 pixels. Each sample of the second image dataset corresponds to a window of 17 × 17 pixels. Taking both the processing efficiency and the detection accuracy into consideration, the optimal sliding distance of a window in each sliding step is three pixels in our study area (obtained through experiments). After the sample collection of the two image datasets, we used the first optimal CNN model to predict land-cover types for the first image dataset, and used the second optimal CNN model to predict the object types for the second image dataset. A sample coordinate will be predicted as the coordinate of an oil palm tree if both its corresponding sample in the first image dataset is predicted as the oil palm plantation area type and its corresponding sample in the second image dataset is predicted as the oil palm type.

Minimum Distance Filter Based Post-Processing
After completing the overlapping partitioning-based large-scale image division and the multiscale sliding-window-based label prediction, we will obtain all samples predicted as the oil palm types and their corresponding spatial coordinates in the whole image. At this point, multiple predicted oil palm coordinates are often located within the same oil palm tree crown, as we partition the large-scale image into multiple sub-images with overlap and the sliding distance of the window in each step is smaller than the actual distance between two oil palm trees. To avoid the repeated detection of the same oil palm tree, we apply a minimum distance filter-based post-processing method to all predicted oil palm coordinates, as shown in Figure 5c,d. According to the actual situation of the oil palm tree plantation, the Euclidean spatial distance between two neighbor oil palms should not be smaller than 10 pixels (i.e. 6 meters) in our study area. Thus, we iteratively merge each group of oil palm coordinates into one coordinate if their Euclidean distance is smaller than a given threshold. In each iteration, we increased the threshold by one pixel (from 3 to 10 pixels), and the merging results remain unchanged after the threshold increased to 8 pixels. After the minimum distance filter-based post-processing step, we will obtain the final results of the oil palm detection in the whole study area.

Hyper-Parameter Setting and Classification Accuracies
In this study, the main hyper-parameters of the CNN for land-cover classification and the CNN for object classification are optimized independently. After the hyper-parameter optimization, we will obtain the optimal hyper-parameter setting of two CNNs with the highest classification accuracy of the 3000 validation samples. For both CNNs, the sizes of the convolution kernel, max-pooling kernel, and mini-batch are 3, 2 and 10, respectively. The maximum number of iterations for CNN training is set as 100,000. For the first CNN for land-cover classification, we obtain the highest classification accuracy of 95.

Minimum Distance Filter Based Post-Processing
After completing the overlapping partitioning-based large-scale image division and the multi-scale sliding-window-based label prediction, we will obtain all samples predicted as the oil palm types and their corresponding spatial coordinates in the whole image. At this point, multiple predicted oil palm coordinates are often located within the same oil palm tree crown, as we partition the large-scale image into multiple sub-images with overlap and the sliding distance of the window in each step is smaller than the actual distance between two oil palm trees. To avoid the repeated detection of the same oil palm tree, we apply a minimum distance filter-based post-processing method to all predicted oil palm coordinates, as shown in Figure 5c,d. According to the actual situation of the oil palm tree plantation, the Euclidean spatial distance between two neighbor oil palms should not be smaller than 10 pixels (i.e., 6 m) in our study area. Thus, we iteratively merge each group of oil palm coordinates into one coordinate if their Euclidean distance is smaller than a given threshold. In each iteration, we increased the threshold by one pixel (from 3 to 10 pixels), and the merging results remain unchanged after the threshold increased to 8 pixels. After the minimum distance filter-based post-processing step, we will obtain the final results of the oil palm detection in the whole study area.

Hyper-Parameter Setting and Classification Accuracies
In this study, the main hyper-parameters of the CNN for land-cover classification and the CNN for object classification are optimized independently. After the hyper-parameter optimization, we will obtain the optimal hyper-parameter setting of two CNNs with the highest classification accuracy of the 3000 validation samples. For both CNNs, the sizes of the convolution kernel, max-pooling kernel, and mini-batch are 3, 2 and 10, respectively. The maximum number of iterations for CNN training is set as 100,000. For the first CNN for land-cover classification, we obtain the highest classification accuracy of 95.47% of the 3000 validation samples in three classes when the number of convolution kernels in five convolutional layers and the number of neurons in three fully-connected layers are 24-64-96-96-64-800-800-3. For the second CNN for object classification, we obtain the highest classification accuracy of 91.53% on the 3000 validation samples in four classes when the number of convolution kernels in five convolutional layers and the number of neurons in three fully-connected layers are 25-60-45-45-65-800-800-4. Moreover, the size of the image sample for land-cover classification can influence the detection results to some extent. Experiment results show that the classification accuracies on 3000 validation samples are 92%, 95% and 91% when the sizes of the image sample are 45 × 45 pixels, 65 × 65 pixels, and 85 × 85 pixels, respectively. Thus, we chose 65 × 65 pixels as the sample image size for land-cover classification.

Evaluation of the Oil Palm Tree Detection Results
To evaluate the oil palm detection results of our proposed method quantitatively, we selected six typical regions in the large-scale study area and compared the detection results of these regions with the manually labeled ground truth. The precision, the recall and the F1-score of the six selected regions are calculated according to Equations (1)-(3), where the True Positive (TP) indicates the number of oil palms that are detected correctly, the False Positive (FP) indicates the number of other objects that are detected as oil palms by mistake, and the False Negative (FN) indicates the number of oil palms not detected. An oil palm will be considered as correctly detected only if the Euclidean spatial distance between its coordinate and an oil palm coordinate in the ground truth dataset is smaller than or equal to 5 pixels [17]. Table 1 displays the performance of the TS-CNN-based method in six typical regions selected from the large-scale study area, in respect to TP, FP, FN, Precision, Recall and F1-score. The locations of the six selected regions for method evaluation can be found in Figure 1 (denoted by the red squares and numbered from top to bottom). Each selected region includes various land cover types, e.g., oil palm plantation areas, other vegetation, bare land, buildings, and roads, etc. Figure 6 shows the image detection results in six selected regions, where the red points indicate the detected oil palms of our proposed method, the green squares indicate the oil palms that are not detected, and the blue squares indicate other types of objects that are mistakenly detected as oil palms. We can find that our proposed method achieves excellent oil palm detection results with very few confusions with other vegetation or buildings, and the F1-scores of the oil palm detection results are 91-96% in six regions.

Oil palm Tree Detection Results in the Whole Study Area
In the whole study area, a total number of 264,960 oil palm trees are detected based on our proposed method. The link of the predicted oil palm coordinates of the whole study area can be found at the end of this paper. The total processing time for the oil palm detection in the whole study area (including the data pre-processing, oil palm detection and post-processing) is about 6 hours (Library: Tensorflow, Hardware: single NVIDIA Tesla K40 GPU). We obtain the density map of our study area through calculating the number of detected oil palms per hectare for each pixel, as shown in Figure 7. For most regions in the oil palm plantation area, the densities of oil palms are between 108 to 180 per hectare, which is consistent with the actual plantation situation of the oil palms (the spatial distance between two neighbor oil palm trees is often 7.5-9 m).

Oil palm Tree Detection Results in the Whole Study Area
In the whole study area, a total number of 264,960 oil palm trees are detected based on our proposed method. The link of the predicted oil palm coordinates of the whole study area can be found at the end of this paper. The total processing time for the oil palm detection in the whole study area (including the data pre-processing, oil palm detection and post-processing) is about 6 hours (Library: Tensorflow, Hardware: single NVIDIA Tesla K40 GPU). We obtain the density map of our study area through calculating the number of detected oil palms per hectare for each pixel, as shown in Figure  7. For most regions in the oil palm plantation area, the densities of oil palms are between 108 to 180 per hectare, which is consistent with the actual plantation situation of the oil palms (the spatial distance between two neighbor oil palm trees is often 7.5-9 meters).

Detection Results of Different CNN Architectures
In this section, we will analyze the influence of various CNN architectures on the oil palm tree detection results in the six selected regions. For both the first CNN for land cover classification (CNN 1) and the second CNN for object classification (CNN 2), we evaluate the detection accuracies of three widely-used CNN architectures: (1) LeNet, including 2 convolutional layers, 2 max-pooling layers, and 1 fully-connected layers; (2) AlexNet, including 5 convolutional layers, 3 max-pooling layers, and 3 fully-connected layers; (3) VGG-19, including 16 convolutional layers, 5 max-pooling layers, and 3 fully-connected layers. Thus, there are nine combinations of the CNN architectures for our proposed TS-CNN-based method. Table 2 shows the F1-score of six selected regions obtained from nine combinations of the CNN architectures. The numbers in bold type indicate the best case (the highest F1-score) for each region. We find that the combination of AlexNet + AlexNet architecture used in this study achieves the highest F1-score in most of the regions (slightly lower than LeNet + AlexNet

Detection Results of Different CNN Architectures
In this section, we will analyze the influence of various CNN architectures on the oil palm tree detection results in the six selected regions. For both the first CNN for land cover classification (CNN 1) and the second CNN for object classification (CNN 2), we evaluate the detection accuracies of three widely-used CNN architectures: (1) LeNet, including 2 convolutional layers, 2 max-pooling layers, and 1 fully-connected layers; (2) AlexNet, including 5 convolutional layers, 3 max-pooling layers, and 3 fully-connected layers; (3) VGG-19, including 16 convolutional layers, 5 max-pooling layers, and 3 fully-connected layers. Thus, there are nine combinations of the CNN architectures for our proposed TS-CNN-based method. Table 2 shows the F1-score of six selected regions obtained from nine combinations of the CNN architectures. The numbers in bold type indicate the best case (the highest F1-score) for each region. We find that the combination of AlexNet + AlexNet architecture used in this study achieves the highest F1-score in most of the regions (slightly lower than LeNet + AlexNet for Region 3 and Region 6) and achieves the highest average F1-score in six regions amongst different combinations of CNN architectures.

The Detection Result Comparison of Different Methods
To further evaluate the proposed TS-CNN-based method, we implemented four existing oil palm detection methods for comparison with the proposed method, including the single-stage CNN, SVM, Random Forest, and Artificial Neural Network-based method (denoted by CNN, SVM, RF and ANN, respectively). The CNN-based method outperforms other oil palm detection methods (the template matching-based method, the ANN-based method, and the local maximum filter-based method) in all selected regions in our previous studies [17]. The SVM, RF and ANN are broadly used machine learning methods in the remote sensing domain due to their excellent classification accuracy [61,62] and good performance in many tree crown detection or tree species classification studies [33,36,37,63]. Thus, we chose these methods for comparison with the TS-CNN-based method. We also implemented six other two-stage methods using the different combinations of the above classifiers (denoted by SVM+SVM, RF+RF, ANN+ANN, CNN+SVM, CNN+RF, and CNN+ANN, respectively) for further evaluating the performance of the two-stage strategy for the oil palm detection.
We also tried applying the Faster-RCNN [64] method to the oil palm detection based on the same sample dataset used for TS-CNN. Two-thousand sample images in 500 × 500 pixels were randomly generated from the selected regions (described in Figure 1) for the training of Faster-RCNN. However, the detection results are very bad in all six regions, which might be due to the limited number of labeled sample images, the small size of oil palm objects, and the relatively low resolution of the satellite images used in this study. The Faster-RCNN and other end-to-end object detection methods will be further explored and improved for the oil palm detection in our future research.
All of these methods include the same processes of overlapping partitioning-based large-scale image division and minimum distance filter-based post-processing, and different processes of classifier training and optimization, image dataset collection, and label prediction. For the single classifier-based method (i.e., CNN, SVM, RF, and ANN), we trained and optimized the single classifier for object classification, and utilized the single-scale sliding-window-based method for image dataset collection and label prediction [17]. For the two-stage classifier-based method (i.e., CNN+CNN, SVM+SVM, RF+RF, ANN+ANN, CNN+SVM, CNN+RF, and CNN+ANN), we trained and optimized one classifier for land cover classification and one classifier for object classification, and utilized the multi-scale sliding window-based method for image dataset collection and label prediction. Tables 3-8 shows the detection results of all methods in each selected regions in terms of TP, FP, FN, Precision, Recall, and F1-score. The detection results of all six regions (in terms of F1-score) are summarized in Table 9. The numbers in bold type indicate the best case (the highest value for TP, Precision, Recall and F1-socre, and the lowest value for FP and FN) among different methods. Figures 8 and 9 show the image detection results of all methods in Region 2 and Region 4. We find that our proposed TS-CNN-based method outperforms other methods in many aspects. First, the average F1-score in six regions of TS-CNN is 94.99%, which is 7.04-16.64% higher than those obtained from the existing one-stage methods (CNN, SVM, RF, and ANN), and 2.51-11.45% higher than those obtained from other classifiers-based two-stage methods (SVM+SVM, RF+RF, ANN+ANN, CNN+SVM, CNN+RF, and CNN+ANN). Second, from both the detection accuracy and the detection image results, we can conclude that the two-stage classifier-based methods outperform the single classifier-based methods for all classifiers (i.e., CNN, SVM, RF, and ANN). The CNN classifier outperforms SVM, RF and ANN classifiers for both two-stage and one-stage classifier-based methods in our study area. Moreover, there exists much fewer misclassifications between oil palms and other vegetation or buildings in the detection results of our proposed method, compared with those obtained from the other 10 methods (See the top areas in Figure 8 and the bottom left areas in Figure 9). Nevertheless, the oil palm detection results of our proposed method still have some weaknesses in the following aspects. On one hand, the recalls of the two-stage classifier-based methods are often slightly lower than the one-stage classifier-based methods. On the other hand, there are still some misclassifications between oil palm trees and other vegetation types, especially around the boundary of two land-cover types. These issues should be solved in our future study.

Conclusions
In this research, we proposed a two-stage convolutional neural network-based oil palm detection method, using a large-scale QuickBird satellite image located in the south of Malaysia. Twenty-thousand sample coordinates in four classes were collected from the study area through human interpretation. For the training and optimization of the TS-CNN, we built a sample dataset in three classes with a size of 65 × 65 pixels to train and optimize the first CNN for land-cover classification, and built a sample dataset in four classes with a size of 17 × 17 pixels to train and optimize the second CNN for object classification. For the oil palm tree detection in large-scale remote sensing images, we proposed an overlapping partitioning method to divide the large-scale image into multiple sub-images. We proposed a multi-scale sliding-window technique to collect image datasets and predict the label for each sample of the image datasets to obtain the predicted oil palm coordinates. Finally, we applied a minimum distance filter-based post-processing method and obtained the final oil palm detection results.
In the experiment results analysis, we calculated the detection accuracies of six selected regions through comparing the detection results with the manually-labeled ground truth. The TS-CNN-based method achieves the highest average F1-score of 95% among the 11 oil palm detection methods in our study area. There exist much fewer confusions between oil palms and other vegetation or buildings in the detection image results of our proposed TS-CNN compared with other methods. Moreover, we obtained the oil palm density map from the detection results in the whole study area. We also analyzed the influence of different combinations of CNN architectures on the oil palm detection results. In our future research, we will explore the potential of more state-of-the-art object detection algorithms in oil palm tree detection studies. We will also design more accurate and efficient oil palm tree detection methods for various large-scale study areas.
Supplementary Materials: The 5000 manually-labeled oil palm sample coordinates and the whole oil palm detection results of our proposed method can be found in https://github.com/liweijia/oil-palm-detection.

Conclusions
In this research, we proposed a two-stage convolutional neural network-based oil palm detection method, using a large-scale QuickBird satellite image located in the south of Malaysia. Twenty-thousand sample coordinates in four classes were collected from the study area through human interpretation. For the training and optimization of the TS-CNN, we built a sample dataset in three classes with a size of 65 × 65 pixels to train and optimize the first CNN for land-cover classification, and built a sample dataset in four classes with a size of 17 × 17 pixels to train and optimize the second CNN for object classification. For the oil palm tree detection in large-scale remote sensing images, we proposed an overlapping partitioning method to divide the large-scale image into multiple sub-images. We proposed a multi-scale sliding-window technique to collect image datasets and predict the label for each sample of the image datasets to obtain the predicted oil palm coordinates. Finally, we applied a minimum distance filter-based post-processing method and obtained the final oil palm detection results.
In the experiment results analysis, we calculated the detection accuracies of six selected regions through comparing the detection results with the manually-labeled ground truth. The TS-CNN-based method achieves the highest average F1-score of 95% among the 11 oil palm detection methods in our study area. There exist much fewer confusions between oil palms and other vegetation or buildings in the detection image results of our proposed TS-CNN compared with other methods. Moreover, we obtained the oil palm density map from the detection results in the whole study area. We also analyzed the influence of different combinations of CNN architectures on the oil palm detection results. In our future research, we will explore the potential of more state-of-the-art object detection algorithms in oil palm tree detection studies. We will also design more accurate and efficient oil palm tree detection methods for various large-scale study areas.
Supplementary Materials: The 5000 manually-labeled oil palm sample coordinates and the whole oil palm detection results of our proposed method can be found in https://github.com/liweijia/oil-palm-detection.