Higher-Order Conditional Random Fields-Based 3D Semantic Labeling of Airborne Laser-Scanning Point Clouds

: This paper presents a novel framework to achieve 3D semantic labeling of objects (e


Introduction
Outdoor scene labeling of airborne laser-scanning (ALS) point clouds is a key step in many applications such as autonomous driving, urban scene understanding, surveying and mapping, smart city and remote sensing, among others [1][2][3][4][5][6]. Several point cloud classification approaches have been proposed in the last decade. Those classification methods can be generally divided into for region-growing. This method ensures that the differences between the points in each cluster are smallest and that the differences between the clusters are maximized. Although this method is simple and effective, it is greatly influenced by the seed point selection and is constrained by the boundary conditions. Furthermore, since the growth criterion is based on the correlation between the points, the selection of low-level features has a large impact on the segmentation results.
Model-based method: In this method, the point cloud is segmented according to predetermined geometric models. Initial simple assumptions are made, followed by model verification. Subsequently, the points belonging to the same model are assigned to the same class. Among numerous hypothesis-based models, the Random Sample Consensus (RANSAC) algorithm [17] is most frequently used because it is a simple and powerful tool well suited for outlier detection, shape recognition, and registration. For example, Wang and Shi [18] first segmented a point cloud into subunits using a spatial grid. Then a local RANSAC algorithm was used to fit different models in the subunits, e.g., planes, cylinders, and spheres. Finally, the best model was determined using statistical inference. Ni et al. [19] assumed that the inliers were located close to each other and proposed the so-called GroupSAC algorithm that divides the input data by selecting samples from clusters containing many inliers. A detailed discussion regarding the standard RANSAC algorithm and its variants can be found in [20,21]. Rather than using RANSAC, Awadallah et al. [22] projected a point cloud into a two-dimensional grid and obtained a two-dimensional gray-level image based on the mesh density. Then the point cloud was segmented according to the gray-level image segmentation results using a Snake model. In summary, model-based methods are well suited for extracting objects that can be fitted by linear and/or nonlinear parameter estimation but are infeasible for free-form surface because of lack of corresponding nonlinear models. Moreover, the applications of these kinds of methods are limited by the classes of the model.
Clustering method: This method is an unsupervised learning method, which heuristically clusters points with similar attributes into the same class to meet the requirements of cost functions. Representative methods include the classic K-means algorithm [23], Euclidean distance clustering algorithm [24], mean shift clustering [25][26][27], hierarchical clustering [28][29][30], sample density-based clustering [31,32], and mixed kernel density function clustering [33]. For example, Wu et al. [24] introduced a smooth threshold constraint to the traditional Euclidean clustering algorithm to prevent over-and/or under-segmentation problems. The enhanced algorithm was used to segment a Kinect point cloud contaminated with abundant noise and many outliers. Yang et al. [4] used a K-means algorithm to solve segmentation problems in model construction for a reverse engineering project. More specifically, the curvature of each points was first calculated. The Euclidean metric among points was redefined and recalculated in a local rather than a global coordinate system by considering the curvatures and geometric coordinates of points. Based on this newly defined Euclidean metric, the point clouds are segmented by K-means clustering algorithm. The overall performance of the clustering method is satisfactory, but it is significantly affected by different types of distance metrics. In some complex scenes, the proximity of different objects reduces the accuracy of the clustering results. In addition, it is usually impossible to segment objects with different scales, which is a desirable goal because information at different scales provides significant clues to object detection and recognition.
After segmentation, the discriminative features are usually determined by fusing multiple features. Some commonly used features include the view feature histogram (VFH) [34], clustered view feature histogram (CVFH) [35], the length and width of the point set [36], the covariance features and shape features of the point set. These features are regarded as effective descriptors for cluster classification and recognition. These cluster-based features are fed into the classifier to finalize the labeling. From a technical point of view, the mainstream classifiers mainly include unsupervised classifiers, machine learning classifiers, probability and optimization-based classifiers, ensemble learning classifiers, and deep learning classifier [37]. Each kind of method has its own advantages and drawbacks. Especially for processing relatively small-sized point clouds, the probability and optimization-based classifiers, e.g., conditional random fields (CRF) and Markov random fields (MRF) tend to be more promising than other methods because they result in more homogeneous clusters, and they consider the spatial correlation among multiple labels by incorporating contextual information and different types of prior knowledge into the algorithm, thereby improving greater flexibility and degrees of freedom for users. However, most existing MRF and/or CRF algorithms only consider the unary potential that is defined over a specific clique and pairwise potential that is defined over two cliques consisting of the two nearest clusters. In this case, the contextual information and prior knowledge have not been fully explored, thus leading to wrong labels in some complex scenes. To solve this problem, in this paper, we propose a higher-order CRF-based 3D semantic labeling framework. More specifically, the segmented framework simultaneously combines the existing classic algorithms, i.e., the K-means and the density-based spatial clustering of applications with noise (DBSCAN) [31] algorithms with the proposed probability density clustering algorithm to divide the raw ALS point clouds into a set of clusters. This preserves the clusters' homogeneity and neighborhood topology among clusters. It is to be noted that DBSCAN algorithm can be regarded as a kind of region-growing algorithms. After implementation of DBSCAN algorithm, the coarse-grained clusters are obtained. K-means and the proposed density clustering algorithms can be regarded as a kind of clustering methods. A set of fine-grained clusters is generated after implementation of K-means algorithm and the topological relationships between the fine-grained clusters are maintained by the proposed probability density clustering algorithm, i.e., the proposed segmented framework simultaneously combines the region-growing method with the clustering method, which commonly ensures the creation of homogeneous clusters with explicit topology. Based on these clusters and neighborhood topologies, we propose a higher-order CRF energy function that consists of unary, pairwise, and higher-order potentials defined over cliques consisting of more than two clusters to improve the accuracy of cluster labeling. The higher-order cliques are more useful for good perception of contextual information, thus improving the object classification and recognition. Given that our work is based on the conventional CRF framework, we explicitly state our original contributions as follows: • Hierarchical Point Cluster Generation: We propose a multilevel clustering point set construction method. The first-level clustering uses a density clustering method to aggregate the points belonging to the same kind of object into a coarse cluster set. The second-level clustering uses the classic K-means algorithm to over-segment the coarse cluster set into a fine cluster set, from which the point set of the minimum processing unit is constructed. The third-level clustering uses the proposed probability density clustering method to construct a cluster set at a high-level scale, i.e., the third-level cluster set includes one-to-many relationships with the second-level cluster set. The multilevel clusters commonly provide greater use of contextual information, thus improving the accuracy of clustering labeling. • Cluster Topology Maintenance Strategy: We present a strategy of constructing a neighborhood system among clusters by turning the problem of topology maintenance of clusters into a clustering problem based on the proposed probability density clustering method. • Higher-Order Energy Function: We propose a higher-order CRF energy function that considers the constraints of the unary potential, the pairwise potential, and the higher-order potential defined over cliques consisting of more than two clusters.
This remainder of the paper is organized as follows. Section 2 describes the detailed methodology including data set descriptions, coarse-grained clustering, fine-grained clustering, topology maintenance, and CRF labeling optimization. In Section 3, the performance evaluation results of the labeling accuracy and effectiveness of the CRF cost function terms are presented, analyzed, and discussed. Finally, Section 4 concludes the paper along with a few suggestions for future research topics.

Methodology
To accurately assign semantic labels, e.g., trees, buildings, and vehicles to ALS point clouds, we propose a methodology that consists of four steps, which are outlined in Figure 1. The input raw data (Figure 1a) are first segmented into a set of coarse-grained clusters using the classic DBSCAN algorithm, followed by over-segmentation using the K-means algorithm, thereby generating fine-grained clusters (Figure 1b,c). Subsequently, we finalize the topology maintenance of the fine-grained clusters by transforming the problem of topology maintenance into a clustering problem based on the proposed probability density clustering algorithm (Figure 1d,e). In the last step, three types of potentials, including the unary, pairwise, and higher-order potentials are embedded into the CRF model to finalize the classification. In the labeling framework, we use as input the ALS point clouds of residential and urban scenes, and generate point clouds with explicit semantic labels, i.e., trees, buildings, and vehicles ( Figure 1f). It should be noted that the ground points are first filtered using a cloth simulation filtering algorithm [38] because we only focus on non-ground point clouds.

Materials
To evaluate the performance of the proposed classification algorithm, we use scenes (Figure 2a,d) representing residential and urban areas in Tianjin, China to verify the accuracy of the proposed algorithm. The two scenes were acquired in August 2010 using a Leica ALS50 system with a mean flying height of 500 m above ground and a 45 • field of view. The point density is approximately 20-30 points/m 2 . Scene 1 contains large buildings that are adjacent to dense trees and some vehicles scattered in this scene. Scene 2 is dominated by small-sized vehicles, which presents a challenge for the classification algorithm because of the unbalanced classes. Thus, these two scenes are appropriate for assessing the proposed algorithm. Because the proposed method is a supervised learning algorithm, we manually select the different classes in the point clouds as training data. To achieve this goal, registered aerial images and the commercial Terrasolid software package (http://www.terrasolid. com/home.php) are used to collect the training samples in two-and three-dimensional data space to improve the reliability of training data selection. In practice, we first use a stratified sampling scheme for the selection of the most representative point clouds. The stratified sampling guarantees that the selected point clouds are uniformly distributed and derived from different objects. Once the typical points are selected, visual inspection is conducted by combining the registered aerial images and interactive operations in two-and three-dimensional data space using Terrasolid software package. The remaining point clouds in the two scenes are regarded as testing points. The statistics of the training and testing data sets of the different classes in the two scenes are listed in Table 1.

Coarse-Grained Clustering Using DBSCAN
Since the geometric shapes of scanned objects are complex, conventional clustering algorithms that are based on a rigidly fixed-point set/cluster are not suitable for complex scenes. To obtain different numbers of clusters and different sizes of clusters corresponding to different objects, an adaptive clustering algorithm is needed. To this end, we implement the initial clustering based on the local point density and the connectivity of point clouds by using the classic DBSCAN algorithm. After being clustered, a complex scene can be simplified into a series of disposable clusters. This provides a solid foundation for the subsequent generation of a set of fine-grained clusters. The reason for choosing the DBSCAN algorithm is that it is resistant to noise and can find clusters with concave and convex shapes. In addition, the DBSCAN algorithm does not require the number of clusters as input. In fact, DBSCAN only requires two parameters: Eps and MinPts. The parameter Eps specifies how close the points should be to each other to be considered to be a part of the cluster, while the parameter MinPts specifies how many neighbors a point should have to be included into a cluster. After being clustered, the raw point clouds are clustered into three categories: • Core Point: Any point p with several neighborhood points that is greater than or equal to MinPts is regarded as a core point. Based on these definitions, the core points with "density reachability" are clustered together to form one cluster. Here, "density reachability" is defined as follows: given a data set P = {p 1 , p 2 , . . . , p n }, where p = p 1 and q = p n . If a point from p i (i = 1, 2, . . . , n−1) is within the -neighborhood of a point p i+1 , and the p i+1 is a core point, the point p i+1 (i = 1, 2, . . . , n−1) is defined as "density reachable" from a point p i . If the set of connection points leading from p i to p i+1 consists of only core points and is "density reachable", the point q is defined as "density reachable" from point p. The border points are merged with the current cluster's core points to form a class. The noise points are classified as another class.
As shown in the outdoor scene in Figure 3a, even though the point clouds are noisy and have outliers, we can roughly classify the point clouds of the same object into one cluster, as it is evident in Figure 3b. Please note that different colors represent different clusters after implementation of the DBSCAN algorithm. Compared with the labeling reference in Figure 3c, we observed that some trees are close to the house, resulting in under-segmented clusters. To solve this problem, the coarse-grained clusters generated by the DBSCAN algorithm are further refined by dividing them into smaller and fine-grained point sets to guarantee the homogeneity of points within each fine-grained cluster.

Fine-Grained Clustering Using K-Means
After coarse-grained clustering, the size of the segmented clusters is relatively large, and the results do not consider the trends of the point distribution. In addition, the coarse-grained clustering does not provide homogeneous clusters, resulting in objects of multiple categories coexisting in each cluster. To ensure that the segmented clusters provide better discrimination, the clusters should be uniform size, and should contain only one single object or part of an object. To this end, the K-means algorithm is used to aggregate the points with uniform distribution into a series of subclusters. As previously mentioned in Section 1, three criteria are used for ideal segmentation. Fortunately, K-means algorithm meets the first criterion by tuning the number of clusters K. In addition, K-means algorithm is simple, efficient coverage, and easy implementation, which makes it compliance with the second criterion. However, the K-means algorithm do not comply with the third criterion, i.e., the generated clusters using K-means do not maintain the adjacent topological relationships, which are expected to be restored by the proposed probability density clustering algorithm (see Section 2.4). The direct use of the conventional K-means algorithm to segment the non-ground raw points is computationally inefficient and large-scale raw point clouds cannot be processed. In contrast, if we implement the K-means algorithm after each coarse-grained cluster has been created, the computational efficiency is improved significantly based on the concept of "divide and conquer". Furthermore, by using the coarse-grained clusters as inputs of the K-means algorithm, multiple clusters can be processed in parallel, thereby further enhancing the computational efficiency. Therefore, the coarse-grained clusters derived from the DBSCAN algorithm are fed into the K-means algorithm to over-segment them into smaller cluster with relatively uniform size, which is below a predefined threshold T . The detailed algorithm is given below: The K-means algorithm is defined as follows: given a coarse-grained cluster C with a set of point clouds S c = {p i ∈ R 3 , (i = 1, 2, . . . , n)}, the number of predefined K (K < n) fine-grained clusters is obtained by minimizing the following energy function [23]: where S c represents the point set from cluster C and p c is the centroid set of S c . The detailed aggregation process is as follows: (1) Choose an arbitrary cluster S c from the coarse-grained cluster set as a processing unit and then randomly choose K points as the beginning centroids, i.e., p (0) i ∈ S c , i = 1, 2, . . . , K. (2) Assign the point clouds within S c to their associated centroids according to the criterion of the minimum Euclidean metric from a point to its associated centroid. After t iterations, we obtain: where min(·) represents the minimum Euclidean metric between two points. (3) Update the centroid of each class: (4) Repeat steps (2) to (3) until the centroid locations remain stable, i.e., S After the four steps, each coarse-grained cluster has been segmented into K subclusters S c , c = 1, 2, 3 . . . , K. The number of points |S c | of each subcluster S c , c = 1, 2, 3 . . . , K is calculated. If it is less than or equal to the predefined threshold T , the subcluster is remained, otherwise, we regard it as a new input to the K-means algorithm, i.e., we repeat steps (1) to (4) until the sizes of the segmented clusters are less than T . This process is executed iteratively, until all coarse-grained clusters have been segmented. After the fine-grained clustering has been performed using the K-means algorithm, the coarse-grained clusters shown in Figure 4a are segmented into the fine-grained clusters depicted in Figure 4b. It is observed that the clusters are more uniformly sized and have more homogeneous semantic consistency.

Neighborhood Relationship Maintenance between Fine-Grained Clusters
Most methods that use neighborhood relationships (contextual information) for point labeling depend on information from the individual point clouds. This pointwise neighborhood relationship only represents the objects' local rather than the global geometric shapes. In addition, although the features extracted from the fine-grained clusters are more discriminative, the results as shown in Figure 4b do not include neighborhood relationships between the clusters. Intuitively, the adjacent topology between clusters is an important factor in object classification and recognition because it describes the objects' geometric shapes at a relatively high level. Although the KNN algorithm can be used to establish the neighborhood relationships between the fine-grained clusters, it only maintains a fixed number of neighboring clusters. This simple and rigid neighborhood topology may be effective for the description of relatively simple geometric shapes but not for complex scenes. To create a more effective cluster-based neighborhood system that is suitable for the subsequent CRF optimization described in Section 2.5, we propose a probability density clustering method that changes the problem of topology maintenance into a clustering problem between the fine-grained clusters. Because the proposed clustering algorithm requires initial coarse labels of the fine-grained clusters as constraints, we derive the initial coarse labels and execute the proposed clustering to obtain a new cluster set, from which the neighborhood relationship is obtained.

Initial Coarse Labeling of Fine-Grained Clusters
To obtain the initial coarse labels of the fine-grained clusters, we first create distinctive features for these clusters and then use a support vector machine (SVM) supervised classifier to train and label the clusters as trees, buildings, and vehicles. In [29,39], it was demonstrated that the point covariance matrix is an effective measure to describe the object's geometric shapes and point clouds' distribution; thus the covariance matrix and its derived high-level features are selected as feature descriptors. More specifically, we first extract the centroid of each cluster and estimate its covariance matrix. We can easily obtain the three eigenvalues, i.e., λ 2 ≥ λ 1 ≥ λ 0 and their corresponding eigenvectors, namely v 2 , v 1 , and v 0 . The features that are derived by linear or nonlinear combinations of these eigenvalues are regarded as covariance-based feature descriptors denoted as F cov . For the ALS point clouds, the elevation is one of the discriminative features for object recognition, thus the high-based descriptor F z is used. We use the feature F lsh [7] for cluster labeling because it has good discriminative ability. Specifically, to calculate the feature F lsh , we first select a point from a set of points as the center of a sphere, and derive a histogram of the neighborhood points in the latitudinal direction. This feature is used to distinguish different classes of points according to the distribution of the neighborhood points in the latitudinal direction. This feature has the advantages of anti-occlusion, no influence of the local coordinate system, as well as high efficiency [7]. The features used in this study are listed in Table 2. Table 2. The selected features for the initial coarse labeling of fine-grained clusters. Symbol "-" represents the specific descriptor that needs to be defined in each application.

Features
Description Linear property λ 0 /λ 2 Scattered property 1− < e 0 · (0, 0, 1) > Horizontal property For each cluster, we extract these features and concatenate them into high-dimensional feature vectors. The feature vectors are normalized and used in the SVM classifier with a Gaussian kernel to train the model. Once the model has been established, the probability of each class is obtained. The label of the class with the highest probability is the label of the current cluster. Please note that the initial labels are not required to be more precise, but they are needed as an approximate input to the subsequent clustering described in Figure 5. In addition, the initial coarse labeling provides the probability of fine-grained cluster of belonging to the different classes, which presents useful information for the subsequent CRF optimization described in Section 2.5.

Creation of Neighborhood System
Once we assign the coarse labels to the fine-grained clusters, we use these labels as constraints to create the neighborhood system of fine-grained clusters. To this end, we transform the problem of developing a neighborhood system into a problem of clustering based on the fine-grained cluster set. More specifically, inspired by the concept of "probability density clustering" that is used in the classic mean shift algorithm, we use the initial coarse labels and centroids of the fine-grained clusters as inputs. We assume that we have a centroid set X = {x i , i = 1, 2, . . . , K} and we arbitrarily select x i as the centroid of the first cluster. Then we update the probability density of the current cluster as follows: where K(·) is a Gaussian kernel, the symbols l and l i represent the corresponding initial coarse labels of the centroids x and x i , and S h represents the spherical areas with radius h and is defined by The entire clustering algorithm consists of the following seven steps: (1) Select an initial cluster center from the unlabeled set X, and set its class as the current cluster's label. (2) Label the centroids of the cluster in S h . If they are consistent with the label of the current centroid, we update the accumulator. (3) Calculate the deviation vector using Equation (4) and update the center of the current cluster. (4) Repeat steps (2) to (3) until the mean deviation is less than a threshold, i.e., M h (x) < . (5) Determine whether the Euclidean metric from the current cluster's center to the existing center is less than σ. If it is true, these two clusters need to be merged, otherwise the current cluster is regarded as a new cluster. (6) Execute the steps from (1) to (5) until all the centroids of the clusters have been assigned a specific label. (7) For each centroid of a cluster, we assign the label with the highest frequency of visits as an associated label.
In summary, once the initial coarse labels are given (see Figure 5a), the proposed probability density clustering algorithm generates a new cluster set (see Figure 5b), each of which includes one or more fine-grained clusters, i.e., the neighborhood topology between the fine-grained clusters is implicitly embedded in the newly generated clusters. The inclusive relationships are shown in the enlarged view in Figure 5c by the overlap portion of the fine-grained and the newly generated clusters.

CRF Classification with Higher-Order Potentials
As described in Section 2.4.1, the point classification based on the features of the individual clusters is not stable and is subject to noise and outliers. In this section, we describe the establishment of the higher-order CRF model by integrating the label of the fine-grained clusters and the neighborhood topology between the fine-grained clusters to obtain the final class labels. The CRF model is a type of discriminative probabilistic model based on an undirected graph (UG). It is a modeling approach for determining the conditional probability of multiple variables given observation data [40]. This model not only considers the individual cluster's label but also the neighborhood relationship between the clusters. We use the initial labels of the clusters (Section 2.4.1) and the neighborhood topology (Section 2.4.2) to construct the high-level CRF model to finalize the classification of the fine-grained clusters.
The undirected graphical CRF model is a joint probability distribution model that satisfies the global Markov property. In this model, each vertex denoted by variable v represents one or a group of variables. The edge e connecting two vertices indicates the relational dependency between them. All vertices are stored in a set V, and all edges are maintained in a set E , and the UG is denoted as G = (V, E ). If two vertices in the subset of the UG have a connected edge, this subset is called a clique. If the clique does not exist exclusively within the vertex set of a larger clique, the subset of the vertex set constitutes a maximal clique, i.e., a maximum clique is a clique of the largest possible size in a given UG. In our case, the vertices in the UG correspond to the fine-grained clusters and the edges correspond to the neighborhood topology between the fine-grained clusters.
We assume that a set L = {l 1 , l 2 , . . . , l n } represents the initial labeling results of the fine-grained clusters and the set Y = {y 1 , y 2 , . . . , y n } denotes the corresponding real labels of each cluster. The objective of CRF is the creation of conditional probability model P (Y |L), which is defined as [41][42][43]: where Z is a normalization factor that is used to normalize P (Y |L) to a specific scale. ϕ c (y c ) represents a potential function of a group c belonging to a label y c . Therefore, our objective is to obtain the optimal label configuration Y = {y 1 , y 2 , . . . , y n } that maximizes the probability P (Y |L). We use the logarithm of the negative value of Equation (5) and transform the maximization problem into a minimization problem [41][42][43]: The labeling optimization problem is transformed into [41][42][43]: More specifically, our complete model is represented by three types of potentials, including the unary potential, pairwise potential, and higher-order potential: where the first two terms ∑ i∈V ϕ i (y i ) and ∑ (i,j)∈ ϕ ij (y i , y j ) are the unary and pairwise energies and the last term ∑ |c|>2 ϕ c (y c ) is the higher-order energy. The unary term penalizes the discrepancy between the initial cluster and the solution labels and is defined as: where P (y i ) represents the point i's probability belonging to class y i . The unary potential is calculated using the initial labels and the probability of each fine-grained cluster belonging to different classes (see Section 2.4.1). Unlike with other feature descriptors, we directly use the initial probability of the cluster to create more accurate results and simplify the calculation. The pairwise term is used to constraint the discrepancy between the nearest neighborhood clusters to ensure that the labels are identical. In other words, this interaction potential provides a definition of smoothness and penalizes changes between the current cluster and their nearest neighbors. The pairwise term is defined as follows: where λ e is a balanced parameter and dist represents the Euclidean distance between the centroids of two clusters. Studies have shown that the potentials defined by using higher-order cliques are highly suitable [42,43]. Therefore, inspired by these works, in this study, we include the higher-order potential in the CRF model to improve the classification accuracy. Here the higher-order potential means that the potential defined over cliques consists of more than two vertices/nodes. The higher-order term better describes the spatial context of 3D point clouds and it incorporates prior knowledge in the CRF model. In addition, it constrains the inconsistency among vertices and ensures that the labels are similar for higher-order cliques. To ensure smooth change of the hider-order term, we use the robust P n Potts [41] model to design this energy term; it is defined as follows: where N l c (y c ) represents the number of vertices with label l in the cliques. Q is a truncation function threshold. It is used to determine the energy function of the higher-order term by comparing the number of clusters belonging to class c within a higher-order clique; if this number is small, the energy value of the higher-order term is high. γ max is used to smooth the effect of each cluster in the higher-order clique.
To solve Equation (8), a primitive dual algorithm called SoSPD proposed in [44] is used to optimize the objective function. The algorithm uses a so-called "supplementary relaxation" rather than the conventional maximum flow to optimize the energy function. More precisely, this algorithm first performs linear programming relaxation for Equation (8) and executes a supplementary relaxation. During this process, these two steps must satisfy two conditions, i.e., low-order and higher-order relaxations, i.e., the original and dual solutions are similar initially to satisfy the low-order relaxation until it converges to simultaneously satisfy the low-order and higher-order relaxations.

Performance Evaluation and Discussion
In this section, we briefly introduce the hardware configurations and parameter settings of the workflow. Subsequently, we qualitatively evaluate the classification results by conducting a visual comparison with the reference data. In addition, we quantitatively evaluate the classification results using the selected measures, i.e., precision, recall, accuracy, F 1 -score, and mF 1 . We also compare our results with the results generated by the state-or-the-art methods and test the effectiveness of the proposed CRF model by using an "ablation study" strategy.

Implementation
The experiment is conducted using a personal computer equipped with a 4.2 GHz Intel Core i7-7700K CPU and 24-GB of main memory. The implementation of the proposed algorithm is performed using the open source Point Cloud Library, i.e., PCL1.8.0 (http://pointclouds.org/) embedded in the Microsoft Visual C++ integrated development environment. In addition, the relevant parameter setting is based on the characteristics of the objects in these two scenes, e.g., the geometric shapes and the ALS point density of the objects. To obtain the optimal input parameters, we use a trial and error strategy to determine the appropriate values. More specifically, in the DBSCAN clustering, we obtain good results when using the parameter Eps in the range [0.7, 1.5] and the parameter MinPts in the range [6,10]. Similarly, in the over-segmented K-means algorithm and the procedure of topological maintenance using the proposed probability density clustering algorithm, we use the thresholds T in the range [200, 300] and the clustering radius h in the range [0.8, 1.5], i.e., our algorithm is insensitive to changes in the parameter settings over a wide range of values. Based on the above-mentioned analysis, in our scenario, the optimal values Eps = 1 m, MinPts = 8, T = 300, and h = 1 m are used as input to the workflow.

Comparisons
To demonstrate the superiority of the proposed algorithm, we conduct a qualitative comparison of our method with several state-of-the-art methods, namely Method 1 in [45], Method 2 in [46], Method 3 in [7], Method 4 in [47] and Method 5 in [48]. The characteristics of these methods are detailed in Table 3. Table 3. Comparison with five methods in terms of "Basic Unit", "Point Set Construction", "Feature Expression", "Optimization", and "Classifier".

Methods
Basic The comparison results of Scene 1 and Scene 2 are shown in Figures 6 and 7. The results indicate that the other five methods result in numerous omission and commission error. More specifically, the building's step edges and/or ridge edges are mistakenly classified as trees. Some parts of dense vegetation areas are incorrectly classified as buildings. Many small-sized vehicles have not been correctly and completely recognized. In contrast, thanks to the higher-order CRF optimization based on clusters, the point labels of the proposed method are more homogeneous, which makes it suitable for applications that require semantic class labels of point clouds. For example, the building points can be used for 3D building contouring and the tree points can be used for tree modeling and tree species identification. In Figure 7, we observe that this scene is dominated by small-sized vehicles and it is a challenge to segment these small objects from the imbalanced classes. Fortunately, it is evident that our proposed method is superior to the other methods, especially for recognizing the small-sized vehicles. Most of the vehicles in Scene 2 have been correctly labeled and recognized, except for some vehicles consisting of few points or being covered by dense vegetation. Although we obtain a good classification performance in the urban and residential scenes, the buildings in some areas denoted by the red rectangles are mistakenly classified as trees and some red ovals are falsely labeled as buildings. These misclassification errors occur because the features presented in Table 2 are straightforward and relatively simple, although we consider the labels of neighborhood clusters to assist in classification. However, once many neighborhood clusters have been mislabeled, the cluster itself suffers from misclassification errors.  We further use five measures, namely the precision, recall, accuracy, F 1 -score, and mF 1 to quantitatively evaluate the proposed method. Precision is the ratio of correctly predicted positive observations to the total predicted positive observations, i.e., precision = TP/(TP+FP) [49]. Recall is the ratio of correctly predicted positive observations to all observations in the positive class i.e., recall = TP/(TP + FN) [49]. Accuracy is the most intuitive performance measure and is the ratio of correctly predicted observations to the total observations, i.e., accuracy = (TP + TN)/(TP + FP + TN + FN) [50]. F 1 -score is the weighted average of the precision and recall and is defined as: F 1 -score = 2 × (recall × precision)/(recall + precision) [13]. Therefore, this score takes both false positives and false negatives into account. It is more useful than the other three measures, especially if the scene has an uneven class distribution.
In Table 4, we provide the statistics of the four measures. Although our method does not have the highest precision or recall for all classes, we achieve a good tradeoff between precision and recall and the overall accuracies are 95.4% and 95.2% for Scene 1 and Scene 2, respectively. Since the percentage of the training vehicle points is relatively low (see Table 1), the accurate labeling of vehicles has become more challenging, i.e., we confront class imbalanced problem in Scene 1 and Scene 2. However, in this case, we still obtain the highest F 1 -score values of 73.7% and 66.9% for vehicle labeling in Scene 1 and Scene 2. This proves that our method has an advantage in terms of classification of small vehicles over other state-of-the-art methods. The F 1 -score values for trees and buildings in Scene 1 are 95.9% and 95.8% and are the highest values among Methods 1-5, thereby obtaining the highest mF 1 value. Please note that the mF 1 measure is calculated by averaging all classes of the F 1 -scores in the scene. It is a comprehensive indicator to represent the overall performance of scene labeling. Similarly, we obtain the highest mF 1 value of 86.1% for Scene 2 and the F 1 -scores for trees, buildings, and vehicles are 93.4%, 98.0%, and 66.9%. Although 93.4% for the F 1 -score for the tree class is not the highest value, we achieve a good balance among all classes and the highest mF 1 value of 86.1% for Scene 2. It should be noted that although numerous small-sized vehicles are scattered in Scene 2, we still obtain 95.2% accuracy, which is slightly smaller than 95.5% of Method 5 because some building point clouds are falsely classified as vegetation. Although our method for classifying buildings and vehicles outperforms Method 5, the overall number of point clouds with corrected semantic labels is lower than that of Method 5. In fact, the number of tree points in Scene 2 is ten times that of the number of vehicle points; therefore, many misclassified tree points greatly affect the accuracy of the proposed method. This also explains why the F 1 -score of 93.4% for the tree class is lower than that of Method 3 (95.6%) and Method 4 (94.6%). However, it should be noted that a high accuracy does not mean that Method 5 has a good classification performance, i.e., the classification algorithm should consider the performance of all classes. Therefore, some measures such as F 1 -score or mF 1 are recommended when evaluating the overall performance of classification algorithms. Table 4. Quantitative comparison of six classification methods using measures of "precision/recall", "accuracy", "F 1 -score", and "mF 1 ". The values from columns of tree, building, and vehicle correspond to "precision/recall". The measure of "accuracy" means the overall performance of three classes. The values of "F 1 -score" correspond to tree, building, and vehicle. The last column represents the value of "mF 1 " measure, which is calculated by averaging the three classes.

Effectiveness of CRF Model
To demonstrate the superiority of the higher-order CRF classification model, we use an "ablation study" to compare three different configurations including classification without CRF optimization, a low-level CRF model using only first-and second-order terms, and a high-level CRF model using first-and second-order terms, as well as higher-order terms. In these three cases, we use the F 1 -score and mF 1 measures to evaluate the classification accuracy of the two scenes. The results are shown in Table 5. It is evident that without CRF optimization, we obtain the smallest F 1 -score and mF 1 score for all classes in the two scenes, indicating the effectiveness of the CRF optimization. If we use the low-level CRF model in the workflow, it is expected that the results are superior to the methods without using CRF optimization, but the improvement is not large. For the high-level CRF model, we obtain mF 1 score of 88.5% and 86.1% for Scene 1 and Scene 2, indicating that this method outperforms the other two methods.

Discussions
From the above-mentioned qualitative and quantitative analyses, we can conclude that the cluster-based classification algorithms, i.e., Methods 4 and 5 and the proposed method are more accurate than the point-based methods because the former methods include more contextual information and the estimated features are more stable and robust. Our results in Figures 6 and 7 are more homogeneous than those of the other methods and there are few scattered points of other classes within the boundary of a specific class. The class boundaries are better defined. Although Method 3 resulted in relatively high precision and/or recall, the completeness of a specific object cannot be guaranteed. Although the cluster-based Method 5 provides good point labeling, it includes some obvious error and the results are not as homogeneous as the proposed method. In contrast, our method results in more accurate point labels and complete geometry, which is important for detailed applications of semantic point clouds, such as vectorization of buildings and reconstruction of buildings.
The good performance of the proposed method is attributed to the following three reasons: (a) Self-adaptive Segmentation: In outdoor scenes, the objects have various shapes and sizes. The simple segmentation according to rigid number and the size of the clusters cannot adapt to various objects. In this case, the self-adaptive segmentation algorithm is required. The proposed workflow is adaptation-aware and is shown in two aspects: 1 The coarse-grained clustering using the DBSCAN can identify any shapes of clusters according to the distribution of point clouds. The implementation cannot require the number of clusters and can easily find the noise and outliers. 2 We propose probability density clustering algorithm to ensure topological maintenance between adjacent fined-grained clusters. The adaptive characteristics is represented by aggregating different number of fine-grained clusters into a high-level cluster, from which one-to-many relationships are included. That is the number of fine-grained clusters in a generated cluster is not fixed and is determined by the shapes and local properties of the objects. Self-adaptive segmentation ensures reasonableness of generated clusters at different levels. (b) Multilevel Clustering: The feature of cluster is more robust and stable compared to individual point clouds. Therefore, we propose three-level cluster generation strategy: the first level creates coarse-level clusters using DBSCAN algorithm. The second level clustering uses conventional k-means algorithm to over-segment the coarse-level clusters into a set of fine-grained clusters. The proposed probability density clustering algorithm works at the third level; it aggregates the fine-grained cluster set into a high-level scale, i.e., the third-level cluster set includes one-to-many relationships with the second-level clusters. Multilevel clustering can better explore the contextual information between the objects in outdoor scenes. Although there exists a multilevel cluster construction algorithm in [13], the size of generated clusters is linearly increased, causing no apparent feature discrepancy of the objects. (c) Higher-Order CRF Optimization: To optimize the point labeling using the proposed higher-order CRF model, we not only consider the adjacent clusters but a wider local area based on the neighborhood relationship between the clusters. More precisely, we design the higher-order potential and embed it into the CRF optimization. The higher-order potential can fully perceive the prior knowledge and contextual clues, thereby improving the accuracy of classification and recognition.
Through the above analysis, we believe that the three strategies, namely self-adaptive segmentation, multilevel clustering, and higher-order CRF optimization guarantee that the proposed method can theoretically result in a better outcome compared to other point-based and/or cluster-based methods listed in Table 3.

Conclusions
In this paper, we have presented a novel CRF-based 3D semantic labeling algorithm for assigning semantic information in ALS point clouds. Instead of using point-based semantic labeling, this algorithm is based on higher-level clusters that are created by integrating the DBSCAN and the classic K-means algorithms, followed by the proposed probability density clustering algorithm. Using the over-segmented clusters as the basic processing units has two advantages: 1 The computational efficiency is significantly improved during the CRF optimization. 2 All individual points and clusters have explicit contextual information, which facilitates the detection of the object by the classifiers. To determine the neighborhood topology between the generated clusters, we transform the problem of topological maintenance into a clustering problem using the proposed probability density clustering algorithm. This algorithm maintains a flexible neighborhood system in a natural intuitive manner and does not require the rigid and fixed numbers of a neighborhood system used in conventional methods. Finally, these cluster entities and their contextual information are used for labeling the points by incorporating unary, pairwise, and higher-order potentials established on different levels of cliques, thereby achieving better point cloud labeling. Compared with five other state-of-the-art methods, the classification of urban and residential scenes with the proposed method has better classification accuracy and there is a good tradeoff between precision and recall; the mF 1 measures are 88.5% and 86.1% for Scene 1 and Scene 2, respectively. In addition, it is found that the CRF optimization with higher-level potentials defined over cliques consisting of more than two clusters is effective and the highest labeling performance is obtained in terms of the F 1 -score and mF 1 .
Although our algorithm achieves promising classification accuracy for the Tianjin data set, there are some drawbacks and interesting ideas which can be further explored to extend the research reported in this paper. We use the initial labels to design the CRF' unary term. These initial labels are obtained by using the cluster features defined in Table 2. In some cases, the low-level features of the clusters are not stable. In future works, we plan to use a deep learning method to extract the high-level cluster features for better discrimination of the cluster's properties. In addition, if we use higher-order potential terms in the CRF optimization, a large number of iterations is required, and the computational cost will increase along with an increase in the cluster number; this makes the processing of large-scale ALS point clouds impossible. For this problem, we plan to explore the in-depth properties and topologies of different levels of clusters to control the number of participating cluster entities.
Author Contributions: Y.L. and X.D. analyzed the data and wrote the C++ source code. D.C. and S.X. (Shaobo Xia) helped with the project and study design, paper writing, and analysis of the results. Y.W., S.X. (Sheng Xu), and Q.Y. helped with the data analysis, experimental analysis, and comparisons.