UAV Autonomous Localization Using Macro-Features Matching with a CAD Model

Research in the field of autonomous Unmanned Aerial Vehicles (UAVs) has significantly advanced in recent years, mainly due to their relevance in a large variety of commercial, industrial, and military applications. However, UAV navigation in GPS-denied environments continues to be a challenging problem that has been tackled in recent research through sensor-based approaches. This paper presents a novel offline, portable, real-time in-door UAV localization technique that relies on macro-feature detection and matching. The proposed system leverages the support of machine learning, traditional computer vision techniques, and pre-existing knowledge of the environment. The main contribution of this work is the real-time creation of a macro-feature description vector from the UAV captured images which are simultaneously matched with an offline pre-existing vector from a Computer-Aided Design (CAD) model. This results in a quick UAV localization within the CAD model. The effectiveness and accuracy of the proposed system were evaluated through simulations and experimental prototype implementation. Final results reveal the algorithm’s low computational burden as well as its ease of deployment in GPS-denied environments.


Introduction
Due to the proliferation of Unmanned Aerial Vehicles (UAV) applications in the past decade, vision-based localization of UAVs has been an important and active field of research for several years. This is primarily due to the increase in applications where Global Positioning Systems (GPS) and Global Navigation Satellite System (GNSS) are infeasible [1,2]. Furthermore, auxiliary functions such as object recognition can be integrated into the vision system reducing overall complexity. In indoor applications, GPS signal can be weak, inaccurate or unavailable, thus preventing autonomous UAVs from localizing within buildings which gives rise to multiple complications that hamper navigation efforts [2,3].
Non-vision-based localization in GPS-denied environments either rely on wireless sensor network [4,5], Time of Flight, and Received Signal Strength Indicator (RSS) [6]. Such solutions require special building-wide setup consisting of sensors and/or transmitters to be constructed adequate functionality [5]. Furthermore, the increased cost as the setup footprint increases as well as synchronization between different nodes yields a major complexity in this type of setups [6].
As a resolution to such complications, vision-based localization approaches in GPS-denied environments have been widely discussed in literature. Detection and mapping to the corresponding section 2.1) in a 3D CAD model. Once multiple matches are found, the UAV can localize itself within the CAD model and therefore would be able to follow a calculated trajectory to the goal position.

Pre-Processing
Prior to the UAV flight, a Building Information Model (BIM) [14] is loaded using Open Asset Import Library (Assimp) [15] which provides a unified method of accessing the data within. A BIM is a 3D CAD model typically produced during the planning phase of a construction project and includes information about macro-features as well as their locations as shown in Figure 1.
Macro-features are fixed features in the indoor environment such as doors, windows, and vent panels in which the proximity of each feature to the other can be used to identify a particular location on a 3D map. The macro-features from the CAD model are extracted and their centroids' locations are stored in a K-Dimensional Tree (k-d Tree) [16,17] of dimension 3. This facilitates retrieval of information about the nearest neighboring macro-feature of any given reference macro-feature which is required while generating the description vector within the CAD model. Using the k-d Tree, vectors are computed for each macro-feature within the CAD model using an algorithm explained in Section 2.3. These are later used to detect correspondences between the detected macro-features during the UAV flight and those present within the CAD model.

Macro-Feature Detection and Extraction Method
The objective here is to device a method to detect doors and windows (macro-features) reliably and efficiently from observed data while simultaneously registering them to the generated SLAM map. A CNN YOLOv2 [13] that has been previously trained to detect macro-features would ensure a high level of accuracy while maintaining a quick computational speed. Since windows are harder to detect using traditional image processing techniques due to reflections and the presence of objects behind them. YOLOv2 performs real-time detection by applying a single neural network to the entire image. This network divides the image into regions and predicts bounding boxes and probabilities of predictions for each region.

Data Preparation and Training
The training data is collected using short video sequences containing doors and windows from the real datasets. The data is collected from a variety of different sources including the actual indoor flight environment. Care is taken to collect images in different lighting conditions as well as from different distances and angles to ensure training effectiveness. The images are annotated using LabelImg [18] tool. Doors and windows in each image were marked using rectangular boxes and the coordinates were stored in normalized coordinates by expressing them as a fraction of the length and Macro-features are fixed features in the indoor environment such as doors, windows, and vent panels in which the proximity of each feature to the other can be used to identify a particular location on a 3D map. The macro-features from the CAD model are extracted and their centroids' locations are stored in a K-Dimensional Tree (k-d Tree) [16,17] of dimension 3. This facilitates retrieval of information about the nearest neighboring macro-feature of any given reference macro-feature which is required while generating the description vector within the CAD model.
Using the k-d Tree, vectors are computed for each macro-feature within the CAD model using an algorithm explained in Section 2.3. These are later used to detect correspondences between the detected macro-features during the UAV flight and those present within the CAD model.

Macro-Feature Detection and Extraction Method
The objective here is to device a method to detect doors and windows (macro-features) reliably and efficiently from observed data while simultaneously registering them to the generated SLAM map. A CNN YOLOv2 [13] that has been previously trained to detect macro-features would ensure a high level of accuracy while maintaining a quick computational speed. Since windows are harder to detect using traditional image processing techniques due to reflections and the presence of objects behind them. YOLOv2 performs real-time detection by applying a single neural network to the entire image. This network divides the image into regions and predicts bounding boxes and probabilities of predictions for each region.

Data Preparation and Training
The training data is collected using short video sequences containing doors and windows from the real datasets. The data is collected from a variety of different sources including the actual indoor flight environment. Care is taken to collect images in different lighting conditions as well as from different distances and angles to ensure training effectiveness. The images are annotated using LabelImg [18] tool. Doors and windows in each image were marked using rectangular boxes and the coordinates were stored in normalized coordinates by expressing them as a fraction of the length and width of the image. Slight perturbation in angle were introduced to the images during augmentation. This was done by applying a rotation to the input images given by, where R z is a rotation about an axis perpendicular to xy plane and θ is the angle of rotation. This angle is kept to a maximum of ±20 degrees in 5-degree increments to simulate the maximum possible roll. Furthermore, inverse warping is applied to get rid of holes in the warped images. These holes were artifacts of forward warping where pixels in the warped images are not painted in. This is due to the unavailability of one-to-one mapping from the source images to the destination images. A total of 1540 images are collected and a total of 12,320 images are generated through augmentation. A total of 80% of the dataset is randomly selected as the training set and the remaining 20% are to validate the detection accuracy. This is done to compute the prevision-recall curve of the detection which provides information to set the cut-off threshold of the detection confidence. Detections are classified into two classes, one for the doors and the other for windows. In addition, to remove detections that looked like neither, a third class is added. This class is not used in the evaluation but leads to an increase in the neural network accuracy.

Detection Refinement
The YOLOv2 provides rectangular boxes bounding the proposed detection with the image space as shown in Figure 2a. Each of these bounding boxes corresponds to a single detection of either a door or a window. These detections are then translated into the actual macro-features in 3D space. To do this, lines are first detected in the area within each of the bounding boxes using Line Segment Detector (LSD) [19,20]. In practice, even though the algorithm does perform quite well line detection, the output is prone to generate broken lines even after rigorous parameter tuning as in Figure 2b. Further refinement is applied to such an issue by joining vertical side lines within a threshold distance from bounded box provided they meet certain criteria. The LSD outputs lines as point pairs denoting the endpoints of the line segments. These endpoints are used to compute the slopes of line segments. Line segments are joined when such line segments have a slope within a threshold angle as well as close endpoints are within a threshold distance. The threshold for the angle is set to 2 degrees and the threshold distance between close endpoints is set to 10 pixels. The longest lines closest to the two vertical sides of the bounds are then taken to denote the vertical sides of the macro-feature as illustrated in Figure 2c. This is similarly illustrated in Figure 2d,e on one and multiple doors, respectively.

3D Reconstruction
Using stereo image pairs captured during flight, the 2D lines detected and denoted as the vertical edges of each macro-feature are projected into 3D space. The image pairs are used to compute the depth from disparity at each point in the image. This computation is highly parallelizable and is done on board the UAV. The sampled points are shown in Figure 3a along the 2D vertical side lines detected in Figure 2c and the depth of each point are projected into the SLAM map in Figure 3b. This is computed by: where P 3x , P 3y , and P 3z are the coordinates of the point in 3D with respect to the current pose of the camera in the world coordinate system. While P 2x and P 2y are the coordinates of the point in the

3D Reconstruction
Using stereo image pairs captured during flight, the 2D lines detected and denoted as the vertical edges of each macro-feature are projected into 3D space. The image pairs are used to compute the depth from disparity at each point in the image. This computation is highly parallelizable and is done on board the UAV. The sampled points are shown in Figure 3a along the 2D vertical side lines detected in Figure 2c and the depth of each point are projected into the SLAM map in Figure 3b. This is computed by: where P3x, P3y, and P3z are the coordinates of the point in 3D with respect to the current pose of the camera in the world coordinate system. While P2x and P2y are the coordinates of the point in the 2D image coordinate system with a computed depth of d. Camera coordinates and focal lengths are represented by cx, cy and fx, fy, respectively.  The sampled points are used to compute the line in 3D using a least-squared line fitting algorithm relying on orthogonal distance. Outliers are then removed by removing points that are at a threshold distance from the computed line. The line is once again recomputed from the inliers using the orthogonal regression used previously. This two-step computation of the line provides a better estimate and is resistant to outliers in practice. The endpoints of the line are computed by finding the projection of the two endpoints of the line segment in 3D onto the fitted line. The four points that make up these line segments now represent the macro-feature in 3D in the view coordinates. Each of these macrofeatures detected are assigned to the keyframe in which they are first detected during flight.
Keyframes are frames in a graph-based SLAM that are tracked across multiple frames. They The sampled points are used to compute the line in 3D using a least-squared line fitting algorithm relying on orthogonal distance. Outliers are then removed by removing points that are at a threshold distance from the computed line. The line is once again recomputed from the inliers using the orthogonal regression used previously. This two-step computation of the line provides a better estimate and is resistant to outliers in practice. The endpoints of the line are computed by finding the projection of the two endpoints of the line segment in 3D onto the fitted line. The four points that make up these line segments now represent the macro-feature in 3D in the view coordinates. Each of these macro-features detected are assigned to the keyframe in which they are first detected during flight.
Keyframes are frames in a graph-based SLAM that are tracked across multiple frames. They consist of the camera pose at an instant along with map points in view of the keyframe and contain the connection to other keyframes in the SLAM map that share closely correlated map points. Note that assigning the detected macro-features to the keyframes ensures spatial consistency in the event of a bundle adjustment step occurring before the localization step is complete.
A homogenous point in 3D is defined by: where P x , P y , and P z are the three scalar coordinates. A transformation, T ∈ SE(3) is represented by: where R ∈ SO(3) is a rotation matrix and t is a translation. At any point in time, the locations of the macro-feature points in the world coordinates can be computed by, where P w are the homogeneous coordinates of the point in the world coordinate system, P c are the homogeneous coordinates of the point in the view coordinate system, and T c w is the transformation between the camera and the world coordinate systems.
For the course registration phase that takes place after this step, each of the macro-features computed above are further reduced to their respective centroids in the world coordinate system. The description vector for these centroids are computed according to the following subsection.

Macro-Feature Registeration and Matching
In the proposed system, the UAV is already localized within the SLAM coordinate system. This means that the successful alignment of the CAD model with the SLAM map would result in localization of the UAV within the CAD model which is the main goal. This can be achieved by successfully matching the macro-features between both the CAD model and the SLAM map.
A SLAM algorithm that utilizes stereo images ensures that the SLAM map generated has an accurate estimation of scale. Since the two sets of data have the same scale, a rigid transformation is sufficient to align both sets of data. To compute this transformation, point-to-point correspondences is required between both datasets. Once the correspondences are found, a Random Sample Consensus (RANSAC) [21] based approach uses subsets of these correspondences to determine the best set of corresponding macro-features to utilize and the required transformation is computed.
To speed up the process of finding corresponding macro-features, a novel method is being devised that encodes each macro-feature into a binary macro-feature description vector that is orientation and position invariant. This is the main contribution of the paper and is explained in detail in the following subsections. It is to be noted that the use of a binary description vector ensures fast macro-feature matching due to the utilization of Hamming distance as the distance measurement unit. This is because the computation of a Hamming distances can be done in a single XOR operation between the two binary strings/vectors. The orientation and position invariance help in removing the dependence of a vector on the coordinate system.
To obtain matches between the macro-features in both datasets, the information within the vector needs to be decoupled from both coordinate systems and should mainly be dependent on the macro-features themselves. This is done by encoding the distances and angles between the closest 5 macro-features into the description vector making the data within the vector dependent on only the macro-features and the co-relations between them. Since the SLAM coordinate system has close to accurate scale, the relative distances and orientations between the closest macro-features stay the same in both coordinate systems. Figure 4 illustrates when the macro-features are reduced to their respective centroids, the relative distances and orientations between the macro-features stay the same.    Figure 4a shows portion of the CAD model with a macro-feature whose vector is being calculated while Figure 4b shows the SLAM map generated with the detected macro-features and corresponding labels in both images. The description vector is computed by first retrieving the 5 macro-features closest to a macro-feature for which the vector is being computed. This is done efficiently by utilizing the k-d Tree to store the centroids of the detected macro-features. Next, using the macro-feature closest to the macro-feature for which the computation is occurring as a base, the angles to each of the 4 other macro-features are computed using the formula, The vector is clarified in Figure 5, where a is the vector between the current macro-feature and the closest (base) macro-feature, b is the vector between the current macro-feature and the macro-feature whose angle is to be calculated, and α is the angle between the two vectors.
(b) Figure 4. Orientation invariance of the macro-feature description vector. The images show the relative distances and orientation of a few selected macro-features, with 'e' being the microfeature of which the description vector is being calculated. Image (a) shows the macro-features in the 3D CAD model while image (b) shows the same macro-features in the SLAM map. Figure 5. Angle calculation-In this figure, r is the macro-feature for which the description vector is being calculated, p is the closest macro-feature to r (taken as a base for the calculation), m is the macrofeature for which the angle relative to r is being calculated, and α is the angle between vectors a and b that connects r, p, and m.
The encoding of the relative orientations and distances to each of the macro-features is achieved through a lookup table. The lookup table ensures that the values of distances and angles that are close to local means are assigned the same value. This binning process enables the encoding of the distances and the angles into a compact form that can then be used to form the binary descriptive vector. This Figure 5. Angle calculation-In this figure, r is the macro-feature for which the description vector is being calculated, p is the closest macro-feature to r (taken as a base for the calculation), m is the macro-feature for which the angle relative to r is being calculated, and α is the angle between vectors a and b that connects r, p, and m.
The encoding of the relative orientations and distances to each of the macro-features is achieved through a lookup table. The lookup table ensures that the values of distances and angles that are close to local means are assigned the same value. This binning process enables the encoding of the distances and the angles into a compact form that can then be used to form the binary descriptive vector. This process is outlined in Algorithm 1. In addition to the distance and orientation information, the first bit of vector (feature_type) is used to denote the type of the macro-feature i.e., door (0) or window (1). This is further explained in the next subsection.
Sensors 2020, 20, 743 9 of 21 process is outlined in Algorithm 1. In addition to the distance and orientation information, the first bit of vector (feature_type) is used to denote the type of the macro-feature i.e., door (0) or window (1). This is further explained in the next subsection.  Table and Vector Structure The lookup table formed in Algorithm 1 is computed through a binning process. Two lookup tables are created, one for angles and the other for distances. These tables are used to group together values that are close to local means. The boundaries of these bins are computed using a Kernel Density Estimation (KDE) process [22].
First, all the possible values are computed during the pre-processing stage using the CAD model. These values are then sorted and grouped into clusters which is achieved by sorting all the values and then estimating the shape of a probability density function f (using a KDE) that provides values that are close to local means. The boundaries of these bins are computed using a Kernel Density Estimation (KDE) process [22].
First, all the possible values are computed during the pre-processing stage using the CAD model. These values are then sorted and grouped into clusters which is achieved by sorting all the values and then estimating the shape of a probability density function f (using a KDE) that provides a representation of the data. The Kernel Density Estimator,f h is given by: where K is the kernel (in this case a Gaussian kernel), K h is the scaled kernel, h is a smoothing parameter and x 1 , x 2 , . . . , x n are univariate, independent, and identically distributed samples drawn from the distribution function. The bins required for the lookup table are created by estimating and analyzing the shape of the function then dividing the distribution along its local minima as illustrated in Figure 6. The lookup table formed in Algorithm 1 is computed through a binning process. Two lookup tables are created, one for angles and the other for distances. These tables are used to group together values that are close to local means. The boundaries of these bins are computed using a Kernel Density Estimation (KDE) process [22].
First, all the possible values are computed during the pre-processing stage using the CAD model. These values are then sorted and grouped into clusters which is achieved by sorting all the values and then estimating the shape of a probability density function f (using a KDE) that provides a representation of the data. The Kernel Density Estimator, is given by: where K is the kernel (in this case a Gaussian kernel), Kh is the scaled kernel, h is a smoothing parameter and x1, x2, …. xn are univariate, independent, and identically distributed samples drawn from the distribution function. The bins required for the lookup table are created by estimating and analyzing the shape of the function then dividing the distribution along its local minima as illustrated in Figure 6.  By considering the underlying density of this function to be Gaussian andσ as the standard deviation of values, the value of h can be estimated using Silverman's rule of thumb as follows: Each of the bins are then assigned a number between 0 to the total number of bins, the maximum being 255 to fit an 8-bit width in the binary representation of the number. This way, each neighboring macro-feature gives rise to two 8-bit values, one for the distance and one for the angle. As mentioned earlier that 4 neighboring macro-features are considered to define a specific macro-feature, the total size of the descriptive vector is thus 64 bits. The first bit in the 64-bit vector is usually zero as the number of bits used to represent the number of bins in the angles is less than 8. This enables the use of the first bit to store the type of the macro-feature while using the remaining 7 bits to store the first angle. The final structure of the macro-feature description vector is shown in Figure 7. In this study, the macro-feature type bit takes a 0 or 1 value depending on the feature being a door or window. If the macro-feature types match, the vector score is evaluated using the rest of the bits, if not, the maximum value for the distance is returned i.e., 64.
corresponding values are different. This is computed very easily and efficiently using the XOR operator and then summing the total number of set bits in the result. Computers with newer hardware usually have support for counting the total number of set bits in CPU instructions. This results in a computational efficient matching mechanism. The complexity for finding matches between the observed macro-features and the CAD macro-features is thus, O(mn), with m being the number of observed macro-features and n being the number of CAD macro-features.

Initial Registration
During the initial registration phase, macro-features are extracted using the methods outlined in Section 2.2. For each extracted macro-feature, the centroid of the macro-feature is stored in the k-d Tree and the vector is calculated when the macro-feature has more than a threshold number of neighbors. These vectors and the vectors computed from the CAD model during the pre-processing stage are then matches using the methods outlined in Section 2.3.1.
The matching macro-features are then used in a RANSAC which works by selecting the minimum number of parameters needed to estimate a model hypothesis. RANSAC is a robust iterative method that is used to estimate the parameters of a mathematical model from noisy data that has outliers. The generated model is then used to compute the total number of inliers for the model hypothesis. This process is repeated for a total of N times to generate N hypotheses and the hypothesis with the greatest number of inliers is selected as the transformation. The number N is a function of the desired probability of success (p). If is the probability of selecting an inlier for each selected point, and m is the number of points required to estimate the model, then 1 − will be the probability of finding at least one outlier among the m points. Meanwhile, (1 − ) is thus the probability of never selecting m points that are all inliers. Thus, the total number of hypothesis that need to be generated for a desired probability of success is: Since this is a binary vector, matching of the vector is done by finding the Hamming distance between two vectors. The vectors are matched on the basis of the lowest Hamming distance. Hamming distance will compute the number of positions of two equal strings where the corresponding values are different. This is computed very easily and efficiently using the XOR operator and then summing the total number of set bits in the result. Computers with newer hardware usually have support for counting the total number of set bits in CPU instructions. This results in a computational efficient matching mechanism. The complexity for finding matches between the observed macro-features and the CAD macro-features is thus, O(mn), with m being the number of observed macro-features and n being the number of CAD macro-features.

Initial Registration
During the initial registration phase, macro-features are extracted using the methods outlined in Section 2.2. For each extracted macro-feature, the centroid of the macro-feature is stored in the k-d Tree and the vector is calculated when the macro-feature has more than a threshold number of neighbors. These vectors and the vectors computed from the CAD model during the pre-processing stage are then matches using the methods outlined in Section 2.3.1.
The matching macro-features are then used in a RANSAC which works by selecting the minimum number of parameters needed to estimate a model hypothesis. RANSAC is a robust iterative method that is used to estimate the parameters of a mathematical model from noisy data that has outliers. The generated model is then used to compute the total number of inliers for the model hypothesis. This process is repeated for a total of N times to generate N hypotheses and the hypothesis with the greatest number of inliers is selected as the transformation. The number N is a function of the desired probability of success (p). If ω is the probability of selecting an inlier for each selected point, and m is the number of points required to estimate the model, then 1 − ω m will be the probability of finding at least one outlier among the m points. Meanwhile, (1 − ω m ) N is thus the probability of never selecting m points that are all inliers. Thus, the total number of hypothesis that need to be generated for a desired probability of success is: Therefore, N different hypotheses are generated to estimate the model. In this study, the model is a rigid transformation between two 3D point sets. Since this requires at least four point-to-point correspondences, the value of m in this case is four, which are randomly selected from the matched macro-features. The rigid transformation is then computed using a linear least squares method based on Singular Value Decomposition (SVD).
Coplanarity or collinearity of the points give rise to degenerate conditions and must be avoided. Since collinear points are also coplanar it is sufficient to perform the check for coplanarity. The test for coplanarity can be done by computing the scalar triple product, which is given by: where x 1 , x 2 , x 3 , x 4 ∈ R 3 are the four distinct points and ε = 0.01 is the margin of error. Any set of points giving rise to a value less than ε are considered coplanar. If the selected points pass the coplanarity test, they are rejected, and a new set of points are selected for the hypothesis. The hypothesis that generates the greatest number of inliers is then selected as the transformation. The rigid transformation in SE(3) between the two point sets is computed using the following methodology. If A:= {a i |i = 1, 2, . . . , n, a i ∈ R 3 } and B:= {b i |i = 1, 2, . . . , n, b i ∈ R 3 } are two corresponding sets of points, A being the source and B being the destination set, the rigid transformation to be computed is found by minimizing the squared error of the transformed coordinates, which is given by: where R is the rotation in SO(3) and t in the translation while ω i > 0 are the weights assigned to the squared differences for each point pair. The weighted centroids on both sets are computed by: where a and b are the centroids of the two sets. Vectors are then computed using the point sets and the respective centroids by, where x i and y i are the corresponding vectors originating at the respective centroids. The 3 × 3 covariance matrix is then computed using: where X and Y are the vectors of dimension 3 and W = diag (ω 1 , ω 2 , . . . , ω n ), the weight matrix. The singular value decomposition of is computed where, U and V are orthogonal unitary matrices, and σ is a diagonal matrix with non-negative real numbers along the diagonal which are the singular values of S. The rotation matrix R is then given by The translation can then be calculated as After computing the transformation matrix using the least-squares based RANSAC method outlined above, the transformation is refined by using all the points from all the features. This means that all the four points that make up the corners of the macro-features (doors/windows) in the 3D space are used to compute the least-squares rigid transformation. This gives rise to a better estimation of the transformation.

Hardware Implementation
The hardware used to implement and test the proposed system comprised of Parrot's SLAM dunk module mounted on top of a Parrot Bebop 2 drone and a ground control station. The SLAM dunk utilizes NVIDIA's Jetson TK1 board along with integrated stereo camera, an Inertial Measurement Unit (IMU), a front facing ultrasonic sensor, a barometer, and a magnetometer as shown in Figure 8.

Hardware Implementation
The hardware used to implement and test the proposed system comprised of Parrot's SLAM dunk module mounted on top of a Parrot Bebop 2 drone and a ground control station. The SLAM dunk utilizes NVIDIA's Jetson TK1 board along with integrated stereo camera, an Inertial Measurement Unit (IMU), a front facing ultrasonic sensor, a barometer, and a magnetometer as shown in Figure 8. The SLAM dunk module controls the drone by connecting to an access point hosted by the Bebop 2 via USB. On the other hand, the connection between the SLAM dunk and the ground control station is established via Wi-Fi using the Bebop's access point as an interface. This means that all the communication between the SLAM dunk and the ground control station is to be routed via the Parrot's server. This is done to make use of the powerful Wi-Fi hardware present on-board the Bebop 2. The complete network configuration of the setup is illustrated in Figure 9. The UAV is set to have two main modes of operation: exploration mode and localized mode. In the exploration mode, the UAV's operation is based completely on the SLAM map that is being generated during flight while the localizing sequence continuously attempts to localize the UAV with the CAD model. A flow chart of this mode of operation including the pre-processing steps discussed The SLAM dunk module controls the drone by connecting to an access point hosted by the Bebop 2 via USB. On the other hand, the connection between the SLAM dunk and the ground control station is established via Wi-Fi using the Bebop's access point as an interface. This means that all the communication between the SLAM dunk and the ground control station is to be routed via the Parrot's server. This is done to make use of the powerful Wi-Fi hardware present on-board the Bebop 2. The complete network configuration of the setup is illustrated in Figure 9.

Hardware Implementation
The hardware used to implement and test the proposed system comprised of Parrot's SLAM dunk module mounted on top of a Parrot Bebop 2 drone and a ground control station. The SLAM dunk utilizes NVIDIA's Jetson TK1 board along with integrated stereo camera, an Inertial Measurement Unit (IMU), a front facing ultrasonic sensor, a barometer, and a magnetometer as shown in Figure 8. The SLAM dunk module controls the drone by connecting to an access point hosted by the Bebop 2 via USB. On the other hand, the connection between the SLAM dunk and the ground control station is established via Wi-Fi using the Bebop's access point as an interface. This means that all the communication between the SLAM dunk and the ground control station is to be routed via the Parrot's server. This is done to make use of the powerful Wi-Fi hardware present on-board the Bebop 2. The complete network configuration of the setup is illustrated in Figure 9. The UAV is set to have two main modes of operation: exploration mode and localized mode. In the exploration mode, the UAV's operation is based completely on the SLAM map that is being generated during flight while the localizing sequence continuously attempts to localize the UAV with the CAD model. A flow chart of this mode of operation including the pre-processing steps discussed The UAV is set to have two main modes of operation: exploration mode and localized mode. In the exploration mode, the UAV's operation is based completely on the SLAM map that is being generated during flight while the localizing sequence continuously attempts to localize the UAV with the CAD model. A flow chart of this mode of operation including the pre-processing steps discussed in Section 2.1 can be seen in Figure 10. Once localized, the UAV switches to localized mode where it is made to follow a set of calculated waypoints (based on its localized position) to a goal location.
integrate into the process pipeline.
Since the BIM format already contains all the macro-features of the environment Figure 1, the next step was to extract the position of the centroids of the macro-features and store them in a k-d Tree which facilitated the creation of the macro-feature description vector for each of the macrofeatures as discussed explained previously using Algorithm 1. These vectors are later used to find the correspondences between the SLAM and the CAD maps.

Preprocessing
As explained in Section 2.1 for the macro-feature vector matching to occur the BIM of the building needs to be pre-processed. This is done experimentally by loading the BIM into Assimp to be able to access the information within the model. Since Assimp is written in C++ it is easy to integrate into the process pipeline.
Since the BIM format already contains all the macro-features of the environment Figure 1, the next step was to extract the position of the centroids of the macro-features and store them in a k-d Tree which facilitated the creation of the macro-feature description vector for each of the macro-features as discussed explained previously using Algorithm 1. These vectors are later used to find the correspondences between the SLAM and the CAD maps.

Exploration Mode
In exploration mode, the left and right images from the stereo cameras were used to compute depth from disparity using GPU optimized Semi Global Block Matching (SGBM) [23] on-board the SLAM dunk. The left image and the computed depth images were then sent via Wi-Fi to the ground control station at a rate of 30 fps in addition to IMU updates that were sent over at ∼150 Hz. It is to be noted that all communication between the UAV and the ground control including image transfer are handled by Robot Operating System (ROS) [24] communication packets.
On the ground control unit, the system was divided into two main components: the SLAM module and the localization module. The SLAM module estimates the current UAV pose in a map generated using the images and IMU data as illustrated in Figure 11a. ORB-SLAM2 [25,26] was used as the SLAM algorithm and was used in the RGBD mode, which utilized the left and computed depth images from the UAV to construct a map consisting of a sparse set of 3D map points while also estimating the six-degree-of-freedom (6DOF) current UAV pose in it.
depth from disparity using GPU optimized Semi Global Block Matching (SGBM) [23] on-board the SLAM dunk. The left image and the computed depth images were then sent via Wi-Fi to the ground control station at a rate of 30 fps in addition to IMU updates that were sent over at ∼150 Hz. It is to be noted that all communication between the UAV and the ground control including image transfer are handled by Robot Operating System (ROS) [24] communication packets.
On the ground control unit, the system was divided into two main components: the SLAM module and the localization module. The SLAM module estimates the current UAV pose in a map generated using the images and IMU data as illustrated in Figure 11a. ORB-SLAM2 [25,26] was used as the SLAM algorithm and was used in the RGBD mode, which utilized the left and computed depth images from the UAV to construct a map consisting of a sparse set of 3D map points while also estimating the six-degree-of-freedom (6DOF) current UAV pose in it.
The localizer extracted the macro-features discussed earlier (Figure 11b) from the images and tried to localize the UAV in the CAD model by first forming correspondences between the environment's detected macro-features and those extracted from the CAD model. These are then used to estimate a rigid transformation in SE(3) between the two coordinate frames (SLAM and CAD model) using a robust transformation estimation algorithm (Section 2.3.2). The resulting localized UAV is illustrated in Figure 11c.  The localizer extracted the macro-features discussed earlier (Figure 11b) from the images and tried to localize the UAV in the CAD model by first forming correspondences between the environment's detected macro-features and those extracted from the CAD model. These are then used to estimate a rigid transformation in SE(3) between the two coordinate frames (SLAM and CAD model) using a robust transformation estimation algorithm (Section 2.3.2). The resulting localized UAV is illustrated in Figure 11c.

Localized Mode
Once the UAV was localized using the localizer module, the UAV began the process of traveling to the destination. The height of the UAV from the ground level was retrieved using an ultrasound sensor located beneath the Bebop 2 drone. A cross-section of the current floor, roughly at the UAV's flight height was generated from an octomap [27,28] representation of the building. The octomap provided a probabilistic 3D voxel representation of the environment/building that was updated in real-time with objects that did not appear in the original CAD model. The goal position and the current location of the UAV were then projected onto the 2D occupancy grid that is generated from the cross section generated earlier. An A* search algorithm [29] shown in Figure 12, to plan a path from the current location to the goal position was then run on this grid map. This generated a path of shortest route between the current location and the goal location. Dividing the floor into a 2D grid reduces the search space and consequently provided an efficient method to compute the shortest distance. Using the path generated, waypoints were sent to the UAV representing the centroids of the cells in the 2D grid which were connected by the computed path. Since the UAV did not undergo any fast motions, it was sufficient to supply the waypoints as a set of location coordinates with respect to the CAD coordinate system.

Localized Mode
Once the UAV was localized using the localizer module, the UAV began the process of traveling to the destination. The height of the UAV from the ground level was retrieved using an ultrasound sensor located beneath the Bebop 2 drone. A cross-section of the current floor, roughly at the UAV's flight height was generated from an octomap [27,28] representation of the building. The octomap provided a probabilistic 3D voxel representation of the environment/building that was updated in real-time with objects that did not appear in the original CAD model. The goal position and the current location of the UAV were then projected onto the 2D occupancy grid that is generated from the cross section generated earlier. An A* search algorithm [29] shown in Figure 12, to plan a path from the current location to the goal position was then run on this grid map. This generated a path of shortest route between the current location and the goal location. Dividing the floor into a 2D grid reduces the search space and consequently provided an efficient method to compute the shortest distance. Using the path generated, waypoints were sent to the UAV representing the centroids of the cells in the 2D grid which were connected by the computed path. Since the UAV did not undergo any fast motions, it was sufficient to supply the waypoints as a set of location coordinates with respect to the CAD coordinate system. Figure 12. A* algorithm for path finding. A cross section of the floor is used to generate the 2D occupancy grid. The dark green cell is the start location and the red cell is the goal location. The gray cells are occupied, and the white cells are free. A path is computed by applying the A* algorithm and waypoints are generated using the centroids of the grid cells.

Experimental Results and Analysis
The following sections evaluates the performance and accuracy of the different algorithms used. Section 4.1 evaluates the performance of the feature descriptor extraction and matching. Sections 4.2 and 4.3 evaluate the performance of the entire system operation through localization as well as relocalization within the CAD model.

Object Vector Results and Accuracy
A basic method of selecting features based on distances to neighboring features was initially explored. In this method, lookup table of distances from each feature to every other feature was first computed from the CAD model in the preprocessing phase. Two features were then selected at random from the observed dataset and matched with distances computed from the CAD model. On finding a successful match, a new feature was then selected such that the distances to the two selected features was consistent with the distances in the CAD model. This process was repeated until a total of four correspondences were found. This method was found to be resource intensive and slow. Figure 12. A* algorithm for path finding. A cross section of the floor is used to generate the 2D occupancy grid. The dark green cell is the start location and the red cell is the goal location. The gray cells are occupied, and the white cells are free. A path is computed by applying the A* algorithm and waypoints are generated using the centroids of the grid cells.

Experimental Results and Analysis
The following sections evaluates the performance and accuracy of the different algorithms used. Section 4.1 evaluates the performance of the feature descriptor extraction and matching. Sections 4.2 and 4.3 evaluate the performance of the entire system operation through localization as well as re-localization within the CAD model.

Object Vector Results and Accuracy
A basic method of selecting features based on distances to neighboring features was initially explored. In this method, lookup table of distances from each feature to every other feature was first computed from the CAD model in the preprocessing phase. Two features were then selected at random from the observed dataset and matched with distances computed from the CAD model. On finding a successful match, a new feature was then selected such that the distances to the two selected features was consistent with the distances in the CAD model. This process was repeated until a total of four correspondences were found. This method was found to be resource intensive and slow.
The histogram of the distances to the closest five features from each feature whose descriptor is to be calculated is shown in Figure 13a. The probability density function of the Kernel Density Estimator is shown in Figure 13b. The shape of the KDE closely resembles the histogram and thus represents the underlying data. Dividing the probability density function along the local minima yields 24 different bins for the lookup table.
Sensors 2020, 20, 743 16 of 21 The histogram of the distances to the closest five features from each feature whose descriptor is to be calculated is shown in Figure 13a. The probability density function of the Kernel Density Estimator is shown in Figure 13b. The shape of the KDE closely resembles the histogram and thus represents the underlying data. Dividing the probability density function along the local minima yields 24 different bins for the lookup table.
The accuracy of the matches formed based on the descriptor created from this lookup table has been evaluated across 5 separate runs of the algorithm in different settings and has been shown in Table 1. Each of the matches are formed by finding the feature descriptor with the least Hamming distance from the features in the CAD model.  73.6% It is seen that the matching accuracy increases with the number of features detected. This is expected since increasing the number of features results in a more complete picture of neighboring features which are used to build the feature vector. Lower number of features detected leads to greater chance of ambiguity in the feature vectors due to multiple features having similar distances. Run 3 is an example of such a scenario.
The extraction of the 5 nearest neighbors from the k-d Tree has been found to take an average of approximately 0.052ms on the test system consisting of a 6th generation core-i7 processor. The matching of descriptors using the brute-force matching technique has a complexity of O(mn) and takes approximately 0.3ms for 25 descriptors in the observed dataset and 106 descriptors in the CAD dataset.

Localization within the CAD Model
The descriptors computed from both the CAD model and the observed data are used to match the two sets of features and calculate the SE(3) rigid transformation between the SLAM map and the CAD model. Figure 14 shows 5 runs of different lengths within the test setup. AprilTags [30,31] were placed at the goal positions for each run. AprilTags are fiducial markers that are not prone to the ambiguities inherent in other markers like Checkerboards or Circlegrids. To measure the accuracy of Figure 13. The histogram and Kernel Density Estimator (KDE) probability density function for distance. The horizontal axis represents the Hamming distance starting from 0. In (a), the vertical axis represents the number of features at a particular distance. In (b), the vertical axis represents the probability density of a particular distance.
The accuracy of the matches formed based on the descriptor created from this lookup table has been evaluated across 5 separate runs of the algorithm in different settings and has been shown in Table 1. Each of the matches are formed by finding the feature descriptor with the least Hamming distance from the features in the CAD model. It is seen that the matching accuracy increases with the number of features detected. This is expected since increasing the number of features results in a more complete picture of neighboring features which are used to build the feature vector. Lower number of features detected leads to greater chance of ambiguity in the feature vectors due to multiple features having similar distances. Run 3 is an example of such a scenario.
The extraction of the 5 nearest neighbors from the k-d Tree has been found to take an average of approximately 0.052ms on the test system consisting of a 6th generation core-i7 processor. The matching of descriptors using the brute-force matching technique has a complexity of O(mn) and takes approximately 0.3ms for 25 descriptors in the observed dataset and 106 descriptors in the CAD dataset.

Localization within the CAD Model
The descriptors computed from both the CAD model and the observed data are used to match the two sets of features and calculate the SE(3) rigid transformation between the SLAM map and the CAD model. Figure 14 shows 5 runs of different lengths within the test setup. AprilTags [30,31] were placed at the goal positions for each run. AprilTags are fiducial markers that are not prone to the ambiguities inherent in other markers like Checkerboards or Circlegrids. To measure the accuracy of the localization, the distance between the final location of the UAV and the Apriltag located at the goal position was measured. The localization time was calculated based on how long it took the UAV to successfully localize itself in the CAD model from the time the system was started. Table 2 shows the localization time and the error between the system goal position and the actual goal position for each of the runs depicted in Figure 14. The localization time was calculated based on how long it took the UAV to successfully localize itself in the CAD model from the time the system was started. Table 2 shows the localization time and the error between the system goal position and the actual goal position for each of the runs depicted in Figure 14. The UAV was able to successfully localize itself within the CAD model approximately 15 seconds of starting on average. Run 4 had been performed with multiple goal locations to generate the complete map of the floor. This map is then used in a run to test the re-localization mode of the UAV.

Re-Localization within the CAD Model
To test the re-localization capability of the system, a SLAM map was generated from a previous run and its transformation with respect to the CAD model was stored. These were then used when running the UAV a second time. The system was able to quickly localize the UAV within a second of the system initialization. This is possible because of ORB-SLAM2 uses a fast and efficient re-localization module. The initial SLAM map and the view from the perspective of the camera is shown in Figure 15. The real-time experimental results can be viewed in Video S1.  The UAV was able to successfully localize itself within the CAD model approximately 15 seconds of starting on average. Run 4 had been performed with multiple goal locations to generate the complete map of the floor. This map is then used in a run to test the re-localization mode of the UAV.

Re-Localization within the CAD Model
To test the re-localization capability of the system, a SLAM map was generated from a previous run and its transformation with respect to the CAD model was stored. These were then used when running the UAV a second time. The system was able to quickly localize the UAV within a second of the system initialization. This is possible because of ORB-SLAM2 uses a fast and efficient relocalization module. The initial SLAM map and the view from the perspective of the camera is shown in Figure 15. The real-time experimental results can be viewed in Video S1.

Conclusions
The proposed system utilized a pre-existing scale accurate SLAM system using stereo images to estimate depth, and a pre-existing real-time Convolutional Neural Network for object detection coupled with a novel and computationally efficient method of registration of the SLAM map with the CAD model. This paper proposed a novel pipeline combining a real-time CNN based object detection network, a SLAM system and a novel registration mechanism combined with image processing techniques to localize a UAV in a CAD model. Furthermore, the paper proposed a computationally efficient, orientation invariant feature descriptor to match features in the CAD model and the observed data, based solely on the spatial correlation between the features.
A major contribution of this work has been the fast and efficient computation of orientation invariant feature descriptors that were used to form correspondences between the observed features and the features in the CAD model. The use of a k-d Tree enabled the quick extraction neighboring features. The use of KDE to group clusters of discrete values together to reduce the time needed to match features had played a crucial part in improving the efficiency and performance of the algorithm.
The experimental validation shows that leveraging readily available 3D CAD models of buildings provided valuable information regarding the environment of the UAV. It provided the system with all the information needed to deduce the current location of the UAV within the CAD model, and enabled the UAV to find its pose without the use of unreliable GPS in indoor environments or other costly systems that emulate the GPS mechanism indoors. The use of a SLAM system registered to a CAD model also enabled the system to leverage the re-localization feature of the SLAM, thus enabling fast localization in previously visited locations.

Future Work
The system suffers from a few limitations that could be improved. The accuracy of the feature descriptor could be improved by applying a threshold on the computed Hamming distance. This could be calculated by saving all the computed descriptors and then studying the result of using different thresholds varying from 0 to 64 for the descriptor matches. All these matches could then be used to generate a Receiver Operating Characteristics (ROC) curve, which could be used to deduce the best possible value for the threshold distance.
The system performance could also be significantly improved by including other macro-features like drinking fountains, exit signs, posts, and other macro-features. Including more macro-features would enable the UAV to localize quickly due to the abundance of macro-features as well as the variations in the macro-feature types which would give rise to less ambiguity in the macro-feature descriptors. This would also alleviate problems arising from symmetries. Furthermore, since the system was developed for a single floor localization problem, added macro-features unique to each floor can further help the localization problem. Similar floors can be a challenging problem with the presented algorithm which is also in consideration in our future work in progress.