Spherical-Model-Based SLAM on Full-View Images for Indoor Environments

: As we know, SLAM (Simultaneous Localization and Mapping) relies on surroundings. A full-view image provides more beneﬁts to SLAM than a limited-view image. In this paper, we present a spherical-model-based SLAM on full-view images for indoor environments. Unlike traditional limited-view images, the full-view image has its own speciﬁc imaging principle (which is nonlinear), and is accompanied by distortions. Thus, speciﬁc techniques are needed for processing a full-view image. In the proposed method, we ﬁrst use a spherical model to express the full-view image. Then, the algorithms are implemented based on the spherical model, including feature points extraction, feature points matching, 2D-3D connection, and projection and back-projection of scene points. Thanks to the full ﬁeld of view, the experiments show that the proposed method effectively handles sparse-feature or partially non-feature environments, and also achieves high accuracy in localization and mapping. An experiment is conducted to prove that the accuracy is affected by the view ﬁeld.


Introduction
For a mobile robot, finding its location and building environment maps are basic and important tasks. An answer to this need is the development of simultaneous localization and mapping (SLAM) methods. For vision-based SLAM systems, localization and mapping are achieved by observing the features of environments via a camera. Therefore, the performance of a vision-based SLAM method depends not only on the algorithm, but also the feature distribution of environments observed by the camera. Figure 1 shows a sketch of an indoor environment. For such a room, there may be few features within the field of view (FOV). Observing such scenes with a limited FOV camera, the SLAM often fails to work. However, these problems can be avoided by using a full-view camera. Another reason for using a full-view camera is that it improves the accuracy of SLAM methods. Because some different motions may have similar changes in an image of a limited FOV camera, these motions are discriminated with a full-view image. For example, as shown in Figure 2, the translation of a limited-view camera along the horizontal axis or rotation in the horizontal direction may result in the same movement for some features on images, i.e., Target 1 in Figure 2. However, for Targets 2-4, the camera motion causes different movements. This means that in some cases, different motions may be difficult to decouple from the observation of limited FOV images, while it is possible to distinguish from the observation of full-view images. Figure 2. The translation of a limited-view camera along the horizontal axis and the horizontal rotation may result in the same movement for some features in images (Target 1); however, the same camera motion causes different movements for the features in other places (Target 2-4), only a full-view camera can capture them.
Based on the above observations, a vision-based SLAM method of using full-view images can effectively manage sparse-feature or partially non-feature environments, and also achieve higher accuracy in localization and mapping than conventional limited field-of-view methods. Until now, few SLAM methods for using omnidirectional images have been proposed, and although a wider view results in a better performance in localization and mapping, there are no experiments assessing how accuracy is affected by the view field.
In this paper, we realize simultaneous localization and mapping (SLAM) on full-view images. The principle is similar to the typical approach of the conventional SLAM methods, PTAM (parallel tracking and mapping) [1]. In the proposed method, a full-view image is captured by Ricoh Theta [2]. Next, feature points are extracted from the full-view image. Then, spherical projection is used to compute projection and back-projection of the scene points. Finally, feature matching is performed using a spherical epipolar constraint. The characteristic of this paper is that a spherical model is used throughout processing, from feature extracting to localization and mapping computing.
The rest of this paper is organized as follows. In the next section, we introduce the related research. In Section 3, we introduce a camera model. Section 4 describes our system. Comprehensive validations are shown in Section 5. Finally, we summarize our work in Section 6.  Another reason for using a full-view camera is that it improves the accuracy of SLAM methods. Because some different motions may have similar changes in an image of a limited FOV camera, these motions are discriminated with a full-view image. For example, as shown in Figure 2, the translation of a limited-view camera along the horizontal axis or rotation in the horizontal direction may result in the same movement for some features on images, i.e., Target 1 in Figure 2. However, for Targets 2-4, the camera motion causes different movements. This means that in some cases, different motions may be difficult to decouple from the observation of limited FOV images, while it is possible to distinguish from the observation of full-view images. Another reason for using a full-view camera is that it improves the accuracy of SLAM methods. Because some different motions may have similar changes in an image of a limited FOV camera, these motions are discriminated with a full-view image. For example, as shown in Figure 2, the translation of a limited-view camera along the horizontal axis or rotation in the horizontal direction may result in the same movement for some features on images, i.e., Target 1 in Figure 2. However, for Targets 2-4, the camera motion causes different movements. This means that in some cases, different motions may be difficult to decouple from the observation of limited FOV images, while it is possible to distinguish from the observation of full-view images. Figure 2. The translation of a limited-view camera along the horizontal axis and the horizontal rotation may result in the same movement for some features in images (Target 1); however, the same camera motion causes different movements for the features in other places (Target 2-4), only a full-view camera can capture them.
Based on the above observations, a vision-based SLAM method of using full-view images can effectively manage sparse-feature or partially non-feature environments, and also achieve higher accuracy in localization and mapping than conventional limited field-of-view methods. Until now, few SLAM methods for using omnidirectional images have been proposed, and although a wider view results in a better performance in localization and mapping, there are no experiments assessing how accuracy is affected by the view field.
In this paper, we realize simultaneous localization and mapping (SLAM) on full-view images. The principle is similar to the typical approach of the conventional SLAM methods, PTAM (parallel tracking and mapping) [1]. In the proposed method, a full-view image is captured by Ricoh Theta [2]. Next, feature points are extracted from the full-view image. Then, spherical projection is used to compute projection and back-projection of the scene points. Finally, feature matching is performed using a spherical epipolar constraint. The characteristic of this paper is that a spherical model is used throughout processing, from feature extracting to localization and mapping computing.
The rest of this paper is organized as follows. In the next section, we introduce the related research. In Section 3, we introduce a camera model. Section 4 describes our system. Comprehensive validations are shown in Section 5. Finally, we summarize our work in Section 6.  Figure 2. The translation of a limited-view camera along the horizontal axis and the horizontal rotation may result in the same movement for some features in images (Target 1); however, the same camera motion causes different movements for the features in other places (Target 2-4), only a full-view camera can capture them. Based on the above observations, a vision-based SLAM method of using full-view images can effectively manage sparse-feature or partially non-feature environments, and also achieve higher accuracy in localization and mapping than conventional limited field-of-view methods. Until now, few SLAM methods for using omnidirectional images have been proposed, and although a wider view results in a better performance in localization and mapping, there are no experiments assessing how accuracy is affected by the view field.
In this paper, we realize simultaneous localization and mapping (SLAM) on full-view images. The principle is similar to the typical approach of the conventional SLAM methods, PTAM (parallel tracking and mapping) [1]. In the proposed method, a full-view image is captured by Ricoh Theta [2]. Next, feature points are extracted from the full-view image. Then, spherical projection is used to compute projection and back-projection of the scene points. Finally, feature matching is performed using a spherical epipolar constraint. The characteristic of this paper is that a spherical model is used throughout processing, from feature extracting to localization and mapping computing.
The rest of this paper is organized as follows. In the next section, we introduce the related research. In Section 3, we introduce a camera model. Section 4 describes our system. Comprehensive validations are shown in Section 5. Finally, we summarize our work in Section 6.

Related Works
In this section, we first introduce the related research in two categories. The first addresses omnidirectional image sensors, while the other addresses the SLAM methods for using omnidirectional image sensors. Then, we explain the characteristics of the proposed method.

Omnidirectional Image Sensor
A wider FOV means more visual information. One of the methods for covering a wide FOV uses a cluster of cameras. However, multiple camera capture devices are needed, requiring a complicated calibration of multiple cameras. Conversely, it is desirable to observe a wide scene using a single camera during a real-time task. For this purpose, other than fisheye cameras [3], omnidirectional image sensors have been developed, and a popular method uses a mirror combined with a catadioptric omnidirectional camera. The omnidirectional images are acquired using mirrors such as a spherical [4], conical [5], convex hyperbolic [6], or convex parabolic [7]. To acquire a full-view image, a camera with a pair of fisheye lenses was also developed [8].

Omnidirectional-Vision SLAM
Recently, many vision-based SLAM algorithms have been proposed [9]. According to the FOV of the cameras, these approaches can be grouped into two categories: one is based on perspective images with conventional limited FOV, and the other is an approach that relies on omnidirectional images with hemispherical FOV. Here, we focus the discussion of the related work on those omnidirectional SLAM approaches.
One of the most popular techniques for SLAM with omnidirectional cameras is the extended Kalman filter (EKF). For example, Rituerto et al. [10] integrated the spherical camera model into the EKF-SLAM by linearizing direct and the inverse projection. Developed in [10], Gutierrez et al. [11] introduced a new computation for the descriptor patch for catadioptric omnidirectional cameras that aimed to reach rotation and scale invariance. Then, a new view initialization mechanism was presented for the map building process within the problem of EKF-based visual SLAM [12]. Gamallo et al. [13] proposed a SLAM algorithm (OV-FastSLAM) for omniVision cameras operating with severe occlusions. Chapoulie et al. [14] presented an approach which was applied to SLAM that addressed qualitative loop closure detection using a spherical view. Caruso et al. [15] proposed an extension of LSD-SLAM to a generic omnidirectional camera model. The resulting method was capable of handling central projection systems such as fish-eye and catadioptric cameras.
In the above omnidirectional-vision SLAM, although they used omnidirectional images, their images could still not obtain a view as wide as that of full-view images. Additionally, since processing such as feature extraction and feature matching was not conducted based on the spherical model, they used perspective projection. These approaches are not considered a spherical-model-based approach, according to a recent survey [16].
However, visual odometry and structure from motion are also designed for estimating camera motion and 3D structure in an unknown environment. There are some camera pose estimation methods for full-vision systems were proposed [17]. An extrinsic camera parameter recovery method for a moving omni-directional multi-camera system was proposed; this method is based on using the shape-from-motion and the PnP techniques [18]. Pagani et al. investigated the use of full spherical panoramic images for Structure form Motion algorithms; several error models for solving the pose problem, and the relative pose problem were introduced [19]. Scaramuzza et al. obtained a 3D metric reconstruction of a scene from two highly-distorted omnidirectional images by using image correspondence only [20]. Our spherical model is similar to theirs; their methods also prove that wide-view based on the spherical model has its advantages over limited-view.
The aim of this research is to develop a SLAM method for a mobile robot in indoor environments. The proposed method is based on a spherical model. The problem of tracking failure due to limited view can be completely avoided. Furthermore, as described in Section 1, full-view images also result in better performance for SLAM. The processing of the proposed full-view SLAM method is based on a spherical camera model. The effectiveness of the proposed method is shown using real-world experiments in indoor environments. Additionally, an experiment is also conducted to prove that the accuracy is affected by the view field.

Camera Model and Full-View Image
Although the perspective projection model serves as the dominant imaging model in computer vision, there are several factors that make the perspective model far too restrictive. A variety of non-perspective imaging systems have been developed, such as a catadioptric sensor that uses a combination of lenses and mirrors [21], wide-angle lens systems [3], clusters of cameras [22], and a compound camera [23]. No matter what form of imaging system is used, Michael D et al. [24] presented a general imaging model to represent an arbitrary imaging system. In their opinion, an imaging system may be modeled as a set of raxels on a sphere surrounding the imaging system. Strictly speaking, there is not a commercial single viewpoint full-view camera. In this paper, we approximate RICOH-THETA (Ricoh, 2016) as a single viewpoint full-view camera because the distance between the two viewpoint of the fisheye lens of the RICOH-THETA is very small, relative to the observed scenes [25]. This approximation is rather reasonable for a multiple camera motion estimation approach, as mentioned in [26]. The RICOH-THETA camera is only used as a capture device of a single viewpoint full-view image. The captured full-view image is then transformed to a discrete spherical image; the proposed method is based on the discrete spherical image (Figure 3a). The discrete spherical model is referred to [27]. method is based on a spherical camera model. The effectiveness of the proposed method is shown using real-world experiments in indoor environments. Additionally, an experiment is also conducted to prove that the accuracy is affected by the view field.

Camera Model and Full-View Image
Although the perspective projection model serves as the dominant imaging model in computer vision, there are several factors that make the perspective model far too restrictive. A variety of non-perspective imaging systems have been developed, such as a catadioptric sensor that uses a combination of lenses and mirrors [21], wide-angle lens systems [3], clusters of cameras [22], and a compound camera [23]. No matter what form of imaging system is used, Michael D et al. [24] presented a general imaging model to represent an arbitrary imaging system. In their opinion, an imaging system may be modeled as a set of raxels on a sphere surrounding the imaging system. Strictly speaking, there is not a commercial single viewpoint full-view camera. In this paper, we approximate RICOH-THETA (Ricoh, 2016) as a single viewpoint full-view camera because the distance between the two viewpoint of the fisheye lens of the RICOH-THETA is very small, relative to the observed scenes [25]. This approximation is rather reasonable for a multiple camera motion estimation approach, as mentioned in [26]. The RICOH-THETA camera is only used as a capture device of a single viewpoint full-view image. The captured full-view image is then transformed to a discrete spherical image; the proposed method is based on the discrete spherical image (Figure 3a). The discrete spherical model is referred to [27]. Next, we describe the spherical projection model. Suppose there is a sphere with a radius f and a point p in space, as shown in Figure 3b. The intersection of the sphere surface with the line joining point p and the center of the sphere is the projection of the point to the sphere. Suppose the ray direction of the projection is determined by a polar angle, , and an azimuth angle, , and the coordinates of a point p in space is The projection of point p, in a spherical model, can be represented as = ( sin cos , sin sin , cos ) Then, we have the following relation between the two Next, we describe the spherical projection model. Suppose there is a sphere with a radius f and a point p in space, as shown in Figure 3b. The intersection of the sphere surface with the line joining point p and the center of the sphere is the projection of the point to the sphere. Suppose the ray direction of the projection is determined by a polar angle, θ, and an azimuth angle, ϕ, and the coordinates of a point p in space is The projection of point p, in a spherical model, can be represented as Then, we have the following relation between the two This means that m is equal to M c apart from a scale factor. Let f = 1, Now we have the normalized spherical coordinate. This corresponds to the projection onto a unit sphere.
Compared with the perspective projection of a planar image based on a pinhole camera model, a full-view image is obtained using the projection of all visible points from the sphere. Thus, a full-view image is the surface of a sphere, with the focus point at the center of the sphere. We use equidistance cylindrical projection to represent a full-view image, as shown in Figure 3c. The corners of squares become more distorted as they approach the poles. Thus, the planar algorithm cannot be applied.
Strictly speaking, the two lenses should be calibrated in a novel calibration process, rather than spherical approximation model, considering that the distance of two viewpoints in RICOH-THETA is smaller than 1 cm, plus, according to the experiments of the absolute errors of the estimated rotation and translation in [26]. Compared to the purely spherical camera, which did not suffer from errors induced by approximation, the spherical approximation had slightly lower performance, but was comparable when the feature tracking error was large. They also showed that the approximation even results in improvements in accuracy and stability of estimated motion over the exact algorithm.

Method Overview
Our method continuously builds and maintains a map, which describes the system's representation of the user's environment. The map consists of a collection of M point features located in a world coordinate frame W. Each point feature represents a local, spherical-textured patch in the world. The jth point p j on the map has a spherical descriptor d j , and its 3D world coordinate p jW = (x jW , y jW , z jW ) T in coordinate frame W.
The map also maintains N keyframes, which are taken in various frame coordinates according to the keyframe insertion strategy. Each keyframe K contains a keyframe index i, the rotation R iK and translation T iK between this frame coordinate and the world coordinate. Each keyframe also stores the frame image I iK , the index of map points V iK , which appear in the keyframe, and the map points' 2D The flow of our full-view SLAM system is shown in Figure 4. Like other SLAM systems, our system also has two main parts: mapping and tracking. The tracking system receives full-view images and maintains an estimation of the camera pose. At every frame, the system performs two-stage coarse-to-fine tracking. After tracking, the system determines whether a keyframe should be added. Once a keyframe is added, a spherical epipolar search is conducted, followed by a spherical match to insert the new map point. The flow of our full-view SLAM system is similar to many existing SLAM approaches. The difference is that these steps are handled based on the spherical model. Each step is now described in detail.

Image Acquisition
Images are captured from a Ricoh Theta S and are scaled to 1280 × 640. Before tracking, we run SPHORB, a new fast and robust binary feature detector and descriptor for spherical panoramic images [27]. In contrast to the state-of-the-art spherical features, this approach stems from the geodesic grid, a nearly equal-area hexagonal grid parametrization of the sphere used in climate modeling. It enables us to directly build fine-grained pyramids and construct robust features on the hexagonal spherical grid, avoiding the costly computation of spherical harmonics and their associated bandwidth limitation. The SPHORB feature also achieves scale and rotation invariance. Therefore, each feature point has a specific spherical descriptor.

Map Initialization
Our system is based on the map. When the system starts, we take two frames, which have a slight offset position between them, to perform the spherical match. Then, we represent feature points using the unit sphere polar coordinate (Equation (5)). Next, we employ the RANSAC five-point stereo algorithm to estimate an essential matrix. Finally, we triangulate the matched feature pair to compute the map point. The resulting map is refined through bundle adjustment [28].

Map Point Projection and Pose Update
To project map points into the image plane; it is first transformed from the world coordinate to the current frame coordinate C.
where is the jth map point in the current frame coordinate, is the jth map point in the world coordinate, and and represent the current camera pose from the world coordinate. Then, we project the map point onto the unit sphere, as ( , , ) shown in Equation (5). Then, we detect all the feature points in current frame by SPHORB. Next, we perform a fixed-range search around the unit coordinate ( , , ) . Since the map point has a spherical descriptor, we can compare it with candidate feature points to find the best match point Since the spherical point is unit,

Image Acquisition
Images are captured from a Ricoh Theta S and are scaled to 1280 × 640. Before tracking, we run SPHORB, a new fast and robust binary feature detector and descriptor for spherical panoramic images [27]. In contrast to the state-of-the-art spherical features, this approach stems from the geodesic grid, a nearly equal-area hexagonal grid parametrization of the sphere used in climate modeling. It enables us to directly build fine-grained pyramids and construct robust features on the hexagonal spherical grid, avoiding the costly computation of spherical harmonics and their associated bandwidth limitation. The SPHORB feature also achieves scale and rotation invariance. Therefore, each feature point has a specific spherical descriptor.

Map Initialization
Our system is based on the map. When the system starts, we take two frames, which have a slight offset position between them, to perform the spherical match. Then, we represent feature points using the unit sphere polar coordinate (Equation (5)). Next, we employ the RANSAC five-point stereo algorithm to estimate an essential matrix. Finally, we triangulate the matched feature pair to compute the map point. The resulting map is refined through bundle adjustment [28].

Map Point Projection and Pose Update
To project map points into the image plane; it is first transformed from the world coordinate to the current frame coordinate C.
where p jC is the jth map point in the current frame coordinate, p jW is the jth map point in the world coordinate, and R CW and T CW represent the current camera pose from the world coordinate. Then, we project the map point p jC onto the unit sphere, as m jx , m jy , m jz shown in Equation (5). Then, we detect all the feature points in current frame by SPHORB. Next, we perform a fixed-range search around the unit coordinate m jx , m jy , m jz . Since the map point p jC has a spherical descriptor, we can compare it with candidate feature points to find the best match pointp jC m jx ,m jy ,m jz .
Given a pair of matched feature points, m jx , m jy , m jz and m jx ,m jy ,m jz , the angle between the pair of matched feature points can be computed as e j = arccos( m jx ×m jx + m jy ×m jy + m jz ×m jz m jx 2 + m jy 2 + m jz 2 · m jx 2 +m jy 2 +m jz 2 ) Since the spherical point is unit, e j = arccos m jx ×m jx + m jy ×m jy + m jz ×m jz (8) Substituting Equations (5) and (6) into Equation (8), we obtain a function about rotation R CW and translation T CW .
Since for a successfully-matched feature point and true rotation R CW and translation T CW , e j = 0, we can compute rotation R CW and translation T CW by minimizing the following objective function given n matched feature points.

Two-Stage Coarse-to-Fine Tracking
Because of the distortion of full-view images, even with a slight camera move, the feature point would have a large move on full-view images. To have more resilience when faced with this problem, our tracking system uses the same tracking strategy as PTAM. At every frame, a prior pose estimation is generated from a motion model. Then, map points are project into the image, and their match points are searched at a coarse range. In this procedure, the camera pose is updated based on 50 coarse matches. After this, map points are re-projected into the image. A tight range is used to search the match point. The final frame pose is calculated from all the matched pairs.

Tracking Recovery
Since the tracking is based on the map point projection, the camera may fail to localize its own position when fewer map points are matched to the current frame. To determine whether the tracking failed, we use a certain threshold of point matching pairs. If the tracking continuously fails for more than a few frames, our system initiates a tracking recovery procedure. We compare the current frame with all keyframes stored on the map using spherical image match. The pose of the closest keyframe is used as the current frame for the next tracking procedure, while the motion model for tracking is not considered for the next frame.

Keyframe and Map Point Insertion
To maintain our system function, as the camera moves, new keyframes and map points are added to the system to include more information about the environment.
After the tracking procedure, the current frame is reviewed to determine whether it is a qualified keyframe using the following conditions: 1.
The tracking did not fail at this frame; 2.
The recovery procedure has not been activated in the past few frames; 3.
The distance from the nearest keyframe is greater than a minimum distance; 4.
The time since the last keyframe was added should exceed some frames.
Map point insertion is implemented when a new keyframe is added. First, feature points in the new keyframe are considered as candidate map points. Then, we determine whether the candidate point is near successfully observed map points in the current view. This step helps us to discard candidate points that may already be on the map. Next, we select the closest keyframe by distance since map point insertion is not available from a single keyframe. To establish a correspondence between the two views, we use the spherical epipolar search to limit the search range to a small, but accurate range. Based on the tight search range, spherical matching for the candidate map point is more reliable. If a match has been found, the candidate map point is triangulated. However, until this step, we cannot guarantee the candidate map point is a new point which is not on the map. Therefore, we compare the 3D candidate map point with each existing map point by projecting them onto the first and current keyframe image. If both offsets on the two images are small, we conclude they are the same point, and then connect the current frame information to the existing map point. Finally, the remaining candidate map point is inserted into the map.

Spherical Epipolar Search
Suppose that the same point is observed by two spherical model cameras. Let the coordinates of the same point at the two camera coordinate systems be M 1 and M 2 . The pose between the two cameras is represented by a rotation matrix, R, and a translation vector, T. According to the coplanar principle, we have where E = [T] _× R and By using the spherical projection in Equation (4), we have Given m 1 is known, according to Equation (5), we represent Equation (12) as In practice, we give an error range ∆e > 0 |A sin θ 2 cos ϕ 2 + B sin θ 2 sin ϕ 2 + C cos θ 2 | < ∆e Equation (15) gives the constraint for the spherical epipolar search and identifies the corresponding feature point.

Data Refinement
After a new keyframe and new map points are added, bundle adjustment must be used as the last step. In our map, each map point corresponds to at least one keyframe. Using keyframe pose and 2D/3D point correspondences, each map point projection has an associated reprojection error, calculated as Equation (7). Finally, bundle adjustment iteratively adjusts all map points and keyframes to minimize the error.
The added map point may be incorrect because of matching error. To distinguish the outliers, for every frame tracking, we calculate the reprojection error of visible map points in the current frame. If the number of map point errors exceeds the number of correct map points, the map point is discarded.

Validation
The system described above was implemented on a desktop PC with a 2.5 GHz Intel(R) CoreTM i5-2400S processor and 8 GB RAM. To evaluate the performance of our full-view SLAM system, we first evaluated our system in a small workspace to test system functions. Then, we used an accuracy test on a measured route, which was in a sparse feature environment. Last, we moved the camera in a room to conduct a real-world test. As this paper focuses on emphasizing the advantage of the full-view-based method over limited-view-based methods, we conducted an experiment to show that the accuracy decreased while the field of view decreased.

Functions Test
We prepared a video with 2429 frames. This video, which was recorded from a full-view model camera, explored a desk and its immediate surroundings. First, we let the camera move in front of the desk to produce an overview of the scene. Then, the camera moved away from the desk. Next, we occluded the camera and let it try to recover tracking at a place near the desk. The full test video can be seen at https://youtu.be/_4CFIPA5kZU.
The final map consisted of 31 keyframes and 1243 map points. We selected nine frames (frame index: 134, 356, 550, 744, 1067, 1301, 1639, 1990, and 2103) from the test video to illustrate the performance, as shown in Figure 5. The yellow plane in these figures represents the world coordinate system, which was established using map initialization. The yellow plane was stable in the environment during this test. Whether the camera was far or near to an object, our system functioned well. The final map is shown in Figure 6. From the map, we identified the laptop PC, desktop PC, and other items. We also tested the tracking recovery, which was very important to camera re-localization, in this video. It was also used to evaluate our tracking and mapping system. We occluded the camera at frame 2171 and stopped occluding at frame 2246. Simultaneous with the occluding, we moved the camera from a distant place to a closer place relative to the desk. Our system successfully localized itself after being moved (Figure 7).

Functions Test
We prepared a video with 2429 frames. This video, which was recorded from a full-view model camera, explored a desk and its immediate surroundings. First, we let the camera move in front of the desk to produce an overview of the scene. Then, the camera moved away from the desk. Next, we occluded the camera and let it try to recover tracking at a place near the desk. The full test video can be seen at https://youtu.be/_4CFIPA5kZU.
The final map consisted of 31 keyframes and 1243 map points. We selected nine frames (frame index : 134, 356, 550, 744, 1067, 1301, 1639, 1990, and 2103) from the test video to illustrate the performance, as shown in Figure 5. The yellow plane in these figures represents the world coordinate system, which was established using map initialization. The yellow plane was stable in the environment during this test. Whether the camera was far or near to an object, our system functioned well. The final map is shown in Figure 6. From the map, we identified the laptop PC, desktop PC, and other items. We also tested the tracking recovery, which was very important to camera re-localization, in this video. It was also used to evaluate our tracking and mapping system. We occluded the camera at frame 2171 and stopped occluding at frame 2246. Simultaneous with the occluding, we moved the camera from a distant place to a closer place relative to the desk. Our system successfully localized itself after being moved (Figure 7).

Accuracy Test
To validate our system further and compare with the limited-view method (PTAM), we measured routes in a space with sparse features, as shown in Figure 8. The camera moved from the starting point to the ending point with no rotation as quickly as possible. The first 10 cm was used for map initialization; we provided the real scale for the system. During the test, the limited-view SLAM failed because the environment was sparse. Only a few features were captured in limited views. We tested PTAM [1], one of the most famous limited-view SLAM approaches, on perspective images generated directly from full-view images. Figure 9 shows that PTAM failed to track in a sparse feature environment, which was a wall in this test. However, our full-view SLAM tracked well in this situation because it can capture features from other place except the wall. The final map is shown as the left image in Figure 10. Two boxes were defined in the final map. The coordinate system on the map was the world coordinate system, which was also the starting point in this test. Each green point represents a keyframe. The trajectory, which was very close to the ground truth, is shown as the right image in Figure 10. The video can be seen at https://youtu.be/c1ER1OqyjXM.

Accuracy Test
To validate our system further and compare with the limited-view method (PTAM), we measured routes in a space with sparse features, as shown in Figure 8. The camera moved from the starting point to the ending point with no rotation as quickly as possible. The first 10 cm was used for map initialization; we provided the real scale for the system. During the test, the limited-view SLAM failed because the environment was sparse. Only a few features were captured in limited views. We tested PTAM [1], one of the most famous limited-view SLAM approaches, on perspective images generated directly from full-view images. Figure 9 shows that PTAM failed to track in a sparse feature environment, which was a wall in this test. However, our full-view SLAM tracked well in this situation because it can capture features from other place except the wall. The final map is shown as the left image in Figure 10. Two boxes were defined in the final map. The coordinate system on the map was the world coordinate system, which was also the starting point in this test. Each green point represents a keyframe. The trajectory, which was very close to the ground truth, is shown as the right image in Figure 10. The video can be seen at https://youtu.be/c1ER1OqyjXM.

Accuracy Test
To validate our system further and compare with the limited-view method (PTAM), we measured routes in a space with sparse features, as shown in Figure 8. The camera moved from the starting point to the ending point with no rotation as quickly as possible. The first 10 cm was used for map initialization; we provided the real scale for the system. During the test, the limited-view SLAM failed because the environment was sparse. Only a few features were captured in limited views. We tested PTAM [1], one of the most famous limited-view SLAM approaches, on perspective images generated directly from full-view images. Figure 9 shows that PTAM failed to track in a sparse feature environment, which was a wall in this test. However, our full-view SLAM tracked well in this situation because it can capture features from other place except the wall. The final map is shown as the left image in Figure 10. Two boxes were defined in the final map. The coordinate system on the map was the world coordinate system, which was also the starting point in this test. Each green point represents a keyframe. The trajectory, which was very close to the ground truth, is shown as the right image in Figure 10. The video can be seen at https://youtu.be/c1ER1OqyjXM.

Accuracy Test
To validate our system further and compare with the limited-view method (PTAM), we measured routes in a space with sparse features, as shown in Figure 8. The camera moved from the starting point to the ending point with no rotation as quickly as possible. The first 10 cm was used for map initialization; we provided the real scale for the system. During the test, the limited-view SLAM failed because the environment was sparse. Only a few features were captured in limited views. We tested PTAM [1], one of the most famous limited-view SLAM approaches, on perspective images generated directly from full-view images. Figure 9 shows that PTAM failed to track in a sparse feature environment, which was a wall in this test. However, our full-view SLAM tracked well in this situation because it can capture features from other place except the wall. The final map is shown as the left image in Figure 10. Two boxes were defined in the final map. The coordinate system on the map was the world coordinate system, which was also the starting point in this test. Each green point represents a keyframe. The trajectory, which was very close to the ground truth, is shown as the right image in Figure 10. The video can be seen at https://youtu.be/c1ER1OqyjXM.    Then, we conducted an experiment to test our system's accuracy in a larger environment. We planned a new, measured route, as shown in Figure 11. The camera moved in a clockwise loop through 16 marked points, we randomly put a few objects around the route to generate a sparse space, plus, no objects are put around points 9-12. Point A was the starting point. The first 10 cm from A to B was used for map initialization. Using continuous tracking and mapping, we obtained the distance between point A and every marked point. We measured the ground truth using a laser rangefinder. The final errors for every marked point using our method are shown in Table 1 and Figure 12. As seen in Table 1, the accuracy of marked points (8)(9)(10)(11)(12) decreased compared to the other points. The reason is that we set few trackable features around these marked points to simulate a sparse environment. Our proposed method is based on full-view images. It uses the features from every direction, so this had only a small influence on our system. However, limited-view based methods, such as PTAM [1], immediately failed to track at these points because of a limited view with few features. The result shows that our method tracked in a sparse environment, and that the accuracy of our method was sufficient. The average error was 0.02 m.   Then, we conducted an experiment to test our system's accuracy in a larger environment. We planned a new, measured route, as shown in Figure 11. The camera moved in a clockwise loop through 16 marked points, we randomly put a few objects around the route to generate a sparse space, plus, no objects are put around points 9-12. Point A was the starting point. The first 10 cm from A to B was used for map initialization. Using continuous tracking and mapping, we obtained the distance between point A and every marked point. We measured the ground truth using a laser rangefinder. The final errors for every marked point using our method are shown in Table 1 and Figure 12. As seen in Table 1, the accuracy of marked points (8)(9)(10)(11)(12) decreased compared to the other points. The reason is that we set few trackable features around these marked points to simulate a sparse environment. Our proposed method is based on full-view images. It uses the features from every direction, so this had only a small influence on our system. However, limited-view based methods, such as PTAM [1], immediately failed to track at these points because of a limited view with few features. The result shows that our method tracked in a sparse environment, and that the accuracy of our method was sufficient. The average error was 0.02 m. Then, we conducted an experiment to test our system's accuracy in a larger environment. We planned a new, measured route, as shown in Figure 11. The camera moved in a clockwise loop through 16 marked points, we randomly put a few objects around the route to generate a sparse space, plus, no objects are put around points 9-12. Point A was the starting point. The first 10 cm from A to B was used for map initialization. Using continuous tracking and mapping, we obtained the distance between point A and every marked point. We measured the ground truth using a laser rangefinder. The final errors for every marked point using our method are shown in Table 1 and Figure 12. As seen in Table 1, the accuracy of marked points (8)(9)(10)(11)(12) decreased compared to the other points. The reason is that we set few trackable features around these marked points to simulate a sparse environment. Our proposed method is based on full-view images. It uses the features from every direction, so this had only a small influence on our system. However, limited-view based methods, such as PTAM [1], immediately failed to track at these points because of a limited view with few features. The result shows that our method tracked in a sparse environment, and that the accuracy of our method was sufficient. The average error was 0.02 m.   Then, we conducted an experiment to test our system's accuracy in a larger environment. We planned a new, measured route, as shown in Figure 11. The camera moved in a clockwise loop through 16 marked points, we randomly put a few objects around the route to generate a sparse space, plus, no objects are put around points 9-12. Point A was the starting point. The first 10 cm from A to B was used for map initialization. Using continuous tracking and mapping, we obtained the distance between point A and every marked point. We measured the ground truth using a laser rangefinder. The final errors for every marked point using our method are shown in Table 1 and Figure 12. As seen in Table 1, the accuracy of marked points (8)(9)(10)(11)(12) decreased compared to the other points. The reason is that we set few trackable features around these marked points to simulate a sparse environment. Our proposed method is based on full-view images. It uses the features from every direction, so this had only a small influence on our system. However, limited-view based methods, such as PTAM [1], immediately failed to track at these points because of a limited view with few features. The result shows that our method tracked in a sparse environment, and that the accuracy of our method was sufficient. The average error was 0.02 m.

Route Test
In this section, we demonstrate the performance of our system in real-world situations. We allowed the camera to move along the wall in a square in a room to build the map. Some sample frames are shown in Figure 13 to illustrate the formation of the map and the trajectory. We marked "P" on map images to define the current camera positions and marked "door" on both the full-view image and map image. Readers may find the relationship between the map and the real environment easier to understand using these markers. The final trajectory was a square as planned. The video is available at https://youtu.be/QzilX069U0M. The surrounding estimation and the trajectory were similar to that of the real environment.
To have a further validation on route test, as Figure 14a shows, we put the full-view camera on the MAV, then let the MAV fly on a planned route. One side of the route is a tall building with unique features, while the other side is full of trees with confusing features. The final trajectory is very close to our planned route. Our final estimated trajectory, the trajectory from PTAM, and the trajectory from GPS, are compared in Figure 14. PTAM performs badly and even loses tracking as MAV rotates, because limited-view SLAM systems would not insert the keyframe if only rotation happens. And the trajectories from our system and GPS are very close, but around the start point, the GPS has some errors, which is more than the estimated trajectory by our system. The GPS trajectory shows that the MAV is very close to the wall, which is not possible; this is because the GPS signal may be affected by the tall building nearby. The whole test video by our system is available on https://youtu.be/R8C_W5HY1Dg.

Route Test
In this section, we demonstrate the performance of our system in real-world situations. We allowed the camera to move along the wall in a square in a room to build the map. Some sample frames are shown in Figure 13 to illustrate the formation of the map and the trajectory. We marked "P" on map images to define the current camera positions and marked "door" on both the full-view image and map image. Readers may find the relationship between the map and the real environment easier to understand using these markers. The final trajectory was a square as planned. The video is available at https://youtu.be/QzilX069U0M. The surrounding estimation and the trajectory were similar to that of the real environment.
To have a further validation on route test, as Figure 14a shows, we put the full-view camera on the MAV, then let the MAV fly on a planned route. One side of the route is a tall building with unique features, while the other side is full of trees with confusing features. The final trajectory is very close to our planned route. Our final estimated trajectory, the trajectory from PTAM, and the trajectory from GPS, are compared in Figure 14. PTAM performs badly and even loses tracking as MAV rotates, because limited-view SLAM systems would not insert the keyframe if only rotation happens. And the trajectories from our system and GPS are very close, but around the start point, the GPS has some errors, which is more than the estimated trajectory by our system. The GPS trajectory shows that the MAV is very close to the wall, which is not possible; this is because the GPS signal may be affected by the tall building nearby. The whole test video by our system is available on https://youtu.be/R8C_W5HY1Dg.

Accuracy Validation with a Decrease in the Field of View
In this section, we validate the inherent disadvantages of limited-view based methods. We discuss the decreased field of vision and how the performance of our system would behave. Similar to Section 5.2, we designed a route and collected full-view images for tracking and mapping. To decrease the field of view from full-view images, we decreased the range of theta since the full-view image was presented in the spherical model ( Figure 15). Thus, we obtained limited-view images with desirable view angles. We tested our system with (theta = 160, 140, 120, 100, 80 degrees) views, and the experiment was conducted ten times. Each time, the limited-view was generated randomly, so the tracking result was different depending on the features. The accuracy with different theta values is listed in Table 2.
As shown in Table 2, as the field of view decreased, the accuracy decreased. When the field of view was less than 100 degrees, some experiments failed to track ('X' in Table 2). Because a limited view means limited information, only limited features from one direction could not retrain a result as accurately as from all directions, especially for sparse environments. The average error showed that wider views resulted in a better performance.

Accuracy Validation with a Decrease in the Field of View
In this section, we validate the inherent disadvantages of limited-view based methods. We discuss the decreased field of vision and how the performance of our system would behave. Similar to Section 5.2, we designed a route and collected full-view images for tracking and mapping. To decrease the field of view from full-view images, we decreased the range of theta since the full-view image was presented in the spherical model ( Figure 15). Thus, we obtained limited-view images with desirable view angles. We tested our system with (theta = 160, 140, 120, 100, 80 degrees) views, and the experiment was conducted ten times. Each time, the limited-view was generated randomly, so the tracking result was different depending on the features. The accuracy with different theta values is listed in Table 2.
As shown in Table 2, as the field of view decreased, the accuracy decreased. When the field of view was less than 100 degrees, some experiments failed to track ('X' in Table 2). Because a limited view means limited information, only limited features from one direction could not retrain a result as accurately as from all directions, especially for sparse environments. The average error showed that wider views resulted in a better performance.

Accuracy Validation with a Decrease in the Field of View
In this section, we validate the inherent disadvantages of limited-view based methods. We discuss the decreased field of vision and how the performance of our system would behave. Similar to Section 5.2, we designed a route and collected full-view images for tracking and mapping. To decrease the field of view from full-view images, we decreased the range of theta since the full-view image was presented in the spherical model ( Figure 15). Thus, we obtained limited-view images with desirable view angles. We tested our system with (theta = 160, 140, 120, 100, 80 degrees) views, and the experiment was conducted ten times. Each time, the limited-view was generated randomly, so the tracking result was different depending on the features. The accuracy with different theta values is listed in Table 2.
As shown in Table 2, as the field of view decreased, the accuracy decreased. When the field of view was less than 100 degrees, some experiments failed to track ('X' in Table 2). Because a limited view means limited information, only limited features from one direction could not retrain a result as accurately as from all directions, especially for sparse environments. The average error showed that wider views resulted in a better performance.