Robust Outlier-Adaptive Filtering for Vision-Aided Inertial Navigation.

With the advent of unmanned aerial vehicles (UAVs), a major area of interest in the research field of UAVs has been vision-aided inertial navigation systems (V-INS). In the front-end of V-INS, image processing extracts information about the surrounding environment and determines features or points of interest. With the extracted vision data and inertial measurement unit (IMU) dead reckoning, the most widely used algorithm for estimating vehicle and feature states in the back-end of V-INS is an extended Kalman filter (EKF). An important assumption of the EKF is Gaussian white noise. In fact, measurement outliers that arise in various realistic conditions are often non-Gaussian. A lack of compensation for unknown noise parameters often leads to a serious impact on the reliability and robustness of these navigation systems. To compensate for uncertainties of the outliers, we require modified versions of the estimator or the incorporation of other techniques into the filter. The main purpose of this paper is to develop accurate and robust V-INS for UAVs, in particular, those for situations pertaining to such unknown outliers. Feature correspondence in image processing front-end rejects vision outliers, and then a statistic test in filtering back-end detects the remaining outliers of the vision data. For frequent outliers occurrence, variational approximation for Bayesian inference derives a way to compute the optimal noise precision matrices of the measurement outliers. The overall process of outlier removal and adaptation is referred to here as "outlier-adaptive filtering". Even though almost all approaches of V-INS remove outliers by some method, few researchers have treated outlier adaptation in V-INS in much detail. Here, results from flight datasets validate the improved accuracy of V-INS employing the proposed outlier-adaptive filtering framework.


Introduction
The most widely used algorithms for estimating the states of a dynamic system are a Kalman Filter [1,2] and its nonlinear versions such as an extended Kalman filter (EKF) [3,4]. After the NASA Ames Research Center implemented the Kalman filter into navigation computers to estimate the trajectory of the Apollo program, engineers have developed a myriad of applications of the Kalman filter in navigation system research areas [5]. For example, Magree and Johnson [6] developed a simultaneous localization and mapping (SLAM) architecture with improved numerical stability based on the UD factored EKF, and Song et al. [7] proposed a new EKF system that loosely fuses both absolute state measurements and relative state measurements. Furthermore, Mostafa et al. [8] integrated radar odometry and visual odometry via EKF to help overcome their limitations in navigation. Despite the development of numerous applications of the Kalman filter in various fields, it suffers from inaccurate estimation when required assumptions fail.
Estimation using a Kalman filter is optimal when process and measurement noise are Gaussian. However, sensor measurements are often corrupted by unmodeled non-Gaussian or heavy-tailed noise. An abnormal value relative to an overall pattern of the nominal Gaussian noise distribution is called an outlier. In other words, in statistics, an outlier is an observation that deviates so much from other observations as to arouse suspicion that it is generated by a different mechanism [9]. Such outliers have many anomalous causes. They arise due to unanticipated changes in system behavior (e.g., temporary sensor failure or transient environmental disturbance) or unmodeled factors (e.g., human errors or unknown characteristics of intrinsic noise). As an example of measurement outliers in many navigation systems, either computer vision data contaminated by outliers or sonar data corrupted by phase noise lead to erroneous measurements. Process outliers also occur by chance. Inertial measurement unit (IMU) dead reckoning and wheel odometry as a proxy often generate inaccurate dynamic models in visual-inertial odometry (VIO) and SLAM algorithms, respectively. Without accounting for outliers, the accuracy of the estimator significantly degrades, and control systems that rely on high-quality estimation can diverge.

State Estimation for Measurements with Outliers
As the performance of the Kalman filter degrades at the presence of measurement outliers, many researchers have investigated other approaches to mitigate the impact of the outliers. Mehra [10] created adaptive filtering with the identification of noise covariance matrices and showed the asymptotic convergence of the estimates towards their true values. Maybeck [11] and Stengel [12] found other noise-adaptive filtering such as covariance matching. However, all of these filters performed only offline and required filter tuning. To estimate parameter values in unknown covariances without the need for manual parameter tuning, Ting et al. [13] used a variational expectation-maximization (EM) framework. That is, they introduced a scalar weight for each observed data sample and modeled the weights to be Gamma distributed random variables. However, it assumed that noise characteristics are homogeneous across all measurements even though sensors have distinct properties. Särkkä and Nummenmaa [14] provided the online learning of the parameters of the measurement noise variance, but to simultaneously track the system states and the levels of sensor noise, they additionally defined a heuristic transition model for the noise parameters. Piché et al. [15] developed Gaussian assumed density filtering and smoothing framework for nonlinear systems using the multivariate Student t-distribution, and Roth et al. [16] included an approximation step for heavy-tailed process noise, but these filters are not applicable in high dimensions. Next, Solin and Särkkä [17] found that the added flexibility of Student-t processes over Gaussian processes robustifies inference in outlier-contaminated noisy data, but they treated only analytic solutions enabled by the noise entanglement.
Recently, Agamennoni et al. developed the outlier-robust Kalman filter (ORKF) [18,19] to obtain the optimal precision matrices of measurement outliers by variational approximation for Bayesian inference [20]. However, this method requires iterations at every time, even when observed data contain no outliers. Graham et al. also established the 1 -norm filter [21] for both types of sparse outliers. However, the filter might not work for nonlinear systems since they derived the constraint of 1 -norm optimization based on only linear system equations. Similar to the ORKF, the 1 -norm filter needs the constrained optimization at all times, even when no additional noise present as outliers. Therefore, these two approaches demand some extensive computational complexity for either iterations or optimization. As outliers do not always arise (i.e., are rare), we reduce such computation cost if a test detects the time when outliers occur. All of the above methods were not validated for complicated systems such as unmanned aerial vehicles or vision-aided inertial navigation [6,22,23] or with sequential measurement updates [24,25].

Outlier Rejection Techniques
One of the primary problems in vision-aided inertial navigation systems (V-INS) is incorrect vision data correspondence or association. Matched features between two different camera views are corrupted by outliers because of image noise, occlusions, and illumination changes that are not modeled by the feature matching techniques. To provide cleaned measurement data to the filter, outlier removal in image processing front-end is essential. One of standard outlier rejection techniques is RANdom SAmple Consensus (RANSAC) [26]. RANSAC is an iterative approach to estimate the parameters of a mathematical model from a set of observed data contaminated by outliers. An underlying assumption is that the data consist of inliers whose distribution is described by some set of the parameters of the model and outliers that do not fit the model. The generated parameters are then verified on the remaining subset of the data, and the model with the highest consensus is a selected solution. In particular, 2-point RANSAC [27,28] is an extended RANSAC-based method for two consecutive views of a camera rigidly mounted on a vehicle platform. Given gyroscopic data from IMU measurements, randomly selected two-feature correspondences hypothesize an ego-motion of the vehicle. This motion constraint discards wrong data associations in the feature matching processes.
For detecting remaining outliers that are not rejected in the image processing front-end, outlier detection tests are required in filtering back-end. Most of the statistical tests [29] that require access to the entire set of data samples for detecting outliers might not be a viable option in real-time applications. For example, the typicality and eccentricity data analysis [30,31] used in [32] is an inadequate measure in V-INS, as computing the means and the variances of each residual of sequential measurements is challenging. In V-INS, the tracking of some measured features is possibly lost due to out of sight, and new feature measurements are coming for initialization.
For the real-time outlier detection of sequential measurements in V-INS, the Mahalanobis gating test [33] is a useful measure based on the analysis of residual and covariance signals at each feature measurement. The approach builds upon each Mahalanobis distance [34] of residuals and compares each value against a threshold given by the quantile of the chi-squared distribution with a certain degree of freedom. The confidence level of the threshold is designated prior to examining the data. Most commonly, the 95% confidence level is used. This hypothesis testing, called goodness of fit, is a commonly used outlier detection method in practice. Because of such suitability of the Mahalanobis gating test to real-time detection in V-INS, this paper combines the test with the ORKF [18,32] to detect and handle measurement outliers in vision-aided estimation problems. Similar to the derivation of update steps for handling measurement outliers in the ORKF, for computing the optimal precision matrices of unmodeled outliers in V-INS, Section 4 will derive feasible update procedures by variational inference. In other words, whenever unexpected outliers appear, the outlier-adaptive filtering in this paper updates and marginalizes measurement outliers to improve the robustness of the navigation systems.

Summary of Contributions
This paper presents improving the use of outlier removal techniques in image processing front-end and the development of a robust and adaptive state estimation framework for V-INS when frequent outliers occur. For outlier removal in the image processing front-end of V-INS, feature correspondence constitutes the following three steps: tracking, stereo matching, and 2-point RANSAC. To estimate the states of V-INS in which vision measurements still contain remaining outliers, we propose a novel approach that combines a real-time outlier detection technique with an extended version of the ORKF in the filtering back-end of V-INS. Therefore, our approach does not restrict noise at either a constant or Gaussian level in filtering. The testing results of benchmark flight datasets show that our approach leads to greater improvement in accuracy and robustness under severe illumination environments.
Starting from the architecture of the existing vision-aided inertial navigation system, this paper more focuses on contributing to the development of red boxes in Figure 3.

A Guide to This Document
The remainder of this document contains the following sections. Section 2 introduces background for all of this study. To estimate the states of V-INS in which frequent outliers arise, Section 3 examines outlier rejection techniques in image processing front-end, and Sections 4 and 5 formulate a novel implementation of robust outlier-adaptive filtering. Section 6 shows testing results of this study on benchmark flight datasets. The last section concludes and plans future work.

The Extended Kalman Filter
The system equations with continuous-time dynamics and a discrete-time sensor are as follows, where x ∈ R n is the state and y ∈ R m a measurement. f (·) and h(·) are the nonlinear dynamic and measurement functions, respectively. Let us assume that these functions are known based on each equation of motion and modeling. To clarify, t denotes continuous time, subscript k represents the k-th time step, and initial condition x(0) = x 0 is given. Moreover, let us assume that both propagation and measurements are corrupted by additive zero-mean white Gaussian noise; that is, η(t) ∼ N (0, Q) and ζ(t k ) ∼ N (0, R).

Time Update
To estimate the state variables of the system, we design a hybrid EKF in the following steps. In the propagation step, state estimatex := E[x] and its error-covariance P : where letx k =x(t k ). Hat "ˆ" denotes an estimate, and superscripts − and + represent a priori and a posteriori estimates, respectively. Here, for one numerical solution of the ordinary differential equation, we use the Heun's method [35] that refers to the improved Euler's method or a similar two-stage Runge-Kutta method. Jacobian A, B, and state transition matrix Φ are defined by where ∆t k−1 = t k − t k−1 . Letting P k = P(t k ), the time update of error covariance is

Measurement Update
Using actual sensor measurements, the measurement update step of the EKF corrects state estimates and its corresponding error covariance after propagation. Letting y k = y(t k ), at every time k when y k is measured, where K is called the Kalman gain and Jacobian C is defined as Equation (7) is the Joseph's form [36] of the covariance measurement update, so this form preserves its symmetry and positive definite. For more details such as optimality and derivation, see References [24,37].

Sequential Measurement Update
When myriad measurements are observed at one time, sequential Kalman filtering is useful. In fact, we obtain N measurements, y 1 , y 2 , · · · , y N , at time k; that is, we first measure y 1 , then y 2 , · · · , and finally y N , shown in Figure 1. We first initialize a posteriori estimate and covariance after zero measurement is processed; that is, they are equal to the a priori estimate and covariance. For i = 1, · · · , N, perform the general measurement update using the i-th measurement. We lastly assign the a posteriori estimate and covariance asx + k ← (x k ) N and P + k ← (P k ) N . Based on Simon [24]'s proof that the sequential Kalman filtering is equivalent formulation of the standard EKF, the order of updates does not affect overall performance of estimation.

Vehicle Model
The nonlinear dynamics of a vehicle is driven by raw micro-electromechanical system (MEMS) IMU sensor data including specific force and angular velocity inputs. The estimated vehicle state is given byx where p b/i , v b/i are the position and velocity of the vehicle with respect to the inertial frame, respectively. δθ is the error quaternion of the attitude of the vehicle, and its more details are explained in [38][39][40]. b a and b ω are the acceleration and gyroscope biases of the IMU, respectively. Left superscript i denotes a vector expressed in the inertial frame. The EKF propagates the vehicle state vector by dead reckoning with data from the IMU. Raw MEMS IMU sensor measurements a raw and ω raw are corrupted by noise and bias as follows, where a true and ω true are the true acceleration and angular rate, respectively, and g is the gravitational acceleration in the inertial frame. η a and η ω are zero-mean, white, Gaussian noise of the accelerometer and gyroscope measurement, and T b/i = T i/b T denotes the rotation matrix from the inertial frame to the body frame.
η b a and η b ω in Equations (9) and (10) are the random walk rate of the acceleration and gyroscope biases. From the works in [40][41][42], the MEMS accelerometer and gyroscope are subject to flicker noise in the electronics and other components susceptible to random flickering. The flicker noise causes their biases to wander over time. Such bias fluctuations are usually modeled as a random walk. In other words, slow variations in the bias of the IMU sensor are modeled with a "Brownian motion" process, also termed a random walk in discrete-time. In practice, the biases are constrained to be within some range, and thus the random walk model is only a good approximation to the true process for short periods of time.
The vehicle dynamics is given by where α × is a skew symmetric matrix, and function Q(·) maps a 3 by 1 vector of the angular velocity into a 4 by 4 matrix [25]. The use of the 4 by 1 quaternion representation in state estimation causes the covariance matrix to become singular, so it requires considerable accounting for the quaternion constraints. To avoid these difficulties, engineers developed the error-state Kalman filter in which 3 by 1 infinitesimal error quaternion δθ is used instead of 4 by 1 quaternion q in the state vector.
In other words, we use attitude error quaternion to express the incremental difference between tracked reference body frame b and actual body frame b for the vehicle. Jacobian matrix A = ∂ẋ ∂ x |x and

Camera Model
An intrinsically calibrated pinhole camera model [27,43] is given by where x is the state vector including the vehicle state and the feature state, and measurement y j is the j-th feature 2D location on the image plane. f u and f v are the horizontal and vertical focal lengths, respectively, and ζ u and ζ v are additive, zero-mean, white, Gaussian noise of the measurement. Vectors p f j /c , p f j /i are the j-th feature 3D position with respect to the camera frame and the inertial frame, respectively. Extrinsic parameter T c/b and b p c/b are known and constant. Jacobian matrix C j = ∂ y j ∂ x |x is computed in Appendix A. In addition, from Equation (18), if j-th measurement y j on an image is a new feature, then i p f j /i is unknown so need to be initialized. Details of feature initialization are explained in Appendix B.

Feature Correspondence
In this study, a feature detector using the Features from Accelerated Segment Test (FAST) algorithm [44,45] maintains a minimum number of features in each image. For each new image, a feature extractor using the Kanade-Lucas-Tomasi (KLT) sparse optical flow algorithm [46] tracks the existing features. Even though Paul et al. [47] proved that descriptor-based methods for temporal feature tracking are more accurate than KLT-based methods, as Sun et al. [48] found that descriptor-based methods require much more computing resource with a small gain in accuracy, we employ the KLT optical flow algorithm in the image processing front-end of this study. Next, our stereo matching using a fixed baseline stereo configuration also applies to the KLT optical flow algorithm for saving computational loads compared to other stereo matching approaches. With the matched features, a 2-point RANSAC [26] is applied to remove remaining outliers by utilizing the RANSAC step in the fundamental matrix test [27]. In the scope of this study, we implement the 2-point RANSAC algorithm by simply running one of open source codes.
Similar to [48,49], our outlier rejection is composed of three steps, shown in Figure 2. We assume that features from previous c 1 and c 2 images are outlier-rejected points, where c 1 and c 2 are left and right camera frames of a stereo camera, respectively. The three steps form a closed loop of previous and current frames of left and right cameras. The first step is the stereo matching of tracked features on current c 1 image to c 2 image. The next steps are applying 2-point RANSAC between previous and current images of the left camera and another 2-point RANSAC between previous and current images of the right camera. For steps 2 and 3, stereo-matched features are directly used in each RANSAC.

Algorithm of Feature Correspondence
Algorithm 1 summarizes the feature correspondence for outlier rejection. For the scope of this paper, the OpenCV library [50] and open source codes of RANSAC are extremely useful and directly applied.
In Algorithm 1, Pyramid is a type of multi-scale signal representation in which an image is subject to repeated smoothing and sub-sampling.

Algorithm 1 Feature Correspondence for Outlier Rejection
Require: Pyramids and outlier-rejected points of previous c 1 , c 2 images 1: Feature Tracking: OpenCV [50] 3: return pyramid of current c 1 or c 2 4: end function 5: function PREDICTFEATURES(outlier-rejected points of previous c 1 ,T curr←prev of c 1 , Intrinsic c 1 ) 6: return predicted features of current c 1 7: end function Initialize c 2 points by projecting the tracked features of current c 1 to c 2 using the rotation from stereo extrinsic 14: function CALCOPTICALFLOWPYRLK(pyramid of current c 1 and c 2 , tracked features of current c 1 , initialized c 2 points) OpenCV [50] 15: end function 16: Further remove outliers based on the essential matrix

Outlier Adaptation in Filtering Back-End
Even though image processing front-end removes outliers by tracking, stereo matching, and 2-point RANSAC, some outlier features still survive and enter the filter as inputs. This section explains the outlier rejection procedure in filtering back-end.

Outlier Removal in Feature Initialization
If a measurement is a new feature, our system initializes its 3D position with respect to the inertial frame. In feature initialization, Gaussian-Newton least-squares minimization first estimates the depth of left c 1 camera. If either the estimated depth of the left or right camera is negative, then the solution of the minimization is invalid since features are always in front of both camera frames observing them. This process of removing features that has the invalid depth is referred to as outlier removal in feature initialization.

Outlier Detection by Chi-Squared Statistical Test
Before operating navigation systems, we initialize the chi-squared test table with a 95% confidence level. While the systems estimate the state variable, if j-th measurement y j at time k is the existing feature, then its residual r j and Jacobian C j are computed. Next, we proceed a Mahalanobis gating test [33] for residual r j to detect remaining outliers. In fact, Mahalanobis distance [34] γ j is a measure of the distance between residual r j and covariance matrix In the statistic test, we compare γ j value against a threshold given by the 95-th percentile of the χ 2 distribution with ν j degree of freedom. Here, ν j is the number of observations of the j-th feature minus one. If the feature passes the test, the EKF uses residual r j to process the measurement update.

Outlier-Adaptive Filtering
Unlike the extended ORKF (EORKF) [32], for a practical estimation approach in V-INS, this study investigates only measurement outliers due to the following reasons. As the measurement update is not the process performed at every time step, the outlier detection by each residual value cannot directly detect the outliers of IMU measurements. Furthermore, in the sequential measurement update, multiple residuals are computed to update at one IMU time stamp. In other words, as only rare observations among feature measurements from one image are corrupted by the remaining outliers, hypothesizing that the outliers come from the IMU may be faulty. Therefore, in the scope of this paper, we handle only measurement outliers.

Student's t-Distribution
Despite the true system with outliers, the classical EKF assumes that each model in the filter is corrupted with additive white Gaussian noise. The levels of the noise are assumed to be constant and encoded by sensor covariance matrices Q and R (i.e., η k ∼ N (0, Q), (ζ j ) k ∼ N (0, R)). However, as outliers arise in the realistic system, now we do not restrict noise at either a constant or Gaussian level. Instead, their levels vary over time, or noise have heavier tails than the normal distribution as follows, where ST(·) denotes a Student's t-distribution, and ν k > m − 1 is degrees of freedom. Covariance matrix R j follows the inverse-Wishart distribution, denoted as W −1 (·). Λ j 0 is m × m precision matrix. In Bayesian statistics, the inverse-Wishart distribution is used as the conjugate prior for the covariance matrix of a multivariate normal distribution [20]. The probability density function (pdf) of the inverse-Wishart is where tr(·) denotes the trace of a square matrix in linear algebra. Moreover, in probability and statistics, a Student's t-distribution is any member of a family of continuous probability distributions that arises when estimating the mean of a normally distributed population in situations where the standard deviation of the population is unknown [51]. Whereas a normal distribution describes a full population, a t-distribution describes samples drawn from a full population; thus, the larger the sample, the more the distribution resembles a normal distribution. Indeed, as the degree of freedom goes to infinity, the t-distribution approaches the standard normal distribution. In other words, when the variance of a normally distributed random variable is unknown and a conjugate prior placed over it that follows an inverse-Wishart distribution, the resulting marginal distribution of the variable follows a Student's t-distribution [52]. Then, the Student-t, a sub-exponential distribution with much heavier tails than the Gaussian, is more prone to producing outlying values that fall far from its mean.

Variational Inference
The purpose of filtering is generally to find the approximations of posterior distributions p(x k | y 1:k ), where y 1:k = [y 1 , y 2 , · · · , y k ] is the histories of sensor measurements obtained up to time k. For systems with the heavy-tailed noise, we also wish to produce another inference about covariance matrices whose priors follow the inverse-Wishart distribution. Hence, our goal in this section is to find both approximations for posterior distribution p(x 1:k ,R 1:k y 1:k ) and model evidence p(y 1:k ). Compared to sampling methods, the variational Bayesian method performs approximate posterior inference at low computational cost for a wide range of models [20,52]. In the method, we decompose log marginal probability ln p(y 1: where KL [ q p ] = q(x 1:k ,R 1:k ) ln q(x 1:k ,R 1:k ) p(x 1:k ,R 1:k y 1:k ) p is the true distribution that is intractable for non-Gaussian noise models, and q is a tractable approximate distribution.
In probability theory, a measure of the difference between two probability distributions p and q is the Kullback-Leibler divergence, denoted as KL[·]. If we allow any possible choice for q such as the Gaussian distribution, then lower bound L[q] is maximum when the KL divergence vanishes; that is, q(x 1:k ,R 1:k ) = p(x 1:k ,R 1:k | y 1:k ). To minimize the KL divergence, we seek the member of a restricted family of q(x 1:k ,R 1:k ). Indeed, maximizing L[q] is equivalent to minimizing another new KL divergence [52], and thus the minimum occurs when factorized distributions q(x 1:k ,R 1:k ) = q(x 1:k ) q(R 1:k ) and the following Equations (25) and (26) hold, where E q(·) represents the expectation with respect to q(·). With assuming that initial state x 1 is Gaussian, the measurement update with varying noise covariance E[R −1 t ] = Λ −1 t , which closely resemble the EKF updates, solve Equation (25). Algorithm 2 describes the details of the updates. Now let us assume that the true priors are IID noise models as the case in this study; that is, p(R k ) follows W −1 (νR, ν) distribution. Then, second term ln p(R k ) in the right-hand side of Equation (26) is computed using the pdf of the inverse-Wishart distribution in Equation (21) with its prior noise model.
As the term is conjugate prior for Equation (20), the approximations of q(R k ) have same mathematical forms as priors; that is, q(R k ) also follows W −1 (ν k Λ k ,ν k ) distribution. As From Equations (26)- (29), to handle measurement outliers, similar to Agamennoni et al. [18,19]'s derivation, we derive how to compute precision matrix Λ k of approximate distribution q(R k ) ofR k as follows,ν where each feature from one image is independent and Next, in Equation (30), where estimation error e j = x k − (x k ) j and the Jacobian C j = ∂h j ∂x (x k ) j . In the sequential measurement update, (x k ) j and (P k ) j are corrected by Kalman gain K j that is a function of (Λ j ) k , so these update steps are coupled. Hence no a closed-form solution exists, and we can only solve iteratively. The purpose of the iteration seems to be similar to that of the online learning of unknown variances of each noise [10]. In addition, similar to Agamennoni et al.'s interpretation [19], the convergence and optimality of the derived update steps for outliers are guaranteed since the variational lower bound is convex with respect to (x k ) j , (P k ) j , and (Λ j ) k . In particular, as the j-th feature is observed countless times (i.e., ν j → ∞), Λ j converges to R in the limit of an infinitely precise noise distribution, so the iterative update steps reduce to the standard sequential measurement update of the EKF. If true state x(t img ) is significantly different from its estimate (x k ) j , then statistics E (ζ j ) k (ζ j ) T k dominates, and (Λ j ) k becomes much larger than R. This ζ j is regarded as a measurement outlier at time k. As Kalman gain K j is a function of the inverse of precision matrix (Λ j ) k , the larger (Λ j ) k values, the smaller the Kalman gain. Therefore, to deal with situations where measurement outliers occur, the iteration for Equations (30) and (31) corrects the state estimates and its covariance with low weights.

Marginalization of Feature States
If measurement outliers often occur, a few numbers of sequential updates in the EKF are proceeded to correct the state estimates. Without a sufficient number of measurement updates, the EKF is not robust and even diverges. Hence, outlier-adaptive filtering introduced in Section 4.3 performs the modified measurement update even when a residual is detected as an outlier. Indeed, to save computation resources, this study operates the outlier-adaptive filtering for only features detected frequently outliers. For implementation, we count how many numbers of features augmented in state variables are detected as outliers. Once updating feature outliers by the outlier-adaptive filtering approach, we prune the used feature states from the state vector ( Figure 3). In addition, similar to that mentioned in Appendix B, to maintain a certain size of the state vector, after the feature initialization, we marginalize the features with the least number of observations among tracked features.

Summarized Algorithm
This section summarizes and describes an implementation of the proposed method. Figure 4 and Algorithm 2 illustrate a flow chart and the pseudocode of the overall process of the outlier-adaptive filtering approach for V-INS, respectively. For the robust outlier-adaptive filter presented in this paper, the blue boxes in Figure 4 are extended from the figure in [25].

Algorithm 2 The Outlier-Adaptive Filtering
Require:x + 0 (=x + V 0 ), P + 0 , Q, R, χ 2 1: for k = 1 : T do 2: Image processing front-end in different thread 3: if new image capture then 4: Image Processing: Algorithm 1 5: Stereo matching between current images of left camera c 1 and right camera c 2 6: RANSAC between previous and current images of camera c 1 7: RANSAC between previous and current images of camera c 2 8: end if 9: Filtering back-end in different thread 10: if new IMU packet arrival then 11: Time Update: 12: State prediction 13: end if 14: if new vision data packet arrival then 15: for j = 1 : of observed features N do 16: if new feature then 17: Feature Initialization:

18:
If any depth of c 1 or c 2 is negative, j-th feature is outlier 19: else tracked feature 20: Outlier Gating Test: while until converged do 28: Update measurement noise given the state 29:r j = y j − h j (x j ) Update the posteriori state given noise 34: : else standard sequential EKF in Section 2 41: end if 48: end for

Flight Datasets Test Results
To examine the influence of outliers in V-INS and validate the reliability and robustness of the proposed outlier-adaptive approach for navigation systems with outliers, we test one of benchmark flight datasets, the so-called "EuRoC MAV datasets" [53]. The visual-inertial sequences of the datasets were recorded onboard a micro aerial vehicle (MAV) while a pilot manually flew around the indoor motion capture environment. For more details, see Appendix C. To articulate the significance of outliers, we select two datasets of the bright scene, called "EuRoC V1 Easy", and motion blur, called "EuRoC V1 Difficult." As the images in the difficult dataset are dark scene or motion blur, we hypothesize that outliers occur more frequently in the difficult dataset.
The EKF estimates the relative location and orientation from a starting point. As we do not know the exact absolute location of the origin of given datasets, to compare with ground-truth data given in the datasets, we require certain evaluation error metrics such as so-call "absolute trajectory error [54]". For more details, see Appendix D. The absolute trajectory error as an evaluation error metric yields the following various comparison plots. Figure 5 illustrates the top-down view of the estimated flight trajectory of the difficult dataset. Similar to the analysis presented in [25], the adaptive filter is a well-tuned estimator since the performance of doing runs with ×3 or ×10 ( /3 or /10) multiplier on the R matrix used in the filter is worse for all of those, shown in Table 1. That is, the fact that using the multipliers reveals larger root mean square (RMS) estimation errors indicates that our filter is well-tuned.   Figure 7 shows the advantages of the addition of the outlier adaptation proposed in this paper by comparing it with a baseline, the standard EKF in V-INS.
As Lee [25] and other researchers showed that the standard EKF is a basic filter in V-INS, we choose the method as a baseline here. The baseline only rejects outliers whenever the chi-squared test fails, whereas the outlier-adaptive filtering follows all proposed adaptive approaches in Algorithm 2.
Although the iteration in the outlier-adaptive filtering might increase computational resources, it significantly improves the accuracy of estimation. Fortunately, the "while" loop iteration in Algorithm 2 rapidly converges to the optimal noise covariance by twice or three times iterations. For sensitivity analysis, RMS position errors resulting from the baseline and the outlier-adaptive filtering are compiled in Table 2.  Motion blur datasets are more sensitive to outliers as the improvement is larger when applied to those datasets. Although the outlier-adaptive filtering is the best choice for motion blur datasets, we can select an adequate mode depending on computation margin and cost.
Although a number of researchers have investigated V-INS of the EuRoC datasets [55], only a few of them thoroughly has focused on vision measurements with outliers. Table 3 reveals that the proposed estimator, the outlier-adaptive filter, outperforms other state-of-the-art V-INS techniques, called "SVO+MSF [56,57]" and "S-MSCKF [48,58] " in which stereo is available. As SVO+MSF is loosely coupled, its algorithm actually gets diverged.

Discussion
This paper has presented practical outlier-adaptive filtering for a vision-aided inertial navigation systems (V-INS) and evaluated its performance with flight datasets testing. In other words, this study develops a robust and adaptive state estimation framework for V-INS under frequent outliers occurrence. In the image processing front-end of the framework, we propose the improved utilization of outlier removal techniques. In filtering back-end, for estimating the states of V-INS with measurement outliers, we implement a novel approach of the outlier-robust extended Kalman filter (EKF) to V-INS, for which we derive iterative update steps for computing the precision noise matrices of vision outliers when the Mahalanobis gating test detects remaining outliers.
To validate the accuracy of the proposed approach and compare it with other state-of-the-art V-INS algorithms, we test the performance of V-INS employing the outlier-adaptive filtering algorithm in the realistic benchmark flight datasets. In particular, to show more improvements of our method over the others' approaches, we use the fast motion and motion blur flight datasets. Results from the flight datasets testing show that the novel navigation approach in this study improves the accuracy and reliability of state estimation in V-INS with frequent outliers. Using the outlier-adaptive filtering reduces the root mean square (RMS) error of the estimates and accelerates the robustness of the estimates, especially for the motion blur datasets.
The primary goals of future work are listed as follows. Since an inertial measurement unit (IMU) is also a sensor, it could generate outliers in V-INS. With accounting for the process outliers, the accuracy and robustness of the estimator would be improved. If we distinguish process outliers from IMU sensors with measurement outliers from vision data, the extended outlier-robust EKF [32] may be an impressive and innovate approach for this case. Furthermore, the investigation of color noise in V-INS is another possible future work. One of the required assumptions of the Kalman filter is the whiteness of measurement noise. As an illustration, during sampling and transmission in image processing, color noise that may be originated from a multiplicity of sources could degrade the quality of images [59]. The vibrational effects of camera sensors might also produce colored measurement noise [60]. That is, if the residuals of vision data are correlated with themselves at different timestamp, then colored measurement noise occurs in V-INS. Therefore, the images with color noise would be filtered for ensuring the accuracy of locating landmarks. As modeling noise without additional prior knowledge of the noise statistics is typically difficult, the machine learning techniques-based state estimator for colored noise [61,62] may handle the unknown correlations in V-INS.
This study showcases the approaches using stereo cameras but is also suitable for monocular V-INS and employable to other filter-based V-INS frameworks. We test benchmark flight datasets to validate the reliability and robustness of this study, but additional validating with other flight datasets or real-time flight tests would help to prove its more robustness. In addition, we can operate unmanned aerial vehicles (UAVs) stacked the navigation algorithms in this study and a controller-in-the-loop. The use of the controller-in-the-loop could be a more important validation criteria due to the potential for navigation-controller coupling. The research objectives and contributions presented here will remarkably advance the state-of-the-art techniques of vision-aided inertial navigation for UAVs.

Conflicts of Interest:
The authors declare no conflicts of interest.

Abbreviations
The following abbreviations are used in this manuscript: where for more detailed derivations, see reference [25,63].

Appendix B. Feature Initialization
In the first step of the measurement update, we employ Gauss-Newton least-squares minimization [58,64] to estimate feature 3D position ip f j /i . To avoid local minima, we apply the inverse depth parameterization of the feature position [65] that is numerically more stable than the Cartesian parameterization.
We assume that the intrinsic and extrinsic parameters of a stereo camera are known and constant values. c 1 , c 2 frames are the left and right camera frame of the stereo, respectively. Since the baseline of the stereo is fixed, rotation T c 2 /c 1 and translation c 2 p c 1 /c 2 between both cameras are constant and known values. Feature coordinates c [X, Y, Z] T with respect to both cameras are where m x , m y , and m z are scalar functions of given j-th measurement and constant extrinsic rotation matrix. Based on Equation (17), as the measurement equations from the stereo camera are right c 2 camera measurements are expressed in Ax = b form.
where let Therefore, the Gauss-Newton least-squares minimization estimates depth c 1 Z of left c 1 camera using the pseudo-inverse of A is If either estimated depth c 1Ẑ or c 2Ẑ is negative, the solution of the minimization is invalid since the feature is always in front of both camera frames observing it. By substituting estimated c 1Ẑ into Equation (A8), wherep f j /c is not related to the pose of the vehicle. Likewise, if a monocular camera is used instead, c 1 is the camera frame in which the feature was observed at the first time, and c 2 is the camera frame at a different time instance. The j-th feature 3D position with respect to the inertial frame is The new feature is initialized using only one image in which the feature is first observed. Although the new feature is initialized, as it still entails uncertainty, the EKF recursively estimates and updates its 3D position by augmenting into the state vector.
wherex V is the estimated vehicle state vector defined in Equation (8). The overall initialization includes the initial value of the feature state and its error covariance assignment. The error covariance of the new feature are initialized using state augmentation with Jacobian J.
where Jacobian J = ∂ p f /i ∂ x x is computed as follows, N f is the number of all features and P f new is the initial uncertainty of the initialized new feature. The error pertains to measurement noise and the error of the least-squares minimization and so on. In fact, since Montiel et al. [65] validate that the initial uncertainty is coded as Gaussian, the EKF including the feature initialization still holds optimality. Once initialized, the EKF processes the feature state in the prediction-update loop. In the time update of the EKF, we propagate P by Equation (4).
where state transition matrix Φ ≈ I + A ∆t.
, P f f is the error covariance of all features, and P V f = P T f V represents vehicle-feature correlations. In addition, we assume that surrounding is static, so the dynamics of featuresṗ f j /i = 0. In the measurement update of the EKF, only tracked features are used for the update. For the efficient management of the map database, if the size of the state vector exceeds than the maximum limit, then the feature with the least number of observations is pruned and marginalized.

Appendix C. Experimental Equipment and Environments
Burri et al. [53] provide the benchmark datasets of UAV flying, and Table A1 illustrates the specifications of the sensors used in the datasets. They obtain the noise model parameters from the IMU at rest and provide them; that is, σ a , σ ω , σ b a , and σ b ω are known. The intrinsic and extrinsic parameters of both cameras are also given; that is, f u , f v , T c/b , and b p c/b are known. The visual-inertial sensor unit is calibrated with Kalibr [66] prior to dataset collection. Furthermore, IMU and cameras are hardware time-synchronized such that the middle of the exposure aligned with the IMU measurements. Visual and inertial data is logged and timestamped onboard the MAV while ground truth is logged in the Vicon 6D motion capture system on the base station. The accuracy of the synchronization between the ground truth and the sensor data is limited by the fact that both sources are recorded on different machines and that the timestamps of the devices are unavailable for the ground-truth system. A maximum likelihood (ML) estimator [11] aligns the data temporally and calibrates the position of the ground-truth coordinate with respect to the body sensor unit. In fact, the ML estimator synchronizes the time-varying temporal offset between the ground truth and the body sensor system. Moreover, it determines the unknown transform between the ground-truth reference frame and the body frame. To obtain the full ML solution, they employ a batch estimator in an offline procedure. Finally, as ground truth, they provide the ML solutions instead of raw data.

Appendix D. Evaluation Error Metric
Sturm et al. [54] provide a set of tools used to preprocess the datasets and evaluate the tracking results. To validate estimation results, we need to evaluate the errors in the estimated trajectory by comparing it with the ground truth. Among various error metrics, two prominent methods are the absolute trajectory error (ATE) and the relative pose error (RPE). In this paper, to evaluate the overall performance of V-INS employing the outlier-adaptive filtering, the ATE measure is selected.

Absolute Trajectory Error (ATE)
The absolute trajectory error directly measures the difference between points of the true and the estimated trajectory. As a preprocessing step, we associate the estimated poses with ground-truth poses using the recorded timestamps. Based on this association, we align the true and the estimated trajectory using the Horn et al. [67]'s closed-form method based on singular value decomposition (SVD). Finally, we compute the differences between each pair of poses and output the mean, median, and standard deviation of these differences.