Simultaneous Robot–World and Hand–Eye Calibration without a Calibration Object

An extended robot–world and hand–eye calibration method is proposed in this paper to evaluate the transformation relationship between the camera and robot device. This approach could be performed for mobile or medical robotics applications, where precise, expensive, or unsterile calibration objects, or enough movement space, cannot be made available at the work site. Firstly, a mathematical model is established to formulate the robot-gripper-to-camera rigid transformation and robot-base-to-world rigid transformation using the Kronecker product. Subsequently, a sparse bundle adjustment is introduced for the optimization of robot–world and hand–eye calibration, as well as reconstruction results. Finally, a validation experiment including two kinds of real data sets is designed to demonstrate the effectiveness and accuracy of the proposed approach. The translation relative error of rigid transformation is less than 8/10,000 by a Denso robot in a movement range of 1.3 m × 1.3 m × 1.2 m. The distance measurement mean error after three-dimensional reconstruction is 0.13 mm.


Introduction
With the progress of robot-vision-system advanced technology, it is necessary to evaluate the geometric relationships among the robot, sensors, and a reference frame. This problem is usually called "robot-sensor calibration", and it has been an active area of research for almost 40 years [1]. As research has progressed, the applications of robot-sensor calibration have extended into many domains, such as automobile assembly, robot navigation, and endoscopic surgery. As reported previously [2], the most widespread mathematical representations for the robot-sensor calibration problem can all be grouped into two categories: AX = XB and AX = ZB.
The first class, and the most common robot-sensor calibration problem, is hand-eye calibration AX = XB, which was proposed by Tsai et al. [3] and Shiu et al. [4]. The earliest solution strategy estimated the rotation and translation with respect to homogeneous transformation X separately [5,6]. However, it was found that such a method would produce rotation error spread in the process of the translation estimation. In later strategies, both the rotation and translation with respect to homogeneous transformation X are solved simultaneously [7][8][9]. The above calibration methods solve the hand-eye relationship with different parametric approaches, such as the quaternion, dual quaternion, and Kronecker product, which are all inseparable from a known calibration object. However, there are many situations in which using an accurately-manufactured calibration object For these particular situations, an extended robot-world and hand-eye calibration approach without a calibration target is proposed for a robotic visual measurement system. At first, our approach improves the AX = ZB mathematical model by supposing that different camera poses comprise up to an unknown scale factor, and propose a fast linear method to give an initial estimate to the calibration equation. Then, we combine space intersection and sparse bundle adjustment to refine the robot-world and hand-eye transformation relationship, as well as 3D reconstruction, simultaneously. Finally, we demonstrate the effectiveness, correctness, and reliability of our approach with relevant synthetic and real data experiments.

Initial Estimate
Supposing that we have an arbitrary position of the robotic system, from Figure 1, we can define: The homogeneous transformation matrix A is obtained by calibrating extrinsic camera parameters with respect to a fixed calibration object. The homogeneous transformation matrix B is computed using the internal-link forward kinematics of the robot arm. X is the robot-gripper-to-camera rigid transformation, which is always constant, as the camera is rigidly mounted on the robot gripper, and Z is the robot-base-to-world rigid transformation. approach improves the AX = ZB mathematical model by supposing that different camera poses comprise up to an unknown scale factor, and propose a fast linear method to give an initial estimate to the calibration equation. Then, we combine space intersection and sparse bundle adjustment to refine the robot-world and hand-eye transformation relationship, as well as 3D reconstruction, simultaneously. Finally, we demonstrate the effectiveness, correctness, and reliability of our approach with relevant synthetic and real data experiments.

Initial Estimate
Supposing that we have an arbitrary position of the robotic system, from Figure 1, we can define: The homogeneous transformation matrix A is obtained by calibrating extrinsic camera parameters with respect to a fixed calibration object. The homogeneous transformation matrix B is computed using the internal-link forward kinematics of the robot arm. X is the robot-gripper-to-camera rigid transformation, which is always constant, as the camera is rigidly mounted on the robot gripper, and Z is the robot-base-to-world rigid transformation. Now, let RA, RB, RX and RZ ∈ SO(3) denote the respective 3 × 3 rotational matrices of A, B, X and Z. Let tA, tB, tX, and tZ denote the respective 3 × 1 translational vectors, which are measured using the same scale unit. Equation (1) can be easily decomposed into a rotational matrix equation and translational vector equation: If there is no 3D calibration object, such as in 2D-to-3D correspondences, we have to use SFM to estimate camera poses based on 2D-to-2D correspondences only. However, due to the lack of a given scale factor, SFM can reconstruct the structure of the scene and the camera poses up to an unknown scale factor. Of course, we can introduce an explicit scaling factor to the robot-world and hand-eye calibration equation, with reference to Andreff [10]. Equation (2) can be transformed into The Equations (3) can be used to formulate an objective function f(·) for non-linear optimization, which is based on the objective function for standard robot-world and hand-eye calibration proposed by [21]: Now, let R A , R B , R X and R Z ∈ SO(3) denote the respective 3 × 3 rotational matrices of A, B, X and Z. Let t A , t B , t X , and t Z denote the respective 3 × 1 translational vectors, which are measured using the same scale unit. Equation (1) can be easily decomposed into a rotational matrix equation and translational vector equation: If there is no 3D calibration object, such as in 2D-to-3D correspondences, we have to use SFM to estimate camera poses based on 2D-to-2D correspondences only. However, due to the lack of a given scale factor, SFM can reconstruct the structure of the scene and the camera poses up to an unknown scale factor. Of course, we can introduce an explicit scaling factor to the robot-world and hand-eye calibration equation, with reference to Andreff [10]. Equation (2) can be transformed into The Equations (3) can be used to formulate an objective function f (·) for non-linear optimization, which is based on the objective function for standard robot-world and hand-eye calibration proposed by [21]: where W(q) T Q(q) is an orthogonal matrix for quaternion q, and the parameters λ 1 through λ 4 are regularization factors (e.g., λ 1 = λ 2 = 1 and λ 3 = λ 4 = 10 6 ). In addition to scale factor α, the rotations and translations associated with X and Z can be estimated simultaneously by solving Equation (4).
Referring to [20], we can also obtain the separable solutions to the robot-world and hand-eye calibration problem by Kronecker product. Since R A and R B are both an orthogonal matrix, the orientation component of Equation (3) can also be represented as: Those vectors of Equation (5) can efficiently be computed by singular value decomposition (SVD). The symbol ⊗ denotes the Kronecker product, and the column vector operator vec reorders the coefficients of a (m × n) matrix A into an mn vector vec(A) = (a 11 , . . . , a 1n , a 21 , . . . , a mn ) [26]. Once R Z is calculated by Equation (5), t X , t Z is the solution to the linear system: The solution to t X , t Z and α can be easily determined by least square technique. However, the variety of the additional scale factor α will bring instability into Equation (3). To overcome this problem, we propose a novel solution through eliminating α based on the Kronecker product. We define t A * as a skew-symmetric matrix corresponding to t A , which can be denoted as Since the scale factor α has no influence on the computation of rotation, the rotational part of Equation (3) is the same, and the translational part of Equation (3) is multiplied on both sides by the skew-symmetric t A *. Obviously, t A * t A = [0, 0, 0] T , and the new equation can be formulated as follows: By using the Kronecker product theory, and if AXB = C for an unknown matrix X, then the equation can be rewritten as a linear system: Thus, Equation (7) can be reconstituted into Obviously, the solution of the linear system (8) can be solved by SVD, and since R X and R Z are rotational matrices, there is a proportionality constraint in that the R X and R Z have a determinant value of 1. Thus, the unique solution can be determined. Supposing that the solution of the linear system (8) is proportional to the right singular vector v corresponding to the minimum singular value, the resulting R X and R Z can be estimated as where V X = vec -1 (v 1:9 ), V Z = vec −1 (v 10:18 ), vec −1 is defined as the inverse operator to vec, and the proportionality constants are Therefore, the calculated robot-world and hand-eye translation vectors are However, the calculated matrices R X and R Z may be not strictly orthogonal due to noise. Therefore, to ensure that they are indeed rotations, it is necessary to re-orthogonalize the computed rotation matrices.

Data Selection
SFM is a general method for obtaining camera poses from image correspondences, and mainly consists of feature point detection, feature point matching, camera pose calibration, and reconstruction. Given two view feature points that are coarse matching, there are a significant number of outliers in the estimated transformations of camera poses, and these outliers will inevitably affect the accuracy of the initial estimate for extended robot-world and hand-eye calibration. RANSAC [27] is a simple but robust algorithm for outlier removal, which has been used widely in computer vision. In this section, we utilize it to enhance robustness of the initial estimate. Referring to the RANSAC method [15], we randomly select a certain number of two view image correspondences and solve the extended robot-world and hand-eye calibration equation by the linear system (8). Firstly, as three pairs of camera pose solutions are just enough to determine the unique robot-world and hand-eye transformation [20], three pairs of camera orientation results are treated as the minimum number required for this sample. Then, we predictÂ i using Equation (2): So, the rotation error e R can be defined as follows: Because the predicted translation tÂ i and original translation t A i may not be calculated based on the same scale factor, the translation error e t is defined as follows: In addition, we combine the rotation and translation errors as the total error. Considering that the translation unit is always set to millimeters, in order to balance the rotation and translation errors, we scale the translation error by 0.01, so the total error e is Finally, we calculate the total error e for all valid random samples, and determine the largest set of consistent pairs. In this section, we let the error threshold e be 0.01, and the maximum outlier ratio be 50%. It should be considered that this selection process is just an initial estimate. There is no need to spend substantial amounts of time for minor accuracy improvement, so we stop RANSAC when the maximum iteration limit reaches 100.

Sparse Bundle Adjustment
Following the initial estimation for robot-world and hand-eye transformation by the Kronecker product, which is solved by Equations (9) and (10), we employ bundle adjustment to jointly refine the robot-world, hand-eye transformations, and the reconstruction results simultaneously. Bundle adjustment is almost invariably solved as the last step of feature-based 3D reconstruction algorithms and motion estimation computer vision algorithms to obtain optimal solutions. Generally speaking, the goal in bundle adjustment is to minimize the overall reprojection error between the observed and predicted image points. The mathematical expression can be depicted as below: assume that m 3D points are seen in n views, and let x ij indicate the projection of the ith point on the jth image. Assume also that λ ij is equal to 1 if the ith point can be observed on the jth image, otherwise it is equal to 0. Moreover, assume that A j is the rigid homogeneous transformation from the jth image frame to the world frame and that G i is the predicted 3D ith point by space intersection, and let P j (·) be the predicted projection matrix of the jth image, including camera-intrinsic parameters. The bundle adjustment model minimizes the reprojection error with respect to all 3D points and camera parameters, specifically: Problems that are substantially similar to problem (15) can typically be tackled with non-linear least-squares optimization routines such as the Levenberg-Marquardt or Gauss-Newton approaches. Conventional bundle adjustment methods solve the normal equations repeatedly with complexity O(n 3 ) in the number of unknown parameters for each iteration. However, substantial time-saving can be achieved by taking advantage of the sparse block structure contained in the normal equation [28]. In this way, a software implementation of sparse bundle adjustment is proposed by Lourakis and Argyros [29].
In our experiment, we utilize their implementation to solve the extended robot-world and hand-eye calibration problem. In order to refine the initial guess of X and Z using sparse bundle adjustment, the homogeneous transformation A j (α) up to an unknown scale factor is substituted by the inverse Equation (1) A j = ZB j X −1 , because the robot arm pose B j , which is calibrated before delivery, can provide the real metric units. Then, the point 3D initial coordinates can be calculated by space intersection or triangulation. Finally, the sparse bundle adjustment method optimizes the robot-world transformation Z, hand-eye transformation X, and target point 3D coordinates G i simultaneously, while keeping the robot motions B j and camera-intrinsic parameters constant. Specifically, the sparse bundle adjustment model can be rewritten as: Note that the robot-world Z and hand-eye X transformations consist of 6 rotation parameters and 6 translation parameters, while each point consists of 3 position parameters. The total number of minimization parameters in Equation (16) equals 3m + 12. According to specific needs, we can set a termination condition for iteration. The iterations are terminated when the estimated robot-world translation changes by less than 10 −3 mm, or the reconstruction 3D points changes by less than 10 −3 mm, compared to that of the last iteration, or reaches the maximum limit of iterations, which is ten in this paper.

Experiments
This section validates the proposed method for the extended robot-world and hand-eye calibration problem both on synthetic and real datasets. For the data comparison, with some considerations, one could not expect that the method without a calibration object would obtain results as accurate as the method with a calibration object. In this paper, our main purpose is that the estimation of the robot-world and hand-eye transformation is feasible without a calibration object. We refer to the means of data comparison of previous extended hand-eye calibration methods, such as those presented by Nicolas Andreff [10], Jochen Schmidt [11], and Jan Heller [13]. We present an experimental evaluation of the extended robot-world and hand-eye methods, in which the estimation of rotation, translation, and scale factor can be formulated using the Kronecker product [20], or quaternions [21], or reprojection error [25], and a standard robot-world and hand-eye calibration method [25] with chessboard pattern calibration was used as an approximate truth-value, since no ground truth is available to compare accuracy between different methods. For convenience, in the following experiments, the labels "Dornaika" and "Shah" stand for the estimation of rotation, translation, and scale factor using the quaternions Equation (4) or Kronecker product Equation (5), respectively. The label "KPherwc" stands for the proposed initial calibration method based on the Kronecker product Equation (8), and the label "BAherwc" stands for the proposed optimization approach based on sparse bundle adjustment Equation (16). VisualSFM [30]-a state-of-the-art, open-source SFM implementation-was used to obtain the camera poses for a general object in real-data experiments. All method results were obtained using a hybrid MATLAB 9.0 and C++ reference implementation, and we conducted the methods on an Intel Core i7-8750H processor running Linux.

Experiments with Synthetic Data
In order to simulate the actual process of robot motions, considering that PUMA560 is the most classic robot arm kinematics model, and that this robot has been well studied and its parameters are very well known-it has been described as the "white rat" of robotics research [31]-referring to Zhuang [18], we used PUMA560 robot kinematics modeling and a camera imaging model to build a synthetic scene and a virtual camera. As shown in Figure 2, a red PUMA560 robot arm was constantly in movement with a different-colored camera attached to the end gripper. A synthetic scene consisting of 50 3D points was generated randomly into a gray cube with side length 0.5 m, and 8 different virtual camera poses set such that the cameras were faced approximately to the center of the cube were created. The intrinsic parameters of the virtual camera and Denavit-Hartenberg (DH) parameters of the PUMA 560 robot are separately listed in Tables 1 and 2. synthetic scene and a virtual camera. As shown in Figure 2, a red PUMA560 robot arm was constantly in movement with a different-colored camera attached to the end gripper. A synthetic scene consisting of 50 3D points was generated randomly into a gray cube with side length 0.5 m, and 8 different virtual camera poses set such that the cameras were faced approximately to the center of the cube were created. The intrinsic parameters of the virtual camera and Denavit-Hartenberg (DH) parameters of the PUMA 560 robot are separately listed in Tables 1 and 2.
To test the performance of different methods against projection noise, the simulated data were conducted with the synthetic scene and a virtual camera. The scene 3D points were projected into the image plane after each position movement, but the projection points would be neglected if they were outside the image plane. In order to qualitatively analyze and evaluate the results of the synthetic experiment, we defined the error evaluation criteria associated with rotation and translation as follows: where R represents the true rotation, R represents the estimated rotation, t represents the true translation, and t represents the estimated translation. In the synthetic experiment, since the nominal value for the robot-world and hand-eye transformation can be set up in advance, there is no need to use a standard robot-world and hand-eye calibration method [25] as an approximate truth-value. We set t X 2 = 0.1 m and t Z 2 = 1 m. The entire experiment is a four-step process. Firstly, considering that real-world feature point extraction is generally expected to have accuracy within 1 pixel, projection points in the synthetic experiment were corrupted by 6 different levels of Gaussian noise in the image domain with a standard deviation η ∈ [0, 1] pixel and a step of 0.2 pixel. Secondly, according to the synthetic scene, we defined the nominal value for the hand-eye transformation X and the robot-to-world transformation Z with constant translation t X and t Z . Thirdly, we calculated a sequence of camera positions based on space resection, and the corresponding robot motions were calculated by B = Z −1 AX. Considering that the noise of robot motion is determined after production, we added a constant noise (σ = 0.025 mm) to robot joint offset. Finally, we performed the homogeneous transformations X and Z with the above four different methods and compared their rotation and translation errors in the presence of various noise levels. For each noise level, 50 repeated experiments were done with randomly generated sets of data, and the final value was the mean of all 50 errors. Figure 3 illustrates the rotation and translation errors for each noise level using the boxplot. Clearly, our optimization method ("BAherwc") exhibits the best behavior both in rotation and translation estimation of the transformation X and Z, whereas the proposed initial calibration method ("KPherwc") performs worst under noise conditions; thus, it is extremely effective to refine the initial calibration results by follow-up sparse bundle adjustment. Meanwhile, the translation relative errors estimated by "Shah" are slightly better than those estimated by "Dornaika". This is a result of the "Dornaika" method calculating the rotation and translation transformations regarding X and Z all in the same step. Due to noise, the estimated rotations may not be accurate representations of the rotation matrices, and thus, a set of nonlinear constraints have to be made for the rotation matrices; meanwhile, the estimated translations are not updated with the orthogonal restriction, which causes the larger positional errors that are illustrated in Figure 3.

Experiments with Real Datasets
In this experiment, a Denso VS-6577GM serial 6-DOF robot arm with a Nikon D300s digital SLR camera and an AF NIKKOR 20 mm lens was used to acquire the real data. Since no ground truth gripper-camera transformation is available in the real data, it is difficult to give direct error results about the computed robot-world and hand-eye transformation, such as for the synthetic data. Therefore, it is desirable to measure the quality of the calibration results between the camera and robot device in some indirect way. In the rest of this section, we arranged two different scenes to complete the accuracy assessment: scene A, with a general object, was used to show the general applicability of the proposed method compared to the standard robot-world and hand-eye calibration approaches, and scene B, a photogrammetric retro-reflective target was used to improve the feature point-locating accuracy and decrease the false match rate for calibrating the extrinsic camera. Before the experiment, we used [32] to calibrate the camera together with seven parameters of lens, so the images were undistorted prior to being further used in order to improve sparse bundle adjustment results.

Experiments with Real Datasets
In this experiment, a Denso VS-6577GM serial 6-DOF robot arm with a Nikon D300s digital SLR camera and an AF NIKKOR 20 mm lens was used to acquire the real data. Since no ground truth gripper-camera transformation is available in the real data, it is difficult to give direct error results about the computed robot-world and hand-eye transformation, such as for the synthetic data. Therefore, it is desirable to measure the quality of the calibration results between the camera and robot device in some indirect way. In the rest of this section, we arranged two different scenes to complete the accuracy assessment: scene A, with a general object, was used to show the general applicability of the proposed method compared to the standard robot-world and hand-eye calibration approaches, and scene B, a photogrammetric retro-reflective target was used to improve the feature point-locating accuracy and decrease the false match rate for calibrating the extrinsic camera. Before the experiment, we used [32] to calibrate the camera together with seven parameters of lens, so the images were undistorted prior to being further used in order to improve sparse bundle adjustment results.

Dataset A
With dataset A, our main purpose is not to prove how high the accuracy of our method is, but to demonstrate the feasibility of estimating the robot-world and hand-eye transformation without a calibration object in a general scene. Two image sets were required for the performance of the different methods in real-world conditions, as shown in Figure 4. Some consideration for the absence of a ground truth is available in the real-data experiment. We cannot give errors between the real robot-world transformation and the computed one, just like in the synthetic experiment. Since the method with a calibration object can usually obtain more accurate results than the method without a calibration object, a chessboard pattern was firstly used for solving robot-world Y bar and hand-eye X bar transformation simultaneously by the Tabb method [25], which could be assumed to give an approximate true value for the present. Afterwards, we removed the chessboard pattern, and used books as the object instead. We used the above "Dornaika", "Shah", "KPherwc", and "BAherwc" methods to calculate the homogeneous transformation X scene and Y scene with the general object of books. Finally, we compared their results to the approximate true value X bar and Y bar. The errors of robot-world and hand-eye relationships are defined as follows: Figure 5a shows that the robot gripper carrying the camera took a series of photos around the center of the books. The positions of the gripper were adjusted with ten different locations, and it was ensured that the entirety of the books were in the view in every frame. The camera was set to manual mode, and images of 4288 × 2848 pixels were taken using a PC remote control. After all the photos were taken, a fast open-source SFM implementation was used to obtain the camera pose A i (α), and the robot motion transformation B i was obtained from the known gripper-to-robot-base transformations. Then we computed robot-world and hand-eye transformation using the above four methods. Figure 5b shows the resulting 3D model output from bundle adjustment, containing 49,352 points in space, and the poses of all the ten cameras. Due to a high number of correspondences, only every hundredth member of the set of corresponding image points was used in our experiment. hand-eye Xbar transformation simultaneously by the Tabb method [25], which could be assumed to give an approximate true value for the present. Afterwards, we removed the chessboard pattern, and used books as the object instead. We used the above "Dornaika", "Shah", "KPherwc", and "BAherwc" methods to calculate the homogeneous transformation Xscene and Yscene with the general object of books. Finally, we compared their results to the approximate true value Xbar and Ybar. The errors of robot-world and hand-eye relationships are defined as follows: Figure 5a shows that the robot gripper carrying the camera took a series of photos around the center of the books. The positions of the gripper were adjusted with ten different locations, and it was ensured that the entirety of the books were in the view in every frame. The camera was set to manual mode, and images of 4288 × 2848 pixels were taken using a PC remote control. After all the photos were taken, a fast open-source SFM implementation was used to obtain the camera pose Ai(α), and the robot motion transformation Bi was obtained from the known gripper-to-robot-base transformations. Then we computed robot-world and hand-eye transformation using the above four methods. Figure 5b shows the resulting 3D model output from bundle adjustment, containing 49,352 points in space, and the poses of all the ten cameras. Due to a high number of correspondences, only every hundredth member of the set of corresponding image points was used in our experiment.      Table 3 summarize the results obtained with the two image sets mentioned above. Compared with other similar methods, it can be seen that our "BAherwc" method is nearest to the results of Tabb method [25] based on the chessboard pattern calibration. This is because in the "BAherwc" method, it was initialized by the results from the "KPherwc" method; then, the reprojection error is directly minimized, like in Tabb reprojection [25]. On the other hand, in the "Dornaika" and "Shah" method, the variety of the scale factor will bring instability into the solution of the robot-world and hand-eye transformation during the SFM implementation. Of course, one could not expect to obtain results as accurate as with Tabb's standard calibration. However, depending on the different application, the advantages of the proposed extended method may outweigh this drawback. It is especially true for mobile robotics or endoscopy setups that we have in mind, where robot-world and hand-eye calibration has to be performed under specific situations, due to the restrictions in limited onboard weight or the strict sanitary conditions. In order to achieve a rough qualitative analysis, we also measured the translation from the gripper to the camera lens center by hand with the known mechanical structure of the gripper and join parts, which is approximated to [0, 58, 66] mm. The estimated translation by our "BAherwc" approach is [0.183, 57.326, 64.910] mm, which is close to the result of the previous physical measurement, showing the validity of the obtained results. In dataset B, our main purpose is to provide a mobile benchmark for large-scale digital photogrammetry under industrial conditions, which needs a robot to move along the guide rail to complete multi-station stitching measures. In order to reduce noise disturbance caused by SFM, we used photogrammetric retro-reflective targets (RRTs) to obtain accurate feature point matching. RRTs consist of a dense arrangement of small glass beads (Figure 6a bottom-right), as the name would suggest, which have good retro-reflective performance. The reflected light intensity in the light source direction is up to hundreds of times larger than the general diffuse reflection target. Thus, it is easy to obtain a subpixel level locating accuracy of feature points in the complex background image. As indicated in Figure 6a, dozens of RRTs and two yellow invar alloy scale bars S 1 and S 2 constructed a photogrammetric control field, and two 6 mm diameter coded RRTs were rigidly fixed on the yellow scale bar end, for which the distance had been accurately measured by a laser interferometer. Furthermore, coded RRTs can be encoded using specific pattern of distribution, which can actualize the automatic image matching of corresponding points. Then, the robot gripper carrying the camera took a series of photos around the center of the photogrammetric field, ensuring that the entire RRTs were in the camera view in every frame. Afterwards, we used the Hartley 5-point minimal relative pose method [33] and photogrammetry bundle adjustment to calibrate and optimize the extrinsic camera parameters. Figure 6b shows the distribution of camera pose and RRTs. After bundle adjustment, the cameras were moved to 20 different poses faced to the RRTs and scale bars, and a Denso robot was moved in volume of 1.3 m × 1.3 m × 1.2 m. In view of photogrammetric relative orientation yields a high precision camera poses A i (i = 1, . . . , 20), we solved hand-eye transformation X and the robot-to-world transformation Z by means of three methods (the "Dornaika", "Shah", and our "BAherwc" method) based on existing camera pose A i and robot motion B i . Then, the predictive camera posesÂ i can be inverse-computed with Equation (1) In this section, the discrepancy betweenÂ i and A i is supposed to an accuracy assessment basis of robot-world and hand-eye calibration. Considering the difference of scale factor betweent A i and t A i , all translations are normalized beforehand, and the mean errors of all motions (from 1 to 20) are computed in rotation and translation. The rotation and translation relative errors are described as: Sensors 2018, 18, x FOR PEER REVIEW 12 of 15 In this section, the discrepancy between ˆi A and Ai is supposed to an accuracy assessment basis of robot-world and hand-eye calibration. Considering the difference of scale factor between ˆi Comparisons of the accuracy in rotation and translation for photogrammetric scene data set are provided in Table 4. It can be seen that our method "BAherwc" is almost to half an order of magnitude better than the other methods with regard to both in rotation and translation estimation. The rotation error is less than 5/10,000, and translation relative error is less than 8/10,000. Our optimization method "BAherwc" outperforms the "Dornaika" and "Shah" methods, mainly because the feature extraction and matching accuracy of retroreflective targets is significantly higher than that of the general targets used by SFM; this is an expected behavior, as the "BAherwc" method, by minimizing overall reprojection error, depends on the feature extraction accuracy. This experiment might have confirmed the validity of our approach for the calibration of transformation parameters between the camera and the robot device based on the digital photogrammetric system.  Comparisons of the accuracy in rotation and translation for photogrammetric scene data set are provided in Table 4. It can be seen that our method "BAherwc" is almost to half an order of magnitude better than the other methods with regard to both in rotation and translation estimation. The rotation error is less than 5/10,000, and translation relative error is less than 8/10,000. Our optimization method "BAherwc" outperforms the "Dornaika" and "Shah" methods, mainly because the feature extraction and matching accuracy of retroreflective targets is significantly higher than that of the general targets used by SFM; this is an expected behavior, as the "BAherwc" method, by minimizing overall reprojection error, depends on the feature extraction accuracy. This experiment might have confirmed the validity of our approach for the calibration of transformation parameters between the camera and the robot device based on the digital photogrammetric system. Since the two invar alloy scale bars, S 1 and S 2 , provided both feature correspondences and a distance measurement reference, we evaluated our "BAherwc" approach in the relative accuracy of 3D reconstruction by using distance measurement. The average distance measurement errors of the two scale bars are given in the following Table 5. For comparison,Ŝ 1 andŜ 2 are defined as calculated values based on our "BAherwc" method as reconstruction result byproduct. The distance measurement errors are described as: Finally, to show the iterative process of our bundle adjustment method, Figure 7 illustrates the distance estimation error variances of the scale bars S 1 and S 2 at each iteration. One can see that although the initial reconstruction results are clearly inaccurate, the reconstruction errors after finite iteration still converge, and the final differences between the nominal value and measurement value of the two scale bars are close to 0.1 mm. Given that offline photogrammetry systems offer the highest precision and accuracy levels, the precision of feature point measurement can be as high as 1/50 of a pixel, yielding typical measurement precision on the object in the range of 1:100,000 to 1:200,000 [34], with the former corresponding to 0.01 mm for an object of 1 m in size. The absolute accuracy of length measurements is generally 2-3 times less (e.g., about 0.025 mm for a 1-m long object) than the precision of object point coordinates, which expresses the relative accuracy of 3D shape reconstruction. The relative accuracy of reconstruction by our bundle adjustment method is also influenced by the robot arm, and it can be improved by follow-up photogrammetric network design.  Finally, to show the iterative process of our bundle adjustment method, Figure 7 illustrates the distance estimation error variances of the scale bars S1 and S2 at each iteration. One can see that although the initial reconstruction results are clearly inaccurate, the reconstruction errors after finite iteration still converge, and the final differences between the nominal value and measurement value of the two scale bars are close to 0.1 mm. Given that offline photogrammetry systems offer the highest precision and accuracy levels, the precision of feature point measurement can be as high as 1/50 of a pixel, yielding typical measurement precision on the object in the range of 1:100,000 to 1:200,000 [34], with the former corresponding to 0.01 mm for an object of 1 m in size. The absolute accuracy of length measurements is generally 2-3 times less (e.g., about 0.025 mm for a 1-m long object) than the precision of object point coordinates, which expresses the relative accuracy of 3D shape reconstruction. The relative accuracy of reconstruction by our bundle adjustment method is also influenced by the robot arm, and it can be improved by follow-up photogrammetric network design.

Conclusions
In this paper, we present an extended approach for robot-world and hand-eye calibration without the need for a calibration object. In order to obtain the calibration data, we use two kinds of extrinsic calibrations: one for computer vision system with SFM, and the other for the digital

Conclusions
In this paper, we present an extended approach for robot-world and hand-eye calibration without the need for a calibration object. In order to obtain the calibration data, we use two kinds of extrinsic calibrations: one for computer vision system with SFM, and the other for the digital photogrammetric system with RRTs. These two calibration methods can both estimate the extrinsic camera parameters lacking a known scales factor. Meanwhile, the robot end gripper pose is computed using the manipulator's forward kinematics, whose parameters are generally supposed to be known. Then, we use a fast initial estimation for extended robot-world and hand-eye calibration based on the Kronecker product. After the initial guess, to further improve the calibration results, we used sparse bundle adjustment to optimize the robot-world and hand-eye transformation relationship along with reconstruction. Finally, to evaluate and verify the feasibility of the proposed method, four accuracy assessment solutions were designed in the synthetic-data and real-data experiments. It is shown that our "BAherwc" approach can maintain a certain accuracy and robustness without a calibration object under the lower noise disturbance, and the Denso VS-6577GM, rigidly mounted to the floor, can obtain relatively reliable reconstruction results for follow-up photogrammetry stitching measures. In the future, we will move the industrial robot along the guide rail to expand the measurement range of the calibration procedures.