A New Edge Patch with Rotation Invariance for Object Detection and Pose Estimation

Local patch-based methods of object detection and pose estimation are promising. However, to the best of the authors’ knowledge, traditional red-green-blue and depth (RGB-D) patches contain scene interference (foreground occlusion and background clutter) and have little rotation invariance. To solve these problems, a new edge patch is proposed and experimented with in this study. The edge patch is a local sampling RGB-D patch centered at the edge pixel of the depth image. According to the normal direction of the depth edge, the edge patch is sampled along a canonical orientation, making it rotation invariant. Through a process of depth detection, scene interference is eliminated from the edge patch, which improves the robustness. The framework of the edge patch-based method is described, and the method was evaluated on three public datasets. Compared with existing methods, the proposed method achieved a higher average F1-score (0.956) on the Tejani dataset and a better average detection rate (62%) on the Occlusion dataset, even in situations of serious scene interference. These results showed that the proposed method has higher detection accuracy and stronger robustness.


Introduction
Object detection and pose estimation (ODPE) are important research topics in semantic navigation, robotic intelligent manipulation, and other fields. Although intensive work has been conducted, ODPE tasks remain challenging owing to scene interference problems. In this paper, only two kinds of scene interference, i.e., foreground occlusion and background clutter, are involved. In general, there are ODPE methods based on artificial features (local or global), machine learning, and local patches.
Global feature-based methods are robust to background clutter, but will suffer in situations with occlusion [1][2][3][4][5]. Local feature-based methods are robust to foreground occlusion, but only perform well for objects with enough feature points [6][7][8][9]. Furthermore, the representation ability of artificial features is not adequate for the diversity of objects.
Additionally, ODPE methods based on machine learning have achieved many remarkable results [10][11][12]. Compared with artificial feature-based methods, these learning-based methods are more adaptable to objects with various attributes. The object pose can be learned by random forests [13][14][15] or convolutional neural networks (CNNs) [16][17][18]. These methods directly use raw images for end-to-end learning and prediction, achieving real-time performance. However, the random forests or CNNs used in ODPE tasks need to be retrained for each new target object, which makes the learning-based methods not flexible enough.
Recently, local patch-based methods have been proposed, which use machine learning frameworks to learn adaptive descriptors of local red-green-blue and depth (RGB-D) patches. For instance, Doumanoglou et al. [19] trained a sparse auto-encoder to encode local RGB-D patches extracted from synthetic views and testing scenes. However, the scene interference contained in the patch reduces the matching accuracy between patches, leading to performance degradation during ODPE tasks. To improve the robustness of random forests against scene interference, Tejani et al. [20] integrated a z-check process into the similarity detection of training patches. However, without obviating the scene interference in the patches, the improvement in robustness brought by learning methods is limited. Kehl et al. [21] eliminated regions of scene interference in the depth channel by checking depth values, leaving RGB channels unconsidered.
Moreover, as far as the authors know, the traditional RGB-D patches have little rotation invariance, including those used by Kehl et al. [21]. This is because no canonical directions are selected, and the feature encoders are sensitive to the in-plane rotation of input data. To solve these problems, Zhang et al. [22] expanded the patch dataset by rotating the view of each rendering viewpoint at 10-degree intervals. However, this strategy introduces rotation quantization errors of up to 5 degrees (half of the rotation interval), which affect the accuracy of feature matching.
Therefore, an RGB-D patch with rotation invariance and robustness against scene interference is desired. For this reason, a new edge patch (E-patch) is proposed in this study. The E-patch is a local RGB-D patch centered at the edge pixel of the depth image. The advantages of the E-patch are summarized as follows: • The E-patch is rotation invariant. In the sampling process, a canonical orientation is extracted to make the E-patch rotation invariant. Thus, it is not necessary to expand the E-patch library by rotating rendering views of the target object, avoiding quantization errors in the process of feature matching.

•
The E-patch contains less scene interference. During the depth detection process, the scene interference is eliminated in the four channels of E-patch. This ensures the robustness of the E-patch against scene interference.
These two advantages result in the proposed E-patch-based method obtaining higher detection accuracy and stronger robustness to scene interference.
The rest of this paper is organized as follows: Section 2 describes the generation, encoding, and usage of E-patch. The experimental results and discussion are presented in Section 3, and Section 4 concludes the paper.

Sampling Center Extraction
A schematic diagram of occlusion between object A (Duck) and object B (Glue) is shown in Figure 1.
Sensors 2020, 20, x 2 of 17 patch reduces the matching accuracy between patches, leading to performance degradation during ODPE tasks. To improve the robustness of random forests against scene interference, Tejani et al. [20] integrated a z-check process into the similarity detection of training patches. However, without obviating the scene interference in the patches, the improvement in robustness brought by learning methods is limited. Kehl et al. [21] eliminated regions of scene interference in the depth channel by checking depth values, leaving RGB channels unconsidered. Moreover, as far as the authors know, the traditional RGB-D patches have little rotation invariance, including those used by Kehl et al. [21]. This is because no canonical directions are selected, and the feature encoders are sensitive to the in-plane rotation of input data. To solve these problems, Zhang et al. [22] expanded the patch dataset by rotating the view of each rendering viewpoint at 10-degree intervals. However, this strategy introduces rotation quantization errors of up to 5 degrees (half of the rotation interval), which affect the accuracy of feature matching.
Therefore, an RGB-D patch with rotation invariance and robustness against scene interference is desired. For this reason, a new edge patch (E-patch) is proposed in this study. The E-patch is a local RGB-D patch centered at the edge pixel of the depth image. The advantages of the E-patch are summarized as follows: • The E-patch is rotation invariant. In the sampling process, a canonical orientation is extracted to make the E-patch rotation invariant. Thus, it is not necessary to expand the E-patch library by rotating rendering views of the target object, avoiding quantization errors in the process of feature matching. • The E-patch contains less scene interference. During the depth detection process, the scene interference is eliminated in the four channels of E-patch. This ensures the robustness of the E-patch against scene interference.
These two advantages result in the proposed E-patch-based method obtaining higher detection accuracy and stronger robustness to scene interference.
The rest of this paper is organized as follows: Section 2 describes the generation, encoding, and usage of E-patch. The experimental results and discussion are presented in Section 3, and Section 4 concludes the paper.

Sampling Center Extraction
A schematic diagram of occlusion between object A (Duck) and object B (Glue) is shown in Figure 1. Using the gradient filtering algorithm, edges in the depth image were extracted and divided into foreground edges and background edges. These two kinds of depth edges are marked on the RGB  Using the gradient filtering algorithm, edges in the depth image were extracted and divided into foreground edges and background edges. These two kinds of depth edges are marked on the RGB image ( Figure 1a) and point cloud (Figure 1b). Because the background edges could not represent the real contour of object B, only foreground edge pixels were selected as sampling centers. The selection criterion was defined by Equation (1): where z edge is the depth value of the query edge pixel, z neighbor are the depth values of edge pixels in the 3 × 3 neighborhood of the query edge pixel, and δ edge is the threshold used in the abovementioned gradient filtering process. Figure 2 shows an extracting result. The desktop in Figure 2a was firstly extracted using the random sampling consensus (RANSAC) algorithm [23], and irrelevant scene points (black pixels in Figure 2b) under the desktop were removed. Sampling centers are drawn as green pixels in Figure 2b.
Sensors 2020, 20, x 3 of 17 image ( Figure 1a) and point cloud (Figure 1b). Because the background edges could not represent the real contour of object B, only foreground edge pixels were selected as sampling centers. The selection criterion was defined by Equation (1): where z edge is the depth value of the query edge pixel, z neighbor are the depth values of edge pixels in the 3 × 3 neighborhood of the query edge pixel, and δ edge is the threshold used in the abovementioned gradient filtering process. Figure 2 shows an extracting result. The desktop in Figure 2a was firstly extracted using the random sampling consensus (RANSAC) algorithm [23], and irrelevant scene points (black pixels in Figure 2b) under the desktop were removed. Sampling centers are drawn as green pixels in Figure 2b.

E-Patch Sampling along a Canonical Orientation
The sampling process of E-patch is shown in Figure 3. An E-patch P with a size of 32 × 32 × 4 was sampled from a square region in the input image I. The image coordinate frame of I is Frame I , which has the principal axes I u and I v . The sampling square's coordinate frame Frame s is marked with its principal axes (G u s , G v s ). G n s is the canonical orientation of Frame s . The sampling square is centered at the edge pixel p 0 and has a side length of L. To make the Epatch scale invariant, L was calculated via Equation (2): where L s = 50 mm is a fixed metric size of the E-patch, f c is the focal length of the camera, z 0 is the depth of p 0 , and ⟦•⟧ is the rounding function.
Each neighboring edge pixel of p 0 within the distance of L/2 was collected and denoted as p i (i = 1, 2, …). To make the E-patch rotation invariant, the canonical orientation G n s of the sampling square was aligned with a unit vector n, which was determined by Equation (3):

E-Patch Sampling along a Canonical Orientation
The sampling process of E-patch is shown in Figure 3. An E-patch P with a size of 32 × 32 × 4 was sampled from a square region in the input image I. The image coordinate frame of I is Frame I , which has the principal axes I u and I v . The sampling square's coordinate frame Frame s is marked with its principal axes (G s u , G s v ). G s n is the canonical orientation of Frame s .
Sensors 2020, 20, x 3 of 17 image ( Figure 1a) and point cloud (Figure 1b). Because the background edges could not represent the real contour of object B, only foreground edge pixels were selected as sampling centers. The selection criterion was defined by Equation (1): where z edge is the depth value of the query edge pixel, z neighbor are the depth values of edge pixels in the 3 × 3 neighborhood of the query edge pixel, and δ edge is the threshold used in the abovementioned gradient filtering process. Figure 2 shows an extracting result. The desktop in Figure 2a was firstly extracted using the random sampling consensus (RANSAC) algorithm [23], and irrelevant scene points (black pixels in Figure 2b) under the desktop were removed. Sampling centers are drawn as green pixels in Figure 2b.

E-Patch Sampling along a Canonical Orientation
The sampling process of E-patch is shown in Figure 3. An E-patch P with a size of 32 × 32 × 4 was sampled from a square region in the input image I. The image coordinate frame of I is Frame I , which has the principal axes I u and I v . The sampling square's coordinate frame Frame s is marked with its principal axes (G u s , G v s ). G n s is the canonical orientation of Frame s . The sampling square is centered at the edge pixel p 0 and has a side length of L. To make the Epatch scale invariant, L was calculated via Equation (2): where L s = 50 mm is a fixed metric size of the E-patch, f c is the focal length of the camera, z 0 is the depth of p 0 , and ⟦•⟧ is the rounding function.
Each neighboring edge pixel of p 0 within the distance of L/2 was collected and denoted as p i (i = 1, 2, …). To make the E-patch rotation invariant, the canonical orientation G n s of the sampling square was aligned with a unit vector n, which was determined by Equation (3): The sampling square is centered at the edge pixel p 0 and has a side length of L. To make the E-patch scale invariant, L was calculated via Equation (2): where L s = 50 mm is a fixed metric size of the E-patch, f c is the focal length of the camera, z 0 is the depth of p 0 , and Sensors 2020, 20, x 3 of 17 image ( Figure 1a) and point cloud (Figure 1b). Because the background edges could not represent the real contour of object B, only foreground edge pixels were selected as sampling centers. The selection criterion was defined by Equation (1): where zedge is the depth value of the query edge pixel, zneighbor are the depth values of edge pixels in the 3 × 3 neighborhood of the query edge pixel, and δedge is the threshold used in the abovementioned gradient filtering process. Figure 2 shows an extracting result. The desktop in Figure 2a was firstly extracted using the random sampling consensus (RANSAC) algorithm [23], and irrelevant scene points (black pixels in Figure 2b) under the desktop were removed. Sampling centers are drawn as green pixels in Figure 2b.

E-Patch Sampling along a Canonical Orientation
The sampling process of E-patch is shown in Figure 3. An E-patch P with a size of 32 × 32 × 4 was sampled from a square region in the input image I. The image coordinate frame of I is Frame I , which has the principal axes I u and I v . The sampling square's coordinate frame Frame s is marked with its principal axes (G u s , G v s ). G n s is the canonical orientation of Frame s . The sampling square is centered at the edge pixel p 0 and has a side length of L. To make the Epatch scale invariant, L was calculated via Equation (2): where L s = 50 mm is a fixed metric size of the E-patch, f c is the focal length of the camera, z 0 is the depth of p 0 , and ⟦•⟧ is the rounding function.
Each neighboring edge pixel of p 0 within the distance of L/2 was collected and denoted as p i (i = 1, 2, …). To make the E-patch rotation invariant, the canonical orientation G n s of the sampling square was aligned with a unit vector n, which was determined by Equation (3): is the rounding function. Each neighboring edge pixel of p 0 within the distance of L/2 was collected and denoted as p i (i = 1, 2, . . . ). To make the E-patch rotation invariant, the canonical orientation G s n of the sampling square was aligned with a unit vector n, which was determined by Equation (3): where the weighted sum g of gradient directions was calculated using Equation (4): In Equation (4), g i is the gradient direction of p i , and the weighting coefficient w i was calculated by Equation (5): where d i is the pixel distance between p 0 and p i . During the sampling process, a point set G s = {G ij }, ∀i, j ∈ {1, . . . , 32}, was arranged in the sampling square. The coordinates of G ij was calculated using Equation (6): where (u I ij , v I ij ) and (u S ij , v S ij ) are the coordinates of G ij in Frame I and Frame s , respectively, and (u 0 , v 0 ) is the coordinate of p 0 in Frame I . u S ij and v S ij were respectively calculated by Equations (7) and (8): The rotation matrix R was expressed as Equation (9): where n u and n v are the horizontal and vertical components of n in Frame I . As described in Equation (10), the E-patch P was obtained by sampling the original image I in four RGB-D channels using the same rules: In the E-patch, the pixel values in RGB channels ranged from 0 to 255, while in the depth channel, values ranged from 0 mm to 4000 mm. To balance pixel values in the four channels, Equations (11) and (12) were applied to each E-patch: where P depth are pixel values in the depth channel, P rgb are pixel values in RGB channels, and P depth and P rgb are the corresponding updated pixel values.

Depth Detection
The key to ODPE methods based on the E-patch is the similarity matching between E-patches extracted from synthetic views and real scenes. The original E-patch in a realistic scene contains regions of foreground occlusion and background clutter, as shown in Figure 4. This leads to a difference between realistic and synthetic E-patches. Therefore, a process of depth detection was used to eliminate the regions of occlusion and clutter. Firstly, the regions of foreground occlusion were detected with a criterion of P depth < −1, and patches with occlusive rates higher than 30% were abandoned. Then, the criterion of P depth > 1 was used to detect the regions of background clutter. All four channels were set to zero for pixels in the regions of occlusion and clutter, which enhanced the robustness of E-patch against scene interference.
Sensors 2020, 20, x 5 of 17 abandoned. Then, the criterion of P depth > 1 was used to detect the regions of background clutter. All four channels were set to zero for pixels in the regions of occlusion and clutter, which enhanced the robustness of E-patch against scene interference.

Encoding Network Training
A CNN-based encoder, Netcoder, was constructed and trained within the Siamese network framework. The Netcoder takes in an E-patch and computes a 16-dimensional descriptor, as shown in Figure 5a. It included two convolutional layers (Conv) and three fully connected layers (FC). Each convolutional layer was followed by a rectified linear unit (ReLU) as the activation function. Since the size of the input E-patch was only 32 × 32 × 4, to avoid information loss, only one maximum pooling layer (Max-pool) was introduced after the first convolutional layer. Each fully connected layer was followed by a parametrized rectified linear unit (PReLU) as the activation function, which avoided the premature failure of neurons. Following the first two fully connected layers, there were dropout layers (Drop-out) to prevent overfitting of the training process.

Encoding Network Training
A CNN-based encoder, Net coder , was constructed and trained within the Siamese network framework. The Net coder takes in an E-patch and computes a 16-dimensional descriptor, as shown in Figure 5a. It included two convolutional layers (Conv) and three fully connected layers (FC). Each convolutional layer was followed by a rectified linear unit (ReLU) as the activation function. Since the size of the input E-patch was only 32 × 32 × 4, to avoid information loss, only one maximum pooling layer (Max-pool) was introduced after the first convolutional layer. Each fully connected layer was followed by a parametrized rectified linear unit (PReLU) as the activation function, which avoided the premature failure of neurons. Following the first two fully connected layers, there were dropout layers (Drop-out) to prevent overfitting of the training process.
Two parameter-sharing encoders Net coder were combined into a Siamese network, as shown in Figure 5b. patch 1 and patch 2 are E-patches in the patch pair, and label sim is the similarity label of the patch pair (label sim = 1 for a similar patch pair and label sim = 0 for a dissimilar pair). f 1 and f 2 are features of patch 1 and patch 2 , respectively. The contrastive loss function loss cont is formalized in Equation (13): Sensors 2020, 20, 887 where N is the number of patch pairs, df i is the Euclidean distance between features of the E-patches in the ith pair, and margin is the threshold value (which here was 1).
framework. The Netcoder takes in an E-patch and computes a 16-dimensional descriptor, as shown in Figure 5a. It included two convolutional layers (Conv) and three fully connected layers (FC). Each convolutional layer was followed by a rectified linear unit (ReLU) as the activation function. Since the size of the input E-patch was only 32 × 32 × 4, to avoid information loss, only one maximum pooling layer (Max-pool) was introduced after the first convolutional layer. Each fully connected layer was followed by a parametrized rectified linear unit (PReLU) as the activation function, which avoided the premature failure of neurons. Following the first two fully connected layers, there were dropout layers (Drop-out) to prevent overfitting of the training process.
(a) (b)  Patch pairs numbering 0.6 million were determined using the LineMod dataset [4] to train the Siamese network. The ratio of similar and dissimilar patch pairs was 1:1. The parameters of Net coder were optimized using the root-mean-square prop (RMSprop) algorithm to minimize the contrast loss loss cont . This was equivalent to gathering similar E-patches and alienating dissimilar ones.

Object Detection and Pose Estimation Based on E-patch
The proposed E-patch-based method consists of two phases, offline modeling and online testing, as shown in Figure 6. In the online testing phase, processes of object detection and pose estimation were carried out simultaneously. The same CNN-based encoder was used in both phases to guarantee the consistency of the feature coding principle.
Sensors 2020, 20, x 6 of 17 Two parameter-sharing encoders Netcoder were combined into a Siamese network, as shown in Figure 5b. patch1 and patch2 are E-patches in the patch pair, and labelsim is the similarity label of the patch pair (labelsim = 1 for a similar patch pair and labelsim = 0 for a dissimilar pair). f1 and f2 are features of patch1 and patch2, respectively. The contrastive loss function losscont is formalized in Equation (13): where N is the number of patch pairs, dfi is the Euclidean distance between features of the E-patches in the ith pair, and margin is the threshold value (which here was 1). Patch pairs numbering 0.6 million were determined using the LineMod dataset [4] to train the Siamese network. The ratio of similar and dissimilar patch pairs was 1:1. The parameters of Netcoder were optimized using the root-mean-square prop (RMSprop) algorithm to minimize the contrast loss losscont. This was equivalent to gathering similar E-patches and alienating dissimilar ones.

Object Detection and Pose Estimation Based on E-patch
The proposed E-patch-based method consists of two phases, offline modeling and online testing, as shown in Figure 6. In the online testing phase, processes of object detection and pose estimation were carried out simultaneously. The same CNN-based encoder was used in both phases to guarantee the consistency of the feature coding principle. In the offline modeling phase, each target object was uniformly rendered from 1313 perspectives. Note that because of the rotation invariance of the E-patch, no in-plane rotation was needed for rendering views. Features of all E-patches in rendering images were computed and used to construct In the offline modeling phase, each target object was uniformly rendered from 1313 perspectives. Note that because of the rotation invariance of the E-patch, no in-plane rotation was needed for rendering views. Features of all E-patches in rendering images were computed and used to construct the codebook. To improve the retrieval efficiency, the codebook was arranged in a k-d tree according to Euclidean distances between features, which was denoted as Tree F .
All the coordinate systems used in the construction of the codebook are shown in Figure 7. C set obj , C set p , and C set c are the local coordinate systems of the target object, synthetic E-patch, and rendering camera, respectively.  to C set p , which was calculated via Equation (15): where (p x , p y , p z ) is the spatial coordinate of the sampling center of the E-patch.

Online Testing
In the online testing phase, the local coordinate systems of the target object, realistic E-patch, and testing camera were respectively denoted as C scene obj , C scene p , and C scene c . Then, the transformation relationship between coordinate systems in the scene is expressed as Equation (16): , which was also determined by the canonical orientation and sampling center of the scene E-patch, similarly to Equation (15).
For matching E-patches, it is reasonable to assume that the transformation relationship between the coordinate systems of the E-patch and the object in the realistic scene is the same as that in the virtual scene (i.e., scene T o p = set T o p ). Therefore, according to Equation (14) and Equation (16), the object pose scene T c o was determined by Equation (17): Each E-patch in the testing scene was encoded as a feature f with the same encoder Netcoder used in the offline phase. Its 100 nearest neighbors in Tree F were searched and denoted as f j (j = 1, …, 100). In the codebook, the feature of each E-patch was stored together with an annotation, info = {obj, set T o p }. Here, obj is the name of the target object, and set T o p is the transformation from C set obj to C set p , which was obtained by Equation (14): where set T o c is a known transformation from C set obj to C set c and set T c p is the transformation from C set c to C set p , which was calculated via Equation (15): where (p x , p y , p z ) is the spatial coordinate of the sampling center of the E-patch.

Online Testing
In the online testing phase, the local coordinate systems of the target object, realistic E-patch, and testing camera were respectively denoted as C scene obj , C scene p , and C scene c . Then, the transformation relationship between coordinate systems in the scene is expressed as Equation (16): where scene T o p is the transformation from C scene obj to C scene p and scene T o c is the transformation from C scene obj to C scene c , i.e., the pose of the target object. scene T c p is the transformation from C scene c to C scene p , which was also determined by the canonical orientation and sampling center of the scene E-patch, similarly to Equation (15).
For matching E-patches, it is reasonable to assume that the transformation relationship between the coordinate systems of the E-patch and the object in the realistic scene is the same as that in the virtual scene (i.e., scene T o p = set T o p ). Therefore, according to Equation (14) and Equation (16), the object pose scene T c o was determined by Equation (17): Each E-patch in the testing scene was encoded as a feature f with the same encoder Net coder used in the offline phase. Its 100 nearest neighbors in Tree F were searched and denoted as f j (j = 1, . . . , 100).
Each neighboring feature f j generated a vote v j = {obj j , scene T c oj } based on its annotation info j = {obj j , set T o pj } stored in the codebook. The confidence conf j of vote v j was calculated by Equation (18): where weighting coefficients w j and α j are respectively calculated according to Equations (19) and (20): The mean shift algorithm was used to cluster voting poses successively in the translational space and rotational space. For each cluster of votes, the clustering center was regarded as a hypothetical pose, and the total weight was regarded as the corresponding confidence. To ensure operational efficiency, only the top 80% of hypothetical poses according to their confidence values were retained. After a hypothesis verification process similar to that used by Li et al. [7], the estimated results were finally achieved.

Experiments and Discussion
In this section, the robustness of our E-patch-based method to occlusion and clutter is demonstrated through two experiments on public datasets. The results of these two experiments also show the improvement in the detection accuracy. In addition, experimental results on the third dataset indicate that our method also has high accuracy in the case of slight clutter.

Detection Results
The Tejani dataset [20] was chosen to demonstrate the robustness of the proposed method to background clutter, which contains six target objects as shown in Figure 8. The numbers of testing scenes contained in each object are 337, 501, 838, 556, 288, and 604, respectively. In each testing image, there are two or three instances of the same kind of target object. Although this dataset contains slight occlusion, different levels of background clutter pose a challenge to ODPE tasks. Figure 9 shows the results of our method in three testing scenes. In each row, the left subfigure is a scene image, the middle subfigure is a preprocessed scene overlaid with edge pixels, and estimated poses are shown in the right subfigure with green transparent models, where the scene is displayed in gray for better visibility.
An estimated pose was considered correct when its intersection over union (IoU) score was higher than 0.5 [16]. The F1-scores of the proposed method are compared with those of the state-of-the-art methods in Table 1. The results of the comparison methods were obtained from [7,24]. The proposed method obtained a higher average F1-score (0.956) than did the other methods (0.910, 0.885, 0.747, and 0.939). This indicates that the use of the E-patch provided higher detection accuracy. The Tejani dataset [20] was chosen to demonstrate the robustness of the proposed method to background clutter, which contains six target objects as shown in Figure 8. The numbers of testing scenes contained in each object are 337, 501, 838, 556, 288, and 604, respectively. In each testing image, there are two or three instances of the same kind of target object. Although this dataset contains slight occlusion, different levels of background clutter pose a challenge to ODPE tasks.  Figure 9 shows the results of our method in three testing scenes. In each row, the left subfigure is a scene image, the middle subfigure is a preprocessed scene overlaid with edge pixels, and estimated poses are shown in the right subfigure with green transparent models, where the scene is displayed in gray for better visibility. An estimated pose was considered correct when its intersection over union (IoU) score was higher than 0.5 [16]. The F1-scores of the proposed method are compared with those of the state-ofthe-art methods in Table 1. The results of the comparison methods were obtained from [7,24]. The proposed method obtained a higher average F1-score (0.956) than did the other methods (0.910, 0.885, 0.747, and 0.939). This indicates that the use of the E-patch provided higher detection accuracy.   The method in [7] depends only on depth information. Due to the small size of 'Camera', its space points are insufficient, resulting in a significant reduction in F1-score. The methods in [16,21] are learning-based methods trained with synthetic models. Therefore, differences between synthetic and realistic scenes caused by the scene interference affect the detection results. This is especially true for the small object 'Camera' and pure white object 'Milk'.
For each object, a clutter index was designed to represent the clutter level quantitatively. It was calculated as the average proportion of background region in a radial neighborhood of 50 pixels (around the projection of the object center). The clutter index of each object is shown in Table 2. Taking the clutter index as the abscissa axis and the F1-score as the ordinate axis, curves were drawn and are presented in Figure 10 to illustrate the influence of background clutter on the F1-scores of all mentioned methods.   Taking a clutter index of 77.5% as the dividing point, objects were divided into two groups: those with slight clutter ('Joystick', 'Milk', and 'Juice Carton') and those with heavy clutter ('Coffee Cup', 'Shampoo', and 'Camera'). For the objects with slight clutter, the average F1-score of the proposed method was 0.962, while those of the methods in [7,16,21,24] were 0.956, 0.899, 0.74, and 0.959, respectively. The pure white color of 'Milk' made RGB-D patches inside the object too similar to distinguish, leading to failure in [21]. The E-patch was located at the depth edge and contained features of object contours as well as RGB-D appearance. Moreover, by sampling along the canonical orientation, descriptor variation caused by in-plane rotation was avoided.
For the objects with heavy clutter, our average F1-score was 0.95, while those of the methods in [7,16,21,24] were 0.864, 0.872, 0.755, and 0.919, respectively. With the aggravation of background clutter, our average F1-score decreased by 0.012, while those of the methods in [7,16,24] decreased by at least 0.027. These data prove that the proposed E-patch achieved stronger robustness against clutter. The reasons for these phenomena are explained in detail later. Note that the average F1-score of the method in [21] increased by 0.015 because its poor performance on the object 'Milk', which made this method unsuitable for robustness analysis.
The aforementioned improvement in detection accuracy and robustness against clutter are owing to the advantages of the E-patch. Considering a synthetic E-patch Psyn and the corresponding realistic E-patch Prel, the relationship between them is expressed as Equation (21): Taking a clutter index of 77.5% as the dividing point, objects were divided into two groups: those with slight clutter ('Joystick', 'Milk', and 'Juice Carton') and those with heavy clutter ('Coffee Cup', 'Shampoo', and 'Camera'). For the objects with slight clutter, the average F1-score of the proposed method was 0.962, while those of the methods in [7,16,21,24] were 0.956, 0.899, 0.74, and 0.959, respectively. The pure white color of 'Milk' made RGB-D patches inside the object too similar to distinguish, leading to failure in [21]. The E-patch was located at the depth edge and contained features of object contours as well as RGB-D appearance. Moreover, by sampling along the canonical orientation, descriptor variation caused by in-plane rotation was avoided.
For the objects with heavy clutter, our average F1-score was 0.95, while those of the methods in [7,16,21,24] were 0.864, 0.872, 0.755, and 0.919, respectively. With the aggravation of background clutter, our average F1-score decreased by 0.012, while those of the methods in [7,16,24] decreased by at least 0.027. These data prove that the proposed E-patch achieved stronger robustness against clutter. The reasons for these phenomena are explained in detail later. Note that the average F1-score of the method in [21] increased by 0.015 because its poor performance on the object 'Milk', which made this method unsuitable for robustness analysis.
The aforementioned improvement in detection accuracy and robustness against clutter are owing to the advantages of the E-patch. Considering a synthetic E-patch P syn and the corresponding realistic E-patch P rel , the relationship between them is expressed as Equation (21): where ε is variation of the E-patch caused by background clutter, and dθ is the deviation angle of the in-plane rotation.
Taking E(·) as the encoding function, the features of the two E-patches are obtained via Equations (22) and (23): where f syn and f rel are the features of P syn and P rel , respectively. Therefore, the feature distance between f syn and f rel can be expressed as Equation (24): The rotation invariance of the E-patch made dθ ≈ 0 ('≈' indicates 'close to'), and the elimination of background clutter in the depth detection process made ε ≈ 0. Both of these led to dis ≈ 0, and a smaller dis means a more accurate feature-matching result. Consequently, the E-patch is beneficial to improving the detection accuracy and robustness to clutter of ODPE methods.

Computation Time
The average time of our online testing phase on the Tejani dataset was 903.4 ms, which is close to the 774.5 ms in Liu et al. [24]. The online testing phase consisted of four stages, namely, 'Patch sampling', 'Feature encoding', 'Hypothesis generation', and 'Hypothesis verification'. The 'Feature encoding' stage was implemented in a Jupyter notebook environment with an NVIDIA Tesla T4 graphics processing unit (GPU). Other stages were implemented in a MATLAB environment, running on a laptop with an Intel central processing unit (CPU, i7-4720HQ).
In our online testing phase, the computation times of each stage were 153.9 ms, 16.5 ms, 228.7 ms, and 504.3 ms, respectively, while those in [24] were 12.5 ms, 47.4 ms, 186.2 ms, and 528.4 ms, respectively. In 'Feature encoding' and 'Hypothesis verification', our times were roughly the same as those in [24]. The introduction of a canonical orientation led to longer times in 'Patch sampling' and 'Hypothesis generation', which was acceptable considering the improvement in the detection accuracy. In addition, the computation time of the depth detection process was negligible.

Results on the Occlusion Dataset
The Occlusion dataset [14] was used to test the robustness of the proposed method to the occlusion problem. Figure 11 shows the eight objects in the dataset. To compare with the testing results reported in [25], the same 200 scenes were selected. All eight objects coexist and occlude each other in each testing scene, which is challenging for ODPE tasks.
Detection results of our method in three scenes are shown in Figure 12. In each row, the left subfigure is a scene image, the middle subfigure is a preprocessed scene overlaid with edge pixels, and the right subfigure shows estimated poses with green transparent models, where the scene is displayed in gray for better visibility.

Results on the Occlusion Dataset
The Occlusion dataset [14] was used to test the robustness of the proposed method to the occlusion problem. Figure 11 shows the eight objects in the dataset. To compare with the testing results reported in [25], the same 200 scenes were selected. All eight objects coexist and occlude each other in each testing scene, which is challenging for ODPE tasks.  Detection results of our method in three scenes are shown in Figure 12. In each row, the left subfigure is a scene image, the middle subfigure is a preprocessed scene overlaid with edge pixels, and the right subfigure shows estimated poses with green transparent models, where the scene is displayed in gray for better visibility. Any estimated pose with a visible surface discrepancy (VSD) score of less than 0.3 was considered correct [25]. The detection rates (percentages of correct poses) of all the eight objects in the Occlusion dataset were calculated, as shown in Table 3.  Any estimated pose with a visible surface discrepancy (VSD) score of less than 0.3 was considered correct [25]. The detection rates (percentages of correct poses) of all the eight objects in the Occlusion dataset were calculated, as shown in Table 3.
The methods used in [8,9] rely only on the point pair feature, and perform well in most scenes with good point-cloud quality. However, when the main plane of a flat object ('Glue') flips, its point cloud quality deteriorates rapidly. This leads to a significant reduction in the detection rate. For quantitative analysis, Table 4 shows the occlusion rate of each object, which is the average proportion of occlusion in all testing scenes. The detection rates of all five methods against different occlusion rates are displayed in Figure 13. The methods used in [8,9] rely only on the point pair feature, and perform well in most scenes with good point-cloud quality. However, when the main plane of a flat object ('Glue') flips, its point cloud quality deteriorates rapidly. This leads to a significant reduction in the detection rate.
For quantitative analysis, Table 4 shows the occlusion rate of each object, which is the average proportion of occlusion in all testing scenes. The detection rates of all five methods against different occlusion rates are displayed in Figure 13.  Using an occlusion rate of 27.5% as the boundary, the objects were divided into two groups: those with slight occlusion ('Hole Punch', 'Duck', 'Ape', 'Can', and 'Egg Box') and those with heavy occlusion ('Driller', 'Glue', and 'Cat'). For the objects in the first group, our average detection rate was 63%, while those of the methods in [5,8,9,15] were 59.6%, 65.4%, 59.4%, and 52.8%, respectively. This means that the E-patch has acceptable performance in the situation of slight occlusion (only lower than the method in [8]). In particular, 'Egg Box' was the most difficult object for the proposed method, because its textureless appearance and repeated edges made E-patches too similar to distinguish. This problem may be solved by introducing a sophisticated process of hypothesis verification, which will be conducted in future work.
For the objects in the second group, our average detection rate was 60.7%, while those of the methods in [5,8,9,15] were 36.7%, 45.7%, 45%, and 48%, respectively. Owing to the heavy occlusion, our average detection rate decreased by 2.3%, which was a lower decrease than those for the Using an occlusion rate of 27.5% as the boundary, the objects were divided into two groups: those with slight occlusion ('Hole Punch', 'Duck', 'Ape', 'Can', and 'Egg Box') and those with heavy occlusion ('Driller', 'Glue', and 'Cat'). For the objects in the first group, our average detection rate was 63%, while those of the methods in [5,8,9,15] were 59.6%, 65.4%, 59.4%, and 52.8%, respectively. This means that the E-patch has acceptable performance in the situation of slight occlusion (only lower than the method in [8]). In particular, 'Egg Box' was the most difficult object for the proposed method, because its textureless appearance and repeated edges made E-patches too similar to distinguish. This problem may be solved by introducing a sophisticated process of hypothesis verification, which will be conducted in future work.
For the objects in the second group, our average detection rate was 60.7%, while those of the methods in [5,8,9,15] were 36.7%, 45.7%, 45%, and 48%, respectively. Owing to the heavy occlusion, our average detection rate decreased by 2.3%, which was a lower decrease than those for the aforementioned methods (decreased by at least 4.8%). These results indicate that the E-patch is more robust to occlusion problems. They also prove the effectiveness of eliminating occlusion regions during the depth detection process.
Similar to the theoretical analysis of the first experiment, our improvement in detection accuracy and robustness to occlusion can be explained by Equation (25): where ε' represents the alteration of the E-patch caused by foreground occlusion. The rotation invariance of E-patch made dθ ≈ 0, and the elimination of occlusion regions in the depth detection process made ε' ≈ 0. Therefore, dis ≈ 0, and feature matching became more accurate.
Consequently, E-patch is conducive to increased detection accuracy and robustness to occlusion in ODPE methods.

Results on the Doumanoglou Dataset
The Douanoglou dataset [19] was chosen to demonstrate the effectiveness of the proposed E-patch-based method in the case of light clutter. Figure 14 shows the 10 objects in the dataset, four pairs of which belong to the same category. Similar to the theoretical analysis of the first experiment, our improvement in detection accuracy and robustness to occlusion can be explained by Equation (25): where ε' represents the alteration of the E-patch caused by foreground occlusion. The rotation invariance of E-patch made dθ ≈ 0, and the elimination of occlusion regions in the depth detection process made ε' ≈ 0. Therefore, dis ≈ 0, and feature matching became more accurate.
Consequently, E-patch is conducive to increased detection accuracy and robustness to occlusion in ODPE methods.

Results on the Doumanoglou Dataset
The Douanoglou dataset [19] was chosen to demonstrate the effectiveness of the proposed Epatch-based method in the case of light clutter. Figure 14 shows the 10 objects in the dataset, four pairs of which belong to the same category. Compared with the first two datasets, the Douanoglou dataset contains less background clutter, which is suitable for analyzing the basic detection performance of ODPE methods. The dataset contains 351 testing scenes, each of which has multiple objects placed on the desktop. The detection results of our method in two scenes are shown in Figure 15.
In each row, the left subfigure is a scene image, the middle subfigure is a preprocessed scene overlaid with edge pixels, and estimated poses are shown in the right subfigure with green transparent models, where the scene is displayed in gray for better visibility.
The clutter index of each object is shown in Table 5, which indicates the Doumanoglou dataset has slight clutter.

Objects
Clutter Indexes Amita 13.6 Colgate 55.5 Compared with the first two datasets, the Douanoglou dataset contains less background clutter, which is suitable for analyzing the basic detection performance of ODPE methods. The dataset contains 351 testing scenes, each of which has multiple objects placed on the desktop. The detection results of our method in two scenes are shown in Figure 15.
In each row, the left subfigure is a scene image, the middle subfigure is a preprocessed scene overlaid with edge pixels, and estimated poses are shown in the right subfigure with green transparent models, where the scene is displayed in gray for better visibility.
The clutter index of each object is shown in Table 5, which indicates the Doumanoglou dataset has slight clutter. Sensors 2020, 20, x 15 of 17 Figure 15. Some detection results on the Doumanoglou dataset.
An estimated pose was considered correct when its IoU score was higher than 0.5. As shown in Table 6, our detection rates were generally higher than those of the method in [19], which revealed the high accuracy of the proposed method in the case of slight clutter. It should be noted that the method in [19] has a low detection rate for the 'Colgate' object. This is because the narrow surfaces of 'Colgate' result in too many RGB-D patches near the edge. These patches usually contain background clutter, which cannot be eliminated by the method in [19]. 'Lipton' and 'Oreo' have similar problems.

Conclusions
A new E-patch for ODPE tasks was proposed herein. The advantages of the E-patch were described and evaluated on three public datasets. The proposed method improved the F1-score from 0.939 to 0.956 on the Tejani dataset and improved the detection rate from 58% to 62% on the Occlusion dataset. With intensifying background clutter, the F1-score of the proposed method decreased more slightly (0.012) than did those of the comparison methods (more than 0.027). When the occlusion level increased, the detection rate of the proposed method decreased by 2.3%, and those of the comparison methods decreased by at least 4.8%. These results prove that the proposed method is more accurate and robust to scene interference. Additionally, one limitation of the proposed method is that it fails to cover texture-less objects with repeated edges, which is worth further study.    An estimated pose was considered correct when its IoU score was higher than 0.5. As shown in Table 6, our detection rates were generally higher than those of the method in [19], which revealed the high accuracy of the proposed method in the case of slight clutter. It should be noted that the method in [19] has a low detection rate for the 'Colgate' object. This is because the narrow surfaces of 'Colgate' result in too many RGB-D patches near the edge. These patches usually contain background clutter, which cannot be eliminated by the method in [19]. 'Lipton' and 'Oreo' have similar problems.

Conclusions
A new E-patch for ODPE tasks was proposed herein. The advantages of the E-patch were described and evaluated on three public datasets. The proposed method improved the F1-score from 0.939 to 0.956 on the Tejani dataset and improved the detection rate from 58% to 62% on the Occlusion dataset. With intensifying background clutter, the F1-score of the proposed method decreased more slightly (0.012) than did those of the comparison methods (more than 0.027). When the occlusion level increased, the detection rate of the proposed method decreased by 2.3%, and those of the comparison methods decreased by at least 4.8%. These results prove that the proposed method is more accurate and robust to scene interference. Additionally, one limitation of the proposed method is that it fails to cover texture-less objects with repeated edges, which is worth further study.