Local Feature-Aware Siamese Matching Model for Vehicle Re-Identiﬁcation

: Vehicle re-identiﬁcation is attracting an increasing amount of attention in intelligent transportation and is widely used in public security. In comparison to person re-identiﬁcation, vehicle re-identiﬁcation is more challenging because vehicles with different IDs are generated by a uniﬁed pipeline and cannot only be distinguished based on the subtle differences in their features such as lights, ornaments, and decorations. In this paper, we propose a local feature-aware Siamese matching model for vehicle re-identiﬁcation. A local feature-aware Siamese matching model focuses on the informative parts in an image and these are the parts most likely to differ among vehicles with different IDs. In addition, we utilize Siamese feature matching to better supervise our attention. Furthermore, a perspective transformer network, which can eliminate image deformation, has been designed for feature extraction. We have conducted extensive experiments on three large-scale vehicle re-ID datasets, i.e., VeRi-776, VehicleID, and PKU-VD, and the results show that our method is superior to the state-of-the-art methods.


Introduction
Vehicle re-identification (re-ID) returns a series of images containing the same vehicle ID as that of an image from a database. It is widely used in intelligent transportation, public security, and urban computing [1][2][3]. The straightforward way of vehicle re-ID is license-plate recognition [4]; however, a license plate is not always visible. Figure 1 shows the cases in which a vehicle ID cannot be determined based on the license plate. For example, the license plates of vehicles are sometimes occluded, illegally used, or invisible in some views. In particular, the number plates, models, and colors of genuine cars are sometimes used in other vehicles for performing illegal activities, such as smuggling, assembling, and scrapping and stealing vehicles, and the act of doing so is known as "car cloning." Therefore, vehicle IDs cannot be distinguished only based on license plates in several scenarios, and therefore, vehicle re-ID through other image-based features is urgently needed (Code available at https://github.com/WangHonglie/LFASM_pytorch). In recent years, deep learning [5], person re-ID [6][7][8] and fine-grained retrieval [9,10] have gained remarkable success. Vehicle re-ID datasets, such as VeRi-776 [3], VehicleID [11], and PKU-VD [12], have been released, thereby facilitating research on deep learning. However, because of the inconspicuous divergences among vehicles, vehicle re-ID is still difficult.
The main challenge of vehicle re-ID is distinguishing between two vehicles of the same or similar types. Images of different IDs which are captured from the same view may be more alike than those with the same ID but captured from different angles. Owing to camera resolution and shooting angle, obtaining a very high-quality vehicle image is sometimes difficult. Thus, vehicles always have inconspicuous differences, as shown in Figure 2. Different vehicles of the same model are similar in global appearance, and thus difficult to distinguish. Most existing studies focus on the entire image, and such subtle differences cannot be easily distinguished.
(a) (b) Figure 2. (a) The two vehicles above are of the same type but have different IDs. They can be distinguished based on the windshield stickers (green box) and ornaments (red box). The two vehicles below are of a similar type but have different engine hoods (pink box), headlights (blue box), and air intakes (yellow box). (b) It is obviously easier to distinguish the vehicles based on their key parts.
Unlike general classification problems, the number of categories in a re-ID problem is uncertain. Therefore, some metric learning methods are committed to reducing the distance between images of the same vehicle, and enlarging the distance between images of different vehicles. Schroff et al. [13] proposed triplet loss, which directly optimizes the feature embedding. Bai et al. [14] combined the local structural constraints to generate feature embedding more effectively. He et al. [15] proposed the Triplet-Center loss, which jointly considers the distances inside a class and relationships between different classes.
In this paper, we propose an effective feature extractor to find more fine-grained features by training it using two supervising methods. The first is an end-to-end classification module, in which a local net is aimed at selecting the regions of interest and another extractor, transposed convolutional layer (CTL), is proposed to find more implicit features. The other supervising method is a Siamese net, which matches the local features of two images and supervises attention better. Inspired by the spatial transformer network (STN) [16], we propose a perspective transformer network (PTN), which has greater degrees of freedom and can eliminate the deformation in images. To demonstrate an improved accuracy of retrieval, we have re-ranked the re-ID results given by Zhong et al. [17], thereby effectively ranking more true images at the top of the ranking list.
In summary, our major contributions to the literature of this field are threefold.
• We propose a local feature-aware Siamese matching model (LFASM) that can learn the local feature matching of different images. This is done by providing additional supervision so that the network is better trained, increasing the distance between classes, and reducing the distance within classes. • To focus on the informative parts, we propose a local feature net that provide supervised attention to the regions of interest, thereby assigns different weights to different parts of the input. Unlike some methods [18][19][20] based on additional information (such as spatial, temporal, and part labels), our method is only based on the images of vehicles. • We also propose a PTN, which can project a picture to a new view plane and eliminate the deformation of images. Compared to STN [16], PTN has greater flexibility for image transformation.
The remainder of this paper is organized as follows. Section 2 provides an overview of the related work. Section 3 describes the proposed local feature-aware Siamese matching model for vehicle re-ID and some details about our experiment. In Section 4, we discuss the experimental results, and Section 5 gives our conclusions.

Related Work
In this section, we review the existing studies on vehicle re-ID.

Vehicle Re-ID
Vehicle re-ID has become a major research area over the past decade. Owing to the development of the convolutional neural network (CNN) [21,22], the extraction of deeper features of images has become easier. Liu et al. [3] released the VeRi-776 dataset, which includes ultiview vehicle images, and Liu et al. [11] released VehicleID on a large scale. Yan et al. [12] contributed two rich annotated vehicle datasets, VD1 and VD2, obtained in real time from two cities, and containing high-resolution images. Wang et al. [19] utilized 20 key-point locations of vehicles to extract orientation information and proposed an orientation invariant feature embedding module. De et al. [23] proposed a two-stream Siamese classification model for vehicle re-ID, and Wei et al. [24] proposed an recurrent neural network-based hierarchical attention (RNN-HA) network, which combines a large number of attributes for vehicle re-ID. Bai et al. [14] proposed a group sensitive triplet embedding approach that can model the interclass differences. Recently, He et al. [20] considered both local and global representations to propose a valid learning framework for vehicle re-ID, however, their method depends on the labeled parts and is therefore labor-intensive. Krizhevsky et al. [21] first proposed the use of triplet loss to help the model directly learn feature embedding. The effect of triplet loss largely depends on the choice of training samples. Therefore, Hermans et al. [25] proposed hard mining to choose the hard positive and negative samples to train the network better. Furthermore, Chen et al. [26] proposed a quadruplet network for a greater impact of training.

Fine-Grained Visual Recognition
Although the identification of the main categories of objects is easy (such as computers, mobile phones, and water cups), determining highly refined object classification names (such as the type of bird and model of computers) is even more challenging. The greatest challenge is that the visual differences between the different subcategories of the same main category are minimal. Vehicle re-ID is a typical example of fine-grained recognition, the classification of which is mainly conducted using a part-based model and a representation learning model. Zhang et al. [27] employed the approach of learning of the entire object as well as the use of part detectors for fine-grained object recognition. Fully convolutional network (FCN) attention [28] can adaptively select the attention area and efficiently position multiple object parts. Lin et al. [29] proposed a bilinear structure comprising two feature extractors that can model pairwise feature interactions in an invariant manner.

Attention Mechanisms
The attention mechanism stems from the study of human vision. To make rational use of the limited visual-information-processing resources, humans must select specific parts of the visual area, and then focus on these parts. For example, when reading, only a few words are noticed at one time and then processed. The basic idea of visual-attention mechanisms is to enable a model to ignore irrelevant information and focus on the significant one. The attention mechanism has various forms of implementation; these mainly include soft and hard attention. Typical examples of soft attention include the STN [16], residual attention network [30], and two-level attention [31].
Although the hard attention model is required to predict the region of interest, it usually learns through reinforcement learning [32].

Proposed Method
We propose a local feature-aware Siamese matching (LFASM) model for vehicle re-ID. In this section, we provide a brief overview of the problem of vehicle re-ID and put forward our framework (Section 3.1). Then, we present the local feature-aware module, which is capable of learning more significant information (Section 3.2 and describe how we match the corresponding parts (Section 3.3). Finally, we propose our feature extractor in Section 3.4 and its implementation in Section 3.5.

Framework and Overview
Given a query vehicle image, the target of vehicle re-ID is to obtain a set of images from the gallery with the same ID as that of the candidate image. At present, we believe that vehicles with the same ID have more similar image feature embeddings. Therefore, these feature embeddings must be extracted and the similarity score between the embeddings of this candidate image and those of other images in the gallery must be calculated. The training set is then defined as {x i , y i } N i=1 , where y i represents the identification label of image x i and N represents the number of training images. The similarity between query image q and gallery image g is defined as D(φ(q; θ), φ(g; θ)), where φ(·; θ) is the feature extractor and D(·) is a metric function. To obtain a better feature extractor, the parameter θ must be learned through gradient descent: where L is the loss function, and w is the weight vector. Figure 3 shows the framework of the proposed LFASM model for vehicle re-ID. It comprises two branches: one in charge of the ID classification, and the other used for Siamese local feature matching to better supervise our attention module. Each branch comprises two modules including a local net to output an attention descriptor, m ∈ R C×H×W . The score from the array m represents the amount of attention required in this area. An attention-based feature extractor is used to extract the deep features of input images.

Local Feature
This module aims to determine which informative parts deserve the greatest attention (e.g., outline, lights, windshield stickers, engine hood, and ornament), as shown in Figure 2. The goal of this study is to make our system more responsive to differences in these parts in order to effectively distinguish vehicle identities. The local feature net is an additional neural network that assigns different weights to different parts of the input. Our local net outputs an attention descriptor, m ∈ R H×W , representing the values of different parts of features. To prevent m from being negative, we used softplus [33] as our activation function. We project the attention descriptor, m ∈ R H×W , into the first feature map, f 1 ∈ R C×H×W , by element-wise multiplication and obtain the masked feature, f 1 ∈ R C×H×W . For each tensor, f i,j ∈ R C and m i,j ∈ R, where (i, j) is the spatial location in f 1 ∈ R C×H×W and m ∈ R H×W , the corresponding output tensor, f i,j ∈ R C , can be determined as follows: To limit the value of m (i,j) between 0-1, we normalize the attention map m by where p = 2 and = 1 × 10 −12 . Figure 4 shows the key parts of the vehicle images selected by our attention model. It can be seen that this model filters most of the background and some parts of the vehicle with poor information. The white part of Figure 4 represents the value of m close to 1, on the contrary, the black part represents the value of m close to 0. This module can better find the noteworthy part of the images and reduce the noise impression in the remainder of the image, so that the model can be more focused, and can more readily distinguish different vehicles.

Siamese Match
Although local features are always used in image retrieval [34,35], their use is not sufficient to distinguish images only according to class labels. To enhance the training of local features in the network, we propose a Siamese feature matching module. This module allows the network to know whether the two input pictures belong to the same ID. This is done by providing additional supervision so that the network is better trained, increasing the distance between classes, and reducing the distance within classes.
Given two images, p, y p and q, y q , where y represents the identification label. The features of these two images can be denoted as φ(p; θ) and φ(q; θ), respectively. We measure the similarity of the two feature embeddings through a dot product: where i and j are the positions of p and q in the feature map, respectively. The target label y can be computed as where C(p) = ∑ ∀j s p i , q j aims to normalize the result; g = W g p i and W g are the weights to be learned for this pair of features. While y p =y q , target label y converges to 1, else it converges to 0.

Attention-Based Feature Extractor
PTN. Vehicle pictures are taken by surveillance cameras, which essentially show the projection of the real scene on the camera chip, as in Figure 5. Owing to the different camera parameters and environmental factors, the obtained vehicle pictures often contain varying degrees of distortion. To eliminate the effects of projection transformations in different scenes, we propose a PTN, which predicts the transformation θ to apply to the input image using Equation (6), as shown in Figure 6. The main structure of the PTN comprise two convolutional networks, both of which output a 3 × 3 transformational matrix. We apply this transformational matrix to the features after the first block. The first two rows of the transformational matrix are identical to the affine matrix, which implements linear transformation and translation, and the third row is used to implement perspective transformation.
where (x t i , y t i ) are the target coordinates of the regular grid in the output feature map, and (x s i = x /w , y s i = y /w ) are the source coordinates in the input feature map that define the sample points. The main purpose of PTN is to eliminate the deformation by perspective transformations of vehicles in the images.

CTL.
We used ResNet-50 [36] as the base model of the feature extractor after PTN. As mentioned earlier, a component in the model was dedicated to extracting explicit key areas of images; some implicit features that play an important role in the re-ID task could not be extracted at the pixel level. Therefore, we applied the attention map M ∈ R 1×H×W to the intermediate feature map. The activated feature, f ∈ R C×H×W , can be expressed as follows: where ⊗ and f denote the element-wise multiplication and input feature, respectively. We obtained an attention map M through a transposed convolutional layer after a convolutional layer (CTL), as shown in Figure 7. The main purpose of the CTL is to extract the more informative part of the feature.

Implementation Details
In our experiments, ResNet-50 was used as the backbone network for feature extraction. The output of class block, x ∈ R d , was used as the acquired image representation, and d = 512 in our experiment. We measured the feature distances of two images by calculating the cosine distances. The stochastic gradient descent [37] with hyper-parameters (weight_decay = 5 × 10 −4 , momentum = 0.9, nesterov = True ) was adopted for model optimization. We set the learning rate of the fully connected layer to 0.005 and the other layers to 0.001 with a gradual decrease. All the images were scaled to 256 × 256 pixels.
Even if the features could be effectively clustered, if our query lies at the edge of the space in its category, we inevitably obtain a considerable amount of true negatives, as shown in Figure 8. One of the solutions to retrieve more true-positives is to enlarge the distance between different clusters. For this purpose, we set the arcFace loss [38] to measure the distance between the images; it uses angular distance to represent the distances between features. Furthermore, the scaling factor s was set to 10 in our experiments. Algorithm 1 depicts the whole pseudo code algorithm employed to train the proposed neural network architecture. Deep feature map:φ(i; θ) = concat(a i × f i , a i )for all i=p,q 5: y ← 1 C(p) ∑ ∀j s p i , q j g p j // i and j are the positions of p and q in the feature map, 6: f emb ← φ(q; θ) 7: Fine tuning:min(L( f emb , y q ) + L(y, i f y q = y p )) 8: end while Figure 8. When the query is at the edge of the space in its category (left sample), it is more easily recalled as a false-positive. One of the methods to avoid this is to enlarge the inter class distance (right sample).

Dataset and Metric
To verify the effectiveness of the proposed LFASM method, we conducted experiments on three important datasets, namely VehicleID, VeRi-776, and PKU-VD, and compared our results with those of the state-of-the-art vehicle methods for re-ID.
VeRi-776 [3] contains roughly 50,000 images of 776 vehicles captured by 2-18 cameras from different view angles. Every image in the query set contained 678 images of 200 vehicles, in which the images were captured by all the cameras in the cars.
VehicleID [11] comprises 221,763 images of 26,267 vehicles captured by different cameras and provides three test subsets of different sizes, with 800, 1600, and 2400 gallery images, respectively, such that we can evaluate our model on different data scales. The dataset contains images captured from two view angles: front and back. PKU-VD [12] contains a large number of images with rich annotations (vehicle model and color). So far, it is the largest dataset for vehicle re-ID and is divided into two subsets: VD1 and VD2. The images in VD1 and VD2 were captured from surveillance videos and traffic cameras, respectively. They comprise approximately 1,098,649 and 807,260 images, respectively.
We computed the mean average precision (mAP) to evaluate the performance of our model. Average precision (AP) is a measure that considers both recall and precision. The AP for image q can be expressed as where N gt (q) is the number of ground truths, P(k) is the precision at rank k, and rel(k) = 1 when the matching of query image q to a test image is satisfied at rank k. The mAP is the mean value of APs of all queries and can be expressed as where Q is the number of query images. The mAP combines both precision and recall and is a comprehensive evaluation criterion.

Main Result
We present our results on three benchmark datasets: VeRi-776 [3], VehicleID [11], and PKU-VD [12] and compare the results with those of state-of-the-art vehicle re-ID methods. Table 1 shows the flops counter for each parts in LFASM. VeRi-776: The total numbers of query and gallery images were 1678 and 11,579, respectively. We compared the proposed LFASM with the state-of-the-art vehicle re-ID methods. First, we considered LOMO [39], which utilizes a handcrafted local feature for person re-ID; it solves the problems associated with view and illumination variations. The GoogLeNet fine-tuned on the CompCars dataset [40] can extract high-level semantic attributes of the vehicle appearance, while VAMI [41] is a viewpoint-aware attention model used to extract the core area from different views through an adversarial network, and QD-DLF [42] has different directional feature pooling layers. Siamese-CNN + Path-LSTM [18] is a two-stage framework that combines complex spatiotemporal information and effectively regularizes the re-ID results.
The comparison results on the VeRi-776, presented in Table 2 show that our proposed LFASM model achieves accuracies of 61.92%, 90.11%, and 92.91% mAP, top-1, and top-5, respectively. The ROC curves for the VeRi-776 are plotted in Figure 9, and the Area Under the Curve (AUC) is 0.974. The standard deviation values (std) of ap is 0.236. To determine the effect of each component in our model, we conducted an ablation study on VeRi-776. Our framework comprises three components: a PTN, local aware features (LAF), and Siamese feature match (SFM). We removed one component at a time and retrained the remaining network to evaluate the model performance in the absence of the removed component. Our model was able to achieve accuracies of 52.69%, 83.41%, and 90.81% for mAP, top-1, and top-5, respectively, without either PTN, SFM, or LAF, and the results were considered to be the baseline. The results of other comparative experiments are detailed in Table 3. The performance shows that the attention module has the most significant influence on the learning process; the other modules were also found to have improve the experimental results.   VehicleID: VehicleID has a larger number of images than that of VeRi-776, with both front and rear views of the vehicles. The testing data of VehicleID were split into three subsets, as detailed in Table 4.  Table 5 presents the comparison results on the VehicleID dataset. As shown, our model achieves the highest top-1 rate and exhibits robust performance with respect to other evaluation indices.
PKU-VD: Furthermore, we tested our method on the PKU-VD dataset, in turn, the two subdatasets: VD1 and VD2. Each subdataset is further divided into test sets of the following three sizes: small, medium, and large. Table 6 presents the number of test images in each subdataset. We followed the official setting provided by [12] for our model evaluation. Both VD1 and VD2 comprise 2000 query images, and the number of gallery images is listed in Table 6. Our method was also able to achieve good performance on a large-scale dataset, as detailed in Tables 7 and 8.   Figure 10 shows the results returned by the LFASM. Each row indicates a query image and its top-5 retrievals. As shown, the model performs effectively on most data except for those containing vehicles with a dim background. Feature correspondences. The main feature of LFASM is its focus on the informative parts of images. Furthermore, we demonstrate the feature correspondence between the query and gallery images to reveal the function of LFASM when retrieving images. We extracted the local descriptor using an attention map and utilizing the nearest neighbor search (NNS) to find the best matches in each image. As shown in Figure 11, our model can effectively match the key parts (i.e., lights, windshield stickers, and engine hood). Therefore, our method can be used to retrieve images according to the number of matches in some other scenarios. However, using the distance between the feature vectors directly, accurate results can be obtained on the three datasets. Figure 11. Visualization of the local-feature matches with the highest responsiveness among various pictures obtained by extracting and comparing local features.

Conclusions
In this paper, we proposed a model that combines the LAFs of vehicle images. In addition to global features, LFASM emphasizes the significant parts that are most likely to be different in vehicles with different IDs. This encourages the model to focus on more details in local regions. Furthermore, we applied local-feature matching, which compares the local features of two embeddings and helps the local net to better learn an attention map. Moreover, the PTN allows images to be aligned directly without the need to match key points, thereby facilitating image identification by the model. The experimental results on three large vehicle datasets show that LFASM can extract discriminative features and achieve excellent performance. On the other hand, as shown in Figure 10, the model performs well on most data except for those containing a dim background. In some other scenarios, such as in the case of different views of two cars or in the absence of shared parts in the two cars, it is difficult for our model to achieve effective identification. Improving recognition of vehicles with different views is the focus of future work.