Contextual Coefﬁcients Excitation Feature: Focal Visual Representation for Relationship Detection

: Visual relationship detection (VRD), a challenging task in the image understanding, suffers from vague connection between relationship patterns and visual appearance. This issue is caused by the high diversity of relationship-independent visual appearance, where inexplicit and redundant cues may not contribute to the relationship detection, even confuse the detector. Previous relationship detection models have shown remarkable progress in leveraging external textual information or scene-level interaction to complement relationship detection cues. In this work, we propose Contextual Coefﬁcients Excitation Feature (CCEF), a focal visual representation, which is adaptively recalibrated from original visual feature responses by explicitly modeling the interdependencies between features and their contextual coefﬁcients. Speciﬁcally, contextual coefﬁcients are obtained by calculation of both the spatial coefﬁcients and generated-label ones. In addition, a conditional Wasserstein Generative Adversarial Network (WGAN) regularized with a relationship classiﬁcation loss is designed to alleviate inadequate training of generated-label coefﬁcients due to long tail distribution of relationship. Experimental results demonstrate the effective improvements of our method on relationship detection. In particular, our method improves the recall from 8.5% to 23.2% of predicting unseen relationship from zero-shot set.


Introduction
With rapid development of deep learning and image recognition [1][2][3][4][5], visual relationship detection [6], a higher-level visual understanding task, has been a popular research topic. Relationships are commonly defined as triplets consisting of a subject, predicate and object, which can be represented as subject, predicate, object . Subject and object are considered as the context of predicate [7]. Visual relationship detection aims to recognize various visually observable predicates between subject and object, where subject and object are a pair of objects in the image.
However, visual relationship detection is a challenging task that most existing relationship detection methods [8,9] treat each type of relationship predicates as a class, leading to the high diversity of visual appearance which varies greatly with different relationship instances. Furthermore, this visual diversity undermines the correlation between relationship predicates and visual appearance and confuses the detector. For instance, as Figure 1 shows, when we recognize the predicate "hold" from "people hold phone" and "person hold bear", person features close to the "phone" or "bear" are more pivotal than other redundant visual features, such as various facial features of different instances. These unrelated features account for the majority of the original visual features and cause the vague connection between visual cues and relationship predicates. Previous methods [8,10] adopt a linguistic model conditioned on the label of object pairs and predicates, or establish a scene graph relying on the contextual objects in the image to complement insufficient correlations between visual cues and relationship predicate. However, these methods ignore the importance of selectively highlighting the visual features associated with the relationship.
To this end, we propose a novel Contextual Coefficients Excitation Feature (CCEF), which is a focal representation based on a new relationship space. A relation is a predicate. By learning the context of the predicate, that is, the feature distribution of both subject and object in the new space, the noises in their raw visual features are restrained, so their more discriminative representations for relationship predicates are activated. In particular, the contextual coefficients are used to control the importance of both subject and object to predicate, which are learned from both spatial and semantic information. And semantic information is generated by a conditional WGAN [11] regularized by a relationship classification loss. This additional classification loss enforces the generation model to generate relationship-relevant coefficients which are more suitable for relationship detection, especially for predicates with unseen or few training instances. After the contextual coefficients are obtained, both subject and object will be recalibrated on the feature level so as to activate the visual feature that are helpful for predicate selection, that is the CCEF.
Therefore, the visualization of relationship feature in both subject and object is given by deconvolusion approach [12], as shown in Figure 1,to illustrate that CCEF is more significant for relationship representation than original visual features. We summary our contributions as follows: • Propose a Contextual Coefficients Excitation Feature (CCEF), which reduce the diversity of unrelated visual features by introducing feature recalibration conditioned on the relationship contextual information. • Improve the conditional WGAN with relationship classification loss for generated-label coefficients generation on VRD and significantly improve the prediction for unseen relationship.

Related Works
In this secion, the existing works related to our proposal are briefly reviewed. It is mainly divided into two parts: Visual Relationship Detection and Generative Adversarial Network.

Visual Relationship Detection
The relationship models are commonly divided into two categories: the joint models [13] and the separate ones [7][8][9]. Early works [14] focused on the joint models and hand-crafted features were used in detection task, which concerned about how to classify the relationship combinations. Since there are too many combinations and the long-tail distribution of visual relationship in the real world, it is impossible to obtain sufficient training images of per combination. As a result, these methods have poor generalization ability. Therefore, Lu [8] proposed the separate model, formalized the visual relationship detection as a task and provided a new dataset Visual Relationship Detection (VRD). After that, the separate models became the mainstream of research. Despite the visual features and language priors in [8], Zhu [15] complemented the spatial feature of relationship, which was ignored by LP [8]. Zhang [16] introduced the Knowledge Transfer to interpret the relationship as a vector translation, which was trained in an end-to-end system.
Besides, recent works focus on the object priors or textual priors. Zhuang [7] obtained a part of the classifier weight from the context information of relationship and focused on the different image regions via the attention mechanism. Yu [9] utilized massive external textual data and integrated the probability deviation of object pairs into relationship classification. To fully exploit the potential of feature learning, Yin [17] proposed the message passing to encourage the feature sharing between objects and predicate. Moreover, the latest approaches [10,18,19] tried to introduce information of scene graph and tackled the insufficient visual cues and tackled the quadratic combinations of possible relationships.

Generative Adversarial Network
The main idea of Generative Adversarial Network (GAN) [11,[20][21][22][23] is to train a generator that can capture the real data distribution and a discriminate network that discriminates whether an instance is from the truth data distribution or candidates produced by the generator.
The most critical issue of GAN is the convergence of the training. Many works [11,21] have been proposed to address this problem by improving the objective functions of GANs. Arjovsky [11] extended the objective of the original GAN [20] which is related to Jenson-Shannon distance by analyzing the properties of four different divergences or distances over two distributions and proposed WGAN which used the Wasserstein distance to stably optimize discriminators and generators. However, there are also the vanishing and exploding gradient problems in WGAN due to the weight clipping [11]. Thus, Gulrajani [21] improved the WGAN with the gradient penalty. Besides, the input to a generator is a "noise" vector z drawn from a latent distribution, such as a multivariate Gaussian, leading to the uncontrolled generated result. In order to direct the generation process with additional information, Mirza [24] proposed the conditional GAN, which could generate the input-related results by inputting specified labels or attributes into both discriminator and generator.
Despite the generation stability of GANs has been significantly enhanced, improving the quality of generated images is still a challenge. Some other works [25][26][27][28][29][30] focused on deepening the network structure or increasing the training scale have been proposed to improve the quality of images for GANs. In addition to generating realistic images, GANs have shown remarkable results in generating image features [31][32][33]. Xian [33] proposed f-CLSWGAN to tackle generalized zero-shot learning by generating Convolutional Neural Network (CNN) features for unseen classes, which focused on the image features for classification instead of realistic images.

Methods
We begin by defining the problem of our interest.
where O is the set of localized objects in the image, the subscript i stands for the localized object with index i in object

CCEF: Focal Visual Representation
First, ours proposed R i is the focal visual representation, which adaptively focus on the local dimensions of each object viusal features for relationships. It is constrained by the label and spatial information to boost the discriminability of representation. R i is defined as follows: where F(·) refers to the feature-wise excitation and original visual features V i are obtained by feeding image regions of object to the feature extractor [2], as in Figure 2. Contextual Coefficients. The contextual coefficient M i indicates the importance of each object visual features which is based on label and spatial descriptors, designed as: where G i is the generated-label coefficient from the object label with the conditional WGAN and F i is the spatial coefficient. ⊗ denots the elements-wise multiplication to ensure that multiple dimensions are allowed to be emphasised opposed to one-hot activation. so as in Figure 2. Generated-label coefficients. The generated-label coefficient G i , as a part of importance descriptor, is relationship-related semantics. It is mapped from pure semantic feature space to relationship-related space by conditional WGAN and defined as: where function g (parameterized by θ g ) is the generator of conditional WGAN, z i ∈ R d z is a random noise vector sampled from a multidimensional centered Gaussian and C(y i ) is the condition vector to direct the coefficients generation process. Besides, the discriminator d (parameterized by θ d ) tries to distinguish whether generated-label coefficient can represent label coefficient with the conditional vector C(y i ) or not. The components d and g iteratively play the two-player minimax game with the objective function, where d tries to maximize the GAN loss and g tries to minimizes it. The structure of our conditional WGAN is shown in Figure 3. Spatial coefficients. The spatial coefficient F i is the embedding of object spatial information, defined as follows: where r(·) is the Fully Connected layers (FC) layer with ReLU activation, θ s is the parameter of FC layer. Specifically, S i is the spatial feature from the bounding boxes of object pairs, similar to the ones in [16]: where (x i , y i , w i , h i ) and (x j , y j , w j , h j ) are bounding boxes of candidated object pairs B i and B j . Note that subscript i, j indicate different object instances. Relationship Representation. Referring to the TransE [16,35], the relationship representation R rel is modeled as the translation vector of object reprensatation pairs(R i and R j ) by mapping them to the relation space. It is defined as follows: where denotes element-wise subtraction, so as in Figure 2.
Although focal visual representation encodes the appearance of both objects, it is difficult to directly model spatial correlation between predicates and objects with pixel values of image. Hence, final prediction of the relationship is obtained by a line classifier conditioned on the concatenation of S and R rel , as: where θ p is the parameter of the classifier, • denotes vector concatenation.

Objective Function
Relationship Classification Loss. The visual relationship prediction is constrained by efficient softmax that only rewards the deterministically accurate predicates: GAN Loss. As described above, the WGAN-GP [21], which constrains the inactivated truth value in the objective function and enhances the Lipschitz constraint by gradient penalty, is extended to the conditional GAN by integrating a conditional vector into both the generator and discriminator. Besides, the regularization with the relationship classification loss is minimized to encourage the generator to construct suitable label coefficients for relationship detection. Hence, the conditional WGAN is trained with the following objective function: where α in (9) is the balance hyper-parameter to weight the contribution of the classifier loss and the adversarial loss L WGAN is as follows: where E[·] is the expected value operator, L is the target label coefficient, 1) and λ is a penalty parameter. The first two terms in (10) represent the Wasserstein distance and the third term enforces the gradient of d(·) to satisfy the Lipschitz constraint.

Training and Prediction
In practice, we design a two-step procedure for our proposed method. During the first stage, the model, shown in Figure 2, is trained by replacing the Conditional WGAN with a FC layer to obtain a pre-trained parameter θ * p . Besides, G i in (2) are replaced with L i which are obtained from the labels embeddig of object C(y i ) with the FC layers similar as (4). In addition, L i is called label coefficient, which is the necessary target of generated-label coefficients for the training of discriminator because of the unique training mechanism.
After that, in the second stage, the Conditional WGAN which is removed in the first stage is recovered and trained with parameter θ * p . With the trained label coefficients L in the first stage, the discriminator tries to distinguish the generated-label coefficients G from the trained one L. Then the θ g , θ d are obtained by training the objectives function as in (9).
Finally, the relationship predication result formulated: where θ * p is pre-trained in the first step and frozen during training the conditional WGAN.

Results
To demonstrate the effectiveness of our proposed visual representation, a series of ablation experiments of our proposed visual representation are compared with existing baseline methods. The experimental setup is as follows :

1.
Ours − A: Directly use V i to instead R i in (1) without utilizing any coefficients.

2.
Outs − A + S: Replace M i with S i in(1) in section 3.1.

3.
Ours − A + L: Replace M i with L i in (1) described as Section 3.2.

4.
Ours − A + S + L: Replace the G i with L i in (2) described in Section 3.2.
Here, A is for appearance, S for spatial representation, L for label representation and G for generated-label representation.

Implement Details
The discriminator consists of two MultiLayer Perceptron (MLP) layers with LeakyReLU activation, while the generator contains one MLP layer with LeakyReLU and an output layer with ReLu. Adam [36], an algorithm for first-order gradient-based optimization of stochastic objective functions and based on adaptive estimates of lower-order moments, is perfect for optimizing the classifier. And Stochastic gradient descent (SGD) is commonly used to optimize both generator and discriminator networks where the learning rate is 0.0001. The balance term α in the loss function is 1.0 coming from experiments, and λ = 10 as suggested in [21]. The noise z is drawn from a unit Gaussian with the same dimension as label embedding. The VGG-16 [2] network pre-trained on ImageNet [5] is always used to extract the original visual features. The parameter size of our method is about 135M and there is about 15.47 × 10 9 floating point operations (FLOPs) in our method.

Evaluation on Visual Relationship Dataset
Visual Relationship Detection (VRD) dataset [8] is used to evaluate the proposed methods. This dataset contains 5000 images with 100 object categories and 70 predicate categories. In total, there are 37,993 relationship instances with 6672 relationship types and 24.25 predicates per object label. The train/test split is the same as [8], where 4000 training images containing 30,355 relationships with 6672 types and 1000 test images containing 7638 relationships with 2747 types. Note that 1169 relationships with 1029 types are only in the test data.
Our approach are evaluated on three tasks [8]: Predicate detection: with an image and a set of ground truth object bounding boxes, this task is to predict a set of possible predicates between pairs of objects. Since relationship between the pair of objects is critical, this indicator can reflect the performance of the model intuitively, ignoring the error of object detection. Phrase detection: given an input image, this task is to output the triplets subject, predicate, object and localize the entire relationships as one bounding box. Relationship detection: with an input image, it should output the triplets subject, predicate, object and localize the subject and object bounding boxes. Both phrase and relationship bounding boxes should have at least 0.5 overlap with their ground truth bounding boxes. Obviously, the performance of phrase and relationship detection is affected by the result of object detection due to the pipeline of separate detection. In order to compare the performance of relationship models fairly, the object detection results (both bounding boxes and corresponding detection scores) provided by [8] are used for Phrase detection and Relationshop detection. More details in [8].
Following the original paper [8], the Recall@50 (R@50) and Recall@100 (R@100) are used as our evaluation metrics. Recall@X computes the rate of the correct relationship on the top X prediction. The reason of using Recall@x instead of the mean average precision (mAP) is that mAP would penalize the correct detection if dataset don't have particular ground truth. Note that only the predicate with highest confidence for each pair of objects is considered for predicate, where the prediction score is the product of predicate score and the confidence scores of both subject and object for phrase detection or relationship detection, while prediction score is the predicate score for predicate detection task.

Comparison and Discussion
Predicate Detection. The results of predicate detection are reported in Table 1. The benefit of visual feature excitation is visible across all experiment settings: all kind of coefficients are effective while CCEF achieves better performance. (e.g., R@50 is improved from 45.2% to 55.5% on the entire set and from 13.2% to 23.2% on the zero-shot set).
In Table 1, (Ours − A) only uses visual information and gets unsatisfactory performance. Then the spatial information is added in (Ours − A + S), which can enhance local visual feature where objects are more likely to interact, and improves the performance of 3 points. Next, the label semantic information is introduced in (Ours − A + L), instead of the spatial information. It reweights the identity information corresponding to the object category in the visual features, which significantly improves the performance of 5 points. Now we can see the performance of (Ours − A + L + S) is 53.0 (≈ 45.2 + 3 + 5), which proves that label and spatial information can effectively complementary. In the end, CCEF (Ours − A + G + S) replaces the label coefficient (L) in (Ours − A + L + S) with the generated-label coefficient (G) synthesized by the conditional WGAN. We can see the total performance of (Ours − A + G + S) improves by from 53.0 to 55.5, as shown in Figure 4, which proves that the generated-label coefficient G can effectively improve model generalization and the prediction ability of the unseen relationship, thus improves the overall performance. The G effectively improve model generalization and the prediction ability of the unseen relationship, as shown in Figure 4. Hence, the total performance of (Ours − A + G + S) improves by from 52.3 to 55.5. Table 1. Evaluation of different methods on VRD including the R@100/50 of predicate detection on entire and zero-shot set. "*" marks the results of LK without knowledge distillation. And "-" indicates "not applicable".

Method
Entire Set Zero-Shot R@100/50 1 R@100/50 Language Priors [8] 47.9 8.5 VTransE [16] 44.8 -STA [37] 48.0 20.6 VSA-Net [38] 49.2 -Zoom-Net [17] 50.  Another interesting finding is that the (Ours − A + S) shows significant improvement on zero-shot. We speculate the reason is that spatial coefficients come from the spatial features which are triplet-independent and are less susceptible to the long tail distribution of training data.
Phrase and Relationship Detection. To fairly compare the performance of relationship models, the same object detection results [8,15] are utilized. Relationship for every pair of object is predicted with the pipeline in Figure 2. Evaluation results on the entire test set and the zero-shot setting are shown in Table 2. The observations on various experiments are consistent with predicate detection. However, the spatial coefficients have less improvement in phrase or relationship detection than predicate detection due to the position error of object detection. Even so, performance of (Ours − A + S + G) is still better than other previous models, especially under the zero-shot setting.
Effect of Generation Methods. An important question about our approach is whether the generative methods succeed in mapping label embedding to suitable coefficients for relationship.
In order to answer this question, the evolution of the relationship classification loss which is a function of epochs is shown in Figure 5. In general, the classification loss l rel decreases steadily over training, showing the success of our model in mapping.
Another relevant question is whether our method keep convergent or not for generating coefficients, compared to the other methods. Figure 5 shows the Recall@50 of relationship detection as the function of epochs with different generation methods, e.g., Boundary Equilibrium Generative Adversarial Networks(BEGAN) [41], Least Squares Generative Adversarial Networks(LSGAN) [42] and original GAN [20], which are trained with the same setting and structure. In particular, LSGAN and BEGAN have more prone to training crashes and produce more serious training instability results even compared with the original GAN. While the stable training trend and loss convergence during the training of WGAN is observed, similar conclusions are also drawn from the recent work [43].
After verifying that our method leads to more stable training performance than the other methods, the generalization ability of different methods is another critical problem. Hence, the results of predicate detection on entire set and zero-shot with different generative models are compared in Table 3. In both the entire set and zero-shot setting, our model has the best performance for generating highly suitable coefficients. And the GAN models have the better performance of generalization on both entire set and zero-shot and converge faster than cVAE. We conjecture that VAE tends to produce blurred results and leads to the inaccurate coefficients. In addition, the generalization ability of our model is also affected by the balance term α in (9), as shown in Figure 6.  Comparison with Attention Methods. To tackle the problem of relationship-independent visual appearance, another promising approach is attention mechanism, which encourages the network to focus on the discriminative regions of feature map during extracting the visual features. Therefore, the detection results of existing attention methods with our excitation approach and choose ResNet-50 [45] as the feature extractor in Figure 2 are compared to determine the effect of visual feature distributions.
The experimental results in Table 4 show that most attention mechanisms are effective but inappropriate weight assignments may not contribute to the final detection, e.g., "VSA-Spatial", which always focuses on the center of object region. These experimental results also demonstrate that our excitation approach, which make the network focus on the discriminative feature of visual relationship appearance, is more effective than attention methods. Besides, the experimental results in Table 1 and Table 4 show that our contextual coefficients are able to be suitable for various visual feature distributions and ResNet features are stronger than VGG, which is expected.

Conclusions
In this paper, we proposed the contextual coefficients excitation feature (CCEF), an dynamic feature recalibration on context for VRD task. Instead of attention mechanism, we use conditional WGAN to learn the importance distribution of visual features, and use semantic description as constraints. Zero-shot experiments show that the features distribution learned by WGAN have better generalization ability for identifying new relationships. In future, we will further study the joint adversarial learning of visual and semantic representation, so as to improve the ability of Zero-Shot prediction in different task.