TextRS: Deep Bidirectional Triplet Network for Matching Text to Remote Sensing Images

: Exploring the relevance between images and their respective natural language descriptions, due to its paramount importance, is regarded as the next frontier in the general computer vision literature. Thus, recently several works have attempted to map visual attributes onto their corresponding textual tenor with certain success. However, this line of research has not been widespread in the remote sensing community. On this point, our contribution is three-pronged. First, we construct a new dataset for text-image matching tasks, termed TextRS, by collecting images from four well-known di ﬀ erent scene datasets, namely AID, Merced, PatternNet, and NWPU datasets. Each image is annotated by ﬁve di ﬀ erent sentences. All the ﬁve sentences were allocated by ﬁve people to evidence the diversity. Second, we put forth a novel Deep Bidirectional Triplet Network (DBTN) for text to image matching. Unlike traditional remote sensing image-to-image retrieval, our paradigm seeks to carry out the retrieval by matching text to image representations. To achieve that, we propose to learn a bidirectional triplet network, which is composed of Long Short Term Memory network (LSTM) and pre-trained Convolutional Neural Networks (CNNs) based on (E ﬃ cientNet-B2, ResNet-50, Inception-v3, and VGG16). Third, we top the proposed architecture with an average fusion strategy to fuse the features pertaining to the ﬁve image sentences, which enables learning of more robust embedding. The performances of the method expressed in terms Recall@K representing the presence of the relevant image among the top K retrieved images to the query text shows promising results as it yields 17.20%, 51.39%, and 73.02% for K = 1, 5, and 10, respectively.


Introduction
The steady accessibility of remote sensing data, particularly high resolution images, has animated remarkable research outputs in the remote sensing community. Two of the most active topics in this regard refer to image classification and retrieval [1][2][3][4][5]. Image classification aims to assign scene images to a discrete set of land use/land cover classes depending on the image content [6][7][8][9][10]. Recently, with rapidly expanded remote sensing acquisition technologies, both quantity and quality of remote sensing data have been increased. In this context, content-based image retrieval (CBIR) has become a paramount research subject in order to meet the increasing need for the efficient organization and by training a CNN for semantic segmentation and feature generation. Shao et al. [11] constructed a dense labeling remote sensing dataset to evaluate the performance of retrieval techniques based on traditional handcrafted feature as well as deep learning-based ones. Dai et al. [44] discussed the use of multiple hyperspectral image retrieval labels and introduced a multi-label scheme that incorporates spatial and spectral features.
It is evident that the multi-label scenario is generally favored (over the single label case) on account of its abundant semantic information. However, it remains limited due to the discrete nature of labels pertaining to a given image. This suggests a further endeavor to model the relation among objects/labels using an image description. With the rapid advancement of computer vision and natural language processing (NLP), machines began to understand, slowly but surely, the semantics of images.
Current computer vision literature suggests that, instead of tackling the problem from an image-to-image matching perspective, cross-modal text-image learning seems to offer a more concrete alternative. This concept has manifested itself lately in the form of image captioning, which stems as a crossover where computer vision meets NLP. Basically, it consists of generating a sequential textual narration of visual data, similar to how humans perceive it. In fact, image captioning is considered as a subtle aid for image grasping, as a description generation model should capture not only the objects/scenes presented in the image, but it should also be capable of expressing how the objects/scenes relate to each other in a textual sentence.
The leading deep learning techniques, for image captioning, can be categorized into two streams. One stream adopts encoder-decoder, an end-to-end fashion [45,46] where a CNN is typically considered as the encoder and an RNN as the decoder, often a Long-Short Term Memory (LSTM) [47]. Rather than translating between various languages, such techniques translate from a visual representation to language. The visual representation is extracted via a pre-trained CNN [48]. Translation is achieved by RNNs based language models. The major usefulness of this method is that the whole system adopts end to end learning [47]. Xu et al. [35] went one step further by introducing the attention mechanism, which enables the decoder to concentrate on specific portions of the input image when generating a word. The other stream adopts a compositional framework, such as [49] for instance, which divided the task of generating the caption into various parts: detection of the words by a CNN, generating the caption candidates, and re-ranking the sentence by a deep multimodal similarity model.
With respect to image captioning, the computer vision literature suggests several contributions mainly based on deep learning. For instance, You et al. [50] combined top-down (i.e., image-to-words) and bottom-up (i.e., joining several relevant words into a meaningful image description) approaches via CNN and RNN models for image captioning, which revealed interesting experimental results. Chen et al. [51] proposed an alternative architecture based on spatial and channel-wise attention for image captioning. In other works, a common deep model called a bi-directional spatial-semantic attention network was introduced [52,53], where an embedding and a similarity network were adopted to model the bidirectional relations between pairs of text and image. Zhang and Lu [54] proposed a projection classification loss that classified the vector projection of representations from one form to another by improving the norm-softmax loss. Huang et al. [52] addressed the problem of image text matching in bi-direction by making use of attention networks.
So far, it can be noted that computer vision has been accumulating a steady research basis in the context of image captioning [47,50,55]. In remote sensing, however, contributions have barely begun to move in this direction, often regarded as the 'next frontier' in computer vision. Lu et al. [56] for instance, proposed a similar concept as in [51] by combining CNNs (for image representation) and LSTM network for sentence generation in remote sensing images. Shi et al. [57] leveraged a fully convolutional architecture for remote sensing image description. Zhang et al. [58] adopted an attribute attention strategy to produce remote sensing image description, and investigated the effect of the attributes derived from remote sensing images on the attention system.
As we have previously reviewed, the mainstream of the remote sensing works focuses mainly on scenarios of single label, whereas in practice images may contain many classes simultaneously.
In the quest for tackling this bottleneck, recent works attempted to allocate multiple labels to a single query image. Nevertheless, coherence among the labels in such cases remains questionable since multiple labels are assigned to an image regardless of their relativity. Therefore, these methods do not specify (or else model) explicitly the relation between the different objects in a given image for a better understanding of its content. Evidently, remote sensing image description has witnessed rather scarce attention in this sense. This may be explained by the fact that remote sensing images exhibit a wide range of morphological complexities and scale changes, which render text to/from image retrieval intricate.
In this paper we propose a solution based DBTN for solving the text-to-image matching problem. It is worth mentioning that this work is inspired from [53]. The major contributions of this work can be highlighted as follows: • Departing from the fact that the task of text-image retrieval/matching is a new topic in the remote sensing community, we deem it necessary to build a benchmark dataset for remote sensing image description. Our dataset will constitute a benchmark for future research in this respect.

•
We propose a DBTN architecture to address the problem of text image matching, which to the best of our knowledge, has never been posed in remote sensing prior-art thus far.

•
We tie the single models into fusion schemes that can improve the overall performance through adopting the five sentences.
The paper includes five sections, where the structure of the paper is as follows. In Section 2, we introduce the proposed DBTN method. Section 3 presents the TextRS dataset and the experimental results followed by discussions in Section 4. Finally, Section 5 provides conclusions and directions for future developments.

Description of the Proposed Method
Assume a training set composed of N images with their matching sentences. In particular, to each training image X i we associated a set of M matching sentences Y i = y 1 i , . . . , y K i . In the test phase, given a query sentence t q , we aimed to retrieve the most relevant image in the training set D. Figure 1 shows a general description of the proposed DBTN method composed of image and text encoding branches that aimed to learn appropriate image and text embeddings f (X i ) and g(T i ), respectively, by optimizing a bidirectional triplet loss. Detailed descriptions are provided in the next sub-sections.
Remote Sens. 2020, 12, 405 4 of 19 As we have previously reviewed, the mainstream of the remote sensing works focuses mainly on scenarios of single label, whereas in practice images may contain many classes simultaneously. In the quest for tackling this bottleneck, recent works attempted to allocate multiple labels to a single query image. Nevertheless, coherence among the labels in such cases remains questionable since multiple labels are assigned to an image regardless of their relativity. Therefore, these methods do not specify (or else model) explicitly the relation between the different objects in a given image for a better understanding of its content. Evidently, remote sensing image description has witnessed rather scarce attention in this sense. This may be explained by the fact that remote sensing images exhibit a wide range of morphological complexities and scale changes, which render text to/from image retrieval intricate.
In this paper we propose a solution based DBTN for solving the text-to-image matching problem. It is worth mentioning that this work is inspired from [53]. The major contributions of this work can be highlighted as follows: • Departing from the fact that the task of text-image retrieval/matching is a new topic in the remote sensing community, we deem it necessary to build a benchmark dataset for remote sensing image description. Our dataset will constitute a benchmark for future research in this respect. • We propose a DBTN architecture to address the problem of text image matching, which to the best of our knowledge, has never been posed in remote sensing prior-art thus far. • We tie the single models into fusion schemes that can improve the overall performance through adopting the five sentences. The paper includes five sections, where the structure of the paper is as follows. In Section 2, we introduce the proposed DBTN method. Section 3 presents the TextRS dataset and the experimental results followed by discussions in Section 4. Finally, Section 5 provides conclusions and directions for future developments.

Description of the Proposed Method
Assume a training set = , composed of N images with their matching sentences. In particular, to each training image we associated a set of M matching sentences = , … , . In the test phase, given a query sentence , we aimed to retrieve the most relevant image in the training set . Figure 1 shows a general description of the proposed DBTN method composed of image and text encoding branches that aimed to learn appropriate image and text embeddings ( ) and ( ), respectively, by optimizing a bidirectional triplet loss. Detailed descriptions are provided in the next sub-sections.

Image Encoding Module
The image encoding module uses a pre-trained CNN augmented with an additional network to learn the visual features ( ) of the image (Figure 2). To learn informative features and suppress less relevant ones, this extra network applies a channel attention layer termed squeeze excitation (SE) to the activation maps layer obtained after the 3 × 3 convolution layer. The goal is to enhance further the representation of the features by grasping the significance of each feature map among all extracted feature maps. As illustrated in Figure 2, the squeeze operation produces features of dimension (1,1,128) by means of global average pooling (GAP), which are then fed to a fully connected layer to reduce the dimension by 1/16. Then the produced feature vector s calibrates the feature maps of each channel (V) by channel-wise scale operation. SE works as shown below [59]: where is the scaling factor, ⊙ refers to the channel-wise multiplication, and represents the feature maps obtained from a particular layer of the pre-trained CNN. Then the resulting activation maps are fed to a GAP followed by a fully connected and -normalization for feature rescaling yielding the features ( ).

Image Encoding Module
The image encoding module uses a pre-trained CNN augmented with an additional network to learn the visual features f (X i ) of the image ( Figure 2). To learn informative features and suppress less relevant ones, this extra network applies a channel attention layer termed squeeze excitation (SE) to the activation maps layer obtained after the 3 × 3 convolution layer. The goal is to enhance further the representation of the features by grasping the significance of each feature map among all extracted feature maps. As illustrated in Figure 2, the squeeze operation produces features of dimension (1,1,128) by means of global average pooling (GAP), which are then fed to a fully connected layer to reduce the dimension by 1/16. Then the produced feature vector s calibrates the feature maps of each channel (V) by channel-wise scale operation. SE works as shown below [59]: where s is the scaling factor, refers to the channel-wise multiplication, and V represents the feature maps obtained from a particular layer of the pre-trained CNN. Then the resulting activation maps V SE are fed to a GAP followed by a fully connected and l 2 -normalization for feature rescaling yielding the features f (X i ).
As pre-trained CNNs, we adopted in this work different CNNs including VGG16, inception_v3, ResNet50, and EfficientNet. The VGG16 was proposed in 2014 and has 16-layers [27]. Such network was trained on the imagenet dataset to classify 1.2 million RGB images of size 224 × 224 pixel into 1000 classes. The inception-v3 network [60], introduced by Google, contains 42 layers as well as three kinds of inception modules, which comprise convolution kernels with sizes of 5 × 5 to 1 × 1. Such modules seek to reduce the parameters number. The Residual network (ResNet) [25] is a 50-layer network with shortcut connection. This network was proposed for deeper networks to solve the problem of vanishing gradients. Finally, EfficientNets, which are new state-of-the-art models with up to 10 times better efficiency (faster as well as smaller), were developed recently by a research team from Google [61] to scale up CNNs using a simple compound coefficient. Differently from traditional approaches that scale network dimensions (width, depth, and resolution) individually, EfficientNet tries to scale each dimension in a balanced way using a stationary set of scaling coefficients evenly. Practically, the performance of the model can be enhanced by scaling individual dimensions. Further, enhancing the entire performance can be achieved through scaling each dimension uniformly, which leads to higher accuracy and efficiency. As pre-trained CNNs, we adopted in this work different CNNs including VGG16, inception_v3, ResNet50, and EfficientNet. The VGG16 was proposed in 2014 and has 16-layers [27]. Such network was trained on the imagenet dataset to classify 1.2 million RGB images of size 224 × 224 pixel into 1000 classes. The inception-v3 network [60], introduced by Google, contains 42 layers as well as three kinds of inception modules, which comprise convolution kernels with sizes of 5 × 5 to 1 × 1. Such modules seek to reduce the parameters number. The Residual network (ResNet) [25] is a 50-layer network with shortcut connection. This network was proposed for deeper networks to solve the problem of vanishing gradients. Finally, EfficientNets, which are new state-of-the-art models with up to 10 times better efficiency (faster as well as smaller), were developed recently by a research team from Google [61] to scale up CNNs using a simple compound coefficient. Differently from traditional approaches that scale network dimensions (width, depth, and resolution) individually, EfficientNet tries to scale each dimension in a balanced way using a stationary set of scaling coefficients evenly. Practically, the performance of the model can be enhanced by scaling individual dimensions. Further, enhancing the entire performance can be achieved through scaling each dimension uniformly, which leads to higher accuracy and efficiency. Figure 3 shows the text encoding module, which is composed of K symmetric branches, where each branch is used to encode one sentence describing the image content. These sub-branches use a word embedding layer followed by LSTM, a fully-connected layer, and -normalization.

Text Encoding Module
The word embedding layer receives a sequence of integers representing the words in the sentence and transforms them into representations, where similar words should have similar encodings. Then the outputs of this layer are fed to LSTM [62] for modeling the entire sentence based on their long-term dependency learning capacity. Figure 4 shows the architecture of LSTM, with its four types of gates at each time step in the memory cell. These gates are the input gate , the update gate , the output gate , and the forget gate . For each time step, these gates receive as input the hidden state ℎ and the current input . Then, the cell memory recursively updates itself based on its previous values and forget and update gates.  Figure 3 shows the text encoding module, which is composed of K symmetric branches, where each branch is used to encode one sentence describing the image content. These sub-branches use a word embedding layer followed by LSTM, a fully-connected layer, and l 2 -normalization.  The working mechanism of LSTM is given below (for simplicity, we omit the image index ) [62]: The word embedding layer receives a sequence of integers representing the words in the sentence and transforms them into representations, where similar words should have similar encodings. Then the outputs of this layer are fed to LSTM [62] for modeling the entire sentence based on their long-term dependency learning capacity. Figure 4 shows the architecture of LSTM, with its four types of gates at each time step t in the memory cell. These gates are the input gate i t , the update gate c t , the output gate o t , and the forget gate f t . For each time step, these gates receive as input the hidden state h t−1 and the current input y t . Then, the cell memory recursively updates itself based on its previous values and forget and update gates.

DBTN Optimization
Many machine learning and computer vision problems are based on learning a distance metric for solving retrieval problems [63]. Inspired by achievements of deep learning in computer vision [26], deep neural networks were used to learn how to embed discriminative features [64,65]. These methods learn to project images or texts into a discriminative embedding space. The embedded vectors of similar samples are closer, while they are farther to those of dissimilar samples. Then several loss functions were developed for optimization such as triplet [65], quadruplet [66], lifted structure [67], N-pairs [68], and angular [69] losses. In this work, we concentrate on the triplet loss, which aims to learn a discriminative embedding for various applications such as classification [64], retrieval [70][71][72][73][74], and person re-identification [75,76]. It is worth recalling that a standard triplet in image-to-image retrieval is composed of three samples: an anchor, a positive sample (from the same category to the anchor), and a negative sample (from the different category to the anchor). The aim of the triplet loss is to learn an embedding space, where anchor samples are closer to positive samples than to negative ones by a given margin.
In our case, the network is composed of asymmetric branches, unlike standard triplet networks, as the anchor; positive and negative samples are represented in a different way. For instance, triplets can be formed using a text as an anchor, its corresponding image as a positive sample in addition to an image with a different content image as a negative. Similarly, one can use an image as an anchor associated with positive and negative textual descriptions. The aim is to learn discriminative features for different textual descriptions and discriminative features for different visual features as well. In addition, we should learn similar features to each image and its corresponding textual representation. For such purpose, we propose a bidirectional triplet loss as a possible solution to the problem. The bidirectional triplet loss is given as follows: where | | = ( , 0), and is the margin that ensures the negative is farther away than the positive. and are parameters of regularization controlling the contribution of both terms.
The performance of DBTN heavily relies on triplet selection. Indeed, the process of training is often so sensitive to the selected triplets, i.e., selecting the triplets randomly leads to The working mechanism of LSTM is given below (for simplicity, we omit the image index i) [62]: where * denotes the Hadamard product, and W i , W f , W g , and W o are learnable weights. In general, we can model the hidden state h t of the LSTM as follows [62]: where r t−1 indicates the memory cell vector at time step t − 1.
For each branch, the output of LSTM is fed to an additional fully-connected layer yielding K feature representation g y k i , k = 1, . . . , K. Then, the final outputs of different branches are fused using an average fusion layer to obtain a feature of dimension 128 [7]:

DBTN Optimization
Many machine learning and computer vision problems are based on learning a distance metric for solving retrieval problems [63]. Inspired by achievements of deep learning in computer vision [26], deep neural networks were used to learn how to embed discriminative features [64,65]. These methods learn to project images or texts into a discriminative embedding space. The embedded vectors of similar samples are closer, while they are farther to those of dissimilar samples. Then several loss functions were developed for optimization such as triplet [65], quadruplet [66], lifted structure [67], N-pairs [68], and angular [69] losses. In this work, we concentrate on the triplet loss, which aims to learn a discriminative embedding for various applications such as classification [64], retrieval [70][71][72][73][74], and person re-identification [75,76]. It is worth recalling that a standard triplet in image-to-image retrieval is composed of three samples: an anchor, a positive sample (from the same category to the anchor), and a negative sample (from the different category to the anchor). The aim of the triplet loss is to learn an embedding space, where anchor samples are closer to positive samples than to negative ones by a given margin.
In our case, the network is composed of asymmetric branches, unlike standard triplet networks, as the anchor; positive and negative samples are represented in a different way. For instance, triplets can be formed using a text as an anchor, its corresponding image as a positive sample in addition to an image with a different content image as a negative. Similarly, one can use an image as an anchor associated with positive and negative textual descriptions. The aim is to learn discriminative features for different textual descriptions and discriminative features for different visual features as well. In addition, we should learn similar features to each image and its corresponding textual representation. For such purpose, we propose a bidirectional triplet loss as a possible solution to the problem. The bidirectional triplet loss is given as follows: where |z| + = max(z, 0),{\displaystyle A} and α is the margin that ensures the negative is farther away than the positive. g(T i a ) refers to the embedding of the anchor text, f (X i p ) is the embedding of the positive image, and f (X i n ) refers to the embedding of the negative image. On the other side, f (X i a ) refers to the embedding of the anchor image, g(T i p ) is the embedding of the positive text, and g(T i n ) refers to the embedding of the negative text. λ 1 and λ 2 are parameters of regularization controlling the contribution of both terms. The performance of DBTN heavily relies on triplet selection. Indeed, the process of training is often so sensitive to the selected triplets, i.e., selecting the triplets randomly leads to non-convergence. To surmount this problem, the authors in [77] proposed triplet mining, which utilized only semi-hard triplets, where the positive pair was closer than the negative. Such valid semi-hard triplets are scarce, and therefore semi-hard mining requires a large batch size to search for informative pairs. A framework named smart mining was provided by Harwood et al. [78] to find out hard samples from the entire dataset that suffered from the burden of off-line computation. Wu et al. [79] discussed the significance of sampling and proposed a sampling technique called distance weighted sampling, which uniformly samples negative examples by similarity. Ge et al. [80] built a hierarchal tree of all the classes to find out hard negative pairs, which were collected via a dynamic margin. In this paper, we proposed to use a semi-hard mining strategy, as shown in Figure 5, although other sophisticated selection mechanism could be investigated as well. In particular, we selected triplets in an online mode based on the following constraint [77]: where d(·) is the cosine distance.

Dataset Description
We built a dataset, named TextRS, by collecting images from four well-known different scene datasets, namely the AID dataset, which consists of 10,000 aerial images of size 600 × 600 pixels within 30 classes collected from Google Earth imagery by different remote sensors. The Merced dataset contains 21 classes; each class has 100 images of size 256 × 256 pixels with a resolution of 30 cm and RGB color. Such dataset was collected from USGS. The PatternNet was gathered from high-resolution imagery and includes 38 classes; each class contains 800 images of size 256 × 256 pixels. The NWPU dataset is another scene dataset, which has 31,500 images and is composed of 45 scene classes.
TextRS is composed of 2144 images selected randomly from the above four scene datasets. In particular, 480, 336, 608, and 720 images were selected from AID, Merced, PatternNet, and NWPU, respectively (16 images were selected from each class of such datasets). Then each remote sensing image was annotated by five different sentences; therefore, the total number of sentences was 10,720, and all the captions of this dataset were generated by five people to prove the diversity. It is worth recalling that the choice of the five sentences was mainly motivated by other datasets developed in the general context of computer vision literature [47,81]. During, the annotation we took into consideration some rules that had to be followed during generation of the sentences: • Focus on the main dominating objects (tiny ones may be useless).
• Describe what exists instead of what does not exist in the scene.

Dataset Description
We built a dataset, named TextRS, by collecting images from four well-known different scene datasets, namely the AID dataset, which consists of 10,000 aerial images of size 600 × 600 pixels within 30 classes collected from Google Earth imagery by different remote sensors. The Merced dataset contains 21 classes; each class has 100 images of size 256 × 256 pixels with a resolution of 30 cm and RGB color. Such dataset was collected from USGS. The PatternNet was gathered from high-resolution imagery and includes 38 classes; each class contains 800 images of size 256 × 256 pixels. The NWPU dataset is another scene dataset, which has 31,500 images and is composed of 45 scene classes.
TextRS is composed of 2144 images selected randomly from the above four scene datasets. In particular, 480, 336, 608, and 720 images were selected from AID, Merced, PatternNet, and NWPU, respectively (16 images were selected from each class of such datasets). Then each remote sensing image was annotated by five different sentences; therefore, the total number of sentences was 10,720, and all the captions of this dataset were generated by five people to prove the diversity. It is worth recalling that the choice of the five sentences was mainly motivated by other datasets developed in the general context of computer vision literature [47,81]. During, the annotation we took into consideration some rules that had to be followed during generation of the sentences:

•
Focus on the main dominating objects (tiny ones may be useless).

•
Describe what exists instead of what does not exist in the scene.

•
Try not to focus on the number of objects too much but use generic descriptions such as several, few, many, etc.

•
Try not to emphasize the color of objects (e.g., blue vehicles) but rather on their existence and density. • When mentioning, for instance, a parking lot (in an airport), it is important to mention the word 'airport' as well to distinguish it from any generic parking lot (downtown for example). • Avoid using punctuation and conjunctions.
Some samples from our dataset are shown in Figure 6. and density. • When mentioning, for instance, a parking lot (in an airport), it is important to mention the word 'airport' as well to distinguish it from any generic parking lot (downtown for example). • Avoid using punctuation and conjunctions. Some samples from our dataset are shown in Figure 6.

Performance Evaluation
We implemented the method using the keras open-source library for deep learning written in python. For training the network, we randomly select 1714 images as training and the remaining 430 images as the test corresponding to approximately to 80% for training and 20% for testing. For training the DBTN, we used a mini-batch size of 50 images with the Adam optimization method with a fixed learning rate equal to 0.001 and exponential decay rates for the moment estimates equal to 0.9 and 0.999. Additionally, we set the regularization parameters to the default values of λ 1 = λ 2 = 0.5. To evaluate the performance of the method, we used the wide recall measure, which is suitable for text-to-image retrieval problems. In particular, we presented the results in Recall@K (R@K) terms for different values of K (1,5,10), which are the percentage of ground-truth matches shown in the top K-ranked results. We conducted the experiments on a station with an Intel Core i9 processor with a speed of 3.6 GHz and 32 GB of memory, and a Graphical Processing Unit (GPU) with 11 GB of GDDR5X memory.

Results
As mentioned in the previous sections, we used four different pre-trained CNNs for the image encoding branch, which were EfficientNet, ResNet50, Inception_v3, and VGG16. Figure 7 illustrates the evolution of the triplet loss function during the training phase for these different networks. We can see that the loss function decreased gradually with an increase in the number of iterations. In general, the model reached stable values after 40 iterations. In Figure 8 we show examples of features obtained by the image and text encoding branches at the end of the training process.
We implemented the method using the keras open-source library for deep learning written in python. For training the network, we randomly select 1714 images as training and the remaining 430 images as the test corresponding to approximately to 80% for training and 20% for testing. For training the DBTN, we used a mini-batch size of 50 images with the Adam optimization method with a fixed learning rate equal to 0.001 and exponential decay rates for the moment estimates equal to 0.9 and 0.999. Additionally, we set the regularization parameters to the default values of λ = λ = 0.5. To evaluate the performance of the method, we used the wide recall measure, which is suitable for text-to-image retrieval problems. In particular, we presented the results in Recall@K (R@K) terms for different values of K (1,5,10), which are the percentage of ground-truth matches shown in the top K-ranked results. We conducted the experiments on a station with an Intel Core i9 processor with a speed of 3.6 GHz and 32 GB of memory, and a Graphical Processing Unit (GPU) with 11 GB of GDDR5X memory.

Results
As mentioned in the previous sections, we used four different pre-trained CNNs for the image encoding branch, which were EfficientNet, ResNet50, Inception_v3, and VGG16. Figure 7 illustrates the evolution of the triplet loss function during the training phase for these different networks. We can see that the loss function decreased gradually with an increase in the number of iterations. In general, the model reached stable values after 40 iterations. In Figure 8 we show examples of features obtained by the image and text encoding branches at the end of the training process.   Table 1 illustrates the performance of DBTN using EfficientNet as a pre-trained CNN for encoding the visual features. It could be observed with one sentence (Sent.1). The method achieved 13.02%, 40%, and 59.30% in R@1, R@5, and R@10, respectively. In contrast, when the five sentences are fused, the performance was further improved to 17.20%, 51.39%, and 73.02% of R@1, R@5, and R@10, respectively. Further, we computed the average of R@1, R@5, and R@10 for each sentence, and for fusion, we observed that the average of fusion had the highest score. Table 2 shows the results obtained using ResNet50 as the image encoder to learn the image features. We can see that the performances in R@1, R@5, and R@10 were 10.93%, 38.60%, and 54.41%, respectively, for Sent.1, while the method achieved 13.72%, 50.93%, and 69.06% of R@1, R@5, and R@10, respectively, with the fusion. Similarly, from Table 3 we observed that with Inception_v3, considering the fusion, the performance was also better than that of individual sentences. Finally, the results of using VGG16 are shown in Table 4. We can see that for Sent.1, our method achieved 10%, 36.27%, and 51.62% of R@1, R@5, and R@10, respectively, whereas the fusion process yielded 11.86%, 44.41%, and 63.72% of R@1, R@5, and R@10, respectively.
According to these preliminary results, one can notice that the fusing of the representations of the five sentences produced better matching results than did using one sentence. Additionally, EfficientNet seemed to be better compared to the other three pre-trained networks. This indicates that learning visual features by EfficientNet was quite effective and allowed better scores to be obtained compared to the other pre-trained CNNs.  Table 1 illustrates the performance of DBTN using EfficientNet as a pre-trained CNN for encoding the visual features. It could be observed with one sentence (Sent.1). The method achieved 13.02%, 40%, and 59.30% in R@1, R@5, and R@10, respectively. In contrast, when the five sentences are fused, the performance was further improved to 17.20%, 51.39%, and 73.02% of R@1, R@5, and R@10, respectively. Further, we computed the average of R@1, R@5, and R@10 for each sentence, and for fusion, we observed that the average of fusion had the highest score. Table 2 shows the results obtained using ResNet50 as the image encoder to learn the image features. We can see that the performances in R@1, R@5, and R@10 were 10.93%, 38.60%, and 54.41%, respectively, for Sent.1, while the method achieved 13.72%, 50.93%, and 69.06% of R@1, R@5, and R@10, respectively, with the fusion. Similarly, from Table 3 we observed that with Inception_v3, considering the fusion, the performance was also better than that of individual sentences. Finally, the results of using VGG16 are shown in Table 4. We can see that for Sent.1, our method achieved 10%, 36.27%, and 51.62% of R@1, R@5, and R@10, respectively, whereas the fusion process yielded 11.86%, 44.41%, and 63.72% of R@1, R@5, and R@10, respectively.   To analyze the performance in detail for image retrieval given a query text, we showed many successful and failure scenarios. For example, we could see (Figure 9) a given query text (five sentences) with its image, and the top nine relevant retrieved images (from left to right); the image in red box is the ground truth image of the query text (true match). We could observe that our method output reasonable relevant images, where all nine images had almost the same content (objects). In these four scenarios, the rank of the retrieved true images was 1, 6, and 1, respectively. According to these preliminary results, one can notice that the fusing of the representations of the five sentences produced better matching results than did using one sentence. Additionally, EfficientNet seemed to be better compared to the other three pre-trained networks. This indicates that learning visual features by EfficientNet was quite effective and allowed better scores to be obtained compared to the other pre-trained CNNs.
To analyze the performance in detail for image retrieval given a query text, we showed many successful and failure scenarios. For example, we could see (Figure 9) a given query text (five sentences) with its image, and the top nine relevant retrieved images (from left to right); the image in red box is the ground truth image of the query text (true match). We could observe that our method output reasonable relevant images, where all nine images had almost the same content (objects). In these four scenarios, the rank of the retrieved true images was 1, 6, and 1, respectively. In contrast, Figure 10 shows two failure scenarios. In this case, we obtained relevant and irrelevant images, but the true matched image was not retrieved. This gives an indication that the problem was not easy and requires further investigations in improving the alignment of the descriptions to the image content.  scenarios (a, b and c) of text-to-image retrieval.
In contrast, Figure 10 shows two failure scenarios. In this case, we obtained relevant and irrelevant images, but the true matched image was not retrieved. This gives an indication that the problem was not easy and requires further investigations in improving the alignment of the descriptions to the image content.
(c) Figure 9. Successful scenarios (a, b and c) of text-to-image retrieval.
In contrast, Figure 10 shows two failure scenarios. In this case, we obtained relevant and irrelevant images, but the true matched image was not retrieved. This gives an indication that the problem was not easy and requires further investigations in improving the alignment of the descriptions to the image content.
(a) (b) Figure 10. Unsuccessful scenarios (a and b) of text-to-image retrieval. Figure 10. Unsuccessful scenarios (a and b) of text-to-image retrieval.

Discussion
In this section, we analyze further the performances of DBTN using different versions of EfficientNets, which are B0, B3, and B5. The version B0 contains 5.3 M parameters, while B3 and B5 are deeper and have 12M and 30M parameters, respectively. The results reported in Table 5 show that using B2 yields slightly better results compared to the other models. On the other side, B0 seems to be less competing as it provides an average recall of 45.65 compared to 47.20 for B2.  Table 6 shows sensitivity analysis for bidirectional text image matching at multiple margin values. We can observe that setting this parameter to α = 0.5 seems to be the most suitable choice. Increasing further this value leads to a decrease in the average recall as the network tends to select easy negative triplets.
In Table 7, we report the recall results obtained by using only one direction instead of bidirectional training. That is, we use text-to-image (Anchor text) and image-to-text (Anchor image). Obviously, the performance with bidirectional achieves the best results where relative similarity in one direction is useful for retrieval in the other direction, in the sense that the model trained with text-to-image triplets obtains a reasonable result in an image-to-text retrieval task and vice-versa. Nevertheless, the model trained with bi-directional triplets achieves the best result, indicating that the triplets organized in bidirectional provide more overall information for text-to-image matching.

Conclusions
In this work, we proposed a novel DBTN architecture for matching textual descriptions to remote sensing images. Different from traditional remote sensing image-to-image retrieval, our network seeks to carry out a more challenging problem, which is text-to-image retrieval. Such a network is composed of an image and text encoding branches and is trained using a bidirectional triplet loss.
In the experiments, we validated the method on a new benchmark data set termed TextRS. Experiments show in general promising results in terms of the recall measure. In particular, better recall scores were obtained by fusing the textual representations rather than using one sentence for each image. In addition, EfficientNets allows better visual representations to be obtained compared to the other pre-trained CNNs. For future developments, we propose to investigate image-to-text matching and propose advanced solutions based on attention mechanisms.