Extracting Representative Images of Tourist Attractions from Flickr by Combining an Improved Cluster Method and Multiple Deep Learning Models

: Extracting representative images of tourist attractions from geotagged photos is beneﬁcial to many ﬁelds in tourist management, such as applications in touristic information systems. This task usually begins with clustering to extract tourist attractions from raw coordinates in geotagged photos. However, most existing cluster methods are limited in the accuracy and granularity of the places of interest, as well as in detecting distinct tags, due to its primary consideration of spatial relationships. After clustering, the challenge still exists for the task of extracting representative images within the geotagged base image data, because of the existence of noisy photos occupied by a large area proportion of humans and unrelated objects. In this paper, we propose a framework containing an improved cluster method and multiple neural network models to extract representative images of tourist attractions. We ﬁrst propose a novel time- and user-constrained density-joinable cluster method (TU-DJ-Cluster), speciﬁc to photos with similar geotags to detect place-relevant tags. Then we merge and extend the clusters according to the similarity between pairs of tag embeddings, as trained from Word2Vec. Based on the clustering result, we ﬁlter noise images with Multilayer Perceptron and a single-shot multibox detector model, and further select representative images with the deep ranking model. We select Beijing as the study area. The quantitative and qualitative analysis, as well as the questionnaire results obtained from real-life tourists, demonstrate the e ﬀ ectiveness of this framework.


Introduction
An increasing number of studies related to tourism geography have been conducted in recent years because the tourism industry is making a significant contribution to the global economy: The total spending on tourism abroad in 2016 reached $1.23 trillion, and international tourist arrivals in 2017 reached 1.32 billion with growth at 4 % per year in eight years [1]. Among these studies, extracting representative images of tourist attractions is unfailing and practical research. It can provide informative descriptions about the tourist attractions [2]. Furthermore, it can be applied in building touristic information systems [3] and generating tourist maps [4], as well as providing image content to multibox detector (SSD) model [22]. Then the similarity of the filtered images is ranked by the deep ranking model [23], returning a ranked location list with representative images. We choose Beijing as the study area. The results of the proposed cluster methods and those of the existing methods are compared in clustering number, clustering accuracy and semantic distinctiveness, and the results of representative images selection are analyzed qualitatively. Additionally, a questionnaire is also conducted to evaluate whether the overall results meet the satisfaction of tourists in real life. A series of achieved results demonstrate the effectiveness of the framework.
The remainder of this paper is organized as follows. Section 2 reviews related work on clustering methods for geotagged photos, and selecting representative images for the extracted locations. Section 3 introduces the preliminary and the overall framework of extracting tourist attractions and their representative images. Section 4 describes the study area and discusses the implementation results and evaluation. Section 5 summarizes this paper and points out the directions for future work.

Geotagged Photo Clustering
Clustering is the premise and basis for representative image selection from geotagged photos. Among these clustering methods, density-based cluster methods are widely applied in clustering geotagged photos because they do not need to predefine the number of clusters and can filter noise points [11,24]. These density-based cluster methods include DBSCAN (Density-Based Spatial Clustering of Applications with Noise) [6,8,24] and various modified DBSCAN, such as P-DBSCAN [25], a method that considers a pre-set radius with the minimum number of photo owners [11,26]. After clustering, some directly reverse geocode these clusters and identify the corresponding place names by using tools, such as Geonames [12] and Google Places API [11]. However, the diversity of reverse geocoding results makes it difficult to judge the accuracy, and sometimes positioning errors make it worse [27,28]. Other articles consider leveraging textual contents attached to geotagged photos to obtain more accurate place names, mainly by calculating TF-IDF or its variants of each cluster and find the representative tags as place names or information [14,29]. Nevertheless, most of the cluster methods can only generate a coarse-grained cluster result, for instance, areas of interest, which possibly contain more than one tourist attraction, especially in the area with high density uploaded photos. Such results are not beneficial to the further application, such as tourist attraction recommendations.
A small proportion of researchers choose to retrieve geotagged photos with standard names of tourist attractions [15,30]. This may cause low recall and contain noise. A few studies attempted to detect place tags in geotagged photos without referring to any gazetteer by clustering photos with the same tag and analyzing the spatial distribution [18]. As mentioned above, there are plenty of abbreviations and alternate spellings of standard place names in the tag sets, which are non-trivial to resolve. However, when merging the place tags, researchers in previous studies mainly consider the similarity of spatial distribution between tags, and few consider the semantic similarity of place tags. Furthermore, some tags related to events taken place in fixed places or frequently used by few users cannot be filtered. Therefore, further improvement is needed in clustering places with geotagged photos.

Representative Image Selection
Representative image selection is a popular, but also challenging study, due to the existence of noisy images in geotagged photos. A few studies used supervised learning methods to extract representative images of certain places. For instance, Crandall et al. [31] utilize an SVM model to distinguish tourist attraction photos from negative ones obtained from other locations. Similarily, Samany [32] applies a deep belief network to classify landmarks in Tehran, Iran. Kim et al. [33] seek another way to categorize and analyze the representative image of major components in each area of interest in Seoul with Inception v3 model that is pre-trained with ImageNet. However, labeling images for supervised learning costs extensive manual labor [34]. Besides, it is hardly possible to train classifiers for every tourist attraction in the world [30]. Therefore, a more common approach is to compare image properties and find similar images after clustering or extracting places from geotags and tags. Among those image properties, SIFT is frequently used [18,31]. Other properties are also used, including GIST [29,35], color histograms [36], etc. For better representation, some studies may combine more than one image property [36,37]. However, the performance may be limited by the representation of these hand-crafted features to a great extent [23]. With the development of convolutional neural networks, researchers gradually try to leverage convolutional-based models to finish this task. For instance, Ding and Fan [38] combine SURF (an algorithm similar to SIFT) and LIFT (a deep-learning model) to find representative images and match untagged images to them.
For selecting better representative images, some filtering preprocess also attempted in previous studies. Most of them target to filter images with humans, by either applying a sophisticated library (such as OpenCV) [8] or training a deep learning model for image classification [5]. However, all of the previous studies conducted the undifferentiated filter process, which may fail to find the representative image for some tourist attractions. In addition, apart from images with humans, few consider other types of noise images, such as artificial images (e.g., a logo) and images with objects (e.g., an apple) [39]. In summary, more effort is required to tap the potential of convolutional-based models to apply in representative image selection for tourist attraction.

Preliminary
We define the set of photos in a certain study area as P = p 1 , p 2 , . . . , p |P| , where ∀p i ∈ P consists of a tuple of attributes, represented as p i = id p i , t p i , l p i , u p i , X p i . These attributes include the unique photo ID id p i , taken time t p i , taken location l p i (represented by latitude lat p i and longitude lon p i ), user who contributes this photo u p i , and a list of tags X p i = x 1 , x 2 , . . . , x |X p i | . Note that the number of tags in X p i could be zero or any positive integer, and a tag x can be attached to one or more than one photo. We represent the set of tags as X = x 1 , x 2 , . . . , x |X| , and the subset of photos that are attached to a specific tag x as P x = ∪ p i ∈P,x∈X p x .
Our goal is to detect the set of place-relevant tags, and further merge and extend these clusters; based on the cluster results, find the representative images of each tourist attraction. Figure 1 shows the overall framework, and each step is illustrated in detail in the following sections.

Data Acquisition
Harvesting data is the first step of the framework. As mentioned above, the Flickr geotagged photo dataset is an optimal choice for this study, due to its several advantages. Apart from Flickr APIs, the datasets can also be conveniently obtained from Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100M) hosted on Amazon AWS. As a part of the Yahoo Webscope program [40], it provides approximate 100 million public Flickr photos, each including user id, longitude, latitude, user tags, capture time, capture device, photo/video page URL, license URL, etc. This provides an adequate amount of data under a Creative Commons Attribution License and can free researchers from troublesome data crawling work. We leverage geotagged photos from YFCC100M, whose coordinates are bounded in the study area and taken within a certain time. The main features we use in this paper contain: Line number (the unique identification of each geotagged photo), user id (the unique identification of each user), capture time, geotag (longitude and latitude), user tags and the images themselves.

User Filtering
Because we aim to extract tourist attractions in this study, geotagged photos uploaded by natives should be removed since most of their check-in records are about daily life and events unrelated to tourism. Similar to the study conducted by Sun et al. [24], we leverage an entropy-based method to distinguish tourists from natives in the tourist destination, formulated as Equation (1): In Equation (2), D m (u) is the number of days that user u have stayed in the study area in Month m, and Mon(u) is the total number of months that user u have stayed in this study area. P m (u) is the proportion of the number of days in Month m and the total number of days that user u has stayed in the study area. Intuitively, the larger the value of E(u) is, the more dispersed the user u's visiting distribution is, the less likely he/she is a tourist. So we define a threshold E to remove geotagged photos of the user u if the value of E(u) for user u is larger than E.

Tag Processing
Preprocessing of tags and Word2Vec training is needed before detecting place tags and clustering places, which resolves common lingual ambiguities, such as white spaces, word separation and capitalization, and regularizes terms. After that, we leverage Word2Vec to extract semantic relationships, where all tags in the study area form the corpus, and tags in each photo form one sentence.
In order to extract the intended semantics, we adapt the word exclusion threshold as a user constraint (i.e., tags that are used by less than a minimum number of users will not be trained). Also, we define the word neighbourhood to encapsulate all phrases attached to the same photo. We leverage Skip-gram in Word2Vec to train the tag sets, which mainly aims to maximize the log-likelihood of the contextual word given the center word, formulated as Equation (3): where x t represents the given word, and C(x t ) represents the contents of x t , and x c ∈ C(x t ) (where x t is exclusive) represents a neighbouring word. Skip-gram defines p(x c |x t ) using the softmax function. However, the cost of computing Equation (1) is impractically large when using the softmax function. Therefore, the hierarchical softmax function and negative sampling are proposed as two computationally efficient approximation algorithms in Equation (3). In this paper, the hierarchical softmax function is used to improve efficiency, which utilizes a binary Huffman tree to represent the output layer with words and explicitly represents the relative probabilities of the child nodes for each node [41].

Photo Clustering
Photo clustering includes place tag extraction, merge, and cluster extension. The process of place tag extraction and merge begins with arbitrary tag x that is not yet processed. If the number of photos containing tag x (given as |P x |) is less than the minimum number of photos min_pts then the tag will be marked as a noise tag and continue to process the next tag; otherwise, cluster the photos with TU-DJ-Cluster. It is a modified density-joinable cluster method [20], which is further constrained by a time threshold ∆t and a minimum number of users and is specific to geotagged photos. The main process of TU-DJ-Cluster is illustrated in Figure 2: (a) extract all points with the same tag as the clustering dataset per time, where different colors represent photos taken by different users; (b) calculate the neighbourhood of each point within a radius of eps; (c) mark points with no neighbourhood as noise points, and join those points with at least one common point; (d) after generating an initial cluster result, further judge whether each cluster meets the conditions of the minimum time threshold and minimum users. If not, mark them as noise points.
ISPRS Int. J. Geo-Inf. 2020, 9, 81 7 of 22 neighbourhood as noise points, and join those points with at least one common point; (d) after generating an initial cluster result, further judge whether each cluster meets the conditions of the minimum time threshold and minimum users. If not, mark them as noise points. After clustering with TU-DJ-Cluster, we will get the cluster results . If no cluster is generated, mark the tag as noise tag. Otherwise, loop through these clusters and determine whether there is a cluster that the number of photos accounting for the total number of photos | | is larger than the minimum proportion _ . If the cluster that matches the above condition exists, then mark tag as a place tag and create the convex hull with the points in . We further merge some convex hulls according to spatial relationships and semantic similarity of tags. If two convex hulls have the overlay part and the similarity value of their place tags is larger than the minimum threshold _ , then merge them. As Equation (4) shows, cosine similarity is used to calculate the similarity of two tags and : where and represent the embedding of tags and , respectively, which are obtained from the above Word2Vec training. After processing all the convex hulls, we obtain a set of processed convex hulls with different semantics . The above cluster results only contain a small proportion of photos that are attached with placerelevant tags, because a subset of location-related photos is not tagged accordingly. Therefore, we continue to classify the unprocessed photos according to spatial relationships and semantic similarity to improve recall. The nature of photo acquisition of touristic places causes such photos to be captured within or near the place. So, we create a buffer with radius for each convex hull in generated by the above steps for further use. Additionally, previous studies show that there is a correlation between tags and geotags [42], so we assume that photos taken in the adjacent location are inclined to assign similar tags. We judge if the unclassified photos are located within any convex hull in , and if there exists any attached tag whose similarity with the name of the convex hull is After clustering P x with TU-DJ-Cluster, we will get the cluster results C x . If no cluster is generated, mark the tag x as noise tag. Otherwise, loop through these clusters and determine whether there is a cluster that the number of photos accounting for the total number of photos |P x | is larger than the minimum proportion p_pro. If the cluster c i x that matches the above condition exists, then mark tag x as a place tag and create the convex hull with the points in c i x . We further merge some convex hulls according to spatial relationships and semantic similarity of tags. If two convex hulls have the overlay part and the similarity value of their place tags is larger than the minimum threshold min_sim, then merge them. As Equation (4) shows, cosine similarity is used to calculate the similarity of two tags x i and x j : where e x i and e x j represent the embedding of tags x i and x j , respectively, which are obtained from the above Word2Vec training. After processing all the convex hulls, we obtain a set of processed convex hulls with different semantics CH x . The above cluster results only contain a small proportion of photos that are attached with place-relevant tags, because a subset of location-related photos is not tagged accordingly. Therefore, we continue to classify the unprocessed photos according to spatial relationships and semantic similarity to improve recall. The nature of photo acquisition of touristic places causes such photos to be captured within or near the place. So, we create a buffer with radius r for each convex hull in CH x generated by the above steps for further use. Additionally, previous studies show that there is a correlation between tags and geotags [42], so we assume that photos taken in the adjacent location are inclined to assign similar tags. We judge if the unclassified photos are located within any convex hull in CH b x , and if there exists any attached tag whose similarity with the name of the convex hull is larger than min_sim.
The final output is a set of clusters that represent tourist attractions with different semantics.

Noise Image Filtering
Images that are place-irrelevant or are occupied by a large area proportion of humans are removed with multiple pre-trained models. Inspired by the study conducted by Zhang et al. [39], we also use the Caltech 101 dataset (an object image dataset) [43], and the Places2 dataset (a scene image dataset with most place types) [44] to train a binary classifier of place-relevant images and place-irrelevant images. Both datasets complement one another for the target binary classification: Caltech 101 depicts individual, human-made objects, whereas, Places2 explicitly shows geographically-locatable landscapes. For training, we randomly select about 4,000 images from each dataset to transform into 2,048 dimension features and feed into Multilayer Perceptron (MLP), and about 2,000 images to evaluate the accuracy. The final classification accuracy reaches to 98.68%.
Next, we apply a single-shot multibox detector (SSD) model [22] to detect persons in images. It is a convolutional-based object detection model, pre-trained on PASCAL Visual Object Classes (VOC) dataset. We assume that if a more substantial proportion of an image is occupied by at least one person, it is more likely to be a tourist's selfie in front of a tourist attraction. Examples are shown in Figure 3. Although both images show the same tourist attraction (The Great Wall in Beijing, China) and are both detected to have two persons, Figure 3a seems more likely to be the representative image for this tourist attraction than Figure 3b. Given this assumption, we detect each image and filter it if there is a person whose minimum bounding rectangle's area covers over 10% of this image. larger than _ . The final output is a set of clusters that represent tourist attractions with different semantics.

Noise Image Filtering
Images that are place-irrelevant or are occupied by a large area proportion of humans are removed with multiple pre-trained models. Inspired by the study conducted by Zhang et al. [39], we also use the Caltech 101 dataset (an object image dataset) [43], and the Places2 dataset (a scene image dataset with most place types) [44] to train a binary classifier of place-relevant images and placeirrelevant images. Both datasets complement one another for the target binary classification: Caltech 101 depicts individual, human-made objects, whereas, Places2 explicitly shows geographicallylocatable landscapes. For training, we randomly select about 4,000 images from each dataset to transform into 2,048 dimension features and feed into Multilayer Perceptron (MLP), and about 2,000 images to evaluate the accuracy. The final classification accuracy reaches to 98.68%.
Next, we apply a single-shot multibox detector (SSD) model [22] to detect persons in images. It is a convolutional-based object detection model, pre-trained on PASCAL Visual Object Classes (VOC) dataset. We assume that if a more substantial proportion of an image is occupied by at least one person, it is more likely to be a tourist's selfie in front of a tourist attraction. Examples are shown in Figure 3. Although both images show the same tourist attraction (The Great Wall in Beijing, China) and are both detected to have two persons, Figure 3a seems more likely to be the representative image for this tourist attraction than Figure 3b. Given this assumption, we detect each image and filter it if there is a person whose minimum bounding rectangle's area covers over 10% of this image.

Representative Image Selection
After removing noise photos, we train a deep ranking model and find the most representative images of each tourist attraction. The deep ranking model is a convolutional model focusing on finegrained visual similarity, which is different from most existing models that only focus on categorylevel similarity [23]. As shown in Figure 4, the model can integrate a commonly-used convolutional network (ConvNet), such as VGG nets [45] and ResNet [46] with low-resolution paths and normalize

Representative Image Selection
After removing noise photos, we train a deep ranking model and find the most representative images of each tourist attraction. The deep ranking model is a convolutional model focusing on fine-grained visual similarity, which is different from most existing models that only focus on category-level similarity [23]. As shown in Figure 4, the model can integrate a commonly-used convolutional network (ConvNet), such as VGG nets [45] and ResNet [46] with low-resolution paths and normalize their output features. Image triplets, including anchor image, positive image, and negative image, are fed independently into three networks with the same architecture and shared parameters. These embedding outputs of the inputs are leveraged to evaluate the hinge loss, by back-propagating the gradients to the lower layers to optimizing their parameters and minimizing the hinge loss.
ISPRS Int. J. Geo-Inf. 2020, 9, 81 9 of 22 their output features. Image triplets, including anchor image, positive image, and negative image, are fed independently into three networks with the same architecture and shared parameters. These embedding outputs of the inputs are leveraged to evaluate the hinge loss, by back-propagating the gradients to the lower layers to optimizing their parameters and minimizing the hinge loss. In our study, we leverage ResNet as ConvNet in the model and Tiny-ImageNet [47] as the training dataset. For each image in the training dataset, we randomly select one image in the same category as the positive image and one image in any other category as the negative one to create the triplet input. To accelerate the training process, we initialize the ConvNet part of the model with ImageNet weights. After training, we obtain the model weights and transfer them to our dataset.

Study Area
We select Beijing as the study area to verify the framework. Beijing is the capital of China, and also the second-largest city in China. It has abundant tourism resources, and every year it has attracted many tourists at home and abroad [48]. The number of raw images bounded in Beijing is 145,397, and the number of users is 2,846. After filtering users, as Section 3.3 described, the number of images has reduced to 140,891, and the number of users is 2,750. Figure 5 has shown photo distribution in Beijing. In our study, we leverage ResNet as ConvNet in the model and Tiny-ImageNet [47] as the training dataset. For each image in the training dataset, we randomly select one image in the same category as the positive image and one image in any other category as the negative one to create the triplet input. To accelerate the training process, we initialize the ConvNet part of the model with ImageNet weights. After training, we obtain the model weights and transfer them to our dataset.

Study Area
We select Beijing as the study area to verify the framework. Beijing is the capital of China, and also the second-largest city in China. It has abundant tourism resources, and every year it has attracted many tourists at home and abroad [48]. The number of raw images bounded in Beijing is 145,397, and the number of users is 2,846. After filtering users, as Section 3.3 described, the number of images has reduced to 140,891, and the number of users is 2,750. Figure 5 has shown photo distribution in Beijing.

Result of Place-Relevant Tag Detection
Before applying Word2Vec in tag set, we have analyzed the frequency distribution of tags used in the study (Figure 6a), as well as the number of users using these tags (Figure 6b), and represent them as log-log plots. The plots reveal that they both approximately follow a power-law distribution similar to the word frequency distribution in natural language, indicating that it is applicable to leverage Word2Vec to embed these tags with the limitation condition of user counts. We have set the minimum number of users as three, and the embedding size as 200. After filtering tags, the number of tags reduces from 19,469 to 2,845.

Result of Place-Relevant Tag Detection
Before applying Word2Vec in tag set, we have analyzed the frequency distribution of tags used in the study (Figure 6a), as well as the number of users using these tags (Figure 6b), and represent them as log-log plots. The plots reveal that they both approximately follow a power-law distribution similar to the word frequency distribution in natural language, indicating that it is applicable to leverage Word2Vec to embed these tags with the limitation condition of user counts. We have set the minimum number of users as three, and the embedding size as 200. After filtering tags, the number of tags reduces from 19,469 to 2,845.

Result of Place-Relevant Tag Detection
Before applying Word2Vec in tag set, we have analyzed the frequency distribution of tags used in the study (Figure 6a), as well as the number of users using these tags (Figure 6b), and represent them as log-log plots. The plots reveal that they both approximately follow a power-law distribution similar to the word frequency distribution in natural language, indicating that it is applicable to leverage Word2Vec to embed these tags with the limitation condition of user counts. We have set the minimum number of users as three, and the embedding size as 200. After filtering tags, the number of tags reduces from 19,469 to 2,845.  To evaluate the ability to filter place-irrelevant tags of TU-DJ-Cluster, we compare it with a density-joinable cluster without the constraint of time and user, which can be considered as DBSCAN where MinPts is 1 to some extent. We replace TU-DJ-Cluster with it in the framework of photo clustering. Table 1 shows the values of the parameters for all methods in this experiment. Meanwhile, the baseline method of DBSCAN set both min_users and ∆t as zero, representing that there is no restriction on the number of users and time for clustering.  Table 2 lists the detection results of both methods. TU-DJ-Cluster has detected 131 place-relevant tags, while DBSCAN has detected 385 without the constraint of time and user. To better validate the accuracy of place-relevant tag detection, we invite volunteers who are familiar with Beijing to manually mark place-relevant tags, and further use to calculate recall of TU-DJ-Cluster and DBSCAN, respectively, which is defined as the proportion of the number of true place-relevant tags and the number of detected tags, or we can regard it as the hitting ratio. We can see that the hitting ratio of TU-DJ-Cluster is much larger than DBSCAN, which is over 85% of the detected tags are true positive values. Although TU-DJ-Cluster has missed some true place-relevant tags (52 fewer than DBSCAN's), many of them can be merged with the detected tags in further cluster extension, because the majority of them are alternate spellings or misspelling of the detected tags, which are used by few users. On the contrary, the false-positive values detected by DBSCAN have generated many trivial clusters. Figure 7 shows some misidentification results of DBSCAN when detecting place-relevant tags. It has detected "midi" (Figure 7a, a famous music festival held in Haidian Park, Beijing) and "cnbloggercon" (Figure 7b, a conference related to China's Blogger) as a place-relevant tag, and also tags related to personal places, such as "office" and "home", which we do not show in this Figure; Our preferred TU-DJ-Cluster algorithm, on the hand, naturally filters out those semantically-irrelevant tags (see in Supplementary Materials).   To prove the effectiveness of applying Word2Vec in tag processing and similarity calculation, we also list some results of high similarity among place-relevant tags, while merging the semantic convex hulls and show in Table 3. The analysis shows that the processing detects and merges synonymous place-relevant tags because the synonymous tags are more likely to have high similarity, including English name and "Pinyin" of a certain tourist attraction (for instance, "altarofheaven" and "tiantanpark", "oldsummerpalace" and "yuanmingyuan", etc.), abbreviations (for instance, both "nationalcentrefortheperformingarts" and "ncpa" can represent National Centre To prove the effectiveness of applying Word2Vec in tag processing and similarity calculation, we also list some results of high similarity among place-relevant tags, while merging the semantic convex hulls and show in Table 3. The analysis shows that the processing detects and merges synonymous place-relevant tags because the synonymous tags are more likely to have high similarity, including English name and "Pinyin" of a certain tourist attraction (for instance, "altarofheaven" and "tiantanpark", "oldsummerpalace" and "yuanmingyuan", etc.), abbreviations (for instance, both "nationalcentrefortheperformingarts" and "ncpa" can represent National Centre for the Performing Arts.) and alternate names (for instance, both "birdsnest" and "nationalstadium" can represent Beijing National Stadium).

Result of Photo Clustering
Following the parameters and process above, we obtain the overall clustering result of our framework. The result contains a total number of 30 clusters, most of which are located in Dongcheng District and Xicheng District, including Tiananmen, the Forbidden City, Wangfujing, Jingshan Park, Drum Tower, etc., and are shown in Figure 8.
For better illustration, we compare our framework with P-DBSCAN and TF-IDF-UF, which we follow the process in the studies of Kennedy et al. [14] and Vu et al. [26]. As expected, P-DBSCAN results extract fewer clusters (16 clusters) with less distinctiveness than TU-DJ-Clustering, which is qualitatively presented in the OpenStreetMap graphics of Figure 9. Both P-DBSCAN and our method have successfully detected the same places of interest, including the Old Summer Palace (also known as "Yuanminyuan"), the art district "798" and the Summer Palace (also known as "Yiheyuan"; Figure 9a). However, because of the unbalanced distribution of point density within these places, the result of P-DBSCAN in Figure 9a does not include the southwest part of it, which is a part of the Summer Palace, shown in OSM. Moreover, because these points located in the northwest part of the P-DBSCAN result get closer to the high-density area, they are included in the cluster, where we randomly check the content of some photos and find that they are not semantically relevant to the Summer Palace. Consequently, although both of them successfully detect the same place of interest, a clustering result that has considered the semantical difference of photos can undoubtedly obtain a fine-grained clustering result and benefit further application. Figure 9b compares the cluster results of TU-DJ-Cluster and P-DBSCAN around the area of the Forbidden City. Our method has detected different places of interest in this area, while P-DBSCAN has clustered such a wide range of areas into a cluster. Even if we have tested several combinations of the parameters of P-DBSCAN during the experiment, most of them tend to cluster these different places of interest into the same cluster. One possible reason is that these popular tourist attractions of Beijing densely locate in the area around the Forbidden City, which causes a relatively high density of geotagged photos and makes P-DBSCAN difficult to distinguish them. Also, with TF-IDF-UF method, it chooses "beijing", a relatively unrepresentative tag for this cluster. Such a cluster result may have a bad influence on further applications, such as tourist attraction recommendation. The comparison shows the superiority of our method in detecting fine-grained places of interest and extracting accurate and representative tags to these places of interest over the traditional P-DBSCAN method. ISPRS Int. J. Geo-Inf. 2020, 9, 81 13 of 22 For better illustration, we compare our framework with P-DBSCAN and TF-IDF-UF, which we follow the process in the studies of Kennedy et al. [14] and Vu et al. [26]. As expected, P-DBSCAN results extract fewer clusters (16 clusters) with less distinctiveness than TU-DJ-Clustering, which is qualitatively presented in the OpenStreetMap graphics of Figure 9. Both P-DBSCAN and our method have successfully detected the same places of interest, including the Old Summer Palace (also known as "Yuanminyuan"), the art district "798" and the Summer Palace (also known as "Yiheyuan"; Figure  9a). However, because of the unbalanced distribution of point density within these places, the result of P-DBSCAN in Figure 9a does not include the southwest part of it, which is a part of the Summer Palace, shown in OSM. Moreover, because these points located in the northwest part of the P- of Beijing densely locate in the area around the Forbidden City, which causes a relatively high density of geotagged photos and makes P-DBSCAN difficult to distinguish them. Also, with TF-IDF-UF method, it chooses "beijing", a relatively unrepresentative tag for this cluster. Such a cluster result may have a bad influence on further applications, such as tourist attraction recommendation. The comparison shows the superiority of our method in detecting fine-grained places of interest and extracting accurate and representative tags to these places of interest over the traditional P-DBSCAN method.

Result of Representative Image Selection
Based on the above cluster results of TU-DJ-Cluster, we further collect the corresponding images in each cluster to filter and find representative images. We exhibit the overall filtering result of each cluster with a stacked bar chart in Figures 10 and 11. Figure 10 is the absolute number of images, and these tourist attractions are sorted by the total number of images, which reflects the popularity of each tourist attraction to some extent. As Figure 10 shows, the Forbidden City is the most popular tourist attractions, since the number of images far exceeds others. The following are the Olympic Park, the Summer Palace, and Tiananmen Square. Figure 11 shows the proportion of different types of image contents. Capital Museum, Zoo, and Zhongshan Park attract tourists mainly by historical relics, pandas, and tulip, respectively. Tourist attractions like them have a relatively high proportion of images related to objects. This result indicates that tourists are more fond of taking photos of objects when visiting these types of tourist attractions. On the contrary, images with humans are dominant in tourist attractions, such as Wangfujing and Ditan Park. Regarding Wangfujing, it is easy to explain because it is a shopping area with a massive flow of people. Ditan Park equally attracts tourists by its vibrant temple fairs, and thus, has many images containing humans. In addition, the chart indicates that tourist attractions with magnificent appearance can appeal to tourists to take more overall photos of them because images related to scenes account for over 60% with tourist attractions like CCTV Building, National Theater, and Yuanmingyuan. To sum up, tourists do show different preferences when taking photos of different types of tourist attractions, and the difficulty of representative image selection also varies from different types of tourist attractions.
when visiting these types of tourist attractions. On the contrary, images with humans are dominant in tourist attractions, such as Wangfujing and Ditan Park. Regarding Wangfujing, it is easy to explain because it is a shopping area with a massive flow of people. Ditan Park equally attracts tourists by its vibrant temple fairs, and thus, has many images containing humans. In addition, the chart indicates that tourist attractions with magnificent appearance can appeal to tourists to take more overall photos of them because images related to scenes account for over 60% with tourist attractions like CCTV Building, National Theater, and Yuanmingyuan. To sum up, tourists do show different preferences when taking photos of different types of tourist attractions, and the difficulty of representative image selection also varies from different types of tourist attractions.  We select the top five tourist attractions which have the most number of photos and analyze their results of representative image selection. We compare the result of our representative image selection framework with that of random selection (without noise image filtering process), shown in Figure 12. We can infer from the random selection result that the image set biased-or nonrepresentative photo perspectives (2-a and 5-c in Figure 12 for instance), local parts of this tourist attraction (1-b and 3-b in Figure 12) and even some noisy images (1-a and 5-b in Figure 12). In addition, although the noise image filtering process has been done, some unrelated images still exist, which increases the difficulty of ranking and selection for the deep ranking model. Our framework can still select the images taken from the most common and representative angles of view with the We select the top five tourist attractions which have the most number of photos and analyze their results of representative image selection. We compare the result of our representative image selection framework with that of random selection (without noise image filtering process), shown in Figure 12. We can infer from the random selection result that the image set biased-or non-representative photo perspectives (2-a and 5-c in Figure 12 for instance), local parts of this tourist attraction (1-b and 3-b in Figure 12) and even some noisy images (1-a and 5-b in Figure 12). In addition, although the noise image filtering process has been done, some unrelated images still exist, which increases the difficulty of ranking and selection for the deep ranking model. Our framework can still select the images taken from the most common and representative angles of view with the overall look of a certain tourist attraction. Although some representative images show the visual diversity, they reflect the diverse visual preferences of different users to some extent. For instance, different from the other representative images, 1-f and 3-f in Figure 12 show one of the Palaces in the Forbidden City and Marble Boat in the Summer Palace, respectively.

Result of Users' Satisfaction
For better evaluating the overall framework results, we conducted a questionnaire based on the simple tourist map we created, where Baidu Map is the base map and extracted tourist attractions' locations, and representative images are shown ( Figure 13). Eighty volunteers participated in the survey, including people who lived in Beijing, have visited Beijing before, or are potential tourists to Beijing in the future (Note that most tourist attractions in Beijing are famous enough, and therefore, most people in China are familiar with them to a different degree). According to the tourist map, each volunteer rated three items based on a Likert scale from 1 (strongly disagree) to 5 (strongly agree),

Result of Users' Satisfaction
For better evaluating the overall framework results, we conducted a questionnaire based on the simple tourist map we created, where Baidu Map is the base map and extracted tourist attractions' locations, and representative images are shown ( Figure 13). Eighty volunteers participated in the survey, including people who lived in Beijing, have visited Beijing before, or are potential tourists to Beijing in the future (Note that most tourist attractions in Beijing are famous enough, and therefore, most people in China are familiar with them to a different degree). According to the tourist map, each volunteer rated three items based on a Likert scale from 1 (strongly disagree) to 5 (strongly agree), including: (1) Integrity: To what extent do you think the extracted results can cover Beijing's famous tourist attractions (Q1); (2) representativeness: To what extent do you think the selected images represent the tourist attractions (Q2); (3) attractiveness: To what extent do you think adding representative images can attract you more to visit the tourist attractions (Q3).
The statistics results of the questionnaire are shown in Table 4, where the integer number represents the rating number of people for each option. Of all the three criteria, most volunteers chose "agree", and the following are "neutral" or "strong agree". The high average ratings indicate the high users' satisfaction, especially in the criteria of representativeness (about 3.88 out of 5), revealing that the framework of representative image selection is effective. From what has been analyzed above, the overall framework has the potential to apply in tourism applications and meet the satisfaction of tourists in real life. The statistics results of the questionnaire are shown in Table 4, where the integer number represents the rating number of people for each option. Of all the three criteria, most volunteers chose "agree", and the following are "neutral" or "strong agree". The high average ratings indicate the high users' satisfaction, especially in the criteria of representativeness (about 3.88 out of 5), revealing that the framework of representative image selection is effective. From what has been analyzed above, the overall framework has the potential to apply in tourism applications and meet the satisfaction of tourists in real life.

Conclusions
In this paper, we propose a framework containing an improved cluster method and multiple neural network models to extract representative images of tourist attractions. Leveraging Flickr Creative Commons 100 Million Dataset, we choose Beijing as the study area to evaluate our framework. Then we filter the dataset with an entropy-based method to remove certain photos uploaded by natives. We improve a density-based cluster by adding the constraint of time and user number threshold (TU-DJ-Cluster) to extract place-relevant tags, and further merge and extend them according to the spatial relationship between convex hulls generated by these place-relevant tags and semantic similarity between tag embeddings obtained from Word2Vec training. By comparing the extraction result of DBSCAN, TU-DJ-Cluster extracts place-relevant tags and simultaneously filters unimportant tags unrelated to tourist attractions. Besides, the clustering results of our framework are superior to P-DBSCAN, whether in the number of clusters or the accuracy of clustering boundaries. After that, we further select representative images for each tourist attraction, by first filtering noise images with pre-trained MLP and SSD model and then ranking the remaining images with the deep ranking model. The comparative analysis further demonstrates the effectiveness of filtering irrelevant images and selecting representative images of this framework. A questionnaire is also conducted to evaluate users' satisfaction with the overall results. The high rating scores indicate that the results of

Conclusions
In this paper, we propose a framework containing an improved cluster method and multiple neural network models to extract representative images of tourist attractions. Leveraging Flickr Creative Commons 100 Million Dataset, we choose Beijing as the study area to evaluate our framework. Then we filter the dataset with an entropy-based method to remove certain photos uploaded by natives. We improve a density-based cluster by adding the constraint of time and user number threshold (TU-DJ-Cluster) to extract place-relevant tags, and further merge and extend them according to the spatial relationship between convex hulls generated by these place-relevant tags and semantic similarity between tag embeddings obtained from Word2Vec training. By comparing the extraction result of DBSCAN, TU-DJ-Cluster extracts place-relevant tags and simultaneously filters unimportant tags unrelated to tourist attractions. Besides, the clustering results of our framework are superior to P-DBSCAN, whether in the number of clusters or the accuracy of clustering boundaries. After that, we further select representative images for each tourist attraction, by first filtering noise images with pre-trained MLP and SSD model and then ranking the remaining images with the deep ranking model. The comparative analysis further demonstrates the effectiveness of filtering irrelevant images and selecting representative images of this framework. A questionnaire is also conducted to evaluate users' satisfaction with the overall results. The high rating scores indicate that the results of our framework are effective in extracting tourist attractions and can meet real-life tourists' requirements.
Though the results are satisfactory, some efforts still should be made to improve our framework. For instance, even though noise images are filtered prior to their importance ranking, some thematically-unrelated images remain. This influences the ranking results, due to the diverse visual preferences of different users. Besides, the deep ranking model used in this paper calculates the similarity from embeddings of the whole images, while using convolutional models based on point detector and descriptor may provide a more accurate selection result, due to the difficulty to select mostly outdoor scene images from noisy geotagged images. In future work, we will attempt to extract places of interest directly from photos or videos with unsupervised or semi-supervised deep learning methods. Furthermore, we try to analyze the visual contents of images taken from tourists to infer their preferences and apply in further applications, such as recommendation systems.