Minimax Bridgeness-Based Clustering for Hyperspectral Data

: Hyperspectral (HS) imaging has been used extensively in remote sensing applications like agriculture, forestry, geology and marine science. HS pixel classiﬁcation is an important task to help identify different classes of materials within a scene, such as different types of crops on a farm. However, this task is signiﬁcantly hindered by the fact that HS pixels typically form high-dimensional clusters of arbitrary sizes and shapes in the feature space spanned by all spectral channels. This is even more of a challenge when ground truth data is difﬁcult to obtain and when there is no reliable prior information about these clusters (e.g., number, typical shape, intrinsic dimensionality). In this letter, we present a new graph-based clustering approach for hyperspectral data mining that does not require ground truth data nor parameter tuning. It is based on the minimax distance, a measure of similarity between vertices on a graph. Using the silhouette index, we demonstrate that the minimax distance is more suitable to identify clusters in raw hyperspectral data than two other graph-based similarity measures: mutual proximity and shared nearest neighbours. We then introduce the minimax bridgeness-based clustering approach, and we demonstrate that it can discover clusters of interest in hyperspectral data better than comparable approaches.


Introduction
Hyperspectral (HS) imaging combines spectroscopy and imaging to capture the reflectance (or radiance) of surfaces within a scene. It is used in remote sensing applications to determine nutrient deficiency in crops [1], for vegetation mapping [2] or to model phytoplankton dynamics in the ocean [3]. The data produced by HS sensors is, however, very large (spatially) and HS pixels typically form high-dimensional clusters of arbitrary sizes and shapes in the feature space spanned by the spectral channels, which significantly hinders HS data mining. In remote sensing applications, ground truth data is often used for validation and information recovery in what is referred to as the supervised framework. However, obtaining such data is costly and time-demanding. "Ground truthing" involves surveying and processing campaigns with trained personnel, high-end equipment and laboratory tests. Furthermore, preparing ground truth data can be error-prone [4]. Semi-supervised or unsupervised methods are then highly desirable as they require little to no training data. In this paper, we focus on unsupervised classification, also known as clustering. Our objective is that each extracted cluster of pixels corresponds to a meaningful class of material within the scene. Furthermore, we also aim for the method to be easy to use and interpret, with no need for user input.
One of the main challenges in HS data clustering is that traditional distance metrics, such as the Euclidean distance, are meaningless in high-dimensional spaces [5,6]. Dimensionality reduction techniques, such as Principal/Independent Component Analysis [7][8][9], manifold learning [10] or band selection [11], have been used to address this issue, but they are also subject to the aforementioned problems, i.e., the need for ground truth data and the parameter tuning. In particular, the choice of an appropriate number of dimensions remains an open problem. However, we leave dimensionality reduction aspects outside the scope of this paper, as we focus solely on the clustering technique. In terms of comparing high-dimensional vectors, graph-based measures of similarity such as the number of shared nearest neighbours [12] and mutual proximity [13] have been proven useful to alleviate the curse of dimensionality [14]. The minimax distance [15] is another type of such measure, defined as the longest edge on a path between two distinct nodes of a graph, minimised over the range of all possible paths between the nodes. It is also the longest edge on the path between two nodes of a minimum spanning tree. The minimax distance is very powerful to separate clusters of arbitrary shape, size and dimensionality, especially if the data is relatively noise-free [16,17]. Furthermore, unlike the shared nearest neighbours, the minimax distance is completely parameter-free. In this paper, we propose a clustering method that harnesses the advantages of the minimax distance for HS data mining. The method creates a minimum spanning tree and determines which edges are most likely bridges between clusters based on four intuitive and parameter-free heuristics that we introduce. It then works as a sequential graph partitioning approach using the silhouette index as an objective function. In the next sections, we review related work, present the minimax distance and demonstrate its suitability for the task. We then introduce the proposed clustering approach, present results and conclude.

Related Work
Unsupervised classification of HS data has been an active research topic for several decades with many existing methods based on template-matching, spectral decomposition, density analysis or hierarchical representation [18]. Recent trends have also seen the emergence of deep models [19,20] that can model the distribution of information in high-dimensional data. However, these methods currently do not generalise nor scale well and require a tremendous amount of training data. The main drawback with template-matching methods (e.g., centroid-based, mixture of Gaussians) is that the shape of the clusters is presumed known a priori, i.e., they are parametric. For example, K-means and its variants are primarily meant to detect convex-shaped clusters, which occur only rarely in HS data. In spectral clustering [21,22], the eigenvectors of a Laplacian matrix representing the data are used to project the clusters in a subspace where they are more separable. Performance depends on how the Laplacian is defined and how clustering is eventually performed after projecting the data.
Clustering based on density analysis has received a lot of attention in the geoscience and remote sensing communities. Their rationale is that pixels are sampled from a multivariate probability density function (PDF) of unknown shape and parameters, which can be estimated from the data. Most methods are based on a search for the local maxima of the PDF, also referred to as modes. Most existing mode-seeking approaches, such as Mean Shift [23,24] or Fast Density Peak Clustering [25], assume that each cluster contains a single dominant mode, although it may not always be true (consider for instance the case of a ring-shaped cluster). This motivates methods, such as those based on space partitioning [26], or support vector machines [27], that seek local minima of the density functions as they represent the boundaries between clusters. These methods are independent of cluster shape. DBSCAN (Density-based spatial clustering of applications with noise) [28] is another adaptive approach in which "core points" are selected to better separate clusters and facilitate their extraction. Unfortunately, the performance of DBSCAN and most density-based clustering methods depend heavily on parameter tuning, which generally comes down to finding the right amount of smoothing for density estimation. This problem also applies also to k nearest neighbours (kNN)-based methods [29][30][31] as k is generally not an intuitive parameter for end-users. Automatic tuning with the elbow method [32,33] gives no theoretical guarantee of finding the optimal parameter. Adaptive density estimation (e.g., based on diffusion [34]) requires a lot of data to find significant patterns in high dimensions, and there is currently no consensus on whether they can outperform global methods on high-dimensional data [35].
Parameter-laden algorithms present an important pitfall as incorrect settings may prevent the retrieval of the true patterns and instead lead the algorithm to greatly overestimate the significance of less relevant ones [36]. This makes parameter setting a burden to the end-user. Existing parameter-free approaches are based on finding either a natural cutoff or peaks in some distribution [37,38] or rely on maximising a quality criterion [39] or combining strategies [40]. The recently proposed FINCH [41] is based on a simple and intuitive nearest-neighbour rule and hierarchically combines clusters based on the position of their respective means.
To address the problem of similarity in high dimensions, an effective approach consists of comparing points in terms of their neighbourhoods rather than only their coordinates. The number of shared nearest neighbours [14,42] is based on this principle, but it requires setting k, the neighbourhood size. A parameter-free alternative known as mutual proximity [13], has also shown promise for clustering but, as we will demonstrate, it leads to sub-optimal performance, and it comes at a high computational cost. The minimax distance is another type of graph-based similarity measure that shows promise for data classification [15]. In this paper, our main contributions are an evaluation of graph-based similarity measures based on the silhouette coefficient, as well as a new parameter-free clustering algorithm named MBC (Minimax Bridgeness-based Clustering).

Definition
Consider a connected, undirected and edge-weighted graph G(V, E) where V = {v i |i = 1..N} is a set of N vertices and E = {e i |i = 1..M} is a set of M edges, with N < M. Note P v i , v j the set of all loopless paths between vertices i and j in G. The largest of all edge weights along a given path p v i , v j is denoted w max (p). Then, the path that satisfies: is the minimax path between vertices i and j. The edge in which the weight is w max p mnx v i , v j will be referred to as the minimax edge, its weight is the minimax distance between vertices i and j.
The minimax distance matrix, similarly to the mutual proximity and shared nearest neighbours distance matrices, is computed based on E, the weights of the graph. These weights typically represent the Euclidean distance. They can be obtained from the data in O(n 2 ), with n the number of data points, but parallel computing can be used to increase efficiency [43,44]. The minimax distance matrix is then computed based a minimum spanning tree (MST) on G, which constitutes a subset of E. An MST is defined as a set of edges that connects all the data points, without any cycles and with the smallest total edge weight. Each edge of an MST is a minimax edge [15], and the distance matrix can then be obtained in linear time from the MST [45], which can also be constructed in linear time [46].

Minimax Silhouette
An ideal measure of similarity should be so that it returns a small value if the data points belong to the same class and a large value otherwise. This property is captured by the silhouette index, which measures how similar a data point x is to its own cluster compared to other clusters: where s(x) is the silhouette of x, a(x) is the average similarity between x and all other points in the same cluster and b(x) is the smallest average similarity x and all other points in different clusters. A small average silhouette indicates that clusters are poorly separated, and negative values mean that they overlap significantly. While the silhouette index is typically used to assess a clustering result, we use it here to compare similarity measures. We calculated the average silhouette of the ground truth of each of the data-sets described in Section 5.2, with the squared Euclidean norm, the spectral angle mapper and their respective minimax, mutual proximity and shared nearest neighbours versions.
In Table 1, we report the results obtained on the whole data-set (limited to 10,000 points-see Section 5.2) or only on a core set Γ 50 consisting of the 50% data points of highest density (see next paragraph). We then compared the regular measures, i.e., Squared Euclidean norm (SE) and Spectral Angle Mapper (SAM), to their corresponding shared nearest neighbour (SNN), mutual proximity and minimax distance counterparts. Note that SNN requires to tune the parameter k, i.e., the size of the set of NN within which the shared neighbours are searched for each pair of pixels. The results we report are the best from all values of k between 5 and N, with a step size of 5. These results indicate that even the largest silhouette (0.260 with Minimax SE on Γ 50 ) is relatively low, confirming that HS clusters are generally not well separable in the feature space spanned by all spectral bands. Nevertheless, we can observe the clear inferiority of the regular measures, with a full-set silhouette of −0.428 at most. The minimax distance and mutual proximity surpass the shared nearest neighbours overall, except on the Massey data. Further, note the improvement obtained by discarding the 50% least dense (i.e., noisiest) pixels, particularly for the minimax distance, which gives the best results overall. This suggests that the minimax distance is better suited to extract classes of interest from HS data, especially on core sets. Using the core set Γ 50 allows to tackle two drawbacks of the minimax distance: sensitivity to noise and computational complexity. Core sets have previously been used for similar purposes [28,40,47]. To select a representative core set in a computationally efficient and parameter-free manner, we estimate the underlying probability density function of the data and discard the 50% least dense points. We compared several parameter-free and scalable density estimators in terms of their ability to produce core sets with compact and well-separated classes at several threshold values [48]. It is particularly noteworthy that diffusion-based approaches [34] scale poorly to high-dimensional data and suffer from the Hughes phenomenon: they require a tremendous amount of sampling points to correctly estimate the multivariate density. Instead, we found that convolving the data with an isotropic Gaussian kernel with a bandwidth equal to the average distance to a point's nearest neighbour allowed for a good balance between low computational footprint and usefulness in identifying core-sets. We used this approach to estimate density in the remainder of our experiments.

Minimax Distance-Based kNN Clustering
In order to further demonstrate the usefulness of the minimax distance, we compared the performance of two state-of-the-art kNN-based algorithms: KNNCLUST [49] and GWENN [50]. Both rely on a measure of distance to perform clustering. We evaluated whether using the minimax distance can improve performance in terms of overall accuracy (OA) and cluster purity [51] on the five data-sets presented in Section 5.2. Figure 1 shows results obtained on one of these data-sets. Note that, for a given data point x in the feature space, all its neighbours that are at the same minimax distance to x are sorted based on regular distance. Results indicate that using the minimax distance can improve the performance of kNN-based clustering on HS data. On all five data-sets, the peak overall accuracy was improved by at least 3% and up to 8% (on the Kennedy Space Centre scene), and cluster purity was improved for most values of k. We also found that, as k increases, the number of clusters decreases. This should be considered when evaluating cluster purity, which is known to increase with the number of clusters. Purity is a more meaningful measure of clustering quality when this number is low, which is the case in our experiments.

Minimax Bridgeness-Based Clustering
As we demonstrated, the minimax distance is well suited for class separation in HS data. However, a large minimax distance informs only of the existence of a gap on a path between two points, but not so much of its significance. The latter can be established when the gap's existence is confirmed by multiple pairs of end-nodes. The number of these pairs gives what we refer to as the minimax bridgeness: β(E), where E is an edge in G. It can also be defined as the number of paths on which E is a minimax edge. A high bridgeness indicates a consensus between data points that the node crosses a border between clusters. Figure 2 illustrates the concept of minimax bridgeness with a simple example.
The proposed clustering algorithm, hereby referred to as Minimax Bridgeness-based Clustering (MBC), works as a sequential graph partitioning with four main steps (see Figure 3):

•
Step 2: Discard edges that are unlikely to separate clusters.

•
Step 3: Rank the remaining edges based on minimax bridgeness.

•
Step 4: Remove next highest ranking edge that does not significantly decrease the minimax silhouette. Repeat until all edges have been assessed.  For Step 1, there are numerous algorithms to extract the MST of the data efficiently, but we consider these aspects outside the scope of this paper. Note that the MST is unique if all pairwise distances between pixels are different.
In Step 2, to find edges that are unlikely to be inter-cluster bridges (ICBs), we first identify four important properties of ICBs:

1.
They are longer than most edges. 2.
They have a higher minimax bridgeness than most edges. 3.
They have a lower density point at their centre than most edges.

4.
Neither of the vertices they connect is the other's first neighbour, nor do they have the same nearest neighbour (see [41]).
With regards to the first three properties listed above, we found that the distributions of edge length, minimax bridgeness and central point density typically have a single dominant mode each. Specifically, we use diffusion-based density estimation [34] to determine the peak locations and estimate these modes. We observed that, for at least one of these three attributes, ICBs always have a value significantly larger than the mode. We then established that edges that do not satisfy this property are unlikely to be ICBs. Note that we tested various density estimation methods, keeping in mind that we need a parameter-free and computationally efficient method to increase clustering performance, first by allowing for better identification of ICBs, and then by creating a representative subset Γ of the data. The latter can be used to create a pseudo-ground truth of the data and apply a more computationally-efficient clustering on the remaining data points. As previously mentioned, we found that an isotropic Gaussian kernel with a bandwidth equal to the average distance to any point's nearest neighbour gave the best results overall.
Finally, in Steps 3 and 4, we use the minimax silhouette as guiding criterion. In an approach similar to that employed in GWENN [50], candidate edges are ranked in order of descending β and removed one by one in a sequential manner. The edge with the largest β is systematically removed. The minimax silhouette then is calculated after each edge removal. If the edge removed last caused the silhouette to decrease by more than half its value at the previous iteration, or to become negative, the edge is put back in the graph. This ad hoc rule is particularly efficient when the data contains well-separated clusters. As previously demonstrated, the minimax silhouette captures cluster separation better than other measures of distances on HS data. Our experiments showed that removing an edge that does not separate clusters tend to decrease the minimax silhouette by more than half its current value.

Alternative Clustering Methods
To validate the proposed method, we compared it to five state-of-the-art clustering methods, which are summarised in Table 2. Note that the latter two are parameter-free. Fuzzy C-means requires the number of clusters K and a (typically parametrised) de-fuzzification parameter m. For FDPC and GWENN, we manually tuned their respective parameter to obtain the right number of clusters. We also implemented a variant of MBC, which allows specifying the number of clusters K. We named it K-MBC. In this case, the iterative approach of step 4 stops when K − 1 edges have been removed or when all edges not discarded in step 2 have been examined, whichever comes first. Finally, we evaluated MBC on Γ 50 for each data-set to determine how it performs on a smaller and cleaner set of points.

Data
We used five HS images: Pavia University, Kennedy Space Centre, Salinas, Botswana and Massey University.
The KSC image was acquired over the Kennedy Space Centre, Florida (Lat-Long coordinates of scene centre: 28 • 37 50 N, 80 • 46 45 W), by the airborne AVIRIS (Airborne Visible/Infrared Imaging Spectrometer) NASA instrument. It contains 224 spectral bands (400 to 2500 nm), but only 176 of them were kept after water absorption bands removal. The ground resolution is 18 m and the ground truth is made of 13 classes (Scrub, Willow swamp, CP hammock, CP/Oak, Slash pine, Oak/Broadleaf, Hardwood swamp, Graminoid marsh, Spartina marsh, Cattail marsh, Salt marsh, Mud flats, Water).
The Botswana HSI was acquired over the Okavango Delta, Botswana (Lat-Long coordinates of scene centre: 19 • 33 33 S, 23 • 07 37 E), by the Hyperion sensor onboard the EO-1 satellite (within the same spectral range than AVIRIS), with a ground resolution of 30 m. It has 1476 × 256 pixels and 145 bands (from 242 original bands) after water absorption bands removal. The ground truth map includes 14 classes (Water, Hippo grass, Floodplain Grasses 1, Floodplain Grasses 2, Reeds, Riparian, Firescar 2, Island interior, Acacia woodlands, Acacia shrublands, Acacia grasslands, Short mopane, Mixed mopane, Exposed soils).
The Massey University scene [53] was captured in Palmerston North, New Zealand (Lat-Long coordinates of scene centre: 40 • 23 17 S, 175 • 37 07 E), with an airborne AisaFENIX hyperspectral sensor covering visible to short-wave infrared (380 to 2500 nm). It has 339 bands (after removal of water absorption and noisy bands) and 9564 pixels with corresponding ground truth. The latter contains a total of 23 different land-cover classes, which includes nine different types of roof tops, five vegetation types, water, soil, and two different shadows.
The characteristics of these five data sets are summarised in Table 3 and their colour composite are shown in Figure 4. Table 3. Description of data-sets used: number of labelled pixels N, dimensionality D and number of classes.  The data was pre-processed to remove noisy spectral bands, burnt pixels and inconsistent ground truth data (see [4]). Furthermore, the number of pixels was capped to 10,000 per scene, mostly due to the memory complexity incurred by the computation of the similarity matrix. For each data-set with more than 10,000 pixels, we performed 20 random selections (Note that the same proportion of each class was kept after subsampling.) of 10,000 points and computed the average results.

Criteria
The criteria used to validate the proposed approach and compare it to other state-of-the-art methods are as follows: • Overall Accuracy (OA) (see Table 4) is the average number of pixels that are correctly classified. It is calculated as the sum of diagonal elements of the confusion matrix divided by the total number of pixels. Here, the confusion matrix is obtained owing to the Hungarian (Munkres) algorithm [54].

•
Purity [51] (see Table 5) measures the tendency of clusters to contain a single class.

•
Normalised Mutual Information (NMI) (see Table 6) measures the probabilistic resemblance between cluster and class labels.

•
The average difference between the number of classes and the number of clusters (see Table 7).
In terms of computational cost, we found that it takes about 2 min to cluster 10,000 points with MBC with a Matlab implementation (hardware: x64-based Intel Core i7-8750H CPU @ 2.20GHz, 32GB of RAM), mostly spent on the creation of graph G from the data.

Results
From these results, we make the following observations: • K-MBC significantly outperforms FCM, FDPC and GWENN on all data-sets except Botswana in terms of OA and NMI. It also yields the best pixel purity on all data-sets except Massey University, where it is surpassed only by GWENN by 0.08.

•
The clustering maps in Figure 5 confirm that our approach performs particularly well at creating clusters of high purity.

•
Although it is expected that MBC would give the best overall results when applied to the core set Γ 50 , or when the number of clusters is known, this does not appear as obvious from our results. However, we note that MBC tends to find too few clusters in the data (especially on the Botswana data-set, where it misses six classes of pixels). • FINCH and LPC generally perform poorly overall and especially in terms of number of clusters. They each tend to detect too many clusters on all data-sets. Interestingly, they also yield poor pixel purity values. Usually, high purity is expected when the number of clusters is high. The fact that we observe the contrary indicates that these two methods are really not well suited to deal with raw hyperspectral data.

•
On the core-set Γ 50 , MBC performs very well and even finds the right number of clusters in the Pavia University and Salinas scenes. It over-estimates this number by one on the Massey University scene and under-estimate by two on the KSC scene.

•
The Botswana scene seems to the most challenging for the proposed methods. The only case where the MBC clustering surpasses the benchmark with this scene is in terms of pixel purity. We noted that this particular scene contains classes of pixels that are among the least pure in the benchmark, with strong variations in reflectance spectra within classes. We hypothesise that this is the main reason for our method under-performing on this scene. On the other hand, it is well known that clustering is generally an ill-posed problem and that different applications may require different types of clustering approaches. Clearly, in this case, FDPC performs better, but it should be noted that it was tuned manually, unlike MBC. Overall, these results indicate that MBC can handle high-dimensional data well and recognise meaningful classes of materials and surfaces in a scene. While it comes with a certain computational cost, it performs better than existing clustering methods, even without parameter tuning. These results also validate our hypothesis that the minimax distance is well suited for hyperspectral data exploration.

Conclusions
We introduced MBC, a parameter-free clustering algorithm based on a new graph-based measure of similarity, which we named minimax bridgeness. We demonstrated its ability to automatically discover clusters in high-dimensional remote sensing data without user input, as well as its superiority over other graph-based similarity measures such as the number of shared nearest neighbours. The proposed method has two drawbacks: its sensitivity to noise and its high computational cost. To address these, we used a simple approach based on density estimation to select a subset of relevant data points, a so-called core set and applied MBC on it, which resulted in stronger performance across the board. Future work should focus on the selection of core sets for a more efficient exploitation of the minimax distance and minimax bridgeness. Also, the use of unsupervised dimensionality reduction techniques based, for instance, on band selection is expected to improve performance by reducing the curse of dimensionality. Lastly, future work should also focus on developing efficient methods to produce graph-based similarity matrices that scale to large data-sets.
Author Contributions: S.L.M. conceived of the paper, designed the experiments, generated the dataset, wrote the source code, performed the experiments, and wrote the paper. C.C. provided detailed advice during the writing process and revised the manuscript. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Acknowledgments:
The authors wish to thank Reddy Pullanagari and Gabor Kereszturi for providing the Massey University data-set.

Conflicts of Interest:
The authors declare no conflict of interest.