Spectral Imagery Tensor Decomposition for Semantic Segmentation of Remote Sensing Data through Fully Convolutional Networks

: This work aims at addressing two issues simultaneously: data compression at input space and semantic segmentation. Semantic segmentation of remotely sensed multi-or hyperspectral images through deep learning (DL) artiﬁcial neural networks (ANN) delivers as output the corresponding matrix of pixels classiﬁed elementwise, achieving competitive performance metrics. With technological progress, current remote sensing (RS) sensors have more spectral bands and higher spatial resolution than before, which means a greater number of pixels in the same area. Nevertheless, the more spectral bands and the greater number of pixels, the higher the computational complexity and the longer the processing times. Therefore, without dimensionality reduction, the classiﬁcation task is challenging, particularly if large areas have to be processed. To solve this problem, our approach maps an RS-image or third-order tensor into a core tensor, representative of our input image, with the same spatial domain but with a lower number of new tensor bands using a Tucker decomposition (TKD). Then, a new input space with reduced dimensionality is built. To ﬁnd the core tensor, the higher-order orthogonal iteration (HOOI) algorithm is used. A fully convolutional network (FCN) is employed afterwards to classify at the pixel domain, each core tensor. The whole framework, called here HOOI-FCN, achieves high performance metrics competitive with some RS-multispectral images (MSI) semantic segmentation state-of-the-art methods, while signiﬁcantly reducing computational complexity, and thereby, processing time. We used a Sentinel-2 image data set from Central Europe as a case study, for which our framework outperformed other methods (included the FCN itself) with average pixel accuracy (PA) of 90% (computational time ∼ 90s) and nine spectral bands, achieving a higher average PA of 91.97% (computational time ∼ 36.5s), and average PA of 91.56% (computational time ∼ 9.5s) for seven and ﬁve new tensor bands, respectively.


Introduction
Remote sensing RS images are of great use in many earth observation applications, such as agriculture, forest monitoring, disaster prevention, security affairs, and others [1]. The recent and upcoming availability of multispectral and hyperspectral satellites alleviates specific tasks, such as detection, classification, and semantic segmentation. In semantic segmentation, also called pixel-wise classification, each pixel in an RS image is assigned to one class [1]. This classification becomes easier when higher dimensional spectral information is acquired [1]. Spectral systems split, by physical filters,

Related Work
In recent years, spectral data for earth surface classification has been a very active research area. Methods proposed by Kemker et al. [11,20], Hamida et al. [21], and López et al. [18] use CNNs for RS-CNNMSI pixel-wise classification. Nevertheless, processing raw spectral data with deep learning (DL) algorithms is computationally very expensive. Wang et al. [22] introduced a salient band selection method for HSIs by manifold ranking, and Li et al. [23] proposed a band selection method from the perspective of spectral shape similarity analysis of RS-HSIs to obtain less computational complexity. However, some surface materials differentiate from each other in specific bands, so cutting off spectral bands negatively affected further classification tasks.
More recently, the use of tensor approach for spectral images compression has been introduced; see Zhang et al. [24]. Many authors adopted dimensionality reduction algorithms, such as PCA [16] and singular value decomposition (SVD), for spectral image compression. Other authors have made efforts to reduce the computational cost in CNNs for image classification by using TD algorithms [25,26]. Astrid et al. in [25] proposed a CNN compression method based on CPD and the tensor power method where they achieved significant reduction in memory and computational cost. Chien et al. in [26] presents a tensor-factorized ANN, which integrates TD and ANNs for multi-way feature extraction and classification. Nevertheless, although the idea is to compress data in order to reduce computational cost and processing time, these works compress or decompose the data of the hyper-parameters within the network, which causes the training of the semantic segmentation or classification network to be slower due to the change of the weights in the tensor decomposition.
Recently, three works close to our research [27][28][29] were published. In [27] An et al. proposed an unsupervised tensor-based multiscale low rank decomposition (T-MLRD) method for hyperspectral image dimensionality reduction, and Li et al. in [28] proposed a low-complexity compression approach for multispectral images based on convolution neural networks CNNs with nonnegative Tucker decomposition (NTD). Nevertheless, these methods reduce the tensor in every dimension, which is self-defeating for a segmentation CNN. Besides, the non-negative decomposed tensor proposed in [28] causes slower convergence in DL algorithms. In [29] An et al. proposed a tensor discriminant analysis (TDA) model via compact feature representation, wherein the traditional linear discriminant analysis was extended to tensor space to make the resulting feature representation more discriminant. However, this approach still leads to a degradation of the spatial resolution, which disturbed the CNN performance. See Table 1 for a summary of the related works. Table 1. Related work in spectral imagery semantic segmentation.

Contribution
The contribution of this work is summarized into three main points: 1. RS-CNNMSI or -HSI, or third order tensors are compressed in the spectral domain through TKD preprocessing, preserving the pixel spatial structure and obtaining a core tensor representative of the original. These core tensors, with less new tensor bands, which belong to subspaces of the original space, build the new input space for any supervised classifier at pixel level, which delivers the corresponding prediction matrix of pixels classified element-wise. This approach achieves high or competitive performance metrics but with less computational complexity, and consequently, lower computational time.

2.
This approach outperforms other methods in normalized difference indexes, PCA, particularly the same FCN with original data. Each core tensor is calculated using the HOOI algorithm, which achieves high orthogonality degree for the core tensor (all-orthogonality) and for its factor matrices (column-wise orthogonal); besides, it converges faster than others, such as TUCKALS3 [17].

3.
The efficiency of this approach can be measured by one or more performance metrics, e.g., pixel accuracy (PA), as a function of the number of new tensor bands, orthogonality degree of the factor matrices and the core tensor, reconstruction error of the original tensor, and execution time.
These results are shown in Section 6: Experimental Results.
The remainder of this work is organized as follows. Section 2 introduces tensor algebra notation and basic concepts to familiarize the reader with the symbology used in this paper. Section 3 presents the problem statement of this work and the mathematical definition. In Section 4, CNN theory is described for classification and semantic segmentation. Section 5 presents the framework proposed for compression and semantic segmentation of spectral images. Experimental results are presented in Section 6. Finally, Sections 7 and 8 present a discussion and conclusions based on the results obtained in the experiments.

Tensor Algebra Basic Concepts
For this work we used the conventional tensor algebra notation [15]. Hence, scalars or zero order tensors are represented by italic lowercase letters; e.g., a. Vectors or first order tensor are denoted by boldface lowercase letters; e.g., a. Matrices or tensor of order two are denoted by boldface capital letters, e.g., A, and three or higher order tensors by boldface Euler script letters, e.g., A A A. In a N-order tensor A A A ∈ R I 1 ×···×I N , where R represents the set of real numbers, I n indicates the size of the tensor in each mode n = {1, . . . , N}. An element of A A A is denoted with indices in lowercase letters, e.g., a i 1 ...i N where i n denotes the n-mode of A A A [17]. A fiber is a vector, the result of fixing every index of a tensor but one, and it is denoted by a :i 2 i 3 , a i 1 :i 3 , and a i 1 i 2 -for column, row, and tube fibers respectively for a third order tensor instance. A slice is a matrix, the result of fixing every index of a tensor but two, and it is denoted by A i 1 :: , A :i 2 : , and A ::i 3 , or more compactly, A i 1 , A i 2 , and A i 3 for horizontal, lateral, and frontal slices respectively for a third order tensor instance. Finally, A (n) denotes a matrix element from a sequence of matrices [17].
It is also necessary to introduce some tensor algebra operations and basic concepts used in later explanations. These notations were taken textually from [17].

Matricization
The mode-n matricization is the process of reordering the elements of a tensor into a matrix along axis n and it is denoted as A (n) ∈ R I n ×∏ m =n I m .

Outer Product
The outer product of N vectors X X X = a (1) • · · · • a (N) produces a tensor X X X ∈ R I 1 ×···×I N where • denotes the outer product and a (n) denotes a vector in a sequence of N vectors and each element of the tensor is the product of the corresponding vector elements; i.e.,

Inner Product
The inner product of two tensors A A A, B B B ∈ R I 1 ×···×I N is the sum of the products of their entries; i.e., A A A, B B B = ∑

N-Mode Product
It means the multiplication of a tensor A A A ∈ R I 1 ×···×I N by a matrix U ∈ R J×I n or vector u ∈ R I n in mode n; i.e., along axis n. It is represented by

Rank-One Tensor
A tensor X X X ∈ R I 1 ×···×I N is rank one if it can be written as the outer product of N vectors; i.e., X X X = a (1) • · · · • a (N) .

Rank-R Tensor
The rank of a tensor rank(X X X) is the smallest number of components in a CPD; i.e., the smallest number of rank-one tensors that generate X X X as their sum [17].

N-Rank
The n-rank of a tensor X X X ∈ R I 1 ×···×I N denoted rank n (X X X), is the column rank of X (n) ; i.e., the dimension of the vector space spanned by the mode-n fibers. Hence, if R n ≡ rank n (X X X) for n = 1, . . . , N, we can say that X X X has a rank − (R 1 , . . . , R N ) tensor. All the tensor algebra notation presented until this point is summarized in Table 2 for simpler regarding.
Tensor, matrix, vector and scalar respectively A A A ∈ R I 1 ×···×I N N-order tensor of size I 1 × · · · × I N . a i 1 ...i N An element of a tensor a :i 2 i 3 , a i 1 :i 3 , and a i 1 i 2 : Column, row and tube fibers of a third order tensor A i 1 :: , A :i 2 : , A ::i 3 Horizontal, lateral and frontal slices for a third order tensor A (n) , a (n) A matrix/vector element from a sequence of matrices/vectors A (n) Mode-n matricization of a tensor. A (n) ∈ R I n ×∏ m =n I m n-mode product of tensor A A A ∈ R I 1 ×···×I N by a matrix U ∈ R J×I n along axis n.

Tucker Decomposition (Tkd)
The TKD can be seen as a form of higher-order PCA [17]. This method decomposes a tensor X X X ∈ R I 1 ×···×I N into a core tensor G G G ∈ R J 1 ×···×J N multiplied by a matrix along each mode n = 1, . . . , N as where the core tensor preserves the level of interaction for each factor or projection matrix U (n) ∈ R I n ×J n . These matrices are usually, but not necessarily, orthogonal, and can be thought of as the principal components in each mode [17] (see Figure 1). J n represents the number of components in the decomposition; i.e., the rank − (R 1 , . . . , R N ). We compute rank − (R 1 , . . . , R N ), where rank n (X X X) = R n for every n-mode, which generally does not exactly reproduce X X X. Starting from (1), the reconstruction of an approximated tensor can be given by whereX X X is the reconstructed tensor. Then, we can acquire the core tensor G G G by the multilinear projection where U (n)T denotes the transpose matrix of U (n) for n = 1, . . . , N. The reconstruction error ξ can be computed as where || · || F represents the Frobenius norm. To effectively compress data, the reconstructed lower-rank tensorX X X should be close to the original tensor X X X; this can be reached by an algorithm as HOOI, which is iterative, and it is described in Section 5.1.
Core Tensor Projec�on matrices Figure 1. Tucker decomposition for a third-order tensor.

Problem Statement and Mathematical Definition
Spectral images are third-order arrays, which provide not only spatial, but also spectral features from RS scenes of interest. These properties aid CNNs to easily find features to characterize the behaviors of different materials over the earth's surface. However, the large amount of spectral data causes huge computational load, and therefore, large processing time using machine learning algorithms.
It is important to preserve the three-dimensional array structure of the RS spectral input image, in order to effectively classify each pixel of the image. In RS multi-or hyperspectral images, the spectral bands are highly correlated, and contain lot of redundancy. Therefore, we propose a TKD-based method as a preprocessing step to provide a better suited input for the semantic segmentation based on CNN. This will also considerably reduce high number of parameters, and in turn, processing time during training and testing. Our problem statement for RS spectral images can be described as follows.

Problem Statement
Given a pair (X X X, Y), where tensor X X X ∈ R I 1 ×I 2 ×I 3 denotes a CNNMSI or HSI, and Y ∈ R I 1 ×I 2 its corresponding ground truth matrix for a specific number of classes C, find another pair (G G G,Ŷ), where the tensor G G G ∈ R J 1 ×J 2 ×J 3 , used for classification, is representative of X X X, andŶ is its associated matrix of predicted classes; preserving the spatial-domain J 1 = I 1 , J 2 = I 2 but with fewer new tensor bands, i.e., J 3 < I 3 , achieving higher or competitive performance metrics for pixel-wise classification, reducing the dimensionality, and therefore, decreasing computational complexity in the classification task.

Mathematical Definition
We can describe the problem stated in previous subsection mathematically as the following optimization problem where ψ denotes an error threshold defined depending on the accuracy or performance metrics required for each application and St I n ×J n represents the Stiefel manifold [30]. Embedding G G G into the objective function, as Lathhauwer proved in [31] Theorems 3.1, 4.1, and 4.2, (5), can be written by the equivalent under the same constraints as (5).
The subtensors G G G i n of the core tensor G G G satisfy the all-orthogonality property [32], which establishes that two subtensors G G G i n =α and G G G i n =β are all-orthogonal for all possible values of n, α, and β subject to α = β, and the ordering property: Our optimization problem can be solved by several algorithms. In this work, the HOOI algorithm was selected (described in Section 5.1), due to its convergence and orthogonality performance. Once a tensor G G G is obtained, a classifier f that belongs to the hypothesis space H maps input data G G G into output dataŶ; that isŶ where f is a pixel-wise classifier. In this paper, a FCN for semantic segmentation was used as classifier due to the need of classify each pixel of the input image and to its performance in pixel accuracy. The FCN used in this work is described in Section 4.

Convolutional Neural Networks (CNNs)
CNNs are supervised feed-forward DL-ANNs for computer vision. The idea of applying a sort of convolution of the synaptic weights of a neural network through the input data yields to a preservation of spatial features, which alleviates the hard task of classification and in turn semantic segmentation. This type of ANN works under the same linear regression model as every machine learning (ML) algorithm. Since images are three dimensional arrays, we can use tensor algebra notation to describe the input of CNNs as a tensor A A A ∈ R I 1 ×I 2 ×I 3 , where I 1 , I 2 , and I 3 represent height, width, and depth of the third order array respectively; i.e., the spatial and spectral domain of an image. We can write generally the linear regression model used for ANNs aŝ whereŷ represents the output prediction of the network; σ denotes an activation function; g is the input dataset; W and b are the matrix of synaptic weights and the bias vector, respectively. These parameters are adjustable; i.e., their values are modified every iteration looking for convergence to minimize the loss in the prediction through optimization algorithms [33]. For simplicity, the bias vector can be ignored, assuming that matrix W will update until convergence independently of another parameter [33]. Considering that the input dataset to a CNN is a multidimensional array, we can represent (9) and (10) using tensor algebra notation aŝ whereŶ Y Y represents the prediction output tensor of the ANN (in our case, a second order tensor or matrixŶ), G G G is the input dataset, and W W W is a K 1 × K 2 × F 1 tensor called filter or kernel with the adaptable synaptic weights. Different to conventional ANN, in CNNs, W W W is a shiftable square tensor is much smaller in height and width than the input data, i.e., K 1 = K 2 and K s << I s for s = 1, 2; F 1 denotes the number of input channels; i.e., F 1 = I 3 . For hidden layers, instead of the prediction tensorŶ Y Y, the output is a matrix called activation map M ∈ R I 1 ×I 2 , which preserves features from the original data in each domain. Actually, it is necessary to use much kernels W W W ( f 2 ) as activation maps, with different initialization values to preserve diverse features of the image. Hence, we can also define activation maps as a tensor M M M ∈ R I 1 ×I 2 ×F 2 where F 2 denotes the number of activation maps produced by each filter (see Figure 2). Kernels are displaced through the whole input image as a discrete convolution operation. Then, each element of the output activation map m i 1 i 2 f 2 is computed by the summary of the Hadamard product of kernel W W W ( f 2 ) and a subtensor from the input tensor G G G centered in position (i, j) and with same dimensions of W W W, as follows where m i 1 i 2 f 2 denotes the value of the output activation map f 2 at position i 1 , i 2 ; σ represents the activation function; and o 1 and o 2 are offsets in spatial dimensions which depend on the kernel size, and equal K 1 +1 2 and K 2 +1 2 respectively (see Figure 2).

Input image
Kernel Ac�va�on maps ReLU Figure 2. Convolutional layer with a K 1 × K 2 × F 1 × F 2 kernel. Input channels F 1 must equal the spectral bands I 3 . To preserve original dimensions at the output, zero padding is needed [18]. Output dimensions also depend on stride S = 1 to consider every piece of pixel information and to preserve original dimensions.
An ANN is trained by using iterative gradient-based optimizers, such as Stochastic gradient descent, Momentum, RMSprop, and Adam [33]. This drive the cost function L(W W W) to a very low value by updating the synaptic weights W W W. We can compute the cost function by any function that measures the difference between the training data and the prediction, such as Euclidean distance or cross-entropy [10]. Besides, the same function is used to measure the performance of the model during testing and validation. In order to avoid overfitting [33], the total cost function used to train an ANN combines one of the cost functions mentioned before, plus a regularization term.
where J(W W W) denotes the total cost function and R(W W W) represents a regularization function. Then, we can decrease J(W W W) by updating the synaptic weights in the direction of the negative gradient. This is known as the method of steepest descent or gradient descent.
where W W W represents the synaptic weights tensor in next iteration during training, α denotes the learning rate parameter, and ∇ W W W J(W W W) the cost function gradient. Gradient descent converges when every element of the gradient is zero, or in practice, very close to zero [10]. CNNs has been successfully used in many image classification frameworks. This variation in architecture from other typical ANN models yields the network to learn spatial and spectral features, which are highly profitable for image classification. Besides, FCNs, constructed with only convolutional layers are able to classify each element of the input image; i.e., they yield pixel-wise classification, or in other words, semantic segmentation.

Hooi-Fcn Framework
In this work we propose a TKD-CNN-based framework called HOOI-FCN, which maps the original high-correlated spectral image into a low-rank core tensor, preserving enough statistical information to alleviate image pixel-wise classification. The aim is to improve performance while reducing processing time in semantic segmentation ANNs by compressing CNNMSI third-order tensors. Applying TD methods, relevant information is preserved, mainly acquired from the spectral domain, convenient for the classification FCN. This novel framework is in summary, a two step structure composed by an HOOI TD and a FCN for semantic segmentation described below (see Figure 3).

Higher Order Orthogonal Iteration (HOOI) for Spectral Image Compression
Quoting Kolda, "The truncated higher order singular value decomposition (HOSVD) is not optimal in terms of giving the best fit as measured by the norm of the difference, but it is a good starting point for an iterative alternating least square algorithm" [17]. HOOI is an iterative algorithm to compute a rank-(R 1 , . . . , R N ) TKD. Let X X X ∈ R I 1 ×···×I N be an N-th order tensor and R 1 , . . . , R N be a set of integers satisfying 1 ≤ R n ≤ I n , for n = 1, . . . , N; the rank − (R 1 , . . . , R N ) approximation problem is to find a set of I n × R n matrices U (n) column-wise orthogonal and a R 1 × · · · × R N core tensor G G G by computing min and from matrices U (n) , where U (n)T U (n) = I (n) , the core tensor G G G is found to satisfy (2) [34]. For a third-order tensor decomposition, we can rewrite (4) aŝ whereX X X denotes the reconstruction approximation of the input spectral image X X X, G G G is the J 1 × J 2 × J 3 core tensor, and U (1) ∈ R I 1 ×J 1 , U (2) ∈ R I 2 ×J 2 and U (3) ∈ R I 3 ×J 3 are the projection matrices. Algorithm 1 shows HOOI for a third order tensor decomposition, but the extension to higher order tensors is straightforward. Thus, with Algorithm 1 we compute the tensor G G G with rank-(J 1 , J 2 , J 3 ) for each spectral image as third-order tensor. S p e c t r a l b a n d s Figure 3.
The big picture of the fast semantic segmentation framework proposed, with a fully convolutional network encoder-decoder architecture and a preprocessing HOOI tucker decomposition stage.
Algorithm 1: HOOI for MSI. ALS algorithm to compute the core tensor G G G.

Fcn for Semantic Segmentation of Spectral Images
We use a FCN model for semantic segmentation based on the proposed by Badrinarayanan et al. in [35] called Segnet. Each core tensor G G G obtained after decomposition, is the input to the SegNet for training and testing the network. Hence, the feature activation maps M M M ∈ R I 1 ×I 2 ×F 2 for each hidden layer of the SegNet encoder-decoder FCN are computed by displacing the filters W W W through the whole input core tensor in strides S = 1. It is worth noting that kernel W W W is a four-order tensor W W W ∈ R K 1 ×K 2 ×F 1 ×F 2 , where K 1 and K 2 represent its spatial dimensions height and width; F 1 its depth, i.e., the spectral domain; and F 2 denotes the number of filters used to produce F 2 activation maps ( Figure 2). We express this convolution operation as where M ( f 2 ) represents each activation map for f 2 = 1, . . . , F 2 , and each value m i 1 i 2 f 2 is computed as in (12). σ denotes the rectified linear unit (ReLU) [33] function; i.e., σ(z) = max {0, z}. Symbol is used in this paper to represent the convolution; i.e., the whole operation applied in convolutional layers (see Figure 2). These activation maps are the input for the subsequent layer in the SegNet FCN. The last layer is used the softmax activation function [33] to produce a distribution probability, and so, predict values relating each pixel to one of the C classes of interest. Hence, for the last layer we rewrite (17) whereŶ represents the output prediction, M M M is the feature activation maps at previous layer, δ the softmax activation function, and W W W the filter or kernel tensor with the adaptable synaptic weights. The output of the FCN is a matrixŶ with the same spatial dimensions as the input, with a value of the most likely class for each pixels. Figure 4 shows the architecture of the SegNet model used in this work. Experiments present the behavior of this FCN with and without data compression in the spectral domain.

Our Data
As case study, a CNNMSI dataset with 100 RS images was used for training and 10 for testing, all of them from central Europe with 128 × 128 pixels. These images are partitions of the original Sentinel-2 images without modification and all semi-manually labeled, and with abundant presence of the elements of interest. In Table 3 the 10 scenarios correspond to our 10 images for testing. We used only nine from the 13 available spectral bands from visible, NIR to SWIR wavelengths. Bands 2, 3, 4, and 8 have 10 m resolution, and bands 5, 6, 7, 11, and 12 have 20 m (oversampled to 10 m [18]). These bands provide decisive information for discrimination of different classes. Bands 1, 9, and 10 were dismissed because of their lower spatial resolution of 60 m. Band 8A, also with 20 m spatial resolution, was dismissed due to wavelength overlapping with band 8. It is worth mentioning that the framework proposed in this work can be applied to any kind of spectral image and multitemporal datasets [36].

The Training Space
For training, the input data ws a tensor X X X ∈ R 128×128×9×100 , where 128 × 128 is the spatial dimensions, 9 is the number of spectral bands, and 100 is the number of images used for training. Although the number of images seems low, taking into account that we work at pixel-domain, the real number of training points or vectors is high. Indeed, our FCN for semantic segmentation was trained with 128 × 128 × 100 = 1638400 samples or vectors. To test whether the size of the data for training was sufficiently high, a smaller subtensor of X X X, X X X p ∈ R 128×128×9×80 , equivalent to 1310720 points or vectors, was used for a second training obtaining, for the same test set, an average PA of 91.48%; i.e., only 0.08% less than with 100 images, 91.56%. We also tested these results by a third training with an extended dataset of 120 images, X X X q ∈ R 128×128×9×120 equivalent to 1966080 vectors, and we found only a slight variation of +0.01% in the PA (91.57%), while the execution time for the training increased significantly.

The Labels
Our labels were acquired using the scene classification algorithm developed by the ESA [19], and subsequently modified, semi-manually, misclassified pixels.

Downloading Data
Due to the big size of the data, format npy was used. Data are available in the link Dataset.

•
The training dataset is in the file S2_TrainingData.npy. • Labels of the training dataset are in the file S2_TrainingLabels.npy. • A true color representation of the training dataset can be found in S2_Trainingtruecolor.npy. • The testing dataset and the corresponding labels are in the file S2_TestData.npy. • Labels of the test dataset are in the file S2_TestLabels.npy. • Last, a true color representation of the test data can be found in S2_Testtruecolor.npy.
Code will be delivered by the corresponding author upon request for research purposes only.

Classes
The CNNMSI dataset has been semi-manually labeled for supervised semantic segmentation of C = 5 classes; vegetation, water, cloud, cloud shadow, and soil. These classes were selected according to their impact in RS research areas such as agriculture, forest monitoring, population growth analysis, and disaster prevention. It is worth mentioning that the detection of clouds and cloud shadows is an important prerequisite for almost all RS applications.

Pixel Accuracy (PA)
We used the PA metric to compute a ratio between the amount of correctly classified pixels and the total number of pixels as where we have a total of C classes and p ii is the amount of pixels of class c correctly assigned to class c (true positives), and p cd is the amount of pixels of class c inferred to belong to class d (false positives). We can see in Table 3 the PA values for our proposed framework in comparison with other state-of-the-art methods. From Table 3, we can see that: • Indexes NDI are important references for pixel-wise classification but they show one of the lowest PAs and the highest computational time.

•
Classic PCA with five components shows the lowest PA, although the computational time is similar to HOOI-FCN with five tensor bands. • Due to the poor results of NDI and classical PCA, FCN (with raw data and nine components) is a good reference in terms of performance and computational time, and HOOI-FCN with seven and five tensor bands achieves the highest PA and the lowest computational time.
The PA and the computational times for FCN and HOOI-FCN with different numbers of tensor bands are shown in Figure 5.  Table 3.

Relative Mean Square Error (rMSE)
In order to compute the reconstruction error of the tensor X X X for the implementation of HOOI, the rMSE was used: where X X X q represents the q-th CNNMSI from our dataset with Q MSIs andX X X q its corresponding reconstruction computed by (4). Figure 6a shows the behavior of the reconstruction rMSE for our 100 training images for J 3 = 1, . . . , I 3 . With this metric we can quantify how good the decomposition represents the input data. The rMSE is also one of the decisive parameters to set the value of the rank 3 (X X X) = J 3 . To preserve a high performance in the pixel-wise classification task, we set the threshold ψ to a value for which the rMSE error is less than or equal to 0.05%, since deeper decomposition decrease the PA to less than 90%, as we can see in Figure 5. For a rank decomposition (128, 128, 5) our rMSE is 0.04%, which means that we reduce the dimensionality of our input data to almost half with a very low loss in performance. Besides, comparing this error with matrix based methods as PCA, we can see that our tensor-based decomposition produces lower rMSE for every value of J 3 except for the first one.

Orthogonality Degree of Factor Matrices and Tensor Bands
A way to analyze the algorithm HOOI efficiency is computing the orthogonality degree of the core tensor G G G and the projection matrices U (n) . As we mentioned in Section 3, we use the all-orthogonality property proposed in [32] and described in (7) and (8) to evaluate the orthogonality degree of our core tensors. Table 4 shows the results of the inner products between each tensor band with the others from one of our training images. We can see that these values are practically zero, which means that our bands are orthogonal. Furthermore, we can see in Figure 6b that (8) is fulfilled.
It is also important to know the orthogonality degree in our projection matrices. From Theorem 2 in [32] we start from the condition U (n)T U (n) = I (n) ; then, we create a vectorô where the components are the trace of each resulting matrix, i.e., tr(I (n) ), and compute the MSE with respect to a vector rank o = (J 1 , J 2 , J 3 ) as Using this orthogonality analysis, we obtain MSE values very close to zero, e.g., in order of 10 −20 , which means that projection matrices present a high orthogonality degree.

Fcn Specifications
We used hyperparameter search [33] to set the learning rate to 1 × 10 −3 . The model was run 100 epochs introducing 100 CNNMSI from our dataset. We used the Adam optimizer as our optimization algorithm. Xavier initialization was used for setting the initial values of the weights in the model. The Segnet FCN was used as the base model, since it achieves very high performance metrics in semantic segmentation [35].

Hardware/Software Specifications
Our framework was implemented using Python 3.7 with Tensorflow-GPU version 1.13. Experiments were run with a NVIDIA GeForce GTX 1050 Ti GPU. The processor used was an Intel core i7 with 8GB RAM, 128 GB SSD, and 1 TB HDD. Table 4. Inner products of each tensor band with the others from one image of our dataset decomposed by HOOI.   Table 3.

Discussion and Comparison with Other Methods
Original spectral bands (Figure 7a) were transformed or mapped into new tensor bands (Figure 7b,c) which preserved features of our classes of interest within the first tensor bands, avoiding the use of all the original spectral bands, thereby reducing computational load in further applications.
From Figure 7b,c, we can see that, for the classes of interest in this case study, the error margin selected ψ is indeed a good parameter to restrict the rank in the third mode, since the spectral information for differentiation of these five classes is a greater proportion than the first elements of the spectral domain. Nevertheless, if a smaller value for J 3 were used, there would be a trade off in the performance of the semantic segmentation.
Quantitative results in Figures 8-12 and Table 3 present a comparison of the processing time and PA from our proposed framework with a model without any preprocessing data decomposition algorithm and with a normalized differentiation index based method in different scenarios. The accuracy values obtained by the proposed HOOI-FCN framework are better in overall than those obtained by the other methods under same conditions and scenarios, but with a quite significant decrease of the processing time, in the order of 10 times. It is worth noting that our HOOI-FCN framework with seven and five tensor bands outperforms in PA to the same FCN with the original nine bands. This means that the decomposition produces better features for the classification ANN.
In the confusion matrix presented in Figure 13, we can see the accuracy of the framework proposed HOOI-FCN for each class and the overall accuracy. Rows correspond to the output class or prediction and the columns to the truth class. Diagonal cells show the correctly classified pixels. Off-diagonal cells show where the errors come from. The rightmost column shows the accuracy for each predicted class, while the bottom row shows the accuracy for each true class. It is important to note that vegetation and cloud classes are close to 95% accuracy, while for water and cloud shadows have less than 90% accuracy. The latter can be caused by the lack of samples with a greater contribution of these elements in the training dataset as well as the similarity of these elements to others in the scenes.

Conclusions
Any RS-MSI or -HSI or third-order tensor image is mapped by the TKD to another tensor, called core tensor representative of the original, preserving its spatial structure, but with fewer tensor bands. In other words, a new subspace embedded in the original space was found and it was be used as the new input space for the task of pixel-level classification or semantic segmentation. Due to the success of DL for image processing, our approach employs an FCN network as the classifier, which delivers the corresponding prediction matrix of pixels classified element-wise.
The efficiency of the proposed higher order orthogonal iteration (HOOI)-FCN framework is measured by metrics such as pixel accuracy (PA) or recall as a function of the number of new tensor bands, which is defined by the reconstruction error computed by the rMSE. Another important parameter in the TKD is the orthogonality degree of each component, i.e., the core tensor and the factor matrices, computed by the inner products of each band with the others.
Our experimental results for a case study show that the proposed HOOI-FCN framework for CNNMSI semantic segmentation reduced the number of spectral bands from nine to seven or five tensor bands, for which PA values converge or are very close to the maximum.
State-of-the-art methods, such as normalized difference indexes, PCA with five principal components, and the same FCN network with nine original bands, with an average pixel accuracy 90% (computational time ∼90s), were outperformed by the HOOI-FCN framework, which achieved a higher average pixel accuracy of 91.97% (and computational time ∼36.5s), and average PA of 91.56% (computational time 9.5s) for seven and five new tensor bands respectively.
These results are very promising in RS, since the use of other algorithms for the calculation of core tensors and a deeper data analysis of weights and initialization of the convolutional neural network (CNN) can increase performance metrics of the segmentation for RS spectral data. Some limitations for a better validation of this approach are: denoising is not included; there is a need for new cases to enhance the input space; use of a greater number of classifiers is needed.
Finally, this research allows us to emphasize two main, relevant points. (1) RS images are characterized by a large number of bands, high correlation between neighbor bands, and high data redundancy; (2) besides, they are corrupted by several noises. Some issues related to our approach remain open.

•
Compression affects not only the input data, but also the CNN network to reduce overall complexity and/or create new ANN architectures for specific RS-CNNMSI or HSI image applications. • Instead of the HOOI algorithm, use greedy HOOI and other algorithms that determine the core tensor for a broad comparison. • For classification purposes, use other machine learning algorithms, such as a SVM or random forest. • Increase the input data with more scenarios and their corresponding ground truth to a deeper study of the behaviors of several classifiers, including those based on ANN, and the scope of the TD methods.

Acknowledgments:
We would like to thank the student E. Padilla and his advisor A. Méndez for their help in producing the results for the PCA decomposition and facilitate the comparative analysis between TD and PCA; and F. Hermosillo for useful observations about this work, all of them being from CINVESTAV, Guadalajara. We also would like to dedicate this work to the memory of Y. Shkvarko, who was an important mentor for the realization of this research.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: