GPU-Based Soil Parameter Parallel Inversion for PolSAR Data

: With the development of polarimetric synthetic aperture radar (PolSAR), quantitative parameter inversion has been seen great progress, especially in the ﬁeld of soil parameter inversion, which has achieved good results for applications. However, PolSAR data is also often many terabytes large. This huge amount of data also directly affects the efﬁciency of the inversion. Therefore, the efﬁciency of soil moisture and roughness inversion has become a problem in the application of this PolSAR technique. A parallel realization based on a graphics processing unit (GPU) for multiple inversion models of PolSAR data is proposed in this paper. This method utilizes the high-performance parallel computing capability of a GPU to optimize the realization of the surface inversion models for polarimetric SAR data. Three classical forward scattering models and their corresponding inversion algorithms are analyzed. They are different in terms of polarimetric data requirements, application situation, as well as inversion performance. Speciﬁcally, the inversion process of PolSAR data is mainly improved by the use of the high concurrent threads of GPU. According to the inversion process, various optimization strategies are applied, such as the parallel task allocation, and optimizations of instruction level, data storage, data transmission between CPU and GPU. The advantages of a GPU in processing computationally-intensive data are shown in the data experiments, where the efﬁciency of soil roughness and moisture inversion is increased by one or two orders of magnitude.


Introduction
Soil moisture and roughness are important parameters in the fields of agriculture, ecology, meteorology, and hydrology. They are widely used in farmland irrigation management, climate prediction, and drought monitoring [1]. For example, in the agricultural area, soil water content and roughness directly affect crop growth [2]. The correct assessment of soil moisture is also the basis of hydrological modeling [3]. In the meteorological field, soil water content is an essential component of the land-atmosphere boundary energy budget [4,5]. Therefore, the inversion of soil water content and roughness has become a research hotspot for scholars.
With the increase in the number of global satellites, the application of ground exploration has become increasingly common [6][7][8]. But low resolution is also a major problem for exploration [9,10]. The development of science and technology has promoted the rapid development of synthetic aperture radar (SAR) techniques, through which the quality and resolution of radar imaging has been significantly improved. However, in the process of ground detection, the anti-interference of SAR is very low, and the detection process is easily affected. Both GIS and remote sensing assistance information are used for soil moisture estimation [11]. Multi-satellite collaboration can also improve spatio-temporal resolution [12]. By means of dual/multi/full polarization, the polarized synthetic aperture radar (PolSAR) has a good detection effect and high resolution [13]. PolSAR plays an important role in geographic surveying, angle is larger than 30 degrees. The scattering models themselves are a forward model, that is, from the input object parameters and observation parameters to the scattering matrix or coefficients. However, if we want to retrieve the soil roughness and moisture from the data, the corresponding inversion algorithms are needed. We mainly analyze the optimization of the three algorithms for soil inversion based on the GPU parallel method. Therefore, the contributions of this paper are mainly reflected in the parallel design (GPU thread allocation strategy), parallel optimization for computing-intensive issue (instruction optimization), and parallel optimization for data-intensive issue (storage optimization and data type conversion). Through the three aspects of parallel acceleration, 14×-169× speedups can be achieved for the three inversion models.
The rest of this paper is organized as follows. In Section 2, the three inversion algorithms are specifically introduced, including the Dubois model, Oh model, and X-Bragg model. In Section 3, the sequential algorithm analysis and the proposed parallel methods are presented. In Section 4, the experimental results and analysis are presented. The conclusion is given in the final section.

Dubois Model
In 1995, Dubois proposed an empirical model that only requires the same polarization backscatter coefficients σ 0 HH and σ 0 VV to extract the root mean square height and water content of the bare soil. The model was built using the datasets collected by a truck-mounted scatterometer at the University of Michigan and the RASAM scatterometer at the University of Bern. Through the measurement of the scatterometer and the data, the local incident angle and frequency, the dielectric constant and the surface roughness are mapped to the co-polarized scattering coefficient. Studies have found that this relationship is close to the tangent of the angle of incidence. The algorithm is applied to SAR data (AIRSAR and SIR-C) to prove its robustness [36].

Oh Model
At the University of Michigan, based on the analysis of the classical theoretical scattering model Kirchhoff approximation (KA) and SPM, Y. Oh, K.Sarabandi, and F.T. Ulaby developed this semi-empirical model in 1992. The model uses full-polarization data (LCX POLARSCAT) measured by an on-board network analysis scatterometer at three frequencies (1.5, 4.5, and 9.5 GHz), as well as comprehensive and accurate surface measurements, with incident angles ranging from 10 • to 70 • [36,37].
The model proposes a clear cross-polarization and co-polarization backscatter ratio function. The empirical equation is: where P and Q represent the cross-polarization and co-polarization backscatter ratios (i.e., , θ is the local incident angle, and ks is the root mean square height (i.e., roughness) after the wavelength is normalized, and Γ 0 is the Fresnel reflection coefficient.
By combining Equations (4)- (6) we can obtain the mathematical Equation (7): Among them, x is obtained by the iterative method in the program, and then the Fresnel reflectivity Γ 0 and Fresnel reflection coefficient ε r can be obtained, as well as the soil roughness (ks) and soil moisture (mv). In general, the model shows good agreement on ground measurements within a certain range, where ks ∈ [0. 1,6],m v ∈ [9, 31].

X-Bragg Model
X-Bragg model is an SPM-based polarimetric scattering model. It utilizes the coherency matrix of full polarimetric data, including the phase information. Firstly, from the side of polarimetric coherency matrix, which contains the second order moment of scattering process shown in Equation (8), can be diagonalized by an unitary similarity transformation of the following form [38]: [Λ] is a diagonal matrix whose elements are [T] real non-negative eigenvalues 0 is an eigenvector matrix whose columns correspond to orthogonal eigenvectors e 1 , e 2 and e 3 . In this way the coherency matrix T is written as The diagonalization of the coherency matrix directly produces three important physical features. Firstly with the obtained eigenvalues, the scattering probability p i are computed by normalizing the eigenvalues.
Then two of the physical features are defined as follows, which are polarization scattering entropy H and scattering anisotropy A The third important parameter is obtained from the eigenvector of [T]. Each feature vector e i can be represented by five angles [31]. The β i angle can be interpreted as the rotation of the corresponding feature vector e i in a plane perpendicular to the scattering plane, while ϕ 1i , ϕ 2i , and ϕ 3i explain the phase relationship between the e i elements. In this work, the average scattering angle α is more important, which is defined as To extend the Bragg scattering model to a wider range of roughness conditions, the Bragg coherency matrix [T] is rotated around a plane perpendicular to the scattering plane. The rough surface is modeled as a reflective symmetry depolarizer, as shown in Equation (15). A configuration averaging is performed on a given distribution β of P(β): Indeed, Figure 1 shows the corresponding spatial relationship of the surface slope in detail. The width of the assumed distribution corresponds to the amount of roughness disturbance of the modeled surface [38]. Assuming P(β) to be a uniform distribution about zero with width β 1 : The coherency matrix for the rough surface becomes: the coefficients C 1 , C 2 and C 3 describing the Bragg components of the surface are given by For the soil roughness estimation, ks can be calculated by Equation (20) With the obtained roughness ks, the corresponding entropy H and α angle values are stored in the look-up-table (LUT) by Equations (12) and (14). Using this LUT, the dielectric constant value can be obtained directly from the estimated entropy H and α angle values. Thus, the corresponding moisture mv is obtained.

Inversion Algorithms Analysis
These three inversion algorithms based on Dubois, Oh, and X-Bragg scattering algorithms differ in the aspects of input data, valid ranges, features, and computation complexity, as do the parallel processing methods applied to them, as shown in Figures 2 and 3.

Get data and initialize
Pre-processing   For the inversion of the Dubois algorithm, it is straight and simple from the algorithm equations. At first, the dielectric constant is computed then the surface roughness is calculated. It requires only the scattering coefficients of HH and VV channels, hence they could be applied widely in the presence of the dual pol data availability of many airborne and spaceborne platform. However, it should be noticed that only when the incidence angle is larger than 30 degrees, the algorithm has reliable inversion results. According to Equations (1)-(3), the algorithm complexity is calculated as O(n), where n indicates the number of PolSAR image pixels. Although the algorithm complexity is ordinary, there are many time-consuming functions including trigonometric and exponential functions, which may reduce the acceleration efficiency.
The Oh algorithm utilizes the full polarimetric scattering coefficients. While for inversion, the Fresnel coefficient is first obtained by an iterative process, following that, the dielectric constant and roughness are computed consequently. Oh has a large valid range of roughness and moisture among the empirical inversion models. When the amplitudes of full polarimetric SAR data are available, it can be applied. According to Equation (4)-(7), its algorithm complexity can be approximated as O(m · n), where m is the number of iterative calculation, and is set to 100 in the experiments. Compared to the Dubois algorithm in computing efficiency, the advantage is that the trigonometric function calculations are avoided, and the disadvantage is that the iterative calculation should be performed.
The X-Bragg algorithm is considered to extend the Bragg scattering algorithm for a slight roughness in the soil surface. It has a wider valid range for the roughness parameter, and is also not sensitive to the existence of slope. The X-Bragg algorithm is the real full polarimetric algorithm for soil surface, which utilizes both the amplitude and the phase information of full polarimetric channels. However, the inversion of this algorithm is not straightforward. The main steps are to compute the roughness from anisotropy, to construct the two-dimensional space of entropy and mean alpha, then to find out the dielectric constant by use of look-up-table (LUT) under certain conditions of incident angle and roughness. According to Equations (8)-(20), the algorithm complexity can be simplified as d · O(l · n), where d indicates the algorithm complexity of matrix diagonalization, l represents the dimension of lookup table. Based on the above complexity analysis, it can be seen that the X-Bragg algorithm is the most complicated calculation, and is worthy of deep optimization.
According to the differences of the three inversion schemes, the key points of parallel computing are thread allocation, data storage, and instruction optimization. For the Dubois and Oh algorithms, two optimization methods were used in our experiment: thread allocation and instruction optimization. For the X-Bragg algorithm, we used a variety of optimization methods such as thread allocation, storage optimization, and instruction optimization.

GPU-Based Dubois and Oh Parallel Inversion
In principle, the Dubois algorithm can be seen as a simplification of the Oh inversion algorithm, so the optimization methods of the two inversion algorithms are roughly the same. The implementation of these two inversion algorithms includes the following parts: data acquisition, data preprocessing, inversion algorithm implementation, and data output. The calculation process of the inversion model is optimized in parallel, which can efficiently achieve the inversion of soil water content and roughness. The overall framework of the inversion algorithm is as follows: In Figure 2, the black dotted frame is the part that needs to be optimized. The number of cycles of the calculation process is determined by the amount of data. This article uses two ways to optimize: (1) Thread allocation: In the thread allocation process, the computing power of the hardware needs to be considered. In this experimental environment, each block can be allocated up to 1024 threads, which does not mean that the number of threads per block is as high as possible. The amount of data used in the experiment is much larger than 1024, and the pixels remain independent during the calculation, so all threads are independent. Warp is the basic transmission unit of SM (streaming multiprocessor), and a warp has 32 threads. Therefore, the size of each thread block in this experiment is 16 * 16. And the problem of limited storage space for threads is solved. This size ensures the full utilization of each scheduling unit and the threads have sufficient memory. It can make computing more efficient. Figure 3 shows the detailed thread allocation. (2) Instruction optimization: In Equations (1), (2), and (7), there are a large number of trigonometric and power functions. When parallel optimization is used, these functions are not applicable.
In the CUDA runtime, there are some corresponding mathematical functions, and the calculation efficiency is higher under the condition of partial precision loss. For example, to replace the function sin(.) with the function __sin f (.). The calculation time of the inversion can be reduced by using the __sin f (.) function, which is an internal function of GPU.

GPU-Based X-Bragg Parallel Inversion
According to the principle of the analytic algorithm, X-Bragg is different from the other two inversion algorithms, and the lookup table is calculated before all data preprocessing. This table is used to find out the corresponding soil moisture under certain conditions of incidence angle and roughness. Figure 4 below shows the overall framework of the inversion algorithm based on X-Bragg.  There are three main parts in the graph. The first one is to calculate the H and α according to X-Bragg algorithm, then H and α are stored in the lookup table. The second part is that the raw data needs to be spatially averaged for preprocessing etc. The third part is the inversion process of X-Bragg algorithm. This part solves the [T] matrix for each corresponding pixel, and calculates real non-negative eigenvalues λ 1 , λ 2 , λ 3 , and orthogonal eigenvectors e 1 , e 2 , and e 3 . Then the entropy H and α angle are computed corresponding to the eigenvalues and eigenvectors. Following that, the soil roughness can be calculated by scattering anisotropy A (ks), and finally with the obtained entropy H and α angle corresponding to the lookup table, soil moisture (mv) is inverted.
The size of data used in this experiment was 7981 × 1837. After testing and averaging six times, the calculation time for each process is obtained. The calculation time of the first part is about 15 ms, and the calculation time of the second and third parts are about 7247 ms and 330,582 ms, respectively. The total computation time is 337,844 ms, in which the third part accounts for 97.85% of the total time. In the local environment, it takes more than five minutes to proceed with data of size 7981 × 1837, which indicates that X-Bragg algorithm inversion is inefficient and parallelism. Since the third part of the time affects the real-time processing of the inversion, it is considered as the main part for optimization. Figure 5 describes in detail the flowchart of CPU/GPU collaborative processing based on inversion algorithms.
In Figure 5, Ndieli is the step size of the inverted dielectric constant. Nbeta is the step size of the roughness angle in the inversion. Through the preliminary test, the display driver stopped responding during the calculation because of the execution time of the kernel function is too long. So the kernel function is divided into three parts: (1) Entropy H and α angle are calculated by high concurrent multithreading; (2) the position of the pixel corresponding to the scatter table is determined, and the entropy H and α angle are calculated using the zero-start consumption of the kernel function; (3) correspond to the lookup table, the moisture (mv) and roughness (ks) efficiently are calculated.   In pseudo code, the T matrix is calculated. V is the eigenvector and lambda is the corresponding eigenvalue. al and se are calculated as lookup tables. pos is the position in the LUT. Nlig * Ncol is the size of the data. Mmv_out is the output data. In the following part, a detailed optimization analysis of the X-Bragg algorithm is performed through four points.

V1: Thread allocation optimization
The thread allocation optimization is basically consistent with the analysis of Figure 2. In Figure 4, the experiment is divided into three kernel functions. The first reason is to calculate the time limit. In addition, if the single thread independently calculates the entire inversion process, it will lead to parallel branches. Considering the fact that when the entropy H and α angle are calculated by Equations (17) and (19), there is a threshold judgment H ≤ max_H and α ≤ max_α, while the inversion is only performed in range, so those threads do not perform inversion will be idle. The computation resources is wasted in this way. Therefore, in our experiment the kernel function is split before inversion, which can greatly increase the utilization rate of computing resources and avoid the waste of resources caused by parallel branches. V2: Storage optimization In the pseudo code, steps 5 and 9 use three constant arrays lia _blockrange, max _en and max _al of size 901. In the process of calculating and searching lookup tables, these three arrays are read multiple times. In general, data is transported from CPU to GPU global memory. Each thread needs to acquire data from the global memory for multiple times, hence the slow transmission time leads to the bottleneck of data processing. This problem can be solved well in hardware storage configuration. Constant memory has 64 kb of storage, and it is much larger than the size of the three arrays. So it is possible to pre-calculate the indices and addresses of the three arrays on CPU and uploaded them to the cached GPU constant memory, where they can be retrieved by the thread blocks at both high bandwidth and low latency. Besides, constant memory is a good fit for these three arrays in read-only operations. In this way, the transmission objective of the data is changed from 1 to 2, as shown in Figure 6.

V3: Data type conversion optimization
In steps 4 and 8, the diagonalization function is used for multiple times. By implementing Equation (13) [T] matrix is diagonalized by an unitary similarity transformation, and the non-negative eigenvalue matrix [Λ] and the eigenvector matrix [U 3 ] are obtained. This process uses intermediate variables of double-type to make the diagonalization process more precise. But double-type data also limits the speed of GPU operations, while float-type is more suitable than double. Under the AIDA64 software test, the GTX 1080Ti provides a peak throughput of nearly 12.637 teraflop/s in float precision, but is limited to 423 gigaflop/s in double precision. In the GPU, using float data not only reduces memory consumption, but also improves data operation efficiency. Therefore, under the condition of partial accuracy loss, the usage of float-type increases the computational efficiency as well as saving storage resources for the hardware.

V4: Instruction optimization
In addition to the CUDA fast math optimization mentioned in the other two algorithms, loop expansion is also used for the optimization of the X-Bragg algorithm inversion. In general, a GPU is suitable for processing computationally intensive data, however its ability to do logical judgment is weak. In step 4, step 5, and step 8, the diagonalization process is used multiple times. There are a large number of loops in the process for calculating the real non-negative eigenvalue matrix [Λ] and the eigenvector matrix [U 3 ]. Logical judgment has also become a huge bottleneck during GPU computing. By artificially expanding the loop within the kernel function, the instruction consumption is reduced as much as possible. Kernel performance is improved to get efficient calculations. In Figure 7, expanding the loop is vividly displayed. GPU is more suitable for computing than CPU. In terms of logical judgment, CPU has higher efficiency.

Experiment Environment
The hardware environment of experimentation includes Intel(R) core(TM) i5-3470 (CPU) and NVIDIA GeForce GTX 1080 Ti (GPU), and the library of visual studio 2013+ CUDA 9.0 is the software environment of the program.

Accuracy Analysis
Technically, there is no accuracy loss after these algorithms are implemented on GPU. However, there exists a small difference in the accuracy range of CPU and GPU math functions. As for GPU, the computing performance of single-precision floating-point data outperforms double-precision floating-point data by a lot. Therefore, the single-precision floating-point data type is applied for the GPU parallel design. Due to the above two reasons, the proposed methods may bring certain calculation errors, which should be analyzed to guarantee the algorithm accuracy.
Two indicators are employed to validate the parallel methods, mean absolute error (MAE) and root mean square error (RMSE), respectively. The pixel-wise comparisons of m v are carried out among the three inversion models, as shown in Table 1. From the MAE and RMSE results, it can be seen that the errors from GPU parallel are very small and can be ignored. Meanwhile, the visual result comparisons on m v and k s are shown in Figure 8.

Optimization Results of the Dubois and Oh Parallel Inversion
The test site is the Demmin area in northern Germany, and the real PolSAR data is acquired by the ESAR airborne system of the German Aerospace Center. The original size is 7981 × 1837. For comparison, different sizes of data are constructed by the upsampling and downsampling methods. The principle of data construction is to set a size more suitable for each inversion algorithm, which makes the computation more stable. In the above experimental environment, the Dubois and Oh algorithms were tested and compared using the same set of size data. Here, four sizes of data are tested, which are 13600 × 1837, 6800 × 1837, 3400 × 1837, and 1700 × 1837. After six tests, the average calculation time is used as the final results. Tables 1 and 2 below show the calculation time for Dubois and Oh algorithms at different data sizes.
In Tables 2 and 3, the row data represents the calculation time for different sizes. The first column indicates CPU computation time. The second column shows the transmission time of all data from CPU to GPU. The third column is the computation time of the kernel function. The fourth column shows the overall speedup results of the inversion based on the scattering algorithm.   Table 2 shows the optimization results of the Dubois model inversion. The performance of the GPU has only increased by about 15 times. As the amount of data decreases, the speedup effect is gradually weakened. This further illustrates pertinence of GPU for computationally intensive data. The acceleration can reach hundreds of times without considering the data transmission time. In Table 3, GPU is 100 times faster than CPU. Here, the computation time can be accelerated by thousands of times without considering the data transmission.
In terms of algorithm, Oh is more complicated than Dubois. In CPU calculation process, Oh is much slower than Dubois. This shows that the complexity of the calculation process has a profound impact on GPU usage. From the two algorithms optimization results, Oh is more suitable for GPU than Dubois. The Figure 9 is a more intuitive description of the optimization of these two algorithms.

Optimization Result of the X-Bragg Parallel Inversion
In the above experimental environment, The 15862 × 1837, 7981 × 1837, 3990 × 1837, and 1995 × 1837 size data were used in the X-Bragg algorithm. Table 4 details the calculation time of the algorithm in different situations. In Table 4 , CPU time is the time before optimization. The final computation time is the one after the final optimization. As can be seen from the table, the final result is about 150 times faster than CPU. In Figure 10, the acceleration effects of different data sizes can be clearly expressed. This experiment proposes four optimization methods based on the complexity of the algorithm. They are thread allocation, storage optimization, data conversion, and instruction optimization, respectively. Table 5 shows the time tested after each step optimization. In Table 5, the first column is the optimization result after reasonable thread allocation for hardware. The second column shows the effect after using constant memory. The third column shows the time after type conversion optimization. The fourth column is the final optimization result after the loop is expanded. From the first column to the fourth column optimization, the efficiency of the process is more than doubled. The acceleration performances of X-Bragg algorithm at different data size with different optimization methods are also shown in Figure 11. In Figure 11, bar graphs of different colors represent data of different sizes. With the implementation of the four optimization methods, the calculation time is continuously reduced. The number on the graph indicates the detailed calculation time (unit: ms). On the whole, the final optimization can reach more than 150× speedup. Taking 15862 × 1837 data as an example, the inversion time is reduced from 11 min to 4 s.

Conclusions
In this paper, three classical forward scattering models and their corresponding inversion algorithms are analyzed. They are different in polarimetric data requirement, application situation, and performance. Through the further analysis of the structure of the three classical inversion algorithms based on scattering models, each algorithm is optimized, respectively. Then a framework for the parallel inversion method for polarimetric SAR imagery based on GPU is presented. The optimization combines the processing advantages of a GPU with computationally intensive imagery, so as to realize the parallel design of three inversion algorithms, including the entire inversion process from data transmission, computational instruction set, to GPU hardware structure. In the experiments, the calculation efficiency is increased by approximately 100-fold. For all widely used empirical models, the problems of large data volume and low computing efficiency are solved, including dual/multi/full polarization models. Experiments with real data fully demonstrate the tremendous advantages of combining GPU and PolSAR imagery for real-time processing. However, as far as the calculation time is concerned, the parallel X-Bragg method is still two orders of magnitude slower than the other two methods. In the future work, we will further optimize the diagonalization part to make it more efficient.