Deep Learning for SAR Image Despeckling

: Speckle ﬁltering is an unavoidable step when dealing with applications that involve amplitude or intensity images acquired by coherent systems, such as Synthetic Aperture Radar (SAR). Speckle is a target-dependent phenomenon; thus, its estimation and reduction require the individuation of speciﬁc properties of the image features. Speckle ﬁltering is one of the most prominent topics in the SAR image processing research community, who has ﬁrst tackled this issue using handcrafted feature-based ﬁlters. Even if classical algorithms have slowly and progressively achieved better and better performance, the more recent Convolutional-Neural-Networks (CNNs) have proven to be a promising alternative, in the light of the outstanding capabilities in efﬁciently learning task-speciﬁc ﬁlters. Currently, only simplistic CNN architectures have been exploited for the speckle ﬁltering task. While these architectures outperform classical algorithms, they still show some weakness in the texture preservation. In this work, a deep encoder–decoder CNN architecture, focused in the speciﬁc context of SAR images, is proposed in order to enhance speckle ﬁltering capabilities alongside texture preservation. This objective has been addressed through the adaptation of the U-Net CNN, which has been modiﬁed and optimized accordingly. This architecture allows for the extraction of features at different scales, and it is capable of producing detailed reconstructions through its system of skip connections. In this work, a two-phase learning strategy is adopted, by ﬁrst pre-training the model on a synthetic dataset and by adapting the learned network to the real SAR image domain through a fast ﬁne-tuning procedure. During the ﬁne-tuning phase, a modiﬁed version of the total variation (TV) regularization was introduced to improve the network performance when dealing with real SAR data. Finally, experiments were carried out on simulated and real data to compare the performance of the proposed method with respect to the state-of-the-art methodologies.


Introduction
SAR, as any coherent imaging system, generates speckled images. Speckle itself carries crucial information about the observed surface. This information is usually exploited by popular interferometric processing techniques of SAR image pairs [1]. However, as shown in Figure 1, speckle acts as noise on a single detected SAR image since it hides many details of the observed scene. At the beginning of the SAR era, when only detected images were considered, the speckle was referred to as speckle noise. Thus, speckle noise should be removed in all those Earth Observation (EO) applications where only detected images are considered.
Recently, a massive amount of SAR images has been made available by new SAR missions with systematic observation capability. The two Sentinel-1 satellites, for example, provide medium resolution SAR images of a large part of the Earth every six days (or even less if we consider both ascending and descending passes). The number of looks usually exploited to reduce the speckle should be kept as low as possible to guarantee a good spatial resolution. Such an amount of regularly repeated observations can be used to improve land cover classification, environmental monitoring, emergency response, and military surveillance. In most of these applications, despeckled SAR images are requested, as testified by the presence in the literature of many algorithms that have been proposed and used as a pre-processing step to mitigate the effects of speckle noise since the 1980s. Adaptive spatial domain filters like Lee [2], Kuan [3], Frost [4], operate on the value of the pixels by running a window, or kernel, over the entire image. Both Lee and Kuan filters, remove speckle noise by computing a linear combination of the central pixel intensity and the average intensity of neighbour pixels within the moving window. The Frost filter adopts a similar approach by using an exponentially damped kernel that behaves in a fashion similar to a low-pass filter or an identity filter, depending on whether the local coefficient of variation is small or large, respectively [5]. Lope et al. [6] presented the enhanced Lee and Frost filters operating in a similar way but introduced three classes for the coefficient of variation, namely homogeneous regions, heterogeneous regions, and isolated points. Although spatial domain filters proved to be effective in removing speckle from images under specific local conditions, their performance is highly constrained on the choice of the moving window. In general, they are applicable only in homogeneous areas and are characterised by blurring artefacts.
Another class of despeckling methodologies is based on the wavelet transform, in which the noise is reduced by thresholding the coefficients of the discrete wavelet transform (DWT) of the log-transformed single look image. In [7], the wavelet Bayesian denoising technique is integrated with an image regularisation procedure based on Markov random fields (MRF), achieving better performance than the enhanced Lee filter. In [8] authors investigate a despeckling and texture extraction method which uses a Gauss-MRF, while in [9] authors introduce a speckled reduction method for PolSAR imagery based on adaptive Gaussian MRF. On the other hand, the method used by Argenti et al. [10] outperforms the Kuan filter by applying a minimum mean-square error (MMSE) filtering in the undecimated wavelet domain. A different approach by Solbo et al. [11] introduces the homomorphic wavelet maximum a posteriori (Γ-WMAP) filter improving the performance of the original Γ-MAP filter [12]. A good smoothing capability in homogeneous regions is achieved by using a priori statistical information about the radar cross section (RCS). The major weaknesses of wavelet transform methods are still the backscatter mean preservation in homogeneous areas, details preservation, and the generation of artificial artefacts.
In order to overcome the issues mentioned before, more recent nonlocal (NL) filtering methods have been introduced in which noise reduction is performed by assigning each pixel a weight according to its similarity with the reference pixel. The nonlocal means (NL means) filter [13] computes the value of a pixel as a weighted average of all the pixels in the image, where the weights are a function of the Euclidean distance between local patches of fixed size centred in the reference pixels. Deledalle et al. [14] adapt the above method to SAR images by proposing a probabilistic patch-based (PPB) filter in which the similarity between noisy patches is defined from the noise distribution. Then, the obtained weights are refined, including the similarity between the restored patches. The block-matching 3-D (BM3D) image denoising algorithm [15] groups image patches into 3-D arrays based on their similarity and performs a collaborative filtering procedure to obtain the 2-D estimates for all grouped blocks. Parrilli et al. [16] modify the above algorithm to deal with SAR images (SAR-BM3D) by grouping the image patches through an ad hoc similarity measure that takes into account the actual statistics of the speckle and by adopting the local linear MMSE (LLMMSE) criterion in the estimation step. Despite its good results, this framework is one of the most computationally intensive.
Popular approaches are also the ones based on the total variation (TV) denoising procedure [17] that combines a data fitting term with a regularisation term which encourages smooth solutions while preserving edges. TV-based methods differ because of the choice of the data fitting term and of the application domain, i.e., intensity or log-transformed intensity. In [18], authors propose a new variational method in the original intensity domain based on a MAP approach, assuming a Gamma distribution for the speckle and a Gibbs prior for the original image. The variational model by Steidl et al. [19] operates on images contaminated by multiplicative Gamma noise, by considering the I-divergence as data fitting term and applying the Douglas-Rachford splitting techniques to solve the minimisation problem. Other works apply TV regularisation in the logarithmic domain to overcome the difficulty of defining strictly convex TV problems in the original intensity domain [20,21]. A critical task in TV regularisation is the choice of the regularisation parameter that controls the degree of smoothing, i.e., large values for the parameter lead to over-smoothed results while small values do not sufficiently remove the noise. In [22], the authors employ an adaptive TV (ATV) regularisation method consisting in adapting the regularisation parameter based on the speckle intensity. Another work by Palsson et al. [23] proposes to select the regularisation parameter by minimising the Monte Carlo Stein's Unbiased Risk Estimate (MCSURE), showing good results on real SAR images.
Finally, it is worth to mention an additional framework where despeckling is achieved by using the information contained in multi-temporal SAR data stacks. The extracted temporal statistics are used to develop space-adaptive filters which are then used on the single image. Ferretti et al. [24] propose a despeckling algorithm, referred to as DespecKS. Statistically homogeneous population (SHP) is identified by applying the two-sample Kolmogorov-Smirnov (KS) test within an estimation window where pixels share the same statistics of the considered centre pixel. Then, the obtained SHPs identify homogeneous areas in the image and their intensities are averaged to reduce the speckle while preserving point-wise permanent scatters (PS). In [25] Chierchia et al. deal with multi-temporal data by integrating a nonlocal temporal filter (NLTF) with the SAR-BM3D filtering method, thus developing a multi-temporal oriented version of the latter. Another approach by Zhao et al. [26] applies a single-image denoising procedure to the ratio image (residual speckle) obtained dividing the considered speckled image by the super-image computed by temporally averaging the images in the data stack. The speckle-free SAR image is then obtained by multiplying the denoised ratio image with the original super-image. This framework tends to fail in low stationarity scenarios, and it cannot perform denoising in a single-image environment.
Our work aims at the single SAR image despeckling problem using a Deep Learning (DL) approach. The U-Net convolutional neural network [27], initially proposed in 2015 for biomedical image segmentation, has been modified and adapted here for the SAR image despeckling task. Unlike previous DL approaches (see Section 2), which do not explicitly impose to maintain the image structure during the despeckling, the proposed network allows by design to preserve edges, and permanent scatter points while producing smooth solutions in homogeneous regions. This is a desirable property in the specific problem, which requires high-quality filtered images with no additional artefacts. Experimental results demonstrate that the proposed approach achieves better performances than other methods and gives more reliable results even with respect to a multi-temporal despeckling algorithm.
The rest of the paper is organised as follows. Section 2 provides an overview of the existing Deep Learning methods with a particular focus on the SAR-DRN convolutional neural network, which is used as the baseline method in our experiments. Section 3 gives a theoretical background about the SAR speckle modelling with particular emphasis on the model used to generate synthetic speckle in our training procedure. Section 4 introduces the proposed method by detailing both the used architecture and the adopted learning strategy. In Section 5, we describe the suite of experiments, accurately designed to validate our approach, together with the datasets generation process and the analysis of the obtained results. Finally, we derive our conclusions and outline possible future work in Section 6.

Related Works
The increase of data availability, together with the development of more powerful computing devices, led to surprising advances in machine learning (ML) methods allowing systems to reach very high performance in many complex tasks, e.g., image classification [28][29][30][31][32], object detection [33,34], and semantic segmentation [35][36][37]. In particular, Deep Learning methods have been employed in remote sensing tasks [38][39][40][41] due to the capability of Deep Neural Networks to automatically learn suitable features from images in a data-driven approach, without manually setting the parameters of specific algorithms.
In recent years, the use of Deep Learning has also been investigated for image denoising tasks. In [42] the authors propose to train a feed-forward denoising convolutional neural network (DnCNN) following a residual learning approach [30], paired with batch normalisation (BN) [43]. Residual learning methods, by focusing on predicting the residual image, i.e., the noise, instead of directly producing a noise-free image, often allow achieving better performance. Applying a residual learning approach, indeed, helps to improve training, since neural networks tend to learn better when asked to give an output which is significantly different from the input [42].
Chierchia et al. [44] extend the concept of residual learning to SAR images by proposing a convolutional neural network (SAR-CNN) trained in a homomorphic setting, i.e., when the speckle is extracted from the log-transformed original image. The restored image is then obtained by mapping back the speckle-free image to the original domain through the exponential function. On the other hand, the image-despeckling convolutional neural network (ID-CNN) proposed in [45] works directly in the original domain of the image, by assuming a multiplicative speckle noise, recovering the filtered image through a component-wise division-residual layer. Besides, they proposed to jointly minimise both the Euclidean loss and the total variation (TV) loss to provide smooth results.
In contrast to previous approaches, where log-transform is used [44] or the training loss function is modified to solve the specific problem [45], in [46] authors work in an end-to-end fashion. By not relying on any a priori knowledge about either the noise or the image, and by leveraging a significant amount of simulated data during training, this approach is capable of automatically learning the hidden non-linear relations without having to rely on hand-crafted parameters. To accomplish this, they build a dilated residual network (SAR-DRN) showing better performance with respect to the state-of-the-art methods in case of high-level speckle noise. One of the key ingredients of SAR-DRN is the use of dilated convolutions [47], which allows enlarging the receptive field [48] by maintaining the filter size and the depth of the architecture. Skip connections are finally used, as shown in Figure 2, to facilitate the learning process.

Speckle Model for Data Generation
Our approach follows the Supervised Learning (SL) paradigm, that is the task of learning a mapping between pairs of inputs and the corresponding targets (ground truth). This is done in a data-driven fashion by feeding the algorithm with examples coming from a well-designed training set and minimising the error between the network predictions and the expected outputs. In the considered problem, the task is to learn a mapping between the speckled images and the speckle-free ones. The quality of the learned model is strongly related to the quality and variability of examples used during training. A large number of speckled images, coupled with their speckle-free counterparts, are indeed needed to train DL models properly. Fortunately, thanks to the numerous EO missions (e.g., Sentinel-1, COSMO-SkyMed, RADARSAT, TERRASAR-X), petabytes of speckled SAR images are nowadays available. However, it is not physically possible to have their corresponding speckle-free images. Different approaches have been proposed to overcome this issue. Chierchia et al. [44] obtain speckle-free targets by averaging multi-temporal images and considering during training only those regions that do not significantly change over time. Even if the obtained results are qualitatively promising, this approach strictly relies on the choice of patches, and it is limited by the approximation made by considering the average image as the speckle-free reference. Other works [45,46] synthetically generate the speckle images by using prior knowledge about the statistical model of the noise. Our approach builds on the same concept by using prior statistics about the speckle to generate synthetic images in the original SAR high-resolution space. In the following subsections, we briefly describe the speckle model we adopted and how it is applied for building the dataset we used during training.

Speckle Model
The most commonly adopted model for describing SAR speckle for distributed scatterers is the multiplicative noise model [49] where Y ∈ R W×H is the observed intensity image, X ∈ R W×H is the noise-free radar reflectivity image and N ∈ R W×H is the speckle noise random variable. Assuming a L-looks SAR image, N follows a Gamma distribution with unit mean and variance 1/L with the following probability density function [50]: where Γ(·) denotes the Gamma function. The same multiplicative speckle noise model holds for an amplitude image (the square root of the intensity) where x is the speckle-free amplitude value. Now the speckle noise random variable n obeys to the Nakagami distribution with the following probability density function which results in the Rayleigh distribution in the case of single look (L = 1) amplitude image. Figure 3 shows the processing steps employed to simulate speckled images. The noise-free image x is obtained by transforming into grayscale the original image kept from a collection of RGB images. The speckle noise n is generated exploiting the Nakagami-Rayleigh distribution. Finally, the speckled synthetic image is given by applying the multiplicative noise model described by Equation (3). In our setting, this process is performed online. This means that new synthetic images are generated at each iteration during the training phase instead of building the dataset at once at the beginning. The main advantage of using such an approach is mostly related to the over-fitting problem in ML tasks, i.e., the model is more likely to memorise the training samples and loosely generalise on unseen samples. By generating a different noise image at each training step, the model is encouraged to learn the real mapping among input and target pairs since it is more difficult for it to over-fit on already processed samples. As described in Section 5, we also adopt data augmentation in our experiments to further improve generalisation performance.

RGB Dataset
Sampling from Rayleigh distribution

Proposed Method
Inspired by the principles of denoising autoencoders [51,52], we addressed the despeckling problem by learning a mapping between speckled images and their corresponding speckle-free reconstructions. In this framework, an encoder network is used to map a corrupted input to a low-dimensional hidden representation at a specific level of abstraction. This abstract representation is then used to obtain a restored input through a decoding process. By imposing the reconstruction to be equal to its clean reference, i.e., by optimizing the network parameters using a suitable similarity function, the restored input is forced to lie on the manifold of noise-free data. From a geometrical interpretation, a denoising autoencoder learns how to attract the corrupted input to this manifold.
The same concepts can be applied in the SAR image despeckling task, by employing a suitable architecture to extract latent representations from speckled images and by obtaining their corresponding reconstructions. In particular, we developed a modified version of U-Net, which is an encoder-decoder convolutional neural network initially designed for biomedical image segmentation tasks [27]. The U-Net's encoder network allows compressing the information by extracting relevant features from the input image at different scales, thus providing representations at different levels of abstraction. The decoder network performs the corresponding reconstruction by mapping back the latent representation to the input spatial resolution. This process is done by stacking several upsampling layers, each of them restoring the information at different resolution. During the encoding process, part of the information could be lost, especially when using a deep network. Then, the decoder could be not able to recover details of the given image starting from its abstract representation. To this purpose, U-Net provides a set of skip connections which allow preserving relevant information during the decoding stage, thus enhancing the accuracy of the restored image. We provide more insights into the role of skip connections in Section 5.5.1. Figure 4 shows the architecture, derived from the original U-Net. Its symmetric encoder-decoder structure ("U" shape) allows capturing details at different scales as the network depth increases. The additional system of skip connections going from the contracting path, on the left, to the expansive one, on the right, lets the local low-level information to be transmitted to decoding layers and, at the same time, it helps learning by solving the known vanishing gradient problem [53]. This is done by concatenating the output of each encoding layer to the input of the corresponding decoding layer. Thus, for the l-th decoding layer, we have

Architecture
where O l E and O l−1 D are the feature maps extracted by the encoder (E) and by the decoder (D) at the l-th and (l − 1)-th layers respectively. U psample is the upsampling operation computed on O l−1 D to recover the l-th spatial resolution and ⊕ represents the concatenation operator. We modified the original U-Net architecture by using a four-layer structure for the contractive part, which is one level less than the depth of the original network. In addition, we changed the first convolutional layer to deal with one-channel input images. Each layer of the encoding path, except for the input layer, takes as input the output feature maps of the previous layer and, after compressing them through a downsampling operation, it produces as output a tensor of feature maps with double the channels. Thus, at each level where O l−1 E of dimension H l−1 × W l−1 × C l−1 is the output of the previous encoding layer and O l E are the feature maps extracted at the current level with dimension H 2 l−1 × W 2 l−1 × 2 · C l−1 . EncBlock is the encoding block composed by a max-pooling layer followed by two stacked convolutional layers. Since the number of features extracted by the first encoding layer is set to 64, we obtain 512 features as output of the contractive path. In Section 5.5.1 we performed an ablation study on different network configurations, i.e., different depth levels and number of feature maps for each layer.
The decoder network mirrors the encoder. The output of the previous decoding layer O l−1 D is first upsampled by applying a 2 × 2 Transposed Convolution (TConv) layer [54] which produces a set of feature maps of the same dimension of the output of the l-th encoding layer. Then, after applying the concatenation layer (Equation (5)), the final output is obtained by applying two stacked convolutional layers. Each layer uses 3 × 3 filters and stride equal to one except for the final convolutional layer which use a 1 × 1 filter to produce a prediction in the original input space. The non-linear activation employed after each convolution, except for the last one in which a linear activation is used instead, is the rectified linear unit (ReLU) which is defined as

Learning
The procedure adopted to learn the parameters of the proposed network belongs to the residual learning paradigm, which proved to be effective in previous related works. Thus, the architecture is trained to extract the noise from the noisy input image instead of directly learning the mapping between input and corresponding clean image. Contrary to homomorphic approaches, in which the multiplicative speckle model is first converted into an additive one through the logarithmic transformation of the input, here we adopt the same residual strategy as SAR-DRN. The residual noise is expressed as the difference between noisy and clean images. Therefore where n is the residual image, x is the clean target, and y is the speckled input. Accordingly, the network is trained to find the function φ(y, θ) =n (9) where θ are the model parameters, that are trained to approximate as well as possible the real mapping between input y and the target residual noise n. This is done in a supervised fashion by minimising the error between the predicted residual image and the target one. In particular, we used the mean-squared error (MSE) defined as As shown in Figure 5, at inference time the despeckled imagex is obtained from the original noisy image y by subtracting the residual noisen predicted by the network.

Experimental Results
In order to evaluate the performance of our model, we designed a suitable set of experiments which covers synthetic and real image despeckling domains. In particular, we built two types of datasets, one for each domain, to asses the quality of our predictions, both when dealing with synthetic images and real SAR ones. The synthetic dataset allows to numerically compare the quality of our predictions with state-of-the-art approaches, while the latter is intended to demonstrate how the proposed architecture can be successfully employed in a real despeckling scenario. In the following, we give an overview of the datasets we generated to run the experiments, and we present our main results together with the comparison with state-of-the-art methods.

Synthetic Dataset
The synthetic dataset has been built using aerial images from the UC Merced dataset [55] created for land use classification. The original dataset is composed of RGB images taken from the USGS National Map Urban Area Imagery collection and subdivided into 21 classes depending on the type of the land. From this dataset, we selected 1409 images from which we extracted 238,121 patches of 64 × 64 pixels as the training set. Furthermore, we selected 294 images of 256 × 256 pixels from three classes (Airplane, Buildings, and River) that are not used for training to perform the validation. Besides, we built a test set for the very final validation using four images from the PatternNet dataset [56], which are taken from unseen classes (Transformer Station, Basketball Court, Tree Farm, and Bridge). The test set is used to compare our performance with state-of-the-art methods.
As anticipated in Section 3.2, we use the training set to generate single look synthetic speckled images using the online procedure depicted in Figure 3. Furthermore, at each training step the samples are augmented before to be fed to the network by randomly combining horizontal and vertical flipping, 90 • and 270 • rotation, and changes in image contrast. As shown in the ablation study, the performance increases as the data augmentation becomes more severe.

Real SAR Dataset
We built an additional dataset to fine-tune the learned model on the specific real SAR domain, which is characterised by high-resolution images. The challenge to be faced was to create the corresponding pairs of speckled and target images for training that had to be as similar as possible to real images. Since it is not physically possible to obtain a despeckled realisation of SAR images, we adopted an alternative strategy to obtain a speckle free reference to train the model. Specifically, we considered the point-wise temporal average over a stack of SAR images as the target image.
The noise is then generated synthetically as in the case of aerial data. Even if the data is simulated, the obtained dataset provides samples that are close to the real ones, allowing the model to handle high-resolution images better.
We considered two large temporal stacks provided by the Sentinel-1 mission that have been averaged to generate two despeckled images with a resolution of 20 × 5 m (VV polarisation). The dimensions of the two images are 31,576 × 7251 and 40,439 × 15,340 pixels, respectively. We extracted 167,713 patches of 64 × 64 pixels as training set and 16,856 patches as the validation set. The latter has been incremented with additional unseen images computed with the simulation process for a total of 26,901 validation patches. Finally, we built a suitable test set to evaluate the proposed approach on real single look SAR images. To this purpose, we collected other Sentinel-1 images and also images coming from different constellations, in the specific COSMO-SkyMed (CSK) and RADARSAT (HH polarisation), to evaluate the generalisation capability of the network on different resolutions.

Training
Our best result is achieved by using the four-level U-Net configuration with 64 features extracted at the first encoding layer. We employed a minibatch learning procedure by creating batches of 128 samples. The update of the network parameters was performed using the Adam [57] gradient-based optimisation algorithm with β 1 = 0.9, β 2 = 0.999, and = 10 −8 , where β 1 and β 2 are the exponential decay rates for the first and second moment estimates respectively while prevents any division by zero in the implementation. Thus, at each iteration t, the network parameters are updated using the following formulation where η is the learning rate andm t+1 andv t+1 are the bias-corrected first and second raw moment estimates defined asm where g t+1 are the gradients of the objective function at time t with respect to the network parameters θ. In particular, denoting the MSE function introduced in Equation (10) as the data fitting loss L D , the gradients are computed as The network has been trained on the synthetic dataset for 50 epochs, for a total of 93,050 training steps, using η = 0.001 as starting learning rate. We further adopted a learning rate decay schedule by decreasing it of 0.5 at intervals of 10 epochs.
For what concern the experiments on real SAR images, we used the parameters learned on the aerial dataset as initialisation, instead of training a new model from scratch. In this way, we took advantage of the already acquired capability of the model to extract relevant features for noise removal, and we fine-tuned the network parameters to deal with images in the real domain. This process is also known as domain adaptation, and it requires only a few additional training steps. Also, in this case, we employed a minibatch learning using batches of 128 samples each. The network has been fine-tuned for 15 epochs using a learning rate of 5 · 10 −6 and the Adam optimiser with the same hyper-parameters of the synthetic experiments.
In the real domain, the speckle noise has some properties that are not fully captured by the synthetically generated images. For this reason, despite the good capability of the network in filtering the speckle, the generated images still contain some blurry artefacts. Thus, we introduce a Total Variation (TV) regularisation loss during the training with the goal of removing the artefacts and producing smoother results while preserving the structure and the details of the images. The TV loss is defined as where ∇ hx and ∇ vx are the gradients of the reconstructed image on the horizontal and vertical directions, respectively, and are defined as while ∇ h x and ∇ v x are the gradients computed on the speckle-free reference image on the same directions. The role of the former is to minimise the difference among neighbouring pixels while the latter allows avoiding over-smoothed results in correspondence of the edges. The total cost function becomes L = L D + λ TV L TV (18) where L D is the data fitting term defined in Equation (10) while λ TV is the hyper-parameter governing the importance of the regularisation. We initially performed a random search to find the range of values of λ TV for which TV gives a relevant contribution, finding 1 · 10 −4 ≤ λ TV ≤ 5 · 10 −4 . Finally, we found λ TV = 2 · 10 −4 to give the best regularisation incidence in our experiments. Also in this case we compute the gradients g t+1 of the objective function at time t as where ∇ θ L D (θ t ) are computed as in Equation (14), while the gradients of the total variation are given by

Metrics for Evaluation
In the aerial dataset, the target corresponding to each noisy image is available since the latter are generated synthetically. Thus, it is possible to evaluate the quality of the provided approach by computing two kinds of metrics that are typically used in image denoising tasks. They are the peak signal-to-noise ratio (PSNR) and the structural similarity (SSIM) index. The PSNR expresses the ratio between the maximum signal power and the power of the noise affecting its quality. It measures how closely the reconstructed image represents the original one and is given by where MAX I is the maximum signal power, i.e., 255 for grayscale images and the MSE is computed between the noise-free target image and its reconstruction. The SSIM measures instead of the similarity between two images from the point of view of their structural information, i.e., the structure of objects in the scene, regardless of the changes in contrast and luminance. Given two images x and y, it is defined as where µ i is the mean of i, σ i is the variance of i, and c i is a constant introduced to avoid instabilities. Since a speckle-free reference is not available in the real domain, it is not possible to compute the metrics introduced above. A different approach is thus required. One option remains the visual inspection of the reconstructed images, for which we provide several results on different scenarios. Another common approach, that we adopted in this work, is to evaluate the degree of smoothing in a homogeneous region by computing the equivalent number of looks (ENL) defined as where µ I and σ I are the mean and standard deviation of the considered image portion I respectively.
The main difficulty in the computation of this metric is the ability to find a considerable homogeneous region inside the image under evaluation.

Results on Synthetic Images
We compare the performance of our approach with the one of two methods: SAR-BM3D, belonging to the class of algorithms of non-learned filters, and SAR-DRN, which is, to the best of our knowledge, the state-of-the-art method among DL approaches. For the former we used the publicly available MATLAB code http://www.grip.unina.it/research/80-sar-despeckling/80-sar-bm3d.html, setting the parameters as suggested in the original paper. For what concern SAR-DRN, instead, no public code is available. We thus implemented and trained SAR-DRN from scratch on our dataset, following the specifics given by the authors in the reference paper. Figure 6 shows the qualitative results of our approach on a selection of the test set, together with the ones of selected methods. Notice that, even if SAR-BM3D is quite effective in removing the speckle noise, the filtered images loosely preserve the object details and smoothness in homogeneous regions. SAR-DRN shows better performance w.r.t. SAR-BM3D, filtering out the speckle with higher accuracy. However, the produced images are still contaminated by some blurry artefacts. The proposed network, thanks to the combined action of its encoder-decoder structure and the system of skip-connections, improves the quality of the despeckled images by providing sharper results in correspondence of edges and preserving homogeneity among spatially coherent pixels (e.g., Bridge).
The qualitative improvements are reflected by the numerical results. As shown in Table 1, our approach shows a gain in the PSNR metric of about 1.21 dB, 0.80 dB, 1.68 dB, and 1.87 dB w.r.t. SAR-BM3D algorithm for Bridge, Basketball Court, Christmas Tree Farm, and Transformer Station respectively and outperforms SAR-DRN of about 0.63 dB, 0.20 dB, 0.62 dB, and 0.28 dB in the same images. These improvements testify a better image restoration capability compared with other methods. The proposed approach outperforms the baseline methods also in terms of SSIM index, as highlighted in Table 2, providing more accurate preservation of the image structural information.

Ablation Study
Network complexity and data augmentation. Figure 7 shows the performance of the different networks and data augmentation (DA) configurations on the validation set, evaluated at the end of each training epoch. For the sake of simplicity, only the number of features produced by the first encoding layer indicated as C is reported. We set different experiments to understand how network complexity affects the results together with the importance of data augmentation procedures. Compared with the original U-Net architecture, employing a five-levels depth model, the modified version, with only four encoding layers, shows better performance. As can be observed in the figure, results show how decreasing the number of parameters in the network does not help the learning, so we set the number of starting features as C = 64. The experiments also reveal that the use of strong data augmentation increases performance. The best performance is obtained by doing random image flipping (flip), rotation (rot), and changes in image contrast. Finally, the curves show the better performance of our approach compared to SAR-DRN trained on the same augmented data.
Skip connections. We further studied the importance of skip connections in the despeckling process. We claimed that skip connections allow obtaining far deeper details in the reconstruction since they facilitate the recovering of information that could be partially lost during the encoding stage. Tables 3 and 4 show the numerical results obtained during the ablation study. First, we reproduced our best experiment without using skip connections. As expected, it is possible to observe a drastic drop in performance, both in the PSNR and SSIM metrics. The decoder network is not able to restore the input image by only considering the final latent representation. In addition, in order to verify that the loss of information is less severe when reducing the depth of the network, we conducted an additional experiment in which we considered only three encoding layers and no skip connections. Also, in this case, the results confirm the expectation by showing an improved capability of the network in producing a speckle-free reconstruction from encoded information. However, it is possible to notice how the lack of skip connections mainly affects the preservation of the structural information, as shown by the SSIM index. For completeness, we also report the evaluation of the reduced network with the addition of skip connections. In conclusion, the conducted study shows that the greater the degree of compression of the input image, the more skip connections play a key role in the overall despeckling process, by allowing for a good speckle filtering and, at the same time, for good preservation of image details.    Figure 8 shows the qualitative results on different single look SAR images. It is possible to observe how the SAR-BM3D algorithm is quite efficient in removing the speckle, but the reconstructed images still contain some residual noise as testified by the produced artefacts. The base version of the proposed U-Net architecture, i.e., without TV, shows visually better results by generating sharper images. It is interesting to notice how the proposed approach performs well also on images having a spatial resolution different from the one encountered during the training phase. The performance achieved on the Sentinel-1 image, having a resolution of 20 × 5 m, is also achieved on the COSMO-SkyMed and RADARSAT images, having a resolution of 3 × 3 and 15 × 15 m, respectively. The best results were given by the model learned with the TV regularization whose reconstructions are smoother on homogeneous areas, further removing some of the residual artefacts, while preserving the structure of the images. The quantitative results proposed in Table 5 reflects the above considerations showing how the proposed approach has higher performance in speckle suppression in the considered homogeneous regions. Figure 9 shows a qualitative comparison between our single observation approach and the DespecKS algorithm based on multi-temporal data. As highlighted by the blue circles, the proposed model has a higher capability of preserving the information contained in the image while the reference algorithm tends to lose the details due to the averaging over the temporal dimension. Our approach provides faster and more accurate results by looking only to a single image instead of processing the entire data stack.   Speckled Ours (U-Net + TV) DespecKS Figure 9. Qualitative comparison between the proposed method and the DespecKS multi-temporal algorithm. The regions within the blue circles highlight the main differences between the two approaches.

Conclusions
We presented an adaptation of the U-Net convolutional neural network, originally conceived for semantic segmentation, for the problem of speckle removal in single look SAR images. Its encoder-decoder architecture allowed us to address the problem following the principles of denoising autoencoders. We built an online procedure for synthetic speckle generation, coupled with a well-designed data augmentation pipeline, and we extensively ran experiments to validate the performance of the proposed approach. We built two datasets to first pre-train the network on aerial images and then to fine-tune the model on the real SAR domain. Through the experiments, we showed how our model outperforms the state-of-the-art methods on synthetically generated images, as testified by the metrics that we were able to compute thanks to the availability of the speckle-free targets. We compared our results with SAR-BM3D, a well-known algorithm for SAR speckle filtering, and with SAR-DRN, a deep learning approach which demonstrated competitive results in this field. Compared to these methods, the proposed approach produced sharper reconstructions and showed to be less prone to generate artefacts and blur effects. We demonstrated how the learned model is effective in filtering out the speckle also from real samples with few fine-tuning steps on a specifically designed dataset. We further boosted the performance by introducing a modified version of the total variation regularisation, which allows generating smooth results in homogeneous regions while maintaining the information about the structure in the image. Additionally, the presented results show the generalisation capability of the network, which performs well on spatial resolutions not seen during training. Finally, we compared our approach with an algorithm that uses the temporal information of a whole data stack, showing how the proposed model generates more reliable results.
As future work, we plan to use a generative model to improve the quality of the simulated noise, with the aim of generating training samples more similar to the real data distribution. In addition, we want to investigate further the generalisation capability of the model by building a complete training set containing samples of different spatial resolutions.