Satellite-Based Rainfall Retrieval: From Generalized Linear Models to Artiﬁcial Neural Networks

: In this study, we develop and compare satellite rainfall retrievals based on generalized linear models and artiﬁcial neural networks. Both approaches are used in classiﬁcation mode in a ﬁrst step to identify the precipitating areas (precipitation detection) and in regression mode in a second step to estimate the rainfall intensity at the ground (rain rate). The input predictors are geostationary satellite infrared (IR) brightness temperatures and Satellite Application Facility (SAF) nowcasting products which consist of cloud properties, such as cloud top height and cloud type. Additionally, a set of auxiliary location-describing input variables is employed. The output predictand is the ground-based instantaneous rain rate provided by the European-scale radar composite OPERA, that was additionally quality-controlled. We compare our results to a precipitation product which uses a single infrared (IR) channel for the rainfall retrieval. Speciﬁcally, we choose the operational PR-OBS-3 hydrology SAF product as a representative example for this type of approach. generalized linear models, we show that we are able to substantially improve in terms of hits by considering more IR channels and cloud property predictors. Furthermore, we demonstrate the added value of using artiﬁcial neural networks to further improve prediction skill by additionally reducing false alarms. In the rain rate estimation, the indirect relationship between surface rain rates and the cloud properties measurable with geostationary satellites limit the skill of all models, which leads to smooth predictions close to the mean rainfall intensity. Probability matching is explored as a tool to recover higher order statistics to obtain a more realistic rain rate distribution.

were able to significantly improve over PERSIANN-CCS in the delineation of precipitating areas by employing a deep neural network approach. Additional examples of neural networks being applied for satellite rainfall retrievals include neural network based fusions of LEO MW and GEO IR data [50] and precipitation detection with LEO MW observations [51]. Tapiador et al. [52] provide a general overview on satellite-based rainfall estimations by means of neural networks including a discussion about advantages as well as drawbacks of employing such a statistical-learning approach. In addition, the potential of other machine learning techniques such as random forests has been revealed for precipitation detection based on LEO MW [53] as well as for rainfall retrievals based on GEO IR and visible (VIS) channel input data [54]. Meyer et al. [40] compare several machine learning algorithms (random forests, artificial neural networks, and support vector machines) and show that the choice of machine learning algorithm only marginally affects the skill of the rainfall retrieval.
The main goal of this study is to explicitly investigate the added value of employing multivariate input predictors and machine learning methods for satellite-based rainfall retrievals. For this purpose, we develop and test linear and nonlinear statistical learning methods to estimate the occurrence and instantaneous rainfall intensity provided by the European Operational Program for Exchange of weather RAdar Information (OPERA) network. We use Meteosat Second Generation (MSG) GEO satellite brightness temperatures, cloud information provided by the Nowcasting Satellite Application Facility (NWC-SAF), and a set of auxiliary location-describing variables as predictors. The resulting satellite precipitation retrieval algorithms are compared to a single IR channel input predictor product, namely the operationally available instantaneous rain rate product of the Hydrology Satellite Application Facility (H-SAF). The additional contributions of this study reside in the combination of both brightness temperatures and NWC-SAF products for rainfall retrieval and the calibration of a GEO satellite precipitation retrieval using the European-scale OPERA radar composite.
The paper is organized as follows: Section 2 describes the used datasets and Section 3 the statistical learning methods. In Section 4, the results are presented by means of a representative verification as well as illustrative case studies. In Section 5, the results are discussed and set into a broader context. Finally, the conclusions are drawn in Section 6.

Data
The rainfall retrieval algorithm was derived and tested using satellite and radar data covering Europe during summer 2017 (mid-June to mid-August). In this section, we describe the employed satellite and radar datasets and the operational H-SAF product for instantaneous rainfall retrieval. Additionally, we discuss data quality aspects.

Satellite Data
The MSG Spinning Enhanced Visible and Infrared Imager (SEVIRI) instrument in full disk service performs a scan every 15 min with a spatial resolution of 3 km × 3 km at nadir for IR channels [55]. During the study period, this service was provided by Meteosat-10, which was located at 0°E. We employed its Level 1b derived brightness temperatures and associated Level 2 NWC-SAF products [56,57] as satellite predictors for the rainfall retrieval. Specifically, we considered the following brightness temperature channels: WV 6.2 µm, WV 7.3 µm, IR 8.7 µm, IR 9.7 µm, IR 10.8 µm, IR 12.0 µm, IR 13.4 µm, as well as all the differences between them (corresponding to 21 additional variables), plus the following NWC-SAF products: cloud mask, cloud type, cloud top phase, cloud top temperature, cloud top pressure, and cloud top height. All of these predictors are available during both day-and nighttime and they were all parallax corrected based on the NWC-SAF cloud top height. The satellite data processing was carried out with the software package PyTroll [58].
In this study, rainfall predictions were made at the Europe-scale inside the NWC-SAF cloud mask. The map projection employed throughout the study shows the full prediction region as seen from the MSG SEVIRI in full disk service. It consists of 548 × 986 satellite pixels. It should be noted that the quality of the satellite products declines with increasing distance from the sub-satellite point due to lower spatial resolution and unfavorable viewing angles.

Auxiliary Variables
A range of auxiliary variables projected on the MSG full disk service grid were further included to characterize the spatial and temporal variability of the statistical relationship linking the satellite measurements to the ground-based radar observations. These additional variables are: the geographical coordinates (latitude/longitude), the local solar time, a land-water mask, and the altitude of topography.

Ground-Based Radar Reference
OPERA provides a European-scale radar composite, which is available every 15 min [4,59]. We employed OPERA surface rain rates re-projected on the MSG full disk service grid as a ground-based radar reference for the satellite rainfall retrieval. The OPERA composite offers a valuable continuous spatial and temporal coverage over land and it even partly extends over the ocean.
The quality of the OPERA radar composite increased substantially in 2017, e.g., it became considerably more successful in reproducing plausible precipitation frequencies across the Alps compared to summer 2016. In this study, precipitation frequency refers to the fraction of time slots during which the radar rain rates are ≥0.3 mm h −1 while the radar is in operation. The improved data quality in summer 2017 can partly be attributed to the inclusion of additional radars. The continuous improvement of the quality and coverage of this radar composite is very valuable but has the drawback of introducing temporal inhomogeneity in the data archive. To limit the effect of inhomogeneity while testing the potential of our statistical learning methods for satellite rainfall retrieval, we only used data from summer 2017.
Despite the improved data quality, some remaining issues needed to be addressed before the analyses. Radar-based QPE becomes more uncertain with increasing distance from the radar location due to beam broadening with distance and visibility effects, e.g., earth curvature and shielding by orographic features [60,61]. This often leads to an underestimation of precipitation frequency and intensity far from the radar. Additionally, residual ground and sea clutter, electromagnetic interferences, and wind farms may lead to falsely detected precipitation [62][63][64].
To eliminate uncertain radar measurements, we computed a map describing the frequency of precipitation ≥0.3 mm h −1 for summer 2017 (upper row in Figure 1). A radar composite mask was then derived by: 1. Removing all pixels that deviate too strongly from their neighborhood by: • Applying a 15 × 15 pixel running mean to the raw frequency map of Figure 1. • Subtracting this mean value from the actual frequency.
• Subsequently removing all pixels with a deviation of ≥2.0% from the mean surrounding frequency, based on visual inspection. This step was effective in removing the locations affected by e.g., wind farms and radio-interferences.
2. Removing regions with implausibly low precipitation frequencies using a lower absolute frequency cut-off of 0.4%. Thereby, the locations were excluded where e.g., radar beams are blocked by topographic obstacles, the clutter rejection is too strong, or precipitation frequency is heavily underestimated at a large distance from the radar. 3. Eliminating the sea clutter affected region south of France that is connected to the Mistral winds. Fraction of time rain rate 0. 3 mm h -1 Figure 1. Maps depicting the fraction of time in summer 2017 during which the radar product is available and the rain rate is ≥0.3 mm h −1 for the raw OPERA composite (top) and the quality-controlled OPERA composite with unreliable regions masked out and thus shown in white (bottom). A zoomed-in version of the composite is provided on the right.

40°N
As shown, we applied a rigorous data quality control to avoid affecting the training of the statistical learning algorithms. As a consequence, we accepted also discarding a certain fraction of trustworthy precipitation measurements e.g., in dry regions like the Iberian Peninsula.
The lower row in Figure 1 shows the coverage of the OPERA composite after applying the radar mask. The majority of locations with low quality radar measurements were successfully removed. While the original OPERA composite covers 62.48% of the MSG pixels where predictions are carried out, the derived radar composite mask labels 54.95% of the MSG pixels of the prediction domain as reliable OPERA pixels.
An additional simple thresholding was applied on each individual radar image eliminating all rain rates <0.3 mm h −1 and >130 mm h −1 . It needs to be noted that, even after this procedure, individual radar images sometimes still contain residual clutter.

H-SAF Precipitation Product
The "EUMETSAT SAF on Support to Operational Hydrology and Water Management" or in short H-SAF provides satellite-based products for operational hydrological purposes [8,65]. In this study, we employed its PR-OBS-3 product, which is the instantaneous rain rate at the ground derived from MSG full disk service measurements "calibrated" with LEO MW measurements [8]. This product has the same spatial and temporal resolution as the MSG full disk service variables and we also parallax corrected it with PyTroll [58] based on the NWC-SAF cloud top height.
H-SAF derives the geolocated relationship between IR 10.8 µm brightness temperatures and MW rain rates from coincident IR and MW data at each grid box by a probabilistic histogram matching [8]. This relationship is updated whenever new data become available. Thus, a satisfactory correlation between the IR 10.8 µm channel and the rain rate at the ground is assumed. In the accompanying product user manual [66], it is stated that this is regarded acceptable for convective rain events, but less so for stratiform ones. Hence, the product shows a bias towards convective rain rates. Nevertheless, the authors of the product user manual claim that, at the moment, H-SAF is the only operationally available precipitation retrieval algorithm with a sufficient temporal resolution for nowcasting applications.

General Approach
The input predictors for the statistical learning algorithms for the rainfall retrieval consist of the SEVIRI IR channels, their differences, the NWC-SAF products, and the auxiliary variables described in the previous section, while the output predictand is the OPERA composite's instantaneous rain rate. In total, there are 51 predictors, which consist of 36 continuous and 15 dummy variables representing the categorical classes. Due to the collinearity of several input predictors, it was not possible to analyze their respective relevance [67]. Instead, we solely focused on the predictive power of the derived models, which should not be compromised unless the collinearity between the variable is subject to change.
The rainfall retrieval was performed in two steps: 1. Precipitation detection as a classification problem: All partly and fully cloudy pixels (according to the NWC-SAF cloud mask) were classified into either precipitating or non-precipitating pixels. 2. Rainfall intensity retrieval as a regression problem: The instantaneous surface rain rate was derived with a multivariate regression analysis on the previously detected precipitating pixels.
We investigated two different modelling setups. The first consisted of applying generalized linear models (GLM) with the classification being carried out by a logistic regression (LOGREG) and the regression by a linear regression (LINREG). The second considered artificial neural networks (ANN); specifically, the multilayer perceptron (MLP) was used both in classification and regression mode. The GLM and ANN models have been chosen because they are computationally fast, conceptually simple, and widely used in the literature. If there are substantial nonlinear data dependencies, the ANNs should be able to capture them and improve the prediction skill. If not, they are expected to provide similar skill as the GLM models. The precipitation detection and rain rate retrieval were performed using the same set of 51 predictors. The python scikit-learn package was employed for both the GLMs and the machine learning algorithms [68].
All models were compared to the operational instantaneous rain rate product of H-SAF to investigate the added value of considering additional satellite predictors by means of GLMs and using a nonlinear modeling approach with ANNs.

Data Splitting
Statistical learning requires splitting the dataset into a training, validation, and test set. The training set was used to derive the model parameters, such as the regression weights or the weights connecting the hidden neurons of the ANN. With the validation set, we selected the best hyper-parameters for the model, such as the number of hidden neurons and the regularization parameter. The test set was employed to estimate the generalization skill of the model on a new set of unseen data.
The time slots for training, validation, and testing were selected randomly with the additional constraints to contain a minimum of 4000 precipitating pixels inside the scene at hand and lying at least 45 min apart, i.e., the typical lifetime of a single cell thunderstorm. Hence, too highly correlated scenes were avoided. The training, validation, and test set consist of 800, 400 and 400 time slots during summer 2017, respectively.
To obtain balanced classes for the precipitation detection, we randomly selected 1000 precipitating and 1000 non-precipitating pixels inside the cloud mask for each time slot. For the rain rate retrieval, we substituted the non-precipitating pixels with additional precipitating ones, which resulted in 2000 randomly chosen precipitating pixels per time slot. For the validation and the test set, unbalanced datasets containing the actually observed distribution between precipitating and non-precipitating pixels were additionally generated for the precipitation detection using the same time slots. For this purpose, 10,000 samples were randomly drawn inside the cloud mask at each time slot. The resulting unbalanced validation and test set contain 7.4% and 7.6% precipitating pixels, respectively. In summary, the training, validation and test sets contain 1.6 mio, 800,000 and 800,000 samples. For the precipitation detection on unbalanced sets, the validation and test sets contain 4.0 mio samples each.
All categorical variables in the predictors described in Sections 2.1 and 2.2, such as e.g., cloud type, were converted into dummy variables, while the local solar time was expressed as a circular variable (sine and cosine) to eliminate the discontinuity at midnight. All 51 predictors were furthermore scaled to a mean of zero and a standard deviation of one based on the training set.

Verification Scores
For model validation and testing, a set of scores was considered. We refer to Appendix A for the meaning behind each score and for the respective equations.
The categorical scores for the classification problem were obtained by comparing the satellite-derived precipitating regions with the ground-based radar reference dataset. They all distinguish between events and non-events, which in our case refers to precipitating and non-precipitating pixels, respectively. We used Probability of Detection (POD), False Alarm Ratio (FAR), Probability of False Detection (POFD), Accuracy (ACC), and the Critical Success Index (CSI). Additionally, we calculated skill scores that measure the quality of a prediction system relative to a reference prediction, whereby a negative value indicates worse performance than the reference. In our case, the reference prediction is random chance that is represented by the "climatological" frequency of precipitation by pooling all the data samples in space and time. We computed the Gilbert Skill Score (GSS), the Heidke Skill Score (HSS), and the Hanssen-Kuipers discriminant (HK).
For the regression, the rain rate retrieved by satellite predictions was compared to the OPERA composite surface rain rate and the following continuous scores were considered: the Mean Error (ME), the Root Mean Squared Error (RMSE), the Pearson Correlation Coefficient (PCORR), the Spearman Rank Correlation (SCORR), and the Reduction of Variance skill score (RV). The RV skill score also measures skill compared to a "climatological" reference prediction, which is represented here by the observed sample variance.

Model Training and Hyper-Parameter Selection
In the GLM modelling chain, the default parameters of the scikit-learn package were taken for both the LOGREG and the LINREG with the exception of the solver used for the LOGREG, which was changed to the "Stochastic Average Gradient Descent" solver that is suitable for large datasets.
The ANN is a classical feedforward back-propagation MLP, which was trained using the gradient-based stochastic optimization algorithm "Adam" available in scikit-learn (see also [69]). The MLP was trained until convergence of the training set error. Overfitting and model complexity were controlled by optimizing the number of hidden layers, number of hidden neurons and the L 2 regularization parameter using the respective validation set. A logistic activation function was used to make a conceptual link between the linear logistic regression model and the nonlinear MLP model. Default values for the other MLP parameters were used, for instance a constant learning rate of 0.001 and a batch size of 200 points.
For the MLP classification, we employed the skill scores described in Appendix A.1: GSS, HSS, and HK. A grid search of parameter options was carried out to find the best combination of hyper-parameters based on the balanced validation set. Since this set contained 50% precipitating and 50% non-precipitating pixels, optimizing any one of these skill scores resulted in the same set of hyper-parameters. The best skill was achieved using an MLP with two hidden layers with 100 units each and a regularization parameter alpha equal to 10 −7 .
In order to be able to carry out predictions for the actually observed unbalanced problem, the decision boundary, i.e., the probability threshold that is used to categorize model output into precipitating and non-precipitating pixels, needs to be adjusted with the help of the unbalanced validation set for the GLM and the ANN. We chose to optimize the GSS which automatically implies maximizing the HSS too (see Appendix A.1). This can be nicely observed in Figure 2. For the LOGREG model, the optimal threshold probability amounted to 0.835, and, for the MLP classifier, it was 0.870. It is furthermore evident that HK would be maximized at 0.500, since it considers both classes separately irrespective of their climatological frequency, which would lead to many false alarms in the unbalanced classification problem. The hyper-parameters of the rainfall retrieval MLP regressor were optimized by maximizing the RV skill score described in Appendix A.2. However, the skill surface structure was not as clear in this case with several diverging parameter combinations resulting in similar skill for the validation set. We finally decided to use two hidden layers, 50 hidden units each, and a value of 10 −2 for regularization. It is interesting to note that, while it was not possible to overfit on the classification problem, a certain magnitude of regularization was needed in the regression to avoid overfitting there.

Model Evaluation
The different methods were objectively compared by means of the representative test set. Additionally, single-scene case studies were carried out on time slots from the test set to analyze the realism of the predicted rainfall maps and the case-to-case variability of prediction skill. While the predicted rainfall is shown on the full map, the associated precipitation detection scores are computed-like in the representative test set-solely on trusted OPERA composite pixels (see Section 2.3) inside the NWC-SAF cloud mask (see Section 2.1). If each pixel inside the trusted OPERA composite were to be considered instead, the resulting single-scene skill would be higher since the pixels outside the cloud mask are by construction set to non-precipitating, which leads to many correct rejects.
To avoid punishing the regression algorithms for shortcomings in the precipitation detection, only pixels precipitating according to the quality-controlled OPERA composite were considered in the rain rate verification test set. On each one of these pixels, rain rates were predicted with the LINREG and the MLP regressor. For H-SAF, only a subgroup of the verification test set pixels was evaluated, namely the ones at which H-SAF detected precipitation, which amounts to 39.7% of the rain rate test set. In the case studies, rain rates were solely evaluated at the locations where both the trusted OPERA pixels and the classification algorithm detected rain. This implies that each algorithm was evaluated on a partly different set of pixels depending on the quality of the initial precipitation detection map.

Probability Matching
The LINREG minimizes the residual sum of squares between the OPERA measurements and the predicted rain rate. Similarly, the MLP regression minimizes the squared error loss function between them. Thus, strong deviations from the OPERA reference are heavily penalized and smooth predictions around the mean are favored. This leads to the best possible scores but not realistically looking precipitation fields. If visual consistency with radar fields and the ability to represent a more realistic rain rate distribution are valued higher than quantitative prediction skill, probability matching is a helpful tool. Probability matching carries out a nonlinear transformation to recover higher order statistics (such as variance and skewness) by means of empirical quantile matching of predicted rain rates with observed rain rates. In this study, the transformation from predicted rain rates into OPERA rain rates was derived on the validation set and then applied to the predictions resulting from the test set and the case studies.

Excluded Approaches
Over the course of the study, several additional approaches were tested but were deemed unsuitable for our purpose. In the following, we shortly mention the most noteworthy ones.
We investigated the potential of employing a range of textural features such as standard deviations computed within moving windows of 3 × 3, 15 × 15, and 20 × 20 pixels. The resulting negligible increase in skill did not justify the associated increase in computational cost. Thus, like Meyer et al. [70], we found no need to introduce structural parameters into our set of predictors.
The other excluded approach concerns the choice of the machine learning algorithm. As an alternative to the ANN, we additionally considered support vector machine (SVM) algorithms. However, they were found to be too computationally demanding in both training and more importantly also in prediction mode. The slow prediction speed originates from the need to compute the distances between all support vectors and all prediction points. Thus, SVMs are not suitable to provide near-real-time satellite-derived precipitation maps over Europe. This matches with the findings of Meyer et al. [40], which point out the large computational costs associated with SVMs and slightly worse performance compared to ANNs.

Results
In this section, we analyze the results of the objective verification and support them with illustrative case studies in which we carry out European-scale predictions for a single time slot with each algorithm.

Objective Verification
The objective verification of the precipitation detection reveals that both GLM and ANN have higher skill than the H-SAF product when using the independent test set (Figure 3 circles). While the largest gain in performance is obtained by including the additional variables, applying machine learning further improves the predictions. It is interesting to note that, while the GLM already substantially increases the POD compared to H-SAF, the ANN succeeds much more in decreasing the FAR. Thus, adding input features mainly leads to the increase in hits while the machine learning is helpful in additionally reducing the number of false alarms. Only small improvements are observed in POFD and ACC since both of these scores depend on the number of correct rejects, which are large for each algorithm in this unbalanced classification problem. As expected, CSI and GSS show a similar improvement when increasing the number of input features and applying machine learning techniques, whereby the absolute magnitude of the GSS is smaller since it is additionally corrected for hits due to random chance. The HSS is higher than the GSS because the HSS measures accuracy adjusted for correct predictions obtained by random chance, i.e., it also takes correct rejects into account which are naturally large. To understand the behavior of the HK score, one needs to remember that it can be expressed as POD-POFD. Hence, the large improvement for the GLM compared to H-SAF is explained by its substantial increase in POD. The ANN's additional reduction of false alarms, however, does not strongly affect the HK score since POFD is dominated by the large number of correct rejects.
Instantaneous rain rate retrievals with GEO satellites are challenging and only result in little skill in any of the tested algorithms, which is due to the complex and indirect relationship between surface rain rates and satellite-observed variables (Table 1 test set). In terms of the RV skill score, the ANN performs best, closely followed by the GLM. As explained in Section 3.6, smooth predictions close to the mean observed rain rate (1.35 mm h −1 ) are rewarded when deriving the GLM and the ANN. The resulting ANN manages to predict a maximum of 15.46 mm h −1 , while the GLM does not exceed 4.19 mm h −1 . Figure 4 shows the 2D histograms of observed against predicted rain rates, which further emphasizes the narrow prediction ranges of both models. Visually, the ANN seems to better capture the statistical relationship between the satellite input and the radar rain rates than the GLM. This is also confirmed by its slightly higher PCORR and SCORR (Table 1).
Probability matching recovers higher order statistics and thus covers the full range of observed values of 0.30-130.00 mm h −1 (Table 1). This procedure, however, leads to a decline in RV to below zero, indicating a worse performance than a "climatological" prediction represented by the observed sample variance would have. Additionally, the best linear fit between the observations and the predicted rain rates of the probability matched ANN deviates from the 1:1 line, i.e., a conditional bias is introduced (Figure 4). The linear relationship between the predicted rain rates and the observations, measured by PCORR, becomes less pronounced compared to the ANN and drops to the level of the GLM. The monotonic relationship, given by SCORR, in turn, remains constant since the ranks of the predictions do not change. In summary, despite the negative RV and the introduced conditional bias, there are several advantages to employing probability matched ANNs.      Table 1. H-SAF shows the lowest correlations of any of the tested approaches and is also faced with a RV skill score below zero (Table 1). It predicts rain rates up to 29.45 mm h −1 . Hence, it is outperformed by each one of our algorithms in both precipitation detection and rain rate estimation. While a clear distinction in model performances is seen in the precipitation detection, the rain rate estimation performance varies less between the models. It should be pointed out that, while the GLM and the ANN were trained on OPERA data, H-SAF was calibrated on MW measurements. To quantify the importance of this effect, we trained an additional GLM with the same input variable as H-SAF, namely the IR 10.8 µm channel. The model we obtained performs very similarly to H-SAF in terms of skill (not shown). Thus, we conclude that the improvement of our multivariate GLM compared to H-SAF can be predominantly attributed to the additional variables, which are considered without needing to take the different "calibration" datasets into account.

Illustrative Case Studies
Case studies allow us to put the conclusions drawn from the objective verification in the context of the actually desired single-scene rainfall prediction maps. While our first case study depicts widespread convection over the European continent, the second one contains large, mostly stratiform, rainbands, thus making it possible to compare the algorithms' performances qualitatively in different weather regimes. In terms of precipitation detection, the skill in the convective case is slightly below average for the GLM and the ANN while it is very low for H-SAF (Figure 3 crosses). This is almost exclusively attributable to the H-SAF's large amount of false alarms, which is reflected in its large FAR. This strong H-SAF outlier in terms of false alarm performance highlights the big case-to-case variability of the scores and underlines that the average performance derived from the test set is not representative for every single scene. However, it should also be noted that we specifically chose to show an extreme case of a false alarm outlier performance here.
When visually comparing H-SAF to the GLM, a strong decrease in false alarms and increase in hits is observed in the GLM, with the improvements largely limited to continental Europe and most progress being made in northeastern Europe ( Figure 5). The high opaque clouds in northeastern Europe (not shown) were mistaken as precipitating by H-SAF, which is a natural consequence of their cold IR 10.8 µm brightness temperatures. In this case study, the ANN manages to further reduce the false alarms, while the amount of hits remains similar to the GLM. Nevertheless, it is arguable that, visually, the ANN also outperforms the GLM in terms of hits since it is the only algorithm able to detect precipitation in the northern parts of Great Britain. While the ANN does not always identify the full extent of the precipitating areas, it is most successful at identifying regions faced with rainfall in this case study.
The rain rate estimation performance of each product is qualitatively similar to the objective verification with generally smaller correlations and larger errors (Table 1 convective). The map shown in Figure 5 highlights the main issues of the rain rate retrieval. None of the satellite-based methods are able to capture the fine-scale precipitation structures displayed by the ground-based radar network. The GLM's smooth predictions around the mean make it visually most different from the radar composite. The ANN, on the other hand, is able to reproduce finer spatial structures deviating more from the mean while still outperforming the GLM in terms of scores. Only probability matching allows for an overall similar rain rate distribution compared to the radar composite ( Figure 6). However, the structures remain less fine-scale than in the radar composite and, due to the double penalty effect, ME and RMSE increase strongly while RV drops below zero (Table 1)  . Case study at 15:00 UTC 10 July 2017 of precipitation detection (left) and rain rate estimation (right). The first row consists of the OPERA observations. The subsequent rows show the ANN, GLM, and H-SAF predictions. The precipitation detection is evaluated against the OPERA radar (rad) reference. "Yes" means precipitation, "no" means no precipitation according to the system at hand. On reliable OPERA pixels, the predictions are referred to by their contingency table names: hit, false alarm, correct reject, and miss. Predictions on unreliable radar pixels (rad unr, i.e., on the permanent radar mask derived in Section 2.3 and where the radar rain rate is the range of 0.1-0.3 mm hh −1 or >130.0 mm hh −1 in this scene) are distinguished from pixels where no radar (no rad) is available. In the rain rate plots, the regions where no radar is available and the permanent radar mask are also marked. Probability matched ANN predicted rain rates Figure 6. ANN rain rate retrieval product after probability matching of the predicted rain rates.

40°N
On the left for 15:00 UTC 10 July 2017 (see also Figure 5) and on the right for 06:45 UTC 30 June 2017 (see also Figure 7).

Stratiform Case-6:45 UTC 30 June 2017
In the precipitation detection of the stratiform rainband case, an unusually high skill is observed for both the GLM and the ANN (Figure 3 pluses). On the one hand, this reveals that our algorithms, similar to H-SAF, are faced with large case-to-case variability in performance. It thus emphasizes the importance of using the test set containing 400 time slots to estimate the average generalization skill of our algorithms instead of solely relying on single case studies. On the other hand, Figure 7 shows that, in this specific case, the verification scores profit from the presence of large regions in which no scores are computed because the radar composite depicts rain rates between 0.1-0.3 mm h −1 (which we do not consider due to reliability issues). There is an uncommonly large buffer separating the precipitating areas from the non-precipitating ones resulting in lower FAR than on average for each one of the algorithms (Figure 3). Similar to the objective verification, the GLM mostly increases the number of hits while the ANN adds additional value by further decreasing the number of false alarms. A logical consequence of H-SAF only considering the IR 10.8 µm channel for the precipitation retrieval is its inferior capability to detect stratiform precipitation originating from low-and mid-level clouds. This is especially visible in the large rainband structure extending from Ireland down into France and northern Spain where the cloud top height is in the 6-8 km range (not shown) and which is almost entirely missed by H-SAF (Figure 7).
In this stratiform case (Table 1 stratiform), the performance of the rain rate retrieval in terms of RV skill is below average for every algorithm, except for the probability matched ANN. The same conclusion is reached in terms of correlations apart from the H-SAF SCORR, which is unusually high. All algorithms predict lower rain rates and thus a smaller rain rate range than they did in the convective case and in the test set, which explains the smaller RMSE. Even probability matching only results in a maximum rain rate of 4.16 mm h −1 inside the quality-controlled radar mask, which is far below the maximum radar observed 52.6 mm h −1 . The resulting transformed predictions are shown in Figure 6.

Discussion
In this section, we discuss general features uncovered with case study maps, strengths and limitations of our rainfall retrievals, and possible ways forward together with the associated challenges.
Visual inspection of many precipitation detection maps including Figures 5 and 7 revealed that the GLM and the ANN do not always perform best in the same regions. For example, some precipitating areas that are captured by the GLM, such as the rain over Latvia in Figure 7, fail to be predicted by the ANN and vice versa e.g., for the scattered precipitation over Great Britain in Figure 5. Furthermore, the maps indicate a decrease in performance of the algorithms at high latitudes, especially to the northeast and over Iceland (not shown). Computing the test set scores for only the pixels originating from these regions confirms our visual impression. This is probably caused by the unfavorable viewing geometry of the satellite at these locations and the stronger deformation of the OPERA radar data when projected onto the satellite grid at high latitudes. Potentially, this issue could be somewhat diminished by replacing the predictors from the MSG full disk service with MSG rapid scan ones, since the rapid scan satellite has a more favorable viewing geometry due to its location at 9.5°E, which is more centered above the OPERA domain. Additionally, the mountainous regions in Iceland imply difficult conditions for the radar-based QPE, which is demonstrated by the large number of discarded clutter pixels (see radar mask in Figure 1). Nevertheless, some residual clutter remains in the dataset, which negatively affects the ANN training and predictions over Iceland.
Another issue is the tendency of our rainfall retrievals to also predict high rain rates at the edge of precipitation systems (see, e.g., Figures 5 and 7). H-SAF, on the other hand, seems to struggle less with this. One possible solution could be to integrate as additional predictor a variable characterizing the "dry drift", i.e., the distance from the edge of the closest surrounding dry area (e.g., [71]).
One of the biggest challenges of the current study is the training of the rainfall retrieval algorithms using pixel-based instantaneous rain rates. In fact, it is widely known that correlations between predictions and observations tend to improve with increasing spatial (e.g., [38]) and temporal aggregation (e.g., [36,40,47,54]). We nevertheless decided to focus solely on pixel-based instantaneous rain rates here because they are especially valuable for real-time nowcasting applications, as they can e.g., be used to fill in gaps during radar downtime and cover regions with poor or no radar coverage. Nonetheless, one should keep in mind that our product is clearly better suited for identifying precipitating regions than for instantaneous QPE.
Several other studies (e.g., [13,45,46]) showed that, during the daytime, satellite-based rainfall retrievals benefit from additionally considering VIS channels. However, when taking VIS channels into account, two separate algorithms need to be used for day-and nighttime with the nighttime algorithm exhibiting lower skill. An IR-only approach, as employed in this study, on the other hand, results in a single algorithm whose skill does not depend on whether it is day or night, which can be argued to be advantageous.
It would furthermore be valuable to extend the training set to represent a longer time period that also includes winter months, and thus learn seasonal dependencies. Currently, this is complicated by the ongoing efforts to improve the OPERA composite, which creates temporal data inhomogeneity. Another challenge is given by the seasonality of the quality of both radar and satellite products since each of them faces more uncertainties during the cold season. Nonetheless, the work currently invested by EUMETSAT and OPERA results in datasets with increasing quality to train our algorithms, which will in turn likely be reflected in an increase in prediction skill.

Conclusions
In this study, we demonstrated, visually as well as quantitatively, the added value of multivariate input predictors and machine learning methods for geostationary satellite-based instantaneous rainfall retrievals. We trained generalized linear models (GLM) and artificial neural networks (ANN) using infrared brightness temperature information, NWC-SAF cloud property products, and auxiliary variables as input predictors. The target variable for the statistical learning models is represented by rain rates provided by the quality-controlled ground-based European radar composite OPERA, which was also used as the reference for verification. The resulting predictions were compared to an operational single-input-predictor-based instantaneous precipitation product, namely PR-OBS-3 of H-SAF.
The rainfall retrieval process is divided into two steps for which individual models were trained: precipitation detection (a classification problem) and rain rate estimation (a regression problem). The additional input predictors lead to a strong increase in Probability of Detection for the GLM compared to H-SAF. Taking nonlinearities into account with the ANN mostly adds value by further reducing the false alarms compared to the GLM. Despite large case-to-case variability in skill, both the GLM and the ANN outperform H-SAF in all the tested weather situations.
The indirect relationship between geostationary satellite information and surface precipitation limits the skill of the rain rate estimation. While we achieve superior scores compared to H-SAF, the H-SAF product is better at creating plausible looking precipitation fields. In fact, the training procedure of our algorithms penalizes large deviations from observations heavily, which leads to smooth prediction around the mean rain rate. The ANN is only slightly superior to the GLM in terms of scores but is less smooth and creates more fine-scale structures inside the precipitating areas. To improve the visual realism and the representation of higher order statistics (such as variance and skewness) of the predictions, we transformed the predicted rain rates based on probability matching. This, however, results in inferior verification scores.
The performance of the ANN is promising and motivates further research to improve its rainfall product. This could be achieved by using rapid scan MSG data to have a more favorable viewing geometry and extending the training and testing sets to include longer time periods and also the cold season.

. Categorical Scores for the Precipitation Detection
The categorical scores explained in the following are all based on the contingency table shown in Table A1. More thorough descriptions can be found in the book of Hogan and Mason [72]. The Probability of Detection (POD, also called hit rate) measures the proportion of the correctly predicted events among the total number of observed events: The False Alarm Ratio (FAR) depicts the proportion of false alarms in all predictions, i.e., it is conditioned on the predictions: The Probability of False Detection (POFD, also called false alarm rate) shows the proportion of non-events predicted as events, i.e., it is conditioned on the observations: The Accuracy (ACC, also called fraction correct) is the proportion of correct predictions in all predictions: The Critical Success Index (CSI, also called Threat Score) indicates the proportion of hits in all predicted (i.e., hits and false alarms) and missed events: The Gilbert Skill Score (GSS, also called Equitable Threat Score) is equal to the CSI adjusted by hits due to random chance. It is sensitive to hits and the error source cannot be determined because misses and false alarms are penalized in the same way: with the hits expected by random chance: H r = (H + F) · (H + M) H + F + M + R .
The Heidke Skill Score (HSS, also called Cohen's k) measures the proportion of correct predictions, i.e., the accuracy, after removing the predictions that are correct due to random chance. It can be regarded as a rescaled version of the GSS to obtain a linear score and can be expressed as: HSS = 2·GSS 1+GSS : with H r like in the GSS and R r = (R + F) · (R + M) H + F + M + R .
The Hanssen-Kuipers discriminant (HK, also called Peirce Skill Score) can be interpreted as the "accuracy for events + accuracy for non-events − 1". Hence, it does not depend on climatological frequency: Appendix A.

Continuous Scores for the Rain Rate Estimation
In this section, the continuous scores employed in this study are explained and additional information on them can be found in the book of Wilks [73,74]. In the following, y i denotes individual predictions and o i the corresponding observations. The sample mean is calculated as follows for the predictions:ȳ = 1 N ∑ N i=1 y i and analogously for the observationsō. The sample variance of the predictions is expressed as: s 2 y = 1 N ∑ N i=1 (y i −ȳ) 2 and analogously for the observations s 2 o . Based on these variables, all the employed continuous scores can be computed.
The Mean Error (ME, also called Bias) measures the average prediction bias and denotes the average error without indicating the magnitude of the errors or how well the predictions and the observations correspond: ME =ȳ −ō.
The Root Mean Squared Error (RMSE) represents the average magnitude of the prediction error weighted by the square of the error. It is sensitive to outliers because large errors are penalized heavily: The Reduction of Variance skill score (RV, also called Nash-Sutcliffe Efficiency) measures the mean squared error of the predictions with reference to the mean squared error of a climatological prediction: where MSE clim is the climatological mean squared error represented by the observed sample variance. The Pearson Correlation Coefficient (PCORR) depicts the sign and the strength of the linear relationship between the predicted and the observed variable with |PCORR| = 1 denoting a perfect linear relationship. PCORR is a measure of potential skill because PCORR 2 shows the proportion of the variance of the observations, which can be explained by a linear relationship with the predictions: The Spearman Rank Correlation (SCORR) measures the sign and the strength of a monotonic relationship, which can also be nonlinear. It is computed in the same way as PCORR but uses the ranks of the data instead of the data themselves, which makes it more robust and resistant to outliers: where y r i is the rank of the ith prediction and o r i of the ith observations.