Forecasting the Intra-Day Spread Densities of Electricity Prices

Intra-day price spreads are of interest to electricity traders, storage and electric vehicle operators. This paper formulates dynamic density functions, based upon skewed-t and similar representations, to model and forecast the German electricity price spreads between different hours of the day, as revealed in the day-ahead auctions. The four specifications of the density functions are dynamic and conditional upon exogenous drivers, thereby permitting the location, scale and shape parameters of the densities to respond hourly to such factors as weather and demand forecasts. The best fitting and forecasting specifications for each spread are selected based on the Pinball Loss function, following the closed-form analytical solutions of the cumulative distribution functions.


Introduction
Whilst day ahead electricity price forecasting has been a topic of substantial and wide-ranging research in terms of methods, the focus has mostly been upon price levels for the delivery periods (usually hourly) the following day. More recently there has been an interest in density forecasts for the hourly prices, motivated by considerations of risk management. See [1,2] for extensive reviews. In this paper, we provide a new formulation with a focus upon price spreads, and specifically, we forecast the density functions for the intraday spreads in the day-ahead prices. The optimal operation of storage facilities, e.g., batteries and electric vehicles, or load shifting programmes, e.g., demand-side management, over daily cycles depends upon these spreads if they are operated as merchants, arbitraging buying and selling from the wholesale market. Furthermore, if the risk is a consideration, analysis of the mean differences in price levels would be inadequate, and we therefore directly estimate the density functions of all hourly spreads in prices at the day-ahead stage. These forecasts ahead of the day-ahead auctions would be needed to help traders decide whether they want to be buyers or sellers at each hour and thereby optimise their bids and offers. Our specification, estimation and forecasting of these arbitrage spreads are new and computationally-intensive.
Based upon day-ahead forecasts for the drivers of electricity prices, such as demand, wind and solar production, gas and coal prices, forecasts for electricity price levels have been proposed from various methods, e.g., [3][4][5][6] and some for price densities [1,7], but apparently no methods have been developed specifically for forecasting intraday spread densities. Until recently storage assets, such as pumped hydro storage, would regularly store energy at night and discharge at the daily peak demand periods, which were quite predictable. However with the penetration of wind and especially solar generating facilities, the peak and trough hours in prices move around the day and in sunny locations with substantial solar energy, e.g., California, the lowest prices may often be in the middle of the day [8]. Thus, the expected daily spreads in prices, and their consequent arbitrage opportunities for storage or load shifting, will depend upon the wind and solar forecasts, as well as demand and supply considerations. Furthermore, the price density functions are non-normal with skewness switching between positive and negative depending upon the dynamics of production of renewable energy [9]. Because of this non-Normality and the non-independence of each hourly price, spread densities cannot be easily derived as the difference of the price densities, and so we estimate and forecast the daily matrix of intraday spreads directly.
We apply our spread densities formulation to the German market. This is the largest and the main daily reference market for wholesale power in Europe. It is also strongly influenced by wind and solar production, as well as providing a context where batteries and demand-side management are active innovations. The day ahead auction has been actively researched and closes at noon each day, with the vector of 24 hourly prices for the next day being released an hour later. For modeling the spread densities, we adapt the Generalised Additive Model for Location, Scale and Shape (GAMLSS) semi-parametric regression model [10], which has already been used effectively to form day-ahead densities of price levels in the German context [9]. Within this framework, the hourly electricity price spreads form a response variable, whose distribution function varies according to multiple exogenous factors. The GAMLSS framework allows choice from a wide range of distributions, whose distribution parameters change according to the exogenous variables specified using (non)linear relationships. The dynamic location, scale and shape parameters (related to the mean, volatility, skewness and kurtosis of price spreads) are therefore explicitly incorporated into the forecasting model. The paper proceeds by first describing the data and the density estimation process. In Section 2, we use the Pinball Loss function to select the best fitting density model with four distribution parameters. Then in Section 3, we undertake a rolling window forecasting evaluation and demonstrate the value of the dynamic, conditional latent parameter. Section 4 concludes.

Data
The German hourly day-ahead electricity price, wind forecast, solar forecast and actual total load data were downloaded from the Open Power System Data website https://data.open-powersystem-data.org for the period of 1 January 2012 to 31 March 2017 (resulting in t = 1, ..., 1917 time steps). The data is comprised of four German control areas in MW: 50 Hertz, Amprion, TenneT and TransnetBW. The summer time hour change was accounted for by creating hour 02 with interpolation between hours 01 and 03. For the clock change back the (later) 02 hour was deleted. The German day-ahead total load data is calculated as an average of 4 × 15 min segments following the beginning of each hour. The "actual total load" for each control area (obtained at the end of the 15 min real bidding time in the balancing markets) is averaged and results across four regions are summed up to give the total actual load. We use the steam coal ARA 1 month forward benchmark index for steam coal (one price per month) and the Germany Gaspool (GPL) natural gas day-ahead forward (one price per day) for gas. The weekly seasonality and holidays are included into a single dummy variable which takes on value 1 for Saturday/Sunday and the following German state holidays: New Year's Day, Good Friday, Easter Monday, Labour Day, Ascension Day, Whit Monday, German Unity Day, Christmas Day, Boxing Day and New Years Eve (see Appendix A for further details).
Day ahead electricity price data are plotted for a selection of hours, showing the evolution of fitted distribution over the year 2017. The distribution of choice 'ST5' (to be described later) was used to fit the histogram data from the GAMLSS functions library. The hourly electricity spot price data display high volatility, fast-changing dynamics and highly skewed distributions (see Figure 1). The day ahead electricity price data also shows moderate/strong correlations between prices at different hour of the day, as depicted by highly-positive off-diagonal entries in Figure 2 (note the smallest value is 0.41).

Model Description
An intraday spread, Y (s) t , between two hours, denoted by spread number s, of the day-ahead electricity spot prices, is calculated by taking away the later hour price information from the earlier one, resulting in a positive spread if later hour is less expensive and negative otherwise. Each day therefore comprises of 276 separate spreads, and with 1917 days in the time series, the full data set is denoted by Y ∈ R 1917×276 . Each electricity spread time series is tested for stationarity using the Augmented Dickey-Fuller test (adf.test() with 10 lags and a linear trend) and is confirmed to be stationary at the 1% significance level.
We plot example histograms for intraday spreads obtained between hours 00-08, 08-12, 12-16, 16-20 for the example year 2017. Figure 3 depicts the high skewness and kurtosis of the spreads and shows the variation between spread distribution shapes for different hours of the day. Note data is symmetrical about the diagonal thus only the upper diagonal is plotted. The hourly spread data possesses a high degree of skewness in the range [−8.31, 7.85] (see Figure 4). The plot shows that typically the spreads are negatively skewed (note spreads are obtained by hr earlier − hr later ).  The hourly spread data displays a high degree of positive excess kurtosis and thus indicates that spreads have Leptokurtic distributions in the range of [3.8, 140] (see Figure 5). Therefore, given the above analysis, we chose to model the full four-parameter distribution of electricity price spreads at each time step (day), t. The spreads between exogenous variables are likewise formed by taking away the later hour values from the earlier ones in all cases except for the gas forward prices, coal forward prices and the dummy variable, for which only the daily, rather than hourly, values are available. Hence the following exogenous variables are considered for modeling Y (s) t : (1) spread of the lagged intraday electricity price, (2) gas Gaspool forward daily price, (3) coal ARA forward daily price, (4) spread of wind day-ahead forecast, (5) spread of solar day-ahead forecast, (6) dummy variable taking the value of 1 for weekends/holidays, (7) spread of the day-ahead total load forecast, and (8) an interaction load variable, calculated as Load spread * 1 2 (Load earlierHr + Load laterHr ) = 1 2 * (Load 2 earlierHr − Load 2 laterHr ) i.e., an average of the load for the two hours from which the spread is calculated, weighted by the load spread obtained for those hours. This variable provides interaction of ∆ Load * avLoad in order to account for the rate of change in load. The full exogenous variables spread data is a 3D matrix denoted by X ∈ R 1917×9×276 , where the first column is a vector of 1s used in the calculation of the off-set.  Estimation using the Generalised Additive Model for Location, Scale and Shape (GAMLSS) framework [11] facilitates modeling each parameter (µ, σ, ν, τ) of the response variable's distribution as a function of explanatory variables. The range of possible distributions includes both exponential family (as per Generalised Linear Model (GLM)) and general family distributions, which allow for both discrete/continuous and highly skewed/kurtotic distributions (unlike GLM). The probability density function of response variable for T observations is given by f Y (y t ] T is a vector of distribution parameters, and D(θ (s) t ) represents the distribution of spread number s on day t. We use parametric linear GAMLSS framework which relates distribution parameters to explanatory variables by where k = 1, 2, 3, 4 specifies the distribution parameter corresponding to µ, σ, ν, τ respectively; θ (s) k ∈ R T is a vector comprised of values for distribution parameter k over t = 1, ..., T time steps (i.e., θ (s) T is the number of observations; g k (·) is the monotonic link function of distribution parameter k; η (s) k ∈ R T is the linear predictor vector for distribution parameter k; β vector of coefficients learnt for parameter k; J k is the number of significant exogenous variables for parameter k obtained at 5% significance level; and X (s) k ∈ R T×J k +1 is the design matrix with each column containing spread data for significant independent variables. Equation (1) can be re-written for each distribution parameter as: The choice of the link function influences how the linear predictor (i.e., systematic component k ) relates to each parameter. For example, a log link function for the standard deviation g 2 (σ (s) ) = log(σ (s) ) results in the relationship log(σ (s) ) = η (s) 2 ). Parameters θ (s) k are calculated by maximizing the penalized likelihood using an RS algorithm which does not require calculation of the likelihood function's cross derivatives, but is a generalisation of the algorithm for fitting Mean and Dispersion Additive Models, (MADAM) as in [11]. The RS algorithm is a simple and fast method for larger data sets which does not require accurate starting values for distribution parameters to ensure convergence (constant default starting values are typically adequate) [12]. The RS algorithm is more suited for cases where the parameters of population probability density functions are information orthogonal and is reported by authors to be successfully used for the convergence of all the suitable distributions defined within the GAMLSS framework, although it may occasionally be slow.
To model electricity spreads within the GAMLSS approach, the distribution parameters sigma and tau should take positive values only but the mu and nu should be able to take either positive or negative values. Therefore distributions with the following link functions for each parameter are considered (see Table 1): location (mu): identity; scale (sigma): log; shape (nu): identity; shape (tau): log.
There are seven candidate distributions which we consider to be sufficiently flexible and which correspond to the necessary link function specifications: Johnson's SU, Johnson's original SU (JSU), skew power exponential type 1 (SEP1), skew power exponential type 2 (SEP2), skew t type 1 (ST1), skew t type 2 (ST2), and skew t type 5 (ST5). The Box-Cox power exponential and Box-Cox t distributions are not suitable due to only being defined on the positive interval Y (s) t ∈ [0, +∞), while the electricity spread data is highly skewed and kurtotic (see Figures 4 and 5). The skew t type 3 and 4 are also unsuitable since they are only defined for positive skewness yet the spread data is often negatively skewed.

Continuous Distribution µ σ ν τ
Johnson's SU (JSU) identity log identity log Johnson's original SU (JSUo) identity log identity log Skew power exponential type 1 (SEP1) identity log identity log Skew power exponential type 2 (SEP2) identity log identity log Skew t type 1 (ST1) identity log identity log Skew t type 2 (ST2) identity log identity log Skew t type 5 (ST5) identity log identity log The expected value of a random variable Y t for each distribution in Table 1 does not in general equal to the location parameter µ t . To be precise, it is given by is the standardized value of y t , the expectation of Z t is not always zero and is specified for each distribution separately (see Appendix Equations (A1)-(A4) for details), and µ t , σ t , ν t , τ t are the location, scale and shape parameters of the given distribution at time step t. In the case of Johnson's SU and skew t type 1 distributions, the expected value does equal the distribution location parameter, E(Y t ) = µ t . Large fitted/forecasted tau values are capped at 100 for algorithmic convergence.

Distribution Selection
This section performs a thorough analysis of candidate distributions, shown in Table 1, to establish which of the distributions fits each spread hour data the best. This allows us to subsequently perform out-of-sample forecasting analysis using a rolling window estimation technique, presented in Section 3, where the best-chosen distribution is used to fit a model to each spread number and perform 1-step ahead forecasts on a rolling basis.
For model specifications, we begin by analysing which of the seven-candidate distributions fits each spread hour the best. We also analyse whether using a single distribution for modeling all of the spread data is a viable possibility as this might be useful for applications requiring analytical generality. The analysis is divided into two main steps: (a) simple distribution fit, where each distribution of Table 1 is fitted to the spreads in the training data set and the Akaike Information Criterion (AIC) is used to assess the goodness of fit; and (b) factor-based distribution fit, where for each spread hour, the candidate distributions resulting from the analysis of step (a) are used to build a model using exogenous factors. The fit of these models is assessed using a validation data set and a number of goodness of fit measures.

Methodology
To perform a thorough analysis when selecting the best distribution for each spread, we use a training data set, comprised of the first 60% of the spread time series. This is used for model fitting, following which a validation data set, comprised of the next 20% of the spread time series, is used for building forecasts and model evaluation. The validation set is used when there is a question of model selection prior to final out-of-sample forecasting in the backtesting set of data [13]. The validation thus ensures that the predictive power of our trained models is evaluated on the unseen (validation) data set rather than on the fitting in-sample data. This is to avoid over-fitting and ensures that the final backtesting 20% of the data has neither been used for model fitting nor model selection.
The simple distribution fit analysis is performed using the training data, Y train ∈ R 1150×276 . Therefore, for each spread number s = 1, ..., 276 and each distribution i = 1, ..., 7 of Table 1 obtained for spread number s, are ranked in ascending order and the distribution corresponding to the model with the lowest AIC criterion is selected as the "distribution of best fit". Occasionally GAMLSS failed to fit the distribution to the spreads data, in which case that distribution was omitted for that spread. This problem was encountered for spread hours and distributions: 00-01 SEP1, 03-04 SEP2, 12-13 SEP1.
The results show that skew t type 5 distribution is selected most often as the distribution of best fit based on this simple selection criterion (see Table 2). It is closely followed by the same family Skew t type 1 distribution. Overall, all of the distributions except JSUo are indicated as potentially of the best fit for the training spread data. A more detailed breakdown of the spread hours for which each distribution was selected is displayed in Figure 6. We use the result to further analyse the six possible distributions with an factor-based distribution fit method. JSU SEP1 SEP2 ST1 ST2 ST5 Figure 6. Continuous four-parameter distribution chosen as "best fit" to training spread data using gamlss <−y ∼ 1 and the AIC criterion.
The results of simple distribution fit indicated that six out of seven continuous four-parameter distributions could be used for modeling electricity spread data (see Table 2), which are D (1) -JSU, D (2) -SEP1, D (3) -SEP2, D (4) -ST1, D (5) -ST2, and D (6) -ST5. Hence we proceed with a factor-based analysis by estimating each time-varying parameters of the candidate distributions using exogenous variables within the GAMLSS framework. The training data set is comprised of the dependent Y train ∈ R 1150×276 and independent X train ∈ R 1150×9×276 variables, where for each spread number s, under distribution D (i) , i = 1, ..., 6, the initial equations relating each distribution parameter to its linear predictor are: where is the initial vector of independent variables where x 1 is the spread of lagged day-ahead electricity price, x 2 is the gas Gaspool forward daily price, x 3 is the coal ARA forward daily price, x 4 is the spread of wind day-ahead forecast, x 5 is the spread of solar day-ahead forecast, x 6 is the dummy variable taking value 1 for weekends/holidays, x 7 is the spread of the day-ahead total load forecast, and x 8 is the interaction load variable. The models are specified using iterative updating of the equations for each distribution parameter, where the refinement is performed by deleting the most insignificant variable one-by-one and re-estimating the model, until all variables are significant at 5% (see Algorithm 1). This results in M ∈ R 276×6 models containing the estimated coefficients. However during the process it became apparent that some distributions were not suitable for modeling certain spreads within the GAMLSS framework. Convergence was not achieved for some distribution parameters, typically τ and occasionally µ (see Appendix C.1) and whenever this happened the distribution was omitted from candidacy for that spread.

Evaluation
The models estimated using the factor-based distribution fit are analysed in four ways: (1) producing the expected value fit over the training data, (2) producing the expected value fit over validation data (comprising next 20% of unseen time series, at data points t = 1151, ..., 1534), (3) analysing the goodness of fit over the validation data using Root Mean Squared Error, and (4) analysing the goodness of fit over validation data using Pinball Loss function measure.
Although the main focus of the methodology is to fit and forecast the density functions, it is a useful specification check to initially look at the calibration of the mean values. Therefore, the fitted distribution parameters θ (i) train ∈ R 1150×4×276 for each distribution i = 1, ..., 6 are used to find the fitted expected value E(Y train ) of the training spread price data for s = 1, ..., 276 and training data points t = 1, ..., 1150 using expected value expression relevant to each distribution (see Appendix B).

Expected Value Fit Over Validation Data
The fitted models M ∈ R 276×6 , containing the estimated coefficients for each spread number s = 1, ..., 276 under each distribution D (i) , i = 1, ..., 6, are used to forecast the distribution parameters, θ (i) validate ∈ R 383×4×276 over validation data. We note that the same estimated model M (s,i) is used to build the forecasts over validation time series, i.e., each vector of estimated coefficients β (s,i) k for distribution parameters k = 1, ..., 4 is re-used at each time step t to make the predictions. Once the distribution parameters are forecasted the expected value of the spread is calculated using Appendix Equations (A1)-(A4).

Goodness of Fit Measure-Root Mean Squared Error
The Root Mean Squared Error (RMSE), was additionally used to check the accuracy of the forecasted expected value of the spreads over the validation data set. The RMSE is calculated for each spread s = 1, ..., 276 and for each distribution D (i) , i = 1, ..., 6, using: The distribution resulting in the lowest RMSE was selected as the best. As the expected value at each time step t is calculated from all of the four forecasted parameters µ the RMSE measure provides a goodness of fit based on the forecast for the entire distribution specification.
The results are summarised in Table 3, which shows that ST5 distribution was used to form models corresponding to the lowest error across 110 276 * 100 = 40% of the spreads. This is in line with the finding of the simple distribution fit over the spread training data which indicated that ST5 is chosen as the best distribution most often (see Table 2). A detailed breakdown of best distribution assignment for each spread hour and the corresponding RMSE values are shown in Figures 7 and 8 respectively, where blue indicates a lower error. The results show that spreads with hours 13.00, 14.00 are the hardest to forecast since they have the highest RMSE error (dark red). Table 3. Factor-based distribution fit (validation data)-best distribution based on root mean squared error (RMSE).  Goodness of Fit Measure-Pinball Loss Score As our main modeling focus is upon estimating the entire four-parameter distributions fitted to the spread data at each time step, we focus next upon the calibration of the estimated quantiles, using the Pinball Loss (PL) function [14]. The PL function is always positive, and has a higher value the further away the quantile estimate is away from the target value. The quantiles are deemed to be fitted the best when the Pinball Loss is at its lowest. The quality of quantile fit is important for Value-at-Risk compliance in the context of risk management applications. We adopt this measure as our main performance metric to select the best distribution for each spread. We follow Algorithm 2 when making an assessment of each predictive density power and outline the steps involved in calculating the performance measure below.

1.
Pinball Loss Values. At each time step t, the target quantiles q a , a = 1, 2, ..., 99 are extracted by inverting the cumulative distribution function of distribution D (i) , i = 1, ..., 6 specified with forecasted distribution parameters µ and compared to the realisation of true spread price, y t , using: This results in Pinball Loss Values vector, L (s,i) t (q, y t ) ∈ R J a , where J a = {99, 97, 95} is the number of quantiles extracted. Certain quantiles failed to converge when calculating the full set of 99 values due to convergence issues around distribution tails. To resolve this we removed the tail quantiles one pair at a time and attempted to extract the quantiles again (i.e., resulting in 97 quantiles: [q 2 , q 98 ] and if still not convergent 95 quantiles [q 3 , q 97 ]). If this still did not resolve the issue, we omitted calculating the Pinball Loss value for that time step and recorded the number of such occurrences, n. Note: JSU and ST5 extracted quantiles 100% of the time, making these two distributions the most stable out of the 6 tested. This is in contrast to ST1 and ST2 distributions which failed to converge often (see Appendix C Table A1 for the number of times each distribution failed to extract a quantile).

2.
Pinball Loss Scores. At each time step, t, the average over Pinball Loss Values is obtained resulting in a single Pinball Loss Score,L (s,i) t (q a , y t ): This results in a vectorL (s,i) (q, y) ∈ R J b , where J b = 383 − n and n denotes the number of occurrences when L The distribution D (i) corresponding to the model with the lowest Pinball Loss Performance Measure, L (s,i) ∈ R, is selected as the distribution of best fit for spread number s. for each quantile q a,t do 8: Calculate Pinball Loss value L (s,i) t (q a , y t ) using Equation (11) 9: Calculate Pinball Loss ScoreL (s,i) t (q a , y t ) using Equation (12) 10: Calculate Pinball Loss Performance Measure L (s,i) using Equation (13) 11: Find n = argmin i L (s) and select corresponding distribution D (n) as best fit for s The Pinball Loss Performance Measures are analysed for each spread s and the results are presented in a 24 × 24 upper diagonal matrix, containing the best distribution number selected for each intraday spread between the hours indicated by the row and column labels (see Figure 9). The figure shows that the majority of the spreads are fitted best with ST5 distribution (number 6, purple colour). This is in line with the results obtained from the factor-based distribution fit using the RMSE measure (see Figure 7), and with the simple best distribution fit given by y ∼ 1 function (see Figure 6), which both favoured the ST5 distribution.  Table 4 specifies the number of times each distribution was selected as the best from the 276 possible spreads. While ST5 is selected for almost 50% of the spreads, the other 5 distributions were selected nearly with the same proportion for the remaining half of the spread data. Further to this, we calculate the % difference of the best distribution PL score vs. that of the next best distribution, , for each spread number s. When ST5 was selected as the best distribution, the average difference of the performance measure is 4.84% compared to, when other distributions were selected as best, the average difference is 1.44% (with approx 1/3 of these cases containing ST5 is the second-best). This makes the case for ST5 to potentially be used as a general best fit distribution across all spreads.

Rolling Window Forecast Analysis
We perform an out of sample forecasting analysis based on the best distribution selected for fitting each spread number using a rolling window estimation technique and compare the results to the performance of a Normal density as a benchmark. We first assess the performance of models fitted using the Pinball Loss function, and then we use the Diebold-Mariano test. The estimation horizon (rolling window size) comprises of 80% of the data (T 0 = 1534 observations), which is moved up 1 time step at a time.
For each spread we re-specify and re-estimate models using the chosen distribution for that spread (given by Figure 9) while following the rolling window approach, outlined in Figure 10. Starting at time step t 1 a model is specified and estimated over a fixed length horizon of T 0 observations (green coloured bracket). The equations for distribution parameters are obtained using iterative improvement, where the most insignificant variables are deleted one-by-one until all of the variables in each parameter's equation are significant at 5% level. A one-step ahead forecast is made and a full predictive density is obtained, f 1 (green colour). The window is moved by a 1-time step (blue colour) and the procedure is repeated until all of the forecasts are created for the current spread (comprising 383 data points). This procedure is applied to other spreads until all rolling window forecasts are obtained for 276 spreads (see Algorithm 3).  while any β (s,t) k,j insignificant at 5% ∀k do 8: j ← j + 1 9: Find most insignificant coeff. β † (except intercept) across all k 10: Remove exogenous variable associated with β † from eq. for k and from X (s,t) train 11: k,j } 4 k=1 using RS algorithm 12: Using model M (s,t) create 1-step ahead forecast resulting in θ  The resulting data structure is of the size: 276 × 383 and contains time-series of forecasts made for each spread. The rolling window forecast procedure was parallelised across 8 cores on an Intel Core i7 2.9 GHz processor and took 216 hours to complete. The same approach is used to form forecasts for the Normal Distribution resulting in a data structure containing 383 forecasts for each of the 276 electricity spreads.

In-Sample Evaluation
To validate the need for dynamic modeling, we display the evolution of fitted parameters for four example spreads obtained for the first rolling window training data (i.e., first 1534 data points used for specification and estimation phase), giving their associated skew type distributions: 00-08 (ST1), 08-12 (ST1), 12-16 (ST1), 16-20 (ST5).

Explanatory Variable Coefficients
The variations in the size and sign of explanatory variable coefficients for different spreads of the same day illustrate the varying impact of drivers (see Tables 5 and 6 which show the estimated coefficients for selected 4 spreads: 00-08, 08-12, 12-16, [16][17][18][19][20]. The coefficients were extracted from the model estimated for the first rolling window frame i.e., t 1 − t 1534 , and obtained on standardized regressors, but without standardizing the dummy variable. The displayed values for the coefficients are all significant at 5% thus forming the equation for that parameter. Missing values indicate that the independent variable was not significant and thus was omitted from the equation specification for that parameter (column). The coefficients correspond to their associated independent variables: β 1 coeff for spread of the lagged day-ahead electricity price; β 2 coeff for gas Gaspool forward daily price; β 3 coeff for coal ARA forward daily price; β 4 coeff for spread of wind day-ahead forecast; β 5 coeff for spread of solar day-ahead forecast; β 6 coeff for dummy variable taking value of 1 for weekends/holidays; β 7 coeff for spread of the day-ahead total load forecast; β 8 coeff for an interaction load variable. Overall the signs and significances of the coefficients are intuitive. In particular, wind and solar production spreads have negative effects on the mu and nu parameters of spreads for the morning and afternoon spread pairs shown. Recall that the spread is defined as the former minus the later hours and so a higher wind and solar production spread will generally reduce the average spreads and also the skewness (since nu is related to skewness). This is consistent with the effects of wind and solar production on price levels reported in [9] and elsewhere.

Parameter Evolution-Year
The evolution of the four parameters throughout four years using 4 spread hours is examined i.e., establishing how selected spreads behave throughout each year. We selected four years: 2012, 2013, 2014, 2015 in order to depict the evolution and the changing dynamics of the parameters with time. The evolution of each distribution parameter is plotted on separate graphs (see Figure 11 for the evolution ofμ, Figure 12 for evolution ofσ, Figure 13 for the evolution ofν, Figure 14 for evolution ofτ). The results show that the fitted location parameter is in line with what would be expected for the realisation of true spread price, y t , where the spreads between 08-12 hours tend to be positive (i.e., later hour is cheaper), while the 16-20 spreads are negative, i.e., later hour is more expensive. The standard deviation is highest for the spreads between night-time (less busy) and early morning/afternoon (green and blue lines). While the parameter nu tends to be positive for the 08-12 hours, and negative for the 00-08 hours. Next, we plot realisations of spread prices, y t , vs. expected value of fitted spread prices E(Y t ) over the four years, picking different spread hours to show a variety of underlying skew type distributions: 00-09 (SEP2), 08-11 (SEP1), 11-19 (ST2), [16][17][18][19][20][21][22]. Figure 15 shows the evolution of realised spread prices, compared to the evolution of expected values produced by the estimated models (see Figure 16, note slightly smaller scale). The fitted expected values follow the realised spread prices pattern throughout each year, for example, the 16-22 spread tends to have more negative values in the summer time (i.e., electricity at earlier hour is less expensive) and positive values in the winter time (i.e., electricity at earlier hour is more expensive). It can be seen that the fitted spread values are slightly under-fitted as indicated by the difference in plot scale. Next we focus on the evolution of the fitted parameter mu for four chosen hours (midnight, 8am, noon, 4pm). Figure 18a zooms in on the dotted grey section of Figure 17a and depicts spreads between the midnight hour with all hours of the day. This zoomed in plot thus allows a closer examination of the evolution of the fitted parameter mu during a day for the 4 chosen days, and highlights the different behaviour not only within the day but during different season's of the year 2015. Figure 18b-d provide similar information for the spreads between hour: (b) 08 (c) noon (d) 16 with all other hours of the day (note the x-axis range decreases since there are less hours between which the spreads are calculated).
Likewise Figures 19-21 provide the plots for selected hours spread with all hours that follow it in that day for volatility, nu and tau respectively.

RMSE-Based Evaluation
The first stage of out-of-sample evaluation of the obtained forecasts involves calculating the Root Mean Squared Error for skew type distributions and the Normal distribution. The RMSE values of skew type distribution models for the forecasted expected value of the rolling window test data are given in Figure 22. The results show that night-time spreads with hours 14.00, 17.00 and 18.00 had the biggest errors when forecasting the level. The RMSE values of the benchmark Normal type models for the forecasted expected value of the rolling window test data are given in Figure 23. The results show that night-time spreads with hours 14.00, 17.00 and 18.00 also had the biggest errors when forecasting the level. In order to thoroughly compare the performance of skew type distributions against the Normal benchmark, formal statistical testing is carried out next using Pinball Loss based measure.

Pinball-Based Evaluation
A more accurate approach would be to use the Pinball Loss function, which was used to compare the accuracy of the full distribution forecast at each time step t of the forecasted time series for each spread. The analysis used Algorithm 2 where instead of six distributions only two distributions are used: D (i) , i = 1, 2, where i = 1 represented the best-fitting distribution for model s, and i = 2 is the benchmark Normal distribution. The estimation of models was not always convergent within the GAMLSS estimation algorithm and certain spreads failed to be fitted either by skewed distribution, by Normal distribution or both, which was mostly due to special characteristics in the dataset. For example, 02-03 hour spread is dominated by noise, with 1 3 of the data containing 0's, which in turn creates a problem during weight update calculations. If convergence issues were experienced, the forecast for that time step was omitted (which was also accounted for during calculation of Pinball Loss averages). If more than 200 out of 383 forecasts were missing for the best skew distribution of a given spread, the result was judged as unreliable and the forecasts for that spread were treated as unavailable. This happened for 4 out of 276 spreads: hours 02-03/12; 03-05; 13-14. The number of times at least 1 forecast was missing for skew type distribution was 46 times (at most 70 times out of 383 time steps). The convergence issue was also experienced while using the Normal distribution (benchmark), however, forecasts with more than 200 time steps were observed more frequently with a total of 13 out of 276 spreads: hours: 01-12/13/14/23; 02-03/23; 05-13; 06-12/13/14/15/16; 07-08. When this happened the skew type distribution was judged to have the better result. The number of times at least 1 forecast was missing for Normal type distribution was 5 times (at most 42 time steps out of 383). The Pinball Loss scoreL was calculated by averaging the Pinball Loss L(q a , y t ) (Equation (11)) across all t rounded to 5 decimal places. If forecast at t was missing due to a failed model, and if total missing values over a forecasted time-series was less than 100 for a given spread, the PL average was adjusted to the relevant number of remaining forecasts.
The results show that for 276 spreads the skewed distributions forecasted the full density more accurately 258 times vs. 18 times for the Normal distribution, when PL score is calculated to 5 decimal places. Reducing decimal places to 3, results in the same number of Normal distributions (18 better than Skew), however, in this case results show that for 5 spreads a Normal distribution produces as good a fit as the skew distribution, and 253 cases of where skew is a better distribution for spreads. The differences between Pinball scores are plotted in Figure 24, where negative values indicate that a skew type distribution has a better forecasting power over Normal. The overwhelming majority preference for skewed distributions is evident through negative values, as expected due to highly skewed and kurtotic nature of spread price data. Specifically, the results show that in 12 cases the skew distribution is strongly better than Normal (dark blue), 90 cases moderately better (navy blue), with the remaining 151 cases where skew is better than Normal.

Diebold-Mariano-Based Evaluation
To draw statistically significant conclusions over the outperformance of skew type distribution forecast accuracy over that of the Normal distribution, the Diebold-Mariano [15] test is performed. The test is applicable to forecast errors that do not have zero mean, that are not Gaussian, and that may be serially/contemporaneously correlated, thus Diebold-Mariano accounts for the correlation structure between the errors. The test is valid for cases where a non-quadratic loss function is used, such as the Pinball Loss function used here. We use a variation of the standard Diebold-Mariano test with the implementation proposed in [16]. For each spread number s we have t = 1, ..., 383 forecasts produced by two models (M 1 -estimated with skew type distribution and M 2 -estimated with Normal distribution), which are tested against each other using a one-sided test at 5% significance level. The null hypothesis is that the two methods have the same forecast accuracy, with a one-sided alternative hypothesis that forecasting power by skew type distribution outperforms that of the Normal, H 0 : E(∆ M 1 ,M 2 ,t,s ) ≤ 0, where the Loss Differential Series ∆ M 1 ,M 2 ,t,s is calculated using where s is the spread number, t is the forecast time step,L t,1 average quantile score of skew type distribution model (M 1 ) forecast at step t,L t,2 average quantile score of Normal distribution (M 2 ) forecast at step t.
The p-values are displayed in Figure 25 and the results show that out of 276 spreads: (a) the skew type distribution is significantly better at forecasting the spreads 161 times at 5% (bright green) and 15 times at 10% (olive green). Note: the distributions which were used to learn these 176 models break into JSU: 9 times, SEP1: 14 times, SEP2: 16 times, ST1: 13 times, ST2: 18 times and ST5: 106 times ( 106 164 × 100 = 64.5%); (b) the Normal and the skew have the same forecasting power 48 times (i.e., the null hypothesis could not be rejected at 10%). Note: the distributions which were used to learn these 48 models break into JSU: 9 times, SEP1: 4 times, SEP2: 8 times, ST1: 8 times, ST2: 4 times and ST5: 15 times ( 15 48 × 100 = 31.25%); (c) the results could not be obtained for the skew distribution model 49 times (white spaces of upper triangular) since at least 1 forecast was missing due to Quantile estimate convergence issues. Note: the distributions which were used to learn these 49 models break into JSU: 4 times, SEP1: 8 times, SEP2: 21 times, ST1: 9 times, ST2: 5 times and ST5: 2 times ( 2 49 × 100 = 4%) (i.e., this typically happened for distributions other than ST5). Note: the Normal distribution approach may have resulted in missing forecasts also, however this was not checked. We focused on selecting the skew distributions which fitted best the individual spread hour data. This is the most accurate approach, however, some distributions experienced issues with convergence during quantile estimation. As a result of the above analysis it was found that: (a) the ST5 distribution was most frequently selected as the best distribution by simple distribution fit based on y ∼ 1 and factor-based distribution fit based on both RMSE and Pinball Loss function; (b) the ST5 distribution was the most reliable for convergence of quantile estimates using GAMLSS function qFUN (e.g., qST5), especially for the extreme quantiles of q 1 , q 2 , q 3 , q 97 , q 98 , q 99 where other distributions such as SEP1 and SEP2 often failed. Note that quantile estimates are important for this research because they are utilised in the statistical testing of performance of two distributions. For example, the Pinball Loss calculations revealed that for 49 spread hours at least one forecast step (out of 383 steps) failed to extract 95 quantiles from the estimated model. Upon further examination, it was revealed that out of the 49 occurrences only 4% had ST5 as the underlying distribution for which the model was estimated. This supports our claim that ST5 is a reliable distribution for the quantile estimates; (c) the best distribution for each spread was chosen based on the Pinball Loss score calculated for the validation data. Each spread had 6 possible distributions from which the best distribution was chosen and the one with the lowest score was taken as the best distribution for that spread. Further analysis reveals that on 1/3 of occasions when other distributions than ST5 were selected as 'best', the ST5 was the second-best distribution, which on average was only worse by 1.44%. However, for the spreads where the ST5 was the best distribution, the pinball loss score was on average better by 4.84%, which points to the possibility that the best distribution did not have a significantly different performance when compared to ST5. Therefore we conclude that if one wished to use a single distribution across all spread hours, the ST5 distribution forms a robust choice. However, our research shows that for a more rigorous analysis the more detailed approach should be used, where individual spreads have distributions of best fit assigned to them.

Conclusions
The methodology developed in this paper is innovative in focusing directly upon modeling the intraday spreads in day ahead prices, rather than the day ahead price levels themselves. The density function for each spread is modeled directly. As the day ahead hourly prices are non-Normal and not independent, it is not generally possible to derive the spread densities from pairs of price densities. This work is motivated by potential applications to short-term arbitrage trading and battery operations with risk-averse traders, but we have restricted this research to considerations of the forecastability of these spreads. In this context, we have demonstrated the value of detailed, computationally intensive modeling of intra-day power price spread densities using a flexible four-parameter distributional form, generally the skew-t. This allows the dynamic conditional parameter estimates to follow stochastic evolutions driven by exogenous factors, most importantly the day ahead demand, wind and solar forecasts. The result is a dynamic specification of a four-parameter density function, adapting its shape to the evolution of demand, wind and solar forecasts. Furthermore, the analytical properties of these four parameter functions allow direct computation of their cumulative distribution functions. This is an attractive alternative to conventional quantile regression methods that require interpolations between a discrete set of quantile estimates to approximate the full distribution function.
With respect to validation and out-of-sample testing, these forecasts performed well in a rolling window backtesting and outperformed baseline comparison to a Normal density model. The significance of outperformance was established, generally at 5%, with Diebold-Mariano tests on the Pinball Loss functions for the densities. Insofar as the Normal baseline was outperformed, this is a validation of the approach against the simpler alternative. However, the model specification and validation process are computationally intensive, and whilst modeling simplifications could be introduced, generally, we think extra accuracy is important. Overall, this formulation and application shows the merits of a computationally intensive approach to accurate specifications and these are likely to be more attractive in practical applications, e.g. to intraday arbitrage trading, than methods based upon analytical simplifications of the stochastic price processes. The value of this methodology should be quite general to most electricity markets where spreads are relevant for short-term, intraday arbitrage trading. The key ingredient is the non-Normality of the prices and the dependence of the price density shapes on potential exogenous variables. With the widespread introduction of wind and solar production facilities, these conditions are likely to be fulfilled in most daily power markets.
Author Contributions: Both authors contributed equally in all stages of the research. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Data Collection and Pre-Processing
The data collection process for each variable is described below.

Appendix A.2. Total Wind Day Ahead Forecast
The total wind day ahead forecasts were obtained as the sum of wind forecasts for the four control areas, each downloaded from https://data.open-power-system-data.org. Missing data for two control areas of the following dates was interpolated: Each control area's forecast is obtained by finding the average of 15 min forecasts starting from the hour (e.g., for 23.00 data point this is the average of 23:00-23:15, 23:15-23:30, 23:30-23:45, 23:45-00:00), with data downloaded from https://data.open-power-system-data.org. The total solar day-ahead forecast is found by summing the average for each of the four control regions. A number of days contained non-zero entries for forecasted sunshine during hours 22.00 and 23.00 in the night (e.g., 30 June 2014 contained entries 224 and 15 respectively). In order aid smooth regression estimation, any non-zero entries for hours 22.00/23.00 have been overwritten with 0 in order to be consistent with the majority of the data for those hours (approx. 10 entries in total). Note: the processed data set therefore did not contain any forecasted sunshine for hours: 22.00, 23.00, 24.00, 00.00, 01.00, 02.00, 03.00. Some data was missing and was therefore replaced for the following control areas: were missing from https://www.transnetbw.com/en/transparency/market-data/key-figures and thus were replaced with an average of the hour from the past 7 days. Note the March dates experience a change in the clock, therefore for consistency of data, the average (for each hour for end of March missing data) was shifted by one hour: i.e., for hour 06, an average of 7 previous days of hour 05 was taken instead of 06. Incorrect data for the 13 December 2014 data shows sunshine at night and for the 24 hours of that day (200+ MW), hence this was replaced with an average of last 7 days of forecasts for each hour.
Substituting a into Equation (A7) we obtain Equation (A10) is re-arranged to have the form f (x) = 0 (see Equation (A11)) and solved for b using Newton-Raphson root searching algorithm.