Energy Commodity Price Forecasting with Deep Multiple Kernel Learning

: Oil is an important energy commodity. The difﬁculties of forecasting oil prices stem from the nonlinearity and non-stationarity of their dynamics. However, the oil prices are closely correlated with global ﬁnancial markets and economic conditions, which provides us with sufﬁcient information to predict them. Traditional models are linear and parametric, and are not very effective in predicting oil prices. To address these problems, this study developed a new strategy. Deep (or hierarchical) multiple kernel learning (DMKL) was used to predict the oil price time series. Traditional methods from statistics and machine learning usually involve shallow models; however, they are unable to fully represent complex, compositional, and hierarchical data features. This explains why traditional methods fail to track oil price dynamics. This study aimed to solve this problem by combining deep learning and multiple kernel machines using information from oil, gold, and currency markets. DMKL is good at exploiting multiple information sources. It can effectively identify the relevant information and simultaneously select an apposite data representation. The kernels of DMKL were embedded in a directed acyclic graph (DAG), which is a deep model and efﬁcient at representing complex and compositional data features. This provided a solid foundation for extracting the key features of oil price dynamics. By using real data for empirical testing, our new system robustly outperformed traditional models and signiﬁcantly reduced the forecasting errors.


Introduction
Crude oil is the world's largest energy commodity and is actively traded internationally. The welfare of oil-importing and oil-producing economies are heavily influenced by fluctuations in oil prices, especially when they are unexpectedly large and persistent. As indicated by Abosedra and Baghestani [1], "sharp increases in crude oil prices adversely influence economic growth and accelerate inflation for oil importing economies. Large fall in crude oil prices will generate serious budgetary deficit problems for oil exporting countries". Accurate oil price forecasting is appealing and important. Nevertheless, in modern time series analysis it is a very difficult task owing to its complex dynamics. Many researchers have tried to develop models to maximize forecasting accuracy. However, until now, they have not achieved a satisfactory level of performance from their models. The failure of traditional approaches is derived from their model setting. The model forms adopted are usually linear and parametric (Atsalakis and Valavanis [2,3], Fan and Li [4]), which are not flexible enough Due to the rapid development of the Internet and information technology, global financial markets are highly correlated. Oil is both an important energy commodity and a financial instrument that is heavily traded in global markets. Upon reviewing the research in oil or financial price predictions (Ding et al. [22], Iranmanesh et al. [19], Khashman and Nwulu [23], Liu et al. [24], Wang et al. [25], Xie et al. [26], Yu et al. [12]), we can confirm that machine learning or artificial intelligence approaches usually outperform statistical and econometric methods. However, there are still some weaknesses associated with machine learning or artificial intelligence approaches. Previously, kernel methods have been prolific, theoretical, and algorithmic machine learning frameworks. The success of kernel methods depends on good data representation or kernel design, and this has resulted in a lot of research that focuses on kernel design, which is adapted to specific data types. Conversely, there are also several generic kernel-based algorithms for typical learning tasks. The strength of SVMs is that they use structural risk to regularize model complexity, which leads to excellent generalization properties in out-sample forecasting. The mathematical formulation of an SVM is ideal because its objective function is convex with a unique solution. Consequently, the solution searching or parameter optimization algorithms are easier than those in neural network (NN) models. The kernels are typically hand-crafted and fixed in advance, and the roles of the kernel in an SVM can be divided into two parts: (1) it defines the similarity between two examples, and (2) it simultaneously acts as a regularization for the objective function.
Hand-tuning kernel parameters is difficult, as the appropriate sets of features need to be selected and combined. On the other hand, traditional SVMs are based on a single kernel, whereas in real-life applications data comes from multiple sources, and therefore, the representation by a single kernel is not sufficient. The combination of multiple kernels is a good solution; however, determining the process to combine them presents another problem. Lanckriet et al. [27] sought to address this problem and proposed an idea to learn the multiple kernels from training data. Their solution was to learn the target kernels as a linear combination of given basis or local kernels. Following Lanckriet et al. [27], various multiple kernel learning (MKL) formulations and modifications have been proposed. The success of MKL stems from the fact that using multiple kernels can enhance the interpretability of the decision function, and thus improve performance (Lanckriet et al. [27]). However, the number of the basis kernels that we need to consider is exponential in the dimension of the input space. Considering this decomposition for MKL directly is intractable. To address the issue of selecting basis kernels more efficiently, Bach [28,29] proposed a useful framework to design the MKL kernels. Owing to the fact that data features of modern time series are complex, compositional, and hierarchical, using the natural hierarchical (or deep) structure of the problem for the kernel design of MKL is a good solution. The suggestion made by Bach [28,29] involves embedding the kernels in a directed acyclic graph (DAG). The kernels embedded in a DAG form provide an excellent deep representation of the data features. Another contribution from Bach [28,29] is the proposal to perform high-dimensional kernel selection through a graph-adapted sparsity-inducing norm. Using the norm, the selection can be completed in polynomial time in the number of selected kernels.
Recently, deep learning (DL, Bengio et al. [30], Schmidhuber [31]) or deep representations (DR) have become very popular. As opposed to task-specific algorithms, DL aims to learn the data representation. Consequently, DL is also known as deep structured learning or hierarchical learning. In machine learning methods, DL has become the new trend in overcoming complex data mining problems. As previously mentioned, kernel methods are usually shallow models that cannot fully represent or capture complex, compositional, and hierarchical data features. This study aimed to combine the strengths of DL (or DR) and MKL. The kernels used in this study were embedded in a hierarchical directed acyclic graph, which is a deep representation form for real data. In the past few years, DL has become very popular in many fields of computer science, and the most recognized applications are in computer vision and natural language processing. With the advancement in storage technology, there are considerable quantities of labeled data available for training a model. This allowed us to learn large numbers of model parameters in DL without having to be concerned about overfitting. Another factor contributing to the success of DL is the rapid development of the Graphics Processing Unit (GPU). The computing power of the GPU grows very fast, whereas traditional complex DL model using CPU (Central Processing Unit) training requires weeks of computations. The training can be completed in a day on a GPU (see, e.g., He et al. [32], Ioffe and Szegedy [33], Krizhevsky et al. [34], Simonyan and Zisserman [35]). This study sought to bridge kernel methods and deep representations and ideally achieve the best of both worlds.
The remainder of this paper is organized as follows: Section 2 reviews the weaknesses and strengths of prior research, including the support vector regression (a type of SVMs), feedforward neural network (FFNN), radial basis functions (RBF) neural network, general regression neural network (GRNN), and DLs. Section 3 describes the proposed model. Section 4 introduces the real data we used to test the model, and discusses the empirical results. Finally, Section 5 is the conclusion.

Support Vector Regression
Based on the structured risk minimization (SRM) principle, support vector regression (SVR) seeks to minimize an upper bound of the generalization error, instead of the empirical error as in other neural networks. The concept of SVR is to find suitable support vectors in the margin and build the model according only to a subset of the training data. In the past, SVMs have achieved great performance in various applications, yet in some cases it was not satisfactory. SVMs need to overcome the following drawbacks: (1) Similar to NN models, the optimization algorithm needs to tune a large number of model parameters. The general strategy is to employ genetic algorithms (GA) or particle swarm optimization (PSO) algorithms to search for the best parameters (Huang and Wang [36], Ren and Bai [37]). Despite the fact that the objective function of an SVM is convex and has a unique solution, the parameter space is highly nonlinear and non-convex. Typical optimization (or tuning) algorithms are not very effective for searching in the parameter space. Although searching for optimal parameters by GA (or PSO) is an effective solution, this is time consuming and computationally intensive. (2) In high-dimensional data, an SVM also cannot get rid of the curse of dimensionality (Bellman [38]). For large-scale input data, the dimension of input space is very large, and the distribution of data points becomes very sparse. This results in a sharp deterioration in the SVM's performance.
(3) The representation of an SVM is not compact and concise, and it generally cannot produce sparse models. For example, in a system of identification, Drezet and Harrison [39] demonstrated that the model built by an SVM is not always parsimonious. (4) To make an SVM successful in many areas of application, the choice of a good kernel and features is very important and relies heavily on data processing experience.

Feedforward Neural Network
The notion of artificial neural networks was derived from biological neural networks. The neurons process information through a non-linear sigmoid function, and consequently, NNs are effective at non-linear data modeling. The strengths of NNs are in modeling complex relationships between inputs and outputs and finding patterns in data. However, there are also certain weaknesses in NN models including: (1) they depend on a large number of model parameters; (2) the solution space of NN is not convex, and the optimization algorithm is often trapped into local minima in the training; (3) the training of NN usually tends to be over-fitting, which results in a poor out-sample generalization; and (4) traditional NNs are shallow models, and thus their representation is insufficient. These problems are partially addressed by the technique of kernel methods or support vector machines.

Radial Basis Function and Generalized Regression Neural Networks
The radial basis function neural network (RBFNN) is in a special class of neural networks that consists of an input layer, a hidden layer, and an output layer. The neurons in the hidden layer of an RBF contain Gaussian transfer functions whose settings makes the outputs inversely proportional to the distance from the center of the neuron. The Generalized regression neural network (GRNN) is a variation of the RBFNN. GRNNs represent an improved technique to the neural networks based on nonparametric regression, and every training sample represents the mean to a radial basis neuron. GRNN can be used for regression, prediction, and classification and can also be a good solution for online dynamical systems. Similar to RBFNN, GRNN has the following advantages: (1) high accuracy in the estimation because it uses Gaussian functions; (2) single-pass learning so backpropagation would not be required; and (3) it can resist and handle noises in the inputs. However, there are still some disadvantages in GRNN, for example, there is no optimal method to improve it, and its size grows fast with the input dimension, which is computationally expensive.

Deep Learning
Deep learning is good at feature extractions and representations. It has achieved a remarkable performance breakthrough in several fields (such as speech recognition, natural language processing, and computer vision). In particular, convolutional neural network (CNN) architectures produce state-of-the-art performance on a variety of image analysis tasks. Currently, the weakness of DL is that most of DL research focused on dealing data with 1D, 2D, or 3D Euclidean spaces. However, most data from energy or financial markets lies on high-dimensional non-Euclidean manifolds. Generalizing deep learning methods to non-Euclidean structured data becomes very important. Applying differential geometry to generalize DL is a good solution. The generalizing (or geometric) deep learning can thus be applied to a variety of domains, such as network analysis, computational social science, computer graphics, and so on. Another weakness of DL is that their computation is quite heavy. We need multiple GPUs or cloud computing to accelerate the computation.

Deep (or Hierarchical) Multiple Kernel Learning
Kernel methods are popular learning frameworks and the basis of the approach can be stated as follows: through non-linear transformations, we can transform the input space to a larger and potentially infinite-dimensional feature space. Typically, the feature space is a reproducing kernel Hilbert space (RKHS), which is a space of functions in which point evaluation is a continuous linear functional. The advantage of RKHS is that it is more flexible and rich for feature representations than original input space. Via representer theorems, with the kernel function and appropriate regularization by Hilbertian norms, we can consider larger and potentially infinite-dimensional feature spaces without computing the coordinates of data in that space, but rather by simply computing the inner products between the images of all pairs of data in the feature space. This approach is called the "kernel trick", which is computationally cheaper than the explicit computation of the coordinates. This has led to several studies on kernel design adapted to specific data types and generic kernel-based algorithms for many learning tasks.
In practical applications, data comes from multiple sources. Classical kernel machines are based on a single kernel, which is not capable of representing complex data sources. Consequently, it is more desirable to construct learning machines based on combinations of multiple kernels. The approach suggested by Bach [28,29] proposed a large feature space that is the concatenation of smaller feature spaces, and for real-life application, considered a positive definite kernel that can be expressed as a large sum of positive definite basis or local kernels. After the construction, we can apply multiple kernel learning to select among these kernels. However, directly applying multiple kernel learning in this decomposition is intractable because the number of these smaller kernels increases exponentially in the dimension of the input space. In order to overcome the difficulty in basis kernel selections, Bach [28,29] made an arrangement so that these small kernels could be embedded in a DAG, which happens to be a hierarchical structure that is effective at deep representations.
The following description of DMKL follows Bach [28,29]. For the problem to consider predicting a random variable Y from a random variable X, we defined X and Y to be spaces of X and Y. Given n observations (x i , y i ), i = 1, . . . , n, the empirical risk in the estimation of a function f from X to R can be defined as 1 n ∑ n i=1 l(y i , f (x i )), where l is a loss function.

Graph-Structured Positive Definite Kernels
To construct a larger kernel, k : X × X → R, we assumed that this positive definite kernel is the sum, over an index set V, of basis kernels , respectively. Consequently, the larger feature map φ(x) and larger feature space F of k can be expressed as the concatenation of the feature maps φ v (x) for each kernel k v , i.e, F = ∏ v∈V F v and φ(x) = (φ v (x)) v∈V . The learning algorithm of MKL tried to find for a certain β ∈ F to form a predictor function f (x) = β, φ(x) , which is equivalent to find jointly for The goal of this research was to perform kernel selection among the kernels k v , v ∈ V. In order to accelerate the searching, we only considered specific subsets of V. We limited the basis kernels to be embedded in a graph, and as described by Bach [28], "instead of considering all possible subsets of active (relevant) vertices, we are only interested in estimating correctly the hull of these relevant vertices".
We assumed that the input space X can be factorized into p-components X = X 1 × · · · × X p , and that there are p sequences of length q + 1 of kernels k ij ( . Thus we had a sum of (q + 1) p kernels, that could be computed efficiently as a product of p sums. In this scenario, the products of kernels was equivalent to interactions between certain variables. The basis kernels embedding in a DAG implies that an interaction will be selected only after all sub-interactions are already selected. The framework of DAGs are particularly suited to deep feature representations and non-linear variable selection, and especially for the polynomial and Gaussian kernels.
In considering the linear kernel, k ij (x i , x i ) = C q j x i , x i j , where , stands for inner product; the full kernel is then equal to k(x, Please note that this is not exactly the usual polynomial kernel. Typical polynomial kernels, k(x, x ) = (1 + xx ) q , are multivariate polynomials of total degree less than q. Another example is the product of the Gaussian , which is also known as all-subset Gaussian kernel. ANOVA (analysis of variance) kernel is also famous in research. It is shown as follows: The optimal hierarchical multiple kernel learning could be formulated as the following minimization problem: where

Data Sets Used for The Research
In modern society, our economy heavily depends on the energy sector. Investors all over the world pay attention to oil prices, which are one of the most important global economic variables. Energy markets are closely correlated with financial markets and are therefore economically linked. In determining which variables to include in our study, gold and oil are two kinds of commodities to hedge against inflation. In addition, since both gold and oil are globally traded in U.S. dollars, the currency markets should also be considered. Typically, the U.S. dollar is more sensitive to oil than gold. Consequently, this study proposed to consider the possible economic and financial linkages between the oil, gold, and currency markets. The markets for oil and gold have been extensively studied; however, in this analysis, we attempt to bring together these three markets and use recent methodologies to uncover the emerging relationships.
The testing data used in this study include five major crude oil spot prices: West Texas Intermediate (WTI), Brent, Forties, Dubai, and Oman. Brent and Forties are the reference for crude oil in the North Sea, WTI is the reference for the America, and Dubai and Oman are the references for the Middle East. This study aimed to forecast these crude oil spot prices, while taking the economic and financial linkages among oil, gold, currency markets into account. This analysis included the gold prices (New York), and the exchange rate between the U.S. dollar (USD) and the Taiwanese dollar (TWD) to enhance the predictions. In total, we had 5 crude oil spot prices (WTI, Brent, Forties, Dubai, and Oman), 2 financial prices (the gold prices and the U.S. exchange rate), and for every variable we considered 2 time lags. Consequently, there were 14 ((5 + 2) × 2 = 14) input time series in our model. The data covered the period from 1 May 2009 to 31 December 2010, and comprised of 435 daily observations. The descriptive statistics of each variable are provided in Table 1.  Table 2 shows the p-value of the unit root test on every time series. We tested for a unit root against a trend-stationary alternative, augmenting the model with 0, 1, and 2 lagged difference terms. Under 1%, 5%, and 10% significance level, the results indicated that these tests failed to reject the null hypothesis of a unit root against the autoregressive alternative, regardless of lagged 0, 1, or 2 difference terms; namely, these time series are not stationary. Market information is generated instantly every day, and therefore considering one-step-ahead forecasting is enough in constructing a forecasting system. We needed to adaptively adjust the model for the following day's predictions. Moreover, in online applications, one-step-ahead forecasting can also prevent cumulative errors from the previous period, which is important in out-of-sample forecasting. This study used 300 data points before the day of prediction to serve as the training data. The DMKL model was trained in a batch manner, and the window of the training data set slides with the current prediction. Other models are trained in a similar manner, and the remaining 135 daily oil prices served as the testing data to evaluate the performance of all prediction models. Two lagged prices (P t−1 , P t−2 , two time lags) of each asset served as the explanatory or input variables for the predictions. The flow diagram of the proposed system is shown in Figure 1.

Model Settings and Performance Measurements
Traditionally, researchers use the mean square error (MSE), root mean squared error (RMSE), mean absolute error (MAE), and the mean absolute percent error (MAPE) to measure the performance of a model. Different indices emphasize distinct parts of errors, and are suitable for different applications. This study compared the DMKL model with traditional predictors. These predictors include the auto-regressive integrated moving average (ARIMA), the feed-forward neural network (FFNN), and the generalized regression neural network (GRNN). This study adopted a general ARIMA(1, 1, 1) model for its general good performance; specifically the order of the autoregressive part, the degree of differencing, and the order of the moving-average part were all set to one. The FFNN and GRNN are shallow network models with two layers. There are five sigmoid neurons in the first layer of FFNN, and the initial spread of radial basis functions of GRNN was set to 1. The basis kernels used in the DAG of DMKL were the union of ANOVA kernels with full interaction. Since we had 14 input variables (7 original variables, each with two time lags), from the first order linear part (k(x i , y i )), second order interaction (k(x i , y i )k(x j , y j )), third interaction (∏ 3 i=1 k(x i , y i )),. . . , to full interaction (∏ 14 i=1 k(x i , y i )) and then outputs, there were 15 (7 × 2 + 1 = 15) layers with hundreds of kernels organized by the DAG. The basis kernels are If we were to include more input variables and more time lags, the depth of the DAG network would increase in proportion to the input dimension.

Performance Comparison
Tables 3-6 list the results of the four models. Figures 2-6 detail the empirical results of the proposed model including: the actual oil prices, predicted values, and model residuals. These figures display the forecasting capabilities of the DMKL models and demonstrate that the proposed model can instantaneously track price fluctuations. As shown by the four tables, DMKL performed the best. FFNN was the second, ARIMA the third, and GRNN performed the worst. The DMKL model significantly outperformed the others and it substantially reduced the forecasting errors. The FFNN, ARIMA and GRNN are all shallow models. They cannot compete with DMKL.       Performance Comparison Using Theil's U Theil's U coefficient indicates how well a forecasting model performs compared with naive no-change extrapolation. It is different from the MSE, RMSE, MAE, and MAPE indices that emphasize only the forecasting errors. As indicated in Theil [40], "Theil's U will equal 1 if a forecasting technique is essentially no better than using a naive forecast. Theil's U values less than 1 indicate that a technique is better than using a naive forecast. Hence, a value equal to zero indicates a perfect fit, and consequently, a better model gives a U value close to zero." The Theil's U value can be divided into three components including the bias, variance, and covariance. As the names suggest, the bias part accounts for the bias between actual and predicted values, the variance part represents the inequality accounted for by higher/lower variance in the simulated series, and the covariance part is the residual. Table 7 displays the model performance measured by Theil's U index.
As shown in Table 7, DMKL was approximately one order better than FFNN, ARIMA, and GRNN based on the Theil's U index. Figure 7 plots the results of Table 7. Table 8 provides the average error of each model. Figures 8-12 displays the details of Table 8. As shown in Table 8, according to the performance ranking measured by average errors, DMKL was the best, followed by FFNN, then by ARIMA, and lastly the GRNN. The average RMSE, MAE, and MAPE errors of DMKL were approximately 1 4 than those of the GRNN, and the reduction was even greater for the MSE.

Conclusions
This study focused on developing advanced techniques in oil price forecasting, which is one basis for implementing an effecting hedging or trading strategy. The success of the proposed forecasting model was derived from the combination of multiple kernel machines and deep kernel representation. Deep kernel representation provides a solid foundation for extracting the key features of oil price dynamics. The kernels embedded in a directed acyclic graph provides a deep model that is good at representing complex, compositional, and hierarchical data features. This study used a deep multiple kernel learning for oil price forecasting that eliminated the drawbacks of traditional neural network and support vector machine models. DMKL is successful at high-dimensional data representation and performing non-linear variable selection. By using DMKL, we can both select which variables should enter and the corresponding degrees of interaction complexity. This study applied five major crude oil prices for testing. Empirical results showed that our model was robust, and it systematically outperformed traditional neural networks and regression models. The new model significantly reduced the forecasting errors.
This study developed a highly effective framework for energy commodity price forecasting. The proposed model combines the strengths of kernel methods and deep learning. It can achieve better performance easily. The strength of kernel methods is that they can learn a complex decision boundary with only a few parameters by projecting the data onto a potentially infinite-dimensional reproducing kernel Hilbert space. On the basis of kernel methods and deep learning, the proposed model works by combining multiple kernels within each layer to increase the richness of representations, and by stacking many layers to process a signal in an increasingly abstract manner. Oil price dynamics are complex, nonlinear, and non-stationary. Traditional models tends to be linear, parametric, and shallow, which are not suitable for oil price forecasting. Extracting data features in an abstract manner using a directed acyclic graph (as in our study) is a good strategy to handle complex oil price dynamics.
In summary, the effective framework of this study is also suitable for applications in other forecasting problems. With the leverage of cloud computing, or multiple GPUs on the CUDA (Compute Unified Device Architecture) platform, the system can be applied to online forecasting. Energy commodity investors can also apply the proposed system to effectively hedge their risk in global investments.

Implications and Limitations of This Study, and Suggestions for Future Research
Oil is an important energy commodity, and its price is influenced by many factors, which makes capturing its dynamics quite challenging and leads to difficulties in forecasting. However, with the advances in electronic transactions, vast amounts of financial market data can been collected in real time. Owing to the real time information flow, global markets are closely correlated with instant interactions, especially in the oil and financial markets. This study used information from oil, gold, and currency markets to serve as multiple inputs for our forecasting system. Considering more real time information from global markets is not difficult for future research. However, the computational loading is heavy. Implementing the algorithm in an IC (Interrgrated Circuit) chip is a good solution to achieve the real time response.
There are certain limitations in the study, which may in turn provide fruitful avenues for future studies. First, the DMKL model working in time domains may be not very effective at capturing oil price dynamics. Transforming to a good feature space, such as wavelet domain, could enhance the prediction. However, this would have required more computations, and the loading would be heavier for our algorithm. Second, for simplicity and reducing computation loading, this study employed a global model. The weakness of global models is that they cannot fit each dynamic region very well. However, their strength is that they are easy to implement and are suitable for online applications. Third, this study used data sets of oil, gold, and currency markets only. There are other factors that are also influential in oil prices, such as the supply, demand, GDP, consumer price levels, and commodities markets, and future studies may consider these variables. Fourth, trading is also an important issue for future research. There are many strategies to trading, which poses several issues in finance, for example, price trading, volatility trading, paired trading, and hedge trading, which were beyond the scope of this study. Further investigation is required to determine how to effectively use the forecasting power of this study for trading requirements. Finally, market data that can be collected becomes very large. Complex high-dimensional data tends to obscure the essential feature of data. Identifying intrinsic characteristics and structure of high-dimensional data is important for various fields of research, not limited to the oil price forecasting. Due to the curse of dimensionality, considering sparse modeling (coding) or dimensionality reductions (such as manifold learning) for high-dimensional data will be very important in performance improvements.