Deep Hybrid Model Based on EMD with Classification by Frequency Characteristics for Long-Term Air Quality Prediction

: Air pollution (mainly PM2.5) is one of the main environmental problems about air quality. Air pollution prediction and early warning is a prerequisite for air pollution prevention and control. However, it is not easy to accurately predict the long-term trend because the collected PM2.5 data have complex nonlinearity with multiple components of different frequency characteristics. This study proposes a hybrid deep learning predictor, in which the PM2.5 data are decomposed into components by empirical mode decomposition (EMD) firstly, and a convolutional neural network (CNN) is built to classify all the components into a fixed number of groups based on the frequency characteristics. Then, a gated-recurrent-unit (GRU) network is trained for each group as the sub-predictor, and the results from the three GRUs are fused to obtain the prediction result. Experiments based on the PM2.5 data from Beijing verify the proposed model, and the prediction results show that the decomposition and classification can develop the accuracy of the proposed predictor for air pollution prediction greatly.


Introduction
With the rapid development of industrialization and urbanization, urban air pollution has become increasingly serious, which has affected the living environment and health greatly. Presently, air quality monitoring stations can record hourly data to monitor the city's PM2.5 and other air pollutants. Based on the collected data, the prediction of air pollution and the evolution of PM2.5 concentration are considered to be a key issue of air quality monitoring task for environmental air protection and governance [1].
The accurate long-term prediction can have more time to deal with air pollution in advance, so comparing the short-term prediction, the long-term prediction of air quality is more important. While the air quality data, such as PM2.5, PM10, SO2, etc., are complex nonlinear time series data with multiple components of different frequency characteristics, the long-term prediction is still an open research issue.
The prediction methods have been widely used for the time series data. In general, time series prediction methods can be roughly divided into two categories: traditional probabilistic methods [2] and machine learning models [3]. The disadvantage of the traditional time series prediction method is that the model parameters need to be determined based on theoretical assumptions and prior knowledge of the data. However, when the actual distribution of the data does not match the model's assumptions, it will not be able to give accurate long-term prediction results.
Unlike traditional time series prediction methods, machine learning methods do not require prior physical information or linear assumptions. As long as the historical data is known, based on preset rules and algorithms, model parameters can be learned and hidden relationships and knowledge between the data can be obtained. The deep learning networks have exceeded traditional time series prediction methods in many non-linear modeling fields.
Although in theory, deep learning networks can model any complex non-linear time-series data, the deep learning network cannot obtain sufficient accuracy long-term prediction for the prediction of PM2.5 data due to the limitation of the amount of training data and the size of the network. In the following Section 2, we will describe the related works about the current prediction methods for time series data and discuss the pros and cons of traditional probabilistic methods and machine learning models. At the end of Section 2, the innovative contributions of this paper are discussed. Then, Section 3 introduces each part of the proposed predictor. Experimental results of the proposed predictor are shown and evaluated in Section 4, and finally, Section 5 presents our conclusion.

Related Works
Traditional methods include chemical analysis models and prediction models based on statistical theory. In the physical models, the equations are established based on the physical relationships between variables, and then the prediction models analyze and predict time series data [2]. The statistical models used include the autoregressive moving average (ARMA) [4][5][6], autoregressive integrated moving average (ARIMA) [7], threshold autoregressive (TAR) model [8], hidden Markov model (HMM) [9], etc. In [10], Berrocal et al. had discussed geostatistical techniques, such as kriging, and spatial statistical models that use the information contained in air quality model outputs, such as statistical downscaling models.
These traditional time series prediction methods determine the model parameters based on theoretical assumptions and prior knowledge of the data. However, it is often difficult to know the prior parameters because of the lack of background knowledge. When the data distribution does not match the model's hypothesis, it will cause a mismatch between the model and the data, which seriously affects the accuracy of the prediction. Therefore, the traditional methods are less applied in the complex application of the environment. Different from traditional time series prediction methods, machine learning methods require no prior physical information or linear assumptions. These methods build prediction models based on learning algorithms and historical data and perform operations based on preset rules and algorithms to adaptively learn model parameters and obtain hidden relationships and knowledge between data. The trained model is then used to predict the future development trends and patterns of time series data. The prediction is based on mathematical models and some parameter identification methods can be used, such as iterative algorithms [11][12][13], particle-based algorithms [14], and recursive algorithms [15][16][17].
Regression models initially were used for predictive tasks. Tang et al. [18] applied a linear regression method to predict PM2.5 emissions in the Northeast United States from 2002 to 2013 based on fine-resolution aerosol optical depth. Oteros et al. [19] used different factors of pollen concentration and took into account the extreme weather events in the Mediterranean climate characteristics to establish a multivariate regression model. Donnelly et al. [20] presented a real-time approach based on multiple linear regression for air quality prediction. While these linear regression models all faced challenges in prediction tasks with highly nonlinear time series data.
The shallow network, such as artificial neural networks (ANNs), ensemble extreme learning machine and gradient boosting machine play a key role in solving nonlinear problems [21][22][23][24]. Ni et al. [25] developed a short-term prediction method based on a back-propagation (BP) neural network for PM2.5 concentrations data such as meteorological data, including regional average rainfall, daily mean temperature, average relative humidity, average wind speed, maximum wind speed, and other pollutant concentration data, including CO, NO2, SO2, PM10, etc. In [26], a prediction model based on classification and regression tree and ensemble extreme learning machine methods were developed to split the dataset into subsets in a hierarchical fashion and build a prediction model for each leaf. Bai et al. studied the combined prediction method of shallow nonlinear autoregressive network (NAR) on the basis of BP [27], and proposed the prediction method from time and space dimensions by using shallow networks [28]. Du et al. [29] developed the geographically-weighted gradient boosting machine to address the spatial non-stationarity of the relationships between PM2.5 concentrations based on aerosol optical depth (AOD) and meteorological conditions. The above results show that these machine learning methods, which are mainly structured with shallow network features, can only obtain accurate results in short-term prediction. The reason is that its network structure is relatively simple and can only obtain short-term changes. Therefore, if we want to achieve accurate long-term prediction performance, we must use more complex networks that can capture long-term changes.
Comparing with the shallow network, the so-called deep learning networks had shown the outstanding ability for the complex time series relation. Recurrent neural network (RNN) [30] for time series prediction has attracted extensive attention from researchers because it could capture the high nonlinearity of time series data. Yadav et al. [31] predicted hourly/daily/monthly average solar radiation by an RNN, and an adaptive learning rate for the RNN was proposed. The approach was found promising when compared to the multilayer perceptron. As an improved version of the RNN, long short-term memory (LSTM), replaced it as a popular time-series data prediction technology [32,33]. The gated recurrent unit (GRU) [34] inherits the advantages of LSTM to automatically learn features and efficiently model long-term dependent information, and it also exhibits a significant increase in computational speed.
Though the deep learning model can extract more accurate information from complex environments, however, the accuracy of long-term prediction still needs to be developed further, because PM2.5 concentration series has the characteristics of randomness, strong non-linearity and non-stationarity due to the influence of meteorological factors and atmospheric pollutants. As for the air quality, the long-term precision prediction, which we mean predicting 20 to 30 steps ahead hourly, is more meaningful for the management of environmental air protection and governance. Therefore, how to obtain an accurate long-term prediction has been a considerate research field.
The researchers have found that PM2.5 data have complex nonlinearities with multiple components of different frequency characteristics [35]. In recent years, the combined methods based on data decomposition have been proven effective in improving the prediction performance, and various hybrid models have been introduced to predict nonlinear time series data.
For example, wavelet decomposition [36,37] decomposed the data into multi-dimensional information by setting suitable wavelet basis function, then, to predict each sequence and reconstruct the prediction. Another decomposition method, seasonal trend decomposition procedure based on loess (STL) [38,39] can give the seasonal components, which has been used to model air pollen for short-term predict. In our previous research, we also proposed an integrated predictor for PM2.5 long-term prediction based on STL [35], in which, we used STL to decompose PM2.5 into three components, i.e., trend, period and residual component, and ARIMA was used to model for trend component, two GRU networks for period component and residual, respectively. Different from wavelet decomposition and STL, empirical mode decomposition (EMD) can decompose the time series data into intrinsic mode function (IMF) components with different frequency characteristics features [40]. Each component obtained by EMD is a local characteristic signal based on different time scales of the original time series itself, representing each frequency component in the original signal, and arranged in order from high frequency to low frequency, which are independent of each other. The process of EMD decomposition is actually the simplification of complex time series. In [41], the decomposition process of EMD was regarded as a denoising procedure of training data, and the prediction results were obtained by support vector regression (SVR) based on different feature vectors. The results showed the superiority of the model in power load prediction. In some other studies, prediction methods combined with EMD treated the first highfrequency IMF sequence, which did not contribute to the prediction result, as a noise term and discarded it [42,43].
Qiu et al. [44] presented an integrated method based on an EMD algorithm and a deep learning method, in which the load demand sequence was first decomposed into several IMFs. Then, the extracted IMFs were modeled using a deep belief network (DBN) containing two restricted Boltzmann machines (RBMs) to accurately predict the evolution of each IMF. The predictions of each model were finally combined by an additional operation to obtain the total output of the load demand. Wang et al. [45] introduced a feedforward neural network (FNN) into the prediction framework based on EMD, proposed a weighted reorganization strategy, and conducted a singlestep prediction experiment on four nonlinear nonstationary sequences to verify and compare the effectiveness of the proposed model. Bedi et al. [46] combined EMD with LSTM to estimate the electricity demand for a given time interval. The performance of the proposed approach was evaluated by comparison with the prediction results of RNN and EMD with RNN models.
The above-combined models [44][45][46] were based on such a framework, that is, the original time series data was first decomposed into several IMFs and one residue by the EMD method. Then, the predicted model, such as LSTM, was applied to each IMF including the residue independently. Finally, the prediction results of all IMFs were aggregated by simply summed to produce an ensemble prediction of the original time series. We must mention that because the frequency components in each segment are different, the number of IMF components obtained by EMD decomposition is also different. This can result in different numbers of trained and online predicted models, but the above references do not explain how to solve this problem.
Significantly different from previous studies, we will combine IMF components to achieve the unification of the training model and prediction model in practical applications. Our innovative contributions are highlighted as follows.
(1) After EMD, the obtained IMF components are further analyzed for their frequency characteristics, and all the components are divided into a fixed number of groups by convolutional neural network (CNN) networks. Different from [44][45][46], the fixed number can effectively solve the problem that a variable number of IMF components will be obtained when predicting different time intervals. (2) We present a general framework that predicts the PM2.5 data from air quality monitoring systems and obtains accurate long-term predictions that can meet the needs of precision in air quality warning.

Hybrid Deep Predictor
The proposed predictor has a hybrid structure, in which the data are decomposed by EMD to reduce their nonlinear complexity, and then the IMFs' frequency characteristics are analyzed, and all the components are divided into a fixed number of groups by CNN networks. For each group, the deep learning network GRU is used to model and predict, and finally, all the predictions of GRU are added to obtain the prediction result. Next, we will detail each part of the proposed model and provide its flowchart.

Decomposition and Analysis of PM2.5 Time Series
EMD decomposes complex signals into a finite number of IMFs automatically based on the frequency characteristics of the data, and the decomposed IMF components contain local characteristic information of different timescales of the original signal, which should satisfy the following conditions: (1) over the entire time range, the absolute value of the difference between the number of zero crossings and the number of extreme points is equal to 0 or 1; and (2) the mean value of the envelope constructed by local maxima and minima must be zero at any point. EMD is an adaptive data processing or mining method and is essentially a smoothing process for time series data (or signals). It can theoretically be applied to the decomposition of any type of time series (or signal).
Assume t D is the time series to be decomposed, e h is the expected decomposition result to be obtained. The decomposition process is as follows [47]: (1) Identify the local maximum point of the given time series data t D and fit the maximum point with a cubic spline interpolation function to form an upper envelope of the original data.
(2) Similarly, find the local minimum point of t D and fit all the minimum points through the cubic spline interpolation function to form the lower envelope of the original data. In the above decomposition algorithm, is the average value of the upper envelope and the lower envelope, represents the difference between the sequence and the IMF. The traditional EMD algorithm performs spline interpolation on the extreme points, making the derivative at the boundary of the IMF component large, and resulting in an end defect. Therefore, in this study, the first point and endpoint of the sequence curve were added as extreme points to the spline to avoid the end defect.
To show the time and frequency domain characteristics of different components, we take PM2.5 data with 2500 samples from the air quality monitoring systems as an example to give the results of EMD decomposition. In Figure 1a, all the obtained IMFs are shown in the time domain (from IMF-0 to IMF-10), and correspondingly, each sub-picture on the right-hand side, Figure 1b   To further illustrate the different frequency components contained in each IMF, we perform a one-dimensional convolution operation on each IMF in the frequency domain. The convolution formula is as follow: The results of the convolution of IMFs and the Gaussian kernel function () gx are shown in Figure 2. It can be clearly seen that the IMF-0 component contains a wider band of frequency components. In contrast, the frequency components contained in IMF-1 and IMF-2 are significantly reduced, but there are still long tails in the cutoff band. Differently, as for IMF-3 and IMF-4, the downhill is significantly steeper, indicating that fewer frequency components are included. Furthermore, for IMF-5-10, the downslope is almost vertical, and we can find that the fluctuations of these components in the time domain map (Figure 1a) are relatively flat. It can be seen that the frequency characteristics of the IMFs are different. Compared to the original data, each IMF only contains partial frequency components. Therefore, after decomposing and predicting each component separately, more accurate prediction results can be obtained.
Moreover, we found that the number of IMFs varies for different time periods. The decomposition result of the EMD algorithm depends on the original time series itself. The number of IMFs obtained by EMD is usually different within different prediction intervals. As shown in Figure  3, we performed EMD on the three different data intervals [0, 2400), [2400, 4000), and [4000, 6400) of thePM2.5 data, and the number of IMFs obtained was 11, 10, and 9, respectively. We can note that the number of trained prediction sub-models will be different from the number of IMF components in the different prediction intervals. Therefore, it is necessary to combine an unfixed number of IMFs into a fixed number according to frequency characteristics.
In this study, according to the respective frequency characteristics of IMFs, we combined all the IMFs into a fixed number of groups. That is to say, the decomposition components having similar frequency characteristics will be labeled, grouped and added together, then for each group, one model will be trained. Therefore, the number of models in each prediction interval will be fixed.

Classification and Combination for IMFs
Convolution calculation can effectively capture the dynamic change of the signal and obtain the mode characteristics of its change. This method has achieved good results in many pattern recognition applications, so we would use the CNN neural network to classify and group IMFs.
The IMF components to be processed in this study are one-dimensional discrete time series. Therefore, we use one-dimensional convolution as the convolutional layer to construct onedimensional CNN, which is suitable for feature extraction from IMF sequences. Given an input IMF sequence t X , 1,..., t n = , and the convolution kernel function, the filters sequentially perform a local convolution operation on the input features of the previous layer. The output of the convolution is as follows: The convolutional layer needs an activation function () f  for nonlinear feature mapping. In this study, the rectified linear unit (ReLU) with fast convergence speed is selected as the activation function. The formula is as follows: Then by flattening and full connection process [48], a one-dimensional convolutional neural network extracts the frequency characteristics of the IMFs, the Softmax classifier, classifies the features and finally achieves the network output, i.e., the labels of each IMF. The schematic of the one-dimensional CNN is shown in Figure 4, where t X is the input IMF sequence, and y is the output label of each IMF. In consequence, the IMFs in each group will be added (noted as t S ) and as the input data for the GRU network.

Deep Prediction Network for Combined IMFs
In this study, the GRU network was designed to train and predict the IMFs groups. Using the known input and output data, the network is trained by the stochastic gradient descent algorithm, and the optimal weight can be obtained. The GRU network was trained on the sum of IMFs sequences in each group. The GRU network consisted of multiple GRU cells, and here the number of hidden layers is set as 2. Shown as Figure 5,  GRU uses the update gate to control the degree to which the state information of the previous moment is brought into the current state. The larger the value of the update gate, the more the state information is brought in from the previous moment. The reset gate is similar to the forget gate in LSTM, which is used to control the degree of ignoring the state information of the previous moment. The smaller the reset gate value, the more the information neglected.
The forward propagation formulas in each GRU cell are as follows [ b represents bias vectors; is an element-wise multiplication; and  and tanh are activation functions. The GRU is trained by the gradient descent algorithm, and the parameters are continually updated until convergence. The proposed methods proposed in this paper can combine other identification approaches [50] to study the modeling and prediction problems applied to other fields [51,52] such as internet of things systems [53,54] and water environment prediction and management control [55,56].

Hybrid Model Framework
In conclusion, based on the discussion in Sections 3.1-3.3, the proposed deep learning predictor is shown in Figure 6. The number of groups is fixed at three, which is a result of the experiment. We have used the PM2.5 data from an air quality monitoring system to verify the proposed predictor, and this shows that the three groups of IMFs can maintain high performance at a low calculation cost.
The hybrid predictor includes the two processes of training and predicting. The first process is to train the CNN and GRU based on the IMFs decomposed by EMD. The details are as follows: (1) Decompose the data into IMFs by EMD and label each IMF into three groups based on its frequency characteristics as Group1-3. (2) Train the CNN by IMFs and labels, and add the sequences to each group.
(3) Train GRU models for each group to get three GRU sub-predictors.
The predicting process is to predict the future trends of data by the trained networks, and it is implemented as follows: (1) Decompose the input data into IMFs by EMD.
(2) Use CNN to classify IMFs into three groups, and add the sequences of the same group together.

Dataset and Experimental Setup
Our experiments are based on the PM2.5 dataset, which is from the US Department of State [57]. The dataset includes 37,704 records of hourly PM2.5 average concentration data in Beijing from January 2013 to December 2017. To assess the prediction performance of the different models, we selected the first 75% of the data for training and the remaining 25% for testing. The rang of the test data is from 2 mg / m 3 to 601 mg The open-source deep learning library Keras, based on TensorFlow, was used to build all of the learning models. All of the experiments were performed on a PC server with an Intel CORE CPU i5-4200U at 1.60 GHz, with 4 GB of memory. The default parameters in Keras were used for deep neural network initialization (e.g., weight initialization). The ReLU was used as the activation function of the CNN, which was set with 32 convolution kernels in each layer. The group labels were set by onehot encoding. For the sub-predictors, we designed GRU networks with two hidden layers, and the Adam algorithm was selected as GRU's optimized method.
In the experiments, the long-term prediction with 24 steps ahead was considered, in which we used the PM2.5 data during the previous 24 h to predict the next 24 h. In order to compare the accuracy performance of the proposed predictor, five evaluation indicators are used, such as root mean square error (RMSE), normalized root mean square error (NRMSE), mean absolute error (MAE), symmetric mean absolute percentage error (SMAPE), Pearson correlation coefficient (R) whose calculation formula are shown in Equations (5)-(9).
where N is the number of predictive datasets; The comparison of the two cases shows that our proposed model is effective in predicting PM2.5. From Case 1 (Section 4.2), it can be seen that the decomposition and prediction using EMD technology is effective. In Case 2 (in Section 4.3), by comparing with the results of different combinations of IMF components, the results show that the proposed combination method has advantages in PM2.5′s long-term prediction.

Case 1: Prediction Performance Analysis of Different Predictor
In this experiment, in addition to the proposed method, six other models are used to predict 24 h ahead of the PM2.5 average concentration, which are RNN [31], LSTM [32], GRU [34], Decomposition-ARIMA-GRU-GRU [35], and after decomposing the data by EMD and classification of CNN firstly, RNN [31] and LSTM [32] as the sub-predictors, respectively. The PM2.5 data introduced in Section 4.1 is used to show the prediction result. Figures 7 and 8 give the predictions of hourly PM2.5 average concentration data in Beijing from 1 to 20 December, 2017 by RNN [31], LSTM [32], GRU [33], EMDCNN_RNN (EMD and CNN-based RNN), EMDCNN_LSTM (EMD and CNN-based LSTM), and the proposed method.
To evaluate the performance of each method numerically, Table 1 gives a comparison of the prediction results in terms of RMSE, NRMSE, MAE, SMAPE, and R. The smaller the values of RMSE, NRMSE, MAE, SMAPE, the more accurate the predictions. A higher value of R indicates a better fit between prediction and observation. It is apparent from the comparison of prediction results that the decomposition models are significantly outperformed than the undecomposed ones, and the proposed EMDCNN_GRU model has a more accurate prediction than other models with the smallest RMSE, NRMSE, MAE, SMAPE. The correlation coefficient, R, of the EMDCNN_GRU model is above 0.8.
The results show that it can effectively improve the accuracy of prediction results to decompose PM2.5 into multiple components as the RMSEs can be significantly reduced. The reason is that affected by weather, pollution emissions, and regional relationships, PM2.5 contains multiple components, which makes long-term prediction difficult. The modeling of the decomposed components can reduce the difficulty, so that the prediction accuracy of each component can be guaranteed, and the accuracy of the prediction results can be improved. Moreover, for each component, more accurate modeling is necessary. The more accurate prediction result obtained by each component, the more accurate the final prediction result will be. Therefore, we found that different sub-predictor can obtain different prediction results. Consistent with many research results, GRU is better than LSTM and RNN.

Case 2: Prediction Performance Analysis of Different Combinations for IMFs
In this experiment, the collected hourly PM2.5 data in Beijing from January, 2013 to December, 2017 are used to show the prediction result. We will conclude that mode No. 5 including three groups with Group1: {IMF 0-2}, Group2: {IMF 3-4}, Group3: {IMF 5-10} is the suitable mode in long-term prediction (24-h ahead) for PM2.5. Table 2 lists the prediction performances of 12 different modes. For each mode, we used adifferent combination mode to train the CNN, and obtained a different number of groups, then GRUs are trained for each group. The details are as follows, and a parenthesis represents a group, several parentheses indicate that several GRU sub-predictors need to be built.
(   46.0065, and 45.0356, respectively. Compared with mode No. 5, the RMSE, NRMSE, MAE, the SMAPE of mode No. 6 or 7 is slightly reduced, but we have to train 4 or 5 GRUs, so the training parameters are increased by 1/3, 2/3 respectively. The results show that the decomposition of the data is effective, while the data do not be excessively decomposed. We believe that the reason is that for each decomposition component, the prediction will produce a certain error. If too many decomposed components are used for prediction, each component will bring errors that will lead to a reduction in the accuracy of the final prediction result. In addition to the accuracy of the predictions, we believe that the amount of computation in the air quality monitoring system is also important. In order to reduce the cost of the actual system, we choose a method of calculating less. Therefore, to ensure the prediction performance and keep the cost of parameters, mode No. 5 with three groups is a suitable choice for the PM2.5 data of the application in the air quality monitoring system.
Further, from the comparison of mode 1 and mode 3, it can be seen that for PM2.5 data, the RMSE of mode 1 is 58.7715, which is smaller than 59.6399 of mode 3. However, we think this is not much different. Because the average value of PM2.5 is about 100, such a small RMSE difference cannot have a very large impact on the prediction result. Therefore, we believe that it has little reduction in the predicted final performance by using IMF0 as a component for prediction or discarding it as just noise. We think that the reasons may be that the PM2.5 data is so-called chaotic data, therefore, the removal of noise cannot fundamentally improve prediction performance. Therefore, in the proposed prediction structure, IMF0 is not regarded as noise and discarded.

Conclusions
In recent years, environmental problems such as PM2.5 have seriously affected people's normal lives, and air quality has begun to receive more and more public attention. Due to the influence of meteorological factors and atmospheric pollutants, the PM2.5 concentration series has the characteristics of randomness, strong non-linearity, and non-stationarity with multiple components of different frequency characteristics. Therefore, accurate long-term prediction is still a challenge.
This study proposes a hybrid deep learning predictor based on EMD and GRU group model to predict complex PM2.5 concentration time series. The key issue is to combine the IMF components obtained by the EMD algorithm, so as to solve the problem of inconsistent IMF components in different periods of time series and improve the precision of prediction. A CNN was used to classify IMFs into groups based on the frequency feature, and the GRUs were trained as the sub-predictors for separate predictions. The prediction results of the sub-predictor are finally added to obtain the final prediction result. The proposed predictor can obtain accurate predictions for the next 24 h, which can provide the advantage information for air pollution prevention and management in advance.
The proposed method has universal characteristics and can be used for the prediction of other data, such as meteorological data, such as air temperature, wind speed, relative humidity and other air pollutant measurements, such as PM10, SO2, etc.