Sound Levels Forecasting in an Acoustic Sensor Network Using a Deep Neural Network

Wireless acoustic sensor networks are nowadays an essential tool for noise pollution monitoring and managing in cities. The increased computing capacity of the nodes that create the network is allowing the addition of processing algorithms and artificial intelligence that provide more information about the sound sources and environment, e.g., detect sound events or calculate loudness. Several models to predict sound pressure levels in cities are available, mainly road, railway and aerial traffic noise. However, these models are mostly based in auxiliary data, e.g., vehicles flow or street geometry, and predict equivalent levels for a temporal long-term. Therefore, forecasting of temporal short-term sound levels could be a helpful tool for urban planners and managers. In this work, a Long Short-Term Memory (LSTM) deep neural network technique is proposed to model temporal behavior of sound levels at a certain location, both sound pressure level and loudness level, in order to predict near-time future values. The proposed technique can be trained for and integrated in every node of a sensor network to provide novel functionalities, e.g., a method of early warning against noise pollution and of backup in case of node or network malfunction. To validate this approach, one-minute period equivalent sound levels, captured in a two-month measurement campaign by a node of a deployed network of acoustic sensors, have been used to train it and to obtain different forecasting models. Assessments of the developed LSTM models and Auto regressive integrated moving average models were performed to predict sound levels for several time periods, from 1 to 60 min. Comparison of the results show that the LSTM models outperform the statistics-based models. In general, the LSTM models achieve a prediction of values with a mean square error less than 4.3 dB for sound pressure level and less than 2 phons for loudness. Moreover, the goodness of fit of the LSTM models and the behavior pattern of the data in terms of prediction of sound levels are satisfactory.


Introduction
Noise pollution is one of the main environmental concerns of modern cities because of its effects on the quality of life, health and livability of cities. The European Commission adopted the European Noise Directive (END) [1], which focuses on the monitoring of environmental noise by generating noise maps of the main population centers and elaborating action plans [2,3]. Noise measurements in urban areas are typically carried out by designated officers that collect data at a few accessible spots, where sound level meters are installed during short time intervals. Collected noise data is often input

Related Work
A significant amount of information generated by sound sources is carried by acoustic signals, and this information can be used to describe and understand human and social activities. Sound signal acquired by acoustic sensors can be processed in two ways: (i) capturing and processing the audio signal (e.g., event detection [28,29], classification of sound sources [30,31], sound source location [32], etc.) and (ii) calculating values of acoustic parameters from the captured audio signal (e.g., sound pressure level [33], loudness [34], etc.) that are the data collected to generate sound maps.
Several works have been developed in applying artificial neural networks to estimate sound source features and/or acoustic parameters values in a certain location for a given period of time, using data obtained through WASN or other information data base. In what follows we introduce differences between the proposed work and these previous works. Regarding audio signal processing, in publications [35,36] a WASN is proposed to monitor and analyze urban noise pollution, deploying a network of sensors to measure sound pressure level and using convolutional neural networks to classify sound sources from captured audio. In other work, Socoró et al. [37] introduced an anomalous noise event detector to remove sound frames unrelated to road traffic sound sources to provide more reliable data captured by a WASN. In [38], a convolutional recurrent neural network in a dilated spiral is used as a classifier fed by the energy recording feature in the mel band for the detection of sound events. Regarding to parameters calculation, some published papers introduce neural networks to estimate advanced acoustic parameters values. Yu and Kang [39] explored the feasibility of using machine learning models to predict the sound landscape quality in urban open spaces by correlating various physical, behavioral, social, demographic and psychological factors. In [40], a convolutional neural network was implemented to estimate the psycho-acoustic annoyance Zwicker's model from an input audio signal. In contrast with these related works, in our research a neural network approach is used to predict future time values of acoustic parameters instead of estimating current time values.
There are some studies that apply neural networks to create a prediction model in order to estimate sound pressure levels emitted by sound sources across a spatial domain but using also geospatial and description information as input parameters. Specifically in [41], a multi-layer perceptron neural network model trained with the Levenberg-Marquardt algorithm was used to predict the equivalent sound level from road traffic noise. In another publication [42], a system proposition is presented that uses an ensemble of machine learning techniques to estimate both environmental sound levels and uncertainty in model predictions by taking geospatial data as input. In addition to making use of auxiliary information, these neural network-based models predict long-term values and do not take into account the temporal composition of the short-term sound environment. An attempt to predict the temporal component of traffic noise levels is presented in [43] through the use of back-propagation neural networks, however it only estimates index values describing temporal variability and impulsiveness in addition to using auxiliary data as input. Although noise sources are mainly non-stationary, statistical techniques such as AutoRegressive Integrated Moving Average (ARIMA) [44] have been also used in the literature to model traffic noise pollution.
Finally, it is worth highlighting that there are several works in the literature that predict other pollution factors through deep neural networks, considering the data of these variables as time series. Specifically, the most common pollution problem studied is air pollution, particulate matter and carbon monoxide concentrations among others [45,46]. However, the use of deep learning models such as LSTM require an optimized configuration and settings for each type of problem, as it is carried out in Section 3.5, considering the inputs and its behavior in time.

Wireless Acoustic Sensor Network
In this work, data captured from a node of a deployed WASN was used to train and validate the designed neural network prediction models. This WASN is a scalable and extensible system used to monitor sound levels in a certain environment. This is a static and homogeneous WASN allowing continuous monitoring indoors and outdoors. This network was composed of ten acoustic nodes deployed in the campus of the Catholic University of Murcia. In this WASN, each acoustic node [47] collected and processed the audio signal and after that, it calculated and sent data every minute to the sink node. The low-cost acoustic node design included two main parts: the audio acquisition system and the processing core. The former consisted of an array of the four-microphones of a Sony PlayStation Eye camera. Regarding the processing core, a Raspberry Pi 3 Model B computer [48] was selected for the processing, acquisition and publishing stages. Although a node is able to compute results related to diverse acoustic parameters, see [47] for details, this research is focused on the equivalent sound pressure level (L p ) and loudness level (N) values [49] in a one-minute period. A sink node plays the additional role of transmitting the data to an Internet of Things (IoT) platform to store and to perform analysis of the overall data. The audio signal was not stored nor transmitted from the node to keep public privacy. Concerning the network design, acoustic nodes transmit data via Wi-Fi technology using two communications protocols: TCP for communication between nodes and HTTP for communication between the sink node and the IoT platform. Further in-depth control and maintenance of the deployed nodes was provided via a virtual private network that provides a method for remote Secure SHell (SSH) access to each node. The virtual private network also enhances the wireless transmission security of the sensor as all data and control traffic was routed through this secure network.
Specifically for this research, a data-set with these acoustic parameters, L p and N, was built, as it is explained in detail in the following section.

Acoustic Data-Set
In this research, the acoustic data acquired on a continuous basis with a temporal period, i.e., a time step of 1 min by a node of the described WASN in the previous section was used to train a LSTM network. This data-set was collected from the beginning of October to the end of November 2019 and it contains quantitative and temporal data related to two acoustic parameters: the equivalent sound pressure level in decibels (dB) and loudness level in phons in one-minute of integration time. The selected node was located in-door in an open-office room where lecturers and researchers work. Working days are mainly from Monday to Friday but Saturday is also open. This data-set is representative of a random noise, of which the main sound sources are speech and human activities. This long-period study can help to analyze and predict the temporal behavior pattern of this type of soundscape.
From the principal data-set, a total of ten data-sets have been generated, five for each parameter, computing a temporal average of the data for the following periods: 1, 5, 15, 30 and 60 min. The following average has been used for time intervals: where X can be either L p or N, and X i corresponds respectively to the equivalent sound pressure level (L p i ) and loudness level (N i ) for each time step i. For example, the data-set denoted as noise15 in Table 1 indicates that the 1-min values have been averaged over 15 min, generating one value for L p and other for N. A description of the quantity of samples used for each data-set can be seen in Table 1.
The number of samples in each data-set corresponds to approximately 50 days.

Deep Learning: Long Short-Term Memory
A Recurrent Neural Network (RNN) in very powerful for everything that has to do with sequence analysis, such as text, sound or video analysis. The main feature of an RNN is that information can persist by looping into the network diagram, so they can basically "remember" previous states and use this information to decide what will be next. This feature makes them very suitable for managing time series. However, a conventional RNN presents problems in training because retro-propagated gradients tend to grow enormously or fade over time because the gradient depends not only on the present error but also on past errors. The accumulation of errors makes it difficult to memorize long-term dependencies. These problems are solved by the Long Short-Term Memory neural networks (LSTM), for which it incorporates a series of steps to decide which information will be stored and which erased. The LSTM networks are composed of LSTM modules which are a special type of recurrent neural network described in 1997 by Hochreiter and Schmidhuber [50]. The LSTM module contains three internal gates, known as input, forgotten and output (as can be seen in more detail in the Figure 1), consisting basically of a sigmoid layer and a multiplication operation, and in the case of the forgetting door, it also incorporates a hyperbolic tangent layer. These gates allow to remove or add information to the cell state, which is a connection that transfers information from one LSTM module to the next. The input gates controls when new information can enter memory. Forgotten gates control when a piece of information is forgotten, allowing the cell state to discriminate between important and superfluous data, leaving room for new data, for this, a hyperbolic tangent layer is added which is combined with the sigmoid layer. Output gate controls when used in the result of memories stored in the cell state. The cell state has a weighting optimization mechanism based on the resulting network output error, which controls each gate. The output and the cell state value generated by the LSTM module are transferred to the next LSTM module. Figure 1 shows the gates and operations of an LSTM module graphically for L p (for N it would be the same scheme), and in which it can be observed that the input for a unit, is the output of the previous one. This way, each LSTM module transmits to the next one its prediction that together with the current input of the module, generate the output that is sent as input to the next LSTM module.
The network proposed in this work is univariate, that is, it takes a single input variable and obtains a single output variable, given that the objective of the work is to predict both the L p sound levels and the loudness N. Thus, for the prediction of each one of these values, a different LSTM model will be made for each data-set.

Statistical Approach: Auto Regressive Integrated Moving Average
Classical approach to predict time-series is based in statistics. The Auto Regressive Integrated Moving Average technique [51] is a statistical model that uses variations and regressions of statistical data in order to find patterns for a prediction into the future. It has been also applied to sound level parameters prediction [44], as it has been introduced in Section 2. ARIMA is a dynamic time series model, i.e., future estimates are explained by past data rather than independent variables. This model was developed in the late 1960s. Box and Jenkins (1976) systematized it [52]. An ARIMA model is characterized by 3 terms: (p, d, q, ) where, p is the order of the Auto Regressive (AR) term, q is the order of the Moving Average (MA) term and d is the number of differences needed to make the time series stationary. In this work, an ARIMA model has been created using the same data-set described in Section 3.2 to compare with quality metrics of the proposed LSTM models.

Experiment Configuration
The viability and suitability of the proposed LSTM technique is assessed using two types of experiments. On the one hand, an experiment was executed using 80% of the data-set to train the model and 20% to test it. This experiment was applied to the five data-sets (different time intervals) described in Section 3.2, for each acoustic parameter. In addition, to validate the LSTM model, we performed a comparison with the Auto Regressive Integrated Moving Average (ARIMA) technique [51]. On the other hand, to analyze the robustness and adaptability of the proposed LSTM model, we performed several types of validation for the 30 and 60 min data-sets, which are the best results obtained globally. Specifically on the proposed LSTM model; comparisons will be made using the validations of 60%, 70% and 80% to train and 40%, 30% and 20% to test respectively. Thus, depending on the results, the response capacity of the model presented can be analyzed in the absence of training data. For the ARIMA model, used in the comparison, the parameter (p, d, q) used for the for the estimation of the acoustic parameter L p were (1,1,14) and for the acoustic parameter N were (1,1,10). In the LSTM model proposed in this paper, the optimal parameters that have been chosen, after a previous adjustment carried out to obtain the optimum parameters, are shown in the Table 2. For the number of neurons, intervals are shown depending on the acoustic parameter. The quality evaluation of the model proposed is performed by measuring the goodness of the prediction by the following metrics: Experiments were been carried out in a GPU-based platform. This platform was composed of an Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz, 128 GB of RAM, 1 TB SSD Hard Disk and a NVIDIA GeForce GTX 780 GPU (Kepler).

Results and Discussion
In this section, the behavior of the LSTM model proposed for the prediction of the sound pressure level and loudness values is discussed and analyzed. The evaluation and analysis is detailed in two subsections. First, a comparison with a technique to predict the time series of ARIMA was made by performing an experiment with 80% of the data-set to train and 20% to test. Then, to validate the robustness of the proposed LSTM technique, several validations increasing the test percentage and reducing the train percentage were performed. It should be noted that the predictions were estimated for the values L p and N, therefore for each of these values a different model was made.

Comparing the LSTM Model with the ARIMA Model
This section presents the results obtained by the LSTM models for the prediction of the parameters L p and N for the different data-sets described in Section 1. In addition, LSTM models are compared with the ARIMA technique models for both parameters to validate the results. The validation carried out for both LSTM and ARIMA models was using 80% of the data-sets to train and 20% to test. The number of days is equivalent to about 40 days for training and about 10 consecutive days of prediction for testing. Table 3 shows the values of RMSE, MAE, PCC and R 2 for each of the data-set of L p parameter for the LSTM and ARIMA models. For the LSTM models, the calculated metrics are very satisfactory in general, obtaining a RMSE lower than 4.3 dB for L p in all the data-sets. Regarding to the fit of the model, R 2 , the better is this fit the greater the temporal amplitude of the interval is. This may be caused by the smoothing obtained by the averaging of punctual noise peaks. The best fit of the model, 0.75, is obtained for L p when the prediction period is 60 min. With respect to ARIMA models, the RMSE values increase considerably, which indicates that the ARIMA technique is not adequate for estimating the behavior of the L p parameter in short-term intervals. For all data-sets the ARIMA model fit is very low and the errors much higher than for the LSTM model. It must be taken into account that ARIMA may need more days of training to be able to reduce the error and improve the fit of the predicted time series. This is one of the advantages of the LSTM technique.  Table 4 shows the values of RMSE, MAE, PCC and R 2 for each of the data-set of N parameter for the LSTM and ARIMA models. For the LSTM models, the calculated metrics are very satisfactory in general, obtaining a RMSE lower than 2 phons for N in all the data-sets. Particularly, metrics show that the RMSE of N is similar for all time intervals. In addition, the value of adjustment of the model, R 2 , of N is very similar in all the cases, which indicates that it is less affected by the time interval considered to predict sound levels. For ARIMA models, the behavior and results for predicting the N parameter is similar to the L p parameter. In this case, the error does not increase as significantly as for the L p parameter. However, the error is always more than double that obtained by the LSTM technique. Moreover, as far as the model's adjustment is concerned, the result is not at all satisfactory. This indicates that the ARIMA models are not able to adapt to the non-stationary behavior of the sound level parameters in short-term intervals. In summary, results show that the LSTM technique outperforms the ARIMA technique for creating temporal short-term models and predicts the behavior of the L p and N parameters. One aspect to consider about the obtained LSTM models is the difference between the RMSE and MAE values for both N and L p levels. The MAE value is almost double the RMSE value, indicating that there are outliers in the data [53]. These outliers data are usually reflected by the peaks. In this case, the outliers can be observed in Figures 2 and 3, for both N and L p levels, in the eventually impulsive sound events that occur throughout the day. Figures 2 and 3 represent a temporal graph for a ten days interval of the captured data, i.e., real data from the test-subset, along with the estimated data using the obtained LSTM models for both N and L p . The test-subset begins on Sunday and ends on Tuesday of the following week. Therefore, it can be observed that the minimum noise level on Sunday because the open-office room where the data has been collected is closed. However, the acoustic level increases over the next five working days on the day-period and decreases on the night-period. On Saturday, the activity of people in the office is reduced, thus the noise level is quieter than a regular working day. Then, the time sequence starts again with a Sunday having the lowest noise levels. In general, the model obtained by the LSTM technique, as a pattern of sound level behavior for both L p and N, adequately follows the trend of sound level. The greater the interval in time averages, the peaks of short event high noises are smoothed, obtaining a better prediction and adjustment of the model comparing with models of shorten intervals.
In order to explore in detail the obtained LSTM models, Figure 4a shows a zoomed view of graph of Figures 2d and 4b shows a zoomed view of graph of Figure 3d for a two days interval with a time average of 30 minutes. It can be observed that the LSTM model has difficulties in precisely estimate short-time events where the sound level increase and decrease drastically, i.e., when sound level suddenly rise or decay. However, the behavior of the LSTM model is much more stable when the peaks are less relevant, e.g., during Saturdays.

Assessing the Robustness of the Proposed LSTM Model
In the previous section, it was concluded that the LSTM technique can develop precise models for predicting the sound parameters L p and N in short-term. In this section, a validation of the behavior, the stability and the robustness of the LSTM technique is carried out throughout different types of tests. The objective is to analyze the variability of the LSTM models when a greater amount of samples are predicted having a smaller amount of training samples. The validations that have been made are as follows:  Table 5 shows the values of RMSE, MAE, PCC and R 2 of the validations indicated for noise60 and noise30 data-sets. Analyzing the results for the parameter L p , it can be appreciated how independently of the type of validation the RMSE error is, around 4 dB for the noise60 data-set and around 4.3 dB for the noise30 data-set. The variations of the LSTM models for both data-sets are minimal when the type of validation performed is changed. These minimum variations can be seen with the value of R 2 that hardly suffers variations of 0.04 points. Regarding the N parameter, the results are very similar to the L p parameter in terms of model variability. Analyzing the RMSE value of the N parameter, it is observed that it is around 2 dB for any of the two data-sets and any of the validations. The same happens with the determination coefficient R 2 where the differences between models of different validations and data-sets do not exceed 0.05 points. A remarkable aspect of the N parameter for the 60/40 validation is that it gets the best result than the other validations for both the noise30 and noise60 data-sets. The explanation for this situation can be that by obtaining more test days, these days include more weekends where the noise is more stable and there are fewer punctual peaks, hence the model fit is better. After detailing and analyzing the results of the various performed validations together with the comparison with the ARIMA technique in the previous experiment, it can be concluded that the LSTM technique obtains a considerably stable and satisfactory performance for the problem posed. It must be taken into account that the challenges presented by the LSTM technique have allowed us to make reliable models regarding the error and the adjustment of the model using very few training samples and allowing a prediction of 20 consecutive days. Although the LSTM models created follow the trend of sound with a stable behavior, they present limitations in detecting impulsive short events, i.e., high peak noises at certain times.

Conclusions and Future Work
Wireless acoustic sensor networks are an important tool for monitoring and managing noise pollution in cities. In addition to economic cost savings as compared to traditional procedure to create a noise map, these networks are helping in the design of new noise maps with extended sound sources information and enabling existing noise maps to be updated dynamically. However, it must be taken into account that sensors within a network can fail or that network signal coverage may drop in certain situations, producing missing values in the IoT platform. Moreover, it would be helpful for local administrations to know in advance the trend in noise levels in cities in the temporal short-term. As a support to address these issues and even to decrease the number of necessary nodes in a network, the techniques of artificial intelligence can help through the execution of its different algorithms. This paper proposes the use of a deep neural network, specifically a Long Short-Term Memory neural network (LSTM) to forecast future time values creating a model that represents the behavior of an acoustic environment in a certain location, specifically sound pressure level (L p ) and loudness values (N) parameter are contemplated. To create this model, values taken from a node of a deployed acoustic sensor network that collects information every minute have been used. Different models have been designed for L p and N applying several time periods varied up to 60 min, in order to assess and analyze the behavior of the acoustic environment at different time intervals. To validate the model, it has been compared with the Auto Regressive Integrated Moving Average (ARIMA) time series technique, to evaluate and discuss the benefits and limitations of the proposed LSTM. Besides, to analyze the stability of the LSTM technique, several types of validations have been made. The results indicate that LSTM models obtain a lower prediction error and a better model fit than ARIMA. In general, the results achieved through the application of the LSTM technique are satisfactory since all the created models predict in a correct way the rising and falling trends of the sound levels. Moreover, obtained root mean square error values are lower than 4.3 dB for L p and lower than 2 phons for N all considered models. Analyzing the parameters separately, using the N level more robust models than L p are obtained, resulting in smaller error values and no significance differences between considered time periods. Regarding the L p level models, a more reliable model is achieved when a higher time period is considered. Although L p is a parameter with higher variance than N, the trend of the behavior pattern estimated by the model is satisfactory in terms of determination coefficient. Regarding the results of the different validations, these indicate that the proposed LSTM technique has little variability and needs little training data to obtain good predictions, therefore, the technique could be applied in any city, without the need to obtain long previous historical data. Regarding the limitations of the proposed LSTM technique, the difficulty of the model to follow the trend of high sound levels of the L p and N parameters has been observed.
As a future work, an evaluation of the implementation of LSTM models within the nodes of the network of acoustic sensors is proposed. Moreover, a study to determine the influence of other climatic parameters or variables in predicting acoustic pollution through a multivariate neural network is of interest.