Probabilistic Forecasting of Nitrogen Dioxide Concentrations at an Urban Road Intersection

: The concentration of nitrogen dioxide in the air along a major route in a large city is affected by very many factors, which are also interdependent. As an alternative to complicated deterministic models based on these complex processes, in this study a probabilistic model for predicting NO 2 concentrations is proposed, using a simple accounting cluster-based method for determining probability distributions for tabulated values of ambient factors. Using the example of hourly values of NO 2 concentration and data on wind speed and trafﬁc ﬂow for the main intersection in Wrocław (Poland), a model is constructed to predict the frequency of occurrence of concentrations in the form of a probability distribution, for given values of the input variables. The model was successfully veriﬁed on data for the ﬁrst six months of 2018. A mean continuous rank probability score (CRPS) of 9.15 µ g/m 3 was obtained. In spite of the greater impact of trafﬁc volume on urban NO 2 concentrations, as measured by Pearson’s correlation coefﬁcient, for instance, the model indicates that wind speed is also a very important factor—wind being the principal mechanism causing the evacuation of pollutants. This underlines the importance of sustainable city planning with regard to ensuring suitable conditions for the passage of air. for nitrogen concentrations For the complete dataset, parameters were determined for the following asymmetric continuous theoretical distributions: Weibull, Johnson, generalised extreme value (GEV) and log-normal. The values of the χ 2 and Anderson–Darling statistics were high, and the corresponding p -values did not differ from zero (with an accuracy to ﬁve decimal places). Values of the less restrictive Kolmogorov–Smirnov (K-S) statistic gave p -values of 0.025, 0.029, 0.033 and 0.085 for the respective distributions. However, in the light of the results of the other tests, this does not enable conﬁrmation of the conformity of the empirical distribution to any of the analysed theoretical distributions.


Introduction
With the increase in levels of anthropogenic air pollution, there is growing interest among scientists in modelling the relationship between pollutant concentrations and various ambient factors. Methods of forecasting pollution levels on various timescales are being simultaneously developed. Literature reports indicate that atmospheric concentrations of pollutants are affected fundamentally both by the pollutant load (originating from various sources) and by meteorological conditions in the area studied. The main source of emissions of nitrogen oxides is waste gases, which come mainly from high-temperature combustion in vehicle engines, as well as from the combustion of fuels in energy production. In transport, the largest emitters of NO 2 are diesel engines. Known deterministic models describing relationships between pollutant concentrations in the air and meteorological and temporal conditions and traffic flow may be divided into two groups, based on localization in time: Current (point-based) and backward (interval-based). Interval-based models require and use data from a certain preceding period of time (t − 1, t − 2, t − 24, t − 48) to determine the predicted value at time t. This is quite a well-understood approach, with over 90% prediction effectiveness e.g., [1][2][3][4]. Point-based models, on the other hand, serve to determine the pollutant concentration value based on data only from the time point t. There is still much to be done with regard to point-based models, because their quality of fit needs improvement, and their predictive capabilities are burdened with a large error. Among point-based models, one may identify several principal methods of mathematical modelling. Still popular are regression models, which have evolved significantly due to the development of computational techniques [5][6][7][8]. More computationally advanced are models based on machine learning. Models that have been applied include artificial neural networks [9,10], which may also be combined with a multiple regression model [11]. There is an increasing number of reports on the use of random tree methods to model pollutant concentrations. These include both single random trees [12] and more complex structures based on them: Random forest (RF) [13,14] and boosted regression trees (BRT) [15]. Singh et al. [12] compared the effectiveness of methods based on random trees and on support vector machines (SVMs) in modelling atmospheric pollutant concentrations, and found that the random trees outperformed SVMs. Comparisons of the RF and BRT methods have not yet determined conclusively the advantage of either modelling method over the other. Singh et al. [12] do not indicate that either method is superior. Kamińska [16] states that for the prediction of typical values better results are obtained by the BRT method, while for the prediction of higher values the RF method gives better results.
Deterministic models, particularly those that do not refer to the values of variables at preceding time points (point-based models), are subject to large errors, and the main problem with their application is their poor fit to data. Larkin et al. [17] based on worldwide data from 5220 air monitors from 58 countries on all over the world developed the Land Use Regression model based on 10 land use predictors with R 2 = 0.52. Sayegh et al. [15] for the input data, including meteorological conditions, temporal data and traffic flow obtained R 2 from 0.49 to 0.54 for different measurement points. Kamińska [18] performed a modification of an RF model, which improved the quality of fit to R 2 = 0.82, although there were difficulties in making predictions. This fact results from the high variability of concentration values and the significant number of factors on which those values depend. Attempts to forecast a precise numerical value led to very large errors. An alternative to deterministic models is probabilistic models, which may be used to forecast concentrations or the exceeding of threshold values with given probability [19,20].
Probabilistic models serve to generate a forecast in the form of probability values or a probability distribution for the occurrence of an event, given the existence of specified conditions. In the field of air pollution, such models give the probability (or a probability distribution) of a pollutant concentration subject to specified ambient conditions (values of explanatory variables). Probabilistic models perform particularly well in forecasting the values of variables in the case of heavily skewed data, when the accuracy of deterministic models is greatly reduced. The literature describes three main modelling techniques for probabilistic forecasting of air pollution concentration: Quantile regression, modelling based on weather forecasting, and ensemble statistical post-processing. Probabilistic forecasting with quantile regression is based on the conditional sample mean, but instead of the mean the median is estimated. In the case of forecasting of extreme values or concentration peaks, the median is replaced by any quantile [19,20]. Probabilistic models based on weather forecasting are complex models consisting of several modules, including a module for forecasting meteorological conditions, a module for forecasting pollutant emissions, and a final module for forecasting the pollutant concentration. The main drawback of this method is the need to predict the input conditions for the pollutant concentration model, which entails a high degree of uncertainty. Weather forecasts may be obtained using the model of [21] or taken from a reliable source [22]. Such models enable forecasting for arbitrarily large areas. Another difficulty in the application of forecasts of this type is the complex construction of the model and its interdisciplinary nature. Another type of probabilistic forecasting makes use of an ensemble statistical post-processor (ESP). An example is the work of Garner and Thompson [23], where an ESP is constructed using a moving-block bootstrap, regression trees and extreme-value theory for the forecasting of concentrations of ozone.
In this study a probabilistic model is proposed that enables the forecasting of atmospheric pollutant concentrations, and which may be easily implemented at any location along a transport route. A basic condition and limitation required for the model to be effective is that traffic flow and air quality measuring points should be situated close to one another. It was reported by Padró-Martínez et al. [24] and Beckerman et al. [25] that the highest NO 2 concentrations are found within 50 m of the road, and then fall off with increasing distance. A second factor determining the effectiveness of the model is the appropriate selection of explanatory variables (predictors). Most studies use sets of variables reflecting traffic and meteorological conditions [15,26,27]. There also exist studies based only on meteorological data [28] or only on traffic data [29]. More precise analysis of the effect of particular explanatory variables is enabled by such methods as random trees. The methodology for identifying tree divisions described by Breiman [30] allows the importance of predictors to be determined on a scale of 0-100. Using this methodology, Kamińska [13] established that the most significant variables affecting NO 2 concentration were traffic flow and wind speed. Anzarte [19], who studied the importance of variables according to a concept based on a direct measure of the impact of each feature on the accuracy of the model [31], identified as the most important factors (apart from variables referring to preceding time points) the temperature and wind speed. One could find papers with different, temporal variables, such as: Month, hour of the day [14], weekday, weekend or holiday [13,32], but their impact on the variability of the pollution concentration is slight.
The idea of the present approach is to partition the entire dataset into clusters by a matrix method, with respect to the values of the explanatory variables that modelling shows to be the most important. For each cluster, a probabilistic analysis is performed, and a forecast is obtained. This takes the form of a probability distribution for NO 2 concentrations or the probability of the exceeding of a set threshold of pollutant concentration (e.g., limit, warning, alarm levels), for given values of the explanatory variables.
Determination of the probability of occurrence of warning and alarm levels of NO 2 subject to given values of traffic flow can help traffic managers to take decisions efficiently, by selecting the most adequate traffic management strategy [33][34][35][36]. In Wrocław, rational spatial planning encouraging the maintenance of favorable environmental conditions is not yet established practice [37]. The adverse impact of new buildings may be observed not only within the city, but also in new suburban developments [38]. Most new buildings in Wrocław do not form compact groups that might ensure free space for the passage of air. Hope for a change to current practices is provided by the adopted Polish National Strategy for Adaptation to Climate Change in the Perspective to 2030, which requires cities with more than 100,000 inhabitants to prepare their own plans for adaptation to climate change [39]. Although the main emphasis in this strategy is on rainwater management, the document creates the possibility of integrated and systematic management of various environmental components, including air quality. A systematic approach to the sustainable management of city development, making use of wind conditions, is effectively implemented in Wrocław and its surroundings with the use of spatial analyses and decision support systems for urban planning [40]. The results of the present study may therefore prove useful in efforts to improve air quality in Wrocław.
The model makes it possible, in a simple manner, to forecast probability distributions for pollution levels depending on the values of the input parameters. This makes it easy to determine what changes in air quality can be expected to result from a given reduction in the number of vehicles driving into the city center. This will enable estimation of the benefits resulting from, for example, the introduction of a charge for vehicles entering the central zone.

Methodology
As was mentioned at the outset, precise numerical forecasting of atmospheric pollutant concentrations, without taking past values into account of past values, is subject to very large errors. An alternative to deterministic methods is a probabilistic approach. Pollutant concentrations are subject to a high degree of variation. In the present approach, the dataset is partitioned into clusters based on ranges of the values of the independent variables. Assuming that two independent variables are being considered, the clusters will be described by a matrix system. More generally, the system will have a number of dimensions equal to the number of independent variables. Thus, the entire dataset is represented by the matrix D = [Y, X 1 , X 2 , . . . , X n ], where Y is a vector of values of the dependent variable, and X k (k = 1, . . . , K) are vectors of explanatory variables (predictors). This set is divided into subsets (clusters) according to the values of the explanatory variables. In view of limitations resulting from the methodology of analysis of the conformity of a random variable to a theoretical distribution, the number of explanatory variables must be small enough to ensure that each cluster contains a sufficient number of cases. As mentioned above, in the problem under consideration there are two highly significant explanatory variables, and our further considerations will therefore concern this case.
The values of the variables X 1 , X 2 are partitioned into n 1 and n 2 intervals respectively. This division leads to n 1 ·n 2 subsets (clusters) of data represented by matrices D i,j = Y ij , X 1ij , X 2ij , where i = 1, 2, . . . , n 1 , j = 1, 2, . . . , n 2 . For each cluster independently, a probability distribution is determined for the variable Y i,j . Knowledge of the probability distribution of pollutant concentrations within a bounded space of ambient conditions enables one to determine the probability that a given concentration threshold y 0 will be exceeded when such conditions exist.
Here a 1i , a 2j are the boundary values of the partition of X 1 , X 2 respectively. Knowing a theoretical probability distribution that conforms to the empirical distribution of the variable Y i,j , and thus also its distribution function, one may also use this method to calculate a concentration value that will be exceeded with given probability, assuming a defined set of conditions.
The effectiveness of the method may be impaired if the data subsets are of insufficient size. It may then be difficult to fit the data to a theoretical distribution and to obtain statistically significant conformity. The partitioning of values of the explanatory variables must therefore be performed in such a way as to ensure a minimum number of instances of the variable Y ij to enable the procedure of fitting a theoretical probability distribution to be carried out. It is recommended that the size of clusters should be such that, when classes are formed in the process of determining the value of a test statistic, the number of cases in each class is at least 5 [41]. In practice, a minimum number of 30 cases is used. The described procedure may be used for any atmospheric pollutant, once a set of predictors has been identified. It is important that the air quality measuring station should be located not more than 50 m from the road being the source of the pollution.
In this study, the above procedure was applied to the analysis and forecasting of values of nitrogen dioxide concentrations in the air along a major transport route, using the example of a selected road intersection in the city of Wrocław (Poland).

Data Sources
Wrocław is located in the southwestern part of Poland (Europe). The intersection of the streets Hallera and PowstańcówŚląskich is subject to traffic flow monitoring, and an air quality measuring station is located in its vicinity. In view of the significant dependence of atmospheric pollutant concentrations on the distance from the road, computations were carried out on the basis of data obtained from this intersection.
Based on values of Pearson's correlation coefficient (Table 1), and the results of studies which included analysis of the impact (importance) of different variables on NO 2 concentrations, carried out at the same intersection for the years 2015-2016 [13] and 2015-2017 [18], it was concluded that the greatest influence on concentrations comes from two variables: Traffic flow and wind speed. These were selected as the independent variables to be used for further analysis.
In the dataset covering a total of 26,081 h of measurements, a small number of missing or erroneous readings were present, due to temporary faults of the automatic measuring systems. Cases for which the value of at least one variable was absent or the values were incorrect (traffic flow exceeding the capacity of the intersection) were excluded from the analysis. The number of cases thus omitted was 310, equal to 1.2% of the original number of cases. −0.013921

NO 2 Concentrations
Measurements of nitrogen dioxide concentration were made at five points, but only one near the largest crossroads in central Wrocław: The intersection of the streets Hallera and PowstańcówŚląskich ( Figure 1). The data cover the full years 2015-2017 and were collected by the Provincial Environment Protection Inspectorate. Basic statistical values relating to NO 2 concentrations are given in Table 2.  Measurements of nitrogen dioxide concentration were made at five points, but only one near the largest crossroads in central Wrocław: The intersection of the streets Hallera and Powstańców Śląskich ( Figure 1). The data cover the full years 2015-2017 and were collected by the Provincial Environment Protection Inspectorate. Basic statistical values relating to NO2 concentrations are given in Table 2.     The limit annual average value of NO 2 is 40 µg/m 3 and was exceeded at that station in each of the years 2015-2017, reaching 53.8, 49.2 and 48.1 µg/m 3 . The median of NO 2 concentration (49.4 µg/m 3 ) is lower than the mean (50.4 µg/m 3 ) which indicates the right-sided asymmetry (shown in Figure 2). The permissible hourly atmospheric concentration of NO 2 in Poland is 200 µg/m 3 , and this value must not be exceeded more than 18 times in a year. The alarm level of atmospheric NO 2 is 400 µg/m 3 maintained for at least three consecutive hours [Regulation of the Minister of Environment of 14 August 2012 on levels of certain substances in the air]. In the analysed period, the permissible hourly level was exceeded on three occasions in 2015 (30.08, 01.09 and 04.11-two consecutive hours), but the alarm hourly level was not attained. A frequency histogram for nitrogen dioxide concentrations is presented in Figure 2. For the complete dataset, parameters were determined for the following asymmetric continuous theoretical distributions: Weibull, Johnson, generalised extreme value (GEV) and log-normal. The values of the χ 2 and Anderson-Darling statistics were high, and the corresponding p-values did not differ from zero (with an accuracy to five decimal places). Values of the less restrictive Kolmogorov-Smirnov (K-S) statistic gave p-values of 0.025, 0.029, 0.033 and 0.085 for the respective distributions. However, in the light of the results of the other tests, this does not enable confirmation of the conformity of the empirical distribution to any of the analysed theoretical distributions. hourly level was exceeded on three occasions in 2015 (30.08, 01.09 and 04.11-two consecutive hours), but the alarm hourly level was not attained. A frequency histogram for nitrogen dioxide concentrations is presented in Figure 2. For the complete dataset, parameters were determined for the following asymmetric continuous theoretical distributions: Weibull, Johnson, generalised extreme value (GEV) and log-normal. The values of the and Anderson-Darling statistics were high, and the corresponding p-values did not differ from zero (with an accuracy to five decimal places). Values of the less restrictive Kolmogorov-Smirnov (K-S) statistic gave p-values of 0.025, 0.029, 0.033 and 0.085 for the respective distributions. However, in the light of the results of the other tests, this does not enable confirmation of the conformity of the empirical distribution to any of the analysed theoretical distributions.

Traffic Flow
The traffic data are provided by the Traffic and Public Transport Management Department of the Roads and City Maintenance Board in Wrocław. The Department operates 921 video cameras distributed widely over the area of the city. One of the pieces of information obtained is the number of vehicles passing through the measurement plane on a given traffic lane or lanes. A network of sensors is set up to monitor vehicular traffic at the main intersections of the city road network. A total of 68 intersections are subject to traffic measurement. However, only in one case is a monitored intersection located in the immediate vicinity of an air quality measuring station: At the intersection of Hallera and Powstańców Śląskich. Traffic flow data indicate the total number of vehicles driving onto the intersection from all directions and traffic lanes. The daily and weekly variation in traffic flow values are shown in Figure 3.
The daily variation in traffic volume is bimodal, with peak periods in the morning between 7:00 and 8:00 and in the afternoon between 15:00 and 17:00, although the variation during the morning peak is significantly greater than during the afternoon peak. During night time the traffic flow is significantly lower. The traffic volumes on working days are similar, but there is a clear reduction at weekends.  Figure 3.
The daily variation in traffic volume is bimodal, with peak periods in the morning between 7:00 and 8:00 and in the afternoon between 15:00 and 17:00, although the variation during the morning peak is significantly greater than during the afternoon peak.

Probability Distribution
As described in Section 2.2.1, the random variable representing values of NO2 concentration is not found to conform to any of the theoretical distributions. Using the described methodology, the entire set of concentration values was divided into clusters according to the values of the independent variables (in this case, traffic flow and wind speed). The boundary values of these variables were determined a priori based on an analysis of their variability and the basic laws controlling the physics of pollutants in the air. Traffic flow values were partitioned at intervals of 1000 vehicles, mainly in order to ensure that the clusters were of adequate size. Wind speed values were partitioned at intervals of 2 m/s, on the grounds that:

Probability Distribution
As described in Section 2.2.1, the random variable representing values of NO2 concentration is not found to conform to any of the theoretical distributions. Using the described methodology, the entire set of concentration values was divided into clusters according to the values of the independent variables (in this case, traffic flow and wind speed). The boundary values of these variables were determined a priori based on an analysis of their variability and the basic laws controlling the physics of pollutants in the air. Traffic flow values were partitioned at intervals of 1000 vehicles, mainly in order to ensure that the clusters were of adequate size. Wind speed values were partitioned at intervals of 2 m/s, on the grounds that:

Probability Distribution
As described in Section 2.2.1, the random variable representing values of NO 2 concentration is not found to conform to any of the theoretical distributions. Using the described methodology, the entire set of concentration values was divided into clusters according to the values of the independent variables (in this case, traffic flow and wind speed). The boundary values of these variables were determined a priori based on an analysis of their variability and the basic laws controlling the physics of pollutants in the air. Traffic flow values were partitioned at intervals of 1000 vehicles, mainly in order to ensure that the clusters were of adequate size. Wind speed values were partitioned at intervals of 2 m/s, on the grounds that:

•
The first interval reflects conditions of very light wind or no wind, when the wind's role in removing pollutants from the transport route is negligible, and the accumulation of pollutants favors the occurrence of chemical reactions in the air; • Further intervals reflect conditions of increasing wind strength, which affects the rate of horizontal movement of pollutants; • The final interval reflects strong winds (relative to normal local conditions), which, by causing movement of air through the city, affect the NO 2 concentration values.
The distance of the meteorological measuring station from the intersection (9.6 km) is not insignificant. The division into clusters defined above, based on wind speeds with a step size of 2 m/s, takes into account the possibility of modification of the value of this variable as a result of the distance. Table 3 shows the number of cases in each cluster. Two clusters, marked in green, were too small to enable the conduct of the procedure to test conformity to a theoretical probability distribution. For the values of NO 2 concentration corresponding to the cases contained in each cluster, corresponding parameters were computed for the following asymmetric continuous theoretical distributions: Weibull, Johnson, generalized extreme value (GEV) and log-normal. Next, for each theoretical distribution and each cluster, statistical tests of the fit of the theoretical and empirical distributions were performed: A χ 2 test and a Kolmogorov-Smirnov (K-S) test. The Anderson-Darling test gave almost the same p-values as the χ 2 tests, and therefore only the results of the latter are shown. The best fit (taking account of all clusters) was obtained for the log-normal distribution with density function given by (2): Values of χ 2 and Kolmogorov-Smirnov (K-S) statistics, together with computed p-values, are given in Table 4.
Both statistical tests indicated a lack of conformity to the log-normal distribution in one case only: For the lightest wind conditions-[0, 2] m/s-and the lowest traffic levels-(0, 1000] vehicles. This is the largest of the clusters. If the clusters are labelled according to their position in the matrix, this is cluster (1,1). In three cases the χ 2 test indicated a lack of conformity to the theoretical distribution (rejection of the null hypothesis H 0 ) while the Kolmogorov-Smirnov test indicated of the lack of grounds to reject H 0 . To make a final determination of the possibility of identifying the empirical distribution with the appropriate theoretical distribution, Q-Q (quantile-quantile) plots were produced and used to assess the deviation between the distributions. The Q-Q plots for the clusters for which the applied test statistics indicated a rejection of the hypothesis H 0 (green fields in Table 4) are shown in Figure 5. The largest deviations from the theoretical distribution are found for high concentration values; clearly the greatest differences occurred for cluster (1, 1) ([0, 2] m s ; (0, 1000] veh). Caution must therefore be applied when interpreting the subsequent results with respect to the lowest values of wind speed and traffic flow. Nonetheless, these are conditions that occur mainly at night, when NO 2 concentrations are relatively low. In the remaining 23 cases, both tests confirmed the conformity of the empirical distribution of NO 2 concentrations to the log-normal distribution.  Empirical histograms, along with a graph of the density function of the fitted theoretical distribution, are shown in Figure 6. The arrangement of the histograms is in accordance with the matrix arrangement of Table 3. All of them are characterized by right-handed asymmetry. The kurtosis of the distribution increases as the wind speed increases, and falls as the traffic flow increases. Empirical histograms, along with a graph of the density function of the fitted theoretical distribution, are shown in Figure 6. The arrangement of the histograms is in accordance with the matrix arrangement of Table 3. All of them are characterized by right-handed asymmetry. The kurtosis of the distribution increases as the wind speed increases, and falls as the traffic flow increases.

Forecasting the Probability
Using knowledge of the theoretical distributions of NO2 concentrations given the tabulated ambient conditions, probabilities were computed for the exceeding of defined concentration values in each of the clusters representing defined conditions ( Table 5). The values in the table should be interpreted in the following manner: Given a traffic flow not exceeding 1000 vehicles and a wind speed not greater than 2 m/s, the probability that the atmospheric concentration of NO2 will exceed 40 μg/m 3 is 39.1%. The probability of the occurrence of a given concentration of NO2 decreases as its value increases. The probabilities are highest for the exceeding of the smallest values. The probability of exceeding a concentration of 40 μg/m 3 given the least favorable conditions (cluster (6,5): Traffic flow > 5000 vehicles, wind speed ≤ 2 m/s) is above 85%, which means that statistically, this concentration is exceeded for 7507 out of the 8760 h in a year. Because the atmospheric NO2 concentration is subject to significant variation in the course of a day, the accepted safe value is exceeded every day. With a fall in traffic flow and an increase in wind speed, the probability of exceeding the mean annual permissible value falls to 3.2% for cluster (1,5)

Forecasting the Probability
Using knowledge of the theoretical distributions of NO 2 concentrations given the tabulated ambient conditions, probabilities were computed for the exceeding of defined concentration values in each of the clusters representing defined conditions ( Table 5). The values in the table should be interpreted in the following manner: Given a traffic flow not exceeding 1000 vehicles and a wind speed not greater than 2 m/s, the probability that the atmospheric concentration of NO 2 will exceed 40 µg/m 3 is 39.1%. The probability of the occurrence of a given concentration of NO 2 decreases as its value increases. The probabilities are highest for the exceeding of the smallest values. The probability of exceeding a concentration of 40 µg/m 3 given the least favorable conditions (cluster (6,5): Traffic flow > 5000 vehicles, wind speed ≤ 2 m/s) is above 85%, which means that statistically, this concentration is exceeded for 7507 out of the 8760 h in a year. Because the atmospheric NO 2 concentration is subject to significant variation in the course of a day, the accepted safe value is exceeded every day. With a fall in traffic flow and an increase in wind speed, the probability of exceeding the mean annual permissible value falls to 3.2% for cluster (1,5), equivalent to 282 h in a year. The probabilities of concentrations in excess of 100 µg/m 3 are several times smaller: They range from 0.1% (value exceeded for 10 h in a year) for conditions with strong wind and low traffic flow, i.e., cluster (1,5), to 22.6% (value exceeded for 1997 h in a year) for cluster (3,1), with traffic flow in the interval (2000, 3000] and wind speed not exceeding 2 m/s. 0.1% (<1) [2] 0.1% (<1) [5] 0.0% (<1) [17] 0.0% (<1) [ The determined probability distributions indicate that for 7 of the 28 described sets of ambient conditions, the permissible value of the NO 2 concentration (200 µg/m 3 ) is reached less frequently than once per year. In unfavorable ambient conditions-cluster (3, 1)-the permissible level may be exceeded with a probability of 3.4% (300 h in a year). The probability of exceeding the alarm level of 400 µg/m 3 is always lower than 0.2%. The largest probabilities of NO 2 concentrations exceeding 100, 200 and 400 µg/m 3 were identified in the case of cluster (3,1), with low wind speeds (≤2 m/s) and traffic flows in the interval (2000,3000] vehicles. This cluster contains a total of 985 cases, mainly from the evening hours (70% of cases occur from 20:01 to 22:00) when the pollutants emitted by traffic throughout the day remain accumulated, although the traffic flow at that time is not especially high. A low wind speed increases the accumulation of pollutants and favors the occurrence of chemical reactions. During volatile organic compounds degradation processes, apart from ozone formation, the transformation of NO to NO 2 also occurs. This process is more intense when more substrates are present in the air and when the atmospheric conditions are more favorable, particularly when there is low wind.
Probabilities of exceeding given values were found to decrease more rapidly with an increase in wind speed than with a fall in traffic volume. The very strong influence of wind speed on atmospheric NO 2 concentration is largely a result of the geographic location of the intersection. Given Wrocław's prevailing WNW winds, the alignment of the intersection with the wind direction favors the evacuation of pollutants. The surrounding buildings modify the wind speed and direction only to a small degree.

Verification
The above method of probabilistic forecasting of atmospheric NO 2 conditions was subjected to verification using independent data from the first six months of 2018. The values of NO 2 concentrations from the verification period were partitioned into clusters according to the key described in Section 3.1. There were 2586 h in which the concentration exceeded 40 µg/m 3 . The value was above 100 µg/m 3 only for 41 h, which made verification impossible due to the low frequencies of occurrence within clusters. No cases of concentrations in excess of 200 µg/m 3 were recorded. The statistical significance of the differences between the obtained frequencies of NO 2 concentrations in excess of 40 µg/m 3 for the whole matrix compared with the frequency matrix for 2015-2017 was investigated using the t-test. On this basis, for a significance level of 0.05, the hypothesis of equality of means for the frequencies can be accepted (p-value = 0.86), which means that the determined frequencies of occurrence also give a good description of the independent data from 2018. An example forecast, in the form of a density function graph for a computed log-normal distribution and the actual value for the time 13:00 on 18 March 2018, is shown in Figure 7. At that time the recorded wind speed was 10 m/s, and the traffic flow was 2891 vehicles. These values correspond to cluster (3,5) in the matrix of Table 3. The actual recorded NO 2 concentration was 22 µg/m 3 .
For the quantitative evaluation of forecast resolution and uncertainty, the continuous rank probability score (CRPS) was used. This is designed to evaluate the reliability, resolution and uncertainty of probabilistic forecasts [42,43]. The CRPS value for a single forecast (hour) ranged from 1.38 to 53.2 µg/m 3 , with a mean of 9.15 µg/m 3 . Considering the simplicity of the applied forecasting method and its precision (hourly), this result may be considered satisfactory. Balashov et al. [22], analysing their complex REGiS model for forecasting daily ozone concentrations, obtained mean CRPS values ranging from 3.6 to 6.3 ppbv.
Mean CRPS values were also obtained for the clusters. The largest errors occurred in the forecasting of concentrations for the lightest winds ( Table 6). The phenomenon of accumulation of pollutants, which occurs at the lowest wind speeds, and is significantly influenced by values from past time points, is not taken into account in the model (by design). For wind speeds above 4 m/s, the CRPS takes single-figure values. Verification of the model using independent data confirmed the effectiveness of the cluster-based model for probabilistic forecasting of pollutant concentrations on a major transport route. For the quantitative evaluation of forecast resolution and uncertainty, the continuous rank probability score (CRPS) was used. This is designed to evaluate the reliability, resolution and uncertainty of probabilistic forecasts [42,43]. The CRPS value for a single forecast (hour) ranged from 1.38 to 53.2 μg/m 3 , with a mean of 9.15 μg/m 3 . Considering the simplicity of the applied forecasting method and its precision (hourly), this result may be considered satisfactory. Balashov et al. [22], analysing their complex REGiS model for forecasting daily ozone concentrations, obtained mean CRPS values ranging from 3.6 to 6.3 ppbv.
Mean CRPS values were also obtained for the clusters. The largest errors occurred in the forecasting of concentrations for the lightest winds ( Table 6). The phenomenon of accumulation of pollutants, which occurs at the lowest wind speeds, and is significantly influenced by values from past time points, is not taken into account in the model (by design). For wind speeds above 4 m/s, the CRPS takes single-figure values. Verification of the model using independent data confirmed the effectiveness of the cluster-based model for probabilistic forecasting of pollutant concentrations on a major transport route.

Conclusions
This article has presented a cluster-based approach to the problem of probabilistic forecasting of concentrations of nitrogen dioxide on a major transport route. It was shown to be most effective to consider the impact of traffic flow and wind speed on NO2 concentrations using a matrix-based partitioning of ambient conditions. For each cluster, corresponding to an interval of wind speeds and an interval of traffic flow values, parameters were calculated for log-normal distributions of NO2

Conclusions
This article has presented a cluster-based approach to the problem of probabilistic forecasting of concentrations of nitrogen dioxide on a major transport route. It was shown to be most effective to consider the impact of traffic flow and wind speed on NO 2 concentrations using a matrix-based partitioning of ambient conditions. For each cluster, corresponding to an interval of wind speeds and an interval of traffic flow values, parameters were calculated for log-normal distributions of NO 2 concentrations. Based on the obtained probability density functions of the theoretical distributions, probabilities were computed for the occurrence of NO 2 concentrations in excess of 40 µg/m 3 , 100 µg/m 3 , 200 µg/m 3 and 400 µg/m 3 . The probability that the mean annual permissible atmospheric NO 2 concentration will be exceeded is as high as 85.7% assuming the highest traffic volume (in excess of 5000 vehicles) and very low wind speed (not greater than 2 m/s). For higher boundary values, the phenomenon of accumulation of pollutants was observed to become more significant. The fact that the cluster containing moderate traffic flows of (2000, 3000] vehicles gives the highest probabilities of exceeding the concentrations 100 µg/m 3 , 200 µg/m 3 and 400 µg/m 3 is a consequence of the time of day to which these cases correspond: Most of them are from the hours 20:01-22:00. Verification of the model using independent data from the first six months of 2018 demonstrated statistically significant conformity of the frequencies of occurrence of NO 2 concentrations for the tabulated conditions. The mean prediction error measured by the CRPS was 9.15 µg/m 3 . Analysis of the probabilities of exceeding the permissible and alarm levels of NO 2 concentration showed that a reduction in the number of vehicles must be significant to achieve a noticeable reduction in NO 2 levels along the transport route. With an increase in wind speed, the probability that a given nitrogen oxide concentration threshold will be exceeded is reduced approximately by half for every 2 m/s. Thus, a significant contribution to the removal of pollutants comes from the movement of air through the city, and consequently from the sustainable nature of its building development.
The main limitation on the applicability of the presented methodology is that it is point-based. When a forecast is generated in this form, it is not possible to extend it to a wider area. Further work will investigate the possibility of extending probabilistic forecasting by the cluster method to several points located along the route, and then extending the forecast to the entire length of the road. A further goal of future work will be the identification of partition points for the explanatory variables in an optimization process: For example, minimization of the variance within a cluster.