Eigenvector Spatial Filtering Regression Modeling of Ground PM2.5 Concentrations Using Remotely Sensed Data

This paper proposes a regression model using the Eigenvector Spatial Filtering (ESF) method to estimate ground PM2.5 concentrations. Covariates are derived from remotely sensed data including aerosol optical depth, normal differential vegetation index, surface temperature, air pressure, relative humidity, height of planetary boundary layer and digital elevation model. In addition, cultural variables such as factory densities and road densities are also used in the model. With the Yangtze River Delta region as the study area, we constructed ESF-based Regression (ESFR) models at different time scales, using data for the period between December 2015 and November 2016. We found that the ESFR models effectively filtered spatial autocorrelation in the OLS residuals and resulted in increases in the goodness-of-fit metrics as well as reductions in residual standard errors and cross-validation errors, compared to the classic OLS models. The annual ESFR model explained 70% of the variability in PM2.5 concentrations, 16.7% more than the non-spatial OLS model. With the ESFR models, we performed detail analyses on the spatial and temporal distributions of PM2.5 concentrations in the study area. The model predictions are lower than ground observations but match the general trend. The experiment shows that ESFR provides a promising approach to PM2.5 analysis and prediction.


Introduction
PM 2.5 , particles with an aerodynamic diameter of 2.5 µm or less, are harmful to both the natural environment and human health. These small particles are responsible for such environmental issues as corrosion, soiling, damage to vegetation and especially, reduced visibility [1,2], as they are major air pollutants that cause haze [3,4]. Due to their small size, PM 2.5 particles can penetrate deep into the lungs through breathing and seep into the blood system along with toxic substances on their surfaces, posing severe health risks to the human body. Exposure to PM 2.5 for only a few days can bring about adverse health effects [5][6][7][8].
As the number of smog days surged in major cities across China in recent years, the Chinese government has become increasingly concerned about PM 2.5 pollution and has added more ground stations to monitor this contaminant, from 496 stations in 2012 to 1436 in 2016. This is a substantial improvement but they are still far from enough. Ground monitoring stations remain sparse and distributed unevenly, making it difficult to perform detailed analyses on the spatial and temporal

Ground PM2.5 Concentrations
The Qingyue Data Center (https://data.epmap.org/air/nations) provides air pollutants data collected from the China Environmental Monitoring Center (CEMC). Data are derived from ground monitoring sites, which include hourly and daily average mass concentrations of PM2.5 and other air pollutants. PM2.5 concentrations are measured by the tapered element oscillating microbalance (TEOM) method with an accuracy of ±0.5 μg/m 3 for the daily average. PM2.5 measurements from 1 December 2015 to 30 November 2016 over the YRD Region (including 233 stations in total) were collected from this data center. Invalid data records, for example those with null or negative values, were removed from the dataset.

Remotely Sensed Data
MODIS provides daily 3 km resolution ambient AOD products, which will be used to generate PM2.5 raster grids. The MOD04_3k_v6 dataset was downloaded from the NASA website (https://search.earthdata.nasa.gov) and we extracted the Dark Target algorithm retrieved AOD in the study area. AOD data points with quality flag 0 were removed. The NDVI data were also MODIS

Ground PM 2.5 Concentrations
The Qingyue Data Center (https://data.epmap.org/air/nations) provides air pollutants data collected from the China Environmental Monitoring Center (CEMC). Data are derived from ground monitoring sites, which include hourly and daily average mass concentrations of PM 2.5 and other air pollutants. PM 2.5 concentrations are measured by the tapered element oscillating microbalance (TEOM) method with an accuracy of ±0.5 µg/m 3 for the daily average. PM 2.5 measurements from 1 December 2015 to 30 November 2016 over the YRD Region (including 233 stations in total) were collected from this data center. Invalid data records, for example those with null or negative values, were removed from the dataset.

Remotely Sensed Data
MODIS provides daily 3 km resolution ambient AOD products, which will be used to generate PM 2.5 raster grids. The MOD04_3k_v6 dataset was downloaded from the NASA website (https: Int. J. Environ. Res. Public Health 2018, 15, 1228 4 of 24 //search.earthdata.nasa.gov) and we extracted the Dark Target algorithm retrieved AOD in the study area. AOD data points with quality flag 0 were removed. The NDVI data were also MODIS products (MOD13A3) with a spatial resolution of 1 km. Satellite meteorological data were derived from the Goddard Earth Observing System Data Assimilation System (https://gmao.gsfc.nasa.gov/ products/), which include Surface Temperature (ST), Relative Humidity (RH), Pressure (PS), and Planetary Boundary Layer Height (PBLH). Their spatial resolutions were 0.25 • latitude × 0.3125 • longitude. The Shuttle Radar Topography Mission (SRTM) DEM product was downloaded from the http://srtm.csi.cgiar.org website and its spatial resolution was 90 m.

Pollution Source Data
We extracted the roads networks from OpenStreetMap and obtained the factory locations from BaiduMap. They were used in the PM 2.5 model specification as well as analysis of PM 2.5 pollution causes. As PM 2.5 pollution mainly comes from industrial emissions and motor vehicle exhaust [40,41], we simplified the roads networks by keeping only the types of roads used for motor vehicles, such as primary roads, secondary roads, trunks, raceways, motorways and railways. Duplicate factory points were removed, resulting in a total of 23,299 factory points in the study area.

Data Preprocessing
Several steps were followed to prepare the data for regression modeling. The first step is rescaling all data sets to the same spatial resolution of 3 kilometers, consistent with the AOD data. Because the meteorological data (ST, PS, RH and PBLH) have a coarser spatial resolution than the AOD, we interpolated them to 3 km using Ordinary Kriging with the spherical model. NDVI and DEM have a finer resolution than the AOD and they were therefore resampled into the 3 km grid using bilinear-interpolation. In the second step, we created the density grids for factories and roads at 3 km resolution as potential covariates. The point or line density in a grid cell is calculated as the number of points or the total length of lines falling in the buffer of 24 kilometers. To accommodate temporal analysis, annual, seasonal and monthly averages of each variable were calculated. In the final step, data values of the candidate independent variables, which are extracted from the raster grids, were assigned to the corresponding stations. Stations with null value were removed. In the end, we have 3301 records at annual, seasonal and monthly time scales in total, each record with a PM 2.5 value and a vector of the covariates including AOD, ST, PS, PBLH, RH, NDVI, DEM, densities of factories and roads.

Spatial Regression with Eigenvector Spatial Filtering
Geographic variables often exhibit spatial autocorrelation, which affects the accuracy and uncertainty of the parameter estimates in regression models. Recent advances in spatial statistics provide several approaches to remedy the problem by including spatial autocorrelation in the classic statistical models [42,43]. Spatial autoregressive models, signified by the spatial lag and spatial error models in spatial econometrics, are now commonly used [44][45][46]. Meanwhile, Eigenvector Spatial Filtering Regression (ESFR), which represents spatial autocorrelation as a synthetic variable derived from a linear combination of selected eigenvectors of the spatial weights matrix, is gaining recognitions [47,48]. Equation (1) is the general form of the ESFR model [49]: where Y is an n × 1 vector of dependent variable, X is an n × p matrix containing independent variables, E is an n × k matrix containing k selected eigenvectors, α and β are the corresponding vectors of regression coefficients, and ε is a vector of i.i.d random errors. The linear combination of the selected eigenvectors Eα filters the spatial autocorrelation out of the regression residuals. Because the eigenvectors are orthogonal and uncorrelated with each other, we are able to use them as synthetic variables in the regression model and estimate the parameters using conventional methods such as OLS, constructing models with improved accuracy and reduced uncertainty [37,50]. Researchers have begun to use ESF to model PM 2.5 [51] where the independent variables are observations of other air pollutants at the same monitoring stations as PM 2.5 . The application of the model is limited as the number of monitoring stations is insufficient to cover a large area. In this paper, we use remotely sensed data with high resolution to build the ESFR model, making it more practical for PM 2.5 analysis and prediction.

Model Specification, Assessment and Comparison
Estimating the ESFR model for PM 2.5 concentration includes five steps: (1) construction of the spatial weights matrix for the study area; (2) calculation of the eigenvectors from the centered spatial weights matrix; (3) selections of eigenvectors using step-wise regression with all the covariates; (4) OLS estimation of the regression coefficients; (5) removal of the insignificant covariates in the obtained model and repeating steps (3) and (4) to construct the final ESFR model.
The spatial weights matrix C 0 for the stations can be topology-based as well as distance-based [47,52]. In this paper, C 0 is a distance-based matrix whose (i, j)-th element equals exp(-d i,j /r), where d i,j is the Euclidean distance between stations i and j, and r is the longest distance in the minimum spanning tree covering stations. Subsequently, the spatial weights matrix C 0 is centered as follows: where 1 is an n × 1 vector of ones, which means 11 T is an n × n matrix whose elements all equal one, and T denotes the matrix transpose operator, n is the number of monitoring stations and I is an n-dimension identity matrix. To proceed with the ESFR, eigenvectors are calculated for the centered matrix C 1 , followed by a selection process which often involves such variable selection methods as step-wise regression as well as the Least Absolute Shrinkage and Selection Operator (LASSO) [53][54][55]. Candidate eigenvectors were first calculated from the distance-based weights matrix for the study area ( Figure 1). Since all of the variables have positive spatial autocorrelation, the subset of eigenvectors was initially formed through the criteria (λ i /λ 1 ) ≥ 0.10 [48] which then was further selected along with the covariates through step-wise regression. The combination of eigenvector resulting in the highest R-squared is remained. The initial model is specified in the following form: where β 0 is the intercept, β i (i = 1, . . . , 9) are regression coefficients. E k is an n × k matrix of selected eigenvectors, β k is a k × 1 vector of coefficients for the eigenvectors, ε is an n × 1 error vector. The term E k β k is the spatial filter, accounting for the spatial effects in PM 2.5 distribution. After the eigenvectors are selected, OLS is used for estimating the model as specified in Equation (3). Model performance will be assessed through the common metrics for OLS models, including Adjusted R 2 , Residual Standard Errors (RSE), Mean Absolute Percentage Error (MAPE) and Corrected Akaike Information Criterion (AICc). Residuals' global Moran's I was computed to validate if the spatial autocorrelation was filtered out of the residuals and the residuals were spatially random. To assess model fitting and prediction accuracy, leave-one-out cross validation (LOOCV) was conducted. We selects one station for validation and the remaining stations are used as the training data. This is repeated until every station is used as the validation data once. Then we calculate the Mean Squared Error (MSE) of the validation dataset to assess model estimation accuracy. For LOOCV of ESFR model, the R package "spmoran" provides functions for ESF-based spatial interpolation based on a minimization of expected error, which can be used to calculate the eigenvector for the stations to be predicted. The ESFR model is compared with the conventional non-spatial OLS model (denoted as Global Multiple Linear Regression, or GMLR), using the above performance metrics and cross-validations.

PM 2.5 Distribution Mapping and Cause Analysis
The annual and seasonal ESFR models will be applied to generate the ground PM 2.5 maps. As the covariates were all 3 km raster layers, we interpolated the selected eigenvectors into the 3 km grids and applied the ESFR model to producing PM 2.5 maps. In this way, PM 2.5 concentrations on the grid cells with no monitoring stations were computed and we obtained continuous annual and seasonal PM 2.5 distribution maps in the YRD region. We can evaluate the air quality status as well as analyzing spatiotemporal characteristics of PM 2.5 concentrations in the YRD region from a finer scale using these maps.
To explore the causes of the PM 2.5 distribution, we introduced a measurement of pollution sources density, defined as the average of normalized point density of factory POIs and the line density of roads, as is shown in Equation (4): Road Den_Norm is the normalized Road Den , Fact Den_Norm is the normalized Fact Den . The normalization method for Road Den and Fact Den is shown in Equation (5): Max and Min denote the maximum and minimum values of X. The pollution sources density map was compared to the PM 2.5 maps for visual exploration of the relationship between them. Table 1 is an overview of the original dataset. The average PM 2.5 concentration in the YRD region is 51.6 µg/m 3 . The average AOD is 0.54. The average ST, PS, RH, PBLH and NDVI are 292.0 K, 1001.3 hpa, 69.7%, 389.7 m and 62.0%, respectively. The average elevation is 138.3 m. PM 2.5 concentration in the study area is high on the whole according to the Chinese national standards (GB 3095-2012), with an annual average 47.4% greater than the national limit (35 µg/m 3 ).  Figure 2 shows the distribution characteristics of annual PM 2.5 observations, which ranges from 23.2 to 67.9 µg/m 3 . The histogram (Figure 2a) indicates that the annual PM 2.5 distribution is approximately normal. The PM 2.5 observation map shows the spatial difference ( Figure 2b): the high concentration clusters in Hefei and Bengbu of Anhui Province, while the low concentration is located in Huangshan and Zhoushan of Zhejiang Province. Figure 3 depicts the monthly and seasonal mean PM 2.5 concentrations calculated from ground observations. PM 2.5 pollution is most serious in winter (77.9 µg/m 3 ) followed by spring (54.9 µg/m 3 ) and autumn (42.0 µg/m 3 ). The summer average (30.5 µg/m 3 ) is the lowest. The monthly averages from April to October are below the annual average while the other months are above the mean. It can be seen that the monthly average line is U-shaped and the bottom is reached in August, which denotes the best air quality in the year.   Table 2 shows the Pearson Correlation Coefficients (PCC) between the annual and seasonal PM2.5 concentration averages and the covariates. It can be seen from the annual measurements that PM2.5 concentration is moderately and positively correlated with AOD and PS while negatively correlated with PBLH, RH, DEM and ST. NDVI is weakly and negatively correlated with PM2.5. FactDen and RoadDen are weakly and positively correlated with PM2.5. There are variations in seasonal correlation results. Generally, AOD, PS, FactDen and RoadDen are positively correlated with PM2.5 while PBLH, RH, NDVI and DEM are negatively correlated with PM2.5. There are exceptions: some variables don't have significant correlations with PM2.5 in certain period such as AOD in autumn, RH in summer, and RoadDen in autumn. The relationship between ST and PM2.5 is not stationary through time. They are positively correlated with PM2.5 in spring, negatively correlated in autumn and winter, and not correlated in summer. Table 3 shows the monthly results are consistent with those of annual and seasonal analyses on   Table 2 shows the Pearson Correlation Coefficients (PCC) between the annual and seasonal PM2.5 concentration averages and the covariates. It can be seen from the annual measurements that PM2.5 concentration is moderately and positively correlated with AOD and PS while negatively correlated with PBLH, RH, DEM and ST. NDVI is weakly and negatively correlated with PM2.5. FactDen and RoadDen are weakly and positively correlated with PM2.5. There are variations in seasonal correlation results. Generally, AOD, PS, FactDen and RoadDen are positively correlated with PM2.5 while PBLH, RH, NDVI and DEM are negatively correlated with PM2.5. There are exceptions: some variables don't have significant correlations with PM2.5 in certain period such as AOD in autumn, RH in summer, and RoadDen in autumn. The relationship between ST and PM2.5 is not stationary through time. They are positively correlated with PM2.5 in spring, negatively correlated in autumn and winter, and not correlated in summer.  Table 2 shows the Pearson Correlation Coefficients (PCC) between the annual and seasonal PM 2.5 concentration averages and the covariates. It can be seen from the annual measurements that PM 2.5 concentration is moderately and positively correlated with AOD and PS while negatively correlated with PBLH, RH, DEM and ST. NDVI is weakly and negatively correlated with PM 2.5 . Fact Den  RH, NDVI and DEM are negatively correlated with PM 2.5 . There are exceptions: some variables don't have significant correlations with PM 2.5 in certain period such as AOD in autumn, RH in summer, and Road Den in autumn. The relationship between ST and PM 2.5 is not stationary through time. They are positively correlated with PM 2.5 in spring, negatively correlated in autumn and winter, and not correlated in summer. Table 3 shows the monthly results are consistent with those of annual and seasonal analyses on the whole. However, opposite results appear in several months. For example, PBLH is not correlated with PM 2.5 in May and June. NDVI does not have significant correlation with monthly PM 2.5 values except in December, February, July and November. The relationship between ST and PM 2.5 is even more complex, with the PCC varying from −0.591 to 0.235. PCC of RH varies from −0.550 to 0.228. Results show that the correlations between PM 2.5 and covariates change with time, motivating us to construct PM 2.5 models at multi time scales to avoid temporal effects. However, the PCC is limited in correlation analysis between PM 2.5 and a certain variable because it can be influenced by other influential factors. Correlation analysis based on regression models can be more accurate.

Spatial Autocorrelation Analysis
Since spatial effects have an impact on model accuracy and uncertainty, it is important to examine the nature and magnitude of spatial autocorrelation in the data. Table 4 shows the Moran's I values of PM 2.5 concentrations at different time scales, which are all positive, ranging from 0.296 to 0.610 and statistically significant, indicating the geographic distribution of PM 2.5 is highly clustered. The magnitudes of spatial autocorrelation change at different time scales.
The Moran's I of annual mean PM 2.5 is 0.563. As for the seasonal time scale, the spatial autocorrelation of PM 2.5 is strongest in autumn, followed by winter and spring, and becomes weakest in summer. At the monthly scale, spatial autocorrelation is strongest in March and November but weakest in May and June. Nevertheless, the prevailing presence of spatial autocorrelation across time scales suggests that model performances would be improved greatly if spatial information is incorporated in model specifications.

ESFR Model
ESFR models at annual, seasonal and monthly levels were estimated. Tables 5 and 6 report the modeling results, including the coefficient estimates and the p-values of the ESFR models. For ease of comparison, the variables are standardized and standardized coefficients are given instead of the original coefficients. For insignificant variables removed in the initial modeling step, their coefficients and p-value are marked with '/'. The p-values indicate the significance of variables in PM 2.5 modeling changes with time. In the annual ESFR model, the selected variables including AOD, ST, PS, RH, PBLH, NDVI, DEM and Fact Den are all significant at α = 0.1 level. For the seasonal and monthly models, these variables have significance test results different from the annual model. AOD is not significant in Winter, Spring, Summer and some months such as December, January. RH is not significant in the Spring and January models. PBLH is not significant in May and November. Fact Den is not significant in the December model. Although Road Den is not selected in the annual model, it is significant in January and September.
Among the statistically significant variables, it can be seen from the standardized coefficients (denoted as Beta) that they have different effects on PM 2.5 concentration. In the annual model, PM 2.5 concentration is positively correlated with AOD, PS, RH and Fact Den while negatively correlated with ST, PBLH, NDVI and DEM. For the seasonal and monthly models, the effects of PS, DEM, NDVI and Fact Den on PM 2.5 are consistent with the annual model. However, some variables present results opposite to the annual model. For example, the coefficient of AOD in October is −0.12, indicating a negative effect on PM 2.5 . The coefficients of RH in Summer, April, May and July indicate a negative effect on PM 2.5 . Besides, Road Den has positive effect on PM 2.5 according to the monthly models. ESFR models have different performance in different periods. The annual ESFR model has the best fit with an adjusted R 2 of 0.70, an LOOCV MSE of 19.2 and an AICc of 1255.8. ESFR model performs best in autumn (R 2 adj = 0.64) and worst in spring (R 2 adj = 0.49). Among monthly models, ESFR model in November has the best performance (R 2 adj = 0.73) and worst in June (R 2 adj = 0.36). In December and January when PM 2.5 pollution is often serious, the models also perform well.
Though all nine of the covariates are reported to be significant influential factors in previous PM 2.5 studies, some of them are not significant in our case study. In most of the models, AOD, NDVI and Road Den are not useful independent variables for PM 2.5 estimation while PBLH and Fact Den remained significant in most models. Because the relationships between these independent variables and PM 2.5 change with space and time, our wide span of study area and time period may weaken their relationship and lead to insignificant coefficients.  Table 7 shows the performance metrics of the annual and seasonal PM 2.5 ESFR models. Performance indicators for the annual GMLR models are also given for comparison. The adjusted R 2 of annual ESFR model reaches 0.70, 16.7% higher than the GMLR model. The RSE is 4.24 µg/m 3 and the MAPE is 6.66%, which decreases by 12.7% and 13.5%, respectively, compared with the annual GMLR model. The AICc of ESFR model is lower than the GLMR model, which means there is less information loss in the ESFR model.

Model Fit
For the seasonal models, the autumn ESFR model has the best performance with an adjusted R 2 0.65, which increases by 34.1% over the GMLR model. Its RSE decreases 17.3%. Though the spring ESFR model is the worst among the four seasons, it increased 25.4% in adjusted R 2 compared to the GMLR model. Its RSE decreases 8.5%. Considering that serious air pollution events often occur in winter and autumn, the ESFR model is of great significance for practical application. The monthly results are shown in Figure 4. For the GMLR models, their adjusted R 2 ranged from 0.23 to 0.50, with a mean value of 0.39. The RSE ranged from 4.97 to 12.32 µg/m 3 , with a mean value of 8.41 µg/m 3 . The MAPE ranged from 11.73% to 17.01%, with a mean value of 13.62%. The ESFR models have higher adjusted R 2 , lower RSE and MAPE than the GMLR models. The adjusted R 2 ranged from 0.36 to 0.73, with a mean value of 0.54. The RSE ranged from 4.03 µg/m 3 to 10.51 µg/m 3 , with a mean value of 7.26 µg/m 3 . The MAPE ranged from 9.34% to 14.10%, with a mean value of 11.33%. The AICc values of ESFR models are obviously lower than those of GMLR models. In conclusion, ESFR model has significant improvements on model fitting precision compared with GMLR model. The monthly results are shown in Figure 4. For the GMLR models, their adjusted R 2 ranged from 0.23 to 0.50, with a mean value of 0.39. The RSE ranged from 4.97 to 12.32 μg/m 3 , with a mean value of 8.41 μg/m 3 . The MAPE ranged from 11.73% to 17.01%, with a mean value of 13.62%. The ESFR models have higher adjusted R 2 , lower RSE and MAPE than the GMLR models. The adjusted R 2 ranged from 0.36 to 0.73, with a mean value of 0.54. The RSE ranged from 4.03 μg/m 3 to 10.51 μg/m 3 , with a mean value of 7.26 μg/m 3 . The MAPE ranged from 9.34% to 14.10%, with a mean value of 11.33%. The AICc values of ESFR models are obviously lower than those of GMLR models. In conclusion, ESFR model has significant improvements on model fitting precision compared with GMLR model.  Table 8 shows the Moran's I of model residuals. All GMLR residuals have significant spatial autocorrelation with clustered residuals. On the contrary, all ESFR models are able to filter out spatial autocorrelation in the residuals, rendering insignificant Moran's I. As we can see from the performance metrics, filtering spatial autocorrelation from the residuals substantially improves  Table 8 shows the Moran's I of model residuals. All GMLR residuals have significant spatial autocorrelation with clustered residuals. On the contrary, all ESFR models are able to filter out spatial autocorrelation in the residuals, rendering insignificant Moran's I. As we can see from the performance metrics, filtering spatial autocorrelation from the residuals substantially improves model fit and reduces model errors.

Model Cross Validation
Cross validations were conducted to assess model overfitting and prediction accuracy. Table 9 shows that the MSE of the annual ESFR model is 19.2, 22.8% lower than that from GMLR (MSE = 24.9). For the seasonal models, the MSE of winter ESFR model is 12.5% lower than that of GMLR model. The MSE of the spring ESFR model is 14.0% lower than the GMLR model. The MSE of the summer ESFR model is 28.9% lower than the GMLR model. The MSE of the autumn ESFR model is 29.9% lower than the GMLR model. For the monthly models ( Figure 5), the MSE values of the GMLR models range from 25.2 to 157.0 and the mean is 78.5. The MSE values of the ESFR models range from 17.5 to 124.5 and the mean is 61.2. In all the monthly models, ESFR models have substantially lower cross validation errors than GMLR. It can be concluded that overall ESFR performs the best in the estimation of PM 2.5 with the set of predictors where no monitoring stations exist.  Figure 6a,c,e,g,i depict the annual and seasonal PM2.5 spatial distributions. They were derived from the ESFR models. Figure 6b,d,f,h,j were Kriging interpolations of observed PM2.5 at ground stations. It is obvious that PM2.5 spatial distributions based on ESFR models contain more details than the direct interpolations. For example, the junction of Lu'an and Anqing is an area with low PM2.5 concentration. However it is not reflected in the interpolation maps (Figure 6b,d,f,h,j).  Figure 6a,c,e,g,i depict the annual and seasonal PM 2.5 spatial distributions. They were derived from the ESFR models. Figure 6b,d,f,h,j were Kriging interpolations of observed PM 2.5 at ground stations. It is obvious that PM 2.5 spatial distributions based on ESFR models contain more details than the direct interpolations. For example, the junction of Lu'an and Anqing is an area with low PM 2.5 concentration. However it is not reflected in the interpolation maps (Figure 6b,d,f,h,j).  Figure 6a,c,e,g,i depict the annual and seasonal PM2.5 spatial distributions. They were derived from the ESFR models. Figure 6b,d,f,h,j were Kriging interpolations of observed PM2.5 at ground stations. It is obvious that PM2.5 spatial distributions based on ESFR models contain more details than the direct interpolations. For example, the junction of Lu'an and Anqing is an area with low PM2.5 concentration. However it is not reflected in the interpolation maps (Figure 6b,d,f,h,j).   Figure 7. For the whole year, 30.4% of the area has qualified air (level 1). In 43.7% of the area, the PM2.5 is at level 2. 25.8% of the area has level 3 PM2.5 pollution. Clearly, PM2.5 pollution in the YRD region is rather serious.  Figure 7. For the whole year, 30.4% of the area has qualified air (level 1). In 43.7% of the area, the PM 2.5 is at level 2. 25.8% of the area has level 3 PM 2.5 pollution. Clearly, PM 2.5 pollution in the YRD region is rather serious.

PM2.5 Distribution Maps
To demonstrate further the spatial effects on PM 2.5 distribution, we calculated the linear combination of the eigenvectors, E k β k , for both the annual and seasonal models and visualized them in Figure 8b,d,f,h,j. The spatial patterns between each pair are strikingly similar because these eigenvector maps contain local spatial information of PM 2.5 distribution. In Figure 8a, there is an obvious high-high cluster of PM 2.5 concentration: Xuzhou-Suqian and it exactly corresponds to a highlighted area in the eigenvector map in Figure 8b. Besides, there is a low-low cluster region: Lishui-Taizhou in Zhejiang and it also corresponds to a highlighted area. For the seasonal maps, the situations are almost identical to these two cluster regions but there are differences. For example, in winter (Figure 8c), another high-high cluster appears in Hefei-Chuzhou and correspondingly in Figure 8d the E k β k is high in this area. In spring (Figure 8e), Changzhou-Nantong is high in PM 2.5 concentration and it also has a relatively high value in Figure 8f. In autumn (Figure 8i), the high-high cluster expands from Xuzhou-Suqian to Xuzhou-Bengbu-Huaibei, and the highlighted area changes too (Figure 8j). The existence of high-high and low-low clusters of PM 2.5 concentration shows that PM 2.5 distribution is significantly influenced by spatial heterogeneity and spatial autocorrelation in YRD region and the spatial patterns change with time. To demonstrate further the spatial effects on PM2.5 distribution, we calculated the linear combination of the eigenvectors, Ekβk, for both the annual and seasonal models and visualized them in Figures 8b,d,f,h,j. The spatial patterns between each pair are strikingly similar because these eigenvector maps contain local spatial information of PM2.5 distribution. In Figure 8a, there is an obvious high-high cluster of PM2.5 concentration: Xuzhou-Suqian and it exactly corresponds to a highlighted area in the eigenvector map in Figure 8b. Besides, there is a low-low cluster region: Lishui-Taizhou in Zhejiang and it also corresponds to a highlighted area. For the seasonal maps, the situations are almost identical to these two cluster regions but there are differences. For example, in winter (Figure 8c), another high-high cluster appears in Hefei-Chuzhou and correspondingly in Figure 8d the Ekβk is high in this area. In spring (Figure 8e), Changzhou-Nantong is high in PM2.5 concentration and it also has a relatively high value in Figure 8f. In autumn (Figure 8i), the high-high cluster expands from Xuzhou-Suqian to Xuzhou-Bengbu-Huaibei, and the highlighted area changes too (Figure 8j). The existence of high-high and low-low clusters of PM2.5 concentration shows that PM2.5 distribution is significantly influenced by spatial heterogeneity and spatial autocorrelation in YRD region and the spatial patterns change with time. eigenvector maps contain local spatial information of PM2.5 distribution. In Figure 8a, there is an obvious high-high cluster of PM2.5 concentration: Xuzhou-Suqian and it exactly corresponds to a highlighted area in the eigenvector map in Figure 8b. Besides, there is a low-low cluster region: Lishui-Taizhou in Zhejiang and it also corresponds to a highlighted area. For the seasonal maps, the situations are almost identical to these two cluster regions but there are differences. For example, in winter (Figure 8c), another high-high cluster appears in Hefei-Chuzhou and correspondingly in Figure 8d the Ekβk is high in this area. In spring (Figure 8e), Changzhou-Nantong is high in PM2.5 concentration and it also has a relatively high value in Figure 8f. In autumn (Figure 8i), the high-high cluster expands from Xuzhou-Suqian to Xuzhou-Bengbu-Huaibei, and the highlighted area changes too (Figure 8j). The existence of high-high and low-low clusters of PM2.5 concentration shows that PM2.5 distribution is significantly influenced by spatial heterogeneity and spatial autocorrelation in YRD region and the spatial patterns change with time.

PM2.5 Spatial-temporal Analysis in YRD region
The PM2.5 distribution maps show apparent spatial heterogeneity. The north of the YRD region has higher PM2.5 concentration than the south. Jiangsu Province has the highest annual mean PM2.

PM 2.5 Spatial-temporal Analysis in YRD region
The PM 2.5 distribution maps show apparent spatial heterogeneity. The north of the YRD region has higher PM 2.5 concentration than the south. Jiangsu Province has the highest annual mean For different time scales, there exists similar spatial patterns of PM 2.5 concentrations. Table 10 shows the top three cities with the best or the worst air quality in four seasons. The high concentration value always appears in the vicinity of Taizhou in Jiangsu, the junction area of Suqian and Huai'an, Hefei and Ma'anshan. The low concentration always clusters in the junction area of Lu'an and Anqing, Xuancheng, Hangzhou and several cities in southern Zhejiang province such as Lishui and Taizhou. In the perspective of changes over time, PM 2.5 average concentrations in the four seasons can be sorted in descending order as winter (65.2 µg/m 3 ), spring (42.0 µg/m 3 ), autumn (33.6 µg/m 3 ) and summer (24.7 µg/m 3 ). This tendency is consistent with the station monitoring data but the concentrations are undervalued for all the seasons. Besides, according to Figure 7, PM 2.5 pollution is the severest in winter with 60.0% of the area having a level-4 PM 2.5 pollution, which can be detrimental to human health. In spring, the percentage of area with unqualified PM 2.5 concentration reaches up to 71.4% but the pollution levels are not as high as winter. The air is the best in summer, with 98.3% of the area above the national limit. The remaining 1.7% is at level 2 and has minor health consequence. Then in autumn the percentage of area with qualified PM 2.5 concentration decreases to 40.9%. Figure 9 shows the density of likely pollution sources in the YRD region. The density value presents local causes for PM 2.5 pollution to some degree. Shanghai, Changzhou, central Hefei, Jiaxing and the east of Hangzhou are regions with concentrated pollution sources, which can explain the relatively high PM 2.5 concentrations in these areas. However, there are high PM 2.5 concentration areas, such as Chuzhou and Anqing, with low pollution sources density. There are also low PM 2.5 concentration areas with high pollution sources density, such as Taizhou in Zhejiang. These may be the results of particle dispersion influenced by temperature, pressure and so on. Overall, this is a simple quantitative way to find the main pollutant sources in a region, which will be helpful for air pollution treatment. More accurate predictions require modeling atmospheric transmissions, which is beyond the scope of this project.

Pollution Sources Analysis
such as Chuzhou and Anqing, with low pollution sources density. There are also low PM2.5 concentration areas with high pollution sources density, such as Taizhou in Zhejiang. These may be the results of particle dispersion influenced by temperature, pressure and so on. Overall, this is a simple quantitative way to find the main pollutant sources in a region, which will be helpful for air pollution treatment. More accurate predictions require modeling atmospheric transmissions, which is beyond the scope of this project.

Spatial-Temporal Analysis of PM2.5 Concentrations Based on ESFR Models
In modeling ground PM2.5 concentrations, spatial autocorrelation is a main factor that limits the performances of conventional OLS models. In this paper, we developed eigenvector spatial filtering regression models, filtering spatial information from the regression residuals and representing it as a linear combination of selective eigenvectors of the spatial weights matrix in the model. In this way, residuals spatial autocorrelation is substantially reduced and model precision improved, enabling us to perform more accurate and reliable analyses and predictions.
The resulting models shed new lights on the relationships between PM2.5 and the atmospheric conditions, terrain, ground vegetation, and cultural features. By comparisons of models from different periods, we revealed the influences of the independent variables on PM2.5 and the variations of individual impacts over time, most of which are consistent with existing research findings. PBLH is negatively correlated with PM2.5 because Planetary Boundary Layer (PBL) can weaken the exchange between earth surface and free troposphere. The lower the PBLH is, the more particles are restricted near ground [56]. ST and RH pose different influence on PM2.5 in different periods as Tables 5 and 6 show. As for ST, high temperature contributes to photochemical activity to produce more fine particles; however, it can also promote the convection of air and decrease PM2.5 concentrations. The

Spatial-Temporal Analysis of PM 2.5 Concentrations Based on ESFR Models
In modeling ground PM 2.5 concentrations, spatial autocorrelation is a main factor that limits the performances of conventional OLS models. In this paper, we developed eigenvector spatial filtering regression models, filtering spatial information from the regression residuals and representing it as a linear combination of selective eigenvectors of the spatial weights matrix in the model. In this way, residuals spatial autocorrelation is substantially reduced and model precision improved, enabling us to perform more accurate and reliable analyses and predictions.
The resulting models shed new lights on the relationships between PM 2.5 and the atmospheric conditions, terrain, ground vegetation, and cultural features. By comparisons of models from different periods, we revealed the influences of the independent variables on PM 2.5 and the variations of individual impacts over time, most of which are consistent with existing research findings. PBLH is negatively correlated with PM 2.5 because Planetary Boundary Layer (PBL) can weaken the exchange between earth surface and free troposphere. The lower the PBLH is, the more particles are restricted near ground [56]. ST and RH pose different influence on PM 2.5 in different periods as Tables 5 and 6 show. As for ST, high temperature contributes to photochemical activity to produce more fine particles; however, it can also promote the convection of air and decrease PM 2.5 concentrations. The relationship between RH and PM 2.5 is also complex. When RH is low, PM 2.5 concentration increases because of hygroscopic growth [57]. However, when RH is high enough, fine particles cluster together and fall to the ground, causing a decrease of PM 2.5 concentrations in the air [16]. DEM is negatively correlated with PM 2.5 because higher altitudes have good PM 2.5 dispersion conditions [22]. PS has a positive correlation with PM 2.5 concentration because high pressure can cause downdrafts and an accumulation of particles near the ground [58]. NDVI is not as important as reported in the literature, according to our study (Tables 5 and 6). The PCCs in Tables 2 and 3 also indicate a weak correlation between NDVI and PM 2.5 . In several models, NDVI shows a negative effect because vegetation can absorb PM 2.5 emissions. In contrast, Fact Den and Road Den are positively correlated with PM 2.5 concentrations, because they relate to the emission of pollutants. Road Den is also removed from most of the models as transported emissions tend to move up to the boundary layer, hence are likely irrelevant to ground PM 2.5 concentrations in our study area [25].
AOD is an important variable for modeling PM 2.5 concentrations. On the whole as the annual model results in Table 5 show, AOD is positively correlated with PM 2.5 concentration, because AOD tells how much sunlight is absorbed or scattered by aerosol particles. However, as we can see in most of the models, AOD has been removed for being not significant in the first stepwise procedure. We speculate a number of possible causes for this inconsistency. First, most studies on the relationship between AOD and PM 2.5 are based on daily aggregates of stations located close to each other as a means to alleviate the missing data issue. As a result, AOD and PM 2.5 may have significant relations on a daily time scale. This approach however may not work well for a longer duration because there are more chances of missing data due to atmospheric conditions. Second, the amount of missing data differs between stations, so the resulting means are biased. The relationship between PM 2.5 and AOD weakened after calculating annual, seasonal and monthly PM 2.5 averages [59]. If rudimentary interpolation and imputation are to be performed to estimate the missing data, excessive amount of errors will likely to be introduced. Besides, the large span of study area can also weaken their relationship due to the spatial heterogeneity. A previous study showed that there was a linear PM 2.5 -AOD relationship at one site in Italy but the results were different at other sites in Los Angeles and Beijing [60]. Applying more advanced methods of imputation to generate a more reliable AOD aggregates will lead to substantial improvement with the models.
Both in-sample fit and cross-validation show that ESFR model performs better than GMLR. Thus the ESFR model combined with remotely sensed data can be an effective way of estimating PM 2.5 concentrations. PM 2.5 distribution maps show that the northern YRD region has a higher concentration than the south. Jiangsu Province is the most seriously polluted among the four administrative regions, while Zhejiang is the cleanest. PM 2.5 concentrations are the highest in winter and the lowest in summer. These spatial and temporal analyses can be verified by current researches and the station observations, indicating our ESFR model is a good approximation of PM 2.5 processes in the region [38,61].
According to the ESF theory [47], spatial influence on PM 2.5 distribution can be visualized by the linear combination E k β k . In our case study of the YRD region, although we had different eigenvectors for the annual and seasonal models and linear combination of eigenvectors also varied with time, there are similar spatial patterns revealed by the eigenvector maps. By comparing with the PM 2.5 distribution maps, we found a corresponding relationship between high values of E k β k and the clusters of PM 2.5 concentrations. This geographic pattern prevails in the YRD region. In addition, high-high or low-low PM 2.5 clusters such as Xuzhou-Suqian and Lishui-Wenzhou were found. This is another supporting evidence that ESFR model can effectively uncover spatial structures using eigenvector maps, which is a unique advantage of the ESF approach.

Real-Time Monitoring of Ground PM 2.5 Concentrations
In addition to long term PM 2.5 analysis as shown in the annual, seasonal and monthly modeling of the YRD region, the ESFR model provides a way of real-time monitoring of ground PM 2.5 concentrations. In the previous ESFR PM 2.5 models [51], the independent variables are observations of other air pollutants, including PM 10 , SO 2 , NO 2 , CO and O 3 , which are collected from the same monitoring stations as PM 2.5 . They have the same resolutions as PM 2.5 and are of limited relevance in the estimation of PM 2.5 on a finer spatial or temporal scale. Besides, current methods of obtaining PM 2.5 concentration data are through ground measurements at fixed stations or portable PM 2.5 detectors, which can be time consuming and costly, particularly if greater spatial and temporal resolutions are required. In our ESFR model, most of the covariates are remotely sensed data with increasing spatial and temporal resolutions, such as AOD, ST, PS, PBLH and so on. Considering that ESFR model has a good prediction ability, we can construct ESFR models based on real time remotely sensed data and provide short-term or near real-time estimations of PM 2.5 concentrations for a given region, enabling more effective air quality management and more timely environmental guidance for daily activities.

Limitations and Future Enhancements
Missing observations remain a major problem with AOD, which may well be a reason for the weak association with PM 2.5 in the model. In addition to cloud cover, the dark target algorithm has trouble in retrieving AOD in some urban areas. The daily and monthly models are affected the most due to a low AOD coverage. Our future work is to combine AOD products from various sources into a database with large spatial coverage [28] and high precision so that it can be widely used in this field. In addition, we will further improve the model precision experimenting with different specifications of spatial weights matrix and by exploring additional covariates. As the distribution of PM 2.5 is influenced not only by spatial autocorrelation but also time effects, an integrated spatio-temporal regression model should be the ultimate approach to modeling air pollution [62], where the eigenvector spatial filter is a promising direction [50,63]. Lastly, atmospheric transmission and dynamics have yet to be incorporated in the current modeling process, which is clearly a limitation, though such effects were alleviated to some extent because the study area is large and the temporal span is year long hence the variables can be seen as stable and representative in the study area.

Conclusions
As PM 2.5 pollution becomes increasingly severe, it is necessary to develop accurate and reliable models for studying the spatial and temporal characteristics of ground PM 2.5 . The contribution of this paper is two-fold. On the one hand, we develop an ESFR model for ground PM 2.5 concentrations estimation. Spatial influence is incorporated to classic OLS models and performances are substantially improved. Spatial characteristics of PM 2.5 concentrations are effectively shown in the model specification. On the other hand, models with improved accuracies combined with remotely sensed data offer an effective means to estimate PM 2.5 concentrations over time and space. High resolution and real-time PM 2.5 distribution maps can be generated and provide detailed analysis results for chronic and epidemiological studies. However, some of the covariates used in our model are not significant as reported by previous studies. Data with higher quality will be explored and applied in our next study to further analyze spatial-temporal characteristics of ground PM 2.5 concentrations. The spatial differences in the relationships between influential factors and PM 2.5 are not shown in the model. As the next step, we will conduct GWR and extract location varied coefficients from ESF so as to analyze the coefficients' spatial variations.