An Online Contaminant Classification Method Based on MF ‐ DCCA Using Conventional Water Quality Indicators

: Emergent contamination warning systems are critical to ensure drinking water supply security. After detecting the existence of contaminants, identifying the types of contaminants is conducive to taking remediation measures. An online classification method for contaminants, which explored abnormal fluctuation information and the correlation between 12 water quality indicators adequately, is proposed to realize comprehensive and accurate discrimination of contaminants. Firstly, the paper utilized multi ‐ fractal detrended fluctuation analysis (MF ‐ DFA) to select indicators with abnormal fluctuation, used multi ‐ fractal detrended cross ‐ correlation analysis (MF ‐ DCCA) to measure the cross ‐ correlation between indicators. Subsequently, the algorithm fused the abnormal probability of each indicator and constructed the abnormal probability matrix to further judge the abnormal fluctuation of indicators using D–S evidence theory. Finally, the singularity index of the cross ‐ correlation function and the selected indicators were used to classification by cosine distance. Experiments of five chemical contaminants at three concentration levels were implemented, and analysis results show the method can weaken disturbance of water quality background noise and other interfering factors. It effectively improved the classification accuracy at low concentrations compared with another three methods, including methods using triple standard deviation threshold and single indicator fluctuation analysis ‐ only methods without fluctuation analysis. This can be applied to water quality emergency monitoring systems to reduce contaminant misclassification.


Introduction
The frequent occurrence of emergent contamination events in drinking water pipes poses a great threat to drinking water supply security. It is particularly critical to establish a sound emergency warning system for water environmental pollution [1,2]. The accurate and timely classification of contaminants is conducive to taking targeted measures to deal with pollution sources, which is an important prerequisite for water rescue work.
Contaminant classification methods used most commonly is the laboratory-based analysis, e.g., ICP-MS. It has the advantages of low detection limit and high precision, and support contaminants classification and quantitative analysis. However, this method is time-consuming and difficult to meet the needs for online classification in the water quality warning system. Some researchers use online compound-specific sensors to detect the type of contaminants. Although this method faster than the laboratory-based analysis, it can normally only identify one type or a small group of contaminants [3][4][5]. Some scholars have tried to develop online methods for identifying contaminants using conventional water quality indicators considering that using conventional indicators to analyze water quality is suitable for online monitoring with fast analysis speed. Online classification of contaminants in water pipelines is mainly based on supervised classification methods at present. Kroll et al. [6] processed five independent water quality parameters (pH, Conductivity, Turbidity, Residual Chlorine, and TOC) into a single trigger signal, and the direction of the deviation signal was related to the nature of contaminants. Based on the experimental data, the deviation signal library was established, and the contaminants could be distinguished by comparing the deviation signal with the signal in the library. Yang et al. [7] studied the changes of different water quality indicators caused by 11 kinds of contaminants based on the real-time adaptive signal processing method, established four contaminants classification systems, and used the geometric characteristics of the response of water quality indicators to distinguish the categories of contaminants. Liu et al. [8] used the clustering algorithm to get the class center of the contaminant response signal, and measured the similarity between the monitored sample value and the class center by calculating the Mahalanobis distance to identify the contaminants. The team [9] then employed cosine distances to measure similarities to determine the category of contaminants. Compared with the Mahalanobis distance, it can better reduce the influence caused by unknown concentrations of contaminants. Huang et al. [10] proposed a multi-classification model based on support vector machine (SVM) for contaminant classification, and introduced the classification probability to distinguish contaminants. To some extent, it avoided making a single decision when classification features were unclear in the initial phase of contaminants injection.
There are problems of information redundancy and low signal-to-noise ratio in conventional water quality indicators data, affected by the sensors and fluctuation of water quality background. It is difficult to identify the contaminants when the concentration of contaminants is low in the early stage of sudden pollution incidents. The existing methods for online classification of contaminants based on conventional water quality indicators have achieved relatively good classification results. However, multiple water quality indicators show linkage changes during the occurrence of sudden water pollution incidents, and the above methods do not fully explore the correlation and difference among indicators with abnormal fluctuations caused by contaminants, which limits the accuracy of contaminant classification to some extent at low concentrations.
Considering the above problem of the online classification of contaminants, this paper proposes an online classification method for contaminants in water pipelines based on cross-correlation analysis combined with D-S evidential theory. Firstly, the paper picked out the indicators series with abnormal fluctuation utilizing multi-fractal detrended fluctuation analysis (MF-DFA) and measured the cross-correlation between these selected water quality indicators series based on multi-fractal detrended cross-correlation analysis (MF-DCCA). Then, the paper fused the abnormal probability of each indicator and constructed the abnormal probability matrix to further judge the abnormal fluctuation of indicators using D-S evidence theory. Finally, the singularity index of the crosscorrelation function and the time series of the selected indicators formed eigenvectors and implemented contaminant classification using cosine distance. Compared with another three methods, the proposed approach effectively filtered out signal noise and other interfering factors by further exploring the cross-correlation between indicators. It revealed the hydrodynamic characteristics of different contaminants hidden in complex data and improved the classification accuracy at low concentrations.

Principles and Methodology
The online contaminant classification method proposed in this paper is mainly based on MF-DCCA (multi-fractal detrended cross-correlation analysis) and D-S evidential theory.
The workflow of the method is shown in Figure 1, which is specifically divided into four parts: abnormal fluctuation analysis of single indicator, cross-correlation analysis of multiple indicators with abnormal fluctuation, abnormal probability information fusion, and classification based on cosine distance. This paper used the MF-DFA algorithm to evaluate the fluctuation of the time series of all the indicators firstly, secondly analyzed the cross-correlation of indicators screened in the previous step based on MF-DCCA algorithm, and then the judgment results from different sensors monitoring water quality indicators were used to complete the information fusion at the decisionmaking level utilizing D-S evidential theory. Finally, the fused information was used as auxiliary evidence combined with the selected anomalous fluctuation indicators series to form feature vectors for identification. The multifractal detrended fluctuation analysis (MF-DFA) was proposed by Kantelhardt with the aim to detect multifractal properties of nonlinear and nonstationary time series, which provides efficient tools for estimation of the multifractal spectrum. The technical details of MF-DFA are mentioned in [11]. In this paper, MF-DFA was used to analyze the time series of conventional water quality indicators. For a nonstationary time series of indicators with a length of , 1, 2, … , , the multifractal spectrum can be obtained after analysis. As shown in Figure 2, it is the analysis result for Ammonia Nitrogen time series data for the injection of Copper Sulfate with different concentrations.
is a smooth convex curve with a single peak, and its x-coordinate α is a singular index, reflecting the growth probability of the fractal in a small region. ∆ ∞ ∞ indicates the inhomogeneity of the distribution of the Ammonia Nitrogen time series in the whole probability measure. The larger the ∆ , the stronger the multifractal properties. The ∆ of each indicator time series was compared with the corresponding threshold, and the 0 was used as auxiliary information to judge whether the indicator had abnormal fluctuation or not. The specific fluctuations of these indicators caused by current contaminants were measured, the water quality indicators without abnormal fluctuation or unapparent abnormal fluctuation (equivalent to noise signal) were eliminated, and then the time series of indicators with real abnormal fluctuation were used for the next processing, which can effectively weaken the disturbance of water background fluctuation and other noise.

Cross-correlation Analysis of Multiple Indicators Based on MF-DCCA
Multifractal detrended cross-correlation analysis (MF-DCCA) was proposed by Zhou [12] to reveal the multifractal features of two nonstationary time series, and it finds applications ranging from investment market [13][14][15], environmental analysis [16][17][18], biomedical [19][20][21], traffic data [22], and power industry [23]. The technical details of MF-DCCA is mentioned in [12]. and are time series data of water quality indicators, 1,2, … , and the analysis process of these two water quality indicators using MF-DCCA is as follows: 1. Divide and into pieces of data of length , that is, .
2. Calculate the cross-correlation fluctuation function between and : and , respectively, represent the local trend of time series and .
3. Calculate the de-trending covariance function of order q between time series and : (1) When there is a long-range correlation between time series and , the relationship between and is as follows: ℎ is a generalized cross-correlation Hurst exponent.
(2) The de-trend cross-correlation index of time series and is: According to the Legendre transformation, we can get this relationship: In this paper, according to the water quality indicators series with abnormal fluctuation screened out in the previous step, the MF-DCCA algorithm was employed to analyze the multifractal features of the cross-correlation functions of each two indicators series. MF-DCCA (multi-fractal detrended cross-correlation analysis) fully considers the cross-correlation between data. So, it can better exclude the influence of signal noise and other factors on the results, and explore the dynamic mechanism hidden in the data effectively. Based on the multifractal spectrum, five eigenvalues with explicit physical meaning can be extracted to form an eigenvector: which enables a more complete description of the fluctuations of nonstationary time series. The width of the spectrum ∆ , equivalent to ∞ ∞ , describes the strength of multifractality. The larger the ∆α, the stronger the cross-correlation between the indicators. If ∆ based on the time series and is equal to the threshold , it indicates that the time series of the two water quality indicators are at the critical point of relevant and irrelevant. When ∆ , it indicates that there is a strong correlation. When ∆ , it shows that there is no correlation or weak correlation between them.

Abnormal Probability Fusion Based on D-S Evidential Theory
D-S evidential theory was first proposed by professor Dempster [24] of Harvard University in 1967 and further improved by his student Shafer [25] in 1976, which plays an important role in multisource information fusion in the fields of target recognition, water environment monitoring, and medical diagnosis [26][27][28][29][30][31]. In this paper, D-S evidential theory was used to fuse and analyze the abnormal fluctuation of multiple water quality indicators. The recognition frame consists of two elements: Normal and Abnormal.
Define the basic probability distribution function as follows: represents the degree to which the scale index ∆ deviates from the threshold . ∆ is the degree of unevenness of the -th indicator time series based on MF-DFA.
Multi-indicators anomaly probability fusion adopts the form of confidence accumulation: , where the basic abnormal probability of indicator is denoted as , , and the cross-correlation index ∆ of water quality indicator , is normalized to . So, the abnormal probability of cross-correlation of each two indicators among N water quality indicators is as follows:

Constructing Eigenvector and Classifying Based on Cosine Distance
The abnormal probability matrix is obtained from Equation (11), and used the threshold to judge whether the fusion probability of the two associated indicators is abnormal. If the judgment result is abnormal, the normalized cross-correlation index and the time series of the corresponding indicators are combined to form a feature vector, and the relevant data is needed to be taken out from the contaminant information library. If it is judged to be normal, it indicates that at least one of the two indicators are not of abnormal fluctuation, and there is no need to extract relevant information from the information library. When the eigenvector is constructed, online classification can be carried out based on cosine distance.
Cosine distance is a measure of similarity based on the cosine of the angle between two vectors [32,33]: where the smaller the angle between vector and vector is, the closer the cosine distance is to 1, that is, the more similar the two vectors are. Compared with the measurement method based on Euclidean Distance, the cosine distance excludes the influence of the vector′s amplitude. When the same contaminant is injected to the water pipeline with different concentrations, the amplitude of the abnormal fluctuation of the indicators is different. So, using the cosine distance for classification can weaken the influence of different concentrations.

Experimental Apparatus
All experiments involved in this paper were conducted in the simulated water pipeline system. The structure of the system is shown in Figure 3, consisting of three parts: automatic chemicals feeding system, chemicals mixing system, and automatic monitoring system. In Figure 3a

Experimental Scheme Design
The experiment was divided into two phases. The first phase was to build a contaminants library of water quality indicators response, the second phase was to verify the performance of the algorithm using the data obtained from the new contaminants' injection experiments. In this experiment, three classes and five types of the most common contaminants were selected, including agricultural contaminants (Ammonium Citrate), chemical contaminants (Potassium Hydrogen Phthalate, Sodium Nitrite, Potassium Ferricyanide), and heavy metal contaminants (Copper Sulfate). The five types of contaminants were employed to construct the contaminants library, and Copper Sulfate was taken as an example to demonstrate the performance of the classification algorithm in the paper.

Build Contaminants Library
In the first phase, the dilution ratio of the water pipeline system was adjusted to 2%, and five concentration gradients were set for each contaminant to conduct an experiment with the injection interval of 30 min. The concentrations of the sample solution were 400, 300, 200, 100, and 50 mg/L, and the concentrations of contaminants actually presented in the main pipeline after dilution were 8, 6, 4, 2, and 1 mg/L. The information of contaminants concentration and number of sampling points is shown in Table 1. The sampling interval was set to 1 min. The time series of all water quality indicators were obtained from the sensors, and characteristics information of contaminants was added into the knowledge library after analysis using the proposed algorithm.

Obtain Contaminant Classification Data
In the second phase of the experiment, the five kinds of contaminants mentioned above were used for new injection experiments to demonstrate the performance of algorithm. The concentrations of the sample solution were 500, 250, and 80mg/L, and the concentrations of contaminants actually presented in the main pipeline after dilution were 10, 5.0, and 1.6 mg/L. The information of contaminants concentration and sampling points is shown in Table 2. The data collected by the sensors after the abnormal detection was used for online classification of characteristic contaminants. There were 30 samples in each group of contaminants.

Concentration of Contaminants
Number of Sampling Point 500 mg/L 10 mg/L 30 Five kinds of contaminants 250 mg/L 5.0 mg/L 30 80 mg/L 1.6 mg/L 30

The Result of Single Indicator Fluctuation Analysis
The MF-DFA algorithm was used to evaluate the fluctuation of the time series of twelve indicators, and the Hurst graph and the multifractal singular spectrum were obtained. Taking the concentration of 10 mg/L as an example, the Hurst graph and the multifractal singular spectrum of Nitrate Nitrogen are shown in Figure 4, the value ∆ extracted from the spectrum of all indicators are shown in Table 3.  Under normal circumstances, the background fluctuation intensity of different water quality indicators was different, and the more severe the fluctuation was, the higher the value ∆ was. For example, the background fluctuation of the time series of COD, TOC, Nitrate Nitrogen was more obvious than Turbidity, Permanganate Indicator, Total Phosphorus, and Total Nitrogen. Taking Residual Chlorine as an example, the injection of Potassium Ferricyanide and Copper Sulfate caused obvious fluctuation in Residual Chlorine's time series, and other contaminants had little effect on it. Substituting the difference between ∆ and the preset threshold into the basic probability distribution function better described the abnormal fluctuations of a single indicator, and the water quality indicators without abnormal fluctuations (corresponding to noise signals) were eliminated.

Cross-Correlation Analysis of Multiple Indicators and Comparison of Probability Fusion Results
MF-DCCA and D-S evidential theory were used to analyze the multifractal features of each two indicators time series and obtain the abnormal probability matrix. Figure 5a,b, respectively, show the abnormal probability results of water quality indicators before and after the introduction of multiindicators correlation information. The numbers 1 to 12 represent 12 water quality indicators, namely pH, Conductivity, Turbidity, Dissolved Oxygen, COD, Permanganate Indicator, TOC, Ammonia Nitrogen, Nitrate Nitrogen, Total Phosphorus, Total Nitrogen, and Residual Chlorine. The letters A to F represent five characteristic contaminants, namely Ammonium Citrate, Potassium Hydrogen Phthalate, Potassium Ferricyanide, Copper Sulfate, and Sodium Nitrite. The abnormal fluctuation probability of indicators is divided into 10 grades from 0 to 1, and the corresponding color blocks are from white to black. The more obvious abnormal fluctuation is, the darker the color is. Taking Ammonium Citrate as an example, the response curves of each indicator are shown in Figure 5 at the concentration of 1 mg/L, 2 mg/L, 4 mg/L, 6 mg/L, and 8 mg/L.
There appeared obvious fluctuations in the time series of Ammonia Nitrogen after the injection of Ammonium Citrate. In addition, pH was affected to some extent, and the fluctuation curves of other indicators did not change significantly. As shown in Figure 5a, before the introduction of MF-DCCA to analyze the cross-correlation information between the indicators, the indicators with abnormal fluctuation probability exceeding 0.7 in descending order were Ammonia Nitrogen, pH, Conductivity, TOC, and Turbidity, which was inconsistent with the results in Figure 6. During the whole experiment, the baseline of Conductivity, TOC, and Turbidity fluctuated obviously due to noise interference. If we use these three indicators time series for identification, more interference will be introduced, which will weaken the classification accuracy. As shown in Figure 5b, the indicators with abnormal fluctuation probability exceeding 0.7 in descending order only included pH and Ammonia Nitrogen when MF-DCCA was introduced to the analysis process. This is because the fluctuation correlation of other indicators was low, and the abnormal probability after fusion did not exceed the alarm threshold. Therefore, the cross-correlation analysis for multi-indicators based on MF-DCCA can reduce the interference caused by the noise, weaken the influence of Electrical Conductivity, TOC, Turbidity, and other indicators on the results due to baseline changes and noise interference. It is helpful to improve the accuracy of contaminant classification at low concentrations.

Comparison of Contaminant Classification Results
In the experiment, five types of contaminants of 10 mg/L, 5 mg/L and 1.6 mg/L were selected to verify the classification accuracy, including Ammonium Citrate, Potassium Hydrogen Phthalate, Potassium Ferricyanide, Copper Sulfate, and Sodium Nitrite. There were 30 samples for each concentration group and 90 samples for each contaminant. The normalized cross-correlation scale index and the indicator time series with abnormal fluctuations formed the feature vector, and the algorithm based on cosine distance was used to complete the contaminant classification. The recognition results are shown in Figure 7, where A, B, C, D, and E represent Ammonium Citrate, Potassium Hydrogen Phthalate, Potassium Ferricyanide, Copper Sulfate, and Sodium Nitrite. It can be seen from Figure 6 that the algorithm proposed in this paper showed good performance in identifying contaminants. Meanwhile, in order to further verify the performance of this algorithm, the recognition accuracy before and after the introduction of cross-correlation information analysis based on MF-DCCA is presented in Table 4. It can be seen that the classification accuracy of the five contaminants was significantly improved after the introduction of cross-correlation information between indicators. Taking Copper Sulfate as an example to further illustrate the method proposed in this paper can improve the classification accuracy of contaminants at low concentrations. The classification results of Copper Sulfate are shown in Figures 8-10. As the concentration decreased from 10 mg/L to 5 mg/L and 1.6 mg/L, the classification accuracy became worse, which can also be seen from Table 4. However, by comparing both parts with each other in Figure 8a,b, Figure 9a,b, and Figure 10a,b, it shows that the classification accuracy was significantly improved after the fusion of the crosscorrelation information, especially for the low concentrations. When the concentration of Copper Sulfate was 10 mg/L, as shown in Figure 8a,b, the classification accuracy was both relatively high before and after the introduction of cross-correlation information between indicators. Because for high concentrations, the signal-to-noise ratio of the time series information was high, the abnormal fluctuation was severe and the features used for recognition were obvious, therefore it had little influence on the classification results.
When the concentration was 5 mg/L, the classification results without the introduction of crosscorrelation information are shown in Figure 9a. At the time points 8, 9, and 10, there occurred misclassification. This is because noise interference and the injection of contaminants caused different water quality fluctuations, and the indicators time series affected by noise interference had no strong correlation with other indicators series. The classification result with the introduction of crosscorrelation information between indicators is shown in Figure 9b. At the time points 9 and 10, misclassification was eliminated, because the abnormal probability of the water quality indicators affected by noise such as Conductivity and TOC was lower than the alarm threshold. When the concentration is 1.6 mg/L, the classification results are shown in Figure 10a,b. At this time, the classification accuracy of contaminants became worse compared with the high concentration group, because as the concentration decreased, the signal-to-noise ratio was too low and the indicators data with abnormal fluctuations was hardly effectively distinguished from noise and other interference signals. However, it can be seen that the classification accuracy of Copper Sulfate increased from 0.77 before the fusion of cross-correlation information to 0.90 after fusion, which was improved by 16.89%. It demonstrates that the cross-correlation information between the indicators can effectively describe the current abnormal fluctuations, and better assist the classification work especially when the contaminants are at low concentrations. In order to further verify the effectiveness of the proposed algorithm, this paper used three other methods to making comparisons, including the method of using all the pre-processed indicators series data for cosine distance classification directly without fluctuation analysis, the approach based on the triple standard deviation threshold, and the method only employing single indicator fluctuation analysis. The recognition results of the four methods are shown in Table 5, and it is proved that the proposed algorithm effectively filtered out signal noise and other interfering factors, and it better revealed the hydrodynamic characteristics of different contaminants hidden in complex data and improve the accuracy of classification at low concentrations.

Conclusions
According to the characteristic that multiple conventional water quality indicators show linkage changes during the outbreak of sudden water pollution incident, this paper proposes an online classification method for contaminants in water pipeline with high precision based on multiindicators cross-correlation analysis combined with D-S evidential theory. It is aimed to improve the classification accuracy at low concentrations in the initial stage of the sudden pollution incident by further excavating abnormal fluctuation information and the cross-correlation between water quality indicators. The paper used the method of MF-DFA to pick out the indicators series with abnormal fluctuation, utilized MF-DCCA to measure the cross-correlation between these selected water quality indicators series. Then, the method fused the abnormal probability of each indicator and constructed the abnormal probability matrix to further judge the abnormal fluctuation of indicators using D-S evidence theory. Finally, the singularity index of the cross-correlation function and the time series of the selected indicators were constructed to the eigenvectors and implemented contaminant classification by means of cosine distance.
A large number of experimental data were obtained through the injection experiment of five kinds of contaminants, and multiple sets of comparative experiments were carried out to verify the performance of the proposed algorithm, including comparing the classification results before and after the introduction of cross-correlation information, and comparing our approach with the other three methods. The results show that the proposed method can better filter out signal noise and other interfering factors, weakening the influence of abnormal fluctuation of water quality indicators caused by non-contaminants. The method effectively reveals the hydrodynamic characteristics of different contaminants hidden in complex data and improve the accuracy of classification at low concentrations.