An Assemble Based on Clustering and Monte Carlo for the Wavelengths Selection of Excitation Emission Fluorescence Spectra

: Excitation-emission ﬂuorescence spectra is very e ﬀ ective to predict the concentration of organics in samples. However, redundant information and noises in the excitation-emission matrix (EEM) decrease the accuracy of the prediction concentration. Here we proposed a method to select more useful excitation and emission spectra from the EEM to increase the accuracy of prediction concentration and reduce the processing time. First, the excitation wavelengths were selected based on the clustering method to limit the redundant information in the EEM. Then the emission wavelengths were selected based on the Monte-Carlo method. To validate this method, we established the concentration prediction model with the spectra corresponding to the selected wavelengths by partial least square regression and predicted the multi component concentrations in the test samples. Our studies indicate that incorporation of this method increases the accuracy of the prediction concentration of organics and reduces the processing time.


Introduction
Spectrofluorimetry has been widely implemented to predict the concentration of organics in samples for its advantages of sensitivity, selectivity, non-invasiveness, and fastness [1,2]. To date, the emission fluorescence spectrum, which is excited by single excitation wavelength, is commonly used to predict the concentration of organics. Alternatively, an excitation-emission matrix (EEM) of the sample acquired by choosing individual excitation wavelengths is better to analyze the organics quantitatively, which consists of multiple emission spectra and includes more information of the samples. However, some of these emission fluorescence spectra are linearly correlated, by which lead to extra redundant information [3,4], and Rayleigh scattering of the light also contributes to the EEM, which is known as noises. All this subordinate information is infaust for predicting the concentration of the organics [5]. To increase the accuracy of concentration prediction, the excitation and emission wavelengths in the EEM should be selected.
To date most reported researches are focused on the selection of the emission wavelengths from the emission spectrum. The methods, for example, partial least squares regression (PLSR), moving window PLS, iterative predictor weighting-PLS, and uninformative variable elimination-PLS [6][7][8][9][10][11], establish multi concentration prediction models with every emission wavelength in turn and the wavelengths corresponding to the lowest prediction concentration error are finally reserved. Comparing to these methods, the genetic algorithm (GA) and Monte Carlo method have more wavelengths selection throughput [12][13][14][15], because emission wavelengths are selected by taking into account their contribution to the concentration prediction. The initial contribution to the concentration of each wavelength is presumably the same for GA, which is not consistent with the actuality; and the predetermined criteria is given for Monte Carlo to reserve the emission wavelengths which contribute most to the concentration. Considering the linear correlation between emission spectra excited by different wavelengths, all emission spectra are categorized into different classes based on their similarity, by aligning the similarity coefficients between different emission spectra in each class. Only the emission spectrum with the maximal similarity coefficient between it and the other emission spectra is reserved to eliminate the influence of the redundant information between different emission spectra.
Overall, we developed a method to increase the accuracy of the prediction concentration based on the EEM, by which the excitation wavelengths and the emission wavelengths are selected respectively. This approach involves three discrete steps: (a) The clustering method is performed to categorize the emission spectra into different classes, only the emission spectrum with the maximal similarity coefficient between it and the other emission spectra in each class is reserved; then (b) all the reserved emission spectra are unfolded to one dimension emission spectrum by sorting their corresponding excitation wavelengths in ascending order, and the Monte Carlo method combined with PLSR is applied to refine the emission wavelengths; and finally (c) the prediction concentration model is established by the spectra corresponding to the selected wavelengths.

Selection of the Excitation Wavelengths by the Clustering Method
The typical EEM is described in Table 1. E x1 , . . . , E xj , . . . , E xJ are the J excitation wavelengths, E m1 , . . . , E mi , . . . , E mI are the I emission wavelengths, and X ij is the fluorescence intensity corresponding to the i th emission wavelength excited by the j th wavelength. Let (X a1 , X a2 , . . . , X aI ) T and (X b1 , X b2 , . . . , X bI ) T be the emission intensity spectra relating to the a th and the b th excitation wavelengths, respectively. The similarity coefficient C ab between these two emission spectra is calculated by Equation (1), as follows: where X a , X b are the mean fluorescence intensity of the a th and the b th emission spectra, respectively. Initially there are J excitation wavelength classes (Figure 1), corresponding to J excitation wavelengths, then the similarity coefficients between any two emission spectra corresponding to their excitation wavelengths are calculated, after the pair of emission spectra with the largest similarity coefficient are designated as one class, J-1 excitation wavelength classes are left. The number of classes is reduced by one after each calculation round of the similarity coefficient. This process is performed until the difference of similarity coefficients between any two pair of classes are under the pre-defined tolerant criterion. When two or more excitation wavelengths are assigned to the same class, the calculation of the similarity coefficient between different classes is different from that between each single emission spectrum. Let M and N be two different classes with m and n emission spectra in each class, the similarity coefficient calculation steps are: First, the similarity coefficients between one emission spectrum in class M and the n emission spectra in class N are calculated; then another emission spectrum in class M is assigned to repeat the above step. Once all × m n similarity coefficients are calculated, the two excitation wavelengths corresponding to the maximum similarity coefficient between the class M and N are accepted to represent each class.
The EEM of pure organic compound sample is used to select its excitation wavelengths. The spectra corresponding to these wavelengths will be served to select the emission wavelengths.

Selection of the Emission Wavelengths by the Monte Carlo Method
The P training samples with known concentrations of each component are used to establish the prediction concentration model. Presumably k excitation wavelengths are retained for each component by the method given in Section 2.1, the k emission spectra corresponding to specific excitation wavelengths are unfolded into a one-dimensional emission spectrum in the ascending order of the excitation wavelengths, that is, ik Ik X X ,..., X ,..., X , X ,..., X ,..., X ,..., X ,..., X ,..., X = Each sample has I emission wavelengths, leading to a total of I × k emission wavelengths. There are P unfolding emission spectra for P training samples.
The Monte Carlo method combined with PLSR is applied to select the emission wavelengths from P unfolding emission spectra as Formula (2). For each iteration, l emission wavelengths are randomly picked out from each X, a l × P sub-emission matrix Xl is formed with P unfolding spectra. The prediction concentration model is constructed with Xl based on PLSR [10,11]. The root mean square error of correction (RMSEC) is calculated by Equation (3): where yc is the known concentration of component c, c ỹ is the predicted concentration of c, and C is the number of components. The reliability of the prediction for each iteration is calculated by dj = 1/RMSEC.
After 20,000 iterations of above step, if the i th emission wavelength is selected Q times, the reliability score Si of the i th emission wavelength is summed by When two or more excitation wavelengths are assigned to the same class, the calculation of the similarity coefficient between different classes is different from that between each single emission spectrum. Let M and N be two different classes with m and n emission spectra in each class, the similarity coefficient calculation steps are: First, the similarity coefficients between one emission spectrum in class M and the n emission spectra in class N are calculated; then another emission spectrum in class M is assigned to repeat the above step. Once all m × n similarity coefficients are calculated, the two excitation wavelengths corresponding to the maximum similarity coefficient between the class M and N are accepted to represent each class.
The EEM of pure organic compound sample is used to select its excitation wavelengths. The spectra corresponding to these wavelengths will be served to select the emission wavelengths.

Selection of the Emission Wavelengths by the Monte Carlo Method
The P training samples with known concentrations of each component are used to establish the prediction concentration model. Presumably k excitation wavelengths are retained for each component by the method given in Section 2.1, the k emission spectra corresponding to specific excitation wavelengths are unfolded into a one-dimensional emission spectrum in the ascending order of the excitation wavelengths, that is, Each sample has I emission wavelengths, leading to a total of I × k emission wavelengths. There are P unfolding emission spectra for P training samples.
The Monte Carlo method combined with PLSR is applied to select the emission wavelengths from P unfolding emission spectra as Formula (2). For each iteration, l emission wavelengths are randomly picked out from each X, a l × P sub-emission matrix X l is formed with P unfolding spectra. The prediction concentration model is constructed with X l based on PLSR [10,11]. The root mean square error of correction (RMSEC) is calculated by Equation (3): where y c is the known concentration of component c, y c is the predicted concentration of c, and C is the number of components. The reliability of the prediction for each iteration is calculated by d j = 1/RMSEC. After 20,000 iterations of above step, if the i th emission wavelength is selected Q times, the reliability score S i of the i th emission wavelength is summed by The reliability scores of the I × k emission wavelengths are calculated and the top 10% emission wavelengths are reserved. These selected emission wavelengths will be used to predict concentrations in the test samples.

Apparatus and Sample Preparation
Three organic compounds used for this study were naphthalene, 1-naphthol, and 2-naphthol. The samples of the three pure compounds were mixed into another 11 samples of different concentrations. We used the training set (nine samples) to establish the prediction concentration model, and the test set (two samples) to validate the accuracy of the concentration prediction. Table 2 listed the components concentrations of all samples. Besides, eleven blank samples were also prepared without the above three organics compounds.
The EEMs of the three pure compounds, the nine training samples, the two test samples, and eleven blank samples were collected over excitation wavelengths between 220 and 300 nm with a 2 nm interval and over emission wavelengths between 325 and 600 nm with a 5 nm interval by a Hitachi F-7000 spectrofluorometer at room temperature of 25 • C. The EEM of each sample was subtracted by the mean EEM of the eleven blank samples to eliminate the influence of the background noises. As an example, Figure 2 shows the original EEM of naphthalene, the mean EEM of the blank samples, and the blank-subtracted naphthalene. The reliability scores of the I × k emission wavelengths are calculated and the top 10% emission wavelengths are reserved. These selected emission wavelengths will be used to predict concentrations in the test samples.

Apparatus and Sample Preparation
Three organic compounds used for this study were naphthalene, 1-naphthol, and 2-naphthol. The samples of the three pure compounds were mixed into another 11 samples of different concentrations. We used the training set (nine samples) to establish the prediction concentration model, and the test set (two samples) to validate the accuracy of the concentration prediction. Table  2 listed the components concentrations of all samples. Besides, eleven blank samples were also prepared without the above three organics compounds.
The EEMs of the three pure compounds, the nine training samples, the two test samples, and eleven blank samples were collected over excitation wavelengths between 220 and 300 nm with a 2 nm interval and over emission wavelengths between 325 and 600 nm with a 5 nm interval by a Hitachi F-7000 spectrofluorometer at room temperature of 25 °C. The EEM of each sample was subtracted by the mean EEM of the eleven blank samples to eliminate the influence of the background noises. As an example, Figure 2 shows the original EEM of naphthalene, the mean EEM of the blank samples, and the blank-subtracted naphthalene.

Results of the Selected Wavelengths
The clustering method was used to select the excitation wavelengths for each pure compound with the pure compound EEM. The tolerant criterion was set to 1.3 × 10 −3 . Table 3 gives the 10 selected excitation wavelengths of each component from 41 excitation wavelengths. The selected excitation wavelengths of each component are different, which can presume it relates to their specific molecular structures. We used the emission spectra corresponding to the 10 selected excitation wavelengths of the nine training samples to find the emission wavelengths based on the method given in Section 2.2. The dots in Figure 3a,c,e are the selected emission wavelengths, which are 63 for naphthalene, 64 for 1-naphthol, and 71 for 2-naphthol. These emission wavelengths fell within the emission spectra peak band of each compound excited by specific wavelength, as plotted in Figure 4. For comparison, we used the same emission spectra corresponding to the selected excitation wavelengths to select the emission wavelengths by the GA method, as shown in Figure 3b,d,f. As it can be seen in Figure 3b,d,f, these wavelengths extended across the entire band of the emission wavelengths. The Monte Carlo method in this study is more effective than the GA method presumably that the emission wavelengths within the peak band contribute most to the concentration prediction.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 8 The clustering method was used to select the excitation wavelengths for each pure compound with the pure compound EEM. The tolerant criterion was set to 1.3 × 10 −3 . Table 3 gives the 10 selected excitation wavelengths of each component from 41 excitation wavelengths. The selected excitation wavelengths of each component are different, which can presume it relates to their specific molecular structures. We used the emission spectra corresponding to the 10 selected excitation wavelengths of the nine training samples to find the emission wavelengths based on the method given in Section 2.2. The dots in Figure 3a,3c,3e are the selected emission wavelengths, which are 63 for naphthalene, 64 for 1-naphthol, and 71 for 2-naphthol. These emission wavelengths fell within the emission spectra peak band of each compound excited by specific wavelength, as plotted in Figure 4. For comparison, we used the same emission spectra corresponding to the selected excitation wavelengths to select the emission wavelengths by the GA method, as shown in Figure 3b,3d,3f. As it can be seen in Figure  3b,3d,3f, these wavelengths extended across the entire band of the emission wavelengths. The Monte Carlo method in this study is more effective than the GA method presumably that the emission wavelengths within the peak band contribute most to the concentration prediction.

Results of Concentration Prediction
The predicted concentration models were established with all spectra of the training samples and the spectra corresponding to the selected wavelengths based on our proposed method of the training samples by the PLSR to predict the concentrations of the test samples. Table 4 gives the root mean squared error of prediction concentration (RMSEP) with these two different spectra to assess the accuracy of the predicted concentrations. As can be seen, the RMSEP for Naphthalene, 1-naphthol, and 2-naphthol with the selected wavelengths are 1.35%, 1.05%, and 1.62%, respectively. However, the RMSEP for Naphthalene, 1-naphthol, and 2-naphthol with the full wavelengths are 2.89%, 1.77%, and 1.78%, respectively. The RMSEP for three compounds with the selected wavelengths are all lower than those with all wavelengths. The running time for the concentration prediction with all wavelengths and the selected wavelengths is also given in Table 4. The running time for Naphthalene, 1-naphthol, and 2-naphthol with the selected wavelengths is 127.67, 130.98, and 127.26 s, respectively. However, the running time for Naphthalene, 1-naphthol, and 2-naphthol with the full wavelengths is 356.37, 405.86, and 382.53 s, respectively. It is also obvious that the running time with the selected wavelengths is shorter than that with all wavelengths.

Results of Concentration Prediction
The predicted concentration models were established with all spectra of the training samples and the spectra corresponding to the selected wavelengths based on our proposed method of the training samples by the PLSR to predict the concentrations of the test samples. Table 4 gives the root mean squared error of prediction concentration (RMSEP) with these two different spectra to assess the accuracy of the predicted concentrations. As can be seen, the RMSEP for Naphthalene, 1-naphthol, and 2-naphthol with the selected wavelengths are 1.35%, 1.05%, and 1.62%, respectively. However, the RMSEP for Naphthalene, 1-naphthol, and 2-naphthol with the full wavelengths are 2.89%, 1.77%, and 1.78%, respectively. The RMSEP for three compounds with the selected wavelengths are all lower than those with all wavelengths. The running time for the concentration prediction with all wavelengths and the selected wavelengths is also given in Table 4. The running time for Naphthalene, 1-naphthol, and 2-naphthol with the selected wavelengths is 127.67, 130.98, and 127.26 s, respectively. However, the running time for Naphthalene, 1-naphthol, and 2-naphthol with the full wavelengths is 356.37, 405.86, and 382.53 s, respectively. It is also obvious that the running time with the selected wavelengths is shorter than that with all wavelengths.

Conclusions
This study aims to increase the accuracy of the prediction concentration by selecting the excitation wavelengths and emission wavelengths from the EEM. To eliminate the redundant information, the excitation wavelengths were selected based on the clustering method; to refine the most useful emission wavelengths, the emission wavelengths corresponding to the selected excitation wavelengths were determined based on the Monte Carlo method. Comparing to the predicted concentration with all spectra, the accuracy of the prediction concentrations with the selected spectra is increased by selecting the irrelevant excitation spectra and reserving the emission wavelengths contributing most to predict component concentration based on the proposed method. The RMSEP for Naphthalene, 1-naphthol, and 2-naphthol with the selected wavelengths are 1.54%, 0.72%, and 0.16% lower than those with all wavelengths, respectively. Furthermore, the running time of the concentration prediction is also reduced because only the selected spectra were used instead of the full spectra. The running time for Naphthalene, 1-naphthol, and 2-naphthol with the selected wavelengths is 228.7, 274.88, and 255.27 s shorter than that with all wavelengths, respectively.