Student’s-t Mixture Regression-Based Robust Soft Sensor Development for Multimode Industrial Processes

Because of multiple manufacturing phases or operating conditions, a great many industrial processes work with multiple modes. In addition, it is inevitable that some measurements of industrial variables obtained through hardware sensors are incorrectly observed, recorded or imported into databases, resulting in the dataset available for statistic analysis being contaminated by outliers. Unfortunately, these outliers are difficult to recognize and remove completely. These process characteristics and dataset imperfections impose challenges on developing high-accuracy soft sensors. To resolve this problem, the Student’s-t mixture regression (SMR) is proposed to develop a robust soft sensor for multimode industrial processes. In the SMR, for each mixing component, the Student’s-t distribution is used instead of the Gaussian distribution to model secondary variables, and the functional relationship between secondary and primary variables is explicitly considered. Based on the model structure of the SMR, a computationally efficient parameter-learning algorithm is also developed for SMR. Results conducted on two cases including a numerical example and a real-life industrial process demonstrate the effectiveness and feasibility of the proposed approach.


Introduction
In industrial processes, there is a class of quality-related variables that is very important but difficult to measure, such as melt index in the polypropylene process, catalyst activation in chemical reactions, thickness of strip in the hot rolling process, octane number of gasoline, etc. Measurements of these quality variables are conventionally obtained by expensive online analyzers or time-consuming laboratory analysis, which introduces huge investment cost or large time delay [1]. Soft sensors, which are essentially mathematical models, are capable of predicting these key variables (referred to as "primary variables") online using easy-to-measure process variables (referred to as "secondary variables") such as flow rate, temperature, pressure, etc. Therefore, soft sensors are economical and real-time alternatives to conventional measurement of quality variables, and play an important role in process monitoring, closed-loop control, process optimization and so forth [2][3][4][5][6]. Owing to their advantages, in recent years, soft sensors have been intensively researched and extensively applied to industrial processes [7][8][9][10][11].
The methods for soft sensor modeling can generally be categorized into two groups, which are first-principle methods [12] and data-driven methods [13]. As modern industrial processes grow increasingly complex, it is difficult to obtain first-principle models. By contrast, data-driven models can be easily obtained because a large amount of process data that reflects the true operating conditions

Student's-t Distribution
The PDF of a d-dimensional Student's-t distribution, with mean µ, precision matrix Λ and degree of freedom ν, is denoted as where Γ(t) = ∞ 0 z t−1 e −z dz is the Gamma function, and ∆ 2 = (x − µ) T Λ(x − µ) is the squared Mahalanobis distance from x to µ.
The Student's-t distribution can be viewed as an infinite mixture of scaled Gaussian distributions, i.e., (2) where N (·) represents the Gaussian distribution, η stands for the intermediate latent variable which is helpful for deriving the analytical solution, and Gam(·) denotes the Gamma distribution. Figure 1 illustrates the Student's-t distribution with fixed mean vector and covariance matrix but various degrees of freedom. It can be seen that the Student's-t distribution degrades the Gaussian distribution in the limit ν → +∞. Moreover, the tail of the Student's-t distribution tends to be heavier when the degree of freedom ν → 0. Therefore, the Student's-t distribution possesses the potentiality to mitigate the adverse effect of outliers in contrast to the Gaussian distribution.

Student's-t Mixture Model
Assume the secondary variable x follows the mixture distributions with K components as where the mixing coefficients π = {π 1 , π 2 , · · · , π K } satisfy ∑ K k=1 π k = 1 together with 0 π k 1. In addition, let us introduce a K-dimensional assignment latent variable z = (z 1 , · · · , z K ) associated with x, in which z k for k = 1, 2, · · · , K are binary variables, i.e., z k ∈ {0, 1}. In addition, only one of the z k for k = 1, 2, · · · , K can be assigned with value 1, and the rest ones are all 0. Therefore, we have the constraint ∑ k=1 z k = 1. If certain z k = 1, it means that the k-th component is responsible for generating the corresponding observed sample.
The prior distribution over z is specified in accordance with the mixing coefficients π k as p(z k = 1) = π k (4) Using the 1-of-K coding scheme, the prior distribution over z can also be written in the form Similarly, the conditional distribution of x given z is a Student's-t distribution which can also be written as

Methodology
In practical applications, data collected from industrial processes are very likely to be contaminated by outliers, and it is usually non-trivial to completely remove all outliers. It has been demonstrated that the performance of GMM might be rather disappointing with the presence of outliers because the tails of the Gaussian distribution in many applications are shorter than required [22,30]. To this end, we propose the Student's-t distribution mixture regression (SMR) which is detailed in this subsection.

Student's-t Mixture Regression
Let us denote X = {x 1 , · · · , x N } T ∈ R N×d and Y = {y 1 , · · · , y N } T ∈ R N×1 as the input and output space of samples data, and the input variable x is assumed to be generated from Student's-t distribution mixture models with K components as Equation (3).
The SMR is illustrated in Figure 2 in the form of a probabilistic graphical model. For the convenience of mathematical derivation, let us define where η n means the intermediate latent variable associated with the n-th sample of secondary variables (i.e., x n ). Consequently, we have The probability distribution over x n conditioned on two latent variables z n = (z n1 , · · · , z nK ) and η n can be obtained as p(x n |η n , z nk = 1) = N x n µ k , (η nk Λ k ) −1 (10) which can also be written as For each component, linear dependence of y n on x n is introduced. Taking the single-output case for example, for k = 1, · · · , K, we have where ϕ k represents the regression coefficient vector, ε k means zero-mean Gaussian-distributed noise variable with covariance λ −1 k , and x n = [x T n , 1] T . According to Equation (12), for the k-th component, the conditional PDF of y n given x n can be obtained as p(y n |x n , z nk According to Equation (13), we have p(y n |x n , z n ) =

Parameters Learning for the SMR
The parameters for the SMR that need to be learnt are denoted as Θ = {π k , µ k , Λ k , ν k , ϕ k , λ k } K k=1 . The EM algorithm, consisting of the expectation step (E-step) and maximization step (M-step), is an ideal approach to addressing the issues of missing values [31] (corresponding to the latent variables appeared in the SMR). Therefore, we adopt the EM to perform the parameter-learning task for the SMR.
In the E step, the posterior distribution over latent variables z 1 , · · · , z N , which are collectively denoted as Z = (z n ) N n=1 , associated with the training dataset (X, Y) = (x n , y n ) N n=1 can be calculated as p(z nk = 1|x n , y n ) = p(z nk = 1)p(y n |x n , z nk = 1)p(x n |z nk = 1) ∑ K k=1 p(z nk = 1)p(y n |x n , z nk = 1)p(x n |z nk = 1) Therefore, the expectation of z nk based on the posterior distribution can be calculated as Given the latent variable z n and observed variable x n , the posterior distribution over η n can be calculated as p(η n |x n , z nk = 1) ∝ p(x n |η n , z nk = 1)p(η n |z nk = 1) Comparing the definition of the Gamma distribution, we have p(η nk |x n ) = p(η n |x n , z nk = 1) Thus, we can obtain the expectations where ψ(·) is the digamma function defined as ψ(x) = d Γ(x) /dx. Subsequently, in the M step, with the assumption that the samples are independent and identically distributed, the expectation of complete data log-likelihood function is first formulated as L(Θ) = ln p(X, Y, Z, η) = ln p(Y|X, Z) + ln p(X|Z, η) + ln p(η|Z) + ln p(Z) where ln p(Y|X, Z) = and η = (η n ) N n=1 . Setting the derivatives of Equation (21) with respect to µ k to zero leads to Similarly, we have Setting the derivatives of Equation (21) with respect to ϕ k to zero leads to where R k = diag( z 1k , z 1k , · · · , z nk ), X = [X, 1], 1 is the column with all element 1.
The parameter ν k can be obtained by solving the non-linear equation as follows.
Please note that it has been proved that the left-hand side of Equation (31) strictly decreases from +∞ to a minus value as ν k increases in (0,+∞) [32]. Therefore, solving Equation (31) for ν k is not difficult by the means of many one-dimensional search methods, such as the dichotomy method.
Using the constraint ∑ K k=1 π k = 1 and introducing the Lagrange multiplier γ, we can obtain where L(Θ) = L(Θ) + γ(∑ K k=1 π k − 1). In the light of the updated equations such as the derivation above, the robustness of SMR compared with GMR can be clearly seen with the use of degrees of freedom ν. As the degrees of freedom parameter ν is introduced, the outliers with large Mahalanobis distance have small value of the expectation of η nk as can be drawn from Equation (19), resulting in the outliers being down-weighted and the influence of outliers on parameters estimation being significantly reduced. Taking the precision matrix of each component, for example, based on GMR the updated equation will be converted into which means that the data's outliers will highly influence the estimates. However, taking this example to the extreme, the outliers which are extremely different to the majority of dataset are down-weighted to zero because in the SMR the η nk associated with the outliers will be zero, resulting in the influence of outliers on precision matrix estimates being removed.
As the model above-mentioned parameters are updated by iterative learning, the iterative process terminates when L(Θ) converges, and the convergence criterion can be defined as where L(Θ t ) denotes the value of L(Θ) at the tth iteration and ε represents the threshold value, which is specified by the user.
Up to now, we can summarize the detailed procedure for training the SMR in Algorithm 1.

Algorithm 1
Pseudocode for training SMR.

Soft Sensor Development Based on SMR
Based on the SMR, a soft sensor model can be easily developed for predicting the quality variable y q when a sample x q of process variables is available.
To begin with, the posterior distribution of the associated latent variable z q = (z q1 , · · · , z qK ) is calculated as Subsequently, the probability distribution y q conditioned on x q can be obtained as Finally, the prediction of y q can be obtained as

Case Studies
In this section, the proposed method is first evaluated using a numerical example and then applied to develop soft sensors for an industrial primary reformer in an ammonia synthesis plant [33]. For comparison purposes, the performance of multiple dynamic PLS (Multi-DPLS) [34,35] and GMR are also provided as benchmarks. Please note that the Multi-DPLS is realized by first referring to the work in [34], where the GMM is used for data clustering, followed by constructing a sub-PLS model for each data cluster. Then, we extend the PLS model to the DPLS model by augmenting the input vector according to [35].
The root mean squares error (RMSE) is used to evaluate the prediction accuracies of various methods, which is defined as where y n and y n are the true value and predicted value of quality variable, respectively, and N t is the size of the testing dataset.
To deal with the influence of randomness of initial parameters, a total of 100 simulations are carried out for both the GMR and SMR, and their final parameters are selected as those that can minimize the RMSE on the validating dataset, while the generalization performance of various methods are evaluated on the testing dataset. The configurations of the used computer are given as follows: CPU: Core i5-4570 (3.2 GHz × 2), RAM: 8 GB, OS: Windows 10, and Software: MATLAB (R2016b). The CPU time (CPT) spent in offline model training (CPT trn , in seconds) and in online predicting (CPT tst , in seconds) are employed to assess the computational efficiency for different methods. In both case studies, the threshold values for diagnosing the convergence for the SMR and GMR are set as 10 −6 .

Numerical Example
We assume a 2-dimensional input variables x = (x 1 , x 2 ) T and a scalar output y are generated from a mixture of three Student's-t distributions based on Equations (3) and (12), in which the configurations of each component are listed in Table 1. Please note that as the non-diagonal elements for the precision matrices Λ k are not zero, the correlations among the input variables are taken into consideration, which can be captured by the proposed model using Equation (3). In addition, in our model setting, the vector x = (x 1 , x 2 ) T is assumed to obey a mixture of multivariate Student's-t distributions, and we do not need to build one SMM for each of variable. Figure 3 illustrates the data distributions from the input space, which clearly shows the multimode characteristics.   In the simulation, three datasets, namely the training dataset, validating dataset, and testing dataset, each of which consists of 2000 samples, were generated. The training dataset is used for parameter learning, while the validating dataset is used for determining the initialized values of model parameters for the Multi-DPLS, GMR, and SMR models. In this example, the number of mixing components for Multi-DPLS, GMR, and SMR models were set as 3 in advance; in addition, the dimensionality of the latent space for each sub-PLS model in the Multi-DPLS was set as 2.
The performance of various methods are evaluated on the testing dataset, which is unseen at the training stage. Moreover, 1%, 3% and 5% outliers are randomly added into the input data samples, respectively.
According to the proportion of the sample number of each mode, the outliers are generated by transforming a certain coordinate of some sample data randomly selected to the value far away from its center. For example, 3% rate outliers are added to the training dataset containing 2000 samples, which is to say there are 12, 18, and 30 outliers added to each mode, respectively [36].
By using trial and error, the order of the Multi-DPLS is determined as 4, i.e., the values of input variables in the past four moments are also used to estimate the value of the current output. Recall that in this case, data samples were generated independently with each other without dynamics. The reason the Multi-DPLS with the order of 4 achieves the best performance can be explained as follows. The augmented input vector is helpful at improving the classification accuracy for the GMM, because the samples at some augmented sampling instances may be located at non-overlapped areas among the three modes; meanwhile, the PLS can deal with the data-collinearity. That is why the performance of the Multi-DPLS gets enhanced when the order increases. However, as the order further increases, the dimensionality of the input vector significantly increases, too, which leads to inaccurate estimations of the probability density functions. That is why performance of the Multi-DPLS deteriorates when the order is greater than 4.
Predictions for y by the models based on the Multi-DPLS, GMR, and SMR with the outlier rate set as 3% are visualized in Figure 4, from which for the Multi-DPLS large deviations existing in the first mode and third mode can be clearly found. This is because the information of output space in the mode identification step is ignored, and then the performance of clustering the high-dimensional data is rather unsatisfactory, leading to a PLS model built into each mode that cannot explain the true functional dependency between the output and input variables well. In contrast, the GMR and SMR-based models, which treat the input space and output space together, are more powerful at modeling the multimode process. However, intuitively, we can recognize that the SMR performs better compared with the GMR in terms of predicting samples from the first mode.  For more in-depth analyses, predictive accuracies of three methods on the validating dataset and testing dataset are quantified in Table 2. As can be seen, the performance of the Multi-DPLS model is rather disappointing, while the predictive accuracies of the GMR and SMR models are much higher. In addition, one can see that as the number of outliers increases, the performances of both the GMR and SMR-based models deteriorate. However, the deteriorations for the SMR-based model are much slighter compared with those for the GMR-based model. To be specific, as the outlier rate rises from 1% to 3% and 5%, the generalization RMSE for the GMR-based model is increased by 41.8% and 63.8%, respectively; in contrast, the increment of generalization RMSE for the SMR-based model is only 5.1% and 7.7%, respectively, which demonstrate that the SMR-based model is much more robust against outliers compared with the GMR-based model. For probabilistic methods such as the GMR and SMR, correctly estimating the PDFs of process variables is a prerequisite to high predictive accuracy. In this synthetic case, the estimations of PDFs of x 1 and x 2 with different amounts of outliers are illustrated in Figures 5-7. One can readily recognize that due to the long tails of data distributions, the PDFs of x 1 and x 2 estimated by GMR have been significantly skewed compared with the data histograms and true PDFs. In addition, such distortion becomes more severe as the number of outliers increase. In particular, the GMR basically fails to capture the middle peak from the x 2 direction. By contrast, the PDFs estimated by the SMR fit the data histograms well, and are barely affected by the increase of outliers, which is the reason that the SMR-based model can provide satisfactory performance with various numbers of outliers. For the numerical example the time consumed by these three methods are listed in Table 3. It is easily seen from Table 3 that the differences between CPT tst for these three methods can be negligible. The CPT trn for the Multi-DPLS and GMR are comparable. Please note that in the SMR the parameter ν k is estimated by solving a non-linear equation with the help of the dichotomy method, which results in more time for iterative learning. However, the computational efficiency for a soft sensor based on SMR is still acceptable.

Primary Reformer
The primary reformer is an important part of hydrogen-manufacturing units in the ammonia synthesis process for producing NH 3 , which is the main material in the urea synthesis process. The flowchart of the primary reformer is illustrated in Figure 8.
Main transformation reactions set off in the primary reformer are According to the reaction mechanism, the temperature in the furnace plays a significant role in the purity of hydrogen; thus, the temperature should be strictly monitored and controlled, which is realized by manipulating the burning conditions at the dense burner. One of the effective approaches to stabilizing the burning condition is to control the oxygen concentration in the furnace at the specified interval. However, the measurement of oxygen concentration (i.e., the quality-related variable for the primary reformer) in practice is expensive, due to an exorbitant mass spectrometer, or time-consuming, due to offline laboratory analysis, both of which fail to satisfy the requirement of real-time control and production. To cope with this issue, a soft sensor based on a historical dataset is desirable for online estimation of the oxygen concentration, which is illustrated with a dark green block in Figure 8. Based on expert knowledge of process mechanisms and experiences from engineers, 13 process variables, including pressures and temperatures, are selected as secondary variables for soft sensor modeling, which are illustrated with light-green blocks in Figure 8. Detailed descriptions of these secondary variables are presented in Table 4. A total of 7000 samples recorded from January 2015 to July 2015 were collected from the database of distributed control systems of a real-world primary. The collected samples are evenly partitioned into three parts, i.e., 2000 samples serve as the training dataset, 2000 samples are used as the validating dataset for model selection, and the remaining 3000 samples constitute the testing dataset for evaluating the generalization performance of various soft sensors. By taking the testing samples, for example, it is obvious that the process basically involves five large operating conditions, as shown by the dash-dot blue line in Figure 9, which indicates that the primary reformer is characterized by multiple modes. Oxygen concentration(mol%) Figure 9. Visualization of multimode characteristics of the primary reformer.
As with the numerical example, the order of the Multi-DPLS is determined as 3, and Figure 10 shows that the number of components and the dimensionality of latent space are determined as 12 and 8, respectively, in which the Multi-DPLS has the minimum RMSE on the validating dataset. In addition, the initial values of model parameters for the GMR and SMR-based soft sensors, as well as the model selections for them are also completed on the validating dataset. In particular, the best performances for the GMR and SMR-based soft sensors with various numbers of mixing components (i.e., K) are visualized in Figure 11a, which indicates that both the predictive RMSEs of GMR and SMR-based soft sensors reach the minimum at K = 18. However, for the SMR-based soft sensor, we see that as K ≥ 15, the validating RMSE almost stabilize at 0.88. Considering the fact that the larger the K, the higher the model complexity, we determine the optimal K for the SMR-based soft sensor as 15. Based on the same consideration, the optimal K for the GMR-based soft sensor is selected as 18. Meanwhile, for the GMR and SMR-based soft sensors, the generalization performances on the testing dataset are compared in Figure 11b.  From Figure 11a,b we can recognize that: (1) the selected optimal values of K and initialized model parameters upon the validating dataset can basically embody the true generalization performance upon the testing dataset for the GMR and SMR-based soft sensors; (2) upon both the validating and testing dataset, although with small values of K (≤ 7), the performances of the two soft sensors are comparable, and the SMR-based soft sensor starts to show apparent predictive advantage over the GMR-based one as K ≥ 8; (3) the number of components is much larger than the number of operating conditions, because each operating condition may consist of several modes. The underlying reason for this phenomenon is that for complex processes, one Student's-t distribution still does not model one operating condition well, and more Student's-t distributions are required for one operating condition; and (4) the number of components is mainly determined through the division of the spatial pattern of input and output variables rather than the number of input variables, so there is no relationship between the number of components and the number of input variables in the mixture models.
Predictions of the O 2 concentration by soft sensors based on the Multi-DPLS, GMR, and SMR are visualized in Figure 12, where their generalization abilities are also presented in terms of RMSE. As can be seen, the Multi-DPLS model has worst performance. Except for the ignorance of output information in the mode identification, the other reason is that the augmented input vector has high dimensionality (52 dimensions in the primary reformer), resulting in an exponentially increasing number of samples being required to acquire the correct estimations of probability distribution of each mode. In contrast, both the GMR and SMR-based soft sensors, which employ mixture component models, can significantly improve the prediction performance. Scatter plot comparisons among the Multi-DPLS, GMR and SMR presented in Figure 13 could provide more insights. It can be clearly seen that the predictions obtained by the Multi-DPLS are more scattered. However, predictions of soft sensors based on GMR and SMR lean much closer to the black diagonal line, indicating higher predictive accuracy. Moreover, since the SMR takes the robustness against outliers into consideration, the predictions obtained by SMR have tighter scatters around the black diagonal line, which demonstrates the advantages of SMR compared with GMR. The predictive RMSE on the testing dataset also demonstrate that the SMR-based soft sensor has stronger generalization ability than the GMR-based one. For further quantitative analyses, the determination coefficients for the Multi-DPLS, GMR, and SMR are also calculated as 0.7729, 0.8655, and 0.9233, respectively, from which the same conclusion can be drawn. The consumed time by these three methods in the primary reformer process are tabulated in Table 5, from which one can readily find that the Multi-DPLS requires more time to train the model because the dimensionality of the augmented input vector is very high. Although the SMR-based soft sensor is also a time-consuming method due to the dichotomy method, the prediction accuracy is much higher than Multi-DPLS. As for the computational burden based on SMR, we can note that: (1) in the numerical example the input variables are two-dimensional, where the CPT trn is much less than the primary reformer process of which the input variables are 13-dimensional. This is because as more variables are considered, the larger the size of the precision matrices (whose inversions are involved); (2) the computational burden depends on the number of mixing components, and the more mixing components, the more parameters needing to be learnt, which results in more time for model training; and (3) if the input variables are correlated, the non-diagonal elements of covariance are not equal to zero, leading to more time consumed in inverting the covariance matrix.

Conclusions
In this paper, with the aim of dealing with outliers when developing soft sensors for multimode industrial processes, we have proposed a robust modeling approach referred to as the Student's-t mixture model (SMR). Our novel contribution is twofold. First, a regressive model structure with finite mixture of Student's-t distributions has been designed, and the corresponding parameter-learning algorithm based on the EM algorithm has also been developed. Second, case studies have been conducted on both numerical and real-word industrial datasets to evaluate the performance of SMR. The results have demonstrated that SMR can handle multimode characteristics well and is more robust against outliers compared to some state-of-the-art methods.
In our future work, two challenging issues are taken into consideration: (1) how to complete the model-selection and parameter-learning tasks without traversing all candidate numbers of mixing components, and without the validating dataset; and (2) how to deal with the performance degradation of the soft sensor caused by time-variation factors. Our solution is to formulate an adaptive Bayesian SMR (BSMR), which randomizes model parameters (including the number of mixing components K) and updates the BSMR in a recursive fashion online.

Conflicts of Interest:
The authors declare no conflicts of interest.