A Gradient-Based Method for Robust Sensor Selection in Hypothesis Testing

This paper considers the binary Gaussian distribution robust hypothesis testing under a Bayesian optimal criterion in the wireless sensor network (WSN). The distribution covariance matrix under each hypothesis is known, while the distribution mean vector under each hypothesis drifts in an ellipsoidal uncertainty set. Because of the limited bandwidth and energy, we aim at seeking a subset of p out of m sensors such that the best detection performance is achieved. In this setup, the minimax robust sensor selection problem is proposed to deal with the uncertainties of distribution means. Following a popular method, minimizing the maximum overall error probability with respect to the selection matrix can be approximated by maximizing the minimum Chernoff distance between the distributions of the selected measurements under null hypothesis and alternative hypothesis to be detected. Then, we utilize Danskin’s theorem to compute the gradient of the objective function of the converted maximization problem, and apply the orthogonal constraint-preserving gradient algorithm (OCPGA) to solve the relaxed maximization problem without 0/1 constraints. It is shown that the OCPGA can obtain a stationary point of the relaxed problem. Meanwhile, we provide the computational complexity of the OCPGA, which is much lower than that of the existing greedy algorithm. Finally, numerical simulations illustrate that, after the same projection and refinement phases, the OCPGA-based method can obtain better solutions than the greedy algorithm-based method but with up to 48.72% shorter runtimes. Particularly, for small-scale problems, the OCPGA -based method is able to attain the globally optimal solution.


Introduction
Wireless sensor networks (WSNs) are extensively used to collect and transmit data in many applications, such as autonomous driving [1], disaster detection [2], target tracking [3], etc. In the WSN, it is usually unaffordable to collect and process all sensor data due to the limitations of power and communication resources [4,5]. Therefore, it is of great significance to choose an optimal subset of sensors such that the best performance is attained only based on data collected by the selected sensors, which is the so-called sensor selection problem.
In the past dozen years, sensor selection has been widely studied in various fields, e.g., estimation [6], target tracking [7], condition monitoring [8], to name a few. For parameter estimation in Kalman filtering dynamic system, [6] chose the optimal subset of sensors in each iteration via minimizing the error covariance matrix of the next iteration. The sensor selection problem for target tracking in large sensor networks was addressed in [7] based on generalized information gain. In [8], it provided an entropy-based sensor selection method for condition monitoring and prognostics of aircraft engine, which can describe the information contained in the sensor data sets.
Meanwhile, the sensor selection problem in hypothesis testings has also attracted a lot of attention [9][10][11]. For this type of hypothesis testings, only part of sensors in WSN are activated to transmit observation data, and then decisions are made based on the measurements of selected sensors to achieve the best detection performance. When the optimal sensor selection matrix is fixed, the corresponding hypothesis testing problem is reduced to a common one, which is easy to be dealt with. Hence, it is crucial to solve the involved sensor selection problem.
Work [9] studied the sensor selection for the binary Gaussian distribution hypothesis testing in the Neyman-Pearson framework, where the true distribution under each hypothesis is exactly known. It approximately converted the minimization of the false alarm error probability to the maximization of the Kullback-Leibler (KL) divergence between the distributions of the selected measurements under null hypothesis and alternative hypothesis to be detected. Additionally, [9] proposed a sensor selection framework of first relaxation and then projection for the first time, and provided the greedy algorithm to solve the relaxed problem by optimizing each column vector of the selection matrix.
In practical applications, the events to be detected (i.e., parameters of the hypothesis testing) are usually estimated from training data and affected by some uncertainty factors, such as poorly observation environment and system errors. Then, these parameters are not known precisely, but assumed to lie in some given uncertainty sets [10,11]. In these scenarios, the minimax robust sensor selection problem is formulated to cope with the parameter uncertainty. For the binary Gaussian distribution hypothesis testing under the Neyman-Pearson framework, following the framework in [9], work [10] investigated the involved sensor selection problem with distribution mean under each hypothesis falling in an ellipsoidal uncertainty set (the distribution covariance is known). Furthermore, [11] considered the sensor selection problems involved in the Gaussian distribution robust hypothesis testings with both Neyman-Pearson and Bayesian optimal criteria, where the distribution mean under each hypothesis drifts in an ellipsoidal uncertainty set. For the Bayesian framework, minimizing the maximum overall error probability is approximately converted to maximizing the minimum Chernoff distance. Then the corresponding greedy algorithm together with projection and refinement is also proposed to solve the robust sensor selection problem in the above hypothesis testing.
It has been shown in [11] that the robust sensor selection problem in the hypothesis testing is NP-hard under the Bayesian framework. Nevertheless, when the size of the sensor selection problem is small, its optimal solution can be obtained by the exhaustive method via traversing all possible choices. However, for a large-scale problem, the exhaustive method is not affordable due to its huge computation complexity. Although the aforementioned greedy algorithm-based method (i.e., greedy algorithm, projection and refinement) admits a lower computation complexity than the exhaustive method, it can not arrive at the globally optimal solution in many cases, and its computation complexity is still high for large-scale problems. Therefore, it is significant to seek a more efficient algorithm for solving the robust sensor selection problem in the hypothesis testing of WSN. To our surprise, even though other general sensor selection problems have been continuously investigated, for instance, sparse sensing [12] and sensor selection in sequential hypothesis testing [13], there is little progress for this type of sensor selection problems since [11] was published in 2011, which motivates our research.
In this paper, we consider the same binary Gaussian distribution robust hypothesis testings under a Bayesian framework as in [11], where the distribution mean under each hypothesis lies in an ellipsoidal uncertainty set. We attempt to select an optimal subset of sensors such that the maximum overall error probability is minimized. Following the similar idea in [11], minimizing the maximum overall error probability is approximated by maximizing the minimum Chernoff distance between two distributions under null hypothesis and alternative hypothesis to be detected. Then, our main contributions can be summarized as follows.
• First, we succeed in converting the maximinimization of the Chernoff distance to a maximization problem, and adopting the orthogonal constraint-preserving gradient algorithm (OCPGA) [14] to obtain a stationary point of the relaxed maximization problem without 0/1 constraints. • Specifically, when implementing the OCPGA to the relaxed maximization problem, we utilize the Danskin's theorem [15] to acquire its gradient. Furthermore, the efficient bisection is applied to get the means of distributions under null hypothesis and alternative hypothesis to be detected corresponding to the minimum Chernoff distance. • The computational complexity of the OCPGA is shown to be lower than that of the greedy algorithm in [11] from the theoretical point of view, while numerical simulations show that the OCPGA-based method (i.e., OCPGA, projection and refinement) can obtain better solutions than the greedy algorithm-based method (i.e., R-C algorithm in [11]) with up to 48.72% shorter runtimes. Therefore, better solutions are available for our proposed OCPGA-based method in some scenarios.
The remainder of this paper is organized as follows. Section 2 states the problem formulation. The proposed OCPGA, as well as projection and refinement phases, is characterized in Section 3. In Section 4, the existence and computation of gradient are provided. Section 5 presents some numerical experiments to corroborate our theoretical results, while Section 6 concludes the paper.
Notations: Denote R m and R m×p as the m-dimensional real vector space and m × p dimensional real matrix space, respectively. Let I and 0 be the identity matrix and the zero matrix whose dimensions will be clear from the context; bold-face lower-case letters are used for vectors, while bold-face upper-case letters are for matrices. N (µ, S) represents the Gaussian distribution with mean µ and covariance S. For matrix A, tr(A), |A|, A F , A H and A i,j denote its trace, determinant, Frobenius norm, conjugate transpose and the (i, j)-th entry, respectively. For square matrices A and B, A ( )B represents that A − B is semi-positive (positive) definite. For a semi-positive definite matrix A, A 1 2 stands for its square-rooting matrix.

System Model
Define x = (x 1 , x 2 , · · · , x m ) T ∈ R m as the observation vector of all m sensors (each sensor corresponds to one-dimensional measurement). Consider the same binary Gaussian distribution robust hypothesis testing as in [11]: where mean vector m i falls in a given ellipsoidal },m i denotes the mean estimated by training data, covariance S i is a known matrix, and k i ∈ (0, +∞] is the robustness parameter, i = 0, 1. Obviously, when k i = +∞, the ellipsoidal uncertainty set is reduced to a single point, and thereby m i =m i , i.e., there is no uncertainty. Sensor Selection: In the WSN, sensors transmit their observations to a fusion node, which then performs the hypothesis testing based on its received measurements. Due to power constraints, suppose that only p out of m sensors are chosen (p < m) to transmit the observations to the fusion node. We aim at selecting p sensors to guarantee the best detection performance, that is, seeking a selection matrix E ∈ R m×p with 0/1 elements such that the best detection performance based on measurements y = E T x ∈ R p is achieved. It is easy to see that, E has exactly one unit entry per column (corresponds to a selected sensor) and at most one unit entry per row (each sensor is selected at most once). Therefore, E should be a column orthogonal matrix, i.e., E T E = I.

Hypothesis Testing Induced by E:
Owing to y = E T x , the original hypothesis testing (1) about x in the high dimensional space R m is converted to one about y in a lower-dimensional space R p : Without loss of generality, we assume that Otherwise, the hypothesis testing (2) makes no sense. For hypothesis testing (2), when E and m i are determined, the fusion node executes the following likelihood ratio (LR) test: where f i (y; E) is the density of N (E T m i , E T S i E), i = 0, 1, and γ is the test threshold [16]. Sensor Selection Optimality Criteria: Under a Bayesian framework, detection performance is quantified by the overall error probabilities P e , where P e := P(H 0 )P F + P(H 1 )P M with P F := P(l(y) > γ|H 0 ) and P M := P(l(y) < γ|H 1 ) being the false alarm and the miss detection probabilities respectively. Then, the robust sensor selection problem in the hypothesis testing under a Bayesian framework is to seek the selection matrix E such that the maximum overall error probability P e is minimized when making decisions with respect to hypothesis testing (2), that is, solving the following optimization problem:

Problem Transformation
Since computing P e in problem (4) is usually difficult due to the involved integrals, we follow the popular approach in [11] to approximately optimize P e . Based on the Chernoff lemma [17], when the number of independent identically distributed (i.i.d.) measurements increases, the rate of exponential decay of P e is equal to the Chernoff distance between the two distributions N (E T m 0 , E T S 0 E) and N (E T m 1 , E T S 1 E). On the other hand, according to the definition of Chernoff distance between two probability densities f 0 and f 1 : Therefore, as in [11], minimizing the maximum overall error probability P e can be approximately converted to maximizing the minimum Chernoff distance f C (E, m 0 , m 1 ). Accordingly, problem (4) is transformed into or equivalently, Work [11] has proven that problem (7) is NP-hard, and proposed a suboptimal greedy method along with projection and refinement phases to deal with problem (7). Although the greedy algorithm-based method (i.e., R-C algorithm in [11]) admits a lower computation complexity than the exhaustive method, it can not arrive at the globally optimal solution in many cases, and still remains high computation complexity for large-scale problems. Therefore, we endeavor to propose a more efficient method to obtain a better solution of problem (7).
It is not difficult to see that, solving problem (7) can be sequentially divided into the inner minimization and outer maximization. For a given selection matrix E, defining and denoting ( m 0 (E), m 1 (E)) as the optimal solution of the following subproblem then it holdsf C (E) = f C (E, m 0 (E), m 1 (E)). Correspondingly, problem (7) can be equivalently transformed into Remarkably, the optimal solutions of problems (7) and (PP) are the same. Hence, we will discuss how to solve problem (PP) in the following. Taking the orthogonal constraint E T E = I into account, we adopt the OCPGA-based method to deal with problem (PP).

OCPGA-Based Method
Referring to the greedy algorithm-based method proposed in [11], solving problem (PP) can be also successively divided into three phases: relaxation, projection and refinement. Then we will utilize the OCPGA-based method to solve problem (PP). That is, in the relaxation phase the OCPGA is used to handle the relaxed problem without 0/1 constraints, while the projection and refinement phases are the same as in the greedy algorithm-based method. First we provide the working flow of the OCPGA-based method for solving problem (PP) in Figure 1, and the details are shown in the following Sections 3.1 and 3.2.

Relaxation Phase
By relaxing the 0/1 constraints, problem (PP) is reduced to Taking into consideration the orthogonal constraint in problem (RP), once the gradient ∇f C (E) of the objective functionf C (E) exists and is computable, we can implement the OCPGA in [14] to solve problem (RP), which is presented in Algorithm 1.

Algorithm 1: OCPGA
The update formula of Y n (τ) in Algorithm 1 is the Cayley transformation [18], and thereby Y n (τ) always satisfies the orthogonal constraint Y n (τ) T Y n (τ) = I for each iteration n. Meanwhile, the stepsize τ in Algorithm 1 is chosen by curvilinear search algorithms [19] combined with the Barzilai-Borwein (BB) [20] nonmonotonic line search [21]. It has been shown by Lemma 2.2 and Remark 2.3 in [22] that the sequence generated by the OCPGA is globally convergent to a stationary point. For clarity and integrity, we provide the convergence result of Algorithm 1 in the following theorem without proof. Theorem 1 ([22]). Whenf C (E) in problem (RP) is differentiable and the gradient ∇f C (E) is derived, then E * obtained by the OCPGA in Algorithm 1 is a stationary point of problem (RP).
Moreover, the computation complexity of the OCPGA in Algorithm 1 is O(mp 2 ) [14]. Combined with the fact that p is generally much smaller than m, the computation complexity of the OCPGA is much lower than that O(m 3 p) of the greedy algorithm in [11]. Hence, our proposed OCPGA is more efficient than the greedy algorithm, particularly for a quite large m.

Projection and Refinement Phases
Generally, the solution E * obtained by the OCPGA in Algorithm 1 is not a selection matrix, because the elements of E * are not guaranteed to be 0/1. Thus we need to further execute the projection and refinement phases as in [11].
First, we seek matrixẼ closest to the range space of E * by solving the following problem It has been shown in [11] that problem (11) admits a closed-form solution. Specifically, let (j 1 , · · · , j p ) be the indexes of the p largest entries on the diagonal of E * (E * ) T , and thenẼ = (i j 1 , · · · , i j p ) is the optimal solution of problem (11), where i j stands for the j-th column of the identity matrix I m . Subsequently, after projecting to the set of 0/1 selection matrices, we further implement a refinement aroundẼ. Setting E =Ẽ, the first column of E is viewed as the optimization variable, while all other columns are fixed. Then we sweep through all canonical vectors (i.e., columns of the identity matrix I m ) different from the remaining p − 1 columns of E, and choose the one corresponding to the maximumf C as the first column. In the next step, the procedure is repeated for the second column, and so on, up to the p-th step. Finally, one refinement is finished and the matrix E = E is regarded as a solution of problem (PP). Obviously, the more we refine, the better solution we can achieve. When p = 1, the solution achieved by one refinement is indeed the globally optimal selection matrix of problem (PP). If time permits, we can execute the refinement phase for several times until the objective function valuef C keeps unchanged.
Notice that, the computation cost for projection is very small (as there exists an analytical solution), while in refinement we need to calculate the objective function valuef C for (m − p)p times. Since the OCPGA is more efficient than the greedy algorithm, and the remaining projection and refinement phases are the same, the OCPGA-based method naturally possesses higher efficiency than the greedy algorithm-based method.

Remark 1.
For the OCPGA in Algorithm 1, its initial point is randomly chosen from column orthogonal matrices. Experiments illustrate that different initial points may result in different outcomes. Therefore, in order to improve the performance, we implement the OCPGA with different initial points several times, and after projection and refinement, choose the best solution as the output. Remark 2. According to the above discussion, the key to implement the OCPGA-based method is computing the gradient ∇f C (E) in problem (RP). In the next section, we will show how to obtain the gradient ∇f C (E) and discuss when it exists.

Existence and Computation of the Gradient in Problem (RP)
As can be easily seen from Algorithm 1, it is essential to compute the gradient ∇f C (E) in problem (RP). Invoking the definition off C (E) in Equation (9), we havẽ 1] f (E, s, m 0 , m 1 ). To proceed, we first prove the strict concavity of f (E, s, m 0 , m 1 ) with respect to s, which forms a key ingredient of our later arguments.
Then, taking derivative of ∇ s f (E, s, m 0 , m 1 ) with respect to s once again, we obtain where If (13).
On the other hand, when E T m 0 = E T m 1 , then it follows from Equation (3) Then it immediately follows E T S 0 E − E T S 1 E = 0, which leads to a contradiction. Therefore, when As a consequence, f (E, s, m 0 , m 1 ) is a strictly concave function of s.
Moreover, the following Proposition 2 shows that f C (E, m 0 , m 1 ) is continuously differentiable with respect to E for given (m 0 , m 1 ), which is also necessary for follow-up analysis. given by Equation (6).
Note that f (E, 0, m 0 , m 1 ) = f (E, 1, m 0 , m 1 ) = 0 and 1] f (E, s, m 0 , m 1 ) > 0 under condition (3). Hence, s * ∈ (0, 1), which, combined with the strict concavity of f (E, s, m 0 , m 1 ) in Proposition 1, implies that s * is unique with satisfying ∇ s f (E, s, m 0 , m 1 ) = 0. Recalling the expression of ∇ s f (E, s, m 0 , m 1 ) in (12), since all the involved terms B(s), tr(·), | · |, log(·) and E T S 0 E = 0 are continuously differentiable functions of E, then ∇ s f (E, s, m 0 , m 1 ) as a composite function of the above functions is also continuously differentiable with respect to E. Similarly, ∇ 2 s f (E, s, m 0 , m 1 ) in (13) is a composition of functions 1 1−s , B(s) and tr(·), which are all continuous with respect to s. Hence, ∇ 2 s f (E, s, m 0 , m 1 ) is continuous with respect to s, that is, ∇ s f (E, s, m 0 , m 1 ) is continuously differentiable with respect to s. Due to ∇ 2 s f (E, s, m 0 , m 1 ) = 0, it follows from the implicit function theorem [23] that s * is an implicit function of E. Hence, we rewrite s * as s * (E) := arg max Furthermore, based on the implicit function theorem [23], s * (E) is a continuously differentiable function of E, which means that ∇s * (E) is continuous with respect to E.
immediately. In addition, f (E, s, m 0 , m 1 ) is continuously differentiable with s and E. Combined with the fact that s * (E) is a continuously differentiable function of E, thereby f C (E, m 0 , m 1 ) is differentiable with respect to E. Moreover, by leveraging on the chain rule [24] where

Compute the Gradient in Problem (RP) by Danskin's Theorem
In the sequel, we will exploit Danskin's theorem in Appendix A to compute the gradient ∇f C (E), wheref C (E) is defined by Equation (9).
On basis of Proposition 2, the following results hold true for function f C (E, m 0 , m 1 ) and set For arbitrary given (m 0 , m 1 ) and sufficiently small t > 0, since f C (E, m 0 , m 1 ) is differentiable, there exists a bounded directional derivative 3. Mapping (t, m 0 , m 1 ) → D 1 f C (E + th, m 0 , m 1 ) is continuous at point (0, m 0 , m 1 ), which is from the continuity of ∇ E f C (E, m 0 , m 1 ). , v), and −f C (E) ∼J(u), all conditions of Danskin's theorem in Appendix A are satisfied. Subsequently, for a given selection matrix E, if the optimal solution ( m 0 (E), m 1 (E)) of problem (10) is unique, then the gradient off C (E) exists. Furthermore, on basis of the Danskin's theorem, we have ∇f C (E) = ∇ E f C (E, m 0 , m 1 )| (m 0 ,m 1 )=( m 0 (E), m 1 (E)) .
Recall 1] f (E, s, m 0 , m 1 ) from Equation (5). For given (m 0 , m 1 ), we can again utilize the Danskin's theorem in Appendix A to obtain ∇ E f C (E, m 0 , m 1 ). Obviously, f (E, s, m 0 , m 1 ) is continuously differentiable with respect to E. Therefore, it is easy to verify that conditions 1)−3) of Danskin's theorem in Appendix A are all satisfied with identifications , and f C (E, m 0 , m 1 ) ∼J(u). Combined with the uniqueness of s * (E) defined by Equation (14), we can deduce On the other hand, due to the equivalence of Problems (7) and (8), for a given matrix E, a solution ( m 0 (E), m 1 (E)) of Problem (10) corresponds to a solution (ŝ(E), m 0 (E), m 1 (E)) of problem

Compute the Optimal Solution of Problem (IP)
Owing to the Sion's minimax theorem [25] and Lemma 5 in [11], for given E, we can exchange the orders of the minimization with respect to (m 0 , m 1 ) and the maximization with respect to s in problem (IP). Correspondingly, problem (IP) can be equivalently converted into Therefore, we will solve Problem (20) to attain the solution (ŝ(E), m 0 (E), m 1 (E)) of Problem (IP). First, for given matrix E and parameter s, denote  (21) is convex, which can be directly solved by the CVX toolbox [26] with computation complexity O(p 3 ) [27]. Thus, once s is fixed, the corresponding (m * 0 (s), m * 1 (s)) follows. Particularly, ( m 0 (E), m 1 (E)) = (m * 0 (ŝ(E)), m * 1 (ŝ(E))). Next, we will show how to determine the optimalŝ(E) given by Equation (18). Based on Proposition 1, f (E, s, m 0 , m 1 ) is concave with respect to s. Combined with the fact that f (E, s, m * 0 (s), m * 1 (s)) is the minimum of a family of functions f (E, s, m 0 , m 1 ) over the uncertainty is also a concave function of s [28]. Therefore, ∇ s f (E, s, m * 0 (s), m * 1 (s)) is monotonically decreasing with respect to s. Subsequently, we apply the efficient bisection method to searchŝ(E) such that ∇ s f (E,ŝ(E), m * 0 (ŝ(E)), m * 1 (ŝ(E))) = 0. For given E, we can once again use the Danskin's theorem in Appendix A to derive ∇ s f (E, s, m * 0 (s), m * 1 (s)). Since f (E, s, m 0 , m 1 ) is continuously differentiable with respect to s, all conditions of Danskin's theorem in Appendix A are met. Meanwhile, even if the optimal solution (m * 0 (s), m * 1 (s)) of Problem (21) is not unique, ∇ s f (E, s, m * 0 (s), m * 1 (s)) in Equation (12) is always the same no matter which (m * 0 (s), m * 1 (s)) is substituted. Therefore, f (E, s, m * 0 (s), m * 1 (s)) is differentiable with respect to s, and we have where ∇ s f (E, s, m 0 , m 1 ) is given by Equation (12), and (m * 0 (s), m * 1 (s)) is an arbitrary optimal solution of Problem (21).

Inner Procedure 1: Computing the Optimal Solution of Problem (IP)
Input: E, threshold η 0 , lower bound s l and upper bound s u of s, f , n = 0.

Remark 3.
If the true distribution under each hypothesis is exactly known, i.e., there is no uncertainty in the mean vector, then we omit the process of computing solution (m * 0 (ŝ), m * 1 (ŝ)) of problem (21), and directly regard the true mean vector as (m * 0 (ŝ), m * 1 (ŝ)). Consequently, the robust sensor selection problem is reduced to one without uncertainty.
Based on the previous discussion, if the optimal solution (ŝ(E), m 0 (E), m 1 (E)) of Problem (IP) is unique, then the gradient ∇f C (E) exists and can be computed by Equation (19). Next we will discuss the existence of the gradient ∇f C (E) detailedly.

Existence of the Gradient in Problem (RP)
It has been shown that the uniqueness of the optimal solution (ŝ(E), m 0 (E), m 1 (E)) leads to the existence of the gradient ∇f C (E). Therefore, in the following, we turn to show when (ŝ(E), m 0 (E), m 1 (E)) is unique. First the following lemma demonstrates the uniqueness ofŝ(E).
Hence, ∇f C (E) exists and can be obtained by Equation (19). Subsequently, Algorithm 1 can be executed. Furthermore, on basis of Theorem 1 in Section 3.1, we conclude that E * obtained by the OCPGA in Algorithm 1 is a stationary point of problem (RP), which lays a foundation to attain a better solution of the original problem (PP).

Remark 5.
If Assumption 1 is not satisfied, we could use the Clark generalized gradient [30] to replace the gradient in Algorithm 1. Based on the Danskin's theorem in Appendix A, the Clark generalized gradient can also be attained by Equation (19), where an arbitrary solution of Problem (21) is used. Then Algorithm 1 is still applicable and preserves the orthogonal constraint in each iteration. Although the resulting solution is not guaranteed to be a stationary point of Problem (RP), however, after projection and refinement phases, numerical simulations illustrate that the performance of the final result is also acceptable.

Numerical Simulations
In this section, numerical examples are carried out to show that the OCPGA-based method can obtain better solutions than the greedy algorithm-based method in [11]. To this end, (1) for fixed-size sensor selection problems, i.e., the total number and selected number of sensors are fixed, with randomly generated 50 (or 20) cases with different uncertainty sets, we exhibit the proportions that the OCPGA-based method performs better than, the same as, and worse than the greedy algorithm-based method; (2) for small-scale sensor selection problems, we compare the OCPGA-based method with the greedy algorithm based method and the exhaustive method; (3) for larger-scale sensor selection problems, the OCPGA-based method is compared with the greedy algorithm based method; (4) for specific small-scale and larger-scale sensor selection problems, the corresponding receiver operating characteristic (ROC) curves are depicted. Notably, in cases of (1)-(3), the performance of the method is measured by the resulting Chernoff distance (i.e.,f C (E) in Problem (PP)), that is, the method with larger Chernoff distance admits better performance. All the procedures are coded in MATLAB R2014b on an ASUS notebook with the Intel(R) Core(TM) i3-2310M CPU of 2.10GHz and memory of 6GB.
Assume that we need to choose p out of m sensors. In all simulations, for given (m, p), the ellipsoidal uncertainty sets E (m i , k i S −1 i ) in Problem (IP), i = 0, 1, which contain the true distribution under each hypothesis, are generated as follows. Elements of the estimated mean vectors m 0 andm 1 are randomly generated from (0, 1) and (0, 2), respectively. The covariance matrix S i is generated by S i = P i Σ i P T i , where P i is an orthogonal basis of m × m-dimensional matrices whose elements are generally drawn from (0, 1), and Σ i is a diagonal matrix with diagonal entries randomly generated in (0, 2), i = 0, 1. The robustness parameters k 0 = k 1 = 1.
When (m, p) and the ellipsoidal uncertainty sets E (m i , k i S −1 i ) are given, we adopt Inner Procedure 1 to compute (ŝ(E), m 0 (E), m 1 (E)), use Equation (19) to obtain the gradient ∇f C (E), and then apply the OCPGA in Algorithm 1 to get the stationary point E * of Problem (RP). After the projection and refinement phases described in Section 3.2, the final solutionÊ of Problem (PP) is achieved. Meanwhile, the greedy algorithm-based method is a deterministic approach, that is, for given (m, p) and ellipsoidal uncertainty set, its outputs are all the same no matter how many times it is recalled. On the contrary, since the initial point of the OCPGA in Algorithm 1 is randomly chosen, the outputs of the OCPGA-based method may vary with initial points of the OCPGA. If time permits, the OCPGA-based method can be recalled for several times to achieve better performance. Moreover, the OCPGA and the greedy algorithm based methods both execute one refinement only.
Fixed-Size Examples: For 8 fixed pairs of small (m, p), we give the proportions that the OCPGA-based method performs better than, the same as, and worse than the greedy algorithm-based method. For each (m, p), by implementing the two methods with randomly generated 50 different ellipsoidal uncertainty sets (only 20 different ellipsoidal uncertainty sets for (50, 5), (80, 5), and (100, 5) due to their long runtimes), the corresponding results are listed in Table 1. Here, with each ellipsoidal uncertainty set, the OCPGA-based method is recalled for two times and the better result is selected as the output. As shown in Table 1, for each (m, p), the OCPGA-based method performs as well as the greedy algorithm-based method in most cases, while the "better" proportion is more than twice as many as the "worse" proportion. Actually, simulations show that, for "worse" cases, if we recall the OCPGA-based method for more times, then it can perform as well as even better than the greedy algorithm-based method. Small-Scale Examples: we consider small-scale sensor selection problems, whose globally optimal solutions can be attained by the exhaustive method. Hence, we compare the optimal Chernoff distance obtained by the OCPGA-based method with those of the greedy algorithm-based method and the exhaustive method. Suppose that p = 3, 4, 5 out of m = 10, 12, 15 sensors are chosen. The corresponding outputs of the three methods are given in Table 2, and the corresponding runtimes of the three methods are listed in Table 3. It can be easily seen from Table 2 that, the Chernoff distances achieved by the OCPGA-based method are larger than those of the existing greedy algorithm-based method. For the case of (15,3), the Chernoff distance achieved by the OCPGA-based method is even 100% larger than that of the greedy algorithm-based method. In particular, our proposed OCPGA-based method can attain the same performance as the exhaustive method. Meanwhile, it is shown in Table 3 that, the OCPGA-based method is more efficient than the greedy algorithm-based method, and both of them possess much shorter runtimes than the exhaustive method, which is coincident with the theoretical computation complexity analyses. Via simple computations, we can see from Table 3 that the runtime of the OCPGA-based method can be up to 48.72% shorter than that of greedy algorithm-based method (for the case of (10, 4)). Therefore, our proposed OCPGA-based method admits better performance in terms of not only the objective value but also the runtime.  Larger-Scale Examples: Now we consider larger-scale problems, where m = 50, 80, 100 and p = 5, 10, 15. Since m and p are large, which leads to the failure of the exhaustive method, we only compare the obtained Chernoff distances of our OCPGA-based method with those of the greedy algorithm-based method. The corresponding results are exhibited in Table 4, while the corresponding runtimes of the two methods are displayed in Table 5. As we can see from Table 4, the OCPGA-based method can attain a better solution than the greedy algorithm-based method. In the case of (50, 5), the Chernoff distance obtained by the OCPGA-based method can be 13.14% larger than that of the greedy algorithm-based method. Moreover, Table 5 illustrates that, the OCPGA-based method admits higher efficiency than the greedy algorithm -based method, and the runtime of the OCPGA-based method can reduce up to 42.19% in the case of (50, 5). Compared with small-scale cases in Table 3, we can see from Table 5 that the improvement in runtime is more obvious, which is due to the larger gap between m and p for larger-scale cases. Hence, for larger-scale cases, the OCPGA-based method also can achieve better solutions than the greedy algorithm-based method with shorter runtimes.  ROC Curves for Specific Examples: By Monte Carlo simulations with 200,000 instantiations of the LR tests and calculating P D /P F with P D := 1 − P M being the detection probability, we depict the ROC curves to show the validity of the OCPGA-based method. It is well known that the higher the ROC curve, the better the detection performance. For the case of (m, p) = (10, 3) in Table 2, we display the corresponding ROC curves of the exhaustive method, the greedy algorithm-based method and the OCPGA-based method. It can be seen from Figure 2 that the OCPGA-based method is superior to the greedy algorithm-based method, while it can attain the same performance as that of the exhaustive method. Similarly, for the case of (m, p) = (50, 5) in Table 4, Figure 3 illustrates that our proposed approach performs better than the existing greedy algorithm-based method.   In summary, compared with the greedy algorithm-based method, the OCPGA-based method not only admits a lower theoretical computation complexity, but also can obtain better solutions with shorter runtimes in numerical simulations.

Conclusions
We address the minimax robust sensor selection in the binary Gaussian distribution hypothesis testing of WSN with the distribution mean vector under each hypothesis drifting in an ellipsoidal uncertainty set. Under a Bayesian optimal criterion, minimizing the maximum overall error probability with respect to the selection matrix is approximately converted to maximizing the minimum Chernoff distance between the distributions under a null hypothesis and alternative hypothesis to be detected. Then, the gradient of the objective function of the converted maximization problem is computed by Danskin's theorem. Furthermore, we apply the OCPGA to solve the relaxed maximization problem without 0/1 constraints, which can get a stationary point of the relaxed problem with lower computational complexity than the existing greedy algorithm. Numerical simulations demonstrate that the OCPGA-based method can attain better solutions than the greedy algorithm-based method with up to 48.72% shorter runtimes, and the OCPGA-based method is able to attain the globally optimal solution obtained by the exhaustive method for small-scale problems. In future work, we can consider cases where the distribution mean falls in other types of uncertainty sets such as the band model. In addition, the cases with not precisely known distribution covariance can also be a future research direction.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: Let U and V be subsets of a Banach space U and a topological space V respectively. J is a mapping from U × V into R. AssumeJ(u) = sup v∈V J(u, v), andV(u) = v ∈ V|J(u, v) =J(u). Moreover, suppose the following conditions hold true (1) V is compact, ∀ v ∈ V, the application (t, v) → J(u + th, v) is upper semi-continuous at (0, v); (2) ∀ v ∈ V and ∀ t in a right neighborhood of 0, there exists a bounded directional derivative (3) the map (t, v) → D 1 J(u + th, v) is upper semi-continuous at (0, v).
Then the functionJ has a directional derivative at u in the direction h, given by the formula DJ(u; h) = max v∈V(u) D 1 J(u, v; h)}.
Moreover, If u → J(u, v) has a Gateaux derivative J u , and if the max is unique:V(u) = {v}, thenJ has a Gateaux derivativeJ (u)given by the simple formulā J (u) = J u (u,v).