Neighborhood Attribute Reduction: A Multicriterion Strategy Based on Sample Selection

In the rough-set field, the objective of attribute reduction is to regulate the variations of measures by reducing redundant data attributes. However, most of the previous concepts of attribute reductions were designed by one and only one measure, which indicates that the obtained reduct may fail to meet the constraints given by other measures. In addition, the widely used heuristic algorithm for computing a reduct requires to scan all samples in data, and then time consumption may be too high to be accepted if the size of the data is too large. To alleviate these problems, a framework of attribute reduction based on multiple criteria with sample selection is proposed in this paper. Firstly, cluster centroids are derived from data, and then samples that are far away from the cluster centroids can be selected. This step completes the process of sample selection for reducing data size. Secondly, multiple criteria-based attribute reduction was designed, and the heuristic algorithm was used over the selected samples for computing reduct in terms of multiple criteria. Finally, the experimental results over 12 UCI datasets show that the reducts obtained by our framework not only satisfy the constraints given by multiple criteria, but also provide better classification performance and less time consumption.


Introduction
Rough sets [1,2], firstly proposed by Pawlak, have been demonstrated to be useful in data mining [3,4], artificial intelligence [5], decision analysis [6,7], and so on. As one of the important strategies of feature selection, attribute reduction in rough-set theory plays a key role, since it provides us with clear semantic explanations of the selected attributes. Those semantic explanations can be reflected by constraints in terms of the considered measures, such as approximation quality and conditional entropy. For example, Hu et al. [8] have studied uncertainty measures related to fuzzy rough sets, and then further explored approximation quality-based attribute reduction; Dai et al. [9][10][11] investigated attribute reduction with respect to several types conditional entropies; Wang et al. [12] not only proposed a conditional discrimination index for overcoming the limitations of conditional entropies, but also provided the corresponding approach to attribute reduction.
It must be noted that most of the previous results about attribute reduction are based on the consideration of a single measure. For example, if attribute reduction is designed to preserve the approximation quality, then intuitively it may not perform well in learning tasks. This is mainly because that approximation quality is only a measure of uncertainty, which is slightly related to, for example, Given decision system DS, assume that all decision values in DS are discrete, and an indiscernibility relation [22][23][24] IND d can be defined as: By IND d , we can obtain a partition over the universe, such that i.e., universe U is partitioned into q different decision classes. Therefore, ∀X i ∈ U/IND d , X i is referred to as the i-th decision class in rough-set theory. The decision class that contains the sample x is denoted by [x] d . The rough-set objective is to approximate the decision classes by the information given by condition attributes. Such information can actually be represented by the form of information granules [25,26] from the viewpoint of granular computing. For instance, the equivalence class used in traditional rough sets is a typical example of an information granule.
Nevertheless, it should be emphasized that the equivalence classes are only suitable for dealing with categorical data, while numerical data [27][28][29] have been seen everywhere in real-world applications. To fill such a gap, many different types of information granules have been proposed. As an important information granule used in generalized rough sets, neighborhood has been widely accepted by researchers. This is mainly because: (1) the construction of neighborhood is based on the distance that can characterize the similarity between samples with numerical data; (2) different neighborhood scales can be easily obtained by using different radii, and then a multigranularity structure is naturally formed. To know what neighborhood is, the concept of neighborhood relation should be given as follows: Given a decision system DS, ∀A ⊆ AT, suppose that A : U × U ∈ R + ∪ {0} is the Euclidean distance function in which R + is the set of positive real numbers, then A (x, y) represents the Euclidean distance between samples x and y by using the information over condition attributes in A. Immediately, the neighborhood relation is: in which δ is a given radius such that δ ≥ 0. Based on neighborhood relation, it is not difficult to obtain the neighborhood of x in terms of A, such that:

Neighborhood Rough Set and Neighborhood Classifier
Given a sample x, to avoid that only the sample x is in the neighborhood of x, which may bring us the difficulty for classification, Hu et al. [30] have modified the radius such that Following Equation (5), the modified neighborhood of sample x in terms of A is Definition 1. Given decision system DS, ∀A ⊆ AT and ∀X i ∈ U/IND d , the neighborhood lower and upper approximations of X i in terms of A are defined as Pair X i A , X i A is referred to as a neighborhood rough set of X i in terms of A.
The concept of neighborhood can not only be used to construct a rough set, but can also be applied to design classifier [31]. Let us consider one of the simplest classifiers, i.e., the K Nearest Neighbors algorithm (KNN), which is effective in many cases. It is a lazy learning method: for a testing sample to be classified, its K nearest neighbors form a neighborhood of such testing sample, and the voting mechanism is used to determine the label of the testing sample based on the real labels of all samples in neighborhood. For more details about KNN algorithm, see references [32][33][34].
The main thinking of the neighborhood classifier [8] is similar to that of KNN; the difference lies in the fact that the number of neighbors used in a neighborhood classifier is determined by the radius, while the number of neighbors used in KNN is specified by experts. Therefore, different samples may have different numbers of neighbors if a neighborhood classifier is used. The detailed process of the neighborhood classifier [8] is shown in Algorithm 1, as follows.

Measures
Approximation quality is frequently used to evaluate the certainty of belongingness in rough-set theory. In a neighborhood rough set, the formal definition is shown as follows: Definition 2. Given decision system DS, ∀A ⊆ AT, the approximation quality of d related to A is defined as This reflects the percentage of the samples that belong to one of the decision classes determinately by the semantic explanation of lower approximation. Obviously, 0 ≤ γ(A, d) ≤ 1 holds.

Remark 1.
Note that ∀A ⊆ AT, γ(A, d) ≤ γ(AT, d) does not always hold; the reason is that, if some condition attributes are eliminated from AT, then the value of δ obtained by Equation (5) changes. Example 1. As the decision system shown in Table 1, U = {x 1 , x 2 · · · , x 10 } is the set of samples, AT = {a 1 , a 2 , · · · , a 7 } is the set of condition attributes, and d is a decision attribute.
The above results tell us that γ(A, d) ≤ γ(AT, d) does not always hold if A ⊆ AT.  Conditional entropy is another widely accepted measure that is an effective tool for characterizing distinguishable ability in a decision system. The lower the value of conditional entropy is, the higher the ability of that condition attribute to distinguish samples that we will have. Such discrimination can be considered as a type of uncertainty. Presently, many definitions of conditional entropies have been proposed in terms of different requirements [9][10][11][35][36][37]. A typical representation of conditional entropy is shown in Definition 3.

Definition 3.
Reference [30] Given decision system DS, ∀A ⊆ AT, the conditional entropy of d related to A is defined as: ∀A ⊆ AT, ENT(A, d) ≥ ENT(AT, d) does not always hold. This is because not only does the monotonic property of Equation (10) not hold [19], but also the value of δ obtained from Equation (5) is changed if some condition attributes are eliminated from AT.

Attribute Reduction
Attribute reduction is one of the key topics in rough-set theory [38]. Generally speaking, the purpose of attribute reduction is to delete the redundant attributes by some given constraints [39]. These constraints can be constructed by well-known measures such as approximation quality and conditional entropy. Many different definitions of attribute reduction have been proposed with different measures or requirements. Dai et al [9] proposed extended conditional entropy in interval-valued decision systems and designed corresponding definitions and algorithms. Yao et al. [7] addressed different measures, such as confidence, coverage, generality, cost, and decision monotocity based on the decision-theoretic rough-set models. Jia et al. [40] compared most popular definitions and then proposed a generalized attribute reduction that not only considers the data but also users' preferences. For more details about definitions of attribute reductions, see references [9,[41][42][43]. Different from Pawlak's [1,2] traditional definition of attribute reduction for preserving approximation quality, the constraint used in Definition 4 indicates that the approximation quality will not be decreased at least. The reason is shown in Remark 1: the approximation quality is not strictly monotonic in terms of variations of condition attributes. The case of conditional entropy is similar to that of approximation quality. If , then a is redundant; in other words, attribute a has no contribution to the increase of approximation quality. If γ(A ∪ {a}, d) > γ(A, d), then a can be considered as a member in the reduct set. Similarly, it is trivial to present the semantic explanation of redundant attributes in terms of conditional entropy reduct. Therefore, the significance of attributes in terms of two different reducts shown in Definition 4 can be defined as follows: In a decision system, the above two significances both satisfy that the higher the value is, the more important the condition attribute a will be. Following the given significances, the algorithms of finding the reduct must be immediately designed. Up to now, many algorithms have been proposed to obtain reducts. Considering time efficiency, the forward greedy search strategy has become a common way to do this. This kind of algorithm starts from an empty set and gradually adds the attribute with the maximum significance into the candidate attribute subset in each iteration [44] until the constraint is satisfied. This kind of approach is frequently referred to as the heuristic algorithm.
Take the approximation quality reduct as an example; the reduct aims to derive a subset of condition attributes that do not decrease the value of approximation quality. The detailed process of heuristic algorithm for finding such reduct is shown in Algorithm 2.
Similarly, if it is required to compute Conditional Entropy Reduct (CER), as the conditional entropy is a measure that characterizes the distinguishing information of a subset of attributes, and the lower of the value of the conditional entropy is, the greater the distinguishment ability of the attribute set is. Then, the termination in Step 3 of Algorithm 2 is replaced by "ENT(A, d) ≤ ENT(AT, d)", and the significance of attribute in Step 3(1) is replaced by Equation (12), i.e., Sig ENT (a i , A, d); then, we select the attribute that has maximum significance.
The time complexity of a computing neighborhood relation is O(|U| 2 ), in which |U| is the number of samples in a dataset. In the worst case, there are |AT| attributes should be added into the reduct, i.e., no attribute is redundant; then, Step 3 in Algorithm 2 is executed |AT| times. In the i-th iteration, Step 3 is executed |AT| − i + 1 times. Finally, the time complexity of AQR is O(|U| 2 × |AT| 2 ). Similarly, the time complexity of CER is also O(|U| 2 × |AT| 2 ).

Limitations of Single Measure
The above algorithm shows us a complete process of computing the reduct that is determined by a single measure, i.e., either approximation quality or conditional entropy. However, the derived reduct may fail to meet the constraints with multiple criteria. We used the following example to explain it: Example 2. In the decision system shown in Table 1, suppose that δ = 0.15, then γ(AT, d) = 0.1000 and ENT(AT, d) = 0.6879; both of them are obtained by raw attributes.
Furthermore, by Definition 4 and the heuristic process, the obtained approximation-quality reduct is A 1 = {a 1 , a 2 , a 3 }, and the obtained conditional-entropy reduct is A 2 = {a 1 , a 2 , a 3 , a 4 , a 5 }.
If the approximation-quality reduct is selected, then γ(A 1 , d) = 0.1000, and ENT(A 1 , d) = 0.7860. It is observed that the value of approximation quality is maintained, while the value of conditional entropy is increased. In other words, though the constraint based on approximation quality is satisfied by A 1 , such subset of attributes does not meet the constraint in terms of conditional entropy.
If the conditional-entropy reduct is selected, then γ(A 2 , d) = 0.0000, and ENT(A 2 , d) = 0.6762. It is observed that the value of conditional entropy has been significantly decreased, while the value of approximation quality is also decreased. In other words, the constraint defined by approximation quality cannot be guaranteed if the conditional-entropy reduct is used.
The above example tells us that a reduct in terms of a single measure does not meet the constraints in terms of multiple criteria. Therefore, to alleviate such a problem, we propose a multiple-criteria attribute-reduction strategy that considers both the evaluations of approximation quality and conditional entropy.

Multiple-Criteria Reduct
Since the single-criterion reduct cannot meet the constraints, then the multiple-criteria framework [45] can be a solution. The definition of a multiple-criteria reduct presented as follows: Definition 5. Given decision system DS, if A ⊆ AT, A is the multiple-criteria reduct if and only if: Different from the approximation-quality and conditional-entropy reducts shown in Definition 4, the multiple-criteria reduct shown in Definition 5 is defined by considering constraints given by both approximation quality and conditional entropy.
Algorithm 3 presents a heuristic process to compute our multiple-criteria reduct. It should be emphasized that, to derive attribute significance, Equations (11) and (12) should be used.
In Step 3, m and n represent the attribute locations that have maximal significances in terms of approximation quality and conditional entropy, respectively. In Step 3(2), if m and n are the same, then there is no conflict for voting. Otherwise, two attributes have conflict, which means that the max values of significances computed by the measures of approximation quality and conditional entropy are derived from different attributes. Then, a mechanism for selection is required. In this case, we select one attribute without considering the measures, mainly because approximation quality and conditional entropy take the same weight in our algorithm. To make the algorithm more stable, the attribute ranks lower in the order of the raw attributes that are selected instead of a random one. Such thinking is similar to what has been addressed in reference [21]. Similar to AQR, the time complexity of MCR is O(|U| 2 × |AT| 2 ), where |U| represents the number of samples in decision system (DS), and |AT| represents the number of condition attributes. However, MCR may spend more time on computing reduct, because MCR should compute two different types of significances in each iteration.

Algorithm 3 Multiple-Criteria Reduct (MCR)
, select a m and a n such that

Multiple-Criteria Reduct with Sample Selection
Obviously, the process of the heuristic algorithm for computing the reduct is still based on scanning all samples in the data. To further improve the time efficiency of the algorithms shown in Sections 3.1 and 3.3, reducing the size of samples may be a feasible solution.
In the field of machine learning and feature selection [46,47], the technique of sample selection has been widely used. For instance, following many previous results [20,[48][49][50], it has been pointed out that sample selection is a useful method. Wilson et al. [48] provided a survey of previous algorithms, and proposed six additional reduction algorithms that can be used to remove samples from the concept description. Brighton et al. [49] proposed that internal samples positioned away from class boundaries have little or no effect on classification accuracy; on the contrary, samples that lie close to class boundaries hold more information to accurately describe the decision surface. Nikolaidis et al. [50] proposed the Class Boundary Preserving Algorithm (CBP). CBP divided all data into two sets that are referred to as the internal-sample set and boundary-sample set, and focused more on the boundary samples. Xu et al. [20] further expanded the sample selection of boundary samples and introduced it into multilabel datasets. From the above analyses, we can find that the samples in the boundary region are more important than other samples. We propose an algorithm to compute a multiple-criteria reduct by using boundary samples instead of whole samples in the data.
First of all, we used a K-means clustering algorithm to choose K cluster centroids [51][52][53][54]. This process is executed M times because the result of K-means clustering is not stable. Secondly, we compute average cluster centroids by those results. Finally, we select those samples that are far away from the average cluster centroids, and construct a new decision system. We select those samples that are far away from the average cluster centroids, which is mainly because: (1) these samples are more difficult to be correctly classified, and samples nearer to the average cluster centroids tend to be closer to each other, making it easy for them to be classified correctly; (2) these samples sometimes fail to be assigned to the lower approximation set, while the samples closer to the average cluster centroids tend to be in the lower approximation set. Therefore, in order to improve classification performance and reduce the time consumption with the neighborhood rough-set model, we apply boundary samples instead of all samples. To judge whether a sample is far away from the cluster centroid, we used Definition 6, as follows: Definition 6. Given a cluster C j , C * j is the cluster centroid of C j , dist(x, C * j ) denotes the distance between x ∈ C j and C * j , and the average distance between all samples in C j and C * j is , then x is referred to as a sample which is far away from the cluster centroid C * j , such sample is selected for constructing new decision system.
With all boundary samples selected, new decision system DS can be constructed. Obviously, the size of the data in DS is smaller than that in decision system DS. From this point of view, the time consumption of computing the reduct may be reduced. Algorithm 4 shows us the heuristic process to compute a multiple-criteria reduct by using sample selection. Execute K-means clustering algorithm over DS, obtain clusters C r = {C r 1 , C r 2 , · · · , C r K }; // In K-means clustering, K is the number of decision classes; End For 3. For j = 1 to K Obtain the j-th average cluster centroid A, d) and Sig ENT (a i , A, d), select a m and a n such that The first four steps show us the process of sample selection, i.e., the process of constructing a new decision system. In Step 1, the universe of the new decision system is initialized. In the following two steps, a K-means clustering algorithm is executed M times, and the average cluster centroids are obtained. In Step 4, samples that are far away from the average cluster centroids are immediately selected, and the new decision system is constructed. The last three steps are used to compute a multiple-criteria reduct over the new decision system. The time complexity of MCRSS is O(|U | 2 × |AT| 2 + K × |U| × M), where |U | represents the number of samples in the new decision system (DS ); both K and M are constants. It must be noted that |U | < |U|.
The time complexity of MCR is O(|U| 2 × |AT| 2 ), so we compare the time complexity between MCR and MCRSS. Generally speaking, K × M ≤ (|U| − |U | 2 /|U|) × |AT| 2 holds for most of the data because K and M are constants that are much less than the number of samples.
Therefore, it is a trivial to show that O(|U| 2 × |AT| 2 ) ≥ O(|U | 2 × |AT| 2 + K × |U| × M) holds, in other words, the time consumption of MCRSS is less than that of MCR.
The sample-selection strategy shown in Algorithm 4 can also be used in computing approximation-quality and conditional-entropy reducts. It is immediately trivial to design two algorithms: Approximation-Quality Reduct with Sample Selection (AQRSS) and Conditional-Entropy Reduct with Sample Selection (CERSS). The time complexities of AQRSS and CERSS are also O(|U | 2 × |AT| 2 + K × |U| × M).

Experimental Analysis
To validate the effectiveness of MCRSS proposed in this paper, 12 UCI datasets were collected to conduct the experiments. The basic descriptions of the datasets are shown in Table 2. All experiments were carried out on a personal computer with Windows 7, dual-core 1.50 GHz CPU, and 8 GB memory. The programming language was MATLAB R2016a. Statlog (Vehicle Silhouettes)  846  18  4  10  Vertebral Column  310  6  2  11  Wine  178  13  3  12 Yeast 1484 8 10 In the following, five-folder Cross-Validation (5-CV) was adopted. In other words, we divided each set of data into five parts of the same size, which are denoted by U 1 ∪ · · · ∪ U 5 : for each round of computation, 80% of the samples in the data were regarded as the training samples for computing reducts, and the rest were considered as the test samples for computing measures by the attributes in reducts. Furthermore, in this experiment, M = 5, i.e., the K-means clustering, was executed five times to generate average cluster centroids. Ten different values of δ, such that 0.03, 0.06, · · · , 0.30, were also selected.  In Figure 1, we can observe the following:

Comparisons of Approximation Qualities
1. If the value of δ increases, then the decreasing trends have been obtained for approximation qualities with respect to three different reducts, though those decreasing trends are not necessarily monotonic. 2. By comparing it with AQRSS, MCRSS can preserve or slightly increase approximation qualities. This is mainly because the constraint designed by the measure of approximation quality is also considered in MCRSS. Take, for instance, the "Ionosphere" dataset; if δ = 0.12, then the approximation qualities derived by MCRSS and AQRSS are 0.6049 and 0.4644, respectively. 3. An interesting observation is that the approximation qualities obtained by CERSS may be greater than those obtained by AQRSS in some datasets. Take, for instance, the "Dermatology" dataset; if δ = 0.06, then approximation qualities derived by MCRSS, AQRSS, and CERSS are 0.9333, 0.8606, and 0.9151, respectively. Such results tell us that AQRSS is not always good in deriving higher approximation qualities.    In Figure 2, we can observe the following:

Comparisons of Conditional Entropies
1. If the value of δ increases, then the increasing trends have been obtained for conditional entropies with respect to three different reducts, though those increasing trends are not strictly monotonic. 2. In most cases, there are slight differences between conditional entropies generated by MCRSS and CERSS, which can be attributed to the constraint designed by the measure of conditional entropy that has also been considered in MCRSS. Take, for instance, the "Breast Tissue" dataset; if δ = 0.15, then the conditional entropies derived by MCRSS and CERSS are 0.6127 and 0.6599, respectively. 3. In most cases, the conditional entropies obtained by AQRSS are greater than those derived by both CERSS and MCRSS. This observation demonstrates that, if we only pay attention to the single measure of approximation quality, the obtained reduct may not be effective in terms of conditional entropy. Take, for instance, the "Forest-Type Mapping" dataset; if δ = 0.21, then the conditional entropies derived by MCRSS, CERSS, and AQRSS are 0.4744, 0.5507, and 0.8951, respectively.

Comparisons of Classification Accuracies
In the following, the neighborhood classifier was used to measure the classification performances of the reducts derived from three different algorithms. The detailed results are shown in Figure 3. In Figure 3, we can observe that the classification accuracies obtained by MCRSS are greater than those obtained by AQRSS and CERSS. Take the "Statlog (Vehicle Silhouettes)" dataset as an example; if δ = 0.06, then the classification accuracies derived by MCRSS, AQRSS, and CERSS are 0.6205, 0.4352, and 0.6047, respectively. Such results tell us that the MCRSS algorithm provides better classification performance with the use of a neighborhood classifier. Figure 4 shows us the reduct lengths derived from three different algorithms. InFigure 4, we can observe that the reduct lengths obtained by considering multiple criteria are greater than the lengths of reducts obtained by a single measure (approximation quality or conditional entropy). This is mainly because the multiple-criteria reduct in this experiment considers two measures, and then the constraint is stricter than the constraint defined by only one measure.

Comparisons of Time Consumptions
In the following, we compare the time consumption of several algorithms, AQRSS, CERSS, MCRSS, and MCR, in generating reducts. AQRSS and CERSS are used to find the approximation-quality reduct and conditional-entropy reduct with sample selection, respectively; MCRSS and MCR are used to find multiple-criteria reducts with and without sample selection, respectively. The detailed results are shown in Figure 5.
The following conclusions can be obtained from

Comparisons of Core Attributes
In the following, we compare the three algorithms, AQRSS, CERSS and MCRSS, in the view of core attributes. For readers' convenience, we only display the core attributes with one fixed radius; given δ = 0.15, we use boundary samples to compute the core attributes [55,56], and the thinking of the process is similar to the algorithm proposed by Wang et al. [56]. We removed only one attribute from the raw attributes (AT) to make the subset that is made up of the remaining attributes that cannot satisfy the constraints in definition. Take the measure of "approximation quality" (AQRSS) as an example; AT = {a 1 , a 2 · · · a n }: (1) remove a 1 in the first time, then the remaining attributes construct the subset A = {a 2 , a 3 · · · a n }; (2) compute γ(AT, d) and γ(A, d) in the new decision system DS ; (3) if γ(A, d) < γ(AT, d), then a 1 can be a member of the core set. a 2 is removed in the second time and a n is removed in the n-th time. Similar algorithms are used to compute core sets for conditional entropy (CERSS) and multicriterion (MCRSS).
For readers' convenience, to compare the results of these three algorithms (AQRSS, CERSS, MCRSS), the order of attributes was applied. When several consecutive attributes are core attributes, the order are listed with the symbol "-".Take the dataset of "Hayes Roth"(ID: 6) as an example; since core attributes in terms of three algorithms are all {a 1 , a 2 , a 3 , a 4 }, which present "1-4" in Table 3. With a careful investigation of Table 3, the core of MCRSS is the union set of the core of AQRSS and CERSS in general. Take the data set of "Forest Type Mapping"(ID: 5) as an example, the core attributes of AQRSS and CERSS are {a 1 , a 6 , a 7 , a 16 , a 22 } and {a 1 , a 2 , a 4 , a 6 , a 15 , a 16 , a 22 , a 23 , a 25 }, respectively. And the core attributes are {a 1 , a 2 , a 4 , a 6 , a 7 , a 15 , a 16 , a 22 , a 23 , a 25 }, which is the union set of the core attributes of AQRSS and CERSS. In the "Dermatology" (ID: 4) dataset, this union-set relation is not correct since the core attributes of the three algorithms are {a 5 , a 15 }, {a 5 , a 15 , a 21 } and {a 5 , a 9 , a 15 , a 21 , a 22 }, respectively. More information is shown as follows. Given δ = 0.15, the results in Table 4 were obtained from all the data without using sample selection, and 5-CV was also applied to compute the mean values of these three measures (approximation quality, conditional entropy, and classification accuracy). After a careful investigation of Table 4, it can be seen that MCRSS improves approximation quality and classification accuracy, and it also reduces the conditional entropy.
1. In the comparisons of these three measures, the largest values of approximation quality (classification accuracy) are in bold, and the smallest values of conditional entropy are underlined. It should be emphasized that in Datasets 6 and 10, the values of these three measures are the same. This is mainly because the cores of the three algorithms are the same, which can be seen from Table 3. Table 4 can generally stay consistent with the results shown above (Figures 1-3), which are obtained from the datasets with sample selection. The reducts obtained by MCRSS can not only preserve approximation quality ( Figure 1) and reduce conditional entropy ( Figure 2), but also improve classification accuracy performance (Figure 3). 3. We can find that conditional entropy and approximation quality both have an important role in improving performance. The measure of conditional entropy may contribute a little more in improving classification accuracy values. The phenomenon that most of the values derived from the CERSS and MCRSS are the same may illustrate that the constraint of conditional entropy is more helpful in improving classification accuracy.

Conclusions and Future Perspectives
In this paper, a framework of a multiple-criteria reduct with sample selection has been proposed. Different from the traditional attribute-reduction algorithm that only uses one measure, our algorithm is executed based on the multiple criteria, which include approximation quality and conditional entropy. Experimental results show that the reduct computed by our algorithm can not only increase approximation quality and preserve conditional entropy, but also provide better classification performance. Since we also applied boundary samples instead of the whole samples in the data, our algorithm needed to spend less time in finding reducts.
The following topics merit further investigations: 1. Only two measures have been used to design multiple criteria; some other measures, such as classification accuracy [57] and neighborhood discrimination index [12], will be further added into the construction of multiple criteria. 2. Multiple-criteria attribute reduction is realized by a neighborhood rough set; it can also be introduced into other rough-set models, such as a fuzzy rough set [19] and decision-theoretic rough set [58]. 3. Attribute reduction can be considered as the first step of data processing, and classification performances in terms of different classifiers [59,60] based on our reducts will be further explored.
Funding: This research was funded by the Natural Science Foundation of China (Nos. 61572242, 61502211, 61503160).

Conflicts of Interest:
The authors declare no conflict of interest.