Sample Reduction Strategies for Protein Secondary Structure Prediction

: Predicting the secondary structure from protein sequence plays a crucial role in estimating the 3D structure, which has applications in drug design and in understanding the function of proteins. As new genes and proteins are discovered, the large size of the protein databases and datasets that can be used for training prediction models grows considerably. A two-stage hybrid classiﬁer, which employs dynamic Bayesian networks and a support vector machine (SVM) has been shown to provide state-of-the-art prediction accuracy for protein secondary structure prediction. However, SVM is not efﬁcient for large datasets due to the quadratic optimization involved in model training. In this paper, two techniques are implemented on CB513 benchmark for reducing the number of samples in the train set of the SVM. The ﬁrst method randomly selects a fraction of data samples from the train set using a stratiﬁed selection strategy. This approach can remove approximately 50% of the data samples from the train set and reduce the model training time by 73.38% on average without decreasing the prediction accuracy signiﬁcantly. The second method clusters the data samples by a hierarchical clustering algorithm and replaces the train set samples with nearest neighbors of the cluster centers in order to improve the training time. To cluster the feature vectors, the hierarchical clustering method is implemented, for which the number of clusters and the number of nearest neighbors are optimized as hyper-parameters by computing the prediction accuracy on validation sets. It is found that clustering can reduce the size of the train set by 26% without reducing the prediction accuracy. Among the clustering techniques Ward’s method provided the best accuracy on test data.


Introduction
The four different levels of protein structure are known as primary, secondary, tertiary and quaternary structure. The primary structure consists of amino acids that are linked by peptide bonds that make up the protein. The secondary structure is the local conformation of amino acids through hydrogen bonding interactions into regular structures. The three common types of secondary structures are the α-helices, β-sheets and coils (or loops). Secondary structure elements and motifs come together to form tertiary structure. The tertiary structure is the global three-dimensional structure of an amino acid chain or a domain within a protein. Finally quaternary structure refers to multiple chains uniting together via chemical bonds that operate as a single functional unit [1,2].
There are millions of amino acid sequences in protein databases and it is essential to annotate them according to their structural and functional roles [3]. For instance, predicting the one-dimensional properties of proteins such as secondary structure and solvent accessibility plays a crucial role in predicting the 3D structure and understanding the function of proteins [4,5]. Several classification methods have been proposed in the literature for this purpose such as neural networks [6,7], support vector machines [8], dynamic Bayesian networks [9] and hybrid methods that combine different classifiers [9,10]. To date, most of the research efforts in this field have concentrated on developing advanced prediction methods. In the mean time, as new genes and proteins are discovered, the size of the protein databases and datasets that can be used for training prediction models grows considerably. Therefore efficient algorithms and/or data reduction strategies should be developed that can circumvent the computational cost caused by big data conditions while incorporating the useful information into prediction models. Though there are methods developed for reducing the number of features (i.e., dimensions) of the classifiers by employing feature selection or dimension reduction techniques [11], to the best of our knowledge, there is no work in the literature for reducing the number of train set samples using techniques such as sampling and clustering for predicting one-dimensional structural properties of proteins. Recently, a new database called UniClust has been introduced that is derived by clustering millions of proteins [12]. This database is introduced as the sequence database of the HHblits method [13], which aligns a query protein against the amino acid sequences in the database. There are also other databases introduced earlier such as SCOP [14] and PFAM [15] that organize proteins hierarchically into multiple levels (e.g., family, superfamily, fold, class, domain or clan). Among those SCOP assigns proteins to families based on multiple criteria and using clustering. However none of these databases and clustering approaches are employed directly to reduce the size of the train set of a machine learning classifier for predicting structural properties of proteins.
In this paper, the DSPRED method is employed to predict the secondary structure of proteins, which is a two-stage hybrid classifier that combines dynamic Bayesian networks and a support vector machine (SVM). SVMs are known to be effective for combining heterogenous input features as in DSPRED which employs PSSM features as well as features in the form of probability distributions (see Section 3.4). It has been shown in Aydin et al. that replacing SVM with other standard classifiers did not improve the accuracy of DSPRED [16]. One drawback of the SVM is high computational complexity in model training, which can be prohibitive for large datasets [17]. To address this problem, different approaches have been proposed in the literature such as stratified sampling [18], random selection [19], clustering analysis [19], de-clustering [20] and Learning Vector Quantization (LVQ) neural network [21]. In the present study, random stratified sampling and clustering techniques are employed in order to reduce the number of data samples used for training the SVM classifier of the DSPRED method. Note that no matter which classifier is used, reducing the dataset size by reducing the number of data samples will improve the speed of making predictions, which is useful considering the fact that the protein data in public databases is growing rapidly. As an alternative to sample reduction, dimension reduction techniques such as feature selection can also be employed to reduce the training time of the SVM. This is explored in Aydin et al. [11] and Xie et al. [22] and deserves a separate analysis. It should be noted that in Aydin et al. [11] reducing the dimensions by feature selection did not improve the accuracy of prediction. Therefore in this work, we reduce the number of samples not only to reduce the model training time of the SVM but also to explore whether the prediction accuracy will improve.

Related Studies
In this section, we give a brief review of the literature that employs SVMs for secondary structure prediction and studies that propose methods for reducing the training time of the SVM through sample reduction. Lin et al. proposed a multi-SVM ensemble to improve the performance of secondary structure prediction. Their method contains two layers: the first layer consists of an ensemble of five classifiers and the second layer is built by three SVMs. The multi-SVM ensemble employs bagging to resample the training dataset through bootstrap sampling and achieves improved performance on secondary structure prediction when a seven fold cross-validation is performed on the RS126 dataset [23]. Hua et al. proposed a new method of protein secondary structure prediction which is based on the support vector machine (SVM). Their method achieves a three-state per-residue accuracy (Q3) of 73.5% by seven fold cross validation on the CB513 dataset [24]. Although there are many other publications that employ SVMs for protein secondary structure prediction, none of these include sample reduction for reducing the training time of the SVM. Therefore we continue with methods that improve the model training time of SVM in other problems. Jun employed stratified sampling to select a subset of examples from training set [18]. In this work, the author selected 10% of the samples from each class, which reduces the size of the training set by 10-fold. Then an SVM classifier is trained using the reduced dataset. The method is applied to four datasets from UCI Machine Learning repository. Though the prediction accuracy of the models trained by 10% stratified sampling is maintained for the adult and iris datasets, it reduced considerably as compared to using all the samples for letter image recognition and protein location sites datasets. In another work, Hens and Tiwari reduced the number of features by F-score and stratified sampling for credit scoring problem and obtained similar accuracy as the other state-of-the-art methods while reducing the computational time significantly [25]. In addition to sampling strategies, there are also methods that employ clustering to reduce the sample size of the training sets. Awad et al. [19] employed a hierarchical clustering approach to improve the training time of an SVM particularly for large datasets. They proposed three techniques named TCT-SVM, TCTD-SVM and OTC-SVM, which are shown to work efficiently for model training. Among those TCT-SVM performed better than the others in terms of accuracy but it had a higher model training time [19]. Yu et al. have proposed a new method called CB-SVM (Clustering-Based SVM) that integrates a scalable clustering method for large datasets while generating high classification accuracy. The authors claim that the CB-SVM algorithm can reduce the total number of data points effectively for training an SVM [20].

Dataset
To further validate our method, we applied it to the non-homologous CB513 dataset constructed by Reference [26], which contains 513 protein chains and 84,119 amino acids. This dataset is one of the standard benchmarks in protein secondary structure prediction to assess the accuracy of algorithms [27]. It contains protein sequences and structure label assignments obtained using the DSSP program [28] starting from the structure information in Protein Data Bank (PDB) [29]. The DSSP convention is used to map 8-state representation of secondary structure labels into 3-state by applying the following conversion rule: H, G, I to H; E, B to E; S, T, to L.

Problem Definition
Starting from an amino acid sequence, in secondary structure prediction problem, the goal is to assign a structural class label from a 3-letter alphabet (H: Helix, E: Strand, L: Loop) to each amino acid of the protein ( Figure 1).

Feature Extraction for Protein Secondary Structure Prediction
The input features of our prediction methods include sequence profiles in the form of position-specific scoring matrices (PSSMs) [30] derived by PSI-BLAST [31], HHMAKE PSSMs as well as structural profile matrices. Each target protein in the CB513 benchmark is aligned with the proteins of the NCBI's NR database [32] using the PSI-BLAST method [31] to compute a position specific scoring matrix (PSSM). In the next step, the proteins that are similar to target are aligned jointly by a multiple alignment algorithm and a PSSM is computed by normalizing the frequency of occurrence counts of amino acids [31]. Similarly, HHMAKE PSSMs are computed by aligning the target proteins against the NR20 database (a reduced version of the NR) using the HHblits (https://toolkit.tuebingen.mpg.de/tools/hhblits) method and converting the HMM-profile model's match state distributions to a frequency table. In the next step, the HMM-profile of the target is aligned against the HMM-profiles in the PDB70 [33] database using the second step of the HHblits method. To generate structural profiles, the HMM-profile of the target is aligned against the HMM-profiles in the PDB70 [33] database using the second step of the HHblits method.
The size of the PSI-BLAST and HHMAKE PSSMs are N by 20 and the size of the structural profile matrix is N by 3, where N is the number of amino acids in the target protein. Each row of PSI-BLAST and HHMAKE PSSM contains the propensity of observing one of the 20 amino acids at a particular amino acid of the target. On the other hand, each row of structural profile matrix and the three distributions in Section 3.4 contains the probability of observing the three secondary structure labels at a particular amino acid of the target. An example structural profile matrix is shown in Figure 2. In the present study, only distant templates are used to construct structural profiles matrices by removing templates for which the percentage of sequence identity score with respect to target is greater than 20%. Once the profile matrices are obtained they are scaled by sigmoidal transformation to transform the features to the range [0, 1], which are sent as input to DSPRED method for classification. Details of feature extraction can be found in Aydin et al. [9] and the thesis work of Görmez [34]. Details of weighted frequency computation for deriving structural profile matrices can be found in Reference [35].

DSPRED Method
To predict the secondary structure class of each amino acid, the DSPRED method is used, which employs separate dynamic Bayesian network (DBN) classifiers for PSI-BLAST and HHMAKE PSSMs. Each DBN model produces a marginal a posteriori distribution (called Distribution 1 and 2) of class labels given the input features. These distributions are combined with structural profile matrices through model averaging to obtain Distribution 3 [11]. In this work, the one-sided amino acid window of DBN classifiers is set to L A = 5 and the one-sided secondary structure history window is set to L S = 4. In the next step, PSI-BLAST PSSM, HHMAKE PSSM, Distributions 1, 2 and 3 are used as input features of the SVM classifier. To predict the secondary structure class, a symmetric window of size 11 is taken around each amino acid and features in this window are concatenated to obtain a total of 539 features (PSI-BLAST PSSM: 20 × 11 = 220 features, HHMAKE PSSM: 20 × 11 = 220 features, Distributions 1-3: 3 × 3 × 11 = 99 features). The steps of the DSPRED method are shown in Figure 3. Note that in the present work the second structural profile matrix is not employed (i.e., w 4 is set to 0). Details of DSPRED can be found in Aydin et al. [9,11] and the thesis work of Görmez [34].

Training a Support Vector Machine with Large Datasets
Support vector machine is a powerful method for classification and regression problems [36,37]. It has been applied successfully to many real-world problems, including signal processing, image processing and bioinformatics due to its high accuracy, ability to work in high dimensions and process non-vectorial data and flexibility in modelling diverse sources of data [38]. The SVM maps the input space into a high dimensional feature space and then constructs an optimal hyperplane in the new space [36]. Although SVM performs well in complex prediction tasks it solves a quadratic optimization problem during model training, which could be disadvantageous for large datasets [17]. For instance, it would take years to train an SVM on a dataset of one million records and with many features [20,39]. Based on the improvements in data collection, storage and processing technologies the size of the databases is growing at a rapid rate in many disciplines including bioinformatics [40]. Therefore efficient methods should be developed for speeding the training phase of the SVM.
In the following sections, the methods implemented in this work for training the SVM classifier of DSPRED method are explained in more detail.

Sample Reduction by Stratified Random Sampling
In stratified random sampling, a fixed percentage of train set samples (i.e., amino acids) are randomly selected from each class type. This approach preserves the ratio of class types in the reduced train set. In this paper, the percentage parameter is increased from 10% to 100% with increments of 10%. For instance if this parameter is set to 10% then the resulting train set contains approximately 10%

Training a Support Vector Machine with Large Datasets
Support vector machine is a powerful method for classification and regression problems [36,37]. It has been applied successfully to many real-world problems, including signal processing, image processing and bioinformatics due to its high accuracy, ability to work in high dimensions and process non-vectorial data and flexibility in modelling diverse sources of data [38]. The SVM maps the input space into a high dimensional feature space and then constructs an optimal hyperplane in the new space [36]. Although SVM performs well in complex prediction tasks it solves a quadratic optimization problem during model training, which could be disadvantageous for large datasets [17]. For instance, it would take years to train an SVM on a dataset of one million records and with many features [20,39]. Based on the improvements in data collection, storage and processing technologies the size of the databases is growing at a rapid rate in many disciplines including bioinformatics [40]. Therefore efficient methods should be developed for speeding the training phase of the SVM.
In the following sections, the methods implemented in this work for training the SVM classifier of DSPRED method are explained in more detail.

Sample Reduction by Stratified Random Sampling
In stratified random sampling, a fixed percentage of train set samples (i.e., amino acids) are randomly selected from each class type. This approach preserves the ratio of class types in the reduced train set. In this paper, the percentage parameter is increased from 10% to 100% with increments of 10%. For instance if this parameter is set to 10% then the resulting train set contains approximately 10% of the amino acids in the original train set and if it is set to 100% then it contains all the data samples. After applying stratified random sampling, the SVM model is trained using the reduced train sets and the prediction accuracy is computed on the test sets (see Section 4.1).

Sample Reduction by Hierarchical Clustering
The second method clusters the data samples by a hierarchical clustering algorithm and replaces the train set samples with nearest neighbors of the cluster centers. First, the PSSM feature vectors of the amino acids in train set are clustered using hierarchical clustering algorithm. The number of clusters is denoted as N c . In the next step, k nearest neighbors of each cluster center are selected as the data samples for train set of the SVM classifier. Figure 4 summarizes the steps of sample reduction by clustering procedure. The hyper-parameters N c and k are optimized by computing the prediction accuracy on validation sets as explained in the next section. Different methods are employed for hierarchical clustering and among those the Ward's method provided the best results [41]. The Ward's method applies a minimum variance criterion that minimizes the total within cluster variance. At each step, it finds the pair of clusters that leads to minimum increase in total within-cluster variance after merging. This increase is a weighted square distance between cluster centers. The initial cluster distances are defined to be the squared Euclidean distances between points [41,42].
According to the scipy's documents, for the Ward's method, an algorithm called nearest-neighbors chain is implemented which has time complexity O(n 2 ). For other methods a naive algorithm is implemented with O(n 3 ) time complexity. All algorithms use O(n 2 ) memory [43,44].

Cross-Validation and Hyper-Parameter Optimization for Clustering
The accuracy of data reduction strategies is evaluated in a cross-validation setting. For this purpose, proteins in CB513 are randomly assigned to seven folds and the train/test splits are formed accordingly. This results in a total of seven train test set pairs. For instance, in the first train set there are a total of 73,622 amino acid samples 34.70% (25,544) of which belong to helix, 22.26% (16,387) to beta strand and 43.04% (31,691) to loop. Based on this assignment, there remains a total of 10.497 amino acids for the first test set. In train and test sets, each amino acid is represented by a total of 539 features.
The number of clusters and number of nearest neighbours, which are hyper-parameters of the "sample reduction by hierarchical clustering" approach are optimized by performing a grid search. The first hyper-parameter of N c represents the number of clusters. To optimize this parameter, values ranging from 500 to 1500 are considered. The second hyper-parameter k is the number of nearest neighbors, which is also optimized by choosing values from 1 to 19. For this purpose, approximately 10% of the proteins from each train set are randomly selected and a total of seven validation sets are formed. Note that the validation sets are used as secondary test sets to optimize the hyper-parameters. The reason for selecting 10% of the train set is to allow as many samples as possible in the train set so that the prediction accuracy is not affected. In selecting validation sets stratified random selection is not performed because the there is not a large imbalance between different class types. Once validation sets are formed, the remaining samples are used to train the SVM models and prediction accuracies are computed on validation sets for different values of the hyper-parameters. Then the parameters with the best validation set accuracy are selected for each iteration of the cross-validation experiment. Once the optimum hyper-parameters are found (a total of seven optimum parameter pairs), the SVM is trained on the original train sets and predictions are computed on test sets.

System Architecture and Hyper-Parameters of the SVM
The SVM with RBF kernel, which provides satisfactory results for protein secondary structure prediction is implemened using the libSVM software (version 3.21). The hyper-parameters of the SVM are selected as α = 0.00781 and C = 1.0, which have been optimized previously by Aydin et al. [9].

Sample Reduction by Stratified Random Selection
Stratified random selection is performed for each train set of the 7-fold cross-validation experiment. For this purpose, a fixed percentage of amino acid samples are selected randomly from the train set using stratified sampling and the SVM model is trained on this reduced set. In the next step, predictions are computed on the test sets. Figures 5 and 6 show the secondary structure prediction accuracy of the SVM classifier as well as model training times, respectively for all folds of the cross-validation. According to these results, it is possible to remove approximately 50% of data samples from the train sets of CB513 without decreasing the prediction accuracy significantly. Note that the obtained accuracy values are comparable to the state-of-the-art accuracy on CB513 benchmark [11]. Furthermore, the model training time of the SVM is decreased by 73.38% when the training set is reduced by 50% to contain approximately 36,000 amino acid samples only. Table 1 summarizes the overall prediction accuracies of the stratified random selection method using the various training samples and k-fold (k = 7) cross-validation. In this table, D represents the sampled dataset as a percentage, S represents the randomly and uniquely selected rows, Acc p denotes the overall accuracy in percentages (i.e., Q 3 ) on validation sets and Time t is the training time in hours, minutes and seconds, Time p is the prediction time in minutes and seconds.

Sample Reduction by Hierarchical Clustering
A 7-fold cross-validation on CB513 is also performed for sample reduction by hierarchical clustering method. In each iteration, the samples in the train set are clustered by a hierarchical clustering algorithm and the train samples are replaced with nearest neighbors of the cluster centers. We first optimized the number of clusters and the number of nearest neighbors from each cluster center. Table 2 summarizes the experimental results obtained by Ward's hierarchical clustering method. In this table, N c represents the number of clusters, k represents the number of nearest neighbors from each cluster center, N tr is the number of train set samples, Acc v denotes the overall accuracy in percentages (i.e., Q 3 ) on validation sets and Acc t is the overall accuracy on test sets. An N c value of "all" represents the setting in which all the samples are used for model training (i.e., each sample is assigned to a different cluster). The optimum number of clusters for each fold is 1500 except for the third and fourth folds. Typically, 17 closest samples are selected from each cluster based on the distance from cluster center. At the first fold, 13 closest samples were selected as the optimum number of nearest neighbors. Test set prediction accuracy is obtained as almost identical both on reduced and the whole datasets for each fold of the cross-validation experiment, which is comparable to the state-of-the-art [11]. Data in the training set are preprocessed before inputting to the SVM in order to improve the training time.
As a result of these experiments clustering approach can reduce the train set size by 26% without reducing the prediction accuracy significantly. In addition to prediction accuracy, it is of interest to analyze the running time of the sample reduction by hierarchical clustering method and the running time of the SVM classifier with and without clustering applied. For this purpose the following experiment is performed on the first fold of the seven-fold cross-validation experiment. The number of clusters N c is set to 1000 and the number of nearest neighbors k to 13, which resulted in 36,622 training examples for the SVM. The running times are obtained for each step as follows. Hierarchical clustering: 14.51 s, finding the k = 13 nearest neighbors of cluster centers: 59.25 s, training of SVM using 36,622 samples: 6 h, 16 min and 43 s. The total running time of the sample reduction by hierarchical clustering approach is obtained as 6 h, 17 min and 56 s. When the SVM is trained using all of the samples in the first fold's training set it took 14 h, 10 min and 22 s. Based on these results, it can be stated that the running time of sample reduction by hierarchical clustering followed by SVM training is typically lower than training the SVM using the full training set. Table 3 summarizes the average and standard deviation of the accuracies obtained from the 7 folds of cross-validation experiment on CB513. In this table the first two rows include the results obtained for the sample reduction strategies and the last row represents the case where all samples are used to train the SVM classifier. Having low standard deviation values demonstrates that the accuracy evaluations are robust and models are trained with sufficiently large samples. To assess whether the difference between the accuracy values of sample reduction methods and the method that uses all samples is statistically significant, a two-tailed Z-test is performed with a confidence value of 95%. Based on this test, the accuracy difference between sample reduction by stratified random selection and the method that uses all samples is not found to be statistically significant with a Z-score of −0.0217 and a p-value of 0.98404. On the other hand, the accuracy difference between sample reduction by clustering and the method that uses all samples is statistically significant with a Z-score of −4.8713 and a p-value < 1 ×10 −5 . Based on these results, it can be concluded that sample reduction by stratified random sampling is more effective than sample reduction by hierarchical clustering approach for protein secondary structure prediction.

Conclusions
In this paper, we proposed two data reduction strategies for improving the model training time of a support vector machine classifier. The proposed solutions can reduce the dataset size by 26-50% up to approximately 36,000 amino acid samples. The accuracy evaluations are performed by doing cross-validation experiments on CB513 benchmark. For larger datasets, it can still be sufficient to keep approximately 36,000 samples in train set to get satisfactory prediction accuracy, which may correspond to removing even higher percentage of data samples from the train set. This will be investigated further as a future work. Additionally, de-clustering strategies can be implemented and the clusters can be expanded mainly around the decision boundaries. This will provide a finer grained expansion of clusters on regions where the classifier has the most confusion. As a third direction, a smaller train set can be formed for each test example using the cluster centers as guides.