An Enhanced Design of Sparse Autoencoder for Latent Features Extraction Based on Trigonometric Simplexes for Network Intrusion Detection Systems

: Despite the successful contributions in the ﬁeld of network intrusion detection using machine learning algorithms and deep networks to learn the boundaries between normal trafﬁc and network attacks, it is still challenging to detect various attacks with high performance. In this paper, we propose a novel mathematical model for further development of robust, reliable, and efﬁcient software for practical intrusion detection applications. In this present work, we are concerned with optimal hyperparameters tuned for high performance sparse autoencoders for optimizing features and classifying normal and abnormal trafﬁc patterns. The proposed framework allows the parameters of the back-propagation learning algorithm to be tuned with respect to the performance and architecture of the sparse autoencoder through a sequence of trigonometric simplex designs. These hyperparameters include the number of nodes in the hidden layer, learning rate of the hidden layer, and learning rate of the output layer. It is expected to achieve better results in extracting features and adapting to various levels of learning hierarchy as different layers of the autoencoder are characterized by different learning rates in the proposed framework. The idea is viewed such that every learning rate of a hidden layer is a dimension in a multidimensional space. Hence, a vector of the adaptive learning rates is implemented for the multiple layers of the network to accelerate the processing time that is required for the network to learn the mapping towards a combination of enhanced features and the optimal synaptic weights in the multiple layers for a given problem. The suggested framework is tested on CICIDS2017, a reliable intrusion detection dataset that covers all the common, updated intrusions and cyber-attacks. Experimental results demonstrate that the proposed architecture for intrusion detection yields superior performance compared to recently published algorithms in terms of classiﬁcation accuracy and F-measure results. applications pertaining to feature extraction and pattern analysis. We emphasize that the successful contribution of allocating a set of optimal learning rates for different layers of the proposed SAE network has resulted in developing an efﬁcient SAE architecture that can be used to discover latent features extraction. The results from experimental tests show that the different layers of the enhanced SAE could efﬁciently adapt to various levels of the learning hierarchy. Besides, this provided the SGD algorithm with the ability to dynamically adjust the weights and biases within the network. It is evidenced from the test results that it is possible to accelerate the learning process to reach latent features representation based on a vector of adaptive learning rates applied for the multiple layers of the proposed SAE network. Finally, additional tests demonstrated that the proposed IDS architecture could provide a more compact and effective immunity system for different types of network attacks with a signiﬁcant detection accuracy of 99.63% and an F-measure of 0.996, on average, when penalizing sparsity constraint directly on the synaptic weights within the network.


Background
As a result of the increasing attacks on Internet-connected devices in recent years, the study of Intrusion Detection Systems (IDS) has attracted strong interests from a wide range of different research communities, including information systems, security-software companies, and computer science

Paper Organization
The remainder of this paper is organized as follows. Section 2 briefly glances at prior related work that were developed based on machine and deep learning techniques. Section 3 presents the theory and mathematical model of the proposed architecture. In Section 4, the results of the proposed idea applied on a well-known intrusion dataset is provided. A comparison with other related works is also provided in this section. A discussion is presented in Section 5. Finally, concluding remarks appear in Section 6.

Related Work
This section briefly glances at intrusion detection algorithms related to the use of the CICIDS2017 dataset [10], which are developed mainly based on machine and deep learning techniques. In the literature, only a few IDS algorithms used sparse autoencoders to extract features based on latent representation concepts. The major challenge comes from the fact that high-level features produced by the traditional SAEs are designed to activate only a few number of the nodes in the hidden layers towards specific attributes of the input instances. This approach of extracting features fails to reflect the relationships of data instances by directly imposing a sparsity constraint in the hidden layers.
In [6,11], the traditional SAE and Support Vector Machine (SVM) have been used as feature extraction techniques while the Random Forest (RF) classifier was applied to detect malicious attacks. The RF is an ensemble learning algorithm that combines bootstrap aggregation with random features selection to create a set of decision trees, which result in a powerful prediction model with controlled variance [12]. In [13], the multilayer perceptron network and payload classifying algorithm (MLP-PC) was used to distinguish between network intrusions and benign traffic. The MLP network is a deep neural network that consists of five layers and utilizes Adam optimizer. The input layer is composed of 27 nodes, followed by three fully connected hidden layers. Each hidden layer is designed with 64 nodes, dropout probability 0.5, and rectified linear activation function. The output layer is a single node with a sigmoid activation function. The payload classifier (PC) is a deep convolutional neural network (CNN) that consists of a character embedding layer, followed by four convolutional and pooling layers and two standard layers embedded with sigmoid function for classification. In [14], the Fisher Score algorithm (FS) was utilized for feature selection and the SVM, K-Nearest Neighbor (KNN), and Decision Tree (DT) algorithms were applied for intrusion detection, classifying two classes: DDoS or benign. The FS is a supervised feature selection algorithm that selects each feature independently according to a score measured by Fisher criterion [15]. In [16], a distributed model based on Spark was proposed using a collection of a deep belief network (DBN) and multi-layer ensemble support vector machines (MLE-SVMs). The DBN is a greedy layer-wise unsupervised learning model designed with a fine-tuning strategy to learn the relationships among low-level attributes and to represent a good set of hierarchical features. In [17], a deep learning based feature extraction technique and support vector machine (DL-SVM) were used to implement an effective and flexible IDS network. The authors of [18] presented the utilization of RF to keep the most effective features through a recursive feature elimination and deep multilayer perceptron (DMLP) structure to detect intrusion attacks.

Proposed Methodology
In this section, we present the proposed IDS architecture based on an enhanced SAE and RF algorithm, as depicted in Figure 1. The proposed IDS includes various modules for preprocessing huge amount of network packets, tuning the hyperparameters of SAE, and producing more mature and discriminating features. Typically, the preprocessing module identifies the minimum (Min) and maximum (Max) values of the basic features and normalizes them between 0 and 1. Moreover, features that have one value for different classes are eliminated. The hyperparameters tuning module selects 5% of the network packets, which is used later as features for SAE's offline training to adjust the architecture of the SAE based on the HNM algorithm. The optimizing features module produces fewer features but more mature features and results in improved malicious attacks detection compared to traditional network features. The main modules of the proposed IDS are described in more detail hereafter.

Data Preprocessing
The data prepossessing module breaks down the Internet Protocol (IP) and port number for sender and receiver, respectively, into four features instead of two; the CICIDS2017 dataset uses IP-port-sender and IP-port-receiver features. The benefit of doing so is that most intrusions follow a particular pattern for information gathering over the TCP/IP network. After that, the IP-sender and IP-receiver addresses are mapped to an integer representation. Finally, feature scaling is performed to ensure that all the data is in the same range between 0 and 1. Feature scaling is a unity-based normalization method and can be obtained by the following equation [19].
where x min and x max are the minimum and the maximum values for a particular feature x i .

Hassan-Nelde-Mead Algorithm
In this work, we utilize the HNM algorithm [9] to tune the hyperparameters of SAE in order to mitigate the over-fitting problem raised in the hidden layer and to set optimal learning rates for different layers of the back-propagation learning algorithm. The HNM algorithm generates a sequence of trigonometric simplexes designed to extract different features of non-isometric reflections. Unlike the traditional hyperplanes simplex of the Nelder-Mead (NM) algorithm [20,21], the HNM simplex allows the components of the reflected vertex to fragment into multiple triangular simplexes and performs different operations of the algorithm. Thus, the resulting sequence of triangular simplexes not only extracts different non-isometric reflections, but also performs rotation through angles specified by the collection of features of the reflected vertex elements in the hyperplane of the remaining vertices. Therefore, the HNM algorithm is shown to be effective for unconstrained optimization problems. The detailed steps for one axial component of the HNM algorithm is as follows: Step 1. Initialize Triangular Simplex (A, B, and C) and Threshold (Th), as shown in Figure 2: A tetrahedron simplex is a geometrical object that has three vertices. Each vertex has n components, where n is the dimension of the mathematical problem. Since the HNM algorithm is employed to optimize three hyperparameters of SAE, n in our case equals to 3. Next, we sort the simplex vertices in descending order according to an error function (E F ) that is defined later in the process to obtain four points associated with the lowest, second lowest, second highest, and highest E F values, such that A < B < C < Th. Note, each of (A, B, C, and Th) have three axial components (dimensions). The HNM algorithm optimizes a single component in each iteration, while pursuing to explore the curvatures of the E F through six basic operations.
Step 2. Reflection D: The HNM performs reflection along the line segment that is connecting the worst vertex C and the center of gravity, which is H to evaluate E F (D). The vector formula for D is given below.
Step 3. Expansion E: , then the HNM executes expansion because it found a descent in that direction (see Figure 2). E is found by the following equation.
a. If (E F (E) < E F (D)), then we replace the threshold point Th with E, and go to Step 6. b. Otherwise Th is replaced by D, and the algorithm goes to Step 4.
Step 4. Contraction F or G: , then another point must be tested, which is F. If (E F (F) < E F (Th)), then F is kept and replaced with Th. If the condition of F is not met, then perhaps a better point is found somewhere between C and the centroid H. The point G is computed to see whether this point has a smaller function value than Th or not. The vector formulas for F and G are as follows.
a. If either F or G has smaller E F values than Th, then Th is updated and the algorithm goes to Step 6. b. Otherwise, the algorithm moves to Step 5.
Step 5. Reduction H or I: The HNM algorithm performs two types of shrinkage operations. It shrinks the simplex either at the vertex that has the second lowest E F value to evaluate H or at the second highest vertex to evaluate I. The HNM verifies the value of E F (H). If the condition of point H is not satisfied, then HNM shrinks the simplex along the line segment AC and evaluates E F (I). The HNM goes to Step 6. The new vertices are given by: Step 6. Termination Test: The termination tests are problem-based and user-defined. In this work, the stopping criterion is primarily characterized by the designed error function of the SAE. It is encountered in examining the deviation of the error function from the true minimum by 10 −4 , as indicated by the inequality below.
The termination criterion is evaluated for a predefined number of iterations (N). (x, y, and z) correspond to the number of nodes in the hidden layer, learning rate of the hidden layer, and learning rate of the output layer, respectively.
If the condition of the termination test is satisfied, the HNM algorithm stops and returns the best architecture of SAE and the learning rates for the different layers of the back-propagation algorithm. Otherwise, the algorithm sorts the simplex vertices and the Th and goes to Step 2.

Proposed Sparse Autoencoder
Sparse autoencoder is an unsupervised learning algorithm whose training procedure involves a sparsity penalty, allowing only a few nodes of the hidden layers to activate when feeding a single sample into the network. The intuition behind this idea is that the algorithm is forced to sensitize a small number of individual nodes of the hidden layers towards specific features of the input sample [22,23]. This form of regularization is accomplished by calculating the average activation nodes produced by the hidden layers over a collection of input samples. To satisfy the sparsity constraint, the mean computed over the training samples must be near 0 [22,24]. The main problem, however, is that autoencoders often do not explicitly impose regularization on the weights of the network; instead, they regularize activations. As a result, poor performances are encountered with the early designs of sparse autoencoders such that sparsity makes it difficult for an autoencoder to approximate zero (or near zero) error loss function [24,25].
In contrast to traditional autoencoders, this work proposes an alternative mathematical model for sparse autoencoders, which provides a new platform for developing a compressed feature extraction based on imposing sparsity regularization on the weights, not the activations. One solution to penalize weights within a network would be to impose regularization by the sparsity constraint in the output layer. As a result, the sparse autoencoder is encouraged to find a connection between the sparsity penalty and the learning to extract the latent features by selectively activating the number of variables (weights) of the network. The template of the proposed SAE is illustrated in Figure 3. As discussed above, sparse autoencoder is an unsupervised learning algorithm and relies on conveying the outputs of one layer to become the inputs of the following layer. For m input attributes and n hidden layer nodes, the equations that describe this operation are as follows.
where w 1 ∈ R n,m is the weight matrix for the hidden layer and b 1 ∈ R n,1 is the bias matrix associated with the hidden layer. As can be seen in Figure 3, the multilayer design of the SAE network has linear activation functions. Thus, the inputs of the output layer are purely represented by the vector a 1 ∈ R n,1 .
where w 2 ∈ R m,n and b 2 ∈ R m,1 are the weight and bias matrices of the output layer. The outputs of the neurons in the last layer are considered as the SAE outputs, which are denoted by the vector a 2 ∈ R m,1 . As shown in Figure 3, the proposed Error Function (E F ) is composed of two different parts. The first term is the Mean Squared Error or Loss Function (L F ) that measures the average squared difference between the estimated (output) and the actual (input) values. The second term is the proposed Regularization Function (R F ), which employs the sparsity constraint, mainly for penalizing the weight matrices of the hidden and output layers. These terms are calculated as follows: After propagating the input samples forward through the SAE network and obtaining the output vector (a 2 ), the next step is to evaluate the E F from Equation (13). Since E F is not an explicit function of the weights and bias in the SAE network, we need to specify a sensitivity measure that sensitizes the changes in E F and propagates these changes backward through the network from the last layer to the first layer, in a process called the back-propagation learning algorithm.
To derive the recurrence relationship for the sensitivities, we use the Stochastic Gradient Descent algorithm (SGD) [26]. For the output layer, the SGD for updating the weight and bias matrices can be expressed as follows.
where α 2 is the learning rate associated with output layer. The only complication is that the E F for a multilayer SAE design is an indirect function of the weights and bias. Thus, the chain rule theory is required to calculate the partial derivatives of E F with respect to a third variable such as w or b in the hidden and output layers. By using the chain rule application, the derivatives of Equations (14) and (15) can be simplified to the following: We denote the sensitivity at the output layer as s 2 , which can be defined as: Then, Equations (16) and (17) become: where Following the same procedure for evaluating s 2 , we can propagate the sensitivities backward from the output layer to the hidden layer as follows.
where α 1 is the learning rate associated with the hidden layer and s 1 is represented as follows.

Experimental Results
As the rise in attacks on Internet-connected devices are being increased dramatically, it becomes significantly important to consider a reliable dataset that contains volumes of traffic diversity and covers a variety of attacks. Following this trend, we tested our proposed IDS architecture on the CICIDS2017 dataset that covers almost the all common updated attacks such as DDoS, DoS, SQL Injection, Brute Force, XSS, Botnet, Infiltration, and Port Scan attacks. In addition, this section presents two experimental results in examining the efficiency and reliability of the proposed SAE network and shows comparisons with other relevant works. While mitigating the effect of the over-fitting problem, we used the HNM algorithm to determine the number of nodes in the hidden layer based on the initial values of weights and bias in the network.
As shown in Figure 1, data preprocessing is the first step of preparing the records of the dataset, which includes unity-based normalization and eliminating the attributes that have one value in all instances of the dataset. After preprocessing, the volume of the dataset was reduced to 70 features. Then, at least 5% of the reduced dataset was randomly selected to be used later by the HNM algorithm. The aim of using the HNM algorithm was to tune the hyperparameters of the SAE architecture, optimize the learning rates for the different layers, and set percentages of the sparsity for the different layers. Because the weights and bias values were initialized randomly, tuning the hyperparameters for the IDS design differed from one iteration to another. In this paper, we report two experiments to observe how the hyperparameters are tuned based on random initialization and the results are summarized in Table 1. All experiments and simulations were carried out using an Intel-Xeon processor with 3.70 GHz and 16 GB RAM, running Windows 10. As illustrated in Table 1, different parametric measures are produced corresponding to the first and second experiments. These hyperparameters include: number of nodes in the hidden layer (n), learning rate in the hidden layer (α 1 ), learning rate in the output layer (α 2 ), percentage of sparsity measured for the hidden layer (S 1 P ), percentage of sparsity measured for the output layer (S 2 P ), number of epochs (Epoch), and time in seconds (Time). Additionally, it can be seen that the values of the learning rates can be made to vary from one layer to another. This gives us a better features extraction strategy, where the different layers can adapt to various levels of the learning hierarchy. This is while the percentages of sparsity, which are computed for the weight matrices, remain almost stable for both of the conducted experiments.

Discussion
As demonstrated in Table 2, the two conducted experiments achieved results that outperform the existing solutions introduced for the updated and different types of network attacks. Thereby, the proposed SAE architecture provides better performance to extract a good set of features, which could reveal high levels of representation towards various characteristics of the latest intrusion attacks. This is proven by the test results. The features produced by the enhanced SAE technique had learned latent representation to sensitize the individual synaptic weights in the hidden layer and to generate keys for better classification accuracy and F-measure results. The measurements of true positive rate, false positive rate, precision, recall, total number of epochs required to extract the latent features, and time in seconds (Time) for both experiments are summarized in Table 3. After tuning hyperparameters of the improved SAE, it required 3925 s to discover 12 latent features for the first experiment and 2034 s to discover five latent features for the second experiment based on random initialization. Even though the second experiment took less time to represent the latent features, it failed to provide better performance in terms of the accuracy and false positive rate.

Conclusions
This paper proposes an enhanced design of the SAE architecture for IDS applications. The proposed error function for the SAE is designed to make a trade-off between the latent state representation for more mature features and network regularization by applying the sparsity constraint in the output layer of the proposed SAE network. In addition, the hyperparameters of the SAE are tuned based on the HNM algorithm and were proved to give a better capability of extracting features in comparison with the existing developed algorithms such as MLP-PC, MLE-SVMs, and DMLP. In fact, the proposed SAE can be used for not only network intrusion detection systems, but also other applications pertaining to feature extraction and pattern analysis. We emphasize that the successful contribution of allocating a set of optimal learning rates for different layers of the proposed SAE network has resulted in developing an efficient SAE architecture that can be used to discover latent features extraction. The results from experimental tests show that the different layers of the enhanced SAE could efficiently adapt to various levels of the learning hierarchy. Besides, this provided the SGD algorithm with the ability to dynamically adjust the weights and biases within the network. It is evidenced from the test results that it is possible to accelerate the learning process to reach latent features representation based on a vector of adaptive learning rates applied for the multiple layers of the proposed SAE network. Finally, additional tests demonstrated that the proposed IDS architecture could provide a more compact and effective immunity system for different types of network attacks with a significant detection accuracy of 99.63% and an F-measure of 0.996, on average, when penalizing sparsity constraint directly on the synaptic weights within the network.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: