Short Term Traffic State Prediction via Hyperparameter Optimization Based Classifiers

Short-term traffic state prediction has become an integral component of an advanced traveler information system (ATIS) in intelligent transportation systems (ITS). Accurate modeling and short-term traffic prediction are quite challenging due to its intricate characteristics, stochastic, and dynamic traffic processes. Existing works in this area follow different modeling approaches that are focused to fit speed, density, or the volume data. However, the accuracy of such modeling approaches has been frequently questioned, thereby traffic state prediction over the short-term from such methods inflicts an overfitting issue. We address this issue to accurately model short-term future traffic state prediction using state-of-the-art models via hyperparameter optimization. To do so, we focused on different machine learning classifiers such as local deep support vector machine (LD-SVM), decision jungles, multi-layers perceptron (MLP), and CN2 rule induction. Moreover, traffic states are evaluated using traffic attributes such as level of service (LOS) horizons and simple if–then rules at different time intervals. Our findings show that hyperparameter optimization via random sweep yielded superior results. The overall prediction performances obtained an average improvement by over 95%, such that the decision jungle and LD-SVM achieved an accuracy of 0.982 and 0.975, respectively. The experimental results show the robustness and superior performances of decision jungles (DJ) over other methods.


Introduction
Smart cities have emerged at the heart of "next stage urbanization" as they are equipped with fully digital infrastructure and communication technologies to facilitate efficient urban mobility. The fundamental enabler of a smart city is dependent on connected devices, though the real concern is how the collected data are distributed city-wide through sensor technologies via the Internet of Things (IoT). Heterogeneous vehicular networks in a connected infrastructure network are able to sense, compute, and communicate information through various access technologies: Universal Mobile Telecommunications System (UTMS), Fourth Generation (4G), and Dedicated Short-Range Communications (DSRC) [1,2]. In vehicular sensor networks (VSN) and Internet of vehicles (IOV), each vehicle act as receivers, senders, and routers simultaneously to transmit data over the network or to a central transportation agency as an integral part of intelligent transportation systems (ITS) [3,4].

•
We extend the exploration of decision jungles and locally deep SVM (LD-SVM) for short term traffic state prediction using hyperparameter optimization (via random sweep). • A comprehensive comparison was implemented to demonstrate the ability and the effectiveness of each machine learning model for TSP accuracy. • Prediction performances were evaluated under different forecasting time-intervals at distinct time scales. • Short-term traffic state was taken as a function of level of service (LOS) along a basic freeway segment. Study results demonstrated that decision jungles were more efficient and stable at different predicted horizons (time-intervals) than the LD-SVM, MLP, and CN2 rule induction.
The remainder of this paper is organized as follows. Section 2 presents a brief overview of the methods and techniques for TSP in the existing literature. Section 3 describes the preliminaries for different machine learning models used in this study. Section 4 presents study area, data description, and key parameter settings. Section 5 highlights results and discussion. Section 6 includes the comparison of different models. Finally, Section 7 summarizes the conclusions, presents key study limitations, and outlook for future studies.

Related Work
Since early 1980, non-linear traffic flow prediction has been the focus of several research studies as it is regarded as extremely useful for real-time proactive traffic control measures [15,16]. From its inception in the 1980s, artificial neural networks (ANNs) have been widely used for the analysis and prediction of time series data. They have the ability to perceive the non-linear connection between features of input and output variables that in turn can produce effective TSP solutions. For example, Zheng et al. combined Bayesian inference and neural networks to forecast future traffic flow [35]. Ziang and Adeli proposed a time-delay via recurrent wavelet neural network, where the periodicity demonstrated the significance of traffic flow forecasting [36]. Parametric methods can obtain better prediction outcomes when the data flow of the traffic varies temporally. These methods assume a variety of difficult conditions such as residual normalization and predefined system structure and rarely converged due to the stochastic or non-linear traffic flow characteristics.
To address the limitations of parametric models, different approaches including linear kernel, polynomial kernel, Gaussian kernel, and optimized multi kernel SVM (MK-SVM) have been proposed by recent research studies for traffic flow prediction [37][38][39][40]. MK-SVM predicted the results by mapping the linear parts of historical traffic flow data using the linear kernel, while map residual was performed using the non-linear kernel. Alternatively, generating if-then rules, also known as rule induction techniques that search the training data for proposition rules, can also be used. which CN2 is best-known example of this approach, that have been successfully utilized by previous for flow prediction [41,42]. Hashemi et al. developed different models for classification based on if-then rules in the short-term traffic state prediction for a highway segment [43]. In contrast, ANNs' popular network structure is multi-layer perceptron (MLP), which has been widely used in many transport applications due to its simplicity and capacity to conduct non-linear pattern classification and function approximation. The MLP model generally works well in the capture of complex and non-linear relations, but it usually requires a large volume of data and complex training. Many researchers, therefore, consider it as the most commonly implemented network topology [44][45][46]. Recently, in the study by Chen et al., they adapted a novel approach using dynamic graph hybrid automata for the modeling and estimation of density on an urban freeway in the city of Beijing, China [47]. The authors validated the feasibility of their modeling approach on Beijing's Third Ring Road. A recent study conducted by Zahid et al., proposed a new ensemble-based Fast forest quantile regression (FFQR) method to forecast short-term travel speed prediction [48]. It was concluded that proposed approach yielded robust speed prediction results, particularly at larger time-horizons.
Aside from the above-mentioned models, decision trees and forests have a rich history in machine learning and have shown significant progress in TSP, as reported in some of the recent literature [49,50]. Various studies have been conducted to address the shortcomings of traditional decision trees, for example, their sub-optimal efficiency and lack of robustness [51,52]. Similarly, in another research study, the researchers investigated the efficacy of the ensemble decision trees for the TSP [50]. It was concluded that trees generate efficient predictions traditionally. At the same time, researchers have concluded that learning with ideal decision trees could be problematic due to overfitting [53]. Henceforth, this approach has some limitations, such that the amount of data to be provided as the number of nodes in decision trees would increase exponentially with depth, affecting the accuracy [54]. Recently, a study proposed a novel online seasonal adjustment factors coupled with adaptive Kalman filter (OSAF-AKF) model for estimating the real-time seasonal heteroscedasticity in traffic flow series [55].
In contrast, machine learning techniques and their performances for classifying different problems have been encouraging such as decision jungles and LD-SVM, which are heavily dependent on a set of hyperparameters that, in turn, efficiently describes different aspects of algorithm behavior [54,56,57]. It is important to note that no suitable default configuration exists for all problem domains. Optimizing the hyperparameter for different models is important in achieving good performance in the realm of TSP [56]. There are two types of hyperparameter optimization: manual and automatic. Manual is time-consuming and depends on expert inputs, while an automatic approach removes expert input. Automatic approaches include the most common practice methods such as grid search and random search [58]. Several libraries have recently been introduced to optimize hyperparameters. Hyperopt Library is one of the libraries offering different hyper-optimization algorithms for machine learning algorithms [59]. Existing techniques for optimizing EC-based hyperparameters [60,61] such as differential evolution (DE) and particle swarm optimization (PSO) are useful since they are conceptually easy and can achieve highly competitive output in various fields [62][63][64][65]. However, these methods have a great deal of calculation and a low convergence rate in the iterative process. In contrast, hyperparameter optimization methods such as random grid, entire grid, and random sweep have achieved a great deal of attention in hyperparameter optimization. In a random grid, the matrix is computed for all combinations, and the values are extracted from the matrix by the number of defined iterations in relation to the entire grid incurred for all possible combinations. The difference between the random grid and the random sweep is that the latter technique selects random parameter values within the set, while the former only employs the exact values defined in the algorithm module. With this understanding, random sweep was chosen for the models conducted in this study for hyperparameter optimization with the intention of improving the accuracy of short-term TSP.

Preliminaries
Machine learning provides a number of supervised learning techniques for classification and prediction. The objective of a classification problem is to learn a model, which can predict the value of the target variable (class label) based on multiple input variables (predictors, attributes). This model is a function, which maps as an input attribute vector X to the output class label (i.e., Y {C1, C2, C3, . . . , Cn}). The label training set is represented as follows: where Y is the target label class (dependent variable) and vector X is composed of x 0 , x 1 , x 2 , x 3 , . . . , x n . The macroscopic flow, density, and speed obtained from traffic simulation are referred to as input parameters when fed/imported to machine learning models for short term traffic prediction. The model learns from these input variables for different time intervals (i.e., 5, 10, and 15 min). Either the next time interval level of service (LOS) is considered as a class label or target variable. The predicted label class for time (Time duration = 1), is given in the following form: The current study utilized four different machine learning methods for short term TSP. These methods included LD-SVM, decision jungles, CN2 rule induction, and MLP. The detailed methodology for each technique is presented below.

Local Deep Support Vector Machine (LD-SVM)
SVM is based on statistical learning theory as suggested by Vapnik in 1995 for classification and regression [66]. Local deep kernel learning SVM (LD-SVM) is a scheme for effective non-linear SVM prediction while preserving classification precision above an acceptable limit. Using a local kernel function allows the model to learn arbitrary local embedding features including sparse, high-dimensional, and computationally deep features that bring non-linearity into the model. The model employs routines that are effective and primarily infused to optimize the space of local tree-structured embedding features in more than half a million training points for big training sets. LD-SVM model training is exponentially quicker than traditional SVM models training [57]. LD-SVM can be used for both linear and non-linear classification tasks. It is considered as a special type of linear classifier (e.g., logistic regression LG), however, LG is unable to perform sufficiently in complicated and linear tasks. In addition, LD-SVM model learning is significantly faster and computationally more efficient than traditional SVM model training. The formulation of a local deep kernel learns a non-linear kernel K x i , x j = K L x i , x j K G x i , x j , where K L and K G are the local and global kernel. The product of local kernel K L = φ t L φ L and global kernel K G = φ t G φ G leads to the prediction function.
Sensors 2020, 20, 685 , and ⊗ is the Kronecker product. φ L is the local feature space and φ G is the global features space.
φ L k (x) = tanh σθ t k I k (x) (8) while training the LD-SVM and smoothing the tree are shown in Figure 1, Equation (1) can further written as below: where I k (x) is the indicator function for each node k in the tree; θ is to go left or right; v stack with non-linearity; σ is sigmoid sharpness for the parameter scaling and could be set by validation. Higher values imply that the 'tanh' is saturated in the local kernel, while a lower value means a more linear range of operation for θ. The full optimization formula is given in Equation (10). The local deep kernel learning (LDKL) primal for jointly learning θ and W from the training data, where (x i , y i ) N i=1 can be described as: where L = max 0, 1 − y i , φ t L (x i )W t x i ; λ w is the weight of the regularization term; and λ θ specifies the amount of space between the region boundary and the nearest data point to be left. λ θ controls the curvature amount allowed in the model's decision boundaries.
Sensors 2020, 20, x FOR PEER REVIEW 6 of 23 while training the LD-SVM and smoothing the tree are shown in Figure 1, Equation (1) can further written as below: where ( ) is the indicator function for each node in the tree; is to go left or right; stack with non-linearity; is sigmoid sharpness for the parameter scaling and could be set by validation. Higher values imply that the 'tanh' is saturated in the local kernel, while a lower value means a more linear range of operation for . The full optimization formula is given in Equation (10). The local deep kernel learning (LDKL) primal for jointly learning  and from the training data, where {( , ) =1 } can be described as: where = max(0,1 − , ( ) ) ; is the weight of the regularization term; and specifies the amount of space between the region boundary and the nearest data point to be left. ′ controls the curvature amount allowed in the model's decision boundaries.

Decision Jungles
Decision jungles are the latest addition to decision forests. They are comprised of a set of decision-making acyclic graphs (DAGs). Unlike standard decision trees, the DAG in the decision jungle enables different paths from root to leaf. A DAG decision has a reduced memory footprint and provides superior efficiency than a decision tree. Decision jungles are deemed as non-parametric models that provide integrated feature selection, classification, and are robust in the presence of noisy features. DAGs have the same structure as decision trees, except that the nodes have multiple parents.

Decision Jungles
Decision jungles are the latest addition to decision forests. They are comprised of a set of decision-making acyclic graphs (DAGs). Unlike standard decision trees, the DAG in the decision jungle enables different paths from root to leaf. A DAG decision has a reduced memory footprint and provides superior efficiency than a decision tree. Decision jungles are deemed as non-parametric models that provide integrated feature selection, classification, and are robust in the presence of noisy features. DAGs have the same structure as decision trees, except that the nodes have multiple parents. DAGs can limit the memory consumption by specifying a width at each layer in the DAG and potentially help to reduce overfitting [54]. Considering the nodes set at two consecutive levels of DAGs, Figure 2 shows that the nodes set consists of child nodes N c and parent nodes N p . Let θ i denote the parameters of the split function f for parent node i N p . S i denotes the categorized training samples (x, y), where it reaches node i, and set of samples can be calculated from node i, which travels through its left or right branches. Given θ i and S i , the left and right are computed by , respectively. l i N c indicates the left outward edge from parental node i N p to a child node, and r i N c denotes the right outward edge. Henceforth, the number of samples reaching any child node j N is given as: Sensors 2020, 20, x FOR PEER REVIEW 7 of 23

CN2 Rule Induction
In this study, rule learning models were also explored for TSP. These models are usually used for classification and prediction solutions. The CN2 algorithm is a method of classification designed to induce simple efficiency; "if condition then predicts class," even in areas where noise may occur. Inspired by Iterative Dichotomiser 3 (ID3), the original CN2 uses entropy as the function for rule evaluation; Laplace estimation may be defined as an alternative measure of the rule quality to fix unpleasant entropy (downward bias), and it is described as follows [67]: k ' is the number of the training classes available in the training set.

Multi-Layer Perceptron
The most common ANN model is the multi-layer perceptron (MLP). In MLP, input values are transformed by activation function f, giving the value as an output from the neuron. The MLP is made up of various layers including one input layer, one or more hidden layers, and one output layer. In MLP, parameters such as the number of input variables, number of hidden layers, activation function, and learning rate play an important role in the design of neural network architecture. The multi-layer perceptron (MLP) is shown in Figure 3. Neurons have activation functions for both the hidden layer and the output layer; neurons receive only the input dataset and have no activation functions on the input layer. Weights are multiplied with inputs, and are summarized accordingly as; Whilst the most commonly applied activation function is logistic function (sigmoid function), given in following equation: 1 Figure 2. Decision jungles (DAGs).

CN2 Rule Induction
In this study, rule learning models were also explored for TSP. These models are usually used for classification and prediction solutions. The CN2 algorithm is a method of classification designed to induce simple efficiency; "if condition then predicts class," even in areas where noise may occur. Inspired by Iterative Dichotomiser 3 (ID3), the original CN2 uses entropy as the function for rule evaluation; Laplace estimation may be defined as an alternative measure of the rule quality to fix unpleasant entropy (downward bias), and it is described as follows [67]: where 'p' represents the number of positive examples in the training set covered by Rule 'R'; n represents the number of negative instances covered by R; and 'k' is the number of the training classes available in the training set.

Multi-Layer Perceptron
The most common ANN model is the multi-layer perceptron (MLP). In MLP, input values are transformed by activation function f, giving the value as an output from the neuron. The MLP is made up of various layers including one input layer, one or more hidden layers, and one output layer. In MLP, parameters such as the number of input variables, number of hidden layers, activation function, and learning rate play an important role in the design of neural network architecture. The multi-layer perceptron (MLP) is shown in Figure 3 hidden layer and the output layer; neurons receive only the input dataset and have no activation functions on the input layer. Weights are multiplied with inputs, and are summarized accordingly as; Sensors 2020, 20, x FOR PEER REVIEW 8 of 23 of Beijing, with clusters of major companies, businesses, and administrative institutions, but generate 30% of the traffic volume per day [68,69]. Within this perspective, integrated urban planning is becoming difficult, so much so that 60% of the historical site of the city is lying on the Second Ring Road. Since the traffic hotspots are concentrated mainly in the center of Beijing, we have chosen an area as the study area at this location [68,70]. The Second Ring Road is approximately 33 km long including 37 on-ramps and 53 off-ramps. Figure 4 shows the study area on the Second Ring Road along with other different ring roads. In this study, a basic freeway segment of the Second Ring (L = 478.5 m) was selected.  Whilst the most commonly applied activation function is logistic function (sigmoid function), given in following equation:

Study Area
This study was conducted in the city of Beijing, China, which covers an area of 16,410 km 2 , and hosts 21.7 million people. Road transportation is an integral part of the city's routine businesses, linking most households to workplaces or schools. There are 21,885 km of paved public road in Beijing (as of June 2016), 982 km of which are classified as highways [68]. According to the Beijing census, the number of private cars was close to 5.4 million, in addition to 5.3 million other vehicles in different categories including 330,100 trucks. The Second-Ring Road consists of six percent of the urban space of Beijing, with clusters of major companies, businesses, and administrative institutions, but generate 30% of the traffic volume per day [68,69]. Within this perspective, integrated urban planning is becoming difficult, so much so that 60% of the historical site of the city is lying on the Second Ring Road. Since the traffic hotspots are concentrated mainly in the center of Beijing, we have chosen an area as the study area at this location [68,70]. The Second Ring Road is approximately 33 km long including 37 on-ramps and 53 off-ramps. Figure 4 shows the study area on the Second Ring Road along with other different ring roads. In this study, a basic freeway segment of the Second Ring (L = 478.5 m) was selected.

Data Collection and Parameters Setting
The first step in preparing the experiment was to develop a microscopic model using VISSIM (Micro Traffic Simulation Software) to capture all the essential data for the Second Ring Road. When simulating the field conditions, it is essential to calibrate the driving behavior parameters for the traffic simulator, and this was accomplished by standard procedures, as reported in the existing work [71]. In doing so, several simulation iterations were performed, incurring a different random seed to ensure that the model works under the real-time scenario. The proposed methodology for the present study is presented in Figure 5.
becoming difficult, so much so that 60% of the historical site of the city is lying on the Second Ring Road. Since the traffic hotspots are concentrated mainly in the center of Beijing, we have chosen an area as the study area at this location [68,70]. The Second Ring Road is approximately 33 km long including 37 on-ramps and 53 off-ramps. Figure 4 shows the study area on the Second Ring Road along with other different ring roads. In this study, a basic freeway segment of the Second Ring (L = 478.5 m) was selected.  The first step in preparing the experiment was to develop a microscopic model using VISSIM (Micro Traffic Simulation Software) to capture all the essential data for the Second Ring Road. When simulating the field conditions, it is essential to calibrate the driving behavior parameters for the traffic simulator, and this was accomplished by standard procedures, as reported in the existing work [71]. In doing so, several simulation iterations were performed, incurring a different random seed to ensure that the model works under the real-time scenario. The proposed methodology for the present study is presented in Figure 5. In this study, macroscopic traffic parameters (volume, speed, density) were obtained from the VISSIM simulation analysis. Traffic volume or flow rate can be defined as the number of vehicles that pass through a point on a highway or lane at a specific time, and is usually expressed in units of vehicles per hour per lane (v/h/l), while density is referred to the number of vehicles occupying a unit length of roadway, and is denoted by vehicles per km/mile per lane (v/m/ln). Occupancy is sometimes synonymously used with density; however, it should be noted that it shows the percentage of time that a road segment is occupied by vehicles. Traffic speed is another important state parameter, and can be found by the distance traversed per unit of time, and is typically expressed in km/h. or miles/h. These parameters are further calculated by using the link evaluation in VISSIM. Once the factual In this study, macroscopic traffic parameters (volume, speed, density) were obtained from the VISSIM simulation analysis. Traffic volume or flow rate can be defined as the number of vehicles that pass through a point on a highway or lane at a specific time, and is usually expressed in units of vehicles per hour per lane (v/h/l), while density is referred to the number of vehicles occupying a unit length of roadway, and is denoted by vehicles per km/mile per lane (v/m/ln). Occupancy is sometimes synonymously used with density; however, it should be noted that it shows the percentage of time that a road segment is occupied by vehicles. Traffic speed is another important state parameter, and can be found by the distance traversed per unit of time, and is typically expressed in km/h. or miles/h. These parameters are further calculated by using the link evaluation in VISSIM. Once the factual freeway architecture is achieved, the key macroscopic characteristics are identified in order to adjust the entire microscopic simulator (e.g., demand flow and split ratio). Demand flow is defined as the traffic volume as it utilizes the facility, while split ratio is the directional hourly volume (DHV) in the peak direction, which varies with respect to time, that is, the peak time and off-peak time. Additionally, the real traffic state of the Second Ring Road in this study was obtained from the Beijing Collaborative Innovation Center for Metropolitan Transportation. Thereby, the model of the road network deemed for the Second Ring Road was constructed by VISSIM. It has three lanes, where each lane is designated with an average width of 3.75 m, as shown in Figure 6. Simulations in the VISSIM were carried for 6 h, during the period 6:00 am to 12:00 pm, and a congested regime prevailed from 1.5 to 2 h (i.e., between 7:30 am to 9:30 am), leveraging the almost free flow for the remaining hours. Therefore, the transition state from D to F encountered few labels. Meanwhile, data were collected using different prediction horizons such as 5, 10, and 15 min.

Data Collection and Parameters Setting
Sensors 2020, 20, x FOR PEER REVIEW 10 of 23 freeway architecture is achieved, the key macroscopic characteristics are identified in order to adjust the entire microscopic simulator (e.g., demand flow and split ratio). Demand flow is defined as the traffic volume as it utilizes the facility, while split ratio is the directional hourly volume (DHV) in the peak direction, which varies with respect to time, that is, the peak time and off-peak time.
Additionally, the real traffic state of the Second Ring Road in this study was obtained from the Beijing Collaborative Innovation Center for Metropolitan Transportation. Thereby, the model of the road network deemed for the Second Ring Road was constructed by VISSIM. It has three lanes, where each lane is designated with an average width of 3.75 m, as shown in Figure 6. Simulations in the VISSIM were carried for 6 h, during the period 6:00 am to 12:00 pm, and a congested regime prevailed from 1.5 to 2 h (i.e., between 7:30 am to 9:30 am), leveraging the almost free flow for the remaining hours. Therefore, the transition state from D to F encountered few labels. Meanwhile, data were collected using different prediction horizons such as 5, 10, and 15 min. To assess the freeway operations, level-of-service (LOS), a commonly used performance indicator, was used for qualitative evaluation purposes. The data collected from the VISSIM simulation was further divided into six levels [72], wherein the LOS defines the traffic state of each level. Traffic state is usually characterized by traffic-density on a given link, and is directly related with the number of vehicles occupying the roadway segment. It also represents the transient boundary conditions between two LOS levels. Moreover, to test the efficacy, classification models were built in python scripting orange software and azure machine learning to write the required procedures for extracting the traffic parameters, and level-of-service corresponded to highway capacity manual (HCM) [43,73]. The data points (in Figure 7) represent different points in time distributed spatially, which together define the LOS at the road segment. In the mentioned figure, different colors showed the states for 15 min., which is actually the LOS divided into six sub-levels based on density along the highway segment. We termed these levels as different states (from A to F) and further evaluated them for 5, 10, and 15 min. intervals. Since stratified K-fold cross validation was opted to address the issue of imbalance data, the method aimed to choose the proportionate frequencies for each LOS class. Thus, it is likely that label D or any other label will be associated with true representative class. The actual density-flow captured on a segment of the Second Ring was simulated in VISSIM for a prediction horizon of 15 min. and is shown in Figure 7. To assess the freeway operations, level-of-service (LOS), a commonly used performance indicator, was used for qualitative evaluation purposes. The data collected from the VISSIM simulation was further divided into six levels [72], wherein the LOS defines the traffic state of each level. Traffic state is usually characterized by traffic-density on a given link, and is directly related with the number of vehicles occupying the roadway segment. It also represents the transient boundary conditions between two LOS levels. Moreover, to test the efficacy, classification models were built in python scripting orange software and azure machine learning to write the required procedures for extracting the traffic parameters, and level-of-service corresponded to highway capacity manual (HCM) [43,73]. The data points (in Figure 7) represent different points in time distributed spatially, which together define the LOS at the road segment. In the mentioned figure, different colors showed the states for 15 min, which is actually the LOS divided into six sub-levels based on density along the highway segment. We termed these levels as different states (from A to F) and further evaluated them for 5, 10, and 15 min intervals. Since stratified K-fold cross validation was opted to address the issue of imbalance data, the method aimed to choose the proportionate frequencies for each LOS class. Thus, it is likely that label D or any other label will be associated with true representative class. The actual density-flow captured on a segment of the Second Ring was simulated in VISSIM for a prediction horizon of 15 min and is shown in Figure 7.

K-Fold Cross-Validation
We selected the K-Fold cross-validation method (using k = 10), which is used for a better f-model, and it provides the appropriate settings for parameters. The original instances were randomly split into k equal parts. A single part was used for validation from the k split, and the remaining k minus one (k − 1) parts were used for the training set in order to develop the model. To do so, we revised the same technique k times. Each time a distinct validation dataset was selected, until the model's final accuracy was equal to the average accuracy, that in turn, was achieved in each iteration. This technique has the advantage over repeated random sub-sampling as all the samples are used for training as well as in the validation, where each sample is used once for the validation. To avoid the problems of data imbalance and enhance the prediction accuracy of the proposed methods, several strategies have been suggested by previous studies. In this study, K-fold cross validation was used to overcome the issues and bias associated with imbalance and small datasets as the K-fold validation method is more efficient and robust compared to other conventional techniques, since it preserves the percentage of samples for each group or class. We tuned the parameters to obtain the best results with accuracy and they were selected using hyperparameter tuning.

Model Evaluation
In this study, we selected the most common evaluation metrics in order to assess the performances of the models known as F score and Accuracy. The F score is a measure of the accuracy of a test, also known as the F-1 score or F measure. The F-1 score is defined as the weighted average of recall and precision. To measure the overall performances of the model, the F-1 score was derived as follows: Accuracy is one of the classifications' performance measures, which is defined as the ratio of the correct sample to the total number of samples as follows [74],

K-Fold Cross-Validation
We selected the K-Fold cross-validation method (using k = 10), which is used for a better f-model, and it provides the appropriate settings for parameters. The original instances were randomly split into k equal parts. A single part was used for validation from the k split, and the remaining k minus one (k − 1) parts were used for the training set in order to develop the model. To do so, we revised the same technique k times. Each time a distinct validation dataset was selected, until the model's final accuracy was equal to the average accuracy, that in turn, was achieved in each iteration. This technique has the advantage over repeated random sub-sampling as all the samples are used for training as well as in the validation, where each sample is used once for the validation. To avoid the problems of data imbalance and enhance the prediction accuracy of the proposed methods, several strategies have been suggested by previous studies. In this study, K-fold cross validation was used to overcome the issues and bias associated with imbalance and small datasets as the K-fold validation method is more efficient and robust compared to other conventional techniques, since it preserves the percentage of samples for each group or class. We tuned the parameters to obtain the best results with accuracy and they were selected using hyperparameter tuning.

Model Evaluation
In this study, we selected the most common evaluation metrics in order to assess the performances of the models known as F score and Accuracy. The F score is a measure of the accuracy of a test, also known as the F-1 score or F measure. The F − 1 score is defined as the weighted average of recall and precision. To measure the overall performances of the model, the F-1 score was derived as follows: Accuracy is one of the classifications' performance measures, which is defined as the ratio of the correct sample to the total number of samples as follows [74], Accuracy = TP + TN TP + TN + FP + FN (16) where P and N denote the number of positive and negative samples, respectively. TP and TN indicate the true positive and true negative. FP and FN indicate the false positive and false negative, respectively.

Local Deep Kernel Learning SVM (LD-SVM)
The tuning parameters include LD-SVM tree depth, Lambda W, Lambda theta, Lambda theta prime, number of iterations, and sigmoid sharpness or sigma. Figure 8a shows the LD-SVM tree depth impact on accuracy and 92.00% accuracy was achieved when the tree depth was 3. The impact of the other parameters, Lambda W, Lambda theta, Lambda theta prime, number of iterations, and sigmoid sharpness or sigma, can be seen in Figure 8c. The best hyperparameter tuned values for these parameters were 0.00052, 0.34587, 0.1025, 49,247, and 0.0068, which were encircled and obtained using 10-fold cross-validation. Figure 8b shows the predicted state for the next 15 min

Local Deep Kernel Learning SVM (LD-SVM)
The tuning parameters include LD-SVM tree depth, Lambda W, Lambda theta, Lambda theta prime, number of iterations, and sigmoid sharpness or sigma. Figure 8a shows the LD-SVM tree depth impact on accuracy and 92.00% accuracy was achieved when the tree depth was 3. The impact of the other parameters, Lambda W, Lambda theta, Lambda theta prime, number of iterations, and sigmoid sharpness or sigma, can be seen in Figure 8c. The best hyperparameter tuned values for these parameters were 0.00052, 0.34587, 0.1025, 49,247, and 0.0068, which were encircled and obtained using 10-fold cross-validation. Figure 8b shows the predicted state for the next 15 min.

Decision Jungle
The tuning parameters in the decision jungle model were described by the maximum depth of the decision (DAGs), number of decision DAGs, number of optimization steps per decision, DAGs layer, and maximum width of the decision DAGs'. Figure 9b shows the impact of the maximum depth of decision DAGs on the accuracy of the model. The accuracy was 92% and was achieved when the maximum depth of the decision (DAGs) was 77. The best-tuned values for the other parameters are

Decision Jungle
The tuning parameters in the decision jungle model were described by the maximum depth of the decision (DAGs), number of decision DAGs, number of optimization steps per decision, DAGs layer, and maximum width of the decision DAGs'. Figure 9b shows the impact of the maximum depth of decision DAGs on the accuracy of the model. The accuracy was 92% and was achieved when the maximum depth of the decision (DAGs) was 77. The best-tuned values for the other parameters are depicted in Figure 9c (such as the number of decisions DAGs, number of optimization steps per decision, and maximum width of decision DAGs' were 22, 5786, and 19, respectively), and were obtained using 10-fold cross-validation. Since, our study considered 15 min prediction horizons as the structure of DAG is illustrated in Figure 9d, which shows the number of DAGs is 22 with a maximum depth of levels 77. The predicted state for 15 min horizons can be seen in Figure 9a.
Sensors 2020, 20, x FOR PEER REVIEW 13 of 23 depicted in Figure 9c (such as the number of decisions DAGs, number of optimization steps per decision, and maximum width of decision DAGs' were 22, 5786, and 19, respectively), and were obtained using 10-fold cross-validation. Since, our study considered 15 min. prediction horizons as the structure of DAG is illustrated in Figure 9d, which shows the number of DAGs is 22 with a maximum depth of levels 77. The predicted state for 15 minutes horizons can be seen in Figure 9a.

CN2 Rule Induction
CN2 utilizes a statistical significance test in order to ensure that the fresh rule represents a real correlation between features and classes. In fact, it is a pre-pruning technique that prevents particular rules after their implementation. Moreover, it performs a sequential covering approach at the upper stage (also defined as split-and-conquer or cover-and-remove), once used by the algorithm quasi-optimal (AQ) algorithm. The CN2 rule returns a class distribution in terms of the number of examples covered and distributed over classes. The distribution in Table 1 Table 1 can be used to check the accuracy and efficiency of that particular rule. We adopted exclusive coverage in our implementation at the upper level such as unordered CN2 [62], whereas Laplace estimation was used for function evaluation at the lower level. Pre-pruning of rules was performed using two methods: (i) likelihood ratio statistic (LRS) tests, and (ii) minimum threshold for coverage of rules. The LRS test indicates two tests: first, a rule's minimum level of significance α 1 , and the second LRS test is likened to its parent rule, as it checks whether the last rule specialization has a sufficient level of significance α 2 . The values for the LRS tests and rules for the different prediction horizons were obtained using 10-fold cross-validation. Figure 10 shows the predicted state for 15 min intervals. The values of α 1 and α 2 are listed in Table 2. The rule for the next 5 min and 10 min horizons is given in Appendix A, whilst the rule for the next 15 min horizons is given in Table 1.

CN2 Rule Induction
CN2 utilizes a statistical significance test in order to ensure that the fresh rule represents a real correlation between features and classes. In fact, it is a pre-pruning technique that prevents particular rules after their implementation. Moreover, it performs a sequential covering approach at the upper stage (also defined as split-and-conquer or cover-and-remove), once used by the algorithm quasioptimal (AQ) algorithm. The CN2 rule returns a class distribution in terms of the number of examples covered and distributed over classes. The distribution in Table 1and the tables in the Appendix show that each number corresponded to the number of example(s) that belonged to class LOS = i, where i = {A, B, C, D, E, F} and "i" is the observed frequency distribution of examples between different classes. In another words, it represents number of the relevant class membership. The derived probabilities shown in Table 1 can be used to check the accuracy and efficiency of that particular rule. We adopted exclusive coverage in our implementation at the upper level such as unordered CN2 [62], whereas Laplace estimation was used for function evaluation at the lower level. Pre-pruning of rules was performed using two methods: (i) likelihood ratio statistic (LRS) tests, and (ii) minimum threshold for coverage of rules. The LRS test indicates two tests: first, a rule's minimum level of significance α1, and the second LRS test is likened to its parent rule, as it checks whether the last rule specialization has a sufficient level of significance α2. The values for the LRS tests and rules for the different prediction horizons were obtained using 10-fold cross-validation. Figure 10 shows the predicted state for 15 min. intervals. The values of α1 and α2 are listed in Table 2. The rule for the next 5 min. and 10 min. horizons is given in Appendix A, whilst the rule for the next 15 min. horizons is given in Table 1.    In neural networks, learning includes adjusting the connection weights between neurons and each functional neuron's threshold. We considered one input layer and one hidden layer with 35 neurons. The input layer had four nodes: speed, density, flow, and time duration (interval). The accuracy achieved using 10-fold cross-validation for different prediction horizons was compared (shown in Table 3) against the learning rate, momentum, activation function, and epochs. Figure 11a shows the predicted state for the next 15 min horizons. The input layer, hidden layers with neurons, and output layers for the MLP network are depicted in Figure 11b.

Multi-Layer Perceptron (MLP)
In neural networks, learning includes adjusting the connection weights between neurons and each functional neuron's threshold. We considered one input layer and one hidden layer with 35 neurons. The input layer had four nodes: speed, density, flow, and time duration (interval). The accuracy achieved using 10-fold cross-validation for different prediction horizons was compared (shown in Table 3) against the learning rate, momentum, activation function, and epochs. Figure 11a shows the predicted state for the next 15 min. horizons. The input layer, hidden layers with neurons, and output layers for the MLP network are depicted in Figure 11b.

Model Comparison
The weighted average F-1 score and accuracy were evaluated in order to assess the performances of different models. The results suggest that decision jungles outperformed the LD-SVM, CN2, and MLP, as shown in Figure 12. Additionally, the decision jungles and LD-SVM achieved a higher weighted average F-1 score. In particular, the decision jungle was found to have improved results over the LD-SVM, CN2, and MLP, and obtained high F-1 scores of 0.9777, 0.952, and 0.915 were predicated for time horizons of 15, 10, and 5 min., respectively. Similarly, the LD-SVM was slightly better than the MLP and CN2 as the F1-score was higher (0.904, 0.926, 0.946) for the 15, 10, and 5 min. prediction horizons. However, the CN2 rule induction performed better, except for decision jungles, while the other models failed to achieve a higher F-1 score for the same prediction horizon. On the other hand, Figure 13a,b shows that decision jungles and LD-SVM also achieved higher accuracy when compared to the remaining models such as CN2 rule induction and MLP. It can be noted that as the prediction horizons increases, the F-1 score and accuracy decreases. This indicates that decision jungles were stable when compared to the results in accordance with time horizons of 15, 10, and 5. Unlike the LD-SVM, MLP and CN2 were found to be less effective at maintaining the stability of accuracy in different time horizons. However, the CN2 rule induction in Figure 13c

Model Comparison
The weighted average F-1 score and accuracy were evaluated in order to assess the performances of different models. The results suggest that decision jungles outperformed the LD-SVM, CN2, and MLP, as shown in Figure 12. Additionally, the decision jungles and LD-SVM achieved a higher weighted average F-1 score. In particular, the decision jungle was found to have improved results over the LD-SVM, CN2, and MLP, and obtained high F-1 scores of 0.9777, 0.952, and 0.915 were predicated for time horizons of 15, 10, and 5 min, respectively. Similarly, the LD-SVM was slightly better than the MLP and CN2 as the F1-score was higher (0.904, 0.926, 0.946) for the 15, 10, and 5 min prediction horizons. However, the CN2 rule induction performed better, except for decision jungles, while the other models failed to achieve a higher F-1 score for the same prediction horizon. On the other hand, Figure 13a,b shows that decision jungles and LD-SVM also achieved higher accuracy when compared to the remaining models such as CN2 rule induction and MLP. It can be noted that as the prediction horizons increases, the F-1 score and accuracy decreases. This indicates that decision jungles were stable when compared to the results in accordance with time horizons of 15, 10, and 5. Unlike the LD-SVM, MLP and CN2 were found to be less effective at maintaining the stability of accuracy in different time horizons. However, the CN2 rule induction in Figure 13c,d) performed well and provided stable results only for the 10, and 15 min prediction horizons.
of different models. The results suggest that decision jungles outperformed the LD-SVM, CN2, and MLP, as shown in Figure 12. Additionally, the decision jungles and LD-SVM achieved a higher weighted average F-1 score. In particular, the decision jungle was found to have improved results over the LD-SVM, CN2, and MLP, and obtained high F-1 scores of 0.9777, 0.952, and 0.915 were predicated for time horizons of 15, 10, and 5 min., respectively. Similarly, the LD-SVM was slightly better than the MLP and CN2 as the F1-score was higher (0.904, 0.926, 0.946) for the 15, 10, and 5 min. prediction horizons. However, the CN2 rule induction performed better, except for decision jungles, while the other models failed to achieve a higher F-1 score for the same prediction horizon. On the other hand, Figure 13a,b shows that decision jungles and LD-SVM also achieved higher accuracy when compared to the remaining models such as CN2 rule induction and MLP. It can be noted that as the prediction horizons increases, the F-1 score and accuracy decreases. This indicates that decision jungles were stable when compared to the results in accordance with time horizons of 15, 10, and 5. Unlike the LD-SVM, MLP and CN2 were found to be less effective at maintaining the stability of accuracy in different time horizons. However, the CN2 rule induction in Figure 13c,d) performed well and provided stable results only for the 10, and 15 min. prediction horizons.  The experimental results are summarized in Tables 4 and 5, where the models' performances were computed using F-1 score and the average accuracy for different prediction horizons, respectively. It can be clearly seen that decision jungles achieved a higher F-1 score and gained a higher accuracy when compared to the other models for different prediction horizons. This shows that decision jungles achieved an average improvement of 95% and outperformed the remaining models. However, the LD-SVM performed better than the MLP and CN2 rule induction.  The experimental results are summarized in Tables 4 and 5, where the models' performances were computed using F-1 score and the average accuracy for different prediction horizons, respectively. It can be clearly seen that decision jungles achieved a higher F-1 score and gained a higher accuracy when compared to the other models for different prediction horizons. This shows that decision jungles achieved an average improvement of 95% and outperformed the remaining models. However, the LD-SVM performed better than the MLP and CN2 rule induction.

Conclusions
In this study, we improvised machine learning models with hyperparameter tuning optimization for short term TSP. Different schemes offered in parameter tuning were examined by performing the number of simulation iterations incurring different random seeds to ensure that the model worked efficiently under a real-time scenario. To do so, a comprehensive demonstration and the ability of different machine learning models were evaluated using different forecasting time-intervals at distinct time scales. The short-term traffic state was taken as a function of level-of-service (LOS) on a basic freeway segment along Second Ring Road in Beijing, China. Simulation of a transportation road demonstrated that decision jungles were more efficient and stable at different predicted horizons (time intervals) than the LD-SVM, MLP, and CN2 rule induction. Data utilized in this study was collected from traffic simulator VISSIM. Actual density-flow was captured on freeway segment via different prediction horizons of 15, 10, and 5 min. The experimental results showed and demonstrated the superior and robust performance of decision jungles compared to the LD-SVM, CN2 rule induction, and MLP. The overall performance of prediction results were improved by over 95 percent on average, which led to an accuracy of 0.982 and 0.975 for the decision jungle and LD-SVM. Moreover, the prediction performance for CN2 rule induction were also observed to be improved based on if-then rules in terms of the traffic patterns for different prediction horizons.
This study has some limitations that must be acknowledged. First, the proposed study was deployed in a developed urban freeway network model, so the simulated data need to be enhanced in future studies. Second, instead of justifying the efficacy of the suggested techniques using microscopic simulation platform via VISSIM, forthcoming studies may focus on investigating and verifying the performance of proposed methods with an improved model on real traffic data.
In the future, studies may focus on long-term traffic state prediction (hours, days, weeks), which could also be divided into different LOS groups. The study area can be extended from the basic freeway segment to weaving, merging, and diverging segments that cover the entire network range of the Second Ring Road. Studies could incorporate temperature, air quality, weather, and other external factors that are likely to affect travel demand, thus, enhance prediction accuracy. In addition, it could rely on considering larger and various types of traffic datasets to analyze various combinations of flow, occupancy, speed, and other characteristics of road traffic to improve the predictive accuracy by using improved machine learning methods for prediction and analytics.

Acknowledgments:
The authors acknowledge the support of the Beijing University of Technology in providing the essential resources for conducting this study.

Conflicts of Interest:
The authors declare no conflicts of interest.