A Comprehensive Comparative Analysis of the Basic Theory of the Short Term Bus Passenger Flow Prediction

In order to meet the real-time public travel demands, the bus operators need to adjust the timetables in time. Therefore, it is necessary to predict the variations of the short-term passenger flow. Under the help of the advanced public transportation systems, a large amount of real-time data about passenger flow is collected from the automatic passenger counters, automatic fare collection systems, etc. Using these data, different kinds of methods are proposed to predict future variations of the short-term bus passenger flow. Based on the properties and background knowledge, these methods are classified into three categories: linear, nonlinear and combined methods. Their performances are evaluated in detail in the major aspects of the prediction accuracy, the complexity of training data structure and modeling process. For comparison, some long-term prediction methods are also analyzed simply. At last, it points that, with the help of automatic technology, a large amount of data about passenger flow will be collected, and using the big data technology to speed up the data preprocessing and modeling process may be one of the directions worthy of study in the future.


Introduction
With the rapid expansion of urban population, the transportation problems in big cities become more and more serious [1]. Green and low-cost urban public transportation system [2] is the main choice of urban residents in some densely populated cities [3]. Urban bus transit system [2], which consists of a comprehensive route network and a reasonable departure frequency, is the main component of an urban public transportation system. Comparing with the urban rail transit system [2] in some metropolises of the world, such as London, Beijing, Hong Kong, Singapore, Taipei, New York, Madrid, Seoul, etc., except Tokyo and Osaka, the urban bus transit system provides the major public transit services [4].
Route network, timetable [5], vehicle schedule, and crew schedule [6] are the four core resources to maintain the normal bus services. A reasonable bus network covering the whole service area will facilitate the daily life of urban residents. The rationality of the timetable is that it needs to meet the passenger travel demands, while taking into account the operating costs. In addition, the timetable also determines the subsequent vehicle and crew schedules. Consequently, grasping an accurate variation of the passenger flow [7] provides the basis to offer quality bus services.
(1) Bus stop passenger flow prediction Usually, the passenger flow at different stops of a bus line is not balanced, and the passenger amount at some stops is far more than other stops. The stops with high passenger volume are called key stops, for the reasons of traffic jam or bus overloaded, the phenomenon of passenger delay at key stops happens more frequently. Predicting the passenger flow at these key stops will reflect the situation of the whole line's passenger volume, which may help the bus enterprises to develop a feasible scheduling or put on extra express buses to ease the passenger volume pressure.
(2) Bus line passenger flow prediction Some other studies propose the bus line passenger flow prediction models by using the aggregated transaction records from the AFC. Passengers will get on or off at any stop of the bus line, and their distribution and trip length may affect the bus schedule, so it is useful to predict the whole line's passenger flow variation. With the help of the APC, the relatively accurate passenger data will be collected and used for short-term passenger flow prediction.

Data Source
There are two approaches to obtain the counts of boarding and alighting passengers at stops or transit lines: direct method and indirect method. The former mainly uses APC devices to record boardings and alightings of passengers at a stop. While, the image recognition technology is another direct way to identify the boarding and alighting passengers through the video monitor systems equipped at the stop [13]. Indirect method acquires the real-time transaction records through the AFC. The number of transaction records represent the passenger amount using smart card. According to the proportion between the passengers using cash and smart card, the passenger number getting on the bus could be inferred. These technologies are used to obtain the passenger volume, but the two methods have their own defects. The direct method could obtain the counts of boarding and alighting passengers, but the records do not include the identification of the passengers, so that it is hard to get a certain passenger's travel path. In the indirect method, the transaction record includes the unique ID of the smart card, which can be used to distinguish the different passengers. However, some lines using flat fare policy, the smart card is only swiped on the boarding step, and the records only contain the stop information about the boardings of passengers, without the alighting information. Lu [14] proposes a method to infer the alighting stops through the adjacent transaction records of the same smart card. It assumes the stop of the next record is where the passenger alighting on last trip. Based on this assumption, the round trip can be inferred, but it is not accurate enough and the inferred round trip is not real-time data, which cannot be used in short-term passenger flow prediction. In some cities, such as Beijing or Shanghai [15], the metered ticket fare policy needs the boarding and alighting passenger to swipe the smart card separately; under this situation, the real-time trip records could be easily obtained.

Data Format of AFC
The AFC system is widely implemented in buses all over the world. Passengers hold the smart card in front of the equipment of the AFC shortly, complete the payment process, and the transaction information will be recorded in a certain format. Based on international standards [16], the main properties of the sample data structure are listed in Table 1. Table 1. The structure of the transaction record.

Field Name Illustration
Card ID The unique number of the smart card Type of smart card Normal card, coupon card, etc. Driver ID The unique number of the current bus driver Line ID The unique number of the bus line Vehicle ID The unique number of the vehicle Balance The balance of the smart card after the last transaction Transaction amount The transaction amount of the last transaction Transaction count The total number of the transaction count with this smart card Transaction time The time of the last transaction Flat fare, sectional fare or metered fare, and unlimited ride passes are three main fare policies used in majority worldwide cities, such as New York, Washington DC, Berlin, London, Beijing, Shanghai, Hong Kong, Singapore, Madrid, et al. [17]. Flat fare is when all riders pay for tickets before the trips. Sectional fare or metered fare means that fares will increase with the travel distance. Unlimited ride passes usually include one-day pass, three-day pass, or seven-day pass, etc., which means unlimited rides in a limited time.
The flat fare and unlimited ride pass policies need the passenger to swipe the smart card before the trip, while the sectional fare policy needs the rider to swipe the smart card both boarding and alighting. Through the AFC system, the passenger amount using the smart card can be obtained. According to the statistics, the proportion of riders using smart cards is over 60%, during the morning and evening peak hours the proportion may be over 90% [18], so that using the transaction records from the AFC system to predict the short-term passenger flow variation is feasible.

Data Format of the APC
The APC records the time and location of the passenger getting on or off the bus, and the statistics of the amount of passengers. Through pressure, infrared, or image recognition technologies, the APC can distinguish each passenger accurately. The main properties of the sample records from the APC [19] are listed in Table 2. The vehicle intelligent terminal system is composed of a Global Positioning System (GPS), wireless network communication system, vehicle running status collecting system, etc. Using the GPS, the information about real-time vehicle position, vehicle heading, speed, etc., will be recorded. Using vehicle running status collecting system, the information about fuel consumption, vehicle equipment status and vehicle CAN data will be collected. Through a wireless network communication system, all information collected by GPS, APC, and AFC can be uploaded to the data center. The main properties of sample records from the vehicle intelligent terminal system [20] are listed in Table 3. Table 3. The structure of the record from the vehicle intelligent terminal system.

Field Name Illustration
Equipment ID The unique number of the vehicle intelligent terminal system Vehicle ID The unique number of the vehicle Driver ID The unique number of the driver Longitude The longitude of the current vehicle position Latitude The latitude of the current vehicle position Speed The vehicle real-time speed Heading The vehicle heading direction at the current time Line ID The unique number of the bus line Stop ID The unique number of stop, where the bus stops at the current time Distance The relative distance from the current position to the last station Cumulative distance The total mileage of the vehicle State The state of the vehicle intelligent terminal system

Linear Methods for Short-Term Bus Passenger Flow Forecast
The data collected from AFC or APC, has the natural characteristics of spatial-time sequence concerning with linear relationship. The linear regression or time series analysis methods can be adapted directly to analysis the data. Some linear methods or models, such as the Kalman filter method, wavelet forecast method, and time series analysis methods, liking autoregressive (AR), autoregressive moving average (ARMA), or autoregressive integrated moving average (ARIMA), are proposed for short-term forecasting. In this section, the linear forecast methods, Kalman filter, and time series will be introduced in detail.

Kalman Filter
The Kalman filter [21] is a famous recursive solution to the problem of discrete data linear filtering, and it provides an efficient computational method to estimate the future state of a process, by minimizing the mean of the squared error [22]. Recently, some researchers try to forecast short-term bus passenger flow by using this method. Zhang and Song [23] only uses a basic Kalman filter algorithm to estimate the future variation of short-term bus passenger flow at certain bus stops, and they believe that the prediction errors are acceptable. The Kalman filter algorithm is also used as a component of the combination model in some other studies, which will be discussed in Section 5.
The Kalman filter is a learning process, which uses a model that distinguishes between phenomena and noumena, and the state of knowledge about the noumena that can be deduced from the phenomena [24]. The recursive algorithm uses the history state to estimate the present priori state, uses the phenomena to revise the priori, and obtains the optimal posteriori based on the minimizing the mean of the squared error principle. In the following section, the concept of the Kalman filter is introduced briefly, while more details can be found in [25,26]. Try to estimate the state variable x ∈ n of a discrete time process, the relationship between the system states metastasis of the adjacent time-steps, which could be described by the linear stochastic difference equation, like Equation (1): Define the observation state variable z ∈ m as: In Equations (1) and (2), x k is the state vector of the system at time k. The n × n conversion matrix A relates the state at the previous time step k − 1 to the state at the current step k. The n × l matrix B, called gain matrix, relates the optional control input l-dimensional vector u ∈ l to the state x k−1 . The m × n gain matrix H relates the state variable x k to the measurement z k , Equation (2) is also called measurement equation. The random variable w k−1 represents the process noise and v k represents the measurement noise. They are white noise, assumed independent of each other, with normal probability distributions, defined as Equations (3) and (4): where Q and R are called process noise covariance matrix. According to Equation (1), the priori state estimatex − k at time step k is defined as Equation (5): Equation (5) is a sample regression estimation equation, so there is no error term. Using Equations (1)-(5), the following equations can be proved [25,26]: In Equation (6), the symbol K k is called filter gain matrix, also called Kalman gain: where the symbol R k = E v k v T k is the variance matrix of noise. The symbol P − k is a variance matrix representing the error between the priori estimated value and truth, then: In Equation (8), the symbol Q k−1 = E w k−1 w T k−1 is the variance matrix of the noise. Let K k be the best gain matrix, which will minimize the value of the mean square error matrix P k . Using the extremum principle, the K k can be deduced from (6), then: The equation group, composed by Equations (2), (6)- (9), is called the Kalman filter group. Equations (2) and (8) are together called the time update or prediction equation group. Equations (6), (7), and (9) are together called the measurement update or correct equation group.
According to the derivation process above, setting initial value of P 0 andx 0 , Through continuous recursive calculation, the value of state estimationx k at any time step can be calculated finally.

Applications of the Kalman Filter Method in Short-Term Bus Passenger Flow Prediction
Methods or models based on the Kalman filter algorithm are widely used in traffic flow prediction. Almost all studies about bus passenger flow prediction refer to the researching achievements of traffic flow prediction.
Zhang and Song [23] only use the Kalman filter algorithm to predict the passenger flow of key bus stops. They count the passenger flow through the transaction records from AFC and video monitoring system equipped in the buses and stops. The service time of stop l is 12 h, which is divided into 30 min intervals, denoted by t ∈ {1, 2, · · · , T}, and the number of passengers arriving at stop l is aggregated in every interval. The literature sets the premise as that the passenger amount in the time interval t associated with the passenger amount of the previous m time intervals from t − 1 to t − m on the Nth day and the time interval t on the N − 1th day. Based on the Kalman filter algorithm, the average passenger amount within time interval t in the previous n days is set as the observation measurement. The covariance matrix of the white noise is calculated from the history data, and the initial state is set as a zero vector. Based on these settings, the passenger amountQ l (t) in the time interval t on the Nth day at stop l is estimated by the Kalman filter method. The literature also shows the comparative analysis with the Back Propagation Artificial Neural Network (BP-ANN). The comparative analysis results show that the prediction results from the Kalman filter method is better than the BP-ANN, through the four performance indices, including mean absolute deviation (MAD), mean square error (MSE), mean absolute percentage error (MAPE), and mean square percentage error (MSPE). The authors also point that the phenomenon of passenger-mass can be predicted by using the Kalman filter algorithm.

Time Series Theory
The time series is a collection of observations sequentially by time [27], the time series composed by a sequence of random variables can be defined as follows [28,29]: A sequence of random variables ordered by time X 1 , X 2 , · · · , X t , · · · , represents a time series of a random event, noted as {X t |t ∈ T, T = 1, 2, 3, · · · } or {X t }. x 1 , x 2 , x 3 , · · · x n represents n sequence observations of random event, which is also called observation sequence with length n.
The widely used time series analysis methods can be categorized into two types: general descriptive time series analysis method and statistical time series analysis method [30]. Based on different data processing methods, the statistical time series analysis methods can be classified into the time domain and the frequency domain analysis methods [31]. The time-domain analysis methods are mainly used in short-term passenger flow prediction, and the main modeling steps are shown in Figure 1.
A stationary time series with autocorrelation characteristics can be analyzed by using time domain models, such as AR(p) (autoregressive process with order p), MA(q) (moving average process with order q), and ARMA(p, q) (autoregressive moving average process with orders p and q) [32].
Identifying a proper model to analyze a stationary time series is probably the most difficult task in practice. The orders of the autoregressive and moving average terms, need to be obtained before applying a model [33]. The autocorrelation coefficient functions and partial autocorrelation coefficient functions (ACF and PACF), used to examine the stationary data, will appear different features in figures of AR model, MA model and ARMA model, respectively, the general rules that are applied in interpreting these two functions are shown in Table 4 [33].  A stationary time series with autocorrelation characteristics can be analyzed by using time domain models, such as AR( ) (autoregressive process with order ), MA( ) (moving average process with order ), and ARMA( , ) (autoregressive moving average process with orders and ) [32]. Identifying a proper model to analyze a stationary time series is probably the most difficult task in practice. The orders of the autoregressive and moving average terms, need to be obtained before applying a model [33]. The autocorrelation coefficient functions and partial autocorrelation coefficient functions (ACF and PACF), used to examine the stationary data, will appear different features in figures of AR model, MA model and ARMA model, respectively, the general rules that are applied in interpreting these two functions are shown in Table 4 [33]. The next step is to estimate the parameters of the selected model. The most common methods are moment estimation, least squares estimation, maximum likelihood estimation, etc. The parameter estimation can be processed automatically by using R.
For the non-stationary time series, using differential operation, it can be transferred to a stationary one. The autoregressive integrated moving average (ARIMA) model can be used to analysis a non-stationary sequence.
The main purpose of using the time series method is to predict the values of the sequence in the future time.
For the ℓ time steps prediction, based on history data { , , … }, the variable (ℓ > 0) in the future time step + ℓ would be predicted. The time is the forecast origin and ℓ is the lead time. The symbol (ℓ) denotes the predictive value of ℓ , also called estimation value. The next step is to estimate the parameters of the selected model. The most common methods are moment estimation, least squares estimation, maximum likelihood estimation, etc. The parameter estimation can be processed automatically by using R.
For the non-stationary time series, using differential operation, it can be transferred to a stationary one. The autoregressive integrated moving average (ARIMA) model can be used to analysis a non-stationary sequence.
The main purpose of using the time series method is to predict the values of the sequence in the future time.
For the time steps prediction, based on history data {x k , x k−1 , . . .}, the variable x k+1 ( > 0) in the future time step k + would be predicted. The time k is the forecast origin and is the lead time. The symbolx k ( ) denotes the predictive value of x k+ , also called estimation value.
Based on minimum mean square error forecasting, Equation (10) can be proved: where the estimationx k ( ) of x k+ is equal to conditional expectation of x k+ .

Applications of Time Series Method in Short-Term Bus Passenger Flow Prediction
The statistical data of the bus passenger flow has the nature characteristics of time series, so that some researchers propose methods based on ARMA or ARIMA model to predict or analyze the passenger flow variation at the bus stop or lines using the observation data.
Gu [34] uses the ARMA model to forecast the passenger flow in the short-term at a transportation hub station of Shanghai, the largest city in China. In this paper, the bus hub station, located in one of the key areas of Shanghai named Wujiaochang, is set to be the observation point to carry out the passenger flow survey. The survey period is five weeks, and the sampling interval is 10 min. After eliminating Symmetry 2018, 10, 369 9 of 23 weekends, a total of 2575 observations have been obtained. The time series composed of rough observations has cyclical and slow attenuation trend. Using variance analysis method (ANOVA) to eliminate cyclical and trend phenomenon appearing in different weekdays and different time of one day, the authors obtain a stationary sequence. By drawing ACF and PACF figures of the sequence, the model ARMA(2,1) is selected to predict the hub station passenger flow. Compared with the real data, the prediction accuracy of the hub station is proved to be over 80%, which meets the needs of disseminating the passenger flow forecasting information to the public and supporting the hub station management of the operators. However, the hub station is the passenger distribution center, which is the intersection of several lines, and it is difficult to distinguish the directions of the passenger flow. Therefore, the prediction results cannot provide useful information to optimize or coordinate different lines' scheduling.
Ma [35] and Xue [36] propose an interactive multiple model (IMM) based approach combining with time series methods to predict short-term passengers of bus lines. The source data is the transaction records collected from AFC system in the both literatures. The data is aggregated in equal time interval (30 min [35] and 15 min [36]) to compose a time series according to different bus lines. After correlation, periodicity and stationarity analysis, three temporal relevant pattern time series are obtained. Here, three time series are introduced based on [35] briefly. The first is weekly relevant pattern time series s n w (t) = p n−7×1 (t), p n−7×2 (t), · · · , p n−7×n w (t) , which consists of data n w weeks before p n (t) with the same time interval at the same weekday, where p n (t) is passenger count at time interval (t − 1, t ] for day n. The second is daily relevant pattern time series s n d (t) = p n−1 (t), p n−2 (t), · · · , p n−n d (t) , which consists of data n d days before p n (t) with the same time interval. The third is hourly relevant pattern time series , · · · , p n (t − n h )}, which consists of data n h time intervals before p n (t) at the same day. After the ACF and PACF examination, Ma [35] selects AR(3) for weekly time series, SARIMA(1,0,0)(0,1,0) 7 for daily time series, and ARIMA(2,1,0) for hourly time series. Xue [36] selects ARMA(2,2) for weekly time series, SARIMA(2,0,3)(1,0,0) 24 for daily time series, and ARIMA(2,1,0) for hourly time series. The models selected by Ma [35] and Xue [36] fully prove that the weekly time series is a stationary sequence, from which it is speculated that the passenger variation is similar in the same time interval on the same weekday. The daily model reveals the cyclical variation of the passenger flow at the same time interval of different weekday during a cycle of week. The hourly model shows the obvious variation trend of the passenger flow in successive time intervals, such as the peak and off-peak period. Different time series model with different sampling interval will reveal different variation rules of passenger flow and, depending on different factors effecting the passenger flow variation, the predicted results will be different. Both studies propose an IMM-based algorithm, which combined different models prediction result together, and output final prediction in order to match the different situation. The details of the IMM algorithm will be discussed in Section 5.
However, the transaction records from the AFC system is the only data source in Ma [35] and Xue [36], which means the passenger using cash will not be counted, and the transactions are only generated in the bus dwell period, so that the data aggregation method used in these studies will generate more irregular data.

Other Linear Models for Short-Term Bus Passenger Flow Prediction
In addition to the Kalman filter and time series analysis methods, some researchers use general linear regression [37] and the wavelet model [38] to predict short-term passenger flow.
Yang [37] uses the general linear regression method to predict the short-term passenger flow. The source data are transaction records from the AFC system, and aggregated by hours. Using the clustering method, the data with a similar variation trend is clustered, and corresponding regression equations are selected to predict the passenger flow trend. Comparing with the survey data, the significance test indices are over 0.8, which meets the bus scheduling operation requirements.

Nonlinear Methods for Short-Term Bus Passenger Flow Prediction
Generally, the longer the sampling time interval is, the more detail information will be lost, and the more stable the sequence will be. Contrarily, the shorter the predicting period is, the greater the factors will affect the passenger flow, which will lead to be more obvious stochasticity, uncertainty, and non-linear [12], and it is difficult to use linear regression prediction methods. In order to improve the accuracy, some researchers try to construct complex and comprehensive models to describe bus passenger flow variation based on artificial neural networks (ANN), support vector machines (SVM), etc.

Support Vector Machine Regression
Support vector machine regression (SVR or SVMR) [39] is a special application of a support vector machine (SVM) for regression. SVM is a supervised learning model, originally designed for classification, and it only needs a small amount of training data to fit the classification boundary or hyperplane and, finally, obtain very good classification effects. In order to keep the good properties of fitting an effective hyperplane, based on the structural risk minimization principle, minimizing the risk of quadratic ε-non loss function, Vapnik [40] proposed a SVM used for estimating the regression function, which could be called support vector regression. For solving different problems, the SVR can be categorized into linear regression and non-linear regression.
(1) Linear SVR Dataset S is N-dimensional with l groups patterns, described as: Let a regression function f (x) = ω, x + b, where ω ∈ n , b ∈ , to be estimated by using training dataset S. Function f (x) is moved around to include training patterns inside ε-insensitive tube. By the structural risk minimization (SRM) principle, the generalization accuracy is optimized by the regression function flatness, which is guaranteed on small ω, and then fitting function is moved to minimize the norm ω 2 [41]. In order to cover the data outside the ε-insensitive tube, the optimization problem of finding the best fitting function could be moved to a convex quadratic optimization problem described as: subject to: This regression optimization problem is constructed by optimization theory based on the characteristics of SVM, the detail proof procedure could be found in [39]. Ancona [42] interprets the SVR based on the ε-insensitive tube from the perspective of SVM classification theory.
(2) Non-linear SVR Non-linear SVR has the similar idea with non-linear SVM classification, which uses the kernel function k(x i ,x j ) = φ(x i ) · φ(x j ) to change a non-linear regression problem in a low-dimensional space to a linear regression problem in a higher dimensional space. The detailed proof can be found in [39,42]. There are different kinds of kernel functions, such as linear kernel, polynomial kernel, radial base function (RBF), and sigmoid kernel function, etc., used for creating models. More information about the kernel function can be found in [43], which introduces the kernel functions used in the SVM application completely.

Applications of SVR in Short-Term Bus Passenger Flow Prediction
Yang [44] proposes a SVR method based on affinity propagation (AP) to predict the short-term passenger flow of bus stops. The sample data of passenger number is manually collected every 10-min and grouped by weekly cycle. The authors use an AP clustering algorithm to divide the passenger flow observations into different cluster subsets based on different principles, such as two groups of weekday and weekend, or six groups of each weekday and weekend. The basic idea of using the AP method is to group the similar data samples in order to reduce the volatility of the data sequence. Different SVR models are established to predict the future trend of each subset sequence. The authors prove that classifying the data sequence could improve the forecasting accuracy.
Guo [45] adopts Least Squares Support Vector Machine Regression (LS-SVMR) to establish a short-term passenger flow prediction model. The difference between the SVR and LS-SVMR is that the former uses the inequality constraints, the latter uses the equality constraints [46], and described as: In the LS-SVMR algorithm, a QP problem is transformed to solve a linear equation, and easy to compute Lagrange multiplier. The convergence speed of LS-SVMR algorithm is higher than SVR, but the prediction accuracy is weaker than SVR. Guo [45] sets bus stop A as the observation site and the sampling interval to be 5 min, and then uses manual methods to count the arrival passenger of stop A together with upstream and downstream adjacent stops by every seven days for a cycle. The RBF is selected as the kernel function to construct the LS-SVMR based prediction model. The instance given by the authors show that the mentioned three factors, the passenger flow of upstream and downstream, waiting passenger amount, historical data at the same period, may affect forecasting accuracy. If setting adjacent time interval parameter β = 3, the performance of the prediction model will be the best.
Generally, selecting a proper kernel function needs several times testing in order to improve the normalization and adaptive capability, more detail information about the multiple kernel learning methods can be found in [47][48][49]. Based on multiple kernel LS-SVMR, Deng [12] uses linear, RBF and sigmoid, three kinds of kernel functions to construct the weighted summation [48] multiple kernel function K(X, X i ) = ∑ M m=1 θ m k m (X, X i ) (where M is the total number of the kernel functions, θ m is the balance weighted coefficient), to predict the short-term passenger flow. The data source are the transaction records from the AFC system, and aggregated every 10 min. The training dataset is constructed by the same time interval on the same weekday of m weeks before, the same time interval of n days before, and the successional s time intervals before the time interval intended to be predicted. The authors illustrated an example to prove that the prediction accuracy by using the multiple kernel functions is higher than a single kernel function.

Artificial Neural Network
Artificial neural network (ANN) [50,51] is another non-linear regression method used in passenger flow prediction filed. The essential of ANN is a layered weighted directed graph [52], which can be divided into input layer, middle layer (or hidden layer) and output layer, and its structure is shown in Figure 2. The nodes in the directed graph are neurons, like x 1 , s 1 , y 1 , etc., and the directed edges are nerves.
Artificial neural network (ANN) [50,51] is another non-linear regression method used in passenger flow prediction filed. The essential of ANN is a layered weighted directed graph [52], which can be divided into input layer, middle layer (or hidden layer) and output layer, and its structure is shown in Figure 2. The nodes in the directed graph are neurons, like , , , etc., and the directed edges are nerves. In the ANN graph, the low layer nodes point to upper nodes by directed edges. The nodes in the same layer do not point to each other. The directed edge is usually assigned a weight, like , , etc. The subscript of the weight represents the number of the neurons in different layer. Every neuron may have a linear or nonlinear function (•) used for neuron transformation called neuron function. Take the neuron as an example, the value of may be described as = ( + + ).The input value is transferred upward layer by layer, and a more complex hyperplane could be constructed to forecast. Two components of the ANN need to be designed in the application, one is the network structure; the other is how to design the neuron functions (•). The categories of the ANN are generally classified by different network structures or neuron functions, such as BP-ANN [53], RBF-ANN [54], fuzzy ANN [55], etc.

Applications of ANN in Short-Term Bus Passenger Flow Prediction
In the earlier days, some researchers usually predicted the long-term passenger flow trend for days or years through ANN with manual survey data. For example, Yu [56] uses ANN to forecast the bus passenger trip flow between different city zones. Jiang [57] uses RBF-ANN and BP-ANN to predict the long-term passenger flow in one-year interval respectively, the results show the accuracy of RBF-ANN is better than BP-ANN. Yang [58] proposes a model based on the theory of adaptive neural fuzzy inference system to predict bus line passenger flow in day time interval. Compared with the AR and ARMA, the test results from the fuzzy ANN based model are better in accuracy. The time interval in days is not considered as short term anymore nowadays, but it plays an important guiding role in short-term bus passenger flow prediction, and the authors point out that the next step is to predict the passenger flow in hour interval based on the daytime interval prediction results.
Liu [59] proposes a model based on BP-ANN to predict passenger getting on and off flow at a bus stop. The authors select three layers BP-ANN to construct the predict model. The training data is divided into three groups as the model inputs. The first is the same period on the same weekday of the three weeks before the time prepared to be forecast. The second is the same period of the three days before the time prepared to be forecast. The last group is the three adjacent time intervals before In the ANN graph, the low layer nodes point to upper nodes by directed edges. The nodes in the same layer do not point to each other. The directed edge is usually assigned a weight, like ω 11 , θ 21 , etc. The subscript of the weight represents the number of the neurons in different layer. Every neuron may have a linear or nonlinear function f (·) used for neuron transformation called neuron function. Take the neuron s 1 as an example, the value of s 1 may be described as The input value is transferred upward layer by layer, and a more complex hyperplane could be constructed to forecast. Two components of the ANN need to be designed in the application, one is the network structure; the other is how to design the neuron functions f (·). The categories of the ANN are generally classified by different network structures or neuron functions, such as BP-ANN [53], RBF-ANN [54], fuzzy ANN [55], etc.

Applications of ANN in Short-Term Bus Passenger Flow Prediction
In the earlier days, some researchers usually predicted the long-term passenger flow trend for days or years through ANN with manual survey data. For example, Yu [56] uses ANN to forecast the bus passenger trip flow between different city zones. Jiang [57] uses RBF-ANN and BP-ANN to predict the long-term passenger flow in one-year interval respectively, the results show the accuracy of RBF-ANN is better than BP-ANN. Yang [58] proposes a model based on the theory of adaptive neural fuzzy inference system to predict bus line passenger flow in day time interval. Compared with the AR and ARMA, the test results from the fuzzy ANN based model are better in accuracy. The time interval in days is not considered as short term anymore nowadays, but it plays an important guiding role in short-term bus passenger flow prediction, and the authors point out that the next step is to predict the passenger flow in hour interval based on the daytime interval prediction results.
Liu [59] proposes a model based on BP-ANN to predict passenger getting on and off flow at a bus stop. The authors select three layers BP-ANN to construct the predict model. The training data is divided into three groups as the model inputs. The first is the same period on the same weekday of the three weeks before the time prepared to be forecast. The second is the same period of the three days before the time prepared to be forecast. The last group is the three adjacent time intervals before the time prepared to be forecast. The total of 2608 samples are divided into three groups as BP-ANN inputs, compared with real data, the prediction accuracy is over 90%.
Lu [14] proposes a short-term passenger flow prediction model based on RBF-ANN. In this model, the data are transaction records from AFC system. The records with the same smart card ID ordered by transaction time are selected. The travel trace from origin to destination could be deduced according to the two adjacent records. From the travel trace information, the counts of boardings and alightings of passengers are obtained. Using the counts as the training data, the authors use RBF-ANN to predict the passenger flow at stops in one-hour interval. The literature does not describe the prediction process of RBF-ANN, but comparing with the real data, the absolute relative error of prediction results is less than 1.5%, which means the model has a certain value of practical application.
Wen [60] proposes a fuzzy ANN based real-time bus passenger flow forecast model. Different from other models, it uses similarity analysis to calculate the relationship between stop passenger flow distribution and line passenger flow distribution, and finds the key stops that affect the line passenger flow distribution. Based on the real-time passenger boarding counts from key stops, fuzzy ANN based model is used to forecast short-term passenger flow distribution of bus lines in one-hour intervals. The advantage of this model is to use similarity analysis to find key stops, which greatly reduces the cost of passenger flow survey, obtain better effective prediction results, and meet the precision requirements.
Dong [61] uses BP-ANN, improved BP-ANN and RBF-ANN to predict the passenger flow of the selected bus line by using the same transaction records from AFC system. The records are divided into three categories, the same as the input of the ANN in [59]. The results show that the accuracy of the improved BP-ANN and RBF-ANN are better than the traditional BP-ANN model.

Other Nonlinear Methods for Short-Term Passenger Flow Prediction
The grey model (GM) is widely used in bus passenger flow prediction filed. Liu [62] proposes a GM(1,1) model to predict short-term passenger flow of a bus line by using the transaction records from AFC system. The data are aggregated every 15 min and selected 10 consecutive Monday data in peak hour as training data. Compared with real data, the mean relative error of the prediction results is 3.343%, which means the accuracy of the prediction results is acceptable. Zhang [63] also uses GM(1,1) to predict the time-division passenger flow of a single line, and the relative residual of each group is less than 10%, which meet with the second order accuracy requirement. Shen [64] and Wang [65] declare that their research results are short-term prediction, but the prediction interval is over years, so that they are not considered as short-term prediction in this paper. However, the grey model deserves further study in the short-term bus passenger flow prediction field.

Combined Methods for Short-Term Bus Passenger Flow Prediction
Generally, the short-term passenger flow variation is more random and uncertain than the long-term passenger flow. It is hard to cover all characteristics of the short-term passenger flow by a single model. In order to make full use of the advantages of different models, researchers combine linear or nonlinear, different kinds of models together, to establish combination models. In this section, the combination models used for short-term passenger flow prediction are reviewed, and some combination models that deserve further study are simply introduced.
Gong [13] proposes a framework with three sequential stages, including a seasonal ARIMA-based method, an event-based algorithm and a Kalman filter-based algorithm, to predict the short-term passenger flow of bus stops. In the first stage, a time series method is used to predict the arrival passenger count (ArPC) and empty space count (ESC) of a bus. In the second stage, an event-based method is developed to predict the departure passenger counts (DPC) from the stop. In the third stage, a Kalman filter-based method is used to predict the waiting passenger count (WPC) according to the results from the first and second stages. The passenger boarding counts are collected from the APC or cameras installed in each bus. The WPC data is collected through cameras installed at each bus stop. The researchers suggest that the passenger flow of a bus stop is the waiting passenger at the bus stop, which is strongly related to bus arrival times and its current passenger capacity. Based on these principles, the WPC at a bus stop is represented mathematically as: which means the count of passengers waiting at a bus stop at time t relates to the count of passenger waiting at a bus stop at time t − 1, count of passenger arriving at the stop at time t, and count of passenger departing from the stop at time t. Based on the principle represented by (17), the ArPC and DPC need to be predicted before predicting WPC. The first stage is to predict the ArPC. The authors analyze the relation between the passenger boarding count data and arriving data, and proposed a passenger allocation approach to compute the historical ArPC. The boarding count can be collected through the APC. While the passenger arrival process is treated as a Poisson distribution with the probability density function f (x). The ArPC at time t can be presented as: where τ i is the time of ith bus arrival event and B his (τ i ) is history data of boarding count. Using (18), the historical ArPC data will be obtained, and the data repeat in a week-cycle pattern. According to the ACF and PACF experiments, ARIMA(1,0,0)(1,0,0) 7 model is selected to predict ArPC and ESC. Compared with real data, the average relative errors of prediction results are 2.94% of ArPC and 3.02% of ESC. The second stage is to predict the DPC, which is triggered by the bus arrival events (BAEs) at a bus stop in each time interval. Under the author's assumption, the boarding count of BAE is the minimum of the passengers waiting at the stop and the empty space available on the bus. The boarding count d is collected from the APC, so the DPC is the sum of the boarding count of every BAE happened during the time interval t. Therefore, predicting DPC is equivalent to predicting the BAEs at the corresponding stop. Using bus trajectory records from the AVL, an event-based algorithm is proposed to predict the bus arrival time, and combining with the predict results of ArPC and ESC in the first step, the DPC can be predicted. The third stage has been discussed in the previous section. At the end of the literature, a numerical experiments conducted at three typical bus stops are illustrated to demonstrate that the proposed framework is robust and accurate. Ma [35] and Xue [36], discussed in Section 3.2.2, propose weekly, daily and hourly three temporal relevant pattern time series to predict passenger flow on the bus lines. The three patterns select AR (ARMA), SARMA and ARIMA separately to capture different characteristics of time series. In order to maximize the advantages of single models and optimize the interaction between them, the two literatures propose an IMM-based algorithm to combine the predictions of each single model. The output equation of the IMM is defined as: wherex j (t|t) is the prediction result of model j at time t; µ j (t) is the mixed probability. The IMM-based algorithm is a recursive approach, including four steps: re-initialization, model filtering, probability updating and hybrid output. The first step is to calculate the mixed state and covariance at time t based on transition matrix with the updated estimations and probabilities from the last recursive. The second step is using Kalman filter algorithm to update the estimations for each model, and calculate the residual and covariance with the input of real-time measurement. The third step is updating the probability of each model at time t based on likelihood function of each model. The fourth step is calculating the final estimation at time t weighted by the updated probability [35]. The IMM algorithm combines the weekly, daily and hourly time series models to match different data states in order to reduce the errors of using one single model. Comparing with single models, the IMM-based hybrid model can provide more accurate prediction results. Liu [38] proposes a short-term passenger flow forecasting method by combining the wavelet and time series. Its basic idea is to treat the observation sequences of the passenger flow as the timing signals. Using discrete Flourier transformation (DFT) converts the time domain of the original sequence to frequency domain. Using Mallat-based wavelet decomposition method divides the changed sequence into one low frequency main trend signal and five high frequency interference signals. Comparing with the original sequence, the single signal sequence is more stationary. The ARMA model is used to predict the future trend of each series. Finally, the wavelet reconstruction method is used to synthesize these single prediction results together as the final prediction result. From the authors' view, the wavelet decomposition and reconstruction method reduce the volatility of the original signal. Therefore, the wavelet prediction method can improve the forecast accuracy effectively. However, the original sequence is not stationary, the conclusion of ARMA (4,4) model, used to prove the proposed wavelet method with higher forecast accuracy, needs to be confirmed by further research.
Zhou [66] proposes a sliding window ensemble framework to predict the short-term passenger flow. The framework includes three distinct predicting models. The first is the time varying Poisson model, which is used to predict the average number of passenger demand in a fixed time period. The second is the weighted time varying Poisson model, which is to predict the passenger flow with seasonal burst issues. The third one is using ARIMA model to predict the short-term passenger flow. The three models could use long, medium and short-term historical data as training data respectively, and be combined together to improve prediction result by: The prediction results show that the accuracy of the ensemble framework is around 79%, which is better than the single models. The ensemble framework in [66] is used to develop a prototype APP for mobile phone users to predict the crowdedness of the bus. Liu [67] proposes a combined predicting model with BP-ANN and LS-SVM. The combined model includes two steps: firstly, the BP-ANN is adopted to do an initial prediction with the historical training data, and then the LS-SVM is used to refine the initial prediction. The result shows the combination model can improve the prediction accuracy by 1% more than single model. From the authors' perspective, the proposed model combines the advantages of the nonlinear fitting ability of BP-ANN and using small amount training data of LS-SVM, which may improve the prediction accuracy.
Pekel [68] develops two hybrid model, parliamentary optimization algorithm-artificial neural network (POA-ANN) and intelligent water drops algorithm-artificial neural network (IWD-ANN), POA and IWD are utilized to optimize the number of the hidden layer neurons and weight of hidden layer neurons, so as to make global optimization of the model. Some researchers propose different combined methods used to predict bus passenger flow, such as combining with different linear regression functions together, combining the grey model with ANN, the grey model with Markov, etc. However, these methods are used for predicting long-term, years' or months' interval, passenger flow. Here, these methods are introduced simply, the basic idea of which deserves further study. Gan [69] gives a combined method with three weighted linear regression to predict the bus line passenger flow. Ge [70] proposes a combined method based on genetic algorithm and ANN, and Cai [71] proposes a similar model by combing genetic algorithm and BP-ANN, which used genetic algorithm to define data weight of initial input. The grey model (GM) is widely used to construct combined models with others, such as Yang [72] and Shen [64], who propose a similar combination prediction method based on GM and Markov models. Wang [65] proposes a random grey ant colony neural network combined model with random grey model and recurrent neural network. Ling [73] uses the sum of the absolute values of the predicted sequence and residual sequence to construct a new sequence as the input of GM(1,1) to predict the passenger flow.

Big Data Technology and Deep Learning Used for Short-Term Bus Passenger Flow Prediction
The short term bus passenger flow prediction technology is urgently needed by the bus companies, but it develops very slowly. One of the reasons leading to slow its development is that it is hard to obtain the information about passenger flow. Traditional manual survey to collect data costs much time, and it does not suit for short-term passenger flow prediction. Until recent years, it is possible to collect the passenger flow data in real-time with the help of APTS. Since then, the literatures, proposed methods to predict short-term passenger flow, usually used the data from AFC or APC. Another problem is that too much data is collected in short term. Such as the AFC system used in Beijing, the transaction records are over 10 million per day, and the traditional methods hardly handle these data in time. Thanks to the big data technology, storing and accessing a very large amount data are no longer significant problems.
Li [74] uses the MapReduce to implement the BP-ANN parallel algorithm. There are two kinds of parallel approaches to implement the algorithm, one is the structure parallel, and the other is training data parallel. The second approach is very suitable for the big data technology, because the basic idea of the MapReduce technology is to divide the data file into small pieces, and used different machines to speed up the training data process. Li [74] uses data parallel approach to train the BP-ANN, and exchange the training results through a unified weight list table. Comparing with the traditional BP-ANN, the training time spent by the MapReduce-based algorithm is one-sixth of the traditional algorithm and the prediction accuracy is almost the same.
Deep learning, successfully applied in many fields and achieved amazing results, can deeply and abstractly extract the nonlinear features embedded in the dataset, that are attracting some researchers using deep learning to predict the short-term passenger flow variation.
Liu [75] proposes an unsupervised training model based on a stacked autoencoder (SAE) combined with a supervised training model based on deep neural network (DNN) to predict hourly passenger flow. Passenger flow data are collected from the AFC for four months. The authors explain why the hidden nodes can robustly extract and represent the valuable features embedded in the input data by visualizing the high-level features learned in different hidden layers. The experimental results show that the selections and combinations of the input features have a great impact on the accuracy of the prediction results. The highest average RMSE is 75% across all of scenarios, however, the authors believe that it is a universal and robust hourly passenger flow prediction model.

Conclusions
In this paper, more than 20 studies about bus passenger flow prediction are discussed. Twenty-two pieces of literature, listed in Tables 5 and 6, about short-term bus passenger flow prediction are discussed briefly, and the rest about long-term bus passenger flow prediction deserving further study are introduced simply. In the two tables, the characteristics and the evaluation of the methods used by each paper are also listed. Table 5 lists the single linear and nonlinear methods for short-term passenger flow prediction. From the column of "Accuracy", the accuracy of nonlinear method is better than the linear method. For the modeling difficulty, time series and linear regression methods are easier than nonlinear models. The structures of training datasets or sample datasets used as the input of the model are relatively simple, most of which are original time sequence data or roughly processed data. Therefore, it is easy to use single models to predict the passenger flow, and obtain relatively satisfactory prediction results with low cost. Table 6 lists the combination model used for short-term passenger flow prediction. The first four literatures propose methods by combining different types of time series methods together to handle different situation of the sample datasets, which are carefully designed and selected from the rough data sequence. Therefore, the complexity of the dataset structure and the modeling process is very high, which leads to weaken universality of the combined model. The accuracy of the combination models is better than the single model, but the cost is higher.
From the tables listed below, the nonlinear models and combination models are better in prediction accuracy, but the complex modeling process will make the computational complexity higher, and well-defined data structures will cost more time to preprocess rough data, so that if the time costs more than prediction period, the method will lose meaning. Liu [67] proposes the comparison of the computing speed between different methods. In the short-term prediction field, the speed will be more important than accuracy, especially for the APTS equipped in buses, which produced a very large amount of data needing to be treated in time. It is valuable to improve the traditional prediction models by using big data technology so as to accelerate the computing speed.
Since there is no complete experiment dataset given by the authors of the references in this paper, and the evaluating criteria is also not uniform, it is difficult to evaluate the methods or models objectively. In further study, all prediction methods mentioned in the references will be tested and comparatively analyzed, using a unified data source under the unified evaluating criteria, The urban bus transit system plays an important role in the public transit services, and its service ability has been paid close, extensive attention. However, traffic jams, weather conditions, even irregular business activities, will cause sudden changes of the passenger flow. Overcrowded carriages or long waiting times will cause passengers to be dissatisfied with the quality of the bus service. As discussed earlier, passenger flow is the basis of public transport operation. Bus operators hope to improve the quality of bus services to increase corporate profits by developing flexible timetables based on short-term passenger flow changes. Due to the difficulties in obtaining real-time passenger flow statistics for a long time, related applications based on short-term passenger flow changes cannot be applied in practice. With the widespread use of APTS equipment, it is now possible to obtain passenger flow statistics in real-time, and application research based on short-term passenger flow changes has been rapidly developed. DiDi, the largest Chinese ride-sharing company, has launched a shared bus system based on real-time passenger flow demand, which comprehensively considers the number of passengers, travel demand, traffic conditions, and other factors to plan routes in real-time and dispatch buses to meet diverse public travel needs. In future work we will cooperate with a large-scale bus operation enterprise to develop a bus fleet intelligent dispatching operation platform based on short-term passenger flow changes, helping the enterprise to optimize the timetables and subsequent bus fleet and crew scheduling. At the same time, the bus fleet operation information release platform is provided, and the bus arrival information is pushed through the electronic station plate or mobile terminal APP, so as to help the travelers to reasonably arrange the travel plan and improve the bus service level.  Author Contributions: H.Z. conceived the research, proposed the original idea, and wrote most of the paper. L.C. and Y.N. proposed some original idea of the research and wrote some parts of the research. X.X. and W.Z. gave related guidance.