Vehicle Counting in Video Sequences: An Incremental Subspace Learning Approach

The counting of vehicles plays an important role in measuring the behavior patterns of traffic flow in cities, as streets and avenues can get crowded easily. To address this problem, some Intelligent Transport Systems (ITSs) have been implemented in order to count vehicles with already established video surveillance infrastructure. With this in mind, in this paper, we present an on-line learning methodology for counting vehicles in video sequences based on Incremental Principal Component Analysis (Incremental PCA). This incremental learning method allows us to identify the maximum variability (i.e., motion detection) between a previous block of frames and the actual one by using only the first projected eigenvector. Once the projected image is obtained, we apply dynamic thresholding to perform image binarization. Then, a series of post-processing steps are applied to enhance the binary image containing the objects in motion. Finally, we count the number of vehicles by implementing a virtual detection line in each of the road lanes. These lines determine the instants where the vehicles pass completely through them. Results show that our proposed methodology is able to count vehicles with 96.6% accuracy at 26 frames per second on average—dealing with both camera jitter and sudden illumination changes caused by the environment and the camera auto exposure.


Introduction
Video surveillance systems are multistage computer vision systems capable of performing high end tasks [1]. Due to the increasing capabilities of hardware and software, the algorithms used to perform motion detection are getting better performance. However, there is still an increasing interest for developing new algorithms that are able to overtake limitations produced by human errors, since most of the systems cannot be checked automatically [2].
Video surveillance systems are broadly used in roads, banks, shops, schools, and other public places in order to protect social security [2,3]. At the present time, the challenge for these video systems is to provide accuracy and confidence for detecting motion in any scenario. Thus, many applications, for example traffic monitoring, are based on the unsupervised analysis of video sequences

Incremental Principal Component Analysis
Subspace learning model methods have been proposed in order to model the background in video sequences, as described in [27][28][29][30][31][32]. More specifically, such processes compute the mean image from a set of N frames available, then subtract the mean from all frames, and finally compute their eigenvalues and eigenvectors. As authors proposed in [27], the first most significant eigenvectors are used for computing the difference between the actual frame and the background frame previously modeled. These methods are based on a technique known as Batch PCA or Off-line PCA, which are not suitable for working with video sequences. However, novel approaches of incremental subspace learning have proven their effectiveness, allowing eigenbasis to be updated as soon as new frames are available, making them suitable for prolonged applications in real-time [33]. Nevertheless, as reported in [34], the vast majority of these methods do not take into consideration the update of the mean image as new video frames continue arriving.
The main idea of Incremental subspace learning is derived from Singular Value Decomposition (SVD). SVD provides the eigenvectors and eigenvalues sorted in descending order, so that the first eigenvectors provide the maximum variability of the data under analysis. The main contribution of this work is the idea that the first projected eigenvector contains the change of every new incoming frame, with respect to previous frames-in other words, the present motion in the frame. In [35], Levy and Lindenbaum proposed the Sequential Karhunen-Loeve (SKL) algorithm for efficiently updating the eigenbasis. However, SKL does not take into account the mean update as new training data arrive. In order to overcome this limitation, Lim and Ross presented a new Incremental PCA (IPCA) algorithm in [34] that properly updates the eigenbasis as well as the mean. Now, supposing that A = {I 1 , ..., I n } is the block of previous frames, B = {I n+1 , ..., I n+m } is the block containing the new incoming frames, and C = [A B] is their concatenation, n and m can be defined as the number of frames of each block A and B, respectively. Then, as formulated in [34], Algorithm 1 is implemented. First, computing the eigenvectors U and the eigenvalues Σ from the SVD of A −Ī A , whereĪ A is the average of the block from previous frames.
Step No. 2 of Algorithm 1 shows the matrixB = B −Ī B where B is the block with incoming frames, andĪ B is the mean considering the previous and new frames.
Step No. 3 of Algorithm 1 represents inB the components of B orthogonal to U, we can also represent A in terms of its eigenvalues and eigenvectors as A = UΣV T and we can then express the concatenation of A and B as follows [34]: Finally, making a substitution ofB andB we can represent R, and integrate the forgetting factor parameter f , which determines the influence of past observation by values in range [0 1]-a value close to 0 indicates no influence of previous frames; whereas a value close to 1 preserves the influence of all previous frames. Details about the formulation of this algorithm can be found in [34,35].

Proposed Methodology
In this section, we present the three main phases of our methodology for counting vehicles: The implementation of Incremental PCA for motion detection in video sequences; the post-processing steps needed for the image enhancement obtained from the first projected eigenvector, and finally our algorithm for performing the actual vehicle counting. These phases are summarized in Figure 1 and described in detail in the following subsections.
The block diagram of the proposed methodology consists of the input data, i.e., the Video sequence, the individual frames from the video sequence depicted in Figure 1a passes through the RGB to grayscale block in order to convert each frame from an RGB to a grayscale image Figure 1b Then, the grayscale frames are the input for the Incremental PCA block once the first eigenvector is used to project the actual grayscale image, the resulting image is shown in Figure 1c, this projected image uses the function heatmap to remark the variance in the pixels which represent motion, this projected image passes through the Thresholding block, in this block are thresholds based on statistic parameters (mean and standard deviation), it will set the pixels to "one" if these pixels belong to the foreground or movement; and set to "zero" if the pixels belong to the background-the results of the Thresholding block is the image in Figure 1d, but this image contain small groups of pixels which do not represent significant movement, and could be considered noise in the motion detection. To solve this issue, the Filtering block is implemented to eliminate those groups of pixels below a minimum size, relative to the size of the frame, the results are presented in Figure 1e. Moreover, this continuous process uses two main frames from the video sequence, the actual frame and the previous frame. Another important issue in motion detection is the foreground aperture problem, splitting large objects due to the similarity of the object's internal pixels, so to overcome the object separation as a consequence of the miss detected motion in the internal pixel of the object, Frame fusion block was implemented, and OR operator wass applied in actual and previous frames (Figure 1f), the overlap images form a single object, avoiding the typical split in large motion objects, the result is shown in Figure 1g. Now some lines in the object are very thin, in order to maintain those lines, the Frame Dilation block is implemented, the results are shown in Figure 1h. In order to avoid double counting of the same object, the internal holes of the movement object must be filled, this task is done in a Binary hole filling block, and the results are depicted in Figure 1i. Finally, a detection line must be set up in each lane on the road. For each lane, an initial terminal point is stabilized, the objects considered as movement pass through the detection line in the corresponding lane, then a buffer is initiated to store the number of foreground pixels detected. When the object is no more in the detection line, the buffer used to detect considers the absence of the object as a falling edge, and the Average value block performs a computation of the foreground pixels detected-considering the area of detection line and also the number of frames when the object was detected, then a Detection block based on threshold allows the detection and a classification of the object based in the inferred size by the values in detection line.

Motion Detection Using Incremental PCA
As stated in Section 2, we implemented the Incremental PCA algorithm to find the first eigenvector that contains the maximum data variability between two blocks of images. For our specific purpose, blocks A and B are made up from one single frame converted to a column vector of size d × 1, where d is the product of the width and height of the frame, that is, d = I width × I height . Note that the frame width and height must be constant all along the video sequence. As initial parameters, A is set to be a zero matrix of size d × 1; and B is set to be the first input frame of the video sequence in its vector form. Subsequently, the algorithm continues iterating for each new incoming frame. Figure 2 illustrates this iterative process and Figure 3 shows an example of a reconstructed image I proj from the absolute value of the resulting eigenvector U . Note that the projected image is shown using a heatmap color representation.  Another important parameter is the forgetting factor f . As shown in step 3 of Algorithm 1, this coefficient reduces the contribution of previous observations (frames) as new observations are available incrementally, multiplying Σ by f ∈ [0 1]. Thus, determining this coefficient is a crucial step in the Incremental PCA algorithm, since it is desirable to maintain more information about recent frames rather than the earlier ones. f can be manually adjusted depending on the application and the speed of the objects in motion present at the scene.
According to our previous experiments in parallel works, a higher f works better for objects moving slower (e.g., pedestrians, animals, slow vehicles in streets), since the objects in motion are retained in the scene for longer intervals of time. In this application, f = 0.1 has been established in order to retain just the minimum amount of movement of the vehicles that move at high speed, preventing the "ghosting" effect. Figure 4 shows the effect of the forgetting factor f for different values. It can be noticed that the higher the value of f , the better small moving objects in the scene are preserved. However, big moving objects leave a "ghost" behind them, which is not suitable for our application.  Figure 5. Motion detection using Incremental PCA over different scenarios from Changedetection project: (a) original RGB pedestrian frame; (b) heatmap projected of (a); (c) original RGB twoPositionPTZCam frame; (d) heatmap projected of (c); (e) original RGB streetlight frame; (f) heatmap projected of (e); (g) original RGB PETS2006 frame; (h) heatmap projected of (g).

Post-Processing
Once the projected image I proj is obtained, a binarization process is performed using a threshold value T to obtain a new image I bin that only contains objects in motion. I bin is defined by: where T = 2σ. We propose the use of this threshold value because, according to the literature, it contains 95% of all the information in a normal distribution of data. In this case, each frame I proj behaves similarly to a normal distribution with mean 0 (0 value pixels indicate no motion has been detected). Approximately 5% of the remaining data in each frame is the "motion" we are interested in.σ is the mean of the accumulated standard deviation of all previous I proj and is mathematically expressed as:σ where N is the number of frames that have been processed incrementally from the beginning, and std() is the standard deviation of the projected image I proj at each instant i. std() is expressed by the following formula: note that the input argument of std(), X, must be in a vectorized form. Therefore, I proj has to be previously converted to a d × 1 size vector. In addition, note thatX is the mean value of vector X and N = d.
Allowingσ to be the dynamic component of the threshold T, we make sure that only the motion present in I proj is preserved, removing most of the noise caused by the camera jitter, sudden illumination changes in the scene, and the noise induced by the camera itself. σ is averaged with its previous values in each new iteration for preventing abrupt changes of T, which may lead to highly noisy binarized images. This is mainly because the output eigenvector of the Incremental PCA algorithm does not contain a defined range of output values for each iteration. To illustrate this, Figure 6 shows the behavior ofσ and σ over time for a given video sequence. We proposed the use ofσ due to the fact that in our testing video sequences, the vast majority of the scene is occupied by static pixels. This guarantees that most of the histogram distribution of I proj will be centered at its mean value close to 0, indicating no motion.
After the binarization process is completed, we remove small objects in I bin by applying a binary denoising function. This function is described in Algorithm 2 as follows: Algorithm 2 Binary denoising function 1: Determine all the individual objects in I bin .
2: Compute the area of each object in pixels. 3: Remove all objects with area less than a threshold value T bin (set their respective pixels to 0).
we have chosen T bin = 20 for all our experiments.
Once the binary denoising function has been applied, we perform the OR logical operation between the current binarized frame I bin k and the previous one I bin k−1 . The purpose of this operation is to obtain a more "complete" version of the objects in motion present at the current frame. This logical operation is expressed by: where I OR (x, y) is the resulting binary image. This image improves the outcome of the following processing step, as we will show next. Finally, we perform a dilatation process to I OR using a small 2 × 2 structural element, then we fill all holes present to obtain I fill . A hole is basically a "dark" region surrounded by "bright" regions. In a binary image, this is translated as-dark regions (0's) that cannot be reached through any of the edges unless we cross some bright region (1's). This can be achieved using Algorithm 3. Consider that 0 pixel values are considered to be the background of the image. Figure 7 summarizes the flow of all post-processing steps, showing their individual outcomes.

Algorithm 3 Binary hole filling
1: Apply Flood-Fill algorithm using the background edge pixels as its seed.
2: Repeat step 1 until no edge background pixels exist. 3: Create a mask containing only the flood-filled pixels. 4: Set every non-masked pixel in I OR to 1 to obtain I fill .

Vehicle Counting
One of the main approaches for vehicle counting is based on extracting information using ROIs (Regions of Interests). In this work, we propose the use of a lineal ROI over each individual lane of the avenue to count vehicles. The virtual detection line is highlighted in green and shown in Figure 8.   In order to reduce the number of false positives due to large vehicles detected as multiple vehicles, we consider their presence passing through the detection line by establishing the following model: where, Detection is a logic variable that indicates if a vehicle exists or not in the detection line, I bin (x, y) is the image containing the vehicles in motion, l 1 and l 2 are the initial and final columns of the detection line for a fixed row x of I bin (x, y), B is the number of consecutive frames used to compute the mean value of the area occupied by the vehicles, and T count is the threshold value used to discriminate between noise and actual vehicles [15]. Each individual vehicle is counted only when a falling-edge frame is detected followed by a previous detection of a rising-edge frame. This process is illustrated in Figure 10.

Results
We evaluated the effectiveness of our proposed methodology by using four video sequences. The first three sequences were recorded by ourselves during the daytime in different places in Mexico City. The first two sequences show vehicles transiting towards the camera position and the third one shows vehicles transiting in the opposite direction. The last video sequence, called Highway was taken from Changedetection project, which is a website that summarizes an academic benchmark for testing and ranking existing and new algorithms for change and motion detection, providing several datasets and tools [36]. Previews of the four video sequences are shown in Figure 11. We recorded our three video sequences using a Sony DCR-SR100 video camera. Video No. 1 was recorded using automatic light exposure. This video presents camera jitter, especially when large vehicles transit below the pedestrian bridge where the camera was placed. This video also contains a concrete mixer truck and a large public service vehicle. Due to the auto exposure configuration, artificial illumination changes are induced when those large vehicles are present, this issue arises not only when large vehicles are in the scene, but also alongside the foreground aperture problem. The camera jitter and artificial illumination changes are settled by the IPCA motion detection framework, and the foreground aperture problem is address by post-processing stages described in Section 3.2, particularly in the frame fusion block in Figure 1. Videos No. 2 and No. 3 were recorded using the same video camera, but configured with manual light exposure, so that no changes in illumination are induced by big brilliant vehicles. In Video No. 2, the traffic flow is from the bottom frame to the top frame, the set of this experiment is to evaluate the effectiveness of the vehicle counting process when the object is decreasing its relative size, the results demonstrate that our framework can address this scenario. Video No. 2 and Video No. 3 presents a small amount of camera jitter and a gradual environmental illumination change caused by a cloud passing by, in both cases, the IPCA motion detection framework is able to manage these issues. In video No. 3, the detection line in the most right lane is in the shadow of some bushes, this induces changes due to natural movement in the leaves and branches, but the combination of the implemented motion detection based on IPCA and the detection line allow the proposed method to tackle the problems related to inference in motion detection caused by the shadows of the bushes. In Video No. 4, the angle between the traffic flow and the camera plane is slightly different than the other videos, this configuration can overlap the movement objects in independent lanes, especially if the angles increase. In order to avoid this problem in counting vehicles, the camera position should be set perpendicular between the camera plane and the traffic flow. All four videos were previously converted to individual frames of size 320 × 240 for convenience of analyzing individual frames of the sequence. This size normalization implies I width = 320 pixels, I height = 240 pixels, and finally d = I width × I height = 76,800 pixels.
As      On average, the entire process runs at 26 frames per second (fps) on a standard 2.0 GHz dual core PC. Similarly, from Tables 1-4 it can be shown that the system average accuracy was 96.6%.

Discussion
Intelligent transportation systems are currently becoming very important and will definitely play a vital role in smart cities of tomorrow. Specifically, vehicle counting is of great importance for many real world applications, such as urban traffic management. Several methodologies have been proposed in order to improve the overall quality, performance, efficiency, and cost of this kind of systems. Our proposed methodology only addresses the problem of counting vehicles under some of the most common problems, such as small camera jitter and illumination changes due to the environment or the camera auto exposure time. We acknowledge that there exists an immense number of problems and challenges yet to be solved. However, related works have also addressed very specific challenges since no general solution exists. In Table 5, we try to summarize as briefly as possible related works by their performance in terms of accuracy, fps, and type of hardware used. Similarly, in Table 6 we show some comments about the related works.  Table 6. Comments about related works.

Conclusions
In this paper, we presented a methodology based on incremental subspace learning for detecting changes in consecutive frames of video sequences. The resulting vector of this incremental learning process is reconstructed into an image. This image is then post-processed for detecting regions where motion is present. Finally, a statistical algorithm based on the average value of the frames is used to determine the presence of vehicles and also to count them. Our proposed methodology has proven to be useful in real scenarios (as described in the Results section) where light conditions change over time due to the environment and also due to the camera auto exposure. Moreover, it can also handle small camera jitter during several continuous frames with no additional filtering. It is clear that our specific application of Incremental PCA is somehow similar to the frame differentiation methodology for motion detection. However, we make a clear distinction performing a statistical difference between a frame made up from previous accumulated observations and the current one. Additionally, the fact that the forgetting factor f can "discriminate" earlier observations (frames) to a lesser or greater extent makes this methodology flexible for different applications as it provides an improved version of a standard frame differentiation methodology. Experimental results have demonstrated that, in most cases, our methodology is able to count vehicles effectively with up to 100% accuracy, while preserving an optimal performance in fps, suitable for real-time implementation. In future works, descriptive algorithms can be implemented for detecting vehicles given proposed regions of objects in motion in order to perform a more robust and complete segmentation. Lastly, our future scope is to apply Deep Learning models for performing vehicle classification and to mine data for security and video surveillance purposes.