Real-Time Detection and Recognition of Multiple Moving Objects for Aerial Surveillance

: Detection of moving objects by unmanned aerial vehicles (UAVs) is an important application in the aerial transportation system. However, there are many problems to be handled such as high-frequency jitter from UAVs, small size objects, low-quality images, computation time reduction, and detection correctness. This paper considers the problem of the detection and recognition of moving objects in a sequence of images captured from a UAV. A new and efficient technique is proposed to achieve the above objective in real time and in real environment. First, the feature points between two successive frames are found for estimating the camera movement to stabilize sequence of images. Then, region of interest (ROI) of the objects are detected as the moving object candidate (foreground). Furthermore, static and dynamic objects are classified based on the most motion vectors that occur in the foreground and background. Based on the experiment results, the proposed method achieves a precision rate of 94% and the computation time of 47.08 frames per second (fps). In comparison to other methods, the performance of the proposed method surpasses those of existing methods.


Introduction
There has been increased worldwide interest in unmanned aerial vehicles (UAVs) used for surveillance in recent years due to their high mobility and flexibility. In general, the UAV with a camera attached for surveillance flying over the mission area can be controlled manually by an operator or automatically by using computer vision. One of the most important tasks of aerial surveillance is the detection of moving objects that can be used to convey essential information in images, such as pedestrian detection and tracking [1][2][3], vehicle detection and tracking [4,5], object counting [6], estimation and recognition of object activity [7][8][9], human and vehicle interactions [10], intelligent transportation systems [11,12], traffic management [13,14], and autonomous robot navigation [15,16].
Several studies have proposed some methods to detect moving objects using stationary cameras, such as Gaussian Mixture Model (GMM) [17], Bayesian background model [18], Markov Random Field (MRF) [19,20], and frame differences [21,22]. These methods extract and identify moving objects by seeking the changes in pixels in each frame. However, these techniques rely on static pixels in the images and are not suitable for processing images from moving cameras that have dynamic pixels. Therefore, stationary cameras limit the application of image processing on videos from moving cameras, e.g., aerial vehicles, mobile robots, and handheld cameras. Thus, the problem for detecting moving objects using a moving camera attracted the attention of researchers in recent years [23].
Detecting moving objects using UAVs has many difficulties to implement in real time and in real environments. These difficulties include camera movements, dynamic background, abrupt motion of the objects or camera, rapid illumination changes, camouflage of stationary objects as moving objects, moving object appearance changes, noise from low-quality images, and so on. Several approaches have been proposed to detect moving objects by moving cameras using object segmentation techniques. Saif et al. [24] presented a dynamic motion model using moment invariant and segmentation which extracts one frame in one second but it is not fast enough for the real-time detection. Their result has some false detection, such as a parked car recognized as a moving object. Maier et al. [25] used the deviations between all pixels of the anticipated geometry of two or more consecutive frames to distinguish moving and static objects, but the result depended on the accuracy of the optical flow calculation and the amount of radial distortion. Kalantar et al. [26] proposed a moving object detection framework without explicitly overlaying frame pairs, where each frame is segmented into regions and subsequently represented as a regional adjacency graph (RAG).
In our propose method, we do not only want to achieve the accuracy of the moving objects detection using a moving camera but also to do it in real-time processing. Some previous studies used an optical flow schemes approach to define the movement path of pixels that are tracked on two consecutive frames. Wu et al. [27] used a coarse-to-fine threshold scheme on particle trajectories in the sequence of images to detect moving objects. The background movement is subtracted using the adaptive threshold method to get fine foreground segmentation. Then, mean-shift segmentation is used to refine the detected foreground. Cai et al. [28] combined procedures of the brightness constancy relaxation and intensity normalization within the optical flow to extract the moving objects in the background based on the growing region of the velocity field. In this case, the images are obtained from the robot competition arena which has a homogeneous background. Minaeian et al. [29] used the foreground estimation to segment moving targets through the integration of spatiotemporal differences and local motion history. However, these previous methods did not adequately prove reliability in real-time processing. This paper proposes a method for detecting multiple moving objects from the sequence of images taken by a UAV which can be applied for real-time applications. The detection and recognition are performed through this method for different objects, such as people and cars. In addition, the image sequences to be tested by this method may contain a complex background. This paper proposes a reliable method for object detection in images where the processing time to get the foreground is shorter than that of segmentation method employed in previous studies [24][25][26]. Aerial image stabilization is proposed to reduce the mixing of camera and object movements, where the background moves due to the camera movement and the foreground moves due to camera and object movement. Furthermore, the unwanted camera movements make the motion vectors field estimation between two consecutive frames incompatible with the actual situation. This situation differentiates the direction of the motion vectors of static objects from the background, even though the objects are a part of the background. Thus, the static objects tend to be recognized as moving objects. To solve such problems, the proposed method provides the motion vectors classification to distinguish static and dynamic (moving) objects.
The remainder of the paper is organized as follows. Section 2 introduces materials and the main algorithm. Section 3 illustrates performance results using multiple videos taken from a UAV. Finally, conclusions are drawn in Section 4.

Materials
The experiment was executed using Visual Studio C++ in the 3.40 GHz CPU with 8 GB RAM. The performance of the proposed method is evaluated using three types of aerial image sequences (action1, action2, and action3) obtained from the UCF (http://crcv.ucf.edu/data/UCF_Aerial_Action.php) aerial action dataset with the resolution of 960 × 540. These image sequences were recorded at different flying altitudes ranging from 400-450 feet.
Action1.mpg and action2.mpg were taken by the UAV at similar altitudes, where people and cars are the main objects in the image. Action3.mpg was taken at a higher altitude than other videos, so the objects look smaller when compared to other videos.

The Proposed Method
The challenge to the moving object detection by a moving camera is obvious. The application of the proposed framework can be used to distinguish the foreground from a dynamic background into a simpler formulation. The systematic approach starts with image stabilization to reduce unwanted movement in the sequence of images. The unwanted movements are the motion of the camera as well as any vibration of the UAV. Inaccuracies in motion compensation can cause failure on the estimation of the background and foreground pixels [30]. However, despite using image stabilization, motion vectors in the static objects (background) and moving objects (foreground) are still difficult to distinguish.
Additionally, in order to detect several moving objects with different sizes and speeds we require the correct calculation of motion vector fields. Furthermore, static and dynamic objects are distinguished based on their movement direction (MD). There are two kinds of MD to be estimated: The direction of the object's movement (foreground) and the direction of the background's movement. It should be noted that the background motion is affected by the camera's movement. Figure 1 shows an illustration of the movement of a UAV affecting camera movement. The background movement corresponding to the motion of a moving camera is affected by UAV movements on the yaw, pitch, and roll axis. So, efficient affine transformation is needed.  Figure 2 shows an overview of the structure of the system. The algorithm consists of three steps to accomplish the main task: Step 1 is the aerial image stabilization, step 2 is the object detection and recognition, and step 3 is the classification of the motion vectors. The proposed algorithm handles each frame for the moving objects detection and recognition so that it can be used in real-time applications with online image processing.
Step 1: Image stabilization is performed to handle unstable UAV platforms. This step aligns each frame with the adjacent frame in a sequence of aerial images to eliminate the effect of camera movement. This stabilization method consists of motion estimation and compensation. We used the methods of speeded-up robust features (SURF) [31][32][33] and affine transformation [34] to estimate the camera movement based on the position of features which are similar between the previous (t − 1) and current (t) frames. Then use the Kalman filter [35,36] to overcome the changes in frame position due to UAV movement such that the camera movement is compensated for each frame. This image transformation is applied to the frame t, so it affect the results of MD in the background and foreground.
Step 2: People and cars are detected in the images as the moving objects candidates or foreground. In this step, Haar-like features [37] and cascade classifiers [38,39] are used to detect and recognize the objects in the images and determine the region of interest (ROI) for the objects. This is followed by labeling the background and foreground. Step 3: Calculate the motion vectors from two consecutive images based on the dense optical flow [40]. Background modeling is sometimes incompatible with actual camera movements due to UAV movements and camera transitions. It is noted that Step 1 makes the MD between static and dynamic objects clearer to be distinguished. MD is specified as the value of a highly repetitive motion vector in frame t, which is calculated in the background and each foreground. If the foreground has the same MD as the background, then the object is omitted from the foreground. Thus, the final result is the ROI in the image showing the moving objects.
The details of each step are explained as follows.

Step 1: Aerial Image Stabilization
This step uses an affine motion model to handle rotation, scaling, and translation. The affine model can be used to estimate movement between frames under certain conditions in the scene [41] [42]. For every two successive frames, the previous frame is defined as ( 1) f t − and the current frame is defined as ( ) f t . In order to reduce the computation time, let the image size be reduced to 75% of the original size and the color is changed into a gray-scale, where ˆ( ) f t denotes the new image with the above size and color of ( ) f t . The local features on each frame are found using SURF [31] as the feature detector and descriptor. SURF uses an integral image [43] to compute different box filters to detect feature points in the image. If ( ) f t is an input image and ( , ) ( ) x i x y x y y j Haar wavelet [30] corresponding to dx and dy are calculated in the x-direction and y-direction, respectively, around each feature point to form a descriptor vector presented as Then, a 4 × 4 array with each vector having four orientations is constructed and centered on the feature point. Therefore, there will be a total of 64-length vectors for each feature point. Fast Library for Approximate Nearest Neighbor (FLANN) [44] is used to select a set of feature point pairs between ˆ( 1) f t − and ˆ( ) f t . Then, the minimum distance for all pairs of feature points is calculated using the Euclidean distance. The matching pair is determined as a feature point pair with a distance less than 0.6. If the total number of matching pairs are more than three, then the selected feature points are used for the next step. Otherwise, the previous trajectory is used as an estimate of the current movement.
In homogenous coordinates, the relationship between a pair of feature points in the ˆ( 1) f t − and ˆ( ) f t is given by where H is the homogeneous affine matrix given by 11 12 where aij is the parameter from the rotation angular θ, Tx, and Ty are parameters of the translation T on the x-axis and y-axis, respectively. An affine matrix can be represented as a least squares problem by is the number order of features, The optimal estimation h in Equation (5) can be found by using Gaussian elimination to minimize Root Mean Squared Errors (RMSE) calculated by Because the affine transform cannot represent the three-dimensional motion which occurs in the image, the outliers are generated in motion estimation. To solve this problem, Random Sample Consensus (RANSAC) [45] is used to filter outliers during the estimation.
Next, the translation and rotation trajectories are compensated to generate a new set of transformations for each frame using the Kalman filter. The Kalman filter consists of two essential parts, prediction and measurement correction. The prediction step estimates the state of the trajectory where the initial state is defined by [ ] (0) 0,0,0 z = and the error covariance can be estimated by where the initial error covariance is defined by and p Ω is the noise covariance of the process. Optimum Kalman gain can be computed as follows where m Ω is the noise covariance of the measurement. The error covariance can be compensated by Then, the measurement correction step compensates the trajectory state at ˆ( ) f t , which can be computed as where the new state contains the compensated trajectory defined by ( ) is the accumulation of the trajectory measurement which can be calculated as follows Therefore, a new trajectory can be obtained by where Then, warp ( ) f t is in the new image plane and let us apply the new trajectory in Equation (13) to get the transformation ( ) f t in the current frame where is a scale factor computed by

Step 2: Object Detection and Recognition
In this step, the background and foreground are determined in each frame that has been transformed in Step 1. The foreground is made up of the moving object candidates, which are people and cars, in the image. The foreground is detected and recognized using Haar-like features and a boosted cascade of classifiers with training and detection stages. The basic idea behind Haar-like features is to detect objects of various sizes in the images. Figure 3 shows the template of the Haarlike features where each feature consists of two or three adjacent rectangular groups and can be scaled up or down. The pixel intensity values in the white and black groups are accumulated separately. So, the distinction between adjacent groups gives light and dark regions. Therefore, Haar-like features are suitable for defining information in images to find objects on different scales in which some simple patterns are used to identify the existence of objects.  The Haar-like feature value is calculated as the weighted sum of the pixel gray level values which are summed over the black rectangle and the entire feature area. Then, an integral image [41] is used to minimize the number of array references in the sum of the pixels in a rectangular area of an image. Figure 4a,b show the example of the main objects to be selected. Figure 4c shows the example of the additional objects to be selected which are non-moving objects, i.e., road signs, fences, boxes, road patterns, grass patterns, power lines, roadblocks, and so on. This additional object has the purpose to reduce false detection where the type of object often tends to be recognized as foreground. Negative images are the images of landscapes and roads taken by a UAV without containing cars or people. In this study, the minimum and maximum sizes of positive images to be trained are 16 × 35 and 136 × 106, respectively.
and i δ is the weighting parameter set by The final classifier stage ( ) α W is the labeled result of each region represented as Figure 5 shows a sub-window that slides over the image to identify the region containing the object. The region is labeled at each classifier stage either as positive (1) or negative (0). The classifier passes to the next stage if the region is labeled as positive, which means that the region is recognized as an object. Otherwise, the region is labeled as negative and is rejected. The final stage shows the region of the moving object candidates. The region of the non-moving object is not to be displayed in the image and is used to evaluate the detected object. If the region of a moving object candidate is the same as the non-moving object, then the region is eliminated as a foreground. Let the n-th foreground region be represented as  False detection of the moving object candidates is eliminated immediately using a comparison of the region with non-moving objects. This will speed up the computation time in the next step.

Step 3: Motion Vector Classification
The Farneback optical flow [40] is adopted to obtain motion vectors of two consecutive images. The Farneback optical flow uses a polynomial expansion to provide high speed and accuracy for field estimation. Suppose there is a 10 × 10 window G(j) and the pixel j is chosen inside the window. By using polynomial expansion, each pixel in G(j) can be approximated by a polynomial so called "local coordinate system" at ( 1) f t − which can be computed as follows where p is a vector, ( 1) A t − is a symmetric matrix, ( 1) b t − is a vector, and ( 1) c t − is a scalar. The . The relation between the local coordinate systems of two input images will be The coefficients can be equated in Equations (22) and (23) as and Therefore, the total displacement with the extraction in the ROI can be solved by The displacement in Equation (27)  180 tan Since the motion vector is calculated for each 10 × 10 pixels neighborhood, the total displacement is the matrix of size ( ) ( ) (29) Figure 6a shows regions marked with red and blue ROI, representing the moving objects candidate (foreground), identified as a person and a car, respectively. Figure 6b shows an example of the estimate motion vector distribution. In images taken by a static camera, the motion vectors in the background are zero, signifying MD value is zero. This means that there is no movement (represented by the direction of the arrows) between two consecutive frames. In our case (images were taken by a moving camera), motion vectors in the background have several different directions as shown in Figure 6b. The red ROI is a parked car classified as a non-moving object, where the motion vectors are similar to most motion vectors in the background. The blue ROI shows a person walking, classified as a moving object, where the motion vectors are different from most motion vectors in the background. Thus, MD on each moving object candidate is obtained as the most occurrence of motion vectors in each ROI. In the background, MD can be obtained as the most occurrence motion vectors in images other than the foreground.   Δ , then the object is identified as a nonmoving object and is not considered as a moving object candidate. Otherwise, the object is identified as a moving object. Finally, the image will only show the ROI of the selected object. The minimum and maximum MD threshold values in the background are −5 and +5, respectively. We choose these values because the MD between background and static objects may have little difference which is not out of the threshold range [−5, +5].

Result of Motion Vectors
The tested images were unstable due to the movement of the UAV. This caused their motion vectors with regards to static (non-moving) and dynamic (moving) objects to be unsuitable to distinguish. Figures 8 and 9 show the results of the motion vectors without and with image stabilization, respectively. Figures 8a and 9a show the motion vectors in the background. Figures 8b  and 9b show the motion vectors in the ROI as a static object car. Figures 8c and 9c show the motion vectors in the ROI as dynamic objects people. Figure 8 shows that the motion vectors of the dynamic and static objects are almost the same with slight difference from the motion vectors in the background. Thus, the result of the motion vectors without image stabilization was incorrect.  Figure 9b shows that the motion vectors in the car (static object) are almost the same as the background. Figure 9c shows that the motion vectors in the people (dynamic objects) are very different from the background. Thus, the results of the motion vectors with image stabilization were very suitable to distinguish between static and dynamic objects.

Result of Moving Objects Detection
Figures 10-12 show the results of detection and recognition of moving objects. In some cases, there were false detections on moving objects candidates because motion vectors classified these objects as undesirable and so omitted them. Figures 10 and 11 show the sequence of images obtained from Action1 and Action2, respectively. Sometimes the algorithm did not detect a small object in the image. For example, a small car in Figure 11a was not detected as the foreground. Although the classification result of the motion vector showed the car as a moving object, the final result eliminated the car because the object region was not recognized as the foreground.  Figure 12 shows result of the sequence of images obtained from Action3 which contains five people playing together and making little movements every once in a while. The detection result showed that if there were only slight displacements on an object, it was difficult to distinguish the motion vector. So, the object tended to be detected as a non-moving object. The results of computation performance are summarized in Table 1 in respect of frames per second (fps). The average time cost is about 47.08 fps which is faster than previous methods in [23][24][25][26][27][28]. Table 2 shows the performance accuracy in terms of True Positive (TP), False Positive (FP), False Negative (FN), Precision Rate (PR), recall, and f-measure. TP is the detected region that corresponds to the moving object. FP is the detected region that is not related to the moving object. FN is the region associated with the moving object that is not detected. The performance accuracy can be computed as Recall TP Although many articles have tried to solve the same problem (moving object detection using a moving camera), the proposed method has performed well for real-time computation time in a real environment with a complex background. The detection results also showed that the proposed method detected moving objects with high accuracy, although the UAV had some unwanted motion and vibration. The comparison of computation time and accuracy between the results of the proposed method with those methods in [23][24][25][26] and [28] are reported in Table 3. The proposed method achieved an average precision rate of 0.94 and a recall of 0.91. Action1 had the highest PR and recall compared to other videos because there were only a few objects and their sizes were quite large.
Action2 had the lowest PR because there were a lot of objects that were similar to the person and car such as trees, fences, road signs, houses, and bushes. Action3 had the lowest recall due to some small objects in the video with displacement. The method in [27] did not discuss the accuracy of the detected moving object as well as the computation time performance. It focused on the optical flow to describe the direction of pixel movement. However, this method, i.e., [27], is suitable for application on an image with a homogeneous background. In our case, a moving camera produced several objects in the background that had no correlation with moving objects but had pixel movements. This condition occurs in image sequences that have complex backgrounds such as our datasets. Thus, the method in [27] is not suitable to be applied to our datasets. In addition, we used a simple dense optical flow which is sufficient to calculate the motion vector fields between two consecutive frames and has a fast computation time. Then, we used the classification, which is feasible to distinguish the motion vectors between static and dynamic objects, to determine MD in the background and foreground.
The proposed method can be used for various moving objects, not only for people and cars. In this work, we used people and car objects to test the performance of the method, because these objects are often investigated as moving objects using moving cameras [23][24][25][26][27][28][29]. High frequency jitter, small size objects, and low-quality images make detection of moving objects using UAVs a difficult task. But, using the framework that we propose, we can resolve the problem. Furthermore, a machine learning approach is used to detect and recognize the foreground because it can be applied to almost all processors without GPU. This method is proposed for use on a computer or on-board system. In other words, if the image capture by UAVs can be transmitted to a ground station such as a PC using a wireless camera or transmitted to an additional board such as Raspberry on a UAV, then the image can be processed online and in real time.
Based on information from datasets and previous studies, we can conclude that the proposed algorithm will be applicable under the conditions: UAV altitude is less than 500 feet and speed is less than 15 m/s. In addition, based on our experiment results, our algorithm had the best results at a video frame rate of less than 50 fps.

Conclusions
A novel method for multiple moving objects detection using UAVs is presented in this paper. The main contribution of the proposed method is to detect and recognize moving objects by using a UAV with moving camera with excellent accuracy and can be used in real-time applications. An image stabilization method was used to handle unwanted motion in aerial images so that a significant difference in motion vectors can be obtained to distinguish between static and dynamic objects. The object detection that was used to determine the region of the moving object candidate had a fast computation time and good accuracy on complex backgrounds. Some false detections can be handled using a motion vector classification, in which the object that has a movement direction similar to the background will be removed as a moving object candidate. Comparing the results on various sequences of aerial images, the proposed method can be a potential real-time application in the real environment.
Author Contributions: W.R. contributed to the conception of the study and wrote the manuscript, performed the experiment and data analyses, and contributed significantly to algorithm design and manuscript preparation. W.-J.W. and H.-C.C. helped perform the analysis with constructive discussions, writing, review, and editing.