Training-Based Methods for Comparison of Object Detection Methods for Visual Object Tracking

Object tracking in challenging videos is a hot topic in machine vision. Recently, novel training-based detectors, especially using the powerful deep learning schemes, have been proposed to detect objects in still images. However, there is still a semantic gap between the object detectors and higher level applications like object tracking in videos. This paper presents a comparative study of outstanding learning-based object detectors such as ACF, Region-Based Convolutional Neural Network (RCNN), FastRCNN, FasterRCNN and You Only Look Once (YOLO) for object tracking. We use an online and offline training method for tracking. The online tracker trains the detectors with a generated synthetic set of images from the object of interest in the first frame. Then, the detectors detect the objects of interest in the next frames. The detector is updated online by using the detected objects from the last frames of the video. The offline tracker uses the detector for object detection in still images and then a tracker based on Kalman filter associates the objects among video frames. Our research is performed on a TLD dataset which contains challenging situations for tracking. Source codes and implementation details for the trackers are published to make both the reproduction of the results reported in this paper and the re-use and further development of the trackers for other researchers. The results demonstrate that ACF and YOLO trackers show more stability than the other trackers.


Introduction
With knowing the location of a desired object in the first frame, the task of tracking of this object becomes a fascinating topic for video processing from both scientific and industrial viewpoints. This task becomes scientifically more interesting when there is complexity involved in video sequences. This complexity can include a moving camera, uncertainty of the object, background clutter, small size and low resolution of the object, size variation, appearance changes, occlusion, articular objects, illumination change and out-of-plane rotation. Some of the challenges are shown in Figure 1.
Recently many methods have been developed to overcome the challenging object tracking problem in videos. Some trackers focus on human tracking with various techniques such as common energy minimizing [1,2] and data association [3], and some other trackers focus on object tracking using various methods like point tracking [4] and feature descriptors [5]. Multiple cues such as color, shape, and position are selected as human tracking features. In the case of fixed camera scenarios, the background is still. Thus, background detection and subtraction allow foreground object detection. In other words, looking for only moving objects in a limited area allows finding objects of interest. There are many other methods for human detection and tracking in literature [1][2][3][6][7][8][9][10]. These methods often use very simple features to detect humans. The human body is usually described by some simple shapes such as a circular for the top part and a cylindrical shape for other body different sub-problems occurred in a non-stationary environment. Guan et al. [27] proposed a modular tracker for event-triggered tracking in the presence of model drift and occlusion. It is composed of a short-term tracker, occlusion and drift identification, target re-detection, short-term tracker updating and online discriminative learning of detector. In a method presented in [28], authors proposed a semantics-aware object tracking method, which introduced semantics into the tracking procedure to improve the robustness of tracking. In [29] authors use FasterRCNN for object detection and a sequential color particle filtering for tracking. Kim et al. [30] combines long short term memory (LSTM), a residual framework and another LSTM to build an attention network for object tracking. It uses LSTM for temporal learning of object tracking. A learning-based tracker using a Bayesian estimation framework is proposed in [31]. It uses different blurring kernels to increase the robustness of the tracker against blurring.
Yun et al. [32] use an offline convolutional deep neural network (ADNet) for object detection and an action-driven method for temporal tracking. In a method presented in [33], authors proposed a learning-based tracking method which uses deep appearance to learn a discriminative appearance model from large training datasets for a reliable association between tracklets and detections.
In a method presented in [34], authors propose an object tracking method based on transfer learning. It trains an autoencoder by using auxiliary natural images as feature extractor in offline and then uses an additional classification layer in online. In [35], authors use transfer learning for object tracking. Some layers in an offline-trained CNN are transferred to an online classifier with an updating binary classifier layer. This classifier produces some candidates around the previous target which are further evaluated to output the target. In a method presented in [36], authors use recurrent neural networks for the task of tracking objects in 2D laser data in robotics applications. In [37], authors use oblique random forests for object tracking. It uses HOG features and deep neural network-based models as the features. An incremental update steps to update the tracker. In a method presented in [38], authors propose a kernel cross-correlator to improve the robustness of linear cross-correlator-based trackers. It can handle affine transformations.
Recently, there is a drastic progress in the area of object detection by using learning-based techniques like deep learning. However, this progress was not extended to trackers. In this paper, five famous training-based object detectors i.e., ACF [39], RCNN [40], FastRCNN [41], FasterRCNN [42] and you only look once (YOLO) [43] are considered for object tracking and a comparative study among the detectors is done in this context. Two methods for offline tracking (training before tracking) and online tracking (training while tracking) are used. The former uses a pre-trained model for object detection in the space dimension (i.e., still images) and another offline trained classifier for the association of the objects in the time dimension. The latter is a short-term tracker with an online training procedure which updates the detector over the time. In other words, the offline tracker divides the tracking task into two separate tasks of detection of objects in frames and finding the object of interest among the objects of each frame.
The object detection term refers to find an object in an image and object tracking means to follow an object of interest within a group of frames in a video. In this paper, the trackers follow the object of interest using a tracking-by-detection method. The tracking-by-detection means the same object is detected in successive frames of the video.
The object detector performs the first part and the second part is a time series analyzer for tracking. The online tracker trains a detector with the positive and negative data generated from the first frame and then the detector is applied to the next frames of a certain part of the video, the detector is re-trained with the recently detected objects within the video part.
To the best knowledge of the authors, the five mentioned detectors have not previously been compared in online and offline trackers. The online tracker is far different from two-step trackers [29,30,32,33] which first detect objects in images and then associate the detected objects using another classifier. In contrast with [34,35], the online tracker does not have any offline phase. It also does not transfer layers from another network.
The rest of the paper is organized as follows: Section 2 describes and compares the exploited training based detectors. Then, the object tracking methods are explained in Section 3. The experimental results, evaluations and comparisons and discussions are shown in Section 4. Section 5 concludes the paper.

Training Based Object Detection
This section explains the new object detectors i.e., ACF [39], RCNN [40], FastRCNN [41], FasterRCNN [42] and YOLO [43] which are used for object tracking in this paper. These detectors were selected because they are famous object detectors and easy to use.

Aggregate Channel Features
For an ACF (Aggregate Channel Features) detector, a channel refers to a certain component that defines pixel values in image [39]. ACF then generates many features by using obtained channels. These features are called channel features and can be mainly categorized into two types: first-and high-order channels [39]. ACF extracts First-order channel features from a single channel by summing pixels and higher-order channel features by combining two or more first-order channel features. ACF then uses decision trees for classification.

Region-Based Convolutional Neural Network
Region-Based Convolutional Neural Network (RCNN) is an object detector based on Convolutional Neural Network (CNN). CNN performs convolution products on small patches of the input map of the layer. Thus, extracting features are carrying information about local patterns [44]. A typical Convolutional Neural Network (CNN) is composed of two main layers of the convolutional and fully connected layers [45]. First, RCNN computes the region proposal using selective search [46]. Then, it forwards the proposals to a trained Convolutional Neural Network.
The region proposals with greater than 0.5 IoU (Intersection over Union) overlap with a ground truth (defined by a user) are classified as positive and the rest of the proposals are classified as negative. The RCNN has 3 convolutional layers and 2 fully connected layers. RCNN is slow because it performs a CNN for each proposal, without computation sharing.

Fast Region Based CNN
FastRCNN made improvements to RCNN to increase the process speed. It shares computation of the convolution layers among different proposals. Since the convolutional layer does not change the spatial relationship between the adjacent pixels, it projects coordinates in the raw image to corresponding neurons in the convolutional layer. Therefore, the whole image can be computed through the convolutional layer once and the processing time is saved [41]. FastRCNN is up to ten times faster than RCNN but it is not real-time.

Faster Region-Based CNN
FasterRCNN [42] brings object detection toward a real-time application. It uses a region proposal network (RPN) after the last convolutional layer. RPN takes an image feature map as input and outputs a set of rectangular object proposals. The RPN network detects whether the current region, which is generated from a sliding window and different anchors (for each location, some proposals with different aspect ration are parametrized relative to their reference boxes are called anchors), is the object of interest. FasterRCNN is up to ten times faster than FastRCNN and it is real-time.

YOLO
YOLO is state-of-the-art in object detection. It uses deep networks and a hierarchical structure for object labeling and classification real-time [43]. YOLO stands for you only look once, i.e., it just looks at the image once and processes simultaneously the whole image. Instead of a large softmax in Imagenet [47], YOLO uses several softmaxes as a hierarchical tree, each softmax decides for a similar group of objects. In this way, YOLO classifies objects more accurately than Imagenet. It is faster than VGG-16 [48] because it uses few floating point operations. It applies a single convolutional neural network to the full image. This detector divides the image into regions and predicts bounding boxes and a probability for each region. These bounding boxes are weighted by the predicted probabilities. Like FasterRCNN, it adjusts priors on bounding boxes instead of predicting the width and height outright. However, it still predicts the x and y coordinates directly. YOLO is faster and more precise than FasterRCNN.

Online Tracker
The steps of the online object tracking method are shown in Figure 2. In the first frame, a user selects an object of interest. From this selected object, the online tracker generates synthetic data. This procedure is described in Section 3.1.1. The tracker trains a detector using the synthetic data. It then segments the input video frames into unequal-length groups of frames {G 1 , G 2 , . . . , G n }. Each group is composed of several frames i.e., . . , f mn }.The trained detector tracks the first group i.e., G 1 using the detection of objects of interest within G 1 . This process is explained in Section 3.1.1. The objects of interest are a pedestrian, a Panda, a car and so on. At the end of the first iteration, the online tracker copies the detected objects within G 1 as well as the synthetic data to a training vector T v (see Figure 2). The tracker uses the training vector T v to train the detector for the second time and then applies it on the second segment G 2 and then concatenates the detected objects within G 2 to the training vector T v . The tracker updates steadily the training vector prior to each training iteration until the last frame of the video. It updates the detector by using a first input first output(FIFO) procedure. Thus, the tracker follows recent appearances of the object. In the first frame, from this selected object, synthetic data is generated. Then a detector (i.e., RCNN, FastRCNN, FasterRCNN or ACF) is trained using the generated data and applied to the first segment of frames G 1 to detect the objects of interest in them. The detected objects and the synthetic data added to the training vector T v which is then used to update the detector. This process is continued until the end of the video.
In the section of experiments, we investigate the length of the training vector in terms of accuracy. Let I x,y In this equation, D(.) is the detector response which is a function of the image pixels and a trained network with the weights of W n (t). While updating the detector D in a certain interval ∆t (shown in Figure 2 using G 1 , G 2 , . . . ), the weights W n (t) are updated by using the generated samples as well as the detected objects from the first frame to the last frame (t − 1). Equation (2) explains it. The tracker calculates initial weights for detection of the object of interest in the second frame using N synthetic positive and negative data from the first frame (i.e., W n (1) = F(I 1 1 , I 2 1 , . . . , I N 1 )).
where I d i is the object of interest of for frame i, e.g., I d t−1 is the object of interest for the last frame. By using the online tracker and the four detectors of ACF, RCNN, FastRCNN and FasterRCNN, we use four online trackers and compare them in Section 4.

Synthetic Data Generation
To have an effective tracker, the detectors must be trained using enough number of training data. The variety of data is very important factor in the training process. To have such a diversity, the following process is proposed: 1. Rotated copies of the object of interest using the rotating angles {−10, −9.9, −9.8, . . . , 0 , . . . , 9.9, 10}. The tracker combines the above items to generate diverse data. We use different total number of the synthetic data in our experiments (i.e., 500, 800, 1000, 2000, 4000 and 10,000) to investigate its effect on tracking accuracy. The number and types of the synthetic data are given in Table 1. Table 1. The number and types of the synthetic data are given in this table. The first row shows the total number of the synthetic data and each column shows the number and types of the synthetic data for a certain number of the synthetic data. The above parameters are selected in an optimized way to preserve both tracker accuracy and speed. Initially, a small training set was chosen and poor results were obtained. Then, the above parameters were gradually optimized to maximize the tracking accuracy and speed. Therefore, they are independent on type of objects and videos. The detailed information regarding the parameter optimization is given in Section 4.2. Figure 3 shows some examples of the synthetic data for "Pedestrain1". Figure 3a shows the first frame and the selected object (object of interest), this frame is added by salt and pepper noise, with noise density 0.008 ( Figure 3b) and rotated with −10 and 9 degree (shown in Figure 3c,d), enhanced using histogram equalization and then rotated with 9 and −10 degree (shown in Figure 3e,f) and enhanced using contrast adjustment and then rotated with 5 and −5 degree (shown in Figure 3g,h). The object of interest for the selected frames is shown below of each image.

Tracker Updating
The detected objects within frames of the first group G 1 (i.e., {I d 2 , I d 3 , . . . , I d m1 }) are concatenated to the training vector T v = {syntheticdata, I d 2 , I d 3 , . . . , I d m1 }. By using the updated training vector, the tracker trains the detector. The updated detector is then used for tracking of frames (detection of the object of interest) in the second group G 2 . The detected objects within frames of G 2 are again concatenated to the training vector T v and the updated T v is used for tracking in the third group G 3 . The length of the groups {G 1 , G 2 , . . . , G n } is obtained from Equation (3). During the training, when the T v gets full, the data at the beginning of the training vector T v is replaced by the data from the current image (FIFO vector). This process continues until end of the video. The length of T v is investigated in terms of the tracker accuracy in Section 4.
The length of each group U s (k) is calculated from the following Equation: where S 1 , . . . , S 6 are the length of the group of frames, k is the frame index, and T 1 , . . . , T 6 are the length thresholds according to Table 2. In our experiments for ACF we choose 2 sets of parameter SET1 and SET1 (shown in Table 2) and call them ACF1 and ACF2 respectively. We choose SET2 for RCNN, FastRCNN and FasterRCNN.

Offline Tracker
The offline tracker steps are shown in Figure 4. Thw video frames are fed to YOLO and Kalman filter [49]. The offline tracker output (T(.)) is the YOLO response with maximum IoU of the estimated pose by Kalman filter. If there is not intersection between the two responses, the offline tracker selects the YOLO response with lowest Euclidean distance to the Kalman filter response. The Kalman filter response is a point with the maximum probability of object presence in frame. A bounding box with the same size of the object in the last frame (k − 1) is drawn around the point to show the object region.
The offline tracker T f t can be expressed using the following equation: Unlike the online tracker (1), the weight of offline detector W d is not updated during tracking. For more detailed information about the Kalman filter we refer to [50,51]. We use object position p(p x , p y ) (center of mass) and its velocity v(v x , v y ) and acceleration a(a x , a y ) (see the following Equations) for Kalman filter. v If YOLO does not detect any object in a frame or if the Euclidean distance between the Kalman filter response and the nearest YOLO response is more than a predefined threshold (100 pixels), then the object is considered to be occluded with other objects or left the scene. YOLO threshold value is set to 0.15 to minimize the probable object missing. Lower and higher values lead to high false alarm rate and object missing respectively. Since YOLO has a strong pre-trained model for object detection (available in [52]) and due to its slow training, we use YOLO in the offline tracker. Our initial experiments showed that for our dataset the detection rate of YOLO is high, but its classification is not precise. Therefore, we ignore the output labels of YOLO.
Intermediate results of the offline tracker are shown in Figure 5. First, YOLO detects all the objects in the scene. Then, the Kalman filter selects the object of interest within each frame and associates them. In this case, two connected objects (i.e., VW and the white van) are detected as a single object. This problem originates from YOLO-detector misdetection. However, since the common area between the detected object and ground truth is more than 50 percent of the ground truth area, it is seen as a true detection.

Dataset and Parameters
To validate the trackers, one set of 10 videos from TLD [16] and VOT [53] datasets including more than 26,800 frames and various objects of interest are selected. The datasets contain various types of tracking challenges like Moving camera, long videos, Object partial and full occlusion and appearance, illumination, scale change and similar objects [16,53]. Some videos in TLD dataset are also available in other datasets like carchase and pedestrain1 in VOT2018. Some videos are similar to others in the VOT2018. For instance, "Volkswagen" is similar to "LiverRun", "Car1" and "Yamaha" and "Traffic" are similar to "Motocross" and so on. They include various desirable objects for tracking like a car, a motorcycle, a car, a pedestrian, the human face, the human body and a panda as shown in Figure 6.
Green rectangles in Figure 6 show the objects of interest. Among the objects of interest, car is rigid and the other objects are articular. The datasets include short and long videos. To evaluate the trackers. The sequences were manually annotated as ground truth [16]. The plane rotation more than 50% was annotated as "not visible" [16]. The parameters concerning the online trackers are set according to Table 3. Table 3. Parameter set 1 regarding the online trackers. These parameters were selected using initial experiments to preserve both accuracy and speed of the trackers. Since the datasets contain various type of objects with different sizes and shapes, in different scenes, backgrounds, illumination conditions and different degrees of occlusion, the generality of the selected parameters is guaranteed.

Length of T v Length of Synthetic Data Stage Epoch
For the synthetic data as mentioned in Section 3.1.1, the first frame of the video is exposed to a salt and pepper noise, rotation, intensity adjustment, histogram equalization, brightness change, resizing and contrast enhancement. The total number and the types of the synthetic data are shown in Table 1.

Accuracy Results
The experiments in this section use the following evaluation procedure. The trackers are initialized in the first frame of a video sequence and track the object of interest (shown in Figure 6) up to the end. The produced trajectory is then compared to ground truth using the recallR, the precisionP and the F-measureF. The F-measure is calculated using F = 2PR/(P + R). For each frame with a detected object, the object is considered to be correctly detected, If the common area between the detected object and the ground truth is more than 50 percent [16]. The number of selected videos is equal to videos used in other outstanding tracking methods like [16].
The detailed results of the trackers on the datasets in terms of Recall, Precision and F-measure are shown in Table 4.
ACF2 shows better performance than ACF1 for most of the videos. It shows that the minimum segment size of frames groups should be set to 20 (shown in Table 2 parameter SET2 has better results than set1.) and the smaller change in the step size (i.e., 20,30,40) leads to better results. In these beginning of the video, the online tracker is not well trained. Therefore, it should be updated in shorter intervals. The online tracker gradually adapts itself to these object and the scene and therefore longer updates are suitable. We trained the online RCNN1 tracker and the online RCNN2 tracker with SET1 and SET2 (Table 2) respectively. RCNN2 is better than RCNN1 for most of the videos. This experiment shows the preference of SET2 to SET1. SET2 was chosen for the other online trackers i.e., FastRCNN and FasterRCNN. FastRCNN and FasterRCNN show worse results than RCNN and ACF for most of the videos. For the 4 cases of Carchase, Jump, Pedestrain1 and Pedestrain2 fasterRCNN shows better result than RCNN and FastRCNN, but for 6 other videos RCNN is better. It shows that FasterRCNN shows better performance in the presence of similar objects. Among the online trackers, ACF2 has the best overall robustness but the YOLO tracker is even more stable than ACF2. Figure 7 compares the F-measure of the trackers. In another experiment, we changed the training vector length T v , the number of the training iterations and the total number of the synthetic data. We tried different lengths of the synthetic data i.e., 500, 1000, 2000, 4000 and 10,000 with different training iterations of 3 and 10. Then, we calculated the online-trackers F-measure.
The trackers stability using average F-measure is shown in Figure 8. When we increase the synthetic data length from 500 to 4000, the performance increases. However, for the lengths bigger than 4000, it decreases. Thus, the optimum number of synthetic data length is 4000. With a training iteration increasing from 3 to 10, the trackers stability increases. With further increasing, the tracker speed decreases but the accuracy improvement is very little. The training vector length T v is increased from 1000 to 10,000. In this case, when T v exceeds 4000 (other parameters do not change) the accuracy falls down. Thus, the optimum T v length is 4000.  In an ablative study on the online ACF tracker, we removed the updating/training process. The results of the study are shown and compared in Figure 9. According to this figure, the tracker updating increases the performance. In another experiment, the synthetic data generation was removed from the tracking process. In this case, the trackers cannot follow the object of interest at all.

Visualization Result
The visualization comparison of the trackers for selected frames of the 10 videos and the ground truth is shown in Figure 10. From each video, two frames were randomly selected and shown in different rows. The sequence in Figure 10a (Pedestrain1) has similar and articular objects and pose change. The YOLO and ACF trackers can follow the pedestrian very well but the RCNN, FastRCNN and FasterRCNN trackers miss it. The video in Figure 10b (Volkswagen) is a long video which contains similar objects (cars), occlusion and illumination change. The RCNN, ACF and YOLO trackers show better results than FasterRCNN, and FastRCNN is tracking the background. In the first frame except for the FastRCNN tracker, the rest of the trackers can follow the car, but in the second frame only the YOLO and ACF trackers keep tracking the car. The YOLO tracker is more stable than ACF for this case. In this video, there are many frames without the desired car, but the RCNN and FastRCNN trackers track mistakenly the other car which is present in the scene. The sequence in Figure 10c Figure 10g shows a sequence (Car) which has occlusion and similar objects. In this case, since the object (the white car) is leaving the scene, there is no ground truth for it. However, RCNN can completely and YOLO and ACF partially track the object, but, they are considered as a false positive. The FasterRCNN tracker tracks a similar object and the FastRCNN Tracker misses the object in both frames. The sequence in Figure 10h (Pedestrain3) includes similar objects and object occlusion. All the trackers except for YOLO can track the object of interest. In the first selected frame, YOLO detects partially the human but it misses the object of interest in the second frame. In this case, the YOLO human detection from the top view is poor. Figure 10i shows a video (Pedestrain2) which has occlusion and similar objects. For the first frame, the FastRCNN, YOLO and ACF trackers can track the object correctly but RCNN the tracker follows another human and FasterRCNN tracks a part of a car instead of the human. In the second frame, there is no human in the scene but, FasterRCNN tracks a point in the background. The video in Figure 10j (Jumping) contains strong movement and blurring. In this case, the ACF and YOLO-trackers track correctly. Generally, among the tested trackers, the ACF and YOLO trackers show better results.

Algorithm Speed Comparison
The trackers have been implemented on a hardware system with a specification as shown in Table 5. ACF, RCNN, FastRCNN and FasterRCNN were implemented using MATLAB. Except for the ACF tracker, the other trackers use GPU. The detection part the of YOLO tracker was done using C++ on GPU machine and the Kalman filter [54] was implemented using C++ on CPU. For each tracker, the average running time of all 10 videos has been measured and shown in Figure 11. The average fps(frame per second) for the RCNN, FastRCNN, FasterRCNN, ACF and YOLO trackers are 0.2, 0.5, 0.24, 4 and 9 respectively. As shown in Figure 11, the YOLO tracker has the fastest implementation because it does not use training while tracking. Among the online trackers, the ACF-tracker(implemented using CPU) shows has effective implementation than RCNN, FastRCNN and FasterRCNN (implemented using GPU). Thus, the ACF-tracker is faster than the rest of the online trackers. The main difference among the online trackers speed happens in the training phase and therefore ACF is even faster than FasterRCNN in this phase.

Discussion
The comparative study among the trackers is concluded as follows: 1. The ACF tracker has the best results among the online trackers from both accuracy and speed viewpoints. ACF has effective implementation, because it runs on the CPU machine instead of GPU for the other online trackers. 2. Among the RCNN-based trackers (i.e., RCNN, FastRCNN and FasterRCNN), RCNN has the best tracking accuracy. Though, FastRCNN and FasterRCNN are very fast in test phase, their tracking process is slow because they are very slow in training phase. 3. Since the YOLO tracker was implemented offline, it is the fastest tracker. YOLO is not qualified for online tracking, because it is very slow in training phase. 4. For human tracking from the front and side views, the combination of YOLO and Kalman filter shows the best results. 5. We recommend to use the ACF tracker in unknown objects tracking because YOLO does not detect them whereas the ACF tracker does. 6. Compared to YOLO and ACF, the RCNN-based trackers show less accuracy because they don't have a very deep structure (i.e., 3 convolutional layers and 2 fully connected). By using a deeper convolutional neural network like YOLO the accuracy is dramatically increased.
The training vector length T v in Section 3 was investigated and showed that the online tracker follows the recent appearances of the object of interest. The length should be set to an optimal value, if it exceeds the value, the average accuracy decreases. On the other hand, selection of the shorter lengths leads to the under-fitting and low accuracy.
Some results of our research are not limited to tracking. They are stated as follows: 1. YOLO detector detects cars from the top view, but the object classification precision is low. 2. For human detection, since YOLO was biased to the data from the front view, although the YOLO detection results from the view is very good the classification results is disappointing.
3. Though, the object detection based on deep learning has recently improved, further improvement is still necessary. YOLO detection should be trained using a big dataset including more various views of the objects.

Conclusions
In this paper, we did a comprehensive comparative study in the context of object tracking using five famous recently proposed detectors. Two trackers based on online and offline tracking were used. The online tracker first generates positive and negative samples from the first frame and then trains detectors. The detector detects online the objects of interest in next frames and put them in a training vector. The detector is updated in certain intervals using the training vector. The detector detects the objects of interest in the next frames until the last frame. The ACF tracker showed the best results among the examined methods for the online tracking from both speed and accuracy perspectives. In the offline scenario, YOLO detector generates some candidates, then the tracker follows the object of interest using Kalman filter. Extensive experiments showed thatthe YOLO tracker outperforms the rest of the trackers. For the future work, YOLO detector will be trained using an updated dataset to improve the detection results from the top view. The experiments will be extended to other videos in VOT benchmark. We aim to extend our methods for multiple object tracking because all the detectors i.e., ACF, RCNN, FastRCNN, FasterRCNN and YOLO have the capability of multiple object detection. For the online trackers in each iteration of the training phase, instead of a single object, we will define multiple objects and the detectors will output different labels for different objects. In the case of offline tracking, YOLO is able to detect multiple objects as shown in Figure 5b. For each object, one Kalman filter will be used for tracking.

Conflicts of Interest:
The authors declare no conflict of interest.