Coarse-to-Fine Adaptive People Detection for Video Sequences by Maximizing Mutual Information †

Applying people detectors to unseen data is challenging since patterns distributions, such as viewpoints, motion, poses, backgrounds, occlusions and people sizes, may significantly differ from the ones of the training dataset. In this paper, we propose a coarse-to-fine framework to adapt frame by frame people detectors during runtime classification, without requiring any additional manually labeled ground truth apart from the offline training of the detection model. Such adaptation make use of multiple detectors mutual information, i.e., similarities and dissimilarities of detectors estimated and agreed by pair-wise correlating their outputs. Globally, the proposed adaptation discriminates between relevant instants in a video sequence, i.e., identifies the representative frames for an adaptation of the system. Locally, the proposed adaptation identifies the best configuration (i.e., detection threshold) of each detector under analysis, maximizing the mutual information to obtain the detection threshold of each detector. The proposed coarse-to-fine approach does not require training the detectors for each new scenario and uses standard people detector outputs, i.e., bounding boxes. The experimental results demonstrate that the proposed approach outperforms state-of-the-art detectors whose optimal threshold configurations are previously determined and fixed from offline training data.


Introduction
Automatic people detection in video sequences is one of the most relevant problems in computer vision, which is essential in many applications such as for video-surveillance, human-computer interaction and mobile robotics. Although generic object detection is maturing very rapidly thanks to the recent widespread use of deep learning [1,2], many challenges still exist for the specific case of detecting people. Video and images of people exhibit a great variation of viewpoints, motion, poses, backgrounds, occlusions, sizes and body-part deformations [3]. Detection performance has a strong dependency on the training data used to build detectors [4] and, therefore, accuracy drops are expected when training and testing data have different patterns [5]. Moreover, people detectors often have many parameters, which are heuristically or experimentally set according to training data. Such parameter setting strategy may have limitations when applied to other data different from the training one.
The adaptation of people detectors is therefore desired to successfully apply such detectors to unseen data [6]. This adaptation can be approached as best algorithm selection [7,8], domain The remainder of the paper is structured as follows. Section 2 describes the related work. Section 3 describes the proposed coarse-to-fine adaptation framework based on cross-correlations. Section 4 presents the experiments. Finally, Section 5 concludes this paper.

State of the Art
Adapting pedestrian detectors to specific scenes is frequently termed as domain adaptation where the original training dataset (i.e., source domain) is fully annotated. Existing approaches adapt such detectors to unseen data (i.e., target domain) which can be focused on features or models [16].
Feature-based approaches aim to transform feature spaces between the source and target domains, and then apply a classifier. Early approaches annotate data in the target domain to define a grid classifier from scratch [17]. Albeit effective, such annotation is time-demanding, several data samples are needed and therefore difficult to perform for other domains. Most of recent feature-based approaches focus on transfer learning where the knowledge from source domains is extended to semantically-similar categories of the target domain by retraining models with few data annotations. Transfer-learning can use bounding boxes from both the source and target domain such as the learning of discriminative models using CNNs and data augmentation [10] and the transfer of shared source-target attributes by feature selection where data distributions of the domains are similar [18]. Moreover, approaches can also assume the absence of annotations for the target domain and, therefore, perform an online self-learning process by determining which samples to select. For example, such selection can use super-pixel region clustering [19], Gaussian regression within a hierarchical adaptive SVM [16], confidence scores within a deep model [3], background modeling [20] and multiple contextual cues [4]. Other strategies may also be applied by weighting the source data to match the distribution of the object categories in the target domain before re-training [4], by propagating labels between frames for good positive instances [20] and by integrating classifiers at image and instance level to maintain semantic consistency between two domains [21]. Image level aims to determine whether source or target domains are analyzed, whereas instance level classifier is focused on the feature maps. Finally, transfer learning using synthetic data has recently been proposed [22,23]. However, training complex models still presents challenges due to the visual mismatch with real data [20].
Model-based approaches focus on adapting the parameters of the classifiers or the strategy applied. For example, in [5], a Bayes-based multi-class classifier is adapted by computing the proportion of objects in the target domain during runtime. Such adaptation may focus on correcting detection errors by spatiotemporal filtering [24]. Other approaches make use of context such as for building a partial belief about the current scene to only execute certain classifiers [25], for applying specific combinations of part-based models based on spatial object information [26] and for modulating object proposals (class prior probabilities) with semantic knowledge [27]. Model-based approaches may also combine different models by learning the weights of predictions for different sensor modalities in an online manner [11], by applying a cascade of detectors designed to combine the confidence of heterogeneous detectors [12], and by selecting automatically the most suitable model for visible or non-visible light images [28]. Another approach focuses on automatically learning classifiers on the target domain without annotated data, which are later evaluated in the source domain with labeled data and finally top-performing classifiers are selected as the most reliable for the target domain [29]. Moreover, model-based approaches may perform detector ranking by estimating the similarity between both domains in some feature space to design a cost function for selecting the best algorithm in each situation or domain [7]. Therefore, detector ranking can be efficiently learned for different target domain subsets [8] but requires full annotation of source and target domain. Similar to feature-based approaches, model-based detector adaptation may be achieved by coupling detection and tracking for online retraining single [13] or multiple [14] detectors without annotated data. However, these approaches share the limitations of transfer learning (detector re-training), impose restrictions on the employed detectors (e.g., high precision and low recall [14]) or require the use of tracking which is may lead to unstable results [13,14]. Table 1 compares the proposed and reviewed approaches. As we can observe, the proposed approach avoids re-training detectors, unlike many model-based and feature-based approaches based on transfer learning, which often require an offline training stage before the final application to the target domain. Instead of selecting accurate samples for re-training, we leverage results from multiple and possibly independent people detectors assuming that their errors are diverse. The detection threshold of each detector is adjusted according to similarities to other employed detectors. Moreover, our proposal applies self-learning in an online fashion without requiring annotated data for the target domain, unlike those in [26,27] and also without requiring a prior analysis of the target domain features [7,8]. Additionally, the proposed approach employs standard outputs of people detectors (i.e., bounding boxes) so it can be applied to a wide variety of existing approaches, unlike other approaches restricted to CNNs [10], Faster R-CNN [21], and SVMs [16,18] or to being coupled with other detectors [11] and trackers [13]. Finally, the proposed approach is applied to video sequences, unlike most of those in the literature, which are focused on image-level classification. Such application to video may determine when and where adaptation might improve performance, and therefore adjust the computational complexity to the particular details of each video sequence.

Detector Adaptation Framework
We propose a coarse-to-fine framework to improve detector's performance at runtime classification by adapting the configuration of each detector employed (see Figure 2). This proposal is inspired by the maximization of mutual information strategy where classifiers are combined assuming that their errors are complementary, being successfully applied for example to detect shadows [30] and skin [31]. We extend such maximization framework to people detection by introducing pair-wise detector correlation and by adapting online their configuration. Note that we are not re-training detectors at prediction time, which may require data not available in real applications or highly-accurate detectors, and may imply high latency [5], i.e., a minimum number of frames to compute accurate decisions over time. Instead, we consider generic threshold-based detectors pre-trained on standard datasets, thus making this proposal applicable to a wide variety of detectors.
Assuming a set of N people detectors {D n } N n=1 applied to an image, each detector D n obtains a confidence map M n describing the people likelihood for each spatial location (x, y) and scale s in the image. Then, detection candidates are obtained by thresholding this map: where T n (x, y, s) = {0, 1} and τ n is the detection threshold whose value is heuristically set based on the confidence map. These candidates are later combined across scales and can be post-processed by a variety of techniques such as non-maximum suppression [32] and background-people segmentation [33]. The final result for each detector is a set B τ n n = {b k } k=K τn k=1 with K τ n detections (i.e., bounding boxes) representing the output of the detector D n where each detection b k (i.e., bounding box) is described by its position (x, y) and dimensions (w, h). A key parameter in this procedure is the detection threshold τ n , which determines the number of detection candidates. Low (high) values of τ n generate several (few) detections increasing the false (true) positive rate: three examples of τ n are shown in Figure 1. We propose to adapt such detection threshold to the image context by exploring similarities with the other detectors. We compare the output of detectors to obtain a set of pair-wise correlation scores (cross-correlation of detectors in Figure 2), which measures the output similarity. This stage is extended in Section 3.1.
We analyze this similarity at two different levels. First, we propose a coarse analysis to determine relevant frames in a video sequence, where people are present. Second, a fine analysis is applied in those selected frames to adapt the detection system, i.e., adjust the detection thresholds.

Cross-Correlation of Detectors
Firstly, we explore the decision space to determine each detector output by applying multiple thresholds. Then, we correlate these multiple outputs for each pair of detectors (D n and D m ) to obtain a correlation map C n,m which measures the output similarity (see Figure 3).

Multiple Thresholding
To explore the possible detector outputs, we define a set of L thresholds τ j n j=L j=1 for each detector D n whose values are determined by considering L levels between the extreme values of the confidence map M n (i.e., minimum and maximum). Then, we perform thresholding with multiple valuesτ j n to obtain a set of outputs as follows: where each output Bτ j n n is obtained by applying the thresholdτ j n to Equation (1). Note that each detector D n may have different threshold valuesτ

Pair-Wise Correlation
We correlate the N detector outputs {Ω n } n=N n=1 to estimate their similarity. We compute a correlation map C n,m for each pair of detectors outputs Ω n and Ω m . Each element is defined as: where ρ(·, ·) is a function to compute the similarity between the output of detectors. The number of correlation maps C n,m to be computed for N detectors is We propose computing ρ(·, ·) as a one-class classification problem by applying standard evaluation measures. To compare bounding boxes from two outputs, we use three matching criteria [34]: to dr, criteria co and ov employ, respectively, the percentage of spatial bounding box coverage in Bτ j m m and the intersection-over-union features. A positive match is considered true if dr ≤ 0.5, co ≥ 0.5 and ov ≥ 0.5, as commonly employed in related works [34], which corresponds to a deviation up to 25% of the true object size. Only one b k ∈ Bτ i n n is accepted as correct by matching b l ∈ Bτ j m n (i.e., true positive), so any additional b k ∈ Bτ i n n on the same bounding box is considered as a false positive. Then, we compute precision and recall measures from the matching results and obtain the FScore as the final similarity measure ρ(·, ·) between Bτ i n n and Bτ j m m as in [35]. Thus, the final correlation map C n,m between two detectors is defined as the FScores F: where i, j = {1, ..., L}. Figure 5 shows one example of correlation map C 1,2 and four different outputs between two the detectors C 1,2 (i, j) (rows A, B, C and D). Example A corresponds to a low threshold value for both detectors (τ i 1 andτ j 2 ) and therefore in this case a low FScore similarity F(i, j) = 0.52. On the other hand, Example C corresponds to a medium-high threshold value for the first detectorτ i 1 and a low-medium threshold value for the second detectorτ j 2 , and therefore in this case a high FScore

Coarse Adaptation
Assuming that frames without people are not relevant for the adaptation process, we propose to use the correlation map C n,m to determine the relevant frames in a video sequence. In particular, we propose to measure the information entropy as an estimation of the presence of people in every frame (see Figure 6). Based on the principle of maximization of mutual information, we assume that two independent detectors, albeit designed for the same purpose (to detect persons), in presence of people would be highly correlated when many bounding boxes are matched and, therefore, a high level true positive detections is expected. On the other hand, low correlation values would have few matches and, therefore, imply an increase in the false positive rate or negative detection rate. Note that there is one exception to this assumption when outputs are empty (i.e., Bτ i n n = Bτ j m m = ∅) since both outputs are equal and we cannot compute the FScore. To consider this, we avoid this situation by setting the FScore to zero when these sets are empty. However, two independent detectors applied to a frame without the presence of people would have low correlation values for every possible configuration C n,m . For that reason, we can assume that those frames with the presence of persons will produce more variable correlation maps C n,m than those without people.
We propose estimating the absence/presence of people using the entropy of the correlation map C n,m . Information entropy is defined as the average amount of information produced by a stochastic source of data. The measure of information entropy associated with each possible data value is the negative logarithm of the probability mass function for the value. Entropy is a statistical measure of randomness that can be used to characterize the texture of an input image. In our case, we propose classifying every frame using the entropy over the correlation map C n,m as: Figure 7 shows three different examples (rows) of correlation maps C n,m , the output of two detectors for two different threshold values (low and high thresholds) and the corresponding entropy E n,m values. Note the three different correlation behaviors: the first example shows an empty scene, almost zero FScore similarity for any possible pair-wise correlation and therefore a low entropy value (E n,m = 0.6); the second example shows an scene with five pedestrians, high FScore similarity for a range of pair-wise correlations and therefore a high entropy value (E n,m = 4.6); and the third example shows only one person, a medium-high FScore similarity for a range of pair-wise correlations and therefore a medium-high entropy value (E n,m = 3.3).
Up to this point, we have a set of hypothesis for presence of people obtained for each compared pair of detectors E n,m (i.e., D n and D m ), which are combined to obtain a final decision (decision fusion in Figure 6). Such hypotheses combination is performed as a traditional mixture of experts via weighted voting [36]: where ω n,m ∈ [0, 1] is the weight for the hypothesis E n,m achieved by comparing D n and D m and ∑ N m=1 ω n,m = 1 (n = m). Although such ensemble voting may benefit from a previous learning stage [37], currently we assume no prior knowledge about detectors performance so we consider equal weighting ω n,m = 1 N−1 .  In the case of absence of people (i.e., low value of E ), we assume the detections outputs are empty (i.e., Bτ i n n = Bτ j m m = ∅) and therefore the final configuration for each detector is τ * 1 = τ * n .. = τ * N = ∞. This decision has the potential benefit of avoiding any possible false detection but also the possible disadvantage of losing any correct detections (see visual examples in Figure 1). On the other side, in the case of presence of people (i.e., high value of E ), a further adaptation process is required, therefore it is necessary to analyze the fine similarity for the adaptation process.
We formulate the detection of frames containing people (i.e., coarse adaptation) as a two-class classification problem where class q 1 indicates the absence of people in a frame and q 2 is the opposite class. We classify the frame based on the evidence provided by the entropy E , we evaluate the posterior probability of each class P(q i | E ) and we choose the class with largest P(q i | E ), i.e., Then, applying the Bayes Rule results in: P(E ) does not affect the decision rule so it can be eliminated. We simplify to the likelihood ratio ∧(E ): Finally, assuming equal priors (absence/presence of people), the decision rule is known as the Likelihood Ratio Test (LRT): which in essence turns into finding the first entropy value E that determines the condition P(E |q 1 ) P(E |q 2 ) > 1 and using such value as a threshold for the entropy.

Fine Adaptation
The aim of the fine adaptation is to find the configuration with the highest similarity (i.e., highest value in C n,m ) to select the best detection threshold for each detector (τ * n and τ * m , respectively). The threshold hypothesis selection requires searching a single maximum value in C n,m , which may contain multiple local maxima. The correlation map C n,m is the similarity ρ between the output of each where ρ(·, ·) is defined as Equation (3).
Our problem for finding the optimal global solution can be formulated by following the Maximum Likelihood Estimation (MLE) criterion once computed C n,m : To find such maximum value, we propose using a sub-optimal global search solution of the threshold hypothesis selection problem with lower computational cost requirements, i.e. Simulated Annealing (SA) [38]. SA is a probabilistic technique for approximating the global optimum of a given function. For problems where finding an approximate global optimum is more important than finding a precise local optimum in a fixed amount of time, SA may be preferable to other iterative alternatives such as gradient descent [39].
Moreover, we may assume that the probability of selecting a pair of thresholds (i.e., choosing a specific configuration) depends on the pair of detectors compared. For example, some detectors may tend to use thresholds with low values, whereas other detectors may use high values. Therefore, we include a function g(·, ·) to model the prior distribution of thresholds which determines the most likely pairs of thresholds given two detectors. It can be defined as follows: Since the solution of Equation (11) or Equation (12) may not be unique, we may obtain various maximum values τ n,m n (see the darkest area in the bottom-left image in Figure 5a) as the detectors are never totally independent. Therefore, we currently propose three alternatives: selecting the mean, minimum or maximum value among those thresholds τ n,m n maximizing C n,m . After finding the best detection thresholds obtained for each compared pair of detectors τ n,m n (i.e., D n and D m ), we combine them to obtain a final configuration for each detector (decision fusion in Figure 8).
Such hypotheses combination is performed as in Equation (6) as a traditional mixture of experts via weighted voting as follows: ω n,m · τ n,m n (n = m).
It is important to note that this equation does not combined people detectors, instead the proposed approach focuses on improving independently each detector by adapting the detection threshold.

Experimental Results
This section describes the experimental setup to evaluate the proposed coarse-to-fine framework to adapt people detectors during runtime classification, and the results of each part of the framework: coarse adaptation, fine adaptation, and the complete system (see Figure 2).

Setup
We performed the evaluation using the people detection benchmark repository (PDbm (http://www-vpu.eps.uam.es/PDbm/, last accessed December 2018.)) [40]. It has 19 sequences with ground-truth annotations for traditional indoor and outdoor scenarios in computer vision applications: video surveillance, smart cities, etc.
We quantified detection performance for each video frame by precision, recall and FScore metrics [35]. We report the frame-level mean FScore for all tested images as the final performance value. However, to evaluate the impact of the coarse adaptation in the final system, we evaluated the performance in terms of global FScore, i.e., the resulting video-level FScore of the adaptation process for each video and not only frame by frame results.
We applied the adaptation system to six people detectors using publicly available implementations. We used two versions for DPM [32] (Inria and Pascal models), ACF [41] (Inria and Caltech models) and Faster R-CNN [1] (VGG and ZF models).

Coarse Adaptation Results
We proposed the estimation of the absence/presence of people for each frame, using the entropy of the correlation map C n,m (see Section 3.2). We first estimated the entropy probability density function (pdf) of both classes (P(E | q 1 ) and P(E | q 2 )) using the training dataset VOC2012 (Visual Object Classes Challenge 2012 [42]). Figure 9a,b shows the estimated entropy pdfs P(E | q 1 ) and P(E | q 2 ), respectively, while Figure 9c shows both pdfs together. After that, we used the LRT (see Equation (9)) to determine the best entropy threshold between the two classes, i.e., E = 0.7.
Then, we validated the absence/presence of people classification approach. We analyzed the results over the evaluation dataset, PDbm [40]. We performed a 10-fold cross-validation evaluation selecting randomly a balanced set of 1000 frames with and without the presence of people. We analyzed the precision (P), recall (R) and FScore (F) for each class (the absence/presence of people, Classes 1 and 2, respectively) and the final FScore sum. Table 2 shows the classification results obtained by a random classifier, by the six detectors independently and by our proposal with different number of thresholds L = {5, 10, 20, 40, 60}. For the independent detectors, the optimal fix threshold was previously learned with the training dataset VOC2012 (Visual Object Classes Challenge 2012 [42]). The proposed coarse adaptation could classify with around 80% of precision and recall both classes: absence and presence of people. On the other hand, all the other approaches obtained worse results (around 50-60%). The results show clearly how the use of the entropy over the six detectors improve the results significantly in terms of precision, recall and FScore, with respect to the use of the detectors independently and, therefore, versus a random classifier. In addition, the results show how the performance using different number of thresholds L = {5, 10, 20, 40, 60} are quite homogeneous, getting all of them around 1.6 of FScore sum. For that reason, we use the coarse adaptation with L = 5 since it presents a lower computational cost, i.e., lower number of pair-wise correlations between detectors per frame (see detailed analysis in Section 4.3.3 and Table 8).    Table 3 shows the average results after adapting two and six detectors, ADC2 and ADC6, respectively, with different number of thresholds L = {5, 10, 20, 40, 60} and strategies to select a threshold τ n,m n from those values maximizing C n,m (mean, minimum or maximum). In both cases, the results show that the performance increases progressively with the number of thresholds. In addition, the minimum strategy obtained in general the worst results and the mean strategy obtained slightly better results than the maximum one. Figure 10 shows examples of correlation and threshold selection results between pairs of detectors. In the first row, there are three examples of scenes without people and low FScore similarity for any possible pair-wise correlation, while the other two rows include examples from one to five pedestrians and medium-high FScore similarity for a range of pair-wise correlations. Table 3. Average FScore of adapted detectors for different strategies to select a threshold τ n,m n from those values maximizing C n,m obtained with various threshold with L = 5, 10, 20, 40 and 60. Bold indicates best result for: (a) ADC2; and (b) ADC6. Data adapted from [15].  [32] using Inria (cyan) and Pascal (orange) models; and (c) ACF [41] using Inria (cyan) and Caltech (orange) models. Table 4 shows one example of successively adding detectors to the final configuration from two detectors to six (from ADC2 to ADC6). In general, the results show that the greater is the number of detectors the higher is the performance. For example, the DPM-I increases progressively the performance from 37.1 (ADC2) to 38.2 (ADC6). Aa other examples, the ACF-I increases progressively the performance from 38.3 (ADC3) to 39.5 (ADC6) and the the ACF-C increases progressively the performance from 40.0 (ADC4) to 42.0 (ADC6). Table 4. Average FScore of the five ADC combinations from ADC2 to ADC6. Percentage increase (%∆) calculated for each detector with respect to the previously obtained performance just before the additional detector inclusion in the combination (in bold), from ADC2 to ADC5, respectively. Data adapted from [15].  Table 5 shows the comparative results of our approach (ADC6, all six detectors independently of the order or their inclusion) versus two different fixed thresholding approaches (FT PDbm and FT VOC12 ). The FT PDbm approach is the ideal case, the optimal threshold is previously learned with the chosen evaluation dataset (PDbm [40]) and the FT VOC12 is a more realistic approach, where the optimal threshold is previously learned with the training dataset VOC2012 (Visual Object Classes Challenge 2012 [42]). The results show clearly that the use of our adaptive threshold approach ADC6 significantly improves the results of any of the individual detectors using a fixed threshold (10.1% and 18.6% average improvement with respect to FT PDbm and FT VOC12 , respectively). Additionally, we also evaluate the Fine adaptation stage (ADC6) over a different dataset, the MILAN dataset [43]. This dataset includes eleven challenging, publicly available video sequences with ground truth (TUD-Stadtmitte, TUD-Campus and TUD-Crossing, S1L1 (1 and 2), S1L2 (1 and 2), S2L1, S2L2, S2L3 and S3L1). The first three sequences are recorded in real-world busy streets, the complexity in terms of crowd or occlusions is medium or low (fewer than 10 pedestrians are present simultaneously). The last eight sequences are part of the PETS 2009/2010 benchmark [44]. They are recorded outdoors from an elevated point of view, corresponding to a typical surveillance setup. These scenarios include higher complexity in terms of crowds and occlusions than the previous ones (generally more than 10 pedestrians are present simultaneously). Table 6 shows the comparative results of our approach (ADC6) versus two different Fixed Thresholding approaches (FT MILAN and FT VOC12 ) over the MILAN dataset [43]. As in the previous experiment, The FT MILAN approach is the ideal case and the FT VOC12 is a more realistic approach. The ADC6 presents similar results as with the previous dataset. In this case, the initial or fixed thresholding results are higher, therefore the potential improvement is slightly smaller, even though our adaptive approach ADC6 significantly improves the results of any of the individual detectors using a fixed threshold (8.3% and 12.9% average improvement with respect to FT MILAN and FT VOC12 , respectively).

Fine Adaptation: Maximum A Posteriori Estimation
As commented in Section 4.3.1, the previous results are for the threshold hypothesis selection using the Maximum Likelihood Estimation (MLE). However, the results can be improved including the prior distributions of any pair of thresholds configurations, i.e., the correlation map C n,m . Therefore, we evaluated the results using the Maximum A Posteriori Estimation (MAP). Firstly, during the optimal fix threshold learning for evaluation comparison, we also learned the prior distributions of each pair of detector with the training dataset VOC2012 (Visual Object Classes Challenge 2012 [42]) and then we evaluated the results of our approach ADC6 over PDbm including the estimated posteriori in the threshold hypothesis selection. Figure 11 includes a visual representation of the 15 different prior distributions, one for each pair of six detectors and their 15 mirrored versions. Note the clear different behavior between different detectors. While the DPM and ACF versions present a more concentrated range of best thresholds, both FRCNN variations present a sparser range of best thresholds. It is due to the better detection performance of the FRCNN itself and therefore any possible improvement versus a predefined fix threshold will be more difficult. Table 7 shows the comparative results using the MLE versus using the MAP. The results show clearly that the use of our adaptive threshold approach ADC6 with the MAP improves the results of any of the individual detectors without the MAP (3.3% average improvement).

Fine Adaptation: Threshold Hypothesis Selection
We propose using a sub-optimal global search solution of the threshold hypothesis selection problem with lower computational cost requirements, the Simulated Annealing (SA) [38]. We compared SA against other search alternatives; for example, applying a subset of thresholds K = L/k (see Section 3.1.1), being k the sub-sampling factor in the decision space, i.e., k ∈ R and k > 1. In particular, we evaluated four sub-sampling factors from the original decision space L = 60 (Exhaustive Search, ES), the sub-optimal subsets of thresholds are K = {40, 20, 10, 5}. We also evaluated three non-regular sub-sampling patterns, the Three Step Search (TSS) [45], the Four Step Search (FSS) [46], and the Diamond Search (DS) [47]. Finally, we also evaluated two traditional global optimization pattern search approaches: the Pattern/Direct Search (PS) [48] and the Particle Swarm Optimization (PSO) [49]. Table 8 shows the comparative results in terms of FScore and computational cost (number and percentage of operations per each frame), between different threshold hypothesis selection approaches, including regular sub-sampling patterns with sub-optimal subsets of thresholds K = {40, 20, 10, 5}, non-regular sub-sampling patterns (TSS, FSS and DS) and more traditional global optimization approaches (PS, PSO, and SA). The results show clearly how the exhaustive approach, i.e., searching in the original decision space L = 60, obtains the best results but the highest computational cost. Logically, any sub-optimal global search solution of the threshold hypothesis selection problem will obtain worse results in terms of FScore, but also a reduction of the computational cost. The use of different sub-optimal subsets of thresholds (K = {40, 20, 10, 5}), obtained progressively worse FScore results (from 42.5 to 37.4 respectively) but with a strong reduction in terms of percentage of operations (from 44.4% to 0.7%, respectively, being the 100% of operations per each frame required with K = 60). The use of non-regular sub-samplings also obtained worse FScore results (between 32.8 and 39.9) but with always a drastic reduction in terms of percentage of operations (only between 0.4% and 1.1% of operations per each frame are required). In particular, FSS obtains the best ratio between FScore results and computational cost. Finally, the use of more traditional global optimization pattern search also obtained worse FScore results (between 35.7 and 42.0) with a drastically reduction in terms of percentage of operations only between 0.2% and 5.0% of operations per each frame are required). In particular, SA obtained the best FScore results (42.0) but also a strong computational cost reduction in terms of percentage of operations (only 5.0% of operations per each frame are required). Note the progressive reduction of FScore and computational cost of the sub-optimal subsets of threshold (K = {40, 20, 10, 5}), the significant reduction of FScore with the use of any non-regular sub-samplings (TSS, FSS and DS) but with a strong computational cost reduction, and the different behaviors of the three more traditional global optimization pattern search, being significantly better the use of SA. Table 8. Comparative results between different search approaches for threshold hypothesis selection, including regular sub-sampling patterns with sub-optimal subsets of thresholds K = {40, 20, 10, 5}, non-regular sub-sampling patterns (TSS, FSS and DS) and more traditional global optimization approaches (PS, PSO, SA). Results in terms of FScore and computational cost (number and percentage of operations per each frame).

Final Adaptation System (Coarse and Fine)
We evaluated the whole proposed framework (coarse and fine adaptation), described in Section 3. The coarse and fine adaptation were evaluated at frame-level, as shown, respectively, in Sections 4.2 and 4.3. In particular, we evaluated the use of our coarse analysis to identify the representative frames for a possible adaptation of the system; those frames without the presence of people were discarded and those with the presence of people were further analyzed locally. To evaluate the whole coarse-to-fine adaptation process, we compared the results without and with the inclusion of the coarse adaptation stage at video-level. The system without the coarse adaptation corresponds to the proposed fine adaptation ADC6 with MLE or MAP, as evaluated in detail in, respectively, Sections 4.3.1 and 4.3.2. We defined the entropy coarse adaptation threshold with L = 5 and according to the Likelihood Ratio Test, i.e., E = 0.7 (see detailed reasoning in Section 4.2). Generally, the inclusion of the coarse adaptation obtained worse results in terms of the number of true positive detections because those frames misclassified as if there were no people certainly produce missed detections. However, the coarse adaptation also obtained better results in terms of false positive detections, since those frames correctly classified as if there is no people potentially reduce the total number of false detections (see Section 4.2 for further details). In addition, the inclusion of the coarse adaptation significantly reduces the computational cost since the fine adaptation in every frame demands a higher computational cost. Table 9 shows the final adaptation system results for each detection algorithm, with the use of MLE or MAP. In general, the use of the coarse adaptation introduces a significant improvement in the evaluation results (between 21.7% and 90.8% of improvement). It is due to the balance between the number of the false detections and the true positive detections. Table 10 shows the comparative results in terms of FScore and computational cost (number and percentage of operations per each frame), between the use of a fixed threshold FT VOC12 and the final adaptation system results (MLE or MAP). There is also an improvement in FScore performance (10.8% and 16.1% average improvement with respect to the fixed thresholding approach FT VOC12 , MLE and MAP, respectively) and almost a 50% of reduction in terms of computational cost per frame.
To understand the relation between the entropy coarse adaptation threshold (E ) and the performance in terms of FScore and computational cost, we analyzed the performance of our final system with MLE (MAP version present the exactly same behavior) for different entropy coarse adaptation thresholds, E = 0, 0.1, ..., 1.5. Note that E = 0 corresponds to the absence of coarse adaptation, only fine adaptation, i.e., ADC6. Figure 12 shows the final results versus the corresponding computational cost in terms of percentage of operations. Note clearly the progressive increase in terms of FScore from entropy E = 0 until the LRT (E = 0.7) and the posterior reduction in terms of FScore until E = 1.5. In general, avoiding frames without the presence of people improves the results avoiding false detections until the LRT (E = 0.7), after this point the balance between the false detections and the missed detections starts decreasing the performance. +16.1 -- Figure 12. Comparative video analysis results with different coarse adaptation configurations, absence/presence of people classification decision, from entropy E = 0 to 1.5. Global FScore results for each video versus computational cost in terms of percentage of operations.

Conclusions
We have presented a coarse-to-fine framework to automatically adapt people detectors during runtime classification. This proposal explores multiple thresholding hypotheses and exploits the correlation among pairs of detector outputs to determine the best configuration. The coarse adaptation determines the presence/absence of people in every frame and therefore the necessity/not necessity of adaptation of the system. The fine adaptation obtains the optimal detection threshold for each detector in every frame. The proposed approach uses standard state-of-the-art detector outputs (bounding boxes), therefore it can employ various types of detectors. This framework allows the automatic threshold adaptation without requiring a re-training process and therefore without requiring any additional manually labeled ground truth apart from the offline training of the detection model. The proposed coarse adaptation is able to classify with around 80% of precision and recall both classes absence and presence of people. The fine adaptation results (both MLE and MAP versions) demonstrate that any correlation up to six detectors outperforms state-of-the-art detectors, whose thresholds are optimally trained in advance. In addition, we also explored other sub-optimal threshold hypothesis selection approaches with lower computational cost requirements (number of pair-wise correlations between detectors per frame). In particular, the SA search obtains almost the exhaustive FScore results but with a drastic computational cost reduction. Overall, the final coarse-to-fine framework also outperforms state-of-the-art detectors, for both frame by frame and video analysis results, with a computational cost reduction of around 50%.
For future work, we will study other threshold selection and fusion alternatives and we will apply this proposal to other detectors and object types. We will also explore other additional configurations and not only the detection threshold, for example the position of the bonding box, scale of the detected objects, pose, etc.
We acknowledge that running six detectors significantly increases the required resources as compared to running a single detector. However, this adaptation scheme may not need to be applied for each frame of a video sequence and it may be used periodically (e.g., every 1 or 5 s) or be used on-demand (e.g., when scene conditions change after a camera moves). In this case, the computational cost is considerably decreased as we may not apply our adaptation to each frame. We will consider such applicability in real systems as future work.