Malicious UAV Detection Using Integrated Audio and Visual Features for Public Safety Applications

Unmanned aerial vehicles (UAVs) have become popular in surveillance, security, and remote monitoring. However, they also pose serious security threats to public privacy. The timely detection of a malicious drone is currently an open research issue for security provisioning companies. Recently, the problem has been addressed by a plethora of schemes. However, each plan has a limitation, such as extreme weather conditions and huge dataset requirements. In this paper, we propose a novel framework consisting of the hybrid handcrafted and deep feature to detect and localize malicious drones from their sound and image information. The respective datasets include sounds and occluded images of birds, airplanes, and thunderstorms, with variations in resolution and illumination. Various kernels of the support vector machine (SVM) are applied to classify the features. Experimental results validate the improved performance of the proposed scheme compared to other related methods.


Introduction
Mini drones, also known as unmanned aerial vehicles (UAVs), have played a vital role in the development of smart cities. The UAVs have numerous industrial and agricultural applications. The high-resolution images collected through UAVs help in various monitoring applications of the cement industry [1]. Drones are helpful in the irrigation [2] and carrying chemical pesticides or fertilizers to spray on plants [3]. So-called foggy drones use thermal cameras to scan the roads and avoid accidents in foggy weather [4]. The UAVs can operate as mobile base transceiver stations (BTS) to facilitate the surge traffic demands during disasters [5,6]. In smart cities, drones resolve cybersecurity issues [7]. UAVs also help in the navigation and positioning of military targets during war [8].
Malicious UAVs are those which either carry restricted explosive payload or collect audiovisual data from restricted private geographic territory. Moreover, a UAV can be considered malicious when it loses control and enters the nonflying zone [9]. The low-altitude flight of a malicious drone enables it to violate the security measures of a restricted zone, as shown in Figure 1. Restricted areas protect sensitive locations, such as prisons and nuclear facilities. The official definition of such a restricted

Related Work
UAVs can efficiently be detected via several intrinsic signals, which are thermal images, the sound of the UAV's motors, and radio frequency (RF) radar [10]. In [26], the authors achieved 81% UAV detection There is a need for a technology that can detect and disarm such malicious UAVs in a timely manner. Recently, various techniques for UAV detection have been reported in the literature, relying on audio, video, thermal, and radio frequency (RF) signals [10]. Each scheme has its own advantages and limitations. The video-and thermal-based detection techniques fail in adverse weather conditions. The sound of a UAV's motor fan and its images are useful to differentiate the amateur UAV from other objects. The audio-based detectors are cost-effective as they require only an array of microphones to capture the sounds and classify them in their respective class. However, environmental noise can degrade the performance of sound-based detection [11].
We propose a machine-learning-influenced audio-and vision-based UAV detection method. The proposed scheme is capable of detecting UAVs with higher accuracy, even in a noisy environment. The proposed hybrid method consists of acoustic and image processing algorithms for the precise detection of amateur drones [10,11]. The classification accuracy obtained using handcrafted and deep neural network is compared with the proposed framework. Various handcrafted feature extraction methods for image description, such as Local Binary Pattern (LBP) [12], Histogram of Oriented Gradient (HOG) [13], Locally Encoded Transform Feature Histogram (LETRIST) [14], Gray Level Co-occurrence Matrix (GLCM) [15], Completed Joint-scale Local Binary Pattern (CJLBP) [16], Local Tetra Pattern (LTrP) [17], and Non-Redundant Local Binary Pattern (NRLBP) [18], have been employed to detect objects based on their texture. Moreover, several handcrafted feature extraction methods for audio have been proposed, such as Linear Predictive Cepstral Coefficients (LPCC) [19], and Mel Frequency Cepstral Coefficients (MFCC) [20]. The deep neural network (DNN) models such as: AlexNet [21], ResNet-50 [22], VGG-19 [23], Inceptionv3 [24], and GoogLeNet [25] have also been utilized for image feature extraction. The support vector machine (SVM), along with various kernels, have been employed to classify the extracted feature vectors. The proposed scheme is cost-effective as well as highly accurate, even with a small dataset. The proposed scheme integrates the handcrafted sound descriptor with deep features extracted from the image to detect the malicious drone. This hybrid method has provided better accuracy even in adverse weather conditions [11].

Related Work
UAVs can efficiently be detected via several intrinsic signals, which are thermal images, the sound of the UAV's motors, and radio frequency (RF) radar [10]. In [26], the authors achieved 81% UAV detection accuracy by extracting features from the input array of cameras and microphones. In [27], Sensors 2020, 20, 3923 3 of 16 a pseudorandom sequence of binary values was presented to detect drones. The results show that the pseudorandom sequence can only detect UAVs within the 100 m range for the 2 GHz band. The technique in [28] proposed a radar that operates at 35 GHz frequency-modulated continuous-waves (FMCW) equipped with fixed antennas. The results show that their estimated velocities efficiently detected UAVs. This system can be made more efficient by employing circularly polarized antennas. In [29,30], deep belief network (DBN) along with convolutional neural networks (CNNs) were reported. The DBN accuracy depends on channel conditions; moreover, they require a huge dataset for accurate detection. In [31,32], texture descriptors were developed that can classify surfaces into their respective classes even in the presence of geometric and photometric variations. In [33], a tracker was developed by employing the handcrafted descriptors proposed in [31,32].
Furthermore, the authors in [34] measured the radio signal in cellular networks using logistic regression and decision tree to detect drones. The accuracy of these models is reduced when drones are flying at lower heights. Similarly, in [35], plotted image machine learning (PIL) and K-nearest neighbors (KNN) were developed for acoustic-based drone detection in the real-time scenario. The simulation results show that PIL is 22% more accurate than KNN, while KNN is less complicated than PIL. These approaches require a massive amount of data for better performance.
In [28], the authors present a limited-dataset-dependent algorithm for correlation-based sound detection. The method is cost-effective, but it is not suitable for real-time applications. In [36,37], a video-based mechanism was developed for robust detection of drones. In this scheme, the system is equipped with two cameras with day and night vision sensors. The short-wave infrared (SWIR) cameras along with high-resolution visual-optical (VIS) cameras were included with the above system. Still, it failed to bring improvement in accuracy. The mechanism in [37] failed to work properly in strong wind. In [38], Hidden Markov Model (HMM) was used to detect UAVs using acoustic sensors. This model also has limitations, as it gives a poor performance for a small amount of training data due to the complexity of classifiers. There is no such scheme, according to the authors' knowledge, that can detect UAVs accurately using a small amount of training data and machine learning algorithms. This paper contributes to detecting UAVs through a hybrid approach; the first part is related to the detection of UAVs by their sound, while the second part consists of UAV detection and localization using images.

UAV Detection Methodology
UAVs have specific acoustic features that are different than other sounds in the surrounding environment. The sounds play a vital role in UAV detection if appropriate features are extracted and classified. On the other hand, UAVs are very different in shape than the surrounding object, so the image can be a piece of information that is useful to detect UAVs. The image features are extracted by a convolutional neural network (CNN) like AlexNet, and then the extracted features are classified using some efficient classifier.
The proposed malicious UAV detection model depends on the audio and images collected within the restricted zone, as shown in Figure 2. The arrays of microphones and high-resolution cameras capture the audio and video within the restricted zone. First, the ground control stations (GCS) collect the audio and visual information from the respective array of sensors. In the second stage, features are extracted from the audio and visual information through a specified descriptor. In the third step, the extracted features are classified using a trained classifier. In this paper, we have used a machine learning technique to classify the audio and image features extracted through the MFCC and AlexNet model, respectively. The SVM with various kernels is used as a classifier.

Audio Feature Extraction
The audio features are extracted through Mel Frequency Cepstrum Coefficients (MFCC) descriptor. In MFCC, the frequency axis is enveloped with Mel frequencies [20]. Firstly, the preemphasis and windowing filter is applied to audio. Secondly, the Fast Fourier transform is applied over the filtered sound signals, following the Mel filter banks. In the third stage, the log of the filter bank energies is calculated. Finally, the discrete cosine transform (DCT) is applied, and the resultant values between 2 and 13 are preserved, while the rest are discarded. The output of DCT is MFCCs, and all the steps, as mentioned earlier, are illustrated in Figure 3. The frequency in hertz (Hz) is converted into the Mel frequency scale through the following Equation (1).

Audio Feature Extraction
The audio features are extracted through Mel Frequency Cepstrum Coefficients (MFCC) descriptor. In MFCC, the frequency axis is enveloped with Mel frequencies [20]. Firstly, the pre-emphasis and windowing filter is applied to audio. Secondly, the Fast Fourier transform is applied over the filtered sound signals, following the Mel filter banks. In the third stage, the log of the filter bank energies is calculated. Finally, the discrete cosine transform (DCT) is applied, and the resultant values between 2 and 13 are preserved, while the rest are discarded. The output of DCT is MFCCs, and all the steps, as mentioned earlier, are illustrated in Figure 3. The frequency in hertz (Hz) is converted into the Mel frequency scale through the following Equation (1).

Audio Feature Extraction
The audio features are extracted through Mel Frequency Cepstrum Coefficients (MFCC) descriptor. In MFCC, the frequency axis is enveloped with Mel frequencies [20]. Firstly, the preemphasis and windowing filter is applied to audio. Secondly, the Fast Fourier transform is applied over the filtered sound signals, following the Mel filter banks. In the third stage, the log of the filter bank energies is calculated. Finally, the discrete cosine transform (DCT) is applied, and the resultant values between 2 and 13 are preserved, while the rest are discarded. The output of DCT is MFCCs, and all the steps, as mentioned earlier, are illustrated in Figure 3. The frequency in hertz (Hz) is converted into the Mel frequency scale through the following Equation (1).
The symbol mel in Equation (1) represents the frequency in the Mel scale, while the symbol f represents the frequency in hertz. The Mel spectrum is the result of the log of filter banks. The DCT is applied on the Mel spectrum to get Mel cepstrum coefficients, as shown in Equation (2). The symbol mel in Equation (1) represents the frequency in the Mel scale, while the symbol f represents the frequency in hertz. The Mel spectrum is the result of the log of filter banks. The DCT is applied on the Mel spectrum to get Mel cepstrum coefficients, as shown in Equation (2).
The function c(n) in Equation (2) represents the MFCC coefficients, while the symbol C is the size of MFCC coefficients. The function D(m) denotes the Mel magnitude spectrum. The Mel magnitude spectrum is the product of the magnitude spectrum and the triangular Mel weighting filters. The m is the m-th triangular filter coefficient. The variable k in Equation (2) denotes the index of the sample, while M represents the total number of samples.

Visual Feature Extraction
AlexNet is used to extract features for the image. It has 25 layers: one input layer, one output layer, and 23 hidden layers. The hidden layers consist of five convolutional layers, three max-pooling layers, seven rectified linear unit (ReLU) layers, three fully connected layers, two cross-channel normalization layers, two dropout layers, and one softmax layer. The feature extraction using AlexNet is shown in Figure 4. The size of the input image is 227 × 227 × 3 at the input layer of AlexNet. This input is fed into the first convolutional (C1) layer, which has 96 kernels, and stride size in it is 4 × 4. The remaining convolutional layers are cascaded to C1 with the stride size of 1 × 1. ( The function c(n) in Equation (2) represents the MFCC coefficients, while the symbol C is the size of MFCC coefficients. The function D(m) denotes the Mel magnitude spectrum. The Mel magnitude spectrum is the product of the magnitude spectrum and the triangular Mel weighting filters. The m is the m-th triangular filter coefficient. The variable k in Equation 2 denotes the index of the sample, while M represents the total number of samples.

Visual Feature Extraction
AlexNet is used to extract features for the image. It has 25 layers: one input layer, one output layer, and 23 hidden layers. The hidden layers consist of five convolutional layers, three max-pooling layers, seven rectified linear unit (ReLU) layers, three fully connected layers, two cross-channel normalization layers, two dropout layers, and one softmax layer. The feature extraction using AlexNet is shown in Figure 4. The size of the input image is 227 × 227 × 3 at the input layer of AlexNet. This input is fed into the first convolutional (C1) layer, which has 96 kernels, and stride size in it is 4 × 4. The remaining convolutional layers are cascaded to C1 with the stride size of 1 × 1.

Support Vector Machine (SVM)
In this paper, SVM is used to classify the extracted features. The SVM set its hyperplane based on positive and negative training feature set to minimize the classification error. The hyperplane adjusts itself in such a way that it reduces the classification error, as shown in Figure 5. The hyperparameters of SVM that are linear, Gaussian, and polynomial kernel have been used to classify features. SVM chooses the ideal choice limit contingent on the most extreme edge, which ideally isolates the information focuses. Grouping mistake proportion is limited as edge increments, and thus increases the edge, which results in the least mistakes [39]. The preparation guides closer toward the ideal isolating hyperplane are the support vectors [40]. This can be written as in Equation (3).

Support Vector Machine (SVM)
In this paper, SVM is used to classify the extracted features. The SVM set its hyperplane based on positive and negative training feature set to minimize the classification error. The hyperplane adjusts itself in such a way that it reduces the classification error, as shown in Figure 5. The hyperparameters of SVM that are linear, Gaussian, and polynomial kernel have been used to classify features. SVM chooses the ideal choice limit contingent on the most extreme edge, which ideally isolates the information focuses. Grouping mistake proportion is limited as edge increments, and thus increases the edge, which results in the least mistakes [39]. The preparation guides closer toward the ideal isolating hyperplane are the support vectors [40]. This can be written as in Equation (3).
where β denotes the bias, while the symbols x and ω are vectors representing the input and its weight, respectively. When the extracted features have a higher dimensionality, then the learning process selects those variables having a higher interclass variation. This technique is generally known as a bit trap [41]. The favorable principle position of SVM kernels is their capacity to work in any measurements with no extra calculations and multifaceted nature. SVM can perform better even for the noisy high-dimensional feature vectors. This persuades us to choose SVM as a classifier. For SVM grouping precision, selecting a suitable part plays an essential job. We compared the classification accuracy of SVM with its linear, Gaussian, and polynomial kernel types. Equation (4) is for linear kernel. For the polynomial kernel, Equation (5) is used. , Here the symbols and are vectors' dot product and are plotted in the space of dimension p. The following equation, Equation (6), is used for the Gaussian kernel.
where − is used to calculate the euclidean distance of two different samples. The width of the Gaussian kernel can be controlled by changing the value of the variance σ.
SVM is trained using features extracted from AlexNet for the visual dataset and MFCC-extracted features in the case of the audio dataset. The hyperparameters for SVM training are kernels.

Experimental Results
In this section, we evaluated UAV detection using integrated audio and visual features by using audio and image datasets. The dataset is classified by implementing an SVM classifier. Malicious UAVs are localized by implementing handcrafted descriptors like HOG, LBP, CJLBP, LTrP, GLCM, NRLBP, and LETRIST as well as deep neural networks like AlexNet, inceptionv3, VGG-19, resNet50, and GoogleNet. While using an acoustic dataset, malicious UAVs can be detected by implementing MFCC, LPCC, and ZCR in MATLAB. All the experiments were run on a computer with an Intel(R) Core i7 processor (3.6 GHz) and 16 GB DDR4 RAM. CyberpowerPC, Gamer Supreme Liquid Cool, SLC8260A2.

Image Dataset Description
We implemented the proposed method with the dataset of 506 images. Three hundred fifty images were used for training, while 156 images were used for the test. The images were selected randomly with the ratio of 70% for training and 30% for testing. The dataset consists of five classes of the images that are birds, airplanes, kites, balloons, and drones. The flight scenarios of the dataset are where β denotes the bias, while the symbols x and ω are vectors representing the input and its weight, respectively. When the extracted features have a higher dimensionality, then the learning process selects those variables having a higher interclass variation. This technique is generally known as a bit trap [41]. The favorable principle position of SVM kernels is their capacity to work in any measurements with no extra calculations and multifaceted nature. SVM can perform better even for the noisy high-dimensional feature vectors. This persuades us to choose SVM as a classifier. For SVM grouping precision, selecting a suitable part plays an essential job. We compared the classification accuracy of SVM with its linear, Gaussian, and polynomial kernel types. Equation (4) is for linear kernel. For the polynomial kernel, Equation (5) is used.
Here the symbols x i and x j are vectors' dot product and are plotted in the space of dimension p. The following equation, Equation (6), is used for the Gaussian kernel.
where x i − x j is used to calculate the euclidean distance of two different samples. The width of the Gaussian kernel can be controlled by changing the value of the variance σ. SVM is trained using features extracted from AlexNet for the visual dataset and MFCC-extracted features in the case of the audio dataset. The hyperparameters for SVM training are kernels.

Experimental Results
In this section, we evaluated UAV detection using integrated audio and visual features by using audio and image datasets. The dataset is classified by implementing an SVM classifier. Malicious UAVs are localized by implementing handcrafted descriptors like HOG, LBP, CJLBP, LTrP, GLCM, NRLBP, and LETRIST as well as deep neural networks like AlexNet, inceptionv3, VGG-19, resNet50, and GoogleNet. While using an acoustic dataset, malicious UAVs can be detected by implementing MFCC, LPCC, and ZCR in MATLAB. All the experiments were run on a computer with an Intel(R) Core i7 processor (3.6 GHz) and 16 GB DDR4 RAM. CyberpowerPC, Gamer Supreme Liquid Cool, SLC8260A2.

Image Dataset Description
We implemented the proposed method with the dataset of 506 images. Three hundred fifty images were used for training, while 156 images were used for the test. The images were selected randomly with the ratio of 70% for training and 30% for testing. The dataset consists of five classes of the images that are birds, airplanes, kites, balloons, and drones. The flight scenarios of the dataset are low altitude, high altitude, bad weather, bad visibility, clear weather, and noisy environment. The images of the dataset have variations in their resolution, scale, orientation, and illumination. Moreover, drone images also have environmental occlusions. Several pictures from the dataset are presented in Figure 6.

Audio Dataset Description
We implemented the proposed method with the dataset of 217 audio samples. One hundred fifty-seven audio samples were used for the training model, and 60 audio samples were used for the test. The audio samples were randomly selected with the ratio of 70% for training and 30% for testing. The dataset contained audio samples of drones, airplanes, birds, and thunderstorms. All the audio samples were different in length. The spectrograms with a sampling frequency of 44 kHz of audio samples of drone, bird, thunderstorm, and plane are shown in Figure 7a-d, respectively. The drone spectrogram contains a red line which means that the drone has specific frequencies, i.e., 2.4 kHz, while this red line is not observed in spectrograms of other audio samples because they have low frequencies as well as high frequencies.

Malicious UAV Detection with Hand-Crafted Descriptors
We used hand-crafted descriptors such as HOG, LBP, CJLBP, NRLBP, GLCM, LTrP, and LETRIST to detect malicious UAVs. We used SVM as a classifier. The implemented code of all handcrafted descriptors is available at [42]. Accuracy of each descriptor with various kernels of SVM has been presented in the Table 1.

Audio Dataset Description
We implemented the proposed method with the dataset of 217 audio samples. One hundred fifty-seven audio samples were used for the training model, and 60 audio samples were used for the test. The audio samples were randomly selected with the ratio of 70% for training and 30% for testing. The dataset contained audio samples of drones, airplanes, birds, and thunderstorms. All the audio samples were different in length. The spectrograms with a sampling frequency of 44 kHz of audio samples of drone, bird, thunderstorm, and plane are shown in Figure 7a-d, respectively. The drone spectrogram contains a red line which means that the drone has specific frequencies, i.e., 2.4 kHz, while this red line is not observed in spectrograms of other audio samples because they have low frequencies as well as high frequencies.
Sensors 2020, 20, x FOR PEER REVIEW 7 of 16 low altitude, high altitude, bad weather, bad visibility, clear weather, and noisy environment. The images of the dataset have variations in their resolution, scale, orientation, and illumination. Moreover, drone images also have environmental occlusions. Several pictures from the dataset are presented in Figure 6.

Audio Dataset Description
We implemented the proposed method with the dataset of 217 audio samples. One hundred fifty-seven audio samples were used for the training model, and 60 audio samples were used for the test. The audio samples were randomly selected with the ratio of 70% for training and 30% for testing. The dataset contained audio samples of drones, airplanes, birds, and thunderstorms. All the audio samples were different in length. The spectrograms with a sampling frequency of 44 kHz of audio samples of drone, bird, thunderstorm, and plane are shown in Figure 7a-d, respectively. The drone spectrogram contains a red line which means that the drone has specific frequencies, i.e., 2.4 kHz, while this red line is not observed in spectrograms of other audio samples because they have low frequencies as well as high frequencies.

Malicious UAV Detection with Hand-Crafted Descriptors
We used hand-crafted descriptors such as HOG, LBP, CJLBP, NRLBP, GLCM, LTrP, and LETRIST to detect malicious UAVs. We used SVM as a classifier. The implemented code of all handcrafted descriptors is available at [42]. Accuracy of each descriptor with various kernels of SVM has been presented in the Table 1.

Malicious UAV Detection with Hand-Crafted Descriptors
We used hand-crafted descriptors such as HOG, LBP, CJLBP, NRLBP, GLCM, LTrP, and LETRIST to detect malicious UAVs. We used SVM as a classifier. The implemented code of all handcrafted Sensors 2020, 20, 3923 8 of 16 descriptors is available at [42]. Accuracy of each descriptor with various kernels of SVM has been presented in the Table 1.

UAV Detection with CNNs
Results proved that hand-crafted descriptors are not very efficient in malicious UAV detection, as their maximum accuracy is 82.7%. Then, we used CNNs such as AlexNet, inceptionv3, resNet50, GoogleNet, and VGG-19 for the detection of malicious UAVs. The CNN models are used as a descriptor by collecting feature values from the fully connected layer of each respective model. The accuracy, sensitivity, and specificity of all CNNs are shown in Table 2 using different kernels of the SVM classifier. The source codes of all implemented CNNs are available at [43]. The accuracy of AlexNet using the linear or polynomial kernel of the SVM classifier is the greatest among all other CNNs, and it is 97.4%. The confusion matrices of AlexNet using the linear kernel, Gaussian kernel, and polynomial kernel of SVM are shown in Figure 8a-c, respectively. A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the actual values are known. The diagonal elements of the confusion matrix express the percentage of correct classification, while the other items represent the wrong prediction of the classifier. As the accuracy of AlexNet using the linear and polynomial kernel of SVM is 97.4%, we propose detection of malicious UAVs with AlexNet using the polynomial kernel of SVM because its sensitivity is more significant than the linear kernel and it is more robust by image variations such as resolution, scale, orientation, illumination, and occlusions.
The parameters TP, FP, TN, and FN are true-positive, false-positive, true-negative, and falsenegative test samples, respectively. For each threshold, two values are calculated: the true-positive ratio (TPR) and the false-positive ratio (FPR). The TPR is the ratio of TP and the sum of TP and FN. The TPR is known as sensitivity. Equation (7) is used to calculate sensitivity.
Specificity is another parameter which tells the proportion of correctly identified negative instances. Equation (8) can be used to find specificity. A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the actual values are known. The diagonal elements of the confusion matrix express the percentage of correct classification, while the other items represent the wrong prediction of the classifier. As the accuracy of AlexNet using the linear and polynomial kernel of SVM is 97.4%, we propose detection of malicious UAVs with AlexNet using the polynomial kernel of SVM because its sensitivity is more significant than the linear kernel and it is more robust by image variations such as resolution, scale, orientation, illumination, and occlusions. The parameters TP, FP, TN, and FN are true-positive, false-positive, true-negative, and falsenegative test samples, respectively. For each threshold, two values are calculated: the true-positive ratio (TPR) and the false-positive ratio (FPR). The TPR is the ratio of TP and the sum of TP and FN. The TPR is known as sensitivity. Equation (7) is used to calculate sensitivity.
Specificity is another parameter which tells the proportion of correctly identified negative instances. Equation (8) can be used to find specificity.
Overall accuracy and error of classifier is calculated as in Equations (9) and (10) respectively.
As the accuracy of AlexNet is the greatest, we used it for the localization of malicious UAVs in full images. We used training features and training labels of AlexNet that were calculated from the images dataset for localization purposes. Localization procedure is explained in Algorithm 1 and Figure 9. The input image is first scaled into various sizes by creating a scale pyramid, where the fixed size patches are collected from each scale with a 50% overlap. Each local patch is described and classified through the proposed model shown in Figure 9. The size, along with coordinate values of the detector drone, is transformed into the actual image coordinated by the scaling process shown in the figure, and a bounding box annotation is created against those coordinates. Results of localization are shown in Figure 10.

Detection Using Audio
We used two descriptors, i.e., LPCC and MFCC, to detect UAVs using audio samples. We used the SVM classifier and calculated accuracy by the confusion matrix in MATLAB. The implemented code is available at [44]. Table 3 shows the accuracy, sensitivity, and specificity of all the descriptors using different kernels of SVM. MFCC proved to be very effective in UAV detection with a Gaussian kernel of SVM. This is because its frequency domain characteristics provide better diversity gain. The

Detection Using Audio
We used two descriptors, i.e., LPCC and MFCC, to detect UAVs using audio samples. We used the SVM classifier and calculated accuracy by the confusion matrix in MATLAB. The implemented code is available at [44]. Table 3 shows the accuracy, sensitivity, and specificity of all the descriptors using different kernels of SVM. MFCC proved to be very effective in UAV detection with a Gaussian kernel of SVM. This is because its frequency domain characteristics provide better diversity gain. The confusion matrices of MFCC using the linear kernel, Gaussian kernel, and polynomial kernel of SVM are shown in Figure 11a-c, respectively. We also created a combined dataset of images and audio samples [45]. The dataset contains four classes labeled as Drones, Thunder, Birds, and Planes. The dataset contains two sections. The first one is training data, which includes 885 images and audio samples. The second one is testing data, which consists of 400 images and sounds. We combined MFCC features of audio samples and features extracted from AlexNet of images. The combined features are given to multiclass SVM. We observed that the combined approach gives an accuracy of 98.5%. The accuracy of multiclass SVM for this approach is shown in Figure 12, and its source code is available at [46].

Computational Time
The time taken to extract features of one image of size 227 × 227 × 3 through AlexNet is 1.16 s, while the time taken to extract features of one audio sample with MFCC is 0.3 s. The total time taken to train the model for the visual dataset was 16 min, while the total time taken to train the model for the audio dataset was 2 min. The time taken to train the model with a combined dataset was 30 min. The trained model classifies the objects within 2 s. Table 4 shows a comparison of our proposed method with existing drone detection methods. We also compared our work with existing methods to detect drones, i.e., using conventional machine learning and without machine learning, which have detection accuracies of 83% and 79%, respectively. We adopted similar k-fold validation criteria as mentioned in recently published work. We adopted k = 5 for audio, image, and combined datasets. Figure 13 shows that the proposed method achieved almost 98.5% accuracy for drone detection. In the proposed technique, the challenges were low resolution, occlusion, and noisy audio. These challenges are not considered in previous approaches.  Table 4 shows a comparison of our proposed method with existing drone detection methods. We also compared our work with existing methods to detect drones, i.e., using conventional machine learning and without machine learning, which have detection accuracies of 83% and 79%, respectively. We adopted similar k-fold validation criteria as mentioned in recently published work. We adopted k = 5 for audio, image, and combined datasets. Figure 13 shows that the proposed method achieved almost 98.5% accuracy for drone detection. In the proposed technique, the challenges were low resolution, occlusion, and noisy audio. These challenges are not considered in previous approaches.  Figure 13. Comparison of proposed UAV detection with conventional scheme and with schemes without using machine learning. Figure 13. Comparison of proposed UAV detection with conventional scheme and with schemes without using machine learning.

Conclusions
Malicious UAVs have been a challenge for national agencies to consider due to their ability to carry explosive materials. There is a need to detect and localize these UAVs promptly in order to disarm them. For this, a high precision rate model should be used. In this paper, we compared the performance of various hand-crafted descriptors and different CNNs to detect and localize malicious UAVs using a relatively small dataset of images, and we also used MFCC and LPCC to detect malicious UAVs using an audio dataset. We used SVM as a classifier. Our goal was to achieve high accuracy, and the experimental results showed that the accuracy of AlexNet is 97.4% using the polynomial kernel of SVM. The accuracy of MFCC was 98.3% using Gaussian kernel of SVM. Finally, we conclude that AlexNet performed accurately for localization of malicious UAVs, while MFCC had a high precision rate in detecting UAVs based on sound, even in a noisy environment. The combined features of MFCC and AlexNet gives an accuracy of 98.5%. The proposed model can quickly be adopted and deployed by national security agencies to quickly and accurately detect and localize malicious UAVs. This model is cost-effective, as a relatively small dataset is used. In the future, we have a plan to include the RCNN technique and wireless communication in the proposed model.