Wind Turbine Multi-Fault Detection and Classiﬁcation Based on SCADA Data

: Due to the increasing installation of wind turbines in remote locations, both onshore and offshore, advanced fault detection and classiﬁcation strategies have become crucial to accomplish the required levels of reliability and availability. In this work, without using speciﬁc tailored devices for condition monitoring but only increasing the sampling frequency in the already available (in all commercial wind turbines) sensors of the Supervisory Control and Data Acquisition (SCADA) system, a data-driven multi-fault detection and classiﬁcation strategy is developed. An advanced wind turbine benchmark is used. The wind turbine we consider is subject to different types of faults on actuators and sensors. The main challenges of the wind turbine fault detection lie in their non-linearity, unknown disturbances, and signiﬁcant measurement noise at each sensor. First, the SCADA measurements are pre-processed by group scaling and feature transformation (from the original high-dimensional feature space to a new space with reduced dimensionality) based on multiway principal component analysis through sample-wise unfolding. Then, 10-fold cross-validation support vector machines-based classiﬁcation is applied. In this work, support vector machines were used as a ﬁrst choice for fault detection as they have proven their robustness for some particular faults, but at the same time have never accomplished the detection and classiﬁcation of all the proposed faults considered in this work. To this end, the choice of the features as well as the selection of data are of primary importance. Simulation results showed that all studied faults were detected and classiﬁed with an overall accuracy of 98.2%. Finally, it is noteworthy that the prediction speed allows this strategy to be deployed for online (real-time) condition monitoring in wind turbines.


Introduction
Wind energy offers many advantages, as it is an inexhaustible clean fuel source. This explains why it is one of the fastest-growing renewable sources against greenhouse effects. Currently, research efforts are aimed at minimizing the overall cost of this energy. The tendency to use larger wind turbines (WTs) in harsh operating environments (e.g., offshore) implies that one of the main cost drivers is directly related to operation and maintenance actions. Thus, fault diagnosis (FD) is crucial for wind power to be cost-competitive, and even more so for offshore wind farms where bad weather conditions (e.g., storms, high tides, etc.) can prevent any repair actions for several weeks.
A variety of surveys on FD considering different WT components have recently been published. For example, in [1] a wide variety of WT fault locations are considered-rotor, gearbox, bearing, main shaft, hydraulic system, tower, generator, and sensors-as well as the different signal processing methods that are most frequently used in the literature to deal with these types of faults. Reference [2] mainly aims to survey the most recent condition and performance monitoring approaches of WTs with the primary focus on blade, gearbox, generator, braking system, and rotor. However, the more recent trend in this type of literature review is to focus on a specific WT sub-assembly: the bearings and planetary gearbox [3,4], the generator and power converter [5,6], the blades [7,8], etc. Most of these methods, which focus on a specific part of the WT, require the choice of the most appropriate sensors, their advisable position in the sub-assembly, and the most convenient strategy to extract as much information as possible from the obtained data. These are highly localized strategies, and each one relies on the installation of (costly) extra sensors. However, it should be possible to retrofit a multi-fault condition monitoring package onto existing WTs without requiring additional sensors and wiring on the machines. In fact, there is a large amount of operational (Supervisory Control and Data Acquisition-SCADA) data available (already collected at the WT controller), which can be used to diagnose the turbine condition. This section addresses the state-of-the-art in the FD of WT faults using SCADA data.
In recent years, there have been efforts to develop FD strategies by analyzing only SCADA data. The use of machine learning techniques has been crucial in this area. For example, in [9], fault prediction and diagnosis for the WT generator is accomplished using real-world SCADA data from two wind power plants located in China based on principal component analysis (PCA) and unsupervised clustering methods. In [10], a FD strategy for WT gearboxes is proposed based on artificial neural networks (ANNs) and tested on real-world SCADA data sets of a wind farm in Southern Italy. In [11], a strategy to diagnose WT faults from SCADA data using support vector machines (SVMs) is advised. Generally, the classification methods that deserve special mention are SVM and ANN, because of their ability to handle non-linear and noisy data. On one hand, the use of ANNs has drawbacks related to their training time and dependability on the optimization of fine-tuning their parameters. In particular, in [12] the correct number of parameters and their corresponding values must be carefully selected to create a normal behavior model based on an ANN. On the other hand, the SVM is simpler and has successfully proven its suitability in this type of problem. Thus, the SVM is the selected classifier in this paper.
Considerable research has been done on FD methods based on SVM classifiers that analyze only SCADA data. For example, different faults are studied in [13], but faults in the pitch actuators unfortunately could not be detected, and furthermore, the sampling period is unfeasible (0.01 s). Note that SCADA data is typically recorded at 10-minute intervals to reduce transmitted data bandwidth and storage. In [14], an SVM could isolate some faults, except for high varying dynamics (including a pitch actuator fault), where the use of an observer, which is model-based, was found necessary and, again, the sampling period was 0.01 s. Later references based on SVM are, mainly, specifically tailored for a particular type of fault. For example, in [15] an SVM-based method is proposed to classify the misalignment type of fault; generator faults are diagnosed in [9]; only actuator faults are considered in [16]; and generator and power feeder cables faults are diagnosed in [11]. In this paper, we widen the number and type of the studied faults with a unique strategy to cope with them all: three different pitch actuator faults (i.e., high air content in oil, pump wear, hydraulic leakage), a generator speed sensor fault (gain factor of 1.2), three different pitch sensor faults (stuck in 5 deg, stuck in 10 deg, and with a gain factor of 1.2), and a torque actuator offset fault.
As has been noted previously, one of the major drawbacks to using SCADA data is the 10-minute sampling period. This low-frequency resolution negatively affects the diagnosis capabilities, and may hide short-lived events. On the other hand, high-resolution (but feasible) SCADA data should allow the dynamic turbine behavior to be identified with higher fidelity and thus improve detection efficiency. As stated in [17,18], in this work a research framework is proposed that takes SCADA data with an additional high but feasible (1 s) frequency from the sensors. That is, the only requirement is to increase the frequency rate in the SCADA data from the already available sensors. Following this idea, in this work, we propose a strategy to detect and classify (through SVM) multiple WT faults using only conventional SCADA data with an additional, but feasible (sampling period of 1 s), high-frequency sampling from the sensors and without the added cost of retrofitting additional sensors to the turbine. This paper is organized as follows. In Section 2, the WT benchmark model is introduced and the proposed FD strategy is described. The obtained results are presented and discussed in Section 3. Section 4 states the conclusions and future work.

Model Overview
As mentioned in the introduction, the utmost importance of WT fault diagnosis stimulated the proposal of a model, given in Reference [19], encompassing the prevailing faults encountered in practice. This early version of the model described a generic 4.8 MW three-blade horizontal-axis variable-speed WT, and it was issued by the company KK Wind Solutions [20] together with MathWorks, Inc. [21] and Aalborg University to release an international competition on fault detection and isolation in WTs. Several teams participated in the contest, and five of the solutions are compared in [22]. A second enhanced model was presented in [23] that incorporated a more realistic WT simulated using the FAST software (National Renewable Energy Laboratory, Golden, Colorado, USA). This is an aeroelastic WT simulator designed by the U.S. National Renewable Energy Laboratory's (NREL) National Wind Technology Center and is widely used by the research community. Several FAST models of WTs of varying sizes are available in the public domain, including NREL's 5 MW baseline turbine, which is the one used by the model. It is noteworthy that this simulator is able to consider the WT flexible modes that are present in practice, making fault detection more difficult compared to simpler models neglecting these modes (as in [19]). Thus, the second enhanced model, stated in [23], is the one utilized in this work.
The model proposes to simulate the sensors in the block diagram environment Simulink by adding signals from band-limited white noise blocks that are parameterized by noise power to the actual variables provided by the FAST software. These random noise blocks represent measurement noise either due to electrical noise in the system or due to the measuring principle. The different sensors provided in the model are shown in Table 1 with the measurement noise modeled as a Gaussian white noise. Finally, a sampling period of 0.0125 s is used in the simulations. The most important features of the WT are detailed in Table 2. In this paper, we deal with the full load region of operation in the sense that the proposed controller main objective is that the electric power closely follows the rated power. A set of fault scenarios are defined in the WT model. These scenarios are primarily introduced in sensors and actuators. More precisely, the types of faults are gain factors, offsets, changes in the system dynamics, and stuck, as shown in Table 3. These faults are inspired by research in both proprietary and public domain sources [23]. As an extra reference, the interested reader can find a comprehensive description of these faults and their importance in [24]. The stochastic, full-field, turbulent-wind simulator TurbSim-developed by NREL-was used to generate the wind velocity fields applied in the simulations. It employs a stochastic model-as opposed to a physics-based model-to numerically simulate time series of three-component wind-speed vectors. It provides the ability to drive simulations of complex turbine designs with realistic but simulated inflow turbulence environments that combine many of the main fluid dynamic features known to negatively affect turbine aeroelastic response and loading. In this work, the generated wind data had the following features: Kaimal turbulence model with intensity set to 10%, mean speed set to 18.2 m/s and simulated at hub height, logarithmic profile wind type, and the roughness factor was set to 0.01 m. In this work, each simulation was run with a different wind data set. More precisely, 260 different wind data sets of a duration of 600 s each were used.

Noise Handling
To deal with noise in a data set, two broad ways can be considered, in general: (i) it can be filtered out; or (ii) left as it is. Obviously, pros and cons appear when adopting any either of these approaches. By filtering out the noisy instances from the data, there is a trade-off between the amount of information available for building the classifier and the amount of noise retained in the data set. Robust algorithms do not require preprocessing of the data-the data set is taken as is, with the noisy instances-but a classifier built from a noisy data set may be less predictive and its representation may be less compact than it could have been if the data were not noisy. The second approach was used in this work. Since multiway PCA-a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components-was used for the pre-treatment of the data, the strategy can be considered as robust. Besides, the 10-fold cross-validation, a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set, was also considered, and therefore the impact of a particular noisy subset of data was minimized.

Data Collection
A total of 260 simulations were conducted in this work: 100 with a healthy WT, and 20 simulations for each studied fault-that is (recall there are 8 types of faults, see Table 3), a total of 160 simulations with a faulty WT. All simulations had a duration of 600 s. However, only the last 400 s of simulation were used to avoid the transient due to initialization of the numerical simulations, as in [25]. Measurements were taken from the nine SCADA available sensors, see Table 1. Observe that the wind sequence was not used as a known measurement.
Note that a time step of 0.0125 s was needed in the simulations due to the fixed-step-size time-integration scheme used by the FAST simulation software [26]. However, the data used for FD were down-sampled to a sampling period of 1 s. Traditional SCADA data have a 10-minute average sampling frequency. In this paper, following [18], it is proposed to make use of conventional SCADA data with a realistic additional higher-frequency sampling from the sensors (i.e., 1 sample/s). Some condition monitoring systems might surpass the expense of the necessary additional equipment, but may also exhibit high rates of false positive alarms, while the diagnosis is dedicated to a unique component or assembly rather than being system-wide [27]. In this work, a cost-effective multi-fault monitoring tool is obtained without using specific tailored devices for condition monitoring, only increasing the sampling ratio to a feasible frequency in the already-available sensors of the SCADA system.

Data Reshape and Tensor Unfolding
The main objective was to minimize detection time while preserving overall accuracy, using all the available SCADA information. Recall that after the classification model is built, in order to diagnose a WT, a sample has to be given as an input to the model. The smaller the required sample, the smaller the detection time (as less time is needed to collect the data from the sensors). Here detection time refers to the time from when the fault occurs to when it is detected. Assuming T d is the detection time, the fault detection requirements given in the model [23] for the corresponding faults are described in terms of the sampling time for the control system, T s , which in this case is equal to 1 s. In particular: s . This is the most restrictive detection time, as this is the most severe fault. It is related to the torque actuator, and it is noteworthy that the torque rate limit for the NREL 5-MW WT is 15,000 Nm/s [26].
This fault has a high varying dynamic and is related to the pitch actuator (i.e., high air content in oil). In this case, the blade-pitch rate limit for the NREL 5-MW WT is 8 deg/s, as this is speculated to be the blade pitch rate limit of conventional 5-MW machines based on General Electric (GE) Wind's long blade test program [26]. • Faults 4 to 7 (F4, F5, F6, F7) are required to fulfill T d < 10T s . These faults are related to the generator speed sensor and the pitch sensors. • Finally, Faults 2 and 3 (F2, F3) are only required to satisfy T d < 100T s , as these are faults with a very slow dynamic. These faults are related to the pitch actuator (i.e., pump wear and hydraulic leakage).
Using the three most restrictive requirements, it is proposed to organize the available data from the simulations in three different manners: The goal of the remainder of this section is to show how the data were reshaped in samples of J time steps. As said before, the data came from 260 simulations of 400 s duration each (with a time step of 1 s) and nine sensors available. These data were initially stored, for each sensor, in a matrix as follows: where the super-index (k) is related to the different sensors k = 1, 2, . . . , 9. That is, there is one of these matrices for each sensor. The matrix has as many rows as simulations (260). The number of columns is taken as 400 where J defines the number of seconds of each sample, and recall that the super-index (k) is related to the different sensors k = 1, 2, . . . , 9. The total number of samples is given by I = 260 · 400 J , that is (c) I = 10400 samples when J = 10. Figure 1 illustrates how the available data from the 260 long run simulations (see Equation (1)) were reorganized in a third-order tensor (multidimensional array with three indices) with short time samples of J time steps (see Equation (2)). The first J data-points determine the first sample (represented by the light blue color box in Figure 1). Immediately after, the next J data-points determine the second sample (red color box), etc. After the last J data-points of the first simulation (light green), the first J data-points of the second simulation (orange box) define the next sample, and so on. In general, let us consider that we have different sensors k = 1, 2, . . . , K stored at j = 1, 2, . . . , J time instants. Similar data are generated for a number of samples i = 1, 2, . . . , I. This results in the third-order tensor X (I × J × K) as illustrated in Figure 1 The crux of the matter for fault detection by SVM is the definition of the features to be used for classification [13]. In this work, statistical analysis by multiway PCA is used for pretreatment of the raw data. This is equivalent to implementing basic PCA on a large two-dimensional matrix assembled by unfolding the third-order tensor X , see Figure 1. There are three possible ways of unfolding this tensor, as suggested by [28]. In general, sample-wise unfolding facilitates the analysis of the variability among samples by summarizing the information related to the measured variables (sensors) and their variations over time. Thus, in this work, the sample-wise unfolding is used (see Figure 2), where That is, the I × J planes are concatenated into a large two-dimensional matrix X. In summary, multiway PCA of the third-order tensor X in Figure 1 is implemented considering PCA of the sample-wise unfolded matrix X in Equation (2).
Unfolding of the third-order tensor X into matrix X.

Autoscaling or Standardization
There are two reasons for autoscaling the raw data: to deal with data that come from different sensors and with different magnitudes, and to simplify the computations for the multiway PCA decomposition.
Autoscaling is a relatively common pre-processing method that uses column-wise mean-centering followed by division of each column by the standard deviation of that column of matrix X. The result is that each column of the new autoscaled matrix,X, has a mean of zero and a standard deviation of one. This idea is used in this work to rectify for different sensor measurements, magnitudes, and units where the prevalent source of variance is due to the signal itself rather than noise. In particular, it is computed as where µ j and σ j are the mean and the standard deviation, respectively, of all the measures at column j. Accordingly, the elements of matrix X are normalized to create a new matrixX as

Multiway PCA
Recall that before using a classifier, the raw data coming from the sensors must be processed to obtain the most suitable features. In this work, after the autoscaling step, multiway PCA was selected, as the main objective was to keep as much information as possible with the minimum amount of data.
Since the input data are given in a mean-centered matrixX, the empirical covariance matrix, S, can be computed as Then, the singular value decomposition of S is computed as where D is a matrix in diagonal form composed of the eigenvalues λ 1 , λ 2 , . . . , λ JK in decreasing order, and P ∈ M (JK)×(JK) (R) is an orthogonal matrix that contains the eigenvectors. Matrix P is usually called the loading matrix. As the main objective is to reduce the overall size of the data set, only a reduced number of d < JK principal components are used. In this work, the number of principal components was selected based on keeping 99.98% of the variance. The proportion of the variance directed along (explained by) the first d components is given by: In the first case, when J = 3, from a total of J × K = 3 × 9 = 27 components, 99.98% of the variance is accomplished by the first d = 16 components. When J = 8, from a total of J × K = 8 × 9 = 72 components, the first d = 42 components are needed to keep 99.98% of the variance. Finally, when J = 10, from a total of J × K = 10 × 9 = 90 components, the demanded variance is accomplished by the first d = 52 components. Thus, the matrix P d ∈ M (JK)×(d) (R), with only the first d columns of P is used. Finally, the score matrix Y ∈ M (I)×(d) (R) (transformed coordinates of theX data in the new basis given by the first d principal components), whose columns will be used as features by the SVM strategy, is computed as

Support Vector Machines
Since their introduction by Vladimir Vapnik [29], SVMs have been successfully applied to a number of real-world problems such as face detection, object detection, and handwritten digit and character recognition in machine vision. SVMs exhibit a remarkable resistance to overfitting, and their training is performed by maximizing a convex functional, which means that there is a unique solution that can always be found in polynomial time [30]. In this section, basic information about SVM classification is given.
SVM classification is fundamentally a binary classification technique. Let us consider a training set and their corresponding label y i ∈ {−1, +1}. Figure 3 shows these data where one class is labeled as (+) and the other one as (−). The main goal is to find the optimal hyperplane that defines the widest margin to separate both classes, see Figure 3. Formally, the hyperplane is given by where b is known as the bias term and ω is the weight vector. The optimal hyperplane can be characterized in an infinite number of different ways by scaling of b and ω. As a matter of agreement, among all the possible descriptions of the hyperplane, the so-called canonical hyperplane is chosen that satisfies where x sv + and x sv − symbolize the (+) and (−) training samples closest to the hyperplane, that is the so-called support vectors, see Figure 3. The distance between a point x and the hyperplane h is given by In particular, for the canonical hyperplane, when x is a support vector, the numerator |ω T x + b| is equal to one and the distance to the support vector is The width of the margin is twice this distance (i.e., 2 ||ω|| ). Thus, maximizing the margin is equivalent to minimizing the expression ||ω|| 2 , which is equivalent to the following minimization problem: The two previous restrictions can be rewritten in one single equation by taking the product h(x)y, This problem, to find the extrema of a function with constraints, can be solved using Lagrange multipliers, thus leading to where α i are the Lagrange multipliers. Taking partial derivative with respect to ω equal to zero, This equation states that the decision vector, ω, is a linear combination of the data samples. Taking partial derivative with respect to b equal to zero, Finally, substitution of Equations (19) and (20) into Equation (18) leads to which can be rewritten as If the data do not admit a separating hyperplane, SVM can use a soft margin, meaning a hyperplane that separates many, although not all data points. Consequently, the previous problem is generalized by means of slack variables, ε i , and a penalty parameter, C. The general formulation for the linear kernel is in this case: In this case, using Lagrange multipliers, the problem reads The final set of restrictions shows why the penalty parameter C is frequently called a box constraint, as it keeps the admissible values of the Lagrange multipliers in a bounded region. In this work, the box constraint value was tuned to optimize the performance of the SVM, as shown in Section 4.
From Equations (22) and (24), it is obvious that optimization depends only on dot products of pairs of samples. Additionally, the decision rule depends only on the dot product. Furthermore, the optimization problem is solved in a convex space (in contrast to neural networks), thus it never obtains a local extrema but the global one. When the space is not linearly separable (the classification problem does not have a simple hyperplane as a useful separating criterion even using a soft margin), a transformation to another space can be used, φ(·). In fact, the transformation itself is not needed, but just the dot product, the so-called kernel function: The kernel function permits the computation of the inner product between the mapped vectors without expressly calculating the mapping. This is advantageous, as it implies that if data are transformed into a higher-dimensional space (which helps to better classification) there is no need to compute the exact transformation of the data, but only the inner product of the data in that higher-dimensional space (which is computationally cheaper). This is known as the "kernel trick" [31]. Different kernels can be used, namely polynomial, hyperbolic tangent, or Gaussian radial basis function. On one hand, the feature space mapping of the Gaussian kernel has infinite dimensionality. On the other hand, the Gaussian kernel has a ready interpretation as a similarity measure, as its value decreases with distance and ranges between zero and one. For these reasons, in this work the Gaussian kernel is used, namely, where γ is a free parameter, hereafter denoted as kernel scale, related to the Gaussian kernel width. In this work, the kernel scale is computed as the inverse of the square root of the number of features. Note that in this work, the same features and the same kernel scale value for the Gaussian kernel are used to detect all faults. In other words, a unique trained SVM is able to classify among all the studied classes (i.e., eight faulty classes and one healthy class). That is not the case in the previous literature related to WT fault detection (e.g., [13,32]) where the features and the variance were adjusted case-by-case to detect each different fault, thus leading to a much more complex strategy that needed as many different SVM classifiers as faults to detect. Regarding computational effort, there is a clear advantage related to the feature computation, as only one set of features is needed in our proposed approach. As was mentioned earlier, SVM classification is essentially a binary (i.e., two-class) classification technique, which has to be modified to deal with the multi-fault classification. Two of the most common methods to enable this adaptation include the one-vs.-one and one-vs.-all approaches. The one-vs.-all technique represents the earliest and most common SVM multiclass approach [33], and comprises the division of an N class dataset into N two-class cases, and it chooses the class which classifies the test with greatest margin. The one-vs.-one strategy comprises constructing a machine for each pair of classes, thus resulting in N(N − 1)/2 machines. When this approach is applied to a test point, each classification gives one vote to the winning class, and the point is labeled with the class having the most votes. The one-vs.-one strategy is more computationally demanding because the results of more SVM pairs need to be computed. In this work, the one-vs.-all approach is used.

k-Fold Cross-Validation
Normally, a data-based classifier is inferred based on training data and considering a classifier learning algorithm. A prediction error-also known as true error-is associated to each classifier. However, this prediction error is usually unknown, cannot be computed, and must be estimated based on data. Different estimators of the prediction error can be considered, from the simple hold-out [34] and resubstitution [35] to the more sophisticated bootstrap [36]. One of these techniques, and possibly the most popular, is k-fold cross-validation [37]. In k-fold cross-validation, the data set is distributed into k folds, the classifier is then learned using k − 1 folds, and the prediction error is computed by testing the classifier in the fold that is not used in the learning step. In the end, the estimation of the error is the numerical mean of the errors committed in each fold. In this paper, 10-fold cross-validation is used to estimate the performance of the proposed FD strategy.

Results, Analysis, and Discussion
The results of the proposed multi-fault diagnosis strategy introduced in Section 2 in the dataset under study are presented in this section.
It is worth mentioning that, as can be seen in Section 1, there is an increasing number of studies in the field of structural health monitoring or condition monitoring based on machine learning approaches. In this work, the contribution resides essentially in how data are collected, how data are arranged, how data are pre-processed, and how SVM is applied. For instance, in [38,39] it can be noted that there are six possible ways of arranging a third-order tensor. Each one of these six possible choices will lead to a different overall performance of the applied strategy. A similar thing can be said with respect to SVM. As was pointed out in Section 2.7, SVM depends on several parameters and kernels. In this work, the same features and the same variance for the Gaussian kernel are used to detect all the faults, therefore leading to a single trained classifier.
First, a flowchart of the proposed approach and how it is applied is given in Figure 4. When a WT has to be diagnosed, data coming from the WT sensors are scaled and then, using the already-computed PCA projection, the features are computed. Then, the already-trained SVM classifies the data. The box constraint value is tuned to optimize the SVM performance. Making this value large increases the weight of misclassification, see Equation (23), which leads to a stricter separation. However, increasing its value leads to longer training times. The value C = 50 was used in this work because, as shown in Figure 5, with smaller values the overall accuracy was degraded and with larger values similar results were obtained (with longer training times). Table 4 summarizes the results obtained from the proposed strategy. It presents not only the overall accuracy, but also the training time and prediction speed, as both parameters are critical in real application. Notice that in all cases, the prediction speed allows this strategy to be deployed for online (real-time) condition monitoring in WTs. Besides, a comprehensive decomposition of the error between the true classes and the predicted classes is shown by means of the so-called confusion matrices, see Figures 6-8 (an empty blank square means 0%). In these matrices, each row represents the instances in a true class while each column represents the instances in a predicted class (by the classifier). In particular, the first row (and first column) is labeled as 0 and corresponds to the healthy case. The next labels (for rows and columns) correspond to each fault (from Fault 1 to Fault 8). From the confusion matrices and Table 4, the following issues can be highlighted.
When detection time was approximately 3 s (J = 3), the overall accuracy was 95.5%. In this case, the healthy class had a true positive rate (TPR, the percentage of correctly classified instances) higher than 99% and a false negative rate (FNR, the percentage of incorrectly classified instances) smaller than 1%. Fault 1 (the most difficult to classify in previous references and related to the pitch actuator fault with high dynamics) had a TPR of 77% and an FNR of 23%. This FNR percentage was mainly obtained from 17% missing faults and 6% confusion with Fault 2, which is also a fault located in the pitch actuator. Fault 6, related to a stuck value (10 deg) of the pitch sensor measurement, was misclassified as healthy 5% of the time, 3% of the time it was confused with the same type of fault but with only a 5 deg stuck value (Fault 5), and 2% of the time it was misclassified as Fault 2 (pitch actuator fault). The other faults had a TPR higher than 92%. Note that Fault 8, the most severe one and related to the torque actuator, had a 100% TPR with this most restrictive detection time.
When detection time was approximately 8 s (J = 8), the overall accuracy was 98%. As in the previous case, the healthy class had a TPR higher than 99%. Fault 1 increased its TPR to 79% (where 16% were missed faults and 5% confusion with Fault 2), and all the other classes increased their TPR to values higher than 98%. Note that Fault 4, related to the generator speed sensor, reached a 100% TPR. The generator speed measurement from the sensor was used as input in the torque and pitch controllers, and thus being able to correctly diagnose this type of fault is extremely important. As in the previous case, Fault 8 kept a 100% TPR.
Finally, when J = 10 the overall accuracy was 98.2%. In this case, Fault 1 was improved to have a TPR of 80%. In this case, all misclassifications were 1% or lower, except for Fault 1 that was misclassified as healthy 15% of the time and misclassified as Fault 2 5% of the time (recall that this is also a pitch actuator fault). Observe that Faults 1, 4, and 8 obtained a remarkable 100% TPR.          Access to real SCADA datasets is often proprietary, and therefore they are not accessible by the scientific community. To overcome this difficulty, in this work simulated data were obtained by one of the most widely accepted WT simulators in the scientific community (FAST). The drawbacks of using simulated data is that there is no possibility to evaluate the proposed method in a full test set representing the true distribution of real-world data where class imbalance is a challenging problem [40]. However, there are several references (e.g., [11]) where this problem is solved in the training stage using under/oversampling of the training data.

Conclusions
Because of its standard low sampling rate, there is a lack of knowledge on the potential of SCADA data for condition monitoring. In this work, a promising strategy to detect and classify multiple WT faults was presented using only conventional SCADA data with an additional, but feasible, high-frequency sampling from the sensors (1 sample/s). That is, the FD strategy does not involve the supplementary installation of costly purpose-built data sensing equipment for wind power plants.
Note that in this work, in contrast to the previous literature, the same features and the same variance for the Gaussian kernel were used to detect all the faults detailed in the benchmark. Thus, leading to a unique trained classifier capable of coping with all the studied faults by computing only one set of features from the data to diagnose. Consequently, the strategy that we propose outperformed other approaches.
As future work, other faults will be included involving misalignment, ice accumulation, and tower damage. Finally, we will study the contribution of an effective predictive maintenance strategy based on this same principle in order to further optimize operation and maintenance in WTs.