Improved Appliance Classification in Non-Intrusive Load Monitoring Using Weighted Recurrence Graph and Convolutional Neural Networks

Appliance recognition is one of the vital sub-tasks of NILM in which a machine learning classier is used to detect and recognize active appliances from power measurements. The performance of the appliance classifier highly depends on the signal features used to characterize the loads. Recently, different appliance features derived from the voltage–current (V–I) waveforms have been extensively used to describe appliances. However, the performance of V–I-based approaches is still unsatisfactory as it is still not distinctive enough to recognize devices that fall into the same category. Instead, we propose an appliance recognition method utilizing the recurrence graph (RG) technique and convolutional neural networks (CNNs). We introduce the weighted recurrent graph (WRG) generation that, given one-cycle current and voltage, produces an image-like representation with more values than the binary output created by RG. Experimental results on three different sub-metered datasets show that the proposed WRG-based image representation provides superior feature representation and, therefore, improves classification performance compared to V–I-based features.


Introduction
The introduction of smart meters as part of smart grids will produce quantities of data energy consumption data at very fast rates. Analysis of these data streams offers a lot of exciting opportunities for understanding energy consumption patterns. Understanding the consumption pattern of individual loads at consumer premises plays an essential role in the design of customized energy efficiency and energy demand management strategies [1]. It is also useful for improving energy consumption awareness to households, which is likely to stimulate energy-saving behavior and engage energy users towards sustainable energy consumption [2,3]. Non-intrusive Load Monitoring (NILM), also known as energy disaggregation, is a useful technique for analyzing energy consumption data, monitored from a single-point source such as a smart meter. This is because the method can be easily integrated with buildings. The operation of NILM relies on signal processing and machine learning techniques to extract individual load profile from aggregate signal [4,5]. Considerable research attention has been lately devoted to deep neural networks (DNN) to solve energy disaggregation problems [6][7][8][9][10][11][12]. The presented approaches can be classified into event-based and non-event based methods [13]. The former approaches seek to disaggregate appliances through detecting and classifying their transitions in the aggregated signal [9,10,12,14]. In contrast, the non-event-based methods attempt to structural patterns in the signal. As a consequence, RG feature representation also depends on the magnitude of the current and voltage signals.
RG's use for the characterization of appliances features was introduced in [26], and later Rajabi and Estebsari [27] applied RG for estimating power consumption of individual loads. However, similar to other RG methods for time-series classification, the proposed RG uses a compressed distance function that represents all recurrences in the form of a binary matrix. Binarizing the recurrence plot through thresholding is likely to cause information loss and degrade classification performance. To avoid information loss by binarization, we propose the generation of RG that gives a few more values instead of binary output. To classify the generated RG, we follow the approach used in [10] and apply CNN for this task. Experimental evaluation in the three sub-metered datasets shows that the proposed WRG feature representation offers superior performance when compared to the V-I-based image feature. The source code used in our experiments can be found on a GitHub repository (https://github.com/sambaiga/WRG-NILM).
The main contributions of this paper are listed as follows: 1.
We present a recurrence graph feature representation that gives a few more values (WRG) instead of the binary output, which improves the robustness of appliance recognition. The WRG representation for activation current and voltage not only enhances appliance classification performance but also guarantees the appliance feature's uniqueness, which is highly desirable for generalization purposes.

2.
We present a novel pre-processing procedure for extracting steady-state cycle activation current from current and voltage measurements. The pre-processing method ensures that the selected activation current is not a transient signal.

3.
We conduct evaluations on three sub-metered public datasets and comparing with the V-I image, which is its most direct competitor. We also conduct an empirical investigation on how different parameters of the proposed WRG influence classification performance.

Proposed Methods
Recognizing the appliance from the aggregate power signal is a vital sub-task of NILM. The goal of appliance classifier in NILM is to identify active appliances k ∈ {i = 1, 2, . . . M} from aggregate signal x t where M indicates the number of appliances. This is a multi-class classification problem. The aggregate signal x t at any time t is assumed to be , 1} is its state and σ t represents both any contribution from appliances not accounted for and measurement noise. The proposed approach is summarized Figure 1 and consist of the following main building blocks; Feature extraction and pre-processing, WRG generation and the CNN classifier.

Extract activation current and voltage waveform
Distance similarity function

CNN classifier Predicted appliance category
Feature extraction and pre-processing WRG Figure 1. Block diagram of the proposed approach. It consist of the Feature extraction and pre-processing, WRG generation and the CNN classifier blocks.

Feature Extraction and Pre-Processing
Appliance features used for appliance recognition can be categorized into snapshot-form or delta-form features [22]. Snapshot form refers to the appliance feature extracted from aggregate power measurements as the results of more than one appliance being activated. Delta-form, on the other hand, expresses load characteristics in brief windows of time containing the only single event. In this work, we consider delta-form appliance features and define an activation signal as a one-cycle steady-state signal extracted from current or voltage waveform in a brief time after state transition.
To obtain an activation signal from monitored power signals; we measure N s = 20 cycles of v and i before and after state-transitions of appliance has been detected as shown in Figure 2a,c. The N s cycles correspond to steady-state behavior and is equivalent to T s × N s samples where T s = f s f , f s is sampling frequency and f is mains frequency. Since in this work, we only consider sub-metered data, the activation current i and voltage v from current-voltage signals is obtained as follows: i = i (a) and v = v (a) if the event is caused by activation of appliance and i = i b and v = v b if the event is caused by de-activation of appliance as illustrated in Figure 2b,d. To remove noise and ensure that the obtained activation signal is a complete cycle with size T s , we propose a pre-processing procedure summarized in Algorithm 1. This is an empirical method and is based on the engineering knowledge that steady-state activation current should have at least two zero-crossings.

Algorithm 1: Feature pre-processing
Get voltage zero crossings: Get current zero crossing: zc i ; if T j s = T s and zc i ≥ 2 and len(i j ) = Ts then break; end end Once activation-waveforms have been extracted, the piece-wise aggregate approximation (PAA) is used to reduce the dimensional of the signal from T s to a predefined size w with minimal information loss. The PAA algorithm reduces the dimensionality of i and v from T s to embedding size w before generating the D w×w distance matrix. It works by dividing the data into n segments of equal size, then the approximation is a vector of the median values of the data readings per segment. The embedding size w is the hyper-parameter that needs to be selected in advance. Empirically, it was found that the choice of w does not significantly influence the classification performance. However, large values of w impact the learning speed, and small values of w will most likely lead to larger information loss, as depicted in Figure 3. Note that for Figure 3b the embedding size of w = 50 does not change the shape of the input signal, whereas in Figure 3c an embedding size of w = 10 deforms the input shape.

Weighted Recurrence Plot (WRG)
The RG feature representation uses a distance similarity matrix D w×w to represent and visualize structural patterns in the signal. The distance similarity matrix provides a relationship metric between each element in the time-series signal [28]. It has been recommended as a pre-processing step for many of the machine learning approaches such as K-means clustering and K-nearest neighbor algorithms. Consider T s points of activation signal x = {x 1 , x 2 . . . x T s }. The distance similarity between x k and x j is given as d k,j = ||x k − x j || 2 where ||.|| denotes the Euclidean norm. The distance similarity matrix D w×w = [d k,j ] is the similarity matrix such that: For a classification problem, the compressed distance similarity matrix that represents all recurrences in the form of a binary matrix RG w×w = [r k,j ] is usually used. The r k,j function is defined as follows: where ∈ (0, 1] is the recurrence threshold. Equation (3) implies that, a dot will be drawn on a w × w grid if two values within a signal x = {x 1 , x 2 , . . . x w } are closer than . This can be interpreted as unweighed graph G with vertex set N and edge E where RG w×w = [r k,j ] is the adjacency matrix that depicts the graph between the data points. It should be noted that the structural representation of RG provides the similarity between two adjacent points in time series which is necessary for classification [29]. However, binarizing the distance matrix D w×w through thresholding can lead to information loss and therefore degrade classification performance. Thus in this work, we propose the generation of WRG w×w that goes beyond the traditional binary output. More precisely, we introduce the parameter δ ≥ 1 that enforce r k,j to have values between 0 and δ such that: For computational stability, we apply the parametrization on the value of such that λ = 1 . The matrix WRG w×w can be interpreted as a weighted graph G = (V, E) where each value represents the edge weights. Since d k,j > 0 Equation (4) reduces to RG for δ ≤ 1. The recurrence threshold and δ are the hyper-parameters that need to be optimized. Figure 4 illustrates the process of generating the WRG and RG from distance similarity matrix D. We see that the RG image representation in Figure 4c has more limited information compared to the WRG image representation in Figure 4b.

Classifier and Training Procedure
Once the appliance features are extracted, a generic machine learning classifier can be used to learn the pattern from labeled data. We consider a convolution neural network (CNN) for this task. CNNs are specific kinds of neural networks for processing visual data. They leverage local connectivity and equivariant representations that make CNN useful for computer vision tasks. Each hidden unit of a CNN layer is connected only to the subregion of the input image. This allows CNNs to exploit spatially local correlation between neurons of adjacent layers while reducing the number of parameters. Thus, at each CNN layer, the classifier learns several small filters (feature maps). These feature maps are then applied to the entire layer, allowing features to be detected regardless of their position in the image.
The CNN network applied in this work consists of three-stages 2D CNN layers each with 16, 32 and 64 feature maps, 3 × 3 filter, 2 × 1 stride and padding of 1. Each CNN layer is followed by batch normalization (BN) block and Leaky relu activation functions. The final layer consists of one flatten layer and two Fully connected layers (FC) layers. The FC layers have a hidden size of 1024 and K, respectively, where the number of appliances available determines the number of classes (K). The final predicted class is obtained by applying softmax activation function. To learn the model parameters, a standard back propagation is used to optimize the cross-entropy objective function defined in Equation (5): Specifically the mini-batch Stochastic Gradient Descent (SDG) with a momentum of 0.9, a learning rate of 0.001, and a batch size of 16 is used to train the model for 100 iterations. To avoid over-fitting, early stopping with patience is used where the traing is terminated once the validation performance does not change after 20 iterations.

Datasets
The proposed method is tested on the three publicly accessible datasets; Plug Load Appliance Identification Dataset (PLAID v1) [30], Worldwide Household and Industry Transient Energy Data Set (WHITED v1.1) [31], and Controlled On/Off Loads Library (COOLL) datasets [32]. The PLAID v1 contains 1074 instances of current and voltage measurements sampled at 30 kHz from 11 different appliance types in Pittsburgh, Pennsylvania, USA. Each appliance type is represented by various samples of varying make/models. The WHITED consists of sub-metered current and voltage measurements recorded in households and small industry settings at 44.1 KHz sampling frequency. In this work, we use the WHITED v1.1 that comprises 11259 instances for 110 various appliances, which can be grouped into 47 different types (classes).
The COOLL dataset, on the other hand, consists of d 840 current and voltage measurements for 42 controllable appliances sampled at a 100 kHz sampling frequency. Unlike PLAID and WHITED datasets, the COOLL dataset provides twenty turn-on transient signals corresponding to a different turn-on instant (with a controlled delay to the zero-crossing of the mains voltage) for each appliance. The appliances are of 12 different types with a certain number of examples each [32].

Evaluation Metrics
Several performance metrics have been proposed in the NILM literature [13]. This work uses macro averaged F 1 score, zero-loss score (ZL) and Matthews correlation coefficient (MCC), as these are known for being less sensitive to class imbalance [33]. We also use the confusion matrix which shows the correct predictions (the diagonal) and provide a clear view on which appliances are confused with each other.
where M is the number of appliances and F 1 is the harmonic mean of precision and recall.
The zero-loss give the number of miss-classifications with the best performance being 0 and is defined as ZL = ∑ M i=1 I(y i =ŷ i ) The Matthews correlation coefficient, MCC, provides a balanced performance measure of the quality of classification algorithm. It takes into account true and false positives and negatives. Given confusion matrix C for M different classes, the MCC can be defined as where The maximum MCC score is +1 and the minimum value can be between -1 and 0. A score of +1 represents a perfect prediction.

Experimental Description
We are interested in answering the following two research questions: (1) how to pick a suitable set of WRG hyper-parameters? And, (2) how do the graph features extracted by WRG compare against V-I based approach concerning classification performance? We investigate the first objective by altering the WRG hyper-parameters w, δ, and λ = 1/ on the PLAID, and COOLL sub-metered datasets. We first investigate how do the λ and δ parameters influence performance measure when the embedding size is set to 50 (w = 50). We then analyze the impact of the embedding size w on classification performance for given values of λ and δ. We further compare the general performance between the binary RP and WRG.
In the second experiment, we establish a baseline in which the V-I binary image is used as the appliance feature. The baseline is then compared with the WRG feature representation. The V-I image of size w × w, is obtained by first resizing the activation current i and voltage v into corresponding scale d i and d v respectively where: d c = max(| min(i)|, max(i)) and d v = max(| min(v)|, max(v)) and transformed into w × w scale. The scaled current and voltage are then converted into w × w image by meshing the V-I trajectory and assigned a binary value that denotes whether it is traversed by trajectory as described in De Baets et al. [10]. Figure 5 illustrate the generation of V-I image from microwave activation current and voltage in the PLAID dataset. The objective of this experiment is to compare the generalization performance of the proposed approach with that of VI across buildings. To achieve this, we employ leave-one-house-out cross-validation as presented in [21]. A classifier is trained on a dataset in N b − 1 houses and then tested using the unseen house in the same dataset. However, unlike PLAID, the WHITED and COOLL datasets do not have household annotations. Therefore, we adopt the method used in [10], which consists of assigning appliances randomly to artificial homes. The total number of houses is set to 9 for the WHITED dataset and 8 for the COOLL dataset, corresponding to the minimum number of appliance types in each dataset.
The parameters used in this experiment are presented in Table 1.

Results and Discussion
This section presents and discusses the results obtained with respect to the two research objectives of this paper.

Objective 1: WRG Analysis
In the first objective, we investigate how do WRG parameters w, λ and δ influence performance measure. Figure 6a,b shows the relationship between λ and MCC score for different value δ on COOLL and PLAID datasets.
From Figure 6b, we observe that in PLAID a maximum score of 0.981 MCC is reached, when λ = 10 1 and δ = 20. It can be also observed from Figure 6a, that in COOLL, a maximum score of 1.0 MCC is reached for λ = 10 3 and δ = 50. We further see that the binary RG ( when δ = 1 ) achieves a maximum score of 0.97 MCC (when λ = 10) in COOLL, and 0.90 MCC (when λ = 5) in PLAID. However, the performance drops rapidly as λ increases and eventually become zero in PLAID. We also observe that the influence of δ on performance score depends on the selected value of λ. For larger values of λ, the performance increases as δ increases. In contrast, for small values of λ, δ does not significantly impact the performance score.   We also investigate the influence of embedding size w in the classification performance as depicted in Figure 6c. We see that the higher value of w does not significantly improve classification performance. This result is in line with the one obtained in [9] for the V-I image, which concluded that, once a particular resolution is obtained, adding information by increasing the embedding size does not improve performance. Nevertheless, a significantly high value of w impacts the learning speed as shown in Figure 6d. Also as discussed in the feature extraction and pre-processing subsection, a low value of w might lead to information loss, thereby degrading the performance score. Finally, we compare the general performance between the binary RG and WRG, as tabulated in Table 2. We see that compared to binary RG, the proposed WRG improves classification performance from 98.96% to 99.86% F 1 score for the COOLL dataset and 88.18% to 94.35% F 1 score for the PLAID dataset.

Objective 2: Comparison against V-I Image Method
In this experiment, we compare the generalization performance of the WRG and V-I image representation across multiple buildings. We first present and discuss the overall performance of the three sub-metered datasets, as listed in Table 3. From the results presented in Table 3, we see that WRG out-performs the V-I image in all three datasets with 0.92%, 8.5%, and 4.5% percentage points increase in F 1 macro for COOLL, WHITED and PLAID dataset respectively.
For benchmarking purposes, the results presented in this paper are compared with the ones presented in [10] for WHITED and PLAID datasets. We see an increase in F 1 macro score from 77% to 88.53% on PLAID and from 75.46% to 97.23 on the WHITED dataset. Ultimately, these results demonstrate the effectiveness of the WRG feature in characterizing appliances across multiple buildings. We also see the improved performance on the presented V-I based CNN. Yet, the increase in F 1 macro score is attributed to the improved pre-prepossessing procedure and developed CNN model architecture. We also present and discuss the per-appliance performance on the three datasets. Figure 7 shows the F 1 macro (%) per appliance for the COOLL dataset. It can be observed in Figure 7a that except for two appliances (Saw and Hedge), the F 1 macro (%) is above 99.0% for WRG. Examining the confusion matrix for the V-I image in Figure 7b, we see that the V-I makes four confusions between Vacuum and Drill (all having rotating components), one confusion between Vaccum and Grinder, Drill and Lamp and between Drill and Lamp. The use of WRG reduces these confusions to only one confusion, between Saw and Drilling machine (all having rotating components) as depicted in Figure 7c.  Figure 8a presents the per appliance F 1 macro (%) for the PLAID dataset. From Figure 8a, we see that with exception to Washer, Heater, Fridge, AC, and Fan, the WRG reaches at least 88% F 1 macro score for all other appliances. Observing the confusion matrix for WRG in Figure 8c and V-I image in Figure 8c, we see WRG reduces most of the confusions. More precisely, between Fan and Hairdryer (from 19 to 0), Fan and Bulb (from 19 to 6), Fridge and Washer (from 6 to 1) and between AC and other appliances (from 25 to 23). However, despite the increase performance, the WRG makes four confusions between Washer and AC, and five confusions between Fan and Vacuum (both having motor). Finally, Figure 9a presents results for the WHITED dataset. We see that for the WRG, most appliances achieve 97.0 F 1 macro and above. The exceptions are the PowerSupply, Shredder, Hairdryer, Flat Iron, and CFL. From the confusion matrix Figure 9b, we observe that for V-I image representation, the HairDryer is confused with the Iron (9) and kettle (6) (both having heating elements); however, the WRG reduces these confusions to 5 as shown in Figure 9c.

Conclusion and Future Work Directions
In this paper, we presented a WRG-based feature representation for appliance classification in NILM. Specifically, we propose a variation of the RG plot that goes beyond the traditional binary outputs. By following this non-binary approach, the proposed method ensures that more information is preserved in the RG, thus improving its discriminant power.
Extensive evaluations using CNNs for classification, and three public sub-metered datasets show that the proposed WRG feature consistently improves the appliance classification performance compared to the commonly used V-I image representation.
We further assessed how WRG's hyper-parameters influence classification performance. We found that the hyper-parameters are dataset dependent, which raises another fundamental research question: how these parameters can be selected and if they are related to data characteristics like sampling frequency. In future work, we will investigate appropriate methods for choosing these parameters. Precisely, we will investigate whether the WRG hyper-parameters could be treated as learn-able parameters like standard neural network weights.
Finally, even though the proposed approach was evaluated against three public datasets, it is essential to remark that these are all sub-metered. Therefore, future work should also assess the WRG on aggregated datasets. Furthermore, when considering aggregated data, it is also essential to determine the impact of the event detection algorithms (e.g., [34,35]) in the extraction of the current activation waveforms. Moreover, relying on aggregate datasets also presents the opportunity of exploring the applicability of the proposed WRG feature for multilabel appliance classification.