FedOpt: Towards Communication Efﬁciency and Privacy Preservation in Federated Learning

: Artiﬁcial Intelligence (AI) has been applied to solve various challenges of real-world problems in recent years. However, the emergence of new AI technologies has brought several problems, especially with regard to communication efﬁciency, security threats and privacy violations. Towards this end, Federated Learning (FL) has received widespread attention due to its ability to facilitate the collaborative training of local learning models without compromising the privacy of data. However, recent studies have shown that FL still consumes considerable amounts of communication resources. These communication resources are vital for updating the learning models. In addition, the privacy of data could still be compromised once sharing the parameters of the local learning models in order to update the global model. Towards this end, we propose a new approach, namely, Federated Optimisation (FedOpt) in order to promote communication efﬁciency and privacy preservation in FL. In order to implement FedOpt, we design a novel compression algorithm, namely, Sparse Compression Algorithm (SCA) for efﬁcient communication, and then integrate the additively homomorphic encryption with differential privacy to prevent data from being leaked. Thus, the proposed FedOpt smoothly trade-offs communication efﬁciency and privacy preservation in order to adopt the learning task. The experimental results demonstrate that FedOpt outperforms the state-of-the-art FL approaches. In particular, we consider three different evaluation criteria; model accuracy, communication efﬁciency and computation overhead. Then, we compare the proposed FedOpt with the baseline conﬁgurations and the state-of-the-art approaches, i


Introduction
Artificial Intelligence (AI) has been employed in a plethora of application fields in recent years [1]. In this context, as a notable branch of AI, Deep Learning (DL) has been broadly used to empower plenty of data-driven real-world applications, such as facial recognition, autonomous driving and smart grid systems [2][3][4]. These DL-based applications usually demand the gathering of large quantities of data from various IoT edge-devices for training high-quality learning models. However, the traditionally centralised DL models require the local edge-devices to upload their private data to a central cloud server, which may cause serious privacy threats [5]. These privacy threats can be mitigated through distributing the local training among multiple edge-devices, which has led to the emergence of Federated Learning (FL) [6]. Federated Learning (FL) resolves this problem by allowing the edge-devices to collaboratively train a DL model on their individually gathered To this end, in this paper, we propose a novel approach, namely, Federated Optimisation (FedOpt), based on Distributed Stochastic Gradient Descent (DSGD) optimisation. The major contributions in this approach are summarised as follows:

1.
FedOpt utilises the novel Sparse Compression Algorithm (SCA) in order to reduce the communication overhead. In particular, SCA extends the existing top-k gradient compression technique and enables downstream compression with a novel mechanism. 2.
FedOpt adopts a lightweight homomorphic encryption for efficient and secure aggregation of the gradients. In particular, FedOpt provides a concrete abstract, where additively homomorphic encryption is completely utilised in order to eliminate the key-switching operation and to increase the space for plain-text. 3.
To further ensure the privacy of local users from the collusion of adversaries, FedOpt uses a differential-privacy scheme based on Laplace mechanism in order to keep the originality of local gradients.

4.
FedOpt tolerates user drops during the training process with negligible amounts of accuracy losses. Furthermore, the performance evaluation demonstrates the training accuracy of FedOpt in real-life scenarios as well as its efficient communication and low computation overhead.
The remainder of this paper is organised as follows: The system model and the problem statement are presented in Section 2. Federated learning and the primary techniques of cryptography are briefly explained in Section 3. Afterwards, we introduce FedOpt in Section 4 and conduct the experimental evaluations in Section 5. We discuss the related work and a comprehensive comparison to the exiting approaches in Section 6. Finally, Section 7 concludes the paper with future directions.

System Model and Problem Statement
Below, we first describe the system model and then define the problem statement of the proposed approach.

System Model
In the proposed FL environment, two main entities constitute the basic parts of the whole system: users and the cloud server. The major objective of the proposed approach is to minimise the communication cost and to secure the privacy of individual users from the honest-but-curious adversaries during the training process. In particular, the cloud server honestly executes all the data aggregation process but it is also curious to infer private data from the inputs of users. Therefore, the proposed approach is designed in a way that it can prevent the collusion between the users and the cloud server. For this, we demand that the cloud server receives only the encrypted aggregated result from the local gradients in order to avoid the harmful use of private information. To this regard, in this model, we assume that all the users agree on the same leaning task with the same objectives and parameters as shown in Figure 2. In specific, these users are required to compute the local gradients from their private training datasets and then upload it to the cloud server. Afterwards, these users receive the aggregated global gradient from the cloud server. To ensure privacy, each local gradient is encrypted before being uploaded to the cloud server. Meanwhile, the cloud server is assigned the primary task, that is to compute the global gradient based-on the encrypted local gradients. After computing the global gradient, the cloud server broadcasts this global gradient to all the users, and then the training begins on the proposed model. Finally, the proposed approach works by following this iterative collaboration between the cloud server and the users.

Problem Statement
As mentioned in Section 1, massive communication overhead and malicious users can make the FL infeasible. In this context, we consider the typical environment of FL, where local users collaboratively learn a global parametric neural network. Thus, we propose an approach that use data compression technique for efficient communication and integrates additively homomorphic encryption with differential privacy to prevent data from being compromised. The major objective in this approach is to obtain a parameter vector ν in Deep Neural Network (DNN) that is required in order to minimise the expected loss : As described in the system model, the users learn their local models on their personal datasets and then upload their gradients which are calculated using this loss function to the cloud server. Meanwhile, (x, y) denotes the loss function and each user computes the local gradient using gradient function f on its private dataset D i . In order to further ensure the privacy, we apply differential privacy with additively homomorphic encryption on the uploaded gradients during the training process to achieve the highest accuracy.

Preliminaries
In this section, we first briefly explain FL and then discuss the primary cryptographic techniques that serve as a foundation of the proposed FedOpt.

Federated Learning
Federated Learning (FL) is an emerging privacy-protecting and decentralised learning scheme that enables edge-devices (local users) to learn a shared global model without disclosing their personal and private data to the cloud server. In FL, user download a shared global model from the cloud server, train this global model over individuals' local data, and then send the updated gradients back to the cloud server. Afterwards, the cloud server aggregates these updated gradients in order to compute a new global model. The following are some unique features of FL compared to traditional centralised learning.

1.
The learned model is shared between the users and the cloud server. However, the training data which is distributed on each user is not available to the cloud server.

2.
Instead of the cloud server, the training of learning model occurs on each user. The cloud server receives the local gradients and aggregates these gradients to obtain a global gradient and then send this global gradient back to all the users.
In this paper, we consider the standard settings of FL, where large-scale of local users train the global learning model in a distributed and collaborative manner.

Additively Homomorphic Encryption
The homomorphic encryption performs a set of mathematical computations on plain-text and derives a new cipher-text which presumably same as the plain text after decryption. Meanwhile, additively homomorphic encryption performs the additivity on multiple cipher-texts and decrypts the encrypted result at the same time [13]. Therefore, local users can send this encrypted data for processing on the cloud server without revealing the private information. For instance, consider two plain-texts ξ 1 , ξ 2 , such that where E δ represents the encrypted-secret text, τ 1 , τ 2 denotes the cipher-text of ξ 1 , ξ 2 , respectively, and α is a constant for any encrypted text.

Differential Privacy
Differential privacy is a privacy preserving technique that ensures the overall statistics of a dataset will remain same, regardless of change in a single tuple. For example, any algorithm Λ satisfies -differential privacy ( -DP), if it satisfies the following: where P indicates privacy, D and D represent any two neighbouring datasets that have only a single different element, T denotes a set of tuples, and represents the privacy budget. Whereas, the privacy budget is an important factor in differential privacy which ranges from 0 (minimum-) to 1 (maximum-) [14].

Laplace Mechanism
Any gradient function f satisfies the -DP, if it satisfies the following: where and function f determines the gradients for each user during the epoch [15].

Federated Optimisation (FedOpt)
In this section, we propose a new FedOpt approach based on DSGD optimisation in order to promote communication efficiency and privacy preservation in FL.

Sparse Compression Algorithm (SCA)
In the existing literature, sparse top-k, a compression algorithm prove the significant performance in distributed training of data [16][17][18]. Therefore, we use this observation as a starting point to construct a communication efficient protocol in FL. To this end, we design a Sparse Compression Algorithm (SCA) for FedOpt, to reduce the number of communication bits during the models training. In particular, in SCA algorithm, we introduce temporal sparsity into DSGD, which is inspired by [6] to reduce the communication delay. SCA allow each user to perform multiple epochs of SGD, to compute more informative updates. These updates are given by where SGD n (ν, G ) refers to the set of gradient updates after n epochs of SGD on DNN parameters ν during the sampling of mini-batches from local data G . Based on the experiments in Sections 5.1-5.3, we conclude that communication delay reduces drastically, with marginal degradation of accuracy. For details about the impact of existing compression techniques on communication delay, we refer the reader to [18].

SCA Technique
We use the proportion of each user gradient into a full gradient update. To implement this, we set the biggest and smallest fraction q of gradient updates to zero. Then, we compute the mean Ψ of all the remaining negative and positive gradient updates, separately. Afterwards, if the absolute negative mean Ψ − is smaller than the positive mean Ψ + , then we set all the positive values to the positive mean Ψ + and all the negative values to zero. Otherwise, if the absolute negative mean Ψ − is bigger than positive mean Ψ + , then we set all the positive values to zero and all the negative values to the negative mean Ψ − . The detailed technique is formalised in Algorithm 1. In order to find the values of biggest and smallest fraction q in a parameter vector ν, SCA requires the number O(|ν n |) operations, where ν n refers as the total number of parameters in ν. Following the above technique, SCA reduces the required number of bits b num from 32 to 0 through computing the non-zero values of sparse gradient update to the mean Ψ. This results in the reduction of communication cost of up to ×3.

Gradient Aggregation in FedOpt
Secure gradient aggregation in the form of cipher-text can be achieved through homomorphic encryption. However, the large amounts of required communication resources and the computation overhead on public-key encryption might delay and disturb the accuracy of data [19,20]. Towards this end, we utilise the additively homomorphic encryption in FedOpt in order to achieve efficiency throughout the learning process. Furthermore, differential privacy is used in order to tolerate the local users' dropouts and to add calibrated noises before encryption in each gradient. In this context, each user uses a small-size batch from the local dataset D i and learns the model to compute the local gradient G in each epoch. In order to protect their local gradients, the local users use Laplace mechanism to encrypt their local gradients using E = E δ (G + Lap( ∆ f )). Once the cloud server receives all the encrypted gradients, it conducts the aggregation operation where the noises are nearly eliminated due to the symmetry of the Laplace mechanism. This aggregation operation is processed by the following equation: In the end, all the users decrypt the encrypted global gradient E add that is received from the cloud server using the following equation: The detailed pseudocode of privacy preservation technique using differential privacy which is integrated with additively homomorphic encryption is formalised in Algorithm 2. while Users obtain local gradients G by training local models D i do 5 Add noise -DP ← G while Cloud server aggregates encrypted local gradients to users do Generate Update existing parameters 18 Aggregate new parameters to the cloud server 19 end 20 end

Efficiency and Privacy in FedOpt
The efficiency and privacy preservation of FedOpt are set in each epoch and the complete process of each epoch is divided into multiple phases as follows:

Initialisation Phase
In the beginning, the global parameters o and the learning rate ℘ are initialised by the cloud server. Then, all the users copy the global training model to the private devices. Apart from having the security parameter σ, a secret key δ is assigned to each user which is comprised of two big prime numbers j, k(|j| = |k| = σ) where, these prime numbers are given as public parameters M.

Encryption Phase
In this phase, all the users jointly choose the same level of privacy budget in order to maintain the differential privacy. Specifically, in each epoch, the set of users derives their initial parameters and obtains their local gradients G through their individual datasets. Afterwards, the set of users utilise a privacy measure by randomly choosing the noises from the Laplace distribution Lap( ∆ f ) and adds it to the local gradients.
In the equation above, both the privacy budget and the sensitivity ∆ f of Laplace distribution play important roles in differential privacy. Meanwhile, ∆ f can be set to 1 and each gradient is assumed to set at 0 ≤ G ≤ 1 by utilising the min-max normalisation [21].
Subsequently, the users encrypt their local gradients using the secret key δ from j, k as given below: where, M is the public parameter M = jk and j −1 , k −1 denote the inverses of j, k respectively. In the end, these encrypted local gradients E from all the users are sent to the cloud server.

Aggregation Phase
Once all the gradients G are received by the cloud server, it initialises the secure aggregation process as given below: Afterwards, the cloud server begins communication with all the local users and broadcasts the encrypted global gradient E add , in order to avoid collusion from adversaries.

Decryption Phase
Once the local users receive the global encrypted gradient E add , each user begins the decryption process as follows: In similar fashion, Following the above procedure, the local users utilise the Chinese Remainder Theorem (CRT) in order to obtain the final decrypted global gradients B [22]: Since the number of users is sufficient in real-world scenarios, therefore, FedOpt tolerates the users which might drop at any instance of time. Therefore, there is nearly zero effect on eliminating the noises. In the end, each user updates the parameters according to ← − N E add , where N is received from the cloud server. Afterwards, the whole operation is performed repeatedly until the loss function is achieved.
The complete FedOpt approach that features two-way (upstream and downstream) compression via SCA and performs optimal encryption through differential privacy is shown in Algorithm 3.

Algorithm 3: FedOpt: Communication-Efficiency and Privacy-Preserving
Input : Initial parameters Output : Global model with improved parameters o User i execute: 14 Cloud Server CS execute:

FedOpt Evaluation
In this section, we conduct the experimental evaluation of the proposed FedOpt in terms of model accuracy, communication efficiency and computational overhead. We conduct our experiments on the server with an Intel(R) Core(TM) CPU i7-4980HQ (2.80 GHz) and 16 GB of RAM. The compression and privacy-preserving algorithms are simulated by TensorFlow in Python. For evaluation, we consider baseline configuration of FL, Federated Averaging (FedAvg) [23] and Privacy Preserving Deep Learning (PPDL) [24]. In particular, we evaluate the performance of FedOpt on MNIST dataset where the gradient consists of 60,000 training examples and each example consists of 28 × 28 size images. Then, similar to MNIST dataset, we assess the performance of FedOpt on CIFAR-10 dataset where the gradient consists of 50,000 training examples and 10,000 testing examples and each example consists of 32 × 32 size images with three different RGB channels. The baseline configuration setup is given in Table 1.

Accuracy Test
Accuracy is an important factor to measure the performance of any model in DL. In this regard, the proposed FedOpt is able to achieve the accuracy of 99.6% and 98.4% after 500 epochs on MNIST and CIFAR-10 datasets, respectively. As shown in Figures 3a and 4a, we conduct the experiments on various numbers of privacy budgets , i.e., 0.2, 0.4, 0.6, 0.8 and 1.0, in order to test the accuracy of FedOpt on MNIST and CIFAR-10 datasets, respectively. Compared with FedAvg and PPDL, FedOpt is able to achieve 92.3% on 0.2 (lowest-) and 99.6% on 1.0 (highest-) of accuracy on MNIST dataset. Similarly, FedOpt is able to achieve 91.2% on 0.2 (lowest-) and 98.7% on 1.0 (highest-) of accuracy on CIFAR-10 dataset. The above results demonstrate that the number of privacy budgets has a huge impact on the prediction accuracy. Therefore, we conclude that, higher levels of privacy budget produce higher accuracy, but provide lower levels of privacy. Furthermore, we also conduct the accuracy tests with regard to the impact of various number of users , e.g., 200, 400, 600, 800 and 1000 on the constant privacy budget at 0.5-. For example, in Figures 3b and 4b, the accuracy increases with the increasing number of users on MNIST and CIFAR-10 datasets, respectively. In specific, in Figure 3b, FedOpt achieves 97.1% on 200 users (minimum-) and 99.7% on 1000 users (maximum-) of accuracy on MNIST dataset. Similarly, as shown in Figure 4b, FedOpt achieves 93.4% on 200 users (minimum-) and 98.6% on 1000 users (maximum-) of accuracy on CIFAR-10 dataset. As shown in Figures 3 and 4, the proposed FedOpt is compared with FedAvg and PPDL, where it is able to achieve the highest level of accuracy. This is attributed to the fact that a huge part of the noises is eliminated through the symmetry of Laplace mechanism and the complete utilisation of SCA. Furthermore, differential privacy provides protection to gradients during the training process.

Communication Efficiency
In our experiments, we consider the communication efficiency among the cloud server and the users as they are the main entities of the whole system. In specific, during the aggregation phase, we assume there are n epochs in the whole training process and each user has a single thread with the security parameter σ is set to 512 and the size of each local gradient G is 32 bits. In each epoch, the users aggregate the encrypted local gradients E to the cloud server and receives the shared parameters E add from the cloud server. Figures 5 and 6 show the comparison result of communication efficiency between FedOpt, FedAvg and PPDL on MNIST and CIFAR-10 datasets, respectively. In specific, we consider different numbers of gradients and different numbers of users for evaluation in Figures 5a,b and 6a,b respectively. Clearly, it can be demonstrated that the increasing numbers of gradients with the maximum numbers of users has the maximum communication efficiency. Compared to the FedAvg and PPDL, FedOPT has 56% and 38% more communication efficiency, respectively, on MNIST dataset. Similarly, FedOpt outperforms on CIFAR-10 dataset with 54% and 32% more communication efficiency as compare to the FedAvg and PPDL. The major reason behind this higher communication efficiency is that, FedOpt completely utilises pallier encryption [25] which helps in the rapid growth of cipher-text volume. In addition, SCA algorithm helps FedOpt in faster convergence in terms of training epochs with significant compression rate.

Analysis of Communication Efficiency w.r.t Accuracy
In this subsection, we compare the proposed compression algorithm SCA with respect to the number of epochs and the communicated bits that are required to achieve the targeted accuracy on a FL task. In the above subsections, FedOpt performed significantly better than FedAvg and PPDL. In order to have a meaningful comparison, we choose 100 users for 50 and 100 epochs, where every user holds 10 different classes and uses a batch-size of 20 during training. This setup of less number of users and epochs favours the FedAvg and PPDL. The rest of the parameters of the learning environment is the same as given in Table 1. We train the datasets until the targeted accuracy is achieved in the given number of epochs and measure the total communicated bits both for upload and download. The required amounts of upstream and downstream communication bits to achieve the targeted accuracy is given in megabytes (MB) in Table 2.  Table 2, FedOpt communicates 14.6 MB and 172.3 MB of data on MNIST and CIFAR-10 datasets, which is a reduction in communication by a factor of ×152 and ×207 as compared to baseline configurations. Meanwhile, FedAvg and PPDL (epochs = 100) requires 84.73 and 63.74 MB of data on MNIST dataset and 1665.7 and 958.3 MB of data on CIFAR-10 dataset which proves that proposed FedOpt have a minimum delay period in order to achieve the targeted accuracy within a given number of training epochs.

Computation Overhead
In the end, we discuss the computation cost of FedOpt on MNIST and CIFAR-10 datasets as shown in Figures 7 and 8, respectively. We only consider the running time of the cipher-text operation to prove our main contribution. By considering the security requirement, we select plain-text ξ 1 = 2 16 with the security parameter of σ = 128 bits, and analyse the computational cost per each user and the cloud server on each phase as mentioned in Section 4.3. In specific, in each subfigure, as demonstrated in Figures 7 and 8, the computational cost increases linearly with the increasing number of gradients because FedOpt encrypts every single packet in each aggregation. Therefore, the computational overhead over the encryption process is related to the total number of gradients regardless of number of users. Furthermore, increased security (higher security parameter σ) leads to the inefficiency. In this regard, as shown in Figure 7, FedOpt achieves 74% and 53% at the encryption phase, 72% and 45% at the aggregation phase, and 86% and 31% at the decryption phase, less computational overhead than FedAvg and PPDL, respectively, on MNIST dataset. Similarly, Figure 8 shows the computation overhead on CIFAR-10 dataset where FedOpt achieves 61% and 52% at the encryption phase, 43% and 31% at the aggregation phase and 72% and 48% at the decryption phase, less computational overhead than FedAvg and PPDL. The overall computational overhead for users at the encryption phase with the security parameter of σ = 128 bits is about ×2.8 slower than the baseline configurations because FedOpt requires fewer addition and multiplication operations. Similarly, the overall computational overhead for the cloud server with the security parameter of σ = 128 bits is about ×9.3 slower than the baseline configurations. This less computational overhead at the server-end is because FedOpt decrypts every single packet in each aggregation, where the number of decryption process linearly increases with the increasing number of gradients. Therefore, the proposed FedOpt is able to support the learning scenarios with large numbers of users.

Related Work and Discussions
Stochastic Gradient Descent (SGD) is very popular optimisation training technique that supports various DL applications in DNN models. In particular, on one end of the spectrum, SGD can be used to reduce the convergence time in large-scale applications of DL models by using the device-level parallelism [26][27][28]. On the other end of the spectrum, SGD can be used to enable and enhance the privacy preserving in DL algorithms [29]. Since the users are require to share the gradient updates, where SGD helps in training of model from the combined data of all the users without revealing the individual's local data to a centralised cloud server [30]. However, despite the tremendous advantages and the extensive applications of SGD based DL, existing research show that, learning the model updates suffers from a massive communication overhead [31,32]. In order to reduce this communication overhead, a wide variety of methods had been proposed in the past. For example, in [23], the authors proposed a novel approach, i.e., FedAvg, where each user computes the gradient updates by performing multiple epochs of SGD which results into increasing the number of gradient evaluations that causes delay in communication. In order to minimise this communication delay, the authors in [33], use probabilistic-quantisation and random-sparsification. In particular, the authors force the random-sparsity on the users or restrict them in order to learn random-sparse gradient updates (structured and sketched updates) and combine probabilistic quantisation with this sparsification. This method however is not suitable for SGD epochs as it slows down the convergence speed significantly. To overcome this convergence issue, the authors in [34], propose a compression technique; namely, SignSGD that theoretically guarantees the convergence over iid data. This SignSGD quantises each gradient from each user to a binary sign and reduces the bit-size per gradient update by ×32. This compression on gradients is done by means of a majority vote which may result into the loss of important updates. In addition, the compression rate and empirical performance does not

Conclusions
This paper proposes a novel approach, namely, Federated Optimisation (FedOpt) that is able to simultaneously decrease the communication cost and increase the privacy in federated learning settings. In particular, we design a Sparse Compression Algorithm (SCA) for communication efficiency and integrates the additively homomorphic encryption with differential privacy in order to prevent data from being leaked. Compared to the existing approaches, the proposed FedOpt compresses the upstream and downstream communication and reduces the communication overhead. In general, FedOpt is advantageous especially in the network where communication is costly or bandwidth is constrained as it achieve the targeted accuracy within fewer amounts of communication bits. Furthermore, the proposed FedOpt is able to mitigate the security threats for both the local users and the cloud server. In addition, the proposed FedOpt is completely non-interactive which provides higher levels of privacy at the aggregation phase, even when the adversaries collude with honest-users. The experimental evaluation on both MNIST and CIFAR-10 datasets proves that the proposed FedOpt outperforms the state-of-the-art approaches in terms of accuracy, efficiency and privacy. In the future, we will consider the virtualisation of this work through docker to make it useful in real-life environments. In addition, we plan to investigate further approaches for communication efficiency and privacy preservation while maintaining robustness in federated learning, especially with complex neural networks and high-dimensional datasets for diverse learning tasks and their models.
Author Contributions: All the authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.
Funding: This research has been partially supported by NITech Frontier Institute.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations and notations are widely used in this manuscript, while the arithmetic operations and their notations are to be understood element-wise: