Federated Learning for Commercial Image Sources

Shreyansh Jain
IIIT Delhi
New Delhi, India
shreyansh21089@iiitd.ac.in
   Koteswar Rao Jerripothula
IIIT Delhi
New Delhi, India
koteswar@iiitd.ac.in
Abstract

Federated Learning is a collaborative machine learning paradigm that enables multiple clients to learn a global model without exposing their data to each other. Consequently, it provides a secure learning platform with privacy-preserving capabilities. This paper introduces a new dataset containing 23,326 images collected from eight different commercial sources and classified into 31 categories, similar to the Office-31 dataset. To the best of our knowledge, this is the first image classification dataset specifically designed for Federated Learning. We also propose two new Federated Learning algorithms, namely Fed-Cyclic and Fed-Star. In Fed-Cyclic, a client receives weights from its previous client, updates them through local training, and passes them to the next client, thus forming a cyclic topology. In Fed-Star, a client receives weights from all other clients, updates its local weights through pre-aggregation (to address statistical heterogeneity) and local training, and sends its updated local weights to all other clients, thus forming a star-like topology. Our experiments reveal that both algorithms perform better than existing baselines on our newly introduced dataset.

1 Introduction

Federated Learning111Published in the IEEE/CVF WACV 2023 Proceedings with DOI: 10.1109/WACV56688.2023.00647. Please cite this work as [18].
© 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works
(FL) is a distributed learning paradigm that can learn a global model from decentralized data without having to exchange sensitive data across the clients  [25] [30].

Traditional Machine Learning requires users to upload their data to the centralized server for the learning and inference task. The end-user has no power and control over how the data is used [11]. Moreover, uploading the data to a central server incurs severe costs. Maintaining such a vast volume of data and communicating the learning parameters back to the user is costly. To overcome the privacy challenges and issue of maintaining a large amount of data in the centralized setting, Google [23] proposed the Federated Learning paradigm, which could handle these issues well.

The Federated Learning framework addresses sensitive data privacy and data access issues [42]. Federated Learning models are trained via model aggregation rather than data aggregation. It requires model to be trained locally on the data owner’s machine or the local edge devices, and only the model parameters are shared. Federated Learning has found successful applications in the IoT, healthcare, finance, etc [24] [32]. The traditional Federated Learning optimization methods involve local client training on the local datasets for a fixed number of epochs using an SGD optimizer. The local clients then upload the model weights to the central server, where the weights are averaged to form a global model whose parameters are shared with the local client. This method is known as FedAvg [30], which facilitates the local client to learn features from different clients while preserving privacy. However, FedAvg may have convergence issues in case the clients exhibit statistical heterogeneity, which may lead to non-convergence of the model [26] [22] [16]. Thus, a simple FedAvg algorithm may not be helpful when dealing with device-level heterogeneity.

In this work, we aim to understand the real-world scenario where different commercial image sources can collaborate in a Federated setting to perform the image classification task with privacy preservation.

Most previous research works applied the Federated Learning algorithm on a single dataset distributed among the clients in an IID or non-IID manner. This is not close to a real-world scenario where different clients may have different data distribution due to domain shift (statistical heterogeneity) among them  [35]  [31]. Another challenge in Federated Learning is the convergence issue when the data distribution is different among the clients, which may increase the communication cost between the clients and the central server and leads to suboptimal model performance.

Motivated by the above challenges, we propose our dataset in which each client’s dataset is sampled from different commercial image sources to simulate the real-world scenario where each client exhibits a domain shift. This is because each commercial image source has its own unique image set, causing domain shift amongst clients. The dataset is inspired by the Office-31 dataset  [38]. We also propose two novel algorithms, namely Fed-Cyclic and Fed-Star. Fed-Cyclic is a simple algorithm in which, a client gets weights from the previous client, trains the model locally and passes the weights to the next client in a cyclic fashion. In this way, a global model is simply being passed from one client to another in a cyclic manner. The global server need not involve here. Even if it is required that we want to involve it to preserve anonimity, the global server does not have to perform any computation and can be used only for passing the parameters of one client to another. Fed-Star requires each client to receive weights from all the other clients during pre-aggregation in a star-like manner after the local training of each client on its train set. While pre-aggregating, every client prioritizes learning the outlier features present in different clients while retaining the common features to train a more robust model impervious to statistical heterogeneity among the client’s data distribution, followed by aggregation via a global server after a fixed number of periods. Experiments show that our algorithms have better converegence and accuracy.

To summarize, our contributions are three-fold:

  • We propose the an image classification dataset specifically designed for Federated Learning, which is close to a real-world scenario where each client has a unique dataset demonstrating domain shift.

  • We propose Fed-Cyclic, a simple algorithm, which is communicationally efficient and attains higher accuracy than baselines.

  • We propose the Fed-Star algorithm, which trains a model that prioritizes learning of generalized and outlier features to create a model personalized to each client’s heterogeneous dataset distribution and attains faster convergence than the baselines with higher accuracy.

2 Related Works

Many distributed optimization algorithms have been developed to process and draw inferences from the data uploaded [48][37][9] [2] [1]. However, such distributed method requires uploading of data to the central server, which incurs the considerable cost of maintaining data centrally, and processing it requires a lot of power [33] [8]. Also, the privacy issue persists as the user has no control over how personal data is used and shared.

The first application of the Federating Learning algorithm is FedAvg, proposed by Mcmahan et al. [30]. FedAvg performs reasonably well when the data distribution is IID among the clients and shows faster convergence of the global model. The issue arises in real-world scenarios when the data follow the non-IID distribution as proposed by Zhao et al. [49].

Federated Learning with data heterogeneity: The vanilla FedAvg faces convergence issues when there is data heterogeneity among the clients. To tackle this challenge, different methods have been proposed. FedProx [25] adds a proximal term by calculating the square distance between the server and client with local loss to optimize the global model better. FedNova  [44] proposes normalized averaging to eliminate objective inconsistency with heterogeneous Federated optimization. FedMax, as proposed by Chen et al. [7], aims to mitigate activation-divergence by making activation vectors of the same classes across different devices similar. FedOpt proposed by Reddi et al. [36] applies different optimizers to the server, like Adam, Yogi, and AdaGrad. VRL-SGD [27] incorporates variance-reduction into local SGD and attains higher accuracy while decreasing the communication cost. FedCluster [5] proposes grouping local devices into different clusters so that each cluster can implement any Federated algorithm. RingFed, proposed by Yang et al. [47], minimizes the communication cost by pre-aggregating the parameters among the local clients before uploading the parameter to the central server. SCAFFOLD  [21] uses variance-reduction to mitigate client drift. Chen et al. [6] proposes FedSVRG that uses stochastic variance reduced gradient-based method to reduce the communication cost between clients and servers while maintaining accuracy. Jeong et al. [19] proposes Federated augmentation (FAug), which involves the local client jointly training the generative model to augment their local dataset and generate the IID dataset. In this paper, we aim to train models robust to data heterogeneity with faster convergence and lower communication costs via our proposed algorithms. One of our proposed method, Fed-Star is similar to RingFed [47]. RingFed involves simple pre-aggregation of weights between adjacent clients, whereas our method involves pre-aggregation of weights between all the local clients using the accuracy metric.

Personalized Federated Learning FedAvg also suffers from creating a generalized global model as the parameters are averaged, which gives poor representation to a client with heterogeneous data. Personalized Federated Learning involves training the global model using any Federated vanilla algorithm followed by personalizing the model for each client via locally training the model on each client  [20] [29] [49] [11]. Data heterogeneity among clients is the reason for personalized Federated Learning. Data augmentation is explored to account for local data heterogeneity and involves local clients to jointly generate IID data distribution [12] [45]. Wang et al. [43] proposes FAVOR and selects a subset of clients at each round to mitigate the bias introduced by non-IID data. Chai et al. [4] proposes the TiFL method to cluster the clients in different tiers and train the clients belonging to the same tier together for faster training. Sattler et al. [39] proposes a hierarchical clustering-based approach based on the cosine similarity of the client gradient to segment similar clients in similar clustering for training to train the local clients properly. Xie et al. [46] proposes Expectation Maximization to derive optimal matching between the local client and the global model for personalized training. Deng et al. [10] proposes APFL algorithm to find optimal client-server pair for personalized learning.There are different clustering-based approaches as explored by different authors  [3] [14] [17] [13]. Tan et al. [41] provides deeper analysis of personalized Federated framework. In our work, we aim to personalize the model at the global level by proposing an algorithm that better captures the outlier features of the client while retaining the generalized features.

Federated Image classification datasets Most Federated Learning algorithms are simulated on datasets belonging to a single domain with an artificial partition among the clients or use existing public datasets. The dataset distribution may differ for different clients in the real-world scenario as the clients exhibit domain shift. The first work proposing the real-world image dataset  [28] contains more than 900 images belonging to 7 different object categories captured via street cameras and annotated with detailed boxes. The image dataset has applications in object detection. In our work, we propose the first real-world image dataset for the image classification task to better understand the performance of the Federated algorithm in a real-world setting.

3 Proposed Methods

3.1 Objective

In Federated Learning (FL), different clients (say K𝐾Kitalic_K clients) collaborate to learn a global model without having to share their data. Let the weights of such a model be w𝑤witalic_w, and let the loss value of the model for sample (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) be (xi,yi;w)subscript𝑥𝑖subscript𝑦𝑖𝑤\mathcal{L}(x_{i},y_{i};w)caligraphic_L ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_w ). The objective now is to find optimal w𝑤witalic_w such that the following objective is achieved:

min1|D|i=1|D|(xi,yi;w)1𝐷superscriptsubscript𝑖1𝐷subscript𝑥𝑖subscript𝑦𝑖𝑤\min\frac{1}{|D|}\sum\limits_{i=1}^{|D|}{\mathcal{L}(x_{i},y_{i};w)}roman_min divide start_ARG 1 end_ARG start_ARG | italic_D | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_D | end_POSTSUPERSCRIPT caligraphic_L ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_w ) (1)

where D𝐷Ditalic_D denotes the union of all the data owned by different clients, as shown below:

D=k=1KDk𝐷superscriptsubscript𝑘1𝐾subscript𝐷𝑘D=\bigcup\limits_{k=1}^{K}{D_{k}}italic_D = ⋃ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (2)

where Dksubscript𝐷𝑘D_{k}italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denotes the data owned by the kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT client. Given this, we can rewrite our objective function as follows:

min1|D|k=1Ki=1|Dk|(xi,yi;w)1𝐷superscriptsubscript𝑘1𝐾superscriptsubscript𝑖1subscript𝐷𝑘subscript𝑥𝑖subscript𝑦𝑖𝑤\min\frac{1}{|D|}\sum\limits_{k=1}^{K}{\sum\limits_{i=1}^{|D_{k}|}\mathcal{L}(% x_{i},y_{i};w)}roman_min divide start_ARG 1 end_ARG start_ARG | italic_D | end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT caligraphic_L ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_w ) (3)

If we represent the average local loss 𝐋ksubscript𝐋𝑘\mathbf{L}_{k}bold_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT of kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT client using

𝐋k=1|Dk|i=1|Dk|(xi,yi;w),subscript𝐋𝑘1subscript𝐷𝑘superscriptsubscript𝑖1subscript𝐷𝑘subscript𝑥𝑖subscript𝑦𝑖𝑤\mathbf{L}_{k}=\frac{1}{|D_{k}|}\sum\limits_{i=1}^{|D_{k}|}\mathcal{L}(x_{i},y% _{i};w),bold_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT caligraphic_L ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ; italic_w ) , (4)

we can reformulate the objective as follows:

mink=1K|Dk||D|𝐋ksuperscriptsubscript𝑘1𝐾subscript𝐷𝑘𝐷subscript𝐋𝑘\min\sum\limits_{k=1}^{K}{\frac{|D_{k}|}{|D|}\mathbf{L}_{k}}roman_min ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG | italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_ARG start_ARG | italic_D | end_ARG bold_L start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (5)

which suggests that our objective is to minimize the weighted sum of local losses incurred by our clients, and the weights are proportional to the number of data samples clients have with them. This objective function is similar to same as  [30]. We discussed it to make our paper self-contained.

This formulation motivated FedAvg  [30] to aggregate the local model weights in the weighted averaging manner to obtain w𝑤witalic_w as shown below:

w=k=1K|Dk||D|wk𝑤superscriptsubscript𝑘1𝐾subscript𝐷𝑘𝐷subscript𝑤𝑘w=\sum\limits_{k=1}^{K}{\frac{|D_{k}|}{|D|}w_{k}}italic_w = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG | italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_ARG start_ARG | italic_D | end_ARG italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (6)

This aggregation happens iteratively, where, in a given iteration, the central server sends the global model to the local clients, where local models are updated and then sent back to the global server for aggregation, as shown in Figure 1. This happens until the global model converges.

Refer to caption
Figure 1: FedAvg algorithm  [30]

However, it can take several communication rounds for the FedAvg algorithm to converge, especially when there is statistical heterogeneity in clients’ datasets. As a result, the accuracy drops too [15]. Also, FedAvg creates a generalized model by averaging the parameters from the local clients, forcing the local model with statistical heterogeneity to learn a generalized representation that may differ from its data distribution, leading to poorly trained local clients. In that case, local clients do not find the global model satisfactory.

Considering these limitations of FedAvg, we propose Fed-Cyclic and Fed-Star algorithms, which cater to the statistical heterogeneity of data across the clients and ensure the satisfactory local performance of global models despite that.

3.2 Fed-Cyclic

We propose the Fed-Cyclic algorithm to overcome the challenges faced by the FedAvg algorithm, which suffers from a communication bottleneck due to a large number of edge devices uploading the parameters to the central server, which causes congestion in the network. The model visualization is shown in Figure 2

Refer to caption
Figure 2: Proposed model using Fed-Cyclic showing passing of weights cyclically by the clients after local training.
Algorithm 1 Proposed Fed-Cyclic Algorithm

Input Initial weights winitsubscript𝑤𝑖𝑛𝑖𝑡w_{init}italic_w start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT, number of global rounds R𝑅Ritalic_R, number of local epochs E𝐸Eitalic_E, learning rate η𝜂\etaitalic_η, K𝐾Kitalic_K clients indexed by k𝑘kitalic_k (with local data Dksubscript𝐷𝑘D_{k}italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and local weights wksubscript𝑤𝑘w_{k}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT) and local minibatch size b𝑏bitalic_b.
      Output Global weights wRsuperscript𝑤𝑅w^{R}italic_w start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT (after R𝑅Ritalic_R rounds)

Algorithm:

Initialize w0winitsuperscript𝑤0subscript𝑤𝑖𝑛𝑖𝑡w^{0}\leftarrow w_{init}italic_w start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ← italic_w start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT // Global weights initialized
w10w0subscriptsuperscript𝑤01superscript𝑤0w^{0}_{1}\leftarrow w^{0}italic_w start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ← italic_w start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT
for r𝑟ritalic_r=0 to R1𝑅1R-1italic_R - 1 do
     for k𝑘kitalic_k=1 to K1𝐾1K-1italic_K - 1 do
         wkr+1subscriptsuperscript𝑤𝑟1𝑘w^{r+1}_{k}italic_w start_POSTSUPERSCRIPT italic_r + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT \leftarrow ClientUpdate(wkr,k)𝐶𝑙𝑖𝑒𝑛𝑡𝑈𝑝𝑑𝑎𝑡𝑒subscriptsuperscript𝑤𝑟𝑘𝑘ClientUpdate(w^{r}_{k},k)italic_C italic_l italic_i italic_e italic_n italic_t italic_U italic_p italic_d italic_a italic_t italic_e ( italic_w start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k ) using SGD
         wk+1rsubscriptsuperscript𝑤𝑟𝑘1w^{r}_{k+1}italic_w start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT \leftarrow wkr+1subscriptsuperscript𝑤𝑟1𝑘w^{r+1}_{k}italic_w start_POSTSUPERSCRIPT italic_r + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT      
     wKr+1subscriptsuperscript𝑤𝑟1𝐾w^{r+1}_{K}italic_w start_POSTSUPERSCRIPT italic_r + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT \leftarrow ClientUpdate(wKr,K)𝐶𝑙𝑖𝑒𝑛𝑡𝑈𝑝𝑑𝑎𝑡𝑒subscriptsuperscript𝑤𝑟𝐾𝐾ClientUpdate(w^{r}_{K},K)italic_C italic_l italic_i italic_e italic_n italic_t italic_U italic_p italic_d italic_a italic_t italic_e ( italic_w start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT , italic_K ) using SGD
     w1r+1subscriptsuperscript𝑤𝑟11w^{r+1}_{1}italic_w start_POSTSUPERSCRIPT italic_r + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT \leftarrow wKr+1subscriptsuperscript𝑤𝑟1𝐾w^{r+1}_{K}italic_w start_POSTSUPERSCRIPT italic_r + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT
     wr+1superscript𝑤𝑟1w^{r+1}italic_w start_POSTSUPERSCRIPT italic_r + 1 end_POSTSUPERSCRIPT \leftarrow wKr+1subscriptsuperscript𝑤𝑟1𝐾w^{r+1}_{K}italic_w start_POSTSUPERSCRIPT italic_r + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT
function ClientUpdate𝐶𝑙𝑖𝑒𝑛𝑡𝑈𝑝𝑑𝑎𝑡𝑒ClientUpdateitalic_C italic_l italic_i italic_e italic_n italic_t italic_U italic_p italic_d italic_a italic_t italic_e(w,k𝑤𝑘w,kitalic_w , italic_k)
     B𝐵Bitalic_B \leftarrow (split Dksubscript𝐷𝑘D_{k}italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT into batches of size b𝑏bitalic_b)
     for e𝑒eitalic_e=1 to E𝐸Eitalic_E do
         for d𝑑ditalic_d \in B𝐵Bitalic_B do w𝑤witalic_w \leftarrow w𝑤witalic_w - η𝜂\etaitalic_η\nablag(w;d𝑑ditalic_d)               
     return w𝑤witalic_w

In the Fed-Cyclic algorithm, we use a global model to initialize the weight of one of the clients in the network, followed by training the client’s local model for E𝐸Eitalic_E local epochs. The optimizer used is SGD at the local client. The updated weights are then used to initialize the weight of the next client in the network, as shown in equation (7), and the process continues until all the clients are trained cyclically in this manner, constituting one training round. In our Fed-Cyclic algorithm, the clients can either directly pass the weights to the next client or involve the global server to do so to preserve the anonymity of the last client. After the end of each round, the global weights w𝑤witalic_w get updated.

It is a communication-efficient algorithm since the clients can directly pass the weights to the next client without involving the global server. Even if the global server is involved in passing the weights from one client to another, no processing is done, and only parameters from a single client are passed at a time. We explain it in Algorithm 1 in greater detail. The most important step is the following:

wk+1rwkr+1subscriptsuperscript𝑤𝑟𝑘1subscriptsuperscript𝑤𝑟1𝑘\displaystyle w^{r}_{k+1}\leftarrow w^{r+1}_{k}italic_w start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ← italic_w start_POSTSUPERSCRIPT italic_r + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT (7)

where we use kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT client to initialize (k+1)thsuperscript𝑘1𝑡(k+1)^{th}( italic_k + 1 ) start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT client.

The algorithm is robust to statistical heterogeneity as every client gets an opportunity to train the global model on the local data. Moreover, we can take the view that the global model is being periodically trained on different portions of the dataset (D𝐷Ditalic_D), as if they are mini-batches (Dksubscript𝐷𝑘D_{k}italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT). Hence, this algorithm is somewhat analogous to a typical deep learning approach from the point of view of the global model. As a result, convergence also gets ensured, unlike FedAvg, where we expect convergence for simple aggregation of weights.

3.3 Fed-Star

Algorithm 2 Proposed Fed-Star Algorithm

Input: Initial weights winitsubscript𝑤𝑖𝑛𝑖𝑡w_{init}italic_w start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT, number of global rounds R𝑅Ritalic_R, number of local epochs E𝐸Eitalic_E, learning rate η𝜂\etaitalic_η, K𝐾Kitalic_K clients indexed by k𝑘kitalic_k (with local data Dksubscript𝐷𝑘D_{k}italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and local weights wksubscript𝑤𝑘w_{k}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT), local minibatch size b𝑏bitalic_b, number of period P𝑃Pitalic_P, and weight matrix M𝑀Mitalic_M of dim KK𝐾𝐾K*Kitalic_K ∗ italic_K.
      Output: Global weights wRsuperscript𝑤𝑅w^{R}italic_w start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT (after R𝑅Ritalic_R rounds)

Algorithm:

Initialize w0winitsuperscript𝑤0subscript𝑤𝑖𝑛𝑖𝑡w^{0}\leftarrow w_{init}italic_w start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ← italic_w start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT // Global weights initialized
for r𝑟ritalic_r=0 to R1𝑅1R-1italic_R - 1 do
     wkr,0wr,k{1,,K}formulae-sequencesubscriptsuperscript𝑤𝑟0𝑘superscript𝑤𝑟for-all𝑘1𝐾w^{r,0}_{k}\leftarrow w^{r},\forall k\in\{1,\cdots,K\}italic_w start_POSTSUPERSCRIPT italic_r , 0 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← italic_w start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT , ∀ italic_k ∈ { 1 , ⋯ , italic_K }
     for p{0,,P1}𝑝0𝑃1p\in\{0,\cdots,P-1\}italic_p ∈ { 0 , ⋯ , italic_P - 1 } do
         for k{1,,K}𝑘1𝐾k\in\{1,\cdots,K\}italic_k ∈ { 1 , ⋯ , italic_K } parallely do
              wkr,p+1subscriptsuperscript𝑤𝑟𝑝1𝑘w^{r,p+1}_{k}italic_w start_POSTSUPERSCRIPT italic_r , italic_p + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT \leftarrow ClientUpdate𝐶𝑙𝑖𝑒𝑛𝑡𝑈𝑝𝑑𝑎𝑡𝑒ClientUpdateitalic_C italic_l italic_i italic_e italic_n italic_t italic_U italic_p italic_d italic_a italic_t italic_e (wkr,p,k)subscriptsuperscript𝑤𝑟𝑝𝑘𝑘(w^{r,p}_{k},k)( italic_w start_POSTSUPERSCRIPT italic_r , italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k )
              for j{1,,K}𝑗1𝐾j\in\{1,\cdots,K\}italic_j ∈ { 1 , ⋯ , italic_K }  do
                  transfer wjr,p+1subscriptsuperscript𝑤𝑟𝑝1𝑗w^{r,p+1}_{j}italic_w start_POSTSUPERSCRIPT italic_r , italic_p + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT client
                  M(k,j)=1Acc(wjr,p+1,Dk)/100𝑀𝑘𝑗1𝐴𝑐𝑐subscriptsuperscript𝑤𝑟𝑝1𝑗subscript𝐷𝑘100M(k,j)=1-Acc(w^{r,p+1}_{j},D_{k})/100italic_M ( italic_k , italic_j ) = 1 - italic_A italic_c italic_c ( italic_w start_POSTSUPERSCRIPT italic_r , italic_p + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) / 100               
              wkr,p+1=j=1KM(k,j)wjr,p+1j=1KM(k,j)superscriptsubscript𝑤𝑘𝑟𝑝1superscriptsubscript𝑗1𝐾𝑀𝑘𝑗superscriptsubscript𝑤𝑗𝑟𝑝1superscriptsubscript𝑗1𝐾𝑀𝑘𝑗w_{k}^{r,p+1}=\frac{\sum\limits_{j=1}^{K}M(k,j)*w_{j}^{r,p+1}}{\sum\limits_{j=% 1}^{K}M(k,j)}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r , italic_p + 1 end_POSTSUPERSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_M ( italic_k , italic_j ) ∗ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r , italic_p + 1 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_M ( italic_k , italic_j ) end_ARG               
     wr+1=1Kk=1K|Dk||D|wkr,Psuperscript𝑤𝑟11𝐾superscriptsubscript𝑘1𝐾subscript𝐷𝑘𝐷subscriptsuperscript𝑤𝑟𝑃𝑘w^{r+1}=\frac{1}{K}\sum\limits_{k=1}^{K}\frac{|D_{k}|}{|D|}{w^{r,P}_{k}}italic_w start_POSTSUPERSCRIPT italic_r + 1 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT divide start_ARG | italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_ARG start_ARG | italic_D | end_ARG italic_w start_POSTSUPERSCRIPT italic_r , italic_P end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT
function ClientUpdate𝐶𝑙𝑖𝑒𝑛𝑡𝑈𝑝𝑑𝑎𝑡𝑒ClientUpdateitalic_C italic_l italic_i italic_e italic_n italic_t italic_U italic_p italic_d italic_a italic_t italic_e(w,k𝑤𝑘w,kitalic_w , italic_k)
     B𝐵Bitalic_B \leftarrow (split Dksubscript𝐷𝑘D_{k}italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT into batches of size b𝑏bitalic_b)
     for e𝑒eitalic_e=1 to E𝐸Eitalic_E do
         for d𝑑ditalic_d \in B𝐵Bitalic_B do w𝑤witalic_w \leftarrow w𝑤witalic_w - η𝜂\etaitalic_η\nablag(w;d𝑑ditalic_d)               
     return w𝑤witalic_w
function  Acc𝐴𝑐𝑐Accitalic_A italic_c italic_c(w,D𝑤𝐷w,Ditalic_w , italic_D)
     return Accuracy of w𝑤witalic_w on train set of D𝐷Ditalic_D

Although Fed-Cyclic can converge faster than FedAvg, it is a very simple algorithm, similar to FedAvg. Also, it lacks aggregation of any kind. Here, we propose the Fed-Star algorithm, where we address these limitations.

In Fed-Star, for some time, the local models are trained locally for some epochs in a parallel manner. Once the given epochs are complete, we perform pre-aggregation of weights locally at each client by sharing their models with each other. Each client gets models from every other client, and they are evaluated on the local training set of the given client. The accuracy obtained helps us determine how much weightage should be given to the models of each client during pre-aggregation. These pre-aggregated weights are now used to reinitialize the local model for training. These steps are iteratively carried out, and these iterations are called periods. This interaction for model sharing is analogous to star network topology, where every client interacts with every other client in a network. After a certain number of periods P𝑃Pitalic_P, the local weights are aggregated on the central servers, so a round has P𝑃Pitalic_P periods in it, where local models are being shared with each other, they are getting pre-aggregated at each client for initialization to train the model in the next period.

In any period p𝑝pitalic_p of round r𝑟ritalic_r, a weightage matrix M𝑀Mitalic_M gets developed, which is computed as follows:

M(k,j)=1Acc(wjr,p+1,Dk)/100𝑀𝑘𝑗1𝐴𝑐𝑐subscriptsuperscript𝑤𝑟𝑝1𝑗subscript𝐷𝑘100M(k,j)=1-Acc(w^{r,p+1}_{j},D_{k})/100italic_M ( italic_k , italic_j ) = 1 - italic_A italic_c italic_c ( italic_w start_POSTSUPERSCRIPT italic_r , italic_p + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) / 100 (8)

where Acc(wjr,p+1,Dk)𝐴𝑐𝑐subscriptsuperscript𝑤𝑟𝑝1𝑗subscript𝐷𝑘Acc(w^{r,p+1}_{j},D_{k})italic_A italic_c italic_c ( italic_w start_POSTSUPERSCRIPT italic_r , italic_p + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) denotes the training accuracy of wjr,p+1subscriptsuperscript𝑤𝑟𝑝1𝑗w^{r,p+1}_{j}italic_w start_POSTSUPERSCRIPT italic_r , italic_p + 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT on training set of dataset Dksubscript𝐷𝑘D_{k}italic_D start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. It denotes the weightage value for the model coming from jthsuperscript𝑗𝑡j^{th}italic_j start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT model while pre-aggregating at kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT client.

Note here that we give more weightage to the client significantly different from the reference client during pre-aggregation because we want each client to learn the outlier features from the clients while retaining the generalized features during pre-aggregation.

The pre-aggregated weights for kthsuperscript𝑘𝑡k^{th}italic_k start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT client are depicted as follows:

wkr,p+1=j=1KM(k,j)wjr,p+1j=1KM(k,j)superscriptsubscript𝑤𝑘𝑟𝑝1superscriptsubscript𝑗1𝐾𝑀𝑘𝑗superscriptsubscript𝑤𝑗𝑟𝑝1superscriptsubscript𝑗1𝐾𝑀𝑘𝑗\displaystyle w_{k}^{r,p+1}=\frac{\sum\limits_{j=1}^{K}M(k,j)*w_{j}^{r,p+1}}{% \sum\limits_{j=1}^{K}M(k,j)}italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r , italic_p + 1 end_POSTSUPERSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_M ( italic_k , italic_j ) ∗ italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_r , italic_p + 1 end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_M ( italic_k , italic_j ) end_ARG (9)

where we normalize the weightages with their sum while performing the pre-aggregation of weights.

Refer to caption
Figure 3: Proposed model using Fed-Star demonstrating pre-aggregation of the parameters in star topology manner among local clients and is given by equation (9). This is followed by the transferring of weight to the global model where aggregation of weights is performed.

Thus, each local model tends to learn more from the other client that is significantly different. The Fed-Star algorithm is explained further in Algorithm 2 and can be visualized through Figure 3.

This algorithm is communication intensive since each client has to interact with every other client. Still, the total communication overhead is reduced significantly as the pre-aggregation step among clients decreases the reliance on the global server for convergence. Our algorithm attains faster convergence than FedAvg with lesser communication overhead with the global server and higher accuracy. Fed-Star retains outlier features well and helps create a global model that is also personalized to the local clients.

4 Proposed Dataset

Refer to caption
Figure 4: Sample images from our dataset.
Classes Mean Std. dev. Total Img.
back_pack 108.50 37.98 868
bike 108.62 31.73 869
bike_helmet 94.25 26.54 754
book_shelf 77.13 25.93 617
bottle 89.38 11.61 715
calculator 111.50 31.43 892
desk_chair 125.38 27.61 1003
desk_lamp 106.87 24.87 855
desktop_computer 99.00 19.13 792
file_cabinet 65.38 20.13 523
headphone 105.63 22.36 845
keyboard 76.38 35.14 611
laptop 111.38 25.39 891
letter_tray 26.38 13.47 211
mobile_phone 84.88 15.06 679
monitor 112.38 37.35 899
mouse 116.87 25.97 935
mug 117.38 25.19 939
notebook 98.00 23.78 784
pen 108.50 25.20 868
phone 113.63 21.80 909
printer 108.25 22.64 866
projector 72.63 33.86 581
puncher 66.88 32.37 535
ring_binder 65.63 31.26 525
ruler 90.13 19.50 721
scissors 127.13 51.77 1017
speaker 67.63 18.6 541
stapler 105.25 21.75 842
tape_dispenser 82.38 37.9 659
trashcan 103.25 33.36 826
Table 1: The mean, standard deviation and the total number of images in the classes across different commercial sources.

We propose a dataset containing 23,326 images which we collected from 8 different image-hosting websites. Each of the 8 sources represents 8 different clients in our Federated learning setting. The average number of images for each source is approximately 2916, divided across 31 categories. This dataset is inspired from the Office-31 dataset [38], which contains common objects in the office settings like keyboard, printer, monitor, laptop, and so forth. Our dataset includes the same categories of images as were in the Office-31 dataset. We took extra care to ensure that only relevant and high-quality images from each source were taken. We removed poor quality, duplicate or irrelevant images by manually curating the dataset. The statistics showing how images are distributed within the classes and across the sources are summarized in Tables 1 and 2. The sample images are shown in Figure 4.

Source Mean Std. dev. Total Img.
123rf 94.16 26.96 2897
Adobe Stock 101.16 27.25 3104
Alamy 102.80 40.65 3155
CanStockPhotos 95.06 33.24 2915
Depositphotos 101.41 38.59 3112
Getty Images 63.06 21.99 1923
iStock 90.10 33.01 2761
Shutterstock 112.61 36.34 3459
Table 2: The mean, standard deviation of images distribution across the classes of the dataset together with total images in each source.

5 Experiments

5.1 Implementation Detail

In this section, we describe the experiments we performed to evaluate our Federated image classification algorithms. We leverage the pretrained VGG-19 [40] network available from PyTorch pretrained model library  [34] for initialization purpose. We freeze its convolutional layers and replace the rest of the network with three new fully connected layers (of size 1024, 256 and 31) and a softmax layer. In the first two fully connected layers, we use ReLU activation and a dropout rate of 0.5. We have used SGD optimizer with a batch size of 64. We use the 80:20 train-test split of the data at any client. The evaluation metric used is classification accuracy, but we have also evaluated using Macro F1 score and weighted F1 score. The default learning rate is 3e-4 (3x10-4). The different iteration parameters used are given in Table 5.

Method Local Epochs (E𝐸Eitalic_E) Period (P𝑃Pitalic_P) Global Rounds (R𝑅Ritalic_R)
FedAvg 3 - 250
RingFed 3 2 50
Fed-Cyclic 3 - 150
Fed-Star 3 2 50
Table 3: Iteration Parameters
γ𝛾\gammaitalic_γ Accuracy
0.2 89.22%
0.5 88.96%
0.8 89.65%
1.0 89.39%
Table 4: γ𝛾\gammaitalic_γ vs accuracy for RingFed
Method Accuracy Weighted F1 Macro F1
FedAvg [30] 89.11% 88.98% 88.47%
RingFed [47] 89.65% 89.53% 89.14%
Fed-Cyclic (ours) 91.15% 90.89% 90.33%
Fed-Star (ours) 91.72% 91.17% 90.58%
Table 5: Experimental results show that both the Fed-Star and Fed-Cyclic attains higher accuracy than FedAvg and F1-scores. Here, red denotes the best value and blue denotes the second best value.

5.2 Results

We evaluate all four algorithms (FedAvg [30], RingFed [47], Fed-Cyclic, Fed-Star) on our dataset. For RingFed, we kept γ𝛾\gammaitalic_γ=0.8, as we obtained the best accuracy for RingFed using this value, as shown in Table 5.

We provide results of both global evaluation and local evaluation. While local test sets are used for local evaluation, their union is used for global evaluation. As we can see in Table 5, where we provide the global evaluation results, our two proposed methods, Fed-Cyclic and Fed-Star, perform better than FedAvg and RingFed. Fed-Star performs the best among the four, with an accuracy of 91.72%. Moreover, our methods converge very fast. Fed-Star requires 50 global rounds on our dataset, as mentioned in Table 5. Although RingFed also requires the same number of global rounds, its accuracy is lower than Fed-Star.

Dataset Model Test Accuracy
123rf Local Model 85.51%
FedAvg 83.79%
RingFed 84.47%
Fed-Cyclic (Ours) 86.38%
Fed-Star (Ours) 88.90%
Adobe Stock Local Model 92.43%
FedAvg 91.46%
RingFed 92.07%
Fed-Cyclic (Ours) 94.52%
Fed-Star (Ours) 94.96%
Alamy Local Model 86.84%
FedAvg 87.32%
RingFed 87.98%
Fed-Cyclic (Ours) 88.90%
Fed-Star (Ours) 90.14%
CanStockPhotos Local Model 89.53%
FedAvg 89.20%
RingFed 89.56%
Fed-Cyclic (Ours) 90.40%
Fed-Star (Ours) 91.68%
Depositphotos Local Model 98.07%
FedAvg 96.95%
RingFed 97.76%
Fed-Cyclic (Ours) 97.91%
Fed-Star (Ours) 98.33%
Getty Images Local Model 89.35%
FedAvg 90.13%
RingFed 91.06%
Fed-Cyclic (Ours) 92.72%
Fed-Star (Ours) 93.87%
iStock Local Model 85.71%
FedAvg 83.72%
RingFed 84.68%
Fed-Cyclic (Ours) 84.89%
Fed-Star (Ours) 86.47%
Shutterstock Local Model 89.45%
FedAvg 89.60%
RingFed 90.16%
Fed-Cyclic (Ours) 91.56%
Fed-Star (Ours) 92.11%
Table 6: Fed-Star outperforms all the local models trained using traditional ML method and baselines and Fed-Cyclic on different sources.

In Table 6, where we provide the results of the local evaluation, we compare our methods with competing Federated Learning methods and the baseline of the respective local model (using E=250𝐸250E=250italic_E = 250). As can be seen, our Fed-Star algorithm gets the best results on all the clients. Our Fed-Cyclic algorithm comes second in 6 out of 8 cases, losing to the local model on Depositphotos and iStock. As far as FedAvg and RingFed are concerned, they lose to 5 and 4 local clients, respectively. This suggests that FedAvg and RingFed do not help much from a personalization point of view.

Refer to caption
Figure 5: The graph shows how the accuracy of FedAvg, RingFed, Fed-Cyclic and Fed-Star changes with different learning rates (lr).

5.3 Learning Rate (η𝜂\etaitalic_η) Experiments

We also performed our experiments while varying the learning rate. For FedAvg, we have observed that accuracy steadily increases with the decrease in the learning rate, and maximum accuracy is obtained for the learning rate of 3e-3 with the value of 91.43%. For RingFed, we observed that the value of accuracy increased with an increase in learning rate from 1e-3 to 7e-3 (91.81% to 92.17%), followed by dropping in accuracy for the learning rate of 3e-4 to 89.65%. For the Fed-Cyclic algorithm, maximum accuracy is obtained for the learning rate of 3e-3 with an accuracy value of 92.52% and minimum accuracy of 91.15% for the learning rate of 3e-4. Fed-Star attains maximum accuracy of 92.77% for a learning rate of 1e-3. The accuracy drops with an increase in the learning rate, falling to the value of 91.72% for the learning rate of 3e-4. The detailed results are captured in Figure 5 and Table 7.

Learning FedAvg RingFed Fed-Cyclic Fed-Star
Rate (η𝜂\etaitalic_η)  [30]  [47] (Ours) (Ours)
1e-3 91.39% 91.81% 92.42% 92.77%
3e-3 91.43% 92.09% 92.52% 92.68%
7e-3 91.33% 92.17% 92.35% 92.53%
3e-4 89.11% 89.65% 91.15% 91.72%
Table 7: The table shows the accuracy value after convergence attained by FedAvg, RingFed, Fed-Cyclic and Fed-Star for different values of learning rates.

Conclusion

We proposed two Federated Learning algorithms, Fed-Cyclic and Fed-Star, along with a new federated image classification dataset collected from 8 commercial image sources, making the setup much closer to a real-world scenario than other image classification setups where an existing dataset itself is artificially divided. Our algorithms have better convergence and better accuracy than FedAvg and RingFed algorithms. Also, they perform much better from the personalization point of view, making them very relevant for meaningful collaboration amongst clients having statistical heterogeneity (domain shift).

References

  • [1] Mohammad Amiri-Zarandi, Rozita A Dara, and Evan Fraser. A survey of machine learning-based solutions to protect privacy in the internet of things. Computers & Security, 96:101921, 2020.
  • [2] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
  • [3] Christopher Briggs, Zhong Fan, and Peter Andras. Federated learning with hierarchical clustering of local updates to improve training on non-iid data. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–9. IEEE, 2020.
  • [4] Zheng Chai, Ahsan Ali, Syed Zawad, Stacey Truex, Ali Anwar, Nathalie Baracaldo, Yi Zhou, Heiko Ludwig, Feng Yan, and Yue Cheng. Tifl: A tier-based federated learning system. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, pages 125–136, 2020.
  • [5] Cheng Chen, Ziyi Chen, Yi Zhou, and Bhavya Kailkhura. Fedcluster: Boosting the convergence of federated learning via cluster-cycling. In 2020 IEEE International Conference on Big Data (Big Data), pages 5017–5026. IEEE, 2020.
  • [6] Dawei Chen, Choong Seon Hong, Yiyong Zha, Yunfei Zhang, Xin Liu, and Zhu Han. Fedsvrg based communication efficient scheme for federated learning in mec networks. IEEE Transactions on Vehicular Technology, 70(7):7300–7304, 2021.
  • [7] Wei Chen, Kartikeya Bhardwaj, and Radu Marculescu. Fedmax: Mitigating activation divergence for accurate and communication-efficient federated learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 348–363. Springer, 2020.
  • [8] Emiliano De Cristofaro. An overview of privacy in machine learning. arXiv preprint arXiv:2005.08679, 2020.
  • [9] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research, 13(1), 2012.
  • [10] Yuyang Deng, Mohammad Mahdi Kamani, and Mehrdad Mahdavi. Adaptive personalized federated learning. arXiv preprint arXiv:2003.13461, 2020.
  • [11] Georgios Drainakis, Konstantinos V Katsaros, Panagiotis Pantazopoulos, Vasilis Sourlas, and Angelos Amditis. Federated vs. centralized machine learning under privacy-elastic users: A comparative analysis. In 2020 IEEE 19th International Symposium on Network Computing and Applications (NCA), pages 1–8. IEEE, 2020.
  • [12] Moming Duan, Duo Liu, Xianzhang Chen, Renping Liu, Yujuan Tan, and Liang Liang. Self-balancing federated learning with global imbalanced data in mobile systems. IEEE Transactions on Parallel and Distributed Systems, 32(1):59–71, 2020.
  • [13] Moming Duan, Duo Liu, Xinyuan Ji, Renping Liu, Liang Liang, Xianzhang Chen, and Yujuan Tan. Fedgroup: Efficient federated learning via decomposed similarity-based clustering. In 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), pages 228–237. IEEE, 2021.
  • [14] Avishek Ghosh, Jichan Chung, Dong Yin, and Kannan Ramchandran. An efficient framework for clustered federated learning. Advances in Neural Information Processing Systems, 33:19586–19597, 2020.
  • [15] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
  • [16] Tzu-Ming Harry Hsu, Hang Qi, and Matthew Brown. Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335, 2019.
  • [17] Li Huang, Andrew L Shea, Huining Qian, Aditya Masurkar, Hao Deng, and Dianbo Liu. Patient clustering improves efficiency of federated machine learning to predict mortality and hospital stay time using distributed electronic medical records. Journal of biomedical informatics, 99:103291, 2019.
  • [18] Shreyansh Jain and Koteswar Rao Jerripothula. Federated learning for commercial image sources. In 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6523–6532, 2023.
  • [19] Eunjeong Jeong, Seungeun Oh, Hyesung Kim, Jihong Park, Mehdi Bennis, and Seong-Lyun Kim. Communication-efficient on-device machine learning: Federated distillation and augmentation under non-iid private data. arXiv preprint arXiv:1811.11479, 2018.
  • [20] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2):1–210, 2021.
  • [21] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pages 5132–5143. PMLR, 2020.
  • [22] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank J Reddi, Sebastian U Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for on-device federated learning. 2019.
  • [23] Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.
  • [24] Li Li, Yuxi Fan, Mike Tse, and Kuo-Yi Lin. A review of applications in federated learning. Computers & Industrial Engineering, 149:106854, 2020.
  • [25] Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50–60, 2020.
  • [26] Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergence of fedavg on non-iid data. arXiv preprint arXiv:1907.02189, 2019.
  • [27] Xianfeng Liang, Shuheng Shen, Jingchang Liu, Zhen Pan, Enhong Chen, and Yifei Cheng. Variance reduced local sgd with lower communication complexity. arXiv preprint arXiv:1912.12844, 2019.
  • [28] Jiahuan Luo, Xueyang Wu, Yun Luo, Anbu Huang, Yunfeng Huang, Yang Liu, and Qiang Yang. Real-world image datasets for federated learning. arXiv preprint arXiv:1910.11089, 2019.
  • [29] Yishay Mansour, Mehryar Mohri, Jae Ro, and Ananda Theertha Suresh. Three approaches for personalization with applications to federated learning. arXiv preprint arXiv:2002.10619, 2020.
  • [30] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pages 1273–1282. PMLR, 2017.
  • [31] Jose G Moreno-Torres, Troy Raeder, Rocío Alaiz-Rodríguez, Nitesh V Chawla, and Francisco Herrera. A unifying view on dataset shift in classification. Pattern recognition, 45(1):521–530, 2012.
  • [32] Dinh C Nguyen, Ming Ding, Pubudu N Pathirana, Aruna Seneviratne, Jun Li, and H Vincent Poor. Federated learning for internet of things: A comprehensive survey. IEEE Communications Surveys & Tutorials, 23(3):1622–1658, 2021.
  • [33] Nicolas Papernot, Patrick McDaniel, Arunesh Sinha, and Michael Wellman. Towards the science of security and privacy in machine learning. arXiv preprint arXiv:1611.03814, 2016.
  • [34] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  • [35] Joaquin Quinonero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. Dataset shift in machine learning. Mit Press, 2008.
  • [36] Sashank Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konečnỳ, Sanjiv Kumar, and H Brendan McMahan. Adaptive federated optimization. arXiv preprint arXiv:2003.00295, 2020.
  • [37] Peter Richtárik and Martin Takáč. Distributed coordinate descent method for learning with big data. The Journal of Machine Learning Research, 17(1):2657–2681, 2016.
  • [38] Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category models to new domains. In European conference on computer vision, pages 213–226. Springer, 2010.
  • [39] Felix Sattler, Klaus-Robert Müller, and Wojciech Samek. Clustered federated learning: Model-agnostic distributed multitask optimization under privacy constraints. IEEE transactions on neural networks and learning systems, 32(8):3710–3722, 2020.
  • [40] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [41] Alysa Ziying Tan, Han Yu, Lizhen Cui, and Qiang Yang. Towards personalized federated learning. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  • [42] Stacey Truex, Nathalie Baracaldo, Ali Anwar, Thomas Steinke, Heiko Ludwig, Rui Zhang, and Yi Zhou. A hybrid approach to privacy-preserving federated learning. In Proceedings of the 12th ACM workshop on artificial intelligence and security, pages 1–11, 2019.
  • [43] Hao Wang, Zakhary Kaplan, Di Niu, and Baochun Li. Optimizing federated learning on non-iid data with reinforcement learning. In IEEE INFOCOM 2020-IEEE Conference on Computer Communications, pages 1698–1707. IEEE, 2020.
  • [44] Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor. Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in neural information processing systems, 33:7611–7623, 2020.
  • [45] Qiong Wu, Xu Chen, Zhi Zhou, and Junshan Zhang. Fedhome: Cloud-edge based personalized federated learning for in-home health monitoring. IEEE Transactions on Mobile Computing, 2020.
  • [46] Ming Xie, Guodong Long, Tao Shen, Tianyi Zhou, Xianzhi Wang, Jing Jiang, and Chengqi Zhang. Multi-center federated learning. arXiv preprint arXiv:2005.01026, 2020.
  • [47] Guang Yang, Ke Mu, Chunhe Song, Zhijia Yang, and Tierui Gong. Ringfed: Reducing communication costs in federated learning on non-iid data. arXiv preprint arXiv:2107.08873, 2021.
  • [48] Sixin Zhang, Anna E Choromanska, and Yann LeCun. Deep learning with elastic averaging sgd. Advances in neural information processing systems, 28, 2015.
  • [49] Yue Zhao, Meng Li, Liangzhen Lai, Naveen Suda, Damon Civin, and Vikas Chandra. Federated learning with non-iid data. arXiv preprint arXiv:1806.00582, 2018.