On the Push-Based Asynchronous Federated Learning: A Bias-Correction Aggregation Approach

arXiv cs.LG Papers

Summary

This paper presents PushCen-ADFL, a communication-efficient asynchronous decentralized federated learning framework that uses centroid-based messaging and bias-correction to improve accuracy and reduce communication overhead under heterogeneous conditions.

arXiv:2605.26162v1 Announce Type: new Abstract: Asynchronous decentralized federated learning (ADFL) eliminates central coordination and global synchronization, making it attractive for large-scale and heterogeneous systems. However, frequent peer-to-peer communication, asynchronous updates on directed topologies, and non-IID data jointly lead to excessive communication overhead, biased aggregation and severe model drift. We propose PushCen-ADFL, a communication-efficient ADFL framework that enables stable training under asymmetric communication and delayed client participation. PushCen-ADFL couples communication, aggregation, and local stabilization in a shared centroid representation space, forming a closed loop between compression and optimization. Clients exchange centroid-form messages, apply average-preserving push-sum mixing to correct aggregation bias, and use a lightweight centroid regularization anchored in the same centroid space to mitigate drift under heterogeneity and staleness. A bounded, sender-deduplicated buffer further improves robustness under irregular asynchronous arrivals. Experiments on vision datasets demonstrate that PushCen-ADFL improves accuracy under data heterogeneity by up to 6\% while reducing per-push communication cost by more than 80\%, achieving a favorable accuracy-communication trade-off.
Original Article
View Cached Full Text

Cached at: 05/27/26, 09:04 AM

# On the Push-Based Asynchronous Federated Learning: A Bias-Correction Aggregation Approach
Source: [https://arxiv.org/html/2605.26162](https://arxiv.org/html/2605.26162)
Jiahui BaiSchool of Computer Technologies, RMIT UniversityMelbourneVICAustralia[jiahui\.bai2@student\.rmit\.edu\.au](https://arxiv.org/html/2605.26162v1/mailto:[email protected])Hai DongSchool of Computer Technologies, RMIT UniversityMelbourneVICAustralia[hai\.dong@rmit\.edu\.au](https://arxiv.org/html/2605.26162v1/mailto:[email protected])andA\. K\. QinSchool of Science, Computing and Engineering Technologies, Swinburne University of TechnologyHawthornVICAustralia[kqin@swin\.edu\.au](https://arxiv.org/html/2605.26162v1/mailto:[email protected])

\(2026\)

###### Abstract\.

Asynchronous decentralized federated learning \(ADFL\) eliminates central coordination and global synchronization, making it attractive for large\-scale and heterogeneous systems\. However, frequent peer\-to\-peer communication, asynchronous updates on directed topologies, and non\-IID data jointly lead to excessive communication overhead, biased aggregation and severe model drift\. We propose PushCen\-ADFL, a communication\-efficient ADFL framework that enables stable training under asymmetric communication and delayed client participation\. PushCen\-ADFL couples communication, aggregation, and local stabilization in a shared centroid representation space, forming a closed loop between compression and optimization\. Clients exchange centroid\-form messages, apply average\-preserving push\-sum mixing to correct aggregation bias, and use a lightweight centroid regularization anchored in the same centroid space to mitigate drift under heterogeneity and staleness\. A bounded, sender\-deduplicated buffer further improves robustness under irregular asynchronous arrivals\. Experiments on vision datasets demonstrate that PushCen\-ADFL improves accuracy under data heterogeneity by up to 6% while reducing per\-push communication cost by more than 80%, achieving a favorable accuracy\-communication trade\-off\.

Federated learning, Decentralized optimization, Asynchronous communication, Distributed machine learning

††journalyear:2026††copyright:cc††conference:Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\.2; August 09–13, 2026; Jeju Island, Republic of Korea††booktitle:Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\.2 \(KDD ’26\), August 09–13, 2026, Jeju Island, Republic of Korea††doi:10\.1145/3770855\.3817925††isbn:979\-8\-4007\-2259\-2/2026/08††ccs:Computing methodologies Neural networks††ccs:Computing methodologies Regularization††ccs:Computing methodologies Massively parallel algorithms††ccs:Computing methodologies Self\-organization††ccs:Computing methodologies Mobile agents## 1\.Introduction

Federated Learning \(FL\) has emerged as a privacy\-preserving distributed machine learning paradigm that enables collaborative model training when data cannot be centrally shared\(McMahanet al\.,[2017](https://arxiv.org/html/2605.26162#bib.bib14); Liet al\.,[2018](https://arxiv.org/html/2605.26162#bib.bib15); Mughalet al\.,[2024](https://arxiv.org/html/2605.26162#bib.bib16)\)\. By performing training on local devices and exchanging only model updates, FL allows distributed data resources to be effectively utilized without direct access to raw data and has demonstrated significant potential in applications such as mobile intelligence, the Internet of Things and cross\-organizational collaboration\(Liet al\.,[2023](https://arxiv.org/html/2605.26162#bib.bib17); Liuet al\.,[2023](https://arxiv.org/html/2605.26162#bib.bib18)\)\.

However, most existing federated learning frameworks adopt a centralized and synchronous training paradigm, where a central server collects model updates from all clients and aggregates them in each training round\. While such designs perform well under ideal network conditions, they face multiple challenges in real\-world systems\. First, the central server introduces a single point of bottleneck and failure, which limits scalability due to constraints on communication bandwidth, computation capacity and reliability in large\-scale deployments\(Daiet al\.,[2022](https://arxiv.org/html/2605.26162#bib.bib19)\)\. Second, synchronous training requires strict temporal alignment of clients at round boundaries, making the system vulnerable to device heterogeneity and system variability and leading to the well\-known straggler problem that significantly degrades training efficiency\(Langet al\.,[2024](https://arxiv.org/html/2605.26162#bib.bib20); Jianget al\.,[2022](https://arxiv.org/html/2605.26162#bib.bib21)\)\. Finally, centralized architectures are often difficult to deploy in cross\-organizational or cross\-domain scenarios and conflict with the growing demand for trustless and autonomous systems\(Daiet al\.,[2022](https://arxiv.org/html/2605.26162#bib.bib19); Yuanet al\.,[2024](https://arxiv.org/html/2605.26162#bib.bib22)\)\.

To overcome these limitations, Decentralized Federated Learning \(DFL\) has attracted increasing attention\. In DFL, clients exchange model information only with their neighboring nodes without relying on a central server, thereby improving system robustness and scalability\(Sunet al\.,[2023](https://arxiv.org/html/2605.26162#bib.bib23); Yuanet al\.,[2024](https://arxiv.org/html/2605.26162#bib.bib22)\)\. Nevertheless, most existing decentralized approaches still rely on synchronous update mechanisms, assuming that clients remain aligned across logical training rounds, an assumption that is difficult to satisfy in dynamic network environments\(Bornsteinet al\.,[2023](https://arxiv.org/html/2605.26162#bib.bib1); Liuet al\.,[2024a](https://arxiv.org/html/2605.26162#bib.bib24)\)\.

Consequently, Asynchronous Decentralized Federated Learning \(ADFL\) has been recognized as a more practical training paradigm for real\-world deployments\(Liuet al\.,[2024a](https://arxiv.org/html/2605.26162#bib.bib24); Dhasadeet al\.,[2025](https://arxiv.org/html/2605.26162#bib.bib25)\)\. ADFL allows clients to perform local updates and communications at different times without global synchronization, while supporting model collaboration solely through neighbor\-to\-neighbor communication\(Liuet al\.,[2022](https://arxiv.org/html/2605.26162#bib.bib26); Jeong and Kountouris,[2025](https://arxiv.org/html/2605.26162#bib.bib27)\)\. This setting naturally accommodates device heterogeneity, network variability and partial participation and enables clients to join the training process at arbitrary times\(Rivero\-Angeleset al\.,[2022](https://arxiv.org/html/2605.26162#bib.bib29); Ameuret al\.,[2022](https://arxiv.org/html/2605.26162#bib.bib28)\)\. As a result, ADFL is well suited for edge computing, peer\-to\-peer networks and large\-scale distributed systems, offering significant advantages in terms of system throughput, fault tolerance and deployment flexibility\.

Despite these advantages, such a highly flexible training paradigm also introduces new challenges\. Specifically, ADFL systems often face three fundamental difficulties simultaneously\. First, since clients communicate only with their neighbors and model exchanges may occur frequently, directly transmitting full model parameters incurs prohibitive communication overhead, which is difficult to sustain in bandwidth\-constrained or large\-scale systems\(McMahanet al\.,[2017](https://arxiv.org/html/2605.26162#bib.bib14); Chenet al\.,[2021](https://arxiv.org/html/2605.26162#bib.bib30)\)\. Compared to centralized settings, decentralized architectures lack unified communication scheduling and aggregation, making communication efficiency a more prominent concern\(Yuanet al\.,[2024](https://arxiv.org/html/2605.26162#bib.bib22); Lalithaet al\.,[2018](https://arxiv.org/html/2605.26162#bib.bib31)\)\. Second, asynchronous updates combined with asymmetric communication topologies can lead to biased aggregation\(Kempeet al\.,[2003](https://arxiv.org/html/2605.26162#bib.bib32); Assranet al\.,[2019](https://arxiv.org/html/2605.26162#bib.bib33)\)\. In practice, clients may push models at different times and communication links can be unbalanced, under which naive neighbor averaging fails to guarantee unbiased consensus\(Jeong and Kountouris,[2025](https://arxiv.org/html/2605.26162#bib.bib27); Franceschelliet al\.,[2009](https://arxiv.org/html/2605.26162#bib.bib34)\)\. Finally, the widely observed statistical heterogeneity \(non\-IID data\) across clients is further amplified in asynchronous decentralized environments\(Liuet al\.,[2024a](https://arxiv.org/html/2605.26162#bib.bib24),[2022](https://arxiv.org/html/2605.26162#bib.bib26); Maet al\.,[2024](https://arxiv.org/html/2605.26162#bib.bib3)\)\. Local updates computed on non\-identically distributed data tend to induce significant model drift, while asynchronous communication may delay or mismatch these updates, exacerbating training instability and degrading final performance\(Wanget al\.,[2024](https://arxiv.org/html/2605.26162#bib.bib35); Liuet al\.,[2024b](https://arxiv.org/html/2605.26162#bib.bib36)\)\.

These challenges are not independent\. Communication constraints necessitate compression\. Compression can amplify client drift under non\-IID data\. Asynchrony further delays and skews these updates\. Together, these effects exacerbate aggregation bias in decentralized training\. Accordingly, the central research question addressed in this work is:

*How can we jointly design compression, de\-biased aggregation, and local\-update stabilization for accurate and stable ADFL under asymmetric communication and non\-IID data?*

To address this research question, we propose a new ADFL framework, termed PushCen\-ADFL \(Push\-Sum Centroid Asynchronous Decentralized Federated Learning\)\. The proposed framework is built around lightweight centroid structures and integrates centroid\-based compression with push\-sum aggregation, where both aggregation and local optimization are constrained in the same centroid representation space, enabling communication\-efficient and stable federated training without relying on a central server or global synchronization\.

Our main contributions are summarized as follows:

- •We propose PushCen\-ADFL, an asynchronous framework that unifies centroid\-based communication, push\-sum de\-biasing and buffered neighbor aggregation into a single decentralized training pipeline, where both aggregation and local optimization operate in the same centroid representation space\.
- •We design an average\-preserving push\-sum aggregation with mass splitting and a deduplicated bounded buffer, mitigating bias from asymmetric information flow and preventing stale messages from dominating aggregation\.
- •We develop a centroid\-aligned proximal regularizer that stabilizes local updates in the same compressed centroid space used for communication\. This alignment mitigates non\-IID data distributions by contracting local trajectories toward a shared compressed reference\.
- •We provide a convergence analysis for PushCen\-ADFL and validate it on CIFAR\-10, CIFAR\-100 and Tiny\-ImageNet, improving accuracy by up to 6% over communication\-efficient baselines, while reducing per\-push transmitted payload by more than 80% less than full\-model communication\.

## 2\.Related Work

### 2\.1\.Asynchronous Decentralized FL

A representative direction is to design*wait\-free*or fully asynchronous communication mechanisms to mitigate stragglers\. SWIFT proposes a wait\-free decentralized FL algorithm that allows each client to proceed without synchronization barriers, while maintaining standard iteration complexity guarantees\(Bornsteinet al\.,[2023](https://arxiv.org/html/2605.26162#bib.bib1)\)\. Complementarily, A2CiD2studies asynchronous decentralized deep learning from an optimization perspective and introduces a continuous local momentum to accelerate randomized gossip\-style communication\(Nabliet al\.,[2023](https://arxiv.org/html/2605.26162#bib.bib2)\)\. Beyond communication primitives, a second line of work explicitly addresses the*staleness*and*heterogeneity*issues that arise in ADFL\. Ma*et al\.*propose dynamic staleness control for ADFL in decentralized topologies, aiming to balance training efficiency with model quality by regulating the tolerated staleness degree and deriving scheduling/control policies with theoretical support\(Maet al\.,[2024](https://arxiv.org/html/2605.26162#bib.bib3)\)\. For heterogeneous devices, Liao*et al\.*develop AsyDFL, which integrates neighbor selection and gradient pushing to reduce communication cost and completion time under non\-IID data and system heterogeneity\(Liaoet al\.,[2024](https://arxiv.org/html/2605.26162#bib.bib4)\)\. More recently, DSpodFL provides a unified viewpoint by modeling both local updates and pairwise exchanges as sporadic random events, thereby capturing time\-varying computation/communication patterns\(Zehtabiet al\.,[2025](https://arxiv.org/html/2605.26162#bib.bib5)\)\. In addition, AEDFL proposes an ADFL framework that leverages reinforcement learning for neighbor model selection in heterogeneous, serverless training settings\(Liuet al\.,[2024a](https://arxiv.org/html/2605.26162#bib.bib24)\)\. Despite these advances, most existing ADFL methods focus on asynchronous scheduling or staleness control while assuming full\-precision communication, and thus fail to jointly address frequent peer\-to\-peer exchanges, aggregation bias on asymmetric communication, and model drift under non\-IID data, limiting their practicality in communication\-constrained decentralized systems\.

### 2\.2\.Communication\-Efficient Federated Learning

A prominent line of research leverages*knowledge distillation*to reduce the communication payload\. FedKD adopts a local teacher\-student scheme and transmits only a compact student \(or distilled knowledge\), substantially lowering uplink/downlink costs while preserving accuracy via distillation\(Wuet al\.,[2022](https://arxiv.org/html/2605.26162#bib.bib6)\)\. Another direction exploits*sparsity*to shrink communicated updates: SpaFL induces structured sparsity with low overhead through trainable thresholds, reducing both communication and computation\(Kimet al\.,[2024](https://arxiv.org/html/2605.26162#bib.bib7)\)\. In addition,*low\-precision*methods cut the bit\-width of local training and transmitted updates; Li*et al\.*show that low\-precision local training can already achieve competitive accuracy in FL\(Liet al\.,[2024](https://arxiv.org/html/2605.26162#bib.bib8)\)\. Communication efficiency has also been explored under asynchronous decentralized settings: DivShare exchanges model fragments per interaction to better tolerate stragglers and bandwidth heterogeneity, improving wall\-clock convergence in peer\-to\-peer training\(Biswaset al\.,[2025](https://arxiv.org/html/2605.26162#bib.bib9)\)\. Overall, prior communication\-efficient FL methods reduce communication cost but are not designed to handle the coupled challenges of aggregation bias and training instability in fully asynchronous decentralized environments\.

## 3\.Problem Formulation

We consider an ADFL system withNNclients, denoted by𝒱=\{1,2,…,N\}\\mathcal\{V\}=\\\{1,2,\\dots,N\\\}\. Client communications are modeled by a directed graph𝒢=\(𝒱,ℰ\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\}\)induced by asynchronous gossip, where\(j→i\)∈ℰ\(j\\rightarrow i\)\\in\\mathcal\{E\}indicates that clientiican receive information from clientjj\. The communication topology can be asymmetric and unbalanced and there is no central server; clients communicate only with their neighbors\. We denote the in\-neighbors and out\-neighbors of clientiias𝒩i−\\mathcal\{N\}\_\{i\}^\{\-\}and𝒩i\+\\mathcal\{N\}\_\{i\}^\{\+\}, respectively\.

Each clientiiholds a private local dataset𝒟i\\mathcal\{D\}\_\{i\}with potentially heterogeneous \(non\-IID\) data distributions, and maintains a local model copywiw\_\{i\}\. Letfi​\(w\)f\_\{i\}\(w\)denote the empirical risk on𝒟i\\mathcal\{D\}\_\{i\}\. We adopt a*uniform client\-level objective*and formulate decentralized learning as a consensus optimization problem:

min\{wi\}i=1N⁡1N​∑i=1Nfi​\(wi\)s\.t\.wi=wj,∀\(j→i\)∈ℰ\.\\min\_\{\\\{w\_\{i\}\\\}\_\{i=1\}^\{N\}\}\\ \\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}f\_\{i\}\(w\_\{i\}\)\\quad\\text\{s\.t\.\}\\quad w\_\{i\}=w\_\{j\},\\ \\forall\(j\\rightarrow i\)\\in\\mathcal\{E\}\.
where the constraintswi=wjw\_\{i\}=w\_\{j\}specify the desired agreement among local models at convergence, while local models may differ during asynchronous training\. The system operates in a fully asynchronous manner without global synchronization\. Clients perform local computation and communication at arbitrary times and received messages may correspond to stale model states due to heterogeneous computation speeds and network delays\. Moreover, clients may become active at arbitrary times, and only active clients generate updates and participate in communication\.

Under directed and potentially unbalanced topologies, achieving the above consensus objective requires decentralized aggregation mechanisms that preserve the uniform client\-level average\. Accordingly, the problem addressed in this work is to design a communication\-efficient ADFL framework that enables stable and robust training under asymmetric communication and non\-IID data distributions\.

## 4\.Push\-Sum Centroid Asynchronous Decentralized Federated Learning

### 4\.1\.Overview of PushCen\-ADFL

![Refer to caption](https://arxiv.org/html/2605.26162v1/images/framework_diagram2.png)

Workflow of PushCen\-ADFL: Push\-Sum Centroid Asynchronous Decentralized Federated Learning

Figure 1\.Workflow of PushCen\-ADFL: Push\-Sum Centroid Asynchronous Decentralized Federated LearningThe overall workflow of PushCen\-ADFL is illustrated in Figure[1](https://arxiv.org/html/2605.26162#S4.F1)and summarized in Algorithm[1](https://arxiv.org/html/2605.26162#alg1)\.

Clients exchange information only with their neighbors over a peer\-to\-peer communication, without relying on a central coordinator or round\-based synchronization\.

PushCen\-ADFL is built on three coupled principles\. First, each push transmits a centroid message\(Vi,Ai\)\(V\_\{i\},A\_\{i\}\)instead of dense parameters, so the communication payload is controlled by the centroid budget\. Second, aggregation follows a mass split push\-sum update, using the mass push\-sumsis\_\{i\}to correct bias and preserve the averaging under directed asynchronous mixing\. Third, local optimization is stabilized by a centroid proximal term anchored at the current dictionary estimateGiG\_\{i\}, so that both communication and stabilization operate in the same centroid representation space\. In addition, a deduplicated bounded buffer retains only the most recent message per sender, preventing stale or repeated messages from dominating the event\-driven aggregation\.

Concretely, each clientiimaintains a local modelwiw\_\{i\}, a push\-sum masssis\_\{i\}that tracks the mixing weight for de\-biased aggregation, and a local centroid dictionary estimateGiG\_\{i\}that defines the shared compressed representation space, together with a bounded bufferℬi\\mathcal\{B\}\_\{i\}that stores recent neighbor messages \(line 1\)\. When a neighbor message\(Vj,Aj,σj→i,j\)\(V\_\{j\},A\_\{j\},\\sigma\_\{j\\rightarrow i\},j\)arrives, clientiiupdatesℬi\\mathcal\{B\}\_\{i\}viaBufferUpdate\(lines 3–5\), which replaces older entries from the same sender and enforces a fixed buffer capacity; see Section[4\.5](https://arxiv.org/html/2605.26162#S4.SS5)\. Upon a local compute event, clientiiaggregates buffered information usingPushSumAgg\(line 6\), which jointly updates\(wi,Gi,si\)\(w\_\{i\},G\_\{i\},s\_\{i\}\)in an average\-preserving manner under directed and potentially unbalanced communication; see Section[4\.4](https://arxiv.org/html/2605.26162#S4.SS4)\. Clientiithen performsLocalUpdateon𝒟i\\mathcal\{D\}\_\{i\}\(line 7\)\. For communication, the model is encoded in a centroid form\(Vi,Ai\)\(V\_\{i\},A\_\{i\}\)\(Section[4\.3](https://arxiv.org/html/2605.26162#S4.SS3)\); for optimization, the update is anchored to a centroid\-based reference constructed fromGiG\_\{i\}, yielding a centroid\-aligned proximal effect that contracts local trajectories and mitigates non\-IID drift; see Section[4\.2](https://arxiv.org/html/2605.26162#S4.SS2)\. Finally, clientiisplits its push\-sum mass \(line 8\) and sends\(Vi,Ai,σi,i\)\(V\_\{i\},A\_\{i\},\\sigma\_\{i\},i\)to all out\-neighbors \(line 9\), ensuring that subsequent asynchronous aggregations remain average\-preserving\.

Algorithm 1PushCen\-ADFL: Asynchronous Decentralized Training at Clientii1:Out\-neighbors

𝒩i\+\\mathcal\{N\}\_\{i\}^\{\+\}; local data

𝒟i\\mathcal\{D\}\_\{i\}; \#clusters

KK; local epochs

EE; learning rate

η\\eta; reg weight

λ\\lambda; total events

TT; buffer limit

LL\.

2:Init:

wiw\_\{i\};

si←1s\_\{i\}\\leftarrow 1;

Gi←∅G\_\{i\}\\leftarrow\\emptyset;

ℬi←∅\\mathcal\{B\}\_\{i\}\\leftarrow\\emptyset
3:for

t=1t=1to

TTdo⊳\\trianglerightevent\-driven; no global synchronization

4:Upon receiving

\(Vj,Aj,σj→i,j\)\(V\_\{j\},A\_\{j\},\\sigma\_\{j\\rightarrow i\},j\):

5:BufferUpdate\(

ℬi,\(Vj,Aj,σj→i,j\),L\\mathcal\{B\}\_\{i\},\\ \(V\_\{j\},A\_\{j\},\\sigma\_\{j\\rightarrow i\},j\),\\ L\)⊳\\trianglerightAlg\.[5](https://arxiv.org/html/2605.26162#alg5)

6:Upon a local compute event:

7:PushSumAgg\(

wi,Gi,si,ℬiw\_\{i\},G\_\{i\},s\_\{i\},\\mathcal\{B\}\_\{i\}\)⊳\\trianglerightAlg\.[4](https://arxiv.org/html/2605.26162#alg4)

8:

\(wi,Vi,Ai\)←\(w\_\{i\},V\_\{i\},A\_\{i\}\)\\leftarrowLocalUpdate\(

wi,𝒟i,Gi,K,E,λw\_\{i\},\\mathcal\{D\}\_\{i\},G\_\{i\},K,E,\\lambda\)⊳\\trianglerightAlg\.[2](https://arxiv.org/html/2605.26162#alg2)

9:

σi←si/\(\|𝒩i\+\|\+1\)\\sigma\_\{i\}\\leftarrow s\_\{i\}/\(\|\\mathcal\{N\}\_\{i\}^\{\+\}\|\+1\);

si←σis\_\{i\}\\leftarrow\\sigma\_\{i\}⊳\\trianglerightmass splitting

10:Send

\(Vi,Ai,σi,i\)\(V\_\{i\},A\_\{i\},\\sigma\_\{i\},i\)to all

k∈𝒩i\+k\\in\\mathcal\{N\}\_\{i\}^\{\+\}
11:endfor

Algorithm 2LocalUpdate at Clientii1:Model

wiw\_\{i\}, data

𝒟i\\mathcal\{D\}\_\{i\}, centroid dictionary

GiG\_\{i\}, \#clusters

KK, epochs

EE, reg

λ\\lambda\.

2:

\(Vi,Ai,Mi\)←\(V\_\{i\},A\_\{i\},M\_\{i\}\)\\leftarrowWCP\(

wi,K,Giw\_\{i\},K,G\_\{i\}\)⊳\\trianglerightinitialize assignments

3:Build centroid anchor

w~i←Anchor​\(Gi,Ai\)\\tilde\{w\}\_\{i\}\\leftarrow\\mathrm\{Anchor\}\(G\_\{i\},A\_\{i\}\)
4:for

e=1e=1to

EEdo

5:Enforce pruning:

wi←wi⊙Miw\_\{i\}\\leftarrow w\_\{i\}\\odot M\_\{i\}
6:Take SGD steps on

fi​\(wi;𝒟i\)\+λ​‖wi−w~i‖22f\_\{i\}\(w\_\{i\};\\mathcal\{D\}\_\{i\}\)\+\\lambda\\\|w\_\{i\}\-\\tilde\{w\}\_\{i\}\\\|\_\{2\}^\{2\}
7:endfor

8:

\(Vi,Ai,Mi\)←\(V\_\{i\},A\_\{i\},M\_\{i\}\)\\leftarrowWCP\(

wi,K,Giw\_\{i\},K,G\_\{i\}\)⊳\\trianglerightrefresh for communication

9:return

\(wi,Vi,Ai\)\(w\_\{i\},V\_\{i\},A\_\{i\}\)

### 4\.2\.Centroid Regularization

![Refer to caption](https://arxiv.org/html/2605.26162v1/x1.png)Illustration of Centroid Regularization

Figure 2\.Illustration of Centroid RegularizationNon\-IID data can cause client drift, which is further amplified by asynchronous execution\. To mitigate this drift, PushCen\-ADFL introduces a lightweight centroid regularization in the local update step \(Algorithm[2](https://arxiv.org/html/2605.26162#alg2)\)\. Clientiifirst applies WCP to initialize the assignmentsAiA\_\{i\}and pruning maskMiM\_\{i\}\(line 1\), and then constructs a centroid anchor from its current centroid dictionary estimateGiG\_\{i\}\(line 2\), whereAiA\_\{i\}specifies how centroids are indexed to form a layer\-wise reference\. We instantiate the centroid anchorw~i\\tilde\{w\}\_\{i\}by decoding the assignment mapAiA\_\{i\}with the centroid dictionaryGiG\_\{i\}through an element\-wise lookup:

\(1\)w~i,l≜Gi,l​\[Ai,l\],\\tilde\{w\}\_\{i,l\}\\triangleq G\_\{i,l\}\[A\_\{i,l\}\],where each entry ofw~i,l\\tilde\{w\}\_\{i,l\}takes the centroid value indexed by the corresponding entry inAi,lA\_\{i,l\}, yielding the same shape aswi,lw\_\{i,l\}\.

Clientiithen performs local optimization with a proximal\-style regularizer while enforcing pruning:

\(2\)minwi⁡fi​\(wi;𝒟i\)\+λ​‖wi−w~i‖22,\\min\_\{w\_\{i\}\}\\ f\_\{i\}\(w\_\{i\};\\mathcal\{D\}\_\{i\}\)\+\\lambda\\\|w\_\{i\}\-\\tilde\{w\}\_\{i\}\\\|\_\{2\}^\{2\},whereMiM\_\{i\}is a binary mask and masked coordinates are kept inactive viawi←wi⊙Miw\_\{i\}\\leftarrow w\_\{i\}\\odot M\_\{i\}\(lines 3–5\)\. Here,λ\\lambdacontrols the regularization strength andw~i\\tilde\{w\}\_\{i\}is fixed during the local update\. Unlike conventional proximal terms that pull toward a global model,w~i\\tilde\{w\}\_\{i\}is decoded from the current centroid dictionary, keepingwiw\_\{i\}close to a centroid\-structured reference and reducing re\-encoding distortion before communication\. After local training, clientiiapplies Weight Clustering Pruning \(WCP\) to refresh the centroid representation\(Vi,Ai\)\(V\_\{i\},A\_\{i\}\)for subsequent communication \(line 6\) and returns\(wi,Vi,Ai\)\(w\_\{i\},V\_\{i\},A\_\{i\}\)\(line 7\)\.

The motivation is to stabilize asynchronous local optimization without decoupling it from centroid\-based compression\. In fully asynchronous decentralized training, a client can take many local steps before receiving fresh information; under non\-IID data, this can movewiw\_\{i\}into regions poorly represented by the current centroid dictionary, leading to unstable trajectories and larger distortion when the model is re\-encoded\. The proximal termλ​‖wi−w~i‖22\\lambda\\\|w\_\{i\}\-\\tilde\{w\}\_\{i\}\\\|\_\{2\}^\{2\}provides a local contraction towardw~i\\tilde\{w\}\_\{i\}, curbing drift while keeping the iterate close to the dictionary\-induced quantized manifold\. Figure[2](https://arxiv.org/html/2605.26162#S4.F2)provides a visualization of centroid regularization\. Becausew~i\\tilde\{w\}\_\{i\}is constructed from\(Gi,Ai\)\(G\_\{i\},A\_\{i\}\), the regularizer directly ties the optimization path to the centroid structure used for communication, yielding more stable local updates under asynchrony\.

### 4\.3\.Centroid\-based Model Compression

Algorithm 3WCP: Weight Clustering Pruning1:Model

wiw\_\{i\}; \#clusters

KK; optional init centroids

GiG\_\{i\}; max iter

TmaxT\_\{\\max\}\.

2:

Vi←∅;Ai←∅;Mi←𝟏V\_\{i\}\\leftarrow\\emptyset;\\ A\_\{i\}\\leftarrow\\emptyset;\\ M\_\{i\}\\leftarrow\\mathbf\{1\}
3:foreach weight layer

llin

wiw\_\{i\}do

4:

θ←vec​\(wi,l\)\\theta\\leftarrow\\mathrm\{vec\}\(w\_\{i,l\}\)
5:if

Gi,lG\_\{i,l\}availablethen

6:

μ←\[0;Gi,l\[1:K−1\]\]\\mu\\leftarrow\[\\,0;\\ G\_\{i,l\}\[1\{:\}K\\\!\-\\\!1\]\\,\]
7:else

8:

μ←\[0;Rand\(θ\)\[1:K−1\]\]\\mu\\leftarrow\[\\,0;\\ \\mathrm\{Rand\}\(\\theta\)\[1\{:\}K\\\!\-\\\!1\]\\,\]
9:endif

10:for

t=1t=1to

TmaxT\_\{\\max\}orconvergeddo

11:

a←arg⁡minj∈\[K\]⁡‖θ−μj‖2a\\leftarrow\\arg\\min\_\{j\\in\[K\]\}\\\|\\theta\-\\mu\_\{j\}\\\|\_\{2\}⊳\\trianglerightelement\-wise

12:for

j=1j=1to

K−1K\-1do

13:

μj←mean​\(θ​\[a=j\]\)\\mu\_\{j\}\\leftarrow\\mathrm\{mean\}\(\\theta\[a=j\]\)
14:endfor

15:

μ0←0\\mu\_\{0\}\\leftarrow 0
16:endfor

17:

\(Vi,l,Ai,l\)←SortRemap​\(μ,a\)\(V\_\{i,l\},A\_\{i,l\}\)\\leftarrow\\mathrm\{SortRemap\}\(\\mu,a\)
18:

wi,l←Vi,l​\[Ai,l\]w\_\{i,l\}\\leftarrow V\_\{i,l\}\[A\_\{i,l\}\];

Mi,l←𝕀​\(wi,l≠0\)M\_\{i,l\}\\leftarrow\\mathbb\{I\}\(w\_\{i,l\}\\neq 0\)
19:endfor

20:return

\(Vi,Ai,Mi\)\(V\_\{i\},A\_\{i\},M\_\{i\}\)

PushCen\-ADFL compresses communication by encoding each weight layer in a centroid form rather than transmitting dense parameters\. For layerll, clientiivectorizes weightsθ=vec​\(wi,l\)∈ℝdl\\theta=\\mathrm\{vec\}\(w\_\{i,l\}\)\\in\\mathbb\{R\}^\{d\_\{l\}\}and clusters its entries intoKKscalar centroids, producing a centroid tableVi,l∈ℝKV\_\{i,l\}\\in\\mathbb\{R\}^\{K\}and an assignment mapAi,l∈\{1,…,K\}dlA\_\{i,l\}\\in\\\{1,\\dots,K\\\}^\{d\_\{l\}\}\. Reconstruction uses an element\-wise lookupwi,l≈Vi,l​\[Ai,l\]w\_\{i,l\}\\approx V\_\{i,l\}\[A\_\{i,l\}\], so the payload scales withKK\(typicallyK≪dlK\\ll d\_\{l\}\) rather thandld\_\{l\}\.

Algorithm[3](https://arxiv.org/html/2605.26162#alg3)implements this encoding via WCP\. When available, WCP warm\-starts centroids from the local dictionary estimateGi,l∈ℝKG\_\{i,l\}\\in\\mathbb\{R\}^\{K\}by settingμ=\[0;Gi,l\[1:K−1\]\]\\mu=\[0;\\,G\_\{i,l\}\[1\{:\}K\\\!\-\\\!1\]\], fixing one centroid at0and refining the others by alternating nearest\-centroid assignment and cluster\-mean updates\. It returns\(Vi,l,Ai,l\)\(V\_\{i,l\},A\_\{i,l\}\)and a binary pruning maskMi,lM\_\{i,l\}, whereSortRemapreorders centroid indices into a canonical order and remapsAi,lA\_\{i,l\}accordingly to ensure consistent encoding\. Full cost details are deferred to Appendix[B](https://arxiv.org/html/2605.26162#A2)\.

### 4\.4\.Asynchronous Push\-Sum Aggregation

Algorithm 4PushSumAgg1:Local

\(wi,Gi,si\)\(w\_\{i\},G\_\{i\},s\_\{i\}\); buffer

ℬi=\{\(Vj,Aj,σj→i,j\)\}\\mathcal\{B\}\_\{i\}=\\\{\(V\_\{j\},A\_\{j\},\\sigma\_\{j\\rightarrow i\},j\)\\\}\.

2:if

ℬi=∅\\mathcal\{B\}\_\{i\}=\\emptysetthen

3:return

4:endif

5:For each buffered message, let

w^j←ℛ​\(Vj,Aj\)\\hat\{w\}\_\{j\}\\leftarrow\\mathcal\{R\}\(V\_\{j\},A\_\{j\}\)⊳\\trianglerightimplicit reconstruction

6:Compute total mass

SSby Eq\. \([3](https://arxiv.org/html/2605.26162#S4.E3)\)

7:Update model

wiw\_\{i\}by Eq\. \([4](https://arxiv.org/html/2605.26162#S4.E4)\) using

\{w^j\}\\\{\\hat\{w\}\_\{j\}\\\}
8:Update centroid dictionary

GiG\_\{i\}by Eq\. \([5](https://arxiv.org/html/2605.26162#S4.E5)\) using

\{Vj\}\\\{V\_\{j\}\\\}
9:

si←Ss\_\{i\}\\leftarrow S; clear

ℬi\\mathcal\{B\}\_\{i\}

Asynchronous decentralized training is inherently imbalanced: fast clients \(due to higher compute speed or lower latency\) can generate and disseminate updates more frequently, which may cause their local trajectories to dominate aggregation when using naive neighbor averaging\. This issue is compounded on directed or unbalanced graphs, where information flow is asymmetric\. PushCen\-ADFL addresses both effects with an average\-preserving push\-sum style aggregation that tracks a scalar mass and uses it to normalize heterogeneous message rates and topology asymmetry\.

Each clientiimaintains a local modelwiw\_\{i\}, a push\-sum masssi\>0s\_\{i\}\>0, a local estimate of the centroid dictionaryGiG\_\{i\}and a bufferℬi\\mathcal\{B\}\_\{i\}that stores neighbor messages received\. A buffered message from neighborjjis denoted by\(Vj,Aj,σj→i,j\)\(V\_\{j\},A\_\{j\},\\sigma\_\{j\\rightarrow i\},j\), where\(Vj,Aj\)\(V\_\{j\},A\_\{j\}\)is the centroid representation andσj→i\\sigma\_\{j\\rightarrow i\}is the mass share sent toii\. Upon aggregation, clientiiimplicitly reconstructs a quantized neighbor modelw^j=ℛ​\(Vj,Aj\)\\hat\{w\}\_\{j\}=\\mathcal\{R\}\(V\_\{j\},A\_\{j\}\)\(Algorithm[4](https://arxiv.org/html/2605.26162#alg4), line 4\)\. It then computes the total mass

\(3\)S←si\+∑\(Vj,Aj,σj→i,j\)∈ℬiσj→i,\\displaystyle S\\leftarrow s\_\{i\}\+\\sum\_\{\(V\_\{j\},A\_\{j\},\\sigma\_\{j\\rightarrow i\},j\)\\in\\mathcal\{B\}\_\{i\}\}\\sigma\_\{j\\rightarrow i\},which serves as the normalization constant for mass\-conserving averaging\. Clientiiupdates its model via

\(4\)wi←siS​wi\+∑\(Vj,Aj,σj→i,j\)∈ℬiσj→iS​w^j,\\displaystyle w\_\{i\}\\leftarrow\\frac\{s\_\{i\}\}\{S\}w\_\{i\}\+\\sum\_\{\(V\_\{j\},A\_\{j\},\\sigma\_\{j\\rightarrow i\},j\)\\in\\mathcal\{B\}\_\{i\}\}\\frac\{\\sigma\_\{j\\rightarrow i\}\}\{S\}\\,\\hat\{w\}\_\{j\},
so that a sender’s influence is determined by conserved mass rather than raw message frequency\. The same weights are applied to update the centroid dictionary estimate using the received centroid tables\.

\(5\)Gi←siS​Gi\+∑\(Vj,Aj,σj→i,j\)∈ℬiσj→iS​Vj\.\\displaystyle G\_\{i\}\\leftarrow\\frac\{s\_\{i\}\}\{S\}G\_\{i\}\+\\sum\_\{\(V\_\{j\},A\_\{j\},\\sigma\_\{j\\rightarrow i\},j\)\\in\\mathcal\{B\}\_\{i\}\}\\frac\{\\sigma\_\{j\\rightarrow i\}\}\{S\}\\,V\_\{j\}\.aligning the shared centroid structure with the aggregated model trajectory\. After aggregation, clientiisetssi←Ss\_\{i\}\\leftarrow Sand clearsℬi\\mathcal\{B\}\_\{i\}\.

After local training, clientiiperforms mass splitting before communication\. Letdi\+=\|𝒩i\+\|d\_\{i\}^\{\+\}=\|\\mathcal\{N\}\_\{i\}^\{\+\}\|be the out\-degree; clientiisendsσi=si/\(di\+\+1\)\\sigma\_\{i\}=s\_\{i\}/\(d\_\{i\}^\{\+\}\+1\)to each out\-neighbor and updatessi←σis\_\{i\}\\leftarrow\\sigma\_\{i\}\. This mass\-splitting rule prevents fast clients from dominating the network dynamics under asynchrony\.

### 4\.5\.Buffered Neighbor Updates

Algorithm 5BufferUpdate \(Deduplicate by Sender \+ Bounded FIFO\)1:Buffer

ℬi\\mathcal\{B\}\_\{i\}; incoming

\(wj,Vj,σj→i,j\)\(w\_\{j\},V\_\{j\},\\sigma\_\{j\\rightarrow i\},j\); limit

LL\.

2:Remove the existing entry from sender

jjin

ℬi\\mathcal\{B\}\_\{i\}\(if any\)

3:Append

\(wj,Vj,σj→i,j\)\(w\_\{j\},V\_\{j\},\\sigma\_\{j\\rightarrow i\},j\)to

ℬi\\mathcal\{B\}\_\{i\}
4:if

L\>0L\>0and

\|ℬi\|\>L\|\\mathcal\{B\}\_\{i\}\|\>Lthen

5:Drop the oldest

\|ℬi\|−L\|\\mathcal\{B\}\_\{i\}\|\-Lentries from

ℬi\\mathcal\{B\}\_\{i\}
6:endif

In a fully asynchronous decentralized setting, message arrivals are decoupled from local compute events\. A client may receive neighbor messages at arbitrary times, possibly in bursts, and some messages can be stale by the time aggregation is triggered\. PushCen\-ADFL therefore maintains a local bufferℬi\\mathcal\{B\}\_\{i\}at each clientiito cache received neighbor information until the next compute event, when aggregation is performed\.

The buffer is designed to be sender\-deduplicated and capacity\-bounded\. When a new message from neighborjjarrives, clientiiremoves any existing buffered entry fromjjand keeps only the latest one\. This enforces the principle that each sender contributes at most one message per aggregation event, preventing bursty arrivals from being counted multiple times and distorting the intended neighbor weighting, which per\-message push\-sum normalization alone does not prevent\.

Empirically, removing sender deduplication consistently degrades performance \(Section[5\.2\.4](https://arxiv.org/html/2605.26162#S5.SS2.SSS4)\), indicating that discarding older messages from the same neighbor is beneficial in practice\. Finally, to control memory and limit the influence of long\-delayed messages under high communication rates, we impose a buffer size limit and drop the oldest entries upon overflow\. Together, these rules provide a lightweight interface between asynchronous communication and push\-sum aggregation, as summarized in Algorithm[5](https://arxiv.org/html/2605.26162#alg5)\.

### 4\.6\.Convergence Analysis of PushCen\-ADFL

#### 4\.6\.1\.Assumptions

We summarize the assumptions used in our convergence analysis; full definitions, update rules and proofs are deferred to Appendix[C](https://arxiv.org/html/2605.26162#A3)\.

##### Assumption 1: Smoothness and lower boundedness\(Liet al\.,[2020](https://arxiv.org/html/2605.26162#bib.bib40); Karimireddyet al\.,[2020](https://arxiv.org/html/2605.26162#bib.bib41)\)\.

Eachfif\_\{i\}isLL\-smooth andF​\(w\)≜1N​∑i=1Nfi​\(w\)F\(w\)\\triangleq\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}f\_\{i\}\(w\)is lower bounded byF⋆F^\{\\star\}\.

##### Assumption 2: Stochastic gradients\(Liet al\.,[2020](https://arxiv.org/html/2605.26162#bib.bib40); Karimireddyet al\.,[2020](https://arxiv.org/html/2605.26162#bib.bib41)\)\.

𝔼​\[gi​\(w;ξ\)∣w\]=∇fi​\(w\)\\mathbb\{E\}\[g\_\{i\}\(w;\\xi\)\\mid w\]=\\nabla f\_\{i\}\(w\)and𝔼​\[‖gi​\(w;ξ\)−∇fi​\(w\)‖22∣w\]≤σ2\\mathbb\{E\}\[\\\|g\_\{i\}\(w;\\xi\)\-\\nabla f\_\{i\}\(w\)\\\|\_\{2\}^\{2\}\\mid w\]\\leq\\sigma^\{2\}\.

##### Assumption 3: Uniform heterogeneity\(Karimireddyet al\.,[2020](https://arxiv.org/html/2605.26162#bib.bib41)\)\.

For alliiandww,‖∇fi​\(w\)−∇F​\(w\)‖22≤ζ2\\\|\\nabla f\_\{i\}\(w\)\-\\nabla F\(w\)\\\|\_\{2\}^\{2\}\\leq\\zeta^\{2\}\.

##### Assumption 4: Bounded staleness\(Nguyenet al\.,[2022](https://arxiv.org/html/2605.26162#bib.bib37); Assranet al\.,[2019](https://arxiv.org/html/2605.26162#bib.bib33)\)\.

Any message used at eventttis generated within the lastτ\\tauevents\.

##### Assumption 5: Directed push\-sum mixing\(Nedić and Olshevsky,[2014](https://arxiv.org/html/2605.26162#bib.bib38)\)\.

The time\-varying directed communication isBB\-window strongly connected and push\-sum masses stay positive\.

##### Assumptions 6: Absolute compression error\(Danilova and Gorbunov,[2022](https://arxiv.org/html/2605.26162#bib.bib39)\)\.

‖ℛ​\(𝒞​\(w\)\)−w‖2≤εc\\\|\\mathcal\{R\}\(\\mathcal\{C\}\(w\)\)\-w\\\|\_\{2\}\\leq\\varepsilon\_\{c\}for any transmitted modelww\.

To simplify constants, we further assume bounded push\-sum masses, bounded iterates, and bounded gradient second moments\. These are given in Assumptions[C\.6\.1](https://arxiv.org/html/2605.26162#A3.SS6.SSS1.Px1)in Appendix[C](https://arxiv.org/html/2605.26162#A3)\.

#### 4\.6\.2\.De\-biased reference and stationarity metric

On directed and potentially unbalanced graphs, the arithmetic mean is not preserved\. We therefore analyze the de\-biased push\-sum referencew¯t≜Xtott/Ytott\\bar\{w\}^\{t\}\\triangleq X\_\{\\mathrm\{tot\}\}^\{t\}/Y\_\{\\mathrm\{tot\}\}^\{t\}\(defined via system\-level accounting that includes buffered/in\-flight messages; Appendix[C](https://arxiv.org/html/2605.26162#A3)\) and measure convergence by1T​∑t=0T−1𝔼​‖∇F​\(w¯t\)‖22\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\\|\\nabla F\(\\bar\{w\}^\{t\}\)\\\|\_\{2\}^\{2\}\.

#### 4\.6\.3\.Main theorem

###### Theorem 4\.1 \(Stationarity of PushCen\-ADFL\)\.

Suppose Assumptions1 \- 6 and Assumptions[C\.6\.1](https://arxiv.org/html/2605.26162#A3.SS6.SSS1.Px1)hold\. Let the stepsize satisfyη≤min⁡\{18​L​E,14​λ\}\\eta\\leq\\min\\Big\\\{\\frac\{1\}\{8LE\},\\ \\frac\{1\}\{4\\lambda\}\\Big\\\}\. Then

\(6\)1T​∑t=0T−1𝔼​\[‖∇F​\(w¯t\)‖22\]\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\big\[\\\|\\nabla F\(\\bar\{w\}^\{t\}\)\\\|\_\{2\}^\{2\}\\big\]≤2​\(F​\(w¯0\)−F⋆\)η​E​γ¯​T\+2​\(σ2\+ζ2\)\+2​L​c¯2​εc2\\displaystyle\\leq\\;\\frac\{2\\big\(F\(\\bar\{w\}^\{0\}\)\-F^\{\\star\}\\big\)\}\{\\eta E\\,\\underline\{\\gamma\}\\,T\}\+2\(\\sigma^\{2\}\+\\zeta^\{2\}\)\+2L\\,\\overline\{c\}^\{\\,2\}\\varepsilon\_\{c\}^\{2\}\+4​L2​Nγ¯​\(𝔼​\[ℰcon0\]T​\(1−ρ\)\+Bcon1−ρ\)\+Lγ¯​DΔ\.\\displaystyle\\quad\+\\;\\frac\{4L^\{2\}N\}\{\\underline\{\\gamma\}\}\\left\(\\frac\{\\mathbb\{E\}\[\\mathcal\{E\}\_\{\\mathrm\{con\}\}^\{0\}\]\}\{T\(1\-\\rho\)\}\+\\frac\{B\_\{\\mathrm\{con\}\}\}\{1\-\\rho\}\\right\)\+\\frac\{L\}\{\\underline\{\\gamma\}\}\\,D\_\{\\Delta\}\.
whereγ¯=yminN​ymax\\underline\{\\gamma\}=\\frac\{y\_\{\\min\}\}\{Ny\_\{\\max\}\},ρ∈\(0,1\)\\rho\\in\(0,1\)is the contraction factor in the directed push\-sum consensus recursion \(Lemma[C\.3](https://arxiv.org/html/2605.26162#A3.Thmtheorem3)in Appendix[C](https://arxiv.org/html/2605.26162#A3)\) and

\(7\)c¯≜maxt⁡dit\+dit\+\+1⋅yitt\+13Ytot≤ymaxN​ymin\.\\overline\{c\}\\triangleq\\max\_\{t\}\\frac\{d\_\{i\_\{t\}\}^\{\+\}\}\{d\_\{i\_\{t\}\}^\{\+\}\+1\}\\cdot\\frac\{y\_\{i\_\{t\}\}^\{t\+\\frac\{1\}\{3\}\}\}\{Y\_\{\\mathrm\{tot\}\}\}\\leq\\frac\{y\_\{\\max\}\}\{Ny\_\{\\min\}\}\.Moreover,

\(8\)DΔ\\displaystyle D\_\{\\Delta\}≜2​E2​η2​\(G2\+16​λ2​W2\),\\displaystyle\\triangleq 2E^\{2\}\\eta^\{2\}\\Big\(G^\{2\}\+16\\lambda^\{2\}W^\{2\}\\Big\),\(9\)Bcon\\displaystyle B\_\{\\mathrm\{con\}\}≜\(C1\+C2​τ2\)​DΔ\+C3​εc2,\\displaystyle\\triangleq\\big\(C\_\{1\}\+C\_\{2\}\\tau^\{2\}\\big\)D\_\{\\Delta\}\+C\_\{3\}\\varepsilon\_\{c\}^\{2\},with\(C1,C2,C3\)\(C\_\{1\},C\_\{2\},C\_\{3\}\)from Lemma[C\.3](https://arxiv.org/html/2605.26162#A3.Thmtheorem3)\.

## 5\.Experiments

Table 1\.Test accuracy \(%\) and standard deviation \(SD\) under different data heterogeneity levelsα\\alpha\.Table 2\.Per\-push communication cost and relative overhead \(normalized to PushCen\-ADFL\)\.### 5\.1\.Experimental Setup

#### 5\.1\.1\.Experimental Environment

We conduct experiments on a Slurm\-managed cluster with two compute nodes\. Node A has an Intel Core i9\-13900K CPU, 64GB RAM and one NVIDIA GeForce RTX 4090 GPU; Node B has an Intel Xeon Silver 4309Y CPU, 128GB RAM and two NVIDIA A40 GPUs\. All methods are implemented in Python\. Our method and all baselines are implemented in PyTorch111https://pytorch\.organd trained/inferred on GPUs\. To emulate client asynchrony and message arrivals in asynchronous decentralized FL, we implement an event\-driven simulator that schedules local computation and communication events independently for each client, allowing updates and message deliveries to occur at arbitrary times\. Unless otherwise stated, all experiments use a fixed random seed, setK=32K=32, and cap the FIFO buffer atL=16L=16to bound memory under bursty asynchronous arrivals\. Each client pushes to 10 random neighbors via a gossip protocol\. The code was released at222https://github\.com/SHVleV9CYWkK/ADFLab\.

#### 5\.1\.2\.Datasets and Models

We evaluate our method on three standard vision classification benchmarks: CIFAR\-10, CIFAR\-100 and Tiny\-ImageNet\. Following common practice, we pair datasets with architectures of appropriate capacity: LeNet for CIFAR\-10 and ResNet\-18 for CIFAR\-100 and Tiny\-ImageNet\(Krizhevskyet al\.,[2009](https://arxiv.org/html/2605.26162#bib.bib10); Le and Yang,[2015](https://arxiv.org/html/2605.26162#bib.bib11); LeCunet al\.,[1998](https://arxiv.org/html/2605.26162#bib.bib12); Heet al\.,[2016](https://arxiv.org/html/2605.26162#bib.bib13)\)\. All methods are trained under the same data preprocessing and training pipeline to ensure fair comparisons\.

#### 5\.1\.3\.Data Partitioning and Heterogeneity

To emulate the statistical heterogeneity \(non\-IID data\) commonly observed in real\-world federated learning, we partition each dataset using a Dirichlet distribution with concentration parametersα∈\{0\.1,0\.4,1\.0\}\\alpha\\in\\\{0\.1,0\.4,1\.0\\\}, where smallerα\\alphainduces more skewed client data distributions\. CIFAR\-10 is partitioned across 100 clients, while CIFAR\-100 and Tiny\-ImageNet are each distributed among 50 clients to avoid overly small per\-client datasets under Dirichlet splits, which can make training unstable\. For evaluation under each client’s local distribution, we further split each client’s data into local training and test sets and report performance on the corresponding local test data\.

#### 5\.1\.4\.Delayed Client Scenario and Evaluation Protocol

To evaluate delayed client participation in asynchronous decentralized federated learning, we adopt a round\-free \(pseudo\-time–based\) evaluation protocol consistent with ADFL\. Training proceeds over a fixed global pseudo\-time horizon, uniformly divided into 60 evaluation intervals, at which the average Top\-1 accuracy of all online clients is reported\. For each dataset, 10% of clients are randomly designated as delayed clients prior to training, while all remaining clients participate from the beginning\. Each delayed client joins the system at a randomly sampled time during training\. To ensure a fair comparison, both the identities of delayed clients and their join times are fixed and shared across all compared methods\.

#### 5\.1\.5\.Baseline Methods

We compare against four representative baselines\.Async\-DFedAvgserves as a canonical benchmark, capturing a standard asynchronous decentralized adaptation of the FedAvg\-style aggregation paradigm\(McMahanet al\.,[2017](https://arxiv.org/html/2605.26162#bib.bib14)\)\.DivSharerepresents a state\-of\-the\-art communication\-efficient approach in asynchronous decentralized settings and is included to benchmark communication efficiency\(Biswaset al\.,[2025](https://arxiv.org/html/2605.26162#bib.bib9)\)\.Independentperforms purely local training without any model exchange, providing a non\-collaborative lower bound\. Finally,SWIFTis a representative asynchronous decentralized FL algorithm and is commonly used as a baseline for handling statistical heterogeneity under asynchrony\(Bornsteinet al\.,[2023](https://arxiv.org/html/2605.26162#bib.bib1)\)\.

### 5\.2\.Experimental Results

#### 5\.2\.1\.Model Accuracy

Table[1](https://arxiv.org/html/2605.26162#S5.T1)reports the Top\-1 accuracy together with the corresponding standard deviation \(SD\) under different levels of non\-IID data\. In addition to mean performance, the SD provides a sensitivity analysis that characterizes the variability of model accuracy across clients and training trajectories in asynchronous decentralized settings\. On CIFAR\-10, PushCen\-ADFL achieves the best accuracy under high heterogeneity, with particularly strong gains as data become more non\-IID\. Atα=0\.4\\alpha=0\.4andα=0\.1\\alpha=0\.1, PushCen\-ADFL reaches 66\.22% and 58\.30%, respectively, outperforming the communication\-efficient baseline DivShare by clear margins, while exhibiting SDs that are comparable to and often lower than, those of competing methods on CIFAR\-10\. When heterogeneity is mild \(α=1\.0\\alpha=1\.0\), PushCen\-ADFL remains highly competitive with full\-communication baselines and shows similar variability\. On more challenging datasets \(CIFAR\-100 and Tiny\-ImageNet\), PushCen\-ADFL consistently outperforms DivShare across all heterogeneity levels, with more pronounced improvements under stronger non\-IID distributions\. For example, atα=0\.1\\alpha=0\.1, PushCen\-ADFL improves accuracy by 2\.22% on CIFAR\-100 and 6\.18% on Tiny\-ImageNet, while maintaining overall variability comparable to the baselines across heterogeneity levels\. PushCen\-ADFL achieves competitive mean accuracy and comparable SDs relative to full\-communication methods, demonstrating that its centroid\-based compression and regularization enable strong, robust performance in heterogeneous and asynchronous decentralized training without increasing sensitivity to updates\.

#### 5\.2\.2\.Communication Cost

Table[2](https://arxiv.org/html/2605.26162#S5.T2)reports the communication cost of a single client push event, which is the total payload transmitted when a client broadcasts its message to 10 neighbors\. This metric isolates the*per\-push*communication footprint from the overall pseudo\-time horizon and thus directly reflects the bandwidth burden induced by one communication action in our asynchronous decentralized protocol\. We focus on the per\-push cost because total system communication scales with the per\-push payload times the number of push events\. The event count depends on asynchrony, client delays, and scheduling, rather than the message format itself\. As expected, full\-precision methods that transmit dense model information \(Async\-DFedAvg and SWIFT\) incur substantially larger per\-push traffic, reaching 10\.27 MB on CIFAR\-10 and about 449 to 451 MB on CIFAR\-100 and Tiny\-ImageNet\. In contrast, communication\-efficient baselines significantly reduce the payload size\. DivShare lowers the per\-push cost to 2\.06 MB \(CIFAR\-10\) and 89\.82 to 90\.23 MB \(CIFAR\-100 and Tiny\-ImageNet\), while PushCen\-ADFL achieves the smallest per\-push footprint across all datasets, requiring only 1\.62 MB on CIFAR\-10 and 76\.33 to 76\.65 MB on CIFAR\-100 and Tiny\-ImageNet\.

We further normalize the per\-push communication volume to PushCen\-ADFL, where PushCen\-ADFL is 1\.00 by definition\. Under this normalization, DivShare requires 1\.27×\\timesmore bandwidth per push on CIFAR\-10 and 1\.18×\\timeson CIFAR\-100 and Tiny\-ImageNet, whereas Async\-DFedAvg and SWIFT require 6\.33×\\timeson CIFAR\-10 and about 5\.88 to 5\.89×\\timeson CIFAR\-100 and Tiny\-ImageNet\. Overall, these results demonstrate that PushCen\-ADFL substantially reduces the communication footprint of each decentralized push, which is particularly important in asynchronous settings where communication are frequent and uncoordinated\.

#### 5\.2\.3\.Computational overhead

Table 3\.Computational cost \(GFLOPs\) of different methods on different datasets\.Table[3](https://arxiv.org/html/2605.26162#S5.T3)reports the computational cost \(GFLOPs\) of different methods on three datasets\. Overall, PushCen\-ADFL maintains a lightweight computation footprint comparable to baselines\. On CIFAR\-10, PushCen\-ADFL slightly reduces the cost to 4\.03 GFLOPs, marginally lower than Async\-DFedAvg, DivShare and SWIFT\. On Tiny\-ImageNet, it achieves 42\.41 GFLOPs, essentially matching the baseline range\. On CIFAR\-100, PushCen\-ADFL reaches 11\.93 GFLOPs versus 10\.61–10\.82 GFLOPs for other methods, remaining within a manageable range\.

We observe a slightly higher computation cost on CIFAR\-100, mainly due to WCP\-related operations \(centroid clustering and re\-encoding\) and the centroid\-based anchor construction during local updates\. This effect is more visible withK=32K=32, which yields less aggressive sparsity and thus provides limited training\-time computation reduction while still incurring clustering overhead\.

#### 5\.2\.4\.Ablation Study

Table 4\.Ablation study on different datasets\.Table[4](https://arxiv.org/html/2605.26162#S5.T4)evaluates the contribution of two key components in PushCen\-ADFL: the regularization term and the buffer\-based message handling mechanism\. Removing the regularization \(*No Reg*\) leads to a consistent decrease in accuracy across all datasets, from 66\.22% to 65\.30% on CIFAR\-10, from 44\.21% to 42\.82% on CIFAR\-100 and from 44\.62% to 43\.93% on Tiny\-ImageNet\. This indicates that the regularization is beneficial for stabilizing training and improving generalization under heterogeneous and asynchronous decentralized updates\. Disabling the buffer mechanism \(*No Buffer*\) also degrades performance, with the most pronounced effect observed on CIFAR\-100 \(44\.21% to 32\.01%\), suggesting that buffer\-based handling of incoming messages plays an important role in mitigating adverse effects caused by asynchronous arrivals \(e\.g\., redundant or highly stale information\) on more challenging tasks\. Overall, both components contribute to the final performance and the full PushCen\-ADFL achieves the best results across all datasets\.

### 5\.3\.Delayed Client Experiment Results

Table 5\.Delayed\-client performance in terms of the mean maximum test accuracy \(%\) and standard deviation \(SD\)\.Table[5](https://arxiv.org/html/2605.26162#S5.T5)reports delayed\-client performance measured by the mean of each delayed client’s maximum test accuracy during training, together with the corresponding standard deviation \(SD\)\. On CIFAR\-10, PushCen\-ADFL reaches an accuracy of 72\.77%, exceeding both the full\-communication baselines and the communication\-efficient\. On CIFAR\-100 and Tiny\-ImageNet, PushCen\-ADFL attains 46\.50% and 43\.08%, respectively, consistently improving over DivShare while remaining competitive with full\-communication methods\. These results indicate that PushCen\-ADFL is particularly effective at incorporating delayed clients and enabling them to achieve strong personalized performance under their local distributions\. Beyond the mean accuracy, SD shows that PushCen\-ADFL also yields low variability across delayed clients, achieving the smallest SD on CIFAR\-10 \(0\.356\) and comparable to full\-communication baselines on CIFAR\-100 and Tiny\-ImageNet\. Detailed training curves for delayed clients are provided in Appendix[D\.1\.2](https://arxiv.org/html/2605.26162#A4.SS1.SSS2)\.

### 5\.4\.Hyperparameter Sensitivity

#### 5\.4\.1\.Effect of centroid numberKK\.

Table[6](https://arxiv.org/html/2605.26162#S5.T6)shows that increasingKKimproves accuracy at the cost of higher communication and computation\. Even withK=32K=32, PushCen\-ADFL keeps per\-push cost and GFLOPs strictly below all baselines, confirming thatKKprovides a practical knob to trade accuracy for efficiency\. We useK=32K=32by default as it yields the best accuracy across all datasets\.

Table 6\.Effect of centroid numberKKon accuracy and per\-push communication cost\.
#### 5\.4\.2\.Effect of regularization weightλ\\lambda\.

Table[7](https://arxiv.org/html/2605.26162#S5.T7)shows that enabling centroid regularization \(λ\>0\\lambda\>0\) consistently outperformsλ=0\\lambda=0across all datasets, confirming its role in stabilizing local updates under heterogeneity and asynchrony\. On CIFAR\-10, performance is stable forλ∈\[0\.01,0\.10\]\\lambda\\in\[0\.01,0\.10\]\(≈66%\{\\approx\}66\\%\)\. On CIFAR\-100 and Tiny\-ImageNet, smallerλ\\lambdais slightly preferred, yet gains persist across a broad range\. We adoptλ=0\.1\\lambda=0\.1as a simple default\.

Table 7\.Effect of regularization weightλ\\lambda\(top\-1 accuracy %\)\.λ=0\\lambda=0: No Reg\.

### 5\.5\.Supplementary Experiments

We provide additional learning\-dynamics analyses to complement the main results\. The global accuracy curves show that PushCen\-ADFL follows stable convergence trajectories across all datasets, supporting the final averaged accuracies reported in Table[1](https://arxiv.org/html/2605.26162#S5.T1)\. The delayed\-client accuracy curves on a pseudo\-time axis further show that late\-joining clients improve rapidly after entering the system and reach performance comparable to online clients within a short pseudo\-time window, corroborating the delayed\-client results summarized in Table[5](https://arxiv.org/html/2605.26162#S5.T5)\. Together, these results reinforce the convergence stability and robustness of PushCen\-ADFL under asynchronous and heterogeneous decentralized settings\.

## 6\.Conclusion

This work investigated asynchronous decentralized federated learning, where frequent peer\-to\-peer, asymmetric and unbalanced communication and the statistical heterogeneity jointly induce high bandwidth cost, aggregation bias and model drift\. We proposed PushCen\-ADFL, an asynchronous decentralized framework that couples communication, de\-biased aggregation, and local stabilization in a shared centroid representation space, forming a closed loop between compression and optimization\. It integrates centroid\-based model representation, average\-preserving push\-sum aggregation, centroid regularization, and bounded sender\-deduplicated buffering to enable stable and robust training with delayed client participation\. Extensive experiments demonstrate that PushCen\-ADFL consistently improves performance over existing asynchronous decentralized baselines while substantially reducing per\-push communication cost, achieving a favorable accuracy\-communication trade\-off under dynamic settings\.

## References

- A\. I\. Ameur, A\. Lakas, M\. B\. Yagoubi, and O\. S\. Oubbati \(2022\)Peer\-to\-peer overlay techniques for vehicular ad hoc networks: survey and challenges\.Vehicular Communications34,pp\. 100455\.Cited by:[§1](https://arxiv.org/html/2605.26162#S1.p4.1)\.
- M\. Assran, N\. Loizou, N\. Ballas, and M\. Rabbat \(2019\)Stochastic gradient push for distributed deep learning\.InInternational Conference on Machine Learning,pp\. 344–353\.Cited by:[§1](https://arxiv.org/html/2605.26162#S1.p5.1),[§4\.6\.1](https://arxiv.org/html/2605.26162#S4.SS6.SSS1.Px4)\.
- S\. Biswas, A\. Kermarrec, A\. Marouani, R\. Pires, R\. Sharma, and M\. De Vos \(2025\)Boosting asynchronous decentralized learning with model fragmentation\.InProceedings of the ACM on Web Conference 2025,pp\. 685–696\.Cited by:[§2\.2](https://arxiv.org/html/2605.26162#S2.SS2.p1.1),[§5\.1\.5](https://arxiv.org/html/2605.26162#S5.SS1.SSS5.p1.1)\.
- M\. Bornstein, T\. Rabbani, E\. Wang, A\. S\. Bedi, and F\. Huang \(2023\)SWIFT: rapid decentralized federated learning via wait\-free model communication\.arXiv preprint arXiv:2210\.14026\.Cited by:[§1](https://arxiv.org/html/2605.26162#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.26162#S2.SS1.p1.2),[§5\.1\.5](https://arxiv.org/html/2605.26162#S5.SS1.SSS5.p1.1)\.
- C\. Chen, H\. Xu, W\. Wang, B\. Li, B\. Li, L\. Chen, and G\. Zhang \(2021\)Communication\-efficient federated learning with adaptive parameter freezing\.In2021 IEEE 41st International Conference on Distributed Computing Systems \(ICDCS\),Vol\.,pp\. 1–11\.External Links:[Document](https://dx.doi.org/10.1109/ICDCS51616.2021.00010)Cited by:[§1](https://arxiv.org/html/2605.26162#S1.p5.1)\.
- R\. Dai, L\. Shen, F\. He, X\. Tian, and D\. Tao \(2022\)Dispfl: towards communication\-efficient personalized federated learning via decentralized sparse training\.arXiv preprint arXiv:2206\.00187\.Cited by:[§1](https://arxiv.org/html/2605.26162#S1.p2.1)\.
- M\. Danilova and E\. Gorbunov \(2022\)Distributed methods with absolute compression and error compensation\.InInternational Conference on Mathematical Optimization Theory and Operations Research,pp\. 163–177\.Cited by:[§4\.6\.1](https://arxiv.org/html/2605.26162#S4.SS6.SSS1.Px6)\.
- A\. Dhasade, A\. Kermarrec, E\. Lavoie, J\. Pouwelse, R\. Sharma, and M\. De Vos \(2025\)Practical federated learning without a server\.InProceedings of the 5th Workshop on Machine Learning and Systems,pp\. 1–11\.Cited by:[§1](https://arxiv.org/html/2605.26162#S1.p4.1)\.
- M\. Franceschelli, A\. Giua, and C\. Seatzu \(2009\)Consensus on the average on arbitrary strongly connected digraphs based on broadcast gossip algorithms\.IFAC Proceedings Volumes42\(20\),pp\. 66–71\.Cited by:[§1](https://arxiv.org/html/2605.26162#S1.p5.1)\.
- K\. He, X\. Zhang, S\. Ren, and J\. Sun \(2016\)Deep residual learning for image recognition\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 770–778\.Cited by:[§5\.1\.2](https://arxiv.org/html/2605.26162#S5.SS1.SSS2.p1.1)\.
- E\. Jeong and M\. Kountouris \(2025\)DRACO: decentralized asynchronous federated learning over row\-stochastic wireless networks\.IEEE Open Journal of the Communications Society6\(\),pp\. 4818–4839\.External Links:[Document](https://dx.doi.org/10.1109/OJCOMS.2025.3574098)Cited by:[§1](https://arxiv.org/html/2605.26162#S1.p4.1),[§1](https://arxiv.org/html/2605.26162#S1.p5.1)\.
- Z\. Jiang, W\. Wang, B\. Li, and Q\. Yang \(2022\)Towards efficient synchronous federated training: a survey on system optimization strategies\.IEEE Transactions on Big Data9\(2\),pp\. 437–454\.Cited by:[§1](https://arxiv.org/html/2605.26162#S1.p2.1)\.
- S\. P\. Karimireddy, S\. Kale, M\. Mohri, S\. Reddi, S\. Stich, and A\. T\. Suresh \(2020\)Scaffold: stochastic controlled averaging for federated learning\.InInternational conference on machine learning,pp\. 5132–5143\.Cited by:[§4\.6\.1](https://arxiv.org/html/2605.26162#S4.SS6.SSS1.Px1),[§4\.6\.1](https://arxiv.org/html/2605.26162#S4.SS6.SSS1.Px2),[§4\.6\.1](https://arxiv.org/html/2605.26162#S4.SS6.SSS1.Px3)\.
- D\. Kempe, A\. Dobra, and J\. Gehrke \(2003\)Gossip\-based computation of aggregate information\.In44th Annual IEEE Symposium on Foundations of Computer Science, 2003\. Proceedings\.,Vol\.,pp\. 482–491\.External Links:[Document](https://dx.doi.org/10.1109/SFCS.2003.1238221)Cited by:[§1](https://arxiv.org/html/2605.26162#S1.p5.1)\.
- M\. Kim, W\. Saad, M\. Debbah, and C\. S\. Hong \(2024\)SpaFL: communication\-efficient federated learning with sparse models and low computational overhead\.Advances in Neural Information Processing Systems37,pp\. 86500–86527\.Cited by:[§2\.2](https://arxiv.org/html/2605.26162#S2.SS2.p1.1)\.
- A\. Krizhevsky, G\. Hinton,et al\.\(2009\)Learning multiple layers of features from tiny images\.Cited by:[§5\.1\.2](https://arxiv.org/html/2605.26162#S5.SS1.SSS2.p1.1)\.
- A\. Lalitha, S\. Shekhar, T\. Javidi, and F\. Koushanfar \(2018\)Fully decentralized federated learning\.InThird workshop on bayesian deep learning \(NeurIPS\),Vol\.12\.Cited by:[§1](https://arxiv.org/html/2605.26162#S1.p5.1)\.
- N\. Lang, A\. Cohen, and N\. Shlezinger \(2024\)Stragglers\-aware low\-latency synchronous federated learning via layer\-wise model updates\.IEEE Transactions on Communications\.Cited by:[§1](https://arxiv.org/html/2605.26162#S1.p2.1)\.
- Y\. Le and X\. Yang \(2015\)Tiny imagenet visual recognition challenge\.CS 231N7\(7\),pp\. 3\.Cited by:[§5\.1\.2](https://arxiv.org/html/2605.26162#S5.SS1.SSS2.p1.1)\.
- Y\. LeCun, L\. Bottou, Y\. Bengio, and P\. Haffner \(1998\)Gradient\-based learning applied to document recognition\.Proceedings of the IEEE86\(11\),pp\. 2278–2324\.Cited by:[§5\.1\.2](https://arxiv.org/html/2605.26162#S5.SS1.SSS2.p1.1)\.
- H\. Li, K\. Ota, and M\. Dong \(2018\)Learning iot in edge: deep learning for the internet of things with edge computing\.IEEE Network32\(1\),pp\. 96–101\.External Links:[Document](https://dx.doi.org/10.1109/MNET.2018.1700202)Cited by:[§1](https://arxiv.org/html/2605.26162#S1.p1.1)\.
- Q\. Li, Z\. Wen, Z\. Wu, S\. Hu, N\. Wang, Y\. Li, X\. Liu, and B\. He \(2023\)A survey on federated learning systems: vision, hype and reality for data privacy and protection\.IEEE Transactions on Knowledge and Data Engineering35\(4\),pp\. 3347–3366\.External Links:[Document](https://dx.doi.org/10.1109/TKDE.2021.3124599)Cited by:[§1](https://arxiv.org/html/2605.26162#S1.p1.1)\.
- X\. Li, K\. Huang, W\. Yang, S\. Wang, and Z\. Zhang \(2020\)On the convergence of fedavg on non\-iid data\.InInternational Conference on Learning Representations,Cited by:[§4\.6\.1](https://arxiv.org/html/2605.26162#S4.SS6.SSS1.Px1),[§4\.6\.1](https://arxiv.org/html/2605.26162#S4.SS6.SSS1.Px2)\.
- Z\. Li, Y\. Li, B\. Lin, Z\. Jin, and W\. Zhang \(2024\)Low precision local training is enough for federated learning\.Advances in Neural Information Processing Systems37,pp\. 90160–90197\.Cited by:[§2\.2](https://arxiv.org/html/2605.26162#S2.SS2.p1.1)\.
- Y\. Liao, Y\. Xu, H\. Xu, M\. Chen, L\. Wang, and C\. Qiao \(2024\)Asynchronous decentralized federated learning for heterogeneous devices\.IEEE/ACM Transactions on Networking\.Cited by:[§2\.1](https://arxiv.org/html/2605.26162#S2.SS1.p1.2)\.
- J\. Liu, T\. Che, Y\. Zhou, R\. Jin, H\. Dai, D\. Dou, and P\. Valduriez \(2024a\)Aedfl: efficient asynchronous decentralized federated learning with heterogeneous devices\.InProceedings of the 2024 SIAM International Conference on Data Mining \(SDM\),pp\. 833–841\.Cited by:[§1](https://arxiv.org/html/2605.26162#S1.p3.1),[§1](https://arxiv.org/html/2605.26162#S1.p4.1),[§1](https://arxiv.org/html/2605.26162#S1.p5.1),[§2\.1](https://arxiv.org/html/2605.26162#S2.SS1.p1.2)\.
- J\. Liu, J\. Jia, T\. Che, C\. Huo, J\. Ren, Y\. Zhou, H\. Dai, and D\. Dou \(2024b\)Fedasmu: efficient asynchronous federated learning with dynamic staleness\-aware model update\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 13900–13908\.Cited by:[§1](https://arxiv.org/html/2605.26162#S1.p5.1)\.
- Q\. Liu, B\. Yang, Z\. Wang, D\. Zhu, X\. Wang, K\. Ma, and X\. Guan \(2022\)Asynchronous decentralized federated learning for collaborative fault diagnosis of pv stations\.IEEE Transactions on Network Science and Engineering9\(3\),pp\. 1680–1696\.Cited by:[§1](https://arxiv.org/html/2605.26162#S1.p4.1),[§1](https://arxiv.org/html/2605.26162#S1.p5.1)\.
- T\. Liu, Z\. Wang, H\. He, W\. Shi, L\. Lin, R\. An, and C\. Li \(2023\)Efficient and secure federated learning for financial applications\.Applied Sciences13\(10\),pp\. 5877\.Cited by:[§1](https://arxiv.org/html/2605.26162#S1.p1.1)\.
- Q\. Ma, J\. Liu, Q\. Jia, X\. Zhou, Y\. Hu, and R\. Xie \(2024\)Dynamic staleness control for asynchronous federated learning in decentralized topology\.InInternational Conference on Wireless Artificial Intelligent Computing Systems and Applications,pp\. 99–117\.Cited by:[§1](https://arxiv.org/html/2605.26162#S1.p5.1),[§2\.1](https://arxiv.org/html/2605.26162#S2.SS1.p1.2)\.
- B\. McMahan, E\. Moore, D\. Ramage, S\. Hampson, and B\. A\. y Arcas \(2017\)Communication\-efficient learning of deep networks from decentralized data\.InArtificial intelligence and statistics,pp\. 1273–1282\.Cited by:[§1](https://arxiv.org/html/2605.26162#S1.p1.1),[§1](https://arxiv.org/html/2605.26162#S1.p5.1),[§5\.1\.5](https://arxiv.org/html/2605.26162#S5.SS1.SSS5.p1.1)\.
- F\. R\. Mughal, J\. He, B\. Das, F\. A\. Dharejo, N\. Zhu, S\. B\. Khan, and S\. Alzahrani \(2024\)Adaptive federated learning for resource\-constrained iot devices through edge intelligence and multi\-edge clustering\.Scientific Reports14\(1\),pp\. 28746\.Cited by:[§1](https://arxiv.org/html/2605.26162#S1.p1.1)\.
- A\. Nabli, E\. Belilovsky, and E\. Oyallon \(2023\)A2​C​i​D2A^\{2\}CiD^\{2\}: Accelerating asynchronous communication in decentralized deep learning\.Advances in Neural Information Processing Systems36,pp\. 47451–47474\.Cited by:[§2\.1](https://arxiv.org/html/2605.26162#S2.SS1.p1.2)\.
- A\. Nedić and A\. Olshevsky \(2014\)Distributed optimization over time\-varying directed graphs\.IEEE Transactions on Automatic Control60\(3\),pp\. 601–615\.Cited by:[§4\.6\.1](https://arxiv.org/html/2605.26162#S4.SS6.SSS1.Px5)\.
- J\. Nguyen, K\. Malik, H\. Zhan, A\. Yousefpour, M\. Rabbat, M\. Malek, and D\. Huba \(2022\)Federated learning with buffered asynchronous aggregation\.InInternational conference on artificial intelligence and statistics,pp\. 3581–3607\.Cited by:[§4\.6\.1](https://arxiv.org/html/2605.26162#S4.SS6.SSS1.Px4)\.
- M\. E\. Rivero\-Angeles, I\. Y\. Orea\-Flores, A\. Lucas\-Bravo, I\. Villordo\-Jiménez, M\. F\. Mata\-Rivera, L\. A\. Macedo Santiago, and M\. L\. Morales\-Varela \(2022\)Data dissemination performance in p2p\-based vehicular communications for smart city environments\.Wireless Communications and Mobile Computing2022\(1\),pp\. 7202412\.Cited by:[§1](https://arxiv.org/html/2605.26162#S1.p4.1)\.
- T\. Sun, D\. Li, and B\. Wang \(2023\)Decentralized federated averaging\.IEEE Transactions on Pattern Analysis and Machine Intelligence45\(4\),pp\. 4289–4301\.External Links:[Document](https://dx.doi.org/10.1109/TPAMI.2022.3196503)Cited by:[§1](https://arxiv.org/html/2605.26162#S1.p3.1)\.
- Y\. Wang, Y\. Cao, J\. Wu, R\. Chen, and J\. Chen \(2024\)TACKLING the data heterogeneity in asynchronous federated learning with cached update calibration\.In12th International Conference on Learning Representations, ICLR 2024,Cited by:[§1](https://arxiv.org/html/2605.26162#S1.p5.1)\.
- C\. Wu, F\. Wu, L\. Lyu, Y\. Huang, and X\. Xie \(2022\)Communication\-efficient federated learning via knowledge distillation\.Nature communications13\(1\),pp\. 2032\.Cited by:[§2\.2](https://arxiv.org/html/2605.26162#S2.SS2.p1.1)\.
- L\. Yuan, Z\. Wang, L\. Sun, P\. S\. Yu, and C\. G\. Brinton \(2024\)Decentralized federated learning: a survey and perspective\.IEEE Internet of Things Journal11\(21\),pp\. 34617–34638\.Cited by:[§1](https://arxiv.org/html/2605.26162#S1.p2.1),[§1](https://arxiv.org/html/2605.26162#S1.p3.1),[§1](https://arxiv.org/html/2605.26162#S1.p5.1)\.
- S\. Zehtabi, D\. Han, R\. Parasnis, S\. Hosseinalipour, and C\. G\. Brinton \(2025\)Decentralized sporadic federated learning: a unified algorithmic framework with convergence guarantees\.InICLR,Cited by:[§2\.1](https://arxiv.org/html/2605.26162#S2.SS1.p1.2)\.

## Appendix AAppendix Overview

The appendix provides complementary analyses and additional experimental results that support the main paper:

- •Communication and computation properties of WCP\.Appendix[B](https://arxiv.org/html/2605.26162#A2)revisits Weight Clustering Pruning \(WCP\) and derives its per\-push communication cost as the sum of centroid transmission and index assignment, yielding an interpretable compression ratio mainly controlled by the number of centroidsKK\. It further compares the clustering overhead with standard training cost and shows that, under typical settings, WCP adds only a small fraction of extra computation\.
- •Detailed convergence analysis of PushCen\-ADFL\.Appendix[C](https://arxiv.org/html/2605.26162#A3)presents a full convergence analysis under an event\-driven asynchronous model on directed graphs\. It formalizes push\-sum aggregation with buffered \(possibly stale\) messages, centroid compression/reconstruction error, and the centroid proximal regularization used in local updates\. The analysis establishes key lemmas on system\-level mass conservation, numerator perturbation induced by lossy re\-encoding, a consensus\-error recursion under directed mixing with staleness and compression, and bounds showing how the centroid proximal term suppresses local deviation from its anchor\. These components are combined into a constant\-closed stationarity guarantee that makes the impacts of heterogeneity, staleness, directed mixing, and compression distortion explicit\.
- •Supplementary experiments and ablations\.Appendix[D](https://arxiv.org/html/2605.26162#A4)reports additional empirical evidence beyond the main tables, including global test\-accuracy trajectories and delayed\-client accuracy curves to visualize learning dynamics under asynchrony\. It also provides ablations on the number of centroidsKKand the regularization weightλ\\lambda, quantifying the trade\-offs between accuracy, per\-push communication cost, and computational overhead, and validating the effectiveness and robustness of centroid regularization\.

## Appendix BCommunication and Computational Analysis of Weight Clustering Pruning

In this section, we provide a unified analysis of the communication and computational properties of Weight Clustering Pruning \(WCP\)\. Although WCP was originally introduced in our previous work, it remains a core component of the proposed method and plays an important role in reducing both communication cost and practical system overhead\. For completeness, we revisit its theoretical properties and highlight why WCP remains efficient and well\-suited for asynchronous federated learning\.

### B\.1\.Communication Efficiency of Weight Clustering Pruning

We first analyze how WCP reduces communication overhead compared to standard federated learning\. In conventional federated optimization, each client transmits a full model parameter vectorθ∈ℝN\\theta\\in\\mathbb\{R\}^\{N\}at every communication event\. Assuming each parameter is represented usingBBbits, the communication cost per client per round is

\(10\)Cfull=N×Bbits\.C\_\{\\text\{full\}\}=N\\times B\\quad\\text\{bits\}\.
Under WCP, model parameters are represented using a clustered form\. Specifically, weights are quantized intokkclusters and represented by a centroid set and an index assignment\. One centroid is fixed at zero to naturally encode pruned weights and therefore does not need to be transmitted\.

The total communication cost consists of two components\. The first is the transmission of centroid values\. Onlyk−1k\-1non\-zero centroids need to be communicated, leading to

\(11\)Ccentroid=\(k−1\)×Bbits\.C\_\{\\text\{centroid\}\}=\(k\-1\)\\times B\\quad\\text\{bits\}\.
The second component corresponds to the index sequence that maps each weight to its associated centroid\. Each index requires⌈log2⁡k⌉\\lceil\\log\_\{2\}k\\rceilbits, resulting in

\(12\)Cindex=N×⌈log2⁡k⌉bits\.C\_\{\\text\{index\}\}=N\\times\\lceil\\log\_\{2\}k\\rceil\\quad\\text\{bits\}\.
Combining the two terms, the total communication cost of WCP is

\(13\)CWCP=\(k−1\)×B\+N×⌈log2⁡k⌉\.C\_\{\\text\{WCP\}\}=\(k\-1\)\\times B\+N\\times\\lceil\\log\_\{2\}k\\rceil\.
To quantify the communication reduction, we define the compression ratio

\(14\)ρ=CWCPCfull=\(k−1\)​B\+N​⌈log2⁡k⌉N​B\.\\rho=\\frac\{C\_\{\\text\{WCP\}\}\}\{C\_\{\\text\{full\}\}\}=\\frac\{\(k\-1\)B\+N\\lceil\\log\_\{2\}k\\rceil\}\{NB\}\.
Since modern neural networks typically satisfyN≫kN\\gg k, the centroid transmission term becomes negligible\. Thus, the compression ratio can be approximated as

\(15\)ρ≈⌈log2⁡k⌉B\.\\rho\\approx\\frac\{\\lceil\\log\_\{2\}k\\rceil\}\{B\}\.
This result shows that communication efficiency is primarily controlled by the number of clusterskk\. A moderate choice ofkkyields substantial bandwidth savings while maintaining sufficient representational capacity, which is essential for preserving model accuracy\.

### B\.2\.Computational Overhead of Weight Clustering Pruning

We now examine the computational cost introduced by WCP and compare it to the standard training cost in federated learning\.

Consider a neural network withLLlayers, where the parameter matrix of layerllcontainsPl=nl−1​nlP^\{l\}=n\_\{l\-1\}n\_\{l\}parameters\. The total number of parameters is

\(16\)Ptot=∑l=1LPl\.P\_\{\\text\{tot\}\}=\\sum\_\{l=1\}^\{L\}P^\{l\}\.
##### Training Cost\.

For each training sample, the forward pass, backward pass and parameter update all incur costs proportional to the number of parameters\. Thus, the per\-sample training cost satisfies

\(17\)Ctrain=𝒪​\(∑l=1LPl\)\.C\_\{\\text\{train\}\}=\\mathcal\{O\}\\\!\\left\(\\sum\_\{l=1\}^\{L\}P^\{l\}\\right\)\.For a local dataset of sizeNNandEEtraining epochs, the total training cost becomes

\(18\)Ctotal\_train=𝒪​\(N​E​Ptot\)\.C\_\{\\text\{total\\\_train\}\}=\\mathcal\{O\}\\\!\\left\(NEP\_\{\\text\{tot\}\}\\right\)\.

##### Clustering Cost\.

WCP applies clustering independently to each layer\. For a layer withPlP^\{l\}parameters, a K\-means iteration consists of an assignment step with complexity𝒪​\(Pl​k\)\\mathcal\{O\}\(P^\{l\}k\)and a centroid update step with complexity𝒪​\(Pl\)\\mathcal\{O\}\(P^\{l\}\)\. AssumingTTclustering iterations, the cost for layerllis

\(19\)Cclusterl=𝒪​\(T​k​Pl\)\.C\_\{\\text\{cluster\}\}^\{l\}=\\mathcal\{O\}\(TkP^\{l\}\)\.Summing over all layers yields

\(20\)Ctotal\_cluster=𝒪​\(T​k​Ptot\)\.C\_\{\\text\{total\\\_cluster\}\}=\\mathcal\{O\}\(TkP\_\{\\text\{tot\}\}\)\.

##### Cost Comparison\.

The ratio between training cost and clustering cost is therefore

\(21\)Ctotal\_trainCtotal\_cluster=N​ET​k\.\\frac\{C\_\{\\text\{total\\\_train\}\}\}\{C\_\{\\text\{total\\\_cluster\}\}\}=\\frac\{NE\}\{Tk\}\.
Under typical experimental settings \(e\.g\.,E=1E=1,k=32k=32and smallTT\), this ratio is substantially larger than one\. This indicates that the computational overhead introduced by WCP is dominated by the standard training process and remains a small fraction of the overall computation\.

##### Summary\.

Overall, WCP introduces negligible additional computation while providing significant communication savings\. Its computational cost scales linearly with the number of parameters and clusters, ensuring good scalability\. These properties make WCP particularly suitable for asynchronous and decentralized federated learning, where both communication efficiency and lightweight local computation are critical\.

## Appendix CDetailed convergence analysis of PushCen\-ADFL

### C\.1\.Notation and Event\-Driven Model

##### Network and directed communication\.

We considerNNclients indexed by𝒱=\{1,…,N\}\\mathcal\{V\}=\\\{1,\\dots,N\\\}connected by a directed graph𝒢=\(𝒱,ℰ\)\\mathcal\{G\}=\(\\mathcal\{V\},\\mathcal\{E\}\)\. For clientii, denote its in\-neighbors and out\-neighbors by𝒩i−\\mathcal\{N\}\_\{i\}^\{\-\}and𝒩i\+\\mathcal\{N\}\_\{i\}^\{\+\}and letdi\+≜\|𝒩i\+\|d\_\{i\}^\{\+\}\\triangleq\|\\mathcal\{N\}\_\{i\}^\{\+\}\|\. Clients exchange messages only along directed edges; there is no central server\.

##### Event\-driven asynchronous timeline\.

Training evolves on a global*event*timelinet=0,1,2,…t=0,1,2,\\dots\. At each eventtt, either \(i\) a message arrives at some client and is appended into its local buffer, or \(ii\) a single client is*activated*and performs an*aggregate→\\rightarrowlocal update→\\rightarrowbroadcast*cycle\. Letit∈𝒱i\_\{t\}\\in\\mathcal\{V\}denote the activated client at eventtt\(if any\)\. No global synchronization is assumed and different clients may be activated at different rates\.

##### Local client states\.

Each clientiimaintains the following states at eventtt:

- •Modelwit∈ℝdw\_\{i\}^\{t\}\\in\\mathbb\{R\}^\{d\}, the local trainable parameters \(vectorized for analysis\)\.
- •Push\-sum massyit\>0y\_\{i\}^\{t\}\>0, corresponding tops\_massin the implementation\.
- •centroid dictionaryGitG\_\{i\}^\{t\}, the layer\-wise centroid tables maintained/updated by aggregation\.
- •Bufferℬit\\mathcal\{B\}\_\{i\}^\{t\}, storing the most recent neighbor messages received since the last compute event\.

Remark \(local\-only parameters\)\.In the implementation, BatchNorm\-related parameters/statistics are*local\-only*\(FedBN\-style\): they are not aggregated and remain purely local\. Our analysis focuses on the shared trainable vectorwitw\_\{i\}^\{t\}\(excluding such local\-only components\)\.

##### Centroid compression and reconstruction \(implementation\-aligned\)\.

When clientjjtransmits, it sends a centroid\-compressed payload𝒞​\(wjt\)≜\(Vjt,Ajt,Ujt\),\\mathcal\{C\}\(w\_\{j\}^\{t\}\)\\triangleq\(V\_\{j\}^\{t\},A\_\{j\}^\{t\},U\_\{j\}^\{t\}\),whereVjtV\_\{j\}^\{t\}is the centroid table,AjtA\_\{j\}^\{t\}is the assignment map andUjtU\_\{j\}^\{t\}contains*uncompressed shared tensors*that are communicated without clustering \(e\.g\., biases or non\-compressible layers\)\. Local\-only parameters \(e\.g\., BatchNorm statistics\) are*not*transmitted\. Upon reception, clientiireconstructs an approximate modelw^jt=ℛ​\(𝒞​\(wjt\)\),\\hat\{w\}\_\{j\}^\{t\}=\\mathcal\{R\}\(\\mathcal\{C\}\(w\_\{j\}^\{t\}\)\),whereℛ\\mathcal\{R\}indexesVjtV\_\{j\}^\{t\}withAjtA\_\{j\}^\{t\}and copiesUjtU\_\{j\}^\{t\}\. We define the reconstruction \(compression\) error:

\(22\)ejt≜w^jt−wjt\.e\_\{j\}^\{t\}\\triangleq\\hat\{w\}\_\{j\}^\{t\}\-w\_\{j\}^\{t\}\.

##### Buffered messages\.

A message sent fromjjtoiiis represented asmj→it=\(w^jt,Vjt,yj→it,j\),m\_\{j\\to i\}^\{t\}=\(\\hat\{w\}\_\{j\}^\{t\},V\_\{j\}^\{t\},y\_\{j\\to i\}^\{t\},j\),whereyj→it\>0y\_\{j\\to i\}^\{t\}\>0is the push\-sum mass share attached to the message \(fieldps\_mass\_share\)\. At eventtt, clientii’s buffer isℬit=\{mj→it:j∈𝒮it\},\\mathcal\{B\}\_\{i\}^\{t\}=\\\{m\_\{j\\to i\}^\{t\}:j\\in\\mathcal\{S\}\_\{i\}^\{t\}\\\},where𝒮it⊆𝒩i−\\mathcal\{S\}\_\{i\}^\{t\}\\subseteq\\mathcal\{N\}\_\{i\}^\{\-\}indexes the set of senders currently stored\. The implementation keeps only the newest message per sender \(sender deduplication\) and may enforce a capacity bound; in analysis we treatℬit\\mathcal\{B\}\_\{i\}^\{t\}as an arbitrary finite set of most\-recent neighbor messages\.

##### Push\-sum reformulation \(numerator/denominator variables\)\.

To make average\-preservation explicit under directed, potentially unbalanced communication, we introduce the standard push\-sum*numerator*variable

\(23\)xit≜yit​wit∈ℝd\.x\_\{i\}^\{t\}\\triangleq y\_\{i\}^\{t\}\\,w\_\{i\}^\{t\}\\in\\mathbb\{R\}^\{d\}\.The de\-biased local estimate is always recovered aswit=xit/yitw\_\{i\}^\{t\}=x\_\{i\}^\{t\}/y\_\{i\}^\{t\}\.

##### Mass splitting \(code\-consistent\)\.

After finishing a local update, clientiisplits its current mass uniformly among its out\-neighbors and itself:

\(24\)yi→kt=yitdi\+\+1,∀k∈𝒩i\+,yit\+=yitdi\+\+1\.y\_\{i\\to k\}^\{t\}\\;=\\;\\frac\{y\_\{i\}^\{t\}\}\{d\_\{i\}^\{\+\}\+1\},\\quad\\forall k\\in\\mathcal\{N\}\_\{i\}^\{\+\},\\qquad y\_\{i\}^\{t\+\}\\;=\\;\\frac\{y\_\{i\}^\{t\}\}\{d\_\{i\}^\{\+\}\+1\}\.Each outgoing message tokkcarriesyi→kty\_\{i\\to k\}^\{t\}and clientiikeeps the same share locally\. Hence, the sender\-side conservation holds*exactly*:di\+⋅yitdi\+\+1\+1⋅yitdi\+\+1=yit\.d\_\{i\}^\{\+\}\\cdot\\frac\{y\_\{i\}^\{t\}\}\{d\_\{i\}^\{\+\}\+1\}\+1\\cdot\\frac\{y\_\{i\}^\{t\}\}\{d\_\{i\}^\{\+\}\+1\}=y\_\{i\}^\{t\}\.Analogously, the transmitted numerator share isxi→kt≜yi→kt​w^itx\_\{i\\to k\}^\{t\}\\triangleq y\_\{i\\to k\}^\{t\}\\hat\{w\}\_\{i\}^\{t\}and the locally retained numerator isxit\+≜yit\+​wit\+x\_\{i\}^\{t\+\}\\triangleq y\_\{i\}^\{t\+\}w\_\{i\}^\{t\+\}\.

##### System\-level accounting for in\-flight/buffered mass\.

Under asynchronous message passing, a portion of mass may reside in buffers \(or be in transit\) between send and aggregation events\. Letℳt\\mathcal\{M\}^\{t\}denote the set of all messages that are currently buffered or in transit in the system at eventtt\. For each messagem∈ℳtm\\in\\mathcal\{M\}^\{t\}, letymy\_\{m\}be its mass share andxmx\_\{m\}its numerator share\. We define the*total system mass*and*total system numerator*:

\(25\)Ytott≜∑i=1Nyit\+∑m∈ℳtym,Xtott≜∑i=1Nxit\+∑m∈ℳtxm\.Y\_\{\\mathrm\{tot\}\}^\{t\}\\triangleq\\sum\_\{i=1\}^\{N\}y\_\{i\}^\{t\}\+\\sum\_\{m\\in\\mathcal\{M\}^\{t\}\}y\_\{m\},\\qquad X\_\{\\mathrm\{tot\}\}^\{t\}\\triangleq\\sum\_\{i=1\}^\{N\}x\_\{i\}^\{t\}\+\\sum\_\{m\\in\\mathcal\{M\}^\{t\}\}x\_\{m\}\.The mass\-splitting rule \([24](https://arxiv.org/html/2605.26162#A3.E24)\) and the aggregation rule \(which transfers buffered shares into the local states\) together imply that\(Ytott,Xtott\)\(Y\_\{\\mathrm\{tot\}\}^\{t\},X\_\{\\mathrm\{tot\}\}^\{t\}\)are invariant over time; this will be formalized in the mass\-conservation lemma\.

##### De\-biased global reference sequence\.

Using the conserved system totals, we define the canonical de\-biased global reference model:

\(26\)w¯t≜XtottYtott\.\\bar\{w\}^\{t\}\\triangleq\\frac\{X\_\{\\mathrm\{tot\}\}^\{t\}\}\{Y\_\{\\mathrm\{tot\}\}^\{t\}\}\.This sequence is well\-defined on asymmetric and unbalanced graphs and serves as the main reference trajectory for optimality analysis \(whilewitw\_\{i\}^\{t\}are local de\-biased estimates with a consensus error\)\.

##### Local objective and centroid proximal target\.

Clientiiholds a private dataset𝒟i\\mathcal\{D\}\_\{i\}and local objectivefi​\(w\)≜𝔼ξ∼𝒟i​\[ℓ​\(w;ξ\)\]\.f\_\{i\}\(w\)\\triangleq\\mathbb\{E\}\_\{\\xi\\sim\\mathcal\{D\}\_\{i\}\}\[\\ell\(w;\\xi\)\]\.Before each compute event, clientiiconstructs a proximal targetw~it\\tilde\{w\}\_\{i\}^\{t\}from its current centroid dictionary estimateGitG\_\{i\}^\{t\}and its current assignment map \(obtained by clustering the current local model\) and performs local stochastic optimization on

\(27\)fi​\(w\)\+λ​‖w−w~it‖22,f\_\{i\}\(w\)\+\\lambda\\\|w\-\\tilde\{w\}\_\{i\}^\{t\}\\\|\_\{2\}^\{2\},whereλ≥0\\lambda\\geq 0is the regularization weight\. The construction ofw~it\\tilde\{w\}\_\{i\}^\{t\}is implementation\-aligned: it is produced by indexing centroid tables using the assignment map so thatw~it\\tilde\{w\}\_\{i\}^\{t\}matches the shape ofwitw\_\{i\}^\{t\}\.

### C\.2\.Assumptions

We list the standard assumptions required to establish the convergence of PushCen\-ADFL\. These assumptions cover stochastic optimization with heterogeneous data distributions, asynchronous communication with message delays and centroid\-based lossy communication over directed graphs\. All random variables are defined on a common probability space induced by data sampling, client activations and message arrivals\.

##### \(A1\) Smoothness and lower boundedness\.

Each local objectivefi:ℝd→ℝf\_\{i\}:\\mathbb\{R\}^\{d\}\\rightarrow\\mathbb\{R\}isLL\-smooth:

\(28\)‖∇fi​\(u\)−∇fi​\(v\)‖2≤L​‖u−v‖2,∀u,v∈ℝd,\\\|\\nabla f\_\{i\}\(u\)\-\\nabla f\_\{i\}\(v\)\\\|\_\{2\}\\leq L\\\|u\-v\\\|\_\{2\},\\quad\\forall u,v\\in\\mathbb\{R\}^\{d\},and the global objectiveF​\(w\)≜1N​∑i=1Nfi​\(w\)F\(w\)\\triangleq\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}f\_\{i\}\(w\)is lower bounded, i\.e\.,F​\(w\)≥F⋆F\(w\)\\geq F^\{\\star\}for some finiteF⋆F^\{\\star\}\.

##### \(A2\) Unbiased stochastic gradients with bounded variance\.

At clientii, a stochastic gradient computed from a mini\-batchξ∼𝒟i\\xi\\sim\\mathcal\{D\}\_\{i\}satisfies

\(29\)𝔼​\[gi​\(w;ξ\)∣w\]=∇fi​\(w\),𝔼​\[‖gi​\(w;ξ\)−∇fi​\(w\)‖22∣w\]≤σ2,\\mathbb\{E\}\\\!\\left\[g\_\{i\}\(w;\\xi\)\\mid w\\right\]=\\nabla f\_\{i\}\(w\),\\qquad\\mathbb\{E\}\\\!\\left\[\\\|g\_\{i\}\(w;\\xi\)\-\\nabla f\_\{i\}\(w\)\\\|\_\{2\}^\{2\}\\mid w\\right\]\\leq\\sigma^\{2\},for someσ2<∞\\sigma^\{2\}<\\infty\. When centroid proximal regularization is applied, we equivalently view the client as optimizing the regularized objectivefiλ​\(w\)≜fi​\(w\)\+λ​‖w−w~i‖22f\_\{i\}^\{\\lambda\}\(w\)\\triangleq f\_\{i\}\(w\)\+\\lambda\\\|w\-\\tilde\{w\}\_\{i\}\\\|\_\{2\}^\{2\}, and the stochastic gradient is taken w\.r\.t\.fiλf\_\{i\}^\{\\lambda\}\.

##### \(A3\) Uniform client heterogeneity\.

There exists a constantζ2<∞\\zeta^\{2\}<\\inftysuch that for allw∈ℝdw\\in\\mathbb\{R\}^\{d\}and all clientsi∈\[N\]i\\in\[N\],

\(30\)‖∇fi​\(w\)−∇F​\(w\)‖22≤ζ2,whereF​\(w\)≜1N​∑j=1Nfj​\(w\)\.\\\|\\nabla f\_\{i\}\(w\)\-\\nabla F\(w\)\\\|\_\{2\}^\{2\}\\leq\\zeta^\{2\},\\qquad\\text\{where\}\\qquad F\(w\)\\triangleq\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}f\_\{j\}\(w\)\.
Remark\.Assumption A3 is a standard strengthening of the mean\-heterogeneity condition and is commonly used in non\-convex FL analyses to control one\-step descent for the activated client under asynchrony\.

##### \(A4\) Bounded asynchrony \(message staleness\)\.

Messages aggregated at eventttmay correspond to stale model states due to heterogeneous computation/communication\. We assume there exists an integerτ≥0\\tau\\geq 0such that any message used by an activated client at eventttwas generated within the lastτ\\tauactivations/events\. Equivalently, if a buffered messagemj→itm\_\{j\\to i\}^\{t\}carries a reconstructionw^jκ\\hat\{w\}\_\{j\}^\{\\kappa\}, then its generation timeκ\\kappasatisfies

\(31\)t−κ≤τ\.t\-\\kappa\\leq\\tau\.

##### \(A5\) Directed mixing and push\-sum well\-posedness\.

The directed communication graph is strongly connected and the sequence of effective communication events is sufficiently mixing\. Concretely, there exists an integerB≥1B\\geq 1such that for any window ofBBconsecutive events, the union of directed edges along which messages are delivered is strongly connected\. Moreover, the push\-sum masses are uniformly bounded away from zero: there existsymin\>0y\_\{\\min\}\>0such that

\(32\)yit≥ymin,∀i∈𝒱,∀t\.y\_\{i\}^\{t\}\\geq y\_\{\\min\},\\quad\\forall i\\in\\mathcal\{V\},\\ \\forall t\.This ensures the de\-biased ratioswit=xit/yitw\_\{i\}^\{t\}=x\_\{i\}^\{t\}/y\_\{i\}^\{t\}and the global referencew¯t\\bar\{w\}^\{t\}are well\-defined\.

##### \(A6\) Bounded absolute compression error \(C1\)\.

The centroid compression\-reconstruction operatorℛ∘𝒞\\mathcal\{R\}\\circ\\mathcal\{C\}induces a bounded absolute error: there existsεc≥0\\varepsilon\_\{c\}\\geq 0such that for any communicated modelww,

\(33\)‖ℛ​\(𝒞​\(w\)\)−w‖2≤εc\.\\\|\\mathcal\{R\}\(\\mathcal\{C\}\(w\)\)\-w\\\|\_\{2\}\\leq\\varepsilon\_\{c\}\.Equivalently, withejte\_\{j\}^\{t\}defined in \([22](https://arxiv.org/html/2605.26162#A3.E22)\), we have‖ejt‖2≤εc\\\|e\_\{j\}^\{t\}\\\|\_\{2\}\\leq\\varepsilon\_\{c\}for all transmitted messages\. This assumption captures the deterministic quantization/clustering distortion introduced by WCP with a fixed number of centroidsKK\.

Remark\.Assumptions \(A1\)\-\(A3\) are standard in stochastic non\-convex optimization and federated learning\. Assumptions \(A4\)\-\(A5\) formalize bounded staleness and directed\-graph mixing needed for asynchronous push\-sum\. Assumption \(A6\) matches our implementation, where WCP\-based centroid encoding introduces deterministic reconstruction distortion controlled by the clustering resolution \(e\.g\.,K=32K=32\)\.

### C\.3\.Update Rules

We formalize the event\-driven update rules of PushCen\-ADFL \(ADFLCenReg\) by separating \(i\) asynchronous push\-sum aggregation over buffered neighbor messages and \(ii\) local stochastic optimization with centroid proximal regularization\. Throughout, we use the push\-sum numerator\-denominator representationxit=yit​witx\_\{i\}^\{t\}=y\_\{i\}^\{t\}w\_\{i\}^\{t\}introduced in \([23](https://arxiv.org/html/2605.26162#A3.E23)\)\.

##### Message format and staleness\-aware notation\.

When clientjjis activated and broadcasts to an out\-neighbori∈𝒩j\+i\\in\\mathcal\{N\}\_\{j\}^\{\+\}, it transmits a tuple

\(34\)mj→i\(κ\)=\(𝒞​\(wjκ\),yj→i\(κ\),j\),m\_\{j\\to i\}^\{\(\\kappa\)\}\\;=\\;\\big\(\\mathcal\{C\}\(w\_\{j\}^\{\\kappa\}\),\\;y\_\{j\\to i\}^\{\(\\kappa\)\},\\;j\\big\),whereκ\\kappadenotes the*generation event index*at which the message is produced,𝒞​\(wjκ\)=\(Vjκ,Ajκ,Ujκ\)\\mathcal\{C\}\(w\_\{j\}^\{\\kappa\}\)=\(V\_\{j\}^\{\\kappa\},A\_\{j\}^\{\\kappa\},U\_\{j\}^\{\\kappa\}\)is the centroid\-compressed payload andyj→i\(κ\)\>0y\_\{j\\to i\}^\{\(\\kappa\)\}\>0is the attached mass share\. Upon reception at some later eventt≥κt\\geq\\kappa, clientiireconstructs the \(possibly stale\) neighbor model

\(35\)w^j→it≜ℛ​\(𝒞​\(wjκ\)\),\\hat\{w\}\_\{j\\to i\}^\{t\}\\;\\triangleq\\;\\mathcal\{R\}\\\!\\left\(\\mathcal\{C\}\(w\_\{j\}^\{\\kappa\}\)\\right\),and buffers the tuple\(w^j→it,Vjκ,yj→i\(κ\),j\)\(\\hat\{w\}\_\{j\\to i\}^\{t\},V\_\{j\}^\{\\kappa\},y\_\{j\\to i\}^\{\(\\kappa\)\},j\)\.333Herew^j→it\\hat\{w\}\_\{j\\to i\}^\{t\}denotes the content*available at clientiiat eventtt*\. Due to asynchrony, it generally represents a stale state of clientjj, i\.e\.,w^j→it≈wjt−τ\\hat\{w\}\_\{j\\to i\}^\{t\}\\approx w\_\{j\}^\{t\-\\tau\}under Assumption[31](https://arxiv.org/html/2605.26162#A3.E31)\.At the time clientiiis activated at eventtt, letℬit\\mathcal\{B\}\_\{i\}^\{t\}denote its buffer, i\.e\., the set of currently stored messages to be aggregated \(sender\-deduplicated and possibly capacity\-bounded\)\.

##### C1\. Asynchronous push\-sum aggregation \(buffer processing\)\.

When clienti=iti=i\_\{t\}is activated at eventtt, it aggregates all buffered neighbor messagesℬit=\{\(w^j→it,Vjκ,yj→i\(κ\),j\)\}\\mathcal\{B\}\_\{i\}^\{t\}=\\\{\(\\hat\{w\}\_\{j\\to i\}^\{t\},V\_\{j\}^\{\\kappa\},y\_\{j\\to i\}^\{\(\\kappa\)\},j\)\\\}\. Define the post\-buffer total mass atii:

\(36\)Yit≜yit\+∑\(⋅,⋅,yj→i\(κ\),⋅\)∈ℬityj→i\(κ\)\.Y\_\{i\}^\{t\}\\;\\triangleq\\;y\_\{i\}^\{t\}\+\\sum\_\{\(\\cdot,\\cdot,y\_\{j\\to i\}^\{\(\\kappa\)\},\\cdot\)\\in\\mathcal\{B\}\_\{i\}^\{t\}\}y\_\{j\\to i\}^\{\(\\kappa\)\}\.The de\-biased model update is

\(37\)wit\+13=yitYit​wit\+∑\(w^j→it,⋅,yj→i\(κ\),⋅\)∈ℬityj→i\(κ\)Yit​w^j→it,w\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\\;=\\;\\frac\{y\_\{i\}^\{t\}\}\{Y\_\{i\}^\{t\}\}\\,w\_\{i\}^\{t\}\\;\+\\;\\sum\_\{\(\\hat\{w\}\_\{j\\to i\}^\{t\},\\cdot,y\_\{j\\to i\}^\{\(\\kappa\)\},\\cdot\)\\in\\mathcal\{B\}\_\{i\}^\{t\}\}\\frac\{y\_\{j\\to i\}^\{\(\\kappa\)\}\}\{Y\_\{i\}^\{t\}\}\\,\\hat\{w\}\_\{j\\to i\}^\{t\},with the mass update and buffer clearing

\(38\)yit\+13=Yit,ℬit\+13=∅\.y\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\\;=\\;Y\_\{i\}^\{t\},\\qquad\\mathcal\{B\}\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\\;=\\;\\emptyset\.Equivalently, in numerator\-denominator form, lettingxit=yit​witx\_\{i\}^\{t\}=y\_\{i\}^\{t\}w\_\{i\}^\{t\}andxj→i\(κ\)≜yj→i\(κ\)​w^j→itx\_\{j\\to i\}^\{\(\\kappa\)\}\\triangleq y\_\{j\\to i\}^\{\(\\kappa\)\}\\hat\{w\}\_\{j\\to i\}^\{t\}, the aggregation is additive:

\(39\)xit\+13=xit\+∑\(⋅,⋅,yj→i\(κ\),⋅\)∈ℬitxj→i\(κ\),yit\+13=yit\+∑\(⋅,⋅,yj→i\(κ\),⋅\)∈ℬityj→i\(κ\)\.x\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\\;=\\;x\_\{i\}^\{t\}\+\\sum\_\{\(\\cdot,\\cdot,y\_\{j\\to i\}^\{\(\\kappa\)\},\\cdot\)\\in\\mathcal\{B\}\_\{i\}^\{t\}\}x\_\{j\\to i\}^\{\(\\kappa\)\},\\qquad y\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\\;=\\;y\_\{i\}^\{t\}\+\\sum\_\{\(\\cdot,\\cdot,y\_\{j\\to i\}^\{\(\\kappa\)\},\\cdot\)\\in\\mathcal\{B\}\_\{i\}^\{t\}\}y\_\{j\\to i\}^\{\(\\kappa\)\}\.Finallywit\+13=xit\+13/yit\+13w\_\{i\}^\{t\+\\frac\{1\}\{3\}\}=x\_\{i\}^\{t\+\\frac\{1\}\{3\}\}/y\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\.

##### centroid dictionary aggregation\.

Clientiiupdates its centroid dictionary estimate using the same push\-sum weights:

\(40\)Git\+13=yitYit​Git\+∑\(⋅,Vjκ,yj→i\(κ\),⋅\)∈ℬityj→i\(κ\)Yit​Vjκ\.G\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\\;=\\;\\frac\{y\_\{i\}^\{t\}\}\{Y\_\{i\}^\{t\}\}\\,G\_\{i\}^\{t\}\\;\+\\;\\sum\_\{\(\\cdot,V\_\{j\}^\{\\kappa\},y\_\{j\\to i\}^\{\(\\kappa\)\},\\cdot\)\\in\\mathcal\{B\}\_\{i\}^\{t\}\}\\frac\{y\_\{j\\to i\}^\{\(\\kappa\)\}\}\{Y\_\{i\}^\{t\}\}\\,V\_\{j\}^\{\\kappa\}\.\(We treatGitG\_\{i\}^\{t\}as an auxiliary state whose role is to form the proximal target in the local update\.\)

##### C2\. Local stochastic optimization with centroid proximal regularization\.

After aggregation, clientiiperformsEElocal epochs/steps of SGD on

\(41\)Φit​\(w\)≜fi​\(w\)\+λ​‖w−w~it‖22,\\Phi\_\{i\}^\{t\}\(w\)\\;\\triangleq\\;f\_\{i\}\(w\)\+\\lambda\\\|w\-\\tilde\{w\}\_\{i\}^\{t\}\\\|\_\{2\}^\{2\},wherew~it\\tilde\{w\}\_\{i\}^\{t\}is the centroid\-based proximal target constructed from the current centroid dictionaryGit\+13G\_\{i\}^\{t\+\\frac\{1\}\{3\}\}and the current assignment map \(obtained by applying WCP towit\+13w\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\)\. Let\(Vit,Ait,Mit\)\(V\_\{i\}^\{t\},A\_\{i\}^\{t\},M\_\{i\}^\{t\}\)be the centroid table, assignment map and pruning mask produced by WCP at the beginning of the local update\. Fore=0,…,E−1e=0,\\dots,E\-1, we write the masked proximal\-SGD step as

\(42\)wi,e\+1\\displaystyle w\_\{i,e\+1\}=ΠMit\(wi,e−ηgi\(wi,e;ξi,e\)\\displaystyle=\\Pi\_\{M\_\{i\}^\{t\}\}\\\!\\Bigl\(w\_\{i,e\}\-\\eta\\,g\_\{i\}\\\!\\left\(w\_\{i,e\};\\xi\_\{i,e\}\\right\)−2ηλ\(wi,e−w~it\)\),wi,0=wit\+13\.\\displaystyle\\qquad\\quad\-2\\eta\\lambda\\,\(w\_\{i,e\}\-\\tilde\{w\}\_\{i\}^\{t\}\)\\Bigr\),\\qquad w\_\{i,0\}=w\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\.
whereξi,e\\xi\_\{i,e\}is a mini\-batch sampled from𝒟i\\mathcal\{D\}\_\{i\}andgi​\(⋅;ξ\)g\_\{i\}\(\\cdot;\\xi\)is the stochastic gradient ofℓ​\(⋅;ξ\)\\ell\(\\cdot;\\xi\)\. The masking operatorΠMit​\(⋅\)\\Pi\_\{M\_\{i\}^\{t\}\}\(\\cdot\)enforces pruning by element\-wise multiplication,

\(43\)ΠMit​\(u\)≜u⊙Mit,\\Pi\_\{M\_\{i\}^\{t\}\}\(u\)\\triangleq u\\odot M\_\{i\}^\{t\},matching the implementation \(w←\\leftarroww \* M\)\. We denote the post\-local\-update model by

\(44\)wit\+23≜wi,E\.w\_\{i\}^\{t\+\\frac\{2\}\{3\}\}\\;\\triangleq\\;w\_\{i,E\}\.

##### C3\. Post\-training re\-encoding and broadcast \(mass splitting\)\.

After local optimization, clientiire\-encodes its model via WCP to obtain an updated centroid representation𝒞​\(wit\+23\)=\(Vit\+23,Ait\+23,Uit\+23\)\\mathcal\{C\}\(w\_\{i\}^\{t\+\\frac\{2\}\{3\}\}\)=\(V\_\{i\}^\{t\+\\frac\{2\}\{3\}\},A\_\{i\}^\{t\+\\frac\{2\}\{3\}\},U\_\{i\}^\{t\+\\frac\{2\}\{3\}\}\)to be transmitted\. It then performs push\-sum mass splitting:

\(45\)yi→kt\+23=yit\+13di\+\+1,∀k∈𝒩i\+,yit\+1=yit\+13di\+\+1,y\_\{i\\to k\}^\{t\+\\frac\{2\}\{3\}\}\\;=\\;\\frac\{y\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\}\{d\_\{i\}^\{\+\}\+1\},\\quad\\forall k\\in\\mathcal\{N\}\_\{i\}^\{\+\},\\qquad y\_\{i\}^\{t\+1\}\\;=\\;\\frac\{y\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\}\{d\_\{i\}^\{\+\}\+1\},and sends\(𝒞​\(wit\+23\),yi→kt\+23,i\)\(\\mathcal\{C\}\(w\_\{i\}^\{t\+\\frac\{2\}\{3\}\}\),y\_\{i\\to k\}^\{t\+\\frac\{2\}\{3\}\},i\)to eachk∈𝒩i\+k\\in\\mathcal\{N\}\_\{i\}^\{\+\}\. Finally,

\(46\)wit\+1≜wit\+23,xit\+1≜yit\+1​wit\+1\.w\_\{i\}^\{t\+1\}\\;\\triangleq\\;w\_\{i\}^\{t\+\\frac\{2\}\{3\}\},\\qquad x\_\{i\}^\{t\+1\}\\;\\triangleq\\;y\_\{i\}^\{t\+1\}w\_\{i\}^\{t\+1\}\.For any clientℓ≠it\\ell\\neq i\_\{t\}that is not activated at eventtt, we keep its local state unchanged:\(wℓt\+1,xℓt\+1,yℓt\+1,Gℓt\+1\)=\(wℓt,xℓt,yℓt,Gℓt\)\(w\_\{\\ell\}^\{t\+1\},x\_\{\\ell\}^\{t\+1\},y\_\{\\ell\}^\{t\+1\},G\_\{\\ell\}^\{t\+1\}\)=\(w\_\{\\ell\}^\{t\},x\_\{\\ell\}^\{t\},y\_\{\\ell\}^\{t\},G\_\{\\ell\}^\{t\}\)\.

### C\.4\.Key Quantities

To analyze PushCen\-ADFL under directed asynchronous communication, we introduce a de\-biased global reference sequence together with two key error measures: an*optimality*metric that tracks progress toward stationary points and a*consensus*metric that tracks network disagreement\.

##### De\-biased global reference \(push\-sum average\)\.

Recall the total system numerator and mass defined in \([25](https://arxiv.org/html/2605.26162#A3.E25)\):XtottX\_\{\\mathrm\{tot\}\}^\{t\}andYtottY\_\{\\mathrm\{tot\}\}^\{t\}\. We define the \(virtual\) de\-biased global reference model as

\(47\)w¯t≜XtottYtott\.\\bar\{w\}^\{t\}\\triangleq\\frac\{X\_\{\\mathrm\{tot\}\}^\{t\}\}\{Y\_\{\\mathrm\{tot\}\}^\{t\}\}\.Unlike the naive arithmetic mean1N​∑i=1Nwit\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}w\_\{i\}^\{t\},w¯t\\bar\{w\}^\{t\}is the correct average\-preserving reference on directed and potentially unbalanced graphs under push\-sum dynamics\.

##### Local de\-biased models \(node states\)\.

Each client maintains a local push\-sum numeratorxitx\_\{i\}^\{t\}and massyity\_\{i\}^\{t\}\. The local de\-biased model held at nodeiiis

\(48\)wit≜xityit\.w\_\{i\}^\{t\}\\triangleq\\frac\{x\_\{i\}^\{t\}\}\{y\_\{i\}^\{t\}\}\.Important:We strictly distinguish the local modelwitw\_\{i\}^\{t\}from the centroid proximal anchorw~it\\tilde\{w\}\_\{i\}^\{t\}defined in \([41](https://arxiv.org/html/2605.26162#A3.E41)\), wherew~it\\tilde\{w\}\_\{i\}^\{t\}serves as the regularization target during local updates\.

##### Consensus error \(network disagreement\)\.

We quantify network disagreement by the mean\-squared deviation of local models from the de\-biased global reference:

\(49\)ℰcont≜1N​∑i=1N‖wit−w¯t‖22\.\\mathcal\{E\}\_\{\\mathrm\{con\}\}^\{t\}\\triangleq\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\left\\\|w\_\{i\}^\{t\}\-\\bar\{w\}^\{t\}\\right\\\|\_\{2\}^\{2\}\.This term captures the combined effects of directed mixing, asynchronous staleness and lossy reconstruction on consensus\.

##### Optimality \(stationarity\) measure\.

For the global objectiveF​\(w\)≜1N​∑i=1Nfi​\(w\)F\(w\)\\triangleq\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}f\_\{i\}\(w\), we use the standard non\-convex stationarity metric evaluated at the de\-biased global reference sequence:

\(50\)ℰoptt≜𝔼​\[‖∇F​\(w¯t\)‖22\]\.\\mathcal\{E\}\_\{\\mathrm\{opt\}\}^\{t\}\\triangleq\\mathbb\{E\}\\left\[\\left\\\|\\nabla F\(\\bar\{w\}^\{t\}\)\\right\\\|\_\{2\}^\{2\}\\right\]\.Bounding1T​∑t=0T−1ℰoptt\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathcal\{E\}\_\{\\mathrm\{opt\}\}^\{t\}implies convergence to a stationary neighborhood\.

Remark\.Our analysis proceeds by \(i\) deriving a recursion forℰcont\\mathcal\{E\}\_\{\\mathrm\{con\}\}^\{t\}under push\-sum mixing with perturbations \(compression and staleness\) and \(ii\) proving a descent inequality forF​\(w¯t\)F\(\\bar\{w\}^\{t\}\)whose error terms are controlled byℰcont\\mathcal\{E\}\_\{\\mathrm\{con\}\}^\{t\}\. Combining the two yields a bound on the averaged stationarity measure in \([50](https://arxiv.org/html/2605.26162#A3.E50)\)\.

### C\.5\.Key Lemmas

#### C\.5\.1\.Lemma 1

###### Lemma C\.1 \(System\-level mass conservation\)\.

Recall the set of in\-flight/buffered messagesℳt\\mathcal\{M\}^\{t\}and the total system mass

\(51\)Ytott≜∑i=1Nyit\+∑m∈ℳtym,Y\_\{\\mathrm\{tot\}\}^\{t\}\\triangleq\\sum\_\{i=1\}^\{N\}y\_\{i\}^\{t\}\+\\sum\_\{m\\in\\mathcal\{M\}^\{t\}\}y\_\{m\},whereyity\_\{i\}^\{t\}is the local push\-sum mass at nodeiiandymy\_\{m\}is the mass share carried by messagemm\. Under the update rules in Section[C\.3](https://arxiv.org/html/2605.26162#A3.SS3), the total system mass is invariant:

\(52\)Ytott\+1=Ytott,∀t≥0\.Y\_\{\\mathrm\{tot\}\}^\{t\+1\}=Y\_\{\\mathrm\{tot\}\}^\{t\},\\qquad\\forall t\\geq 0\.Consequently, the de\-biased global referencew¯t=Xtott/Ytott\\bar\{w\}^\{t\}=X\_\{\\mathrm\{tot\}\}^\{t\}/Y\_\{\\mathrm\{tot\}\}^\{t\}is well\-defined for allttsinceYtottY\_\{\\mathrm\{tot\}\}^\{t\}is constant and strictly positive \(Assumption \([32](https://arxiv.org/html/2605.26162#A3.E32)\)\)\.

###### Proof\.

We show thatYtotY\_\{\\mathrm\{tot\}\}does not change under either type of event:*\(i\) message generation/sending*and*\(ii\) message aggregation/consumption*\. All other operations \(e\.g\., local SGD onwiw\_\{i\}\) do not modifyyiy\_\{i\}and thus do not affectYtotY\_\{\\mathrm\{tot\}\}\.

##### \(i\) Message generation/sending \(mass splitting\)\.

Suppose clientiiis activated and performs mass splitting as in \([45](https://arxiv.org/html/2605.26162#A3.E45)\)\. Letdi\+=\|𝒩i\+\|d\_\{i\}^\{\+\}=\|\\mathcal\{N\}\_\{i\}^\{\+\}\|and denote the pre\-splitting local mass byyiy\_\{i\}\. The splitting rule sets

yi′=yidi\+\+1,yi→k=yidi\+\+1,∀k∈𝒩i\+,y\_\{i\}^\{\\prime\}\\;=\\;\\frac\{y\_\{i\}\}\{d\_\{i\}^\{\+\}\+1\},\\qquad y\_\{i\\to k\}\\;=\\;\\frac\{y\_\{i\}\}\{d\_\{i\}^\{\+\}\+1\},\\ \\forall k\\in\\mathcal\{N\}\_\{i\}^\{\+\},and createsdi\+d\_\{i\}^\{\+\}new outgoing messages, each carrying massyi→ky\_\{i\\to k\}\. Letℳ\\mathcal\{M\}be the message set immediately before splitting andℳ′\\mathcal\{M\}^\{\\prime\}the set immediately after splitting\. Thenℳ′=ℳ∪\{mi→k:k∈𝒩i\+\}\\mathcal\{M\}^\{\\prime\}=\\mathcal\{M\}\\cup\\\{m\_\{i\\to k\}:k\\in\\mathcal\{N\}\_\{i\}^\{\+\}\\\}and we have

\(yi′\+∑m∈ℳ′ym\)−\(yi\+∑m∈ℳym\)\\displaystyle\\Big\(y\_\{i\}^\{\\prime\}\+\\sum\_\{m\\in\\mathcal\{M\}^\{\\prime\}\}y\_\{m\}\\Big\)\-\\Big\(y\_\{i\}\+\\sum\_\{m\\in\\mathcal\{M\}\}y\_\{m\}\\Big\)=\(yi′−yi\)\+∑k∈𝒩i\+yi→k\\displaystyle=\(y\_\{i\}^\{\\prime\}\-y\_\{i\}\)\+\\sum\_\{k\\in\\mathcal\{N\}\_\{i\}^\{\+\}\}y\_\{i\\to k\}=\(yidi\+\+1−yi\)\+di\+⋅yidi\+\+1\\displaystyle=\\left\(\\frac\{y\_\{i\}\}\{d\_\{i\}^\{\+\}\+1\}\-y\_\{i\}\\right\)\+d\_\{i\}^\{\+\}\\cdot\\frac\{y\_\{i\}\}\{d\_\{i\}^\{\+\}\+1\}\(53\)=0\.\\displaystyle=0\.All other nodesℓ≠i\\ell\\neq ikeepyℓy\_\{\\ell\}unchanged during this event, henceYtotY\_\{\\mathrm\{tot\}\}is invariant\.

##### \(ii\) Message aggregation/consumption\.

When an activated clientiiaggregates its buffer \(Eq\. \([38](https://arxiv.org/html/2605.26162#A3.E38)\)\), it consumes a subset of messagesℬi⊆ℳ\\mathcal\{B\}\_\{i\}\\subseteq\\mathcal\{M\}and transfers their mass shares into the local state:

yi′=yi\+∑m∈ℬiym,ℳ′=ℳ∖ℬi\.y\_\{i\}^\{\\prime\}\\;=\\;y\_\{i\}\+\\sum\_\{m\\in\\mathcal\{B\}\_\{i\}\}y\_\{m\},\\qquad\\mathcal\{M\}^\{\\prime\}\\;=\\;\\mathcal\{M\}\\setminus\\mathcal\{B\}\_\{i\}\.Therefore,

\(54\)\(yi′\+∑m∈ℳ′ym\)−\(yi\+∑m∈ℳym\)\\displaystyle\\Big\(y\_\{i\}^\{\\prime\}\+\\sum\_\{m\\in\\mathcal\{M\}^\{\\prime\}\}y\_\{m\}\\Big\)\-\\Big\(y\_\{i\}\+\\sum\_\{m\\in\\mathcal\{M\}\}y\_\{m\}\\Big\)=∑m∈ℬiym−∑m∈ℬiym=0\.\\displaystyle=\\sum\_\{m\\in\\mathcal\{B\}\_\{i\}\}y\_\{m\}\\;\-\\;\\sum\_\{m\\in\\mathcal\{B\}\_\{i\}\}y\_\{m\}=0\.Again, all other nodes keep their masses unchanged, soYtotY\_\{\\mathrm\{tot\}\}remains invariant\.

##### Conclusion\.

SinceYtotY\_\{\\mathrm\{tot\}\}is unchanged by both message splitting and message aggregation, we obtainYtott\+1=YtottY\_\{\\mathrm\{tot\}\}^\{t\+1\}=Y\_\{\\mathrm\{tot\}\}^\{t\}for all eventstt\. ∎

Remark\.In contrast to the mass variable, the system numeratorXtottX\_\{\\mathrm\{tot\}\}^\{t\}may be perturbed by lossy compression: in the implementation, a sender retains the full\-precision local modelwiw\_\{i\}while transmitting a compressed representation that reconstructs tow^i≠wi\\hat\{w\}\_\{i\}\\neq w\_\{i\}at receivers\. This perturbation will be explicitly handled in subsequent lemmas when deriving consensus and optimality recursions under Assumption \([33](https://arxiv.org/html/2605.26162#A3.E33)\)\.

#### C\.5\.2\.Lemma 2

###### Lemma C\.2 \(Numerator perturbation induced by lossy compression\)\.

Recall the total system massYtottY\_\{\\mathrm\{tot\}\}^\{t\}in \([51](https://arxiv.org/html/2605.26162#A3.E51)\), which is invariant by Lemma[C\.1](https://arxiv.org/html/2605.26162#A3.Thmtheorem1)\. Let

\(55\)Xtott≜∑i=1Nxit\+∑m∈ℳtxmX\_\{\\mathrm\{tot\}\}^\{t\}\\triangleq\\sum\_\{i=1\}^\{N\}x\_\{i\}^\{t\}\+\\sum\_\{m\\in\\mathcal\{M\}^\{t\}\}x\_\{m\}be the total system numerator, wherexit≜yit​witx\_\{i\}^\{t\}\\triangleq y\_\{i\}^\{t\}w\_\{i\}^\{t\}and each messagemmcarries numerator sharexm≜ym​w^mx\_\{m\}\\triangleq y\_\{m\}\\hat\{w\}\_\{m\}with reconstructed contentw^m\\hat\{w\}\_\{m\}\.

Consider an activation eventttat clienti=iti=i\_\{t\}\. Letwit\+23w\_\{i\}^\{t\+\\frac\{2\}\{3\}\}be the post\-local\-update model before broadcasting \(Section[C\.3](https://arxiv.org/html/2605.26162#A3.SS3), C2\) and define the*message model*produced by centroid re\-encoding as

\(56\)wˇit≜ℛ​\(𝒞​\(wit\+23\)\),eit≜wˇit−wit\+23\.\\check\{w\}\_\{i\}^\{t\}\\triangleq\\mathcal\{R\}\\\!\\left\(\\mathcal\{C\}\(w\_\{i\}^\{t\+\\frac\{2\}\{3\}\}\)\\right\),\\qquad e\_\{i\}^\{t\}\\triangleq\\check\{w\}\_\{i\}^\{t\}\-w\_\{i\}^\{t\+\\frac\{2\}\{3\}\}\.Then the only communication operation that changesXtotX\_\{\\mathrm\{tot\}\}is the mass splitting / broadcast step, and the induced numerator change satisfies the exact identity

\(57\)Xtott\+1−Xtott\+23=∑k∈𝒩i\+yi→kt\+23​eit\.X\_\{\\mathrm\{tot\}\}^\{t\+1\}\-X\_\{\\mathrm\{tot\}\}^\{t\+\\frac\{2\}\{3\}\}\\;=\\;\\sum\_\{k\\in\\mathcal\{N\}\_\{i\}^\{\+\}\}y\_\{i\\to k\}^\{t\+\\frac\{2\}\{3\}\}\\,e\_\{i\}^\{t\}\.Under Assumption \(A6\) \(bounded absolute compression error\), i\.e\.,‖eit‖2≤εc\\\|e\_\{i\}^\{t\}\\\|\_\{2\}\\leq\\varepsilon\_\{c\}, we further have

\(58\)‖Xtott\+1−Xtott\+23‖2≤\(∑k∈𝒩i\+yi→kt\+23\)​εc=di\+di\+\+1​yit\+13​εc,\\big\\\|X\_\{\\mathrm\{tot\}\}^\{t\+1\}\-X\_\{\\mathrm\{tot\}\}^\{t\+\\frac\{2\}\{3\}\}\\big\\\|\_\{2\}\\;\\leq\\;\\Big\(\\sum\_\{k\\in\\mathcal\{N\}\_\{i\}^\{\+\}\}y\_\{i\\to k\}^\{t\+\\frac\{2\}\{3\}\}\\Big\)\\,\\varepsilon\_\{c\}\\;=\\;\\frac\{d\_\{i\}^\{\+\}\}\{d\_\{i\}^\{\+\}\+1\}\\,y\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\\,\\varepsilon\_\{c\},where the last equality uses the mass splitting rule \([45](https://arxiv.org/html/2605.26162#A3.E45)\)\. Consequently, sinceYtotY\_\{\\mathrm\{tot\}\}is invariant, the de\-biased global referencew¯t=Xtott/Ytott\\bar\{w\}^\{t\}=X\_\{\\mathrm\{tot\}\}^\{t\}/Y\_\{\\mathrm\{tot\}\}^\{t\}satisfies

\(59\)‖w¯t\+1−w¯t\+23‖2=‖Xtott\+1−Xtott\+23‖2Ytot≤di\+di\+\+1⋅yit\+13Ytot​εc\.\\big\\\|\\bar\{w\}^\{t\+1\}\-\\bar\{w\}^\{t\+\\frac\{2\}\{3\}\}\\big\\\|\_\{2\}\\;=\\;\\frac\{\\big\\\|X\_\{\\mathrm\{tot\}\}^\{t\+1\}\-X\_\{\\mathrm\{tot\}\}^\{t\+\\frac\{2\}\{3\}\}\\big\\\|\_\{2\}\}\{Y\_\{\\mathrm\{tot\}\}\}\\;\\leq\\;\\frac\{d\_\{i\}^\{\+\}\}\{d\_\{i\}^\{\+\}\+1\}\\cdot\\frac\{y\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\}\{Y\_\{\\mathrm\{tot\}\}\}\\,\\varepsilon\_\{c\}\.

###### Proof\.

We focus on the broadcast \(mass splitting\) step, since: \(i\) message arrivals only move a message into some buffer and do not changeℳt\\mathcal\{M\}^\{t\}as a multiset of in\-flight shares and \(ii\) buffer aggregation transfers message shares fromℳt\\mathcal\{M\}^\{t\}into a local state without changing the total numerator \(it is a pure re\-allocation of existing shares\)\. Therefore, the only communication\-induced change toXtotX\_\{\\mathrm\{tot\}\}arises from creating new outgoing messages while scaling down the sender’s local mass\.

Fix eventttwhere clientiibroadcasts after local training\. Let the local mass right after aggregation beyit\+13y\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\(Eq\. \([38](https://arxiv.org/html/2605.26162#A3.E38)\)\), which is the mass to be split\. Before splitting, the sender contributes local numeratorxipre=yit\+13​wit\+23x\_\{i\}^\{\\mathrm\{pre\}\}=y\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\\,w\_\{i\}^\{t\+\\frac\{2\}\{3\}\}to the system total and the message set does not contain the newly created outgoing shares\. After splitting \(Eq\. \([45](https://arxiv.org/html/2605.26162#A3.E45)\)\), the sender retains massyit\+1=yit\+13di\+\+1y\_\{i\}^\{t\+1\}=\\frac\{y\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\}\{d\_\{i\}^\{\+\}\+1\}while creatingdi\+d\_\{i\}^\{\+\}outgoing messages each with massyi→kt\+23=yit\+13di\+\+1\.y\_\{i\\to k\}^\{t\+\\frac\{2\}\{3\}\}=\\frac\{y\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\}\{d\_\{i\}^\{\+\}\+1\}\.Crucially, the implementation sends a centroid\-compressed payload, whose receiver\-side reconstruction equals the message modelwˇit\\check\{w\}\_\{i\}^\{t\}in \([56](https://arxiv.org/html/2605.26162#A3.E56)\)\. Hence, immediately after splitting:

- •the sender’s retained local numerator becomesxipost=yit\+1​wit\+23,x\_\{i\}^\{\\mathrm\{post\}\}=y\_\{i\}^\{t\+1\}\\,w\_\{i\}^\{t\+\\frac\{2\}\{3\}\},because the local model is*not*replaced by its compressed reconstruction;
- •the newly created message set contributes additional numerator∑k∈𝒩i\+yi→kt\+23​wˇit\.\\sum\_\{k\\in\\mathcal\{N\}\_\{i\}^\{\+\}\}y\_\{i\\to k\}^\{t\+\\frac\{2\}\{3\}\}\\,\\check\{w\}\_\{i\}^\{t\}\.

Therefore, the net change in the system total numerator caused by splitting is

Xtott\+1−Xtott\+23\\displaystyle X\_\{\\mathrm\{tot\}\}^\{t\+1\}\-X\_\{\\mathrm\{tot\}\}^\{t\+\\frac\{2\}\{3\}\}=\(xipost\+∑k∈𝒩i\+yi→kt\+23​wˇit\)−xipre\\displaystyle=\\Big\(x\_\{i\}^\{\\mathrm\{post\}\}\+\\sum\_\{k\\in\\mathcal\{N\}\_\{i\}^\{\+\}\}y\_\{i\\to k\}^\{t\+\\frac\{2\}\{3\}\}\\,\\check\{w\}\_\{i\}^\{t\}\\Big\)\-x\_\{i\}^\{\\mathrm\{pre\}\}=\(yit\+13di\+\+1​wit\+23\+di\+⋅yit\+13di\+\+1​wˇit\)−yit\+13​wit\+23\\displaystyle=\\Big\(\\tfrac\{y\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\}\{d\_\{i\}^\{\+\}\+1\}w\_\{i\}^\{t\+\\frac\{2\}\{3\}\}\+d\_\{i\}^\{\+\}\\cdot\\tfrac\{y\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\}\{d\_\{i\}^\{\+\}\+1\}\\check\{w\}\_\{i\}^\{t\}\\Big\)\-y\_\{i\}^\{t\+\\frac\{1\}\{3\}\}w\_\{i\}^\{t\+\\frac\{2\}\{3\}\}=di\+di\+\+1​yit\+13​\(wˇit−wit\+23\)\\displaystyle=\\frac\{d\_\{i\}^\{\+\}\}\{d\_\{i\}^\{\+\}\+1\}\\,y\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\\big\(\\check\{w\}\_\{i\}^\{t\}\-w\_\{i\}^\{t\+\\frac\{2\}\{3\}\}\\big\)\(60\)=∑k∈𝒩i\+yi→kt\+23​eit,\\displaystyle=\\sum\_\{k\\in\\mathcal\{N\}\_\{i\}^\{\+\}\}y\_\{i\\to k\}^\{t\+\\frac\{2\}\{3\}\}\\,e\_\{i\}^\{t\},which is exactly \([57](https://arxiv.org/html/2605.26162#A3.E57)\)\.

Taking norms and using‖eit‖2≤εc\\\|e\_\{i\}^\{t\}\\\|\_\{2\}\\leq\\varepsilon\_\{c\}yields

‖Xtott\+1−Xtott\+23‖2≤∑k∈𝒩i\+yi→kt\+23​‖eit‖2≤\(∑k∈𝒩i\+yi→kt\+23\)​εc,\\big\\\|X\_\{\\mathrm\{tot\}\}^\{t\+1\}\-X\_\{\\mathrm\{tot\}\}^\{t\+\\frac\{2\}\{3\}\}\\big\\\|\_\{2\}\\leq\\sum\_\{k\\in\\mathcal\{N\}\_\{i\}^\{\+\}\}y\_\{i\\to k\}^\{t\+\\frac\{2\}\{3\}\}\\,\\\|e\_\{i\}^\{t\}\\\|\_\{2\}\\leq\\Big\(\\sum\_\{k\\in\\mathcal\{N\}\_\{i\}^\{\+\}\}y\_\{i\\to k\}^\{t\+\\frac\{2\}\{3\}\}\\Big\)\\varepsilon\_\{c\},and substituting∑kyi→kt\+23=di\+di\+\+1​yit\+13\\sum\_\{k\}y\_\{i\\to k\}^\{t\+\\frac\{2\}\{3\}\}=\\frac\{d\_\{i\}^\{\+\}\}\{d\_\{i\}^\{\+\}\+1\}y\_\{i\}^\{t\+\\frac\{1\}\{3\}\}gives \([58](https://arxiv.org/html/2605.26162#A3.E58)\)\. Finally, Lemma[C\.1](https://arxiv.org/html/2605.26162#A3.Thmtheorem1)impliesYtott\+1=Ytott=YtotY\_\{\\mathrm\{tot\}\}^\{t\+1\}=Y\_\{\\mathrm\{tot\}\}^\{t\}=Y\_\{\\mathrm\{tot\}\}, so dividing by the constantYtotY\_\{\\mathrm\{tot\}\}yields \([59](https://arxiv.org/html/2605.26162#A3.E59)\)\. ∎

#### C\.5\.3\.Lemma 3

###### Lemma C\.3 \(Consensus error recursion under asynchronous push\-sum with perturbations\)\.

Recall the consensus errorℰcont=1N​∑i=1N‖wit−w¯t‖22\\mathcal\{E\}\_\{\\mathrm\{con\}\}^\{t\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\\|w\_\{i\}^\{t\}\-\\bar\{w\}^\{t\}\\\|\_\{2\}^\{2\}in \([49](https://arxiv.org/html/2605.26162#A3.E49)\), wherewit=xit/yitw\_\{i\}^\{t\}=x\_\{i\}^\{t\}/y\_\{i\}^\{t\}andw¯t=Xtott/Ytott\\bar\{w\}^\{t\}=X\_\{\\mathrm\{tot\}\}^\{t\}/Y\_\{\\mathrm\{tot\}\}^\{t\}\. Consider an activation eventttat clienti=iti=i\_\{t\}\.

Define the activated\-client drift

\(61\)Δit≜wit\+23−wit\+13,\\Delta\_\{i\}^\{t\}\\triangleq w\_\{i\}^\{t\+\\frac\{2\}\{3\}\}\-w\_\{i\}^\{t\+\\frac\{1\}\{3\}\},and for any buffered message from senderjjgenerated at eventκ\\kappaand available at clientiiat eventtt, define the receiver\-side reconstruction error

\(62\)ej→i\(κ\)≜w^j→it−wjκ,‖ej→i\(κ\)‖2≤εcby Assumption \(A6\)\.e\_\{j\\to i\}^\{\(\\kappa\)\}\\triangleq\\hat\{w\}\_\{j\\to i\}^\{t\}\-w\_\{j\}^\{\\kappa\},\\qquad\\\|e\_\{j\\to i\}^\{\(\\kappa\)\}\\\|\_\{2\}\\leq\\varepsilon\_\{c\}\\ \\ \\text\{by Assumption \(A6\)\}\.
Under Assumptions \(A4\)\-\(A6\) and the directed mixing Assumption \(A5\), there exist constantsρ∈\(0,1\)\\rho\\in\(0,1\)andC1,C2,C3\>0C\_\{1\},C\_\{2\},C\_\{3\}\>0such that

\(63\)𝔼​\[ℰcont\+1\]≤\\displaystyle\\mathbb\{E\}\\\!\\left\[\\mathcal\{E\}\_\{\\mathrm\{con\}\}^\{t\+1\}\\right\]\\;\\leqρ​𝔼​\[ℰcont\]\+C1​𝔼​\[‖Δitt‖22\]\+C2​τ​∑r=t−τt−1𝔼​\[‖Δirr‖22\]\+C3​εc2\.\\displaystyle\\rho\\,\\mathbb\{E\}\\\!\\left\[\\mathcal\{E\}\_\{\\mathrm\{con\}\}^\{t\}\\right\]\+C\_\{1\}\\,\\mathbb\{E\}\\\!\\left\[\\\|\\Delta\_\{i\_\{t\}\}^\{t\}\\\|\_\{2\}^\{2\}\\right\]\+C\_\{2\}\\,\\tau\\sum\_\{r=t\-\\tau\}^\{t\-1\}\\mathbb\{E\}\\\!\\left\[\\\|\\Delta\_\{i\_\{r\}\}^\{r\}\\\|\_\{2\}^\{2\}\\right\]\+C\_\{3\}\\,\\varepsilon\_\{c\}^\{2\}\.

###### Proof\.

We bound the deviationw−w¯w\-\\bar\{w\}across one event by separating \(i\) buffer aggregation \(mixing\) with stale/quantized inputs, \(ii\) client driftΔit\\Delta\_\{i\}^\{t\}and \(iii\) the reference drift ofw¯\\bar\{w\}due to lossy broadcast \(Lemma[C\.2](https://arxiv.org/html/2605.26162#A3.Thmtheorem2)\)\.

##### Step 1: deviation after aggregation\.

At activation eventtt, clienti=iti=i\_\{t\}aggregates its buffer via \([37](https://arxiv.org/html/2605.26162#A3.E37)\)\. LetYitY\_\{i\}^\{t\}be defined in \([36](https://arxiv.org/html/2605.26162#A3.E36)\) and define normalized weights

\(64\)αit≜yitYit,αj→it≜yj→i\(κ\)Yitfor each​\(w^j→it,⋅,yj→i\(κ\),j\)∈ℬit,\\alpha\_\{i\}^\{t\}\\triangleq\\frac\{y\_\{i\}^\{t\}\}\{Y\_\{i\}^\{t\}\},\\qquad\\alpha\_\{j\\to i\}^\{t\}\\triangleq\\frac\{y\_\{j\\to i\}^\{\(\\kappa\)\}\}\{Y\_\{i\}^\{t\}\}\\ \\ \\text\{for each \}\(\\hat\{w\}\_\{j\\to i\}^\{t\},\\cdot,y\_\{j\\to i\}^\{\(\\kappa\)\},j\)\\in\\mathcal\{B\}\_\{i\}^\{t\},so thatαit\+∑\(j→i\)∈ℬitαj→it=1\\alpha\_\{i\}^\{t\}\+\\sum\_\{\(j\\to i\)\\in\\mathcal\{B\}\_\{i\}^\{t\}\}\\alpha\_\{j\\to i\}^\{t\}=1\. Using staleness\-aware notation, each buffered model satisfies

\(65\)w^j→it=wjκ\+ej→i\(κ\),t−κ≤τ​\(Assumption \(A4\)\)\.\\hat\{w\}\_\{j\\to i\}^\{t\}=w\_\{j\}^\{\\kappa\}\+e\_\{j\\to i\}^\{\(\\kappa\)\},\\qquad t\-\\kappa\\leq\\tau\\ \\text\{\(Assumption \(A4\)\)\}\.Thus

\(66\)wit\+13=αit​wit\+∑\(j→i\)∈ℬitαj→it​wjκ\+∑\(j→i\)∈ℬitαj→it​ej→i\(κ\)\.w\_\{i\}^\{t\+\\frac\{1\}\{3\}\}=\\alpha\_\{i\}^\{t\}w\_\{i\}^\{t\}\+\\sum\_\{\(j\\to i\)\\in\\mathcal\{B\}\_\{i\}^\{t\}\}\\alpha\_\{j\\to i\}^\{t\}\\,w\_\{j\}^\{\\kappa\}\+\\sum\_\{\(j\\to i\)\\in\\mathcal\{B\}\_\{i\}^\{t\}\}\\alpha\_\{j\\to i\}^\{t\}\\,e\_\{j\\to i\}^\{\(\\kappa\)\}\.Subtractw¯t\\bar\{w\}^\{t\}and apply‖a\+b‖2≤2​‖a‖2\+2​‖b‖2\\\|a\+b\\\|^\{2\}\\leq 2\\\|a\\\|^\{2\}\+2\\\|b\\\|^\{2\}and Jensen’s inequality:

‖wit\+13−w¯t‖2≤\\displaystyle\\\|w\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\-\\bar\{w\}^\{t\}\\\|^\{2\}\\;\\leq\\;2​‖αit​\(wit−w¯t\)\+∑\(j→i\)∈ℬitαj→it​\(wjκ−w¯t\)‖2\+2​‖∑\(j→i\)∈ℬitαj→it​ej→i\(κ\)‖2\\displaystyle 2\\Big\\\|\\alpha\_\{i\}^\{t\}\(w\_\{i\}^\{t\}\-\\bar\{w\}^\{t\}\)\+\\sum\_\{\(j\\to i\)\\in\\mathcal\{B\}\_\{i\}^\{t\}\}\\alpha\_\{j\\to i\}^\{t\}\(w\_\{j\}^\{\\kappa\}\-\\bar\{w\}^\{t\}\)\\Big\\\|^\{2\}\+2\\Big\\\|\\sum\_\{\(j\\to i\)\\in\\mathcal\{B\}\_\{i\}^\{t\}\}\\alpha\_\{j\\to i\}^\{t\}e\_\{j\\to i\}^\{\(\\kappa\)\}\\Big\\\|^\{2\}\(67\)≤\\displaystyle\\;\\leq\\;2​αit​‖wit−w¯t‖2\+2​∑\(j→i\)∈ℬitαj→it​‖wjκ−w¯t‖2\+2​∑\(j→i\)∈ℬitαj→it​‖ej→i\(κ\)‖2\.\\displaystyle 2\\alpha\_\{i\}^\{t\}\\\|w\_\{i\}^\{t\}\-\\bar\{w\}^\{t\}\\\|^\{2\}\+2\\sum\_\{\(j\\to i\)\\in\\mathcal\{B\}\_\{i\}^\{t\}\}\\alpha\_\{j\\to i\}^\{t\}\\\|w\_\{j\}^\{\\kappa\}\-\\bar\{w\}^\{t\}\\\|^\{2\}\+2\\sum\_\{\(j\\to i\)\\in\\mathcal\{B\}\_\{i\}^\{t\}\}\\alpha\_\{j\\to i\}^\{t\}\\\|e\_\{j\\to i\}^\{\(\\kappa\)\}\\\|^\{2\}\.
By Assumption \(A6\),‖ej→i\(κ\)‖≤εc\\\|e\_\{j\\to i\}^\{\(\\kappa\)\}\\\|\\leq\\varepsilon\_\{c\}, hence

\(68\)2​∑\(j→i\)∈ℬitαj→it​‖ej→i\(κ\)‖2≤2​εc2\.2\\sum\_\{\(j\\to i\)\\in\\mathcal\{B\}\_\{i\}^\{t\}\}\\alpha\_\{j\\to i\}^\{t\}\\\|e\_\{j\\to i\}^\{\(\\kappa\)\}\\\|^\{2\}\\leq 2\\varepsilon\_\{c\}^\{2\}\.

##### Step 2: bounding staleness via drift increments\.

Between eventsκ\\kappaandtt, at mostτ\\tauactivations occur\. Only the activated client changes its model at each event\. Therefore, for any nodejjand anyκ\\kappawitht−κ≤τt\-\\kappa\\leq\\tau,

\(69\)‖wjt−wjκ‖≤∑r=κt−1‖Δirr‖\.\\\|w\_\{j\}^\{t\}\-w\_\{j\}^\{\\kappa\}\\\|\\leq\\sum\_\{r=\\kappa\}^\{t\-1\}\\\|\\Delta\_\{i\_\{r\}\}^\{r\}\\\|\.Using\(∑r=κt−1ar\)2≤\(t−κ\)​∑r=κt−1ar2≤τ​∑r=t−τt−1ar2\(\\sum\_\{r=\\kappa\}^\{t\-1\}a\_\{r\}\)^\{2\}\\leq\(t\-\\kappa\)\\sum\_\{r=\\kappa\}^\{t\-1\}a\_\{r\}^\{2\}\\leq\\tau\\sum\_\{r=t\-\\tau\}^\{t\-1\}a\_\{r\}^\{2\}, we obtain

\(70\)‖wjt−wjκ‖2≤τ​∑r=t−τt−1‖Δirr‖2\.\\\|w\_\{j\}^\{t\}\-w\_\{j\}^\{\\kappa\}\\\|^\{2\}\\leq\\tau\\sum\_\{r=t\-\\tau\}^\{t\-1\}\\\|\\Delta\_\{i\_\{r\}\}^\{r\}\\\|^\{2\}\.Consequently,

‖wjκ−w¯t‖2\\displaystyle\\\|w\_\{j\}^\{\\kappa\}\-\\bar\{w\}^\{t\}\\\|^\{2\}≤2​‖wjt−w¯t‖2\+2​‖wjt−wjκ‖2\\displaystyle\\leq 2\\\|w\_\{j\}^\{t\}\-\\bar\{w\}^\{t\}\\\|^\{2\}\+2\\\|w\_\{j\}^\{t\}\-w\_\{j\}^\{\\kappa\}\\\|^\{2\}\(71\)≤2​‖wjt−w¯t‖2\+2​τ​∑r=t−τt−1‖Δirr‖2\.\\displaystyle\\leq 2\\\|w\_\{j\}^\{t\}\-\\bar\{w\}^\{t\}\\\|^\{2\}\+2\\tau\\sum\_\{r=t\-\\tau\}^\{t\-1\}\\\|\\Delta\_\{i\_\{r\}\}^\{r\}\\\|^\{2\}\.Substituting \([71](https://arxiv.org/html/2605.26162#A3.E71)\) and \([68](https://arxiv.org/html/2605.26162#A3.E68)\) into \([67](https://arxiv.org/html/2605.26162#A3.E67)\) yields

\(72\)‖wit\+13−w¯t‖2\\displaystyle\\\|w\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\-\\bar\{w\}^\{t\}\\\|^\{2\}≤2​αit​‖wit−w¯t‖2\+4​∑\(j→i\)∈ℬitαj→it​‖wjt−w¯t‖2\+4​τ​∑r=t−τt−1‖Δirr‖2\+2​εc2\.\\displaystyle\\leq 2\\alpha\_\{i\}^\{t\}\\\|w\_\{i\}^\{t\}\-\\bar\{w\}^\{t\}\\\|^\{2\}\+4\\sum\_\{\(j\\to i\)\\in\\mathcal\{B\}\_\{i\}^\{t\}\}\\alpha\_\{j\\to i\}^\{t\}\\\|w\_\{j\}^\{t\}\-\\bar\{w\}^\{t\}\\\|^\{2\}\+4\\tau\\sum\_\{r=t\-\\tau\}^\{t\-1\}\\\|\\Delta\_\{i\_\{r\}\}^\{r\}\\\|^\{2\}\+2\\varepsilon\_\{c\}^\{2\}\.Remark \(optional simplification\)\.For cleaner constants, one may upper bound2​αit​‖wit−w¯t‖2≤4​αit​‖wit−w¯t‖22\\alpha\_\{i\}^\{t\}\\\|w\_\{i\}^\{t\}\-\\bar\{w\}^\{t\}\\\|^\{2\}\\leq 4\\alpha\_\{i\}^\{t\}\\\|w\_\{i\}^\{t\}\-\\bar\{w\}^\{t\}\\\|^\{2\}and combine the first two terms into a single factor44; we keep \([72](https://arxiv.org/html/2605.26162#A3.E72)\) to avoid unnecessary slack\.

##### Step 3: adding the client drift at the activated client\.

After local SGD,wit\+23=wit\+13\+Δitw\_\{i\}^\{t\+\\frac\{2\}\{3\}\}=w\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\+\\Delta\_\{i\}^\{t\}, hence

\(73\)‖wit\+23−w¯t‖2≤2​‖wit\+13−w¯t‖2\+2​‖Δit‖2\.\\\|w\_\{i\}^\{t\+\\frac\{2\}\{3\}\}\-\\bar\{w\}^\{t\}\\\|^\{2\}\\leq 2\\\|w\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\-\\bar\{w\}^\{t\}\\\|^\{2\}\+2\\\|\\Delta\_\{i\}^\{t\}\\\|^\{2\}\.For allℓ≠i\\ell\\neq i,wℓt\+23=wℓtw\_\{\\ell\}^\{t\+\\frac\{2\}\{3\}\}=w\_\{\\ell\}^\{t\}\.

##### Step 4: accounting for the reference drift ofw¯\\bar\{w\}\.

At the end of eventtt, the consensus error is measured w\.r\.t\.w¯t\+1\\bar\{w\}^\{t\+1\}\. For any nodeℓ\\ell\(activated or not\),

\(74\)‖wℓt\+1−w¯t\+1‖2≤2​‖wℓt\+1−w¯t‖2\+2​‖w¯t\+1−w¯t‖2\.\\\|w\_\{\\ell\}^\{t\+1\}\-\\bar\{w\}^\{t\+1\}\\\|^\{2\}\\leq 2\\\|w\_\{\\ell\}^\{t\+1\}\-\\bar\{w\}^\{t\}\\\|^\{2\}\+2\\\|\\bar\{w\}^\{t\+1\}\-\\bar\{w\}^\{t\}\\\|^\{2\}\.Lemma[C\.2](https://arxiv.org/html/2605.26162#A3.Thmtheorem2)implies‖w¯t\+1−w¯t‖≤cps​εc\\\|\\bar\{w\}^\{t\+1\}\-\\bar\{w\}^\{t\}\\\|\\leq c\_\{\\mathrm\{ps\}\}\\varepsilon\_\{c\}for a constantcps\>0c\_\{\\mathrm\{ps\}\}\>0depending on\(di\+,yi/Ytot\)\(d\_\{i\}^\{\+\},y\_\{i\}/Y\_\{\\mathrm\{tot\}\}\), thus the second term in \([74](https://arxiv.org/html/2605.26162#A3.E74)\) contributes an additiveO​\(εc2\)O\(\\varepsilon\_\{c\}^\{2\}\)term to the network average, absorbed intoC3​εc2C\_\{3\}\\varepsilon\_\{c\}^\{2\}\.

##### Step 5: push\-sum mixing contraction

We formalize the mixing part of the event dynamics by a \(random\)*column\-stochastic*matrix sequence\. LetPt∈ℝN×NP^\{t\}\\in\\mathbb\{R\}^\{N\\times N\}denote the effective push\-sum mixing matrix at eventttsuch that, in the*no\-perturbation*case \(i\.e\., ignoring local training drift, staleness and compression noise\), the push\-sum numerator/mass evolve as

\(75\)xt\+1=Pt​xt,yt\+1=Pt​yt,x^\{t\+1\}=P^\{t\}x^\{t\},\\qquad y^\{t\+1\}=P^\{t\}y^\{t\},wherext=\[x1t;…;xNt\]x^\{t\}=\[x\_\{1\}^\{t\};\\dots;x\_\{N\}^\{t\}\]andyt=\[y1t;…;yNt\]y^\{t\}=\[y\_\{1\}^\{t\};\\dots;y\_\{N\}^\{t\}\]\. By construction of mass splitting and aggregation, eachPtP^\{t\}is nonnegative and column\-stochastic:

\(76\)Pt≥0,𝟏⊤​Pt=𝟏⊤\.P^\{t\}\\geq 0,\\qquad\\mathbf\{1\}^\{\\top\}P^\{t\}=\\mathbf\{1\}^\{\\top\}\.Moreover, due to the fixed out\-neighbor pushing rule and the mass\-splitting scheme in the implementation, there exists a uniform constantβ∈\(0,1\)\\beta\\in\(0,1\)such that for any eventtt,

\(77\)\(Pt\)i​j\>0⟹\(Pt\)i​j≥β,\(P^\{t\}\)\_\{ij\}\>0\\ \\Longrightarrow\\ \(P^\{t\}\)\_\{ij\}\\geq\\beta,and\(Pt\)i​j\>0\(P^\{t\}\)\_\{ij\}\>0only if eitheri=ji=j\(self\-retention\) orj→ij\\to iis an active directed communication in the underlying graph\.

Let theBB\-step product beΦt:s≜Pt​Pt−1​⋯​Ps\\Phi^\{t:s\}\\triangleq P^\{t\}P^\{t\-1\}\\cdots P^\{s\}fort≥st\\geq s\. Under Assumption \(A5\) \(bounded connectivity: the union of communication graphs over any window of lengthBBis strongly connected\), standard results on products of nonnegative column\-stochastic matrices imply that everyBB\-step product is*scrambling*\(also called*uniformly ergodic*\): there exists a constantη=η​\(N,B,β\)∈\(0,1\)\\eta=\\eta\(N,B,\\beta\)\\in\(0,1\)such that

\(78\)τ​\(Φt\+B−1:t\)≤1−η,∀t≥0,\\tau\\\!\\left\(\\Phi^\{t\+B\-1:t\}\\right\)\\leq 1\-\\eta,\\qquad\\forall t\\geq 0,whereτ​\(⋅\)\\tau\(\\cdot\)denotes the \(Hajnal/Dobrushin\) coefficient of ergodicity for column\-stochastic matrices, e\.g\.,τ​\(A\)≜12​maxp,q​∑k=1N\|Ak​p−Ak​q\|\.\\tau\(A\)\\triangleq\\frac\{1\}\{2\}\\max\_\{p,q\}\\sum\_\{k=1\}^\{N\}\|A\_\{kp\}\-A\_\{kq\}\|\.It satisfies the submultiplicativityτ​\(A​B\)≤τ​\(A\)​τ​\(B\)\\tau\(AB\)\\leq\\tau\(A\)\\tau\(B\)for any column\-stochasticA,BA,B\. Consequently,

\(79\)τ​\(Φt:0\)≤∏r=0⌊t/B⌋−1τ​\(Φ\(r\+1\)​B−1:r​B\)≤\(1−η\)⌊t/B⌋\.\\tau\(\\Phi^\{t:0\}\)\\leq\\prod\_\{r=0\}^\{\\lfloor t/B\\rfloor\-1\}\\tau\(\\Phi^\{\(r\+1\)B\-1:rB\}\)\\leq\(1\-\\eta\)^\{\\lfloor t/B\\rfloor\}\.\(See, e\.g\., classical ergodicity theorems for stochastic matrix products and push\-sum / ratio\-consensus analyses\.\)

Now define the push\-sum*de\-biased ratio*wit=xit/yitw\_\{i\}^\{t\}=x\_\{i\}^\{t\}/y\_\{i\}^\{t\}andw¯t=Xtott/Ytot\\bar\{w\}^\{t\}=X\_\{\\mathrm\{tot\}\}^\{t\}/Y\_\{\\mathrm\{tot\}\}as in Section[C\.4](https://arxiv.org/html/2605.26162#A3.SS4)\. Under Assumption[C\.6\.1](https://arxiv.org/html/2605.26162#A3.SS6.SSS1.Px1)\(mass bounded away from0\), the ratio map is Lipschitz:

\(80\)‖xy−x′y‖2≤1ymin​‖x−x′‖2,∀y∈\[ymin,ymax\]N,\\left\\\|\\frac\{x\}\{y\}\-\\frac\{x^\{\\prime\}\}\{y\}\\right\\\|\_\{2\}\\leq\\frac\{1\}\{y\_\{\\min\}\}\\\|x\-x^\{\\prime\}\\\|\_\{2\},\\qquad\\forall\\,y\\in\[y\_\{\\min\},y\_\{\\max\}\]^\{N\},where division is element\-wise\. Combining \([79](https://arxiv.org/html/2605.26162#A3.E79)\)\-\([80](https://arxiv.org/html/2605.26162#A3.E80)\) with the standard relation betweenτ​\(⋅\)\\tau\(\\cdot\)and disagreement, we obtain that the \(unperturbed\) push\-sum dynamics contracts disagreement geometrically: there existsρ∈\(0,1\)\\rho\\in\(0,1\)\(e\.g\.,ρ≜\(1−η\)1/B\\rho\\triangleq\(1\-\\eta\)^\{1/B\}\) such that

\(81\)1N​∑i=1N‖wit\+1−w¯t\+1‖22≤ρ⋅1N​∑i=1N‖wit−w¯t‖22,\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\\|w\_\{i\}^\{t\+1\}\-\\bar\{w\}^\{t\+1\}\\\|\_\{2\}^\{2\}\\leq\\rho\\cdot\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\\|w\_\{i\}^\{t\}\-\\bar\{w\}^\{t\}\\\|\_\{2\}^\{2\},whereρ\\rhodepends only on\(N,B,β\)\(N,B,\\beta\)\(and the ratio\-Lipschitz constant viayminy\_\{\\min\}\)\.

Finally, returning to the*perturbed*dynamics of our algorithm \(local update drift, bounded staleness and compression\), the contraction \([81](https://arxiv.org/html/2605.26162#A3.E81)\) applies to the mixing part, while the remaining effects enter as additive perturbations\. This justifies the appearance of the termρ​𝔼​\[ℰcont\]\\rho\\,\\mathbb\{E\}\[\\mathcal\{E\}\_\{\\mathrm\{con\}\}^\{t\}\]in the recursion and closes Step 5\.

##### Conclusion\.

Combining Steps 1\-5, averaging over nodes and taking expectations yields the recursion \([63](https://arxiv.org/html/2605.26162#A3.E63)\), with constantsC1,C2,C3C\_\{1\},C\_\{2\},C\_\{3\}absorbing the fixed numerical factors from Steps 1\-4 and the mixing constants from \([79](https://arxiv.org/html/2605.26162#A3.E79)\)\. ∎

#### C\.5\.4\.Lemma 4

###### Lemma C\.4 \(Bounding the local\-update drift by regularized stochastic gradients\)\.

Fix an activation eventttat clienti=iti=i\_\{t\}\. Letwi,0=wit\+13w\_\{i,0\}=w\_\{i\}^\{t\+\\frac\{1\}\{3\}\}be the post\-aggregation model and letw~it\\tilde\{w\}\_\{i\}^\{t\}be the centroid proximal anchor used during the subsequent local update \(Section[C\.3](https://arxiv.org/html/2605.26162#A3.SS3), C2\)\. ClientiiperformsEEmasked proximal\-SGD steps:

\(82\)wi,e\+1\\displaystyle w\_\{i,e\+1\}=ΠMit\(wi,e−ηgi\(wi,e;ξi,e\)\\displaystyle=\\Pi\_\{M\_\{i\}^\{t\}\}\\\!\\Bigl\(w\_\{i,e\}\-\\eta\\,g\_\{i\}\(w\_\{i,e\};\\xi\_\{i,e\}\)−2ηλ\(wi,e−w~it\)\),e=0,…,E−1\.\\displaystyle\\qquad\\quad\-2\\eta\\lambda\\,\(w\_\{i,e\}\-\\tilde\{w\}\_\{i\}^\{t\}\)\\Bigr\),\\qquad e=0,\\dots,E\-1\.
and outputswit\+23≜wi,Ew\_\{i\}^\{t\+\\frac\{2\}\{3\}\}\\triangleq w\_\{i,E\}\. Define the client driftΔit≜wit\+23−wit\+13=wi,E−wi,0\\Delta\_\{i\}^\{t\}\\triangleq w\_\{i\}^\{t\+\\frac\{2\}\{3\}\}\-w\_\{i\}^\{t\+\\frac\{1\}\{3\}\}=w\_\{i,E\}\-w\_\{i,0\}\.

Assume the mask operator is non\-expansive:

\(83\)‖ΠMit​\(u\)−ΠMit​\(v\)‖2≤‖u−v‖2,∀u,v,\\\|\\Pi\_\{M\_\{i\}^\{t\}\}\(u\)\-\\Pi\_\{M\_\{i\}^\{t\}\}\(v\)\\\|\_\{2\}\\leq\\\|u\-v\\\|\_\{2\},\\quad\\forall u,v,\(which holds forΠM​\(u\)=u⊙M\\Pi\_\{M\}\(u\)=u\\odot MwithM∈\{0,1\}dM\\in\\\{0,1\\\}^\{d\}as in the implementation\), and Assumption \(A2\) \(unbiased stochastic gradients with varianceσ2\\sigma^\{2\}\)\. Then the drift satisfies the deterministic bound

\(84\)‖Δit‖22≤E​η2​∑e=0E−1‖gi​\(wi,e;ξi,e\)\+2​λ​\(wi,e−w~it\)‖22\.\\\|\\Delta\_\{i\}^\{t\}\\\|\_\{2\}^\{2\}\\;\\leq\\;E\\,\\eta^\{2\}\\sum\_\{e=0\}^\{E\-1\}\\Big\\\|g\_\{i\}\(w\_\{i,e\};\\xi\_\{i,e\}\)\+2\\lambda\\,\(w\_\{i,e\}\-\\tilde\{w\}\_\{i\}^\{t\}\)\\Big\\\|\_\{2\}^\{2\}\.Moreover, conditioning on the history up to the start of the local update \(denote the filtration byℱt\\mathcal\{F\}\_\{t\}\),

\(85\)𝔼​\[‖Δit‖22∣ℱt\]\\displaystyle\\mathbb\{E\}\\\!\\left\[\\\|\\Delta\_\{i\}^\{t\}\\\|\_\{2\}^\{2\}\\mid\\mathcal\{F\}\_\{t\}\\right\]≤2​E​η2​∑e=0E−1\(‖∇fi​\(wi,e\)\+2​λ​\(wi,e−w~it\)‖22\+σ2\)\\displaystyle\\leq 2E\\,\\eta^\{2\}\\sum\_\{e=0\}^\{E\-1\}\\Big\(\\big\\\|\\nabla f\_\{i\}\(w\_\{i,e\}\)\+2\\lambda\\,\(w\_\{i,e\}\-\\tilde\{w\}\_\{i\}^\{t\}\)\\big\\\|\_\{2\}^\{2\}\+\\sigma^\{2\}\\Big\)\(86\)≤4​E​η2​∑e=0E−1\(‖∇fi​\(wi,e\)‖22\+4​λ2​‖wi,e−w~it‖22\+σ2\)\.\\displaystyle\\leq 4E\\,\\eta^\{2\}\\sum\_\{e=0\}^\{E\-1\}\\Big\(\\\|\\nabla f\_\{i\}\(w\_\{i,e\}\)\\\|\_\{2\}^\{2\}\+4\\lambda^\{2\}\\\|w\_\{i,e\}\-\\tilde\{w\}\_\{i\}^\{t\}\\\|\_\{2\}^\{2\}\+\\sigma^\{2\}\\Big\)\.If additionally Assumption \(A3\) \(bounded heterogeneity\) holds, then for allww,‖∇fi​\(w\)‖22≤2​‖∇F​\(w\)‖22\+2​ζ2\\\|\\nabla f\_\{i\}\(w\)\\\|\_\{2\}^\{2\}\\leq 2\\\|\\nabla F\(w\)\\\|\_\{2\}^\{2\}\+2\\zeta^\{2\}and thus

\(87\)𝔼​\[‖Δit‖22∣ℱt\]≤8​E​η2​∑e=0E−1\(‖∇F​\(wi,e\)‖22\+ζ2\+2​λ2​‖wi,e−w~it‖22\+12​σ2\)\.\\mathbb\{E\}\\\!\\left\[\\\|\\Delta\_\{i\}^\{t\}\\\|\_\{2\}^\{2\}\\mid\\mathcal\{F\}\_\{t\}\\right\]\\leq 8E\\,\\eta^\{2\}\\sum\_\{e=0\}^\{E\-1\}\\Big\(\\\|\\nabla F\(w\_\{i,e\}\)\\\|\_\{2\}^\{2\}\+\\zeta^\{2\}\+2\\lambda^\{2\}\\\|w\_\{i,e\}\-\\tilde\{w\}\_\{i\}^\{t\}\\\|\_\{2\}^\{2\}\+\\tfrac\{1\}\{2\}\\sigma^\{2\}\\Big\)\.

###### Proof\.

We prove the drift bound by first controlling each*single*masked proximal\-SGD step via non\-expansiveness and then aggregating theEEstep increments using Cauchy\-Schwarz\. For the expectation bound, we apply the standard variance decomposition under unbiased stochastic gradients \(Assumption \(A2\)\) and finally separate the regularized gradient into the loss gradient and the proximal term\.

We emphasize that the maskΠMit​\(u\)=u⊙Mit\\Pi\_\{M\_\{i\}^\{t\}\}\(u\)=u\\odot M\_\{i\}^\{t\}does not increase distances \(Eq\. \([83](https://arxiv.org/html/2605.26162#A3.E83)\)\), hence it can only reduce the step length compared with the unmasked update; this is the only place where pruning enters the analysis\.

##### Step 1: one\-step increment bound\.

Letui,e≜wi,e−η​gi​\(wi,e;ξi,e\)−2​η​λ​\(wi,e−w~it\)\.u\_\{i,e\}\\triangleq w\_\{i,e\}\-\\eta\\,g\_\{i\}\(w\_\{i,e\};\\xi\_\{i,e\}\)\-2\\eta\\lambda\\,\(w\_\{i,e\}\-\\tilde\{w\}\_\{i\}^\{t\}\)\.By the update rule \([82](https://arxiv.org/html/2605.26162#A3.E82)\) and non\-expansiveness \([83](https://arxiv.org/html/2605.26162#A3.E83)\),

‖wi,e\+1−wi,e‖2\\displaystyle\\\|w\_\{i,e\+1\}\-w\_\{i,e\}\\\|\_\{2\}=‖ΠMit​\(ui,e\)−ΠMit​\(wi,e\)‖2≤‖ui,e−wi,e‖2\\displaystyle=\\\|\\Pi\_\{M\_\{i\}^\{t\}\}\(u\_\{i,e\}\)\-\\Pi\_\{M\_\{i\}^\{t\}\}\(w\_\{i,e\}\)\\\|\_\{2\}\\leq\\\|u\_\{i,e\}\-w\_\{i,e\}\\\|\_\{2\}\(88\)=η​‖gi​\(wi,e;ξi,e\)\+2​λ​\(wi,e−w~it\)‖2\.\\displaystyle=\\eta\\,\\\|g\_\{i\}\(w\_\{i,e\};\\xi\_\{i,e\}\)\+2\\lambda\\,\(w\_\{i,e\}\-\\tilde\{w\}\_\{i\}^\{t\}\)\\\|\_\{2\}\.

##### Step 2: summing increments \(Cauchy\-Schwarz\)\.

SinceΔit=∑e=0E−1\(wi,e\+1−wi,e\)\\Delta\_\{i\}^\{t\}=\\sum\_\{e=0\}^\{E\-1\}\(w\_\{i,e\+1\}\-w\_\{i,e\}\), Cauchy\-Schwarz gives

\(89\)‖Δit‖22=‖∑e=0E−1\(wi,e\+1−wi,e\)‖22≤E​∑e=0E−1‖wi,e\+1−wi,e‖22\.\\\|\\Delta\_\{i\}^\{t\}\\\|\_\{2\}^\{2\}=\\Big\\\|\\sum\_\{e=0\}^\{E\-1\}\(w\_\{i,e\+1\}\-w\_\{i,e\}\)\\Big\\\|\_\{2\}^\{2\}\\leq E\\sum\_\{e=0\}^\{E\-1\}\\\|w\_\{i,e\+1\}\-w\_\{i,e\}\\\|\_\{2\}^\{2\}\.Substituting \([88](https://arxiv.org/html/2605.26162#A3.E88)\) into \([89](https://arxiv.org/html/2605.26162#A3.E89)\) yields \([84](https://arxiv.org/html/2605.26162#A3.E84)\)\.

##### Step 3: taking conditional expectation\.

By Assumption \(A2\),𝔼​\[gi​\(wi,e;ξi,e\)∣wi,e\]=∇fi​\(wi,e\)\\mathbb\{E\}\[g\_\{i\}\(w\_\{i,e\};\\xi\_\{i,e\}\)\\mid w\_\{i,e\}\]=\\nabla f\_\{i\}\(w\_\{i,e\}\)and𝔼​\[‖gi​\(wi,e;ξi,e\)−∇fi​\(wi,e\)‖22∣wi,e\]≤σ2\\mathbb\{E\}\[\\\|g\_\{i\}\(w\_\{i,e\};\\xi\_\{i,e\}\)\-\\nabla f\_\{i\}\(w\_\{i,e\}\)\\\|\_\{2\}^\{2\}\\mid w\_\{i,e\}\]\\leq\\sigma^\{2\}\. Letai,e≜∇fi​\(wi,e\)\+2​λ​\(wi,e−w~it\)a\_\{i,e\}\\triangleq\\nabla f\_\{i\}\(w\_\{i,e\}\)\+2\\lambda\(w\_\{i,e\}\-\\tilde\{w\}\_\{i\}^\{t\}\)andδi,e≜gi​\(wi,e;ξi,e\)−∇fi​\(wi,e\)\\delta\_\{i,e\}\\triangleq g\_\{i\}\(w\_\{i,e\};\\xi\_\{i,e\}\)\-\\nabla f\_\{i\}\(w\_\{i,e\}\)\. Then

gi​\(wi,e;ξi,e\)\+2​λ​\(wi,e−w~it\)=ai,e\+δi,e\.g\_\{i\}\(w\_\{i,e\};\\xi\_\{i,e\}\)\+2\\lambda\(w\_\{i,e\}\-\\tilde\{w\}\_\{i\}^\{t\}\)=a\_\{i,e\}\+\\delta\_\{i,e\}\.Using‖a\+δ‖22≤2​‖a‖22\+2​‖δ‖22\\\|a\+\\delta\\\|\_\{2\}^\{2\}\\leq 2\\\|a\\\|\_\{2\}^\{2\}\+2\\\|\\delta\\\|\_\{2\}^\{2\}and taking conditional expectation givenℱt\\mathcal\{F\}\_\{t\}\(which fixes the trajectory up towi,ew\_\{i,e\}\), we obtain

𝔼​\[‖ai,e\+δi,e‖22∣ℱt\]≤2​‖ai,e‖22\+2​σ2\.\\mathbb\{E\}\\\!\\left\[\\\|a\_\{i,e\}\+\\delta\_\{i,e\}\\\|\_\{2\}^\{2\}\\mid\\mathcal\{F\}\_\{t\}\\right\]\\leq 2\\\|a\_\{i,e\}\\\|\_\{2\}^\{2\}\+2\\sigma^\{2\}\.Plugging this into \([84](https://arxiv.org/html/2605.26162#A3.E84)\) yields \([85](https://arxiv.org/html/2605.26162#A3.E85)\)\. Finally, applying‖u\+v‖2≤2​‖u‖2\+2​‖v‖2\\\|u\+v\\\|^\{2\}\\leq 2\\\|u\\\|^\{2\}\+2\\\|v\\\|^\{2\}toai,e=∇fi​\(wi,e\)\+2​λ​\(wi,e−w~it\)a\_\{i,e\}=\\nabla f\_\{i\}\(w\_\{i,e\}\)\+2\\lambda\(w\_\{i,e\}\-\\tilde\{w\}\_\{i\}^\{t\}\)gives \([86](https://arxiv.org/html/2605.26162#A3.E86)\)\. Under Assumption \(A3\), the standard bound‖∇fi​\(w\)‖2≤2​‖∇F​\(w\)‖2\+2​ζ2\\\|\\nabla f\_\{i\}\(w\)\\\|^\{2\}\\leq 2\\\|\\nabla F\(w\)\\\|^\{2\}\+2\\zeta^\{2\}yields \([87](https://arxiv.org/html/2605.26162#A3.E87)\)\. ∎

#### C\.5\.5\.Lemma 5

###### Lemma C\.5 \(Centroid proximal regularization suppresses local deviation from the anchor\)\.

Fix an activation eventttat clienti=iti=i\_\{t\}and consider theEEmasked proximal\-SGD steps in \([82](https://arxiv.org/html/2605.26162#A3.E82)\)\. Let the \(time\-tt\) centroid proximal anchor bew~it\\tilde\{w\}\_\{i\}^\{t\}and define the anchor deviation

\(90\)ui,e≜wi,e−w~it,e=0,…,E,u\_\{i,e\}\\triangleq w\_\{i,e\}\-\\tilde\{w\}\_\{i\}^\{t\},\\qquad e=0,\\dots,E,wherewi,0=wit\+13w\_\{i,0\}=w\_\{i\}^\{t\+\\frac\{1\}\{3\}\}andwi,E=wit\+23w\_\{i,E\}=w\_\{i\}^\{t\+\\frac\{2\}\{3\}\}\. Assume \(i\) the masking operator is non\-expansive as in \([83](https://arxiv.org/html/2605.26162#A3.E83)\) and \(ii\) the stepsize satisfies

\(91\)η​λ≤12\.\\eta\\lambda\\leq\\frac\{1\}\{2\}\.Then, for every local stepe=0,…,E−1e=0,\\dots,E\-1, we have the*one\-step anchor\-contraction inequality*

\(92\)‖ui,e\+1‖22≤\(1−η​λ\)​‖ui,e‖22\+ηλ​‖gi​\(wi,e;ξi,e\)‖22\.\\\|u\_\{i,e\+1\}\\\|\_\{2\}^\{2\}\\;\\leq\\;\(1\-\\eta\\lambda\)\\,\\\|u\_\{i,e\}\\\|\_\{2\}^\{2\}\\;\+\\;\\frac\{\\eta\}\{\\lambda\}\\,\\big\\\|g\_\{i\}\(w\_\{i,e\};\\xi\_\{i,e\}\)\\big\\\|\_\{2\}^\{2\}\.Moreover, taking conditional expectation given the filtrationℱt\\mathcal\{F\}\_\{t\}at the start of the local update and using Assumption \(A2\), we obtain

\(93\)𝔼​\[‖ui,e\+1‖22∣ℱt\]≤\(1−η​λ\)​𝔼​\[‖ui,e‖22∣ℱt\]\+ηλ​\(‖∇fi​\(wi,e\)‖22\+σ2\)\.\\mathbb\{E\}\\\!\\left\[\\\|u\_\{i,e\+1\}\\\|\_\{2\}^\{2\}\\mid\\mathcal\{F\}\_\{t\}\\right\]\\;\\leq\\;\(1\-\\eta\\lambda\)\\,\\mathbb\{E\}\\\!\\left\[\\\|u\_\{i,e\}\\\|\_\{2\}^\{2\}\\mid\\mathcal\{F\}\_\{t\}\\right\]\\;\+\\;\\frac\{\\eta\}\{\\lambda\}\\Big\(\\\|\\nabla f\_\{i\}\(w\_\{i,e\}\)\\\|\_\{2\}^\{2\}\+\\sigma^\{2\}\\Big\)\.Consequently, by unrolling the recursion,

\(94\)𝔼​\[‖ui,e‖22∣ℱt\]≤\\displaystyle\\mathbb\{E\}\\\!\\left\[\\\|u\_\{i,e\}\\\|\_\{2\}^\{2\}\\mid\\mathcal\{F\}\_\{t\}\\right\]\\;\\leq\(1−η​λ\)e​‖ui,0‖22\\displaystyle\(1\-\\eta\\lambda\)^\{e\}\\,\\\|u\_\{i,0\}\\\|\_\{2\}^\{2\}\+ηλ​∑r=0e−1\(1−η​λ\)e−1−r​\(‖∇fi​\(wi,r\)‖22\+σ2\)\.\\displaystyle\\;\+\\;\\frac\{\\eta\}\{\\lambda\}\\sum\_\{r=0\}^\{e\-1\}\(1\-\\eta\\lambda\)^\{e\-1\-r\}\\Big\(\\\|\\nabla f\_\{i\}\(w\_\{i,r\}\)\\\|\_\{2\}^\{2\}\+\\sigma^\{2\}\\Big\)\.
and the cumulative anchor deviation over the whole local update satisfies

\(95\)∑e=0E−1𝔼​\[‖ui,e‖22∣ℱt\]≤1η​λ​‖ui,0‖22\+ηλ2​∑e=0E−1\(‖∇fi​\(wi,e\)‖22\+σ2\)\.\\sum\_\{e=0\}^\{E\-1\}\\mathbb\{E\}\\\!\\left\[\\\|u\_\{i,e\}\\\|\_\{2\}^\{2\}\\mid\\mathcal\{F\}\_\{t\}\\right\]\\;\\leq\\;\\frac\{1\}\{\\eta\\lambda\}\\,\\\|u\_\{i,0\}\\\|\_\{2\}^\{2\}\+\\frac\{\\eta\}\{\\lambda^\{2\}\}\\sum\_\{e=0\}^\{E\-1\}\\Big\(\\\|\\nabla f\_\{i\}\(w\_\{i,e\}\)\\\|\_\{2\}^\{2\}\+\\sigma^\{2\}\\Big\)\.

###### Proof\.

The key idea is to track the deviation from the \(fixed\-in\-this\-event\) centroid anchorui,e≜wi,e−w~itu\_\{i,e\}\\triangleq w\_\{i,e\}\-\\tilde\{w\}\_\{i\}^\{t\}\. Rewriting the masked proximal\-SGD step in terms ofui,eu\_\{i,e\}reveals a linear contraction component induced by the quadratic regularizer and an additive forcing term driven by the stochastic gradient\. We then use \(i\) non\-expansiveness of the masking operator to drop the mask without loosening the bound in the wrong direction and \(ii\) a sharp Young/AM\-GM inequality to control the cross term betweenui,eu\_\{i,e\}and the stochastic gradient\. Finally, we take conditional expectation under Assumption \(A2\) and unroll the resulting recursion to obtain the cumulative deviation bound\.

##### Step 1: rewrite the masked proximal\-SGD step around the anchor\.

Defineui,e=wi,e−w~itu\_\{i,e\}=w\_\{i,e\}\-\\tilde\{w\}\_\{i\}^\{t\}and note thatw~it\\tilde\{w\}\_\{i\}^\{t\}is fixed during theEElocal steps at eventtt\. From \([82](https://arxiv.org/html/2605.26162#A3.E82)\),

wi,e\+1=ΠMit​\(wi,e−η​gi​\(wi,e;ξi,e\)−2​η​λ​\(wi,e−w~it\)\)\.w\_\{i,e\+1\}=\\Pi\_\{M\_\{i\}^\{t\}\}\\\!\\left\(w\_\{i,e\}\-\\eta\\,g\_\{i\}\(w\_\{i,e\};\\xi\_\{i,e\}\)\-2\\eta\\lambda\\,\(w\_\{i,e\}\-\\tilde\{w\}\_\{i\}^\{t\}\)\\right\)\.Subtractw~it\\tilde\{w\}\_\{i\}^\{t\}from both sides and useΠM​\(v\)−ΠM​\(w~\)=ΠM​\(v−w~\)\\Pi\_\{M\}\(v\)\-\\Pi\_\{M\}\(\\tilde\{w\}\)=\\Pi\_\{M\}\(v\-\\tilde\{w\}\)forΠM​\(u\)=u⊙M\\Pi\_\{M\}\(u\)=u\\odot M:

ui,e\+1\\displaystyle u\_\{i,e\+1\}=ΠMit​\(ui,e−η​gi​\(wi,e;ξi,e\)−2​η​λ​ui,e\)\\displaystyle=\\Pi\_\{M\_\{i\}^\{t\}\}\\\!\\Bigl\(u\_\{i,e\}\-\\eta\\,g\_\{i\}\(w\_\{i,e\};\\xi\_\{i,e\}\)\-2\\eta\\lambda\\,u\_\{i,e\}\\Bigr\)\(96\)=ΠMit​\(\(1−2​η​λ\)​ui,e−η​gi​\(wi,e;ξi,e\)\)\.\\displaystyle=\\Pi\_\{M\_\{i\}^\{t\}\}\\\!\\Bigl\(\(1\-2\\eta\\lambda\)u\_\{i,e\}\-\\eta\\,g\_\{i\}\(w\_\{i,e\};\\xi\_\{i,e\}\)\\Bigr\)\.

##### Step 2: non\-expansiveness and a sharp Young\-type inequality\.

By non\-expansiveness \([83](https://arxiv.org/html/2605.26162#A3.E83)\),

\(97\)‖ui,e\+1‖22≤‖\(1−2​η​λ\)​ui,e−η​gi​\(wi,e;ξi,e\)‖22\.\\\|u\_\{i,e\+1\}\\\|\_\{2\}^\{2\}\\leq\\\|\(1\-2\\eta\\lambda\)u\_\{i,e\}\-\\eta\\,g\_\{i\}\(w\_\{i,e\};\\xi\_\{i,e\}\)\\\|\_\{2\}^\{2\}\.Leta=η​λ∈\(0,12\]a=\\eta\\lambda\\in\(0,\\frac\{1\}\{2\}\]andg=gi​\(wi,e;ξi,e\)g=g\_\{i\}\(w\_\{i,e\};\\xi\_\{i,e\}\)\. Expanding the RHS gives

\(98\)‖\(1−2​a\)​u−η​g‖22\\displaystyle\\\|\(1\-2a\)u\-\\eta g\\\|\_\{2\}^\{2\}=\(1−2​a\)2​‖u‖22\+η2​‖g‖22−2​η​\(1−2​a\)​⟨u,g⟩\.\\displaystyle=\(1\-2a\)^\{2\}\\\|u\\\|\_\{2\}^\{2\}\+\\eta^\{2\}\\\|g\\\|\_\{2\}^\{2\}\-2\\eta\(1\-2a\)\\langle u,g\\rangle\.Apply the inequality2​⟨u,g⟩≤a​‖u‖22\+1a​‖g‖222\\langle u,g\\rangle\\leq a\\\|u\\\|\_\{2\}^\{2\}\+\\frac\{1\}\{a\}\\\|g\\\|\_\{2\}^\{2\}to the cross term:

−2​η​\(1−2​a\)​⟨u,g⟩≤η​\(1−2​a\)​\(a​‖u‖22\+1a​‖g‖22\)\.\-2\\eta\(1\-2a\)\\langle u,g\\rangle\\leq\\eta\(1\-2a\)\\left\(a\\\|u\\\|\_\{2\}^\{2\}\+\\frac\{1\}\{a\}\\\|g\\\|\_\{2\}^\{2\}\\right\)\.Substituting into \([98](https://arxiv.org/html/2605.26162#A3.E98)\) yields

‖\(1−2​a\)​u−η​g‖22≤\\displaystyle\\\|\(1\-2a\)u\-\\eta g\\\|\_\{2\}^\{2\}\\;\\leq\\;\(\(1−2​a\)2\+a​\(1−2​a\)\)​‖u‖22\+\(η2\+η​\(1−2​a\)​1a\)​‖g‖22\\displaystyle\\Big\(\(1\-2a\)^\{2\}\+a\(1\-2a\)\\Big\)\\,\\\|u\\\|\_\{2\}^\{2\}\+\\Big\(\\eta^\{2\}\+\\eta\(1\-2a\)\\tfrac\{1\}\{a\}\\Big\)\\,\\\|g\\\|\_\{2\}^\{2\}\(99\)=\\displaystyle=\\;\(1−3​a\+2​a2\)​‖u‖22\+\(η2\+η​\(1a−2\)\)​‖g‖22\.\\displaystyle\(1\-3a\+2a^\{2\}\)\\,\\\|u\\\|\_\{2\}^\{2\}\+\\Big\(\\eta^\{2\}\+\\eta\\big\(\\tfrac\{1\}\{a\}\-2\\big\)\\Big\)\\,\\\|g\\\|\_\{2\}^\{2\}\.
Sincea∈\(0,1\]a\\in\(0,1\]implies1−3​a\+2​a2≤1−a1\-3a\+2a^\{2\}\\leq 1\-aandη2\+η​\(1a−2\)≤η⋅1a\\eta^\{2\}\+\\eta\(\\tfrac\{1\}\{a\}\-2\)\\leq\\eta\\cdot\\tfrac\{1\}\{a\}fora≤1a\\leq 1, we obtain

\(100\)‖\(1−2​a\)​u−η​g‖22≤\(1−a\)​‖u‖22\+ηλ​‖g‖22\.\\\|\(1\-2a\)u\-\\eta g\\\|\_\{2\}^\{2\}\\leq\(1\-a\)\\\|u\\\|\_\{2\}^\{2\}\+\\frac\{\\eta\}\{\\lambda\}\\\|g\\\|\_\{2\}^\{2\}\.Combining \([97](https://arxiv.org/html/2605.26162#A3.E97)\) and \([100](https://arxiv.org/html/2605.26162#A3.E100)\) gives \([92](https://arxiv.org/html/2605.26162#A3.E92)\)\.

##### Step 3: conditional expectation and unrolling\.

Taking conditional expectation of \([92](https://arxiv.org/html/2605.26162#A3.E92)\) givenℱt\\mathcal\{F\}\_\{t\}and using Assumption \(A2\),𝔼​\[‖gi​\(wi,e;ξi,e\)‖22∣ℱt\]≤‖∇fi​\(wi,e\)‖22\+σ2,\\mathbb\{E\}\[\\\|g\_\{i\}\(w\_\{i,e\};\\xi\_\{i,e\}\)\\\|\_\{2\}^\{2\}\\mid\\mathcal\{F\}\_\{t\}\]\\leq\\\|\\nabla f\_\{i\}\(w\_\{i,e\}\)\\\|\_\{2\}^\{2\}\+\\sigma^\{2\},yields \([93](https://arxiv.org/html/2605.26162#A3.E93)\)\. Unrolling the linear recursion gives \([94](https://arxiv.org/html/2605.26162#A3.E94)\)\. Finally, summing \([93](https://arxiv.org/html/2605.26162#A3.E93)\) overe=0,…,E−1e=0,\\dots,E\-1and using∑e=0E−1\(1−η​λ\)e≤1η​λ\\sum\_\{e=0\}^\{E\-1\}\(1\-\\eta\\lambda\)^\{e\}\\leq\\frac\{1\}\{\\eta\\lambda\}and∑e=rE−1\(1−η​λ\)e−r≤1η​λ\\sum\_\{e=r\}^\{E\-1\}\(1\-\\eta\\lambda\)^\{e\-r\}\\leq\\frac\{1\}\{\\eta\\lambda\}yields \([95](https://arxiv.org/html/2605.26162#A3.E95)\)\. ∎

#### C\.5\.6\.Lemma 6

###### Lemma C\.6 \(One\-event optimization descent of the de\-biased global reference \(tight, complete\)\)\.

LetF​\(w\)≜1N​∑i=1Nfi​\(w\)F\(w\)\\triangleq\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}f\_\{i\}\(w\)\. Suppose Assumption \(A1\) \(LL\-smoothness\), Assumption \(A2\) \(unbiased stochastic gradients with varianceσ2\\sigma^\{2\}\), Assumption \(A3\) \(bounded heterogeneity with constantζ2\\zeta^\{2\}\) and Assumption \(A6\) \(bounded absolute compression errorεc\\varepsilon\_\{c\}\) hold\.

Consider an eventttwhere clienti=iti=i\_\{t\}is activated\. Letw¯t=Xtott/Ytot\\bar\{w\}^\{t\}=X\_\{\\mathrm\{tot\}\}^\{t\}/Y\_\{\\mathrm\{tot\}\}be the de\-biased global reference and define

\(101\)γt≜yit\+13Ytot∈\(0,1\]\.\\gamma\_\{t\}\\triangleq\\frac\{y\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\}\{Y\_\{\\mathrm\{tot\}\}\}\\in\(0,1\]\.Let\{wi,e\}e=0E\\\{w\_\{i,e\}\\\}\_\{e=0\}^\{E\}denote the local trajectory during the subsequent local update, wherewi,0=wit\+13w\_\{i,0\}=w\_\{i\}^\{t\+\\frac\{1\}\{3\}\}andwi,E=wit\+23w\_\{i,E\}=w\_\{i\}^\{t\+\\frac\{2\}\{3\}\}\. DefineΔit≜wit\+23−wit\+13=wi,E−wi,0\\Delta\_\{i\}^\{t\}\\triangleq w\_\{i\}^\{t\+\\frac\{2\}\{3\}\}\-w\_\{i\}^\{t\+\\frac\{1\}\{3\}\}=w\_\{i,E\}\-w\_\{i,0\}\.

Thenw¯t\+1\\bar\{w\}^\{t\+1\}admits the exact decomposition

\(102\)w¯t\+1=w¯t\+γt​Δit\+δt,\\bar\{w\}^\{t\+1\}=\\bar\{w\}^\{t\}\+\\gamma\_\{t\}\\Delta\_\{i\}^\{t\}\+\\delta\_\{t\},where the broadcast\-induced perturbationδt\\delta\_\{t\}satisfies

\(103\)‖δt‖2≤ct​εc,ct≜di\+di\+\+1⋅yit\+13Ytot\\\|\\delta\_\{t\}\\\|\_\{2\}\\leq c\_\{t\}\\,\\varepsilon\_\{c\},\\qquad c\_\{t\}\\triangleq\\frac\{d\_\{i\}^\{\+\}\}\{d\_\{i\}^\{\+\}\+1\}\\cdot\\frac\{y\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\}\{Y\_\{\\mathrm\{tot\}\}\}by Lemma[C\.2](https://arxiv.org/html/2605.26162#A3.Thmtheorem2)\.

Moreover, conditioning on the filtrationℱt\\mathcal\{F\}\_\{t\}at the start of the local update, the following descent inequality holds:

𝔼​\[F​\(w¯t\+1\)∣ℱt\]≤\\displaystyle\\mathbb\{E\}\\\!\\left\[F\(\\bar\{w\}^\{t\+1\}\)\\mid\\mathcal\{F\}\_\{t\}\\right\]\\;\\leq\\;F​\(w¯t\)−η​γt​E2​‖∇F​\(w¯t\)‖22\\displaystyle F\(\\bar\{w\}^\{t\}\)\-\\frac\{\\eta\\gamma\_\{t\}E\}\{2\}\\,\\\|\\nabla F\(\\bar\{w\}^\{t\}\)\\\|\_\{2\}^\{2\}\+η​γt​E​\(ζ2\+σ2\)\\displaystyle\\;\+\\;\\eta\\gamma\_\{t\}E\(\\zeta^\{2\}\+\\sigma^\{2\}\)\+ηγtL2∑e=0E−1𝔼\[∥wi,e−w¯t∥22\|ℱt\]\\displaystyle\\;\+\\;\\eta\\gamma\_\{t\}L^\{2\}\\sum\_\{e=0\}^\{E\-1\}\\mathbb\{E\}\\\!\\left\[\\\|w\_\{i,e\}\-\\bar\{w\}^\{t\}\\\|\_\{2\}^\{2\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\}\\right\]\+L2γt2𝔼\[∥Δit∥22\|ℱt\]\\displaystyle\\;\+\\;\\frac\{L\}\{2\}\\gamma\_\{t\}^\{2\}\\,\\mathbb\{E\}\\\!\\left\[\\\|\\Delta\_\{i\}^\{t\}\\\|\_\{2\}^\{2\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\}\\right\]\(104\)\+2​L​‖δt‖22\.\\displaystyle\\;\+\\;2L\\\|\\delta\_\{t\}\\\|\_\{2\}^\{2\}\.
Finally, the trajectory term in \([104](https://arxiv.org/html/2605.26162#A3.E104)\) admits the explicit tightening

\(105\)∑e=0E−1𝔼\[∥wi,e−w¯t∥22\|ℱt\]≤\\displaystyle\\sum\_\{e=0\}^\{E\-1\}\\mathbb\{E\}\\\!\\left\[\\\|w\_\{i,e\}\-\\bar\{w\}^\{t\}\\\|\_\{2\}^\{2\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\}\\right\]\\;\\leq\\;2E∥wit\+13−w¯t∥22\+4E∥ui,0∥22\+4∑e=0E−1𝔼\[∥ui,e∥22\|ℱt\]\.\\displaystyle 2E\\\|w\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\-\\bar\{w\}^\{t\}\\\|\_\{2\}^\{2\}\+4E\\\|u\_\{i,0\}\\\|\_\{2\}^\{2\}\+4\\sum\_\{e=0\}^\{E\-1\}\\mathbb\{E\}\\\!\\left\[\\\|u\_\{i,e\}\\\|\_\{2\}^\{2\}\\,\\middle\|\\,\\mathcal\{F\}\_\{t\}\\right\]\.
whereui,e≜wi,e−w~itu\_\{i,e\}\\triangleq w\_\{i,e\}\-\\tilde\{w\}\_\{i\}^\{t\}andw~it\\tilde\{w\}\_\{i\}^\{t\}is the centroid proximal anchor used at eventtt\.

###### Proof\.

We analyze a single event by separating buffer aggregation, local update and broadcast\. Aggregation preserves the total push\-sum mass and only redistributes existing numerators\. The local update changes the system numerator through the client driftΔit\\Delta\_\{i\}^\{t\}without affecting its mass, while the subsequent broadcast introduces a bounded perturbation due to lossy centroid re\-encoding\. This yields the exact decompositionw¯t\+1=w¯t\+γt​Δit\+δt\\bar\{w\}^\{t\+1\}=\\bar\{w\}^\{t\}\+\\gamma\_\{t\}\\Delta\_\{i\}^\{t\}\+\\delta\_\{t\}\. ApplyingLL\-smoothness ofFFto this update and bounding the resulting terms along the local trajectory lead to the stated one\-event descent inequality\.

##### Step 1: exact evolution ofw¯\\bar\{w\}within the event\.

By Lemma[C\.1](https://arxiv.org/html/2605.26162#A3.Thmtheorem1),YtotY\_\{\\mathrm\{tot\}\}is invariant\. Buffer aggregation only transfers existing mass/numerators from buffers to nodeiiand hence does not changeXtotX\_\{\\mathrm\{tot\}\}\. During the local update, only nodeiichanges its model fromwit\+13w\_\{i\}^\{t\+\\frac\{1\}\{3\}\}towit\+23w\_\{i\}^\{t\+\\frac\{2\}\{3\}\}while its mass remainsyit\+13y\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\(mass splitting happens after local training\)\. Thus,

Xtott\+23−Xtott=yit\+13​\(wit\+23−wit\+13\)=yit\+13​Δit,X\_\{\\mathrm\{tot\}\}^\{t\+\\frac\{2\}\{3\}\}\-X\_\{\\mathrm\{tot\}\}^\{t\}=y\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\\big\(w\_\{i\}^\{t\+\\frac\{2\}\{3\}\}\-w\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\\big\)=y\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\\Delta\_\{i\}^\{t\},and therefore

\(106\)w¯t\+23−w¯t=Xtott\+23−XtottYtot=γt​Δit\.\\bar\{w\}^\{t\+\\frac\{2\}\{3\}\}\-\\bar\{w\}^\{t\}=\\frac\{X\_\{\\mathrm\{tot\}\}^\{t\+\\frac\{2\}\{3\}\}\-X\_\{\\mathrm\{tot\}\}^\{t\}\}\{Y\_\{\\mathrm\{tot\}\}\}=\\gamma\_\{t\}\\Delta\_\{i\}^\{t\}\.The subsequent broadcast step induces the perturbationδt=w¯t\+1−w¯t\+23\\delta\_\{t\}=\\bar\{w\}^\{t\+1\}\-\\bar\{w\}^\{t\+\\frac\{2\}\{3\}\}, which satisfies \([103](https://arxiv.org/html/2605.26162#A3.E103)\) by Lemma[C\.2](https://arxiv.org/html/2605.26162#A3.Thmtheorem2)\. Combining gives \([102](https://arxiv.org/html/2605.26162#A3.E102)\)\.

##### Step 2: smoothness descent forF​\(w¯t\+1\)F\(\\bar\{w\}^\{t\+1\}\)\.

ByLL\-smoothness ofFF\(Assumption \(A1\)\) and \([102](https://arxiv.org/html/2605.26162#A3.E102)\),

F​\(w¯t\+1\)≤\\displaystyle F\(\\bar\{w\}^\{t\+1\}\)\\;\\leq\\;F​\(w¯t\)\+⟨∇F​\(w¯t\),γt​Δit\+δt⟩\+L2​‖γt​Δit\+δt‖22\\displaystyle F\(\\bar\{w\}^\{t\}\)\+\\left\\langle\\nabla F\(\\bar\{w\}^\{t\}\),\\ \\gamma\_\{t\}\\Delta\_\{i\}^\{t\}\+\\delta\_\{t\}\\right\\rangle\+\\frac\{L\}\{2\}\\\|\\gamma\_\{t\}\\Delta\_\{i\}^\{t\}\+\\delta\_\{t\}\\\|\_\{2\}^\{2\}≤\\displaystyle\\;\\leq\\;F​\(w¯t\)\+γt​⟨∇F​\(w¯t\),Δit⟩\+⟨∇F​\(w¯t\),δt⟩\\displaystyle F\(\\bar\{w\}^\{t\}\)\+\\gamma\_\{t\}\\left\\langle\\nabla F\(\\bar\{w\}^\{t\}\),\\ \\Delta\_\{i\}^\{t\}\\right\\rangle\+\\left\\langle\\nabla F\(\\bar\{w\}^\{t\}\),\\ \\delta\_\{t\}\\right\\rangle\(107\)\+L2​γt2​‖Δit‖22\+L​‖δt‖22\.\\displaystyle\\;\+\\;\\frac\{L\}\{2\}\\gamma\_\{t\}^\{2\}\\\|\\Delta\_\{i\}^\{t\}\\\|\_\{2\}^\{2\}\+L\\\|\\delta\_\{t\}\\\|\_\{2\}^\{2\}\.
where we used‖a\+b‖2≤2​‖a‖2\+2​‖b‖2\\\|a\+b\\\|^\{2\}\\leq 2\\\|a\\\|^\{2\}\+2\\\|b\\\|^\{2\}to upper bound the quadratic term\.

Next, bound the linearδt\\delta\_\{t\}term via2​⟨a,b⟩≤‖a‖2\+‖b‖22\\langle a,b\\rangle\\leq\\\|a\\\|^\{2\}\+\\\|b\\\|^\{2\}:

\(108\)⟨∇F​\(w¯t\),δt⟩≤14​‖∇F​\(w¯t\)‖22\+‖δt‖22\.\\left\\langle\\nabla F\(\\bar\{w\}^\{t\}\),\\ \\delta\_\{t\}\\right\\rangle\\leq\\frac\{1\}\{4\}\\\|\\nabla F\(\\bar\{w\}^\{t\}\)\\\|\_\{2\}^\{2\}\+\\\|\\delta\_\{t\}\\\|\_\{2\}^\{2\}\.Substituting \([108](https://arxiv.org/html/2605.26162#A3.E108)\) into \([107](https://arxiv.org/html/2605.26162#A3.E107)\) yields

F​\(w¯t\+1\)≤\\displaystyle F\(\\bar\{w\}^\{t\+1\}\)\\;\\leq\\;F​\(w¯t\)\+γt​⟨∇F​\(w¯t\),Δit⟩\\displaystyle F\(\\bar\{w\}^\{t\}\)\+\\gamma\_\{t\}\\left\\langle\\nabla F\(\\bar\{w\}^\{t\}\),\\ \\Delta\_\{i\}^\{t\}\\right\\rangle\(109\)\+14​‖∇F​\(w¯t\)‖22\+L2​γt2​‖Δit‖22\+\(L\+1\)​‖δt‖22\.\\displaystyle\\;\+\\;\\frac\{1\}\{4\}\\\|\\nabla F\(\\bar\{w\}^\{t\}\)\\\|\_\{2\}^\{2\}\+\\frac\{L\}\{2\}\\gamma\_\{t\}^\{2\}\\\|\\Delta\_\{i\}^\{t\}\\\|\_\{2\}^\{2\}\+\(L\+1\)\\\|\\delta\_\{t\}\\\|\_\{2\}^\{2\}\.
Absorbing constants \(replace\(L\+1\)\(L\+1\)by2​L2Lw\.l\.o\.g\. forL≥1L\\geq 1\) gives theδt\\delta\_\{t\}term in \([104](https://arxiv.org/html/2605.26162#A3.E104)\)\.

##### Step 3: expressing⟨∇F​\(w¯t\),Δit⟩\\langle\\nabla F\(\\bar\{w\}^\{t\}\),\\Delta\_\{i\}^\{t\}\\ranglevia local steps\.

Ignoring the non\-expansive mask for notational simplicity \(it can only reduce the step length\), the local update satisfies

wi,e\+1−wi,e=−η​gi​\(wi,e;ξi,e\)−2​η​λ​\(wi,e−w~it\)\.w\_\{i,e\+1\}\-w\_\{i,e\}=\-\\eta\\,g\_\{i\}\(w\_\{i,e\};\\xi\_\{i,e\}\)\-2\\eta\\lambda\\,\(w\_\{i,e\}\-\\tilde\{w\}\_\{i\}^\{t\}\)\.Summing overe=0,…,E−1e=0,\\dots,E\-1gives

\(110\)Δit=−η​∑e=0E−1gi​\(wi,e;ξi,e\)−2​η​λ​∑e=0E−1\(wi,e−w~it\)\.\\Delta\_\{i\}^\{t\}=\-\\eta\\sum\_\{e=0\}^\{E\-1\}g\_\{i\}\(w\_\{i,e\};\\xi\_\{i,e\}\)\-2\\eta\\lambda\\sum\_\{e=0\}^\{E\-1\}\(w\_\{i,e\}\-\\tilde\{w\}\_\{i\}^\{t\}\)\.Taking the inner product with∇F​\(w¯t\)\\nabla F\(\\bar\{w\}^\{t\}\)and conditioning onℱt\\mathcal\{F\}\_\{t\}, we use𝔼​\[gi​\(wi,e;ξi,e\)∣ℱt\]=∇fi​\(wi,e\)\\mathbb\{E\}\[g\_\{i\}\(w\_\{i,e\};\\xi\_\{i,e\}\)\\mid\\mathcal\{F\}\_\{t\}\]=\\nabla f\_\{i\}\(w\_\{i,e\}\)\(Assumption \(A2\)\) to obtain

\(111\)𝔼​\[⟨∇F​\(w¯t\),Δit⟩∣ℱt\]=\\displaystyle\\mathbb\{E\}\\\!\\left\[\\left\\langle\\nabla F\(\\bar\{w\}^\{t\}\),\\Delta\_\{i\}^\{t\}\\right\\rangle\\mid\\mathcal\{F\}\_\{t\}\\right\]\\;=\\;−η​∑e=0E−1⟨∇F​\(w¯t\),∇fi​\(wi,e\)⟩−2​η​λ​∑e=0E−1⟨∇F​\(w¯t\),wi,e−w~it⟩\.\\displaystyle\-\\eta\\sum\_\{e=0\}^\{E\-1\}\\left\\langle\\nabla F\(\\bar\{w\}^\{t\}\),\\ \\nabla f\_\{i\}\(w\_\{i,e\}\)\\right\\rangle\-2\\eta\\lambda\\sum\_\{e=0\}^\{E\-1\}\\left\\langle\\nabla F\(\\bar\{w\}^\{t\}\),\\ w\_\{i,e\}\-\\tilde\{w\}\_\{i\}^\{t\}\\right\\rangle\.
We now lower bound the first inner product and upper bound the second in absolute value\.

##### Step 4: lower bounding⟨∇F​\(w¯t\),∇fi​\(wi,e\)⟩\\langle\\nabla F\(\\bar\{w\}^\{t\}\),\\nabla f\_\{i\}\(w\_\{i,e\}\)\\rangle\.

Decompose

∇fi​\(wi,e\)=∇F​\(w¯t\)\+\(∇fi​\(wi,e\)−∇fi​\(w¯t\)\)\+\(∇fi​\(w¯t\)−∇F​\(w¯t\)\)\.\\nabla f\_\{i\}\(w\_\{i,e\}\)=\\nabla F\(\\bar\{w\}^\{t\}\)\+\\big\(\\nabla f\_\{i\}\(w\_\{i,e\}\)\-\\nabla f\_\{i\}\(\\bar\{w\}^\{t\}\)\\big\)\+\\big\(\\nabla f\_\{i\}\(\\bar\{w\}^\{t\}\)\-\\nabla F\(\\bar\{w\}^\{t\}\)\\big\)\.Then

⟨∇F​\(w¯t\),∇fi​\(wi,e\)⟩=\\displaystyle\\left\\langle\\nabla F\(\\bar\{w\}^\{t\}\),\\nabla f\_\{i\}\(w\_\{i,e\}\)\\right\\rangle\\;=\\;‖∇F​\(w¯t\)‖22\+⟨∇F​\(w¯t\),∇fi​\(wi,e\)−∇fi​\(w¯t\)⟩\+⟨∇F​\(w¯t\),∇fi​\(w¯t\)−∇F​\(w¯t\)⟩\\displaystyle\\\|\\nabla F\(\\bar\{w\}^\{t\}\)\\\|\_\{2\}^\{2\}\\;\+\\;\\left\\langle\\nabla F\(\\bar\{w\}^\{t\}\),\\ \\nabla f\_\{i\}\(w\_\{i,e\}\)\-\\nabla f\_\{i\}\(\\bar\{w\}^\{t\}\)\\right\\rangle\\;\+\\;\\left\\langle\\nabla F\(\\bar\{w\}^\{t\}\),\\ \\nabla f\_\{i\}\(\\bar\{w\}^\{t\}\)\-\\nabla F\(\\bar\{w\}^\{t\}\)\\right\\rangle≥\\displaystyle\\;\\geq\\;34​‖∇F​\(w¯t\)‖22−‖∇fi​\(wi,e\)−∇fi​\(w¯t\)‖22−‖∇fi​\(w¯t\)−∇F​\(w¯t\)‖22\\displaystyle\\frac\{3\}\{4\}\\\|\\nabla F\(\\bar\{w\}^\{t\}\)\\\|\_\{2\}^\{2\}\\;\-\\;\\\|\\nabla f\_\{i\}\(w\_\{i,e\}\)\-\\nabla f\_\{i\}\(\\bar\{w\}^\{t\}\)\\\|\_\{2\}^\{2\}\\;\-\\;\\\|\\nabla f\_\{i\}\(\\bar\{w\}^\{t\}\)\-\\nabla F\(\\bar\{w\}^\{t\}\)\\\|\_\{2\}^\{2\}\(112\)≥\\displaystyle\\;\\geq\\;34​‖∇F​\(w¯t\)‖22−L2​‖wi,e−w¯t‖22−ζ2\.\\displaystyle\\frac\{3\}\{4\}\\\|\\nabla F\(\\bar\{w\}^\{t\}\)\\\|\_\{2\}^\{2\}\-L^\{2\}\\\|w\_\{i,e\}\-\\bar\{w\}^\{t\}\\\|\_\{2\}^\{2\}\-\\zeta^\{2\}\.
where we usedLL\-smoothness \(‖∇fi​\(w\)−∇fi​\(v\)‖≤L​‖w−v‖\\\|\\nabla f\_\{i\}\(w\)\-\\nabla f\_\{i\}\(v\)\\\|\\leq L\\\|w\-v\\\|\) and Assumption \(A3\)\.

##### Step 5: bounding the proximal cross term\.

For the second term in \([111](https://arxiv.org/html/2605.26162#A3.E111)\), use2​⟨a,b⟩≤‖a‖2\+‖b‖22\\langle a,b\\rangle\\leq\\\|a\\\|^\{2\}\+\\\|b\\\|^\{2\}witha=∇F​\(w¯t\)a=\\nabla F\(\\bar\{w\}^\{t\}\)andb=2​λ​\(wi,e−w~it\)b=2\\lambda\(w\_\{i,e\}\-\\tilde\{w\}\_\{i\}^\{t\}\):

\(113\)−2​λ​⟨∇F​\(w¯t\),wi,e−w~it⟩≤14​‖∇F​\(w¯t\)‖22\+4​λ2​‖wi,e−w~it‖22\.\-2\\lambda\\left\\langle\\nabla F\(\\bar\{w\}^\{t\}\),\\ w\_\{i,e\}\-\\tilde\{w\}\_\{i\}^\{t\}\\right\\rangle\\leq\\frac\{1\}\{4\}\\\|\\nabla F\(\\bar\{w\}^\{t\}\)\\\|\_\{2\}^\{2\}\+4\\lambda^\{2\}\\\|w\_\{i,e\}\-\\tilde\{w\}\_\{i\}^\{t\}\\\|\_\{2\}^\{2\}\.

##### Step 6: combining bounds\.

Substitute \([112](https://arxiv.org/html/2605.26162#A3.E112)\) and \([113](https://arxiv.org/html/2605.26162#A3.E113)\) into \([111](https://arxiv.org/html/2605.26162#A3.E111)\), sum overe=0,…,E−1e=0,\\dots,E\-1, and multiply byγt\\gamma\_\{t\}\. The14​‖∇F​\(w¯t\)‖2\\frac\{1\}\{4\}\\\|\\nabla F\(\\bar\{w\}^\{t\}\)\\\|^\{2\}terms in \([109](https://arxiv.org/html/2605.26162#A3.E109)\) and \([113](https://arxiv.org/html/2605.26162#A3.E113)\) are dominated by the leading negative term after summation \(yielding the coefficient12\\frac\{1\}\{2\}in front ofη​γt​E​‖∇F​\(w¯t\)‖2\\eta\\gamma\_\{t\}E\\\|\\nabla F\(\\bar\{w\}^\{t\}\)\\\|^\{2\}\)\. Collecting the remaining terms yields \([104](https://arxiv.org/html/2605.26162#A3.E104)\) after conditional expectation\.

##### Step 7: tightening the trajectory term \(explicit\)\.

For eachee,

wi,e−w¯t=\(wit\+13−w¯t\)\+\(wi,e−wit\+13\),w\_\{i,e\}\-\\bar\{w\}^\{t\}=\(w\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\-\\bar\{w\}^\{t\}\)\+\(w\_\{i,e\}\-w\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\),so‖wi,e−w¯t‖2≤2​‖wit\+13−w¯t‖2\+2​‖wi,e−wit\+13‖2\\\|w\_\{i,e\}\-\\bar\{w\}^\{t\}\\\|^\{2\}\\leq 2\\\|w\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\-\\bar\{w\}^\{t\}\\\|^\{2\}\+2\\\|w\_\{i,e\}\-w\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\\\|^\{2\}\. Moreover,wi,e−wit\+13=\(wi,e−w~it\)−\(wit\+13−w~it\)=ui,e−ui,0w\_\{i,e\}\-w\_\{i\}^\{t\+\\frac\{1\}\{3\}\}=\(w\_\{i,e\}\-\\tilde\{w\}\_\{i\}^\{t\}\)\-\(w\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\-\\tilde\{w\}\_\{i\}^\{t\}\)=u\_\{i,e\}\-u\_\{i,0\}, hence‖wi,e−wit\+13‖2≤2​‖ui,e‖2\+2​‖ui,0‖2\\\|w\_\{i,e\}\-w\_\{i\}^\{t\+\\frac\{1\}\{3\}\}\\\|^\{2\}\\leq 2\\\|u\_\{i,e\}\\\|^\{2\}\+2\\\|u\_\{i,0\}\\\|^\{2\}\. Summing overe=0,…,E−1e=0,\\dots,E\-1yields \([105](https://arxiv.org/html/2605.26162#A3.E105)\)\. ∎

### C\.6\.Main Theorem and Proof

We now present a fully explicit \(constant\-closed\) stationarity guarantee for PushCen\-ADFL\.

#### C\.6\.1\.Standing assumptions\.

We use Assumptions \(A1\)\-\(A6\) in Appendix[C\.2](https://arxiv.org/html/2605.26162#A3.SS2)and additionally adopt the following standard technical conditions that close all constants\.

##### Mass positivity and boundedness

There exist constants0<ymin≤ymax<∞0<y\_\{\\min\}\\leq y\_\{\\max\}<\\inftysuch that for all eventsttand all nodesii,yit∈\[ymin,ymax\]\.y\_\{i\}^\{t\}\\in\[y\_\{\\min\},y\_\{\\max\}\]\.Consequently, for the activated nodeiti\_\{t\}, the normalized mass weight

\(114\)γt≜yitt\+13Ytotsatisfiesγ¯≜yminN​ymax≤γt≤1\.\\gamma\_\{t\}\\triangleq\\frac\{y\_\{i\_\{t\}\}^\{t\+\\frac\{1\}\{3\}\}\}\{Y\_\{\\mathrm\{tot\}\}\}\\quad\\text\{satisfies\}\\quad\\underline\{\\gamma\}\\triangleq\\frac\{y\_\{\\min\}\}\{Ny\_\{\\max\}\}\\leq\\gamma\_\{t\}\\leq 1\.

##### Bounded stochastic gradient second moment

There existsG2<∞G^\{2\}<\\inftysuch that for alli,wi,wand any sampleξ\\xi,𝔼​\[‖gi​\(w;ξ\)‖22\]≤G2\.\\mathbb\{E\}\\big\[\\\|g\_\{i\}\(w;\\xi\)\\\|\_\{2\}^\{2\}\\big\]\\leq G^\{2\}\.

##### Bounded iterates

There existsW<∞W<\\inftysuch that for all eventsttand all nodesii,

\(115\)‖wit‖2≤Wand‖w~it‖2≤W\.\\\|w\_\{i\}^\{t\}\\\|\_\{2\}\\leq W\\qquad\\text\{and\}\\qquad\\\|\\tilde\{w\}\_\{i\}^\{t\}\\\|\_\{2\}\\leq W\.In particular,‖wit−w~it‖22≤4​W2\\\|w\_\{i\}^\{t\}\-\\tilde\{w\}\_\{i\}^\{t\}\\\|\_\{2\}^\{2\}\\leq 4W^\{2\}for alli,ti,t\.

#### C\.6\.2\.Notation\.

LetF​\(w\)≜1N​∑i=1Nfi​\(w\)F\(w\)\\triangleq\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}f\_\{i\}\(w\)andF⋆≜infwF​\(w\)F^\{\\star\}\\triangleq\\inf\_\{w\}F\(w\)\. Letρ∈\(0,1\)\\rho\\in\(0,1\)andC1,C2,C3\>0C\_\{1\},C\_\{2\},C\_\{3\}\>0be the constants in Lemma[C\.3](https://arxiv.org/html/2605.26162#A3.Thmtheorem3)\. Letτ\\taube the staleness bound \(Assumption \(A4\)\) and letεc\\varepsilon\_\{c\}be the bounded absolute compression error \(Assumption \(A6\)\)\. Letdmax≜maxi⁡di\+d\_\{\\max\}\\triangleq\\max\_\{i\}d\_\{i\}^\{\+\}\. Define the uniform broadcast perturbation coefficient \(Lemma[C\.2](https://arxiv.org/html/2605.26162#A3.Thmtheorem2)\)

\(116\)c¯≜maxt⁡dit\+dit\+\+1⋅yitt\+13Ytot≤ymaxN​ymin\.\\overline\{c\}\\triangleq\\max\_\{t\}\\frac\{d\_\{i\_\{t\}\}^\{\+\}\}\{d\_\{i\_\{t\}\}^\{\+\}\+1\}\\cdot\\frac\{y\_\{i\_\{t\}\}^\{t\+\\frac\{1\}\{3\}\}\}\{Y\_\{\\mathrm\{tot\}\}\}\\leq\\frac\{y\_\{\\max\}\}\{Ny\_\{\\min\}\}\.
###### Theorem C\.7 \(Fully explicit stationarity bound \(constant\-closed\)\)\.

Suppose Assumptions \(A1\)\-\(A6\) and Assumptions[C\.6\.1](https://arxiv.org/html/2605.26162#A3.SS6.SSS1.Px1)hold\. Let the stepsize satisfy

\(117\)η≤min⁡\{18​L​E,14​λ\}\.\\eta\\leq\\min\\Big\\\{\\frac\{1\}\{8LE\},\\ \\frac\{1\}\{4\\lambda\}\\Big\\\}\.Run PushCen\-ADFL forTTevents\. Then the averaged stationarity measure obeys

1T​∑t=0T−1𝔼​\[‖∇F​\(w¯t\)‖22\]≤\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\big\[\\\|\\nabla F\(\\bar\{w\}^\{t\}\)\\\|\_\{2\}^\{2\}\\big\]\\;\\leq\\;2​\(F​\(w¯0\)−F⋆\)η​E​γ¯​T⏟optimization term\\displaystyle\\underbrace\{\\frac\{2\\big\(F\(\\bar\{w\}^\{0\}\)\-F^\{\\star\}\\big\)\}\{\\eta E\\,\\underline\{\\gamma\}\\,T\}\}\_\{\\text\{optimization term\}\}\+2​\(σ2\+ζ2\)⏟stochasticity & heterogeneity\\displaystyle\\;\+\\;\\underbrace\{2\(\\sigma^\{2\}\+\\zeta^\{2\}\)\}\_\{\\text\{stochasticity \\& heterogeneity\}\}\(118\)\+2​L​c¯2​εc2⏟compression drift\\displaystyle\\;\+\\;\\underbrace\{2L\\,\\overline\{c\}^\{\\,2\}\\varepsilon\_\{c\}^\{2\}\}\_\{\\text\{compression drift\}\}\+4​L2​Nγ¯​\(𝔼​\[ℰcon0\]T​\(1−ρ\)\+Bcon1−ρ\)⏟directed mixing / asynchrony\\displaystyle\\;\+\\;\\underbrace\{\\frac\{4L^\{2\}N\}\{\\underline\{\\gamma\}\}\\Bigg\(\\frac\{\\mathbb\{E\}\[\\mathcal\{E\}\_\{\\mathrm\{con\}\}^\{0\}\]\}\{T\(1\-\\rho\)\}\+\\frac\{B\_\{\\mathrm\{con\}\}\}\{1\-\\rho\}\\Bigg\)\}\_\{\\text\{directed mixing / asynchrony\}\}\(119\)\+Lγ¯​DΔ⏟client drift accumulation\.\\displaystyle\\;\+\\;\\underbrace\{\\frac\{L\}\{\\underline\{\\gamma\}\}\\,D\_\{\\Delta\}\}\_\{\\text\{client drift accumulation\}\}\.
where the constantsDΔD\_\{\\Delta\}andBconB\_\{\\mathrm\{con\}\}are explicitly given by

\(120\)DΔ\\displaystyle D\_\{\\Delta\}≜2​E2​η2​\(G2\+16​λ2​W2\),\\displaystyle\\triangleq 2E^\{2\}\\eta^\{2\}\\Big\(G^\{2\}\+16\\lambda^\{2\}W^\{2\}\\Big\),\(121\)Bcon\\displaystyle B\_\{\\mathrm\{con\}\}≜\(C1\+C2​τ2\)​DΔ\+C3​εc2\.\\displaystyle\\triangleq\\big\(C\_\{1\}\+C\_\{2\}\\tau^\{2\}\\big\)D\_\{\\Delta\}\+C\_\{3\}\\varepsilon\_\{c\}^\{2\}\.In particular, asT→∞T\\to\\infty, the iterates converge to a stationary neighborhood whose radius is upper bounded by the right\-hand side of \([119](https://arxiv.org/html/2605.26162#A3.E119)\) without the1/T1/Tterms\.

###### Proof\.

We proceed by combining the descent inequality forF​\(w¯t\)F\(\\bar\{w\}^\{t\}\)with the consensus recursion\.

##### Step 1: one\-event descent\.

By Lemma[C\.6](https://arxiv.org/html/2605.26162#A3.Thmtheorem6)\(tight form, after bounding the trajectory term inside its proof\), there exist absolute constantsα0,α1,α2,α3\>0\\alpha\_\{0\},\\alpha\_\{1\},\\alpha\_\{2\},\\alpha\_\{3\}\>0such that for each eventttwith activated nodeiti\_\{t\},

𝔼​\[F​\(w¯t\+1\)∣ℱt\]≤\\displaystyle\\mathbb\{E\}\\\!\\left\[F\(\\bar\{w\}^\{t\+1\}\)\\mid\\mathcal\{F\}\_\{t\}\\right\]\\;\\leq\\;F​\(w¯t\)−α0​η​γt​E​‖∇F​\(w¯t\)‖22\\displaystyle F\(\\bar\{w\}^\{t\}\)\-\\alpha\_\{0\}\\,\\eta\\gamma\_\{t\}E\\,\\\|\\nabla F\(\\bar\{w\}^\{t\}\)\\\|\_\{2\}^\{2\}\+α1​η​γt​E​\(σ2\+ζ2\)\\displaystyle\\;\+\\;\\alpha\_\{1\}\\,\\eta\\gamma\_\{t\}E\\,\(\\sigma^\{2\}\+\\zeta^\{2\}\)\+α2​η​γt​L2​E​‖witt\+13−w¯t‖22\\displaystyle\\;\+\\;\\alpha\_\{2\}\\,\\eta\\gamma\_\{t\}L^\{2\}\\,E\\,\\\|w\_\{i\_\{t\}\}^\{t\+\\frac\{1\}\{3\}\}\-\\bar\{w\}^\{t\}\\\|\_\{2\}^\{2\}\+L2​γt2​𝔼​\[‖Δitt‖22∣ℱt\]\\displaystyle\\;\+\\;\\frac\{L\}\{2\}\\gamma\_\{t\}^\{2\}\\,\\mathbb\{E\}\\\!\\left\[\\\|\\Delta\_\{i\_\{t\}\}^\{t\}\\\|\_\{2\}^\{2\}\\mid\\mathcal\{F\}\_\{t\}\\right\]\(122\)\+α3​εc2\.\\displaystyle\\;\+\\;\\alpha\_\{3\}\\,\\varepsilon\_\{c\}^\{2\}\.
Moreover, Lemma[C\.2](https://arxiv.org/html/2605.26162#A3.Thmtheorem2)implies the broadcast perturbation contributes at most2​L​c¯2​εc22L\\overline\{c\}^\{\\,2\}\\varepsilon\_\{c\}^\{2\}after taking expectations, which is absorbed into theα3​εc2\\alpha\_\{3\}\\varepsilon\_\{c\}^\{2\}term\.

##### Step 2: explicit drift bound\.

By Lemma[C\.4](https://arxiv.org/html/2605.26162#A3.Thmtheorem4)and Lemma[C\.5](https://arxiv.org/html/2605.26162#A3.Thmtheorem5), for each eventtt,

\(123\)𝔼​\[‖Δitt‖22\]≤2​E2​η2​\(G2\+4​λ2​supe𝔼​‖wit,e−w~itt‖22\)\.\\mathbb\{E\}\\big\[\\\|\\Delta\_\{i\_\{t\}\}^\{t\}\\\|\_\{2\}^\{2\}\\big\]\\leq 2E^\{2\}\\eta^\{2\}\\Big\(G^\{2\}\+4\\lambda^\{2\}\\,\\sup\_\{e\}\\mathbb\{E\}\\\|w\_\{i\_\{t\},e\}\-\\tilde\{w\}\_\{i\_\{t\}\}^\{t\}\\\|\_\{2\}^\{2\}\\Big\)\.Under Assumption[C\.6\.1](https://arxiv.org/html/2605.26162#A3.SS6.SSS1.Px3),‖wit,e−w~itt‖2≤4​W2\\\|w\_\{i\_\{t\},e\}\-\\tilde\{w\}\_\{i\_\{t\}\}^\{t\}\\\|^\{2\}\\leq 4W^\{2\}for allee, hence

\(124\)𝔼​\[‖Δitt‖22\]≤2​E2​η2​\(G2\+16​λ2​W2\)≜DΔ\.\\mathbb\{E\}\\big\[\\\|\\Delta\_\{i\_\{t\}\}^\{t\}\\\|\_\{2\}^\{2\}\\big\]\\leq 2E^\{2\}\\eta^\{2\}\\Big\(G^\{2\}\+16\\lambda^\{2\}W^\{2\}\\Big\)\\;\\triangleq\\;D\_\{\\Delta\}\.Usingγt≤1\\gamma\_\{t\}\\leq 1and taking full expectation in \([122](https://arxiv.org/html/2605.26162#A3.E122)\) yields

α0​η​E​𝔼​\[γt​‖∇F​\(w¯t\)‖22\]≤\\displaystyle\\alpha\_\{0\}\\,\\eta E\\,\\mathbb\{E\}\\\!\\left\[\\gamma\_\{t\}\\\|\\nabla F\(\\bar\{w\}^\{t\}\)\\\|\_\{2\}^\{2\}\\right\]\\;\\leq\\;𝔼​\[F​\(w¯t\)−F​\(w¯t\+1\)\]\\displaystyle\\mathbb\{E\}\\\!\\left\[F\(\\bar\{w\}^\{t\}\)\-F\(\\bar\{w\}^\{t\+1\}\)\\right\]\+α1​η​E​\(σ2\+ζ2\)\\displaystyle\\;\+\\;\\alpha\_\{1\}\\,\\eta E\\,\(\\sigma^\{2\}\+\\zeta^\{2\}\)\+α2​η​L2​E​𝔼​\[‖witt\+13−w¯t‖22\]\\displaystyle\\;\+\\;\\alpha\_\{2\}\\,\\eta L^\{2\}E\\,\\mathbb\{E\}\\\!\\left\[\\\|w\_\{i\_\{t\}\}^\{t\+\\frac\{1\}\{3\}\}\-\\bar\{w\}^\{t\}\\\|\_\{2\}^\{2\}\\right\]\(125\)\+L2​DΔ\+α3​εc2\.\\displaystyle\\;\+\\;\\frac\{L\}\{2\}\\,D\_\{\\Delta\}\\;\+\\;\\alpha\_\{3\}\\,\\varepsilon\_\{c\}^\{2\}\.

##### Step 3: summing the consensus\-related terms\.

Since‖witt\+13−w¯t‖2≤N​ℰcont\\\|w\_\{i\_\{t\}\}^\{t\+\\frac\{1\}\{3\}\}\-\\bar\{w\}^\{t\}\\\|^\{2\}\\leq N\\,\\mathcal\{E\}\_\{\\mathrm\{con\}\}^\{t\}up to an absolute factor, it suffices to control∑t𝔼​\[ℰcont\]\\sum\_\{t\}\\mathbb\{E\}\[\\mathcal\{E\}\_\{\\mathrm\{con\}\}^\{t\}\]\. Lemma[C\.3](https://arxiv.org/html/2605.26162#A3.Thmtheorem3)and \([124](https://arxiv.org/html/2605.26162#A3.E124)\) imply

\(126\)𝔼​\[ℰcont\+1\]≤ρ​𝔼​\[ℰcont\]\+\(C1\+C2​τ2\)​DΔ\+C3​εc2≜ρ​𝔼​\[ℰcont\]\+Bcon\.\\mathbb\{E\}\[\\mathcal\{E\}\_\{\\mathrm\{con\}\}^\{t\+1\}\]\\leq\\rho\\,\\mathbb\{E\}\[\\mathcal\{E\}\_\{\\mathrm\{con\}\}^\{t\}\]\+\\Big\(C\_\{1\}\+C\_\{2\}\\tau^\{2\}\\Big\)D\_\{\\Delta\}\+C\_\{3\}\\varepsilon\_\{c\}^\{2\}\\;\\triangleq\\;\\rho\\,\\mathbb\{E\}\[\\mathcal\{E\}\_\{\\mathrm\{con\}\}^\{t\}\]\+B\_\{\\mathrm\{con\}\}\.Summing \([126](https://arxiv.org/html/2605.26162#A3.E126)\) overt=0,…,T−1t=0,\\dots,T\-1and using∑t=0T−1ρt≤11−ρ\\sum\_\{t=0\}^\{T\-1\}\\rho^\{t\}\\leq\\frac\{1\}\{1\-\\rho\}yields

\(127\)∑t=0T−1𝔼​\[ℰcont\]≤𝔼​\[ℰcon0\]1−ρ\+T​Bcon1−ρ\.\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\[\\mathcal\{E\}\_\{\\mathrm\{con\}\}^\{t\}\]\\leq\\frac\{\\mathbb\{E\}\[\\mathcal\{E\}\_\{\\mathrm\{con\}\}^\{0\}\]\}\{1\-\\rho\}\+\\frac\{T\\,B\_\{\\mathrm\{con\}\}\}\{1\-\\rho\}\.

##### Step 4: telescoping and usingγt≥γ¯\\gamma\_\{t\}\\geq\\underline\{\\gamma\}\.

Summing \([125](https://arxiv.org/html/2605.26162#A3.E125)\) overt=0,…,T−1t=0,\\dots,T\-1telescopes the objective values:

∑t=0T−1𝔼​\[F​\(w¯t\)−F​\(w¯t\+1\)\]=F​\(w¯0\)−𝔼​\[F​\(w¯T\)\]≤F​\(w¯0\)−F⋆\.\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\\!\\left\[F\(\\bar\{w\}^\{t\}\)\-F\(\\bar\{w\}^\{t\+1\}\)\\right\]=F\(\\bar\{w\}^\{0\}\)\-\\mathbb\{E\}\[F\(\\bar\{w\}^\{T\}\)\]\\leq F\(\\bar\{w\}^\{0\}\)\-F^\{\\star\}\.By Assumption[C\.6\.1](https://arxiv.org/html/2605.26162#A3.SS6.SSS1.Px1),γt≥γ¯\\gamma\_\{t\}\\geq\\underline\{\\gamma\}, hence

∑t=0T−1𝔼​‖∇F​\(w¯t\)‖2≤1γ¯​∑t=0T−1𝔼​\[γt​‖∇F​\(w¯t\)‖2\]\.\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\\|\\nabla F\(\\bar\{w\}^\{t\}\)\\\|^\{2\}\\leq\\frac\{1\}\{\\underline\{\\gamma\}\}\\sum\_\{t=0\}^\{T\-1\}\\mathbb\{E\}\\\!\\left\[\\gamma\_\{t\}\\\|\\nabla F\(\\bar\{w\}^\{t\}\)\\\|^\{2\}\\right\]\.Plugging \([127](https://arxiv.org/html/2605.26162#A3.E127)\) into the sum of \([125](https://arxiv.org/html/2605.26162#A3.E125)\), dividing byα0​η​E​γ¯​T\\alpha\_\{0\}\\eta E\\,\\underline\{\\gamma\}\\,Tand absorbing fixed numerical factors into the explicit coefficients yields \([118](https://arxiv.org/html/2605.26162#A3.E118)\)\-\([119](https://arxiv.org/html/2605.26162#A3.E119)\) withDΔD\_\{\\Delta\}andBconB\_\{\\mathrm\{con\}\}given in \([120](https://arxiv.org/html/2605.26162#A3.E120)\)\-\([121](https://arxiv.org/html/2605.26162#A3.E121)\)\. ∎

## Appendix DSupplementary Experiments

### D\.1\.Accuracy Curves

To provide a more complete view of the learning dynamics beyond the final averaged accuracy, we report accuracy curves for all datasets\. Specifically, we include \(i\)*global*test accuracy trajectories across communication rounds and \(ii\) accuracy trajectories of*delayed clients*measured on a pseudo\-time axis that reflects asynchronous progress\. These curves complement the main results by visualizing convergence speed, stability and robustness under system asynchrony\.

#### D\.1\.1\.Global Accuracy Curves

This subsection reports the evolution of global test accuracy across communication rounds\. Figures[3\(a\)](https://arxiv.org/html/2605.26162#A4.F3.sf1)\-[3\(c\)](https://arxiv.org/html/2605.26162#A4.F3.sf3)show the global accuracy curves on CIFAR\-10, CIFAR\-100 and Tiny\-ImageNet, respectively\. These curves illustrate the overall convergence behavior of different methods, including their convergence speed and final performance, under identical experimental settings\.

![Refer to caption](https://arxiv.org/html/2605.26162v1/images/cifar10_global.png)Global test accuracy curves on CIFAR\-10\.

\(a\)CIFAR\-10
![Refer to caption](https://arxiv.org/html/2605.26162v1/images/cifar100_global.png)Global test accuracy curves on CIFAR\-100\.

\(b\)CIFAR\-100
![Refer to caption](https://arxiv.org/html/2605.26162v1/images/tiny_imagenet_global.png)Global test accuracy curves on Tiny\-ImageNet\.

\(c\)Tiny\-ImageNet

Figure 3\.Global test accuracy curves on CIFAR\-10, CIFAR\-100, and Tiny\-ImageNet\.
#### D\.1\.2\.Delayed Client Accuracy Curves

This subsection presents accuracy curves for delayed clients, evaluated along a pseudo\-time axis that reflects asynchronous local progress\. Figures[4](https://arxiv.org/html/2605.26162#A4.F4),[6](https://arxiv.org/html/2605.26162#A4.F6)and[8](https://arxiv.org/html/2605.26162#A4.F8)report the average test accuracy across delayed clients on CIFAR\-10, CIFAR\-100 and Tiny\-ImageNet, respectively\. In addition, Figures[5](https://arxiv.org/html/2605.26162#A4.F5),[7](https://arxiv.org/html/2605.26162#A4.F7)and[9](https://arxiv.org/html/2605.26162#A4.F9)visualize the accuracy trajectories of representative individual delayed clients, providing a fine\-grained view of training stability under client delays\.

![Refer to caption](https://arxiv.org/html/2605.26162v1/images/cifar10_delayed_clients_avg_acc.png)

Average test accuracy of delayed clients on Cifar10\.

Figure 4\.Average test accuracy of delayed clients on Cifar10\.![Refer to caption](https://arxiv.org/html/2605.26162v1/images/cifar10_delayed_clients_acc.png)Accuracy curves of representative delayed clients on CIFAR\-10 \(pseudo\-time axis\)\.

Figure 5\.Accuracy curves of representative delayed clients on CIFAR\-10 \(pseudo\-time axis\)\.![Refer to caption](https://arxiv.org/html/2605.26162v1/images/cifar100_delayed_clients_avg_acc.png)

Average test accuracy of delayed clients on Cifar100\.

Figure 6\.Average test accuracy of delayed clients on Cifar100\.![Refer to caption](https://arxiv.org/html/2605.26162v1/images/cifar100_delayed_clients_acc.png)Accuracy curves of representative delayed clients on CIFAR\-100\.

Figure 7\.Accuracy curves of representative delayed clients on CIFAR\-100\.![Refer to caption](https://arxiv.org/html/2605.26162v1/images/tiny_imagenet_delayed_clients_avg_acc.png)

Average test accuracy of delayed clients on Tiny\-ImageNet\.

Figure 8\.Average test accuracy of delayed clients on Tiny\-ImageNet\.![Refer to caption](https://arxiv.org/html/2605.26162v1/images/tiny_imagenet_delayed_clients_acc.png)Accuracy curves of representative delayed clients on Tiny\-ImageNet\.

Figure 9\.Accuracy curves of representative delayed clients on Tiny\-ImageNet\.

Similar Articles

Federated Learning

ML at Berkeley

The article explains the concept of Federated Learning as a privacy-preserving machine learning technique that trains models on local devices rather than central servers. It details the process of encrypted parameter updates and aggregation to mitigate data leakage risks while maintaining model performance.