OmniISR: A Unified Framework for Centralized and Federated Learning via Intermediate Supervision and Regularization

arXiv cs.LG 05/21/26, 04:00 AM Papers
Summary
OmniISR proposes a unified framework combining centralized and federated learning via intermediate supervision and regularization at hidden layers, offering theoretical convergence guarantees and reducing the CL–FL gap by 22.60%.
arXiv:2605.20276v1 Announce Type: new Abstract: The global deployment of edge intelligence operates across heterogeneous legal frameworks. While some regions permit centralized learning (CL) via cloud data aggregation, others enforce strict data localization, necessitating federated learning (FL). This operational dichotomy introduces two incompatible optimization regimes (i.e., unbiased global gradients yet coupled with internal covariate shift in CL versus biased, drift-prone local updates in FL), resulting in that any naive integration of the two lacks rigorous theoretical guarantees. To fill this gap, we propose OmniISR, a unified framework that fuses pure CL, pure FL, and hybrid CL-FL training modes via equipping intermediate supervision and regularization (ISR) signals at multiple hidden layers. Specifically, we propose (i) to use mutual-information (MI) as intermediate supervision to align shifting internal covariate in CL and client-drifting representations in FL, and (ii) to adopt negative-entropy (NE) as intermediate regularizer to penalize overconfident prediction, preserve representational uncertainty, and avoid device-specific collapse. On the theory side, we derive (i) a unified, ISR-agnostic, and non-asymptotic O(1/sqrt(T)) convergence bound that shows the introduced ISR does not violate standard SGD convergence, (ii) a federated drift-bound that quantifies the ISR-reduced client drift, (iii) a gradient-alignment guarantee that ensures non-conflicting CL and FL updates under mild bias, and (iv) an explicit escape-time bound that indicates that CL-FL hybrid mixing enlarges effective stochasticity and accelerates escape from strict saddles. Extensive experiments demonstrate that OmniISR consistently improves model performance in both centralized and federated paradigms, reduces the CL-FL gap by 22.60%, and yields 37/48 paired metric wins across multiple FL algorithms.
Original Article
View Cached Full Text
Cached at: 05/21/26, 06:21 AM
# OmniISR: A Unified Framework for Centralized and Federated Learning via Intermediate Supervision and Regularization
Source: [https://arxiv.org/html/2605.20276](https://arxiv.org/html/2605.20276)
Wei\-Bin Kou, Guangxu Zhu, Ming Tang, Chen Zhang, Lisheng Wu, Lei Zhou, Yujiu Yang∗Wei\-Bin Kou and Yujiu Yang are with Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China Guangxu Zhu is with Shenzhen Research Institute of Big Data, Shenzhen, China\. Ming Tang is with the Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, China\. Chen Zhang is with the Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong, China\. Lisheng Wu and Lei Zhou are with Yinwang Intelligent Technology Co\. Ltd\., Shenzhen, China Corresponding authors: Yujiu Yang\.

###### Abstract

The global deployment of edge intelligence \(e\.g\., autonomous driving\) operates across heterogeneous legal frameworks\. While some regions permit centralized learning \(CL\) via cloud data aggregation, others enforce strict data localization, necessitating federated learning \(FL\)\. This operational dichotomy introduces two fundamentally incompatible optimization regimes \(i\.e\., unbiased global gradients yet coupled with internal covariate shift in CL versus biased, drift\-prone local updates in FL\), resulting in that any naive integration of the two lacks rigorous theoretical guarantees\. To fill this gap, we propose OmniISR, a unified framework that fuses pure CL, pure FL, and hybrid CL–FL training modes via equipping intermediate supervision and regularization \(ISR\) signals at multiple hidden layers\. Specifically, we propose \(i\) to use mutual\-information \(MI\) as intermediate supervision to align shifting internal covariate in CL and client\-drifting representations in FL, and \(ii\) to adopt negative\-entropy \(NE\) as intermediate regularizer to penalize overconfident prediction, preserve representational uncertainty, and avoid device\-specific collapse\. On the theory side, we derive \(i\) a unified, ISR\-agnostic, and non\-asymptotic𝒪\(1/T\)\\mathcal\{O\}\(1/\\sqrt\{T\}\)convergence bound that shows the introduced ISR does not violate standard SGD convergence, \(ii\) a federated drift\-bound that quantifies the ISR\-reduced client drift, \(iii\) a gradient\-alignment guarantee that ensures non\-conflicting CL and FL updates under mild bias, and \(iv\) an explicit escape\-time bound that indicates that CL–FL hybrid mixing enlarges effective stochasticity and accelerates escape from strict saddles\. Extensive experiments across multiple model architectures, datasets, and FL algorithms demonstrate that OmniISR consistently improves model performance in both centralized and federated paradigms, reduces the CL–FL gap by 22\.60%, and yields 37/48 paired metric wins across multiple FL algorithms\.

###### Index Terms:

Unified Learning Framework, Theoretical Guarantees, Intermediate Supervision, Intermediate Regularization, Federated Client\-drift Control, Saddle\-escape Time Bound,

ϵ\\epsilon\-Stationarity Complexity Analysis

## 1Introduction

The advent of edge intelligence has revolutionized large\-scale distributed systems, with autonomous driving \(AD\) serving as a paramount application\[[1](https://arxiv.org/html/2605.20276#bib.bib1),[2](https://arxiv.org/html/2605.20276#bib.bib2),[3](https://arxiv.org/html/2605.20276#bib.bib3)\]\. To continuously refine AD models, AD fleets require to collect vast amounts of driving data\. However, the data collection is increasingly constrained by divergent data governance and privacy regulations\. In certain jurisdictions, data can be aggregated in the cloud for centralized learning \(CL\)\. Conversely, in regions governed by stringent privacy frameworks, such as the General Data Protection Regulation \(GDPR\) in the European Union and the Data Security Law in China, raw data is classified as sensitive information and is strictly prohibited from cross\-device or cross\-border transmission\. In these regulatory domains, Federated Learning \(FL\)\[[4](https://arxiv.org/html/2605.20276#bib.bib4),[5](https://arxiv.org/html/2605.20276#bib.bib5),[6](https://arxiv.org/html/2605.20276#bib.bib6)\]emerges as the legally compliant methodology for model enhancement\. Consequently, globally deployed edge intelligence systems must operate under a compatible training paradigm\. This necessitates a unified optimization framework that seamlessly integrates CL and FL to ensure consistent and robust model performance across diverse international markets\.

However, building such a unified framework is far from straightforward\. The two paradigms differ fundamentally in their data distribution and optimization dynamics\. Naively alternating between centralized and federated updates without principled coordination offers no convergence guarantees and may introduce pathological gradient interference\. A rigorous unification must confront three deeply intertwined challenges that span both the theoretical and structural dimensions of the learning process\.

First, fundamentally divergent optimization dynamics between CL and FL impede the unified training\.In pure CL, data is generally assumed to be independent and identically distributed \(IID\), allowing optimizers \(e\.g\., Adam, SGD\) to descend smoothly along the loss landscape\. In contrast, FL in edge scenarios is characterized by highly non\-IID data distributions\[[7](https://arxiv.org/html/2605.20276#bib.bib7),[8](https://arxiv.org/html/2605.20276#bib.bib8)\]and system heterogeneity \(e\.g\., stragglers\)\[[9](https://arxiv.org/html/2605.20276#bib.bib9),[10](https://arxiv.org/html/2605.20276#bib.bib10),[11](https://arxiv.org/html/2605.20276#bib.bib11)\]\. Formulating a unified mechanism requires rigorously answering:How can we mathematically guarantee convergence across such disparate optimization landscapes?Furthermore, it is imperative to prove that the exact gradients derived from centralized data do not engage in ”gradient conflict” with the aggregated pseudo\-gradients from federated clients, but rather operate synergistically to accelerate to escape local optima\.

![Refer to caption](https://arxiv.org/html/2605.20276v1/omni_isr_overview.jpg)Figure 1:Overview of the mechanism of intermediate supervision and regularization in the proposed OmniISR framework\.Second, non\-IID data exacerbates the independent drift of latent representations across distributed clients in FL\[[12](https://arxiv.org/html/2605.20276#bib.bib12)\]\. In deep learning architectures, supervision is exclusively applied at the output layer\. While this single\-point supervision suffices for centralized IID data, in a federated non\-IID context, the lack of explicit constraints on hidden layers causes the latent features of different clients to drift independently\. This representation drift severely degrades the aggregated global model’s performance, an issue that output\-layer\-only supervision is mathematically ill\-equipped to resolve\.

Third, mere intermediate supervision compromises generalization\.A seemingly intuitive solution to representation drift is the introduction of supervision at intermediate layers to anchor the hidden feature\. However, enforcing strict intermediate supervision forces the hidden representations to become highly scenario\- or task\-informed\. This deterministic feature alignment reduces the model’s flexibility, causing it to overfit to the specific distributions of the training clients and drastically reducing its generalization capability in unseen scenarios\.

To systematically address these three challenges, we propose OmniISR, a unified training framework for edge intelligence that seamlessly operates across pure CL, pure FL, and hybrid CL–FL paradigms\. The core design principle of OmniISR is that deep networks deployed with heterogeneous distributions require explicit, diversified guidance at hidden layers rather than merely at the output\. Concretely, OmniISR embodies intermediate supervision and regularization \(ISR\) mechanism that contains following three integral strategies\.

1. 1\.Architecture\-Agnostic Intermediate Layer Selection:we select multiple ISR layers within the network based on architecture\-agnostic criteria, such as transitional layer between blocks \(e\.g\., ResNet stages or Transformer layers\), downsampling layers, bottleneck layers, or feature fusion layers, ensuring applicability across CNN\-based, Transformer\-based, and hybrid architectures without requiring architecture\-specific redesign\.
2. 2\.Heterogeneous Supervision at Intermediate Layers:at each selected intermediate point, OmniISR computes the mutual information \(MI\) between the latent features and ground\-truth, using this as a heterogeneous supervision signal distinct from the output layer’s cross\-entropy objective\. By maximizing the shared information between hidden representations and task labels, MI supervision guides intermediate layers to learn discriminative yet hierarchically diverse features, effectively constraining representation drift without compromising to learn premature hidden features\.
3. 3\.Negative Entropy \(NE\) Regularization on Intermediate Activations:OmniISR imposes NE regularization on the latent activation distributions at each selected intermediate layer, explicitly penalizing peaked, overconfident hidden features and injecting necessary uncertainty and significantly enhancing its generalization to unseen scenarios\.

These three integral strategies are illustrated in[Fig\.1](https://arxiv.org/html/2605.20276#S1.F1)\.

![Refer to caption](https://arxiv.org/html/2605.20276v1/OmniISR_three_modes.jpg)Figure 2:Illustration of the three modes of the proposed OmniISR framework\.The proposed OmniISR operates aforementioned ISR mechanism across three modes: Pure CL, Pure FL, and Hybrid CL–FL\. In the CL mode, all intermediate losses and regularizers are combined with the output\-layer loss into a single objective, which is optimized until convergence\. In the FL mode, training proceeds over multiple communication rounds\. Within each round, every client first minimizes a weighted combination of intermediate losses, regularizers, and the final loss on its private data for several local iterations, after which the central server aggregates the updated local models across all participants without exposing raw data\. In the Hybrid CL–FL mode, each update blends an exact cloud gradient with a federated pseudo\-gradient through a mixing weight, scheduled via one of three strategies: alternating between pure CL and FL rounds, a fixed mixing ratio, or an adaptive weight adjusted by gradient similarity\. Beyond leveraging larger data volumes, this hybrid procedure combines centralized and federated noise sources, increasing effective stochasticity and thereby helping the model escape saddle points and explore the loss landscape more effectively\. The three modes are illustrated in[Fig\.2](https://arxiv.org/html/2605.20276#S1.F2)\.

On the theoretical front, we derive a unified, ISR\-independent, and non\-asymptotic𝒪\(1/T\)\\mathcal\{O\}\(1/\\sqrt\{T\}\)convergence bound for OmniISR across CL, FL, and hybrid modes, proving that OmniISR maintains standard non\-convex SGD rates\. Our analysis explicitly characterizes how finite\-time constants scale with the number of intermediate pointsMMand their associated weights\{αm,λm\}m=1M\\\{\\alpha\_\{m\},\\lambda\_\{m\}\\\}\_\{m=1\}^\{M\}\. In the FL setting, we establish an𝒪\(E2H\)\\mathcal\{O\}\(E^\{2\}H\)drift bound, whereEEdenotes local epochs andHHquantifies non\-IID data heterogeneity, and demonstrate that ISR reduces effective client drift by stabilizing hidden representations\. For the hybrid mode, the bound isolates an explicit bias floorBeffB\_\{\\mathrm\{eff\}\}caused by cloud–device representativeness gaps, alongside an effective varianceσeff2\\sigma\_\{\\mathrm\{eff\}\}^\{2\}resulting from mixed noise sources\. Furthermore, we prove that CL and FL gradients satisfy𝔼\[⟨𝐠CL,𝐠FL⟩\]≥0\\mathbb\{E\}\[\\langle\\mathbf\{g\}\_\{\\mathrm\{CL\}\},\\mathbf\{g\}\_\{\\mathrm\{FL\}\}\\rangle\]\\geq 0under mild assumptions on the overlap between centralized and federated data distributions\. This confirms that both gradient sources cooperate to escape local optima, providing first formal justification for mixed\-paradigm training\. Ultimately, these theoretical guarantees translate hyperparameters \(e\.g\.,MM,EE, and the mixing weightα\\alpha\) into actionable design principles for stable and unified training\.

In summary, the main contributions of this work are highlighted as follows:

- •We propose OmniISR, a unified optimization framework for CL, FL, and hybrid CL–FL training under one objective, with architecture\-agnostic introduction of ISR\. This directly targets real deployments where training mode is policy\-dependent across regions\.
- •We introduce a coupled intermediate design that combines heterogeneous MI supervision \(not output\-layer CE replication\) and NE regularization\. This coupling is intended to jointly address two competing requirements in non\-IID training: representation\-drift suppression and generalization preservation\.
- •Theoretically, we demonstrate that \(i\) OmniISR guarantees ISR\-not\-violated SGD convergence regardless of working modes, \(ii\) OmniISR reduces heterogeneity\-caused client\-drift, \(iii\) cloud and on\-device hybrid updates operate synergistically rather than destructively, and \(iv\) the combined stochasticity of the hybrid CL–FL accelerates to escape suboptimal saddle points\.
- •Empirically, we evaluate OmniISR across multiple model architectures, datasets, and FL algorithms\. Beyond absolute gains, OmniISR narrows the CL–FL performance gap by 22\.60%, shows broad cross\-FL\-algorithm positive transferability with 37/48 metric wins in paired comparisons, and clarifies how intermediate point number, spacing, and placement impact OmniISR’s effectiveness via comprehensive ablations\.

The remainder of this paper proceeds as follows\.[Section2](https://arxiv.org/html/2605.20276#S2)reviews related works\.[Section3](https://arxiv.org/html/2605.20276#S3)details the proposed OmniISR and its theoretical guarantees\.[Section4](https://arxiv.org/html/2605.20276#S4)presents experiments and ablations\.[Section5](https://arxiv.org/html/2605.20276#S5)concludes this paper\.

## 2Related Works

### 2\.1Centralized Learning Optimization

Centralized learning \(CL\) optimization encompasses a broad suite of algorithms for minimizing loss functions over high\-dimensional parameter spaces\. Stochastic gradient descent \(SGD\) and back\-propagation lay the groundwork\[[13](https://arxiv.org/html/2605.20276#bib.bib13)\], while adaptive methods such as Adam\[[14](https://arxiv.org/html/2605.20276#bib.bib14)\]and RMSprop provide adaptive learning rates that improve convergence stability across architectures\. Regularization strategies, including dropout\[[15](https://arxiv.org/html/2605.20276#bib.bib15)\], L1/L2 penalties\[[16](https://arxiv.org/html/2605.20276#bib.bib16)\], and sharpness\-aware minimization\[[17](https://arxiv.org/html/2605.20276#bib.bib17)\], are critical for preventing overfitting and ensuring robust generalization\. Normalization techniques such as batch normalization\[[18](https://arxiv.org/html/2605.20276#bib.bib18)\]and layer normalization\[[19](https://arxiv.org/html/2605.20276#bib.bib19)\]further stabilize training dynamics\.

Despite these advances, training very deep networks still faces gradient gradually weakening\[[20](https://arxiv.org/html/2605.20276#bib.bib20),[21](https://arxiv.org/html/2605.20276#bib.bib21)\]and under\-optimized intermediate features\[[22](https://arxiv.org/html/2605.20276#bib.bib22)\], particularly when supervision is applied solely at the output layer\. This motivates the study of intermediate supervision, discussed next\.

### 2\.2Federated Learning Optimization

Federated Learning \(FL\)\[[23](https://arxiv.org/html/2605.20276#bib.bib23),[24](https://arxiv.org/html/2605.20276#bib.bib24),[25](https://arxiv.org/html/2605.20276#bib.bib25)\]enables distributed devices to collaboratively train a shared model while keeping raw data private\[[4](https://arxiv.org/html/2605.20276#bib.bib4)\]\. FedAvg\[[4](https://arxiv.org/html/2605.20276#bib.bib4)\]is the de\-facto baseline, where clients perform local SGD and upload updates for server\-side aggregation\. However, the non\-IID nature of distributed data, arising from diverse geographic environments, weather conditions, and traffic patterns, causes significant performance degradation and slow convergence\.

A rich line of work has been proposed to mitigate these challenges\. FedProx\[[26](https://arxiv.org/html/2605.20276#bib.bib26)\]adds a proximal term to penalize local deviation from the global model\. SCAFFOLD\[[12](https://arxiv.org/html/2605.20276#bib.bib12)\]introduces control variates to correct client drift\. FedDyn\[[27](https://arxiv.org/html/2605.20276#bib.bib27)\]uses dynamic regularization\. FedAvgM\[[28](https://arxiv.org/html/2605.20276#bib.bib28)\]applies server\-side momentum\. FedIR\[[29](https://arxiv.org/html/2605.20276#bib.bib29)\]addresses class imbalance through importance reweighting\. MOON\[[30](https://arxiv.org/html/2605.20276#bib.bib30)\]leverages model\-contrastive learning\. BalanceFL\[[31](https://arxiv.org/html/2605.20276#bib.bib31)\]addresses long\-tail class imbalance\. In the AD\-specific context, FedRC\[[32](https://arxiv.org/html/2605.20276#bib.bib32)\]and FedGau\[[33](https://arxiv.org/html/2605.20276#bib.bib33)\]accelerate hierarchical FL convergence; FedEMA\[[34](https://arxiv.org/html/2605.20276#bib.bib34)\]integrates exponential moving averaging with negative entropy regularization; pFedLVM\[[35](https://arxiv.org/html/2605.20276#bib.bib35)\]personalizes federated models via large vision model features; and FedDrive\[[36](https://arxiv.org/html/2605.20276#bib.bib36)\]generalizes FL to semantic segmentation in AD\. Communication\-constrained and hierarchical FL for AD has also been studied\[[37](https://arxiv.org/html/2605.20276#bib.bib37)\], along with contrastive\-divergence approaches to non\-IID mitigation\[[38](https://arxiv.org/html/2605.20276#bib.bib38)\]\.

Despite these algorithmic advances, existing FL methods address non\-IID challenges primarily through aggregation\-level or loss\-level corrections*at the output layer*\. None of these works introduces supervision and regularization at intermediate layer within the federated training loop\. OmniISR fills this gap\.

### 2\.3Intermediate Supervision

Intermediate supervision\[[39](https://arxiv.org/html/2605.20276#bib.bib39)\]augments output\-layer\-only supervision by injecting auxiliary losses at intermediate layers, thereby providing explicit guidance for hidden representations\. Early instances include GoogleNet’s two auxiliary classifiers at intermediate stages\[[40](https://arxiv.org/html/2605.20276#bib.bib40)\]and DSN’s auxiliary supervision branches\[[41](https://arxiv.org/html/2605.20276#bib.bib41)\]\. Subsequent work has extended the paradigm to dense\-prediction tasks\. For example, PSPNet\[[42](https://arxiv.org/html/2605.20276#bib.bib42)\]adds an auxiliary classifier for pixel\-wise cross\-entropy on pyramid pooling features\. BiSeNet\[[43](https://arxiv.org/html/2605.20276#bib.bib43)\]applies supervised branches to spatial and context paths\. Gated\-SCNN\[[44](https://arxiv.org/html/2605.20276#bib.bib44)\]introduces shape\-based intermediate losses\. ICNet\[[45](https://arxiv.org/html/2605.20276#bib.bib45)\]attaches auxiliary losses to low\-resolution intermediate predictions in a cascaded framework\. More recently, contrastive intermediate supervision\[[46](https://arxiv.org/html/2605.20276#bib.bib46)\]has been explored, and a comprehensive review of intermediate supervision theories and applications is provided in\[[47](https://arxiv.org/html/2605.20276#bib.bib47)\]\.

While intermediate supervision has proven effective, three limitations still persist\. First, existing techniques are tightly coupled with specific model architectures, lacking generality across CNN\-based and Transformer\-based model architectures\. Second, the auxiliary losses applied to intermediate layers are typically*identical*to the output\-layer loss \(e\.g\., cross\-entropy\), which forces intermediate layers to prioritize output\-specific features prematurely and limits the learning of generalizable representations\. Third, no explicit regularization is imposed on hidden activations, risking overconfident predictions that degrade out\-of\-distribution generalization\. OmniISR addresses all three issues by introducing architecture\-agnostic intermediate point selection, heterogeneous mutual\-information supervision, and negative\-entropy regularization\.

### 2\.4Unified Centralized and Federated Learning

A fundamental yet under\-explored challenge in practical edge intelligence is the coexistence of centralized and federated training paradigms\. In globally deployed systems \(for instance, AD fleets operating across jurisdictions with different data\-privacy regulations\), some regions permit cloud\-based centralized training while others mandate strictly local, federated training\. This bifurcation necessitates a framework that can seamlessly support both modes under a single optimization formulation\. To the best of our knowledge, no prior work has provided such a unified framework with formal convergence guarantees\. Existing methods treat centralized and federated learning as separate paradigms, each with its own optimization analysis and algorithmic pipeline\. The potential synergy between exact centralized gradients and aggregated federated pseudo\-gradients has not been formally analyzed\.

The proposed OmniISR bridges this gap\. The novelty of OmniISR is not a standalone use of intermediate supervision and intermediate regularization in isolation\. The contribution lies in their*joint coupling across paradigms*: \(i\) heterogeneous intermediate MI supervision \(not output\-layer CE replication\), \(ii\) intermediate NE uncertainty regularization, and \(iii\) a unified CL/FL/hybrid optimization with fundamentally theoretical guarantees\. This three\-fold coupling defines the technical distinction from prior FL corrections and architecture\-specific intermediate supervision designs\.

## 3Methodology

We first introduce the key notations of the proposed OmniISR framework in[Tab\.I](https://arxiv.org/html/2605.20276#S3.T1), and then present the ISR mechanism in[Section3\.1](https://arxiv.org/html/2605.20276#S3.SS1), the three working modes of OmniISR in[Section3\.2](https://arxiv.org/html/2605.20276#S3.SS2), the theoretical guarantees in[Section3\.3](https://arxiv.org/html/2605.20276#S3.SS3), and itsϵ\\epsilon\-stationarity complexity analysis in[Section3\.4](https://arxiv.org/html/2605.20276#S3.SS4)\. Finally, we compare OmniISR’s working modes in[Section3\.5](https://arxiv.org/html/2605.20276#S3.SS5)\.

TABLE I:Key notations used throughout this paperSymbolDescriptionθ∈ℝd\\theta\\in\\mathbb\{R\}^\{d\}Global model parametersθt\\theta\_\{t\}Model parameters at iteration/roundttDDThe depth \(layer number\) of modelθ\\theta𝒟\\mathcal\{D\}Centralized training dataset𝒟n\\mathcal\{D\}\_\{n\}Local private dataset on clientnnNNNumber of clients in FLMMNumber of intermediate supervision pointsGmG\_\{m\}mm\-th intermediate point \(layer index\)zim=Gm\(θ;xi\)z^\{m\}\_\{i\}=G\_\{m\}\(\\theta;x\_\{i\}\)Latent feature map at pointGmG\_\{m\}for inputxix\_\{i\}qm\(⋅;φm\)q^\{m\}\(\\cdot\\,;\\varphi\_\{m\}\)Lightweight dimension adapter atGmG\_\{m\}Cm,wm,hmC\_\{m\},w\_\{m\},h\_\{m\}Channel count, width, and height ofzmz^\{m\}KKNumber of semantic classesℒCE\\mathcal\{L\}\_\{\\mathrm\{CE\}\}Output\-layer cross\-entropy lossℒMI\(m\)\\mathcal\{L\}\_\{\\mathrm\{MI\}\}^\{\(m\)\}MI\-based intermediate supervision atGmG\_\{m\}ℒNE\(m\)\\mathcal\{L\}\_\{\\mathrm\{NE\}\}^\{\(m\)\}Negative entropy regularizer atGmG\_\{m\}αm,λm\\alpha\_\{m\},\\lambda\_\{m\}Weights for MI loss and NE regularizer atGmG\_\{m\}EENumber of local training epochs per FL roundTTTotal CL training iterations or FL roundswn=\|𝒟n\|/∑k\|𝒟k\|w\_\{n\}=\|\\mathcal\{D\}\_\{n\}\|/\\sum\_\{k\}\|\\mathcal\{D\}\_\{k\}\|Aggregation weight for clientnnαt∈\[0,1\]\\alpha\_\{t\}\\in\[0,1\]Hybrid CL–FL mixing weight at roundttgCLt,gFLtg\_\{\\mathrm\{CL\}\}^\{t\},g\_\{\\mathrm\{FL\}\}^\{t\}Cloud gradient and federated pseudo\-gradientη,ηt\\eta,\\eta\_\{t\}Learning rate \(constant or scheduled\)### 3\.1The Proposed ISR Mechanism

The core philosophy of OmniISR is that deep networks deployed across CL and FL paradigms require*explicit, diversified guidance at hidden layers*rather than mere supervision at the output\. This intermediate guidance is supposed to balancerepresentation alignment\(constraining drift across clients or training stages\) andrepresentational flexibility\(preserving the uncertainty needed for generalization\)\. OmniISR instantiates this principle through three integral strategies: \(i\) architecture\-agnostic intermediate layer selection, \(ii\) intermediate heterogeneous mutual\-information \(MI\) supervision, and \(iii\) intermediate negative\-entropy \(NE\) regularization\. We elaborate each below\.

#### 3\.1\.1Architecture\-Agnostic Intermediate Layer Selection

Let the networkfθf\_\{\\theta\}be composed ofDDlayers\. We selectMMISR points\{G1,…,GM\}\\\{G\_\{1\},\\dots,G\_\{M\}\\\}\(M≪DM\\ll D\) at key architectural transition boundaries\. Unlike prior intermediate supervision methods that are tightly coupled to a specific backbone \(e\.g\., ICNet’s cascaded branches\[[45](https://arxiv.org/html/2605.20276#bib.bib45)\]\), OmniISR defines*architecture\-agnostic*selection criteria:

- •Scale transitions:before or after spatial downsampling \(pooling, strided convolutions\), capturing changes in spatial resolution and feature granularity\.
- •Block boundaries:between major computational blocks \(e\.g\., ResNet stages, Transformer encoder layers\), leveraging hierarchical feature abstraction\.
- •Bottleneck layers:where feature dimensionality is compressed, highlighting the critical information pathways\.
- •Attention or fusion points:before/after attention mechanisms or multi\-branch feature fusion, capturing how information is redistributed\.

This design ensures that OmniISR is directly applicable to CNN\-based architectures \(e\.g\., DeepLabv3\+\[[48](https://arxiv.org/html/2605.20276#bib.bib48)\]\), Transformer\-based architectures \(e\.g\., TopFormer\[[49](https://arxiv.org/html/2605.20276#bib.bib49)\]\), and CNN\-Transformer hybrid architectures \(e\.g\., SeaFormer\[[50](https://arxiv.org/html/2605.20276#bib.bib50)\]\) without any architecture\-specific redesign\. In practice, we recommendM=𝒪\(log⁡D\)M=\\mathcal\{O\}\(\\log D\)to balance supervision granularity against introduced overhead\.

![Refer to caption](https://arxiv.org/html/2605.20276v1/omniisr_loss_landscape.jpg)Figure 3:Illustration of benefits of ISR mechanism, supposing three intermediate points within the model\.For each selected pointGmG\_\{m\}, we denote the latent feature map extracted from inputxix\_\{i\}as

zim=Gm\(θ;xi\)∈ℝCm×wm×hm,z^\{m\}\_\{i\}=G\_\{m\}\(\\theta;x\_\{i\}\)\\in\\mathbb\{R\}^\{C\_\{m\}\\times w\_\{m\}\\times h\_\{m\}\},\(1\)whereCmC\_\{m\},wmw\_\{m\}, andhmh\_\{m\}are the channel count, width, and height of the feature map atGmG\_\{m\}, respectively\. Sincezimz^\{m\}\_\{i\}typically differs in spatial resolution and channel dimensionality from the ground truthyi∈\{1,…,K\}W×Hy\_\{i\}\\in\\\{1,\\dots,K\\\}^\{W\\times H\}, we attach a lightweight*dimension adapter*qm\(⋅;φm\)q^\{m\}\(\\cdot\\,;\\varphi\_\{m\}\)at each point\. The adapter consists of a1×11\\times 1convolution followed by bilinear upsampling toW×HW\\times H, producing class\-probability mapsqkm\(zi,pm;φm\)q^\{m\}\_\{k\}\(z^\{m\}\_\{i,p\};\\varphi\_\{m\}\)for each pixelppand classkk\.

#### 3\.1\.2Intermediate Mutual Information Supervision

A fundamental limitation of classical intermediate supervision\[[39](https://arxiv.org/html/2605.20276#bib.bib39),[40](https://arxiv.org/html/2605.20276#bib.bib40),[41](https://arxiv.org/html/2605.20276#bib.bib41)\]is that it applies the*same*loss function \(typically cross\-entropy\) to both intermediate and output layers\. While this provides gradient signal to hidden layers, it also forces intermediate representations to converge prematurely toward output\-layer decision boundaries\. From an information\-theoretic perspective\[[51](https://arxiv.org/html/2605.20276#bib.bib51)\], an ideal intermediate representation should preserve maximal information about the task labelYYwhile maintaining diverse feature structures that differ from the final prediction head\. This motivates the use of a*heterogeneous*intermediate loss that maximizes the shared information between hidden features and labels without collapsing onto the output\-layer objective\.

LetΩ\\Omegadenote the set of all pixels in an image\. The output\-layer cross\-entropy \(CE\) loss is defined as

ℒCE\(θ\)=−1\|𝒟\|∑\(xi,yi\)∈𝒟∑p∈Ω∑k=1Kyi,p,klog⁡pk\(xi,p;θ\),\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\(\\theta\)=\-\\frac\{1\}\{\|\\mathcal\{D\}\|\}\\sum\_\{\(x\_\{i\},y\_\{i\}\)\\in\\mathcal\{D\}\}\\sum\_\{p\\in\\Omega\}\\sum\_\{k=1\}^\{K\}y\_\{i,p,k\}\\,\\log\\,p\_\{k\}\(x\_\{i,p\};\\theta\),\(2\)where𝒟\\mathcal\{D\}denotes the training dataset on a device,pk\(xi,p;θ\)p\_\{k\}\(x\_\{i,p\};\\theta\)is the output\-layer softmax probability for classkkat pixelpp\.

For intermediate pointGmG\_\{m\}, we define the MI supervision loss\. Recall that the MI between the latent featureZmZ^\{m\}and labelYYisI\(Zm;Y\)=H\(Y\)−H\(Y\|Zm\)I\(Z^\{m\};Y\)=H\(Y\)\-H\(Y\|Z^\{m\}\), whereH\(Y\)H\(Y\)is a data\-dependent constant\. MaximizingI\(Zm;Y\)I\(Z^\{m\};Y\)is therefore equivalent to minimizingH\(Y\|Zm\)H\(Y\|Z^\{m\}\)\. Since the true conditionalP\(Y\|Zm\)P\(Y\|Z^\{m\}\)is intractable, we introduce a variational approximationqm\(⋅;φm\)q^\{m\}\(\\cdot\\,;\\varphi\_\{m\}\)and optimize the variational upper bound onH\(Y\|Zm\)H\(Y\|Z^\{m\}\)\[[52](https://arxiv.org/html/2605.20276#bib.bib52)\]:

ℒMI\(m\)\(θ,φm\)=−1\|𝒟\|∑\(xi,yi\)∈𝒟∑p∈Ω∑k=1Kyi,p,klog⁡qkm\(zi,pm;φm\)\.\\mathcal\{L\}\_\{\\mathrm\{MI\}\}^\{\(m\)\}\(\\theta,\\varphi\_\{m\}\)\\\!=\\\!\-\\frac\{1\}\{\|\\mathcal\{D\}\|\}\\\!\\sum\_\{\(x\_\{i\},y\_\{i\}\)\\in\\mathcal\{D\}\}\\\!\\sum\_\{p\\in\\Omega\}\\\!\\sum\_\{k=1\}^\{K\}y\_\{i,p,k\}\\,\\\!\\log\\,\\\!q^\{m\}\_\{k\}\\big\(z^\{m\}\_\{i,p\};\\varphi\_\{m\}\\big\)\.\(3\)Although[Eq\.3](https://arxiv.org/html/2605.20276#S3.E3)superficially resembles a CE loss, it differs fromℒCE\\mathcal\{L\}\_\{\\mathrm\{CE\}\}in three critical respects:

1. 1\.Operating on latent features instead of predictions:ℒMI\(m\)\\mathcal\{L\}\_\{\\mathrm\{MI\}\}^\{\(m\)\}is computed on the intermediate feature mapzmz^\{m\}passed through a small adapterqm\(⋅;φm\)q^\{m\}\(\\cdot\\,;\\varphi\_\{m\}\), rather than on the full network’s output\. This means the gradient signal flows directly into the hidden layers at pointGmG\_\{m\}, providing localized supervision that does not traverse the entire downstream subnetwork\.
2. 2\.Distinct parameterization:The adapterqm\(⋅;φm\)q^\{m\}\(\\cdot\\,;\\varphi\_\{m\}\)has its own parametersφm\\varphi\_\{m\}, decoupled from the output prediction head\. This prevents the intermediate loss from merely replicating the output\-layer objective and instead encouragesGmG\_\{m\}to learn representations that are*independently discriminative*\.
3. 3\.Hierarchical diversity:Because eachGmG\_\{m\}operates at a different depth and spatial resolution, the MI losses collectively enforce a hierarchy of increasingly abstract yet task\-relevant features, aligning with the information\-theoretic principle of progressive information refinement\[[51](https://arxiv.org/html/2605.20276#bib.bib51)\]\.

#### 3\.1\.3Intermediate Negative Entropy Regularization

While MI supervision ensures that intermediate features are well\-optimized, it does not prevent them from becoming*overconfident*, i\.e\., producing peaked, low\-entropy distributions that are overly specialized to the training data distribution\. This problem is amplified in FL, where each client’s local training can push intermediate features toward client\-specific overconfidence, exacerbating representation drift upon aggregation\.

To counteract this tendency, we impose a NE regularizer on the latent activation distribution at each intermediate point\. For pointGmG\_\{m\}, the NE regularizer is defined as

ℒNE\(m\)\(θ\)=1\|𝒟\|∑\(xi,yi\)∈𝒟∑p∈Ω∑c=1Cmpcm\(zi,pm;θ\)log⁡pcm\(zi,pm;θ\),\\mathcal\{L\}\_\{\\mathrm\{NE\}\}^\{\(m\)\}\(\\theta\)=\\frac\{1\}\{\|\\mathcal\{D\}\|\}\\sum\_\{\(x\_\{i\},y\_\{i\}\)\\in\\mathcal\{D\}\}\\sum\_\{p\\in\\Omega\}\\sum\_\{c=1\}^\{C\_\{m\}\}p^\{m\}\_\{c\}\(z^\{m\}\_\{i,p\};\\theta\)\\,\\log\\,p^\{m\}\_\{c\}\(z^\{m\}\_\{i,p\};\\theta\),\(4\)wherepcm\(zi,pm;θ\)=softmax\(zi,pm\)cp^\{m\}\_\{c\}\(z^\{m\}\_\{i,p\};\\theta\)=\\mathrm\{softmax\}\(z^\{m\}\_\{i,p\}\)\_\{c\}is the softmax probability over theCmC\_\{m\}channels at pixelppof the latent feature\. MinimizingℒNE\(m\)\\mathcal\{L\}\_\{\\mathrm\{NE\}\}^\{\(m\)\}is equivalent to*maximizing the entropy*of the channel\-wise activation distribution, thereby penalizing peaked activations and injecting controlled uncertainty into hidden features\.

#### 3\.1\.4The MI–NE synergy

The MI supervision and NE regularization play complementary roles that resolve a fundamental tension in intermediate layer optimization:

- •MI supervisionpulls features toward*discriminativeness*: it ensures that hidden representations carry sufficient information about the task labels, preventing under\-optimized or task\-irrelevant intermediate features\.
- •NE regularizationpushes features toward*uncertainty*: it prevents features from collapsing into overconfident, overly specialized representations\.

Together, they carve out a “sweet spot” in the representation space, where features are*well\-optimized, task\-informed but not task\-overfit*\. This synergy is the key design principle underlying OmniISR’s effectiveness in both CL and FL settings\.

#### 3\.1\.5OmniISR’s Unified Optimization Objective

Combining the output\-layer loss with all intermediate terms, the total optimization objective of OmniISR at each device is

ℒT\(θ,Φ\)=ℒCE\(θ\)\+∑m=1M\(αmℒMI\(m\)\(θ,φm\)\+λmℒNE\(m\)\(θ\)\),\\mathcal\{L\}\_\{T\}\(\\theta,\\Phi\)=\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\(\\theta\)\+\\sum\_\{m=1\}^\{M\}\\Big\(\\alpha\_\{m\}\\,\\mathcal\{L\}\_\{\\mathrm\{MI\}\}^\{\(m\)\}\(\\theta,\\varphi\_\{m\}\)\+\\lambda\_\{m\}\\,\\mathcal\{L\}\_\{\\mathrm\{NE\}\}^\{\(m\)\}\(\\theta\)\\Big\),\(5\)whereΦ=\{φ1,…,φM\}\\Phi=\\\{\\varphi\_\{1\},\\dots,\\varphi\_\{M\}\\\}denotes the collective adapter parameters andαm,λm\>0\\alpha\_\{m\},\\lambda\_\{m\}\>0are weights controlling the relative contribution of MI supervision and NE regularization at each pointGmG\_\{m\}\.

The total lossℒT\(⋅,⋅\)\\mathcal\{L\}\_\{T\}\(\\cdot,\\cdot\)induces a richer gradient landscape than output\-layer\-only training\. The gradient with respect to model parametersθ\\thetadecomposes as

∇θℒT=∇θℒCE\+∑m=1M\(αm∇θℒMI\(m\)\+λm∇θℒNE\(m\)\)\.\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{T\}=\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\+\\sum\_\{m=1\}^\{M\}\\Big\(\\alpha\_\{m\}\\,\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{MI\}\}^\{\(m\)\}\+\\lambda\_\{m\}\\,\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{NE\}\}^\{\(m\)\}\\Big\)\.\(6\)The intermediate gradient terms∇θℒMI\(m\)\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{MI\}\}^\{\(m\)\}and∇θℒNE\(m\)\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{NE\}\}^\{\(m\)\}inject supervision directly atGmG\_\{m\}, effectively shortening the back propagation path for layers precedingGmG\_\{m\}\. This addresses the gradient vanishing problem while simultaneously providing diverse optimization signals that prevent all layers from converging to the same output\-layer\-centric features\. The benefits of ISR mechanism are illustrated in[Fig\.3](https://arxiv.org/html/2605.20276#S3.F3)\.

![Refer to caption](https://arxiv.org/html/2605.20276v1/omni_isr_client_drift_arch.jpg)\(a\)Gradient drift comparison at intermediate layers
![Refer to caption](https://arxiv.org/html/2605.20276v1/omni_isr_client_drift_landscape.jpg)\(b\)Convergence trajectory comparison on objective landscapes

Figure 4:Illustration of why the proposed OmniISR framework can reduce client drift in federated setting, taking three clients in this toy example\.

### 3\.2Three Working Modes of OmniISR

#### 3\.2\.1Pure Centralized Training \(CL Mode\)

When a centralized dataset𝒟\\mathcal\{D\}is available, OmniISR optimizesℒT\(⋅,⋅\)\\mathcal\{L\}\_\{T\}\(\\cdot,\\cdot\)via stochastic gradient descent until convergence\. Concretely, each training iteration proceeds through four steps: \(i\)Forward Pass:For an input imagexix\_\{i\}, the network computes intermediate features\{zi1,…,ziM\}\\\{z\_\{i\}^\{1\},\\dots,z\_\{i\}^\{M\}\\\}and the final predictiony^i\\hat\{y\}\_\{i\}\. \(ii\)Loss Computation:The total lossℒT\\mathcal\{L\}\_\{T\}is computed via[Eq\.5](https://arxiv.org/html/2605.20276#S3.E5), i\.e\., aggregating the output\-layer lossℒCE\\mathcal\{L\}\_\{\\mathrm\{CE\}\}, the intermediate MI losses\{ℒMI\(m\)\}m=1M\\\{\\mathcal\{L\}\_\{\\mathrm\{MI\}\}^\{\(m\)\}\\\}\_\{m=1\}^\{M\}, and the NE regularizers\{ℒNE\(m\)\}m=1M\\\{\\mathcal\{L\}\_\{\\mathrm\{NE\}\}^\{\(m\)\}\\\}\_\{m=1\}^\{M\}\. \(iii\)Back\-Propagation:Gradients ofℒT\\mathcal\{L\}\_\{T\}are computed with respect to both the model parametersθ\\thetaand the auxiliary dimension adapters\{φm\}m=1M\\\{\\varphi\_\{m\}\\\}\_\{m=1\}^\{M\}\. \(iv\)Parameter Update:All parameters are updated using the Adam optimizer\[[14](https://arxiv.org/html/2605.20276#bib.bib14)\]:

θ←θ−η∇θℒT,\\displaystyle\\theta\\leftarrow\\theta\-\\eta\\,\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{T\},\\quadφm←φm−η∇φmℒT,∀m=1,…,M,\\displaystyle\\varphi\_\{m\}\\leftarrow\\varphi\_\{m\}\-\\eta\\,\\nabla\_\{\\varphi\_\{m\}\}\\mathcal\{L\}\_\{T\},\\quad\\forall\\,m=1,\\dots,M,\(7\)whereη\\etais the learning rate\. This procedure iterates over all mini\-batches forTTepochs until convergence\.

#### 3\.2\.2Pure Federated Training \(FL Mode\)

In the federated setting, forNNclients, each holds a private local dataset𝒟n\\mathcal\{D\}\_\{n\}reflecting their respective working environments\. The global objective becomes

minθ,Φ⁡ℒT\(θ,Φ\)=∑n=1NwnℒTn\(θ,Φn;𝒟n\),\\min\_\{\\theta,\\Phi\}\\;\\mathcal\{L\}\_\{T\}\(\\theta,\\Phi\)=\\sum\_\{n=1\}^\{N\}w\_\{n\}\\,\\mathcal\{L\}\_\{T\}^\{n\}\(\\theta,\\Phi\_\{n\};\\mathcal\{D\}\_\{n\}\),\(8\)whereℒTn\\mathcal\{L\}\_\{T\}^\{n\}is the total OmniISR loss \([Eq\.5](https://arxiv.org/html/2605.20276#S3.E5)\) evaluated on𝒟n\\mathcal\{D\}\_\{n\}, andwn=\|𝒟n\|/∑k\|𝒟k\|w\_\{n\}=\|\\mathcal\{D\}\_\{n\}\|/\\sum\_\{k\}\|\\mathcal\{D\}\_\{k\}\|is the aggregation weight for clientnn\.

In standard FL with output\-layer\-only supervision, the non\-IID nature of client data causes a well\-documented problem of*intermediate representation drift*\[[30](https://arxiv.org/html/2605.20276#bib.bib30),[12](https://arxiv.org/html/2605.20276#bib.bib12)\]\. Because hidden layers receive supervision only indirectly via back\-propagation from the output, clients in diverse environments \(e\.g\., urban vs\. rural\) develop independently drifting latent features\. When these models are aggregated on the server, the misaligned intermediate representations produce a global model whose hidden features represent an incoherent “average” that serves no client well\.

OmniISR\-embodied intermediate MI supervision directly addresses this problem by providing*explicit, task\-relevant anchor points*at eachGmG\_\{m\}\. Because all clients optimize toward the same MI objective at each intermediate depth, their hidden representations are guided toward a shared semantic subspace\. Simultaneously, the NE regularizer prevents any single client from developing overconfident, locally specialized features that would resist aggregation\. The combination ensures that the aggregated global model inherits coherent, generalizable features across the network hierarchy\.[Fig\.4](https://arxiv.org/html/2605.20276#S3.F4)illustrates why the proposed OmniISR framework can reduce client drift in federated setting\.

1

Input :

𝚖𝚘𝚍𝚎∈\{CL,FL,Hybrid\}\\mathtt\{mode\}\\in\\\{\{\\color\[rgb\]\{0\.12109375,0\.46484375,0\.70703125\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.12109375,0\.46484375,0\.70703125\}\\mathrm\{CL\}\},\{\\color\[rgb\]\{0\.171875,0\.62890625,0\.171875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.171875,0\.62890625,0\.171875\}\\mathrm\{FL\}\},\{\\color\[rgb\]\{0\.83984375,0\.15234375,0\.15625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.83984375,0\.15234375,0\.15625\}\\mathrm\{Hybrid\}\}\\\};centralized dataset𝒟C\\mathcal\{D\}\_\{\\mathrm\{C\}\};clients\{1,…,N\}\\\{1,\\dots,N\\\}with datasets\{𝒟n\}n=1N\\\{\\mathcal\{D\}\_\{n\}\\\}\_\{n=1\}^\{N\}; intermediate points

\{Gm\}m=1M\\\{G\_\{m\}\\\}\_\{m=1\}^\{M\}; adapters

Φ=\{φm\}m=1M\\Phi=\\\{\\varphi\_\{m\}\\\}\_\{m=1\}^\{M\}; learning rates

\{ηt\}\\\{\\eta\_\{t\}\\\};local epochsEE; total rounds

TT;mixing schedule\{αt\}\\\{\\alpha\_\{t\}\\\}, adaptation rateβ\>0\\beta\>0

Output :Trained global model

θ∗\\theta^\{\*\};

2

3Initialization:Initialize

θ0\\theta^\{0\},

Φ0\\Phi^\{0\},

\{αm,λm\}m=1M\\\{\\alpha\_\{m\},\\lambda\_\{m\}\\\}\_\{m=1\}^\{M\};

4

5for*roundt=0t=0toT−1T\-1*do

6

//Used by CL and Hybrid

7if*𝚖𝚘𝚍𝚎∈\{CL,Hybrid\}\\mathtt\{mode\}\\in\\\{\{\\color\[rgb\]\{0\.12109375,0\.46484375,0\.70703125\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.12109375,0\.46484375,0\.70703125\}\\mathrm\{CL\}\},\{\\color\[rgb\]\{0\.83984375,0\.15234375,0\.15625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.83984375,0\.15234375,0\.15625\}\\mathrm\{Hybrid\}\}\\\}*then

8Sample minibatch

ℬc⊆𝒟C\\mathcal\{B\}\_\{\\mathrm\{c\}\}\\subseteq\\mathcal\{D\}\_\{\\mathrm\{C\}\};

9Forward:

y^i,\{zi1,…,ziM\}←fθt\(xi\)\\hat\{y\}\_\{i\},\\\{z\_\{i\}^\{1\},\\dots,z\_\{i\}^\{M\}\\\}\\leftarrow f\_\{\\theta^\{t\}\}\(x\_\{i\}\)for

\(xi,yi\)∈ℬc\(x\_\{i\},y\_\{i\}\)\\in\\mathcal\{B\}\_\{\\mathrm\{c\}\};

10Compute

ℒCE\\mathcal\{L\}\_\{\\mathrm\{CE\}\}\([Eq\.2](https://arxiv.org/html/2605.20276#S3.E2)\); for

m=1,…,Mm=1,\\dots,Mcompute

ℒMI\(m\)\\mathcal\{L\}\_\{\\mathrm\{MI\}\}^\{\(m\)\}\([Eq\.3](https://arxiv.org/html/2605.20276#S3.E3)\),

ℒNE\(m\)\\mathcal\{L\}\_\{\\mathrm\{NE\}\}^\{\(m\)\}\([Eq\.4](https://arxiv.org/html/2605.20276#S3.E4)\); aggregate

ℒT\\mathcal\{L\}\_\{T\}\([Eq\.5](https://arxiv.org/html/2605.20276#S3.E5)\);

11Backward:

gCLt←∇θℒT\(θt,Φt;ℬc\)g\_\{\\mathrm\{CL\}\}^\{t\}\\leftarrow\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{T\}\(\\theta^\{t\},\\Phi^\{t\};\\mathcal\{B\}\_\{\\mathrm\{c\}\}\);

gφm,CLt←∇φmℒT\(θt,Φt;ℬc\)g\_\{\\varphi\_\{m\},\\mathrm\{CL\}\}^\{t\}\\leftarrow\\nabla\_\{\\varphi\_\{m\}\}\\mathcal\{L\}\_\{T\}\(\\theta^\{t\},\\Phi^\{t\};\\mathcal\{B\}\_\{\\mathrm\{c\}\}\);

12if*𝚖𝚘𝚍𝚎=CL\\mathtt\{mode\}=\{\\color\[rgb\]\{0\.12109375,0\.46484375,0\.70703125\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.12109375,0\.46484375,0\.70703125\}\\mathrm\{CL\}\}*then

13

θt\+1←θt−ηtgCLt\\theta^\{t\+1\}\\leftarrow\\theta^\{t\}\-\\eta\_\{t\}\\,g\_\{\\mathrm\{CL\}\}^\{t\};φmt\+1←φmt−ηtgφm,CLt\\varphi\_\{m\}^\{t\+1\}\\leftarrow\\varphi\_\{m\}^\{t\}\-\\eta\_\{t\}\\,g\_\{\\varphi\_\{m\},\\mathrm\{CL\}\}^\{t\}for∀m\\forall m;

14continue;

15

16

17

//Used by FL and Hybrid

18Sample participating clients

St⊆\{1,…,N\}S^\{t\}\\subseteq\\\{1,\\dots,N\\\}; broadcast

\(θt,Φt\)\(\\theta^\{t\},\\Phi^\{t\}\)to

n∈Stn\\in S^\{t\};

19for*each clientn∈Stn\\in S^\{t\}*do in parallel

20

θn←θt\\theta\_\{n\}\\leftarrow\\theta^\{t\};

φm,n←φmt\\varphi\_\{m,n\}\\leftarrow\\varphi\_\{m\}^\{t\}for

∀m\\forall m;

21for*local epoche=1e=1toEE*do

22for*each minibatchℬ⊆𝒟n\\mathcal\{B\}\\subseteq\\mathcal\{D\}\_\{n\}*do

23Forward:

y^i,\{zi1,…,ziM\}←fθn\(xi\)\\hat\{y\}\_\{i\},\\\{z\_\{i\}^\{1\},\\dots,z\_\{i\}^\{M\}\\\}\\leftarrow f\_\{\\theta\_\{n\}\}\(x\_\{i\}\)for

\(xi,yi\)∈ℬ\(x\_\{i\},y\_\{i\}\)\\in\\mathcal\{B\};

24Compute

ℒCEn\\mathcal\{L\}\_\{\\mathrm\{CE\}\}^\{n\}\([Eq\.2](https://arxiv.org/html/2605.20276#S3.E2)\);

ℒMI\(m\),n\\mathcal\{L\}\_\{\\mathrm\{MI\}\}^\{\(m\),n\},

ℒNE\(m\),n\\mathcal\{L\}\_\{\\mathrm\{NE\}\}^\{\(m\),n\}for

∀m\\forall m; aggregate

ℒTn\\mathcal\{L\}\_\{T\}^\{n\}\([Eq\.5](https://arxiv.org/html/2605.20276#S3.E5)\);

25

θn←θn−ηt∇θnℒTn\\theta\_\{n\}\\leftarrow\\theta\_\{n\}\-\\eta\_\{t\}\\,\\nabla\_\{\\theta\_\{n\}\}\\mathcal\{L\}\_\{T\}^\{n\};

φm,n←φm,n−ηt∇φm,nℒTn\\varphi\_\{m,n\}\\leftarrow\\varphi\_\{m,n\}\-\\eta\_\{t\}\\,\\nabla\_\{\\varphi\_\{m,n\}\}\\mathcal\{L\}\_\{T\}^\{n\}for

∀m\\forall m;

26

27

28Send

\(θn,\|𝒟n\|,\{φm,n\}m=1M\)\(\\theta\_\{n\},\|\\mathcal\{D\}\_\{n\}\|,\\\{\\varphi\_\{m,n\}\\\}\_\{m=1\}^\{M\}\)to server;

29

30

wn←\|𝒟n\|/∑k∈St\|𝒟k\|w\_\{n\}\\leftarrow\|\\mathcal\{D\}\_\{n\}\|\\big/\\\!\\\!\\sum\_\{k\\in S^\{t\}\}\|\\mathcal\{D\}\_\{k\}\|;

θ~t\+1←∑n∈Stwnθn\\tilde\{\\theta\}\_\{t\+1\}\\leftarrow\\sum\_\{n\\in S^\{t\}\}w\_\{n\}\\,\\theta\_\{n\};

φ~mt\+1←∑n∈Stwnφm,n\\tilde\{\\varphi\}\_\{m\}^\{t\+1\}\\leftarrow\\sum\_\{n\\in S^\{t\}\}w\_\{n\}\\,\\varphi\_\{m,n\}*\(optional\)*;

31

32if*𝚖𝚘𝚍𝚎=FL\\mathtt\{mode\}=\{\\color\[rgb\]\{0\.171875,0\.62890625,0\.171875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.171875,0\.62890625,0\.171875\}\\mathrm\{FL\}\}*then

33

θt\+1←θ~t\+1\\theta^\{t\+1\}\\leftarrow\\tilde\{\\theta\}\_\{t\+1\};Φt\+1←Φ~t\+1\\Phi^\{t\+1\}\\leftarrow\\tilde\{\\Phi\}\_\{t\+1\};

34continue;

35

36

//Used by Hybrid Only

37

gFLt←\(θt−θ~t\+1\)/ηtg\_\{\\mathrm\{FL\}\}^\{t\}\\leftarrow\(\\theta^\{t\}\-\\tilde\{\\theta\}\_\{t\+1\}\)/\\eta\_\{t\}gΦ,FLt←\(Φt−Φ~t\+1\)/ηtg\_\{\\Phi,\\mathrm\{FL\}\}^\{t\}\\leftarrow\(\\Phi^\{t\}\-\\tilde\{\\Phi\}\_\{t\+1\}\)/\\eta\_\{t\}*\(optional\)*;

38

θt\+1←θt−ηt\(αtgCLt\+\(1−αt\)gFLt\)\\theta^\{t\+1\}\\leftarrow\\theta^\{t\}\-\\eta\_\{t\}\\big\(\\alpha\_\{t\}\\,g\_\{\\mathrm\{CL\}\}^\{t\}\+\(1\-\\alpha\_\{t\}\)\\,g\_\{\\mathrm\{FL\}\}^\{t\}\\big\);

39

φmt\+1←φmt−ηt\(αtgφm,CLt\+\(1−αt\)gφm,FLt\)\\varphi\_\{m\}^\{t\+1\}\\leftarrow\\varphi\_\{m\}^\{t\}\-\\eta\_\{t\}\\big\(\\alpha\_\{t\}\\,g\_\{\\varphi\_\{m\},\\mathrm\{CL\}\}^\{t\}\+\(1\-\\alpha\_\{t\}\)\\,g\_\{\\varphi\_\{m\},\\mathrm\{FL\}\}^\{t\}\\big\)for∀m\\forall m*\(optional\)*;

40

41if*adaptive mixing enabled*then

42

s←sim\(gCLt,gFLt\)s\\leftarrow sim\\\!\\big\(g\_\{\\mathrm\{CL\}\}^\{t\},\\,g\_\{\\mathrm\{FL\}\}^\{t\}\\big\);αt\+1←clip\(αt\+β\(1−s\),0,1\)\\alpha\_\{t\+1\}\\leftarrow\\mathrm\{clip\}\\\!\\big\(\\alpha\_\{t\}\+\\beta\(1\-s\),\\,0,\\,1\\big\);

43

44

45return

θ∗←θT\\theta^\{\*\}\\leftarrow\\theta^\{T\};

Algorithm 1Unified OmniISR \(CL / FL / Hybrid\)
#### 3\.2\.3Hybrid CL–FL Training \(Hybrid Mode\)

In real\-world edge intelligence deployments, it is common for a system to have access to both a cloud\-curated dataset \(e\.g\., a representative replay buffer collected from diverse regions\) and distributed on\-device data that must remain local\. OmniISR naturally supports a hybrid update that combines centralized and federated gradient information:

θt\+1=θt−ηt\(αtgCLt\+\(1−αt\)gFLt\),\\theta\_\{t\+1\}=\\theta\_\{t\}\-\\eta\_\{t\}\\Big\(\\alpha\_\{t\}\\,g\_\{\\mathrm\{CL\}\}^\{t\}\+\(1\-\\alpha\_\{t\}\)\\,g\_\{\\mathrm\{FL\}\}^\{t\}\\Big\),\(9\)wheregCLt=∇θℒT\(θt;𝒟cloud\)g\_\{\\mathrm\{CL\}\}^\{t\}=\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{T\}\(\\theta\_\{t\};\\mathcal\{D\}\_\{\\mathrm\{cloud\}\}\)is the exact gradient on cloud data,gFLt=\(θt−θt\+1fed\)/ηtg\_\{\\mathrm\{FL\}\}^\{t\}=\(\\theta\_\{t\}\-\\theta\_\{t\+1\}^\{\\mathrm\{fed\}\}\)/\\eta\_\{t\}is the pseudo\-gradient derived from the federated aggregation step, andαt∈\[0,1\]\\alpha\_\{t\}\\in\[0,1\]is the mixing weight\. The mixing weightαt\\alpha\_\{t\}controls the balance between centralized and federated contributions\.

We identify three practical mixing regimes:

1. 1\.Alternating schedule\(αt∈\{0,1\}\\alpha\_\{t\}\\in\\\{0,1\\\}\): The system alternates between pure CL rounds and pure FL rounds\. This is simplest to implement and incurs no synchronization overhead between the two data sources\.
2. 2\.Fixed mixing\(αt=α\\alpha\_\{t\}=\\alpha\): Both gradient sources contribute in every round with a constant ratio\. This provides stable optimization dynamics and is recommended when both data sources are always available\.
3. 3\.Adaptive mixing:αt\\alpha\_\{t\}is adjusted based on the similarity betweengCLtg\_\{\\mathrm\{CL\}\}^\{t\}andgFLtg\_\{\\mathrm\{FL\}\}^\{t\}\. When the two directions are well\-aligned \(high similarity\), a larger federated weight captures more data diversity; when they diverge, a larger centralized weight provides corrective guidance\.

Beyond the intuitive benefit of combining more data, the hybrid scheme has a deeper theoretical advantage\. That is, the mixture of two independent noise sources \(centralized sampling noise and federated aggregation noise\) increases the effective stochasticity of the optimization trajectory, which accelerates escape from saddle points \(see[Theorem6](https://arxiv.org/html/2605.20276#Thmtheorem6)\) and improves exploration of the loss landscape\.

The training procedure of the proposed OmniISR framework is given in[Algo\.1](https://arxiv.org/html/2605.20276#algorithm1), including pure CL mode \(blue\), pure FL mode \(green\), and hybrid CL–FL mode \(red\)\.

### 3\.3Theoretical Guarantees

We now provide comprehensive theoretical guarantees of OmniISR under all three training modes\. The strict theoretical proofs, detailed explanations, and valuable remarks are deferred to the appendixes in the supplementary material, and here we just discuss the assumptions, state the results, and analyze the most significant implications\.

#### 3\.3\.1Common Assumptions

To conduct the theoretical analysis, we adopt the below standard assumptions for non\-convex stochastic optimization\.

###### Assumption 1\(LL\-Smoothness\)\.

For each componentℒs\\mathcal\{L\}\_\{s\}withs∈\{CE,\{MIm\}m=1M,\{NEm\}m=1M\}s\\in\\\{\\mathrm\{CE\},\\\{\_\{\\mathrm\{MI\}\}^\{m\}\\\}\_\{m=1\}^\{M\},\\\{\_\{\\mathrm\{NE\}\}^\{m\}\\\}\_\{m=1\}^\{M\}\\\}, there exists a constantLs\>0L\_\{s\}\>0such that for allθ,θ′\\theta,\\theta^\{\\prime\},

‖∇ℒs\(θ\)−∇ℒs\(θ′\)‖≤Ls‖θ−θ′‖\.\\\|\\nabla\\mathcal\{L\}\_\{s\}\(\\theta\)\-\\nabla\\mathcal\{L\}\_\{s\}\(\\theta^\{\\prime\}\)\\\|\\leq L\_\{s\}\\\|\\theta\-\\theta^\{\\prime\}\\\|\.Consequently, the total objectiveℒT\(⋅\)\\mathcal\{L\}\_\{T\}\(\\cdot\)isLmaxL\_\{\\max\}\-smooth, whereLmax=max⁡\{LCE,\{αmLMI\(m\)\}m=1M,\{λmLNE\(m\)\}m=1M\}L\_\{\\max\}=\\max\\\{L\_\{\\mathrm\{CE\}\},\\,\\\{\\alpha\_\{m\}L\_\{\\mathrm\{MI\}\}^\{\(m\)\}\\\}\_\{m=1\}^\{M\},\\,\\\{\\lambda\_\{m\}L\_\{\\mathrm\{NE\}\}^\{\(m\)\}\\\}\_\{m=1\}^\{M\}\\\}\.

###### Assumption 2\(Bounded Stochastic Gradients\)\.

For each componentℒs\\mathcal\{L\}\_\{s\}withs∈\{CE,\{MIm\}m=1M,\{NEm\}m=1M\}s\\in\\\{\\mathrm\{CE\},\\\{\_\{\\mathrm\{MI\}\}^\{m\}\\\}\_\{m=1\}^\{M\},\\\{\_\{\\mathrm\{NE\}\}^\{m\}\\\}\_\{m=1\}^\{M\}\\\}, there exists a constantGs\>0G\_\{s\}\>0such that for allθ\\thetaand any minibatch,

𝔼‖∇ℒs\(θ\)‖2≤Gs2\.\\mathbb\{E\}\\\|\\nabla\\mathcal\{L\}\_\{s\}\(\\theta\)\\\|^\{2\}\\leq G\_\{s\}^\{2\}\.The aggregate bound ofℒT\(⋅\)\\mathcal\{L\}\_\{T\}\(\\cdot\)is thereforeGT2=GCE2\+∑m=1M\(αm2\(GMI\(m\)\)2\+λm2\(GNE\(m\)\)2\)G\_\{T\}^\{2\}=G\_\{\\mathrm\{CE\}\}^\{2\}\+\\sum\_\{m=1\}^\{M\}\\big\(\\alpha\_\{m\}^\{2\}\(G\_\{\\mathrm\{MI\}\}^\{\(m\)\}\)^\{2\}\+\\lambda\_\{m\}^\{2\}\(G\_\{\\mathrm\{NE\}\}^\{\(m\)\}\)^\{2\}\\big\)\.

###### Assumption 3\(Unbiasedness and Bounded Variance\)\.

For each componentℒs\\mathcal\{L\}\_\{s\}withs∈\{CE,\{MIm\}m=1M,\{NEm\}m=1M\}s\\in\\\{\\mathrm\{CE\},\\\{\_\{\\mathrm\{MI\}\}^\{m\}\\\}\_\{m=1\}^\{M\},\\\{\_\{\\mathrm\{NE\}\}^\{m\}\\\}\_\{m=1\}^\{M\}\\\}, the stochastic gradientgsg\_\{s\}is unbiased and has bounded variance: there existsσs2\>0\\sigma\_\{s\}^\{2\}\>0such that for allθ\\theta,

𝔼\[gs∣θ\]=∇ℒs\(θ\),𝔼‖gs−∇ℒs\(θ\)‖2≤σs2\.\\mathbb\{E\}\[g\_\{s\}\\mid\\theta\]=\\nabla\\mathcal\{L\}\_\{s\}\(\\theta\),\\qquad\\mathbb\{E\}\\\|g\_\{s\}\-\\nabla\\mathcal\{L\}\_\{s\}\(\\theta\)\\\|^\{2\}\\leq\\sigma\_\{s\}^\{2\}\.Consequently, the stochastic gradientgtg\_\{t\}ofℒT\(⋅\)\\mathcal\{L\}\_\{T\}\(\\cdot\)satisfies𝔼\[gt∣θt\]=∇ℒT\(θt\)\\mathbb\{E\}\[g\_\{t\}\\mid\\theta\_\{t\}\]=\\nabla\\mathcal\{L\}\_\{T\}\(\\theta\_\{t\}\)and𝔼‖gt−∇ℒT\(θt\)‖2≤σT2\\mathbb\{E\}\\\|g\_\{t\}\-\\nabla\\mathcal\{L\}\_\{T\}\(\\theta\_\{t\}\)\\\|^\{2\}\\leq\\sigma\_\{T\}^\{2\}, where the aggregate variance isσT2=σCE2\+∑m=1M\(αm2\(σMI\(m\)\)2\+λm2\(σNE\(m\)\)2\)\\sigma\_\{T\}^\{2\}=\\sigma\_\{\\mathrm\{CE\}\}^\{2\}\+\\sum\_\{m=1\}^\{M\}\\big\(\\alpha\_\{m\}^\{2\}\(\\sigma\_\{\\mathrm\{MI\}\}^\{\(m\)\}\)^\{2\}\+\\lambda\_\{m\}^\{2\}\(\\sigma\_\{\\mathrm\{NE\}\}^\{\(m\)\}\)^\{2\}\\big\)\.

#### 3\.3\.2Convergence of Centralized OmniISR

###### Theorem 1\(Centralized OmniISR Convergence\)\.

Under Assumptions[1](https://arxiv.org/html/2605.20276#Thmassumption1)–[3](https://arxiv.org/html/2605.20276#Thmassumption3), let the optimization iteratesTTrounds with beingθt\+1=θt−ηtgt\\theta\_\{t\+1\}=\\theta\_\{t\}\-\\eta\_\{t\}g\_\{t\}and step\-sizeηt=η/T\\eta\_\{t\}=\\eta/\\sqrt\{T\}\. DefineΔ=ℒT\(θ0\)−ℒT∗\\Delta=\\mathcal\{L\}\_\{T\}\(\\theta\_\{0\}\)\-\\mathcal\{L\}\_\{T\}^\{\*\}, whereθ0\\theta\_\{0\}is the initial model parameters,ℒT∗\\mathcal\{L\}\_\{\\text\{T\}\}^\{\*\}is the theoretical optimal loss\. Then we have

1T∑t=1T𝔼‖∇ℒT\(θt\)‖2\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathbb\{E\}\\\|\\nabla\\mathcal\{L\}\_\{T\}\(\\theta\_\{t\}\)\\\|^\{2\}≤2ΔηT⏟initial gap\+LmaxηT\(GT2\+σT2\)⏟variance term\\displaystyle\\leq\\underbrace\{\\frac\{2\\Delta\}\{\\eta\\sqrt\{T\}\}\}\_\{\\text\{initial gap\}\}\+\\underbrace\{\\frac\{L\_\{\\max\}\\,\\eta\}\{\\sqrt\{T\}\}\\,\(G\_\{T\}^\{2\}\+\\sigma\_\{T\}^\{2\}\)\}\_\{\\text\{variance term\}\}=𝒪\(1T\)\.\\displaystyle=\\mathcal\{O\}\\\!\\left\(\\frac\{1\}\{\\sqrt\{T\}\}\\right\)\.\(10\)

The bound in[Theorem1](https://arxiv.org/html/2605.20276#S3.Ex5)consists of two terms\. The*initial gap*term2Δ/\(ηT\)2\\Delta/\(\\eta\\sqrt\{T\}\)reflects the distance between the initial model parameters and the optimum one, and it decreases with larger step\-size or more iterations\. The*variance term*reflects the noise introduced by mini\-batch sampling and scales with the aggregate varianceGT2\+σT2G\_\{T\}^\{2\}\+\\sigma\_\{T\}^\{2\}\. Crucially, the asymptotic rate𝒪\(1/T\)\\mathcal\{O\}\(1/\\sqrt\{T\}\)matches standard non\-convex SGD\[[13](https://arxiv.org/html/2605.20276#bib.bib13)\], proving that the introduction of ISR does*not*degrade the convergence bound, and it only affects the finite\-TTconstants viaσT2\\sigma\_\{T\}^\{2\},GT2G\_\{T\}^\{2\}, andLmaxL\_\{\\max\}\.

#### 3\.3\.3Convergence of Federated OmniISR

In the federated setting of OmniISR, it runs forTTrounds until the global model converges\. Specifically, in roundtt, the central server first broadcastsθt\\theta\_\{t\}to all clients\. Then, each client runsEElocal SGD steps starting fromθt\\theta\_\{t\}with step\-sizeηt\\eta\_\{t\}, producingθt,En\\theta\_\{t,E\}^\{n\}\. Finally, the server aggregates models received from all participated clients, i\.e\.,

θt\+1\\displaystyle\\theta\_\{t\+1\}=∑n=1Nwnθt,En=θt\+∑n=1Nwnδtn,\\displaystyle=\\sum\_\{n=1\}^\{N\}w\_\{n\}\\theta\_\{t,E\}^\{n\}=\\theta\_\{t\}\+\\sum\_\{n=1\}^\{N\}w\_\{n\}\\delta\_\{t\}^\{n\},\(11\)δtn\\displaystyle\\delta\_\{t\}^\{n\}=θt,En−θt\.\\displaystyle=\\theta\_\{t,E\}^\{n\}\-\\theta\_\{t\}\.\(12\)We also define the heterogeneity measure as

Ht=∑n=1Nwn‖∇ℒTn\(θt\)−∇ℒT\(θt\)‖2\.\\displaystyle H\_\{t\}=\\sum\_\{n=1\}^\{N\}w\_\{n\}\\\|\\nabla\\mathcal\{L\}\_\{T\}^\{n\}\(\\theta\_\{t\}\)\-\\nabla\\mathcal\{L\}\_\{T\}\(\\theta\_\{t\}\)\\\|^\{2\}\.\(13\)
Building upon the aforementioned assumptions and definitions, Lemma[14](https://arxiv.org/html/2605.20276#S3.E14)establishes adrift boundfor each federated round\. This bound guarantees that the aggregated model does not deviate significantly from its initial state during any given round\.

###### Lemma 2\(Drift bound\)\.

There exists an absolute constantCCsuch that for every roundtt

∑n=1Nwn𝔼‖δtn‖2≤Cηt2E2\(𝔼‖∇ℒT\(θt\)‖2\+Ht\+σT2\)\.\\sum\_\{n=1\}^\{N\}w\_\{n\}\\mathbb\{E\}\\\|\\delta\_\{t\}^\{n\}\\\|^\{2\}\\leq C\\,\\eta\_\{t\}^\{2\}E^\{2\}\\bigl\(\\mathbb\{E\}\\\|\\nabla\\mathcal\{L\}\_\{T\}\(\\theta\_\{t\}\)\\\|^\{2\}\+H\_\{t\}\+\\sigma\_\{T\}^\{2\}\\bigr\)\.\(14\)

By constructing thedrift bound, we now offer the convergence upper bound of the federated OmniISR in[Theorem3](https://arxiv.org/html/2605.20276#Thmtheorem3)\.

###### Theorem 3\(Federated OmniISR Convergence\)\.

Under Assumptions[1](https://arxiv.org/html/2605.20276#Thmassumption1)–[3](https://arxiv.org/html/2605.20276#Thmassumption3)applied per\-client, withEElocal SGD steps, step\-sizeηt=η/T\\eta\_\{t\}=\\eta/\\sqrt\{T\}, and the drift\-control conditionLmaxηTCE2<1L\_\{\\max\}\\eta\\sqrt\{T\}\\,C\\,E^\{2\}<1\(where constantC\>0C\>0\)\. Then afterTTfederated aggregation rounds, we have

1T∑t=1T𝔼‖∇ℒ\(θt\)‖2\\displaystyle\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathbb\{E\}\\\|\\nabla\\mathcal\{L\}\(\\theta\_\{t\}\)\\\|^\{2\}≤2ΔηT⏟initial gap\+LmaxηT\(GT2\+σT2\)⏟variance\+\\displaystyle\\leq\\underbrace\{\\frac\{2\\Delta\}\{\\eta\\sqrt\{T\}\}\}\_\{\\text\{initial gap\}\}\+\\underbrace\{\\frac\{L\_\{\\max\}\\,\\eta\}\{\\sqrt\{T\}\}\\big\(G\_\{T\}^\{2\}\+\\sigma\_\{T\}^\{2\}\\big\)\}\_\{\\text\{variance\}\}\+LmaxηTΓdrift⏟client drift,\\displaystyle\\underbrace\{\\frac\{L\_\{\\max\}\\,\\eta\}\{\\sqrt\{T\}\}\\,\\Gamma\_\{\\mathrm\{drift\}\}\}\_\{\\text\{client drift\}\},\(15\)where the drift contribution satisfiesΓdrift≤cE2\(A\+H¯\)\\Gamma\_\{\\mathrm\{drift\}\}\\leq c\\,E^\{2\}\(A\+\\bar\{H\}\)withA=1T∑t𝔼‖∇ℒ\(θt\)‖2A=\\frac\{1\}\{T\}\\sum\_\{t\}\\mathbb\{E\}\\\|\\nabla\\mathcal\{L\}\(\\theta\_\{t\}\)\\\|^\{2\}andH¯=1T∑tHt\\bar\{H\}=\\frac\{1\}\{T\}\\sum\_\{t\}H\_\{t\}\.

The federated bound \([Theorem3](https://arxiv.org/html/2605.20276#S3.Ex6)\) differs from the centralized bound \([Theorem1](https://arxiv.org/html/2605.20276#S3.Ex5)\) by the additional drift termΓdrift\\Gamma\_\{\\mathrm\{drift\}\}\. This term reveals a three\-fold interaction:

1. 1\.Local epochsEE:The drift scales asE2E^\{2\}, quantifying the cost of communication\. DoublingEEquadruples the drift, motivating the choice of moderateEEin practice\.
2. 2\.Data heterogeneityH¯\\bar\{H\}:The drift is amplified by inter\-client gradient dissimilarity\. In highly non\-IID scenarios,H¯\\bar\{H\}is large, making convergence slower\. OmniISR’s intermediate MI supervision helps here by aligning client representations \(reducing the effectiveH¯\\bar\{H\}at hidden layers\), which is empirically confirmed by the faster convergence observed in our experiments\.
3. 3\.Controllability:Under the drift\-control conditionLmaxηTCE2<1L\_\{\\max\}\\eta\\sqrt\{T\}\\,C\\,E^\{2\}<1, the drift term can be absorbed into the left\-hand side, preserving the𝒪\(1/T\)\\mathcal\{O\}\(1/\\sqrt\{T\}\)rate\. This condition is satisfied for sufficiently small step\-size or moderateEE\.

#### 3\.3\.4Convergence of Hybrid CL–FL OmniISR

For the hybrid update \([Eq\.9](https://arxiv.org/html/2605.20276#S3.E9)\), we additionally model the bias structure of each gradient source\.

###### Assumption 4\(Bias Decomposition\)\.

There exist bias vectorsbct,bftb\_\{c\}^\{t\},b\_\{f\}^\{t\}such that𝔼\[gCLt\|θt\]=∇ℒT\(θt\)\+bct\\mathbb\{E\}\[g\_\{\\mathrm\{CL\}\}^\{t\}\|\\theta\_\{t\}\]=\\nabla\\mathcal\{L\}\_\{T\}\(\\theta\_\{t\}\)\+b\_\{c\}^\{t\}and𝔼\[gFLt\|θt\]=∇ℒT\(θt\)\+bft\\mathbb\{E\}\[g\_\{\\mathrm\{FL\}\}^\{t\}\|\\theta\_\{t\}\]=\\nabla\\mathcal\{L\}\_\{T\}\(\\theta\_\{t\}\)\+b\_\{f\}^\{t\}, with‖bct‖≤Bc\\\|b\_\{c\}^\{t\}\\\|\\leq B\_\{c\}and‖bft‖≤Bf\\\|b\_\{f\}^\{t\}\\\|\\leq B\_\{f\}for alltt\. The conditional variances satisfy𝔼∥gCLt−𝔼\[gCLt\|θt\]∥2≤σc2\\mathbb\{E\}\\\|g\_\{\\mathrm\{CL\}\}^\{t\}\-\\mathbb\{E\}\[g\_\{\\mathrm\{CL\}\}^\{t\}\|\\theta\_\{t\}\]\\\|^\{2\}\\leq\\sigma\_\{c\}^\{2\}and𝔼∥gFLt−𝔼\[gFLt\|θt\]∥2≤σf2\\mathbb\{E\}\\\|g\_\{\\mathrm\{FL\}\}^\{t\}\-\\mathbb\{E\}\[g\_\{\\mathrm\{FL\}\}^\{t\}\|\\theta\_\{t\}\]\\\|^\{2\}\\leq\\sigma\_\{f\}^\{2\}\. Furthermore, the centralized and federated noise terms are conditionally independent givenθt\\theta\_\{t\}, i\.e\.,

𝔼\[⟨gCLt−𝔼t\[gCLt\],gFLt−𝔼t\[gFLt\]⟩\|θt\]=0,\\mathbb\{E\}\\bigl\[\\langle g\_\{\\mathrm\{CL\}\}^\{t\}\-\\mathbb\{E\}\_\{t\}\[g\_\{\\mathrm\{CL\}\}^\{t\}\],\\;g\_\{\\mathrm\{FL\}\}^\{t\}\-\\mathbb\{E\}\_\{t\}\[g\_\{\\mathrm\{FL\}\}^\{t\}\]\\rangle\\;\\big\|\\;\\theta\_\{t\}\\bigr\]=0,which is natural since the cloud data and the distributed on\-device data are collected from independent sources by using independent sensory devices\.

The biasbctb\_\{c\}^\{t\}captures representativeness gaps between the cloud dataset and the true data distribution;bftb\_\{f\}^\{t\}captures client drift, staleness, and non\-IID effects\. We define the effective quantities as follows

Beff\\displaystyle B\_\{\\mathrm\{eff\}\}=maxt⁡‖αtbct\+\(1−αt\)bft‖,\\displaystyle=\\max\_\{t\}\\\|\\alpha\_\{t\}b\_\{c\}^\{t\}\+\(1\-\\alpha\_\{t\}\)b\_\{f\}^\{t\}\\\|,\(16\)σeff2\\displaystyle\\sigma\_\{\\mathrm\{eff\}\}^\{2\}=αmin2σc2\+\(1−αmin\)2σf2,\\displaystyle=\\alpha\_\{\\min\}^\{2\}\\,\\sigma\_\{c\}^\{2\}\+\(1\-\\alpha\_\{\\min\}\)^\{2\}\\,\\sigma\_\{f\}^\{2\},\(17\)whereαmin=mint⁡αt\>0\\alpha\_\{\\min\}=\\min\\limits\_\{t\}\\alpha\_\{t\}\>0ensures the persistent participation of centralized training\.

###### Theorem 4\(Hybrid OmniISR Convergence\)\.

Under Assumptions[1](https://arxiv.org/html/2605.20276#Thmassumption1)–[4](https://arxiv.org/html/2605.20276#Thmassumption4), with step\-sizeηt=η/T\\eta\_\{t\}=\\eta/\\sqrt\{T\}, CL–FL mixing weightsαt≥αmin\>0\\alpha\_\{t\}\\geq\\alpha\_\{\\min\}\>0, andC\>0C\>0, we have

1T∑t=1T𝔼‖∇ℒT\(θt\)‖2≤CΔαminηT⏟initial gap\+Cησeff2T⏟variance\+CBeff2⏟bias floor\.\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\mathbb\{E\}\\\|\\nabla\\mathcal\{L\}\_\{T\}\(\\theta\_\{t\}\)\\\|^\{2\}\\leq\\underbrace\{\\frac\{C\\Delta\}\{\\alpha\_\{\\min\}\\,\\eta\\sqrt\{T\}\}\}\_\{\\text\{initial gap\}\}\+\\underbrace\{C\\,\\frac\{\\eta\\,\\sigma\_\{\\mathrm\{eff\}\}^\{2\}\}\{\\sqrt\{T\}\}\}\_\{\\text\{variance\}\}\+\\underbrace\{C\\,B\_\{\\mathrm\{eff\}\}^\{2\}\}\_\{\\text\{bias floor\}\}\.\(18\)

The hybrid framework introduces a non\-vanishing bias floorBeffB\_\{\\mathrm\{eff\}\}, which persists even as the number of iterationsT→∞T\\to\\infty\. This irreducible error arises from the inherent imperfections in the individual gradient sources, where the centralized cloud data may not fully capture the true global distribution, and the federated updates are typically subject to client drift\. However, this fundamental trade\-off is manageable\. By actively mitigating these biases, such as through representative cloud sampling and the application of drift\-reduction algorithms like SCAFFOLD\[[12](https://arxiv.org/html/2605.20276#bib.bib12)\], the bias floor can be kept sufficiently small\. Under these controlled conditions, the hybrid scheme maintains the standard𝒪\(1/T\)\\mathcal\{O\}\(1/\\sqrt\{T\}\)convergence rate\.

![Refer to caption](https://arxiv.org/html/2605.20276v1/omniisr_gradient_nonconflict.jpg)Figure 5:Illustration of how OmniISR helps to escape saddle point\.
#### 3\.3\.5Gradient Non\-Conflict Between CL and FL

A critical concern in hybrid CL\-FL OmniISR is whether the centralized and federated gradient directions might*conflict*, i\.e\., pushing the model parameters in opposing directions, leading to oscillated or degraded convergence\. The following lemma provides a formal guarantee\.

###### Lemma 5\(No Gradient Conflict in Hybrid OmniISR\)\.

Under Assumption[4](https://arxiv.org/html/2605.20276#Thmassumption4), the expected inner product between centralized and federated gradients decomposes as

𝔼\[⟨gCLt,gFLt⟩\]=\\displaystyle\\mathbb\{E\}\[\\langle g\_\{\\mathrm\{CL\}\}^\{t\},g\_\{\\mathrm\{FL\}\}^\{t\}\\rangle\]=‖∇ℒT\(θt\)‖2⏟positive definite\+⟨∇ℒT\(θt\),bct\+bft⟩⏟bias–gradient cross term\+\\displaystyle\\underbrace\{\\\|\\nabla\\mathcal\{L\}\_\{T\}\(\\theta\_\{t\}\)\\\|^\{2\}\}\_\{\\text\{positive definite\}\}\+\\underbrace\{\\langle\\nabla\\mathcal\{L\}\_\{T\}\(\\theta\_\{t\}\),\\,b\_\{c\}^\{t\}\+b\_\{f\}^\{t\}\\rangle\}\_\{\\text\{bias\-\-gradient cross term\}\}\+⟨bct,bft⟩⏟bias–bias interaction\+ϵt⏟noise residual,\\displaystyle\\underbrace\{\\langle b\_\{c\}^\{t\},b\_\{f\}^\{t\}\\rangle\}\_\{\\text\{bias\-\-bias interaction\}\}\+\\underbrace\{\\epsilon\_\{t\}\}\_\{\\text\{noise residual\}\},\(19\)where\|ϵt\|≤12\(σc2\+σf2\)\|\\epsilon\_\{t\}\|\\leq\\frac\{1\}\{2\}\(\\sigma\_\{c\}^\{2\}\+\\sigma\_\{f\}^\{2\}\)\. In particular, when the biases are small relative to the gradient norm \(i\.e\.,‖bct‖\+‖bft‖≤c‖∇ℒT\(θt\)‖\\\|b\_\{c\}^\{t\}\\\|\+\\\|b\_\{f\}^\{t\}\\\|\\leq c\\,\\\|\\nabla\\mathcal\{L\}\_\{T\}\(\\theta\_\{t\}\)\\\|for somec<1c<1\), the dominant‖∇ℒT\(θt\)‖2\\\|\\nabla\\mathcal\{L\}\_\{T\}\(\\theta\_\{t\}\)\\\|^\{2\}term ensures

𝔼\[⟨gCLt,gFLt⟩\]≥0\.\\mathbb\{E\}\[\\langle g\_\{\\mathrm\{CL\}\}^\{t\},g\_\{\\mathrm\{FL\}\}^\{t\}\\rangle\]\\geq 0\.\(20\)

This result provides the first formal justification for mixed CL–FL training in edge intelligence\. It demonstrates that, under mild and verifiable conditions, the two gradient sources act synergistically to guide the optimization trajectory toward stationary points, rather than producing conflicting updates\. The required condition \(i\.e\.,‖bct‖\+‖bft‖≪‖∇ℒT\(θt\)‖\\\|b\_\{c\}^\{t\}\\\|\+\\\|b\_\{f\}^\{t\}\\\|\\ll\\\|\\nabla\\mathcal\{L\}\_\{T\}\(\\theta\_\{t\}\)\\\|\) naturally holds throughout the majority of the training process when the gradients are large\. It is typically only violated near convergence where the gradients diminish and the model approaches a stationary point\. This near\-convergence regime that the saddle\-escape mechanism becomes critical and is discussed below in[Section3\.3\.6](https://arxiv.org/html/2605.20276#S3.SS3.SSS6)\.

#### 3\.3\.6Accelerated Saddle Escape via Hybrid Stochasticity

Near strict saddle points, gradient norms are small and the gradient non\-conflict condition may weaken\. However, stochasticity becomes beneficial\. Specifically, noisy gradients help the optimizer escape the saddle, and more importantly the hybrid OmniISR scheme amplifies this effect\.

###### Assumption 5\(Hessian Lipschitz\)\.

There exists a constantρ\>0\\rho\>0such that the Hessian isρ\\rho\-Lipschitz: for allθ,θ′\\theta,\\theta^\{\\prime\},

‖∇2ℒT\(θ\)−∇2ℒT\(θ′\)‖≤ρ‖θ−θ′‖\.\\\|\\nabla^\{2\}\\mathcal\{L\}\_\{T\}\(\\theta\)\-\\nabla^\{2\}\\mathcal\{L\}\_\{T\}\(\\theta^\{\\prime\}\)\\\|\\leq\\rho\\\|\\theta\-\\theta^\{\\prime\}\\\|\.

###### Theorem 6\(Saddle Escape\)\.

Suppose Assumptions[1](https://arxiv.org/html/2605.20276#Thmassumption1),[4](https://arxiv.org/html/2605.20276#Thmassumption4), and[5](https://arxiv.org/html/2605.20276#Thmassumption5)hold, and let the modelθt\\theta\_\{t\}be located near a strict saddle point exhibiting negative curvature−γ\-\\gamma\(whereγ\>0\\gamma\>0\)\. With a probability of at least1−δ1\-\\delta, the hybrid optimization scheme will successfully escape a neighborhood of radiusRRaround this saddle point in a number of iterations bounded by

Tesc\(δ\)≤1ηγln⁡\(C0Ry0−ηCB−ηCσ\(δ\)\)=𝒪\(1ηγlog⁡Rδ\),T\_\{\\mathrm\{esc\}\}\(\\delta\)\\\!\\leq\\\!\\frac\{1\}\{\\eta\\gamma\}\\\!\\ln\\\!\\left\(\\frac\{C\_\{0\}R\}\{y\_\{0\}\\\!\-\\\!\\eta C\_\{B\}\\\!\-\\\!\\eta C\_\{\\sigma\}\(\\delta\)\}\\right\)\\\!=\\\!\\mathcal\{O\}\\\!\\left\(\\\!\\frac\{1\}\{\\eta\\gamma\}\\\!\\log\\\!\\frac\{R\}\{\\delta\}\\right\),\(21\)whereCB=Beff/γC\_\{B\}=B\_\{\\mathrm\{eff\}\}/\\gamma,Cσ\(δ\)=\(σeff/ηγ\)ln⁡\(2/δ\)C\_\{\\sigma\}\(\\delta\)=\(\\sigma\_\{\\mathrm\{eff\}\}/\\sqrt\{\\eta\\gamma\}\)\\sqrt\{\\ln\(2/\\delta\)\}, andy0y\_\{0\}represents the initial displacement along the direction of negative curvature\. The constantC0C\_\{0\}is determined byLmaxL\_\{\\max\}andρ\\rho\.

The time required to escape the saddle point is governed by the effective second momentS2=Beff2\+σeff2S^\{2\}=B\_\{\\mathrm\{eff\}\}^\{2\}\+\\sigma\_\{\\mathrm\{eff\}\}^\{2\}\. Notably, the hybrid framework yields a largerS2S^\{2\}compared to a purely individual approach\. This is because the compound gradient introduces additional variance from an independent data distribution\. Consequently, this induces a more substantial effective perturbation along the direction of negative curvature, thereby accelerating the escape process\.[Fig\.5](https://arxiv.org/html/2605.20276#S3.F5)demonstrates this process\.

Furthermore, our bound on the escape time directly aligns with the perturbed gradient descent \(PGD\) framework\[[53](https://arxiv.org/html/2605.20276#bib.bib53)\]\. This connection is made by equating their perturbation radius to our effective initial displacementy0y\_\{0\}, their noise scale toS2S^\{2\}, and their negative\-curvature magnitude toγ\\gamma\. By setting the learning rate and incorporating the relevant dimension\-dependent geometric factors, we can recover\[[53](https://arxiv.org/html/2605.20276#bib.bib53)\]’s iteration complexity directly from[Eq\.21](https://arxiv.org/html/2605.20276#S3.E21)\.

### 3\.4Complexity Analysis forϵ\\epsilon\-Stationarity

To find the expected number of iteration timesT\(ϵ\)T\(\\epsilon\)required forϵ\\epsilon\-stationarity, we convert[Theorems1](https://arxiv.org/html/2605.20276#Thmtheorem1),[3](https://arxiv.org/html/2605.20276#Thmtheorem3)and[4](https://arxiv.org/html/2605.20276#Thmtheorem4)via

1/T∑t=1T𝔼‖∇ℒT\(θt\)‖2≤ϵ\.\\displaystyle 1/T\\sum\\nolimits\_\{t=1\}^\{T\}\\mathbb\{E\}\\bigl\\\|\\nabla\\mathcal\{L\}\_\{T\}\(\\theta\_\{t\}\)\\bigr\\\|^\{2\}\\;\\leq\\;\\epsilon\.\(22\)
#### 3\.4\.1Pure Centralized Mode \([Theorem1](https://arxiv.org/html/2605.20276#Thmtheorem1)\)

Setting the right\-hand side of[Theorem1](https://arxiv.org/html/2605.20276#S3.Ex5)toϵ\\epsilongives the required roundTCLISR\(ϵ\)T\_\{\\text\{CL\}\}^\{\\text\{ISR\}\}\(\\epsilon\)to reachϵ\\epsilon\-stationarity as follow

TCLISR\(ϵ\)=𝒪\(ΔLmax\(GT2\+σT2\)ϵ2\)\.\\;T\_\{\\text\{CL\}\}^\{\\text\{ISR\}\}\(\\epsilon\)\\;=\\;\\mathcal\{O\}\\\!\\bigl\(\\frac\{\\Delta\\,L\_\{\\max\}\\,\(G\_\{T\}^\{2\}\+\\sigma\_\{T\}^\{2\}\)\}\{\\epsilon^\{2\}\}\\bigr\)\.\(23\)Owing toGT2=𝒪\(M\)G\_\{T\}^\{2\}=\\mathcal\{O\}\(M\)andσT2=𝒪\(M\)\\sigma\_\{T\}^\{2\}=\\mathcal\{O\}\(M\), the absolute iteration count grows*at most linearly inMM*, while the asymptotic rate𝒪\(1/T\)\\mathcal\{O\}\(1/\\sqrt\{T\}\)is preserved\.

#### 3\.4\.2Pure Federated Mode \([Theorem3](https://arxiv.org/html/2605.20276#Thmtheorem3)\)

Based on[Theorem3](https://arxiv.org/html/2605.20276#Thmtheorem3), the derived iteration number to reachϵ\\epsilon\-stationarity of FL mode of OmniISR is

TFLISR\(ϵ\)=𝒪\(ΔLmax\(GT2\+σT2\+E2H¯\)ϵ2\)\.\\;T\_\{\\mathrm\{FL\}\}^\{\\text\{ISR\}\}\(\\epsilon\)\\;=\\;\\mathcal\{O\}\\\!\\bigl\(\\frac\{\\Delta\\,L\_\{\\max\}\\,\\bigl\(G\_\{T\}^\{2\}\+\\sigma\_\{T\}^\{2\}\+E^\{2\}\\,\\bar\{H\}\\bigr\)\}\{\\epsilon^\{2\}\}\\bigr\)\.\(24\)
ISR augments each client’s loss with MI supervision and NE regularization that push the learned representations of each client toward a shared representation\. This representation alignment translates into a*gradient\-level alignment*bound, i\.e\., there exists a constantκ∈\(0,1\)\\kappa\\in\(0,1\)such thatH¯≤κH¯0\\bar\{H\}\\leq\\kappa\\bar\{H\}\_\{0\}, whereH¯0\\bar\{H\}\_\{0\}is the expected gradient heterogeneity in vanilla FL algorithms without any representation alignment\. The contraction factorκ\\kappadepends on the ISR’s strength \(i\.e\.,\{αm,λm\}m=1M\\\{\\alpha\_\{m\},\\lambda\_\{m\}\\\}\_\{m=1\}^\{M\}\)\. In general, stronger ISR strength produces smallerκ\\kappa, withκ→1\\kappa\\to 1when the ISR weights vanish \(recovering vanilla FL\) andκ→0\\kappa\\to 0when representations are perfectly aligned\.

For*vanilla FL*without ISR, by substitutingH¯=H¯0\\bar\{H\}=\\bar\{H\}\_\{0\}into[Eq\.24](https://arxiv.org/html/2605.20276#S3.E24), we can obtain its iteration complexity

TFL0\(ϵ\)≜𝒪\(ΔLmax\(GT2\+σT2\+E2H¯0\)ϵ2\)\.T\_\{\\text\{FL\}\}^\{0\}\(\\epsilon\)\\;\\triangleq\\;\\mathcal\{O\}\\\!\\bigl\(\\frac\{\\Delta\\,L\_\{\\max\}\\,\\bigl\(G\_\{T\}^\{2\}\+\\sigma\_\{T\}^\{2\}\+E^\{2\}\\,\\bar\{H\}\_\{0\}\\bigr\)\}\{\\epsilon^\{2\}\}\\bigr\)\.\(25\)With*ISR in place*, replacingH¯\\bar\{H\}byκH¯0\\kappa\\,\\bar\{H\}\_\{0\}, it yields

TFLISR\(ϵ\)=𝒪\(ΔLmax\(GT2\+σT2\+E2κH¯0\)ϵ2\)\.T\_\{\\text\{FL\}\}^\{\\text\{ISR\}\}\(\\epsilon\)\\;=\\;\\mathcal\{O\}\\\!\\bigl\(\\frac\{\\Delta\\,L\_\{\\max\}\\,\\bigl\(G\_\{T\}^\{2\}\+\\sigma\_\{T\}^\{2\}\+E^\{2\}\\,\\kappa\\,\\bar\{H\}\_\{0\}\\bigr\)\}\{\\epsilon^\{2\}\}\\bigr\)\.\(26\)Based on[Eq\.25](https://arxiv.org/html/2605.20276#S3.E25)and[Eq\.26](https://arxiv.org/html/2605.20276#S3.E26), we can conclude that

- •In the heterogeneity\-dominated regime whereH¯0≫GT2\+σT2\\bar\{H\}\_\{0\}\\gg G\_\{T\}^\{2\}\+\\sigma\_\{T\}^\{2\}, this simplifies to TFLISR\(ϵ\)≈κTFL0\(ϵ\),\\displaystyle T\_\{\\text\{FL\}\}^\{\\text\{ISR\}\}\(\\epsilon\)\\;\\approx\\;\\kappa\\;T\_\{\\text\{FL\}\}^\{0\}\(\\epsilon\),\(27\)indicating that ISR yields a reduction factor ofκ<1\\kappa<1in the number of communication rounds required to reachϵ\\epsilon\-stationarity\. Equation[Eq\.27](https://arxiv.org/html/2605.20276#S3.E27)formalizes the communication\-efficiency gains: ISR does not change the asymptotic rate𝒪\(1/T\)\\mathcal\{O\}\(1/\\sqrt\{T\}\), but it reduces the*prefactor*by a contractionκ\\kappa, directly translating into fewer communication rounds for the same target accuracy\.
- •In the noise\-dominated regime whereGT2\+σT2≫H¯0G\_\{T\}^\{2\}\+\\sigma\_\{T\}^\{2\}\\gg\\bar\{H\}\_\{0\}, the ISR’s gain is smaller, since heterogeneity is already a sub\-leading term\.

#### 3\.4\.3Hybrid CL–FL Mode \([Theorem4](https://arxiv.org/html/2605.20276#Thmtheorem4)\)

In the hybrid CL–FL regime, the*irreducible bias floor*Beff2B\_\{\\text\{eff\}\}^\{2\}stems from the mismatch between the sampled cloud distribution and the real distribution, and data heterogeneity among FL clients\. For target accuracyϵ\\epsilon, it yields

THybISR\(ϵ\)=𝒪\(Δσeff2αmin2\(ϵ−CBeff2\)2\),\\;T\_\{\\text\{Hyb\}\}^\{\\text\{ISR\}\}\(\\epsilon\)\\;=\\;\\mathcal\{O\}\\\!\\bigl\(\\frac\{\\Delta\\sigma\_\{\\text\{eff\}\}^\{2\}\}\{\\alpha\_\{\\min\}^\{2\}\\,\(\\epsilon\-C\\,B\_\{\\text\{eff\}\}^\{2\}\)^\{2\}\}\\bigr\),\\;\(28\)revealing two regimes:

- •*Above the floor*\(ϵ\>CBeff2\\epsilon\>C\\,B\_\{\\text\{eff\}\}^\{2\}\): convergence is achievable with𝒪\(\(ϵ−CBeff2\)−2\)\\mathcal\{O\}\\bigl\(\(\\epsilon\-CB\_\{\\text\{eff\}\}^\{2\}\)^\{\-2\}\\bigr\), slightly worse than the centralized𝒪\(ϵ−2\)\\mathcal\{O\}\(\\epsilon^\{\-2\}\)due to the shifted denominator\.
- •*Below the floor*\(ϵ≤CBeff2\\epsilon\\leq C\\,B\_\{\\text\{eff\}\}^\{2\}\): convergence is*unattainable*regardless ofTT\. The algorithm asymptotically oscillates\.

The hybrid bound interpolates between CL and FL\. Specifically, it inherits the lower variance of CL \(viaαmin\\alpha\_\{\\min\}\) and the richer data coverage of FL \(via1−αmin1\-\\alpha\_\{\\min\}\), at the cost of a bias floor that reflects the mismatch between the two gradient sources\. When ISR is active, the federated biasBfB\_\{f\}and heterogeneityH¯\\bar\{H\}are both reduced, further tightening the hybrid bound\.

### 3\.5Comparison among Training Modes of OmniISR

[Tab\.II](https://arxiv.org/html/2605.20276#S3.T2)compares aforementioned three training modes of the proposed OmniISR framework from multiple perspectives, including the role of the intermediate loss and regularizer, and theoretical convergence guarantees\.

TABLE II:Comprehensive Comparison of the working modes ofOmniISR: Pure CL, Pure FL, and Hybrid CL–FLModeCentralized \(CL\) OmniISRFederated \(FL\) OmniISRHybrid CL–FL OmniISRUpdate ruleθt\+1=θt−ηtgt\\theta\_\{t\+1\}=\\theta\_\{t\}\-\\eta\_\{t\}g\_\{t\}θt\+1=θt\+∑nwnδtn\\theta\_\{t\+1\}=\\theta\_\{t\}\+\\sum\_\{n\}w\_\{n\}\\delta\_\{t\}^\{n\}θt\+1=θt−ηt\[αtgCLt\+\(1−αt\)gFLt\]\\theta\_\{t\+1\}=\\theta\_\{t\}\-\\eta\_\{t\}\[\\alpha\_\{t\}g\_\{\\mathrm\{CL\}\}^\{t\}\+\(1\-\\alpha\_\{t\}\)g\_\{\\mathrm\{FL\}\}^\{t\}\]Explicit bound on1T∑t𝔼‖∇ℒT\(θt\)‖2\\tfrac\{1\}\{T\}\\\!\\sum\_\{t\}\\\!\\mathbb\{E\}\\\|\\nabla\\mathcal\{L\}\_\{T\}\(\\theta\_\{t\}\)\\\|^\{2\}2ΔηT\+Lmaxη\(GT2\+σT2\)T\\dfrac\{2\\Delta\}\{\\eta\\sqrt\{T\}\}\+\\dfrac\{L\_\{\\max\}\\eta\(G\_\{T\}^\{2\}\+\\sigma\_\{T\}^\{2\}\)\}\{\\sqrt\{T\}\}2ΔηT\+Lmaxη\(GT2\+σT2\+Γdrift\)T\\dfrac\{2\\Delta\}\{\\eta\\sqrt\{T\}\}\+\\dfrac\{L\_\{\\max\}\\eta\(G\_\{T\}^\{2\}\+\\sigma\_\{T\}^\{2\}\+\\Gamma\_\{\\mathrm\{drift\}\}\)\}\{\\sqrt\{T\}\}CΔαminηT\+Cησeff2T\+CBeff2\\dfrac\{C\\Delta\}\{\\alpha\_\{\\min\}\\eta\\sqrt\{T\}\}\+\\dfrac\{C\\eta\\sigma\_\{\\mathrm\{eff\}\}^\{2\}\}\{\\sqrt\{T\}\}\+CB\_\{\\mathrm\{eff\}\}^\{2\}Convergence rate𝒪\(1/T\)\\mathcal\{O\}\(1/\\sqrt\{T\}\)𝒪\(1/T\)\\mathcal\{O\}\(1/\\sqrt\{T\}\)𝒪\(1/T\)\+CBeff2\\mathcal\{O\}\(1/\\sqrt\{T\}\)\+C\\,B\_\{\\mathrm\{eff\}\}^\{2\}Step\-size conditionLmaxη≤TL\_\{\\max\}\\eta\\leq\\sqrt\{T\}LmaxηTcE2<1L\_\{\\max\}\\eta\\sqrt\{T\}\\,cE^\{2\}<1Lmaxη≤T/4L\_\{\\max\}\\eta\\leq\\sqrt\{T\}/4Rounds toϵ\\epsilon\-stationarity𝒪\(ΔLmax\(GT2\+σT2\)/ϵ2\)\\mathcal\{O\}\\\!\\bigl\(\{\\Delta L\_\{\\max\}\(G\_\{T\}^\{2\}\+\\sigma\_\{T\}^\{2\}\)\}/\{\\epsilon^\{2\}\}\\bigr\)𝒪\(ΔLmax\(GT2\+σT2\+E2H¯\)/ϵ2\)\\mathcal\{O\}\\\!\\bigl\(\{\\Delta L\_\{\\max\}\\bigl\(G\_\{T\}^\{2\}\+\\sigma\_\{T\}^\{2\}\+E^\{2\}\\bar\{H\}\\bigr\)\}/\{\\epsilon^\{2\}\}\\bigr\)𝒪\(Δσeff2/\(αmin2\(ϵ−CBeff2\)2\)\)\\mathcal\{O\}\\\!\\bigl\(\{\\Delta\\sigma\_\{\\text\{eff\}\}^\{2\}\}/\(\{\\alpha\_\{\\min\}^\{2\}\(\\epsilon\-CB\_\{\\text\{eff\}\}^\{2\}\)^\{2\}\}\)\\bigr\)

## 4Experiments

TABLE III:Quantitative performance evaluation of OmniISR across multiple model architectures and datasetsModelsRemarkSettingOmniISR?Cityscapes Dataset \(%\)CamVid Dataset \(%\)SynthiaSF Dataset \(%\)mIoUmF1mPremRecmIoUmF1mPremRecmIoUmF1mPremRecDeepLabv3\+ResNet18\(Backbone\)Centralized✗47\.9156\.3159\.9955\.5876\.0282\.4383\.0782\.4633\.2837\.2738\.9836\.45✓50\.4959\.7065\.0057\.8276\.1382\.5283\.0982\.5734\.2839\.1342\.7437\.60Federated✗43\.7650\.4051\.5450\.7772\.7880\.2481\.3379\.4726\.6531\.9440\.3030\.46✓47\.7856\.2859\.6455\.6473\.2480\.7081\.6080\.3429\.0334\.9041\.1232\.92SeaFormerAxialTransformerCentralized✗27\.4030\.9930\.5532\.1450\.6956\.0055\.4056\.8924\.7429\.7032\.6829\.19✓29\.8234\.1933\.8035\.4555\.8362\.3964\.1962\.5424\.2029\.2332\.2029\.00Federated✗27\.3230\.9532\.7831\.8547\.0853\.3953\.8653\.9116\.2921\.0528\.9821\.29✓29\.7734\.1833\.8435\.3951\.5459\.4060\.0760\.3618\.4323\.2028\.3322\.88TopFormerNormalTransformerCentralized✗32\.7637\.6436\.9239\.2463\.1070\.2271\.8870\.2528\.3733\.7536\.9732\.99✓34\.2839\.9640\.4140\.6066\.3874\.5077\.4773\.6028\.7034\.0437\.2233\.20Federated✗32\.2937\.2336\.8438\.3556\.6063\.2171\.6663\.2621\.6026\.8832\.2226\.71✓32\.3337\.2436\.7538\.4358\.8566\.2072\.0265\.2021\.6326\.9831\.8025\.81### 4\.1Datasets, Metrics, and Implementation

Datasets\. We take AD semantic segmentation task as example to evaluate the proposed OmniISR framework on three benchmarks used throughout the literature\. The Cityscapes dataset\[[55](https://arxiv.org/html/2605.20276#bib.bib55)\]consists of 2,975 training images and 500 validation images, each annotated with masks\. This dataset encompasses 19 semantic classes, such as vehicles and pedestrians\. The CamVid dataset\[[56](https://arxiv.org/html/2605.20276#bib.bib56)\]comprises a total of 701 images across 11 semantic classes\. For our experiments, we randomly selected 600 samples for training and used the remaining 101 samples as a test dataset\. The SynthiaSF dataset\[[57](https://arxiv.org/html/2605.20276#bib.bib57)\]offers a collection of synthetic, yet photorealistic images that emulate urban scenarios\. It provides pixel\-level annotations for 23 semantic classes, with 1,596 images designated for training and 628 for testing\.

Metrics\. We assess the proposed OmniISR framework using four commonly used metrics:mIoU: the mean of intersection over union;mPrecision \(mPre for short\): the mean ratio of true positive pixels to the total predicted positive pixels;mRecall \(mRec for short\): the mean ratio of true positive pixels to the total positive ground truth pixels;mF1: the mean of harmonic mean of precision and recall, providing a balanced measure of these two metrics\. Such metrics are evaluated across all semantic classes, offering a comprehensive view of OmniISR’s performance\.

Implementation\. All models and baselines are implemented using the PyTorch framework and trained on two NVIDIA GeForce 4090 GPUs\. For optimization, we employ the Adam optimizer with beta values of 0\.9 and 0\.999, alongside a weight decay of 1e\-4\. Our experiments contain lots of comparisons with comprehensive analyses, utilizing models of CNN\-based DeepLabv3\+\[[48](https://arxiv.org/html/2605.20276#bib.bib48)\], Transformer\-based TopFormer\[[49](https://arxiv.org/html/2605.20276#bib.bib49)\], and CNN\-Transformer hybrid\-based SeaFormer\[[50](https://arxiv.org/html/2605.20276#bib.bib50)\], across the Cityscapes, CamVid, and SynthiaSF datasets\. To systematically evaluate the versatility of OmniISR, we contrast its CL training mode against the traditional output\-layer supervision approach, while its FL mode is benchmarked against a diverse set of established FL algorithms, including FedProx\[[58](https://arxiv.org/html/2605.20276#bib.bib58)\], FedDyn\[[59](https://arxiv.org/html/2605.20276#bib.bib59)\], FedAvgM\[[60](https://arxiv.org/html/2605.20276#bib.bib60)\], FedIR\[[61](https://arxiv.org/html/2605.20276#bib.bib61)\], MOON\[[62](https://arxiv.org/html/2605.20276#bib.bib62)\], SCAFFOLD\[[12](https://arxiv.org/html/2605.20276#bib.bib12)\], FedAvg\[[63](https://arxiv.org/html/2605.20276#bib.bib63)\], BalanceFL\[[31](https://arxiv.org/html/2605.20276#bib.bib31)\], and FedGau\[[33](https://arxiv.org/html/2605.20276#bib.bib33)\]\.

### 4\.2Evaluation and Empirical Analyses

#### 4\.2\.1Quantitative Evaluation

We carry out extensive experiments to compare the quantitative performance of OmniISR against standard output\-layer\-only supervision baselines under both centralized and federated settings\. The evaluated architectures include the models of CNN\-based DeepLabv3\+, Transformer\-based TopFormer, and CNN\-Transformer hybrid\-based SeaFormer\. The results across the Cityscapes, CamVid, and SynthiaSF datasets are presented in[Tab\.III](https://arxiv.org/html/2605.20276#S4.T3)\. Rather than isolated case\-by\-case number reporting, we substantially reframe the empirical results into reviewer\-facing research questions\. Specifically, we ask:RQ1whether OmniISR yields consistent gains for both CL and FL paradigms,RQ2whether OmniISR\-achieved gains track heterogeneity and optimization difficulty,RQ3how model architecture influences effectiveness, andRQ4whether OmniISR truly improves both paradigms’*unification quality*\(not only absolute accuracy\)\. This framing is designed to evaluate this paper’s core claims, i\.e\., cross\-paradigm applicability, architecture\-agnostic deployment, and drift–generalization trade\-off\.

TABLE IV:Matrix\-level mIoU gain summary derived from[Tab\.III](https://arxiv.org/html/2605.20276#S4.T3)SettingMean Absolute GainMean Relative GainCentralized \(9 pairs\)\+1\.76\+4\.04%Federated \(9 pairs\)\+2\.03\+6\.06%

### RQ1: Does OmniISR provide coherent cross\-paradigm gains?

Yes\. Matrix\-level statistics in[Tab\.IV](https://arxiv.org/html/2605.20276#S4.T4)show that OmniISR improves mIoU in both paradigms, with larger average gain in FL \(\+2\.03 absolute, \+6\.06%\) than in CL \(\+1\.76 absolute, \+4\.04%\)\. This asymmetry of improvement is theoretically meaningful: if intermediate constraints primarily mitigate representation drift, the benefit should be amplified under non\-IID FL, which is exactly what we observe\. At the exemplar level, DeepLabv3\+ on Cityscapes increases from 47\.91 to 50\.49 in CL and from 43\.76 to 47\.78 in FL, confirming that OmniISR improves model performance for both paradigms and does not trade one paradigm for the other\.

TABLE V:Qualitative performance comparison of the centralized setting of proposed OmniISR against conventional output\-supervision\-only training methodRaw RGBs![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/frankfurt_000001_048355.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/frankfurt_000001_059642.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/lindau_000029_000019.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/munster_000086_000019.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/munster_000173_000019.jpg)Ground Truth![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/frankfurt_000001_048355_gt.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/frankfurt_000001_059642_gt.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/lindau_000029_000019_gt.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/munster_000086_000019_gt.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/munster_000173_000019_gt.jpg)w/o OmniISR![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/frankfurt_000001_048355_dis.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/frankfurt_000001_059642_dis.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/lindau_000029_000019_dis.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/munster_000086_000019_dis.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/munster_000173_000019_dis.jpg)w/ OmniISR![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/frankfurt_000001_048355_en.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/frankfurt_000001_059642_en.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/lindau_000029_000019_en.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/munster_000086_000019_en.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/munster_000173_000019_en.jpg)
### RQ2: Do OmniISR\-achieved gains increase with heterogeneity and optimization difficulty?

The answer is generally yes, but with informative exceptions\. Under DeepLabv3\+, Cityscapes exhibits substantially larger gains \(CL: \+5\.39%, FL: \+9\.19%\) than CamVid \(CL: \+0\.14%, FL: \+0\.63%\), consistent with the hypothesis that OmniISR contributes more obviously when client drift and class\-boundary ambiguity are more severe\. At the full\-matrix level, this pattern is not perfectly monotonic across all models, which is because dataset difficulty/heterogeneity interacts with architecture inductive bias\. The key implication is that OmniISR should be interpreted as a*difficulty/heterogeneity\-adaptive*mechanism: stronger benefit in harder non\-IID regimes, near\-saturated benefit in easier regimes\.

![Refer to caption](https://arxiv.org/html/2605.20276v1/federated_mIoU.jpg)\(a\)mIoU
![Refer to caption](https://arxiv.org/html/2605.20276v1/federated_mF1.jpg)\(b\)mF1

Figure 6:Practical escape time comparison between OmniISR\-enabled and OmniISR\-disabled settings\.![Refer to caption](https://arxiv.org/html/2605.20276v1/T_escape_panels.jpg)Figure 7:Evaluation of the time of escape saddle point for the proposed OmniISR framework\.
### RQ3: Is OmniISR architecture\-agnostic or architecture\-invariant?

The evidence supports architecture\-agnostic*deployability*but not architecture\-invariant*effect size*\. DeepLabv3\+ shows the most stable gains, while Transformer\-related models exhibit stronger condition dependence\. For example, SeaFormer on SynthiaSF decreases slightly in CL \(24\.74 to 24\.20 mIoU\) but improves clearly in FL \(16\.29 to 18\.43 mIoU\)\. TopFormer gains are strong on CamVid \(CL: 63\.10 to 66\.38; FL: 56\.60 to 58\.85\) but almost neutral on SynthiaSF\. This is a critical design message for practitioners: OmniISR can be inserted broadly, but point placement and coefficients of MI loss and NE regularizer should be architecture\-aware, especially for Transformer families\.

### RQ4: Does OmniISR improve cross\-paradigm unification \(reduce the CL–FL performance gap\) beyond raising absolute accuracy?

We quantify “unification quality” using the performance gap between CL and FL, defined asD=mIoUCL−mIoUFLD=\\mathrm\{mIoU\}\_\{\\mathrm\{CL\}\}\-\\mathrm\{mIoU\}\_\{\\mathrm\{FL\}\}\(where a smallerDDindicates superior unification\)\. Across the nine evaluated model–dataset pairs, we analyze the individual gapDDwith and without OmniISR, the aggregate mean and median gaps, and the proportion of pairs exhibiting a reduced gap\. Taking DeepLabv3\+ as a representative example, OmniISR consistently reduces this dataset\-level gap: from 4\.15 to 2\.71 on Cityscapes \(Δ=1\.44\\Delta=1\.44, 34\.7% reduction\), 3\.24 to 2\.89 on CamVid \(Δ=0\.35\\Delta=0\.35, 10\.8%\), and 6\.63 to 5\.25 on SynthiaSF \(Δ=1\.38\\Delta=1\.38, 20\.8%\)\. Consequently, the average gap for DeepLabv3\+ decreases from 4\.67 to 3\.62 \(Δ=1\.05\\Delta=1\.05, 22\.6%\)\. Importantly, this improvement is*not universal*\. Across all nine architecture–dataset pairs, OmniISR reduces the gap in five instances but increases it in four\. This variability highlights that the efficacy of CL–FL unification is highly sensitive to both the underlying data distribution and the specific model architecture, rather than being universally guaranteed\.

#### 4\.2\.2Qualitative Evaluation

[Tab\.V](https://arxiv.org/html/2605.20276#S4.T5)compares predictions with and without OmniISR on five representative urban scenes\. Beyond visual ”cleanliness”, the key qualitative evidence is structural consistency\. Specifically, OmniISR better preserves thin and boundary\-sensitive regions \(e\.g\., poles, object contours, and small foreground instances\) while reducing fragmented misclassification in cluttered backgrounds\. This observation is consistent with the design purpose of OmniISR: MI supervision strengthens semantic discriminability at intermediate layers, and NE regularization discourages overconfident early commitments that often produce brittle boundaries\. The qualitative gains therefore corroborate the quantitative improvements by showinghowthe improvement is realized in pixel space\.In addition, the quantitative pattern in[Tab\.III](https://arxiv.org/html/2605.20276#S4.T3)shows that mPre often increases at least as much as mRec, suggesting that OmniISR primarily reduces false positives from ambiguous regions while maintaining recall, which is desirable for safety\-critical AD perception pipelines\.

#### 4\.2\.3Escape Time Evaluation

[Fig\.6](https://arxiv.org/html/2605.20276#S4.F6)reports the practical escape\-time comparison between OmniISR\-enabled and OmniISR\-disabled training\. The curves show that OmniISR reaches the post\-saddle acceleration phase earlier on both mIoU and mF1 \(black circles\)\. This result is an optimization\-speed gain, where fewer rounds to exit flat/unstable regions translate into lower communication and computation budgets\. This behavior supports our theoretical claim of gradient synergy: intermediate supervision stabilizes useful directions, while regularization\-induced stochasticity prevents overconfident trapping, jointly improving the probability of fast escape from poor stationary regions\.

To further illustrate the theoretical properties of the proposed OmniISR framework,[Fig\.7](https://arxiv.org/html/2605.20276#S4.F7)analyzes its saddle escape timeTescapeT\_\{\\mathrm\{escape\}\}\. Specifically, the figure evaluates the sensitivity ofTescapeT\_\{\\mathrm\{escape\}\}with respect to four key parameters: the curvatureγ\\gamma, the step sizeη\\eta, the neighborhood radiusRR, and the confidence levelδ\\delta\. Each subfigure systematically sweeps one primary variable while holding the others constant, utilizing multiple curves to demonstrate the compounding influence of a secondary parameter\.

### \(a\) Effect of curvatureγ\\gamma

[Fig\.7](https://arxiv.org/html/2605.20276#S4.F7)\(a\) illustrates the effect of the curvature parameterγ\\gamma, on the saddle escape timeTescapeT\_\{\\mathrm\{escape\}\}\. The log–log plot reveals thatγ\\gammais the dominant factor governing the escape dynamics\. Specifically,TescapeT\_\{\\mathrm\{escape\}\}exhibits a strong inverse dependence on the curvature, decreasing across the evaluated curvature range\. Furthermore, the escape time scales inversely with the step sizeη\\eta, which is evidenced by the vertical spacing between the curves\. However, in the weak\-curvature regime \(i\.e\., for small values ofγ\\gamma\), the curves bend sharply upward and terminate prematurely\. This divergence indicates that stochastic noise begins to dominate the gradient information, causing the theoretical bound to break down\. Overall, these findings emphasize that even a marginal increase in curvature drastically accelerates the saddle escape process\.

### \(b\) Effect of step sizeη\\eta

[Fig\.7](https://arxiv.org/html/2605.20276#S4.F7)\(b\) illustrates the influence of the step sizeη\\etaon the escape timeTescapeT\_\{\\mathrm\{escape\}\}\. For most curvature values,TescapeT\_\{\\mathrm\{escape\}\}decreases gradually asη\\etaincreases, exhibiting a gentle downward slope on the logarithmic scale\. However, in the weak\-curvature regime \(e\.g\.,γ∈\{0\.01,0\.02,0\.05\}\\gamma\\in\\\{0\.01,0\.02,0\.05\\\}\), the dynamics display a pronounced non\-monotonic behavior\. Beyond a critical step size threshold, the escape time rises abruptly, forming a distinct U\-shape with a well\-defined minimum\. The location of this minimum, which represents the optimal step size, shifts toward larger values ofη\\etaas the curvatureγ\\gammaincreases\. Ultimately, these findings demonstrate a critical trade\-off: while larger step sizes generally accelerate the escape process, exceeding the optimalη\\etabecomes highly detrimental when the curvature is small, leading to a severe divergence in escape time\.

### \(c\) Effect of neighborhood radiusRR

[Fig\.7](https://arxiv.org/html/2605.20276#S4.F7)\(c\) illustrates the impact of the neighborhood radiusRR, on the escape timeTescapeT\_\{\\mathrm\{escape\}\}\. When plotted on a semi\-logarithmic scale \(linearTescapeT\_\{\\mathrm\{escape\}\}versus logarithmicRR\), the curves exhibit an approximately linear trajectory, indicating thatTescapeT\_\{\\mathrm\{escape\}\}grows only logarithmically withRR\. This logarithmic dependence demonstrates that the neighborhood radius exerts a relatively weak influence on the escape dynamics\. Instead, the overall magnitude of the escape time is governed predominantly by the curvatureγ\\gamma\. This is evident from the distinct vertical stratification of the curves: the weak\-curvature case \(γ=0\.02\\gamma=0\.02\) dominates the plot, reachingTescape≈3×104T\_\{\\mathrm\{escape\}\}\\approx 3\\times 10^\{4\}atR=100R=100, whereas the curves corresponding to stronger curvatures \(γ≳0\.1\\gamma\\gtrsim 0\.1\) remain essentially flat and near zero across the entire range ofRR\. Furthermore, in the extreme weak\-curvature scenario \(γ=0\.01\\gamma=0\.01\), the curve is absent at small values ofRR\. This absence indicates that the theoretical bound becomes physically meaningless\.

### \(d\) Effect of confidenceδ\\delta

[Fig\.7](https://arxiv.org/html/2605.20276#S4.F7)\(d\) illustrates the influence of the confidence levelδ\\deltaon the saddle escape timeTescapeT\_\{\\mathrm\{escape\}\}\. For moderate to large values of the curvature parameterγ\\gamma, the curves remain essentially flat across the evaluated range ofδ\\delta\. This indicates that the required confidence level has a negligible impact on the escape dynamics, with the corresponding escape times remaining compressed near the bottom of the plot and practically indistinguishable from zero on a linear scale\. However, a stark contrast emerges in the extreme weak\-curvature regime\. Specifically, forγ=0\.01\\gamma=0\.01, the escape time rises steeply asδ\\deltadecreases, peaking at approximately9×1049\\times 10^\{4\}nearδ≈10−1\\delta\\approx 10^\{\-1\}\. The curve corresponding toγ=0\.02\\gamma=0\.02exhibits a similar, albeit significantly less pronounced, upward trajectory\. Ultimately, these observations demonstrate that demanding higher statistical confidence incurs almost no penalty in escape time, except when the curvature is exceptionally small, where the cost becomes severe\.

### Overall Synthesis

TABLE VI:Qualitative summary of the dependencies of the escape timeTescT\_\{\\mathrm\{esc\}\}\.ParameterObserved TrendPractical Impactγ\\gammaPronounced log–log decayDominantη\\etaNon\-monotonicU\-shaped dependenceStrong\(exhibits an optimum\)RRMild logarithmic growthWeakδ\\deltaNegligible variation;diverges only for minimalγ\\gammaMostly negligibleTABLE VII:Quantitative performance comparison of OmniISR\-enabled case against OmniISR\-disabled case across multiple FL algorithmsFL AlgorithmOmniISR?mIoUmF1mPremRecFedAvg✗47\.9156\.3159\.9955\.58✓50\.4959\.7065\.0057\.82FedProx \(0\.005\)✗41\.4147\.5247\.6948\.31✓42\.4548\.5348\.6149\.36FedProx \(0\.01\)✗31\.7435\.7435\.8237\.31✓32\.8436\.8236\.3538\.38FedDyn \(0\.005\)✗28\.6331\.9531\.9333\.95✓29\.7533\.0532\.3135\.01FedDyn \(0\.01\)✗25\.1927\.9226\.6329\.43✓26\.1928\.9227\.6430\.43FedAvgM \(0\.7\)✗51\.4960\.7465\.3758\.96✓52\.5961\.8466\.6659\.93FedAvgM \(0\.9\)✗51\.6560\.9165\.5559\.12✓52\.7461\.9366\.6360\.39FedGau✗52\.0362\.2368\.9859\.36✓54\.4064\.4371\.9461\.49FedIR✗25\.8328\.3327\.9129\.02✓25\.8128\.3227\.9028\.98MOON✗51\.4860\.7565\.3759\.02✓52\.4961\.6766\.6359\.70SCAFFOLD✗24\.1927\.2226\.3228\.27✓23\.6326\.7926\.3827\.49BalanceFL✗52\.1561\.3665\.8259\.53✓51\.4860\.7865\.5458\.95A summary of the four parametric studies, provided in[Tab\.VI](https://arxiv.org/html/2605.20276#S4.T6), highlights the key factors of the escape time\. Most importantly, the curvatureγ\\gammais the dominant factor controlling the overall escape timeTescapeT\_\{\\mathrm\{escape\}\}\. It determines the baseline magnitude of the escape time and causes the largest variations across all of our evaluations\. Additionally, the step sizeη\\etaplays a critical role, possessing a clear optimal value for minimizing the escape time\. If the step size exceeds this optimal threshold, performance drops sharply, especially when the curvature is weak\. In contrast, the neighborhood radiusRRand the confidence levelδ\\deltaact as minor factors\. Their impact onTescapeT\_\{\\mathrm\{escape\}\}is mostly negligible, only becoming significant in extreme, weak\-curvature scenarios where the theoretical bounds are highly sensitive\.

### 4\.3OmniISR\-Augmented FL Algorithms

To assess whether OmniISR is merely compatible with FedAvg\-like updates or genuinely*optimizer\-agnostic*, we perform a stress test across 12 FL algorithms spanning proximal correction \(FedProx\), drift correction \(SCAFFOLD\), dynamic regularization \(FedDyn\), momentum aggregation \(FedAvgM\), contrastive alignment \(MOON\), imbalance\-aware optimization \(FedIR/BalanceFL\), and our prior AD\-oriented variant \(FedGau\)\.

Result 1: broad positive transfer\.Across 48 paired comparisons \(12 algorithms×\\times4 metrics\), OmniISR demonstrates improved performance in 37 cases \(77\.10%\)\. Furthermore, the performance gains over strong representative baselines remain substantial\. For instance, when compared to FedGau, OmniISR achieves improvements of 4\.56%, 3\.54%, 4\.29%, and 3\.59% across the mIoU, mF1, mPre, and mRec metrics, respectively\. These consistent enhancements indicate that the proposed OmniISR’s intermediate mechanism is highly adaptable and not restricted to any single FL aggregation algorithm\.

Result 2: gain stratification reveals when OmniISR is most useful\.We observe distinct tiers of performance improvement: \(i\)*high\-gain*methods \(e\.g\., FedAvg, FedAvgM, FedGau, and MOON\); \(ii\)*moderate\-gain*methods \(e\.g\., FedProx and FedDyn\); and \(iii\)*marginal\-to\-negative*methods \(e\.g\., FedIR, SCAFFOLD, and BalanceFL\)\. This stratification indicates that OmniISR yields the most significant improvements when the base FL algorithm lacks robust, built\-in representation\-drift control mechanisms\. Conversely, the benefits are less pronounced when drift correction is already heavily encoded into the base FL algorithm itself\.

![Refer to caption](https://arxiv.org/html/2605.20276v1/abl_number_mIoU.jpg)\(a\)mIoU
![Refer to caption](https://arxiv.org/html/2605.20276v1/abl_number_mPre.jpg)\(b\)mPrecision
![Refer to caption](https://arxiv.org/html/2605.20276v1/abl_number_mRec.jpg)\(c\)mRecall
![Refer to caption](https://arxiv.org/html/2605.20276v1/abl_number_mF1.jpg)\(d\)mF1

Figure 8:The impact of the number of intermediate points on OmniISR’s training performance\.
### 4\.4Ablation Studies

This subsection details three ablation studies designed to evaluate the configuration of intermediate points in OmniISR\. We assess how OmniISR’s inference performance is affected by \(i\) the number of intermediate points deployed, \(ii\) the interval distance between adjacent points, and \(iii\) the structural placement of these points\.

#### 4\.4\.1Impact of the Number of intermediate points

To quantify how supervision granularity affects OmniISR, we evaluate 1–5 intermediate points in both CL and FL settings \(i\.e\., “Intermediate Point \(1\)” to “Intermediate Point \(5\)”\)\. As shown in[Fig\.8](https://arxiv.org/html/2605.20276#S4.F8), performance peaks at a moderate number of points, revealing a clear bias–variance style trade\-off in representation shaping: too few points under\-constrain latent drift, while too many points over\-constrain feature evolution and reduce representational flexibility\. This observation is consistent with our core claim that intermediate MI supervision should guide rather than dominate hidden representations\. In practical, OmniISR’s deployment should therefore consider a moderate point count instead of maximal insertion\.

![Refer to caption](https://arxiv.org/html/2605.20276v1/abl_dist_mIoU.jpg)\(a\)mIoU
![Refer to caption](https://arxiv.org/html/2605.20276v1/abl_dist_mPre.jpg)\(b\)mPrecision
![Refer to caption](https://arxiv.org/html/2605.20276v1/abl_dist_mRec.jpg)\(c\)mRecall
![Refer to caption](https://arxiv.org/html/2605.20276v1/abl_dist_mF1.jpg)\(d\)mF1

Figure 9:The impact of the distance between adjacent intermediate points on OmniISR’s training performance\.
#### 4\.4\.2Impact of the distance between adjacent points

To examine spatial coupling between supervision points, we define a base layer distance and test three configurations: 1\-base, 2\-base, and 3\-base spacing for both CL and FL\.[Fig\.9](https://arxiv.org/html/2605.20276#S4.F9)shows that the 2\-base spacing consistently yields the best performance\. A plausible interpretation is that 1\-base spacing introduces redundant constraints on highly correlated adjacent features, while 3\-base spacing is too sparse to propagate stable intermediate guidance across depth\. Thus, moderate spacing appears to best balance local feature consistency and global semantic abstraction, which is precisely where OmniISR is expected to be most effective\.

TABLE VIII:The impact of the position of intermediate points on OmniISR’s training performancemIoUmPremRecmF1Single\-Point![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/abl_position_one_mIoU.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/abl_position_one_mPre.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/abl_position_one_mRec.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/abl_position_one_mF1.jpg)Two\-Point![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/abl_position_two_mIoU.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/abl_position_two_mPre.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/abl_position_two_mRec.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/abl_position_two_mF1.jpg)Three\-Point![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/abl_position_three_mIoU.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/abl_position_three_mPre.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/abl_position_three_mRec.jpg)![[Uncaptioned image]](https://arxiv.org/html/2605.20276v1/abl_position_three_mF1.jpg)
#### 4\.4\.3Impact of position of intermediate points

To determine the most effective locations for intermediate constraints, we evaluate three placement strategies:

- •Series I:A single intermediate point located near the input, middle, or output stage\.
- •Series II:Two intermediate points, following the same positional biases\.
- •Series III:Three intermediate points, again comparing input\-, middle\-, and output\-biased configurations\.

The results, summarized in[Tab\.VIII](https://arxiv.org/html/2605.20276#S4.T8), reveal a consistent pattern regarding the network’s learning dynamics\. Across all three series, input\-biased placement yields the best performance\. This suggests that applying guidance at early layers effectively stabilizes shared, low\-level feature representations before client\-specific drift can accumulate in the deeper layers\. Conversely, the performance of output\-biased placement degrades as the number of constraint points increases\. This decline likely occurs because excessive constraints at later stages interfere with task\-specific adaptation and limit the flexibility of the final decision layers\.

Interestingly, middle\-biased placement improves as more constraint points are added\. This indicates that distributing supervision across the intermediate layers \(the network’s “waist”\) enhances semantic consolidation without overly restricting the final output logits\. Ultimately, these findings offer a practical guideline for deployment: intermediate constraints should be prioritized in the early\-to\-middle stages of the network, while stacking multiple constraints near the output head should be avoided\.

To move from ablation analysis to actionable practice, we summarize a suite of concrete configuration policies\.Step 1: choose point count in the mid\-rangeto avoid both under\-guidance and over\-constraint\.Step 2: choose moderate spacingto balance local consistency and global abstraction\.Step 3: bias placement toward early\-to\-middle stages, especially under strong non\-IID drift\.

## 5Conclusion

We presented OmniISR, a unified CL/FL/hybrid framework that couples intermediate MI supervision and NE regularization under one optimization objective\. Rather than claiming novelty from individual components in isolation, we emphasize the innovation in their cross\-paradigm coupling and corresponding optimization theory\.

1. 1\.We formulate a unified objective that can be instantiated in centralized, federated, and hybrid modes without architecture\-specific redesign\.
2. 2\.We show that heterogeneous intermediate supervision plus uncertainty\-aware regularization can jointly mitigate representation drift and preserve generalization, especially under non\-IID federated conditions\.
3. 3\.Theoretically, we establish four key results: \(i\) a unified, ISR\-agnostic, and non\-asymptotic𝒪\(1/T\)\\mathcal\{O\}\(1/\\sqrt\{T\}\)convergence bound applicable across pure and hybrid CL–FL paradigms; \(ii\) a federated drift\-bound quantifying the ISR\-reduced client drift; \(iii\) a gradient\-alignment guarantee ensuring update consistency; and \(iv\) an explicit time bound demonstrating that hybrid mixing accelerates the escape from strict saddle points\.
4. 4\.Empirically, OmniISR improves a broad benchmark matrix, narrows the CL–FL quality gap in representative settings, and exhibits cross\-optimizer positive transfer\.

In the future, three extensions are most impactful: \(i\) comprehensive multi\-seed statistical intervals, \(ii\) explicit hybrid\-mode empirical sweeps across cloud/client data ratios and mixing policies, and \(iii\) full accuracy–communication–computation Pareto reporting under realistic edge constraints\.

## References

- \[1\]Y\. Fu, C\. Li, F\. R\. Yu, T\. H\. Luan, and P\. Zhao, “An incentive mechanism of incorporating supervision game for federated learning in autonomous driving,”*IEEE Transactions on Intelligent Transportation Systems*, vol\. 24, no\. 12, pp\. 14 800–14 812, 2023\.
- \[2\]Y\. Chen, J\. Zhang, Z\. Xie, W\. Li, F\. Zhang, J\. Lu, and L\. Zhang, “S\-nerf\+\+: Autonomous driving simulation via neural reconstruction and generation,”*IEEE Transactions on Pattern Analysis and Machine Intelligence*, vol\. 47, no\. 6, pp\. 4358–4376, 2025\.
- \[3\]C\. Chen, C\. Wang, B\. Liu, C\. He, L\. Cong, and S\. Wan, “Edge intelligence empowered vehicle detection and image segmentation for autonomous vehicles,”*IEEE Transactions on Intelligent Transportation Systems*, vol\. 24, no\. 11, pp\. 13 023–13 034, 2023\.
- \[4\]H\. B\. McMahan, E\. Moore, D\. Ramage, S\. Hampson, and B\. A\. y\. Arcas, “Communication\-efficient learning of deep networks from decentralized data,” in*AISTATS*, 2017, pp\. 1273–1282\.
- \[5\]H\. Zhang, C\. Li, W\. Dai, Z\. Zheng, J\. Zou, and H\. Xiong, “Stabilizing and accelerating federated learning on heterogeneous data with partial client participation,”*IEEE Transactions on Pattern Analysis and Machine Intelligence*, vol\. 47, no\. 1, pp\. 67–83, 2025\.
- \[6\]X\. Yang, R\. Dai, Y\. Zhang, A\. Li, T\. Liu, and B\. Han, “Co\-boosting\+\+: Coupled optimization of data and ensemble for one\-shot federated learning,”*IEEE Transactions on Pattern Analysis and Machine Intelligence*, pp\. 1–14, 2026\.
- \[7\]C\. Meng, J\. Yang, H\. Niu, G\. Habault, R\. Legaspi, S\. Wada, C\. Ono, and Y\. Liu, “Sample\-level prototypical federated learning,”*IEEE Transactions on Pattern Analysis and Machine Intelligence*, vol\. 48, no\. 2, pp\. 1133–1144, 2026\.
- \[8\]Q\. Li, L\. Shen, G\. Li, Q\. Yin, and D\. Tao, “Dfedadmm: Dual constraint controlled model inconsistency for decentralize federated learning,”*IEEE Transactions on Pattern Analysis and Machine Intelligence*, vol\. 47, no\. 6, pp\. 4803–4815, 2025\.
- \[9\]I\. Wang, P\. Nair, and D\. Mahajan, “Fluid: Mitigating stragglers in federated learning using invariant dropout,”*Advances in Neural Information Processing Systems*, vol\. 36, pp\. 73 258–73 273, 2023\.
- \[10\]H\.\-G\. Joo, S\. Hong, and D\.\-J\. Shin, “Fedlsc: Improving communication efficiency and robustness in federated learning with stragglers and adversaries,”*IEEE Transactions on Neural Networks and Learning Systems*, vol\. 36, no\. 11, pp\. 19 805–19 819, 2025\.
- \[11\]C\. Chen, Y\. Zhao, Z\. Zhang, W\. Li, and J\. Wu, “Toward efficient and scalable asynchronous federated learning via stragglers version control,”*IEEE Transactions on Mobile Computing*, vol\. 25, no\. 2, pp\. 2627–2643, 2026\.
- \[12\]S\. P\. Karimireddy, S\. Kale, M\. Mohri, S\. Reddi, S\. Stich, and A\. T\. Suresh, “SCAFFOLD: Stochastic controlled averaging for federated learning,” in*ICML*, 2020, pp\. 5132–5143\.
- \[13\]M\. Refinetti, A\. Ingrosso, and S\. Goldt, “Neural networks trained with SGD learn distributions of increasing complexity,” in*ICML*, 2023, pp\. 28 843–28 863\.
- \[14\]D\. P\. Kingma and J\. Ba, “Adam: A method for stochastic optimization,”*arXiv preprint arXiv:1412\.6980*, 2014\.
- \[15\]G\. E\. Hinton, N\. Srivastava, A\. Krizhevsky, I\. Sutskever, and R\. R\. Salakhutdinov, “Improving neural networks by preventing co\-adaptation of feature detectors,”*arXiv preprint arXiv:1207\.0580*, 2012\.
- \[16\]R\. Tibshirani, “Regression shrinkage and selection via the lasso,”*Journal of the Royal Statistical Society Series B*, vol\. 58, no\. 1, pp\. 267–288, 1996\.
- \[17\]P\. Foret, A\. Kleiner, H\. Mobahi, and B\. Neyshabur, “Sharpness\-aware minimization for efficiently improving generalization,” in*ICLR*, 2021\.
- \[18\]S\. Ioffe and C\. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in*ICML*, 2015, pp\. 448–456\.
- \[19\]J\. L\. Ba, J\. R\. Kiros, and G\. E\. Hinton, “Layer normalization,”*arXiv preprint arXiv:1607\.06450*, 2016\.
- \[20\]Y\. Guo, Y\. Chen, Z\. Hao, W\. Peng, Z\. Jie, Y\. Zhang, X\. Liu, and Z\. Ma, “Take a shortcut back: Mitigating the gradient vanishing for training spiking neural networks,” in*NeurIPS*, vol\. 37, 2024, pp\. 24 849–24 867\.
- \[21\]S\. Hochreiter, Y\. Bengio, P\. Frasconi, J\. Schmidhuber*et al\.*, “Gradient flow in recurrent nets: the difficulty of learning long\-term dependencies,” 2001\.
- \[22\]Y\. Liu, D\. Peng, W\. Wei, Y\. Fu, W\. Xie, and D\. Chen, “Detection\-based intermediate supervision for visual question answering,” in*AAAI*, vol\. 38, no\. 12, 2024, pp\. 14 061–14 068\.
- \[23\]X\. Fang, M\. Ye, and B\. Du, “Robust asymmetric heterogeneous federated learning with corrupted clients,”*IEEE Transactions on Pattern Analysis and Machine Intelligence*, vol\. 47, no\. 4, pp\. 2693–2705, 2025\.
- \[24\]Y\. Sun, L\. Shen, and D\. Tao, “Toward understanding generalization and stability gaps between centralized and decentralized federated learning,”*IEEE Transactions on Pattern Analysis and Machine Intelligence*, vol\. 48, no\. 4, pp\. 4744–4755, 2026\.
- \[25\]D\. Kwon, J\. Park, and S\. Hong, “Tighter regret analysis and optimization of online federated learning,”*IEEE Transactions on Pattern Analysis and Machine Intelligence*, vol\. 45, no\. 12, pp\. 15 772–15 789, 2023\.
- \[26\]T\. Li, A\. K\. Sahu, M\. Zaheer, M\. Sanjabi, A\. Talwalkar, and V\. Smith, “Federated optimization in heterogeneous networks,”*MLSys*, 2020\.
- \[27\]D\. A\. E\. Acar, Y\. Zhao, R\. Matas, M\. Mattina, P\. Whatmough, and V\. Saligrama, “Federated learning based on dynamic regularization,” in*ICLR*, 2021\.
- \[28\]T\.\-M\. H\. Hsu, H\. Qi, and M\. Brown, “Measuring the effects of non\-identical data distribution for federated visual classification,”*arXiv preprint arXiv:1909\.06335*, 2019\.
- \[29\]——, “Federated visual classification with real\-world data distribution,” in*ECCV*, 2020, pp\. 76–92\.
- \[30\]Q\. Li, B\. He, and D\. Song, “Model\-contrastive federated learning,” in*CVPR*, 2021, pp\. 10 713–10 722\.
- \[31\]X\. Shuai, Y\. Shen, S\. Jiang, Z\. Zhao, Z\. Yan, and G\. Xing, “Balancefl: Addressing class imbalance in long\-tail federated learning,” in*2022 21st ACM/IEEE International Conference on Information Processing in Sensor Networks \(IPSN\)*, 2022, pp\. 271–284\.
- \[32\]W\.\-B\. Kou, Q\. Lin, M\. Tang, S\. Wang, G\. Zhu, and Y\.\-C\. Wu, “FedRC: A rapid\-converged hierarchical federated learning framework in street scene semantic understanding,” in*IROS*, 2024, pp\. 2578–2585\.
- \[33\]W\.\-B\. Kou, Q\. Lin, M\. Tang, R\. Ye, S\. Wang, G\. Zhu, and Y\.\-C\. Wu, “Fast\-convergent and communication\-alleviated heterogeneous hierarchical federated learning in autonomous driving,”*IEEE Transactions on Intelligent Transportation Systems*, 2025\.
- \[34\]W\.\-B\. Kou, G\. Zhu, B\. Cheng, S\. Wang, M\. Tang, and Y\.\-C\. Wu, “FedEMA: Federated exponential moving averaging with negative entropy regularizer in autonomous driving,”*arXiv preprint arXiv:2505\.00318*, 2025\.
- \[35\]W\.\-B\. Kou, Q\. Lin, M\. Tang, S\. Xu, R\. Ye, Y\. Leng, S\. Wang, G\. Li, Z\. Chen, G\. Zhu*et al\.*, “pFedLVM: A large vision model\-driven and latent feature\-based personalized federated learning framework in autonomous driving,”*IEEE Transactions on Intelligent Transportation Systems*, 2025\.
- \[36\]L\. Fantauzzo, E\. Fanì, D\. Caldarola, A\. Tavera, F\. Cermelli, M\. Ciccone, and B\. Caputo, “FedDrive: Generalizing federated learning to semantic segmentation in autonomous driving,” in*IROS*, 2022\.
- \[37\]W\.\-B\. Kou, S\. Wang, G\. Zhu, B\. Luo, Y\. Chen, D\. W\. K\. Ng, and Y\.\-C\. Wu, “Communication resources constrained hierarchical federated learning for end\-to\-end autonomous driving,” in*IROS*, 2023, pp\. 9383–9390\.
- \[38\]T\. Do, B\. X\. Nguyen, Q\. D\. Tran, H\. Nguyen, E\. Tjiputra, T\.\-C\. Chiu, and A\. Nguyen, “Reducing non\-IID effects in federated autonomous driving with contrastive divergence loss,” in*ICRA*, 2024, pp\. 2190–2196\.
- \[39\]C\.\-Y\. Lee, S\. Xie, P\. Gallagher, Z\. Zhang, and Z\. Tu, “Deeply\-supervised nets,” in*Artificial Intelligence and Statistics*\. PMLR, 2015, pp\. 562–570\.
- \[40\]C\. Szegedy, W\. Liu, Y\. Jia, P\. Sermanet, S\. Reed, D\. Anguelov, D\. Erhan, V\. Vanhoucke, and A\. Rabinovich, “Going deeper with convolutions,” in*CVPR*, 2015, pp\. 1–9\.
- \[41\]L\. Wang, C\.\-Y\. Lee, Z\. Tu, and S\. Lazebnik, “Training deeper convolutional networks with deep supervision,”*arXiv preprint arXiv:1505\.02496*, 2015\.
- \[42\]H\. Zhao, J\. Shi, X\. Qi, X\. Wang, and J\. Jia, “Pyramid scene parsing network,” in*CVPR*, 2017\.
- \[43\]C\. Yu, J\. Wang, C\. Peng, C\. Gao, G\. Yu, and N\. Sang, “Bisenet: Bilateral segmentation network for real\-time semantic segmentation,” in*ECCV*, 2018, pp\. 325–341\.
- \[44\]T\. Takikawa, D\. Acuna, V\. Jampani, and S\. Fidler, “Gated\-SCNN: Gated shape CNNs for semantic segmentation,” in*ICCV*, 2019, pp\. 5228–5237\.
- \[45\]H\. Zhao, X\. Qi, X\. Shen, J\. Shi, and J\. Jia, “ICNet for real\-time semantic segmentation on high\-resolution images,” in*ECCV*, 2018\.
- \[46\]L\. Zhang, X\. Chen, J\. Zhang, R\. Dong, and K\. Ma, “Contrastive deep supervision,” in*ECCV*, 2022, pp\. 1–19\.
- \[47\]R\. Li, X\. Wang, G\. Huang, W\. Yang, K\. Zhang, X\. Gu, S\. N\. Tran, S\. Garg, J\. Alty, and Q\. Bai, “A comprehensive review on deep supervision: Theories and applications,”*arXiv preprint arXiv:2207\.02376*, 2022\.
- \[48\]L\.\-C\. Chen, Y\. Zhu, G\. Papandreou, F\. Schroff, and H\. Adam, “Encoder\-decoder with atrous separable convolution for semantic image segmentation,”*arXiv preprint arXiv:1802\.02611*, 2018\.
- \[49\]W\. Zhang, Z\. Huang, G\. Luo, T\. Chen, X\. Wang, W\. Liu, G\. Yu, and C\. Shen, “TopFormer: Token pyramid transformer for mobile semantic segmentation,” in*CVPR*, 2022, pp\. 12 083–12 093\.
- \[50\]Q\. Wan, Z\. Huang, J\. Lu, G\. Yu, and L\. Zhang, “SeaFormer: Squeeze\-enhanced axial transformer for mobile semantic segmentation,” in*ICLR*, 2023\.
- \[51\]R\. Shwartz\-Ziv and N\. Tishby, “Opening the black box of deep neural networks via information,”*arXiv preprint arXiv:1703\.00810*, 2017\.
- \[52\]D\. Barber and F\. Agakov, “The IM algorithm: A variational approach to information maximization,” in*NeurIPS*, vol\. 16, 2003\.
- \[53\]C\. Jin, R\. Ge, P\. Netrapalli, S\. M\. Kakade, and M\. I\. Jordan, “How to escape saddle points efficiently,” in*ICML*, 2017, pp\. 1724–1732\.
- \[54\]Y\. Arjevani, Y\. Carmon, J\. C\. Duchi, D\. J\. Foster, N\. Srebro, and B\. Woodworth, “Lower bounds for non\-convex stochastic optimization,”*Mathematical Programming*, vol\. 199, no\. 1, pp\. 165–214, 2023\.
- \[55\]M\. Cordts, M\. Omran, S\. Ramos, T\. Rehfeld, M\. Enzweiler, R\. Benenson, U\. Franke, S\. Roth, and B\. Schiele, “The Cityscapes dataset for semantic urban scene understanding,” in*CVPR*, 2016\.
- \[56\]G\. J\. Brostow, J\. Shotton, J\. Fauqueur, and R\. Cipolla, “Segmentation and recognition using structure from motion point clouds,” in*Proc\. European Conference on Computer Vision of the \(ECCV\)*, 2008\.
- \[57\]G\. Ros, L\. Sellart, J\. Materzynska, D\. Vazquez, and A\. M\. Lopez, “The SYNTHIA dataset: A large collection of synthetic images for semantic segmentation of urban scenes,” in*CVPR*, 2016, pp\. 3234–3243\.
- \[58\]T\. Li, S\. Hu, A\. Beirami, and V\. Smith, “Federated multi\-task learning for competing constraints,”*arXiv preprint arXiv:2012\.04221*, 2020\.
- \[59\]D\. A\. E\. Acar, Y\. Zhao, R\. Matas, M\. Mattina, P\. Whatmough, and V\. Saligrama, “Federated learning based on dynamic regularization,” in*International Conference on Learning Representations*, 2021\. \[Online\]\. Available:[https://openreview\.net/forum?id=B7v4QMR6Z9w](https://openreview.net/forum?id=B7v4QMR6Z9w)
- \[60\]T\.\-M\. H\. Hsu, H\. Qi, and M\. Brown, “Measuring the effects of non\-identical data distribution for federated visual classification,”*arXiv preprint arXiv:1909\.06335*, 2019\.
- \[61\]T\. H\. Hsu, H\. Qi, and M\. Brown, “Federated visual classification with real\-world data distribution,” in*Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part X 16*\. Springer, 2020, pp\. 76–92\.
- \[62\]Q\. Li, B\. He, and D\. Song, “Model\-contrastive federated learning,” in*Proceedings of the IEEE/CVF conference on computer vision and pattern recognition*, 2021, pp\. 10 713–10 722\.
- \[63\]B\. McMahan, E\. Moore, D\. Ramage, S\. Hampson, and B\. A\. y Arcas, “Communication\-efficient learning of deep networks from decentralized data,” in*Artificial Intelligence and Statistics*\. PMLR, 2017, pp\. 1273–1282\.
OmniISR: A Unified Framework for Centralized and Federated Learning via Intermediate Supervision and Regularization

Similar Articles

Towards Serverless Semi-Decentralized Federated Learning with Heterogeneous Optimizers

Towards the Next Frontier of LLMs, Training on Private Data: A Cross-Domain Benchmark for Federated Fine-Tuning

Federated continual learning: A comprehensive survey on lifelong and privacy-preserving learning over distributed and non-stationary data

Uncovering the Latent Potential of Deep Intermediate Representations

Federated Nested Learning: Collaborative Training of Self-Referential Memories for Test-Time Adaptation

Submit Feedback

Similar Articles

Towards Serverless Semi-Decentralized Federated Learning with Heterogeneous Optimizers
Towards the Next Frontier of LLMs, Training on Private Data: A Cross-Domain Benchmark for Federated Fine-Tuning
Federated continual learning: A comprehensive survey on lifelong and privacy-preserving learning over distributed and non-stationary data
Uncovering the Latent Potential of Deep Intermediate Representations
Federated Nested Learning: Collaborative Training of Self-Referential Memories for Test-Time Adaptation