Accurate and Resource-Efficient Federated Continual Learning
Summary
FedRAN is a resource-aware analytic federated continual learning framework that replaces gradient-based updates with compact random feature statistics, achieving high accuracy with significantly lower communication and computation costs.
View Cached Full Text
Cached at: 06/11/26, 01:48 PM
# Accurate and Resource-Efficient Federated Continual Learning Source: [https://arxiv.org/html/2606.11480](https://arxiv.org/html/2606.11480) ,Dhruv Parikh[dhruvash@usc\.edu](https://arxiv.org/html/2606.11480v1/mailto:[email protected])University of Southern CaliforniaLos AngelesUSA,Jayashree Adivarahan[adivarah@usc\.edu](https://arxiv.org/html/2606.11480v1/mailto:[email protected])University of Southern CaliforniaLos AngelesUSA,Rajgopal Kannan[rajgopal\.kannan\.civ@army\.mil](https://arxiv.org/html/2606.11480v1/mailto:[email protected])DEVCOM Army Research OfficeUSAandViktor Prasanna[prasanna@usc\.edu](https://arxiv.org/html/2606.11480v1/mailto:[email protected])University of Southern CaliforniaLos AngelesUSA \(2027\) ###### Abstract\. Federated continual learning \(FCL\) must learn from distributed task streams under limited resources, such as communication, computation, memory, and label availability\. Existing FCL methods often rely on repeated local optimization, replay, and full supervision\. Analytic alternatives avoid iterative training and replay, but using high\-dimensional random features to improve accuracy requires a second\-order feature statistic, the Gram matrix, which has a quadratic communication cost in the random feature sizeMM\. We proposeFedRAN, a resource\-aware analytic FCL framework that replaces gradient\-based updates with compact random feature statistics\. Each client transmits a truncated\-SVD summary of its Gram matrix, reducing the dominant second\-order upload from quadratic to linear inMMfor fixed rank\. The server performs a two\-level QR\-SVD subspace merge, spatially across clients and temporally across tasks, and solves a ridge classifier in closed form\. FedRAN further supports label scarcity through prototype\-based pseudo\-labeling\. Across CIFAR\-100, ImageNet\-R, and VTAB datasets, FedRAN improves average accuracy by up to 4\.8 percentage points over the strongest baseline, uses 30\.6–121\.8×\\timesless per\-client communication than optimization\-based FCL, and is 190\.3×\\timesfaster on average than gradient\-based baselines; with only 20% labels, pseudo\-labeling improves average accuracy by up to 6\.61 points\. These results show that FedRAN enables accurate and resource\-efficient FCL under communication, computation, and label constraints\. The source code is available at[https://github\.com/JebacyrilArockiaraj/Fed\-RAN\-SSL](https://github.com/JebacyrilArockiaraj/Fed-RAN-SSL)\. Federated continual learning, Analytic continual learning, Semi\-supervised learning ††copyright:acmlicensed††journalyear:2027††doi:XXXXXXX\.XXXXXXX††conference:Conference’17, Washington, DC, USA; ;††isbn:978\-1\-4503\-XXXX\-X/2018/06††ccs:Computing methodologies Machine learning††ccs:Computing methodologies Lifelong machine learning††ccs:Computing methodologies Semi\-supervised learning settings## 1\.Introduction Figure 1\.Overview of federated continual learning\.Table 1\.Comparing FedRAN against representative FCL and analytic\-learning families\. FedRAN targets the middle ground between full second\-order analytic aggregation and first\-order/statistic\-only communication: it communicates low\-rank spectral summaries of random\-feature Gram statistics and merges them spatially across clients and temporally across tasks\.Federated learning \(FL\) enables multiple clients to collaboratively learn a model while keeping raw data local\(McMahanet al\.,[2017a](https://arxiv.org/html/2606.11480#bib.bib1); Kairouz and McMahan,[2021](https://arxiv.org/html/2606.11480#bib.bib2)\)\. This privacy\-preserving learning paradigm is well\-suited to settings where data are distributed across clients and cannot be centralized due to ownership, privacy, or communication constraints\. Most FL methods, however, assume a static learning problem: clients optimize a model for a fixed data distribution, and the learned model is then deployed\. In many practical deployments, data continue to arrive after deployment, and local client distributions evolve as new classes, domains, or sensing conditions appear\(Jothimurugesanet al\.,[2023](https://arxiv.org/html/2606.11480#bib.bib8)\)\. Continual learning \(CL\) studies how models can learn from such non\-stationary streams while preserving performance on previously observed tasks\(Kirkpatricket al\.,[2017b](https://arxiv.org/html/2606.11480#bib.bib6); Schwarzet al\.,[2018](https://arxiv.org/html/2606.11480#bib.bib5); De Langeet al\.,[2021](https://arxiv.org/html/2606.11480#bib.bib7)\)\. As shown in Figure[1](https://arxiv.org/html/2606.11480#S1.F1), Federated continual learning \(FCL\) combines these two axes: clients learn from private sequential task streams while a server maintains a global model over all classes observed so far\(Yoonet al\.,[2021](https://arxiv.org/html/2606.11480#bib.bib3); Gholizadeet al\.,[2026](https://arxiv.org/html/2606.11480#bib.bib4)\)\. A central challenge in FCL is that practical deployments are constrained not only by privacy and catastrophic forgetting, but also by the resources required to keep learning\. Recent benchmarking of resource\-constrained FCL shows that existing methods often assume unrestricted training overhead and degrade substantially when memory buffer, computational budget, communication rounds, and label rate are limited\(Liet al\.,[2026](https://arxiv.org/html/2606.11480#bib.bib15)\)\. This motivates an FCL design that treats communication, computation, memory, and label availability as primary constraints rather than secondary implementation details\. Such a method should avoid repeatedly updating large client models, avoid storing or generating old data, remain robust under non\-IID client partitions, and still exploit useful information from sparsely labeled streams\. Existing FCL methods address catastrophic forgetting through several mechanisms, but most remain optimization\-centric\.\(Yoonet al\.,[2021](https://arxiv.org/html/2606.11480#bib.bib3)\)decomposes model parameters into global and task\-adaptive components for weighted inter\-client transfer\.\(Donget al\.,[2023](https://arxiv.org/html/2606.11480#bib.bib13)\)uses global\-local forgetting compensation with class\-aware losses and relation distillation\.\(Qiet al\.,[2023](https://arxiv.org/html/2606.11480#bib.bib10)\)and\(Zhanget al\.,[2023b](https://arxiv.org/html/2606.11480#bib.bib9)\)rely on generative replay or exemplar\-free distillation to preserve old knowledge\.\(Bakmanet al\.,[2023](https://arxiv.org/html/2606.11480#bib.bib12)\)constrains global updates to be orthogonal to previous\-task activation subspaces\. More recent pretrained\-model approaches, such as\(Bagweet al\.,[2023](https://arxiv.org/html/2606.11480#bib.bib32)\)and\(Guoet al\.,[2024](https://arxiv.org/html/2606.11480#bib.bib33)\), reduce trainable parameters through prompts or LoRA modules\. These methods improve continual accuracy, but they still require iterative local optimization, repeated communication rounds, auxiliary replay or distillation mechanisms, task\-specific modules, or full supervision\. Moreover, when the global model is updated on non\-IID client data, local updates can induce feature drift across clients, making the same client’s data map to different representations after other clients update the global model\(Venkateshaet al\.,[2022](https://arxiv.org/html/2606.11480#bib.bib16)\)\. Analytic learning offers a different approach\. Instead of repeatedly optimizing model parameters with gradients, analytic methods freeze a feature extractor and update the classifier from feature statistics in closed form\.\(Zhuanget al\.,[2022](https://arxiv.org/html/2606.11480#bib.bib39)\)shows that recursive analytic updates can match the performance of joint training in centralized continual learning without storing past samples\.\(Fanìet al\.,[2024b](https://arxiv.org/html/2606.11480#bib.bib40)\)and\(Fanìet al\.,[2024a](https://arxiv.org/html/2606.11480#bib.bib41)\)bring closed\-form ridge classifiers to federated learning, showing that additive statistics can reduce the sensitivity of federated heads to client partitioning\.\(Tanget al\.,[2025b](https://arxiv.org/html/2606.11480#bib.bib42)\)extends this idea to FCL, replacing gradient\-based updates with analytic aggregation over frozen features\.\(Guanet al\.,[2026](https://arxiv.org/html/2606.11480#bib.bib38)\)further shows that feature statistics can be aggregated spatially across clients and temporally across tasks, giving a closed\-form classifier update for FCL\. These works establish that analytic statistics are a powerful alternative to iterative FCL\. However, they expose a key accuracy–resource tradeoff: stronger analytic classifiers depend on second\-order feature statistics, but the full Gram matrix grows quadratically with feature dimension\. This tradeoff becomes sharper when using high\-dimensional random features\.\(McDonnellet al\.,[2023](https://arxiv.org/html/2606.11480#bib.bib18)\)shows that frozen nonlinear random projections can significantly improve accuracy by increasing feature separability and enabling second\-order prototype decorrelation\.\(Penget al\.,[2024](https://arxiv.org/html/2606.11480#bib.bib43)\)further shows that random\-feature matrices can be ill\-conditioned, and that truncated SVD improves stability for centralized continual learning\. However, directly applying a random\-feature analytic pipeline to a federated setting is expensive: if clients upload full second\-order Gram matrices, communication grows asM2M^\{2\}in the random\-feature dimensionMM\. Communication\-efficient alternatives reduce this cost but typically estimate the second\-order structure\. For example,\(Guanet al\.,[2026](https://arxiv.org/html/2606.11480#bib.bib38)\)proposes STSA\-E, which estimates the global Gram matrix from first\-order label\-feature statistics and label counts, and uses dummy clients when the number of real clients is small\. Such estimators improve communication, but they do not directly preserve the dominant feature\-feature directions of the Gram matrix\. We proposeFedRAN, a resource\-aware analytic framework for FCL\. FedRAN keeps the backbone frozen and replaces iterative client training with compact random\-feature statistics\. Each client computes a truncated\-SVD summary of its local random\-feature Gram matrix\. The server then performs a two\-level QR\-SVD subspace merge\. This produces a rank\-bounded global spectral state that approximates the dominant second\-order geometry without transmitting a fullM×MM\\times MGram matrix\. To address sparse labels, FedRAN further introduces a prototype\-based pseudo\-labeling variant, FedRAN\-SSL, allowing high\-confidence unlabeled samples to contribute to the analytic update\. FedRAN therefore occupies a distinct point in the FCL design space\. As illustrated in Table[1](https://arxiv.org/html/2606.11480#S1.T1), compared with gradient\-based FCL, it avoids local backpropagation, repeated model exchange, and trainable representation drift\. Compared with the analytic FCL, it avoids full second\-order communication\. Compared with first\-order or plug\-in statistic methods, it preserves dominant Gram directions rather than estimating the Gram matrix solely from class\-wise summaries\. Compared with centralized low\-rank analytic CL, it introduces the missing federated component: a spatial\-temporal QR\-SVD merge that turns local low\-rank random\-feature summaries into a global continual classifier state\. The major contributions of our work include: - •We proposeFedRAN, a resource\-aware analytic FCL framework that replaces iterative client training with random\-feature statistics and closed\-form classifier updates\. - •We design a two\-level QR\-SVD subspace merge of low\-rank truncated\-SVD random\-feature Gram summaries, enabling spatial aggregation across non\-IID clients and temporal aggregation across class\-incremental tasks without transmitting full second\-order matrices\. - •We formulate FedRAN as a subspace\-constrained ridge classifier and provide deterministic bounds relating its Gram approximation, classifier approximation, and prediction\-score stability to the retained spectral subspace\. - •We extend FedRAN to label\-scarce streams using prototype\-based pseudo\-labeling, enabling confident unlabeled samples to contribute to the analytic label\-feature statistic\. - •We evaluate FedRAN on CIFAR\-100, ImageNet\-R, and VTAB datasets under non\-IID FCL settings\. FedRAN improves average accuracy by up to 4\.8 percentage points over the strongest baseline, uses 30\.6–121\.8×\\timesless per\-client communication than representative optimization\-based FCL, and is 190\.3×\\timesfaster on average than gradient\-based baselines; with only 20% labels, pseudo\-labeling improves average accuracy by up to 6\.61 points\. Figure 2\.Motivation for resource\-constrained FCL\. Left and middle: sequential non\-IID client updates of trainable baselines increase feature drift on a fixed Client\-1 Task\-2 reference set and reduce test accuracy, while FedRAN keeps the representation fixed\. Right: per\-client communication for iterative methods grows with communication rounds and exchanged model size, whereas FedRAN uses a fixed one\-shot statistic upload per task; the inset zooms in on this fixed\-upload regime\. ## 2\.Preliminaries This section introduces the FCL setting, defines the evaluation metrics, and formalizes the resource constraints and problem setting for this work\. ### 2\.1\.Class\-Incremental FCL Setting We considerKKclients learning over a sequence of tasks\{𝒯t\}t=1T\\\{\\mathcal\{T\}\_\{t\}\\\}\_\{t=1\}^\{T\}in a class\-incremental FCL setting\. At tasktt, clientkkhas a private local stream𝒟k,t=𝒟k,tℓ∪𝒟k,tu\\mathcal\{D\}\_\{k,t\}=\\mathcal\{D\}^\{\\ell\}\_\{k,t\}\\cup\\mathcal\{D\}^\{u\}\_\{k,t\}, where𝒟k,tℓ=\{\(xk,t,i,yk,t,i\)\}i=1nk,tℓ\\mathcal\{D\}^\{\\ell\}\_\{k,t\}=\\\{\(x\_\{k,t,i\},y\_\{k,t,i\}\)\\\}\_\{i=1\}^\{n^\{\\ell\}\_\{k,t\}\}is the labeled subset and𝒟k,tu=\{xk,t,i\}i=1nk,tu\\mathcal\{D\}^\{u\}\_\{k,t\}=\\\{x\_\{k,t,i\}\\\}\_\{i=1\}^\{n^\{u\}\_\{k,t\}\}is the unlabeled subset; in the fully supervised settingnk,tu=0n^\{u\}\_\{k,t\}=0\. Raw samples remain on the client and are never centralized\. Each task introduces a new set of classes𝒞tnew\\mathcal\{C\}\_\{t\}^\{\\mathrm\{new\}\}, and𝒞1:t=⋃τ=1t𝒞τnew\\mathcal\{C\}\_\{1:t\}=\\bigcup\_\{\\tau=1\}^\{t\}\\mathcal\{C\}\_\{\\tau\}^\{\\mathrm\{new\}\}represents all classes observed up to tasktt, withCt=\|𝒞1:t\|C\_\{t\}=\|\\mathcal\{C\}\_\{1:t\}\|\. After completing tasktt, the global predictor must classify samples from any class in𝒞1:t\\mathcal\{C\}\_\{1:t\}without receiving the task identity at inference time\. We useXk,tX\_\{k,t\}to denote the local inputs available to clientkkat tasktt\. We useYk,t∈ℝn~k,t×CtY\_\{k,t\}\\in\\mathbb\{R\}^\{\\tilde\{n\}\_\{k,t\}\\times C\_\{t\}\}to denote the one\-hot label matrix used by the learning algorithm, wheren~k,t\\tilde\{n\}\_\{k,t\}is the number of labeled or accepted pseudo\-labeled samples used in the update\. When new classes arrive, previously accumulated label\-dependent matrices are padded with zero columns so that their width matchesCtC\_\{t\}\. Clients may have different sample counts, class coverage, and label availability\. We simulate label\-skewed non\-IID partitions using a Dirichlet distribution with concentration parameterβ\\beta, where smallerβ\\betacorresponds to stronger class imbalance across clients\(Hsuet al\.,[2019](https://arxiv.org/html/2606.11480#bib.bib22)\)\. ### 2\.2\.Accuracy Metrics Letψt\\psi\_\{t\}denote the global predictor after completing tasktt, and letℰ1:t\\mathcal\{E\}\_\{1:t\}denote the test set over all classes observed up to tasktt\. The accuracy after taskttisAt=1\|ℰ1:t\|∑\(x,y\)∈ℰ1:t𝟏\{ψt\(x\)=y\}A\_\{t\}=\\frac\{1\}\{\|\\mathcal\{E\}\_\{1:t\}\|\}\\sum\_\{\(x,y\)\\in\\mathcal\{E\}\_\{1:t\}\}\\mathbf\{1\}\\\{\\psi\_\{t\}\(x\)=y\\\}\. We report the final accuracyATA\_\{T\}and the average accuracyAavg=1T∑t=1TAtA\_\{\\mathrm\{avg\}\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}A\_\{t\}, which measures performance across the full task stream\. ### 2\.3\.Resource Constraints and Problem Setting We characterize each FCL algorithm𝒜\\mathcal\{A\}using communication, computation, label rate, and feature drift\. Communication\.Communication is the maximum upload by any client for a single task, accumulated over all rounds used for that task:Comm\(𝒜\)=maxk,tbytesk,tupload\(𝒜\)\\mathrm\{Comm\}\(\\mathcal\{A\}\)=\\max\_\{k,t\}\\,\\mathrm\{bytes\}^\{\\mathrm\{upload\}\}\_\{k,t\}\(\\mathcal\{A\}\)\. Computation\.Computation is measured by the wall\-clock time required to incorporate a task\. LetTk,tclient\(𝒜\)T^\{\\mathrm\{client\}\}\_\{k,t\}\(\\mathcal\{A\}\)be the local model update time for clientkkat tasktt, and letTtserver\(𝒜\)T^\{\\mathrm\{server\}\}\_\{t\}\(\\mathcal\{A\}\)be the corresponding server\-side aggregation and update time\. We define \(1\)Time\(𝒜\)=1T∑t=1T\(1K∑k=1KTk,tclient\(𝒜\)\+Ttserver\(𝒜\)\)\.\\mathrm\{Time\}\(\\mathcal\{A\}\)=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\left\(\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}T^\{\\mathrm\{client\}\}\_\{k,t\}\(\\mathcal\{A\}\)\+T^\{\\mathrm\{server\}\}\_\{t\}\(\\mathcal\{A\}\)\\right\)\. Label rate\.For label\-scarce streams, the local label rate isρk,t=\|𝒟k,tℓ\|/\(\|𝒟k,tℓ\|\+\|𝒟k,tu\|\)\\rho\_\{k,t\}=\|\\mathcal\{D\}^\{\\ell\}\_\{k,t\}\|/\(\|\\mathcal\{D\}^\{\\ell\}\_\{k,t\}\|\+\|\\mathcal\{D\}^\{u\}\_\{k,t\}\|\); a smallerρk,t\\rho\_\{k,t\}means that fewer local samples have ground\-truth labels at tasktt\. Feature drift\.For methods that update a shared feature extractor, feature drift measures how much the representation of a fixed reference set changes after later updates\(Venkateshaet al\.,[2022](https://arxiv.org/html/2606.11480#bib.bib16)\)\. Letg\(s\)\(⋅\)g^\{\(s\)\}\(\\cdot\)be the feature extractor after update stagess, and letg\(0\)\(⋅\)g^\{\(0\)\}\(\\cdot\)be the reference extractor\. For a reference setℛk,τ\\mathcal\{R\}\_\{k,\\tau\}from clientkkand taskτ\\tau, we define \(2\)Driftk,τ\(s\)=1\|ℛk,τ\|∑x∈ℛk,τ‖g\(s\)\(x\)−g\(0\)\(x\)‖2\.\\mathrm\{Drift\}\_\{k,\\tau\}\(s\)=\\frac\{1\}\{\|\\mathcal\{R\}\_\{k,\\tau\}\|\}\\sum\_\{x\\in\\mathcal\{R\}\_\{k,\\tau\}\}\\\|g^\{\(s\)\}\(x\)\-g^\{\(0\)\}\(x\)\\\|\_\{2\}\.For methods with a fixed feature extractor, this quantity is zero\. Problem setting\.Given client streams\{𝒟k,t\}k=1,t=1K,T\\\{\\mathcal\{D\}\_\{k,t\}\\\}\_\{k=1,t=1\}^\{K,T\}and label rates\{ρk,t\}\\\{\\rho\_\{k,t\}\\\}, an FCL algorithm must output a predictorψt\\psi\_\{t\}after each task while keeping raw data local and performing inference over𝒞1:t\\mathcal\{C\}\_\{1:t\}without task identity\. We evaluate methods by their average and final accuracy,AavgA\_\{\\mathrm\{avg\}\}andATA\_\{T\}, together with the resource constraints above\. A target deployment may impose budgets such asΓcomm\\Gamma\_\{\\mathrm\{comm\}\},Γtime\\Gamma\_\{\\mathrm\{time\}\}, andΓdrift\\Gamma\_\{\\mathrm\{drift\}\}; within such budgets, the goal is to maintain high continual accuracy with low communication, fast task updates, stable representations, and effective use of the available labels\. ## 3\.Motivation Federated continual learning must update a global classifier using private, non\-IID task streams under tight limits on communication, computation, and supervision, while keeping the learned representation stable\. Recent resource\-constrained FCL benchmarks show that existing methods degrades when computational budget and label rate are restricted\(Liet al\.,[2026](https://arxiv.org/html/2606.11480#bib.bib15)\)\. We examine four aspects that shape FedRAN’s design: the cost of repeated updates, representation stability, retaining second\-order information under a communication budget, and learning under sparse labels\. Repeated update cost\.Most FCL methods remain optimization\-centric: clients repeatedly adapt a model, prompt, adapter, or auxiliary component, and server aggregates these updates over communication rounds\(McMahanet al\.,[2017a](https://arxiv.org/html/2606.11480#bib.bib1); Yoonet al\.,[2021](https://arxiv.org/html/2606.11480#bib.bib3); Donget al\.,[2023](https://arxiv.org/html/2606.11480#bib.bib13); Bagweet al\.,[2023](https://arxiv.org/html/2606.11480#bib.bib32); Guoet al\.,[2024](https://arxiv.org/html/2606.11480#bib.bib33)\)\. This couples communication to the number of rounds and exchanged parameters, while local computation grows with backpropagation, replay, distillation, or generation mechanisms\(Qiet al\.,[2023](https://arxiv.org/html/2606.11480#bib.bib10); Zhanget al\.,[2023b](https://arxiv.org/html/2606.11480#bib.bib9); Bakmanet al\.,[2023](https://arxiv.org/html/2606.11480#bib.bib12)\)\. The right panel of Fig\.[2](https://arxiv.org/html/2606.11480#S1.F2)shows this scaling for ResNet\-18 and ViT\-B/16\. This motivates replacing repeated local optimization with forward\-only statistic construction\. Representation stability\.Updating a shared trainable backbone on the server using the non\-IID client streams can also move the feature space itself\. Later client updates may change the representation of earlier clients’ data even when their raw data are unchanged\(Venkateshaet al\.,[2022](https://arxiv.org/html/2606.11480#bib.bib16)\)\. The left/middle panels of Fig\.[2](https://arxiv.org/html/2606.11480#S1.F2)show this effect: as feature drift increases on Client\-1 Task\-2 data, the corresponding test accuracy of optimization\-based baselines drops\. This motivates freezing the backbone and performing continual adaptation via classifier state\. Second\-order information under a budget\.Analytic FCL avoids iterative training by aggregating feature statistics and solving the classifier in closed form\(Fanìet al\.,[2024b](https://arxiv.org/html/2606.11480#bib.bib40); Tanget al\.,[2025b](https://arxiv.org/html/2606.11480#bib.bib42); Guanet al\.,[2026](https://arxiv.org/html/2606.11480#bib.bib38)\)\. However, with high\-dimensional random features, useful second\-order information is captured by anM×MM\\times Mfeature\-feature Gram matrix, whose cost is quadratic in the random\-feature dimensionMM\(McDonnellet al\.,[2023](https://arxiv.org/html/2606.11480#bib.bib18); Penget al\.,[2024](https://arxiv.org/html/2606.11480#bib.bib43)\)\. A cheaper first\-order route estimates this Gram matrix from class\-wise feature sums and counts, but the second\-order structure then becomes a statistical estimate: its variance grows quadratically with the task sample count, is sensitive to small client counts, and is unbiased only under an i\.i\.d\. assumption\(Guanet al\.,[2026](https://arxiv.org/html/2606.11480#bib.bib38)\)\. FedRAN instead transmits a rank\-rrsummary of the actual second\-order statistic, so its approximation error is a deterministic quantity that additional communication can directly reduce, rather than an estimation variance inherent to reconstructing second\-order structure from first\-order statistics \(Theorem[1](https://arxiv.org/html/2606.11480#Thmfedranthm1)\)\. Sparse labels\.Resource constraints also include supervision\. Many FCL methods assume fully labeled client streams, but labels may be delayed, expensive, or unavailable in deployment\. If analytic updates use only labeled samples, unlabeled data cannot contribute to the label\-feature statistic\. Federated semi\-supervised learning studies this sparse\-label regime\(Jeonget al\.,[2020](https://arxiv.org/html/2606.11480#bib.bib46)\), and prototype\-based federated learning provides a lightweight way to summarize class structure in feature space\(Tanet al\.,[2022](https://arxiv.org/html/2606.11480#bib.bib47)\)\. Since FedRAN keeps the backbone fixed, prototype\-based pseudo\-labeling is a natural way to use confident unlabeled samples without updating the backbone\. Design implication\.These aspects define the design target for FedRAN: client updates should be forward\-only, the backbone should remain fixed, second\-order information should be retained without full Gram matrix transmission, and unlabeled samples should be usable when labels are scarce\. FedRAN addresses this through compact random\-feature statistics, spatial\-temporal QR\-SVD aggregation, and prototype\-based pseudo\-labeling\. ## 4\.System Design Figure 3\.Overview of the proposed FedRAN system\. Each client extracts frozen pretrained features, optionally assigns pseudo\-labels to confident unlabeled samples, applies a fixed random projection, and computes compact local statistics\. The server aggregates label\-feature statistics exactly, merges low\-rank feature\-feature summaries using a two\-level QR\-SVD update, and obtains the classifier through a closed\-form ridge solve in the retained subspace\.FedRAN is a resource\-aware federated analytic continual learning system for the setting defined in Section[2](https://arxiv.org/html/2606.11480#S2)\. At each task, clients keep raw data local, compute compact random\-feature statistics, and communicate these statistics to the server\. The server maintains exact class\-wise label\-feature statistics, merges low\-rank spectral summaries of feature\-feature statistics, and computes the classifier through a closed\-form analytic update\. Setting\.We use the spatial\-temporal notation from Section[2\.1](https://arxiv.org/html/2606.11480#S2.SS1): at tasktt, clientkkholds local data𝒟k,t\\mathcal\{D\}\_\{k,t\}, the current class set has sizeCtC\_\{t\}, andYk,t∈ℝn~k,t×CtY\_\{k,t\}\\in\\mathbb\{R\}^\{\\tilde\{n\}\_\{k,t\}\\times C\_\{t\}\}denotes the one\-hot label matrix used by the algorithm\. FedRAN proceeds in four steps\. First, each client maps local samples to a fixed nonlinear random\-feature space\. Second, the client computes label\-feature statistic and a low\-rank spectral summary of its feature\-feature statistic\. Third, the server merges these summaries spatially across clients and temporally across tasks\. Finally, the server solves a ridge classifier in closed form\. Analytic reference objective\.LetHk,t∈ℝn~k,t×MH\_\{k,t\}\\in\\mathbb\{R\}^\{\\tilde\{n\}\_\{k,t\}\\times M\}denote the random\-feature matrix produced by clientkkat tasktt, and letH1:tH\_\{1:t\}andY1:tY\_\{1:t\}denote the row\-wise concatenation of all random features and labels from clients1:K1\{:\}Kand tasks1:t1\{:\}t\. If all random features were centralized, the full analytic classifier would solve \(3\)W1:t⋆=argminW∈ℝM×Ct‖H1:tW−Y1:t‖F2\+λ‖W‖F2,W^\{\\star\}\_\{1:t\}=\\arg\\min\_\{W\\in\\mathbb\{R\}^\{M\\times C\_\{t\}\}\}\\\|H\_\{1:t\}W\-Y\_\{1:t\}\\\|\_\{F\}^\{2\}\+\\lambda\\\|W\\\|\_\{F\}^\{2\},with closed\-form solution \(4\)W1:t⋆=\(G1:t⋆\+λIM\)−1B1:t,G1:t⋆=H1:t⊤H1:t,B1:t=H1:t⊤Y1:t\.W^\{\\star\}\_\{1:t\}=\\left\(G^\{\\star\}\_\{1:t\}\+\\lambda I\_\{M\}\\right\)^\{\-1\}B\_\{1:t\},\\qquad G^\{\\star\}\_\{1:t\}=H\_\{1:t\}^\{\\top\}H\_\{1:t\},\\quad B\_\{1:t\}=H\_\{1:t\}^\{\\top\}Y\_\{1:t\}\.Here,G1:t⋆G^\{\\star\}\_\{1:t\}is the full feature\-feature Gram statistic andB1:tB\_\{1:t\}is the label\-feature statistic\. FedRAN keepsB1:tB\_\{1:t\}exact, but avoids transmitting the fullM×MM\\times MGram matrix by replacingG1:t⋆G^\{\\star\}\_\{1:t\}with a merged low\-rank spectral summary\. Algorithm 1FedRAN: Federated Analytic Continual Learning1:Clients k∈\{1,…,K\}k\\in\\\{1,\\ldots,K\\\}, task stream \{𝒯t\}t=1T\\\{\\mathcal\{T\}\_\{t\}\\\}\_\{t=1\}^\{T\}, frozen backbone f\(⋅\)f\(\\cdot\)with feature size dd, projection P∈ℝd×MP\\in\\mathbb\{R\}^\{d\\times M\}, retained rank rr, ridge parameter λ\\lambda, SSL threshold τ\\tau 2:Server initializes \(V1:0,σ1:0\)←∅\(V\_\{1:0\},\\sigma\_\{1:0\}\)\\leftarrow\\emptysetand B1:0←∅B\_\{1:0\}\\leftarrow\\emptyset 3:for t=1,…,Tt=1,\\ldots,Tdo 4:Let CtC\_\{t\}be the number of classes observed up to task tt 5:Client\-side local summarization 6:foreach client kkin paralleldo 7:Extract frozen features for labeled and unlabeled data: Zk,tℓ=f\(Xk,tℓ\)Z^\{\\ell\}\_\{k,t\}=f\(X^\{\\ell\}\_\{k,t\}\), Zk,tu=f\(Xk,tu\)Z^\{u\}\_\{k,t\}=f\(X^\{u\}\_\{k,t\}\) 8:Optionally augment labels using prototype pseudo\-labeling: \(Z~k,t,y~k,t\)←PseudoLabel\(Zk,tℓ,yk,tℓ,Zk,tu,τ\)\(\\widetilde\{Z\}\_\{k,t\},\\widetilde\{y\}\_\{k,t\}\)\\leftarrow\\mathrm\{PseudoLabel\}\(Z^\{\\ell\}\_\{k,t\},y^\{\\ell\}\_\{k,t\},Z^\{u\}\_\{k,t\},\\tau\) 9:Compute random features Hk,t←ReLU\(Z~k,tP\)H\_\{k,t\}\\leftarrow\\mathrm\{ReLU\}\(\\widetilde\{Z\}\_\{k,t\}P\), where Hk,t∈ℝn~k,t×MH\_\{k,t\}\\in\\mathbb\{R\}^\{\\tilde\{n\}\_\{k,t\}\\times M\} 10:Form one\-hot labels Yk,t∈ℝn~k,t×CtY\_\{k,t\}\\in\\mathbb\{R\}^\{\\tilde\{n\}\_\{k,t\}\\times C\_\{t\}\}from y~k,t\\widetilde\{y\}\_\{k,t\} 11:Compute truncated SVD summary \(Vk,t,σk,t\)←ClientSVD\(Hk,t,r\)\(V\_\{k,t\},\\sigma\_\{k,t\}\)\\leftarrow\\mathrm\{ClientSVD\}\(H\_\{k,t\},r\) 12:Compute label\-feature statistic Bk,t←Hk,t⊤Yk,tB\_\{k,t\}\\leftarrow H\_\{k,t\}^\{\\top\}Y\_\{k,t\} 13:Send \(Vk,t,σk,t,Bk,t\)\(V\_\{k,t\},\\sigma\_\{k,t\},B\_\{k,t\}\)to the server 14:Server\-side spatial\-temporal aggregation 15:Initialize task spectral summary \(Vt,σt\)←∅\(V\_\{t\},\\sigma\_\{t\}\)\\leftarrow\\emptysetand task label statistic Bt←0∈ℝM×CtB\_\{t\}\\leftarrow 0\\in\\mathbb\{R\}^\{M\\times C\_\{t\}\} 16:foreach received summary \(Vk,t,σk,t,Bk,t\)\(V\_\{k,t\},\\sigma\_\{k,t\},B\_\{k,t\}\)do 17: \(Vt,σt\)←MergeSVD\(\(Vt,σt\),\(Vk,t,σk,t\),r\)\(V\_\{t\},\\sigma\_\{t\}\)\\leftarrow\\mathrm\{MergeSVD\}\(\(V\_\{t\},\\sigma\_\{t\}\),\(V\_\{k,t\},\\sigma\_\{k,t\}\),r\) 18: Bt←Bt\+Bk,tB\_\{t\}\\leftarrow B\_\{t\}\+B\_\{k,t\} 19:Pad B1:t−1B\_\{1:t\-1\}with zero columns if needed, and set B1:t←B1:t−1\+BtB\_\{1:t\}\\leftarrow B\_\{1:t\-1\}\+B\_\{t\} 20: \(V1:t,σ1:t\)←MergeSVD\(\(V1:t−1,σ1:t−1\),\(Vt,σt\),r\)\(V\_\{1:t\},\\sigma\_\{1:t\}\)\\leftarrow\\mathrm\{MergeSVD\}\(\(V\_\{1:t\-1\},\\sigma\_\{1:t\-1\}\),\(V\_\{t\},\\sigma\_\{t\}\),r\) 21:Analytic classifier update 22: W~1:t←\(diag\(σ1:t2\)\+λIr\)−1V1:t⊤B1:t\\widetilde\{W\}\_\{1:t\}\\leftarrow\\left\(\\mathrm\{diag\}\(\\sigma\_\{1:t\}^\{2\}\)\+\\lambda I\_\{r\}\\right\)^\{\-1\}V\_\{1:t\}^\{\\top\}B\_\{1:t\} 23: W1:t←V1:tW~1:tW\_\{1:t\}\\leftarrow V\_\{1:t\}\\widetilde\{W\}\_\{1:t\} 24:returnPredictor x↦argmaxc∈\{1,…,CT\}\[ReLU\(f\(x\)P\)W1:T\]cx\\mapsto\\arg\\max\_\{c\\in\\\{1,\\ldots,C\_\{T\}\\\}\}\\left\[\\mathrm\{ReLU\}\(f\(x\)P\)W\_\{1:T\}\\right\]\_\{c\} ### 4\.1\.System Overview Figure[3](https://arxiv.org/html/2606.11480#S4.F3)illustrates the FedRAN system\. During each task, clients receive local data while keeping raw samples on\-device\. Each client first extracts features using a frozen pretrained backbone, applies a shared random projection followed by a ReLU nonlinearity, and constructs local random\-feature statistics\. Under label scarcity, FedRAN\-SSL assigns pseudo\-labels to confident unlabeled samples using prototype\-based cosine similarity before computing the label\-feature statistic\. What the server needs\.Analytic ridge classification requires two sufficient statistics: a feature\-feature statisticG=H⊤HG=H^\{\\top\}Hand a label\-feature statisticB=H⊤YB=H^\{\\top\}Y\. These statistics are additive across clients and tasks when the feature map is fixed, which makes them naturally suited to federated continual learning\. However, the full Gram matrix has sizeM×MM\\times M, which becomes prohibitive when the random\-feature dimensionMMis large\. What FedRAN communicates\.Instead of transmitting the full local Gram matrixGk,t=Hk,t⊤Hk,tG\_\{k,t\}=H\_\{k,t\}^\{\\top\}H\_\{k,t\}, each client computes a truncated SVD ofHk,tH\_\{k,t\}and sends a low\-rank spectral summary\(Vk,t,σk,t\)\(V\_\{k,t\},\\sigma\_\{k,t\}\)together with the exact label\-feature statisticBk,t=Hk,t⊤Yk,tB\_\{k,t\}=H\_\{k,t\}^\{\\top\}Y\_\{k,t\}\. The server merges the spectral summaries spatially across clients and temporally across tasks using a QR\-SVD update, producing a bounded global subspace\(V1:t,σ1:t\)\(V\_\{1:t\},\\sigma\_\{1:t\}\)\. The final classifier is then solved analytically in this retained subspace\. ### 4\.2\.Client\-Side Feature Extraction Frozen feature extraction\.For tasktt, clientkkreceives local samplesXk,tX\_\{k,t\}\. FedRAN uses a frozen pretrained backbonef\(⋅\)f\(\\cdot\)to extract features \(5\)Zk,t=f\(Xk,t\),Zk,t∈ℝnk,t×d,Z\_\{k,t\}=f\(X\_\{k,t\}\),\\qquad Z\_\{k,t\}\\in\\mathbb\{R\}^\{n\_\{k,t\}\\times d\},whereddis the backbone feature dimension\. The backbone remains fixed throughout the task stream; clients do not updatef\(⋅\)f\(\\cdot\)during federated learning\. This avoids client\-induced feature drift from repeated updates of a shared feature extractor\(Venkateshaet al\.,[2022](https://arxiv.org/html/2606.11480#bib.bib16)\)and removes client\-side backpropagation through the backbone\. Shared random\-feature map\.After feature extraction, each client applies a shared random projection followed by ReLU: \(6\)Hk,t=ReLU\(Zk,tP\),P∈ℝd×M,Hk,t∈ℝnk,t×M\.H\_\{k,t\}=\\mathrm\{ReLU\}\(Z\_\{k,t\}P\),\\qquad P\\in\\mathbb\{R\}^\{d\\times M\},\\quad H\_\{k,t\}\\in\\mathbb\{R\}^\{n\_\{k,t\}\\times M\}\.The projection matrixPPis generated once and shared across clients and tasks, so all clients compute statistics over aligned random\-feature dimensions\. The random projection expands features fromddtoMMdimensions without introducing trainable client\-side parameters\. The ReLU nonlinearity creates nonlinear random\-feature interactions that improve linear separability for the downstream analytic classifier\(Rahimi and Recht,[2007](https://arxiv.org/html/2606.11480#bib.bib17); McDonnellet al\.,[2023](https://arxiv.org/html/2606.11480#bib.bib18)\)\. Client computation\.For tasktt, clientkkperforms one forward pass through the frozen backbone, a matrix multiplicationZk,tPZ\_\{k,t\}Pwith costO\(nk,tdM\)O\(n\_\{k,t\}dM\), and a ReLU activation overnk,tMn\_\{k,t\}Mentries\. Unlike gradient\-based FCL, clients do not perform multiple local epochs, compute backbone gradients, or transmit model updates\. ### 4\.3\.Client\-Side Low\-Rank Statistical Summarization Exact local statistics\.Given the random featuresHk,tH\_\{k,t\}and labelsYk,tY\_\{k,t\}, clientkkcan form the exact local feature\-feature and label\-feature statistics \(7\)Gk,t=Hk,t⊤Hk,t∈ℝM×M,Bk,t=Hk,t⊤Yk,t∈ℝM×Ct\.G\_\{k,t\}=H\_\{k,t\}^\{\\top\}H\_\{k,t\}\\in\\mathbb\{R\}^\{M\\times M\},\\qquad B\_\{k,t\}=H\_\{k,t\}^\{\\top\}Y\_\{k,t\}\\in\\mathbb\{R\}^\{M\\times C\_\{t\}\}\.Thecc\-th column ofBk,tB\_\{k,t\}is the sum of random features assigned to classccat clientkkand tasktt\. Thus,Bk,tB\_\{k,t\}is an unnormalized class\-wise prototype statistic over random features\. ###### Proposition 1 \(Exact spatial\-temporal additivity\)\. LetH1:tH\_\{1:t\}andY1:tY\_\{1:t\}denote the row\-wise concatenation of all random features and labels from clients1:K1\{:\}Kand tasks1:t1\{:\}t\. Then \(8\)G1:t⋆=H1:t⊤H1:t=∑τ=1t∑k=1KGk,τ,B1:t=H1:t⊤Y1:t=∑τ=1t∑k=1KBk,τ\.G^\{\\star\}\_\{1:t\}=H\_\{1:t\}^\{\\top\}H\_\{1:t\}=\\sum\_\{\\tau=1\}^\{t\}\\sum\_\{k=1\}^\{K\}G\_\{k,\\tau\},\\qquad B\_\{1:t\}=H\_\{1:t\}^\{\\top\}Y\_\{1:t\}=\\sum\_\{\\tau=1\}^\{t\}\\sum\_\{k=1\}^\{K\}B\_\{k,\\tau\}\.Consequently, if the fullGk,tG\_\{k,t\}andBk,tB\_\{k,t\}were uploaded, the server would recover the same ridge solution as centralized training on all data seen up to tasktt\. Why low\-rank summaries are needed\.AlthoughGk,tG\_\{k,t\}is additive, transmitting it is expensive: its communication cost isM2M^\{2\}floating\-point values per client per task\. ForM=10,000M=10\{,\}000, this is10810^\{8\}values, or roughly400400MB in fp32\. FedRAN therefore keepsBk,tB\_\{k,t\}exact but compressesGk,tG\_\{k,t\}through the dominant singular directions ofHk,tH\_\{k,t\}\. Truncated SVD summary\.Each client computes a rank\-rk,tr\_\{k,t\}truncated SVD of its random\-feature matrix, \(9\)Hk,t≈Uk,tdiag\(σk,t\)Vk,t⊤,Vk,t∈ℝM×rk,t,σk,t∈ℝrk,t,H\_\{k,t\}\\approx U\_\{k,t\}\\mathrm\{diag\}\(\\sigma\_\{k,t\}\)V\_\{k,t\}^\{\\top\},\\qquad V\_\{k,t\}\\in\\mathbb\{R\}^\{M\\times r\_\{k,t\}\},\\quad\\sigma\_\{k,t\}\\in\\mathbb\{R\}^\{r\_\{k,t\}\},whererk,t≤min\{r,nk,t,M\}r\_\{k,t\}\\leq\\min\\\{r,n\_\{k,t\},M\\\}\. This induces a rank\-rk,tr\_\{k,t\}approximation to the local Gram matrix: \(10\)G~k,t=Vk,tdiag\(σk,t2\)Vk,t⊤≈Hk,t⊤Hk,t=Gk,t\.\\widetilde\{G\}\_\{k,t\}=V\_\{k,t\}\\mathrm\{diag\}\(\\sigma\_\{k,t\}^\{2\}\)V\_\{k,t\}^\{\\top\}\\approx H\_\{k,t\}^\{\\top\}H\_\{k,t\}=G\_\{k,t\}\.The client sends\(Vk,t,σk,t,Bk,t\)\(V\_\{k,t\},\\sigma\_\{k,t\},B\_\{k,t\}\)to the server\. The upload cost becomes \(11\)Mrk,t\+rk,t\+MCtMr\_\{k,t\}\+r\_\{k,t\}\+MC\_\{t\}values instead ofM2\+MCtM^\{2\}\+MC\_\{t\}\. For a per\-message budgetℬ\\mathcal\{B\}bytes andbbbytes per scalar, a feasible rank satisfies \(12\)rk,t≤⌊ℬ/b−MCtM\+1⌋\.r\_\{k,t\}\\leq\\left\\lfloor\\frac\{\\mathcal\{B\}/b\-MC\_\{t\}\}\{M\+1\}\\right\\rfloor\. ### 4\.4\.Prototype\-Based Pseudo\-Labeling Need for pseudo\-labeling\.The statisticBk,t=Hk,t⊤Yk,tB\_\{k,t\}=H\_\{k,t\}^\{\\top\}Y\_\{k,t\}requires labels\. In resource\-constrained FCL, labels may be sparse, delayed, or unavailable for a large fraction of incoming client samples\(Liet al\.,[2026](https://arxiv.org/html/2606.11480#bib.bib15)\)\. If only labeled data are used, the analytic update may under\-utilize the local stream\. FedRAN\-SSL, therefore, assigns pseudo\-labels to high\-confidence unlabeled samples before constructingBk,tB\_\{k,t\}\. Local prototypes in frozen feature space\.LetZk,tℓZ^\{\\ell\}\_\{k,t\}andyk,tℓy^\{\\ell\}\_\{k,t\}denote the labeled backbone features and labels at clientkk, and letZk,tuZ^\{u\}\_\{k,t\}denote unlabeled backbone features\. For each classccpresent in the labeled subset, the client computes \(13\)pk,t,c=1\|ℐk,t,c\|∑i∈ℐk,t,cZk,tℓ\[i\],ℐk,t,c=\{i:yk,tℓ\[i\]=c\},p\_\{k,t,c\}=\\frac\{1\}\{\|\\mathcal\{I\}\_\{k,t,c\}\|\}\\sum\_\{i\\in\\mathcal\{I\}\_\{k,t,c\}\}Z^\{\\ell\}\_\{k,t\}\[i\],\\qquad\\mathcal\{I\}\_\{k,t,c\}=\\\{i:y^\{\\ell\}\_\{k,t\}\[i\]=c\\\},followed byℓ2\\ell\_\{2\}normalizationpk,t,c←pk,t,c/‖pk,t,c‖2p\_\{k,t,c\}\\leftarrow p\_\{k,t,c\}/\\\|p\_\{k,t,c\}\\\|\_\{2\}\. Confidence filtering\.For each unlabeled featureu∈Zk,tuu\\in Z^\{u\}\_\{k,t\}, FedRAN computesu¯=u/‖u‖2\\bar\{u\}=u/\\\|u\\\|\_\{2\}and assigns the candidate pseudo\-label \(14\)c⋆\(u\)=argmaxcu¯⊤pk,t,c\.c^\{\\star\}\(u\)=\\arg\\max\_\{c\}\\ \\bar\{u\}^\{\\top\}p\_\{k,t,c\}\.The pseudo\-label is accepted only if \(15\)u¯⊤pk,t,c⋆\(u\)≥τ,\\bar\{u\}^\{\\top\}p\_\{k,t,c^\{\\star\}\(u\)\}\\geq\\tau,whereτ\\tauis a confidence threshold\. Accepted pseudo\-labeled samples are combined with labeled samples to form\(Z~k,t,y~k,t\)\(\\widetilde\{Z\}\_\{k,t\},\\widetilde\{y\}\_\{k,t\}\), and the corresponding random features and one\-hot labels are then used in Eq\. \([6](https://arxiv.org/html/2606.11480#S4.E6)\) and Eq\. \([7](https://arxiv.org/html/2606.11480#S4.E7)\)\. The theoretical statements below condition on the label matrix actually used by the algorithm; when SSL is enabled, this is the augmented label matrix after confidence filtering\. ### 4\.5\.Server\-Side QR\-SVD Aggregation Two\-level aggregation\.The server maintains two global objects: the exact accumulated label\-feature statisticB1:tB\_\{1:t\}and the low\-rank spectral summary\(V1:t,σ1:t\)\(V\_\{1:t\},\\sigma\_\{1:t\}\)of the accumulated Gram matrix\. Aggregation occurs in two levels\. First, client summaries from the current task are merged spatially into\(Vt,σt\)\(V\_\{t\},\\sigma\_\{t\}\)\. Second, this task\-level summary is merged temporally into the global summary\(V1:t,σ1:t\)\(V\_\{1:t\},\\sigma\_\{1:t\}\)\. QR\-SVD merge\.Consider merging two spectral summaries\(Va,σa\)\(V\_\{a\},\\sigma\_\{a\}\)and\(Vb,σb\)\(V\_\{b\},\\sigma\_\{b\}\), whereVaV\_\{a\}andVbV\_\{b\}have orthonormal columns andσa,σb\\sigma\_\{a\},\\sigma\_\{b\}contain nonnegative singular values in descending order\. FedRAN forms \(16\)A=\[Vadiag\(σa\),Vbdiag\(σb\)\]∈ℝM×\(ra\+rb\)\.A=\\left\[V\_\{a\}\\mathrm\{diag\}\(\\sigma\_\{a\}\),\\;V\_\{b\}\\mathrm\{diag\}\(\\sigma\_\{b\}\)\\right\]\\in\\mathbb\{R\}^\{M\\times\(r\_\{a\}\+r\_\{b\}\)\}\.The key identity is \(17\)AA⊤=Vadiag\(σa2\)Va⊤\+Vbdiag\(σb2\)Vb⊤\.AA^\{\\top\}=V\_\{a\}\\mathrm\{diag\}\(\\sigma\_\{a\}^\{2\}\)V\_\{a\}^\{\\top\}\+V\_\{b\}\\mathrm\{diag\}\(\\sigma\_\{b\}^\{2\}\)V\_\{b\}^\{\\top\}\.Thus, the covariance ofAAis exactly the sum of the two input low\-rank Gram approximations\. FedRAN then computes \(18\)A=QR,R=URdiag\(σ¯\)WR⊤,A=QR,\\qquad R=U\_\{R\}\\mathrm\{diag\}\(\\bar\{\\sigma\}\)W\_\{R\}^\{\\top\},keeps the toprrsingular components, and outputs \(19\)V=QUR\(:,1:r\),σ=σ¯1:r\.V=QU\_\{R\}^\{\(:,1:r\)\},\\qquad\\sigma=\\bar\{\\sigma\}\_\{1:r\}\.This operation avoids materializing anyM×MM\\times Mmatrix\. It operates onA∈ℝM×2rA\\in\\mathbb\{R\}^\{M\\times 2r\}when both inputs have rankrr, and therefore costsO\(Mr2\+r3\)O\(Mr^\{2\}\+r^\{3\}\)per merge\. The same operation is used for spatial client aggregation and temporal task aggregation\. Similar QR\-SVD subspace merging ideas appear in federated PCA and streaming subspace tracking\(Grammenoset al\.,[2020](https://arxiv.org/html/2606.11480#bib.bib19); Řehřek,[2011](https://arxiv.org/html/2606.11480#bib.bib20); Eftekhariet al\.,[2019](https://arxiv.org/html/2606.11480#bib.bib21)\); FedRAN adapts this mechanism to random\-feature analytic continual learning\. ###### Proposition 2 \(QR\-SVD merge recovers the summed sketch covariance\)\. LetAAbe defined as in Eq\. \([16](https://arxiv.org/html/2606.11480#S4.E16)\), and letS=AA⊤S=AA^\{\\top\}have eigenvaluesλ1\(S\)≥λ2\(S\)≥⋯≥0\\lambda\_\{1\}\(S\)\\geq\\lambda\_\{2\}\(S\)\\geq\\cdots\\geq 0\. If no rank truncation is applied after the SVD ofRR, then the merged factors\(V,σ\)\(V,\\sigma\)satisfy \(20\)Vdiag\(σ2\)V⊤=AA⊤=Vadiag\(σa2\)Va⊤\+Vbdiag\(σb2\)Vb⊤\.V\\mathrm\{diag\}\(\\sigma^\{2\}\)V^\{\\top\}=AA^\{\\top\}=V\_\{a\}\\mathrm\{diag\}\(\\sigma\_\{a\}^\{2\}\)V\_\{a\}^\{\\top\}\+V\_\{b\}\\mathrm\{diag\}\(\\sigma\_\{b\}^\{2\}\)V\_\{b\}^\{\\top\}\.If only the toprrcomponents are retained, thenVdiag\(σ2\)V⊤=𝒯r\(S\)V\\mathrm\{diag\}\(\\sigma^\{2\}\)V^\{\\top\}=\\mathcal\{T\}\_\{r\}\(S\), the best rank\-rrapproximation toSSin both spectral and Frobenius norms\. Its residual satisfies \(21\)‖S−𝒯r\(S\)‖2=λr\+1\(S\),‖S−𝒯r\(S\)‖F=\(∑i\>rλi\(S\)2\)1/2\.\\\|S\-\\mathcal\{T\}\_\{r\}\(S\)\\\|\_\{2\}=\\lambda\_\{r\+1\}\(S\),\\qquad\\\|S\-\\mathcal\{T\}\_\{r\}\(S\)\\\|\_\{F\}=\\left\(\\sum\_\{i\>r\}\\lambda\_\{i\}\(S\)^\{2\}\\right\)^\{1/2\}\. The proof follows from the QR\-SVD factorization and the Eckart–Young–Mirsky theorem; details are given in Appendix[A\.5](https://arxiv.org/html/2606.11480#A1.SS5)\. Approximation to the true Gram matrix\.Let \(22\)G~1:t=V1:tdiag\(σ1:t2\)V1:t⊤\\widetilde\{G\}\_\{1:t\}=V\_\{1:t\}\\mathrm\{diag\}\(\\sigma\_\{1:t\}^\{2\}\)V\_\{1:t\}^\{\\top\}be the final FedRAN Gram sketch after all local SVD summaries and QR\-SVD merges up to tasktt\. The following theorem makes the approximation error explicit in terms of discarded local singular values and discarded merge eigenvalues\. ###### Theorem 1 \(FedRAN Gram approximation error\)\. Let \(23\)G1:t⋆=∑τ=1t∑k=1KGk,τG^\{\\star\}\_\{1:t\}=\\sum\_\{\\tau=1\}^\{t\}\\sum\_\{k=1\}^\{K\}G\_\{k,\\tau\}be the exact accumulated Gram matrix\. For each local random\-feature matrixHk,τH\_\{k,\\tau\}, let sk,τ,1≥sk,τ,2≥⋯s\_\{k,\\tau,1\}\\geq s\_\{k,\\tau,2\}\\geq\\cdotsdenote its full singular\-value spectrum\. For each QR\-SVD mergejj, letSj=AjAj⊤S\_\{j\}=A\_\{j\}A\_\{j\}^\{\\top\}denote the untruncated covariance being merged, with eigenvalues λj,1≥λj,2≥⋯≥0\.\\lambda\_\{j,1\}\\geq\\lambda\_\{j,2\}\\geq\\cdots\\geq 0\.Then the FedRAN sketch satisfies \(24\)‖G1:t⋆−G~1:t‖2≤∑τ=1t∑k=1Ksk,τ,rk,τ\+12\+∑j∈ℳtλj,r\+1≜εG\(t\),\\\|G^\{\\star\}\_\{1:t\}\-\\widetilde\{G\}\_\{1:t\}\\\|\_\{2\}\\leq\\sum\_\{\\tau=1\}^\{t\}\\sum\_\{k=1\}^\{K\}s\_\{k,\\tau,r\_\{k,\\tau\}\+1\}^\{2\}\+\\sum\_\{j\\in\\mathcal\{M\}\_\{t\}\}\\lambda\_\{j,r\+1\}\\triangleq\\varepsilon\_\{G\}\(t\),whereℳt\\mathcal\{M\}\_\{t\}is the set of server\-side spatial and temporal QR\-SVD merges performed up to tasktt\. If the relevant omitted singular value or eigenvalue does not exist, the corresponding term is defined as0\. Theorem[1](https://arxiv.org/html/2606.11480#Thmfedranthm1)separates FedRAN’s approximation error into two interpretable sources: local truncation at the clients and merge truncation at the server\. Increasing the retained rank reduces both terms, while increasing communication and server\-side computation\. The proof and an equivalent discarded\-energy form are given in Appendix[A\.6](https://arxiv.org/html/2606.11480#A1.SS6)\. ### 4\.6\.Analytic Classifier Update Subspace\-constrained ridge objective\.After aggregation, the server has the exact accumulated label\-feature statisticB1:tB\_\{1:t\}and the low\-rank spectral summary\(V1:t,σ1:t\)\(V\_\{1:t\},\\sigma\_\{1:t\}\)\. A full ridge solve withG1:t⋆∈ℝM×MG^\{\\star\}\_\{1:t\}\\in\\mathbb\{R\}^\{M\\times M\}would require materializing and inverting anM×MM\\times Mmatrix\. FedRAN instead constrains the classifier to the retained spectral subspace: \(25\)W1:t=V1:tW~1:t,W~1:t∈ℝr×Ct\.W\_\{1:t\}=V\_\{1:t\}\\widetilde\{W\}\_\{1:t\},\\qquad\\widetilde\{W\}\_\{1:t\}\\in\\mathbb\{R\}^\{r\\times C\_\{t\}\}\.The corresponding subspace ridge objective is \(26\)W~1:t=argminW~∈ℝr×Ct‖H1:tV1:tW~−Y1:t‖F2\+λ‖W~‖F2\.\\widetilde\{W\}\_\{1:t\}=\\arg\\min\_\{\\widetilde\{W\}\\in\\mathbb\{R\}^\{r\\times C\_\{t\}\}\}\\\|H\_\{1:t\}V\_\{1:t\}\\widetilde\{W\}\-Y\_\{1:t\}\\\|\_\{F\}^\{2\}\+\\lambda\\\|\\widetilde\{W\}\\\|\_\{F\}^\{2\}\.If the exact Gram matrix were available, the closed\-form solution of the above objective would be \(27\)W~1:t⋆=\(V1:t⊤G1:t⋆V1:t\+λIr\)−1V1:t⊤B1:t\.\\widetilde\{W\}\_\{1:t\}^\{\\star\}=\\left\(V\_\{1:t\}^\{\\top\}G^\{\\star\}\_\{1:t\}V\_\{1:t\}\+\\lambda I\_\{r\}\\right\)^\{\-1\}V\_\{1:t\}^\{\\top\}B\_\{1:t\}\.FedRAN uses the merged spectral sketch \(28\)G~1:t=V1:tdiag\(σ1:t2\)V1:t⊤,\\widetilde\{G\}\_\{1:t\}=V\_\{1:t\}\\mathrm\{diag\}\(\\sigma\_\{1:t\}^\{2\}\)V\_\{1:t\}^\{\\top\},for whichV1:t⊤G~1:tV1:t=diag\(σ1:t2\)V\_\{1:t\}^\{\\top\}\\widetilde\{G\}\_\{1:t\}V\_\{1:t\}=\\mathrm\{diag\}\(\\sigma\_\{1:t\}^\{2\}\)\. The classifier update is therefore \(29\)W~1:t=\(diag\(σ1:t2\)\+λIr\)−1V1:t⊤B1:t,W1:t=V1:tW~1:t\.\\widetilde\{W\}\_\{1:t\}=\\left\(\\mathrm\{diag\}\(\\sigma\_\{1:t\}^\{2\}\)\+\\lambda I\_\{r\}\\right\)^\{\-1\}V\_\{1:t\}^\{\\top\}B\_\{1:t\},\\qquad W\_\{1:t\}=V\_\{1:t\}\\widetilde\{W\}\_\{1:t\}\.This update solves ridge regression in the retained subspace and sets the orthogonal complement ofV1:tV\_\{1:t\}to zero\. As a result, the server only inverts a diagonal matrix rather than the fullM×MM\\times Mmatrix\. Weight approximation bound\.The following theorem quantifies how close the FedRAN classifier is to the full ridge solution in Eq\. \([4](https://arxiv.org/html/2606.11480#S4.E4)\)\. The bound has two terms: one from the low\-rank Gram sketch and one from restricting the classifier to the retained spectral subspace\. ###### Theorem 2 \(Approximation to full ridge\)\. LetW1:t⋆=\(G1:t⋆\+λIM\)−1B1:tW^\{\\star\}\_\{1:t\}=\(G^\{\\star\}\_\{1:t\}\+\\lambda I\_\{M\}\)^\{\-1\}B\_\{1:t\}be the full ridge solution and letW1:tW\_\{1:t\}be the FedRAN classifier in Eq\. \([29](https://arxiv.org/html/2606.11480#S4.E29)\)\. Let \(30\)G~1:t=V1:tdiag\(σ1:t2\)V1:t⊤,\\widetilde\{G\}\_\{1:t\}=V\_\{1:t\}\\mathrm\{diag\}\(\\sigma\_\{1:t\}^\{2\}\)V\_\{1:t\}^\{\\top\},and letεG\(t\)\\varepsilon\_\{G\}\(t\)satisfy Theorem[1](https://arxiv.org/html/2606.11480#Thmfedranthm1)\. Then \(31\)‖W1:t⋆−W1:t‖F≤εG\(t\)λ2‖B1:t‖F\+1λ‖\(IM−V1:tV1:t⊤\)B1:t‖F\.\\\|W^\{\\star\}\_\{1:t\}\-W\_\{1:t\}\\\|\_\{F\}\\leq\\frac\{\\varepsilon\_\{G\}\(t\)\}\{\\lambda^\{2\}\}\\\|B\_\{1:t\}\\\|\_\{F\}\+\\frac\{1\}\{\\lambda\}\\\|\(I\_\{M\}\-V\_\{1:t\}V\_\{1:t\}^\{\\top\}\)B\_\{1:t\}\\\|\_\{F\}\. The first term in Eq\. \([31](https://arxiv.org/html/2606.11480#S4.E31)\) is controlled by the spectral error of the Gram sketch\. The second term measures the amount of label\-feature signal outside the retained subspace\. Thus, FedRAN closely approximates the full ridge classifier when the Gram sketch is accurate andB1:tB\_\{1:t\}is well aligned with the retained spectral directions\. The proof is provided in Appendix[A\.8](https://arxiv.org/html/2606.11480#A1.SS8)\. ###### Corollary 1 \(Score and prediction stability\)\. For a test random featureh∈ℝMh\\in\\mathbb\{R\}^\{M\}, lets⋆=h⊤W1:t⋆s^\{\\star\}=h^\{\\top\}W^\{\\star\}\_\{1:t\}ands=h⊤W1:ts=h^\{\\top\}W\_\{1:t\}\. IfεW\(t\)\\varepsilon\_\{W\}\(t\)denotes the right\-hand side of Eq\. \([31](https://arxiv.org/html/2606.11480#S4.E31)\), then \(32\)‖s⋆−s‖2≤‖h‖2εW\(t\)\.\\\|s^\{\\star\}\-s\\\|\_\{2\}\\leq\\\|h\\\|\_\{2\}\\varepsilon\_\{W\}\(t\)\.Moreover, if the full ridge classifier has margin \(33\)sy⋆−maxc≠ysc⋆\>2‖h‖2εW\(t\),s^\{\\star\}\_\{y\}\-\\max\_\{c\\neq y\}s^\{\\star\}\_\{c\}\>2\\\|h\\\|\_\{2\}\\varepsilon\_\{W\}\(t\),then FedRAN predicts the same class as the full ridge classifier forhh\. Inference\.For a test samplexx, FedRAN computes \(34\)h\(x\)=ReLU\(f\(x\)P\),s\(x\)=h\(x\)W1:t,h\(x\)=\\mathrm\{ReLU\}\(f\(x\)P\),\\qquad s\(x\)=h\(x\)W\_\{1:t\},and predicts \(35\)y^=argmaxc∈\{1,…,Ct\}\[s\(x\)\]c\.\\hat\{y\}=\\arg\\max\_\{c\\in\\\{1,\\ldots,C\_\{t\}\\\}\}\[s\(x\)\]\_\{c\}\.After feature extraction, inference is a single matrix\-vector score computation with the analytic classifier\. Cost summary\.FedRAN replaces iterative model\-update communication with one\-shot statistic upload per task and replaces the fullM×MM\\times Manalytic solve with rank\-rrQR\-SVD merging and a subspace ridge update\. A detailed communication and computation analysis is provided in Appendix[A\.2](https://arxiv.org/html/2606.11480#A1.SS2)\. ## 5\.Evaluation Figure 4\.Task\-wise accuracy comparison across CIFAR\-100, ImageNet\-R, and VTAB withβ=0\.1\\beta=0\.1\. FedRAN consistently achieves higher accuracy across sequential tasks, demonstrating lower catastrophic forgetting than existing FCL baselines\.### 5\.1\.Experimental Setup We evaluate FedRAN in the class\-incremental learning \(CIL\) setting, where new classes arrive over a sequence of tasks\. Each task introduces a new subset of classes, and the model is evaluated across all classes observed up to that task\. At each task, clients collaboratively update the global model under the federated setting\. For the federated setup, we useK=5K=5clients\. To simulate non\-IID client data, the training samples of each task are partitioned across clients using a Dirichlet distribution\(Hsuet al\.,[2019](https://arxiv.org/html/2606.11480#bib.bib22)\)with parameterβ=\[0\.1,0\.5,1\]\\beta=\[0\.1,0\.5,1\]\. A smallerβ\\betaproduces more skewed class distributions across clients, resulting in stronger non\-IID partitions\. We evaluate two backbone settings, ResNet\-18\(Heet al\.,[2016](https://arxiv.org/html/2606.11480#bib.bib23)\)and ViT\-B/16\(Dosovitskiyet al\.,[2020](https://arxiv.org/html/2606.11480#bib.bib24)\), pretrained on ImageNet\. We set the batch size to128128, and the ridge parameter isλ=10−3\\lambda=10^\{\-3\}\. The ResNet setting uses projection dimensionM=8192M=8192with rankr=2048r=2048on CIFAR\-100 and ImageNet\-R, and rankr=512r=512on VTAB\. The ViT setting usesM=2048M=2048andr=512r=512\. For prototype\-based pseudo\-labeling, the confidence threshold is set toτ=0\.5\\tau=0\.5\. All experiments are implemented in Python 3\.8\.20 with PyTorch 2\.4\.1 and run on an NVIDIA RTX A5000 GPU\. ### 5\.2\.Datasets We evaluate FedRAN on three widely used continual learning image classification benchmarks\. CIFAR\-100 is a standard image classification benchmark\. ImageNet\-R evaluates robustness under out\-of\-distribution visual renditions\. VTAB spans diverse visual domains, including natural, medical, and remote sensing images\. Table[2](https://arxiv.org/html/2606.11480#S5.T2)summarizes the datasets and task splits used in our evaluation\. For CIFAR\-100, we use 5 tasks in the ResNet setting and 10 tasks in the ViT setting, consistent with prior works\. Table 2\.Datasets used for evaluation\. ### 5\.3\.Metrics We evaluate FedRAN using two sets of metrics: accuracy and resource efficiency\.Accuracy metrics:\(i\) final accuracyATA\_\{T\}, measured after the last task over all observed classes; and \(ii\) average accuracyAavgA\_\{\\mathrm\{avg\}\}, computed by averaging task accuracy across the task stream\. At each task, accuracy is evaluated over all classes observed up to that task\.Resource\-efficiency metrics:\(i\) communication cost, measured in MB as the maximum amount of data transmitted by a client in a single task; and \(ii\) runtime, measured as wall\-clock training time in seconds, including both client\-side and server\-side computation for each task\. ### 5\.4\.Baselines We compare FedRAN against representative baselines for federated continual learning\. Finetune\(McMahanet al\.,[2017b](https://arxiv.org/html/2606.11480#bib.bib36)\)sequentially trains client models on new tasks and aggregates them using FedAvg\(McMahanet al\.,[2017a](https://arxiv.org/html/2606.11480#bib.bib1)\)\. FedEWC\(Kirkpatricket al\.,[2017a](https://arxiv.org/html/2606.11480#bib.bib28)\), FedLwF\(Li and Hoiem,[2017](https://arxiv.org/html/2606.11480#bib.bib29)\), and iCaRL\(Rebuffiet al\.,[2017](https://arxiv.org/html/2606.11480#bib.bib37)\)adapt standard continual learning strategies to the federated setting: FedEWC\(Kirkpatricket al\.,[2017a](https://arxiv.org/html/2606.11480#bib.bib28)\)regularizes changes to weights important from previous tasks, FedLwF\(Li and Hoiem,[2017](https://arxiv.org/html/2606.11480#bib.bib29)\)uses distillation to preserve previous\-task behavior, and iCaRL\(Rebuffiet al\.,[2017](https://arxiv.org/html/2606.11480#bib.bib37)\)uses an exemplar replay\-based continual learning technique\. TARGET\(Zhanget al\.,[2023a](https://arxiv.org/html/2606.11480#bib.bib35)\)is an exemplar\-free federated class\-continual learning method based on distillation and synthetic data generation\. STSA\(Guanet al\.,[2026](https://arxiv.org/html/2606.11480#bib.bib38)\)aggregates spatial\-temporal feature statistics for FCL\. For pretrained ViT\-based comparisons, we also include prompt\- and adapter\-based continual learning methods\. DualPrompt\(Wanget al\.,[2022](https://arxiv.org/html/2606.11480#bib.bib30)\)learns complementary prompts for rehearsal\-free continual learning\. CodaPrompt\(Smithet al\.,[2023](https://arxiv.org/html/2606.11480#bib.bib31)\)uses decomposed attention\-based prompts for continual adaptation\. Fed\-CPrompt\(Bagweet al\.,[2023](https://arxiv.org/html/2606.11480#bib.bib32)\)extends prompt learning to rehearsal\-free federated continual learning using contrastive prompts\. PiLoRA\(Guoet al\.,[2024](https://arxiv.org/html/2606.11480#bib.bib33)\)uses prototype\-guided LoRA adaptation for federated class\-incremental learning\. ### 5\.5\.Results #### 5\.5\.1\.Overall accuracy Table[3](https://arxiv.org/html/2606.11480#S5.T3)compares FedRAN with ResNet\-based FCL baselines under different Dirichlet settings, using the same number of clients, task splits, data partitioning across clients, and backbone setting\. Fig\.[4](https://arxiv.org/html/2606.11480#S5.F4)further shows accuracy across the task stream\. FedRAN achieves the bestAavgA\_\{\\mathrm\{avg\}\}andATA\_\{T\}across all datasets and allβ\\betavalues\. STSA is the strongest baseline, but FedRAN consistently improves over it, with gains inAavgA\_\{\\mathrm\{avg\}\}of up to4\.804\.80points on CIFAR\-100,4\.244\.24points on ImageNet\-R, and2\.682\.68points on VTAB\. The task accuracy curves in Fig\.[4](https://arxiv.org/html/2606.11480#S5.F4)show that optimization\-based baselines, such as Finetune, exhibit larger accuracy drops as the number of tasks increases, indicating stronger forgetting under sequential client updates\. In contrast, STSA and FedRAN maintain higher accuracy across the task stream\. This suggests that freezing the backbone helps stabilize the client drift and reduces catastrophic forgetting of previously learned tasks\. FedRAN further improves over STSA by using an SVD\-based low\-rank summary of the projected feature matrix, which preserves the dominant feature directions needed for the analytic classifier update\. Additionally, FedRAN is stable across different Dirichlet settings\. Its accuracy varies slightly fromβ=0\.1\\beta=0\.1toβ=1\.0\\beta=1\.0, whereas several optimization\-based baselines change more noticeably across client partitions\. This robustness comes from aggregating client\-side random\-feature statistics rather than averaging locally trained model parameters\. Table 3\.Performance comparison under different Dirichlet settings\. Each dataset reports average accuracy \(AavgA\_\{\\mathrm\{avg\}\}\) and final accuracy \(ATA\_\{T\}\)\. Improvement denotes the absolute percentage\-point gain of FedRAN over the strongest baseline for each metric\. Best and second\-best results are shown inboldandunderlined, respectively\. #### 5\.5\.2\.Communication overhead\. Table[4](https://arxiv.org/html/2606.11480#S5.T4)reports the communication cost per client per task\. TARGET is shown as a representative optimization\-based FCL method, since Finetune, FedLwF, FedEWC, and FediCaRL also communicate model updates through iterative training rounds and therefore have comparable communication costs\. FedRAN substantially reduces communication costs compared with TARGET, whose costs are31\.90×31\.90\\times,30\.65×30\.65\\times, and121\.75×121\.75\\timeshigher on CIFAR\-100, ImageNet\-R, and VTAB, respectively\. Compared with STSA, FedRAN also reduces communication by1\.45×1\.45\\timeson CIFAR\-100 and2\.77×2\.77\\timeson both ImageNet\-R and VTAB\. This reduction comes from the client summary design\. In our experimental setup, we consider the communication\-efficient STSA method, which communicates class\-wise feature sums and class counts, including the dummy\-class statistics used to estimate the Gram matrix\. In contrast, FedRAN directly transmits a low\-rank spectral summary\(Vk,t,σk,t\)\(V\_\{k,t\},\\sigma\_\{k,t\}\)together with the label\-feature statisticBk,t=Hk,t⊤Yk,tB\_\{k,t\}=H\_\{k,t\}^\{\\top\}Y\_\{k,t\}, with communication costO\(Mr\+r\+MCt\)O\(Mr\+r\+MC\_\{t\}\)\. Combined with the accuracy results in Table[3](https://arxiv.org/html/2606.11480#S5.T3), FedRAN provides a stronger accuracy\-communication tradeoff than FCL baselines\. Table 4\.Communication cost per client per task\. Costs are reported in MB, with the relative cost normalized to FedRAN shown in parentheses; lower is better\. TARGET is shown as a representative of all optimization\-based methods such as Finetune, FedLwF, FedEWC, and FediCaRL\. #### 5\.5\.3\.Computation Overhead Table[5](https://arxiv.org/html/2606.11480#S5.T5)reports the runtime per client per task, including the computation performed by a single client and the corresponding server\-side update for that task\. Gradient\-based baselines such as Finetune, FedLwF, TARGET, FedEWC, and FediCaRL require local backpropagation over multiple rounds and take hundreds of seconds per task\. In contrast, FedRAN completes each task in a few seconds by avoiding iterative local training over multiple rounds\. Across the gradient\-based baselines, FedRAN is on average190\.3×190\.3\\timesfaster, with96\.9×96\.9\\times,246\.9×246\.9\\times, and227\.2×227\.2\\timesfaster on CIFAR\-100, ImageNet\-R, and VTAB, respectively\. STSA is the closest runtime baseline, as it avoids expensive gradient\-based training over multiple rounds\. FedRAN is slightly slower than STSA on CIFAR\-100 and ImageNet\-R due to the additional SVD and QR\-SVD spectral summarization, but it achieves consistently higher accuracy, as shown in Table[3](https://arxiv.org/html/2606.11480#S5.T3)\. Overall, FedRAN provides much lower runtime than gradient\-based FCL methods and comparable runtime to the fastest baseline, while improving accuracy\. Table 5\.Runtime per client per task \(seconds\)\. Lower is better\.Figure 5\.Effect of proxy\-label under limited labeled data\. FedRAN with proxy labels consistently improves average accuracy at low labeled rates, with the largest gains at 20% labeled data\. #### 5\.5\.4\.Effect of Proxy\-Labeling Figure[5](https://arxiv.org/html/2606.11480#S5.F5)evaluates FedRAN under limited labeled data atβ=0\.1\\beta=0\.1\. The labeled\-only setting formsYk,tY\_\{k,t\}andBk,tB\_\{k,t\}using only the labeled subset of data at each client, while the proxy\-label setting assigns pseudo\-labels to confident unlabeled samples using prototype similarity with thresholdτ=0\.5\\tau=0\.5and includes them in the local statistics\. Proxy\-labeling provides the largest gains when only20%20\\%of the data is labeled, improvingAavgA\_\{\\mathrm\{avg\}\}by3\.263\.26on CIFAR\-100,6\.616\.61on ImageNet\-R, and5\.765\.76on VTAB\. As the label rate increases, the gain decreases because the labeled\-only class\-prototype matrix becomes more reliable\. These results show that proxy\-labeling improves FedRAN under label scarcity by allowing confident unlabeled samples to contribute toBkB\_\{k\}\. Figure 6\.Effect of the SVD rank under different random projection dimensionsMMon FedRAN for CIFAR\-100 withβ=0\.5\\beta=0\.5\. #### 5\.5\.5\.Effect of Projection dimension and rank Figure[6](https://arxiv.org/html/2606.11480#S5.F6)shows the effect of SVD rankrrand random projection dimensionMMon CIFAR\-100 withβ=0\.5\\beta=0\.5\. IncreasingrrimprovesAavgA\_\{\\mathrm\{avg\}\}as the client’s low\-rank spectral summary retains more dominant directions of the local Gram matrix\. However, the gain saturates at larger ranks\. For example, whenM=2048M=2048, increasingrrfrom256256to512512improvesAavgA\_\{\\mathrm\{avg\}\}by3\.03\.0, while increasingrrfrom10241024to20482048improves it by only1\.01\.0, while communication nearly doubles from17\.617\.6MB to33\.633\.6MB\. This suggests that moderate ranks already capture most of the useful spectral information\. IncreasingMMalso improves accuracy by increasing feature separability, but it also further increases communication cost\. For example, atr=1024r=1024,AavgA\_\{\\mathrm\{avg\}\}increases from74\.074\.0atM=2048M=2048to75\.275\.2atM=8192M=8192, while communication increases from17\.617\.6MB to70\.370\.3MB\. These results show the trade\-off between accuracy and communication cost\. Larger values ofMMandrrimprove accuracy, but moderate ranks offer most of the benefit at substantially lower communication cost\. #### 5\.5\.6\.Ablation on Model Components\. Figure 7\.Ablation of FedRAN components on CIFAR\-100 with ViT\-B/16\. Components are added sequentially to demonstrate the construction of the final FedRAN framework\.To evaluate the architectural necessity of each FedRAN framework component, we perform an ablation study on CIFAR\-100 with the ViT backbone forβ=1\.0\\beta=1\.0as shown in Figure[7](https://arxiv.org/html/2606.11480#S5.F7)\. We establish a baseline without random projection, ReLU, and low\-rank SVD, achieving 87\.71% accuracy with raw ViT features\. Linearly expanding this feature space toM=1250M=1250using random projection increases the model’s representational capacity, thereby increasing the average accuracy to 90\.21%\. Applying the ReLU activation improves feature separability and further boosts accuracy to 93\.95%, validating the necessity of the non\-linear operation and projection\. We finally integrate the Low\-Rank SVD approximation to complete the FedRAN framework\. By transmitting only the dominant components rather than the full Gram matrix, FedRAN significantly reduces the communication cost from quadraticO\(M2\)O\(M^\{2\}\)to linearO\(Mr\)O\(Mr\)for a fixedrr\. This reduction in communication overhead incurs no accuracy penalty, with the final model achieving an average accuracy of 93\.96%\. Our ablation study shows that FedRAN successfully preserves the key second\-order geometry required for high\-accuracy continual learning while satisfying resource constraints\. #### 5\.5\.7\.Ablation on ViT as Backbone model\. Table[6](https://arxiv.org/html/2606.11480#S5.T6)compares FedRAN with ViT\-based FCL baselines under different Dirichlet settings, using the same number of clients, task splits, and data partitioning across clients\. All methods use the same pretrained ViT\-B/16 model, with task 0 used for finetuning following the prior ViT\-based setting\. FedRAN achieves the bestAavgA\_\{\\mathrm\{avg\}\}across allβ\\betavalues on both datasets\. Compared with STSA, the strongest baseline, FedRAN improvesAavgA\_\{\\mathrm\{avg\}\}by up to1\.031\.03percentage points on CIFAR\-100 and2\.402\.40percentage points on ImageNet\-R\. FedRAN also uses less communication than STSA in the ViT setting\. On CIFAR\-100, FedRAN requires9\.579\.57MB per client per task compared with10\.5010\.50MB for STSA\. On ImageNet\-R, FedRAN requires11\.1311\.13MB compared with21\.0021\.00MB for STSA\. Thus, FedRAN provides a stronger trade\-off between accuracy and communication cost\. Table 6\.ViT\-B/16 results under different Dirichlet settings\. Each dataset reports average accuracy \(AavgA\_\{\\mathrm\{avg\}\}\) and final accuracy \(ATA\_\{T\}\)\. ## 6\.Applications FedRAN finds its application in privacy\-sensitive and resource\-constrained FCL scenarios where client data arrives continuously after model deployment\. In clinical imaging, hospitals and edge medical devices cannot share raw patient data due to privacy risks, labels often require expert annotation, and new diseases or imaging conditions may emerge over time\. In drone and remote\-sensing networks, clients operate with limited bandwidth, battery, and onboard compute while monitoring diverse terrains, weather conditions, and object categories\. Similar constraints arise in monitoring systems such as human activity recognition, where local data is privacy\-sensitive, and labels may be sparse or delayed\. These applications require continual updates without the need to train large client models iteratively or train under limited labeled data\. FedRAN satisfies these constraints by keeping the backbone frozen, communicating compact, low\-rank statistics, updating the classifier via a closed\-form solution, and using prototype\-based pseudo\-labeling to exploit unlabeled client data when supervision is limited\. ## 7\.Related Works ### 7\.1\.Federated Continual Learning Existing FCL methods mainly address forgetting through optimization\-based mechanisms: task\-adaptive transfer across clients, global–local forgetting compensation, replay or synthetic data, distillation, orthogonal update constraints, and spatial–temporal heterogeneity modeling\(Yoonet al\.,[2021](https://arxiv.org/html/2606.11480#bib.bib3); Donget al\.,[2023](https://arxiv.org/html/2606.11480#bib.bib13); Qiet al\.,[2023](https://arxiv.org/html/2606.11480#bib.bib10); Zhanget al\.,[2023b](https://arxiv.org/html/2606.11480#bib.bib9); Bakmanet al\.,[2023](https://arxiv.org/html/2606.11480#bib.bib12); Yuet al\.,[2025](https://arxiv.org/html/2606.11480#bib.bib34); Li and Bidkhori,[2025](https://arxiv.org/html/2606.11480#bib.bib51)\)\. Other approaches adapt standard continual\-learning tools such as regularization, exemplar replay, and distillation, or reduce the trainable state with prompts, adapters, and parameter\-efficient modules built on pretrained models\(Kirkpatricket al\.,[2017a](https://arxiv.org/html/2606.11480#bib.bib28); Rebuffiet al\.,[2017](https://arxiv.org/html/2606.11480#bib.bib37); Li and Hoiem,[2017](https://arxiv.org/html/2606.11480#bib.bib29); Bagweet al\.,[2023](https://arxiv.org/html/2606.11480#bib.bib32); Guoet al\.,[2024](https://arxiv.org/html/2606.11480#bib.bib33); Xuet al\.,[2026](https://arxiv.org/html/2606.11480#bib.bib48)\)\. Recent resource\-aware and on\-device studies further emphasize that FCL accuracy must be maintained under constrained client computation, communication, and label availability\(Zhonget al\.,[2025](https://arxiv.org/html/2606.11480#bib.bib49); Denget al\.,[2023](https://arxiv.org/html/2606.11480#bib.bib50); Liet al\.,[2026](https://arxiv.org/html/2606.11480#bib.bib15)\)\. FedRAN differs by replacing iterative client optimization with a forward\-only statistic upload per task\. Its frozen backbone also avoids client\-induced representation drift, while continual adaptation is handled only through the analytic classifier state\. ### 7\.2\.Analytic Learning and Low\-Rank Aggregation Analytic continual learning replaces iterative classifier training with closed\-form updates over frozen features, including recursive, forward\-only, dual\-stream, generalized, and joint\-training\-oriented formulations\(Zhuanget al\.,[2022](https://arxiv.org/html/2606.11480#bib.bib39),[2024c](https://arxiv.org/html/2606.11480#bib.bib52),[2024b](https://arxiv.org/html/2606.11480#bib.bib53),[2024a](https://arxiv.org/html/2606.11480#bib.bib54); Momeniet al\.,[2026](https://arxiv.org/html/2606.11480#bib.bib56)\)\. Random\-feature methods and pretrained representations provide a complementary route to stronger fixed features, while low\-rank random\-feature updates and truncated SVD improve stability in centralized continual learning\(Rahimi and Recht,[2007](https://arxiv.org/html/2606.11480#bib.bib17),[2008](https://arxiv.org/html/2606.11480#bib.bib65); McDonnellet al\.,[2023](https://arxiv.org/html/2606.11480#bib.bib18); Prabhuet al\.,[2024](https://arxiv.org/html/2606.11480#bib.bib55); Penget al\.,[2024](https://arxiv.org/html/2606.11480#bib.bib43)\)\. Federated analytic methods aggregate closed\-form heads, class statistics, or spatial–temporal feature statistics for one\-round, continual, personalized, unlearning, nearest\-class, and low\-rank Gram settings\(Fanìet al\.,[2024b](https://arxiv.org/html/2606.11480#bib.bib40); Tanget al\.,[2025b](https://arxiv.org/html/2606.11480#bib.bib42); Guanet al\.,[2026](https://arxiv.org/html/2606.11480#bib.bib38); Tanget al\.,[2026](https://arxiv.org/html/2606.11480#bib.bib57); Fanet al\.,[2025](https://arxiv.org/html/2606.11480#bib.bib58); Quanet al\.,[2026](https://arxiv.org/html/2606.11480#bib.bib59); Goswamiet al\.,[2026](https://arxiv.org/html/2606.11480#bib.bib44); Legateet al\.,[2023](https://arxiv.org/html/2606.11480#bib.bib45); Tanget al\.,[2025a](https://arxiv.org/html/2606.11480#bib.bib60); Turazzaet al\.,[2026](https://arxiv.org/html/2606.11480#bib.bib61); Menget al\.,[2026](https://arxiv.org/html/2606.11480#bib.bib62); Alsulaimawi,[2026](https://arxiv.org/html/2606.11480#bib.bib63)\)\. Separately, federated PCA, streaming SVD, deterministic sketches, randomized SVD, hierarchical SVD, and distributed eigenspace estimation provide tools for compact low\-rank subspace merging\(Grammenoset al\.,[2020](https://arxiv.org/html/2606.11480#bib.bib19); Řehřek,[2011](https://arxiv.org/html/2606.11480#bib.bib20); Eftekhariet al\.,[2019](https://arxiv.org/html/2606.11480#bib.bib21); Iwen and Ong,[2016](https://arxiv.org/html/2606.11480#bib.bib66); Ghashamiet al\.,[2016](https://arxiv.org/html/2606.11480#bib.bib67); Halkoet al\.,[2011](https://arxiv.org/html/2606.11480#bib.bib68); Vasudevan and Ramakrishna,[2017](https://arxiv.org/html/2606.11480#bib.bib69); Fanet al\.,[2019](https://arxiv.org/html/2606.11480#bib.bib70); Shamrai,[2026](https://arxiv.org/html/2606.11480#bib.bib64)\)\. FedRAN extends these into a two\-level spatial–temporal QR\-SVD merge of random\-feature Gram summaries for FCL\. It occupies a middle ground between transmitting the fullM×MM\\times MGram matrix and estimating second\-order structure from first\-order class statistics\. ### 7\.3\.Label\-Efficient Federated Learning Label\-efficient federated learning studies how clients can learn when only part of the distributed data is labeled, often by enforcing consistency across clients or selectively using unlabeled examples\(Jeonget al\.,[2020](https://arxiv.org/html/2606.11480#bib.bib46); Parket al\.,[2025](https://arxiv.org/html/2606.11480#bib.bib71)\)\. Prototype\-based federated methods summarize class structure through feature\-space representatives rather than raw data\(Tanet al\.,[2022](https://arxiv.org/html/2606.11480#bib.bib47)\)\. FedRAN’s semi\-supervised variant uses the frozen feature space in the same spirit: confident unlabeled samples are assigned prototype\-based pseudo\-labels and then added to the analytic label–feature statistic\. This uses unlabeled data without updating the backbone or adding an iterative semi\-supervised training loop\. ## 8\.Conclusion and Future Work We presented FedRAN, a resource\-aware analytic framework for federated continual learning\. Instead of iterative client training, each client freezes a pretrained backbone, forms random features through a fixed projection, and uploads two compact statistics: an exact label–feature statistic and a truncated\-SVD summary of its random\-feature Gram\. The server merges these summaries spatially across clients and temporally across tasks with a two\-level QR\-SVD update, then solves the classifier in closed form, with deterministic bounds tying the approximation to the retained spectral subspace\. A prototype\-based pseudo\-labeling variant brings confident unlabeled samples into the analytic update without any backbone training\. Across CIFAR\-100, ImageNet\-R, and VTAB under non\-IID streams, FedRAN improves average accuracy by up to4\.84\.8points over the strongest baseline while using30\.630\.6–121\.8×121\.8\\timesless per\-client communication and training190\.3×190\.3\\timesfaster, and gains up to6\.616\.61points from pseudo\-labeling at a20%20\\%label rate\. In future work, we plan to study privacy\-preserving variants of FedRAN, such as secure aggregation for the uploaded statistics\. Also, we plan to extend analytic random\-feature aggregation beyond class\-incremental image classification and incorporate pseudo\-label noise into the theoretical analysis\. ## References - Z\. Alsulaimawi \(2026\)One\-shot federated ridge regression: exact recovery via sufficient statistic aggregation\.arXiv preprint arXiv:2601\.08216\.Cited by:[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - G\. Bagwe, X\. Yuan, M\. Pan, and L\. Zhang \(2023\)Fed\-cprompt: contrastive prompt for rehearsal\-free federated continual learning\.arXiv preprint arXiv:2307\.04869\.Cited by:[Table 1](https://arxiv.org/html/2606.11480#S1.T1.1.3.2.1.1.1),[§1](https://arxiv.org/html/2606.11480#S1.p3.1),[§3](https://arxiv.org/html/2606.11480#S3.p2.1),[§5\.4](https://arxiv.org/html/2606.11480#S5.SS4.p2.1),[§7\.1](https://arxiv.org/html/2606.11480#S7.SS1.p1.1)\. - Y\. F\. Bakman, D\. N\. Yaldiz, Y\. H\. Ezzeldin, and S\. Avestimehr \(2023\)Federated orthogonal training: mitigating global catastrophic forgetting in continual federated learning\.arXiv preprint arXiv:2309\.01289\.Cited by:[Table 1](https://arxiv.org/html/2606.11480#S1.T1.1.2.1.1.1.1),[§1](https://arxiv.org/html/2606.11480#S1.p3.1),[§3](https://arxiv.org/html/2606.11480#S3.p2.1),[§7\.1](https://arxiv.org/html/2606.11480#S7.SS1.p1.1)\. - M\. De Lange, R\. Aljundi, M\. Masana, S\. Parisot, X\. Jia, A\. Leonardis, G\. Slabaugh, and T\. Tuytelaars \(2021\)A continual learning survey: defying forgetting in classification tasks\.IEEE transactions on pattern analysis and machine intelligence44\(7\),pp\. 3366–3385\.Cited by:[§1](https://arxiv.org/html/2606.11480#S1.p1.1)\. - Y\. Deng, S\. Yue, T\. Wang, G\. Wang, J\. Ren, and Y\. Zhang \(2023\)Fedinc: an exemplar\-free continual federated learning framework with small labeled data\.InProceedings of the 21st ACM Conference on Embedded Networked Sensor Systems,pp\. 56–69\.Cited by:[§7\.1](https://arxiv.org/html/2606.11480#S7.SS1.p1.1)\. - J\. Dong, H\. Li, Y\. Cong, G\. Sun, Y\. Zhang, and L\. Van Gool \(2023\)No one left behind: real\-world federated class\-incremental learning\.IEEE Transactions on Pattern Analysis and Machine Intelligence46\(4\),pp\. 2054–2070\.Cited by:[Table 1](https://arxiv.org/html/2606.11480#S1.T1.1.2.1.1.1.1),[§1](https://arxiv.org/html/2606.11480#S1.p3.1),[§3](https://arxiv.org/html/2606.11480#S3.p2.1),[§7\.1](https://arxiv.org/html/2606.11480#S7.SS1.p1.1)\. - A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly,et al\.\(2020\)An image is worth 16x16 words: transformers for image recognition at scale\.arXiv preprint arXiv:2010\.11929\.Cited by:[§5\.1](https://arxiv.org/html/2606.11480#S5.SS1.p2.8)\. - A\. Eftekhari, R\. A\. Hauser, and A\. Grammenos \(2019\)MOSES: a streaming algorithm for linear dimensionality reduction\.IEEE transactions on pattern analysis and machine intelligence42\(11\),pp\. 2901–2911\.Cited by:[§4\.5](https://arxiv.org/html/2606.11480#S4.SS5.p2.11),[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - J\. Fan, D\. Wang, K\. Wang, and Z\. Zhu \(2019\)Distributed estimation of principal eigenspaces\.Annals of statistics47\(6\),pp\. 3009\.Cited by:[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - K\. Fan, J\. Tang, Z\. Yang, F\. Han, J\. Li, R\. He, Y\. Huang, A\. Liu, H\. H\. Song, Y\. Liu,et al\.\(2025\)Apfl: analytic personalized federated learning via dual\-stream least squares\.arXiv preprint arXiv:2508\.10732\.Cited by:[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - E\. Fanì, R\. Camoriano, B\. Caputo, and M\. Ciccone \(2024a\)Accelerating heterogeneous federated learning with closed\-form classifiers\.arXiv preprint arXiv:2406\.01116\.Cited by:[Table 1](https://arxiv.org/html/2606.11480#S1.T1.1.4.3.1.1.1),[§1](https://arxiv.org/html/2606.11480#S1.p4.1)\. - E\. Fanì, R\. Camoriano, B\. Caputo, and M\. Ciccone \(2024b\)Accelerating heterogeneous federated learning with closed\-form classifiers\.arXiv preprint arXiv:2406\.01116\.Cited by:[Table 1](https://arxiv.org/html/2606.11480#S1.T1.1.4.3.1.1.1),[§1](https://arxiv.org/html/2606.11480#S1.p4.1),[§3](https://arxiv.org/html/2606.11480#S3.p4.3),[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - M\. Ghashami, E\. Liberty, J\. M\. Phillips, and D\. P\. Woodruff \(2016\)Frequent directions: simple and deterministic matrix sketching\.SIAM Journal on Computing45\(5\),pp\. 1762–1792\.Cited by:[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - M\. Gholizade, F\. Ruffini, P\. Ducange, and F\. Marcelloni \(2026\)Federated continual learning: a comprehensive survey on lifelong and privacy\-preserving learning over distributed and non\-stationary data\.Neurocomputing,pp\. 133929\.Cited by:[§1](https://arxiv.org/html/2606.11480#S1.p1.1)\. - D\. Goswami, S\. Magistri, K\. Wang, B\. Twardowski, A\. Bagdanov, and J\. van de Weijer \(2026\)Covariances for free: exploiting mean distributions for training\-free federated learning\.Advances in Neural Information Processing Systems38,pp\. 65081–65115\.Cited by:[Table 1](https://arxiv.org/html/2606.11480#S1.T1.1.5.4.1.1.1),[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - A\. Grammenos, R\. Mendoza Smith, J\. Crowcroft, and C\. Mascolo \(2020\)Federated principal component analysis\.Advances in neural information processing systems33,pp\. 6453–6464\.Cited by:[§4\.5](https://arxiv.org/html/2606.11480#S4.SS5.p2.11),[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - Z\. Guan, G\. Zhu, Y\. Zhou, W\. Liu, W\. Wang, J\. Luo, and X\. Gu \(2026\)Enhancing federated class\-incremental learning via spatial\-temporal statistics aggregation\.InProceedings of the ACM Web Conference 2026,pp\. 5356–5367\.Cited by:[Table 1](https://arxiv.org/html/2606.11480#S1.T1.1.4.3.1.1.1),[Table 1](https://arxiv.org/html/2606.11480#S1.T1.1.5.4.1.1.1),[§1](https://arxiv.org/html/2606.11480#S1.p4.1),[§1](https://arxiv.org/html/2606.11480#S1.p5.2),[§3](https://arxiv.org/html/2606.11480#S3.p4.3),[§5\.4](https://arxiv.org/html/2606.11480#S5.SS4.p1.1),[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - H\. Guo, F\. Zhu, W\. Liu, X\. Zhang, and C\. Liu \(2024\)Pilora: prototype guided incremental lora for federated class\-incremental learning\.InEuropean Conference on Computer Vision,pp\. 141–159\.Cited by:[Table 1](https://arxiv.org/html/2606.11480#S1.T1.1.3.2.1.1.1),[§1](https://arxiv.org/html/2606.11480#S1.p3.1),[§3](https://arxiv.org/html/2606.11480#S3.p2.1),[§5\.4](https://arxiv.org/html/2606.11480#S5.SS4.p2.1),[§7\.1](https://arxiv.org/html/2606.11480#S7.SS1.p1.1)\. - N\. Halko, P\. Martinsson, and J\. A\. Tropp \(2011\)Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions\.SIAM review53\(2\),pp\. 217–288\.Cited by:[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - K\. He, X\. Zhang, S\. Ren, and J\. Sun \(2016\)Deep residual learning for image recognition\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 770–778\.Cited by:[§5\.1](https://arxiv.org/html/2606.11480#S5.SS1.p2.8)\. - D\. Hendrycks, S\. Basart, N\. Mu, S\. Kadavath, F\. Wang, E\. Dorundo, R\. Desai, T\. Zhu, S\. Parajuli, M\. Guo,et al\.\(2021\)The many faces of robustness: a critical analysis of out\-of\-distribution generalization\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 8340–8349\.Cited by:[Table 2](https://arxiv.org/html/2606.11480#S5.T2.1.3.2.1)\. - T\. H\. Hsu, H\. Qi, and M\. Brown \(2019\)Measuring the effects of non\-identical data distribution for federated visual classification\.arXiv preprint arXiv:1909\.06335\.Cited by:[§2\.1](https://arxiv.org/html/2606.11480#S2.SS1.p2.14),[§5\.1](https://arxiv.org/html/2606.11480#S5.SS1.p1.3)\. - M\. A\. Iwen and B\. W\. Ong \(2016\)A distributed and incremental svd algorithm for agglomerative data analysis on large networks\.SIAM Journal on Matrix Analysis and Applications37\(4\),pp\. 1699–1718\.Cited by:[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - W\. Jeong, J\. Yoon, E\. Yang, and S\. J\. Hwang \(2020\)Federated semi\-supervised learning with inter\-client consistency & disjoint learning\.arXiv preprint arXiv:2006\.12097\.Cited by:[§3](https://arxiv.org/html/2606.11480#S3.p5.1),[§7\.3](https://arxiv.org/html/2606.11480#S7.SS3.p1.1)\. - E\. Jothimurugesan, K\. Hsieh, J\. Wang, G\. Joshi, and P\. B\. Gibbons \(2023\)Federated learning under distributed concept drift\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 5834–5853\.Cited by:[§1](https://arxiv.org/html/2606.11480#S1.p1.1)\. - P\. Kairouz and H\. B\. McMahan \(2021\)Advances and open problems in federated learning\.Foundations and trends in machine learning14\(1\-2\),pp\. 1–210\.Cited by:[§1](https://arxiv.org/html/2606.11480#S1.p1.1)\. - J\. Kirkpatrick, R\. Pascanu, N\. Rabinowitz, J\. Veness, G\. Desjardins, A\. A\. Rusu, K\. Milan, J\. Quan, T\. Ramalho, A\. Grabska\-Barwinska,et al\.\(2017a\)Overcoming catastrophic forgetting in neural networks\.Proceedings of the national academy of sciences114\(13\),pp\. 3521–3526\.Cited by:[§5\.4](https://arxiv.org/html/2606.11480#S5.SS4.p1.1),[§7\.1](https://arxiv.org/html/2606.11480#S7.SS1.p1.1)\. - J\. Kirkpatrick, R\. Pascanu, N\. Rabinowitz, J\. Veness, G\. Desjardins, A\. A\. Rusu, K\. Milan, J\. Quan, T\. Ramalho, A\. Grabska\-Barwinska,et al\.\(2017b\)Overcoming catastrophic forgetting in neural networks\.Proceedings of the national academy of sciences114\(13\),pp\. 3521–3526\.Cited by:[§1](https://arxiv.org/html/2606.11480#S1.p1.1)\. - A\. Krizhevsky, G\. Hinton,et al\.\(2009\)Learning multiple layers of features from tiny images\.Cited by:[Table 2](https://arxiv.org/html/2606.11480#S5.T2.1.2.1.1)\. - G\. Legate, N\. Bernier, L\. Page\-Caccia, E\. Oyallon, and E\. Belilovsky \(2023\)Guiding the last layer in federated learning with pre\-trained models\.Advances in Neural Information Processing Systems36,pp\. 69832–69848\.Cited by:[Table 1](https://arxiv.org/html/2606.11480#S1.T1.1.5.4.1.1.1),[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - H\. Li and H\. Bidkhori \(2025\)FedGTEA: federated class\-incremental learning with gaussian task embedding and alignment\.arXiv preprint arXiv:2510\.12927\.Cited by:[§7\.1](https://arxiv.org/html/2606.11480#S7.SS1.p1.1)\. - Y\. Li, Y\. Wang, J\. Dong, H\. Wang, Y\. Qi, R\. Zhang, and R\. Li \(2026\)Resource\-constrained federated continual learning: what does matter?\.Advances in Neural Information Processing Systems38,pp\. 79493–79518\.Cited by:[§1](https://arxiv.org/html/2606.11480#S1.p2.1),[§3](https://arxiv.org/html/2606.11480#S3.p1.1),[§4\.4](https://arxiv.org/html/2606.11480#S4.SS4.p1.2),[§7\.1](https://arxiv.org/html/2606.11480#S7.SS1.p1.1)\. - Z\. Li and D\. Hoiem \(2017\)Learning without forgetting\.IEEE transactions on pattern analysis and machine intelligence40\(12\),pp\. 2935–2947\.Cited by:[§5\.4](https://arxiv.org/html/2606.11480#S5.SS4.p1.1),[§7\.1](https://arxiv.org/html/2606.11480#S7.SS1.p1.1)\. - M\. D\. McDonnell, D\. Gong, A\. Parvaneh, E\. Abbasnejad, and A\. Van den Hengel \(2023\)Ranpac: random projections and pre\-trained models for continual learning\.Advances in Neural Information Processing Systems36,pp\. 12022–12053\.Cited by:[§1](https://arxiv.org/html/2606.11480#S1.p5.2),[§3](https://arxiv.org/html/2606.11480#S3.p4.3),[§4\.2](https://arxiv.org/html/2606.11480#S4.SS2.p2.3),[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - B\. McMahan, E\. Moore, D\. Ramage, S\. Hampson, and B\. A\. y Arcas \(2017a\)Communication\-efficient learning of deep networks from decentralized data\.InArtificial intelligence and statistics,pp\. 1273–1282\.Cited by:[§1](https://arxiv.org/html/2606.11480#S1.p1.1),[§3](https://arxiv.org/html/2606.11480#S3.p2.1),[§5\.4](https://arxiv.org/html/2606.11480#S5.SS4.p1.1)\. - B\. McMahan, E\. Moore, D\. Ramage, S\. Hampson, and B\. A\. y Arcas \(2017b\)Communication\-efficient learning of deep networks from decentralized data\.InArtificial intelligence and statistics,pp\. 1273–1282\.Cited by:[§5\.4](https://arxiv.org/html/2606.11480#S5.SS4.p1.1)\. - C\. Meng, M\. Tang, and V\. W\. Wong \(2026\)FLoRG: federated fine\-tuning with low\-rank gram matrices and procrustes alignment\.arXiv preprint arXiv:2602\.17095\.Cited by:[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - S\. Momeni, C\. Xiao, and B\. Liu \(2026\)Anacp: toward upper\-bound continual learning via analytic contrastive projection\.Advances in Neural Information Processing Systems38,pp\. 72923–72944\.Cited by:[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - B\. Park, P\. P\. B\. de GusmÃĢo, D\. Ji, and M\. Kim \(2025\)CATCHFed: efficient unlabeled data utilization for semi\-supervised federated learning in limited labels environments\.arXiv preprint arXiv:2511\.11778\.Cited by:[§7\.3](https://arxiv.org/html/2606.11480#S7.SS3.p1.1)\. - L\. Peng, J\. Elenter, J\. Agterberg, A\. Ribeiro, and R\. Vidal \(2024\)Loranpac: low\-rank random features and pre\-trained models for bridging theory and practice in continual learning\.arXiv preprint arXiv:2410\.00645\.Cited by:[Table 1](https://arxiv.org/html/2606.11480#S1.T1.1.6.5.1.1.1),[§1](https://arxiv.org/html/2606.11480#S1.p5.2),[§3](https://arxiv.org/html/2606.11480#S3.p4.3),[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - A\. Prabhu, S\. Sinha, P\. Kumaraguru, P\. H\. Torr, O\. Sener, and P\. K\. Dokania \(2024\)RanDumb: random representations outperform online continually learned representations\.Advances in Neural Information Processing Systems37,pp\. 37988–38006\.Cited by:[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - D\. Qi, H\. Zhao, and S\. Li \(2023\)Better generative replay for continual federated learning\.arXiv preprint arXiv:2302\.13001\.Cited by:[Table 1](https://arxiv.org/html/2606.11480#S1.T1.1.2.1.1.1.1),[§1](https://arxiv.org/html/2606.11480#S1.p3.1),[§3](https://arxiv.org/html/2606.11480#S3.p2.1),[§7\.1](https://arxiv.org/html/2606.11480#S7.SS1.p1.1)\. - Y\. Quan, W\. Wu, and G\. Montana \(2026\)Exact federated continual unlearning for ridge heads on frozen foundation models\.arXiv preprint arXiv:2603\.12977\.Cited by:[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - A\. Rahimi and B\. Recht \(2007\)Random features for large\-scale kernel machines\.Advances in neural information processing systems20\.Cited by:[§4\.2](https://arxiv.org/html/2606.11480#S4.SS2.p2.3),[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - A\. Rahimi and B\. Recht \(2008\)Weighted sums of random kitchen sinks: replacing minimization with randomization in learning\.Advances in neural information processing systems21\.Cited by:[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - S\. Rebuffi, A\. Kolesnikov, G\. Sperl, and C\. H\. Lampert \(2017\)Icarl: incremental classifier and representation learning\.InProceedings of the IEEE conference on Computer Vision and Pattern Recognition,pp\. 2001–2010\.Cited by:[§5\.4](https://arxiv.org/html/2606.11480#S5.SS4.p1.1),[§7\.1](https://arxiv.org/html/2606.11480#S7.SS1.p1.1)\. - R\. Řehřek \(2011\)Subspace tracking for latent semantic analysis\.InEuropean Conference on Information Retrieval,pp\. 289–300\.Cited by:[§4\.5](https://arxiv.org/html/2606.11480#S4.SS5.p2.11),[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - J\. Schwarz, W\. Czarnecki, J\. Luketina, A\. Grabska\-Barwinska, Y\. W\. Teh, R\. Pascanu, and R\. Hadsell \(2018\)Progress & compress: a scalable framework for continual learning\.InInternational conference on machine learning,pp\. 4528–4537\.Cited by:[§1](https://arxiv.org/html/2606.11480#S1.p1.1)\. - M\. Shamrai \(2026\)Concatenated matrix svd: compression bounds, incremental approximation, and error\-constrained clustering\.arXiv preprint arXiv:2601\.11626\.Cited by:[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - J\. S\. Smith, L\. Karlinsky, V\. Gutta, P\. Cascante\-Bonilla, D\. Kim, A\. Arbelle, R\. Panda, R\. Feris, and Z\. Kira \(2023\)Coda\-prompt: continual decomposed attention\-based prompting for rehearsal\-free continual learning\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 11909–11919\.Cited by:[§5\.4](https://arxiv.org/html/2606.11480#S5.SS4.p2.1)\. - Y\. Tan, G\. Long, L\. Liu, T\. Zhou, Q\. Lu, J\. Jiang, and C\. Zhang \(2022\)Fedproto: federated prototype learning across heterogeneous clients\.InProceedings of the AAAI conference on artificial intelligence,Vol\.36,pp\. 8432–8440\.Cited by:[§3](https://arxiv.org/html/2606.11480#S3.p5.1),[§7\.3](https://arxiv.org/html/2606.11480#S7.SS3.p1.1)\. - J\. Tang, Y\. Huang, K\. Fan, F\. Han, J\. Li, J\. Xu, R\. He, A\. Liu, H\. H\. Song, H\. Zhuang,et al\.\(2026\)DeepAFL: deep analytic federated learning\.arXiv preprint arXiv:2603\.00579\.Cited by:[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - J\. Tang, Z\. Yang, J\. Wang, K\. Fan, J\. Xu, H\. Zhuang, A\. Liu, H\. H\. Song, L\. Wang, and Y\. Liu \(2025a\)FedHiP: heterogeneity\-invariant personalized federated learning through closed\-form solutions\.arXiv preprint arXiv:2508\.04470\.Cited by:[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - J\. Tang, H\. Zhuang, J\. He, R\. He, J\. Wang, K\. Fan, A\. Liu, T\. Wang, L\. Wang, Z\. Zhu,et al\.\(2025b\)Afcl: analytic federated continual learning for spatio\-temporal invariance of non\-iid data\.arXiv preprint arXiv:2505\.12245\.Cited by:[Table 1](https://arxiv.org/html/2606.11480#S1.T1.1.4.3.1.1.1),[§1](https://arxiv.org/html/2606.11480#S1.p4.1),[§3](https://arxiv.org/html/2606.11480#S3.p4.3),[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - F\. Turazza, M\. Picone, and M\. Mamei \(2026\)The gaussian\-head ofl family: one\-shot federated learning from client global statistics\.arXiv preprint arXiv:2602\.01186\.Cited by:[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - V\. Vasudevan and M\. Ramakrishna \(2017\)A hierarchical singular value decomposition algorithm for low rank matrices\.arXiv preprint arXiv:1710\.02812\.Cited by:[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - Y\. Venkatesha, Y\. Kim, H\. Park, Y\. Li, and P\. Panda \(2022\)Addressing client drift in federated continual learning with adaptive optimization\.Available at SSRN 4188586\.Cited by:[§1](https://arxiv.org/html/2606.11480#S1.p3.1),[§2\.3](https://arxiv.org/html/2606.11480#S2.SS3.p5.6),[§3](https://arxiv.org/html/2606.11480#S3.p3.1),[§4\.2](https://arxiv.org/html/2606.11480#S4.SS2.p1.6)\. - Z\. Wang, Z\. Zhang, S\. Ebrahimi, R\. Sun, H\. Zhang, C\. Lee, X\. Ren, G\. Su, V\. Perot, J\. Dy,et al\.\(2022\)Dualprompt: complementary prompting for rehearsal\-free continual learning\.InEuropean conference on computer vision,pp\. 631–648\.Cited by:[§5\.4](https://arxiv.org/html/2606.11480#S5.SS4.p2.1)\. - K\. Xu, Y\. Feng, J\. Li, Y\. Qi, and J\. Zhou \(2026\)C2prompt: class\-aware client knowledge interaction for federated continual learning\.Advances in Neural Information Processing Systems38,pp\. 44109–44139\.Cited by:[§7\.1](https://arxiv.org/html/2606.11480#S7.SS1.p1.1)\. - J\. Yoon, W\. Jeong, G\. Lee, E\. Yang, and S\. J\. Hwang \(2021\)Federated continual learning with weighted inter\-client transfer\.InInternational conference on machine learning,pp\. 12073–12086\.Cited by:[Table 1](https://arxiv.org/html/2606.11480#S1.T1.1.2.1.1.1.1),[§1](https://arxiv.org/html/2606.11480#S1.p1.1),[§1](https://arxiv.org/html/2606.11480#S1.p3.1),[§3](https://arxiv.org/html/2606.11480#S3.p2.1),[§7\.1](https://arxiv.org/html/2606.11480#S7.SS1.p1.1)\. - H\. Yu, X\. Yang, L\. Zhang, H\. Gu, T\. Li, L\. Fan, and Q\. Yang \(2025\)Handling spatial\-temporal data heterogeneity for federated continual learning via tail anchor\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 4874–4883\.Cited by:[§7\.1](https://arxiv.org/html/2606.11480#S7.SS1.p1.1)\. - X\. Zhai, J\. Puigcerver, A\. Kolesnikov, P\. Ruyssen, C\. Riquelme, M\. Lucic, J\. Djolonga, A\. S\. Pinto, M\. Neumann, A\. Dosovitskiy,et al\.\(2019\)A large\-scale study of representation learning with the visual task adaptation benchmark\.arXiv preprint arXiv:1910\.04867\.Cited by:[Table 2](https://arxiv.org/html/2606.11480#S5.T2.1.4.3.1)\. - J\. Zhang, C\. Chen, W\. Zhuang, and L\. Lyu \(2023a\)Target: federated class\-continual learning via exemplar\-free distillation\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 4782–4793\.Cited by:[§5\.4](https://arxiv.org/html/2606.11480#S5.SS4.p1.1)\. - J\. Zhang, C\. Chen, W\. Zhuang, and L\. Lyu \(2023b\)Target: federated class\-continual learning via exemplar\-free distillation\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 4782–4793\.Cited by:[Table 1](https://arxiv.org/html/2606.11480#S1.T1.1.2.1.1.1.1),[§1](https://arxiv.org/html/2606.11480#S1.p3.1),[§3](https://arxiv.org/html/2606.11480#S3.p2.1),[§7\.1](https://arxiv.org/html/2606.11480#S7.SS1.p1.1)\. - Z\. Zhong, W\. Bao, J\. Wang, J\. Chen, L\. Lyu, and W\. Y\. B\. Lim \(2025\)Sacfl: self\-adaptive federated continual learning for resource\-constrained end devices\.IEEE Transactions on Neural Networks and Learning Systems\.Cited by:[§7\.1](https://arxiv.org/html/2606.11480#S7.SS1.p1.1)\. - H\. Zhuang, Y\. Chen, D\. Fang, R\. He, K\. Tong, H\. Wei, Z\. Zeng, and C\. Chen \(2024a\)GACL: exemplar\-free generalized analytic continual learning\.Advances in neural information processing systems37,pp\. 83024–83047\.Cited by:[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - H\. Zhuang, R\. He, K\. Tong, Z\. Zeng, C\. Chen, and Z\. Lin \(2024b\)DS\-al: a dual\-stream analytic learning for exemplar\-free class\-incremental learning\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 17237–17244\.Cited by:[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - H\. Zhuang, Y\. Liu, R\. He, K\. Tong, Z\. Zeng, C\. Chen, Y\. Wang, and L\. Chau \(2024c\)F\-oal: forward\-only online analytic learning with fast training and low memory footprint in class incremental learning\.Advances in Neural Information Processing Systems37,pp\. 41517–41538\.Cited by:[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. - H\. Zhuang, Z\. Weng, H\. Wei, R\. Xie, K\. Toh, and Z\. Lin \(2022\)ACIL: analytic class\-incremental learning with absolute memorization and privacy protection\.Advances in Neural Information Processing Systems35,pp\. 11602–11614\.Cited by:[§1](https://arxiv.org/html/2606.11480#S1.p4.1),[§7\.2](https://arxiv.org/html/2606.11480#S7.SS2.p1.1)\. ## Appendix AAdditional FedRAN Algorithms and Proofs This appendix provides helper algorithms and proofs for the statements in Section[4](https://arxiv.org/html/2606.11480#S4)\. All results are deterministic and condition on the feature matrix and label matrix used by FedRAN\. When pseudo\-labeling is enabled, the label matrix is the augmented label matrix produced by pseudo\-labeling\. ### A\.1\.Helper Algorithms This subsection details the three routines invoked by Algorithm[1](https://arxiv.org/html/2606.11480#alg1)\. Algorithm[2](https://arxiv.org/html/2606.11480#alg2)computes a client’s rank\-rrspectral summary: it factorizesHk,tH\_\{k,t\}through whichever ofHk,tH\_\{k,t\}orHk,t⊤H\_\{k,t\}^\{\\top\}has the smaller leading dimension, so the thin SVD is always taken on themin\{n~k,t,M\}\\min\\\{\\tilde\{n\}\_\{k,t\},M\\\}side, and returns the top\-rk,tr\_\{k,t\}right singular vectorsVk,tV\_\{k,t\}with their singular valuesσk,t\\sigma\_\{k,t\}\. Algorithm[3](https://arxiv.org/html/2606.11480#alg3)is the server\-side QR–SVD subspace merge used for both spatial \(across clients\) and temporal \(across tasks\) aggregation: given two summaries it concatenates their scaled bases, orthonormalizes the result by QR, re\-diagonalizes the smallRRfactor by SVD, and truncates back to rankrr; on the first merge an empty operand is returned unchanged \(up to rank truncation\)\. Algorithm[4](https://arxiv.org/html/2606.11480#alg4)performs prototype\-based pseudo\-labeling: it formsℓ2\\ell\_\{2\}\-normalized class prototypes from the labeled features, assigns each unlabeled feature to its nearest prototype by cosine similarity, and accepts the resulting pseudo\-label only when the similarity exceeds the confidence thresholdτ\\tau\. Algorithm 2Client\-Side Truncated SVD Summary1:Local random features Hk,t∈ℝn~k,t×MH\_\{k,t\}\\in\\mathbb\{R\}^\{\\tilde\{n\}\_\{k,t\}\\times M\}, target rank rr 2:Set rk,t←min\{r,n~k,t,M\}r\_\{k,t\}\\leftarrow\\min\\\{r,\\tilde\{n\}\_\{k,t\},M\\\} 3:if n~k,t≤M\\tilde\{n\}\_\{k,t\}\\leq Mthen 4:Compute thin SVD Hk,t=Uk,tdiag\(σk,t\)Vk,t⊤H\_\{k,t\}=U\_\{k,t\}\\mathrm\{diag\}\(\\sigma\_\{k,t\}\)V\_\{k,t\}^\{\\top\} 5:else 6:Compute thin SVD Hk,t⊤=Vk,tdiag\(σk,t\)U~k,t⊤H\_\{k,t\}^\{\\top\}=V\_\{k,t\}\\mathrm\{diag\}\(\\sigma\_\{k,t\}\)\\widetilde\{U\}\_\{k,t\}^\{\\top\} 7:Keep Vk,t\[:,1:rk,t\]∈ℝM×rk,tV\_\{k,t\}\[:,1:r\_\{k,t\}\]\\in\\mathbb\{R\}^\{M\\times r\_\{k,t\}\}and σk,t\[1:rk,t\]∈ℝrk,t\\sigma\_\{k,t\}\[1:r\_\{k,t\}\]\\in\\mathbb\{R\}^\{r\_\{k,t\}\} 8:return \(Vk,t,σk,t\)\(V\_\{k,t\},\\sigma\_\{k,t\}\) Algorithm 3Server\-Side QR–SVD Subspace Merge1:Previous summary \(Va,σa\)\(V\_\{a\},\\sigma\_\{a\}\), new summary \(Vb,σb\)\(V\_\{b\},\\sigma\_\{b\}\), target rank rr 2:if VaV\_\{a\}is emptythen 3:return \(Vb\[:,1:min\{r,cols\(Vb\)\}\],σb\[1:min\{r,len\(σb\)\}\]\)\(V\_\{b\}\[:,1:\\min\\\{r,\\mathrm\{cols\}\(V\_\{b\}\)\\\}\],\\sigma\_\{b\}\[1:\\min\\\{r,\\mathrm\{len\}\(\\sigma\_\{b\}\)\\\}\]\) 4:Form A←\[Vadiag\(σa\),Vbdiag\(σb\)\]A\\leftarrow\[V\_\{a\}\\mathrm\{diag\}\(\\sigma\_\{a\}\),\\;V\_\{b\}\\mathrm\{diag\}\(\\sigma\_\{b\}\)\] 5:Compute QR factorization A=QRA=QR 6:Compute SVD R=URdiag\(σ\)WR⊤R=U\_\{R\}\\mathrm\{diag\}\(\\sigma\)W\_\{R\}^\{\\top\} 7:Keep top rrcomponents: V←QUR\[:,1:r\]V\\leftarrow QU\_\{R\}\[:,1:r\], σ←σ\[1:r\]\\sigma\\leftarrow\\sigma\[1:r\] 8:return \(V,σ\)\(V,\\sigma\) Algorithm 4Prototype\-Based SSL Pseudo\-Labeling1:Labeled backbone features Zk,tℓZ^\{\\ell\}\_\{k,t\}, labels yk,tℓy^\{\\ell\}\_\{k,t\}, unlabeled backbone features Zk,tuZ^\{u\}\_\{k,t\}, confidence threshold τ\\tau 2:foreach class ccin yk,tℓy^\{\\ell\}\_\{k,t\}do 3: pk,t,c←mean\(Zk,tℓ\[yk,tℓ=c\]\)p\_\{k,t,c\}\\leftarrow\\mathrm\{mean\}\(Z^\{\\ell\}\_\{k,t\}\[y^\{\\ell\}\_\{k,t\}=c\]\) 4: pk,t,c←pk,t,c/‖pk,t,c‖2p\_\{k,t,c\}\\leftarrow p\_\{k,t,c\}/\\\|p\_\{k,t,c\}\\\|\_\{2\} 5:Initialize accepted pseudo\-labeled set \(Zk,tp,yk,tp\)←∅\(Z^\{p\}\_\{k,t\},y^\{p\}\_\{k,t\}\)\\leftarrow\\emptyset 6:foreach unlabeled feature u∈Zk,tuu\\in Z^\{u\}\_\{k,t\}do 7: u¯←u/‖u‖2\\bar\{u\}\\leftarrow u/\\\|u\\\|\_\{2\} 8: c⋆←argmaxcu¯⊤pk,t,cc^\{\\star\}\\leftarrow\\arg\\max\_\{c\}\\bar\{u\}^\{\\top\}p\_\{k,t,c\} 9:if u¯⊤pk,t,c⋆≥τ\\bar\{u\}^\{\\top\}p\_\{k,t,c^\{\\star\}\}\\geq\\tauthen 10:Add \(u,c⋆\)\(u,c^\{\\star\}\)to \(Zk,tp,yk,tp\)\(Z^\{p\}\_\{k,t\},y^\{p\}\_\{k,t\}\) 11:return \(Z~k,t,y~k,t\)=\(Zk,tℓ∪Zk,tp,yk,tℓ∪yk,tp\)\(\\widetilde\{Z\}\_\{k,t\},\\widetilde\{y\}\_\{k,t\}\)=\(Z^\{\\ell\}\_\{k,t\}\\cup Z^\{p\}\_\{k,t\},\\;y^\{\\ell\}\_\{k,t\}\\cup y^\{p\}\_\{k,t\}\) ### A\.2\.Communication and Computation Cost We summarize the per\-task cost of FedRAN and compare it with gradient\-based FCL and exact analytic FCL\. The costs below exclude the one\-time distribution of the frozen backbone and shared random projection\. We assume that after each task the server returns a dense classifierW1:t∈ℝM×CtW\_\{1:t\}\\in\\mathbb\{R\}^\{M\\times C\_\{t\}\}to each client for local inference\. Thus, analytic methods have downlink costMCtMC\_\{t\}values per client per task\. If inference is performed only on the server, this downlink can be omitted\. LetKKbe the number of clients,RtR\_\{t\}the number of communication rounds for tasktt,EEthe number of local epochs,SmodelS\_\{\\mathrm\{model\}\}the number of trainable model parameters,CbpC\_\{\\mathrm\{bp\}\}the cost of one forward/backward pass per sample through the trainable model,CfC\_\{f\}the cost of one frozen\-backbone forward pass per sample,n~k,t\\tilde\{n\}\_\{k,t\}the number of samples used by clientkkat taskttafter pseudo\-labeling,ddthe frozen\-backbone feature dimension,MMthe random\-feature dimension,rrthe retained rank, andCtC\_\{t\}the number of classes observed up to tasktt\. Communication\.Table[7](https://arxiv.org/html/2606.11480#A1.T7)reports per\-client per\-task communication in transmitted floating\-point values\. Gradient\-based FCL exchanges a trainable model in every communication round\. Exact analytic FCL avoids iterative model exchange, but each client uploads the full local Gram matrixGk,t∈ℝM×MG\_\{k,t\}\\in\\mathbb\{R\}^\{M\\times M\}and the label\-feature statisticBk,t∈ℝM×CtB\_\{k,t\}\\in\\mathbb\{R\}^\{M\\times C\_\{t\}\}\. FedRAN instead uploads a low\-rank spectral summary\(Vk,t,σk,t\)\(V\_\{k,t\},\\sigma\_\{k,t\}\)and the exact label\-feature statisticBk,tB\_\{k,t\}, reducing the dominant uplink term fromM2M^\{2\}toMr\+rMr\+r\. Table 7\.Per\-client per\-task communication cost in transmitted floating\-point values\. The downlink assumes the server returns the dense classifierW1:t∈ℝM×CtW\_\{1:t\}\\in\\mathbb\{R\}^\{M\\times C\_\{t\}\}to each client\.Computation\.Table[8](https://arxiv.org/html/2606.11480#A1.T8)reports dominant computation costs\. Client\-side cost is per client per task, while server\-side cost is per task\. For gradient\-based FCL, client computation is dominated by repeated local backpropagation overRtR\_\{t\}communication rounds andEElocal epochs, while server computation is dominated by aggregating model parameters\. Exact analytic FCL avoids backpropagation but constructs the full local Gram matrix on the client and solves a fullMM\-dimensional ridge system on the server\. FedRAN avoids both full Gram construction and full Gram inversion\. Each client computes random features, constructsBk,tB\_\{k,t\}by sparse class\-wise accumulation, and computes a truncated SVD summary ofHk,tH\_\{k,t\}\. The server performs QR\-SVD merges over rank\-rrsummaries and solves the classifier in the retained subspace\. For the client\-side SVD, an exact thin SVD ofHk,t∈ℝn~k,t×MH\_\{k,t\}\\in\\mathbb\{R\}^\{\\tilde\{n\}\_\{k,t\}\\times M\}costs O\(min\{n~k,t2M,n~k,tM2\}\),O\\\!\\left\(\\min\\\{\\tilde\{n\}\_\{k,t\}^\{2\}M,\\;\\tilde\{n\}\_\{k,t\}M^\{2\}\\\}\\right\),depending on whether the computation is performed throughHk,tH\_\{k,t\}orHk,t⊤H\_\{k,t\}^\{\\top\}\. In the common regimen~k,t≪M\\tilde\{n\}\_\{k,t\}\\ll M, this becomesO\(n~k,t2M\)O\(\\tilde\{n\}\_\{k,t\}^\{2\}M\)\. Table 8\.Dominant computation cost\. Client\-side cost is per client per task; server\-side cost is per task\.Discussion\.The dominant uplink cost of exact analytic FCL is quadratic inMMbecause each client transmitsGk,tG\_\{k,t\}\. FedRAN reduces this term toMr\+rMr\+rby transmitting the rank\-rrspectral summary instead\. On the computation side, exact analytic FCL constructs the full local Gram matrix at costO\(n~k,tM2\)O\(\\tilde\{n\}\_\{k,t\}M^\{2\}\)and solves a full ridge system withO\(M3\+M2Ct\)O\(M^\{3\}\+M^\{2\}C\_\{t\}\)server cost\. FedRAN replaces these operations with local SVD summarization and server\-side QR\-SVD merging over rank\-rrfactors\. Sincer≪Mr\\ll M, this avoids full second\-order inversion while retaining the dominant random\-feature geometry used by the analytic classifier\. ### A\.3\.Notation for the Proofs LetG1:t⋆G^\{\\star\}\_\{1:t\}denote the exact Gram matrix accumulated over all clients and tasks up tott, and letB1:tB\_\{1:t\}denote the corresponding label\-feature statistic: \(36\)G1:t⋆=∑τ=1t∑k=1KHk,τ⊤Hk,τ,B1:t=∑τ=1t∑k=1KHk,τ⊤Yk,τ\.G^\{\\star\}\_\{1:t\}=\\sum\_\{\\tau=1\}^\{t\}\\sum\_\{k=1\}^\{K\}H\_\{k,\\tau\}^\{\\top\}H\_\{k,\\tau\},\\qquad B\_\{1:t\}=\\sum\_\{\\tau=1\}^\{t\}\\sum\_\{k=1\}^\{K\}H\_\{k,\\tau\}^\{\\top\}Y\_\{k,\\tau\}\.For a symmetric positive semidefinite matrixSS, let𝒯r\(S\)\\mathcal\{T\}\_\{r\}\(S\)denote its best rank\-rrtruncation obtained by keeping itsrrlargest eigenvalues and associated eigenvectors\. Unless stated otherwise, singular values and eigenvalues are sorted in descending order and indexed from11\. If an omitted singular value or eigenvalue does not exist, we set it to0by convention\. For each local feature matrixHk,τH\_\{k,\\tau\}, let \(37\)Hk,τ=Uk,τdiag\(sk,τ,1,sk,τ,2,…\)Vk,τ⊤H\_\{k,\\tau\}=U\_\{k,\\tau\}\\mathrm\{diag\}\(s\_\{k,\\tau,1\},s\_\{k,\\tau,2\},\\ldots\)V\_\{k,\\tau\}^\{\\top\}denote its full SVD, withsk,τ,1≥sk,τ,2≥⋯≥0s\_\{k,\\tau,1\}\\geq s\_\{k,\\tau,2\}\\geq\\cdots\\geq 0\. The vectorσk,τ\\sigma\_\{k,\\tau\}used in the main algorithm contains only the retained top singular values, whilesk,τ,is\_\{k,\\tau,i\}denotes the full spectrum used for analysis\. ### A\.4\.Proof of Proposition[1](https://arxiv.org/html/2606.11480#Thmfedranprop1) ###### Proof\. LetH1:tH\_\{1:t\}be the row\-wise concatenation ofHk,τH\_\{k,\\tau\}over allk∈\{1,…,K\}k\\in\\\{1,\\ldots,K\\\}andτ∈\{1,…,t\}\\tau\\in\\\{1,\\ldots,t\\\}\. Then \(38\)H1:t⊤H1:t=∑τ=1t∑k=1KHk,τ⊤Hk,τ,H\_\{1:t\}^\{\\top\}H\_\{1:t\}=\\sum\_\{\\tau=1\}^\{t\}\\sum\_\{k=1\}^\{K\}H\_\{k,\\tau\}^\{\\top\}H\_\{k,\\tau\},because multiplying a row\-wise concatenation by its transpose sums the per\-block Gram matrices\. Similarly, ifY1:tY\_\{1:t\}is the corresponding row\-wise concatenation of labels, then \(39\)H1:t⊤Y1:t=∑τ=1t∑k=1KHk,τ⊤Yk,τ\.H\_\{1:t\}^\{\\top\}Y\_\{1:t\}=\\sum\_\{\\tau=1\}^\{t\}\\sum\_\{k=1\}^\{K\}H\_\{k,\\tau\}^\{\\top\}Y\_\{k,\\tau\}\.Thus, full spatial\-temporal aggregation recovers exactly the centralized ridge statistics\. Substituting these statistics into the closed\-form ridge solution proves the final claim\. ∎ ### A\.5\.Proof of Proposition[2](https://arxiv.org/html/2606.11480#Thmfedranprop2) ###### Proof\. By construction, \(40\)A=\[Vadiag\(σa\),Vbdiag\(σb\)\]\.A=\[V\_\{a\}\\mathrm\{diag\}\(\\sigma\_\{a\}\),\\;V\_\{b\}\\mathrm\{diag\}\(\\sigma\_\{b\}\)\]\.Therefore, \(41\)AA⊤=Vadiag\(σa2\)Va⊤\+Vbdiag\(σb2\)Vb⊤\.AA^\{\\top\}=V\_\{a\}\\mathrm\{diag\}\(\\sigma\_\{a\}^\{2\}\)V\_\{a\}^\{\\top\}\+V\_\{b\}\\mathrm\{diag\}\(\\sigma\_\{b\}^\{2\}\)V\_\{b\}^\{\\top\}\.This proves that the covariance of the concatenated scaled basis equals the sum of the two input sketch covariances\. Now letA=QRA=QRbe a QR factorization withQ⊤Q=IQ^\{\\top\}Q=I, and let \(42\)R=URdiag\(σ¯\)WR⊤R=U\_\{R\}\\mathrm\{diag\}\(\\bar\{\\sigma\}\)W\_\{R\}^\{\\top\}be the SVD ofRR, with singular valuesσ¯1≥σ¯2≥⋯≥0\\bar\{\\sigma\}\_\{1\}\\geq\\bar\{\\sigma\}\_\{2\}\\geq\\cdots\\geq 0\. Then \(43\)AA⊤=QRR⊤Q⊤=QURdiag\(σ¯2\)UR⊤Q⊤\.AA^\{\\top\}=QRR^\{\\top\}Q^\{\\top\}=QU\_\{R\}\\mathrm\{diag\}\(\\bar\{\\sigma\}^\{2\}\)U\_\{R\}^\{\\top\}Q^\{\\top\}\.Thus, without rank truncation, takingV=QURV=QU\_\{R\}andσ=σ¯\\sigma=\\bar\{\\sigma\}gives \(44\)Vdiag\(σ2\)V⊤=AA⊤\.V\\mathrm\{diag\}\(\\sigma^\{2\}\)V^\{\\top\}=AA^\{\\top\}\. If FedRAN retains only the toprrcomponents, then \(45\)V=QUR\(:,1:r\),σ=σ¯1:r,V=QU\_\{R\}^\{\(:,1:r\)\},\\qquad\\sigma=\\bar\{\\sigma\}\_\{1:r\},and \(46\)Vdiag\(σ2\)V⊤=𝒯r\(AA⊤\)\.V\\mathrm\{diag\}\(\\sigma^\{2\}\)V^\{\\top\}=\\mathcal\{T\}\_\{r\}\(AA^\{\\top\}\)\.SinceAA⊤AA^\{\\top\}is symmetric positive semidefinite, the Eckart–Young–Mirsky theorem implies that𝒯r\(AA⊤\)\\mathcal\{T\}\_\{r\}\(AA^\{\\top\}\)is the best rank\-rrapproximation toAA⊤AA^\{\\top\}in any unitarily invariant norm, including spectral and Frobenius norms\. Ifλi\(AA⊤\)=σ¯i2\\lambda\_\{i\}\(AA^\{\\top\}\)=\\bar\{\\sigma\}\_\{i\}^\{2\}, then the residual satisfies \(47\)‖AA⊤−𝒯r\(AA⊤\)‖2=λr\+1\(AA⊤\),\\\|AA^\{\\top\}\-\\mathcal\{T\}\_\{r\}\(AA^\{\\top\}\)\\\|\_\{2\}=\\lambda\_\{r\+1\}\(AA^\{\\top\}\),and \(48\)‖AA⊤−𝒯r\(AA⊤\)‖F=\(∑i\>rλi\(AA⊤\)2\)1/2\.\\\|AA^\{\\top\}\-\\mathcal\{T\}\_\{r\}\(AA^\{\\top\}\)\\\|\_\{F\}=\\left\(\\sum\_\{i\>r\}\\lambda\_\{i\}\(AA^\{\\top\}\)^\{2\}\\right\)^\{1/2\}\.This proves the claim\. ∎ ### A\.6\.Proof of Theorem[1](https://arxiv.org/html/2606.11480#Thmfedranthm1) ###### Proof\. We track exactly where information is discarded\. The first source is local truncation at each client\. For clientkkand taskτ\\tau, the exact local Gram matrix is \(49\)Gk,τ=Hk,τ⊤Hk,τ\.G\_\{k,\\tau\}=H\_\{k,\\tau\}^\{\\top\}H\_\{k,\\tau\}\.LetG~k,τ\\widetilde\{G\}\_\{k,\\tau\}be the rank\-rk,τr\_\{k,\\tau\}local SVD approximation used by FedRAN: \(50\)G~k,τ=Vk,τdiag\(σk,τ2\)Vk,τ⊤\.\\widetilde\{G\}\_\{k,\\tau\}=V\_\{k,\\tau\}\\mathrm\{diag\}\(\\sigma\_\{k,\\tau\}^\{2\}\)V\_\{k,\\tau\}^\{\\top\}\.BecauseG~k,τ\\widetilde\{G\}\_\{k,\\tau\}keeps the toprk,τr\_\{k,\\tau\}eigen\-directions ofGk,τG\_\{k,\\tau\}, the local residual \(51\)Rk,τloc=Gk,τ−G~k,τR^\{\\mathrm\{loc\}\}\_\{k,\\tau\}=G\_\{k,\\tau\}\-\\widetilde\{G\}\_\{k,\\tau\}is positive semidefinite\. Its spectral, Frobenius, and trace errors are \(52\)‖Rk,τloc‖2=sk,τ,rk,τ\+12,\\\|R^\{\\mathrm\{loc\}\}\_\{k,\\tau\}\\\|\_\{2\}=s\_\{k,\\tau,r\_\{k,\\tau\}\+1\}^\{2\},\(53\)‖Rk,τloc‖F=\(∑i\>rk,τsk,τ,i4\)1/2,tr\(Rk,τloc\)=∑i\>rk,τsk,τ,i2\.\\\|R^\{\\mathrm\{loc\}\}\_\{k,\\tau\}\\\|\_\{F\}=\\left\(\\sum\_\{i\>r\_\{k,\\tau\}\}s\_\{k,\\tau,i\}^\{4\}\\right\)^\{1/2\},\\qquad\\mathrm\{tr\}\(R^\{\\mathrm\{loc\}\}\_\{k,\\tau\}\)=\\sum\_\{i\>r\_\{k,\\tau\}\}s\_\{k,\\tau,i\}^\{2\}\.The indexrk,τ\+1r\_\{k,\\tau\}\+1denotes the first omitted singular value of the full matrixHk,τH\_\{k,\\tau\}, not an entry of the retained vectorσk,τ\\sigma\_\{k,\\tau\}\. The second source of approximation is server\-side merge truncation\. Letℳt\\mathcal\{M\}\_\{t\}denote the set of all spatial and temporal QR\-SVD merges performed up to tasktt\. For each mergej∈ℳtj\\in\\mathcal\{M\}\_\{t\}, letAjA\_\{j\}be the concatenated scaled basis used in Eq\. \([16](https://arxiv.org/html/2606.11480#S4.E16)\), and define \(54\)Sj=AjAj⊤\.S\_\{j\}=A\_\{j\}A\_\{j\}^\{\\top\}\.Before truncation,SjS\_\{j\}is exactly the sum of the two input sketch covariances by Proposition[2](https://arxiv.org/html/2606.11480#Thmfedranprop2)\. FedRAN replacesSjS\_\{j\}with𝒯r\(Sj\)\\mathcal\{T\}\_\{r\}\(S\_\{j\}\), introducing the residual \(55\)Rjmerge=Sj−𝒯r\(Sj\)\.R^\{\\mathrm\{merge\}\}\_\{j\}=S\_\{j\}\-\\mathcal\{T\}\_\{r\}\(S\_\{j\}\)\.Ifλj,1≥λj,2≥⋯≥0\\lambda\_\{j,1\}\\geq\\lambda\_\{j,2\}\\geq\\cdots\\geq 0are the eigenvalues ofSjS\_\{j\}, then \(56\)‖Rjmerge‖2=λj,r\+1,\\\|R^\{\\mathrm\{merge\}\}\_\{j\}\\\|\_\{2\}=\\lambda\_\{j,r\+1\},\(57\)‖Rjmerge‖F=\(∑i\>rλj,i2\)1/2,tr\(Rjmerge\)=∑i\>rλj,i\.\\\|R^\{\\mathrm\{merge\}\}\_\{j\}\\\|\_\{F\}=\\left\(\\sum\_\{i\>r\}\\lambda\_\{j,i\}^\{2\}\\right\)^\{1/2\},\\qquad\\mathrm\{tr\}\(R^\{\\mathrm\{merge\}\}\_\{j\}\)=\\sum\_\{i\>r\}\\lambda\_\{j,i\}\. We now relate these discarded terms to the final FedRAN sketch\. Before any server\-side merge, replacing each exact local Gram matrix by its local sketch discards∑τ,kRk,τloc\\sum\_\{\\tau,k\}R^\{\\mathrm\{loc\}\}\_\{k,\\tau\}\. During each QR\-SVD merge, replacingSjS\_\{j\}by𝒯r\(Sj\)\\mathcal\{T\}\_\{r\}\(S\_\{j\}\)discardsRjmergeR^\{\\mathrm\{merge\}\}\_\{j\}\. Since all these residuals lie in the same ambientMM\-dimensional random\-feature space, the final error decomposes as \(58\)G1:t⋆−G~1:t=∑τ=1t∑k=1KRk,τloc\+∑j∈ℳtRjmerge\.G^\{\\star\}\_\{1:t\}\-\\widetilde\{G\}\_\{1:t\}=\\sum\_\{\\tau=1\}^\{t\}\\sum\_\{k=1\}^\{K\}R^\{\\mathrm\{loc\}\}\_\{k,\\tau\}\+\\sum\_\{j\\in\\mathcal\{M\}\_\{t\}\}R^\{\\mathrm\{merge\}\}\_\{j\}\.Taking spectral norms and using the residual identities above gives \(59\)‖G1:t⋆−G~1:t‖2≤∑τ=1t∑k=1Ksk,τ,rk,τ\+12\+∑j∈ℳtλj,r\+1\.\\\|G^\{\\star\}\_\{1:t\}\-\\widetilde\{G\}\_\{1:t\}\\\|\_\{2\}\\leq\\sum\_\{\\tau=1\}^\{t\}\\sum\_\{k=1\}^\{K\}s\_\{k,\\tau,r\_\{k,\\tau\}\+1\}^\{2\}\+\\sum\_\{j\\in\\mathcal\{M\}\_\{t\}\}\\lambda\_\{j,r\+1\}\.This proves Eq\. \([24](https://arxiv.org/html/2606.11480#S4.E24)\)\. ∎ ### A\.7\.An Interpretable Relative\-Tail Bound Theorem[1](https://arxiv.org/html/2606.11480#Thmfedranthm1)gives an instance\-dependent bound in terms of the exact discarded singular and eigenvalues\. We can further express this bound through aggregate feature energy under a standard relative\-tail assumption\. Assumption\.Suppose there existηloc,ηmerge∈\[0,1\]\\eta\_\{\\mathrm\{loc\}\},\\eta\_\{\\mathrm\{merge\}\}\\in\[0,1\]such that every local truncation discards at most anηloc\\eta\_\{\\mathrm\{loc\}\}fraction of local feature energy, \(60\)∑i\>rk,τsk,τ,i2≤ηloc‖Hk,τ‖F2,∀k,τ,\\sum\_\{i\>r\_\{k,\\tau\}\}s\_\{k,\\tau,i\}^\{2\}\\leq\\eta\_\{\\mathrm\{loc\}\}\\\|H\_\{k,\\tau\}\\\|\_\{F\}^\{2\},\\qquad\\forall k,\\tau,and every server\-side merge truncation discards at most anηmerge\\eta\_\{\\mathrm\{merge\}\}fraction of the trace energy being merged, \(61\)∑i\>rλj,i≤ηmergetr\(Sj\),∀j∈ℳt\.\\sum\_\{i\>r\}\\lambda\_\{j,i\}\\leq\\eta\_\{\\mathrm\{merge\}\}\\mathrm\{tr\}\(S\_\{j\}\),\\qquad\\forall j\\in\\mathcal\{M\}\_\{t\}\.Let \(62\)Et=∑τ=1t∑k=1K‖Hk,τ‖F2E\_\{t\}=\\sum\_\{\\tau=1\}^\{t\}\\sum\_\{k=1\}^\{K\}\\\|H\_\{k,\\tau\}\\\|\_\{F\}^\{2\}be the total random\-feature energy observed up to tasktt\. LetLtL\_\{t\}be the maximum number of rank\-truncating server merges in which any local client summary can participate before tasktt\. ###### Corollary 2 \(Relative\-tail Gram bound\)\. Under the relative\-tail assumption, \(63\)‖G1:t⋆−G~1:t‖2≤tr\(G1:t⋆−G~1:t\)≤\(ηloc\+ηmergeLt\)Et\.\\\|G^\{\\star\}\_\{1:t\}\-\\widetilde\{G\}\_\{1:t\}\\\|\_\{2\}\\leq\\mathrm\{tr\}\(G^\{\\star\}\_\{1:t\}\-\\widetilde\{G\}\_\{1:t\}\)\\leq\\left\(\\eta\_\{\\mathrm\{loc\}\}\+\\eta\_\{\\mathrm\{merge\}\}L\_\{t\}\\right\)E\_\{t\}\.If additionally‖h\(x\)‖22≤R2\\\|h\(x\)\\\|\_\{2\}^\{2\}\\leq R^\{2\}for every random feature vector andNt=∑τ=1t∑k=1Kn~k,τN\_\{t\}=\\sum\_\{\\tau=1\}^\{t\}\\sum\_\{k=1\}^\{K\}\\tilde\{n\}\_\{k,\\tau\}, then \(64\)‖G1:t⋆−G~1:t‖2≤\(ηloc\+ηmergeLt\)R2Nt\.\\\|G^\{\\star\}\_\{1:t\}\-\\widetilde\{G\}\_\{1:t\}\\\|\_\{2\}\\leq\\left\(\\eta\_\{\\mathrm\{loc\}\}\+\\eta\_\{\\mathrm\{merge\}\}L\_\{t\}\\right\)R^\{2\}N\_\{t\}\. ###### Proof\. From Eq\. \([58](https://arxiv.org/html/2606.11480#A1.E58)\), the error is a sum of positive semidefinite residuals\. Hence its spectral norm is at most its trace\. The local residual trace is∑i\>rk,τsk,τ,i2\\sum\_\{i\>r\_\{k,\\tau\}\}s\_\{k,\\tau,i\}^\{2\}, which is bounded by Eq\. \([60](https://arxiv.org/html/2606.11480#A1.E60)\)\. The merge residual trace is∑i\>rλj,i\\sum\_\{i\>r\}\\lambda\_\{j,i\}, which is bounded by Eq\. \([61](https://arxiv.org/html/2606.11480#A1.E61)\)\. Since each local summary can participate in at mostLtL\_\{t\}rank\-truncating server merges, the total trace energy exposed to merge truncation is at mostLtEtL\_\{t\}E\_\{t\}\. Therefore, \(65\)tr\(G1:t⋆−G~1:t\)≤ηlocEt\+ηmergeLtEt\.\\mathrm\{tr\}\(G^\{\\star\}\_\{1:t\}\-\\widetilde\{G\}\_\{1:t\}\)\\leq\\eta\_\{\\mathrm\{loc\}\}E\_\{t\}\+\\eta\_\{\\mathrm\{merge\}\}L\_\{t\}E\_\{t\}\.The final bound follows fromEt=∑τ,k‖Hk,τ‖F2≤R2NtE\_\{t\}=\\sum\_\{\\tau,k\}\\\|H\_\{k,\\tau\}\\\|\_\{F\}^\{2\}\\leq R^\{2\}N\_\{t\}when every random feature vector has squared norm at mostR2R^\{2\}\. ∎ ### A\.8\.Proof of Theorem[2](https://arxiv.org/html/2606.11480#Thmfedranthm2) ###### Proof\. We suppress the subscript1:t1\{:\}tfor readability\. Let \(66\)W⋆=\(G⋆\+λIM\)−1BW^\{\\star\}=\(G^\{\\star\}\+\\lambda I\_\{M\}\)^\{\-1\}Bbe the full ridge solution\. Let \(67\)G~=VΛV⊤,Λ=diag\(σ2\),\\widetilde\{G\}=V\\Lambda V^\{\\top\},\\qquad\\Lambda=\\mathrm\{diag\}\(\\sigma^\{2\}\),be the FedRAN Gram sketch\. Define the full ridge solution associated with the sketched Gram matrix: \(68\)WG~=\(G~\+λIM\)−1B\.W\_\{\\widetilde\{G\}\}=\(\\widetilde\{G\}\+\\lambda I\_\{M\}\)^\{\-1\}B\.Then \(69\)‖W⋆−W‖F≤‖W⋆−WG~‖F\+‖WG~−W‖F\.\\\|W^\{\\star\}\-W\\\|\_\{F\}\\leq\\\|W^\{\\star\}\-W\_\{\\widetilde\{G\}\}\\\|\_\{F\}\+\\\|W\_\{\\widetilde\{G\}\}\-W\\\|\_\{F\}\. For the first term, we use the standard inverse\-difference identity \(70\)\(G⋆\+λIM\)−1−\(G~\+λIM\)−1=\(G⋆\+λIM\)−1\(G~−G⋆\)\(G~\+λIM\)−1\.\(G^\{\\star\}\+\\lambda I\_\{M\}\)^\{\-1\}\-\(\\widetilde\{G\}\+\\lambda I\_\{M\}\)^\{\-1\}=\(G^\{\\star\}\+\\lambda I\_\{M\}\)^\{\-1\}\(\\widetilde\{G\}\-G^\{\\star\}\)\(\\widetilde\{G\}\+\\lambda I\_\{M\}\)^\{\-1\}\.SinceG⋆G^\{\\star\}andG~\\widetilde\{G\}are positive semidefinite, both inverse factors have spectral norm at most1/λ1/\\lambda\. Therefore, \(71\)‖W⋆−WG~‖F≤‖G⋆−G~‖2λ2‖B‖F≤εG\(t\)λ2‖B‖F\.\\\|W^\{\\star\}\-W\_\{\\widetilde\{G\}\}\\\|\_\{F\}\\leq\\frac\{\\\|G^\{\\star\}\-\\widetilde\{G\}\\\|\_\{2\}\}\{\\lambda^\{2\}\}\\\|B\\\|\_\{F\}\\leq\\frac\{\\varepsilon\_\{G\}\(t\)\}\{\\lambda^\{2\}\}\\\|B\\\|\_\{F\}\. For the second term, becauseV⊤V=IV^\{\\top\}V=I, \(72\)G~\+λIM=V\(Λ\+λIr\)V⊤\+λ\(IM−VV⊤\)\.\\widetilde\{G\}\+\\lambda I\_\{M\}=V\(\\Lambda\+\\lambda I\_\{r\}\)V^\{\\top\}\+\\lambda\(I\_\{M\}\-VV^\{\\top\}\)\.Hence, \(73\)\(G~\+λIM\)−1=V\(Λ\+λIr\)−1V⊤\+1λ\(IM−VV⊤\)\.\(\\widetilde\{G\}\+\\lambda I\_\{M\}\)^\{\-1\}=V\(\\Lambda\+\\lambda I\_\{r\}\)^\{\-1\}V^\{\\top\}\+\\frac\{1\}\{\\lambda\}\(I\_\{M\}\-VV^\{\\top\}\)\.FedRAN uses the subspace\-constrained component \(74\)W=V\(Λ\+λIr\)−1V⊤B\.W=V\(\\Lambda\+\\lambda I\_\{r\}\)^\{\-1\}V^\{\\top\}B\.Substituting Eq\. \([73](https://arxiv.org/html/2606.11480#A1.E73)\) gives \(75\)WG~−W=1λ\(IM−VV⊤\)B,W\_\{\\widetilde\{G\}\}\-W=\\frac\{1\}\{\\lambda\}\(I\_\{M\}\-VV^\{\\top\}\)B,and therefore \(76\)‖WG~−W‖F=1λ‖\(IM−VV⊤\)B‖F\.\\\|W\_\{\\widetilde\{G\}\}\-W\\\|\_\{F\}=\\frac\{1\}\{\\lambda\}\\\|\(I\_\{M\}\-VV^\{\\top\}\)B\\\|\_\{F\}\.Combining Eq\. \([69](https://arxiv.org/html/2606.11480#A1.E69)\), Eq\. \([71](https://arxiv.org/html/2606.11480#A1.E71)\), and Eq\. \([76](https://arxiv.org/html/2606.11480#A1.E76)\) proves the theorem\. ∎ ### A\.9\.Proof of Corollary[1](https://arxiv.org/html/2606.11480#Thmfedrancor1) ###### Proof\. LetΔW=W1:t⋆−W1:t\\Delta W=W^\{\\star\}\_\{1:t\}\-W\_\{1:t\}\. Then \(77\)‖s⋆−s‖2=‖h⊤ΔW‖2≤‖h‖2‖ΔW‖F≤‖h‖2εW\(t\),\\\|s^\{\\star\}\-s\\\|\_\{2\}=\\\|h^\{\\top\}\\Delta W\\\|\_\{2\}\\leq\\\|h\\\|\_\{2\}\\\|\\Delta W\\\|\_\{F\}\\leq\\\|h\\\|\_\{2\}\\varepsilon\_\{W\}\(t\),which proves the score perturbation bound\. Lety=argmaxcsc⋆y=\\arg\\max\_\{c\}s^\{\\star\}\_\{c\}and letδ=‖h‖2εW\(t\)\\delta=\\\|h\\\|\_\{2\}\\varepsilon\_\{W\}\(t\)\. The score perturbation bound implies\|sc−sc⋆\|≤δ\|s\_\{c\}\-s^\{\\star\}\_\{c\}\|\\leq\\deltafor every classcc\. Thus, \(78\)sy≥sy⋆−δ,sc≤sc⋆\+δfor allc≠y\.s\_\{y\}\\geq s^\{\\star\}\_\{y\}\-\\delta,\\qquad s\_\{c\}\\leq s^\{\\star\}\_\{c\}\+\\delta\\quad\\text\{for all \}c\\neq y\.If \(79\)sy⋆−maxc≠ysc⋆\>2δ,s^\{\\star\}\_\{y\}\-\\max\_\{c\\neq y\}s^\{\\star\}\_\{c\}\>2\\delta,thensy\>scs\_\{y\}\>s\_\{c\}for allc≠yc\\neq y, so FedRAN and the full ridge classifier predict the same class\. ∎ ### A\.10\.Remarks on the Bounds Local versus merge approximation\.Theorem[1](https://arxiv.org/html/2606.11480#Thmfedranthm1)separates the approximation error into two interpretable sources\. Local errors arise because clients retain only the top singular directions ofHk,tH\_\{k,t\}\. Merge errors arise because the server repeatedly truncates merged spectral summaries back to rankrr\. Increasingrrdecreases both terms but increases communication and server\-side computation\. Subspace residual\.The second term in Theorem[2](https://arxiv.org/html/2606.11480#Thmfedranthm2),‖\(I−VV⊤\)B‖F/λ\\\|\(I\-VV^\{\\top\}\)B\\\|\_\{F\}/\\lambda, appears because FedRAN intentionally solves the classifier in the retained subspace\. If label\-feature statistics are mostly contained in this subspace, FedRAN closely matches the full ridge solution\. This term also clarifies the role of the rank hyperparameter: a larger retained subspace can reduce the residual at the cost of larger communication\. Pseudo\-labeling\.The theory above conditions on the label matrix used in the analytic update\. If pseudo\-labeling introduces label noise, thenBBchanges because it is computed from the augmented labels\. A separate label\-noise analysis can be layered on top of these deterministic bounds by controlling‖Bpseudo−Btrue‖F\\\|B\_\{\\mathrm\{pseudo\}\}\-B\_\{\\mathrm\{true\}\}\\\|\_\{F\}, but this is orthogonal to the spectral aggregation analysis\.
Similar Articles
Federated continual learning: A comprehensive survey on lifelong and privacy-preserving learning over distributed and non-stationary data
This paper provides a comprehensive survey of Federated Continual Learning (FCL), an emerging field that combines Federated Learning and Continual Learning to enable lifelong, adaptive, and privacy-preserving learning over distributed and non-stationary data. It proposes a taxonomy, reviews applications, metrics, and open challenges.
Federated Learning
The article explains the concept of Federated Learning as a privacy-preserving machine learning technique that trains models on local devices rather than central servers. It details the process of encrypted parameter updates and aggregation to mitigate data leakage risks while maintaining model performance.
On the Push-Based Asynchronous Federated Learning: A Bias-Correction Aggregation Approach
This paper presents PushCen-ADFL, a communication-efficient asynchronous decentralized federated learning framework that uses centroid-based messaging and bias-correction to improve accuracy and reduce communication overhead under heterogeneous conditions.
Federated Nested Learning: Collaborative Training of Self-Referential Memories for Test-Time Adaptation
Proposes Federated Nested Learning (FedNL), a framework that reformulates federated learning as a three-level nested optimization system, enabling collaborative training of self-referential memories for test-time adaptation to handle Non-IID data and long-tail distributions.
FedQHD: Closed-Form Function-Space Federated Reinforcement Learning
This paper proposes FedQHD, a novel federated Q-learning method using hyperdimensional random-feature state encoders with linear readouts to enable closed-form function-space aggregation, addressing the federation gap due to heterogeneous client encoders.