Federated Nested Learning: Collaborative Training of Self-Referential Memories for Test-Time Adaptation

arXiv cs.LG Papers

Summary

Proposes Federated Nested Learning (FedNL), a framework that reformulates federated learning as a three-level nested optimization system, enabling collaborative training of self-referential memories for test-time adaptation to handle Non-IID data and long-tail distributions.

arXiv:2605.16350v1 Announce Type: new Abstract: We rethink Federated Learning (FL) from a nested learning perspective, framing the core challenge as how to collaboratively learn optimization rules, not just static models, to tackle Non-IID client data. To address this, we propose Federated Nested Learning (FedNL), a novel framework that reformulates FL as a three-level nested optimization system. FedNL embeds Titans-based linear attention into FL, enabling clients to perform lightweight, zero-shot test-time adaptation by treating a delta rule as an online gradient step. Experiments on Non-IID MMLU and long-context benchmarks show that FedNL achieves competitive performance in short-context reasoning, enhances the performance of long-context retrieval and streaming Cross-Entropy, and maintains constant inference memory.
Original Article
View Cached Full Text

Cached at: 05/19/26, 06:41 AM

# Federated Nested Learning: Collaborative Training of Self-Referential Memories for Test-Time Adaptation
Source: [https://arxiv.org/html/2605.16350](https://arxiv.org/html/2605.16350)
Hong Chen HKUST \(GZ\) hchen763@connect\.hkust\-gz\.edu\.cn Pengcheng Wu11footnotemark:1 Nanyang Technological University pengchengwu@ntu\.edu\.sg &Yuanguo Lin Jimei University xdlyg@jmu\.edu\.cn Peilin Zhao Shanghai Jiao Tong University peilinzhao@sjtu\.edu\.cn Xiuze Zhou HKUST \(GZ\) xzhou154@connect\.hkust\-gz\.edu\.cn &Fan Lin Xiamen University iamafan@xmu\.edu\.cn Han Yu Nanyang Technological University han\.yu@ntu\.edu\.sg

###### Abstract

We rethink Federated Learning \(FL\) from a nested learning perspective, framing the core challenge as how to collaboratively learn optimization rules, not just static models, to tackle Non\-IID client data\. To address this, we propose Federated Nested Learning \(FedNL\), a novel framework that reformulates FL as a three\-level nested optimization system\. FedNL embeds Titans\-based linear attention into FL, enabling clients to perform lightweight, zero\-shot test\-time adaptation by treating a delta rule as an online gradient step\. Experiments on Non\-IID MMLU and long\-context benchmarks show that FedNL achieves competitive performance in short\-context reasoning, enhances the performance of long\-context retrieval and streaming Cross\-Entropy, and maintains constant inference memory\.

## 1Introduction

Federated Learning \(FL\) has emerged as a privacy\-preserving paradigm for collaboratively training large language models \(LLMs\) across distributed edge devices\(Kuanget al\.,[2024](https://arxiv.org/html/2605.16350#bib.bib1); Yeet al\.,[2024](https://arxiv.org/html/2605.16350#bib.bib2)\)\. By keeping raw data local and aggregating model updates, FL promises to harness the collective intelligence of massive, decentralized datasets\. However, the real\-world deployment of Federated LLMs faces two persistent and intertwined challenges: data heterogeneity \(Non\-IID\) and long\-tail distributions\. In realistic scenarios, client data distributions are highly skewed \(e\.g\., a medical client vs\. a coding assistant\), and critical knowledge often resides in the long tail of these distributions, which is easily overshadowed by head classes during global aggregation\(Shuaiet al\.,[2022](https://arxiv.org/html/2605.16350#bib.bib4)\)\.

To address the above challenges, existing approaches primarily focus onregularizingorpersonalizingthe static model weights\. For instance, FedProx\(Liet al\.,[2020](https://arxiv.org/html/2605.16350#bib.bib5)\)introduces proximal terms to restrict local deviation, while recent state\-of\-the\-art methods like FedSSI\(Liet al\.,[2025](https://arxiv.org/html/2605.16350#bib.bib6)\)employ synaptic intelligence to selectively preserve important parameters, effectively mitigating catastrophic forgetting\. Despite their success, these methods share a fundamental limitation: they treat the global model as a container of static knowledge\. When such a static model is deployed to a client with unseen, highly heterogeneous data, it lackstest\-time plasticity— the ability to adapt to the current context without gradient updates\. Consequently, static weights often fail to capture the nuances of long\-tail distributions that are context\-dependent, leading to suboptimal performance on domain\-specific tasks\(Wanget al\.,[2023](https://arxiv.org/html/2605.16350#bib.bib7)\)\.

In this paper, we argue that solving the Non\-IID and long\-tail dilemma requires a paradigm shift from aggregating static knowledge to aggregating learning capabilities\. Drawing inspiration from the emerging theory of Nested Learning \(NL\)\(Behrouzet al\.,[2025a](https://arxiv.org/html/2605.16350#bib.bib8)\), we posit that learning should not be dichotomized into “training” and “inference”, but rather viewed as a hierarchy of nested optimization processes operating at different frequencies\. From this perspective, the “inference” phase of a sequence model can be reframed as a high\-frequency “inner\-loop training” process, where the model actively compresses the current context into a transient memory state\. If a model possesses a powerful mechanism to construct this memory at test time, it can dynamically adapt to heterogeneous local distributions without altering its global weights\.

Building on this insight, we propose Federated Nested Learning \(FedNL\), a novel framework that reformulates FL as a three\-level nested optimization system\. Instead of aggregating the memory content itself \(which is private and heterogeneous\), FedNL aggregates the meta\-rules governing how memory is constructed and updated\. Specifically, we leverage the Titans architecture\(Behrouzet al\.,[2025b](https://arxiv.org/html/2605.16350#bib.bib9)\), which utilizes a linearized attention mechanism equipped with aDelta Rule\. In our framework, the server aggregates the projection matrices and gating coefficients \(Level 0\), while clients utilize these global rules to instantiate private, context\-aware memory statesStS\_\{t\}during local inference \(Level 2\)\. This self\-referential mechanism allows the model to “learn to memorize” the specific patterns of local long\-tail data on\-the\-fly, effectively bypassing the limitations of static weight aggregation\.

Our approach offers a parameter\-efficient way to address aspects of traditional FL heterogeneity\. By decoupling general linguistic capabilities \(frozen backbone\) from memory construction rules \(trainable adapters\), FedNL achieves superior adaptability with minimal communication overhead\. We implement FedNL using the computationally efficient LiZAttention module\(Furfaro,[2025](https://arxiv.org/html/2605.16350#bib.bib10)\)and validate it on diverse benchmarks\. Our main contributions are summarized as follows\.

- •Theoretical Reframing:We introduce the Nested Learning perspective to FL, formalizing the problem as a collaborative training of optimization rules rather than static representations\. This provides a theoretical basis for addressing Non\-IID issues via test\-time adaptation\.
- •The FedNL Framework:We propose a practical algorithm that integrates Titans\-based linear attention into the FL pipeline\. By treating the Delta Rule as an online gradient descent step, we enable clients to performZero\-Shot Test\-Time Adaptationon unseen domains without computational heavy lifting\.
- •Empirical Superiority:Experiments on Non\-IID MMLU and long\-context benchmarks show that FedNL is competitive with strong federated baselines on short\-context reasoning and obtains larger gains on long\-context retrieval and streaming Cross\-Entropy \(CE\) diagnostics\. Notably, a 16K Needle In A Haystack \(NIAH\) streaming CE probe shows that FedNL continues to reduce normalized loss as context unfolds while FedAvg accumulates uncertainty, and does so while maintaining constant inference memory complexity\.

## 2Methodology

### 2\.1Preliminaries

In this section, we formalize the problem of FL with Test\-Time Adaptation constraints\. We then review the formulation of Linear Attention mechanisms \(specifically Titans\) as associative memory optimization\. Finally, drawing on NL theory, we formally define our proposed federated nested optimization framework\.

Federated Learning with Test\-Time Adaptation\.

Consider a federated learning system withKKclients, where each clientkkholds a private dataset𝒟k=\{\(x\(i\),y\(i\)\)\}i=1Nk\\mathcal\{D\}\_\{k\}=\\\{\(x^\{\(i\)\},y^\{\(i\)\}\)\\\}\_\{i=1\}^\{N\_\{k\}\}drawn from a local distribution𝒫k\\mathcal\{P\}\_\{k\}\. The standard goal of FL is to find a global parameter vectorθ∗\\theta^\{\*\}that minimizes the weighted empirical risk over all clients:

θ∗=arg⁡minθ​∑k=1KNkN​ℒk​\(θ;𝒟k\),\\theta^\{\*\}=\\arg\\min\_\{\\theta\}\\sum\_\{k=1\}^\{K\}\\frac\{N\_\{k\}\}\{N\}\\mathcal\{L\}\_\{k\}\(\\theta;\\mathcal\{D\}\_\{k\}\),\(1\)whereℒk\\mathcal\{L\}\_\{k\}is the local loss function \(e\.g\., Cross\-Entropy\) andN=∑NkN=\\sum N\_\{k\}\.

The Challenge of Static Weights\.In traditional settings, onceθ∗\\theta^\{\*\}is deployed to a clientkkfor inference \(test\-time\), the parameters remain fixed\. Letx1:T=\(x1,…,xT\)x\_\{1:T\}=\(x\_\{1\},\\dots,x\_\{T\}\)be a test sequence on clientkk\. A static model computes predictionsp​\(xt\+1\|x1:t;θ∗\)p\(x\_\{t\+1\}\|x\_\{1:t\};\\theta^\{\*\}\)\. If the test distribution𝒫t​e​s​t\\mathcal\{P\}\_\{test\}significantly shifts from the training distributions \(i\.e\., extreme Non\-IID or Long\-Tail scenarios\), the staticθ∗\\theta^\{\*\}struggles to adapt\.

Test\-Time Adaptation \(TTA\)\.To address this, we consider a setting where the model maintains a dynamic stateStS\_\{t\}during inference\. The prediction becomesp​\(xt\+1\|x1:t;St,θ\)p\(x\_\{t\+1\}\|x\_\{1:t\};S\_\{t\},\\theta\), whereStS\_\{t\}is updated online based on the contextx1:tx\_\{1:t\}\. Our goal in FedNL is to learn the optimalupdate rules\(encoded inθ\\theta\) such thatStS\_\{t\}rapidly converges to a representation that minimizes local prediction error at test time, without requiring gradient updates toθ\\thetaitself\.

Neural Memory as Online Optimization\.

We leverage the Titans architecture\(Behrouzet al\.,[2025b](https://arxiv.org/html/2605.16350#bib.bib9)\), which treats the attention mechanism as a Neural Memory module\. Unlike standard Softmax attention which requires storing the full history buffer, Titans compresses history into a fixed\-size memory state𝐒∈ℝd×d\\mathbf\{S\}\\in\\mathbb\{R\}^\{d\\times d\}\.

From the NL perspective\(Behrouzet al\.,[2025a](https://arxiv.org/html/2605.16350#bib.bib8)\), the update of this memory state is not merely a heuristic recurrence, but an online optimization step\. Specifically, let𝐤t,𝐯t∈ℝd\\mathbf\{k\}\_\{t\},\\mathbf\{v\}\_\{t\}\\in\\mathbb\{R\}^\{d\}be the key and value vectors projected from inputxtx\_\{t\}using parametersθ\\theta\. The memory state𝐒t\\mathbf\{S\}\_\{t\}is updated to map keys to values by minimizing a momentary associative memory objective:

𝐒t=arg⁡min𝐒⁡\(12​‖𝐒𝐤t−𝐯t‖2\+12​η​‖𝐒−𝐒t−1‖2\)\.\\mathbf\{S\}\_\{t\}=\\arg\\min\_\{\\mathbf\{S\}\}\\left\(\\frac\{1\}\{2\}\\\|\\mathbf\{S\}\\mathbf\{k\}\_\{t\}\-\\mathbf\{v\}\_\{t\}\\\|^\{2\}\+\\frac\{1\}\{2\\eta\}\\\|\\mathbf\{S\}\-\\mathbf\{S\}\_\{t\-1\}\\\|^\{2\}\\right\)\.\(2\)Solving Eq\. \([2](https://arxiv.org/html/2605.16350#S2.E2)\) via one step of Gradient Descent yields theDelta Ruleupdate:

𝐒t=𝐒t−1−η​∇𝐒ℒm​e​m=𝐒t−1\+η​\(𝐯t−𝐒t−1​𝐤t\)​𝐤t⊤,\\mathbf\{S\}\_\{t\}=\\mathbf\{S\}\_\{t\-1\}\-\\eta\\nabla\_\{\\mathbf\{S\}\}\\mathcal\{L\}\_\{mem\}=\\mathbf\{S\}\_\{t\-1\}\+\\eta\(\\mathbf\{v\}\_\{t\}\-\\mathbf\{S\}\_\{t\-1\}\\mathbf\{k\}\_\{t\}\)\\mathbf\{k\}\_\{t\}^\{\\top\},\(3\)whereη\\etais a learnable step size \(or gating factor\) derived fromθ\\theta\. This formulation reveals thatinference is effectively a high\-frequency training process, where the model “learns” the current context by optimizing𝐒t\\mathbf\{S\}\_\{t\}\.

The Federated Nested Optimization Framework\.

Building on the concepts above, we formalize FedNL as a three\-level nested optimization problem\. This framework decouples the global learning of rules from the local construction of memory\. We define the system tuple𝒩=\{\(L0,𝒯0\),\(L1,𝒯1\),\(L2,𝒯2\)\}\\mathcal\{N\}=\\\{\(L\_\{0\},\\mathcal\{T\}\_\{0\}\),\(L\_\{1\},\\mathcal\{T\}\_\{1\}\),\(L\_\{2\},\\mathcal\{T\}\_\{2\}\)\\\}, representing the three levels of optimization loops:

\(1\) The Inner Loop: Test\-Time Adaptation \(Client\-Side\)\.This loop runs during inference on the client\. It is “unsupervised” in the sense that it does not require ground\-truth labelsyy, but self\-supervised by the memory objective\. For a given clientkkand input streamx1:Tx\_\{1:T\}, the memory state trajectory𝒮k=\(𝐒0,…,𝐒T\)\\mathcal\{S\}\_\{k\}=\(\\mathbf\{S\}\_\{0\},\\dots,\\mathbf\{S\}\_\{T\}\)is generated by recursively solving the inner objective defined in Eq\. \([2](https://arxiv.org/html/2605.16350#S2.E2)\):

𝐒t​\(θ\)=OnlineOptimizer​\(𝐒t−1;θ,xt\),\\mathbf\{S\}\_\{t\}\(\\theta\)=\\text\{OnlineOptimizer\}\(\\mathbf\{S\}\_\{t\-1\};\\theta,x\_\{t\}\),\(4\)whereOnlineOptimizercorresponds to the Delta Rule update\. Note that𝐒t\\mathbf\{S\}\_\{t\}is strictly a function of the local context and the parametersθ\\theta\. This state is transient and private, never leaving the device\.

\(2\) The Intermediate Loop: Rule Learning \(Client\-Side\)\.This loop runs during the local training phase\. The client optimizes the parametersθ\\theta\(e\.g\., LoRA weights, gating mechanisms\) to ensure that the Inner Loop produces a memory state𝐒t\\mathbf\{S\}\_\{t\}that is useful for the downstream task \(e\.g\., next\-token prediction\)\. The local objective for clientkkis:

minθ⁡𝒥k​\(θ\)=𝔼\(x,y\)∼𝒟k​\[∑tℒtask​\(f​\(xt,𝐒t−1​\(θ\)\),yt\)\],\\min\_\{\\theta\}\\mathcal\{J\}\_\{k\}\(\\theta\)=\\mathbb\{E\}\_\{\(x,y\)\\sim\\mathcal\{D\}\_\{k\}\}\\left\[\\sum\_\{t\}\\mathcal\{L\}\_\{\\text\{task\}\}\(f\(x\_\{t\},\\mathbf\{S\}\_\{t\-1\}\(\\theta\)\),y\_\{t\}\)\\right\],\(5\)whereffis the prediction head\. Crucially, calculating the gradient∇θ𝒥k\\nabla\_\{\\theta\}\\mathcal\{J\}\_\{k\}requires differentiatingthroughthe Inner Loop process \(Eq\.[4](https://arxiv.org/html/2605.16350#S2.E4)\), a technique known as Backpropagation Through Time \(BPTT\) in RNNs or Meta\-Gradients in meta\-learning\. This ensuresθ\\thetalearnshow to construct memoryfor the specific data distribution of clientkk\.

\(3\) The Outer Loop: Collaborative Generalization \(Server\-Side\)\.This loop runs on the server to aggregate the locally learned rules\. Sinceθ\\thetarepresents the “physics” of memory construction rather than the memory itself, aggregatingθ\\thetaallows diverse clients to share learning capabilities\. The global objective is:

minθ⁡𝒢​\(θ\)=∑k=1KNkN​𝒥k​\(θ\)\.\\min\_\{\\theta\}\\mathcal\{G\}\(\\theta\)=\\sum\_\{k=1\}^\{K\}\\frac\{N\_\{k\}\}\{N\}\\mathcal\{J\}\_\{k\}\(\\theta\)\.\(6\)The server performs the updateθr\+1←Aggregate​\(\{θkr\+1\}k\)\\theta^\{r\+1\}\\leftarrow\\text\{Aggregate\}\(\\\{\\theta\_\{k\}^\{r\+1\}\\\}\_\{k\}\), typically via weighted averaging \(FedAvg\)\.

Unified View\.By nesting these loops, FedNL effectively trains a distributedoptimizer\. The global modelθ\\thetais not a static knowledge base, but ameta\-learner\. When deployed to a new client with a Non\-IID distribution \(e\.g\., medical records\), the meta\-learnerθ\\thetaexecutes the Inner Loop to rapidly build a medical\-specific memory𝐒t\\mathbf\{S\}\_\{t\}from the context, achieving zero\-shot adaptation without explicit gradient updates\.

Detailed derivations of the gradient flow through the memory states and the implementation of efficient chunk\-wise parallelization are provided in Appendix A\.

![Refer to caption](https://arxiv.org/html/2605.16350v1/x1.png)Figure 1:The three\-level nested optimization framework of FedNL\. L2: Memory stateStS\_\{t\}updated via the Delta Rule for test\-time adaptation\. L1: Meta\-parametersθ\\theta\(LoRA adapters\) trained with frozen backbone\. L0: Server aggregates rulesθ\\theta, not private memory\. Red: parameter flow; Blue: meta\-gradient flow\.
![Refer to caption](https://arxiv.org/html/2605.16350v1/x2.png)Figure 2:Unrolled computation graph of FedNL\. L2: Token\-level memory updatesst=st−1−∇ℒsurps\_\{t\}=s\_\{t\-1\}\-\\nabla\\mathcal\{L\}\_\{\\text\{surp\}\}via Delta Rule\. L1: Meta\-gradients∇θ\\nabla\_\{\\theta\}backpropagated through memory trajectory\. L0: M3 aggregation of global meta\-rulesθ\\theta\. Memory states remain strictly local\.

Based on the theoretical framework established in Section[2\.1](https://arxiv.org/html/2605.16350#S2.SS1), we present the implementation ofFedNL\. We first detail the model architecture that decouples static linguistic capabilities from dynamic memory rules\. We then describe the training algorithm that coordinates the three\-level nested optimization\. Finally, we provide an analytical understanding of why this self\-referential mechanism is inherently robust to data heterogeneity\.

### 2\.2Architecture: Decoupling Knowledge and Rules

To implement FedNL efficiently under resource\-constrained settings, we build upon the LiZAttention mechanism\(Furfaro,[2025](https://arxiv.org/html/2605.16350#bib.bib10)\), which integrates linear attention into pretrained Transformers\. We define the global modelℳ\\mathcal\{M\}as a composition of three distinct components:

1\. The Frozen Backbone \(Θfixed\\Theta\_\{\\text\{fixed\}\}\):We utilize a pretrained Large Language Model \(e\.g\., Llama\-3\.2\-1B\) as the backbone\. All original weights \(Self\-Attention, FFN, Norms\) remain frozen throughout the federated lifecycle\. This component provides general linguistic knowledge and feature extraction capabilities, acting as a shared basis across all clients\.

2\. The Dynamic Memory Module \(𝐒t\\mathbf\{S\}\_\{t\}\):We replace standard Softmax Attention with a dual\-path mechanism\. TheLinear Pathmaintains the transient memory state𝐒t\\mathbf\{S\}\_\{t\}updated via the Delta Rule \(Eq\.[3](https://arxiv.org/html/2605.16350#S2.E3)\)\. This state acts as a private, context\-specific container that captures local distribution patterns during inference\.

3\. The Trainable Meta\-Parameters \(θ\\theta\):These are the only parameters communicated and updated in FedNL\. They consist of:

- •Low\-Rank Projections \(LoRA\):We inject low\-rank matricesA,BA,Binto the query, key, and value projections:W′=Wfixed\+B​AW^\{\\prime\}=W\_\{\\text\{fixed\}\}\+BA\. These learnable adapters determinewhatinformation should be written into the memory𝐒t\\mathbf\{S\}\_\{t\}andhowit should be retrieved\.
- •Memory Gating \(α\\alpha\):A learnable scalar or vector that controls the mixing weight between the static Softmax attention \(general knowledge\) and dynamic Linear attention \(local context memory\)\.

By restricting the learnable parametersθ\\thetato the adapters, FedNL reduces communication overhead by orders of magnitude compared to full\-model aggregation, while the dynamic𝐒t\\mathbf\{S\}\_\{t\}provides infinite capacity for test\-time context compression\.

### 2\.3The FedNL Algorithm

The training procedure of FedNL simulates the nested learning process\. The core innovation lies in the client’s local update step, where the gradient calculation must account for the trajectory of the dynamic memory𝐒t\\mathbf\{S\}\_\{t\}\.

Forward Pass \(Inner Loop Execution\):During local training on a sequencexx, the client executes the model forward pass\. Crucially, this is not just a function evaluation but an optimization process\. For each token steptt, the Delta Rule updates𝐒t−1→𝐒t\\mathbf\{S\}\_\{t\-1\}\\to\\mathbf\{S\}\_\{t\}using the current rulesθ\\theta\. The predictiony^t\+1\\hat\{y\}\_\{t\+1\}depends on𝐒t\\mathbf\{S\}\_\{t\}, which in turn depends on the historyx1:tx\_\{1:t\}andθ\\theta\.

Backward Pass \(Rule Optimization\):To optimizeθ\\theta, we compute the gradient of the task lossℒtask\\mathcal\{L\}\_\{\\text\{task\}\}\. Since𝐒t\\mathbf\{S\}\_\{t\}is a function ofθ\\theta\(recursively\), the gradient flows through time:

∂ℒtask∂θ=∑t∂ℒt∂y^t​\(∂y^t∂θ\+∂y^t∂𝐒t−1​∂𝐒t−1∂θ⏟Recursive Term\)\.\\frac\{\\partial\\mathcal\{L\}\_\{\\text\{task\}\}\}\{\\partial\\theta\}=\\sum\_\{t\}\\frac\{\\partial\\mathcal\{L\}\_\{t\}\}\{\\partial\\hat\{y\}\_\{t\}\}\\left\(\\frac\{\\partial\\hat\{y\}\_\{t\}\}\{\\partial\\theta\}\+\\frac\{\\partial\\hat\{y\}\_\{t\}\}\{\\partial\\mathbf\{S\}\_\{t\-1\}\}\\underbrace\{\\frac\{\\partial\\mathbf\{S\}\_\{t\-1\}\}\{\\partial\\theta\}\}\_\{\\text\{Recursive Term\}\}\\right\)\.\(7\)Modern automatic differentiation frameworks handle this BPTT \(Backpropagation Through Time\) naturally\. By minimizing this loss,θ\\thetalearns to generate update rules that maximize the predictive power of the memory𝐒t\\mathbf\{S\}\_\{t\}\. The full procedure is detailed in Algorithm[1](https://arxiv.org/html/2605.16350#alg1)\.

Algorithm 1Federated Nested Learning \(FedNL\)1:Input:Pretrained Backbone

Θfixed\\Theta\_\{\\text\{fixed\}\}, Clients

KK, Rounds

RR, Local Epochs

EE\.

2:Server Initialize:Meta\-parameters

θ\(0\)\\theta^\{\(0\)\}\(LoRA \+ Gating\)\.

3:forround

r=1r=1to

RRdo

4:Server selects subset of clients

𝒦r\\mathcal\{K\}\_\{r\}\.

5:Broadcast

θ\(r−1\)\\theta^\{\(r\-1\)\}to clients in

𝒦r\\mathcal\{K\}\_\{r\}\.

6:forclient

k∈𝒦rk\\in\\mathcal\{K\}\_\{r\}in paralleldo

7:

θk←θ\(r−1\)\\theta\_\{k\}\\leftarrow\\theta^\{\(r\-1\)\}
8:forepoch

e=1e=1to

EEdo

9:forbatch

B=\(x,y\)B=\(x,y\)in

𝒟k\\mathcal\{D\}\_\{k\}do

10:Initialize memory state

𝐒0=𝟎\\mathbf\{S\}\_\{0\}=\\mathbf\{0\}\.

11:// Level 2 Loop \(Implicit\)

12:fortoken

ttin sequencedo

13:Generate

kt,vt,qtk\_\{t\},v\_\{t\},q\_\{t\}using

Θfixed\+θk\\Theta\_\{\\text\{fixed\}\}\+\\theta\_\{k\}\.

14:Update

𝐒t←𝐒t−1\+DeltaRule​\(kt,vt\)\\mathbf\{S\}\_\{t\}\\leftarrow\\mathbf\{S\}\_\{t\-1\}\+\\text\{DeltaRule\}\(k\_\{t\},v\_\{t\}\)\.

15:Compute output using

𝐒t\\mathbf\{S\}\_\{t\}\.

16:endfor

17:Compute Loss

ℒ=CrossEntropy​\(output,y\)\\mathcal\{L\}=\\text\{CrossEntropy\}\(\\text\{output\},y\)\.

18:// Level 1 Loop

19:Update

θk←θk−η​∇θkℒ\\theta\_\{k\}\\leftarrow\\theta\_\{k\}\-\\eta\\nabla\_\{\\theta\_\{k\}\}\\mathcal\{L\}\.

20:endfor

21:endfor

22:Return

θk\\theta\_\{k\}to Server\.

23:endfor

24:// Level 0 Loop

25:

θ\(r\)←∑k∈𝒦rNkN​θk\\theta^\{\(r\)\}\\leftarrow\\sum\_\{k\\in\\mathcal\{K\}\_\{r\}\}\\frac\{N\_\{k\}\}\{N\}\\theta\_\{k\}\.

26:endfor

### 2\.4Theoretical Analysis

Standard FL fails in Non\-IID settings because it tries to find a single static parameter setθ∗\\theta^\{\*\}that satisfies conflicting local distributions\. Specifically, let𝒫1\\mathcal\{P\}\_\{1\}and𝒫2\\mathcal\{P\}\_\{2\}be two disparate distributions \(e\.g\., Code vs\. Medical\)\. A static model attempts to findθ∗∈arg⁡min⁡\(ℒ𝒫1​\(θ\)\+ℒ𝒫2​\(θ\)\)\\theta^\{\*\}\\in\\arg\\min\(\\mathcal\{L\}\_\{\\mathcal\{P\}\_\{1\}\}\(\\theta\)\+\\mathcal\{L\}\_\{\\mathcal\{P\}\_\{2\}\}\(\\theta\)\), often resulting in a solution that is suboptimal for both \(the “average” model\)\.

In FedNL, the prediction for a samplexxis not determined byθ\\thetaalone, but by the tuple\(θ,𝐒x\)\(\\theta,\\mathbf\{S\}\_\{x\}\), where𝐒x\\mathbf\{S\}\_\{x\}is the memory state dynamically constructed from the context ofxxitself\.

Proposition 1 \(Instance\-Specific Approximation\)\.Letθ∗\\theta^\{\*\}be the aggregated meta\-parameters in FedNL\. For any clientkkwith distribution𝒫k\\mathcal\{P\}\_\{k\}, and for any instancex∼𝒫kx\\sim\\mathcal\{P\}\_\{k\}, the effective model used for prediction isℳ​\(x;θ∗\)≈ℳs​t​a​t​i​c​\(θ∗\+Δ​θx\)\\mathcal\{M\}\(x;\\theta^\{\*\}\)\\approx\\mathcal\{M\}\_\{static\}\(\\theta^\{\*\}\+\\Delta\\theta\_\{x\}\), whereΔ​θx\\Delta\\theta\_\{x\}represents an implicit gradient step taken on the memory state𝐒\\mathbf\{S\}during inference\.

Proof Sketch\.The Delta Rule update𝐒t=𝐒t−1−η​∇ℒm​e​m\\mathbf\{S\}\_\{t\}=\\mathbf\{S\}\_\{t\-1\}\-\\eta\\nabla\\mathcal\{L\}\_\{mem\}in Level 2 is mathematically equivalent to a gradient descent step in the function space of linear layers\(Von Oswaldet al\.,[2023](https://arxiv.org/html/2605.16350#bib.bib27); Behrouzet al\.,[2025a](https://arxiv.org/html/2605.16350#bib.bib8)\)\. Therefore, when FedNL processes a medical text, the memory𝐒\\mathbf\{S\}moves in the direction that minimizes the reconstruction error of medical tokens\. This is functionally equivalent to fine\-tuning the model on the current contextat inference time\.

Implication\.The global meta\-parametersθ∗\\theta^\{\*\}do not need to encode the conflict between Code and Medical knowledge\. Instead,θ∗\\theta^\{\*\}only needs to encode theuniversal rule: “If context involves Python syntax, update𝐒\\mathbf\{S\}to store code logic; if context involves Anatomy, update𝐒\\mathbf\{S\}to store biological relations”\. Since this rule is consistent across domains, the Non\-IID conflict in the parameter space is significantly alleviated\.

Consequently, FedNL achievesZero\-Shot Test\-Time Adaptation: even if the global model has never seen a specific local distribution during training, it can adapt to it during the first few tokens of inference, purely by executing the learned memory update rules\. This property may improve robustness to certain forms of heterogeneity, especially when useful information can be absorbed from the test\-time context\.

## 3Experiments

We evaluate FedNL on two federated settings that stress different aspects of heterogeneity: a five\-client Non\-IID MMLU\(Hendryckset al\.,[2021b](https://arxiv.org/html/2605.16350#bib.bib30)\)split for domain\-specialized reasoning, and long\-context NIAH tasks for sparse retrieval and streaming adaptation\.

### 3\.1Experimental Setup

Setup Summary\.We evaluate FedNL on two federated settings: a five\-client Non\-IID MMLU split for domain\-specialized short\-context reasoning, and long\-context NIAH tasks for multi\-needle retrieval and streaming CE diagnostics\. We additionally use PG\-19 to isolate component\-level effects in the ablation study\. All experiments were conducted on 4 NVIDIA L20 48GB GPUs\. The full data format, client partitions, and implementation details are provided in Appendix[C](https://arxiv.org/html/2605.16350#A3)\.

Baselines\.We compare FedNL with six representative federated methods spanning algorithmic and architectural axes\. FedAvg\(McMahanet al\.,[2017](https://arxiv.org/html/2605.16350#bib.bib11)\)is the canonical FL baseline that averages local LoRA updates across clients\. FedProx\(Liet al\.,[2020](https://arxiv.org/html/2605.16350#bib.bib5)\)augments FedAvg with a proximal term to mitigate client drift under heterogeneity\. FedSSI\(Liet al\.,[2025](https://arxiv.org/html/2605.16350#bib.bib6)\)represents the current continual\-FL regularization family, using synaptic\-intelligence\-style importance weights to preserve parameters across clients\. FedALA\(Zhanget al\.,[2023](https://arxiv.org/html/2605.16350#bib.bib28)\)personalizes the global model by locally calibrating aggregated weights on each client’s data via a few SGD steps before evaluation\. FFA\-LoRA\(Sun and others,[2024](https://arxiv.org/html/2605.16350#bib.bib29)\)freezes the LoRAAAmatrices at initialization and only averages theBBmatrices across clients, reducing communication cost by half\. Fed\-Mamba is a backbone\-comparison baseline that applies FedAvg to a Mamba\-1\.4B\(Gu and Dao,[2024](https://arxiv.org/html/2605.16350#bib.bib21)\)state\-space backbone, isolating the difference between SSM\-style and Titans\-style memory under federation\.

### 3\.2Federated Generalization on Non\-IID MMLU

We first study domain\-level heterogeneity on MMLU\. The benchmark is split into five clients, each corresponding to one super\-category: Law/Ethics, Humanities, STEM, Math/CS, and Medical/Psychology\. Each client fine\-tunes on its own domain\-specific training questions\. The server then aggregates the client updates and redistributes the federated model back to all clients\. Evaluation is performed on each client’s held\-out test questions from the same domain; these examples are unseen during training, and no gradient updates are performed at inference time\.

Table[1](https://arxiv.org/html/2605.16350#S3.T1)reports test accuracy on this five\-client Non\-IID partition, while Figure[3](https://arxiv.org/html/2605.16350#S3.F3)visualizes the client\-level aggregation drop\. On Qwen2\.5\-1\.5B, FedNL obtains the highest average accuracy,58\.88%58\.88\\%, slightly above FedSSI at58\.70%58\.70\\%\. The main gains appear on STEM and Math/CS, where FedNL improves over FedSSI by\+2\.0\+2\.0and\+3\.8\+3\.8percentage points, respectively\. On the smaller Llama\-3\.2\-1B backbone, FedNL reaches42\.64%42\.64\\%, compared with42\.10%42\.10\\%for FedSSI, with the largest gain again on Math/CS\. Fed\-Mamba, which replaces the Titans\-style memory with an SSM backbone, obtains26\.70%26\.70\\%average accuracy\. These results suggest that memory\-rule aggregation can be integrated into federated training without degrading short\-context MMLU performance, while providing modest gains on the more shifted client domains\.

Table 1:Test accuracy \(%\) on the five\-client Non\-IID MMLU partition\. Each client trains on its own domain and is evaluated on held\-out questions from that domain after federated aggregation\. FedNL is evaluated on Titans\-Qwen2\.5\-1\.5B and Titans\-Llama\-3\.2\-1B against matched\-backbone FL baselines; Fed\-Mamba uses Mamba\-1\.4B as a non\-Titans memory architecture\.![Refer to caption](https://arxiv.org/html/2605.16350v1/x3.png)Figure 3:Per\-client MMLU aggregation drop from each client’s locally fine\-tuned adapter to the corresponding federated adapter\. Lower is better: FedNL keeps the drop near zero across client domains, whereas static aggregation baselines show larger client\-specific degradation under Non\-IID shifts\.
### 3\.3Long\-Tail Retrieval and Catastrophic Forgetting

We next evaluate long\-tail retrieval with the NIAH suite under a seven\-client Non\-IID split\. Each client corresponds to one retrieval template: MK\-NIAH, MV\-NIAH\(Hsiehet al\.,[2024](https://arxiv.org/html/2605.16350#bib.bib25)\), Passkey, UUID code, Name\-date, Phrase code, or Counter state\. Each client fine\-tunes on its own template\-specific training pool\. The server then aggregates the client updates and redistributes the federated model back to all clients\. Evaluation is performed on each client’s held\-out examples from the same retrieval template\. All examples use multi\-needle contexts at target depths from11K to1616K tokens, so the task requires binding the queried key, rank, or state to the correct value rather than merely detecting that a needle\-like value appeared in context\. The full prompt construction and insertion rules are provided in Appendix[C\.3](https://arxiv.org/html/2605.16350#A3.SS3)\.

Table[2](https://arxiv.org/html/2605.16350#S3.T2)reports personalized accuracy on this seven\-client NIAH partition, while Figure[5](https://arxiv.org/html/2605.16350#S3.F5)averages the same evaluation over needle types at each target depth\. FedNL obtains the highest average accuracy,29\.7%29\.7\\%, compared with the strongest baseline FedALA at28\.6%28\.6\\%\. The largest gains appear on MK\-NIAH, MV\-NIAH, and UUID code, where FedNL reaches32\.0%32\.0\\%,40\.0%40\.0\\%, and48\.0%48\.0\\%, respectively\. The depth\-stratified view shows that FedNL is strongest at11K–44K and remains tied for the best result at1616K, while the harder88K setting narrows the gap across methods\. To complement the accuracy view with a loss\-based streaming diagnostic, Figure[5](https://arxiv.org/html/2605.16350#S3.F5)evaluates normalized next\-token CE on 16K held\-out NIAH prompts\. FedAvg’s CE increases by3\.1%3\.1\\%as the prompt unfolds, whereas FedNL decreases by2\.1%2\.1\\%, indicating that the recurrent memory state continues to absorb useful context over long streams rather than accumulating uncertainty\.

Table 2:Per\-client NIAH accuracy \(%\) under the 7\-client non\-IID partition with multi\-needle retrieval, averaged over target depths11K,22K,44K,88K, and1616K \(final round, personalized\)\. FedAvg, FedProx, FedSSI, FedALA, and FFA\-LoRA use the Llama\-3\.2\-1B Transformer backbone; FedNL uses Titans\-Llama\-3\.2\-1B; Fed\-Mamba uses Mamba\-1\.4B\.![Refer to caption](https://arxiv.org/html/2605.16350v1/x4.png)Figure 4:NIAH accuracy by target insertion depth, averaged over the seven needle clients\.
![Refer to caption](https://arxiv.org/html/2605.16350v1/x5.png)Figure 5:16K NIAH streaming CE relative to each method’s first 1K bin\.

### 3\.4Ablation Study

Figure[7](https://arxiv.org/html/2605.16350#S3.F7)isolates the contribution of the main FedNL components\. The full model keeps both the optimization\-based Delta Rule and the Memory\-as\-Gate path while training LoRA adapters together with the memory parameters\. We compare it with three controlled variants:w/o Delta Rule: replaces the Delta Rule with a Hebbian\-style update, testing whether simple associative accumulation is sufficient;w/o MaG: removes the learned memory gate, forcing the model to rely on the memory path without the same fallback control;w/o LoRA: freezes the LoRA adapters and trains only the 32K memory parameters\. The Delta Rule is the most critical component: replacing it with a Hebbian update raises PPL from 29\.82 to 1576\.42, showing that simple accumulation cannot reliably correct noisy or overwritten memory values during streaming updates\. Removing MaG increases PPL to 348\.97 because the model loses a learned fallback that controls when to trust the memory path\. Freezing LoRA gives 149\.46 PPL, indicating that the small memory\-rule parameters still need adapter\-level alignment to make the recurrent state useful for language modeling\.

### 3\.5Communication and Resource Efficiency

A critical requirement for FL is efficiency\. We analyze the inference memory footprint and the per\-round communication cost\.

Inference Memory \(VRAM\)\.Peak VRAM measurements compare memory usage across sequence lengths\. Static\-attention baselines grow with sequence length due to the KV cache and encounter Out\-Of\-Memory errors at 16k on resource\-constrained accelerators\. FedNL instead maintains aconstantO​\(1\)O\(1\)memory footprintby storing only the fixed\-size state𝐒t\\mathbf\{S\}\_\{t\}, making long\-context deployment more practical on edge devices\.

Communication Efficiency\.Because FedNL aggregates only the memory\-update meta\-rules across clients — a small fraction of the trainable parameter count — the per\-round client\-to\-server payload shrinks dramatically\. On the NIAH 7\-client setup \(Llama\-3\.2\-1B \+ Titans\-Llama,r=16r\{=\}16LoRA\), the effective memory rules amount to∼\\sim32,76832\{,\}768parameters \(∼\\sim0\.260\.26MB at fp16\), compared to∼\\sim11\.311\.3M LoRA parameters \(∼\\sim22\.522\.5MB at fp16\) for the FedAvg Transformer baseline\. This is a∼𝟑𝟓𝟎×\\mathbf\{\\sim\}\\bm\{350\\times\}reduction in per\-round communication\(Figure[7](https://arxiv.org/html/2605.16350#S3.F7)\)\. Aggregated over the full7×27\\\!\\times\\\!2training schedule, FedNL exchanges only3\.63\.6MB of cross\-device traffic against1\.261\.26GB for FedAvg — a property that is essential for deployment on bandwidth\-constrained edge networks\.

![Refer to caption](https://arxiv.org/html/2605.16350v1/x6.png)Figure 6:PG\-19 ablation \(PPL, lower is better\)\.
![Refer to caption](https://arxiv.org/html/2605.16350v1/x7.png)Figure 7:Per\-round communication on NIAH \(Llama\-3\.2\-1B, fp16\)\.

## 4Conclusion

This paper proposes Federated Nested Learning \(FedNL\), a three\-level nested optimization framework that redefines collaborative training\. By theorizing FL as self\-referential update rules, FedNL fundamentally addresses the Non\-IID challenge\. FedNL implements this theory through a Titans\-based linear attention mechanism, enabling efficient zero\-shot test\-time adaptation\. Empirical validation across Non\-IID MMLU and long\-context benchmarks demonstrates FedNL’s stronger federated generalization, retrieval accuracy, and streaming CE behavior\. This work establishes a new direction for FL, where models evolve from repository of knowledge into a paradigm of continuous, context\-aware learning\.

## 5Limitations

While our experiments demonstrate consistent gains at the 1B–1\.5B scale, extending FedNL to larger foundation models remains an important next step to validate the generality of memory\-rule aggregation\. The empirical evaluation spans reasoning and retrieval tasks on MMLU and NIAH, and broadening the benchmark suite would further solidify the practical scope of the framework\.

## References

- A\. Behrouz, M\. Razaviyayn, P\. Zhong, and V\. Mirrokni \(2025a\)Nested learning: the illusion of deep learning architectures\.arXiv preprint arXiv:2512\.24695\.Cited by:[§A\.2](https://arxiv.org/html/2605.16350#A1.SS2.p1.1),[§1](https://arxiv.org/html/2605.16350#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.16350#S2.SS1.p8.4),[§2\.4](https://arxiv.org/html/2605.16350#S2.SS4.p4.2)\.
- A\. Behrouz, P\. Zhong, and V\. Mirrokni \(2025b\)Titans: learning to memorize at test time\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://research.google/pubs/titans-learning-to-memorize-at-test-time-2/)Cited by:[§A\.2](https://arxiv.org/html/2605.16350#A1.SS2.p1.1),[§1](https://arxiv.org/html/2605.16350#S1.p4.1),[§2\.1](https://arxiv.org/html/2605.16350#S2.SS1.p7.1)\.
- C\. Finn, P\. Abbeel, and S\. Levine \(2017\)Model\-agnostic meta\-learning for fast adaptation of deep networks\.InInternational conference on machine learning,pp\. 1126–1135\.Cited by:[§A\.2](https://arxiv.org/html/2605.16350#A1.SS2.p1.1)\.
- F\. Furfaro \(2025\)TPTT: transforming pretrained transformer into titans\.arXiv preprint arXiv:2506\.17671\.Cited by:[§C\.1](https://arxiv.org/html/2605.16350#A3.SS1.p2.6),[§1](https://arxiv.org/html/2605.16350#S1.p5.1),[§2\.2](https://arxiv.org/html/2605.16350#S2.SS2.p1.1)\.
- A\. Gu and T\. Dao \(2024\)Mamba: linear\-time sequence modeling with selective state spaces\.InFirst conference on language modeling,Cited by:[§A\.2](https://arxiv.org/html/2605.16350#A1.SS2.p1.1),[§3\.1](https://arxiv.org/html/2605.16350#S3.SS1.p2.2)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021a\)Measuring massive multitask language understanding\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=d7KBjmI3GmQ)Cited by:[§C\.1](https://arxiv.org/html/2605.16350#A3.SS1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021b\)Measuring massive multitask language understanding\.Proceedings of the International Conference on Learning Representations \(ICLR\)\.Cited by:[§3](https://arxiv.org/html/2605.16350#S3.p1.1)\.
- C\. Hsieh, S\. Sun, S\. Kriman, S\. Acharya, D\. Rekesh, F\. Jia, Y\. Zhang, and B\. Ginsburg \(2024\)RULER: what’s the real context size of your long\-context language models?\.arXiv preprint arXiv:2404\.06654\.Cited by:[§3\.3](https://arxiv.org/html/2605.16350#S3.SS3.p1.2)\.
- S\. P\. Karimireddy, S\. Kale, M\. Mohri, S\. Reddi, S\. Stich, and A\. T\. Suresh \(2020\)Scaffold: stochastic controlled averaging for federated learning\.InInternational conference on machine learning,pp\. 5132–5143\.Cited by:[§A\.1](https://arxiv.org/html/2605.16350#A1.SS1.p1.1)\.
- J\. Kirkpatrick, R\. Pascanu, N\. Rabinowitz, J\. Veness, G\. Desjardins, A\. A\. Rusu, K\. Milan, J\. Quan, T\. Ramalho, A\. Grabska\-Barwinska,et al\.\(2017\)Overcoming catastrophic forgetting in neural networks\.Proceedings of the national academy of sciences114\(13\),pp\. 3521–3526\.Cited by:[§A\.1](https://arxiv.org/html/2605.16350#A1.SS1.p2.1)\.
- W\. Kuang, B\. Qian, Z\. Li, D\. Chen, D\. Gao, X\. Pan, Y\. Xie, Y\. Li, B\. Ding, and J\. Zhou \(2024\)Federatedscope\-llm: a comprehensive package for fine\-tuning large language models in federated learning\.InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 5260–5271\.Cited by:[§1](https://arxiv.org/html/2605.16350#S1.p1.1)\.
- T\. Li, A\. K\. Sahu, M\. Zaheer, M\. Sanjabi, A\. Talwalkar, and V\. Smith \(2020\)Federated optimization in heterogeneous networks\.Proceedings of Machine learning and systems2,pp\. 429–450\.Cited by:[§A\.1](https://arxiv.org/html/2605.16350#A1.SS1.p1.1),[§1](https://arxiv.org/html/2605.16350#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.16350#S3.SS1.p2.2)\.
- Y\. Li, Y\. Wang, H\. Wang, Y\. Qi, T\. Xiao, and R\. Li \(2025\)FedSSI: rehearsal\-free continual federated learning with synergistic synaptic intelligence\.InForty\-second International Conference on Machine Learning,Cited by:[§A\.1](https://arxiv.org/html/2605.16350#A1.SS1.p2.1),[§1](https://arxiv.org/html/2605.16350#S1.p2.1),[§3\.1](https://arxiv.org/html/2605.16350#S3.SS1.p2.2)\.
- X\. Liu, C\. Wu, M\. Menta, L\. Herranz, B\. Raducanu, A\. D\. Bagdanov, S\. Jui, and J\. v\. de Weijer \(2020\)Generative feature replay for class\-incremental learning\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops,pp\. 226–227\.Cited by:[§A\.1](https://arxiv.org/html/2605.16350#A1.SS1.p2.1)\.
- B\. McMahan, E\. Moore, D\. Ramage, S\. Hampson, and B\. A\. y Arcas \(2017\)Communication\-efficient learning of deep networks from decentralized data\.InArtificial intelligence and statistics,pp\. 1273–1282\.Cited by:[§A\.1](https://arxiv.org/html/2605.16350#A1.SS1.p1.1),[§3\.1](https://arxiv.org/html/2605.16350#S3.SS1.p2.2)\.
- D\. Qi, H\. Zhao, and S\. Li \(2023\)Better generative replay for continual federated learning\.arXiv preprint arXiv:2302\.13001\.Cited by:[§A\.1](https://arxiv.org/html/2605.16350#A1.SS1.p2.1)\.
- X\. Shuai, Y\. Shen, S\. Jiang, Z\. Zhao, Z\. Yan, and G\. Xing \(2022\)BalanceFL: addressing class imbalance in long\-tail federated learning\.In2022 21st ACM/IEEE International Conference on Information Processing in Sensor Networks \(IPSN\),pp\. 271–284\.Cited by:[§1](https://arxiv.org/html/2605.16350#S1.p1.1)\.
- Y\. Sunet al\.\(2024\)FFA\-lora: federated fine\-tuning of large language models with fedavg on lora\.arXiv preprint arXiv:2407\.03039\.Cited by:[§3\.1](https://arxiv.org/html/2605.16350#S3.SS1.p2.2)\.
- Y\. Sun, X\. Li, K\. Dalal, J\. Xu, A\. Vikram, G\. Zhang, Y\. Dubois, X\. Chen, X\. Wang, S\. Koyejo,et al\.\(2024\)Learning to \(learn at test time\): rnns with expressive hidden states\.arXiv preprint arXiv:2407\.04620\.Cited by:[§A\.3](https://arxiv.org/html/2605.16350#A1.SS3.p1.1),[§B\.2](https://arxiv.org/html/2605.16350#A2.SS2.p4.8)\.
- Y\. Sun, X\. Wang, Z\. Liu, J\. Miller, A\. Efros, and M\. Hardt \(2020\)Test\-time training for robust generalization under covariate shifts\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 9229–9248\.Cited by:[§A\.3](https://arxiv.org/html/2605.16350#A1.SS3.p1.1)\.
- J\. Von Oswald, E\. Niklasson, E\. Randazzo, J\. Sacramento, A\. Mordvintsev, A\. Zhmoginov, and M\. Vladymyrov \(2023\)Transformers learn in\-context by gradient descent\.InProceedings of the 40th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.202,pp\. 35151–35174\.External Links:[Link](https://proceedings.mlr.press/v202/von-oswald23a.html)Cited by:[§2\.4](https://arxiv.org/html/2605.16350#S2.SS4.p4.2)\.
- D\. Wang, E\. Shelhamer, S\. Liu, B\. Olshausen, and T\. Darrell \(2021\)Tent: fully test\-time adaptation by entropy minimization\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=uXl3bZLkr3c)Cited by:[§A\.3](https://arxiv.org/html/2605.16350#A1.SS3.p1.1)\.
- H\. Wang, Y\. Li, W\. Xu, R\. Li, Y\. Zhan, and Z\. Zeng \(2023\)DaFKD: domain\-aware federated knowledge distillation\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 20412–20421\.Cited by:[§1](https://arxiv.org/html/2605.16350#S1.p2.1)\.
- R\. Ye, W\. Wang, J\. Chai, D\. Li, Z\. Li, Y\. Xu, Y\. Du, Y\. Wang, and S\. Chen \(2024\)Openfedllm: training large language models on decentralized private data via federated learning\.InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining,pp\. 6137–6147\.Cited by:[§1](https://arxiv.org/html/2605.16350#S1.p1.1)\.
- J\. Yoon, W\. Jeong, G\. Lee, E\. Yang, and S\. J\. Hwang \(2021\)Federated continual learning with weighted inter\-client transfer\.InInternational conference on machine learning,pp\. 12073–12086\.Cited by:[§A\.1](https://arxiv.org/html/2605.16350#A1.SS1.p2.1)\.
- F\. Zenke, B\. Poole, and S\. Ganguli \(2017\)Continual learning through synaptic intelligence\.InInternational conference on machine learning,pp\. 3987–3995\.Cited by:[§A\.1](https://arxiv.org/html/2605.16350#A1.SS1.p2.1)\.
- X\. Zhang, D\. Li,et al\.\(2023\)FedALA: local adaptive aggregation for heterogeneous federated learning\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.37,pp\. 11205–11213\.Cited by:[§3\.1](https://arxiv.org/html/2605.16350#S3.SS1.p2.2)\.

## Appendix ARelated Work

### A\.1Continual Federated Learning

Standard Federated Learning \(FL\) aggregates local updates to train a global model\[McMahanet al\.,[2017](https://arxiv.org/html/2605.16350#bib.bib11)\]\. To handle statistical heterogeneity \(Non\-IID\), methods like FedProx\[Liet al\.,[2020](https://arxiv.org/html/2605.16350#bib.bib5)\]and SCAFFOLD\[Karimireddyet al\.,[2020](https://arxiv.org/html/2605.16350#bib.bib12)\]introduce regularization or control variates\. However, these methods assume a static data distribution over time\.

Continual Federated Learning \(CFL\) addresses the scenario where clients face streaming tasks\[Yoonet al\.,[2021](https://arxiv.org/html/2605.16350#bib.bib13)\]\. Existing approaches fall into two main categories:\(1\) Replay\-based methods\[Liuet al\.,[2020](https://arxiv.org/html/2605.16350#bib.bib19), Qiet al\.,[2023](https://arxiv.org/html/2605.16350#bib.bib14)\]store or generate past samples to rehearse old tasks\. While effective, they fundamentally contradict the privacy\-preserving ethos of FL and incur significant storage costs on edge devices\.\(2\) Regularization\-based methodsaim to constrain weight updates to protect important parameters\. EWC\[Kirkpatricket al\.,[2017](https://arxiv.org/html/2605.16350#bib.bib16)\]and Synaptic Intelligence \(SI\)\[Zenkeet al\.,[2017](https://arxiv.org/html/2605.16350#bib.bib17)\]are classic examples\. Recently,FedSSI\[Liet al\.,[2025](https://arxiv.org/html/2605.16350#bib.bib6)\]advanced this direction by introducing Personalized Surrogate Models \(PSM\) to calibrate local SI regularization with global information, achieving state\-of\-the\-art performance in preventing catastrophic forgetting\.

Limitations of Current CFL:Despite their sophistication, methods ranging from FedAvg to FedSSI share a common premise: they treat the global model as a container ofstatic knowledge\. They aim to find a set of weightsθ∗\\theta^\{\*\}that creates a compromise between conflicting tasks\. In contrast, FedNL fundamentally departs from this “static weight” paradigm\. Instead of regularizing weights to prevent them from changing, we design the model toactively changeits internal state \(𝐒t\\mathbf\{S\}\_\{t\}\) during inference, enabling it to embrace heterogeneity rather than compromise with it\.

### A\.2Nested Learning and Neural Memory

The concept ofNested Learning \(NL\)\[Behrouzet al\.,[2025a](https://arxiv.org/html/2605.16350#bib.bib8)\]posits that intelligent systems should be modeled as hierarchies of optimization loops operating at different frequencies\. This framework unifies meta\-learning\[Finnet al\.,[2017](https://arxiv.org/html/2605.16350#bib.bib20)\]and in\-context learning under a single theoretical umbrella\. A practical realization of NL is theTitansarchitecture\[Behrouzet al\.,[2025b](https://arxiv.org/html/2605.16350#bib.bib9)\], which utilizes a linearized attention mechanism equipped with a memory module\. Unlike standard Recurrent Neural Networks \(RNNs\) or State Space Models \(Mamba\)\[Gu and Dao,[2024](https://arxiv.org/html/2605.16350#bib.bib21)\]that use fixed heuristic updates, Titans updates its memory via theDelta Rule— mechanistically equivalent to an online gradient descent step\. FedNL is the first work to apply the NL perspective to Federated Learning\. We reinterpret the client’s inference process as the “inner\-loop” optimization defined in NL, and the federated aggregation as the “outer\-loop” meta\-learning\. This allows us to decouple thememory content\(local, private, transient\) from thememory update rules\(global, shared, persistent\)\.

### A\.3Test\-Time Training and Adaptation

Test\-Time Training \(TTT\)\[Sunet al\.,[2020](https://arxiv.org/html/2605.16350#bib.bib22), Wanget al\.,[2021](https://arxiv.org/html/2605.16350#bib.bib23)\]refers to the paradigm of updating model parameters during inference to adapt to distribution shifts\. Recent advancements, such as TTT\-Linear\[Sunet al\.,[2024](https://arxiv.org/html/2605.16350#bib.bib24)\], bake this optimization directly into the forward pass of sequence models\. FedNL can be viewed as aFederated Collaborative TTTframework\. While standard TTT focuses on adapting a single isolated model, FedNL aggregates the experience of multiple clients to learnhow to adaptefficiently\. By learning the optimal meta\-parametersθ\\theta\(projections and gating\), FedNL ensures that the test\-time adaptation \(via Delta Rule\) is robust and converges rapidly on unseen Non\-IID domains\. This effectively solves the “cold\-start” problem often faced by TTT methods in zero\-shot scenarios\.

## Appendix BDerivations and Implementation Details

In this appendix, we provide the detailed mathematical derivations supporting the theoretical framework of Federated Nested Learning \(FedNL\)\. Specifically, we analyze the gradient flow through the dynamic memory states to validate the meta\-learning interpretation of our method \(Level 1 Loop\)\. We also detail the efficient chunk\-wise parallel implementation of the Delta Rule used in the Inner Loop \(Level 2\)\.

### B\.1Gradient Flow Analysis: Optimizing the Learning Rule

In Section[2\.1](https://arxiv.org/html/2605.16350#S2.SS1), we defined the local training objective for a clientkkas finding the optimal meta\-parametersθ\\thetathat minimize the cumulative prediction loss over a sequencex1:Tx\_\{1:T\}\. The loss is given by:

𝒥​\(θ\)=∑t=1Tℓ​\(f​\(xt;𝐒t−1,θ\),xt\+1\),\\mathcal\{J\}\(\\theta\)=\\sum\_\{t=1\}^\{T\}\\ell\\left\(f\(x\_\{t\};\\mathbf\{S\}\_\{t\-1\},\\theta\),x\_\{t\+1\}\\right\),\(8\)where𝐒t\\mathbf\{S\}\_\{t\}evolves according to the Delta Rule \(Eq\.[3](https://arxiv.org/html/2605.16350#S2.E3)\):

𝐒t=𝐒t−1\+βt​\(𝐯t−𝐒t−1​𝐤t\)​𝐤t⊤\.\\mathbf\{S\}\_\{t\}=\\mathbf\{S\}\_\{t\-1\}\+\\beta\_\{t\}\(\\mathbf\{v\}\_\{t\}\-\\mathbf\{S\}\_\{t\-1\}\\mathbf\{k\}\_\{t\}\)\\mathbf\{k\}\_\{t\}^\{\\top\}\.\(9\)Here,𝐤t,𝐯t,βt\\mathbf\{k\}\_\{t\},\\mathbf\{v\}\_\{t\},\\beta\_\{t\}are functions of the inputxtx\_\{t\}and parametersθ\\theta\(specifically the LoRA adapters and gating networks\)\.

To updateθ\\thetausing Gradient Descent \(Level 1 Loop\), we require the total derivatived​𝒥d​θ\\frac\{d\\mathcal\{J\}\}\{d\\theta\}\. Applying the chain rule through time \(BPTT\), the gradient at stepttdepends on the state𝐒t−1\\mathbf\{S\}\_\{t\-1\}, which in turn depends onθ\\thetathrough all previous timesteps\.

The total gradient can be expanded as:

d​𝒥d​θ=∑t=1T\(∂ℓt∂θ⏟Direct\+∂ℓt∂𝐒t−1⋅d​𝐒t−1d​θ⏟Recursive\)\.\\frac\{d\\mathcal\{J\}\}\{d\\theta\}=\\sum\_\{t=1\}^\{T\}\\left\(\\underbrace\{\\frac\{\\partial\\ell\_\{t\}\}\{\\partial\\theta\}\}\_\{\\text\{Direct\}\}\+\\underbrace\{\\frac\{\\partial\\ell\_\{t\}\}\{\\partial\\mathbf\{S\}\_\{t\-1\}\}\\cdot\\frac\{d\\mathbf\{S\}\_\{t\-1\}\}\{d\\theta\}\}\_\{\\text\{Recursive\}\}\\right\)\.\(10\)
The Direct Term captures howθ\\thetaaffects the immediate prediction \(e\.g\., through the output projection layer\)\. The Recursive Term captures the “meta\-learning” signal: howθ\\thetainfluences theconstructionof the memory\.

We can expand the recursive state derivatived​𝐒td​θ\\frac\{d\\mathbf\{S\}\_\{t\}\}\{d\\theta\}using Eq\. \([9](https://arxiv.org/html/2605.16350#A2.E9)\):

d​𝐒td​θ=∂𝐒t∂𝐒t−1​d​𝐒t−1d​θ\+∂𝐒t∂θ\|𝐒t−1​fixed\.\\frac\{d\\mathbf\{S\}\_\{t\}\}\{d\\theta\}=\\frac\{\\partial\\mathbf\{S\}\_\{t\}\}\{\\partial\\mathbf\{S\}\_\{t\-1\}\}\\frac\{d\\mathbf\{S\}\_\{t\-1\}\}\{d\\theta\}\+\\frac\{\\partial\\mathbf\{S\}\_\{t\}\}\{\\partial\\theta\}\\bigg\|\_\{\\mathbf\{S\}\_\{t\-1\}\\text\{ fixed\}\}\.\(11\)
1\. The Transition Jacobian \(∂𝐒t∂𝐒t−1\\frac\{\\partial\\mathbf\{S\}\_\{t\}\}\{\\partial\\mathbf\{S\}\_\{t\-1\}\}\):Differentiating Eq\. \([9](https://arxiv.org/html/2605.16350#A2.E9)\) w\.r\.t𝐒t−1\\mathbf\{S\}\_\{t\-1\}:

∂𝐒t∂𝐒t−1=𝐈−βt​𝐤t​𝐤t⊤\.\\frac\{\\partial\\mathbf\{S\}\_\{t\}\}\{\\partial\\mathbf\{S\}\_\{t\-1\}\}=\\mathbf\{I\}\-\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{k\}\_\{t\}^\{\\top\}\.\(12\)This term acts as a "forgetting gate" or contraction map\. It determines how much of the gradient flows back to previous memories\. In FedNL,θ\\thetalearns to generate𝐤t\\mathbf\{k\}\_\{t\}andβt\\beta\_\{t\}such that this Jacobian preserves gradients for relevant long\-term dependencies while dampening noise\.

2\. The Update Jacobian \(∂𝐒t∂θ\|direct\\frac\{\\partial\\mathbf\{S\}\_\{t\}\}\{\\partial\\theta\}\\big\|\_\{\\text\{direct\}\}\):This term represents how a change inθ\\thetaalters thecontentwritten into memory at steptt\. Since𝐤t,𝐯t,βt\\mathbf\{k\}\_\{t\},\\mathbf\{v\}\_\{t\},\\beta\_\{t\}are functions ofθ\\theta:

∂𝐒t∂θ\|direct≈βt​\(∂𝐯t∂θ​𝐤t⊤\+𝐯t​∂𝐤t⊤∂θ\)\+\(…\)\.\\frac\{\\partial\\mathbf\{S\}\_\{t\}\}\{\\partial\\theta\}\\bigg\|\_\{\\text\{direct\}\}\\approx\\beta\_\{t\}\\left\(\\frac\{\\partial\\mathbf\{v\}\_\{t\}\}\{\\partial\\theta\}\\mathbf\{k\}\_\{t\}^\{\\top\}\+\\mathbf\{v\}\_\{t\}\\frac\{\\partial\\mathbf\{k\}\_\{t\}^\{\\top\}\}\{\\partial\\theta\}\\right\)\+\(\\dots\)\.\(13\)By optimizing this term, FedNL explicitly trains the projection matrices \(e\.g\., LoRAA,BA,B\) to produce keys and values that maximize the utility of the resulting memory trace\.

Conclusion:The gradient descent update onθ\\thetain the Local Loop effectively solves a meta\-optimization problem:"Find the projection rulesθ\\thetasuch that executing the Delta Rule \(Inner Loop\) yields the sequence of states𝐒0:T\\mathbf\{S\}\_\{0:T\}that minimizes prediction error\."

### B\.2Efficient Chunk\-wise Parallelization

While the Delta Rule \(Eq\.[9](https://arxiv.org/html/2605.16350#A2.E9)\) is recurrent and seemingly sequential \(O​\(T\)O\(T\)\), we leverage the properties of linear recurrence to parallelize computation, making FedNL feasible for edge devices\.

We divide the input sequence of lengthTTinto chunks of sizeCC\(e\.g\.,C=128C=128\)\. The computation is decomposed intoIntra\-Chunk\(parallel\) andInter\-Chunk\(recurrent\) operations\.

Matrix Formulation of Delta Rule\.Eq\. \([9](https://arxiv.org/html/2605.16350#A2.E9)\) can be rewritten as a linear recurrence:

𝐒t=𝐒t−1​𝐖t\+𝐔t,\\mathbf\{S\}\_\{t\}=\\mathbf\{S\}\_\{t\-1\}\\mathbf\{W\}\_\{t\}\+\\mathbf\{U\}\_\{t\},\(14\)where𝐖t=\(𝐈−βt​𝐤t​𝐤t⊤\)\\mathbf\{W\}\_\{t\}=\(\\mathbf\{I\}\-\\beta\_\{t\}\\mathbf\{k\}\_\{t\}\\mathbf\{k\}\_\{t\}^\{\\top\}\)is the decay matrix and𝐔t=βt​𝐯t​𝐤t⊤\\mathbf\{U\}\_\{t\}=\\beta\_\{t\}\\mathbf\{v\}\_\{t\}\\mathbf\{k\}\_\{t\}^\{\\top\}is the update term\.

1\. Intra\-Chunk Computation \(Parallel\):For a chunkbbspanning time stepsiitojj, we can compute the aggregate transition matrix𝐖b\\mathbf\{W\}\_\{b\}and aggregate update𝐔b\\mathbf\{U\}\_\{b\}in parallel\. Because𝐖t\\mathbf\{W\}\_\{t\}is a rank\-1 perturbation of the identity, the cumulative product over the chunk can be computed efficiently using the WY representation\[Sunet al\.,[2024](https://arxiv.org/html/2605.16350#bib.bib24)\]or parallel associative scans\. Specifically, we compute the local memory states𝐒~t\\tilde\{\\mathbf\{S\}\}\_\{t\}within the chunk assuming a zero initial state \(𝐒i−1=𝟎\\mathbf\{S\}\_\{i\-1\}=\\mathbf\{0\}\)\. This can be implemented via standard causal self\-attention masks within the chunk:

ChunkOutputb=CausalDotProduct​\(𝐐b,𝐊b,𝐕b,βb\)\.\\text\{ChunkOutput\}\_\{b\}=\\text\{CausalDotProduct\}\(\\mathbf\{Q\}\_\{b\},\\mathbf\{K\}\_\{b\},\\mathbf\{V\}\_\{b\},\{\\beta\}\_\{b\}\)\.\(15\)
2\. Inter\-Chunk Recurrence:Once the aggregate effect of each chunk is computed, we update the boundary states𝐒b\\mathbf\{S\}\_\{b\}sequentially:

𝐒b=𝐒b−1​𝐖c​h​u​n​k​\_​b\+𝐔c​h​u​n​k​\_​b\.\\mathbf\{S\}\_\{b\}=\\mathbf\{S\}\_\{b\-1\}\\mathbf\{W\}\_\{chunk\\\_b\}\+\\mathbf\{U\}\_\{chunk\\\_b\}\.\(16\)Since the number of chunksT/CT/Cis small, this sequential step is negligible\.

3\. Final Output:The query𝐪t\\mathbf\{q\}\_\{t\}at any timettinside chunkbbinteracts with both the intra\-chunk local memory and the passed\-down inter\-chunk memory:

𝐨t=\(𝐒b−1​∏τ=it𝐖τ\)​𝐪t⏟Long\-term History\+𝐒~t​𝐪t⏟Local Context\.\\mathbf\{o\}\_\{t\}=\\underbrace\{\(\\mathbf\{S\}\_\{b\-1\}\\prod\_\{\\tau=i\}^\{t\}\\mathbf\{W\}\_\{\\tau\}\)\\mathbf\{q\}\_\{t\}\}\_\{\\text\{Long\-term History\}\}\+\\underbrace\{\\tilde\{\\mathbf\{S\}\}\_\{t\}\\mathbf\{q\}\_\{t\}\}\_\{\\text\{Local Context\}\}\.\(17\)
Efficiency Analysis\.For training at Level 1, this chunk\-wise formulation allows us to train on long sequences \(e\.g\., 4k tokens\) with high GPU utilization, as the heavy lifting is done by parallel matrix multiplications \(Tensor Cores\)\. For inference at Level 2, token\-by\-token generation reverts to theO​\(1\)O\(1\)recurrent form \(Eq\.[9](https://arxiv.org/html/2605.16350#A2.E9)\), ensuring constant memory usage and low latency on edge devices\.

## Appendix CBenchmark Data Format

This appendix describes the concrete example format used in the two federated benchmarks\. In both cases, the client identity is tied to the data domain rather than sampled from a global mixture\.

### C\.1Experimental Setup Details

We construct two scenarios to simulate extreme heterogeneity and long\-tail distributions\. The first is a Non\-IID MMLU split, where the MMLU benchmark\[Hendryckset al\.,[2021a](https://arxiv.org/html/2605.16350#bib.bib26)\]is partitioned into five disjoint super\-categories, one per client: Law/Ethics, Humanities, STEM, Math/CS, and Medical/Psychology\. Each client holds approximately 2,000 training and up to 200 test questions drawn from its assigned domain, producing a setting where no client observes the global subject mixture\. The second scenario combines the Needle In A Haystack benchmark and PG\-19 to evaluate long\-tail recall and language modeling over long horizons, up to 16k tokens\.

For MMLU, we evaluate FedNL on two backbone scales, Qwen2\.5\-1\.5B and Llama\-3\.2\-1B, against matched\-backbone FL baselines; Fed\-Mamba uses Mamba\-1\.4B as a non\-Titans memory comparison\. Long\-context experiments use Llama\-3\.2\-1B and Titans\-Llama\-3\.2\-1B respectively\. Trainable parameters are restricted to LoRA adapters and, for FedNL, the LiZAttention\[Furfaro,[2025](https://arxiv.org/html/2605.16350#bib.bib10)\]memory parameters\. For MMLU we use LoRA rankr=32r\{=\}32,α=64\\alpha\{=\}64at learning rate5×10−55\\\!\\times\\\!10^\{\-5\}; for the NIAH and PG\-19 experiments we user=16r\{=\}16,α=32\\alpha\{=\}32at learning rate3×10−43\\\!\\times\\\!10^\{\-4\}\. All methods train for one local epoch per round\. The NIAH experiments run for two federated rounds and report the final personalized accuracy\.

### C\.2MMLU Federated Split

Each MMLU example is a four\-choice multiple\-choice question with a question stem, four answer options, a subject label, and a zero\-indexed gold option\. The gold index 0–3 corresponds to choices A–D\. During loading, the row is converted into the prompt

> Question: <question\> A\. <choice 0\> B\. <choice 1\> C\. <choice 2\> D\. <choice 3\> Answer:

Local training appends the gold answer letter and applies the loss to that answer token\. Evaluation uses the same prompt and scores the first generated answer letter\.

Table 3:MMLU client partition used in the federated experiments\. The train/test columns report the examples selected by the loader after shuffling and capping each client at at most 2,000 training and 200 test rows\.
### C\.3NIAH Federated Split

The main NIAH setting uses seven clients: Passkey, UUID code, Name\-date, Phrase code, Counter state, MK\-NIAH, and MV\-NIAH\. Each client contains 750 training examples, a 15\-example held\-out test split balanced over target depths\{1024,2048,4096\}\\\{1024,2048,4096\\\}, and an additional long\-depth test split at\{8192,16384\}\\\{8192,16384\\\}\. Each example records the target depth, insertion positions, full prompt, four answer candidates, gold answer, correct answer letter, task metadata, and the inserted needle events\.

The prompt is built by sampling a slice of WikiText\-103 filler, inserting several needle events at controlled depth fractions, and appending a four\-choice question:

> <filler prefix\> <event 1\> <filler\> \.\.\. <event m\> <filler suffix\> Question: <retrieval question\> A\) <candidate 1\> B\) <candidate 2\> C\) <candidate 3\> D\) <candidate 4\> Answer:

The correct candidate is randomly assigned to one of A–D\. Distractors are hard negatives from the same haystack whenever possible, so selecting a value that appeared in context is insufficient unless it is bound to the queried key, rank, or state\.

Table 4:Needle templates in the seven\-client NIAH split\. Each ordinary retrieval client inserts four target\-like events per haystack; the question selects one event and uses the other observed values as distractors\.For clients with four events, events are placed at the first four canonical depth fractions\{0\.10,0\.30,0\.50,0\.70\}\\\{0\.10,0\.30,0\.50,0\.70\\\}within the filler budget\. Stateful clients such as counter state can contain more events; in that case event positions are spread approximately uniformly across the haystack\. This construction makes all seven clients multi\-needle retrieval tasks rather than single\-needle lookup tasks\.

## Appendix DBroader Impacts

FedNL aims to improve federated learning for language models under heterogeneous and long\-context client data\. Its potential positive impacts include enabling more adaptive on\-device or edge language models, reducing the need to centralize raw user data, and lowering communication costs by exchanging compact memory\-update rules rather than full model parameters\. These properties may make privacy\-preserving and bandwidth\-efficient collaborative training more accessible in settings such as personalized assistants, domain\-specific reasoning tools, and resource\-constrained deployments\.

At the same time, FedNL inherits several risks associated with adaptive language models and federated learning\. First, improved test\-time adaptation may make models more effective in benign applications, but it could also improve the capability of systems used for harmful purposes, such as generating misleading content, automating social engineering, or adapting to user\-specific contexts in manipulative ways\. Second, although FedNL keeps raw data and transient memory states local, federated updates can still carry privacy risks through model\-update leakage or membership\-inference attacks\. Therefore, practical deployments should consider standard privacy protections such as secure aggregation, differential privacy, careful logging policies, and auditing of communicated updates\.

Third, heterogeneous client distributions can create fairness and reliability concerns\. A model that adapts strongly to local context may perform unevenly across domains, dialects, demographic groups, or low\-resource settings, especially when some client distributions are underrepresented during federated training\. In sensitive applications such as medicine, law, or education, incorrect test\-time adaptation could lead to misleading or harmful outputs even when the system is used as intended\. Deployments should therefore include domain\-specific evaluation, uncertainty monitoring, human oversight, and safeguards against overconfident predictions\.

Finally, FedNL is a methodological contribution rather than a deployed system\. We do not release user data or propose an application\-specific decision\-making pipeline\. Nevertheless, because the method can improve efficient adaptation of language models, downstream uses should be evaluated for privacy, security, fairness, and misuse risks before deployment\.

## Appendix EAsset Licenses and Terms of Use

Our experiments use existing pretrained model backbones, benchmark datasets, and software components\. We cite the original sources for all assets used in the paper and use them only for research purposes in accordance with their respective licenses and terms of use\.

The pretrained language\-model backbones include Llama\-3\.2\-1B, Qwen2\.5\-1\.5B, and Mamba\-1\.4B\. The federated and long\-context experiments use public benchmark datasets including MMLU, NIAH/RULER\-style retrieval tasks, PG\-19, and WikiText\-103 filler text for prompt construction\. The implementation further builds on publicly described components such as LoRA, Titans\-style memory, and LiZAttention\. We do not claim ownership over these assets\.

Similar Articles

Federated Learning

ML at Berkeley

The article explains the concept of Federated Learning as a privacy-preserving machine learning technique that trains models on local devices rather than central servers. It details the process of encrypted parameter updates and aggregation to mitigate data leakage risks while maintaining model performance.