FedQHD: Closed-Form Function-Space Federated Reinforcement Learning

arXiv cs.LG Papers

Summary

This paper proposes FedQHD, a novel federated Q-learning method using hyperdimensional random-feature state encoders with linear readouts to enable closed-form function-space aggregation, addressing the federation gap due to heterogeneous client encoders.

arXiv:2605.29002v1 Announce Type: new Abstract: Federated reinforcement learning enables decentralized agents to collaboratively improve policies or value estimates without exchanging raw trajectories. However, FedAvg-style parameter averaging is not function-space consistent: when clients use heterogeneous encoders or even identical nonlinear networks, averaged parameters need not correspond to the weighted average of client value functions in any common function space. We propose FedQHD, a federated Q-learning method using hyperdimensional (random-feature) state encoders with a linear readout, so that Q-functions are nonlinear in state yet linear in trainable parameters. This linear structure enables closed-form aggregation. With a shared encoder, the function-space consensus update coincides exactly with weighted averaging of local readout matrices. With heterogeneous encoders, the server constructs a global teacher by averaging client Q-values on a shared anchor-state set, and each client compiles this teacher into its local representation via a single ridge projection. We formalize the federation gap -- the error incurred when compiling a federated teacher into a heterogeneous client representation -- relative to a client-specific oracle projection. We show that this gap decomposes into subspace misalignment, anchor-set conditioning, and regularization bias. We further identify the anchor-to-dimension ratio $m \geq D_i$ as the well-conditioned regime in which the gap reduces to a multiple of the encoder heterogeneity floor. On four continuous-state, discrete-action control benchmarks, FedQHD matches or outperforms FedAvg-style baselines and distillation-based alternatives while requiring substantially less computation, and the empirical dependence of the federation gap on encoder dimension matches our theoretical analysis.
Original Article
View Cached Full Text

Cached at: 05/29/26, 09:14 AM

# FedQHD: Closed-Form Function-Space Federated Reinforcement Learning
Source: [https://arxiv.org/html/2605.29002](https://arxiv.org/html/2605.29002)
Yuchen Hou1Yongshan Chen1Zhuowen Zou2Calvin Yeung2 Mohsen Imani2Tian Lan3Mahdi Imani1 1Northeastern University2University of California, Irvine3The George Washington University \{hou\.yuchen, chen\.yongs, m\.imani\}@northeastern\.edu \{zhuowez1, chyeung2, m\.imani\}@uci\.edutlan@gwu\.edu

###### Abstract

Federated reinforcement learning enables decentralized agents to collaboratively improve policies or value estimates without exchanging raw trajectories\. However, FedAvg\-style parameter averaging is not function\-space consistent: when clients use heterogeneous encoders or even identical nonlinear networks, averaged parameters need not correspond to the weighted average of client value functions in any common function space\. We propose*FedQHD*, a federated Q\-learning method using hyperdimensional \(random\-feature\) state encoders with a linear readout, so that Q\-functions are nonlinear in state yet linear in trainable parameters\. This linear structure enables closed\-form aggregation\. With a shared encoder, the function\-space consensus update coincides exactly with weighted averaging of local readout matrices\. With heterogeneous encoders, the server constructs a global teacher by averaging client Q\-values on a shared anchor\-state set, and each client compiles this teacher into its local representation via a single ridge projection\. We formalize the*federation gap*—the error incurred when compiling a federated teacher into a heterogeneous client representation—relative to a client\-specific oracle projection\. We show that this gap decomposes into subspace misalignment, anchor\-set conditioning, and regularization bias\. We further identify the anchor\-to\-dimension ratiom≥Dim\\geq D\_\{i\}as the well\-conditioned regime in which the gap reduces to a multiple of the encoder heterogeneity floor\. On four continuous\-state, discrete\-action control benchmarks, FedQHD matches or outperforms FedAvg\-style baselines and distillation\-based alternatives while requiring substantially less computation, and the empirical dependence of the federation gap on encoder dimension matches our theoretical analysis\.

## 1Introduction

Reinforcement learning \(RL\) systems in autonomous vehicles\(Lianget al\.,[2022](https://arxiv.org/html/2605.29002#bib.bib8); Chellapandiet al\.,[2023](https://arxiv.org/html/2605.29002#bib.bib7)\), industrial robots\(Liuet al\.,[2019](https://arxiv.org/html/2605.29002#bib.bib6)\), and resource\-constrained edge devices\(Yuet al\.,[2020](https://arxiv.org/html/2605.29002#bib.bib5)\)often learn from on\-device interaction data that cannot be centralized due to communication costs, privacy requirements, and the volume of on\-device experience\. Federated reinforcement learning \(FedRL\) targets this setting by allowing agents to improve jointly without sharing raw trajectories\(Zhuoet al\.,[2020](https://arxiv.org/html/2605.29002#bib.bib29); Qiet al\.,[2021](https://arxiv.org/html/2605.29002#bib.bib10)\)\.

Most FedRL pipelines inherit parameter averaging \(FedAvg\) from supervised federated learning\(McMahanet al\.,[2017](https://arxiv.org/html/2605.29002#bib.bib30)\): clients train locally, the server averages parameters, and the averaged model is broadcast back\. However, federated*Q*\-learning exposes two structural obstacles\. First, weight averaging of nonlinear value networks is not value\-function averaging; achieving function\-space agreement typically requires additional optimization\. Second, practical deployments are structurally heterogeneous: clients may use different encoders, feature dimensions, or architectures, making parameter averaging algebraically undefined\(Fanet al\.,[2023](https://arxiv.org/html/2605.29002#bib.bib33); Jianget al\.,[2025](https://arxiv.org/html/2605.29002#bib.bib22)\)\.

The dominant approach to heterogeneous federation is knowledge distillation\(Li and Wang,[2019](https://arxiv.org/html/2605.29002#bib.bib23); Linet al\.,[2021](https://arxiv.org/html/2605.29002#bib.bib34); Jianget al\.,[2025](https://arxiv.org/html/2605.29002#bib.bib22)\), which exchanges predictions on shared query states and iteratively trains local students toward an ensembled teacher\. Distillation introduces per\-round iterative optimization, hyperparameter sensitivity, and instability under the nonstationary Bellman targets of online RL\(Czarneckiet al\.,[2019](https://arxiv.org/html/2605.29002#bib.bib19)\)\. We pursue an alternative that remains well\-defined under heterogeneous representations without iterative teacher–student training\.

Hyperdimensional computing \(HDC\), and more broadly fixed random\-feature value approximation, offers an alternative value representation: states are mapped through a fixed high\-dimensional feature map and action values are produced by a linear readout\(Kanerva,[2009](https://arxiv.org/html/2605.29002#bib.bib43)\)\. This linear readout enables closed\-form least\-squares\-style updates and avoids backpropagation in hyperdimensional Q\-learning \(QHD\)\(Niet al\.,[2022a](https://arxiv.org/html/2605.29002#bib.bib38)\)\. The linearity in the trainable parameters also simplifies federation: for linear\-in\-parameters value functions, averaging in value function space coincides exactly with averaging parameters\(Lagoudakis and Parr,[2003](https://arxiv.org/html/2605.29002#bib.bib17); Bhandariet al\.,[2018](https://arxiv.org/html/2605.29002#bib.bib15)\), and heterogeneous aggregation reduces to a projection step rather than iterative distillation\.

We propose*FedQHD*, a federatedQQ\-learning framework that aggregates clients through theirQQ\-values and remains well\-defined under encoder heterogeneity\. With a shared encoder, the federated update reduces exactly to weighted averaging of readout matrices, recovering FedAvg in closed form\. With heterogeneous encoders, the server forms a teacher by averaging clientQQ\-values on a shared*anchor*set, and each client compiles this teacher into its own representation via a single ridge\-regression solve per round—without exchanging trajectories and without iterative optimization\.

Our contributions are:

- •Closed\-form federation under heterogeneous encoders\.We propose a closed\-form federatedQQ\-learning algorithm that handles heterogeneous encoders in a single step, compiling a function\-space teacher into each client’s local representation via anchor\-based ridge regression and recovering FedAvg exactly when encoders are shared\.
- •Pointwise bound on the federation gap\.We derive a pointwise bound that decomposes the gap into three interpretable terms—encoder heterogeneity, anchor conditioning, and ridge shrinkage—and identifym≥Dim\\geq D\_\{i\}as the well\-conditioned regime in which the gap reduces to a multiple of the heterogeneity floor\.
- •Empirical validation on four continuous\-control benchmarks\.We conduct experiments on four continuous\-control tasks under both homogeneous and heterogeneous encoders, showing that FedQHD matches or exceeds federated DQN baselines while running substantially faster than distillation\-based alternatives, with ablations confirming the predicted dependence on encoder dimension and anchor\-set size\.

## 2Related Work

#### Federated RL with shared parameterizations\.

Federated learning was popularized by FedAvg, which aggregates client models through iterative parameter averaging\(McMahanet al\.,[2017](https://arxiv.org/html/2605.29002#bib.bib30)\)\. Early federated RL systems applied this paradigm by sharing neural value or policy network parameters across agents\(Zhuoet al\.,[2020](https://arxiv.org/html/2605.29002#bib.bib29); Nadigeret al\.,[2019](https://arxiv.org/html/2605.29002#bib.bib28)\), including applications such as autonomous driving under distribution shift\(Lianget al\.,[2022](https://arxiv.org/html/2605.29002#bib.bib8)\)and Byzantine\-robust policy gradients\(Fanet al\.,[2021](https://arxiv.org/html/2605.29002#bib.bib27)\)\. More recent work established finite\-time guarantees for federated TD and Q\-learning under Markovian sampling\(Khodadadianet al\.,[2022](https://arxiv.org/html/2605.29002#bib.bib26)\)and analyzed performance degradation under environment heterogeneity\(Jinet al\.,[2022](https://arxiv.org/html/2605.29002#bib.bib39)\)\. However, these approaches assume a*shared parameterization*across clients: FedAvg\-style aggregation requires identical parameter shapes and is undefined when clients use different encoders or feature dimensions\.

#### Variance reduction and personalization\.

Several works address optimization drift in FedAvg\. FedProx\(Liet al\.,[2020](https://arxiv.org/html/2605.29002#bib.bib3)\)introduces proximal regularization, while SCAFFOLD\(Karimireddyet al\.,[2020](https://arxiv.org/html/2605.29002#bib.bib2)\)uses control variates to reduce client variance\. Personalized federated learning methods further allow each client to maintain a locally adapted model\(Fallahet al\.,[2020](https://arxiv.org/html/2605.29002#bib.bib1)\)\. In contrast, FedQHD eliminates client drift entirely in the homogeneous case \(exact aggregation\) and handles heterogeneous encoders through a closed\-form ridge projection rather than iterative optimization\.

#### Distillation\-based federation under heterogeneity\.

Knowledge distillation aggregates models in output space rather than parameter space\(Hintonet al\.,[2015](https://arxiv.org/html/2605.29002#bib.bib31)\), enabling federation across heterogeneous architectures\. In supervised federated learning\(Li and Wang,[2019](https://arxiv.org/html/2605.29002#bib.bib23); Linet al\.,[2021](https://arxiv.org/html/2605.29002#bib.bib34); Zhuet al\.,[2021](https://arxiv.org/html/2605.29002#bib.bib21); Chen and Chao,[2020](https://arxiv.org/html/2605.29002#bib.bib20)\), methods differ in proxy\-data assumptions but all rely on iterative gradient\-based fitting\. In RL, policy distillation\(Rusuet al\.,[2016](https://arxiv.org/html/2605.29002#bib.bib32)\)and Distral\(Tehet al\.,[2017](https://arxiv.org/html/2605.29002#bib.bib35)\)introduced function\-space transfer mechanisms\. Recent heterogeneous FedRL approaches adopt similar principles: FedHQL aggregates models through server\-side queries\(Fanet al\.,[2023](https://arxiv.org/html/2605.29002#bib.bib33)\), SCCD distills ensembles using pseudo\-data\(Maiet al\.,[2023](https://arxiv.org/html/2605.29002#bib.bib4)\), and FedHPD matches action distributions on shared anchor states\(Jianget al\.,[2025](https://arxiv.org/html/2605.29002#bib.bib22)\)\. These approaches require iterative teacher–student optimization and can be sensitive to design and hyperparameter choices, particularly under nonstationary Bellman targets\(Czarneckiet al\.,[2019](https://arxiv.org/html/2605.29002#bib.bib19)\)\.

#### Linear function approximation, kernels, and random features in RL\.

Linear value\-function approximation has long provided stable and analyzable RL algorithms\. Least\-squares approaches such as LSPI and fitted Q\-iteration formulate Bellman updates as regression problems with closed\-form solutions\(Lagoudakis and Parr,[2003](https://arxiv.org/html/2605.29002#bib.bib17); Ernstet al\.,[2005](https://arxiv.org/html/2605.29002#bib.bib16)\), while finite\-time guarantees for linear TD have been established under both i\.i\.d\. and Markovian sampling\(Bhandariet al\.,[2018](https://arxiv.org/html/2605.29002#bib.bib15)\)\. Kernel and basis\-function methods extend this framework to nonlinear state representations while retaining linear parameter structure\(Ormoneit and Sen,[2002](https://arxiv.org/html/2605.29002#bib.bib12); Konidariset al\.,[2011](https://arxiv.org/html/2605.29002#bib.bib11)\)\. Random Fourier features provide scalable kernel approximations\(Rahimi and Recht,[2007](https://arxiv.org/html/2605.29002#bib.bib41)\), and regret analyses connect reproducing kernel Hilbert space \(RKHS\) geometry to RL sample complexity\(Jinet al\.,[2020](https://arxiv.org/html/2605.29002#bib.bib14)\)\. HDC\(Kanerva,[2009](https://arxiv.org/html/2605.29002#bib.bib43)\)can be viewed as a high\-dimensional random\-feature instantiation; QHD and HDPG demonstrate that HDC encoders enable efficient RL with linear readouts and least\-squares\-style updates\(Niet al\.,[2022a](https://arxiv.org/html/2605.29002#bib.bib38),[b](https://arxiv.org/html/2605.29002#bib.bib13)\)\.

#### Positioning of FedQHD\.

FedQHD addresses*structural heterogeneity*in federated Q\-learning, where clients may use different encoders and parameter dimensions and parameter averaging becomes ill\-defined\(Fanet al\.,[2023](https://arxiv.org/html/2605.29002#bib.bib33); Jianget al\.,[2025](https://arxiv.org/html/2605.29002#bib.bib22)\)\. Instead of iterative distillation, FedQHD aggregates Q\-values on a shared anchor\-state interface and compiles the resulting consensus into each client representation via a one\-shot ridge projection\. In the homogeneous limit \(shared encoder\), this procedure reduces exactly to parameter averaging, connecting classical federated learning with heterogeneous value\-function aggregation\.

## 3Preliminaries

### 3\.1Markov decision processes and off\-policy value learning

We consider a Markov decision processℳ=\(𝒮,𝒜,P,r,γ\)\\mathcal\{M\}=\(\\mathcal\{S\},\\mathcal\{A\},P,r,\\gamma\), where𝒮\\mathcal\{S\}is a continuous state space,𝒜\\mathcal\{A\}is a finite action set,P:𝒮×𝒜→Δ​\(𝒮\)P:\\mathcal\{S\}\\times\\mathcal\{A\}\\\!\\to\\\!\\Delta\(\\mathcal\{S\}\)is the transition kernel,r:𝒮×𝒜→ℝr:\\mathcal\{S\}\\times\\mathcal\{A\}\\\!\\to\\\!\\mathbb\{R\}is the reward function, andγ∈\(0,1\)\\gamma\\in\(0,1\)is the discount factor\. The optimal action\-value functionQ⋆Q^\{\\star\}is the unique fixed point of the Bellman optimality operator

Q⋆​\(s,a\)=r​\(s,a\)\+γ​𝔼s′∼P\(⋅\|s,a\)​\[maxa′∈𝒜⁡Q⋆​\(s′,a′\)\],Q^\{\\star\}\(s,a\)=r\(s,a\)\+\\gamma\\,\\mathbb\{E\}\_\{s^\{\\prime\}\\sim P\(\\cdot\|s,a\)\}\\left\[\\max\_\{a^\{\\prime\}\\in\\mathcal\{A\}\}Q^\{\\star\}\(s^\{\\prime\},a^\{\\prime\}\)\\right\],with optimal policyπ⋆​\(s\)=arg⁡maxa⁡Q⋆​\(s,a\)\\pi^\{\\star\}\(s\)=\\arg\\max\_\{a\}Q^\{\\star\}\(s,a\)\. Since𝒮\\mathcal\{S\}is continuous,Q⋆Q^\{\\star\}is approximated using standard off\-policy temporal\-difference framework: transitions\(s,a,r,s′\)\(s,a,r,s^\{\\prime\}\)are stored in a replay buffer andQ⋆Q^\{\\star\}is estimated by semi\-gradient updates against a periodically frozen target network\(Niet al\.,[2022a](https://arxiv.org/html/2605.29002#bib.bib38)\)\.

### 3\.2Hyperdimensional computing \(HDC\)

HDC is a brain\-inspired computational paradigm in which symbols and structured entities are represented as high\-dimensional vectors—called*hypervectors*—with components drawn independently from simple distributions\(Kanerva,[2009](https://arxiv.org/html/2605.29002#bib.bib43)\)\. In such spaces, independently sampled hypervectors are nearly orthogonal with high probability, a geometric property that underlies classical HDC operations such as*bundling*\(superposition via addition\),*binding*\(association via elementwise multiplication or permutation\), and*permutation*\(role shifting\)\. Because information is distributed holographically across all dimensions, HDC representations exhibit strong robustness to noise, quantization, and partial corruption\.

Given an inputxx, an HDC encoder produces a bounded hypervectorϕ​\(x\)∈𝕂D\\phi\(x\)\\in\\mathbb\{K\}^\{D\}using random projections or compositional schemes built from a small set of base hypervectors\. These encoders admit a precise random\-feature interpretation: the empirical kernelkD​\(x,x′\)=⟨ϕ​\(x\),ϕ​\(x′\)⟩k\_\{D\}\(x,x^\{\\prime\}\)=\\langle\\phi\(x\),\\phi\(x^\{\\prime\}\)\\rangleconverges to a smooth limiting kernelk∗​\(x,x′\)k\_\{\\ast\}\(x,x^\{\\prime\}\)asD→∞D\\to\\infty, with uniform concentration at rateO​\(D−1/2\)O\(D^\{\-1/2\}\)\(Rahimi and Recht,[2007](https://arxiv.org/html/2605.29002#bib.bib41); Bach,[2015](https://arxiv.org/html/2605.29002#bib.bib42); Rudi and Rosasco,[2017](https://arxiv.org/html/2605.29002#bib.bib44)\)\. As a consequence, linear prediction in the hypervector domain serves as a computationally efficient, finite\-dimensional approximation to kernel methods in the RKHS associated withk∗k\_\{\\ast\}\.

## 4Problem Formulation

#### QHD: hyperdimensional Q\-learning for continuous\-state control\.

We consider an MDP for each agentiiwith continuous statess∈𝒮s\\in\\mathcal\{S\}and discrete actionsa∈𝒜a\\in\\mathcal\{A\}\. Agentiiselects a hyperdimensional encoderΦi:𝒮→𝕂Di,𝕂∈\{ℝ,ℂ\}\\Phi\_\{i\}:\\mathcal\{S\}\\to\\mathbb\{K\}^\{D\_\{i\}\},\\quad\\mathbb\{K\}\\in\\\{\\mathbb\{R\},\\mathbb\{C\}\\\}mapping states toDiD\_\{i\}\-dimensional hypervectors\. For each actionaa, it maintains a parameter hypervector𝐰i,a∈𝕂Di\\mathbf\{w\}\_\{i,a\}\\in\\mathbb\{K\}^\{D\_\{i\}\}\. We use⟨⋅,⋅⟩\\langle\\cdot,\\cdot\\rangleto denote the standard Hermitian inner product on𝕂Di\\mathbb\{K\}^\{D\_\{i\}\},⟨u,v⟩=uH​v=∑k=1Diuk¯​vk\\langle u,v\\rangle=u^\{\\mathrm\{H\}\}v=\\sum\_\{k=1\}^\{D\_\{i\}\}\\overline\{u\_\{k\}\}\\,v\_\{k\}, and we writeℜ⁡\(z\)\\Re\(z\)for the real part ofz∈ℂz\\in\\mathbb\{C\}, with the conventionℜ⁡\(z\)=z\\Re\(z\)=zwhenz∈ℝz\\in\\mathbb\{R\}\. The approximated action\-value is

Qi​\(s,a\)=ℜ⁡\(⟨Φi​\(s\),𝐰i,a⟩\),Q\_\{i\}\(s,a\)\\;=\\;\\Re\\\!\\bigl\(\\langle\\Phi\_\{i\}\(s\),\\mathbf\{w\}\_\{i,a\}\\rangle\\bigr\)\\,,\(1\)so that stackingWi=\[𝐰i,a\]a∈𝒜∈𝕂Di×\|𝒜\|W\_\{i\}=\[\\mathbf\{w\}\_\{i,a\}\]\_\{a\\in\\mathcal\{A\}\}\\in\\mathbb\{K\}^\{D\_\{i\}\\times\|\\mathcal\{A\}\|\}yieldsQi​\(s,⋅\)=ℜ⁡\(Φi​\(s\)H​Wi\)Q\_\{i\}\(s,\\cdot\)=\\Re\(\\Phi\_\{i\}\(s\)^\{\\mathrm\{H\}\}W\_\{i\}\)\. AlthoughΦi\\Phi\_\{i\}may be nonlinear inss, the model \([1](https://arxiv.org/html/2605.29002#S4.E1)\) is linear in parameters\. QHD updatesWiW\_\{i\}via standard off\-policy TD with a delayed target network\(Niet al\.,[2022a](https://arxiv.org/html/2605.29002#bib.bib38)\); the explicit semi\-gradient update rule is given in Appendix[A](https://arxiv.org/html/2605.29002#A1)\.

#### Federated value learning objective\.

We considerNNagents \(clients\) learning in parallel from private data\. They do not share raw data; instead, after local training, each clientiihas learned a Q\-functionQi​\(s,a\)=ℜ⁡\(Φi​\(s\)H​Wi\)Q\_\{i\}\(s,a\)=\\Re\(\\Phi\_\{i\}\(s\)^\{\\mathrm\{H\}\}W\_\{i\}\)\. The server defines a global*function\-space*objective by aggregating these models under an agreement distributionμ\\mu\(implemented via a shared anchor set; see below\)\.

Concretely, given weights\{πi\}\\\{\\pi\_\{i\}\\\}with∑iπi=1\\sum\_\{i\}\\pi\_\{i\}=1, the server seeks the global value functionQglobQ^\{\\mathrm\{glob\}\}that minimizes

∑i=1Nπi​𝔼\(s,a\)∼μ​\[\(Qi​\(s,a\)−Q​\(s,a\)\)2\]\.\\sum\_\{i=1\}^\{N\}\\pi\_\{i\}\\,\\mathbb\{E\}\_\{\(s,a\)\\sim\\mu\}\\bigl\[\(Q\_\{i\}\(s,a\)\-Q\(s,a\)\)^\{2\}\\bigr\]\.\(2\)By standard projection arguments, the minimizer is the weighted average in function space:

Qglob​\(s,a\)=∑iπi​Qi​\(s,a\),\(μ​\-a\.e\.\)\.Q^\{\\text\{glob\}\}\(s,a\)=\\sum\_\{i\}\\pi\_\{i\}Q\_\{i\}\(s,a\),\\quad\(\\mu\\text\{\-a\.e\.\}\)\.\(3\)Hereμ\\mudenotes the agreement distribution over state\-action pairs; in practice, we implementμ\\muvia a discrete anchor set\. The server thus obtains a global action\-value model in closed form without accessing client trajectories\. The problem is to compile this function\-space teacher into a parameter vector for each client’s representation\.

Although Eq\. \([2](https://arxiv.org/html/2605.29002#S4.E2)\) is defined in function space, FedAvg\-style aggregation either requires iterative optimization \(for nonlinear approximators\) or becomes algebraically undefined \(under heterogeneous encoders\)\. FedQHD resolves both by exploiting the linearity in parameters of \([1](https://arxiv.org/html/2605.29002#S4.E1)\)\.

## 5FedQHD: Function\-Space Federation with Closed\-Form Compilation

We present FedQHD in two regimes\. When clients share an encoder, the federation step is exact in closed form\. When clients use heterogeneous encoders, the server aggregates client predictions on a shared anchor\-state set, and each client compiles the resulting teacher into its local feature space via a closed\-form ridge solve\.

### 5\.1Homogeneous encoders

If all clients share an encoderΦ\\Phi, substituting into \([3](https://arxiv.org/html/2605.29002#S4.E3)\) gives the global modelQglob​\(s,a\)=ℜ⁡\(Φ​\(s\)H​Wglob\)Q^\{\\mathrm\{glob\}\}\(s,a\)=\\Re\\bigl\(\\Phi\(s\)^\{\\mathrm\{H\}\}W^\{\\mathrm\{glob\}\}\\bigr\), whereWglob∈𝕂D×\|𝒜\|W^\{\\mathrm\{glob\}\}\\in\\mathbb\{K\}^\{D\\times\|\\mathcal\{A\}\|\}is the minimizer of a quadratic objective∑i=1Nπi​E\(s,a\)∼μ​\[⟨Φ​\(s\),\(Wi−Wglob\):,a⟩2\]\\sum\_\{i=1\}^\{N\}\\pi\_\{i\}E\_\{\(s,a\)\\sim\\mu\}\\big\[\\langle\\Phi\(s\),\(W\_\{i\}\-W^\{\\mathrm\{glob\}\}\)\_\{:,a\}\\rangle^\{2\}\\big\]\. The closed\-form solution is:

Wglob=∑i=1Nπi​Wi\.W^\{\\mathrm\{glob\}\}\\;=\\;\\sum\_\{i=1\}^\{N\}\\pi\_\{i\}W\_\{i\}\.That is, whenΦi=Φ\\Phi\_\{i\}=\\Phi, function\-space consensus coincides exactly with parameter averaging, which is independent ofμ\\mu\. The server simply computes the weighted average of the local weight matrices without requiring raw data and sendsWglobW^\{\\mathrm\{glob\}\}back to the clients as their updated parameters\. The homogeneous FedQHD procedure is presented as pseudocode in Appendix[B](https://arxiv.org/html/2605.29002#A2)\.

### 5\.2Heterogeneous encoders

When clients use different encodersΦi\\Phi\_\{i\}with possibly different dimensionsDiD\_\{i\}, direct parameter averaging is not defined\. Instead, we align client representations through a*shared anchor set*\. The server samples a set of reference states𝒮ref=\{s1,…,sm\}\\mathcal\{S\}\_\{\\mathrm\{ref\}\}=\\\{s\_\{1\},\\dots,s\_\{m\}\\\}\. The anchor set can be obtained from random rollouts, a shared unlabeled dataset, or states encountered during local training; in our experiments we use random rollouts\.

Each clientiievaluates its Q\-function on the anchors,Qiref​\(sj,a\)=Qi​\(sj,a\)Q^\{\\mathrm\{ref\}\}\_\{i\}\(s\_\{j\},a\)=Q\_\{i\}\(s\_\{j\},a\), forming a matrixQiref∈ℝm×\|𝒜\|Q^\{\\mathrm\{ref\}\}\_\{i\}\\in\\mathbb\{R\}^\{m\\times\|\\mathcal\{A\}\|\}\. The server aggregates these predictions to obtain the anchor teacherQrefglob=∑i=1Nπi​Qiref\.Q^\{\\mathrm\{glob\}\}\_\{\\mathrm\{ref\}\}=\\sum\_\{i=1\}^\{N\}\\pi\_\{i\}Q^\{\\mathrm\{ref\}\}\_\{i\}\.Clientiithen fits parametersWiglobW\_\{i\}^\{\\mathrm\{glob\}\}by solvingminW∈𝕂Di×\|𝒜\|​∑j=1m∑a∈𝒜\(ℜ⁡\(Φi​\(sj\)H​wa\)−Qrefglob​\(sj,a\)\)2\+λ​‖W‖F2\.\\min\_\{W\\in\\mathbb\{K\}^\{D\_\{i\}\\times\|\\mathcal\{A\}\|\}\}\\sum\_\{j=1\}^\{m\}\\sum\_\{a\\in\\mathcal\{A\}\}\\Bigl\(\\Re\(\\Phi\_\{i\}\(s\_\{j\}\)^\{H\}w\_\{a\}\)\-Q^\{\\mathrm\{glob\}\}\_\{\\mathrm\{ref\}\}\(s\_\{j\},a\)\\Bigr\)^\{2\}\+\\lambda\\\|W\\\|\_\{F\}^\{2\}\.Since the model is linear inWW, the solution has the closed form

Wiglob=\(XiH​Xi\+λ​IDi\)−1​XiH​Qrefglob,W^\{\\mathrm\{glob\}\}\_\{i\}=\\left\(X\_\{i\}^\{H\}X\_\{i\}\+\\lambda I\_\{D\_\{i\}\}\\right\)^\{\-1\}X\_\{i\}^\{H\}Q^\{\\mathrm\{glob\}\}\_\{\\mathrm\{ref\}\},whenλ\>0\\lambda\>0orXiH​XiX\_\{i\}^\{H\}X\_\{i\}is full rank\. This corresponds to ridge regression projecting the global teacher onto clientii’s feature space\. After updating toWiglobW\_\{i\}^\{\\mathrm\{glob\}\}, the client resumes local RL updates until the next federation round\. Equivalently, using the Woodbury identity,Wiglob=XiH​\(Gi\+λ​Im\)−1​QrefglobW\_\{i\}^\{\\mathrm\{glob\}\}=X\_\{i\}^\{H\}\(G\_\{i\}\+\\lambda I\_\{m\}\)^\{\-1\}Q\_\{\\mathrm\{ref\}\}^\{\\mathrm\{glob\}\}, which requires inverting anm×mm\\times mmatrix rather thanDi×DiD\_\{i\}\\times D\_\{i\}\. Algorithm[2](https://arxiv.org/html/2605.29002#alg2)is provided in Appendix[B](https://arxiv.org/html/2605.29002#A2)\.

As analyzed in Sec\.[6](https://arxiv.org/html/2605.29002#S6), the resulting compilation error depends on representation mismatch, anchor conditioning, and regularization\.

## 6Theoretical Analysis

We analyze the static error induced when a federated teacher is compiled into a client\-specific linear representation\. The results characterize representation/compilation error conditional on fixed local predictors; they are not a convergence theorem for the full online FedQHD training dynamics\.

### 6\.1Federated Representation Mismatch

LetQ∗Q^\{\*\}denote the true optimal action\-value function\. For each clientii, define its*oracle projection*Q^i\\hat\{Q\}\_\{i\}as the best possible approximation toQ∗Q^\{\*\}within its function class:

Q^i​\(s,a\)≔arg⁡minQ∈ℱi⁡𝔼\(s,a\)∼μ​\(Q​\(s,a\)−Q∗​\(s,a\)\)2,\\hat\{Q\}\_\{i\}\(s,a\)\\;\\coloneqq\\;\\arg\\min\_\{Q\\in\\mathcal\{F\}\_\{i\}\}\\mathbb\{E\}\_\{\(s,a\)\\sim\\mu\}\\bigl\(Q\(s,a\)\-Q^\{\*\}\(s,a\)\\bigr\)^\{2\},whereℱi≔\{Qi​\(s,a\)=ℜ⁡\(Φi​\(s\)H​Wi\):Wi∈𝕂Di×\|𝒜\|\}\\mathcal\{F\}\_\{i\}\\coloneqq\\\{Q\_\{i\}\(s,a\)=\\Re\(\\Phi\_\{i\}\(s\)^\{\\mathrm\{H\}\}W\_\{i\}\):W\_\{i\}\\in\\mathbb\{K\}^\{D\_\{i\}\\times\|\\mathcal\{A\}\|\}\\\}is the client’s QHD function class\. In linear approximation theory, this is the orthogonal projection ofQ∗Q^\{\*\}onto the span ofΦi\\Phi\_\{i\}, i\.e\., the best representation of the true value function that clientiican achieve given its encoder\.

Equivalently, clientii’s oracle parametersW^i\\hat\{W\}\_\{i\}minimize the mean\-squared Bellman error for policy evaluation\. We then define the federation gap

Δi​\(s,a\)≔Q^i​\(s,a\)−Qi​\(s,a;Wiglob\),\\Delta\_\{i\}\(s,a\)\\;\\coloneqq\\;\\hat\{Q\}\_\{i\}\(s,a\)\\;\-\\;Q\_\{i\}\\bigl\(s,a;\\,W\_\{i\}^\{\\mathrm\{glob\}\}\\bigr\),\(4\)whereWiglobW\_\{i\}^\{\\mathrm\{glob\}\}is the global\-compiled weight returned to the clientii\. ThusΔi\\Delta\_\{i\}measures how far FedQHD’s aggregated model falls short of the best possible value in clientii’s function class\.

### 6\.2Projection Residual Invisibility in Function Space

Even if the global teacherQrefglobQ^\{\\mathrm\{glob\}\}\_\{\\mathrm\{ref\}\}aggregates knowledge from all clients, only the portion representable within clientii’s function class is absorbed during compilation\. Formally, letPiP\_\{i\}denote the orthogonal projector onto the subspace spanned by clientii’s anchor evaluations, and define the*projection residual*Ri,0=\(I−Pi\)​QrefglobR\_\{i,0\}=\(I\-P\_\{i\}\)Q^\{\\mathrm\{glob\}\}\_\{\\mathrm\{ref\}\}as the component of the global teacher that lies outside this subspace\. The following result shows that the ridge solve automatically discards the unrepresentable portion of the teacher\.

###### Theorem 1\(Projection residual invisibility\)\.

For anyλ≥0\\lambda\\geq 0, the compiled Q\-function depends only on the in\-subspace component of the global teacher:

Qi​\(s,a;Wiglob​\(λ\)\)=𝐤i​\(s\)⊤​\(Gi\+λ​I\)−1​Pi​Qref,aglob∀s∈𝒮,a∈𝒜,Q\_\{i\}\\bigl\(s,a;\\,W\_\{i\}^\{\\mathrm\{glob\}\}\(\\lambda\)\\bigr\)\\;=\\;\\mathbf\{k\}\_\{i\}\(s\)^\{\\top\}\(G\_\{i\}\+\\lambda I\)^\{\-1\}P\_\{i\}\\,Q^\{\\mathrm\{glob\}\}\_\{\\mathrm\{ref\},a\}\\qquad\\forall\\,s\\in\\mathcal\{S\},\\;a\\in\\mathcal\{A\},where\[𝐤i​\(s\)\]ℓ=⟨Φi​\(sℓ\),Φi​\(s\)⟩\[\\mathbf\{k\}\_\{i\}\(s\)\]\_\{\\ell\}=\\langle\\Phi\_\{i\}\(s\_\{\\ell\}\),\\Phi\_\{i\}\(s\)\\rangleforℓ=1,…,m\\ell=1,\\ldots,m, andGi∈𝕂m×mG\_\{i\}\\in\\mathbb\{K\}^\{m\\times m\}is the anchor Gram matrix with\[Gi\]ℓ​j=⟨Φi​\(sℓ\),Φi​\(sj\)⟩\[G\_\{i\}\]\_\{\\ell j\}=\\langle\\Phi\_\{i\}\(s\_\{\\ell\}\),\\Phi\_\{i\}\(s\_\{j\}\)\\rangle\. Hence the projection residual is invisible:𝐤i​\(s\)⊤​\(Gi\+λ​I\)−1​Ri,0,a=0\\mathbf\{k\}\_\{i\}\(s\)^\{\\top\}\(G\_\{i\}\+\\lambda I\)^\{\-1\}R\_\{i,0,a\}=0\.

This theorem shows that the compiled predictor depends only on the component of the teacher that lies in clientii’s anchor\-feature subspace\. Representation mismatch, therefore, appears as unrepresented residual information rather than direct contamination of the fitted coefficients\. The resulting loss is quantified in Theorem[2](https://arxiv.org/html/2605.29002#Thmtheorem2)\.

### 6\.3Non\-Asymptotic Geometric Decomposition of the Federation Gap

We give a non\-asymptotic, pointwise bound on the federation gap defined in Sec\.[6\.1](https://arxiv.org/html/2605.29002#S6.SS1)under encoder heterogeneity\. Here we introduce standard regularity assumptions ensuring that \(i\) all predictors lie in a common function space, \(ii\) client features are non\-degenerate, and \(iii\) the anchor\-based ridge regression is well posed\.

###### Assumption 1\(Common RKHS\)\.

There exists a continuous, symmetric, positive\-definite kernelκ:𝒮×𝒮→ℝ\\kappa:\\mathcal\{S\}\\times\\mathcal\{S\}\\to\\mathbb\{R\}withκ​\(s,s\)≤1\\kappa\(s,s\)\\leq 1for alls∈𝒮s\\in\\mathcal\{S\}, inducing a separable RKHSℋ\\mathcal\{H\}such that:

1. 1\.\(RKHS norm bound\)‖Q∗​\(⋅,a\)‖ℋ≤B<∞\\\|Q^\{\*\}\(\\cdot,a\)\\\|\_\{\\mathcal\{H\}\}\\leq B<\\inftyfor alla∈𝒜a\\in\\mathcal\{A\}\. This is a standard regularity condition in kernel\-based RL\(Jinet al\.,[2020](https://arxiv.org/html/2605.29002#bib.bib14)\); it is implied whenQ⋆Q^\{\\star\}lies in the RKHS ofκ\\kappa, which holds under standard smoothness conditions on the MDP dynamics and reward;
2. 2\.\(Common embedding\) all client function classes embed asℱi↪ℋ\\mathcal\{F\}\_\{i\}\\hookrightarrow\\mathcal\{H\}, with the reproducing propertyf​\(s\)=⟨f,κ​\(⋅,s\)⟩ℋf\(s\)=\\langle f,\\kappa\(\\cdot,s\)\\rangle\_\{\\mathcal\{H\}\}holding for allf∈ℋf\\in\\mathcal\{H\},s∈𝒮s\\in\\mathcal\{S\};
3. 3\.\(Feature boundedness\)‖Φi​\(s\)‖2≤1\\\|\\Phi\_\{i\}\(s\)\\\|\_\{2\}\\leq 1for alls∈𝒮s\\in\\mathcal\{S\},i=1,…,Ni=1,\\ldots,N\.

###### Assumption 2\(Feature covariance\)\.

For each clientii, letνi\\nu\_\{i\}denote the marginal state distribution of clientii’s local experience\. The feature covarianceΣi≔𝔼s∼νi​\[Φi​\(s\)​Φi​\(s\)H\]≻0\\Sigma\_\{i\}\\coloneqq\\mathbb\{E\}\_\{s\\sim\\nu\_\{i\}\}\[\\Phi\_\{i\}\(s\)\\Phi\_\{i\}\(s\)^\{\\mathrm\{H\}\}\]\\succ 0\.

This holds generically for random Fourier feature \(RFF\) encoders with continuous state distributions, since the induced features span aDiD\_\{i\}\-dimensional subspace with probability one when\|supp​\(νi\)\|≥Di\|\\mathrm\{supp\}\(\\nu\_\{i\}\)\|\\geq D\_\{i\}\.

###### Assumption 3\(Anchor set\)\.

The anchor set𝒮ref=\{s1,…,sm\}\\mathcal\{S\}\_\{\\mathrm\{ref\}\}=\\\{s\_\{1\},\\ldots,s\_\{m\}\\\}is drawn i\.i\.d\. from a distributionνref\\nu\_\{\\mathrm\{ref\}\}withνref≪νi\\nu\_\{\\mathrm\{ref\}\}\\ll\\nu\_\{i\}for allii, andλ\>0\\lambda\>0\.

Under these assumptions, we first isolate the component of error that arises solely from representation heterogeneity, independent of the anchor\-based compilation step\. Lemma[1](https://arxiv.org/html/2605.29002#Thmlemma1)quantifies how geometric subspace misalignment translates into disagreement between client oracle predictors\.

###### Lemma 1\(Representation bias\)\.

Under Assumption[1](https://arxiv.org/html/2605.29002#Thmassumption1), for any clientiiand actionaa,

\|Q^i​\(s,a\)−Q¯​\(s,a\)\|≤2​B​∑j≠iπj​sin⁡\(θi​j\)\+∑j≠iπj​\(εirep\+εjrep\),\|\\hat\{Q\}\_\{i\}\(s,a\)\-\\bar\{Q\}\(s,a\)\|\\;\\leq\\;2B\\sum\_\{j\\neq i\}\\pi\_\{j\}\\sin\(\\theta\_\{ij\}\)\\;\+\\;\\sum\_\{j\\neq i\}\\pi\_\{j\}\(\\varepsilon\_\{i\}^\{\\mathrm\{rep\}\}\+\\varepsilon\_\{j\}^\{\\mathrm\{rep\}\}\),whereQ¯=∑jπj​Q^j\\bar\{Q\}=\\sum\_\{j\}\\pi\_\{j\}\\hat\{Q\}\_\{j\}is the weighted oracle mean andθi​j\\theta\_\{ij\}is the largest principal angle between the subspaces of clientsiiandjj\. This term corresponds to component \(I\) in the federation\-gap decomposition\.

Lemma[1](https://arxiv.org/html/2605.29002#Thmlemma1)characterizes the irreducible heterogeneity floor: even with perfect anchor conditioning andλ=0\\lambda=0, federation cannot eliminate oracle disagreement induced by subspace misalignment and representation error\. This geometric term forms the baseline component of the final federation\-gap bound in Theorem[2](https://arxiv.org/html/2605.29002#Thmtheorem2)\.

###### Theorem 2\(Federation gap bound\)\.

Under Assumptions[1](https://arxiv.org/html/2605.29002#Thmassumption1)–[3](https://arxiv.org/html/2605.29002#Thmassumption3), for each clientii, any\(s,a\)\(s,a\), and regularizationλ≥0\\lambda\\geq 0, define the anchor feature matrixXi∈𝕂m×DiX\_\{i\}\\in\\mathbb\{K\}^\{m\\times D\_\{i\}\}by stacking the encoded anchor states row\-wise,\[Xi\]ℓ,:=Φi​\(sℓ\)⊤\[X\_\{i\}\]\_\{\\ell,:\}=\\Phi\_\{i\}\(s\_\{\\ell\}\)^\{\\top\}forℓ=1,…,m\\ell=1,\\ldots,m, and the anchor Gram matrixGi=Xi​XiH∈𝕂m×mG\_\{i\}=X\_\{i\}X\_\{i\}^\{\\mathrm\{H\}\}\\in\\mathbb\{K\}^\{m\\times m\}with entries\[Gi\]ℓ​j=⟨Φi​\(sℓ\),Φi​\(sj\)⟩\[G\_\{i\}\]\_\{\\ell j\}=\\langle\\Phi\_\{i\}\(s\_\{\\ell\}\),\\Phi\_\{i\}\(s\_\{j\}\)\\rangle\. The aggregation error defined in \([4](https://arxiv.org/html/2605.29002#S6.E4)\) satisfies

\|Δi​\(s,a\)\|≤h¯i⏟\(I\) encoder heterogeneity\+mγi\+λ​h¯i⏟\(II\) anchor amplification\+λγi\+λ​‖W^i‖F⏟\(III\) ridge shrinkage\|\\Delta\_\{i\}\(s,a\)\|\\;\\leq\\;\\underbrace\{\\bar\{h\}\_\{i\}\}\_\{\\textbf\{\(I\) encoder heterogeneity\}\}\\;\+\\;\\underbrace\{\\frac\{\\sqrt\{m\}\}\{\\sqrt\{\\gamma\_\{i\}\+\\lambda\}\}\\;\\bar\{h\}\_\{i\}\}\_\{\\textbf\{\(II\) anchor amplification\}\}\\;\+\\;\\underbrace\{\\frac\{\\lambda\}\{\\gamma\_\{i\}\+\\lambda\}\\,\\\|\\hat\{W\}\_\{i\}\\\|\_\{F\}\}\_\{\\textbf\{\(III\) ridge shrinkage\}\}\(5\)
whereγi≔λmin\+​\(Gi\)\\gamma\_\{i\}\\;\\coloneqq\\;\\lambda\_\{\\min\}^\{\+\}\(G\_\{i\}\)is the smallest positive eigenvalue of clientii’s anchor feature Gram matrix,h¯i=∑j≠iπj​\(2​B​sin⁡\(θi​j\)\+εirep\+εjrep\)\\bar\{h\}\_\{i\}=\\sum\_\{j\\neq i\}\\pi\_\{j\}\\,\(2B\\sin\(\\theta\_\{ij\}\)\+\\varepsilon\_\{i\}^\{\\mathrm\{rep\}\}\+\\varepsilon\_\{j\}^\{\\mathrm\{rep\}\}\), and‖W^i‖F\\\|\\hat\{W\}\_\{i\}\\\|\_\{F\}is clientii’s oracle parameters\.

From Theorem[2](https://arxiv.org/html/2605.29002#Thmtheorem2)we obtain several corollaries:

###### Corollary 1\(Zero\-gap conditions\)\.

Δi=0\\Delta\_\{i\}=0iff all clients’ subspaces align \(θi​j=0\\theta\_\{ij\}=0\) and have no representation error,λ→0\\lambda\\to 0, and the anchor feature matrix has full column rank \(rank​\(Xi\)=Di\\mathrm\{rank\}\(X\_\{i\}\)=D\_\{i\}\)\.

###### Corollary 2\(Heterogeneity\-dominated regime\)\.

Supposem≥Dim\\geq D\_\{i\}, so thatGiG\_\{i\}has full rankDiD\_\{i\}oncol​\(Xi\)\\mathrm\{col\}\(X\_\{i\}\)\. By Remark[2](https://arxiv.org/html/2605.29002#Thmremark2),m/γi\+λ\\sqrt\{m\}/\\sqrt\{\\gamma\_\{i\}\+\\lambda\}is asymptoticallymm\-independent\. Sendingλ→0\\lambda\\to 0in Theorem[2](https://arxiv.org/html/2605.29002#Thmtheorem2)collapses the ridge shrinkage Term \(III\), leaving\|Δi​\(s,a\)\|≤𝒪​\(h¯i\)\.\|\\Delta\_\{i\}\(s,a\)\|\\;\\leq\\;\\mathcal\{O\}\(\\bar\{h\}\_\{i\}\)\.In this regime the federation gap is governed entirely by the encoder heterogeneityh¯i\\bar\{h\}\_\{i\}, and is reduced primarily by smallersin⁡θi​j\\sin\\theta\_\{ij\}orεirep\\varepsilon\_\{i\}^\{\\mathrm\{rep\}\}\.

Theorem[2](https://arxiv.org/html/2605.29002#Thmtheorem2)decomposes the federation gap into three interpretable terms: \(I\) an irreducible encoder heterogeneityh¯i\\bar\{h\}\_\{i\}driven by principal\-angle misalignment and per\-client representation error, \(II\) an anchor\-amplified term whose apparentm\\sqrt\{m\}growth cancels asymptotically whenm≥Dim\\geq D\_\{i\}, and \(III\) a ridge shrinkage term controlled byλ\\lambda\. Consistent with the theory, experiments \(Sec\.[7](https://arxiv.org/html/2605.29002#S7)\) show that under RFF encoder heterogeneity withm≥Dim\\geq D\_\{i\}, the gap collapses to𝒪​\(h¯i\)\\mathcal\{O\}\(\\bar\{h\}\_\{i\}\)\.

## 7Experiments

We evaluate FedQHD on four continuous\-control benchmarks under both homogeneous and heterogeneous encoder settings\. Our experiments address:\(Q1\)Does FedQHD improve policy learning over independent training and existing federated RL baselines?\(Q2\)Under heterogeneous encoders, can anchor\-based aggregation recover a meaningful shared value function?\(Q3\)Does FedQHD reduce computation compared with backpropagation\-based and distillation\-based federation?\(Q4\)How does the encoder dimension D affect FedQHD quality?

Learning curves on CartPole and LunarLander, the ablation on anchor set sizemm, the scalability study with respect toNN, and the full experimental setup are provided in Appendix[C](https://arxiv.org/html/2605.29002#A3)–[F](https://arxiv.org/html/2605.29002#A6)\.

### 7\.1Experimental Setup

We conduct FedQHD on four continuous\-state and discrete\-action control benchmarks from OpenAI Gym\(Brockmanet al\.,[2016](https://arxiv.org/html/2605.29002#bib.bib36)\):CartPole\-v1,Acrobot\-v1,LunarLander\-v3andMountainCar\-v0\.

In homogeneous experiments, all clients share a random Fourier feature \(RFF\) encoder of dimensionD=10,000D=10\{,\}000and fixed bandwidthσ\\sigma\. In heterogeneous experiments, each client independently samples an RFF encoder with bandwidthσi∼Unif​\[0\.5​σ0,1\.5​σ0\]\\sigma\_\{i\}\\sim\\mathrm\{Unif\}\[0\.5\\sigma\_\{0\},1\.5\\sigma\_\{0\}\]and dimensionDiD\_\{i\}ranging from5×1025\\times 10^\{2\}to10410^\{4\}\. Anchor\-based aggregation uses a server\-constructed reference set of sizem=200m=200\. Unless otherwise stated, we usem=200m=200anchors in the main experiments; Appendix[E](https://arxiv.org/html/2605.29002#A5)studies the effect of varyingmm\.

We compare FedQHD against the following baselines: \(i\) Independent QHD: The performance of a randomly selected local model in each of theNNinvolved environments\. \(ii\) Oracle QHD†: A single QHD trained on data pooled from all clients\. \(iii\) Oracle DQN†: A single DQN trained on pooled data from all clients\. \(iv\) FedAvg\-DQN\(Jinet al\.,[2022](https://arxiv.org/html/2605.29002#bib.bib39)\): Federated deep Q\-learning with parameter averaging\. \(v\) Truncate FedAvg\-QHD: Client weight matricesWi∈ℝDi×\|A\|W\_\{i\}\\in\\mathbb\{R\}^\{D\_\{i\}\\times\|A\|\}are truncated to the minimum dimensionDminD\_\{\\min\}before averaging, then zero\-padded back to each client’s original dimension\. \(vi\) Distillation FedDQN\(Jianget al\.,[2025](https://arxiv.org/html/2605.29002#bib.bib22)\): Heterogeneous aggregation by distilling a client DQN toward an anchor\-set teacher updated via averaged soft predictions\.

We report \(i\) average reward: mean episodic return across clients; \(ii\) compilation error: maximum Q\-value deviation between FedQHD and the client oracle on held\-out anchors; and \(iii\) policy value gap: the return difference between greedy policies induced by Oracle QHD and FedQHD\.

### 7\.2Results

#### Performance Comparison \(Q1 and Q2\)

Table[1](https://arxiv.org/html/2605.29002#S7.T1)reports final average reward after trainingN=5N=5clients for 600 episodes\. Full results with standard deviations across 3 seeds are reported in Appendix[D](https://arxiv.org/html/2605.29002#A4)\. FedQHD achieves the best non\-oracle performance in 5 of 8 tasks and even surpasses Oracle QHD on LunarLander\.

Under heterogeneous encoders, aggregation strategies diverge sharply\. Truncate FedAvg\-QHD collapses when moving from Q1 to Q2 \(e\.g\., CartPole drops 73%\), confirming that naive dimension matching destroys the RFF feature structure\. In contrast, FedQHD remains competitive because anchor\-based projection preserves the geometry of client representations\. Distillation FedDQN also transfers across heterogeneous architectures but requires iterative optimization and is substantially slower \(Table[2](https://arxiv.org/html/2605.29002#S7.T2)\), whereas FedQHD performs a single closed\-form compilation step\.

Table 1:Final average reward \(last 100 episodes\), mean overN=5N\{=\}5clients and 3 seeds\.Bold: best non\-oracle\.†\\dagger: oracle \(pooled data\)\. –: not applicable\.CartPoleAcrobotLunarLanderMountainCarMethodHomoHeteroHomoHeteroHomoHeteroHomoHeteroIndep\. QHD26\.124\.0−306\.1\-306\.1−495\.4\-495\.4−187\.3\-187\.3−174\.6\-174\.6−200\.0\-200\.0−200\.0\-200\.0Trunc\. FedAvg\-QHD230\.8112\.6−106\.5\-106\.5−280\.8\-280\.865\.0−24\.8\-24\.8−148\.3\-148\.3−172\.9\-172\.9FedAvg\-DQN120\.8–−85\.3\\mathbf\{\-85\.3\}–200\.9–−167\.8\-167\.8–Distill\. FedDQN169\.9149\.4−87\.5\-87\.5−89\.2\\mathbf\{\-89\.2\}122\.5131\.4\\mathbf\{131\.4\}−156\.0\-156\.0−174\.3\-174\.3FedQHD466\.3\\mathbf\{466\.3\}351\.1\\mathbf\{351\.1\}−105\.0\-105\.0−102\.5\-102\.5224\.1\\mathbf\{224\.1\}101\.5−143\.7\\mathbf\{\-143\.7\}−162\.3\\mathbf\{\-162\.3\}Oracle QHD†373\.2458\.5−93\.6\-93\.6−101\.2\-101\.287\.6232\.9−141\.3\-141\.3−140\.0\-140\.0Oracle DQN†139\.1204\.4−75\.3\-75\.3−79\.8\-79\.8183\.5202\.6−125\.7\-125\.7−122\.8\-122\.8
#### Computation Cost\.

Table[2](https://arxiv.org/html/2605.29002#S7.T2)reports the wall\-clock training time of FedQHD with 5 clients after 600 episodes, complementing the performance results in Table[1](https://arxiv.org/html/2605.29002#S7.T1)\. FedQHD is consistently faster than all DQN\-based methods, as it entirely replaces backpropagation with closed\-form TD\(0\) updates and, unlike Distillation FedDQN, requires no additional per\-round gradient loops through a teacher network\. Under homogeneous encoders withD=10,000D=10\{,\}000, FedQHD takes 9\.7–20\.3 min across environments, compared to 25\.9–85\.6 min for FedAvg\-DQN and 14\.3–69\.2 min for Distillation FedDQN\. A more revealing finding emerges under Q2: despite requiring an extra anchor\-based ridge\-regression solve at each aggregation round, FedQHD’s Q2 wall\-clock time \(1\.4–5\.9 min\) is33–12×12\\timeslower than its own Q1 cost on the same environment\. This is because Q2 clients operate with mixed encoder dimensions \(DiD\_\{i\}ranging from5×1025\\times 10^\{2\}to10410^\{4\}, assigned cyclically\) versus the fixedD=10,000D=10\{,\}000used uniformly in Q1, demonstrating that in FedQHD representation dimensionality is the primary driver of computation cost\.

Table 2:Wall\-clock training time \(minutes per run, 600 episodes,N=5N\{=\}5clients\) reported as mean±\\pmstd across 3 seeds\. CP = CartPole, Acr = Acrobot, LL = LunarLander, MC = MountainCar\.
#### Effect of Encoder DimensionDD\.

We study CartPole and varyDDfrom 16 to 2048 while fixingm=4​Dm=4D, keeping all runs in the over\-determined regime \(m\>Dm\>D\) so that the RFF feature map and the encoder heterogeneity dominates compiled error following the geometry floor\|Δi\|=𝒪​\(D−1/2\)\|\\Delta\_\{i\}\|=\\mathcal\{O\}\(D^\{\-1/2\}\)\(Rahimi and Recht,[2007](https://arxiv.org/html/2605.29002#bib.bib41)\)\.

Figure[1](https://arxiv.org/html/2605.29002#S7.F1)confirms this prediction in Corollary[2](https://arxiv.org/html/2605.29002#Thmcorollary2)\. The center panel shows the compiled error with a \-0\.525 log–log slope, validating that FedQHD’s anchor–based projection inherits the standard RFF approximation guarantee\. The left panel shows that lowerDDleads to both slower convergence and lower final reward, whileD≥128D\\geq 128consistently reaches the CartPole ceiling \(V≈500V\\approx 500\) within 600 episodes\. The right panel directly links the policy value gap to the compiled error curve, confirming that the federation gap is a reliable proxy for practical policy degradation\.

![Refer to caption](https://arxiv.org/html/2605.29002v1/x1.png)Figure 1:Ablation 1 \(varyDD, setm=4​Dm=4D\): learning curves \(left\), Q\-error vs\.DD\(middle\), and final policy value \(right\)\.

## 8Conclusion

We presented FedQHD, a federatedQQ\-learning framework that replaces parameter\-space synchronization with closed\-form function\-space aggregation for linear\-in\-parameter hyperdimensional \(random\-feature\) value representations\. With a shared encoder, FedQHD reduces exactly to weighted averaging of local readout matrices, recovering FedAvg in closed form\. With heterogeneous encoders, the server aggregates clientQQ\-values on a shared anchor\-state interface, and each client compiles the resulting teacher into its local representation via a one\-shot ridge projection—avoiding per\-round iterative distillation and gradient\-based teacher–student optimization\. We further derived a pointwise bound on the federation gap, decomposing it into encoder heterogeneity, anchor\-set conditioning, and ridge shrinkage, and identifiedm≥Dim\\geq D\_\{i\}as the well\-conditioned regime in which the gap reduces to a multiple of the heterogeneity floor\. On four continuous\-state, discrete\-action benchmarks, FedQHD matches or exceeds federated DQN baselines while requiring substantially less computation, with the empirical dependence of the federation gap on encoder dimension matching our theoretical analysis\. Extending closed\-form heterogeneous federation to learned encoders, richer observation spaces, and actor–critic settings remains an important direction for future work\.

## References

- On the equivalence between quadrature rules and random features\.arXiv preprint arXiv:1502\.06800135\.Cited by:[§3\.2](https://arxiv.org/html/2605.29002#S3.SS2.p2.7)\.
- J\. Bhandari, D\. Russo, and R\. Singal \(2018\)A finite time analysis of temporal difference learning with linear function approximation\.InConference on learning theory,pp\. 1691–1692\.Cited by:[§1](https://arxiv.org/html/2605.29002#S1.p4.1),[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px4.p1.1)\.
- G\. Brockman, V\. Cheung, L\. Pettersson, J\. Schneider, J\. Schulman, J\. Tang, and W\. Zaremba \(2016\)Openai gym\.arXiv preprint arXiv:1606\.01540\.Cited by:[§7\.1](https://arxiv.org/html/2605.29002#S7.SS1.p1.1)\.
- V\. P\. Chellapandi, L\. Yuan, C\. G\. Brinton, S\. H\. Żak, and Z\. Wang \(2023\)Federated learning for connected and automated vehicles: a survey of existing approaches and challenges\.IEEE Transactions on Intelligent Vehicles9\(1\),pp\. 119–137\.Cited by:[§1](https://arxiv.org/html/2605.29002#S1.p1.1)\.
- H\. Chen and W\. Chao \(2020\)Fedbe: making bayesian model ensemble applicable to federated learning\.arXiv preprint arXiv:2009\.01974\.Cited by:[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px3.p1.1)\.
- W\. M\. Czarnecki, R\. Pascanu, S\. Osindero, S\. Jayakumar, G\. Swirszcz, and M\. Jaderberg \(2019\)Distilling policy distillation\.InThe 22nd international conference on artificial intelligence and statistics,pp\. 1331–1340\.Cited by:[§1](https://arxiv.org/html/2605.29002#S1.p3.1),[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px3.p1.1)\.
- D\. Ernst, P\. Geurts, and L\. Wehenkel \(2005\)Tree\-based batch mode reinforcement learning\.Journal of Machine Learning Research6\.Cited by:[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px4.p1.1)\.
- A\. Fallah, A\. Mokhtari, and A\. Ozdaglar \(2020\)Personalized federated learning with theoretical guarantees: a model\-agnostic meta\-learning approach\.Advances in neural information processing systems33,pp\. 3557–3568\.Cited by:[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px2.p1.1)\.
- F\. X\. Fan, Y\. Ma, Z\. Dai, C\. Tan, B\. K\. H\. Low, and R\. Wattenhofer \(2023\)FedHQL: federated heterogeneous q\-learning\.External Links:2301\.11135,[Link](https://arxiv.org/abs/2301.11135)Cited by:[§1](https://arxiv.org/html/2605.29002#S1.p2.1),[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px5.p1.1)\.
- X\. Fan, Y\. Ma, Z\. Dai, W\. Jing, C\. Tan, and B\. K\. H\. Low \(2021\)Fault\-tolerant federated reinforcement learning with theoretical guarantee\.Advances in neural information processing systems34,pp\. 1007–1021\.Cited by:[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px1.p1.1)\.
- G\. Hinton, O\. Vinyals, and J\. Dean \(2015\)Distilling the knowledge in a neural network\.External Links:1503\.02531,[Link](https://arxiv.org/abs/1503.02531)Cited by:[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px3.p1.1)\.
- W\. Jiang, J\. Wang, X\. Zhang, W\. Bao, C\. Tan, and F\. X\. Fan \(2025\)Fedhpd: heterogeneous federated reinforcement learning via policy distillation\.arXiv preprint arXiv:2502\.00870\.Cited by:[§1](https://arxiv.org/html/2605.29002#S1.p2.1),[§1](https://arxiv.org/html/2605.29002#S1.p3.1),[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px5.p1.1),[§7\.1](https://arxiv.org/html/2605.29002#S7.SS1.p3.5)\.
- C\. Jin, Z\. Yang, Z\. Wang, and M\. I\. Jordan \(2020\)Provably efficient reinforcement learning with linear function approximation\.InConference on learning theory,pp\. 2137–2143\.Cited by:[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px4.p1.1),[item 1](https://arxiv.org/html/2605.29002#S6.I1.i1.p1.4)\.
- H\. Jin, Y\. Peng, W\. Yang, S\. Wang, and Z\. Zhang \(2022\)Federated reinforcement learning with environment heterogeneity\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 18–37\.Cited by:[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px1.p1.1),[§7\.1](https://arxiv.org/html/2605.29002#S7.SS1.p3.5)\.
- P\. Kanerva \(2009\)Hyperdimensional computing: an introduction to computing in distributed representation with high\-dimensional random vectors\.Cognitive computation1\(2\),pp\. 139–159\.Cited by:[§1](https://arxiv.org/html/2605.29002#S1.p4.1),[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px4.p1.1),[§3\.2](https://arxiv.org/html/2605.29002#S3.SS2.p1.1)\.
- S\. P\. Karimireddy, S\. Kale, M\. Mohri, S\. Reddi, S\. Stich, and A\. T\. Suresh \(2020\)Scaffold: stochastic controlled averaging for federated learning\.InInternational conference on machine learning,pp\. 5132–5143\.Cited by:[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Khodadadian, P\. Sharma, G\. Joshi, and S\. T\. Maguluri \(2022\)Federated reinforcement learning: linear speedup under markovian sampling\.InInternational conference on machine learning,pp\. 10997–11057\.Cited by:[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px1.p1.1)\.
- G\. Konidaris, S\. Osentoski, and P\. Thomas \(2011\)Value function approximation in reinforcement learning using the fourier basis\.InProceedings of the AAAI conference on artificial intelligence,Vol\.25,pp\. 380–385\.Cited by:[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px4.p1.1)\.
- M\. G\. Lagoudakis and R\. Parr \(2003\)Least\-squares policy iteration\.Journal of machine learning research4\(Dec\),pp\. 1107–1149\.Cited by:[§1](https://arxiv.org/html/2605.29002#S1.p4.1),[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px4.p1.1)\.
- D\. Li and J\. Wang \(2019\)FedMD: heterogenous federated learning via model distillation\.External Links:1910\.03581,[Link](https://arxiv.org/abs/1910.03581)Cited by:[§1](https://arxiv.org/html/2605.29002#S1.p3.1),[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px3.p1.1)\.
- T\. Li, A\. K\. Sahu, M\. Zaheer, M\. Sanjabi, A\. Talwalkar, and V\. Smith \(2020\)Federated optimization in heterogeneous networks\.Proceedings of Machine learning and systems2,pp\. 429–450\.Cited by:[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px2.p1.1)\.
- X\. Liang, Y\. Liu, T\. Chen, M\. Liu, and Q\. Yang \(2022\)Federated transfer reinforcement learning for autonomous driving\.InFederated and transfer learning,pp\. 357–371\.Cited by:[§1](https://arxiv.org/html/2605.29002#S1.p1.1),[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Lin, L\. Kong, S\. U\. Stich, and M\. Jaggi \(2021\)Ensemble distillation for robust model fusion in federated learning\.External Links:2006\.07242,[Link](https://arxiv.org/abs/2006.07242)Cited by:[§1](https://arxiv.org/html/2605.29002#S1.p3.1),[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px3.p1.1)\.
- B\. Liu, L\. Wang, and M\. Liu \(2019\)Lifelong federated reinforcement learning: a learning architecture for navigation in cloud robotic systems\.IEEE Robotics and Automation Letters4\(4\),pp\. 4555–4562\.Cited by:[§1](https://arxiv.org/html/2605.29002#S1.p1.1)\.
- W\. Mai, J\. Yao, G\. Chen, Y\. Zhang, Y\. Cheung, and B\. Han \(2023\)Server\-client collaborative distillation for federated reinforcement learning\.ACM Transactions on Knowledge Discovery from Data18\(1\),pp\. 1–22\.Cited by:[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px3.p1.1)\.
- B\. McMahan, E\. Moore, D\. Ramage, S\. Hampson, and B\. A\. y Arcas \(2017\)Communication\-efficient learning of deep networks from decentralized data\.InArtificial intelligence and statistics,pp\. 1273–1282\.Cited by:[§1](https://arxiv.org/html/2605.29002#S1.p2.1),[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Nadiger, A\. Kumar, and S\. Abdelhak \(2019\)Federated reinforcement learning for fast personalization\.In2019 IEEE Second International Conference on Artificial Intelligence and Knowledge Engineering \(AIKE\),pp\. 123–127\.Cited by:[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Ni, D\. Abraham, M\. Issa, Y\. Kim, P\. Mercati, and M\. Imani \(2022a\)Qhd: a brain\-inspired hyperdimensional reinforcement learning algorithm\.arXiv preprint\.Cited by:[Appendix A](https://arxiv.org/html/2605.29002#A1.p1.3),[§1](https://arxiv.org/html/2605.29002#S1.p4.1),[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px4.p1.1),[§3\.1](https://arxiv.org/html/2605.29002#S3.SS1.p1.12),[§4](https://arxiv.org/html/2605.29002#S4.SS0.SSS0.Px1.p1.20)\.
- Y\. Ni, M\. Issa, D\. Abraham, M\. Imani, X\. Yin, and M\. Imani \(2022b\)Hdpg: hyperdimensional policy\-based reinforcement learning for continuous control\.InProceedings of the 59th ACM/IEEE Design Automation Conference,pp\. 1141–1146\.Cited by:[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px4.p1.1)\.
- D\. Ormoneit and Ś\. Sen \(2002\)Kernel\-based reinforcement learning\.Machine learning49\(2\),pp\. 161–178\.Cited by:[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px4.p1.1)\.
- J\. Qi, Q\. Zhou, L\. Lei, and K\. Zheng \(2021\)Federated reinforcement learning: techniques, applications, and open challenges\.arXiv preprint arXiv:2108\.11887\.Cited by:[§1](https://arxiv.org/html/2605.29002#S1.p1.1)\.
- A\. Rahimi and B\. Recht \(2007\)Random features for large\-scale kernel machines\.Advances in neural information processing systems20\.Cited by:[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px4.p1.1),[§3\.2](https://arxiv.org/html/2605.29002#S3.SS2.p2.7),[§7\.2](https://arxiv.org/html/2605.29002#S7.SS2.SSS0.Px3.p1.4)\.
- A\. Rudi and L\. Rosasco \(2017\)Generalization properties of learning with random features\.Advances in neural information processing systems30\.Cited by:[§3\.2](https://arxiv.org/html/2605.29002#S3.SS2.p2.7)\.
- A\. A\. Rusu, S\. G\. Colmenarejo, C\. Gulcehre, G\. Desjardins, J\. Kirkpatrick, R\. Pascanu, V\. Mnih, K\. Kavukcuoglu, and R\. Hadsell \(2016\)Policy distillation\.External Links:1511\.06295,[Link](https://arxiv.org/abs/1511.06295)Cited by:[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px3.p1.1)\.
- Y\. Teh, V\. Bapst, W\. M\. Czarnecki, J\. Quan, J\. Kirkpatrick, R\. Hadsell, N\. Heess, and R\. Pascanu \(2017\)Distral: robust multitask reinforcement learning\.Advances in neural information processing systems30\.Cited by:[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Yu, X\. Chen, Z\. Zhou, X\. Gong, and D\. Wu \(2020\)When deep reinforcement learning meets federated learning: intelligent multitimescale resource management for multiaccess edge computing in 5g ultradense network\.IEEE Internet of Things Journal8\(4\),pp\. 2238–2251\.Cited by:[§1](https://arxiv.org/html/2605.29002#S1.p1.1)\.
- Z\. Zhu, J\. Hong, and J\. Zhou \(2021\)Data\-free knowledge distillation for heterogeneous federated learning\.InInternational conference on machine learning,pp\. 12878–12889\.Cited by:[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px3.p1.1)\.
- H\. H\. Zhuo, W\. Feng, Y\. Lin, Q\. Xu, and Q\. Yang \(2020\)Federated deep reinforcement learning\.External Links:1901\.08277,[Link](https://arxiv.org/abs/1901.08277)Cited by:[§1](https://arxiv.org/html/2605.29002#S1.p1.1),[§2](https://arxiv.org/html/2605.29002#S2.SS0.SSS0.Px1.p1.1)\.

## Appendix AQHD Semi\-Gradient Update

For completeness, we record the semi\-gradient TD update rule used by QHD\[Niet al\.,[2022a](https://arxiv.org/html/2605.29002#bib.bib38)\]\. Given a transition\(s,a,r,s′\)\(s,a,r,s^\{\\prime\}\), a delayed target weightWi−W\_\{i\}^\{\-\}, and learning rateη\\eta, define the bootstrapped target

y=r\+γ​ℜ⁡\(Φi​\(s′\)H​𝐰i,a⋆−\),a⋆=arg⁡maxa′∈𝒜⁡Qi​\(s′,a′\)\.y\\;=\\;r\+\\gamma\\,\\Re\\\!\\bigl\(\\Phi\_\{i\}\(s^\{\\prime\}\)^\{\\mathrm\{H\}\}\\,\\mathbf\{w\}^\{\-\}\_\{i,a^\{\\star\}\}\\bigr\),\\qquad a^\{\\star\}=\\arg\\max\_\{a^\{\\prime\}\\in\\mathcal\{A\}\}Q\_\{i\}\(s^\{\\prime\},a^\{\\prime\}\)\.The semi\-gradient update on the readout vector for actionaais

𝐰i,a←𝐰i,a\+η​\(y−Qi​\(s,a\)\)​Φi​\(s\)\.\\mathbf\{w\}\_\{i,a\}\\;\\leftarrow\\;\\mathbf\{w\}\_\{i,a\}\+\\eta\\,\\bigl\(y\-Q\_\{i\}\(s,a\)\\bigr\)\\,\\Phi\_\{i\}\(s\)\.The target weightsWi−W\_\{i\}^\{\-\}are periodically synchronized withWiW\_\{i\}, in the standard fashion of off\-policyQQ\-learning with target networks\.

## Appendix BAlgorithm

The pseudocode of FedQHD procedure is presented here\.

Algorithm 1FedQHD: Homogeneous Protocol1:Init:shared encoder

Φ:𝒮→𝕂D\\Phi:\\mathcal\{S\}\\to\\mathbb\{K\}^\{D\};

W0glob=𝟎W^\{\\mathrm\{glob\}\}\_\{0\}=\\mathbf\{0\}
2:forround

k=0,…,T−1k=0,\\ldots,T\\\!\-\\\!1do

3:Broadcast

WkglobW^\{\\mathrm\{glob\}\}\_\{k\}to all clients

4:forclient

i=1,…,Ni=1,\\ldots,Nin paralleldo

5:

Wi,Wi−←WkglobW\_\{i\},W\_\{i\}^\{\-\}\\leftarrow W^\{\\mathrm\{glob\}\}\_\{k\}
6:for

e=1,…,Ke=1,\\ldots,Kdo

7:

δ←TD​\(s,a,r,s′;Wi,Wi−\)\\delta\\leftarrow\\text\{TD\}\(s,a,r,s^\{\\prime\};W\_\{i\},W\_\{i\}^\{\-\}\)
8:

Wi​\[:,a\]←Wi​\[:,a\]\+η​δ​Φ​\(s\)W\_\{i\}\[:,a\]\\leftarrow W\_\{i\}\[:,a\]\+\\eta\\,\\delta\\,\\Phi\(s\)
9:Sync

Wi−W\_\{i\}^\{\-\}every

τ\\taueps; decay

ϵ\\epsilon
10:endfor

11:Upload

WiW\_\{i\}to server

12:endfor

13:

Wk\+1glob←∑iπi​WiW^\{\\mathrm\{glob\}\}\_\{k\+1\}\\leftarrow\\sum\_\{i\}\\pi\_\{i\}W\_\{i\}
14:endfor

Algorithm 2FedQHD: Heterogeneous Protocol1:Init:anchors

𝒮ref=\{s1,…,sm\}\\mathcal\{S\}\_\{\\mathrm\{ref\}\}\\\!=\\\!\\\{s\_\{1\},\\ldots,s\_\{m\}\\\}; each client

ii: encoder

Φi\\Phi\_\{i\},

Wi=𝟎W\_\{i\}\\\!=\\\!\\mathbf\{0\}
2:forround

k=0,…,T−1k=0,\\ldots,T\\\!\-\\\!1do

3:forclient

i=1,…,Ni=1,\\ldots,Nin paralleldo

4:

Wi−←WiW\_\{i\}^\{\-\}\\leftarrow W\_\{i\}
5:for

e=1,…,Ke=1,\\ldots,Kdo

6:

δ←TD​\(s,a,r,s′;Wi,Wi−\)\\delta\\leftarrow\\text\{TD\}\(s,a,r,s^\{\\prime\};W\_\{i\},W\_\{i\}^\{\-\}\)
7:

Wi​\[:,a\]←Wi​\[:,a\]\+η​δ​Φi​\(s\)W\_\{i\}\[:,a\]\\leftarrow W\_\{i\}\[:,a\]\+\\eta\\,\\delta\\,\\Phi\_\{i\}\(s\)
8:Sync

Wi−W\_\{i\}^\{\-\}every

τ\\taueps; decay

ϵ\\epsilon
9:endfor

10:Upload

Qiref=ℜ⁡\(Xi​Wi\)Q^\{\\mathrm\{ref\}\}\_\{i\}\\\!=\\\!\\Re\(X\_\{i\}W\_\{i\}\)
11:endfor

12:Server:

Qrefglob←∑iπi​QirefQ^\{\\mathrm\{glob\}\}\_\{\\mathrm\{ref\}\}\\leftarrow\\sum\_\{i\}\\pi\_\{i\}Q^\{\\mathrm\{ref\}\}\_\{i\};

13:forclient

iiin paralleldo

14:

Wiglob←XiH​\(Gi\+λ​I\)−1​QrefglobW\_\{i\}^\{\\mathrm\{glob\}\}\\leftarrow X\_\{i\}^\{\\mathrm\{H\}\}\(G\_\{i\}\\\!\+\\\!\\lambda I\)^\{\-1\}Q^\{\\mathrm\{glob\}\}\_\{\\mathrm\{ref\}\}
15:endfor

16:endfor

## Appendix CExperimental setup\.

For homogeneous experiments, all clients share a random Fourier feature \(RFF\) encoder of dimensionD=10,000D=10\{,\}000with fixed bandwidthσ\\sigma:Φ​\(s\)=1D​\[cos⁡\(ω1⊤​s\+b1\),…,cos⁡\(ωD⊤​s\+bD\)\]⊤\\Phi\(s\)=\\tfrac\{1\}\{\\sqrt\{D\}\}\\big\[\\cos\(\\omega\_\{1\}^\{\\top\}s\+b\_\{1\}\),\\dots,\\cos\(\\omega\_\{D\}^\{\\top\}s\+b\_\{D\}\)\\big\]^\{\\top\}\. For heterogeneous experiments, each client independently samplesΦi\\Phi\_\{i\}with bandwidthσi∼Unif​\[0\.5​σ0,1\.5​σ0\]\\sigma\_\{i\}\\\!\\sim\\\!\\mathrm\{Unif\}\[0\.5\\sigma\_\{0\},1\.5\\sigma\_\{0\}\], dimensionDi∈\{500,1000,2000,5000,10000\}D\_\{i\}\\\!\\in\\\!\\\{500,1000,2000,5000,10000\\\}and anchor set𝒮ref\\mathcal\{S\}\_\{\\text\{ref\}\}withm=200m=200states collected from random rollouts\.

All methods useϵ\\epsilon\-greedy exploration withϵ\\epsilonannealed from 1\.0 to 0\.001\. We use learning rateη=0\.01\\eta=0\.01, discountγ=0\.99\\gamma=0\.99, replay buffer size 10,000 per client, and uniform federation weightsπi=1/N\\pi\_\{i\}=1/N\. Federated aggregation occurs everyK=50K=50local episodes withN=5N=5clients unless stated otherwise\. We report mean over 3 random seeds\. All DQN\-based baseline use a 2\-layer MLP with 128 hidden units per layer\.

## Appendix DPerformance Comparison

Table 3:Final average reward \(last 100 episodes\), mean±\\pmstd overN=3N\{=\}3seeds\.Bold: best non\-oracle\.†\\dagger: oracle \(pooled data\)\. –: N/A\.Figure[2](https://arxiv.org/html/2605.29002#A4.F2)shows that FedQHD converges within 600 episodes on CartPole and LunarLander, whereas DQN\-based methods plateau earlier\. This suggests that function\-space aggregation transfers more useful information per communication round than parameter averaging in deep networks\.

![Refer to caption](https://arxiv.org/html/2605.29002v1/x2.png)Figure 2:Learning curves for CartPole \(top\) and LunarLander \(bottom\) under homogeneous \(left\) and heterogeneous \(right\) encoders\.
## Appendix EEffect of Anchor Set Sizemm\.

We fixD=512D=512and vary the anchor sizemmfrom5151to20482048on CartPole\. Theorem[2](https://arxiv.org/html/2605.29002#Thmtheorem2)predicts a regime change aroundm=Dm=Dwhenm<Dm<D, the anchor feature matrix is rank\-deficient, Term \(II\) is unbounded, and compiled weights generalize poorly\. Oncem≥Dm\\geq D, the residual reduces to the heterogeneity floor𝒪​\(D−1/2\)\\mathcal\{O\}\(D^\{\-1/2\}\)\.

Figure[3](https://arxiv.org/html/2605.29002#A5.F3)shows the phase transition: the compiled error in the middle panel drops steeply within the under\-determined regime; oncem≥Dm\\geq D, the error stabilizes near the geometry floor, and FedQHD performance becomes reliable\. The learning curve and policy value are unstable whenm<Dm<D, but converge reliably toV≈500V\\approx 500oncem≥Dm\\geq D, reflecting the boundary sensitivity of the ridge solve\.

Together with the ablation on dimension, this confirms the design rulem≥Dm\\geq D: the encoder dimension sets the geometry floor \(D−1/2D^\{\-1/2\}scaling\), and the anchor count determines the projection regime, both predicted by Theorem[2](https://arxiv.org/html/2605.29002#Thmtheorem2)\.

![Refer to caption](https://arxiv.org/html/2605.29002v1/x3.png)Figure 3:Ablation 2 \(varymm, fixD=512D=512\): learning curves \(left\), Q\-error vs\.m/Dm/Dwith them=Dm=Dtransition \(middle\), and final policy value \(right\)\.
## Appendix FScalability Analysis

We evaluate the scalability of FedQHD by varyingN∈\{1,2,5,20\}N\\in\\\{1,2,5,20\\\}on CartPole and LunarLander, measuring the average reward over the final 30 episodes after 500 episodes\. Following standard benchmarks, we define a task as*solved*when the rolling average reward≥475\\geq\\\!475for CartPole and≥200\\geq\\\!200for LunarLander\. The dashed red line in each panel marks this threshold; bars that cross it indicate a qualitatively reliable policy\.

Figure[4](https://arxiv.org/html/2605.29002#A6.F4)shows the consistently improved final reward of FedQHD withNN, yielding gains of\+30\.6%\+30\.6\\%on CartPole and\+29\.5%\+29\.5\\%on LunarLander relative toN=1N=1with no degradation at largeNN, which demonstrates that federation is a reliable lever for performance in FedQHD\. Notably, a modestN=5N\\\!=\\\!5already recovers most of the scalability benefit on both environments, suggesting that FedQHD is well\-suited to practical deployments where large client counts are often infeasible\.

![Refer to caption](https://arxiv.org/html/2605.29002v1/x4.png)Figure 4:FedQHD performance at 575\-600 episodes vs\. number of clientsNN\.Across homogeneous and heterogeneous settings, FedQHD consistently matches or exceeds federated DQN baselines while requiring substantially less computation\. Its performance degradation under encoder mismatch follows the projection regime predicted by Theorem[2](https://arxiv.org/html/2605.29002#Thmtheorem2), and its scalability saturates once the dominant value subspace is sufficiently covered\. Together, the results validate the geometric design principles underlying FedQHD\.

## Appendix GProof of Theorem[1](https://arxiv.org/html/2605.29002#Thmtheorem1)

###### Proof\.

By the Woodbury identity applied toWiglob=\(XiH​Xi\+λ​IDi\)−1​XiH​QrefglobW^\{\\mathrm\{glob\}\}\_\{i\}=\\left\(X\_\{i\}^\{\\mathrm\{H\}\}X\_\{i\}\+\\lambda I\_\{D\_\{i\}\}\\right\)^\{\-1\}X\_\{i\}^\{\\mathrm\{H\}\}Q^\{\\mathrm\{glob\}\}\_\{\\mathrm\{ref\}\}, the compiled Q\-function evaluates as:

Qi​\(s,a;Wiglob\)=𝐤i​\(s\)⊤​\(Gi\+λ​Im\)−1​Qref,aglob\.Q\_\{i\}\(s,a;\\,W^\{\\mathrm\{glob\}\}\_\{i\}\)=\\mathbf\{k\}\_\{i\}\(s\)^\{\\\!\\top\}\(G\_\{i\}\+\\lambda I\_\{m\}\)^\{\-1\}Q^\{\\mathrm\{glob\}\}\_\{\\mathrm\{ref\},a\}\.\(6\)Decompose the teacher:Qref,aglob=Pi​Qref,aglob\+Ri,0,aQ^\{\\mathrm\{glob\}\}\_\{\\mathrm\{ref\},a\}=P\_\{i\}\\,Q^\{\\mathrm\{glob\}\}\_\{\\mathrm\{ref\},a\}\+R\_\{i,0,a\}, wherePi​Qref,aglob∈col​\(Xi\)P\_\{i\}\\,Q^\{\\mathrm\{glob\}\}\_\{\\mathrm\{ref\},a\}\\in\\mathrm\{col\}\(X\_\{i\}\)andRi,0,a∈col​\(Xi\)⟂=ker⁡\(Gi\)R\_\{i,0,a\}\\in\\mathrm\{col\}\(X\_\{i\}\)^\{\\perp\}=\\ker\(G\_\{i\}\)\. Since\(Gi\+λ​Im\)−1\(G\_\{i\}\+\\lambda I\_\{m\}\)^\{\-1\}preserves the eigenspaces ofGiG\_\{i\}, the vector\(Gi\+λ​Im\)−1​Ri,0,a\(G\_\{i\}\+\\lambda I\_\{m\}\)^\{\-1\}R\_\{i,0,a\}remains inker⁡\(Gi\)\\ker\(G\_\{i\}\), which is orthogonal to𝐤i​\(s\)∈col​\(Xi\)\\mathbf\{k\}\_\{i\}\(s\)\\in\\mathrm\{col\}\(X\_\{i\}\)\. Hence𝐤i​\(s\)⊤​\(Gi\+λ​I\)−1​Ri,0,a=0\\mathbf\{k\}\_\{i\}\(s\)^\{\\top\}\(G\_\{i\}\+\\lambda I\)^\{\-1\}R\_\{i,0,a\}=0\. ∎

## Appendix HProof of Lemma[1](https://arxiv.org/html/2605.29002#Thmlemma1): Encoder Heterogeneity Bias

We bound the deviation between a client oracleQ^i\\hat\{Q\}\_\{i\}and the federation averageQ¯=∑jπj​Q^j\\bar\{Q\}=\\sum\_\{j\}\\pi\_\{j\}\\hat\{Q\}\_\{j\}\.

###### Proof\.

RecallQ^k​\(⋅,a\)=Πk​Q∗​\(⋅,a\)\\hat\{Q\}\_\{k\}\(\\cdot,a\)=\\Pi\_\{k\}Q^\{\*\}\(\\cdot,a\)\. Using the triangle inequality throughQ∗Q^\{\*\},

‖Q^i−Q^j‖ℋ≤‖\(I−Πi\)​Q∗‖ℋ\+‖\(I−Πj\)​Q∗‖ℋ\.\\\|\\hat\{Q\}\_\{i\}\-\\hat\{Q\}\_\{j\}\\\|\_\{\\mathcal\{H\}\}\\leq\\\|\(I\-\\Pi\_\{i\}\)Q^\{\*\}\\\|\_\{\\mathcal\{H\}\}\+\\\|\(I\-\\Pi\_\{j\}\)Q^\{\*\}\\\|\_\{\\mathcal\{H\}\}\.\(7\)
We bound the first term \(the second follows symmetrically\)\. Decompose

\(I−Πi\)​Q∗=\(I−Πi\)​Πj​Q∗\+\(I−Πi\)​\(I−Πj\)​Q∗\.\(I\-\\Pi\_\{i\}\)Q^\{\*\}=\(I\-\\Pi\_\{i\}\)\\Pi\_\{j\}Q^\{\*\}\+\(I\-\\Pi\_\{i\}\)\(I\-\\Pi\_\{j\}\)Q^\{\*\}\.
Cross\-subspace term\.SinceΠj​Q∗∈ℱj\\Pi\_\{j\}Q^\{\*\}\\in\\mathcal\{F\}\_\{j\}, the definition of principal angle gives

‖\(I−Πi\)​Πj‖ℋ→ℋ=sin⁡\(θi​j\)\.\\\|\(I\-\\Pi\_\{i\}\)\\Pi\_\{j\}\\\|\_\{\\mathcal\{H\}\\to\\mathcal\{H\}\}=\\sin\(\\theta\_\{ij\}\)\.Thus

‖\(I−Πi\)​Πj​Q∗‖ℋ≤sin⁡\(θi​j\)​‖Πj​Q∗‖ℋ≤B​sin⁡\(θi​j\),\\\|\(I\-\\Pi\_\{i\}\)\\Pi\_\{j\}Q^\{\*\}\\\|\_\{\\mathcal\{H\}\}\\leq\\sin\(\\theta\_\{ij\}\)\\\|\\Pi\_\{j\}Q^\{\*\}\\\|\_\{\\mathcal\{H\}\}\\leq B\\sin\(\\theta\_\{ij\}\),using‖Πj‖≤1\\\|\\Pi\_\{j\}\\\|\\leq 1and Assumption[1](https://arxiv.org/html/2605.29002#Thmassumption1)\(i\)\.

Residual term\.Since‖I−Πi‖≤1\\\|I\-\\Pi\_\{i\}\\\|\\leq 1,

‖\(I−Πi\)​\(I−Πj\)​Q∗‖ℋ≤‖\(I−Πj\)​Q∗‖ℋ≤εjrep\.\\\|\(I\-\\Pi\_\{i\}\)\(I\-\\Pi\_\{j\}\)Q^\{\*\}\\\|\_\{\\mathcal\{H\}\}\\leq\\\|\(I\-\\Pi\_\{j\}\)Q^\{\*\}\\\|\_\{\\mathcal\{H\}\}\\leq\\varepsilon\_\{j\}^\{\\mathrm\{rep\}\}\.
Combining two terms we get‖\(I−Πi\)​Q∗‖ℋ≤B​sin⁡\(θi​j\)\+εjrep\\\|\(I\-\\Pi\_\{i\}\)Q^\{\*\}\\\|\_\{\\mathcal\{H\}\}\\leq B\\sin\(\\theta\_\{ij\}\)\+\\varepsilon\_\{j\}^\{\\mathrm\{rep\}\}\.

By symmetry ini,ji,j,

‖Q^i−Q^j‖ℋ≤2​B​sin⁡\(θi​j\)\+εirep\+εjrep\.\\\|\\hat\{Q\}\_\{i\}\-\\hat\{Q\}\_\{j\}\\\|\_\{\\mathcal\{H\}\}\\leq 2B\\sin\(\\theta\_\{ij\}\)\+\\varepsilon\_\{i\}^\{\\mathrm\{rep\}\}\+\\varepsilon\_\{j\}^\{\\mathrm\{rep\}\}\.\(8\)
SinceQ¯=∑jπj​Q^j,∑jπj=1\\bar\{Q\}=\\sum\_\{j\}\\pi\_\{j\}\\hat\{Q\}\_\{j\},\\sum\_\{j\}\\pi\_\{j\}=1, by deviating to from the federation average, we haveQ^i−Q¯=∑j≠iπj​\(Q^i−Q^j\)\.\\hat\{Q\}\_\{i\}\-\\bar\{Q\}=\\sum\_\{j\\neq i\}\\pi\_\{j\}\(\\hat\{Q\}\_\{i\}\-\\hat\{Q\}\_\{j\}\)\.

Taking theℋ\\mathcal\{H\}\-norm and applying the triangle inequality,‖Q^i−Q¯‖ℋ≤∑j≠iπj​‖Q^i−Q^j‖ℋ\\\|\\hat\{Q\}\_\{i\}\-\\bar\{Q\}\\\|\_\{\\mathcal\{H\}\}\\leq\\sum\_\{j\\neq i\}\\pi\_\{j\}\\\|\\hat\{Q\}\_\{i\}\-\\hat\{Q\}\_\{j\}\\\|\_\{\\mathcal\{H\}\}

Substituting \([8](https://arxiv.org/html/2605.29002#A8.E8)\), we get

‖Q^i−Q¯‖ℋ≤2​B​∑j≠iπj​sin⁡\(θi​j\)\+∑j≠iπj​\(εirep\+εjrep\)\.\\\|\\hat\{Q\}\_\{i\}\-\\bar\{Q\}\\\|\_\{\\mathcal\{H\}\}\\leq 2B\\sum\_\{j\\neq i\}\\pi\_\{j\}\\sin\(\\theta\_\{ij\}\)\+\\sum\_\{j\\neq i\}\\pi\_\{j\}\(\\varepsilon\_\{i\}^\{\\mathrm\{rep\}\}\+\\varepsilon\_\{j\}^\{\\mathrm\{rep\}\}\)\.\(9\)
By the reproducing property,

\|f​\(s\)\|=\|⟨f,κ​\(⋅,s\)⟩ℋ\|≤‖f‖ℋ​κ​\(s,s\)\.\|f\(s\)\|=\|\\langle f,\\kappa\(\\cdot,s\)\\rangle\_\{\\mathcal\{H\}\}\|\\leq\\\|f\\\|\_\{\\mathcal\{H\}\}\\sqrt\{\\kappa\(s,s\)\}\.Underκ​\(s,s\)≤1\\kappa\(s,s\)\\leq 1,\|Q^i​\(s,a\)−Q¯​\(s,a\)\|≤‖Q^i−Q¯‖ℋ\.\|\\hat\{Q\}\_\{i\}\(s,a\)\-\\bar\{Q\}\(s,a\)\|\\leq\\\|\\hat\{Q\}\_\{i\}\-\\bar\{Q\}\\\|\_\{\\mathcal\{H\}\}\.

Combining with \([9](https://arxiv.org/html/2605.29002#A8.E9)\) yields the desired bound on Term \(A\)\. ∎

## Appendix IProof of Theorem[2](https://arxiv.org/html/2605.29002#Thmtheorem2)

We introduceQ¯​\(s,a\)≔∑i=1Nπi​Q^i​\(s,a\)\\bar\{Q\}\(s,a\)\\coloneqq\\sum\_\{i=1\}^\{N\}\\pi\_\{i\}\\hat\{Q\}\_\{i\}\(s,a\)as the federation\-weighted mean of the oracles, which serves as the global objective defined in \([3](https://arxiv.org/html/2605.29002#S4.E3)\)\.

\|Δi​\(s,a\)\|≤\|Q^i​\(s,a\)−Q¯​\(s,a\)\|⏟\(A\): Representation Bias\+\|Q¯​\(s,a\)−Qi​\(s,a;Wiglob​\(λ\)\)\|⏟\(B\): Aggregation Distortion\.\|\\Delta\_\{i\}\(s,a\)\|\\;\\leq\\;\\underbrace\{\|\\hat\{Q\}\_\{i\}\(s,a\)\-\\bar\{Q\}\(s,a\)\|\}\_\{\\text\{\(A\): Representation Bias\}\}\\;\+\\;\\underbrace\{\|\\bar\{Q\}\(s,a\)\-Q\_\{i\}\(s,a;\\,W^\{\\mathrm\{glob\}\}\_\{i\}\(\\lambda\)\)\|\}\_\{\\text\{\(B\): Aggregation Distortion\}\}\.\(10\)\(A\) is bound by Lemma[1](https://arxiv.org/html/2605.29002#Thmlemma1)and corresponds to Term \(I\) in the main theorem\.

\(B\) captures how accurately FedQHD’s anchor\-set ridge regression recovers the mean oracle\.

We insert an intermediate predictor that uses the same ridge map as the global model but fits clientii’s own oracle anchor labels\. Define the oracle ridge solution

W^iridge≔\(XiH​Xi\+λ​I\)−1​XiH​Q^iref,Q^iref=ℜ⁡\(Xi​W^i\)∈ℝm×\|𝒜\|\.\\hat\{W\}\_\{i\}^\{\\mathrm\{ridge\}\}\\coloneqq\(X\_\{i\}^\{\\mathrm\{H\}\}X\_\{i\}\+\\lambda I\)^\{\-1\}X\_\{i\}^\{\\mathrm\{H\}\}\\hat\{Q\}\_\{i\}^\{\\mathrm\{ref\}\},\\qquad\\hat\{Q\}\_\{i\}^\{\\mathrm\{ref\}\}=\\Re\(X\_\{i\}\\hat\{W\}\_\{i\}\)\\in\\mathbb\{R\}^\{m\\times\|\\mathcal\{A\}\|\}\.Then by triangle inequality,

\(B\)≤\|Q¯​\(s,a\)−Q^i​\(s,a\)\|\+\|Q^i​\(s,a\)−Qi​\(s,a;W^iridge\)\|\+\|Qi​\(s,a;W^iridge\)−Qi​\(s,a;Wiglob​\(λ\)\)\|\.\(B\)\\leq\|\\bar\{Q\}\(s,a\)\-\\hat\{Q\}\_\{i\}\(s,a\)\|\+\|\\hat\{Q\}\_\{i\}\(s,a\)\-Q\_\{i\}\(s,a;\\hat\{W\}\_\{i\}^\{\\mathrm\{ridge\}\}\)\|\+\|Q\_\{i\}\(s,a;\\hat\{W\}\_\{i\}^\{\\mathrm\{ridge\}\}\)\-Q\_\{i\}\(s,a;W\_\{i\}^\{\\mathrm\{glob\}\}\(\\lambda\)\)\|\.
The first term is exactly Term \(A\) already bounded in Appendix[H](https://arxiv.org/html/2605.29002#A8)\. Below we bound the remaining two terms and denote them by\(B1\)\(B\_\{1\}\)and\(B2\)\(B\_\{2\}\)\.

### Step 1: Bound\(B1\)\(B\_\{1\}\)\(oracle ridge shrinkage\)

SinceQi​\(s,a;W^iridge\)=ℜ⁡\(Φi​\(s\)H​W^iridge\)=ℜ⁡\(Φi​\(s\)H​\(XiH​Xi\+λ​I\)−1​XiH​Xi​W^i,a\)Q\_\{i\}\(s,a;\\hat\{W\}\_\{i\}^\{\\mathrm\{ridge\}\}\)=\\Re\(\\Phi\_\{i\}\(s\)^\{\\mathrm\{H\}\}\\hat\{W\}\_\{i\}^\{\\mathrm\{ridge\}\}\)=\\Re\\\!\\Big\(\\Phi\_\{i\}\(s\)^\{\\mathrm\{H\}\}\(X\_\{i\}^\{\\mathrm\{H\}\}X\_\{i\}\+\\lambda I\)^\{\-1\}X\_\{i\}^\{\\mathrm\{H\}\}X\_\{i\}\\hat\{W\}\_\{i,a\}\\Big\), we obtain

\(B1\)=Q^i​\(s,a\)−Qi​\(s,a;W^iridge\)=ℜ⁡\(Φi​\(s\)H​\[I−\(XiH​Xi\+λ​I\)−1​XiH​Xi\]​W^i\)=ℜ⁡\(Φi​\(s\)H​λ​\(XiH​Xi\+λ​I\)−1​W^i\)\.\(B\_\{1\}\)=\\hat\{Q\}\_\{i\}\(s,a\)\-Q\_\{i\}\(s,a;\\hat\{W\}\_\{i\}^\{\\mathrm\{ridge\}\}\)=\\Re\\\!\\Big\(\\Phi\_\{i\}\(s\)^\{\\mathrm\{H\}\}\\bigl\[I\-\(X\_\{i\}^\{\\mathrm\{H\}\}X\_\{i\}\+\\lambda I\)^\{\-1\}X\_\{i\}^\{\\mathrm\{H\}\}X\_\{i\}\\bigr\]\\hat\{W\}\_\{i\}\\Big\)=\\Re\\\!\\Big\(\\Phi\_\{i\}\(s\)^\{\\mathrm\{H\}\}\\lambda\(X\_\{i\}^\{\\mathrm\{H\}\}X\_\{i\}\+\\lambda I\)^\{\-1\}\\hat\{W\}\_\{i\}\\Big\)\.
DecomposeW^i,a=PXiH​W^i\+PXiH⟂​W^i\\hat\{W\}\_\{i,a\}=P\_\{X\_\{i\}^\{\\mathrm\{H\}\}\}\\hat\{W\}\_\{i\}\+P\_\{X\_\{i\}^\{\\mathrm\{H\}\}\}^\{\\perp\}\\hat\{W\}\_\{i\}, wherePXiHP\_\{X\_\{i\}^\{\\mathrm\{H\}\}\}projects ontocol​\(XiH\)=row​\(Xi\)\\mathrm\{col\}\(X\_\{i\}^\{\\mathrm\{H\}\}\)=\\mathrm\{row\}\(X\_\{i\}\)\. The two pieces behave differently under the operatorλ​\(XiH​Xi\+λ​I\)−1\\lambda\(X\_\{i\}^\{\\mathrm\{H\}\}X\_\{i\}\+\\lambda I\)^\{\-1\}:

*\(a\) Oncol​\(XiH\)\\mathrm\{col\}\(X\_\{i\}^\{\\mathrm\{H\}\}\):*the operatorXiH​XiX\_\{i\}^\{\\mathrm\{H\}\}X\_\{i\}has eigenvalues bounded below byλmin\+​\(XiH​Xi\)=λmin\+​\(Gi\)≔γi\\lambda\_\{\\min\}^\{\+\}\(X\_\{i\}^\{\\mathrm\{H\}\}X\_\{i\}\)=\\lambda\_\{\\min\}^\{\+\}\(G\_\{i\}\)\\coloneqq\\gamma\_\{i\}, so

‖λ​\(XiH​Xi\+λ​I\)−1​PXiH‖o​p=λγi\+λ\.\\bigl\\\|\\lambda\(X\_\{i\}^\{\\mathrm\{H\}\}X\_\{i\}\+\\lambda I\)^\{\-1\}P\_\{X\_\{i\}^\{\\mathrm\{H\}\}\}\\bigr\\\|\_\{\\mathrm\{o\}p\}=\\frac\{\\lambda\}\{\\gamma\_\{i\}\+\\lambda\}\.
*\(b\) Oncol​\(XiH\)⟂\\mathrm\{col\}\(X\_\{i\}^\{\\mathrm\{H\}\}\)^\{\\perp\}:*the operator acts as the identity with factorλ/λ=1\\lambda/\\lambda=1, but this component is invisible to the predictorΦi​\(s\)H\\Phi\_\{i\}\(s\)^\{\\mathrm\{H\}\}in the following sense\. By Theorem[1](https://arxiv.org/html/2605.29002#Thmtheorem1)\(projection residual invisibility\), the compiled predictor depends only on the in\-subspace component of any teacher; equivalently, only the component ofW^i,a\\hat\{W\}\_\{i,a\}incol​\(XiH\)\\mathrm\{col\}\(X\_\{i\}^\{\\mathrm\{H\}\}\)contributes toQ^i\\hat\{Q\}\_\{i\}andQi​\(⋅;W^iridge\)Q\_\{i\}\(\\cdot;\\hat\{W\}\_\{i\}^\{\\mathrm\{ridge\}\}\)at any state, modulo a feature\-space residual that is absorbed intoεirep\\varepsilon\_\{i\}^\{\\mathrm\{rep\}\}\.

Restricting attention to the effective parametersW^i,a∥≔PXiH​W^i,a\\hat\{W\}\_\{i,a\}^\{\\parallel\}\\coloneqq P\_\{X\_\{i\}^\{\\mathrm\{H\}\}\}\\hat\{W\}\_\{i,a\}and using‖Φi​\(s\)‖2≤1\\\|\\Phi\_\{i\}\(s\)\\\|\_\{2\}\\leq 1\(Assumption[1](https://arxiv.org/html/2605.29002#Thmassumption1)\(iii\)\),

\(B1\)≤‖Φi​\(s\)‖2​‖λ​\(XiH​Xi\+λ​I\)−1​PXiH‖o​p​‖W^i,a∥‖2≤λγi\+λ​‖W^i‖F\.\(B\_\{1\}\)\\leq\\\|\\Phi\_\{i\}\(s\)\\\|\_\{2\}\\,\\bigl\\\|\\lambda\(X\_\{i\}^\{\\mathrm\{H\}\}X\_\{i\}\+\\lambda I\)^\{\-1\}P\_\{X\_\{i\}^\{\\mathrm\{H\}\}\}\\bigr\\\|\_\{\\mathrm\{o\}p\}\\,\\\|\\hat\{W\}\_\{i,a\}^\{\\parallel\}\\\|\_\{2\}\\leq\\frac\{\\lambda\}\{\\gamma\_\{i\}\+\\lambda\}\\,\\\|\\hat\{W\}\_\{i\}\\\|\_\{F\}\.

### Step 2: Bound\(B2\)\(B\_\{2\}\)\(teacher mismatch on anchors\)

Now\(B2\)\(B\_\{2\}\)measures how ridge fitting reacts when we change the training targets fromQ^iref\\hat\{Q\}\_\{i\}^\{\\mathrm\{ref\}\}toQrefglob=∑jπj​Q^jrefQ\_\{\\mathrm\{ref\}\}^\{\\mathrm\{glob\}\}=\\sum\_\{j\}\\pi\_\{j\}\\hat\{Q\}\_\{j\}^\{\\mathrm\{ref\}\}\. Define the anchor\-level mismatch for actionaa:Δi,aref≔Qref,aglob−Q^i,aref∈ℝm\.\\Delta^\{\\mathrm\{ref\}\}\_\{i,a\}\\coloneqq Q^\{\\mathrm\{glob\}\}\_\{\\mathrm\{ref\},a\}\-\\hat\{Q\}^\{\\mathrm\{ref\}\}\_\{i,a\}\\in\\mathbb\{R\}^\{m\}\.Because ridge is linear in the targets, the parameter difference is

Wiglob​\(λ\)−W^iridge=\(XiH​Xi\+λ​I\)−1​XiH​Δiref\.W\_\{i\}^\{\\mathrm\{glob\}\}\(\\lambda\)\-\\hat\{W\}\_\{i\}^\{\\mathrm\{ridge\}\}=\(X\_\{i\}^\{\\mathrm\{H\}\}X\_\{i\}\+\\lambda I\)^\{\-1\}X\_\{i\}^\{\\mathrm\{H\}\}\\Delta\_\{i\}^\{\\mathrm\{ref\}\}\.Evaluating atssand using the Woodbury form \(equivalently the representer form for linear ridge\),

Qi​\(s,a;Wiglob\)−Qi​\(s,a;W^iridge\)=𝐤i​\(s\)⊤​\(Gi\+λ​I\)−1​Δi,aref\.Q\_\{i\}\(s,a;W\_\{i\}^\{\\mathrm\{glob\}\}\)\-Q\_\{i\}\(s,a;\\hat\{W\}\_\{i\}^\{\\mathrm\{ridge\}\}\)=\\mathbf\{k\}\_\{i\}\(s\)^\{\\top\}\(G\_\{i\}\+\\lambda I\)^\{\-1\}\\Delta^\{\\mathrm\{ref\}\}\_\{i,a\}\.Only the component ofΔi,aref\\Delta^\{\\mathrm\{ref\}\}\_\{i,a\}insidecol​\(Xi\)\\mathrm\{col\}\(X\_\{i\}\)matters because𝐤i​\(s\)∈col​\(Xi\)\\mathbf\{k\}\_\{i\}\(s\)\\in\\mathrm\{col\}\(X\_\{i\}\); therefore we may replaceΔi,aref\\Delta^\{\\mathrm\{ref\}\}\_\{i,a\}byPi​Δi,arefP\_\{i\}\\Delta^\{\\mathrm\{ref\}\}\_\{i,a\}:

𝐤i​\(s\)⊤​\(Gi\+λ​I\)−1​Δi,aref=𝐤i​\(s\)⊤​\(Gi\+λ​I\)−1​Pi​Δi,aref\.\\mathbf\{k\}\_\{i\}\(s\)^\{\\top\}\(G\_\{i\}\+\\lambda I\)^\{\-1\}\\Delta^\{\\mathrm\{ref\}\}\_\{i,a\}=\\mathbf\{k\}\_\{i\}\(s\)^\{\\top\}\(G\_\{i\}\+\\lambda I\)^\{\-1\}P\_\{i\}\\Delta^\{\\mathrm\{ref\}\}\_\{i,a\}\.To bound this scalar, apply Cauchy–Schwarz after inserting\(Gi\)†⁣/2​\(Gi\)1/2\(G\_\{i\}\)^\{\\dagger/2\}\(G\_\{i\}\)^\{1/2\}:

\|𝐤i​\(s\)⊤​\(Gi\+λ​I\)−1​Pi​Δi,aref\|≤𝐤i​\(s\)⊤​Gi†​𝐤i​\(s\)​\(Pi​Δi,aref\)⊤​Gi​\(Gi\+λ​I\)−2​\(Pi​Δi,aref\)\.\\bigl\|\\mathbf\{k\}\_\{i\}\(s\)^\{\\top\}\(G\_\{i\}\+\\lambda I\)^\{\-1\}P\_\{i\}\\Delta^\{\\mathrm\{ref\}\}\_\{i,a\}\\bigr\|\\leq\\sqrt\{\\mathbf\{k\}\_\{i\}\(s\)^\{\\top\}G\_\{i\}^\{\\dagger\}\\mathbf\{k\}\_\{i\}\(s\)\}\\;\\sqrt\{\(P\_\{i\}\\Delta^\{\\mathrm\{ref\}\}\_\{i,a\}\)^\{\\top\}G\_\{i\}\(G\_\{i\}\+\\lambda I\)^\{\-2\}\(P\_\{i\}\\Delta^\{\\mathrm\{ref\}\}\_\{i,a\}\)\}\.The first factor satisfies𝐤i​\(s\)⊤​Gi†​𝐤i​\(s\)=‖PXiH​Φi​\(s\)‖22≤‖Φi​\(s\)‖22≤1\\mathbf\{k\}\_\{i\}\(s\)^\{\\top\}G\_\{i\}^\{\\dagger\}\\mathbf\{k\}\_\{i\}\(s\)=\\\|P\_\{X\_\{i\}^\{\\mathrm\{H\}\}\}\\Phi\_\{i\}\(s\)\\\|\_\{2\}^\{2\}\\leq\\\|\\Phi\_\{i\}\(s\)\\\|\_\{2\}^\{2\}\\leq 1\. For the second factor, oncol​\(Xi\)\\mathrm\{col\}\(X\_\{i\}\)the eigenvalues ofGiG\_\{i\}are bounded below byγi=λmin\+​\(Gi\)\\gamma\_\{i\}=\\lambda\_\{\\min\}^\{\+\}\(G\_\{i\}\)\. Hencemaxt≥γi⁡t\(t\+λ\)2≤1γi\+λ,\\max\_\{t\\geq\\gamma\_\{i\}\}\\frac\{t\}\{\(t\+\\lambda\)^\{2\}\}\\leq\\frac\{1\}\{\\gamma\_\{i\}\+\\lambda\},\. This gives

\(Pi​Δi,aref\)⊤​Gi​\(Gi\+λ​I\)−2​\(Pi​Δi,aref\)≤1γi\+λ​‖Pi​Δi,aref‖22\.\(P\_\{i\}\\Delta^\{\\mathrm\{ref\}\}\_\{i,a\}\)^\{\\top\}G\_\{i\}\(G\_\{i\}\+\\lambda I\)^\{\-2\}\(P\_\{i\}\\Delta^\{\\mathrm\{ref\}\}\_\{i,a\}\)\\leq\\frac\{1\}\{\\gamma\_\{i\}\+\\lambda\}\\\|P\_\{i\}\\Delta^\{\\mathrm\{ref\}\}\_\{i,a\}\\\|\_\{2\}^\{2\}\.Combining yields the bound

\(B2\)≤1γi\+λ​‖Pi​Δi,aref‖2\.\(B\_\{2\}\)\\leq\\frac\{1\}\{\\sqrt\{\\gamma\_\{i\}\+\\lambda\}\}\\,\\\|P\_\{i\}\\Delta^\{\\mathrm\{ref\}\}\_\{i,a\}\\\|\_\{2\}\.

### Step 3: Bound the projected mismatch‖Pi​Δi,aref‖2\\\|P\_\{i\}\\Delta^\{\\mathrm\{ref\}\}\_\{i,a\}\\\|\_\{2\}

We first expandΔi,aref\\Delta^\{\\mathrm\{ref\}\}\_\{i,a\}usingQref,aglob=∑jπj​Q^j,arefQ^\{\\mathrm\{glob\}\}\_\{\\mathrm\{ref\},a\}=\\sum\_\{j\}\\pi\_\{j\}\\hat\{Q\}^\{\\mathrm\{ref\}\}\_\{j,a\}:

Δi,aref=∑j≠iπj​\(Q^j,aref−Q^i,aref\),‖Pi​Δi,aref‖2≤‖Δi,aref‖2≤∑j≠iπj​‖Q^j,aref−Q^i,aref‖2\.\\Delta^\{\\mathrm\{ref\}\}\_\{i,a\}=\\sum\_\{j\\neq i\}\\pi\_\{j\}\(\\hat\{Q\}^\{\\mathrm\{ref\}\}\_\{j,a\}\-\\hat\{Q\}^\{\\mathrm\{ref\}\}\_\{i,a\}\),\\qquad\\\|P\_\{i\}\\Delta^\{\\mathrm\{ref\}\}\_\{i,a\}\\\|\_\{2\}\\leq\\\|\\Delta^\{\\mathrm\{ref\}\}\_\{i,a\}\\\|\_\{2\}\\leq\\sum\_\{j\\neq i\}\\pi\_\{j\}\\\|\\hat\{Q\}^\{\\mathrm\{ref\}\}\_\{j,a\}\-\\hat\{Q\}^\{\\mathrm\{ref\}\}\_\{i,a\}\\\|\_\{2\}\.Each vector difference collects evaluations of the RKHS functionhi​j,a​\(s\)≔Q^j​\(s,a\)−Q^i​\(s,a\)h\_\{ij,a\}\(s\)\\coloneqq\\hat\{Q\}\_\{j\}\(s,a\)\-\\hat\{Q\}\_\{i\}\(s,a\)on themmanchors, so by the reproducing property bound\|hi​j,a​\(sℓ\)\|≤‖hi​j,a‖ℋ\|h\_\{ij,a\}\(s\_\{\\ell\}\)\|\\leq\\\|h\_\{ij,a\}\\\|\_\{\\mathcal\{H\}\}\(usingκ​\(sℓ,sℓ\)≤1\\kappa\(s\_\{\\ell\},s\_\{\\ell\}\)\\leq 1\),

‖Q^j,aref−Q^i,aref‖2≤m​‖Q^j​\(⋅,a\)−Q^i​\(⋅,a\)‖ℋ\.\\\|\\hat\{Q\}^\{\\mathrm\{ref\}\}\_\{j,a\}\-\\hat\{Q\}^\{\\mathrm\{ref\}\}\_\{i,a\}\\\|\_\{2\}\\leq\\sqrt\{m\}\\,\\\|\\hat\{Q\}\_\{j\}\(\\cdot,a\)\-\\hat\{Q\}\_\{i\}\(\\cdot,a\)\\\|\_\{\\mathcal\{H\}\}\.
Plugging in the pairwise oracle gap bound from Appendix[H](https://arxiv.org/html/2605.29002#A8),

‖Q^j​\(⋅,a\)−Q^i​\(⋅,a\)‖ℋ≤2​B​sin⁡\(θi​j\)\+εirep\+εjrep,\\\|\\hat\{Q\}\_\{j\}\(\\cdot,a\)\-\\hat\{Q\}\_\{i\}\(\\cdot,a\)\\\|\_\{\\mathcal\{H\}\}\\leq 2B\\sin\(\\theta\_\{ij\}\)\+\\varepsilon\_\{i\}^\{\\mathrm\{rep\}\}\+\\varepsilon\_\{j\}^\{\\mathrm\{rep\}\},we obtain

‖Pi​Δi,aref‖2≤m​∑j≠iπj​\(2​B​sin⁡\(θi​j\)\+εirep\+εjrep\)=m​h¯i\.\\\|P\_\{i\}\\Delta^\{\\mathrm\{ref\}\}\_\{i,a\}\\\|\_\{2\}\\leq\\sqrt\{m\}\\sum\_\{j\\neq i\}\\pi\_\{j\}\\bigl\(2B\\sin\(\\theta\_\{ij\}\)\+\\varepsilon\_\{i\}^\{\\mathrm\{rep\}\}\+\\varepsilon\_\{j\}^\{\\mathrm\{rep\}\}\\bigr\)=\\sqrt\{m\}\\,\\bar\{h\}\_\{i\}\.
Combining Steps 1–3 with the Term \(A\) bound from Lemma[1](https://arxiv.org/html/2605.29002#Thmlemma1)\(which gives\|Q^i​\(s,a\)−Q¯​\(s,a\)\|≤h¯i\|\\hat\{Q\}\_\{i\}\(s,a\)\-\\bar\{Q\}\(s,a\)\|\\leq\\bar\{h\}\_\{i\}\), we have

\|Δi​\(s,a\)\|≤h¯i⏟\(A\)=Term \(I\)\+mγi\+λ​h¯i⏟\(B2\)=Term \(II\)\+λγi\+λ​‖W^i‖F⏟\(B1\)=Term \(III\),\|\\Delta\_\{i\}\(s,a\)\|\\leq\\underbrace\{\\bar\{h\}\_\{i\}\}\_\{\(A\)=\\text\{Term \(I\)\}\}\+\\underbrace\{\\frac\{\\sqrt\{m\}\}\{\\sqrt\{\\gamma\_\{i\}\+\\lambda\}\}\\bar\{h\}\_\{i\}\}\_\{\(B\_\{2\}\)=\\text\{Term \(II\)\}\}\+\\underbrace\{\\frac\{\\lambda\}\{\\gamma\_\{i\}\+\\lambda\}\\\|\\hat\{W\}\_\{i\}\\\|\_\{F\}\}\_\{\(B\_\{1\}\)=\\text\{Term \(III\)\}\},
∎

Similar Articles

Fair Reinforcement Learning

Reddit r/AI_Agents

Fair Reinforcement Learning introduces Democratic Alignment to incorporate multiple competing value sets from different agents, overcoming traditional RLHF limitations, and achieves orders of magnitude faster optimization via a black-box policy wrapper.

Federated Learning

ML at Berkeley

The article explains the concept of Federated Learning as a privacy-preserving machine learning technique that trains models on local devices rather than central servers. It details the process of encrypted parameter updates and aggregation to mitigate data leakage risks while maintaining model performance.