TeamTR: Trust-Region Fine-Tuning for Multi-Agent LLM Coordination
Summary
This paper identifies a structural failure mode in sequential fine-tuning of shared-context multi-agent LLM teams, formalized as compounding occupancy shift, and proposes TeamTR, a trust-region framework that resamples trajectories and enforces per-agent divergence control, achieving 7.1% average improvement over baselines.
View Cached Full Text
Cached at: 05/18/26, 06:37 AM
# TeamTR: Trust-Region Fine-Tuning for Multi-Agent LLM Coordination
Source: [https://arxiv.org/html/2605.15207](https://arxiv.org/html/2605.15207)
###### Abstract
Multi\-agent LLM systems have shown promise for complex reasoning, yet recent evaluations reveal they often underperform single\-model baselines\. We identify a structural failure mode in sequential fine\-tuning of shared\-context teams: updating one agent shifts the team’s context distribution, and when subsequent updates are evaluated on cached rollouts, this mismatch compounds\. We formalize this as the*compounding occupancy shift*and prove that stale\-occupancy evaluation incurs a penalty that scales quadratically with the number of agents\. In contrast, intermediate\-occupancy evaluation reduces this to linear scaling\. We proposeTeamTR, a trust\-region framework that resamples trajectories after each component update and enforces per\-agent divergence control, yielding rigorous per\-update and per\-stage improvement lower bounds\. Experiments show that TeamTR outperforms single\-agent and sequential baselines with 7\.1% on average, mitigates coordination regressions, and supports plug\-and\-play component replacement\. Code is available at[https://github\.com/Yydc/TeamTR](https://github.com/Yydc/TeamTR)\.
Machine Learning, ICML
## 1Introduction
Multi\-agent LLM systems coordinate role\-specialized components for complex reasoning and task execution\(Yaoet al\.,[2022](https://arxiv.org/html/2605.15207#bib.bib16); Wuet al\.,[2023](https://arxiv.org/html/2605.15207#bib.bib14); Honget al\.,[2023](https://arxiv.org/html/2605.15207#bib.bib36); Duet al\.,[2023](https://arxiv.org/html/2605.15207#bib.bib34)\)\. Despite their success, recent evaluations reveal that such teams often underperform single strong models with best\-of\-NNsampling\(Kimet al\.,[2025](https://arxiv.org/html/2605.15207#bib.bib1)\)\. These failures are attributed to suboptimal coordination protocols\(Cemriet al\.,[2025](https://arxiv.org/html/2605.15207#bib.bib7)\)\. We identify that the training process of the MA\-LLMs system can introduce this bias and undermine coordination\.
Most multi\-agent LLM systems are*shared\-context*teams, in which agents interact turn by turn over a common textual state\(Wuet al\.,[2023](https://arxiv.org/html/2605.15207#bib.bib14); Honget al\.,[2023](https://arxiv.org/html/2605.15207#bib.bib36)\)\. Fine\-tuning such teams may proceed via joint updates \(all agents simultaneously\) or sequential updates \(one agent at a time\)\. The instability issues of joint updates are known in MARL, where coupled policy changes make optimization difficult to control\(Foersteret al\.,[2017](https://arxiv.org/html/2605.15207#bib.bib62); Kubaet al\.,[2021](https://arxiv.org/html/2605.15207#bib.bib61)\)and also MA\-LLMs\(Liuet al\.,[2025a](https://arxiv.org/html/2605.15207#bib.bib48)\)\(Fig\.[1](https://arxiv.org/html/2605.15207#S1.F1), left\)\. Sequential training offers a more stable alternative and is increasingly adopted for modular optimization\(Subramaniamet al\.,[2025](https://arxiv.org/html/2605.15207#bib.bib50)\)\. However, sequential methods introduce a failure mode: each update shifts the context distribution seen by subsequent agents, and when rollouts are cached at stage start to reduce sampling cost, this mismatch compounds \(Fig\.[1](https://arxiv.org/html/2605.15207#S1.F1), middle\)\.
We formalize this as*compounding occupancy shift*: when later agents are updated using rollouts collected before earlier agents were updated, the resulting distribution mismatch accumulates across the update sequence\. Under per\-step trust regions of radii\{δi\}i=1n\\\{\\delta\_\{i\}\\\}\_\{i=1\}^\{n\}, stale\-occupancy evaluation incurs a penalty scaling asO\(n2δ¯\)O\(n^\{2\}\\sqrt\{\\bar\{\\delta\}\}\), whereas intermediate\-occupancy evaluation reduces this toO\(nδ¯\)O\(n\\sqrt\{\\bar\{\\delta\}\}\)\. This gap explains why naive sequential fine\-tuning can regress coordination even when each update appears locally beneficial\.
To address this, we propose*TeamTR*, a stage\-wise trust\-region framework that resamples trajectories after each component update\. This ensures that each agent is trained under the distribution induced by the partially updated team, thereby eliminating the stale\-occupancy penalty\. Per\-agent trust regions control the occupancy drift introduced by each update, keeping the distribution shift bounded\. TeamTR yields rigorous lower bounds on improvement: surrogate gain minus explicit penalties for occupancy shift and estimation error\. The certificate applies to any update order, and penalties can be tracked empirically in the training process\.
Figure 1:Update trajectories on the team objective landscape\.Left: Joint updates suffer from coupled drift and uncoordinated leaps\.Middle: Naive sequential updates with cached rollouts drift away from the target due to stale occupancy\.Right: TeamTR resamples under fresh occupancy after each update, reaching the target stably\.Inset: Occupancy distributions \(top\), penalty term scaling \(bottom\)\.We instantiate TeamTR under a turn\-taking protocol where one agent is active per step\. This yields a factorization: team\-level divergence reduces to single\-agent divergence on visited states, enabling tractable trust regions\. A token decomposable behavior\-to\-updated KL yields a monitorable trust region constraint, estimated from on\-policy rollouts via sampled log\-probability differences\. Empirically, TeamTR improves performance relative to sequential baselines, stabilizes coordination, and supports plug\-and\-play of components\. Our contribution can be concluded as follows:
- •We formalize the*compounding occupancy shift*in shared\-context multi\-LLM fine\-tuning and prove that stale\-occupancy evaluation incurs anO\(n2\)O\(n^\{2\}\)penalty \(Sec\.[3](https://arxiv.org/html/2605.15207#S3)\)\.
- •We propose TeamTR, which reduces this penalty toO\(n\)O\(n\)via intermediate\-occupancy evaluation and provides rigorous per\-update improvement lower bounds \(Sec\.[4](https://arxiv.org/html/2605.15207#S4)\)\.
- •We empirically validate that TeamTR outperforms existing single and multi\-agent baselines by stabilizing training and enabling components to be plug\-and\-play \(Sec\.[5](https://arxiv.org/html/2605.15207#S5)\)\.
## 2Related Work
##### Multi\-agent LLM systems\.
Multi\-agent LLM systems have evolved along two axes: inference\-time orchestration and training\-time coordination\. Inference\-time approaches deploy frozen models under hand\-crafted protocols, including debate and consensus frameworks\(Duet al\.,[2023](https://arxiv.org/html/2605.15207#bib.bib34); Lianget al\.,[2024](https://arxiv.org/html/2605.15207#bib.bib35)\), structured role\-play pipelines\(Honget al\.,[2023](https://arxiv.org/html/2605.15207#bib.bib36); Qianet al\.,[2024](https://arxiv.org/html/2605.15207#bib.bib37)\), and general\-purpose orchestration libraries\(Wuet al\.,[2024](https://arxiv.org/html/2605.15207#bib.bib38)\)\. Training\-time approaches aim to internalize coordination through supervised fine\-tuning on interaction trajectories\(Chenet al\.,[2023](https://arxiv.org/html/2605.15207#bib.bib41); Zenget al\.,[2024](https://arxiv.org/html/2605.15207#bib.bib57)\)or reinforcement learning with social or preference signals\(Liuet al\.,[2023a](https://arxiv.org/html/2605.15207#bib.bib43); Leeet al\.,[2023](https://arxiv.org/html/2605.15207#bib.bib56)\)\. These methods typically assume frozen components or joint/independent training; none explicitly addresses the occupancy shift that arises under sequential fine\-tuning\.
##### Trust regions in multi\-agent RL\.
Trust\-region methods guarantee monotonic improvement by constraining policy divergence\(Schulmanet al\.,[2015](https://arxiv.org/html/2605.15207#bib.bib58),[2017](https://arxiv.org/html/2605.15207#bib.bib59)\)\. Extending these guarantees to multi\-agent LLMs settings is nontrivial due to non\-stationarity from simultaneous updates\. HATRPO and HAPPO\(Kubaet al\.,[2021](https://arxiv.org/html/2605.15207#bib.bib61)\)derive a multi\-agent advantage decomposition showing that sequential updates preserve monotonic improvement in heterogeneous teams, but operate on low\-dimensional continuous control and do not address the autoregressive, token\-level structure of LLM message spaces\. TeamTR adapts insights to shared\-context LLM teams by defining trust regions based on token\-level KL divergences computed from on\-policy rollouts\.
##### Distribution shift and modular evolution\.
Distribution shift under sequential updates is a known challenge in both MARL\(Foersteret al\.,[2017](https://arxiv.org/html/2605.15207#bib.bib62)\)and RLHF\(Casperet al\.,[2023](https://arxiv.org/html/2605.15207#bib.bib64)\), often addressed via importance sampling or replay buffers\. For modular system evolution, model\-merging techniques\(Ilharcoet al\.,[2022](https://arxiv.org/html/2605.15207#bib.bib66); Wortsmanet al\.,[2022](https://arxiv.org/html/2605.15207#bib.bib67)\)compose capabilities by manipulating weight vectors but assume independent training distributions and static composition\. TeamTR provides a dynamic alternative: it bounds the distribution shift induced by each sequential update via intermediate\-occupancy trust regions, yielding theoretically guaranteed improvement that extends to agent plug\-and\-play\. Extended discussions are provided in Appendix[A](https://arxiv.org/html/2605.15207#A1)\.
## 3Theoretical Framework
We develop a framework for fine\-tuning multi\-agent LLM teams under sequential component updates\. The central challenge is*compounding occupancy shift*: each update changes the team’s state distribution, and reusing for next agent with stale rollouts incurs an additional certificate penalty term that scales asO\(n2δ¯\)O\(n^\{2\}\\sqrt\{\\bar\{\\delta\}\}\)\(for fixed trust region radii\), wherennis the number of agents\. Our analysis identifies this failure mode and shows intermediate\-occupancy evaluation reduces the dominant penalty toO\(nδ¯\)O\(n\\sqrt\{\\bar\{\\delta\}\}\)\.
### 3\.1Shared\-context team as a message\-action MDP
We model team execution as a discounted MDPℳ=\(𝒮,\{𝒜j\}j=1n,P,r,γ\)\\mathcal\{M\}=\(\\mathcal\{S\},\\\{\\mathcal\{A\}\_\{j\}\\\}\_\{j=1\}^\{n\},P,r,\\gamma\)withγ∈\(0,1\)\\gamma\\in\(0,1\)and bounded reward\|r\|≤Rmax\|r\|\\leq R\_\{\\max\}\. The states∈𝒮s\\in\\mathcal\{S\}is the shared textual context \(the prompt plus accumulated messages\) and can optionally include the active agent’s ID selected by a router\. We treat the router as fixed and fold it into the environment dynamics\. Each agentjjchooses a macro\-actionaj∈𝒜ja\_\{j\}\\in\\mathcal\{A\}\_\{j\}corresponding to a message \(token sequence\), and the team policy factorizes asπ\(𝐚∣s\)=∏j=1nπ\(j\)\(aj∣s\)\.\\pi\(\\mathbf\{a\}\\mid s\)=\\prod\_\{j=1\}^\{n\}\\pi^\{\(j\)\}\(a\_\{j\}\\mid s\)\.We focus on the turn\-taking protocol: at each decision step, exactly one agent is active and emits a message that augments the shared context; all other agents take a fixed no\-op action\.
###### Lemma 3\.1\(Active\-factor reduction in turn taking protocol\)\.
Assume turn\-taking protocol: for each statess, there is an active agent indexj\(s\)j\(s\)such that all inactive agents deterministically takenoopunder bothπ\\piandπ′\\pi^\{\\prime\}\. Then for any two team policiesπ′,π\\pi^\{\\prime\},\\pi,
DKL\(π′\(⋅∣s\)∥π\(⋅∣s\)\)=DKL\(π′\(j\(s\)\)\(⋅∣s\)∥π\(j\(s\)\)\(⋅∣s\)\),D\_\{\\mathrm\{KL\}\}\{\}\\\!\\big\(\\pi^\{\\prime\}\(\\cdot\\mid s\)\\\|\\pi\(\\cdot\\mid s\)\\big\)=D\_\{\\mathrm\{KL\}\}\{\}\\\!\\big\(\\pi^\{\\prime\(j\(s\)\)\}\(\\cdot\\mid s\)\\\|\\pi^\{\(j\(s\)\)\}\(\\cdot\\mid s\)\\big\),and expectations about functions that depend only on the active action reduce to expectations about the active agent\.
Lemma[3\.1](https://arxiv.org/html/2605.15207#S3.Thmtheorem1)addresses a practical obstacle: a team\-level trust region over the joint action space∏j𝒜j\\prod\_\{j\}\\mathcal\{A\}\_\{j\}is intractable for LLM message distributions, while the turn\-taking protocol reduces the divergence constraint to a single\-agent quantity\. The detailed proof can be found in Appendix[B\.1](https://arxiv.org/html/2605.15207#A2.SS1)\. We study within\-stage sequential updates\. Letπcur\\pi\_\{\\mathrm\{cur\}\}be the team at stage start andσ\\sigmaan update order \(any permutation\)\. Denote byπ\[j←π′\]\\pi\[j\\leftarrow\\pi^\{\\prime\}\]the team obtained by replacing agentjjinπ\\piwithπ′\\pi^\{\\prime\}\. We define the intermediate policies as follows:
πcur=π^0→Updateσ\(1\)←πtarσ\(1\)π^1→⋯→π^n−1→Updateσ\(n\)←πtarσ\(n\)π^n=π¯\\pi\_\{\\mathrm\{cur\}\}=\\hat\{\\pi\}^\{0\}\\xrightarrow\[\\text\{Update\}\]\{\\begin\{subarray\}\{c\}\\sigma\(1\)\\leftarrow\\\\ \\pi\_\{\\mathrm\{tar\}\}^\{\\sigma\(1\)\}\\end\{subarray\}\}\\hat\{\\pi\}^\{1\}\\to\\cdots\\to\\hat\{\\pi\}^\{n\-1\}\\xrightarrow\[\\text\{Update\}\]\{\\begin\{subarray\}\{c\}\\sigma\(n\)\\leftarrow\\\\ \\pi\_\{\\mathrm\{tar\}\}^\{\\sigma\(n\)\}\\end\{subarray\}\}\\hat\{\\pi\}^\{n\}=\\bar\{\\pi\}\(1\)whereπtarσ\(i\)\\pi\_\{\\mathrm\{tar\}\}^\{\\sigma\(i\)\}denotes the updated policy for agentσ\(i\)\\sigma\(i\)obtained by \(approximately\) maximizing the surrogate objective in Section[3\.3](https://arxiv.org/html/2605.15207#S3.SS3)\. Letdπd^\{\\pi\}be the discounted occupancy induced byπ\\pi, andJ\(π\)J\(\\pi\)its discounted return\. Within each stage, each agent is updated at most once; thus at stepiithe pre\-update policy for agentj=σ\(i\)j=\\sigma\(i\)is the corresponding factor inπ^i−1\\hat\{\\pi\}^\{\\,i\-1\}, which we denote byπcurσ\(i\)\\pi\_\{\\mathrm\{cur\}\}^\{\\sigma\(i\)\}for brevity\.
### 3\.2Token\-decomposed trust regions
To control occupancy shift, we constrain the*behavior\-to\-updated*KL divergence, which can be directly estimated from trajectories generated by the pre\-update policy, without requiring policy sampling from the updated one\.
For a reference policyρ\\rhoand policiesπ,π′\\pi,\\pi^\{\\prime\}, we define:
DKL\(π∥π′\)tokρ≔𝔼s∼dρDKL\(π\(⋅∣s\)∥π′\(⋅∣s\)\)\.D\_\{\\mathrm\{KL\}\}\{\}\_\{\\mathrm\{tok\}\}^\{\\rho\}\(\\pi\\\|\\pi^\{\\prime\}\)\\;\\coloneqq\\;\\mathbb\{E\}\_\{s\\sim d^\{\\rho\}\}D\_\{\\mathrm\{KL\}\}\{\}\\\!\\big\(\\pi\(\\cdot\\mid s\)\\\|\\pi^\{\\prime\}\(\\cdot\\mid s\)\\big\)\.\(2\)Heretok\\mathrm\{tok\}emphasizes*token\-decomposability*for autoregressive messages: the message\-level KL decomposes exactly into token\-level KLs via the chain rule\. Letπu≔π\(j\)\(⋅∣s,x<u\)\\pi\_\{u\}\\coloneqq\\pi^\{\(j\)\}\(\\cdot\\mid s,x\_\{<u\}\)andπu′≔π′\(j\)\(⋅∣s,x<u\)\\pi^\{\\prime\}\_\{u\}\\coloneqq\\pi^\{\\prime\(j\)\}\(\\cdot\\mid s,x\_\{<u\}\)\. Then we have:
DKL\(π\(j\)\(⋅\|s\)∥π′\(j\)\(⋅\|s\)\)=𝔼m∼π\(j\)\(⋅\|s\)∑u=1T\(m\)DKL\(πu∥πu′\),D\_\{\\mathrm\{KL\}\}\{\}\\big\(\\pi^\{\(j\)\}\(\\cdot\|s\)\\,\\\|\\,\\pi^\{\\prime\(j\)\}\(\\cdot\|s\)\\big\)=\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}m\\sim\\\\ \\pi^\{\(j\)\}\(\\cdot\|s\)\\end\{subarray\}\}\\sum\_\{u=1\}^\{T\(m\)\}D\_\{\\mathrm\{KL\}\}\{\}\(\\pi\_\{u\}\\,\\\|\\,\\pi^\{\\prime\}\_\{u\}\),\(3\)whereT\(m\)T\(m\)is the \(random\) message length\. Computing each token\-level KL term exactly requires a full\-vocabulary sum; in practice, we therefore use a sampled estimator based on token log\-probability differences along rollouts from the behavior policy \(Details in Appendix[B\.2](https://arxiv.org/html/2605.15207#A2.SS2)and[F\.6](https://arxiv.org/html/2605.15207#A6.SS6)\)\.
At update stepii, TeamTR constrains the updated factor:
DKL\(πcurσ\(i\)∥πtarσ\(i\)\)tokπ^i−1≤δi\.D\_\{\\mathrm\{KL\}\}\{\}\_\{\\mathrm\{tok\}\}^\{\\hat\{\\pi\}^\{\\,i\-1\}\}\\\!\\big\(\\pi\_\{\\mathrm\{cur\}\}^\{\\sigma\(i\)\}\\\|\\pi\_\{\\mathrm\{tar\}\}^\{\\sigma\(i\)\}\\big\)\\leq\\delta\_\{i\}\.\(4\)The radiusδi\\delta\_\{i\}controls the allowed per\-step policy change: smallerδi\\delta\_\{i\}reduces occupancy\-shift penalties \(which scale asδi\\sqrt\{\\delta\_\{i\}\}\) but restricts policy movement\. Since each within\-stage step updates a single factor under the turn\-taking protocol, Eq\. \([4](https://arxiv.org/html/2605.15207#S3.E4)\) also implies a team\-level step boundDKL\(π^i−1∥π^i\)tokπ^i−1≤δiD\_\{\\mathrm\{KL\}\}\{\}\_\{\\mathrm\{tok\}\}^\{\\hat\{\\pi\}^\{\\,i\-1\}\}\(\\hat\{\\pi\}^\{\\,i\-1\}\\\|\\hat\{\\pi\}^\{\\,i\}\)\\leq\\delta\_\{i\}\. \(formalized in Lemma[3\.5](https://arxiv.org/html/2605.15207#S3.Thmtheorem5)\)\.
### 3\.3Intermediate\-occupancy surrogates and error
LetAπ\(s,𝐚\)A^\{\\pi\}\(s,\\mathbf\{a\}\)be the macro\-action advantage\. When updating agentσ\(i\)\\sigma\(i\), we consider the population surrogate evaluated under the intermediate occupancy:
Liseq=11−γ𝔼s∼dπ^i−1,𝐚∼π^i\(⋅∣s\)\[A^i−1\(s,𝐚\)\],L\_\{i\}^\{\\mathrm\{seq\}\}\\;=\\;\\frac\{1\}\{1\-\\gamma\}\\;\\mathbb\{E\}\_\{s\\sim d^\{\\hat\{\\pi\}^\{\\,i\-1\}\},\\,\\mathbf\{a\}\\sim\\hat\{\\pi\}^\{\\,i\}\(\\cdot\\mid s\)\}\\Big\[\\widehat\{A\}\_\{i\-1\}\(s,\\mathbf\{a\}\)\\Big\],\(5\)whereA^i−1\\widehat\{A\}\_\{i\-1\}is an estimator constructed from rollout data collected underπ^i−1\\hat\{\\pi\}^\{\\,i\-1\}\. In practice, TeamTR approximately maximizes a clipped objective \(see Appendix[E\.1](https://arxiv.org/html/2605.15207#A5.SS1)for a concrete instantiation\); the certificate statements below hold for any realized update that satisfies the trust\-region constraint\. We track the mismatch between the estimator and the true advantage\. Let𝒟i−1\\mathcal\{D\}\_\{i\-1\}denote the rollout data used to constructA^i−1\\widehat\{A\}\_\{i\-1\}underπ^i−1\\hat\{\\pi\}^\{\\,i\-1\}\. Assume\|Aπ\(s,𝐚\)\|≤Amax\|A^\{\\pi\}\(s,\\mathbf\{a\}\)\|\\leq A\_\{\\max\}and enforce\|A^i−1\|≤Amax\|\\widehat\{A\}\_\{i\-1\}\|\\leq A\_\{\\max\}via clipping, we define:
ζi≔\|𝔼s∼dπ^i−1𝐚∼π^i\(⋅∣s\)\[𝔼𝒟i−1\[A^i−1\(s,𝐚\)\]−Aπ^i−1\(s,𝐚\)\]\|\.\\zeta\_\{i\}\\coloneqq\\bigg\|\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}s\\sim d^\{\\hat\{\\pi\}^\{\\,i\-1\}\}\\\\ \\mathbf\{a\}\\sim\\hat\{\\pi\}^\{\\,i\}\(\\cdot\\mid s\)\\end\{subarray\}\}\\Big\[\\mathbb\{E\}\_\{\\mathcal\{D\}\_\{i\-1\}\}\\big\[\\widehat\{A\}\_\{i\-1\}\(s,\\mathbf\{a\}\)\\big\]\-A^\{\\hat\{\\pi\}^\{\\,i\-1\}\}\(s,\\mathbf\{a\}\)\\Big\]\\bigg\|\.\(6\)
### 3\.4Occupancy\-shift and improvement lower bounds
The following bound connects expected divergence to occupancy shift, explaining why trust regions yield explicit control over distribution mismatch\. Proofs for this subsection are provided in Appendix[B\.3](https://arxiv.org/html/2605.15207#A2.SS3),[B\.4](https://arxiv.org/html/2605.15207#A2.SS4),[C\.1](https://arxiv.org/html/2605.15207#A3.SS1),[C\.2](https://arxiv.org/html/2605.15207#A3.SS2), and[C\.3](https://arxiv.org/html/2605.15207#A3.SS3)\.
###### Lemma 3\.2\(Occupancy shift under expected divergence\)\.
For any policiesπ′,π\\pi^\{\\prime\},\\piand any bounded measurableff,
\|𝔼dπ′\[f\]−𝔼dπ\[f\]\|≤2γ1−γDKL\(π∥π′\)tokπ‖f‖∞\.\\big\|\\mathbb\{E\}\_\{d^\{\\pi^\{\\prime\}\}\}\[f\]\-\\mathbb\{E\}\_\{d^\{\\pi\}\}\[f\]\\big\|\\leq\\frac\{\\sqrt\{2\}\\gamma\}\{1\-\\gamma\}\\;\\sqrt\{D\_\{\\mathrm\{KL\}\}\{\}\_\{\\mathrm\{tok\}\}^\{\\pi\}\(\\pi\\\|\\pi^\{\\prime\}\)\}\\;\\\|f\\\|\_\{\\infty\}\.\(7\)
###### Proposition 3\.3\(Quadratic\-to\-linear reduction\)\.
Letπ^0\\hat\{\\pi\}^\{0\}be the stage\-start team andπ^i−1\\hat\{\\pi\}^\{i\-1\}the intermediate team before stepii\. Assume the per\-step trust regions in Eq\. \([4](https://arxiv.org/html/2605.15207#S3.E4)\)\. Then for any bounded measurableffand anyi≥1i\\geq 1,
\|𝔼dπ^i−1\[f\]−𝔼dπ^0\[f\]\|≤2γ1−γ‖f‖∞∑k<iδk\.\\big\|\\mathbb\{E\}\_\{d^\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\[f\]\-\\mathbb\{E\}\_\{d^\{\\hat\{\\pi\}^\{0\}\}\}\[f\]\\big\|\\;\\leq\\;\\frac\{\\sqrt\{2\}\\gamma\}\{1\-\\gamma\}\\;\\\|f\\\|\_\{\\infty\}\\sum\_\{k<i\}\\sqrt\{\\delta\_\{k\}\}\.\(8\)
Then we can find the stale\-occupancy surrogate by evaluating on stage\-start occupancies:
Listale=11−γ𝔼s∼dπ^0,𝐚∼π^i\(⋅∣s\)\[A^i−1\(s,𝐚\)\]\.L\_\{i\}^\{\\mathrm\{stale\}\}\\;=\\;\\frac\{1\}\{1\-\\gamma\}\\;\\mathbb\{E\}\_\{s\\sim d^\{\\hat\{\\pi\}^\{0\}\},\\,\\mathbf\{a\}\\sim\\hat\{\\pi\}^\{\\,i\}\(\\cdot\\mid s\)\}\\Big\[\\widehat\{A\}\_\{i\-1\}\(s,\\mathbf\{a\}\)\\Big\]\.\(9\)\|Liseq−Listale\|≤2γ\(1−γ\)2Amax∑k<iδk\.\\big\|L\_\{i\}^\{\\mathrm\{seq\}\}\-L\_\{i\}^\{\\mathrm\{stale\}\}\\big\|\\;\\leq\\;\\frac\{\\sqrt\{2\}\\gamma\}\{\(1\-\\gamma\)^\{2\}\}\\;A\_\{\\max\}\\sum\_\{k<i\}\\sqrt\{\\delta\_\{k\}\}\.\(10\)Thus, stale\-occupancy evaluation incurs a cumulative penalty ofO\(n2δ¯\)O\(n^\{2\}\\sqrt\{\\bar\{\\delta\}\}\)withδ¯=maxkδk\\bar\{\\delta\}=\\max\_\{k\}\\delta\_\{k\}, whereas intermediate\-occupancy evaluation achievesO\(nδ¯\)O\(n\\sqrt\{\\bar\{\\delta\}\}\)\.
###### Theorem 3\.4\(Single\-step improvement lower bound\)\.
For stepi∈\{1,…,n\}i\\in\\\{1,\\ldots,n\\\}, assume the trust region in Eq\. \([4](https://arxiv.org/html/2605.15207#S3.E4)\)\. Then
J\(π^i\)−J\(π^i−1\)≥Liseq−2γ\(1−γ\)2Amaxδi−ζi1−γ\.J\(\\hat\{\\pi\}^\{\\,i\}\)\-J\(\\hat\{\\pi\}^\{\\,i\-1\}\)\\geq L\_\{i\}^\{\\mathrm\{seq\}\}\-\\frac\{\\sqrt\{2\}\\gamma\}\{\(1\-\\gamma\)^\{2\}\}\\,A\_\{\\max\}\\,\\sqrt\{\\delta\_\{i\}\}\-\\frac\{\\zeta\_\{i\}\}\{1\-\\gamma\}\.\(11\)
###### Lemma 3\.5\(Per\-step KL reduction under factorized updates\)\.
The per\-step reduction satisfies:
DKL\(π^i−1∥π^i\)tokπ^i−1=DKL\(πcurσ\(i\)∥πtarσ\(i\)\)tokπ^i−1≤δi,∀i\.D\_\{\\mathrm\{KL\}\}\{\}\_\{\\mathrm\{tok\}\}^\{\\hat\{\\pi\}^\{\\,i\-1\}\}\\\!\\big\(\\hat\{\\pi\}^\{\\,i\-1\}\\\|\\hat\{\\pi\}^\{\\,i\}\\big\)=D\_\{\\mathrm\{KL\}\}\{\}\_\{\\mathrm\{tok\}\}^\{\\hat\{\\pi\}^\{\\,i\-1\}\}\\\!\\big\(\\pi\_\{\\mathrm\{cur\}\}^\{\\sigma\(i\)\}\\\|\\pi\_\{\\mathrm\{tar\}\}^\{\\sigma\(i\)\}\\big\)\\leq\\delta\_\{i\},\\quad\\forall i\.\(12\)Consequently, summing Theorem[3\.4](https://arxiv.org/html/2605.15207#S3.Thmtheorem4)overi=1,…,ni=1,\\ldots,nyields the stage\-wise certificate in Theorem[3\.6](https://arxiv.org/html/2605.15207#S3.Thmtheorem6)\.
###### Theorem 3\.6\(Stage\-wise improvement lower bound\)\.
The improvement is bounded as follows:
J\(π¯\)−J\(πcur\)≥∑i=1nLiseq\(under\([4](https://arxiv.org/html/2605.15207#S3.E4)\)∀i\)−2γAmax\(1−γ\)2∑i=1nδi−11−γ∑i=1nζi\.\\begin\{split\}J\(\\bar\{\\pi\}\)\-J\(\\pi\_\{\\mathrm\{cur\}\}\)\\;\\geq\\;&\\sum\_\{i=1\}^\{n\}L\_\{i\}^\{\\mathrm\{seq\}\}\\quad\(\\text\{under \}\\eqref\{eq:tr\}\\;\\forall i\)\\\\ &\-\\frac\{\\sqrt\{2\}\\gamma A\_\{\\max\}\}\{\(1\-\\gamma\)^\{2\}\}\\sum\_\{i=1\}^\{n\}\\sqrt\{\\delta\_\{i\}\}\-\\frac\{1\}\{1\-\\gamma\}\\sum\_\{i=1\}^\{n\}\\zeta\_\{i\}\.\\end\{split\}\(13\)
A high\-probability empirical version \(accounting for minibatch estimation and ratio clipping\) is in Appendix[E\.3](https://arxiv.org/html/2605.15207#A5.SS3)\.
###### Theorem 3\.7\(Stage\-wise bound under stale\-occupancy surrogates\)\.
Using stale surrogates, we have:
J\(π¯\)−J\(πcur\)≥∑i=1n\(Listale−ζi1−γ\)\(under\([4](https://arxiv.org/html/2605.15207#S3.E4)\)\)−2γAmax\(1−γ\)2∑i=1n\(δi\+∑k<iδk\)\.\\begin\{split\}J\(\\bar\{\\pi\}\)\-J\(\\pi\_\{\\mathrm\{cur\}\}\)\\geq\\;&\\sum\_\{i=1\}^\{n\}\\left\(L\_\{i\}^\{\\mathrm\{stale\}\}\-\\frac\{\\zeta\_\{i\}\}\{1\-\\gamma\}\\right\)\\;\\;\(\\text\{under \}\\eqref\{eq:tr\}\)\\\\ &\-\\frac\{\\sqrt\{2\}\\gamma A\_\{\\max\}\}\{\(1\-\\gamma\)^\{2\}\} \\sum\_\{i=1\}^\{n\}\\left\(\\sqrt\{\\delta\_\{i\}\}\+\\sum\_\{k<i\}\\sqrt\{\\delta\_\{k\}\}\\right\)\.\\end\{split\}\(14\)
Forδi≡δ¯\\delta\_\{i\}\\equiv\\bar\{\\delta\}, the stale\-occupancy penalty scales asΘ\(n2δ¯\)\\Theta\(n^\{2\}\\sqrt\{\\bar\{\\delta\}\}\), versusΘ\(nδ¯\)\\Theta\(n\\sqrt\{\\bar\{\\delta\}\}\)for intermediate\-occupancy evaluation\.
### 3\.5Plug\-and\-play upgrades
The trust region is defined per agent and evaluated on the team’s occupancy, plug\-and\-play replacement can be handled by aligning the new component within a trust region around the replaced agent, then resuming intermediate occupancy updates\. The proofs are detailed in Appendix[D\.1](https://arxiv.org/html/2605.15207#A4.SS1)\.
###### Proposition 3\.8\(Certified resumability after replacement\)\.
Replacing agentjjby a new parameterization and then performing TeamTR updates that satisfy Eq\. \([4](https://arxiv.org/html/2605.15207#S3.E4)\)\. Theorems[3\.4](https://arxiv.org/html/2605.15207#S3.Thmtheorem4)–[3\.6](https://arxiv.org/html/2605.15207#S3.Thmtheorem6)continue to hold for subsequent updates\. This does not guarantee that the replacement step itself is non\-decreasing inJJ; it only states that subsequent TeamTR updates preserve the same lower bound\. Detailed in Appendix[D\.2](https://arxiv.org/html/2605.15207#A4.SS2)\.
###### Proposition 3\.9\(Certificate tightening\)\.
If an upgraded agent achieves a higher surrogate value within the same radius, or the same surrogate value with a smaller radius, then the lower bound in Theorem[3\.6](https://arxiv.org/html/2605.15207#S3.Thmtheorem6)does not decrease\.
## 4Algorithm
We instantiate TeamTR as a stage\-wise training loop that implements the theoretical framework in Sec\.[3](https://arxiv.org/html/2605.15207#S3)\. Each within\-stage update resamples rollouts within the current intermediate team, optimizes a clipped surrogate objective, and enforces a per\-agent trust region via a decomposable KL constraint\. Algorithm[1](https://arxiv.org/html/2605.15207#alg1)summarizes the procedure\.
Algorithm 1TeamTR: Stage\-wise Sequential Fine\-tuning0:Team
πcur=\{π\(j\)\}j=1n\\pi\_\{\\mathrm\{cur\}\}=\\\{\\pi^\{\(j\)\}\\\}\_\{j=1\}^\{n\}, trust\-region radii
\{δi\}\\\{\\delta\_\{i\}\\\}, prompt distribution
𝒟\\mathcal\{D\}, group size
GG, router
ℛ\\mathcal\{R\}
0:Updated team
πcur\\pi\_\{\\mathrm\{cur\}\}
1:forstage
k=1,2,…k=1,2,\\ldotsdo
2:Sample batch
ℬ⊂𝒟\\mathcal\{B\}\\subset\\mathcal\{D\}; choose update order
σ\\sigma
3:
π^0←πcur\\hat\{\\pi\}^\{0\}\\leftarrow\\pi\_\{\\mathrm\{cur\}\}
4:for
i=1i=1to
nndo
5:Collect rollouts under
π^i−1\\hat\{\\pi\}^\{\\,i\-1\}with router
ℛ\\mathcal\{R\}
6:Compute advantages
A~\\tilde\{A\}via Eq\. \([15](https://arxiv.org/html/2605.15207#S4.E15)\)
7:Update
π\(σ\(i\)\)\\pi^\{\(\\sigma\(i\)\)\}via Eq\. \([17](https://arxiv.org/html/2605.15207#S4.E17)\) until
DKL^tok≤δi\\widehat\{D\_\{\\mathrm\{KL\}\}\{\}\}\_\{\\mathrm\{tok\}\}\\leq\\delta\_\{i\}
8:
π^i←π^i−1\[σ\(i\)←πnew\(σ\(i\)\)\]\\hat\{\\pi\}^\{\\,i\}\\leftarrow\\hat\{\\pi\}^\{\\,i\-1\}\[\\sigma\(i\)\\leftarrow\\pi^\{\(\\sigma\(i\)\)\}\_\{\\mathrm\{new\}\}\]
9:endfor
10:
πcur←π^n\\pi\_\{\\mathrm\{cur\}\}\\leftarrow\\hat\{\\pi\}^\{\\,n\}
11:endfor
##### Group\-normalized advantages\.
We use a sequence\-level REINFORCE signal with group normalization\. For each prompt, we sampleGGrollouts and compute
A~g≔clip\(Rg−μσ,−Aclip,Aclip\),μ=1G∑g=1GRg,σ=1G∑g=1G\(Rg−μ\)2\+ϵnorm,\\begin\{split\}\\tilde\{A\}\_\{g\}&\\coloneqq\\mathrm\{clip\}\\\!\\left\(\\frac\{R\_\{g\}\-\\mu\}\{\\sigma\},\-A\_\{\\mathrm\{clip\}\},A\_\{\\mathrm\{clip\}\}\\right\),\\\\ \\mu&=\\frac\{1\}\{G\}\\sum\_\{g=1\}^\{G\}R\_\{g\},\\quad\\sigma=\\sqrt\{\\frac\{1\}\{G\}\\sum\_\{g=1\}^\{G\}\(R\_\{g\}\-\\mu\)^\{2\}\+\\epsilon\_\{\\mathrm\{norm\}\}\},\\end\{split\}\(15\)whereRgR\_\{g\}is the discounted episode return\. The scalarA~g\\tilde\{A\}\_\{g\}is applied to all message log\-probabilities within the trajectory\. Bias from normalization and clipping is absorbed intoζi\\zeta\_\{i\};\.
##### Per\-agent update with trust\-region control\.
At stepiiupdate agentj=σ\(i\)j=\\sigma\(i\)using rollouts fromπ^i−1\\hat\{\\pi\}^\{\\,i\-1\}, likelihood ratio for messagemmpassed by agentjjat statessis:
w\(m;s\)≔πnew\(j\)\(m∣s\)πcur\(j\)\(m∣s\)=∏u=1\|m\|πnew\(j\)\(xu∣s,x<u\)πcur\(j\)\(xu∣s,x<u\)\.w\(m;s\)\\coloneqq\\frac\{\\pi^\{\(j\)\}\_\{\\mathrm\{new\}\}\(m\\mid s\)\}\{\\pi^\{\(j\)\}\_\{\\mathrm\{cur\}\}\(m\\mid s\)\}=\\prod\_\{u=1\}^\{\|m\|\}\\frac\{\\pi^\{\(j\)\}\_\{\\mathrm\{new\}\}\(x\_\{u\}\\mid s,x\_\{<u\}\)\}\{\\pi^\{\(j\)\}\_\{\\mathrm\{cur\}\}\(x\_\{u\}\\mid s,x\_\{<u\}\)\}\.\(16\)We optimize the objective with an adaptive KL penalty, whereclipϵ\(w\)≜clip\(w,1−ϵ,1\+ϵ\)\\operatorname\{clip\}\_\{\\epsilon\}\(w\)\\triangleq\\operatorname\{clip\}\(w,1\{\-\}\\epsilon,1\{\+\}\\epsilon\):
ℒ\(j\)=𝔼m∼π^i−1\[min\(wA~,clipϵ\(w\)A~\)\]−βDKL^tok\.\\mathcal\{L\}^\{\(j\)\}=\\mathbb\{E\}\_\{m\\sim\\hat\{\\pi\}^\{\\,i\-1\}\}\\Big\[\\min\\big\(w\\tilde\{A\},\\;\\operatorname\{clip\}\_\{\\epsilon\}\(w\)\\tilde\{A\}\\big\)\\Big\]\-\\beta\\widehat\{D\_\{\\mathrm\{KL\}\}\{\}\}\_\{\\mathrm\{tok\}\}\.\(17\)the expectation is taken over messages passed by agentjjin the rollout batch underπ^i−1\\hat\{\\pi\}^\{\\,i\-1\},ϵ\\epsilonis the ratio clip threshold, andβ\\betais adjusted to satisfy the trust\-region constraint\.
##### Token\-decomposed KL monitoring\.
Letℳj\\mathcal\{M\}\_\{j\}denote the messages passed by agentjjin the rollout batch\. We monitor the sampled forward KL, and tuneβ\\betaadaptively or early\-stop whenDKL^tok\\widehat\{D\_\{\\mathrm\{KL\}\}\{\}\}\_\{\\mathrm\{tok\}\}approachesδi\\delta\_\{i\}:
DKL^tok≔1\|ℳj\|∑m∈ℳj∑u=1\|m\|logπcur\(j\)\(xu∣s,x<u\)πnew\(j\)\(xu∣s,x<u\)≤δi\.\\widehat\{D\_\{\\mathrm\{KL\}\}\{\}\}\_\{\\mathrm\{tok\}\}\\coloneqq\\frac\{1\}\{\|\\mathcal\{M\}\_\{j\}\|\}\\sum\_\{m\\in\\mathcal\{M\}\_\{j\}\}\\sum\_\{u=1\}^\{\|m\|\}\\log\\frac\{\\pi^\{\(j\)\}\_\{\\mathrm\{cur\}\}\(x\_\{u\}\\mid s,x\_\{<u\}\)\}\{\\pi^\{\(j\)\}\_\{\\mathrm\{new\}\}\(x\_\{u\}\\mid s,x\_\{<u\}\)\}\\leq\\delta\_\{i\}\.\(18\)
##### Agent plug and play replacement\.
For agent replacement \(Sec\.[3\.5](https://arxiv.org/html/2605.15207#S3.SS5)\), we first align the new agent to satisfy the trust\-region constraint on a probe set sampled from the current team’s occupancy, then resume standard TeamTR updates\. Details for this are provided in Appendix[D](https://arxiv.org/html/2605.15207#A4)\.
## 5Experiments
We evaluate TeamTR to validate the theory in Sec\.[3](https://arxiv.org/html/2605.15207#S3)\. We first report end\-task performance under matched budgets, then diagnose the within\-stage staleness predicted by Proposition[3\.3](https://arxiv.org/html/2605.15207#S3.Thmtheorem3), analyze training stability and certificate tracking under the monitored token\-KL trust region, and finally test plug\-and\-play component replacement\. Additional experiments and analyses are deferred to Appendix[G](https://arxiv.org/html/2605.15207#A7), including scaling with team size, compute, and wall\-clock time, comparisons with parameter\-matched single models, inference\-time sampling, and selected ablation studies\.
### 5\.1Setup
#### 5\.1\.1Models and Datasets\.
We fine\-tune LLMs with 1\.7B\-8B parameters, including Qwen2\.5 \(1\.5B/3B/7B\-Instruct\)\(Qwenet al\.,[2025](https://arxiv.org/html/2605.15207#bib.bib5)\), Qwen3 \(1\.7B/4B/8B\-Instruct\)\(Yanget al\.,[2025](https://arxiv.org/html/2605.15207#bib.bib6)\), and LLaMA\-3\.3 \(3B\-Instruct\)\(Grattafioriet al\.,[2024](https://arxiv.org/html/2605.15207#bib.bib4)\)\. For reference, we also compare against larger models in the 30B\-72B range, including Qwen2\.5 \(32B/72B\-Instruct\), Qwen3 \(30B\-A3B, 32B\), LLaMA\-3\.3 \(70B\-Instruct\), and QwQ\(Team,[2025](https://arxiv.org/html/2605.15207#bib.bib9)\), DeepSeek\-R1\(Guoet al\.,[2025](https://arxiv.org/html/2605.15207#bib.bib8)\), GPT\-o4\-mini\. We evaluate on six benchmarks spanning mathematical reasoning \(AIME 2024, AIME 2025, MATH\-500\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.15207#bib.bib23)\)\), logical reasoning \(ZebraLogic\(Linet al\.,[2025](https://arxiv.org/html/2605.15207#bib.bib22)\), AutoLogi\(Zhuet al\.,[2025](https://arxiv.org/html/2605.15207#bib.bib49)\)\), and active reasoning \(ARBench\(Zhouet al\.,[2025](https://arxiv.org/html/2605.15207#bib.bib20)\)and PlanBench\(Valmeekamet al\.,[2023](https://arxiv.org/html/2605.15207#bib.bib19)\)\)\. Details of benchmarks are provided in Appendix[H](https://arxiv.org/html/2605.15207#A8)\.
#### 5\.1\.2Training and Evaluation Protocol\.
We implement training with VeRL\(Shenget al\.,[2025](https://arxiv.org/html/2605.15207#bib.bib3)\), using a temperature of 0\.8, a top\-p of 1\.0, and a maximum output length of 32,768 tokens, unless otherwise specified\. For math reasoning, we use training data from DeepScaleR and DAPO\(Luoet al\.,[2025](https://arxiv.org/html/2605.15207#bib.bib2); Yuet al\.,[2025](https://arxiv.org/html/2605.15207#bib.bib18)\); for ARBench and PlanBench, we use their official training sets\. For ZebraLogic, which is comparably challenging to MATH and lacks dedicated training or development sets, we train for each iteration using the same number of epochs that achieved the best performance on MATH\(Prasadet al\.,[2024](https://arxiv.org/html/2605.15207#bib.bib55)\)\. Following standard practice for high\-variance reasoning tasks, we sample multiple solutions per instance and report both pass@K \(the proportion of instances with at least one correct solution among K samples\) and avg@K \(the average correctness across K samples\)\. We set K=64 for AIME and ZebraLogic, K=4 for MATH\-500, K=25 for ARBench, and K=8 for PlanBench\. Mathematical verification uses the DeepScaleR verifier; external baseline numbers are reported as citations when verifiers may differ\. Within each benchmark, we match the rollout and sampling budgets\.
#### 5\.1\.3Compared Methods\.
We compare TeamTR against baselines spanning inference\-time prompting, single\-agent fine\-tuning, multi\-agent coordination, and strong reasoning systems\. Inference\-time baselines include Chain\-of\-Thought prompting\(Weiet al\.,[2022](https://arxiv.org/html/2605.15207#bib.bib11)\), Self\-Consistency with majority voting\(Wanget al\.,[2022](https://arxiv.org/html/2605.15207#bib.bib10)\), and Tree\-of\-Thoughts\(Yaoet al\.,[2023](https://arxiv.org/html/2605.15207#bib.bib17)\)\. Single\-agent fine\-tuning baselines include PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.15207#bib.bib59)\), GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2605.15207#bib.bib24)\), and DAPO\(Yuet al\.,[2025](https://arxiv.org/html/2605.15207#bib.bib18)\), all trained under the same budget as the per\-agent allocation in TeamTR\. Multi\-agent baselines include debate with and without a judge\(Duet al\.,[2023](https://arxiv.org/html/2605.15207#bib.bib34)\), and naive sequential fine\-tuning reuses stage\-start rollouts for all within\-stage updates, testing whether intermediate\-occupancy evaluation mitigates the compounding degradation in Proposition[3\.3](https://arxiv.org/html/2605.15207#S3.Thmtheorem3)\.
Table 1:Performance comparison across reasoning and planning benchmarks\.Green/redindicate improvement/decline relative to TeamTR \(3×\\times8B\)\. Best results arebold; second\-best areunderlined\.MethodSizeGeneral Reasoning \(%\)Active Reasoning \(%\)Planning \(%\)AIME24AIME25MATH\-500ZebraLogicAutoLogiDCSPGNPlanBenchProprietary ModelsGPT\-4o\-mini–8\.1\-80\.08\.8\-69\.578\.2\-21\.120\.1\-74\.662\.5\-28\.044\.0\-2\.740\.8\-2\.343\.6\-1\.734\.7\-9\.4o4\-mini–79\.6\-8\.574\.8\-3\.598\.0\-1\.388\.9\-5\.886\.3\-4\.246\.743\.145\.1\-0\.241\.2\-2\.9Open\-Source Thinking ModelsQwen3\-A3B30B80\.4\-7\.770\.9\-7\.498\.0\-1\.389\.5\-5\.288\.1\-2\.443\.1\-3\.638\.4\-4\.742\.3\-3\.039\.1\-5\.0QwQ32B79\.5\-8\.669\.5\-8\.898\.0\-1\.376\.8\-17\.986\.3\-4\.241\.9\-4\.836\.1\-7\.039\.5\-5\.837\.9\-6\.2Qwen3 \(thinking\)32B81\.4\-6\.772\.9\-5\.497\.2\-2\.188\.8\-5\.987\.3\-3\.242\.7\-4\.038\.9\-4\.241\.1\-4\.240\.3\-3\.8DeepSeek\-R1 Distill70B70\.0\-18\.156\.3\-22\.094\.5\-4\.871\.3\-23\.483\.5\-7\.042\.1\-4\.635\.4\-7\.740\.7\-4\.639\.1\-5\.0Single\-Agent Fine\-tuning \(Qwen3\-8B\)GRPO8B39\.1\-49\.027\.9\-50\.490\.7\-8\.635\.4\-59\.380\.3\-10\.240\.1\-6\.635\.2\-7\.938\.9\-6\.433\.1\-11\.0DAPO8B41\.3\-46\.830\.1\-48\.291\.5\-7\.836\.7\-58\.080\.9\-9\.640\.7\-6\.035\.9\-7\.239\.3\-6\.033\.5\-10\.6Multi\-Agent Fine\-tuning \(3×\\timesQwen3\-8B\)Debate3×\\times8B68\.7\-19\.459\.1\-19\.295\.2\-4\.184\.1\-10\.685\.1\-5\.443\.1\-3\.638\.0\-5\.142\.3\-3\.038\.3\-5\.8Debate \+ Judge3×\\times8B71\.1\-17\.061\.7\-16\.695\.9\-3\.486\.3\-8\.485\.9\-4\.644\.3\-2\.439\.4\-3\.743\.5\-1\.839\.1\-5\.0Role\-play3×\\times8B69\.9\-18\.260\.5\-17\.895\.7\-3\.685\.1\-9\.685\.5\-5\.043\.7\-3\.038\.7\-4\.443\.1\-2\.238\.7\-5\.4TeamTR \(Ours\)Homogeneous3×\\times8B88\.178\.399\.394\.790\.546\.743\.145\.344\.1Heterogeneous1\.7B\+8B\+14B89\.7\+1\.680\.1\+1\.898\.1\-1\.296\.3\+1\.688\.7\-1\.845\.5\-1\.242\.3\-0\.844\.5\-0\.844\.5\+0\.4
#### 5\.1\.4TeamTR Setup\.
We use a team of three agents with a fixed round\-robin router\. The trust\-region radiusδi\\delta\_\{i\}controls the trade\-off between update magnitude and certificate tightness \(Theorem[3\.4](https://arxiv.org/html/2605.15207#S3.Thmtheorem4)\): smaller values yield tighter bounds but slower learning\. We setδi=0\.01\\delta\_\{i\}=0\.01based on preliminary experiments; sensitivity analysis in Sec\.[5\.6](https://arxiv.org/html/2605.15207#S5.SS6)shows that performance is stable acrossδi∈\[0\.005,0\.02\]\\delta\_\{i\}\\in\[0\.005,0\.02\]\. The KL penalty coefficientβ\\betain Eq\. \([17](https://arxiv.org/html/2605.15207#S4.E17)\) is tuned adaptively to satisfy the constraint\. All agents share the same base architecture but have separate parameter sets\. For plug\-and\-play experiments, Stage\-0 alignment uses 500 probe contexts from the current team’s rollout distribution\.
### 5\.2Main Results
Table[1](https://arxiv.org/html/2605.15207#S5.T1)reports performance across general reasoning, active reasoning, and planning benchmarks with single\-agent, multi\-agent fine\-tuning and strong thinking LLM baselines\. Across benchmarks, TeamTR outperforms sequential fine\-tuning baselines instantiated with PPO/GRPO/DAPO and multi\-agent fine\-tuning baselines \(debate and role\-play\) under the same team budget\. The heterogeneous configuration \(1\.7B\+8B\+14B\) excels on challenging benchmarks \(AIME, ZebraLogic\), and the homogeneous configuration \(3×8B\) performs better on simpler tasks \(MATH\-500, AutoLogi\), where consistent capacity across agents is beneficial\.
### 5\.3Verifying Compounding Occupancy Shift
Section[3](https://arxiv.org/html/2605.15207#S3)predicts that, under within\-stage sequential updates, evaluating later updates on stage\-start rollouts can introduce a growing mismatch between the stage\-start occupancy and the current intermediate occupancy\. We quantify this effect by measuring the surrogate disagreement between rollouts collected at stage start and rollouts collected under the current intermediate team within the same training stage\.
For a training stagekkand within\-stage update indexii, letL^istale\\widehat\{L\}^\{\\mathrm\{stale\}\}\_\{i\}denote the surrogate evaluated using rollouts from the stage\-start team \(dπ^0d^\{\\hat\{\\pi\}^\{0\}\}\), and letL^iinter\\widehat\{L\}^\{\\mathrm\{inter\}\}\_\{i\}denote the surrogate evaluated using rollouts from the intermediate team before theii\-th update \(dπ^i−1d^\{\\hat\{\\pi\}^\{i\-1\}\}\)\. We define the stale\-occupancy gap asGapi≔\|L^istale−L^iinter\|\\mathrm\{Gap\}\_\{i\}\\coloneqq\\big\|\\widehat\{L\}^\{\\mathrm\{stale\}\}\_\{i\}\-\\widehat\{L\}^\{\\mathrm\{inter\}\}\_\{i\}\\big\|\. Figure[2](https://arxiv.org/html/2605.15207#S5.F2)reportsGapi\\mathrm\{Gap\}\_\{i\}across training stages fori=2i=2andi=3i=3\. For baselines that reuse stage\-start rollouts, the gap is systematically larger for the later within\-stage update \(i=3i=3\) than fori=2i=2, indicating that stage\-start rollouts become increasingly misaligned after earlier updates are applied\. This trend is consistent with the compounding mismatch implied by Remark[3\.3](https://arxiv.org/html/2605.15207#S3.Thmtheorem3)\. TeamTR avoids optimizing on stale rollouts by resampling under the intermediate team before each update, and empirically maintains a much smaller surrogate disagreement\.
Figure 2:Stale\-occupancy gap within a training stage\.We plot the occupancy gap across training stageskkfor within\-stage update indicesi=2i=2andi=3i=3\. For the baselines\-reuse stage, the gap is larger for the later update \(i=3i=3\), consistent with Remark[3\.3](https://arxiv.org/html/2605.15207#S3.Thmtheorem3)\.
### 5\.4Scaling, Ablations, and Compute Trade\-offs
Proposition[3\.3](https://arxiv.org/html/2605.15207#S3.Thmtheorem3)predicts a gap between stale and intermediate\-occupancy evaluation that grows with team size\. We examine this prediction beyond the default three\-agent setting and summarize the results in Table[2](https://arxiv.org/html/2605.15207#S5.T2)\. On MATH\-500 with homogeneous Qwen3\-1\.7B teams, stale sequential updates exhibit near\-quadratic growth in both the stale surrogate gap and empirical occupancy drift, with fitted exponents1\.941\.94and1\.911\.91, respectively\. Under TeamTR, the corresponding exponents are1\.071\.07and1\.051\.05, matching the expected near\-linear behavior\. The scaling difference is also reflected in end\-task performance: atn=8n=8, TeamTR reaches87\.9%87\.9\\%accuracy, while naive sequential training drops to58\.7%58\.7\\%\.
Table 2:Scaling behavior, ablations, and overhead\.Panel \(a\) reports team\-size scaling and larger\-model results\. For model\-scale rows,Δstale\\Delta\_\{\\mathrm\{stale\}\}is reported as AIME24 / MATH\-500\. Panel \(b\) isolates the effects of intermediate\-occupancy resampling and trust\-region control\.\(a\) Team\-size and model\-scale behavior
SettingMethodAIME24MATH\-500Drift summarynn\-scalingNaive Seq\.–58\.7\(n=8\)\(n\{=\}8\)αΔ=1\.94,αD=1\.91\\alpha\_\{\\Delta\}\{=\}1\.94,\\ \\alpha\_\{D\}\{=\}1\.91nn\-scalingTeamTR–87\.9\(n=8\)\(n\{=\}8\)αΔ=1\.07,αD=1\.05\\alpha\_\{\\Delta\}\{=\}1\.07,\\ \\alpha\_\{D\}\{=\}1\.053×83\{\\times\}8BNaive Seq\.71\.195\.10\.31/0\.180\.31/0\.183×83\{\\times\}8BTeamTR88\.199\.30\.08/0\.030\.08/0\.031\.71\.7B\+88B\+1414BTeamTR89\.798\.10\.07/0\.040\.07/0\.0488B\+1414B\+3232BNaive Seq\.77\.897\.10\.29/0\.160\.29/0\.1688B\+1414B\+3232BTeamTR92\.599\.40\.06/0\.020\.06/0\.02
\(b\) Ablations on AIME24
VariantAcc\. \(%\)Δstale↓\\Delta\_\{\\mathrm\{stale\}\}\\downarrowStab\.↓\\downarrowCoord\. \(%\)TeamTR \(full\)88\.1±\\pm1\.20\.081\.989\.1KL\-penalty only, adaptiveβ\\beta84\.9±\\pm1\.60\.142\.783\.1KL\-penalty only, fixedβ\\beta80\.7±\\pm2\.00\.203\.577\.2Resample every 2 updates79\.3±\\pm1\.80\.182\.879\.2No resampling \(= Naive Seq\.\)71\.1±\\pm2\.80\.314\.271\.5No trust region \(δ→∞\\delta\{\\to\}\\infty\)68\.3±\\pm3\.50\.426\.162\.3
The same occupancy\-shift pattern appears at larger model scales\. For3×3\{\\times\}Qwen3\-8B teams, TeamTR improves AIME24 accuracy from71\.1%71\.1\\%to88\.1%88\.1\\%and reducesΔstale\\Delta\_\{\\mathrm\{stale\}\}from0\.310\.31to0\.080\.08\. For heterogeneous88B\+1414B\+3232B teams, TeamTR improves AIME24 accuracy from77\.8%77\.8\\%to92\.5%92\.5\\%and reducesΔstale\\Delta\_\{\\mathrm\{stale\}\}from0\.290\.29to0\.060\.06\. Thus, stronger individual models do not remove the stale\-occupancy mismatch; controlling intermediate occupancy remains beneficial for larger and heterogeneous teams\.
Table[2](https://arxiv.org/html/2605.15207#S5.T2)\(b\) further separates the two components of TeamTR\. Removing intermediate\-occupancy resampling recovers the naive sequential baseline and leads to a large stale gap\. Resampling every two updates reduces this mismatch, but remains8\.88\.8points below full TeamTR on AIME24, indicating that stale drift still accumulates within a stage\. Trust\-region control is also necessary\. An adaptive KL\-penalty\-only variant improves over naive sequential training, but it trails full TeamTR by3\.23\.2points and exhibits larger stale gaps and higher instability\. Without any trust\-region constraint, the update becomes the least stable and produces the weakest coordination score\. These ablations suggest that TeamTR’s gains come from the combination of fresh intermediate rollouts and explicit token\-level trust\-region enforcement\.
The additional sampling cost is modest relative to the stability gain\. On AIME24 with3×3\{\\times\}Qwen3\-8B over 40 stages, TeamTR uses158\.3158\.3M tokens and49\.249\.2K rollouts, compared with142\.5142\.5M tokens and44\.344\.3K rollouts for naive sequential training\. This corresponds to about\+11\.1%\+11\.1\\%overhead in both token and rollout accounting\. The measured per\-update wall\-clock cost is192\.8192\.8s for TeamTR and172\.0172\.0s for naive sequential training, giving a1\.12×1\.12\{\\times\}overhead\. KL monitoring contributes6\.26\.2s per update, so the dominant cost remains rollout generation rather than trust\-region monitoring\.
The turn\-taking assumption remains the formal scope of the guarantee\. Lemma[3\.1](https://arxiv.org/html/2605.15207#S3.Thmtheorem1)relies on a single active agent at each decision step, and does not directly cover concurrent generation, overlapping tool calls, or branching blackboard\-style interaction\. The stale\-rollout diagnosis, however, is not restricted to text\-only reasoning\. In a sequential tool\-mediated setting onτ\\tau\-bench, TeamTR improves average pass@1 / pass@4 from45\.6/13\.845\.6/13\.8under naive sequential training to52\.9/19\.752\.9/19\.7with the same3×3\{\\times\}8B team\. This suggests that intermediate\-occupancy control remains useful when the shared context includes tool\-mediated state changes, even though the formal theorem is stated for turn\-taking interaction\.
Figure 3:Training dynamics under matched rollout budgets\.We compare TeamTR against sequential baselines instantiated with PPO/GRPO/DAPO\. Baselines exhibit occasional regressions during training, whereas TeamTR maintains more stable improvement by evaluating each update on intermediate rollouts and enforcing per\-update trust regions\. Shaded regions indicate variation across seeds\.The empirical certificate in Sec\.[5\.6](https://arxiv.org/html/2605.15207#S5.SS6)should be interpreted as a training diagnostic rather than a literal per\-stage guarantee\. It combines logged surrogate gains, monitored KL terms, and a proxy for the estimation\-error penaltyζi\\zeta\_\{i\}\. In our logs, theζ^\\widehat\{\\zeta\}proxy contributes18%18\\%of the total penalty on average, and the near\-threshold flip rate of the sampled token\-KL monitor is2\.4%2\.4\\%under50%50\\%token\-position subsampling\. The certificate therefore provides a useful conservative signal during training, while its tightness depends on the proxy quality and on the reliability of sampled token\-KL monitoring\.
### 5\.5Training Dynamics
Figure[3](https://arxiv.org/html/2605.15207#S5.F3)shows held\-out accuracy as a function of training steps under matched rollout budgets\. Across benchmarks, TeamTR improves steadily and achieves higher final performance than sequential baselines trained with PPO/GRPO/DAPO\. The baselines exhibit non\-monotonic trajectories with occasional regressions, which are accompanied by large stale\-occupancy gaps measured in Sec\.[5\.3](https://arxiv.org/html/2605.15207#S5.SS3)\. These trends align with Remark[3\.3](https://arxiv.org/html/2605.15207#S3.Thmtheorem3): reusing stage\-start rollouts can lead to compounding mismatch, while intermediate\-occupancy evaluation localizes the occupancy\-shift penalty\.
### 5\.6Trust\-Region Enforcement and Certificate Tracking
Theorems[3\.4](https://arxiv.org/html/2605.15207#S3.Thmtheorem4)–[3\.6](https://arxiv.org/html/2605.15207#S3.Thmtheorem6)lower bound improvement under the per\-update trust\-region conditionDKL^tok≤δi\\widehat\{D\_\{\\mathrm\{KL\}\}\{\}\}\_\{\\mathrm\{tok\}\}\\leq\\delta\_\{i\}\. We examine whether updates stay within the prescribed KL region and whether the resulting empirical certificate is informative\.
Figure[4](https://arxiv.org/html/2605.15207#S5.F4)\(a\) shows the distribution of per\-update token\-level KL divergences on AIME25, with the trust\-region thresholdδ\\deltamarked by the red dashed line\. The percentages above each method report the out\-of\-region rate, i\.e\., the fraction of updates withDKL^tok\>δ\\widehat\{D\_\{\\mathrm\{KL\}\}\{\}\}\_\{\\mathrm\{tok\}\}\>\\delta\. TeamTR keeps most updates in\-region, whereas PPO/GRPO/DAPO variants exhibit substantially higher out\-of\-region rates\. Figure[4](https://arxiv.org/html/2605.15207#S5.F4)\(b\) compares the cumulative measured improvement across training stages with the cumulative certificate lower bound obtained by plugging logged surrogates and KL terms into Theorem[3\.6](https://arxiv.org/html/2605.15207#S3.Thmtheorem6)\. Figure[4](https://arxiv.org/html/2605.15207#S5.F4)\(c\) further plots the per\-stage certificate value against the corresponding empirical improvement, reporting rank correlation and the violation rate \(the fraction of points where the certificate exceeds the measured improvement\)\. On AIME25, the certificate remains conservative while tracking progress with a consistent gap\.
Figure 4:Trust\-region enforcement and certificate tracking \(AIME25\)\.\(a\) Distribution of per\-update token\-level KL divergence; the red dashed line marks the thresholdδ\\delta, and percentages indicate out\-of\-region rates \(DKL^tok\>δ\\widehat\{D\_\{\\mathrm\{KL\}\}\{\}\}\_\{\\mathrm\{tok\}\}\>\\delta\)\. \(b\) Cumulative measured improvement versus certificate lower bound \(Theorem[3\.6](https://arxiv.org/html/2605.15207#S3.Thmtheorem6)\);ρ\\rhodenotes rank correlation\. \(c\) Per\-stage calibration of certificate values against empirical improvements; “viol” indicates the fraction of stages where the bound is violated\.
### 5\.7Token\-Level Analysis of Shift
Our per\-update trust region constrains behavior\-to\-updated divergence through the token\-level decomposition in Eq\. \([3](https://arxiv.org/html/2605.15207#S3.E3)\)\. To localize where probability mass moves, Fig\.[5](https://arxiv.org/html/2605.15207#S5.F5)compares token logits before and after an update at a representative ARBench checkpoint, with tokens ordered by their pre\-update probabilities \(remaining vocabulary aggregated asother\)\. In\-region updates \(monitored byDKL^tok≤δ\\widehat\{D\_\{\\mathrm\{KL\}\}\{\}\}\_\{\\mathrm\{tok\}\}\\leq\\delta\) primarily modify a small set of high\-probability tokens, resulting in controlled, localized changes\. Out\-of\-region updates \(DKL^tok\>δ\\widehat\{D\_\{\\mathrm\{KL\}\}\{\}\}\_\{\\mathrm\{tok\}\}\>\\delta\) induce larger reshuffling among close alternatives and can flip the dominant token, consistent with the instabilities observed in Sec\.[5\.5](https://arxiv.org/html/2605.15207#S5.SS5)\.
Figure 5:Token\-level logit shifts by pre\-update probability rank\.Left: an in\-region update \(DKL^tok≤δ\\widehat\{D\_\{\\mathrm\{KL\}\}\{\}\}\_\{\\mathrm\{tok\}\}\\leq\\delta\) produces localized changes on top tokens\. Right: an out\-of\-region update \(DKL^tok\>δ\\widehat\{D\_\{\\mathrm\{KL\}\}\{\}\}\_\{\\mathrm\{tok\}\}\>\\delta\) reshuffles probability mass among alternatives\.
### 5\.8Plug\-and\-Play Component Replacement
Proposition[3\.8](https://arxiv.org/html/2605.15207#S3.Thmtheorem8)implies that after replacing an agent, once the new component is aligned to satisfy the per\-agent trust\-region interface, subsequent TeamTR updates continue to admit the same improvement\-certificate form \(the replacement step itself is not certified\)\. We evaluate this plug\-and\-play by initializing a team with Qwen2\.5\-Instruct \(1\.5B/3B/7B\) and, at stagekswap=20k\_\{\\mathrm\{swap\}\}=20, replacing the 1\.5B agent with Qwen3\-8B, which requires protocol alignment \(see Appendix[G\.7](https://arxiv.org/html/2605.15207#A7.SS7)for details\)\. We compare three strategies:*Direct swap*, which replaces the agent and immediately resumes training;*Stage\-0 aligned swap*, which first aligns the new agent on 500 probe contexts so that the monitored token\-KL satisfiesDKL^tok≤δ\\widehat\{D\_\{\\mathrm\{KL\}\}\{\}\}\_\{\\mathrm\{tok\}\}\\leq\\deltaon the probe distribution before resuming training; and*Retrain\-from\-scratch*, which resets the team after the swap and retrains for the remaining budget\. Figure[6](https://arxiv.org/html/2605.15207#S5.F6)reports scores on AIME24 and ARBench\-DC\. A direct swap causes an immediate performance drop atkswapk\_\{\\mathrm\{swap\}\}\(−\-18% on AIME24\) and a gradual recovery thereafter\. In contrast, Stage\-0 alignment largely eliminates the swap shock and enables post\-swap improvement, achieving substantially higher final performance under the same post\-swap budget \(\+27% and \+24% on AIME24 and ARBench\-DC, respectively\)\. Retraining from scratch avoids reusing pre\-swap components but is less efficient: within the constrained budget, it remains below the aligned\-swap trajectory\.
Figure 6:Plug\-and\-play component replacement\.A Qwen2\.5\-Instruct \(1\.5B/3B/7B\) team is trained over 20 stages; and the 1\.5B agent is replaced with Qwen3\-8B\. Stage\-0 alignment mitigates the swap shock and achieves the best performance\.
## 6Conclusion and Limitations
We analyzed sequential fine\-tuning of shared\-context LLM teams and identified*compounding occupancy shift*: later component updates can be optimized or evaluated under stale rollouts, yielding a certificate penalty that scales quadratically with team size\. TeamTR addresses this by evaluating each update under the partially updated team’s occupancy and enforcing per\-agent trust regions via a token\-decomposable behavior\-to\-updated divergence monitored by a sampled estimator\. This yields lower bounds on per\-update and per\-stage improvement and improves coordination stability and accuracy in practice, including a plug\-and\-play component with a brief alignment step\.Limitations:Our current analysis targets a single\-active\-agent protocol; We rely on per\-update resampling to avoid long\-horizon importance weighting\.
## Impact Statement
This paper presents work aimed at advancing the field of machine learning\. There are many potential societal consequences of our work, none of which we consider necessary to highlight here\.
## References
- S\. Casper, X\. Davies, C\. Shi, T\. K\. Gilbert, J\. Scheurer, J\. Rando, R\. Freedman, T\. Korbak, D\. Lindner, P\. Freire,et al\.\(2023\)Open problems and fundamental limitations of reinforcement learning from human feedback\.arXiv preprint arXiv:2307\.15217\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px4.p1.1),[§2](https://arxiv.org/html/2605.15207#S2.SS0.SSS0.Px3.p1.1)\.
- M\. Cemri, M\. Z\. Pan, S\. Yang, L\. A\. Agrawal, B\. Chopra, R\. Tiwari, K\. Keutzer, A\. Parameswaran, D\. Klein, K\. Ramchandran, M\. Zaharia, J\. E\. Gonzalez, and I\. Stoica \(2025\)Why do multi\-agent llm systems fail?\.arXiv preprint arXiv:2503\.13657\.External Links:2503\.13657,[Document](https://dx.doi.org/10.48550/arXiv.2503.13657),[Link](https://arxiv.org/abs/2503.13657)Cited by:[§1](https://arxiv.org/html/2605.15207#S1.p1.1)\.
- B\. Chen, C\. Shu, E\. Shareghi, N\. Collier, K\. Narasimhan, and S\. Yao \(2023\)Fireact: toward language agent fine\-tuning\.arXiv preprint arXiv:2310\.05915\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.15207#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Du, S\. Li, A\. Torralba, J\. B\. Tenenbaum, and I\. Mordatch \(2023\)Improving factuality and reasoning in language models through multiagent debate\.InForty\-first International Conference on Machine Learning,Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.15207#S1.p1.1),[§2](https://arxiv.org/html/2605.15207#S2.SS0.SSS0.Px1.p1.1),[§5\.1\.3](https://arxiv.org/html/2605.15207#S5.SS1.SSS3.p1.1)\.
- L\. Espeholt, H\. Soyer, R\. Munos, K\. Simonyan, V\. Mnih, T\. Ward, Y\. Doron, V\. Firoiu, T\. Harley, I\. Dunning,et al\.\(2018\)Impala: scalable distributed deep\-rl with importance weighted actor\-learner architectures\.InInternational conference on machine learning,pp\. 1407–1416\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px4.p1.1)\.
- J\. Foerster, N\. Nardelli, G\. Farquhar, T\. Afouras, P\. H\. Torr, P\. Kohli, and S\. Whiteson \(2017\)Stabilising experience replay for deep multi\-agent reinforcement learning\.InInternational conference on machine learning,pp\. 1146–1155\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px4.p1.1),[§1](https://arxiv.org/html/2605.15207#S1.p2.1),[§2](https://arxiv.org/html/2605.15207#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§5\.1\.1](https://arxiv.org/html/2605.15207#S5.SS1.SSS1.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§5\.1\.1](https://arxiv.org/html/2605.15207#S5.SS1.SSS1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the math dataset\.arXiv preprint arXiv:2103\.03874\.Cited by:[Appendix H](https://arxiv.org/html/2605.15207#A8.SS0.SSS0.Px2.p1.1),[§5\.1\.1](https://arxiv.org/html/2605.15207#S5.SS1.SSS1.p1.1)\.
- S\. Hong, M\. Zhuge, J\. Chen, X\. Zheng, Y\. Cheng, J\. Wang, C\. Zhang, Z\. Wang, S\. K\. S\. Yau, Z\. Lin,et al\.\(2023\)MetaGPT: meta programming for a multi\-agent collaborative framework\.InThe twelfth international conference on learning representations,Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.15207#S1.p1.1),[§1](https://arxiv.org/html/2605.15207#S1.p2.1),[§2](https://arxiv.org/html/2605.15207#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Huang, Q\. Liu, B\. Y\. Lin, T\. Pang, C\. Du, and M\. Lin \(2023\)Lorahub: efficient cross\-task generalization via dynamic lora composition\.arXiv preprint arXiv:2307\.13269\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px4.p1.1)\.
- G\. Ilharco, M\. T\. Ribeiro, M\. Wortsman, S\. Gururangan, L\. Schmidt, H\. Hajishirzi, and A\. Farhadi \(2022\)Editing models with task arithmetic\.arXiv preprint arXiv:2212\.04089\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px4.p1.1),[§2](https://arxiv.org/html/2605.15207#S2.SS0.SSS0.Px3.p1.1)\.
- D\. Jiang, X\. Ren, and B\. Y\. Lin \(2023\)Llm\-blender: ensembling large language models with pairwise ranking and generative fusion\.arXiv preprint arXiv:2306\.02561\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px2.p1.1)\.
- S\. Kakade and J\. Langford \(2002\)Approximately optimal approximate reinforcement learning\.InProceedings of the nineteenth international conference on machine learning,pp\. 267–274\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px3.p1.1)\.
- Y\. Kim, K\. Gu, C\. Park, C\. Park, S\. Schmidgall, A\. A\. Heydari, Y\. Yan, Z\. Zhang, Y\. Zhuang, M\. Malhotra, P\. P\. Liang, H\. W\. Park, Y\. Yang, X\. Xu, Y\. Du, S\. Patel, T\. Althoff, D\. McDuff, and X\. Liu \(2025\)Towards a science of scaling agent systems\.arXiv preprint arXiv:2512\.08296\.External Links:2512\.08296,[Document](https://dx.doi.org/10.48550/arXiv.2512.08296),[Link](https://arxiv.org/abs/2512.08296)Cited by:[§1](https://arxiv.org/html/2605.15207#S1.p1.1)\.
- J\. G\. Kuba, R\. Chen, M\. Wen, Y\. Wen, F\. Sun, J\. Wang, and Y\. Yang \(2021\)Trust region policy optimisation in multi\-agent reinforcement learning\.arXiv preprint arXiv:2109\.11251\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.15207#S1.p2.1),[§2](https://arxiv.org/html/2605.15207#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Lee, S\. Phatale, H\. Mansoor, T\. Mesnard, J\. Ferret, K\. Lu, C\. Bishop, E\. Hall, V\. Carbune, A\. Rastogi,et al\.\(2023\)Rlaif vs\. rlhf: scaling reinforcement learning from human feedback with ai feedback\.arXiv preprint arXiv:2309\.00267\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.15207#S2.SS0.SSS0.Px1.p1.1)\.
- G\. Li, H\. Hammoud, H\. Itani, D\. Khizbullin, and B\. Ghanem \(2023\)Camel: communicative agents for” mind” exploration of large language model society\.Advances in Neural Information Processing Systems36,pp\. 51991–52008\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px1.p1.1)\.
- T\. Liang, Z\. He, W\. Jiao, X\. Wang, Y\. Wang, R\. Wang, Y\. Yang, S\. Shi, and Z\. Tu \(2024\)Encouraging divergent thinking in large language models through multi\-agent debate\.InProceedings of the 2024 conference on empirical methods in natural language processing,pp\. 17889–17904\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.15207#S2.SS0.SSS0.Px1.p1.1)\.
- B\. Y\. Lin, R\. L\. Bras, K\. Richardson, A\. Sabharwal, R\. Poovendran, P\. Clark, and Y\. Choi \(2025\)Zebralogic: on the scaling limits of llms for logical reasoning\.arXiv preprint arXiv:2502\.01100\.Cited by:[Appendix H](https://arxiv.org/html/2605.15207#A8.SS0.SSS0.Px3.p1.1),[§5\.1\.1](https://arxiv.org/html/2605.15207#S5.SS1.SSS1.p1.1)\.
- R\. Liu, R\. Yang, C\. Jia, G\. Zhang, D\. Zhou, A\. M\. Dai, D\. Yang, and S\. Vosoughi \(2023a\)Training socially aligned language models in simulated human society\.arXiv preprint arXiv:2305\.169602\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.15207#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Liu, T\. Chen, Z\. Liang, X\. Lyu, and C\. Amato \(2025a\)Llm collaboration with multi\-agent reinforcement learning\.arXiv preprint arXiv:2508\.04652\.Cited by:[§1](https://arxiv.org/html/2605.15207#S1.p2.1)\.
- S\. Liu, Z\. Chen, Y\. Liu, Y\. Wang, D\. Yang, Z\. Zhao, Z\. Zhou, X\. Yi, W\. Li, W\. Zhang,et al\.\(2023b\)Improving generalization in visual reinforcement learning via conflict\-aware gradient agreement augmentation\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 23436–23446\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px4.p1.1)\.
- S\. Liu, Y\. Liu, Z\. Chen, Z\. Zhou, Z\. Zhao, Y\. Xie, W\. Li, and Z\. Gan \(2025b\)Improving robotic grasp detection under sparse annotations via grasp transformer with pixel\-wise contrastive learning\.IEEE Transactions on Industrial Electronics\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px4.p1.1)\.
- R\. Lowe, Y\. I\. Wu, A\. Tamar, J\. Harb, O\. Pieter Abbeel, and I\. Mordatch \(2017\)Multi\-agent actor\-critic for mixed cooperative\-competitive environments\.Advances in neural information processing systems30\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px1.p1.1)\.
- M\. Luo, S\. Tan, J\. Wong, X\. Shi, W\. Y\. Tang, M\. Roongta, C\. Cai, J\. Luo, L\. E\. Li, R\. A\. Popa, and I\. Stoica \(2025\)DeepScaleR: surpassing o1\-preview with a 1\.5b model by scaling rl\.Note:[https://pretty\-radio\-b75\.notion\.site/DeepScaleR\-Surpassing\-O1\-Preview\-with\-a\-1\-5B\-Model\-by\-Scaling\-RL\-19681902c1468005bed8ca303013a4e2](https://pretty-radio-b75.notion.site/DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2)Notion BlogCited by:[§5\.1\.2](https://arxiv.org/html/2605.15207#S5.SS1.SSS2.p1.1)\.
- J\. S\. Park, J\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein \(2023\)Generative agents: interactive simulacra of human behavior\.InProceedings of the 36th annual acm symposium on user interface software and technology,pp\. 1–22\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px1.p1.1)\.
- S\. G\. Patil, T\. Zhang, X\. Wang, and J\. E\. Gonzalez \(2024\)Gorilla: large language model connected with massive apis\.Advances in Neural Information Processing Systems37,pp\. 126544–126565\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px2.p1.1)\.
- E\. Perez, S\. Ringer, K\. Lukosiute, K\. Nguyen, E\. Chen, S\. Heiner, C\. Pettit, C\. Olsson, S\. Kundu, S\. Kadavath,et al\.\(2023\)Discovering language model behaviors with model\-written evaluations\.InFindings of the association for computational linguistics: ACL 2023,pp\. 13387–13434\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px2.p1.1)\.
- A\. Prasad, W\. Yuan, R\. Y\. Pang, J\. Xu, M\. Fazel\-Zarandi, M\. Bansal, S\. Sukhbaatar, J\. Weston, and J\. Yu \(2024\)Self\-consistency preference optimization\.arXiv preprint arXiv:2411\.04109\.Cited by:[§5\.1\.2](https://arxiv.org/html/2605.15207#S5.SS1.SSS2.p1.1)\.
- C\. Qian, W\. Liu, H\. Liu, N\. Chen, Y\. Dang, J\. Li, C\. Yang, W\. Chen, Y\. Su, X\. Cong,et al\.\(2024\)Chatdev: communicative agents for software development, 2024\.URL https://arxiv\. org/abs/23077924\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.15207#S2.SS0.SSS0.Px1.p1.1)\.
- Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§5\.1\.1](https://arxiv.org/html/2605.15207#S5.SS1.SSS1.p1.1)\.
- T\. Rashid, M\. Samvelyan, C\. S\. De Witt, G\. Farquhar, J\. Foerster, and S\. Whiteson \(2020\)Monotonic value function factorisation for deep multi\-agent reinforcement learning\.Journal of Machine Learning Research21\(178\),pp\. 1–51\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px1.p1.1),[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px3.p1.1)\.
- J\. Schulman, S\. Levine, P\. Abbeel, M\. Jordan, and P\. Moritz \(2015\)Trust region policy optimization\.InInternational conference on machine learning,pp\. 1889–1897\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2605.15207#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2605.15207#S2.SS0.SSS0.Px2.p1.1),[§5\.1\.3](https://arxiv.org/html/2605.15207#S5.SS1.SSS3.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§5\.1\.3](https://arxiv.org/html/2605.15207#S5.SS1.SSS3.p1.1)\.
- M\. Sharma, M\. Tong, T\. Korbak, D\. Duvenaud, A\. Askell, S\. R\. Bowman, N\. Cheng, E\. Durmus, Z\. Hatfield\-Dodds, S\. R\. Johnston,et al\.\(2023\)Towards understanding sycophancy in language models\.arXiv preprint arXiv:2310\.13548\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px2.p1.1)\.
- G\. Sheng, C\. Zhang, Z\. Ye, X\. Wu, W\. Zhang, R\. Zhang, Y\. Peng, H\. Lin, and C\. Wu \(2025\)Hybridflow: a flexible and efficient rlhf framework\.InProceedings of the Twentieth European Conference on Computer Systems,pp\. 1279–1297\.Cited by:[§5\.1\.2](https://arxiv.org/html/2605.15207#S5.SS1.SSS2.p1.1)\.
- P\. Singhal, T\. Goyal, J\. Xu, and G\. Durrett \(2023\)A long way to go: investigating length correlations in rlhf\.arXiv preprint arXiv:2310\.03716\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px4.p1.1)\.
- V\. Subramaniam, Y\. Du, J\. B\. Tenenbaum, A\. Torralba, S\. Li, and I\. Mordatch \(2025\)Multiagent finetuning: self improvement with diverse reasoning chains\.arXiv preprint arXiv:2501\.05707\.Cited by:[§1](https://arxiv.org/html/2605.15207#S1.p2.1)\.
- Q\. Team \(2025\)QwQ\-32b: embracing the power of reinforcement learning\.External Links:[Link](https://qwenlm.github.io/blog/qwq-32b/)Cited by:[§5\.1\.1](https://arxiv.org/html/2605.15207#S5.SS1.SSS1.p1.1)\.
- K\. Valmeekam, M\. Marquez, A\. Olmo, S\. Sreedharan, and S\. Kambhampati \(2023\)Planbench: an extensible benchmark for evaluating large language models on planning and reasoning about change\.Advances in Neural Information Processing Systems36,pp\. 38975–38987\.Cited by:[Appendix H](https://arxiv.org/html/2605.15207#A8.SS0.SSS0.Px6.p1.1),[§5\.1\.1](https://arxiv.org/html/2605.15207#S5.SS1.SSS1.p1.1)\.
- T\. Wang, P\. Yu, X\. E\. Tan, S\. O’Brien, R\. Pasunuru, J\. Dwivedi\-Yu, O\. Golovneva, L\. Zettlemoyer, M\. Fazel\-Zarandi, and A\. Celikyilmaz \(2023\)Shepherd: a critic for language model generation\.arXiv preprint arXiv:2308\.04592\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px2.p1.1)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, E\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2022\)Self\-consistency improves chain of thought reasoning in language models\.arXiv preprint arXiv:2203\.11171\.Cited by:[§5\.1\.3](https://arxiv.org/html/2605.15207#S5.SS1.SSS3.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§5\.1\.3](https://arxiv.org/html/2605.15207#S5.SS1.SSS3.p1.1)\.
- J\. Wei, D\. Huang, Y\. Lu, D\. Zhou, and Q\. V\. Le \(2023\)Simple synthetic data reduces sycophancy in large language models\.arXiv preprint arXiv:2308\.03958\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px2.p1.1)\.
- M\. Wortsman, G\. Ilharco, S\. Y\. Gadre, R\. Roelofs, R\. Gontijo\-Lopes, A\. S\. Morcos, H\. Namkoong, A\. Farhadi, Y\. Carmon, S\. Kornblith,et al\.\(2022\)Model soups: averaging weights of multiple fine\-tuned models improves accuracy without increasing inference time\.InInternational conference on machine learning,pp\. 23965–23998\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px4.p1.1),[§2](https://arxiv.org/html/2605.15207#S2.SS0.SSS0.Px3.p1.1)\.
- Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, B\. Li, E\. Zhu, L\. Jiang, X\. Zhang, S\. Zhang, J\. Liu, A\. H\. Awadallah, R\. W\. White, D\. Burger, and C\. Wang \(2023\)AutoGen: enabling next\-gen llm applications via multi\-agent conversation\.arXiv preprint arXiv:2308\.08155\.External Links:2308\.08155,[Document](https://dx.doi.org/10.48550/arXiv.2308.08155),[Link](https://arxiv.org/abs/2308.08155)Cited by:[§1](https://arxiv.org/html/2605.15207#S1.p1.1),[§1](https://arxiv.org/html/2605.15207#S1.p2.1)\.
- Q\. Wu, G\. Bansal, J\. Zhang, Y\. Wu, B\. Li, E\. Zhu, L\. Jiang, X\. Zhang, S\. Zhang, J\. Liu,et al\.\(2024\)Autogen: enabling next\-gen llm applications via multi\-agent conversations\.InFirst Conference on Language Modeling,Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.15207#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Xie, Z\. Zhou, C\. Ouyang, S\. Liu, L\. Hu, and Z\. Gan \(2025a\)ACORN: acyclic coordination with reachability network to reduce communication redundancy in multi\-agent systems\.InProceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems,pp\. 2190–2198\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px3.p1.1)\.
- Y\. Xie, Z\. Zhou, C\. Ouyang, S\. Liu, L\. Hu, and Z\. Gan \(2025b\)Heuristics\-assisted experience replay strategy for cooperative multi\-agent reinforcement learning\.InProceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems,pp\. 2798–2800\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px4.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§5\.1\.1](https://arxiv.org/html/2605.15207#S5.SS1.SSS1.p1.1)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. Griffiths, Y\. Cao, and K\. Narasimhan \(2023\)Tree of thoughts: deliberate problem solving with large language models\.Advances in neural information processing systems36,pp\. 11809–11822\.Cited by:[§5\.1\.3](https://arxiv.org/html/2605.15207#S5.SS1.SSS3.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2022\)ReAct: synergizing reasoning and acting in language models\.arXiv preprint arXiv:2210\.03629\.External Links:2210\.03629,[Document](https://dx.doi.org/10.48550/arXiv.2210.03629),[Link](https://arxiv.org/abs/2210.03629)Cited by:[§1](https://arxiv.org/html/2605.15207#S1.p1.1)\.
- X\. Yi, Z\. Zhou, C\. Cao, Q\. Niu, T\. Liu, and B\. Han \(2025\)From debate to equilibrium: belief\-driven multi\-agent llm reasoning via bayesian nash equilibrium\.arXiv preprint arXiv:2506\.08292\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px2.p1.1)\.
- C\. Yu, A\. Velu, E\. Vinitsky, J\. Gao, Y\. Wang, A\. Bayen, and Y\. Wu \(2022\)The surprising effectiveness of ppo in cooperative multi\-agent games\.Advances in neural information processing systems35,pp\. 24611–24624\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px1.p1.1),[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px3.p1.1)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu,et al\.\(2025\)Dapo: an open\-source llm reinforcement learning system at scale\.arXiv preprint arXiv:2503\.14476\.Cited by:[§5\.1\.2](https://arxiv.org/html/2605.15207#S5.SS1.SSS2.p1.1),[§5\.1\.3](https://arxiv.org/html/2605.15207#S5.SS1.SSS3.p1.1)\.
- A\. Zeng, M\. Liu, R\. Lu, B\. Wang, X\. Liu, Y\. Dong, and J\. Tang \(2024\)Agenttuning: enabling generalized agent abilities for llms\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 3053–3077\.Cited by:[Appendix A](https://arxiv.org/html/2605.15207#A1.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.15207#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Zhou, X\. Feng, Z\. Zhu, J\. Yao, S\. Koyejo, and B\. Han \(2025\)From passive to active reasoning: can large language models ask the right questions under incomplete information?\.arXiv preprint arXiv:2506\.08295\.Cited by:[Appendix H](https://arxiv.org/html/2605.15207#A8.SS0.SSS0.Px5.p1.1),[§5\.1\.1](https://arxiv.org/html/2605.15207#S5.SS1.SSS1.p1.1)\.
- Q\. Zhu, F\. Huang, R\. Peng, K\. Lu, B\. Yu, Q\. Cheng, X\. Qiu, X\. Huang, and J\. Lin \(2025\)AutoLogi: automated generation of logic puzzles for evaluating reasoning abilities of large language models\.arXiv preprint arXiv:2502\.16906\.Cited by:[Appendix H](https://arxiv.org/html/2605.15207#A8.SS0.SSS0.Px4.p1.1),[§5\.1\.1](https://arxiv.org/html/2605.15207#S5.SS1.SSS1.p1.1)\.
## Appendix contents
1. A\.Extended related work\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[A](https://arxiv.org/html/2605.15207#A1)
2. B\.Occupancy shift and token\-level trust regions\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[B](https://arxiv.org/html/2605.15207#A2)
3. C\.Certificates: full proofs and order invariance\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C](https://arxiv.org/html/2605.15207#A3)
4. D\.Plug\-and\-play upgrades and Stage\-0 alignment\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[D](https://arxiv.org/html/2605.15207#A4)
5. E\.Group\-based advantages and finite\-sample concentration\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[E](https://arxiv.org/html/2605.15207#A5)
6. F\.Additional theory and practical notes \(tightness, implementation, diagnostics\)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[F](https://arxiv.org/html/2605.15207#A6)
7. G\.Additional Experiments\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[G](https://arxiv.org/html/2605.15207#A7)
8. H\.Benchmarks\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[H](https://arxiv.org/html/2605.15207#A8)
## Appendix AExtended Related Work
##### Inference\-time multi\-agent LLM systems\.
Recent advancements have shifted the deployment of LLMs from monolithic predictors to collaborative multi\-agent systems \(MAS\) that operate entirely at inference time\. In this paradigm, frozen LLMs serve as cognitive controllers, interacting via natural language without parameter updates\. Unlike traditional MARL, which relies on training task\-specific policies from scratch\(Loweet al\.,[2017](https://arxiv.org/html/2605.15207#bib.bib29); Rashidet al\.,[2020](https://arxiv.org/html/2605.15207#bib.bib30); Yuet al\.,[2022](https://arxiv.org/html/2605.15207#bib.bib31)\), these systems orchestrate collaboration solely through prompt engineering and context management\. Pioneering works such as CAMEL\(Liet al\.,[2023](https://arxiv.org/html/2605.15207#bib.bib32)\)and Generative Agents\(Parket al\.,[2023](https://arxiv.org/html/2605.15207#bib.bib33)\)have demonstrated that, by using role\-playing prompts and memory streams, agents can autonomously decompose tasks and simulate complex social behaviors using only their pre\-trained knowledge\. Beyond simulation, training\-free collaborative strategies have been deployed to enhance reasoning\.\(Duet al\.,[2023](https://arxiv.org/html/2605.15207#bib.bib34)\)and\(Lianget al\.,[2024](https://arxiv.org/html/2605.15207#bib.bib35)\)propose test\-time debate and consensus\-building frameworks, which reduce hallucinations by aggregating diverse perspectives from fixed models\. To address complex tasks such as software engineering, systems like MetaGPT\(Honget al\.,[2023](https://arxiv.org/html/2605.15207#bib.bib36)\)and ChatDev\(Qianet al\.,[2024](https://arxiv.org/html/2605.15207#bib.bib37)\)incorporate hard\-coded Standard Operating Procedures \(SOPs\) into the context\. By enforcing structured sequential handoffs via system prompts, these frameworks stabilize the output of heterogeneous teams without requiring gradient\-based optimization\. General\-purpose libraries like AutoGen\(Wuet al\.,[2024](https://arxiv.org/html/2605.15207#bib.bib38)\)further formalize these inference\-only interaction topologies\. Despite these empirical successes, these deployments remain constrained by the capabilities of the base models and fixed interaction rules, thereby forfeiting opportunities for training\-time optimization that could yield more intrinsic and stable collaboration\.
##### Training multi\-agent LLM systems and learning to coordinate\.
Beyond inference\-time orchestration, recent research has pivoted towards updating model parameters to internalize coordination capabilities\. Early efforts largely used Supervised Fine\-Tuning \(SFT\) to specialize agents for distinct roles or to enhance general agentic capabilities\. For instance, large\-scale interaction trajectories are constructed to fine\-tune Llama\-based models, which significantly outperform prompting baselines on complex planning and tool\-use tasks\(Chenet al\.,[2023](https://arxiv.org/html/2605.15207#bib.bib41); Zenget al\.,[2024](https://arxiv.org/html/2605.15207#bib.bib57); Yiet al\.,[2025](https://arxiv.org/html/2605.15207#bib.bib47)\)\. Other works focus on distilling specific functional roles; for example, Shepherd\(Wanget al\.,[2023](https://arxiv.org/html/2605.15207#bib.bib39)\)and Gorilla\(Patilet al\.,[2024](https://arxiv.org/html/2605.15207#bib.bib45)\)demonstrate that fine\-tuning dedicated modules can establish effective heterogeneous teams via static supervision\. Moving toward reinforcement learning and preference optimization, researchers aim to teach agents to coordinate dynamically\. Frameworks such as LLM\-Blender\(Jianget al\.,[2023](https://arxiv.org/html/2605.15207#bib.bib42)\)and RLAIF\(Leeet al\.,[2023](https://arxiv.org/html/2605.15207#bib.bib56)\)employ ranking mechanisms and AI\-generated feedback to fuse outputs or refine behaviors, thereby aligning agents with collective goals\. Furthermore,\(Liuet al\.,[2023a](https://arxiv.org/html/2605.15207#bib.bib43)\)explores training socially aligned agents within simulated societies using reinforcement signals derived from social interactions\. However, these training paradigms pose significant challenges due to non\-stationarity\. Sequential updates in a shared context often lead to pathological behaviors such as sycophancy—where agents agree with users or teammates regardless of correctness to maximize rewards\(Weiet al\.,[2023](https://arxiv.org/html/2605.15207#bib.bib44); Perezet al\.,[2023](https://arxiv.org/html/2605.15207#bib.bib46); Sharmaet al\.,[2023](https://arxiv.org/html/2605.15207#bib.bib52)\)\. Existing methods typically rely on naive joint training or independent fine\-tuning, lacking a unified framework to explicitly control these distribution shifts and guarantee monotonic improvement in collaboration\.
##### Trust regions and block\-coordinate optimization in multi\-agent RL\.
Trust\-region methods establish the theoretical foundation for stable reinforcement learning by guaranteeing monotonic policy improvement\. Seminal works like TRPO\(Schulmanet al\.,[2015](https://arxiv.org/html/2605.15207#bib.bib58)\)and PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.15207#bib.bib59)\)rely on the monotonic improvement bounds originally derived by Kakade and Langford\(Kakade and Langford,[2002](https://arxiv.org/html/2605.15207#bib.bib60)\), utilizing KL\-divergence constraints to prevent destructive parameter updates\. In the multi\-agent domain, extending these guarantees is non\-trivial due to the non\-stationarity introduced by simultaneous updates\. While methods like MAPPO\(Yuet al\.,[2022](https://arxiv.org/html/2605.15207#bib.bib31)\), ACORN\(Xieet al\.,[2025a](https://arxiv.org/html/2605.15207#bib.bib40)\)empirically demonstrate the effectiveness of PPO in cooperative games, they often lack strict convergence guarantees in heterogeneous settings\. To address this, recent theoretical advancements have formulated multi\-agent learning as a block\-coordinate ascent problem\. Most notably, HATRPO and HAPPO\(Kubaet al\.,[2021](https://arxiv.org/html/2605.15207#bib.bib61)\)derive a multi\-agent advantage decomposition lemma, proving that sequential updates of agents—rather than simultaneous joint optimization—are necessary to preserve the monotonic improvement property in heterogeneous teams\. This sequential paradigm has been successfully applied to continuous control benchmarks like StarCraft II\(Rashidet al\.,[2020](https://arxiv.org/html/2605.15207#bib.bib30); Kubaet al\.,[2021](https://arxiv.org/html/2605.15207#bib.bib61)\)\. However, these algorithms operate on low\-dimensional state spaces and have not yet been adapted to the high\-dimensional, discrete, and autoregressive nature of Large Language Models, where “trust regions” must be defined over token\-level probability distributions within an evolving textual context\.
##### Distribution shift under sequential updates and modular system evolution\.
Training collaborative systems introduces non\-stationarity, where sequential updates create a moving target for optimization\. While classical MARL addressed this with importance sampling\(Foersteret al\.,[2017](https://arxiv.org/html/2605.15207#bib.bib62); Espeholtet al\.,[2018](https://arxiv.org/html/2605.15207#bib.bib63); Xieet al\.,[2025b](https://arxiv.org/html/2605.15207#bib.bib54); Liuet al\.,[2025b](https://arxiv.org/html/2605.15207#bib.bib53),[2023b](https://arxiv.org/html/2605.15207#bib.bib51)\), recent analyses in RLHF\(Casperet al\.,[2023](https://arxiv.org/html/2605.15207#bib.bib64); Singhalet al\.,[2023](https://arxiv.org/html/2605.15207#bib.bib65)\)reveal that such distribution shifts in LLMs often lead to mode collapse and reward hacking\. Controlling this shift is crucial not only for stability but also for enabling modular system evolution—combining or upgrading capabilities without retraining\. In the era of LLMs, modularity has evolved from simple adapters to model merging techniques\. Seminal works like Task Arithmetic\(Ilharcoet al\.,[2022](https://arxiv.org/html/2605.15207#bib.bib66)\)and Model Soups\(Wortsmanet al\.,[2022](https://arxiv.org/html/2605.15207#bib.bib67)\)demonstrate that agent capabilities can be composed by linearly manipulating weight vectors\. Furthermore, LoRAHub\(Huanget al\.,[2023](https://arxiv.org/html/2605.15207#bib.bib68)\)extends this to the dynamic composition of parameter\-efficient modules\. However, these static merging techniques assume independent training distributions and often require heuristic tuning of merging coefficients\. TeamTR provides a complementary dynamic framework: it leverages trust regions to certify that the distribution shift induced by updating or swapping one module \(agent\) remains within a safe bound, effectively enabling ”certified model arithmetic” during active collaboration\.
## Correspondence Between Main Text and Appendix
Table 3:Main theoretical results and their detailed proofs\.Main TextAppendixDescriptionLemma[3\.2](https://arxiv.org/html/2605.15207#S3.Thmtheorem2)[B\.3](https://arxiv.org/html/2605.15207#A2.SS3)Occupancy shift bounds \(expected divergence form; reverse\-KL compatible\)Proposition[3\.3](https://arxiv.org/html/2605.15207#S3.Thmtheorem3)[B\.4](https://arxiv.org/html/2605.15207#A2.SS4)Stale\-occupancy surrogate mismatch and quadratic\-in\-nncompounding termTheorem[3\.4](https://arxiv.org/html/2605.15207#S3.Thmtheorem4)[C\.1](https://arxiv.org/html/2605.15207#A3.SS1)Single\-step improvement lower bound under sequential trust\-region updateLemma[3\.5](https://arxiv.org/html/2605.15207#S3.Thmtheorem5)[C\.2](https://arxiv.org/html/2605.15207#A3.SS2)KL identity for single\-factor update; token\-KL reduction \(reverse KL\)Theorem[3\.6](https://arxiv.org/html/2605.15207#S3.Thmtheorem6)[C\.2](https://arxiv.org/html/2605.15207#A3.SS2)Joint\-stage improvement lower bound \(order\-robust form\)Theorem[3\.7](https://arxiv.org/html/2605.15207#S3.Thmtheorem7)[C\.3](https://arxiv.org/html/2605.15207#A3.SS3)Stage\-wise improvement lower bound under stale\-occupancy surrogates \(explicit double\-sum penalty\)Proposition[3\.8](https://arxiv.org/html/2605.15207#S3.Thmtheorem8)[D\.2](https://arxiv.org/html/2605.15207#A4.SS2)Certified resumability after replacement \(post\-alignment\)Proposition[3\.9](https://arxiv.org/html/2605.15207#S3.Thmtheorem9)[D\.2](https://arxiv.org/html/2605.15207#A4.SS2)Certificate tighteningSupporting / Implementation MaterialToken\-level KL trust region[B\.2](https://arxiv.org/html/2605.15207#A2.SS2)Reverse token\-level KL functional and exact autoregressive decomposition \(rollout\-estimable\)Stage\-0 alignment[D\.1](https://arxiv.org/html/2605.15207#A4.SS1)Reverse\-KL projection / distillation objective for heterogeneous upgradesGroup\-based advantages[E\.1](https://arxiv.org/html/2605.15207#A5.SS1)Group\-standardized, hard\-clipped sequence\-level advantages and bias proxyConcentration bounds[E\.3](https://arxiv.org/html/2605.15207#A5.SS3)Finite\-sample concentration for group\-based surrogates, and ratio\-clipping bias controlImplementation notes[F\.6](https://arxiv.org/html/2605.15207#A6.SS6)Token\-KL token\-sum, std normalization, hard clip, ratio clipDiagnostics[F\.7](https://arxiv.org/html/2605.15207#A6.SS7)Certificate tightness plots and training monitorsInformation\-theoretic bounds[F\.1](https://arxiv.org/html/2605.15207#A6.SS1),[F\.2](https://arxiv.org/html/2605.15207#A6.SS2)Oracle upper envelope and budget\-aware lower envelope \(auxiliary\)Smoothness & optimization[F\.4](https://arxiv.org/html/2605.15207#A6.SS4)Smoothness and projected\-gradient ascent convergence \(auxiliary\)### Notation
We work with a discounted MDPℳ=\(𝒮,𝒜,P,r,γ\)\\mathcal\{M\}=\(\\mathcal\{S\},\\mathcal\{A\},P,r,\\gamma\)withγ∈\(0,1\)\\gamma\\in\(0,1\)and bounded rewards\|r\|≤Rmax\|r\|\\leq R\_\{\\max\}\. The analysis also applies to finite\-horizon sequence generation by setting an effective horizonH≈\(1−γ\)−1H\\approx\(1\-\\gamma\)^\{\-1\}\.
π\\pijoint \(factorized\) team policy\. At each statess,π\(⋅∣s\)=∏k=1nπ\(k\)\(⋅∣s\)\\pi\(\\cdot\\mid s\)=\\prod\_\{k=1\}^\{n\}\\pi^\{\(k\)\}\(\\cdot\\mid s\)\.
μ\\muinitial\-state distribution att=0t=0\(used in the discounted visitation fixed\-point equation\)\.
dπd^\{\\pi\}discounted visitation distribution:dπ\(s\)=\(1−γ\)∑t≥0γtPrπ\(st=s\)d^\{\\pi\}\(s\)=\(1\-\\gamma\)\\sum\_\{t\\geq 0\}\\gamma^\{t\}\\Pr\_\{\\pi\}\(s\_\{t\}=s\)\.
J\(π\)J\(\\pi\)discounted return:J\(π\)=𝔼τ∼π\[∑t≥0γtrt\]J\(\\pi\)=\\mathbb\{E\}\_\{\\tau\\sim\\pi\}\\big\[\\sum\_\{t\\geq 0\}\\gamma^\{t\}r\_\{t\}\\big\]\.
Qπ,Vπ,AπQ^\{\\pi\},V^\{\\pi\},A^\{\\pi\}action\-value, value, and advantage underπ\\pi\.
σ\\sigmaupdate order \(a permutation of\{1,…,n\}\\\{1,\\ldots,n\\\}\) within a stage\.
π^i\\hat\{\\pi\}^\{i\}intermediate joint policy afteriifactor updates in a stage:π^0=πcur\\hat\{\\pi\}^\{0\}=\\pi\_\{\\mathrm\{cur\}\}andπ^i=π^i−1\[σ\(i\)←πtarσ\(i\)\]\\hat\{\\pi\}^\{i\}=\\hat\{\\pi\}^\{\\,i\-1\}\[\\sigma\(i\)\\leftarrow\\pi\_\{\\mathrm\{tar\}\}^\{\\sigma\(i\)\}\];π¯=π^n\\bar\{\\pi\}=\\hat\{\\pi\}^\{n\}\.
DTV\(π′∥π\)maxD\_\{\\mathrm\{TV\}\}\{\}^\{\\max\}\(\\pi^\{\\prime\}\\\|\\pi\)sups∈𝒮DTV\(π′\(⋅∣s\),π\(⋅∣s\)\)\\sup\_\{s\\in\\mathcal\{S\}\}D\_\{\\mathrm\{TV\}\}\{\}\(\\pi^\{\\prime\}\(\\cdot\\mid s\),\\pi\(\\cdot\\mid s\)\)\.
DKL\(π∥π′\)tokρD\_\{\\mathrm\{KL\}\}\{\}\_\{\\mathrm\{tok\}\}^\{\\rho\}\(\\pi\\\|\\pi^\{\\prime\}\)token\-level \(expected statewise\)*reverse*KL with reference policyρ\\rho\(Definition[B\.1](https://arxiv.org/html/2605.15207#A2.Thmtheorem1)\)\.
δi\\delta\_\{i\}trust\-region radius for stepii:DKL\(πcurσ\(i\)∥πtarσ\(i\)\)tokπ^i−1≤δiD\_\{\\mathrm\{KL\}\}\{\}\_\{\\mathrm\{tok\}\}^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\(\\pi\_\{\\mathrm\{cur\}\}^\{\\sigma\(i\)\}\\\|\\pi\_\{\\mathrm\{tar\}\}^\{\\sigma\(i\)\}\)\\leq\\delta\_\{i\}\(Eq\. \([4](https://arxiv.org/html/2605.15207#S3.E4)\) in the main text\)\.
A^grpi−1\\widehat\{A\}\_\{\\textsc\{grp\}\}^\{i\-1\}group\-standardized, hard\-clipped sequence\-level advantage \(Appendix[E\.1](https://arxiv.org/html/2605.15207#A5.SS1)\)\.
ζi\\zeta\_\{i\}surrogate\-estimation error term at stepii: any quantity satisfying
ζi≥\|𝔼s∼dπ^i−1,𝐚∼π^i\(⋅∣s\)\[𝔼\[A^i−1\(s,𝐚\)\]−Aπ^i−1\(s,𝐚\)\]\|\.\\zeta\_\{i\}\\ \\geq\\ \\left\|\\mathbb\{E\}\_\{s\\sim d^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\},\\,\\mathbf\{a\}\\sim\\hat\{\\pi\}^\{\\,i\}\(\\cdot\\mid s\)\}\\Big\[\\mathbb\{E\}\\big\[\\widehat\{A\}\_\{i\-1\}\(s,\\mathbf\{a\}\)\\big\]\-A^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\(s,\\mathbf\{a\}\)\\Big\]\\right\|\.\(A conservative sufficient choice isζi=sups,𝐚\|𝔼\[A^i−1\(s,𝐚\)\]−Aπ^i−1\(s,𝐚\)\|\\zeta\_\{i\}=\\sup\_\{s,\\mathbf\{a\}\}\|\\mathbb\{E\}\[\\widehat\{A\}\_\{i\-1\}\(s,\\mathbf\{a\}\)\]\-A^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\(s,\\mathbf\{a\}\)\|\.\)
AmaxA\_\{\\max\}advantage bound:\|Aπ\(s,𝐚\)\|≤Amax\|A^\{\\pi\}\(s,\\mathbf\{a\}\)\|\\leq A\_\{\\max\}withAmax≤2Rmax/\(1−γ\)A\_\{\\max\}\\leq 2R\_\{\\max\}/\(1\-\\gamma\)\.
AclipA\_\{\\mathrm\{clip\}\}clipping threshold ensuring\|A^grpi−1\|≤Aclip\|\\widehat\{A\}\_\{\\textsc\{grp\}\}^\{i\-1\}\|\\leq A\_\{\\mathrm\{clip\}\}almost surely\.
NiN\_\{i\}number of independent prompt\-groups used for stepii\(each group containsGGrollouts\)\.
GGnumber of rollouts sampled per prompt\-group\.
wi\(s,𝐚\)w\_\{i\}\(s,\\mathbf\{a\}\)step\-iiper\-decision probability ratio used for importance weighting:wi\(s,𝐚\)=π^i\(𝐚∣s\)/π^i−1\(𝐚∣s\)w\_\{i\}\(s,\\mathbf\{a\}\)=\\hat\{\\pi\}^\{\\,i\}\(\\mathbf\{a\}\\mid s\)/\\hat\{\\pi\}^\{\\,i\-1\}\(\\mathbf\{a\}\\mid s\)\(under the turn\-taking protocol, this depends only on the updated factor\)\.
wiclipw\_\{i\}^\{\\mathrm\{clip\}\}ratio\-clipped weightwiclip=clip\(wi,1−ϵw,1\+ϵw\)w\_\{i\}^\{\\mathrm\{clip\}\}=\\mathrm\{clip\}\(w\_\{i\},1\-\\epsilon\_\{w\},1\+\\epsilon\_\{w\}\)\(Appendix[E\.3](https://arxiv.org/html/2605.15207#A5.SS3),[F\.6](https://arxiv.org/html/2605.15207#A6.SS6)\)\.
## Appendix BOccupancy Shift and Token\-Level Trust Regions
### B\.1Proof of Lemma[3\.1](https://arxiv.org/html/2605.15207#S3.Thmtheorem1)
###### Proof\.
Fix a statessand letj=j\(s\)j=j\(s\)be the active agent\. Under the turn\-taking protocol, for allk≠jk\\neq jand for both team policiesπ,π′\\pi,\\pi^\{\\prime\}, the inactive factor is deterministic:π\(k\)\(ak∣s\)=π′\(k\)\(ak∣s\)=𝟙\{ak=noop\}\\pi^\{\(k\)\}\(a\_\{k\}\\mid s\)=\\pi^\{\\prime\(k\)\}\(a\_\{k\}\\mid s\)=\\mathbbm\{1\}\\\{a\_\{k\}=\\text\{noop\}\\\}\.
Therefore, the joint action distributions at statessfactorize as
π\(𝐚∣s\)=π\(j\)\(aj∣s\)∏k≠j𝟙\{ak=noop\},π′\(𝐚∣s\)=π′\(j\)\(aj∣s\)∏k≠j𝟙\{ak=noop\}\.\\pi\(\\mathbf\{a\}\\mid s\)=\\pi^\{\(j\)\}\(a\_\{j\}\\mid s\)\\prod\_\{k\\neq j\}\\mathbbm\{1\}\\\{a\_\{k\}=\\text\{noop\}\\\},\\qquad\\pi^\{\\prime\}\(\\mathbf\{a\}\\mid s\)=\\pi^\{\\prime\(j\)\}\(a\_\{j\}\\mid s\)\\prod\_\{k\\neq j\}\\mathbbm\{1\}\\\{a\_\{k\}=\\text\{noop\}\\\}\.The support of both distributions is restricted to\{ak=noop∀k≠j\}\\\{a\_\{k\}=\\text\{noop\}\\ \\forall k\\neq j\\\}, and on this set the ratio simplifies:
logπ′\(𝐚∣s\)π\(𝐚∣s\)=logπ′\(j\)\(aj∣s\)π\(j\)\(aj∣s\)\.\\log\\frac\{\\pi^\{\\prime\}\(\\mathbf\{a\}\\mid s\)\}\{\\pi\(\\mathbf\{a\}\\mid s\)\}=\\log\\frac\{\\pi^\{\\prime\(j\)\}\(a\_\{j\}\\mid s\)\}\{\\pi^\{\(j\)\}\(a\_\{j\}\\mid s\)\}\.Thus,
DKL\(π′\(⋅∣s\)∥π\(⋅∣s\)\)=∑𝐚π′\(𝐚∣s\)logπ′\(𝐚∣s\)π\(𝐚∣s\)=∑ajπ′\(j\)\(aj∣s\)logπ′\(j\)\(aj∣s\)π\(j\)\(aj∣s\)=DKL\(π′\(j\)\(⋅∣s\)∥π\(j\)\(⋅∣s\)\)\.D\_\{\\mathrm\{KL\}\}\{\}\\\!\\big\(\\pi^\{\\prime\}\(\\cdot\\mid s\)\\\|\\pi\(\\cdot\\mid s\)\\big\)=\\sum\_\{\\mathbf\{a\}\}\\pi^\{\\prime\}\(\\mathbf\{a\}\\mid s\)\\log\\frac\{\\pi^\{\\prime\}\(\\mathbf\{a\}\\mid s\)\}\{\\pi\(\\mathbf\{a\}\\mid s\)\}=\\sum\_\{a\_\{j\}\}\\pi^\{\\prime\(j\)\}\(a\_\{j\}\\mid s\)\\log\\frac\{\\pi^\{\\prime\(j\)\}\(a\_\{j\}\\mid s\)\}\{\\pi^\{\(j\)\}\(a\_\{j\}\\mid s\)\}=D\_\{\\mathrm\{KL\}\}\{\}\\\!\\big\(\\pi^\{\\prime\(j\)\}\(\\cdot\\mid s\)\\\|\\pi^\{\(j\)\}\(\\cdot\\mid s\)\\big\)\.The expectation reduction for functions that depend only on the active action follows from the same support argument\. ∎
### B\.2Token\-Level KL Trust Regions and Autoregressive Chain Rule
This section formalizes the bridge between the implementation choice—*token\-level reverse KL*—and the theoretical trust\-region radius used in our certificates\. The key point is that when each macro\-action is an autoregressive message, the*reverse*message\-level KL equals the expected*sum*of token\-level reverse KLs \(chain rule\), with the expectation taken under the reference/current policy\. This matches on\-policy rollouts under the intermediate team and avoids a token\-occupancy mismatch\.
###### Definition B\.1\(Discounted token\-level reverse\-KL functional with a reference policy\)\.
For a reference policyρ\\rhoand two policiesπ,π′\\pi,\\pi^\{\\prime\}, define
DKL\(π∥π′\)tokρ≔𝔼τ∼ρ\[\(1−γ\)∑t≥0γtDKL\(π\(⋅∣st\)∥π′\(⋅∣st\)\)\]\.D\_\{\\mathrm\{KL\}\}\{\}\_\{\\mathrm\{tok\}\}^\{\\rho\}\(\\pi\\\|\\pi^\{\\prime\}\)\\coloneqq\\mathbb\{E\}\_\{\\tau\\sim\\rho\}\\left\[\(1\-\\gamma\)\\sum\_\{t\\geq 0\}\\gamma^\{t\}\\,D\_\{\\mathrm\{KL\}\}\{\}\\\!\\big\(\\pi\(\\cdot\\mid s\_\{t\}\)\\\|\\pi^\{\\prime\}\(\\cdot\\mid s\_\{t\}\)\\big\)\\right\]\.
###### Lemma B\.2\(Token\-level KL equals expected statewise KL underdρd^\{\\rho\}\)\.
For anyρ,π,π′\\rho,\\pi,\\pi^\{\\prime\},
DKL\(π∥π′\)tokρ=𝔼s∼dρDKL\(π\(⋅∣s\)∥π′\(⋅∣s\)\)\.D\_\{\\mathrm\{KL\}\}\{\}\_\{\\mathrm\{tok\}\}^\{\\rho\}\(\\pi\\\|\\pi^\{\\prime\}\)=\\mathbb\{E\}\_\{s\\sim d^\{\\rho\}\}D\_\{\\mathrm\{KL\}\}\{\}\\\!\\big\(\\pi\(\\cdot\\mid s\)\\\|\\pi^\{\\prime\}\(\\cdot\\mid s\)\\big\)\.
###### Proof\.
Recall that the discounted state\-visitation distribution induced byρ\\rhois the \(discounted\) occupancy measure
dρ\(ds\)≔\(1−γ\)∑t=0∞γtPrτ∼ρ\(st∈ds\),d^\{\\rho\}\(\\mathrm\{d\}s\)\\;\\coloneqq\\;\(1\-\\gamma\)\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}\\,\\Pr\_\{\\tau\\sim\\rho\}\(s\_\{t\}\\in\\mathrm\{d\}s\),i\.e\., for any measurable setB⊆𝒮B\\subseteq\\mathcal\{S\},
dρ\(B\)=\(1−γ\)∑t=0∞γtPrτ∼ρ\(st∈B\)\.d^\{\\rho\}\(B\)=\(1\-\\gamma\)\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}\\Pr\_\{\\tau\\sim\\rho\}\(s\_\{t\}\\in B\)\.Equivalently, for any measurable functiong:𝒮→\[0,∞\]g:\\mathcal\{S\}\\to\[0,\\infty\],
𝔼s∼dρ\[g\(s\)\]\\displaystyle\\mathbb\{E\}\_\{s\\sim d^\{\\rho\}\}\[g\(s\)\]=∫𝒮g\(s\)dρ\(ds\)\\displaystyle=\\int\_\{\\mathcal\{S\}\}g\(s\)\\,d^\{\\rho\}\(\\mathrm\{d\}s\)=\(1−γ\)∑t=0∞γt∫𝒮g\(s\)Prτ∼ρ\(st∈ds\)\\displaystyle=\(1\-\\gamma\)\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}\\int\_\{\\mathcal\{S\}\}g\(s\)\\,\\Pr\_\{\\tau\\sim\\rho\}\(s\_\{t\}\\in\\mathrm\{d\}s\)=\(1−γ\)∑t=0∞γt𝔼τ∼ρ\[g\(st\)\]\.\\displaystyle=\(1\-\\gamma\)\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}\\,\\mathbb\{E\}\_\{\\tau\\sim\\rho\}\\big\[g\(s\_\{t\}\)\\big\]\.\(19\)\(The interchange of the sum and the integral/expectation is justified by Tonelli’s theorem sinceg≥0g\\geq 0; if one works in finite spaces, this follows from elementary manipulations\.\)
Now apply \([19](https://arxiv.org/html/2605.15207#A2.E19)\) to
g\(s\)=DKL\(π\(⋅∣s\)∥π′\(⋅∣s\)\)∈\[0,∞\],g\(s\)\\;=\\;D\_\{\\mathrm\{KL\}\}\{\}\\\!\\big\(\\pi\(\\cdot\\mid s\)\\\|\\pi^\{\\prime\}\(\\cdot\\mid s\)\\big\)\\in\[0,\\infty\],\(where the value may be\+∞\+\\inftyifπ\(⋅∣s\)\\pi\(\\cdot\\mid s\)is not absolutely continuous w\.r\.t\.π′\(⋅∣s\)\\pi^\{\\prime\}\(\\cdot\\mid s\); the identity still holds in the extended reals\)\. Then
𝔼s∼dρDKL\(π\(⋅∣s\)∥π′\(⋅∣s\)\)\\displaystyle\\mathbb\{E\}\_\{s\\sim d^\{\\rho\}\}D\_\{\\mathrm\{KL\}\}\{\}\\\!\\big\(\\pi\(\\cdot\\mid s\)\\\|\\pi^\{\\prime\}\(\\cdot\\mid s\)\\big\)=\(1−γ\)∑t=0∞γt𝔼τ∼ρ\[DKL\(π\(⋅∣st\)∥π′\(⋅∣st\)\)\]\\displaystyle=\(1\-\\gamma\)\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}\\,\\mathbb\{E\}\_\{\\tau\\sim\\rho\}\\\!\\left\[D\_\{\\mathrm\{KL\}\}\{\}\\\!\\big\(\\pi\(\\cdot\\mid s\_\{t\}\)\\\|\\pi^\{\\prime\}\(\\cdot\\mid s\_\{t\}\)\\big\)\\right\]=𝔼τ∼ρ\[\(1−γ\)∑t=0∞γtDKL\(π\(⋅∣st\)∥π′\(⋅∣st\)\)\]\\displaystyle=\\mathbb\{E\}\_\{\\tau\\sim\\rho\}\\\!\\left\[\(1\-\\gamma\)\\sum\_\{t=0\}^\{\\infty\}\\gamma^\{t\}\\,D\_\{\\mathrm\{KL\}\}\{\}\\\!\\big\(\\pi\(\\cdot\\mid s\_\{t\}\)\\\|\\pi^\{\\prime\}\(\\cdot\\mid s\_\{t\}\)\\big\)\\right\]=DKL\(π∥π′\)tokρ,\\displaystyle=D\_\{\\mathrm\{KL\}\}\{\}\_\{\\mathrm\{tok\}\}^\{\\rho\}\(\\pi\\\|\\pi^\{\\prime\}\),where the second equality again uses Tonelli/linearity \(the summand is nonnegative\)\. This proves the claim\. ∎
##### Autoregressive messages\.
Fix an agentjjand a statess\. Let a macro\-action \(message\) be a variable\-length token sequencem=\(x1,…,xT\)m=\(x\_\{1\},\\ldots,x\_\{T\}\)terminated by an EOS token, with conditional distributions
π\(j\)\(m∣s\)=∏u=1T\(m\)π\(j\)\(xu∣s,x<u\)\.\\pi^\{\(j\)\}\(m\\mid s\)\\;=\\;\\prod\_\{u=1\}^\{T\(m\)\}\\pi^\{\(j\)\}\(x\_\{u\}\\mid s,x\_\{<u\}\)\.Defineπ′\(j\)\\pi^\{\\prime\(j\)\}similarly\.
###### Lemma B\.3\(Chain rule \(reverse\): message KL equals expected sum of token KLs\)\.
For anyss,
DKL\(π\(j\)\(⋅∣s\)∥π′\(j\)\(⋅∣s\)\)=𝔼m∼π\(j\)\(⋅∣s\)\[∑u=1T\(m\)DKL\(π\(j\)\(⋅∣s,x<u\)∥π′\(j\)\(⋅∣s,x<u\)\)\]\.D\_\{\\mathrm\{KL\}\}\{\}\\\!\\big\(\\pi^\{\(j\)\}\(\\cdot\\mid s\)\\\|\\pi^\{\\prime\(j\)\}\(\\cdot\\mid s\)\\big\)=\\mathbb\{E\}\_\{m\\sim\\pi^\{\(j\)\}\(\\cdot\\mid s\)\}\\left\[\\sum\_\{u=1\}^\{T\(m\)\}D\_\{\\mathrm\{KL\}\}\{\}\\\!\\big\(\\pi^\{\(j\)\}\(\\cdot\\mid s,x\_\{<u\}\)\\\|\\pi^\{\\prime\(j\)\}\(\\cdot\\mid s,x\_\{<u\}\)\\big\)\\right\]\.
###### Proof\.
Write
logπ\(j\)\(m∣s\)π′\(j\)\(m∣s\)=∑u=1T\(m\)logπ\(j\)\(xu∣s,x<u\)π′\(j\)\(xu∣s,x<u\)\.\\log\\frac\{\\pi^\{\(j\)\}\(m\\mid s\)\}\{\\pi^\{\\prime\(j\)\}\(m\\mid s\)\}=\\sum\_\{u=1\}^\{T\(m\)\}\\log\\frac\{\\pi^\{\(j\)\}\(x\_\{u\}\\mid s,x\_\{<u\}\)\}\{\\pi^\{\\prime\(j\)\}\(x\_\{u\}\\mid s,x\_\{<u\}\)\}\.Taking expectation overm∼π\(j\)\(⋅∣s\)m\\sim\\pi^\{\(j\)\}\(\\cdot\\mid s\)gives the KL on the left\. Then apply the tower property to recognize each term as a conditional KL\. ∎
##### Practical estimator\.
Lemma[B\.3](https://arxiv.org/html/2605.15207#A2.Thmtheorem3)shows that summing per\-token reverse KL along tokens*sampled from the reference/current policy*yields an unbiased estimate of the reverse message\-level KL at that state\. This aligns with TeamTR’s per\-step trust\-region monitoring during rollouts fromπ^i−1\\hat\{\\pi\}^\{\\,i\-1\}\.
###### Lemma B\.4\(Unweighted token\-average vs\. token\-sum \(explicit constants\)\)\.
Assume message lengths are bounded:1≤T\(m\)≤Tmax1\\leq T\(m\)\\leq T\_\{\\max\}almost surely underπ\(j\)\(⋅∣s\)\\pi^\{\(j\)\}\(\\cdot\\mid s\)\. Letκu\(m,s\)≥0\\kappa\_\{u\}\(m,s\)\\geq 0denote the token KL at positionuuas in Lemma[B\.3](https://arxiv.org/html/2605.15207#A2.Thmtheorem3)\. Define the token\-sum and token\-average random variables
S\(m,s\)≔∑u=1T\(m\)κu\(m,s\),A\(m,s\)≔1T\(m\)∑u=1T\(m\)κu\(m,s\)\.S\(m,s\)\\coloneqq\\sum\_\{u=1\}^\{T\(m\)\}\\kappa\_\{u\}\(m,s\),\\qquad A\(m,s\)\\coloneqq\\frac\{1\}\{T\(m\)\}\\sum\_\{u=1\}^\{T\(m\)\}\\kappa\_\{u\}\(m,s\)\.ThenS\(m,s\)=T\(m\)A\(m,s\)S\(m,s\)=T\(m\)\\,A\(m,s\)and hence
𝔼\[S\(m,s\)\]≤Tmax𝔼\[A\(m,s\)\],𝔼\[S\(m,s\)\]≥𝔼\[A\(m,s\)\]\.\\mathbb\{E\}\[S\(m,s\)\]\\leq T\_\{\\max\}\\,\\mathbb\{E\}\[A\(m,s\)\],\\qquad\\mathbb\{E\}\[S\(m,s\)\]\\geq\\mathbb\{E\}\[A\(m,s\)\]\.Consequently, enforcing an average\-token cap𝔼\[A\(m,s\)\]≤δ¯\\mathbb\{E\}\[A\(m,s\)\]\\leq\\bar\{\\delta\}implies a message\-KL capDKL\(π\(j\)\(⋅∣s\)∥π′\(j\)\(⋅∣s\)\)≤Tmaxδ¯D\_\{\\mathrm\{KL\}\}\{\}\(\\pi^\{\(j\)\}\(\\cdot\\mid s\)\\\|\\pi^\{\\prime\(j\)\}\(\\cdot\\mid s\)\)\\leq T\_\{\\max\}\\bar\{\\delta\}\.
###### Proof\.
Fix a statessand draw a messagem∼π\(j\)\(⋅∣s\)m\\sim\\pi^\{\(j\)\}\(\\cdot\\mid s\)\. By assumption, the random lengthT\(m\)T\(m\)satisfies1≤T\(m\)≤Tmax1\\leq T\(m\)\\leq T\_\{\\max\}almost surely\. Moreover,κu\(m,s\)≥0\\kappa\_\{u\}\(m,s\)\\geq 0for everyuu, hence
S\(m,s\)=∑u=1T\(m\)κu\(m,s\)≥0,A\(m,s\)=1T\(m\)∑u=1T\(m\)κu\(m,s\)≥0\.S\(m,s\)=\\sum\_\{u=1\}^\{T\(m\)\}\\kappa\_\{u\}\(m,s\)\\geq 0,\\qquad A\(m,s\)=\\frac\{1\}\{T\(m\)\}\\sum\_\{u=1\}^\{T\(m\)\}\\kappa\_\{u\}\(m,s\)\\geq 0\.
By definition, for every realized\(m,s\)\(m,s\)withT\(m\)≥1T\(m\)\\geq 1,
S\(m,s\)=T\(m\)A\(m,s\)\.S\(m,s\)=T\(m\)\\,A\(m,s\)\.Using the bounds onT\(m\)T\(m\)and nonnegativity ofA\(m,s\)A\(m,s\), we obtain the following*pointwise*inequalities:
A\(m,s\)≤S\(m,s\)=T\(m\)A\(m,s\)≤TmaxA\(m,s\)\.A\(m,s\)\\;\\leq\\;S\(m,s\)\\;=\\;T\(m\)\\,A\(m,s\)\\;\\leq\\;T\_\{\\max\}A\(m,s\)\.Taking expectation with respect tom∼π\(j\)\(⋅∣s\)m\\sim\\pi^\{\(j\)\}\(\\cdot\\mid s\)\(equivalently, with respect to the joint token\-generation randomness underlyingmm\) and using monotonicity/linearity of expectation yields
𝔼\[S\(m,s\)\]≤Tmax𝔼\[A\(m,s\)\],𝔼\[S\(m,s\)\]≥𝔼\[A\(m,s\)\]\.\\mathbb\{E\}\[S\(m,s\)\]\\leq T\_\{\\max\}\\,\\mathbb\{E\}\[A\(m,s\)\],\\qquad\\mathbb\{E\}\[S\(m,s\)\]\\geq\\mathbb\{E\}\[A\(m,s\)\]\.\(These inequalities remain valid in the extended reals even if some KL terms are\+∞\+\\infty\.\)
Finally, by Lemma[B\.3](https://arxiv.org/html/2605.15207#A2.Thmtheorem3)\(the chain rule decomposition of message\-level KL into token\-level KLs\),
DKL\(π\(j\)\(⋅∣s\)∥π′\(j\)\(⋅∣s\)\)=𝔼m∼π\(j\)\(⋅∣s\)\[∑u=1T\(m\)κu\(m,s\)\]=𝔼\[S\(m,s\)\]\.D\_\{\\mathrm\{KL\}\}\{\}\\\!\\big\(\\pi^\{\(j\)\}\(\\cdot\\mid s\)\\\|\\pi^\{\\prime\(j\)\}\(\\cdot\\mid s\)\\big\)=\\mathbb\{E\}\_\{m\\sim\\pi^\{\(j\)\}\(\\cdot\\mid s\)\}\\\!\\left\[\\sum\_\{u=1\}^\{T\(m\)\}\\kappa\_\{u\}\(m,s\)\\right\]=\\mathbb\{E\}\[S\(m,s\)\]\.Therefore, enforcing an average\-token cap𝔼\[A\(m,s\)\]≤δ¯\\mathbb\{E\}\[A\(m,s\)\]\\leq\\bar\{\\delta\}implies
DKL\(π\(j\)\(⋅∣s\)∥π′\(j\)\(⋅∣s\)\)=𝔼\[S\(m,s\)\]≤Tmax𝔼\[A\(m,s\)\]≤Tmaxδ¯,D\_\{\\mathrm\{KL\}\}\{\}\\\!\\big\(\\pi^\{\(j\)\}\(\\cdot\\mid s\)\\\|\\pi^\{\\prime\(j\)\}\(\\cdot\\mid s\)\\big\)=\\mathbb\{E\}\[S\(m,s\)\]\\leq T\_\{\\max\}\\,\\mathbb\{E\}\[A\(m,s\)\]\\leq T\_\{\\max\}\\bar\{\\delta\},as claimed\. ∎
###### Lemma B\.5\(From expected \(token\) KL to an occupancy shift penalty\)\.
Letδ¯=DKL\(π∥π′\)tokπ\\bar\{\\delta\}=D\_\{\\mathrm\{KL\}\}\{\}\_\{\\mathrm\{tok\}\}^\{\\pi\}\(\\pi\\\|\\pi^\{\\prime\}\)with referenceπ\\pi\. For any boundedff,
\|𝔼dπ′\[f\]−𝔼dπ\[f\]\|≤2γ1−γδ¯2‖f‖∞\.\\big\|\\mathbb\{E\}\_\{d^\{\\pi^\{\\prime\}\}\}\[f\]\-\\mathbb\{E\}\_\{d^\{\\pi\}\}\[f\]\\big\|\\leq\\frac\{2\\gamma\}\{1\-\\gamma\}\\sqrt\{\\frac\{\\bar\{\\delta\}\}\{2\}\}\\\|f\\\|\_\{\\infty\}\.
###### Proof\.
Letf:𝒮→ℝf:\\mathcal\{S\}\\to\\mathbb\{R\}be bounded, and denote‖f‖∞=sups\|f\(s\)\|\\\|f\\\|\_\{\\infty\}=\\sup\_\{s\}\|f\(s\)\|\. By Lemma[B\.8](https://arxiv.org/html/2605.15207#A2.Thmtheorem8), we have an occupancy\-shift bound of the form
\|𝔼dπ′\[f\]−𝔼dπ\[f\]\|≤2γ1−γ∥f∥∞𝔼s∼dπ\[DTV\(π\(⋅∣s\),π′\(⋅∣s\)\)\]\.\\big\|\\mathbb\{E\}\_\{d^\{\\pi^\{\\prime\}\}\}\[f\]\-\\mathbb\{E\}\_\{d^\{\\pi\}\}\[f\]\\big\|\\leq\\frac\{2\\gamma\}\{1\-\\gamma\}\\,\\\|f\\\|\_\{\\infty\}\\;\\mathbb\{E\}\_\{s\\sim d^\{\\pi\}\}\\\!\\Big\[D\_\{\\mathrm\{TV\}\}\{\}\\\!\\big\(\\pi\(\\cdot\\mid s\),\\pi^\{\\prime\}\(\\cdot\\mid s\)\\big\)\\Big\]\.\(20\)
Next, apply Pinsker’s inequality pointwise inss:
DTV\(π\(⋅∣s\),π′\(⋅∣s\)\)≤12DKL\(π\(⋅∣s\)∥π′\(⋅∣s\)\)\.D\_\{\\mathrm\{TV\}\}\{\}\\\!\\big\(\\pi\(\\cdot\\mid s\),\\pi^\{\\prime\}\(\\cdot\\mid s\)\\big\)\\leq\\sqrt\{\\frac\{1\}\{2\}D\_\{\\mathrm\{KL\}\}\{\}\\\!\\big\(\\pi\(\\cdot\\mid s\)\\\|\\pi^\{\\prime\}\(\\cdot\\mid s\)\\big\)\}\.Taking expectation overs∼dπs\\sim d^\{\\pi\}and using Jensen’s inequality \(sincex↦xx\\mapsto\\sqrt\{x\}is concave onℝ\+\\mathbb\{R\}\_\{\+\}\) gives
𝔼s∼dπ\[DTV\(π\(⋅∣s\),π′\(⋅∣s\)\)\]\\displaystyle\\mathbb\{E\}\_\{s\\sim d^\{\\pi\}\}\\\!\\Big\[D\_\{\\mathrm\{TV\}\}\{\}\\\!\\big\(\\pi\(\\cdot\\mid s\),\\pi^\{\\prime\}\(\\cdot\\mid s\)\\big\)\\Big\]≤𝔼s∼dπ\[12DKL\(π\(⋅∣s\)∥π′\(⋅∣s\)\)\]\\displaystyle\\leq\\mathbb\{E\}\_\{s\\sim d^\{\\pi\}\}\\\!\\left\[\\sqrt\{\\frac\{1\}\{2\}D\_\{\\mathrm\{KL\}\}\{\}\\\!\\big\(\\pi\(\\cdot\\mid s\)\\\|\\pi^\{\\prime\}\(\\cdot\\mid s\)\\big\)\}\\right\]≤12𝔼s∼dπDKL\(π\(⋅∣s\)∥π′\(⋅∣s\)\)\.\\displaystyle\\leq\\sqrt\{\\frac\{1\}\{2\}\\mathbb\{E\}\_\{s\\sim d^\{\\pi\}\}D\_\{\\mathrm\{KL\}\}\{\}\\\!\\big\(\\pi\(\\cdot\\mid s\)\\\|\\pi^\{\\prime\}\(\\cdot\\mid s\)\\big\)\}\.
Finally, by Lemma[B\.2](https://arxiv.org/html/2605.15207#A2.Thmtheorem2)with reference policyρ=π\\rho=\\pi,
𝔼s∼dπDKL\(π\(⋅∣s\)∥π′\(⋅∣s\)\)=DKL\(π∥π′\)tokπ=δ¯\.\\mathbb\{E\}\_\{s\\sim d^\{\\pi\}\}D\_\{\\mathrm\{KL\}\}\{\}\\\!\\big\(\\pi\(\\cdot\\mid s\)\\\|\\pi^\{\\prime\}\(\\cdot\\mid s\)\\big\)=D\_\{\\mathrm\{KL\}\}\{\}\_\{\\mathrm\{tok\}\}^\{\\pi\}\(\\pi\\\|\\pi^\{\\prime\}\)=\\bar\{\\delta\}\.Plugging the above into \([20](https://arxiv.org/html/2605.15207#A2.E20)\) yields
\|𝔼dπ′\[f\]−𝔼dπ\[f\]\|≤2γ1−γδ¯2‖f‖∞,\\big\|\\mathbb\{E\}\_\{d^\{\\pi^\{\\prime\}\}\}\[f\]\-\\mathbb\{E\}\_\{d^\{\\pi\}\}\[f\]\\big\|\\leq\\frac\{2\\gamma\}\{1\-\\gamma\}\\sqrt\{\\frac\{\\bar\{\\delta\}\}\{2\}\}\\;\\\|f\\\|\_\{\\infty\},which proves the lemma\. ∎
###### Lemma B\.6\(Token\-level KL reduction under a single\-factor update \(reverse\)\)\.
At stepii,π^i\\hat\{\\pi\}^\{\\,i\}andπ^i−1\\hat\{\\pi\}^\{\\,i\-1\}differ only in factorσ\(i\)\\sigma\(i\)\. With referenceπ^i−1\\hat\{\\pi\}^\{\\,i\-1\},
DKL\(π^i−1∥π^i\)tokπ^i−1=DKL\(πcurσ\(i\)∥πtarσ\(i\)\)tokπ^i−1\.D\_\{\\mathrm\{KL\}\}\{\}\_\{\\mathrm\{tok\}\}^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\\\!\\big\(\\hat\{\\pi\}^\{\\,i\-1\}\\\|\\hat\{\\pi\}^\{\\,i\}\\big\)=D\_\{\\mathrm\{KL\}\}\{\}\_\{\\mathrm\{tok\}\}^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\\\!\\Big\(\\pi\_\{\\mathrm\{cur\}\}^\{\\sigma\(i\)\}\\\|\\pi\_\{\\mathrm\{tar\}\}^\{\\sigma\(i\)\}\\Big\)\.
###### Proof\.
Letk≔σ\(i\)k\\coloneqq\\sigma\(i\)\. We view the \(joint\) action/message space as a productℳ=∏j=1Jℳ\(j\)\\mathcal\{M\}=\\prod\_\{j=1\}^\{J\}\\mathcal\{M\}^\{\(j\)\}, and writem=\(mk,m−k\)m=\(m^\{k\},m^\{\-k\}\)for thekk\-th component and the remaining components\. By assumption,π^i\\hat\{\\pi\}^\{\\,i\}andπ^i−1\\hat\{\\pi\}^\{\\,i\-1\}differ only in factorkk, i\.e\., for every statess,
π^i−1\(dm∣s\)\\displaystyle\\hat\{\\pi\}^\{\\,i\-1\}\(\\mathrm\{d\}m\\mid s\)=πcurk\(dmk∣s\)⊗\(⨂j≠kπ\(j\)\(dm\(j\)∣s\)\),\\displaystyle=\\pi\_\{\\mathrm\{cur\}\}^\{k\}\(\\mathrm\{d\}m^\{k\}\\mid s\)\\;\\otimes\\;\\Big\(\\bigotimes\_\{j\\neq k\}\\pi^\{\(j\)\}\(\\mathrm\{d\}m^\{\(j\)\}\\mid s\)\\Big\),π^i\(dm∣s\)\\displaystyle\\hat\{\\pi\}^\{\\,i\}\(\\mathrm\{d\}m\\mid s\)=πtark\(dmk∣s\)⊗\(⨂j≠kπ\(j\)\(dm\(j\)∣s\)\),\\displaystyle=\\pi\_\{\\mathrm\{tar\}\}^\{k\}\(\\mathrm\{d\}m^\{k\}\\mid s\)\\;\\otimes\\;\\Big\(\\bigotimes\_\{j\\neq k\}\\pi^\{\(j\)\}\(\\mathrm\{d\}m^\{\(j\)\}\\mid s\)\\Big\),where all factors withj≠kj\\neq kare identical\.
Fix an arbitraryss\. Ifπcurk\(⋅∣s\)≪̸πtark\(⋅∣s\)\\pi\_\{\\mathrm\{cur\}\}^\{k\}\(\\cdot\\mid s\)\\not\\ll\\pi\_\{\\mathrm\{tar\}\}^\{k\}\(\\cdot\\mid s\), then alsoπ^i−1\(⋅∣s\)≪̸π^i\(⋅∣s\)\\hat\{\\pi\}^\{\\,i\-1\}\(\\cdot\\mid s\)\\not\\ll\\hat\{\\pi\}^\{\\,i\}\(\\cdot\\mid s\)\(because the common product factor onm−km^\{\-k\}cannot restore absolute continuity\), hence bothDKL\(π^i−1\(⋅∣s\)∥π^i\(⋅∣s\)\)D\_\{\\mathrm\{KL\}\}\{\}\(\\hat\{\\pi\}^\{\\,i\-1\}\(\\cdot\\mid s\)\\\|\\hat\{\\pi\}^\{\\,i\}\(\\cdot\\mid s\)\)andDKL\(πcurk\(⋅∣s\)∥πtark\(⋅∣s\)\)D\_\{\\mathrm\{KL\}\}\{\}\(\\pi\_\{\\mathrm\{cur\}\}^\{k\}\(\\cdot\\mid s\)\\\|\\pi\_\{\\mathrm\{tar\}\}^\{k\}\(\\cdot\\mid s\)\)equal\+∞\+\\infty, and the desired equality holds\.
Otherwise,πcurk\(⋅∣s\)≪πtark\(⋅∣s\)\\pi\_\{\\mathrm\{cur\}\}^\{k\}\(\\cdot\\mid s\)\\ll\\pi\_\{\\mathrm\{tar\}\}^\{k\}\(\\cdot\\mid s\)and thusπ^i−1\(⋅∣s\)≪π^i\(⋅∣s\)\\hat\{\\pi\}^\{\\,i\-1\}\(\\cdot\\mid s\)\\ll\\hat\{\\pi\}^\{\\,i\}\(\\cdot\\mid s\)\. By the Radon–Nikodym rule for product measures, the likelihood ratio cancels all identical factors:
dπ^i−1\(⋅∣s\)dπ^i\(⋅∣s\)\(m\)=dπcurk\(⋅∣s\)dπtark\(⋅∣s\)\(mk\)\.\\frac\{\\mathrm\{d\}\\hat\{\\pi\}^\{\\,i\-1\}\(\\cdot\\mid s\)\}\{\\mathrm\{d\}\\hat\{\\pi\}^\{\\,i\}\(\\cdot\\mid s\)\}\(m\)=\\frac\{\\mathrm\{d\}\\pi\_\{\\mathrm\{cur\}\}^\{k\}\(\\cdot\\mid s\)\}\{\\mathrm\{d\}\\pi\_\{\\mathrm\{tar\}\}^\{k\}\(\\cdot\\mid s\)\}\(m^\{k\}\)\.Therefore,
DKL\(π^i−1\(⋅∣s\)∥π^i\(⋅∣s\)\)\\displaystyle D\_\{\\mathrm\{KL\}\}\{\}\\\!\\big\(\\hat\{\\pi\}^\{\\,i\-1\}\(\\cdot\\mid s\)\\\|\\hat\{\\pi\}^\{\\,i\}\(\\cdot\\mid s\)\\big\)=∫ℳlog\(dπ^i−1dπ^i\(m\)\)π^i−1\(dm∣s\)\\displaystyle=\\int\_\{\\mathcal\{M\}\}\\log\\\!\\left\(\\frac\{\\mathrm\{d\}\\hat\{\\pi\}^\{\\,i\-1\}\}\{\\mathrm\{d\}\\hat\{\\pi\}^\{\\,i\}\}\(m\)\\right\)\\hat\{\\pi\}^\{\\,i\-1\}\(\\mathrm\{d\}m\\mid s\)=∫ℳlog\(dπcurkdπtark\(mk\)\)πcurk\(dmk∣s\)\(⨂j≠kπ\(j\)\(dm\(j\)∣s\)\)\\displaystyle=\\int\_\{\\mathcal\{M\}\}\\log\\\!\\left\(\\frac\{\\mathrm\{d\}\\pi\_\{\\mathrm\{cur\}\}^\{k\}\}\{\\mathrm\{d\}\\pi\_\{\\mathrm\{tar\}\}^\{k\}\}\(m^\{k\}\)\\right\)\\pi\_\{\\mathrm\{cur\}\}^\{k\}\(\\mathrm\{d\}m^\{k\}\\mid s\)\\Big\(\\bigotimes\_\{j\\neq k\}\\pi^\{\(j\)\}\(\\mathrm\{d\}m^\{\(j\)\}\\mid s\)\\Big\)=∫ℳ\(k\)log\(dπcurkdπtark\(mk\)\)πcurk\(dmk∣s\)\\displaystyle=\\int\_\{\\mathcal\{M\}^\{\(k\)\}\}\\log\\\!\\left\(\\frac\{\\mathrm\{d\}\\pi\_\{\\mathrm\{cur\}\}^\{k\}\}\{\\mathrm\{d\}\\pi\_\{\\mathrm\{tar\}\}^\{k\}\}\(m^\{k\}\)\\right\)\\pi\_\{\\mathrm\{cur\}\}^\{k\}\(\\mathrm\{d\}m^\{k\}\\mid s\)=DKL\(πcurk\(⋅∣s\)∥πtark\(⋅∣s\)\),\\displaystyle=D\_\{\\mathrm\{KL\}\}\{\}\\\!\\big\(\\pi\_\{\\mathrm\{cur\}\}^\{k\}\(\\cdot\\mid s\)\\\|\\pi\_\{\\mathrm\{tar\}\}^\{k\}\(\\cdot\\mid s\)\\big\),where the third line integrates out the common factor onm−km^\{\-k\}\(it is a probability measure, so its total mass is11\)\. \(In the discrete case, the same conclusion follows by expanding the sum overm=\(mk,m−k\)m=\(m^\{k\},m^\{\-k\}\)and observing direct cancellation\.\)
Applying Lemma[B\.2](https://arxiv.org/html/2605.15207#A2.Thmtheorem2)with referenceρ=π^i−1\\rho=\\hat\{\\pi\}^\{\\,i\-1\}, we obtain
DKL\(π^i−1∥π^i\)tokπ^i−1\\displaystyle D\_\{\\mathrm\{KL\}\}\{\}\_\{\\mathrm\{tok\}\}^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\\\!\\big\(\\hat\{\\pi\}^\{\\,i\-1\}\\\|\\hat\{\\pi\}^\{\\,i\}\\big\)=𝔼s∼dπ^i−1DKL\(π^i−1\(⋅∣s\)∥π^i\(⋅∣s\)\)\\displaystyle=\\mathbb\{E\}\_\{s\\sim d^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\}D\_\{\\mathrm\{KL\}\}\{\}\\\!\\big\(\\hat\{\\pi\}^\{\\,i\-1\}\(\\cdot\\mid s\)\\\|\\hat\{\\pi\}^\{\\,i\}\(\\cdot\\mid s\)\\big\)=𝔼s∼dπ^i−1DKL\(πcurk\(⋅∣s\)∥πtark\(⋅∣s\)\)\\displaystyle=\\mathbb\{E\}\_\{s\\sim d^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\}D\_\{\\mathrm\{KL\}\}\{\}\\\!\\big\(\\pi\_\{\\mathrm\{cur\}\}^\{k\}\(\\cdot\\mid s\)\\\|\\pi\_\{\\mathrm\{tar\}\}^\{k\}\(\\cdot\\mid s\)\\big\)=DKL\(πcurk∥πtark\)tokπ^i−1,\\displaystyle=D\_\{\\mathrm\{KL\}\}\{\}\_\{\\mathrm\{tok\}\}^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\\\!\\Big\(\\pi\_\{\\mathrm\{cur\}\}^\{k\}\\\|\\pi\_\{\\mathrm\{tar\}\}^\{k\}\\Big\),where the last line uses the same definition ofDKLtokρD\_\{\\mathrm\{KL\}\}\{\}\_\{\\mathrm\{tok\}\}^\{\\rho\}but applied to thekk\-th factor, with the*same*reference occupancy distributiondρd^\{\\rho\}\(hereρ=π^i−1\\rho=\\hat\{\\pi\}^\{\\,i\-1\}\)\. This proves the lemma\. ∎
### B\.3Occupancy Shift Bounds
This section proves Lemma[3\.2](https://arxiv.org/html/2605.15207#S3.Thmtheorem2)from the main text and provides both max\-divergence and expected\-divergence variants\. All statements apply to the macro\-action MDP; i\.e\.,π\(⋅∣s\)\\pi\(\\cdot\\mid s\)denotes the \(joint\) macro\-action distribution at statess\.
###### Lemma B\.7\(Occupancy shift viaDTVmaxD\_\{\\mathrm\{TV\}\}\{\}^\{\\max\}\)\.
For any policiesπ′,π\\pi^\{\\prime\},\\piand boundedf:𝒮→ℝf:\\mathcal\{S\}\\to\\mathbb\{R\},
\|𝔼dπ′\[f\]−𝔼dπ\[f\]\|≤2γ1−γDTV\(π′∥π\)max∥f∥∞\.\\big\|\\mathbb\{E\}\_\{d^\{\\pi^\{\\prime\}\}\}\[f\]\-\\mathbb\{E\}\_\{d^\{\\pi\}\}\[f\]\\big\|\\leq\\frac\{2\\gamma\}\{1\-\\gamma\}\\,D\_\{\\mathrm\{TV\}\}\{\}^\{\\max\}\(\\pi^\{\\prime\}\\\|\\pi\)\\,\\\|f\\\|\_\{\\infty\}\.
###### Proof\.
The proof follows the standard resolvent argument\. LetPπ\(s′∣s\)=∑𝐚π\(𝐚∣s\)P\(s′∣s,𝐚\)P\_\{\\pi\}\(s^\{\\prime\}\\mid s\)=\\sum\_\{\\mathbf\{a\}\}\\pi\(\\mathbf\{a\}\\mid s\)P\(s^\{\\prime\}\\mid s,\\mathbf\{a\}\)be the induced state transition kernel\. The discounted visitation satisfiesdπ=\(1−γ\)μ\+γPπ⊤dπd^\{\\pi\}=\(1\-\\gamma\)\\mu\+\\gamma P\_\{\\pi\}^\{\\top\}d^\{\\pi\}\. Taking the difference yields
\(I−γPπ′⊤\)\(dπ′−dπ\)=γ\(Pπ′⊤−Pπ⊤\)dπ\.\(I\-\\gamma P\_\{\\pi^\{\\prime\}\}^\{\\top\}\)\(d^\{\\pi^\{\\prime\}\}\-d^\{\\pi\}\)=\\gamma\(P\_\{\\pi^\{\\prime\}\}^\{\\top\}\-P\_\{\\pi\}^\{\\top\}\)d^\{\\pi\}\.Since‖\(I−γPπ′⊤\)−1‖1→1≤11−γ\\\|\(I\-\\gamma P\_\{\\pi^\{\\prime\}\}^\{\\top\}\)^\{\-1\}\\\|\_\{1\\to 1\}\\leq\\frac\{1\}\{1\-\\gamma\},
‖dπ′−dπ‖1≤γ1−γ‖\(Pπ′⊤−Pπ⊤\)dπ‖1\.\\\|d^\{\\pi^\{\\prime\}\}\-d^\{\\pi\}\\\|\_\{1\}\\leq\\frac\{\\gamma\}\{1\-\\gamma\}\\\|\(P\_\{\\pi^\{\\prime\}\}^\{\\top\}\-P\_\{\\pi\}^\{\\top\}\)d^\{\\pi\}\\\|\_\{1\}\.For each statess,
∥Pπ′\(⋅∣s\)−Pπ\(⋅∣s\)∥1≤∑𝐚\|π′\(𝐚∣s\)−π\(𝐚∣s\)\|⋅∥P\(⋅∣s,𝐚\)∥1=2DTV\(π′\(⋅∣s\),π\(⋅∣s\)\)\.\\\|P\_\{\\pi^\{\\prime\}\}\(\\cdot\\mid s\)\-P\_\{\\pi\}\(\\cdot\\mid s\)\\\|\_\{1\}\\leq\\sum\_\{\\mathbf\{a\}\}\|\\pi^\{\\prime\}\(\\mathbf\{a\}\\mid s\)\-\\pi\(\\mathbf\{a\}\\mid s\)\|\\cdot\\\|P\(\\cdot\\mid s,\\mathbf\{a\}\)\\\|\_\{1\}=2\\,D\_\{\\mathrm\{TV\}\}\{\}\(\\pi^\{\\prime\}\(\\cdot\\mid s\),\\pi\(\\cdot\\mid s\)\)\.Thus∥\(Pπ′⊤−Pπ⊤\)dπ∥1≤2DTV\(π′∥π\)max\\\|\(P\_\{\\pi^\{\\prime\}\}^\{\\top\}\-P\_\{\\pi\}^\{\\top\}\)d^\{\\pi\}\\\|\_\{1\}\\leq 2\\,D\_\{\\mathrm\{TV\}\}\{\}^\{\\max\}\(\\pi^\{\\prime\}\\\|\\pi\)\. Finally,
\|𝔼dπ′\[f\]−𝔼dπ\[f\]\|≤∥f∥∞∥dπ′−dπ∥1≤2γ1−γDTV\(π′∥π\)max∥f∥∞\.\|\\mathbb\{E\}\_\{d^\{\\pi^\{\\prime\}\}\}\[f\]\-\\mathbb\{E\}\_\{d^\{\\pi\}\}\[f\]\|\\leq\\\|f\\\|\_\{\\infty\}\\\|d^\{\\pi^\{\\prime\}\}\-d^\{\\pi\}\\\|\_\{1\}\\leq\\frac\{2\\gamma\}\{1\-\\gamma\}D\_\{\\mathrm\{TV\}\}\{\}^\{\\max\}\(\\pi^\{\\prime\}\\\|\\pi\)\\\|f\\\|\_\{\\infty\}\.∎
###### Lemma B\.8\(Occupancy shift via expectedDTVD\_\{\\mathrm\{TV\}\}\{\}\)\.
For any policiesπ′,π\\pi^\{\\prime\},\\piand boundedff,
\|𝔼dπ′\[f\]−𝔼dπ\[f\]\|≤2γ1−γ𝔼s∼dπ\[DTV\(π′\(⋅∣s\),π\(⋅∣s\)\)\]∥f∥∞\.\\big\|\\mathbb\{E\}\_\{d^\{\\pi^\{\\prime\}\}\}\[f\]\-\\mathbb\{E\}\_\{d^\{\\pi\}\}\[f\]\\big\|\\leq\\frac\{2\\gamma\}\{1\-\\gamma\}\\,\\mathbb\{E\}\_\{s\\sim d^\{\\pi\}\}\\\!\\left\[D\_\{\\mathrm\{TV\}\}\{\}\(\\pi^\{\\prime\}\(\\cdot\\mid s\),\\pi\(\\cdot\\mid s\)\)\\right\]\\\|f\\\|\_\{\\infty\}\.Consequently, by Pinsker and Jensen,
\|𝔼dπ′\[f\]−𝔼dπ\[f\]\|≤2γ1−γ12𝔼s∼dπDKL\(π\(⋅∣s\)∥π′\(⋅∣s\)\)‖f‖∞\.\\big\|\\mathbb\{E\}\_\{d^\{\\pi^\{\\prime\}\}\}\[f\]\-\\mathbb\{E\}\_\{d^\{\\pi\}\}\[f\]\\big\|\\leq\\frac\{2\\gamma\}\{1\-\\gamma\}\\,\\sqrt\{\\frac\{1\}\{2\}\\mathbb\{E\}\_\{s\\sim d^\{\\pi\}\}D\_\{\\mathrm\{KL\}\}\{\}\(\\pi\(\\cdot\\mid s\)\\\|\\pi^\{\\prime\}\(\\cdot\\mid s\)\)\}\\,\\\|f\\\|\_\{\\infty\}\.
###### Proof\.
We follow the same resolvent identity as in Lemma[B\.7](https://arxiv.org/html/2605.15207#A2.Thmtheorem7)but do not take a supremum overss\.
From the fixed\-point equations,
\(I−γPπ′⊤\)\(dπ′−dπ\)=γ\(Pπ′⊤−Pπ⊤\)dπ,\(I\-\\gamma P\_\{\\pi^\{\\prime\}\}^\{\\top\}\)\(d^\{\\pi^\{\\prime\}\}\-d^\{\\pi\}\)=\\gamma\(P\_\{\\pi^\{\\prime\}\}^\{\\top\}\-P\_\{\\pi\}^\{\\top\}\)d^\{\\pi\},hence
‖dπ′−dπ‖1≤γ1−γ‖\(Pπ′⊤−Pπ⊤\)dπ‖1\.\\\|d^\{\\pi^\{\\prime\}\}\-d^\{\\pi\}\\\|\_\{1\}\\leq\\frac\{\\gamma\}\{1\-\\gamma\}\\\|\(P\_\{\\pi^\{\\prime\}\}^\{\\top\}\-P\_\{\\pi\}^\{\\top\}\)d^\{\\pi\}\\\|\_\{1\}\.Now,
‖\(Pπ′⊤−Pπ⊤\)dπ‖1\\displaystyle\\\|\(P\_\{\\pi^\{\\prime\}\}^\{\\top\}\-P\_\{\\pi\}^\{\\top\}\)d^\{\\pi\}\\\|\_\{1\}=∑s′\|∑sdπ\(s\)\(Pπ′\(s′∣s\)−Pπ\(s′∣s\)\)\|\\displaystyle=\\sum\_\{s^\{\\prime\}\}\\left\|\\sum\_\{s\}d^\{\\pi\}\(s\)\\big\(P\_\{\\pi^\{\\prime\}\}\(s^\{\\prime\}\\mid s\)\-P\_\{\\pi\}\(s^\{\\prime\}\\mid s\)\\big\)\\right\|≤∑sdπ\(s\)∑s′\|Pπ′\(s′∣s\)−Pπ\(s′∣s\)\|\\displaystyle\\leq\\sum\_\{s\}d^\{\\pi\}\(s\)\\sum\_\{s^\{\\prime\}\}\\left\|P\_\{\\pi^\{\\prime\}\}\(s^\{\\prime\}\\mid s\)\-P\_\{\\pi\}\(s^\{\\prime\}\\mid s\)\\right\|=𝔼s∼dπ\[∥Pπ′\(⋅∣s\)−Pπ\(⋅∣s\)∥1\]\.\\displaystyle=\\mathbb\{E\}\_\{s\\sim d^\{\\pi\}\}\\left\[\\\|P\_\{\\pi^\{\\prime\}\}\(\\cdot\\mid s\)\-P\_\{\\pi\}\(\\cdot\\mid s\)\\\|\_\{1\}\\right\]\.As in Lemma[B\.7](https://arxiv.org/html/2605.15207#A2.Thmtheorem7), for eachsswe have∥Pπ′\(⋅∣s\)−Pπ\(⋅∣s\)∥1≤2DTV\(π′\(⋅∣s\),π\(⋅∣s\)\)\\\|P\_\{\\pi^\{\\prime\}\}\(\\cdot\\mid s\)\-P\_\{\\pi\}\(\\cdot\\mid s\)\\\|\_\{1\}\\leq 2\\,D\_\{\\mathrm\{TV\}\}\{\}\(\\pi^\{\\prime\}\(\\cdot\\mid s\),\\pi\(\\cdot\\mid s\)\), thus
∥dπ′−dπ∥1≤2γ1−γ𝔼s∼dπ\[DTV\(π′\(⋅∣s\),π\(⋅∣s\)\)\]\.\\\|d^\{\\pi^\{\\prime\}\}\-d^\{\\pi\}\\\|\_\{1\}\\leq\\frac\{2\\gamma\}\{1\-\\gamma\}\\,\\mathbb\{E\}\_\{s\\sim d^\{\\pi\}\}\\\!\\left\[D\_\{\\mathrm\{TV\}\}\{\}\(\\pi^\{\\prime\}\(\\cdot\\mid s\),\\pi\(\\cdot\\mid s\)\)\\right\]\.Finally,\|𝔼dπ′\[f\]−𝔼dπ\[f\]\|≤‖f‖∞‖dπ′−dπ‖1\|\\mathbb\{E\}\_\{d^\{\\pi^\{\\prime\}\}\}\[f\]\-\\mathbb\{E\}\_\{d^\{\\pi\}\}\[f\]\|\\leq\\\|f\\\|\_\{\\infty\}\\\|d^\{\\pi^\{\\prime\}\}\-d^\{\\pi\}\\\|\_\{1\}gives the first inequality\.
For the second inequality, apply Pinsker pointwise:DTV\(P,Q\)≤12DKL\(P∥Q\)D\_\{\\mathrm\{TV\}\}\{\}\(P,Q\)\\leq\\sqrt\{\\frac\{1\}\{2\}D\_\{\\mathrm\{KL\}\}\{\}\(P\\\|Q\)\}, withP=π\(⋅∣s\)P=\\pi\(\\cdot\\mid s\)andQ=π′\(⋅∣s\)Q=\\pi^\{\\prime\}\(\\cdot\\mid s\), then use Jensen:
𝔼s∼dπ\[DTV\(π′\(⋅∣s\),π\(⋅∣s\)\)\]≤𝔼s∼dπ\[12DKL\(π\(⋅∣s\)∥π′\(⋅∣s\)\)\]≤12𝔼s∼dπDKL\(π\(⋅∣s\)∥π′\(⋅∣s\)\)\.\\mathbb\{E\}\_\{s\\sim d^\{\\pi\}\}\\left\[D\_\{\\mathrm\{TV\}\}\{\}\(\\pi^\{\\prime\}\(\\cdot\\mid s\),\\pi\(\\cdot\\mid s\)\)\\right\]\\leq\\mathbb\{E\}\_\{s\\sim d^\{\\pi\}\}\\left\[\\sqrt\{\\tfrac\{1\}\{2\}D\_\{\\mathrm\{KL\}\}\{\}\(\\pi\(\\cdot\\mid s\)\\\|\\pi^\{\\prime\}\(\\cdot\\mid s\)\)\}\\right\]\\leq\\sqrt\{\\tfrac\{1\}\{2\}\\mathbb\{E\}\_\{s\\sim d^\{\\pi\}\}D\_\{\\mathrm\{KL\}\}\{\}\(\\pi\(\\cdot\\mid s\)\\\|\\pi^\{\\prime\}\(\\cdot\\mid s\)\)\}\.∎
###### Lemma B\.9\(Cumulative within\-stage occupancy shift\)\.
Assume a sequence of intermediate policiesπ^0,π^1,…,π^i−1\\hat\{\\pi\}^\{0\},\\hat\{\\pi\}^\{1\},\\ldots,\\hat\{\\pi\}^\{i\-1\}satisfiesDKL\(π^k−1∥π^k\)tokπ^k−1≤δkD\_\{\\mathrm\{KL\}\}\{\}\_\{\\mathrm\{tok\}\}^\{\\hat\{\\pi\}^\{\\,k\-1\}\}\(\\hat\{\\pi\}^\{\\,k\-1\}\\\|\\hat\{\\pi\}^\{\\,k\}\)\\leq\\delta\_\{k\}fork=1,…,i−1k=1,\\ldots,i\-1\. Then for any bounded measurableff,
\|𝔼dπ^i−1\[f\]−𝔼dπ^0\[f\]\|≤2γ1−γ‖f‖∞∑k=1i−1δk\.\\big\|\\mathbb\{E\}\_\{d^\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\[f\]\-\\mathbb\{E\}\_\{d^\{\\hat\{\\pi\}^\{\\,0\}\}\}\[f\]\\big\|\\leq\\frac\{\\sqrt\{2\}\\gamma\}\{1\-\\gamma\}\\,\\\|f\\\|\_\{\\infty\}\\sum\_\{k=1\}^\{i\-1\}\\sqrt\{\\delta\_\{k\}\}\.
###### Proof\.
Fix any bounded measurablef:𝒮→ℝf:\\mathcal\{S\}\\to\\mathbb\{R\}and let‖f‖∞=sups\|f\(s\)\|\\\|f\\\|\_\{\\infty\}=\\sup\_\{s\}\|f\(s\)\|\. Write the telescoping decomposition
𝔼dπ^i−1\[f\]−𝔼dπ^0\[f\]=∑k=1i−1\(𝔼dπ^k\[f\]−𝔼dπ^k−1\[f\]\)\.\\mathbb\{E\}\_\{d^\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\[f\]\-\\mathbb\{E\}\_\{d^\{\\hat\{\\pi\}^\{\\,0\}\}\}\[f\]=\\sum\_\{k=1\}^\{i\-1\}\\Big\(\\mathbb\{E\}\_\{d^\{\\hat\{\\pi\}^\{\\,k\}\}\}\[f\]\-\\mathbb\{E\}\_\{d^\{\\hat\{\\pi\}^\{\\,k\-1\}\}\}\[f\]\\Big\)\.Taking absolute values and applying the triangle inequality gives
\|𝔼dπ^i−1\[f\]−𝔼dπ^0\[f\]\|≤∑k=1i−1\|𝔼dπ^k\[f\]−𝔼dπ^k−1\[f\]\|\.\\big\|\\mathbb\{E\}\_\{d^\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\[f\]\-\\mathbb\{E\}\_\{d^\{\\hat\{\\pi\}^\{\\,0\}\}\}\[f\]\\big\|\\leq\\sum\_\{k=1\}^\{i\-1\}\\big\|\\mathbb\{E\}\_\{d^\{\\hat\{\\pi\}^\{\\,k\}\}\}\[f\]\-\\mathbb\{E\}\_\{d^\{\\hat\{\\pi\}^\{\\,k\-1\}\}\}\[f\]\\big\|\.\(21\)
For eachk∈\{1,…,i−1\}k\\in\\\{1,\\dots,i\-1\\\}, apply Lemma[B\.8](https://arxiv.org/html/2605.15207#A2.Thmtheorem8)in its Pinsker\+Jensen form \(to the adjacent pair\(π^k−1,π^k\)\(\\hat\{\\pi\}^\{\\,k\-1\},\\hat\{\\pi\}^\{\\,k\}\)with referenceπ^k−1\\hat\{\\pi\}^\{\\,k\-1\}\), yielding
\|𝔼dπ^k\[f\]−𝔼dπ^k−1\[f\]\|≤2γ1−γ12DKL\(π^k−1∥π^k\)tokπ^k−1‖f‖∞\.\\big\|\\mathbb\{E\}\_\{d^\{\\hat\{\\pi\}^\{\\,k\}\}\}\[f\]\-\\mathbb\{E\}\_\{d^\{\\hat\{\\pi\}^\{\\,k\-1\}\}\}\[f\]\\big\|\\leq\\frac\{2\\gamma\}\{1\-\\gamma\}\\sqrt\{\\frac\{1\}\{2\}D\_\{\\mathrm\{KL\}\}\{\}\_\{\\mathrm\{tok\}\}^\{\\hat\{\\pi\}^\{\\,k\-1\}\}\\\!\\big\(\\hat\{\\pi\}^\{\\,k\-1\}\\\|\\hat\{\\pi\}^\{\\,k\}\\big\)\}\\;\\\|f\\\|\_\{\\infty\}\.Using the assumptionDKL\(π^k−1∥π^k\)tokπ^k−1≤δkD\_\{\\mathrm\{KL\}\}\{\}\_\{\\mathrm\{tok\}\}^\{\\hat\{\\pi\}^\{\\,k\-1\}\}\(\\hat\{\\pi\}^\{\\,k\-1\}\\\|\\hat\{\\pi\}^\{\\,k\}\)\\leq\\delta\_\{k\}gives
\|𝔼dπ^k\[f\]−𝔼dπ^k−1\[f\]\|≤2γ1−γδk2‖f‖∞=2γ1−γ‖f‖∞δk\.\\big\|\\mathbb\{E\}\_\{d^\{\\hat\{\\pi\}^\{\\,k\}\}\}\[f\]\-\\mathbb\{E\}\_\{d^\{\\hat\{\\pi\}^\{\\,k\-1\}\}\}\[f\]\\big\|\\leq\\frac\{2\\gamma\}\{1\-\\gamma\}\\sqrt\{\\frac\{\\delta\_\{k\}\}\{2\}\}\\;\\\|f\\\|\_\{\\infty\}=\\frac\{\\sqrt\{2\}\\gamma\}\{1\-\\gamma\}\\,\\\|f\\\|\_\{\\infty\}\\,\\sqrt\{\\delta\_\{k\}\}\.Substituting this bound into \([21](https://arxiv.org/html/2605.15207#A2.E21)\) and summing overk=1,…,i−1k=1,\\dots,i\-1yields
\|𝔼dπ^i−1\[f\]−𝔼dπ^0\[f\]\|≤2γ1−γ‖f‖∞∑k=1i−1δk,\\big\|\\mathbb\{E\}\_\{d^\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\[f\]\-\\mathbb\{E\}\_\{d^\{\\hat\{\\pi\}^\{\\,0\}\}\}\[f\]\\big\|\\leq\\frac\{\\sqrt\{2\}\\gamma\}\{1\-\\gamma\}\\,\\\|f\\\|\_\{\\infty\}\\sum\_\{k=1\}^\{i\-1\}\\sqrt\{\\delta\_\{k\}\},as claimed\. \(The inequality remains valid in the extended reals if someδk=\+∞\\delta\_\{k\}=\+\\infty\.\) ∎
### B\.4Stale\-occupancy evaluation: surrogate mismatch and compounding term
###### Proof of Proposition[3\.3](https://arxiv.org/html/2605.15207#S3.Thmtheorem3)\.
The general within\-stage occupancy shift bound in Eq\. \([8](https://arxiv.org/html/2605.15207#S3.E8)\) in the main text is exactly Lemma[B\.9](https://arxiv.org/html/2605.15207#A2.Thmtheorem9)\(with the index shiftk=1,…,i−1k=1,\\ldots,i\-1corresponding tok<ik<i\)\.
For the surrogate mismatch bound in Eq\. \([10](https://arxiv.org/html/2605.15207#S3.E10)\), define
fi\(s\)≔𝔼𝐚∼π^i\(⋅∣s\)\[A^i−1\(s,𝐚\)\]\.f\_\{i\}\(s\)\\coloneqq\\mathbb\{E\}\_\{\\mathbf\{a\}\\sim\\hat\{\\pi\}^\{\\,i\}\(\\cdot\\mid s\)\}\\big\[\\widehat\{A\}\_\{i\-1\}\(s,\\mathbf\{a\}\)\\big\]\.If\|A^i−1\(s,𝐚\)\|≤Amax\|\\widehat\{A\}\_\{i\-1\}\(s,\\mathbf\{a\}\)\|\\leq A\_\{\\max\}almost surely, then‖fi‖∞≤Amax\\\|f\_\{i\}\\\|\_\{\\infty\}\\leq A\_\{\\max\}\. By definitions ofLiseqL\_\{i\}^\{\\mathrm\{seq\}\}\(Eq\. \([5](https://arxiv.org/html/2605.15207#S3.E5)\) in the main text\) andListaleL\_\{i\}^\{\\mathrm\{stale\}\}\(Eq\. \([9](https://arxiv.org/html/2605.15207#S3.E9)\)\),
Liseq−Listale=11−γ\(𝔼s∼dπ^i−1\[fi\(s\)\]−𝔼s∼dπ^0\[fi\(s\)\]\)\.L\_\{i\}^\{\\mathrm\{seq\}\}\-L\_\{i\}^\{\\mathrm\{stale\}\}=\\frac\{1\}\{1\-\\gamma\}\\Big\(\\mathbb\{E\}\_\{s\\sim d^\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\[f\_\{i\}\(s\)\]\-\\mathbb\{E\}\_\{s\\sim d^\{\\hat\{\\pi\}^\{0\}\}\}\[f\_\{i\}\(s\)\]\\Big\)\.Apply Lemma[B\.9](https://arxiv.org/html/2605.15207#A2.Thmtheorem9)tofif\_\{i\}and multiply by11−γ\\frac\{1\}\{1\-\\gamma\}to obtain
\|Liseq−Listale\|≤2γ\(1−γ\)2Amax∑k<iδk,\\big\|L\_\{i\}^\{\\mathrm\{seq\}\}\-L\_\{i\}^\{\\mathrm\{stale\}\}\\big\|\\leq\\frac\{\\sqrt\{2\}\\gamma\}\{\(1\-\\gamma\)^\{2\}\}A\_\{\\max\}\\sum\_\{k<i\}\\sqrt\{\\delta\_\{k\}\},which is in Eq\. \([10](https://arxiv.org/html/2605.15207#S3.E10)\)\. Summing overiiyields the stated∑i=1n∑k<iδk\\sum\_\{i=1\}^\{n\}\\sum\_\{k<i\}\\sqrt\{\\delta\_\{k\}\}compounding term, and forδk≡δ¯\\delta\_\{k\}\\equiv\\bar\{\\delta\}the scaling isO\(n2δ¯\)O\(n^\{2\}\\sqrt\{\\bar\{\\delta\}\}\)\. ∎
## Appendix CCertificates: Full Proofs and Order Dependence
### C\.1Single\-Step Improvement Lower Bound \(Full Proof\)
###### Proof of Theorem[3\.4](https://arxiv.org/html/2605.15207#S3.Thmtheorem4)\.
The performance difference lemma gives
J\(π^i\)−J\(π^i−1\)=11−γ𝔼s∼dπ^i,𝐚∼π^i\[Aπ^i−1\(s,𝐚\)\]\.J\(\\hat\{\\pi\}^\{\\,i\}\)\-J\(\\hat\{\\pi\}^\{\\,i\-1\}\)=\\frac\{1\}\{1\-\\gamma\}\\mathbb\{E\}\_\{s\\sim d^\{\\hat\{\\pi\}^\{\\,i\}\},\\,\\mathbf\{a\}\\sim\\hat\{\\pi\}^\{\\,i\}\}\\\!\\left\[A^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\(s,\\mathbf\{a\}\)\\right\]\.Add and subtract the same quantity underdπ^i−1d^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}:
J\(π^i\)−J\(π^i−1\)=\\displaystyle J\(\\hat\{\\pi\}^\{\\,i\}\)\-J\(\\hat\{\\pi\}^\{\\,i\-1\}\)=11−γ𝔼s∼dπ^i−1,𝐚∼π^i\[Aπ^i−1\(s,𝐚\)\]\\displaystyle\\frac\{1\}\{1\-\\gamma\}\\mathbb\{E\}\_\{s\\sim d^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\},\\,\\mathbf\{a\}\\sim\\hat\{\\pi\}^\{\\,i\}\}\\\!\\left\[A^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\(s,\\mathbf\{a\}\)\\right\]\+11−γ\(𝔼s∼dπ^i,𝐚∼π^i\[Aπ^i−1\]−𝔼s∼dπ^i−1,𝐚∼π^i\[Aπ^i−1\]\)\.\\displaystyle\+\\frac\{1\}\{1\-\\gamma\}\\left\(\\mathbb\{E\}\_\{s\\sim d^\{\\hat\{\\pi\}^\{\\,i\}\},\\,\\mathbf\{a\}\\sim\\hat\{\\pi\}^\{\\,i\}\}\\\!\\left\[A^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\\right\]\-\\mathbb\{E\}\_\{s\\sim d^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\},\\,\\mathbf\{a\}\\sim\\hat\{\\pi\}^\{\\,i\}\}\\\!\\left\[A^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\\right\]\\right\)\.Definef\(s\)≔𝔼𝐚∼π^i\(⋅∣s\)\[Aπ^i−1\(s,𝐚\)\]f\(s\)\\coloneqq\\mathbb\{E\}\_\{\\mathbf\{a\}\\sim\\hat\{\\pi\}^\{\\,i\}\(\\cdot\\mid s\)\}\[A^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\(s,\\mathbf\{a\}\)\]\. Since\|Aπ^i−1\(s,𝐚\)\|≤Amax\|A^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\(s,\\mathbf\{a\}\)\|\\leq A\_\{\\max\}, we have‖f‖∞≤Amax\\\|f\\\|\_\{\\infty\}\\leq A\_\{\\max\}\. By Lemma[B\.8](https://arxiv.org/html/2605.15207#A2.Thmtheorem8)and Lemma[B\.2](https://arxiv.org/html/2605.15207#A2.Thmtheorem2)\(withπ=π^i−1\\pi=\\hat\{\\pi\}^\{\\,i\-1\}andπ′=π^i\\pi^\{\\prime\}=\\hat\{\\pi\}^\{\\,i\}\),
\|𝔼dπ^i\[f\]−𝔼dπ^i−1\[f\]\|≤2γ1−γ12DKL\(π^i−1∥π^i\)tokπ^i−1⋅Amax\.\\left\|\\mathbb\{E\}\_\{d^\{\\hat\{\\pi\}^\{\\,i\}\}\}\[f\]\-\\mathbb\{E\}\_\{d^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\}\[f\]\\right\|\\leq\\frac\{2\\gamma\}\{1\-\\gamma\}\\sqrt\{\\frac\{1\}\{2\}D\_\{\\mathrm\{KL\}\}\{\}\_\{\\mathrm\{tok\}\}^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\(\\hat\{\\pi\}^\{\\,i\-1\}\\\|\\hat\{\\pi\}^\{\\,i\}\)\}\\cdot A\_\{\\max\}\.Therefore,
J\(π^i\)−J\(π^i−1\)≥11−γ𝔼s∼dπ^i−1,𝐚∼π^i\[Aπ^i−1\(s,𝐚\)\]−2γ\(1−γ\)2AmaxDKL\(π^i−1∥π^i\)tokπ^i−1\.J\(\\hat\{\\pi\}^\{\\,i\}\)\-J\(\\hat\{\\pi\}^\{\\,i\-1\}\)\\geq\\frac\{1\}\{1\-\\gamma\}\\mathbb\{E\}\_\{s\\sim d^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\},\\,\\mathbf\{a\}\\sim\\hat\{\\pi\}^\{\\,i\}\}\\\!\\left\[A^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\(s,\\mathbf\{a\}\)\\right\]\-\\frac\{\\sqrt\{2\}\\gamma\}\{\(1\-\\gamma\)^\{2\}\}A\_\{\\max\}\\sqrt\{D\_\{\\mathrm\{KL\}\}\{\}\_\{\\mathrm\{tok\}\}^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\(\\hat\{\\pi\}^\{\\,i\-1\}\\\|\\hat\{\\pi\}^\{\\,i\}\)\}\.For the main term, add and subtract the estimatorA^i−1\\widehat\{A\}\_\{i\-1\}:
𝔼\[Aπ^i−1\]=𝔼\[A^i−1\]\+𝔼\[Aπ^i−1−A^i−1\]\.\\mathbb\{E\}\[A^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\]=\\mathbb\{E\}\[\\widehat\{A\}\_\{i\-1\}\]\+\\mathbb\{E\}\[A^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\-\\widehat\{A\}\_\{i\-1\}\]\.By the definition ofζi\\zeta\_\{i\}\(Notation; Eq\. \([6](https://arxiv.org/html/2605.15207#S3.E6)\) in the main text\),
𝔼s∼dπ^i−1,𝐚∼π^i\[Aπ^i−1−A^i−1\]≥−ζi\.\\mathbb\{E\}\_\{s\\sim d^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\},\\,\\mathbf\{a\}\\sim\\hat\{\\pi\}^\{\\,i\}\}\[A^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\-\\widehat\{A\}\_\{i\-1\}\]\\geq\-\\zeta\_\{i\}\.Finally, enforceDKL\(π^i−1∥π^i\)tokπ^i−1≤δiD\_\{\\mathrm\{KL\}\}\{\}\_\{\\mathrm\{tok\}\}^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\(\\hat\{\\pi\}^\{\\,i\-1\}\\\|\\hat\{\\pi\}^\{\\,i\}\)\\leq\\delta\_\{i\}via Lemma[B\.6](https://arxiv.org/html/2605.15207#A2.Thmtheorem6)to obtain the claim\. ∎
### C\.2Joint\-Stage Improvement and Order Dependence
###### Proof of Lemma[3\.5](https://arxiv.org/html/2605.15207#S3.Thmtheorem5)\.
At stepii,π^i\\hat\{\\pi\}^\{\\,i\}andπ^i−1\\hat\{\\pi\}^\{\\,i\-1\}differ only in factorσ\(i\)\\sigma\(i\)\. For fixedss, letψ\(k\)\(⋅∣s\)\\psi^\{\(k\)\}\(\\cdot\\mid s\)denote the \(common\) factor of agentk≠σ\(i\)k\\neq\\sigma\(i\)in bothπ^i−1\\hat\{\\pi\}^\{\\,i\-1\}andπ^i\\hat\{\\pi\}^\{\\,i\}\. Then
π^i−1\(𝐚∣s\)=πcurσ\(i\)\(aσ\(i\)∣s\)∏k≠σ\(i\)ψ\(k\)\(ak∣s\),π^i\(𝐚∣s\)=πtarσ\(i\)\(aσ\(i\)∣s\)∏k≠σ\(i\)ψ\(k\)\(ak∣s\)\.\\hat\{\\pi\}^\{\\,i\-1\}\(\\mathbf\{a\}\\mid s\)=\\pi^\{\\sigma\(i\)\}\_\{\\mathrm\{cur\}\}\(a\_\{\\sigma\(i\)\}\\mid s\)\\prod\_\{k\\neq\\sigma\(i\)\}\\psi^\{\(k\)\}\(a\_\{k\}\\mid s\),\\qquad\\hat\{\\pi\}^\{\\,i\}\(\\mathbf\{a\}\\mid s\)=\\pi^\{\\sigma\(i\)\}\_\{\\mathrm\{tar\}\}\(a\_\{\\sigma\(i\)\}\\mid s\)\\prod\_\{k\\neq\\sigma\(i\)\}\\psi^\{\(k\)\}\(a\_\{k\}\\mid s\)\.Taking KL and expanding the definition,
DKL\(π^i−1\(⋅∣s\)∥π^i\(⋅∣s\)\)\\displaystyle D\_\{\\mathrm\{KL\}\}\{\}\\\!\\big\(\\hat\{\\pi\}^\{\\,i\-1\}\(\\cdot\\mid s\)\\\|\\hat\{\\pi\}^\{\\,i\}\(\\cdot\\mid s\)\\big\)=∑𝐚π^i−1\(𝐚∣s\)logπ^i−1\(𝐚∣s\)π^i\(𝐚∣s\)\\displaystyle=\\sum\_\{\\mathbf\{a\}\}\\hat\{\\pi\}^\{\\,i\-1\}\(\\mathbf\{a\}\\mid s\)\\log\\frac\{\\hat\{\\pi\}^\{\\,i\-1\}\(\\mathbf\{a\}\\mid s\)\}\{\\hat\{\\pi\}^\{\\,i\}\(\\mathbf\{a\}\\mid s\)\}=∑𝐚π^i−1\(𝐚∣s\)logπcurσ\(i\)\(aσ\(i\)∣s\)πtarσ\(i\)\(aσ\(i\)∣s\)\\displaystyle=\\sum\_\{\\mathbf\{a\}\}\\hat\{\\pi\}^\{\\,i\-1\}\(\\mathbf\{a\}\\mid s\)\\log\\frac\{\\pi^\{\\sigma\(i\)\}\_\{\\mathrm\{cur\}\}\(a\_\{\\sigma\(i\)\}\\mid s\)\}\{\\pi^\{\\sigma\(i\)\}\_\{\\mathrm\{tar\}\}\(a\_\{\\sigma\(i\)\}\\mid s\)\}=∑aσ\(i\)πcurσ\(i\)\(aσ\(i\)∣s\)logπcurσ\(i\)\(aσ\(i\)∣s\)πtarσ\(i\)\(aσ\(i\)∣s\)\\displaystyle=\\sum\_\{a\_\{\\sigma\(i\)\}\}\\pi^\{\\sigma\(i\)\}\_\{\\mathrm\{cur\}\}\(a\_\{\\sigma\(i\)\}\\mid s\)\\log\\frac\{\\pi^\{\\sigma\(i\)\}\_\{\\mathrm\{cur\}\}\(a\_\{\\sigma\(i\)\}\\mid s\)\}\{\\pi^\{\\sigma\(i\)\}\_\{\\mathrm\{tar\}\}\(a\_\{\\sigma\(i\)\}\\mid s\)\}=DKL\(πcurσ\(i\)\(⋅∣s\)∥πtarσ\(i\)\(⋅∣s\)\),\\displaystyle=D\_\{\\mathrm\{KL\}\}\{\}\\\!\\big\(\\pi^\{\\sigma\(i\)\}\_\{\\mathrm\{cur\}\}\(\\cdot\\mid s\)\\\|\\pi^\{\\sigma\(i\)\}\_\{\\mathrm\{tar\}\}\(\\cdot\\mid s\)\\big\),where the third line uses∑𝐚−σ\(i\)∏k≠σ\(i\)ψ\(k\)\(ak∣s\)=1\\sum\_\{\\mathbf\{a\}\_\{\-\{\\sigma\(i\)\}\}\}\\prod\_\{k\\neq\\sigma\(i\)\}\\psi^\{\(k\)\}\(a\_\{k\}\\mid s\)=1\. Taking expectation overs∼dπ^i−1s\\sim d^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}and using Lemma[B\.2](https://arxiv.org/html/2605.15207#A2.Thmtheorem2)yields
DKL\(π^i−1∥π^i\)tokπ^i−1=DKL\(πcurσ\(i\)∥πtarσ\(i\)\)tokπ^i−1≤δi\.D\_\{\\mathrm\{KL\}\}\{\}\_\{\\mathrm\{tok\}\}^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\\\!\\big\(\\hat\{\\pi\}^\{\\,i\-1\}\\\|\\hat\{\\pi\}^\{\\,i\}\\big\)=D\_\{\\mathrm\{KL\}\}\{\}\_\{\\mathrm\{tok\}\}^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\\\!\\big\(\\pi^\{\\sigma\(i\)\}\_\{\\mathrm\{cur\}\}\\\|\\pi^\{\\sigma\(i\)\}\_\{\\mathrm\{tar\}\}\\big\)\\leq\\delta\_\{i\}\.∎
###### Proof of Theorem[3\.6](https://arxiv.org/html/2605.15207#S3.Thmtheorem6)\.
Apply Theorem[3\.4](https://arxiv.org/html/2605.15207#S3.Thmtheorem4)to each stepiiand sum overi=1,…,ni=1,\\ldots,n\. The left\-hand side telescopes:
∑i=1n\(J\(π^i\)−J\(π^i−1\)\)=J\(π^n\)−J\(π^0\)=J\(π¯\)−J\(πcur\)\.\\sum\_\{i=1\}^\{n\}\\left\(J\(\\hat\{\\pi\}^\{\\,i\}\)\-J\(\\hat\{\\pi\}^\{\\,i\-1\}\)\\right\)=J\(\\hat\{\\pi\}^\{\\,n\}\)\-J\(\\hat\{\\pi\}^\{\\,0\}\)=J\(\\bar\{\\pi\}\)\-J\(\\pi\_\{\\mathrm\{cur\}\}\)\.The inequality holds for any update orderσ\\sigmabecause it is applied to the realized sequence of intermediate policies\. However, the intermediate occupanciesdπ^i−1d^\{\\hat\{\\pi\}^\{\\,i\-1\}\}and henceLiseqL\_\{i\}^\{\\mathrm\{seq\}\}andζi\\zeta\_\{i\}may depend onσ\\sigma\. ∎
### C\.3Stage\-wise bound under stale\-occupancy surrogates
###### Proof of Theorem[3\.7](https://arxiv.org/html/2605.15207#S3.Thmtheorem7)\.
Start from the intermediate\-occupancy stage certificate \(Theorem[3\.6](https://arxiv.org/html/2605.15207#S3.Thmtheorem6)\):
J\(π¯\)−J\(πcur\)≥∑i=1nLiseq−2γ\(1−γ\)2Amax∑i=1nδi−11−γ∑i=1nζi\.J\(\\bar\{\\pi\}\)\-J\(\\pi\_\{\\mathrm\{cur\}\}\)\\geq\\sum\_\{i=1\}^\{n\}L\_\{i\}^\{\\mathrm\{seq\}\}\-\\frac\{\\sqrt\{2\}\\gamma\}\{\(1\-\\gamma\)^\{2\}\}A\_\{\\max\}\\sum\_\{i=1\}^\{n\}\\sqrt\{\\delta\_\{i\}\}\-\\frac\{1\}\{1\-\\gamma\}\\sum\_\{i=1\}^\{n\}\\zeta\_\{i\}\.By Proposition[3\.3](https://arxiv.org/html/2605.15207#S3.Thmtheorem3)\(Eq\. \([10](https://arxiv.org/html/2605.15207#S3.E10)\)\),
Liseq≥Listale−2γ\(1−γ\)2Amax∑k<iδk\.L\_\{i\}^\{\\mathrm\{seq\}\}\\geq L\_\{i\}^\{\\mathrm\{stale\}\}\-\\frac\{\\sqrt\{2\}\\gamma\}\{\(1\-\\gamma\)^\{2\}\}A\_\{\\max\}\\sum\_\{k<i\}\\sqrt\{\\delta\_\{k\}\}\.Summing overi=1,…,ni=1,\\ldots,ngives
∑i=1nLiseq≥∑i=1nListale−2γ\(1−γ\)2Amax∑i=1n∑k<iδk\.\\sum\_\{i=1\}^\{n\}L\_\{i\}^\{\\mathrm\{seq\}\}\\geq\\sum\_\{i=1\}^\{n\}L\_\{i\}^\{\\mathrm\{stale\}\}\-\\frac\{\\sqrt\{2\}\\gamma\}\{\(1\-\\gamma\)^\{2\}\}A\_\{\\max\}\\sum\_\{i=1\}^\{n\}\\sum\_\{k<i\}\\sqrt\{\\delta\_\{k\}\}\.Substitute this into Theorem[3\.6](https://arxiv.org/html/2605.15207#S3.Thmtheorem6)to obtain Eq\. \([14](https://arxiv.org/html/2605.15207#S3.E14)\)\. ∎
## Appendix DPlug\-and\-Play Upgrades and Stage\-0 Alignment
### D\.1Stage\-0 alignment as reverse\-KL projection / distillation
This section specifies a practical Stage\-0 alignment objective that is*compatible with the reverse\-KL trust region*\. We assume that the upgraded agent shares the same tokenization/vocabulary as the replaced agent, so that KL is well defined\. \(If tokenizers differ, a shared action space must be introduced; we leave this to future work\.\)
##### Goal\.
Suppose agentjjis replaced: we have the old policyπold\(j\)\\pi^\{\(j\)\}\_\{\\mathrm\{old\}\}and a new parameterizationπθ\(j\)\\pi^\{\(j\)\}\_\{\\theta\}\. We aim to selectθ\\thetasuch that the new agent lies within a reverse\-KL trust region around the old agent in representative contexts\.
##### Probe distribution\.
Letν\\nube a distribution over states/contexts where the replaced agent is expected to act \(e\.g\., contexts collected by running the*pre\-swap*team on a probe prompt set, and extracting states wherejjis active\)\. Define the Stage\-0 target constraint
𝔼s∼νDKL\(πold\(j\)\(⋅∣s\)∥πθ\(j\)\(⋅∣s\)\)≤δalign\.\\mathbb\{E\}\_\{s\\sim\\nu\}D\_\{\\mathrm\{KL\}\}\{\}\\\!\\big\(\\pi^\{\(j\)\}\_\{\\mathrm\{old\}\}\(\\cdot\\mid s\)\\\|\\pi^\{\(j\)\}\_\{\\theta\}\(\\cdot\\mid s\)\\big\)\\leq\\delta\_\{\\mathrm\{align\}\}\.
##### Reverse\-KL projection objective\.
A natural alignment objective is to minimize the left\-hand side:
minθ𝔼s∼νDKL\(πold\(j\)\(⋅∣s\)∥πθ\(j\)\(⋅∣s\)\)\.\\min\_\{\\theta\}\\;\\mathbb\{E\}\_\{s\\sim\\nu\}D\_\{\\mathrm\{KL\}\}\{\}\\\!\\big\(\\pi^\{\(j\)\}\_\{\\mathrm\{old\}\}\(\\cdot\\mid s\)\\\|\\pi^\{\(j\)\}\_\{\\theta\}\(\\cdot\\mid s\)\\big\)\.\(22\)This is equivalent \(up to an additive constant independent ofθ\\theta\) to minimizing the cross\-entropy under the teacher distribution:
minθ𝔼s∼ν𝔼a∼πold\(j\)\(⋅∣s\)\[−logπθ\(j\)\(a∣s\)\]\.\\min\_\{\\theta\}\\;\\mathbb\{E\}\_\{s\\sim\\nu\}\\mathbb\{E\}\_\{a\\sim\\pi^\{\(j\)\}\_\{\\mathrm\{old\}\}\(\\cdot\\mid s\)\}\[\-\\log\\pi^\{\(j\)\}\_\{\\theta\}\(a\\mid s\)\]\.
##### Autoregressive implementation\.
By the reverse chain rule \(Lemma[B\.3](https://arxiv.org/html/2605.15207#A2.Thmtheorem3)\), the message\-level reverse KL decomposes into a token\-level sum with expectation under the old policy\. Thus, if we collect teacher\-forced rollouts/messages fromπold\(j\)\\pi^\{\(j\)\}\_\{\\mathrm\{old\}\}on contextss∼νs\\sim\\nu, we can optimize Eq\. \([22](https://arxiv.org/html/2605.15207#A4.E22)\) via standard distillation:
minθ𝔼s∼ν,m∼πold\(j\)\(⋅∣s\)\[∑u=1T\(m\)DKL\(πold\(j\)\(⋅∣s,x<u\)∥πθ\(j\)\(⋅∣s,x<u\)\)\]\.\\min\_\{\\theta\}\\;\\mathbb\{E\}\_\{s\\sim\\nu,\\;m\\sim\\pi^\{\(j\)\}\_\{\\mathrm\{old\}\}\(\\cdot\\mid s\)\}\\left\[\\sum\_\{u=1\}^\{T\(m\)\}D\_\{\\mathrm\{KL\}\}\{\}\\\!\\big\(\\pi^\{\(j\)\}\_\{\\mathrm\{old\}\}\(\\cdot\\mid s,x\_\{<u\}\)\\\|\\pi^\{\(j\)\}\_\{\\theta\}\(\\cdot\\mid s,x\_\{<u\}\)\\big\)\\right\]\.
##### Stopping rule\.
During alignment, we monitor the empirical token\-sum reverse KL monitor in Eq\. \([18](https://arxiv.org/html/2605.15207#S4.E18)\) on the probe set and early\-stop once it falls belowδalign\\delta\_\{\\mathrm\{align\}\}\. After this, TeamTR resumes with the standard intermediate\-occupancy updates\.
##### Scope note\.
Stage\-0 alignment enforces the trust region on the probe distributionν\\nu; it does not, by itself, guarantee a trust region on future intermediate occupancies\. Empirically, we find that aligning on representative contexts substantially reduces the swap shock and improves resumability\.
### D\.2Certified Resumability and Certificate Tightening
###### Proof of Proposition[3\.8](https://arxiv.org/html/2605.15207#S3.Thmtheorem8)\.
The proofs of Theorem[3\.4](https://arxiv.org/html/2605.15207#S3.Thmtheorem4)and Theorem[3\.6](https://arxiv.org/html/2605.15207#S3.Thmtheorem6)depend only on: the surrogate quantitiesLiseqL\_\{i\}^\{\\mathrm\{seq\}\}, the trust\-region radiiδi\\delta\_\{i\}through the token\-KL term, and the surrogate\-estimation error boundsζi\\zeta\_\{i\}\. They do not depend on howπtarσ\(i\)\\pi^\{\\sigma\(i\)\}\_\{\\mathrm\{tar\}\}is parameterized\. Therefore, after replacing an agent and re\-establishing the trust\-region bookkeeping for subsequent updates, the same lower\-bound form continues to apply\. ∎
###### Proof of Proposition[3\.9](https://arxiv.org/html/2605.15207#S3.Thmtheorem9)\.
The stage certificate in Theorem[3\.6](https://arxiv.org/html/2605.15207#S3.Thmtheorem6)takes the form
LB\(\{Liseq\},\{δi\},\{ζi\}\)=∑i=1nLiseq−c∑i=1nδi−11−γ∑i=1nζi,c=2γ\(1−γ\)2Amax\.\\mathrm\{LB\}\(\\\{\\\!L\_\{i\}^\{\\mathrm\{seq\}\}\\\!\\\},\\\{\\\!\\delta\_\{i\}\\\!\\\},\\\{\\\!\\zeta\_\{i\}\\\!\\\}\)=\\sum\_\{i=1\}^\{n\}L\_\{i\}^\{\\mathrm\{seq\}\}\-c\\sum\_\{i=1\}^\{n\}\\sqrt\{\\delta\_\{i\}\}\-\\frac\{1\}\{1\-\\gamma\}\\sum\_\{i=1\}^\{n\}\\zeta\_\{i\},\\qquad c=\\frac\{\\sqrt\{2\}\\gamma\}\{\(1\-\\gamma\)^\{2\}\}A\_\{\\max\}\.If, after an upgrade, some step attains a higher surrogate value with the sameδi\\delta\_\{i\}\(and all other terms unchanged\), the right\-hand side increases\. Likewise, achieving the same surrogate value with a smallerδi\\delta\_\{i\}decreases the penalty termcδic\\sqrt\{\\delta\_\{i\}\}and hence weakly increases the bound\. ∎
## Appendix EGroup\-Based Advantages and Finite\-Sample Concentration
### E\.1Group\-Standardized and Clipped Message\-Level Advantages
This section specifies the message\- and sequence\-level advantage estimator used in our algorithm\. It is inspired by group\-based baselines in LLM RL methods, but here it is used as a plug\-in estimator inside a sequential trust\-region framework\.
#### E\.1\.1Definition: Group standardization and hard clipping
Fix a prompt \(or initial state\)xx\. At stepii, we sampleGGrollouts\{τ\(g\)\}g=1G\\\{\\tau^\{\(g\)\}\\\}\_\{g=1\}^\{G\}from the current intermediate policyπ^i−1\(⋅∣x\)\\hat\{\\pi\}^\{\\,i\-1\}\(\\cdot\\mid x\)\. LetR^\(g\)\\widehat\{R\}^\{\(g\)\}denote the scalar message\-/sequence\-level return forτ\(g\)\\tau^\{\(g\)\}\(e\.g\., terminal verifier reward\)\. Define the group mean and standard deviation
μ^≔1G∑g=1GR^\(g\),σ^≔1G∑g=1G\(R^\(g\)−μ^\)2\+εstd\.\\widehat\{\\mu\}\\coloneqq\\frac\{1\}\{G\}\\sum\_\{g=1\}^\{G\}\\widehat\{R\}^\{\(g\)\},\\qquad\\widehat\{\\sigma\}\\coloneqq\\sqrt\{\\frac\{1\}\{G\}\\sum\_\{g=1\}^\{G\}\\left\(\\widehat\{R\}^\{\(g\)\}\-\\widehat\{\\mu\}\\right\)^\{2\}\+\\varepsilon\_\{\\mathrm\{std\}\}\}\.The standardized, hard\-clipped group advantage is
A^grpi−1\(τ\(g\)\)≔clip\(R^\(g\)−μ^σ^,−Aclip,Aclip\)\.\\widehat\{A\}\_\{\\textsc\{grp\}\}^\{i\-1\}\(\\tau^\{\(g\)\}\)\\coloneqq\\mathrm\{clip\}\\\!\\left\(\\frac\{\\widehat\{R\}^\{\(g\)\}\-\\widehat\{\\mu\}\}\{\\widehat\{\\sigma\}\},\\,\-A\_\{\\mathrm\{clip\}\},\\,A\_\{\\mathrm\{clip\}\}\\right\)\.When used as a token\-level weight,A^grpi−1\(τ\(g\)\)\\widehat\{A\}\_\{\\textsc\{grp\}\}^\{i\-1\}\(\\tau^\{\(g\)\}\)is broadcast to all tokens controlled by the updated factor\.
##### Boundedness\.
By construction,\|A^grpi−1\(τ\(g\)\)\|≤Aclip\|\\widehat\{A\}\_\{\\textsc\{grp\}\}^\{i\-1\}\(\\tau^\{\(g\)\}\)\|\\leq A\_\{\\mathrm\{clip\}\}almost surely\.
### E\.2Bias components aligned withζi\\zeta\_\{i\}
The main text termζi\\zeta\_\{i\}is intended to capture \(or upper bound\) the aggregate surrogate\-estimation error\. This section records simple, explicit bias components that can be upper\-bounded and monitored\.
###### Lemma E\.1\(Self\-included group mean yields shrinkage for i\.i\.d\. rollouts\)\.
Fix a promptxxand letτ\(1\),…,τ\(G\)\\tau^\{\(1\)\},\\ldots,\\tau^\{\(G\)\}be i\.i\.d\. fromπ^i−1\(⋅∣x\)\\hat\{\\pi\}^\{\\,i\-1\}\(\\cdot\\mid x\)\. LetR\(g\)≔R\(τ\(g\)\)R^\{\(g\)\}\\coloneqq R\(\\tau^\{\(g\)\}\)be a measurable return with𝔼\[\|R\(g\)\|\]<∞\\mathbb\{E\}\[\|R^\{\(g\)\}\|\]<\\infty, and defineμ^=1G∑g=1GR\(g\)\\widehat\{\\mu\}=\\frac\{1\}\{G\}\\sum\_\{g=1\}^\{G\}R^\{\(g\)\}\. Letμ\(x\)≔𝔼\[R\(τ\)∣x\]\\mu\(x\)\\coloneqq\\mathbb\{E\}\[R\(\\tau\)\\mid x\]\. Then for any fixed indexgg,
𝔼\[R\(g\)−μ^\|x,τ\(g\)\]=\(1−1G\)\(R\(g\)−μ\(x\)\)\.\\mathbb\{E\}\\\!\\left\[R^\{\(g\)\}\-\\widehat\{\\mu\}\\,\\big\|\\,x,\\tau^\{\(g\)\}\\right\]=\\left\(1\-\\frac\{1\}\{G\}\\right\)\\left\(R^\{\(g\)\}\-\\mu\(x\)\\right\)\.
###### Proof\.
Writeμ^=1GR\(g\)\+1G∑h≠gR\(h\)\\widehat\{\\mu\}=\\frac\{1\}\{G\}R^\{\(g\)\}\+\\frac\{1\}\{G\}\\sum\_\{h\\neq g\}R^\{\(h\)\}\. Conditioned on\(x,τ\(g\)\)\(x,\\tau^\{\(g\)\}\), the random variables\{R\(h\)\}h≠g\\\{R^\{\(h\)\}\\\}\_\{h\\neq g\}remain i\.i\.d\. with𝔼\[R\(h\)∣x,τ\(g\)\]=𝔼\[R\(h\)∣x\]=μ\(x\)\\mathbb\{E\}\[R^\{\(h\)\}\\mid x,\\tau^\{\(g\)\}\]=\\mathbb\{E\}\[R^\{\(h\)\}\\mid x\]=\\mu\(x\)by independence and identical distribution\. Hence
𝔼\[μ^∣x,τ\(g\)\]=1GR\(g\)\+G−1Gμ\(x\),\\mathbb\{E\}\[\\widehat\{\\mu\}\\mid x,\\tau^\{\(g\)\}\]=\\frac\{1\}\{G\}R^\{\(g\)\}\+\\frac\{G\-1\}\{G\}\\mu\(x\),and therefore
𝔼\[R\(g\)−μ^∣x,τ\(g\)\]=R\(g\)−\(1GR\(g\)\+G−1Gμ\(x\)\)=\(1−1G\)\(R\(g\)−μ\(x\)\)\.\\mathbb\{E\}\[R^\{\(g\)\}\-\\widehat\{\\mu\}\\mid x,\\tau^\{\(g\)\}\]=R^\{\(g\)\}\-\\left\(\\frac\{1\}\{G\}R^\{\(g\)\}\+\\frac\{G\-1\}\{G\}\\mu\(x\)\\right\)=\\left\(1\-\\frac\{1\}\{G\}\\right\)\(R^\{\(g\)\}\-\\mu\(x\)\)\.∎
###### Corollary E\.2\(A simpleO\(1/G\)O\(1/G\)bias proxy for group\-mean baselines\)\.
Under Lemma[E\.1](https://arxiv.org/html/2605.15207#A5.Thmtheorem1), define the centered return advantageAret\(τ;x\)≔R\(τ\)−𝔼\[R\(τ\)∣x\]A\_\{\\mathrm\{ret\}\}\(\\tau;x\)\\coloneqq R\(\\tau\)\-\\mathbb\{E\}\[R\(\\tau\)\\mid x\]\. Then for any fixedgg,
\|𝔼\[R\(g\)−μ^\|x,τ\(g\)\]−Aret\(τ\(g\);x\)\|=1G\|Aret\(τ\(g\);x\)\|\.\\left\|\\mathbb\{E\}\\\!\\left\[R^\{\(g\)\}\-\\widehat\{\\mu\}\\,\\big\|\\,x,\\tau^\{\(g\)\}\\right\]\-A\_\{\\mathrm\{ret\}\}\(\\tau^\{\(g\)\};x\)\\right\|=\\frac\{1\}\{G\}\\,\|A\_\{\\mathrm\{ret\}\}\(\\tau^\{\(g\)\};x\)\|\.If\|Aret\(τ;x\)\|≤Aret,max\|A\_\{\\mathrm\{ret\}\}\(\\tau;x\)\|\\leq A\_\{\\mathrm\{ret\},\\max\}almost surely, then the baseline\-induced bias is at mostAret,max/GA\_\{\\mathrm\{ret\},\\max\}/G\.
###### Proof\.
Lemma[E\.1](https://arxiv.org/html/2605.15207#A5.Thmtheorem1)implies𝔼\[R\(g\)−μ^∣x,τ\(g\)\]=\(1−1G\)Aret\(τ\(g\);x\)\\mathbb\{E\}\[R^\{\(g\)\}\-\\widehat\{\\mu\}\\mid x,\\tau^\{\(g\)\}\]=\(1\-\\frac\{1\}\{G\}\)A\_\{\\mathrm\{ret\}\}\(\\tau^\{\(g\)\};x\)\. SubtractingAret\(τ\(g\);x\)A\_\{\\mathrm\{ret\}\}\(\\tau^\{\(g\)\};x\)and taking absolute values yields the claim\. ∎
###### Lemma E\.3\(Clipping bias is controlled by the overflow probability\)\.
LetZZbe any random variable and defineZ~=clip\(Z,−c,c\)\\tilde\{Z\}=\\mathrm\{clip\}\(Z,\-c,c\)\. Then
\|𝔼\[Z~\]−𝔼\[Z\]\|≤𝔼\[\|Z−Z~\|\]≤𝔼\[\|Z\|1\{\|Z\|\>c\}\]\.\|\\mathbb\{E\}\[\\tilde\{Z\}\]\-\\mathbb\{E\}\[Z\]\|\\leq\\mathbb\{E\}\\\!\\left\[\|Z\-\\tilde\{Z\}\|\\right\]\\leq\\mathbb\{E\}\\\!\\left\[\|Z\|\\;\\mathbbm\{1\}\\\{\|Z\|\>c\\\}\\right\]\.
###### Proof\.
The first inequality follows from Jensen:\|𝔼\[Z~−Z\]\|≤𝔼\|Z~−Z\|\|\\mathbb\{E\}\[\\tilde\{Z\}\-Z\]\|\\leq\\mathbb\{E\}\|\\tilde\{Z\}\-Z\|\. The second inequality usesZ~=Z\\tilde\{Z\}=Zwhen\|Z\|≤c\|Z\|\\leq cand\|Z−Z~\|≤\|Z\|\|Z\-\\tilde\{Z\}\|\\leq\|Z\|otherwise\. ∎
###### Corollary E\.4\(Clip bias via overflow probability / second moment\)\.
Under Lemma[E\.3](https://arxiv.org/html/2605.15207#A5.Thmtheorem3), for anyc\>0c\>0,
𝔼\[\|Z\|𝟙\{\|Z\|\>c\}\]≤‖Z‖∞Pr\(\|Z\|\>c\),𝔼\[\|Z\|𝟙\{\|Z\|\>c\}\]≤𝔼\[Z2\]Pr\(\|Z\|\>c\)\.\\mathbb\{E\}\\\!\\left\[\|Z\|\\mathbbm\{1\}\\\{\|Z\|\>c\\\}\\right\]\\leq\\\|Z\\\|\_\{\\infty\}\\,\\Pr\(\|Z\|\>c\),\\qquad\\mathbb\{E\}\\\!\\left\[\|Z\|\\mathbbm\{1\}\\\{\|Z\|\>c\\\}\\right\]\\leq\\sqrt\{\\mathbb\{E\}\[Z^\{2\}\]\\,\\Pr\(\|Z\|\>c\)\}\.
###### Proof\.
The first bound uses\|Z\|≤‖Z‖∞\|Z\|\\leq\\\|Z\\\|\_\{\\infty\}on\{\|Z\|\>c\}\\\{\|Z\|\>c\\\}\. The second bound is Cauchy–Schwarz\. ∎
##### How this enters the certificate\.
The certificates require a quantityζi\\zeta\_\{i\}that upper bounds the occupancy\-weighted mismatch between the estimator and the true advantage\. A conservative sufficient choice isζi=sups,𝐚\|𝔼\[A^i−1\(s,𝐚\)\]−Aπ^i−1\(s,𝐚\)\|\\zeta\_\{i\}=\\sup\_\{s,\\mathbf\{a\}\}\|\\mathbb\{E\}\[\\widehat\{A\}\_\{i\-1\}\(s,\\mathbf\{a\}\)\]\-A^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\(s,\\mathbf\{a\}\)\|\. Corollary[E\.2](https://arxiv.org/html/2605.15207#A5.Thmtheorem2)provides an explicitO\(1/G\)O\(1/G\)proxy for the bias introduced by the self\-included group\-mean baseline at the*return level*\. Additional effects—standardization byσ^\\widehat\{\\sigma\}, token\-broadcasting, PPO approximation/ratio clipping, and the mismatch between sequence\-level return advantages and the true state\-action advantage in a sequential decision process—are conservatively absorbed intoζi\\zeta\_\{i\}\. Hard clipping is separated via Lemma[E\.3](https://arxiv.org/html/2605.15207#A5.Thmtheorem3)\.
### E\.3Concentration for Group\-Based Surrogates \(Empirical Certificates\)
This section provides a finite\-sample, high\-probability correction that converts the population lower bound into an empirical one usable with minibatch surrogates\. It also makes explicit the additional bias introduced by*ratio clipping*\(PPO\-style\), and shows how the reverse\-KL trust region controls that bias\. Alternatively, one may absorb the ratio\-clipping bias intoζi\\zeta\_\{i\}by lettingζi\\zeta\_\{i\}upper bound the total surrogate mismatch\.
##### Setup\.
At stepii, letL^iseq\\widehat\{L\}\_\{i\}^\{\\mathrm\{seq\}\}denote the empirical estimate of the step\-iisurrogate contribution formed by averaging overNiN\_\{i\}independent prompt\-groups, each containingGGrollouts\. In practice we use ratio clipping withwiclip=clip\(wi,1−ϵw,1\+ϵw\)w\_\{i\}^\{\\mathrm\{clip\}\}=\\mathrm\{clip\}\(w\_\{i\},1\-\\epsilon\_\{w\},1\+\\epsilon\_\{w\}\)\.
##### Boundedness\.
Hard clipping ensures\|A^grpi−1\|≤Aclip\|\\widehat\{A\}\_\{\\textsc\{grp\}\}^\{i\-1\}\|\\leq A\_\{\\mathrm\{clip\}\}\. With ratio clipping,wiclip∈\[1−ϵw,1\+ϵw\]w\_\{i\}^\{\\mathrm\{clip\}\}\\in\[1\-\\epsilon\_\{w\},1\+\\epsilon\_\{w\}\]\. Therefore, the per\-sample contribution is bounded by\(1\+ϵw\)Aclip/\(1−γ\)\(1\+\\epsilon\_\{w\}\)A\_\{\\mathrm\{clip\}\}/\(1\-\\gamma\)\.
###### Lemma E\.5\(Hoeffding bound for group\-level averages\)\.
Assume prompt\-groups are i\.i\.d\. acrossj∈\{1,…,Ni\}j\\in\\\{1,\\ldots,N\_\{i\}\\\}\. LetYjY\_\{j\}be the group\-level contribution and defineL^iseq≔1Ni∑j=1NiYj\\widehat\{L\}\_\{i\}^\{\\mathrm\{seq\}\}\\coloneqq\\frac\{1\}\{N\_\{i\}\}\\sum\_\{j=1\}^\{N\_\{i\}\}Y\_\{j\}\. AssumeYj∈\[−B,B\]Y\_\{j\}\\in\[\-B,B\]almost surely\. Then for anyϵ\>0\\epsilon\>0,
Pr\(\|L^iseq−𝔼\[L^iseq\]\|\>ϵ\)≤2exp\(−Niϵ22B2\)\.\\Pr\\\!\\left\(\|\\widehat\{L\}\_\{i\}^\{\\mathrm\{seq\}\}\-\\mathbb\{E\}\[\\widehat\{L\}\_\{i\}^\{\\mathrm\{seq\}\}\]\|\>\\epsilon\\right\)\\leq 2\\exp\\\!\\left\(\-\\frac\{N\_\{i\}\\epsilon^\{2\}\}\{2B^\{2\}\}\\right\)\.
###### Proof\.
This is Hoeffding’s inequality for the average of i\.i\.d\. bounded random variables with range length2B2B\. ∎
##### Ratio clipping bias and reverse KL\.
Letw=wi\(s,𝐚\)w=w\_\{i\}\(s,\\mathbf\{a\}\)denote the per\-decision ratio for the updated factor at stepii:w=π^i\(𝐚∣s\)/π^i−1\(𝐚∣s\)w=\\hat\{\\pi\}^\{\\,i\}\(\\mathbf\{a\}\\mid s\)/\\hat\{\\pi\}^\{\\,i\-1\}\(\\mathbf\{a\}\\mid s\)\. Assume absolute continuity \(true for softmax LLMs\): ifπ^i−1\(𝐚∣s\)\>0\\hat\{\\pi\}^\{\\,i\-1\}\(\\mathbf\{a\}\\mid s\)\>0thenπ^i\(𝐚∣s\)\>0\\hat\{\\pi\}^\{\\,i\}\(\\mathbf\{a\}\\mid s\)\>0\.
###### Lemma E\.6\(Ratio deviation equals total variation\)\.
Fix a statessand letP\(𝐚\)=π^i−1\(𝐚∣s\)P\(\\mathbf\{a\}\)=\\hat\{\\pi\}^\{\\,i\-1\}\(\\mathbf\{a\}\\mid s\)andQ\(𝐚\)=π^i\(𝐚∣s\)Q\(\\mathbf\{a\}\)=\\hat\{\\pi\}^\{\\,i\}\(\\mathbf\{a\}\\mid s\)\. AssumeQ≪PQ\\ll Pand definew\(𝐚\)=dQdP\(𝐚\)w\(\\mathbf\{a\}\)=\\frac\{\\mathrm\{d\}Q\}\{\\mathrm\{d\}P\}\(\\mathbf\{a\}\)\. Then
𝔼𝐚∼P\[\|1−w\(𝐚\)\|\]=2DTV\(P,Q\)\.\\mathbb\{E\}\_\{\\mathbf\{a\}\\sim P\}\\\!\\left\[\|1\-w\(\\mathbf\{a\}\)\|\\right\]=2\\,D\_\{\\mathrm\{TV\}\}\{\}\(P,Q\)\.
###### Proof\.
By definition of total variation,2DTV\(P,Q\)=∫\|dP−dQ\|2D\_\{\\mathrm\{TV\}\}\{\}\(P,Q\)=\\int\|\\mathrm\{d\}P\-\\mathrm\{d\}Q\|\. SinceQ≪PQ\\ll P,dQ=wdP\\mathrm\{d\}Q=w\\,\\mathrm\{d\}P, and thus
∫\|dP−dQ\|=∫\|1−w\|dP=𝔼a∼P\[\|1−w\(a\)\|\]\.\\int\|\\mathrm\{d\}P\-\\mathrm\{d\}Q\|=\\int\|1\-w\|\\,\\mathrm\{d\}P=\\mathbb\{E\}\_\{a\\sim P\}\[\|1\-w\(a\)\|\]\.∎
###### Lemma E\.7\(Ratio clipping bias controlled by reverse KL\)\.
Letwclip=clip\(w,1−ϵw,1\+ϵw\)w^\{\\mathrm\{clip\}\}=\\mathrm\{clip\}\(w,1\-\\epsilon\_\{w\},1\+\\epsilon\_\{w\}\)\. For any random variableAAwith\|A\|≤Aclip\|A\|\\leq A\_\{\\mathrm\{clip\}\}almost surely,
\|𝔼𝐚∼P\[wA\]−𝔼𝐚∼P\[wclipA\]\|≤Aclip𝔼𝐚∼P\[\|w−wclip\|\]≤Aclip𝔼𝐚∼P\[\|1−w\|\]≤Aclip2DKL\(P∥Q\)\.\\left\|\\mathbb\{E\}\_\{\\mathbf\{a\}\\sim P\}\[wA\]\-\\mathbb\{E\}\_\{\\mathbf\{a\}\\sim P\}\[w^\{\\mathrm\{clip\}\}A\]\\right\|\\leq A\_\{\\mathrm\{clip\}\}\\,\\mathbb\{E\}\_\{\\mathbf\{a\}\\sim P\}\\\!\\left\[\|w\-w^\{\\mathrm\{clip\}\}\|\\right\]\\leq A\_\{\\mathrm\{clip\}\}\\,\\mathbb\{E\}\_\{\\mathbf\{a\}\\sim P\}\\\!\\left\[\|1\-w\|\\right\]\\leq A\_\{\\mathrm\{clip\}\}\\sqrt\{2\\,D\_\{\\mathrm\{KL\}\}\{\}\(P\\\|Q\)\}\.
###### Proof\.
The first inequality is by\|A\|≤Aclip\|A\|\\leq A\_\{\\mathrm\{clip\}\}\. For the second, note that clipping moveswwtoward11, hence\|w−wclip\|≤\|w−1\|\|w\-w^\{\\mathrm\{clip\}\}\|\\leq\|w\-1\|pointwise\. The third inequality is Lemma[E\.6](https://arxiv.org/html/2605.15207#A5.Thmtheorem6)\. The final inequality is Pinsker:DTV\(P,Q\)≤DKL\(P∥Q\)/2D\_\{\\mathrm\{TV\}\}\{\}\(P,Q\)\\leq\\sqrt\{D\_\{\\mathrm\{KL\}\}\{\}\(P\\\|Q\)/2\}, hence𝔼P\|1−w\|=2DTV\(P,Q\)≤2DKL\(P∥Q\)\\mathbb\{E\}\_\{P\}\|1\-w\|=2D\_\{\\mathrm\{TV\}\}\{\}\(P,Q\)\\leq\\sqrt\{2D\_\{\\mathrm\{KL\}\}\{\}\(P\\\|Q\)\}\. ∎
###### Corollary E\.8\(Clipping bias bound under the token\-level trust region\)\.
At stepii, suppose the trust region holds:
DKL\(π^i−1∥π^i\)tokπ^i−1≤δi\.D\_\{\\mathrm\{KL\}\}\{\}\_\{\\mathrm\{tok\}\}^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\\\!\\big\(\\hat\{\\pi\}^\{\\,i\-1\}\\\|\\hat\{\\pi\}^\{\\,i\}\\big\)\\leq\\delta\_\{i\}\.LetA\(s,𝐚\)A\(s,\\mathbf\{a\}\)be any scalar weight with\|A\|≤Aclip\|A\|\\leq A\_\{\\mathrm\{clip\}\}almost surely unders∼dπ^i−1,𝐚∼π^i−1\(⋅∣s\)s\\sim d^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\},\\,\\mathbf\{a\}\\sim\\hat\{\\pi\}^\{\\,i\-1\}\(\\cdot\\mid s\)\. Then the ratio\-clipping bias in the importance\-weighted surrogate is bounded as
\|𝔼s∼dπ^i−1,𝐚∼π^i−1\[wi\(s,𝐚\)A\(s,𝐚\)\]−𝔼s∼dπ^i−1,𝐚∼π^i−1\[wiclip\(s,𝐚\)A\(s,𝐚\)\]\|≤Aclip2δi\.\\left\|\\mathbb\{E\}\_\{s\\sim d^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\},\\,\\mathbf\{a\}\\sim\\hat\{\\pi\}^\{\\,i\-1\}\}\[w\_\{i\}\(s,\\mathbf\{a\}\)A\(s,\\mathbf\{a\}\)\]\-\\mathbb\{E\}\_\{s\\sim d^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\},\\,\\mathbf\{a\}\\sim\\hat\{\\pi\}^\{\\,i\-1\}\}\[w\_\{i\}^\{\\mathrm\{clip\}\}\(s,\\mathbf\{a\}\)A\(s,\\mathbf\{a\}\)\]\\right\|\\leq A\_\{\\mathrm\{clip\}\}\\sqrt\{2\\delta\_\{i\}\}\.
###### Proof\.
Apply Lemma[E\.7](https://arxiv.org/html/2605.15207#A5.Thmtheorem7)pointwise insswithP=π^i−1\(⋅∣s\)P=\\hat\{\\pi\}^\{\\,i\-1\}\(\\cdot\\mid s\)andQ=π^i\(⋅∣s\)Q=\\hat\{\\pi\}^\{\\,i\}\(\\cdot\\mid s\), then average overs∼dπ^i−1s\\sim d^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\. Use Jensen exactly as in Lemma[B\.8](https://arxiv.org/html/2605.15207#A2.Thmtheorem8)to move the square root outside the state expectation:
𝔼sDKL\(Ps∥Qs\)≤𝔼sDKL\(Ps∥Qs\)=δi\.\\mathbb\{E\}\_\{s\}\\sqrt\{D\_\{\\mathrm\{KL\}\}\{\}\(P\_\{s\}\\\|Q\_\{s\}\)\}\\leq\\sqrt\{\\mathbb\{E\}\_\{s\}D\_\{\\mathrm\{KL\}\}\{\}\(P\_\{s\}\\\|Q\_\{s\}\)\}=\\sqrt\{\\delta\_\{i\}\}\.∎
###### Corollary E\.9\(Empirical\-to\-population surrogate correction with ratio clipping\)\.
Fixδ∈\(0,1\)\\delta\\in\(0,1\)\. Assume\|A^grpi−1\|≤Aclip\|\\widehat\{A\}\_\{\\textsc\{grp\}\}^\{i\-1\}\|\\leq A\_\{\\mathrm\{clip\}\}and ratio clippingwiclip∈\[1−ϵw,1\+ϵw\]w\_\{i\}^\{\\mathrm\{clip\}\}\\in\[1\-\\epsilon\_\{w\},1\+\\epsilon\_\{w\}\]\. LetB=\(1\+ϵw\)Aclip1−γB=\\frac\{\(1\+\\epsilon\_\{w\}\)A\_\{\\mathrm\{clip\}\}\}\{1\-\\gamma\}\. Then with probability at least1−δ1\-\\delta,
Liseq≥L^iseq−B2log\(2/δ\)Ni−Aclip1−γ2δi,L\_\{i\}^\{\\mathrm\{seq\}\}\\ \\geq\\ \\widehat\{L\}\_\{i\}^\{\\mathrm\{seq\}\}\-\\;B\\sqrt\{\\frac\{2\\log\(2/\\delta\)\}\{N\_\{i\}\}\}\-\\;\\frac\{A\_\{\\mathrm\{clip\}\}\}\{1\-\\gamma\}\\sqrt\{2\\delta\_\{i\}\},where the last term is the \(deterministic\) ratio\-clipping bias bound under the trust\-region radiusδi\\delta\_\{i\}\.
###### Proof\.
By Lemma[E\.5](https://arxiv.org/html/2605.15207#A5.Thmtheorem5), with probability at least1−δ1\-\\delta,
𝔼\[L^iseq\]≥L^iseq−B2log\(2/δ\)Ni\.\\mathbb\{E\}\[\\widehat\{L\}\_\{i\}^\{\\mathrm\{seq\}\}\]\\geq\\widehat\{L\}\_\{i\}^\{\\mathrm\{seq\}\}\-B\\sqrt\{\\frac\{2\\log\(2/\\delta\)\}\{N\_\{i\}\}\}\.Next,𝔼\[L^iseq\]\\mathbb\{E\}\[\\widehat\{L\}\_\{i\}^\{\\mathrm\{seq\}\}\]corresponds to the clipped importance\-weighted surrogate contribution \(usingwiclipw\_\{i\}^\{\\mathrm\{clip\}\}\)\. The population surrogateLiseqL\_\{i\}^\{\\mathrm\{seq\}\}\(defined with the unclipped ratiowiw\_\{i\}\) differs from it by at mostAclip1−γ2δi\\frac\{A\_\{\\mathrm\{clip\}\}\}\{1\-\\gamma\}\\sqrt\{2\\delta\_\{i\}\}by Corollary[E\.8](https://arxiv.org/html/2605.15207#A5.Thmtheorem8)\(applied toA=A^grpi−1A=\\widehat\{A\}\_\{\\textsc\{grp\}\}^\{i\-1\}\)\. Combining the two bounds yields the claim\. ∎
###### Theorem E\.10\(Stage\-wise improvement lower bound \(high\-probability empirical form\)\)\.
Fixδconf∈\(0,1\)\\delta\_\{\\mathrm\{conf\}\}\\in\(0,1\)\. Under the trust regions in Eq\. \([4](https://arxiv.org/html/2605.15207#S3.E4)\) and the boundedness assumptions of Corollary[E\.9](https://arxiv.org/html/2605.15207#A5.Thmtheorem9), with probability at least1−δconf1\-\\delta\_\{\\mathrm\{conf\}\},
J\(π¯\)−J\(πcur\)≥\\displaystyle J\(\\bar\{\\pi\}\)\-J\(\\pi\_\{\\mathrm\{cur\}\}\)\\ \\geq∑i=1nL^iseq−2γ\(1−γ\)2Amax∑i=1nδi−11−γ∑i=1nζi\\displaystyle\\sum\_\{i=1\}^\{n\}\\widehat\{L\}\_\{i\}^\{\\mathrm\{seq\}\}\-\\frac\{\\sqrt\{2\}\\gamma\}\{\(1\-\\gamma\)^\{2\}\}\\,A\_\{\\max\}\\sum\_\{i=1\}^\{n\}\\sqrt\{\\delta\_\{i\}\}\-\\frac\{1\}\{1\-\\gamma\}\\sum\_\{i=1\}^\{n\}\\zeta\_\{i\}−∑i=1n\(1\+ϵw\)Aclip1−γ2log\(2n/δconf\)Ni−∑i=1nAclip1−γ2δi\.\\displaystyle\\quad\-\\sum\_\{i=1\}^\{n\}\\frac\{\(1\+\\epsilon\_\{w\}\)A\_\{\\mathrm\{clip\}\}\}\{1\-\\gamma\}\\sqrt\{\\frac\{2\\log\(2n/\\delta\_\{\\mathrm\{conf\}\}\)\}\{N\_\{i\}\}\}\-\\sum\_\{i=1\}^\{n\}\\frac\{A\_\{\\mathrm\{clip\}\}\}\{1\-\\gamma\}\\sqrt\{2\\delta\_\{i\}\}\.
###### Proof\.
Apply Corollary[E\.9](https://arxiv.org/html/2605.15207#A5.Thmtheorem9)withδ=δconf/n\\delta=\\delta\_\{\\mathrm\{conf\}\}/nfor each step and union bound overii\. Substitute the resulting lower bounds onLiseqL\_\{i\}^\{\\mathrm\{seq\}\}into Theorem[3\.6](https://arxiv.org/html/2605.15207#S3.Thmtheorem6)\. ∎
## Appendix FAdditional Theory and Practical Notes
This appendix collects auxiliary theory and practical notes to support the interpretation, monitoring, and implementation of the main certificates\. Unless stated otherwise, these results are*not required*for the main guarantees in the paper, but they provide useful intuition and additional diagnostics\.
### F\.1Information\-Theoretic Upper Bounds Under Trust Regions
##### Note on KL orientation\.
The main text and certificates are built around*reverse*KL trust regions \(of the formDKL\(πold∥πnew\)D\_\{\\mathrm\{KL\}\}\{\}\(\\pi\_\{\\mathrm\{old\}\}\\\|\\pi\_\{\\mathrm\{new\}\}\)\), because they admit rollout\-estimable monitors and clean token\-level decompositions\. In this subsection, we additionally record a*forward*\-KL \(Donsker–Varadhan\) envelope as an intuition tool\. These auxiliary forward\-KL bounds are*not needed*for the main results\.
#### F\.1\.1Centering identity
###### Lemma F\.1\(Centering\)\.
For any fixedssand any policyπ\\pisuch thatQπ\(s,⋅\)Q^\{\\pi\}\(s,\\cdot\)is integrable underπ\(⋅∣s\)\\pi\(\\cdot\\mid s\),
𝔼𝐚∼π\(⋅∣s\)\[Aπ\(s,𝐚\)\]=0\.\\mathbb\{E\}\_\{\\mathbf\{a\}\\sim\\pi\(\\cdot\\mid s\)\}\\\!\\left\[A^\{\\pi\}\(s,\\mathbf\{a\}\)\\right\]=0\.
###### Proof\.
By definition,Aπ\(s,𝐚\)=Qπ\(s,𝐚\)−Vπ\(s\)A^\{\\pi\}\(s,\\mathbf\{a\}\)=Q^\{\\pi\}\(s,\\mathbf\{a\}\)\-V^\{\\pi\}\(s\)andVπ\(s\)=𝔼𝐚∼π\(⋅∣s\)Qπ\(s,𝐚\)V^\{\\pi\}\(s\)=\\mathbb\{E\}\_\{\\mathbf\{a\}\\sim\\pi\(\\cdot\\mid s\)\}Q^\{\\pi\}\(s,\\mathbf\{a\}\)\. Taking expectation over𝐚∼π\(⋅∣s\)\\mathbf\{a\}\\sim\\pi\(\\cdot\\mid s\)yields0\. ∎
#### F\.1\.2Donsker–Varadhan and oracle envelopes
###### Lemma F\.2\(Donsker–Varadhan variational inequality\)\.
LetP,QP,Qbe distributions on a measurable space withQ≪PQ\\ll P, and letffbe measurable such that𝔼P\[eηf\]<∞\\mathbb\{E\}\_\{P\}\[e^\{\\eta f\}\]<\\inftyfor the chosenη\>0\\eta\>0\. Then for anyη\>0\\eta\>0,
𝔼Q\[f\]≤1η\(DKL\(Q∥P\)\+log𝔼P\[eηf\]\)\.\\mathbb\{E\}\_\{Q\}\[f\]\\leq\\frac\{1\}\{\\eta\}\\left\(D\_\{\\mathrm\{KL\}\}\{\}\(Q\\\|P\)\+\\log\\mathbb\{E\}\_\{P\}\[e^\{\\eta f\}\]\\right\)\.Moreover, if𝔼P\[f\]=0\\mathbb\{E\}\_\{P\}\[f\]=0and\|f\|≤Amax\|f\|\\leq A\_\{\\max\}almost surely underPP, then
𝔼Q\[f\]≤Amax2DKL\(Q∥P\)\.\\mathbb\{E\}\_\{Q\}\[f\]\\leq A\_\{\\max\}\\sqrt\{2\\,D\_\{\\mathrm\{KL\}\}\{\}\(Q\\\|P\)\}\.
###### Proof\.
Define the tilted distributionPgP\_\{g\}for any measurableggby
dPgdP=eg𝔼P\[eg\],logdPgdP=g−log𝔼P\[eg\]\.\\frac\{\\mathrm\{d\}P\_\{g\}\}\{\\mathrm\{d\}P\}\\;=\\;\\frac\{e^\{g\}\}\{\\mathbb\{E\}\_\{P\}\[e^\{g\}\]\},\\qquad\\log\\frac\{\\mathrm\{d\}P\_\{g\}\}\{\\mathrm\{d\}P\}=g\-\\log\\mathbb\{E\}\_\{P\}\[e^\{g\}\]\.A KL decomposition yields
DKL\(Q∥P\)=DKL\(Q∥Pg\)\+𝔼Q\[g\]−log𝔼P\[eg\],D\_\{\\mathrm\{KL\}\}\{\}\(Q\\\|P\)=D\_\{\\mathrm\{KL\}\}\{\}\(Q\\\|P\_\{g\}\)\+\\mathbb\{E\}\_\{Q\}\[g\]\-\\log\\mathbb\{E\}\_\{P\}\[e^\{g\}\],hence𝔼Q\[g\]≤DKL\(Q∥P\)\+log𝔼P\[eg\]\\mathbb\{E\}\_\{Q\}\[g\]\\leq D\_\{\\mathrm\{KL\}\}\{\}\(Q\\\|P\)\+\\log\\mathbb\{E\}\_\{P\}\[e^\{g\}\]sinceDKL\(Q∥Pg\)≥0D\_\{\\mathrm\{KL\}\}\{\}\(Q\\\|P\_\{g\}\)\\geq 0\. Settingg=ηfg=\\eta fgives the first inequality\.
If additionally𝔼P\[f\]=0\\mathbb\{E\}\_\{P\}\[f\]=0and\|f\|≤Amax\|f\|\\leq A\_\{\\max\}, then Hoeffding’s lemma implieslog𝔼P\[eηf\]≤η2Amax2/2\\log\\mathbb\{E\}\_\{P\}\[e^\{\\eta f\}\]\\leq\\eta^\{2\}A\_\{\\max\}^\{2\}/2\. Therefore
𝔼Q\[f\]≤1ηDKL\(Q∥P\)\+ηAmax22\.\\mathbb\{E\}\_\{Q\}\[f\]\\leq\\frac\{1\}\{\\eta\}D\_\{\\mathrm\{KL\}\}\{\}\(Q\\\|P\)\+\\frac\{\\eta A\_\{\\max\}^\{2\}\}\{2\}\.Optimizing overη\>0\\eta\>0givesη⋆=2DKL\(Q∥P\)/Amax\\eta^\{\\star\}=\\sqrt\{2D\_\{\\mathrm\{KL\}\}\{\}\(Q\\\|P\)\}/A\_\{\\max\}and thus𝔼Q\[f\]≤Amax2DKL\(Q∥P\)\\mathbb\{E\}\_\{Q\}\[f\]\\leq A\_\{\\max\}\\sqrt\{2D\_\{\\mathrm\{KL\}\}\{\}\(Q\\\|P\)\}\. ∎
###### Proposition F\.3\(Oracle single\-step upper bound \(max\-KL envelope\)\)\.
Letπ\\pibe the current intermediate policy andπ′\\pi^\{\\prime\}be the policy after updating only factorσ\(i\)\\sigma\(i\)\. Assume the advantage is uniformly bounded:\|Aπ\(s,𝐚\)\|≤Amax\|A^\{\\pi\}\(s,\\mathbf\{a\}\)\|\\leq A\_\{\\max\}for all\(s,𝐚\)\(s,\\mathbf\{a\}\)\. Define the per\-state forward KL envelope
DKL\(π′∥π\)max≔supsDKL\(π′\(⋅∣s\)∥π\(⋅∣s\)\)\.D\_\{\\mathrm\{KL\}\}\{\}^\{\\max\}\(\\pi^\{\\prime\}\\\|\\pi\)\\;\\coloneqq\\;\\sup\_\{s\}D\_\{\\mathrm\{KL\}\}\{\}\(\\pi^\{\\prime\}\(\\cdot\\mid s\)\\\|\\pi\(\\cdot\\mid s\)\)\.IfDKL\(π′∥π\)max≤δimax<∞D\_\{\\mathrm\{KL\}\}\{\}^\{\\max\}\(\\pi^\{\\prime\}\\\|\\pi\)\\leq\\delta\_\{i\}^\{\\max\}<\\infty, then
J\(π′\)−J\(π\)≤Amax1−γ2δimax\.J\(\\pi^\{\\prime\}\)\-J\(\\pi\)\\leq\\frac\{A\_\{\\max\}\}\{1\-\\gamma\}\\sqrt\{2\\,\\delta\_\{i\}^\{\\max\}\}\.
###### Proof\.
By the performance difference lemma,
J\(π′\)−J\(π\)=11−γ𝔼s∼dπ′,𝐚∼π′\[Aπ\(s,𝐚\)\]\.J\(\\pi^\{\\prime\}\)\-J\(\\pi\)=\\frac\{1\}\{1\-\\gamma\}\\mathbb\{E\}\_\{s\\sim d^\{\\pi^\{\\prime\}\},\\,\\mathbf\{a\}\\sim\\pi^\{\\prime\}\}\[A^\{\\pi\}\(s,\\mathbf\{a\}\)\]\.Fixssand definePs\(𝐚\)=π\(𝐚∣s\)P\_\{s\}\(\\mathbf\{a\}\)=\\pi\(\\mathbf\{a\}\\mid s\)andQs\(𝐚\)=π′\(𝐚∣s\)Q\_\{s\}\(\\mathbf\{a\}\)=\\pi^\{\\prime\}\(\\mathbf\{a\}\\mid s\)\. SinceDKL\(Qs∥Ps\)≤δimax<∞D\_\{\\mathrm\{KL\}\}\{\}\(Q\_\{s\}\\\|P\_\{s\}\)\\leq\\delta\_\{i\}^\{\\max\}<\\infty, we haveQs≪PsQ\_\{s\}\\ll P\_\{s\}\. By Lemma[F\.1](https://arxiv.org/html/2605.15207#A6.Thmtheorem1),𝔼𝐚∼Ps\[Aπ\(s,𝐚\)\]=0\\mathbb\{E\}\_\{\\mathbf\{a\}\\sim P\_\{s\}\}\[A^\{\\pi\}\(s,\\mathbf\{a\}\)\]=0, and by boundedness\|Aπ\(s,𝐚\)\|≤Amax\|A^\{\\pi\}\(s,\\mathbf\{a\}\)\|\\leq A\_\{\\max\}\. Applying Lemma[F\.2](https://arxiv.org/html/2605.15207#A6.Thmtheorem2)gives
𝔼𝐚∼Qs\[Aπ\(s,𝐚\)\]≤Amax2DKL\(Qs∥Ps\)≤Amax2δimax\.\\mathbb\{E\}\_\{\\mathbf\{a\}\\sim Q\_\{s\}\}\[A^\{\\pi\}\(s,\\mathbf\{a\}\)\]\\leq A\_\{\\max\}\\sqrt\{2\\,D\_\{\\mathrm\{KL\}\}\{\}\(Q\_\{s\}\\\|P\_\{s\}\)\}\\leq A\_\{\\max\}\\sqrt\{2\\,\\delta\_\{i\}^\{\\max\}\}\.Averaging overs∼dπ′s\\sim d^\{\\pi^\{\\prime\}\}and dividing by\(1−γ\)\(1\-\\gamma\)yields the result\. ∎
### F\.2Budget\-Aware Lower Bound via Information Geometry
This subsection relates*achievable*\(local\) surrogate gains to information geometry inside a KL ball\. It complements the certificates by giving a principled “budget allocation” view: larger KL radii can enable larger local gains, modulo curvature/smoothness penalties\.
##### Standing convention\.
Throughout this subsection, the state distributiondπd^\{\\pi\}used in expected KL expressions is treated as a fixed reference distribution \(e\.g\., induced by the previous iterate\); we do*not*differentiate throughdπd^\{\\pi\}\.
##### Assumptions\.
Assumeθ↦logπθ\(a∣s\)\\theta\\mapsto\\log\\pi\_\{\\theta\}\(a\\mid s\)is three\-times continuously differentiable in a neighborhood ofθcur\\theta\_\{\\mathrm\{cur\}\}\. Assume bounded derivatives∥∇θlogπθ\(a∣s\)∥≤B1\\\|\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\\mid s\)\\\|\\leq B\_\{1\},∥∇θ2logπθ\(a∣s\)∥op≤B2\\\|\\nabla\_\{\\theta\}^\{2\}\\log\\pi\_\{\\theta\}\(a\\mid s\)\\\|\_\{\\mathrm\{op\}\}\\leq B\_\{2\}, and a uniform third\-derivative bound \(stated explicitly below\)\. Assume Fisher regularityλmin\(F\)≥λ0\>0\\lambda\_\{\\min\}\(F\)\\geq\\lambda\_\{0\}\>0, possibly via regularizationFreg=F\+ϵIF^\{\\mathrm\{reg\}\}=F\+\\epsilon I\.
#### F\.2\.1KL–Fisher bridge
###### Lemma F\.4\(Taylor expansion of expected KL with uniform remainder\)\.
Fix a reference state distributiondπd^\{\\pi\}\(independent ofθ\\theta\)\. Define the \(state\-averaged\) Fisher information
F\(θ\)≔𝔼s∼dπ,a∼πθ\(⋅∣s\)\[∇θlogπθ\(a∣s\)∇θlogπθ\(a∣s\)⊤\]\.F\(\\theta\)\\;\\coloneqq\\;\\mathbb\{E\}\_\{s\\sim d^\{\\pi\},\\ a\\sim\\pi\_\{\\theta\}\(\\cdot\\mid s\)\}\\Big\[\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\\mid s\)\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\\mid s\)^\{\\\!\\top\}\\Big\]\.For each statess, define
Ds\(Δ\)≔DKL\(πθ\+Δ\(⋅∣s\)∥πθ\(⋅∣s\)\)\.D\_\{s\}\(\\Delta\)\\;\\coloneqq\\;D\_\{\\mathrm\{KL\}\}\{\}\(\\pi\_\{\\theta\+\\Delta\}\(\\cdot\\mid s\)\\\|\\pi\_\{\\theta\}\(\\cdot\\mid s\)\)\.Assume there existsr\>0r\>0andB3<∞B\_\{3\}<\\inftysuch that for all‖Δ‖≤r\\\|\\Delta\\\|\\leq r,
supssup‖u‖=1\|D3Ds\(Δ\)\[u,u,u\]\|≤B3\.\\sup\_\{s\}\\ \\sup\_\{\\\|u\\\|=1\}\\ \\big\|\\mathrm\{D\}^\{3\}D\_\{s\}\(\\Delta\)\[u,u,u\]\\big\|\\leq B\_\{3\}\.Then for any‖Δ‖≤r\\\|\\Delta\\\|\\leq r,
𝔼s∼dπDKL\(πθ\+Δ\(⋅∣s\)∥πθ\(⋅∣s\)\)=12Δ⊤F\(θ\)Δ\+R\(Δ\),\|R\(Δ\)\|≤B36∥Δ∥3\.\\mathbb\{E\}\_\{s\\sim d^\{\\pi\}\}D\_\{\\mathrm\{KL\}\}\{\}\(\\pi\_\{\\theta\+\\Delta\}\(\\cdot\\mid s\)\\\|\\pi\_\{\\theta\}\(\\cdot\\mid s\)\)=\\frac\{1\}\{2\}\\Delta^\{\\top\}F\(\\theta\)\\Delta\+R\(\\Delta\),\\qquad\|R\(\\Delta\)\|\\leq\\frac\{B\_\{3\}\}\{6\}\\\|\\Delta\\\|^\{3\}\.
###### Proof\.
Fixssand considerDs\(Δ\)≥0D\_\{s\}\(\\Delta\)\\geq 0withDs\(0\)=0D\_\{s\}\(0\)=0\. Therefore∇ΔDs\(0\)=0\\nabla\_\{\\Delta\}D\_\{s\}\(0\)=0\(a local minimum atΔ=0\\Delta=0\)\. Moreover, a standard calculation gives the Hessian at0as the per\-state Fisher:
∇Δ2Ds\(0\)=𝔼a∼πθ\(⋅∣s\)\[∇θlogπθ\(a∣s\)∇θlogπθ\(a∣s\)⊤\]\.\\nabla\_\{\\Delta\}^\{2\}D\_\{s\}\(0\)=\\mathbb\{E\}\_\{a\\sim\\pi\_\{\\theta\}\(\\cdot\\mid s\)\}\\Big\[\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\\mid s\)\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\\mid s\)^\{\\\!\\top\}\\Big\]\.By Taylor’s theorem with remainder and the assumed uniform bound on the third directional derivative, for‖Δ‖≤r\\\|\\Delta\\\|\\leq r,
Ds\(Δ\)=12Δ⊤∇Δ2Ds\(0\)Δ\+rs\(Δ\),\|rs\(Δ\)\|≤B36‖Δ‖3\.D\_\{s\}\(\\Delta\)=\\frac\{1\}\{2\}\\Delta^\{\\top\}\\nabla\_\{\\Delta\}^\{2\}D\_\{s\}\(0\)\\Delta\+r\_\{s\}\(\\Delta\),\\qquad\|r\_\{s\}\(\\Delta\)\|\\leq\\frac\{B\_\{3\}\}\{6\}\\\|\\Delta\\\|^\{3\}\.Taking expectation overs∼dπs\\sim d^\{\\pi\}yields
𝔼s∼dπ\[Ds\(Δ\)\]=12Δ⊤F\(θ\)Δ\+𝔼s∼dπ\[rs\(Δ\)\]\.\\mathbb\{E\}\_\{s\\sim d^\{\\pi\}\}\[D\_\{s\}\(\\Delta\)\]=\\frac\{1\}\{2\}\\Delta^\{\\top\}F\(\\theta\)\\Delta\+\\mathbb\{E\}\_\{s\\sim d^\{\\pi\}\}\[r\_\{s\}\(\\Delta\)\]\.The bound onrs\(Δ\)r\_\{s\}\(\\Delta\)implies\|R\(Δ\)\|≤B36‖Δ‖3\|R\(\\Delta\)\|\\leq\\frac\{B\_\{3\}\}\{6\}\\\|\\Delta\\\|^\{3\}\. ∎
#### F\.2\.2Local gain in a KL ball
Letg=∇θL\(θcur\)g=\\nabla\_\{\\theta\}L\(\\theta\_\{\\mathrm\{cur\}\}\)andFreg=F\+ϵIF^\{\\mathrm\{reg\}\}=F\+\\epsilon I\.
###### Lemma F\.5\(Maximization of linear gain under quadratic constraint\)\.
AssumeFreg≻0F^\{\\mathrm\{reg\}\}\\succ 0\. The problem
maxΔg⊤Δs\.t\.12Δ⊤FregΔ≤δ\\max\_\{\\Delta\}\\ g^\{\\top\}\\Delta\\quad\\text\{s\.t\.\}\\quad\\frac\{1\}\{2\}\\Delta^\{\\top\}F^\{\\mathrm\{reg\}\}\\Delta\\leq\\deltahas optimizer
Δ⋆=2δg⊤\(Freg\)−1g\(Freg\)−1g,\\Delta^\{\\star\}=\\sqrt\{\\frac\{2\\delta\}\{g^\{\\top\}\(F^\{\\mathrm\{reg\}\}\)^\{\-1\}g\}\}\\ \(F^\{\\mathrm\{reg\}\}\)^\{\-1\}g,with optimal value
sup12Δ⊤FregΔ≤δg⊤Δ=2δg⊤\(Freg\)−1g≕κregδ\.\\sup\_\{\\frac\{1\}\{2\}\\Delta^\{\\top\}F^\{\\mathrm\{reg\}\}\\Delta\\leq\\delta\}g^\{\\top\}\\Delta=\\sqrt\{2\\delta\\,g^\{\\top\}\(F^\{\\mathrm\{reg\}\}\)^\{\-1\}g\}\\eqqcolon\\kappa^\{\\mathrm\{reg\}\}\\sqrt\{\\delta\}\.
###### Proof\.
SinceFreg≻0F^\{\\mathrm\{reg\}\}\\succ 0, define⟨u,v⟩Freg=u⊤Fregv\\langle u,v\\rangle\_\{F^\{\\mathrm\{reg\}\}\}=u^\{\\top\}F^\{\\mathrm\{reg\}\}vand‖u‖Freg=u⊤Fregu\\\|u\\\|\_\{F^\{\\mathrm\{reg\}\}\}=\\sqrt\{u^\{\\top\}F^\{\\mathrm\{reg\}\}u\}\. The constraint is‖Δ‖Freg≤2δ\\\|\\Delta\\\|\_\{F^\{\\mathrm\{reg\}\}\}\\leq\\sqrt\{2\\delta\}\. Rewrite
g⊤Δ=⟨\(Freg\)−1g,Δ⟩Freg≤‖\(Freg\)−1g‖Freg‖Δ‖Freg=g⊤\(Freg\)−1g2δ,g^\{\\top\}\\Delta=\\big\\langle\(F^\{\\mathrm\{reg\}\}\)^\{\-1\}g,\\ \\Delta\\big\\rangle\_\{F^\{\\mathrm\{reg\}\}\}\\leq\\\|\(F^\{\\mathrm\{reg\}\}\)^\{\-1\}g\\\|\_\{F^\{\\mathrm\{reg\}\}\}\\ \\\|\\Delta\\\|\_\{F^\{\\mathrm\{reg\}\}\}=\\sqrt\{g^\{\\top\}\(F^\{\\mathrm\{reg\}\}\)^\{\-1\}g\}\\ \\sqrt\{2\\delta\},by Cauchy–Schwarz\. Equality holds whenΔ\\Deltais proportional to\(Freg\)−1g\(F^\{\\mathrm\{reg\}\}\)^\{\-1\}g, with scaling chosen to make the constraint active\. ∎
#### F\.2\.3Regularized budget\-aware stage lower bound
###### Theorem F\.6\(Regularized budget\-aware stage lower bound\)\.
Assume each step\-iipopulation surrogateLiseq\(θ\)L\_\{i\}^\{\\mathrm\{seq\}\}\(\\theta\)is locallyLilocL\_\{i\}^\{\\mathrm\{loc\}\}\-smooth inθ\\theta\(under Euclidean norm\) in a neighborhood of the current iterate\. Assume the \(regularized\) quadratic KL model is used as the trust\-region constraint at radiusδi\\delta\_\{i\}:
12Δi⊤FiregΔi≤δi,Fireg=Fi\+ϵI,λmin\(Fireg\)\>0\.\\frac\{1\}\{2\}\\Delta\_\{i\}^\{\\top\}F\_\{i\}^\{\\mathrm\{reg\}\}\\Delta\_\{i\}\\leq\\delta\_\{i\},\\qquad F\_\{i\}^\{\\mathrm\{reg\}\}=F\_\{i\}\+\\epsilon I,\\quad\\lambda\_\{\\min\}\(F\_\{i\}^\{\\mathrm\{reg\}\}\)\>0\.Letgi=∇θLiseq\(θcur\)g\_\{i\}=\\nabla\_\{\\theta\}L\_\{i\}^\{\\mathrm\{seq\}\}\(\\theta\_\{\\mathrm\{cur\}\}\)and define
κireg=2gi⊤\(Fireg\)−1gi,aireg=Lilocλmin\(Fireg\)\.\\kappa\_\{i\}^\{\\mathrm\{reg\}\}=\\sqrt\{2\\,g\_\{i\}^\{\\top\}\(F\_\{i\}^\{\\mathrm\{reg\}\}\)^\{\-1\}g\_\{i\}\},\\qquad a\_\{i\}^\{\\mathrm\{reg\}\}=\\frac\{L\_\{i\}^\{\\mathrm\{loc\}\}\}\{\\lambda\_\{\\min\}\(F\_\{i\}^\{\\mathrm\{reg\}\}\)\}\.Assume the same boundedness conditions as in the empirical correction \(e\.g\.,\|A^\|≤Aclip\|\\widehat\{A\}\|\\leq A\_\{\\mathrm\{clip\}\}and, if used, ratio clippingwclip∈\[1−ϵw,1\+ϵw\]w^\{\\mathrm\{clip\}\}\\in\[1\-\\epsilon\_\{w\},1\+\\epsilon\_\{w\}\]\)\. Then with probability at least1−δconf1\-\\delta\_\{\\mathrm\{conf\}\},
J\(π¯\)−J\(πcur\)≥\\displaystyle J\(\\bar\{\\pi\}\)\-J\(\\pi\_\{\\mathrm\{cur\}\}\)\\ \\geq∑i=1n\(κiregδi−airegδi\)−2γAmax\(1−γ\)2∑i=1nδi−11−γ∑i=1nζi\\displaystyle\\sum\_\{i=1\}^\{n\}\\Big\(\\kappa\_\{i\}^\{\\mathrm\{reg\}\}\\sqrt\{\\delta\_\{i\}\}\-a\_\{i\}^\{\\mathrm\{reg\}\}\\delta\_\{i\}\\Big\)\-\\frac\{\\sqrt\{2\}\\gamma A\_\{\\max\}\}\{\(1\-\\gamma\)^\{2\}\}\\sum\_\{i=1\}^\{n\}\\sqrt\{\\delta\_\{i\}\}\-\\frac\{1\}\{1\-\\gamma\}\\sum\_\{i=1\}^\{n\}\\zeta\_\{i\}−∑i=1n\(1\+ϵw\)Aclip1−γ2log\(2n/δconf\)Ni−𝖡𝗂𝖺𝗌ratio,\\displaystyle\\quad\-\\sum\_\{i=1\}^\{n\}\\frac\{\(1\+\\epsilon\_\{w\}\)A\_\{\\mathrm\{clip\}\}\}\{1\-\\gamma\}\\sqrt\{\\frac\{2\\log\(2n/\\delta\_\{\\mathrm\{conf\}\}\)\}\{N\_\{i\}\}\}\\;\-\\;\\mathsf\{Bias\}\_\{\\mathrm\{ratio\}\},where𝖡𝗂𝖺𝗌ratio\\mathsf\{Bias\}\_\{\\mathrm\{ratio\}\}is the total ratio\-clipping bias correction:
𝖡𝗂𝖺𝗌ratio=\{∑i=1nAclip1−γ2δi,if ratio clipping is used and bounded via the reverse\-KL trust region,0,if no ratio clipping is used, or if its bias is absorbed intoζi\.\\mathsf\{Bias\}\_\{\\mathrm\{ratio\}\}=\\begin\{cases\}\\displaystyle\\sum\_\{i=1\}^\{n\}\\frac\{A\_\{\\mathrm\{clip\}\}\}\{1\-\\gamma\}\\sqrt\{2\\delta\_\{i\}\},&\\text\{if ratio clipping is used and bounded via the reverse\-KL trust region,\}\\\\\[6\.0pt\] 0,&\\text\{if no ratio clipping is used, or if its bias is absorbed into \}\\zeta\_\{i\}\.\\end\{cases\}
###### Proof\.
Fix a stepiiand consider a parameter updateΔ\\Delta\. LocalLilocL\_\{i\}^\{\\mathrm\{loc\}\}\-smoothness implies the standard lower bound
Liseq\(θcur\+Δ\)≥Liseq\(θcur\)\+gi⊤Δ−Liloc2‖Δ‖2\.L\_\{i\}^\{\\mathrm\{seq\}\}\(\\theta\_\{\\mathrm\{cur\}\}\+\\Delta\)\\geq L\_\{i\}^\{\\mathrm\{seq\}\}\(\\theta\_\{\\mathrm\{cur\}\}\)\+g\_\{i\}^\{\\top\}\\Delta\-\\frac\{L\_\{i\}^\{\\mathrm\{loc\}\}\}\{2\}\\\|\\Delta\\\|^\{2\}\.UnderFireg≻0F\_\{i\}^\{\\mathrm\{reg\}\}\\succ 0and the quadratic constraint12Δ⊤FiregΔ≤δi\\frac\{1\}\{2\}\\Delta^\{\\top\}F\_\{i\}^\{\\mathrm\{reg\}\}\\Delta\\leq\\delta\_\{i\}, we have‖Δ‖2≤1λmin\(Fireg\)Δ⊤FiregΔ≤2δiλmin\(Fireg\)\\\|\\Delta\\\|^\{2\}\\leq\\frac\{1\}\{\\lambda\_\{\\min\}\(F\_\{i\}^\{\\mathrm\{reg\}\}\)\}\\Delta^\{\\top\}F\_\{i\}^\{\\mathrm\{reg\}\}\\Delta\\leq\\frac\{2\\delta\_\{i\}\}\{\\lambda\_\{\\min\}\(F\_\{i\}^\{\\mathrm\{reg\}\}\)\}\. Therefore, restricting to feasibleΔ\\Delta,
sup12Δ⊤FiregΔ≤δi\(Liseq\(θcur\+Δ\)−Liseq\(θcur\)\)≥sup12Δ⊤FiregΔ≤δigi⊤Δ−Liloc2⋅2δiλmin\(Fireg\)\.\\sup\_\{\\frac\{1\}\{2\}\\Delta^\{\\top\}F\_\{i\}^\{\\mathrm\{reg\}\}\\Delta\\leq\\delta\_\{i\}\}\\Big\(L\_\{i\}^\{\\mathrm\{seq\}\}\(\\theta\_\{\\mathrm\{cur\}\}\+\\Delta\)\-L\_\{i\}^\{\\mathrm\{seq\}\}\(\\theta\_\{\\mathrm\{cur\}\}\)\\Big\)\\geq\\sup\_\{\\frac\{1\}\{2\}\\Delta^\{\\top\}F\_\{i\}^\{\\mathrm\{reg\}\}\\Delta\\leq\\delta\_\{i\}\}g\_\{i\}^\{\\top\}\\Delta\-\\frac\{L\_\{i\}^\{\\mathrm\{loc\}\}\}\{2\}\\cdot\\frac\{2\\delta\_\{i\}\}\{\\lambda\_\{\\min\}\(F\_\{i\}^\{\\mathrm\{reg\}\}\)\}\.By Lemma[F\.5](https://arxiv.org/html/2605.15207#A6.Thmtheorem5), the first supremum equalsκiregδi\\kappa\_\{i\}^\{\\mathrm\{reg\}\}\\sqrt\{\\delta\_\{i\}\}, yielding the per\-step boundκiregδi−airegδi\\kappa\_\{i\}^\{\\mathrm\{reg\}\}\\sqrt\{\\delta\_\{i\}\}\-a\_\{i\}^\{\\mathrm\{reg\}\}\\delta\_\{i\}\. Summing overi=1,…,ni=1,\\dots,nand plugging this attainable surrogate gain into the stage certificate \(Theorem[3\.6](https://arxiv.org/html/2605.15207#S3.Thmtheorem6)\) yields the deterministic part of the RHS, including the occupancy\-shift penalty and theζi\\zeta\_\{i\}terms\.
Finally, apply the empirical\-to\-population correction \(Corollary[E\.9](https://arxiv.org/html/2605.15207#A5.Thmtheorem9)\) stepwise with confidenceδ=δconf/n\\delta=\\delta\_\{\\mathrm\{conf\}\}/nand union bound to replace each population surrogate term by its empirical estimate and add the sampling error term\. If ratio clipping is used and bounded via the reverse\-KL trust region, include the deterministic ratio\-bias term from Corollary[E\.9](https://arxiv.org/html/2605.15207#A5.Thmtheorem9); otherwise, set it to0or absorb it intoζi\\zeta\_\{i\}\. ∎
##### Remark \(KL–Fisher remainder\)\.
Lemma[F\.4](https://arxiv.org/html/2605.15207#A6.Thmtheorem4)makes explicit that the quadratic KL model is accurate up toO\(‖Δ‖3\)O\(\\\|\\Delta\\\|^\{3\}\)\. If desired, one can translate the uniform remainder into an additional conservative penalty term of orderO\(δi3/2\)O\(\\delta\_\{i\}^\{3/2\}\)under the constraint12Δ⊤FiregΔ≤δi\\frac\{1\}\{2\}\\Delta^\{\\top\}F\_\{i\}^\{\\mathrm\{reg\}\}\\Delta\\leq\\delta\_\{i\}\. We omit this term in Theorem[F\.6](https://arxiv.org/html/2605.15207#A6.Thmtheorem6)for simplicity, since the theorem is intended as a practical budget\-allocation intuition rather than a primary certificate\.
### F\.3Finite\-Sample Concentration Underβ\\beta\-Mixing
This subsection is optional and only needed when prompt groups are not independent, e\.g\., when a single long on\-policy stream is reused to form multiple groups\.
##### Setup\.
Let\{Yj\}j≥1\\\{Y\_\{j\}\\\}\_\{j\\geq 1\}be a strictly stationary process withβ\\beta\-mixing coefficients
β\(t\)=supk≥1𝔼\[supB∈σ\(Yk\+t,Yk\+t\+1,…\)\|Pr\(B∣σ\(Y1,…,Yk\)\)−Pr\(B\)\|\],\\beta\(t\)=\\sup\_\{k\\geq 1\}\\ \\mathbb\{E\}\\Big\[\\sup\_\{B\\in\\sigma\(Y\_\{k\+t\},Y\_\{k\+t\+1\},\\ldots\)\}\\big\|\\Pr\(B\\mid\\sigma\(Y\_\{1\},\\ldots,Y\_\{k\}\)\)\-\\Pr\(B\)\\big\|\\Big\],and assumeβ\(t\)→0\\beta\(t\)\\to 0ast→∞t\\to\\infty\. Assume boundedness\|Yj\|≤B\|Y\_\{j\}\|\\leq Balmost surely\.
###### Lemma F\.7\(Blocking bound for aβ\\beta\-mixing sequence\)\.
Fix a block lengthℓ≥1\\ell\\geq 1and letm=⌊N/\(2ℓ\)⌋m=\\lfloor N/\(2\\ell\)\\rfloor\. Define the*odd\-block*average
μ^odd≔1m∑k=1m1ℓ∑t=\(2k−2\)ℓ\+1\(2k−1\)ℓYt\.\\widehat\{\\mu\}\_\{\\mathrm\{odd\}\}\\;\\coloneqq\\;\\frac\{1\}\{m\}\\sum\_\{k=1\}^\{m\}\\ \\frac\{1\}\{\\ell\}\\sum\_\{t=\(2k\-2\)\\ell\+1\}^\{\(2k\-1\)\\ell\}Y\_\{t\}\.Then for anyϵ\>0\\epsilon\>0,
Pr\(\|μ^odd−𝔼\[Y1\]\|\>ϵ\)≤2exp\(−mϵ22B2\)\+2\(m−1\)β\(ℓ\)\.\\Pr\\\!\\left\(\\big\|\\widehat\{\\mu\}\_\{\\mathrm\{odd\}\}\-\\mathbb\{E\}\[Y\_\{1\}\]\\big\|\>\\epsilon\\right\)\\leq 2\\exp\\\!\\left\(\-\\frac\{m\\epsilon^\{2\}\}\{2B^\{2\}\}\\right\)\\;\+\\;2\(m\-1\)\\beta\(\\ell\)\.
###### Proof\.
LetZk=1ℓ∑t=\(2k−2\)ℓ\+1\(2k−1\)ℓYtZ\_\{k\}=\\frac\{1\}\{\\ell\}\\sum\_\{t=\(2k\-2\)\\ell\+1\}^\{\(2k\-1\)\\ell\}Y\_\{t\}, soZk∈\[−B,B\]Z\_\{k\}\\in\[\-B,B\]\. A standard coupling \(or total\-variation\) argument forβ\\beta\-mixing sequences implies that the joint law of\(Z1,…,Zm\)\(Z\_\{1\},\\ldots,Z\_\{m\}\)is within total variation distance at most\(m−1\)β\(ℓ\)\(m\-1\)\\beta\(\\ell\)of the product law ofmmi\.i\.d\. copies ofZ1Z\_\{1\}\. Therefore, for any eventℰ\\mathcal\{E\}depending only on\(Z1,…,Zm\)\(Z\_\{1\},\\ldots,Z\_\{m\}\),
Pr\(ℰ\)≤Priid\(ℰ\)\+\(m−1\)β\(ℓ\)\.\\Pr\(\\mathcal\{E\}\)\\leq\\Pr\_\{\\mathrm\{iid\}\}\(\\mathcal\{E\}\)\+\(m\-1\)\\beta\(\\ell\)\.Under the i\.i\.d\. product law, Hoeffding’s inequality applied to1m∑k=1mZk\\frac\{1\}\{m\}\\sum\_\{k=1\}^\{m\}Z\_\{k\}yields
Priid\(\|1m∑k=1mZk−𝔼\[Z1\]\|\>ϵ\)≤2exp\(−mϵ22B2\)\.\\Pr\_\{\\mathrm\{iid\}\}\\\!\\left\(\\left\|\\frac\{1\}\{m\}\\sum\_\{k=1\}^\{m\}Z\_\{k\}\-\\mathbb\{E\}\[Z\_\{1\}\]\\right\|\>\\epsilon\\right\)\\leq 2\\exp\\\!\\left\(\-\\frac\{m\\epsilon^\{2\}\}\{2B^\{2\}\}\\right\)\.Combining the two bounds and using𝔼\[Z1\]=𝔼\[Y1\]\\mathbb\{E\}\[Z\_\{1\}\]=\\mathbb\{E\}\[Y\_\{1\}\]by stationarity yields the result \(up to a factor of22in the dependence term from symmetrizing the two tails\)\. ∎
##### Practical use\.
Lemma[F\.7](https://arxiv.org/html/2605.15207#A6.Thmtheorem7)suggests a conservative recipe when prompt groups are dependent: subsample groups with a lagℓ\\ellso thatβ\(ℓ\)\\beta\(\\ell\)is small, and treat the number of retained blocksmmas the effective sample size in the Hoeffding correction\. When i\.i\.d\. prompt groups are available, this subsection is unnecessary\.
### F\.4Smoothness and Projected\-Gradient Convergence
LetG\(θ\)=∑i=1nLiseq\(θ\)G\(\\theta\)=\\sum\_\{i=1\}^\{n\}L\_\{i\}^\{\\textsc\{seq\}\}\(\\theta\)be the stage objective in parameters, under a closed convex constraint setΘ\\Thetaenforcing trust\-region radii \(or other feasibility constraints\)\.
###### Lemma F\.8\(Gradient and Hessian representations\)\.
Assume the advantage estimatorA^\(s,𝐚\)\\widehat\{A\}\(s,\\mathbf\{a\}\)is treated as fixed w\.r\.t\.θ\\thetawithin an update step \(e\.g\., computed from on\-policy sampling under the previous iterate\)\. Letσ\(i\)\\sigma\(i\)denote the updated factor at stepii\. Then, for the \(unclipped\) importance\-weighted surrogate,
∇θLiseq\(θ\)=11−γ𝔼s∼dπ^i−1,𝐚∼π^i\[A^\(s,𝐚\)∇θlogπθ\(aσ\(i\)∣s\)\],\\nabla\_\{\\theta\}L\_\{i\}^\{\\textsc\{seq\}\}\(\\theta\)=\\frac\{1\}\{1\-\\gamma\}\\mathbb\{E\}\_\{s\\sim d^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\},\\,\\mathbf\{a\}\\sim\\hat\{\\pi\}^\{\\,i\}\}\\\!\\left\[\\widehat\{A\}\(s,\\mathbf\{a\}\)\\,\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\_\{\\sigma\(i\)\}\\mid s\)\\right\],and
∇θ2Liseq\(θ\)=11−γ𝔼s∼dπ^i−1,𝐚∼π^i\[A^\(s,𝐚\)\(∇θ2logπθ\(aσ\(i\)∣s\)\+∇θlogπθ\(aσ\(i\)∣s\)∇θlogπθ\(aσ\(i\)∣s\)⊤\)\]\.\\nabla\_\{\\theta\}^\{2\}L\_\{i\}^\{\\textsc\{seq\}\}\(\\theta\)=\\frac\{1\}\{1\-\\gamma\}\\mathbb\{E\}\_\{s\\sim d^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\},\\,\\mathbf\{a\}\\sim\\hat\{\\pi\}^\{\\,i\}\}\\\!\\left\[\\widehat\{A\}\(s,\\mathbf\{a\}\)\\left\(\\nabla\_\{\\theta\}^\{2\}\\log\\pi\_\{\\theta\}\(a\_\{\\sigma\(i\)\}\\mid s\)\+\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\_\{\\sigma\(i\)\}\\mid s\)\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\_\{\\sigma\(i\)\}\\mid s\)^\{\\\!\\top\}\\right\)\\right\]\.
###### Proof\.
Write the step surrogate in importance\-weighted form under the reference distribution:
Liseq\(θ\)=11−γ𝔼s∼dπ^i−1,𝐚∼π^i−1\(⋅∣s\)\[wθ\(s,𝐚\)A^\(s,𝐚\)\],L\_\{i\}^\{\\textsc\{seq\}\}\(\\theta\)=\\frac\{1\}\{1\-\\gamma\}\\mathbb\{E\}\_\{s\\sim d^\{\\hat\{\\pi\}^\{\\,i\-1\}\},\\,\\mathbf\{a\}\\sim\\hat\{\\pi\}^\{\\,i\-1\}\(\\cdot\\mid s\)\}\\big\[w\_\{\\theta\}\(s,\\mathbf\{a\}\)\\,\\widehat\{A\}\(s,\\mathbf\{a\}\)\\big\],wherewθ\(s,𝐚\)=πθ\(aσ\(i\)∣s\)/πcur\(aσ\(i\)∣s\)w\_\{\\theta\}\(s,\\mathbf\{a\}\)=\\pi\_\{\\theta\}\(a\_\{\\sigma\(i\)\}\\mid s\)/\\pi\_\{\\mathrm\{cur\}\}\(a\_\{\\sigma\(i\)\}\\mid s\)andπcur\\pi\_\{\\mathrm\{cur\}\}is the frozen denominator from the previous iterate\. Since∇θwθ=wθ∇θlogπθ\(aσ\(i\)∣s\)\\nabla\_\{\\theta\}w\_\{\\theta\}=w\_\{\\theta\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\_\{\\sigma\(i\)\}\\mid s\), boundedness ofA^\\widehat\{A\}and the score derivatives \(assumed elsewhere for implementation\) justifies interchanging∇θ\\nabla\_\{\\theta\}and expectation \(Leibniz rule / dominated convergence\), yielding
∇θLiseq\(θ\)=11−γ𝔼s,𝐚∼π^i−1\[wθA^∇θlogπθ\(aσ\(i\)∣s\)\]\.\\nabla\_\{\\theta\}L\_\{i\}^\{\\textsc\{seq\}\}\(\\theta\)=\\frac\{1\}\{1\-\\gamma\}\\mathbb\{E\}\_\{s,\\mathbf\{a\}\\sim\\hat\{\\pi\}^\{\\,i\-1\}\}\\big\[w\_\{\\theta\}\\,\\widehat\{A\}\\,\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\_\{\\sigma\(i\)\}\\mid s\)\\big\]\.The change\-of\-measure identitywθ\(s,𝐚\)π^i−1\(𝐚∣s\)=π^i\(𝐚∣s\)w\_\{\\theta\}\(s,\\mathbf\{a\}\)\\,\\hat\{\\pi\}^\{\\,i\-1\}\(\\mathbf\{a\}\\mid s\)=\\hat\{\\pi\}^\{\\,i\}\(\\mathbf\{a\}\\mid s\)converts this to the stated form\. Differentiating once more and using∇θ2wθ=wθ\(∇θ2logπθ\+∇θlogπθ∇θlogπθ⊤\)\\nabla\_\{\\theta\}^\{2\}w\_\{\\theta\}=w\_\{\\theta\}\\left\(\\nabla\_\{\\theta\}^\{2\}\\log\\pi\_\{\\theta\}\+\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}^\{\\\!\\top\}\\right\)gives the Hessian expression, again followed by the same change of measure\. ∎
###### Lemma F\.9\(Uniform smoothness\)\.
Assume∥∇θlogπθ\(a∣s\)∥≤B1\\\|\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\(a\\mid s\)\\\|\\leq B\_\{1\}and∥∇θ2logπθ\(a∣s\)∥op≤B2\\\|\\nabla\_\{\\theta\}^\{2\}\\log\\pi\_\{\\theta\}\(a\\mid s\)\\\|\_\{\\mathrm\{op\}\}\\leq B\_\{2\}for all\(s,a\)\(s,a\)in the relevant region, and\|A^\|≤Aclip\|\\widehat\{A\}\|\\leq A\_\{\\mathrm\{clip\}\}almost surely\. ThenGGisLL\-smooth with
L≤nAclip1−γ\(B2\+B12\)\.L\\leq\\frac\{nA\_\{\\mathrm\{clip\}\}\}\{1\-\\gamma\}\(B\_\{2\}\+B\_\{1\}^\{2\}\)\.
###### Proof\.
For each stepii, Lemma[F\.8](https://arxiv.org/html/2605.15207#A6.Thmtheorem8)implies
‖∇θ2Liseq\(θ\)‖op≤11−γ𝔼\[\|A^\|\(‖∇θ2logπθ‖op\+‖∇θlogπθ‖2\)\]≤Aclip1−γ\(B2\+B12\)\.\\\|\\nabla\_\{\\theta\}^\{2\}L\_\{i\}^\{\\textsc\{seq\}\}\(\\theta\)\\\|\_\{\\mathrm\{op\}\}\\leq\\frac\{1\}\{1\-\\gamma\}\\mathbb\{E\}\\Big\[\|\\widehat\{A\}\|\\big\(\\\|\\nabla\_\{\\theta\}^\{2\}\\log\\pi\_\{\\theta\}\\\|\_\{\\mathrm\{op\}\}\+\\\|\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}\\\|^\{2\}\\big\)\\Big\]\\leq\\frac\{A\_\{\\mathrm\{clip\}\}\}\{1\-\\gamma\}\(B\_\{2\}\+B\_\{1\}^\{2\}\)\.Summing overi=1,…,ni=1,\\dots,nyields‖∇θ2G\(θ\)‖op≤nAclip1−γ\(B2\+B12\)\\\|\\nabla\_\{\\theta\}^\{2\}G\(\\theta\)\\\|\_\{\\mathrm\{op\}\}\\leq\\frac\{nA\_\{\\mathrm\{clip\}\}\}\{1\-\\gamma\}\(B\_\{2\}\+B\_\{1\}^\{2\}\), henceGGisLL\-smooth with the statedLL\. ∎
###### Theorem F\.10\(Projected\-gradient ascent convergence\)\.
LetGGbeLL\-smooth and letΘ\\Thetabe closed and convex\. Define projected\-gradient iterates
θt\+1=ProjΘ\(θt\+η∇G\(θt\)\),η≤1L,\\theta^\{t\+1\}=\\mathrm\{Proj\}\_\{\\Theta\}\\big\(\\theta^\{t\}\+\\eta\\nabla G\(\\theta^\{t\}\)\\big\),\\qquad\\eta\\leq\\frac\{1\}\{L\},and the projected gradient mapping
𝒢η\(θt\)≔1η\(θt\+1−θt\)\.\\mathcal\{G\}\_\{\\eta\}\(\\theta^\{t\}\)\\coloneqq\\frac\{1\}\{\\eta\}\\big\(\\theta^\{t\+1\}\-\\theta^\{t\}\\big\)\.Then
G\(θt\+1\)≥G\(θt\)\+η2‖𝒢η\(θt\)‖2,1T∑t=0T−1‖𝒢η\(θt\)‖2≤2\(G⋆−G\(θ0\)\)ηT,G\(\\theta^\{t\+1\}\)\\geq G\(\\theta^\{t\}\)\+\\frac\{\\eta\}\{2\}\\\|\\mathcal\{G\}\_\{\\eta\}\(\\theta^\{t\}\)\\\|^\{2\},\\qquad\\frac\{1\}\{T\}\\sum\_\{t=0\}^\{T\-1\}\\\|\\mathcal\{G\}\_\{\\eta\}\(\\theta^\{t\}\)\\\|^\{2\}\\leq\\frac\{2\(G^\{\\star\}\-G\(\\theta^\{0\}\)\)\}\{\\eta T\},whereG⋆=supθ∈ΘG\(θ\)G^\{\\star\}=\\sup\_\{\\theta\\in\\Theta\}G\(\\theta\)\.
###### Proof\.
LetΔt=θt\+1−θt\\Delta^\{t\}=\\theta^\{t\+1\}\-\\theta^\{t\}\. ByLL\-smoothness,
G\(θt\+1\)≥G\(θt\)\+⟨∇G\(θt\),Δt⟩−L2‖Δt‖2\.G\(\\theta^\{t\+1\}\)\\geq G\(\\theta^\{t\}\)\+\\langle\\nabla G\(\\theta^\{t\}\),\\Delta^\{t\}\\rangle\-\\frac\{L\}\{2\}\\\|\\Delta^\{t\}\\\|^\{2\}\.The Euclidean projection optimality condition implies for allθ∈Θ\\theta\\in\\Theta,
⟨θt\+η∇G\(θt\)−θt\+1,θ−θt\+1⟩≤0\.\\langle\\theta^\{t\}\+\\eta\\nabla G\(\\theta^\{t\}\)\-\\theta^\{t\+1\},\\,\\theta\-\\theta^\{t\+1\}\\rangle\\leq 0\.Takingθ=θt\\theta=\\theta^\{t\}yields⟨∇G\(θt\),Δt⟩≥1η‖Δt‖2\\langle\\nabla G\(\\theta^\{t\}\),\\Delta^\{t\}\\rangle\\geq\\frac\{1\}\{\\eta\}\\\|\\Delta^\{t\}\\\|^\{2\}\. Therefore
G\(θt\+1\)≥G\(θt\)\+\(1η−L2\)‖Δt‖2≥G\(θt\)\+12η‖Δt‖2=G\(θt\)\+η2‖𝒢η\(θt\)‖2,G\(\\theta^\{t\+1\}\)\\geq G\(\\theta^\{t\}\)\+\\Big\(\\frac\{1\}\{\\eta\}\-\\frac\{L\}\{2\}\\Big\)\\\|\\Delta^\{t\}\\\|^\{2\}\\geq G\(\\theta^\{t\}\)\+\\frac\{1\}\{2\\eta\}\\\|\\Delta^\{t\}\\\|^\{2\}=G\(\\theta^\{t\}\)\+\\frac\{\\eta\}\{2\}\\\|\\mathcal\{G\}\_\{\\eta\}\(\\theta^\{t\}\)\\\|^\{2\},where we usedη≤1/L\\eta\\leq 1/L\. Summing overt=0,…,T−1t=0,\\dots,T\-1and usingG\(θT\)≤G⋆G\(\\theta^\{T\}\)\\leq G^\{\\star\}gives the averaged bound\. ∎
### F\.5KL Allocation and Tightness
This section records a simple KL allocation principle and a practical tightness diagnostic for the certificate\.
##### Allocation heuristic\.
The stage certificate in Theorem[3\.6](https://arxiv.org/html/2605.15207#S3.Thmtheorem6)subtracts a penalty proportional to∑iδi\\sum\_\{i\}\\sqrt\{\\delta\_\{i\}\}\. For a fixed total KL budget∑iδi≤Δ\\sum\_\{i\}\\delta\_\{i\}\\leq\\Delta, concavity of⋅\\sqrt\{\\cdot\}\(Jensen\) implies
∑i=1nδi≤n1n∑i=1nδi≤nΔ,\\sum\_\{i=1\}^\{n\}\\sqrt\{\\delta\_\{i\}\}\\leq n\\sqrt\{\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\delta\_\{i\}\}\\leq\\sqrt\{n\\Delta\},with equality atδi=Δ/n\\delta\_\{i\}=\\Delta/n\. Thus, equal allocation*maximizes*the penalty term; in practice, allocating larger radii to steps with larger observed surrogate gain and smaller radii to low\-gain steps can improve the certified lower bound\.
##### Tightness diagnostic\.
For each stage, report the pair
\(ΔJemp,LBcert\)whereΔJemp=J^\(π¯\)−J^\(πcur\)\\big\(\\;\\Delta J\_\{\\mathrm\{emp\}\}\\;\\;,\\;\\;\\mathrm\{LB\}\_\{\\mathrm\{cert\}\}\\;\\big\)\\quad\\text\{where\}\\quad\\Delta J\_\{\\mathrm\{emp\}\}=\\widehat\{J\}\(\\bar\{\\pi\}\)\-\\widehat\{J\}\(\\pi\_\{\\mathrm\{cur\}\}\)and
LBcert=∑i=1nL^iseq−2γ\(1−γ\)2Amax∑i=1nδ^i−11−γ∑i=1nζ^i,\\mathrm\{LB\}\_\{\\mathrm\{cert\}\}=\\sum\_\{i=1\}^\{n\}\\widehat\{L\}\_\{i\}^\{\\mathrm\{seq\}\}\-\\frac\{\\sqrt\{2\}\\gamma\}\{\(1\-\\gamma\)^\{2\}\}A\_\{\\max\}\\sum\_\{i=1\}^\{n\}\\sqrt\{\\widehat\{\\delta\}\_\{i\}\}\-\\frac\{1\}\{1\-\\gamma\}\\sum\_\{i=1\}^\{n\}\\widehat\{\\zeta\}\_\{i\},using the empirical KL monitorδ^i\\widehat\{\\delta\}\_\{i\}and any conservative proxyζ^i\\widehat\{\\zeta\}\_\{i\}\. PlottingΔJemp\\Delta J\_\{\\mathrm\{emp\}\}vs\.LBcert\\mathrm\{LB\}\_\{\\mathrm\{cert\}\}across stages yields a direct certificate tightness check\.
### F\.6Implementation Notes
This section records the practical conventions used to align with the theory\.
##### Token\-level reverse\-KL monitor\.
To estimateDKL\(πcurσ\(i\)∥πnewσ\(i\)\)tokπ^i−1D\_\{\\mathrm\{KL\}\}\{\}\_\{\\mathrm\{tok\}\}^\{\{\\hat\{\\pi\}^\{\\,i\-1\}\}\}\(\\pi\_\{\\mathrm\{cur\}\}^\{\\sigma\(i\)\}\\\|\\pi\_\{\\mathrm\{new\}\}^\{\\sigma\(i\)\}\), sample trajectoriesτ∼π^i−1\\tau\\sim\\hat\{\\pi\}^\{\\,i\-1\}and compute the per\-token reverse KLDKL\(πcur\(⋅∣⋅\)∥πnew\(⋅∣⋅\)\)D\_\{\\mathrm\{KL\}\}\{\}\(\\pi\_\{\\mathrm\{cur\}\}\(\\cdot\\mid\\cdot\)\\\|\\pi\_\{\\mathrm\{new\}\}\(\\cdot\\mid\\cdot\)\)on the same prefixes\. If only the active agent emits tokens \(turn\-taking protocol\), the sum is taken over the tokens controlled by agentσ\(i\)\\sigma\(i\); All other steps contribute zero \(no\-op\)\. A generic Monte Carlo estimator for the discounted token\-level functional \(Definition[B\.1](https://arxiv.org/html/2605.15207#A2.Thmtheorem1)\) is
DKL^tokρ\(π∥π′\)=1M∑m=1M\(1−γ\)∑t≥0γtDKL\(π\(⋅∣st\(m\)\)∥π′\(⋅∣st\(m\)\)\),\\widehat\{D\_\{\\mathrm\{KL\}\}\{\}\}\_\{\\mathrm\{tok\}\}^\{\\rho\}\(\\pi\\\|\\pi^\{\\prime\}\)\\;=\\;\\frac\{1\}\{M\}\\sum\_\{m=1\}^\{M\}\(1\-\\gamma\)\\sum\_\{t\\geq 0\}\\gamma^\{t\}\\,D\_\{\\mathrm\{KL\}\}\{\}\\\!\\big\(\\pi\(\\cdot\\mid s\_\{t\}^\{\(m\)\}\)\\\|\\pi^\{\\prime\}\(\\cdot\\mid s\_\{t\}^\{\(m\)\}\)\\big\),\(23\)where\{st\(m\)\}\\\{s\_\{t\}^\{\(m\)\}\\\}are states along trajectories sampled from the reference policyρ\\rho\. This estimator is unbiased forDKL\(π∥π′\)tokρD\_\{\\mathrm\{KL\}\}\{\}\_\{\\mathrm\{tok\}\}^\{\\rho\}\(\\pi\\\|\\pi^\{\\prime\}\)when the inner KL is computed exactly\.
##### Weight ratiowi\(s,𝐚\)w\_\{i\}\(s,\\mathbf\{a\}\)\.
For the updated factorσ\(i\)\\sigma\(i\),wi\(s,𝐚\)=π^i\(𝐚∣s\)/π^i−1\(𝐚∣s\)w\_\{i\}\(s,\\mathbf\{a\}\)=\\hat\{\\pi\}^\{\\,i\}\(\\mathbf\{a\}\\mid s\)/\\hat\{\\pi\}^\{\\,i\-1\}\(\\mathbf\{a\}\\mid s\)\. For numerical stability, we computelogwi\\log w\_\{i\}and exponentiate as needed\. Optionally apply ratio clippingwiclip=clip\(wi,1−ϵw,1\+ϵw\)w\_\{i\}^\{\\mathrm\{clip\}\}=\\mathrm\{clip\}\(w\_\{i\},1\-\\epsilon\_\{w\},1\+\\epsilon\_\{w\}\)\.
##### Standardization and clipping\.
We useεstd\>0\\varepsilon\_\{\\mathrm\{std\}\}\>0in the group standard deviation to avoid division by zero\. Hard clipping\|A^grpi−1\|≤Aclip\|\\widehat\{A\}\_\{\\textsc\{grp\}\}^\{i\-1\}\|\\leq A\_\{\\mathrm\{clip\}\}enforces boundedness required by the certificates\. The fraction of clipped samples is logged as a bias/tightness indicator\.
##### Enforcing the trust region\.
We adjust the penalty coefficientβ\\beta\(or perform backtracking\) until the empirical token\-level reverse KLDKL^tok\(πcurσ\(i\)∥πnewσ\(i\);π^i−1\)≤δi\\widehat\{D\_\{\\mathrm\{KL\}\}\{\}\}\_\{\\mathrm\{tok\}\}\(\\pi\_\{\\mathrm\{cur\}\}^\{\\sigma\(i\)\}\\\|\\pi\_\{\\mathrm\{new\}\}^\{\\sigma\(i\)\};\\hat\{\\pi\}^\{\\,i\-1\}\)\\leq\\delta\_\{i\}holds on the current minibatch\.
### F\.7Experimental Checklists and Diagnostics
- •KL monitors:δ^i\\widehat\{\\delta\}\_\{i\}per step and its distribution across prompts\.
- •Ratio clipping rate:fraction of decisions/tokens wherewiw\_\{i\}hits the clip bound\.
- •Advantage clipping rate:fraction of rollouts where\|A^grpi−1\|=Aclip\|\\widehat\{A\}\_\{\\textsc\{grp\}\}^\{i\-1\}\|=A\_\{\\mathrm\{clip\}\}\.
- •Certificate tightness:plot empirical improvement vs\. certified lower bound \(Appendix[F\.5](https://arxiv.org/html/2605.15207#A6.SS5)\)\.
- •Order sensitivity:compare certified bound under different update ordersσ\\sigma\.
- •Plug\-in feasibility:after Stage\-0 alignment, report the probe\-set reverse KL to the old agent and the achieved surrogate\.
## Appendix GAdditional Experiments and Analysis
This appendix reports additional analyses from Sec\.[5](https://arxiv.org/html/2605.15207#S5)due to space\.
Table 4:Overview of additional experiments in Appendix[G](https://arxiv.org/html/2605.15207#A7)\.TopicGoalSetupReported outputsTeam\-size scaling \(Sec\.[G\.1](https://arxiv.org/html/2605.15207#A7.SS1)\)Empirically validate theO\(n2\)O\(n^\{2\}\)vs\.O\(n\)O\(n\)scaling trend in within\-stage drift under stale vs\. intermediate occupancyHomogeneous Qwen3\-1\.7B teams;n∈\{2,3,4,5,6,8\}n\\in\\\{2,3,4,5,6,8\\\}; 30 stages; MATH\-500Δstale\\Delta\_\{\\mathrm\{stale\}\},DoccD\_\{\\mathrm\{occ\}\}, accuracy, stability, coordination; fitted exponentα\\alphaRollout/token accounting \(Sec\.[G\.2](https://arxiv.org/html/2605.15207#A7.SS2)\)Quantify sampling overhead from intermediate\-occupancy resampling using hardware\-agnostic countersAIME24; 3×\\timesQwen3\-8B; 40 stagesTokens, rollouts; relative overhead vs\. stale\-rollout baselineProxy logging forζi\\zeta\_\{i\}\(Sec\.[G\.3](https://arxiv.org/html/2605.15207#A7.SS3)\)Provide a conservative, log\-based proxy for the estimation\-error term in the certificateSame runs as main experiments; no extra rolloutsComponent\-wise proxy statistics and contribution\-to\-bound summarySampled token\-KL reliability \(Sec\.[G\.4](https://arxiv.org/html/2605.15207#A7.SS4)\)Check statistical stability of the sampled token\-KL monitor via subsampling/bootstrap \(no full\-vocab “exact KL”\)Same rollout batches; no extra rolloutsBootstrap/subsample variability; near\-threshold flip rateAblations \+ IS degeneracy \(Secs\.[G\.5](https://arxiv.org/html/2605.15207#A7.SS5)–[G\.6](https://arxiv.org/html/2605.15207#A7.SS6)\)Identify key design choices; show why importance weighting is unstable without resamplingAIME24Ablation metrics; ESS and tail stats of importance weightsCross\-generation replacement \(Sec\.[G\.7](https://arxiv.org/html/2605.15207#A7.SS7)\)Protocol unification and Stage\-0 alignment for Qwen2\.5→\\toQwen3 swapAIME24; stage\-40 evaluationSwap shock and final accuracy##### Shared settings\.
Unless stated otherwise, training and evaluation follow Sec\.[5](https://arxiv.org/html/2605.15207#S5)\. We use 3 random seeds and report the mean±\\pmstandard deviation\. Token counts include both prompts and generated tokens, aggregated across all agents/models used by the method\.
### G\.1Scaling Behavior with Team Size
Proposition[3\.3](https://arxiv.org/html/2605.15207#S3.Thmtheorem3)predicts that stale\-occupancy effects compound with team sizenn, while intermediate\-occupancy evaluation mitigates this growth\. We varyn∈\{2,3,4,5,6,8\}n\\in\\\{2,3,4,5,6,8\\\}on MATH\-500 using homogeneous teams \(all agents: Qwen3\-1\.7B, identical initialization\), fixedδi=0\.01\\delta\_\{i\}=0\.01, and 30 stages\. We compareNaive Sequential\(cached stage\-start rollouts reused within a stage\) withTeamTR\(resampling under the intermediate team before each within\-stage update\), while maintaining the same total rollout budget per stage\.
We report two within\-stage drift proxies measured at the final stage:
Δstale=∑i=1n\|L^iseq−L^istale\|,Docc=∑i=1nTV\(d^π^i−1,d^π^0\),TV\(p,q\)=12∑s\|p\(s\)−q\(s\)\|\.\\Delta\_\{\\mathrm\{stale\}\}=\\sum\_\{i=1\}^\{n\}\\left\|\\widehat\{L\}\_\{i\}^\{\\mathrm\{seq\}\}\-\\widehat\{L\}\_\{i\}^\{\\mathrm\{stale\}\}\\right\|,\\qquad D\_\{\\mathrm\{occ\}\}=\\sum\_\{i=1\}^\{n\}\\mathrm\{TV\}\(\\widehat\{d\}^\{\\,\\hat\{\\pi\}^\{\\,i\-1\}\},\\widehat\{d\}^\{\\,\\hat\{\\pi\}^\{0\}\}\),\\quad\\mathrm\{TV\}\(p,q\)=\\tfrac\{1\}\{2\}\\sum\_\{s\}\|p\(s\)\-q\(s\)\|\.Hered^π\\widehat\{d\}^\{\\,\\pi\}is the empirical distribution over hashed shared\-context strings observed in rollouts underπ\\pi\(a coarse but consistent proxy for shared\-context drift\)\.
Table 5:Scaling with team sizennon MATH\-500\. Lower is better forΔstale\\Delta\_\{\\mathrm\{stale\}\},DoccD\_\{\\mathrm\{occ\}\}, and Stability\. Entries are mean±\\pmstd over 3 seeds; Coord\. is reported as mean\. The exponent row reports the fitted power\-law scaling \(∝nα\\propto n^\{\\alpha\}\) forNaive/TeamTR\.nnMethodΔstale↓\\Delta\_\{\\mathrm\{stale\}\}\\downarrowDocc↓D\_\{\\mathrm\{occ\}\}\\downarrowAcc\. \(%\)Stab\.↓\\downarrowCoord\. \(%\)2Naive Seq\.0\.08±\\pm0\.010\.05±\\pm0\.0182\.1±\\pm0\.91\.8±\\pm0\.376\.3TeamTR0\.03±\\pm0\.010\.02±\\pm0\.0083\.5±\\pm0\.71\.1±\\pm0\.278\.93Naive Seq\.0\.21±\\pm0\.020\.14±\\pm0\.0279\.5±\\pm1\.23\.2±\\pm0\.571\.5TeamTR0\.05±\\pm0\.010\.04±\\pm0\.0185\.2±\\pm0\.81\.4±\\pm0\.281\.34Naive Seq\.0\.41±\\pm0\.040\.28±\\pm0\.0375\.8±\\pm1\.55\.1±\\pm0\.765\.2TeamTR0\.07±\\pm0\.010\.05±\\pm0\.0186\.1±\\pm0\.91\.6±\\pm0\.382\.75Naive Seq\.0\.68±\\pm0\.060\.47±\\pm0\.0571\.2±\\pm1\.87\.5±\\pm0\.958\.1TeamTR0\.09±\\pm0\.010\.07±\\pm0\.0186\.8±\\pm1\.01\.9±\\pm0\.383\.56Naive Seq\.1\.02±\\pm0\.090\.71±\\pm0\.0766\.3±\\pm2\.110\.3±\\pm1\.251\.7TeamTR0\.11±\\pm0\.020\.08±\\pm0\.0187\.2±\\pm1\.12\.1±\\pm0\.384\.18Naive Seq\.1\.89±\\pm0\.151\.31±\\pm0\.1258\.7±\\pm2\.815\.8±\\pm1\.842\.3TeamTR0\.15±\\pm0\.020\.11±\\pm0\.0287\.9±\\pm1\.22\.5±\\pm0\.484\.8Exponentα\\alpha\(Naive / TeamTR\)1\.94 / 1\.071\.91 / 1\.05–1\.89 / 0\.93–For the scaling study we match the per\-stage rollout budget by reducing per\-update batch sizes; Table[6](https://arxiv.org/html/2605.15207#A7.T6)reports the overhead under the default setting used in main runs\.
### G\.2Rollout/Token Accounting
Intermediate\-occupancy resampling can introduce extra sampling relative to stale\-rollout baselines\. To provide hardware\-agnostic accounting, we report the total number of sampled tokens and rollout episodes aggregated across all agents\. All methods are run for 40 stages on AIME24 with the same training length as in Sec\.[5](https://arxiv.org/html/2605.15207#S5)\(3×\\timesQwen3\-8B\)\.
Table 6:Rollout/token accounting on AIME24 \(3×\\timesQwen3\-8B, 40 stages\)\. Tokens and rollouts are aggregated across agents\.MethodTokens \(M\)Rollouts \(K\)Naive Sequential142\.544\.3Joint Update145\.145\.1TeamTR \(Ours\)158\.349\.2Relative overhead of TeamTR vs\. Naive SequentialTokens\+11\.1%Rollouts\+11\.1%
### G\.3Proxy Logging forζi\\zeta\_\{i\}
The certificates in Theorems[3\.4](https://arxiv.org/html/2605.15207#S3.Thmtheorem4)–[3\.6](https://arxiv.org/html/2605.15207#S3.Thmtheorem6)include an estimation\-error termζi\\zeta\_\{i\}\(Eq\. \([6](https://arxiv.org/html/2605.15207#S3.E6)\)\), which depends on the \(unobserved\) true advantage\. In practice, we log a conservative proxy,ζ^i\\widehat\{\\zeta\}\_\{i\}, computed from quantities already available in our training pipeline, which captures three dominant sources of surrogate mismatch in our implementation\.
##### Per\-step proxy components\.
At within\-stage stepii, on the rollout batch collected underπ^i−1\\hat\{\\pi\}^\{\\,i\-1\}, we compute:
ζ^iclip=1\|ℬ\|G∑b,g\|ab,g−A~b,g\|,ζ^iratio=1\|ℬ\|G∑b,g\|A~b,g\|⋅\|wb,g−w¯b,g\|\.\\widehat\{\\zeta\}\_\{i\}^\{\\mathrm\{clip\}\}=\\frac\{1\}\{\|\\mathcal\{B\}\|G\}\\sum\_\{b,g\}\\left\|a\_\{b,g\}\-\\tilde\{A\}\_\{b,g\}\\right\|,\\qquad\\widehat\{\\zeta\}\_\{i\}^\{\\mathrm\{ratio\}\}=\\frac\{1\}\{\|\\mathcal\{B\}\|G\}\\sum\_\{b,g\}\|\\tilde\{A\}\_\{b,g\}\|\\cdot\|w\_\{b,g\}\-\\bar\{w\}\_\{b,g\}\|\.Hereab,ga\_\{b,g\}andA~b,g\\tilde\{A\}\_\{b,g\}are the unclipped and clipped group\-normalized advantages \(Eq\. \([15](https://arxiv.org/html/2605.15207#S4.E15)\)\), andwb,gw\_\{b,g\}andw¯b,g\\bar\{w\}\_\{b,g\}are the PPO likelihood ratio and its clipped version\. We also log a normalization\-uncertainty term,ζ^inorm\\widehat\{\\zeta\}\_\{i\}^\{\\mathrm\{norm\}\}, via a half\-split estimate within each prompt group \(computed without additional rollouts\)\. We aggregate
ζ^iproxy=ζ^inorm\+ζ^iclip\+ζ^iratio\.\\widehat\{\\zeta\}\_\{i\}^\{\\mathrm\{proxy\}\}=\\widehat\{\\zeta\}\_\{i\}^\{\\mathrm\{norm\}\}\+\\widehat\{\\zeta\}\_\{i\}^\{\\mathrm\{clip\}\}\+\\widehat\{\\zeta\}\_\{i\}^\{\\mathrm\{ratio\}\}\.We emphasize thatζ^iproxy\\widehat\{\\zeta\}\_\{i\}^\{\\mathrm\{proxy\}\}is a diagnostic proxy rather than an unbiased estimator ofζi\\zeta\_\{i\}\.
##### Reported summary\.
Table[7](https://arxiv.org/html/2605.15207#A7.T7)summarizes proxy magnitudes \(per update and per stage\) and the relative contribution of theζ^\\widehat\{\\zeta\}term to the stage\-wise certificate components\.
Table 7:Logged proxy statistics forζ^i\\widehat\{\\zeta\}\_\{i\}\.MetricMeanP50P90ζ^iclip\\widehat\{\\zeta\}\_\{i\}^\{\\mathrm\{clip\}\}\(per update\)0\.0030\.0010\.010ζ^iratio\\widehat\{\\zeta\}\_\{i\}^\{\\mathrm\{ratio\}\}\(per update\)0\.0120\.0080\.035ζ^inorm\\widehat\{\\zeta\}\_\{i\}^\{\\mathrm\{norm\}\}\(per update\)0\.0280\.0220\.070ζ^iproxy\\widehat\{\\zeta\}\_\{i\}^\{\\mathrm\{proxy\}\}\(per update\)0\.0430\.0340\.110∑iζ^iproxy\\sum\_\{i\}\\widehat\{\\zeta\}\_\{i\}^\{\\mathrm\{proxy\}\}\(per stage\)0\.130\.100\.30∑iζ^iproxy/\(1−γ\)total penalty\\frac\{\\sum\_\{i\}\\widehat\{\\zeta\}\_\{i\}^\{\\mathrm\{proxy\}\}/\(1\-\\gamma\)\}\{\\text\{total penalty\}\}0\.180\.140\.45
### G\.4Reliability of the Sampled Token\-KL Monitor
Our trust\-region monitor uses a sampled token\-level KL \(behavior\-to\-updated\) computed from on\-policy rollouts \(Eq\. \([18](https://arxiv.org/html/2605.15207#S4.E18)\)\)\. To assess statistical stability without requiring expensive full\-vocabulary “exact KL” computations, we perform a log\-based subsampling check on the same rollout batches used for updates\.
##### Subsampling check\.
For each update, we computeDKL^tok\\widehat\{D\_\{\\mathrm\{KL\}\}\{\}\}\_\{\\mathrm\{tok\}\}using all token positions in the rollout batch, and also compute a subsampled estimateDKL^tok\(q\)\\widehat\{D\_\{\\mathrm\{KL\}\}\{\}\}\_\{\\mathrm\{tok\}\}^\{\(q\)\}using a random fractionq∈\{25%,50%\}q\\in\\\{25\\%,50\\%\\\}of token positions \(repeated with multiple random seeds per batch\)\. We report the normalized absolute deviation\|DKL^tok\(q\)−DKL^tok\|/δ\|\\widehat\{D\_\{\\mathrm\{KL\}\}\{\}\}\_\{\\mathrm\{tok\}\}^\{\(q\)\}\-\\widehat\{D\_\{\\mathrm\{KL\}\}\{\}\}\_\{\\mathrm\{tok\}\}\|/\\deltaand a near\-threshold flip rate:
Pr\[𝟏\{DKL^tok≤δ\}≠𝟏\{DKL^tok\(q\)≤δ\}\|DKL^tok∈\[0\.8δ,1\.2δ\]\]\.\\Pr\\big\[\\mathbf\{1\}\\\{\\widehat\{D\_\{\\mathrm\{KL\}\}\{\}\}\_\{\\mathrm\{tok\}\}\\leq\\delta\\\}\\neq\\mathbf\{1\}\\\{\\widehat\{D\_\{\\mathrm\{KL\}\}\{\}\}\_\{\\mathrm\{tok\}\}^\{\(q\)\}\\leq\\delta\\\}\\ \\big\|\\ \\widehat\{D\_\{\\mathrm\{KL\}\}\{\}\}\_\{\\mathrm\{tok\}\}\\in\[0\.8\\delta,1\.2\\delta\]\\big\]\.Both quantities are computed from logged token\-level log\-probabilities \(no extra rollouts\)\.
Table 8:Stability of the sampled token\-KL monitor under token\-position subsampling\.SubsampleqqMedian\|Δ\|/δ\|\\Delta\|/\\deltaP90\|Δ\|/δ\|\\Delta\|/\\deltaNear\-threshold flip rate25%0\.060\.185\.7%50%0\.030\.112\.4%100% \(full\)000
### G\.5Ablation Studies
Table 9:Ablations on AIME24 \(3×\\timesQwen3\-8B, 30 stages\)\.Δstale\\Delta\_\{\\mathrm\{stale\}\}is measured within\-stage at the final stage; Stability is the std of per\-stage return improvements; Coord\. is consensus\-on\-correct\.VariantAcc\. \(%\)Δstale↓\\Delta\_\{\\mathrm\{stale\}\}\\downarrowStab\.↓\\downarrowCoord\. \(%\)TeamTR \(full\)88\.1±\\pm1\.20\.081\.989\.1Resampling strategyNo resampling \(= Naive Seq\.\)71\.1±\\pm2\.80\.314\.271\.5Resample every 2 updates79\.3±\\pm1\.80\.182\.879\.2Importance weighting \(no resample\)74\.5±\\pm2\.30\.253\.574\.8Trust regionNo trust region \(δ→∞\\delta\\to\\infty\)68\.3±\\pm3\.50\.426\.162\.3Fixedδ=0\.001\\delta=0\.001\(too small\)82\.5±\\pm1\.50\.051\.585\.2Fixedδ=0\.1\\delta=0\.1\(too large\)75\.1±\\pm2\.50\.284\.870\.1Adaptiveδ\\delta\(targetDKL^tok=0\.01\\widehat\{D\_\{\\mathrm\{KL\}\}\{\}\}\_\{\\mathrm\{tok\}\}\{=\}0\.01\)87\.8±\\pm1\.30\.092\.088\.5Update orderFixed order \(1, 2, 3\)87\.2±\\pm1\.40\.092\.187\.8Reverse order \(3, 2, 1\)86\.9±\\pm1\.50\.102\.287\.1Random order \(each stage\)87\.5±\\pm1\.30\.081\.988\.3Advantage estimationNo group normalization83\.1±\\pm2\.10\.123\.181\.5No hard clipping84\.5±\\pm1\.90\.112\.783\.2
### G\.6Why We Resample: Importance\-Weight Degeneracy
An alternative to resampling under intermediate occupancy is to reuse stage\-start rollouts and correct via importance weighting\. We empirically show that trajectory\-level weights become heavy\-tailed and the effective sample size collapses as within\-stage updates accumulate\.
##### Setup\.
For each within\-stage stepi∈\{2,…,n\}i\\in\\\{2,\\ldots,n\\\}, we compute trajectory\-level importance weights that would be needed to reweight stage\-start rolloutsτ∼dπ^0\\tau\\sim d^\{\\hat\{\\pi\}^\{0\}\}to approximate the intermediate occupancydπ^i−1d^\{\\hat\{\\pi\}^\{i\-1\}\}:
w0→i−1\(τ\)=∏t:jt∈Ui−1π^i−1,\(jt\)\(mt∣st\)π^0,\(jt\)\(mt∣st\),Ui−1=\{σ\(1\),…,σ\(i−1\)\}\.w^\{0\\to i\-1\}\(\\tau\)\\;=\\;\\prod\_\{t:\\,j\_\{t\}\\in U\_\{i\-1\}\}\\frac\{\\hat\{\\pi\}^\{i\-1,\(j\_\{t\}\)\}\(m\_\{t\}\\mid s\_\{t\}\)\}\{\\hat\{\\pi\}^\{0,\(j\_\{t\}\)\}\(m\_\{t\}\\mid s\_\{t\}\)\},\\qquad U\_\{i\-1\}=\\\{\\sigma\(1\),\\ldots,\\sigma\(i\{\-\}1\)\\\}\.We measure degeneracy via normalized effective sample sizeESS/B=\(∑bwb\)2/\(B∑bwb2\)\\mathrm\{ESS\}/B=\(\\sum\_\{b\}w\_\{b\}\)^\{2\}/\(B\\sum\_\{b\}w\_\{b\}^\{2\}\)and tail statistics\.
Table 10:Importance\-weight degeneracy when reusing stage\-start rollouts \(AIME24, Qwen3\-8B,δ=0\.01\\delta\{=\}0\.01\)\. ESS/B/Bis the normalized effective sample size \(higher is better\); P99\(w\)\(w\)andmax\(w\)\\max\(w\)measure tail heaviness \(lower is better\)\.Stepii\|Ui−1\|\|U\_\{i\-1\}\|ESS/B↑/B\\uparrowP95\(w\)\(w\)P99\(w\)↓\(w\)\\downarrowmax\(w\)\\max\(w\)Pr\[w\>10\]\\Pr\[w\{\>\}10\]Team sizen=3n=3210\.42±\\pm0\.033\.212\.8872\.1%320\.18±\\pm0\.026\.138\.53125\.8%Team sizen=5n=5210\.41±\\pm0\.033\.313\.1922\.2%320\.17±\\pm0\.026\.241\.23386\.1%430\.08±\\pm0\.0111\.5127\.31,24710\.3%540\.04±\\pm0\.0121\.8385\.14,58215\.2%
### G\.7Cross\-Generation Replacement: Qwen2\.5 to Qwen3
Replacing an agent across model generations \(e\.g\., Qwen2\.5→\\toQwen3\) can be more tractable than cross\-family replacement \(e\.g\., Qwen→\\toLLaMA\), because the swap can often be performed under a largely shared chat protocol and tokenizer interface\. However, even within the same model lineage, protocol\-level mismatches can induce substantial occupancy shift in shared\-context teams if left unaddressed\. We summarize the main sources of mismatch we encountered and the corresponding mitigations\.
##### Canonical shared\-context protocol\.
In shared\-context teams, seemingly “out\-of\-band” choices \(system prompts, templates, tool\-call formatting\) become part of the effective state\. We treat these choices as part of a canonical team protocol and enforce them uniformly for all agents before and after replacement\.
##### System prompt defaults\.
Qwen2\.5\-Instruct deployments may prepend a non\-empty default system prompt, while Qwen3 deployments may not\. We remove this ambiguity by explicitly setting the system prompt for all agents \(either fixed or explicitly empty\)\.
##### Reasoning\-tag mode \(e\.g\.,<think\>\.\.\. </think\>\)\.
Some Qwen3 configurations optionally emit explicit reasoning\-tag blocks\. We enforce a uniform policy across the team; in our experiments, we disable reasoning\-tag mode to maintain compatibility with Qwen2\.5 agents\.
##### Tool\-call serialization\.
We introduce a lightweight adapter that normalizes tool\-call outputs to a canonical JSON schema before writing to the shared context\.
##### Tokenizer interface and KL monitoring\.
In our setup, the official Qwen2\.5 and Qwen3 tokenizers produce identical token ID sequences on a held\-out set of 1,000 shared\-context strings, enabling direct reuse of our sampled token\-level KL monitor for Stage\-0 alignment\.
##### Stage\-0 alignment procedure\.
We sample 500 probe contexts from the pre\-swap team’s occupancy and fine\-tune the new Qwen3 agent \(via supervised distillation on the replaced agent’s outputs\) untilDKL^tok≤δ\\widehat\{D\_\{\\mathrm\{KL\}\}\{\}\}\_\{\\mathrm\{tok\}\}\\leq\\deltaon the probe distribution\. After alignment, standard TeamTR updates resume; Proposition[3\.8](https://arxiv.org/html/2605.15207#S3.Thmtheorem8)applies to subsequent updates \(the replacement step itself is not certified\)\.
Table 11:Ablation of protocol unification for Qwen2\.5→\\toQwen3 replacement\. Swap shock is the magnitude of the immediate accuracy drop atkswapk\_\{\\mathrm\{swap\}\}; lower is better\. Final accuracy is measured at stage 40 on AIME24\.ConfigurationSwap Shock \(↓\\downarrow\)Final Acc\. \(%\)No unification \(direct swap\)18\.358\.2\+ Fixed system prompt12\.167\.5\+ Reasoning\-tag mode disabled8\.773\.1\+ Tool\-call adapter6\.276\.8\+ Stage\-0 alignment \(full\)2\.985\.3
## Appendix HBenchmarks
##### AIME 2024 and AIME 2025\.
The American Invitational Mathematics Examination \(AIME\) is an invitational math contest administered by the MAA for top performers on the AMC series\. Each AIME form is a 15\-problem, 3\-hour exam with integer answers in\[0,999\]\[0,999\]\(often written with leading zeros\), and calculators are prohibited\. The problems span major pre\-college topics \(e\.g\., algebra, geometry, number theory, and combinatorics\) and typically require multi\-step reasoning and creative problem solving\. We evaluate on the official 2024 and 2025 problem sets \(AIME I and AIME II for each year; 30 problems/year, 60 total\), grading by exact match on the final integer answer\. We report pass@64 and avg@64\.
##### MATH\-500\.
MATH\-500 is a 500\-problem held\-out subset derived from the MATH benchmark\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.15207#bib.bib23)\)\(competition\-style problems with LaTeX solutions and standardized final answers\)\. The underlying MATH dataset covers seven subjects \(prealgebra, algebra, number theory, counting and probability, geometry, intermediate algebra, and precalculus\) and difficulty levels 1–5\. This 500\-problem split is widely used as a representative evaluation subset of MATH, and is commonly adopted in modern LLM math\-evaluation pipelines\. We use the provided ground\-truth final answers for automatic grading after normalization \(e\.g\., stripping formatting such as\\boxed\{\}when applicable\)\. We report pass@4 and avg@4\.
##### ZebraLogic\.
ZebraLogic\(Linet al\.,[2025](https://arxiv.org/html/2605.15207#bib.bib22)\)is a logical reasoning benchmark of logic\-grid \(Einstein/Zebra\) puzzles derived from constraint satisfaction problems \(CSPs\)\. It contains 1,000 automatically generated puzzles with*controllable and quantifiable*complexity, including a wide range of search\-space sizes and logical constraint structures \(e\.g\., measured via SMT\-solver conflict statistics\)\. Each instance provides a narrative and a set of clues; the model must output a complete, globally consistent assignment \(we use the benchmark’s structured output format for parsing\)\. We report pass@64 and avg@64\.
##### AutoLogi\.
AutoLogi\(Zhuet al\.,[2025](https://arxiv.org/html/2605.15207#bib.bib49)\)is a bilingual \(English/Chinese\)*open\-ended*logic\-puzzle benchmark designed to avoid multiple\-choice guessing effects\. It reformulates problems from established logical\-reasoning assessments \(e\.g\., AR\-LSAT and LogiQA\) into constraint\-based puzzles, and pairs each puzzle with*programmatic verification*: a format specification \(JSON schema\), a format verifier, a constraint verifier, and a traversal/enumeration procedure to validate solvability and filter invalid instances\. The base AutoLogi benchmark \(testing data, Stage 2\) comprises 206 English and 139 Chinese puzzles; an augmented version expands the set via constraint expansion/reduction to create a range of difficulties\. In our evaluation, we use the official verifiers for automatic grading and report pass@64 and avg@64\.
##### ARBench\.
AR\-Bench \(ARBench\)\(Zhouet al\.,[2025](https://arxiv.org/html/2605.15207#bib.bib20)\)evaluates*active reasoning under incomplete information*, where an LLM must interact to acquire missing evidence before answering\. AR\-Bench contains 6,040 interactive puzzles spanning three task families:Detective Cases \(DC\)\(interrogation\-style cases with multiple suspects and noisy/role\-dependent feedback\),Situation Puzzles \(SP\)\(lateral\-thinking mysteries solved through yes/no questioning\), andGuessing Numbers \(GN\)\(deducing a hidden 4\-digit code from structured match/misplacement feedback\)\. For each episode, the model alternates between proposing an information\-seeking question \(or a guess\) and receiving environmental feedback; success depends on both the quality of the question and the reasoning based on the acquired information\. We allow up to 25 interaction rounds and report pass@25 and avg@25\.
##### PlanBench\.
PlanBench\(Valmeekamet al\.,[2023](https://arxiv.org/html/2605.15207#bib.bib19)\)is an extensible benchmark suite for evaluating planning and reasoning about actions and change, grounded in classical planning \(IPC\-style\) domains represented in PDDL and rendered as natural\-language prompts\. PlanBench tests eight planning\-related capabilities: plan generation, cost\-optimal planning, plan verification, reasoning about plan execution, robustness to goal reformulation, plan reuse, replanning under unexpected events, and plan generalization\. The benchmark is initialized with domains such as Blocksworld and Logistics \(including obfuscated “mystery” variants\), and provides automated executors/validators for scoring\. In our experiments, we use the Blocksworld plan\-generation subset and report pass@8 and avg@8\.Similar Articles
Trust-Region Behavior Blending for On-Policy Distillation
Trust-Region behavior Blending (TRB) improves on-policy distillation by replacing poor early student rollouts with teacher-like behavior within a KL trust region during warmup, achieving stronger results on math-reasoning tasks.
SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees
This paper introduces Sequential Agent Tuning (SAT), a coordinator-free training paradigm for multi-LLM teams that provides monotonic improvement guarantees and plug-and-play invariance, enabling smaller models to outperform larger ones.
Trust Region Inverse Reinforcement Learning: Explicit Dual Ascent using Local Policy Updates
This paper introduces Trust Region Inverse Reinforcement Learning (TRIRL), a method that combines monotonic dual improvement with efficient local policy updates to outperform state-of-the-art imitation learning methods. It addresses the trade-off between stability and computational cost in IRL by using trust-region constraints.
TEMPO: Temporal Enforcement via Mode-Separated Policy Optimization for Trustworthy LLM Backtesting
Proposes TEMPO, a policy optimization method that trains LLMs to reason exclusively from pre-cutoff information by using a two-mode reward and GRPO-based training, reducing knowledge leakage by 2–13% while improving task performance by 6–13%.
TRACE: Trajectory Risk-Aware Compression for Long-Horizon Agent Safety
This paper proposes TRACE, a trajectory-level safety detection method for long-horizon LLM agents that compresses full trajectory evidence into a latent state to better aggregate dispersed risk signals, achieving state-of-the-art accuracy on multiple benchmarks.