Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

arXiv cs.AI 05/20/26, 04:00 AM Papers
Summary
This paper formalizes workflow learning in multi-agent LLM pipelines as an interface-constrained semi-Markov decision process (IC-SMDP) and proposes IC-ICQQ, an asynchronous decentralized Q-learning algorithm with a finite-sample bound that decomposes error sources, providing the first finite-sample guarantee for neural Q-learning under decentralized partial observability.
arXiv:2605.19140v1 Announce Type: new Abstract: We study workflow learning in a setting where specialized agents hand off control through a shared artifact, each agent observes only a local function of that artifact and its own private state, and no centralized learner accesses joint trajectories -- the operating regime of multi-agent LLM pipelines that span organizational, vendor, or trust boundaries. We formalize this regime as an interface-constrained semi-Markov decision process (IC-SMDP), whose decision epochs occur at handoff times, and design IC-$Q$, an asynchronous decentralized $Q$-learning algorithm in which cross-agent coordination at every handoff is exactly one scalar. Our main result is a finite-sample bound for neural IC-$Q$ that decomposes into three independently controllable error sources: neural function-approximation error, interface representation gap, and a mixing-time residual, under the random option-duration discount. Establishing this bound requires lifting the approximate information state (AIS) framework from single-agent primitive-step MDPs to multi-agent SMDPs and controlling Markovian noise under random duration, neither of which has been done in prior work. To our knowledge this is the first finite-sample guarantee for neural $Q$-learning under decentralized partial observability. Four experiments: a controlled synthetic IC-SMDP that validates the bound term-by-term, multi-LLM mathematical reasoning, multi-agent routing, and multi-agent CPU programming, show that IC-$Q$ matches a centralized oracle without any agent observing joint trajectories, with each of the three error sources scaling along its corresponding axis as the bound predicts.
Original Article
View Cached Full Text
Cached at: 05/20/26, 08:28 AM
# Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints
Source: [https://arxiv.org/html/2605.19140](https://arxiv.org/html/2605.19140)
Jiayu Li Stern School of Business New York University New York, NY 10012 jl15681@stern\.nyu\.edu &Enpei Zhang Department of Computer Science Dartmouth College Hanover, NH 03755 enpei\.zhang\.gr@dartmouth\.edu &Dawei Zhou Department of Computer Science Virginia Tech Blacksburg, VA 24061 zhoud@vt\.edu Elynn Chen Stern School of Business New York University New York, NY 10012 elynn\.chen@stern\.nyu\.edu &Yujun Yan Department of Computer Science Dartmouth College Hanover, NH 03755 yujun\.yan@dartmouth\.edu

###### Abstract

We study workflow learning in a setting where specialized agents hand off control through a shared artifact, each agent observes only a local function of that artifact and its own private state, and no centralized learner accesses joint trajectories – the operating regime of multi\-agent LLM pipelines that span organizational, vendor, or trust boundaries\. We formalize this regime as an interface\-constrained semi\-Markov decision process \(IC\-SMDP\), whose decision epochs occur at handoff times, and design IC\-QQ, an asynchronous decentralizedQQ\-learning algorithm in which cross\-agent coordination at every handoff is exactly one scalar\. Our main result is a finite\-sample bound for neural IC\-QQthat decomposes into three independently controllable error sources: neural function\-approximation error, interface representation gap, and a mixing\-time residual, under the random option\-duration discount\. Establishing this bound requires lifting the approximate information state \(AIS\) framework from single\-agent primitive\-step MDPs to multi\-agent SMDPs and controlling Markovian noise under random duration, neither of which has been done in prior work\. To our knowledge this is the first finite\-sample guarantee for neuralQQ\-learning under decentralized partial observability\. Four experiments: a controlled synthetic IC\-SMDP that validates the bound term\-by\-term, multi\-LLM mathematical reasoning, multi\-agent routing, and multi\-agent CPU programming, show that IC\-QQmatches a centralized oracle without any agent observing joint trajectories, with each of the three error sources scaling along its corresponding axis as the bound predicts\.

## 1Introduction

Multi\-agent LLM systems coordinate specialized agents such as planners, coders, testers, retrievers, verifiers[hong2023metagpt](https://arxiv.org/html/2605.19140#bib.bib19);[huang2023agentcoder](https://arxiv.org/html/2605.19140#bib.bib21);[gottweis2025aiscientist](https://arxiv.org/html/2605.19140#bib.bib18);[yao2023react](https://arxiv.org/html/2605.19140#bib.bib40)into workflows whose end\-to\-end performance is shaped byhow the workflow is structured: which agent acts first, when control transfers, and what information passes between them\. Two distinct regimes for designing such workflows have emerged\.

Thecentralizedregime assumes a single designer with global visibility into agent trajectories\. Hand\-designed workflows[hong2023metagpt](https://arxiv.org/html/2605.19140#bib.bib19);[wu2023autogen](https://arxiv.org/html/2605.19140#bib.bib37)fix the structure in advance, while learned workflow generators[zhuge2024gptswarm](https://arxiv.org/html/2605.19140#bib.bib44);[zhang2024aflow](https://arxiv.org/html/2605.19140#bib.bib41);[hu2024adas](https://arxiv.org/html/2605.19140#bib.bib20);[fan2024workflowllm](https://arxiv.org/html/2605.19140#bib.bib15);[yue2025daao](https://arxiv.org/html/2605.19140#bib.bib33)optimize it from joint trajectory data\. This regime is well\-developed, and, where its assumptions of centralization hold, it works\.

Thesequential decentralizedregime is structurally different\. Production systems are increasingly built as pipelines of role\-specialized agents that transfer one artifact at a time, as in the assembly\-line architecture of MetaGPT[hong2023metagpt](https://arxiv.org/html/2605.19140#bib.bib19), the programmer\-to\-tester pipeline of AgentCoder[huang2023agentcoder](https://arxiv.org/html/2605.19140#bib.bib21), the turn\-based message\-passing of AutoGen[wu2023autogen](https://arxiv.org/html/2605.19140#bib.bib37)\. Emerging agent\-to\-agent protocols[anthropic2024mcp](https://arxiv.org/html/2605.19140#bib.bib1);[google2025a2a](https://arxiv.org/html/2605.19140#bib.bib17)standardize sequential message exchange as the coordination primitive\. When such pipelines span organizational, vendor, or trust boundaries[yang2025agentnet](https://arxiv.org/html/2605.19140#bib.bib39), the centralized assumptions fail: no party holds joint trajectories at training time, agents see only the artifact they receive together with their own private state, and chains of thought, scratchpads, and proprietary prompt templates are not exposed across agent boundaries by design, by safety filtering, or by API contract\. The resulting decision problem combines partial observability with the cross\-environment heterogeneity studied in latent\-heterogeneous RL[chen2024reinforcement](https://arxiv.org/html/2605.19140#bib.bib14), where each agent’s local view induces effectively a distinct learning problem\. Workflow learning in this regime has, to date, been addressed only by hand\-designed adaptive rules[yang2025agentnet](https://arxiv.org/html/2605.19140#bib.bib39)without a decision\-theoretic foundation or convergence guarantee\.

This paper provides such a foundation\. We study workflow learning in the sequential decentralized regime under four operating conditions that distinguish it from the centralized setting: \(i\)sequential handoff\-based control– exactly one agent acts at each step and transfers control through a shared artifact; \(ii\)decentralization in both training and execution– no party accesses joint trajectories at any stage; \(iii\)interface\-limited observation– each agent decides on the basis of the artifact it receives plus its own private state, with no exposure of internal state across agent boundaries; and \(iv\)finite\-sample guarantees– deployments run under bounded API and compute budgets, so a designer needs to know how much sample budget suffices, not merely that convergence eventually obtains\. Finite\-sample analysis at this granularity has recently driven theoretical progress in adjacent areas, including transferQQ\-learning[chen2025transfer](https://arxiv.org/html/2605.19140#bib.bib13);[chai2025transfer](https://arxiv.org/html/2605.19140#bib.bib8)and high\-dimensional sequential decision\-making under structured latent heterogeneity[chen2025highdim](https://arxiv.org/html/2605.19140#bib.bib11), and motivates an analogous treatment for decentralized agentic workflows\.

Why existing frameworks do not extend\.Four lines of prior theory bear on this regime; each violates at least one of the four conditions above\.

∙\\bulletDecentralized POMDPs and CTDE\.Dec\-POMDPs[bernstein2002decpomdp](https://arxiv.org/html/2605.19140#bib.bib3);[nair2005networked](https://arxiv.org/html/2605.19140#bib.bib26);[oliehoek2016decpomdp](https://arxiv.org/html/2605.19140#bib.bib27);[oliehoek2008exploiting](https://arxiv.org/html/2605.19140#bib.bib28)model cooperative multi\-agent control under local observations but assumeconcurrentaction with a joint reward, violating sequentiality\. Centralized training with decentralized execution[lowe2017maddpg](https://arxiv.org/html/2605.19140#bib.bib25);[rashid2020qmix](https://arxiv.org/html/2605.19140#bib.bib30);[foerster2018coma](https://arxiv.org/html/2605.19140#bib.bib16);[sunehag2018vdn](https://arxiv.org/html/2605.19140#bib.bib35);[iqbal2019actorattentioncritic](https://arxiv.org/html/2605.19140#bib.bib22)relaxes execution but requires joint\-trajectory access at training, violating decentralization\. Related transfer\-RL formulations that share information across heterogeneous tasks[chen2026data](https://arxiv.org/html/2605.19140#bib.bib10);[chai2026optimistic](https://arxiv.org/html/2605.19140#bib.bib9)likewise presume a coordinator who observes either joint trajectories or task\-level structure unavailable across vendor boundaries\.

∙\\bulletThe options framework\.Options[sutton1999options](https://arxiv.org/html/2605.19140#bib.bib36);[bacon2017optioncritic](https://arxiv.org/html/2605.19140#bib.bib2);[bradtke1994smdp](https://arxiv.org/html/2605.19140#bib.bib5);[precup2000options](https://arxiv.org/html/2605.19140#bib.bib29)provide the natural temporal\-extension primitive but assume a single meta\-controller observing thefullstate at each decision epoch, violating interface\-limited observation\. Recent work on prior\-aligned meta\-RL with finite\-horizon guarantees[zhou2025prior](https://arxiv.org/html/2605.19140#bib.bib43)inherits the same full\-observation assumption at the meta\-controller level\.

∙\\bulletApproximate information states\.The AIS framework[subramanian2022ais](https://arxiv.org/html/2605.19140#bib.bib34);[kara2022finite](https://arxiv.org/html/2605.19140#bib.bib24);[sinha2024agentstate](https://arxiv.org/html/2605.19140#bib.bib32);[sinha2024periodic](https://arxiv.org/html/2605.19140#bib.bib31)treats single\-agent partial observability in primitive\-step MDPs\. Lifting it to decentralized AIS observations acrossNNagents composing through handoffs requires controlling the AIS error under the random discountγτk\+1\\gamma^\{\\tau\_\{k\+1\}\}and over a disjoint observation space – neither has been done\. The common\-information extension of[kao2022common](https://arxiv.org/html/2605.19140#bib.bib23)addresses concurrent Dec\-POMDPs through a fictitious coordinator and is not applicable to sequential handoff control\. Closely related are non\-stationary RL settings where the environment itself shifts across episodes[chai2025deep](https://arxiv.org/html/2605.19140#bib.bib7), for which transfer guarantees require additional structural assumptions that the IC\-SMDP does not impose\.

∙\\bulletMulti\-agent LLM orchestration\.Recent systems[yang2025agentnet](https://arxiv.org/html/2605.19140#bib.bib39);[zhuge2024gptswarm](https://arxiv.org/html/2605.19140#bib.bib44);[zhang2024aflow](https://arxiv.org/html/2605.19140#bib.bib41);[hu2024adas](https://arxiv.org/html/2605.19140#bib.bib20)route tasks through learned or hand\-designed DAGs of LLM agents\. AgentNet[yang2025agentnet](https://arxiv.org/html/2605.19140#bib.bib39)is closest in spirit but operates via hand\-designed adaptive rules \(moving averages over edge weights, retrieval\-based heuristics, capability\-vector updates\) with no Bellman recursion and no convergence guarantee, violating finite\-sample guarantees\. Workflow learning has also been studied for economic and operational decision\-making, where finite\-sample optimality under cross\-market heterogeneity[chen2026transfer](https://arxiv.org/html/2605.19140#bib.bib12);[zhang2025transfer](https://arxiv.org/html/2605.19140#bib.bib42)has driven recent theoretical advances – none of which apply directly to the handoff\-based interface regime we study\. A full discussion appears in AppendixLABEL:sec:related\_work\.

Contributions\.We provide the first framework that respects all four conditions and prove finite\-sample convergence inside it\.

∙\\bulletFormal model\.We introduce theinterface\-constrained semi\-Markov decision process\(IC\-SMDP\) \(§[2](https://arxiv.org/html/2605.19140#S2)\) and show that it induces a well\-defined SMDP at handoff times to which the AIS framework lifts with quantifiable interface gap\(εϕ,δϕ\)\(\\varepsilon\_\{\\phi\},\\delta\_\{\\phi\}\)\.

∙\\bulletDecentralized algorithm\.We design IC\-QQ\(§[3](https://arxiv.org/html/2605.19140#S3)\), an asynchronousQQ\-learning algorithm in which cross\-agent coordination at every handoff is a single scalar – minimal communication overhead under the bandwidth and API\-call constraints of cross\-vendor deployment\.

∙\\bulletFinite\-sample convergence\.We prove a finite\-sample bound \(Theorem[1](https://arxiv.org/html/2605.19140#Thmtheorem1), §[4](https://arxiv.org/html/2605.19140#S4)\) decomposing into three independently controllable error sources – neural approximation, interface representation gap, and mixing\-time residual\. Three challenges arise that prior analyses do not face: a Bellman contraction under random rather than fixed discount, an AIS gap propagating at the option scale rather than the primitive\-step scale, and Markovian\-noise control under random option duration\. To our knowledge this isthe first such guarantee under decentralized partial observability through approximate information states\.

∙\\bulletEmpirical validation\.On four tasks: a controlled synthetic IC\-SMDP that isolates each error term, multi\-LLM mathematical reasoning, multi\-agent routing, and multi\-agent CPU programming, IC\-QQmatches a centralized oracle and recovers the best predefined workflow without any agent observing joint trajectories\. The synthetic IC\-SMDP further validates the bound term\-by\-term, with each of the three error sources scaling along its corresponding axis as predicted \(§[5](https://arxiv.org/html/2605.19140#S5)\)\.

## 2Interface\-Constrained Semi\-Markov Decision Process

Modern multi\-agent LLM systems coordinate by passing artifacts such as messages, intermediate solutions, scratchpad snippets between specialized agents\. A planner LLM hands a task description to a coder LLM, which hands code to a tester LLM\. No single component sees the full picture: the planner does not see the coder’s chain of thought, and no centralized observer sees joint trajectories\. We formalize this as aninterface\-constrained semi\-Markov decision process\(IC\-SMDP\)\. Although the system evolves at the primitive time scale of individual agent steps, its decision structure operates at the coarser scale of handoff times, where each agent invocation is a temporally extended option and the semi\-Markov property emerges naturally\.

### 2\.1Formal framework

An IC\-SMDP withNNagents\[N\]:=\{1,…,N\}\[N\]:=\\\{1,\\dots,N\\\}is a tuple

ℐ=\(𝒳,ℳ,\{ℒi,𝒜i,ϕi\}i∈\[N\],P,r,γ,ρ0\)\.\\mathcal\{I\}=\\big\(\\mathcal\{X\},\\mathcal\{M\},\\\{\\mathcal\{L\}\_\{i\},\\mathcal\{A\}\_\{i\},\\phi\_\{i\}\\\}\_\{i\\in\[N\]\},P,r,\\gamma,\\rho\_\{0\}\\big\)\.
States\.Three layers of state, distinguished by who can observe them\. Theglobal latent statext∈𝒳x\_\{t\}\\in\\mathcal\{X\}governs underlying task dynamics and is observed by no agent\. Theinterface statemt∈ℳm\_\{t\}\\in\\mathcal\{M\}is the artifact passed between agents – theonlychannel of cross\-agent information flow\. Theprivate stateℓt\(i\)∈ℒi\\ell\_\{t\}^\{\(i\)\}\\in\\mathcal\{L\}\_\{i\}is local to agentiiand unobserved by others\. At steptt, exactly one agentct∈\[N\]c\_\{t\}\\in\[N\]is active\.

Observation: the interface constraint\.Agentctc\_\{t\}observes only

ot=ϕct\(mt,ℓt\(ct\)\)∈𝒪ct,o\_\{t\}=\\phi\_\{c\_\{t\}\}\\\!\\big\(m\_\{t\},\\ell\_\{t\}^\{\(c\_\{t\}\)\}\\big\)\\in\\mathcal\{O\}\_\{c\_\{t\}\},\(1\)whereϕi:ℳ×ℒi→𝒪i\\phi\_\{i\}\\colon\\mathcal\{M\}\\times\\mathcal\{L\}\_\{i\}\\to\\mathcal\{O\}\_\{i\}is an agent\-specific observation map\. Two structural restrictions are implicit:\(IC1\) Channel restriction– whenctc\_\{t\}hands off toct\+1=jc\_\{t\+1\}=j, the only informationjjreceives ismt\+1m\_\{t\+1\};\(IC2\) Representation restriction– agentii’s decision is a function ofϕi\(mt,ℓt\(i\)\)\\phi\_\{i\}\(m\_\{t\},\\ell\_\{t\}^\{\(i\)\}\)rather thanmtm\_\{t\}directly\. \(IC1\) rules out CTDE\-style learning[lowe2017maddpg](https://arxiv.org/html/2605.19140#bib.bib25);[rashid2020qmix](https://arxiv.org/html/2605.19140#bib.bib30);[foerster2018coma](https://arxiv.org/html/2605.19140#bib.bib16), which assumes joint trajectory access; \(IC2\) precludes direct application of the options framework[sutton1999options](https://arxiv.org/html/2605.19140#bib.bib36), which assumes full state observation at every decision epoch\.

Actions: local operation and successor selection\.Agentctc\_\{t\}selects a local actionat∈𝒜cta\_\{t\}\\in\\mathcal\{A\}\_\{c\_\{t\}\}and a successorct\+1∈𝒞:=\[N\]∪\{STOP\}c\_\{t\+1\}\\in\\mathcal\{C\}:=\[N\]\\cup\\\{\\textsf\{STOP\}\\\}, jointly writtenut=\(at,ct\+1\)u\_\{t\}=\(a\_\{t\},c\_\{t\+1\}\)\. A stationary decentralized policy is a collectionπi\(u∣o\)∈Δ\(𝒰i\)\\pi\_\{i\}\(u\\mid o\)\\in\\Delta\(\\mathcal\{U\}\_\{i\}\)withut∼πct\(⋅∣ot\)u\_\{t\}\\sim\\pi\_\{c\_\{t\}\}\(\\cdot\\mid o\_\{t\}\)\. Agentiiispre\-configuredifata\_\{t\}is fixed \(e\.g\., a frozen LLM\) and only the successor distribution is learnable; otherwise it isadaptable\.

Transitions and handoff dynamics\.Conditioned onutu\_\{t\}, the system evolves under a Markov kernelPP\. Ifct\+1=ctc\_\{t\+1\}=c\_\{t\}, the active agent does not change; ifct\+1=j≠ctc\_\{t\+1\}=j\\neq c\_\{t\}, control transfers tojj, whose private state is preserved from its last activation\. Handoff times0=t0<t1<⋯0=t\_\{0\}<t\_\{1\}<\\cdotsare the steps wherect\+1≠ctc\_\{t\+1\}\\neq c\_\{t\}; the interval\[tk,tk\+1\)\[t\_\{k\},t\_\{k\+1\}\)is theinvocationof agentctkc\_\{t\_\{k\}\}with random durationτk\+1:=tk\+1−tk\\tau\_\{k\+1\}:=t\_\{k\+1\}\-t\_\{k\}\.

Objective\.The rewardrt=r\(xt,mt,ct,at,ct\+1\)r\_\{t\}=r\(x\_\{t\},m\_\{t\},c\_\{t\},a\_\{t\},c\_\{t\+1\}\)typically combines task utility with workflow costs\. The decentralized workflow control problem is

maxπ∈Π⁡𝔼ρ0π\[∑t=0H−1γtrt\],\\max\_\{\\pi\\in\\Pi\}\\mathbb\{E\}^\{\\pi\}\_\{\\rho\_\{0\}\}\\\!\\left\[\\sum\_\{t=0\}^\{H\-1\}\\gamma^\{t\}r\_\{t\}\\right\],\(2\)subject to \([1](https://arxiv.org/html/2605.19140#S2.E1)\), maximized overfactoredpolicies, with no algorithm accessingxtx\_\{t\}, joint trajectories, or other agents’ private states\.

### 2\.2Semi\-Markov reduction and AIS structure

The IC\-SMDP’s decision structure lives at handoff times\{tk\}\\\{t\_\{k\}\\\}\. Each agent invocation is an option in the sense of[sutton1999options](https://arxiv.org/html/2605.19140#bib.bib36): agentii’s internal policyπiint\\pi\_\{i\}^\{\\text\{int\}\}executes from primitive steptkt\_\{k\}until the next handofftk\+1t\_\{k\+1\}, with effective discountγτk\+1\\gamma^\{\\tau\_\{k\+1\}\}\. Writingmk:=mtkm\_\{k\}:=m\_\{t\_\{k\}\}, thelatent decision\-epoch SMDPhas stateℳ\\mathcal\{M\}, action𝒞\\mathcal\{C\}, transition kernelPlat\(m′,τ∣m,i\)=ℙ\(mk\+1=m′,τk\+1=τ∣mk=m,ctk=i\)P\_\{\\text\{lat\}\}\(m^\{\\prime\},\\tau\\mid m,i\)=\\mathbb\{P\}\(m\_\{k\+1\}=m^\{\\prime\},\\tau\_\{k\+1\}=\\tau\\mid m\_\{k\}=m,c\_\{t\_\{k\}\}=i\), and option rewardRlat\(m,i\)=𝔼\[∑s=0τ−1γsrtk\+s∣mk=m,ctk=i\]R\_\{\\text\{lat\}\}\(m,i\)=\\mathbb\{E\}\\\!\\left\[\\sum\_\{s=0\}^\{\\tau\-1\}\\gamma^\{s\}r\_\{t\_\{k\}\+s\}\\mid m\_\{k\}=m,c\_\{t\_\{k\}\}=i\\right\]\.

No agent observesmkm\_\{k\}directly; agentiiobserves onlyϕi\(mk,ℓtk\(i\)\)\\phi\_\{i\}\(m\_\{k\},\\ell\_\{t\_\{k\}\}^\{\(i\)\}\)\. TheAIS\-induced SMDPreplaces\(Plat,Rlat\)\(P\_\{\\text\{lat\}\},R\_\{\\text\{lat\}\}\)with their conditional counterparts\(P^i,R^i\)\(\\hat\{P\}\_\{i\},\\hat\{R\}\_\{i\}\)given the AIS observation\. The two perspectives describe the same physical process at the decision\-epoch scale; they differ only in what is observed\.

###### Assumption 1\(Structural conditions\)\.

\(i\) The joint process\(xt,mt,ℓt\(1:N\),ct\)\(x\_\{t\},m\_\{t\},\\ell\_\{t\}^\{\(1:N\)\},c\_\{t\}\)is Markov under any stationary decentralized policy\. \(ii\) Each invocation terminates a\.s\.:ℙ\(τk\+1<∞\)=1\\mathbb\{P\}\(\\tau\_\{k\+1\}<\\infty\)=1\. \(iii\) The interface statemmdetermines the admissible successor set\.

###### Assumption 2\(AIS conditions on\{ϕi\}\\\{\\phi\_\{i\}\\\}\)\.

There existεϕ,δϕ≥0\\varepsilon\_\{\\phi\},\\delta\_\{\\phi\}\\geq 0such that for everyii,\(m,ℓ\(i\)\)\(m,\\ell^\{\(i\)\}\), ando=ϕi\(m,ℓ\(i\)\)o=\\phi\_\{i\}\(m,\\ell^\{\(i\)\}\):\(B1\) Reward sufficiency:\|Rlat\(m,i\)−R^i\(o,i\)\|≤εϕ\|R\_\{\\text\{lat\}\}\(m,i\)\-\\hat\{R\}\_\{i\}\(o,i\)\|\\leq\\varepsilon\_\{\\phi\}\.\(B2\) Evolution sufficiency:dℱ\(Plat\(⋅∣m,i\),P^i\(⋅∣o,i\)\)≤δϕd\_\{\\mathcal\{F\}\}\(P\_\{\\text\{lat\}\}\(\\cdot\\mid m,i\),\\hat\{P\}\_\{i\}\(\\cdot\\mid o,i\)\)\\leq\\delta\_\{\\phi\}, wheredℱd\_\{\\mathcal\{F\}\}is an integral probability metric on probability measures overℳ×ℕ\\mathcal\{M\}\\times\\mathbb\{N\}\.

Assumption[1](https://arxiv.org/html/2605.19140#Thmassumption1)is mild: \(i\) is automatic from the IC\-SMDP construction, \(ii\) holds for any finite\-horizon IC\-SMDP, and \(iii\) holds wheneverℐi=ℳ\\mathcal\{I\}\_\{i\}=\\mathcal\{M\}\. Together they ensure the handoff\-time process\{\(mk,ctk\)\}\\\{\(m\_\{k\},c\_\{t\_\{k\}\}\)\\\}is a well\-defined SMDP \(AppendixLABEL:appendix:proof\-prop1\)\.

Assumption[2](https://arxiv.org/html/2605.19140#Thmassumption2)is the substantive restriction\. Conditions \(B1\)–\(B2\) are the option\-level analogs of the AIS definition in[subramanian2022ais](https://arxiv.org/html/2605.19140#bib.bib34), originally formulated for primitive\-step single\-agent partially observable MDPs\. We strengthen this in two ways\. First, \(B1\)–\(B2\) are stated in terms of theoption rewardRlatR\_\{\\text\{lat\}\}andoption transitionPlatP\_\{\\text\{lat\}\}, which integrate primitive\-step rewards under the random discountγτk\+1\\gamma^\{\\tau\_\{k\+1\}\}rather than the constant primitive discountγ\\gamma\. Reward sufficiency at the option scale is therefore a strictly stronger condition than primitive\-step reward sufficiency:εϕ\\varepsilon\_\{\\phi\}controls error in a quantity that already reflects the option’s full execution\. Second, the conditions holdfor every agentiiwith its own observation mapϕi\\phi\_\{i\}, not for a single global observation\. This decentralization means the AIS structure must compose across agents at handoff times, which is what allows the policy correspondence below to operate over the disjoint union⨆i𝒪i\\bigsqcup\_\{i\}\\mathcal\{O\}\_\{i\}rather than a single observation space\. Whenϕi=idℳ\\phi\_\{i\}=\\mathrm\{id\}\_\{\\mathcal\{M\}\}for allii,εϕ=δϕ=0\\varepsilon\_\{\\phi\}=\\delta\_\{\\phi\}=0\. The choice of integral probability metricdℱd\_\{\\mathcal\{F\}\}is left to the user; total variation is the natural instantiation for finiteℳ\\mathcal\{M\}, withLV≤Rmax/\(1−γ¯\)L\_\{V\}\\leq R\_\{\\max\}/\(1\-\\bar\{\\gamma\}\), while Wasserstein metrics yield tighterLVL\_\{V\}when value functions are smooth\.

Examples\.In multi\-LLM mathematical reasoning \(§[5](https://arxiv.org/html/2605.19140#S5)\),mmis the conversation history;ϕi\\phi\_\{i\}projects to a 4\-dimensional vector encoding answer status\. \(B1\) holds because option reward is determined by answer status; \(B2\) holds because the next answer status given the current one is approximately independent of the verbatim history\. In multi\-agent routing,ϕi\\phi\_\{i\}is the exact interface state andεϕ=δϕ=0\\varepsilon\_\{\\phi\}=\\delta\_\{\\phi\}=0\. In multi\-agent CPU programming,ϕi\\phi\_\{i\}projects to a visible register block, andδϕ\\delta\_\{\\phi\}is controlled by the locality of the active agent’s register usage\.

Policy correspondence \(informal\)\.Under Assumptions[1](https://arxiv.org/html/2605.19140#Thmassumption1)–[2](https://arxiv.org/html/2605.19140#Thmassumption2), every stationary decentralized policy on AIS observations corresponds to a stationary policy on the AIS\-induced SMDP, and conversely\. The cost of partial observability is bounded by an additive AIS gap of order\(εϕ\+γ¯LVδϕ\)/\(1−γ¯\)\(\\varepsilon\_\{\\phi\}\+\\bar\{\\gamma\}L\_\{V\}\\delta\_\{\\phi\}\)/\(1\-\\bar\{\\gamma\}\), whereγ¯:=supm,i𝔼\[γτk\+1∣mk=m,ctk=i\]∈\[0,1\)\\bar\{\\gamma\}:=\\sup\_\{m,i\}\\mathbb\{E\}\[\\gamma^\{\\tau\_\{k\+1\}\}\\mid m\_\{k\}=m,c\_\{t\_\{k\}\}=i\]\\in\[0,1\)is the effective per\-epoch discount andLVL\_\{V\}is thedℱd\_\{\\mathcal\{F\}\}\-Lipschitz constant of the AIS\-induced optimal value function\. The formal statement and proof appear as PropositionLABEL:thm:ic\-smdp\-reductionin AppendixLABEL:appendix:proof\-prop1; the bound itself is absorbed into Theorem[1](https://arxiv.org/html/2605.19140#Thmtheorem1)in §[4](https://arxiv.org/html/2605.19140#S4), where it appears as theinterface representation gapterm\.

Comparison with adjacent formalisms\.Single\-agent MDPs are recovered whenN=1N=1andϕ1=idℳ\\phi\_\{1\}=\\mathrm\{id\}\_\{\\mathcal\{M\}\}\. Dec\-POMDPs[bernstein2002decpomdp](https://arxiv.org/html/2605.19140#bib.bib3);[oliehoek2016decpomdp](https://arxiv.org/html/2605.19140#bib.bib27)permit concurrent action with joint reward, whereas the IC\-SMDP enforces sequential control through an interface artifact – the converse encoding does not preserve the handoff structure\. The options framework[sutton1999options](https://arxiv.org/html/2605.19140#bib.bib36);[bacon2017optioncritic](https://arxiv.org/html/2605.19140#bib.bib2)provides the temporal\-extension primitive but assumes full state observation\. The IC\-SMDP is thus a strict generalization of MDPs, an alternative to \(rather than special case of\) Dec\-POMDPs, and an interface\-constrained extension of options\.

## 3DecentralizedQQ\-Learning under the IC\-SMDP

The AIS\-induced SMDP of §[2](https://arxiv.org/html/2605.19140#S2)specifies what a decentralized agent should estimate: the optimal AIS\-induced option\-valueQ^i⋆\(o,c′\)\\hat\{Q\}^\{\\star\}\_\{i\}\(o,c^\{\\prime\}\)for each agentii, observationoo, and successorc′c^\{\\prime\}\. We derive an asynchronous decentralizedQQ\-learning algorithm: IC\-QQthat estimatesQ^i⋆\\hat\{Q\}^\{\\star\}\_\{i\}from samples without violating the interface constraint\. Each agent maintains a local value estimator over its own AIS observations; cross\-agent information flow at handoff is restricted to a single scalar bootstrap target\.

By the AIS\-induced SMDP construction, the optimal valueQ^i⋆\\hat\{Q\}^\{\\star\}\_\{i\}satisfies the SMDP Bellman equation

Q^i⋆\(o,c′\)=R^i\(o,c′\)\+𝔼\[γτk\+1maxc′′∈𝒞⁡Q^c′⋆\(o′,c′′\)\|o,c′\],\\hat\{Q\}^\{\\star\}\_\{i\}\(o,c^\{\\prime\}\)=\\hat\{R\}\_\{i\}\(o,c^\{\\prime\}\)\+\\mathbb\{E\}\\\!\\left\[\\gamma^\{\\tau\_\{k\+1\}\}\\max\_\{c^\{\\prime\\prime\}\\in\\mathcal\{C\}\}\\hat\{Q\}^\{\\star\}\_\{c^\{\\prime\}\}\(o^\{\\prime\},c^\{\\prime\\prime\}\)\\,\\Big\|\\,o,c^\{\\prime\}\\right\],\(3\)whereo′=ϕc′\(mk\+1,ℓtk\+1\(c′\)\)o^\{\\prime\}=\\phi\_\{c^\{\\prime\}\}\(m\_\{k\+1\},\\ell\_\{t\_\{k\+1\}\}^\{\(c^\{\\prime\}\)\}\)is the successor’s AIS observation\. Two features encode the interface constraint at the level of the Bellman recursion: the maximization is over thesuccessor’svalue functionQ^c′⋆\\hat\{Q\}^\{\\star\}\_\{c^\{\\prime\}\}, not the current agent’s, ando′o^\{\\prime\}is computed under thesuccessor’sobservation mapϕc′\\phi\_\{c^\{\\prime\}\}\. A naive update that locally maximizesQiQ\_\{i\}would solve the wrong fixed\-point equation; the correct target requires information that, by construction, the predecessor cannot observe\.

Value passing at handoffs\.At handoff timetk\+1t\_\{k\+1\}, the successor agentc′c^\{\\prime\}– which has just become active and observedok\+1′o^\{\\prime\}\_\{k\+1\}– computes

bk\+1=maxc′′∈𝒞⁡Q^c′\(ok\+1′,c′′\),b\_\{k\+1\}=\\max\_\{c^\{\\prime\\prime\}\\in\\mathcal\{C\}\}\\hat\{Q\}\_\{c^\{\\prime\}\}\(o^\{\\prime\}\_\{k\+1\},c^\{\\prime\\prime\}\),\(4\)and transmitsbk\+1b\_\{k\+1\}back to predecessoriitogether with the option durationτk\+1\\tau\_\{k\+1\}and accumulated rewardRkR\_\{k\}\. Predecessoriiupdates its local estimator using

yk=Rk\+γτk\+1bk\+1y\_\{k\}=R\_\{k\}\+\\gamma^\{\\tau\_\{k\+1\}\}b\_\{k\+1\}\(5\)as the regression target\. The cross\-agent information flow at each handoff is exactly three numbers:bk\+1b\_\{k\+1\},τk\+1\\tau\_\{k\+1\},RkR\_\{k\}, with no exposure of internal parameters, latent state, or local observations\. This realizes the right\-hand side of \([3](https://arxiv.org/html/2605.19140#S3.E3)\) without violating either \(IC1\) or \(IC2\): the successor’smax\\maxis computed locally where the successor’sϕc′\\phi\_\{c^\{\\prime\}\}and parameters live, and only the resulting scalar crosses the interface boundary\.

Successor\-selection updates\.Each agent maintains a local successor\-selectionQQ\-functionQiβ\(o,c′;θiβ\)Q\_\{i\}^\{\\beta\}\(o,c^\{\\prime\};\\theta\_\{i\}^\{\\beta\}\), with greedy successor policyπiβ\(o\)=arg⁡maxc′⁡Qiβ\(o,c′;θiβ\)\\pi\_\{i\}^\{\\beta\}\(o\)=\\arg\\max\_\{c^\{\\prime\}\}Q\_\{i\}^\{\\beta\}\(o,c^\{\\prime\};\\theta\_\{i\}^\{\\beta\}\)\. At each handoff,θiβ\\theta\_\{i\}^\{\\beta\}is updated by minimizing\(Qiβ\(ok,c′;θiβ\)−yk\)2\(Q\_\{i\}^\{\\beta\}\(o\_\{k\},c^\{\\prime\};\\theta\_\{i\}^\{\\beta\}\)\-y\_\{k\}\)^\{2\}\. Whenτk\+1=1\\tau\_\{k\+1\}=1, this reduces to standardQQ\-learning over interface observations; whenτk\+1\\tau\_\{k\+1\}is genuinely random, it implements asynchronous SMDPQQ\-learning with random discountγτk\+1\\gamma^\{\\tau\_\{k\+1\}\}\.

Adaptable agents\.When the local\-action policy is also learned, each agent maintains a secondQQ\-functionQiα\(o,a;θiα\)Q\_\{i\}^\{\\alpha\}\(o,a;\\theta\_\{i\}^\{\\alpha\}\)\. The two estimators are coupled through their bootstrap targets:QiαQ\_\{i\}^\{\\alpha\}regresses againstmaxc′⁡Qiβ\(ok\+,c′;θiβ\)\\max\_\{c^\{\\prime\}\}Q\_\{i\}^\{\\beta\}\(o\_\{k\}^\{\+\},c^\{\\prime\};\\theta\_\{i\}^\{\\beta\}\)at the post\-action observation, whileQiβQ\_\{i\}^\{\\beta\}regresses againstmaxa′⁡Qc′α\(ok\+1′,a′;θc′α\)\\max\_\{a^\{\\prime\}\}Q\_\{c^\{\\prime\}\}^\{\\alpha\}\(o^\{\\prime\}\_\{k\+1\},a^\{\\prime\};\\theta\_\{c^\{\\prime\}\}^\{\\alpha\}\)from the successor\. The value\-passing structure remains symmetric, with a single scalar at each handoff, preserving the interface constraint\. The two\-timescale dynamics ofθiα\\theta\_\{i\}^\{\\alpha\}andθiβ\\theta\_\{i\}^\{\\beta\}require additional regularity beyond what the single\-QQanalysis of §[4](https://arxiv.org/html/2605.19140#S4)provides; we treat the adaptable case empirically and leave its formal convergence to future work\.

We summarize the unified procedure in Algorithm[1](https://arxiv.org/html/2605.19140#alg1)\.

Algorithm 1DecentralizedQQ\-learning for IC\-SMDPs \(IC\-QQ\)0:Agents

\[N\]\[N\]; AIS observation maps

\{ϕi\}\\\{\\phi\_\{i\}\\\}; learning rate

η\\eta; discount

γ\\gamma; horizon

HH; exploration

ϵ\\epsilon\.

1:Initialize

\{θiβ\}i=1N\\\{\\theta\_\{i\}^\{\\beta\}\\\}\_\{i=1\}^\{N\}\{and

\{θiα\}\\\{\\theta\_\{i\}^\{\\alpha\}\\\}in the adaptable regime\}

2:forepisode

=1,2,…=1,2,\\ldotsdo

3:Sample

\(x0,m0,\{ℓ0\(i\)\},c0\)∼ρ0\(x\_\{0\},m\_\{0\},\\\{\\ell\_\{0\}^\{\(i\)\}\\\},c\_\{0\}\)\\sim\\rho\_\{0\}; set

k←0k\\leftarrow 0,

t←0t\\leftarrow 0,

Rk←0R\_\{k\}\\leftarrow 0
4:while

t<Ht<Hand

ct≠𝖲𝖳𝖮𝖯c\_\{t\}\\neq\\mathsf\{STOP\}do

5:Active agent

i←cti\\leftarrow c\_\{t\}observes

ot=ϕi\(mt,ℓt\(i\)\)o\_\{t\}=\\phi\_\{i\}\(m\_\{t\},\\ell\_\{t\}^\{\(i\)\}\)
6:Select

ata\_\{t\}from fixed

πiint\\pi\_\{i\}^\{\\mathrm\{int\}\}\(pre\-configured\) or via

ϵ\\epsilon\-greedy on

QiαQ\_\{i\}^\{\\alpha\}\(adaptable\)

7:Select successor

ct\+1←ϵ\-greedy\(Qiβ\(ot,⋅;θiβ\)\)c\_\{t\+1\}\\leftarrow\\epsilon\\text\{\-greedy\}\(Q\_\{i\}^\{\\beta\}\(o\_\{t\},\\cdot\\,;\\theta\_\{i\}^\{\\beta\}\)\)
8:Environment transitions; receive

rtr\_\{t\};

Rk←Rk\+γt−tkrtR\_\{k\}\\leftarrow R\_\{k\}\+\\gamma^\{t\-t\_\{k\}\}r\_\{t\}
9:if

ct\+1≠ctc\_\{t\+1\}\\neq c\_\{t\}\{handoff\}then

10:

c′←ct\+1c^\{\\prime\}\\leftarrow c\_\{t\+1\};

ot\+1←ϕc′\(mt\+1,ℓt\+1\(c′\)\)o\_\{t\+1\}\\leftarrow\\phi\_\{c^\{\\prime\}\}\(m\_\{t\+1\},\\ell\_\{t\+1\}^\{\(c^\{\\prime\}\)\}\)
11:Successor computes

bk\+1←maxc′′⁡Qc′β\(ot\+1,c′′;θc′β\)b\_\{k\+1\}\\leftarrow\\max\_\{c^\{\\prime\\prime\}\}Q\_\{c^\{\\prime\}\}^\{\\beta\}\(o\_\{t\+1\},c^\{\\prime\\prime\};\\theta\_\{c^\{\\prime\}\}^\{\\beta\}\)and returns

\(bk\+1,τk\+1,Rk\)\(b\_\{k\+1\},\\tau\_\{k\+1\},R\_\{k\}\)
12:Form

ykβ←Rk\+γτk\+1bk\+1y\_\{k\}^\{\\beta\}\\leftarrow R\_\{k\}\+\\gamma^\{\\tau\_\{k\+1\}\}b\_\{k\+1\}and update

θiβ\\theta\_\{i\}^\{\\beta\}by SGD on

\(Qiβ\(otk,c′;θiβ\)−ykβ\)2\(Q\_\{i\}^\{\\beta\}\(o\_\{t\_\{k\}\},c^\{\\prime\};\\theta\_\{i\}^\{\\beta\}\)\-y\_\{k\}^\{\\beta\}\)^\{2\}
13:ifadaptable regimethen

14:Successor returns

bk\+1α←maxa′⁡Qc′α\(ot\+1,a′;θc′α\)b\_\{k\+1\}^\{\\alpha\}\\leftarrow\\max\_\{a^\{\\prime\}\}Q\_\{c^\{\\prime\}\}^\{\\alpha\}\(o\_\{t\+1\},a^\{\\prime\};\\theta\_\{c^\{\\prime\}\}^\{\\alpha\}\)
15:Form

ykα←rtk\+γmaxc′′⁡Qiβ\(otk\+,c′′;θiβ\)y\_\{k\}^\{\\alpha\}\\leftarrow r\_\{t\_\{k\}\}\+\\gamma\\max\_\{c^\{\\prime\\prime\}\}Q\_\{i\}^\{\\beta\}\(o\_\{t\_\{k\}\}^\{\+\},c^\{\\prime\\prime\};\\theta\_\{i\}^\{\\beta\}\)and update

θiα\\theta\_\{i\}^\{\\alpha\}analogously

16:endif

17:

k←k\+1k\\leftarrow k\+1;

tk←t\+1t\_\{k\}\\leftarrow t\+1;

Rk←0R\_\{k\}\\leftarrow 0
18:endif

19:

t←t\+1t\\leftarrow t\+1
20:endwhile

21:endfor

## 4Finite\-Sample Convergence

This section presents the paper’s main theoretical result: a finite\-sample convergence bound for IC\-QQin the pre\-configured regime\. The bound decomposes into three interpretable terms: the AIS representation gap \(the price of the interface constraint\), neural function\-approximation error, and a mixing\-time residual that decays at rateO~\(1/T\)\\widetilde\{O\}\(1/T\)\. Full setup, assumptions, and proofs are deferred to AppendixLABEL:appendix:finite\-sample\-convergence\.

### 4\.1Assumptions

We analyze IC\-QQunder standard local\-linearization conditions on the neural function approximator\. The full setup, including the tangent\-feature mapg0g\_\{0\}, linearized Bellman operator, trust regionB\(θ0,ω\)B\(\\theta\_\{0\},\\omega\), and local stationary setΞω\\Xi\_\{\\omega\}, is given in AppendixLABEL:appendix:setup\. The analysis requires six finite\-sample assumptions, each standard in its own subdomain but combining nontrivially here:

- •A1\(finite spaces, bounded rewards/durations\):𝒪,𝒞\\mathcal\{O\},\\mathcal\{C\}finite;\|rt\|≤rmax\|r\_\{t\}\|\\leq r\_\{\\max\};1≤τk\+1≤τmax1\\leq\\tau\_\{k\+1\}\\leq\\tau\_\{\\max\}\.
- •A2\(uniform ergodicity\): the behavior\-induced chain\{\(ok,ck\)\}\\\{\(o\_\{k\},c\_\{k\}\)\\\}is uniformly ergodic with stationary distributionμ\\mu\(μmin\>0\\mu\_\{\\min\}\>0\) andε\\varepsilon\-mixing timetmix\(ε\)t\_\{\\mathrm\{mix\}\}\(\\varepsilon\)\.
- •A3\(boundedQQ\-network and projected stability\):\|Q\|,‖∇θQ‖,‖g0‖\|Q\|,\\\|\\nabla\_\{\\theta\}Q\\\|,\\\|g\_\{0\}\\\|uniformly bounded on the trust region; iterates remain in the trust region under projected SGD\.
- •A4\(uniform local linearization\):\|Q−Q¯\|≤ε0\|Q\-\\widebar\{Q\}\|\\leq\\varepsilon\_\{0\}and‖∇θQ−g0‖≤ε0\\\|\\nabla\_\{\\theta\}Q\-g\_\{0\}\\\|\\leq\\varepsilon\_\{0\}on the trust region\.
- •A5\(well\-conditioned features\):Σμ:=𝔼μ\[g0g0⊤\]\\Sigma\_\{\\mu\}:=\\mathbb\{E\}\_\{\\mu\}\[g\_\{0\}g\_\{0\}^\{\\top\}\]has smallest nonzero eigenvalueλ0\>0\\lambda\_\{0\}\>0\.
- •A6\(SMDP\-level Bellman contraction\): there existsν∈\(0,1\)\\nu\\in\(0,1\)such that\(1−ν\)2Σμ−Σμ,τmax\(v\)⪰0\(1\-\\nu\)^\{2\}\\Sigma\_\{\\mu\}\-\\Sigma\_\{\\mu,\\tau\}^\{\\max\}\(v\)\\succeq 0for every nonzerov∈ℛ\(Σμ\)v\\in\\mathcal\{R\}\(\\Sigma\_\{\\mu\}\), whereΣμ,τmax\(v\)\\Sigma\_\{\\mu,\\tau\}^\{\\max\}\(v\)is the worst\-case next\-state feature covariance under random discountγτk\+1\\gamma^\{\\tau\_\{k\+1\}\}\.

A1–A4 are standard for asynchronous neuralQQ\-learning[bhandari2018finite](https://arxiv.org/html/2605.19140#bib.bib4);[cai2023neural](https://arxiv.org/html/2605.19140#bib.bib6);[xu2020finite](https://arxiv.org/html/2605.19140#bib.bib38)\. A5 ensures a well\-conditioned feature covariance over the disjoint observation space⨆i𝒪i\\bigsqcup\_\{i\}\\mathcal\{O\}\_\{i\}and is what permits SGD to make progress in the parameter directions visible at the AIS scale\. A6 is the genuinely SMDP\-level condition: it controls the variance of the linearized Bellman operator under therandom discountγτk\+1\\gamma^\{\\tau\_\{k\+1\}\}rather than under a constantγ\\gamma, and is what allows the contraction argument to close at the option scale\. Without A6, the random discount could in principle inflate the next\-state feature covariance to a point where the linearized operator is no longer a contraction, and finite\-sample analysis fails\. We discuss A5–A6 in detail in AppendixLABEL:appendix:assumptions\-full\.

### 4\.2Convergence theorem

Under Assumptions A1–A6 and the AIS conditions of §[2](https://arxiv.org/html/2605.19140#S2), IC\-QQadmits a non\-asymptotic bound on the expected squared error against the latent SMDP optimum, decomposing into three independently controllable terms\. The formal theorem is as follows\.

###### Theorem 1\(Finite\-sample convergence of IC\-QQ\)\.

Suppose Assumptions[1](https://arxiv.org/html/2605.19140#Thmassumption1)–[2](https://arxiv.org/html/2605.19140#Thmassumption2)and A1–A6 hold\. FixT≥1T\\geq 1, and let\{θk\}k=0T\\\{\\theta\_\{k\}\\\}\_\{k=0\}^\{T\}be generated by IC\-QQin the pre\-configured regime with step sizesηk=12νλ0\(k\+1\)\\eta\_\{k\}=\\frac\{1\}\{2\\nu\\lambda\_\{0\}\(k\+1\)\}\. Settmix:=tmix\(ηT\)t\_\{\\mathrm\{mix\}\}:=t\_\{\\mathrm\{mix\}\}\(\\eta\_\{T\}\)\. Suppose there existsθ⋆∈Ξω\\theta^\{\\star\}\\in\\Xi\_\{\\omega\}such that‖Q¯\(⋅;θ⋆\)−Q^⋆‖∞2≤εapp\\\|\\overline\{Q\}\(\\,\\cdot\\,;\\theta^\{\\star\}\)\-\\widehat\{Q\}^\{\\star\}\\\|\_\{\\infty\}^\{2\}\\leq\\varepsilon\_\{\\mathrm\{app\}\}, whereQ^⋆\\widehat\{Q\}^\{\\star\}is the optimal AIS\-induced option\-value function\. Define the AIS action\-value gap

αQ\(εϕ,δϕ,γ¯\):=εϕ\+γ¯LQδϕ1−γ¯,\\alpha\_\{Q\}\(\\varepsilon\_\{\\phi\},\\delta\_\{\\phi\},\\bar\{\\gamma\}\):=\\frac\{\\varepsilon\_\{\\phi\}\+\\bar\{\\gamma\}L\_\{Q\}\\delta\_\{\\phi\}\}\{1\-\\bar\{\\gamma\}\},whereγ¯:=supm,i𝔼\[γτk\+1∣mk=m,ctk=i\]\\bar\{\\gamma\}:=\\sup\_\{m,i\}\\mathbb\{E\}\[\\gamma^\{\\tau\_\{k\+1\}\}\\mid m\_\{k\}=m,c\_\{t\_\{k\}\}=i\]andLQL\_\{Q\}is thedℱd\_\{\\mathcal\{F\}\}\-Lipschitz constant ofQ^⋆\\widehat\{Q\}^\{\\star\}\. Then there exist constantsC0,C1\>0C\_\{0\},C\_\{1\}\>0depending only on the problem parameters in A1–A6 such that

𝔼\[‖Q\(⋅;θT\)−Qlat⋆‖μ2\]\\displaystyle\\mathbb\{E\}\\\!\\left\[\\bigl\\\|Q\(\\,\\cdot\\,;\\theta\_\{T\}\)\-Q^\{\\star\}\_\{\\mathrm\{lat\}\}\\bigr\\\|\_\{\\mu\}^\{2\}\\right\]≤2αQ\(εϕ,δϕ,γ¯\)2⏟interface representation gap\+6εapp\+6ε0\+6λmax\(Σμ\)C1ε0⏟neural approximation\\displaystyle\\leq\\underbrace\{2\\alpha\_\{Q\}\(\\varepsilon\_\{\\phi\},\\delta\_\{\\phi\},\\bar\{\\gamma\}\)^\{2\}\}\_\{\\textsf\{interface representation gap\}\}\+\\;\\;\\underbrace\{6\\varepsilon\_\{\\mathrm\{app\}\}\+6\\varepsilon\_\{0\}\+6\\lambda\_\{\\max\}\(\\Sigma\_\{\\mu\}\)C\_\{1\}\\varepsilon\_\{0\}\}\_\{\\textsf\{neural approximation\}\}\(6\)\+6λmax\(Σμ\)⋅C0\(1\+tmix\)\(1\+log⁡\(T\+1\)\)T⏟mixing\-time residual,\\displaystyle\+\\underbrace\{6\\lambda\_\{\\max\}\(\\Sigma\_\{\\mu\}\)\\cdot\\frac\{C\_\{0\}\(1\+t\_\{\\mathrm\{mix\}\}\)\(1\+\\log\(T\+1\)\)\}\{T\}\}\_\{\\textsf\{mixing\-time residual\}\},whereQlat∗Q^\{\*\}\_\{\\mathrm\{lat\}\}is the latent decision\-epoch SMDP optimal option\-value function and‖f‖μ2:=∑\(o,c′\)μ\(o,c′\)f\(o,c′\)2\\\|f\\\|\_\{\\mu\}^\{2\}:=\\sum\_\{\(o,c^\{\\prime\}\)\}\\mu\(o,c^\{\\prime\}\)\\,f\(o,c^\{\\prime\}\)^\{2\}\.

Proof outline and technical novelty\.The proof \(AppendixLABEL:appendix:proof\-thm1\) is not the concatenation of three known results, and three structural challenges arise that prior analyses do not face\.

\(i\)*Random\-discount Bellman contraction\.*The classical Q\-learning contraction argument relies on a fixed primitive discountγ\\gammamultiplying the next\-state value\. Here the discount is the random variableγτk\+1\\gamma^\{\\tau\_\{k\+1\}\}, withτk\+1\\tau\_\{k\+1\}depending on the predecessor’s option and the joint dynamics\. Establishing the strong\-monotonicity inequality⟨Δk,h¯\(θk\)−h¯\(θ⋆\)⟩≥νλ0‖Δk‖22\\langle\\Delta\_\{k\},\\overline\{h\}\(\\theta\_\{k\}\)\-\\overline\{h\}\(\\theta^\{\\star\}\)\\rangle\\geq\\nu\\lambda\_\{0\}\\\|\\Delta\_\{k\}\\\|\_\{2\}^\{2\}\(LemmaLABEL:lem:strong\-monotonicity\) requires controlling the second moment of the next\-state feature covarianceweighted byγ2τk\+1\\gamma^\{2\\tau\_\{k\+1\}\}, which is the worst\-case quantityΣμ,τmax\\Sigma\_\{\\mu,\\tau\}^\{\\max\}in A6\. The contraction argument therefore has to absorb the duration distribution into the operator\-level inequality, not merely scale a fixed discount\.

\(ii\)*AIS gap propagation through the SMDP Bellman operator\.*The interface representation gapαQ\(εϕ,δϕ,γ¯\)\\alpha\_\{Q\}\(\\varepsilon\_\{\\phi\},\\delta\_\{\\phi\},\\bar\{\\gamma\}\)enters through a Bellman fixed\-point comparison: we must bound‖Q^⋆−Qlat⋆‖∞\\\|\\widehat\{Q\}^\{\\star\}\-Q^\{\\star\}\_\{\\mathrm\{lat\}\}\\\|\_\{\\infty\}where the two operators differ in both their reward and transition terms, both restricted to the option scale\. The standard primitive\-step AIS bound\([subramanian2022ais,](https://arxiv.org/html/2605.19140#bib.bib34), Theorem 9\)does not apply because the option rewardRlatR\_\{\\mathrm\{lat\}\}and option transitionPlatP\_\{\\mathrm\{lat\}\}are themselves expectations under the random durationτk\+1\\tau\_\{k\+1\}\. Our argument \(AppendixLABEL:appendix:proof\-thm2\) lifts the AIS sufficiency condition to the option scale, usingγ¯\\bar\{\\gamma\}in place ofγ\\gammaand adℱd\_\{\\mathcal\{F\}\}\-Lipschitz argument on the AIS\-induced value function\.

\(iii\)*Markovian noise control under random duration\.*The mixing\-time analysis of[bhandari2018finite](https://arxiv.org/html/2605.19140#bib.bib4)requires Lipschitz continuity of the linearized updatehk\(θ\)h\_\{k\}\(\\theta\), governed byδ¯k\(θ\)\\overline\{\\delta\}\_\{k\}\(\\theta\)\. Here, that update depends on\(ok,ck,Rk,τk\+1,ok\+1\)\(o\_\{k\},c\_\{k\},R\_\{k\},\\tau\_\{k\+1\},o\_\{k\+1\}\), and bothRkR\_\{k\}and the bootstrap termγτk\+1maxc′⁡Q¯\(ok\+1,c′;θ\)\\gamma^\{\\tau\_\{k\+1\}\}\\max\_\{c^\{\\prime\}\}\\overline\{Q\}\(o\_\{k\+1\},c^\{\\prime\};\\theta\)depend on the random durationτk\+1\\tau\_\{k\+1\}\. We bound\|δ¯k\(θ\)−δ¯k\(θ′\)\|≤2G0‖θ−θ′‖2\|\\overline\{\\delta\}\_\{k\}\(\\theta\)\-\\overline\{\\delta\}\_\{k\}\(\\theta^\{\\prime\}\)\|\\leq 2G\_\{0\}\\\|\\theta\-\\theta^\{\\prime\}\\\|\_\{2\}with constantG0G\_\{0\}uniformly inτk\+1\\tau\_\{k\+1\}\(the constant is independent ofτmax\\tau\_\{\\max\}becauseγτk\+1≤1\\gamma^\{\\tau\_\{k\+1\}\}\\leq 1\), and this uniform bound is what closes the mixing argument\. The durationτmax\\tau\_\{\\max\}enters the bound only throughRmax=rmax\(1−γτmax\)/\(1−γ\)R\_\{\\max\}=r\_\{\\max\}\(1\-\\gamma^\{\\tau\_\{\\max\}\}\)/\(1\-\\gamma\)inside the constantsC0,C1C\_\{0\},C\_\{1\}\.

### 4\.3Discussion of the bound

Three\-term decomposition is interpretable and tight\.Each term in \([6](https://arxiv.org/html/2605.19140#S4.E6)\) has a distinct physical source\. The neural\-approximation termO~\(ε0\)\\widetilde\{O\}\(\\varepsilon\_\{0\}\)vanishes in the wide\-network regime \(ε0→0\\varepsilon\_\{0\}\\to 0\)\. The AIS gap termαQ\(εϕ,δϕ,γ¯\)2\\alpha\_\{Q\}\(\\varepsilon\_\{\\phi\},\\delta\_\{\\phi\},\\bar\{\\gamma\}\)^\{2\}vanishes when interface observations are exact \(εϕ=δϕ=0\\varepsilon\_\{\\phi\}=\\delta\_\{\\phi\}=0\)\. The mixing residual vanishes at rateO~\(tmix/T\)\\widetilde\{O\}\(t\_\{\\mathrm\{mix\}\}/T\)\. Each can be controlled independently by widening the network, choosing better interface representations, or running the algorithm longer\.

The AIS gap is the price of the interface constraint\.Whenϕi=idℳ\\phi\_\{i\}=\\mathrm\{id\}\_\{\\mathcal\{M\}\}for allii,εϕ=δϕ=0\\varepsilon\_\{\\phi\}=\\delta\_\{\\phi\}=0and we recover the standard finite\-sample bound for neural SMDPQQ\-learning\. The interface constraint introduces a bias term that isadditive\(rather than multiplicative\) andbounded a prioriby the AIS conditions\. A system designer choosing\{ϕi\}\\\{\\phi\_\{i\}\\\}can directly trade representation richness against privacy/computation through\(εϕ,δϕ\)\(\\varepsilon\_\{\\phi\},\\delta\_\{\\phi\}\)\.

The mixing\-time dependence is intrinsic to SMDP learning\.Compared to the primitive\-step bound of[bhandari2018finite](https://arxiv.org/html/2605.19140#bib.bib4), equation \([6](https://arxiv.org/html/2605.19140#S4.E6)\) carries an additional multiplicative factor of\(1\+τmax\)\(1\+\\tau\_\{\\max\}\)implicit inC0,C1C\_\{0\},C\_\{1\}, reflecting that option durations enter the variance of the linearized stochastic update\. This is a fundamental consequence of the random discountγτk\+1\\gamma^\{\\tau\_\{k\+1\}\}, not slack in the analysis\. Theτmax\\tau\_\{\\max\}dependence is in fact necessary: the option rewardRkR\_\{k\}is a sum of up toτmax\\tau\_\{\\max\}primitive rewards, so its variance scales withτmax\\tau\_\{\\max\}\.

Comparison with prior bounds\.Theorem[1](https://arxiv.org/html/2605.19140#Thmtheorem1)specializes to known results in limiting regimes:N=1N=1,ϕi=id\\phi\_\{i\}=\\mathrm\{id\},τk\+1≡1\\tau\_\{k\+1\}\\equiv 1recovers[cai2023neural](https://arxiv.org/html/2605.19140#bib.bib6);N=1N=1with arbitraryϕ\\phirecovers a neural\-network analog of the finite\-memory POMDP bound of[kara2022finite](https://arxiv.org/html/2605.19140#bib.bib24)up to constants\. Recent finite\-sample analyses for transfer and structuredQQ\-learning[chen2025transfer](https://arxiv.org/html/2605.19140#bib.bib13);[chen2026data](https://arxiv.org/html/2605.19140#bib.bib10)obtain rates of comparable form but under either single\-agent observability or task\-level structure shared by a central learner; neither covers the decentralized handoff regime\. The contribution is the simultaneous treatment of \(i\) decentralized AIS observations acrossNNagents with composition through handoffs, \(ii\) handoff\-induced random durationτk\+1\\tau\_\{k\+1\}entering both the contraction \(A6\) and noise \(LemmaLABEL:lem:gradient\-bounds\), and \(iii\) finite\-sample neural function approximation with explicit constants – a combination not previously analyzed\.

Corollary: AIS value gap\.Settingε0,εapp→0\\varepsilon\_\{0\},\\varepsilon\_\{\\mathrm\{app\}\}\\to 0and takingT→∞T\\to\\inftyin \([6](https://arxiv.org/html/2605.19140#S4.E6)\) recovers the AIS value\-gap bound for decentralized observations and SMDP discountγ¯\\bar\{\\gamma\}:

supi,m,ℓ\(i\)\|Vlat⋆\(m\)−V^⋆\(ϕi\(m,ℓ\(i\)\)\)\|≤εϕ\+γ¯LVδϕ1−γ¯,\\sup\_\{i,m,\\ell^\{\(i\)\}\}\\big\|V^\{\\star\}\_\{\\mathrm\{lat\}\}\(m\)\-\\widehat\{V\}^\{\\star\}\(\\phi\_\{i\}\(m,\\ell^\{\(i\)\}\)\)\\big\|\\leq\\frac\{\\varepsilon\_\{\\phi\}\+\\bar\{\\gamma\}L\_\{V\}\\delta\_\{\\phi\}\}\{1\-\\bar\{\\gamma\}\},generalizing\([subramanian2022ais,](https://arxiv.org/html/2605.19140#bib.bib34), Theorem 9\)from single\-agent primitive\-step MDPs to multi\-agent SMDPs \(TheoremLABEL:thm:ais\-value\-gap, AppendixLABEL:appendix:thm2\-statement\)\. Theorem[1](https://arxiv.org/html/2605.19140#Thmtheorem1)thus simultaneously delivers the asymptotic AIS gap and the finite\-sample estimation error in a single bound – the first such joint result we are aware of\.

## 5Empirical Results

We evaluate IC\-QQon four tasks chosen to exercise distinct aspects of the framework: a controlled synthetic IC\-SMDP that isolates each term in Theorem[1](https://arxiv.org/html/2605.19140#Thmtheorem1); multi\-LLM mathematical reasoning, the headline application with a nontrivial AIS observation; multi\-agent routing, instantiating the exact\-AIS regimeεϕ=δϕ=0\\varepsilon\_\{\\phi\}=\\delta\_\{\\phi\}=0; and multi\-agent CPU programming, probing the adaptable\-agent extension beyond Theorem[1](https://arxiv.org/html/2605.19140#Thmtheorem1)’s pre\-configured scope\. Throughout, no agent observes joint trajectories, no centralized critic is used, and no parameters are shared across agents\. Each agent maintainsQiβQ\_\{i\}^\{\\beta\}over its own AIS observations𝒪i\\mathcal\{O\}\_\{i\}and updates via the option\-level target \([5](https://arxiv.org/html/2605.19140#S3.E5)\)\. Full hyperparameters and protocols are in AppendixLABEL:appendix:experiments\.

### 5\.1Theory validation: synthetic IC\-SMDP

We construct a controlled IC\-SMDP withN=10N=10pre\-configured agents,\|𝒳\|=120\|\\mathcal\{X\}\|=120,\|ℳ\|=50\|\\mathcal\{M\}\|=50, and horizonH=60H=60\. The interface observation map is parameterized by a single retention ratioρ∈\(0,1\]\\rho\\in\(0,1\]: the active agent observesm~t=mtmodnbins\\tilde\{m\}\_\{t\}=m\_\{t\}\\bmod n\_\{\\mathrm\{bins\}\}withnbins=⌈ρ\|ℳ\|⌉n\_\{\\mathrm\{bins\}\}=\\lceil\\rho\|\\mathcal\{M\}\|\\rceil, soρ=1\\rho=1recovers the full interface state and smallerρ\\rhoinduces stronger AIS aliasing\. This gives a single knob that monotonically increases the AIS gap proxies\(ε^ϕ,δ^ϕ\)\(\\hat\{\\varepsilon\}\_\{\\phi\},\\hat\{\\delta\}\_\{\\phi\}\)asρ\\rhodecreases, isolating the AIS term in the bound from sampling and approximation effects\.

T1: AIS gap predicts the value gap\.We sweepρ∈\{1\.0,0\.9,…,0\.1,0\.05\}\\rho\\in\\\{1\.0,0\.9,\\ldots,0\.1,0\.05\\\}\(5 seeds\) and measure both the empirical AIS gapα^\(ρ\)\\hat\{\\alpha\}\(\\rho\)and the value gapGapV\(ρ\)\\mathrm\{Gap\}\_\{V\}\(\\rho\)of the converged IC\-QQpolicy relative to the highest\-retention baseline\. Theorem[1](https://arxiv.org/html/2605.19140#Thmtheorem1)predictsGapV\\mathrm\{Gap\}\_\{V\}should grow withα^\\hat\{\\alpha\}onceTTis large\. We observe Pearson correlationρ\(GapV,α^\)≈0\.916\\rho\(\\mathrm\{Gap\}\_\{V\},\\hat\{\\alpha\}\)\\approx 0\.916\(FigureLABEL:fig:t1, AppendixLABEL:appendix:t1\-protocol\), confirming that the AIS gap term is not worst\-case slack but tracks the actual loss as the interface degrades\.

T2: finite\-sample behavior across the three sources of error\.We isolate each term in \([6](https://arxiv.org/html/2605.19140#S4.E6)\) along its corresponding axis\. Sample budgetT∈\{50,…,3300\}T\\in\\\{50,\\ldots,3300\\\}atρ=1\\rho=1confirms the predictedO~\(tmix/T\)\\widetilde\{O\}\(t\_\{\\mathrm\{mix\}\}/T\)residual decay; varying handoff probabilityphandoff∈\[0\.10,0\.55\]p\_\{\\mathrm\{handoff\}\}\\in\[0\.10,0\.55\]traces thetmixt\_\{\\mathrm\{mix\}\}envelope, with error expanding as mixing slows; varyingρ\\rhoat fixedTTshows the error floors at a level controlled byα^\(ρ\)\\hat\{\\alpha\}\(\\rho\), separating the AIS term from the sample\-dependent residual \(FigureLABEL:fig:t2, AppendixLABEL:appendix:t2\-protocol\)\. Together, the three scans cash out the three terms in \([6](https://arxiv.org/html/2605.19140#S4.E6)\) as distinguishable empirical phenomena rather than artifacts of the proof\.

### 5\.2Multi\-LLM mathematical reasoning under interface constraints

Four LLM agents collaborate sequentially to solve multiple\-choice mathematics problems: acommissionerwho initiates and concludes the discussion, aneditorwho writes the final answer, athinkerwho constructs step\-by\-step rationales, and acheckerwho evaluates correctness\. The interface statemt=\{Q,Kt,mtmsg,kt\}m\_\{t\}=\\\{Q,K\_\{t\},m\_\{t\}^\{\\mathrm\{msg\}\},k\_\{t\}\\\}is high\-dimensional free\-form text\. The AIS observationϕi\(mt,ℓt\(i\)\)=zt∈\{0,1\}4\\phi\_\{i\}\(m\_\{t\},\\ell\_\{t\}^\{\(i\)\}\)=z\_\{t\}\\in\\\{0,1\\\}^\{4\}encodes whether the answer is given, modified, checked, and judged correct\. The successor\-selection policy is conditioned only onztz\_\{t\}– agents do not read the discussion to decide where to pass the task\. Each agent’s local\-action policy is fixed by the LLM and prompt template; only the routing policy is learned\. Reward isRRif the editor concludes with the correct answer, zero otherwise\.

We use GPT\-4o\-mini for all four agents, parameterize eachQiβ\(z,c′\)Q\_\{i\}^\{\\beta\}\(z,c^\{\\prime\}\)by a 3\-layer MLP with hidden dimension 512, and evaluate on three datasets of increasing difficulty\. Baselines are four hand\-designed predefined workflows \(AppendixLABEL:appendix:llm\-workflows, FigureLABEL:fig:workflows\): zero\-shot Chain\-of\-Thought \(CoT\), reflection prompting, and two combinations\.

Table 1:Accuracy of predefined workflows and the workflow learned by IC\-QQ\. On every dataset, IC\-QQconverges to the highest\-performing workflow without observing joint trajectories\.Table[1](https://arxiv.org/html/2605.19140#S5.T1)reports accuracy\. IC\-QQrecovers the optimal predefined workflow on every dataset, despite no agent observing joint trajectories – the policy\-correspondence claim of §[2](https://arxiv.org/html/2605.19140#S2)cashed out empirically\. The optimal workflow is dataset\-dependent and IC\-QQdiscovers this without manual tuning: on harder datasets \(MathQA, MMLU high\-school\) it engages the checker, matching Combine\-E; on easier data \(MMLU elementary\) the checker introduces errors and IC\-QQcorrectly converges to simpler CoT\. Combine\-T \(editor receives feedback only via the thinker\) is consistently dominated and never selected, showing IC\-QQidentifies that information loss in the editor’s pipeline degrades performance\. The four\-bit AIS observation is a sixteen\-state projection of an unbounded conversation, so\(εϕ,δϕ\)\(\\varepsilon\_\{\\phi\},\\delta\_\{\\phi\}\)are nontrivial; their empirical smallness, evidenced by IC\-QQmatching the centralized oracle, is what enables the algorithm’s success\.

### 5\.3Routing and adaptable agents

Two further experiments instantiate regimes complementary to the multi\-LLM headline\.Multi\-agent routingtreatsN=100N=100agents as routers in randomly generated graphs \(Erdős\-Rényi, Barabási\-Albert, Watts\-Strogatz, Chain\), each observing only its local neighborhood\. The AIS observation is exact \(εϕ=δϕ=0\\varepsilon\_\{\\phi\}=\\delta\_\{\\phi\}=0, since destination plus detain\-flag suffices\), so this experiment validates IC\-QQin the regime where Theorem[1](https://arxiv.org/html/2605.19140#Thmtheorem1)’s AIS term collapses\. IC\-QQachieves 100% routing accuracy across all graph distributions \(AppendixLABEL:appendix:routing\)\.Multi\-agent CPU programminguses six agents \(starter, two loaders, ALU, selector, writer\) representing CPU components that collaboratively transform an initial memory state into a target state\. Both local actions and successor selection are learned – the adaptable regime of §[3](https://arxiv.org/html/2605.19140#S3)that Theorem[1](https://arxiv.org/html/2605.19140#Thmtheorem1)does not formally cover\. IC\-QQachieves∼\\sim80% accuracy on previously unseen target memory states even when trained on as little as 20% of the integer range \(AppendixLABEL:appendix:cpu\), evidence that the learned workflows generalize compositionally and that the framework empirically extends beyond the pre\-configured regime our theory currently treats\.

## 6Conclusion

We introduced the IC\-SMDP as a formal model of decentralized agentic workflow control, designed an asynchronousQQ\-learning algorithm with single\-scalar value passing at handoffs \(Algorithm[1](https://arxiv.org/html/2605.19140#alg1)\), and proved a finite\-sample convergence bound decomposing into neural approximation, AIS gap, and mixing\-time residual \(Theorem[1](https://arxiv.org/html/2605.19140#Thmtheorem1)\)\. The framework covers a regime prior decentralized RL theory does not, namely sequential handoffs with interface\-limited observation, and yields the first joint asymptotic\-and\-finite\-sample bound for neuralQQ\-learning under decentralized partial observability\. Empirically, four experiments spanning a synthetic IC\-SMDP, multi\-LLM mathematical reasoning, multi\-agent routing, and adaptable CPU programming cash out the bound’s three error terms as distinguishable phenomena and show that IC\-QQmatches a centralized oracle without any agent observing joint trajectories\.

## References

- \[1\]Anthropic\.Introducing the Model Context Protocol\.[https://www\.anthropic\.com/news/model\-context\-protocol](https://www.anthropic.com/news/model-context-protocol), 2024\.
- \[2\]Pierre\-Luc Bacon, Jean Harb, and Doina Precup\.The option\-critic architecture\.InProceedings of the AAAI Conference on Artificial Intelligence, 2017\.
- \[3\]Daniel S\. Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein\.The complexity of decentralized control of Markov decision processes\.Mathematics of Operations Research, 27\(4\):819–840, 2002\.
- \[4\]Jalaj Bhandari, Daniel Russo, and Raghav Singal\.A finite time analysis of temporal difference learning with linear function approximation\.InProceedings of the 31st Conference on Learning Theory, volume 75 ofProceedings of Machine Learning Research, pages 1691–1692\. PMLR, 2018\.
- \[5\]Steven J\. Bradtke and Michael O\. Duff\.Reinforcement learning methods for continuous\-time Markov decision problems\.InAdvances in Neural Information Processing Systems \(NeurIPS\), 1994\.
- \[6\]Qi Cai, Zhuoran Yang, Jason D\. Lee, and Zhaoran Wang\.Neural temporal difference and Q learning provably converge to global optima\.Mathematics of Operations Research, 49\(1\):619–651, 2023\.
- \[7\]Jinhang Chai, Elynn Chen, and Jianqing Fan\.Deep transferqq\-learning for offline non\-stationary reinforcement learning\.arXiv preprint arXiv:2501\.04870, 2025\.
- \[8\]Jinhang Chai, Elynn Chen, and Lin Yang\.Transfer Q\-learning with composite MDP structures\.InProceedings of the 42nd International Conference on Machine Learning \(ICML\), volume 267 ofProceedings of Machine Learning Research, pages 7089–7106\. PMLR, 2025\.
- \[9\]Jinhang Chai, Enpei Zhang, Elynn Chen, and Yujun Yan\.Optimistic transfer under task shift via Bellman alignment\.arXiv preprint arXiv:2601\.21924, 2026\.
- \[10\]Elynn Chen, Xi Chen, and Wenbo Jing\.Data\-driven knowledge transfer in batchq∗q^\{\*\}learning\.Journal of the American Statistical Association, 2026\.Accepted; published online 05 Jan 2026\.
- \[11\]Elynn Chen, Xi Chen, Wenbo Jing, and Xiao Liu\.High\-dimensional linear bandits under stochastic latent heterogeneity\.arXiv preprint arXiv:2502\.00423, 2025\.
- \[12\]Elynn Chen, Xi Chen, and Yi Zhang\.Transfer learning for contextual joint assortment\-pricing under cross\-market heterogeneity\.arXiv preprint arXiv:2603\.18114, 2026\.
- \[13\]Elynn Chen, Sai Li, and Michael I\. Jordan\.Transfer Q\-learning for finite\-horizon Markov decision processes\.Electronic Journal of Statistics, 19\(2\):5289–5312, 2025\.
- \[14\]Elynn Chen, Rui Song, and Michael I\. Jordan\.Reinforcement learning in latent heterogeneous environments\.Journal of the American Statistical Association, 119\(548\):3113–3126, 2024\.
- \[15\]Shengda Fan, Xin Cong, Yuepeng Fu, Zhong Zhang, Shuyan Zhang, Yuanwei Liu, Yesai Wu, Yankai Lin, Zhiyuan Liu, and Maosong Sun\.WorkflowLLM: Enhancing workflow orchestration capability of large language models\.arXiv preprint arXiv:2411\.05451, 2024\.
- \[16\]Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, Nantas Nardelli, and Shimon Whiteson\.Counterfactual multi\-agent policy gradients\.InProceedings of the AAAI Conference on Artificial Intelligence, 2018\.
- \[17\]Google\.Agent2Agent \(A2A\) protocol specification, 2025\.
- \[18\]Juraj Gottweis, Wei\-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, et al\.Towards an AI co\-scientist\.arXiv preprint arXiv:2502\.18864, 2025\.
- \[19\]Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al\.MetaGPT: Meta programming for multi\-agent collaborative framework\.arXiv preprint arXiv:2308\.00352, 2023\.Cited as Hong et al\., 2024 per ICLR publication\.
- \[20\]Shengran Hu, Cong Lu, and Jeff Clune\.Automated design of agentic systems\.arXiv preprint arXiv:2408\.08435, 2024\.
- \[21\]Dong Huang, Qingwen Bu, Jie M\. Zhang, Michael Luck, and Heming Cui\.AgentCoder: Multi\-agent\-based code generation with iterative testing and optimisation\.arXiv preprint arXiv:2312\.13010, 2023\.
- \[22\]Shariq Iqbal and Fei Sha\.Actor\-attention\-critic for multi\-agent reinforcement learning\.InProceedings of the 36th International Conference on Machine Learning \(ICML\), volume 97 ofProceedings of Machine Learning Research, pages 2961–2970\. PMLR, 2019\.
- \[23\]Hsu Kao and Vijay Subramanian\.Common information based approximate state representations in multi\-agent reinforcement learning\.InArtificial Intelligence and Statistics \(AISTATS\), 2022\.
- \[24\]Ali Devran Kara and Serdar Yüksel\.Convergence of finite memory Q\-learning for POMDPs and near optimality of learned policies under filter stability\.Mathematics of Operations Research, 48\(4\):2066–2093, 2022\.
- \[25\]Ryan Lowe, Yi I\. Wu, Aviv Tamar, Jean Harb, Pieter Abbeel, and Igor Mordatch\.Multi\-agent actor\-critic for mixed cooperative\-competitive environments\.InAdvances in Neural Information Processing Systems \(NeurIPS\), 2017\.
- \[26\]Ranjit Nair, Pradeep Varakantham, Milind Tambe, and Makoto Yokoo\.Networked distributed POMDPs: A synthesis of distributed constraint optimization and POMDPs\.InProceedings of the 20th National Conference on Artificial Intelligence \(AAAI\), pages 133–139, 2005\.
- \[27\]Frans A\. Oliehoek and Christopher Amato\.A Concise Introduction to Decentralized POMDPs\.Springer, 2016\.
- \[28\]Frans A\. Oliehoek, Matthijs T\. J\. Spaan, Shimon Whiteson, and Nikos Vlassis\.Exploiting locality of interaction in factored Dec\-POMDPs\.InProceedings of the 7th International Joint Conference on Autonomous Agents and Multiagent Systems \(AAMAS\), pages 517–524, 2008\.
- \[29\]Doina Precup\.Temporal Abstraction in Reinforcement Learning\.PhD thesis, University of Massachusetts Amherst, 2000\.
- \[30\]Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson\.Monotonic value function factorisation for deep multi\-agent reinforcement learning\.Journal of Machine Learning Research, 21\(178\):1–51, 2020\.
- \[31\]Amit Sinha, Matthieu Geist, and Aditya Mahajan\.Periodic agent\-state based Q\-learning for POMDPs\.InAdvances in Neural Information Processing Systems \(NeurIPS\), 2024\.
- \[32\]Amit Sinha and Aditya Mahajan\.Agent\-state based policies in POMDPs: Beyond belief\-state MDPs\.InIEEE Conference on Decision and Control \(CDC\), 2024\.
- \[33\]Jinwei Su, Qizhen Lan, Yinghui Xia, Lifan Sun, Weiyou Tian, Tianyu Shi, and Lewei He\.Difficulty\-aware agentic orchestration for query\-specific multi\-agent workflows\.pages 2060–2070, 2026\.
- \[34\]Jayakumar Subramanian, Amit Sinha, Raihan Seraj, and Aditya Mahajan\.Approximate information state for approximate planning and reinforcement learning in partially observed systems\.Journal of Machine Learning Research, 23\(12\):1–83, 2022\.arXiv:2010\.08843\.
- \[35\]Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z\. Leibo, Karl Tuyls, et al\.Value\-decomposition networks for cooperative multi\-agent learning based on team reward\.InAutonomous Agents and Multi\-Agent Systems \(AAMAS\), 2018\.
- \[36\]Richard S\. Sutton, Doina Precup, and Satinder Singh\.Between MDPs and semi\-MDPs: A framework for temporal abstraction in reinforcement learning\.Artificial Intelligence, 112\(1–2\):181–211, 1999\.
- \[37\]Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W\. White, Doug Burger, and Chi Wang\.AutoGen: Enabling next\-gen LLM applications via multi\-agent conversation\.arXiv preprint arXiv:2308\.08155, 2023\.
- \[38\]Pan Xu and Quanquan Gu\.A finite\-time analysis of q\-learning with neural network function approximation\.InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 10555–10565\. PMLR, 2020\.
- \[39\]Yingxuan Yang, Huacan Chai, Shuai Shao, Yuanyi Song, Siyuan Qi, Renting Rui, and Weinan Zhang\.AgentNet: Decentralized evolutionary coordination for LLM\-based multi\-agent systems\.InAdvances in Neural Information Processing Systems \(NeurIPS\), 2025\.arXiv:2504\.00587\.
- \[40\]Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao\.ReAct: Synergizing reasoning and acting in language models\.International Conference on Learning Representations \(ICLR\), 2023\.
- \[41\]Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xiongwei Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al\.AFlow: Automating agentic workflow generation\.InInternational Conference on Learning Representations \(ICLR\), 2025\.arXiv:2410\.10762\.
- \[42\]Yi Zhang, Elynn Chen, and Yujun Yan\.Transfer faster, price smarter: Minimax dynamic pricing under cross\-market preference shift\.InAdvances in Neural Information Processing Systems \(NeurIPS\), 2025\.Spotlight; arXiv:2505\.17203\.
- \[43\]Runlin Zhou, Chixiang Chen, and Elynn Chen\.Prior\-aligned meta\-RL: Thompson sampling with learned priors and guarantees in finite\-horizon MDPs\.arXiv preprint arXiv:2510\.05446, 2025\.
- \[44\]Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jürgen Schmidhuber\.GPTSwarm: Language agents as optimizable graphs\.InProceedings of the 41st International Conference on Machine Learning \(ICML\), 2024\.
Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

Similar Articles

When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs

Stateful Inference for Low-Latency Multi-Agent Tool Calling

Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs

Toward Reliable Design of LLM-Enabled Agentic Workflows: Optimizing Latency-Reliability-Cost Tradeoffs

Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs

Submit Feedback

Similar Articles

When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs
Stateful Inference for Low-Latency Multi-Agent Tool Calling
Iterative Critique-and-Routing Controller for Multi-Agent Systems with Heterogeneous LLMs
Toward Reliable Design of LLM-Enabled Agentic Workflows: Optimizing Latency-Reliability-Cost Tradeoffs
Multi-Stream LLMs: Unblocking Language Models with Parallel Streams of Thoughts, Inputs and Outputs