State-Centric Decision Process
Summary
Introduces the State-Centric Decision Process (SDP), a runtime framework that enables language agents to construct task-induced state spaces, observation-to-state mappings, certified transitions, and termination criteria from raw text environments, achieving state-of-the-art training-free results on five benchmarks.
View Cached Full Text
Cached at: 05/14/26, 06:14 AM
# State-Centric Decision Process
Source: [https://arxiv.org/html/2605.12755](https://arxiv.org/html/2605.12755)
SUNGHEON JEONG1RYOZO MASUKAWA1SANGGEON YUN1 MAHDI IMANI2MOHSEN IMANI1 1University of California, Irvine,2Northeastern Universitysungheoj@uci\.edu
###### Abstract
Language environments such as web browsers, code terminals, and interactive simulations emit raw text rather than states, and provide none of the runtime structure that MDP analysis requires\. No explicit state space, no observation\-to\-state mapping, no certified transitions, and no termination criterion\. We introduce the State\-Centric Decision Process \(SDP\), a runtime framework that constructs these missing inputs by having the agent build them, predicate by predicate, as it acts\. At each step the agent commits to a natural\-language predicate describing how the world should look, takes an action to make it true, and checks the observation against it\. Predicates that pass become certified states, and the resulting trajectory carries the four objects language environments do not provide, namely a task\-induced state space, an observation\-to\-state mapping, certified transitions, and a termination criterion\. We evaluate SDP on five benchmarks spanning planning, scientific exploration, web reasoning, and multi\-hop question answering\. SDP achieves the best training\-free results on all five, with the advantage widening as the horizon grows\. The certified trajectories additionally support analyses unavailable to reactive agents, including per\-predicate credit assignment, failure localization, partial\-progress measurement, and modular operator replacement\.
## 1Introduction
Language agents operate in environments that were never designed for autonomous decision\-making, from web browsers and code terminals to interactive simulations and multi\-step tool\-use pipelines\(Yoranet al\.,[2024](https://arxiv.org/html/2605.12755#bib.bib3); Zhouet al\.,[2023b](https://arxiv.org/html/2605.12755#bib.bib34); Jimenezet al\.,[2023](https://arxiv.org/html/2605.12755#bib.bib40); Wanget al\.,[2022](https://arxiv.org/html/2605.12755#bib.bib2); Xieet al\.,[2024](https://arxiv.org/html/2605.12755#bib.bib1)\)\. Large language models have made this possible, absorbing enough world knowledge from vast training corpora to select reasonable actions from raw observations without environment\-specific engineering\(Brownet al\.,[2020](https://arxiv.org/html/2605.12755#bib.bib41); Weiet al\.,[2022](https://arxiv.org/html/2605.12755#bib.bib42); Achiamet al\.,[2023](https://arxiv.org/html/2605.12755#bib.bib43); Huanget al\.,[2022a](https://arxiv.org/html/2605.12755#bib.bib44); Xiet al\.,[2025](https://arxiv.org/html/2605.12755#bib.bib46)\)\. Yet success at action selection does not give the trajectory any formal structure: no explicit states, no verified transitions, nothing on which downstream methods can define transitions or assign credit\. Language environments cannot supply this structure even in principle\. The same situation admits unboundedly many valid natural\-language descriptions, and which counts as a useful state is determined by the goal, a choice the environment has no access to\. Without a state space, sequential decision\-making has no surface to operate on\.
Existing language agents address parts of this gap but none closes it\. Reactive agents\(Yaoet al\.,[2022](https://arxiv.org/html/2605.12755#bib.bib6); Schicket al\.,[2023](https://arxiv.org/html/2605.12755#bib.bib47)\)interleave reasoning with action selection yet operate directly on raw observations without constructing an explicit state\. Reflective agents go further, accumulating verbal lessons or causal memories across episodes\(Shinnet al\.,[2023](https://arxiv.org/html/2605.12755#bib.bib8); Majumderet al\.,[2023](https://arxiv.org/html/2605.12755#bib.bib25); Zhaoet al\.,[2024](https://arxiv.org/html/2605.12755#bib.bib28)\), but these summaries are open\-ended text rather than states linked by certified transitions\. Action planners\(Wanget al\.,[2023b](https://arxiv.org/html/2605.12755#bib.bib7); Yaoet al\.,[2023](https://arxiv.org/html/2605.12755#bib.bib29); Zhouet al\.,[2023a](https://arxiv.org/html/2605.12755#bib.bib30)\)deliberate over candidate action sequences before or during execution, gaining the benefit of lookahead, but the plan entries are things to do rather than conditions to verify, so progress cannot be checked against the environment\. World\-model approaches\(Wanget al\.,[2023c](https://arxiv.org/html/2605.12755#bib.bib32); Haoet al\.,[2023](https://arxiv.org/html/2605.12755#bib.bib48); Liuet al\.,[2023](https://arxiv.org/html/2605.12755#bib.bib49); Sunet al\.,[2023a](https://arxiv.org/html/2605.12755#bib.bib15)\)construct internal descriptions of the environment, moving closest to explicit state, but the descriptions are consumed by the same module that selects actions, with no per\-step certification that they actually hold\. In every case the trajectory remains a sequence of raw observations and actions, never a sequence of verified states over which transitions and credit can be defined\. On short tasks this is tolerable; on long\-horizon problems, where errors compound and intermediate progress must be tracked, agents have no formal signal that progress is being made\.
We propose to close this gap by having the agent construct its own Markov Decision Process \(MDP\) at runtime\. We call the resulting framework the*State\-Centric Decision Process*\(SDP\)\. Instead of choosing actions and treating state as a byproduct, the agent first commits to a natural\-language predicate describing how the world should look after the next action, a checkable condition on the resulting observation\. It then takes an action intended to make that predicate true and checks the observation against it\. Predicates that pass this check become certified states\. By committing to a predicate before acting, the agent forces its intent into a form the environment can falsify, and the resulting trajectory carries precisely the four objects that MDP analysis requires but language environments do not provide: states, transitions, actions, and credit\. SDP is therefore not a competitor to MDP\-based formulations of agency, but the interface layer they presuppose\.
Contributions\.
1. 1\.Identifying the missing inputs\.We formalize a specification problem unstated in the literature: MDP\-based analysis requires four objects no language environment supplies, and the gap is one of specification rather than sample complexity\.
2. 2\.The SDP framework\.We introduce the State\-Centric Decision Process, a runtime framework producing Markov trajectories by decomposing agency into operatorsPropose,Realize,Validate, andReplanover natural\-language predicates\.
3. 3\.Empirical evaluation\.SDP achieves the best training\-free results on all five benchmarks, with the gap widening as task horizon grows, and ablations isolating each operator’s contribution\.
4. 4\.Trajectory as diagnostic artifact\.Certified trajectories support analyses unavailable to prior agents: failure localization, partial\-progress tracking, cascade recording, and validator auditing\.
## 2Preliminaries: the MDP gap in language environments
MDP analysis requires a state space, an observation\-to\-state mapping, a transition kernel, and a termination criterion\(Puterman,[2014](https://arxiv.org/html/2605.12755#bib.bib50); Suttonet al\.,[1998](https://arxiv.org/html/2605.12755#bib.bib51)\)\. Language environments provide none of these\. They emit unstructured text such as web pages, terminal outputs, and API responses rather than states, and their interface does not commit to what the agent should treat as stateXiet al\.\([2025](https://arxiv.org/html/2605.12755#bib.bib46)\); Wanget al\.\([2024](https://arxiv.org/html/2605.12755#bib.bib52)\); Sumerset al\.\([2023](https://arxiv.org/html/2605.12755#bib.bib53)\)\. Without a fixed state space the remaining MDP constructs cannot even be stated, let alone estimated or optimized\. This section identifies the gap and the four inputs any framework must supply before MDP analysis\.
Useful state abstractions are goal\-dependent\.Let𝒪\\mathcal\{O\}denote the space of raw environment outputs andℋ=⋃t≥0\(𝒪×A\)t×𝒪\\mathcal\{H\}=\\bigcup\_\{t\\geq 0\}\(\\mathcal\{O\}\\times A\)^\{t\}\\times\\mathcal\{O\}the space of interaction histories\. An MDP state is the image of a history under an abstractionϕ:ℋ→S\\phi:\\mathcal\{H\}\\to Sthat preserves the Markov property\(Liet al\.,[2006](https://arxiv.org/html/2605.12755#bib.bib56); Givanet al\.,[2003](https://arxiv.org/html/2605.12755#bib.bib57)\)\. For a singleϕ\\phito suffice, the distinctions it draws must be adequate for every goal the agent might pursue\. In language environments this fails, since the distinctions that matter depend on the goal\. Two histories that a summarization policy can safely identify must be separated by a checkout policy, and vice versa\. Anyϕ\\phifine enough to be Markov across all goals collapses toward the identity onℋ\\mathcal\{H\}, while any coarserϕ\\phiis goal\-specific\. The useful abstraction is therefore not a singleϕ\\phibut a family\{ϕg\}g∈𝒢\\\{\\phi\_\{g\}\\\}\_\{g\\in\\mathcal\{G\}\}indexed by goals\(Abelet al\.,[2018](https://arxiv.org/html/2605.12755#bib.bib58); Andrychowiczet al\.,[2017](https://arxiv.org/html/2605.12755#bib.bib59); Schaulet al\.,[2015](https://arxiv.org/html/2605.12755#bib.bib60)\), and the MDP formalism provides no mechanism to select one at runtime\.
Once the state space goes, the rest follows\.A state space that varies with the goal makes the transition kernelT:S×A→Δ\(S\)T:S\\times A\\to\\Delta\(S\)undefined as a mathematical object, because its domain and codomain are not fixed sets\. This is not a problem more data could resolve; function approximation requires a target object to approximate, and there is none\. Every construct built onTT, including value functions, Bellman backups, and policy gradients\(Suttonet al\.,[1998](https://arxiv.org/html/2605.12755#bib.bib51); Williams,[1992](https://arxiv.org/html/2605.12755#bib.bib54)\), inherits the same gap\. A termination criterion faces the same absence, since the goal is supplied with the task rather than the environment, and nothing in the raw output stream signals when the task is done\. Two responses might be expected, and neither works\. Fixing some maximally fineϕ\\phiand treating the resulting space as truth is what the POMDP relaxation effectively does\(Kaelblinget al\.,[1998](https://arxiv.org/html/2605.12755#bib.bib55); Younget al\.,[2013](https://arxiv.org/html/2605.12755#bib.bib36); Murphy and others,[2000](https://arxiv.org/html/2605.12755#bib.bib61)\), but its filtering equations presuppose precisely theSSandTTwhose existence is in question\. Skippingϕ\\phientirely and letting a neural network operate on raw history works empirically and is the dominant approach in language agents\(Yaoet al\.,[2022](https://arxiv.org/html/2605.12755#bib.bib6); Wanget al\.,[2023a](https://arxiv.org/html/2605.12755#bib.bib26)\), but it recovers no analytic object on which transitions, value, or progress can be defined\.
The four missing inputs\.The preceding argument exposes four inputs that MDP analysis requires but that no language environment provides:
1. 1\.State space\.A setSSover which policies, values, and transitions are defined\. No fixedSSexists because the useful abstraction is goal\-indexed\.
2. 2\.Observation\-to\-state mapping\.A function that turns each new observation into a state update\. No singleϕ\\phiworks across goals\.
3. 3\.Certified transitions\.Tuples\(s,a,s′\)\(s,a,s^\{\\prime\}\)whose validity has been checked, not merely assumed from temporal adjacency\. Without a sharedSSthere is no space in which to express them\.
4. 4\.Termination criterion\.A predicate over the state space that signals task completion\. Language environments emit no such signal; the goal is supplied with the task, not by the environment\.
## 3Method: State\-Centric Decision Process
We introduce the*State\-Centric Decision Process*\(SDP\), a runtime framework that supplies the four missing inputs by having the agent construct them, predicate by predicate, as it acts\. The constructed states are not observations, not summaries of observations, and not latent vectors\. They are natural\-language predicates that the agent commits to before acting, each describing how the world should look at a future step\. The trajectory that results carries precisely what language environments do not provide, namely a state space populated by the task, a mapping from observations to state updates, transitions whose validity has been checked, and a termination test grounded in the goal \(Figure[1](https://arxiv.org/html/2605.12755#S3.F1)\)\.
### 3\.1The State\-Centric Decision Process
The framework rests on four operators over a state spaceΣ\\Sigma, formalized below\.
###### Definition 1\(State\-Centric Decision Process\)\.
An SDP is a tuple
\(Σ,A,𝒪,𝒯,g,Propose,Realize,Validate,Replan\),\(\\Sigma,A,\\mathcal\{O\},\\mathcal\{T\},g,\\textsc\{Propose\},\\textsc\{Realize\},\\textsc\{Validate\},\\textsc\{Replan\}\),whereΣ\\Sigmais a space of natural\-language predicates over𝒪\\mathcal\{O\},AAis the action space,𝒪\\mathcal\{O\}is the space of raw environment outputs,𝒯\\mathcal\{T\}is the space of certified trajectories accumulated during execution, andg∈Σg\\in\\Sigmais a goal predicate given with the task\. The four functions are:
Propose:Σ×Σ→Σ,\\displaystyle:\\Sigma\\times\\Sigma\\to\\Sigma,\(st,g\)\\displaystyle\(s\_\{t\},g\)↦s^t\+1,\\displaystyle\\mapsto\\hat\{s\}\_\{t\+1\},\(1\)Realize:Σ×Σ→A,\\displaystyle:\\Sigma\\times\\Sigma\\to A,\(st,s^t\+1\)\\displaystyle\(s\_\{t\},\\hat\{s\}\_\{t\+1\}\)↦at,\\displaystyle\\mapsto a\_\{t\},\(2\)Validate:Σ∗×𝒪→ℕ,\\displaystyle:\\Sigma^\{\*\}\\times\\mathcal\{O\}\\to\\mathbb\{N\},\(s^t\+1,…,s^n;o\)\\displaystyle\(\\hat\{s\}\_\{t\+1\},\\ldots,\\hat\{s\}\_\{n\};\\,o\)↦k,\\displaystyle\\mapsto k,\(3\)Replan:Σ×Σ×𝒯→Σ∗,\\displaystyle:\\Sigma\\times\\Sigma\\times\\mathcal\{T\}\\to\\Sigma^\{\*\},\(st,g,τt\)\\displaystyle\(s\_\{t\},g,\\tau\_\{t\}\)↦\(s^t\+1,…,s^n\)\.\\displaystyle\\mapsto\(\\hat\{s\}\_\{t\+1\},\\ldots,\\hat\{s\}\_\{n\}\)\.\(4\)Propose\(Eq\.[1](https://arxiv.org/html/2605.12755#S3.E1)\) sets the next target from the current state and the goal\.Realize\(Eq\.[2](https://arxiv.org/html/2605.12755#S3.E2)\) selects an action to make the target hold\.Validate\(Eq\.[3](https://arxiv.org/html/2605.12755#S3.E3)\) takes the resulting observation and returns the numberk≥0k\\geq 0of consecutive targets froms^t\+1\\hat\{s\}\_\{t\+1\}that the observation satisfies, withk=0k=0meaning the immediate target is unmet\.Replan\(Eq\.[4](https://arxiv.org/html/2605.12755#S3.E4)\) produces a new target sequence from the current state toggwhen the plan is unrecoverable\. Herennis the current plan length\.
By construction, onlyValidateconsumes raw observations;ProposeandRealizeoperate entirely onΣ\\Sigma, so targets reflect the agent’s intent rather than reactions to whatever the environment presents\. The three operators in the normal loop each read only local inputs, never the preceding history\.Replanis the sole exception, receivingτt\\tau\_\{t\}as the recovery mechanism invoked when the normal loop cannot advance\. This locality yields the Markov property \(Proposition[1](https://arxiv.org/html/2605.12755#Thmproposition1)\)\.
Optimizing over states, not actions\.Definition[1](https://arxiv.org/html/2605.12755#Thmdefinition1)reorganizes the decision problem\. A reactive agent solvesat∗=argmaxa∈AP\(success∣ht,a\)a\_\{t\}^\{\*\}=\\arg\\max\_\{a\\in A\}P\(\\text\{success\}\\mid h\_\{t\},a\)at each step, withAAas its decision space\. SDP takes the*state plan*as decision variable, decomposing the problem into two coupled stages:
\(s^1,…,s^n\)∗\\displaystyle\(\\hat\{s\}\_\{1\},\\ldots,\\hat\{s\}\_\{n\}\)^\{\*\}=argmax\(s^1,…,s^n\)∈ΣnP\(plan reachesg\|s0\),\\displaystyle=\\arg\\max\_\{\(\\hat\{s\}\_\{1\},\\ldots,\\hat\{s\}\_\{n\}\)\\in\\Sigma^\{n\}\}\\;P\\bigl\(\\text\{plan reaches \}g\\,\\big\|\\,s\_\{0\}\\bigr\),\(5\)at∗\\displaystyle a\_\{t\}^\{\*\}=argmaxa∈AP\(Validate\(s^t\+1,Env\(a\)\)≥1\|st,s^t\+1\)\.\\displaystyle=\\arg\\max\_\{a\\in A\}\\;P\\bigl\(\\textsc\{Validate\}\(\\hat\{s\}\_\{t\+1\},\\textsc\{Env\}\(a\)\)\\geq 1\\,\\big\|\\,s\_\{t\},\\hat\{s\}\_\{t\+1\}\\bigr\)\.\(6\)The outer objective \(Eq\.[5](https://arxiv.org/html/2605.12755#S3.E5)\) searches overΣn\\Sigma^\{n\}for a predicate chain froms0s\_\{0\}togg\. The inner objective \(Eq\.[6](https://arxiv.org/html/2605.12755#S3.E6)\) searches overAAfor an action whose environment responseot\+1o\_\{t\+1\}satisfies the next predicate\.Proposecommits to the outer choice before execution begins,Realizesolves the inner problem at each step, andValidatesupplies the per\-step signal that closes the loop\. In this paper, both stages are realized through prompted LLM calls rather than explicit optimization\.Replanexists to repair cases where the resulting plan turns out to be unreachable\. Making the optimization explicit is the subject of the research program SDP is intended to enable, discussed in Section[6](https://arxiv.org/html/2605.12755#S6)\.
Figure 1:The SDP execution structure\.Proposebuilds the predicate chain froms0s\_\{0\}togg\.Realizeacts toward the next predicate\.Validatechecks the observation and is the sole interface for𝒪\\mathcal\{O\}\. When failures at a target exceedbb,Replandiscards and builds a new state\.
### 3\.2The execution loop
The SDP execution loop separates planning from stepping\. The agent first appliesProposefroms0s\_\{0\}and feeds each result back intoProposeuntil it reachesgg, producing a state plan\(s^1,…,s^n=g\)\(\\hat\{s\}\_\{1\},\\ldots,\\hat\{s\}\_\{n\}=g\)\. Algorithm[2](https://arxiv.org/html/2605.12755#S3.F2)gives the complete loop\. At each step the agent selects an action viaRealizeand sends the environment’s response toValidate\(L\. 4–6\)\. If validation certifies one or more predicates, the cursor advances and the certified states enter the trajectory \(L\. 7–9\)\. Otherwise the agent retries the same target, and afterbbconsecutive failuresReplanreplaces the remaining plan suffix \(L\. 10–14\)\.
Cascade\. One action can certify several targets\.Validatereturns an integer rather than a binary verdict because a single action sometimes satisfies multiple consecutive predicates at once\. Such cascades arise from LLM knowledge spanning multiple sub\-goals, predicate overlap, or overly fine\-grained decomposition byPropose\. For example, if the next two targets are “the athlete’s average pace is computed” and “the total distance is derived from that pace,” a single Python snippet can satisfy both simultaneously\. Treating the first target alone as satisfied would force a redundant second action whose outcome is already in hand\.Validatetherefore checks the observation against the head of the remaining plan, reports how many consecutive predicates are jointly satisfied, and advances the cursor by that many positions\. The certified transitions take the form\(st,at,st\+k\)\(s\_\{t\},a\_\{t\},s\_\{t\+k\}\), generalizing single\-step MDP transitions \(Figure[2](https://arxiv.org/html/2605.12755#S3.F2)\)\.
Algorithm 1The SDP execution loop\.
1:initial state
s0s\_\{0\}, goal
gg, attempt budget
bb
2:
\(s^1,…,s^n\)←BuildPlan\(s0,g\)\(\\hat\{s\}\_\{1\},\\ldots,\\hat\{s\}\_\{n\}\)\\leftarrow\\textsc\{BuildPlan\}\(s\_\{0\},g\)
3:⊳\\trianglerightPropose untils^n=g\\hat\{s\}\_\{n\}=g
4:
τ←\(s0\)\\tau\\leftarrow\(s\_\{0\}\),
t←0\\;t\\leftarrow 0,
r←0\\;r\\leftarrow 0
6:
at←Realize\(st,s^t\+1\)a\_\{t\}\\leftarrow\\textsc\{Realize\}\(s\_\{t\},\\hat\{s\}\_\{t\+1\}\)
7:
ot\+1←Env\(at\)o\_\{t\+1\}\\leftarrow\\textsc\{Env\}\(a\_\{t\}\)
8:
k←Validate\(\(s^t\+1,…,s^n\),ot\+1\)k\\leftarrow\\textsc\{Validate\}\(\(\\hat\{s\}\_\{t\+1\},\\ldots,\\hat\{s\}\_\{n\}\),o\_\{t\+1\}\)
10:
st\+1:t\+k←s^t\+1:t\+ks\_\{t\+1:t\+k\}\\leftarrow\\hat\{s\}\_\{t\+1:t\+k\}
11:⊳\\trianglerightcertifykkpredicates
12:
τ←τ∪\{\(at,st\+k\)\}\\tau\\leftarrow\\tau\\cup\\\{\(a\_\{t\},s\_\{t\+k\}\)\\\}
13:
t←t\+kt\\leftarrow t\+k,
r←0\\;r\\leftarrow 0
14:else
17:
\(s^t\+1,…,s^n\)←Replan\(st,g,τ\)\(\\hat\{s\}\_\{t\+1\},\\ldots,\\hat\{s\}\_\{n\}\)\\leftarrow\\textsc\{Replan\}\(s\_\{t\},g,\\tau\)
19:endif
20:endif
21:endwhile
Figure 2:Example SDP run, with a cascade \(k≥2k\\geq 2\) and a replan opportunity\.
#### Failure in execution and failure in planning are corrected separately\.
When an action fails to realize a target,Realizeretries with a different action against the same predicate\. The rest of the plan is untouched\. The plan itself is revised only when repeated attempts at the same target exhaust the budgetbb\.Replanthen discards the plan tail and proposes a new sequence fromsts\_\{t\}togg\(Figure[1](https://arxiv.org/html/2605.12755#S3.F1)\)\. This separation matters most in long\-horizon tasks\. A broken action consumes one attempt and the agent retries, while a broken plan is locally repaired without restarting froms0s\_\{0\}\. Reactive agents must reconsider both whenever either fails\. SDP corrects each on its own timescale\.
### 3\.3What an SDP run produces
An SDP run terminates with more than an answer to the task\. It returns a structured artifact that records every certified state, the action that produced it, and the cascade depth recorded by each validation\. The artifact has the form
\(s0,a0,st1,at1,…,aT−1,sT\),sT⊧g,\\bigl\(s\_\{0\},\\;a\_\{0\},\\;s\_\{t\_\{1\}\},\\;a\_\{t\_\{1\}\},\\;\\ldots,\\;a\_\{T\-1\},\\;s\_\{T\}\\bigr\),\\qquad s\_\{T\}\\models g,where the indicest1<t2<⋯<Tt\_\{1\}<t\_\{2\}<\\cdots<Tmark the steps at which validation discharged at least one target\. Each gapti\+1−tit\_\{i\+1\}\-t\_\{i\}is the cascade depthkkat that step\. This artifact closes the gap identified in Section[2](https://arxiv.org/html/2605.12755#S2)by supplying, item by item, the four objects that MDP analysis requires but that no language environment provides\. Table[1](https://arxiv.org/html/2605.12755#S3.T1)states the correspondence explicitly\.
Table 1:MDP inputs absent from language environments and the artifacts SDP produces\. The last row adds a credit signal beyond the four\.Table[1](https://arxiv.org/html/2605.12755#S3.T1)should not be read as a claim that we have applied any particular downstream method to the artifact\. We have not\. The claim is that each downstream method is now a*well\-posed problem*on this artifact, in a way it is not on raw histories of language environments\. Making these problems well\-posed is the contribution of this paper\. Translating them into algorithms is the substance of the program described in Section[6](https://arxiv.org/html/2605.12755#S6)\. The artifact also carries a structural guarantee\.
###### Proposition 1\(Markov property of SDP trajectories\)\.
Letτ=\(s0,a0,st1,at1,…,sti\)\\tau=\(s\_\{0\},a\_\{0\},s\_\{t\_\{1\}\},a\_\{t\_\{1\}\},\\ldots,s\_\{t\_\{i\}\}\)be a prefix of a trajectory generated by Algorithm[2](https://arxiv.org/html/2605.12755#S3.F2), letprefix=\(s0,a0,…,ati−1\)\\text\{prefix\}=\(s\_\{0\},a\_\{0\},\\ldots,a\_\{t\_\{i\}\-1\}\)denote the history beforestis\_\{t\_\{i\}\}, and letPi=\(s^ti\+1,…,s^n\)P\_\{i\}=\(\\hat\{s\}\_\{t\_\{i\}\+1\},\\ldots,\\hat\{s\}\_\{n\}\)be the plan tail at steptit\_\{i\}\. Assume the environment’s responseP\(oti\+1∣ati,prefix\)P\(o\_\{t\_\{i\}\+1\}\\mid a\_\{t\_\{i\}\},\\text\{prefix\}\)depends on the prefix only through\(sti,Pi\)\(s\_\{t\_\{i\}\},P\_\{i\}\)\. Then the next certified state satisfies
P\(sti\+1\|sti,Pi,prefix\)=P\(sti\+1\|sti,Pi\)\.P\\bigl\(s\_\{t\_\{i\+1\}\}\\,\\big\|\\,s\_\{t\_\{i\}\},P\_\{i\},\\text\{prefix\}\\bigr\)=P\\bigl\(s\_\{t\_\{i\+1\}\}\\,\\big\|\\,s\_\{t\_\{i\}\},P\_\{i\}\\bigr\)\.\(7\)
###### Proof sketch\.
The next certified statesti\+1s\_\{t\_\{i\+1\}\}depends only onatia\_\{t\_\{i\}\}and the valuek=Validate\(Pi,oti\+1\)k=\\textsc\{Validate\}\(P\_\{i\},o\_\{t\_\{i\}\+1\}\)\. By Definition[1](https://arxiv.org/html/2605.12755#Thmdefinition1),Realize\(sti,s^ti\+1\)\\textsc\{Realize\}\(s\_\{t\_\{i\}\},\\hat\{s\}\_\{t\_\{i\}\+1\}\)reads only the current state and the next target, andValidatereads only the plan tailPiP\_\{i\}and the new observation\. Neither consumes the prefix, andoti\+1o\_\{t\_\{i\}\+1\}depends on the prefix only through\(sti,Pi\)\(s\_\{t\_\{i\}\},P\_\{i\}\)by assumption\.Replan, when invoked, receivesτ\\tauonly to produce a new plan tailPi′P\_\{i\}^\{\\prime\}, which then plays the role ofPiP\_\{i\}for subsequent steps, and the same independence holds withPi′P\_\{i\}^\{\\prime\}in place ofPiP\_\{i\}\. The prefix therefore enterssti\+1s\_\{t\_\{i\+1\}\}only through\(sti,Pi\)\(s\_\{t\_\{i\}\},P\_\{i\}\)\. ∎
The Markov property here follows from how state is defined, given a mild conditional independence assumption on the environment, rather than from an empirical claim about the environment’s full dynamics\. SDP states are predicates the agent has certified to hold, not environment configurations whose latent dynamics might leak across step boundaries\. Trajectories with the same certified state and remaining plan are indistinguishable for predicting the next certified state, which depends only on what the agent intends to certify next and the action it tries\. Proposition[1](https://arxiv.org/html/2605.12755#Thmproposition1)establishes exactly the conditional independence of Eq\.[7](https://arxiv.org/html/2605.12755#S3.E7), with the pair\(sti,Pi\)\(s\_\{t\_\{i\}\},P\_\{i\}\)playing the role of the Markov state\.
## 4Experiments
We evaluate SDP along two axes\. First, whether organizing an agent around a state plan improves task performance \(Section[4\.1](https://arxiv.org/html/2605.12755#S4.SS1)\)\. Second, whether the certified trajectories support analyses unavailable to reactive agents \(Section[4\.2](https://arxiv.org/html/2605.12755#S4.SS2)\)\. Section[4\.3](https://arxiv.org/html/2605.12755#S4.SS3)isolates the contribution of each loop\-level mechanism via ablation\. The five benchmarks span distinct combinations of environment structure and goal type, including constraint satisfaction over structured option sets \(TravelPlanner\(Xieet al\.,[2024](https://arxiv.org/html/2605.12755#bib.bib1)\)\), open\-ended web reasoning with free\-form retrieval \(AssistantBench\(Yoranet al\.,[2024](https://arxiv.org/html/2605.12755#bib.bib3)\)\), interactive scientific exploration in a text\-based simulator \(ScienceWorld\(Wanget al\.,[2022](https://arxiv.org/html/2605.12755#bib.bib2)\)\), and multi\-hop question answering at varying chain depths \(HotpotQA\(Yanget al\.,[2018](https://arxiv.org/html/2605.12755#bib.bib4)\), MuSiQue\(Trivediet al\.,[2022](https://arxiv.org/html/2605.12755#bib.bib5)\)\)\. For each benchmark we compare SDP against the published baselines evaluated on it, retaining each baseline with the LLM reported in its original paper\. We report benchmark\-selection rationale and baseline fairness criteria, including LLM backbone matching and sample size alignment, in Appendices[B](https://arxiv.org/html/2605.12755#A2)and[A](https://arxiv.org/html/2605.12755#A1)\.
### 4\.1Task performance across language environments
TravelPlanner\.Table[2](https://arxiv.org/html/2605.12755#S4.T2)reports results on TravelPlanner\(Xieet al\.,[2024](https://arxiv.org/html/2605.12755#bib.bib1)\)\. SDP’s largest gain is on hard constraints, achieving 97\.4% Micro and 93\.8% Macro and exceeding ATLAS\(Choiet al\.,[2025](https://arxiv.org/html/2605.12755#bib.bib9)\)by 14\.8 and 19\.4 points despite ATLAS using a larger backbone \(Gemini\-2\.5\-Pro\)\. This advantage traces to SDP’s predicate structure: each constraint \(budget, accommodation, transportation\) becomes a separate predicate that must be certified before the plan advances, so violations are caught at the step where they arise rather than discovered after assembly\. Two failure modes of plan\-as\-text baselines reinforce this gap\. In our reproduction runs, 12–18% of outputs from ReAct, Plan\-and\-Solve, and Reflexion were malformed, and 20–30% of feasible tasks suffered budget overflow from single\-pass generation\. SDP eliminates both, since eachRealizecall receives only constraint\-satisfying options and the LLM selects among guaranteed\-feasible candidates rather than checking constraints itself\.
Table 2:Results on TravelPlannerXieet al\.\([2024](https://arxiv.org/html/2605.12755#bib.bib1)\)\. Metrics are Delivery Rate, Commonsense constraint satisfaction \(Micro/Macro\), Hard Constraint satisfaction \(Micro/Macro\), and Final Pass Rate \(%\)\.AssistantBench\.Table[3](https://arxiv.org/html/2605.12755#S4.T3)reports results on AssistantBench\(Yoranet al\.,[2024](https://arxiv.org/html/2605.12755#bib.bib3)\)\. SDP achieves the highest overall accuracy, precision, and EM using only a search API and URL\-level scraping, without browser rendering, JavaScript, or form interaction\. Its predicate structure decomposes each constraint into a separate verification step, and roughly 40% of certified steps required at least one rejection\-and\-retry cycle\. A recurring pattern in multi\-entity tasks is that propagating partial findings with uncertainty markers outperforms forcing complete results, since validation pressure for completeness drives the agent to fill gaps with plausible but ungrounded values\. SDP does not lead on Hard\-tier accuracy, where Magentic\-One\(Fourneyet al\.,[2024](https://arxiv.org/html/2605.12755#bib.bib19)\)wins, because Hard\-tier tasks require multi\-page browsing sessions a search\-only interface cannot replicate\. Where this limitation does not apply, the advantage is decisive: on Easy\-tier tasks SDP achieves 92\.8%, exceeding the next best by over 10 points, confirming that structured validation rather than richer tool access is the binding factor\.
Table 3:Results on AssistantBenchYoranet al\.\([2024](https://arxiv.org/html/2605.12755#bib.bib3)\)\. We report overall Accuracy, Precision, Exact Match \(EM\), and Accuracy stratified by difficulty \(Easy/Medium/Hard\)\.ScienceWorld\.Table[5](https://arxiv.org/html/2605.12755#S4.T5)reports results on ScienceWorld\(Wanget al\.,[2022](https://arxiv.org/html/2605.12755#bib.bib2)\)\. Among training\-free methods on the standard 30\-task GPT\-4 protocol, SDP achieves the highest overall score \(59\.16\), improving over the previous best \(Plan\-and\-Act\(Erdoganet al\.,[2025](https://arxiv.org/html/2605.12755#bib.bib11)\)\) by 11\.3 points\. The gap widens with task length\. On Long tasks SDP reaches 50\.41, leading Plan\-and\-Act by 15\.6 points and Reflexion by over 20 points, consistent with the expectation that explicit state tracking matters more as the horizon grows\. The predicate structure also yields plan compression\. The LLM’s plan for a task like “boil water” uses 7 predicates that subsume the oracle’s 36\-step action sequence while preserving causal ordering, a compression derived from the task description alone without in\-context trajectory examples\. Plan granularity proved more consequential than length\. Overly abstract predicates confused the validator, while overly fine\-grained ones inflated LLM calls without improving accuracy, and predicates pitched at a single observable state change performed most reliably\. The dominant failure mode here is not planning or action selection but inaccurate validation, analyzed further in Section[4\.2](https://arxiv.org/html/2605.12755#S4.SS2)\.
Table 4:Results on ScienceWorldWanget al\.\([2022](https://arxiv.org/html/2605.12755#bib.bib2)\)across task lengths, reported as average score\. All methods are training\-free with GPT\-4 on the standard 30\-task protocol\.
Table 5:Results on multi\-hop QA in the fullwiki open\-domain setting\. Both benchmarks use BM25 retrieval over a 5M\-paragraph Wikipedia corpus and 1,000 validation examples\.
Multi\-hop question answering\.Table[5](https://arxiv.org/html/2605.12755#S4.T5)reports results on HotpotQA\(Yanget al\.,[2018](https://arxiv.org/html/2605.12755#bib.bib4)\)and MuSiQue\(Trivediet al\.,[2022](https://arxiv.org/html/2605.12755#bib.bib5)\)\. SDP’s advantage is most pronounced on MuSiQue, where chains reach three or four hops, consistent with the expectation that explicit state tracking matters more as the reasoning horizon grows\. The primary strength here is thatValidateblocks hallucinated intermediate findings before they propagate\. In HotpotQA’s distractor setting, where gold paragraphs are guaranteed alongside irrelevant ones, this is decisive: validation rejects findings from distractor passages and the agent re\-attempts with the correct paragraph, while reactive baselines lack per\-hop verification and are vulnerable to plausible but incorrect evidence that silently corrupts the chain\. The converse is the primary challenge: when the retriever fails to surface relevant evidence,Validaterejects each unsupported finding but the loop has no better paragraph to try, exhausting its budget\. Roughly 76% of MuSiQue failures trace to this retrieval bottleneck rather than planning or validation errors\. SDP achieves these results using BM25 alone, whereas the strongest baselines employ hybrid retrieval; integrating learned retrievers intoRealizeis a natural extension the framework accommodates\.
### 4\.2Anatomy of SDP trajectories
The previous results evaluated SDP as a prompting strategy\. This section examines the artifact it produces\. The certified trajectory carries structured information that a reactive trajectory does not, namely whether single actions advanced multiple predicates, where the agent stalls, how much progress was made before failure, and how well the validator’s verdicts agree with ground truth\.
Figure 3:Anatomy of SDP trajectories\.\(a\)Cascade depth\.\(b\)Score rate vs\. replan count\.\(c\)Certified progress in failed runs\.\(d\)Validatecalibration\.Cascade and replan rates are environment properties \(Figure[3](https://arxiv.org/html/2605.12755#S4.F3)a,b\)\.Whether a single action satisfies multiple predicates, and whether a failed plan is locally repairable, are environment\-determined\. Cascade ranges from 0% on TravelPlanner, where each slot demands separate selection, to 37% on ScienceWorld, where bundled sub\-goals are common\. The replan curve tracks environment recoverability\. ScienceWorld holds full success through one replan, while TravelPlanner declines steadily as replanning rises, since replan cannot rescue infeasible option sets\. This variation confirms SDP records these rates honestly rather than imposing a single pattern\.
Where and how far runs fail \(Figure[3](https://arxiv.org/html/2605.12755#S4.F3)c\)\.Because every rejection attaches to a specific predicate, the trajectory records not only that a run failed but where and how far\. TravelPlanner failures concentrate on hotel selection and return\-transport; MuSiQue failures typically certify the first hop or two before stalling on a bridge entity\. A reactive run is binary; an SDP run is a fraction\. Failed TravelPlanner runs certify 44% of their plan, and failed MuSiQue runs certify 60–64% of their hops\. The pattern is most revealing in MuSiQue, where certified hops grow with chain length, so longer chains fail farther in, surfacing more partial information rather than less\.
Calibration is auditable \(Figure[3](https://arxiv.org/html/2605.12755#S4.F3)d\)\.BecauseValidatecertifies the goal predicate as a discrete decision, every run yields a ground\-truth check\. The agreement between goal certification and a correct final answer is 79% on HotpotQA and 60% on MuSiQue\. In both, runs that fall back to forced finalization show sharply lower precision \(41% and 19% respectively\), confirming that the certification signal carries information beyond LLM parametric guessing\. The gap is not a hidden weakness but a measured one, since runs whereValidateaccepts the full chain but the answer is wrong form a small, identifiable subset whose patterns can guide tighter validator design\.
### 4\.3Ablation study
The framework’s three loop\-level mechanisms,Validate,Replan, and cascade, are each removable\. We replaceValidatewith always\-pass,Replanwith termination on budget exhaust, and cascade with a cap ofk≤1k\\leq 1\. Scores are rescaled using conversion factors derived from each trajectory \(action fidelity forValidate, certified\-prefix ratio forReplan, both defined in Appendix[C](https://arxiv.org/html/2605.12755#A3)\)\.
Table 6:Ablation\.Effect of removing each loop\-level mechanism\.Validateis the dominant contributor, except the environment substitutes for it\.RemovingValidateproduces the largest drop on four of five benchmarks, confirming it is the only mechanism catching uncertified findings before they propagate\. TravelPlanner is the exception: its environment pre\-filters candidates to exclude budget\- and constraint\-violating options, performing part ofValidate’s role structurally\. Without such a filter,Validateis the principal source of correctness\.
Replanpays off in proportion to environment recoverability\.Replanmatters most where local re\-attempts can recover progress and least where no alternative exists\. ScienceWorld shows the strongest effect since most successful runs needed at least one replan to navigate an unexpected obstacle, while AssistantBench shows the weakest as nearly all tasks complete on the original plan\. TravelPlanner presents a distinct pattern: the score drops not because replanning would have rescued the task, but because the certified prefix is shorter when the agent terminates on budget exhaustion\.
Cascade is decisive only under tight attempt budgets\.Cascade acts as a budget\-multiplier, advancing the cursor byk≥2k\\geq 2and preventing budget exhaustion when actions bundle\. ScienceWorld shows the largest effect without cascade \(59\.2→\\to32\.0\), consistent with its 37% cascade rate\. On benchmarks where cascade is rare or the budget is generous, disabling it has little measurable impact\.
## 5Related work
Reactive and reflective language agents\.The dominant paradigm maps observation history directly to action\(Yaoet al\.,[2022](https://arxiv.org/html/2605.12755#bib.bib6); Schicket al\.,[2023](https://arxiv.org/html/2605.12755#bib.bib47)\)\. Subsequent work adds verbal reflections\(Shinnet al\.,[2023](https://arxiv.org/html/2605.12755#bib.bib8); Lippmannet al\.,[2025](https://arxiv.org/html/2605.12755#bib.bib24)\), persistent memories\(Majumderet al\.,[2023](https://arxiv.org/html/2605.12755#bib.bib25)\), or reusable skills\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.12755#bib.bib28); Wanget al\.,[2023a](https://arxiv.org/html/2605.12755#bib.bib26)\)\. These systems still share a structural gap: they do not construct states distinct from observations, certify transitions against those states, or produce trajectories whose Markov property can be stated\. SDP instead produces a discrete certification at every step\.
Action\-planning agents\.A second line equips language agents with explicit planning, but plans over action sequences rather than state sequences\(Wanget al\.,[2023b](https://arxiv.org/html/2605.12755#bib.bib7); Yaoet al\.,[2023](https://arxiv.org/html/2605.12755#bib.bib29); Zhouet al\.,[2023a](https://arxiv.org/html/2605.12755#bib.bib30); Liet al\.,[2026](https://arxiv.org/html/2605.12755#bib.bib12); Erdoganet al\.,[2025](https://arxiv.org/html/2605.12755#bib.bib11)\)\. These approaches share the insight that committing to a plan before acting improves over reactive behavior, and SDP inherits it\. The difference is what is planned\. Action planners searchAnA^\{n\}for things to do, while SDP searchesΣn\\Sigma^\{n\}for conditions that should hold\. Two consequences follow\. First, plan entries are verifiable against observations, since predicates have truth values\. Second, the plan is decoupled from the action space, so the same predicate chain can be realized by different actions without replanning\.
World models and state abstraction\.Several recent systems construct internal representations of the environment rather than raw observations, including temporal knowledge graphs\(Dinhet al\.,[2024](https://arxiv.org/html/2605.12755#bib.bib31)\), free\-form state descriptions\(Wanget al\.,[2023c](https://arxiv.org/html/2605.12755#bib.bib32); Sunet al\.,[2023a](https://arxiv.org/html/2605.12755#bib.bib15)\), and feedback\-triggered replanning\(Huanget al\.,[2022b](https://arxiv.org/html/2605.12755#bib.bib33)\)\. These systems share the intuition that an agent benefits from an explicit model of what is true in the world\. The gap is in what follows from that model\. In each case the internal representation is consumed by the action\-producing module, with no per\-step certification\. SDP’s four\-operator decomposition enforces this separation architecturally rather than relying on the language model to maintain it implicitly\.
Decision\-theoretic formulations for language environments\.The standard approach posits an MDP or POMDP and learns or plans within it\. This has been productive in dialogue systems\(Younget al\.,[2013](https://arxiv.org/html/2605.12755#bib.bib36); Williams and Young,[2007](https://arxiv.org/html/2605.12755#bib.bib37)\), game playing\(Silveret al\.,[2017](https://arxiv.org/html/2605.12755#bib.bib38)\), and instruction following in gridworlds\(Branavanet al\.,[2009](https://arxiv.org/html/2605.12755#bib.bib39)\), where the state space is fixed in advance\. Applying the same template to open\-ended language environments requires fixing a state space and a transition kernel before the task is known—the obstacle of Section[2](https://arxiv.org/html/2605.12755#S2)\. SDP sidesteps this by not assuming a pre\-existing MDP\. The agent constructs the state space at runtime, predicate by predicate, and the certified trajectory is the MDP, populated by the task rather than posited\.
## 6Discussion
Limitations\.SDP inherits the limits of its LLM\-based operators\.Validatecan produce false positives on superficially matching observations, so the certified\-state guarantee is only as strong as its soundness; Section[4\.2](https://arxiv.org/html/2605.12755#S4.SS2)quantifies this through calibration\.Proposesimilarly bounds plan quality by the LLM’s ability to decompose goals into reachable predicates, andReplancannot fully recover from consistently unreachable targets\. Natural\-language predicates also restrict which conditions can become states, especially for continuous quantities or conditions the LLM cannot articulate\. Finally, SDP uses more LLM calls per environment step than reactive baselines\.
What the framework makes possible\.In classical settings, MDP structure is valuable because it makes a class of operations well\-defined, not because it enables one specific algorithm\. SDP brings three such operations to language agents: per\-predicate credit assignment through the integerkkfromValidate, modular operator replacement through the fixed interfaces of Definition[1](https://arxiv.org/html/2605.12755#Thmdefinition1), and formal progress measurement through the certified trajectory \(Section[4\.2](https://arxiv.org/html/2605.12755#S4.SS2)\)\. Follow\-up work can extend each direction by training a state generator forPropose, learning an action policy overΣ\\SigmaforRealize, or applying offline RL to certified tuples\(st,at,k,st\+k\)\(s\_\{t\},a\_\{t\},k,s\_\{t\+k\}\)\. The present paper provides the structural foundation, and the fixed interfaces make these extensions natural\.
## 7Conclusion
Language environments provide none of the runtime structure MDP analysis requires\. No state space, no observation\-to\-state mapping, no certified transitions, and no termination criterion\. SDP closes this gap by having the agent construct all four, predicate by predicate, producing a certified trajectory that carries exactly the objects downstream analysis presupposes\. Experiments on five benchmarks confirm this structure improves task performance, with the advantage widening as the horizon grows, and supports analyses unavailable to reactive agents\. SDP operates on the formal MDP and rigorously certifies each predicate, bringing decision\-theoretic rigor to language agents\.
## Acknowledgement
This work was supported in part by the DARPA Young Faculty Award, the National Science Foundation \(NSF\) under Grants \#2127780, \#2319198, \#2321840, \#2312517, and \#2235472, the Semiconductor Research Corporation \(SRC\), the Office of Naval Research through the Young Investigator Program Award, and Grants \#N00014\-21\-1\-2225 and N00014\-24\-1\-2547, Army Research Office Grant \#W911NF2410360\. Additionally, support was provided by the Air Force Office of Scientific Research under Award \#FA9550\-22\-1\-0253\.
## References
- \[1\]\(2018\)State abstractions for lifelong reinforcement learning\.InInternational conference on machine learning,pp\. 10–19\.Cited by:[§2](https://arxiv.org/html/2605.12755#S2.p2.9)\.
- \[2\]J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)Gpt\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§1](https://arxiv.org/html/2605.12755#S1.p1.1)\.
- \[3\]M\. Ahn, A\. Brohan, N\. Brown, Y\. Chebotar, O\. Cortes, B\. David, C\. Finn, C\. Fu, K\. Gopalakrishnan, K\. Hausman,et al\.\(2022\)Do as i can, not as i say: grounding language in robotic affordances\.arXiv preprint arXiv:2204\.01691\.Cited by:[§B\.2](https://arxiv.org/html/2605.12755#A2.SS2.SSS0.Px3.p1.1),[Table 5](https://arxiv.org/html/2605.12755#S4.T5.fig1.1.2.1.1)\.
- \[4\]M\. Andrychowicz, F\. Wolski, A\. Ray, J\. Schneider, R\. Fong, P\. Welinder, B\. McGrew, J\. Tobin, O\. Pieter Abbeel, and W\. Zaremba\(2017\)Hindsight experience replay\.Advances in neural information processing systems30\.Cited by:[§2](https://arxiv.org/html/2605.12755#S2.p2.9)\.
- \[5\]D\. Bhardwaj, A\. Beniwal, S\. Chaudhari, A\. Kalyan, T\. Rajpurohit, K\. R\. Narasimhan, A\. Deshpande, and V\. Murahari\(2025\)Agent context protocols enhance collective inference\.arXiv preprint arXiv:2505\.14569\.Cited by:[§B\.3](https://arxiv.org/html/2605.12755#A2.SS3.SSS0.Px3.p1.2),[Table 3](https://arxiv.org/html/2605.12755#S4.T3.2.2.9.5.1)\.
- \[6\]S\. R\. Branavan, H\. Chen, L\. Zettlemoyer, and R\. Barzilay\(2009\)Reinforcement learning for mapping instructions to actions\.InProceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP,pp\. 82–90\.Cited by:[§5](https://arxiv.org/html/2605.12755#S5.p4.1)\.
- \[7\]T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in neural information processing systems33,pp\. 1877–1901\.Cited by:[§1](https://arxiv.org/html/2605.12755#S1.p1.1)\.
- \[8\]J\. Chen, H\. Li, J\. Yang, Y\. Liu, and Q\. Ai\(2025\)Enhancing llm\-based agents via global planning and hierarchical execution\.arXiv preprint arXiv:2504\.16563\.Cited by:[§B\.1](https://arxiv.org/html/2605.12755#A2.SS1.SSS0.Px3.p1.1),[Table 2](https://arxiv.org/html/2605.12755#S4.T2.1.1.7.7.1)\.
- \[9\]J\. Choi, J\. Yoon, J\. Chen, S\. Jha, and T\. Pfister\(2025\)ATLAS: constraints\-aware multi\-agent collaboration for real\-world travel planning\.arXiv preprint arXiv:2509\.25586\.Cited by:[§B\.1](https://arxiv.org/html/2605.12755#A2.SS1.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.12755#S4.SS1.p1.1),[Table 2](https://arxiv.org/html/2605.12755#S4.T2.1.1.11.11.1)\.
- \[10\]M\. P\. Dinh, M\. Syed, G\. Y\. Michael, and T\. W\. Ford\(2024\)ReasonPlanner: enhancing autonomous planning in dynamic environments with temporal knowledge graphs and llms\.arXiv preprint arXiv:2410\.09252\.Cited by:[§5](https://arxiv.org/html/2605.12755#S5.p3.1)\.
- \[11\]L\. E\. Erdogan, N\. Lee, S\. Kim, S\. Moon, H\. Furuta, G\. Anumanchipalli, K\. Keutzer, and A\. Gholami\(2025\)Plan\-and\-act: improving planning of agents for long\-horizon tasks\.arXiv preprint arXiv:2503\.09572\.Cited by:[§B\.2](https://arxiv.org/html/2605.12755#A2.SS2.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.12755#S4.SS1.p3.1),[Table 5](https://arxiv.org/html/2605.12755#S4.T5.fig1.1.7.6.1),[§5](https://arxiv.org/html/2605.12755#S5.p2.2)\.
- \[12\]A\. Fourney, G\. Bansal, H\. Mozannar, C\. Tan, E\. Salinas, F\. Niedtner, G\. Proebsting, G\. Bassman, J\. Gerrits, J\. Alber,et al\.\(2024\)Magentic\-one: a generalist multi\-agent system for solving complex tasks\.arXiv preprint arXiv:2411\.04468\.Cited by:[§B\.3](https://arxiv.org/html/2605.12755#A2.SS3.SSS0.Px3.p1.2),[§4\.1](https://arxiv.org/html/2605.12755#S4.SS1.p2.1),[Table 3](https://arxiv.org/html/2605.12755#S4.T3.2.2.8.4.1)\.
- \[13\]R\. Givan, T\. Dean, and M\. Greig\(2003\)Equivalence notions and model minimization in markov decision processes\.Artificial intelligence147\(1\-2\),pp\. 163–223\.Cited by:[§2](https://arxiv.org/html/2605.12755#S2.p2.9)\.
- \[14\]Z\. Guo, B\. Xu, X\. Wang, and Z\. Mao\(2025\)Mirror: multi\-agent intra\-and inter\-reflection for optimized reasoning in tool learning\.arXiv preprint arXiv:2505\.20670\.Cited by:[§B\.1](https://arxiv.org/html/2605.12755#A2.SS1.SSS0.Px3.p1.1),[Table 2](https://arxiv.org/html/2605.12755#S4.T2.1.1.6.6.1)\.
- \[15\]S\. Hao, Y\. Gu, H\. Ma, J\. Hong, Z\. Wang, D\. Wang, and Z\. Hu\(2023\)Reasoning with language model is planning with world model\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 8154–8173\.Cited by:[§1](https://arxiv.org/html/2605.12755#S1.p2.1)\.
- \[16\]W\. Huang, P\. Abbeel, D\. Pathak, and I\. Mordatch\(2022\)Language models as zero\-shot planners: extracting actionable knowledge for embodied agents\.InInternational conference on machine learning,pp\. 9118–9147\.Cited by:[§1](https://arxiv.org/html/2605.12755#S1.p1.1)\.
- \[17\]W\. Huang, F\. Xia, T\. Xiao, H\. Chan, J\. Liang, P\. Florence, A\. Zeng, J\. Tompson, I\. Mordatch, Y\. Chebotar,et al\.\(2022\)Inner monologue: embodied reasoning through planning with language models\.arXiv preprint arXiv:2207\.05608\.Cited by:[§5](https://arxiv.org/html/2605.12755#S5.p3.1)\.
- \[18\]C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. Narasimhan\(2023\)Swe\-bench: can language models resolve real\-world github issues?\.arXiv preprint arXiv:2310\.06770\.Cited by:[§1](https://arxiv.org/html/2605.12755#S1.p1.1)\.
- \[19\]L\. P\. Kaelbling, M\. L\. Littman, and A\. R\. Cassandra\(1998\)Planning and acting in partially observable stochastic domains\.Artificial intelligence101\(1\-2\),pp\. 99–134\.Cited by:[§2](https://arxiv.org/html/2605.12755#S2.p3.6)\.
- \[20\]D\. Lee, Y\. Jo, H\. Park, and M\. Lee\(2025\)Shifting from ranking to set selection for retrieval augmented generation\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 17606–17619\.Cited by:[§B\.4](https://arxiv.org/html/2605.12755#A2.SS4.SSS0.Px3.p1.1),[Table 5](https://arxiv.org/html/2605.12755#S4.T5.fig2.1.6.6.1)\.
- \[21\]L\. Li, T\. J\. Walsh, and M\. L\. Littman\(2006\)Towards a unified theory of state abstraction for mdps\.\.AI&M1\(2\),pp\. 3\.Cited by:[§2](https://arxiv.org/html/2605.12755#S2.p2.9)\.
- \[22\]Y\. Li, B\. Xu, X\. Tian, X\. Xu, and H\. Shen\(2026\)Beyond entangled planning: task\-decoupled planning for long\-horizon agents\.arXiv preprint arXiv:2601\.07577\.Cited by:[§B\.1](https://arxiv.org/html/2605.12755#A2.SS1.SSS0.Px3.p1.1),[§B\.2](https://arxiv.org/html/2605.12755#A2.SS2.SSS0.Px3.p1.1),[Table 2](https://arxiv.org/html/2605.12755#S4.T2.1.1.8.8.1),[§5](https://arxiv.org/html/2605.12755#S5.p2.2)\.
- \[23\]B\. Y\. Lin, Y\. Fu, K\. Yang, F\. Brahman, S\. Huang, C\. Bhagavatula, P\. Ammanabrolu, Y\. Choi, and X\. Ren\(2023\)Swiftsage: a generative agent with fast and slow thinking for complex interactive tasks\.Advances in Neural Information Processing Systems36,pp\. 23813–23825\.Cited by:[§B\.2](https://arxiv.org/html/2605.12755#A2.SS2.SSS0.Px2.p1.1)\.
- \[24\]P\. Lippmann, M\. T\. Spaan, and J\. Yang\(2025\)Positive experience reflection for agents in interactive text environments\.InProceedings of the 1st Workshop for Research on Agent Language Models \(REALM 2025\),pp\. 131–142\.Cited by:[§5](https://arxiv.org/html/2605.12755#S5.p1.1)\.
- \[25\]Z\. Liu, H\. Hu, S\. Zhang, H\. Guo, S\. Ke, B\. Liu, and Z\. Wang\(2023\)Reason for future, act for now: a principled framework for autonomous llm agents with provable sample efficiency\.arXiv preprint arXiv:2309\.17382\.Cited by:[§1](https://arxiv.org/html/2605.12755#S1.p2.1)\.
- \[26\]B\. P\. Majumder, B\. D\. Mishra, P\. Jansen, O\. Tafjord, N\. Tandon, L\. Zhang, C\. Callison\-Burch, and P\. Clark\(2023\)Clin: a continually learning language agent for rapid task adaptation and generalization\.arXiv preprint arXiv:2310\.10134\.Cited by:[§1](https://arxiv.org/html/2605.12755#S1.p2.1),[§5](https://arxiv.org/html/2605.12755#S5.p1.1)\.
- \[27\]K\. P\. Murphyet al\.\(2000\)A survey of pomdp solution techniques\.environment2\(10\),pp\. 1–12\.Cited by:[§2](https://arxiv.org/html/2605.12755#S2.p3.6)\.
- \[28\]M\. M\. H\. Nahid and D\. Rafiei\(2025\)Prism: agentic retrieval with llms for multi\-hop question answering\.arXiv preprint arXiv:2510\.14278\.Cited by:[§B\.4](https://arxiv.org/html/2605.12755#A2.SS4.SSS0.Px2.p1.1),[§B\.4](https://arxiv.org/html/2605.12755#A2.SS4.SSS0.Px3.p1.1),[Table 5](https://arxiv.org/html/2605.12755#S4.T5.fig2.1.8.8.1)\.
- \[29\]R\. Pradeep, S\. Sharifymoghaddam, and J\. Lin\(2023\)Rankzephyr: effective and robust zero\-shot listwise reranking is a breeze\!\.arXiv preprint arXiv:2312\.02724\.Cited by:[§B\.4](https://arxiv.org/html/2605.12755#A2.SS4.SSS0.Px3.p1.1),[Table 5](https://arxiv.org/html/2605.12755#S4.T5.fig2.1.3.3.1),[Table 5](https://arxiv.org/html/2605.12755#S4.T5.fig2.1.4.4.1)\.
- \[30\]O\. Press, M\. Zhang, S\. Min, L\. Schmidt, N\. A\. Smith, and M\. Lewis\(2023\)Measuring and narrowing the compositionality gap in language models\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 5687–5711\.Cited by:[§B\.3](https://arxiv.org/html/2605.12755#A2.SS3.SSS0.Px3.p1.2),[Table 3](https://arxiv.org/html/2605.12755#S4.T3.2.2.6.2.1)\.
- \[31\]M\. L\. Puterman\(2014\)Markov decision processes: discrete stochastic dynamic programming\.John Wiley & Sons\.Cited by:[§2](https://arxiv.org/html/2605.12755#S2.p1.1)\.
- \[32\]R\. G\. Reddy, S\. Mukherjee, J\. Kim, Z\. Wang, D\. Hakkani\-Tur, and H\. Ji\(2025\)Infogent: an agent\-based framework for web information aggregation\.InFindings of the Association for Computational Linguistics: NAACL 2025,pp\. 5745–5758\.Cited by:[§B\.3](https://arxiv.org/html/2605.12755#A2.SS3.SSS0.Px3.p1.2),[Table 3](https://arxiv.org/html/2605.12755#S4.T3.2.2.5.1.1)\.
- \[33\]T\. Schaul, D\. Horgan, K\. Gregor, and D\. Silver\(2015\)Universal value function approximators\.InInternational conference on machine learning,pp\. 1312–1320\.Cited by:[§2](https://arxiv.org/html/2605.12755#S2.p2.9)\.
- \[34\]T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, E\. Hambro, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom\(2023\)Toolformer: language models can teach themselves to use tools\.Advances in neural information processing systems36,pp\. 68539–68551\.Cited by:[§1](https://arxiv.org/html/2605.12755#S1.p2.1),[§5](https://arxiv.org/html/2605.12755#S5.p1.1)\.
- \[35\]N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao\(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in neural information processing systems36,pp\. 8634–8652\.Cited by:[§B\.1](https://arxiv.org/html/2605.12755#A2.SS1.SSS0.Px3.p1.1),[§B\.2](https://arxiv.org/html/2605.12755#A2.SS2.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.12755#S1.p2.1),[Table 2](https://arxiv.org/html/2605.12755#S4.T2.1.1.4.4.1),[Table 5](https://arxiv.org/html/2605.12755#S4.T5.fig1.1.5.4.1),[§5](https://arxiv.org/html/2605.12755#S5.p1.1)\.
- \[36\]D\. Silver, J\. Schrittwieser, K\. Simonyan, I\. Antonoglou, A\. Huang, A\. Guez, T\. Hubert, L\. Baker, M\. Lai, A\. Bolton,et al\.\(2017\)Mastering the game of go without human knowledge\.nature550\(7676\),pp\. 354–359\.Cited by:[§5](https://arxiv.org/html/2605.12755#S5.p4.1)\.
- \[37\]T\. Sumers, S\. Yao, K\. R\. Narasimhan, and T\. L\. Griffiths\(2023\)Cognitive architectures for language agents\.Transactions on Machine Learning Research\.Cited by:[§2](https://arxiv.org/html/2605.12755#S2.p1.1)\.
- \[38\]H\. Sun, Y\. Zhuang, L\. Kong, B\. Dai, and C\. Zhang\(2023\)Adaplanner: adaptive planning from feedback with language models\.Advances in neural information processing systems36,pp\. 58202–58245\.Cited by:[§1](https://arxiv.org/html/2605.12755#S1.p2.1),[§5](https://arxiv.org/html/2605.12755#S5.p3.1)\.
- \[39\]W\. Sun, L\. Yan, X\. Ma, S\. Wang, P\. Ren, Z\. Chen, D\. Yin, and Z\. Ren\(2023\)Is chatgpt good at search? investigating large language models as re\-ranking agents\.InProceedings of the 2023 conference on empirical methods in natural language processing,pp\. 14918–14937\.Cited by:[§B\.4](https://arxiv.org/html/2605.12755#A2.SS4.SSS0.Px3.p1.1),[Table 5](https://arxiv.org/html/2605.12755#S4.T5.fig2.1.5.5.1)\.
- \[40\]R\. S\. Sutton, A\. G\. Barto,et al\.\(1998\)Reinforcement learning: an introduction\.MIT press Cambridge\.Cited by:[§2](https://arxiv.org/html/2605.12755#S2.p1.1),[§2](https://arxiv.org/html/2605.12755#S2.p3.6)\.
- \[41\]H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal\(2022\)MuSiQue: multihop questions via single\-hop question composition\.Transactions of the Association for Computational Linguistics10,pp\. 539–554\.Cited by:[§A\.6](https://arxiv.org/html/2605.12755#A1.SS6.p1.1),[§B\.4](https://arxiv.org/html/2605.12755#A2.SS4.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.12755#S4.SS1.p4.1),[Table 5](https://arxiv.org/html/2605.12755#S4.T5.fig2.1.1.1.4),[§4](https://arxiv.org/html/2605.12755#S4.p1.1)\.
- \[42\]H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal\(2023\)Interleaving retrieval with chain\-of\-thought reasoning for knowledge\-intensive multi\-step questions\.InProceedings of the 61st annual meeting of the association for computational linguistics \(volume 1: long papers\),pp\. 10014–10037\.Cited by:[§A\.6](https://arxiv.org/html/2605.12755#A1.SS6.p3.1),[§B\.3](https://arxiv.org/html/2605.12755#A2.SS3.SSS0.Px3.p1.2),[§B\.4](https://arxiv.org/html/2605.12755#A2.SS4.SSS0.Px2.p1.1),[§B\.4](https://arxiv.org/html/2605.12755#A2.SS4.SSS0.Px3.p1.1),[Table 3](https://arxiv.org/html/2605.12755#S4.T3.1.1.1.1),[Table 5](https://arxiv.org/html/2605.12755#S4.T5.fig2.1.7.7.1)\.
- \[43\]G\. Wang, Y\. Xie, Y\. Jiang, A\. Mandlekar, C\. Xiao, Y\. Zhu, L\. Fan, and A\. Anandkumar\(2023\)Voyager: an open\-ended embodied agent with large language models\.arXiv preprint arXiv:2305\.16291\.Cited by:[§2](https://arxiv.org/html/2605.12755#S2.p3.6),[§5](https://arxiv.org/html/2605.12755#S5.p1.1)\.
- \[44\]L\. Wang, C\. Ma, X\. Feng, Z\. Zhang, H\. Yang, J\. Zhang, Z\. Chen, J\. Tang, X\. Chen, Y\. Lin,et al\.\(2024\)A survey on large language model based autonomous agents\.Frontiers of Computer Science18\(6\),pp\. 186345\.Cited by:[§2](https://arxiv.org/html/2605.12755#S2.p1.1)\.
- \[45\]L\. Wang, W\. Xu, Y\. Lan, Z\. Hu, Y\. Lan, R\. K\. Lee, and E\. Lim\(2023\)Plan\-and\-solve prompting: improving zero\-shot chain\-of\-thought reasoning by large language models\.InProceedings of the 61st annual meeting of the association for computational linguistics \(volume 1: long papers\),pp\. 2609–2634\.Cited by:[§B\.1](https://arxiv.org/html/2605.12755#A2.SS1.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.12755#S1.p2.1),[Table 2](https://arxiv.org/html/2605.12755#S4.T2.1.1.5.5.1),[§5](https://arxiv.org/html/2605.12755#S5.p2.2)\.
- \[46\]R\. Wang, P\. Jansen, M\. Côté, and P\. Ammanabrolu\(2022\)Scienceworld: is your agent smarter than a 5th grader?\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,pp\. 11279–11298\.Cited by:[§A\.3](https://arxiv.org/html/2605.12755#A1.SS3.p1.1),[§B\.2](https://arxiv.org/html/2605.12755#A2.SS2.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.12755#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.12755#S4.SS1.p3.1),[Table 5](https://arxiv.org/html/2605.12755#S4.T5.fig1),[§4](https://arxiv.org/html/2605.12755#S4.p1.1)\.
- \[47\]Z\. Wang, S\. Cai, G\. Chen, A\. Liu, X\. Ma, and Y\. Liang\(2023\)Describe, explain, plan and select: interactive planning with large language models enables open\-world multi\-task agents\.arXiv preprint arXiv:2302\.01560\.Cited by:[§1](https://arxiv.org/html/2605.12755#S1.p2.1),[§5](https://arxiv.org/html/2605.12755#S5.p3.1)\.
- \[48\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§B\.2](https://arxiv.org/html/2605.12755#A2.SS2.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.12755#S1.p1.1),[Table 5](https://arxiv.org/html/2605.12755#S4.T5.fig1.1.4.3.1)\.
- \[49\]J\. D\. Williams and S\. Young\(2007\)Partially observable markov decision processes for spoken dialog systems\.Computer Speech & Language21\(2\),pp\. 393–422\.Cited by:[§5](https://arxiv.org/html/2605.12755#S5.p4.1)\.
- \[50\]R\. J\. Williams\(1992\)Simple statistical gradient\-following algorithms for connectionist reinforcement learning\.Machine learning8\(3\),pp\. 229–256\.Cited by:[§2](https://arxiv.org/html/2605.12755#S2.p3.6)\.
- \[51\]Z\. Xi, W\. Chen, X\. Guo, W\. He, Y\. Ding, B\. Hong, M\. Zhang, J\. Wang, S\. Jin, E\. Zhou,et al\.\(2025\)The rise and potential of large language model based agents: a survey\.Science China Information Sciences68\(2\),pp\. 121101\.Cited by:[§1](https://arxiv.org/html/2605.12755#S1.p1.1),[§2](https://arxiv.org/html/2605.12755#S2.p1.1)\.
- \[52\]J\. Xie, K\. Zhang, J\. Chen, T\. Zhu, R\. Lou, Y\. Tian, Y\. Xiao, and Y\. Su\(2024\)Travelplanner: a benchmark for real\-world planning with language agents\.arXiv preprint arXiv:2402\.01622\.Cited by:[§B\.1](https://arxiv.org/html/2605.12755#A2.SS1.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.12755#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.12755#S4.SS1.p1.1),[Table 2](https://arxiv.org/html/2605.12755#S4.T2),[§4](https://arxiv.org/html/2605.12755#S4.p1.1)\.
- \[53\]Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning\(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 conference on empirical methods in natural language processing,pp\. 2369–2380\.Cited by:[§A\.5](https://arxiv.org/html/2605.12755#A1.SS5.p1.1),[§B\.4](https://arxiv.org/html/2605.12755#A2.SS4.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.12755#S4.SS1.p4.1),[Table 5](https://arxiv.org/html/2605.12755#S4.T5.fig2.1.1.1.3),[§4](https://arxiv.org/html/2605.12755#S4.p1.1)\.
- \[54\]S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. Griffiths, Y\. Cao, and K\. Narasimhan\(2023\)Tree of thoughts: deliberate problem solving with large language models\.Advances in neural information processing systems36,pp\. 11809–11822\.Cited by:[§1](https://arxiv.org/html/2605.12755#S1.p2.1),[§5](https://arxiv.org/html/2605.12755#S5.p2.2)\.
- \[55\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao\(2022\)React: synergizing reasoning and acting in language models\.arXiv preprint arXiv:2210\.03629\.Cited by:[§B\.1](https://arxiv.org/html/2605.12755#A2.SS1.SSS0.Px3.p1.1),[§B\.2](https://arxiv.org/html/2605.12755#A2.SS2.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2605.12755#S1.p2.1),[§2](https://arxiv.org/html/2605.12755#S2.p3.6),[Table 2](https://arxiv.org/html/2605.12755#S4.T2.1.1.3.3.1),[Table 5](https://arxiv.org/html/2605.12755#S4.T5.fig1.1.3.2.1),[§5](https://arxiv.org/html/2605.12755#S5.p1.1)\.
- \[56\]O\. Yoran, S\. J\. Amouyal, C\. Malaviya, B\. Bogin, O\. Press, and J\. Berant\(2024\)Assistantbench: can web agents solve realistic and time\-consuming tasks?\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 8938–8968\.Cited by:[§A\.4](https://arxiv.org/html/2605.12755#A1.SS4.p1.1),[§B\.3](https://arxiv.org/html/2605.12755#A2.SS3.SSS0.Px1.p1.1),[§B\.3](https://arxiv.org/html/2605.12755#A2.SS3.SSS0.Px3.p1.2),[§1](https://arxiv.org/html/2605.12755#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.12755#S4.SS1.p2.1),[Table 3](https://arxiv.org/html/2605.12755#S4.T3),[Table 3](https://arxiv.org/html/2605.12755#S4.T3.2.2.7.3.1),[§4](https://arxiv.org/html/2605.12755#S4.p1.1)\.
- \[57\]S\. Young, M\. Gašić, B\. Thomson, and J\. D\. Williams\(2013\)Pomdp\-based statistical spoken dialog systems: a review\.Proceedings of the IEEE101\(5\),pp\. 1160–1179\.Cited by:[§2](https://arxiv.org/html/2605.12755#S2.p3.6),[§5](https://arxiv.org/html/2605.12755#S5.p4.1)\.
- \[58\]S\. Yuan, K\. Song, J\. Chen, X\. Tan, D\. Li, and D\. Yang\(2025\)Evoagent: towards automatic multi\-agent generation via evolutionary algorithms\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 6192–6217\.Cited by:[§B\.1](https://arxiv.org/html/2605.12755#A2.SS1.SSS0.Px3.p1.1),[§B\.2](https://arxiv.org/html/2605.12755#A2.SS2.SSS0.Px3.p1.1),[Table 2](https://arxiv.org/html/2605.12755#S4.T2.1.1.9.9.1),[Table 5](https://arxiv.org/html/2605.12755#S4.T5.fig1.1.6.5.1)\.
- \[59\]C\. Zhang, X\. D\. Goh, D\. Li, H\. Zhang, and Y\. Liu\(2025\)Planning with multi\-constraints via collaborative language agents\.InProceedings of the 31st International Conference on Computational Linguistics,pp\. 10054–10082\.Cited by:[§B\.1](https://arxiv.org/html/2605.12755#A2.SS1.SSS0.Px3.p1.1),[Table 2](https://arxiv.org/html/2605.12755#S4.T2.1.1.10.10.1)\.
- \[60\]A\. Zhao, D\. Huang, Q\. Xu, M\. Lin, Y\. Liu, and G\. Huang\(2024\)Expel: llm agents are experiential learners\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 19632–19642\.Cited by:[§1](https://arxiv.org/html/2605.12755#S1.p2.1),[§5](https://arxiv.org/html/2605.12755#S5.p1.1)\.
- \[61\]B\. Zheng, B\. Gou, J\. Kil, H\. Sun, and Y\. Su\(2024\)Gpt\-4v \(ision\) is a generalist web agent, if grounded\.arXiv preprint arXiv:2401\.01614\.Cited by:[§B\.3](https://arxiv.org/html/2605.12755#A2.SS3.SSS0.Px3.p1.2),[Table 3](https://arxiv.org/html/2605.12755#S4.T3.2.2.2.1)\.
- \[62\]A\. Zhou, K\. Yan, M\. Shlapentokh\-Rothman, H\. Wang, and Y\. Wang\(2023\)Language agent tree search unifies reasoning acting and planning in language models\.arXiv preprint arXiv:2310\.04406\.Cited by:[§1](https://arxiv.org/html/2605.12755#S1.p2.1),[§5](https://arxiv.org/html/2605.12755#S5.p2.2)\.
- \[63\]S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried,et al\.\(2023\)Webarena: a realistic web environment for building autonomous agents\.arXiv preprint arXiv:2307\.13854\.Cited by:[§1](https://arxiv.org/html/2605.12755#S1.p1.1)\.
## Appendix AImplementation details
### A\.1Common Settings
All five benchmarks share the same SDP core loop; only the adapter layer differs per benchmark\. This subsection describes the shared infrastructure\.
#### Core loop and adapter protocol\.
The SDP core executes a single algorithm regardless of benchmark: given a task, it calls the adapter’sProposeto obtain an initial predicate plan, then iterates through the plan by callingRealizeto produce an action for the current predicate andValidateto certify whether the action satisfies it\. If validation succeeds with cascade countk≥1k\\geq 1, the firstkkpredicates are removed from the plan and the state is updated\. If validation fails, the attempt is recorded andRealizeis retried up to the attempt budget\. If the budget is exhausted,Replangenerates a new plan tail from the stuck predicate onward\. The loop terminates when the goal predicate is certified or the global step cap is reached\.
#### Attempt history and failure propagation\.
Failed attempts are recorded in an append\-only history that is threaded through subsequentRealizeandReplancalls\. Each entry contains the target predicate, the attempted action, the validation outcome, and the rejection reason\. This history is surfaced in the LLM prompt so that the model avoids repeating previously failed choices\. Crucially, failed attempts do not modify the certified state: only successful validations produce state updates\. This separation preserves the MDP property that the state at any point reflects exactly the set of certified decisions\.
#### Validation modes\.
SDP does not prescribe how validation is implemented; the framework supports both deterministic and LLM\-based validators depending on the domain\. In practice, we observe a spectrum across the five benchmarks\. TravelPlanner uses fully deterministic validators \(Python constraint checks over typed option objects\), with no LLM involvement in validation at all\. HotpotQA and MuSiQue use LLM\-based verification \(checking whether a finding is supported by a cited paragraph\) combined with deterministic rejection filters \(negation phrases, garbage answers, type mismatches\)\. AssistantBench uses a three\-phase pipeline \(tool execution, LLM\-based finding extraction, LLM\-based verdict\)\. ScienceWorld delegates validation entirely to the LLM after three hard\-coded shortcuts \(task completion, invalid action, goal\-at\-head\)\. The choice of validation mode is an adapter\-level design decision driven by domain structure: when constraints are formally specifiable \(as in TravelPlanner\), deterministic validation eliminates noise; when success criteria are semantic \(as in ScienceWorld\), LLM judgment is necessary\.
#### LLM backbone and inference\.
Unless otherwise noted, all experiments use a single LLM backbone per experimental condition \(the same model forPropose,Realize,Validate, andReplan\)\. Temperature is set to 0 for all calls except where explicitly stated \(AssistantBench finalize retries use temperature 0\.3\)\. All LLM outputs are requested in JSON format and parsed with a tolerant parser that strips code fences, attempts structured extraction, and falls back to regex\-based field extraction\. No few\-shot examples are provided in any prompt; all prompts are zero\-shot with detailed natural\-language instructions\.
### A\.2TravelPlanner
#### State and predicate plan\.
The certified statests\_\{t\}contains task constants \(origin, destination, days, travelers, budget, local constraints\), a deterministically extracted trip topology \(visited\-city sequence, day\-to\-city mapping, inter\-city travel days\), and accumulated decisions \(chosen transports, accommodations, per\-day meals and attractions, running total cost, used restaurant/attraction names, collected cuisine types\)\. All updates are immutable and the running cost is maintained as an invariant computed identically to the official evaluator’s cost arithmetic\.
The state plan is a deterministic function of the task shape and city sequence\. Eight predicate kinds are defined: \(1\) outbound transport, \(2\) accommodation \(one per visited city\), \(3\) inter\-city transport \(one per consecutive city pair\), \(4\) day meals \(one per day, covering all three slots\), \(5\) day attractions \(one per day\), \(6\) return transport, \(7\) budget check \(aggregate: total cost≤\\leqbudget and cuisine coverage\), and \(8\) render \(assembly into official JSON format\)\. The total count is1\+n\+\(n−1\)\+2N\+31\+n\+\(n\{\-\}1\)\+2N\+3, yielding 11 predicates for 3d/1c, 17 for 5d/2c, and 23 for 7d/3c\. Predicates are ordered so that cost\-incurring global decisions precede per\-day content, which precedes aggregate checks, ensuring that the remaining budget is well\-defined at each filtering step\.
#### Constraint\-upfront option filtering\.
Before eachRealizecall, the adapter constructs the set of options that already satisfy every constraint the official evaluator will check\. Five filters are applied in sequence: \(1\) sandbox membership \(the option appears in the reference data\), \(2\) local\-constraint satisfaction \(transport mode, room type, house rules\), \(3\) transport\-mode consistency \(self\-driving must not co\-occur with flight or taxi across the trip\), \(4\) trip\-wide uniqueness \(restaurants and attractions used on earlier days are excluded\), and \(5\) affordability \(option cost plus running total≤\\leqbudget\)\. The filtered set is sorted cheapest\-first and capped at 30 options\. Because every presented option is guaranteed valid, any LLM selection passes validation, eliminating both malformed\-output and constraint\-violation failures\.
For meal predicates, the three slots are decided sequentially within a single predicate: after each slot, the state is updated so that the next slot’s options exclude the just\-chosen restaurant and reflect the updated budget\. When the option set is empty, a diagnostic identifies which filter stage eliminated all candidates and surfaces this to the replan mechanism\.
#### Validation, replan, and hyperparameters\.
All validators are deterministic and receive typed selections rather than raw LLM output; parsing is confined to theRealizestep\. The budget\-check predicate is the only cross\-predicate validator, confirming structural completeness, total cost, and cuisine coverage\. Replan supports retry only: rollback to earlier predicates is not available because state updates are additive and not reversible\. The attempt history is surfaced in the retry prompt so that the LLM avoids repeating failed choices\.
Table 7:SDP configuration for TravelPlanner\.
### A\.3ScienceWorld
ScienceWorld\(Wanget al\.,[2022](https://arxiv.org/html/2605.12755#bib.bib2)\)is an interactive text\-based environment in which the agent performs multi\-step science experiments \(e\.g\., testing conductivity, boiling substances, growing plants\)\. Tasks range from 5 to 50\+ oracle steps, with partial\-credit scoring that is milestone\-based rather than monotonically increasing\. The environment accepts free\-form text actions and rejects unrecognized inputs with a fixed error string\.
ScienceWorld’s instantiation of SDP differs from TravelPlanner in every component\. Where TravelPlanner uses deterministic predicate plans and pre\-filtered option sets, ScienceWorld requires LLM\-generated plans, free\-form action generation, and LLM\-based validation\. This contrast demonstrates that the SDP framework accommodates both structured\-selection and open\-ended\-generation settings without changes to the core loop\.
#### State and predicate plan\.
The statests\_\{t\}caches the latest environment observation, the full room description \(which persists across actions regardless of the last command\), the agent’s inventory, the current score, a list of objects the agent has focused on, and a rolling action history\. Unlike TravelPlanner, state updates are not purely additive: the environment itself is mutable, and the state snapshot reflects whatever the simulator reports after each action\.
All predicates are free\-form natural\-language condition assertions generated by the LLM at the start of each episode\. A single unified predicate type replaces the eight\-kind taxonomy \(NAVIGATE, ACQUIRE, FOCUS, etc\.\) used in an earlier version of the adapter, which proved brittle: adding a new benchmark would have required defining a new kind set, new dispatch handlers, and new validation heuristics\. Under the current design, the LLM produces declarative sub\-goals such as “The agent’s inventory contains a metal pot” or “The water has been heated to a boil,” and validation is delegated entirely to a separate LLM call\. The only special\-cased predicate is the terminal goal \(“task score reaches 100”\), which triggers a hard shortcut when the environment reports completion\.
A grounding mechanism prevents the LLM from hallucinating environment\-specific names\. TheProposeprompt receives an explicit list of reachable rooms parsed from the simulator’s valid\-action set, and the LLM is instructed to reference only rooms from this list\. During replanning, a cumulative list of targets that the engine has previously rejected is appended to the prompt so that the LLM avoids repeating proven\-invalid references\.
#### Free\-form action generation and LLM\-based validation\.
Realizeissues a single LLM call that receives the current room description \(the simulator’s persistent state view, not the terse last\-action response\), inventory, the target predicate’s description, recent action history, and the set of rooms already visited\. The LLM returns a free\-form action string; if the simulator rejects it, the attempt is recorded as a failure and the target is retried\.
Validateexecutes the action in the environment, then applies three hard shortcuts before resorting to an LLM call: \(1\) if the environment reports task completion \(done flag or score≥100\\geq 100\), all remaining predicates are certified; \(2\) if the engine rejects the action, the attempt fails withk=0k=0and the rejected target is cached for future replan context; \(3\) if the goal predicate is at the head of the plan but the task is not complete,k=0k=0\. In all other cases, the LLM receives the target predicate description, the action, the resulting observation, the score delta, and a flag indicating whether the agent entered a previously unvisited room\. The LLM returns a cascade countkk\(how many consecutive head predicates are now satisfied\) and a reason string\. The cascade count is capped so that it cannot advance past natural\-language predicates into the goal predicate\.
The visited\-room flag addresses exploration sub\-goals: predicates of the form “The location of X is known to the agent” are credited withk≥1k\\geq 1when the agent enters a new room, even if the target object has not yet been directly observed\. Without this relaxation, exploration predicates became unreachable whenever the target was hidden inside a closed container in an unvisited room\.
#### Replan and hyperparameters\.
Replanning reuses theProposeprompt with two additions: a context block describing the stuck predicate, the number of failed attempts, and the most recent action–failure pairs; and the cumulative invalid\-targets list\. The LLM generates a fresh sub\-goal sequence from the current state\.
Table 8:SDP configuration for ScienceWorld\.The attempt budget is substantially higher than TravelPlanner’s \(30 vs\. 3\) because ScienceWorld actions are free\-generated and frequently rejected by the engine, requiring more retries per predicate\. The global step cap of 500 accommodates long tasks whose oracle trajectories already exceed 40 actions\.
### A\.4AssistantBench
AssistantBench\(Yoranet al\.,[2024](https://arxiv.org/html/2605.12755#bib.bib3)\)is an open\-ended web\-information QA benchmark: given a natural\-language question \(e\.g\., “List all gyms within half a mile of Tompkins Square Park”\), the agent must search the web, scrape pages, and synthesize a final answer\. Unlike ScienceWorld, there is no interactive environment—the agent orchestrates external tool calls \(web search and page scraping\) and submits a single answer\.
AssistantBench instantiates SDP as a*narrowing pipeline*: the LLM decomposes the question into a sequence of evidence\-gathering and reasoning steps, each of which is certified before the next begins\. This design prevents a common failure mode of single\-pass QA agents, in which partial evidence is silently dropped and the final answer omits qualifying candidates\.
#### State and predicate plan\.
The statests\_\{t\}accumulates raw evidence and validated conclusions across the episode\. It contains a search cache \(mapping query strings to search\-engine responses\), a scrape cache \(mapping URLs to page content\), a list of certified step findings \(each recording the step’s goal, the evidence it produced, and its validated conclusion\), a domain blacklist \(sites that consistently fail scraping\), and the current final answer\.
The predicate plan is generated by aDecomposecall that breaks the question into up to five*narrowing steps*, each labeled with one of four kinds:*collect*\(gather raw candidates\),*filter*\(narrow an existing set by a criterion\),*extract*\(pull a specific fact from known evidence\), or*compute*\(derive the answer by calculation or comparison\)\. After the step predicates, a*finalize*predicate synthesizes the answer from accumulated findings, and a*goal*predicate submits it\.
#### Tool\-augmented realize and three\-phase validation\.
Realizefor each step predicate consists of a planning LLM call that, given the question, prior certified findings, accumulated evidence, and previous failures on this step, produces a short list of tool actions \(web searches and page scrapes\)\. The adapter executes these actions via the Serper API, which provides both a search endpoint \(returning ranked results with snippets\) and a scraping endpoint \(returning full\-page markdown\)\. Search results automatically trigger scraping of the top\-ranked URL unless its domain appears on the blacklist\. Domains are blacklisted at runtime when scraping fails \(timeout, access denial, or content shorter than 100 characters\), preventing repeated attempts against consistently unreachable sites such as TripAdvisor or Yelp\.
When all tool actions return only cached results \(no fresh evidence\), a fallback mechanism invokes GPT\-4o with web\-search to gather additional data\. This fallback also fires after two consecutive validation failures on the same step, providing a qualitatively different evidence source\.
Validation proceeds in three phases\. First, the tool actions are executed and their results are appended to the state’s evidence caches\. Second, aFindingLLM call reads the step’s goal, the fresh evidence, and all prior certified findings, and produces a natural\-language conclusion for this step\. Third, aValidateLLM call judges whether the finding adequately addresses the step goal, returning a pass/fail verdict with a reason\. If validation fails, the step’s failure counter \(maintained on the original state to persist across attempts\) is incremented; after two consecutive rejections, the step is force\-passed if the finding contains any substantive content, on the principle that partial data scores higher than an infinite retry loop\.
#### Replan, finalize, and hyperparameters\.
Replanning generates new search queries via an LLM call that receives the stuck predicate, a diagnosis of what has been tried, and a strategy\-shift prompt\. It also proposes scraping of URLs from existing search results that have not yet been fetched, subject to domain\-blacklist filtering\. The original step predicates are re\-added to give them another attempt with the newly gathered evidence\.
The finalize predicate synthesizes the final answer from all accumulated evidence\. If the answer is weak \(empty, null, or a refusal string\), the adapter retries with an elevated temperature and an explicit instruction to return closest matches\. After three weak attempts, a force\-accept mechanism produces the best available answer rather than returning nothing\.
Table 9:SDP configuration for AssistantBench\.The conservative step cap reflects that each step involves multiple external API calls \(search plus auto\-scrape plus potential fallback\), making per\-step cost substantially higher than in the interactive benchmarks\.
### A\.5HotpotQA
HotpotQA\(Yanget al\.,[2018](https://arxiv.org/html/2605.12755#bib.bib4)\)is a multi\-hop question answering benchmark in which each question requires reasoning over two Wikipedia paragraphs\. We evaluate under two settings:*distractor*, where ten paragraphs \(two gold, eight distractors\) are provided with the question, and*open\-domain*, where the agent retrieves evidence from a BM25 index over the full Wikipedia corpus\.
#### State, predicate plan, and retrieval\.
The statests\_\{t\}contains the question, all available paragraphs \(provided in distractor mode or accumulated via retrieval in open\-domain mode\), a list of certified hop findings \(each recording the sub\-question, the supporting paragraph title, the extracted finding, and a bridge entity for chaining\), a search cache mapping query strings to retrieved paragraphs, and the current final answer\.
The predicate plan follows a fixed four\-predicate structure: two*hop*predicates \(one per sub\-question\), an*answer*predicate that synthesizes the final response from certified findings, and a*goal*predicate\. The two sub\-questions are generated by an LLM decomposition call at the start of each episode, and the bridge entity extracted from the first hop is available to the second hop’s realize call, enabling the characteristic two\-hop reasoning chain\.
In the open\-domain setting,Realizefor each hop first generates a search query via a dedicated LLM call that sees the original question, the current sub\-question, prior findings, and previously tried queries\. The query is executed against a BM25 index built over approximately 5\.2 million Wikipedia abstract articles, with title boosting: a two\-stage retrieval first matches query n\-grams against paragraph titles \(with prefix matching to handle Wikipedia disambiguation suffixes such as “\(film\)” or “\(album\)”\), then falls back to standard BM25 content scoring\. Retrieved paragraphs accumulate in the search cache across hops, so the second hop can draw on evidence gathered during the first\.
#### Two\-stage validation with paragraph fallback\.
Validatefor hop predicates applies a two\-stage verification process to address the most frequent failure mode: the LLM cites the correct finding but attributes it to a wrong or generic paragraph title\. In the primary stage, the adapter resolves the cited title against available paragraphs \(exact match, then substring match, then search\-cache lookup\) and asks an LLM verifier whether the finding is supported by the resolved paragraph’s text and correctly answers the sub\-question\. Findings that contain explicit negation phrases \(“does not provide,” “not mentioned,” “not specified”\) are rejected before the LLM call\.
If the primary stage fails—either because the cited title cannot be resolved or because the LLM verifier rejects the attribution—a fallback stage scans all accumulated paragraphs for one that supports the finding\. Candidate paragraphs are ranked by keyword overlap with the finding \(extracting content words of four or more characters, excluding stopwords, and counting matches in each paragraph’s text\)\. The top candidate is then submitted to the same LLM verifier\. If confirmed, the finding is accepted with the corrected title\. This fallback recovers cases where the LLM extracted the right fact from the right paragraph but hallucinated the paragraph’s title\.
Answer\-predicate validation rejects garbage responses \(“unknown,” “not available,” “cannot determine”\) and, when the answer predicate is immediately followed by the goal predicate, certifies both in a single cascade \(k=2k=2\)\. Replanning re\-decomposes the question with failure context while preserving any hops already certified, so only the remaining sub\-questions are regenerated\.
#### Hyperparameters\.
Table 10:SDP configuration for HotpotQA\.The low global step cap reflects the fixed plan length \(four predicates\): most episodes complete in 4–8 realize calls, with additional steps consumed only by retries on validation failure or replan\-triggered re\-decomposition\.
### A\.6MuSiQue
MuSiQue\(Trivediet al\.,[2022](https://arxiv.org/html/2605.12755#bib.bib5)\)extends multi\-hop QA to 2–4 hop compositional chains\. The SDP adapter shares the same architecture as HotpotQA \(Section[A\.5](https://arxiv.org/html/2605.12755#A1.SS5)\)—LLM\-based decomposition, hop\-by\-hop evidence extraction with bridge\-entity chaining, LLM\-based hop validation, and BM25 retrieval for the open\-domain setting—with three adaptations\.
First, the predicate plan is dynamic: the hop count is parsed from each task’s identifier, yielding plans of 4–6 predicates \(2–4 hops, plus answer and goal\)\. The decomposition prompt requests the corresponding number of sub\-questions, and replanning preserves already\-certified hops and regenerates only the remaining tail\.
Second, the open\-domain retriever operates over a corpus of approximately 84,000 unique paragraphs collected from all MuSiQue splits following the IRCoT protocol\(Trivediet al\.,[2023](https://arxiv.org/html/2605.12755#bib.bib23)\), rather than the 5\.2\-million\-article Wikipedia corpus used for HotpotQA\. The retriever uses the same BM25 architecture with title boosting and prefix matching\. To handle the longer reasoning chains, the adapter employs escalating search diversity: after four unsuccessful queries, it falls back to entity\-only queries extracted from the sub\-question; after six, it asks the LLM to guess the answer and searches for the guess directly\.
Third, answer validation is stricter to match MuSiQue’s short\-span answer format\. Beyond the garbage\-answer rejection shared with HotpotQA, the adapter rejects self\-referential answers \(entities already present in the question\), intermediate bridge entities that are not the final answer, answers exceeding 15 words, and question–answer type mismatches \(e\.g\., a non\-numeric answer to a “when” question\)\. A final LLM sanity check verifies that the proposed answer is consistent with the certified hop chain before acceptance\.
Table 11:SDP configuration for MuSiQue\.The global step cap is set slightly higher than HotpotQA’s \(25 vs\. 20\) to accommodate 3\- and 4\-hop tasks, which require proportionally more realize calls\.
## Appendix BBenchmarks
### B\.1TravelPlanner
#### Benchmark selection\.
TravelPlanner\(Xieet al\.,[2024](https://arxiv.org/html/2605.12755#bib.bib1)\)is a constraint\-satisfaction benchmark in which an agent must assemble a multi\-day travel itinerary that jointly satisfies hard constraints \(budget, transportation mode, accommodation type\) and commonsense constraints \(meal variety, attraction count\)\. We selected it because these constraints map directly onto SDP predicates\. Each constraint becomes a separate predicate that must be certified before the plan advances, making TravelPlanner a natural testbed for whether per\-predicate certification prevents the constraint violations that dominate plan\-as\-text baselines\. The benchmark spans multiple task scales \(3\-day/1\-city through 7\-day/3\-city\), allowing us to observe how SDP behaves as plan complexity grows\.
#### Evaluation protocol\.
We evaluate on all 180 validation tasks using the tool\-use setting, in which the agent queries search APIs \(flights, hotels, restaurants, attractions, ground transport\) to discover options before assembling the itinerary\. Evaluation follows the official TravelPlanner scoring pipeline, reporting Delivery Rate, Commonsense constraint satisfaction \(Micro/Macro\), Hard Constraint satisfaction \(Micro/Macro\), and Final Pass Rate\.
#### Baseline selection\.
We include nine baselines spanning four categories\. ReAct\(Yaoet al\.,[2022](https://arxiv.org/html/2605.12755#bib.bib6)\), Reflexion\(Shinnet al\.,[2023](https://arxiv.org/html/2605.12755#bib.bib8)\), Plan\-and\-Solve\(Wanget al\.,[2023b](https://arxiv.org/html/2605.12755#bib.bib7)\), GoalAct\(Chenet al\.,[2025](https://arxiv.org/html/2605.12755#bib.bib14)\), and TDP\(Liet al\.,[2026](https://arxiv.org/html/2605.12755#bib.bib12)\)are reproduced by us using Gemini\-3\.1\-flash\-lite on all 180 tasks with the same evaluation pipeline\. MIRROR\(Guoet al\.,[2025](https://arxiv.org/html/2605.12755#bib.bib10)\)\(GPT\-4o\), EvoAgent\(Yuanet al\.,[2025](https://arxiv.org/html/2605.12755#bib.bib16)\), PMC\(Zhanget al\.,[2025](https://arxiv.org/html/2605.12755#bib.bib17)\), and ATLAS\(Choiet al\.,[2025](https://arxiv.org/html/2605.12755#bib.bib9)\)\(all Gemini\-2\.5\-Pro\) report numbers from their original papers on the same 180\-task validation split and official scoring\. SDP is evaluated with both Gemini\-3\.1\-flash\-lite and GPT\-4o to enable fair comparison at both scales\. The Gemini\-3\.1\-flash\-lite runs allow direct comparison against the five reproduced baselines under identical conditions, while the GPT\-4o runs allow comparison against MIRROR at the same backbone scale\.
### B\.2ScienceWorld
#### Benchmark selection\.
ScienceWorld\(Wanget al\.,[2022](https://arxiv.org/html/2605.12755#bib.bib2)\)is an interactive text\-based simulator in which an agent must perform scientific experiments by navigating rooms, manipulating objects, and reasoning about physical processes\. We selected it because its tasks are inherently long\-horizon \(oracle action sequences reach 36 steps\) and require strict causal ordering, making it a natural testbed for whether SDP’s per\-predicate certification and replan mechanism improve performance as the horizon grows\. The benchmark groups its 30 tasks into Short, Medium, and Long categories, enabling direct measurement of how each method scales with task length\.
#### Evaluation protocol\.
We follow the standard 30\-task evaluation protocol established by SwiftSage\(Linet al\.,[2023](https://arxiv.org/html/2605.12755#bib.bib66)\)and adopted by subsequent training\-free methods\. All methods use GPT\-4 as the backbone\. Performance is reported as average score across Short, Medium, Long, and Overall\.
#### Baseline selection\.
We include six training\-free baselines\. SayCan\(Ahnet al\.,[2022](https://arxiv.org/html/2605.12755#bib.bib45)\), ReAct\(Yaoet al\.,[2022](https://arxiv.org/html/2605.12755#bib.bib6)\), CoT\(Weiet al\.,[2022](https://arxiv.org/html/2605.12755#bib.bib42)\), Reflexion\(Shinnet al\.,[2023](https://arxiv.org/html/2605.12755#bib.bib8)\), EVOAGENT\(Yuanet al\.,[2025](https://arxiv.org/html/2605.12755#bib.bib16)\), and Plan\-and\-Act\(Erdoganet al\.,[2025](https://arxiv.org/html/2605.12755#bib.bib11)\)all report results using GPT\-4 on the same 30\-task protocol\. Numbers are taken from their original papers\. CoT, Plan\-and\-Act report only Overall scores\. We do not include TDP\(Liet al\.,[2026](https://arxiv.org/html/2605.12755#bib.bib12)\)or RPMS in the table because their evaluation protocols differ from the standard 30\-task setup, making direct comparison unreliable\.
### B\.3AssistantBench
#### Benchmark selection\.
AssistantBench\(Yoranet al\.,[2024](https://arxiv.org/html/2605.12755#bib.bib3)\)is an open\-ended web reasoning benchmark consisting of 214 tasks that require retrieving, filtering, and integrating information from multiple web sources\. We selected it because a large class of its tasks involves filtering entities through several independent constraints, a structure that maps naturally onto SDP predicates\. Each constraint becomes a separate verification step, andValidateblocks ungrounded findings before they propagate to subsequent steps\. The benchmark also stratifies tasks by difficulty \(Easy/Medium/Hard\), enabling fine\-grained analysis of where SDP’s advantage concentrates and where tool limitations become binding\.
#### Evaluation protocol\.
We evaluate on the full test split \(181 tasks\) via submission to the official HuggingFace leaderboard\. The agent uses Serper search API and URL\-level scraping as its tool interface, without browser rendering, JavaScript execution, or form interaction\. Scoring follows the official AssistantBench metrics, reporting Accuracy, Precision, Exact Match \(EM\), and Accuracy stratified by difficulty\.
#### Baseline selection\.
We include seven baselines from the official AssistantBench leaderboard\. Infogent\(Reddyet al\.,[2025](https://arxiv.org/html/2605.12755#bib.bib20)\), Magentic\-One\(Fourneyet al\.,[2024](https://arxiv.org/html/2605.12755#bib.bib19)\), and ACP\-Domain Agents\(Bhardwajet al\.,[2025](https://arxiv.org/html/2605.12755#bib.bib18)\)use GPT\-4o\. RALM\-1S→\\toCB\(Trivediet al\.,[2023](https://arxiv.org/html/2605.12755#bib.bib23)\), CB\-1S\(Presset al\.,[2023](https://arxiv.org/html/2605.12755#bib.bib22)\), SeeAct→\\toCB\(Zhenget al\.,[2024](https://arxiv.org/html/2605.12755#bib.bib21)\), and SPA\-CB\(Yoranet al\.,[2024](https://arxiv.org/html/2605.12755#bib.bib3)\)use GPT\-4\-Turbo\. All numbers are taken from the official leaderboard under the same test split and scoring pipeline\. SDP uses GPT\-4o, enabling direct comparison against the three GPT\-4o baselines\. The tool interface differs across methods\. Magentic\-One employs a dedicated browser\-controlling sub\-agent, while SDP relies solely on search API and scraping\. This difference is most visible on Hard\-tier tasks, where multi\-page browsing sessions are required\.
### B\.4HotpotQA and MuSiQue
#### Benchmark selection\.
HotpotQA\(Yanget al\.,[2018](https://arxiv.org/html/2605.12755#bib.bib4)\)and MuSiQue\(Trivediet al\.,[2022](https://arxiv.org/html/2605.12755#bib.bib5)\)are multi\-hop question answering benchmarks that require chaining evidence across multiple Wikipedia paragraphs\. We selected them because multi\-hop reasoning creates a natural chain of predicates, one per hop, where each hop’s finding feeds into the next\. This structure makesValidate’s per\-hop verification directly consequential\. HotpotQA requires two hops, while MuSiQue extends to two, three, and four hops, enabling measurement of how SDP scales with reasoning depth\. We use the answerable subset of MuSiQue as defined byTrivediet al\.\([2022](https://arxiv.org/html/2605.12755#bib.bib5)\)\.
#### Evaluation protocol\.
Both benchmarks are evaluated in the fullwiki \(open\-domain\) setting\. The agent retrieves evidence from a 5M\-paragraph Wikipedia corpus using BM25 retrieval, matching the setup of IRCoT\(Trivediet al\.,[2023](https://arxiv.org/html/2605.12755#bib.bib23)\)and PRISM\(Nahid and Rafiei,[2025](https://arxiv.org/html/2605.12755#bib.bib62)\)\. We evaluate on 1,000 validation examples for each benchmark\. Performance is reported as Exact Match \(EM\) and token\-level F1\. SDP uses GPT\-4o as the backbone\.
#### Baseline selection\.
We include six baselines from three families\. RankZephyr, RankZephyr \+ CoT\(Pradeepet al\.,[2023](https://arxiv.org/html/2605.12755#bib.bib63)\), and RankGPT\(Sunet al\.,[2023b](https://arxiv.org/html/2605.12755#bib.bib65)\)represent retrieve\-then\-read approaches with reranking\. SETR\-CoT & IRI\(Leeet al\.,[2025](https://arxiv.org/html/2605.12755#bib.bib64)\)extends this with iterative retrieval\. IRCoT\(Trivediet al\.,[2023](https://arxiv.org/html/2605.12755#bib.bib23)\)interleaves retrieval with chain\-of\-thought reasoning\. PRISM\(Nahid and Rafiei,[2025](https://arxiv.org/html/2605.12755#bib.bib62)\)uses agentic retrieval with multi\-step planning\. All numbers are taken from their original papers\. IRCoT uses GPT\-3 while all other baselines and SDP use GPT\-4o\. SDP uses BM25 retrieval alone, whereas PRISM employs hybrid retrieval \(BM25 \+ learned reranker\), making the comparison conservative for SDP\.
## Appendix CAblation Details
The ablation estimates in Table[6](https://arxiv.org/html/2605.12755#S4.T6)are computed by replaying recorded trajectories under counterfactual rules rather than re\-executing the environment\. Each counterfactual modifies one mechanism and rescales the observed score using a conversion factor derived from the trajectory itself\.
#### Action fidelity \(for−\-Validate\)\.
WhenValidateis replaced by always\-pass, every proposed predicate is accepted regardless of whether the observation actually supports it\. We define*action fidelity*as the fraction ofRealizeattempts in the original trajectory whose action would have produced a correct outcome without validation\. Formally, for a trajectory withmmcertified steps, action fidelity isf=mcorrect\-on\-first/mf=m\_\{\\text\{correct\-on\-first\}\}/m, wheremcorrect\-on\-firstm\_\{\\text\{correct\-on\-first\}\}counts the steps whose firstRealizeattempt was subsequently certified byValidate\. The ablated score is the original score multiplied byff, reflecting the expected degradation when incorrect findings propagate unchecked\.
#### Certified\-prefix ratio \(for−\-Replan\)\.
WhenReplanis removed, the agent terminates as soon as the per\-target attempt budgetbbis exhausted at any predicate\. We define the*certified\-prefix ratio*asr=tstall/nr=t\_\{\\text\{stall\}\}/n, wheretstallt\_\{\\text\{stall\}\}is the cursor position at which the first budget exhaustion occurs andnnis the plan length\. For runs that complete without triggeringReplan,r=1r=1and the score is unchanged\. For runs that required replanning, the ablated score is the original score multiplied byrr, reflecting the fraction of the plan that would have been certified before termination\.
#### Cascade \(−\-Cascade\)\.
When cascade is disabled by cappingValidateatk≤1k\\leq 1, each action certifies at most one predicate\. The additional steps that would have been required are counted from the original trajectory by summing\(ki−1\)\(k\_\{i\}\-1\)over all steps whereki≥2k\_\{i\}\\geq 2\. If the total step count under this rule exceeds the agent’s budget, the run is marked as incomplete and scored proportionally\. On benchmarks where cascade rarely occurs \(e\.g\., TravelPlanner withk=1k=1on all steps\), the ablated score equals the original\.
#### Limitations of replay estimation\.
Replay estimates assume that removing a mechanism does not alter the distribution of subsequent actions or observations\. In practice, an unchecked error at stepttmay cause downstream actions to diverge from the recorded trajectory\. The estimates therefore represent an optimistic bound on the true ablation effect, since cascading errors would likely degrade performance further\. The direction and relative ordering of the estimates across benchmarks are nonetheless informative for identifying which mechanism contributes most in each environment\.Similar Articles
The Context Gathering Decision Process: A POMDP Framework for Agentic Search
This paper introduces the Context Gathering Decision Process (CGDP), a POMDP framework to model LLM agent search behavior, proposing interventions that improve multi-hop reasoning and reduce token usage without performance degradation.
I'd like to share an updated methodology for building agents.[P]
Introduces Spice, an open-source decision layer that acts as a 'brain' above execution agents like Claude Code and Codex, enabling context-aware task delegation and structured decision-making.
SDOF: Taming the Alignment Tax in Multi-Agent Orchestration with State-Constrained Dispatch
SDOF is a framework that treats multi-agent execution as a constrained state machine, using an online-RLHF specialized intent router and state-aware dispatcher to enforce business process stage constraints, achieving 86.5% task completion on a recruitment system with 6,000+ enterprises.
Self-Programmed Execution for Language-Model Agents
This paper introduces Self-Programmed Execution (SPE), an agent architecture where the language model generates its own orchestration program rather than relying on a fixed external harness. It presents 'Spell', a Lisp-based language enabling this self-editing and re-evaluation, demonstrating that frontier models can successfully perform agentic tasks using this method.
What are MDPs? And how can we Solve them?
This article explains the fundamentals of Markov Decision Processes (MDPs), a core framework in deep reinforcement learning, using an educational example of a student's daily decisions.