The Context Gathering Decision Process: A POMDP Framework for Agentic Search

arXiv cs.AI 05/11/26, 04:00 AM Papers
llm-agents pomdp agentic-search context-management reinforcement-learning multi-hop-reasoning
Summary
This paper introduces the Context Gathering Decision Process (CGDP), a POMDP framework to model LLM agent search behavior, proposing interventions that improve multi-hop reasoning and reduce token usage without performance degradation.
arXiv:2605.07042v1 Announce Type: new Abstract: Large Language Model (LLM) agents are deployed in complex environments -- such as massive codebases, enterprise databases, and conversational histories -- where the relevant state far exceeds their context windows. To navigate these spaces, an agent must iteratively explore the environment to find relevant information. However, without explicit infrastructure, an agent's working memory can degrade into lossy representations of the search state, resulting in redundant work (e.g. repetitive looping) and premature stopping. In this work, we formalize this challenge as the Context Gathering Decision Process (CGDP), a specialized Partially Observable Markov Decision Process, where an agent's objective is to adaptively refine its belief state to isolate the necessary information for a task. We model an LLM's behavior as approximate Thompson Sampling within this CGDP, and introduce a predicate-based method that decomposes an LLM's implicit search into explicit and modular operations. We then derive two plug-and-play interventions for iterative LLM agents: a persistent, predicate-based belief state that bounds context while preserving multi-hop reasoning, and a programmatic exhaustion gate that halts unproductive search without premature stopping. Across four methods and three question-answering domains, we empirically validate that replacing an LLM's implicit state with our CGDP-motivated belief state improves multi-hop reasoning by up to $11.4\%$; while the modular programmatic exhaustion detection saves up to $39\%$ of tokens without any degradation in agent performance. Ultimately, we argue that framing the LLM agent loop as a CGDP can guide the design of modular, non-interfering improvements to agentic search harnesses.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 05/11/26, 07:11 AM
# A POMDP Framework for Agentic Search
Source: [https://arxiv.org/html/2605.07042](https://arxiv.org/html/2605.07042)
## The Context Gathering Decision Process: A POMDP Framework for Agentic Search

Chinmaya Kausik University of Michigan ckausik@umich\.edu &Adith Swaminathan Netflix aswaminathan@netflix\.com &Nathan Kallus Netflix nkallus@netflix\.com

###### Abstract

Large Language Model \(LLM\) agents are deployed in complex environments – such as massive codebases, enterprise databases, and conversational histories – where the relevant state far exceeds their context windows\. To navigate these spaces, an agent must iteratively explore the environment to find relevant information\. However, without explicit infrastructure, an agent’s working memory can degrade into lossy representations of the search state, resulting in redundant work \(e\.g\. repetitive looping\) and premature stopping\. In this work, we formalize this challenge as the Context Gathering Decision Process \(CGDP\), a specialized Partially Observable Markov Decision Process, where an agent’s objective is to adaptively refine its belief state to isolate the necessary information for a task\. We model an LLM’s behavior as approximate Thompson Sampling within this CGDP, and introduce a predicate\-based method that decomposes an LLM’s implicit search into explicit and modular operations\. We then derive two plug\-and\-play interventions for iterative LLM agents: a persistent, predicate\-based belief state that bounds context while preserving multi\-hop reasoning, and a programmatic exhaustion gate that halts unproductive search without premature stopping\. Across four methods and three question\-answering domains, we empirically validate that replacing an LLM’s implicit state with our CGDP\-motivated belief state improves multi\-hop reasoning by up to11\.4%11\.4\\%; while the modular programmatic exhaustion detection saves up to39%39\\%of tokens without any degradation in agent performance\. Ultimately, we argue that framing the LLM agent loop as a CGDP can guide the design of modular, non\-interfering improvements to agentic search harnesses\.

## 1Introduction

Large Language Model \(LLM\) agents are increasingly deployed on complex, real\-world environments where the relevant state far exceeds their reliable working context\. Deep\-research agents search the web\[[44](https://arxiv.org/html/2605.07042#bib.bib1)\]; coding assistants search repositories and run shell tools\[[41](https://arxiv.org/html/2605.07042#bib.bib2)\]; support agents retrieve from enterprise knowledge bases\[[16](https://arxiv.org/html/2605.07042#bib.bib3)\]; all these systems face the same fundamental challenge: the agent cannot load the entire environment into its prompt\. Instead, it must iteratively interact with an observation function – such as a Python REPL, a search engine API, or a vector database – to gather the necessary information\.

Despite active research into extending LLM context windows\[[17](https://arxiv.org/html/2605.07042#bib.bib4),[38](https://arxiv.org/html/2605.07042#bib.bib5)\], long\-context models still exhibit position sensitivity\[[21](https://arxiv.org/html/2605.07042#bib.bib6)\], degrade with hard negatives\[[11](https://arxiv.org/html/2605.07042#bib.bib7)\], are unreliable in multi\-turn interaction\[[14](https://arxiv.org/html/2605.07042#bib.bib8)\], and stop prematurely during long\-horizon search\[[43](https://arxiv.org/html/2605.07042#bib.bib9)\]\. Thus, building agentic harnesses – infrastructure that allows the LLM to actively manage and search external context – has been studied extensively\[[33](https://arxiv.org/html/2605.07042#bib.bib10),[37](https://arxiv.org/html/2605.07042#bib.bib11),[42](https://arxiv.org/html/2605.07042#bib.bib12),[26](https://arxiv.org/html/2605.07042#bib.bib18),[29](https://arxiv.org/html/2605.07042#bib.bib30)\]\.

Agentic harnesses typically append raw observations directly to the LLM prompt\[[33](https://arxiv.org/html/2605.07042#bib.bib10),[37](https://arxiv.org/html/2605.07042#bib.bib11),[42](https://arxiv.org/html/2605.07042#bib.bib12)\]; over long horizons, this implicit state tracking leads to severe failures: agents lose track of their original objective, fall into repetitive query loops, hallucinate parametric knowledge instead of corpus evidence, and fail to reliably recognize when a search is completely exhausted\. For instance, a coding agent that is resolving a software bug may repeatedly retrieve the same file, oscillating between identical hypotheses without realizing that its search has stagnated\.

To address these failures, we formalize the interactive information\-seeking agent loop as the Context Gathering Decision Process \(CGDP\)\. The CGDP is a specialized Partially Observable Markov Decision Process \(POMDP\)\[[13](https://arxiv.org/html/2605.07042#bib.bib17)\]where the hidden state is the vast external corpus, actions are tool calls, observations are retrieved information, and the objective is to adaptively refine the agent’s belief state to identify task\-relevant information from the hidden state\. As shown in Table[1](https://arxiv.org/html/2605.07042#S1.T1), this mathematical framework abstracts the details of many modern agents\.

Table 1:Across diverse agentic applications, the underlying challenge remains to navigate a massive hidden state via a constrained observation function to satisfy a user query, which we model as a Context Gathering Decision Process \(CGDP\)\.Through the lens of CGDP, we model an LLM’s behavior as approximate Thompson Sampling\[[35](https://arxiv.org/html/2605.07042#bib.bib16),[30](https://arxiv.org/html/2605.07042#bib.bib34),[25](https://arxiv.org/html/2605.07042#bib.bib15)\], where the model implicitly samples a hypothesis and takes information\-gathering actions conditioned on it\. To make this process explicit, we introduce Predicate\-Based Adaptive Identification \(PBAI\), an abstract algorithm that decomposes agentic search into modular operations: assess stopping, select action, observe, and update belief\. By mapping state\-of\-the\-art agent harnesses to PBAI, we can pinpoint precisely where their implicit reasoning lacks the necessary infrastructure for reliable, long\-horizon search\.

Based on our framework, we derive two modular interventions for agentic search harnesses:

1. 1\.The Predicate\-Based Belief State: A persistent, explicitly managed data structure that forces the agent to extract findings and track open questions iteratively, bounding the context footprint while preserving multi\-hop reasoning performance\.
2. 2\.The Programmatic Exhaustion Gate: A stopping mechanism that halts unproductive search\. Rather than relying on an LLM’s self\-assessment \(which may be poorly calibrated\[[4](https://arxiv.org/html/2605.07042#bib.bib19),[9](https://arxiv.org/html/2605.07042#bib.bib29)\]\), the mechanism uses programmatic signals like action similarity and observation novelty to prevent redundant looping and premature stopping\.

We empirically validate these interventions across four agent search harnesses and three domains \(multi\-session conversational Question\-Answering, multi\-hop Wikipedia QA, and code\-repository QA\)\. We find that replacing an LLM’s implicit search state with the predicate\-based belief state never degrades agent performance and improves multi\-hop reasoning by up to11\.4%11\.4\\%\. Furthermore, the programmatic exhaustion gate safely reduces token consumption by up to39%39\\%on stateful harnesses without sacrificing task accuracy\. Ultimately, we demonstrate that formalizing LLM agentic search as a CGDP provides a blueprint for designing reliable modular harness improvements\.

## 2Related Work

Our contributions are at the intersection of retrieval\-augmented generation, long\-horizon agentic search, and LLM meta\-cognition\. While prior work has empirically improved specific components of the agent loop, we provide a unifying framework to understand how these components interact\.

#### Iterative RAG and Agent Harnesses\.

Multi\-round methods generally accumulate state implicitly through a growing context window\. IRCoT\[[37](https://arxiv.org/html/2605.07042#bib.bib11)\]interleaves retrieval with chain\-of\-thought reasoning, Iter\-RetGen\[[33](https://arxiv.org/html/2605.07042#bib.bib10)\]uses previous generations to guide subsequent retrieval queries, and ReAct\[[42](https://arxiv.org/html/2605.07042#bib.bib12)\]tracks history through structured Thought\-Action\-Observation trajectories\. Although these methods perform agentic search, their implicit state tracking frequently degrades over long episodes\. More recent approaches introduce explicit per\-retrieval interventions: Self\-RAG\[[2](https://arxiv.org/html/2605.07042#bib.bib20)\]adds retrieval\-time self\-reflection, Corrective RAG\[[40](https://arxiv.org/html/2605.07042#bib.bib21)\]corrects retrieval queries based on relevance assessment and FAIR\-RAG\[[1](https://arxiv.org/html/2605.07042#bib.bib22)\]introduces gap analysis via an evidence checklist\. Similarly, MemGPT\[[26](https://arxiv.org/html/2605.07042#bib.bib18)\]provides LLMs with explicit memory management tools\. While approaches like StateAct\[[29](https://arxiv.org/html/2605.07042#bib.bib30)\]or FAIR\-RAG\[[1](https://arxiv.org/html/2605.07042#bib.bib22)\]rely on LLM\-managed summaries or checklists, our framework demonstrates that*orchestrator\-enforced*, strictly curated belief states outperform an LLM’s self\-managed tool use or memory edits\.

#### Long\-Horizon Search and Corpus Organization\.

To address unbounded context, offline memory organization methods like A\-MEM\[[39](https://arxiv.org/html/2605.07042#bib.bib23)\], HippoRAG\[[8](https://arxiv.org/html/2605.07042#bib.bib24)\], HopRAG\[[20](https://arxiv.org/html/2605.07042#bib.bib26)\]and GraphRAG\[[5](https://arxiv.org/html/2605.07042#bib.bib25)\]structure the underlying corpus into graphs to facilitate easier retrieval, whereas in our work we organize the agent’s understanding of what it has found from the corpus online\. For long\-horizon online search, SLIM\[[43](https://arxiv.org/html/2605.07042#bib.bib9)\]separates search from browsing and periodically summarizes trajectories to manage context, while AggAgent\[[15](https://arxiv.org/html/2605.07042#bib.bib13)\]generates parallel retrieval trajectories and synthesizes them on demand\. These systems support our core premise: reliability improves when the external context is actively managed rather than passively appended\.

#### Stopping Criteria and Metacognition\.

A critical challenge in iterative search is knowing when to stop\. To detect whether to initiate retrieval, there are models such as FLARE\[[10](https://arxiv.org/html/2605.07042#bib.bib27)\]\(using token\-level confidence scores\) and DRAGIN\[[34](https://arxiv.org/html/2605.07042#bib.bib28)\]\(using attention\-based scores\)\. In contrast, our work addresses post\-retrieval stagnation\. Relying on LLMs to self\-assess stopping criteria has proven brittle; recent evidence shows that LLMs cannot reliably self\-correct reasoning without external feedback\[[9](https://arxiv.org/html/2605.07042#bib.bib29)\]and that their verbalized confidence is poorly calibrated\[[4](https://arxiv.org/html/2605.07042#bib.bib19)\]\. Motivated by this, our programmatic exhaustion gate replaces LLM self\-assessment with heuristic stagnation signals and improves token efficiency without degrading search accuracy\.

To understand why iterative LLM agents often fail at long\-horizon search, we seek to separate the formulation of information\-seeking problems from the specific prompt engineering used to solve them\. In this section, we formalize agentic search as a specific decision\-making process, define the notion of success, and diagnose why LLMs become sub\-optimal agents for this process\.

### 3\.1The Context Gathering Decision Process

A Context Gathering Decision Process \(CGDP\) can be viewed as a POMDP\[[13](https://arxiv.org/html/2605.07042#bib.bib17)\]with terminal rewards and per\-action costs, closely related to sequential identification\[[31](https://arxiv.org/html/2605.07042#bib.bib31),[7](https://arxiv.org/html/2605.07042#bib.bib33)\]; it is defined by a taskqq\(e\.g\. user query\), a massive hidden world statec∈𝒞c\\in\\mathcal\{C\}\(e\.g\. codebase\), an action space𝒜\\mathcal\{A\}\(e\.g\. LLM\-callable tools\), an observation functionF:𝒜×𝒞→𝒪F:\\mathcal\{A\}\\times\\mathcal\{C\}\\to\\mathcal\{O\}that maps an agent’s actions to observable text chunks, and a per\-action costλ\\lambda\. At each timesteptt, the agent selects an actionat∈𝒜a\_\{t\}\\in\\mathcal\{A\}which can either be an environment query \(e\.g\. a BASH command\) that incurs costCost\(at\)Cost\(a\_\{t\}\)and creates observationot=F\(at,c\)o\_\{t\}=F\(a\_\{t\},c\), or a termination action that returns a final answerafinala\_\{\\text\{final\}\}\. The environment evaluates the agent’s final answer via a binary success functionSuccess\(q,c,afinal\)∈\{0,1\}Success\(q,c,a\_\{\\text\{final\}\}\)\\in\\\{0,1\\\}\. Leta∗\(q,c\)a^\{\*\}\(q,c\)denote an optimal answer, i\.e\.,Success\(q,c,a∗\(q,c\)\)=1Success\(q,c,a^\{\*\}\(q,c\)\)=1\.

Prompted with queryqqalone, an LLM lacks the context to producea∗\(q,c\)a^\{\*\}\(q,c\)and will abstain or hallucinate\. Unable to process the entire hidden stateccat once, it must iteratively interact withFFto gather a sufficient subset of information\.

The optimal CGDP agent maximizes expected success while minimizing exploration cost \(e\.g\. LLM token budget and/or latency\) across a task distribution𝒟\\mathcal\{D\}:

argmaxPolicy𝔼\(q,c\)∼𝒟\[Success\(q,c,Policy\(q,c\)\)−λ∑t=1TCost\(at\)\]\.\\operatorname\*\{argmax\}\_\{\\mathrm\{Policy\}\}\\mathbb\{E\}\_\{\(q,c\)\\sim\\mathcal\{D\}\}\\left\[\\mathrm\{Success\}\(q,c,\\mathrm\{Policy\}\(q,c\)\)\-\\lambda\\sum\_\{t=1\}^\{T\}\\mathrm\{Cost\}\(a\_\{t\}\)\\right\]\.\(1\)
Crucially, the true state of the environment includescc, but the observationoto\_\{t\}at each step is only the fragment ofccthat the agent explicitly chose to observe viaF\(at,c\)F\(a\_\{t\},c\)\. Therefore, a successful CGDP agent must maintain an internal belief statebtb\_\{t\}\[[13](https://arxiv.org/html/2605.07042#bib.bib17)\]that synthesizes its historical observations, tracks its progress towardsa∗\(q,c\)a^\{\*\}\(q,c\)and guides the selection of the next actionata\_\{t\}\.

### 3\.2LLMs’ Behavior in CGDPs

Agentic harnesses \(e\.g\. ReAct\[[42](https://arxiv.org/html/2605.07042#bib.bib12)\], IRCoT\[[37](https://arxiv.org/html/2605.07042#bib.bib11)\]\) solve a CGDP by deploying an LLM as a policy, maintaining belief state implicitly by concatenating observation history\. At steptt, the state is

st\+1=Truncate\(st⊕\{at,ot\}\),s\_\{t\+1\}=Truncate\(s\_\{t\}\\oplus\\\{a\_\{t\},o\_\{t\}\\\}\),\(2\)where the truncation programmatically drops the older steps to obey the limits of LLM context windows\. The policy is implemented simply through autoregressive generationat\+1=LLM\(st\+1\)a\_\{t\+1\}=LLM\(s\_\{t\+1\}\)\. Viewing LLM agents as CGDP policies reveals that they can suffer from failure modes because they lack fundamental mechanisms known to be beneficial for navigating POMDPs:

1. 1\.Lossy Representation \(Goal Forgetting\): A vanilla LLM agent relies on the growing historysts\_\{t\}as its belief state\. Asttincreases, the LLM must implicitly infer “what is the goal?” and “what has been figured out so far?” at every step\. Empirical studies show that LLMs can ignore evidence in the middle of long contexts\[[21](https://arxiv.org/html/2605.07042#bib.bib6)\], which can cause goal drift\.
2. 2\.Premature Stopping: Optimal agents in POMDPs stop exploring when the expected information gain of the next action is outweighed by its cost\[[31](https://arxiv.org/html/2605.07042#bib.bib31),[7](https://arxiv.org/html/2605.07042#bib.bib33)\]\. However, we conjecture that LLMs are trained on instruction\-following datasets where the most rewarded behavior is to immediately produce an answer when a plausible one is found\. Therefore, when the corpuscccontains irrelevant distractors or adversarial honeypots, the LLM agent can trigger premature termination\[[43](https://arxiv.org/html/2605.07042#bib.bib9)\]\.
3. 3\.Insufficient Exploration: Many sequential decision\-making algorithms explore via optimism\[[3](https://arxiv.org/html/2605.07042#bib.bib32)\], but LLMs do not have such mechanisms\. When an LLM samples an uninformative observationoto\_\{t\}, it can repeatedly generate similar actionsat\+1≈ata\_\{t\+1\}\\approx a\_\{t\}\(mode collapse\), without any structural awareness that its search has stagnated\.

To overcome these LLM\-specific failure modes, we next describe an abstracted algorithm for CGDPs that uses explicit state tracking and study how state\-of\-the\-art LLM harnesses map to it\.

## 4Abstract Algorithm

PBAI1\. Stop?2\. Act3\. Observe4\. Updatenext roundORCHESTRATORGateStagnation?stop→\\toafinala\_\{\\text\{final\}\}usingbtb\_\{t\}Belief Statebtb\_\{t\}Facts & QuestionsAGENTLLMToolsFFata\_\{t\}Corpusccinjectoto\_\{t\}ExtractorResolveot→bt\+1o\_\{t\}\\to b\_\{t\+1\}writebt\+1b\_\{t\+1\}signals:at,ota\_\{t\},o\_\{t\}

Figure 1:The PBAI loop with our two harness interventions\. TheBelief State\(blue\) is updated by theExtractorand injected into the agent prompt\. TheGate\(orange\) evaluates programmatic signals to detect stagnation\.
Algorithm 1The PBAI Loop\. Agent selects actions based on unresolved predicates, and updates belief state until the query is satisfied or the budget is exhausted\.0:Query

qq, budget

BB
1:Initialize belief state

b0b\_\{0\}from

qq
2:while

cost<B\\text\{cost\}<Bdo

3:1\. Stop?:If facts in

btb\_\{t\}satisfy

q→q\\tobreak the while loop

4:2\. Act:Generate action

ata\_\{t\}targeting the top open predicate in

btb\_\{t\}\.

5:3\. Observe:Execute

ata\_\{t\}, get observation

oto\_\{t\}\.

6:4\. Update Belief \(bt\+1b\_\{t\+1\}\):

7:Extract new facts from

oto\_\{t\}\.

8:Mark satisfied predicates*True*\.

9:Append new sub\-questions\.

10:endwhile

11:returnBest answer

afinala\_\{\\text\{final\}\}using

btb\_\{t\}

To build a reliable agent for the CGDP, we define the operations that should be performed\. In this section, we introduce Predicate\-Based Adaptive Identification \(PBAI\), an explicit algorithm template for a CGDP agent loop\. By mapping state\-of\-the\-art agent harnesses to the PBAI operations, we can diagnose where their implicit approximations fall short of desired behavior\.

### 4\.1Predicate\-Based Adaptive Identification

In a CGDP, the full hidden stateccis too large for an LLM to hold in its context\. To successfully navigate the CGDP, the agent’s belief statebtb\_\{t\}must compress its understanding of the hidden state into a finite discrete object that can fit compactly in context\.

We define apredicateas a logical proposition relevant to the taskqqthat can be evaluated as*True*or*False*by querying the environment \(e\.g\. “The target function returns a list”in a codebasecc\)\. A useful belief state is one that tracks which predicates have been resolved by observed evidence and which remain unresolved\. To maintain and act upon this state, the agent iterates through a four step loop that we call Predicate\-Based Adaptive Identification \(PBAI\):

1. 1\.Stop?The agent evaluates its belief statebtb\_\{t\}\. If the necessary predicates relevant toqqare resolved unambiguously, the agent outputs a final answer and terminates\.
2. 2\.Select ActionIf unresolved predicates remain, the agent formulates a new hypothesis about the hidden state\. It selects an actionata\_\{t\}to resolve the highest\-priority unresolved predicate\.
3. 3\.ObserveThe agent executesata\_\{t\}against the environmentF\(⋅,c\)F\(\\cdot,c\)and receives observationoto\_\{t\}\.
4. 4\.Update BeliefThe agent updates its belief statebtb\_\{t\}givenoto\_\{t\}\. Belief updates can revise beliefs about the final answer, eliminate hypotheses inconsistent withoto\_\{t\}, update the agent’s understanding of the observation functionF\(⋅,c\)F\(\\cdot,c\), or introduce new predicates to be resolved\.

Algorithm[1](https://arxiv.org/html/2605.07042#alg1)provides pseudocode for the PBAI loop\. While POMDP solvers maintain probability distributions over states, LLMs instead autoregressively generate a textual action string based on their prompt\. PBAI abstracts this textual generation in theSelect Actionstep as a form of approximate Thompson Sampling\[[30](https://arxiv.org/html/2605.07042#bib.bib34)\]\. While our experiments utilize greedy decoding \(temperature0\) for reproducibility, conceptually, the generation of a specific search query corresponds to the agent traversing its internal hypothesis space, seeking evidence to test its predicates, and updating its internal state\.

### 4\.2Mapping Agent Harnesses via PBAI

Existing Retrieval\-Augmented Generation \(RAG\) and agentic memory methods perform all four PBAI operations to some extent, but they do so implicitly\. By viewing these harnesses through PBAI, we can identify where they fall short \(Table[2](https://arxiv.org/html/2605.07042#S4.T2)\)\.

Table 2:Four agent harnesses mapped to the PBAI algorithm\.†\\daggerdenotes operations where infrastructure gaps can arise\. Our experiments in Section[6](https://arxiv.org/html/2605.07042#S6)confirm these gaps\.#### IRCoT \(implicit belief state\)

The concatenation of all past CoT sentences and retrieved passages serves as IRCoT’s belief state\. While the CoT sentences act as an implicitUpdate Beliefstep, it is not curated or explicitly maintained\. As the trajectory grows, earlier CoT sentences fall out of the LLM context, causing lossy representation and goal drift over long horizons\.

#### ReAct \(inefficient exploration\)

The explicit*Finish\[answer\]*action is available to an LLM agent to execute theStop?step\. However, ReAct agents frequently suffer from goal displacement, because they lack a dedicated mechanism to anchor unresolved predicates of the original user queryqq\.

#### MemGPT \(unreliable metacognition\)

Explicit memory management tools enable an explicitUpdate Beliefstep\. However, theStop?step is left entirely to the LLM’s self\-assessment\. Relying on weak metacognition leads to high rates of premature stopping or infinite tool\-calling loops\.

#### Iter\-RetGen \(memoryless\)

Without a persistent belief state, no adaptive stopping \(executing for a fixed number of rounds\), and no belief update mechanism, this is the crudest PBAI approximation\.

In summary, current agent harnesses ask an LLM to act as the belief state, the CGDP policy, and the metacognitive evaluator simultaneously\. To solve CGDP more reliably, we propose that PBAI operations be unbundled and have dedicated infrastructure for specific operations\.

## 5Interventions

In this section, we derive two modular, orchestrator\-level interventions that explicitly implement the PBAI operations\. Because these interventions exist outside the agent’s LLM prompt, they can be inserted into standard harnesses without difficulty\.

### 5\.1Predicate\-Based Belief State

The most severe limitation of standard agent harnesses is the unbounded accumulation of context\. To resolve this, we introduce a predicate\-based belief state and an explicit implementation of the PBAIUpdate Beliefstep\. This belief statebtb\_\{t\}replaces the unbounded trajectory history with a tightly constrained persistent data structure\. It consists of two conceptually distinct elements:

1. 1\.Facts: A curated list of confirmed propositions \(i\.e\., predicates resolved with evidence\) extracted from past observations\.
2. 2\.Open Predicates: A queue of unresolved sub\-questions that must be answered to satisfyqq\.

Rather than relying on an LLM to remember all it has read during an episode, the orchestrator actively manages the belief state using a modularExtractor\. At timesteptt, the orchestrator passes\(bt,ot\)\(b\_\{t\},o\_\{t\}\)to a lightweight LLM extraction call \(<500<500tokens\)\. The extractor parsesoto\_\{t\}, appends newly discovered facts, marks resolved predicates, and appends sub\-questions ifoto\_\{t\}reveals missing context\. To ensure a bounded context footprint, the extractor compresses older findings and curates the state down toK≤6K\\leq 6items \(see Appendix[H\.4](https://arxiv.org/html/2605.07042#A8.SS4)for token overhead analysis of this step\)\. This updated statebt\+1b\_\{t\+1\}is injected into the agent’s prompt for the next timestep, discarding the rawoto\_\{t\}\. In contrast to the naïve truncation in Equation[2](https://arxiv.org/html/2605.07042#S3.E2),bt\+1=Extract\(bt,ot\)b\_\{t\+1\}=\\text\{Extract\}\(b\_\{t\},o\_\{t\}\)distills a compact representation while preserving sufficient information\.

To understand how an LLM processes explicitbtb\_\{t\}, we experiment \(in Section[6](https://arxiv.org/html/2605.07042#S6)\) with two different textualizations of the predicate\-based belief state by changing the extractor’s output format:

- •Structured object: A JSON schema containing key\-value fields for facts and open predicates\.
- •Freeform text: A natural language summary that the LLM can use as a scratchpad\.

We provide the orchestrator prompts, schema definitions, and all extractor details in Appendix[E](https://arxiv.org/html/2605.07042#A5)\.

### 5\.2Exhaustion Gate

A secondary failure mode in the CGDP is improper stopping \(the first step in PBAI\)\. Relying on an LLM’s self\-assessment leads to two extremes: premature stopping \(halting before sufficient context is gathered due to, e\.g\. generating a plausible hallucination\) or infinite looping \(repeatedly issuing similar queries due to, e\.g\. mode collapse\)\. To prevent this, we introduce the Exhaustion Gate\.

Instead of relying on the LLM’s metacognition to realize that its search has stagnated, the orchestrator explicitly monitors the search loop\. We later compare both a focused lightweight LLM call \(analogous to the extractor\) and programmatic heuristics, and find that programmatic heuristics are reliable and more performant\. The programmatic gate tracks two quantities across the agent’s trajectory:

1. 1\.Action similarity: Measures the lexical overlap between the current actionata\_\{t\}and recent actions, e\.g\. usingJaccardJaccardsimilarity\. For example, with retrieval actions we computeJaccardJaccardover the strings queried in the actions; this helps to detect looping behavior\.
2. 2\.Novelty: Measures the overlap between the current observationoto\_\{t\}and previous observations, e\.g\. usingUPRUPR\(Unique Passage Rate\)\. UPR is the percentage of newly retrieved chunks inoto\_\{t\}that have not been seen in previous rounds\. A low UPR indicates that the agent’s recent actions are no longer surfacing novel text from the hidden statecc, suggesting that the current search hypothesis has run dry\.

At each timesteptt, the orchestrator evaluatesStagnatedt≔\(Jaccardt≥τJ\)∧\(UPRt≤τU\)Stagnated\_\{t\}\\coloneqq\(Jaccard\_\{t\}\\geq\\tau\_\{J\}\)\\wedge\(UPR\_\{t\}\\leq\\tau\_\{U\}\)\. IfStagnatedtStagnated\_\{t\}remains*True*forppconsecutive rounds, the orchestrator interrupts the PBAI loop and forces the agent to give a final answer based on the textualization of the current belief statebtb\_\{t\}\. To ensure our gate is not overly sensitive to specific hyperparameters or chunking strategies, we evaluated1616different threshold configurations \(both discrete and smooth exponential moving averages\)\. The exact values \(τJ,τU,p\\tau\_\{J\},\\tau\_\{U\},p\) used in our experiments, along with our robustness sweep, are detailed in Appendix[G](https://arxiv.org/html/2605.07042#A7), demonstrating that the programmatic gate provides stable gains across a wide range of settings\.

## 6Experiments and Analysis

To validate the CGDP framework, we apply PBAI interventions to four agent harnesses: IRCoT\[[37](https://arxiv.org/html/2605.07042#bib.bib11)\], ReAct\[[42](https://arxiv.org/html/2605.07042#bib.bib12)\], MemGPT\[[26](https://arxiv.org/html/2605.07042#bib.bib18)\]and Iter\-RetGen\[[33](https://arxiv.org/html/2605.07042#bib.bib10)\]\. We design experiments to answer two questions:

1. 1\.Does explicit extraction of a predicate\-based belief state prevent performance degradation over long horizons?
2. 2\.Does the programmatic exhaustion gate save tokens and prevent inefficient search better than LLM self\-assessment?

### 6\.1Setup

#### Datasets:

We evaluated three complex retrieval\-augmented QA domains requiring sophisticated exploration: LoCoMo\[[22](https://arxiv.org/html/2605.07042#bib.bib35)\]\(conversational QA\), MuSiQue\[[36](https://arxiv.org/html/2605.07042#bib.bib37)\]\(multi\-hop reasoning\) and SWE\-QA\-Pro\[[28](https://arxiv.org/html/2605.07042#bib.bib38)\]\(code repository QA\)\.

#### Model and Metrics:

All agents share identical zero\-shot prompts and retriever configurations\. Retrieval combines BM25 andall\-MiniLM\-L6\-v2embeddings matching published harness weights \(0\.90\.9for IRCoT,0\.50\.5otherwise\)\. Agents and orchestrators usegpt\-4o\-mini\(temperature0\)\. Performance is evaluated via LLM\-as\-a\-Judge \(gpt\-4otemperature0\), which measures Correctness, Completeness, penalizes Irrelevance, and normalizes to0−100%0\-100\\%\. Cost is measured in total LLM API tokens\. To ensure LLM judge is unbiased toward specific formatting, we verified standard string\-matching metrics \(Token F1, Exact Match, ROUGE, and SBERT\)\. As detailed in Appendix[F\.2](https://arxiv.org/html/2605.07042#A6.SS2), these lexical metrics broadly agree with the rubric judge rankings, confirming that the accuracy gains are substantive\. We use pairedtt\-tests with Holm–Bonferroni correction\.

### 6\.2Lobotomize\-then\-Replace Methodology

To isolate the impact of our interventions, we evaluate each agent harness under four controlled memory conditions:

1. 1\.Baseline: The harness runs as\-is, accumulating its standard observation trajectory\.
2. 2\.Lobotomized: The harness’ native memory is wiped at every timestep\. It sees only the current observationoto\_\{t\}and the original queryqq\. This tests how dependent the harness is on the trajectory history\.
3. 3\.PBBS \(Structured,btstructb\_\{t\}^\{struct\}\): The lobotomized harness is injected with a JSON object that maintains up toK=6K=6key\-value pairs of facts and open predicates\.
4. 4\.PBBS \(Freeform,btfreeb\_\{t\}^\{free\}\): The lobotomized harness is injected with a natural language paragraph, rewritten each timestep by the extractor to distill facts and open predicates\.

### 6\.3Studying the Predicate\-Based Belief State

Stripping an agent’s memory \(Lobotomized\) causes severe degradation \(Table[3](https://arxiv.org/html/2605.07042#S6.T3)\), particularly on MuSiQue\. However, replacing native memory management with an explicit predicate\-based belief state completely prevents this degradation \(allbtfreeb\_\{t\}^\{free\}improvements are statistically significant,p<0\.05p<0\.05\)\. Crucially, by distilling trajectories and preventing early evidence from falling out of the context window, it frequently outperforms baselines \(e\.g\.\+11\.4%\+11\.4\\%on MuSiQue,p<0\.001p<0\.001\)\.

#### Freeform text outperforms Structured JSON\.

We observed that textualizing the belief state as a natural language running summary or freeform scratchpad \(btfreeb\_\{t\}^\{free\}\) consistently outperformed \(1010in1212experimental settings\) textualizing it with a rigid JSON schema \(btstructb\_\{t\}^\{struct\}\)\. To understand why freeform text wins, we analyze the extraction outputs\. When forced into a rigid JSON schema, the LLM generates artifacts like ’no evidence’ for open questions \(occurring in up to56%56\\%of episodes, see Table[4](https://arxiv.org/html/2605.07042#S6.T4)\)\. This artificially fragments the LLM’s reasoning and misleadingly frames the state on answerable questions\. Freeform text allows the LLM to organize the content in whatever natural language form best preserves the information\.

Table 3:Performance \(Rubric Judge Accuracy, %\) across memory conditions\. Adding the PBBS \(btb\_\{t\}\) to lobotomized agents fully recovers and often improves over the baselines\. Bold indicates best per harness\-task pair\. Iter\-RetGen \(memoryless by design\) has no lobotomized condition\.Table 4:Prevalence of “no evidence” open questions in structuredbtstructb\_\{t\}^\{\\text\{struct\}\}across harnesses and tasks\.![Refer to caption](https://arxiv.org/html/2605.07042v1/x1.png)Figure 2:Pareto frontiers of quality vs\. token efficiency across harnesses\. Lobotomization \(orange\) collapses the frontier;btb\_\{t\}\(green\) recovers much of it toward baseline levels \(blue\)\.While adding orchestrator steps might intuitively seem to increase overall token costs, Figure[2](https://arxiv.org/html/2605.07042#S6.F2)shows that the belief state fundamentally improves the quality\-cost trade\-off\. Lobotomization \(orange\) collapses the Pareto frontier; injecting the belief state \(green\) recovers the frontier toward baseline levels \(blue\), so the accuracy gains do not come at an increased token cost\.

### 6\.4Studying the Programmatic Exhaustion Gate

As shown in Table[6](https://arxiv.org/html/2605.07042#S6.T6), using the programmatic exhaustion gate on stateful baselines improves accuracy across all three domains \(up to\+6\.9%\+6\.9\\%on SWE\-QA\-Pro\)\. Importantly, the gate relies on the belief statebtb\_\{t\}to generate the final answer; when applied to lobotomized agents \(which do not have accumulated facts\), stopping the loop early results in a slight accuracy drop\.

#### Programmatic heuristic outperforms LLM metacognition\.

Table[6](https://arxiv.org/html/2605.07042#S6.T6)compares the programmatic heuristic against prompting the LLM to self\-assess stagnation\. The LLM can be prompted in two distinct ways and shows two extreme failures\. If prompted to act conservatively, it spirals into infinite loops \(unfounded skepticism\), costing159%159\\%more tokens than the baseline\. If prompted neutrally, it triggers premature stopping that damages accuracy by5\.0%5\.0\\%\. In contrast, the programmatic gate does not require additional LLM calls and saves up to39%39\\%of the total tokens without sacrificing accuracy\.

Table 5:Exhaustion Gate impact on task accuracy \(averaged across IRCoT, ReAct, MemGPT\)\. Per\-harness results are in Appendix[G](https://arxiv.org/html/2605.07042#A7)\. Bold indicates statistical significance \(p<0\.05p<0\.05\)\.
Table 6:Programmatic vs\. LLM\-assessed exhaustion gates on IRCoT in SWE\-QA\-Pro\. Programmatic saves tokens without degrading accuracy, LLM costs more tokens than it saves\. Bold indicates statistical significance \(p<0\.05p<0\.05\)\.

## 7Discussion and Conclusion

Formalizing iterative agentic search as a Context Gathering Decision Process \(CGDP\) provides a blueprint for designing reliable LLM harnesses\. Viewing agent behavior through approximate Thompson Sampling, our framework is able to identify modular infrastructure interventions\. We demonstrated that by explicitly unbundling the operations of the search loop, we can systematically mitigate catastrophic failures\. The two interventions derived from this framework – a persistent belief state and a programmatic exhaustion gate – compose with each other in state\-of\-the\-art harnesses\. They are empirically effective in preventing context degradation and halting unproductive search, saving up to39%39\\%of tokens while improving multi\-hop reasoning across three domains\.

Our empirical evaluation surfaced a design principle: orchestrators should dictateoperationsnotrepresentations\. While the orchestrator enforces state tracking steps \(extraction, curation\), imposing rigid structures \(e\.g\. JSON\) on intermediate representation interferes with LLM reasoning\. The framework is most effective when it guides control flow while letting the LLM freely synthesize evidence in a natural language scratchpad\.

#### Limitations and Future Work\.

All evaluations in this study use a single agent LLM \(GPT\-4o\-mini\)\. While our framework predicts even stronger gains on weaker models where implicit state tracking degrades faster, cross\-model generalization remains an open question for future work\. Furthermore, while the CGDP models general hidden states \(e\.g\. codebases with BASH tools\), our empirical validation focuses on complex, long\-horizon retrieval to establish baseline efficacy\. Finally, the CGDP serves as a conceptual and empirical framework to guide infrastructure design, rather than providing formal mathematical bounds\.

Ultimately, the CGDP abstraction is valuable for pinpointing infrastructure gaps before they cause silent failures in practice\. Our analysis reveals several unaddressed gaps in current harnesses – such as LLM priors misaligned with the observation function and error compounding across trajectories \(detailed in Appendix[A](https://arxiv.org/html/2605.07042#A1)\)\. By mapping to the operations of the Predicate\-Based Adaptive Identification loop, we can find fruitful ways to address each of them in future work\.

## Acknowledgments and Disclosure of Funding

We thank Ambuj Tewari, the Netflix Machine Learning and Inference Research group, Ding Tong, Aditya Sinha, Maya Ravichandran, and Anuj Phadke for their insightful feedback on this project\.

## References

- \[1\]M\. Aghajani Asl, M\. Asgari\-Bidhendi, and B\. Minaei\-Bidgoli\(2025\)FAIR\-RAG: faithful adaptive iterative refinement for retrieval\-augmented generation\.arXiv preprint arXiv:2510\.22344\.Cited by:[§2](https://arxiv.org/html/2605.07042#S2.SS0.SSS0.Px1.p1.1)\.
- \[2\]A\. Asai, Z\. Wu, Y\. Wang, A\. Sil, and H\. Hajishirzi\(2024\)Self\-RAG: learning to retrieve, generate, and critique through self\-reflection\.InICLR,Cited by:[§2](https://arxiv.org/html/2605.07042#S2.SS0.SSS0.Px1.p1.1)\.
- \[3\]P\. Auer, N\. Cesa\-Bianchi, and P\. Fischer\(2002\)Finite\-time analysis of the multiarmed bandit problem\.Machine learning47\(2\),pp\. 235–256\.Cited by:[item 3](https://arxiv.org/html/2605.07042#S3.I1.i3.p1.2)\.
- \[4\]Y\. Chen, L\. Yuan, G\. Cui, Z\. Liu, and H\. Ji\(2023\)A close look into the calibration of pre\-trained language models\.InACL,pp\. 1343–1367\.Cited by:[§C\.2](https://arxiv.org/html/2605.07042#A3.SS2.p1.1),[item 2](https://arxiv.org/html/2605.07042#S1.I1.i2.p1.1),[§2](https://arxiv.org/html/2605.07042#S2.SS0.SSS0.Px3.p1.1)\.
- \[5\]D\. Edge, H\. Trinh, N\. Cheng, J\. Bradley, A\. Chao, A\. Mody, S\. Truitt, D\. Metropolitansky, R\. O\. Ness, and J\. Larson\(2025\)From local to global: a graph RAG approach to query\-focused summarization\.arXiv preprint arXiv:2404\.16130\.Cited by:[§2](https://arxiv.org/html/2605.07042#S2.SS0.SSS0.Px2.p1.1)\.
- \[6\]S\. Feng, W\. Shi, Y\. Wang, W\. Ding, V\. Balachandran, and Y\. Tsvetkov\(2024\)Don’t hallucinate, abstain: identifying LLM knowledge gaps via multi\-LLM collaboration\.ACL\.Cited by:[§C\.2](https://arxiv.org/html/2605.07042#A3.SS2.p1.1)\.
- \[7\]A\. Garivier and E\. Kaufmann\(2016\)Optimal best arm identification with fixed confidence\.InCOLT,pp\. 998–1027\.Cited by:[item 2](https://arxiv.org/html/2605.07042#S3.I1.i2.p1.1),[§3\.1](https://arxiv.org/html/2605.07042#S3.SS1.p1.13)\.
- \[8\]B\. J\. Gutierrez, Y\. Shu, Y\. Gu, M\. Yasunaga, and Y\. Su\(2024\)HippoRAG: neurobiologically inspired long\-term memory for large language models\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2605.07042#S2.SS0.SSS0.Px2.p1.1)\.
- \[9\]J\. Huang, X\. Chen, S\. Mishra, H\. S\. Zheng, A\. W\. Yu, X\. Song, and D\. Zhou\(2024\)Large language models cannot self\-correct reasoning yet\.InICLR,Cited by:[item 2](https://arxiv.org/html/2605.07042#S1.I1.i2.p1.1),[§2](https://arxiv.org/html/2605.07042#S2.SS0.SSS0.Px3.p1.1)\.
- \[10\]Z\. Jiang, F\. Xu, L\. Gao, Z\. Sun, Q\. Liu, J\. Dwivedi\-Yu, Y\. Yang, J\. Callan, and G\. Neubig\(2023\)Active retrieval augmented generation\.InEMNLP,pp\. 7969–7992\.Cited by:[§2](https://arxiv.org/html/2605.07042#S2.SS0.SSS0.Px3.p1.1)\.
- \[11\]B\. Jin, J\. Yoon, J\. Han, and S\. O\. Arik\(2025\)Long\-context LLMs meet RAG: overcoming challenges for long inputs in RAG\.InICLR,Cited by:[§1](https://arxiv.org/html/2605.07042#S1.p2.1)\.
- \[12\]B\. Jin, H\. Zeng, Z\. Yue, J\. Yoon, S\. O\. Arik, D\. Wang, H\. Zamani, and J\. Han\(2025\)Search\-r1: training LLMs to reason and leverage search engines with reinforcement learning\.InCOLM,Cited by:[§C\.1](https://arxiv.org/html/2605.07042#A3.SS1.p1.1)\.
- \[13\]L\. P\. Kaelbling, M\. L\. Littman, and A\. R\. Cassandra\(1998\)Planning and acting in partially observable stochastic domains\.Artificial Intelligence101\(1\),pp\. 99–134\.Cited by:[§1](https://arxiv.org/html/2605.07042#S1.p4.1),[§3\.1](https://arxiv.org/html/2605.07042#S3.SS1.p1.13),[§3\.1](https://arxiv.org/html/2605.07042#S3.SS1.p4.7)\.
- \[14\]P\. Laban, H\. Hayashi, Y\. Zhou, and J\. Neville\(2026\)LLMs get lost in multi\-turn conversation\.InICLR,Cited by:[§1](https://arxiv.org/html/2605.07042#S1.p2.1)\.
- \[15\]Y\. Lee, H\. Yen, X\. Ye, and D\. Chen\(2026\)Agentic aggregation for parallel scaling of long\-horizon agentic tasks\.arXiv preprint arXiv:2604\.11753\.Cited by:[Appendix B](https://arxiv.org/html/2605.07042#A2.SS0.SSS0.Px5.p1.1),[§2](https://arxiv.org/html/2605.07042#S2.SS0.SSS0.Px2.p1.1)\.
- \[16\]P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel, S\. Riedel, and D\. Kiela\(2020\)Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2605.07042#S1.p1.1)\.
- \[17\]J\. Li, M\. Wang, Z\. Zheng, and M\. Zhang\(2024\)LooGLE: can long\-context language models understand long contexts?\.InACL,pp\. 16304–16333\.Cited by:[§1](https://arxiv.org/html/2605.07042#S1.p2.1)\.
- \[18\]X\. Li, G\. Dong, J\. Jin, Y\. Zhang, Y\. Zhou, Y\. Zhu, P\. Zhang, and Z\. Dou\(2025\)Search\-o1: agentic search\-enhanced large reasoning models\.InEMNLP,pp\. 5420–5438\.Cited by:[§C\.1](https://arxiv.org/html/2605.07042#A3.SS1.p1.1)\.
- \[19\]X\. Li, J\. Jin, G\. Dong, H\. Qian, Y\. Wu, J\. Wen, Y\. Zhu, and Z\. Dou\(2026\)WebThinker: empowering large reasoning models with deep research capability\.InNeurIPS,Cited by:[§C\.1](https://arxiv.org/html/2605.07042#A3.SS1.p1.1)\.
- \[20\]H\. Liu, Z\. Wang, X\. Chen, Z\. Li, F\. Xiong, Q\. Yu, and W\. Zhang\(2025\)HopRAG: multi\-hop reasoning for logic\-aware retrieval\-augmented generation\.InACL Findings,pp\. 1897–1913\.Cited by:[§2](https://arxiv.org/html/2605.07042#S2.SS0.SSS0.Px2.p1.1)\.
- \[21\]N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang\(2024\)Lost in the middle: how language models use long contexts\.Transactions of the Association for Computational Linguistics12,pp\. 157–173\.Cited by:[§1](https://arxiv.org/html/2605.07042#S1.p2.1),[item 1](https://arxiv.org/html/2605.07042#S3.I1.i1.p1.2)\.
- \[22\]A\. Maharana, D\. Lee, S\. Tulyakov, M\. Bansal, F\. Barbieri, and Y\. Fang\(2024\)Evaluating very long\-term conversational memory of LLM agents\.InACL,pp\. 13851–13870\.Cited by:[§6\.1](https://arxiv.org/html/2605.07042#S6.SS1.SSS0.Px1.p1.1)\.
- \[23\]E\. Meyerson, G\. Paolo, R\. Dailey, H\. Shahrzad, O\. Francon, C\. F\. Hayes, X\. Qiu, B\. Hodjat, and R\. Miikkulainen\(2025\)Solving a million\-step LLM task with zero errors\.arXiv preprint arXiv:2511\.09030\.Cited by:[§C\.4](https://arxiv.org/html/2605.07042#A3.SS4.p1.4)\.
- \[24\]N\. Mündler, J\. He, S\. Jenko, and M\. Vechev\(2024\)Self\-contradictory hallucinations of large language models: evaluation, detection and mitigation\.ICLR\.Cited by:[§C\.3](https://arxiv.org/html/2605.07042#A3.SS3.p1.1)\.
- \[25\]I\. Osband, D\. Russo, and B\. Van Roy\(2013\)\(More\) efficient reinforcement learning via posterior sampling\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2605.07042#S1.p5.1)\.
- \[26\]C\. Packer, S\. Wooders, K\. Lin, V\. Fang, S\. G\. Patil, I\. Stoica, and J\. E\. Gonzalez\(2024\)MemGPT: towards LLMs as operating systems\.InICLR,Cited by:[§1](https://arxiv.org/html/2605.07042#S1.p2.1),[§2](https://arxiv.org/html/2605.07042#S2.SS0.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2605.07042#S6.p1.1)\.
- \[27\]J\. S\. Park, J\. C\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein\(2023\)Generative agents: interactive simulacra of human behavior\.UIST\.Cited by:[§C\.4](https://arxiv.org/html/2605.07042#A3.SS4.p1.4)\.
- \[28\]W\. Peng, Y\. Shi, Y\. Wang, X\. Zhang, B\. Shen, and X\. Gu\(2025\)SWE\-QA: can language models answer repository\-level code questions?\.arXiv preprint arXiv:2509\.14635\.Cited by:[§6\.1](https://arxiv.org/html/2605.07042#S6.SS1.SSS0.Px1.p1.1)\.
- \[29\]N\. Rozanov and M\. Rei\(2025\)StateAct: enhancing LLM base agents via self\-prompting and state\-tracking\.InREALM,pp\. 367–385\.Cited by:[§1](https://arxiv.org/html/2605.07042#S1.p2.1),[§2](https://arxiv.org/html/2605.07042#S2.SS0.SSS0.Px1.p1.1)\.
- \[30\]D\. J\. Russo, B\. Van Roy, A\. Kazerouni, I\. Osband, and Z\. Wen\(2018\)A tutorial on thompson sampling\.Foundations and Trends in Machine Learning\.Cited by:[§1](https://arxiv.org/html/2605.07042#S1.p5.1),[§4\.1](https://arxiv.org/html/2605.07042#S4.SS1.p3.1)\.
- \[31\]D\. Russo\(2016\)Simple bayesian algorithms for best arm identification\.InCOLT,pp\. 1417–1418\.Cited by:[item 2](https://arxiv.org/html/2605.07042#S3.I1.i2.p1.1),[§3\.1](https://arxiv.org/html/2605.07042#S3.SS1.p1.13)\.
- \[32\]T\. Schnabel, K\. Tomlinson, A\. Swaminathan, and J\. Neville\(2025\)Lost in transmission: when and why LLMs fail to reason globally\.InNeurIPS,Cited by:[§A\.1](https://arxiv.org/html/2605.07042#A1.SS1.SSS0.Px2.p1.1)\.
- \[33\]Z\. Shao, Y\. Gong, Y\. Shen, M\. Huang, N\. Duan, and W\. Chen\(2023\)Enhancing retrieval\-augmented large language models with iterative retrieval\-generation synergy\.InEMNLP Findings,pp\. 9248–9274\.Cited by:[§1](https://arxiv.org/html/2605.07042#S1.p2.1),[§1](https://arxiv.org/html/2605.07042#S1.p3.1),[§2](https://arxiv.org/html/2605.07042#S2.SS0.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2605.07042#S6.p1.1)\.
- \[34\]W\. Su, Y\. Tang, Q\. Ai, Z\. Wu, and Y\. Liu\(2024\)DRAGIN: dynamic retrieval augmented generation based on the real\-time information needs of large language models\.InACL,pp\. 12991–13013\.Cited by:[§2](https://arxiv.org/html/2605.07042#S2.SS0.SSS0.Px3.p1.1)\.
- \[35\]W\. R\. Thompson\(1933\)On the likelihood that one unknown probability exceeds another in view of the evidence of two samples\.Biometrika25,pp\. 285–294\.Cited by:[§1](https://arxiv.org/html/2605.07042#S1.p5.1)\.
- \[36\]H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal\(2022\)MuSiQue: multihop questions via single hop question composition\.Transactions of the Association for Computational Linguistics10,pp\. 539–554\.Cited by:[§6\.1](https://arxiv.org/html/2605.07042#S6.SS1.SSS0.Px1.p1.1)\.
- \[37\]H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal\(2023\)Interleaving retrieval with chain\-of\-thought reasoning for knowledge\-intensive multi\-step questions\.InACL,pp\. 10014–10037\.Cited by:[§1](https://arxiv.org/html/2605.07042#S1.p2.1),[§1](https://arxiv.org/html/2605.07042#S1.p3.1),[§2](https://arxiv.org/html/2605.07042#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2605.07042#S3.SS2.p1.1),[§6](https://arxiv.org/html/2605.07042#S6.p1.1)\.
- \[38\]Y\. Wu, Y\. Gu, X\. Feng, W\. Zhong, D\. Xu, Q\. Yang, H\. Liu, and B\. Qin\(2024\)Extending context window of large language models from a distributional perspective\.InEMNLP,pp\. 7288–7301\.Cited by:[§1](https://arxiv.org/html/2605.07042#S1.p2.1)\.
- \[39\]W\. Xu, Z\. Liang, K\. Mei, H\. Gao, J\. Tan, and Y\. Zhang\(2025\)A\-Mem: agentic memory for LLM agents\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2605.07042#S2.SS0.SSS0.Px2.p1.1)\.
- \[40\]S\. Yan, J\. Gu, Y\. Zhu, and Z\. Ling\(2024\)Corrective retrieval augmented generation\.arXiv preprint arXiv:2401\.15884\.Cited by:[§2](https://arxiv.org/html/2605.07042#S2.SS0.SSS0.Px1.p1.1)\.
- \[41\]J\. Yang, C\. Jimenez, A\. Wettig, K\. Lieret, S\. Yao, K\. Narasimhan, and O\. Press\(2024\)SWE\-agent: agent\-computer interfaces enable automated software engineering\.InNeurIPS,pp\. 50528–50652\.Cited by:[§1](https://arxiv.org/html/2605.07042#S1.p1.1)\.
- \[42\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao\(2023\)ReAct: synergizing reasoning and acting in language models\.InICLR,Cited by:[§1](https://arxiv.org/html/2605.07042#S1.p2.1),[§1](https://arxiv.org/html/2605.07042#S1.p3.1),[§2](https://arxiv.org/html/2605.07042#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2605.07042#S3.SS2.p1.1),[§6](https://arxiv.org/html/2605.07042#S6.p1.1)\.
- \[43\]H\. Yen, A\. Paranjape, M\. Xia, T\. Venkatesh, J\. Hessel, D\. Chen, and Y\. Zhang\(2025\)Lost in the maze: overcoming context limitations in long\-horizon agentic search\.arXiv preprint arXiv:2510\.18939\.Cited by:[§C\.1](https://arxiv.org/html/2605.07042#A3.SS1.p1.1),[§1](https://arxiv.org/html/2605.07042#S1.p2.1),[§2](https://arxiv.org/html/2605.07042#S2.SS0.SSS0.Px2.p1.1),[item 2](https://arxiv.org/html/2605.07042#S3.I1.i2.p1.1)\.
- \[44\]Y\. Zheng, D\. Fu, X\. Hu, X\. Cai, L\. Ye, P\. Lu, and P\. Liu\(2025\)DeepResearcher: scaling deep research via reinforcement learning in real\-world environments\.InEMNLP,pp\. 414–431\.Cited by:[§1](https://arxiv.org/html/2605.07042#S1.p1.1)\.

## Appendix AObserved Failure Modes and Example Traces

### A\.1Observed Failure Modes

This section summarizes the main failure modes \(G1–G7\) surfaced by our evaluation tasks, together with several additional failure modes that the PBAI framework suggests may arise in longer episodes, larger corpora, or harder reasoning settings\.

#### Coverage unawareness\.

The agent cannot tell the difference between “I did not find it” and “it is not in the corpus”\. Current methods usually have little notion of corpus coverage, so they cannot estimate how much of the search space has already been explored\. A natural fix would be explicit coverage tracking\.

#### Aggregation breakdown\.

Some questions require combining many weak signals across passages, such as counting entities, summarizing trends, or comparing sources\. LLMs become less reliable as the number of relevant passages grows\[[32](https://arxiv.org/html/2605.07042#bib.bib14)\], so the state update can fail even when the evidence has already been retrieved\.

#### Error compounding\.

Small extraction mistakes in early rounds can become premises for later reasoning\. Persistent state helps by keeping each extraction step short and allowing later revisions, but it does not fully remove this problem\.

#### Action\-observation misalignment\.

The agent may search the wrong part of the corpus because its internal picture of how actions map to observations is wrong\. In practice, this means it chooses actions that are locally sensible but poorly matched to the corpus structure\. A natural fix would be a lightweight predict\-before\-retrieve step\.

#### Unjustified confidence\.

The agent may stop with high confidence because the answer sounds plausible, not because the evidence is complete\. This happens when stopping depends more on fluent output than on whether the key predicates have actually been resolved\. A natural fix would be evidence\-grounded stopping or calibrated answer generation\.

### A\.2Annotated Agent Traces

All traces below are from baseline conditions of our experiments \(nobtb\_\{t\}injection, no exhaustion gate\)\.

#### G5/G6: Retrieval Stagnation \(IRCoT, MuSiQue\)\.

Retrieval stagnation occurs when the agent issues the same or near\-identical search actions across rounds, receiving the same passages and making no progress\. We measure this as episodes where the token\-level Jaccard similarity between consecutive actions exceeds 0\.5 for at least 2 consecutive pairs\. Example:User query:*“What city shares a border with the place where the person who went to the state known for its Mediterranean climate during the gold rush worked?”*Ground truth: Rio Linda\. IRCoT runs 10 rounds \(maximum\) and answers incorrectly\.

> R1:To answer the question, we need to identify a city that shares a border with a location associated with the Mediterranea\.\.\. R2:The Black Hills Gold Rush took place in Dakota Territory, which is not associated with a Mediterranean climate\. R3:To answer the question, we need to identify a city that shares a border\.\.\. R5–R10: Repeat of R1/R3 pattern with minor variations\.

Annotation\.The agent alternates between two framings without advancing the multi\-hop chain\. Because it has no explicit mechanism for detecting stagnation or trying a new direction, it remains stuck in the loop\. The exhaustion gate would flag this quickly via high action overlap\.

#### G4: Premature Stopping \(MemGPT, LoCoMo\)\.

Premature stopping occurs when the agent stops after 1–2 rounds with an incorrect answer despite insufficient evidence\. Example\.User query:*“Who performed at the concert at Melanie’s daughter’s birthday?”*Ground truth: Matt Patterson\. MemGPT runs 1 round, retrieves 5 passages with action‘‘concert Melanie’s daughter’s birthday’’, and answersUNANSWERABLE\. Annotation\.The agent does not find the answer in the first retrieval and stops instead of refining the search\. With a persistent belief state, the unresolved question “who performed?” would remain explicit, making early termination less attractive\.

#### G2: Goal Displacement \(ReAct, MuSiQue\)\.

Goal displacement occurs when the agent’s actions drift from the original user query, pursuing tangential leads\. Example\.User query:*“How were people from whom new coins were a proclamation of independence by the Somali Muslim Ajuran Empire expelled from the natural boundary between Thailand and A Magne’s country?”*Ground truth: The dynasty regrouped and defeated the Portuguese\. ReAct runs 5 rounds\.

> R1 action:Ajuran Empire proclamation of independence and expulsion of people R5 action:Ajuran Empire Southeast Asia history

Annotation\.The agent follows a salient thread but drifts away from the intermediate sub\-questions required to answer the original query\. Without explicit state tracking for what remains unresolved, tangential directions are easy to pursue\.

#### G1: Evidence Not Persisted \(IRCoT, MuSiQue\)\.

Evidence forgetting occurs when relevant information appears in early rounds but is absent from later reasoning\. Example\.User query:*“Who is the mother of the screenwriter of WarGames?”*Ground truth: Jane Greer\. IRCoT runs 10 rounds and answers incorrectly\. Round 2 passage contains:*“…He is the son of actress Jane Greer and producer Edward Lasker…”*By rounds 8–10, the chain\-of\-thought discusses the screenwriters without mentioning their mothers\. Annotation\.The relevant evidence appears early but is not preserved in a durable form, so it no longer shapes later reasoning\. A good persistent belief state would have written “Jane Greer” into the running state as soon as it appeared\.

#### G3: Parametric Dominance \(ReAct, MuSiQue\)\.

Parametric dominance occurs when the agent produces an answer grounded in pretraining knowledge rather than the retrieved evidence\. Example\.User query:*“How many mandatory transmitters of the Canadian Broadcasting Centre’s owner were updated before the deadline?”*Ground truth: only about half\. ReAct runs 3 rounds\.

> R2 Thought:The observations indicate that the CBC did not convert all of its mandatory transmitters to digital by the original deadline of August 31, 2011\.\.\. Answer:15 mandatory transmitters were updated before the deadline

Annotation\.The final answer introduces a specific number that is not supported by the retrieved evidence\. Here the model’s parametric prior wins over the retrieved passages because the evidence is not kept in an explicit, persistent state\.

#### G7: Evidence\-Action Misalignment \(LoCoMo\)\.

Agents incorporate retrieved evidence without verifying that it actually pertains to the specific entities in the user query\. Example\.User query:*“What was grandma’s gift to Melanie?”*\(entity\-swapped; the original conversation discusses a gift to a different person\)\. IRCoT retrieves passages about the necklace gift and answers “A necklace” in 2 rounds without verifying that the evidence pertains to Melanie specifically\. Annotation\.The agent finds a relevant gift passage, but it does not verify that the gift is attached to the correct entity in the user query\. Entity\-grounded extraction or a lightweight alignment check would catch this mismatch before answer generation\.

Table 7:Gap prevalence across baseline conditions\. Each cell shows the count of episodes exhibiting the pattern\.GapTaskIRCoTReActMemGPTIter\-RetGenG4 \(premature stop\)MuSiQue744620LoCoMo68101430SWE\-QA\-Pro15030G5/G6 \(stagnation\)MuSiQue115146173139LoCoMo2487024986SWE\-QA\-Pro197517137G2 \(goal displacement\)MuSiQue1142361800LoCoMo765460SWE\-QA\-Pro961200G7 \(evidence\-action misalign\.\)LoCoMo una\.125/44073/440229/440—

## Appendix BAdditional Interventions Suggested by the Framework

The gap analysis suggests several additional orchestrator\-level interventions that are outside the scope of this paper but follow naturally from the framework:

#### Gap\-aware stopping\.

Instead of asking only whether search has stalled, the orchestrator could prompt a focused LLM call to identify which predicates remain unresolved\. This would directly target premature stopping and would complement the exhaustion gate, which targets the opposite failure mode of stagnation\.

#### Entity\-grounded extraction and alignment checking\.

The extractor could be required to ground each finding against the entities and relations in the user query, and a separate verification step could catch remaining mismatches before answer generation\.

#### Triggered resampling\.

When stagnation is detected, the orchestrator could branch to one or more new hypotheses that are still consistent with the accumulated observations, rather than terminating immediately\. This would make direction changes explicit rather than leaving them to the model’s chain\-of\-thought\.

#### Observation filtering\.

Before extraction, the orchestrator could filter or compress the raw observation itself, removing clearly irrelevant text before it ever reaches the belief\-update step\.

#### Parallel hypothesis aggregation\.

Another way to explore multiple hypotheses is to run several retrieval trajectories in parallel and aggregate them afterward\[[15](https://arxiv.org/html/2605.07042#bib.bib13)\]\. This turns hypothesis diversity into an explicit orchestrator decision rather than an accidental byproduct of a single trajectory\.

## Appendix CExtended Related Work

### C\.1Agentic RAG Landscape

Recent work has explored giving retrieval agents more sophisticated decision\-making capabilities\. SearchR1\[[12](https://arxiv.org/html/2605.07042#bib.bib50)\]and Search\-o1\[[18](https://arxiv.org/html/2605.07042#bib.bib51)\]train retrieval agents with reinforcement learning\. WebThinker\[[19](https://arxiv.org/html/2605.07042#bib.bib52)\]extends this to web\-scale retrieval\. SLIM\[[43](https://arxiv.org/html/2605.07042#bib.bib9)\]targets long\-horizon search by periodically summarizing trajectories\. These approaches are complementary to ours: they improve the agent’s action selection policy or trajectory management through training and tool design, while we improve the infrastructure around the policy \(persistent state, exhaustion detection\) through orchestration\.

### C\.2LLM Metacognition and Self\-Assessment

Our finding that programmatic exhaustion detection is more token\-efficient than LLM\-judged stopping is consistent with a growing body of evidence on LLM metacognitive limitations\.Fenget al\.\[[6](https://arxiv.org/html/2605.07042#bib.bib55)\]showed that multi\-LLM probing improves abstention by 19\.3% over single\-model self\-assessment\.Chenet al\.\[[4](https://arxiv.org/html/2605.07042#bib.bib19)\]demonstrated that verbalized confidence is poorly calibrated across model families\. These findings motivate our design choice: the stagnation signals we measure \(action Jaccard and UPR\) are observable facts about the retrieval process that do not require the LLM to assess its own search progress\.

### C\.3Consistency and Contradiction Detection

Evidence\-action misalignment connects to a literature on contradiction detection in LLM outputs\.Mündleret al\.\[[24](https://arxiv.org/html/2605.07042#bib.bib43)\]showed that pairwise contradiction detection achieves approximately 80% F1, substantially better than document\-level detection\. The key insight is that contradiction detection is effective when the relevant facts are presented together in short context, but degrades when they must be identified from a long history\. This motivates the orchestrator’s role: maintaining a persistent state \(btb\_\{t\}\) so that new extractions can be compared against existing facts in focused pairwise calls\.

### C\.4Generative Agents and Long\-Term Memory

Park et al\.\[[27](https://arxiv.org/html/2605.07042#bib.bib59)\]introduced generative agents with long\-term memory for simulated social environments\. Their memory architecture \(observation→\\toreflection→\\toplanning\) shares structural similarities with our belief state management \(observation→\\toextraction→\\tostate update\), but operates in a different setting: their agents interact with a changing social world \(full POMDP\), while our agents search a static database \(sequential identification\)\. The MAKER framework\[[23](https://arxiv.org/html/2605.07042#bib.bib36)\]extends agentic memory to long\-horizon creative tasks\.

## Appendix DExperimental Setup and Reproducibility

#### Datasets and Splits\.

We evaluated three datasets: LoCoMo \(1,275 tasks spanning conversational memory\), MuSiQue \(500 answerable tasks spanning multi\-hop Wikipedia routing\), and SWE\-QA\-Pro \(260 tasks spanning software repository structures\)\. We utilized the standard test/validation splits provided by the authors of the respective benchmarks\.

#### Compute Resources and LLM Usage\.

All agent trajectories, modular extractions, and rubric evaluations were executed via OpenAI’s API using thegpt\-4o\-miniandgpt\-4oendpoints\. The total compute expenditure for the experiments, including baseline runs, lobotomization sweeps, and gate ablations, was approximately$1,500\\mathdollar 1,500in API credits\.

#### Licenses\.

We use LoCoMo \(CC BY\-NC 4\.0\), MuSiQue \(CC BY 4\.0\), and SWE\-QA\-Pro Bench \(MIT\) as evaluation datasets, andall\-MiniLM\-L6\-v2\(Apache\-2\.0\) for dense retrieval embeddings\. We accessedgpt\-4o\-miniandgpt\-4othrough the OpenAI API under the applicable OpenAI Services Agreement and Service Terms\.

## Appendix EOrchestrator and Judge Prompts

We list the key prompts designed for our interventions\. All prompts are in the experiment codebase undersrc/orchestrator/\. Harness\-specific prompts and rubric judge prompts are insrc/methods/prompts/andsrc/scoring/\.

### E\.1Structured Extraction Prompt

The structured prompt separates output into facts, resolved questions, and new questions\. The instruction to “state precisely what evidence is still missing” produces the “no evidence” artifacts analyzed in Section 6\.

> New retrieved passages: \{observation\} What we already know \(DO NOT repeat any of these\): \{established\_facts\} <scratchpad\> First, what do the passages actually state? Extract the key claims exactly as the evidence presents them \-\-\- preserve who did what, which entity is involved, what values are mentioned\. Do not paraphrase in a way that changes attribution or meaning\. Then, given the question "\{question\}", what is new here compared to what we already know? Do any of these claims resolve our open questions? If a fact is already listed above, skip it entirely\. </scratchpad\> Open questions we are still investigating: \{open\_questions\} Output ONLY genuinely new facts not already listed above\. If the passages contain nothing new beyond what we already know, write "Nothing relevant\." New facts: \- The claim exactly as the evidence states it \(source: document or passage identifier\) Resolved questions: \- Which open questions are now answered by the evidence? New questions: \- What specific evidence is still missing? State precisely what has NOT been found\.

### E\.2Freeform Extraction Prompt

The freeform prompt produces notes and memories without fact/question separation\.

> New retrieved passages: \{observation\} Your current notes \(DO NOT repeat any of these\): \{existing\_notes\} <scratchpad\> First, what do the passages actually state? Extract claims exactly as presented \-\-\- preserve who did what, which entities are involved\. Do not paraphrase in a way that changes attribution\. Then, how do these relate to the question "\{question\}" and your current notes? What is genuinely new? If a note already covers this information, skip it entirely\. </scratchpad\> Write ONLY genuinely new notes not already covered above\. If the passages contain nothing new beyond your existing notes, write "Nothing relevant\." Notes: \- A finding exactly as the evidence presents it Memories to keep: \- Verbatim quote or key passage worth preserving

Both prompts operate onbt\+otb\_\{t\}\+o\_\{t\}only \(approximately 500 tokens of context\), not the full history\.

### E\.3Reorganization Prompt

To strictly bound the size of the belief state, the orchestrator allows the state to grow up toKtrigger=10K\_\{\\text\{trigger\}\}=10items before pausing to execute a reorganization call\. This call curates the state back down to a target size ofKtarget=6K\_\{\\text\{target\}\}=6\(the capacity limit reported in the main text\)\.

> You are curating an investigation state\. Given the original question and all facts and questions gathered so far, produce a compact, prioritized state\. Original question: \{question\} Current facts: \{facts\} Open questions: \{questions\} Instructions: \- Keep at most \{k\_target\} facts and \{n\_questions\} open questions \- Merge redundant facts into single comprehensive claims \- Drop facts irrelevant to the question \- PRESERVE facts that form multi\-hop reasoning chains, even if individually they seem tangential \- Each fact must retain its source attribution \- Rewrite open questions to reflect what is actually still unknown \- Remove questions that have been answered by the facts \- Order facts by importance \(most important first\)

### E\.4LLM\-based Exhaustion Gate Prompts

Two LLM\-based stagnation detection variants were evaluated against the programmatic gate\. Both see the current investigation state and recent retrieval rounds\.

#### Conservative \(v3\)\.

Defaults to CONTINUE; requires concrete evidence of stagnation to recommend stopping\.

> You are deciding whether further retrieval will meaningfully improve the answer to this question\. QUESTION: \{question\} CURRENT INVESTIGATION STATE: \{current\_state\} RECENT RETRIEVAL \(last \{window\} rounds\): \{recent\_rounds\} Your DEFAULT is to CONTINUE retrieval\. Only recommend stopping if you can point to CONCRETE evidence of stagnation: \- The same passages or near\-paraphrases are appearing across multiple rounds \- Search queries have covered the obvious angles and a meaningfully different direction is hard to identify \- The current answer already addresses the question and further evidence is unlikely to change it Do NOT recommend stopping just because: \- The answer seems plausible \(it may still be incomplete\) \- A few passages overlap \(some overlap is normal\) \- You are uncertain about the answer quality \(uncertainty means more retrieval could help\) VERDICT: PRODUCTIVE / QUERY\_STALE / EXHAUSTED REASON: \[explanation\]

#### Neutral \(v3\_neutral\)\.

No default direction; presents the three verdicts symmetrically\.

> You are evaluating whether an information retrieval investigation is making progress or has stagnated\. QUESTION: \{question\} CURRENT INVESTIGATION STATE: \{current\_state\} RECENT RETRIEVAL \(last \{window\} rounds\): \{recent\_rounds\} Based on the evidence above, choose one of the following verdicts: VERDICT: PRODUCTIVE \-\-\- if new, relevant information is still being discovered each round\. VERDICT: QUERY\_STALE \-\-\- if the current search direction is exhausted but a specific untried angle could yield new information\. VERDICT: EXHAUSTED \-\-\- if retrieval has stalled and further rounds are unlikely to surface new relevant content\. VERDICT: REASON:

## Appendix FFull Results and Additional Metrics

### F\.1Rubric Judge Scores and Statistical Tests

Tables[8](https://arxiv.org/html/2605.07042#A6.T8)and[9](https://arxiv.org/html/2605.07042#A6.T9)provide the comprehensive pairedtt\-tests and split\-condition scores that support the summarized findings in Section 6 of the main text\.

Table 8:Full pairedtt\-tests for persistent belief state \(btb\_\{t\}\)\. Each cell reports the left condition minus the right condition in pp; positive means the left condition scores higher\. Iter\-RetGen has no lobotomized condition\.Bold blue=p<\.05p<\.05\.Table 9:LoCoMobtb\_\{t\}effect \(pp\) split by answerable \(835q\) and unanswerable \(440q entity\-swap\)\. Each cell reports thebtb\_\{t\}condition minus the comparison condition, so positive meansbtb\_\{t\}scores higher\. Structuredbtb\_\{t\}improves unanswerable detection \(abstention\) while freeformbtb\_\{t\}improves answerable accuracy\. Iter\-RetGen has no lobotomized condition\.
### F\.2Lexical and Embedding\-based Metrics

String metrics \(Token F1, Exact Match, ROUGE\-1, METEOR, SentenceBERT cosine similarity\) broadly agree with the rubric judge rankings, confirming that the performance gains are not driven by judge preference forbtb\_\{t\}\-conditioned answers\. On MuSiQue,btfreeb\_\{t\}^\{\\text\{free\}\}is the best condition for every method on every string metric\.

The few divergences between string metrics and the rubric judge justify our primary use of the rubric judge \(Table[8](https://arxiv.org/html/2605.07042#A6.T8)\)\. For example, perfectly correct abstentions \(e\.g\., answering "UNANSWERABLE"\) have zero token overlap with gold answers, artificially deflating F1 scores on LoCoMo despite being the optimal agent behavior\. Similarly, SWE\-QA\-Pro string metrics favor lobotomized models because longer, verbose code explanations coincidentally have higher token overlap with the ground truth\. The rubric judge accurately captures these task\-specific quality dimensions \(abstention, correctness, entity attribution\) that surface\-level overlap cannot\.

Table 10:Token F1 and Exact Match:btb\_\{t\}effect \(pp\)\. Iter\-RetGen \(memoryless by design\) has no lobotomized condition\.Bold blue=p<\.05p<\.05\.Table 11:ROUGE\-1 and METEOR:btb\_\{t\}effect \(pp\)\. Iter\-RetGen has no lobotomized condition\.Bold blue=p<\.05p<\.05\.Table 12:SBERT and Rubric Judge:btb\_\{t\}effect \(pp\)\. Iter\-RetGen has no lobotomized condition\. Rubric judge significance is from Table[8](https://arxiv.org/html/2605.07042#A6.T8); SBERT significance is not shown\.Table 13:Exhaustion gate effect on string metrics \(pp\): blended \(gate3 answer if triggered, natural otherwise\) minus natural\. Positive = gate improves the metric\.Bold blue=p<\.05p<\.05\. IRCoT benefits most on LoCoMo/MuSiQue; SWE shows negative F1 diffs because early stopping truncates verbose code answers that have high token overlap\.
### F\.3Additional Statistics

Tables[14](https://arxiv.org/html/2605.07042#A6.T14)and[15](https://arxiv.org/html/2605.07042#A6.T15)detail the distribution of retrieval rounds and the effect of task complexity \(hop count\) on the interventions\.

Table 14:Mean and median retrieval rounds per episode\. Max rounds: IRCoT 10, ReAct 7, Iter\-RetGen 4 \(fixed\), MemGPT 12\. Iter\-RetGen always uses all 4 rounds \(omitted\)\.Table 15:MuSiQuebtb\_\{t\}effect \(pp\) by hop count\. Iter\-RetGen has no lobotomized condition\.Bold blue=p<\.05p<\.05\.

## Appendix GExhaustion Gate Deep Dive

### G\.1Per\-method gate results

Table[16](https://arxiv.org/html/2605.07042#A7.T16)reports the exhaustion gate effect for each method individually across all tasks and conditions\.Bold blue=p<\.05p<\.05\.

Table 16:Exhaustion gate \(global best configuration,f\_j0\.6\_u0\.3\_p2f\\\_j0\.6\\\_u0\.3\\\_p2\)\.Δ\\Delta: score difference in pp\. Fire: % of episodes where the gate triggers\. Save: net token savings \(%\)\.Bold blue=p<\.05p<\.05\.†Iter\-RetGen \(memoryless by design\) has no lobotomized condition\.IRCoT benefits most because it has the most retrieval rounds \(up to 10\) and thus the most opportunity for stagnation to develop\. ReAct \(up to 7 rounds\) shows moderate benefits on baseline and lobotomized conditions\. MemGPT has variable round counts and less predictable stagnation patterns; the gate hurts on MuSiQue lobotomized \(−9\.0\-9\.0pp\) because it triggers during productive search episodes\. Iter\-RetGen’s 4 fixed rounds mean the gate fires at the last round with no opportunity for early stopping, producing 0/9 significant positives and 3/9 significant negatives\.

### G\.2Smooth vs\. discrete configurations

We swept 16 configurations: 7 discrete \(hard thresholds on Jaccard and UPR\) and 9 smooth \(exponentially\-weighted moving averages\)\. On IRCoT, smooth configurations produce slightly better mean improvement \(\+7\.2pp\) than discrete \(\+6\.4pp\), consistent with the intuition that stagnation is gradual\.

### G\.3Anchoring: why the gate helps stateful methods

Table[17](https://arxiv.org/html/2605.07042#A7.T17)shows the pooled condition\-level impact: the gate significantly helps baseline,btstructb\_\{t\}^\{\\text\{struct\}\}, andbtfreeb\_\{t\}^\{\\text\{free\}\}conditions \(p<0\.001p<0\.001\) but is neutral on lobotomized \(p=0\.248p=0\.248\)\. Table[18](https://arxiv.org/html/2605.07042#A7.T18)supports the anchoring hypothesis: in the lobo\-vs\-btb\_\{t\}comparisons shown below, the per\-question variance of the gate’s effect is consistently higher without persistent state\. The belief state appears to anchor the method’s reasoning so that the early\-stop answer is more consistent with what the method would have produced naturally, while memoryless methods produce higher\-variance answers from similar passages\.

Table 17:Pooled condition\-level exhaustion gate impact \(global best config, all methods\)\. The gate significantly helps base,btstructb\_\{t\}^\{\\text\{struct\}\}, andbtfreeb\_\{t\}^\{\\text\{free\}\}conditions but is neutral on lobotomized\.Table 18:Per\-question std of gate effect \(rubric judge\)\. Lobo has higher std than bothbtb\_\{t\}conditions in all six displayed method\-task rows, supporting the view that persistent state anchors reasoning and makes early stopping more reliable\.
### G\.4State diff is not informative

Comparingfullmode \(checks Jaccard \+ UPR \+ state diff\) vs\.query\_and\_fullmode \(checks Jaccard \+ UPR only\) with matched thresholds: 194 of 225 cells are identical \(0pp difference\)\. The diff check only matters for discreteU=0\.3U\{=\}0\.3configs, where it reduces fire rate by 1\.8–5\.8pp\. For smooth configs \(U<0\.3U<0\.3\), the diff threshold never binds\. Mean delta between modes is−0\.04\-0\.04pp\. In this sweep, Jaccard and UPR capture the stagnation signal; the state diff added little additional signal\.

Table 19:Exhaustion gate fire round distribution \(global best config\)\. Iter\-RetGen always fires at round 4 \(its maximum\)\.†Pre\-lobotomized\.Mean fire roundMedian fire roundHarnessTaskBL𝐛𝐭𝐬\\mathbf\{b\_\{t\}^\{s\}\}𝐛𝐭𝐟\\mathbf\{b\_\{t\}^\{f\}\}BL𝐛𝐭𝐬\\mathbf\{b\_\{t\}^\{s\}\}𝐛𝐭𝐟\\mathbf\{b\_\{t\}^\{f\}\}IRCoTLoC4\.44\.55\.55\.04454MuS5\.85\.96\.56\.06666SWE4\.26\.25\.86\.04666ReActLoC3\.84\.55\.04\.93455MuS5\.55\.95\.35\.36655SWE5\.16\.15\.35\.25755Iter\-RetGen†LoC4\.0—4\.04\.04—44MuS4\.0—4\.04\.04—44SWE4\.0—4\.04\.04—44MemGPTLoC5\.43\.43\.33\.35333MuS7\.14\.95\.34\.57554SWE5\.73\.64\.13\.76344#### Token overhead estimation methodology\.

Final\-answer \(FA\) call tokens are estimated homogeneously for all gate variants \(programmatic and LLM\):

- •*Prompt tokens:*exact via tiktoken on actual evidence \+ question \+ system prompt \+ template\. Validated against actual API prompt tokens from an FA token ablation: 0\.93–1\.06×\\timesratio across all 45 method/task/condition cells\.
- •*Completion tokens:*per\-method/task/condition average from the FA token ablation \(100 random triggered questions per cell, actual gpt\-4o\-mini API calls\)\.

Gate decision overhead: programmatic gates use 0 tokens \(metric\-based, no LLM call\)\. LLM gates use actual loggedgate\_tokens\(per\-round “should I stop?” calls\)\. Total overhead = gate decision \+ FA call estimate\. Using the same FA estimator for all gates ensures the only difference between variants is how often and where they fire, not how their cost is estimated\.

#### Configuration clustering\.

L1 distance on the 45\-cell mean\-diff vector reveals two main clusters:

1. 1\.*Conservative discrete*\(U=0\.3U\{=\}0\.3,p=2p\{=\}2–3\): moderate fire rates, fewer negatives\. Contains the global best \(f\_j0\.6\_u0\.3\_p2f\\\_j0\.6\\\_u0\.3\\\_p2\)\.
2. 2\.*Aggressive smooth*\(U=0\.10U\{=\}0\.10–0\.15,p=2p\{=\}2\): high fire rates, more negatives on non\-IRCoT methods but larger gains on IRCoT\.

Three pairs of configurations are functionally identical \(L1 = 0pp\): thefullvs\.query\_and\_fulltrigger mode makes no difference when the diff threshold is not binding\. The global config is near\-optimal: L1 distance between the global and per\-method\+condition diff vectors is only 0\.64pp per cell on average\.

## Appendix HAblations

### H\.1Retriever type \(alpha\)

We test whether thebtb\_\{t\}pattern holds under extreme retriever configurations: ReAct withα=1\.0\\alpha\{=\}1\.0\(pure BM25, keyword\-only\) and MemGPT withα=0\.0\\alpha\{=\}0\.0\(pure dense embeddings, semantic\-only\)\. Table[20](https://arxiv.org/html/2605.07042#A8.T20)reports the key comparisons\.

Table 20:btb\_\{t\}effect \(pp\) under extreme retriever alpha\. The recovery\-from\-lobotomized pattern persists across retriever types, although baseline comparisons vary by task and state format\.Bold blue=p<\.05p<\.05\.The recovery\-from\-lobotomized pattern is strongest for freeformbtb\_\{t\}: it improves over lobo in all six method\-task rows and is significant in five\. Structuredbtb\_\{t\}is more mixed under these extreme retriever settings, especially on SWE\-QA\-Pro\. Overall, the qualitative benefit of persistent belief state management is not tied to a single retriever type\.

### H\.2Belief state capacity \(K=6 vs\. K=12\)

We test whether the default curated belief state capacity ofKtarget=6K\_\{\\text\{target\}\}=6items is a bottleneck by doubling toKtarget=12K\_\{\\text\{target\}\}=12on IRCoT and ReAct across all three tasks\.

Table 21:K=6 vs\. K=12 structured belief state \(btstructb\_\{t\}^\{\\text\{struct\}\}\)\.Five of six pairs show no significant difference\. The one marginally significant result \(ReAct SWE \+2\.5pp,p=0\.028p\{=\}0\.028\) is a small effect\. K=6 is not a bottleneck: the reorganization step that curates the belief state toKtarget=6K\_\{\\text\{target\}\}=6items preserves the information needed for the task\.

### H\.3Retrieval depth \(k=3, k=5, k=10\)

We test whether thebtb\_\{t\}pattern holds at different retrieval depths on IRCoT×\\timesLoCoMo\.

Table 22:btb\_\{t\}effect \(pp\) at different retrieval depths \(IRCoT×\\timesLoCoMo\)\. Recovery from lobotomized holds at all depths, while comparisons to baseline narrow as retrieval depth increases\.Bold blue=p<\.05p<\.05\.Across retrieval depths, bothbtb\_\{t\}formats significantly recover from lobotomized at everykk\. Higher retrieval depth \(k=10k\{=\}10\) improves all conditions, narrowing the gap between baseline andbtb\_\{t\}\.

### H\.4Belief State Reorganization – Token Overhead Analysis

To quantify the cost\-benefit trade\-off of the orchestrator’s reorganization step \(which triggers when the belief state exceedsKtrigger=10K\_\{trigger\}=10items to curate it down toKtarget=6K\_\{target\}=6\), we analyzed the token usage across all16,28016,280episodes run under the explicit belief state \(btb\_\{t\}\) conditions\.

We found that the reorganization prompt is highly efficient and only triggers when strictly necessary on complex, long\-horizon tasks:

- •Trigger Frequency:20\.8%20\.8\\%of episodes \(3,385/16,2803,385/16,280\) triggered the reorganization step at least once\.
- •Call Overhead:For episodes that did require reorganization, the orchestrator made an average of1\.91\.9reorganization calls, with each call consuming on average979979tokens\.
- •Relative Cost:The total mean overhead for reorganization was1,8241,824tokens per triggered episode\. This accounted for11\.4%11\.4\\%of the total token consumption in those episodes\.

While episodes that triggered reorganization had a substantially higher average total token cost \(16,01016,010tokens\) compared to those that did not \(5,8875,887tokens\), this difference is largely confounded by underlying task complexity\. Episodes that exceed theK=10K=10threshold are inherently longer\-running, multi\-hop searches that require more environment steps\. The reorganization step itself remains lightweight \(∼1,824\{\\sim\}1,824tokens\), ensuring the agent’s context window stays bounded and focused without imposing a prohibitive cost\.
The Context Gathering Decision Process: A POMDP Framework for Agentic Search

Similar Articles

GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)

MedExAgent: Training LLM Agents to Ask, Examine, and Diagnose in Noisy Clinical Environments

MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning

Agentic search models (5 minute read)

From History to State: Constant-Context Skill Learning for LLM Agents

Submit Feedback

Similar Articles

GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)
MedExAgent: Training LLM Agents to Ask, Examine, and Diagnose in Noisy Clinical Environments
MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning
Agentic search models (5 minute read)
From History to State: Constant-Context Skill Learning for LLM Agents