Inference-Time Budget Control for LLM Search Agents

arXiv cs.AI Papers

Summary

This paper introduces a two-stage inference-time budget control method for LLM search agents, using Value-of-Information scores to optimize tool-call and token allocation during multi-hop question answering.

arXiv:2605.05701v1 Announce Type: new Abstract: LLM search agents increasingly rely on tools at inference time, but their trajectories are often constrained by hard limits on both tool calls and generated tokens. Under such dual budgets, better answers require not only stronger models, but also explicit control over which search action should receive the next budget unit and when the accumulated evidence is sufficient to commit a final answer. We study this problem in multi-hop question answering (QA) and formulate it as two-stage inference-time budget control. At search time, our controller assigns each feasible action a task-level Value-of-Information (VOI) score, defined as an operational estimate of marginal task value per unit budget under the current search state and remaining dual budget, and uses this score to choose among retrieval, decomposition, and answer commitment. After search, a selective evidence-grounded finalizer compares the trajectory answer with a refined candidate and rewrites only when the residual error appears to be a low-risk answer-form error. Across four multi-hop QA benchmarks, three LLM backbones, and four budget levels, the method yields positive aggregate gains over four audited baselines under the same hard dual-budget protocol. Ablations show that search-time budget control, especially budget-dependent penalty, provides the main performance gain, while answer-time control helps mainly when the retrieval path is already adequate. These results suggest that inference-time budget control for LLM search agents should govern both how budget is spent during search and how the final answer is committed.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/08/26, 08:35 AM

# Inference-Time Budget Control for LLM Search Agents
Source: [https://arxiv.org/html/2605.05701](https://arxiv.org/html/2605.05701)
###### Abstract

LLM search agents increasingly rely on tools at inference time, but their trajectories are often constrained by hard limits on both tool calls and generated tokens\. Under such dual budgets, better answers require not only stronger models, but also explicit control over which search action should receive the next budget unit and when the accumulated evidence is sufficient to commit a final answer\. We study this problem in multi\-hop question answering \(QA\) and formulate it as two\-stage inference\-time budget control\. At search time, our controller assigns each feasible action a task\-level Value\-of\-Information \(VOI\) score, defined as an operational estimate of marginal task value per unit budget under the current search state and remaining dual budget, and uses this score to choose among retrieval, decomposition, and answer commitment\. After search, a selective evidence\-grounded finalizer compares the trajectory answer with a refined candidate and rewrites only when the residual error appears to be a low\-risk answer\-form error\. Across four multi\-hop QA benchmarks, three LLM backbones, and four budget levels, the method yields positive aggregate gains over four audited baselines under the same hard dual\-budget protocol\. Ablations show that search\-time budget control, especially budget\-dependent penalty, provides the main performance gain, while answer\-time control helps mainly when the retrieval path is already adequate\. These results suggest that inference\-time budget control for LLM search agents should govern both how budget is spent during search and how the final answer is committed\.

\\linenumbers

## 1Introduction

Large language models \(LLMs\) are increasingly deployed as search agents that use external tools during inference\. ReAct\[[69](https://arxiv.org/html/2605.05701#bib.bib69)\], Toolformer\[[43](https://arxiv.org/html/2605.05701#bib.bib43)\], BATS\[[28](https://arxiv.org/html/2605.05701#bib.bib28)\], and Search\-o1\[[21](https://arxiv.org/html/2605.05701#bib.bib21)\]treat retrieval as part of the inference trajectory rather than a fixed preprocessing step\. Broader agent systems add reflection, tree search, collaboration, and software tools\[[46](https://arxiv.org/html/2605.05701#bib.bib46),[68](https://arxiv.org/html/2605.05701#bib.bib68),[7](https://arxiv.org/html/2605.05701#bib.bib7),[53](https://arxiv.org/html/2605.05701#bib.bib53),[78](https://arxiv.org/html/2605.05701#bib.bib78),[77](https://arxiv.org/html/2605.05701#bib.bib77),[30](https://arxiv.org/html/2605.05701#bib.bib30)\]\. In this paradigm, answer quality depends not only on model capability, but also on how the agent spends tool\-call and token budgets while inference is still unfolding\.

This paradigm also changes test\-time scaling from a token\-only problem into an inference\-time control problem\. Prior work studies how additional inference compute improves outputs through repeated sampling, self\-refinement, and adaptive allocation\[[56](https://arxiv.org/html/2605.05701#bib.bib56),[32](https://arxiv.org/html/2605.05701#bib.bib32),[47](https://arxiv.org/html/2605.05701#bib.bib47),[12](https://arxiv.org/html/2605.05701#bib.bib12),[25](https://arxiv.org/html/2605.05701#bib.bib25)\]\. For search agents, however, compute is not token\-only: extra budget can create more branches, intermediate states, and low\-yield actions\. Systems such as Search\-R1\[[17](https://arxiv.org/html/2605.05701#bib.bib17)\], BATS\[[28](https://arxiv.org/html/2605.05701#bib.bib28)\], and recent browsing agents\[[80](https://arxiv.org/html/2605.05701#bib.bib80),[57](https://arxiv.org/html/2605.05701#bib.bib57),[45](https://arxiv.org/html/2605.05701#bib.bib45),[15](https://arxiv.org/html/2605.05701#bib.bib15)\]already operate in this tool\-heavy regime\. Deployment studies likewise show that unconstrained tool use leads to budget\-inefficient failures, including redundant retrieval, wasted branching, and brittle context growth\[[6](https://arxiv.org/html/2605.05701#bib.bib6),[72](https://arxiv.org/html/2605.05701#bib.bib72),[19](https://arxiv.org/html/2605.05701#bib.bib19),[31](https://arxiv.org/html/2605.05701#bib.bib31)\]\. Under explicit budgets, more inference\-time compute and better answers are no longer equivalent\.

This raises the central inference\-time control problem of*how to decide, under explicit tool\-call and output\-token budgets, what to do next, and how to commit a final answer once search stops\.*Existing budget\-aware approaches either route across models to reduce cost\[[6](https://arxiv.org/html/2605.05701#bib.bib6),[72](https://arxiv.org/html/2605.05701#bib.bib72)\]or inject budget tracking into the agent loop\[[28](https://arxiv.org/html/2605.05701#bib.bib28),[74](https://arxiv.org/html/2605.05701#bib.bib74)\], but none of them explicitly control both action allocation during search and answer resolution after search under the same hard dual\-budget audit, which motivates this work\. Three difficulties make this control problem nontrivial\.

Challenge 1: Budget must be allocated across heterogeneous actions\.At each step a search agent can retrieve evidence, decompose the question, or answer directly\. These actions differ in cost and return, and the right choice depends on the evidence state and remaining budget\. Infrastructure analyses show that naive scaling leads to over\-search and diminishing returns\[[19](https://arxiv.org/html/2605.05701#bib.bib19),[3](https://arxiv.org/html/2605.05701#bib.bib3)\], while token\-level budgeting cannot resolve this action\-level trade\-off under coupled tool\-call and output\-token budgets\[[12](https://arxiv.org/html/2605.05701#bib.bib12),[25](https://arxiv.org/html/2605.05701#bib.bib25)\]\.

Challenge 2: Final\-answer errors persist after successful search\.Even with the right evidence, the final answer can fail on exactness: yes/no polarity, binary choice, typed slots, or alias completion\. Self\-refinement\[[32](https://arxiv.org/html/2605.05701#bib.bib32)\], stepwise verification\[[5](https://arxiv.org/html/2605.05701#bib.bib5),[48](https://arxiv.org/html/2605.05701#bib.bib48)\], and reasoning\-aware selection\[[52](https://arxiv.org/html/2605.05701#bib.bib52)\]can help, but they target open\-ended generation rather than budgeted search where every extra call has a cost\.

Challenge 3: Rewriting introduces intervention risk\.Unconditional answer rewriting can erase bridge structure or reverse comparative semantics in multi\-hop settings\[[51](https://arxiv.org/html/2605.05701#bib.bib51),[50](https://arxiv.org/html/2605.05701#bib.bib50)\]\. Reflexion\-style feedback loops\[[46](https://arxiv.org/html/2605.05701#bib.bib46)\]mitigate some errors but do not explicitly model the risk of damaging a correct answer\. Thus, an answer controller must intervene only when the expected gain outweighs risk\.

To address these challenges, we propose a training\-free, two\-stage inference\-time controller built on a tree\-search backbone inspired by BAVT\[[23](https://arxiv.org/html/2605.05701#bib.bib23)\]\. At search time, the controller scores each feasible action by task\-level VOI: an operational estimate of marginal task value per unit budget under the current trajectory state and remaining dual budget, rather than Shannon information gain or Bayesian posterior value\. The score combines critic\-derived value change, structural signals, cost normalization, budget penalty, and conservative guards to choose among retrieval, decomposition, and answer commitment\. After search, an evidence\-grounded finalizer rewrites only when expected exactness gain outweighs the risk of damaging an otherwise adequate trajectory answer\. The method therefore controls two distinct decision points: how budget is spent during search, and how the final answer is committed after search\. We evaluate VOI under a shared hard dual\-budget audit across four multi\-hop QA benchmarks, three LLM backbones, and a four\-level budget ladder\. The results support explicit two\-stage control while also exposing its boundary conditions across datasets, budgets, and backbones\. Our main contributions are as follows:

- ❶Two\-stage inference\-time budget\-control formulation\.We formulate budget\-aware inference in LLM search agents as a two\-stage inference\-time control problem: search\-time action allocation under coupled tool\-call and output\-token budgets, followed by answer\-time finalization under intervention risk\.
- ❷Task\-level VOI control for search actions\.We introduce a training\-free scorer that ranks retrieval, decomposition, and answer commitment by estimated marginal task value per unit budget\. The scorer combines critic\-derived progress, structural signals, cost normalization, budget\-dependent penalty, and conservative guards\.
- ❸Risk\-controlled answer finalization\.We introduce an evidence\-grounded finalizer that rewrites only for low\-risk answer\-form errors, such as yes/no polarity, binary choices, typed\-slot mismatches, or supported factoid completion, and abstains when bridge or comparative reasoning remains unresolved\.
- ❹Empirical analysis under hard dual\-budget audit\.We evaluate across four benchmarks, three backbones, and four budget levels under the same hard dual\-budget protocol\. The results show positive aggregate gains and many low\- and mid\-budget improvements, while ablations identify budget\-dependent penalty as the dominant component and expose backbone\- and dataset\-dependent boundary conditions where the gains diminish\.

## 2Related Work

Search agents and tool\-augmented inference\.Tool use shifted LLMs from static text generation to interactive decision\-making\. ReAct\[[69](https://arxiv.org/html/2605.05701#bib.bib69)\]interleaves reasoning and acting, Toolformer\[[43](https://arxiv.org/html/2605.05701#bib.bib43)\]studies self\-supervised tool use, and later systems add reflection, tree search, collaboration, environment grounding, and software tools\[[46](https://arxiv.org/html/2605.05701#bib.bib46),[68](https://arxiv.org/html/2605.05701#bib.bib68),[7](https://arxiv.org/html/2605.05701#bib.bib7),[29](https://arxiv.org/html/2605.05701#bib.bib29),[53](https://arxiv.org/html/2605.05701#bib.bib53),[78](https://arxiv.org/html/2605.05701#bib.bib78),[77](https://arxiv.org/html/2605.05701#bib.bib77),[39](https://arxiv.org/html/2605.05701#bib.bib39),[63](https://arxiv.org/html/2605.05701#bib.bib63),[66](https://arxiv.org/html/2605.05701#bib.bib66),[30](https://arxiv.org/html/2605.05701#bib.bib30)\]\. Agent evaluation has also expanded toward real\-world and proactive\-assistance settings\[[49](https://arxiv.org/html/2605.05701#bib.bib49)\]\. Search\-oriented agents such as Search\-R1\[[17](https://arxiv.org/html/2605.05701#bib.bib17)\], BATS\[[28](https://arxiv.org/html/2605.05701#bib.bib28)\], Search\-o1\[[21](https://arxiv.org/html/2605.05701#bib.bib21)\], BrowseComp\[[57](https://arxiv.org/html/2605.05701#bib.bib57)\], and recent browsing systems\[[62](https://arxiv.org/html/2605.05701#bib.bib62),[45](https://arxiv.org/html/2605.05701#bib.bib45),[79](https://arxiv.org/html/2605.05701#bib.bib79),[20](https://arxiv.org/html/2605.05701#bib.bib20),[11](https://arxiv.org/html/2605.05701#bib.bib11),[54](https://arxiv.org/html/2605.05701#bib.bib54),[10](https://arxiv.org/html/2605.05701#bib.bib10),[59](https://arxiv.org/html/2605.05701#bib.bib59),[36](https://arxiv.org/html/2605.05701#bib.bib36)\]push toward open\-ended information seeking\. We instead study QA agents under explicit tool\-call and output\-token budgets\.

![Refer to caption](https://arxiv.org/html/2605.05701v1/x1.png)Figure 1:Problem sketch\.The agent chooses the next search action from the observed trajectory, evidence, and remaining budget; future evidence and exact next cost are unknown\.Test\-time scaling and budget\-aware agent scaling\.Test\-time scaling work turns extra inference compute into better outputs through repeated sampling, self\-refinement, early stopping, meta\-generation, adaptive compute, efficient task adaptation, and token\-budgeted reasoning\[[56](https://arxiv.org/html/2605.05701#bib.bib56),[32](https://arxiv.org/html/2605.05701#bib.bib32),[22](https://arxiv.org/html/2605.05701#bib.bib22),[58](https://arxiv.org/html/2605.05701#bib.bib58),[47](https://arxiv.org/html/2605.05701#bib.bib47),[52](https://arxiv.org/html/2605.05701#bib.bib52),[12](https://arxiv.org/html/2605.05701#bib.bib12),[25](https://arxiv.org/html/2605.05701#bib.bib25),[14](https://arxiv.org/html/2605.05701#bib.bib14),[34](https://arxiv.org/html/2605.05701#bib.bib34),[2](https://arxiv.org/html/2605.05701#bib.bib2),[5](https://arxiv.org/html/2605.05701#bib.bib5),[33](https://arxiv.org/html/2605.05701#bib.bib33),[4](https://arxiv.org/html/2605.05701#bib.bib4),[38](https://arxiv.org/html/2605.05701#bib.bib38),[73](https://arxiv.org/html/2605.05701#bib.bib73)\]\. Budget\-aware inference adds a harder question: not just how much extra compute to spend, but where\. FrugalGPT\[[6](https://arxiv.org/html/2605.05701#bib.bib6)\]and EcoAssistant\[[72](https://arxiv.org/html/2605.05701#bib.bib72)\]route across models, while agent\-scaling work\[[80](https://arxiv.org/html/2605.05701#bib.bib80),[28](https://arxiv.org/html/2605.05701#bib.bib28),[19](https://arxiv.org/html/2605.05701#bib.bib19),[31](https://arxiv.org/html/2605.05701#bib.bib31)\]argues that tool use changes the scaling problem itself\. Related edge and mobile intelligence studies further highlight that LLM and AI deployment is constrained by communication, caching, model downloading, continual adaptation, and split or federated execution costs\[[40](https://arxiv.org/html/2605.05701#bib.bib40),[41](https://arxiv.org/html/2605.05701#bib.bib41),[61](https://arxiv.org/html/2605.05701#bib.bib61),[60](https://arxiv.org/html/2605.05701#bib.bib60),[8](https://arxiv.org/html/2605.05701#bib.bib8)\]\. We follow this resource\-aware framing at a finer granularity: step\-level action control during search and selective intervention at finalization\.

Workflow search and tree\-based control\.DSPy\[[18](https://arxiv.org/html/2605.05701#bib.bib18)\], PromptAgent\[[55](https://arxiv.org/html/2605.05701#bib.bib55)\], Promptbreeder\[[9](https://arxiv.org/html/2605.05701#bib.bib9)\], ADAS\[[16](https://arxiv.org/html/2605.05701#bib.bib16)\], AgentSquare\[[44](https://arxiv.org/html/2605.05701#bib.bib44)\], AFlow\[[71](https://arxiv.org/html/2605.05701#bib.bib71)\], AutoFlow\[[24](https://arxiv.org/html/2605.05701#bib.bib24)\], EvoFlow\[[70](https://arxiv.org/html/2605.05701#bib.bib70)\], MermaidFlow\[[75](https://arxiv.org/html/2605.05701#bib.bib75)\], and HyEvo\[[64](https://arxiv.org/html/2605.05701#bib.bib64)\]optimize prompts, module graphs, or workflows before deployment\. Related LLM\-based automatic algorithm\-design work also uses language models to evolve heuristics or heuristic sets before inference\[[26](https://arxiv.org/html/2605.05701#bib.bib26),[27](https://arxiv.org/html/2605.05701#bib.bib27)\]\. A separate line treats inference as structured search over states, thoughts, or plans, as in Tree of Thoughts\[[68](https://arxiv.org/html/2605.05701#bib.bib68)\], Graph of Thoughts\[[1](https://arxiv.org/html/2605.05701#bib.bib1)\], Language Agent Tree Search\[[77](https://arxiv.org/html/2605.05701#bib.bib77)\], BAVT\[[23](https://arxiv.org/html/2605.05701#bib.bib23)\], and CATS\[[74](https://arxiv.org/html/2605.05701#bib.bib74)\]\. Our method belongs to this second family: given an existing search backbone, it controls which action spends the next budget unit and how the final answer is resolved\.

Answer verification and selective intervention\.Self\-Refine\[[32](https://arxiv.org/html/2605.05701#bib.bib32)\]shows that post\-hoc editing can improve outputs, while stepwise debugging\[[76](https://arxiv.org/html/2605.05701#bib.bib76)\], VerifAI\[[48](https://arxiv.org/html/2605.05701#bib.bib48)\], Reasoning\-Aware Self\-Consistency\[[52](https://arxiv.org/html/2605.05701#bib.bib52)\], and SETS\[[5](https://arxiv.org/html/2605.05701#bib.bib5)\]show that revision quality depends on trigger timing and intervention risk\. Our final\-answer layer applies that lesson to budgeted search QA: the finalizer keeps the trajectory answer unless the expected gain clearly outweighs the intervention risk, since aggressive answer replacement can erase bridge structure or comparative semantics even when the trajectory is adequate\.

## 3Problem Formulation

We study inference\-time budget control for an LLM search agent answering a questionxxunder dual budgetsB=\(Btool,Btok\)B=\(B\_\{\\mathrm\{tool\}\},B\_\{\\mathrm\{tok\}\}\), whereBtoolB\_\{\\mathrm\{tool\}\}limits tool calls andBtokB\_\{\\mathrm\{tok\}\}limits budgeted output tokens\. The agent faces two coupled decisions\. As illustrated in Figure[1](https://arxiv.org/html/2605.05701#S2.F1), the agent must choose the next search action from the currently observed trajectory, collected evidence, candidate answer state, and remaining budget, before future evidence and the exact next token cost are known\. The agent faces two coupled decisions\. During search, it must decide how to allocate the remaining budget across retrieval, decomposition, and answer commitment\. After search terminates, it must decide whether to keep the trajectory answer or replace it with a refined answer derived from the collected evidence\. The problem is therefore to control both budget allocation during search and answer commitment after search\.

Letμ\\mudenote the search\-time decision rule\. GivenxxandBB, the search process produces a trajectory and base answer\(𝒯,abase\)=ℳ​\(x,B;μ\)\.\(\\mathcal\{T\},a\_\{\\mathrm\{base\}\}\)=\\mathcal\{M\}\(x,B;\\mu\)\.The decision rule is applied online\. Before choosing the next operation, the agent observes the question, the partial trajectory, the evidence and decompositions collected so far, any candidate answer states, and the remaining tool\-call and token budgets\. It does not know future evidence, final answer quality, or the exact output\-token cost of the next operation\. Therefore, action selection uses an estimated local budget charge available before execution, while the realized tool\-call and output\-token costs are debited after the operation is executed\.

The completed trajectory𝒯\\mathcal\{T\}records the executed operations, collected evidence, intermediate states, candidate answers, and realized costs\. After the trajectory is complete, a refinement functiongrefg\_\{\\mathrm\{ref\}\}constructs a candidate refined answeraref=gref​\(𝒯,abase\),a\_\{\\mathrm\{ref\}\}=g\_\{\\mathrm\{ref\}\}\(\\mathcal\{T\},a\_\{\\mathrm\{base\}\}\),and a second policyσ\\sigmachooses the final predictiona^\\hat\{a\}from\{abase,aref\}\\\{a\_\{\\mathrm\{base\}\},a\_\{\\mathrm\{ref\}\}\\\}\. This stage is intentionally narrow: it is not unrestricted rewriting, but selective intervention between preserving the trajectory answer and replacing it with a refined one\. This distinction matters in multi\-hop QA, where an unnecessary rewrite can remove bridge entities, alter comparison structure, or replace a precise answer with a broader but less faithful one\. We useR​\(a,y\)∈\[0,1\]R\(a,y\)\\in\[0,1\]as an abstract bounded QA reward, whereaais a predicted answer andyyis the gold answer\. In experiments, answer quality is reported using exact match and token\-level F1; the formulation only requires a bounded scalar reward\. The expectation is over the evaluation distribution of\(x,y\)\(x,y\)and the search trajectory induced by the policies\. The cost functionsCtool​\(𝒯\)C\_\{\\mathrm\{tool\}\}\(\\mathcal\{T\}\)andCtok​\(𝒯\)C\_\{\\mathrm\{tok\}\}\(\\mathcal\{T\}\)denote realized trajectory costs\. We formulate the overall problem as

maxμ,σ⁡𝔼​\[R​\(a^,y\)\]s\.t\.Ctool​\(𝒯\)≤Btool,Ctok​\(𝒯\)≤Btok,𝔼​\[Lharm​\(a^,abase,y\)\]≤ρharm,\\max\_\{\\mu,\\sigma\}\\;\\mathbb\{E\}\\\!\\left\[R\(\\hat\{a\},y\)\\right\]\\quad\\text\{s\.t\.\}\\quad C\_\{\\mathrm\{tool\}\}\(\\mathcal\{T\}\)\\leq B\_\{\\mathrm\{tool\}\},\\;C\_\{\\mathrm\{tok\}\}\(\\mathcal\{T\}\)\\leq B\_\{\\mathrm\{tok\}\},\\;\\mathbb\{E\}\\\!\\left\[L\_\{\\mathrm\{harm\}\}\(\\hat\{a\},a\_\{\\mathrm\{base\}\},y\)\\right\]\\leq\\rho\_\{\\mathrm\{harm\}\},\(1\)whereρharm\\rho\_\{\\mathrm\{harm\}\}is the allowable expected harm from answer replacement\.

The harm term measures the damage caused by replacing the base answer with a worse final answer:

Lharm​\(a^,abase,y\)=max⁡\{0,R​\(abase,y\)−R​\(a^,y\)\}\.L\_\{\\mathrm\{harm\}\}\(\\hat\{a\},a\_\{\\mathrm\{base\}\},y\)=\\max\\\{0,R\(a\_\{\\mathrm\{base\}\},y\)\-R\(\\hat\{a\},y\)\\\}\.This quantity is zero when the final decision preserves or improves answer quality and positive only when finalization makes the answer worse than the original trajectory answer\. Under this formulation, the first stage is an online budget\-allocation problem during search, and the second stage is a risk\-controlled answer\-finalization problem after search\. Our method instantiates these two decisions using a search\-time controller and an answer\-time selector with abstention\.

## 4Method

![Refer to caption](https://arxiv.org/html/2605.05701v1/x2.png)Figure 2:Two\-stage budget control with task\-level VOI\.Stage 1 uses a controller based on the task\-level VOI score to choose whether the next step should retrieve, decompose, or answer under the remaining dual budget\. Stage 2 finalizes the answer conservatively, rewriting only when the case is safe and the expected gain outweighs rewrite risk\.Our method instantiates the two decisions in Section[3](https://arxiv.org/html/2605.05701#S3)with two lightweight layers on top of a generic search backbone\. Figure[2](https://arxiv.org/html/2605.05701#S4.F2)summarizes the full pipeline\. In Stage 1, the backbone maintains candidate trajectories and evidence, while the VOI controller scores feasible actions under the remaining dual budget and selects the next operation\. The selected operation is executed by the tree\-search substrate, which updates the trace and debits realized tool\-call and token costs\. In Stage 2, the completed trace yields a base answer and a refined candidate; the finalizer keeps the base answer unless a safe, evidence\-supported refinement passes the gain–risk check\. The search\-time layer chooses among evidence acquisition, question decomposition, and answer commitment under the remaining budget\. The answer\-time layer then chooses between a base answerabasea\_\{\\mathrm\{base\}\}extracted from the trace and a refined answerarefa\_\{\\mathrm\{ref\}\}constructed from the same trace\. The final output isa^\\hat\{a\}\.

### 4\.1Search\-Time Budget Allocation

#### Task\-level VOI control\.

At each search step, the controller ranks feasible actions using a task\-level VOI score: an operational estimate of marginal task value per unit budget under the current search state and remaining dual budget\. This score is used only for action selection; it is not Shannon information gain, Bayesian posterior value, or a learned value model\.

We instantiate the online decision rule in Section[3](https://arxiv.org/html/2605.05701#S3)as follows\. At steptt, the observed search status is represented bysts\_\{t\}, the remaining budget bybt=\(btool,t,btok,t\)b\_\{t\}=\(b\_\{\\mathrm\{tool\},t\},b\_\{\\mathrm\{tok\},t\}\), and the feasible action set by𝒦tfeas⊆𝒦\\mathcal\{K\}^\{\\mathrm\{feas\}\}\_\{t\}\\subseteq\\mathcal\{K\}, where𝒦=\{Search,Decompose,Answer\}\\mathcal\{K\}=\\\{\\text\{\{Search\}\},\\text\{\{Decompose\}\},\\text\{\{Answer\}\}\\\}\. For each feasible actionkk, the controller assigns a pre\-execution charge vectorgt​\(k\)=\(gtool,t​\(k\),gtok,t​\(k\)\)⊤g\_\{t\}\(k\)=\(g\_\{\\mathrm\{tool\},t\}\(k\),g\_\{\\mathrm\{tok\},t\}\(k\)\)^\{\\top\}and extracts fixed features fromsts\_\{t\}summarizing unresolved evidence, compositional structure, answer readiness, stagnation, loop pressure, and premature\-answer risk\.

The controller scores each feasible action by first forming a budget\-shaped utility, then normalizing it as value per budget, and finally applying guards\. First, it forms

ut​\(k\)=Δ^t​\(k\)⏟progress signal\+Ψt​\(k\)⏟structural signals−Πt​\(k;bt\)⏟budget\-dependent penalty,u\_\{t\}\(k\)=\\underbrace\{\\widehat\{\\Delta\}\_\{t\}\(k\)\}\_\{\\text\{progress signal\}\}\+\\underbrace\{\\Psi\_\{t\}\(k\)\}\_\{\\text\{structural signals\}\}\-\\underbrace\{\\Pi\_\{t\}\(k;\\,b\_\{t\}\)\}\_\{\\text\{budget\-dependent penalty\}\},\(2\)whereΔ^t​\(k\)\\widehat\{\\Delta\}\_\{t\}\(k\)is a critic\-derived progress signal,Ψt​\(k\)\\Psi\_\{t\}\(k\)aggregates structural signals such as bridge structure, loop pressure, readiness, and early\-answer risk, andΠt​\(k;bt\)\\Pi\_\{t\}\(k;b\_\{t\}\)is a signed budget\-dependent penalty term: it is positive forSearchandDecomposeunder tight budgets\.Πt​\(k;bt\)\\Pi\_\{t\}\(k;b\_\{t\}\)serves as a tractable proxy for the oracle budget shadow\-costλt⋆⊤​gt​\(k\)\\lambda\_\{t\}^\{\\star\\top\}g\_\{t\}\(k\), with approximation errorεtΠ\\varepsilon\_\{t\}^\{\\Pi\}formalized in Assumption[C\.2](https://arxiv.org/html/2605.05701#A3.SS2)\. Second, this task\-level action utility is clipped and normalized into the task\-level VOI score used for action selection:

rt​\(k\)=\[ut​\(k\)\]\+dt​\(k;bt\)\+ϵ,r\_\{t\}\(k\)=\\frac\{\[u\_\{t\}\(k\)\]\_\{\+\}\}\{d\_\{t\}\(k;\\,b\_\{t\}\)\+\\epsilon\},\(3\)where\[⋅\]\+=max⁡\{0,⋅\}\[\\cdot\]\_\{\+\}=\\max\\\{0,\\cdot\\\},ϵ\>0\\epsilon\>0is a small constant, anddt​\(k;bt\)\>0d\_\{t\}\(k;b\_\{t\}\)\>0is an action\-specific budget scale derived from the pre\-execution charge and current budget state\. We callrt​\(k\)r\_\{t\}\(k\)the task\-level VOI score: an operational estimate of marginal task value per unit budget under the current state and remaining dual budget\.

###### Definition 4\.1\(Task\-level VOI score\)\.

For feasiblek∈𝒦tfeask\\in\\mathcal\{K\}^\{\\mathrm\{feas\}\}\_\{t\}, the task\-level VOI score isrt​\(k\)=\[ut​\(k\)\]\+/\(dt​\(k;bt\)\+ϵ\)r\_\{t\}\(k\)=\[u\_\{t\}\(k\)\]\_\{\+\}/\(d\_\{t\}\(k;b\_\{t\}\)\+\\epsilon\), whereut​\(k\)u\_\{t\}\(k\)is the task\-level action utility anddt​\(k;bt\)d\_\{t\}\(k;b\_\{t\}\)is the budget\-aware action scale\. This score estimates marginal task value per unit budget for action selection under the current search state and remaining dual budget\.

In the released implementation,ut​\(k\)u\_\{t\}\(k\)is instantiated by fixed coefficients over explicit trajectory features,dt​\(k;bt\)d\_\{t\}\(k;b\_\{t\}\)by action\-specific budget\-aware cost terms, and the executable score𝒥~t​\(k\)\\widetilde\{\\mathcal\{J\}\}\_\{t\}\(k\)by these normalized scores after conservative guard adjustments\.

Third, action\-specific guards produce the executable score:

𝒥~t​\(k\)=𝔊t​\(k;rt,st,bt\),\\widetilde\{\\mathcal\{J\}\}\_\{t\}\(k\)=\\mathfrak\{G\}\_\{t\}\\\!\\bigl\(k;\\,r\_\{t\},\\,s\_\{t\},\\,b\_\{t\}\\bigr\),\(4\)where𝔊t\\mathfrak\{G\}\_\{t\}is a deterministic guard operator that masks or adjusts raw scores, including premature\-answer suppression, factoid\-decomposition suppression, and minimum\-search enforcement on compositional cases\. The controller selectskt=arg⁡maxk∈𝒦tfeas⁡𝒥~t​\(k\)k\_\{t\}=\\arg\\max\_\{k\\in\\mathcal\{K\}\_\{t\}^\{\\mathrm\{feas\}\}\}\\widetilde\{\\mathcal\{J\}\}\_\{t\}\(k\)\.

This design is adaptive, training\-free, and interpretable: scores are recomputed at every step, no extra value model is learned, and each term corresponds to an explicit search or budget signal\. Appendix[A](https://arxiv.org/html/2605.05701#A1)provides empirical controller diagnostics\.

### 4\.2Answer\-Time Finalization

After search terminates, the system extracts a base answerabasea\_\{\\mathrm\{base\}\}from the best trajectory and constructs a refined candidatearefa\_\{\\mathrm\{ref\}\}from the same trace𝒯\\mathcal\{T\}\. Finalization is narrow but important: the search path may be largely correct while the final answer still contains local answer\-form errors, such as wrong yes/no polarity, wrong binary choice, incomplete alias, or typed\-slot mismatch\. Unnecessary rewriting remains risky because fluent refinements can erase bridge entities, distort comparisons, or replace precise answers with broader but less faithful ones\.

We therefore formulate finalization as selective intervention over the two\-candidate set\{abase,aref\}\\\{a\_\{\\mathrm\{base\}\},a\_\{\\mathrm\{ref\}\}\\\}\. From the completed trace and the candidate pair, we construct a consistency feature vectorz=z​\(𝒯,abase,aref\)z=z\(\\mathcal\{T\},a\_\{\\mathrm\{base\}\},a\_\{\\mathrm\{ref\}\}\)\. This vector records evidence relevant to safe revision: support type, slot type, contradiction indicators, and unresolved bridge or comparative reasoning\. For exposition, finalization is written as a gain–risk threshold over this feature representation; in implementation, it is a deterministic selector over explicit feature conditions, not a learned scorer or extra LLM call:

F​\(z\)=G​\(z\)−η​H​\(z\),a^=\{aref,if​z∈𝒮safe​and​F​\(z\)≥τ,abase,otherwise,F\(z\)=G\(z\)\-\\eta H\(z\),\\qquad\\hat\{a\}=\\begin\{cases\}a\_\{\\mathrm\{ref\}\},&\\text\{if \}z\\in\\mathcal\{S\}\_\{\\mathrm\{safe\}\}\\text\{ and \}F\(z\)\\geq\\tau,\\\\\[5\.69054pt\] a\_\{\\mathrm\{base\}\},&\\text\{otherwise\},\\end\{cases\}\(5\)whereG​\(z\)G\(z\)measures the potential gain,H​\(z\)H\(z\)measures the intervention risk,η\>0\\eta\>0controls risk aversion,τ\\tauis an abstention threshold, and𝒮safe\\mathcal\{S\}\_\{\\mathrm\{safe\}\}is the set of reliable revision cases\. In the released implementation, the abstract mapggis instantiated by a fixed constructionaref=gref​\(𝒯,abase\)a\_\{\\mathrm\{ref\}\}=g\_\{\\mathrm\{ref\}\}\(\\mathcal\{T\},a\_\{\\mathrm\{base\}\}\), and the selector is a deterministic rule over explicit feature conditions\. Appendix[C](https://arxiv.org/html/2605.05701#A3)characterizes the exact rule in Proposition[C\.17](https://arxiv.org/html/2605.05701#A3.Thmtheorem17)\.

The safe set𝒮safe\\mathcal\{S\}\_\{\\mathrm\{safe\}\}excludes unresolved bridge structure, unresolved comparative semantics, and missing direct support, where rewriting is most likely to harm a correct trajectory answer\. Thus the finalizer is a conservative answer selector, not a generic editor: it intervenes only when the remaining error appears local to answer\-form correction rather than retrieval or path selection\. It adds no tool calls and issues no additional LLM call during finalization\.

### 4\.3End\-to\-End Inference

At inference time, the search procedure maintains a frontier of trajectories while the controller repeatedly selects the next feasible operation using Eqs\. \([2](https://arxiv.org/html/2605.05701#S4.E2)\)–\([4](https://arxiv.org/html/2605.05701#S4.E4)\)\. It debits realized tool\-call and output\-token costs after each operation and stops when the frontier or budget is exhausted, or when the search procedure’s termination condition fires\. The system then extractsabasea\_\{\\mathrm\{base\}\}, constructsarefa\_\{\\mathrm\{ref\}\}, and applies Eq\. \([5](https://arxiv.org/html/2605.05701#S4.E5)\)\. Full pseudocode is in Appendix[B](https://arxiv.org/html/2605.05701#A2)\.

#### Theoretical support\.

Appendix[C](https://arxiv.org/html/2605.05701#A3)provides the formal support for the two controller layers\. For search\-time allocation, Appendix[C\.2](https://arxiv.org/html/2605.05701#A3.SS2)shows that the task\-level utility in Eq\. \([2](https://arxiv.org/html/2605.05701#S4.E2)\) locally approximates an oracle budget\-charged one\-step lookahead value, yielding ranking consistency under a margin condition and a one\-step value\-gap bound when guards are compatible with the utility ranking\. For answer\-time finalization, Appendix[C\.4](https://arxiv.org/html/2605.05701#A3.SS4)derives the gain–risk threshold in Eq\. \([5](https://arxiv.org/html/2605.05701#S4.E5)\) from the harm\-constrained objective in Eq\. \([1](https://arxiv.org/html/2605.05701#S3.E1)\)\. The analysis supports budget\-aware action selection and conservative answer replacement, but does not claim global optimality of the full search tree\.

## 5Experiments

### 5\.1Setup

We instantiate the LLM search agent with three backbones: Qwen3\-32B\[[65](https://arxiv.org/html/2605.05701#bib.bib65)\], Qwen3\.5\-122B\[[42](https://arxiv.org/html/2605.05701#bib.bib42)\], and GPT\-5\.4\-Mini\[[35](https://arxiv.org/html/2605.05701#bib.bib35)\], chosen to span different model scales, families, and deployment regimes rather than tuning to a single LLM\. We evaluate on four multi\-hop QA benchmarks: HotpotQA\[[67](https://arxiv.org/html/2605.05701#bib.bib67)\], 2WikiMultihopQA\[[13](https://arxiv.org/html/2605.05701#bib.bib13)\], MuSiQue\[[50](https://arxiv.org/html/2605.05701#bib.bib50)\], and Bamboogle\[[37](https://arxiv.org/html/2605.05701#bib.bib37)\]\. All methods use the same retrieval configuration: Search\-R1 retrieval, question\-only queries, andtop\_k=5\. We evaluate under four dual\-budget levels,low,lower\-mid,upper\-mid, andhigh, corresponding to\(1,100\)\(1,100\),\(2,200\)\(2,200\),\(2,300\)\(2,300\), and\(3,500\)\(3,500\), where each pair denotes the tool\-call cap and output\-token cap\.

The audit is strict at the example level: all five methods are scored under the same hard tool/output\-token budget constraint, and any example exceeding either constraint is counted as failed\. We compare VOI with BAVT\[[23](https://arxiv.org/html/2605.05701#bib.bib23)\], BATS\[[28](https://arxiv.org/html/2605.05701#bib.bib28)\], AFlow\[[71](https://arxiv.org/html/2605.05701#bib.bib71)\], and Search\-o1\[[21](https://arxiv.org/html/2605.05701#bib.bib21)\]\. We report EM, token\-level F1, average tool calls, and average budget output tokens; accounting details, cross\-backbone confidence intervals, and feasible\-only usage diagnostics are provided in Appendix[H](https://arxiv.org/html/2605.05701#A8)\.

### 5\.2Main Results

![Refer to caption](https://arxiv.org/html/2605.05701v1/x3.png)Figure 3:Cross\-model budget scaling curves across four datasets\.Rows report Qwen3\-32B and GPT\-5\.4\-Mini; columns report the four multi\-hop QA benchmarks\. All methods use the shared hard dual\-budget audit\. Qwen3\.5\-122B results are reported in Appendix[F](https://arxiv.org/html/2605.05701#A6)as a backbone\-sensitivity analysis\.Figure[3](https://arxiv.org/html/2605.05701#S5.F3)shows the main budget\-scaling pattern on two representative backbones\. On Qwen3\-32B, whose full EM/F1 table is reported in Appendix[E](https://arxiv.org/html/2605.05701#A5), VOI is best\-F1 in 7 of 16 dataset\-budget cells and tied best in 2 additional cells\. It improves over AFlow, BATS, Search\-o1, and BAVT in 14/16, 10/16, 16/16, and 15/16 cells, respectively\. The strongest gains appear in low and lower\-mid budgets, where choosing the next action carefully is most important because wasted retrieval or premature answering quickly exhausts the trajectory\. The pattern is positive but not a dominance claim\. BATS remains strong in several upper\-budget regimes, especially when broader beam\-style evidence accumulation is useful, and AFlow leads on 2WikiMultihopQAlower\-mid\. These exceptions are important because they show that budget control interacts with dataset structure and budget level\. The main claim is therefore not that VOI wins every cell, but that explicit action\-level budget control improves most audited settings under the same hard budget constraints\.

On GPT\-5\.4\-Mini, VOI remains consistently competitive despite the stronger base model\. This is useful evidence that the controller is not merely compensating for a weak backbone\. The gains are smaller in some cells than on Qwen3\-32B, which is expected: stronger backbones can already resolve more answer\-form and evidence\-selection errors without as much controller help\. Still, the curves show that the two\-stage controller continues to provide value when the model family changes\.

We report Qwen3\.5\-122B separately in Appendix[F](https://arxiv.org/html/2605.05701#A6)\. That backbone exhibits a more mixed regime, with several cells led by BATS or BAVT at higher budgets\. We treat it as a backbone\-sensitivity case rather than the main trend\. For completeness, Appendix[H](https://arxiv.org/html/2605.05701#A8)reports cross\-backbone macro deltas over all three backbones, four benchmarks, and four budgets; the aggregate effect remains favorable, while the detailed curves confirm substantial backbone dependence\.

### 5\.3Final\-Answer Control Improves Exactness

Final\-answer control is most useful when residual errors are answer\-form errors rather than search\-depth failures, clearest on Bamboogle\. Evidence\-grounded finalization improves the measured Bamboogle budget ladder without changing tool usage: F1 rises from 0\.4382 to 0\.4628 atlow, from 0\.5786 to 0\.6047 atupper\-mid, and from 0\.5056 to 0\.5576 athigh\. An additional\(1,200\)probe shows the same trend, improving from 0\.4631 to 0\.4911 on Bamboogle and from 0\.4907 to 0\.5174 on 2WikiMultihopQA\. Thus, answer\-time control yields visible gains when the trajectory already contains adequate evidence but the final answer still has an answer\-form error\.

Successful Stage 2 interventions are typed\-slot corrections, binary\-choice repairs, yes/no polarity repairs, and supported factoid completions\. This matches the design target: the finalizer is a conservative exactness layer, not a generic rewrite module\. It creates no new evidence and launches no additional search or additional LLM call during finalization\. Its gains therefore concentrate on answer\-form errors, while path\-level failures remain outside its scope\. This also explains the weaker contribution on MuSiQue, where residual errors more often involve unresolved bridge structure rather than a local answer span\.

### 5\.4Search\-Time Controller Component Ablation

Section[4\.1](https://arxiv.org/html/2605.05701#S4.SS1)defines the Stage 1 pipeline: task\-level action utility, value\-per\-cost normalization, and guard\-adjusted executable scoring in Eqs\. \([2](https://arxiv.org/html/2605.05701#S4.E2)\)–\([4](https://arxiv.org/html/2605.05701#S4.E4)\)\. We ablate these components under theupper\-midbudget on Qwen3\-32B by removing budget\-dependent penaltyΠt​\(k;bt\)\\Pi\_\{t\}\(k;b\_\{t\}\), replacingrt​\(k\)r\_\{t\}\(k\)with\[ut​\(k\)\]\+\[u\_\{t\}\(k\)\]\_\{\+\}, removingΨt​\(k\)\\Psi\_\{t\}\(k\), or bypassing𝔊t\\mathfrak\{G\}\_\{t\}\. All variants use the same search procedure, Search\-R1 retrieval, question\-only queries,top\_k=5, and hard dual\-budget audit\.

Table 1:Stage\-1 component ablation on Qwen3\-32B \(upper\-midbudget\)\.Each cell reportsEM/F1under the hard dual\-budget audit\. The 1st, 2nd, and 3rd best results are highlighted infirst,second, andthirdcolors\. Cell background colors are ranked by the average of EM and F1 scores\. Best EM and F1 are independently bolded\.↑\\uparrowindicates higher is better\.BenchmarkBAVTw/o penaltyw/o norm\.w/o struct\.w/o guardsVOI \(Full\)EM/F1↑\\uparrowEM/F1↑\\uparrowEM/F1↑\\uparrowEM/F1↑\\uparrowEM/F1↑\\uparrowEM/F1↑\\uparrowHotpotQA0\.34/0\.410\.35/0\.400\.34/0\.380\.32/0\.370\.34/0\.390\.39/0\.472WikiMultihopQA0\.40/0\.450\.38/0\.430\.44/0\.500\.39/0\.450\.42/0\.480\.54/0\.63MuSiQue0\.26/0\.340\.24/0\.300\.25/0\.330\.30/0\.390\.29/0\.380\.36/0\.43Bamboogle0\.33/0\.420\.31/0\.400\.35/0\.430\.28/0\.370\.28/0\.360\.46/0\.56

Table[1](https://arxiv.org/html/2605.05701#S5.T1)reports the benchmark\-level results\. The full controller achieves the best F1 on all four benchmarks, while every component removal reduces the macro average\. The largest drop comes from removing budget\-dependent penalty, showing that the remaining\-budget term is not a cosmetic penalty but a central part of the action\-allocation rule\. Removing normalization, structural signals, or guards also hurts, although the affected benchmark differs\. This pattern supports the method design: the improvement is tied to the three\-stage task\-level VOI score controller rather than to a generic prompt constraint, search\-procedure change, or answer\-time finalization effect\.

#### Inference\-time cost\.

Table[2](https://arxiv.org/html/2605.05701#S5.T2)reports an end\-to\-end wall\-clock timing probe under the Qwen3\-32Bupper\-midsetting with 100 examples per dataset\. The full VOI controller reduces mean inference time relative to BAVT on all four benchmarks, from 20\.91s to 15\.23s on average, a 27\.2% reduction\. Thus the added deterministic controller does not introduce a visible latency burden in this setting; its action choices often shorten trajectories by avoiding low\-value operations\. The 2WikiMultihopQA row also shows a boundary case where some ablated variants are faster than the full controller, indicating dataset\-dependent runtime behavior\.

Table 2:End\-to\-end inference time under Qwen3\-32B and theupper\-midbudget\.Each cell reports mean wall\-clock seconds per example; parenthesized values give relative change against BAVT\. Lower is better\. The 1st, 2nd, and 3rd best results, ranked solely by mean inference time, are highlighted infirst,second, andthirdcolors\.BenchmarkBAVTw/o penaltyw/o norm\.w/o struct\.w/o guardsVOI \(Full\)Mean Time \(s\)↓\\downarrowMean Time \(s\)↓\\downarrowMean Time \(s\)↓\\downarrowMean Time \(s\)↓\\downarrowMean Time \(s\)↓\\downarrowMean Time \(s\)↓\\downarrowBamboogle19\.9216\.25 \(\-18\.4%\)16\.38 \(\-17\.8%\)17\.15 \(\-13\.9%\)16\.34 \(\-18\.0%\)14\.41\(\-27\.7%\)HotpotQA22\.0117\.44 \(\-20\.7%\)17\.79 \(\-19\.1%\)17\.57 \(\-20\.1%\)17\.45 \(\-20\.7%\)14\.41\(\-34\.5%\)MuSiQue20\.2316\.65 \(\-17\.7%\)16\.98 \(\-16\.1%\)16\.47 \(\-18\.6%\)16\.22 \(\-19\.8%\)14\.79\(\-26\.9%\)2WikiMultihopQA21\.4915\.81 \(\-26\.4%\)15\.25\(\-29\.0%\)15\.53 \(\-27\.7%\)15\.64 \(\-27\.2%\)17\.30 \(\-19\.5%\)Average20\.9116\.54 \(\-20\.9%\)16\.60 \(\-20\.6%\)16\.68 \(\-20\.2%\)16\.41 \(\-21\.5%\)15\.23\(\-27\.2%\)

### 5\.5Two\-Stage Component Ablation

Figure[4](https://arxiv.org/html/2605.05701#S6.F4)isolates the two stages under the same audited setting\. Stage 1 alone improves F1 on all four benchmarks, confirming that most of the gain comes from search\-time budget allocation\. The full method yields relative F1 gains of\+5\.7%\+5\.7\\%,\+11\.8%\+11\.8\\%,\+14\.7%\+14\.7\\%, and\+18\.4%\+18\.4\\%over BAVT on HotpotQA, 2WikiMultihopQA, MuSiQue, and Bamboogle, respectively\. Stage 2 contributes no additional gain on HotpotQA, but accounts for13\.4%13\.4\\%,41\.9%41\.9\\%, and27\.8%27\.8\\%of the total gain on 2WikiMultihopQA, MuSiQue, and Bamboogle\. This supports the intended division of labor: Stage 1 provides broad budget\-aware search control, while Stage 2 acts as a sparse exactness layer when the retrieved trajectory is already mostly adequate\. The finalizer is sparse, conservative, and exactness\-oriented: it avoids broad rewriting and provides targeted corrections when the retrieved trajectory is adequate but the answer still has an answer\-form error\.

## 6Discussion and Limitations

![Refer to caption](https://arxiv.org/html/2605.05701v1/x4.png)Figure 4:Two\-stage component ablation\.Top: absolute F1 for BAVT, search\-time\-only VOI, and full VOI\. Bottom: relative F1 gain over BAVT, split into search\-time and final\-answer contributions; labels report total gain and the Stage 2 share\.Our results show that budget\-aware search is not a monotone scaling problem\. Larger tool\-call and output\-token budgets can improve evidence coverage, but may also introduce redundant search or noisier finalization\. This is visible on 2WikiMultihopQA and Bamboogle, where middle budgets sometimes outperformhigh\. The task\-level VOI controller mitigates this trade\-off by making budget spending state\-dependent, but does not remove the tension between exploration and commitment\.

The main boundary appears in high\-budget regimes and across backbones\. BATS remains competitive in several upper\-budget cells, and Qwen3\.5\-122B shows a more mixed regime in Appendix[F](https://arxiv.org/html/2605.05701#A6)because stronger backbones narrow the relative gap, consistent with the diminishing\-returns regime expected for any inference\-time controller layered on a more capable base model\. The finalizer is also exactness\-oriented rather than reasoning\-complete: it can repair local answer\-form errors, but cannot recover from wrong retrieval paths or unresolved bridge relations\. Extending the controller to stronger retrieval backends, richer query rewriting, and longer\-horizon browsing remains future work\.

## 7Conclusion

This paper studies tool\-augmented LLM search under explicit tool\-call and output\-token budgets\. We formulate the problem as two\-stage constrained inference: a search\-time decision over how to spend the next unit of budget, followed by an answer\-time decision over whether a refined answer is worth the intervention risk\. Our method instantiates this formulation with a training\-free controller based on the task\-level VOI score overSearch,Decompose, andAnswer, together with a conservative evidence\-grounded finalizer that rewrites only under low\-risk exactness conditions\. Across four multi\-hop QA benchmarks, four budget levels, and three LLM backbones, the results show that budget control is useful but not uniform\. The search\-time controller provides the main gains by improving action allocation under strict budgets, while the finalizer adds a sparse exactness benefit when the trajectory already contains adequate evidence\. The remaining failures clarify the boundary of the approach: more budget is not always better, backbone behavior matters, and answer\-time control cannot repair incorrect retrieval paths\.

Broader impacts\.Budget\-aware search can improve the deployability of tool\-augmented agents by reducing unnecessary tool use, making inference cost more predictable, and exposing when an agent chooses to search, decompose, or answer\. These properties are useful for applications where latency, token usage, or external tool calls must be controlled\. At the same time, more efficient search agents could also be used in harmful information\-seeking workflows\. We therefore emphasize hard budget audits, explicit accounting, conservative finalization, and failure analysis, so that budgeted agent behavior is easier to inspect rather than easier to hide\.

## References

- Besta et al\. \[2024\]Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, et al\.Graph of thoughts: Solving elaborate problems with large language models\.In*Proceedings of the AAAI conference on artificial intelligence*, volume 38, pages 17682–17690, 2024\.
- Brown et al\. \[2024\]Bradley C\. A\. Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V\. Le, Christopher Re, and Azalia Mirhoseini\.Large language monkeys: Scaling inference compute with repeated sampling\.*CoRR*, abs/2407\.21787, 2024\.[10\.48550/ARXIV\.2407\.21787](https://arxiv.org/doi.org/10.48550/ARXIV.2407.21787)\.
- Cemri et al\. \[2025\]Mert Cemri, Melissa Z Pan, Shuyi Yang, Lakshya A Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, et al\.Why do multi\-agent llm systems fail?*arXiv preprint arXiv:2503\.13657*, 2025\.
- Chen et al\. \[2025\]Jianhao Chen, Zishuo Xun, Bocheng Zhou, Han Qi, Qiaosheng Zhang, Yang Chen, Wei Hu, Yuzhong Qu, Wanli Ouyang, and Shuyue Hu\.Do we truly need so many samples? multi\-llm repeated sampling efficiently scales test\-time compute\.*CoRR*, abs/2504\.00762, 2025\.[10\.48550/ARXIV\.2504\.00762](https://arxiv.org/doi.org/10.48550/ARXIV.2504.00762)\.
- Chen et al\. \[20252\]Jiefeng Chen, Jie Ren, Xinyun Chen, Chengrun Yang, Ruoxi Sun, and Sercan Oz\. Arik\.SETS: leveraging self\-verification and self\-correction for improved test\-time scaling\.*CoRR*, abs/2501\.19306, 20252\.[10\.48550/ARXIV\.2501\.19306](https://arxiv.org/doi.org/10.48550/ARXIV.2501.19306)\.
- Chen et al\. \[2023\]Lingjiao Chen, Matei Zaharia, and James Zou\.Frugalgpt: How to use large language models while reducing cost and improving performance\.*arXiv preprint arXiv:2305\.05176*, 2023\.
- Chen et al\. \[20232\]Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi\-Min Chan, Heyang Yu, Yaxi Lu, Yi\-Hsin Hung, Chen Qian, et al\.Agentverse: Facilitating multi\-agent collaboration and exploring emergent behaviors\.In*The Twelfth International Conference on Learning Representations*, 20232\.
- Ding et al\. \[2026\]Zihao Ding, Beining Wu, Jun Huang, and Shiwen Mao\.Application\-aware twin\-in\-the\-loop planning for federated split learning over wireless edge networks\.*arXiv preprint arXiv:2604\.26105*, 2026\.
- Fernando et al\. \[2024\]Chrisantha Fernando, Dylan Banarse, Henryk Michalewski, Simon Osindero, and Tim Rocktaschel\.Promptbreeder: Self\-referential self\-improvement via prompt evolution\.In*ICML*\. OpenReview\.net, 2024\.
- Gao et al\. \[2025\]Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, and Yi Wu\.Beyond ten turns: Unlocking long\-horizon agentic search with large\-scale asynchronous RL\.*CoRR*, abs/2508\.07976, 2025\.[10\.48550/ARXIV\.2508\.07976](https://arxiv.org/doi.org/10.48550/ARXIV.2508.07976)\.
- Han et al\. \[2025\]Rujun Han, Yanfei Chen, Zoey CuiZhu, Lesly Miculicich, Guan Sun, Yuanjun Bi, Weiming Wen, Hui Wan, Chunfeng Wen, Solene Maitre, George Lee, Vishy Tirumalashetty, Emily Xue, Zizhao Zhang, Salem Haykal, Burak Gokturk, Tomas Pfister, and Chen\-Yu Lee\.Deep researcher with test\-time diffusion\.*CoRR*, abs/2507\.16075, 2025\.[10\.48550/ARXIV\.2507\.16075](https://arxiv.org/doi.org/10.48550/ARXIV.2507.16075)\.
- Han et al\. \[20252\]Tingxu Han, Zhenting Wang, Chunrong Fang, Shiyu Zhao, Shiqing Ma, and Zhenyu Chen\.Token\-budget\-aware llm reasoning\.In*Findings of the Association for Computational Linguistics: ACL 2025*, pages 24842–24855, 20252\.
- Ho et al\. \[2020\]Xanh Ho, Anh\-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa\.Constructing a multi\-hop qa dataset for comprehensive evaluation of reasoning steps\.In*Proceedings of the 28th International Conference on Computational Linguistics*, pages 6609–6625, 2020\.
- Hu et al\. \[2025\]Senkang Hu, Xudong Han, Jinqi Jiang, Yihang Tao, Zihan Fang, Yong Dai, Sam Tak Wu Kwong, and Yuguang Fang\.Distribution\-aligned decoding for efficient llm task adaptation\.*arXiv preprint arXiv:2509\.15888*, 2025\.
- Hu et al\. \[2026\]Senkang Hu, Yong Dai, Yuzhi Zhao, Yihang Tao, Yu Guo, Zhengru Fang, Sam Tak Wu Kwong, and Yuguang Fang\.Optimizing agentic reasoning with retrieval via synthetic semantic information gain reward\.*arXiv preprint arXiv:2602\.00845*, 2026\.
- Hu et al\. \[2024\]Shengran Hu, Cong Lu, and Jeff Clune\.Automated design of agentic systems, 2024\.
- Jin et al\. \[2025\]Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Zamani, and Jiawei Han\.Search\-r1: Training llms to reason and leverage search engines with reinforcement learning\.*arXiv preprint arXiv:2503\.09516*, 2025\.
- Khattab et al\. \[2024\]Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T\. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts\.Dspy: Compiling declarative language model calls into state\-of\-the\-art pipelines\.In*The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024*\. OpenReview\.net, 2024\.
- Kim et al\. \[2025\]Jiin Kim, Byeongjun Shin, Jinha Chung, and Minsoo Rhu\.The cost of dynamic reasoning: Demystifying ai agents and test\-time scaling from an ai infrastructure perspective\.*arXiv preprint arXiv:2506\.04301*, 2025\.
- Li et al\. \[2025\]Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, Weizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi Wu, Yong Jiang, Ming Yan, Pengjun Xie, Fei Huang, and Jingren Zhou\.Websailor: Navigating super\-human reasoning for web agent\.*CoRR*, abs/2507\.02592, 2025\.[10\.48550/ARXIV\.2507\.02592](https://arxiv.org/doi.org/10.48550/ARXIV.2507.02592)\.
- Li et al\. \[20252\]Xiaoxi Li, Guanting Dong, Jiajie Jin, Yuyao Zhang, Yujia Zhou, Yutao Zhu, Peitian Zhang, and Zhicheng Dou\.Search\-o1: Agentic search\-enhanced large reasoning models\.*CoRR*, abs/2501\.05366, 20252\.[10\.48550/ARXIV\.2501\.05366](https://arxiv.org/doi.org/10.48550/ARXIV.2501.05366)\.
- Li et al\. \[2024\]Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Xinglin Wang, Bin Sun, Heda Wang, and Kan Li\.Escape sky\-high cost: Early\-stopping self\-consistency for multi\-step reasoning\.In*The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024*\. OpenReview\.net, 2024\.
- Li et al\. \[2026\]Yushu Li, Wenlong Deng, Jiajin Li, and Xiaoxiao Li\.Spend less, reason better: Budget\-aware value tree search for llm agents, 2026\.
- Li et al\. \[20242\]Zelong Li, Shuyuan Xu, Kai Mei, Wenyue Hua, Balaji Rama, Om Raheja, Hao Wang, He Zhu, and Yongfeng Zhang\.Autoflow: Automated workflow generation for large language model agents\.*CoRR*, abs/2407\.12821, 20242\.
- Li et al\. \[20253\]Zheng Li, Qingxiu Dong, Jingyuan Ma, Di Zhang, Kai Jia, and Zhifang Sui\.Selfbudgeter: Adaptive token allocation for efficient llm reasoning\.*arXiv preprint arXiv:2505\.11274*, 20253\.
- Liu et al\. \[2024\]Fei Liu, Xialiang Tong, Mingxuan Yuan, Xi Lin, Fu Luo, Zhenkun Wang, Zhichao Lu, and Qingfu Zhang\.Evolution of heuristics: Towards efficient automatic algorithm design using large language model\.*arXiv preprint arXiv:2401\.02051*, 2024\.
- Liu et al\. \[2026\]Fei Liu, Yilu Liu, Qingfu Zhang, Tong Xialiang, and Mingxuan Yuan\.Eoh\-s: Evolution of heuristic set using llms for automated heuristic design\.In*Proceedings of the AAAI Conference on Artificial Intelligence*, volume 40, pages 37090–37098, 2026\.
- Liu et al\. \[2025\]Tengxiao Liu, Zifeng Wang, Jin Miao, I\-Hung Hsu, Jun Yan, Jiefeng Chen, Rujun Han, Fangyuan Xu, Yanfei Chen, Ke Jiang, Samira Daruki, Yi Liang, William Yang Wang, Tomas Pfister, and Chen\-Yu Lee\.Budget\-aware tool\-use enables effective agent scaling, 2025\.
- Liu et al\. \[20242\]Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al\.Agentbench: Evaluating llms as agents\.In*The Twelfth International Conference on Learning Representations*, 20242\.
- Lu et al\. \[2025\]Pan Lu, Bowen Chen, Sheng Liu, Rahul Thapa, Joseph Boen, and James Zou\.Octotools: An agentic framework with extensible tools for complex reasoning\.*arXiv preprint arXiv:2502\.11271*, 2025\.
- Lu et al\. \[20252\]Ruofan Lu, Yichen Li, and Yintong Huo\.Exploring autonomous agents: A closer look at why they fail when completing tasks\.*arXiv preprint arXiv:2508\.13143*, 20252\.
- Madaan et al\. \[2023\]Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark\.Self\-refine: Iterative refinement with self\-feedback\.In*Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023*, 2023\.
- Muennighoff et al\. \[2025\]Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei\-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel J\. Candes, and Tatsunori Hashimoto\.s1: Simple test\-time scaling\.*CoRR*, abs/2501\.19393, 2025\.[10\.48550/ARXIV\.2501\.19393](https://arxiv.org/doi.org/10.48550/ARXIV.2501.19393)\.
- Nayab et al\. \[2024\]Sania Nayab, Giulio Rossolini, Giorgio C\. Buttazzo, Nicolamaria Manes, and Fabrizio Giacomelli\.Concise thoughts: Impact of output length on LLM reasoning and cost\.*CoRR*, abs/2407\.19825, 2024\.[10\.48550/ARXIV\.2407\.19825](https://arxiv.org/doi.org/10.48550/ARXIV.2407.19825)\.
- OpenAI \[2026\]OpenAI\.Introducing GPT\-5\.4 mini and nano\.[https://openai\.com/index/introducing\-gpt\-5\-4\-mini\-and\-nano/](https://openai.com/index/introducing-gpt-5-4-mini-and-nano/), March 2026\.Accessed: 2026\-05\-05\.
- Pang et al\. \[2025\]Xianghe Pang, Shuo Tang, Rui Ye, Yuwen Du, Yaxin Du, and Siheng Chen\.Browsemaster: Towards scalable web browsing via tool\-augmented programmatic agent pair\.*CoRR*, abs/2508\.09129, 2025\.[10\.48550/ARXIV\.2508\.09129](https://arxiv.org/doi.org/10.48550/ARXIV.2508.09129)\.
- Press et al\. \[2023\]Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A\. Smith, and Mike Lewis\.Measuring and narrowing the compositionality gap in language models\.In*Findings of the Association for Computational Linguistics: EMNLP 2023*, pages 5687–5711, 2023\.
- Pu et al\. \[2025\]Xiao Pu, Michael Saxon, Wenyue Hua, and William Yang Wang\.THOUGHTTERMINATOR: benchmarking, calibrating, and mitigating overthinking in reasoning models\.*CoRR*, abs/2504\.13367, 2025\.[10\.48550/ARXIV\.2504\.13367](https://arxiv.org/doi.org/10.48550/ARXIV.2504.13367)\.
- Qiao et al\. \[2024\]Shuofei Qiao, Ningyu Zhang, Runnan Fang, Yujie Luo, Wangchunshu Zhou, Yuchen Eleanor Jiang, Chengfei Lv, and Huajun Chen\.Autoact: Automatic agent learning from scratch via self\-planning\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics*, 2024\.
- Qu et al\. \[2025\]Guanqiao Qu, Qiyuan Chen, Wei Wei, Zheng Lin, Xianhao Chen, and Kaibin Huang\.Mobile edge intelligence for large language models: A contemporary survey\.*IEEE Communications Surveys & Tutorials*, 27\(6\):3820–3860, 2025\.
- Qu et al\. \[2026\]Guanqiao Qu, Zheng Lin, Qian Chen, Jian Li, Fangming Liu, Xianhao Chen, and Kaibin Huang\.Trimcaching: Parameter\-sharing edge caching for ai model downloading\.*IEEE Transactions on Networking*, 2026\.
- Qwen Team \[2026\]Qwen Team\.Qwen3\.5: Towards native multimodal agents, February 2026\.URL[https://qwen\.ai/blog?id=qwen3\.5](https://qwen.ai/blog?id=qwen3.5)\.
- Schick et al\. \[2023\]Timo Schick, Jane Dwivedi\-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom\.Toolformer: Language models can teach themselves to use tools\.*Advances in Neural Information Processing Systems*, 36:68539–68551, 2023\.
- Shang et al\. \[2024\]Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, and Yong Li\.Agentsquare: Automatic llm agent search in modular design space, 2024\.
- Shen et al\. \[2025\]Junhong Shen, Hao Bai, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, and Aviral Kumar\.Thinking vs\. doing: Agents that reason by scaling test\-time interaction, 2025\.
- Shinn et al\. \[2023\]Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao\.Reflexion: Language agents with verbal reinforcement learning\.*Advances in neural information processing systems*, 36:8634–8652, 2023\.
- Snell et al\. \[2025\]Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar\.Scaling llm test\-time compute optimally can be more effective than scaling parameters for reasoning\.In*The Thirteenth International Conference on Learning Representations*, 2025\.
- Tang et al\. \[2024\]Nan Tang, Chenyu Yang, Ju Fan, Lei Cao, Yuyu Luo, and Alon Y\. Halevy\.Verifai: Verified generative AI\.In*CIDR*\. www\.cidrdb\.org, 2024\.
- Tang et al\. \[2026\]Yuanbo Tang, Huaze Tang, Tingyu Cao, Lam Nguyen, Anping Zhang, Xinwen Cao, Chunkang Liu, Wenbo Ding, and Yang Li\.Proagentbench: Evaluating llm agents for proactive assistance with real\-world data\.*arXiv preprint arXiv:2602\.04482*, 2026\.
- Trivedi et al\. \[2022\]Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal\.Musique: Multi\-hop questions via single\-hop question composition\.*Transactions of the Association for Computational Linguistics*, 10:539–554, 2022\.[10\.1162/tacl\_a\_00475](https://arxiv.org/doi.org/10.1162/tacl_a_00475)\.
- Trivedi et al\. \[2023\]Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal\.Interleaving retrieval with chain\-of\-thought reasoning for knowledge\-intensive multi\-step questions\.In Anna Rogers, Jordan Boyd\-Graber, and Naoaki Okazaki, editors,*Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 10014–10037, Toronto, Canada, July 2023\. Association for Computational Linguistics\.[10\.18653/v1/2023\.acl\-long\.557](https://arxiv.org/doi.org/10.18653/v1/2023.acl-long.557)\.
- Wan et al\. \[2025\]Guangya Wan, Yuqi Wu, Jie Chen, and Sheng Li\.Reasoning aware self\-consistency: Leveraging reasoning paths for efficient LLM sampling\.In*Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2025 \- Volume 1: Long Papers, Albuquerque, New Mexico, USA, April 29 \- May 4, 2025*, pages 3613–3635\. Association for Computational Linguistics, 2025\.[10\.18653/V1/2025\.NAACL\-LONG\.184](https://arxiv.org/doi.org/10.18653/V1/2025.NAACL-LONG.184)\.
- Wang et al\. \[2023\]Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar\.Voyager: An open\-ended embodied agent with large language models\.*arXiv preprint arXiv:2305\.16291*, 2023\.
- Wang et al\. \[2025\]Ningning Wang, Xavier Hu, Pai Liu, He Zhu, Yue Hou, Heyuan Huang, Shengyu Zhang, Jian Yang, Jiaheng Liu, Ge Zhang, Changwang Zhang, Jun Wang, Yuchen Eleanor Jiang, and Wangchunshu Zhou\.Efficient agents: Building effective agents while reducing cost\.*CoRR*, abs/2508\.02694, 2025\.[10\.48550/ARXIV\.2508\.02694](https://arxiv.org/doi.org/10.48550/ARXIV.2508.02694)\.
- Wang et al\. \[2024\]Xinyuan Wang, Chenxi Li, Zhen Wang, Fan Bai, Haotian Luo, Jiayou Zhang, Nebojsa Jojic, Eric P\. Xing, and Zhiting Hu\.Promptagent: Strategic planning with language models enables expert\-level prompt optimization\.In*ICLR*\. OpenReview\.net, 2024\.
- Wang et al\. \[2022\]Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou\.Self\-consistency improves chain of thought reasoning in language models\.*arXiv preprint arXiv:2203\.11171*, 2022\.
- Wei et al\. \[2025\]Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese\.Browsecomp: A simple yet challenging benchmark for browsing agents\.*arXiv preprint arXiv:2504\.12516*, 2025\.
- Welleck et al\. \[2024\]Sean Welleck, Amanda Bertsch, Matthew Finlayson, Hailey Schoelkopf, Alex Xie, Graham Neubig, Ilia Kulikov, and Zaid Harchaoui\.From decoding to meta\-generation: Inference\-time algorithms for large language models\.*Trans\. Mach\. Learn\. Res\.*, 2024, 2024\.
- Wong et al\. \[2025\]Ryan Wong, Jiawei Wang, Junjie Zhao, Li Chen, Yan Gao, Long Zhang, Xuan Zhou, Zuo Wang, Kai Xiang, Ge Zhang, Wenhao Huang, Yang Wang, and Ke Wang\.Widesearch: Benchmarking agentic broad info\-seeking\.*CoRR*, abs/2508\.07999, 2025\.[10\.48550/ARXIV\.2508\.07999](https://arxiv.org/doi.org/10.48550/ARXIV.2508.07999)\.
- Wu and Huang \[2026\]Beining Wu and Jun Huang\.Lifecycle\-aware federated continual learning in mobile autonomous systems\.*arXiv preprint arXiv:2604\.20745*, 2026\.
- Wu et al\. \[2026\]Beining Wu, Zihao Ding, and Jun Huang\.A review of continual learning in edge ai\.*IEEE Transactions on Network Science and Engineering*, 2026\.
- Wu et al\. \[2025\]Jialong Wu, Baixuan Li, Runnan Fang, Wenbiao Yin, Liwen Zhang, Zhengwei Tao, Dingchu Zhang, Zekun Xi, Yong Jiang, Pengjun Xie, Fei Huang, and Jingren Zhou\.Webdancer: Towards autonomous information seeking agency\.*CoRR*, abs/2505\.22648, 2025\.[10\.48550/ARXIV\.2505\.22648](https://arxiv.org/doi.org/10.48550/ARXIV.2505.22648)\.
- Xie et al\. \[2024\]Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al\.Osworld: Benchmarking multimodal agents for open\-ended tasks in real computer environments\.*Advances in Neural Information Processing Systems*, 37:52040–52094, 2024\.
- Xu et al\. \[2026\]Beibei Xu, Yutong Ye, Chuyun Shen, Yingbo Zhou, Cheng Chen, and Mingsong Chen\.Hyevo: Self\-evolving hybrid agentic workflows for efficient reasoning, 2026\.
- Yang et al\. \[2025\]An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al\.Qwen3 technical report\.*arXiv preprint arXiv:2505\.09388*, 2025\.
- Yang et al\. \[2024\]John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press\.Swe\-agent: Agent\-computer interfaces enable automated software engineering\.*Advances in Neural Information Processing Systems*, 37:50528–50652, 2024\.
- Yang et al\. \[2018\]Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D\. Manning\.Hotpotqa: A dataset for diverse, explainable multi\-hop question answering\.In*Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing*, pages 2369–2380, 2018\.
- Yao et al\. \[2023\]Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan\.Tree of thoughts: Deliberate problem solving with large language models\.*Advances in neural information processing systems*, 36:11809–11822, 2023\.
- Yao et al\. \[20232\]Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao\.React: Synergizing reasoning and acting in language models\.In*The Eleventh International Conference on Learning Representations*, 20232\.
- Zhang et al\. \[2025\]Guibin Zhang, Kaijie Chen, Guancheng Wan, Heng Chang, Hong Cheng, Kun Wang, Shuyue Hu, and Lei Bai\.Evoflow: Evolving diverse agentic workflows on the fly, 2025\.
- Zhang et al\. \[2024\]Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, and Chenglin Wu\.Aflow: Automating agentic workflow generation, 2024\.
- Zhang et al\. \[2023\]Jieyu Zhang, Ranjay Krishna, Ahmed H Awadallah, and Chi Wang\.Ecoassistant: Using llm assistant more affordably and accurately\.*arXiv preprint arXiv:2310\.03046*, 2023\.
- Zhang et al\. \[20252\]Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Zhihan Guo, Yufei Wang, Irwin King, Xue Liu, and Chen Ma\.What, how, where, and how well? A survey on test\-time scaling in large language models\.*CoRR*, abs/2503\.24235, 20252\.[10\.48550/ARXIV\.2503\.24235](https://arxiv.org/doi.org/10.48550/ARXIV.2503.24235)\.
- Zhang et al\. \[20253\]Zihao Zhang, Hui Wei, Kenan Jiang, Shijia Pan, Shu Kai, and Fei Liu\.Cost\-awareness in tree\-search llm planning: A systematic study, 20253\.
- Zheng et al\. \[2025\]Chengqi Zheng, Jianda Chen, Yueming Lyu, Wen Zheng Terence Ng, Haopeng Zhang, Yew\-Soon Ong, Ivor Tsang, and Haiyan Yin\.Mermaidflow: Redefining agentic workflow generation via safety\-constrained evolutionary programming, 2025\.
- Zhong et al\. \[2024\]Li Zhong, Zilong Wang, and Jingbo Shang\.Debug like a human: A large language model debugger via verifying runtime execution step by step\.In*ACL \(Findings\)*, pages 851–870\. Association for Computational Linguistics, 2024\.
- Zhou et al\. \[2023\]Andy Zhou, Kai Yan, Michal Shlapentokh\-Rothman, Haohan Wang, and Yu\-Xiong Wang\.Language agent tree search unifies reasoning acting and planning in language models\.*arXiv preprint arXiv:2310\.04406*, 2023\.
- Zhou et al\. \[20232\]Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al\.Webarena: A realistic web environment for building autonomous agents\.*arXiv preprint arXiv:2307\.13854*, 20232\.
- Zhu et al\. \[2025\]He Zhu, Tianrui Qin, King Zhu, Heyuan Huang, Yeyi Guan, Jinxiang Xia, Yi Yao, Hanhao Li, Ningning Wang, Pai Liu, Tianhao Peng, Xin Gui, Xiaowan Li, Yuhui Liu, Yuchen Eleanor Jiang, Jun Wang, Changwang Zhang, Xiangru Tang, Ge Zhang, Jian Yang, Minghao Liu, Xitong Gao, Jiaheng Liu, and Wangchunshu Zhou\.Oagents: An empirical study of building effective agents\.*CoRR*, abs/2506\.15741, 2025\.[10\.48550/ARXIV\.2506\.15741](https://arxiv.org/doi.org/10.48550/ARXIV.2506.15741)\.
- Zhu et al\. \[20252\]King Zhu, Hanhao Li, Siwei Wu, Tianshun Xing, Dehua Ma, Xiangru Tang, Minghao Liu, Jian Yang, Jiaheng Liu, Yuchen Eleanor Jiang, et al\.Scaling test\-time compute for llm agents\.*arXiv preprint arXiv:2506\.12928*, 20252\.

## Appendix AController Behavior Analysis

This appendix complements Section[4](https://arxiv.org/html/2605.05701#S4)with descriptive evidence on the realized behavior of the search\-time controller\. These results are not additional algorithmic components; they visualize how the implemented controller scores feasible actions and activates guards across budget and compositionality states\.

![Refer to caption](https://arxiv.org/html/2605.05701v1/x5.png)Figure 5:Empirical behavior of the search\-time controller\.\(a\)Mean controller scores forSearch,Decompose, andAnsweracross realized remaining\-budget bands\.\(b\)The meanDecomposescore increases with question compositionality, consistent with the intended role of decomposition in resolving bridge structure\.\(c\)The below\-floor guard activates mainly for high\-compositionality states, reflecting the conservative minimum\-search behavior used to avoid premature answer commitment\. All quantities are descriptive statistics computed from realized audited trajectories\.Table 3:Summary of the search\-time controller\.ActionMain positive signalsMain suppressing signalsSearchunresolved evidence, missing new support, low closure, loop pressurehigh budget pressure, repeated search saturationDecomposehigh compositionality, unresolved bridge structure, stagnation after searchfactoid\-like question, repeated decomposition saturationAnswerhigh closure, strong answer support, candidate answer present, tighter budgetearly\-answer risk, unresolved bridge structure, weak supportAs summarized in Table[3](https://arxiv.org/html/2605.05701#A1.T3), the search\-time controller is a fixed rule\-based scorer over three actions:Search,Decompose, andAnswer\. At each decision step, it evaluates four aspects of the current search state\. First, it estimates whether important evidence remains unresolved and whether another retrieval step is still likely to help\. Second, it checks whether the question appears structurally compositional, for example through unresolved bridge structure, in which case decomposition becomes more valuable than ordinary search\. Third, it evaluates answer readiness through closure, answer support, and the presence of a candidate answer span, while penalizing premature commitment when support remains weak\. Fourth, it applies budget pressure and action\-specific costs so that expensive actions become less attractive as the remaining tool and token budget shrinks\. In addition to these smooth score terms, the controller uses a small number of hard guards: it enforces a minimum amount of search on compositional cases, suppresses unnecessary decomposition on factoid\-like questions, and blocks overly early answer commitment when structural risk remains high\.

## Appendix BInference Algorithm and Prompt Modules

\\nolinenumbers

Algorithm 1Two\-Stage Budget\-Aware InferenceInput:\\State\\State\\Statequestion

xx, tool budget

BtoolB\_\{\\mathrm\{tool\}\}, output\-token budget

BtokB\_\{\\mathrm\{tok\}\}initialize the search procedure, frontier

ℱ\\mathcal\{F\}, trace

𝒯\\mathcal\{T\}, and remaining budget

b0=\(Btool,Btok\)b\_\{0\}=\(B\_\{\\mathrm\{tool\}\},B\_\{\\mathrm\{tok\}\}\)ℱ≠∅\\mathcal\{F\}\\neq\\emptysetand

btb\_\{t\}is not exhausted and termination has not fired select the current frontier state

sts\_\{t\}according to the search procedure construct the feasible operation set

𝒦tfeas\\mathcal\{K\}^\{\\mathrm\{feas\}\}\_\{t\}each feasible action

k∈𝒦tfeask\\in\\mathcal\{K\}^\{\\mathrm\{feas\}\}\_\{t\}compute action\-specific costs and controller features from the current search state score

kkthrough the three\-stage pipeline: task\-level action utility

ut​\(k\)u\_\{t\}\(k\)\(Eq\. \([2](https://arxiv.org/html/2605.05701#S4.E2)\)\), normalized raw score

rt​\(k\)r\_\{t\}\(k\)\(Eq\. \([3](https://arxiv.org/html/2605.05701#S4.E3)\)\), guarded executable score

𝒥~t​\(k\)\\widetilde\{\\mathcal\{J\}\}\_\{t\}\(k\)\(Eq\. \([\\EndFor\\State\\State\\State](https://arxiv.org/html/2605.05701#S4.E4)\)\) select the highest\-scoring feasible action

ktk\_\{t\}and execute it update the trace

𝒯\\mathcal\{T\}, the frontier

ℱ\\mathcal\{F\}, and the remaining budget

bt\+1b\_\{t\+1\}extract the base answer

abasea\_\{\\mathrm\{base\}\}from the best available trajectory construct the refined candidate

arefa\_\{\\mathrm\{ref\}\}from the completed trace

𝒯\\mathcal\{T\}build

z=z​\(𝒯,abase,aref\)z=z\(\\mathcal\{T\},a\_\{\\mathrm\{base\}\},a\_\{\\mathrm\{ref\}\}\)and apply the finalization rule \(Eq\. \([\\If](https://arxiv.org/html/2605.05701#S4.E5)\)\)

z∈𝒮safez\\in\\mathcal\{S\}\_\{\\mathrm\{safe\}\}and

F​\(z\)≥τF\(z\)\\geq\\tauarefa\_\{\\mathrm\{ref\}\}abasea\_\{\\mathrm\{base\}\}
\\State\\State

\\While\\For

\\State\\EndWhile

\\State\\State

\\State\\Return

\\Else\\State

\\Return\\EndIf

### B\.1System Prompt

Our method reuses the same single\-backbone search procedure as BAVT for planning, generation, and value estimation\. The central difference is not that we replace the generator with a new workflow, but that we intervene*before*generation with an explicit search\-time controller\. Concretely, the controller scores feasible actions through the three\-stage pipeline in Eqs\. \([2](https://arxiv.org/html/2605.05701#S4.E2)\)–\([4](https://arxiv.org/html/2605.05701#S4.E4)\) and selects

kt⋆=arg⁡maxk∈𝒦tfeas⁡𝒥~t​\(k\),k\_\{t\}^\{\\star\}=\\arg\\max\_\{k\\in\\mathcal\{K\}^\{\\mathrm\{feas\}\}\_\{t\}\}\\widetilde\{\\mathcal\{J\}\}\_\{t\}\(k\),and only then materializes the selected action as a deterministic prompt insertion\. The generator therefore still follows a standard system\-prompt plus user\-turn format, but its behavior is constrained by a controller\-selected instruction whose choice is determined by the task\-level VOI score outside the LLM\. This is the key interface between the theory and the prompt design: the theory decides*which*action should spend the next unit of budget, and the prompt enforces*how*that chosen action must be executed\.

System Prompt: Budget\-Aware Generator with VOI\-Guided Control

YouareapreciseAIassistantforbudget\-constrainedmulti\-hopquestionanswering\.

Rules:

\-Thinkstepbystepandexplicitlyshowyourreasoning\.

\-Ineachturn,doexactlyoneaction:eithercallonetoolorprovidethefinalanswer\.

\-Youmayusetoolsmultipletimesacrossturnsifneeded\.

\-Followthecurrentdynamicinstructionexactly;itdefineswhatkindofactionisallowedinthisturn\.

\-Iftheinstructionrequiresretrieval,youmustproduceexactlyone<tool\_call\>\.

\-Iftheinstructionrequiresanswering,youmustnotcalltools\.

\-Keepthefinalanswertoasingleshortgroundedspan,notafullsentence\.

\-Ifanentitymaybeambiguous,disambiguatethequerywithatypeorrolealreadysupportedbythequestionorevidence\.

<budget\>

ToolBudgetUsed:\{tool\_budget\_used\}/\{tool\_budget\_total\}

ToolBudgetRemaining:\{b\_tool\}

OutputTokenBudgetUsed:\{token\_budget\_used\}/\{token\_budget\_total\}

OutputTokenBudgetRemaining:\{b\_token\}

</budget\>

\{plan\_context\}

\{dynamic\_instruction\}

### B\.2Planning Module

As in BAVT, the agent maintains an abstract plan before and during search\. The role of the planning module is not to guess the answer, but to identify the missing pieces of evidence, keep track of bridge facts, and prevent the search from revisiting already exhausted directions\. The new contribution of our method is not the planner itself, but the fact that the remaining budget explicitly changes which kind of step the agent is allowed to take next\.

Planning Module: High\-Level Search Plan

Beforetakingstepwiseactions,writeabriefhigh\-levelplan\.

\-Identifytheentities,relations,orattributesthatstillneedevidence\.

\-Separatedirectanswercluesfrombridgefactsthatrequireanintermediatehop\.

\-Keeptheplanabstract;donotinventfacts\.

\-Updatetheplanonlywhennewevidencechangestheremaininguncertainty\.

### B\.3Stage 1: Search\-Time Budget Allocation

Stage 1 is the main search\-time contribution of our method\. Instead of relying only on BAVT’s widen\-versus\-deepen routing, we explicitly allocate the next unit of budget among three action types:Search,Decompose, andAnswer\. The action decision is made by the three\-stage pipeline based on the task\-level VOI score in Eqs\. \([2](https://arxiv.org/html/2605.05701#S4.E2)\)–\([4](https://arxiv.org/html/2605.05701#S4.E4)\); the prompt is only the execution layer that instantiates this decision inside the generator\. In other words, the prompt does not choose the action; the controller computes the task\-level VOI score\. The controller computes a deterministic score from explicit trajectory features, fixed coefficients, budget\-aware action costs, and conservative guards, and then injects the chosen action as a deterministic instruction\.

User Turn Format for Stage 1

\{history\}

Continuereasoning\.

Ifevidenceisnotyetsufficient,callexactlyonesearchtool\.

Ifevidenceissufficient,answernow\.

Outputformat:

<thought\>yourreasoning</thought\>

<tool\_call\>searchquery</tool\_call\>

<answer\>FINAL\_ANSWER</answer\>

Rulesforthisturn:

\-Emiteither<tool\_call\>or<answer\>,notboth,unlessthedynamicinstructionexplicitlyallowsansweringnow\.

\-Whenyouuse<answer\>,includeonlytheshortestcorrectanswerspanandnoexplanation\.

\-Whenyouuse<tool\_call\>,thequerymustbeconcrete,groundedinthequestion,andnon\-redundantwiththeimmediatelyprecedingfailedpath\.

\-Ifthequestionasksforaspecificplace,date,person,title,orslotvalue,theanswermustusethatgroundedstringratherthanabroaderparaphrase\.

\-Ifthecurrenthypothesisisambiguous,disambiguatethenextqueryexplicitlywithatypeorrole\.

The controller\-to\-prompt interface is therefore

\(st,bt\)→Eqs\. \([2](https://arxiv.org/html/2605.05701#S4.E2)\)–\([4](https://arxiv.org/html/2605.05701#S4.E4)\)kt⋆∈\{Search,Decompose,Answer\}→prompt injectiondynamic\_instruction\.\(s\_\{t\},b\_\{t\}\)\\xrightarrow\{\\text\{Eqs\.\\penalty 10000\\ \\eqref\{eq:controller\_score\}\-\-\\eqref\{eq:guarded\_score\}\}\}k\_\{t\}^\{\\star\}\\in\\\{\\text\{\{Search\}\},\\text\{\{Decompose\}\},\\text\{\{Answer\}\}\\\}\\xrightarrow\{\\text\{prompt injection\}\}\\texttt\{dynamic\\\_instruction\}\.The four action templates below are the paper\-specific prompt components that implement this interface\.

Stage 1 Instruction: Search

Instruction:SEARCH\.

Theexpectedmarginalutilityofonemoreretrievalstepishighest\.

Youmustcallexactlyonesearchtoolnow\.

Donotansweranddonotcontinuewithtool\-freereasoning\.

Inside<thought\>,doallofthefollowing:

\-Statethemostspecificunresolvedfactthatstillblockstheanswer\.

\-Explainbrieflywhyanotherretrievalstepispreferabletoansweringnow\.

\-Reuseconcreteentitiesalreadygroundedbythequestionorevidence\.

\-Ifthereisambiguity,statehowthequerywilldisambiguateit\.

Thenemitexactlyone<tool\_call\>\.

Thequeryshouldbetargeted,entity\-specific,andaimedatclosingonemissingfactualgapratherthanbroadexploration\.

Donotoutput<answer\>inthisturn\.

Stage 1 Instruction: Decompose

Instruction:DECOMPOSE\.

Youhavebeensearchingwithoutenoughprogress,andtheremaininguncertaintyappearstobecompositional\.

Youmustcallexactlyonesearchtoolnow\.

Donotansweranddonotcontinuewithtool\-freereasoning\.

Inside<thought\>,doallofthefollowing:

\-Identifythemissingbridgeentity,intermediatefact,orlatentrelationthatconnectsthehops\.

\-Summarizewhathasalreadybeenestablishedfromthecurrenttrajectory\.

\-Explainwhyanotherordinarysearchstepislessusefulthananexplicitbridge\-factquery\.

\-Formulateonepreciseretrievaltargetthatwouldresolvethecompositionalgap\.

Thenemitexactlyone<tool\_call\>withthattargetedquery\.

Thequerymustfocusonthemissingbridgeinformation,notonre\-searchingthefullquestionfromscratch\.

Donotoutput<answer\>inthisturn\.

Stage 1 Instruction: Answer

Instruction:ANSWER\.

Thecontrollerjudgesthatanswercommitmenthashigherutility\-per\-costthanfurtherretrieval\.

Donotcalltools\.

Inside<thought\>,brieflyverifythattheanswerisdirectlysupportedbytheevidencealreadygathered\.

Thenreturnexactlyonelineintheform<answer\>FINAL\_ANSWER</answer\>\.

Answerrequirements:

\-Theanswermustbetheshortestgroundedanswerspan\.

\-Donotoutputexplanationoutsidethetag\.

\-Ifthequestionisyes/no,FINAL\_ANSWERmustbeexactlyyesorno\.

\-Ifthequestionisabinarychoice,outputonlytheselectedoption\.

Stage 1 Instruction: Budget Backstop

Instruction:BUDGETBACKSTOP\.

Theremainingbudgetisexhaustedortoosmallforanothermeaningfulretrievalstep\.

Youmustanswernowandyoumustnotcalltools\.

Inside<thought\>,useonlytheevidencealreadypresentinthetrajectory\.

Donotspeculatebeyondthegroundedinformation\.

Iftheevidenceispartial,preferthemostdirectlysupportedshortanswerratherthananexpansiveparaphrase\.

Returnexactlyonelineintheform<answer\>FINAL\_ANSWER</answer\>\.

Keeptheanswerasshortaspossible\.

Ifthequestionisyes/no,FINAL\_ANSWERmustbeexactlyyesorno\.

Table[4](https://arxiv.org/html/2605.05701#A2.T4)summarizes the Stage\-1 interface\. The key distinction from vanilla BAVT is that our controller makes an explicit budget\-allocation decision over action types before generation\. In the released code, we also include simple guardrails to prevent tool\-free stalling and to enforce a required search step whenever the controller allocates budget to retrieval, but we do not treat these guardrails as separate method components\.

Table 4:Stage 1 compared with related prompting styles\.Our method differs from prior systems not by adding more free\-form prompt complexity, but by using an explicit controller to choose which prompt constraint is injected at each step\.MethodInterleavedretrievalStep\-levelinstructionBudget\-awareaction choiceConservativefinalizationSearch\-o1✓Partial✗✗BAVT✓✓Partial✗Ours✓✓✓✓Here “budget\-aware action choice” means that the action type is selected before generation by an explicit controller rather than inferred from free\-form continuation\. In our case, the controller first selectsSearch,Decompose, orAnswer; the corresponding instruction is then injected into the generator\.

### B\.4Stage 2: Answer\-Time Finalization

Stage 2 is an answer\-selection module applied after search terminates\. Starting from the completed trajectory, the system compares the trajectory answer with a refined candidate derived from the same evidence\. The goal is not to launch another round of free\-form rewriting, but to correct local exactness errors when the evidence clearly supports a safer, more specific answer span\. If the case still looks structurally risky, such as unresolved bridge reasoning or comparative semantics, the module abstains and keeps the trajectory answer\. Importantly, this stage adds no new tool calls and no additional LLM call during finalization\.

We therefore present Stage 2 as an answer\-selection card rather than as another generator prompt\. The point is precisely that this module is*not*a fresh LLM rewrite call\. It is a conservative selection policy over two candidates already available from the completed trace\.

Stage 2 Selection Card: Conservative Answer Finalization

Inputs:

\-Question

\-Trajectoryanswer$a\_\{\\mathrm\{base\}\}$

\-Refinedcandidate$a\_\{\\mathrm\{ref\}\}$

\-Supportingevidencefromthecompletedtrajectory

Decisionprinciple:

\-Iftheremainingerrorisonlylocalanswerexactness,preferthebetter\-supportedcandidate\.

\-Ifthecasestillcontainsbridgeambiguity,comparativesemantics,orpath\-leveluncertainty,abstain\.

\-Neveraddanewtoolcall\.

\-NeverlaunchanadditionalLLMcallduringfinalization\.

\-Treatabstentionasthedefaultwheneverthegain\-risktrade\-offisunclear\.

Typicalpositivecases:

\-yes/nopolarityrepair

\-binarychoicerepair

\-typed\-slotcorrection

\-supportedfactoidcompletion

Typicalabstentioncases:

\-unresolvedbridgereasoning

\-comparativequestions

\-decomposition\-heavytrajectories

\-caseswhererewritingwouldlikelychangesemanticsratherthanimproveexactness

Table[5](https://arxiv.org/html/2605.05701#A2.T5)summarizes this Stage\-2 policy\. The organizing principle is conservative answer finalization: intervene only when the remaining error appears local to answer\-form exactness, and keep the trajectory answer whenever the evidence suggests the real failure is path\-level\.

Table 5:Stage 2: Answer\-time finalization policy\.This module is conservative by design: it rewrites only when the refined candidate is better supported and the risk of changing the semantics remains low\.Case typeFinalization behaviorUnresolved bridge or comparative reasoningAbstain\. If the trajectory still reflects unresolved bridge structure, comparison logic, or other path\-level uncertainty, keep the trajectory answer\.Yes/no or binary choiceFinalize when the evidence clearly resolves the polarity or the choice between the listed options\. This is the cleanest type of answer\-time repair\.Slot\-filling exactnessFinalize only when the refined answer adds a small but meaningful amount of specificity, such as a capacity, date, or year range, without changing the underlying fact\.Supported factoid completionFinalize when the refined candidate simply completes an already supported fact span, for example by restoring a canonical phrase or an omitted attribute\.General fallbackOtherwise finalize only when the refined candidate is better supported by the trajectory evidence and remains comparably concise; abstain in all remaining cases\.This policy is the appendix\-level counterpart of Eq\. \([5](https://arxiv.org/html/2605.05701#S4.E5)\)\. In words, the answer\-time module behaves like a high\-precision selector rather than a generic editor: it prefers abstention whenever the risk of disturbing a correct trajectory answer outweighs the likely exactness gain\.

### B\.5Representative Cases

Instead of showing long traces, we summarize representative Stage\-2 behaviors as compact decision cards\. The first three cards are positive exactness repairs, and the last card is an abstention example showing when the finalizer deliberately refuses to rewrite\.

Case Card 1: Binary\-Choice Exactness Repair

Question:

WhichwriterwasfromEngland,HenryRothorRobertErskineChilders?

Trajectoryanswer:

RobertErskineChilders

Refinedcandidate:

RobertErskineChildersDSC

Evidence:

ThesupportingevidenceidentifiesChildersastheEnglishwriterandalsosupportsthefullercanonicalnamespan\.

WhyStage2finalizes:

\-Thisisabinary\-choicequestionratherthananopen\-endedgenerationcase\.

\-Therefinementdoesnotchangewhichoptionisselected\.

\-Therefinedstringisbettersupportedbythetrajectoryevidence\.

\-Theresidualerrorislocalanswerexactness,notpath\-levelreasoning\.

Outcome:

Finalizetothefullersupportedanswerspan\.

Case Card 2: Supported Factoid Completion

Question:

WhatdistinctionisheldbytheformerNBAplayerwhowasamemberoftheCharlotteHornetsduringtheir1992\-\-93seasonandwasheadcoachfortheCharlotteSting?

Trajectoryanswer:

shortestplayerevertoplayintheNBA

Refinedcandidate:

shortestplayerevertoplayintheNationalBasketballAssociation

Evidence:

ThetrajectoryevidencetiesMuggsyBoguestobothteamcluesandsupportsthefullercanonicalstatementofthedistinction\.

WhyStage2finalizes:

\-Therefinedcandidateisnotadifferentclaim\.

\-Itisacompletionofthesamealready\-supportedfact\.

\-Theinterventionimprovesexactnesswhilepreservingsemantics\.

\-Thegainislocalandtherewriteriskislow\.

Outcome:

Finalizetothefullersupportedfactoidspan\.

Case Card 3: Typed\-Slot Exactness Repair

Question:

ThearenawheretheLewistonMaineiacsplayedtheirhomegamescanseathowmanypeople?

Trajectoryanswer:

3,677

Refinedcandidate:

3,677seated

Evidence:

Thesupportingpassagecontainstheexactcapacity\-bearingphraseratherthanthebarenumberalone\.

WhyStage2finalizes:

\-Thisisaslot\-valuecorrectionproblemratherthanasearchfailure\.

\-Therefinedcandidateaddsonlyminimalslot\-bearingwording\.

\-Theunderlyingfactisunchanged\.

\-Thesupportfortherefinedphraseismoredirectthanfortheshorternumericspanalone\.

Outcome:

Finalizetothetyped\-slotcorrection\.

Case Card 4: Abstention under Bridge Risk

Question:

Whichperformanceacthasahigherinstrumenttopersonratio,BadlyDrawnBoyorWolfAlice?

Trajectoryanswer:

BadlyDrawnBoy

Refinedcandidate:

BadlyDrawnBoy

Evidence:

Thetrajectoryalreadysupportsthecurrentanswer,butthequestionstillhascomparativeandbridge\-sensitivestructure\.

WhyStage2abstains:

\-Theremainingriskisincomparativemulti\-hopreasoning,notanswerwording\.

\-Arewritewouldofferlittleexactnessgain\.

\-Inthisregime,conservativeabstentionissaferthanintervention\.

\-Thefinalizerthereforepreservesthetrajectoryanswerinsteadofforcinganunnecessarychange\.

Outcome:

Abstainandkeepthetrajectoryanswer\.

These cards illustrate the intended role of answer\-time control\. Positive interventions repair answer\-form errors that remain after the search trajectory is already essentially correct\. The abstention example shows the opposite boundary: when the residual uncertainty is still about multi\-hop structure rather than wording, Stage 2 keeps the original trajectory answer\.

## Appendix CTheoretical Statements and Proofs

\\linenumbers

This appendix provides the formal analysis summarized in Section[4](https://arxiv.org/html/2605.05701#S4)\. We first state the main theorem\-level results, then give the definitions, assumptions, auxiliary lemmas, and proofs\. The analysis supports the two controller layers: search\-time budget allocation and answer\-time finalization\. It is local and decision\-level; it does not claim global optimality of the full search tree\.

### C\.1Main Theoretical Statements

For search\-time control, fix steptt, statests\_\{t\}, budgetbtb\_\{t\}, and feasible action set𝒦tfeas\\mathcal\{K\}\_\{t\}^\{\\mathrm\{feas\}\}\. LetVt\+1​\(s,b\)V\_\{t\+1\}\(s,b\)be the analysis\-only oracle continuation value: the scalar maximum expected final reward attainable from statesswith remaining budgetbbunder the oracle continuation policy\. Define

Qt⋆​\(k\):=𝔼​\[Vt\+1​\(St\+1\(k\),bt−gt​\(k\)\)∣st,bt\]Q\_\{t\}^\{\\star\}\(k\):=\\mathbb\{E\}\\\!\\left\[V\_\{t\+1\}\\\!\\bigl\(S\_\{t\+1\}^\{\(k\)\},\\,b\_\{t\}\-g\_\{t\}\(k\)\\bigr\)\\mid s\_\{t\},b\_\{t\}\\right\]as the oracle one\-step lookahead value, whereSt\+1\(k\)S\_\{t\+1\}^\{\(k\)\}is the next state andgt​\(k\)g\_\{t\}\(k\)is the budget charge\. WithVtstopV\_\{t\}^\{\\mathrm\{stop\}\}denoting immediate\-stop value, defineQ~t⋆​\(k\):=Qt⋆​\(k\)−Vtstop​\(st,bt\)\\widetilde\{Q\}\_\{t\}^\{\\star\}\(k\):=Q\_\{t\}^\{\\star\}\(k\)\-V\_\{t\}^\{\\mathrm\{stop\}\}\(s\_\{t\},b\_\{t\}\)and decompose

Q~t⋆​\(k\)=Δt⋆​\(k\)\+Ψt⋆​\(k\)−Λt⋆​\(k;bt\)\+ξt​\(k\),\\widetilde\{Q\}\_\{t\}^\{\\star\}\(k\)=\\Delta\_\{t\}^\{\\star\}\(k\)\+\\Psi\_\{t\}^\{\\star\}\(k\)\-\\Lambda\_\{t\}^\{\\star\}\(k;b\_\{t\}\)\+\\xi\_\{t\}\(k\),whereΔt⋆​\(k\)\\Delta\_\{t\}^\{\\star\}\(k\)is the oracle immediate\-gain term,Ψt⋆​\(k\)\\Psi\_\{t\}^\{\\star\}\(k\)the structural residual,Λt⋆​\(k;bt\)\\Lambda\_\{t\}^\{\\star\}\(k;b\_\{t\}\)the local oracle budget shadow\-cost term, andξt​\(k\)\\xi\_\{t\}\(k\)a smoothness remainder\. This decomposition is only an analysis device mirroring Eq\. \([2](https://arxiv.org/html/2605.05701#S4.E2)\)\. Letktk\_\{t\}be the implemented action andktoptk\_\{t\}^\{\\mathrm\{opt\}\}an oracle\-best feasible action\. Define

Γt​\(k\):=εtΔ\+εtΨ\+εtΠ\+βt​‖gt​\(k\)‖\+Lt2​‖gt​\(k\)‖2,\\Gamma\_\{t\}\(k\):=\\varepsilon\_\{t\}^\{\\Delta\}\+\\varepsilon\_\{t\}^\{\\Psi\}\+\\varepsilon\_\{t\}^\{\\Pi\}\+\\beta\_\{t\}\\\|g\_\{t\}\(k\)\\\|\+\\tfrac\{L\_\{t\}\}\{2\}\\\|g\_\{t\}\(k\)\\\|^\{2\},whereεtΔ,εtΨ,εtΠ\\varepsilon\_\{t\}^\{\\Delta\},\\varepsilon\_\{t\}^\{\\Psi\},\\varepsilon\_\{t\}^\{\\Pi\}bound gain, structural, and budget\-penalty approximation errors,βt\\beta\_\{t\}bounds shadow\-price mismatch, andLtL\_\{t\}is the smoothness constant\. The oracle counterpart of the task\-level VOI score in Eq\. \([3](https://arxiv.org/html/2605.05701#S4.E3)\) isrt⋆​\(k\)=\[Ut⋆​\(k;bt\)\]\+/\(dt​\(k;bt\)\+ϵ\)r\_\{t\}^\{\\star\}\(k\)=\[U\_\{t\}^\{\\star\}\(k;b\_\{t\}\)\]\_\{\+\}/\(d\_\{t\}\(k;b\_\{t\}\)\+\\epsilon\), whereUt⋆U\_\{t\}^\{\\star\}is defined below\.

###### Theorem C\.1\(Local approximation of the task\-level action utility\)\.

Under the budget smoothness, shadow\-price homogeneity, and component\-wise approximation conditions in Assumptions[C\.2](https://arxiv.org/html/2605.05701#A3.SS2)–[C\.2](https://arxiv.org/html/2605.05701#A3.SS2), every feasible actionk∈𝒦tfeask\\in\\mathcal\{K\}\_\{t\}^\{\\mathrm\{feas\}\}satisfies

\|ut​\(k\)−Q~t⋆​\(k\)\|≤Γt​\(k\)\.\\left\|u\_\{t\}\(k\)\-\\widetilde\{Q\}\_\{t\}^\{\\star\}\(k\)\\right\|\\leq\\Gamma\_\{t\}\(k\)\.\(6\)

Theorem[C\.1](https://arxiv.org/html/2605.05701#A3.Thmtheorem1)covers the task\-level action utilityut​\(k\)u\_\{t\}\(k\)and the normalized task\-level VOI scorert​\(k\)r\_\{t\}\(k\)before guards\. Corollary[C\.6](https://arxiv.org/html/2605.05701#A3.Thmtheorem6)below gives a margin condition under which the normalized task\-level VOI score ranking is preserved before guards, and Corollary[C\.8](https://arxiv.org/html/2605.05701#A3.Thmtheorem8)gives a one\-step value\-gap bound when the guard layer is compatible with the utility ranking\. The controller should be reliable when the score clearly separates feasible actions, while failures are expected when approximation terms inΓt​\(k\)\\Gamma\_\{t\}\(k\)are large or guards intentionally override raw ranking\.

For answer\-time finalization, let𝒵\\mathcal\{Z\}be the range ofzz, letZ:=z​\(𝒯,abase,aref\)Z:=z\(\\mathcal\{T\},a\_\{\\mathrm\{base\}\},a\_\{\\mathrm\{ref\}\}\), and letΔfin:=R​\(aref,y\)−R​\(abase,y\)\\Delta\_\{\\mathrm\{fin\}\}:=R\(a\_\{\\mathrm\{ref\}\},y\)\-R\(a\_\{\\mathrm\{base\}\},y\)be the score change from accepting the refined answer\. With\(v\)\+=max⁡\{v,0\}\(v\)\_\{\+\}=\\max\\\{v,0\\\}, define

G⋆​\(z\):=𝔼​\[\(Δfin\)\+∣Z=z\],H⋆​\(z\):=𝔼​\[\(−Δfin\)\+∣Z=z\]G^\{\\star\}\(z\):=\\mathbb\{E\}\[\(\\Delta\_\{\\mathrm\{fin\}\}\)\_\{\+\}\\mid Z=z\],\\qquad H^\{\\star\}\(z\):=\\mathbb\{E\}\[\(\-\\Delta\_\{\\mathrm\{fin\}\}\)\_\{\+\}\\mid Z=z\]as oracle gain and harm\. We consider policiesπ:𝒵→\{0,1\}\\pi:\\mathcal\{Z\}\\to\\\{0,1\\\}, whereπ​\(z\)=1\\pi\(z\)=1acceptsarefa\_\{\\mathrm\{ref\}\}; safe policies satisfyπ​\(z\)=0\\pi\(z\)=0forz∉𝒮safez\\notin\\mathcal\{S\}\_\{\\mathrm\{safe\}\}and the harm constraint in Eq\. \([1](https://arxiv.org/html/2605.05701#S3.E1)\)\.

###### Theorem C\.2\(Optimal threshold form of safe answer replacement\)\.

Under the strong\-duality and multiplier\-attainment condition in Assumption[C\.14](https://arxiv.org/html/2605.05701#A3.Thmtheorem14), there existsη⋆≥1\\eta^\{\\star\}\\geq 1such that the policy

π⋆​\(z\)=𝟏​\{z∈𝒮safe​and​G⋆​\(z\)−η⋆​H⋆​\(z\)≥0\}\\pi^\{\\star\}\(z\)=\\mathbf\{1\}\\\!\\left\\\{z\\in\\mathcal\{S\}\_\{\\mathrm\{safe\}\}\\text\{ and \}G^\{\\star\}\(z\)\-\\eta^\{\\star\}H^\{\\star\}\(z\)\\geq 0\\right\\\}\(7\)maximizes𝔼​\[R​\(a^,y\)\]\\mathbb\{E\}\[R\(\\hat\{a\},y\)\]subject to𝔼​\[Lharm​\(a^,abase,y\)\]≤ρharm\\mathbb\{E\}\[L\_\{\\mathrm\{harm\}\}\(\\hat\{a\},a\_\{\\mathrm\{base\}\},y\)\]\\leq\\rho\_\{\\mathrm\{harm\}\}andπ​\(z\)=0\\pi\(z\)=0forz∉𝒮safez\\notin\\mathcal\{S\}\_\{\\mathrm\{safe\}\}\.

### C\.2Search\-Time Controller: Definitions, Assumptions, and Proof

Fix steptt, frontier statests\_\{t\}, remaining budgetbtb\_\{t\}, and feasible action set𝒦tfeas\\mathcal\{K\}\_\{t\}^\{\\mathrm\{feas\}\}\. For each feasible actionkk, letSt\+1\(k\)S\_\{t\+1\}^\{\(k\)\}be the random next state after executingkkand letgt​\(k\)g\_\{t\}\(k\)be its budget charge, which is\(st,bt\)\(s\_\{t\},b\_\{t\}\)\-measurable\. For token costs, this means using the conditional expectation of realized tokens given\(st,bt,k\)\(s\_\{t\},b\_\{t\},k\)and absorbing residual variance into the remainder\. LetVt\+1​\(s,b\)V\_\{t\+1\}\(s,b\)be the analysis\-only oracle continuation value, i\.e\., the scalar maximum expected final reward from statessand remaining budgetbbunder the oracle continuation policy and induced future\-state/evaluation distribution\. The oracle one\-step lookahead value is

Qt⋆​\(k\):=𝔼​\[Vt\+1​\(St\+1\(k\),bt−gt​\(k\)\)∣st,bt\],Q\_\{t\}^\{\\star\}\(k\):=\\mathbb\{E\}\\\!\\left\[V\_\{t\+1\}\\\!\\bigl\(S\_\{t\+1\}^\{\(k\)\},\\,b\_\{t\}\-g\_\{t\}\(k\)\\bigr\)\\mid s\_\{t\},b\_\{t\}\\right\],\(8\)andktopt∈arg⁡maxk∈𝒦tfeas⁡Qt⋆​\(k\)k\_\{t\}^\{\\mathrm\{opt\}\}\\in\\arg\\max\_\{k\\in\\mathcal\{K\}\_\{t\}^\{\\mathrm\{feas\}\}\}Q\_\{t\}^\{\\star\}\(k\)denotes an oracle\-optimal feasible action\. Define

Vtstop​\(st,bt\):=𝔼​\[R​\(atstop,y\)∣st,bt\],V\_\{t\}^\{\\mathrm\{stop\}\}\(s\_\{t\},b\_\{t\}\):=\\mathbb\{E\}\\\!\\left\[R\(a\_\{t\}^\{\\mathrm\{stop\}\},y\)\\mid s\_\{t\},b\_\{t\}\\right\],\(9\)Δt⋆​\(k\):=𝔼​\[R​\(atstop,\(k\),y\)−R​\(atstop,y\)∣st,bt\],\\Delta\_\{t\}^\{\\star\}\(k\):=\\mathbb\{E\}\\\!\\left\[R\(a\_\{t\}^\{\\mathrm\{stop\},\(k\)\},y\)\-R\(a\_\{t\}^\{\\mathrm\{stop\}\},y\)\\mid s\_\{t\},b\_\{t\}\\right\],\(10\)Ψt⋆​\(k\):=𝔼​\[Vt\+1​\(St\+1\(k\),bt\)∣st,bt\]−Vtstop​\(st,bt\)−Δt⋆​\(k\),\\Psi\_\{t\}^\{\\star\}\(k\):=\\mathbb\{E\}\\\!\\left\[V\_\{t\+1\}\\\!\\bigl\(S\_\{t\+1\}^\{\(k\)\},\\,b\_\{t\}\\bigr\)\\mid s\_\{t\},b\_\{t\}\\right\]\-V\_\{t\}^\{\\mathrm\{stop\}\}\(s\_\{t\},b\_\{t\}\)\-\\Delta\_\{t\}^\{\\star\}\(k\),\(11\)whereatstopa\_\{t\}^\{\\mathrm\{stop\}\}is the answer if the trajectory is finalized immediately andatstop,\(k\)a\_\{t\}^\{\\mathrm\{stop\},\(k\)\}is the answer after revealingSt\+1\(k\)S\_\{t\+1\}^\{\(k\)\}and applying the fixed immediate\-finalization backstop policy at the pre\-charge budgetbtb\_\{t\}\. The budget\-adjusted oracle one\-step gain isQ~t⋆​\(k\):=Qt⋆​\(k\)−Vtstop​\(st,bt\)\\widetilde\{Q\}\_\{t\}^\{\\star\}\(k\):=Q\_\{t\}^\{\\star\}\(k\)\-V\_\{t\}^\{\\mathrm\{stop\}\}\(s\_\{t\},b\_\{t\}\)\.

We further define the remaining\-budget shaped oracle utility

Ut⋆​\(k;bt\):=Δt⋆​\(k\)\+Ψt⋆​\(k\)−Λt⋆​\(k;bt\),U\_\{t\}^\{\\star\}\(k;b\_\{t\}\):=\\Delta\_\{t\}^\{\\star\}\(k\)\+\\Psi\_\{t\}^\{\\star\}\(k\)\-\\Lambda\_\{t\}^\{\\star\}\(k;b\_\{t\}\),\(12\)whereΛt⋆​\(k;bt\):=λt⋆⊤​gt​\(k\)\\Lambda\_\{t\}^\{\\star\}\(k;b\_\{t\}\):=\\lambda\_\{t\}^\{\\star\\top\}g\_\{t\}\(k\)is the local oracle budget shadow\-cost term\. The implemented budget\-penalty termΠt​\(k;bt\)\\Pi\_\{t\}\(k;b\_\{t\}\)may be signed, for example to encourageAnswernear exhaustion; its deviation from this oracle shadow\-cost is absorbed byεtΠ\\varepsilon\_\{t\}^\{\\Pi\}in Assumption[C\.2](https://arxiv.org/html/2605.05701#A3.SS2)\. The remainderξt​\(k\):=Q~t⋆​\(k\)−Ut⋆​\(k;bt\)\\xi\_\{t\}\(k\):=\\widetilde\{Q\}\_\{t\}^\{\\star\}\(k\)\-U\_\{t\}^\{\\star\}\(k;b\_\{t\}\)is bounded by the smoothness constants below\.

\{assumption\}

\[Budget smoothness and cost measurability\] Fix a norm∥⋅∥\\\|\\cdot\\\|onℝ2\\mathbb\{R\}^\{2\}and let∥⋅∥∗\\\|\\cdot\\\|\_\{\\ast\}denote its dual norm\. There existsLt≥0L\_\{t\}\\geq 0such that, for every feasible actionk∈𝒦tfeask\\in\\mathcal\{K\}\_\{t\}^\{\\mathrm\{feas\}\}: \(i\) the mapb↦Vt\+1​\(St\+1\(k\),b\)b\\mapsto V\_\{t\+1\}\(S\_\{t\+1\}^\{\(k\)\},b\)is almost surely twice continuously differentiable nearbtb\_\{t\}withLtL\_\{t\}\-Lipschitz gradient, so the Taylor remainder

ϱt​\(k\):=Vt\+1​\(St\+1\(k\),bt−gt​\(k\)\)−Vt\+1​\(St\+1\(k\),bt\)\+∇bVt\+1​\(St\+1\(k\),bt\)⊤​gt​\(k\)\\varrho\_\{t\}\(k\):=V\_\{t\+1\}\\\!\\bigl\(S\_\{t\+1\}^\{\(k\)\},\\,b\_\{t\}\-g\_\{t\}\(k\)\\bigr\)\-V\_\{t\+1\}\\\!\\bigl\(S\_\{t\+1\}^\{\(k\)\},\\,b\_\{t\}\\bigr\)\+\\nabla\_\{b\}V\_\{t\+1\}\\\!\\bigl\(S\_\{t\+1\}^\{\(k\)\},\\,b\_\{t\}\\bigr\)^\{\\top\}g\_\{t\}\(k\)\(13\)satisfies\|ϱt​\(k\)\|≤Lt2​‖gt​\(k\)‖2\|\\varrho\_\{t\}\(k\)\|\\leq\\frac\{L\_\{t\}\}\{2\}\\\|g\_\{t\}\(k\)\\\|^\{2\}almost surely; and \(ii\)gt​\(k\)g\_\{t\}\(k\)is\(st,bt\)\(s\_\{t\},b\_\{t\}\)\-measurable\.

\{assumption\}

\[Local shadow\-price homogeneity\] There existλt⋆∈ℝ\+2\\lambda\_\{t\}^\{\\star\}\\in\\mathbb\{R\}\_\{\+\}^\{2\}andβt≥0\\beta\_\{t\}\\geq 0such that, for every feasible actionk∈𝒦tfeask\\in\\mathcal\{K\}\_\{t\}^\{\\mathrm\{feas\}\},

∥𝔼\[∇bVt\+1\(St\+1\(k\),bt\)∣st,bt\]−λt⋆∥∗≤βt\.\\left\\\|\\mathbb\{E\}\\\!\\left\[\\nabla\_\{b\}V\_\{t\+1\}\\\!\\bigl\(S\_\{t\+1\}^\{\(k\)\},\\,b\_\{t\}\\bigr\)\\mid s\_\{t\},b\_\{t\}\\right\]\-\\lambda\_\{t\}^\{\\star\}\\right\\\|\_\{\\ast\}\\leq\\beta\_\{t\}\.\(14\)
\{assumption\}

\[Component\-wise approximation\] There existεtΔ,εtΨ,εtΠ≥0\\varepsilon\_\{t\}^\{\\Delta\},\\varepsilon\_\{t\}^\{\\Psi\},\\varepsilon\_\{t\}^\{\\Pi\}\\geq 0such that, for everyk∈𝒦tfeask\\in\\mathcal\{K\}\_\{t\}^\{\\mathrm\{feas\}\},

\|Δ^t​\(k\)−Δt⋆​\(k\)\|≤εtΔ,\|Ψt​\(k\)−Ψt⋆​\(k\)\|≤εtΨ,\|Πt​\(k;bt\)−Λt⋆​\(k;bt\)\|≤εtΠ\.\\bigl\|\\widehat\{\\Delta\}\_\{t\}\(k\)\-\\Delta\_\{t\}^\{\\star\}\(k\)\\bigr\|\\leq\\varepsilon\_\{t\}^\{\\Delta\},\\quad\\bigl\|\\Psi\_\{t\}\(k\)\-\\Psi\_\{t\}^\{\\star\}\(k\)\\bigr\|\\leq\\varepsilon\_\{t\}^\{\\Psi\},\\quad\\bigl\|\\Pi\_\{t\}\(k;b\_\{t\}\)\-\\Lambda\_\{t\}^\{\\star\}\(k;b\_\{t\}\)\\bigr\|\\leq\\varepsilon\_\{t\}^\{\\Pi\}\.\(15\)
\{assumption\}

\[Ranking\-preserving guard condition\] The guard operator𝔊t\\mathfrak\{G\}\_\{t\}preserves the relevantutu\_\{t\}ranking in the following sense: the implemented actionkt∈arg⁡maxk∈𝒦tfeas⁡𝒥~t​\(k\)k\_\{t\}\\in\\arg\\max\_\{k\\in\\mathcal\{K\}\_\{t\}^\{\\mathrm\{feas\}\}\}\\widetilde\{\\mathcal\{J\}\}\_\{t\}\(k\)satisfiesut​\(kt\)≥ut​\(ktopt\)u\_\{t\}\(k\_\{t\}\)\\geq u\_\{t\}\(k\_\{t\}^\{\\mathrm\{opt\}\}\), wherektopt∈arg⁡maxk∈𝒦tfeas⁡Qt⋆​\(k\)k\_\{t\}^\{\\mathrm\{opt\}\}\\in\\arg\\max\_\{k\\in\\mathcal\{K\}\_\{t\}^\{\\mathrm\{feas\}\}\}Q\_\{t\}^\{\\star\}\(k\)\. Equivalently, the guard layer does not promote an action with strictly lowerutu\_\{t\}score above the oracle\-optimal action\.

###### Lemma C\.3\(Budget linearization\)\.

Under Assumptions[C\.2](https://arxiv.org/html/2605.05701#A3.SS2)and[C\.2](https://arxiv.org/html/2605.05701#A3.SS2), for everyk∈𝒦tfeask\\in\\mathcal\{K\}\_\{t\}^\{\\mathrm\{feas\}\},

Q~t⋆​\(k\)=Δt⋆​\(k\)\+Ψt⋆​\(k\)−λt⋆⊤​gt​\(k\)\+ξt​\(k\),\\widetilde\{Q\}\_\{t\}^\{\\star\}\(k\)=\\Delta\_\{t\}^\{\\star\}\(k\)\+\\Psi\_\{t\}^\{\\star\}\(k\)\-\\lambda\_\{t\}^\{\\star\\top\}g\_\{t\}\(k\)\+\\xi\_\{t\}\(k\),where\|ξt​\(k\)\|≤βt​‖gt​\(k\)‖\+Lt2​‖gt​\(k\)‖2\|\\xi\_\{t\}\(k\)\|\\leq\\beta\_\{t\}\\\|g\_\{t\}\(k\)\\\|\+\\frac\{L\_\{t\}\}\{2\}\\\|g\_\{t\}\(k\)\\\|^\{2\}\.

###### Proof C\.4\.

WriteS:=St\+1\(k\)S:=S\_\{t\+1\}^\{\(k\)\},g:=gt​\(k\)g:=g\_\{t\}\(k\),V:=Vt\+1V:=V\_\{t\+1\}\. By Assumption[C\.2](https://arxiv.org/html/2605.05701#A3.SS2)\(i\),

V​\(S,bt−g\)=V​\(S,bt\)−∇bV​\(S,bt\)⊤​g\+ϱt​\(k\),\|ϱt​\(k\)\|≤Lt2​‖g‖2\.V\(S,b\_\{t\}\-g\)=V\(S,b\_\{t\}\)\-\\nabla\_\{b\}V\(S,b\_\{t\}\)^\{\\top\}g\+\\varrho\_\{t\}\(k\),\\quad\|\\varrho\_\{t\}\(k\)\|\\leq\\tfrac\{L\_\{t\}\}\{2\}\\\|g\\\|^\{2\}\.Taking conditional expectation and using Assumption[C\.2](https://arxiv.org/html/2605.05701#A3.SS2)\(ii\), soggpulls out of the expectation,

Qt⋆​\(k\)=𝔼​\[V​\(S,bt\)∣st,bt\]−𝔼​\[∇bV​\(S,bt\)∣st,bt\]⊤​g\+ϱ¯t​\(k\),Q\_\{t\}^\{\\star\}\(k\)=\\mathbb\{E\}\[V\(S,b\_\{t\}\)\\mid s\_\{t\},b\_\{t\}\]\-\\mathbb\{E\}\[\\nabla\_\{b\}V\(S,b\_\{t\}\)\\mid s\_\{t\},b\_\{t\}\]^\{\\top\}g\+\\bar\{\\varrho\}\_\{t\}\(k\),where\|ϱ¯t​\(k\)\|≤Lt2​‖g‖2\|\\bar\{\\varrho\}\_\{t\}\(k\)\|\\leq\\frac\{L\_\{t\}\}\{2\}\\\|g\\\|^\{2\}\. By definition ofΨt⋆​\(k\)\\Psi\_\{t\}^\{\\star\}\(k\),𝔼​\[V​\(S,bt\)∣st,bt\]=Vtstop\+Δt⋆​\(k\)\+Ψt⋆​\(k\)\\mathbb\{E\}\[V\(S,b\_\{t\}\)\\mid s\_\{t\},b\_\{t\}\]=V\_\{t\}^\{\\mathrm\{stop\}\}\+\\Delta\_\{t\}^\{\\star\}\(k\)\+\\Psi\_\{t\}^\{\\star\}\(k\)\. SubtractingVtstopV\_\{t\}^\{\\mathrm\{stop\}\}and writingξt​\(k\):=\(λt⋆−𝔼​\[∇bV​\(S,bt\)∣st,bt\]\)⊤​g\+ϱ¯t​\(k\)\\xi\_\{t\}\(k\):=\(\\lambda\_\{t\}^\{\\star\}\-\\mathbb\{E\}\[\\nabla\_\{b\}V\(S,b\_\{t\}\)\\mid s\_\{t\},b\_\{t\}\]\)^\{\\top\}g\+\\bar\{\\varrho\}\_\{t\}\(k\)gives the identity\. The bound follows from Hölder’s inequality and Assumption[C\.2](https://arxiv.org/html/2605.05701#A3.SS2)\.

### C\.3Proof of Theorem[C\.1](https://arxiv.org/html/2605.05701#A3.Thmtheorem1)and Corollary[C\.8](https://arxiv.org/html/2605.05701#A3.Thmtheorem8)

###### Proof C\.5\(Proof of Theorem[C\.1](https://arxiv.org/html/2605.05701#A3.Thmtheorem1)\)\.

Fixk∈𝒦tfeask\\in\\mathcal\{K\}\_\{t\}^\{\\mathrm\{feas\}\}\. By definition ofut​\(k\)u\_\{t\}\(k\)in Eq\. \([2](https://arxiv.org/html/2605.05701#S4.E2)\) andUt⋆​\(k;bt\)U\_\{t\}^\{\\star\}\(k;b\_\{t\}\)in Eq\. \([12](https://arxiv.org/html/2605.05701#A3.E12)\),

ut​\(k\)−Ut⋆​\(k;bt\)=\(Δ^t​\(k\)−Δt⋆​\(k\)\)\+\(Ψt​\(k\)−Ψt⋆​\(k\)\)−\(Πt​\(k;bt\)−Λt⋆​\(k;bt\)\)\.u\_\{t\}\(k\)\-U\_\{t\}^\{\\star\}\(k;b\_\{t\}\)=\\bigl\(\\widehat\{\\Delta\}\_\{t\}\(k\)\-\\Delta\_\{t\}^\{\\star\}\(k\)\\bigr\)\+\\bigl\(\\Psi\_\{t\}\(k\)\-\\Psi\_\{t\}^\{\\star\}\(k\)\\bigr\)\-\\bigl\(\\Pi\_\{t\}\(k;b\_\{t\}\)\-\\Lambda\_\{t\}^\{\\star\}\(k;b\_\{t\}\)\\bigr\)\.Applying the triangle inequality and Assumption[C\.2](https://arxiv.org/html/2605.05701#A3.SS2)gives\|ut​\(k\)−Ut⋆​\(k;bt\)\|≤εtΔ\+εtΨ\+εtΠ\|u\_\{t\}\(k\)\-U\_\{t\}^\{\\star\}\(k;b\_\{t\}\)\|\\leq\\varepsilon\_\{t\}^\{\\Delta\}\+\\varepsilon\_\{t\}^\{\\Psi\}\+\\varepsilon\_\{t\}^\{\\Pi\}\. By Lemma[C\.3](https://arxiv.org/html/2605.05701#A3.Thmtheorem3),\|Q~t⋆​\(k\)−Ut⋆​\(k;bt\)\|=\|ξt​\(k\)\|≤βt​‖gt​\(k\)‖\+Lt2​‖gt​\(k\)‖2\|\\widetilde\{Q\}\_\{t\}^\{\\star\}\(k\)\-U\_\{t\}^\{\\star\}\(k;b\_\{t\}\)\|=\|\\xi\_\{t\}\(k\)\|\\leq\\beta\_\{t\}\\\|g\_\{t\}\(k\)\\\|\+\\frac\{L\_\{t\}\}\{2\}\\\|g\_\{t\}\(k\)\\\|^\{2\}\. The triangle inequality then gives

\|ut​\(k\)−Q~t⋆​\(k\)\|≤εtΔ\+εtΨ\+εtΠ\+βt​‖gt​\(k\)‖\+Lt2​‖gt​\(k\)‖2=Γt​\(k\),\|u\_\{t\}\(k\)\-\\widetilde\{Q\}\_\{t\}^\{\\star\}\(k\)\|\\leq\\varepsilon\_\{t\}^\{\\Delta\}\+\\varepsilon\_\{t\}^\{\\Psi\}\+\\varepsilon\_\{t\}^\{\\Pi\}\+\\beta\_\{t\}\\\|g\_\{t\}\(k\)\\\|\+\\tfrac\{L\_\{t\}\}\{2\}\\\|g\_\{t\}\(k\)\\\|^\{2\}=\\Gamma\_\{t\}\(k\),which is Eq\. \([6](https://arxiv.org/html/2605.05701#A3.E6)\)\.

Although Theorem[C\.1](https://arxiv.org/html/2605.05701#A3.Thmtheorem1)is stated for the unnormalized utility, action selection uses the normalized score in Eq\. \([3](https://arxiv.org/html/2605.05701#S4.E3)\)\. The next corollary shows that the normalized ranking is stable when the oracle score margin dominates the approximation terms\.

###### Corollary C\.6\(Ranking consistency under a margin condition\)\.

If

rt⋆​\(i\)−rt⋆​\(j\)\>Γt​\(i\)\+Γt​\(j\)dmin\+ϵr\_\{t\}^\{\\star\}\(i\)\-r\_\{t\}^\{\\star\}\(j\)\>\\frac\{\\Gamma\_\{t\}\(i\)\+\\Gamma\_\{t\}\(j\)\}\{d\_\{\\min\}\+\\epsilon\}\(16\)for somedmin\>0d\_\{\\min\}\>0lower\-bounding all denominators, thenrt​\(i\)\>rt​\(j\)r\_\{t\}\(i\)\>r\_\{t\}\(j\)\.

###### Proof C\.7\.

Definert⋆​\(k\):=\[Ut⋆​\(k;bt\)\]\+/\(dt​\(k;bt\)\+ϵ\)r\_\{t\}^\{\\star\}\(k\):=\[U\_\{t\}^\{\\star\}\(k;b\_\{t\}\)\]\_\{\+\}/\(d\_\{t\}\(k;b\_\{t\}\)\+\\epsilon\)\. Since\[⋅\]\+\[\\cdot\]\_\{\+\}is11\-Lipschitz,

\|rt​\(k\)−rt⋆​\(k\)\|=\|\[ut​\(k\)\]\+−\[Ut⋆​\(k;bt\)\]\+\|dt​\(k;bt\)\+ϵ≤\|ut​\(k\)−Ut⋆​\(k;bt\)\|dmin\+ϵ≤Γt​\(k\)dmin\+ϵ\.\|r\_\{t\}\(k\)\-r\_\{t\}^\{\\star\}\(k\)\|=\\frac\{\|\[u\_\{t\}\(k\)\]\_\{\+\}\-\[U\_\{t\}^\{\\star\}\(k;b\_\{t\}\)\]\_\{\+\}\|\}\{d\_\{t\}\(k;b\_\{t\}\)\+\\epsilon\}\\leq\\frac\{\|u\_\{t\}\(k\)\-U\_\{t\}^\{\\star\}\(k;b\_\{t\}\)\|\}\{d\_\{\\min\}\+\\epsilon\}\\leq\\frac\{\\Gamma\_\{t\}\(k\)\}\{d\_\{\\min\}\+\\epsilon\}\.Ifrt⋆​\(i\)−rt⋆​\(j\)\>\(Γt​\(i\)\+Γt​\(j\)\)/\(dmin\+ϵ\)r\_\{t\}^\{\\star\}\(i\)\-r\_\{t\}^\{\\star\}\(j\)\>\(\\Gamma\_\{t\}\(i\)\+\\Gamma\_\{t\}\(j\)\)/\(d\_\{\\min\}\+\\epsilon\), then

rt​\(i\)≥rt⋆​\(i\)−Γt​\(i\)dmin\+ϵ\>rt⋆​\(j\)\+Γt​\(j\)dmin\+ϵ≥rt​\(j\),r\_\{t\}\(i\)\\geq r\_\{t\}^\{\\star\}\(i\)\-\\frac\{\\Gamma\_\{t\}\(i\)\}\{d\_\{\\min\}\+\\epsilon\}\>r\_\{t\}^\{\\star\}\(j\)\+\\frac\{\\Gamma\_\{t\}\(j\)\}\{d\_\{\\min\}\+\\epsilon\}\\geq r\_\{t\}\(j\),which proves the result\.

The ranking statement does not yet include the guard layer\. Under the ranking\-preserving guard condition, the same local bound implies a one\-step value\-gap guarantee for the implemented action\. This result is conditional: it isolates the loss due to score approximation when the guard layer does not overturn the utility ordering relevant to the oracle action, and it is not a global optimality guarantee for arbitrary guard settings\.

###### Corollary C\.8\(One\-step value gap under ranking\-preserving guards\)\.

Under the conditions of Theorem[C\.1](https://arxiv.org/html/2605.05701#A3.Thmtheorem1)and the ranking\-preserving guard condition in Assumption[C\.2](https://arxiv.org/html/2605.05701#A3.SS2), the implemented action satisfies

Qt⋆​\(ktopt\)−Qt⋆​\(kt\)≤2​maxk∈𝒦tfeas⁡Γt​\(k\)\.Q\_\{t\}^\{\\star\}\(k\_\{t\}^\{\\mathrm\{opt\}\}\)\-Q\_\{t\}^\{\\star\}\(k\_\{t\}\)\\leq 2\\max\_\{k\\in\\mathcal\{K\}\_\{t\}^\{\\mathrm\{feas\}\}\}\\Gamma\_\{t\}\(k\)\.\(17\)

###### Proof C\.9\(Proof of Corollary[C\.8](https://arxiv.org/html/2605.05701#A3.Thmtheorem8)\)\.

LetΓ¯t:=maxk∈𝒦tfeas⁡Γt​\(k\)\\bar\{\\Gamma\}\_\{t\}:=\\max\_\{k\\in\\mathcal\{K\}\_\{t\}^\{\\mathrm\{feas\}\}\}\\Gamma\_\{t\}\(k\)\. From Eq\. \([6](https://arxiv.org/html/2605.05701#A3.E6)\),Q~t⋆​\(k\)≤ut​\(k\)\+Γ¯t\\widetilde\{Q\}\_\{t\}^\{\\star\}\(k\)\\leq u\_\{t\}\(k\)\+\\bar\{\\Gamma\}\_\{t\}andQ~t⋆​\(k\)≥ut​\(k\)−Γ¯t\\widetilde\{Q\}\_\{t\}^\{\\star\}\(k\)\\geq u\_\{t\}\(k\)\-\\bar\{\\Gamma\}\_\{t\}for everykk\. By Assumption[C\.2](https://arxiv.org/html/2605.05701#A3.SS2),ut​\(kt\)≥ut​\(ktopt\)u\_\{t\}\(k\_\{t\}\)\\geq u\_\{t\}\(k\_\{t\}^\{\\mathrm\{opt\}\}\)\. Therefore,

Q~t⋆​\(ktopt\)≤ut​\(ktopt\)\+Γ¯t≤ut​\(kt\)\+Γ¯t≤Q~t⋆​\(kt\)\+2​Γ¯t\.\\widetilde\{Q\}\_\{t\}^\{\\star\}\(k\_\{t\}^\{\\mathrm\{opt\}\}\)\\leq u\_\{t\}\(k\_\{t\}^\{\\mathrm\{opt\}\}\)\+\\bar\{\\Gamma\}\_\{t\}\\leq u\_\{t\}\(k\_\{t\}\)\+\\bar\{\\Gamma\}\_\{t\}\\leq\\widetilde\{Q\}\_\{t\}^\{\\star\}\(k\_\{t\}\)\+2\\bar\{\\Gamma\}\_\{t\}\.AddingVtstopV\_\{t\}^\{\\mathrm\{stop\}\}to both sides givesQt⋆​\(ktopt\)−Qt⋆​\(kt\)≤2​Γ¯tQ\_\{t\}^\{\\star\}\(k\_\{t\}^\{\\mathrm\{opt\}\}\)\-Q\_\{t\}^\{\\star\}\(k\_\{t\}\)\\leq 2\\bar\{\\Gamma\}\_\{t\}, which is Eq\. \([17](https://arxiv.org/html/2605.05701#A3.E17)\)\.

### C\.4Answer\-Time Finalization: Reward–Harm Decomposition and Optimality

We now collect the answer\-time quantities used in Theorem[C\.2](https://arxiv.org/html/2605.05701#A3.Thmtheorem2)\. Let\(x,𝒯,abase,aref,y\)\(x,\\mathcal\{T\},a\_\{\\mathrm\{base\}\},a\_\{\\mathrm\{ref\}\},y\)be jointly distributed according to the evaluation distribution, wherearef=gref​\(𝒯,abase\)a\_\{\\mathrm\{ref\}\}=g\_\{\\mathrm\{ref\}\}\(\\mathcal\{T\},a\_\{\\mathrm\{base\}\}\), let𝒵\\mathcal\{Z\}denote the range ofzz, letZ:=z​\(𝒯,abase,aref\)Z:=z\(\\mathcal\{T\},a\_\{\\mathrm\{base\}\},a\_\{\\mathrm\{ref\}\}\), and let an answer\-replacement policy be a measurable mapπ:𝒵→\{0,1\}\\pi:\\mathcal\{Z\}\\to\\\{0,1\\\}\. We restrict attention to safe policies satisfyingπ​\(z\)=0\\pi\(z\)=0forz∉𝒮safez\\notin\\mathcal\{S\}\_\{\\mathrm\{safe\}\}\. DefineΔfin:=R​\(aref,y\)−R​\(abase,y\)\\Delta\_\{\\mathrm\{fin\}\}:=R\(a\_\{\\mathrm\{ref\}\},y\)\-R\(a\_\{\\mathrm\{base\}\},y\)and

G⋆​\(z\):=𝔼​\[\(Δfin\)\+∣Z=z\],H⋆​\(z\):=𝔼​\[\(−Δfin\)\+∣Z=z\]\.G^\{\\star\}\(z\):=\\mathbb\{E\}\[\(\\Delta\_\{\\mathrm\{fin\}\}\)\_\{\+\}\\mid Z=z\],\\qquad H^\{\\star\}\(z\):=\\mathbb\{E\}\[\(\-\\Delta\_\{\\mathrm\{fin\}\}\)\_\{\+\}\\mid Z=z\]\.\(18\)
We now turn from search\-time action selection to answer\-time answer replacement\. The first step is to express finalization as a gain–harm trade\-off relative to the base trajectory answer\.

###### Lemma C\.11\(Reward–harm decomposition\)\.

For any safe answer\-replacement policyπ\\pi,

𝔼​\[R​\(a^,y\)\]−𝔼​\[R​\(abase,y\)\]\\displaystyle\\mathbb\{E\}\[R\(\\hat\{a\},y\)\]\-\\mathbb\{E\}\[R\(a\_\{\\mathrm\{base\}\},y\)\]=𝔼​\[π​\(Z\)​\(G⋆​\(Z\)−H⋆​\(Z\)\)\],\\displaystyle=\\mathbb\{E\}\\\!\\left\[\\pi\(Z\)\\bigl\(G^\{\\star\}\(Z\)\-H^\{\\star\}\(Z\)\\bigr\)\\right\],\(19\)𝔼​\[Lharm​\(a^,abase,y\)\]\\displaystyle\\mathbb\{E\}\[L\_\{\\mathrm\{harm\}\}\(\\hat\{a\},a\_\{\\mathrm\{base\}\},y\)\]=𝔼​\[π​\(Z\)​H⋆​\(Z\)\]\.\\displaystyle=\\mathbb\{E\}\\\!\\left\[\\pi\(Z\)H^\{\\star\}\(Z\)\\right\]\.\(20\)

###### Proof C\.12\.

Fix a safe policyπ\\pi\. By definition,a^=aref\\hat\{a\}=a\_\{\\mathrm\{ref\}\}ifπ​\(Z\)=1\\pi\(Z\)=1anda^=abase\\hat\{a\}=a\_\{\\mathrm\{base\}\}ifπ​\(Z\)=0\\pi\(Z\)=0, soR​\(a^,y\)−R​\(abase,y\)=π​\(Z\)​ΔfinR\(\\hat\{a\},y\)\-R\(a\_\{\\mathrm\{base\}\},y\)=\\pi\(Z\)\\Delta\_\{\\mathrm\{fin\}\}\. Taking expectations, conditioning onZZ, and usingΔfin=\(Δfin\)\+−\(−Δfin\)\+\\Delta\_\{\\mathrm\{fin\}\}=\(\\Delta\_\{\\mathrm\{fin\}\}\)\_\{\+\}\-\(\-\\Delta\_\{\\mathrm\{fin\}\}\)\_\{\+\}yields Eq\. \([19](https://arxiv.org/html/2605.05701#A3.E19)\)\. Likewise,Lharm​\(a^,abase,y\)=π​\(Z\)​\(−Δfin\)\+L\_\{\\mathrm\{harm\}\}\(\\hat\{a\},a\_\{\\mathrm\{base\}\},y\)=\\pi\(Z\)\(\-\\Delta\_\{\\mathrm\{fin\}\}\)\_\{\+\}, and conditioning onZZyields Eq\. \([20](https://arxiv.org/html/2605.05701#A3.E20)\)\.

By Lemma[C\.11](https://arxiv.org/html/2605.05701#A3.Thmtheorem11), maximizing the answer\-time part of Eq\. \([1](https://arxiv.org/html/2605.05701#S3.E1)\) is equivalent to

maxπ⁡𝔼​\[π​\(Z\)​\(G⋆​\(Z\)−H⋆​\(Z\)\)\]s\.t\.𝔼​\[π​\(Z\)​H⋆​\(Z\)\]≤ρharm,π​\(z\)=0​for​z∉𝒮safe,\\max\_\{\\pi\}\\;\\mathbb\{E\}\\\!\\left\[\\pi\(Z\)\\bigl\(G^\{\\star\}\(Z\)\-H^\{\\star\}\(Z\)\\bigr\)\\right\]\\quad\\text\{s\.t\.\}\\quad\\mathbb\{E\}\\\!\\left\[\\pi\(Z\)H^\{\\star\}\(Z\)\\right\]\\leq\\rho\_\{\\mathrm\{harm\}\},\\qquad\\pi\(z\)=0\\ \\text\{for \}z\\notin\\mathcal\{S\}\_\{\\mathrm\{safe\}\},\(21\)where the maximum is over measurable\{0,1\}\\\{0,1\\\}\-valued policies\. Forη\>0\\eta\>0andτ∈ℝ\\tau\\in\\mathbb\{R\}, writeFη,τ⋆​\(z\):=G⋆​\(z\)−η​H⋆​\(z\)−τF\_\{\\eta,\\tau\}^\{\\star\}\(z\):=G^\{\\star\}\(z\)\-\\eta H^\{\\star\}\(z\)\-\\tau\.

###### Lemma C\.13\(Pointwise maximizer of the penalized objective\)\.

Fixη\>0\\eta\>0andτ∈ℝ\\tau\\in\\mathbb\{R\}\. Among all measurable policies satisfyingπ​\(z\)=0\\pi\(z\)=0forz∉𝒮safez\\notin\\mathcal\{S\}\_\{\\mathrm\{safe\}\}, the maximizer of𝔼​\[π​\(Z\)​Fη,τ⋆​\(Z\)\]\\mathbb\{E\}\[\\pi\(Z\)F\_\{\\eta,\\tau\}^\{\\star\}\(Z\)\]is

πη,τ⋆​\(z\)=𝟏​\{z∈𝒮safe​and​Fη,τ⋆​\(z\)≥0\}\.\\pi\_\{\\eta,\\tau\}^\{\\star\}\(z\)=\\mathbf\{1\}\\\!\\left\\\{z\\in\\mathcal\{S\}\_\{\\mathrm\{safe\}\}\\text\{ and \}F\_\{\\eta,\\tau\}^\{\\star\}\(z\)\\geq 0\\right\\\}\.\(22\)

###### Proof C\.14\.

Fixz∈𝒮safez\\in\\mathcal\{S\}\_\{\\mathrm\{safe\}\}\. Becauseπ​\(z\)∈\{0,1\}\\pi\(z\)\\in\\\{0,1\\\}, the pointwise contributionπ​\(z\)​Fη,τ⋆​\(z\)\\pi\(z\)F\_\{\\eta,\\tau\}^\{\\star\}\(z\)equals either0orFη,τ⋆​\(z\)F\_\{\\eta,\\tau\}^\{\\star\}\(z\)\. Hence the maximizing choice isπ​\(z\)=1\\pi\(z\)=1whenFη,τ⋆​\(z\)≥0F\_\{\\eta,\\tau\}^\{\\star\}\(z\)\\geq 0andπ​\(z\)=0\\pi\(z\)=0otherwise\. Outside𝒮safe\\mathcal\{S\}\_\{\\mathrm\{safe\}\}, feasibility forcesπ​\(z\)=0\\pi\(z\)=0\. Since the objective is the expectation of this pointwise\-separable integrand, the resulting rule is a global maximizer\.

\{assumption\}

\[Strong duality and multiplier attainment\] There exists a multiplierγ⋆≥0\\gamma^\{\\star\}\\geq 0such that: \(i\) strong duality holds—the constrained problem in Eq\. \([21](https://arxiv.org/html/2605.05701#A3.E21)\) and its Lagrangian relaxation have the same optimal value; \(ii\) the Lagrangian relaxation attains its maximum atγ⋆\\gamma^\{\\star\}; \(iii\) the policyπγ⋆​\(z\):=𝟏​\{z∈𝒮safe​and​G⋆​\(z\)−\(1\+γ⋆\)​H⋆​\(z\)≥0\}\\pi\_\{\\gamma^\{\\star\}\}\(z\):=\\mathbf\{1\}\\\{z\\in\\mathcal\{S\}\_\{\\mathrm\{safe\}\}\\text\{ and \}G^\{\\star\}\(z\)\-\(1\+\\gamma^\{\\star\}\)H^\{\\star\}\(z\)\\geq 0\\\}is primal\-feasible, i\.e\.,𝔼​\[πγ⋆​\(Z\)​H⋆​\(Z\)\]≤ρharm\\mathbb\{E\}\[\\pi\_\{\\gamma^\{\\star\}\}\(Z\)H^\{\\star\}\(Z\)\]\\leq\\rho\_\{\\mathrm\{harm\}\}; and \(iv\)πγ⋆\\pi\_\{\\gamma^\{\\star\}\}attains the constrained optimum, i\.e\.,πγ⋆\\pi\_\{\\gamma^\{\\star\}\}achieves the maximum of Eq\. \([21](https://arxiv.org/html/2605.05701#A3.E21)\)\. This is a population\-level regularity condition for the oracle constrained selection problem; the released deterministic finalizer below is a conservative feature\-rule instantiation and is not claimed to knowG⋆G^\{\\star\}orH⋆H^\{\\star\}\.

###### Proof C\.15\(Proof of Theorem[C\.2](https://arxiv.org/html/2605.05701#A3.Thmtheorem2)\)\.

By Lemma[C\.11](https://arxiv.org/html/2605.05701#A3.Thmtheorem11), the safe answer\-replacement problem is Eq\. \([21](https://arxiv.org/html/2605.05701#A3.E21)\)\. For any multiplierγ≥0\\gamma\\geq 0, its Lagrangian is

ℒ​\(π,γ\):=𝔼​\[π​\(Z\)​\(G⋆​\(Z\)−H⋆​\(Z\)\)\]−γ​\(𝔼​\[π​\(Z\)​H⋆​\(Z\)\]−ρharm\)\\mathcal\{L\}\(\\pi,\\gamma\):=\\mathbb\{E\}\\\!\\left\[\\pi\(Z\)\\bigl\(G^\{\\star\}\(Z\)\-H^\{\\star\}\(Z\)\\bigr\)\\right\]\-\\gamma\\left\(\\mathbb\{E\}\\\!\\left\[\\pi\(Z\)H^\{\\star\}\(Z\)\\right\]\-\\rho\_\{\\mathrm\{harm\}\}\\right\)\(23\)=𝔼​\[π​\(Z\)​\(G⋆​\(Z\)−\(1\+γ\)​H⋆​\(Z\)\)\]\+γ​ρharm\.=\\mathbb\{E\}\\\!\\left\[\\pi\(Z\)\\bigl\(G^\{\\star\}\(Z\)\-\(1\+\\gamma\)H^\{\\star\}\(Z\)\\bigr\)\\right\]\+\\gamma\\rho\_\{\\mathrm\{harm\}\}\.\(24\)For fixedγ\\gamma, the termγ​ρharm\\gamma\\rho\_\{\\mathrm\{harm\}\}is constant inπ\\pi, so maximizingℒ​\(π,γ\)\\mathcal\{L\}\(\\pi,\\gamma\)is equivalent to maximizing𝔼​\[π​\(Z\)​\(G⋆​\(Z\)−\(1\+γ\)​H⋆​\(Z\)\)\]\\mathbb\{E\}\[\\pi\(Z\)\(G^\{\\star\}\(Z\)\-\(1\+\\gamma\)H^\{\\star\}\(Z\)\)\]subject to the safe\-set constraint\. Applying Lemma[C\.13](https://arxiv.org/html/2605.05701#A3.Thmtheorem13)withη=1\+γ\\eta=1\+\\gammaandτ=0\\tau=0shows that the pointwise maximizer is

πγ​\(z\)=𝟏​\{z∈𝒮safe​and​G⋆​\(z\)−\(1\+γ\)​H⋆​\(z\)≥0\}\.\\pi\_\{\\gamma\}\(z\)=\\mathbf\{1\}\\\!\\left\\\{z\\in\\mathcal\{S\}\_\{\\mathrm\{safe\}\}\\text\{ and \}G^\{\\star\}\(z\)\-\(1\+\\gamma\)H^\{\\star\}\(z\)\\geq 0\\right\\\}\.\(25\)By Assumption[C\.14](https://arxiv.org/html/2605.05701#A3.Thmtheorem14)\(i\)–\(ii\), there existsγ⋆≥0\\gamma^\{\\star\}\\geq 0at which the Lagrangian relaxation attains the constrained optimum\. By Assumption[C\.14](https://arxiv.org/html/2605.05701#A3.Thmtheorem14)\(iii\),πγ⋆\\pi\_\{\\gamma^\{\\star\}\}is primal\-feasible\. By Assumption[C\.14](https://arxiv.org/html/2605.05701#A3.Thmtheorem14)\(iv\),πγ⋆\\pi\_\{\\gamma^\{\\star\}\}achieves the primal optimum\. Settingη⋆:=1\+γ⋆\\eta^\{\\star\}:=1\+\\gamma^\{\\star\}givesη⋆≥1\\eta^\{\\star\}\\geq 1, and substituting into Eq\. \([25](https://arxiv.org/html/2605.05701#A3.E25)\) yields exactly the threshold rule in Eq\. \([7](https://arxiv.org/html/2605.05701#A3.E7)\)\.

### C\.5Released Rule\-Based Finalizer: Exact Rule Characterization

The oracle threshold rule uses unknown conditional quantitiesG⋆G^\{\\star\}andH⋆H^\{\\star\}\. The reported system therefore uses a deterministic conservative feature rule\. The next proposition characterizes that rule at the paper level\.

Define the following paper\-level features\. Letmrefm\_\{\\mathrm\{ref\}\}indicate that a refined candidate is available and passes the preliminary plausibility filter; letcriskc\_\{\\mathrm\{risk\}\}be the refinement\-risk category; letndecn\_\{\\mathrm\{dec\}\}be the number of decomposition steps in the trajectory; letqtypeq\_\{\\mathrm\{type\}\}be the detected question/answer type; letqslotq\_\{\\mathrm\{slot\}\}be the detected slot type; letΔsup\\Delta\_\{\\mathrm\{sup\}\}be the support gain of the refined candidate over the base answer; and letℓbase\\ell\_\{\\mathrm\{base\}\}andℓref\\ell\_\{\\mathrm\{ref\}\}be the token lengths of the base and refined candidates\. Let𝒞block\\mathcal\{C\}\_\{\\mathrm\{block\}\}denote the high\-risk refinement categories blocked by the finalizer, let𝒬bin\\mathcal\{Q\}\_\{\\mathrm\{bin\}\}denote yes/no and binary\-choice cases, and let𝒬typed\\mathcal\{Q\}\_\{\\mathrm\{typed\}\}denote typed\-slot cases such as capacity, date, and year range\. Define four branch condition sets:

ℬbin\\displaystyle\\mathcal\{B\}\_\{\\mathrm\{bin\}\}:=\{qtype∈𝒬bin\},\\displaystyle:=\\\{q\_\{\\mathrm\{type\}\}\\in\\mathcal\{Q\}\_\{\\mathrm\{bin\}\}\\\},ℬtyped\\displaystyle\\mathcal\{B\}\_\{\\mathrm\{typed\}\}:=\{qslot∈𝒬typed,Δsup≥0\.50,ℓref≤ℓbase\+1\},\\displaystyle:=\\\{q\_\{\\mathrm\{slot\}\}\\in\\mathcal\{Q\}\_\{\\mathrm\{typed\}\},\\;\\Delta\_\{\\mathrm\{sup\}\}\\geq 0\.50,\\;\\ell\_\{\\mathrm\{ref\}\}\\leq\\ell\_\{\\mathrm\{base\}\}\+1\\\},ℬexplicit\\displaystyle\\mathcal\{B\}\_\{\\mathrm\{explicit\}\}:=\{explicit factoid\-slot case,Δsup≥0,ℓref≤ℓbase\+3\},\\displaystyle:=\\\{\\text\{explicit factoid\-slot case\},\\;\\Delta\_\{\\mathrm\{sup\}\}\\geq 0,\\;\\ell\_\{\\mathrm\{ref\}\}\\leq\\ell\_\{\\mathrm\{base\}\}\+3\\\},ℬcompact\\displaystyle\\mathcal\{B\}\_\{\\mathrm\{compact\}\}:=\{Δsup\>0,ℓref≤ℓbase\+2\}\.\\displaystyle:=\\\{\\Delta\_\{\\mathrm\{sup\}\}\>0,\\;\\ell\_\{\\mathrm\{ref\}\}\\leq\\ell\_\{\\mathrm\{base\}\}\+2\\\}\.Define the priority\-ordered branch selector:

ℬaccept:=\{ℬbin,if​qtype∈𝒬bin,ℬtyped,if​qtype∉𝒬bin​and​qslot∈𝒬typed,ℬexplicit,if​qtype∉𝒬bin,qslot∉𝒬typed,and the case is an explicit factoid slot,ℬcompact,otherwise\.\\mathcal\{B\}\_\{\\mathrm\{accept\}\}:=\\begin\{cases\}\\mathcal\{B\}\_\{\\mathrm\{bin\}\},&\\text\{if \}q\_\{\\mathrm\{type\}\}\\in\\mathcal\{Q\}\_\{\\mathrm\{bin\}\},\\\\ \\mathcal\{B\}\_\{\\mathrm\{typed\}\},&\\text\{if \}q\_\{\\mathrm\{type\}\}\\notin\\mathcal\{Q\}\_\{\\mathrm\{bin\}\}\\text\{ and \}q\_\{\\mathrm\{slot\}\}\\in\\mathcal\{Q\}\_\{\\mathrm\{typed\}\},\\\\ \\mathcal\{B\}\_\{\\mathrm\{explicit\}\},&\\text\{if \}q\_\{\\mathrm\{type\}\}\\notin\\mathcal\{Q\}\_\{\\mathrm\{bin\}\},\\;q\_\{\\mathrm\{slot\}\}\\notin\\mathcal\{Q\}\_\{\\mathrm\{typed\}\},\\;\\text\{and the case is an explicit factoid slot\},\\\\ \\mathcal\{B\}\_\{\\mathrm\{compact\}\},&\\text\{otherwise\.\}\\end\{cases\}The accept set is not an unconditional union of the branches: when an earlier applicable branch fails its safety test, the finalizer abstains instead of falling through to a lower\-priority branch\. The accept set is

𝒮det:=\{mref=1,crisk∉𝒞block,ndec=0\}∩ℬaccept\.\\mathcal\{S\}\_\{\\mathrm\{det\}\}:=\\\{m\_\{\\mathrm\{ref\}\}=1,\\;c\_\{\\mathrm\{risk\}\}\\notin\\mathcal\{C\}\_\{\\mathrm\{block\}\},\\;n\_\{\\mathrm\{dec\}\}=0\\\}\\cap\\mathcal\{B\}\_\{\\mathrm\{accept\}\}\.
###### Proposition C\.17\(Exact rule of the released deterministic finalizer\)\.

Letzzdenote the finalization feature vector extracted from the completed trajectory and the two candidate answers\. The deterministic finalizer used in our experiments is equivalent to the indicator ruleσdet​\(z\)=𝟏​\{z∈𝒮det\}\\sigma\_\{\\mathrm\{det\}\}\(z\)=\\mathbf\{1\}\\\{z\\in\\mathcal\{S\}\_\{\\mathrm\{det\}\}\\\}for the priority\-ordered safe set𝒮det\\mathcal\{S\}\_\{\\mathrm\{det\}\}\. Therefore, the final answer isarefa\_\{\\mathrm\{ref\}\}whenz∈𝒮detz\\in\\mathcal\{S\}\_\{\\mathrm\{det\}\}andabasea\_\{\\mathrm\{base\}\}otherwise\.

###### Proof C\.18\.

Trace the rule in its evaluation order\. The first gate requires that a refined candidate is available and passes the preliminary plausibility filter, which enforcesmref=1m\_\{\\mathrm\{ref\}\}=1\. The second gate blocks high\-risk refinement categories, enforcingcrisk∉𝒞blockc\_\{\\mathrm\{risk\}\}\\notin\\mathcal\{C\}\_\{\\mathrm\{block\}\}\. The third gate abstains on decomposition\-heavy trajectories, enforcingndec=0n\_\{\\mathrm\{dec\}\}=0\. Conditional on these gates, the branch selector is evaluated in priority order\. Binary and yes/no cases are handled first, giving the branch setℬbin\\mathcal\{B\}\_\{\\mathrm\{bin\}\}\. If the case is not binary but has a typed slot, the typed\-slot safety test is applied; acceptance requiresΔsup≥0\.50\\Delta\_\{\\mathrm\{sup\}\}\\geq 0\.50andℓref≤ℓbase\+1\\ell\_\{\\mathrm\{ref\}\}\\leq\\ell\_\{\\mathrm\{base\}\}\+1, and failure causes abstention rather than falling through\. If no earlier branch applies and the case is an explicit factoid slot, acceptance requires nonnegative support gain andℓref≤ℓbase\+3\\ell\_\{\\mathrm\{ref\}\}\\leq\\ell\_\{\\mathrm\{base\}\}\+3; again, failure causes abstention\. The final fallback accepts only compact support\-improving refinements, requiringΔsup\>0\\Delta\_\{\\mathrm\{sup\}\}\>0andℓref≤ℓbase\+2\\ell\_\{\\mathrm\{ref\}\}\\leq\\ell\_\{\\mathrm\{base\}\}\+2\. Thus the rule accepts exactly when the three gates hold and the priority\-ordered branch selector accepts, which is preciselyz∈𝒮detz\\in\\mathcal\{S\}\_\{\\mathrm\{det\}\}\.

### C\.6Plug\-in Approximation of the Finalizer

The previous proposition describes the released rule\. For completeness, we also record a generic perturbation result showing how approximate gain–harm scores affect threshold decisions: Lemma[C\.19](https://arxiv.org/html/2605.05701#A3.Thmtheorem19)controls the pointwise score error, and Theorem[C\.21](https://arxiv.org/html/2605.05701#A3.Thmtheorem21)converts it into a penalized\-value bound\.

Fixη\>0\\eta\>0andτ∈ℝ\\tau\\in\\mathbb\{R\}\. WriteFη,τ⋆​\(z\):=G⋆​\(z\)−η​H⋆​\(z\)−τF\_\{\\eta,\\tau\}^\{\\star\}\(z\):=G^\{\\star\}\(z\)\-\\eta H^\{\\star\}\(z\)\-\\tauandFη,τ​\(z\):=G​\(z\)−η​H​\(z\)−τF\_\{\\eta,\\tau\}\(z\):=G\(z\)\-\\eta H\(z\)\-\\tau, define the oracle rule byπη,τ⋆​\(z\):=𝟏​\{z∈𝒮safe​and​Fη,τ⋆​\(z\)≥0\}\\pi\_\{\\eta,\\tau\}^\{\\star\}\(z\):=\\mathbf\{1\}\\\{z\\in\\mathcal\{S\}\_\{\\mathrm\{safe\}\}\\text\{ and \}F\_\{\\eta,\\tau\}^\{\\star\}\(z\)\\geq 0\\\}, and define the plug\-in rule byπ^η,τ​\(z\):=𝟏​\{z∈𝒮safe​and​Fη,τ​\(z\)≥0\}\\hat\{\\pi\}\_\{\\eta,\\tau\}\(z\):=\\mathbf\{1\}\\\{z\\in\\mathcal\{S\}\_\{\\mathrm\{safe\}\}\\text\{ and \}F\_\{\\eta,\\tau\}\(z\)\\geq 0\\\}\.

###### Lemma C\.19\(Finalizer score perturbation\)\.

If\|G​\(z\)−G⋆​\(z\)\|≤δG\|G\(z\)\-G^\{\\star\}\(z\)\|\\leq\\delta\_\{G\}and\|H​\(z\)−H⋆​\(z\)\|≤δH\|H\(z\)\-H^\{\\star\}\(z\)\|\\leq\\delta\_\{H\}for allz∈𝒮safez\\in\\mathcal\{S\}\_\{\\mathrm\{safe\}\}, then

\|Fη,τ​\(z\)−Fη,τ⋆​\(z\)\|≤δG\+η​δH\|F\_\{\\eta,\\tau\}\(z\)\-F\_\{\\eta,\\tau\}^\{\\star\}\(z\)\|\\leq\\delta\_\{G\}\+\\eta\\,\\delta\_\{H\}for allz∈𝒮safez\\in\\mathcal\{S\}\_\{\\mathrm\{safe\}\}\.

###### Proof C\.20\.

Forz∈𝒮safez\\in\\mathcal\{S\}\_\{\\mathrm\{safe\}\}, writeFη,τ​\(z\)−Fη,τ⋆​\(z\)=\(G​\(z\)−G⋆​\(z\)\)−η​\(H​\(z\)−H⋆​\(z\)\)F\_\{\\eta,\\tau\}\(z\)\-F\_\{\\eta,\\tau\}^\{\\star\}\(z\)=\\bigl\(G\(z\)\-G^\{\\star\}\(z\)\\bigr\)\-\\eta\\bigl\(H\(z\)\-H^\{\\star\}\(z\)\\bigr\)and apply the triangle inequality\.

###### Theorem C\.21\(Plug\-in excess penalized value\)\.

Fixη\>0\\eta\>0andτ∈ℝ\\tau\\in\\mathbb\{R\}, and letδ:=δG\+η​δH\\delta:=\\delta\_\{G\}\+\\eta\\,\\delta\_\{H\}\. Under the bounds of Lemma[C\.19](https://arxiv.org/html/2605.05701#A3.Thmtheorem19),

𝔼​\[πη,τ⋆​\(Z\)​Fη,τ⋆​\(Z\)\]−𝔼​\[π^η,τ​\(Z\)​Fη,τ⋆​\(Z\)\]≤𝔼​\[\|Fη,τ⋆​\(Z\)\|​𝟏​\{\|Fη,τ⋆​\(Z\)\|≤δ,Z∈𝒮safe\}\]\.\\mathbb\{E\}\\\!\\left\[\\pi\_\{\\eta,\\tau\}^\{\\star\}\(Z\)F\_\{\\eta,\\tau\}^\{\\star\}\(Z\)\\right\]\-\\mathbb\{E\}\\\!\\left\[\\hat\{\\pi\}\_\{\\eta,\\tau\}\(Z\)F\_\{\\eta,\\tau\}^\{\\star\}\(Z\)\\right\]\\leq\\mathbb\{E\}\\\!\\left\[\|F\_\{\\eta,\\tau\}^\{\\star\}\(Z\)\|\\mathbf\{1\}\\\!\\left\\\{\|F\_\{\\eta,\\tau\}^\{\\star\}\(Z\)\|\\leq\\delta,\\;Z\\in\\mathcal\{S\}\_\{\\mathrm\{safe\}\}\\right\\\}\\right\]\.If, in addition, there exist constantsC\>0C\>0andα\>0\\alpha\>0such that

ℙ​\(\|Fη,τ⋆​\(Z\)\|≤u,Z∈𝒮safe\)≤C​uαfor all​u≥0,\\mathbb\{P\}\\\!\\left\(\|F\_\{\\eta,\\tau\}^\{\\star\}\(Z\)\|\\leq u,\\;Z\\in\\mathcal\{S\}\_\{\\mathrm\{safe\}\}\\right\)\\leq Cu^\{\\alpha\}\\qquad\\text\{for all \}u\\geq 0,then

𝔼​\[πη,τ⋆​\(Z\)​Fη,τ⋆​\(Z\)\]−𝔼​\[π^η,τ​\(Z\)​Fη,τ⋆​\(Z\)\]≤C​δ1\+α\.\\mathbb\{E\}\\\!\\left\[\\pi\_\{\\eta,\\tau\}^\{\\star\}\(Z\)F\_\{\\eta,\\tau\}^\{\\star\}\(Z\)\\right\]\-\\mathbb\{E\}\\\!\\left\[\\hat\{\\pi\}\_\{\\eta,\\tau\}\(Z\)F\_\{\\eta,\\tau\}^\{\\star\}\(Z\)\\right\]\\leq C\\delta^\{1\+\\alpha\}\.

###### Proof C\.22\.

LetD:=\{Z∈𝒮safe:πη,τ⋆​\(Z\)≠π^η,τ​\(Z\)\}D:=\\\{Z\\in\\mathcal\{S\}\_\{\\mathrm\{safe\}\}:\\pi\_\{\\eta,\\tau\}^\{\\star\}\(Z\)\\neq\\hat\{\\pi\}\_\{\\eta,\\tau\}\(Z\)\\\}\. Becauseπη,τ⋆\\pi\_\{\\eta,\\tau\}^\{\\star\}is the pointwise maximizer ofπ​\(Z\)​Fη,τ⋆​\(Z\)\\pi\(Z\)F\_\{\\eta,\\tau\}^\{\\star\}\(Z\)overπ​\(Z\)∈\{0,1\}\\pi\(Z\)\\in\\\{0,1\\\}, the difference

𝔼​\[πη,τ⋆​\(Z\)​Fη,τ⋆​\(Z\)\]−𝔼​\[π^η,τ​\(Z\)​Fη,τ⋆​\(Z\)\]\\mathbb\{E\}\\\!\\left\[\\pi\_\{\\eta,\\tau\}^\{\\star\}\(Z\)F\_\{\\eta,\\tau\}^\{\\star\}\(Z\)\\right\]\-\\mathbb\{E\}\\\!\\left\[\\hat\{\\pi\}\_\{\\eta,\\tau\}\(Z\)F\_\{\\eta,\\tau\}^\{\\star\}\(Z\)\\right\]is supported onDDand equals𝔼​\[\|Fη,τ⋆​\(Z\)\|​𝟏D\]\\mathbb\{E\}\[\|F\_\{\\eta,\\tau\}^\{\\star\}\(Z\)\|\\mathbf\{1\}\_\{D\}\]\. OnDD, the oracle and plug\-in scores have opposite signs, so\|Fη,τ⋆​\(Z\)\|≤\|Fη,τ⋆​\(Z\)−Fη,τ​\(Z\)\|\|F\_\{\\eta,\\tau\}^\{\\star\}\(Z\)\|\\leq\|F\_\{\\eta,\\tau\}^\{\\star\}\(Z\)\-F\_\{\\eta,\\tau\}\(Z\)\|\. Applying Lemma[C\.19](https://arxiv.org/html/2605.05701#A3.Thmtheorem19)givesD⊆\{\|Fη,τ⋆​\(Z\)\|≤δ,Z∈𝒮safe\}D\\subseteq\\\{\|F\_\{\\eta,\\tau\}^\{\\star\}\(Z\)\|\\leq\\delta,\\;Z\\in\\mathcal\{S\}\_\{\\mathrm\{safe\}\}\\\}and therefore the first bound\. For the second bound, use\|Fη,τ⋆​\(Z\)\|≤δ\|F\_\{\\eta,\\tau\}^\{\\star\}\(Z\)\|\\leq\\deltaon the boundary set to obtain

𝔼​\[\|Fη,τ⋆​\(Z\)\|​𝟏​\{\|Fη,τ⋆​\(Z\)\|≤δ,Z∈𝒮safe\}\]≤δ​ℙ​\(\|Fη,τ⋆​\(Z\)\|≤δ,Z∈𝒮safe\)≤C​δ1\+α\.\\mathbb\{E\}\\\!\\left\[\|F\_\{\\eta,\\tau\}^\{\\star\}\(Z\)\|\\mathbf\{1\}\\\!\\left\\\{\|F\_\{\\eta,\\tau\}^\{\\star\}\(Z\)\|\\leq\\delta,\\;Z\\in\\mathcal\{S\}\_\{\\mathrm\{safe\}\}\\right\\\}\\right\]\\leq\\delta\\,\\mathbb\{P\}\\\!\\left\(\|F\_\{\\eta,\\tau\}^\{\\star\}\(Z\)\|\\leq\\delta,\\;Z\\in\\mathcal\{S\}\_\{\\mathrm\{safe\}\}\\right\)\\leq C\\delta^\{1\+\\alpha\}\.

### C\.7Finite Termination of Algorithm[1](https://arxiv.org/html/2605.05701#alg1)

Finally, Corollary[C\.23](https://arxiv.org/html/2605.05701#A3.Thmtheorem23)records that the inference procedure is finite under a minimal budget\-decrement condition\. The sumBtool\+BtokB\_\{\\mathrm\{tool\}\}\+B\_\{\\mathrm\{tok\}\}is used only as a bookkeeping potential over two nonnegative budget coordinates, not as a single physical cost metric\.

###### Corollary C\.23\(Finite termination\)\.

Algorithm[1](https://arxiv.org/html/2605.05701#alg1)terminates in at most⌈\(Btool\+Btok\)/ζ⌉\\lceil\(B\_\{\\mathrm\{tool\}\}\+B\_\{\\mathrm\{tok\}\}\)/\\zeta\\rceiliterations under the minimum\-decrement condition below\.

###### Lemma C\.24\(Finite termination lemma\)\.

Assume that every executed action has nonnegative cost in each budget coordinate and that, whenever thewhile\-loop in Algorithm[1](https://arxiv.org/html/2605.05701#alg1)continues, at least one coordinate of the remaining budget decreases by at least a fixed amountζ\>0\\zeta\>0\. If the initial budgetb0=\(Btool,Btok\)b\_\{0\}=\(B\_\{\\mathrm\{tool\}\},B\_\{\\mathrm\{tok\}\}\)is finite, then thewhile\-loop terminates after at most⌈Btool\+Btokζ⌉\\left\\lceil\\frac\{B\_\{\\mathrm\{tool\}\}\+B\_\{\\mathrm\{tok\}\}\}\{\\zeta\}\\right\\rceiliterations\. This proves Corollary[C\.23](https://arxiv.org/html/2605.05701#A3.Thmtheorem23)\.

###### Proof C\.25\.

WriteSt:=btool,t\+btok,tS\_\{t\}:=b\_\{\\mathrm\{tool\},t\}\+b\_\{\\mathrm\{tok\},t\}\. Since each continuing iteration decreases at least one coordinate by at leastζ\\zetaand neither coordinate increases, we haveSt\+1≤St−ζS\_\{t\+1\}\\leq S\_\{t\}\-\\zeta\. BecauseS0=Btool\+BtokS\_\{0\}=B\_\{\\mathrm\{tool\}\}\+B\_\{\\mathrm\{tok\}\}andSt≥0S\_\{t\}\\geq 0for alltt, the loop can continue at most⌈\(Btool\+Btok\)/ζ⌉\\left\\lceil\(B\_\{\\mathrm\{tool\}\}\+B\_\{\\mathrm\{tok\}\}\)/\\zeta\\right\\rceiltimes\. The post\-loop extraction and finalization steps are finite, so Algorithm[1](https://arxiv.org/html/2605.05701#alg1)always returns in finitely many steps\.

\\nolinenumbers

## Appendix DExperimental Protocol and Implementation Details

#### Fixed controller implementation\.

All audited experiments were run on a workstation with 4 NVIDIA RTX 5880 Ada GPUs\. The controller itself is deterministic and lightweight; the dominant cost comes from LLM inference and retrieval calls\. The Stage\-1 controller is a fixed, training\-free scoring rule added on top of the BAVT search procedure\. BAVT provides the planner, generator, critic interface, search procedure, and remaining\-budget state; our added component is the task\-level VOI score action scorer overSearch,Decompose, andAnswer\. No coefficients are learned from labeled trajectories, no regression model is fitted, and no benchmark\-specific hyperparameter search is performed over the reported audit cells\. The same controller parameters are used across all benchmarks, budget levels, and LLM backbones\.

#### Budget pressure and action scoring\.

We use the normalized remaining\-budget pressure

ρt=1−min⁡\{btool,tBtool,btok,tBtok\},\\rho\_\{t\}=1\-\\min\\\!\\left\\\{\\frac\{b\_\{\\mathrm\{tool\},t\}\}\{B\_\{\\mathrm\{tool\}\}\},\\frac\{b\_\{\\mathrm\{tok\},t\}\}\{B\_\{\\mathrm\{tok\}\}\}\\right\\\},clipped to\[0,1\]\[0,1\]\. This scalar increases as either the tool\-call or output\-token budget becomes tight\. It enters both the budget\-dependent penalty termΠt​\(k;bt\)\\Pi\_\{t\}\(k;b\_\{t\}\)and the action\-specific cost scaledt​\(k;bt\)d\_\{t\}\(k;b\_\{t\}\): additionalSearchandDecomposesteps are penalized more strongly under tight budgets, whileAnswerbecomes relatively more attractive when the trajectory has sufficient support\.

#### Fixed coefficients and guards\.

The main fixed Stage\-1 coefficients are a cost\-penalty scale of0\.70\.7, a decomposition bonus of0\.140\.14, and an early\-answer penalty of0\.180\.18\. These values remain fixed in all reported experiments\. After forming the utility numerator, the controller divides positive utility by an action\-specific cost scale and then applies deterministic guards\. The guards suppress premature answer commitment under weak support, suppress decomposition for low\-compositionality or factoid\-like questions, downweight repeated decomposition after stagnation, and enforce minimum search for sufficiently compositional states\. These guards only change the executable action score; they do not change the retrieval backend, generator, sample order, or budget accounting protocol\.

#### Stage\-2 finalization\.

The Stage\-2 finalizer is a deterministic answer\-selection rule over the completed trajectory\. It compares the base answerabasea\_\{\\mathrm\{base\}\}with a refined candidatearefa\_\{\\mathrm\{ref\}\}derived from the same trace and accepts the refined candidate only under low\-risk answer\-form conditions, such as yes/no polarity repair, binary\-choice repair, typed\-slot correction, or supported factoid completion\. It abstains under unresolved bridge structure, unresolved comparative reasoning, or missing direct support\. This stage adds no tool calls and issues no additional LLM call during finalization\.

## Appendix EFull Qwen3\-32B Main Results

Table 6:Main strict\-budget results on Qwen3\-32B\.Each method reportsEM/F1under the same symmetric strict dual\-budget protocol with tool\-call and output\-token caps\(1,100\)\(1,100\),\(2,200\)\(2,200\),\(2,300\)\(2,300\), and\(3,500\)\(3,500\)\. VOI denotes our full two\-stage budget\-control method\. The 1st, 2nd, and 3rd best results are highlighted infirst,second, andthirdcolors\. Cell background colors are ranked by the average of EM and F1 scores\. Best EM and F1 are independently bolded\.↑\\uparrowindicates higher is better\.BenchmarkBudgetBATSBAVTAFlowSearch\-o1VOIEM/F1↑\\uparrowEM/F1↑\\uparrowEM/F1↑\\uparrowEM/F1↑\\uparrowEM/F1↑\\uparrowBambooglelow0\.02/0\.030\.11/0\.170\.03/0\.030\.00/0\.020\.15/0\.21Bambooglelower\-mid0\.05/0\.060\.21/0\.280\.33/0\.420\.03/0\.050\.33/0\.42Bamboogleupper\-mid0\.42/0\.540\.33/0\.420\.33/0\.420\.20/0\.240\.43/0\.53Bambooglehigh0\.45/0\.580\.39/0\.470\.37/0\.500\.33/0\.440\.48/0\.62HotpotQAlow0\.01/0\.020\.11/0\.140\.01/0\.010\.00/0\.000\.14/0\.17HotpotQAlower\-mid0\.02/0\.040\.28/0\.340\.01/0\.010\.03/0\.040\.36/0\.41HotpotQAupper\-mid0\.40/0\.490\.34/0\.410\.01/0\.010\.09/0\.140\.39/0\.47HotpotQAhigh0\.43/0\.520\.34/0\.400\.29/0\.360\.22/0\.310\.43/0\.52MuSiQuelow0\.02/0\.030\.11/0\.170\.01/0\.010\.00/0\.020\.12/0\.16MuSiQuelower\-mid0\.06/0\.080\.26/0\.330\.03/0\.040\.03/0\.060\.28/0\.35MuSiQueupper\-mid0\.35/0\.430\.26/0\.340\.06/0\.080\.16/0\.190\.33/0\.40MuSiQuehigh0\.37/0\.480\.34/0\.420\.11/0\.160\.29/0\.350\.40/0\.502WikiMultihopQAlow0\.02/0\.020\.08/0\.100\.07/0\.070\.00/0\.040\.09/0\.112WikiMultihopQAlower\-mid0\.05/0\.050\.28/0\.330\.47/0\.550\.03/0\.040\.44/0\.492WikiMultihopQAupper\-mid0\.53/0\.640\.40/0\.450\.47/0\.550\.15/0\.200\.54/0\.622WikiMultihopQAhigh0\.62/0\.730\.56/0\.630\.58/0\.700\.27/0\.360\.64/0\.71Average0\.24/0\.300\.28/0\.340\.20/0\.240\.11/0\.160\.35/0\.42

## Appendix FQwen3\.5\-122B Backbone\-Sensitivity Results

Figure[6](https://arxiv.org/html/2605.05701#A6.F6)reports the full budget scaling curves for the Qwen3\.5\-122B backbone across all four benchmarks\. Compared with Qwen3\-32B and GPT\-5\.4\-Mini \(Figure[3](https://arxiv.org/html/2605.05701#S5.F3)\), Qwen3\.5\-122B exhibits a more mixed competitive regime: BATS and BAVT lead in several cells, particularly at upper budgets, while VOI maintains an advantage at lower budgets where explicit budget penalty provides the largest marginal benefit\.

![Refer to caption](https://arxiv.org/html/2605.05701v1/x6.png)Figure 6:Budget scaling curves for Qwen3\.5\-122B across four datasets\.Each column corresponds to one benchmark; all methods are evaluated under the shared dual\-budget protocol\. See Figure[3](https://arxiv.org/html/2605.05701#S5.F3)for Qwen3\-32B and GPT\-5\.4\-Mini results\.
## Appendix GStage\-1 Component Ablation Protocol

The ablation in Section[5\.4](https://arxiv.org/html/2605.05701#S5.SS4)is designed to match the Stage 1 controller in Section[4\.1](https://arxiv.org/html/2605.05701#S4.SS1)\. Each variant removes exactly one component from the three\-stage scoring pipeline while keeping the search procedure, prompts, retrieval backend, sample set, and hard\-budget audit unchanged\. Thew/o penaltyvariant removes the budget\-dependent penalty termΠt​\(k;bt\)\\Pi\_\{t\}\(k;b\_\{t\}\)from Eq\. \([2](https://arxiv.org/html/2605.05701#S4.E2)\);w/o normalizationreplaces Eq\. \([3](https://arxiv.org/html/2605.05701#S4.E3)\) with the unnormalized score\[ut​\(k\)\]\+\[u\_\{t\}\(k\)\]\_\{\+\};w/o structuralremovesΨt​\(k\)\\Psi\_\{t\}\(k\)from Eq\. \([2](https://arxiv.org/html/2605.05701#S4.E2)\); andw/o guardsbypasses𝔊t\\mathfrak\{G\}\_\{t\}in Eq\. \([4](https://arxiv.org/html/2605.05701#S4.E4)\)\. The full benchmark\-level results are reported in Table[1](https://arxiv.org/html/2605.05701#S5.T1)in the main text\.

## Appendix HAdditional Audit Results and Usage Diagnostics

#### Token accounting\.

For BAVT\-family runs,avg API tokensincludes prompt and completion tokens across planner, generator, and critic calls\.avg budget output tokensis the quantity debited from the controller\-side budget and reflects only the output\-token budgeted component\. All methods in the main table are evaluated under the same dual\-budget constraint; the budget output token column reflects the output tokens charged against the budget in each case\.

#### Method\-specific audit notes\.

AFlow, Search\-o1, BATS, BAVT, and VOI are all scored under the same hard dual\-budget audit used in the main table: the same Search\-R1 retrieval backend, question\-only queries,top\_k=5, and the same tool\-call and output\-token caps\. Any example whose realized tool calls or output tokens exceed the target budget is counted as failed for that cell\. For AFlow, this audit is implemented over replayed workflows: a replayed sample contributes to the scored cell only if its realized tool calls and answer output tokens remain within the target cap\. Search\-o1 and BATS are reported under the same audit semantics, and BAVT/VOI share the same BAVT\-family search backbone under the same budget ladder\.

#### Main\-table comparator set\.

The primary empirical comparator set in this paper is AFlow, BATS, Search\-o1, BAVT, and VOI\. All five appear in the main table because all five are evaluated under the same hard tool/output\-token audit\.

Table 7:Cross\-backbone macro deltas for VOI against four baselines\.Δ\\DeltaF1 andΔ\\DeltaEM are averaged over the 48 audited cells \(3 backbones×\\times4 benchmarks×\\times4 budgets\), and their 95% CI is a cell\-level bootstrap confidence interval\.ComparisonΔ\\DeltaEM \(est\.\)95% CIΔ\\DeltaF195% CIVOI vs AFlow0\.2456\[0\.2100, 0\.2819\]0\.3021\[0\.2556, 0\.3491\]VOI vs BATS0\.1081\[0\.0437, 0\.1725\]0\.1219\[0\.0439, 0\.1999\]VOI vs Search\-o10\.2734\[0\.2360, 0\.3112\]0\.2993\[0\.2625, 0\.3370\]VOI vs BAVT0\.0492\[0\.0382, 0\.0603\]0\.0530\[0\.0423, 0\.0639\]
#### Feasible\-only usage audit\.

Table[8](https://arxiv.org/html/2605.05701#A8.T8)reports realized usage on the feasible subset of each dataset\-budget cell\. For scoring, all main\-table methods use the same hard tool/output\-token audit\. This table is a secondary feasibility\-conditioned diagnostic: for each benchmark and budget level, we retain only examples satisfying both caps and then average realized tool calls and tokens over the retained subset\. Different methods can therefore have different feasible subsets\. For VOI, BATS, BAVT, and Search\-o1, tokens are controller\-debited output tokens\. In these Qwen3\-32B runs, AFlow feasibility is mainly tool\-cap limited\.

Table 8:Feasible\-only average resource usage by dataset and budget\.Each method cell reports average tool calls / average output tokens, i\.e\.,Avg\. Tools / Avg\. Tok, computed only over examples that remain feasible under the corresponding tool/output\-token cap\. Budget labels follow the four\-level budget ladder defined in Section[5\.1](https://arxiv.org/html/2605.05701#S5.SS1)\. For scoring, all methods are subject to the same hard tool/output\-token caps\. The table is a feasible\-only usage diagnostic rather than a scoring rule: token entries report the output tokens available from each audited execution\. For AFlow, which is evaluated through replayed workflows, the reported token entry corresponds to answer\-output tokens from the replay\.BenchmarkBudgetVOI \(Ours\)BATSBAVTAFlow†Search\-o1Avg\. Tools / Avg\. TokensBambooglelow0\.97 / 73\.70\.93 / 86\.40\.94 / 69\.71\.00 / 2\.00\.02 / 100\.0Bambooglelower\-mid1\.61 / 138\.81\.00 / 190\.91\.43 / 135\.41\.96 / 7\.80\.69 / 193\.9Bamboogleupper\-mid1\.64 / 171\.31\.58 / 240\.21\.62 / 179\.71\.96 / 7\.80\.87 / 249\.6Bambooglehigh1\.91 / 195\.91\.91 / 268\.51\.89 / 211\.32\.19 / 7\.01\.06 / 350\.5HotpotQAlow0\.92 / 76\.60\.88 / 84\.90\.84 / 67\.71\.00 / 2\.00\.04 / 100\.0HotpotQAlower\-mid1\.49 / 134\.21\.00 / 189\.61\.56 / 138\.61\.00 / 2\.00\.63 / 193\.7HotpotQAupper\-mid1\.61 / 162\.71\.62 / 241\.21\.61 / 158\.81\.00 / 2\.00\.93 / 262\.1HotpotQAhigh1\.91 / 197\.71\.91 / 269\.01\.88 / 218\.92\.98 / 8\.91\.09 / 361\.5MuSiQuelow0\.95 / 64\.60\.91 / 82\.70\.92 / 67\.31\.00 / 2\.10\.03 / 100\.0MuSiQuelower\-mid1\.60 / 141\.11\.00 / 193\.81\.46 / 137\.31\.48 / 4\.60\.68 / 192\.5MuSiQueupper\-mid1\.62 / 174\.51\.49 / 239\.81\.55 / 178\.91\.76 / 5\.80\.88 / 251\.2MuSiQuehigh1\.85 / 191\.51\.95 / 278\.71\.87 / 226\.02\.21 / 7\.21\.09 / 364\.42WikiMultihopQAlow0\.81 / 74\.80\.86 / 85\.60\.73 / 69\.71\.00 / 2\.30\.13 / 100\.02WikiMultihopQAlower\-mid1\.73 / 140\.71\.17 / 189\.81\.64 / 132\.91\.90 / 6\.80\.50 / 193\.62WikiMultihopQAupper\-mid1\.82 / 166\.21\.91 / 249\.71\.72 / 176\.91\.90 / 6\.61\.00 / 268\.52WikiMultihopQAhigh2\.04 / 193\.92\.09 / 271\.41\.91 / 222\.22\.14 / 6\.11\.22 / 376\.4

Similar Articles

Preference Estimation via Opponent Modeling in Multi-Agent Negotiation

arXiv cs.CL

This paper proposes a novel preference estimation method that integrates natural language information from LLMs into a structured Bayesian opponent modeling framework for multi-agent negotiation. The approach leverages LLMs to extract qualitative cues from utterances and convert them into probabilistic formats, demonstrating improved agreement rates and preference estimation accuracy on multi-party negotiation benchmarks.

From History to State: Constant-Context Skill Learning for LLM Agents

arXiv cs.AI

This paper introduces 'constant-context skill learning,' a framework that moves procedural knowledge from prompts into model weights to reduce token usage and improve privacy for LLM agents. The method achieves strong performance on benchmarks like ALFWorld and WebShop while significantly reducing inference costs.