BAGEN: Are LLM Agents Budget-Aware?
Summary
This paper introduces BAGEN, a framework for evaluating budget awareness in LLM agents, defining budget estimation as internal and external budgets and formalizing progressive interval estimation. Experiments show that strong agents lack budget awareness, are over-optimistic, and that early stopping can save tokens while training improves alerting behavior.
View Cached Full Text
Cached at: 06/02/26, 03:40 PM
# BAGEN: Are LLM Agents Budget-Aware?
Source: [https://arxiv.org/html/2606.00198](https://arxiv.org/html/2606.00198)
\\reportnumber
001\\paperurlhttps://ragen\-ai\.github\.io/bagen
Yuxiang Lin12∗, Zihan Wang12∗†, Mengyang Liu3∗, Yuxuan Shan2∗, Longju Bai4∗, Junyao Zhang2, Xing Jin3, Boshan Chen2, Jinyan Su5, Xingyao Wang6, Jiaxin Pei78, Manling Li1 ∗Core contributors\.†Project lead\. 1Northwestern University2O2 Lab3Independent4University of Michigan5Cornell6All Hands AI7Stanford8UT Austin [https://ragen\-ai\.github\.io/bagen](https://ragen-ai.github.io/bagen)
###### Abstract
While agents are increasingly spending more resources, today agent cost is mostly measured only after execution\. ABudget\-Aware Agent \(BAGEN\)should treat budget as an active control signal, rather than a passive cost metric\. We first systematically define budget estimation as internal budgets \(from agent computation\) and external budgets \(from agent actions\)\. We then formalize budget\-awareness asprogressive interval estimation: at each step of a plan, an agent should predict an upper and lower bound on remaining budget, and alert when completion is unlikely\. Scoring with a rollout\-replay protocol, we find consistent failure patterns on four environments and five frontier agents: \(1\) strong agents do not necessarily have strong budget\-awareness, with correlationr≈0\.35r\\approx 0\.35\. \(2\) frontier models are consistently over\-optimistic, continue spending on tasks that are unlikely to succeed, instead of alerting the user early\. \(3\) budget\-aware signal is actionable and trainable\. Early stop saves 28–64% tokens on failed trajectories, and SFT\+RL strengthens early stop and alert behavior\. \(4\) precise interval calibration remains challenging, with interval coverage capping at 47% after SFT\+RL\.
###### Abstract
Foundation\-model agents are deployed with growing resource constraints like token, money, and time budgets, yet it remains unclear whether they know how much budget they will spend\. We call this capability*budget awareness*and formalize an ability for Budget\-Aware Agents \(BAGEN\) as*progressive interval estimation*: mid\-execution, whether the agent can provide an interval on how much budget remain needed and whether the task is still finishable\. We score this with a rollout\-replay protocol that re\-queries the same agent on every prefix of an unconstrained rollout, and decompose estimation into three sub\-capabilities: feasibility prediction, early failure detection, and interval calibration\. We evaluate five frontier models on four environments spanning*internal*budgets \(token consumption on Sokoban, Search\-R1, and SWE\-bench\) and*external*budgets \(cost, time, and warehouse occupancy in a supply\-chain environment curated from real enterprise data\); we further train Qwen\-7B budget estimators with SFT and RL on Sokoban, and deploy their predictions through a simple early\-stop policy\. Across these axes, we find budget awareness: \(1\) decouples from task performance, \(2\) fails in structured ways, and \(3\) is already actionable and trainable as a control signal that resource\-limited agents currently lack\. Code for this project will be open\-sourced\.
## 1Introduction
Foundation\-model agents are increasingly deployed in longer horizons and higher\-stakes tasks: a coding agent consumes tokens per reasoning step, a web agent spends API calls per search query, and a supply\-chain agent commits real dollars and warehouse capacity per procurement decision\. The budget their generation consumes \(their*internal budget*, primarily tokens\) and the budget their actions commit \(their*external budget*, including money, time, and inventory\) are both growing rapidly with deployment horizon\. Yet existing benchmarks track this budget only after the fact, rarely asking whether the agent itself knew, mid\-execution, what it was about to spend\.
Figure 1:Progressive interval estimation in BAGEN\. We record an unconstrained rollout, then re\-query the same agent on every prefix to predict either an interval over remaining budget or animpossibledeclaration; predictions are scored against the realized remaining budget and outcome\.This brings up a fundamental question:*does the agent know how much budget it needs?*We call this capability*budget awareness*: the ability of a Budget\-Aware Agent \(BAGEN\) to estimate, mid\-execution, how much budget remains and whether the task is still finishable\. An agent that cannot estimate its own resource requirements cannot decide when to abort a hopeless task, when to request more resources, or how to allocate budget across sub\-goals\.
Two gaps prevent systematic study of this capability\.First, agent research\(liu2026budgetconstrained;ding2026calibratethenact;mccleary2026quantifying\)usually calculates token consumption as a post\-hoc metric, but hardly ever asks whether the agent could*self\-estimate*how much budget it would need\.Second, most evaluation protocols collect a single\-point prediction at task start, which mismatches long\-horizon agentic tasks where feasibility evolves turn by turn: a project manager re\-estimates the remaining timeline at every milestone, gives a range rather than a point, and flags when completion becomes infeasible\.
To address these limitations, we propose*progressive interval estimation*as a new agent capability that isconfidence\-awareandprogressive throughout execution\. We record a full agent rollout without any budget constraint, then query the same agent separately at every turn:*given current progress, how much budget remains to finish? Provide an interval with confidence, or declare the task impossible\.*We decompose this capability into three sub\-capabilities \(feasibility prediction, early failure detection, and interval calibration\) and ask five frontier models to perform it across four environments \(Sokoban, Search\-R1, SWE\-bench, and a Warehouse environment with three coupled budget dimensions curated from real enterprise data\) covering both internal and external budget modalities\. Across these experiments, we find:
- •Budget awareness is a distinct capability from task performance, separated by interval calibration rather than feasibility prediction\.Task success correlates only weakly with interval hit rate \(r≈0\.35r\{\\approx\}0\.35\)\. On Search\-R1, Opus achieves the highest task success rate \(75\.8%75\.8\\%\) but Sonnet produces better intervals \(36\.5%36\.5\\%vs\.23\.1%23\.1\\%\); no model dominates all three sub\-capabilities \(§[4](https://arxiv.org/html/2606.00198#S4)\)\.
- •Binary feasibility is a calibration problem; interval estimation is a reasoning problem, and all models are optimistically biased\.SFT alone raises Qwen\-7B feasibility accuracy from25\.5%25\.5\\%to≈90%\{\\approx\}90\\%, indicating the capability was latent; interval coverage only reaches47%47\\%after SFT\+RL\. All twenty model–environment pairs underestimate remaining budget more often than they overestimate it, and weaker models are*more*optimistic, not less \(§[5\.1](https://arxiv.org/html/2606.00198#S5.SS1), §[5\.2](https://arxiv.org/html/2606.00198#S5.SS2)\)\.
- •Estimates update over turns, but failure is recognized too late to act on\.Estimates shift as execution progresses—but not reliably toward the truth\. On failed trajectories, models predict feasibility above70%70\\%even after60%60\\%of the budget is consumed; the alarm fires only in the final20%20\\%\(§[5\.3](https://arxiv.org/html/2606.00198#S5.SS3), §[5\.4](https://arxiv.org/html/2606.00198#S5.SS4)\)\.
- •The signal is actionable but training is fragile\.An early\-stop policy keyed onimpossiblepredictions saves between28%28\\%and64%64\\%of tokens on failed trajectories at a cost of only1\.61\.6to4\.24\.2percentage points in success rate\. SFT followed by RL improves estimation on Sokoban; RL without an SFT warm\-start collapses entirely \(§[6](https://arxiv.org/html/2606.00198#S6)\)\.
In general, our results show that budget is less a metric for after\-the\-fact accounting and more a control signal that resource\-limited agents currently lack\.
## 2Two Failures of Single\-Point Budget Estimation
We start with the simplest form of budget estimation: at task start, ask the agent for a single\-point estimate of the total tokens it will spend, and score it against the realized rollout budget\. As a pilot probe, we elicit such estimates from five frontier models on two internal\-budget environments, Sokoban and Search\-R1, collecting both a first\-turn estimate \(at task start\) and later\-turn estimates \(after replaying logged prefixes\); full setup is in App\.[B\.3](https://arxiv.org/html/2606.00198#A2.SS3)\. Two failures hold across all models, and they motivate the protocol of §[3](https://arxiv.org/html/2606.00198#S3)\.
The first failure is systematic optimism\.Across all five models on both tasks, first\-turn predictions underestimate the realized budget more often than overestimating it \(Figure[5](https://arxiv.org/html/2606.00198#S5.F5)\); the bias tracks model confidence rather than task difficulty, with weaker models on a task being*more*optimistic, not less\.
The second failure is that first\-turn and later\-turn estimates do not agree\.On Sokoban, Gemini’s feasibility macro\-F1F\_\{1\}improves by\+21\.9\+21\.9points from first\-turn to all\-turn evaluation; on Search\-R1, Qwen3\-235B moves the opposite way by9\.39\.3points \(Figure[6](https://arxiv.org/html/2606.00198#S5.F6)\)\. First\-turn judgment is therefore neither a consistent over\- nor underestimate of all\-turn judgment; it is just a different judgment, and which side it falls on is model\- and task\-specific\.
Together these failures argue against single\-point estimation\. A point estimate cannot express the optimism the model is exhibiting, but an interval can, and an explicitimpossibleoption lets the model declare infeasibility rather than silently underestimating it\. A single query atk=0k\{=\}0misses the refinement that comes from observing partial progress, but a query that fires every turn captures it\. §[3](https://arxiv.org/html/2606.00198#S3)formalizes both changes as*progressive interval estimation*\.
## 3Agent Budget Awareness
The previous section surfaced two structural failures of single\-point budget estimation: systematic optimism and instability between first\-turn and later\-turn estimates\. We now formalize the capability for Budget\-Aware Agents \(BAGEN\) that addresses both\. We first introduce two budget modalities, then describe the progressive interval estimation protocol with its rollout\-replay procedure, and finally decompose estimation quality into three sub\-capabilities with concrete metrics\. The section closes with the experimental setup\.
### 3\.1Budget Modalities: Internal and External
Consider an agent executing a multi\-turn trajectoryτ=\{\(ot,zt,at\)\}t=1T\\tau=\\\{\(o\_\{t\},z\_\{t\},a\_\{t\}\)\\\}\_\{t=1\}^\{T\}, whereoto\_\{t\}is the observation \(e\.g\., environment state or tool output\) received at turntt,ztz\_\{t\}is the agent’s reasoning trace, andata\_\{t\}is the action\. At each turn, the agent commits budget in two fundamentally different ways\.
Internal budget\.Internal budget is the compute generated by the model’s own reasoning\. Letctinc\_\{t\}^\{\\text\{in\}\}denote the fresh token count at turntt\. The cumulative internal cost up to turnkkisCkin=∑t=1kctin,C\_\{k\}^\{\\text\{in\}\}=\\sum\_\{t=1\}^\{k\}c\_\{t\}^\{\\text\{in\}\},and the remaining internal cost from turnk\+1k\{\+\}1onward isRkin=CTin−CkinR\_\{k\}^\{\\text\{in\}\}=C\_\{T\}^\{\\text\{in\}\}\-C\_\{k\}^\{\\text\{in\}\}\. The agent operates under a total capBinB^\{\\text\{in\}\}, requiringCTin≤BinC\_\{T\}^\{\\text\{in\}\}\\leq B^\{\\text\{in\}\}\. A trajectory that exceeds this cap is truncated and counted as a failure\. Internal\-budget estimation tests whether the model can predict its own compute footprint: how many tokens will the remaining reasoning, planning, and action steps cost?
External budget\.External budget is the cost the agent commits in the environment\. It is generally multi\-dimensional: let𝐜tex∈ℝD\\mathbf\{c\}\_\{t\}^\{\\text\{ex\}\}\\in\\mathbb\{R\}^\{D\}denote theDD\-dimensional cost vector at turntt, with cumulative usage and remaining cost𝐂kex=∑t=1k𝐜tex,\\mathbf\{C\}\_\{k\}^\{\\text\{ex\}\}=\\sum\_\{t=1\}^\{k\}\\mathbf\{c\}\_\{t\}^\{\\text\{ex\}\},where𝐑kex=𝐂Tex−𝐂kex,\\mathbf\{R\}\_\{k\}^\{\\text\{ex\}\}=\\mathbf\{C\}\_\{T\}^\{\\text\{ex\}\}\-\\mathbf\{C\}\_\{k\}^\{\\text\{ex\}\},subject to per\-dimension constraintsCTex,\(d\)≤B\(d\)C\_\{T\}^\{\\text\{ex\},\(d\)\}\\leq B^\{\(d\)\}ford=1,…,Dd=1,\\ldots,D\. External budget forces the agent to reason about coupled resource constraints: holding more inventory raises revenue but eats warehouse capacity, while drawing credit improves short\-term cash but creates repayment pressure later\.
### 3\.2Progressive Interval Estimation and Rollout\-Replay
Progressive interval estimation\.At every turnkk, given the trajectory prefixτ≤k=\{\(ot,zt,at\)\}t=1k\\tau\_\{\\leq k\}=\\\{\(o\_\{t\},z\_\{t\},a\_\{t\}\)\\\}\_\{t=1\}^\{k\}and cumulative usageCkC\_\{k\}, the estimator returns
y^k=\{\[R^klo,R^khi\]if the agent predicts the task is still feasible,impossibleif the agent predicts completion is no longer achievable\.\\hat\{y\}\_\{k\}=\\begin\{cases\}\[\\hat\{R\}\_\{k\}^\{\\text\{lo\}\},\\;\\hat\{R\}\_\{k\}^\{\\text\{hi\}\}\]&\\text\{if the agent predicts the task is still feasible\},\\\\\[4\.0pt\] \\texttt\{impossible\}&\\text\{if the agent predicts completion is no longer achievable\}\.\\end\{cases\}\(1\)The output captures three properties simultaneously:*uncertainty*\(interval width\);*progressiveness*\(the estimate updates each turn\); and*infeasibility awareness*\(an explicitimpossibleoption, enabling early stopping or rerouting\)\. The first two address the two single\-point failures; the third makes the signal actionable downstream\.
Rollout\-replay protocol\.To separate budget estimation ability from task completion ability, we use a two\-phase procedure \(Figure[1](https://arxiv.org/html/2606.00198#S1.F1)\)\.\(1\) Rollout generation\.The agent executes the task without any budget constraint\. We log the full trajectoryτ\\tautogether with per\-turn costctc\_\{t\}and final outcome\.\(2\) Prefix replay and estimation\.For each non\-terminal turnk∈\{1,…,T−1\}k\\in\\\{1,\\ldots,T\{\-\}1\\\}, we replay the logged prefixτ≤k\\tau\_\{\\leq k\}as history, append a cumulative\-usage summary, and ask the agent to estimatey^k\\hat\{y\}\_\{k\}via Eq\.[1](https://arxiv.org/html/2606.00198#S3.E1)\. Each prediction is scored against true remaining costRk=CT−CkR\_\{k\}=C\_\{T\}\-C\_\{k\}and the true outcome\.
A natural alternative is to let the agent estimate budget*during*its own rollout, interleaving estimation with execution\. We avoid this because estimation itself consumes tokens, which would conflate task\-completion cost with self\-assessment cost\. Online estimation is left to future work\.
### 3\.3Sub\-Capability Metrics
Figure 2:Left and middle: estimation quality is only weakly related to task performance for both internal and external budgets\. Right: on failed Sokoban trajectories, estimation accuracy increases as more task progress is observed, with the largest gains appearing later in the trajectory\.A Budget\-Aware Agent \(BAGEN\) must do three things: tell whether the task can succeed under budget, recognize failure early enough to act on it, and provide a calibrated cost range when success is possible\. We score each as a separate sub\-capability\.
\(1\) Feasibility prediction\.*Can the agent tell whether the task will succeed under the remaining budget?*This is the coarsest, binary level of budget awareness\. Letyk∈\{feasible,impossible\}y\_\{k\}\\in\\\{\\text\{feasible\},\\text\{impossible\}\\\}be the ground\-truth label at turnkk\(whether the agent succeeds within the budget\), andy^k∈\{\[⋅,⋅\],impossible\}\\hat\{y\}\_\{k\}\\in\\\{\[\\cdot,\\cdot\],\\;\\texttt\{impossible\}\\\}the model’s output\. We treat any interval as a “feasible” prediction and report
Macro\-F1=12\(F1feasible\+F1impossible\),\{\\text\{Macro\-\}F\_\{1\}=\\tfrac\{1\}\{2\}\\big\(F\_\{1\}^\{\\text\{feasible\}\}\+F\_\{1\}^\{\\text\{impossible\}\}\\big\),\}\(2\)computed over the full trajectory \(*all\-turn*\) or the first turn only \(*first\-turn*\)\.
\(2\) Early failure detection\.*For tasks that ultimately fail, can the agent recognize that early?*This is critical for budget control: catching failure early makes early stopping and budget saving possible\. Restricted to the positive classyk=impossibley\_\{k\}=\\texttt\{impossible\},
Fail\-F1=2⋅Precimp⋅RecimpPrecimp\+Recimp\.\{\\text\{Fail\-\}F\_\{1\}=\\frac\{2\\cdot\\text\{Prec\}\_\{\\text\{imp\}\}\\cdot\\text\{Rec\}\_\{\\text\{imp\}\}\}\{\\text\{Prec\}\_\{\\text\{imp\}\}\+\\text\{Rec\}\_\{\\text\{imp\}\}\}\.\}\(3\)A high Fail\-F1F\_\{1\}means the alarm fires when failure is real and stays silent on feasible trajectories\.
\(3\) Remaining budget estimation\.*When the task actually succeeds, can the agent predict that success and provide an accurate cost range?*Restricted to trajectories that ultimately succeed, we score interval quality only when the model also correctly predicts feasibility\. Animpossibleprediction on a successful trajectory scores zero:
Sk=\{𝟏\[Rk∈\[R^klo,R^khi\]\]⏟coverage⋅max\(0,1−R^khi−R^kloRk\)⏟tightnessif interval,0ifimpossible\.\{S\_\{k\}=\\begin\{cases\}\\underbrace\{\\mathbf\{1\}\\\!\\big\[R\_\{k\}\\in\[\\hat\{R\}\_\{k\}^\{\\text\{lo\}\},\\,\\hat\{R\}\_\{k\}^\{\\text\{hi\}\}\]\\big\]\}\_\{\\text\{coverage\}\}\\cdot\\underbrace\{\\max\\\!\\Big\(0,\\;1\-\\tfrac\{\\hat\{R\}\_\{k\}^\{\\text\{hi\}\}\-\\hat\{R\}\_\{k\}^\{\\text\{lo\}\}\}\{R\_\{k\}\}\\Big\)\}\_\{\\text\{tightness\}\}&\\text\{if interval\},\\\\\[8\.0pt\] 0&\\text\{if \}\\texttt\{impossible\}\.\\end\{cases\}\}\(4\)Coverage is 1 iff the realized remaining costRkR\_\{k\}falls inside the predicted interval\. Tightness penalizes wide intervals: a perfect prediction\[Rk,Rk\]\[R\_\{k\},R\_\{k\}\]scores 1, while an interval as wide asRkR\_\{k\}itself scores 0\. For diagnostics, we also report interval hit rate \(the fraction of success\-case samples covered\) and midpoint relative error,\|R^klo\+R^khi2−Rk\|/Rk\|\\frac\{\\hat\{R\}\_\{k\}^\{\\text\{lo\}\}\+\\hat\{R\}\_\{k\}^\{\\text\{hi\}\}\}\{2\}\-R\_\{k\}\|/R\_\{k\}, at the 50th and 90th percentiles\.
### 3\.4Experimental Setup
Internal\-budget environments\.We useSokoban\(junghanns1998sokoban\): a planning task where agents push boxes to targets on a8×88\{\\times\}8grid, capped at2,5002\{,\}500tokens\.Search\-R1\(jin2025searchr1\): multi\-hop information retrieval, capped at3,5003\{,\}500tokens\.SWE\-bench\(jimenez2024swebench\): agents resolve GitHub issues, capped at160160turns\.
External\-budget environment\.We developWarehouse\(Appendix[C](https://arxiv.org/html/2606.00198#A3)\), a supply\-chain environment curated from real enterprise data, with three coupled budget dimensions: cumulative cost \(USD\), time \(weeks\), and warehouse occupancy \(item\-weeks\)\. The agent manages inventory over a 24\-week horizon \(12 turns\), making procurement and allocation decisions against all budgets simultaneously\. As the task is continuous cash maximization rather than naturally binary, we evaluate budget awareness via*challenge\-conditioned feasibility probes*\(App\.[B\.4](https://arxiv.org/html/2606.00198#A2.SS4)\), keeping the task success rate balanced 50/50 between reachable and unreachable so thatF1F\_\{1\}and calibration are identifiable on both sides\.
Training and EvaluationWe evaluate five frontier models: GPT\-5\.2 Instant\(openai2026gpt52instant\), Claude Opus 4\.7\(anthropic2026adaptiveThinking\), Claude Sonnet 4\.6\(anthropic2026sonnet46\), Gemini 3\.1 Pro\(google2026gemini31pro\), and Qwen3\-235B\(yang2025qwen3\)\. We use Qwen2\.5\-7B\-Instruct\(qwen2\-5\)for SFT and RL, and use a combined reward for RL to prevent collapse \(Appendix[E](https://arxiv.org/html/2606.00198#A5)\)\.
Scale\.We generate 128 rollouts per model on Sokoban, Search\-R1, and Warehouse, and 64 rollouts on SWE\-bench\. Each non\-terminal turn yields one estimation sample via the rollout\-replay protocol, totaling 2,000 to 3,000 estimation samples per model\-task pair\.
## 4Budget Awareness Decouples from Task Performance
Budget\-Aware Agents \(BAGEN\) obtains distinct capabilities: excelling at completing tasks is not necessarily needed for estimating what those tasks will cost\. We show this decoupling holds across all four environments and all three sub\-capabilities\.
Table 1:Overall rollout and budget\-estimation results\. F1@1 and F1@All denote first\-turn and all\-turn feasibility macro\-F1F\_\{1\}, respectively\. Fail\-F1F\_\{1\}measures detection ofimpossiblecases\. Warehouse uses n/a for rollout success because it does not have a separate task\-success label\.Task PerformanceFeasibility PredictionInterval QualityModelSuccessTurnsF1@1F1@AllFail F1HitRewardSWE\-benchClaude Opus 4\.771\.9%12\.1241\.1%51\.1%48\.8%30\.3%0\.160Claude Sonnet 4\.668\.8%20\.6632\.2%37\.7%23\.4%22\.3%0\.130Gemini 3\.1 Pro Preview68\.8%37\.3639\.2%58\.2%52\.0%23\.2%0\.112GPT\-5\.2 Instant57\.8%21\.5243\.5%40\.2%21\.2%44\.3%0\.115Qwen3 235B33\.3%62\.9247\.6%35\.1%32\.8%6\.5%0\.021Search\-R1Claude Opus 4\.775\.8%1\.7839\.4%40\.5%5\.6%23\.1%0\.114Claude Sonnet 4\.671\.1%1\.8737\.9%33\.3%0\.0%36\.5%0\.154GPT\-5\.2 Instant68\.0%3\.6940\.2%38\.3%0\.0%21\.4%0\.031Gemini 3\.1 Pro Preview53\.9%2\.0924\.5%24\.8%0\.0%20\.7%0\.079Qwen3 235B35\.2%9\.9433\.2%23\.9%30\.9%0\.0%0\.000SokobanClaude Opus 4\.756\.2%5\.0446\.3%45\.6%16\.0%46\.4%0\.112Claude Sonnet 4\.651\.6%5\.6546\.4%53\.6%33\.9%45\.1%0\.148Gemini 3\.1 Pro Preview39\.1%5\.6340\.0%61\.9%79\.9%8\.8%0\.313GPT\-5\.2 Instant35\.2%9\.0227\.7%40\.6%32\.8%36\.0%0\.167Qwen3 235B7\.0%10\.776\.3%12\.6%20\.4%10\.8%0\.029WarehouseGPT\-5\.2 Instantn/a12\.0035\.0%63\.4%56\.9%24\.7%0\.577Claude Opus 4\.7n/a12\.0033\.3%63\.2%55\.7%35\.9%0\.690Claude Sonnet 4\.6n/a12\.0033\.3%64\.9%59\.0%17\.3%0\.572Gemini 3\.1 Pro Previewn/a12\.0042\.0%67\.0%62\.8%50\.2%0\.698Qwen3 235Bn/a12\.0041\.0%60\.8%56\.0%17\.3%0\.483
### 4\.1Task Success Rate and Budget Estimation is Decoupled
The best actor is not the best estimator\.On Search\-R1, Opus achieves the highest task success rate \(75\.8%\), yet Sonnet produces better interval estimates \(36\.5% hit rate vs\. 23\.1% for Opus\)\. On SWE\-bench, the rankings split three ways: Opus leads task success, Gemini leads feasibility prediction, and GPT\-5\.2 leads interval estimation\. On Warehouse, rollout success is not reported as a separate outcome, but estimation quality still varies substantially: Gemini achieves the highest interval hit rate \(50\.2%\), while Sonnet and Qwen are lowest \(17\.3%\)\.
The correlation is weak\.Across 20 model\-environment pairs, task success rate correlates only weakly with feasibility prediction \(r≈0\.35r\\approx 0\.35\)\. Left and middle figure of Figure[2](https://arxiv.org/html/2606.00198#S3.F2)visualizes this separation\. A model that completes more tasks does not reliably produce better budget estimates\. This decoupling suggests that budget awareness draws on different capabilities than task execution, perhaps metacognitive monitoring rather than problem\-solving skill\.
Agent estimates are not just linear extrapolation\.On Warehouse, we compare each agent midpoint with a deterministic extrapolation baseline,R^lin=\(Ck/k\)T−Ck\\widehat\{R\}\_\{\\mathrm\{lin\}\}=\(C\_\{k\}/k\)T\-C\_\{k\}, and report paired error reduction \(Figure[3](https://arxiv.org/html/2606.00198#S4.F3)\)\. Agent predictions reduce error for warehouse occupancy in most model\-progress bins, especially early in the rollout\. The pattern is mixed for cumulative cost, where extrapolation often remains competitive\. Thus, the advantage of budget awareness is real but budget\-dimension dependent\.
Figure 3:Agent midpoint estimates versus linear extrapolation on Warehouse, where positive cells indicate lower absolute error for the agent than for the extrapolation baseline\.
### 4\.2No Model Dominates All Sub\-Capabilities
We decompose budget awareness into three sub\-capabilities: feasibility prediction \(binary\), early failure detection, and interval calibration\. Table[1](https://arxiv.org/html/2606.00198#S4.T1)reports them across all model\-environment pairs\.
No single model leads on all three\.Gemini achieves the highest binaryF1F\_\{1\}on Sokoban \(61\.9%\) but the lowest interval hit rate \(8\.8%\)\. On Warehouse, Gemini leads Fail\-F1F\_\{1\}\(62\.8%\) yet Qwen produces a low hit rate \(17\.3%\)\. Each frontier model has its own profile: a different combination of when it judges feasibility correctly, when it raises infeasibility alarms in time, and how tight its intervals are when the task succeeds\.
What separates good estimators is calibration, not feasibility prediction\.Interval hit rate correlates strongly with midpoint bias \(r≈−0\.67r\\approx\-0\.67\) and width adequacy \(r≈0\.62r\\approx 0\.62\), but only weakly with feasibilityF1F\_\{1\}\(r≈0\.35r\\approx 0\.35\)\. Models that correctly predict feasibility often still produce poorly calibrated intervals\. Feasibility prediction is necessary but not sufficient for budget awareness, a split that foreshadows the calibration\-versus\-reasoning analysis in §[5](https://arxiv.org/html/2606.00198#S5)\.
## 5Why Does Budget Estimation Fail?
Figure 4:Training tradeoffs for interval estimation\. Left: coverage versus midpoint error for SFT checkpoints and their SFT\+RL continuations\. Right: reward before and after RL from the same SFT starts\. This indicates that RL can improve estimation performance, but only when it starts from a suitable SFT initialization; without an appropriate SFT warm\-start, training collapses\.Figure 5:Model generally estimate budget too optimistically throughout the rollout\. Conservative bias increases with rollout progress, but remains secondary to optimism overall\.Budget awareness is decoupled from task performance, but we have not yet explained why estimation fails\. We test four hypotheses for the underlying mechanism:
- •Hypothesis 1 \(capability gap\)\.Models lack the reasoning required to predict budget\.
- •Hypothesis 2 \(optimistic prior\)\.Models have a general bias to underestimate remaining budget\.
- •Hypothesis 3 \(static estimation\)\.Models cannot leverage execution outcomes to update estimates\.
- •Hypothesis 4 \(late recognition\)\.Models recognize doomed trajectories only after most of the budget has already been spent\.
The picture that emerges is mixed\. Hypotheses 2 and 4 are supported across all five frontier models\. Hypothesis 1 is supported for interval estimation but rejected for binary feasibility, which is a calibration problem\. Hypothesis 3 is rejected outright: estimates do update over turns, though the direction of the update is model\-specific\.
### 5\.1Binary Feasibility Is Calibration; Interval Estimation Is Reasoning \(Hypothesis 1\)
We train Qwen\-7B on Sokoban budget estimation using supervised fine\-tuning \(SFT\) followed by reinforcement learning \(RL\)\. If Hypothesis 1 holds, fine\-tuning should leave performance largely unchanged because capability is the bottleneck\. If Hypothesis 1 fails, fine\-tuning should close the gap, indicating the capability was already there\.
Binary feasibility is easy to train; precise estimation is hard\.The base Qwen\-7B reaches 25\.5% feasibility prediction accuracy on Sokoban\. SFT alone raises accuracy to roughly 90%, with no RL needed\. Feasibility prediction appears to be a calibration problem: the model already has the capability, but needs the right format\. Hypothesis 1 is therefore rejected for feasibility prediction\.
Interval estimation improves more slowly\.The base model achieves 10\.5% coverage; SFT raises coverage to between 26% and 53% depending on target interval width; SFT \+ RL pushes coverage to 47% with median midpoint relative error of 28% \(Figure[4](https://arxiv.org/html/2606.00198#S5.F4)\)\. Even after training, nearly half the intervals still miss the true remaining budget\. Hypothesis 1 is supported for interval estimation: the bottleneck is reasoning, not output format\.
RL without SFT collapses\.RL alone, without an SFT warm\-start, fails outright: the model either outputsimpossiblefor everything, or exploits the reward by emitting invalid formats\. SFT supplies a format prior that RL cannot recover from sparse reward on its own\. This fragility is consistent with Hypothesis 1: when the underlying capability is thin, supervision matters more than reward\.
Training transfers imperfectly beyond Sokoban\.Table[2](https://arxiv.org/html/2606.00198#S5.T2)extends the SFT\-then\-GRPO pipeline to five budget\-estimation settings\. Training improves reward in every in\-task setting, with the largest gain on Warehouse and the strongest token\-budget interval quality on Search\-R1\. Sokoban and SWE\-bench remain harder: models learn feasibility more readily than calibrated intervals\. Cross\-task evaluation retains only1717–36%36\\%of in\-task reward, suggesting that output format transfers, but budget estimation is largely task\-specific\.
Table 2:Extended SFT\+GRPO budget\-estimator results\. In\-task rows compare the untrained estimator rewardRbaseR\_\{\\mathrm\{base\}\}with the trained rewardRtrainR\_\{\\mathrm\{train\}\}\. Cross\-task rows report the source\-task trained rewardRsrcR\_\{\\mathrm\{src\}\}, the held\-out target\-task rewardRcrossR\_\{\\mathrm\{cross\}\}, and reward retention\. Cover is interval coverage on possible cases; MRE50is median midpoint relative error\. A dash indicates that the trained model produced no possible intervals to score\.SettingLearner / transferUnitRbase/srcR\_\{\\mathrm\{base/src\}\}Rtrain/crossR\_\{\\mathrm\{train/cross\}\}Gain / retentionFormatAcc\.CoverMRE50In\-task trainingSearch\-R1Qwen2\.5\-7BTokens0\.0290\.258\+0\.229\+0\.229100\.0%56\.2%56\.7%8\.8%SokobanLlama\-3\.1\-8BTokens0\.0060\.155\+0\.149\+0\.149100\.0%78\.8%0\.0%85\.2%SokobanQwen3\-4BTokens0\.0020\.143\+0\.141\+0\.14188\.5%73\.0%0\.0%2712\.5%SWE\-benchQwen2\.5\-7BTokens0\.0000\.058\+0\.058\+0\.05852\.8%28\.9%0\.0%–WarehouseQwen2\.5\-7BCost0\.0040\.555\+0\.551\+0\.551100\.0%84\.6%78\.5%16\.1%Cross\-task transferSokoban→\\rightarrowSearch\-R1Qwen3\-4BTokens0\.1430\.05236%87\.7%40\.8%10\.5%92\.4%Sokoban→\\rightarrowSearch\-R1Llama\-3\.1\-8BTokens0\.1550\.02617%86\.9%44\.6%0\.0%94\.4%Search\-R1→\\rightarrowSokobanQwen2\.5\-7BTokens0\.2580\.09035%84\.4%50\.0%0\.0%99\.6%
Table 3:Aggregate early\-stopping tradeoff by estimator model\. False\-abort rate is the fraction of successful\-prefix samples incorrectly labeledimpossible\. Saved token share is the fraction of failed\-rollout tokens saved after the firstimpossibleprediction\.ModelFalse\-abort rateSaved tokensFalse\-abort countStopped failed rolloutsGPT\-5\.2 Instant6\.6%64\.1%183 / 2,776124 / 215Claude Opus 4\.72\.2%28\.2%50 / 2,23462 / 169Claude Sonnet 4\.63\.3%49\.6%76 / 2,294101 / 183Gemini 3\.1 Pro2\.8%55\.7%63 / 2,266123 / 221Qwen3 235B4\.9%38\.8%190 / 3,909140 / 306
### 5\.2Optimistic Bias Is Universal Across Models and Tasks \(Hypothesis 2\)
On every progressive interval estimation prediction across Sokoban, Search\-R1, SWE\-bench, and Warehouse, we record whether the predicted interval misses the realized remaining budget on the optimistic side \(predicted budget below realized\) or the conservative side \(predicted budget above realized\)\. Hypothesis 2 predicts that optimistic misses dominate uniformly\.
Almost all models underestimate remaining budget\.Across all 20 model\-environment pairs, optimistic misses outnumber conservative ones at every rollout\-progress bin \(Figure[5](https://arxiv.org/html/2606.00198#S5.F5)\)\. Gemini and Qwen are the most optimistic; Sonnet and Opus are closest to calibrated, but still skew low\. The bias is not eliminated by averaging over more turns\. Hypothesis 2 is supported\.
The bias tracks model confidence, not task difficulty\.Within an environment, weaker models are*more*optimistic about finishing the task, not less\. This is the opposite of what would be expected if optimism reflected limited reasoning about hard tasks\. The pattern is consistent with overconfidence under limited self\-awareness: the model does not know what it does not know\.
### 5\.3Estimates Update Over Turns, but Not Always Toward the Truth \(Hypothesis 3\)
Figure 6:First\-turn vs all\-turn feasibilityF1F\_\{1\}\. Points scatter on both sides of the equality line, so first\-turn predictions do not summarize what the same agent would say after seeing partial progress\.For each model and environment, we compare feasibility macro\-F1F\_\{1\}computed using only the first\-turn prediction against macro\-F1F\_\{1\}using all\-turn predictions\. If Hypothesis 3 holds, the two should be similar because the model cannot use mid\-execution evidence\. If Hypothesis 3 fails, all\-turnF1F\_\{1\}should differ from first\-turnF1F\_\{1\}\.
Later turns produce different estimates than early ones, but not always better\.On Sokoban, Gemini’s macro\-F1F\_\{1\}improves by\+21\.9\+21\.9points from first\-turn to all\-turn evaluation\. On SWE\-bench, Qwen drops by−12\.5\-12\.5points in the opposite direction\. Across model\-environment pairs, points scatter on both sides of the equality line in Figure[6](https://arxiv.org/html/2606.00198#S5.F6)\. Hypothesis 3 is therefore rejected: estimates do update with execution\. The follow\-up question is whether updates are refinements toward truth, and the answer is mixed: refinement happens for some model\-environment combinations, the opposite for others\. The protocol cannot rely on later turns being uniformly better\.
### 5\.4Failure Is Recognized Too Late to Act On \(Hypothesis 4\)
Figure 7:Failure is recognized late: across environments, models often label failed trajectories asimpossibleonly after much of the token budget has already been spent\.On trajectories that ultimately fail \(the agent runs out of budget without solving the task\), we measure when within the trajectory the model first predictsimpossible\. Hypothesis 4 predicts that the alarm fires only late, after most of the budget has been spent\. The right panel of Figure[2](https://arxiv.org/html/2606.00198#S3.F2)shows this pattern on Sokoban: failure prediction accuracy improves as the rollout progresses, but the largest gains appear only after substantial budget has already been consumed\.
Models keep predicting “feasible” long after the task is doomed\.On failed trajectories, models predict feasibility at rates above 70% even after 60% of the budget has been consumed\. The prediction drops sharply only in the final 20% of execution, which is too late for meaningful intervention \(Figure[7](https://arxiv.org/html/2606.00198#S5.F7)\)\. Hypothesis 4 is supported\.This late recognition wastes substantial compute, which we quantify in §[6\.1](https://arxiv.org/html/2606.00198#S6.SS1)\.
## 6The Signal Is Actionable and Trainable
The four tests in §[5](https://arxiv.org/html/2606.00198#S5)cluster failures into a*calibration regime*\(binary feasibility, late recognition\) and a*reasoning regime*\(precise intervals, persistent optimism\)\. The two admit different remedies, which we develop in this section\.
### 6\.1Early Stopping Saves Tokens at Low Risk
A simple policy: stop when the model predicts infeasible\.At each turn, if the model’s estimation outputsimpossible, we terminate trajectory\. The policy has two error types: a*false abort*stops a trajectory that would have succeeded, and a*false continue*fails to stop one that ultimately fails\. False aborts trade success for compute; false continues are missed opportunities to save compute\.
The savings are substantial with minimal cost\.Across models, early stopping saves between 28% and 64% of tokens on failed trajectories while reducing overall success rate by only 1\.6 to 4\.2 percentage points \(Table[3](https://arxiv.org/html/2606.00198#S5.T3)\)\. The estimation signal is already present in the model’s predictions; the policy simply acts on it\. GPT\-5\.2 achieves the highest token savings \(64%\) but also the highest false\-abort rate \(6\.6%\)\. Opus is more conservative, saving 28% of tokens with only 2\.2% false aborts\. Per\-benchmark behavior is consistent with the aggregate pattern \(Table[6](https://arxiv.org/html/2606.00198#A4.T6)\): savings are largest on Warehouse and SWE\-bench, smaller on Sokoban, and near zero on Search\-R1, where rollouts are short enough that infeasibility detection rarely fires before the run ends naturally\.
Figure 8:SFT choices control the behavior of the Sokoban budget estimator\. Wider interval targets improve coverage but increase midpoint error, while longer SFT makes the model more conservative by reducing feasible predictions and improving recall on impossible cases\.
### 6\.2Training Dynamics: Width, Epoch, and Initialization
Training does more than improve accuracy; it shifts where the model places its confidence\. §[5\.1](https://arxiv.org/html/2606.00198#S5.SS1)reported the headline result that SFT closes most of the binary feasibility gap and SFT followed by RL caps interval coverage near 47%\. Here we characterize three aspects of the training procedure that determine where on the coverage, tightness, and optimism tradeoff curves a trained model lands\.
SFT interval width controls the coverage\-tightness tradeoff\.Training with narrow target intervals \(10% of remaining budget\) produces tight but low\-coverage predictions: 26% coverage with 45% midpoint relative error\. Training with wide intervals \(1,000 tokens fixed\) achieves 90%\+ coverage but poor precision \(170% MRE\)\. The sweet spot is moderate width: 100\-token fixed intervals achieve 47% coverage with 49% MRE \(Figure[8](https://arxiv.org/html/2606.00198#S6.F8)\)\. Wider targets buy coverage at the cost of precision; narrower targets do the reverse\.
## 7Related Work
Budgets and self\-monitoring in LLM agents\.A large fraction studies how to*spend*a budget well: constrained and budgeted sequential decision making optimizes reward under resource constraints\(achiam2017constrained;tessler2019reward;carrara2019budgeted;wachi2020safe;zheng2020constrained;liu2022constrainedvariational;li2023nearoptimal;mazumdar2024safe\); budget\-aware LLM agents add token, tool\-use, or monetary limits and study cost\-aware planning\(liu2026budgetconstrained;ding2026calibratethenact;mccleary2026quantifying\); and adaptive\-compute, early\-exit, and test\-time\-scaling methods decide when to stop or skip computation for efficiency\(graves2016adaptive;teerapittayanon2016branchynet;wang2018skipnet;kaya2019shallowdeep;chen2020learningtostop;zeng2023learningtoskip;zhou2026adaptive;snell2025scaling;muennighoff2025s1\)\. A complementary line asks whether models can assess their own correctness, using reflection, self\-feedback, and tool\-based critique\(kadavath2022language;xiong2024can;kapoor2024large;yona2024can;deng2024towards;manakul2023selfcheckgpt;madaan2023selfrefine;shinn2023reflexion\)and self\-correction methods\(gou2024critic;liu2024intrinsicselfcorrection;hu2024uncertainty;wang2024devils\), and analyzes when and why long\-horizon agents fail rather than just final outcomes\(wang2024learningfromfailure;barke2026agentrx;yao2023react;yao2022webshop;zhou2023webarena;liu2024agentbench;shridhar2021alfworld;ma2019selfmonitoring;ma2019theregretful\)\. We ask the dual question: can an agent self\-estimate, mid\-trajectory, how much budget it will still need? Budget awareness is thus a deployment\-relevant slice of self\-monitoring where the dominant failure mode is selective over\-optimism on failed trajectories\(bai2026ai\); we expand this in Appendix[A](https://arxiv.org/html/2606.00198#A1)\.
Prediction intervals and calibration\.The closest methodological precedent is prediction\-interval and calibration work, which studies finite\-sample coverage, interval sharpness, conformal validity, and post\-hoc calibration\(lei2018distributionfree;pearce2018highquality;kuleshov2018accurate;romano2019conformalized;amini2020deep;chung2021beyond;levi2022evaluating\)\. These methods assume a fixed predictor on static data; in our setting the remaining\-budget distribution evolves along the trajectory and is partly induced by the agent’s own future actions, closer to online belief tracking than offline calibration\.
## 8Conclusion and Limitations
We introduce Budget\-Aware Agents \(BAGEN\) as a new capability requirement for agents, formalize it as*progressive interval estimation*, and evaluate it on five frontier models across four internal\- and external\-budget environments via a rollout\-replay protocol\. The capability decouples from task performance, fails in structured ways \(optimistic bias, late failure recognition, calibration\-bound feasibility vs\. reasoning\-bound intervals\), and is already actionable through a simple early\-stop policy and trainable on Sokoban via SFT\-then\-RL\. Limitations for open directions include extending the training analysis to additional model scales and environments, closing the loop so estimator predictions feed back into actor decisions beyond early stopping, supporting fungible budgets across dimensions rather than independent hard constraints, and closing the interval\-estimation gap, which our results pinpoint as the central open problem\.
## References
## Appendix
## Appendix AExtended Related Work
Budgeted decision making versus budget estimation\.Prior work on constrained and budgeted sequential decision making typically formulates budget as an external constraint or environment\-provided cost signal, and studies how to maximize reward while satisfying safety, resource, or episode\-wise constraints\[achiam2017constrained,tessler2019reward,carrara2019budgeted,wachi2020safe,zheng2020constrained,liu2022constrainedvariational,li2023nearoptimal,mazumdar2024safe\]\. Recent LLM\-agent work further introduces explicit resource limits into agentic systems, including token budgets, tool\-use limits, monetary API costs, adaptive stopping rules, budget\-aware planning, and cost\-aware exploration\[zhou2026adaptive,liu2026budgetconstrained,ding2026calibratethenact,mccleary2026quantifying\]\. However, these studies primarily ask how an agent should act when a budget is known, whereas our work asks whether the agent can estimate the remaining budget required to complete a task from an intermediate trajectory state, with uncertainty\-aware intervals and infeasibility warnings rather than only realized cost or point forecasts\[romano2019conformalized,kuleshov2018accurate\]\. This distinction is especially important for deployed agents that must manage multiple coupled resources, including model\-side internal budgets such as tokens and tool calls and environment\-side external budgets such as money, time, inventory, and warehouse capacity\. It shows why budget estimation should be evaluated as a standalone capability, rather than inferred from task success or total budget use\.\[zhou2026adaptive,liu2026budgetconstrained,ding2026calibratethenact,mccleary2026quantifying\]\.
Prediction intervals, calibration, and uncertainty\-aware budget estimation\.Prediction\-interval and calibration work provides the closest methodological precedent because it studies uncertainty\-aware prediction: given an inputxx, the predictor outputs an interval\[y^low,y^high\]\[\\hat\{y\}\_\{\\mathrm\{low\}\},\\hat\{y\}\_\{\\mathrm\{high\}\}\]that should contain the true targetyywith a desired probability, while also remaining as sharp as possible\[lei2018distributionfree,pearce2018highquality,kuleshov2018accurate,romano2019conformalized,amini2020deep,chung2021beyond,levi2022evaluating\]\. Our setting also uses interval\-valued predictions\. However, the prediction target is fundamentally different\. Given a trajectory prefixτ≤t\\tau\_\{\\leq t\}, the agent must estimate how much additional budget is needed to complete the task from the current state\. This remaining requirement depends on partial progress, accumulated mistakes, environment feedback, and the agent’s future continuation behavior\. Thus, the target is not a static supervised label attached to an input, but an online, trajectory\-dependent quantity that changes as the agent acts; standard calibration assumptions for fixed predictors on static data therefore only partially capture the difficulty of budget estimation\[lei2018distributionfree,romano2019conformalized,levi2022evaluating\]\.
Moreover, interval coverage alone is insufficient for deployed agents\. A remaining\-budget interval may cover the realized cost, yet still be operationally poor if it is too wide, systematically optimistic, or fails to warn that the task has become infeasible under the available budget\. For example, an agent that predicts a broad interval only after exhausting most of its budget is less useful than one that identifies infeasibility early and recommends stopping, requesting more resources, or changing strategy\. We therefore evaluate uncertainty\-aware budget estimation not only through interval hit rate, but also through feasibility prediction and early infeasibility warning, treating budget estimation as an online self\-assessment capability rather than a standard offline calibration problem\.
Adaptive compute as control rather than budget estimation\.Adaptive\-compute and test\-time scaling methods are closely related in that they explicitly reason about computational resources, but their objective is usually to learn a compute\-control policy rather than to evaluate an agent’s self\-estimation ability\. Early\-exit, dynamic\-routing, learning\-to\-stop, and test\-time scaling methods decide whether to halt, continue, route to a different module, or allocate more inference compute in order to improve the performance–cost tradeoff\[graves2016adaptive,teerapittayanon2016branchynet,wang2018skipnet,kaya2019shallowdeep,chen2020learningtostop,zeng2023learningtoskip,snell2025scaling,muennighoff2025s1\]\. Such decisions can be effective without requiring the model to explicitly state how much additional budget is needed, how uncertain that estimate is, or whether task completion remains feasible under the remaining budget\. Moreover, this line primarily focuses on internal computation, such as layers, tokens, samples, reasoning steps, or inference\-time compute, whereas deployed agents also incur external action\-induced costs through tool calls, API usage, elapsed time, monetary spending, inventory changes, or other environment commitments\. Our setting therefore treats remaining\-budget estimation as a distinct capability: the agent must expose an uncertainty\-aware belief about future resource requirements across both model\-side computation and environment\-side action costs, rather than merely choosing a stopping or routing action\.
Self\-monitoring and corrective reasoning\.A line of work studies whether language models can assess their own correctness, express uncertainty, recognize when they do not know an answer, or detect inconsistencies in generated content\[kadavath2022language,xiong2024can,kapoor2024large,yona2024can,deng2024towards,manakul2023selfcheckgpt,hu2024uncertainty\]\. Related methods use reflection, self\-feedback, intrinsic critique, or tool\-based criticism to improve model outputs and agent performance over multiple attempts\[madaan2023selfrefine,shinn2023reflexion,gou2024critic,liu2024intrinsicselfcorrection,wang2024devils\]\. These works are closely related because they treat self\-assessment as a core model capability, but their focus is usually correctness awareness or corrective reasoning: whether the current answer is likely correct, whether the model should revise it, or how feedback can help produce a better next response\. Budget awareness asks a different question: even if the agent can judge or improve its answer, can it estimate how much resource is still needed to complete the task, how uncertain that estimate is, and whether completion remains feasible under the available budget?
Trajectory\-level agent evaluation and budget self\-monitoring\.Interactive agent benchmarks provide the long\-horizon, tool\-mediated settings in which budget estimation becomes meaningful, since agents must reason, act, replan, and use memory over trajectories whose costs accumulate across turns\[yao2023react,yao2022webshop,zhou2023webarena,liu2024agentbench,shridhar2021alfworld\]\. Recent failure\-analysis work further treats failure as a trajectory\-level phenomenon, studying root causes, negative rollouts, and where agents fail rather than only final success labels\[wang2024learningfromfailure,barke2026agentrx\]\. Closest in spirit, progress\-monitoring methods use auxiliary estimates of task progress to guide search, backtracking, and action choice\[ma2019selfmonitoring,ma2019theregretful\]\. Our focus is different from both retrospective failure diagnosis and progress estimation: given a partial trajectory, we ask whether the agent can estimate the remaining token or financial budget, express uncertainty, and warn that completion may be infeasible under the available resources\. This exposes selective over\-optimism on failed rollouts, a failure mode not captured by standard success, path\-efficiency, or progress metrics\.
Similar WorkFailure\-specificanalysisCan LLMs Express Their Uncertainty?\[xiong2024can\]✗✗✗∼\\sim✗✗✗∼\\simConformalized Quantile Regression\[romano2019conformalized\]✗✗✗✓✗✗✗✗Adaptive Stopping for Multi\-Turn LLM Reasoning\[zhou2026adaptive\]∼\\sim✓✓✓∼\\sim✗∼\\sim✗Budget\-Constrained Agentic LLMs \(INTENT\)\[liu2026budgetconstrained\]✓✓✓✗∼\\sim✓✓✗Calibrate\-Then\-Act\[ding2026calibratethenact\]✓✓✓✗✗✗∼\\sim✗Budget\-Constrained Agentic Search \(BCAS\)\[mccleary2026quantifying\]✓✓✓✗✓✗∼\\sim✗Our work✓✓✓✓✓✓✓✓
Figure 9:Delta map between our formulation and the closest prior work\. Rows are representative neighboring papers; columns are the main components of the problem definition\.✓means the component is directly modeled,∼\\simmeans partially related, and✗means largely absent\.The central gap highlighted by Figure[9](https://arxiv.org/html/2606.00198#A1.F9)is that prior work typically covers only one or two sides of the space at a time\. Our work is positioned at the intersection of these dimensions and formulates online remaining\-budget interval estimation as a distinct capability of agents\.
## Appendix BDetailed Experimental Settings
### B\.1Environments and Tasks
Our benchmark spans four environments, SearchR1, Sokoban, and SWE\-bench use token budgets; Warehouse uses a joint financial budget\(cost/time/inventory\) over multiple resources\.
Sokoban\.We use a Sokoban environment to study budget estimation under irreversible planning\. The rollout agent solves procedurally generated8×88\\times 8puzzles with two boxes and search depth 30 under a 2,500\-token budget\. For rollout generation, we cap each response at 800 tokens and truncate dialogue history at 2,500 context tokens\.
SearchR1\.We use the SearchR1 environment backed by a HotpotQA\-derived parquet dataset and a retrieval server\. At each turn, the agent must either issue one search query or submit a final answer under a 3,500\-token budget\.
SWE\-benchSWE\-bench evaluates coding agents on realistic GitHub issue\-resolution tasks\. We include it because coding agents are common in practice, and coding tasks are token\-intensive\. Their token usage often comes from repeated repository inspection, targeted edits, test execution, error analysis, and repair loops\. This makes SWE\-bench a natural benchmark for evaluating remaining\-budget estimation\.
Warehouse\.We use a warehouse\-management control environment with coupled resource constraints\. At each step, the agent must allocate inventory, financing, and cash\-flow decisions so the business can continue toward a profitable outcome without violating limits\. This benchmark tests a different notion of budget awareness: instead of internal reasoning cost, the agent must estimate whether it can still finish while respecting external operational budgets\. Detailed budget construction appears in Section[B\.4](https://arxiv.org/html/2606.00198#A2.SS4)\.
### B\.2Evaluation Protocol
Estimator and rollout roles are separated in all experiments\. Each rollout is first generated and logged by a rollout model, and the estimator is then evaluated on prefixes of that logged trajectory\. In the main experiments, the estimator uses the same model as the rollout generator, corresponding to self\-estimation\.
At each evaluation point, we replay the rollout prefix as dialogue history and ask the estimator to predict the budget needed from the next turn onward\. The estimator only sees information available up to the current turn, including the replayed history, the budget limits, and summaries of completed turns\. For token\-budget tasks, these summaries include per\-turn token usage; for Warehouse, they include cumulative progress and resource usage so far\. The estimator does not receive future turns, future tool outputs, terminal success labels, or the realized remaining budget\. We also exclude the final terminal step from estimation, since there is no future budget to predict after the trajectory has ended\.
The estimator must output either an interval`<answer\>\[est\_low, est\_high\]</answer\>`for the remaining total token usage from the next turn onward, or`<answer\>impossible</answer\>`if it predicts that the trajectory can no longer finish successfully within the budget\. For Warehouse, the estimator receives the target cash threshold and three resource budgets,time\_weeks,warehouse\_item\_weeks, andcumulative\_cost\_usd, together with cumulative usage so far\. It must predict whether the trajectory can still reach the target cash while satisfying all three constraints; if feasible, it outputs one interval for each remaining resource, otherwise it outputs impossible\.
Each non\-terminal rollout prefix yields one evaluation sample\. If a trajectory hasTTturns, it contributesT−1T\-1samples, because the final turn is not estimated\. This construction preserves the online estimation setting: the estimator always predicts from a partial trajectory using only information available at that point\.
SWE\-bench requires additional subsampling because coding\-agent trajectories are much longer than those in the other benchmarks\. In the current SWE\-bench runs, this produces 712–3,715 candidate prefixes per rollout family, so using all prefixes would make long coding trajectories dominate the evaluation set\. We therefore cap each SWE\-bench rollout family at 512 estimation prefixes\. To construct this split, we sort rollouts by their number of assistant turns, partition them into eight nearly equal\-sized length buckets, and sample prefixes with a seeded randomized round\-robin strategy across buckets and rollouts using random seed 42\. This procedure buckets at the rollout level rather than the prefix level, so longer rollouts do not receive proportionally more sampling mass simply because they contain more prefixes\. This makes the split fair because it balances coverage across rollout\-length regimes and rollout instances, preventing unusually long trajectories from contributing a disproportionate number of highly correlated prefixes while keeping the SWE\-bench evaluation size comparable to the other benchmarks\.
### B\.3Pilot Study Setup
The pilot study in §[2](https://arxiv.org/html/2606.00198#S2)establishes the two failures of single\-point budget estimation that motivate progressive interval estimation\. We describe its setup here\.Tasks\.We use two internal\-budget environments\.Sokoban\[junghanns1998sokoban\]is a planning task on an8×88\{\\times\}8grid with a2,5002\{,\}500\-token cap; it admits a clean ground\-truth optimal budget per puzzle\.Search\-R1\[jin2025searchr1\]is a multi\-hop information\-retrieval task with a3,5003\{,\}500\-token cap; the realized budget is variable and driven by retrieval depth\. The two tasks are deliberately different in shape, so a failure showing up on both is unlikely to be an artifact of one task’s structure\.Elicitation\.For each rollout we elicit two estimates from the same model\. A*first\-turn estimate*is taken at task start, before any action, asking the model to predict the total tokens it will spend on the task; this matches the single\-point prompting protocol used in prior work\. A*later\-turn estimate*is taken after replaying the logged prefixτ≤k\\tau\_\{\\leq k\}of the same rollout, asking the model to predict the*remaining*tokens from turnk\+1k\{\+\}1onward\. The later\-turn estimate is therefore a sequence of single\-point predictions, one per non\-terminal turn, made by the same model that produced the rollout\.Models and scale\.We evaluate the same five frontier models as in the main paper: Claude Opus 4\.7\[anthropic2026adaptiveThinking\], Claude Sonnet 4\.6\[anthropic2026sonnet46\], GPT\-5\.2 Instant\[openai2026gpt52instant\], Gemini 3\.1 Pro\[google2026gemini31pro\], and Qwen3\-235B\[yang2025qwen3\]\. We sample128128rollouts per \(model, task\) pair\.Scoring\.*Optimism analysis*\(Figure[5](https://arxiv.org/html/2606.00198#S5.F5), first failure\): for each predictionB^\\hat\{B\}we compare it to the realized rollout costBBand classify it as*optimistic*\(B^<B\\hat\{B\}<B\) or*conservative*\(B^\>B\\hat\{B\}\>B\); we report the share of each across rollout\-progress bins\.*First\-turn vs\. later\-turn comparison*\(Figure[6](https://arxiv.org/html/2606.00198#S5.F6), second failure\): we collapse each single\-point prediction into a binary feasibility label by checking whetherB^\\hat\{B\}stays under the budget cap, and compute feasibility macro\-F1F\_\{1\}over either first\-turn predictions only or all per\-turn predictions \(definition in Eq\.[2](https://arxiv.org/html/2606.00198#S3.E2)\)\. The two are then plotted against each other on equal axes\.
### B\.4Warehouse Budget Construction
Warehouse is evaluated as a multi\-resource financial estimation task that models a manufacturing firm making weekly operational decisions, such as producing goods, replenishing inventory, drawing supplier credit, repaying debt, and factoring accounts receivable to improve cash flow\. Full details on the underlying data source, design goals, calibration choices, and reward structure are deferred to Appendix[C](https://arxiv.org/html/2606.00198#A3)\. For each rollout prefix, the estimator receives the dialogue history, current cash, completed\-week summaries, cumulative resource usage, and per\-step resource consumption so far\. It must predict whether the rollout can still reach the target final cash threshold while staying within all resource budgets\. If feasible, it outputs one interval for each remaining resource; otherwise, it outputsimpossible\. This setting creates coupled trade\-offs: increasing inventory may improve sales but raises warehouse occupancy, drawing credit may improve short\-term cash but adds repayment pressure, and delaying production may reduce costs but lower final cash\. Thus, Warehouse tests whether estimators can reason about realistic multi\-resource constraints rather than a single token budget\.
Warehouse contains three budget dimensions\. The first istime\_weeks, which limits the total planning horizon\. The second iswarehouse\_item\_weeks, which measures cumulative warehouse occupancy over time, i\.e\., inventory integrated across weeks\. The third iscumulative\_cost\_usd, which limits total operational spending\.
Why we construct feasibility probes\.Warehouse is a continuous\-optimization task \(more cash is better, fewer resources used is better\) and therefore has no natural binary success label\. To evaluate budget awareness in this environment, we construct*challenge\-conditioned feasibility probes*: each instance pairs a logged rollout with a sampled \(target cash, time, warehouse, cost\) tuple, and feasibility is defined as whether that rollout still satisfies the target and all three resource constraints\. We balance reachable and unreachable probes 50/50 so that macro\-F1F\_\{1\}, Fail\-F1F\_\{1\}, and calibration are identifiable on both sides; under a heavily skewed split \(e\.g\., 90/10\), a model that always predicts “feasible” would score deceptively well\. This 50/50 balance is an evaluation\-design choice, not a claim about deployment prevalence: we do not assert that real supply\-chain decisions fail half the time, only that controlled coverage of clearly\-feasible, borderline\-feasible, borderline\-infeasible, and clearly\-infeasible regions is needed to test whether models know their own state\.
Sampling procedure\.The estimator uses thehalf\_reachablebudget preset with random seed 42\. We first shuffle rollouts and assign half to a reachable group and half to an unreachable group\. For reachable rollouts, the target cash is sampled fromU\(0\.50,1\.00\)×U\(0\.50,1\.00\)\\timesfinal cash, clipped so that it does not exceed final cash; the time budget is set to the realized trajectory length, and the warehouse and cost budgets are sampled uniformly between1\.0×1\.0\\timesand1\.2×1\.2\\timestheir realized final totals\. For unreachable rollouts, the target\-cash, time\-budget, warehouse\-budget, and cost\-budget ratios are independently sampled fromU\(0\.50,2\.00\)U\(0\.50,2\.00\)until the logged rollout fails the feasibility check\. Thus unreachable cases may be caused by an overly high target cash, too\-tight resource budgets, or a combination of both\.
### B\.5Experiment Matrix
The experiment matrix is shown in Table[4](https://arxiv.org/html/2606.00198#A2.T4)\.
BenchmarkBudget TypeRolloutsHorizonEstimation TargetSearchR1Token128Up to 3500 tokens; one search/action per turnRemaining tokens within 3,500\.SokobanToken128Up to 2500 tokens; up to 3 actions per turnRemaining tokens within 2,500\.SWE\-benchToken64up to 160 turnsRemaining tokens under rollout\-family\-specific median budgets\.WarehouseFinancial1282 weeks per turn; totally 11 turnsRemaining time, warehouse item\-weeks, and cost while reaching target cash\.Table 4:Experiment matrix for the four evaluation benchmarks\. Each row states the budget modality, rollout count, horizon constraint, and remaining\-budget target predicted from each logged prefix\.
### B\.6Model Identifiers and Inference Settings
To ensure reproducibility, we report exact deployment metadata for every model used in rollout generation and estimation\. The display names in the main text are shorthand; the appendix table gives the exact API model identifier and invocation configuration\.
For each model, we log display name, provider, exact API model id or local checkpoint hash, date queried, endpoint or region, reasoning configuration, max output tokens, temperature, top\-ppif used, stop settings, and sampling\-seed policy\.
The term “Low Thinking” denotes a constrained provider\-native reasoning configuration\. Depending on the backend, this corresponds to one of:reasoning\_effort=low\(OpenAI\),output\_config\.effort=low\(Anthropic\), orthinking\_mode=low\(Google/Qwen\)\. The appendix table reports the exact control for each model\.
Table[5](https://arxiv.org/html/2606.00198#A2.T5)lists the concrete entries used in this paper\.
Display NameYAML KeyProviderModel IDReasoningMax TokensTemp\.GPT\-5\.2 InstantOpenAI\-5\.2\-Instantopenaigpt\-5\.2reasoning\_effort: none800N/AClaude Opus 4\.7 Low ThinkingClaude\-Opus\-4\.7\-low\-thinkinganthropicclaude\-opus\-4\-7output\_config\.effort: low800N/AClaude Sonnet 4\.6 Low ThinkingClaude\-Sonnet\-4\.6\-low\-thinkinganthropicclaude\-sonnet\-4\-6output\_config\.effort: low800N/AGemini 3\.1 Pro PreviewOpenRouter\-Gemini\-3\.1\-Pro\-Previewopenroutergoogle/gemini\-3\.1\-pro\-previewthinking\_mode: low8000Qwen3 235Bqwen/qwen3\-235b\-a22b\-2507openrouterqwen/qwen3\-235b\-a22b\-2507thinking\_mode: low8000Table 5:Model deployment identifiers and inference settings used for rollout generation and budget estimation\. The table maps each display name to its provider, exact model ID, reasoning configuration, maximum output length, and temperature\.
## Appendix CWarehouse Environment Details
Figure 10:Overview of the Warehouse environment\. The agent operates a manufacturing firm that orders from OEMs, ships internationally to its own warehouse, and fulfills weekly demand from five retailer accounts under joint money/space/time constraints\.### C\.1Data Source
The demand panel is sourced from a real mid\-sized US consumer\-electronics distributor \(anonymized asAcme\) and its five downstream retail accounts, collapsed into archetypes \(MegaMart,TechZone,ValueCo,OfficePlus,SupplyDirect\) covering mass discount, electronics specialty, value/household, office/B2B, and small\-account direct channels\. We obtained22 consecutive weeksof unit\-level weekly sell\-through per \(retailer, SKU\) pair across five high\-volume product families \(USB\-C hub, 4K dock, USB\-C/HDMI cable, USB display adapter, travel dock\)\. Manufacturers and order quantities arenotin the dataset—those are agent decisions in our environment\. Wholesale prices, retailer warehouse capacities, payment terms \(Net\-30 across the board\), and stockout/overstock chargeback rates were taken from retail vendor agreements with light rounding; manufacturing costs, MOQs, production lead times, and unit weights were chosen to reflect typical consumer\-electronics OEM contract terms\. We did not synthesize or smooth the demand series—week\-to\-week variance, step\-changes, and the occasional collapsing series \(e\.g\., oneOfficePlusSKU\) are present in the raw data\.
At each episode reset the five demand series belonging to a retailer are randomly permuted across the five game SKUs, so SKU identities do not inherit fixed demand profiles and the agent must read demand from the observation rather than from any learned prior over individual SKUs\. Per\-step base demand is the sum of two consecutive weeks of the assigned series with additive Gaussian noise scaled by the series standard deviation\. We do not layer additional seasonality on top—the 22\-week window already captures it\.
### C\.2Design Goal: Stressing Money, Space, and Time
A competent policy must trade off three orthogonal pressures at once:money\(tight $500K initial cash, 30/70 split mfg payment, Net\-30 receivables, holding \+ OpEx burn, a $200K credit line at 0\.8% per step, and 5% AR factoring\),space\(50,000\-unit Acme warehouse plus per\-retailer DCs of 7K–25K units, with overstock chargebacks above 95% occupancy\), andtime\(production lead 25–45 days, ocean vs\. air international transit at 32 vs\. 6 days, domestic 3–4 days, Net\-30 cash collection\)\. Every mechanism is anchored to at least one of these axes\.
### C\.3Calibration Decisions
Two parameter choices were materially tuned away from naive defaults during development; the rest follow from real\-world anchoring\.
Initial inventory is set to 0\.Earlier calibrations pre\-stocked Acme and the retailers with several weeks of expected demand\. Trajectories were then largely shaped by the starting conditions rather than by the policy: pre\-stocked inventory absorbed the early turns of demand regardless of what the agent did, and roughly 3 of 11 turns ran on autopilot before policy choices began to matter\. We now start every \(retailer, SKU\) pair and the Acme warehouse atzero unitsso that outcomes are driven by model decisions end\-to\-end and the early\-episode cash crunch \(no AR matured yet, no inventory to ship\) is genuine\.
International transit and production lead times are compressed\.Real Shenzhen→\\rightarrowUS\-West\-Coast door\-to\-door shipping is 45–60 days ocean and 7–10 days air; OEM production for the more complex SKUs realistically runs 60–90 days once material procurement is included\. Plugged in directly, a single ocean order placed att=0t=0would arrive around day90\+60=15090\+60=150—turn 11 of an 11\-turn episode at our 14\-day step granularity, meaning 4–5 turns elapse before any production decision surfaces\. We instead use ocean=32±4=32\\pm 4days, air=6±1=6\\pm 1days, and production lead=25=25–4545days by SKU\. The fastest end\-to\-end loop \(cable, air\) then lands in≈2\\approx 2steps and the slowest \(dock, ocean\) in≈5\.5\\approx 5\.5steps—short enough that the agent observes consequences within the episode, long enough that time pressure is load\-bearing\.
Other calibration choices, briefly\.Initial cash is $500K \(not $2M\) so the working\-capital decision actually binds—at∼\\sim$116K weekly GMV and Net\-30 terms,∼\\sim$464K of cash is tied up in any one AR cycle\. The mfg payment is a30/70 split\(deposit at order, balance on completion\) rather than 100%\-on\-order, so cash flow is sensitive to the projected trajectory across the production horizon and orders can stall in queue if cash is short on completion\. Production quantities are quantized to1×\\times/2×\\times/3×\\timesMOQwith 0% / 8% / 15% volume discounts, mirroring how factories actually quote\. Episode length is22 weeks÷\\div14 days per step==11 steps—short enough that endgame myopia is diagnosable, long enough for two full ocean\-production cycles\. Bankruptcy is atwo\-stage floor\($50K per\-step penalty whenever cash<0<0, hard truncation at cash<−$200<\-\\mathdollar 200K\) rather than a single hard rule, so a small early\-turn underprediction does not end the episode prematurely\.
### C\.4Reward and Termination
The per\-step reward is
rt=πt−ptstockout−ptoverstock−50,000⋅𝟏\[casht<0\],r\_\{t\}\\;=\\;\\pi\_\{t\}\\;\-\\;p^\{\\text\{stockout\}\}\_\{t\}\\;\-\\;p^\{\\text\{overstock\}\}\_\{t\}\\;\-\\;50\{,\}000\\cdot\\mathbf\{1\}\[\\text\{cash\}\_\{t\}<0\],\(5\)whereπt\\pi\_\{t\}is within\-step operating profit \(revenue minus holding cost minus fixed OpEx\), andptstockoutp^\{\\text\{stockout\}\}\_\{t\},ptoverstockp^\{\\text\{overstock\}\}\_\{t\}are the chargeback penalties from unfilled retailer demand and from\>\>95%\-occupied retailer DCs\. Manufacturing costs, international and domestic shipping costs, credit interest, and AR\-factoring fees affect cash butdo notenter the reward—they are investments and financing costs whose payoff arrives, if at all, through future operating profit\. This is the central credit\-assignment challenge: a production decision atttpays back through revenue att\+4t\+4tot\+6t\+6\. Including these costs in the reward collapses the environment to a near\-myopic control problem and removes precisely the long\-horizon coordination that motivated the design\. The episode terminates att=Tt=T\(default 11\) and is force\-truncated if cash drops below−$200\-\\mathdollar 200K\.
### C\.5Theoretical Maximum Reward \(Defined but Unused\)
For completeness we derive a hypothetical ceiling on cumulative reward under an idealized policy that \(i\) realizes mean demand each step with no noise, \(ii\) keeps Acme inventory exactly equal to the step’s total demand so holding cost is minimal but no stockouts occur, and \(iii\) faces no cash constraint\. Under those assumptions, period revenue, holding cost, and OpEx all become deterministic, and the cumulative ceiling is
Rmax=∑t=1T\[∑r,sdr,s⋅pr,s⋅w−h⋅It⋅w−OpEx⋅w\],R\_\{\\max\}\\;=\\;\\sum\_\{t=1\}^\{T\}\\left\[\\sum\_\{r,s\}d\_\{r,s\}\\cdot p\_\{r,s\}\\cdot w\\;\-\\;h\\cdot I\_\{t\}\\cdot w\\;\-\\;\\text\{OpEx\}\\cdot w\\right\],\(6\)wheredr,sd\_\{r,s\}is mean weekly demand of SKUssat retailerrr,pr,sp\_\{r,s\}is the wholesale price,wwis the number of weeks per step \(default 2\),h=$0\.35h=\\mathdollar 0\.35/unit/week is the holding rate,It=∑sds⋅wI\_\{t\}=\\sum\_\{s\}d\_\{s\}\\cdot wis the per\-step ideal Acme warehouse level \(just enough to cover the step’s demand\), and OpEx=$8,000=\\mathdollar 8\{,\}000/week is the fixed operating cost\. A “combined score” variant adds back ending cash and the manufacturing\-cost value of ending inventory at both Acme and the retailers, treating leftover stock as recoverable capital\.
## Appendix DAdditional Experimental Visualizations
This appendix presents analysis for budget\-estimation behavior\. Figure[12](https://arxiv.org/html/2606.00198#A4.F12)separates the internal token\-budget setting from the external Warehouse setting\. For internal tasks, interval width is normalized by the total token budget\. For Warehouse, we report the same quantity for time, warehouse item\-weeks, and cost\. This pattern suggests that the estimators partly learn to update their uncertainty as more rollout evidence becomes available\.
Figure[11](https://arxiv.org/html/2606.00198#A4.F11)analyzes the direction of estimation errors across rollout progress\. Across models, optimistic misses dominate conservative misses, meaning that agents more often underestimate the remaining budget than overestimate it\. This bias persists throughout execution, although conservative misses become more visible in later rollout stages\. Together, these results show that budget\-estimation errors are not symmetric: models are systematically biased toward thinking the remaining task will be cheaper than it actually is\.
Table[6](https://arxiv.org/html/2606.00198#A4.T6)reports the per\-benchmark early\-stopping tradeoff when trajectories are stopped at the firstimpossibleprediction\. The results show that impossible predictions can save substantial budget on failed rollouts, especially in Warehouse and SWE\-bench, while false\-abort rates remain low for most model\-benchmark pairs\. Search\-R1 shows smaller savings because its rollouts are shorter, leaving less room for early stopping before the trajectory ends\. This supports the main result that budget estimates are actionable, but their utility depends on how early failure can be detected within each environment\.
Figure[13](https://arxiv.org/html/2606.00198#A4.F13)further separates task success from budget\-estimation quality\. Across benchmarks and models, higher rollout success does not necessarily imply better feasibility prediction\. This shows that budget awareness is not simply a byproduct of stronger task\-solving ability, but a distinct capability that must be evaluated directly\.
Figure 11:Optimistic misses dominate single\-point budget estimates\. Orange regions \(predicted budget below realized\) exceed yellow \(predicted budget above realized\) for most model, across all rollout\-progress bins\.Table 6:Per\-benchmark early\-stopping tradeoff by estimator model\. False\-abort prediction rate, failed\-rollout token share saved, false\-abort counts, and the number of failed rollouts stopped at least once\.BenchmarkModelFalse\-abort rateSaved tokensFalse\-abort countStopped failed rolloutsSearchR1GPT\-5\.20\.0%0\.0%0 / 3240 / 41SearchR1Opus 4\.70\.0%3\.5%0 / 871 / 31SearchR1Sonnet 4\.61\.0%0\.0%1 / 1020 / 37SearchR1Gemini 3\.11\.2%0\.0%1 / 850 / 59SearchR1Qwen3 235B0\.0%19\.4%0 / 1,07940 / 83SokobanGPT\-5\.21\.1%30\.5%7 / 66148 / 83SokobanOpus 4\.70\.0%10\.8%0 / 3559 / 56SokobanSonnet 4\.61\.5%27\.9%6 / 40029 / 62SokobanGemini 3\.10\.3%48\.0%1 / 38953 / 78SokobanQwen3 235B0\.1%12\.8%1 / 1,03833 / 119SWE\-benchGPT\-5\.20\.0%19\.5%0 / 51212 / 27SWE\-benchOpus 4\.70\.0%29\.1%0 / 5127 / 18SWE\-benchSonnet 4\.60\.0%45\.2%0 / 5128 / 20SWE\-benchGemini 3\.10\.0%50\.7%0 / 51212 / 20SWE\-benchQwen3 235B0\.0%7\.0%0 / 5125 / 40WarehouseGPT\-5\.213\.8%90\.9%176 / 1,27964 / 64WarehouseOpus 4\.73\.9%41\.9%50 / 1,28045 / 64WarehouseSonnet 4\.65\.4%74\.3%69 / 1,28064 / 64WarehouseGemini 3\.14\.8%70\.5%61 / 1,28058 / 64WarehouseQwen3 235B14\.8%85\.4%189 / 1,28062 / 64Figure 12:Predicted interval width changes over task progress\. Internal\-task widths are normalized by token budget; Warehouse reports separate normalized widths for time, warehouse item\-weeks, and cost\.Figure 13:Progressive interval estimation evaluates an agent’s budget\-estimation ability separately from its task performance\. The weak relationship between task success and estimation quality shows that doing the task well does not mean knowing how much budget remains\.
## Appendix EDetailed Settings for Estimation RL and SFT
The SFT and RL experiments use Qwen/Qwen2\.5\-7B\-Instruct as the base model\. The probe task is Sokoban 6x6 with one box and a 2500\-token budget\. The SFT training set contains 794 samples\. The balanced test set contains 380 samples, split evenly between possible and impossible states\.
The raw rollouts are first converted into alternating user/assistant message histories\. For each trajectory, a probe is inserted before the first action and after every completed turn\. At probe\-after\-turntt, the prompt contains the Sokoban system prompt and the conversation prefix up tott\. Fort=0t=0, the prompt contains the system prompt and the initial grid only\. The completed\-turn token usage is computed from the visible user messages and the recorded API output\-token counts, so that post\-processed assistant text does not undercount hidden generation cost\. The true remaining\-token labelrris the sum of all future assistant output tokens and all future user input tokens after the visible prefix\. A probe is labeled possible only if the original rollout succeeds and the visible\-prefix tokens plusrrfit within the budget; otherwise it is labeledimpossible\. The ablation data use a deterministic trajectory split with seed 42: 40% for SFT, 50% for RL, and 10% for held\-out evaluation\. Training splits are balanced with a 1:1 possible/impossible ratio after duplicating possible samples twice; trivial samples withr=0r=0are removed\.
SFT targets\.SFT turns each probe into a full chat record by appending the ground\-truth assistant answer\. The main runs use the no\-thinking output template, so the assistant target is either<answer\>impossible</answer\>or<answer\>\[L, H\]</answer\>\. For a possible probe with remaining\-token labelrr, the percentage\-width targets are
L=max\(1,⌊r\(1−w\)⌋\),H=⌊r\(1\+w\)⌋,L=\\max\(1,\\lfloor r\(1\-w\)\\rfloor\),\\qquad H=\\lfloor r\(1\+w\)\\rfloor,withw∈\{0\.1,0\.3,0\.5\}w\\in\\\{0\.1,0\.3,0\.5\\\}\. The fixed\-width targets are
L=max\(1,⌊r−w⌋\),H=⌊r\+w⌋,L=\\max\(1,\\lfloor r\-w\\rfloor\),\\qquad H=\\lfloor r\+w\\rfloor,withw∈\{100,500,1000\}w\\in\\\{100,500,1000\\\}tokens\. We also prepared a point\-estimation SFT variant that predicts one integer, but the reported RL interval experiments warm start from interval SFT checkpoints\. All SFT runs use FSDP training for 5 epochs with learning rate5×10−65\\times 10^\{\-6\}, global batch size 16, and per\-GPU micro\-batch size 2\. The default SFT script sets maximum sequence length to 9216, while the v3 batch launcher overrides it to 16384\. Checkpoints are converted after epochs 2, 3, and 5\.
Training\-name convention\.Run names encode the supervision target and checkpoint used for evaluation\.pct10,pct30, andpct50denote interval SFT targets whose half\-width is 10%, 30%, or 50% of the true remaining\-token label\.fix100,fix500, andfix1000denote interval SFT targets with fixed half\-widths of 100, 500, or 1000 tokens\. The suffixese2,e3, ande5identify checkpoints after 2, 3, and 5 SFT epochs\. An entry such asSFT pct30 e5→\\rightarrowRL KL 0\.05means that the epoch\-5pct30SFT checkpoint initializes GRPO, with actor KL coefficient 0\.05\.Zero RLdenotes GRPO started directly from the base model, without SFT warm start\.no thinkand\+ thinkdistinguish whether the output template omits or includes a<think\>\.\.\.</think\>field\.scalar bugdenotes an invalid zero\-RL run that learned scalar outputs instead of the required interval format\.
RL setup\.RL is applied with GRPO on top of selected SFT checkpoints\. The RL parquet records contain the same prompt without the assistant answer, together with the rule\-reward ground truth and the numericremaining\_tokensfield\. The final GRPO runs use 8 GPUs, global batch size 64, 16 rollouts per prompt, learning rate5×10−75\\times 10^\{\-7\}, 5 epochs, maximum prompt length 8192, maximum response length 1024, entropy coefficient 0, and no KL term inside the reward\. The actor still uses a low\-variance KL loss against the reference policy; the reported sweep uses KL coefficients 0\.05 and 0\.1\. Evaluation decodes with vLLM at temperature 0 and a 512\-token response cap\.
RL reward\.The rule reward first extracts the content inside<answer\>\.\.\.</answer\>\. Outputs without this tag, malformed intervals, reversed intervals, and scalar predictions receive zero reward for possible probes\. Impossible probes receive reward 0\.2 only when the extracted answer is exactlyimpossible\. For a possible probe with true remaining tokensrr, an interval\[L,H\]\[L,H\]receives reward only whenL≤r≤HL\\leq r\\leq H:
R=1\.8⋅max\(0,1−H−Lr\)\.R=1\.8\\cdot\\max\\left\(0,1\-\\frac\{H\-L\}\{r\}\\right\)\.This reward favors tight intervals that still cover the true value\. The 1\.8\-to\-0\.2 weighting discourages the all\-impossible policy while keeping impossible\-state recognition in the objective\.
## Appendix FAdditional Data for RL
Table[7](https://arxiv.org/html/2606.00198#A6.T7)reports the full SFT/RL sweep\. The zero\-RL runs show that RL from scratch does not reliably learn the interval format\. Both no\-thinking and thinking variants collapse to all\-impossible answers\. The scalar\-bug run has high classification accuracy, but it does not produce valid intervals\. These results show that SFT is needed before RL\.
Table 7:Sokoban budget\-probe results for the SFT and RL sweep\. Acc\. is feasible/impossible classification accuracy; Rec\. pos/imp are recalls for feasible and impossible probes; Cover is interval coverage on feasible probes; MRE is midpoint relative error; R is the combined reward\.ModelAcc\.Rec\. posRec\. impCoverMRE P50MRE P90RQwen2\.5\-7B\-Instruct25\.5%46\.3%4\.7%10\.5%65\.5%277\.2%0\.021Zero RL scalar bug90\.5%93\.2%87\.9%0\.0%0\.0%0\.0%0\.088Zero RL no think50\.0%0\.0%100\.0%0\.0%0\.0%0\.0%0\.100Zero RL \+ think50\.0%0\.0%100\.0%0\.0%0\.0%0\.0%0\.100SFT pct10 e287\.6%100\.0%75\.3%26\.3%45\.1%84\.0%0\.265SFT pct10 e389\.5%95\.3%83\.7%19\.5%62\.0%135\.4%0\.224SFT pct10 e590\.3%93\.7%86\.8%26\.3%45\.1%84\.2%0\.277SFT pct30 e288\.2%99\.5%76\.8%39\.5%53\.5%86\.7%0\.204SFT pct30 e390\.0%96\.3%83\.7%38\.4%51\.8%84\.3%0\.205SFT pct30 e591\.1%95\.3%86\.8%37\.4%53\.9%86\.4%0\.210SFT pct50 e286\.6%100\.0%73\.2%41\.1%57\.9%848\.1%0\.094SFT pct50 e389\.7%96\.3%83\.2%54\.2%48\.0%86\.0%0\.099SFT pct50 e590\.5%94\.7%86\.3%35\.8%55\.2%303\.4%0\.101SFT fix100 e286\.6%100\.0%73\.2%53\.7%48\.6%83\.7%0\.085SFT fix100 e390\.5%97\.4%83\.7%52\.6%42\.9%83\.6%0\.095SFT fix100 e590\.0%92\.6%87\.4%46\.8%49\.1%88\.4%0\.105SFT fix500 e286\.8%99\.5%74\.2%90\.0%65\.9%427\.3%0\.074SFT fix500 e390\.0%96\.8%83\.2%85\.3%68\.4%294\.0%0\.083SFT fix500 e590\.3%93\.7%86\.8%84\.7%67\.9%441\.8%0\.087SFT fix1000 e286\.3%100\.0%72\.6%96\.8%182\.7%620\.0%0\.073SFT fix1000 e390\.0%96\.3%83\.7%92\.6%169\.6%627\.3%0\.084SFT fix1000 e590\.0%93\.2%86\.8%89\.5%169\.5%639\.7%0\.087SFT pct10 e5→\\rightarrowRL KL 0\.0588\.9%90\.0%87\.9%26\.8%38\.9%84\.0%0\.281SFT pct10 e5→\\rightarrowRL KL 0\.189\.2%90\.5%87\.9%27\.9%38\.9%84\.0%0\.289SFT fix100 e3→\\rightarrowRL KL 0\.0590\.3%93\.7%86\.8%13\.2%177\.6%350\.0%0\.126SFT pct30 e5→\\rightarrowRL KL 0\.0590\.0%92\.6%87\.4%46\.8%28\.2%87\.8%0\.264
Notes\.SFTdenotes supervised fine\-tuning;RLdenotes the subsequent GRPO stage; andZero RLdenotes GRPO from the base model without an SFT warm start\.pct10,pct30, andpct50are SFT target intervals with widths equal to 10%, 30%, or 50% of the true remaining token count\.fix100,fix500, andfix1000are fixed\-width target intervals of 100, 500, or 1000 tokens\. The suffixese2,e3, ande5mark checkpoints after 2, 3, and 5 SFT epochs;KLis the RL KL\-penalty coefficient\.\+ thinkincludes a short reasoning field before the answer, whileno thinkomits it\.scalar bugis an invalid zero\-RL run that outputs scalar values instead of required intervals\.Acc\.is binary feasibility accuracy\.Rec\. posandRec\. impare recall on possible and impossible states\.Coveris computed on truly possible states where the model outputs an interval\[L,H\]\[L,H\]and measures whether the true remaining countrrsatisfiesL≤r≤HL\\leq r\\leq H\.MREis midpoint relative error,MRE=\|\(\(L\+H\)/2\)−r\|/r\\mathrm\{MRE\}=\|\(\(L\+H\)/2\)\-r\|/r;P50andP90are its median and 90th percentile\.Ris the average sparse reward used for RL evaluation\.
## Appendix GPrompts for Agent Estimation and SFT/RL
This section summarizes the prompt format used for agent budget estimation\. All variants follow the same structure\. The prompt defines the budget unit, lists the context fields, and ends with strict output rules\. The compact templates below show the key instructions\.
SearchR1 token\-budget prompt\.
Based on the provided SearchQA rollout context, you are given the following:1\.You have completed\{completed\_turns\}turns\.2\.Each turn, your token consumption is\{turn\_token\_usage\_text\}\.3\.You need to finish the task within\{max\_context\_window\_tokens\}total tokens\.Now, estimate:1\.Whether the search agent can still finish the task successfully within\{max\_context\_window\_tokens\}total tokens \(input \+ output\)\.2\.If yes, how many additional tokens \(input \+ output\) are still needed to finish the task, starting from the next turn\. Return an estimation interval: at least est\_low tokens and at most est\_high tokens\.3\.If no, answer "impossible"\.4\.Prioritize the can\-finish judgment over interval tightness\. If you think the task can finish within budget, make the interval as tight as possible while still covering the true remaining token budget\.Example:For a three\-turn interaction, suppose only Turn 1 has been completed\. The full interaction is: Turn 1: input X1 tokens, output Y1 tokens; Turn 2: input X2 tokens, output Y2 tokens; Turn 3: input X3 tokens, output Y3 tokens\. You will receive: turn\_token\_usage\_text: Turn 1: input X1 tokens, output Y1 tokens You should estimate: X2 \+ Y2 \+ X3 \+ Y3Output exactly one of the following:<<think\>\>\[YOUR THINKING\]<</think\>\><<answer\>\>\[est\_low, est\_high\]<</answer\>\>or<<think\>\>\[YOUR THINKING\]<</think\>\><<answer\>\>impossible<</answer\>\>
Sokoban token\-budget prompt\.
Based on the provided rollout context, you are provided below information:1\.You have completed\{completed\_turns\}turns\.2\.Each turn, your token consumption is\{turn\_token\_usage\_text\}\.3\.You need to finish the task within\{max\_context\_window\_tokens\}tokens\.Now, estimate:1\.Whether you can finish the task successfully within\{max\_context\_window\_tokens\}total tokens \(input \+ output\)\.2\.If yes, how many additional tokens \(input \+ output\) are still needed to finish the task, starting from the next turn\. Return an estimation interval: at least est\_low tokens and at most est\_high tokens\.3\.If no, answer "impossible"\.4\.You should try your best to estimate whether the task can finish within budget \(most important\)\. If you think the task can finish within budget, your interval should be as tight as possible while still covering the true remaining token budget\.Example:For a three\-turn interaction, suppose only Turn 1 has been completed\. The full interaction is: Turn 1: input X1 tokens, output Y1 tokens; Turn 2: input X2 tokens, output Y2 tokens; Turn 3: input X3 tokens, output Y3 tokens\. You will receive: turn\_token\_usage\_text: Turn 1: input X1 tokens, output Y1 tokens You should estimate: X2 \+ Y2 \+ X3 \+ Y3Output exactly one of the following:<<think\>\>\[YOUR THINKING\]<</think\>\><<answer\>\>\[est\_low, est\_high\]<</answer\>\>or<<think\>\>\[YOUR THINKING\]<</think\>\><<answer\>\>impossible<</answer\>\>
SWE\-bench token\-budget prompt\.
Based on the provided SWE\-bench rollout context, you are given the following:1\.The coding agent has completed\{completed\_turns\}turns\.2\.Per\-turn token usage so far, excluding reused history from earlier turns, is:\{turn\_token\_usage\_text\}\.3\.The full task must finish within\{max\_context\_window\_tokens\}total tokens\.Estimate:1\.Whether the agent can still finish the software issue successfully within\{max\_context\_window\_tokens\}total tokens\.2\.If yes, how many additional tokens \(input \+ output\) are still needed from the next turn onward\. Return an interval: at least est\_low tokens and at most est\_high tokens\.3\.If no, answer "impossible"\.4\.Prioritize the can\-finish judgment over interval tightness\. If the task still looks finishable, keep the interval as tight as possible while still covering the true remaining token budget\.Think about typical SWE\-bench costs such as repository inspection, targeted code edits, running validation commands, reading failures, and one or two repair iterations\.Output exactly one of the following:<<think\>\>\[YOUR THINKING\]<</think\>\><<answer\>\>\[est\_low, est\_high\]<</answer\>\>or<<think\>\>\[YOUR THINKING\]<</think\>\><<answer\>\>impossible<</answer\>\>
External Warehouse prompt\.
System prompt\.You are an evaluation agent for historical warehouse\-management rollouts\. Determine whether the rollout can still finish successfully within the remaining resource budgets while reaching the target cash threshold\. If it can, estimate the remaining time, warehouse cumulative occupancy, and cumulative cost still needed from the next turn onward\. Follow the required output format exactly\.Based on the provided warehouse rollout context, you are given the following information:1\.You have completed\{completed\_weeks\}weeks in\{completed\_turns\}turns\.2\.Current cumulative usage so far:•time\_weeks:\{current\_time\_weeks\}•warehouse\_item\_weeks:\{current\_warehouse\_item\_weeks\}•cumulative\_cost\_usd:\{current\_cost\_usd\}3\.Current cash is\{current\_cash\_usd\}USD\. To count as finished, final cash must reach at least\{target\_cash\_usd\}USD\.4\.Historical resource consumption by completed step is:\{resource\_consumption\_text\}5\.The rollout must finish within all three budgets:•time\_weeks<<=\{budget\_time\_weeks\}•warehouse\_item\_weeks<<=\{budget\_warehouse\_item\_weeks\}•cumulative\_cost\_usd<<=\{budget\_cost\_usd\}Now, estimate:1\.Whether the rollout can still finish successfully within all three budgets while also reaching the target cash\.2\.If yes, how much additional usage is still needed from the next turn onward\. Return one interval for each metric\.3\.If no, answer "impossible"\.4\.Prioritize the can\-finish judgment over interval tightness\. If you think the rollout can finish within budget, make each interval as tight as possible while still covering the true remaining value\.Output exactly one of the following:<<think\>\>\[YOUR THINKING\]<</think\>\><<answer\>\>time\_weeks:\[est\_low, est\_high\], warehouse\_item\_weeks:\[est\_low, est\_high\], cumulative\_cost\_usd:\[est\_low, est\_high\]<</answer\>\>or<<think\>\>\[YOUR THINKING\]<</think\>\><<answer\>\>impossible<</answer\>\>
Sokoban SFT/RL budget\-probe prompt\.
System prompt\.You’re a helpful assistant\. You are solving the Sokoban puzzle\. Push all boxes to targets\. You are given the grid and zero\-indexed coordinates of the player, boxes, and targets\. You can push but not pull boxes, and cannot push a box through a wall\. Your available actions are: Up, Down, Left, Right\. You may output at most 3 action\(s\) in a single turn, separated by the action separator " \|\| "\.User estimation prompt\.Based on the provided rollout context, you are provided below information:1\.You have completed\{completed\_turns\}turns\.2\.Each turn, your token consumption is\{turn\_token\_usage\_text\}\.3\.You need to finish the task within\{max\_context\_window\_tokens\}tokens\.Now, estimate:1\.Whether you can finish the task successfully within\{max\_context\_window\_tokens\}total tokens \(input \+ output\)\.2\.If yes, how many additional tokens \(input \+ output\) are still needed to finish the task, starting from the next turn\. Return an estimation interval: at least est\_low tokens and at most est\_high tokens\.3\.If no, answer "impossible"\.4\.You should try your best to estimate whether the task can finish within budget \(most important\)\. If you think the task can finish within budget, your interval should be as tight as possible while still covering the true remaining token budget\.Example:For a three\-turn interaction, suppose only Turn 1 has been completed\. The full interaction is: Turn 1: input X1 tokens, output Y1 tokens; Turn 2: input X2 tokens, output Y2 tokens; Turn 3: input X3 tokens, output Y3 tokens\. You will receive: turn\_token\_usage\_text: Turn 1: input X1 tokens, output Y1 tokens You should estimate: X2 \+ Y2 \+ X3 \+ Y3Output exactly one of the following:<<answer\>\>\[est\_low, est\_high\]<</answer\>\>or<<answer\>\>impossible<</answer\>\>Similar Articles
Inference-Time Budget Control for LLM Search Agents
This paper introduces a two-stage inference-time budget control method for LLM search agents, using Value-of-Information scores to optimize tool-call and token allocation during multi-hop question answering.
The Shadow Price of Reasoning: Economic Perspective on Optimal Budget Allocation for LLMs
This paper formulates LLM inference budget allocation as a constrained optimization problem, proposing CLEAR to reallocate resources from low-utility queries to those near emergence thresholds, achieving up to 3× accuracy improvement under tight budgets.
Token Budgets: An Empirical Catalog of 63 LLM-Agent Budget-Overrun Incidents, with an Affine-Typed Rust Mitigation as a Case Study
This paper presents an empirical catalog of 63 confirmed LLM-agent budget overrun incidents from 21 orchestration frameworks, organized into a failure taxonomy, and introduces a Rust crate using affine type ownership to prevent token/cost budget violations at compile time rather than runtime.
Avoiding Overthinking and Underthinking: Curriculum-Aware Budget Scheduling for LLMs
BACR introduces adaptive token budgeting and curriculum-aware scheduling to prevent LLMs from overthinking easy problems and underthinking hard ones, cutting token use 34% while boosting accuracy up to 8.3%.
Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning
This paper introduces ChemCost, a benchmark for evaluating how well LLM agents can estimate chemical procurement costs by grounding identities, retrieving quotes, and handling noise. It reveals that current agents struggle with robustness and precise arithmetic reasoning in scientific workflows.