Token Budgets: An Empirical Catalog of 63 LLM-Agent Budget-Overrun Incidents, with an Affine-Typed Rust Mitigation as a Case Study
Summary
This paper presents an empirical catalog of 63 confirmed LLM-agent budget overrun incidents from 21 orchestration frameworks, organized into a failure taxonomy, and introduces a Rust crate using affine type ownership to prevent token/cost budget violations at compile time rather than runtime.
Similar Articles
Subagents Account for Most Token Costs in Long Agent Runs: Fixes That Cut Usage 70 to 90 Percent in Practice
The article analyzes a 2026 paper by Bai et al. showing that subagents and context bloat cause token costs in long agent runs to be ~1000x higher than chat, and presents three practical fixes (PLAN.md, read budget, out-of-band notes) that reduce token usage by 70-90%.
BAGEN: Are LLM Agents Budget-Aware?
This paper introduces BAGEN, a framework for evaluating budget awareness in LLM agents, defining budget estimation as internal and external budgets and formalizing progressive interval estimation. Experiments show that strong agents lack budget awareness, are over-optimistic, and that early stopping can save tokens while training improves alerting behavior.
Avoiding Overthinking and Underthinking: Curriculum-Aware Budget Scheduling for LLMs
BACR introduces adaptive token budgeting and curriculum-aware scheduling to prevent LLMs from overthinking easy problems and underthinking hard ones, cutting token use 34% while boosting accuracy up to 8.3%.
Toward Reliable Design of LLM-Enabled Agentic Workflows: Optimizing Latency-Reliability-Cost Tradeoffs
This paper analyzes tradeoffs between latency, reliability, and cost in LLM-enabled agentic workflows, introducing performance models and deriving optimal resource allocation policies like water-filling token allocation.
Inference-Time Budget Control for LLM Search Agents
This paper introduces a two-stage inference-time budget control method for LLM search agents, using Value-of-Information scores to optimize tool-call and token allocation during multi-hop question answering.