Token Budgets: An Empirical Catalog of 63 LLM-Agent Budget-Overrun Incidents, with an Affine-Typed Rust Mitigation as a Case Study

Hugging Face Daily Papers Papers

Summary

This paper presents an empirical catalog of 63 confirmed LLM-agent budget overrun incidents from 21 orchestration frameworks, organized into a failure taxonomy, and introduces a Rust crate using affine type ownership to prevent token/cost budget violations at compile time rather than runtime.

LLM-agent budget overruns are a documented production failure class: a single retry loop can spend thousands of dollars before an operator notices, and the in-process integrity properties that would prevent it (no aliasing, no double-spend, no use-after-delegation of a cost-bearing value) are enforced, if at all, by ad-hoc wrappers rather than by the type system. Our central contribution is empirical: a catalog of 63 confirmed production incidents from 21 orchestration frameworks (2023-2026), each backed by a quoted GitHub issue and, where reported, a dollar loss, organized into an eight-cluster failure taxonomy (inter-rater Cohen's kappa = 0.837, N = 113), plus 47 supplementary structural entries. As one mitigation evaluated against this taxonomy, we build token-budgets, an 1,180-line Rust crate (no unsafe) that operationalizes affine ownership so that cloning, double-spending, or using a budget after delegating it are compile errors rather than runtime hazards an operator must remember to avoid. The dollar cap is runtime arithmetic under an estimator assumption; the affine layer makes that arithmetic non-bypassable. On single-agent workloads a 4-line Python counter matches the crate at 0/30 overshoot, so the distinguishing value is non-bypassability under operator error in multi-agent delegation: the delegation-fanout race documented in 11 incidents is rejected by the borrow checker at compile time, while the same pattern under asyncio overshoots 30/30 and three disciplined alternatives overshoot 0/30. Across five runtimes, three providers, and a temperature-stratified live-API test (N = 160), the approach reports zero cap violations and zero false refusals, at operational parity with concurrent work. Static over-reservation is 4-6x (2.11x adaptive). Binary-level cap-soundness on the running binary is left open.
Original Article

Similar Articles

BAGEN: Are LLM Agents Budget-Aware?

arXiv cs.LG

This paper introduces BAGEN, a framework for evaluating budget awareness in LLM agents, defining budget estimation as internal and external budgets and formalizing progressive interval estimation. Experiments show that strong agents lack budget awareness, are over-optimistic, and that early stopping can save tokens while training improves alerting behavior.

Inference-Time Budget Control for LLM Search Agents

arXiv cs.AI

This paper introduces a two-stage inference-time budget control method for LLM search agents, using Value-of-Information scores to optimize tool-call and token allocation during multi-hop question answering.