When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents
Summary
The ToolMaze benchmark evaluates LLM agents' ability to handle real-world tool failures, revealing that implicit semantic failures cause the largest performance drops and that dynamic replanning remains a critical bottleneck not addressed by scaling or prompting.
View Cached Full Text
Cached at: 06/08/26, 03:29 AM
Paper page - When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents
Source: https://huggingface.co/papers/2606.05806
Abstract
ToolMaze benchmark reveals that real-world tool failures significantly degrade TIR performance, with implicit semantic failures causing the most severe drops and dynamic replanning emerging as a key bottleneck.
Existingbenchmarks evaluateTool-Integrated Reasoning(TIR) in LLMs on idealized ’‘happy paths’’, largely overlooking real-world tool failures. We introduce ToolMaze, abenchmarkfordynamic path discoveryanderror recoveryinTIRagents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design:DAG-based topological complexityand a 2 times 2 taxonomy oftool perturbations(explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops underimplicit semantic failures. Driven by systemic over-trust in corrupted outputs,Perturbation Recovery Rate(PRR) plummets by around 37\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially,agentic fault-toleranceimproves withmodel scale3.66times slower than basic task execution, highlightingdynamic replanningas a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.
View arXiv pageView PDFGitHub2Add to collection
Get this paper in your agent:
hf papers read 2606\.05806
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.05806 in a model README.md to link it from this page.
Datasets citing this paper1
#### dongsheng/ToolMaze Updated3 days ago • 1.14k • 4
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.05806 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
LLM Agents Already Know When to Call Tools -- Even Without Reasoning
This paper introduces When2Tool, a benchmark to study when LLM agents actually need to call tools, and reveals that models already know tool necessity from hidden states but fail to act. The proposed Probe&Prefill method reduces unnecessary tool calls by 48% with minimal accuracy loss.
Do Agents Need to Plan Step-by-Step? Rethinking Planning Horizon in Data-Centric Tool Calling
This paper argues that full-horizon planning with lazy replanning is more efficient than step-by-step execution for data-centric LLM agent tasks, using fewer tokens while maintaining accuracy.
AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios
This paper introduces AsyncTool, a benchmark for evaluating LLM-based agents' asynchronous function calling abilities in multi-task scenarios with delayed tool responses. It proposes efficiency-oriented metrics and identifies key failure modes of current tool-using agents.
Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents
This paper addresses the problem of tool failures in medical AI agents by proposing a GRPO-based reinforcement learning framework that leverages instance-level selection, disagreement-aware synergy learning, and entropy-guided sampling to correct erroneous tool consensus and improve reliability across seven medical benchmarks.
How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines
This paper systematically measures behavioral reproducibility of LLM agents in multi-step tool-calling pipelines across 1,140 traces, finding a 'structural consistency, parametric variance' pattern where agents reliably select tools in the same order but vary in arguments, and that structural consistency predicts task success.