When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

Hugging Face Daily Papers 06/04/26, 12:00 AM Papers

benchmark tool-failures llm-agents anomaly-recovery replanning tool-integrated-reasoning

Summary

The ToolMaze benchmark evaluates LLM agents' ability to handle real-world tool failures, revealing that implicit semantic failures cause the largest performance drops and that dynamic replanning remains a critical bottleneck not addressed by scaling or prompting.

Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a 2 times 2 taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale 3.66times slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.

Original Article

View Cached Full Text

Cached at: 06/08/26, 03:29 AM

Paper page - When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

Source: https://huggingface.co/papers/2606.05806

Abstract

ToolMaze benchmark reveals that real-world tool failures significantly degrade TIR performance, with implicit semantic failures causing the most severe drops and dynamic replanning emerging as a key bottleneck.

Existingbenchmarks evaluateTool-Integrated Reasoning(TIR) in LLMs on idealized ’‘happy paths’’, largely overlooking real-world tool failures. We introduce ToolMaze, abenchmarkfordynamic path discoveryanderror recoveryinTIRagents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design:DAG-based topological complexityand a 2 times 2 taxonomy oftool perturbations(explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops underimplicit semantic failures. Driven by systemic over-trust in corrupted outputs,Perturbation Recovery Rate(PRR) plummets by around 37\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially,agentic fault-toleranceimproves withmodel scale3.66times slower than basic task execution, highlightingdynamic replanningas a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.

View arXiv page View PDF GitHub2 Add to collection

Get this paper in your agent:

hf papers read 2606\.05806

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.05806 in a model README.md to link it from this page.

Datasets citing this paper1

#### dongsheng/ToolMaze Updated3 days ago • 1.14k • 4

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.05806 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

Paper page - When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

Abstract

Models citing this paper0

Datasets citing this paper1

Spaces citing this paper0

Collections including this paper0

Similar Articles

LLM Agents Already Know When to Call Tools -- Even Without Reasoning

Do Agents Need to Plan Step-by-Step? Rethinking Planning Horizon in Data-Centric Tool Calling

AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios

Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents

How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines

Submit Feedback

Similar Articles

LLM Agents Already Know When to Call Tools -- Even Without Reasoning

Do Agents Need to Plan Step-by-Step? Rethinking Planning Horizon in Data-Centric Tool Calling

AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios

Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents

How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines