When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

Hugging Face Daily Papers Papers

Summary

The ToolMaze benchmark evaluates LLM agents' ability to handle real-world tool failures, revealing that implicit semantic failures cause the largest performance drops and that dynamic replanning remains a critical bottleneck not addressed by scaling or prompting.

Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a 2 times 2 taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale 3.66times slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.
Original Article
View Cached Full Text

Cached at: 06/08/26, 03:29 AM

Paper page - When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

Source: https://huggingface.co/papers/2606.05806

Abstract

ToolMaze benchmark reveals that real-world tool failures significantly degrade TIR performance, with implicit semantic failures causing the most severe drops and dynamic replanning emerging as a key bottleneck.

Existingbenchmarks evaluateTool-Integrated Reasoning(TIR) in LLMs on idealized ’‘happy paths’’, largely overlooking real-world tool failures. We introduce ToolMaze, abenchmarkfordynamic path discoveryanderror recoveryinTIRagents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design:DAG-based topological complexityand a 2 times 2 taxonomy oftool perturbations(explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops underimplicit semantic failures. Driven by systemic over-trust in corrupted outputs,Perturbation Recovery Rate(PRR) plummets by around 37\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially,agentic fault-toleranceimproves withmodel scale3.66times slower than basic task execution, highlightingdynamic replanningas a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.

View arXiv pageView PDFGitHub2Add to collection

Get this paper in your agent:

hf papers read 2606\.05806

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.05806 in a model README.md to link it from this page.

Datasets citing this paper1

#### dongsheng/ToolMaze Updated3 days ago • 1.14k • 4

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.05806 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

LLM Agents Already Know When to Call Tools -- Even Without Reasoning

Hugging Face Daily Papers

This paper introduces When2Tool, a benchmark to study when LLM agents actually need to call tools, and reveals that models already know tool necessity from hidden states but fail to act. The proposed Probe&Prefill method reduces unnecessary tool calls by 48% with minimal accuracy loss.

Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents

arXiv cs.AI

This paper addresses the problem of tool failures in medical AI agents by proposing a GRPO-based reinforcement learning framework that leverages instance-level selection, disagreement-aware synergy learning, and entropy-guided sampling to correct erroneous tool consensus and improve reliability across seven medical benchmarks.