Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability

arXiv cs.CL 06/25/26, 04:00 AM Papers

tool-use benchmarks agents reliability hazards llm-agents evaluation

Summary

Introduces ToolBench-X, a benchmark for evaluating large language model agents under various tool-environment reliability hazards, revealing a substantial gap in performance compared to clean environments.

arXiv:2606.25819v1 Announce Type: new Abstract: Large language models are increasingly deployed as agents that solve tasks by interacting with external tool environments. Although recent tool-use benchmarks increasingly cover complex task settings, they still largely assume clean, stable, and trustworthy tool environments, leaving tool-environment unreliability insufficiently examined. We introduce ToolBench-X, a benchmark for evaluating agents under recoverable reliability hazards. ToolBench-X contains executable multi-step tasks across diverse domains and sequential, parallel, and mixed workflows, each paired with deterministic tools and a canonical final answer for automatic evaluation. Starting from clean tool environments, ToolBench-X injects five structured hazard types: Specification Drift, Invocation Error, Execution Failure, Output Drift, and Cross-source Conflict. Crucially, each injected instance remains solvable through at least one valid recovery path, such as retrying, fallback, verification, or cross-checking. Experiments reveal a substantial reliability gap: agents that perform well with reliable tools often fail under recoverable hazards. Further analysis shows that failures are driven less by tool-use volume or inference budget than by limited hazard diagnosis and ineffective recovery. Targeted recovery hints recover many failed tasks, while test-time scaling yields more limited gains. These results suggest that tool-use evaluation should move beyond function-call accuracy toward task completion under unreliable tool environments. The code and data is available at https://github.com/Foreverskyou/ToolBench-X.

Original Article

View Cached Full Text

Cached at: 06/25/26, 05:13 AM

# Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability
Source: [https://arxiv.org/abs/2606.25819](https://arxiv.org/abs/2606.25819)
[View PDF](https://arxiv.org/pdf/2606.25819)

> Abstract:Large language models are increasingly deployed as agents that solve tasks by interacting with external tool environments\. Although recent tool\-use benchmarks increasingly cover complex task settings, they still largely assume clean, stable, and trustworthy tool environments, leaving tool\-environment unreliability insufficiently examined\. We introduce ToolBench\-X, a benchmark for evaluating agents under recoverable reliability hazards\. ToolBench\-X contains executable multi\-step tasks across diverse domains and sequential, parallel, and mixed workflows, each paired with deterministic tools and a canonical final answer for automatic evaluation\. Starting from clean tool environments, ToolBench\-X injects five structured hazard types: Specification Drift, Invocation Error, Execution Failure, Output Drift, and Cross\-source Conflict\. Crucially, each injected instance remains solvable through at least one valid recovery path, such as retrying, fallback, verification, or cross\-checking\. Experiments reveal a substantial reliability gap: agents that perform well with reliable tools often fail under recoverable hazards\. Further analysis shows that failures are driven less by tool\-use volume or inference budget than by limited hazard diagnosis and ineffective recovery\. Targeted recovery hints recover many failed tasks, while test\-time scaling yields more limited gains\. These results suggest that tool\-use evaluation should move beyond function\-call accuracy toward task completion under unreliable tool environments\. The code and data is available at[this https URL](https://github.com/Foreverskyou/ToolBench-X)\.

## Submission history

From: Yang Tian \[[view email](https://arxiv.org/show-email/2e4e1311/2606.25819)\] **\[v1\]**Wed, 24 Jun 2026 13:34:34 UTC \(8,775 KB\)

Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability

Similar Articles

When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents

GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

Submit Feedback

Similar Articles

When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents

GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation