Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability
Summary
Introduces ToolBench-X, a benchmark for evaluating large language model agents under various tool-environment reliability hazards, revealing a substantial gap in performance compared to clean environments.
View Cached Full Text
Cached at: 06/25/26, 05:13 AM
# Beyond Function Calling: Benchmarking Tool-Using Agents under Tool-Environment Unreliability Source: [https://arxiv.org/abs/2606.25819](https://arxiv.org/abs/2606.25819) [View PDF](https://arxiv.org/pdf/2606.25819) > Abstract:Large language models are increasingly deployed as agents that solve tasks by interacting with external tool environments\. Although recent tool\-use benchmarks increasingly cover complex task settings, they still largely assume clean, stable, and trustworthy tool environments, leaving tool\-environment unreliability insufficiently examined\. We introduce ToolBench\-X, a benchmark for evaluating agents under recoverable reliability hazards\. ToolBench\-X contains executable multi\-step tasks across diverse domains and sequential, parallel, and mixed workflows, each paired with deterministic tools and a canonical final answer for automatic evaluation\. Starting from clean tool environments, ToolBench\-X injects five structured hazard types: Specification Drift, Invocation Error, Execution Failure, Output Drift, and Cross\-source Conflict\. Crucially, each injected instance remains solvable through at least one valid recovery path, such as retrying, fallback, verification, or cross\-checking\. Experiments reveal a substantial reliability gap: agents that perform well with reliable tools often fail under recoverable hazards\. Further analysis shows that failures are driven less by tool\-use volume or inference budget than by limited hazard diagnosis and ineffective recovery\. Targeted recovery hints recover many failed tasks, while test\-time scaling yields more limited gains\. These results suggest that tool\-use evaluation should move beyond function\-call accuracy toward task completion under unreliable tool environments\. The code and data is available at[this https URL](https://github.com/Foreverskyou/ToolBench-X)\. ## Submission history From: Yang Tian \[[view email](https://arxiv.org/show-email/2e4e1311/2606.25819)\] **\[v1\]**Wed, 24 Jun 2026 13:34:34 UTC \(8,775 KB\)
Similar Articles
When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents
The ToolMaze benchmark evaluates LLM agents' ability to handle real-world tool failures, revealing that implicit semantic failures cause the largest performance drops and that dynamic replanning remains a critical bottleneck not addressed by scaling or prompting.
TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents
TOBench is a new benchmark for evaluating AI agents on real-world, task-oriented tool use with multimodal inputs and closed-loop verification. Experiments show top models like Qwen 3.5 Plus achieve only 41% success, far below the 94% human benchmark, highlighting a significant gap.
GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows
GTA-2 introduces a hierarchical benchmark for evaluating general tool agents across atomic tool-use and open-ended workflows, revealing a significant capability cliff where frontier models achieve only 14.39% success on complex tasks despite reasonable atomic performance.
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
This paper introduces Agent-ValueBench, a comprehensive benchmark designed to evaluate the values of autonomous agents, revealing that agent values diverge from their underlying language models.
AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation
AJ-Bench introduces a benchmark to evaluate Agent-as-a-Judge systems that interact with environments to verify agent behaviors across 155 tasks in search, data systems, and GUI domains.