PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems
Summary
PlanBench-XL is a new benchmark that evaluates LLM agents' ability to plan and adapt in large tool ecosystems with limited visibility and dynamic disruptions. Experiments show GPT-5.4 achieves only 51.9% accuracy in block-free settings and collapses to 11.36% under severe blocking, highlighting significant challenges in long-horizon planning.
View Cached Full Text
Cached at: 06/23/26, 05:41 AM
Paper page - PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems
Source: https://huggingface.co/papers/2606.22388
Abstract
PlanBench-XL evaluates large language model agents’ ability to plan and adapt in complex tool-rich environments with limited visibility and dynamic disruptions.
LLM agentsincreasingly operate in largetool ecosystems, where real-world tasks require discovering relevant tools, inferringimplicit sub-goals, and adapting todynamic environmentsoverlong horizons. However, existing benchmarks rarely evaluateplanningunderretrieval-limited tool visibility. To address this gap, we introduce PlanBench-XL, aninteractive benchmarkof 327 retail tasks over 1,665 tools that tests whether agents can iteratively retrieve usable tools, invoke them to uncover intermediate evidence for subsequent calls toward the final goal. PlanBench-XL further features an optionalblocking mechanismthat simulates real-world unpredictability through missing, failing, or distracting tool functions, forcing agents to detect disrupted paths and adapt at runtime. Experiments on ten leading LLMs show that massive-toolplanningremains challenging: while GPT-5.4 achieves 51.90% accuracy in block-free settings, it collapses to 11.36% under the most severe blocking condition. Further analysis shows that agents are especially vulnerable when failures lack explicit error signals or when recovery requires longer alternativetool-use paths. These results establish PlanBench-XL as a testbed for diagnosingagentic planningfailures and highlight the need for robust adaptiveplanningin long-horizon tasks with large, imperfect tool environments.
View arXiv pageView PDFProject pageGitHub2Add to collection
Get this paper in your agent:
hf papers read 2606\.22388
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.22388 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.22388 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.22388 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints
AdaPlanBench is a dynamic benchmark for evaluating LLM agents' ability to adaptively plan under progressively revealed world and user constraints through multi-turn interactions, showing current models struggle especially with user constraints.
PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models
PlanningBench is a framework for generating scalable, diverse, and verifiable planning data to evaluate and train large language models, featuring a constraint-driven synthesis pipeline with adaptive difficulty control and quality filtering. Experiments show that frontier LLMs struggle with coupled constraints, and reinforcement learning on PlanningBench data improves performance on unseen planning tasks.
When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents
The ToolMaze benchmark evaluates LLM agents' ability to handle real-world tool failures, revealing that implicit semantic failures cause the largest performance drops and that dynamic replanning remains a critical bottleneck not addressed by scaling or prompting.
CEO-Bench: Can Agents Play the Long Game?
CEO-Bench introduces a simulation benchmark that evaluates language model agents' ability to manage a startup over 500 days, testing long-term planning, noise handling, adaptability, and multi-task coordination. Results show that even the strongest models struggle, with only Claude Opus 4.8 and GPT-5.5 finishing above the starting balance.
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
WildClawBench evaluates language and vision-language models on realistic long-horizon tasks using actual CLI environments with real tools. The benchmark reveals that even the best model achieves only 62.2% accuracy, indicating long-horizon agent evaluation remains challenging.