PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems

Hugging Face Daily Papers 06/21/26, 12:00 AM Papers

benchmark llm-agents tool-use planning evaluation long-horizon

Summary

PlanBench-XL is a new benchmark that evaluates LLM agents' ability to plan and adapt in large tool ecosystems with limited visibility and dynamic disruptions. Experiments show GPT-5.4 achieves only 51.9% accuracy in block-free settings and collapses to 11.36% under severe blocking, highlighting significant challenges in long-horizon planning.

LLM agents increasingly operate in large tool ecosystems, where real-world tasks require discovering relevant tools, inferring implicit sub-goals, and adapting to dynamic environments over long horizons. However, existing benchmarks rarely evaluate planning under retrieval-limited tool visibility. To address this gap, we introduce PlanBench-XL, an interactive benchmark of 327 retail tasks over 1,665 tools that tests whether agents can iteratively retrieve usable tools, invoke them to uncover intermediate evidence for subsequent calls toward the final goal. PlanBench-XL further features an optional blocking mechanism that simulates real-world unpredictability through missing, failing, or distracting tool functions, forcing agents to detect disrupted paths and adapt at runtime. Experiments on ten leading LLMs show that massive-tool planning remains challenging: while GPT-5.4 achieves 51.90% accuracy in block-free settings, it collapses to 11.36% under the most severe blocking condition. Further analysis shows that agents are especially vulnerable when failures lack explicit error signals or when recovery requires longer alternative tool-use paths. These results establish PlanBench-XL as a testbed for diagnosing agentic planning failures and highlight the need for robust adaptive planning in long-horizon tasks with large, imperfect tool environments.

Original Article

View Cached Full Text

Cached at: 06/23/26, 05:41 AM

Paper page - PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems

Source: https://huggingface.co/papers/2606.22388

Abstract

PlanBench-XL evaluates large language model agents’ ability to plan and adapt in complex tool-rich environments with limited visibility and dynamic disruptions.

LLM agentsincreasingly operate in largetool ecosystems, where real-world tasks require discovering relevant tools, inferringimplicit sub-goals, and adapting todynamic environmentsoverlong horizons. However, existing benchmarks rarely evaluateplanningunderretrieval-limited tool visibility. To address this gap, we introduce PlanBench-XL, aninteractive benchmarkof 327 retail tasks over 1,665 tools that tests whether agents can iteratively retrieve usable tools, invoke them to uncover intermediate evidence for subsequent calls toward the final goal. PlanBench-XL further features an optionalblocking mechanismthat simulates real-world unpredictability through missing, failing, or distracting tool functions, forcing agents to detect disrupted paths and adapt at runtime. Experiments on ten leading LLMs show that massive-toolplanningremains challenging: while GPT-5.4 achieves 51.90% accuracy in block-free settings, it collapses to 11.36% under the most severe blocking condition. Further analysis shows that agents are especially vulnerable when failures lack explicit error signals or when recovery requires longer alternativetool-use paths. These results establish PlanBench-XL as a testbed for diagnosingagentic planningfailures and highlight the need for robust adaptiveplanningin long-horizon tasks with large, imperfect tool environments.

View arXiv page View PDF Project page GitHub2 Add to collection

Get this paper in your agent:

hf papers read 2606\.22388

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.22388 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.22388 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.22388 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems

Paper page - PlanBench-XL: Evaluating Long-Horizon Planning of LLM Tool-Use Agents in Large-Scale Tool Ecosystems

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints

PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models

When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

CEO-Bench: Can Agents Play the Long Game?

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

Submit Feedback

Similar Articles

AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints

PlanningBench: Generating Scalable and Verifiable Planning Data for Evaluating and Training Large Language Models

When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

CEO-Bench: Can Agents Play the Long Game?

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation