LLM Agents Already Know When to Call Tools -- Even Without Reasoning
Summary
This paper introduces When2Tool, a benchmark to study when LLM agents actually need to call tools, and reveals that models already know tool necessity from hidden states but fail to act. The proposed Probe&Prefill method reduces unnecessary tool calls by 48% with minimal accuracy loss.
View Cached Full Text
Cached at: 05/13/26, 08:14 PM
Paper page - LLM Agents Already Know When to Call Tools – Even Without Reasoning
Source: https://huggingface.co/papers/2605.09252
Abstract
When2Tool benchmark identifies conditions under which tool calls are necessary for LLM agents, revealing that models can predict tool necessity from hidden states but fail to act on this knowledge, leading to the development of Probe&Prefill method that reduces unnecessary calls by 48% with minimal accuracy loss.
Tool-augmented LLM agentstend to call tools indiscriminately, even when the model can answer directly. Each unnecessary call wastes API fees and latency, yet no existingbenchmarksystematically studies when a tool call is actually needed. We propose When2Tool, abenchmarkof 18 environments (15 single-hop, 3 multi-hop) spanning three categories oftool necessity-- computational scale, knowledge boundaries, and execution reliability -- each with controlled difficulty levels that create a clear decision boundary between tool-necessary and tool-unnecessary tasks. We evaluate two families of training-free baselines: Prompt-only (varying the prompt to discourage unnecessary calls) and Reason-then-Act (requiring the model to reason abouttool necessitybefore acting). Both provide limited control: Prompt-only suppresses necessary calls alongside unnecessary ones, and Reason-then-Act still incurs a disproportionate accuracy cost on hard tasks. To understand why these baselines fail, we probe the models’hidden statesand find thattool necessityis linearly decodable from thepre-generation representationwithAUROC0.89--0.96 across six models, substantially exceeding the model’s own verbalized reasoning. This reveals that models already know when tools are needed, but fail to act on this knowledge during generation. Building on this finding, we proposeProbe&Prefill, which uses a lightweight linear probe to read the hidden-state signal and prefills the model’s response with asteering sentence. Across all models tested,Probe&Prefillreduces tool calls by 48% with only 1.7% accuracy loss, while the best baseline at comparable accuracy only reduces 6% of tool calls, or achieves a similar tool call reduction but incurs a 5times higher accuracy loss. Our code is available at https://github.com/Trustworthy-ML-Lab/when2tool
View arXiv pageView PDFProject pageGitHubAdd to collection
Get this paper in your agent:
hf papers read 2605\.09252
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.09252 in a model README.md to link it from this page.
Datasets citing this paper1
#### cesun/When2Tool Viewer• Updatedabout 20 hours ago • 3.78k • 23
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.09252 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use
This paper introduces a model-adaptive definition of tool necessity for LLMs, revealing a substantial mismatch between when a model should use a tool and when it actually does. The authors decompose tool use into cognition and action stages, finding that the majority of errors occur in translating recognition into action, identifying a 'knowing-doing gap' in LLM tool use.
@omarsar0: Interesting interpretability paper on tool-using agents. The authors probe hidden states and find the model often recog…
This paper introduces a model-adaptive definition of tool necessity and finds a 26-54% mismatch between LLMs' internal recognition that a tool is needed and their actual tool-call actions, concentrated in the cognition-to-action transition. It reveals a 'knowing-doing gap' where the model often knows it should call a tool but fails to do so due to late-layer geometry rotating the signal nearly orthogonal to the action.
Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning
Introduces AutoTool, a model that adaptively decides whether to invoke tools for multimodal LLM reasoning, achieving significant accuracy and efficiency gains through reinforcement learning and dual-mode reasoning.
Contract2Tool: Learning Preconditions and Effects for Reliable Tool-Augmented LLM Agents
This paper introduces Contract2Tool, a framework for automatically inferring lightweight tool contracts (preconditions, effects, risk) from tool metadata, documentation, and execution traces, enabling reliable causal tool filtering for LLM agents. Experiments show learned contracts achieve near-gold contract performance in downstream multi-step agent tasks, significantly reducing token usage.
Faithful uncertainty in LLM agents: calibration vs utility tradeoff in practice[D]
A practitioner discusses the calibration vs. utility tradeoff in LLM agents, sharing experience with a verifier-based pipeline that reduces hallucinated tool calls by ~60% but introduces latency costs and drops easy correct answers.