LLM Agents Already Know When to Call Tools -- Even Without Reasoning

Hugging Face Daily Papers Papers

Summary

This paper introduces When2Tool, a benchmark to study when LLM agents actually need to call tools, and reveals that models already know tool necessity from hidden states but fail to act. The proposed Probe&Prefill method reduces unnecessary tool calls by 48% with minimal accuracy loss.

Tool-augmented LLM agents tend to call tools indiscriminately, even when the model can answer directly. Each unnecessary call wastes API fees and latency, yet no existing benchmark systematically studies when a tool call is actually needed. We propose When2Tool, a benchmark of 18 environments (15 single-hop, 3 multi-hop) spanning three categories of tool necessity -- computational scale, knowledge boundaries, and execution reliability -- each with controlled difficulty levels that create a clear decision boundary between tool-necessary and tool-unnecessary tasks. We evaluate two families of training-free baselines: Prompt-only (varying the prompt to discourage unnecessary calls) and Reason-then-Act (requiring the model to reason about tool necessity before acting). Both provide limited control: Prompt-only suppresses necessary calls alongside unnecessary ones, and Reason-then-Act still incurs a disproportionate accuracy cost on hard tasks. To understand why these baselines fail, we probe the models' hidden states and find that tool necessity is linearly decodable from the pre-generation representation with AUROC 0.89--0.96 across six models, substantially exceeding the model's own verbalized reasoning. This reveals that models already know when tools are needed, but fail to act on this knowledge during generation. Building on this finding, we propose Probe&Prefill, which uses a lightweight linear probe to read the hidden-state signal and prefills the model's response with a steering sentence. Across all models tested, Probe&Prefill reduces tool calls by 48% with only 1.7% accuracy loss, while the best baseline at comparable accuracy only reduces 6% of tool calls, or achieves a similar tool call reduction but incurs a 5times higher accuracy loss. Our code is available at https://github.com/Trustworthy-ML-Lab/when2tool
Original Article
View Cached Full Text

Cached at: 05/13/26, 08:14 PM

Paper page - LLM Agents Already Know When to Call Tools – Even Without Reasoning

Source: https://huggingface.co/papers/2605.09252

Abstract

When2Tool benchmark identifies conditions under which tool calls are necessary for LLM agents, revealing that models can predict tool necessity from hidden states but fail to act on this knowledge, leading to the development of Probe&Prefill method that reduces unnecessary calls by 48% with minimal accuracy loss.

Tool-augmented LLM agentstend to call tools indiscriminately, even when the model can answer directly. Each unnecessary call wastes API fees and latency, yet no existingbenchmarksystematically studies when a tool call is actually needed. We propose When2Tool, abenchmarkof 18 environments (15 single-hop, 3 multi-hop) spanning three categories oftool necessity-- computational scale, knowledge boundaries, and execution reliability -- each with controlled difficulty levels that create a clear decision boundary between tool-necessary and tool-unnecessary tasks. We evaluate two families of training-free baselines: Prompt-only (varying the prompt to discourage unnecessary calls) and Reason-then-Act (requiring the model to reason abouttool necessitybefore acting). Both provide limited control: Prompt-only suppresses necessary calls alongside unnecessary ones, and Reason-then-Act still incurs a disproportionate accuracy cost on hard tasks. To understand why these baselines fail, we probe the models’hidden statesand find thattool necessityis linearly decodable from thepre-generation representationwithAUROC0.89--0.96 across six models, substantially exceeding the model’s own verbalized reasoning. This reveals that models already know when tools are needed, but fail to act on this knowledge during generation. Building on this finding, we proposeProbe&Prefill, which uses a lightweight linear probe to read the hidden-state signal and prefills the model’s response with asteering sentence. Across all models tested,Probe&Prefillreduces tool calls by 48% with only 1.7% accuracy loss, while the best baseline at comparable accuracy only reduces 6% of tool calls, or achieves a similar tool call reduction but incurs a 5times higher accuracy loss. Our code is available at https://github.com/Trustworthy-ML-Lab/when2tool

View arXiv pageView PDFProject pageGitHubAdd to collection

Get this paper in your agent:

hf papers read 2605\.09252

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.09252 in a model README.md to link it from this page.

Datasets citing this paper1

#### cesun/When2Tool Viewer• Updatedabout 20 hours ago • 3.78k • 23

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.09252 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

arXiv cs.AI

This paper introduces a model-adaptive definition of tool necessity for LLMs, revealing a substantial mismatch between when a model should use a tool and when it actually does. The authors decompose tool use into cognition and action stages, finding that the majority of errors occur in translating recognition into action, identifying a 'knowing-doing gap' in LLM tool use.

@omarsar0: Interesting interpretability paper on tool-using agents. The authors probe hidden states and find the model often recog…

X AI KOLs Following

This paper introduces a model-adaptive definition of tool necessity and finds a 26-54% mismatch between LLMs' internal recognition that a tool is needed and their actual tool-call actions, concentrated in the cognition-to-action transition. It reveals a 'knowing-doing gap' where the model often knows it should call a tool but fails to do so due to late-layer geometry rotating the signal nearly orthogonal to the action.

Contract2Tool: Learning Preconditions and Effects for Reliable Tool-Augmented LLM Agents

arXiv cs.AI

This paper introduces Contract2Tool, a framework for automatically inferring lightweight tool contracts (preconditions, effects, risk) from tool metadata, documentation, and execution traces, enabling reliable causal tool filtering for LLM agents. Experiments show learned contracts achieve near-gold contract performance in downstream multi-step agent tasks, significantly reducing token usage.