LLM Agents Already Know When to Call Tools -- Even Without Reasoning

Hugging Face Daily Papers 05/10/26, 12:00 AM Papers

llm-agents tool-calling benchmark probing hidden-states efficiency

Summary

This paper introduces When2Tool, a benchmark to study when LLM agents actually need to call tools, and reveals that models already know tool necessity from hidden states but fail to act. The proposed Probe&Prefill method reduces unnecessary tool calls by 48% with minimal accuracy loss.

Tool-augmented LLM agents tend to call tools indiscriminately, even when the model can answer directly. Each unnecessary call wastes API fees and latency, yet no existing benchmark systematically studies when a tool call is actually needed. We propose When2Tool, a benchmark of 18 environments (15 single-hop, 3 multi-hop) spanning three categories of tool necessity -- computational scale, knowledge boundaries, and execution reliability -- each with controlled difficulty levels that create a clear decision boundary between tool-necessary and tool-unnecessary tasks. We evaluate two families of training-free baselines: Prompt-only (varying the prompt to discourage unnecessary calls) and Reason-then-Act (requiring the model to reason about tool necessity before acting). Both provide limited control: Prompt-only suppresses necessary calls alongside unnecessary ones, and Reason-then-Act still incurs a disproportionate accuracy cost on hard tasks. To understand why these baselines fail, we probe the models' hidden states and find that tool necessity is linearly decodable from the pre-generation representation with AUROC 0.89--0.96 across six models, substantially exceeding the model's own verbalized reasoning. This reveals that models already know when tools are needed, but fail to act on this knowledge during generation. Building on this finding, we propose Probe&Prefill, which uses a lightweight linear probe to read the hidden-state signal and prefills the model's response with a steering sentence. Across all models tested, Probe&Prefill reduces tool calls by 48% with only 1.7% accuracy loss, while the best baseline at comparable accuracy only reduces 6% of tool calls, or achieves a similar tool call reduction but incurs a 5times higher accuracy loss. Our code is available at https://github.com/Trustworthy-ML-Lab/when2tool

Original Article

View Cached Full Text

Cached at: 05/13/26, 08:14 PM

Paper page - LLM Agents Already Know When to Call Tools – Even Without Reasoning

Source: https://huggingface.co/papers/2605.09252

Abstract

When2Tool benchmark identifies conditions under which tool calls are necessary for LLM agents, revealing that models can predict tool necessity from hidden states but fail to act on this knowledge, leading to the development of Probe&Prefill method that reduces unnecessary calls by 48% with minimal accuracy loss.

Tool-augmented LLM agentstend to call tools indiscriminately, even when the model can answer directly. Each unnecessary call wastes API fees and latency, yet no existingbenchmarksystematically studies when a tool call is actually needed. We propose When2Tool, abenchmarkof 18 environments (15 single-hop, 3 multi-hop) spanning three categories oftool necessity-- computational scale, knowledge boundaries, and execution reliability -- each with controlled difficulty levels that create a clear decision boundary between tool-necessary and tool-unnecessary tasks. We evaluate two families of training-free baselines: Prompt-only (varying the prompt to discourage unnecessary calls) and Reason-then-Act (requiring the model to reason abouttool necessitybefore acting). Both provide limited control: Prompt-only suppresses necessary calls alongside unnecessary ones, and Reason-then-Act still incurs a disproportionate accuracy cost on hard tasks. To understand why these baselines fail, we probe the models’hidden statesand find thattool necessityis linearly decodable from thepre-generation representationwithAUROC0.89--0.96 across six models, substantially exceeding the model’s own verbalized reasoning. This reveals that models already know when tools are needed, but fail to act on this knowledge during generation. Building on this finding, we proposeProbe&Prefill, which uses a lightweight linear probe to read the hidden-state signal and prefills the model’s response with asteering sentence. Across all models tested,Probe&Prefillreduces tool calls by 48% with only 1.7% accuracy loss, while the best baseline at comparable accuracy only reduces 6% of tool calls, or achieves a similar tool call reduction but incurs a 5times higher accuracy loss. Our code is available at https://github.com/Trustworthy-ML-Lab/when2tool

View arXiv page View PDF Project page GitHub Add to collection

Get this paper in your agent:

hf papers read 2605\.09252

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.09252 in a model README.md to link it from this page.

Datasets citing this paper1

#### cesun/When2Tool Viewer• Updatedabout 20 hours ago • 3.78k • 23

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.09252 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

LLM Agents Already Know When to Call Tools -- Even Without Reasoning

Paper page - LLM Agents Already Know When to Call Tools – Even Without Reasoning

Abstract

Models citing this paper0

Datasets citing this paper1

Spaces citing this paper0

Collections including this paper0

Similar Articles

Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

@omarsar0: Interesting interpretability paper on tool-using agents. The authors probe hidden states and find the model often recog…

Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning

Contract2Tool: Learning Preconditions and Effects for Reliable Tool-Augmented LLM Agents

Faithful uncertainty in LLM agents: calibration vs utility tradeoff in practice[D]

Submit Feedback

Similar Articles

Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

@omarsar0: Interesting interpretability paper on tool-using agents. The authors probe hidden states and find the model often recog…

Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning

Contract2Tool: Learning Preconditions and Effects for Reliable Tool-Augmented LLM Agents

Faithful uncertainty in LLM agents: calibration vs utility tradeoff in practice[D]