Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use
Summary
This paper introduces a model-adaptive definition of tool necessity for LLMs, revealing a substantial mismatch between when a model should use a tool and when it actually does. The authors decompose tool use into cognition and action stages, finding that the majority of errors occur in translating recognition into action, identifying a 'knowing-doing gap' in LLM tool use.
View Cached Full Text
Cached at: 05/15/26, 06:19 AM
# Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use
Source: [https://arxiv.org/html/2605.14038](https://arxiv.org/html/2605.14038)
Yize Cheng Chenrui Fan11footnotemark:1Mahdi JafariRaviz11footnotemark:1Keivan Rezaei Soheil Feizi University of Maryland, College Park \{yzcheng, cfan42, krezaei, mahdij, sfeizi\}@umd\.eduProject:[https://github\.com/chengez/Tool\-Cognition\-Action](https://github.com/chengez/Tool-Cognition-Action)
###### Abstract
Large language models \(LLMs\) increasingly act as autonomous agents that must decide when to answer directly vs\. when to invoke external tools\. Prior work studying adaptive tool use has largely treated tool necessity as a model\-agnostic property, annotated by human or LLM judge, and mostly cover cases where the answer is obvious \(e\.g\., fetching the weather vs\. paraphrasing text\)\. However, tool necessity in the wild is more nuanced due to the divergence of capability boundaries across models: a problem solvable by a strong model on its own may still require tools for a weaker one\. In this work, we introduce a model\-adaptive definition of tool\-necessity, grounded in each model’s empirical performance\. Following this definition, we compare the necessity against observed tool\-call behavior across four models on arithmetic and factual QA dataset, and find substantial mismatches of 26\.5–54\.0% and 30\.8–41\.8%, respectively\. To diagnose the failure, we decompose tool use into two stages: an internal cognition stage that reflects whether a model believes a tool is necessary, and an execution stage that determines whether the model actually makes a tool\-call action\. By probing the LLM hidden states, we find that both signals are often linearly decodable, yet their probe directions become nearly orthogonal in the late\-layer, last\-token regime that drives the next\-token action\. By tracing the trajectory of samples in the two\-stage process, we further discover that the majority of mismatch is concentrated in the cognition\-to\-action transition, not in cognition itself\. These results reveal a*knowing–doing gap*in LLM tool\-use: improving tool\-use reliability requires not only better recognition of when tools are needed, but also better translation of that recognition into action\.
## 1Introduction
Large language models \(LLMs\) are increasingly deployed as autonomous agents that interact with external tools such as search engines, calculators, and APIs\[[20](https://arxiv.org/html/2605.14038#bib.bib1),[24](https://arxiv.org/html/2605.14038#bib.bib2),[26](https://arxiv.org/html/2605.14038#bib.bib25),[19](https://arxiv.org/html/2605.14038#bib.bib26)\]\. A central challenge in building reliable autonomous LLM agents is achieving adaptive tool using: the LLM needs to determinewhenit should rely on such tools versus answering directly\[[8](https://arxiv.org/html/2605.14038#bib.bib27),[22](https://arxiv.org/html/2605.14038#bib.bib29),[27](https://arxiv.org/html/2605.14038#bib.bib28)\]\. Prior work studying adaptive tool use\[[8](https://arxiv.org/html/2605.14038#bib.bib27),[22](https://arxiv.org/html/2605.14038#bib.bib29),[13](https://arxiv.org/html/2605.14038#bib.bib31)\]has largely treated tool necessity as a static, model\-agnostic property, typically relying on human annotators or strong LLM judges to determine whether a query requires a tool, focusing primarily on polarized cases where the answer is obvious, such as fetching real\-time weather data versus paraphrasing a static paragraph\. However, tool necessity in the wild is fundamentally more nuanced due to the natural divergence of capability boundaries across different models\. A problem that is easily solvable by a state\-of\-the\-art model relying solely on its internal weights may completely exceed the capabilities of a smaller or less capable model, thereby making tool use strictly necessary for the latter but redundant for the former\.
In this work, we argue that tool necessity must be intrinsically tied to the specific capabilities of the model in question\. We introduce a model\-adaptive definition of tool necessity, grounded not in static annotations, but in each individual model’s empirical performance\. By evaluating necessity relative to a model’s intrinsic capabilities, we establish a more accurate characterization for when a specific LLM should seek external help\. Following this definition, we compare the actual necessity against the observed tool\-call behavior across four distinct models on arithmetic and factual question\-answering \(QA\) datasets\. Our findings reveal substantial mismatches: models exhibit a 26\.5–54\.0% necessity\-action mismatch in arithmetic tasks and a 30\.8–41\.8% necessity\-action mismatch in factual QA, frequently calling tools when capable of answering directly, or attempting to answer directly when lacking the requisite internal knowledge\.
Figure 1:Overview of the two stage cognition\-execution modeling of LLM tool\-use\.\(Left\) Necessity: We introduce a model\-adaptive definition of tool necessity based on a model’s empirical ability to consistently answer a query correctly on its own, contrasting with prior model\-agnostic approaches\.\(Middle\) Cognition: By probing the model’s internal hidden stateshh, we identify a linear cognition directionwcw\_\{c\}that successfully distinguishes when a tool is necessary\.\(Right\) Action: We also train a probewaw\_\{a\}to predict the actual tool\-call execution\. We find thatwcw\_\{c\}andwaw\_\{a\}become nearly orthogonal in late layers, and that the majority of the necessity\-action mismatch stems from the execution stage \(translating awareness into action\) rather than the internal cognition stage\.To diagnose the underlying mechanisms of this failure, we propose a two\-stage decomposition of the tool\-use process: an internal cognition stage, which reflects whether the model’s internal representations encode the belief that a tool is necessary, and an execution stage, which determines whether the model actually outputs the tool\-triggering tokens\. Building on prior advancements in mechanistic interpretability and representation engineering\[[35](https://arxiv.org/html/2605.14038#bib.bib15)\]and following recent literature on adaptive tool\-using\[[13](https://arxiv.org/html/2605.14038#bib.bib31),[28](https://arxiv.org/html/2605.14038#bib.bib32)\], we probe the LLM hidden states and find that both the cognition of necessity and the execution intent are often linearly decodable\. Yet, intriguingly, their respective probe directions become nearly orthogonal in the late\-layer, last\-token regime\.
By tracing the trajectory of samples through this two\-stage process, we uncover aknowing\-doing gapin LLM tool use: the majority of the observed necessity\-action mismatch cases originates from the transition from cognition to action, rather than in the cognition stage\. Models frequently generate internal representations indicating the awareness of their own limitations, but fail to translate this into the syntactic execution of a tool call\. Our main contributions can be summarized as follows:
- •We introduce a model\-adaptive definition of tool necessity grounded in empirical performance, challenging the traditional reliance on static, model\-agnostic annotations\.
- •We evaluate four distinct LLMs across arithmetic and factual QA datasets, revealing substantial behavioral mismatches \(up to 54\.0%\) between actual tool necessity and observed tool\-call actions\.
- •By dividing tool use into an internal cognition stage and an execution stage, we use representation probing to demonstrate that while both intent and necessity are linearly decodable, their probe directions become near orthogonal in the late\-layer, last\-token regime\.
- •Through trajectory tracing, we discover that tool\-use failures predominantly occur during the transition from cognition to action, highlighting a knowing\-doing gap in LLM tool\-use\.
## 2Related work
#### Tool calling in LLM agents\.
To extend LLM capabilities beyond parametric knowledge, researchers have introduced function/tool calling\[[20](https://arxiv.org/html/2605.14038#bib.bib1),[24](https://arxiv.org/html/2605.14038#bib.bib2),[26](https://arxiv.org/html/2605.14038#bib.bib25),[19](https://arxiv.org/html/2605.14038#bib.bib26)\], enabling interaction with external resources and expanding task coverage\. Standardized protocols like MCP\[[1](https://arxiv.org/html/2605.14038#bib.bib21)\]and A2A\[[6](https://arxiv.org/html/2605.14038#bib.bib8)\]further streamline communication and access within tool ecosystems\. In parallel, various works has examined tool\-use accuracy\[[12](https://arxiv.org/html/2605.14038#bib.bib7),[21](https://arxiv.org/html/2605.14038#bib.bib23)\], hallucinated calls\[[33](https://arxiv.org/html/2605.14038#bib.bib6),[23](https://arxiv.org/html/2605.14038#bib.bib4)\], and robustness to tool descriptions\[[25](https://arxiv.org/html/2605.14038#bib.bib5),[5](https://arxiv.org/html/2605.14038#bib.bib10)\]\. However, while these efforts aim at teaching and evaluatinghowto use tools, an important and often understudied challenge in building reliable LLM agents is determiningwhento use tools\. Existing works that do study this challenge\[[8](https://arxiv.org/html/2605.14038#bib.bib27),[22](https://arxiv.org/html/2605.14038#bib.bib29),[13](https://arxiv.org/html/2605.14038#bib.bib31)\]treat tool necessity as a static property of the query, labeling instances as either tool\-necessary or tool\-unnecessary using human annotators or some proprietary LLM\. This ignores the inherent difference in capability boundaries between different models\. WhileWanget al\.\[[27](https://arxiv.org/html/2605.14038#bib.bib28)\]has also advocated for model\-dependent tool necessity, to the best of our knowledge, we are the first to have a pipeline that empirically grounds tool necessity in the actual capabilities of a given model\.
#### Meta\-cognition of LLMs and the “knowing\-doing gap”\.
The ability of LLMs to accurately assess their own capability boundaries—often referred to as meta\-cognition or self\-assessment—has been a topic of long\-standing interest\[[10](https://arxiv.org/html/2605.14038#bib.bib11),[30](https://arxiv.org/html/2605.14038#bib.bib22)\]\. To measure this self\-awareness, early work primarily relies on measuring explicit self\-assessment by teaching models to express their knowledge boundaries\[[2](https://arxiv.org/html/2605.14038#bib.bib19),[31](https://arxiv.org/html/2605.14038#bib.bib18)\]or to directly verbalize confidence\[[15](https://arxiv.org/html/2605.14038#bib.bib20)\]\. However, recent work has shown that the ability for models to verbalize its internal activations is limited\[[17](https://arxiv.org/html/2605.14038#bib.bib34),[9](https://arxiv.org/html/2605.14038#bib.bib33)\]\. Moreover, the task of self\-assessment and actual problem solving are fundamentally different tasks\. When explicitly prompted about its capability boundary, the model would focus on self\-assessment\. But when faced with actual problem solving, the prompt is usually tasks\-oriented, and hence the self\-assessing process becomes implicit and subconscious\. This akin to the distinction between system I and system II thinking\[[14](https://arxiv.org/html/2605.14038#bib.bib35)\]\. Therefore, in this work, we follow some recent work that use internal state probing to measure models’ cognition of tool\-necessity\[[13](https://arxiv.org/html/2605.14038#bib.bib31),[28](https://arxiv.org/html/2605.14038#bib.bib32)\], and also empirically show in Appendix[B](https://arxiv.org/html/2605.14038#A2)how model tool\-call actions change when explicitly prompted for self\-assessment\.
Meanwhile, papers in other domain of LLMs that leverage hidden states to study model internal cognition have found that the model’s action can diverge from its internal belief\. For example,Zhaoet al\.\[[34](https://arxiv.org/html/2605.14038#bib.bib3)\]find that LLMs may fail to refuse harmful queries despite internally recognizing their harmfulness, andZhanget al\.\[[32](https://arxiv.org/html/2605.14038#bib.bib36)\]show that models can internally recognize their inability to solve certain math problems yet still expend tokens on unproductive reasoning\. In this work, we show that this “knowing\-doing gap” similarly exists in tool\-calling, and it can constitute even a larger proportion of end\-to\-end errors\.
## 3Defining model\-adaptive tool necessity and two\-stage modeling of tool\-call
To study tool\-use behavior in LLMs, we introduce a simple decomposition that separates recognizing the need for a tool from acting on that recognition\. This distinction will serve as the foundation for the evaluation, diagnosis, and analysis throughout the rest of this paper\.
#### Defining model\-adaptive tool necessity\.
Existing work typically assumes a fixed notion of tool necessity, assigning each query a static label independent of the model being evaluated\. However, we argue that since different models have different capability boundaries, the tool necessity label should be adaptive according to the model\. To characterize a model’s capability boundary, given a modelffand queryxx, we performNNindependent inference runs without access to external tools at temperatureTT\. If the modelffcan consistently solve the problemxxcorrectly acrossNNruns, we assume that thisxxfalls within theff’s capability boundary and therefore the tool necessity,nf\(x\)n\_\{f\}\(x\), is0\. Otherwise, the model cannot reliably solve this query, and hencenf\(x\)n\_\{f\}\(x\)is11\. The parametersNNandTTcontrol the strictness of this criterion\. Specifically, larger values ofNNandTTyield a more conservative and robust estimate of whether a query truly falls within the model’s capability boundary as they demand the model to output the correct answer more consistently\.
This formulation captures a key aspect of real\-world deployment: reliability under uncertainty\. In practical settings, a model that only occasionally produces the correct answer without tools may still benefit from external assistance to ensure consistent performance\. By grounding tool necessity in empirical behavior rather than static annotation, our approach provides a more faithful characterization of when tool use is genuinely required for a given model\.
#### The cognition\-execution modeling of tool\-call\.
We conceptualize tool use as a two\-stage process:
x→zf\(x\)→af\(zf\(x\)\),x\\rightarrow z\_\{f\}\(x\)\\rightarrow a\_\{f\}\(z\_\{f\}\(x\)\),\(1\)wherezf\(x\)z\_\{f\}\(x\)represents the model’s internal cognition of whether a tool is needed, andaf\(zf\(x\)\)a\_\{f\}\(z\_\{f\}\(x\)\)denotes whether the model actually invokes a tool, based on its cognition\. This two\-stage decomposition mirrors the cognition process of human and what we desire for the model\. It distinguishes betweenmeta\-cognition—the model’s internal belief about its capability boundary, andexecution ability—how model acts based on its cognition\.
#### End\-to\-end error diagnosis\.
Under our model\-dependent definition of tool necessitynf\(x\)n\_\{f\}\(x\)and the two stage modeling as in Equation[1](https://arxiv.org/html/2605.14038#S3.E1), we can decompose the end\-to\-end necessity\-action mismatch,D\(nf\(x\),af\(zf\(x\)\)\)D\(n\_\{f\}\(x\),a\_\{f\}\(z\_\{f\}\(x\)\)\), into the mismatch between actual necessity and cognitionD\(nf\(x\),zf\(x\)\)D\(n\_\{f\}\(x\),z\_\{f\}\(x\)\), and the mismatch between model’s cognition and actual decisionD\(zf\(x\),af\(zf\(x\)\)D\(z\_\{f\}\(x\),a\_\{f\}\(z\_\{f\}\(x\)\), whereD\(m,n\)D\(m,n\)denotes the discrepancy betweenmmandnn\.
## 4Dataset curation
We cover two representative domains: math arithmetic and factual question answering, using two widely used model families: Qwen3\-8B and Qwen3\-4B\[[29](https://arxiv.org/html/2605.14038#bib.bib17)\], as well as Llama\-3\.1\-8B\-Instruct and Llama\-3\.2\-3B\-Instruct\[[7](https://arxiv.org/html/2605.14038#bib.bib14)\]\. These domains provide natural testbeds in which some queries can be reliably solved by the model alone, while others may require external assistance \(i\.e\. a calculator for arithmetic tasks and a search API for factual queries\)\. For math arithmetic dataset, we mix problem types that vary in both surface form and actual difficulty\. It includes simple one\- and two\-step addition and subtraction problems, along with harder examples involving multi\-digit multiplication, modulo, parentheses, operator precedence, and longer addition/subtraction chains, resulting in a total of 4,000 instances\. This gives us problems with a range of difficulty levels from very simple questions to extremely difficult ones, enabling us to measure the capability boundary of the model\. More details about the curation of our arithmetic dataset can be found in Appendix[A](https://arxiv.org/html/2605.14038#A1)\. For factual question answering, we adopt TruthfulQA\[[16](https://arxiv.org/html/2605.14038#bib.bib24)\], a widely used dataset with 817 instances designed to evaluate the factual reliability of language models\.
### 4\.1Grounding tool necessity to model\-specific capability boundaries
We follow our definition in Section[3](https://arxiv.org/html/2605.14038#S3)and runN=10N=10independent inferences at temperatureT=0\.7T=0\.7without access to external tools\. For a specific model, we count samples where the model fails at least once as*tool\-necessary*, and samples where the model consistently gives correct answers across allN=10N=10runs as*tool\-unnecessary*\. Figure[2](https://arxiv.org/html/2605.14038#S4.F2)shows that different models have substantially different capability boundaries, which would be obscured by the model\-agnostic definition of tool necessity\. Specifically, the clean boundary in the first row is induced by our sorting procedure, while the red\-green disagreements across rows show that the same sample groups can fall on different sides of different models’ capability boundaries\. This pattern appears in both arithmetic and factual question answering, suggesting that tool necessity depends not only on task type or dataset membership, but also on the particular model being deployed\. This motivates usingnf\(x\)n\_\{f\}\(x\)rather than a single global necessity label when evaluating tool\-use judgment and downstream call behavior\.

Figure 2:Model\-dependent tool\-call necessity\.Each vertical bar represents 0\.5% of samples\. Green indicates samples answered correctly in allN=10N=10no\-tool runs; red indicates at least one failure\. Within each dataset, samples share the same order across rows, obtained by recursively sorting within each previous model’s correctness partition\.·
### 4\.2Collecting tool\-call behaviors on*tool\-necessary*and*tool\-unnecessary*instances
We run inference on the LLMs using both the tool\-necessary and tool\-unnecessary instances obtained in Section[4\.1](https://arxiv.org/html/2605.14038#S4.SS1)\. In this setting, models are provided access to external tools: a calculator for arithmetic question answering and a search API for factual queries\. To facilitate the diagnostic interpretation efforts in Section[5](https://arxiv.org/html/2605.14038#S5), greedy decoding is used when collecting tool\-call actions\. To better reflect real\-world deployment, we follow existing practice\[[21](https://arxiv.org/html/2605.14038#bib.bib23),[3](https://arxiv.org/html/2605.14038#bib.bib9)\]and implement model\-specific handlers that expose these tools in the syntax expected by each model\. We then further divide*tool\-necessary*and*tool\-unnecessary*samples based on the model’s actual tool\-call behavior, and obtain 4 sets of data:*Necessary\-Called*\(N\-C\),*Necessary\-NotCalled*\(N\-NC\),*Unnecessary\-Called*\(UN\-C\), and*Unnecessary\-NotCalled*\(Un\-NC\)\. The first and last are aligned with the optimal behavior under our model\-dependent definition of necessity, while the middle two correspond to the end\-to\-end necessity–action mismatchD\(nf\(x\),af\(zf\(x\)\)\)D\(n\_\{f\}\(x\),a\_\{f\}\(z\_\{f\}\(x\)\)\)defined in Section[3](https://arxiv.org/html/2605.14038#S3)\.
Table 1:Breakdown of tool\-call behavior across the four categories defined by the model\-dependent necessitynf\(x\)n\_\{f\}\(x\)and the observed action\. Aligned cells \(N\-C: Necessary\-Called,UN\-NC: Unnecessary\-NotCalled\) are shaded green; misaligned cells \(N\-NC,UN\-C\) are shaded red and together form the end\-to\-end mismatch, summarized in the grayMis\.column\.#### End\-to\-end mismatch is substantial\.
Table[1](https://arxiv.org/html/2605.14038#S4.T1)reports the distribution of the four categories across four models and two domains\. The aggregated mismatch rate \(grayMis\.column\) ranges from26\.5%to54\.0%on arithmetic and from30\.8%to41\.8%on TruthfulQA, indicating that with a model\-specific notion of tool necessity, between roughly one quarter and one half of all queries result in a tool\-use action that is inconsistent with the model’s actual capability\. This mismatch rate between actual tool necessity and model tool\-use action further highlights the importance of determiningwhento use tools, an issue that is often overlooked in prior work that only emphasizeshowto use them\.
#### The dominant failure mode is highly model\- and domain\-dependent\.
Beyond the overall mismatch rates, the specific types of errors vary significantly across both models and domains\. On arithmetic, Qwen3\-8B suffers from tool\-overuse \(UN\-C at38\.2%vs\. N\-NC at3\.5%\)\. In contrast, Qwen3\-4B and both Llama models exhibit clear tool underuse, with N\-NC rates of14\.5%\(Qwen3\-4B\),30\.1%\(Llama\-3\.1\-8B\-Instruct\), and39\.0%\(Llama\-3\.2\-3B\-Instruct\), exceeding their respective UN\-C rates\. Interestingly, these tendencies are not consistent even within a single model\. On TruthfulQA, Qwen3\-8B reverses its trend entirely, showing tool underuse \(N\-NC at17\.9%vs\. UN\-C at13\.2%\), while Qwen3\-4B now shows tool\-overuse \(UN\-C at23\.1%vs\. N\-NC at18\.7%\)\. Because these models shift between being overly eager and overly conservative in tool\-calling depending on the context, it is clear that no single, uniform bias can fully explain these mismatch errors\. Therefore, in the next section, we leverage our two\-stage modeling of LLM tool\-use defined in Section[3](https://arxiv.org/html/2605.14038#S3)for more fine\-grained diagnosis\.
## 5From meta\-cognition to execution ability: What went wrong?
Having measured the model\-dependent tool necessities for each model \(i\.e\., their capability boundaries\) and collected their actual tool\-call behaviors, we now examine where the breakdown between actual necessity and final action occurs, following the two\-stage decomposition in Section[3](https://arxiv.org/html/2605.14038#S3)\. We first show that each stage—the internal cognition of necessity, and the executed action—is individually linearly separable from the model’s hidden states \(Section[5\.1](https://arxiv.org/html/2605.14038#S5.SS1)and Section[5\.2](https://arxiv.org/html/2605.14038#S5.SS2)\), and then characterize the geometric relationship between the two \(Section[5\.3](https://arxiv.org/html/2605.14038#S5.SS3)\)\. Finally, we find that the majority of the error originates in the execution stage through per sample tracing \(Section[5\.4](https://arxiv.org/html/2605.14038#S5.SS4)\)\.
### 5\.1Probing for model’s cognition
Linear probing is a standard method for studying how concepts are represented in a model’s hidden\-state space\. Recent works\[[13](https://arxiv.org/html/2605.14038#bib.bib31),[28](https://arxiv.org/html/2605.14038#bib.bib32)\]have used it as a proxy for models’ internal belief of tool necessity and reported that, despite substantial end\-to\-end mismatch, the hidden states of*tool\-necessary*and*tool\-unnecessary*samples arealmost linearly separable\. Because that conclusion was drawn under a static, query\-only definition of tool necessity, it is unclear whether it survives the model\-dependent definition introduced in Section[3](https://arxiv.org/html/2605.14038#S3), where the necessity labelnf\(x\)n\_\{f\}\(x\)varies across models with different capability boundaries\.
Concretely, we train a linear classifier with weight𝐰c\\mathbf\{w\}\_\{c\}and biasbcb\_\{c\}on the model’s hidden states, using a learning rate of0\.010\.01with the Adam\[[11](https://arxiv.org/html/2605.14038#bib.bib16)\]optimizer, minimizing the following objective:
ℒ=−1K∑k=1K\[nf\(xk\)logσ\(𝐰c⊤ht\(l\)\(xk\)\+bc\)\+\(1−nf\(xk\)\)log\(1−σ\(𝐰c⊤ht\(l\)\(xk\)\+bc\)\)\],\\mathcal\{L\}=\-\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}\\left\[n\_\{f\}\(x\_\{k\}\)\\log\\sigma\(\\mathbf\{w\}\_\{c\}^\{\\top\}h\_\{t\}^\{\(l\)\}\(x\_\{k\}\)\+b\_\{c\}\)\+\(1\-n\_\{f\}\(x\_\{k\}\)\)\\log\(1\-\\sigma\(\\mathbf\{w\}\_\{c\}^\{\\top\}h\_\{t\}^\{\(l\)\}\(x\_\{k\}\)\+b\_\{c\}\)\)\\right\],\(2\)wherexkx\_\{k\}is a sample in the dataset andht\(l\)\(xk\)h\_\{t\}^\{\(l\)\}\(x\_\{k\}\)is the hidden state at token positionttand layerll\.𝐰c\\mathbf\{w\}\_\{c\}also serves as the normal vector of the separating hyperplane, indicating the direction from “unnecessary” to “necessary” in the model’s representation space\. We sweep\(t,l\)\(t,l\)over all layers and over the last2020query tokens; negative indices denote token positions relative to the start of generation, e\.g\.,t=−1t=\-1is the final query token\. As the class distribution is imbalanced \(Table[1](https://arxiv.org/html/2605.14038#S4.T1)\), we report the probe performance using the Matthews Correlation Coefficient \(MCC\)\[[18](https://arxiv.org/html/2605.14038#bib.bib12)\]on the held\-out test set \(30% of data\), which is a more robust metric than accuracy or F1 under skewed labels:
MCC=TP⋅TN−FP⋅FN\(TP\+FP\)\(TP\+FN\)\(TN\+FP\)\(TN\+FN\)\\text\{MCC\}=\\frac\{TP\\cdot TN\-FP\\cdot FN\}\{\\sqrt\{\(TP\+FP\)\(TP\+FN\)\(TN\+FP\)\(TN\+FN\)\}\}\(3\)Typically, an MCC value between0\.30\.3\-0\.50\.5is considered moderate to good performance, and an MCC of0\.50\.5or more is considered good to strong performance\. Figure[3](https://arxiv.org/html/2605.14038#S5.F3)shows the MCC of probes trained at each\(t,l\)\(t,l\)position for all four models on Arithmetic and TruthfulQA\.
Figure 3:Necessity probe performance across token\-layer positions\.Each cell reports the held\-out MCC of a linear probe trained to predict the model\-adaptive necessity from the hidden state at a given layer and token position\. Darker blue indicates stronger linear separability\. The linear separability is strongly task\-dependent, and the heatmap structure appears similar within model families\.Linear separability of necessity is strongly task\-dependent\.Under our model\-adaptive definition, the prior “almost linearly separable” picture partially holds\. On Arithmetic, necessity is linearly separable for most models, with broad regions of mid\-to\-late layers crossingMCC=0\.4\\mathrm\{MCC\}=0\.4\. This aligns with the finding in prior works\[[13](https://arxiv.org/html/2605.14038#bib.bib31),[28](https://arxiv.org/html/2605.14038#bib.bib32)\]\. On TruthfulQA, however, the regions where MCC exceeds0\.40\.4is noticeably smaller, with only near\-last tokens in mid\-late layers of Llama models still display decent separability\. This contrast suggests the challenge in distinguishing model\-adaptive*tool\-necessary*and*tool\-unnecessary*samples, which is more nuanced than the obvious cases prior work focus on\[[8](https://arxiv.org/html/2605.14038#bib.bib27),[13](https://arxiv.org/html/2605.14038#bib.bib31),[28](https://arxiv.org/html/2605.14038#bib.bib32)\]\. It also suggests that tool\-necessity signals are easier to surface in tasks where problem difficulty is reflected in the input’s surface structure, such as arithmetic, where complexity grows with the expression itself\. In open\-domain factual QA, however, surface form provides little cue about underlying difficulty, making tool necessity or epistemic uncertainty harder to linearly separate\. The heatmap structure also appears similar within model families, with two Qwen and two Llama models sharing similar patterns respsectively\.
Decent internal signal coexists with large end\-to\-end mismatch\.The probe reaches decent MCC at many\(t,l\)\(t,l\)positions, indicating that information about the model’s capability boundary is in fact present in the residual stream\. Yet the same models still exhibit substantial end\-to\-end necessity–action mismatch \(Table[1](https://arxiv.org/html/2605.14038#S4.T1)\), meaning this internal signal is not effectively converted into the right tool\-call decision at generation time\. This mismatch between “what the hidden states know” and “what the model does” is a first hint of a knowing–doing gap, and motivates the next two questions: does the model encode its action in a similarly separable way \(Section[5\.2](https://arxiv.org/html/2605.14038#S5.SS2)\), and how does the action representation relate to the cognition representation \(Section[5\.3](https://arxiv.org/html/2605.14038#S5.SS3)\)?
### 5\.2Probing for action
Having characterized how necessity is represented internally, we now ask the parallel question for the model’s actual decision: how linearly separable is the executed action—whether the model invokes a tool or not—from the same hidden states? Concretely, we train a linear classifier\(𝐰a,ba\)\(\\mathbf\{w\}\_\{a\},b\_\{a\}\)with the same objective as in Equation[2](https://arxiv.org/html/2605.14038#S5.E2)while just changing\(𝐰c,bc\)\(\\mathbf\{w\}\_\{c\},b\_\{c\}\)to\(𝐰a,ba\)\(\\mathbf\{w\}\_\{a\},b\_\{a\}\)\. The Probe performance on the held\-out test set \(30% of data\) in terms of MCC is shown in Figure[4](https://arxiv.org/html/2605.14038#S5.F4)\.
Figure 4:The action Probe performance on different position in Matthews correlation coefficient\.Each cell reports the held\-out MCC of a linear probe trained to predict the tool\-call action from the hidden state at a given layer and token position\. Darker blue indicates stronger linear separability\. The action signal appears highly linearly separable in the hidden states, particularly in near\-end tokens and late layers\.The action is highly separable from hidden states\.Figure[4](https://arxiv.org/html/2605.14038#S5.F4)shows that, on both Arithmetic and TruthfulQA dataset, the action probe attainsMCC≥0\.4\\mathrm\{MCC\}\\geq 0\.4over broad regions of nearly every model\. The signal spans most layers and token positions rather than being confined to a narrow band, indicating that whether the model is about to call a tool is a strongly decodable feature from its residual stream, aligning with recent finding\[[4](https://arxiv.org/html/2605.14038#bib.bib13)\]\.
### 5\.3The gap between cognition and execution
The two probes give us, at every\(t,l\)\(t,l\), a pair of direction vectors:𝐰c\\mathbf\{w\}\_\{c\}pointing from “unnecessary” to “necessary” in the model’s representation space, and𝐰a\\mathbf\{w\}\_\{a\}pointing from “no\-call” to “call\.” If tool\-use behavior were a direct readout of the model’s internal necessity assessment, the two directions should align—at least in the layers where both probes succeed\. We test this by computing the cosine similarityCosSim\(𝐰c,𝐰a\)\\mathrm\{CosSim\}\(\\mathbf\{w\}\_\{c\},\\mathbf\{w\}\_\{a\}\)between𝐰c\\mathbf\{w\}\_\{c\}and𝐰a\\mathbf\{w\}\_\{a\}at each position: a value near±1\\pm 1means necessity and action are encoded along \(anti\-\)parallel directions, while a value near0means the two are represented in geometrically independent subspaces\.
Figure 5:The cosine similarity score between𝐰𝐜\\mathbf\{w\_\{c\}\}and𝐰𝐚\\mathbf\{w\_\{a\}\}on different positions\.The similarity scores between two probe direction are small across the majority of the area\. Although for some models there are moderate similarity scores in late token and middle layer position, two directions fall back to near orthogonal relationship in the late layer of the last token \(bottom right corner\)\.Partial alignment between𝐰c\\mathbf\{w\}\_\{c\}and𝐰a\\mathbf\{w\}\_\{a\}exists in intermediate token\-layer positions\.Figure[5](https://arxiv.org/html/2605.14038#S5.F5)shows that the cosine similarity between𝐰c\\mathbf\{w\}\_\{c\}and𝐰a\\mathbf\{w\}\_\{a\}is fairly high in notable regions of several heatmaps, particularly covering a sizable area for the two Qwen models on Arithmetic\. So necessity and action are not encoded in entirely disjoint subspaces: in some intermediate token\-layer positions, the directions share meaningful alignment\.
Alignment collapses at the position that drives generation\.The picture changes sharply at the position that actually determines the next token: the late layers of the final query token \(t=−1t=\-1, largell\)\. For the two models with the strongest mid\-stream alignment, Qwen3\-8B and Qwen3\-4B, the cosine similarity falls back to small values exactly in the bottom right corner of the heatmap, so𝐰c\\mathbf\{w\}\_\{c\}and𝐰a\\mathbf\{w\}\_\{a\}become close to orthogonal precisely where they would need to interact to translate “I should call a tool” into the actual call token\. The same trend toward low cosine at late layers / last token holds, more uniformly, for the other models and for TruthfulQA\. Whatever partial coupling exists in earlier layers therefore does not survive to the readout\.
The previous two subsections established that the model’s hidden states often contain a usable necessity signal yet still produce mismatched tool\-call actions\. Figure[5](https://arxiv.org/html/2605.14038#S5.F5)explains*why*: even when necessity and action share some structure in intermediate representations, the two directions become nearly orthogonal in the late\-layer / last\-token regime that ultimately drives the next\-token decision\.
### 5\.4Two stage error diagnosis and attribution
So far we have established two facts: end\-to\-end necessity–action mismatch is substantial \(Table[1](https://arxiv.org/html/2605.14038#S4.T1)\), and the cognition and action directions are nearly orthogonal at the readout \(Section[5\.3](https://arxiv.org/html/2605.14038#S5.SS3)\)\. However, the results in Section[5\.3](https://arxiv.org/html/2605.14038#S5.SS3)tells us only that the two stages are*decoupled*, not which of them is responsible for the mismatch we observe\. To attribute the error, we trace each sample along theFactual→\\rightarrowCognition→\\rightarrowActionmodeling, taking Cognition to be the necessity probe\(𝐰c,bc\)\(\\mathbf\{w\}\_\{c\},b\_\{c\}\)read out at the last query token and last layer—the same position that drives the next\-token decision\. Each sample then falls into one of four categories: correct in both stages \(green\), stage\-one\-only error \(red\), stage\-two\-only error \(orange, the knowing–doing gap\), or compensating errors that cancel at the action \(purple\)\. We show the full Sankey flow diagram in Figure[6](https://arxiv.org/html/2605.14038#S5.F6)\.
Figure 6:Per\-sample two\-stage decomposition of tool\-call behavior on Arithmetic and TruthfulQA, for Qwen3\-8B \(top\) and Llama\-3\.1\-8B\-Instruct \(bottom\)\.Each flow tracks a sample through three nodes: ground\-truth necessity \(*Factual*\), the model’s internal cognition of necessity \(*Cognition*\), and the executed action \(*Action*\)\. The end\-to\-end error is dominated byorangeflow, where cognition is correct but action flips away from it—the knowing–doing gap\.Stage two carries the majority of error\.In all four panels of Figure[6](https://arxiv.org/html/2605.14038#S5.F6), the orange flow \(samples with stage two error only\) is by far the largest error category, while red \(samples with stage one error only\) is rather thin\. Given that cognition and action are decoupled, the asymmetric orange≫\\ggred localizes the failure: the end\-to\-end mismatch is overwhelmingly produced in the cognition→\\rightarrowaction stage rather than in forming cognition itself\. The bottleneck is therefore not knowing whether a tool is needed, but converting that knowledge into the call/no\-call action\. This shows that making correctwhento use tools decisions is not just about having the correct tool\-necessity cognition, but more importantly also about translating that cognition to actual matching action\.
Figure 7:The confidence of cognition versus tool calling behavior\.The x axis denotes the probe output after sigmoid function, and the y axis represents the tool call probability quantified by Equation[4](https://arxiv.org/html/2605.14038#S5.E4)\. The mismatch can persist even when the internal representation strongly indicates that a sample is either*tool\-necessary*or*tool\-unnecessary*\.#### The cognition–execution mismatch is not associated with cognition confidence\.
Given the substantial gap between a model’s internal cognition and its final tool\-call behavior, illustrated by the large orange band in Figure[6](https://arxiv.org/html/2605.14038#S5.F6), a natural question is whether this mismatch is caused by uncertainty in the meta\-cognitive belief itself\. In other words, does the mismatch primarily occur on samples where the model is uncertain about whether a tool is necessary? To investigate this question, we quantify the confidence of meta\-cognition using the post\-sigmoid output of the cognition probe,σ\(𝐰ch\+bc\)\\sigma\(\\mathbf\{w\}\_\{c\}h\+b\_\{c\}\)\. We then plot, for all samples, the relationship between the “confidence of tool necessity” and the “probability of making a tool call” in Figure[7](https://arxiv.org/html/2605.14038#S5.F7)\. The probability of making a tool call is defined as
P\(call\)=p\(⟨tool\-token⟩\)p\(⟨tool\-token⟩\)\+p\(best non\-tool token\),\\mathrm\{P\}\(\\text\{call\}\)=\\frac\{\\mathrm\{p\}\(\\langle\\text\{tool\-token\}\\rangle\)\}\{\\mathrm\{p\}\(\\langle\\text\{tool\-token\}\\rangle\)\+\\mathrm\{p\}\(\\text\{best non\-tool token\}\)\},\(4\)wherep\(⋅\)p\(\\cdot\)denotes the softmax probability assigned by the language model to a candidate next token,⟨tool\-token⟩\\langle\\text\{tool\-token\}\\rangledenotes the model\-specific token that initiates a tool call \(e\.g\.,<tool\_call\>in Qwen models\), and the “best non\-tool token” refers to the highest\-logit token among all tokens that are not tool\-call tokens\. This formulation nicely normalizes the probability to the range\[0,1\]\[0,1\]\. Since greedy decoding is used when collecting tool\-call behaviors \(Section[4\.2](https://arxiv.org/html/2605.14038#S4.SS2)\), a value ofP\(call\)\>0\.5P\(\\text\{call\}\)\>0\.5corresponds to an actual tool call being generated\. As shown in Figure[7](https://arxiv.org/html/2605.14038#S5.F7), the cognition–execution mismatch does not primarily occur near the uncertain region whereσ\(𝐰ch\+bc\)≈0\.5\\sigma\(\\mathbf\{w\}\_\{c\}h\+b\_\{c\}\)\\approx 0\.5\. Instead, many orange points occur in regions whereσ\(𝐰ch\+bc\)\\sigma\(\\mathbf\{w\}\_\{c\}h\+b\_\{c\}\)is close to0or11\. This observation suggests that the cognition–execution mismatch is not driven by low confidence in the model’s internal cognition\. Rather, the mismatch can persist even when the internal representation strongly indicates that a sample is either*tool\-necessary*or*tool\-unnecessary*\.
## 6Conclusion
In this work, we introduced a model\-adaptive definition of tool necessity that grounds evaluation in empirical capabilities, and revealed a substantial mismatch between when models actually need tools and when they invoke them\. By decomposing the tool\-use process into internal cognition and execution stages, and analyzing hidden state representations, we identified a fundamental "knowing\-doing gap" in LLMs\. While models sometimes internally recognize the tool necessity, these cognitive representations become orthogonally misaligned with execution intent in later layers, leading to failures in taking the appropriate action\. Our findings demonstrate that improving autonomous agents requires not just better internal meta\-cognition, but bridging the knowing\-doing gap to ensure self\-awareness translates into reliable execution\.
## References
- \[1\]Anthropic\(2024\-11\-25\)Introducing the model context protocol\.External Links:[Link](https://www.anthropic.com/news/model-context-protocol)Cited by:[§2](https://arxiv.org/html/2605.14038#S2.SS0.SSS0.Px1.p1.1)\.
- \[2\]L\. Chen, Z\. Liang, X\. Wang, J\. Liang, Y\. Xiao, F\. Wei, J\. Chen, Z\. Hao, B\. Han, and W\. Wang\(2024\)Teaching large language models to express knowledge boundary from their own signals\.External Links:2406\.10881,[Link](https://arxiv.org/abs/2406.10881)Cited by:[§2](https://arxiv.org/html/2605.14038#S2.SS0.SSS0.Px2.p1.1)\.
- \[3\]Y\. Cheng, A\. S\. Moakhar, C\. Fan, P\. Hosseini, K\. Faghih, Z\. Sodagar, W\. Wang, and S\. Feizi\(2026\)Your llm agents are temporally blind: the misalignment between tool use decisions and human time perception\.External Links:2510\.23853,[Link](https://arxiv.org/abs/2510.23853)Cited by:[§4\.2](https://arxiv.org/html/2605.14038#S4.SS2.p1.1)\.
- \[4\]E\. Esakkiraja, S\. Rajeswar, D\. Akhiyarov, and R\. Venkatesaramani\(2026\)Therefore i am\. i think\.External Links:2604\.01202,[Link](https://arxiv.org/abs/2604.01202)Cited by:[§5\.2](https://arxiv.org/html/2605.14038#S5.SS2.p2.1)\.
- \[5\]K\. Faghih, W\. Wang, Y\. Cheng, S\. Bharti, G\. Sriramanan, S\. Balasubramanian, P\. Hosseini, and S\. Feizi\(2025\)Gaming tool preferences in agentic llms\.arXiv preprint arXiv:2505\.18135\.Cited by:[§2](https://arxiv.org/html/2605.14038#S2.SS0.SSS0.Px1.p1.1)\.
- \[6\]Google\(2025\)Agent2Agent \(a2a\) protocol\.Note:[https://google\.github\.io/A2A/](https://google.github.io/A2A/)Cited by:[§2](https://arxiv.org/html/2605.14038#S2.SS0.SSS0.Px1.p1.1)\.
- \[7\]A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§4](https://arxiv.org/html/2605.14038#S4.p1.1)\.
- \[8\]Y\. Huang, J\. Shi, Y\. Li, C\. Fan, S\. Wu, Q\. Zhang, Y\. Liu, P\. Zhou, Y\. Wan, N\. Z\. Gong, and L\. Sun\(2024\)MetaTool benchmark for large language models: deciding whether to use tools and which to use\.External Links:2310\.03128,[Link](https://arxiv.org/abs/2310.03128)Cited by:[Appendix B](https://arxiv.org/html/2605.14038#A2.p4.1),[Appendix B](https://arxiv.org/html/2605.14038#A2.p6.1),[§1](https://arxiv.org/html/2605.14038#S1.p1.1),[§2](https://arxiv.org/html/2605.14038#S2.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2605.14038#S5.SS1.p3.2)\.
- \[9\]L\. Ji\-An, H\. Xiong, R\. C\. Wilson, M\. G\. Mattar, and M\. K\. Benna\(2025\)Language models are capable of metacognitive monitoring and control of their internal activations\.External Links:2505\.13763,[Link](https://arxiv.org/abs/2505.13763)Cited by:[Appendix B](https://arxiv.org/html/2605.14038#A2.p1.1),[§2](https://arxiv.org/html/2605.14038#S2.SS0.SSS0.Px2.p1.1)\.
- \[10\]S\. Kadavath, T\. Conerly, A\. Askell, T\. Henighan, D\. Drain, E\. Perez, N\. Schiefer, Z\. Hatfield\-Dodds, N\. DasSarma, E\. Tran\-Johnson, S\. Johnston, S\. El\-Showk, A\. Jones, N\. Elhage, T\. Hume, A\. Chen, Y\. Bai, S\. Bowman, S\. Fort, D\. Ganguli, D\. Hernandez, J\. Jacobson, J\. Kernion, S\. Kravec, L\. Lovitt, K\. Ndousse, C\. Olsson, S\. Ringer, D\. Amodei, T\. Brown, J\. Clark, N\. Joseph, B\. Mann, S\. McCandlish, C\. Olah, and J\. Kaplan\(2022\)Language models \(mostly\) know what they know\.External Links:2207\.05221,[Link](https://arxiv.org/abs/2207.05221)Cited by:[§2](https://arxiv.org/html/2605.14038#S2.SS0.SSS0.Px2.p1.1)\.
- \[11\]D\. P\. Kingma and J\. Ba\(2017\)Adam: a method for stochastic optimization\.External Links:1412\.6980,[Link](https://arxiv.org/abs/1412.6980)Cited by:[§5\.1](https://arxiv.org/html/2605.14038#S5.SS1.p2.3)\.
- \[12\]M\. Li, Y\. Zhao, B\. Yu, F\. Song, H\. Li, H\. Yu, Z\. Li, F\. Huang, and Y\. Li\(2023\)Api\-bank: a comprehensive benchmark for tool\-augmented llms\.arXiv preprint arXiv:2304\.08244\.Cited by:[§2](https://arxiv.org/html/2605.14038#S2.SS0.SSS0.Px1.p1.1)\.
- \[13\]W\. Li, D\. Li, K\. Dong, C\. Zhang, H\. Zhang, W\. Liu, Y\. Wang, R\. Tang, and Y\. Liu\(2025\)Adaptive tool use in large language models with meta\-cognition trigger\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 13346–13370\.Cited by:[Appendix B](https://arxiv.org/html/2605.14038#A2.p1.1),[Appendix B](https://arxiv.org/html/2605.14038#A2.p4.1),[Appendix B](https://arxiv.org/html/2605.14038#A2.p6.1),[§1](https://arxiv.org/html/2605.14038#S1.p1.1),[§1](https://arxiv.org/html/2605.14038#S1.p3.1),[§2](https://arxiv.org/html/2605.14038#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.14038#S2.SS0.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2605.14038#S5.SS1.p1.1),[§5\.1](https://arxiv.org/html/2605.14038#S5.SS1.p3.2)\.
- \[14\]Z\. Li, D\. Zhang, M\. Zhang, J\. Zhang, Z\. Liu, Y\. Yao, H\. Xu, J\. Zheng, P\. Wang, X\. Chen, Y\. Zhang, F\. Yin, J\. Dong, Z\. Li, B\. Bi, L\. Mei, J\. Fang, X\. Liang, Z\. Guo, L\. Song, and C\. Liu\(2025\)From system 1 to system 2: a survey of reasoning large language models\.External Links:2502\.17419,[Link](https://arxiv.org/abs/2502.17419)Cited by:[§2](https://arxiv.org/html/2605.14038#S2.SS0.SSS0.Px2.p1.1)\.
- \[15\]S\. Lin, J\. Hilton, and O\. Evans\(2022\)Teaching models to express their uncertainty in words\.External Links:2205\.14334,[Link](https://arxiv.org/abs/2205.14334)Cited by:[§2](https://arxiv.org/html/2605.14038#S2.SS0.SSS0.Px2.p1.1)\.
- \[16\]S\. Lin, J\. Hilton, and O\. Evans\(2022\)TruthfulQA: measuring how models mimic human falsehoods\.External Links:2109\.07958,[Link](https://arxiv.org/abs/2109.07958)Cited by:[§4](https://arxiv.org/html/2605.14038#S4.p1.1)\.
- \[17\]J\. Lindsey, W\. Gurnee, E\. Ameisen, B\. Chen, A\. Pearce, N\. L\. Turner, C\. Citro, D\. Abrahams, S\. Carter, B\. Hosmer, J\. Marcus, M\. Sklar, A\. Templeton, T\. Bricken, C\. McDougall, H\. Cunningham, T\. Henighan, A\. Jermyn, A\. Jones, A\. Persic, Z\. Qi, T\. B\. Thompson, S\. Zimmerman, K\. Rivoire, T\. Conerly, C\. Olah, and J\. Batson\(2025\)On the biology of a large language model\.Transformer Circuits Thread\.External Links:[Link](https://transformer-circuits.pub/2025/attribution-graphs/biology.html)Cited by:[Appendix B](https://arxiv.org/html/2605.14038#A2.p1.1),[§2](https://arxiv.org/html/2605.14038#S2.SS0.SSS0.Px2.p1.1)\.
- \[18\]B\.W\. Matthews\(1975\)Comparison of the predicted and observed secondary structure of t4 phage lysozyme\.Biochimica et Biophysica Acta \(BBA\) \- Protein Structure405\(2\),pp\. 442–451\.External Links:ISSN 0005\-2795,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/0005-2795%2875%2990109-9),[Link](https://www.sciencedirect.com/science/article/pii/0005279575901099)Cited by:[§5\.1](https://arxiv.org/html/2605.14038#S5.SS1.p2.11)\.
- \[19\]G\. Mialon, R\. Dessì, M\. Lomeli, C\. Nalmpantis, R\. Pasunuru, R\. Raileanu, B\. Rozière, T\. Schick, J\. Dwivedi\-Yu, A\. Celikyilmaz,et al\.\(2023\)Augmented language models: a survey\.arXiv preprint arXiv:2302\.07842\.Cited by:[§1](https://arxiv.org/html/2605.14038#S1.p1.1),[§2](https://arxiv.org/html/2605.14038#S2.SS0.SSS0.Px1.p1.1)\.
- \[20\]A\. Parisi, Y\. Zhao, and N\. Fiedel\(2022\)Talm: tool augmented language models\.arXiv preprint arXiv:2205\.12255\.Cited by:[§1](https://arxiv.org/html/2605.14038#S1.p1.1),[§2](https://arxiv.org/html/2605.14038#S2.SS0.SSS0.Px1.p1.1)\.
- \[21\]S\. G\. Patil, H\. Mao, F\. Yan, C\. C\. Ji, V\. Suresh, I\. Stoica, and J\. E\. Gonzalez\(2025\)The berkeley function calling leaderboard \(bfcl\): from tool use to agentic evaluation of large language models\.InForty\-second International Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2605.14038#S2.SS0.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2605.14038#S4.SS2.p1.1)\.
- \[22\]C\. Qian, E\. C\. Acikgoz, H\. Wang, X\. Chen, A\. Sil, D\. Hakkani\-Tür, G\. Tur, and H\. Ji\(2025\)SMART: self\-aware agent for tool overuse mitigation\.External Links:2502\.11435,[Link](https://arxiv.org/abs/2502.11435)Cited by:[§1](https://arxiv.org/html/2605.14038#S1.p1.1),[§2](https://arxiv.org/html/2605.14038#S2.SS0.SSS0.Px1.p1.1)\.
- \[23\]H\. Ross, A\. S\. Mahabaleshwarkar, and Y\. Suhara\(2025\)When2Call: when \(not\) to call tools\.arXiv preprint arXiv:2504\.18851\.Cited by:[§2](https://arxiv.org/html/2605.14038#S2.SS0.SSS0.Px1.p1.1)\.
- \[24\]T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, E\. Hambro, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom\(2023\)Toolformer: language models can teach themselves to use tools\.Advances in Neural Information Processing Systems36,pp\. 68539–68551\.Cited by:[§1](https://arxiv.org/html/2605.14038#S1.p1.1),[§2](https://arxiv.org/html/2605.14038#S2.SS0.SSS0.Px1.p1.1)\.
- \[25\]J\. Shi, Z\. Yuan, G\. Tie, P\. Zhou, N\. Z\. Gong, and L\. Sun\(2025\)Prompt injection attack to tool selection in llm agents\.arXiv preprint arXiv:2504\.19793\.Cited by:[§2](https://arxiv.org/html/2605.14038#S2.SS0.SSS0.Px1.p1.1)\.
- \[26\]Y\. Song, W\. Xiong, D\. Zhu, W\. Wu, H\. Qian, M\. Song, H\. Huang, C\. Li, K\. Wang, R\. Yao,et al\.\(2023\)Restgpt: connecting large language models with real\-world restful apis\.arXiv preprint arXiv:2306\.06624\.Cited by:[§1](https://arxiv.org/html/2605.14038#S1.p1.1),[§2](https://arxiv.org/html/2605.14038#S2.SS0.SSS0.Px1.p1.1)\.
- \[27\]H\. Wang, C\. Qian, M\. Li, J\. Qiu, B\. Xue, M\. Wang, H\. Ji, A\. Storkey, and K\. Wong\(2026\)Position: agent should invoke external tools only when epistemically necessary\.External Links:2506\.00886,[Link](https://arxiv.org/abs/2506.00886)Cited by:[§1](https://arxiv.org/html/2605.14038#S1.p1.1),[§2](https://arxiv.org/html/2605.14038#S2.SS0.SSS0.Px1.p1.1)\.
- \[28\]Y\. Wang, R\. Zhou, R\. Fu, S\. Cao, H\. Zeng, J\. Lu, S\. Fan, J\. Zhao, and L\. Pan\(2026\)ASA: training\-free representation engineering for tool\-calling agents\.External Links:2602\.04935,[Link](https://arxiv.org/abs/2602.04935)Cited by:[Appendix B](https://arxiv.org/html/2605.14038#A2.p1.1),[Appendix B](https://arxiv.org/html/2605.14038#A2.p4.1),[§1](https://arxiv.org/html/2605.14038#S1.p3.1),[§2](https://arxiv.org/html/2605.14038#S2.SS0.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2605.14038#S5.SS1.p1.1),[§5\.1](https://arxiv.org/html/2605.14038#S5.SS1.p3.2)\.
- \[29\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu\(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§4](https://arxiv.org/html/2605.14038#S4.p1.1)\.
- \[30\]Z\. Yin, Q\. Sun, Q\. Guo, J\. Wu, X\. Qiu, and X\. Huang\(2023\-07\)Do large language models know what they don’t know?\.InFindings of the Association for Computational Linguistics: ACL 2023,A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 8653–8665\.External Links:[Link](https://aclanthology.org/2023.findings-acl.551/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-acl.551)Cited by:[§2](https://arxiv.org/html/2605.14038#S2.SS0.SSS0.Px2.p1.1)\.
- \[31\]H\. Zhang, S\. Diao, Y\. Lin, Y\. R\. Fung, Q\. Lian, X\. Wang, Y\. Chen, H\. Ji, and T\. Zhang\(2024\)R\-tuning: instructing large language models to say ‘i don’t know’\.External Links:2311\.09677,[Link](https://arxiv.org/abs/2311.09677)Cited by:[§2](https://arxiv.org/html/2605.14038#S2.SS0.SSS0.Px2.p1.1)\.
- \[32\]Q\. Zhang, Y\. Fu, Y\. Wang, L\. Yan, T\. Wei, K\. Xu, M\. Huang, and H\. Qiu\(2026\)Stop before you fail: operational capability boundaries for mitigating unproductive reasoning in large reasoning models\.External Links:2509\.24711,[Link](https://arxiv.org/abs/2509.24711)Cited by:[§2](https://arxiv.org/html/2605.14038#S2.SS0.SSS0.Px2.p2.1)\.
- \[33\]Y\. Zhang, J\. Chen, J\. Wang, Y\. Liu, C\. Yang, C\. Shi, X\. Zhu, Z\. Lin, H\. Wan, Y\. Yang,et al\.\(2024\)Toolbehonest: a multi\-level hallucination diagnostic benchmark for tool\-augmented large language models\.arXiv preprint arXiv:2406\.20015\.Cited by:[§2](https://arxiv.org/html/2605.14038#S2.SS0.SSS0.Px1.p1.1)\.
- \[34\]J\. Zhao, J\. Huang, Z\. Wu, D\. Bau, and W\. Shi\(2025\)LLMs encode harmfulness and refusal separately\.External Links:2507\.11878,[Link](https://arxiv.org/abs/2507.11878)Cited by:[§2](https://arxiv.org/html/2605.14038#S2.SS0.SSS0.Px2.p2.1)\.
- \[35\]A\. Zou, L\. Phan, S\. Chen, J\. Campbell, P\. Guo, R\. Ren, A\. Pan, X\. Yin, M\. Mazeika, A\. Dombrowski, S\. Goel, N\. Li, M\. J\. Byun, Z\. Wang, A\. Mallen, S\. Basart, S\. Koyejo, D\. Song, M\. Fredrikson, J\. Z\. Kolter, and D\. Hendrycks\(2025\)Representation engineering: a top\-down approach to ai transparency\.External Links:2310\.01405,[Link](https://arxiv.org/abs/2310.01405)Cited by:[§1](https://arxiv.org/html/2605.14038#S1.p3.1)\.
## Appendix AMore details on arithmetic dataset curation
We generate math arithmetic problems grouped into three types\. The easy group contains one\-step addition/subtraction, short two\-step chains, and small modulo problems\. These give cases where a calculator is usually not needed\. The larger short group keeps the expressions simple, but uses larger operands, including multi\-digit subtraction, four\-digit addition/subtraction, and two\- or three\-digit multiplication\. These problems are still short enough to invite a direct answer, but they are more likely to cause digit errors\. The multi\-step group contains precedence chains, parenthesized expressions, multiplication chains, and long addition/subtraction chains\. These examples test whether the model can track intermediate values and apply the order of operations, especially in cases that look simple but are easy to miscompute\. We sample the dataset with a fixed random seed\. During generation, we skip repeated expressions and resample until each family reaches its assigned sampling share\. Table[2](https://arxiv.org/html/2605.14038#A1.T2)gives the problem families and their sampling shares\.
Table 2:Breakdown of the math arithmetic dataset\. Shares show the fixed sampling share for each problem family\. We use40004000samples overall\.GroupProblem FamilyShareDetailExampleEasySingle\-step arithmetic8%Two operands from 1–99 with an addition or subtraction operator\.21 \+ 59Two\-step arithmetic5%Three operands from 1–99 with addition or subtraction operators\.70 \- 47 \+ 68Small modulo5%Three\-digit dividend with divisor from 3–19\.562 % 8Larger shortNegative subtraction7%Three\- or four\-digit operand minus a larger operand\.390 \- 554Four\-digit addition/subtraction6%Two four\-digit operands\.4921 \- 9108Two\-digit multiplication9%Two two\-digit operands\.34 \* 75Three\-by\-two multiplication9%One three\-digit factor and one two\-digit\.504 \* 61Three\-by\-three multiplication7%Two three\-digit factors\.867 \* 671Multi\-stepPrecedence chain12%Five two\- or three\-digit operands with addition, subtraction, or multiplication operators\.84 \* 82 \- 755 \- 805 \- 29One\-digit addition/subtraction chain11%16–39 one\-digit terms with addition or subtraction operators\.3 \+ 4 \+ 1 \+ 3 \- 8 \- …Small addition/subtraction chain10%21–27 terms from 1–30 with addition or subtraction operators\.9 \+ 5 \- 1 \- 3 \+ 23 \+ …Parenthesized expression6%Four two\-digit operands in\(a\+b\)×\(c−d\)\(a\+b\)\\times\(c\-d\)form\.\(67 \+ 68\) \* \(52 \- 88\)Multiplication chain5%Five two\-digit operands ina\+b×c−d×ea\+b\\times c\-d\\times eform\.94 \+ 40 \* 50 \- 24 \* 87
Algorithms[1](https://arxiv.org/html/2605.14038#alg1)–[13](https://arxiv.org/html/2605.14038#alg13)specify the exact procedures used for generating the data samples in each family\. In these algorithms,𝒰\{m,…,n\}\\mathcal\{U\}\\\{m,\\ldots,n\\\}denotes the discrete uniform distribution over integers frommmtonn, and𝒰\(0,1\)\\mathcal\{U\}\(0,1\)denotes the continuous uniform distribution on the unit interval\.
Algorithm 1SingleStepArithmetic1:
a,b∼𝒰\{1,…,99\}a,b\\sim\\mathcal\{U\}\\\{1,\\ldots,99\\\}
2:
op∼\{\+,−\}op\\sim\\\{\+,\-\\\}
3:return“
aopba\\ op\\ b”
Algorithm 2TwoStepArithmetic1:
a,b,c∼𝒰\{1,…,99\}a,b,c\\sim\\mathcal\{U\}\\\{1,\\ldots,99\\\}
2:
u∼𝒰\(0,1\)u\\sim\\mathcal\{U\}\(0,1\)
3:if
u<0\.5u<0\.5then
4:return“
a\+b−ca\+b\-c”
5:else
6:return“
a−b\+ca\-b\+c”
7:endif
Algorithm 3SmallModulo1:
a∼𝒰\{100,…,999\}a\\sim\\mathcal\{U\}\\\{100,\\ldots,999\\\}
2:
b∼𝒰\{3,…,19\}b\\sim\\mathcal\{U\}\\\{3,\\ldots,19\\\}
3:return“
a%ba\\ \\%\\ b”
Algorithm 4NegativeSubtraction1:
u∼𝒰\(0,1\)u\\sim\\mathcal\{U\}\(0,1\)
2:if
u<0\.55u<0\.55then
3:
a∼𝒰\{100,…,500\}a\\sim\\mathcal\{U\}\\\{100,\\ldots,500\\\}
4:
b∼𝒰\{a\+10,…,a\+250\}b\\sim\\mathcal\{U\}\\\{a\+10,\\ldots,a\+250\\\}
5:else
6:
a∼𝒰\{1000,…,5000\}a\\sim\\mathcal\{U\}\\\{1000,\\ldots,5000\\\}
7:
b∼𝒰\{a\+100,…,a\+3000\}b\\sim\\mathcal\{U\}\\\{a\+100,\\ldots,a\+3000\\\}
8:endif
9:return“
a−ba\-b”
Algorithm 5FourDigitAdditionSubtraction1:
a,b∼𝒰\{1000,…,9999\}a,b\\sim\\mathcal\{U\}\\\{1000,\\ldots,9999\\\}
2:
u∼𝒰\(0,1\)u\\sim\\mathcal\{U\}\(0,1\)
3:if
u<0\.6u<0\.6then
4:return“
a\+ba\+b”
5:else
6:return“
a−ba\-b”
7:endif
Algorithm 6TwoDigitMultiplication1:
u∼𝒰\(0,1\)u\\sim\\mathcal\{U\}\(0,1\)
2:if
u<0\.45u<0\.45then
3:
a,b∼𝒰\{15,…,50\}a,b\\sim\\mathcal\{U\}\\\{15,\\ldots,50\\\}
4:else
5:
a,b∼𝒰\{30,…,99\}a,b\\sim\\mathcal\{U\}\\\{30,\\ldots,99\\\}
6:endif
7:return“
a×ba\\times b”
Algorithm 7ThreeByTwoMultiplication1:
a∼𝒰\{100,…,999\}a\\sim\\mathcal\{U\}\\\{100,\\ldots,999\\\}
2:
b∼𝒰\{10,…,99\}b\\sim\\mathcal\{U\}\\\{10,\\ldots,99\\\}
3:return“
a×ba\\times b”
Algorithm 8ThreeByThreeMultiplication1:
a,b∼𝒰\{100,…,999\}a,b\\sim\\mathcal\{U\}\\\{100,\\ldots,999\\\}
2:return“
a×ba\\times b”
Algorithm 9PrecedenceChain1:
a1,…,a5∼𝒰\{10,…,999\}a\_\{1\},\\ldots,a\_\{5\}\\sim\\mathcal\{U\}\\\{10,\\ldots,999\\\}
2:
op1,…,op4∼\{\+,−,×\}op\_\{1\},\\ldots,op\_\{4\}\\sim\\\{\+,\-,\\times\\\}
3:return“
a1op1a2op2a3op3a4op4a5a\_\{1\}\\ op\_\{1\}\\ a\_\{2\}\\ op\_\{2\}\\ a\_\{3\}\\ op\_\{3\}\\ a\_\{4\}\\ op\_\{4\}\\ a\_\{5\}”
Algorithm 10OneDigitAdditionSubtractionChain1:
u∼𝒰\(0,1\)u\\sim\\mathcal\{U\}\(0,1\)
2:if
u<0\.4u<0\.4then
3:
n∼𝒰\{16,…,22\}n\\sim\\mathcal\{U\}\\\{16,\\ldots,22\\\}
4:else
5:
n∼𝒰\{29,…,39\}n\\sim\\mathcal\{U\}\\\{29,\\ldots,39\\\}
6:endif
7:
a1,…,an∼𝒰\{1,…,9\}a\_\{1\},\\ldots,a\_\{n\}\\sim\\mathcal\{U\}\\\{1,\\ldots,9\\\}
8:For each
ii, sample
opiop\_\{i\}with
Pr\(opi=\+\)=0\.53\\Pr\(op\_\{i\}=\+\)=0\.53and
Pr\(opi=−\)=0\.47\\Pr\(op\_\{i\}=\-\)=0\.47
9:return“
a1op1a2op2⋯opn−1ana\_\{1\}\\ op\_\{1\}\\ a\_\{2\}\\ op\_\{2\}\\ \\cdots\\ op\_\{n\-1\}\\ a\_\{n\}”
Algorithm 11SmallAdditionSubtractionChain1:
n∼𝒰\{21,…,27\}n\\sim\\mathcal\{U\}\\\{21,\\ldots,27\\\}
2:
a1,…,an∼𝒰\{1,…,30\}a\_\{1\},\\ldots,a\_\{n\}\\sim\\mathcal\{U\}\\\{1,\\ldots,30\\\}
3:
op1,…,opn−1∼\{\+,−\}op\_\{1\},\\ldots,op\_\{n\-1\}\\sim\\\{\+,\-\\\}
4:return“
a1op1a2op2⋯opn−1ana\_\{1\}\\ op\_\{1\}\\ a\_\{2\}\\ op\_\{2\}\\ \\cdots\\ op\_\{n\-1\}\\ a\_\{n\}”
Algorithm 12ParenthesizedExpression1:
a,b,c,d∼𝒰\{10,…,99\}a,b,c,d\\sim\\mathcal\{U\}\\\{10,\\ldots,99\\\}
2:return“
\(a\+b\)×\(c−d\)\(a\+b\)\\times\(c\-d\)”
Algorithm 13MultiplicationChain1:
a,b,c,d,e∼𝒰\{10,…,99\}a,b,c,d,e\\sim\\mathcal\{U\}\\\{10,\\ldots,99\\\}
2:return“
a\+b×c−d×ea\+b\\times c\-d\\times e”
## Appendix BExplicitly prompting for verbalized belief of tool\-necessity
Due to the limitation of LLMs in verbalizing internal decision processes\[[17](https://arxiv.org/html/2605.14038#bib.bib34),[9](https://arxiv.org/html/2605.14038#bib.bib33)\], and the fundamental difference between the task of self\-assessment and actual problem solving, in this paper, we followed the approach in recent work that use internal state probing to measure models’ cognition of tool\-necessity\[[13](https://arxiv.org/html/2605.14038#bib.bib31),[28](https://arxiv.org/html/2605.14038#bib.bib32)\]\. Nevertheless, for completeness, we also report results obtained using explicit self\-assessment prompts\.
Specifically, we adopt a two\-stage inference procedure\. In the first stage, the model is given the same questions from the Arithmetic and TruthfulQA datasets, but instead of solving the problem directly, it is prompted to first decide “whether it is necessary to invoke an external tool” and to answer only with ‘yes’ or ‘no’\. In the second stage, the model is instructed to “Now answer the original user request\.”
Table[3](https://arxiv.org/html/2605.14038#A2.T3)reports: \(1\) the MCC between the model’s ‘yes’/‘no’ responses and the actual capability\-grounded tool necessity defined in Section[4\.1](https://arxiv.org/html/2605.14038#S4.SS1); \(2\) the cognition–execution mismatch rate, defined as the proportion of samples where the model answered ‘yes’ but did not invoke a tool, or answered ‘no’ but eventually invoked one; and \(3\) the proportion of samples whose eventual tool\-calling behavior changed relative to the direct task\-oriented prompting setup used in Section[4\.2](https://arxiv.org/html/2605.14038#S4.SS2)\.
The results show that the MCC of explicit ‘yes’/‘no’ judgments is substantially worse at capturing the actual capability\-grounded notion of tool necessity\. In particular, Llama\-3\.2\-3B\-Instruct achieves a negative MCC on TruthfulQA, while Llama\-3\.1\-8B\-Instruct simply answers ‘no’ for every TruthfulQA sample, resulting in an undefined MCC\. This behavior implies that the model judges no sample to require a tool, which is clearly inconsistent with the capability measurements reported in Section[4\.1](https://arxiv.org/html/2605.14038#S4.SS1)\. This poor MCC results further show the challenge in distinguishing model\-adaptive tool\-necessary and tool\-unnecessary samples, which is more nuanced than the obvious cases prior work focus on\[[8](https://arxiv.org/html/2605.14038#bib.bib27),[13](https://arxiv.org/html/2605.14038#bib.bib31),[28](https://arxiv.org/html/2605.14038#bib.bib32)\]\.
In contrast, the cognition–execution mismatch rates are noticeably lower than those reported in Section[5\.4](https://arxiv.org/html/2605.14038#S5.SS4)\. Llama\-3\.1\-8B\-Instruct even achieves a 0 mismatch rate, meaning that it not only answered ‘no’ for all samples, but also consistently refrained from making any tool calls\. This outcome is somewhat expected: once the ‘yes’/‘no’ response becomes part of the model’s context, the model is more likely to remain consistent with that earlier commitment during subsequent generation\.
Most importantly, however, Table[3](https://arxiv.org/html/2605.14038#A2.T3)shows a large “Changed” rate in the third column of each dataset\. Relative to direct problem solving with the task\-oriented prompts used in our main experiments, explicit self\-assessment changes tool\-calling behavior on up to nearly 50% of samples\. In practical deployments, prompts are typically task\-oriented and designed to maximize task performance, rather than to elicit explicit self\-assessment\. Therefore, this substantial shift in behavior suggests that evaluations based on explicit prompts such as “decide whether it is necessary to invoke an external tool and answer ‘yes’ or ‘no”’, as used in some prior work\[[8](https://arxiv.org/html/2605.14038#bib.bib27),[13](https://arxiv.org/html/2605.14038#bib.bib31)\], may produce results that diverge significantly from models’ actual tool\-use behavior under realistic task settings\.
Table 3:Tool\-call evaluation summary across datasets\. For each model and dataset, we report Matthews Correlation Coefficient \(MCC\), mismatch rate\(yes,false\)\+\(no,true\)\(\\text\{yes,false\}\)\+\(\\text\{no,true\}\), and the proportion of changed tool\-call behavior across variants\.
## Appendix CLimitations
In this paper, we instantiated the model\-adaptive definition of tool necessity usingN=10N=10andT=0\.7T=0\.7\(as defined in Section[4\.1](https://arxiv.org/html/2605.14038#S4.SS1)\)\. It could be beneficial to cover other instantiations of this definition with differentNN,TTvalues to see how the necessity\-action mismatch rate may change under different settings\. Moreover, as an integral part of this work relies on probing model hidden states, this makes our work inapplicable to close\-source state\-of\-the\-art LLMs like GPT or Gemini\.Similar Articles
@omarsar0: Interesting interpretability paper on tool-using agents. The authors probe hidden states and find the model often recog…
This paper introduces a model-adaptive definition of tool necessity and finds a 26-54% mismatch between LLMs' internal recognition that a tool is needed and their actual tool-call actions, concentrated in the cognition-to-action transition. It reveals a 'knowing-doing gap' where the model often knows it should call a tool but fails to do so due to late-layer geometry rotating the signal nearly orthogonal to the action.
LLM Agents Already Know When to Call Tools -- Even Without Reasoning
This paper introduces When2Tool, a benchmark to study when LLM agents actually need to call tools, and reveals that models already know tool necessity from hidden states but fail to act. The proposed Probe&Prefill method reduces unnecessary tool calls by 48% with minimal accuracy loss.
LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling
This paper proposes a metacognitive harness that separates monitoring from reasoning in LLMs, using pre-solve feeling-of-knowing and post-solve judgment-of-learning signals to control when to trust, retry, or aggregate answers, improving accuracy on text, code, and multimodal benchmarks without parameter updates.
Capability Conditioned Scaffolding for Professional Human LLM Collaboration
Introduces Capability Conditioned Scaffolding, a framework for LLM collaboration that adapts intervention based on user expertise domains to prevent Professional Domain Drift, with pilot evaluation on MMLU subsets.
Production LLM systematically violates tool schema constraints to invent UI features; observed over ~2,400 messages [D]
A production LLM systematically repurposes tool schema enums to invent helpful UI buttons across 2,400 messages, showing strategic deviation from constraints that improves UX rather than causing harm.