When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning

arXiv cs.CL Papers

Summary

This paper introduces Adaptive Tool Trust Calibration (ATTC), a framework that improves tool-integrated reasoning models by enabling them to adaptively decide when to trust or ignore tool results based on code confidence scores. The approach addresses the "Tool Ignored" problem where models incorrectly dismiss correct tool outputs, achieving 4.1-7.5% performance improvements across multiple models and datasets.

arXiv:2604.08281v2 Announce Type: replace Abstract: Large reasoning models (LRMs) have achieved strong performance enhancement through scaling test time computation, but due to the inherent limitations of the underlying language models, they still have shortcomings in tasks that require precise computation and extensive knowledge reserves. Tool-Integrated Reasoning (TIR) has emerged as a promising paradigm that incorporates tool call and execution within the reasoning trajectory. Although recent works have released some powerful open-source TIR models, our analysis reveals that these models still suffer from critical deficiencies. We find that when the reasoning of the model conflicts with the tool results, the model tends to believe in its own reasoning. And there are cases where the tool results are correct but are ignored by the model, resulting in incorrect answers, which we define as "Tool Ignored". This indicates that the model does not know when to trust or ignore the tool. To overcome these limitations, we introduce Adaptive Tool Trust Calibration (ATTC), a novel framework that guides the model to adaptively choose to trust or ignore the tool results based on the confidence score of generated code blocks. The experimental results from various open-source TIR models of different sizes and across multiple datasets demonstrate that ATTC effectively reduces the "Tool Ignored" issue, resulting in a performance increase of 4.1% to 7.5%.
Original Article
View Cached Full Text

Cached at: 04/20/26, 08:32 AM

# When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning

Source: https://arxiv.org/html/2604.08281

Ruotao Xu1, Yixin Ji1, Yu Luo2, Jinpeng Li2, Dong Li2, Peifeng Li1, Juntao Li1, Min Zhang3,1

1School of Computer Science and Technology, Soochow University
2Department of Foundation Model, 2012 Labs, Huawei
3Harbin Institute of Technology, Shenzhen (HITSZ)

{xuruotao007, jiyixin169}@gmail.com, {ljt,minzhang}@suda.edu.cn

###### Abstract

Large reasoning models (LRMs) have achieved strong performance gains through scaling test-time computation, but due to inherent limitations of underlying language models, they still fall short in tasks requiring precise computation and extensive knowledge. Tool-Integrated Reasoning (TIR) has emerged as a promising paradigm that incorporates tool calling and execution within the reasoning trajectory. Although recent works have released powerful open-source TIR models, our analysis reveals that these models still suffer from critical deficiencies. We find that when model reasoning conflicts with tool results, the model tends to trust its own reasoning. There are cases where tool results are correct but ignored by the model, resulting in incorrect answers, which we define as "Tool Ignored". This indicates that the model does not know when to trust or ignore tools. To address these limitations, we introduce Adaptive Tool Trust Calibration (ATTC), a novel framework that guides models to adaptively choose to trust or ignore tool results based on the confidence scores of generated code blocks. Experimental results from various open-source TIR models of different sizes across multiple datasets demonstrate that ATTC effectively reduces the "Tool Ignored" issue, achieving performance improvements of 4.1% to 7.5%. Our code is available at https://github.com/00Dreamer00/ATTC.

## 1 Introduction

![Figure 1: Case of the "Tool Ignored" phenomenon. The model arrives at an incorrect answer of 18 through reasoning, while the tool provides the correct result of 17. The model ignores the tool and gives an incorrect answer.]

The rapid evolution of Large Reasoning Models (LRMs) (OpenAI et al., 2024; DeepSeek-AI et al., 2025a; Yang et al., 2025; Team et al., 2025; Comanici et al., 2025) represents a transformative milestone in the history of Large Language Models (LLMs). By scaling test-time computation (Ji et al., 2025), these models have achieved substantial performance gains in handling challenging reasoning tasks. Unlike conventional LLMs that typically generate responses directly, LRMs engage in long Chain-of-Thought (CoT) reasoning before generating an answer. This shift toward systematic deliberation allows the model to refine its logic and arrive at more robust final outputs. Nevertheless, LRMs are still constrained by inherent limitations of underlying language models (Zhao et al., 2025; Yuee et al., 2025), most notably in areas requiring precise numerical computation and comprehensive knowledge coverage.

To mitigate these limitations, Tool-Integrated Reasoning (TIR) (Gou et al., 2024; Wang et al., 2023; Liao et al., 2024) has emerged as a promising paradigm that incorporates tool calling and execution within the reasoning trajectory. By incorporating external tools such as code executors and search engines, TIR empowers models to transcend the performance bottlenecks of pure reasoning. Early works in TIR primarily rely on prompt engineering (Wang et al., 2025; Yuan et al., 2024; Qian et al., 2024; Chen et al., 2023; Yang et al., 2024b) to guide LLMs in tool calling. However, these approaches are heavily dependent on meticulously crafted prompts, which limit their scalability and generalizability. Some later works use supervised fine-tuning (Chen et al., 2025; Qian et al., 2025; Yang et al., 2024a; Yao et al., 2023; Wang et al., 2023) to internalize the behavior pattern of models actively calling tools in reasoning by training on specialized datasets enriched with tool-calling demonstrations. Nevertheless, SFT-based methodologies exhibit inherent limitations, as they constrain models to strictly adhere to tool-usage patterns in the training data distribution; these models often fail to develop adaptive strategies.

To address these issues, several recent works focus on applying reinforcement learning (Feng et al., 2025; Li et al., 2025; Jiang et al., 2025; Bai et al., 2025; Xuee et al., 2025) to improve models' tool-use abilities. These works enable models to develop more flexible strategies for calling tools based on task complexity. In this study, we specifically focus on applying TIR in scenarios where code executors are utilized as tools. Although existing works have enabled models to perform TIR, our analysis reveals that existing open-source TIR models still suffer from critical deficiencies. Most notably, there remains persistent difficulty in achieving optimal balance between external tool results and reasoning.

By analyzing reasoning trajectories of TIR models, we observe widespread contradictions between model reasoning and results provided by external tools in error cases. When such conflicts arise, models often lack a robust mechanism to reconcile divergent information, frequently choosing to ignore tool output. As shown in Figure 1, the model fails to arrive at the correct answer precisely because it ignores a valid tool result; we define this phenomenon as "Tool Ignored". This behavior indicates that current TIR models struggle to accurately discern when to trust or dismiss tool results, leading to redundant reasoning paths and erroneous conclusions.

To overcome these limitations, we propose Adaptive Tool Trust Calibration (ATTC), a novel framework that guides models to adaptively choose to trust or ignore tool results based on confidence scores of generated code blocks. When a model calls a tool, ATTC scores the code blocks generated by the model using a specific confidence scoring formula. If the confidence score exceeds an empirically determined threshold, ATTC guides the model to trust the tool results; otherwise, ATTC guides the model to reconsider.

We conduct extensive experiments across multiple open-source TIR models, and the experimental results decisively demonstrate that ATTC effectively reduces the "Tool Ignored" issue, achieving performance improvements of 4.1% to 7.5%.

Overall, our contributions are as follows:

- We identify that TIR models do not know when to trust tool results and define the "Tool Ignored" issue.
- We propose a novel framework ATTC for tool-integrated reasoning that guides models to adaptively choose to trust or ignore tool results based on confidence scores of generated code blocks.
- Extensive experiments on open-source TIR models of various sizes demonstrate that ATTC improves model performance.

## 2 Phenomenon Analysis

### 2.1 Contradictions Between Reasoning and Tools

After careful examination of many Tool-Integrated Reasoning trajectories, we observe that conclusions produced by model reasoning are not always consistent with outputs returned by external tools; conflicts between the two arise frequently. To determine the exact proportion of such conflicting instances and characterize their resolution, we conduct an LLM-based audit of over 32k cases followed by detailed quantitative evaluation. Specifically, we measure the prevalence of conflicts separately in true and false cases. In cases of conflict, we further analyze whether models tend to rely on their own reasoning or defer to tool outputs. The prompts used to guide this analysis are provided in Appendix B.

Figure 2 shows that a significant proportion, between 40% and 60%, of false cases display contradictions between model reasoning and external tool output. In over half of conflicting scenarios, the model exhibits a strong tendency to trust its own reasoning over results produced by the tool. This pattern reveals a significant limitation in the model's metacognition, specifically its inability to effectively determine when to trust tool output compared to its own reasoning results.

![Figure 2: The proportion of contradictions between model reasoning and tool results in true and false cases. The color proportions in each column indicate the proportion where the model chooses to believe its own reasoning or tool results.]

### 2.2 Tool Ignored

By examining a large set of false cases exhibiting reasoning-tool contradictions, we identify a counterintuitive failure mode: when a conflict arises, the tool output is correct, yet the model ignores it and adheres to its own reasoning, producing an incorrect answer. We name this failure mode "Tool Ignored". This phenomenon indicates a systematic preference for self-generated reasoning over externally provided evidence, which prevents the model from fully leveraging tool augmentation and ultimately degrades task accuracy. A specific case is shown in Figure 1.

To assess the prevalence of this failure mode, we conduct systematic analysis of false cases across four distinct models and four challenging datasets. The results summarized in Figure 3 show that in every model–dataset combination, "Tool Ignored" accounts for at least 15% of errors. This phenomenon undermines both accuracy and computational efficiency. When verifiably correct tool output is ignored, the model often generates redundant reasoning steps or tool calls, yielding an incorrect prediction while incurring unnecessary computational cost.

It is also important to avoid blindly accepting results generated by tools. Therefore, the key challenge is to endow the model with metacognitive calibration to decide when to trust an external tool and when to rely on its reasoning.

![Figure 3: Proportion of "Tool Ignored" phenomenon for different TIR models across four datasets.]

## 3 Methodology

![Figure 4: An overview of the Adaptive Tool Trust Calibration (ATTC) method.]

| Model | MATH 500 | Minerva Math | Olympiad | AIME24 | AMC23 | Avg |
|-------|----------|--------------|---------|--------|-------|-----|
| **Models based on Qwen2.5-7B** |
| ToRL-7B | 82.2 | 33.5 | 49.9 | 43.3 | 65.0 | 54.8 |
| +ATTC | 84.8 | 43.8 | 52.4 | 46.7 | 72.5 | 60.0 |
| | +5.2 | | | | | |
| Effective TIR-7B | 82.8 | 30.5 | 51.9 | 42.3 | 70.0 | 55.5 |
| +ATTC | 85.8 | 42.3 | 53.5 | 46.7 | 77.5 | 61.2 |
| | +5.7 | | | | | |
| VerlTool-7B | 82.0 | 31.6 | 49.8 | 40.0 | 67.5 | 54.2 |
| +ATTC | 83.4 | 44.1 | 50.5 | 43.3 | 70.0 | 58.3 |
| | +4.1 | | | | | |
| SimpleTIR-7B | 82.1 | 30.1 | 47.4 | 46.7 | 75.0 | 56.3 |
| +ATTC | 83.2 | 46.7 | 49.2 | 50.0 | 77.5 | 61.3 |
| | +5.0 | | | | | |
| **Models based on Qwen2.5-32B** |
| ReTool-32B | 84.6 | 30.5 | 60.1 | 53.3 | 80.0 | 61.7 |
| +ATTC | 87.4 | 36.8 | 62.5 | 66.7 | 92.5 | 69.2 |
| | +7.5 | | | | | |
| SimpleTIR-32B | 85.2 | 33.8 | 53.8 | 50.0 | 80.0 | 60.6 |
| +ATTC | 88.2 | 36.8 | 56.9 | 56.7 | 85 | 64.7 |
| | +4.1 | | | | | |
| **Models based on Qwen3-4B** |
| ReTool-4B | 57.0 | 16.2 | 27.7 | 16.7 | 42.5 | 32.0 |
| +ATTC | 61.8 | 23.9 | 32 | 16.7 | 52.5 | 37.4 |
| | +5.4 | | | | | |
| DemyAgent-4B | 71.4 | 17.3 | 51.9 | 40.0 | 75.0 | 51.1 |
| +ATTC | 79.4 | 22.4 | 53.5 | 43.3 | 77.5 | 55.2 |
| | +4.1 | | | | | |

Table 1: Pass@1 performance of the proposed ATTC method across various tool-integrated reasoning models on various math benchmarks.

### 3.1 Preliminaries

In this work, we consider a Tool-Integrated Reasoning (TIR) setting where a language model interacts with an external Python execution environment during test-time reasoning. Under this particular paradigm, code generation is selectively and autonomously triggered by the model during its reasoning process. Crucially, the model is trained to learn how to effectively leverage this generated code to assist, augment, and validate its reasoning capabilities.

Formally, the TIR model maintains a reasoning trajectory $\mathcal{T}^{(t)}$ at iteration $t$ as:

$$\mathcal{T}^{(t)} = \bigl\{(r^{(1)}, c^{(1)}, o^{(1)}), \dots, (r^{(t)}, c^{(t)}, o^{(t)})\bigr\}$$

where $r^{(t)}$ denotes the natural language reasoning, $c^{(t)}$ represents the generated executable code, and $o^{(t)}$ is the execution result returned by the external environment. The iterative generation process follows:

$$(r^{(t)}, c^{(t)}) \sim M_{\mathrm{tir}}\left(Q, \mathcal{T}^{(t-1)}\right)$$

$$o^{(t)} = \mathcal{E}\left(c^{(t)}\right)$$

$$\mathcal{T}^{(t)} = \mathcal{T}^{(t-1)} \cup \bigl\{(r^{(t)}, c^{(t)}, o^{(t)})\bigr\}$$

Given the input prompt $Q$ and accumulated trajectory $\mathcal{T}^{(t-1)}$, the TIR model $M_{\mathrm{tir}}$ continues to generate. The generated code is then executed by an external code execution environment $\mathcal{E}$ to obtain the corresponding output. This iterative process continues until a termination condition is satisfied, at which point the model produces the final answer.

To address the "Tool Ignored" phenomenon discussed in Section 2, we introduce the Adaptive Tool Trust Calibration (ATTC) method to guide the TIR model on whether to trust or ignore the tool in the next section.

### 3.2 Adaptive Tool Trust Calibration

The main idea behind ATTC is that the model's confidence in its generated code block indicates how sufficient its prerequisite reasoning is before making a tool call. We observe that code blocks resulting from incomplete or flawed reasoning processes tend to exhibit markedly lower confidence levels. Conversely, when the preceding thought process is comprehensive and logically sound, the model generates code with significantly higher certainty. Specific quantitative experiments can be found in Section 4.3.

This pattern suggests that the TIR model implicitly recognizes the trustworthiness of potential tool output but lacks the explicit mechanism to utilize this awareness in ongoing reasoning, frequently leading to erroneous trust or ignorance of tool results. ATTC is designed to bridge this gap by converting this implicit awareness into an explicit decision-making process.

Similar Articles

We measured how AI capabilities INTERACT as models scale. Below 3.5B, reasoning and truthfulness fight. Above it, they cooperate. The transition is engineerable. (2 papers + interactive dashboard + 7 falsifiable predictions)

Reddit r/artificial

Researchers discovered a critical scale (~3.5B parameters) where the trade-off between reasoning and truthfulness in AI models flips from antagonistic to cooperative. They provide a framework, interactive dashboard, and open-source steering tool to identify and correct misaligned outputs at small scales.