Exploring Agentic Tool-Calling Decisions via Uncertainty-Aligned Reinforcement Learning

arXiv cs.AI Papers

Summary

This paper proposes TRUST, a method that integrates uncertainty quantification into reinforcement learning reward design to improve LLM agents' tool-calling decisions, enhancing decision quality and maintaining reliable uncertainty estimates.

arXiv:2606.06976v1 Announce Type: new Abstract: Large language model (LLM)-based agents often make suboptimal tool-use decisions, including unsupported tool invocation and hallucinated direct responses, which may accumulate errors throughout multi-step interactions. Existing approaches mainly improve these behaviors through inference-time correction or coarse-grained reward signals based on decision outcomes and structured checklists, leaving the uncertainty characteristics of agent decisions underexplored. We observe that decision-oriented reinforcement learning tends to weaken the uncertainty separation between correct and incorrect actions, resulting in overconfident mistakes and weaker exploration signals. Therefore, we propose TRUST, which incorporates uncertainty quantification into reward design as a repulsive force for maintaining uncertainty separation, and labels lightweight key-turn annotations for unified post-training of multi-turn trajectories. Experimental results across diverse tool-use benchmarks show that TRUST consistently enhances both decision quality and agent performance while maintaining more reliable uncertainty estimates during optimization.
Original Article
View Cached Full Text

Cached at: 06/08/26, 09:14 AM

# Exploring Agentic Tool-Calling Decisions via Uncertainty-Aligned Reinforcement Learning
Source: [https://arxiv.org/html/2606.06976](https://arxiv.org/html/2606.06976)
Yijin Zhou1,2,3,Linqian Zeng1,Xiaoya Lu1,2,Wenyuan Xie1, Dongrui Liu2,†\\dagger,Junchi Yan1,3,Jing Shao2,†\\dagger 1Shanghai Jiao Tong University, China 2Shanghai Artificial Intelligence Laboratory, China 3Shanghai Innovation Institute, China

###### Abstract

Large language model \(LLM\)\-based agents often make suboptimal tool\-use decisions, including unsupported tool invocation and hallucinated direct responses, which may accumulate errors throughout multi\-step interactions\. Existing approaches mainly improve these behaviors through inference\-time correction or coarse\-grained reward signals based on decision outcomes and structured checklists, leaving the uncertainty characteristics of agent decisions underexplored\. We observe that decision\-oriented reinforcement learning tends to weaken the uncertainty separation between correct and incorrect actions, resulting in overconfident mistakes and weaker exploration signals\. Therefore, we proposeTRUST, which incorporates uncertainty quantification into reward design as a repulsive force for maintaining uncertainty separation, and labels lightweight key\-turn annotations for unified post\-training of multi\-turn trajectories\. Experimental results across diverse tool\-use benchmarks show thatTRUSTconsistently enhances both decision quality and agent performance while maintaining more reliable uncertainty estimates during optimization\.

Code–[https://github\.com/yjzscode/TRUST](https://github.com/yjzscode/TRUST)

Exploring Agentic Tool\-Calling Decisions via Uncertainty\-Aligned Reinforcement Learning

## 1Introduction

Large language model \(LLM\)\-based agents extend the capabilities of language models by invoking external tools for knowledge retrieval, computation, and interaction with real\-world environments\(Yaoet al\.,[2023](https://arxiv.org/html/2606.06976#bib.bib1); Schicket al\.,[2023](https://arxiv.org/html/2606.06976#bib.bib2); Qinet al\.,[2024](https://arxiv.org/html/2606.06976#bib.bib3); Luet al\.,[2026a](https://arxiv.org/html/2606.06976#bib.bib39); Chenet al\.,[2026](https://arxiv.org/html/2606.06976#bib.bib41)\)\. However, agents frequently exhibit tool\-calling decision failures at specific action turns, either invoking tools when tool use is unsupported or unnecessary, or fabricating direct answers without issuing the required tool call\(Rosset al\.,[2025](https://arxiv.org/html/2606.06976#bib.bib4); Healyet al\.,[2026](https://arxiv.org/html/2606.06976#bib.bib30); Zhouet al\.,[2026](https://arxiv.org/html/2606.06976#bib.bib38)\)\. Unlike conventional textual hallucinations in LLMs, tool\-calling decision failures can corrupt intermediate states and propagate errors across subsequent interaction turns, thereby incurring financial costs, execution failures, and information leakage in real\-world agentic tasks\(Linet al\.,[2025](https://arxiv.org/html/2606.06976#bib.bib31); Suet al\.,[2025](https://arxiv.org/html/2606.06976#bib.bib32); Zhanget al\.,[2025](https://arxiv.org/html/2606.06976#bib.bib33); Luet al\.,[2026b](https://arxiv.org/html/2606.06976#bib.bib37)\)\.

![Refer to caption](https://arxiv.org/html/2606.06976v1/x1.png)Figure 1:Comparison between tool\-calling decision failure in wrong decision but low uncertainty, and ourTRUSTsolution\.Recent studies address tool\-calling decision failures by extracting internal uncertainty signals or structured reasoning patterns during inference\(Zhanget al\.,[2026b](https://arxiv.org/html/2606.06976#bib.bib17); Stoisseret al\.,[2025](https://arxiv.org/html/2606.06976#bib.bib34)\)\. However, they primarily focus on post\-hoc intervention in inference, instead of improving the agent’s intrinsic decision capability through policy optimization\. Research efforts byRosset al\.\([2025](https://arxiv.org/html/2606.06976#bib.bib4)\); Suriet al\.\([2025](https://arxiv.org/html/2606.06976#bib.bib9)\)mitigate such failures through reinforcement learning \(RL\) for tool\-calling decisions, typically relying on coarse\-grained reward formulations from decision classification or rule\-based checklists\. Nevertheless, these approaches lack a quantitative analysis of the policy exploration dynamics underlying tool\-calling decision optimization, leaving substantial room for improvement\.

From the perspective ofUncertaintyQuantification \(UQ\) of model responses, we observe that existing RL\-based approaches for optimizing tool\-calling decisions tend to increase confidence in sampled high\-reward actions, while unintentionally weakening the uncertainty separation between correct and incorrect decisions\. As illustrated in Fig\.[1](https://arxiv.org/html/2606.06976#S1.F1)\(a\), unsupported tool calls and unjustified direct answers increasingly overlap with low\-uncertainty regions after decision\-oriented RL optimization, impairing the model’s original calibration that higher uncertainty should indicate a greater likelihood of incorrect decisions\. Take Qwen3\-4B\-Thinking\(Yanget al\.,[2025](https://arxiv.org/html/2606.06976#bib.bib23)\)as an example: the proportion of “Wrong Decision But Low Uncertainty” cases rises from34\.50%34\.50\\%to70\.21%70\.21\\%after RL optimization for tool\-calling decisions\. This observation suggests that existing decision\-oriented objectives primarily optimize action correctness without preserving the uncertainty structure underlying those decisions\. Consequently, overconfident but incorrect actions receive limited optimization pressure, reducing policy exploration toward more reliable alternatives\.

To address this issue, we proposeTool\-calling decisionReward withUncertainty\-Separated post\-Training \(TRUST\), a framework that leveragesUQ as a repulsive forcewithin reward, rather than using uncertainty merely as a post\-hoc diagnostic\. Specifically, we introduce a UQ\-aligned reward that jointly models action correctness and a certainty margin derived from the uncertainty gap between negative and ground\-truth decisions\. As shown in Fig\.[1](https://arxiv.org/html/2606.06976#S1.F1)\(b\), our reward encourages the model to assign lower uncertainty to correct decisions while maintaining comparatively higher uncertainty for incorrect or counterfactual decisions\. This mechanism promotes exploration away from uncertain or unreliable decisions and provides a stronger optimization signal for policy updates\.

Furthermore, we extend this training paradigm from standalone decision instances to multi\-turn agent trajectories\. Instead of exhaustively relabeling entire trajectories, we annotate lightweight key\-turn decision points, enabling unified post\-training for both trajectory\-level task success and turn\-level tool\-calling calibration\. The trajectory\-level outcome reward supervises overall task completion and tool execution quality, while our turn\-level UQ\-aligned reward explicitly calibrates the timing and appropriateness of tool\-calling decisions\.

Empirically,TRUSTyields substantial improvements across diverse tool\-use benchmarks while preserving the uncertainty structure underlying agent decisions\. In turn\-level tool\-calling decision optimization,TRUSTimproves over 11% on the When2Call task accuracy, simultaneously strengthening the performance in complex multi\-turn interactions and tool\-use trajectories\. Cooperated with the trajectory\-level tool\-calling post\-training,TRUSToutperforms 6\.33% on BFCL\-V4 and 7\.07% on ToolSandbox\. The gains are particularly pronounced in challenging decision\-intensive scenarios, including Multi\-Turn and Irrelevance on BFCL\-V4, as well as scenarios such as multiple user turns and distraction tools in ToolSandbox\. Overall,TRUSTjointly optimizes overall tool execution quality and turn\-level decision appropriateness of tool use, providing stronger optimization signals for policy learning and enabling more reliable multi\-turn agent behavior\.

## 2Related Work

#### Uncertainty quantification for language agents\.

UQ has been widely studied for estimating the reliability of language model outputs\(Kadavathet al\.,[2022](https://arxiv.org/html/2606.06976#bib.bib10); Linet al\.,[2022](https://arxiv.org/html/2606.06976#bib.bib11); Zhouet al\.,[2024](https://arxiv.org/html/2606.06976#bib.bib40)\)\. Early work mainly focuses on output\-level uncertainty through model probabilities, verbalized confidence, and sampling\-based consistency signals\(Kadavathet al\.,[2022](https://arxiv.org/html/2606.06976#bib.bib10); Linet al\.,[2022](https://arxiv.org/html/2606.06976#bib.bib11); Manakulet al\.,[2023](https://arxiv.org/html/2606.06976#bib.bib13); Kuhnet al\.,[2023](https://arxiv.org/html/2606.06976#bib.bib12); Manakulet al\.,[2023](https://arxiv.org/html/2606.06976#bib.bib13)\)\. More recently, researchers argue that uncertainty in agentic systems extends beyond final responses to intermediate actions, environmental observations, and multi\-step trajectories\(Kirchhofet al\.,[2025](https://arxiv.org/html/2606.06976#bib.bib14); Duanet al\.,[2025](https://arxiv.org/html/2606.06976#bib.bib16); Ohet al\.,[2026](https://arxiv.org/html/2606.06976#bib.bib15); Zhanget al\.,[2026b](https://arxiv.org/html/2606.06976#bib.bib17); Stoisseret al\.,[2025](https://arxiv.org/html/2606.06976#bib.bib34); Lymperopoulos and Sarathy,[2025](https://arxiv.org/html/2606.06976#bib.bib36)\)\. Several approaches further exploit uncertainty to regulate agent behavior, such as triggering clarification under ambiguous instructions, controlling memory and reflection, or guiding exploration through structured rewards\(Suriet al\.,[2025](https://arxiv.org/html/2606.06976#bib.bib9); Zhanget al\.,[2026b](https://arxiv.org/html/2606.06976#bib.bib17),[a](https://arxiv.org/html/2606.06976#bib.bib18)\)\. Different from prior work that primarily treats uncertainty as a post\-hoc signal for diagnosis or behavior control, we explicitly integrate uncertainty into policy optimization and align decision correctness with model certainty through uncertainty\-aware rewards\.

![Refer to caption](https://arxiv.org/html/2606.06976v1/x2.png)Figure 2:The overview ofTRUST\. It consists of two components: \(a\) A turn\-level UQ\-aligned reward uses uncertainty as a repulsive signal to align decision correctness and confidence\. \(b\) A trajectory\-level unified post\-training framework augments key turns with decision annotations and integratesRUQR\_\{\\text\{UQ\}\}with task\-level rewards for joint optimization of execution quality and tool\-calling decisions\.
#### Tool\-calling decision learning and optimization\.

Tool\-augmented LLMs have been extensively studied for tool selection, argument generation, and large\-scale API utilization\(Schicket al\.,[2023](https://arxiv.org/html/2606.06976#bib.bib2); Qinet al\.,[2024](https://arxiv.org/html/2606.06976#bib.bib3); Patilet al\.,[2025](https://arxiv.org/html/2606.06976#bib.bib20); Luet al\.,[2025](https://arxiv.org/html/2606.06976#bib.bib19)\)\. Beyond execution correctness, recent work increasingly focuses on the preceding decision process: whether an agent should invoke tools, ask follow\-up questions, answer directly, or abstain under current observations\. Benchmarks and analyses such as When2Call and related studies isolate this decision layer and reveal common failures, including unnecessary tool invocation and hallucinated direct answers\(Rosset al\.,[2025](https://arxiv.org/html/2606.06976#bib.bib4); Wanget al\.,[2025](https://arxiv.org/html/2606.06976#bib.bib21); Wuet al\.,[2026](https://arxiv.org/html/2606.06976#bib.bib22); Sunet al\.,[2026](https://arxiv.org/html/2606.06976#bib.bib27)\)\. Concurrent work further explores RL\-based optimization of tool\-calling behavior using calibrated rewards or decision supervision\(Suriet al\.,[2025](https://arxiv.org/html/2606.06976#bib.bib9); Modecruaet al\.,[2026](https://arxiv.org/html/2606.06976#bib.bib28); Zhonget al\.,[2026](https://arxiv.org/html/2606.06976#bib.bib29)\)\. In contrast to these approaches that mainly optimize decision correctness, our work studies how RL reshapes the uncertainty structure of tool\-calling policies and leverages it to improve both decision quality and calibration\.

## 3TRUSTMethod

### 3\.1Problem Formulation

Given the state𝒮\\mathcal\{S\}, the next\-action decision is a four\-way agentic action space\(Rosset al\.,[2025](https://arxiv.org/html/2606.06976#bib.bib4)\):

𝒜=\{Direct,Tool,Ask,Unable\},\\mathcal\{A\}=\\\{\\textsc\{Direct\},\\textsc\{Tool\},\\textsc\{Ask\},\\textsc\{Unable\}\\\},\(1\)whereDirectmeans direct answer,Toolmeans invoking an external tool,Askmeans requesting missing information from the user, andUnablemeans admitting that the request can not be answered\. Specifically, at decision turntt, the agent state isst=\(ht,𝒯\)s\_\{t\}=\(h\_\{t\},\\mathcal\{T\}\), whereht=\(o1,a1,…,ot,at\)h\_\{t\}=\(o\_\{1\},a\_\{1\},\.\.\.,o\_\{t\},a\_\{t\}\)denotes the observed trajectory, and𝒯\\mathcal\{T\}denotes the available tool set with schemas\. A policyπθ:𝒮→𝒵\\pi\_\{\\theta\}:\\mathcal\{S\}\\xrightarrow\{\}\\mathcal\{Z\}maps from state to agent response, generating a structured responsezt=\(at,yt\)z\_\{t\}=\(a\_\{t\},y\_\{t\}\), whereat∈𝒜a\_\{t\}\\in\\mathcal\{A\}andyt∈𝒴y\_\{t\}\\in\\mathcal\{Y\}, where𝒴\\mathcal\{Y\}is the set of surface realizations\. ForTool,yty\_\{t\}is a tool\-call payload with a tool name and arguments; otherwise,yty\_\{t\}is natural language\. The environment updates the state after tool execution or user interaction\.

Each decision point provides a ground\-truth pairzt⋆=\(at⋆,yt⋆\)z\_\{t\}^\{\\star\}=\(a\_\{t\}^\{\\star\},y\_\{t\}^\{\\star\}\)\. A tool\-calling decision failure occurs when the selected action is unsupported by the current state,i\.e\.at⋆∈𝒜/\{Tool\},at=Toola\_\{t\}^\{\\star\}\\in\\mathcal\{A\}/\\\{\\textsc\{Tool\}\\\},a\_\{t\}=\\textsc\{Tool\}orat⋆=Tool,at=Directa\_\{t\}^\{\\star\}=\\textsc\{Tool\},a\_\{t\}=\\textsc\{Direct\}\. Thus, the training objective is not only to increase the probability of correct final trajectories, but also to calibrate the policy over intermediate decisions:

maxθ⁡πθ​\(at⋆,yt⋆∣st\),minθ⁡πθ​\(at−,yt−∣st\)\\max\_\{\\theta\}~\\pi\_\{\\theta\}\(a\_\{t\}^\{\\star\},y\_\{t\}^\{\\star\}\\mid s\_\{t\}\),~\\min\_\{\\theta\}~\\pi\_\{\\theta\}\(a\_\{t\}^\{\-\},y\_\{t\}^\{\-\}\\mid s\_\{t\}\)\(2\)for unsupported or counterfactual decisionszt−=\(at−,yt−\)z\_\{t\}^\{\-\}=\(a\_\{t\}^\{\-\},y\_\{t\}^\{\-\}\)\. In our experiments, this policy is optimized with GRPO post\-training\. For each statess, the current model samples a group of responses𝒵G=\{zi\}i=1G\\mathcal\{Z\}^\{G\}=\\\{z\_\{i\}\\\}\_\{i=1\}^\{G\}, and the policy is updated according to their relative rewards\.

### 3\.2Turn\-level UQ\-Aligned Decision Reward

Directly rewarding the selected action can not make the agent’s uncertainty better calibrated, as shown in Fig\.[3](https://arxiv.org/html/2606.06976#S3.F3)\. So we use UQ as the repulsive force in reward as illustrated in Fig\.[2](https://arxiv.org/html/2606.06976#S2.F2)\(a\)\. We instantiate uncertainty with sequence perplexity\(Kuhnet al\.,[2023](https://arxiv.org/html/2606.06976#bib.bib12)\)\. Given a prompt statessand a candidate responsez=\(a,y\)z=\(a,y\), the perplexity is

PPLθ​\(z∣s\)=exp⁡\(−1\|z\|​∑j=1\|z\|log⁡pθ​\(z\(j\)∣s\)\),\\mathrm\{PPL\}\_\{\\theta\}\(z\\mid s\)=\\exp\\left\(\-\\frac\{1\}\{\|z\|\}\\sum\_\{j=1\}^\{\|z\|\}\\log p\_\{\\theta\}\(z\_\{\(j\)\}\\mid s\)\\right\),\(3\)wherez\(j\)z\_\{\(j\)\}is the j\-th sequence inzz,pθp\_\{\\theta\}calculates the average value of tokens inz\(j\)z\_\{\(j\)\}\.

For each decision point, we compare the perplexity of the ground\-truth decision with that of a negative decision:

m​\(s\)=𝔼𝒵G​\(PPLθ​\(z−∣s\)\)−𝔼𝒵G​\(PPLθ​\(z⋆∣s\)\),m\(s\)=\\mathbb\{E\}\_\{\\mathcal\{Z\}^\{G\}\}\(\\mathrm\{PPL\}\_\{\\theta\}\(z^\{\-\}\\mid s\)\)\-\\mathbb\{E\}\_\{\\mathcal\{Z\}^\{G\}\}\(\\mathrm\{PPL\}\_\{\\theta\}\(z^\{\\star\}\\mid s\)\),\(4\)where𝔼𝒵G​\(PPLθ​\(z−∣s\)\)=1\\mathbb\{E\}\_\{\\mathcal\{Z\}^\{G\}\}\(\\mathrm\{PPL\}\_\{\\theta\}\(z^\{\-\}\\mid s\)\)=1if\{\(a∗,⋅\)∈𝒵G\}=∅\\\{\(a^\{\*\},\\cdot\)\\in\\mathcal\{Z\}^\{G\}\\\}=\\emptyset\. The margin is converted into a bounded certainty coefficient,

c​\(s\)=σ​\(m​\(s\)τ\),c\(s\)=\\sigma\\left\(\\frac\{m\(s\)\}\{\\tau\}\\right\),\(5\)whereτ=0\.1\\tau=0\.1is a temperature andσ​\(⋅\)\\sigma\(\\cdot\)is the sigmoid function\. A large margin means that the model assigns lower perplexity to the correct decision than to the negative decision\. Conversely, a small or negative margin indicates that uncertainty is not aligned with decision correctness\.

The reward combines format validity, decision correctness, and the UQ margin:

RUQ​\(z\)=Rfmt​\(z\)\+Rans​\(z,z⋆\)\+c​Rcls​\(a,a⋆\)\.R\_\{\\mathrm\{UQ\}\}\(z\)=R\_\{\\mathrm\{fmt\}\}\(z\)\+R\_\{\\mathrm\{ans\}\}\(z,z^\{\\star\}\)\+cR\_\{\\mathrm\{cls\}\}\(a,a^\{\\star\}\)\.\(6\)Here,RfmtR\_\{\\mathrm\{fmt\}\}rewards structured output format\(<think\> \.\.\. </think\>\)<𝒜\\mathcal\{A\}\><answer\>𝒴\\mathcal\{Y\}</answer\>and internal consistency between the suggested actionaaand final answeryy\. n trajectory\-level unified training,RfmtR\_\{\\text\{fmt\}\}is set to 0\.Rans​\(z,z⋆\)R\_\{\\mathrm\{ans\}\}\(z,z^\{\\star\}\)rewards consistency betweenzzandz∗z^\{\*\}\.RclsR\_\{\\mathrm\{cls\}\}rewards action correctness and certainty coefficientccitself:

Rcls​\(a,a⋆\)=1\+\\displaystyle R\_\{\\mathrm\{cls\}\}\(a,a^\{\\star\}\)=1\+\(7\)\{2,a∗=a,1,\{a,a∗\}=\{Direct,Tool\}or\{Ask,Unable\},0,otherwise\.\\displaystyleExact action match receives the highestRclsR\_\{\\mathrm\{cls\}\}, while a weaker partial reward is assigned when the prediction preserves the coarse decision direction between action and non\-execution\.RUQR\_\{\\mathrm\{UQ\}\}can be obtained without a LLM judger if using structured output format\(<think\> \.\.\. </think\>\)<𝒜\\mathcal\{A\}\><answer\>𝒴\\mathcal\{Y\}</answer\>; otherwise, it needs a lightweight LLM judger to judge the actiona∈𝒜a\\in\\mathcal\{A\}\. Overall,c​RclscR\_\{\\text\{cls\}\}gives a repulsive force inRU​QR\_\{UQ\}, to push wrong decisions to higher UQ and pull correct decisions to lower UQ, driving policy expands the exploration space in cases of incorrect decisions, and makes it easier to sample the right ones\.

### 3\.3Unified Post\-Training for Trajectory\-Level Performance and Turn\-Level Decision

Standalone decision examples provide clean supervision for when an agent should answer, call tools, ask the user, or stop\. However, realistic agent failures often arise inside multi\-turn trajectories, where an early unsupported decision can corrupt later states\. Therefore, it is of great significance to unify the classical task performance\-driven rewards and the next\-action decision rewards for RL training\. To realize this unification, we augment CM2 trajectories\(Zhanget al\.,[2026c](https://arxiv.org/html/2606.06976#bib.bib8)\)with lightweight tool\-call decision annotations at key turns, and plug in ourRUQR\_\{\\mathrm\{UQ\}\}with the CM2 checklist rewardRCM2R\_\{\\text\{CM2\}\}\.

#### Trajectory augmentation with tool\-call decision annotation\.

As shown in Fig\.[2](https://arxiv.org/html/2606.06976#S2.F2)\(b\), we construct trajectory\-level decision supervision from trajectories without relabeling the full conversation, only augmenting each trajectory with a small set of key\-turn decision annotations\.

A labeling model reads the full payload and selects no more than 2 decision\-critical turns where a next\-action𝒜\\mathcal\{A\}signal is useful\. Each selected turn receives exactly one label from the action space𝒜\\mathcal\{A\}\. To reduce the annotation bias of trajectories toward tool executionTool, the prompt for the labeling model includes running label\-count statistics and requests to prefer underrepresented valid actions when multiple turns are equally useful\.

The annotation output is a strict JSON object\. Each annotation has the following schema

annotation=\{\\displaystyle\\texttt\{annotation\}=\\\{\(8\)turn\_idx:key turn idx,\\displaystyle\\texttt\{turn\\\_idx\}:\\text\{key turn idx\},gt\_action=a∗∈𝒜,\\displaystyle\\texttt\{gt\\\_action\}=a^\{\*\}\\in\\mathcal\{A\},rationale:Reason forgt\_action,\\displaystyle\\texttt\{rationale\}:\\text\{Reason for \}\\texttt\{gt\\\_action\},target\_answer=y∗∈𝒴\\displaystyle\\texttt\{target\\\_answer\}=y^\{\*\}\\in\\mathcal\{Y\}\}\\displaystyle\\\}which is used by the mixed reward manager during GRPO post\-training\. For each annotated turn, the trajectory prefix up to the user message defines the statests\_\{t\}, the annotated label definesat⋆a\_\{t\}^\{\\star\}, and the annotated target answer definesyt⋆y\_\{t\}^\{\\star\}\.

![Refer to caption](https://arxiv.org/html/2606.06976v1/x3.png)Figure 3:Uncertainty calibration of tool\-calling decisions on When2Call\. Green denotes correct decisions, and orange denotes wrong decisions\. Lower PPL means higher certainty\. Compared with Qwen3\-4B\-Thinking and decision GRPO training withoutTRUST, training withTRUSTyields a clearer separation between correct and wrong decisions, assigning lower PPL to correct decisions and higher PPL to wrong ones\.
#### Reward combination for unified RL\.

To unify the next\-action decision with the task completion and accuracy aggregation in one post\-training process, trajectories are optimized with a mixed reward as shown in Fig\.[2](https://arxiv.org/html/2606.06976#S2.F2)\(b\)\. The original checklist rewardRCM2R\_\{\\text\{CM2\}\}continues to supervise task completion and tool execution quality over the full trajectory\. Our turn\-level decision rewardRUQR\_\{\\text\{UQ\}\}is added sparsely at the annotated turn end positions:

R=RCM2\+∑t∈𝒦RUQ​\(zt\),R=R\_\{\\text\{CM2\}\}\+\\sum\_\{t\\in\\mathcal\{K\}\}R\_\{\\mathrm\{UQ\}\}\(z\_\{t\}\),\(9\)where𝒦\\mathcal\{K\}is the set of annotated key turns\. This design preserves the original CM2 objective for execution quality while adding a targeted next\-action decision signal for tool\-call timing and action appropriateness\.TRUSTyields a lightweight trajectory\-level training signal without relabeling entire conversations or redesigning the performance\-driven rewards, realizing the unified training for task performance, efficiency for tool\-calling, and tool\-calling hallucination mitigation\.

## 4Experiments

Table 1:Performance comparison of When2Call\. FDAR is the False Direct Answer Rate\.Boldfaceandunderliningdenote the best and second\-best results, respectively, within each corresponding baseline group\. The transparency intensity ofgreen\-colored cellsis \(Performance ofTRUST\) \- \(Performance of The Baseline with same “Training”\)\.### 4\.1Setup

#### Benchmarks\.

We conduct experiments on three benchmarks, namely When2Call\(Rosset al\.,[2025](https://arxiv.org/html/2606.06976#bib.bib4)\)for turn\-level when to call training and evaluation, and tool\-use benchmarks ToolSandbox\(Luet al\.,[2025](https://arxiv.org/html/2606.06976#bib.bib19)\)and BFCL\-V4\(Patilet al\.,[2025](https://arxiv.org/html/2606.06976#bib.bib20)\)to evaluate the multi\-turn overall performance\.

#### Baselines\.

Closed models MiniMax\-M2\.5\(MiniMax,[2026](https://arxiv.org/html/2606.06976#bib.bib24)\), GPT\-4o\-mini\(Hurstet al\.,[2024](https://arxiv.org/html/2606.06976#bib.bib26)\), and open\-source models Qwen3\-235B\-A22B, Qwen3\-4B\-Thinking, and Qwen3\-8B\-Thinking\(Yanget al\.,[2025](https://arxiv.org/html/2606.06976#bib.bib23)\)are tested on three benchmarks for performance comparison of our post\-trained model withTRUST\. Moreover, we compare two representative UQ for agentic tool\-calling baselines, AUQ\(Zhanget al\.,[2026b](https://arxiv.org/html/2606.06976#bib.bib17)\)and SAGE\-Agent\(Suriet al\.,[2025](https://arxiv.org/html/2606.06976#bib.bib9)\), all implemented on Qwen3\-4B\-Thinking and Qwen3\-8B\-Thinking\. Note that we do not report the results of the vanilla base model and training\-free methods for Qwen3\-8B\-Base, since it is not instruction\-tuned under this evaluation protocol\(Zhanget al\.,[2026c](https://arxiv.org/html/2606.06976#bib.bib8)\)\. Instead, we provide the corresponding training\-free experiments on Qwen3\-8B\-Thinking as informative references\. For turn\-level and trajectory\-level post\-training, GRPO only by the tool\-calling decision, and CM2\(Zhanget al\.,[2026c](https://arxiv.org/html/2606.06976#bib.bib8)\), from the same model checkpoint asTRUST, serve as training\-needed baselines\. See Appendix[A](https://arxiv.org/html/2606.06976#A1)for further details of benchmarks and baselines\.

#### Training setups\.

We select Qwen3\-4B\-Thinking and Qwen3\-8B\-Base as the agent backbone models\(Yanget al\.,[2025](https://arxiv.org/html/2606.06976#bib.bib23)\)for group relative policy optimization \(GRPO\) post\-training\(Shaoet al\.,[2024](https://arxiv.org/html/2606.06976#bib.bib7)\)\.Qwen3\-4B\-Thinkingis trained on When2Call training dataset to validate the efficiency of calibratingturn\-level tool\-calling decisiononly, whileQwen3\-8B\-Baseseesa complete trajectory\-level post\-training, including cold start SFT and the unified RL, following\(Zhanget al\.,[2026c](https://arxiv.org/html/2606.06976#bib.bib8)\)\. In trajectory augmentation, we use Qwen3\-235B\-A22B\-Instruct\-2507\(Yanget al\.,[2025](https://arxiv.org/html/2606.06976#bib.bib23)\)as the labeler\. In unified RL training, Qwen3\-30B\-A3B\-Instruct\-2507\(Yanget al\.,[2025](https://arxiv.org/html/2606.06976#bib.bib23)\)serves as an LLM judger\. Prompts and more settings are in Appendix[B](https://arxiv.org/html/2606.06976#A2)\.

### 4\.2TRUSTSuccessfully Aligns Correctness and Uncertainty

Table 2:Performance comparison of BFCL\-V4\.Table 3:Performance comparison of ToolSandBox benchmark across scenario categories and tool augmentations\.Model / MethodOverall ScoreScenario CategoriesTool AugmentationsSTCMTCSUTMUTSDCII0\-DT3\-DT10\-DTATTNSTDSADSATS\\cellcolorgray\!6Open\-source Baselines235B\-A22B\-Instruct69\.8862\.4662\.6863\.8559\.4862\.6857\.6390\.9873\.8473\.4474\.9870\.7574\.8570\.9974\.7374\.7930B\-A3B\-Instruct61\.5762\.0860\.0762\.8854\.0962\.3457\.2458\.4564\.6862\.1262\.2561\.7263\.6164\.9463\.5463\.554B\-Thinking52\.8957\.6838\.0943\.1538\.1853\.0328\.7190\.8258\.7154\.7553\.2152\.6153\.8253\.6456\.4060\.548B\-Thinking59\.7464\.3244\.7950\.1644\.0661\.5535\.7694\.9359\.0562\.9565\.1360\.7462\.3561\.0865\.2364\.00\\cellcolorgray\!6From Qwen3\-4B\-ThinkingAUQ55\.4748\.2347\.9249\.2044\.7951\.5941\.7184\.8756\.1258\.0960\.5556\.9860\.4159\.0355\.5156\.99SAGE52\.5448\.2246\.9349\.1442\.0454\.3840\.9971\.0650\.1456\.0354\.3351\.6060\.4651\.4955\.9055\.45Turn\-level GRPO39\.3746\.9221\.1528\.6719\.0423\.3719\.0595\.9843\.3645\.5740\.9338\.8644\.2837\.9041\.8243\.71Turn\-levelTRUST\\cellcolorgreen\!6\.9856\.3551\.1846\.2652\.8337\.6855\.5942\.2794\.4457\.4258\.5359\.0558\.6857\.6656\.9058\.0258\.77\\cellcolorgray\!6From Qwen3\-8B\-ThinkingAUQ64\.0856\.4656\.6857\.8553\.4856\.6851\.6387\.9867\.8467\.4468\.9864\.7568\.8564\.9968\.7368\.79SAGE59\.7254\.8551\.5653\.9547\.5659\.4143\.6883\.4859\.7065\.7260\.2758\.2962\.9766\.6162\.2265\.52\\cellcolorgray\!6From Qwen3\-8B\-BaseTraj\.\-level CM261\.2170\.5758\.5564\.2251\.9356\.6453\.7055\.6060\.1265\.0465\.0265\.0361\.8061\.3064\.1464\.51Traj\.\-levelTRUST\\cellcolorgreen\!7\.0768\.2868\.6659\.7063\.1056\.9361\.5852\.0490\.7073\.0176\.4174\.9469\.1666\.7468\.7771\.2671\.24

![Refer to caption](https://arxiv.org/html/2606.06976v1/x4.png)Figure 4:Ablation study ofRUQR\_\{\\text\{UQ\}\}on When2Call\. Tool Hall \+ FDAR measures the overall tool\-calling hallucination\. The setting w/oc​\(s\)c\(s\)representsc​\(s\)=1c\(s\)=1,i\.e\.RUQ=Rfmt\+Rans\+RclsR\_\{\\text\{UQ\}\}=R\_\{\\mathrm\{fmt\}\}\+R\_\{\\mathrm\{ans\}\}\+R\_\{\\mathrm\{cls\}\}\.![Refer to caption](https://arxiv.org/html/2606.06976v1/x5.png)Figure 5:Ablation study results of the unified post\-training across three evaluation benchmarks\.To isolate whether the proposed reward calibrates the tool\-calling decision itself, we train on When2Call training set and compare PPL distributions of correct and wrong decisions\. As shown in Fig\.[3](https://arxiv.org/html/2606.06976#S3.F3)\(a\), the original Qwen3\-4B\-Thinking already exhibits a correlation between correctness and uncertainty, where correct decisions tend to have lower PPL than wrong ones, forming two distinct distributions\. After GRPO decision training withoutTRUST, however, this separation becomes less reliable in Fig\.[3](https://arxiv.org/html/2606.06976#S3.F3)\(b\)\. Wrong decisions are also assigned low PPL, indicating that direct optimization can make the model confident in both correct and incorrect decisions\. In contrast, training withTRUSTagain produces a clearer uncertainty gap depicted in Fig\.[3](https://arxiv.org/html/2606.06976#S3.F3)\(c\)\. Correct decisions concentrate in the lower\-PPL region, while wrong decisions are shifted toward higher PPL\.

We measure the distribution overlap in Fig\.[3](https://arxiv.org/html/2606.06976#S3.F3)by

IoU=Correct∩WrongCorrect∪Wrong\.\\text\{IoU\}=\\frac\{\\text\{Correct\}\\cap\\text\{Wrong\}\}\{\\text\{Correct\}\\cup\\text\{Wrong\}\}\.\(10\)It yields an initial IoU of 34\.50%, which increases to 70\.21% following GRPO\. Conversely, integrating our UQ\-Aligned Reward leads to a reversion of IoU to 35\.29% while delivering superior accuracy\. This demonstrates thatTRUSTexplicitly aligns the model’s internal uncertainty with decision correctness, making erroneous tool\-calling decisions easier to identify and less likely to be reinforced as confident actions\.

### 4\.3Performance ofTRUSTon Turn\-Level Tool\-Calling Decision Post\-Training

To evaluate the effectiveness ofTRUST, we conduct turn\-level GRPO and evaluation on Qwen3\-4B\-Thinking, where the results demonstrate substantial performance gains across all benchmarks\.

First, onWhen2Calltest set \(Table[1](https://arxiv.org/html/2606.06976#S4.T1)\),TRUSTachieves the highest Acc Norm compared to various baseline methods, outperforming the runner\-up by 7\.47%, while yielding the lowest False Direct Answer Rate \(FDAR\) of 5\.07%\. Compared to the vanilla base model Qwen3\-4B\-Thinking,TRUSTimproves the Acc Norm by 11\.47%\. When compared directly against turn\-level GRPO training,TRUSTnot only achieves an 8\.37% absolute improvement in Acc Norm but also reduces the overall hallucination metric, defined as a sum of Tool Hallucination and FDAR, from GRPO’s 30\.49% down to 22\.90%\. Overall,TRUSTsurpasses most open\-source and closed\-source models, reaching a level comparable to the top\-tier closed\-source Claude\-Sonnet\-4\.

Crucially, we discover thatoptimizing tool\-call decisions solely at turn level can directly catalyze and boost the generalized performance in complex multi\-turn interactions and tool\-use trajectories\. OnBFCL\-V4in Table[2](https://arxiv.org/html/2606.06976#S4.T2),TRUSTdelivers an Overall Score of 48\.04%, which represents a substantial 9\.43% leap over the base model, completely dominates all baseline methods, and notably exceeds the performance of the much larger 30B\-A3B\-Instruct \(41\.00%\)\. Specifically,TRUSTexhibits a compelling advantage in dimensions like Multi\-Turn \(49\.62%\) and Irrelevance \(84\.84%\), proving its high precision in handling multi\-turn tool\-calling and detecting irrelevant tools\.

Similarly, onToolSandBoxin Table[3](https://arxiv.org/html/2606.06976#S4.T3),TRUSTestablishes a state\-of\-the\-art performance within its group, outperforming the base model by 3\.46% and scoring over turn\-level GRPO with a 16\.98% margin\. In contrast toTRUST, while turn\-level GRPO has the highest performance in detecting Insufficient Information \(II\), it drives the agent towards a conservative tool\-calling strategy, which affects the tool\-calling and task completion capability\. Furthermore, when subjected to various tool\-level perturbations 3\-DT and 10\-DT as well as specialized robust evaluations \(TNS, TDS, ADS, and ATS\),TRUSTmaintains strong score stability and resilience\. This firmly underscores the powerful cross\-task benefits that precise turn\-level decision training provides for long\-dependency tool\-chain workflows\.

### 4\.4Performance ofTRUSTon Trajectory\-Level Unified Post\-Training

OnWhen2Callin Table[1](https://arxiv.org/html/2606.06976#S4.T1),TRUSTdemonstrates a commanding superiority over the trajectory\-level RL baseline CM2\. Specifically,TRUSTadvances the Acc Norm from 43\.75% to 62\.32% and the F1 score from 36\.94% to 60\.62%, marking a substantial leap in execution accuracy\.

Furthermore, the experimental empirical results on multi\-turn benchmarks confirm thatTRUST, which applies GRPO on trajectories augmented with key turn decision annotations, utilizes tools more accurately and effectively\. OnBFCL\-V4in Table[2](https://arxiv.org/html/2606.06976#S4.T2),TRUSTestablishes an Overall Score of 44\.48%, significantly outperforming Traj\.\-level CM2 \(38\.15%\) and surpassing 30B\-A3B\-Instruct \(41\.00%\)\. This outstanding performance is consistently mirrored in scenarios requiring higher agent capabilities, such as Web Search, Memory, and Multi\-Turn dimensions\. Similarly, onToolSandBoxin Table[3](https://arxiv.org/html/2606.06976#S4.T3),TRUSTdelivers an Overall Score of 68\.28%, outstripping the trajectory\-level GRPO baseline by 7\.07%\. Remarkably, this score not only leaves the 30B model far behind but also closely approaches the performance of 235B\-A22B\-Instruct \(69\.88%\)\. Moreover, it is highlighted thatTRUSTachieves an Insufficient Information \(II\) score of 90\.70%, drastically mitigating the hallucination plagued by the GRPO baseline CM2 \(55\.60%\)\.

Importantly, unlike training\-free alternatives \(e\.g\., SAGE, AUQ\) that often rely on multi\-step prompting or trial\-and\-error reasoning loops,TRUSTinternalizes the decision\-making logic directly into the model weights, thereby introducing zero additional inference latency or computational overhead during deployment\.

### 4\.5Ablation Study

#### Ablation forRUQR\_\{\\text\{UQ\}\}\.

As shown in Fig\.[4](https://arxiv.org/html/2606.06976#S4.F4), all experiments trained with GRPO demonstrate improvements over the vanilla model baseline, while our full method achieves the best overall performance\. Among all reward components, removing the uncertainty repulsive rewardi\.e\.w/oc​\(s\)c\(s\)leads to the most significant degradation, reducing Acc Norm from 80\.83% to 72\.46%\. Furthermore, excluding the answer reward \(w/oRansR\_\{\\text\{ans\}\}\) or the format reward \(w/oRfmtR\_\{\\text\{fmt\}\}\) also results in varying degrees of performance decline, accompanied by lower accuracy and increased hallucination compared with the complete method\. Notably, removingc​\(s\)c\(s\)also substantially increases the overall hallucination \(Tool Hall \+ FDAR\) from 22\.90% to 30\.49%, indicating thatc​\(s\)c\(s\)plays a crucial role in suppressing unsupported tool calls and false direct answers\. These results suggest that different reward components contribute complementary effects, whereRansR\_\{\\text\{ans\}\}andRfmtR\_\{\\text\{fmt\}\}mainly improve answer quality and output validity, whilec​\(s\)c\(s\)is particularly important for reliable tool\-calling decisions\.

#### Ablation for unified post\-training\.

As shown in Fig\.[5](https://arxiv.org/html/2606.06976#S4.F5), in trajectory\-level unified post\-training, removingc​\(s\)c\(s\)also leads to a drastic performance degradation across all evaluated benchmarks, firmly validating that the repulsive force inTRUSTis vital for calibrating tool\-call decisions\. Compared with SFT and the GRPO baseline without tool\-calling decision annotation and a combination of rewards inTRUST, the full unified training framework achieves the highest overall scores and the lowest hallucination across the board\. The empirical results consistently demonstrate that every designed module plays an indispensable role, and their unification yields the optimal performance\. More experimental result details and case study can be found in Appendix[C](https://arxiv.org/html/2606.06976#A3)and Appendix[D](https://arxiv.org/html/2606.06976#A4)\.

## 5Conclusions

In this paper, we proposeTRUST, an uncertainty\-aware reward optimization framework that improves agentic tool\-calling decisions in both turn\-level and trajectory\-level unified post\-training\. By integrating uncertainty quantification information directly into the reward function,TRUSTaligns model certainty with decision correctness and mitigates hallucinated tool\-use behaviors\. We further introduce lightweight key\-turn annotations to unify turn\-level decision calibration with trajectory\-level task optimization\. Extensive experiments on When2Call, BFCL\-V4, and ToolSandbox demonstrate thatTRUSTconsistently improves tool\-calling reliability, multi\-turn interaction quality, and overall agent performance without additional inference overhead\.

## Limitations

Despite its effectiveness, several limitations remain for future exploration\. First, the current framework mainly relies on perplexity\-based uncertainty estimation, while more advanced semantic or trajectory\-level uncertainty modeling could further improve calibration\. Second, our experiments focus on text\-based tool\-use benchmarks with predefined action spaces\. ExtendingTRUSTto more dynamic settings, such as embodied agents or open\-world tool ecosystems, is an important direction for future work\.

## Ethical Considerations

This work aims to improve the tool\-calling decision capability and reliability of LLM\-based agents by proposingTRUST, an uncertainty\-aware reward framework for post\-training tool\-calling decisions\. By aligning decision confidence with correctness, our method improves the turn\-level tool\-calling decision accuracy and the trajectory\-level task performance, while reducing hallucinated or unsupported tool use and mitigating error propagation in multi\-turn interactions\. All datasets and reproduced baselines used in this work are publicly available, properly cited, and follow their original licenses and terms of use\. While we do not anticipate direct severe societal risks from the proposed method itself, more capable agentic systems may still be misused for generating misleading content or unsafe automated behaviors\. Future work should further strengthen safety mechanisms such as content moderation and risk\-aware deployment protocols\.

## References

- A\. Bercovich, I\. Levy, I\. Golan, M\. Dabbah, R\. El\-Yaniv, O\. Puny, I\. Galil, Z\. Moshe, T\. Ronen, N\. Nabwani, I\. Shahaf, O\. Tropp, E\. Karpas, R\. Zilberstein, J\. Zeng, S\. Singhal, A\. Bukharin, Y\. Zhang, T\. Konuk, G\. Shen, A\. S\. Mahabaleshwarkar, B\. Kartal, Y\. Suhara, O\. Delalleau, Z\. Chen, Z\. Wang, D\. Mosallanezhad, A\. Renduchintala, H\. Qian, D\. Rekesh, F\. Jia, S\. Majumdar, V\. Noroozi, W\. U\. Ahmad, S\. Narenthiran, A\. Ficek, M\. Samadi, J\. Huang, S\. Jain, I\. Gitman, I\. Moshkov, W\. Du, S\. Toshniwal, G\. Armstrong, B\. Kisacanin, M\. Novikov, D\. Gitman, E\. Bakhturina, P\. Varshney, M\. Narsimhan, J\. P\. Scowcroft, J\. Kamalu, D\. Su, K\. Kong, M\. Kliegl, R\. K\. Mahabadi, Y\. Lin, S\. Satheesh, J\. Parmar, P\. Gundecha, B\. Norick, J\. Jennings, S\. Prabhumoye, S\. N\. Akter, M\. Patwary, A\. Khattar, D\. Narayanan, R\. Waleffe, J\. Zhang, B\. Su, G\. Huang, T\. Kong, P\. Chadha, S\. Jain, C\. Harvey, E\. Segal, J\. Huang, S\. Kashirsky, R\. McQueen, I\. Putterman, G\. Lam, A\. Venkatesan, S\. Wu, V\. Nguyen, M\. Kilaru, A\. Wang, A\. Warno, A\. Somasamudramath, S\. Bhaskar, M\. Dong, N\. Assaf, S\. Mor, O\. U\. Argov, S\. Junkin, O\. Romanenko, P\. Larroy, M\. Katariya, M\. Rovinelli, V\. Balas, N\. Edelman, A\. Bhiwandiwalla, M\. Subramaniam, S\. Ithape, K\. Ramamoorthy, Y\. Wu, S\. V\. Velury, O\. Almog, J\. Daw, D\. Fridman, E\. Galinkin, M\. Evans, S\. Ghosh, K\. Luna, L\. Derczynski, N\. Pope, E\. Long, S\. Schneider, G\. Siman, T\. Grzegorzek, P\. Ribalta, M\. Katariya, C\. Alexiuk, J\. Conway, T\. Saar, A\. Guan, K\. Pawelec, S\. Prayaga, O\. Kuchaiev, B\. Ginsburg, O\. Olabiyi, K\. Briski, J\. Cohen, B\. Catanzaro, J\. Alben, Y\. Geifman, and E\. Chung \(2025\)Llama\-nemotron: efficient reasoning models\.External Links:2505\.00949,[Link](https://arxiv.org/abs/2505.00949)Cited by:[§B\.2](https://arxiv.org/html/2606.06976#A2.SS2.p1.4)\.
- Geometrically\-constrained agent for spatial reasoning\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 38689–38699\.Cited by:[§1](https://arxiv.org/html/2606.06976#S1.p1.1)\.
- J\. Duan, J\. Diffenderfer, S\. Madireddy, T\. Chen, B\. Kailkhura, and K\. Xu \(2025\)Uprop: investigating the uncertainty propagation of llms in multi\-step agentic decision\-making\.arXiv preprint arXiv:2506\.17419\.Cited by:[§2](https://arxiv.org/html/2606.06976#S2.SS0.SSS0.Px1.p1.1)\.
- K\. Healy, B\. Srinivasan, V\. Madathil, and J\. Wu \(2026\)Internal representations as indicators of hallucinations in agent tool selection\.arXiv preprint arXiv:2601\.05214\.Cited by:[§1](https://arxiv.org/html/2606.06976#S1.p1.1)\.
- A\. Hurst, A\. Lerer, A\. P\. Goucher, A\. Perelman, A\. Ramesh, A\. Clark, A\. Ostrow, A\. Welihinda, A\. Hayes, A\. Radford,et al\.\(2024\)Gpt\-4o system card\.arXiv preprint arXiv:2410\.21276\.Cited by:[§4\.1](https://arxiv.org/html/2606.06976#S4.SS1.SSS0.Px2.p1.1)\.
- S\. Kadavath, T\. Conerly, A\. Askell, T\. Henighan, D\. Drain, E\. Perez, N\. Schiefer, Z\. Hatfield\- Dodds, N\. DasSarma, E\. Tran\-Johnson,et al\.\(2022\)Language models \(mostly\) know what they know\.arXiv preprint arXiv:2207\.05221\.Cited by:[§2](https://arxiv.org/html/2606.06976#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Kirchhof, G\. Kasneci, and E\. Kasneci \(2025\)Position: uncertainty quantification needs reassessment for large\-language model agents\.arXiv preprint arXiv:2505\.22655\.Cited by:[§2](https://arxiv.org/html/2606.06976#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Kuhn, Y\. Gal, and S\. Farquhar \(2023\)Semantic uncertainty: linguistic invariances for uncertainty estimation in natural language generation\.arXiv preprint arXiv:2302\.09664\.Cited by:[§2](https://arxiv.org/html/2606.06976#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2606.06976#S3.SS2.p1.2)\.
- S\. Lin, J\. Hilton, and O\. Evans \(2022\)Teaching models to express their uncertainty in words\.arXiv preprint arXiv:2205\.14334\.Cited by:[§2](https://arxiv.org/html/2606.06976#S2.SS0.SSS0.Px1.p1.1)\.
- X\. Lin, Y\. Ning, J\. Zhang, Y\. Dong, Y\. Liu, Y\. Wu, X\. Qi, N\. Sun, Y\. Shang, P\. Cao,et al\.\(2025\)LLM\-based agents suffer from hallucinations: a survey of taxonomy, methods, and directions\.arXiv preprint arXiv:2509\.18970\.Cited by:[§1](https://arxiv.org/html/2606.06976#S1.p1.1)\.
- J\. Lu, T\. Holleis, Y\. Zhang, B\. Aumayer, F\. Nan, H\. Bai, S\. Ma, S\. Ma, M\. Li, G\. Yin,et al\.\(2025\)Toolsandbox: a stateful, conversational, interactive evaluation benchmark for llm tool use capabilities\.InFindings of the Association for Computational Linguistics: NAACL 2025,pp\. 1160–1183\.Cited by:[2nd item](https://arxiv.org/html/2606.06976#A1.I1.i2.p1.1),[§2](https://arxiv.org/html/2606.06976#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.06976#S4.SS1.SSS0.Px1.p1.1)\.
- X\. Lu, Z\. Chen, X\. Hu, Y\. Zhou, W\. Zhang, D\. Liu, L\. Sheng, and J\. Shao \(2026a\)Is\-bench: evaluating interactive safety of vlm\-driven embodied agents in daily household tasks\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 35680–35688\.Cited by:[§1](https://arxiv.org/html/2606.06976#S1.p1.1)\.
- X\. Lu, Y\. Zhou, Z\. Chen, R\. Wang, B\. Sima, E\. Zhou, L\. Sheng, D\. Liu, and J\. Shao \(2026b\)HomeGuard: vlm\-based embodied safeguard for identifying contextual risk in household task\.arXiv preprint arXiv:2603\.14367\.Cited by:[§1](https://arxiv.org/html/2606.06976#S1.p1.1)\.
- P\. Lymperopoulos and V\. Sarathy \(2025\)Tools in the loop: quantifying uncertainty of llm question answering systems that use tools\.arXiv preprint arXiv:2505\.16113\.Cited by:[§2](https://arxiv.org/html/2606.06976#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Manakul, A\. Liusie, and M\. Gales \(2023\)Selfcheckgpt: zero\-resource black\-box hallucination detection for generative large language models\.InProceedings of the 2023 conference on empirical methods in natural language processing,pp\. 9004–9017\.Cited by:[§2](https://arxiv.org/html/2606.06976#S2.SS0.SSS0.Px1.p1.1)\.
- MiniMax \(2026\)MiniMax\-m2\.5\.MiniMax\.Note:Hugging Face RepositoryExternal Links:[Link](https://huggingface.co/MiniMaxAI/MiniMax-M2.5)Cited by:[§4\.1](https://arxiv.org/html/2606.06976#S4.SS1.SSS0.Px2.p1.1)\.
- W\. Modecrua, K\. Kaewtawee, K\. Pachtrachai, and T\. Kraisingkorn \(2026\)Multi\-turn reinforcement learning for tool\-calling agents with iterative reward calibration\.arXiv preprint arXiv:2604\.02869\.Cited by:[§2](https://arxiv.org/html/2606.06976#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Nathawani, I\. Gitman, S\. Majumdar, E\. Bakhturina, A\. S\. Mahabaleshwarkar, J\. Zhang, and J\. P\. Scowcroft \(2025\)Nemotron\-post\-training\-dataset\-v1\.Nvidia\.Note:Hugging Face RepositoryExternal Links:[Link](https://huggingface.co/datasets/nvidia/Nemotron-Post-Training-Dataset-v1)Cited by:[§B\.2](https://arxiv.org/html/2606.06976#A2.SS2.p1.4)\.
- C\. Oh, S\. Park, T\. E\. Kim, J\. Li, W\. Li, S\. Yeh, X\. Du, H\. Hassani, P\. Bogdan, D\. Song, and S\. Li \(2026\)Uncertainty quantification in llm agents: foundations, emerging challenges, and opportunities\.arXiv preprint arXiv:2602\.05073\.Cited by:[§2](https://arxiv.org/html/2606.06976#S2.SS0.SSS0.Px1.p1.1)\.
- S\. G\. Patil, H\. Mao, F\. Yan, C\. C\. Ji, V\. Suresh, I\. Stoica, and J\. E\. Gonzalez \(2025\)The berkeley function calling leaderboard \(BFCL\): from tool use to agentic evaluation of large language models\.InProceedings of the 42nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.267,pp\. 48371–48392\.Cited by:[3rd item](https://arxiv.org/html/2606.06976#A1.I1.i3.p1.1),[§2](https://arxiv.org/html/2606.06976#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.06976#S4.SS1.SSS0.Px1.p1.1)\.
- Y\. Qin, S\. Liang, Y\. Ye, K\. Zhu, L\. Yan, Y\. Lu, Y\. Lin, X\. Cong, X\. Tang, B\. Qian,et al\.\(2024\)Toolllm: facilitating large language models to master 16000\+ real\-world apis\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 9695–9717\.Cited by:[§1](https://arxiv.org/html/2606.06976#S1.p1.1),[§2](https://arxiv.org/html/2606.06976#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Ross, A\. S\. Mahabaleshwarkar, and Y\. Suhara \(2025\)When2Call: when \(not\) to call tools\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 3391–3409\.Cited by:[1st item](https://arxiv.org/html/2606.06976#A1.I1.i1.p1.1),[§1](https://arxiv.org/html/2606.06976#S1.p1.1),[§1](https://arxiv.org/html/2606.06976#S1.p2.1),[§2](https://arxiv.org/html/2606.06976#S2.SS0.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2606.06976#S3.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.06976#S4.SS1.SSS0.Px1.p1.1)\.
- T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom \(2023\)Toolformer: language models can teach themselves to use tools\.External Links:2302\.04761,[Link](https://arxiv.org/abs/2302.04761)Cited by:[§1](https://arxiv.org/html/2606.06976#S1.p1.1),[§2](https://arxiv.org/html/2606.06976#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§4\.1](https://arxiv.org/html/2606.06976#S4.SS1.SSS0.Px3.p1.1)\.
- J\. L\. Stoisser, M\. B\. Martell, L\. Phillips, G\. Mazzoni, L\. M\. Harder, P\. Torr, J\. Ferkinghoff\-Borg, K\. Martens, and J\. Fauqueur \(2025\)Towards agents that know when they don’t know: uncertainty as a control signal for structured reasoning\.External Links:2509\.02401,[Link](https://arxiv.org/abs/2509.02401)Cited by:[§1](https://arxiv.org/html/2606.06976#S1.p2.1),[§2](https://arxiv.org/html/2606.06976#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Su, J\. Luo, C\. Liu, X\. Yang, Y\. Zhang, Y\. Dong, and J\. Zhu \(2025\)A survey on autonomy\-induced security risks in large model\-based agents\.arXiv preprint arXiv:2506\.23844\.Cited by:[§1](https://arxiv.org/html/2606.06976#S1.p1.1)\.
- C\. Sun, L\. Liu, G\. Yan, Z\. Wang, and T\. Weng \(2026\)LLM agents already know when to call tools–even without reasoning\.arXiv preprint arXiv:2605\.09252\.Cited by:[§2](https://arxiv.org/html/2606.06976#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Suri, P\. Mathur, N\. Lipka, F\. Dernoncourt, R\. A\. Rossi, and D\. Manocha \(2025\)Structured uncertainty guided clarification for llm agents\.arXiv preprint arXiv:2511\.08798\.Cited by:[2nd item](https://arxiv.org/html/2606.06976#A1.I2.i2.p1.1),[§1](https://arxiv.org/html/2606.06976#S1.p2.1),[§2](https://arxiv.org/html/2606.06976#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.06976#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.06976#S4.SS1.SSS0.Px2.p1.1)\.
- W\. Wang, S\. Juluan, Z\. Ling, Y\. Chan, C\. Wang, C\. Lee, Y\. Yuan, J\. Huang, W\. Jiao, and M\. R\. Lyu \(2025\)Learning to ask: when LLM agents meet unclear instruction\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 21773–21784\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1104)Cited by:[§2](https://arxiv.org/html/2606.06976#S2.SS0.SSS0.Px2.p1.1)\.
- Q\. Wu, S\. Das, M\. Amani, A\. Nag, S\. Lee, K\. P\. Gummadi, A\. Ravichander, and M\. B\. Zafar \(2026\)To call or not to call: a framework to assess and optimize llm tool calling\.arXiv preprint arXiv:2605\.00737\.Cited by:[§2](https://arxiv.org/html/2606.06976#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§1](https://arxiv.org/html/2606.06976#S1.p3.2),[§4\.1](https://arxiv.org/html/2606.06976#S4.SS1.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.06976#S4.SS1.SSS0.Px3.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.06976#S1.p1.1)\.
- D\. Zhang, X\. Liu, L\. Cheng, Y\. Wang, K\. Murray, and H\. Wei \(2026a\)SELAUR: self evolving llm agent via uncertainty\-aware rewards\.arXiv preprint arXiv:2602\.21158\.Cited by:[§2](https://arxiv.org/html/2606.06976#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Zhang, P\. K\. Choubey, K\. Huang, C\. Xiong, and C\. Wu \(2026b\)Agentic uncertainty quantification\.arXiv preprint arXiv:2601\.15703\.Cited by:[1st item](https://arxiv.org/html/2606.06976#A1.I2.i1.p1.1),[§1](https://arxiv.org/html/2606.06976#S1.p2.1),[§2](https://arxiv.org/html/2606.06976#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.06976#S4.SS1.SSS0.Px2.p1.1)\.
- Z\. Zhang, K\. Song, X\. Wang, Y\. Hu, W\. Yan, C\. Zhao, H\. P\. Zou, H\. Deng, S\. R\. Indurthi, S\. Liu, S\. Ma, X\. Wang, X\. E\. Wang, and S\. Wang \(2026c\)CM2: reinforcement learning with checklist rewards for multi\-turn and multi\-step agentic tool use\.External Links:2602\.12268,[Link](https://arxiv.org/abs/2602.12268)Cited by:[§B\.2](https://arxiv.org/html/2606.06976#A2.SS2.p1.4),[§3\.3](https://arxiv.org/html/2606.06976#S3.SS3.p1.2),[§4\.1](https://arxiv.org/html/2606.06976#S4.SS1.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.06976#S4.SS1.SSS0.Px3.p1.1)\.
- Z\. Zhang, S\. Cui, Y\. Lu, J\. Zhou, J\. Yang, H\. Wang, and M\. Huang \(2025\)Agent\-safetybench: evaluating the safety of llm agents\.arXiv preprint arXiv:2412\.14470\.Cited by:[§1](https://arxiv.org/html/2606.06976#S1.p1.1)\.
- H\. Zhong, J\. Zhai, L\. Song, J\. Bian, Q\. Liu, and T\. Tan \(2026\)RC\-grpo: reward\-conditioned group relative policy optimization for multi\-turn tool calling agents\.arXiv preprint arXiv:2602\.03025\.Cited by:[§2](https://arxiv.org/html/2606.06976#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Zhou, Y\. Ge, W\. Xie, L\. Zeng, X\. Dong, and Y\. Wang \(2024\)How out\-of\-distribution detection learning theory enhances transformer: learnability and reliability\.arXiv preprint arXiv:2406\.12915\.Cited by:[§2](https://arxiv.org/html/2606.06976#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Zhou, X\. Lu, D\. Liu, J\. Yan, and J\. Shao \(2026\)INFA\-guard: mitigating malicious propagation via infection\-aware safeguarding in llm\-based multi\-agent systems\.arXiv preprint arXiv:2601\.14667\.Cited by:[§1](https://arxiv.org/html/2606.06976#S1.p1.1)\.

## Appendix ADetails of Benchmarks and Baselines

### A\.1Benchmarks

We evaluate three benchmarks forTRUSTframework:

- •When2Call\(Rosset al\.,[2025](https://arxiv.org/html/2606.06976#bib.bib4)\): a natural benchmark for next\-action decision training and testing, as it explicitly evaluates whether an agent should answer directly, call a tool, ask a follow\-up question, or admit that the provided tools cannot answer the request\.
- •ToolSandbox\(Luet al\.,[2025](https://arxiv.org/html/2606.06976#bib.bib19)\): evaluates stateful, conversational, and interactive tool use, with scenarios such as canonicalization, tool scrambled, and insufficient information that require the agent to decide when and how to execute tools versus ask or wait for more context\.
- •BFCL\-V4\(Patilet al\.,[2025](https://arxiv.org/html/2606.06976#bib.bib20)\): evaluates function\-calling reliability across single, multiple, parallel, multi\-turn, and relevance\-detection settings, where irrelevant or unsuitable function sets directly test whether the model can abstain from unsupported tool calls\.

### A\.2Baselines

We compare against two training\-free uncertainty\-aware agent baselines:

- •AUQ\(Zhanget al\.,[2026b](https://arxiv.org/html/2606.06976#bib.bib17)\): uses verbalized confidence and explanations to control long\-horizon agent execution, propagating uncertainty through memory and invoking reflection only when confidence is low\.
- •SAGE\(Suriet al\.,[2025](https://arxiv.org/html/2606.06976#bib.bib9)\): a structured uncertainty\-guided clarification framework that models uncertainty over tool calls and argument domains, then uses EVPI\-based scoring to decide whether to ask a targeted clarification question or execute the best tool call\.

## Appendix BDetailed Setups

### B\.1Prompts

Prompts for the LLM labeler can be found in Box[D\.3](https://arxiv.org/html/2606.06976#A4.SS3), and prompts for the LLM judger in the assigning tool\-calling decision reward can be found in Box[D\.3](https://arxiv.org/html/2606.06976#A4.SS3)\. The judger is aiming to judge the decision category in the response of agents, rather than obtain the rewardRUQR\_\{\\text\{UQ\}\}in one step\.

### B\.2Training Settings

For turn\-level training based on Qwen3\-4B\-Thinking, we train 1 epoch GRPO\. We sample 4 rollouts for each question as the group size\. The learning rate is set to3×10−53\\times 10^\{\-5\}, and the KL divergence loss coefficient is set to 0\.001\. For trajectory\-level training based on Qwen3\-8B\-Base, we follow the training settings inZhanget al\.\([2026c](https://arxiv.org/html/2606.06976#bib.bib8)\)for unified post\-training, including cold\-start SFT and RL\. All training data is from thenvidia/Nemotron\-Post\-Training\-Dataset\-v1datasetBercovichet al\.\([2025](https://arxiv.org/html/2606.06976#bib.bib35)\); Nathawaniet al\.\([2025](https://arxiv.org/html/2606.06976#bib.bib25)\)and filtered byZhanget al\.\([2026c](https://arxiv.org/html/2606.06976#bib.bib8)\)\.nvidia/Nemotron\-Post\-Training\-Dataset\-v1contains a large number of 310k synthetic tool\-use dialogues across diverse domain\. For SFT, we use the cold\-start training set inZhanget al\.\([2026c](https://arxiv.org/html/2606.06976#bib.bib8)\), which has 8k trajectories\. We apply SFT as a cold start under a learning rate3×10−63\\times 10^\{\-6\}with a warmup ratio of 0\.1\. For RL, the corresponding training set inZhanget al\.\([2026c](https://arxiv.org/html/2606.06976#bib.bib8)\)is further annotated byTRUSTand filtered for turn\-level decision distribution balance\. Finally, we get 4k trajectories from the original 8k data\. Then we optimize from the cold\-start SFT checkpoint using GRPO based on VeRL\. The mini\-batch size is set to 128, the KL divergence loss coefficient to 0\.001, and the learning rate is3×10−63\\times 10^\{\-6\}for GRPO\. The group size is 16 for one trajectory\. We train 400 GPU hours for GRPO\. All experiments are conducted on8×8\\timesNVIDIA H200 GPUs using DeepSpeed ZeRO\-3 optimization\.

## Appendix CAdditional Exprimental Results

Detailed experimental results on BFCL\-V4 can be found in Tables[4](https://arxiv.org/html/2606.06976#A4.T4)–[7](https://arxiv.org/html/2606.06976#A4.T7), where BFCL\-V4 is decomposed into four categories: Non\-Live \(Table[4](https://arxiv.org/html/2606.06976#A4.T4)\), Live \(Table[5](https://arxiv.org/html/2606.06976#A4.T5)\), Multi\-Turn \(Table[6](https://arxiv.org/html/2606.06976#A4.T6)\), and Agentic \(Table[7](https://arxiv.org/html/2606.06976#A4.T7)\)\. Ablation study results on When2Call are presented in the Table[8](https://arxiv.org/html/2606.06976#A4.T8)\. For trajectory\-level post\-training from Qwen3\-8B\-Base, ablation results on BFCL\-V4 are further broken down into Non\-Live \(Table[9](https://arxiv.org/html/2606.06976#A4.T9)\), Live \(Table[10](https://arxiv.org/html/2606.06976#A4.T10)\), Multi\-Turn \(Table[11](https://arxiv.org/html/2606.06976#A4.T11)\), and Agentic \(Table[12](https://arxiv.org/html/2606.06976#A4.T12)\) categories\. Ablation results on ToolSandBox are provided in Table[13](https://arxiv.org/html/2606.06976#A4.T13)\.

### C\.1Results on BFCL\-V4

Tables[4](https://arxiv.org/html/2606.06976#A4.T4)–[7](https://arxiv.org/html/2606.06976#A4.T7)report detailed results on BFCL\-V4 across Non\-Live, Live, Multi\-Turn, and Agentic categories\. Overall,TRUSTconsistently improves tool\-calling performance under both turn\-level and trajectory\-level training settings, with particularly large gains on complex multi\-turn and agentic scenarios\.

#### Non\-Live and Live tool calling\.

On relatively direct tool\-calling categories, such as Non\-Live and Live, the benefit of trajectory\-level optimization is already evident\. As shown in Tables[4](https://arxiv.org/html/2606.06976#A4.T4)and[5](https://arxiv.org/html/2606.06976#A4.T5), Traj\.\-levelTRUSTimproves over Traj\.\-level CM2 from 80\.12% to 82\.27% on Non\-Live, and from 74\.58% to 75\.57% on Live\. The gains are moderate but consistent, suggesting that these categories can be effectively optimized with trajectory\-level feedback\.

The ablation results further show thatc​\(s\)c\(s\)is less critical for these relatively simple single\-turn settings\. For example, in Tables[9](https://arxiv.org/html/2606.06976#A4.T9)and[10](https://arxiv.org/html/2606.06976#A4.T10), \+TRUSTw/oc​\(s\)c\(s\)already reaches 78\.90% on Non\-Live and 74\.17% on Live, which are close to the SFT or CM2 baselines\. On some subcategories, such as Live Parallel Multiple, \+TRUSTw/oc​\(s\)c\(s\)even matches the fullTRUSTscore of 54\.17%\. These results suggest that for direct tool\-calling tasks where the decision boundary is relatively clear, trajectory\-level answer and format rewards can already provide useful learning signals, even without an explicit classification reward\.

#### Multi\-turn tool calling\.

The advantage ofTRUSTbecomes more pronounced in Multi\-Turn scenarios, which require maintaining context, resolving missing information, and making correct tool\-use decisions across turns\. As shown in Table[6](https://arxiv.org/html/2606.06976#A4.T6), Traj\.\-levelTRUSTimproves over Traj\.\-level CM2 from 22\.50% to 29\.88%, yielding a substantial gain of 7\.38% points\. The improvement is consistent across all subcategories: Multi Turn Base improves from 34\.00% to 40\.50%, Missing Function from 21\.00% to 26\.50%, Missing Parameter from 18\.00% to 28\.00%, and Long Context from 17\.00% to 24\.50%\.

These gains indicate thatTRUSTis particularly effective in higher\-order tool\-calling settings where the model must reason over dialogue history and incomplete user intents\. Compared with simpler Non\-Live and Live categories, Multi\-Turn tasks rely more heavily on correctly identifying whether, when, and how to invoke tools\. Therefore, the uncertainty repulsive rewardc​RclscR\_\{\\text\{cls\}\}becomes more important: without it, the model may still learn from successful trajectories, but it lacks a direct signal for distinguishing tool\-use versus non\-tool\-use decisions across turns\.

#### Agentic tool calling\.

Large trajectory\-level gains also appear in the Agentic category, which includes Web Search and Memory tasks\. As shown in Table[7](https://arxiv.org/html/2606.06976#A4.T7), Traj\.\-levelTRUSTsignificantly improves over Traj\.\-level CM2 from 21\.97% to 28\.79% on the overall Agentic score, corresponding to a 6\.82% gain\. The improvement is especially clear on Memory, whereTRUSTimproves from 21\.94% to 30\.57%\. Within Memory, Key Value Store increases from 9\.68% to 21\.00%, Vector Store from 14\.19% to 24\.55%, and Rec Sum from 41\.94% to 46\.16%\.

These results highlight the importance of the proposed reward design for high\-level agentic tool use\. Agentic tasks require not only producing correct tool arguments, but also deciding when external search, memory retrieval, or memory update is necessary\. In such settings,c​RclscR\_\{\\text\{cls\}\}provides an essential signal for tool\-use decision making\. Compared with Non\-Live and Live categories, where \+TRUSTw/oc​RclscR\_\{\\text\{cls\}\}can already obtain reasonable performance, Agentic tasks benefit much more from the full reward\. This suggests that explicit classification supervision is critical for complex agent behaviors involving planning, memory, and external information access\.

### C\.2Ablation Study on When2Call

Table[8](https://arxiv.org/html/2606.06976#A4.T8)presents the ablation study on the When2Call test set\. Since When2Call directly evaluates whether the model calls tools at the right time, it provides a focused benchmark for analyzing the contribution of different reward components, especiallyc​\(s\)c\(s\)\.

#### Effect of reward components under turn\-level training\.

Starting from Qwen3\-4B\-Thinking, turn\-levelTRUSTachieves the best overall Acc Norm of 80\.83% and F1 of 82\.84%, outperforming the base checkpoint by 11\.47 points in Acc Norm and 7\.07% points in F1\. Removing different reward components leads to different failure modes\. WithoutRansR\_\{\\text\{ans\}\}, the model still obtains a strong Acc Norm of 78\.97%, suggesting that classification and format rewards can already guide the model toward better tool\-use decisions\. However, withoutc​\(s\)c\(s\), Acc Norm drops to 72\.46% and FDAR increases to 24\.76%, indicating that the model becomes worse at deciding when not to call tools\. WithoutRfmtR\_\{\\text\{fmt\}\}, the model obtains 75\.90% Acc Norm, but Tool Hall remains relatively high at 24\.03%\.

These results show that each reward component contributes differently:RansR\_\{\\text\{ans\}\}improves final task correctness,RfmtR\_\{\\text\{fmt\}\}stabilizes valid tool\-call formatting, andc​\(s\)c\(s\)is especially important for suppressing incorrect tool\-use decisions\. The sharp FDAR increase after removingc​\(s\)c\(s\)confirms that classification reward plays a key role in calibrating when tools should or should not be invoked\.

#### Trajectory\-level ablation from Qwen3\-8B\-Base\.

The importance ofc​\(s\)c\(s\)is even clearer in the trajectory\-level setting\. Starting from Qwen3\-8B\-Base, SFT only achieves 19\.20% Acc Norm and 27\.29 F1%, showing that supervised fine\-tuning alone is insufficient under this evaluation protocol\. Adding Traj\.\-level CM2 improves Acc Norm to 43\.75%, but it also produces a very high Tool Hall score of 75\.43%, suggesting severe over\-calling or hallucinated tool use\.

In contrast, \+ Traj\.\-levelTRUSTw/oc​\(s\)c\(s\)only reaches 31\.43% Acc Norm and 26\.19% F1, with FDAR increasing to 27\.14%\. This indicates that trajectory\-level optimization without classification reward provides limited guidance for deciding whether tools should be used\. The full \+ Traj\.\-levelTRUSTsubstantially improves Acc Norm to 62\.32% and F1 to 60\.62%, while reducing Tool Hall to 49\.87% and FDAR to 8\.39%\. Compared with the w/oc​\(s\)c\(s\)variant, the full reward improves Acc Norm by 30\.89 points and F1 by 34\.43 points\.

These results demonstrate thatc​\(s\)c\(s\)is crucial for When2Call\-style decision making\. While w/oc​\(s\)c\(s\)can perform reasonably well on simpler BFCL categories such as Non\-Live and Live, it is insufficient for benchmarks that explicitly require accurate tool\-use timing\. The fullTRUSTreward is therefore necessary for learning robust tool\-use policies that balance calling tools when needed and avoiding unnecessary or hallucinated tool calls\.

## Appendix DCase Study

### D\.1Case Study 1: Correct Tool Parameter Usage

As shown in Fig\.[6](https://arxiv.org/html/2606.06976#A4.F6), a failure mode is one in which the agent reaches a superficially correct final decision while still failing the task at the trajectory level\. After correctly entering thetestdirectory and identifyingtest\_file1\.txtandtest\_file2\.txt, the baseline incorrectly applieswcto the directory nametestrather than to the two discovered files\. As a result, the required intermediate evidence, namely the character counts of the relevant text files, is never obtained\. Nevertheless, the model proceeds to update the ticket priority to 2, which happens to coincide with the correct final value because the true counts \(20 and 18\) are both not greater than 20\. In contrast,TRUSTexecutes the intended file\-level counting operations and therefore reaches the same final priority assignment through a valid evidential chain\.

### D\.2Case Study 2: Timely Tool Invocation

The second case study \(Fig\.[7](https://arxiv.org/html/2606.06976#A4.F7)\) demonstrates a complementary failure pattern in the baseline: the agent abstains from calling tools when tool use is necessary to advance the task\. A missing tool call in an early turn can destabilize the entire downstream interaction and produce a globally incorrect final state\. In the first turn, the baseline requests zip code information from the user instead of invoking the available city\-to\-zipcode and distance estimation tools\. This omission prevents the model from grounding the subsequent reasoning about travel feasibility\. The error compounds in later turns: the model does not execute the requested fuel\-filling action, and by the final turn, the vehicle state still reports 10\.0 gallons, directly contradicting the user’s requirement to add 30 gallons and end with 40 gallons\. By comparison,TRUSTcorrectly performs the tool chain for geographic lookup, distance estimation, fuel feasibility assessment, tank filling, and engine start, uses the right tool in key turns, and completes the task\.

### D\.3Case Study 3: Deferring Tool Use under Missing Information

Example in Fig\.[8](https://arxiv.org/html/2606.06976#A4.F8)provides a clearer example of inappropriate tool invocation under incomplete user specification\. In the first turn, the user expresses the intent to add a company’s stock to the watchlist but does not provide the company name\. The benchmark’s reference trajectory, therefore, leaves this turn empty, indicating that the correct behavior is to defer action and elicit the missing information\. Instead, the baseline hallucinates the company as Apple, invokesget\_symbol\_by\_name, and addsAAPLto the watchlist\. Although the model later addsZETAafter the user explicitly specifies “Zeta Corp”, the watchlist has already been irreversibly contaminated with an incorrect extra entry\. The agent trained by our methodTRUST, in contrast, handles the underspecified first turn conservatively by requesting the missing company name and only performs tool calls after the parameter becomes available\.

Box 1: Prompt for Labeling Json Annotation\#\# You are annotating multi\-turn tool\-use trajectories for when2call training, namely supervision for deciding when to answer directly, invoke tools, request missing information, or abstain\.\#\# Task: 1\. Read the full trajectory, the available tool specifications, and the checklist metadata\. 2\. Identify decision\-critical turns at which a when2call\-style supervision signal is both informative and well supported by the observed trajectory\. 3\. For each selected turn, assign exactly one ground\-truth action from the following action space: \- ‘direct\_answer‘: the request can be correctly addressed from the existing conversational context or general knowledge, without additional user input and without tool use\. \- ‘tool\_call‘: the appropriate next step is to invoke one or more tools, and the necessary arguments are already available\. \- ‘request\_for\_info‘: the assistant lacks essential user\-provided information, and clarification is the correct immediate next action\. \- ‘cannot\_answer‘: the assistant should refuse, abstain, or acknowledge incapability or lack of access, such that further clarification would not resolve the problem\. 4\. Prefer substantive reasoning over shallow lexical heuristics\. 5\. Maintain broad coverage over the four action categories; do not overproduce ‘tool\_call‘ annotations merely because the data originate from a tool\-use setting\. \#\# Annotation principles: \- Prioritize turns that are genuinely decision\-critical, including missing arguments, unsupported or hallucinated tool usage, inappropriate tool invocation, clarification after tool failure, directly answerable requests, and authentic inability cases\. \- When multiple ‘turn\_idx‘ values are defensible within the same trajectory, prefer the one whose correct action would most improve action\-space coverage under the provided balance guidance\. \- Avoid producing many redundant annotations of the same class from a single trajectory, especially repeated ‘request\_for\_info‘ cases\. \- Do not default to ‘request\_for\_info‘ as a conservative fallback\. If the assistant already has sufficient information to answer, choose ‘direct\_answer‘\. If the assistant already has sufficient information to act, choose ‘tool\_call‘\. If no realistic clarification would resolve the issue, choose ‘cannot\_answer‘\. \- Use the entire trajectory, including earlier tool outputs and prior assistant behavior\. Later turns often support ‘direct\_answer‘ or ‘tool\_call‘ because the required information has already been established upstream\. \- The ‘target\_answer‘ must remain compatible with the when2call reward interface: \(1\) For ‘tool\_call‘, return one or more concrete tool\-call XML blocks in the form ‘<tool\_call\>\{"name": "…", "arguments": \{…\}\}</tool\_call\>‘\. \(2\) For ‘request\_for\_info‘, return the realized clarification question\. \(3\) For ‘direct\_answer‘, return the realized answer text\. \(4\) For ‘cannot\_answer‘, return the realized refusal or inability statement\. \#\# Illustrative cases: \- ‘direct\_answer‘: the user requests an explanation, rewrite, or straightforward factual transformation that can already be completed without tools\. \- ‘tool\_call‘: the user has supplied all required slots, and the correct next action is immediate tool execution\. \- ‘request\_for\_info‘: a decisive slot required for answering or tool execution is missing from the user\. \- ‘cannot\_answer‘: the available toolset cannot solve the request, or the assistant lacks the necessary authority, capability, or access, and clarification would not remedy the limitation\.\#\# Return strict JSON: \{ "turn\_idx": Current ‘turn\_idx‘ in data, "gt\_action": Action for this turn in ‘direct\_answer‘/‘tool\_call‘/‘request\_for\_info‘/‘cannot\_answer‘, "rationale": Your explanation for ‘gt\_action‘, "target\_answer": Real response in data \}\#\# Here are two illustrative annotation examples: \- Example 1: \{ "turn\_idx": 2, "gt\_action": "request\_for\_info", "rationale": "Missing province ID required to continue the request correctly\.", "target\_answer": "Could you please provide me with the specific province ID for Bangkok?" \} \- Example 2: \{ "turn\_idx": 0, "gt\_action": "tool\_call", "rationale": "All required parameters are available\. The assistant should call the tool immediately\.", "target\_answer": "<tool\_call\>"name": "a", "arguments": "account\_id": "123"</tool\_call\>" \} \#\# Dataset balance guidance: \- Current accepted annotation counts: ‘direct\_answer‘=\{DIRECT\_ANSWER\_COUNT\}, ‘tool\_call‘=\{TOOL\_CALL\_COUNT\}, ‘request\_for\_info‘=\{REQUEST\_FOR\_INFO\_COUNT\}, ‘cannot\_answer‘=\{CANNOT\_ANSWER\_COUNT\} \- Currently underrepresented action categories: \{UNDERREPRESENTED\_ACTIONS\} \- Currently overrepresented action categories: \{OVERREPRESENTED\_ACTIONS\} \- If multiple ‘turn\_idx‘ values within the same trajectory are equally well justified, prefer the one whose correct ‘gt\_action‘ belongs to an underrepresented category\. \- Prefer valid non\-‘tool\_call‘ annotations when ‘tool\_call‘ is already overrepresented\. \- Favor a small number of high\-value annotations per trajectory rather than many redundant ones\. \- Do not force an annotation whose action is not supported by the trajectory evidence\.\#\# Trajectory payload: \{ "tools": \{TOOLS\_JSON\}, "turn\_summaries": \{TURN\_SUMMARIES\_JSON\}, "checklists": \{CHECKLISTS\_JSON\} \}

Box 2: Prompt for LLM Judger inTRUST\#\# You are a structured parser for assistant turns in multi\-turn tool\-use trajectories\.\#\# Given the history of completed assistant turns within a multi\-turn trajectory: 1\. Judge the assistant’s next action from the current observed turn\. 2\. Map the action to exactly one normalized action category: \- ‘direct\_answer‘: the assistant provides a substantive answer directly, without invoking a tool and without requesting additional user information\. \- ‘tool\_call‘: the assistant initiates one or more tool calls as the immediate next action\. \- ‘request\_for\_info‘: the assistant asks the user for missing information that is necessary before answering or executing a tool call\. \- ‘cannot\_answer‘: the assistant indicates inability, refusal, lack of capability, or lack of access such that further clarification would not resolve the issue\. 3\. Extract the realized response content in a when2call\-compatible representation: \- If the action is ‘tool\_call‘, return only the tool\-call JSON content, without any surrounding prose\. \- Otherwise, return the assistant’s realized natural\-language response exactly as expressed in the turn\. 4\. Ground the decision in the realized turn behavior rather than in hypothetical alternatives\. \#\# The annotation of this turn: \{JSON\_ANNOTATON\} \#\# If the assistant has conducted ‘gt\_action‘ in ‘JSON\_ANNOTATION‘ in this turn, choose ‘pred\_action‘ the same as ‘gt\_action‘\.\#\# Return strict JSON\. You MUST follow this Output Schema: \{ "pred\_action": Action for this turn in ‘direct\_answer‘/‘tool\_call‘/‘request\_for\_info‘/‘cannot\_answer‘, "pred\_answer": Response of the assistant in this turn, "rationale": "The turn is a clarification request because a required argument is not yet available\. The assistant requests a missing identifier before any tool execution\." \}\#\# Here is an example: \{ "pred\_action": "request\_for\_info", "pred\_answer": "Could you provide the account ID?", "rationale": "The turn is a clarification request because a required argument is not yet available\. The assistant requests a missing identifier before any tool execution\." \}

Table 4:Detailed comparison experiment results of Non\-Live category on BFCL\-V4\.Table 5:Detailed comparison experiment results of Live category on BFCL\-V4\.Table 6:Detailed comparison experiment results of the Multi\-turn category on BFCL\-V4\.Table 7:Detailed comparison experiment results of Agentic category on BFCL\-V4\. The Agentic category contains two subcategoriesi\.e\.Web Search and Memory\.ModelsAgenticWeb SearchSnippetNo SnippetMemoryKey Value StoreVector StoreRec Sum\\cellcolorgray\!6Open\-source Baselines235B\-A22B\-Instruct35\.8044\.5057\.0032\.0027\.1014\.1916\.1350\.9730B\-A3B\-Instruct20\.0722\.5021\.0024\.0017\.639\.039\.0334\.844B\-Thinking9\.196\.008\.004\.0012\.3710\.6515\.8110\.658B\-Thinking13\.3112\.0015\.009\.0014\.625\.167\.1031\.61\\cellcolorgray\!6From Qwen3\-5B\-ThinkingAUQ11\.723\.004\.002\.0020\.4313\.5519\.3528\.39SAGE8\.263\.504\.003\.0013\.026\.5715\.8116\.67Turn\-level GRPO13\.763\.004\.002\.0024\.5212\.9020\.0040\.65Turn\-levelTRUST13\.983\.004\.002\.0024\.9512\.9023\.8738\.06\\cellcolorgray\!6From Qwen3\-4B\-ThinkingAUQ14\.8514\.0017\.0011\.0015\.709\.0314\.8423\.23SAGE15\.619\.5012\.007\.0021\.727\.1017\.4240\.65\\cellcolorgray\!6From Qwen3\-8B\-BaseTraj\.\-level CM221\.9722\.0031\.0013\.0021\.949\.6814\.1941\.94Traj\.\-levelTRUST28\.7927\.0032\.0022\.0030\.5721\.0024\.5546\.16

Table 8:Detailed ablation study results on When2Call test set\. We do not report results for the base checkpoint of Qwen3\-8B\-Base since it is not instruction\-tuned under this evaluation protocol\. The setting w/oc​\(s\)c\(s\)representsc​\(s\)=1c\(s\)=1,i\.e\.RUQ=Rfmt\+Rans\+RclsR\_\{\\text\{UQ\}\}=R\_\{\\mathrm\{fmt\}\}\+R\_\{\\mathrm\{ans\}\}\+R\_\{\\mathrm\{cls\}\}\.Table 9:Detailed ablation study results of trajectory\-level post\-training on Non\-Live category of BFCL\-V4\.Table 10:Detailed ablation study results of trajectory\-level post\-training on the Live category of BFCL\-V4\.Table 11:Detailed ablation study results of trajectory\-level post\-training on the Multi\-Turn category of BFCL\-V4\.Table 12:Detailed ablation study results of Agentic category on BFCL\-V4\. The Agentic category contains two subcategoriesi\.e\.Web Search and Memory\.Table 13:Detailed ablation study results on ToolSandBox\.![Refer to caption](https://arxiv.org/html/2606.06976v1/x6.png)Figure 6:Example of tool\-calling parameter failure in baselines and success in our method\.![Refer to caption](https://arxiv.org/html/2606.06976v1/x7.png)Figure 7:Example of tool\-calling decision failure of missed tool invocation and downstream task collapse, and success in our method\.![Refer to caption](https://arxiv.org/html/2606.06976v1/x8.png)Figure 8:Example of tool\-calling decision failure of unwarranted tool use under information insufficiency, and success in our method\.

Similar Articles

TeamTR: Trust-Region Fine-Tuning for Multi-Agent LLM Coordination

arXiv cs.LG

This paper identifies a structural failure mode in sequential fine-tuning of shared-context multi-agent LLM teams, formalized as compounding occupancy shift, and proposes TeamTR, a trust-region framework that resamples trajectories and enforces per-agent divergence control, achieving 7.1% average improvement over baselines.

On Effectiveness and Efficiency of Agentic Tool-calling and RL Training

arXiv cs.LG

This paper systematically analyzes the sensitivity of tool-calling evaluations to minor implementation choices such as random seeds and multi-turn templates, revealing that these can cause substantial performance variation. It also identifies sources of computational waste in RL-based tool-calling training and introduces techniques to accelerate training without sacrificing performance.

When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning

arXiv cs.CL

This paper introduces Adaptive Tool Trust Calibration (ATTC), a framework that improves tool-integrated reasoning models by enabling them to adaptively decide when to trust or ignore tool results based on code confidence scores. The approach addresses the "Tool Ignored" problem where models incorrectly dismiss correct tool outputs, achieving 4.1-7.5% performance improvements across multiple models and datasets.