Uncertainty-Aware Clarification in LLM Agents with Information Gain

arXiv cs.AI Papers

Summary

Proposes a goal-oriented clarification framework using Information Gain Reward to train LLM agents to ask effective clarification questions under underspecified user instructions, improving task success rate by 3.7% with minimal interaction overhead.

arXiv:2606.03135v1 Announce Type: new Abstract: Large Language Model (LLM) agents often operate under underspecified user instructions, where latent uncertainty over user intent leads to erroneous tool actions. To address this challenge, we propose a goal-oriented clarification framework that aligns clarification behavior with ambiguity resolution. Central to our approach is the Information Gain Reward, a metric that quantifies the utility of clarification questions by measuring the Bayesian belief update towards the ground-truth goal induced by the clarification exchange. We train the clarifier (LLM) using this reward to optimize for high information gain, ensuring that clarifications effectively reduce uncertainty and improve task completion within the agent-tool-user environment. We validate our framework within a clarification-enhanced $\tau$-Bench environment, conducting cross-agent evaluations across five heterogeneous backbones. Empirical results demonstrate that our method consistently improves the success rate by 3.7\% over the no-clarification baseline, while adding only 0.3 total interaction steps on average.
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:42 AM

# Uncertainty-Aware Clarification in LLM Agents with Information Gain
Source: [https://arxiv.org/html/2606.03135](https://arxiv.org/html/2606.03135)
###### Abstract

Large Language Model \(LLM\) agents often operate under underspecified user instructions, where latent uncertainty over user intent leads to erroneous tool actions\. To address this challenge, we propose a goal\-oriented clarification framework that aligns clarification behavior with ambiguity resolution\. Central to our approach is the Information Gain Reward, a metric that quantifies the utility of clarification questions by measuring the Bayesian belief update towards the ground\-truth goal induced by the clarification exchange\. We train the clarifier \(LLM\) using this reward to optimize for high information gain, ensuring that clarifications effectively reduce uncertainty and improve task completion within the agent\-tool\-user environment\. We validate our framework within a clarification\-enhancedτ\\tau\-Bench environment, conducting cross\-agent evaluations across five heterogeneous backbones\. Empirical results demonstrate that our method consistently improves the success rate by 3\.7% over the no\-clarification baseline, while adding only 0\.3 total interaction steps on average\.

Machine Learning, ICML

## 1Introduction

Large Language Model \(LLM\) agents have recently shown significant proficiency in utilizing external tools and performing multi\-step decision\-making across a diverse set of real\-world applications\(Quet al\.,[2025](https://arxiv.org/html/2606.03135#bib.bib20); Yaoet al\.,[2022](https://arxiv.org/html/2606.03135#bib.bib107); Wölfleinet al\.,[2025](https://arxiv.org/html/2606.03135#bib.bib35)\)\. Through their interaction with various external tools, these agents are capable of completing complex tasks that involve planning, reasoning, and iterative feedback\(Wölfleinet al\.,[2025](https://arxiv.org/html/2606.03135#bib.bib35)\)\. In practical interactive contexts, however, user instructions are often*underspecified*or vague, leaving essential aspects of the user’s true intent implicit\(Yehudaiet al\.,[2025](https://arxiv.org/html/2606.03135#bib.bib26); Qiet al\.,[2025](https://arxiv.org/html/2606.03135#bib.bib36)\)\. Such ambiguous user demands present a fundamental challenge for tool\-using agents: premature or inaccurate tool actions may lead to irreversible mistakes, fragile execution paths, and task failure\(Wanget al\.,[2024](https://arxiv.org/html/2606.03135#bib.bib33)\)\.

![Refer to caption](https://arxiv.org/html/2606.03135v1/image.png)Figure 1:An example of clarification in aτ\\tau\-retail trajectory\. When an initial tool call fails due to missing or underspecified information, the Clarifier poses a targeted follow\-up question \(highlighted\) to elicit the required constraints from the user\. The additional information provided through this exchange allows the agent to proceed with a corrected tool invocation and complete the task\.A common approach to addressing underspecified user instructions is issuing clarification questions, as shown in Figure[1](https://arxiv.org/html/2606.03135#S1.F1), to resolve ambiguity before action\(Clark,[1996](https://arxiv.org/html/2606.03135#bib.bib32); Zouet al\.,[2023](https://arxiv.org/html/2606.03135#bib.bib28)\)\. While human assistants naturally identify missing information and seek clarification, this ability remains underdeveloped in LLM\-based agents\(Rao and Daumé III,[2018](https://arxiv.org/html/2606.03135#bib.bib27)\)\. A key challenge is that LLMs are primarily trained to answer questions rather than to determine when clarification is necessary or how to formulate it\. Consequently, existing training pipelines provide scant guidance on the necessity or effectiveness of clarification\(Fuet al\.,[2020](https://arxiv.org/html/2606.03135#bib.bib22); Ouyanget al\.,[2022](https://arxiv.org/html/2606.03135#bib.bib94)\), often leaving such behaviors implicit or entangled with internal reasoning\(Suriet al\.,[2025](https://arxiv.org/html/2606.03135#bib.bib24)\)\. Moreover, the difficulty in training LLMs for clarification stems from a misalignment in supervision: existing annotations typically reward surface\-level fluency while failing to capture a question’s capacity to reduce requirement uncertainty\(Zhanget al\.,[2023](https://arxiv.org/html/2606.03135#bib.bib154)\)\. Without an explicit signal for uncertainty reduction, models default to favoring generic, safe questions that provide limited informational value\(Szymanskiet al\.,[2025](https://arxiv.org/html/2606.03135#bib.bib150); Sonet al\.,[2024](https://arxiv.org/html/2606.03135#bib.bib151)\), struggling to evolve toward truly targeted information\-seeking behaviors\.

To address these challenges and move beyond subjective supervision, we reframe the clarification problem within the context of uncertainty\-aware learning\. Motivated by this perspective, we propose an information\-theoretic framework that quantifies the utility of a clarification question by measuring its impact on reducing uncertainty about the user’s true goal\. Specifically, we model the clarification process as a Bayesian belief update and quantify the utility of a question by measuring the shift in the model’s probability mass towards the ground\-truth goal following the clarification exchange\. This formulation provides a performance\-oriented objective that directly links questioning strategies to effective ambiguity resolution\. We then utilize this intrinsic reward signal to drive Decoupled Advantage Policy Optimization \(DAPO\)\(Yuet al\.,[2025](https://arxiv.org/html/2606.03135#bib.bib4)\), guiding the agent to evolve questioning strategies that are explicitly optimized for missing information recovery rather than surface\-level conversational behaviors\.

This paper studies clarification as an execution\-grounded information acquisition problem, where the agent intervenes only when asking a question is expected to reduce uncertainty about the latent user goal and improve downstream tool use\. We evaluate this formulation in a Clarifier\-augmentedτ\\tau\-Bench setting, where user intent is partially observable and tool feedback exposes execution failures, requiring the agent to recover through interactive clarification rather than direct access to the full goal\. We further validate the framework through comprehensive analyzes of training dynamics, the upper bounds of agent and clarification effectiveness, trigger frequency, and cross\-agent generalization\.

## 2Related Work

##### Tool\-Augmented Agents\.

Large language models have increasingly been framed as tool\-using agents capable of performing multi\-step reasoning and decision\-making\. These agents utilize various processes, such as planning\(Yaoet al\.,[2022](https://arxiv.org/html/2606.03135#bib.bib107); Schick and others,[2023](https://arxiv.org/html/2606.03135#bib.bib88)\), tool invocation\(Yuanet al\.,[2025](https://arxiv.org/html/2606.03135#bib.bib5); Zhuet al\.,[2025](https://arxiv.org/html/2606.03135#bib.bib17); Wuet al\.,[2024b](https://arxiv.org/html/2606.03135#bib.bib18)\), and feedback\-driven execution\(Luet al\.,[2025](https://arxiv.org/html/2606.03135#bib.bib11); Wuet al\.,[2025](https://arxiv.org/html/2606.03135#bib.bib14); Liuet al\.,[2023b](https://arxiv.org/html/2606.03135#bib.bib9)\)\. To evaluate agent behavior in complex, tool\-rich settings, various benchmarks and environments have been proposed\. For example,\(Quet al\.,[2025](https://arxiv.org/html/2606.03135#bib.bib20)\)focuses on tool efficiency,\(Liuet al\.,[2023a](https://arxiv.org/html/2606.03135#bib.bib37)\)provides a broad framework for agent task evaluation, and\(Andriushchenkoet al\.,[2024](https://arxiv.org/html/2606.03135#bib.bib13)\)addresses robustness in adversarial contexts\. These diverse benchmarks\(Wanget al\.,[2024](https://arxiv.org/html/2606.03135#bib.bib33); Andriushchenkoet al\.,[2024](https://arxiv.org/html/2606.03135#bib.bib13); Xuet al\.,[2024](https://arxiv.org/html/2606.03135#bib.bib12); Liet al\.,[2024b](https://arxiv.org/html/2606.03135#bib.bib16); Wuet al\.,[2024a](https://arxiv.org/html/2606.03135#bib.bib10)\)offer valuable insights into agent performance across different application domains\. However, while existing agent frameworks primarily optimize tool invocation for well\-defined tasks, they lack explicit mechanisms to handle ambiguity in real\-world instructions, often entangling clarification with execution\. To address this, our work leverages the dynamic interaction capabilities ofτ\\tau\-Bench\(Yaoet al\.,[2024](https://arxiv.org/html/2606.03135#bib.bib31)\)to explicitly optimize clarification strategies\.

##### Clarification Learning and Alignment Signals\.

Clarification questions serve as a critical mechanism for resolving ambiguity\. While foundational studies emphasized their necessity for grounding in human communication\(Clark,[1996](https://arxiv.org/html/2606.03135#bib.bib32); Traum and Larsson,[2003](https://arxiv.org/html/2606.03135#bib.bib21); Yizhouet al\.,[2024](https://arxiv.org/html/2606.03135#bib.bib141); Zhanget al\.,[2024](https://arxiv.org/html/2606.03135#bib.bib140)\), recent advances have extended this mechanism to conversational search\(Aliannejadiet al\.,[2019](https://arxiv.org/html/2606.03135#bib.bib139)\)and active uncertainty reduction strategies\(Zouet al\.,[2023](https://arxiv.org/html/2606.03135#bib.bib28); Xiaet al\.,[2025](https://arxiv.org/html/2606.03135#bib.bib7); Liet al\.,[2024a](https://arxiv.org/html/2606.03135#bib.bib6)\)\. While current efforts primarily focus on evaluation\(Zhouet al\.,[2025](https://arxiv.org/html/2606.03135#bib.bib138)\)or static question selection\(Choudhuryet al\.,[2025](https://arxiv.org/html/2606.03135#bib.bib143)\), they often overlook the learning of clarification policies for tool\-using agents\. In parallel, IGPO\(Wanget al\.,[2025a](https://arxiv.org/html/2606.03135#bib.bib142)\)formulates information\-gain\-based optimization for multi\-turn agents under the assumption that the target answer is known in advance\. Compared with existing methods operate with predefined answer targets or fixed clarification candidates, our setting considers tool\-using agents that infer latent user goals online, condition on tool state and execution feedback, handle free\-form user responses, and learn an amortized clarification policy rather than only selecting from static questions or optimizing against a known answer slot\.

Although emerging policy optimization frameworks like GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.03135#bib.bib144); Wanget al\.,[2025b](https://arxiv.org/html/2606.03135#bib.bib29)\)and DAPO\(Yuet al\.,[2025](https://arxiv.org/html/2606.03135#bib.bib4)\)offer a robust mechanism to learn such behaviors, applying them to clarification is challenging: standard alignment approaches typically rely on human or LLM\-based judges that prioritize surface\-level text quality over informational value\(Chaudhariet al\.,[2025](https://arxiv.org/html/2606.03135#bib.bib30)\)\. To address this, we propose an information\-theoretic framework that explicitly measures a question’s ability to resolve underspecified user intent\. Specifically, we optimize for Expected Information Gain \(EIG\) to quantify goal uncertainty reduction\. This enables policy optimization to explicitly target uncertainty reduction, maximizing information gain to facilitate successful task execution\.

![Refer to caption](https://arxiv.org/html/2606.03135v1/framework2.png)Figure 2:Overview of the Amortized Bayesian Experimental Design framework\. The model performs on\-policy sampling to generate candidate questions\. These candidates are evaluated by our Belief Update Reward, which quantifies efficacy as the shift in the teacher\-forced log\-likelihood of the ground\-truth goalG∗G^\{\*\}\(Bayesian Belief Update\)\. Acting as an amortized surrogate for the intractable expected information gain, this signal guides the DAPO update, steering the policy towards clarifications that maximally reduce goal uncertainty\.

## 3Method

In this section, we present a framework for learning alignment\-driven clarification policies\. We first introduce the execution\-grounded interaction protocol withinτ\\tau\-Bench \(Section[3\.1](https://arxiv.org/html/2606.03135#S3.SS1)\), followed by the data construction process \(Section[3\.2](https://arxiv.org/html/2606.03135#S3.SS2)\)\. Section[3\.3](https://arxiv.org/html/2606.03135#S3.SS3)details the Amortized Bayesian Experimental Design and its role in formulating the Information Gain Reward, while Section[3\.4](https://arxiv.org/html/2606.03135#S3.SS4)explains how the DAPO algorithm uses this reward to optimize the clarifier for resolving task ambiguity\.

### 3\.1Environment and Interaction Protocol

We conduct our study onτ\\tau\-Bench\(Yaoet al\.,[2024](https://arxiv.org/html/2606.03135#bib.bib31)\), which provides a partially observable Markov decision process \(POMDP\) based environment for tool\-agent interaction\. To handle user ambiguity, we extend this interaction loop with a pluggable and LLM\-based clarifier module, as illustrated in Figure[1](https://arxiv.org/html/2606.03135#S1.F1)\. When the agent issues a tool callat∈𝒜toola\_\{t\}\\in\\mathcal\{A\}\_\{\\text\{tool\}\}and receives the corresponding observationoto\_\{t\}, the clarifier assesses whether the execution feedback reveals latent ambiguity or missing task\-critical information\. If so, it generates a concise follow\-up question; otherwise, no clarification turn is produced\. We focus on learning a clarification policy that optimizes the decision of whether to intervene and what information to solicit based on execution feedback\. This process yields an interaction trajectory defined as:

τ=\(U,\(at,ot,Qt,At\)t=1T,G\),\\tau=\\Big\(U,\\;\(a\_\{t\},\\;o\_\{t\},\\;Q\_\{t\},\\;A\_\{t\}\)\_\{t=1\}^\{T\},\\;G\\Big\),whereUUdenotes the initial fuzzy instruction andGGrepresents the user’s latent ground\-truth goal\. WhileGGremains unobserved by the agent during interaction, it guides the user simulator’s responses and serves as the reference for computing our information\-gain rewards\. At each steptt, the agent executes a tool actionata\_\{t\}and observesoto\_\{t\}\. Conditioned on this execution feedback, the clarifier may issue a clarification questionQtQ\_\{t\}and receive a corresponding responseAtA\_\{t\}from the user simulator before the next action is taken\. If no clarification is made at steptt, we setQt=At=∅Q\_\{t\}=A\_\{t\}=\\varnothing\. These trajectories serve as the foundation for optimizing our clarifier policy\.

### 3\.2Data Construction

To curate the dataset for policy optimization, we process interaction trajectories fromτ\\tau\-Bench, restricting our scope to the first 16 agent steps to focus on ambiguity resolution\. From this filtered subset, we retain a final dataset of2,6762\{,\}676step\-level instances, where each data point consists of the dialogue historyxxand a task\-defined user goalGGprovided byτ\\tau\-Bench\. We further normalize the raw user goals to obtain a clean, canonical goal representationG∗G^\{\*\}, which serves as the supervision target for reward computation throughout this paper\. Details of the data construction process and the goal transfer fromGGtoG∗G^\{\*\}are provided in Appendix[6\.1](https://arxiv.org/html/2606.03135#S6.SS1)\.

### 3\.3Amortized Bayesian Experimental Design

To learn*when*and*what*to clarify, we formulate clarification generation from a Bayesian Experimental Design \(BED\)\(Chaloner and Verdinelli,[1995](https://arxiv.org/html/2606.03135#bib.bib3)\)perspective and train a clarifier to propose questions that are expected to reduce uncertainty about the user’s latent goalG∗G^\{\*\}\. Rather than directly optimizing this objective, which is intractable in interactive settings, we approximate the experimental design principle through an amortized training pipeline\.

##### Theoretical Objective\.

The amortized training pipeline described above can be viewed as an approximation to the following idealized Bayesian objective\. Formally, letG∗G^\{\*\}be the ground\-truth goal,xxbe the interaction context, andP​\(A∣Q,x\)P\(A\\mid Q,x\)denote the answer distribution induced by the user simulator after questionQQis asked in contextxx\. An optimal clarificationQ∗Q^\{\*\}maximizes theExpected Information Gain \(EIG\):

I​\(G∗;A∣Q,x\)=𝔼A∼P\(⋅\|Q,x\)​\[H​\(G∗∣x\)−H​\(G∗∣x,Q,A\)\]I\(G^\{\*\};A\\mid Q,x\)=\\mathbb\{E\}\_\{A\\sim P\(\\cdot\|Q,x\)\}\\Big\[H\(G^\{\*\}\\mid x\)\-H\(G^\{\*\}\\mid x,Q,A\)\\Big\]\(1\)whereH​\(⋅\)H\(\\cdot\)denotes entropy\. Direct maximization of this objective is computationally intractable\. Specifically, calculating the EIG requires integrating over the vast, open\-ended space of latent user goals and potential responses to estimate the posterior entropy\. Such high\-dimensional inference is prohibitively expensive for every clarification candidate\.

##### Amortized Optimization via Belief Update\.

To address this, we propose anamortizedapproach\. We optimize a question policyπθ​\(Q∣x\)\\pi\_\{\\theta\}\(Q\\mid x\)to internalize the experimental design process and directly generate high\-information questions\. Separately, we use the same model in teacher\-forcing mode as a belief scorerPθ​\(G∗∣⋅\)P\_\{\\theta\}\(G^\{\*\}\\mid\\cdot\), with gradients detached during reward computation\. We define the rewardRtR\_\{t\}as apointwiseinformation\-gain proxy\. To ensure the reward is invariant to goal complexity, we formulate it as the shift in the length\-normalized teacher\-forced log\-likelihood ofG∗G^\{\*\}:

Rt​\(xt,Qt,At\)=1L​∑j=1Llog⁡Pθ​\(gj∗∣xt,Qt,At,g<j∗\)⏟Posterior Belief−1L​∑j=1Llog⁡Pθ​\(gj∗∣xt,g<j∗\)⏟Prior Belief\.\\begin\{split\}R\_\{t\}\(x\_\{t\},Q\_\{t\},A\_\{t\}\)&=\\underbrace\{\\frac\{1\}\{L\}\\sum\_\{j=1\}^\{L\}\\log P\_\{\\theta\}\(g^\{\*\}\_\{j\}\\mid x\_\{t\},Q\_\{t\},A\_\{t\},g^\{\*\}\_\{<j\}\)\}\_\{\\text\{Posterior Belief\}\}\\\\ &\\quad\-\\underbrace\{\\frac\{1\}\{L\}\\sum\_\{j=1\}^\{L\}\\log P\_\{\\theta\}\(g^\{\*\}\_\{j\}\\mid x\_\{t\},g^\{\*\}\_\{<j\}\)\}\_\{\\text\{Prior Belief\}\}\.\\end\{split\}\(2\)Here,G∗=\(g1∗,…,gL∗\)G^\{\*\}=\(g^\{\*\}\_\{1\},\\dots,g^\{\*\}\_\{L\}\)represents the token sequence of the ground\-truth goal\. The termlog⁡Pθ​\(⋅\)\\log P\_\{\\theta\}\(\\cdot\)denotes the probability of the next token computed via teacher forcing\. By normalizing these log\-probabilities \(dividing by lengthLL\),RtR\_\{t\}quantifies the Bayesian Belief Update: a positive value indicates that the clarification exchange\(Qt,At\)\(Q\_\{t\},A\_\{t\}\)has effectively concentrated the model’s probability mass on the true goal sequence\. To estimate this belief update in practice, any sufficiently capable LLM could be used to estimate these likelihood terms\. In this work, we use the same backbone as the clarifier policy to maintain consistency between belief representation and policy behavior\. Since the policy is trained to reduce its own uncertainty over the user goal, evaluating likelihood shifts within the same model ensures that the reward reflects the policy’s internal belief update\.

Mathematically, as derived in Appendix[6\.2](https://arxiv.org/html/2606.03135#S6.SS2), this reward corresponds to the Pointwise Mutual Information \(PMI\) between the response and the ground truth\. It can be expressed as the log\-importance weightlog⁡P​\(A∣G∗,Q,x\)P​\(A∣Q,x\)\\log\\frac\{P\(A\\mid G^\{\*\},Q,x\)\}\{P\(A\\mid Q,x\)\}, which effectively quantifies how much the observed answerAAprecisely distinguishes the true goalG∗G^\{\*\}from the general hypothesis space\. Crucially, when the clarifier decides not to intervene \(i\.e\.,No need to asktoken\), the posterior belief remains identical to the prior, naturally yielding a reward of zero \(Rt=0R\_\{t\}=0\)\.

##### Strict User Simulator for Training\.

The validity of the belief update relies heavily on the quality of the simulated responseAtA\_\{t\}\. If the user simulator is too helpful \(revealing the goal even for generic questions\), the clarifier tends to hack the reward without learning valid clarification strategies\. Therefore, we employ a Strict User Simulator during training\. Unlike the relaxed user prompt used inτ\\tau\-Bench, our strict simulator reveals task\-specific informationonlywhen the questionQtQ\_\{t\}is specific and relevant\. This ensures that high rewards are assigned exclusively to questions that genuinely necessitate clarification\. A detailed comparison of the simulator prompts is provided in Appendix[6\.3\.4](https://arxiv.org/html/2606.03135#S6.SS3.SSS4)\.

### 3\.4Optimization via DAPO

As illustrated in Figure[2](https://arxiv.org/html/2606.03135#S2.F2), the clarifier operates on the current interaction contextxx, which consists of the accumulated dialogue and execution history and serves as the input to the clarifier LLM\. Conditioned on this context, the policy performs on\-policy sampling to generate a set of candidate clarification questions\{Q1,…,QK\}\\\{Q\_\{1\},\\dots,Q\_\{K\}\\\}\. Each candidate is answered by the Strict User Simulator, yielding responses\{A1,…,AK\}\\\{A\_\{1\},\\dots,A\_\{K\}\\\}\. We evaluate the resulting trajectories using the Belief Update RewardRtR\_\{t\}derived in Eq\.[2](https://arxiv.org/html/2606.03135#S3.E2)\. Finally, these rewards are converted into group\-relative advantages to update the policy via DAPO\(Yuet al\.,[2025](https://arxiv.org/html/2606.03135#bib.bib4)\), maximizing the expected cumulative information gain\.

![Refer to caption](https://arxiv.org/html/2606.03135v1/training_dynamics_logprob_fill_reward.png)\(a\)Log probabilities with and without clarification; the shaded difference denotes the information gain reward\.
![Refer to caption](https://arxiv.org/html/2606.03135v1/training_dynamics_reward.png)\(b\)Training dynamics of the mean reward under the proposed information gain\.

Figure 3:Training dynamics of DAPO under the information gain reward objective\.##### On\-Policy Monte Carlo Estimation\.

Our training process can be interpreted as a Monte Carlo estimation of the expected Bayesian utility\. The policyπθ\(⋅\|xt\)\\pi\_\{\\theta\}\(\\cdot\|x\_\{t\}\)functions as a proposal distribution, sampling a group ofKKcandidate questions\{Q\(k\)\}k=1K\\\{Q^\{\(k\)\}\\\}\_\{k=1\}^\{K\}, andA\(k\)A^\{\(k\)\}is the response returned by the strict user simulator for the sampled questionQ\(k\)Q^\{\(k\)\}\. The empirical average of their rewards provides an unbiased estimate of the expected utility:

𝒰​\(θ\)≈𝔼Q∼πθ​\[Rt\]≈1K​∑k=1KRt​\(xt,Q\(k\),A\(k\)\)\.\\mathcal\{U\}\(\\theta\)\\approx\\mathbb\{E\}\_\{Q\\sim\\pi\_\{\\theta\}\}\[R\_\{t\}\]\\approx\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}R\_\{t\}\(x\_\{t\},Q^\{\(k\)\},A^\{\(k\)\}\)\.\(3\)As training progresses,πθ\\pi\_\{\\theta\}increasingly concentrates probability mass on the regions of the question space that yield high information gain\.

##### Advantage Computation\.

To reduce variance and stabilize training, we compute the group\-relative advantage for each candidatekk:

A​d​vk=Rt\(k\)−μK​\(Rt\)σK​\(Rt\)\+ϵ,Adv\_\{k\}=\\frac\{R\_\{t\}^\{\(k\)\}\-\\mu\_\{K\}\(\{R\_\{t\}\}\)\}\{\\sigma\_\{K\}\(\{R\_\{t\}\}\)\+\\epsilon\},\(4\)whereμK\\mu\_\{K\}andσK\\sigma\_\{K\}denote the mean and standard deviation across the sampled group, andϵ\\epsilonis a small constant for numerical stability\. SinceRtR\_\{t\}is a sequence\-level metric derived from the final belief update, we broadcast the scalar advantageA​d​vkAdv\_\{k\}to all tokens within the questionQ\(k\)Q^\{\(k\)\}\.

##### DAPO Objective\.

We optimize the policy using the decoupled clipping objective, which is particularly effective for the sparse reward nature of clarification tasks:

𝒥​\(θ\)=1∑k\|Q\(k\)\|∑k=1K∑j=1\|Q\(k\)\|min\(ρk,jAdvk,clip\(ρk,j,1−ϵlow,1\+ϵhigh\)Advk\)\\small\\begin\{split\}\\mathcal\{J\}\(\\theta\)&=\\frac\{1\}\{\\sum\_\{k\}\|Q^\{\(k\)\}\|\}\\sum\_\{k=1\}^\{K\}\\sum\_\{j=1\}^\{\|Q^\{\(k\)\}\|\}\\min\\Big\(\\rho\_\{k,j\}Adv\_\{k\},\\\\ &\\qquad\\text\{clip\}\(\\rho\_\{k,j\},1\-\\epsilon\_\{\\text\{low\}\},1\+\\epsilon\_\{\\text\{high\}\}\)Adv\_\{k\}\\Big\)\\end\{split\}\(5\)whereρk,j\\rho\_\{k,j\}is the importance sampling ratio\. The clipping function restrictsρk,j\\rho\_\{k,j\}within\[1−ϵlow,1\+ϵhigh\]\[1\-\\epsilon\_\{\\text\{low\}\},1\+\\epsilon\_\{\\text\{high\}\}\]\(withϵhigh\>ϵlow\\epsilon\_\{\\text\{high\}\}\>\\epsilon\_\{\\text\{low\}\}\), effectively raising the ceiling for exploration on high\-potential tokens\. By maximizing this objective, the clarifier learns to align its clarifying behavior with the goal of maximizing information gain, effectively amortizing the cost of optimal experimental design\.

## 4Experiments

Table 1:Ablation study onτ\\tau\-Bench \(RetailandAirline\) based on training with Qwen3\-1\.7B\. Pass@1 \(%\) with deterministic decoding \(temperature=0=0\), averaged over three runs\. Numbers in parentheses indicate changes relative toNone\.Baselineis the untrained Qwen3\-1\.7B clarifier;LLM as a Judgereplaces the information\-gain reward with a question\-quality judge;w/o Information Gainoptimizes the posterior likelihood term without the prior\-posterior contrast\.Notes\.↑\\uparrow\(↓\\downarrow\) indicates higher \(lower\) is better\.Averagedenotes the mean across domains\.Average \(w/o Clarifier\)counts agent and user steps excluding clarifier turns;Average Clarifydenotes the average number of clarifier invocations;Sumcounts allAverage \(w/o Clarifier\)andAverage Clarifysteps\. w/o Information Gain removes the uncertainty reduction term and optimizeslog⁡P​\(G∣xt,Qt,At\)\\log P\(G\\mid x\_\{t\},Q\_\{t\},A\_\{t\}\)only\.

Building upon the framework introduced in Section[3](https://arxiv.org/html/2606.03135#S3), we evaluate our approach within the clarifier\-augmentedτ\\tau\-Bench environment\. To ensure a clean separation between training and evaluation, all models are fine\-tuned*exclusively*on a dataset of 500τ\\tau\-retail training trajectories\. Testing is subsequently conducted on two distinct sets: 115 held\-outretailtasks, which represent the in\-domain setting, and 50airlinetasks, which represent the out\-of\-distribution \(OOD\) setting\. Theairlinedomain serves as a particularly challenging testbed for transfer learning, as it features task structures, tool APIs, and user interaction patterns that differ markedly from those observed during training\. Further implementation details and specific training hyperparameters are provided in Appendix[6\.6](https://arxiv.org/html/2606.03135#S6.SS6)\.

Table 2:Exploration Boundaries of Agents and Clarifier: Impact of Different Clarification Strategies on Task Completion Performance\. Numbers in parentheses indicate changes relative to the agent’s performance under theNonecondition \(fuzzy user intent without clarification\)\.↑\\uparrow\(↓\\downarrow\) indicates higher \(lower\) is better\.None \(Full User Intent\)represents the condition where full user intent is provided without clarification, reflecting the agent’s performance boundaries\.### 4\.1Training Dynamics and Performance

We first analyze the training dynamics to assess whether optimization steers the agent toward uncertainty\-reducing behaviors\. Figure[3](https://arxiv.org/html/2606.03135#S3.F3)\(a\) decomposes our Information Gain Reward into two teacher\-forced log\-likelihood terms: the prior confidence in the ground\-truth goal given the interaction context alone,log⁡P​\(G∗∣x\)\\log P\(G^\{\*\}\\mid x\), and the posterior confidence after incorporating the clarification exchange,log⁡P​\(G∗∣x,Q,A\)\\log P\(G^\{\*\}\\mid x,Q,A\)\. Notably, the likelihood conditioned on the interaction context without clarification,log⁡P​\(G∗∣x\)\\log P\(G^\{\*\}\\mid x\), remains mildly oscillatory throughout training, reflecting variation in sampled contexts during on\-policy training\. However,log⁡P​\(G∗∣x\)\\log P\(G^\{\*\}\\mid x\)remains substantially below thelog⁡P​\(G∗∣x,Q,A\)\\log P\(G^\{\*\}\\mid x,Q,A\)posterior curve, and the gap between the two curves tends to widen with training\. This pattern suggests that the learned policy improves primarily by eliciting additional task\-relevant evidence, not by simply increasing prior confidence in ambiguous contexts\.

Consistent with this interpretation, Figure[3](https://arxiv.org/html/2606.03135#S3.F3)\(b\) shows that the mean Information Gain Reward increases steadily over training, rising from a near\-zero baseline to approximately 0\.15 by step 200 and continuing to improve thereafter\. This monotonic trend, without signs of collapse or oscillatory behavior, suggests that the reward provides a stable optimization signal\. Together with the divergence observed in Figure[3](https://arxiv.org/html/2606.03135#S3.F3)\(a\), these results indicate that DAPO effectively exploits the information gain objective to refine the clarification policy, progressively improving the agent’s ability to acquire informative feedback\.

![Refer to caption](https://arxiv.org/html/2606.03135v1/success.png)\(a\)Success rate on theairlineandretaildomains, and their average, under different maximum clarification budgets\.
![Refer to caption](https://arxiv.org/html/2606.03135v1/steps.png)\(b\)Average number of interaction steps, including clarification steps \(shaded area\), across theairlineandretaildomains\.

Figure 4:Effect of clarification budget on task success and interaction efficiency under forced sampling\.Table[1](https://arxiv.org/html/2606.03135#S4.T1)reports an ablation study comparing different clarification training strategies onτ\\tau\-Bench\. Compared to the pre\-trained Qwen3\-1\.7B clarifier, our method consistently improves success rates on both domains, increasing performance from 16\.5% to 18\.3% onretailand from 13\.3% to 17\.3% onairline, while simultaneously reducing interaction cost\. In particular, the average number of clarifier invocations drops sharply from 4\.2 to 1\.3 per task, resulting in a 3 reduction in total interaction steps, indicating that the learned policy asks fewer but more targeted clarification questions\. When the available information is already sufficient, the prior belief over user intent is already highly concentrated\. In this case, additional clarification often introduces weakly informative or irrelevant context, making the prediction of the ground\-truth action sequence under teacher forcing more difficult and leading to a decrease in log\-likelihood\. Intuitively, unnecessary questions introduce distracting information without providing meaningful constraints on the underlying intent, and are therefore less favorable than abstention \(reward = 0\)\.

In contrast, removing the information gain term and optimizing with a likelihood\-only objective \(w/o Information Gain\) leads to a substantial performance degradation, especially on theairlinedomain, where success rate falls from 17\.3% to 10%, suggesting that conditional likelihood alone is insufficient to encourage informative clarifications in domains\. The LLM\-as\-a\-judge variant further exhibits unstable behavior: although it matches our method onretail\(18\.3%\), performance collapses onairline\(8%\) and incurs higher interaction costs, highlighting the limitation of stylistic or preference\-based supervision in capturing the task\-specific informational value of clarification questions\. Details of the LLM\-as\-a\-judge prompt template are provided in Appendix[6\.3\.5](https://arxiv.org/html/2606.03135#S6.SS3.SSS5)\.

### 4\.2Exploration of Clarifier Boundaries

Table[2](https://arxiv.org/html/2606.03135#S4.T2)compares our specialized 1\.7B clarifier against a wide range of large\-scale LLMs \(from 8B to 671B\)\. Despite the significant disparity in parameter scale, our method achieves a task success rate of 17\.8%, approaching the best\-performing large backbone \(DeepSeek\-V3\.1, 18\.1%\) with a negligible margin of merely0\.3%\. This result demonstrates that a small, information\-driven model can effectively match the utility of significantly larger models in resolving task\-oriented ambiguity\.

Crucially, this performance is achieved with superior interaction efficiency\. While general\-purpose models tend to over\-clarify with 3\.9 to 5\.1 turns per task, our policy converges with just1\.3turns on average\. This efficient strategy is vital for real\-world deployment, as it mitigates the deleterious effects of over\-clarification, including interactional redundancy, user fatigue, and the risk of task state pollution \(where excessive dialogue history distracts the agent\)\. By optimizing for Information Gain, our clarifier strikes a favorable balance, facilitating ambiguity resolution while reducing the cognitive and temporal costs of interaction\.

Comparing against theFull User Intentoffers a profound insight into information modality \(see Appendix[6\.3\.6](https://arxiv.org/html/2606.03135#S6.SS3.SSS6)for the specific prompt template\)\. While this static oracle serves as an upper bound of agent’s capability, our interactive approach surprisingly outperforms it in the complexAirlinedomain \(17\.3% vs\. 16%\)\. This suggests that in intricate scenarios, interactive ambiguity resolution is superior to static information loading\. By breaking down complex constraints into sequential turns, our clarifier mitigates the risk of information overload, enabling the agent to attend to task\-critical details more effectively than when overwhelmed by a monolithic instruction dump\. Overall, our learned clarifier effectively approximates this theoretical performance boundary \(17\.8% vs\. 18% on average\) while incurring a marginal overhead of only 2\.8 total steps\.

Table 3:Agent and clarifier generalization performance with the clarifier decoded deterministically\. Different agent backbones are compared under varying clarification strategies, including no clarification \(None\), aQwen3\-8Bclarifier, and a learned 1\.7B clarifier \(Ours\) trained with our DAPO setting\.↑\\uparrow\(↓\\downarrow\) indicates higher \(lower\) is better\. The best results for each model are bolded\. The values in parentheses indicate the change relative to the performance with no clarification None\.
### 4\.3Impact of Clarification Trigger Frequency

Figure[4](https://arxiv.org/html/2606.03135#S4.F4)\(a\) examines how the maximum clarification budget affects Pass@1 success rates on theairlineandretaildomains\. Introducing clarification consistently improves performance over the no\-clarification setting, demonstrating its effectiveness in resolving underspecified user intents\. Increasing the budget from 0 to a moderate level yields a substantial gain in average success rate, rising from approximately 15% to over 21%\. However, the relationship between clarification budget and performance is clearly non\-monotonic: success rates peak at intermediate budgets and plateau or decline as the budget increases further, with the effect being most pronounced in theairlinedomain\. This finding challenges the common design assumption that more clarification is inherently beneficial\. Instead, it suggests that excessive clarification introduces context pollution and redundant interactions, which can disrupt the agent’s reasoning flow without adding task\-relevant constraints\.

Figure[4](https://arxiv.org/html/2606.03135#S4.F4)\(b\) reveals a relationship between clarification frequency and task efficiency\. Initially, increasing the budget from 0 to 2 reduces total interaction steps from 24\.2 to 22\.1 without improving the success rate, indicating that insufficient questioning fails to resolve core task ambiguities\. A significant efficiency gain occurs as the budget scales from 6 to 16: while the success rate grows consistently, total steps peak at 26\.5 \(budget 6\) before declining to 23\.6 \(budget 16\)\. This trend demonstrates that effective mid\-stage clarification successfully converts into substantial performance gains while streamlining the execution path by preempting redundant tool actions\. Beyond budget 16, interaction steps rise to 25\.1 with no commensurate success rate gains\. This reversal suggests that excessive clarification becomes counterproductive, introducing informational noise that distracts the agent rather than facilitating goal resolution\. To further understand the remaining failure modes, we conduct a qualitative audit of representative trajectories and distinguish clarification failures from downstream execution errors; detailed analyses and case studies are provided in Appendix[6\.5](https://arxiv.org/html/2606.03135#S6.SS5)\.

### 4\.4Cross\-Agent Generalization Ability

Table[3](https://arxiv.org/html/2606.03135#S4.T3)presents the cross\-agent generalization performance under deterministic decoding across a diverse set of agents, including Qwen3\-32B, DeepSeek\-R1, DeepSeek\-V3\.2, and GLM\-4\-32B\. We evaluate three clarification strategies: no clarification \(None\), a publicly released model \(Qwen3\-8B\) as a clarifier, and our learned clarifier \(Ours\), which is trained using information\-gain\-driven DAPO\.

##### Overall trends\.

Averaged across all five agent backbones, our method consistently improves task success relative to the baseline\.Oursachieves the highest average success rate \(27\.3%\), outperforming bothNone\(23\.6%\) and the Qwen3\-8B clarifier \(23%\)\. In particular,Qwen3\-8Bfails to generalize effectively, resulting in a slight regression in average performance \(\-0\.6%\), whereasOursdelivers a robust gain of 3\.7%\. This divergence is most pronounced in the out\-of\-distributionAirlinedomain, where our method boosts success by 5\.4%, confirming that explicitly resolving underspecified user intent improves performance across heterogeneous downstream agents\.

##### Comparison across clarification strategies\.

TheQwen3\-8Bclarifier exhibits high variance: while it benefits specific models like DeepSeek\-V3\.2, it adversely impacts others, such as Qwen3\-32B and GLM\-4\-32B\. In contrast,Oursdelivers stable positive outcomes across all backbones while maintaining a significantly lower clarification frequency \(1\.2 vs\. 2\.6 turns on average\)\. This efficiency stems from a policy trained to selectively intervene only when clarification is expected to significantly reduce information asymmetry, effectively avoiding the redundant or disruptive questioning\.

##### Agent\-level success rate analysis\.

We observe a clear positive correlation between the agent’s intrinsic capability and the magnitude of improvement\. For mid\-sized agents like Qwen3\-32B, our approach improves the average success rate from 18\.7% to 21\.2% \(\+2\.5%\)\. Notably, these benefits scale significantly with stronger models: the larger scale DeepSeek\-R1 sees a robust boost from 35\.7% to 40% \(\+4\.3%\), while the massive 671B\-parameter DeepSeek\-V3\.2 achieves the most substantial improvement, rising from 33\.3% to 39\.7% \(\+6\.4%\)\. This trend suggests that more capable agents are better equipped to leverage the high\-quality additional constraints provided by our clarifier, effectively unlocking their full potential in ambiguous scenarios without disrupting their intrinsic reasoning processes\. We further explore that the proposed training framework generalizes across different clarifier backbones, including Qwen2\.5\-1\.5B and Qwen3\-4B; detailed results are provided in Appendix[6\.4](https://arxiv.org/html/2606.03135#S6.SS4)\.

##### Interaction efficiency across agents\.

We find that improved task performance does not necessarily come with increased interaction costs\. For agents like Qwen3\-32B and DeepSeek\-V3\.2, clarification leads to a net reduction in total steps \(e\.g\., 24 to 19\.5 for Qwen3\-32B\), suggesting that early ambiguity resolution effectively preempts longer, erroneous execution trajectories\. On average, across all agent backbones, our method adds just 1\.2 clarification turns and 0\.3 total steps per task, demonstrating a surgical intervention strategy that achieves favorable success–cost trade\-offs\.

## 5Conclusion

We present an information\-gain\-driven Clarifier\-augmented framework to optimize the timing and content of questioning\. Through a comprehensive evaluation spanning training dynamics, clarifier boundaries, trigger frequency, and cross\-agent generalization, we demonstrate the robustness of our method across diverse settings\. Empirical results show that our approach improves the average success rate by 3\.7% over baselines with minimal overhead \(averaging 0\.3 additional steps\), demonstrating the effectiveness of information gain for efficient ambiguity resolution\.

Discussion and Future Work\.In practical agent systems, clarification should be treated as a selectively routed capability rather than an always\-on dialogue behavior\. A base agent or external controller needs to determine whether the current state contains unresolved intent uncertainty, whether clarification is permitted under domain\-specific policies, and whether the expected informational benefit outweighs the associated user burden\. Our current setup also leaves several directions for future work\. The training reward relies on access to ground\-truth user goals in controlled environments, while real users may provide noisy, incomplete, or ambiguous feedback over multiple turns\. Future work could therefore explore weaker supervision signals and more realistic interaction settings\. In addition, the current experiments keep the base agent fixed for clearer attribution of clarification behavior, whereas jointly optimizing the agent and clarifier may further improve coordination between action selection and information seeking\.

## Acknowledgement

This work is supported by Advanced Materials\-National Science and Technology Major Project \(Grant No\. 2025ZD0620100\), HKUST\(GZ\)\-IEIP\-RoP \(G01RF000256\), National Key R&D Program of China \(No\. 2024YFA1012700\), and Guangdong Provincial Key Lab of Integrated Communication, Sensing and Computation for Ubiquitous Internet of Things \(No\. 2023B1212010007\)\.

## Impact Statement

This paper presents work aimed at improving the reliability and efficiency of tool\-using LLM agents through a Bayesian Information Gain framework\. Our primary goal is to reduce ambiguity in human\-agent interaction, thereby preventing erroneous tool executions\. Our method focuses on optimizing clarification strategies within controlled environments and we do not foresee any specific negative societal consequences or ethical concerns that require immediate highlighting, as the proposed method operates strictly within the scope of user\-initiated tasks and grounded tool definitions\.

## References

- M\. Aliannejadi, H\. Zamani, F\. Crestani, and W\. B\. Croft \(2019\)Asking clarifying questions in open\-domain information\-seeking conversations\.InProceedings of the 42nd international acm sigir conference on research and development in information retrieval,pp\. 475–484\.Cited by:[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Andriushchenko, A\. Souly, M\. Dziemian, D\. Duenas, M\. Lin, J\. Wang, D\. Hendrycks, A\. Zou, Z\. Kolter, M\. Fredrikson,et al\.\(2024\)Agentharm: a benchmark for measuring harmfulness of llm agents\.arXiv preprint arXiv:2410\.09024\.Cited by:[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px1.p1.1)\.
- K\. Chaloner and I\. Verdinelli \(1995\)Bayesian experimental design: a review\.Statistical science,pp\. 273–304\.Cited by:[§3\.3](https://arxiv.org/html/2606.03135#S3.SS3.p1.1)\.
- S\. Chaudhari, P\. Aggarwal, V\. Murahari, T\. Rajpurohit, A\. Kalyan, K\. Narasimhan, A\. Deshpande, and B\. Castro da Silva \(2025\)Rlhf deciphered: a critical analysis of reinforcement learning from human feedback for llms\.ACM Computing Surveys58\(2\),pp\. 1–37\.Cited by:[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px2.p2.1)\.
- D\. Choudhury, S\. Williamson, A\. Goliński, N\. Miao, F\. B\. Smith, M\. Kirchhof, Y\. Zhang, and T\. Rainforth \(2025\)Bed\-llm: intelligent information gathering with llms and bayesian experimental design\.arXiv preprint arXiv:2508\.21184\.Cited by:[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px2.p1.1)\.
- H\. H\. Clark \(1996\)Using language\.Cambridge university press\.Cited by:[§1](https://arxiv.org/html/2606.03135#S1.p2.1),[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px2.p1.1)\.
- B\. Fu, Y\. Qiu, C\. Tang, Y\. Li, H\. Yu, and J\. Sun \(2020\)A survey on complex question answering over knowledge base: recent advances and challenges\.arXiv preprint arXiv:2007\.13069\.Cited by:[§1](https://arxiv.org/html/2606.03135#S1.p2.1)\.
- D\. Li, Y\. Zhang, Z\. Wang, S\. Tan, S\. Kosugi, and M\. Okumura \(2024a\)Active learning for abstractive text summarization via llm\-determined curriculum and certainty gain maximization\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 8959–8971\.Cited by:[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Li, S\. Zhao, Q\. Wang, K\. Wang, Y\. Zhou, S\. Srivastava, C\. Gokmen, T\. Lee, E\. L\. Li, R\. Zhang,et al\.\(2024b\)Embodied agent interface: benchmarking llms for embodied decision making\.Advances in Neural Information Processing Systems37,pp\. 100428–100534\.Cited by:[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px1.p1.1)\.
- X\. Liu, H\. Yu, H\. Zhang, Y\. Xu, X\. Lei, H\. Lai, Y\. Gu, H\. Ding, K\. Men, K\. Yang,et al\.\(2023a\)Agentbench: evaluating llms as agents\.arXiv preprint arXiv:2308\.03688\.Cited by:[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Liu, Y\. Zhang, P\. Li, Y\. Liu, and D\. Yang \(2023b\)Dynamic llm\-agent network: an llm\-agent collaboration framework with agent team optimization\.arXiv preprint arXiv:2310\.02170\.Cited by:[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Lu, T\. Holleis, Y\. Zhang, B\. Aumayer, F\. Nan, H\. Bai, S\. Ma, S\. Ma, M\. Li, G\. Yin,et al\.\(2025\)Toolsandbox: a stateful, conversational, interactive evaluation benchmark for llm tool use capabilities\.InFindings of the Association for Computational Linguistics: NAACL 2025,pp\. 1160–1183\.Cited by:[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.Advances in neural information processing systems35,pp\. 27730–27744\.Cited by:[§1](https://arxiv.org/html/2606.03135#S1.p2.1)\.
- Y\. Qi, H\. Peng, X\. Wang, A\. Xin, Y\. Liu, B\. Xu, L\. Hou, and J\. Li \(2025\)Agentif: benchmarking instruction following of large language models in agentic scenarios\.arXiv preprint arXiv:2505\.16944\.Cited by:[§1](https://arxiv.org/html/2606.03135#S1.p1.1)\.
- C\. Qu, S\. Dai, X\. Wei, H\. Cai, S\. Wang, D\. Yin, J\. Xu, and J\. Wen \(2025\)Tool learning with large language models: a survey\.Frontiers of Computer Science19\(8\),pp\. 198343\.Cited by:[§1](https://arxiv.org/html/2606.03135#S1.p1.1),[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Rao and H\. Daumé III \(2018\)Learning to ask good questions: ranking clarification questions using neural expected value of perfect information\.arXiv preprint arXiv:1805\.04655\.Cited by:[§1](https://arxiv.org/html/2606.03135#S1.p2.1)\.
- T\. Schicket al\.\(2023\)Toolformer: language models can teach themselves to use tools\.InICLR,Cited by:[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Shao, Z\. Liu, W\. Zhang,et al\.\(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px2.p2.1)\.
- G\. Sheng, C\. Zhang, Z\. Ye, X\. Wu, W\. Zhang, R\. Zhang, Y\. Peng, H\. Lin, and C\. Wu \(2024\)HybridFlow: a flexible and efficient rlhf framework\.arXiv preprint arXiv: 2409\.19256\.Cited by:[§6\.6](https://arxiv.org/html/2606.03135#S6.SS6.p1.1)\.
- G\. Son, H\. Ko, H\. Lee, Y\. Kim, and S\. Hong \(2024\)Llm\-as\-a\-judge & reward model: what they can and cannot do\.arXiv preprint arXiv:2409\.11239\.Cited by:[§1](https://arxiv.org/html/2606.03135#S1.p2.1)\.
- M\. Suri, P\. Mathur, N\. Lipka, F\. Dernoncourt, R\. A\. Rossi, and D\. Manocha \(2025\)Structured uncertainty guided clarification for llm agents\.arXiv preprint arXiv:2511\.08798\.Cited by:[§1](https://arxiv.org/html/2606.03135#S1.p2.1)\.
- A\. Szymanski, N\. Ziems, H\. A\. Eicher\-Miller, T\. J\. Li, M\. Jiang, and R\. A\. Metoyer \(2025\)Limitations of the llm\-as\-a\-judge approach for evaluating llm outputs in expert knowledge tasks\.InProceedings of the 30th International Conference on Intelligent User Interfaces,pp\. 952–966\.Cited by:[§1](https://arxiv.org/html/2606.03135#S1.p2.1)\.
- D\. R\. Traum and S\. Larsson \(2003\)The information state approach to dialogue management\.InCurrent and new directions in discourse and dialogue,pp\. 325–353\.Cited by:[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px2.p1.1)\.
- G\. Wang, S\. Dai, G\. Ye, Z\. Gan, W\. Yao, Y\. Deng, X\. Wu, and Z\. Ying \(2025a\)Information gain\-based policy optimization: a simple and effective approach for multi\-turn llm agents\.arXiv preprint arXiv:2510\.14967\.Cited by:[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Wang, M\. Zerun, Y\. Li, S\. Zhang, C\. Chen, K\. Chen, and X\. Le \(2024\)GTA: a benchmark for general tool agents\.Advances in Neural Information Processing Systems37,pp\. 75749–75790\.Cited by:[§1](https://arxiv.org/html/2606.03135#S1.p1.1),[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px1.p1.1)\.
- P\. Wang, T\. Liu, C\. Wang, Z\. Li, Y\. Wang, S\. Yan, C\. Jia, X\. Liu, X\. Chen, J\. Xu,et al\.\(2025b\)A survey on large language models for mathematical reasoning\.ACM Computing Surveys\.Cited by:[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px2.p2.1)\.
- G\. Wölflein, D\. Ferber, D\. Truhn, O\. Arandjelovic, and J\. N\. Kather \(2025\)Llm agents making agent tools\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 26092–26130\.Cited by:[§1](https://arxiv.org/html/2606.03135#S1.p1.1)\.
- C\. Wu, Z\. R\. Tam, C\. Lin, Y\. Chen, and H\. Lee \(2024a\)Streambench: towards benchmarking continuous improvement of language agents\.Advances in Neural Information Processing Systems37,pp\. 107039–107063\.Cited by:[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Wu, J\. Zhu, Y\. Liu, M\. Xu, and Y\. Jin \(2025\)Agentic reasoning: a streamlined framework for enhancing llm reasoning with agentic tools\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 28489–28503\.Cited by:[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Wu, S\. Zhao, Q\. Huang, K\. Huang, M\. Yasunaga, K\. Cao, V\. Ioannidis, K\. Subbian, J\. Leskovec, and J\. Y\. Zou \(2024b\)Avatar: optimizing llm agents for tool usage via contrastive reasoning\.Advances in Neural Information Processing Systems37,pp\. 25981–26010\.Cited by:[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Xia, S\. Mukherjee, Z\. Xie, J\. Wu, X\. Li, R\. Aponte, H\. Lyu, J\. Barrow, H\. Chen, F\. Dernoncourt,et al\.\(2025\)From selection to generation: a survey of llm\-based active learning\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 14552–14569\.Cited by:[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px2.p1.1)\.
- F\. F\. Xu, Y\. Song, B\. Li, Y\. Tang, K\. Jain, M\. Bao, Z\. Z\. Wang, X\. Zhou, Z\. Guo, M\. Cao,et al\.\(2024\)Theagentcompany: benchmarking llm agents on consequential real world tasks\.arXiv preprint arXiv:2412\.14161\.Cited by:[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Yao, N\. Shinn, P\. Razavi, and K\. Narasimhan \(2024\)τ\\tau\-Bench: a benchmark for tool\-agent\-user interaction in real\-world domains, 2024\.URL https://arxiv\. org/abs/2406\.12045\.Cited by:[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.03135#S3.SS1.p1.3)\.
- S\. Yao, R\. Yang, N\. Cui, K\. Narasimhan,et al\.\(2022\)ReAct: synergizing reasoning and acting in language models\.arXiv preprint arXiv:2210\.03629\.Cited by:[§1](https://arxiv.org/html/2606.03135#S1.p1.1),[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Yehudai, L\. Eden, A\. Li, G\. Uziel, Y\. Zhao, R\. Bar\-Haim, A\. Cohan, and M\. Shmueli\-Scheuer \(2025\)Survey on evaluation of llm\-based agents\.arXiv preprint arXiv:2503\.16416\.Cited by:[§1](https://arxiv.org/html/2606.03135#S1.p1.1)\.
- C\. Yizhou, L\. Jessy, L\. Kevin, and K\. Dan \(2024\)Clarinet: augmenting language models to ask clarification questions for retrieval\.arXiv preprint arXiv: 2405\.15784\.Cited by:[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px2.p1.1)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, T\. Fan, G\. Liu, L\. Liu, X\. Liu,et al\.\(2025\)Dapo: an open\-source llm reinforcement learning system at scale, 2025\.URL https://arxiv\. org/abs/2503\.14476\.Cited by:[§1](https://arxiv.org/html/2606.03135#S1.p3.1),[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px2.p2.1),[§3\.4](https://arxiv.org/html/2606.03135#S3.SS4.p1.4)\.
- S\. Yuan, K\. Song, J\. Chen, X\. Tan, Y\. Shen, K\. Ren, D\. Li, and D\. Yang \(2025\)Easytool: enhancing llm\-based agents with concise tool instruction\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 951–972\.Cited by:[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Zhang, L\. Dong, X\. Li, S\. Zhang, X\. Sun, S\. Wang, J\. Li, R\. Hu, T\. Zhang, G\. Wang,et al\.\(2023\)Instruction tuning for large language models: a survey\.ACM Computing Surveys\.Cited by:[§1](https://arxiv.org/html/2606.03135#S1.p2.1)\.
- X\. Zhang, Y\. Deng, Z\. Ren, S\. K\. Ng, and T\. Chua \(2024\)Ask\-before\-plan: proactive language agents for real\-world planning\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 10836–10863\.Cited by:[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Zhou, X\. Feng, Z\. Zhu, J\. Yao, S\. Koyejo, and B\. Han \(2025\)From passive to active reasoning: can large language models ask the right questions under incomplete information?\.arXiv preprint arXiv:2506\.08295\.Cited by:[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Zhu, S\. Hsia, X\. Hu, Z\. Liu, J\. Shi, and K\. Ramani \(2025\)AgentAR: creating augmented reality applications with tool\-augmented llm\-based autonomous agents\.InProceedings of the 38th Annual ACM Symposium on User Interface Software and Technology,pp\. 1–23\.Cited by:[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Zou, M\. Aliannejadi, E\. Kanoulas, M\. S\. Pera, and Y\. Liu \(2023\)Users meet clarifying questions: toward a better understanding of user interactions for search clarification\.ACM Transactions on Information Systems41\(1\),pp\. 1–25\.Cited by:[§1](https://arxiv.org/html/2606.03135#S1.p2.1),[§2](https://arxiv.org/html/2606.03135#S2.SS0.SSS0.Px2.p1.1)\.

## 6Appendix

### 6\.1Data Construction

Our data construction pipeline focuses on collecting interaction trajectories from a Clarifier\-augmentedτ\\tau\-Bench environment and extracting step\-level interaction histories for training the clarification policy via DAPO\. Unlike prior approaches that rely on supervised labels or preference annotations, our method does not use SFT or DPO data\. Instead, learning is driven entirely by on\-policy rollouts and reward signals defined over interaction histories\.

##### Interaction Trajectory Generation\.

All interaction trajectories are collected from the Clarifier\-augmented environment described in Section[3\.1](https://arxiv.org/html/2606.03135#S3.SS1)\. During data collection, the downstream task agent is Qwen3\-8B, which interacts with domain\-specific tools under the standardτ\\tau\-Bench execution protocol\. A separate Clarifier module, implemented using Qwen3\-32B, is invoked only after tool calls to determine whether clarification is necessary and, if so, to generate a follow\-up question grounded in the observed execution feedback\. User responses to clarification questions are produced by an LM\-simulated user instantiated as Qwen3\-8B\.

This setup yields multi\-turn interaction trajectories of the form

τ=\{U,\(at,ot,Qt,At\)t=1T,G\},\\tau=\\\{\\,U,\\;\(a\_\{t\},o\_\{t\},Q\_\{t\},A\_\{t\}\)\_\{t=1\}^\{T\},\\;G\\,\\\},whereUUdenotes the initial user instruction,\(at,ot\)\(a\_\{t\},o\_\{t\}\)are the agent’s tool actions and corresponding observations,\(Qt,At\)\(Q\_\{t\},A\_\{t\}\)are optional clarification question–answer pairs, andGGis the task\-defined ground\-truth user goal provided byτ\\tau\-Bench\. Clarification turns are recorded only when triggered; otherwise,Qt=At=∅Q\_\{t\}=A\_\{t\}=\\varnothing\.

##### Step\-Level Instance Extraction\.

From the collected trajectories, we construct training instances at the step level\. Each instance corresponds to a decision pointttand consists solely of the interaction history

xt=\(U,\(ai,oi,Qi,Ai\)i<t\),x\_\{t\}=\\big\(U,\\;\(a\_\{i\},o\_\{i\},Q\_\{i\},A\_\{i\}\)\_\{i<t\}\\big\),together with the associated ground\-truth goalGG\. We restrict our scope to the first 16 agent steps of each trajectory to emphasize ambiguity resolution, where clarification is most critical\. Importantly, we do not retain or supervise on the clarifier outputs generated during data collection; all clarification decisions and questions used for training are produced by the Qwen3\-1\.7B model during on\-policy rollouts\.

##### FromGGtoG∗G^\{\*\}\.

A critical component of our pipeline for reward learning is the normalization of the user goal\. The raw goals inτ\\tau\-Bench often contain extraneous personality descriptions \(e\.g\., “logical,” “shy”\) and complex JSON syntax that can distract smaller models\. To address this, we employ an LLM\-based rewriting step to transform the raw user goals into a flat, natural language format for efficient supervision\. The rewriting strictly enforces the following rules:

- •First\-person normalization: converting third\-person identifiers to first\-person intent \(e\.g\., “Your name is” to “My name is”\)\.
- •Structure flattening: transforming JSON key–value pairs into cohesive natural language phrases\.
- •Noise removal: removing non\-functional attributes while strictly preserving specific values such as addresses and IDs\.

This process yields a clean, unambiguous normalized goalG∗G^\{\*\}, which serves as the supervision target for reward computation throughout this paper\. The detailed prompt template and representative rewriting examples are provided in Appendix[6\.3\.3](https://arxiv.org/html/2606.03135#S6.SS3.SSS3)\.

### 6\.2Derivation of the Information Gain Reward

In Section[3\.3](https://arxiv.org/html/2606.03135#S3.SS3), we defined the rewardRtR\_\{t\}as the difference between the posterior and prior log\-likelihoods of the ground\-truth goalG∗G^\{\*\}\. Here, we derive its connection to the Bayesian Importance Weight\.

Letxxbe the context,QQbe the clarification question, andAAbe the user’s response\. The reward is given by:

Rt=log⁡P​\(G∗∣x,Q,A\)−log⁡P​\(G∗∣x\)\.R\_\{t\}=\\log P\(G^\{\*\}\\mid x,Q,A\)\-\\log P\(G^\{\*\}\\mid x\)\.\(6\)Using Bayes’ theorem, we can expand the posterior termP​\(G∗∣x,Q,A\)P\(G^\{\*\}\\mid x,Q,A\):

P​\(G∗∣x,Q,A\)=P​\(A∣G∗,Q,x\)​P​\(G∗∣x,Q\)P​\(A∣Q,x\)\.P\(G^\{\*\}\\mid x,Q,A\)=\\frac\{P\(A\\mid G^\{\*\},Q,x\)P\(G^\{\*\}\\mid x,Q\)\}\{P\(A\\mid Q,x\)\}\.\(7\)Since the goalG∗G^\{\*\}exists independently of the questionQQ\(the user’s intent is fixed before the question is asked\), we haveP​\(G∗∣x,Q\)=P​\(G∗∣x\)P\(G^\{\*\}\\mid x,Q\)=P\(G^\{\*\}\\mid x\)\. Substituting this back into the equation:

P​\(G∗∣x,Q,A\)=P​\(A∣G∗,Q,x\)​P​\(G∗∣x\)P​\(A∣Q,x\)\.P\(G^\{\*\}\\mid x,Q,A\)=\\frac\{P\(A\\mid G^\{\*\},Q,x\)P\(G^\{\*\}\\mid x\)\}\{P\(A\\mid Q,x\)\}\.\(8\)Now, substituting this expression into the reward definition:

Rt\\displaystyle R\_\{t\}=log⁡\(P​\(A∣G∗,Q,x\)​P​\(G∗∣x\)P​\(A∣Q,x\)\)−log⁡P​\(G∗∣x\)\\displaystyle=\\log\\left\(\\frac\{P\(A\\mid G^\{\*\},Q,x\)P\(G^\{\*\}\\mid x\)\}\{P\(A\\mid Q,x\)\}\\right\)\-\\log P\(G^\{\*\}\\mid x\)\(9\)=log⁡\(P​\(A∣G∗,Q,x\)P​\(A∣Q,x\)\+log⁡P​\(G∗∣x\)−log⁡P​\(G∗∣x\)\)\\displaystyle=\\log\\left\(\\frac\{P\(A\\mid G^\{\*\},Q,x\)\}\{P\(A\\mid Q,x\)\}\+\\log P\(G^\{\*\}\\mid x\)\-\\log P\(G^\{\*\}\\mid x\)\\right\)=log⁡\(P​\(A∣G∗,Q,x\)P​\(A∣Q,x\)\)\.\\displaystyle=\\log\\left\(\\frac\{P\(A\\mid G^\{\*\},Q,x\)\}\{P\(A\\mid Q,x\)\}\\right\)\.The term inside the log is the Importance Weightww:

w=P​\(A∣G∗,Q,x\)P​\(A∣Q,x\)\.w=\\frac\{P\(A\\mid G^\{\*\},Q,x\)\}\{P\(A\\mid Q,x\)\}\.\(10\)This ratio compares the likelihood of the answerAAunder the specific ground\-truth goalG∗G^\{\*\}versus its likelihood under the general marginal distribution\. A high weightww\(and thus highRtR\_\{t\}\) implies that the answerAAis highly specific toG∗G^\{\*\}and unlikely to occur by chance, thereby providing strong discriminative evidence\.

### 6\.3Case Studies and Prompt Templates

#### 6\.3\.1Reward Case Studies

We present a detailed comparison of two clarification trajectories from the same interaction state\. This case study illustrates how the Information Gain Proxy \(RtR\_\{t\}\) effectively penalizes redundant dialogue while rewarding interactions that actively elicit missing constraints required by the ground truth\.

Ground Truth User Goal \(G∗G^\{\*\}\):

> My name is Olivia Lopez and my zip code is 76171\. For \#W7449508, exchange Sneakers size 6, black, synthetic to size 10, gray, leather; via gift\_card\_7711863\.

Table[4](https://arxiv.org/html/2606.03135#S6.T4)contrasts a redundant confirmation loop against a high\-quality clarification turn\.

Table 4:Detailed comparison of clarification candidates\.Case Bdemonstrates a high\-quality question that specifically targets the missing payment information\. This targeted inquiry successfully recovers thegift\_carddetail, leading to a significant increase in the ground truth’s log\-probability\.Mechanism Discussion:

- •High Information Gain \(Case B\):The agent asks a high\-quality question targeting the missingpayment method\. This prompts the user to reveal the token sequence “gift\_card\_7711863”\. Since this specific ID is a mandatory part of the Ground TruthG∗G^\{\*\}, its recovery drastically reduces the perplexity of the target goal, shifting the average log\-probability from−3\.26\-3\.26to−2\.62\-2\.62\.
- •Low Information Gain \(Case A\):In contrast, the confirmation loop in Case A adds length to the context without adding semantic signal\. As a result, the average log\-probability slightly degrades \(−3\.26→−3\.32\-3\.26\\rightarrow\-3\.32\) due to the length penalty, correctly yielding a negative reward \(−0\.05\-0\.05\)\. This demonstrates the metric’s ability to penalize safe but useless questions\.

#### 6\.3\.2Clarifier Prompt Templates

##### Clarifier Prompt Template\.

The following prompt template defines the behavior of the clarification model, which conditions on the dialogue interaction history to determine whether a follow\-up question is required and, if so, generates a concise clarification question\.

Prompt for ClarifierRole You are a Clarification Assistant responsible for determining whether the agent needs to ask a clarifying question\. History: \{history\_text\}Instructions: Analyze the dialogue and decide whether clarification is needed\.If not needed, answer NO\.If needed, answer:YES \[QUESTION\]Your clarification question here\[/QUESTION\]Ask only one concise question about user intent\.Do not ask about data retrievable from tools\.Do not repeat previous questions\.expected\_output:Either:\- NOor\- YES \[QUESTION\]…\[/QUESTION\]

##### Teacher ForcingG∗G^\{\*\}Prompt Template

The first template conditions only on the dialogue and execution history up to the current step\. It prompts the model to infer the user’s hidden intent based solely on the observed interaction context, producing a goal summary that reflects the prior belief before clarification\.

Teacher ForcingG∗G^\{\*\}Prompt Template without QARole You are an expert analyst\. Summarize the user’s hidden profile and intent based on the dialogue\. History: \{history\_text\}User Intent Summary:

The second template augments the same history with an explicit clarification question and the corresponding user response\. This template estimates the posterior belief after incorporating the additional information provided by the clarification exchange\.

Teacher ForcingG∗G^\{\*\}Prompt Template with QARole You are an expert analyst\. Summarize the user’s hidden profile and intent based on the dialogue\. History: \{history\_text\}Question and User Response: \{Clarification and User Response\}User Intent Summary:

#### 6\.3\.3Rewriting Prompt Template and Examples

The goal of the following prompt template is to help convert structured or raw user detailed requirement data into a more readable, natural language format\.

RewritingG∗G^\{\*\}Prompt TemplateRole You are a data converter\. Your goal is to rewrite raw user detailed data into a flat, natural language format optimized for training smaller language models\. Transformation Rules:1\.Person: Change “Your name is” to “My name is”\.2\.Remove Noise: Remove all personality descriptions \(e\.g\., “logical, shy, organized”\)\.3\.Flatten JSON \(Crucial\): Remove all JSON syntax \(\{,\},’,:\)\.Convert key\-value pairs into natural phrases\.Use connectors like “with”, “and”, “set to”\.Example:\{’color’: ’red’, ’size’: ’M’\}→\\towith color red and size M\.4\.Handle “\(same as …\)”: Keep the reference intact\.Ensure it flows logically with the details\.Example:\{’zip’: ’10228’\} \(same as \#W123\)→\\towith zip 10228, which is the same as \#W123\.5\.Preserve Content: Do NOT delete specific values \(like ‘144 Lakeview Drive’\)\. Keep them just in case the user mentions them\.\#\#\# Example:Input:“Your name is Alex\. You are shy and loud\. For \#W1, change address to\{’street’: ’123 Main St’, ’zip’: ’90210’\}\(same as \#W2\)\.”Output:“My name is Alex\. For order \#W1, change address to street 123 Main St and zip 90210, which is the same as \#W2\.”Input:“\{raw\_ground\_truth\}”Output:

Here’s a side\-by\-side example of how a raw Ground Truth data can be transformed into a more readable format using the rewriting template:

Input \(Raw Ground Truth\):

Your name is Mei Martin and your zip code is 32124\. You are messy, creative, outgoing, rigid, cautious\. For \#W5564375, exchange LED Light Bulb ’brightness’: ’60W equivalent’, ’color temperature’: ’daylight’, ’connectivity’: ’none’ to ’brightness’: ’75W equivalent’, ’connectivity’: ’Wi\-Fi’; Office Chair ’material’: ’fabric’, ’color’: ’black’, ’armrest’: ’none’, ’backrest height’: ’high\-back’ to ; via paypal\_2299608\.

Output \(Cleaned Version\):My name is Mei Martin and my zip code is 32124\. For order \#W5564375, exchange LED Light Bulb with brightness of 60W equivalent and color temperature of daylight to a bulb with brightness of 75W equivalent and Wi\-Fi connectivity; Office Chair with fabric material, black color, and high\-back to none; via paypal\_2299608\.

#### 6\.3\.4User Prompt Template Discussion

##### User Prompt Template\.

The following prompt template is used to simulate a human user during interaction inτ\\tau\-Bench environment, specifying the behavioral constraints and response style of the LM\-based user simulator\.

User Prompt Template for Main Evaluation ResultsRole You are a user interacting with an agent\. Your behavior simulates a human user following a hidden instruction\.\{History\}\{User Detail Requirements\}Instructions: Generate only one message at a time\.Reveal information gradually instead of all at once\.Do not invent facts missing from the instruction—if the agent asks for unavailable details, say you don’t remember\.When the instruction’s goal is completed, output STOP\.Do not repeat the instruction verbatim; use natural conversational phrasing\.Maintain a natural, human\-like conversation style\.Expected Output: A single user utterance per step\.

During our preliminary experiments, we observed that training directly with the standard template resulted in a form of reward hacking\. Since the standard user simulator is designed to be cooperative, it tends to interpret generic queries \(e\.g\., Is there any additional information you would like to provide?\) as a signal to reveal the next part of the hidden goal\. As a result, the agent quickly converges to a degenerate policy, repeatedly issuing these vague, non\-specific questions\. While this strategy yields high Information Gain rewards during training, it fails to generalize to more complex scenarios where precise constraint elicitation is required, resulting in unnatural dialogue loops\.

##### Strict User Prompt Template for Training\.

We employ a strict prompt template for training, while testing follows the standard template, ensuring that the agent learns to handle ambiguity more effectively and without relying on simplistic, generic queries\.

User Prompt Template for DAPO TrainingRole You are a user interacting with an agent\. Your behavior simulates a human user following a hidden instruction\.\{History\}\{User Detail Requirements\}Instructions: conversation\_guidelines:\- Respond with one message at a time using first\-person statements\.\- Do not invent details that are not in the instruction; if something is unknown, say you do not remember it\.\- Rephrase the instruction in your own words and maintain a natural, human\-like tone\.IMPORTANT \- Handling vague or generic questions:\-If the agent asks a vague, overly broad, or generic question\(e\.g\., ”Is there any additional information?”, ”Can you tell me more?”, ”Anything else?”\), reply with: ”No, that’s all\.” or ”Just do what I asked\.”\- If the agent outputs a placeholder or acts out of character \(e\.g\., text like ”Your concise and specific clarifying question to the user here”\), reply with: ”Who are you talking to?”\- For such questions, only provide information if it is DIRECTLY relevant to the current step of the instruction\.\- If the question is too vague to answer meaningfully, respond with something like ”I’m not sure what specific information you need” or ”Could you be more specific?”\- Prefer specific, targeted questions that help the agent understand your exact needs\.Question: \{question\}expected\_output:A single user utterance that follows the instruction and guidelines\.

#### 6\.3\.5LLM as a Judge Prompt Template

This prompt template is used in the Ablation Study to compare the impact of different reward signals in the DAPO training process, specifically examining the effect of the Information Gain Reward\.

LLM as a Judge Prompt TemplateRole You are a professional question quality evaluator\.Original Dialogue: \{history\}Instructions: \- Rank the following \{n\} clarification questions from BEST to WORST based on their quality and relevance\.\- You must consider the user requirement and original dialogue context carefully\.\- Rank all \{n\} questions from best \(1\) to worst \(\{n\}\)\.\- Assign a score in \[0, 10\] to each question \(higher is better\)\.Questions List: \{Rollout Questions\}Expected output:Ranking: \{Ranking\_output\}Scores: \{Scores\_output\}

#### 6\.3\.6Full User Intent Prompt Template

Full User Intent Prompt TemplateRole You are a user interacting with an agent\.\{History\}: \{User Detail Requirements\}Rules: \- Generate a single message block to simulate the user’s utterance\.\-Provide all the informationcontained in the instruction in your very first message\.\- Do not hold back any details\.\- Fully explain your goal, constraints, and preferences clearly and comprehensively at the start\.\- If the instruction goal is satisfied, generate\#\#\#STOP\#\#\#as a standalone message without anything else to end the conversation\.\- Do not repeat the exact instruction verbatim\. Instead, use your own words to convey the same information naturally\.\- Stick to the personalities defined in the instruction\.Expected Output: A single user utterance per step\.

### 6\.4Additional Backbone Generalization Results

To further validate that our framework is model\-agnostic, we extend the clarification training experiments to additional backbones, including Qwen2\.5\-1\.5B and Qwen3\-4B\. Table[5](https://arxiv.org/html/2606.03135#S6.T5)reports the performance before and after applying our information\-gain\-driven clarification training objective\. The improvements are consistent across different backbones, suggesting that the proposed clarification training framework is not tied to a specific model architecture\. We also observe slightly larger gains on the higher\-capacity Qwen3\-4B model, indicating that increased model capacity may help better utilize the additional information provided by clarification\.

Table 5:Generalization across different clarifier backbones\. Values in parentheses denote improvements over the corresponding base model\.
### 6\.5Failure Analysis

Case 1: Clarification Enables Correct Constraint Resolution Clarification resolves a key ambiguity that determines the action space; without it, the agent acts incorrectly\. It enables correct grounding and policy\-compliant behavior:

\(1\) The user reports that travel insurance is “already added but not showing,” creating ambiguity between a display error and missing insurance\.

\(2\) The clarifier asks whether insurance is present for reservation PEP4E0, resolving this ambiguity\.

\(3\) The agent checks the reservation and determines that insurance is not present\.

\(4\) It identifies that insurance cannot be added post\-booking and avoids invalid modification\.

\(5\) The case is escalated to a human agent\.

Case 2: Clarification successful but followed by execution failure Clarification resolves budget ambiguity and enables correct cost reasoning, but the error arises from execution order and policy violation\. This highlights that improved understanding does not guarantee policy\-compliant action, separating clarification from execution correctness:

\(1\) The user requests a cabin upgrade with a budget constraint \(“up to $600”\), leaving ambiguity about whether it applies to the upgrade only or the total cost\.

\(2\) The agent retrieves the reservation and prematurely executes the upgrade, with the clarification issued only afterward by the clarifier\.

\(3\) After clarification, it correctly verifies the cost and budget\. However, the upgrade was already executed without explicit user confirmation, violating policy\.

\(4\) The outcome appears correct but contains a policy\-level execution error\.

The above two cases suggest that clarification alone is insufficient to address downstream failures, motivating future work on jointly optimizing clarification and execution\.

### 6\.6Hyperparameters

This section details the hyperparameter settings used for training the clarifier with DAPO under VERL framework\(Shenget al\.,[2024](https://arxiv.org/html/2606.03135#bib.bib157)\)\. All values reported below correspond to parameters explicitly specified in the training configuration and are held fixed across experiments unless otherwise noted\.

##### Model and System Configuration\.

The base policy model is Qwen3\-1\.7B\. FlashAttention\-2 is enabled for attention computation, and gradient checkpointing is employed to reduce memory consumption\. Rollouts are generated using vLLM with chunked prefill enabled\. The maximum token budget per sequence is 3160 tokens, consisting of up to 1400 prompt tokens and 1760 response tokens\. GPU memory utilization for vLLM is set to 0\.60\. All experiments are conducted on a single node equipped with four GPUs\.

##### Optimization Settings\.

We adopt the DAPO advantage estimator with PPO\-style policy optimization\. The learning rate is set to5×10−75\\times 10^\{\-7\}with a linear warmup of 60 steps\. The PPO clipping range is asymmetric, with lower and upper bounds of 0\.20 and 0\.28, respectively\. An entropy coefficient of 0\.001 is applied to encourage exploration\. Weight decay is set to 0\.01, and gradients are clipped to a maximum norm of 1\.0\. Losses are aggregated using token\-level mean reduction\. KL regularization is incorporated into the reward computation, while no explicit KL penalty is applied during policy updates\.

##### Data and Sequence Lengths\.

Training and validation data are loaded from Parquet files, with the input prompt provided under thepromptfield\. Dialogue histories are truncated from the left when exceeding the maximum length\. The maximum prompt length is 1400 tokens, and the maximum response length is 1760 tokens\. Training data are shuffled at each epoch\. The chat template is configured with thinking enabled\.

##### Rollout and Sampling\.

At each training step, 16 prompts are sampled\. For each prompt, 8 candidate responses are generated using vLLM, resulting in 128 sampled responses per step\. Sampling is performed with temperature 0\.8 and top\-p=1\.0p=1\.0, without top\-kktruncation\. Validation rollouts use identical sampling parameters but generate a single response per prompt\. Logging of rollout statistics is disabled to reduce overhead, and dynamic batch sizing is used for log\-probability computation\.

##### Batching and Policy Updates\.

Policy optimization is performed for 4 PPO epochs per batch\. Each PPO mini\-batch contains 2 prompts, and updates are further split into micro\-batches of size 1 per GPU\. The training framework internally accumulates gradients across micro\-batches to realize the effective batch size implied by the rollout and mini\-batch configuration\.

##### Training Schedule and Checkpointing\.

Training is conducted for 20 epochs, corresponding to a total of 2010 training steps\. Model checkpoints are saved every 670 steps, and unless otherwise specified, results are reported using the checkpoint at step 670\.

##### Reward Function\.

Rewards are computed using a custom scoring function implemented insimple\_reward\.py\. No learned reward model is used\. The reward manager is configured to support DAPO\-style optimization\. To ensure reliable reward computation via Bayesian belief updates, we employ Qwen3\-14B as the strict user simulator during training rollouts\. This simulator is configured to provide task\-critical information only when prompted by specific and relevant clarification questions\. No learned reward model is used, and the reward manager is configured to support DAPO\-style optimization\.

##### Runtime Environment\.

All experiments are performed on a single node equipped with four NVIDIA RTX A6000 GPUs\. Under this setup, each training run completes within approximately 7–12 hours\.

##### Evaluation Setting\.

During evaluation, the Clarifier is decoded deterministically with temperature set to0to ensure stable and reproducible clarification behavior\. The Agent and User LLM follow the default decoding settings provided byτ\\tau\-Bench: the Agent operates with a low temperature of0\.010\.01to reduce stochasticity in tool execution, while the User Simulator uses a temperature of11to maintain underspecified and potentially ambiguous user responses\. Across all experiments, the User LLM is instantiated with Qwen3\-8B\. While Figure 4 evaluates performance under varying clarification budgets, all other experiments adopt an unconstrained setting where the clarifier is free to intervene as needed\.

### 6\.7Training Dynamics

We analyze the training dynamics of DAPO to assess the numerical stability and optimization behavior induced by the proposed information gain reward\. Figure[7](https://arxiv.org/html/2606.03135#S6.F7)reports the evolution of the gradient norm throughout training\. The gradient norm increases smoothly during early optimization, indicating that the policy begins to exploit informative reward signals, and subsequently stabilizes within a moderate range without abrupt spikes or divergence\. The gradual decline observed in later stages suggests convergence toward a stable solution rather than gradient collapse or instability\.

![Refer to caption](https://arxiv.org/html/2606.03135v1/training_dynamics_gradnorm.png)Figure 5:Gradient Norm Evolution
During DAPO Training
![Refer to caption](https://arxiv.org/html/2606.03135v1/training_dynamics_entropy.png)Figure 6:Policy Entropy During
DAPO Training
![Refer to caption](https://arxiv.org/html/2606.03135v1/training_dynamics_kl_clip.png)Figure 7:KL Divergence and Clip
Fraction During DAPO Optimization

Figure[7](https://arxiv.org/html/2606.03135#S6.F7)shows the evolution of policy entropy over training steps\. Entropy decreases during the initial phase as the policy becomes more confident in its clarification decisions, followed by a period of stabilization with mild fluctuations\. This behavior indicates that the policy maintains sufficient exploration while avoiding premature collapse to deterministic actions, which is critical for effective on\-policy optimization\.

Figure[7](https://arxiv.org/html/2606.03135#S6.F7)presents the KL divergence between successive policies together with the PPO clip fraction\. Both quantities increase smoothly during training and remain within conservative ranges, reflecting well\-controlled trust\-region updates\. The absence of sustained KL spikes or excessively high clip fractions suggests that DAPO enforces stable policy updates and avoids overly aggressive optimization\.

Taken together, these training dynamics demonstrate that the information gain reward integrates seamlessly with DAPO, yielding stable gradients, controlled policy updates, and sustained exploration throughout training\.

Similar Articles

Knowing When to Ask: Self-Gated Clarification for Hierarchical Language Agents

arXiv cs.AI

This paper proposes ActionRating, a formulation that places clarification inside an agent's action space on a shared ordinal scale with navigation, enabling two information-seeking modes (mandatory and opportunistic). On hierarchical taxonomy classification benchmarks, experiments with 9 LLMs show that opportunistic clarification improves accuracy and information-seeking effectiveness.

Can LLMs Take Retrieved Information with a Grain of Salt?

arXiv cs.CL

This paper investigates how large language models adapt to the certainty of retrieved information, identifying systematic limitations in handling uncertainty. It proposes an interaction strategy that reduces obedience errors by 25% without modifying model weights.