Uncertainty Decomposition for Clarification Seeking in LLM Agents
Summary
This paper proposes a prompt-based uncertainty decomposition method for LLM agents that separates action confidence from request uncertainty, enabling proactive clarification seeking in underspecified tasks. The method is evaluated on new clarification-augmented benchmarks across five LLM backbones, showing significant improvements.
View Cached Full Text
Cached at: 06/20/26, 02:31 PM
# Uncertainty Decomposition for Clarification Seeking in LLM Agents
Source: [https://arxiv.org/html/2606.19559](https://arxiv.org/html/2606.19559)
Gregory Matsnev AI Talent Hub, ITMO University Saint Petersburg 197101, Russia gregory\.matsnev@niuitmo\.ru
###### Abstract
Recent position papers argue that the classical aleatoric/epistemic uncertainty framework is insufficient for interactive large language model \(LLM\) agents and call for underspecification\-aware, decomposed, and communicable uncertainty representations that can unlock new agent capabilities such as proactive clarification seeking and shared mental\-model building\. Practical deployment constraints—black\-box APIs, interactive latency budgets, and the absence of labeled trajectories—rule out logprob\-based, multi\-sampling, and training\-based methods, leaving prompt\-based estimation as the most viable family for surfacing such signals at deployment time\. We answer this call with a simple prompt\-based decomposition that separates action confidence from request uncertainty \(uu\), enabling the agent to ask for clarification when the task specification is ambiguous\. To evaluate it, we introduce two clarification\-augmented benchmarks \(WebShop\-Clarification and ALFWorld\-Clarification\) in which 50% of tasks are deliberately underspecified, and systematically compare the proposed decomposition against ReAct\+UE and Uncertainty\-Aware Memory \(UAM\) across five LLM backbones \(GPT\-5\.1, DeepSeek\-v3\.2\-exp, GLM\-4\.7, Qwen3\.5\-35B, GPT\-OSS\-120B\) on these variants together with the standard WebShop, ALFWorld, and REAL benchmarks for fault detection\. Averaged across the five backbones, the proposed decomposition improves clarification F1 on ALFWorld\-Clarification by 73% over ReAct\+UE and by 36% over UAM, and leads clarification F1 on every backbone on WebShop\-Clarification and on four of five backbones on ALFWorld\-Clarification, indicating that the gains generalize beyond a single LLM\.
*Keywords*clarification seeking⋅\\cdotinteractive benchmarks⋅\\cdotLLM agents⋅\\cdotprompt\-based methods⋅\\cdotuncertainty quantification
## 1Introduction
Large language models \(LLMs\) trained on web\-scale corpora have become general\-purpose reasoners, exhibiting in\-context learning\[[3](https://arxiv.org/html/2606.19559#bib.bib1)\], instruction following\[[31](https://arxiv.org/html/2606.19559#bib.bib2)\], and chain\-of\-thought reasoning\[[38](https://arxiv.org/html/2606.19559#bib.bib3)\], with capabilities that scale predictably with model size\[[18](https://arxiv.org/html/2606.19559#bib.bib4)\]\. Building on these foundations, a growing body of work repurposes LLMs as the controllers of*interactive agents*that plan, invoke tools, and act in external environments such as web interfaces and household simulations\[[45](https://arxiv.org/html/2606.19559#bib.bib5)\],\[[15](https://arxiv.org/html/2606.19559#bib.bib6)\],\[[39](https://arxiv.org/html/2606.19559#bib.bib7)\],\[[32](https://arxiv.org/html/2606.19559#bib.bib8)\]\. Unlike single\-turn question answering, these agents operate under partial observability: they receive underspecified natural\-language instructions, observe noisy environment states, and must chain multiple reasoning steps to complete a task\. Small errors at early steps – misinterpreting an ambiguous request, over\-trusting a noisy observation, or selecting a suboptimal action – can propagate along the trajectory and produce a confidently wrong outcome\[[47](https://arxiv.org/html/2606.19559#bib.bib9)\],\[[7](https://arxiv.org/html/2606.19559#bib.bib10)\]\.
Uncertainty estimation is a natural tool for mitigating such failures\[[1](https://arxiv.org/html/2606.19559#bib.bib11)\],\[[40](https://arxiv.org/html/2606.19559#bib.bib12)\]\. However, recent position papers argue that existing uncertainty frameworks are fundamentally inadequate for interactive agents\. Kirchhof et al\.\[[20](https://arxiv.org/html/2606.19559#bib.bib13)\]demonstrate that the traditional aleatoric/epistemic dichotomy breaks down in agent settings: when a chatbot decides whether to ask a follow\-up question, the same uncertainty can be classified as aleatoric \(irreducible at the current time point\) or epistemic \(reducible by asking\), depending on the modeler’s perspective\. They propose three research directions: underspecification uncertainties that arise when users do not provide complete information, interactive learning through follow\-up questions to reduce context uncertainty, and rich output uncertainties communicated as natural language rather than scalar scores\. Kim et al\.\[[19](https://arxiv.org/html/2606.19559#bib.bib14)\]complement this vision with agentic interpretability – a paradigm where agents proactively assist human understanding through multi\-turn interaction, developing and leveraging mutual mental models\. Together, these position papers call for uncertainty methods that are decomposed by source, communicable to users, and capable of enabling new agent capabilities beyond simple abstention\.
Practical deployment constraints further shape the design space\. Black\-box API access precludes logprob\-based methods\[[8](https://arxiv.org/html/2606.19559#bib.bib16)\],\[[29](https://arxiv.org/html/2606.19559#bib.bib17)\]\. Multi\-sampling at every agent step introduces prohibitive latency and cost in long\-horizon settings\[[37](https://arxiv.org/html/2606.19559#bib.bib18)\],\[[29](https://arxiv.org/html/2606.19559#bib.bib17)\]\. Training\-based calibrators require labeled data and model access\[[23](https://arxiv.org/html/2606.19559#bib.bib19)\],\[[35](https://arxiv.org/html/2606.19559#bib.bib20)\], and white\-box methods that read internal hidden states\[[41](https://arxiv.org/html/2606.19559#bib.bib15)\]are similarly precluded by closed APIs\. This leaves prompt\-based methods – where the agent emits uncertainty estimates as structured text alongside its actions – as the most practically viable approach for real\-world agentic deployments, despite their known limitations\.
Existing prompt\-based methods, however, produce a single scalar confidence per step\. This conflates fundamentally different sources of uncertainty\. An agent may report low confidence because the action is difficult \(e\.g\., many similar products to choose from\) or because the user request is ambiguous \(e\.g\., “find me a shirt” without specifying color or size\)\. These two situations call for different responses: the former suggests the agent should proceed cautiously, while the latter suggests it should ask the user for clarification\.
In this paper, we propose a simple decomposition of prompt\-based uncertainty into two components:
- •Action confidence\(ctc\_\{t\}\): the agent’s confidence that its chosen action moves toward task completion, given the current understanding of the task\.
- •Request uncertainty\(utu\_\{t\}\): the agent’s estimate of whether the user’s goal is fully specified, ranging from 0 \(fully specified\) to 1 \(critical details missing\)\.
Unlike clarifiers that require training on labeled trajectories, our method is prompt\-only and runs on black\-box LLMs; we characterize both the capabilities and the limitations of eliciting this decomposition through prompting alone\.
We evaluate this decomposition against two prompt\-based baselines – ReAct with a simple uncertainty estimation prompt suffix \(ReAct\+UE\) and Uncertainty\-Aware Memory \(UAM\)\[[46](https://arxiv.org/html/2606.19559#bib.bib21)\]– across five LLM backbones \(GPT\-5\.1, DeepSeek\-v3\.2\-exp, GLM\-4\.7, Qwen3\.5\-35B, GPT\-OSS\-120B\), on three standard interactive benchmarks \(WebShop, ALFWorld, REAL\) and two clarification\-augmented variants \(WebShop\-Clarification, ALFWorld\-Clarification\) in which 50% of tasks are deliberately underspecified, so the agent must recognize the gap and ask the user\. We also run a sensitivity analysis over theutu\_\{t\}clarification threshold\.
Our contributions are:
1. 1\.A comparative analysis of uncertainty estimation approaches for LLM agents, showing that practical deployment constraints leave prompt\-based methods as the only viable family and motivating a systematic study of them for proactive agent capabilities\.
2. 2\.Two clarification\-augmented benchmarks \(WebShop\-Clarification and ALFWorld\-Clarification\) in which 50% of tasks are deliberately underspecified, enabling evaluation of clarification seeking as a binary classification task\.
3. 3\.A prompt\-based decomposition method that separates action confidence from request uncertainty, enabling proactive clarification seeking, evaluated against ReAct\+UE and UAM across five LLM backbones on the clarification\-augmented benchmarks\.
4. 4\.An empirical evaluation of prompt\-based methods on fault detection across standard interactive benchmarks, surfacing their capabilities and limitations and promising directions for future work\.
## 2Related Work
We review prior work in three threads\. We first cover single\-turn uncertainty estimation methods for LLMs, where most of the technical machinery originates\. We then turn to methods that propagate per\-step uncertainty along multi\-step agent trajectories, and summarize the families in a comparison table that motivates our focus on prompt\-based approaches\. Finally, we discuss uncertainty decomposition and clarification seeking, which form the immediate context for the proposed method\.
### 2\.1Uncertainty Estimation in LLMs
Uncertainty estimation methods for LLMs span several families\[[40](https://arxiv.org/html/2606.19559#bib.bib12)\],\[[29](https://arxiv.org/html/2606.19559#bib.bib17)\]\.*Multi\-sampling*methods draw multiple responses and measure their disagreement: self\-consistency\[[37](https://arxiv.org/html/2606.19559#bib.bib18)\]selects the majority answer among chain\-of\-thought samples, while semantic entropy\[[21](https://arxiv.org/html/2606.19559#bib.bib43)\],\[[9](https://arxiv.org/html/2606.19559#bib.bib22)\], kernel language entropy\[[28](https://arxiv.org/html/2606.19559#bib.bib23)\], semantic\-embedding variants\[[11](https://arxiv.org/html/2606.19559#bib.bib40)\], and similarity\-based dispersion measures for black\-box NLG\[[22](https://arxiv.org/html/2606.19559#bib.bib44)\]quantify disagreement in semantic space, with joint entropy modeling over LLM and tool contributions extending these measures to tool\-using QA systems\[[25](https://arxiv.org/html/2606.19559#bib.bib25)\], and input clarification ensembling\[[14](https://arxiv.org/html/2606.19559#bib.bib26)\]decomposing uncertainty by generating and ensembling multiple clarified versions of the input\.*Logprob\-based*methods exploit the model’s output token probabilities: perplexity\[[10](https://arxiv.org/html/2606.19559#bib.bib46)\]and predictive entropy\[[26](https://arxiv.org/html/2606.19559#bib.bib45)\]aggregate per\-token log\-probabilities over a generation, while relevance\-weighted variants such as Shifting Attention to Relevance\[[6](https://arxiv.org/html/2606.19559#bib.bib47)\]reweight tokens by semantic importance before aggregating\.*Prompt\-based*methods ask the model to express confidence directly\[[12](https://arxiv.org/html/2606.19559#bib.bib24)\], optionally calibrating the elicited scores through multi\-agent deliberation\[[42](https://arxiv.org/html/2606.19559#bib.bib41)\]\.*Training\-based*methods learn a separate model – a probe or recalibrator over the base model’s output logprobs and/or internal hidden representations, supervised by answer\-correctness labels: ProbeCal\[[23](https://arxiv.org/html/2606.19559#bib.bib19)\]recalibrates a tool\-using agent’s internal token probabilities, while MICE\[[35](https://arxiv.org/html/2606.19559#bib.bib20)\]trains a classifier on model\-internal activations; some such methods further leverage internal hidden states from intermediate layers\[[41](https://arxiv.org/html/2606.19559#bib.bib15)\]\. Engineering toolkits such as LM\-Polygraph\[[8](https://arxiv.org/html/2606.19559#bib.bib16)\]provide unified implementations across these families\.
These methods primarily target single\-turn prediction\. For multi\-step agents, uncertainty arises and evolves at each think\-act\-observe step, requiring propagation mechanisms\.
### 2\.2Uncertainty Propagation in Agent Trajectories
Two recent frameworks address step\-wise uncertainty propagation\. SAUP\[[47](https://arxiv.org/html/2606.19559#bib.bib9)\]attaches per\-step uncertainty estimates using plug\-in estimators and propagates them via HMM\-based situational weights, achieving up to 20% AUROC improvement over final\-step\-only baselines\. UProp\[[7](https://arxiv.org/html/2606.19559#bib.bib10)\]formalizes propagation using pointwise mutual information over sampled trajectories\. Both methods, however, rely on resources that are typically unavailable in black\-box API deployments—multi\-sampling, output logprobs, or labeled trajectories for training—making them impractical in this setting\.
BrowseConf\[[30](https://arxiv.org/html/2606.19559#bib.bib27)\]uses confidence scores for test\-time scaling in web agents, dynamically allocating compute based on the agent’s self\-assessed uncertainty\. Uncertainty\-Aware Memory \(UAM\)\[[46](https://arxiv.org/html/2606.19559#bib.bib21)\]includes the agent’s confidence score and natural\-language explanation in the action history, allowing subsequent steps to reason about accumulated uncertainty\. This prompt\-based propagation requires no additional API calls, making it suitable for practical deployment\.
### 2\.3Comparative Overview of Approaches
Table[1](https://arxiv.org/html/2606.19559#S2.T1)provides a systematic comparison of uncertainty estimation approaches for LLM agents\. Methods differ in their requirements \(logprob access, model\-internals access, multiple inference passes, training data\) and capabilities \(multi\-step support, clarification\)\. The full taxonomy is included for completeness and to motivate the design choices we formalize in Section[3](https://arxiv.org/html/2606.19559#S3)\.
### 2\.4Uncertainty Decomposition and Clarification
Position papers argue that the aleatoric/epistemic split is insufficient for interactive agents\[[20](https://arxiv.org/html/2606.19559#bib.bib13)\],\[[34](https://arxiv.org/html/2606.19559#bib.bib28)\], and empirical analyses of uncertainty sources in LLMs and multimodal models reach similar conclusions\[[43](https://arxiv.org/html/2606.19559#bib.bib42)\]\. Kirchhof et al\.\[[20](https://arxiv.org/html/2606.19559#bib.bib13)\]introduce*underspecification uncertainty*– uncertainty arising when users do not provide complete information – as a category distinct from model knowledge gaps, and advocate for interactive learning via follow\-up questions and rich natural\-language uncertainty outputs\. Smith et al\.\[[34](https://arxiv.org/html/2606.19559#bib.bib28)\]formalize that the aleatoric/epistemic boundary is inherently modeler\-dependent\. These arguments have begun to be operationalized in concrete agent systems\. SAGE\-Agent\[[36](https://arxiv.org/html/2606.19559#bib.bib29)\]trains a POMDP\-guided clarifier with GRPO, and Hao et al\.\[[13](https://arxiv.org/html/2606.19559#bib.bib31)\]trigger human\-in\-the\-loop refinement in GUI agents when uncertainty is high; in parallel, decision\-theoretic frameworks such as DeLLMa\[[24](https://arxiv.org/html/2606.19559#bib.bib32)\]and PlanU\[[5](https://arxiv.org/html/2606.19559#bib.bib33)\]connect uncertainty to action selection through utility modeling\. A recurring caveat across this line of work is that agents tend to be systematically overconfident\[[17](https://arxiv.org/html/2606.19559#bib.bib30)\], limiting the reliability of the very signals these methods depend on\.
Table 1:Comparative Overview of Uncertainty Estimation Approaches for LLM Agents\. Requirements indicate what each method needs beyond a single forward pass\. “Multi\-step” indicates native support for trajectory\-level uncertainty\. “Clarification” indicates whether the method can trigger user clarification\. Prompt\-based methods are the only family requiring neither logprob access nor model internals nor multiple inference passes, motivating the experimental focus of this paper\.FamilyRepresentative MethodsRequirementsMulti\-stepClarificationKey LimitationLogprob\-basedPerplexity\[[10](https://arxiv.org/html/2606.19559#bib.bib46)\], Pred\. Entropy\[[26](https://arxiv.org/html/2606.19559#bib.bib45)\], SAR\[[6](https://arxiv.org/html/2606.19559#bib.bib47)\], LM\-Polygraph\[[8](https://arxiv.org/html/2606.19559#bib.bib16)\]Output token logprobsNoNoNot applicable to black\-box APIsMulti\-samplingSelf\-Consistency\[[37](https://arxiv.org/html/2606.19559#bib.bib18)\], Sem\. Entropy\[[9](https://arxiv.org/html/2606.19559#bib.bib22)\], KLE\[[28](https://arxiv.org/html/2606.19559#bib.bib23)\], Sem\. Embed\.\[[11](https://arxiv.org/html/2606.19559#bib.bib40)\], Deliberation\[[42](https://arxiv.org/html/2606.19559#bib.bib41)\], Clarif\. Ensembling\[[14](https://arxiv.org/html/2606.19559#bib.bib26)\]NNforward passes per stepPartial∗NoN×N\\timescost/latency; not applicable to black\-box APIsTraining\-basedProbeCal\[[23](https://arxiv.org/html/2606.19559#bib.bib19)\], MICE\[[35](https://arxiv.org/html/2606.19559#bib.bib20)\], Int\. Belief\[[41](https://arxiv.org/html/2606.19559#bib.bib15)\], SAGE\-Agent\[[36](https://arxiv.org/html/2606.19559#bib.bib29)\], GUI\-Agent\[[4](https://arxiv.org/html/2606.19559#bib.bib34)\]Logprobs or model internals, labeled dataYesYes†Requires labeled data and trainingPrompt\-basedReAct\+UE\[[12](https://arxiv.org/html/2606.19559#bib.bib24)\], UAM\[[46](https://arxiv.org/html/2606.19559#bib.bib21)\], BrowseConf\[[30](https://arxiv.org/html/2606.19559#bib.bib27)\], ProposedSingle forward pass, prompt onlyYesYes‡Overconfidence, capability dilution∗SAUP\[[47](https://arxiv.org/html/2606.19559#bib.bib9)\]and UProp\[[7](https://arxiv.org/html/2606.19559#bib.bib10)\]add multi\-step propagation atop multi\-sampling\.†SAGE\-Agent only\.‡Proposed method only\.
## 3Problem Statement
We formalize the problem of uncertainty estimation for interactive LLM agents under practical deployment constraints\. We first define the agent setting and notation, then argue that these constraints leave prompt\-based methods as the only viable family, scoping our experimental comparison accordingly\. We then introduce two evaluation objectives together with the metrics used to measure them: fault detection, the standard task that uncertainty methods are designed to solve, and clarification seeking, the proactive capability the proposed decomposition is intended to enable\. Finally, we describe the two existing prompt\-based methods, ReAct\+UE and UAM, against which the proposed method is compared; we introduce them here rather than in Section[4](https://arxiv.org/html/2606.19559#S4)because they are prior work, not contributions of this paper\.
### 3\.1Task Setting and Notation
An LLM agent operates in an environment with observation space𝒪\\mathcal\{O\}and action space𝒜\\mathcal\{A\}\. A task is specified by a natural\-language goalg∈𝒢g\\in\\mathcal\{G\}\. At each stept=1,…,Tt=1,\\ldots,T, the agent receives an observationot∈𝒪o\_\{t\}\\in\\mathcal\{O\}and must produce an actionat∈𝒜a\_\{t\}\\in\\mathcal\{A\}\. The agent is implemented as an LLM moduleπ\\pithat generates structured output:
\(rt,at,st\)∼π\(⋅∣g,Ht,ot;ϕ\),\(r\_\{t\},a\_\{t\},s\_\{t\}\)\\sim\\pi\(\\cdot\\mid g,H\_\{t\},o\_\{t\};\\phi\),\(1\)wherertr\_\{t\}is the chain\-of\-thought reasoning,st∈\[0,1\]ks\_\{t\}\\in\[0,1\]^\{k\}is a vector of uncertainty signals,Ht=\{\(oi,ri,ai,si\)\}i=1t−1H\_\{t\}=\\\{\(o\_\{i\},r\_\{i\},a\_\{i\},s\_\{i\}\)\\\}\_\{i=1\}^\{t\-1\}is the interaction history retained in context, andϕ\\phidenotes the instrumentation prompt that specifies which uncertainty signals the agent should emit\. A trajectoryτ=\(o1,a1,…,oT,aT\)\\tau=\(o\_\{1\},a\_\{1\},\\ldots,o\_\{T\},a\_\{T\}\)receives a binary success labely\(τ\)∈\{0,1\}y\(\\tau\)\\in\\\{0,1\\\}determined by the environment\. For tasks drawn from clarification\-augmented benchmarks, each task also carries an underspecification labelz∈\{0,1\}z\\in\\\{0,1\\\}, wherez=1z=1indicates the goal is underspecified\.
### 3\.2Why Prompt\-Based Methods?
The families in Table[1](https://arxiv.org/html/2606.19559#S2.T1)rely on resources that are typically unavailable when deploying an agent on top of a commercial LLM API\. As Oh et al\.\[[29](https://arxiv.org/html/2606.19559#bib.bib17)\]observe, “probability\-based methods cannot be applied to most frontier LLMs” and “consistency\-based methods become infeasible due to their prohibitively high inference cost in long\-horizon, multi\-turn settings\.” Logprob\-based methods require access to output token probabilities, and training\-based methods that leverage internal hidden states require white\-box access – neither of which most production APIs expose\. Multi\-sampling methods scale compute and latency as𝒪\(NT\)\\mathcal\{O\}\(NT\)across a trajectory of lengthTT, which is rarely acceptable at interactive\-agent budgets\. Training\-based methods require labeled trajectories to train a separate probe or recalibrator\[[23](https://arxiv.org/html/2606.19559#bib.bib19)\],\[[35](https://arxiv.org/html/2606.19559#bib.bib20)\], and in some cases the ability to fine\-tune the underlying model itself\[[36](https://arxiv.org/html/2606.19559#bib.bib29)\]\. Prompt\-based methods are the only family that runs in a single forward pass on a black\-box API, and they natively compose with multi\-step reasoning since the uncertainty signal is just part of the generated text\. We therefore restrict our experimental comparison to the prompt\-based family \(ReAct\+UE, UAM, and the proposed method\) and treat the other families as the context that motivates this restriction\.
### 3\.3Fault Detection Objective
Fault detection – using a trajectory’s uncertainty signals to predict whether it will fail – is the conventional task on which uncertainty estimation methods are evaluated\. Given per\-step uncertainty signals\{st\}t=1T\\\{s\_\{t\}\\\}\_\{t=1\}^\{T\}, we obtain a trajectory\-level scoreS\(τ\)∈\[0,1\]S\(\\tau\)\\in\[0,1\]via an aggregation functionAgg\\mathrm\{Agg\}, and evaluate how wellSSpredicts the binary successyy\. Concretely, withNNtrajectories and predictionsp^i=S\(τi\)\\hat\{p\}\_\{i\}=S\(\\tau\_\{i\}\):
ROC\-AUC=Pr\(p^i\>p^j∣yi=1,yj=0\),\\displaystyle\\text\{ROC\-AUC\}=\\Pr\\\!\\left\(\\hat\{p\}\_\{i\}\>\\hat\{p\}\_\{j\}\\mid y\_\{i\}=1,y\_\{j\}=0\\right\),\(2\)Brier=1N∑i=1N\(p^i−yi\)2,\\displaystyle\\text\{Brier\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\(\\hat\{p\}\_\{i\}\-y\_\{i\}\)^\{2\},\(3\)ECE=∑b=1B\|Ib\|N\|acc\(Ib\)−conf\(Ib\)\|,\\displaystyle\\text\{ECE\}=\\sum\_\{b=1\}^\{B\}\\frac\{\|I\_\{b\}\|\}\{N\}\\,\\bigl\|\\mathrm\{acc\}\(I\_\{b\}\)\-\\mathrm\{conf\}\(I\_\{b\}\)\\bigr\|,\(4\)wherePr\(⋅\)\\Pr\(\\cdot\)denotes the probability over a uniformly drawn pair\(i,j\)\(i,j\)withyi=1y\_\{i\}=1andyj=0y\_\{j\}=0,IbI\_\{b\}is the set of trajectories whose predicted score falls in confidence binbb,acc\(Ib\)\\mathrm\{acc\}\(I\_\{b\}\)is the empirical success rate in binbb, andconf\(Ib\)\\mathrm\{conf\}\(I\_\{b\}\)is the mean predicted score in binbb\. ROC\-AUC captures how wellSSdiscriminates failing from succeeding trajectories, ECE captures how closely its values match empirical success rates \(calibration\), and the Brier score reflects both discrimination and calibration quality jointly\.
### 3\.4Clarification\-Seeking Objective
Clarification seeking is the distinctive objective we evaluate beyond standard fault detection: rather than only scoring its own reliability, the agent must recognize an underspecified goal and act on it\. On clarification\-augmented benchmarks the agent has the option to emit the special actionat=a\_\{t\}=request\_clarification\. LetD\(τ\)=𝟙\[∃t:at=D\(\\tau\)=\\mathbb\{1\}\[\\exists t:\\,a\_\{t\}=request\_clarification\]\]be the trajectory\-level indicator that the agent asked for clarification\. We evaluateDDas a binary predictor of the underspecification labelzz\. Over theNNtrajectories, define the index sets
TP\\displaystyle\\mathrm\{TP\}=\|\{i:D\(τi\)=1,zi=1\}\|,\\displaystyle=\|\\\{i:D\(\\tau\_\{i\}\)=1,\\ z\_\{i\}=1\\\}\|,FP\\displaystyle\\mathrm\{FP\}=\|\{i:D\(τi\)=1,zi=0\}\|,\\displaystyle=\|\\\{i:D\(\\tau\_\{i\}\)=1,\\ z\_\{i\}=0\\\}\|,FN\\displaystyle\\mathrm\{FN\}=\|\{i:D\(τi\)=0,zi=1\}\|,\\displaystyle=\|\\\{i:D\(\\tau\_\{i\}\)=0,\\ z\_\{i\}=1\\\}\|,TN\\displaystyle\\mathrm\{TN\}=\|\{i:D\(τi\)=0,zi=0\}\|\.\\displaystyle=\|\\\{i:D\(\\tau\_\{i\}\)=0,\\ z\_\{i\}=0\\\}\|\.The classification metrics are then
Precision=TPTP\+FP,\\displaystyle\\text\{Precision\}=\\frac\{\\mathrm\{TP\}\}\{\\mathrm\{TP\}\+\\mathrm\{FP\}\},\(5\)Recall=TPTP\+FN,\\displaystyle\\text\{Recall\}=\\frac\{\\mathrm\{TP\}\}\{\\mathrm\{TP\}\+\\mathrm\{FN\}\},\(6\)F1=2Precision⋅RecallPrecision\+Recall,\\displaystyle\\text\{F1\}=\\frac\{2\\,\\text\{Precision\}\\cdot\\text\{Recall\}\}\{\\text\{Precision\}\+\\text\{Recall\}\},\(7\)Accuracy=TP\+TNN\.\\displaystyle\\text\{Accuracy\}=\\frac\{\\mathrm\{TP\}\+\\mathrm\{TN\}\}\{N\}\.\(8\)BecauseDDis a hard decision derived from the agent’s action sequence, Precision/Recall/F1/Accuracy depend on the method and threshold but not on the aggregation functionAgg\\mathrm\{Agg\}\.
### 3\.5Baseline: ReAct \+ Uncertainty Estimation
The first prompt\-based baseline is ReAct\+UE, formulated by Zhang et al\.\[[46](https://arxiv.org/html/2606.19559#bib.bib21)\]\. It augments the standard ReAct agent\[[45](https://arxiv.org/html/2606.19559#bib.bib5)\]with per\-step confidence elicitation in the spirit of verbalized uncertainty estimation\[[12](https://arxiv.org/html/2606.19559#bib.bib24)\]\. At each steptt, the agent outputs:
<think\>rt</think\>\\displaystyle\\texttt\{<think\>\}\\ r\_\{t\}\\ \\texttt\{</think\>\}<action\>at</action\>\\displaystyle\\texttt\{<action\>\}\\ a\_\{t\}\\ \\texttt\{</action\>\}<confidence\>ct∈\[0,1\]</confidence\>\\displaystyle\\texttt\{<confidence\>\}\\ c\_\{t\}\\in\[0,1\]\\ \\texttt\{</confidence\>\}\(9\)<explanation\>et</explanation\>\\displaystyle\\texttt\{<explanation\>\}\\ e\_\{t\}\\ \\texttt\{</explanation\>\}Crucially,ctc\_\{t\}andete\_\{t\}are*not*written back into the agent’s history for subsequent steps\. The agent has no memory of its past uncertainty, preventing it from reasoning about accumulated confidence\. The confidence elicitation instructions are appended to every user turn and ask the agent to report a confidence value in\[0,1\]\[0,1\]together with a natural\-language explanation of what makes it confident, what concerns it has, what information might be missing, and what alternative actions it considered\. The full prompt for ReAct\+UE is given in Appendix[A\.1](https://arxiv.org/html/2606.19559#A1.SS1)\.
### 3\.6Baseline: Uncertainty\-Aware Memory
The second prompt\-based baseline is Uncertainty\-Aware Memory \(UAM\), introduced by Zhang et al\.\[[46](https://arxiv.org/html/2606.19559#bib.bib21)\]\. UAM uses the same output format and confidence elicitation prompt as ReAct\+UE, but now propagates the confidence score and explanation through the agent’s history:
HtUAM=\{\(oi,ri,ai,ci,ei\)\}i=1t−1\.H\_\{t\}^\{\\text\{UAM\}\}=\\\{\(o\_\{i\},r\_\{i\},a\_\{i\},c\_\{i\},e\_\{i\}\)\\\}\_\{i=1\}^\{t\-1\}\.\(10\)This allows the agent to reason about past uncertainty levels, adjust confidence based on accumulated evidence, and detect patterns of decreasing or increasing certainty\. Following Zhang et al\.\[[46](https://arxiv.org/html/2606.19559#bib.bib21)\], we use their “Variant B: Semantic Propagation”\. UAM therefore differs from ReAct\+UE only in whether uncertainty is retained in context; it shares the same single\-scalar confidence signalst=cts\_\{t\}=c\_\{t\}\. The full prompt for UAM is given in Appendix[A\.2](https://arxiv.org/html/2606.19559#A1.SS2)\.
## 4Proposed Method
We now introduce the proposed method\. Its defining feature is a decomposition of the single confidence scalar used by ReAct\+UE and UAM into two semantically distinct signals: an action confidencectc\_\{t\}and a request uncertaintyutu\_\{t\}\. Figure[1](https://arxiv.org/html/2606.19559#S4.F1)sketches the per\-step data flow and Algorithm[1](https://arxiv.org/html/2606.19559#alg1)gives the step\-level pseudocode for the proposed method\. The remainder of this section describes the two signals, the clarification trigger, the history propagation, and the trajectory\-level aggregation strategies\. The full prompt for the proposed method is given in Appendix[A\.3](https://arxiv.org/html/2606.19559#A1.SS3)\.
Figure 1:Proposed method at steptt\. The LLM moduleπ\\pi\(blue\) consumes the goalgg, current observationoto\_\{t\}, and historyHtH\_\{t\}in one forward pass and emits two uncertainty signals: request uncertaintyutu\_\{t\}with explanationxtx\_\{t\}\(orange\), and action confidencectc\_\{t\}with explanationete\_\{t\}alongside the reasoningrtr\_\{t\}and proposed actionata\_\{t\}\(green\)\. The deterministic routing testut≥θu\_\{t\}\\geq\\thetaswitches betweenrequest\_clarificationand execution ofata\_\{t\}\. All emitted fields are appended to the history for subsequent steps\. Assessingutu\_\{t\}before emittingata\_\{t\}ensures underspecification is caught prior to any action, giving the agent a dedicated channel for goal ambiguity that a single confidence score cannot provide\.Algorithm 1Proposed step: decomposed prompt\-based uncertainty with clarification routing\. The LLM moduleπ\\piconsumes the goalgg, current observationoto\_\{t\}, and historyHtH\_\{t\}in a single forward pass, and emits reasoningrtr\_\{t\}, request uncertaintyutu\_\{t\}with explanationxtx\_\{t\}, actionata\_\{t\}, and action confidencectc\_\{t\}with explanationete\_\{t\}\. Theut≥θu\_\{t\}\\geq\\thetatest deterministically routes the agent torequest\_clarificationwhen the goal is judged underspecified\. All emitted fields are appended to history so later steps can reason over past uncertainty\.0:goal
gg, observation
oto\_\{t\}, history
HtH\_\{t\}, threshold
θ\\theta
0:action
at⋆a\_\{t\}^\{\\star\}, updated history
Ht\+1H\_\{t\+1\}
1:
\(rt,ut,xt,at,ct,et\)∼π\(⋅∣g,Ht,ot;ϕproposed\)\(r\_\{t\},u\_\{t\},x\_\{t\},a\_\{t\},c\_\{t\},e\_\{t\}\)\\sim\\pi\(\\cdot\\mid g,H\_\{t\},o\_\{t\};\\phi\_\{\\text\{proposed\}\}\)
2:if
ut≥θu\_\{t\}\\geq\\thetathen
3:
at⋆←a\_\{t\}^\{\\star\}\\leftarrowrequest\_clarification
4:else
5:
at⋆←ata\_\{t\}^\{\\star\}\\leftarrow a\_\{t\}
6:endif
7:
Ht\+1←Ht∪\{\(ot,rt,ut,xt,at⋆,ct,et\)\}H\_\{t\+1\}\\leftarrow H\_\{t\}\\cup\\\{\(o\_\{t\},r\_\{t\},u\_\{t\},x\_\{t\},a\_\{t\}^\{\\star\},c\_\{t\},e\_\{t\}\)\\\}
8:return
at⋆,Ht\+1a\_\{t\}^\{\\star\},H\_\{t\+1\}
### 4\.1Decomposed Uncertainty Signals
The proposed method extends UAM by replacing the singlectc\_\{t\}scalar with the two\-scalar signal\(ut,ct\)\(u\_\{t\},c\_\{t\}\)\. At each step the agent emits, in order:
<think\>rt</think\>\\displaystyle\\texttt\{<think\>\}\\,r\_\{t\}\\,\\texttt\{</think\>\}<u\_request\>ut∈\[0,1\]</u\_request\>\\displaystyle\\texttt\{<u\\\_request\>\}\\,u\_\{t\}\\in\[0,1\]\\,\\texttt\{</u\\\_request\>\}<u\_request\_explanation\>xt</…\>\\displaystyle\\texttt\{<u\\\_request\\\_explanation\>\}\\,x\_\{t\}\\,\\texttt\{</\\ldots\>\}\(11\)<action\>at</action\>\\displaystyle\\texttt\{<action\>\}\\,a\_\{t\}\\,\\texttt\{</action\>\}<confidence\>ct∈\[0,1\]</confidence\>\\displaystyle\\texttt\{<confidence\>\}\\,c\_\{t\}\\in\[0,1\]\\,\\texttt\{</confidence\>\}<explanation\>et</explanation\>\\displaystyle\\texttt\{<explanation\>\}\\,e\_\{t\}\\,\\texttt\{</explanation\>\}
The two signals serve distinct purposes and have different intellectual origins\.
##### Action confidencectc\_\{t\}\.
Estimates how likely the chosen actionata\_\{t\}is to make progress toward task completion, conditioned on the agent’s current understanding of the goal\. This signal is unchanged from the baselines; we retain it so that the proposed method can be directly compared against them on the fault detection task\.
##### Request uncertaintyutu\_\{t\}\.
Estimates the degree to which the user’s goal is underspecified, on a three\-point anchored scale:
- •ut=0u\_\{t\}=0: The goal fully specifies every relevant parameter; there is exactly one correct interpretation\.
- •ut=0\.5u\_\{t\}=0\.5: At least one parameter is left open; the user likely has a specific preference that is not stated\.
- •ut=1u\_\{t\}=1: Critical details are missing; many equally valid interpretations exist\.
This scale operationalizes the “underspecification uncertainty” category argued for by Kirchhof et al\.\[[20](https://arxiv.org/html/2606.19559#bib.bib13)\]and conceptually mirrors input clarification ensembling\[[14](https://arxiv.org/html/2606.19559#bib.bib26)\], but realizes it at the prompt level rather than through multi\-sampling\.
##### Clarification trigger\.
The agent is instructed that whenut≥θu\_\{t\}\\geq\\theta\(withθ=0\.5\\theta=0\.5unless stated otherwise\), the action it emits must berequest\_clarification\. This creates a direct, deterministic link between the uncertainty estimate and an observable behavior, and is what allows the decomposition to be evaluated as a binary classifier over the underspecification labelzz\.
##### History propagation\.
Bothut,xt,ct,etu\_\{t\},x\_\{t\},c\_\{t\},e\_\{t\}are included in the history, following UAM’s semantic propagation:
Htproposed=\{\(oi,ri,ui,xi,ai,ci,ei\)\}i=1t−1\.H\_\{t\}^\{\\text\{proposed\}\}=\\\{\(o\_\{i\},r\_\{i\},u\_\{i\},x\_\{i\},a\_\{i\},c\_\{i\},e\_\{i\}\)\\\}\_\{i=1\}^\{t\-1\}\.\(12\)Compared to UAM’s history in Eq\. \([10](https://arxiv.org/html/2606.19559#S3.E10)\) this adds the\(ui,xi\)\(u\_\{i\},x\_\{i\}\)pair so the agent can reason about request\-level uncertainty across steps\.
##### Output field ordering\.
Within the structured output of Eq\. \([11](https://arxiv.org/html/2606.19559#S4.E11)\), the agent emits theutu\_\{t\}assessment before the actionata\_\{t\}\. The agent first decides whether the goal is sufficiently specified, and this judgment is allowed to influence the action choice \(clarification vs\. task action\)\.
### 4\.2Prompt
The proposed method instructs the agent, before selecting an action, to assess request uncertaintyutu\_\{t\}on the anchored 0/0\.5/1 scale defined above and to explain its assessment in a free\-text field\. The full prompt for the proposed method is given in Appendix[A\.3](https://arxiv.org/html/2606.19559#A1.SS3)\.
### 4\.3Trajectory\-Level Aggregation
Every method except ReAct\+UE already aggregates uncertainty across steps implicitly, by propagating the per\-step estimates through the agent’s history \(Eqs\. \([10](https://arxiv.org/html/2606.19559#S3.E10)\) and \([12](https://arxiv.org/html/2606.19559#S4.E12)\)\)\. For completeness and comparability with other work on LLM\-agent uncertainty, we additionally apply, on top of this built\-in propagation, a set of explicit trajectory\-level aggregation functionsAgg\\mathrm\{Agg\}to the per\-step signals\. For action confidencec1,…,cTc\_\{1\},\\ldots,c\_\{T\}we consider four strategies:
Sclast\\displaystyle S\_\{c\}^\{\\text\{last\}\}=cT,\\displaystyle=c\_\{T\},\(13\)Scavg\\displaystyle S\_\{c\}^\{\\text\{avg\}\}=1T∑t=1Tct,\\displaystyle=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}c\_\{t\},\(14\)Scmin\\displaystyle S\_\{c\}^\{\\text\{min\}\}=mintct,\\displaystyle=\\min\_\{t\}c\_\{t\},\(15\)Scprod\\displaystyle S\_\{c\}^\{\\text\{prod\}\}=\(∏t=1Tct\)1/T\.\\displaystyle=\\Bigl\(\\prod\_\{t=1\}^\{T\}c\_\{t\}\\Bigr\)^\{1/T\}\.\(16\)The*last*strategy \(Sclast=cTS\_\{c\}^\{\\text\{last\}\}=c\_\{T\}\) reports the final\-step confidence, which for the methods that propagate uncertainty through history \(UAM and the proposed method\) already summarizes the trajectory via the aggregation built into the method itself\. The*product*strategy is the geometric mean of the per\-step confidences and instantiates the joint\-validity estimate that Zhang et al\.\[[46](https://arxiv.org/html/2606.19559#bib.bib21)\]use to formalize the “Spiral of Hallucination,” in which a single low\-confidence step compounds multiplicatively and collapses the whole\-trajectory score\. For request uncertaintyu1,…,uTu\_\{1\},\\ldots,u\_\{T\}we report four strategies:
Sufirst\\displaystyle S\_\{u\}^\{\\text\{first\}\}=u1,\\displaystyle=u\_\{1\},\(17\)Suavg\\displaystyle S\_\{u\}^\{\\text\{avg\}\}=1T∑t=1Tut,\\displaystyle=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}u\_\{t\},\(18\)Sumax\\displaystyle S\_\{u\}^\{\\text\{max\}\}=maxtut,\\displaystyle=\\max\_\{t\}u\_\{t\},\(19\)Suprod\\displaystyle S\_\{u\}^\{\\text\{prod\}\}=1−\(∏t=1T\(1−ut\)\)1/T\.\\displaystyle=1\-\\Bigl\(\\prod\_\{t=1\}^\{T\}\(1\-u\_\{t\}\)\\Bigr\)^\{1/T\}\.\(20\)These four strategies mirror the action\-confidence aggregations of Eqs\. \([13](https://arxiv.org/html/2606.19559#S4.E13)\)–\([16](https://arxiv.org/html/2606.19559#S4.E16)\) with1−ut1\-u\_\{t\}playing the role of the confidencectc\_\{t\}: averaging \(Eq\. \([18](https://arxiv.org/html/2606.19559#S4.E18)\)\) and the geometric mean \(Eq\. \([20](https://arxiv.org/html/2606.19559#S4.E20)\), applied to∏t\(1−ut\)\\prod\_\{t\}\(1\-u\_\{t\}\)\) carry over directly, the conservative aggregation flips from the minimum confidence to the maximum uncertainty \(Eq\. \([19](https://arxiv.org/html/2606.19559#S4.E19)\)\), and the single\-step score is taken at the first step \(Eq\. \([17](https://arxiv.org/html/2606.19559#S4.E17)\)\), where underspecification is typically most apparent, rather than the last\. As we report in Table[6](https://arxiv.org/html/2606.19559#A2.T6)and discuss in Section[7](https://arxiv.org/html/2606.19559#S7), the choice of aggregation strategy introduces a substantial hyperparameter that can dominate the effect of the uncertainty method itself\.
## 5Experimental Setup
We describe the benchmarks used for evaluation \(Section[5\.1](https://arxiv.org/html/2606.19559#S5.SS1)\), the metrics reported \(Section[5\.2](https://arxiv.org/html/2606.19559#S5.SS2)\), the models and prompt configurations used for data collection \(Section[5\.3](https://arxiv.org/html/2606.19559#S5.SS3)\), and the evaluation protocol that ties them together \(Section[5\.4](https://arxiv.org/html/2606.19559#S5.SS4)\)\.
### 5\.1Benchmarks
We evaluate on five benchmark configurations, grouped into standard benchmarks for fault detection and clarification\-augmented variants for clarification seeking\.
##### Standard benchmarks \(fault detection\)
We use three standard interactive benchmarks, with no deliberately injected ambiguity\. On these we evaluate fault detection, the conventional task on which uncertainty estimation methods are evaluated:
- •WebShop\[[44](https://arxiv.org/html/2606.19559#bib.bib35)\]: Online shopping over 1,000\+ products, where the agent searches, filters, and selects items matching natural\-language instructions\.
- •ALFWorld\[[33](https://arxiv.org/html/2606.19559#bib.bib36)\]: Household embodied agent performing tasks \(e\.g\., “put a clean mug on the desk”\) in text\-based simulations across 6 task types\.
- •REAL\[[2](https://arxiv.org/html/2606.19559#bib.bib37)\]: Practical multi\-turn tasks on deterministic simulations of 11 real websites\.
##### Clarification\-augmented variants
Starting from the standard benchmarks above, we construct two clarification\-augmented variants by deliberately underspecifying 50% of tasks\. This lets us evaluate whether the agent can distinguish underspecified from fully specified goals and triggerrequest\_clarificationaccordingly:
- •WebShop\-Clarification: Starting from WebShop, for underspecified tasks we strip attribute words \(e\.g\., “black”, “leather”\) and option clauses \(e\.g\., “with color: black”\) from the instruction, keeping only the base product type and price constraint\. The modification preserves task feasibility \(the environment still contains valid products\) while removing the information the user would normally provide\.
- •ALFWorld\-Clarification: Starting from ALFWorld, for underspecified tasks we randomly remove either the object or the receptacle from the goal \(e\.g\., “put something in container” or “put a mug somewhere”\)\.
In both variants, the agent can emitrequest\_clarificationto flag the task as ambiguous\. When it does so on an underspecified task, the original fully specified goal is revealed and the episode continues—simulating a user who supplies the missing details—so that task success rate is measured fairly even on tasks where clarification was required\. Each task in these variants carries a binary underspecification labelzzthat records whether it was deliberately underspecified \(z=1z=1\) or left fully specified \(z=0z=0\); this label is the ground truth against which the agent’s clarification decision is scored\.
We note that evaluation of uncertainty\-aware agents remains an open challenge\[[27](https://arxiv.org/html/2606.19559#bib.bib38)\]; our benchmarks specifically isolate the clarification\-seeking capability\.
### 5\.2Metrics
We report two metric families together with task success rate: the fault\-detection metrics from Eqs\. \([2](https://arxiv.org/html/2606.19559#S3.E2)\)–\([4](https://arxiv.org/html/2606.19559#S3.E4)\) \(ROC\-AUC, ECE, Brier\) and the clarification\-seeking metrics from Eqs\. \([5](https://arxiv.org/html/2606.19559#S3.E5)\)–\([8](https://arxiv.org/html/2606.19559#S3.E8)\) \(Precision, Recall, F1, Accuracy\)\. The fault\-detection metrics gauge how well a method solves the conventional task on which uncertainty estimation methods are evaluated—predicting from a trajectory’s uncertainty signals whether it will fail—while the clarification\-seeking metrics gauge the distinctive capability the proposed decomposition is meant to enable: recognizing an underspecified goal and acting on it\. Fault\-detection metrics use the trajectory\-level scoreS\(τ\)S\(\\tau\)paired with the success labely\(τ\)y\(\\tau\)and are reported per \(method, aggregation\) pair on all five benchmarks; clarification\-seeking metrics use the binary clarification decisionD\(τ\)D\(\\tau\)paired with the underspecification labelz\(τ\)z\(\\tau\)and are reported per method on the two clarification\-augmented benchmarks\.
### 5\.3Models and Implementation Details
Across the results we report all five backbones \(GPT\-5\.1, DeepSeek\-v3\.2\-exp, GLM\-4\.7, Qwen3\.5\-35B, GPT\-OSS\-120B\), so we can check whether the qualitative findings persist across models\. Because the choice of trajectory\-level aggregation is critical for fault detection, the fault\-detection deep\-dive in Section[6\.2](https://arxiv.org/html/2606.19559#S6.SS2)reports only GPT\-5\.1 for legibility; the corresponding fault\-detection metrics for the remaining four backbones are given in Appendix Table[5](https://arxiv.org/html/2606.19559#A2.T5)\. For the proposed method we useθ=0\.5\\theta=0\.5as the standard clarification threshold and report a dedicated sensitivity ablation separately in Section[6\.6](https://arxiv.org/html/2606.19559#S6.SS6)\.
### 5\.4Evaluation Protocol
We evaluate each of the three methods \(ReAct\+UE, UAM, and the proposed method\) on each of the five benchmark configurations described in Section[5\.1](https://arxiv.org/html/2606.19559#S5.SS1)\. For every task in each benchmark we run the method once with the corresponding prompt instrumentationϕ\\phi; this yields a trajectoryτ=\(o1,a1,…,oT,aT\)\\tau=\(o\_\{1\},a\_\{1\},\\ldots,o\_\{T\},a\_\{T\}\), a success labely\(τ\)y\(\\tau\), and per\-step signals\(s1,…,sT\)\(s\_\{1\},\\ldots,s\_\{T\}\)\. Each \(method, benchmark\) pair is run on 100 tasks\.
The trajectory\-level continuous scoreS\(τ\)=Agg\(s1,…,sT\)S\(\\tau\)=\\mathrm\{Agg\}\(s\_\{1\},\\ldots,s\_\{T\}\), computed for each aggregation in Section[4\.3](https://arxiv.org/html/2606.19559#S4.SS3), is paired with the success labely\(τ\)y\(\\tau\)to produce the fault\-detection metrics of Eqs\. \([2](https://arxiv.org/html/2606.19559#S3.E2)\)–\([4](https://arxiv.org/html/2606.19559#S3.E4)\); these are computed on all five benchmarks, since every task carries a success label\. The binary clarification decisionD\(τ\)D\(\\tau\)of Section[3\.4](https://arxiv.org/html/2606.19559#S3.SS4)is paired with the underspecification labelz\(τ\)z\(\\tau\)to produce the clarification\-seeking metrics of Eqs\. \([5](https://arxiv.org/html/2606.19559#S3.E5)\)–\([8](https://arxiv.org/html/2606.19559#S3.E8)\); these are computed only on the two clarification\-augmented variants\.
## 6Results
We present the quantitative results, organized around the two task families: Sections[6\.1](https://arxiv.org/html/2606.19559#S6.SS1)and[6\.2](https://arxiv.org/html/2606.19559#S6.SS2)report the headline results on the clarification\-augmented and standard benchmarks respectively\. Section[6\.3](https://arxiv.org/html/2606.19559#S6.SS3)then consolidates the task success\-rate trend that spans both benchmark families, and Section[6\.4](https://arxiv.org/html/2606.19559#S6.SS4)reports the calibration of the per\-step confidence signals\. Section[6\.5](https://arxiv.org/html/2606.19559#S6.SS5)presents a diagnostic showing that product aggregation largely tracks trajectory length, which informs how we interpret its fault\-detection scores; Section[6\.6](https://arxiv.org/html/2606.19559#S6.SS6)reports a sensitivity analysis over the clarification thresholdθ\\theta\.
### 6\.1Clarification Seeking on Modified Benchmarks
Figure[2](https://arxiv.org/html/2606.19559#S6.F2)reports clarification F1 and task success rate across all backbones on both clarification\-augmented benchmarks\. The proposed method leads F1 on most \(backbone, benchmark\) pairs – in particular on every backbone on WebShop\-Clar\. and on four of five on ALFWorld\-Clar\. – leading by roughly 0\.28 in absolute F1 over ReAct\+UE and 0\.21 over UAM when averaged across the five backbones on each benchmark, confirming that the method generalizes beyond a single backbone LLM\.
Figure 2:Clarification F1 \(top\) and task success rate \(bottom\) on the two clarification\-augmented benchmarks across all five LLM backbones\. Bars are grouped by method\. The proposed method leads clarification F1 on every backbone on WebShop\-Clar\. and on four of five on ALFWorld\-Clar\., confirming that the decomposition enables clarification seeking where scalar\-confidence baselines cannot\.
### 6\.2Fault Detection on Standard Benchmarks
Figure[3](https://arxiv.org/html/2606.19559#S6.F3)reports fault\-detection ROC\-AUC under all four aggregations, together with task success rate, on the three standard benchmarks for GPT\-5\.1\. The proposed method preserves discrimination: it reaches the highest*last*\- and*avg*\-aggregation ROC\-AUC on WebShop and REAL and remains within 0\.08 of the best ROC\-AUC on ALFWorld\. On ALFWorld, product aggregation attains the highest fault\-detection ROC\-AUC of any aggregation–benchmark combination; we quantify this effect in Section[6\.5](https://arxiv.org/html/2606.19559#S6.SS5)and discuss its implications in Section[7\.4](https://arxiv.org/html/2606.19559#S7.SS4)\.
The decomposition does not come at the cost of fault detection: across backbones the proposed method solves this conventional uncertainty task on par with the ReAct\+UE and UAM baselines, with full per\-\(backbone, method, aggregation\) metrics in Appendix Table[5](https://arxiv.org/html/2606.19559#A2.T5)\.
Figure 3:Fault\-detection ROC\-AUC across the four trajectory\-level aggregations \(top\) and task success rate \(bottom\) on the three standard benchmarks \(GPT\-5\.1\)\. Bars are grouped by method\. Across aggregations and benchmarks the three methods achieve comparable fault\-detection ROC\-AUC, confirming that adding the request\-uncertainty decomposition does not sacrifice the conventional uncertainty objective\.
### 6\.3Task Success Rate Across Methods
The success\-rate panels of Figures[2](https://arxiv.org/html/2606.19559#S6.F2)and[3](https://arxiv.org/html/2606.19559#S6.F3)show that task success rate tends to decline as the agent is given more uncertainty instrumentation\. Averaged across all five benchmarks and all five backbones, mean success rate falls monotonically from28\.6%28\.6\\%for ReAct\+UE to27\.8%27\.8\\%for UAM and27\.0%27\.0\\%for the proposed method; success rate for every backbone and method is reported in Appendix Tables[5](https://arxiv.org/html/2606.19559#A2.T5)and[4](https://arxiv.org/html/2606.19559#A2.T4)\. We call this effect*capability dilution*and examine it in Section[7\.2](https://arxiv.org/html/2606.19559#S7.SS2)\.
### 6\.4Calibration
Figure[4](https://arxiv.org/html/2606.19559#S6.F4)reports reliability diagrams for the three methods across all five benchmarks for GPT\-5\.1\. Across every method and benchmark the curve lies below the diagonal—predicted confidence systematically exceeds observed success rate—with per\-panel ECE ranging from 0\.24 to 0\.66\. We examine the implications of this overconfidence in Section[7\.3](https://arxiv.org/html/2606.19559#S7.SS3); the corresponding diagrams for the remaining four backbones are given in Appendix[C](https://arxiv.org/html/2606.19559#A3)\.
Figure 4:Reliability diagrams for GPT\-5\.1: the three methods \(rows\) across the five benchmarks \(columns\), under last\-step aggregation\. Each point bins trajectories by action confidence and plots observed success rate against the bin’s predicted confidence; the dashed line marks perfect calibration\. All curves lie below the diagonal, indicating systematic overconfidence for every method and benchmark\.
### 6\.5Product Aggregation and Trajectory Length
Table[2](https://arxiv.org/html/2606.19559#S6.T2)reports fault\-detection metrics for trajectories in which the real per\-step confidences are replaced by two confidence\-free surrogates—i\.i\.d\.𝒰\(0,1\)\\mathcal\{U\}\(0,1\)draws of matched trajectory length \(averaged over 50 seeds\) and the deterministic1/length1/\\text\{length\}score—with product aggregation re\-applied to each\. On ALFWorld both surrogates reach high ROC\-AUC \(0\.92–0\.99\), matching or exceeding the real\-product score for every method—with the lone exception of the random surrogate under UAM \(0\.94 vs\. 0\.96\); the effect is weaker but present on WebShop and small on REAL\. We discuss what this length confound implies for interpreting product\-aggregation scores in Section[7\.4](https://arxiv.org/html/2606.19559#S7.SS4)\.
Table 2:Product aggregation as a length proxy on the standard benchmarks\.*random*draws per\-step values from𝒰\(0,1\)\\mathcal\{U\}\(0,1\)matched to each trajectory length, averaged over 50 seeds;1/length1/\\text\{length\}uses the trajectory step count directly\. Best per \(method, benchmark, metric\) in bold \(GPT\-5\.1\)\. On ALFWorld both confidence\-free surrogates reach ROC\-AUC of 0\.92–0\.99, matching or exceeding the real\-product score for nearly every method \(the lone exception being the random surrogate under UAM\), confirming that high product\-aggregation ROC\-AUC on that benchmark reflects trajectory\-length confounding rather than an informative confidence signal\.
### 6\.6Clarification Threshold Sensitivity
Table[3](https://arxiv.org/html/2606.19559#S6.T3)reports clarification\-seeking metrics for the proposed method under three values of the clarification threshold,θ∈\{0\.25,0\.5,0\.75\}\\theta\\in\\\{0\.25,0\.5,0\.75\\\}, across all five backbones on both clarification\-augmented benchmarks\.
Table 3:Clarification\-threshold sensitivity for the proposed method across all five backbones\. Best per \(backbone, benchmark, metric\) in bold across the threeθ\\thetarows\. Averaged across all five backbones,θ=0\.25\\theta=0\.25attains the highest mean clarification F1 on both benchmarks \(by at most 0\.03 overθ=0\.5\\theta=0\.5\), though no singleθ\\thetadominates every \(model, benchmark\) pair; we reportθ=0\.5\\theta=0\.5as a balanced default in the main results\.No singleθ\\thetadominates every \(model, benchmark\) pair in Table[3](https://arxiv.org/html/2606.19559#S6.T3); averaged across all five backbones, the lowest thresholdθ=0\.25\\theta=0\.25attains the highest mean clarification F1 on both benchmarks \(WebShop\-Clar\.: 0\.464, vs\. 0\.455 atθ=0\.5\\theta=0\.5and 0\.291 atθ=0\.75\\theta=0\.75; ALFWorld\-Clar\.: 0\.71, vs\. 0\.68 atθ=0\.5\\theta=0\.5and 0\.50 atθ=0\.75\\theta=0\.75\), thoughθ=0\.25\\theta=0\.25andθ=0\.5\\theta=0\.5differ by at most 0\.03; we reportθ=0\.5\\theta=0\.5as a balanced default in the main results \(Table[4](https://arxiv.org/html/2606.19559#A2.T4)\)\. We discuss why we threshold a scalar uncertainty and where this leaves the method in Section[7\.5](https://arxiv.org/html/2606.19559#S7.SS5)\.
## 7Discussion
Section[7\.1](https://arxiv.org/html/2606.19559#S7.SS1)first asks what kind of signal the decomposition actually captures\. Section[7\.2](https://arxiv.org/html/2606.19559#S7.SS2)then analyzes how extending the agent’s prompt with additional uncertainty objectives affects task success rate\. Section[7\.3](https://arxiv.org/html/2606.19559#S7.SS3)examines the overconfidence pattern surfaced by the calibration results of Section[6\.4](https://arxiv.org/html/2606.19559#S6.SS4)\. Section[7\.4](https://arxiv.org/html/2606.19559#S7.SS4)explains why the aggregation choice can dominate the method choice, using the length\-proxy finding from Table[2](https://arxiv.org/html/2606.19559#S6.T2)\. Finally, Section[7\.5](https://arxiv.org/html/2606.19559#S7.SS5)argues that these patterns motivate moving uncertainty estimation from the prompt into the model itself\.
### 7\.1Why Decomposition Helps
The key advantage of separating request uncertainty from action confidence is that it gives the agent a dedicated channel for expressing goal ambiguity\. With a single confidence score, an agent that encounters an underspecified task must either \(a\) report low confidence on its actions, which is indistinguishable from genuinely difficult but fully specified tasks, or \(b\) proceed with an arbitrary interpretation and report high confidence, which leads to silent failure\. The decomposition resolves this ambiguity, directly addressing the call by Kirchhof et al\.\[[20](https://arxiv.org/html/2606.19559#bib.bib13)\]for “underspecification uncertainty” as a distinct category\.
This aligns with the broader agentic interpretability vision of Kim et al\.\[[19](https://arxiv.org/html/2606.19559#bib.bib14)\]: by explicitly communicating why it is uncertain \(task difficulty vs\. goal ambiguity\), the agent helps users build a mental model of its reasoning, rather than presenting an opaque scalar score\.
### 7\.2Capability Dilution
Inspection of trajectories where the proposed method fails but ReAct\+UE succeeds reveals a recurring pattern: the proposed agent spends a disproportionate share of its reasoning budget debating whether the goal is underspecified, producing a longutu\_\{t\}explanation and only a perfunctory action rationale\. This is consistent with a bounded reasoning budget that must be split across task\-solving, confidence estimation, and request\-uncertainty assessment\. The proposed method asks the LLM to do all three in a single forward pass with no additional tokens of reasoning allocated to compensate\. The monotonic drop from UAM \(27\.8%27\.8\\%\) to the proposed method \(27\.0%27\.0\\%\) isolates this effect: since the two methods differ only in the addition of the request\-uncertainty signalutu\_\{t\}and its explanationxtx\_\{t\}, the0\.80\.8percentage\-point SR gap is directly attributable to the enlarged prompt rather than to any other architectural change\.
We call this*capability dilution*: each additional uncertainty objective added to the prompt degrades the primary task objective\. The degradation is concentrated on benchmarks where the task itself is already demanding \(REAL:32\.1%→19\.2%32\.1\\%\\to 19\.2\\%; WebShop:31\.9%→26\.1%31\.9\\%\\to 26\.1\\%\) and is absent on the more structured ALFWorld, where success rate is essentially unchanged \(57\.5%→58\.9%57\.5\\%\\to 58\.9\\%\)\. This is a limitation that cannot be resolved by better prompt engineering within the prompt\-only regime, because the root cause is a fixed shared reasoning budget rather than a specific phrasing failure\.
### 7\.3Overconfidence and Calibration
The calibration results of Section[6\.4](https://arxiv.org/html/2606.19559#S6.SS4)\(with the full per\-backbone reliability diagrams in Appendix[C](https://arxiv.org/html/2606.19559#A3)\) show that all three methods sit well below the diagonal: predicted confidence consistently exceeds observed success rate\. We interpret this as a structural bias of prompt\-based self\-reported confidence – an agent that has already committed to an action has an incentive to justify rather than critique it – rather than an issue specific to any one method\. This matches the independent findings of Kaddour et al\.\[[17](https://arxiv.org/html/2606.19559#bib.bib30)\]and the survey observation by Oh et al\.\[[29](https://arxiv.org/html/2606.19559#bib.bib17)\]that “dynamically expanding context memory results in increasingly inflated and unreliable verbalized confidence\.” The practical implication is that while the confidence scores are useful as*ranking*signals \(ROC\-AUC\), they cannot be interpreted as probabilities without a post\-hoc recalibration step\. Evaluation methodology itself introduces additional uncertainty: the choice of correctness function can substantially affect UQ method rankings\[[16](https://arxiv.org/html/2606.19559#bib.bib39)\]\.
### 7\.4Aggregation as Hidden Hyperparameter
Across Appendix Tables[5](https://arxiv.org/html/2606.19559#A2.T5)and[6](https://arxiv.org/html/2606.19559#A2.T6), the best aggregation strategy differs by method and benchmark: product excels on ALFWorld, avg excels on REAL for both UAM and the proposed method, and avg/last excel on WebShop for the proposed method\. In practice this means a practitioner tuning only the aggregation can produce arbitrarily large differences between methods without changing the underlying uncertainty signal\. This matches Oh et al\.’s\[[29](https://arxiv.org/html/2606.19559#bib.bib17)\]observation that naive cascade aggregations cannot robustly distinguish successful from failed trajectories\.
Product aggregation \(Eq\. \([16](https://arxiv.org/html/2606.19559#S4.E16)\)\) was introduced as an operationalization of the “Spiral of Hallucination” formalized by Zhang et al\.\[[46](https://arxiv.org/html/2606.19559#bib.bib21)\]\. The length\-proxy experiment of Table[2](https://arxiv.org/html/2606.19559#S6.T2)shows, however, that on ALFWorld this aggregation does not capture that mechanism but instead behaves as a trajectory\-length proxy: replacing real confidences with i\.i\.d\.𝒰\(0,1\)\\mathcal\{U\}\(0,1\)draws, or with a deterministic1/length1/\\text\{length\}value, matches or exceeds the real\-product ROC\-AUC for every method\. Failed ALFWorld tasks are systematically longer than successful ones, and the geometric mean ofTTvalues in\[0,1\]\[0,1\]decreases withTT, so the length signal dominates\. The effect is weaker but present on WebShop and essentially absent on REAL\. The qualitative takeaway is that high product\-aggregation ROC\-AUC should not be interpreted as evidence that the confidence signal is informative; it can simply be evidence that the agent took more steps to fail\.
### 7\.5Toward Native Uncertainty Estimation
The limitations identified above – capability dilution, systematic overconfidence, and aggregation sensitivity – are not specific to the proposed method but are fundamental to the prompt\-based paradigm\. Prompt\-based methods bolt uncertainty estimation onto a model that was not designed for it: the model must simultaneously solve the task and accurately assess its own confidence, competing objectives that share a fixed reasoning budget\.
To compare the decomposition against the scalar\-confidence baselines on equal footing, we have the agent emit a numeric request\-uncertainty value and route on a fixed threshold, which is what the ablation of Section[6\.6](https://arxiv.org/html/2606.19559#S6.SS6)probes\. We also agree with the position papers that motivate this work that the more promising direction is to move beyond such scalar thresholds toward proactive interaction grounded in human\-readable, interpretable explanations of the agent’s uncertainty\[[20](https://arxiv.org/html/2606.19559#bib.bib13)\],\[[19](https://arxiv.org/html/2606.19559#bib.bib14)\]\.
Training\-based approaches offer a compelling alternative\. Suri et al\.\[[36](https://arxiv.org/html/2606.19559#bib.bib29)\]demonstrate this with SAGE\-Agent, where GRPO\-based fine\-tuning improved a 3B\-parameter model’s clarification accuracy from 36\.5% to 65\.2% – a 78\.7% relative improvement\. Their certainty\-weighted reward function aligns the model to produce calibrated uncertainty estimates natively, eliminating the need for prompt\-based elicitation and the associated capability dilution\. Similarly, Chen et al\.\[[4](https://arxiv.org/html/2606.19559#bib.bib34)\]show that uncertainty\-aware self\-training for GUI agents produces better\-calibrated confidence estimates than prompting alone\.
We argue that the most promising path toward practical agentic uncertainty lies in alignment\-based approaches that natively integrate uncertainty estimation, decomposition, and addressing \(clarification, abstention\) into the model itself\. Specifically:
- •Native decomposition\.Rather than prompting for separatectc\_\{t\}andutu\_\{t\}scores, models should be trained \(via RLHF, DPO, or GRPO\) to internally distinguish between task difficulty and goal ambiguity, producing decomposed uncertainty as a natural part of their output\.
- •Integrated aggregation\.Instead of applying post\-hoc aggregation strategies, models should learn to maintain and propagate trajectory\-level uncertainty internally, eliminating the aggregation hyperparameter entirely\.
- •Calibrated communication\.Following Kirchhof et al\.’s\[[20](https://arxiv.org/html/2606.19559#bib.bib13)\]call for rich output uncertainties, aligned models should communicate uncertainty in natural language – explaining what is uncertain and why – rather than producing poorly calibrated scalar scores\.
This vision aligns with the agentic interpretability paradigm of Kim et al\.\[[19](https://arxiv.org/html/2606.19559#bib.bib14)\], where agents proactively build shared understanding with users\. The proposed prompt\-based decomposition demonstrates that the concept of separating request uncertainty from action confidence is effective; the challenge now is to move this decomposition from the prompt into the model itself\.
## 8Limitations
The analysis exposes three limitations of the prompt\-based paradigm that motivated this work\. First, adding uncertainty instructions consistently degrades task\-solving ability \(*capability dilution*, Section[7\.2](https://arxiv.org/html/2606.19559#S7.SS2)\)\. Second, all methods suffer from systematic overconfidence visible in the calibration plots \(Section[7\.3](https://arxiv.org/html/2606.19559#S7.SS3)\)\. Third, trajectory\-level aggregation introduces a consequential hyperparameter, and its product variant can act as a trajectory\-length proxy rather than a confidence signal \(Section[7\.4](https://arxiv.org/html/2606.19559#S7.SS4)\)\. Together these limit the regime in which scalar prompt\-based confidence can be relied upon\.
## 9Future Work
The clarification\-augmented benchmarks rely on synthetic underspecification – stripping attributes or objects – and score clarification as a single binary action\. Future evaluations should curate tasks in which ambiguity is organic rather than procedurally generated, and assess the linguistic quality and informativeness of the agent’s clarifying questions, following the methodology of SAGE\-Agent\[[36](https://arxiv.org/html/2606.19559#bib.bib29)\]\. The decomposition itself should also move out of the prompt and into the model, as argued in Section[7\.5](https://arxiv.org/html/2606.19559#S7.SS5); operationalizing it via RLHF, DPO, or GRPO would test whether its benefits survive once capability dilution is removed as a confound\.
## 10Conclusion
We presented a prompt\-based decomposition of agentic uncertainty into action confidence and request uncertainty, enabling proactive clarification seeking by giving the agent a dedicated channel for goal ambiguity that a single confidence score conflates with task difficulty\. Alongside the method, we contributed two clarification\-augmented benchmarks \(WebShop\-Clarification and ALFWorld\-Clarification\) and a systematic comparison of the prompt\-based family \(ReAct\+UE, UAM, and the proposed method\) across five LLM backbones\. The proposed method leads on the clarification\-augmented benchmarks across multiple backbones, while the limitations summarized above show that prompt\-based methods are best viewed as proofs of concept and that the next step is to move the decomposition out of the prompt and into the model via alignment\-based training\.
## Acknowledgments
The author thanks Danil Silantyev \(NDDev, Kazakhstan; ITMO University, St\. Petersburg, Russia\) for his help with the code and for financially supporting the experiments\.
## Appendix APrompts
Each prompt is a system message and a user message rebuilt at every step\. The user message is a sequence of blocks:*Goal*,*Observation*,*Action Space*\(runtime context\); optional*History*and*Error*;*Next\-Action*instruction;*Confidence Elicitation*suffix\. ALFWorld concatenates these into one prose block\. Below, for each method and benchmark, we list the static instructional blocks \(system message, next\-action / action\-selection block, history\-entry template, confidence elicitation suffix\); runtime\-only context blocks are not shown\.
### A\.1ReAct\+UE
#### A\.1\.1WebShop
##### System message
> You are a shopping agent\. Your goal is to find and buy a product that matches the given instruction on a simulated web store\. Available actions: search\[keywords\]– search for products using keywords\. click\[value\]– click a button or link; value must exactly match one of the available clickables listed in the observation\. request\_clarification– request a more specified goal if the request is missing key details or has multiple valid solutions \(e\.g\. color, size\)\. Output format \(required\): <think\>\.\.\.</think\> <action\>search\[…\] or click\[…\] or request\_clarification</action\> <confidence\>0\.0\-\-1\.0</confidence\> <explanation\>\.\.\.</explanation\>
##### Next\-action block
> You are now at step\{t\}\. Prior to this step, you have already taken\{t\}step\(s\)\. Now it’s your turn to take an action\. If the goal is ambiguous or missing key details, you should seek clarification before acting\.
##### History\-entry template
> Step\{i\}: Observation:\{oio\_\{i\}\} Action:<think\>\{rir\_\{i\}\}</think\><action\>\{aia\_\{i\}\}</action\>
##### Confidence elicitation suffix
> After your action, you MUST provide: 1\. Your confidence level \(0\.0–1\.0\) in<confidence\>\.\.\.</confidence\>tags\. 2\. An explanation of your confidence in<explanation\>\.\.\.</explanation\>tags: – Explain what makes you confident\. – Explain what concerns or uncertainties you have\. – What information might be missing or unclear\. – What alternative actions you considered\. – DO NOT output empty<explanation\></explanation\>tags – you MUST provide actual text inside\.
#### A\.1\.2ALFWorld
##### System message
> You are an expert agent operating in the ALFRED Embodied Environment\.
##### Action\-selection block
> Now it’s your turn to take an action\. You should first reason step\-by\-step about the current situation\. This reasoning process MUST be enclosed within<think\> </think\>tags\. Once you’ve finished your reasoning, you should choose an admissible action for the current step and present it within<action\> </action\>tags\. If the goal is ambiguous or missing key details, you should seek clarification before acting\.
##### History\-entry template
Identical to Appendix[A\.1\.1](https://arxiv.org/html/2606.19559#A1.SS1.SSS1)\.
##### Confidence elicitation suffix
Identical to Appendix[A\.1\.1](https://arxiv.org/html/2606.19559#A1.SS1.SSS1)\.
#### A\.1\.3REAL
##### System message
> \# Instructions Review the current state of the page and all other information to find the best possible next action to accomplish your goal\. Your answer will be interpreted and executed by a program, make sure to follow the formatting instructions\. You should first reason step\-by\-step about the current situation\. This reasoning process MUST be enclosed within<think\> </think\>tags\. Once you’ve finished your reasoning, you should choose an admissible action for the current step and present it within<action\> </action\>tags\.
##### Next\-action block
> You are now at step\{t\}\. Prior to this step, you have already taken\{t\}step\(s\)\. Now it’s your turn to take an action\.
##### History\-entry template
Identical to Appendix[A\.1\.1](https://arxiv.org/html/2606.19559#A1.SS1.SSS1)\.
##### Confidence elicitation suffix
Identical to Appendix[A\.1\.1](https://arxiv.org/html/2606.19559#A1.SS1.SSS1)\.
### A\.2UAM
#### A\.2\.1WebShop
##### System message
Identical to Appendix[A\.1\.1](https://arxiv.org/html/2606.19559#A1.SS1.SSS1)\.
##### Next\-action block
Identical to Appendix[A\.1\.1](https://arxiv.org/html/2606.19559#A1.SS1.SSS1)\.
##### History\-entry template
> Step\{i\}: Observation:\{oio\_\{i\}\} Action:<think\>\{rir\_\{i\}\}</think\><action\>\{aia\_\{i\}\}</action\> <confidence\>\{cic\_\{i\}\}</confidence\> <explanation\>\{eie\_\{i\}\}</explanation\>
##### Confidence elicitation suffix
Identical to Appendix[A\.1\.1](https://arxiv.org/html/2606.19559#A1.SS1.SSS1)\.
#### A\.2\.2ALFWorld
##### System message
Identical to Appendix[A\.1\.2](https://arxiv.org/html/2606.19559#A1.SS1.SSS2)\.
##### Action\-selection block
Identical to Appendix[A\.1\.2](https://arxiv.org/html/2606.19559#A1.SS1.SSS2)\.
##### History\-entry template
Identical to Appendix[A\.2\.1](https://arxiv.org/html/2606.19559#A1.SS2.SSS1)\.
##### Confidence elicitation suffix
Identical to Appendix[A\.1\.1](https://arxiv.org/html/2606.19559#A1.SS1.SSS1)\.
#### A\.2\.3REAL
##### System message
Identical to Appendix[A\.1\.3](https://arxiv.org/html/2606.19559#A1.SS1.SSS3)\.
##### Next\-action block
Identical to Appendix[A\.1\.3](https://arxiv.org/html/2606.19559#A1.SS1.SSS3)\.
##### History\-entry template
Identical to Appendix[A\.2\.1](https://arxiv.org/html/2606.19559#A1.SS2.SSS1)\.
##### Confidence elicitation suffix
Identical to Appendix[A\.1\.1](https://arxiv.org/html/2606.19559#A1.SS1.SSS1)\.
### A\.3Proposed Method
#### A\.3\.1WebShop
##### System message
> You are a shopping agent\. Your goal is to find and buy a product that matches the given instruction on a simulated web store\. Available actions: search\[keywords\]– search for products using keywords\. click\[value\]– click a button or link; value must exactly match one of the available clickables listed in the observation\. request\_clarification– request a more specified goal if the request is missing key details or has multiple valid solutions \(e\.g\. color, size\)\. Output format \(required\): <think\>\.\.\.</think\> <u\_request\>0\.0\-\-1\.0</u\_request\> <u\_request\_explanation\>\.\.\.</u\_request\_explanation\> <action\>search\[…\] or click\[…\] or request\_clarification</action\> <confidence\>0\.0\-\-1\.0</confidence\> <explanation\>\.\.\.</explanation\>
##### Next\-action block
> You are now at step\{t\}\. Prior to this step, you have already taken\{t\}step\(s\)\. Now it’s your turn to take an action\. After thinking, you MUST assess your request uncertainty \(0\.0–1\.0\) in<u\_request\>\.\.\.</u\_request\>tags\. 0\.0 = the goal fully specifies every parameter – there is exactly one correct solution\. 0\.5 = the goal leaves open at least one choice where the user likely has a specific preference they did not state – you would be guessing on their behalf\. 1\.0 = critical details are missing, many equally valid interpretations exist\. Be meticulous: if the goal leaves ANY parameter open\-ended, ask yourself – would a real user genuinely be satisfied with ANY valid option, or do they most likely have a specific intent they failed to communicate? If you find yourself choosing one option among several equally plausible alternatives without a clear basis, that is a signu\_requestshould be high\. Then explain your assessment in<u\_request\_explanation\>…</u\_request\_explanation\>tags\. Ifu\_request \>=θ\\theta, your action MUST berequest\_clarification\.
##### History\-entry template
> Step\{i\}: Observation:\{oio\_\{i\}\} <think\>\{rir\_\{i\}\}</think\> <u\_request\>\{uiu\_\{i\}\}</u\_request\> <u\_request\_explanation\>\{xix\_\{i\}\}</u\_request\_explanation\> <action\>\{aia\_\{i\}\}</action\> <confidence\>\{cic\_\{i\}\}</confidence\> <explanation\>\{eie\_\{i\}\}</explanation\>
##### Confidence elicitation suffix
Identical to Appendix[A\.1\.1](https://arxiv.org/html/2606.19559#A1.SS1.SSS1)\.
#### A\.3\.2ALFWorld
##### System message
Identical to Appendix[A\.1\.2](https://arxiv.org/html/2606.19559#A1.SS1.SSS2)\.
##### Action\-selection block
> Now it’s your turn to take an action\. You should first reason step\-by\-step about the current situation\. This reasoning process MUST be enclosed within<think\> </think\>tags\. After thinking, you MUST assess your request uncertainty \(0\.0–1\.0\) in<u\_request\>\.\.\.</u\_request\>tags\. 0\.0 = the goal fully specifies every parameter – there is exactly one correct solution\. 0\.5 = the goal leaves open at least one choice where the user likely has a specific preference they did not state – you would be guessing on their behalf\. 1\.0 = critical details are missing, many equally valid interpretations exist\. Be meticulous: if the goal leaves ANY parameter open\-ended, ask yourself – would a real user genuinely be satisfied with ANY valid option, or do they most likely have a specific intent they failed to communicate? If you find yourself choosing one option among several equally plausible alternatives without a clear basis, that is a signu\_requestshould be high\. Then explain your assessment in<u\_request\_explanation\>…</u\_request\_explanation\>tags\. Ifu\_request \>=θ\\theta, your action MUST berequest\_clarification\. Once you’ve finished your reasoning, you should choose an admissible action for the current step and present it within<action\> </action\>tags\.
##### History\-entry template
Identical to Appendix[A\.3\.1](https://arxiv.org/html/2606.19559#A1.SS3.SSS1)\.
##### Confidence elicitation suffix
Identical to Appendix[A\.1\.1](https://arxiv.org/html/2606.19559#A1.SS1.SSS1)\.
#### A\.3\.3REAL
##### System message
> \# Instructions Review the current state of the page and all other information to find the best possible next action to accomplish your goal\. Your answer will be interpreted and executed by a program, make sure to follow the formatting instructions\. You should first reason step\-by\-step about the current situation\. This reasoning process MUST be enclosed within<think\> </think\>tags\. After thinking, assess request uncertainty in<u\_request\>…</u\_request\>tags and explain in<u\_request\_explanation\>…</u\_request\_explanation\>tags\. Once you’ve finished your reasoning, you should choose an admissible action for the current step and present it within<action\> </action\>tags\. After your action, provide confidence in<confidence\>…</confidence\>tags and an explanation in<explanation\>…</explanation\>tags\.
##### Next\-action block
> You are now at step\{t\}\. Prior to this step, you have already taken\{t\}step\(s\)\. Now it’s your turn to take an action\. After thinking, you MUST assess your request uncertainty \(0\.0–1\.0\) in<u\_request\>\.\.\.</u\_request\>tags\. 0\.0 = the goal fully specifies every parameter – there is exactly one correct solution\. 0\.5 = the goal leaves open at least one choice where the user likely has a specific preference they did not state – you would be guessing on their behalf\. 1\.0 = critical details are missing, many equally valid interpretations exist\. Be meticulous: if the goal leaves ANY parameter open\-ended, ask yourself – would a real user genuinely be satisfied with ANY valid option, or do they most likely have a specific intent they failed to communicate? If you find yourself choosing one option among several equally plausible alternatives without a clear basis, that is a signu\_requestshould be high\. Then explain your assessment in<u\_request\_explanation\>…</u\_request\_explanation\>tags\.
##### History\-entry template
Identical to Appendix[A\.3\.1](https://arxiv.org/html/2606.19559#A1.SS3.SSS1)\.
##### Confidence elicitation suffix
Identical to Appendix[A\.1\.1](https://arxiv.org/html/2606.19559#A1.SS1.SSS1)\.
## Appendix BFull Result Tables
Tables[4](https://arxiv.org/html/2606.19559#A2.T4)–[6](https://arxiv.org/html/2606.19559#A2.T6)report the complete per\-backbone results summarized by the figures of Section[6](https://arxiv.org/html/2606.19559#S6)\. Table[4](https://arxiv.org/html/2606.19559#A2.T4)lists the clarification\-seeking metrics \(success rate, precision, recall, F1, accuracy\) for all five backbones on the two clarification\-augmented benchmarks, expanding Figure[2](https://arxiv.org/html/2606.19559#S6.F2)\. Table[5](https://arxiv.org/html/2606.19559#A2.T5)lists the fault\-detection metrics \(ROC\-AUC, ECE, Brier\) and success rate for every \(backbone, method, aggregation\) combination on the three standard benchmarks, expanding the GPT\-5\.1\-only view of Figure[3](https://arxiv.org/html/2606.19559#S6.F3)to all five backbones\. Table[6](https://arxiv.org/html/2606.19559#A2.T6)reports the same fault\-detection breakdown on the two clarification\-augmented benchmarks\. For the proposed method, both fault\-detection tables additionally include the request\-uncertainty \(uru\_\{r\}\) aggregations of Section[4\.3](https://arxiv.org/html/2606.19559#S4.SS3)alongside the action\-confidence \(cc\) aggregations\.
Table 4:Clarification seeking: full metrics across backbones for both clarification\-augmented benchmarks\. Best per \(backbone, benchmark, metric\) in bold\. The proposed method leads clarification F1 on every backbone on WebShop\-Clar\. and on four of five on ALFWorld\-Clar\., confirming that the decomposition enables clarification seeking where scalar\-confidence baselines cannot\.Table 5:Full fault\-detection results on the standard benchmarks across all backbones, methods, and trajectory\-level aggregations\. Success rate is method\-level and shown once per \(backbone, method\)\. Best per \(backbone, benchmark, metric\) in bold\. All three methods achieve comparable fault\-detection ROC\-AUC across backbones and aggregations, confirming that the request\-uncertainty decomposition preserves the conventional uncertainty objective while additionally enabling clarification seeking\.Table 6:Full fault\-detection results on the clarification\-augmented benchmarks across all backbones, methods, and aggregations\. For the proposed method we additionally report theuru\_\{r\}\-based score variants\. Success rate is method\-level and shown once per \(backbone, method\)\. Best per \(backbone, benchmark, metric\) in bold\.uru\_\{r\}\-based aggregations are systematically weaker on fault\-detection metrics thancc\-based aggregations, confirming that request uncertainty is designed for clarification seeking rather than predicting trajectory failure\.
## Appendix CCalibration Plots
Reliability diagrams for the three methods \(rows\) across the five benchmarks \(columns\) under last\-step aggregation, one figure per backbone\. Each point bins trajectories by action confidence and plots the observed success rate against the bin’s predicted confidence; points below the diagonal indicate overconfidence\. The GPT\-5\.1 diagram appears in Section[6\.4](https://arxiv.org/html/2606.19559#S6.SS4); the remaining four backbones are shown here\.
Figure 5:Reliability diagrams for DeepSeek\-v3\.2\-exp: the three methods \(rows\) across the five benchmarks \(columns\), under last\-step aggregation\. All curves lie below the diagonal, indicating systematic overconfidence for every method and benchmark\.Figure 6:Reliability diagrams for GLM\-4\.7: the three methods \(rows\) across the five benchmarks \(columns\), under last\-step aggregation\. All curves lie below the diagonal, indicating systematic overconfidence for every method and benchmark\.Figure 7:Reliability diagrams for Qwen3\.5\-35B: the three methods \(rows\) across the five benchmarks \(columns\), under last\-step aggregation\. All curves lie below the diagonal, indicating systematic overconfidence for every method and benchmark\.Figure 8:Reliability diagrams for GPT\-OSS\-120B: the three methods \(rows\) across the five benchmarks \(columns\), under last\-step aggregation\. All curves lie below the diagonal, indicating systematic overconfidence for every method and benchmark\.
## References
- \[1\]\(2023\)Uncertainty in natural language generation: from theory to applications\.arXiv preprint arXiv:2307\.15703\.Cited by:[§1](https://arxiv.org/html/2606.19559#S1.p2.1)\.
- \[2\]J\. Baek, H\.\-Y\. Ha, J\. Ha,et al\.\(2025\)REAL: benchmarking LLM agents on deterministic simulations of real websites\.arXiv preprint arXiv:2504\.11543\.Cited by:[3rd item](https://arxiv.org/html/2606.19559#S5.I1.i3.p1.1)\.
- \[3\]T\. B\. Brown, B\. Mann, N\. Ryder,et al\.\(2020\)Language models are few\-shot learners\.InProc\. NeurIPS,pp\. 1877–1901\.Cited by:[§1](https://arxiv.org/html/2606.19559#S1.p1.1)\.
- \[4\]G\. Chen, L\. Jie, L\. Zou, W\. Guan, M\. Zhang, and L\. Nie\(2025\)Enhancing GUI agent with uncertainty\-aware self\-trained evaluator\.InProc\. NeurIPS,Cited by:[Table 1](https://arxiv.org/html/2606.19559#S2.T1.4.4.3.1.1),[§7\.5](https://arxiv.org/html/2606.19559#S7.SS5.p3.1)\.
- \[5\]Z\. Deng, C\. Ma, Q\. Chen,et al\.\(2025\)PlanU: large language model reasoning through planning under uncertainty\.InProc\. NeurIPS,Cited by:[§2\.4](https://arxiv.org/html/2606.19559#S2.SS4.p1.1)\.
- \[6\]J\. Duan, H\. Cheng, S\. Wang,et al\.\(2024\)Shifting attention to relevance: towards the predictive uncertainty quantification of free\-form large language models\.InProc\. ACL,pp\. 5050–5063\.Cited by:[§2\.1](https://arxiv.org/html/2606.19559#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2606.19559#S2.T1.8.10.2.2.1.1)\.
- \[7\]J\. Duan, Y\. Sun, L\. Mao,et al\.\(2025\)UProp: investigating the uncertainty propagation of LLMs in multi\-step decision\-making\.InProc\. NeurIPS,Cited by:[§1](https://arxiv.org/html/2606.19559#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.19559#S2.SS2.p1.1),[Table 1](https://arxiv.org/html/2606.19559#S2.T1.8.8.3.3.3)\.
- \[8\]E\. Fadeeva, A\. Vashurin, A\. Tsvigun,et al\.\(2023\)LM\-Polygraph: uncertainty estimation for language models\.InProc\. EMNLP: System Demonstrations,pp\. 446–461\.Cited by:[§1](https://arxiv.org/html/2606.19559#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.19559#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2606.19559#S2.T1.8.10.2.2.1.1)\.
- \[9\]S\. Farquhar, J\. Kossen, L\. Kuhn, and Y\. Gal\(2024\)Detecting hallucinations in large language models using semantic entropy\.Nature630\(8017\),pp\. 625–630\.Cited by:[§2\.1](https://arxiv.org/html/2606.19559#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2606.19559#S2.T1.3.3.5.1.1)\.
- \[10\]M\. Fomicheva, S\. Sun, L\. Yankovskaya,et al\.\(2020\)Unsupervised quality estimation for neural machine translation\.Trans\. Assoc\. Comput\. Linguistics8,pp\. 539–555\.Cited by:[§2\.1](https://arxiv.org/html/2606.19559#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2606.19559#S2.T1.8.10.2.2.1.1)\.
- \[11\]Y\. S\. Grewal, E\. V\. Bonilla, and T\. D\. Bui\(2024\)Improving uncertainty quantification in large language models via semantic embeddings\.arXiv preprint arXiv:2410\.22685\.Cited by:[§2\.1](https://arxiv.org/html/2606.19559#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2606.19559#S2.T1.3.3.5.1.1)\.
- \[12\]J\. Han, W\. Buntine, and E\. Shareghi\(2024\)Towards uncertainty\-aware language agent\.InFindings of ACL,pp\. 6662–6685\.Cited by:[§2\.1](https://arxiv.org/html/2606.19559#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2606.19559#S2.T1.5.5.3.1.1),[§3\.5](https://arxiv.org/html/2606.19559#S3.SS5.p1.1)\.
- \[13\]C\. Hao, S\. Wang, and K\. Zhou\(2025\)Uncertainty\-aware GUI agent: adaptive perception through component recommendation and human\-in\-the\-loop refinement\.arXiv preprint arXiv:2508\.04025\.Cited by:[§2\.4](https://arxiv.org/html/2606.19559#S2.SS4.p1.1)\.
- \[14\]B\. Hou, Y\. Liu, K\. Qian, J\. Andreas, S\. Chang, and Y\. Zhang\(2024\)Decomposing uncertainty for large language models through input clarification ensembling\.InProc\. ICML,Vol\.235,pp\. 19023–19042\.Cited by:[§2\.1](https://arxiv.org/html/2606.19559#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2606.19559#S2.T1.3.3.5.1.1),[§4\.1](https://arxiv.org/html/2606.19559#S4.SS1.SSS0.Px2.p1.2)\.
- \[15\]X\. Huang, W\. Liu, X\. Chen, X\. Wang, J\. Wang, and H\. Dong\(2024\)Understanding the planning of LLM agents: a survey\.arXiv preprint arXiv:2402\.02716\.Cited by:[§1](https://arxiv.org/html/2606.19559#S1.p1.1)\.
- \[16\]M\. Ielanskyi, K\. Schweighofer, L\. Aichberger, and S\. Hochreiter\(2025\)Addressing pitfalls in the evaluation of uncertainty estimation methods for natural language generation\.InICLR Workshop,Cited by:[§7\.3](https://arxiv.org/html/2606.19559#S7.SS3.p1.1)\.
- \[17\]J\. Kaddour, S\. Patel, G\. Dovonon, L\. Richter, P\. Minervini, and M\. J\. Kusner\(2026\)Agentic uncertainty reveals agentic overconfidence\.arXiv preprint arXiv:2602\.06948\.Cited by:[§2\.4](https://arxiv.org/html/2606.19559#S2.SS4.p1.1),[§7\.3](https://arxiv.org/html/2606.19559#S7.SS3.p1.1)\.
- \[18\]J\. Kaplan, S\. McCandlish, T\. Henighan,et al\.\(2020\)Scaling laws for neural language models\.arXiv preprint arXiv:2001\.08361\.Cited by:[§1](https://arxiv.org/html/2606.19559#S1.p1.1)\.
- \[19\]B\. Kim, J\. Hewitt, N\. Nanda, N\. Fiedel, and O\. Tafjord\(2025\)Because we have LLMs, we can and should pursue agentic interpretability\.arXiv preprint arXiv:2506\.12152\.Cited by:[§1](https://arxiv.org/html/2606.19559#S1.p2.1),[§7\.1](https://arxiv.org/html/2606.19559#S7.SS1.p2.1),[§7\.5](https://arxiv.org/html/2606.19559#S7.SS5.p2.1),[§7\.5](https://arxiv.org/html/2606.19559#S7.SS5.p4.2)\.
- \[20\]M\. Kirchhof, G\. Kasneci, and E\. Kasneci\(2025\)Position: uncertainty quantification needs reassessment for large language model agents\.InProc\. ICML \(Position Paper Track\),Cited by:[§1](https://arxiv.org/html/2606.19559#S1.p2.1),[§2\.4](https://arxiv.org/html/2606.19559#S2.SS4.p1.1),[§4\.1](https://arxiv.org/html/2606.19559#S4.SS1.SSS0.Px2.p1.2),[3rd item](https://arxiv.org/html/2606.19559#S7.I1.i3.p1.1),[§7\.1](https://arxiv.org/html/2606.19559#S7.SS1.p1.1),[§7\.5](https://arxiv.org/html/2606.19559#S7.SS5.p2.1)\.
- \[21\]L\. Kuhn, Y\. Gal, and S\. Farquhar\(2023\)Semantic uncertainty: linguistic invariances for uncertainty estimation in natural language generation\.InProc\. ICLR,Cited by:[§2\.1](https://arxiv.org/html/2606.19559#S2.SS1.p1.1)\.
- \[22\]Z\. Lin, S\. Trivedi, and J\. Sun\(2024\)Generating with confidence: uncertainty quantification for black\-box large language models\.Trans\. Mach\. Learn\. Res\.\.Cited by:[§2\.1](https://arxiv.org/html/2606.19559#S2.SS1.p1.1)\.
- \[23\]H\. Liu, Z\.\-Y\. Dou, Y\. Wang, N\. Peng, and Y\. Yue\(2024\)Uncertainty calibration for tool\-using language agents\.InFindings of EMNLP,pp\. 16781–16805\.Cited by:[§1](https://arxiv.org/html/2606.19559#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.19559#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2606.19559#S2.T1.4.4.3.1.1),[§3\.2](https://arxiv.org/html/2606.19559#S3.SS2.p1.2)\.
- \[24\]O\. Liu, D\. Fu, D\. Yogatama, and W\. Neiswanger\(2025\)DeLLMa: decision making under uncertainty with large language models\.InProc\. ICLR,Cited by:[§2\.4](https://arxiv.org/html/2606.19559#S2.SS4.p1.1)\.
- \[25\]P\. Lymperopoulos and V\. Sarathy\(2025\)Tools in the loop: quantifying uncertainty of LLM question answering systems that use tools\.InProc\. AAMAS,pp\. 2645–2647\.Cited by:[§2\.1](https://arxiv.org/html/2606.19559#S2.SS1.p1.1)\.
- \[26\]A\. Malinin and M\. Gales\(2021\)Uncertainty estimation in autoregressive structured prediction\.InProc\. ICLR,Cited by:[§2\.1](https://arxiv.org/html/2606.19559#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2606.19559#S2.T1.8.10.2.2.1.1)\.
- \[27\]M\. Mohammadi, Y\. Li, J\. Lo, and W\. Yip\(2025\)Evaluation and benchmarking of LLM agents: a survey\.InProc\. KDD,pp\. 6129–6139\.Cited by:[§5\.1](https://arxiv.org/html/2606.19559#S5.SS1.SSS0.Px2.p2.1)\.
- \[28\]A\. V\. Nikitin, J\. Kossen, Y\. Gal, and P\. Marttinen\(2024\)Kernel language entropy: fine\-grained uncertainty quantification for LLMs from semantic similarities\.InProc\. NeurIPS,Cited by:[§2\.1](https://arxiv.org/html/2606.19559#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2606.19559#S2.T1.3.3.5.1.1)\.
- \[29\]C\. Oh, S\. Lim, T\. Bae,et al\.\(2026\)Uncertainty quantification in LLM agents: foundations, emerging challenges, and opportunities\.arXiv preprint arXiv:2602\.05073\.Cited by:[§1](https://arxiv.org/html/2606.19559#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.19559#S2.SS1.p1.1),[§3\.2](https://arxiv.org/html/2606.19559#S3.SS2.p1.2),[§7\.3](https://arxiv.org/html/2606.19559#S7.SS3.p1.1),[§7\.4](https://arxiv.org/html/2606.19559#S7.SS4.p1.1)\.
- \[30\]L\. Ou, K\. Li, F\. Lin,et al\.\(2025\)BrowseConf: confidence\-guided test\-time scaling for web agents\.arXiv preprint arXiv:2510\.23458\.Cited by:[§2\.2](https://arxiv.org/html/2606.19559#S2.SS2.p2.1),[Table 1](https://arxiv.org/html/2606.19559#S2.T1.5.5.3.1.1)\.
- \[31\]L\. Ouyang, J\. Wu, X\. Jiang,et al\.\(2022\)Training language models to follow instructions with human feedback\.InProc\. NeurIPS,pp\. 27730–27744\.Cited by:[§1](https://arxiv.org/html/2606.19559#S1.p1.1)\.
- \[32\]J\. S\. Park, J\. C\. O’Brien, C\. J\. Cai, M\. R\. Morris, P\. Liang, and M\. S\. Bernstein\(2023\)Generative agents: interactive simulacra of human behavior\.InProc\. UIST,pp\. 1–22\.Cited by:[§1](https://arxiv.org/html/2606.19559#S1.p1.1)\.
- \[33\]M\. Shridhar, X\. Yuan, M\.\-A\. Cote, Y\. Bisk, A\. Trischler, and M\. Hausknecht\(2021\)ALFWorld: aligning text and embodied environments for interactive learning\.InProc\. ICLR,Cited by:[2nd item](https://arxiv.org/html/2606.19559#S5.I1.i2.p1.1)\.
- \[34\]F\. B\. Smith, J\. Kossen, E\. Trollope, M\. van der Wilk, A\. Foster, and T\. Rainforth\(2025\)Rethinking aleatoric and epistemic uncertainty\.InProc\. ICML,Cited by:[§2\.4](https://arxiv.org/html/2606.19559#S2.SS4.p1.1)\.
- \[35\]N\. Subramani, J\. Eisner, J\. Svegliato, B\. Van Durme, Y\. Su, and S\. Thomson\(2025\)MICE for CATs: model\-internal confidence estimation for calibrating agents with tools\.InProc\. NAACL,pp\. 12362–12375\.Cited by:[§1](https://arxiv.org/html/2606.19559#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.19559#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2606.19559#S2.T1.4.4.3.1.1),[§3\.2](https://arxiv.org/html/2606.19559#S3.SS2.p1.2)\.
- \[36\]M\. Suri, P\. Mathur, N\. Lipka, F\. Dernoncourt, R\. A\. Rossi, and D\. Manocha\(2025\)Structured uncertainty guided clarification for LLM agents\.arXiv preprint arXiv:2511\.08798\.Cited by:[§2\.4](https://arxiv.org/html/2606.19559#S2.SS4.p1.1),[Table 1](https://arxiv.org/html/2606.19559#S2.T1.4.4.3.1.1),[§3\.2](https://arxiv.org/html/2606.19559#S3.SS2.p1.2),[§7\.5](https://arxiv.org/html/2606.19559#S7.SS5.p3.1),[§9](https://arxiv.org/html/2606.19559#S9.p1.1)\.
- \[37\]X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, E\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou\(2023\)Self\-consistency improves chain of thought reasoning in language models\.InProc\. ICLR,Cited by:[§1](https://arxiv.org/html/2606.19559#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.19559#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2606.19559#S2.T1.3.3.5.1.1)\.
- \[38\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. Chi, Q\. V\. Le, and D\. Zhou\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InProc\. NeurIPS,pp\. 24824–24837\.Cited by:[§1](https://arxiv.org/html/2606.19559#S1.p1.1)\.
- \[39\]Z\. Xi, W\. Chen, X\. Guo,et al\.\(2023\)The rise and potential of large language model based agents: a survey\.arXiv preprint arXiv:2309\.07864\.Cited by:[§1](https://arxiv.org/html/2606.19559#S1.p1.1)\.
- \[40\]Z\. Xia, J\. Xu, Y\. Zhang, and H\. Liu\(2025\)A survey of uncertainty estimation methods on large language models\.InFindings of ACL,pp\. 21381–21396\.Cited by:[§1](https://arxiv.org/html/2606.19559#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.19559#S2.SS1.p1.1)\.
- \[41\]Z\. Xiao, D\. Dou, B\. Xiong, Y\. Chen, and G\. Chen\(2025\)Enhancing uncertainty estimation in LLMs with expectation of aggregated internal belief\.arXiv preprint arXiv:2509\.01564\.Cited by:[§1](https://arxiv.org/html/2606.19559#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.19559#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2606.19559#S2.T1.4.4.3.1.1)\.
- \[42\]R\. Yang, D\. Rajagopal, S\. A\. Hayati, B\. Hu, and D\. Kang\(2024\)Confidence calibration and rationalization for LLMs via multi\-agent deliberation\.InICLR Workshop on Reliable and Responsible Foundation Models,Cited by:[§2\.1](https://arxiv.org/html/2606.19559#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2606.19559#S2.T1.3.3.5.1.1)\.
- \[43\]Z\. Yang, S\. Hao, H\. Sun, L\. Jiang, Q\. Gao, Y\. Ma, and Z\. Hu\(2025\)Understanding the sources of uncertainty for large language and multimodal models\.InICLR Workshop,Cited by:[§2\.4](https://arxiv.org/html/2606.19559#S2.SS4.p1.1)\.
- \[44\]S\. Yao, H\. Chen, J\. Yang, and K\. Narasimhan\(2022\)WebShop: towards scalable real\-world web interaction with grounded language agents\.InProc\. NeurIPS,Cited by:[1st item](https://arxiv.org/html/2606.19559#S5.I1.i1.p1.1)\.
- \[45\]S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao\(2023\)ReAct: synergizing reasoning and acting in language models\.InProc\. ICLR,Cited by:[§1](https://arxiv.org/html/2606.19559#S1.p1.1),[§3\.5](https://arxiv.org/html/2606.19559#S3.SS5.p1.1)\.
- \[46\]J\. Zhang, P\. K\. Choubey, K\.\-H\. Huang, C\. Xiong, and C\.\-S\. Wu\(2026\)Agentic uncertainty quantification\.arXiv preprint arXiv:2601\.15703\.Cited by:[§1](https://arxiv.org/html/2606.19559#S1.p6.1),[§2\.2](https://arxiv.org/html/2606.19559#S2.SS2.p2.1),[Table 1](https://arxiv.org/html/2606.19559#S2.T1.5.5.3.1.1),[§3\.5](https://arxiv.org/html/2606.19559#S3.SS5.p1.1),[§3\.6](https://arxiv.org/html/2606.19559#S3.SS6.p1.1),[§3\.6](https://arxiv.org/html/2606.19559#S3.SS6.p1.2),[§4\.3](https://arxiv.org/html/2606.19559#S4.SS3.p1.4),[§7\.4](https://arxiv.org/html/2606.19559#S7.SS4.p2.5)\.
- \[47\]Q\. Zhao, Y\. Liu, Z\. Gao, E\. Chen, and L\. Meng\(2025\)Uncertainty propagation on LLM agent\.InProc\. ACL,pp\. 6064–6073\.Cited by:[§1](https://arxiv.org/html/2606.19559#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.19559#S2.SS2.p1.1),[Table 1](https://arxiv.org/html/2606.19559#S2.T1.8.8.3.3.3)\.Similar Articles
Uncertainty-Aware Clarification in LLM Agents with Information Gain
Proposes a goal-oriented clarification framework using Information Gain Reward to train LLM agents to ask effective clarification questions under underspecified user instructions, improving task success rate by 3.7% with minimal interaction overhead.
Knowing When to Ask: Self-Gated Clarification for Hierarchical Language Agents
This paper proposes ActionRating, a formulation that places clarification inside an agent's action space on a shared ordinal scale with navigation, enabling two information-seeking modes (mandatory and opportunistic). On hierarchical taxonomy classification benchmarks, experiments with 9 LLMs show that opportunistic clarification improves accuracy and information-seeking effectiveness.
Quantifying Consistency in LLM Logical Reasoning via Structural Uncertainty
This paper introduces structural uncertainty, a framework that evaluates LLM reasoning consistency by measuring the stability of self-preference rankings among sampled reasoning solutions, complementing traditional answer-dispersion methods for identifying unreliable reasoning.
Faithful uncertainty in LLM agents: calibration vs utility tradeoff in practice[D]
A practitioner discusses the calibration vs. utility tradeoff in LLM agents, sharing experience with a verifier-based pipeline that reduces hallucinated tool calls by ~60% but introduces latency costs and drops easy correct answers.
Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering
This position paper argues that current uncertainty quantification methods for large language models are essentially unsupervised clustering, measuring internal consistency rather than external correctness, and therefore fail to detect confident hallucinations. The authors advocate for a paradigm shift to ground uncertainty in objective truth.