Knowing When to Ask: Self-Gated Clarification for Hierarchical Language Agents
Summary
This paper proposes ActionRating, a formulation that places clarification inside an agent's action space on a shared ordinal scale with navigation, enabling two information-seeking modes (mandatory and opportunistic). On hierarchical taxonomy classification benchmarks, experiments with 9 LLMs show that opportunistic clarification improves accuracy and information-seeking effectiveness.
View Cached Full Text
Cached at: 06/11/26, 01:47 PM
# Self-Gated Clarification for Hierarchical Language Agents
Source: [https://arxiv.org/html/2606.11349](https://arxiv.org/html/2606.11349)
Aijing Gao, Yiming Kang, Mengdie Flora Wang, Jae Oh Woo Amazon Web Services \{gaijing, ymkang, florawan, jaeohwoo\}@amazon\.com
###### Abstract
In hierarchical reasoning, failures often originate at intermediate decision points where the agent commits to a wrong branch without recognizing that it lacks critical information\. Rather than treating clarification as an external uncertainty trigger, we proposeActionRating, a formulation that places it inside the agent’s action space on a shared ordinal scale with navigation, so that asking competes directly with acting at every decision point and help\-seeking becomes observable at intermediate states\. Two structurally distinct information\-seeking modes emerge from the agent’s own ratings:*mandatory*\(no viable branch\) and*opportunistic*\(residual uncertainty despite a leading candidate\)\. On Harmonized Tariff Schedule classification \(30,000\-node taxonomy, three benchmarks, 9 LLMs across 4 families\), we observe a regime shift from mandatory to opportunistic clarification, with Information\-Seeking Effectiveness \(ISE\), a local diagnostic defined as the fraction of help interactions followed by a correct next navigation step \(not a final\-task metric\), rising from 50% to 74%\. Three diagnostic contrasts fail to reproduce this structure\. A separability test shows that the information\-seeking pattern \(mode split, ISE ranking\) persists when answer quality is degraded \(−\-18\.8% accuracy\), supporting an empirical separation between*where*an agent seeks help and*the quality*of the help it receives\. Under the controlled answer channel, accuracy gains reach\+\+16\.2% at 10\-digit; we read this as an upper bound on what better localization could unlock, not a deployment estimate\.
Knowing When to Ask: Self\-Gated Clarification for Hierarchical Language Agents
Aijing Gao, Yiming Kang, Mengdie Flora Wang, Jae Oh WooAmazon Web Services\{gaijing, ymkang, florawan, jaeohwoo\}@amazon\.com
Figure 1:Overview ofActionRating\.\(A\) Pipeline\.The agent rates all candidate actions, includingneed\_clarify, on a shared\[0,100\]\[0,100\]ordinal scale; clarification fires when its score exceeds thresholdτ\\tau\. Two modes emerge:*mandatory*\(no viable branch\) and*opportunistic*\(a leading branch exists but clarify still scores aboveτ\\tau\)\. A controlled answer channel fixes answer quality to isolate*where*help is sought from*what*is received\.\(B, C\) Findings\(§[5\.1](https://arxiv.org/html/2606.11349#S5.SS1), §[5\.4](https://arxiv.org/html/2606.11349#S5.SS4); previewed here, not part of the pipeline\)\. Atτ=10\\tau\{=\}10, opportunistic share0→88\.7%0\\to 88\.7\\%, ISE50%→74%50\\%\\to 74\\%, accuracy50\.8%→67\.0%50\.8\\%\\to 67\.0\\%\(B\)\. Replacing controlled with auto\-generated answers collapses accuracy \(−\-18\.8%\) yet preserves the mode split and ISE ranking\(C\), supporting an empirical separation between help localization and answer\-source quality under controlled degradation\.## 1Introduction
Language agents that reason over hierarchical structures \(medical codes, legal statutes, product taxonomies\) face a recurring failure mode: once the agent commits to a wrong intermediate branch, every subsequent step merely elaborates an error that should have been caught earlierYaoet al\.\([2023b](https://arxiv.org/html/2606.11349#bib.bib1),[a](https://arxiv.org/html/2606.11349#bib.bib2)\); Shinnet al\.\([2023](https://arxiv.org/html/2606.11349#bib.bib3)\); Presset al\.\([2023](https://arxiv.org/html/2606.11349#bib.bib29)\); Dziriet al\.\([2024](https://arxiv.org/html/2606.11349#bib.bib50)\)\. Final\-answer accuracy tells us*that*the system failed, but not*where*, at which decision point the agent lacked the information to proceed safely\. The core question is deceptively simple:*when should the agent ask for help instead of committing?*
#### Why current approaches fall short\.
Existing designs treat clarification as external to the reasoning trajectory: a confidence thresholdKadavathet al\.\([2022](https://arxiv.org/html/2606.11349#bib.bib33)\), a prompt instruction \(“ask if unsure”\), or sampling\-based disagreementKuhnet al\.\([2023](https://arxiv.org/html/2606.11349#bib.bib34)\)\. These mechanisms decouple the decision to ask from the decision to act, leaving two problems unsolved\. First, they do not make information\-seeking behavior*structurally observable*: we cannot distinguish an agent that asks because no viable branch exists from one that asks to reduce residual uncertainty\. Second, they confound*where*help was sought with*what*help was received: an agent that asks more may perform better simply because it gets better information\.
#### Clarification as action\.
We proposeActionRating, a formulation that addresses both problems by placing clarification inside the agent’s own action space \(Figure[1](https://arxiv.org/html/2606.11349#S0.F1)\)\. The agent scores candidate next actions, including a dedicated clarification action, on a shared\[0,100\]\[0,100\]ordinal scale, so that asking competes directly with acting at each decision point\. This shared\-scale competition makes the local need for help observable without any external uncertainty estimator\. Two structurally distinct modes emerge from the agent’s own ratings:*mandatory*help, where clarification is top\-ranked and no navigation branch is viable, and*opportunistic*help, where a leading branch exists but a targeted question can reduce residual uncertainty before commitment\.
#### Isolating help localization\.
To analyze information\-seeking behavior cleanly, we must separate*where*help was sought from*what*was received\. We pairActionRatingwith a controlled answer channel that fixes answer quality, analogous to holding one experimental factor constant to analyze another\. We also track*Information\-Seeking Effectiveness*\(ISE\), the fraction of help interactions after which the agent’s next navigation lands on the correct path, as a local utility probe \(§[5\.2](https://arxiv.org/html/2606.11349#S5.SS2)\)\. Mode shift alone shows structural change but not utility; ISE alone measures local usefulness but not global structure; accuracy alone is confounded by answer quality\. Together the three provide converging evidence\.
#### Test bed\.
We evaluate on Harmonized Tariff Schedule \(HTS\) classification, a language\-mediated taxonomy of 30,000\+ nodes where item descriptions are free\-text, taxonomy headings are natural\-language definitions, and clarification is itself a language\-generation act\. HTS provides the structural prerequisites \(deep branching, repeated intermediate commitments, genuine information gaps, and verifiable ground truth\) that make the measurement question nontrivial \(§[4\.1](https://arxiv.org/html/2606.11349#S4.SS1)\)\.
#### Contributions\.
\(1\) Framework\.We formulate clarification as a selectable action competing with navigation on a shared ordinal scale, yielding a self\-gated mechanism that makes information\-seeking behavior directly observable\.\(2\) Behavioral analysis\.The framework reveals a regime shift, not more questions but a structural transition from mandatory to opportunistic clarification, with ISE rising from 50% to 74%\. Three diagnostic contrasts \(prompt\-level, sampling\-based, rating\-only\) do not reproduce this shift\.\(3\) Separability\.When answer quality is degraded, accuracy collapses \(−\-18\.8%\) while the information\-seeking pattern \(mode split, ISE ranking\) is preserved, supporting an empirical separation between help localization and answer\-source quality under controlled degradation\. Accuracy gains under the controlled answer channel \(\+\+16\.2% at 10\-digit\) are read as an upper bound on what better localization could unlock, not a deployment estimate\. Evaluation spans 9 LLMs \(4 families\), three benchmarks, component ablation, and threshold sensitivity\.
## 2Related Work
Our work intersects LLM agents for structured reasoningYaoet al\.\([2023b](https://arxiv.org/html/2606.11349#bib.bib1),[a](https://arxiv.org/html/2606.11349#bib.bib2)\); Shinnet al\.\([2023](https://arxiv.org/html/2606.11349#bib.bib3)\); Zhouet al\.\([2024](https://arxiv.org/html/2606.11349#bib.bib21)\); Schicket al\.\([2024](https://arxiv.org/html/2606.11349#bib.bib19)\); Liuet al\.\([2023](https://arxiv.org/html/2606.11349#bib.bib22)\); Sumerset al\.\([2024](https://arxiv.org/html/2606.11349#bib.bib23)\), self\-evaluation and uncertaintyWanget al\.\([2023a](https://arxiv.org/html/2606.11349#bib.bib4)\); Madaanet al\.\([2023](https://arxiv.org/html/2606.11349#bib.bib6)\); Cobbeet al\.\([2021](https://arxiv.org/html/2606.11349#bib.bib7)\); Lightmanet al\.\([2024](https://arxiv.org/html/2606.11349#bib.bib25)\); Kadavathet al\.\([2022](https://arxiv.org/html/2606.11349#bib.bib33)\); Kuhnet al\.\([2023](https://arxiv.org/html/2606.11349#bib.bib34)\); Linet al\.\([2022](https://arxiv.org/html/2606.11349#bib.bib36)\); Zhenget al\.\([2023](https://arxiv.org/html/2606.11349#bib.bib49)\), information\-seeking and clarificationSettles \([2012](https://arxiv.org/html/2606.11349#bib.bib8)\); Wanget al\.\([2023b](https://arxiv.org/html/2606.11349#bib.bib9)\); Aliannejadiet al\.\([2019](https://arxiv.org/html/2606.11349#bib.bib10)\); Zamaniet al\.\([2020](https://arxiv.org/html/2606.11349#bib.bib41)\); Rao and III \([2018](https://arxiv.org/html/2606.11349#bib.bib40)\); Rahmaniet al\.\([2023](https://arxiv.org/html/2606.11349#bib.bib42)\), selective prediction and abstentionGeifman and El\-Yaniv \([2017](https://arxiv.org/html/2606.11349#bib.bib18)\); El\-Yaniv and Wiener \([2010](https://arxiv.org/html/2606.11349#bib.bib39)\); Kamathet al\.\([2020](https://arxiv.org/html/2606.11349#bib.bib38)\), hierarchical classificationJr\. and Freitas \([2011](https://arxiv.org/html/2606.11349#bib.bib43)\); Kowsariet al\.\([2017](https://arxiv.org/html/2606.11349#bib.bib44)\); Shimuraet al\.\([2018](https://arxiv.org/html/2606.11349#bib.bib11)\); Banerjeeet al\.\([2019](https://arxiv.org/html/2606.11349#bib.bib46)\); Zhouet al\.\([2020](https://arxiv.org/html/2606.11349#bib.bib45)\); Maoet al\.\([2019](https://arxiv.org/html/2606.11349#bib.bib12)\), and multi\-step reasoningWeiet al\.\([2023](https://arxiv.org/html/2606.11349#bib.bib5)\); Zhouet al\.\([2023](https://arxiv.org/html/2606.11349#bib.bib27)\); Khotet al\.\([2023](https://arxiv.org/html/2606.11349#bib.bib28)\); Gaoet al\.\([2023](https://arxiv.org/html/2606.11349#bib.bib54)\); Nyeet al\.\([2021](https://arxiv.org/html/2606.11349#bib.bib30)\); Zelikmanet al\.\([2022](https://arxiv.org/html/2606.11349#bib.bib26)\); Haoet al\.\([2023](https://arxiv.org/html/2606.11349#bib.bib31)\); Bestaet al\.\([2024](https://arxiv.org/html/2606.11349#bib.bib32)\); Huanget al\.\([2024](https://arxiv.org/html/2606.11349#bib.bib51)\); Duaet al\.\([2022](https://arxiv.org/html/2606.11349#bib.bib55)\)\. A full discussion is in Appendix[H](https://arxiv.org/html/2606.11349#A8)\. Three distinctions position our contribution\.*First*, existing agent frameworks address general reasoning over flat or lightly structured spaces; we target deep hierarchical taxonomies where each step narrows the search space\.*Second*, self\-evaluation methods rate final answers or sample agreement; we rate candidate*actions*, including clarification, on a shared ordinal scale, so that clarification competes directly with navigation rather than being triggered by final\-answer confidence or sampling disagreement\.*Third*, prior clarification work assumes external uncertainty estimators or human interlocutors; our mechanism is entirely*self\-gated*from the agent’s own action ratings\.
## 3Framework
### 3\.1Hierarchical Navigation as MDP
We model hierarchical reasoning as an episodic Markov Decision ProcessPuterman \([1994](https://arxiv.org/html/2606.11349#bib.bib48)\)ℳ=\(𝒮,𝒜,𝒯,ℛ\)\\mathcal\{M\}=\(\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{T\},\\mathcal\{R\}\)\.Statesare taxonomy nodes augmented with the item description and navigation history\.Actionscomprise five types:traverse\_child,backtrack,need\_clarify\(q\)\(q\),jump\(c\)\(c\), andconfirm\.Transitionsinduced by a selected taxonomy action are deterministic \(the navigation environment is a fixed tree\); stochasticity enters through the LLM policyπ\(a∣s\)\\pi\(a\\mid s\)and through the answer\-generation channel invoked byneed\_clarify\.Rewardsassign\+1\+1/−1\-1for correct/incorrect classification\.
### 3\.2ActionRating: Asking as a Selectable Action
The core idea is to make help\-needed states observable by placing clarification inside the agent’s own action space rather than treating it as an external decision \(a worked example of the self\-gated reentry cycle is in Figure[4](https://arxiv.org/html/2606.11349#A1.F4), Appendix[A](https://arxiv.org/html/2606.11349#A1)\)\.ActionRatingasks the agent to rate its top\-KKcandidate actions, including a dedicatedneed\_clarifyaction, on a\[0,100\]\[0,100\]ordinal relevance scale before committing \(full rating prompt in Appendix[P\.3](https://arxiv.org/html/2606.11349#A16.SS3)\)\. At steptt, the agent produces:
\{\(ai,si,ri\)\}i=1K,a∗=argmaxisi\\\{\(a\_\{i\},s\_\{i\},r\_\{i\}\)\\\}\_\{i=1\}^\{K\},\\quad a^\{\*\}=\\arg\\max\_\{i\}s\_\{i\}whereaia\_\{i\}is theii\-th candidate action,si∈\[0,100\]s\_\{i\}\\in\[0,100\]its ordinal score,rir\_\{i\}a one\-sentence rationale, anda∗a^\{\*\}the selected action\. The action\-rating step itself is implemented within a single navigation call per step\. However, when clarification is triggered, the full system incurs additional sub\-agent and reentry calls at that step \(see the accuracy–cost analysis in §[6](https://arxiv.org/html/2606.11349#S6)\)\.
The rating serves two functions: \(1\) it makes help\-needed states directly observable via the ask\-vs\-act competition, producing the mandatory/opportunistic distinction that is our primary analytical object; and \(2\) it forces deliberative comparison between candidates before committing\. In our experiments \(Appendix[I](https://arxiv.org/html/2606.11349#A9)\), the behavioral change comes primarily from self\-gated help\-seeking rather than from rating alone, indicating that the rating’s main value lies in enabling observation and gating of help\-needed states\.
#### Controlled answer channel\.
To isolate help localization from answer quality, we use a controlled answer channel, analogous to holding one experimental factor fixed while analyzing another\. Two paired conditions complete the design: \(1\) acontrolled conditionthat fixes answer quality high, so behavioral differences primarily reflect help localization rather than variation in answer quality; and \(2\) adegraded conditionthat removes privileged access as a*separability test*: information\-seeking patterns surviving while accuracy collapses provides evidence for an empirical separation between help localization and answer\-source quality \(§[5\.4](https://arxiv.org/html/2606.11349#S5.SS4)\)\. The controlled channel simulates a knowledgeable product owner who can provide authoritative attribute facts \(material composition, intended use, manufacturing method\) while explicit classification codes are masked \(see Appendix[P\.1](https://arxiv.org/html/2606.11349#A16.SS1)\)\. A post\-hoc audit confirms that 96% of answers contain only domain or technical\-specification knowledge \(§[5\.4](https://arxiv.org/html/2606.11349#S5.SS4)\)\. Accuracy numbers are therefore upper bounds, not deployment estimates\.
### 3\.3Self\-Gated Information Seeking
A key property of the rating is thatneed\_clarifycompetes directly with navigation actions on the same scale\. Whenneed\_clarifyappears among the top\-KKwith score≥τ\\geq\\tau\(*clarification threshold*\), the agent invokes a clarification sub\-agent*at the current node*:
The procedure has four stages:\(1\) Detect: identify∃i≤K\\exists\\,i\\leq Ksuch thatai=need\_clarify∧si≥τa\_\{i\}=\\texttt\{need\\\_clarify\}\\wedge s\_\{i\}\\geq\\tau;\(2\) Clarify: invoke sub\-agenta^=ClarifyAgent\(qi,item\)\\hat\{a\}=\\text\{ClarifyAgent\}\(q\_\{i\},\\text\{item\}\);\(3\) Inject: add the answer to observationoto\_\{t\};\(4\) Re\-select: run action selection again with the enriched observation \(*reentry*\)\.
This*self\-gated reentry*requires no changes to the outer navigation loop: clarification is absorbed within the step\. The thresholdτ\\taucontrols aggressiveness: lower values trigger more clarifications\. We analyze sensitivity in §[5\.3](https://arxiv.org/html/2606.11349#S5.SS3)\.
#### Two help\-needed modes\.
We define two modes purely from the rank ofneed\_clarifywithin the top\-KKlist\.*Mandatory help*:need\_clarifyis the top\-ranked action \(rank=1\\text\{rank\}=1\); no navigation branch scores above it\.*Opportunistic help*: a navigation action leads the ranking, butneed\_clarifyappears among positions 2–KKwith score≥τ\\geq\\tau\.
The definition is operational: it depends only on the rank, not on any interpretation of why the agent ranked it there\. Empirically, rank\-1 placement tends to correspond to states where no navigation branch appears viable, while lower\-ranked placement corresponds to states with a preferred branch but residual uncertainty, consistent with the classical notion of value of informationHoward \([1966](https://arxiv.org/html/2606.11349#bib.bib15)\); Raiffa and Schlaifer \([1961](https://arxiv.org/html/2606.11349#bib.bib47)\)\. Both modes are self\-triggered from the same ordinal action\-rating signal, requiring no external uncertainty estimator\. Together they form an observational taxonomy, not a classification of true epistemic need, but a structured partition that produces a consistent behavioral signature \(mode shift, ISE improvement, separability\) absent under simpler triggers \(§[4](https://arxiv.org/html/2606.11349#S4)\)\.
Structural properties \(monotone trigger sets, bounded reentry\) and a threshold\-policy sanity check stating a sufficient single\-crossing condition under which a threshold policy is optimal are deferred to Appendix[A](https://arxiv.org/html/2606.11349#A1)\. We treat this as a conceptual sanity check, not a guarantee for LLM\-emitted scores, which are not assumed to be calibrated\.
## 4Experiments
### 4\.1HTS as a Language\-Mediated Testbed
HTS classification is a*language\-mediated*hierarchical reasoning task: item descriptions are free\-text and inherently ambiguous \(e\.g\., “cough drops” could be medicament or confectionery\), taxonomy nodes are defined by natural\-language headings with legal\-text qualifications, and clarification is itself a language\-generation act; the agent must formulate a discriminative question in natural language and interpret a textual answer\. The entire reasoning chain, from item description through intermediate decisions to clarification, operates in language space\.
We require four structural preconditions for the measurement question to be nontrivial: \(i\)*deep branching*so that early errors compound, \(ii\)*repeated intermediate commitments*so that information\-seeking varies across steps, \(iii\)*information gaps*severe enough that targeted clarification carries real value, and \(iv\)*verifiable ground truth*\. Flat classification fails \(i\)–\(ii\); shallow QA benchmarks fail \(i\); open\-ended reasoning fails \(iv\)\. HTS, a hierarchical taxonomy used in international trade to assign 10\-digit codes to imported goods, satisfies all four: 30,000\+ nodes across 5 levels \(branching factors up to 50\+\), General Rules of Interpretation requiring special\-case protocols \(Appendix[E](https://arxiv.org/html/2606.11349#A5)\), systematically incomplete product descriptions, and verified ground\-truth codes from U\.S\. Customs rulingsU\.S\. Customs and Border Protection \([2025b](https://arxiv.org/html/2606.11349#bib.bib13)\)\. We construct a knowledge graphU\.S\. Customs and Border Protection \([2025a](https://arxiv.org/html/2606.11349#bib.bib14)\); Panet al\.\([2024](https://arxiv.org/html/2606.11349#bib.bib52)\)as the MDP environment \(Appendix[B](https://arxiv.org/html/2606.11349#A2)\)\. We separateLayer 1\(domain instantiation: KG, GRI protocols, answer channel; HTS\-specific, must be re\-implemented per domain\) fromLayer 2\(measurement protocol: action\-space formulation, mandatory/opportunistic analysis, ISE, threshold sweep; portable to any tree\-structured reasoning task\)\.
#### Datasets and metrics\.
We evaluate on three benchmarks:CBP\-NY\(1,181 products extracted from public CBP rulings\(U\.S\. Customs and Border Protection,[2025b](https://arxiv.org/html/2606.11349#bib.bib13)\)via LLM\-based pipeline \(Appendix[O](https://arxiv.org/html/2606.11349#A15)\),ATLAS\(200 samplesYuvraj and Devarakonda \([2025](https://arxiv.org/html/2606.11349#bib.bib16)\)\), andHSCodeComp\(632 expert\-annotated recordsYanget al\.\([2025](https://arxiv.org/html/2606.11349#bib.bib17)\)\)\. We report hierarchical accuracy at each depth level \(2–10 digits\), success rate, average navigation steps, and ISE \(§[5\.2](https://arxiv.org/html/2606.11349#S5.SS2)\)\.
### 4\.2Setup
We evaluate 9 LLMs across 4 families as MDP navigators;ActionRating\(τ=10\\tau\{=\}10\) is applied to 5 of them\. Thebaselineuses greedy action selection without scoring or gating \(Appendix[P\.2](https://arxiv.org/html/2606.11349#A16.SS2)\)\. The controlled answer channel uses CBP ruling attributes with codes masked \(Appendix[P\.1](https://arxiv.org/html/2606.11349#A16.SS1),[C](https://arxiv.org/html/2606.11349#A3)\)\. Two*diagnostic contrasts*test alternative explanations:H1: a prompt\-level instruction suffices \(CoT\-Ask\-if\-Unsure; Appendix[F](https://arxiv.org/html/2606.11349#A6)\);H2: a sampling\-based trigger suffices \(Self\-Consistency,N=3N\{=\}3;Wanget al\.\([2023a](https://arxiv.org/html/2606.11349#bib.bib4)\); Appendix[G](https://arxiv.org/html/2606.11349#A7)\)\.H3\(deliberation without actioning\) is tested by the rating\-only ablation \(τ=101\\tau\{=\}101\)\.
## 5Results
### 5\.1Information\-Seeking Behavior underActionRating
Table 1:Hierarchical accuracy \(%\) across models and confidence\-triggering methods on HTS classification \(N=1,181N\{=\}1\{,\}181\)\. Cell shading:hightolow\.CoT\-Ask\-if\-Unsure: single prompt instruction to ask clarification when uncertain\.Self\-Consistency\(N=3N\{=\}3\): trigger clarification on action disagreement \(†\\dagger≈19\{\\approx\}19LLM calls/record\)\.AR \(τ=10\\tau\{=\}10\): Claude Opus 4\.6 withActionRating\. Significance markers onΔ\\Deltarows are based on non\-parametric paired bootstrap \(nboot=5,000n\_\{\\mathrm\{boot\}\}\{=\}5\{,\}000\):∗∗∗95 % CI strictly positive;nsCI includes zero\.Table[1](https://arxiv.org/html/2606.11349#S5.T1)presents results across all 9 baseline models, two diagnostic contrasts \(CoT\-Ask\-if\-Unsure and Self\-Consistency,N=3N\{=\}3\), and Claude Opus 4\.6 withActionRating\. The primary signals are*what changes in information\-seeking behavior*, not the accuracy numbers themselves \(which are upper bounds under the controlled answer channel\)\.
A distinctive clarification pattern emerges\.ActionRating\(τ=10\\tau\{=\}10\) produces a three\-part behavioral signature absent from all comparison conditions: \(i\) a regime shift from mandatory to opportunistic clarification \(35\.2%→\\to13\.9% mandatory; 0%→\\to88\.7% opportunistic\), \(ii\) rising local utility \(ISE: 50%→\\to74%\), and \(iii\) accuracy co\-movement at every hierarchy depth\. This is a structural change in*where*the agent seeks clarification, not merely*how often*; the volume increase is a consequence of opportunistic mode activation, not its cause \(§[5\.2](https://arxiv.org/html/2606.11349#S5.SS2)–[5\.4](https://arxiv.org/html/2606.11349#S5.SS4)\)\.
Gating, not scoring, drives the regime shift\.Rating\-only \(τ=101\\tau\{=\}101; Appendix[I](https://arxiv.org/html/2606.11349#A9)\) retains the full scoring apparatus but disables the gating threshold so that no clarification ever fires\. Despite having the deliberation step, it yields−\-0\.9% and produces no opportunistic events, directly isolating*gating*as the mechanism behind the regime shift\. CoT\-Ask\-if\-Unsure \(H1\) and Self\-Consistency \(H2\) similarly fail to reproduce the three\-part signature, arguing against prompt\-level instruction and sampling disagreement as sufficient triggers\. Among the tested alternatives, only shared\-scale action competition, where asking competes with acting at the same local decision point, produces the observed structure\.
Upper\-bound accuracy under controlled answers\.Under the controlled answer channel, Claude Opus 4\.6 reaches 67\.0% 10\-digit accuracy \(\+\+16\.2% over baseline 50\.8%\), with gains increasing monotonically with depth, consistent with the claim that deeper hierarchy levels benefit most from better help localization\. These numbers bound what better localization*could*unlock when high\-quality answers are available\. Cost rises from 6\.0 to 10\.4 calls/record \(Figure[3](https://arxiv.org/html/2606.11349#S6.F3)\)\. The behavioral signature generalizes across 4 LLM families \(Table[8](https://arxiv.org/html/2606.11349#A10.T8), Appendix[J](https://arxiv.org/html/2606.11349#A10)\) and two additional benchmarks \(ATLAS and HSCodeComp\) withτ=10\\tau\{=\}10locked from CBP\-NY, providing an out\-of\-sample generalization test \(Table[9](https://arxiv.org/html/2606.11349#A11.T9), Appendix[K](https://arxiv.org/html/2606.11349#A11)\)\.
### 5\.2Information\-Seeking Effectiveness \(ISE\)
A central question is whether self\-gating merely increases the*volume*of help\-seeking or also improves its local usefulness; i\.e\., whether help was requested at states that genuinely needed it\. We define Information\-Seeking Effectiveness \(ISE\) as a*localized help\-utility probe*: the fraction of help interactions after which the agent’s next navigation action lands on the correct path:
ISE=\# QA followed by correct traverse\# total QA interactions\\text\{ISE\}=\\frac\{\\text\{\\\# QA followed by correct traverse\}\}\{\\text\{\\\# total QA interactions\}\}ISE tests whether each identified help interaction was*locally useful*: if help was requested and the next action was correct, the interaction was productive at the one\-step level\. Because the controlled answer channel fixes answer quality, ISE variation across conditions reflects differences in*where*the agent sought help, not*what*it received\.
Self\-gating improves quality, not just volume\.Atτ=10\\tau\{=\}10, the agent issues 6×\\timesmore clarifications \(2\.4 vs\. 0\.4/record\), yet ISE rises from 50% to 74% \(Figure[2](https://arxiv.org/html/2606.11349#S5.F2)a\)\. Virtually all additional volume is opportunistic, and both modes attain≈\\approx75% ISE \(Figure[2](https://arxiv.org/html/2606.11349#S5.F2)b\)\. Atτ=1\\tau\{=\}1, volume increases further \(5\.9 QA/record\) but ISE*drops*to 62%: the additional questions are less likely to land on the correct path\. This dissociation between volume and quality is the clearest evidence that the framework distinguishes*asking at the right place*from*asking more often*\.
Opportunistic clarification corrects misaligned trajectories\.Table[12](https://arxiv.org/html/2606.11349#A13.T12)further dissects opportunistic ISE by the agent’s traversal state at clarification time\. Even when the agent was already on the correct HTS path, ISE is 83\.6%, suggesting that on\-path clarifications refine rather than derail correct trajectories\. More tellingly, when the agent was*off*the correct path, where clarification must actively steer the agent back, ISE remains 67\.3%\. This argues against the confound that high aggregate ISE is driven purely by easy, already\-correct cases: the clarification sub\-agent appears to meaningfully correct misaligned trajectories\.
Figure 2:Clarification behavior on CBP\-NY \(n=1,181\), Claude Opus 4\.6\.\(a\) QA Volume & ISE\.Stacked bars: average QA per record, split by mode \(orange: mandatory; green: opportunistic\) and outcome \(solid: effective; light: ineffective\)\. Right axis: 10\-digit accuracy \(blue diamonds\)\. Annotations above each bar: total QA/record and overall ISE \(%\)\. Theτ=10\\tau\{=\}10condition \(outlined\) shifts volume from mandatory to opportunistic while maximising accuracy\.\(b\) ISE by Clarification Mode\.Grouped bars compare mandatory \(orange\) vs\. opportunistic \(green\) ISE at each threshold; atτ=10\\tau\{=\}10, both reach≈\\approx75%, while atτ=1\\tau\{=\}1both drop to 62% \(over\-triggering\)\.
### 5\.3Threshold Sensitivity
We do not tuneτ\\tauon target benchmarks; instead we use threshold sweeps to characterize behavior on CBP\-NY and lockτ=10\\tau\{=\}10before any ATLAS or HSCodeComp evaluation\. Treatingτ\\tauas a*behavioral phase control*in this analysis, sweeping it maps out a phase diagram of help\-seeking structure \(Table[10](https://arxiv.org/html/2606.11349#A12.T10), Appendix[L](https://arxiv.org/html/2606.11349#A12)\)\. Theτ=50→30\\tau\{=\}50\{\\to\}30transition marks the clearest phase boundary \(opportunistic rate: 9\.7%→\\to51\.9%; accuracy: 51\.2%→\\to59\.8%\), corresponding to the onset of widespread opportunistic mode activation\. Atτ=10\\tau\{=\}10, 90\.9% of records involve help\-seeking \(76\.8% opportunistic\), achieving 74% ISE at only 2\.4 QA/record\. Navigation steps are unchanged across all settings \(5\.4–5\.7\), indicating that gating alters*where*the agent seeks help, not*how*it navigates\. The sweep was conducted on CBP\-NY only;τ=10\\tau\{=\}10was locked before any ATLAS or HSCodeComp evaluation \(selection protocol in Appendix[L](https://arxiv.org/html/2606.11349#A12)\)\.
### 5\.4Separability: Help Localization vs\. Answer Quality
The regime shift and ISE improvement above are observed under controlled answers\. The critical test is whether these patterns reflect the agent’s information\-seeking ability or are artifacts of answer quality\. A separability test \(Table[2](https://arxiv.org/html/2606.11349#A3.T2)in Appendix[C](https://arxiv.org/html/2606.11349#A3)\) replaces the controlled channel with fully\-automated answers derived from the product description alone\.
Accuracy collapses; information\-seeking pattern is preserved\.ActionRatingloses nearly all accuracy gain at 10\-digit \(67\.0%→\\to48\.2%\) while the baseline is unaffected \(50\.8%→\\to49\.2%\): a−\-18\.8% gap attributable to answer quality\. Crucially, the information\-seeking pattern itself \(mandatory/opportunistic split, ISE ranking across thresholds\) is preserved even when answer quality is degraded\.
Interpretation\.The agent can locate states where reasoning needs help even when the answer source is weak; what it cannot do without a knowledgeable respondent is*realize*the downstream accuracy benefit\. This dissociation between localization and realization provides evidence that the two can be analyzed as separable factors under controlled degradation, with localization behavior appearing more stable than realized accuracy across answer\-quality conditions\. The behavioral signature additionally satisfies four validity criteria:*stability*\(replicates across 4 LLM families and 3 benchmarks\),*interpretability*\(mandatory vs\. opportunistic modes have clear semantic content\),*contrastiveness*\(three diagnostic contrasts fail to produce the same structure\), and*predictive local utility*\(ISE indicates that identified help states are productive at the one\-step level\)\.
Knowledge\-channel audit\.We audit all 2,875 Q/A pairs from the CBP\-NY run with action rating \(τ=10\\tau\{=\}10\) along two axes: six question categories \(e\.g\., Material/Composition, Function/Use\) and five answer types \(Product Attribute, PA\-Essential Character, Classification Criteria \(CC\), Unavailable, Deflected\)\. Table[3](https://arxiv.org/html/2606.11349#A4.T3)shows the full cross\-tabulation: 80% of answers are plain product attributes, and only 3\.5% of questions \(102 pairs\) explicitly name an HTS chapter, heading, or note\. Table[4](https://arxiv.org/html/2606.11349#A4.T4)isolates this HTS\-referencing subset: the guardrail deflects 34% outright; 37% yield plain product\-attribute answers; and only 23 pairs \(0\.8% of all 2,875\) reach a*Classification Criteria*answer \(i\.e\., the oracle directly affirms or denies a named chapter note or heading criterion, such as “yes, it qualifies as rubber under Chapter 40, Note 1”\)\. Table[5](https://arxiv.org/html/2606.11349#A4.T5)then shows that even those 23 CC answers produce*lower*navigation success than plain product\-attribute answers \(ISE 62\.5% vs\. 76\.2%\), arguing against oracle leakage as the main driver of the accuracy gain \(Q&A examples and per\-level breakdown in Appendix[D](https://arxiv.org/html/2606.11349#A4)\)\.
Trajectory\-level analysis \(Appendix[M](https://arxiv.org/html/2606.11349#A13)\) shows that the score gap between the top\-ranked navigation action and clarification narrows at deeper hierarchy levels, consistent with increasing local ambiguity at finer classification granularity, while navigation steps and backtracking rates remain comparable to the baseline\.
### 5\.5Qualitative Clarification Behavior
To illustrate what the two modes look like in practice, we present representative examples from the CBP\-NY evaluation\.
#### Mandatory clarification\.
For*oval\-shaped sugar confectionery cough drops containing 10 mg menthol*, the baseline agent commits to pharmaceuticals based on the dosage language \(“10 mg per dose”\)\. UnderActionRating, no navigation branch scores aboveτ\\tauat the Lv\.2 node;need\_clarifyis top\-ranked \(score 68, next branch 31\)\. The agent asks: “Is this product put up in measured doses or for retail sale as a medicament?” The answer \(“No, this is sugar confectionery for retail consumption”\) resolves the ambiguity, and the re\-scored navigation correctly routes to confectionery\.
#### Opportunistic clarification\.
For a*granular copolymer: 69% butadiene, 20% methyl methacrylate, 9% methacrylic acid, 2% divinylbenzene*, the agent’s top\-ranked navigation action at Lv\.3 is “polymers of olefins” \(score 72\), butneed\_clarifyappears at rank 2 \(score 48, aboveτ=10\\tau\{=\}10\)\. The question \(“Is butadiene considered an olefin for classification purposes?”\) targets a domain convention: chemically, butadiene is a conjugated diene, but classification convention treats butadiene polymers as olefin polymers\. The controlled answer confirms the convention; without it, the automated system applies strict chemistry and misroutes to “other resins\.”
#### Pattern\.
Mandatory questions tend to target*missing essential attributes*\(composition, primary use, form\) at early levels where no branch is preferred, while opportunistic questions target*fine\-grained disambiguation*\(domain conventions, threshold values\) at deeper levels where a leading branch exists\. The two modes capture structurally different information needs, not merely different confidence levels\.
## 6Discussion
Figure 3:10\-digit accuracy vs\. inference cost \(LLM calls/sample\)\. Every model improves withActionRating\(gray arrows\); simpler triggers \(orange\) do not reproduce the regime shift\. Accuracy values are upper bounds under the controlled answer channel \(product\-owner simulation\)\.Figure[3](https://arxiv.org/html/2606.11349#S6.F3)shows that the behavioral signature is model\-agnostic and that simpler triggers lie below theActionRatingPareto frontier\. We highlight three implications\.
#### Portability\.
The two\-layer architecture separates domain instantiation \(Layer 1: knowledge graph, classification protocols, answer channel\) from the analysis protocol \(Layer 2: shared ordinal scale, threshold policy, ISE, separability test\)\. Layer 2 requires only a tree\-structured action space with potential information gaps; candidate domains include medical coding \(ICD\-10\), product classification \(CPC\), and legal statute navigation\.
#### Cost–accuracy trade\-off\.
ActionRatingincreases inference cost from 6\.0 to 10\.4 LLM calls per record \(73% overhead\), with the increase sublinear in accuracy gain\. Self\-Consistency atN=3N\{=\}3achieves\+\+8\.7% at 19 calls \(3\.2×\\timescost\), placing it below theActionRatingPareto frontier \(Figure[3](https://arxiv.org/html/2606.11349#S6.F3)\)\.
#### Toward realistic answer sources\.
The controlled answer channel is a clean measurement baseline\. A natural next step is an*answer\-source ladder*varying answer quality from the controlled channel through retrieval and weaker LLMs to noisy human\-like responses, to map how localization benefits degrade as answer quality decreases\. The separability result \(§[5\.4](https://arxiv.org/html/2606.11349#S5.SS4)\) predicts that information\-seeking*patterns*remain stable across this ladder even as accuracy varies\.
#### Decomposing help\-seeking\.
Separability motivates decomposing help\-seeking into three factors: localization \(where to ask\), question quality \(what to ask\), and answer\-source quality \(who answers\)\. This paper isolates the first\.
## 7Conclusion
ActionRatingplaces clarification inside a language agent’s action space on a shared ordinal scale with navigation, yielding a self\-gated mechanism that requires no external uncertainty estimator\. We frame the contribution as a measurement protocol for self\-gated information\-seeking behavior under a controlled answer channel, not a deployment system; reported accuracy gains are upper bounds\. The framework reveals a regime shift from mandatory to opportunistic information\-seeking \(ISE: 50%→\\to74%\), stable across four LLM families, three benchmarks, and a range of thresholds, with the two modes serving distinct linguistic functions\. Three diagnostic contrasts fail to reproduce this structure, and a separability test \(−\-18\.8% accuracy under degraded answers\) supports an empirical separation between help localization and answer\-source quality\. Generalization beyond HTS and evaluation with realistic answer sources remain open directions\.
## Limitations
Controlled answer channel, not deployment\.Accuracy numbers are upper bounds under a controlled answer channel; deployment with realistic information sources would yield lower gains\. We do not yet systematically vary answer quality across intermediate regimes\.
Single domain\.Evaluation covers three independent HTS benchmarks but generalization to structurally distinct taxonomies \(ICD\-10, CPC, legal statutes\) requires re\-implementing the domain layer\.
Action scores are not calibrated\.ActionRatingassumes the LLM produces meaningful ordinal scores but does not claim calibrated confidence estimationGuoet al\.\([2017](https://arxiv.org/html/2606.11349#bib.bib35)\); Xionget al\.\([2024](https://arxiv.org/html/2606.11349#bib.bib37)\);τ\\taumay require re\-tuning across models\.
Observational taxonomy\.The mandatory/opportunistic distinction is derived from the agent’s own ratings, not from ground\-truth epistemic states\.
Practical constraints\.Opportunistic mode issues multiple inline QA rounds per step; latency may offset accuracy gains\. Evaluation uses English\-language descriptions only\.
## Ethics Statement
HTS classification has direct financial and regulatory implications: incorrect codes can lead to improper duty assessment, and automation errors may have downstream financial or regulatory consequences\. Our system is intended as a decision\-support tool and should be validated by domain experts before deployment\. The benchmark data is derived from publicly available CBP rulings and does not contain personally identifiable information\.
## References
- Asking clarifying questions in open\-domain information\-seeking conversations\.InProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval,SIGIR ’19,pp\. 475–484\.External Links:[Link](http://dx.doi.org/10.1145/3331184.3331265),[Document](https://dx.doi.org/10.1145/3331184.3331265)Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- S\. Banerjee, C\. Akkaya, F\. Perez\-Sorrosal, and K\. Tsioutsiouliklis \(2019\)Hierarchical transfer learning for multi\-label text classification\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,pp\. 6295–6300\.Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px5.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- M\. Besta, N\. Blach, A\. Kubicek, R\. Gerstenberger, M\. Podstawski, L\. Gianinazzi, J\. Gajda, T\. Lehmann, H\. Niewiadomski, P\. Nyczyk, and T\. Hoefler \(2024\)Graph of thoughts: solving elaborate problems with large language models\.External Links:2308\.09687Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px6.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.External Links:2110\.14168,[Link](https://arxiv.org/abs/2110.14168)Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- D\. Dua, S\. Gupta, S\. Singh, and M\. Gardner \(2022\)Successive prompting for decomposing complex questions\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,pp\. 1251–1265\.Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px6.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- N\. Dziri, X\. Lu, M\. Sclar, X\. L\. Li, L\. Jiang, B\. Y\. Lin, S\. Welleck, P\. West, C\. Bhagavatula, R\. L\. Bras, J\. D\. Hwang, S\. Sanyal, X\. Ren, A\. Ettinger, Z\. Harchaoui, and Y\. Choi \(2024\)Faith and fate: limits of transformers on compositionality\.External Links:2305\.18654Cited by:[§1](https://arxiv.org/html/2606.11349#S1.p1.1)\.
- R\. El\-Yaniv and Y\. Wiener \(2010\)On the foundations of noise\-free selective classification\.Journal of Machine Learning Research11,pp\. 1605–1641\.Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px4.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- L\. Gao, A\. Madaan, S\. Zhou, U\. Alon, P\. Liu, Y\. Yang, J\. Callan, and G\. Neubig \(2023\)PAL: program\-aided language models\.External Links:2211\.10435Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px6.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- Y\. Geifman and R\. El\-Yaniv \(2017\)Selective classification for deep neural networks\.External Links:1705\.08500,[Link](https://arxiv.org/abs/1705.08500)Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px4.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger \(2017\)On calibration of modern neural networks\.InProceedings of the 34th International Conference on Machine Learning,pp\. 1321–1330\.Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px2.p1.1),[Limitations](https://arxiv.org/html/2606.11349#Sx1.p3.1)\.
- S\. Hao, Y\. Gu, H\. Ma, J\. J\. Hong, Z\. Wang, D\. Z\. Wang, and Z\. Hu \(2023\)Reasoning with language model is planning with world model\.External Links:2305\.14992Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px6.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- R\. A\. Howard \(1966\)Information value theory\.IEEE Transactions on systems science and cybernetics2\(1\),pp\. 22–26\.Cited by:[§3\.3](https://arxiv.org/html/2606.11349#S3.SS3.SSS0.Px1.p2.1)\.
- J\. Huang, X\. Chen, S\. Mishra, H\. S\. Zheng, A\. W\. Yu, X\. Song, and D\. Zhou \(2024\)Large language models cannot self\-correct reasoning yet\.External Links:2310\.01798Cited by:[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- C\. N\. S\. Jr\. and A\. A\. Freitas \(2011\)A survey of hierarchical classification across different application domains\.Data Mining and Knowledge Discovery22\(1\),pp\. 31–72\.Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px5.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- S\. Kadavath, T\. Conerly, A\. Askell, T\. Henighan, D\. Drain, E\. Perez, N\. Schiefer, Z\. Hatfield\-Dodds, N\. DasSarma, E\. Tyre, Z\. Zhu, L\. Lovitt, J\. Kernion, A\. Jones, B\. Mann, S\. McCandlish, J\. Kaplan, C\. Olah, D\. Amodei, and T\. Brown \(2022\)Language models \(mostly\) know what they know\.External Links:2207\.05221Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2606.11349#S1.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- A\. Kamath, R\. Jia, and P\. Liang \(2020\)Selective question answering under domain shift\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,pp\. 5684–5696\.Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px4.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- T\. Khot, H\. Trivedi, M\. Finlayson, Y\. Fu, K\. Richardson, P\. Clark, and A\. Sabharwal \(2023\)Decomposed prompting: a modular approach for solving complex tasks\.External Links:2210\.02406Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px6.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- K\. Kowsari, D\. E\. Brown, M\. Heidarysafa, K\. J\. Meimandi, M\. S\. Gerber, and L\. E\. Barnes \(2017\)HDLTex: hierarchical deep learning for text classification\.External Links:1709\.09839Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px5.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- L\. Kuhn, Y\. Gal, and S\. Farquhar \(2023\)Semantic uncertainty: linguistic invariances for uncertainty estimation in natural language generation\.External Links:2302\.09664Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2606.11349#S1.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2024\)Let’s verify step by step\.External Links:2305\.20050Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- S\. Lin, J\. Hilton, and O\. Evans \(2022\)Teaching models to express their uncertainty in words\.External Links:2205\.14334Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- X\. Liu, H\. Yu, H\. Zhang, Y\. Xu, X\. Lei, H\. Lai, Y\. Gu, H\. Ding, K\. Men, K\. Yang, S\. Zhang, X\. Deng, A\. Zeng, Z\. Du, C\. Zhang, S\. Shen, T\. Zhang, Y\. Su, H\. Sun, M\. Huang, Y\. Dong, and J\. Tang \(2023\)AgentBench: evaluating llms as agents\.External Links:2308\.03688Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang, S\. Gupta, B\. P\. Majumder, K\. Hermann, S\. Welleck, A\. Yazdanbakhsh, and P\. Clark \(2023\)Self\-refine: iterative refinement with self\-feedback\.External Links:2303\.17651,[Link](https://arxiv.org/abs/2303.17651)Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- Y\. Mao, J\. Tian, J\. Han, and X\. Ren \(2019\)Hierarchical text classification with reinforced label assignment\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),pp\. 445–455\.External Links:[Link](http://dx.doi.org/10.18653/v1/D19-1042),[Document](https://dx.doi.org/10.18653/v1/d19-1042)Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px5.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- M\. Nye, A\. J\. Andreassen, G\. Gur\-Ari, H\. Michalewski, J\. Austin, D\. Biber, D\. Dohan, A\. Lewkowycz, M\. Bosma, D\. Luan, C\. Sutton, and A\. Odena \(2021\)Show your work: scratchpads for intermediate computation with language models\.External Links:2112\.00114Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px6.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- S\. Pan, L\. Luo, Y\. Wang, C\. Chen, J\. Wang, and X\. Wu \(2024\)Unifying large language models and knowledge graphs: a roadmap\.External Links:2306\.08302Cited by:[§4\.1](https://arxiv.org/html/2606.11349#S4.SS1.p2.1)\.
- O\. Press, M\. Zhang, S\. Min, L\. Schmidt, N\. A\. Smith, and M\. Lewis \(2023\)Measuring and narrowing the compositionality gap in language models\.External Links:2210\.03350Cited by:[§1](https://arxiv.org/html/2606.11349#S1.p1.1)\.
- M\. L\. Puterman \(1994\)Markov decision processes: discrete stochastic dynamic programming\.John Wiley & Sons,New York\.Cited by:[§3\.1](https://arxiv.org/html/2606.11349#S3.SS1.p1.6)\.
- H\. A\. Rahmani, X\. Wang, Y\. Feng, Q\. Zhang, E\. Yilmaz, and A\. Lipani \(2023\)A survey on asking clarification questions datasets in conversational systems\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics,pp\. 2698–2716\.Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- H\. Raiffa and R\. Schlaifer \(1961\)Applied statistical decision theory\.Harvard University Press,Cambridge, MA\.Cited by:[§3\.3](https://arxiv.org/html/2606.11349#S3.SS3.SSS0.Px1.p2.1)\.
- S\. Rao and H\. D\. III \(2018\)Learning to ask good questions: ranking clarification questions using neural expected value of perfect information\.InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics,pp\. 2737–2746\.Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom \(2024\)Toolformer: language models can teach themselves to use tools\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- B\. Settles \(2012\)Active learning\.Morgan & Claypool Publishers\.Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- K\. Shimura, J\. Li, and F\. Fukumoto \(2018\)HFT\-CNN: learning hierarchical category structure for multi\-label short text categorization\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,E\. Riloff, D\. Chiang, J\. Hockenmaier, and J\. Tsujii \(Eds\.\),Brussels, Belgium,pp\. 811–816\.External Links:[Link](https://aclanthology.org/D18-1093/),[Document](https://dx.doi.org/10.18653/v1/D18-1093)Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px5.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- N\. Shinn, F\. Cassano, E\. Berman, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.External Links:2303\.11366,[Link](https://arxiv.org/abs/2303.11366)Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.11349#S1.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- T\. R\. Sumers, S\. Yao, K\. Narasimhan, and T\. L\. Griffiths \(2024\)Cognitive architectures for language agents\.External Links:2309\.02427Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- U\.S\. Customs and Border Protection \(2025a\)2025 HTS Revision 26\.Note:[https://www\.usitc\.gov/2025\_hts\_revision\_26](https://www.usitc.gov/2025_hts_revision_26)Accessed: 2025\-01\-01Cited by:[Appendix B](https://arxiv.org/html/2606.11349#A2.p1.6),[§4\.1](https://arxiv.org/html/2606.11349#S4.SS1.p2.1)\.
- U\.S\. Customs and Border Protection \(2025b\)Customs Rulings Online Search System \(CROSS\)\.Note:[https://rulings\.cbp\.gov/home](https://rulings.cbp.gov/home)Accessed: 2025\-01\-01Cited by:[§4\.1](https://arxiv.org/html/2606.11349#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.11349#S4.SS1.p2.1)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, E\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2023a\)Self\-consistency improves chain of thought reasoning in language models\.External Links:2203\.11171,[Link](https://arxiv.org/abs/2203.11171)Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1),[§4\.2](https://arxiv.org/html/2606.11349#S4.SS2.p1.3)\.
- Z\. Wang, G\. Zhang, K\. Yang, N\. Shi, W\. Zhou, S\. Hao, G\. Xiong, Y\. Li, M\. Y\. Sim, X\. Chen, Q\. Zhu, Z\. Yang, A\. Nik, Q\. Liu, C\. Lin, S\. Wang, R\. Liu, W\. Chen, K\. Xu, D\. Liu, Y\. Guo, and J\. Fu \(2023b\)Interactive natural language processing\.External Links:2305\.13246,[Link](https://arxiv.org/abs/2305.13246)Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. Chi, Q\. Le, and D\. Zhou \(2023\)Chain\-of\-thought prompting elicits reasoning in large language models\.External Links:2201\.11903,[Link](https://arxiv.org/abs/2201.11903)Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px6.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- M\. Xiong, Z\. Hu, X\. Lu, Y\. Li, J\. Fu, J\. He, and B\. Hooi \(2024\)Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms\.External Links:2306\.13063Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px2.p1.1),[Limitations](https://arxiv.org/html/2606.11349#Sx1.p3.1)\.
- Y\. Yang, T\. Lan, Q\. Jia, L\. Zhu, H\. Jiang, H\. Zhu, L\. Wang, W\. Luo, and K\. Zhang \(2025\)HSCodeComp: a realistic and expert\-level benchmark for deep search agents in hierarchical rule application\.External Links:2510\.19631,[Link](https://arxiv.org/abs/2510.19631)Cited by:[§4\.1](https://arxiv.org/html/2606.11349#S4.SS1.SSS0.Px1.p1.1)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. L\. Griffiths, Y\. Cao, and K\. Narasimhan \(2023a\)Tree of thoughts: deliberate problem solving with large language models\.External Links:2305\.10601,[Link](https://arxiv.org/abs/2305.10601)Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.11349#S1.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2023b\)ReAct: synergizing reasoning and acting in language models\.External Links:2210\.03629,[Link](https://arxiv.org/abs/2210.03629)Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.11349#S1.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- P\. Yuvraj and S\. Devarakonda \(2025\)ATLAS: benchmarking and adapting llms for global trade via harmonized tariff code classification\.External Links:2509\.18400,[Link](https://arxiv.org/abs/2509.18400)Cited by:[§4\.1](https://arxiv.org/html/2606.11349#S4.SS1.SSS0.Px1.p1.1)\.
- H\. Zamani, S\. Dumais, N\. Craswell, P\. Bennett, and G\. Lueck \(2020\)Generating clarifying questions for information retrieval\.InProceedings of The Web Conference 2020,pp\. 418–428\.Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- E\. Zelikman, Y\. Wu, J\. Mu, and N\. D\. Goodman \(2022\)STaR: bootstrapping reasoning with reasoning\.External Links:2203\.14465Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px6.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.External Links:2306\.05685Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- A\. Zhou, K\. Yan, M\. Shlapentokh\-Rothman, H\. Wang, and Y\. Wang \(2024\)Language agent tree search unifies reasoning acting and planning in language models\.External Links:2310\.04406Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- D\. Zhou, N\. Schärli, L\. Hou, J\. Wei, N\. Scales, X\. Wang, D\. Schuurmans, C\. Cui, O\. Bousquet, Q\. Le, and E\. Chi \(2023\)Least\-to\-most prompting enables complex reasoning in large language models\.External Links:2205\.10625Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px6.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
- J\. Zhou, C\. Ma, D\. Long, G\. Xu, N\. Ding, H\. Zhang, P\. Xie, and G\. Liu \(2020\)Hierarchy\-aware global model for hierarchical text classification\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,pp\. 1106–1117\.Cited by:[Appendix H](https://arxiv.org/html/2606.11349#A8.SS0.SSS0.Px5.p1.1),[§2](https://arxiv.org/html/2606.11349#S2.p1.1)\.
## Appendix Roadmap
The appendix is organized to support specific reviewer questions rather than to extend the main argument\.App\.[A](https://arxiv.org/html/2606.11349#A1)provides formal sanity checks \(threshold\-policy single\-crossing condition, bounded reentry\) and a worked example of the self\-gated cycle\.App\.[B](https://arxiv.org/html/2606.11349#A2)documents the knowledge graph and classification protocols underlying CBP\-NY\.App\.[C](https://arxiv.org/html/2606.11349#A3)–[D](https://arxiv.org/html/2606.11349#A4)audit the controlled answer channel and argue against oracle leakage as the primary driver of accuracy\.App\.[E](https://arxiv.org/html/2606.11349#A5)–[G](https://arxiv.org/html/2606.11349#A7)cover MDP component ablations and the diagnostic\-contrast baselines \(CoT\-Ask\-if\-Unsure, Self\-Consistency\)\.App\.[J](https://arxiv.org/html/2606.11349#A10)–[K](https://arxiv.org/html/2606.11349#A11)report multi\-model and cross\-benchmark transfer under the lockedτ=10\\tau\{=\}10protocol\.App\.[L](https://arxiv.org/html/2606.11349#A12)–[M](https://arxiv.org/html/2606.11349#A13)present threshold sensitivity and trajectory\-level mechanism analysis\.App\.[O](https://arxiv.org/html/2606.11349#A15)–[P\.3](https://arxiv.org/html/2606.11349#A16.SS3)release dataset\-construction and full prompt templates for reproducibility\.
## Appendix AFormal Properties and Proofs
### A\.1Worked Example of the Self\-Gated Reentry Cycle
Figure[4](https://arxiv.org/html/2606.11349#A1.F4)traces a single record through the self\-gated reentry cycle on a CBP\-NY example, illustrating how clarification competes with navigation on a shared ordinal scale rather than being externally triggered\.
#### Top\-down traversal\.
The agent enters the root of the HTS tree and scores every candidate action—each child branch as well asneed\_clarify—on the shared\[0,100\]\[0,100\]scale\. At the 2\-digit and 4\-digit levels \(left panel\), one navigation branch dominates, so the agent commits without invoking the clarification action\. This corresponds to the standard hierarchical decoding regime: the rating layer is active at every step, but the gating condition is not met\.
#### Reentry loop at the 8\-digit node\.
Descent stalls at the 8\-digit node \(dashed box\), where no single branch exceeds the others by a clear margin andclarifyrises to score 45, aboveτ=10\\tau\{=\}10\. This triggers the reentry loop: a clarification sub\-agent generates a question, queries the controlled answer channel, and the resulting⟨Q,A⟩\\langle Q,A\\ranglepair is appended to the observation\. The agent then re\-scores all actions at the same node with the augmented context \(right panel, Round 2\)\. The clarification action drops, and the leadingtraverseaction now dominates at≥92\\geq 92, allowing the agent to commit and continue toward the 10\-digit leaf\.
#### Why this matters for measurement\.
The two\-round score trace exposes the behavioral object the rest of the paper measures: a state where clarification temporarily out\-competes navigation, the agent acts on that signal, and a re\-scored navigation step follows\. Mode taxonomy \(mandatory vs\. opportunistic\) is read from the rank structure of the first round; ISE is read from whether the post\-injection action is correct\. Without the shared\-scale rating, none of these quantities is observable from the trace alone\.
Figure 4:Self\-gated clarification cycle within hierarchical taxonomy navigation\.Left:Top\-down traversal of the taxonomy tree\. At each internal node the agent scores all candidate actions, includingneed\_clarify, on a shared\[0,100\]\[0,100\]ordinal scale\. When the clarification score exceeds thresholdτ\\tauat the 8\-digit node \(dashed box\), the agent enters a*reentry loop*: a sub\-agent resolves the query and injects⟨Q,A⟩\\langle Q,A\\ranglebefore re\-scoring\.Right:Two rounds of action rating\. Round 1:clarifyscores 45 \(\>τ\>\\tau\), triggering the sub\-agent\. Round 2: after answer injection the leadingtraverseaction dominates \(≥92\\geq 92\); the episode terminates at the 10\-digit leaf viaconfirm\.
### A\.2Threshold\-Policy Sanity Check \(Theorem[1](https://arxiv.org/html/2606.11349#Thmtheorem1)\)
#### Scope\.
This subsection states a sufficient condition under which a threshold policy onq\(s\)q\(s\)is optimal within the threshold family\. We do*not*assume LLM\-emitted scores satisfy this condition; the result is a conceptual sanity check rather than an empirical guarantee\.
#### Setup\.
Let𝒮\\mathcal\{S\}be the set of decision states encountered by the navigator, and letq\(s\)∈ℝq\(s\)\\in\\mathbb\{R\}be the agent’s rating assigned to the clarification action at statess\. Define the*net downstream gain of clarification*:
Δ\(s\):=Vask\(s\)−Vact\(s\),\\Delta\(s\):=V^\{\\mathrm\{ask\}\}\(s\)\-V^\{\\mathrm\{act\}\}\(s\),whereVask\(s\)V^\{\\mathrm\{ask\}\}\(s\)is the expected downstream return from clarifying before acting, andVact\(s\)V^\{\\mathrm\{act\}\}\(s\)is the expected downstream return from acting immediately\. For a thresholdτ\\tau, define the clarification policyπτ\(s\)=𝟏\{q\(s\)≥τ\}\\pi\_\{\\tau\}\(s\)=\\mathbf\{1\}\\\{q\(s\)\\geq\\tau\\\}, with expected utilityU\(τ\):=𝔼\[Δ\(s\)1\{q\(s\)≥τ\}\]U\(\\tau\):=\\mathbb\{E\}\\\!\\bigl\[\\Delta\(s\)\\,\\mathbf\{1\}\\\{q\(s\)\\geq\\tau\\\}\\bigr\]\.
###### Proposition 1\(Monotonicity of trigger sets\)\.
ForAτ:=\{s:q\(s\)≥τ\}A\_\{\\tau\}:=\\\{s:q\(s\)\\geq\\tau\\\}and anyτ1<τ2\\tau\_\{1\}<\\tau\_\{2\}, we haveAτ2⊆Aτ1A\_\{\\tau\_\{2\}\}\\subseteq A\_\{\\tau\_\{1\}\}\. Hence lowering the threshold can only add clarification\-triggered states, and any nonnegative clarification\-cost functional is weakly decreasing inτ\\tau\.
Assumption A\(Single\-crossing conditional gain\)\. Letm\(t\):=𝔼\[Δ\(s\)∣q\(s\)=t\]m\(t\):=\\mathbb\{E\}\[\\Delta\(s\)\\mid q\(s\)\{=\}t\]\. Assumemmis integrable and there existsτ⋆\\tau^\{\\star\}such thatm\(t\)≤0m\(t\)\\leq 0fort<τ⋆t<\\tau^\{\\star\}andm\(t\)≥0m\(t\)\\geq 0fort≥τ⋆t\\geq\\tau^\{\\star\}\.
###### Theorem 1\(Optimal threshold under single crossing\)\.
Under Assumption A, the thresholdτ⋆\\tau^\{\\star\}maximizesU\(τ\)U\(\\tau\)over the threshold family\{πτ\}τ\\\{\\pi\_\{\\tau\}\\\}\_\{\\tau\}\.
### A\.3Bounded Reentry
#### Bounded reentry \(stated in §[3\.3](https://arxiv.org/html/2606.11349#S3.SS3)\)\.
Each nodevvpermits at mostCCclarifications \(duplicate guard\)\. Let𝒱ep\\mathcal\{V\}\_\{\\text\{ep\}\}be the set of nodes visited during the episode\. The total number of clarification events is bounded byNclarify≤C⋅\|𝒱ep\|N\_\{\\text\{clarify\}\}\\leq C\\cdot\|\\mathcal\{V\}\_\{\\text\{ep\}\}\|\. Each clarification incurs exactly 2 additional LLM calls \(one sub\-agent call \+ one reentry re\-selection\), so clarification adds at most2Nclarify2N\_\{\\text\{clarify\}\}calls\. Navigation actions are bounded byHH\(the episode step limit\)\. Therefore the total call count satisfiesNtotal≤H\+2C\|𝒱ep\|N\_\{\\text\{total\}\}\\leq H\+2C\|\\mathcal\{V\}\_\{\\text\{ep\}\}\|, which is finite since\|𝒱ep\|≤H\|\\mathcal\{V\}\_\{\\text\{ep\}\}\|\\leq H\(each navigation step visits at most one new node\)\. In our implementation,C=2C=2andH=20H=20, givingNtotal≤20\+4⋅20=100N\_\{\\text\{total\}\}\\leq 20\+4\\cdot 20=100\.
###### Proof of Proposition[1](https://arxiv.org/html/2606.11349#Thmproposition1)\.
By definition,Aτ=\{s:q\(s\)≥τ\}A\_\{\\tau\}=\\\{s:q\(s\)\\geq\\tau\\\}\. Forτ1<τ2\\tau\_\{1\}<\\tau\_\{2\}, ifs∈Aτ2s\\in A\_\{\\tau\_\{2\}\}thenq\(s\)≥τ2\>τ1q\(s\)\\geq\\tau\_\{2\}\>\\tau\_\{1\}, sos∈Aτ1s\\in A\_\{\\tau\_\{1\}\}\. ThereforeAτ2⊆Aτ1A\_\{\\tau\_\{2\}\}\\subseteq A\_\{\\tau\_\{1\}\}\. For any nonnegativec\(s\)≥0c\(s\)\\geq 0, the inclusion implies𝟏\{q\(s\)≥τ2\}≤𝟏\{q\(s\)≥τ1\}\\mathbf\{1\}\\\{q\(s\)\\geq\\tau\_\{2\}\\\}\\leq\\mathbf\{1\}\\\{q\(s\)\\geq\\tau\_\{1\}\\\}for allss, so𝔼\[c\(s\)1\{q\(s\)≥τ2\}\]≤𝔼\[c\(s\)1\{q\(s\)≥τ1\}\]\\mathbb\{E\}\[c\(s\)\\,\\mathbf\{1\}\\\{q\(s\)\\geq\\tau\_\{2\}\\\}\]\\leq\\mathbb\{E\}\[c\(s\)\\,\\mathbf\{1\}\\\{q\(s\)\\geq\\tau\_\{1\}\\\}\]\. ∎
###### Proof of Theorem[1](https://arxiv.org/html/2606.11349#Thmtheorem1)\.
LetX=q\(s\)X=q\(s\)\. By the law of iterated expectation,
U\(τ\)\\displaystyle U\(\\tau\)=𝔼\[Δ\(s\)1\{X≥τ\}\]\\displaystyle=\\mathbb\{E\}\\\!\\bigl\[\\Delta\(s\)\\,\\mathbf\{1\}\\\{X\\geq\\tau\\\}\\bigr\]=𝔼\[m\(X\)1\{X≥τ\}\],\\displaystyle=\\mathbb\{E\}\\\!\\bigl\[m\(X\)\\,\\mathbf\{1\}\\\{X\\geq\\tau\\\}\\bigr\],wherem\(t\):=𝔼\[Δ\(s\)∣q\(s\)=t\]m\(t\):=\\mathbb\{E\}\[\\Delta\(s\)\\mid q\(s\)\{=\}t\]\.
*Case 1:*τ<τ⋆\\tau<\\tau^\{\\star\}\. Then
U\(τ⋆\)−U\(τ\)=−𝔼\[m\(X\)1\{τ≤X<τ⋆\}\]≥0,U\(\\tau^\{\\star\}\)\{\-\}U\(\\tau\)=\-\\mathbb\{E\}\\\!\\bigl\[m\(X\)\\,\\mathbf\{1\}\\\{\\tau\{\\leq\}X\{<\}\\tau^\{\\star\}\\\}\\bigr\]\\geq 0,becausem\(X\)≤0m\(X\)\\leq 0on\{X<τ⋆\}\\\{X<\\tau^\{\\star\}\\\}by Assumption A\.
*Case 2:*τ\>τ⋆\\tau\>\\tau^\{\\star\}\. Then
U\(τ⋆\)−U\(τ\)=𝔼\[m\(X\)1\{τ⋆≤X<τ\}\]≥0,U\(\\tau^\{\\star\}\)\{\-\}U\(\\tau\)=\\mathbb\{E\}\\\!\\bigl\[m\(X\)\\,\\mathbf\{1\}\\\{\\tau^\{\\star\}\{\\leq\}X\{<\}\\tau\\\}\\bigr\]\\geq 0,becausem\(X\)≥0m\(X\)\\geq 0on\{X≥τ⋆\}\\\{X\\geq\\tau^\{\\star\}\\\}by Assumption A\.
ThusU\(τ⋆\)≥U\(τ\)U\(\\tau^\{\\star\}\)\\geq U\(\\tau\)for everyτ\\tau, establishing optimality\. ∎
###### Corollary 2\(Selective clarification can outperform both extremes\)\.
Under Assumption A, if there exist both positive\-gain and negative\-gain clarification states with nonzero probability \(i\.e\.ℙ\(Δ\(s\)\>0,q\(s\)≥τ⋆\)\>0\\mathbb\{P\}\(\\Delta\(s\)\>0,\\;q\(s\)\\geq\\tau^\{\\star\}\)\>0andℙ\(Δ\(s\)<0,q\(s\)<τ⋆\)\>0\\mathbb\{P\}\(\\Delta\(s\)<0,\\;q\(s\)<\\tau^\{\\star\}\)\>0\), thenU\(τ⋆\)\>U\(\+∞\)=0U\(\\tau^\{\\star\}\)\>U\(\+\\infty\)=0andU\(τ⋆\)\>U\(−∞\)=𝔼\[Δ\(s\)\]U\(\\tau^\{\\star\}\)\>U\(\-\\infty\)=\\mathbb\{E\}\[\\Delta\(s\)\]\.
###### Proof\.
The no\-clarification policy givesU\(\+∞\)=0U\(\+\\infty\)=0\. Sincem\(X\)≥0m\(X\)\\geq 0on\{X≥τ⋆\}\\\{X\\geq\\tau^\{\\star\}\\\}with strict positivity on a subset of positive measure,U\(τ⋆\)=𝔼\[m\(X\)1\{X≥τ⋆\}\]\>0U\(\\tau^\{\\star\}\)=\\mathbb\{E\}\[m\(X\)\\,\\mathbf\{1\}\\\{X\\geq\\tau^\{\\star\}\\\}\]\>0\. The always\-clarify policy givesU\(−∞\)=𝔼\[Δ\(s\)\]U\(\-\\infty\)=\\mathbb\{E\}\[\\Delta\(s\)\]\. Then
U\(−∞\)−U\(τ⋆\)=𝔼\[m\(X\)1\{X<τ⋆\}\]\.U\(\-\\infty\)\-U\(\\tau^\{\\star\}\)=\\mathbb\{E\}\\\!\\bigl\[m\(X\)\\,\\mathbf\{1\}\\\{X\{<\}\\tau^\{\\star\}\\\}\\bigr\]\.Becausem\(X\)≤0m\(X\)\\leq 0on\{X<τ⋆\}\\\{X<\\tau^\{\\star\}\\\}with strict negativity on a subset of positive measure, the right\-hand side is strictly negative, soU\(τ⋆\)\>U\(−∞\)U\(\\tau^\{\\star\}\)\>U\(\-\\infty\)\. ∎
Interpretation\.This corollary formalizes a simple intuition: some states benefit from clarification, while others do not\. A policy that asks everywhere pays unnecessary clarification cost, whereas a policy that never asks forgoes high\-value interventions\. A threshold policy can improve over both by selectively retaining only those states whose expected clarification gain is nonnegative\.
## Appendix BKnowledge Graph Construction
We represent the HTS as an augmented directed graph𝒢=\(V,ET∪ER\)\\mathcal\{G\}=\(V,\\,E\_\{T\}\\cup E\_\{R\}\)constructed from the official USITC HTS 2025 Revision 26 dataU\.S\. Customs and Border Protection \([2025a](https://arxiv.org/html/2606.11349#bib.bib14)\), where\|V\|=30,202\|V\|=30\{,\}202nodes span five hierarchical levels \(chapter→\\toheading→\\tosubheading→\\totariff item→\\tostatistical suffix\)\.
#### Node structure\.
Each nodev∈Vv\\in Vstores structured classification metadata:
v=\{\\displaystyle v=\\\{code,description,guidance,\\displaystyle\\texttt\{code\},\\;\\texttt\{description\},\\;\\texttt\{guidance\},\\;parent,children,excludes,ϕv\}\\displaystyle\\texttt\{parent\},\\;\\texttt\{children\},\\;\\texttt\{excludes\},\\;\\phi\_\{v\}\\\}whereguidanceis a concise classification hint generated by an LLM \(see below\), andϕv∈\{0,1\}3\\phi\_\{v\}\\in\\\{0,1\\\}^\{3\}are*protocol indicators*:
ϕv=\{is\_other,is\_parts,is\_set\}\\phi\_\{v\}=\\\{\\texttt\{is\\\_other\},\\;\\texttt\{is\\\_parts\},\\;\\texttt\{is\\\_set\}\\\}These partition nodes into three categories requiring distinct reasoning patterns: \(i\) catch\-all “other” categories \(\|\{v:ϕvother=1\}\|=7,274\|\\\{v:\\phi\_\{v\}^\{\\text\{other\}\}\{=\}1\\\}\|=7\{,\}274; 24% of nodes\), \(ii\) parts/accessories classifications \(\|\{v:ϕvparts=1\}\|=864\|\\\{v:\\phi\_\{v\}^\{\\text\{parts\}\}\{=\}1\\\}\|=864; 3%\), which require validating functional relationships to parent systems, and \(iii\) composite goods/sets \(\|\{v:ϕvset=1\}\|=137\|\\\{v:\\phi\_\{v\}^\{\\text\{set\}\}\{=\}1\\\}\|=137; 0\.5%\), which require essential\-character analysis under GRI Rule 3\.
#### Edge structure\.
The edge set comprises two disjoint types: tree edgesET=\{\(vp,vc\):vc∈children\(vp\)\}E\_\{T\}=\\\{\(v\_\{p\},v\_\{c\}\):v\_\{c\}\\in\\texttt\{children\}\(v\_\{p\}\)\\\}and relational edgesER=EX∪ES∪EPE\_\{R\}=E\_\{X\}\\cup E\_\{S\}\\cup E\_\{P\}\.EXE\_\{X\}holds explicit exclusion edges extracted from chapter and section notes \(\|EX\|=2,847\|E\_\{X\}\|=2\{,\}847\);ESE\_\{S\}captures implicit sibling mutual\-exclusivity constraints;EPE\_\{P\}encodes parts\-to\-parent\-system relationships; each parts node storesparent\_system,parent\_hts, andrelationship\_typeto enable cross\-heading validation jumps\. This hybrid topology supports three navigation modes: \(i\) hierarchical descent viaETE\_\{T\}, \(ii\) cross\-chapter jumps viaEXE\_\{X\}when exclusions apply, and \(iii\) protocol\-specific traversals viaESE\_\{S\}andEPE\_\{P\}for validation\.
#### LLM\-guided node guidance generation\.
The raw HTS descriptions are often terse legal text\. For each node we prompt an LLM to produce a short \(≤\\leq3 sentence\)guidancefield that \(a\) paraphrases the tariff description in plain language, \(b\) lists distinguishing product attributes \(material, use, form\), and \(c\) flags any applicable exclusion cross\-references from the node’s chapter notes\. This guidance is injected into the agent’s observation at each step, providing domain context without requiring the agent to reason over raw legal prose\.
## Appendix COracle Ablation: Human\-in\-the\-Loop vs\. Automated
Table 2:Oracle Ablation Study: impact of removing HTS description and reasoning traces from the clarification sub\-agent\.Oracle= sub\-agent has ground\-truth access \(upper bound\);Ablated= no oracle data \(leakage\-free\)\.Δ\\Delta= oracle−\-ablated accuracy drop\. ISE = fraction of QA interactions after which the agent’s next traversal step lands on the correct HTS path \(§[5\.2](https://arxiv.org/html/2606.11349#S5.SS2)\)\. Significance assessed by non\-parametric paired bootstrap \(nboot=5,000n\_\{\\mathrm\{boot\}\}\{=\}5\{,\}000\):∗∗∗95 % CI strictly positive;nsCI includes zero\. † AR \(oracle\) significantly outperforms Base \(oracle\) at all digit levels \(∗∗∗\); AR \(ablated\) does*not*significantly outperform Base \(ablated\) \(ns, e\.g\. 10d:−1\.0\-1\.0% \[−5\.0\{\-\}5\.0,\+3\.0\+3\.0\]\)\. The interaction \(\+17\.2\{\+\}17\.2% \[\+11\.8\{\+\}11\.8,\+22\.9\+22\.9\],∗∗∗\) confirms oracle data is necessary for AR’s gains\. AR \(ablated\) issues*more*QA steps than the oracle condition \(3,312 vs\. 2,883\) yet at substantially lower ISE \(56\.2 vs\. 73\.7\), explaining why AR \(ablated\) accuracy collapses to near Base \(ablated\)\.In the intended deployment, a knowledgeable human \(product owner, importer, or customs broker\) answers clarification questions about their own product\. The*oracle*condition in our experiments simulates this: the clarification sub\-agent has access to the item’s authoritative ruling record, enabling it to provide confirmed product attributes that a real product owner would know \(material composition, intended function, manufacturing method, physical specifications\)\. The*ablated*condition removes all ruling access, forcing the sub\-agent to answer from the product description alone, simulating fully\-automated operation with no privileged answer source\. Two guardrails are in place throughout: clarification questions are restricted to product\-attribute queries, and any taxonomy\-related text in answers is masked before reaching the navigator\.
#### ISE as the diagnostic lens\.
The ISE column in Table[2](https://arxiv.org/html/2606.11349#A3.T2)makes the mechanism legible at a glance\. ForBase, ablation drops ISE by only 5\.7% \(49\.9→\\to44\.2\): Base asks few clarification questions and does not depend heavily on answer quality, so degraded answers have little effect\. ForAR, ablation drops ISE by 17\.5% \(73\.7→\\to56\.2\): AR’s inline clarification loop is tightly coupled to answer quality, so degraded answers cascade into navigation errors\.
#### More questions, worse outcomes\.
AR \(ablated\) actually issues*more*QA steps than AR \(oracle\) \(3,312 vs\. 2,883, a 15% increase\), yet converts only 56\.2% into correct navigation steps compared with 73\.7% for the oracle condition\. The result is that AR \(ablated\) accuracy \(48\.2% at 10\-digit\) collapses to near Base \(ablated\) \(49\.2%\), erasing the\+\+18\.8% gap entirely\. This is not a flaw in the design: it confirms that oracle quality is precisely what makes the clarification loop effective, and the ablation directly measures that dependence\. The gap therefore reflects domain expertise, not path leakage \(see §[5\.4](https://arxiv.org/html/2606.11349#S5.SS4)and Appendix[D](https://arxiv.org/html/2606.11349#A4)\)\.
### Case Study: Same Question, Different Answers
The four examples below illustrate why the guardrails are insufficient to close the gap\. In each case, both conditions ask the*same*question; both answers contain no taxonomy codes; yet the answers are factually contradictory, leading to different classification outcomes\.
#### Example 1: Cough drops, “medicament” vs\. “confectionery\.”
Product: Oval\-shaped sugar confectionery cough drops containing 10 mg menthol with capsicum, eucalyptus oil, natural licorice\.Question: “whether product is put up in measured doses or for retail sale as medicament\.”
Oracle \(human\):“No, this product is not put up in measured doses or for retail sale as a medicament\. It is a sugar confectionery cough drop ready for consumption\.”→\\tocorrect\.
Ablated \(automated\):“Yes, this product is packaged for retail as a medicament\. The lozenges are labeled as a cough suppressant containing 10 mg menthol per dose\.”→\\tosent to pharmaceuticals\.
The product description is legitimately ambiguous: “10 mg menthol per dose” is consistent with both a confectionery and a medicament framing\. The human owner knows their product is candy; the automated system infers medicament from the dosage language\.
#### Example 2: Novelty umbrella, “functional” vs\. “decorative\.”
Product: Hand\-held umbrella with telescopic shaft, metal frame, nylon fabric shaped like a bird or animal; marketed as a sporting accessory or novelty item\.Question: “whether the umbrella functions as an actual rain umbrella or is purely decorative/toy\.”
Oracle \(human\):“Yes, this is a functional umbrella capable of providing rain protection despite its novelty design\.”→\\tocorrect\.
Ablated \(automated\):“No, it is marketed as a sporting accessory or novelty item, not for rain protection\.”→\\tosent to miscellaneous textile, adding 3 extra navigation steps\.
#### Example 3: Artificial amaryllis flower, essential character\.
Product: Polyester and rayon amaryllis flower with thermoplastic coating, assembled by gluing\.Question: “whether plastic coating or textile substrate gives essential character\.”
Oracle \(human\):“The plastic coating gives the essential character\.”→\\tocorrect \(plastic artificial flowers\)\.
Ablated \(automated\):“The textile base gives the essential character, as the flower petals are made from nonwoven fabric\.”→\\tosent to textile artificial flowers\.
This is a GRI 3\(b\) essential\-character determination, one of the most judgment\-dependent questions in taxonomy classification\. Both answers are plausible readings of the product description; the oracle resolves the ambiguity correctly because it has ruling context\.
#### Example 4: Synthetic copolymer, domain interpretation\.
Product: Granular copolymer: 69% butadiene, 20% methyl methacrylate, 9% methacrylic acid, 2% divinylbenzene\.Question: “whether butadiene is considered an olefin\.”
Oracle \(human\):“Yes, butadiene is an olefin; the product is a polymer of olefins with butadiene as the primary component at 69%\.”→\\tocorrect\.
Ablated \(automated\):“No, butadiene is a diene with two double bonds, whereas olefins have a single double bond\.”→\\tosent to other resins\.
Technically, both answers are chemically defensible: butadiene is a conjugated diene, not a simple alkene\. However, classification conventions treat butadiene polymers as polymers of olefins\. The automated system applies strict chemistry; the oracle applies domain convention\. No product\-attribute guardrail can resolve this because the dispute is about chemistry, not codes\.
### Summary
All four examples pass both guardrails \(no code references in questions; no codes in answers\), yet the answers are factually contradictory\. The human advantage is not about seeing taxonomy codes; it is about resolving genuine product ambiguity that only the product owner or a domain expert can definitively settle: \(1\)*genuine product ambiguity*\(Examples 1–2\), where the description is legitimately ambiguous; \(2\)*essential\-character judgment*\(Example 3\), where GRI 3\(b\) requires a subjective call only the owner can make; and \(3\)*domain interpretation*\(Example 4\), where classification conventions diverge from scientific definitions\. The 255 products that onlyActionRating\+human classifies correctly represent cases in this regime\.
## Appendix DKnowledge\-Channel Audit
To validate the claim that the controlled answer channel primarily conveys product\-owner knowledge rather than classification\-path leakage, we audit all 2,875 Q/A pairs generated during the CBP\-NY evaluation \(τ=10\\tau\{=\}10, Claude Opus 4\.6,N=1,181N\{=\}1\{,\}181records\) using a cross\-tabulation of question category against answer type, and measure post\-QA navigation effectiveness \(ISE\) for each answer type\.
### HTS\-referencing question analysis
Of the 2,875 Q/A pairs, 102 \(3\.5%\) explicitly reference an HTS chapter, heading, or tariff note\. Tables[3](https://arxiv.org/html/2606.11349#A4.T3)and[4](https://arxiv.org/html/2606.11349#A4.T4)cross\-tabulate question category against answer type for the full corpus and for this HTS\-referencing subset respectively\. Table[5](https://arxiv.org/html/2606.11349#A4.T5)measures post\-QA navigation effectiveness \(ISE\) by answer type\.
Table 3:Cross\-tabulation: Question category×\\timesAnswer type\(n=2,875n\{=\}2\{,\}875unique Q/A pairs\)\. Each cell shows count and row percentage\.PA= Product Attribute;PA\-EC= Product Attribute \(Essential Character\);CC= Classification Criteria;N/A= Unavailable;Dfl= Deflected by guardrail\.Question CategoryPAPA\-ECCCN/ADflTotalMaterial/Composition810 \(76%\)35 \(3%\)10 \(1%\)210 \(20%\)5 \(0%\)1070Feature Presence/Absence533 \(85%\)25 \(4%\)6 \(1%\)45 \(7%\)20 \(3%\)629Function/Use353 \(92%\)13 \(3%\)0 \(0%\)15 \(4%\)1 \(0%\)382Physical Form/Processing304 \(85%\)8 \(2%\)3 \(1%\)41 \(12%\)0 \(0%\)356Dimensions/Measurements115 \(73%\)2 \(1%\)0 \(0%\)41 \(26%\)0 \(0%\)158Other189 \(68%\)29 \(10%\)4 \(1%\)38 \(14%\)20 \(7%\)280Total2304\(80%\)112\(4%\)23\(1%\)390\(14%\)46\(2%\)2875
Table 4:Cross\-tabulation: Question category×\\timesAnswer type for HTS\-referencing questions only\(n=102n\{=\}102, 3\.5% of all 2,875 pairs\)\. Questions explicitly naming a chapter, heading, or tariff note\.PA= Product Attribute;PA\-EC= PA \(Essential Character\);CC= Classification Criteria;N/A= Unavailable;Dfl= Deflected by guardrail\.Question CategoryPAPA\-ECCCN/ADflTotalMaterial/Composition16 \(53%\)1 \(3%\)10 \(33%\)1 \(3%\)2 \(7%\)30Feature Presence/Absence10 \(31%\)1 \(3%\)6 \(19%\)1 \(3%\)14 \(44%\)32Function/Use5 \(83%\)———1 \(17%\)6Physical Form/Processing5 \(56%\)1 \(11%\)3 \(33%\)——9Dimensions/Measurements1 \(100%\)————1Other1 \(4%\)—4 \(17%\)1 \(4%\)18 \(75%\)24Total38\(37%\)3\(3%\)23\(23%\)3\(3%\)35\(34%\)10235 \(34%\) deflected; 41 \(40%\) product attribute; 23 \(23%\) classification criteria; 3 \(3%\) unavailable\.
Table 5:ISE by oracle answer type\.ISE = fraction of QA interactions after which the agent’s nexttraverse\_childlands on the correct HTS path \(§[5\.2](https://arxiv.org/html/2606.11349#S5.SS2)\); QA steps with no subsequenttraverse\_childbefore the next interruption count as ineffective, matching the paper’s ISE denominator\.*Classification Criteria*\(CC\) answers yieldlowerISE than plain product\-attribute answers \(62\.5% vs\. 76\.2%\), contradicting the oracle\-leakage hypothesis\.*Deflected*answers \(guardrail\-blocked HTS queries\) have the lowest ISE \(19\.1%\), confirming the guardrail correctly withholds path\-revealing information\.#### Answer\-type definitions\.
- •PA \(Product Attribute\):a factual product property directly answerable from the product description, with no reference to HTS structure\. The default category \(80% of all answers\)\.
- •PA\-EC \(Essential Character\):the oracle’s answer invokes GRI 3\(b\) to identify which component of a composite product determines its classification\.
- •CC \(Classification Criteria\):the oracle directly affirms or denies a named chapter note, section note, or heading criterion \(e\.g\., “yes, it qualifies as rubber under Chapter 40, Note 1”\)\. Occurs in only 0\.8% of all pairs; limited to questions that name the HTS provision\.
- •N/A \(Unavailable\):the product description does not contain the requested information; the agent must navigate without it\.
- •Dfl \(Deflected\):the guardrail blocks the question because it asks for a legal/trade determination rather than a product attribute\. All deflected answers cluster around contentious tariff notes \(Section VI Additional U\.S\. Note 3, Section XI Note 9\)\.
#### Outcome examples for HTS\-referencing questions\.
Deflected\(35/102, 34\.3%\): guardrail correctly blocks tariff\-framed questions\.
- •Q: “Whether maltitol is classified as sugar under chapter 17 additional notes\.” A: “This is a legal/trade determination, not a product attribute question\.”
- •Q: “Whether sugars qualify under additional U\.S\. note 3 to chapter 17\.” A: “You’re right, I apologize for the confusion\. Let me ask a more appropriate question…”
Product Attribute\(38/102, 37\.3%\): question uses HTS framing but oracle answers the underlying product fact\.
- •Q: “Whether grape variety blends should be classified as grape juice or mixtures under HTS\.” A: “This product is a mixture of juices\. It blends three different grape varieties…”
- •Q: “Whether the product is considered aromatic for tariff purposes\.” A: “Yes, this product is aromatic\. The benzyl group contains a benzene ring…”
Classification Criteria\(23/102, 22\.5%\): oracle confirms a chapter or note criterion directly\.
- •Q: “Whether fabric meets note 9 to section XI criteria\.” A: “No, this fabric does not meet those criteria\. The product contains 60\.20% plastics by weight…”
- •Q: “Whether product falls within heading 4202 scope\.” A: “Yes, this product falls within the scope of travel goods…”
Unavailable\(3/102, 2\.9%\): oracle cannot answer even an HTS\-framed question\.
- •Q: “Whether gowns are specifically made of nonwoven fabric \(heading 5603\)\.” A: “The specific material composition is not documented in the product records…”
#### Interpretation\.
The guardrail handles 34% of HTS\-framed questions outright\. Of the remainder, most \(37%\) are redirected to a product\-fact answer that contains no classification reasoning\. Only 23 questions \(22\.5% of HTS\-referencing, 0\.8% of all 2,875\) receive a Classification Criteria answer; and as shown in Table[5](https://arxiv.org/html/2606.11349#A4.T5), even those still lead to wrong navigation 24% of the time \(ISE 62\.5% vs\. 76\.2% for plain product\-attribute answers\)\. The pattern confirms that the\+\+18\.8% accuracy gap reflects domain expertise, not classification\-path leakage\.
## Appendix EMDP Framework Validation
Table 6:MDP component ablation \(Claude Opus 4\.6,N=1181N\{=\}1181, noActionRating\)\.Δ\\Delta: 10\-digit accuracy change vs\. the full baseline navigator\.Table[6](https://arxiv.org/html/2606.11349#A5.T6)validates the MDP framework design by ablating each action and protocol from the full navigator \(withoutActionRating, to isolate framework contributions\)\.
Information gathering and cross\-tree navigation are the most impactful components\.Removingclarify\(−3\.9\-3\.9%\) andjump\(−3\.8\-3\.8%\) causes the largest drops, confirming that the MDP’s ability to seek information and navigate across taxonomy branches is essential\. Among protocols, the “other” catch\-all protocol contributes most \(−2\.3\-2\.3%\), reflecting the difficulty of reasoning about residual categories\.
Domain protocols have asymmetric value\.Among the three GRI\-specific protocols, the catch\-all “other” protocol contributes most \(−2\.3\-2\.3%\)\. HTS headings frequently end with an*other/NESOI*node \(“not elsewhere specified or included”\) that acts as a residual bin; without dedicated handling, the agent conflates genuine residuals with classification errors\. The parts protocol, which routes component and accessory classification to the parent heading under GRI 1, contributes a smaller but meaningful−1\.2\-1\.2%\. The sets protocol, which applies GRI 3\(b\) essential\-character analysis for composite goods, shows no marginal effect\. Together, the protocol ablations confirm that taxonomy\-specific reasoning rules must be explicitly encoded in the MDP state transitions, and cannot be left to the LLM’s implicit knowledge\.
## Appendix FCoT\-Ask\-if\-Unsure Baseline
CoT\-Ask\-if\-Unsure is the simplest possible prompting alternative: a single sentence is appended to the standard navigation prompt instructing the agent to ask a clarification question when uncertain:
> “If you are uncertain about which action to take, ask a clarification question before selecting an action\.”
No scoring function, threshold, or sampling is involved\. At each navigation step, the agent either \(a\) selects a navigation action as usual, or \(b\) marks the step as uncertain \(unsure=True\) and emits a clarification question before committing to an action\. If a clarification question is issued, it is routed to the clarification sub\-agent identically to the inline clarification mechanism used byActionRating; the answer is appended to the product context and the agent re\-selects its action\. This corresponds to enablingcot\_ask\_if\_unsure=Trueandenable\_inline\_clarify=Truein the navigator configuration, with no action rating\.
The key difference fromActionRatingis the absence of an explicit ordinal action\-rating signal: the agent relies entirely on its own instruction\-following to decide when to ask, rather than computing a scored gap between candidate actions\. CoT\-Ask\-if\-Unsure therefore tests whether structured action\-rating \(the scoring step\) is necessary, or whether a prompt\-level uncertainty instruction suffices to trigger useful information seeking\.
## Appendix GSelf\-Consistency Action Selection
At each navigation stepttwith statests\_\{t\}, the self\-consistency \(SC\) method estimates action uncertainty through repeated sampling rather than explicit confidence scoring\. Formally, SC drawsNNindependent action samples from the policy at temperatureT=1T=1:
a\(i\)∼π\(⋅∣st\),i=1,…,Na^\{\(i\)\}\\sim\\pi\(\\cdot\\mid s\_\{t\}\),\\quad i=1,\\ldots,N
For each actiona∈𝒜a\\in\\mathcal\{A\}, the vote count is:
v\(a\)=∑i=1N𝟏\[a\(i\)=a\]v\(a\)=\\sum\_\{i=1\}^\{N\}\\mathbf\{1\}\\bigl\[a^\{\(i\)\}=a\\bigr\]
The selected action is determined by majority vote:
a∗=argmaxa∈𝒜;v\(a\)a^\{\*\}=\\arg\\max\_\{a\\in\\mathcal\{A\}\};v\(a\)
with ties broken by the order of first occurrence among theNNsamples\. The agreement score and entropy proxy are computed as:
α\(st\)\\displaystyle\\alpha\(s\_\{t\}\)=1Nmaxa∈𝒜v\(a\),\\displaystyle=\\frac\{1\}\{N\}\\max\_\{a\\in\\mathcal\{A\}\}v\(a\),H\(st\)\\displaystyle H\(s\_\{t\}\)=−∑a:v\(a\)\>0v\(a\)Nlogv\(a\)N\\displaystyle=\-\\sum\_\{\\begin\{subarray\}\{c\}a:\\,v\(a\)\>0\\end\{subarray\}\}\\frac\{v\(a\)\}\{N\}\\log\\frac\{v\(a\)\}\{N\}
whereα\(st\)∈\[1/N,1\]\\alpha\(s\_\{t\}\)\\in\[1/N,1\]andH\(st\)∈\[0,logN\]H\(s\_\{t\}\)\\in\[0,\\log N\]\. A step is considered uncertain whenα\(st\)<αthresh\\alpha\(s\_\{t\}\)<\\alpha\_\{\\text\{thresh\}\}, equivalently whenH\(st\)\>0H\(s\_\{t\}\)\>0under the unanimous agreement criterion \(αthresh=1\.0\\alpha\_\{\\text\{thresh\}\}=1\.0\)\. In all experiments we useN=3N=3andαthresh=1\.0\\alpha\_\{\\text\{thresh\}\}=1\.0, so any disagreement among the three samples constitutes uncertainty; the majority\-vote action is executed regardless\.
SC incurs exactlyNNLLM calls per navigation step, giving a total inference cost ofN⋅HN\\cdot Hcalls per episode whereHHis the number of navigation steps, compared toH\+2\|𝒞\|H\+2\|\\mathcal\{C\}\|for ActionRating, where\|𝒞\|\|\\mathcal\{C\}\|is the number of clarification events \(each incurring one sub\-agent call and one reentry call\)\.
## Appendix HExtended Related Work
#### LLM agents for structured reasoning\.
ReActYaoet al\.\([2023b](https://arxiv.org/html/2606.11349#bib.bib1)\)interleaves reasoning traces with tool calls in flat action spaces; Tree\-of\-ThoughtsYaoet al\.\([2023a](https://arxiv.org/html/2606.11349#bib.bib2)\)and LATSZhouet al\.\([2024](https://arxiv.org/html/2606.11349#bib.bib21)\)add search over branching thought structures; ReflexionShinnet al\.\([2023](https://arxiv.org/html/2606.11349#bib.bib3)\)introduces verbal self\-reflection after episode\-level failures; ToolformerSchicket al\.\([2024](https://arxiv.org/html/2606.11349#bib.bib19)\)teaches models when to invoke external APIs; AgentBenchLiuet al\.\([2023](https://arxiv.org/html/2606.11349#bib.bib22)\)benchmarks agents across diverse environments; and Cognitive ArchitecturesSumerset al\.\([2024](https://arxiv.org/html/2606.11349#bib.bib23)\)provide a unifying framework for agent design\. All these operate over flat or lightly structured action spaces\. None addresses the specific challenge of deep hierarchical taxonomies where each step narrows the search space irreversibly\.
#### Self\-evaluation and uncertainty\.
Self\-ConsistencyWanget al\.\([2023a](https://arxiv.org/html/2606.11349#bib.bib4)\)uses sampling\-based agreement as a proxy for confidence; Self\-RefineMadaanet al\.\([2023](https://arxiv.org/html/2606.11349#bib.bib6)\)iterates on outputs via self\-critique; process reward modelsCobbeet al\.\([2021](https://arxiv.org/html/2606.11349#bib.bib7)\); Lightmanet al\.\([2024](https://arxiv.org/html/2606.11349#bib.bib25)\)train verifiers to score intermediate steps; and LLM\-as\-JudgeZhenget al\.\([2023](https://arxiv.org/html/2606.11349#bib.bib49)\)evaluates outputs via prompted comparison\. Calibration studiesKadavathet al\.\([2022](https://arxiv.org/html/2606.11349#bib.bib33)\); Kuhnet al\.\([2023](https://arxiv.org/html/2606.11349#bib.bib34)\); Linet al\.\([2022](https://arxiv.org/html/2606.11349#bib.bib36)\); Guoet al\.\([2017](https://arxiv.org/html/2606.11349#bib.bib35)\); Xionget al\.\([2024](https://arxiv.org/html/2606.11349#bib.bib37)\)examine whether model\-expressed confidence correlates with correctness\. These methods rate final answers or sample agreement; our mechanism rates candidate*actions*, including clarification, on a shared ordinal scale, so that clarification competes directly with navigation rather than being triggered by final\-answer confidence or sampling disagreement\.
#### Information seeking and clarification\.
Active learningSettles \([2012](https://arxiv.org/html/2606.11349#bib.bib8)\)selects queries to maximize model improvement; interactive NLPWanget al\.\([2023b](https://arxiv.org/html/2606.11349#bib.bib9)\)and conversational searchAliannejadiet al\.\([2019](https://arxiv.org/html/2606.11349#bib.bib10)\); Zamaniet al\.\([2020](https://arxiv.org/html/2606.11349#bib.bib41)\); Rao and III \([2018](https://arxiv.org/html/2606.11349#bib.bib40)\); Rahmaniet al\.\([2023](https://arxiv.org/html/2606.11349#bib.bib42)\)study when and what to ask\. Prior work assumes external uncertainty estimators or human interlocutors; our mechanism is entirely*self\-gated*from the agent’s own action ratings\.
#### Selective prediction and abstention\.
Selective predictionGeifman and El\-Yaniv \([2017](https://arxiv.org/html/2606.11349#bib.bib18)\); El\-Yaniv and Wiener \([2010](https://arxiv.org/html/2606.11349#bib.bib39)\); Kamathet al\.\([2020](https://arxiv.org/html/2606.11349#bib.bib38)\)allows models to abstain when uncertain, trading coverage for accuracy\. Our mechanism is related but distinct: rather than abstaining from a prediction, the agent*acts*on uncertainty by seeking information\.
#### Hierarchical classification\.
Hierarchical text classificationJr\. and Freitas \([2011](https://arxiv.org/html/2606.11349#bib.bib43)\); Kowsariet al\.\([2017](https://arxiv.org/html/2606.11349#bib.bib44)\); Shimuraet al\.\([2018](https://arxiv.org/html/2606.11349#bib.bib11)\); Banerjeeet al\.\([2019](https://arxiv.org/html/2606.11349#bib.bib46)\); Zhouet al\.\([2020](https://arxiv.org/html/2606.11349#bib.bib45)\); Maoet al\.\([2019](https://arxiv.org/html/2606.11349#bib.bib12)\)assigns labels in taxonomy trees\. These methods typically train end\-to\-end classifiers; we study an LLM agent navigating the taxonomy interactively, with the ability to seek help at any node\.
#### Multi\-step reasoning\.
Chain\-of\-thoughtWeiet al\.\([2023](https://arxiv.org/html/2606.11349#bib.bib5)\), least\-to\-mostZhouet al\.\([2023](https://arxiv.org/html/2606.11349#bib.bib27)\), decomposed promptingKhotet al\.\([2023](https://arxiv.org/html/2606.11349#bib.bib28)\), PALGaoet al\.\([2023](https://arxiv.org/html/2606.11349#bib.bib54)\), scratchpadsNyeet al\.\([2021](https://arxiv.org/html/2606.11349#bib.bib30)\), STaRZelikmanet al\.\([2022](https://arxiv.org/html/2606.11349#bib.bib26)\), reasoning via planningHaoet al\.\([2023](https://arxiv.org/html/2606.11349#bib.bib31)\), graph\-of\-thoughtBestaet al\.\([2024](https://arxiv.org/html/2606.11349#bib.bib32)\), and cumulative reasoningDuaet al\.\([2022](https://arxiv.org/html/2606.11349#bib.bib55)\)all improve multi\-step reasoning\. These focus on improving the quality of reasoning itself; we focus on*measuring where*reasoning needs external help\.
## Appendix IAblation Details
Table 7:Ablation decomposingActionRatingon HTS classification \(Claude Opus 4\.6,N=1181N\{=\}1181\)\.Rating onlydisables clarification gating \(τ=101\\tau\{=\}101\)\.Δ\\Delta: change vs\. Baseline\.Key finding: accuracy gains come entirely from self\-gated clarification, which shifts information seeking from*mandatory*\(agent blocked\) to*opportunistic*\(residual uncertainty resolved inline\)\.Table[7](https://arxiv.org/html/2606.11349#A9.T7)presents the rating\-only ablation \(H3:τ=101\\tau\{=\}101\)\. When the threshold is set above the maximum possible score, no clarification is ever triggered; the agent rates actions but never acts on low confidence\. This yields−\-0\.9% relative to baseline, confirming that the mechanism’s value lies in*actioning*help\-needed states, not in the rating computation itself\.
## Appendix JMulti\-Model Generalization
Table[8](https://arxiv.org/html/2606.11349#A10.T8)demonstrates that the regime shift generalizes across four LLM families \(Claude, DeepSeek, GPT\-OSS, and Qwen3\)\. All models exhibit the mandatory\-to\-opportunistic mode shift and ISE improvement underActionRating, though absolute accuracy varies with model capability\. The behavioral signature, not the accuracy level, is the transferable finding\.
Table 8:ActionRatinggeneralization across LLM families on HTS classification \(N=1181N\{=\}1181\)\. All models useτ=10\\tau\{=\}10\. Cell shading on Base and AR values follows the same accuracy scale as Table[1](https://arxiv.org/html/2606.11349#S5.T1)\(hightolow\)\.Δ\\Delta: improvement over each model’s own baseline \(pp\)\.
## Appendix KCross\-Benchmark Generalization
Table[9](https://arxiv.org/html/2606.11349#A11.T9)shows results on two additional HTS benchmarks \(ATLAS and HSCodeComp\) without dataset\-specific tuning\. The regime shift and accuracy gains are preserved, providing evidence that the behavioral structure is not an artifact of the CBP\-NY evaluation set\.
#### Scope of comparison\.
These comparisons are not intended as a controlled leaderboard claim: prior systems differ in model backbone, tool access, and answer source\. We instead use them as a fixed\-protocol*transfer test*, applyingActionRatingwithτ\\taulocked at the CBP\-NY\-selected value of 10 and asking whether the behavioral signature \(mode shift and ISE improvement\) survives on out\-of\-sample benchmarks without re\-tuning\. The signal we report is transfer of the signature, not a parity\-controlled SOTA claim\.
Table 9:ActionRatingon two independent HTS code benchmarks\. Hierarchical accuracy \(%\) at 6\- and 10\-digit levels\.Δ\\Delta:ActionRatingminus best prior method on the same dataset\.
## Appendix LThreshold Sensitivity Details
Table[10](https://arxiv.org/html/2606.11349#A12.T10)presents the full threshold sweep\. The behavioral phase diagram shows three regimes: \(1\)τ=1\\tau\{=\}1\(always ask\): highest volume but ISE drops to 62%, indicating diminishing returns from indiscriminate help\-seeking; \(2\)τ=10\\tau\{=\}10\(sweet spot\): best ISE \(74%\) with strong accuracy and moderate volume \(2\.4 QA/record\); \(3\)τ=101\\tau\{=\}101\(never ask\): equivalent to rating\-only, collapsing to baseline\. Theτ=50→30\\tau\{=\}50\{\\to\}30transition is the clearest inflection point where opportunistic help\-seeking emerges\.
#### Whyτ=10\\tau\{=\}10, notτ=1\\tau\{=\}1?
τ=1\\tau\{=\}1reaches a numerically higher 10\-digit accuracy \(72\.5% vs\. 67\.0% atτ=10\\tau\{=\}10\), so a reviewer may ask whyτ=10\\tau\{=\}10is reported as the operating point\. The criterion is not final accuracy alone but the behavioral operating point that balances three quantities: ISE, QA volume, and accuracy\. Atτ=1\\tau\{=\}1the agent is in a near\-always\-ask regime \(roughly 6 QA per record\); ISE drops to 62%, meaning a sizable fraction of clarification\-triggered states do not yield better next\-step navigation, and per\-record interaction cost balloons\. Atτ=10\\tau\{=\}10, ISE peaks at 74% with 2\.4 QA/record, so clarification fires more selectively and at states where it is locally useful\. Because this paper’s contribution is help*localization*rather than maximal accuracy, the operating point that maximizes the localization\-quality signal \(ISE\) at moderate cost is the one we treat as the analysis point;τ=1\\tau\{=\}1is reported in Table[10](https://arxiv.org/html/2606.11349#A12.T10)as the upper accuracy regime under nearly indiscriminate asking\.
#### Selection protocol and generalization validity\.
The threshold sweep was conducted exclusively on CBP\-NY\.τ=10\\tau\{=\}10was selected as the operating point from the CBP\-NY phase diagram and then*locked*: no additional tuning was performed for ATLAS or HSCodeComp\. All ATLAS and HSCodeComp experiments \(§[5](https://arxiv.org/html/2606.11349#S5), Table[9](https://arxiv.org/html/2606.11349#A11.T9)\) therefore use this fixed value, making them an out\-of\-sample generalization test rather than an extension of the tuning procedure\. The\+\+18\.8% accuracy gain on CBP\-NY represents in\-sample performance at the selected threshold; ATLAS and HSCodeComp provide the honest cross\-dataset signal\. Researchers deployingActionRatingin a new domain should treat the phase\-diagram sweep as a one\-time calibration step on in\-domain development data before applying the fixed threshold to held\-out evaluation\.
Table 10:Threshold sensitivity on HTS classification \(Claude Opus 4\.6,N=1181N\{=\}1181\)\. Cell shading on accuracy follows the same scale as Table[1](https://arxiv.org/html/2606.11349#S5.T1)\. For clarification:green= low mandatory / high opportunistic;red= high mandatory\.τ=101\\tau\{=\}101disables clarification gating \(rating only\)\.Bold: best value per accuracy column\.Hierarchical Accuracy \(%\)Clarification Behavior \(% records\)Condition2d4d6d8d10dStepsMandatory↓\\downarrowOpportunistic↑\\uparrowAnyBaseline79\.370\.961\.854\.450\.85\.635\.20\.035\.2τ=1\\tau\{=\}188\.482\.478\.073\.972\.55\.59\.797\.897\.9τ=10\\tau\{=\}1087\.282\.075\.269\.567\.05\.413\.988\.790\.9τ=30\\tau\{=\}3084\.877\.569\.363\.159\.85\.521\.651\.964\.9τ=50\\tau\{=\}5079\.671\.162\.255\.051\.25\.631\.39\.739\.0τ=101\\tau\{=\}101\(off\)78\.870\.261\.253\.450\.05\.733\.60\.033\.6
## Appendix MTrajectory Analysis
Table[11](https://arxiv.org/html/2606.11349#A13.T11)presents trajectory\-level statistics including action composition, score gaps between top\-ranked actions, and backtracking rates\. Key observations: \(1\)ActionRatingdoes not increase navigation steps \(5\.4–5\.7 across allτ\\tausettings\), confirming that help\-seeking is additive rather than replacing navigation; \(2\) score gaps between the top\-ranked action and clarification narrow at deeper tree levels, consistent with increasing uncertainty at finer classification granularity; \(3\) backtracking rates decrease underActionRating, suggesting that proactive help\-seeking reduces the need for corrective navigation\.
Table 11:Behavioural trajectory analysis: Baseline vs\.ActionRating\(Claude Opus 4\.6,N=1181N\{=\}1181\)\.Navigation: action\-use rates and episode length\.Information Seeking: clarification mode breakdown\.Decision Confidence:ActionRatingrating statistics \(higher gap⇒\\Rightarrowmore decisive selections\)\.Δ\\Delta:ActionRatingminus Baseline;green= improvement\.Table[12](https://arxiv.org/html/2606.11349#A13.T12)breaks down opportunistic ISE by the agent’s traversal state at clarification time \(on\-path vs\. off\-path\), complementing the aggregate ISE reported in §[5\.2](https://arxiv.org/html/2606.11349#S5.SS2)\.
Table 12:ISE for opportunistic \(inline\) clarification events, split by whether the agent’s traversal state was already on the correct HTS path at clarification time\.*On\-path*: the agent could have reached the correct leaf without further clarification;*Off\-path*: the agent’s current node was not an ancestor of the true HTS code—clarification must steer the trajectory\. Even when the agent is off\-path, 67 % of opportunistic clarifications are followed by a correct traversal step, showing that the clarification sub\-agent actively corrects misaligned trajectories\. Wilson 95 % CIs\.
## Appendix NExtended Discussion
This section expands on the discussion in §[6](https://arxiv.org/html/2606.11349#S6)\.
#### Cost–accuracy trade\-off\.
ActionRatingincreases inference cost from 6\.0 to 10\.4 LLM calls per record \(73% overhead\), primarily from inline clarification sub\-agent calls and reentry re\-selections\. However, the cost increase is sublinear in accuracy gain: the first \+10% costs≈\\approx2 additional calls, while the last \+6% requires≈\\approx2\.4 additional calls\. Self\-Consistency atN=3N\{=\}3achieves \+8\.7% at 19 calls \(3\.2×\\timescost\), placing it well below theActionRatingPareto frontier\.
#### Error analysis\.
The 390 records thatActionRatingfails to classify correctly at 10\-digit fall into three categories: \(1\)*genuine ambiguity*\(42%\): products whose correct classification requires domain expertise beyond what any oracle can provide \(e\.g\., tariff treatment of multi\-material composites\); \(2\)*early commitment errors*\(31%\): the agent commits to a wrong branch before the threshold triggers clarification; \(3\)*answer specificity*\(27%\): the product\-owner simulation provides correct but insufficiently specific information for fine\-grained distinctions at 8–10 digit levels\.
## Appendix OEvaluation Dataset Construction: Product Extraction Prompt
Product Extraction System PromptTask:Extract product information from CBP customs ruling\{raw\_ruling\_text\}and format it as e\-commerce product data\.Ruling hierarchy:HQ rulings supersede NY rulings and are the final authoritative source\. Ground truth HTS codes listed in the prompt are from HQ rulings where applicable\.Difficulty:Easy: clear material, obvious heading;Medium: GRI 3\(b\) composite goods with clear precedent;Hard: unclear essential character, multiple possible headings\.Critical:Extract only, do not invent\. Use “Not specified” for missing fields\.Fields to extract per product:item\_name\(50–100 chars\)⋅\\cdotproduct\_description\(1–2 sentences\)⋅\\cdotbrand⋅\\cdotcolor⋅\\cdotmaterial⋅\\cdotsize⋅\\cdotmanufacturer⋅\\cdotgl\_product\_group\_type⋅\\cdotitem\_weight⋅\\cdotlisting\_price⋅\\cdotcountry\_of\_origin⋅\\cdothts\_code⋅\\cdotbullet\_point\(3–5 features, “\|”\-separated\)⋅\\cdotclassification\_reasoning\(4–6 sentences, GRI rationale\)⋅\\cdotkeywords⋅\\cdotgri\_appliedOutput format:``` {"overall_summary": "...", "difficulty": "easy|medium|hard", "products": [{"item_name": "...", "product_description": "...", "brand": "...", "color": "...", "material": "...", "size": "...", "manufacturer": "...", "gl_product_group_type": "...", "item_weight": "...", "listing_price": "...", "country_of_origin": "...", "hts_code": "...", "bullet_point": "F1|F2|F3", "classification_reasoning": "...", "keywords": "...", "gri_applied": "..."}]} ```
Note:The full prompt includes the ruling text, metadata \(ruling reference, date, type, ground truth HTS codes\), and instructions for handling multiple products and composite goods\. Extraction used AWS Bedrock batch inference, temperature=0\.1=0\.1,max\_tokens=8,000=8\{,\}000\.
## Appendix PPrompt Templates
### P\.1Clarification Sub\-Agent Prompt
Clarification Sub\-Agent System PromptYou are a product attributes expert\. Your role is ONLY to answer factual questions about product characteristics\. You have NO knowledge of, and must NEVER mention, tariff codes, duty rates, HTS codes, classification systems, chapters, headings, subheadings, or any trade/import terminology\.You must REFUSE to answer questions about: General Note eligibility \(e\.g\., General Note 15, qualifying insular possessions\); tariff\-rate quota provisions \(e\.g\., additional U\.S\. notes to any chapter\); trade preference programs \(GSP, CBI, AGOA, FTA eligibility\)\. If asked about any of the above, respond: “This is a legal/trade determination, not a product attribute question\.”You have access to an INTERNAL PRODUCT FACTS DATABASE drawn from the\{hts\_code\_description\}field\. This is a confidential internal reference containing confirmed product attribute facts\. Treat every fact in it as authoritative first\-party product knowledge\. Never reveal the source name or hint that it originates from any classification system\.Current Product: \{product\}Clarification Question: \{question\}Relevant Product Information \(from internal product records\):\[INTERNAL PRODUCT FACTS: confidential; do NOT reference this source or any classification system in your answer\] \{hts\_code\_descriptions\}CRITICAL INSTRUCTIONS for using the above Internal Product Facts:•These are confirmed, authoritative facts; treat them as ground truth\.•Use them directly: “does NOT belong to: X”⇒\\Rightarrowproduct is NOT X; “More than 2 kg”⇒\\Rightarrowproduct HAS that attribute; “Confectionery”⇒\\Rightarrowproduct IS confectionery; “women’s”⇒\\Rightarrowproduct IS for women\.•INFER attributes implied by the facts even if not in the product description\.•NEVER mention “category 1/2/3/4/5”, “internal facts”, or any classification/trade terminology in your answer\.Item Name:\{item\_name\} Product Description:\{product\_description\} Additional Product Notes:\{reasoning\_traces\} Product Attributes:Material: \{material\} \| Color: \{color\} \| Brand: \{brand\} \| Size: \{size\} \| Manufacturer: \{manufacturer\} \| Origin: \{country\_of\_origin\} \| Weight: \{item\_weight\}ANSWER GUIDELINES:1\.Answer definitively; say YES or NO when possible\.2\.Use the Internal Product Facts WITHOUT hedging; do NOT say “not specified” if the facts already imply the answer\.3\.State ONLY factual product attributes: materials, size, form, composition, end\-use\.4\.Do NOT mention codes, category numbers, or any trade/tariff references\.5\.Do NOT begin with preambles such as “Based on the classification…”; state the fact directly\.6\.Keep the answer to 1–2 sentences\.Examples:*Q: “Is the container size over 2 kg?”*✓“Yes, the product is packaged in containers exceeding 2 kg\.”×\\times“The product description does not specify the container size\.”*Q: “Does this contain dairy products?”*✓“No, this product does not contain dairy products or milk solids\.”×\\times“Based on the classification hierarchy…”*Q: “Is this retail candy or confectionery?”*✓“Yes, this is confectionery for retail consumption\.”×\\times“The description does not explicitly state this\.”Answer\(factual product attributes only, NO classification preamble, NO hedging\):
Implementation notes and oracle design rationale\.The\{hts\_node\_descriptions\_for\_current\_item\}field is populated from the HTS knowledge graph node descriptions along the item’s ground\-truth classification path\. All numeric HTS codes are replaced by generic category labels \(“category 1” through “category 5”\) via a regex filter before injection, and a second pass is applied to the generated answer to mask any residual HTS code references\. These filters remove explicit identifiers but do not eliminate semantic information derived from the correct path\.
Knowledge\-source characterization\.The oracle draws on HTS node descriptions along the ground\-truth path to produce product\-attribute answers\. As confirmed by the knowledge\-channel audit \(§[5\.4](https://arxiv.org/html/2606.11349#S5.SS4), Appendix[D](https://arxiv.org/html/2606.11349#A4)\), the channel primarily conveys product\-owner knowledge rather than classification paths, so results measure*help\-seeking and gating*behavior in isolation from answer quality\. The results should be interpreted as measuring*where*the agent chooses to seek help and how that help\-seeking affects navigation accuracy\. Replacing this controlled oracle with deployment\-realistic information sources \(product databases, manufacturer specifications\) is an important direction for future work\.
### P\.2Navigation Prompt \(Baseline\)
Navigation Agent System PromptYou are an expert HTS classification agent navigating the Harmonized Tariff Schedule tree structure\.\[\{clarifications\_section\}\]\(injected when prior Q&A exists; includes per\-node duplicate guard and hard cap at 2 clarifications per node\)GENERAL RULES OF INTERPRETATION \(GRI\)GRI 1:Classification is determined by heading terms and relative section/chapter notes\.GRI 2:Incomplete articles and mixtures follow the essential character of the complete article or primary constituent\.GRI 3:When classifiable under two or more headings: \(a\) most specific prevails; \(b\) essential character for mixtures/sets; \(c\) last heading in numerical order\.GRI 4:Classify under the heading for the most akin goods\.GRI 5:Special rules for containers and packing materials\.GRI 6:Subheading classification applies GRI 1–5 mutatis mutandis\.Additional U\.S\. Rules:Classification by principal use at importation; “parts” provisions cover goods solely/principally used as parts\.NAVIGATION CONTEXTParent Node:\{parent\_code\}: \{parent\_desc\} Current Node:\{current\_code\}: \{current\_desc\} Product:\{product\_description\} Path History:\{history\}AVAILABLE DIRECT CHILDREN: \- \{code\}: \{description\}\(one per line; pruned nodes listed separately with warning\)AVAILABLE ACTIONS: 1\.traverse\_child\(code\): descend to a child node2\.backtrack: return to parent3\.need\_clarify\(question\): ask a product\-attribute question \(not HTS/trade references\)4\.jump\(code\): cross\-tree navigation via exclusion edges5\.confirm: declare final classification \(only at 10\-digit terminal leaf node\)Respond with JSON:``` {"action_type": "traverse_child", "target": "code", "question": null, "reasoning": "why this child"} {"action_type": "need_clarify", "target": null, "question": "specific product attribute question", "reasoning": "which children this resolves"} {"action_type": "confirm", "target": null, "question": null, "reasoning": "why confirming at this 10-digit leaf"} ```
### P\.3Action Rating Section \(appended whenActionRatingenabled\)
Action Rating Section \(appended to Navigation Prompt\)ACTION RATING\(required, include in every response\)From ALL available actions listed above \(traverse\_child options, backtrack, need\_clarify, jump, confirm\), identify your TOPKKmost relevant actions and rate each 0–100:100 = definitely the right move at this node0 = completely wrong / would derail classification50 = uncertain / could go either wayFormat each action description as: traverse\_child to \{code\} \(\{desc\}\) backtrack to \{parent\_code\} \(\{parent\_desc\}\) need\_clarify about \{topic\} confirm code \{code\}⋅\\cdotjump to \{code\} \(\{desc\}\)Example\(do not copy these scores\):``` traverse_child to 8418 (Refrigerators): 92/100 because: product is clearly a refrigerating appliance need_clarify about cooling capacity: 45/100 because: capacity determines the correct 8-digit code backtrack to 84 (Machinery): 3/100 because: current node is already specific enough ``` Include ratings in JSON as"action\_ratings"\(ranked highest→\\tolowest\):``` "action_ratings": [ {"action_description": "traverse_child to 8418 ...", "score": 92, "reason": "product is a refrigerating appliance"}, ... (exactly K entries) ] ```Similar Articles
Uncertainty-Aware Clarification in LLM Agents with Information Gain
Proposes a goal-oriented clarification framework using Information Gain Reward to train LLM agents to ask effective clarification questions under underspecified user instructions, improving task success rate by 3.7% with minimal interaction overhead.
@dair_ai: Cool paper from PwC. "Earlier is always better" is the default intuition for agent clarification. New paper claims that…
A new paper from PwC challenges the intuition that 'earlier is better' for agent clarification, showing via a forced-injection framework that goal clarification loses value quickly while input clarification remains useful longer. The study provides quantitative demand curves for when agents should ask questions, revealing that current frontier models often mistime their clarifications.
Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents
Introduces Inquisitive Conversational Agents (ICAs) for proactive information extraction in legal dialogue, proposing a Dual Hierarchical Reinforcement Learning framework that learns when and how to ask probing questions, evaluated on U.S. Supreme Court oral arguments.
CHAL: Council of Hierarchical Agentic Language
This paper introduces CHAL, a multi-agent dialectic framework that treats defeasible argumentation as structured belief optimization for LLM reasoning, using configurable meta-cognitive value systems and a gradient-informed belief revision mechanism.
Learning Agentic Policy from Action Guidance
The paper proposes ActGuide-RL, a method for training agentic policies in LLMs by using human action data as guidance to overcome exploration barriers in reinforcement learning without extensive supervised fine-tuning.