Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents

arXiv cs.CL Papers

Summary

Introduces Inquisitive Conversational Agents (ICAs) for proactive information extraction in legal dialogue, proposing a Dual Hierarchical Reinforcement Learning framework that learns when and how to ask probing questions, evaluated on U.S. Supreme Court oral arguments.

arXiv:2605.14057v1 Announce Type: new Abstract: Most existing dialogue systems are user-driven, primarily designed to fulfill user requests. However, in many critical real-world scenarios, a conversational agent must proactively extract information to achieve its own objectives rather than merely respond. To address this gap, we introduce \emph{Inquisitive Conversational Agents (ICAs)} and develop an ICA specifically tailored to U.S. Supreme Court oral arguments. We propose a Dual Hierarchical Reinforcement Learning framework featuring two cooperating RL agents, each with its own policy, to coordinate strategic dialogue management and fine-grained utterance generation. By learning when and how to ask probing questions, the agent emulates judicial questioning patterns and systematically uncovers crucial information to fulfill its legal objectives. Evaluations on a U.S. Supreme Court dataset show that our method outperforms various baselines across multiple metrics. It represents an important first step toward broader high-stakes, domain-specific applications.
Original Article
View Cached Full Text

Cached at: 05/15/26, 06:19 AM

# Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents
Source: [https://arxiv.org/html/2605.14057](https://arxiv.org/html/2605.14057)
Xubo Lin Georgetown University xl524@georgetown\.edu &Zezhi Deng Georgetown University zd127@georgetown\.edu &Shihao Wang Georgetown University sw1379@georgetown\.edu Grace Hui Yang Georgetown University Grace\.yang@georgetown\.edu &Yang Deng Singapore Management University ydeng@smu\.edu\.sg

###### Abstract

Most existing dialogue systems are user\-driven, primarily designed to fulfill user requests\. However, in many critical real\-world scenarios, a conversational agent must proactively extract information to achieve its own objectives rather than merely respond\. To address this gap, we introduce*Inquisitive Conversational Agents \(ICAs\)*and develop an ICA specifically tailored to U\.S\. Supreme Court oral arguments\. We propose a Dual Hierarchical Reinforcement Learning framework featuring two cooperating RL agents, each with its own policy, to coordinate strategic dialogue management and fine\-grained utterance generation\. By learning when and how to ask probing questions, the agent emulates judicial questioning patterns and systematically uncovers crucial information to fulfill its legal objectives\. Evaluations on a U\.S\. Supreme Court dataset show that our method outperforms various baselines across multiple metrics\. It represents an important first step toward broader high\-stakes, domain\-specific applications\.111[Git repository](https://github.com/infosenselab/Dual-Hierarchical-Dialogue-Policy-Learning-for-Legal-Inquisitive-Conversational-Agents)

Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents

Xubo LinGeorgetown Universityxl524@georgetown\.eduZezhi DengGeorgetown Universityzd127@georgetown\.eduShihao WangGeorgetown Universitysw1379@georgetown\.edu

Grace Hui YangGeorgetown UniversityGrace\.yang@georgetown\.eduYang DengSingapore Management Universityydeng@smu\.edu\.sg

## 1Introduction

Conversational AI has long focused on user\-driven systems suited to tasks like customer service or digital assistants\. They excel when the discourse is close\-ended and user–driven\. However, they are not well\-suited when it comes to scenarios like court justices, where they do not passively absorb information; instead, they prod, reframe, and challenge, creating a line of inquiry that tests the attorney’s narrative and hunts for latent inconsistencies\. The dialogue has a moving target of questions and counters, and it is this information\-seeking dynamic that we call inquisitive dialogue\.

Much of the literature that characterizes itself as “task\-oriented dialogue” in fact captures only one slice of the space: collaborative dialogue where system and user share a goal\. Datasets such as MultiWOZ\(Budzianowskiet al\.,[2020](https://arxiv.org/html/2605.14057#bib.bib13)\), Schema\-Guided Dialogue\(Rastogiet al\.,[2020](https://arxiv.org/html/2605.14057#bib.bib10)\), Taskmaster\(Byrneet al\.,[2019](https://arxiv.org/html/2605.14057#bib.bib11)\), etc\., canonize that slice by framing the agent as a benevolent assistant whose sole duty is to satisfy explicit user requests\. Their well\-formulated slot ontologies, crowd\-written templates, and short conversational arcs make them ideal for supervised learning but simultaneously ill\-suited for settings where the agent, not the interlocutor, steers the agenda\. Treating these resources as the entirety of task\-oriented dialogue \(TOD\), therefore, overstates their scope and leaves the inquisitive and negotiation spectrum virtually unmapped, for example, in figure[1](https://arxiv.org/html/2605.14057#S1.F1), the utterance "The Sixth Amendment only protects your money up until the point where there’s a judgment?" is a task oriented question but will not appear in collaborative or negotiation dialogue\.

Inquisitive dialogue poses multiple challenges\. First, initiative and relevance are context contingent: asking “Which soda do you prefer?” in an interview can be incisive or irrelevant depending on the preceding exchange, a nuance that traditional conversational agents can struggle to capture\. Second, the interaction horizon is long\. Supreme Court transcripts routinely exceed 5 000 tokens per round, stretching the capacity of mainstream encoder–decoder models that underpin many collaborative agents\(Suet al\.,[2022](https://arxiv.org/html/2605.14057#bib.bib12); Shuet al\.,[2019](https://arxiv.org/html/2605.14057#bib.bib14)\)\. Additionally, the dialogue participants do not share a common goal, and in many cases may be actively working against each other to reach their own goals\. Therefore, any agent participating in inquisitive dialogue must learn long\-term dialogue and questioning strategies in a non\-cooperative context\.

To meet these challenges, we propose a Dual Hierarchical Reinforcement\-Learning \(RL\) framework that splits inquisitive reasoning between two tightly coupled agents\. An Appraisal Agent evaluates each attorney’s response in real time and converts those judgments into scalar rewards that shape the next turn, then a Hierarchical Dialogue\-Policy Agent regressively generates the up to 3 hierarchies of action based on the output from the Appraisal Agent\.

![Refer to caption](https://arxiv.org/html/2605.14057v1/image/TOD_overview.png)Figure 1:While this paper focuses on inquisitive dialogue in the context of U\.S\. Supreme Court hearings, we rethink and propose a broader categorization of task\-oriented dialogue into three types: collaborative, negotiation\(Lewiset al\.,[2017](https://arxiv.org/html/2605.14057#bib.bib15)\), and inquisitive\. In prior works, non\-collaborative types of TOD remain underexplored\.
## 2Related Work

#### Proactive Conversational Agents\.

The development of conversational agents \(CAs\) has been largely driven by breakthroughs in natural language processing and machine learning\. Key approaches include*sequence\-to\-sequence \(Seq2Seq\) modeling*Sutskeveret al\.\([2014](https://arxiv.org/html/2605.14057#bib.bib16)\),*pretrained language models \(PLMs\)*Radfordet al\.\([2019](https://arxiv.org/html/2605.14057#bib.bib47)\); Liuet al\.\([2024](https://arxiv.org/html/2605.14057#bib.bib48)\),*retrieval\-assisted text generation \(RAG\)*Gaoet al\.\([2024](https://arxiv.org/html/2605.14057#bib.bib49)\); Izacard and Grave \([2021](https://arxiv.org/html/2605.14057#bib.bib50)\), and*reinforcement learning \(RL\)*approachesSchulmanet al\.\([2017](https://arxiv.org/html/2605.14057#bib.bib51)\)\. Among them, RL provides an optimization paradigm for dialogue strategies, particularly intask\-orientedsettingsBudzianowskiet al\.\([2020](https://arxiv.org/html/2605.14057#bib.bib13)\), where reward\-based learning aligns agent behavior with desired outcomes\. For instance,Liet al\.\([2016b](https://arxiv.org/html/2605.14057#bib.bib52)\)introduced deep RL to incorporate dialogue\-level rewards, whileZhao and Eskenazi \([2016](https://arxiv.org/html/2605.14057#bib.bib53)\)proposed an end\-to\-end system that learns both dialogue state tracking and strategy\.

While conventional CAs typically respond to user\-initiated requests, a growing line of research focuses on*proactive conversational agents*Liaoet al\.\([2023](https://arxiv.org/html/2605.14057#bib.bib54)\), which actively*initiate topics*Tanget al\.\([2019](https://arxiv.org/html/2605.14057#bib.bib55)\), provide*context\-aware recommendations*Zhouet al\.\([2020](https://arxiv.org/html/2605.14057#bib.bib56)\), and*guide*users rather than simply reactingDenget al\.\([2023](https://arxiv.org/html/2605.14057#bib.bib57)\)\. Proactive agents often leverage*reinforcement learning*Denget al\.\([2024](https://arxiv.org/html/2605.14057#bib.bib58)\),*strategic planning*Zhanget al\.\([2024](https://arxiv.org/html/2605.14057#bib.bib59)\), or*question generation*Guoet al\.\([2024](https://arxiv.org/html/2605.14057#bib.bib60)\)to address limitations of purely reactive systems, enabling richer support for tasks such as exploratory search and decision\-making\. ICAs take this concept even further by focusing on*steering*the conversation and*gathering insights*from the user to achieve the system’s own objectives\. They go beyond offering guidance or recommendations and actively*probe*for information, making them especially suited to domains like legal or investigative dialogues where deeper fact\-finding is critical\.

#### Legal Conversational Agents\.

While much of the research on conversational agents has focused on open\-domain or task\-oriented contexts, a growing body of work explores their application in the legal domain\. For instance,Sharmaet al\.\([2021](https://arxiv.org/html/2605.14057#bib.bib62)\)build a retrieval\-based legal chatbot to address frequently asked legal questions\. Although these systems provide valuable assistance, they predominantly adopt a reactive, FAQ\-style approach, leaving vacancy for more proactive or inquisitive dialogue models—an area our work aims to advance\.

## 3Problem Formulation

### 3\.1Inquisitive Conversations

In this paper, we address the problem of*inquisitive conversation*, where a conversational agent actively probes for critical information to achieve its own objectives, rather than merely responding to user queries\. Specifically, we frame this challenge in the context of Supreme Court judicial dialogue\.

Inquisitive conversations exhibit several key differences to everyday casual conversations\.

Conversational Control:In typical conversations, the user initiates queries and drives the topic\. In judicial dialogues, the justice initiates each round of questioning and controls the direction of the discussion\.

Purpose:Casual dialogues often serve social or informative purposes, whereas in judicial questioning, each question aims to clarify legal uncertainty, probe for consistency, or expose logical flaws\.

Strategy:Justice questioning is deliberate and strategic, employing techniques such as testing hypotheticals, challenging premises, and verifying doctrinal consistency\.

To model these differences in inquisitive behavior, we propose theInquisitive Conversational Agent \(ICA\), which mimics these questioning patterns using a dual\-agent hierarchical reinforcement learning framework\.

### 3\.2Dialogue Formulation

We model the justice–attorney interaction as a Markov Decision Process \(MDP\), defined by the tupleM=\(S,A,R,γ\)M=\(S,A,R,\\gamma\), whereSSis the dialogue state space,AAthe action space,RRthe reward function, andγ\\gammathe discount factor\. Each dialogue roundttbegins with a justice utteranceujtu\_\{j\}^\{t\}, followed by an attorney responseuatu\_\{a\}^\{t\}, forming an interaction pair\(ujt,uat\)\(u\_\{j\}^\{t\},u\_\{a\}^\{t\}\)\. The statest∈Ss^\{t\}\\in Sencodes the dialogue context up to roundtt\.

In our formulation, the justice utteranceujtu\_\{j\}^\{t\}is treated as the actionata^\{t\}, which transitions the environment to a new statest\+1s^\{t\+1\}after observinguat\+1u\_\{a\}^\{t\+1\}and yields a scalar rewardrt=R​\(st,at\)r^\{t\}=R\(s^\{t\},a^\{t\}\)\.

Appraisal Signal:In inquisitive dialogue, agents operate with their own information\-seeking goals\. Rather than waiting for user input to guide the exchange, they actively evaluate each response to determine whether it advances their investigative objective\. To resonate with this feature of inquisitive dialogue, we introduce an appraisal signalptp^\{t\}at each turn\. It encodes the justice’s judgment of the attorney’s prior response \(e\.g\., evasive, incomplete, satisfactory\) under dialogue statests^\{t\}\. In our dataset, the appraisal of the justice in each turnttcan be inferred from an utterance tuple of two rounds:

pt=f​\(ujt−1,uat,ujt\)\.p^\{t\}=f\\Bigl\(u\_\{j\}^\{t\-1\},\\,u\_\{a\}^\{t\},\\,u\_\{j\}^\{t\}\\Bigr\)\.\(1\)
For instance, if the justice issues a near\-identical utterance across two consecutive turns, it often indicates dissatisfaction with the attorney’s prior response\. Accordingly, we augment the standard transition tuple to𝒟∼\(st,pt,at,rt,st\+1\)\\mathcal\{D\}\\sim\(s^\{t\},p^\{t\},a^\{t\},r^\{t\},s^\{t\+1\}\)\. The Appraisal Agent treatsptp^\{t\}as the action selected in statests^\{t\}, while the Dialogue Agent operates on an augmented state representationsaugt=concat​\(st,pt\)s\_\{\\text\{aug\}\}^\{t\}=\\text\{concat\}\(s^\{t\},p^\{t\}\)\. This design enables the Dialogue Agent to condition its next action on its internal assessment of that history as well\.

While many dialogue systems treat utterance generation as an open\-ended natural language generation \(NLG\) task with a vast action spaceZhao and Eskenazi \([2016](https://arxiv.org/html/2605.14057#bib.bib53)\); Sharmaet al\.\([2017](https://arxiv.org/html/2605.14057#bib.bib78)\); Wanget al\.\([2022](https://arxiv.org/html/2605.14057#bib.bib79)\), domain\-specific agents can often reduce complexity by operating on a finite set of*dialogue acts*Penget al\.\([2018](https://arxiv.org/html/2605.14057#bib.bib80)\); Suet al\.\([2018](https://arxiv.org/html/2605.14057#bib.bib81)\)\. In the Supreme Court domain, for instance, justices frequently perform recurrent yet distinct high\-level actions \(e\.g\., asking questions, making hypotheses, or making declarationsCichowicz \([2019](https://arxiv.org/html/2605.14057#bib.bib87)\)\), which lend themselves to a more structured formulation\. After they choose a high\-level intent, such as questioning, hypothesizing, or declaring, they may refine it into a more specific subtype, like a probing or clarifying question before their actual utterance came up\.

Motivated by this, we adopt ahierarchical action spacethat separates policy decisions \(i\.e\.,*which dialogue act to take next*\) from the lower\-level surface text realization \(i\.e\.,*how to verbalize that act*\)\. Our approach discretizes justices’ interactions into a three\-level taxonomy \(see Table[3](https://arxiv.org/html/2605.14057#A1.T3)in appendix\) that captures both top\-level acts \(e\.g\., a “question”\) and their subtypes \(e\.g\., probing for clarity vs\. challenging an argument\)\.

![Refer to caption](https://arxiv.org/html/2605.14057v1/image/Ours_state3.png)Figure 2:System Architecture of the Proposed Dual Hierarchical Inquisitive Conversational Agent\.
### 3\.3Reward Definition

Unlike conventional dialogue rewards that primarily assess the agent’s own utterance, our inquisitive setting focuses on how effectively the*justice’s utterance*elicits information from the*attorney’s subsequent response*\. In this work, each justice’s utteranceujtu\_\{\\mathrm\{j\}\}^\{t\}receives a reward comprising the following components\.

\(1\) Solicitation of Goal\-Relevant Information\.One objective of the agent in an inquisitive dialogue is to gather useful and relevant information that is aligned with their goal\. Therefore, we introduce a goal\-relevance reward to incentivize probing related to the agent’s goal\. To capture how effectively the justice’s utterance,ujtu\_\{\\mathrm\{j\}\}^\{t\}, compels an attorney’s response,uat\+1u\_\{\\mathrm\{a\}\}^\{t\+1\}, to include legally significant information, we measure the attorney’s response’s relevance to the case’s conclusionCC\. Using Llama\-3\-8B[35](https://arxiv.org/html/2605.14057#bib.bib95)as a semantic similarity evaluator, we compute the maximum similarity between the attorney’s responseuat\+1u\_\{\\mathrm\{a\}\}^\{t\+1\}and each sub\-conclusionC​\[i\]C\[i\], with scores bounded by 5\. Formally,

Rrelt\+1​\(st,ujt\)=max⁡\(sim​\(C​\[i\],uat\+1\)\),R\_\{\\mathrm\{rel\}\}^\{t\+1\}\(s^\{t\},u\_\{\\mathrm\{j\}\}^\{t\}\)\\;=\\;\\max\\Bigl\(\\text\{sim\}\\bigl\(C\[i\],\\,u\_\{\\mathrm\{a\}\}^\{t\+1\}\\bigr\)\\Bigr\),\(2\)whereC​\[i\]C\[i\]denotes individual sub\-conclusions of the case’s conclusion\. This reward encourages justice’s inquisitive utterances that steer the dialogue toward legally relevant insights\.

\(2\) Solicitation of Novel Information\.A key goal of an ICA is not only to ask questions but to drive the conversation toward uncovering information that has not yet surfaced\. To capture this behavior, we introduce anovelty rewardthat measures how effectively the justice’s utteranceujtu\_\{\\mathrm\{j\}\}^\{t\}prompts the attorney’s next responseuat\+1u\_\{\\mathrm\{a\}\}^\{t\+1\}to contribute new and informative content beyond what has already been discussed\. This reward encourages the agent to formulate more strategic and context\-aware inquiries that elicit additional legal details or perspectives\.

Formally, we compute this reward using the*Expectation\-Adjusted Distinct \(EAD\)*metricLiuet al\.\([2022](https://arxiv.org/html/2605.14057#bib.bib85)\), a length\-normalized variant of*Distinct\-N*Liet al\.\([2016a](https://arxiv.org/html/2605.14057#bib.bib86)\)that evaluates lexical novelty while accounting for utterance length:

Rnovt\+1​\(st,ujt\)=Nattorneyt\+1V​\(1−\(V−1V\)\|uat\+1\|\),R\_\{\\mathrm\{nov\}\}^\{t\+1\}\(s^\{t\},u\_\{\\mathrm\{j\}\}^\{t\}\)\\;=\\;\\frac\{N\_\{\\mathrm\{attorney\}\}^\{t\+1\}\}\{V\\left\(1\-\\left\(\\tfrac\{V\-1\}\{V\}\\right\)^\{\|u\_\{\\mathrm\{a\}\}^\{t\+1\}\|\}\\right\)\},\(3\)whereNattorneyt\+1N\_\{\\mathrm\{attorney\}\}^\{t\+1\}represents the number of newly introduced tokens inuat\+1u\_\{\\mathrm\{a\}\}^\{t\+1\}that have not appeared in prior turns,222In the original EADLiuet al\.\([2022](https://arxiv.org/html/2605.14057#bib.bib85)\),NNcounts distinct tokens; we adapt it to track newly introduced tokens relative to the dialogue history\.VVis the cumulative vocabulary size up to timett, and\|⋅\|\|\\cdot\|denotes the token count of the utterance\.

\(3\) Solicitation of Succinct Answer\.In Supreme Court dialogues, justices often prefer brief, direct answers \(e\.g\., “yes,” “no”\) from the attorneyCichowicz \([2019](https://arxiv.org/html/2605.14057#bib.bib87)\), as these answers can swiftly confirm or deny a point and thus aid the justice’s decision\-making\. Additionally, succinct answers from the attorney helps the justice in keeping control of the dialogue, and conversational control is an important consideration in making an ICA\. We reward this succinctness, treating it as evidence that the justice’s utteranceujtu\_\{\\mathrm\{j\}\}^\{t\}was well\-targeted:

Rclarityt\+1​\(st,ujt\)=−log⁡\(\|uat\+1\|\),R\_\{\\mathrm\{clarity\}\}^\{t\+1\}\(s^\{t\},u\_\{\\mathrm\{j\}\}^\{t\}\)\\;=\\;\-\\,\\log\\\!\\bigl\(\\lvert u\_\{\\mathrm\{a\}\}^\{t\+1\}\\rvert\\bigr\),\(4\)where\|uat\+1\|\\lvert u\_\{\\mathrm\{a\}\}^\{t\+1\}\\rvertis the token length of the attorney’s response\. This measure complements the previous two components by explicitly encouraging*clarity*in judicial exchanges\.

During training, we combine the three reward components into an aggregated numerical reward via a weighted sum, which allows the agent to balance legal relevance, novelty, and clarity in its inquisitive dialogue\.

## 4Proposed Method: A Dual\-Agent Framework for Legal Inquiry

Building an ICA, which actively uncovers information rather than merely answering queries, poses distinct challenges, especially in complex domains like Supreme Court hearings\. To tackle this, we propose aDual\-Agent Hierarchical RLframework, depicted in Figure[2](https://arxiv.org/html/2605.14057#S3.F2), designed to emulate the judicial exchange process\. Our approach comprises two coordinated agents, each focusing on a different aspect of the conversation\.

Rather than viewing dialogue as a single flat policy, we employ a three\-level hierarchical RL dialogue agent that determines*when*to probe further,*how*to frame questions, and*if*the discussion should shift topics\. By decomposing each turn into layers, ranging from broad subtopic planning to fine\-grained utterance generation—the Dialogue Agent can optimize information elicitation while maintaining coherence and legal formality\.

### 4\.1Appraisal Agent

We introduce a appraisal agent to*evaluate*each attorney response\. If the response appears evasive, contradictory, or insufficiently detailed, the appraisal agent flags the need for deeper inquiry\. This mechanism mimics a justice’s tendency to monitor counsel’s answers on the fly, ensuring that the Dialogue Agent adapts its questioning in real time rather than blindly following a predefined script\.

Why two agents?Separating response appraisal and dialogue control into two specialized agents enables more modular and interpretable decision\-making\. The Dialogue Agent focuses exclusively on planning and generating inquisitive moves, while the Appraisal Agent independently assesses whether the information obtained justifies continued exploration\.

Similar to dialogue acts, the appraisals can be discretized for a specific domain as well\. We summarized nine appraisal types from Supreme Court transcripts\(see Table[4](https://arxiv.org/html/2605.14057#A2.T4)in the appendix\)\. These appraisals allow the justice to evaluate attorney responses, identifying flaws, seeking clarification, or prompting further inquiry, and help ensure the dialogue remains focused, responsive, and inquisitive\.

In our proposed method, Appraisal Agent employs a Q\-network to choose the appraisalppthat maximizes its Q\-value estimate:

p​\(s\)=arg⁡maxp⁡QAppraisal​\(s,p;θ\),p\(s\)\\;=\\;\\arg\\max\_\{p\}\\,Q\_\{\\text\{Appraisal\}\}\(s,p;\\theta\),\(5\)wheressis the current state embedding, andθ\\thetadenotes the Q\-network parameters\. The selected appraisalppis then represented as a one\-hot vector and merged into the Dialogue Agent’s augmented state, guiding subsequent decisions to probe further or shift to the next subtopic as needed\.

In our dialogue agent, We augment the overall dialogue statests\_\{t\}withptp\_\{t\}to yieldsaugt=concat​\(st,pt\)s^\{t\}\_\{\\text\{aug\}\}=\\text\{concat\}\(s\_\{t\},p\_\{t\}\)by treating the appraisal agent output as an internal*state variable*rather than a separate action, the ICA can better track whether deeper probing is needed or if the conversation should transition to a new subtopic\.

### 4\.2Dialogue Agent

To emulate Supreme Court justices, our Hierarchical Dialogue Agent first decides*which*conversational act to perform \(e\.g\., clarify, probe, or challenge\), then determines*how*to realize that act\. We formalize these choices in a three\-level action taxonomy \(Table[3](https://arxiv.org/html/2605.14057#A1.T3)\)\. Level 1 defines high\-level dialogue acts, such as*Questioning*,*Hypothesis Testing*, or*Declaration*\. Level 2 refines each act into subcategories \(e\.g\.,*Clarification*,*Probing*,*Comparison*\), while Level 3 specifies the final utterance\.

Poincaré EmbeddingTo capture the hierarchical structure of judicial dialogue acts, we represent each action in a Poincaré embedding spaceNickel and Kiela \([2017](https://arxiv.org/html/2605.14057#bib.bib99)\)\. Poincaré embeddings are defined in a hyperbolic geometry that naturally preserves hierarchical and tree\-like relationships, where parent nodes lie closer to the origin and child nodes are positioned exponentially farther away\. By embedding our three\-level taxonomy in this hyperbolic space, the Dialogue Agent can learn smoother transitions across levels, leverage proximity for related actions \(e\.g\., between sibling subacts\), and better generalize across hierarchically related behaviors\. The training target of it is as follows:

ℒ=∑\(u,v\)∈Dlog⁡e−d​\(u,v\)∑v′∈𝒩​\(u\)e−d​\(u,v′\)\.\\mathcal\{L\}=\\sum\_\{\(u,v\)\\in D\}\\log\\frac\{e^\{\-d\(u,v\)\}\}\{\\sum\_\{v^\{\\prime\}\\in\\mathcal\{N\}\(u\)\}e^\{\-d\(u,v^\{\\prime\}\)\}\}\.\(6\)
Whered​\(u,v\)d\(u,v\)denotes the hyperbolic distance between embeddings of nodesuuandvv;DDis the set of observed positive pairs \(e\.g\., parent–child or sibling relations\) derived from the dialogue act hierarchy; and𝒩​\(u\)\\mathcal\{N\}\(u\)represents a set of negatively sampled nodes unrelated touu\.

Multi\-Hierarchy Action Selection\.The three\-level hierarchical action taxonomy \(Table[3](https://arxiv.org/html/2605.14057#A1.T3)\) allows our Dialogue Agent to operate at varying degrees of granularity\. A single full\-level action\{a0,a1,a2\}\\\{a\_\{0\},a\_\{1\},a\_\{2\}\\\}may yield up to three transition tuples:\(s,a0,r,s′\),\(s,\\,a\_\{0\},\\,r,\\,s^\{\\prime\}\),\(s,a1,r,s′\),\(s,\\,a\_\{1\},\\,r,\\,s^\{\\prime\}\),\(s,a2,r,s′\),\(s,\\,a\_\{2\},\\,r,\\,s^\{\\prime\}\),The agent may terminate at any level if the chosen sub\-action has no additional children\.

The categories are chosen sequentially, where the highest level of action is chosen based on the augmented state, following a Level 2 action that is a subcategory of the chosen Level 1 action, and then the Level 3 action is chosen based on the Level 2 action in the same way\. \(e\.g\., choose ’question’ asa0a\_\{0\}, choose ’Probing question’ asa1a\_\{1\}, choose ’Probe the assumption’ asa2a\_\{2\}\) Three levels of selection steps correspond to the three transition tuples above\. We use these actions to prompt LLM\([35](https://arxiv.org/html/2605.14057#bib.bib95)\)under a unified template[6](https://arxiv.org/html/2605.14057#A3.T6)to get a response for the Justice\.

### 4\.3Algorithm

For both appraisal agent and dialogue agent, we use DDQN as the backbone, and the DDQN target for the appraisal agent is:

YApp=r\+γ​Q​\(s,arg⁡maxp′⁡Q​\(s′,p′;θA​p​p\);θA​p​p−\),Y\_\{\\text\{App\}\}=r\\,\+\\,\\gamma\\,Q\\\!\\Bigl\(s,\\,\\arg\\max\_\{p^\{\\prime\}\}\\,Q\(s^\{\\prime\},p^\{\\prime\};\\theta\_\{App\}\);\\theta\_\{App\}^\{\-\}\\Bigr\),\(7\)whereθA​p​p\\theta\_\{App\}andθA​p​p−\\theta\_\{App\}^\{\-\}denote the weights of the main network and the target network\. The respective DDQN losses are:

ℒAppDDQN=𝔼\(s,p,s′\)∼𝒟​\(Q​\(s,p;θA​p​p\)−YApp\)2,\\mathcal\{L\}\_\{\\text\{App\}\}^\{\\text\{DDQN\}\}=\\mathbb\{E\}\_\{\\,\(s,p,s^\{\\prime\}\)\\sim\\mathcal\{D\}\}\\bigl\(Q\(s,p;\\theta\_\{App\}\)\\,\-\\,Y\_\{\\text\{App\}\}\\bigr\)^\{2\},\(8\)
whereθA​p​p\\theta\_\{App\}andθA​p​p−\\theta\_\{App\}^\{\-\}denotes the weights of main network and target network of appraisal agent\.

For the dialogue agent, we train one Q\-network that generates Q values of all possible next\-level hierarchy actions sequentially conditioned on augmented states and parent actions\. The DDQN target of it is:

YDial=r\+γ​Q​\(s,arg⁡maxa′⁡Q​\(s′,a′;θD​i​a\);θD​i​a−\),Y\_\{\\text\{Dia\}\}^\{l\}=r\\,\+\\,\\gamma\\,Q\\\!\\Bigl\(s,\\,\\arg\\max\_\{a^\{\\prime\}\}\\,Q\(s^\{\\prime\},a^\{\\prime\};\\theta\_\{Dia\}\);\\theta\_\{Dia\}^\{\-\}\\Bigr\),\(9\)
whereθD​i​a\\theta\_\{Dia\}andθD​i​a−\\theta\_\{Dia\}^\{\-\}denotes the weights of the main network and target network\.

We assume that the definition of dialogue actions in the dataset𝒟\\mathcal\{D\}is complete\. For any single full\-level action\{a0,a1,a2\}\\\{a\_\{0\},a\_\{1\},a\_\{2\}\\\}, we have:

Q​\(s,a0\)=maxa1⁡Q​\(s,a1\)Q​\(s,a1\)=maxa2⁡Q​\(s,a2\)\\begin\{split\}&Q\(s,a\_\{0\}\)=\\max\_\{a\_\{1\}\}Q\(s,a\_\{1\}\)\\\\ &Q\(s,a\_\{1\}\)=\\max\_\{a\_\{2\}\}Q\(s,a\_\{2\}\)\\end\{split\}\(10\)wherea1a\_\{1\}are all child actions ofa0a\_\{0\}anda2a\_\{2\}are all child actions ofa1a\_\{1\}\. It means the Q\-value of the parent action can be represented by the Q\-value of the ’best’ child action\. And the respective loss is

ℒDiahier=\(Q​\(s,a0\)−maxa1⁡Q​\(s,a1\)\)2\+\(Q​\(s,a1\)−maxa2⁡Q​\(s,a2\)\)2\\begin\{split\}\\mathcal\{L\}\_\{\\text\{Dia\}\}^\{\\text\{hier\}\}&=\(Q\(s,a\_\{0\}\)\-\\max\_\{a\_\{1\}\}Q\(s,a\_\{1\}\)\)^\{2\}\\\\ &\+\(Q\(s,a\_\{1\}\)\-\\max\_\{a\_\{2\}\}Q\(s,a\_\{2\}\)\)^\{2\}\\end\{split\}\(11\)
A well\-documented challenge in offline reinforcement learning is the overestimation of Q\-values for state–action pairs that are insufficiently represented in the datasetFujimotoet al\.\([2019](https://arxiv.org/html/2605.14057#bib.bib93)\); Kumaret al\.\([2020](https://arxiv.org/html/2605.14057#bib.bib94)\)\. To mitigate this issue in our setting, we introduce a simple yet effective conservative regularization strategy\. For each statess, we defineR1​\(s\)=maxa∈𝒜⁡Q​\(s,a\)R\_\{1\}\(s\)=\\max\_\{a\\in\\mathcal\{A\}\}Q\(s,a\), which corresponds to the maximum Q\-value across all possible actions and is most likely to be overestimated\. We penalize it by addingR1​\(s\)R\_\{1\}\(s\)to optimize objectives\. However, when these high\-Q actions are well represented in the dataset, applying the penalty uniformly can lead to underestimation\. To address this, we introduce a compensatory termR2​\(s\)=Q​\(s,a\)R\_\{2\}\(s\)=Q\(s,a\), where\(s,a\)∈𝒟\(s,a\)\\in\\mathcal\{D\}, to restore value estimates for observed transitions\.

The resulting regularization terms for the Appraisal and Dialogue Agents are defined as:

ℒAppReg=\(R1​\(s\)−R2​\(s\)\)ℒDiaReg=\(R1​\(saug\)−R2​\(saug\)\)\\begin\{split\}&\\mathcal\{L\}\_\{\\text\{App\}\}^\{\\text\{Reg\}\}=\(R\_\{1\}\(s\)\-R\_\{2\}\(s\)\)\\\\ &\\mathcal\{L\}\_\{\\text\{Dia\}\}^\{\\text\{Reg\}\}=\(R\_\{1\}\(s\_\{\\text\{aug\}\}\)\-R\_\{2\}\(s\_\{\\text\{aug\}\}\)\)\\end\{split\}\(12\)
These terms are incorporated into the final optimization objectives for both agents as follows:

ℒApp=ℒAppDDQN\+α​ℒAppRegℒDia=ℒDiaDDQN\+β​ℒDiaReg\+λ​ℒDiahier\\begin\{split\}&\\mathcal\{L\}\_\{\\text\{App\}\}=\\mathcal\{L\}\_\{\\text\{App\}\}^\{\\text\{DDQN\}\}\+\\alpha\\mathcal\{L\}\_\{\\text\{App\}\}^\{\\text\{Reg\}\}\\\\ &\\mathcal\{L\}\_\{\\text\{Dia\}\}=\\mathcal\{L\}\_\{\\text\{Dia\}\}^\{\\text\{DDQN\}\}\+\\beta\\mathcal\{L\}\_\{\\text\{Dia\}\}^\{\\text\{Reg\}\}\+\\lambda\\mathcal\{L\}\_\{\\text\{Dia\}\}^\{\\text\{hier\}\}\\end\{split\}\(13\)Whereα\\alpha,β\\betaandλ\\lambdaare regularization coefficients\.

When\(s,arg⁡maxa⁡Q​\(s,a\)\)∈𝒟\(s,\\arg\\max\_\{a\}Q\(s,a\)\)\\in\\mathcal\{D\}, the regulation term is equivalent to 0\. When\(s,arg⁡maxa⁡Q​\(s,a\)\)∉𝒟\(s,\\arg\\max\_\{a\}Q\(s,a\)\)\\notin\\mathcal\{D\}, it overestimate Q\-value of\(s,a\)∈𝒟\(s,a\)\\in\\mathcal\{D\}and underestimate\(s,a\)\(s,a\)pairs inR1R\_\{1\}\. This regulatory term makes the derived policy lean towards the policy that generates the dataset𝒟\\mathcal\{D\}from the potentially overestimated values\. So by choosing appropriateα\\alphaandβ\\beta, we can reduce the variance without losing the performance\.

The implementation details of algorithm can be found in Appendix[1](https://arxiv.org/html/2605.14057#alg1)\.

## 5Experiment

### 5\.1Experiment Setup

#### Dataset\.

We evaluate our work on the publicly availableU\.S\. Supreme Court Oral Argument Transcript Dataset\. In these transcripts,*justices*actively probe*attorneys*for information critical to deciding a case, closely reflecting the objectives of an ICA\. Particularly, we use a subset of appeal court cases \(spanning 1955–2023\) from[www\.Oyez\.org](https://arxiv.org/html/2605.14057v1/www.Oyez.org)\. Each transcript in this dataset contains metadata such as the case name, argument date, and speaker identifiers\. The main textual content comprises abackground of the case, anargued question, the completedialogue transcript, and thefinal conclusion\. Table[5](https://arxiv.org/html/2605.14057#A2.T5)in the appendix summarizes the distribution of cases across various legal domains\. Our experiments are carried out*offline*, we divide our training and evaluation data by years when the argument happened\.

#### Evaluation Metrics\.

We employ two complementary evaluation strategies to assess our system’s performance\. We prompt a legally pretrained, SaulLM\-7BColomboet al\.\([2024](https://arxiv.org/html/2605.14057#bib.bib8)\), to score each utterance generated by the agents \(see Appendix[6](https://arxiv.org/html/2605.14057#A3.T6)for the prompts\)\. In parallel, we collect*manual*ratings from human reviewers, applying the same metrics to each utterance\.

We focus on the following metrics\. Both the LLM and human judges assign scores on a*1–5*scale, where higher values indicate stronger performance:

Conformity Score \(CS\)\.Measures how closely each utterance\{ui\}\\\{u\_\{i\}\\\}reflects judicial norms \(e\.g\., formality, legal phrasing\)\.

Progression Score \(PS\)\.Assesses whetheruiu\_\{i\}*advances*the discussion rather than stalling or digressing\.

Outcome Relevance Score \(OS\)\.Evaluates each utterance’s consistency with the broader objective—such as reaching a legal conclusion or a coherent final ruling\.

Probing Effectiveness Score \(PES\)\.Captures how effectivelyuiu\_\{i\}*prompts*new information from the interlocutor\.

#### Multi\-turn Dialogue Metrics\.

We introduce two metrics to evaulate the multi\-turn dialogue capabilities of the systems\. We segment the original transcript from each case into topics and construct an attorney agent usingSeComPanet al\.\([2025](https://arxiv.org/html/2605.14057#bib.bib98)\), and the ICA and the attorney agent engages in a simulated courtroom debate based on the question from the case\. The oral argument stage has a time limit; we set the maximum conversation length as 10 rounds to represent it\.

We compute aCoverage Scorefrom the simulated debate, which calculates how many of the topics in the original transcript was covered by the ICA\. Lettit\_\{i\}be the original topics,TTbe the set of topics from the simulated debate, andti′∈Tt\_\{i\}^\{\\prime\}\\in Tbe a topic in the simulated debate\. Then the Coverage Score is computed as follows:

∑ti′∈Tmaxti⁡\(S​i​m​\(ti,ti′\)\)\\sum\_\{t\_\{i\}^\{\\prime\}\\in T\}\\max\_\{t\_\{i\}\}\(Sim\(t\_\{i\},t\_\{i\}^\{\\prime\}\)\)\(14\)
We also introduce aMarginal Relevance \(MR\) Score, based onMaximal Marginal RelevanceCarbonell and Goldstein \([1998](https://arxiv.org/html/2605.14057#bib.bib100)\)\. The Marginal Relevance Score evaluates the ICA’s ability to probe for new information while staying relevant to the topic of debate\. For every round of dialogue, letuju\_\{j\}be the justice’s last utterance, andui<ju\_\{i<j\}be the justice’s previous utterances\. Letqqbe the question of the case, andnnbe the number of dialogue rounds\. Then the Marginal Relevance Score is computed as:

1n​∑ujγ​\(S​i​m​\(uj,q\)\)−\(1−γ\)​maxui⁡\(S​i​m​\(ui,uj\)\)\\frac\{1\}\{n\}\\sum\_\{u\_\{j\}\}\\gamma\(Sim\(u\_\{j\},q\)\)\-\(1\-\\gamma\)\\max\_\{u\_\{i\}\}\(Sim\(u\_\{i\},u\_\{j\}\)\)\(15\)We use cosine similarity forS​i​mSimin both metrics\. Additionally, we setγ=0\.7\\gamma=0\.7to reward the justice for staying on topic, while still encouraging exploration of new topics\.

Together, these two metrics evaluate the ability of the agents to cover all the necessary topics while probing for new information in multi\-turn dialogues\.

#### Baselines\.

We compare our dual\-agent hierarchical RL approach against several representative conversational systems\.Vanilla Llama3[35](https://arxiv.org/html/2605.14057#bib.bib95)is a straightforward*prompt\-only*approach, querying Llama3\-8B\-Instruct with no hierarchical actions or appraisals, thereby gauging the off\-the\-shelf capabilities of an LLM on Supreme Court discourse\.SFT Llama3[35](https://arxiv.org/html/2605.14057#bib.bib95)fine\-tunes the same base model using our dataset, testing whether domain\-specific training alone meets inquisitive dialogue demands\. We also includeSaulLM\-7BColomboet al\.\([2024](https://arxiv.org/html/2605.14057#bib.bib8)\)to assess how specialized LLMs perform when no hierarchical or appraisal mechanisms are present\. For more structured pipeline approaches,Hudeček et al\.hudeček2023llmsintegrates domain detection, belief\-state tracking, and database querying for task\-oriented dialogues, whileVaRMIShea and Yu \([2023](https://arxiv.org/html/2605.14057#bib.bib84)\)employs offline policy gradient and importance sampling to maintain role consistency in RL\-based CAs\.ArCHerZhouet al\.\([2024](https://arxiv.org/html/2605.14057#bib.bib97)\)utilizes a hierarchical structure with an Actor\-Critic framework for multi\-turn, goal\-oriented dialogues\. We employ the offline variant\. Further details on hyperparameters and implementation are available in Appendix[A](https://arxiv.org/html/2605.14057#A1)\.

CSPSOSPESOverallVanillaLlama33\.993\.944\.703\.924\.14SFTLlama33\.983\.814\.453\.383\.91SaulLM\-7B4\.013\.914\.563\.754\.06Hudeček3\.993\.974\.773\.634\.09VaRMI4\.003\.944\.713\.934\.15ArCHer3\.963\.794\.174\.224\.04Ours4\.013\.984\.894\.474\.34Table 1:Main Experimental Results![Refer to caption](https://arxiv.org/html/2605.14057v1/image/Cov_score.png)Figure 3:Coverage Score results![Refer to caption](https://arxiv.org/html/2605.14057v1/image/MR_score.png)Figure 4:MR Score resultsCSPSOSPESOverallFull Model4\.013\.984\.894\.474\.34w/oAppraisalAgent4\.034\.04\.744\.304\.27w/o SuccinctReward4\.013\.974\.854\.394\.31w/o NoveltyReward4\.013\.974\.824\.344\.29w/o GoalRelevance4\.003\.974\.834\.324\.28Table 2:Ablation Study

### 5\.2Main Results

In this section, we test our method and all baselines on the US Supreme Court dataset and compare their effectiveness in terms of the evaluation metrics\. Detailed results are shown in Table[1](https://arxiv.org/html/2605.14057#S5.T1)\. Our fine\-tuning\-free method achieves the best performance across all metrics, confirming that our dual agent method understands the goal of justice and its inquisitive nature well\. The appraisal agent contributes to the PES metric the most, which is the metric where our method outperforms the baseline the most\.

It is worth noting that although the US Supreme Court transcript is included in the training set of SaulLM\-7B, it is still outperformed by generic models\. The reasons for this phenomenon are two\-fold: first, the model wasn’t trained for dialogue tasks; second, the task is substantially more challenging than the metrics used for SaulLM\-7B\.

The results ofCoverage ScoreandMR Scoreare presented in Figure[3](https://arxiv.org/html/2605.14057#S5.F3)and Figure[4](https://arxiv.org/html/2605.14057#S5.F4)\. In Figure[3](https://arxiv.org/html/2605.14057#S5.F3), our method consistently achieves the highest Coverage Score across all round settings, indicating that our dual\-agent framework is more effective at expanding the discussion to cover a broader range of case\-related topics\. Figure[4](https://arxiv.org/html/2605.14057#S5.F4)shows a similar pattern for MR Score: our method maintains the strongest marginal relevance throughout, suggesting that it is better able to introduce new information while remaining aligned with the central question of the case\.

Due to the quality issue of the Surpeme Court dataset, the finetuning methods are not efficient on the dataset\(see examples in Table[7](https://arxiv.org/html/2605.14057#A3.T7)\)\. SFT and ArCHer achieve ideal results in CS and PES, however, their results were affected by the widespread presence of low\-quality data, while our approach effectively bypasses low\-quality snippets\.

### 5\.3Ablation Study

We conducted four ablations to clarify the role of each reward component and the appraisal agent: \(i\) w/o the appraisal agent, \(ii\) w/o the succinct reward, \(iii\) w/o the novelty reward, and \(iv\) w/o the goal relevance reward\.

Table[2](https://arxiv.org/html/2605.14057#S5.T2)shows that each omission reduces at least one key metric relative to our full model, which yields the highest overall score \(4\.34\), confirming that all components contribute to overall effectiveness\. For example, removing the novelty reward reduces OS from 4\.34 to 4\.29, suggesting that without encouraging fresh information, the dialogue risks becoming less directional\.

Figure[8a](https://arxiv.org/html/2605.14057#A3.F8.sf1)\(130 epochs\) and[8b](https://arxiv.org/html/2605.14057#A3.F8.sf2)\(1600 epochs\) plots the cumulative reward during offline RL\. Early in training, the full model quickly surpasses ablations, reflecting the synergy of dual\-agent oversight and the combination of all reward signals\.

### 5\.4Human Evaluation

We conducted a human evaluation, giving annotators the metadata of each Supreme Court case along with its dialogue context\. Evaluators scored the Conformity Score \(CS\), Progression Score \(PS\), Outcome Relevance Score \(OS\), and Probing Effectiveness Score \(PES\) on a 1–5 scale \(Section[5\.1](https://arxiv.org/html/2605.14057#S5.SS1.SSS0.Px2)\)\. To ensure consistency, all methods were evaluated on the same set of case transcripts\.

Table[8](https://arxiv.org/html/2605.14057#A3.T8)presents the average ratings\. Our full model achieves the highest overall score \(4\.53\), outperforming both SaulLM\-7BColomboet al\.\([2024](https://arxiv.org/html/2605.14057#bib.bib8)\)and all ablated versions\. This underscores the importance of every component in the agent in improving performance\.

## 6Conclusion

In this paper, we revisit the scope of TOD and propose a three\-way categorization—collaborative, negotiation, and inquisitive dialogue—to better capture the diversity of goal\-driven conversation\. Our study centers on the inquisitive dialogue setting, using U\.S\. Supreme Court oral arguments as a representative domain\.

We presented a dual\-agent hierarchical RL approach for inquisitive conversation, focusing on U\.S\. Supreme Court oral arguments as a high\-stakes domain\. By integrating a Hierarchical Dialogue Agent that decomposes conversation control across multiple levels with an Appraisal Agent that proactively evaluates attorney responses, our framework captures the justice’s goal\-driven and probing style\. We also present a regulation term that efficiently reduce the variance of our offline RL method\. Empirical results on diverse Supreme Court cases show that the dual\-agent design, coupled with carefully designed reward components yields more effective and context\-aware dialogue strategies than multiple baselines\.

While our current work centers on Supreme Court interactions, the underlying principles, such as active inquiry, structured dialogue management, and reward\-driven question formulation, are broadly applicable to other high\-stakes or domain\-specific settings where deeper questioning is crucial\. Future directions include expanding the reward model to capture even more nuanced legal strategies, and adapting the framework to other inquisitive domains such as investigative journalism or medical consultations\.

## 7Limitation

The simulated justice’s responses of our agents are given by prompting LLM\. Our agent’s capability is heavily relies on capability of LLM\. When LLMs have a very low probability of generating the desired optimal sequence, our method cannot reach optimal performance as well\.

Although our work outperforms other baselines on the US Supreme Court dataset, the efficiency of our method on other legal domain dialogue datasets remains unclear\. Our reward signals and action types are set for this dataset; for other datasets, they have to be redesigned\. The policy of generating the US Supreme Court dataset is close to the optimal policy\. When the dataset contains a large amount of data generated by a bad policy, our regularization term could be less efficient\.

## 8Ethical statement

This study uses publicly available transcripts and metadata from U\.S\. Supreme Court oral arguments, all of the original format of data can be download from[Official website](https://www.supremecourt.gov/)\([Supreme Court of the United States,](https://arxiv.org/html/2605.14057#bib.bib96)\)\. The Court releases transcripts as part of its routine transparency practices\. These datasets do not reveal any identifiable information about the raters\. We are not asking for any personal information during the labeler selection and labeling process\. We do not include any personalized information in data processing\. All of the examples used in prompting are randomly selected\.

## Acknowledgments

This research was supported by U\.S\. National Science Foundation grant number IIS\-2336768\. Any opinions, findings, conclusions, or recommendations expressed in this paper are of the authors and do not necessarily reflect those of the sponsor\.

## References

- MultiWOZ \- a large\-scale multi\-domain wizard\-of\-oz dataset for task\-oriented dialogue modelling\.InarXiv preprint arXiv:1810\.00278,Cited by:[§1](https://arxiv.org/html/2605.14057#S1.p2.1),[§2](https://arxiv.org/html/2605.14057#S2.SS0.SSS0.Px1.p1.1)\.
- B\. Byrne, K\. Krishnamoorthi, C\. Sankar, A\. Neelakantan, D\. Duckworth, S\. Yavuz, B\. Goodrich, A\. Dubey, A\. Cedilnik, and K\. Kim \(2019\)Taskmaster\-1: toward a realistic and diverse dialog dataset\.External Links:1909\.05358,[Link](https://arxiv.org/abs/1909.05358)Cited by:[§1](https://arxiv.org/html/2605.14057#S1.p2.1)\.
- J\. Carbonell and J\. Goldstein \(1998\)The use of mmr, diversity\-based reranking for reordering documents and producing summaries\.InProceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,Melbourne, Australia,pp\. 335–336\.Cited by:[§5\.1](https://arxiv.org/html/2605.14057#S5.SS1.SSS0.Px3.p3.4)\.
- C\. Cichowicz \(2019\)Oral argument tactics on the supreme court bench: a comparative analysis of verbal tools used by justices sotomayor, kagan, and gorsuch\.External Links:[Link](https://digitalcommons.ursinus.edu/pol_hon/8/)Cited by:[§3\.2](https://arxiv.org/html/2605.14057#S3.SS2.p5.1),[§3\.3](https://arxiv.org/html/2605.14057#S3.SS3.p5.1)\.
- P\. Colombo, T\. P\. Pires, M\. Boudiaf, D\. Culver, R\. Melo, C\. Corro, A\. F\. T\. Martins, F\. Esposito, V\. L\. Raposo, S\. Morgado, and M\. Desa \(2024\)SaulLM\-7b: a pioneering large language model for law\.External Links:2403\.03883,[Link](https://arxiv.org/abs/2403.03883)Cited by:[§5\.1](https://arxiv.org/html/2605.14057#S5.SS1.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2605.14057#S5.SS1.SSS0.Px4.p1.1),[§5\.4](https://arxiv.org/html/2605.14057#S5.SS4.p2.1)\.
- Y\. Deng, L\. Liao, L\. Chen, H\. Wang, W\. Lei, and T\. Chua \(2023\)Prompting and evaluating large language models for proactive dialogues: clarification, target\-guided, and non\-collaboration\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 10602–10621\.External Links:[Link](https://doi.org/10.18653/v1/2023.findings-emnlp.711)Cited by:[§2](https://arxiv.org/html/2605.14057#S2.SS0.SSS0.Px1.p2.1)\.
- Y\. Deng, W\. Zhang, W\. Lam, S\. Ng, and T\. Chua \(2024\)Plug\-and\-play policy planner for large language model powered dialogue agents\.InThe Twelfth International Conference on Learning Representations, ICLR 2024,External Links:[Link](https://openreview.net/forum?id=MCNqgUFTHI)Cited by:[§2](https://arxiv.org/html/2605.14057#S2.SS0.SSS0.Px1.p2.1)\.
- S\. Fujimoto, D\. Meger, and D\. Precup \(2019\)Off\-policy deep reinforcement learning without exploration\.External Links:1812\.02900,[Link](https://arxiv.org/abs/1812.02900)Cited by:[§4\.3](https://arxiv.org/html/2605.14057#S4.SS3.p6.5)\.
- Y\. Gao, Y\. Xiong, X\. Gao, K\. Jia, J\. Pan, Y\. Bi, Y\. Dai, J\. Sun, M\. Wang, and H\. Wang \(2024\)Retrieval\-augmented generation for large language models: a survey\.External Links:2312\.10997,[Link](https://arxiv.org/abs/2312.10997)Cited by:[§2](https://arxiv.org/html/2605.14057#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Guo, L\. Liao, J\. Zhang, C\. Li, and H\. Chen \(2024\)PCQPR: proactive conversational question planning with reflection\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024,pp\. 11266–11278\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.631)Cited by:[§2](https://arxiv.org/html/2605.14057#S2.SS0.SSS0.Px1.p2.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/2106.09685)Cited by:[Appendix A](https://arxiv.org/html/2605.14057#A1.SS0.SSS0.Px1.p4.4)\.
- G\. Izacard and E\. Grave \(2021\)Leveraging passage retrieval with generative models for open domain question answering\.External Links:2007\.01282,[Link](https://arxiv.org/abs/2007.01282)Cited by:[§2](https://arxiv.org/html/2605.14057#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Kumar, A\. Zhou, G\. Tucker, and S\. Levine \(2020\)Conservative q\-learning for offline reinforcement learning\.External Links:2006\.04779,[Link](https://arxiv.org/abs/2006.04779)Cited by:[§4\.3](https://arxiv.org/html/2605.14057#S4.SS3.p6.5)\.
- M\. Lewis, D\. Yarats, Y\. N\. Dauphin, D\. Parikh, and D\. Batra \(2017\)Deal or no deal? end\-to\-end learning for negotiation dialogues\.External Links:1706\.05125,[Link](https://arxiv.org/abs/1706.05125)Cited by:[Figure 1](https://arxiv.org/html/2605.14057#S1.F1)\.
- J\. Li, M\. Galley, C\. Brockett, J\. Gao, and B\. Dolan \(2016a\)A diversity\-promoting objective function for neural conversation models\.InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,K\. Knight, A\. Nenkova, and O\. Rambow \(Eds\.\),San Diego, California,pp\. 110–119\.External Links:[Link](https://aclanthology.org/N16-1014/),[Document](https://dx.doi.org/10.18653/v1/N16-1014)Cited by:[§3\.3](https://arxiv.org/html/2605.14057#S3.SS3.p4.6)\.
- J\. Li, W\. Monroe, A\. Ritter, M\. Galley, J\. Gao, and D\. Jurafsky \(2016b\)Deep reinforcement learning for dialogue generation\.External Links:1606\.01541,[Link](https://arxiv.org/abs/1606.01541)Cited by:[§2](https://arxiv.org/html/2605.14057#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Liao, G\. H\. Yang, and C\. Shah \(2023\)Proactive conversational agents in the post\-ChatGPT world\.InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 3452–3455\.External Links:ISBN 9781450394086,[Link](https://dl.acm.org/doi/10.1145/3539618.3594250),[Document](https://dx.doi.org/10.1145/3539618.3594250)Cited by:[§2](https://arxiv.org/html/2605.14057#S2.SS0.SSS0.Px1.p2.1)\.
- N\. Liu, L\. Chen, X\. Tian, W\. Zou, K\. Chen, and M\. Cui \(2024\)From llm to conversational agent: a memory enhanced architecture with fine\-tuning of large language models\.External Links:2401\.02777,[Link](https://arxiv.org/abs/2401.02777)Cited by:[§2](https://arxiv.org/html/2605.14057#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Liu, S\. Sabour, Y\. Zheng, P\. Ke, X\. Zhu, and M\. Huang \(2022\)Rethinking and refining the distinct metric\.External Links:2202\.13587,[Link](https://arxiv.org/abs/2202.13587)Cited by:[§3\.3](https://arxiv.org/html/2605.14057#S3.SS3.p4.6),[footnote 2](https://arxiv.org/html/2605.14057#footnote2)\.
- M\. Nickel and D\. Kiela \(2017\)Poincaré embeddings for learning hierarchical representations\.External Links:1705\.08039,[Link](https://arxiv.org/abs/1705.08039)Cited by:[§4\.2](https://arxiv.org/html/2605.14057#S4.SS2.p2.1)\.
- Z\. Pan, Q\. Wu, H\. Jiang, X\. Luo, H\. Cheng, D\. Li, Y\. Yang, C\. Lin, H\. V\. Zhao, L\. Qiu, and J\. Gao \(2025\)On memory construction and retrieval for personalized conversational agents\.External Links:2502\.05589,[Link](https://arxiv.org/abs/2502.05589)Cited by:[§5\.1](https://arxiv.org/html/2605.14057#S5.SS1.SSS0.Px3.p1.1)\.
- B\. Peng, X\. Li, J\. Gao, J\. Liu, K\. Wong, and S\. Su \(2018\)Deep dyna\-q: integrating planning for task\-completion dialogue policy learning\.External Links:1801\.06176,[Link](https://arxiv.org/abs/1801.06176)Cited by:[§3\.2](https://arxiv.org/html/2605.14057#S3.SS2.p5.1)\.
- A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, I\. Sutskever,et al\.\(2019\)Language models are unsupervised multitask learners\.OpenAI blog1\(8\),pp\. 9\.Cited by:[§2](https://arxiv.org/html/2605.14057#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Rastogi, X\. Zang, S\. Sunkara, R\. Gupta, and P\. Khaitan \(2020\)Towards scalable multi\-domain conversational agents: the schema\-guided dialogue dataset\.External Links:1909\.05855,[Link](https://arxiv.org/abs/1909.05855)Cited by:[§1](https://arxiv.org/html/2605.14057#S1.p2.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.External Links:1707\.06347Cited by:[§2](https://arxiv.org/html/2605.14057#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Sharma, T\. Russell\-Rose, L\. Barakat, and A\. Matsuo \(2021\)Building a legal dialogue system: development process, challenges and opportunities\.External Links:2109\.00381,[Link](https://arxiv.org/abs/2109.00381)Cited by:[§2](https://arxiv.org/html/2605.14057#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Sharma, J\. He, K\. Suleman, H\. Schulz, and P\. Bachman \(2017\)Natural language generation in dialogue using lexicalized and delexicalized data\.External Links:1606\.03632,[Link](https://arxiv.org/abs/1606.03632)Cited by:[§3\.2](https://arxiv.org/html/2605.14057#S3.SS2.p5.1)\.
- R\. Shea and Z\. Yu \(2023\)Building persona consistent dialogue agents with offline reinforcement learning\.External Links:2310\.10735,[Link](https://arxiv.org/abs/2310.10735)Cited by:[Appendix A](https://arxiv.org/html/2605.14057#A1.SS0.SSS0.Px1.p4.4),[§5\.1](https://arxiv.org/html/2605.14057#S5.SS1.SSS0.Px4.p1.1)\.
- L\. Shu, P\. Molino, M\. Namazifar, H\. Xu, B\. Liu, H\. Zheng, and G\. Tur \(2019\)Flexibly\-structured model for task\-oriented dialogues\.InProceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue,S\. Nakamura, M\. Gasic, I\. Zukerman, G\. Skantze, M\. Nakano, A\. Papangelis, S\. Ultes, and K\. Yoshino \(Eds\.\),Stockholm, Sweden\.External Links:[Link](https://aclanthology.org/W19-5922/),[Document](https://dx.doi.org/10.18653/v1/W19-5922)Cited by:[§1](https://arxiv.org/html/2605.14057#S1.p3.1)\.
- S\. Su, X\. Li, J\. Gao, J\. Liu, and Y\. Chen \(2018\)Discriminative deep Dyna\-Q: robust planning for dialogue policy learning\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,E\. Riloff, D\. Chiang, J\. Hockenmaier, and J\. Tsujii \(Eds\.\),Brussels, Belgium,pp\. 3813–3823\.External Links:[Link](https://aclanthology.org/D18-1416/),[Document](https://dx.doi.org/10.18653/v1/D18-1416)Cited by:[§3\.2](https://arxiv.org/html/2605.14057#S3.SS2.p5.1)\.
- Y\. Su, L\. Shu, E\. Mansimov, A\. Gupta, D\. Cai, Y\. Lai, and Y\. Zhang \(2022\)Multi\-task pre\-training for plug\-and\-play task\-oriented dialogue system\.External Links:2109\.14739,[Link](https://arxiv.org/abs/2109.14739)Cited by:[§1](https://arxiv.org/html/2605.14057#S1.p3.1)\.
- \[32\]Supreme Court of the United StatesOfficial Website of the U\.S\. Supreme Court\.External Links:[Link](https://www.supremecourt.gov/)Cited by:[§8](https://arxiv.org/html/2605.14057#S8.p1.1)\.
- I\. Sutskever, O\. Vinyals, and Q\. V\. Le \(2014\)Sequence to sequence learning with neural networks\.External Links:1409\.3215Cited by:[§2](https://arxiv.org/html/2605.14057#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Tang, T\. Zhao, C\. Xiong, X\. Liang, E\. P\. Xing, and Z\. Hu \(2019\)Target\-guided open\-domain conversation\.InProceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019,pp\. 5624–5634\.External Links:[Link](https://doi.org/10.18653/v1/p19-1565)Cited by:[§2](https://arxiv.org/html/2605.14057#S2.SS0.SSS0.Px1.p2.1)\.
- \[35\]\(2024\)The llama 3 herd of models\.External Links:2407\.21783,[Link](https://arxiv.org/abs/2407.21783)Cited by:[§3\.3](https://arxiv.org/html/2605.14057#S3.SS3.p2.5),[§4\.2](https://arxiv.org/html/2605.14057#S4.SS2.p5.3),[§5\.1](https://arxiv.org/html/2605.14057#S5.SS1.SSS0.Px4.p1.1)\.
- L\. Wang, N\. Yang, X\. Huang, L\. Yang, R\. Majumder, and F\. Wei \(2024\)Improving text embeddings with large language models\.External Links:2401\.00368,[Link](https://arxiv.org/abs/2401.00368)Cited by:[Appendix A](https://arxiv.org/html/2605.14057#A1.SS0.SSS0.Px1.p1.2)\.
- W\. Wang, Z\. Zhang, J\. Guo, Y\. Dai, B\. Chen, and W\. Luo \(2022\)Task\-oriented dialogue system as natural language generation\.InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval,SIGIR ’22\.External Links:[Link](http://dx.doi.org/10.1145/3477495.3531920),[Document](https://dx.doi.org/10.1145/3477495.3531920)Cited by:[§3\.2](https://arxiv.org/html/2605.14057#S3.SS2.p5.1)\.
- T\. Zhang, C\. Huang, Y\. Deng, H\. Liang, J\. Liu, Z\. Wen, W\. Lei, and T\. Chua \(2024\)Strength lies in differences\! improving strategy planning for non\-collaborative dialogues via diversified user simulation\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024,pp\. 424–444\.External Links:[Link](https://aclanthology.org/2024.emnlp-main.26)Cited by:[§2](https://arxiv.org/html/2605.14057#S2.SS0.SSS0.Px1.p2.1)\.
- T\. Zhao and M\. Eskenazi \(2016\)Towards end\-to\-end learning for dialog state tracking and management using deep reinforcement learning\.External Links:1606\.02560,[Link](https://arxiv.org/abs/1606.02560)Cited by:[§2](https://arxiv.org/html/2605.14057#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2605.14057#S3.SS2.p5.1)\.
- Y\. Zheng, R\. Zhang, J\. Zhang, Y\. Ye, Z\. Luo, Z\. Feng, and Y\. Ma \(2024\)LlamaFactory: unified efficient fine\-tuning of 100\+ language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 3: System Demonstrations\),Bangkok, Thailand\.External Links:[Link](http://arxiv.org/abs/2403.13372)Cited by:[Appendix A](https://arxiv.org/html/2605.14057#A1.SS0.SSS0.Px1.p4.4)\.
- K\. Zhou, Y\. Zhou, W\. X\. Zhao, X\. Wang, and J\. Wen \(2020\)Towards topic\-guided conversational recommender system\.InProceedings of the 28th International Conference on Computational Linguistics, COLING 2020,pp\. 4128–4139\.External Links:[Link](https://doi.org/10.18653/v1/2020.coling-main.365)Cited by:[§2](https://arxiv.org/html/2605.14057#S2.SS0.SSS0.Px1.p2.1)\.
- Y\. Zhou, A\. Zanette, J\. Pan, S\. Levine, and A\. Kumar \(2024\)ArCHer: training language model agents via hierarchical multi\-turn rl\.External Links:2402\.19446Cited by:[§5\.1](https://arxiv.org/html/2605.14057#S5.SS1.SSS0.Px4.p1.1)\.

## Appendix AImplementation details

#### Implementation Details\.

We define the agent’sstateSSat each time step as the dialogue context up to the current turn\. We transform the components into a dense vector,S=Embed​\(sc,sh\)S=\\text\{Embed\}\(s\_\{c\},s\_\{h\}\), using a fine\-tuned Mistral\-7B modelWanget al\.\([2024](https://arxiv.org/html/2605.14057#bib.bib88)\)\. Our embedding model produces 4096\-dimensional vectors\. In both the hierarchical dialogue agent and reward model, we compress these embeddings to 32 dimensions before concatenating them with appraisals and actions\. This compression uses fully connected layers with batch normalization and Leaky ReLU activations\.

During training, our Dual Hierarchical Dialogue Agent relies on ground\-truth appraisal and three hierarchies from the dataset, so the appraisal agent and three hierarchies of dialogue agent can be trained simultaneously while loading the same dataset\.

To stabilize training, we employ polyak updates of the target networks withτ=0\.005\\tau=0\.005, and empirically set the discount factorγ\\gammato 0\.9\. The weights for relevance, novelty, and succinctness rewards are set to 0\.2, 0\.7, and 0\.1, respectively\. We use exponential decay for learning rate of both agents; the learning rates for them are 1e\-6 to 3e\-9 and 1e\-6 to 1e\-8\. The model size for both appraisal agent and dialogue policy agent are both less than 2M, the whole training time takes approximately 70 hours\.

Our method, ablations, SFT Llama3, VaRMI, Hudeček’s method, and the vanilla baseline, use Llama\-3\.1\-8B\-Instruct as their base model\. For ArCHer, the actor uses Llama\-3\.2\-1B\-Instruct as the base model due to memory constraints, and the critic uses RoBERTa, in line with the implementation presented by the authors\. We run SFT Llama3 via LLaMA\-FactoryZhenget al\.\([2024](https://arxiv.org/html/2605.14057#bib.bib82)\)on 2,000 examples from the Supreme Court dataset, training for three epochs with LoRAHuet al\.\([2022](https://arxiv.org/html/2605.14057#bib.bib83)\)at a per\-device batch size of 4, a gradient accumulation step of 8, and a learning rate of1​e−41\\text\{e\}\{\-4\}\. For VaRMIShea and Yu \([2023](https://arxiv.org/html/2605.14057#bib.bib84)\), we fine\-tune with a1​e−61\\text\{e\}\{\-6\}learning rate for one epoch\. In Hudeček’s methodhudeček2023llms, the training dataset serves as the retrieval database, and cosine similarity on embedded contexts is used as the retrieval similarity function\. For ArCHer, we train with a dataset of 700 dialogue trajectories, each with 6 to 7 dialogue turns, for 8 epochs, with an actor learning rate of1​e−41\\text\{e\}\-4and a critic learning rate of1​e−51\\text\{e\}\-5\.

Inference Phase\.At inference, the Appraisal Agent is invoked first to generate an appraisal, which is subsequently fed into the dialogue agent\. The dialogue agent first selects a top\-level macro\-action and then refines it through second\- and third\-level choices according to Table[3](https://arxiv.org/html/2605.14057#A1.T3), stopping if the current sub\-action has no successor\. Formally, each levelllsolves:

al=arg⁡maxal⁡Q\(l\)​\(saug,a0,…,al−1,al\)\.a\_\{l\}=\\arg\\max\_\{a\_\{l\}\}\\,Q^\{\(l\)\}\\bigl\(s\_\{\\text\{aug\}\},\\,a\_\{0\},\\dots,a\_\{l\-1\},a\_\{l\}\\bigr\)\.\(16\)The final action vector\{a0,a1,a2\}\\\{a\_\{0\},a\_\{1\},a\_\{2\}\\\}thus encodes a context\-aware dialogue strategy, navigating the conversation at multiple levels of granularity \.

### A\.1Reward Model for Offline Evaluation

In this work, we use an offline RL setting, and the environment is not accessible for providing real\-time feedback\. We thus learn a*Reward Model*to approximate the environment’s true reward function from the offline training data\. Particularly, we employ a feed\-forward neural network \(FFN\) to modelRϕR\_\{\\phi\}and predict a scalar rewardr^\\hat\{r\}\. The network takes data tuples of\(s,p,a,r\)∼𝒟\(s,p,a,r\)\\sim\\mathcal\{D\}\. A standard mean\-squared error \(MSE\) loss is used here to measure the discrepancy between the predicted rewardr^=Rϕ​\(z,p,a\)\\hat\{r\}=R\_\{\\phi\}\\bigl\(z,\\,p,\\,a\\bigr\)and the ground\-truthrir\_\{i\}\.

Algorithm 1Dual Hierarchical RL \(Offline Training Mode with Regularization\)1:Input:dataset

𝒟∼\(\(s,p\),a,r,s′\)\\mathcal\{D\}\\sim\(\(s,p\),a,r,s^\{\\prime\}\)
2:Output:Policies for dual agent

θApp\\theta\_\{\\text\{App\}\}and

θDia\\theta\_\{\\text\{Dia\}\}
3:Initialization:Build dataset

\{\(s,p,r,s′\)\}\\\{\(s,p,r,s^\{\\prime\}\)\\\}for appraisal agent,

\{\(\(s,p,a0,…,al−1\),al,r,s′\)∣l∈\{0,1,2\}\}\\\{\\,\(\(s,p,a\_\{0\},\\ldots,a\_\{l\-1\}\),\\,a\_\{l\},\\,r,\\,s^\{\\prime\}\)\\mid l\\in\\\{0,1,2\\\}\\,\\\}for hierarchical agent,

\{\(s,p,a,r\)\}\\\{\(s,p,a,r\)\\\}for reward model\. Initialize policy and target networks

QAppθQ\_\{\\text\{App\}\}^\{\\theta\},

QDiaQ\_\{\\text\{Dia\}\}, and reward model

RϕR\_\{\\phi\}\.

4:Train Reward Model:

5:foreach training iterationdo

6:Sample mini\-batch

\(s,p,a,r\)\(s,p,a,r\)
7:Update

RϕR\_\{\\phi\}by minimizing

ℒR​M=𝔼\(s,p,a,r\)∼𝒟​\[\(r^−r\)2\]\\mathcal\{L\}\_\{RM\}=\\mathbb\{E\}\_\{\(s,p,a,r\)\\sim\\mathcal\{D\}\}\\Bigl\[\(\\hat\{r\}\-r\)^\{2\}\\Bigr\]
8:endforwhen

RϕR\_\{\\phi\}converges

9:Train Appraisal Agent:

10:foreach training iterationdo

11:Sample

\(s,p,r,s′\)\(s,p,r,s^\{\\prime\}\)
12:Compute DDQN target:

Y=r\+γ​QApp​\(s′,arg⁡maxp′⁡QApp​\(s′,p′;θ\);θ−\)Y=r\+\\gamma Q\_\{\\text\{App\}\}\(s^\{\\prime\},\\arg\\max\_\{p^\{\\prime\}\}Q\_\{\\text\{App\}\}\(s^\{\\prime\},p^\{\\prime\};\\theta\);\\theta^\{\-\}\)
13:Compute regularization terms:

R1​\(s\)=maxp′⁡QApp​\(s,p′\),R2​\(s\)=QApp​\(s,p\)​where​\(s,p\)∈𝒟R\_\{1\}\(s\)=\\max\_\{p^\{\\prime\}\}Q\_\{\\text\{App\}\}\(s,p^\{\\prime\}\),\\quad R\_\{2\}\(s\)=Q\_\{\\text\{App\}\}\(s,p\)\\text\{ where \}\(s,p\)\\in\\mathcal\{D\}
14:Update

QAppQ\_\{\\text\{App\}\}by minimizing:

ℒApp=\(QApp​\(s,p\)−y\)2\+α​\(R1​\(s\)−R2​\(s\)\)\\mathcal\{L\}\_\{\\text\{App\}\}=\(Q\_\{\\text\{App\}\}\(s,p\)\-y\)^\{2\}\+\\alpha\(R\_\{1\}\(s\)\-R\_\{2\}\(s\)\)
15:endforwhen

QAppQ\_\{\\text\{App\}\}converges

16:Train Hierarchical Dialogue Agent:

17:foreach training iterationdo

18:Sample transitions

\{\(\(s,p,a0,…,al−1\),al,r,s′\)∣l∈\{0,1,2\}\}\\\{\\,\(\(s,p,a\_\{0\},\\dots,a\_\{l\-1\}\),a\_\{l\},r,s^\{\\prime\}\)\\mid l\\in\\\{0,1,2\\\}\\,\\\}from

𝒟\\mathcal\{D\}
19:Compute DDQN target:

yl=r\+γ​QDia​\(s′,arg⁡maxp′⁡QDia​\(s′,p′;θ\);θ−\)y^\{l\}=r\+\\gamma\\,Q\_\{\\text\{Dia\}\}\(s^\{\\prime\},\\arg\\max\_\{p^\{\\prime\}\}Q\_\{\\text\{Dia\}\}\(s^\{\\prime\},p^\{\\prime\};\\theta\);\\theta^\{\-\}\)
20:Compute conservative terms:

R1​\(saug\)=maxp′⁡QDia​\(saug,p′\),R2​\(saug\)=QDia​\(saug,p\)R\_\{1\}\(s\_\{\\text\{aug\}\}\)=\\max\_\{p^\{\\prime\}\}Q\_\{\\text\{Dia\}\}\(s\_\{\\text\{aug\}\},p^\{\\prime\}\),\\quad R\_\{2\}\(s\_\{\\text\{aug\}\}\)=Q\_\{\\text\{Dia\}\}\(s\_\{\\text\{aug\}\},p\)
21:Compute

Q​\(s,a0\),maxa1⁡Q​\(s,a1\),Q​\(s,a1\),maxa2⁡Q​\(s,a2\)Q\(s,a\_\{0\}\),\\max\_\{a\_\{1\}\}Q\(s,a\_\{1\}\),Q\(s,a\_\{1\}\),\\max\_\{a\_\{2\}\}Q\(s,a\_\{2\}\)\(depends one hierarchy depth\)

22:Update

QDiaQ\_\{\\text\{Dia\}\}based on accumulate regularized loss:

ℒ=∑i=13ℒl\\mathcal\{L\}=\\sum\_\{i=1\}^\{3\}\\mathcal\{L\}^\{l\}, where

ℒl=\(QDia​\(saug,p\)−yl\)2\+β​\(R1​\(saug\)−R2​\(saug\)\)\\mathcal\{L\}^\{l\}=\(Q\_\{\\text\{Dia\}\}\(s\_\{\\text\{aug\}\},p\)\-y^\{l\}\)^\{2\}\+\\beta\(R\_\{1\}\(s\_\{\\text\{aug\}\}\)\-R\_\{2\}\(s\_\{\\text\{aug\}\}\)\)
23:Offline evaluation:

r^=𝔼\(s,p,a,r\)∼𝒟​Rϕ​\(s,p∗,a0∗,a1∗,a2∗\)\\hat\{r\}=\\mathbb\{E\}\_\{\(s,p,a,r\)\\sim\\mathcal\{D\}\}R\_\{\\phi\}\(s,p^\{\*\},a\_\{0\}^\{\*\},a\_\{1\}^\{\*\},a\_\{2\}^\{\*\}\)
24:endforwhen

ℒ\\mathcal\{L\}converges and

r^\\hat\{r\}stabilizes

QuestionClarification questionClarify important aspect of the caseClarify legal arguments or issuesClarify definition of conceptProbing questionProbe the consistency between the attorney’s arguments and established legal principles or precedentsProbe the assumption underlying the attorney’s argumentsLeading questionAsk for the attorney’s positionLead the attorney toward a particular conclusionLead the attorney to certain aspectsMake hypothesisPresent hypothesisPresent hypothetical situations to test legal limitsPresent hypothetical situations to test legal issues in the caseCompare hypothesisCompare to hypothetical situations to assess legal principlesHighlight key differences from hypothetical situationsConclude hypothesisExplore different types of consequencesDeclarationConfirmationAcknowledge the attorney’s argumentsPrompt for information that would support the attorney’s argumentsRejectionOppose the attorney’s argumentsProvide counterexample to challenge the attorney’s argumentsDeclaration \(non\-questions\) for more detailsLead attorneys by examples \(non\-questions\) for detailed explanation of a conceptDeclaration with Time PressurePressure a rash response from the attorneyTable 3:Proposed Hierarchy of Dialogue Acts

## Appendix BThree\-Level Taxonomy of Justice’s Action

The hierarchies of dialogue actions are listed in Table[3](https://arxiv.org/html/2605.14057#A1.T3)\. Actions in primary hierarchies are bolded\. Actions in second and third hierarchies are listed in the left and right columns of the table, respectively\.

AppraisalsExplanationSense ambiguityThe justice finds the attorney’s arguments ambiguousFind deviatesThe justice believes the conversation has strayed into irrelevant territory or unproductive argumentsFind redundancyThe justice finds the attorney’s arguments repetitive or unproductiveSpot weaknessThe justice spots a weakness in the attorney’s argumentsIdentify flawsThe justice identifies logical flaws in the attorney’s argumentsIdentify chancesThe justice identifies chances to influence the attorneyKeep challengingThe justice wants to challenge the attorney from another aspectDive deeperThe Justice wants to dive deeper into the dialogueOtherwiseAll other kinds of the justice intentsTable 4:Appraisal ActionsDomainTurns/caseWords/utterWords/case\#CasesRegulatory192\.647\.79183\.2589Civil Rights206\.044\.09074\.5337Criminal200\.645\.09035\.7418IP173\.452\.59101\.219Commerce199\.346\.49252\.2107Labor262\.354\.914407\.3101Immigration176\.455\.39762\.316Environment262\.354\.914407\.33Others200\.447\.69541\.218Total198\.645\.99121\.51608Table 5:Supreme Court Dataset Statistics
## Appendix CDetails Regards Measurements

We have qualified students for making human evaluations; we collect our human evaluation results by distributing Google Forms\. The estimated payment is 20$ per hour\.

ComponentPromptInstructionYou are a duteous, respectful, and honest AI justice assistant\. You are given a clip of dialogue that happens on the US supreme court appeal case ending with the utterance of the justice, your job is provide analysis and score the last utterance of the justice in the dialogue from different aspect with the max score of 5\. In each task, you are given the background, argued question of the case, conclusion of the case and the justice utterance with related dialogue context\. Provide a snippet of analysis to analyze the role of the last sentence before giving out the score\.Metric
ExplanationThe metric is:
\{Metric\}: \{Explanation of metric\}The scoring format should be: \{Metric\}: \{Comment about \{metric\}\} ?/5

Score
DefinitionThe definition of scores are:
Score 1/5 \(\{Descriptor\}\): \{Explanation\}
Score 2/5 \(\{Descriptor\}\): \{Explanation\}
Score 3/5 \(\{Descriptor\}\): \{Explanation\}
Score 4/5 \(\{Descriptor\}\): \{Explanation\}
Score 5/5 \(\{Descriptor\}\): \{Explanation\}Table 6:Prompt Structure for LLM Evaluation![Refer to caption](https://arxiv.org/html/2605.14057v1/image/google_form1.png)

![Refer to caption](https://arxiv.org/html/2605.14057v1/image/google_form2.png)

Figure 5:A template of google form for manual labeling, text has been streamlined for typographical purposes\.![Refer to caption](https://arxiv.org/html/2605.14057v1/image/perception2.png)Figure 6:Justice uses a counterexample to challenge the attorney’s position and the arguments presented previously![Refer to caption](https://arxiv.org/html/2605.14057v1/image/perception3.png)Figure 7:Justice continuously pressing attorney by making the rapid succession of her questions, cuts off attorney and then restricts him to one\-word responses before another question is initiated\.![Refer to caption](https://arxiv.org/html/2605.14057v1/image/eval_reward_comparison130_15.png)\(a\)First 130 epochs
![Refer to caption](https://arxiv.org/html/2605.14057v1/image/eval_reward_comparison1600_15.png)\(b\)First 1600 epochs

Figure 8:Learning Curves from Ablation Study\. \(a\) Cumulative reward during early training stage; \(b\) Cumulative reward extends to longer term training \(1600 epochs\)\. The full model \(blue\) outperforms all ablated versions\.ProblemSnippetSub\-optimal utteranceAttorney: So I – so I don’t think there’s any statement in the legislative history that says we’re not forcing employers to give benefits for non\-work\-related injuries\. What – there are three statements in the legislative history that – that Respondent draws a negative inference from\.
Justice: I’m so relieved\.Frequently
interceptionJustice: So you really can’t… there’s no analytical distinction, then–
Attorney: Well–
Justice: –between the fact and the feeling\.
Attorney: –That’s why we believe this should be a question for the district judge, who can balance all of these factors\. In your hypothetical–
Justice: Yes, but even on your balancing theory I thought the judge was supposed to draw… maybe I misunderstood you\. I thought the judge was supposed to draw a line between fact and feeling, and what he was supposed to be balancing–
Attorney: –No, I–
Justice: –was the appropriateness of admitting the fact as against other interests\.
Attorney: –I think that’s one of the things that the trial judge could be balancing, whether it’s fact or feeling, but also the need for the evidence\. If we had a hypothetical where the–
Justice: I don’t understand that, the need for the evidence?Missing dataAttorney: That’s correct\. It may be applied in the discretion of the agency head…
Justice: \(Inaudible\)
Attorney: Yes, sir, I think there is a substantial difference and I think that’s …Table 7:Examples of Low\-quality Snippets of DatasetCSPSOSPESOverallFull Model3\.994\.534\.324\.634\.37w/oAppraisalAgent3\.833\.894\.423\.894\.01w/o SuccinctReward3\.454\.054\.114\.053\.92w/o NoveltyReward3\.744\.324\.264\.374\.17w/o GoalRelevance3\.774\.264\.214\.424\.17SaulLM\-7B3\.733\.214\.0533\.5Table 8:Human Evaluation
## Appendix DTransferability Discussion

Here, we provide a discussion of how the framework could be adapted to other inquisitive domains\.

Inquisitive conversations usually have an ultimate result\. For the Supreme Court, it is a conclusion; for journalism, it is a summary; and for medical interviews, it can be highlights\.

Appraisal taxonomy:To adapt to other domains, we can keep the same turn\-level appraisal mechanism, but broaden the label space to a domain\-general core plus domain\-specific refinements\. For journalism, refinements emphasize attribution and verifiability \(e\.g\., “claim lacks source,” “timeline inconsistent,” “needs evidence/documents”\)\. For medicine, refinements emphasize clinical completeness and safety \(e\.g\., “missing onset/duration/severity,” “red\-flag unaddressed,” “contraindication risk”\)\. Practically, the domain\-specific appraisal set can be selected from a larger universal pool, or induced with weak supervision/clustering over \(question, answer, follow\-up\) triples—reducing reliance on handcrafted legal notions while preserving the same control interface to the dialogue policy\.

Dialogue act hierarchy:The hierarchical decision structure can be broadened to a general property of inquisitive interviewing\. Adaptation does not require redesigning the hierarchy; it requires swapping the act inventory\. In journalism, high\-level acts like clarify, verify, challenge, request evidence, reconcile contradictions, summarize leads naturally decompose into finer acts \(e\.g\., “ask for document,” “ask for source identity,” “pin down time/place”\)\.

Rather than hand\-crafting these for each new domain, a practical adaptation path is to derive the hierarchy via hierarchical clustering of question/response embeddings or other cues, then map clusters to interpretable parent nodes while letting leaves remain domain\-specific\.

Reward design:The reward template also generalizes with a simple substitution: replace the Supreme Court “case conclusion” target with a domain “goal artifact” that represents the interview’s intended end product\. For journalism, this can be a set of story claims/summarization of a dialogue\. For medical interviews, this can be a set of highlights of it\.

Generally speaking, goal\-relevance then rewards answers that add content aligned with these goal elements, novelty rewards information that was not already established earlier in the conversation, and clarity is recalibrated per domain \(journalism: specificity and attribution/evidence presence\)\. Crucially, the training objective and reward combination remain unchanged—only the goal artifact and clarity proxy are swapped in this way, adaptation can be interpreted as a “plug\-in” process rather than a substantial redesign\.

Similar Articles

"Excuse me, may I say something..." CoLabScience, A Proactive AI Assistant for Biomedical Discovery and LLM-Expert Collaborations

arXiv cs.CL

CoLabScience introduces a proactive LLM assistant for biomedical research that autonomously intervenes in scientific discussions using PULI (Positive-Unlabeled Learning-to-Intervene), a novel reinforcement learning framework that determines when and how to contribute context-aware insights. The work includes BSDD, a new benchmark dataset of simulated research dialogues with intervention points derived from PubMed articles.

AIPO: : Learning to Reason from Active Interaction

arXiv cs.CL

This paper introduces AIPO, a reinforcement learning framework that enhances LLM reasoning by allowing the model to actively consult collaborative agents during exploration to overcome capability boundaries.

Interactive Inverse Reinforcement Learning of Interaction Scenarios via Bi-level Optimization

arXiv cs.LG

This paper introduces Interactive Inverse Reinforcement Learning (IIRL), a framework where a learner actively interacts with an expert to infer reward functions, formulated as a stochastic bi-level optimization problem. The authors propose the BISIRL algorithm, providing convergence guarantees and experimental validation for this interactive learning paradigm.

CHAL: Council of Hierarchical Agentic Language

arXiv cs.AI

This paper introduces CHAL, a multi-agent dialectic framework that treats defeasible argumentation as structured belief optimization for LLM reasoning, using configurable meta-cognitive value systems and a gradient-informed belief revision mechanism.