UP-NRPA: User Portrait based Nested Rollout Policy Adaptation for Planning with Large Language Models in Goal-oriented Dialogue Systems
Summary
This paper proposes UP-NRPA, an online framework that integrates user portraits with nested rollout policy adaptation using large language models to dynamically customize dialogue strategies without offline training, achieving 100% success on multiple dialogue tasks.
View Cached Full Text
Cached at: 06/15/26, 08:59 AM
# UP-NRPA: User Portrait based Nested Rollout Policy Adaptation for Planning with Large Language Models in Goal-oriented Dialogue Systems
Source: [https://arxiv.org/html/2606.13683](https://arxiv.org/html/2606.13683)
Fafa Zhang1,2Meng Liu1,2Xiangyu Chen1,2Chaoxu Mu1,2,3 1School of Artificial Intelligence, Anhui University 2Anhui Provincial Key Laboratory of Security Artificial Intelligence, Anhui University 3Pengcheng Laboratory, Shenzhen, China h\.wang\.13@ahu\.edu\.cn, \{wa23301160, w125221177, xiangyu0113\}@stu\.ahu\.edu\.cn, cxmu@tju\.edu\.cn
###### Abstract
To address the challenge that current dialogue policy planning methods struggle to dynamically adapt to diverse user characteristics, this paper proposes a User Portrait based Nested Rollout Policy Adaptation \(UP\-NRPA\) online framework with Large Language Models\. In contrast to conventional approaches dependent on model training and require offline reinforcement learning policy models for user groups, UP\-NRPA enables dynamic customization of dialogue strategies through an adaptive mechanism\. This is achieved by leveraging real\-time user feedback alongside personality, preferences, and objectives mapped from the current user portrait, thereby adapting to user characteristics without offline reinforcement learning\. In collaborative and non\-collaborative dialogue benchmarks, UP\-NRPA demonstrated considerable benefits, achieving an impressive 100% success rate in multiple dialogue tasks\. Particularly in negotiation tasks, the sale\-to\-list ratio \(SL\) increased by 56\.41%\. This demonstrates that UP\-NRPA can adapt to diverse user needs without requiring a training mechanism, enabling the dialogue system to adapt to user characteristics\.
Figure 1:The overview of the UP\-NRPA framework\. This framework integrates a user portrait\-driven simulator with a Nested Rollout Policy Adaptation planner\. Through multi\-level Monte Carlo simulation and reward\-based policy adaptation, the agent dynamically optimizes dialogue strategies by simulating interactions with diverse user personas\.## 1Introduction
With the development of Large Language Models \(LLMs\), goal\-oriented dialogue systems have made substantial progressZhouet al\.\([2024](https://arxiv.org/html/2606.13683#bib.bib2)\); Synekopet al\.\([2024](https://arxiv.org/html/2606.13683#bib.bib1)\); Algherairy and Ahmed \([2024](https://arxiv.org/html/2606.13683#bib.bib3)\)\. These systems excel in scenarios where goals align with user interests, such as restaurant reservations, emotional support, or multi\-step task guidanceDenget al\.\([2025](https://arxiv.org/html/2606.13683#bib.bib4)\); Li \([2024](https://arxiv.org/html/2606.13683#bib.bib6)\); Denget al\.\([2024a](https://arxiv.org/html/2606.13683#bib.bib5)\), demonstrating robust collaborative dialogue capabilities\. However, their performance significantly degrades when dialogue objectives conflict with user interests, such as in negotiation or persuasion scenariosDenget al\.\([2025](https://arxiv.org/html/2606.13683#bib.bib4)\)\. Therefore, the system must balance goal achievement and user sentiment in conversations to achieve optimal interaction outcomes\.
With the advancement of dialogue systems, numerous novel solutions have emergedDenget al\.\([2024a](https://arxiv.org/html/2606.13683#bib.bib5)\)\. Approaches based on prompt engineering optimize decision\-making processes by directly guiding model planning through well\-crafted instructions and contextual prompts, or by integrating external policy planners to collaborate with LLMs, thereby constructing efficient dialogue agentsZhanget al\.\([2023](https://arxiv.org/html/2606.13683#bib.bib34)\)\. However, while these approaches show progress, they have limitations\. Offline reinforcement learning methods perform well for single users but suffer from poor generalization capabilities, with performance heavily dependent on high\-quality dialogue data, leading to costly training\. Online search methods like Monte Carlo Tree Search \(MCTS\) can generate natural responses but fail to achieve objectives in goal\-oriented dialogue tasksYuet al\.\([2023](https://arxiv.org/html/2606.13683#bib.bib42)\)\. Existing approaches fall short in modeling user personas\. In real\-world dialogue scenarios, each user possesses unique personality traits, yet current methods show poor performance in integrating these personas\. Offline reinforcement learning struggles to train policy planners capable of generalizing across diverse user populations, resulting in dialogue agents exhibiting rigid behavioral strategies when encountering different users\. In complex multi\-user scenarios like persuasionWanget al\.\([2019](https://arxiv.org/html/2606.13683#bib.bib29)\), negotiationHeet al\.\([2018](https://arxiv.org/html/2606.13683#bib.bib28)\), or emotional supportLiuet al\.\([2021](https://arxiv.org/html/2606.13683#bib.bib27)\); Zhenget al\.\([2023](https://arxiv.org/html/2606.13683#bib.bib30)\), existing dialogue agents lack dynamic adaptability to adjust strategies based on user feedback\. This limitation constrains their effectiveness in applications requiring deep interaction, empathy building, and trust establishment\. Furthermore, existing approaches fail to maintain dialogue coherence and goal\-oriented focus in non\-collaborative tasks\. They cannot capture behavioral shifts across different users and adapt conversational strategies accordingly\. Collectively, these issues constrain the performance of LLM\-based dialogue systems\.
To tackle the challenges in dialogue planning posed by diverse user portraits, several innovative solutions have been proposed\. Among them, the Tailored Strategic Planning \(TRIP\)Zhanget al\.\([2024](https://arxiv.org/html/2606.13683#bib.bib41)\)method achieves effective adaptation to personalized dialogue scenarios by deeply integrating user characteristics into the strategy planning process\. Its innovation lies in integrating a user perception strategy planning module with a group\-based training paradigm, systematically enhancing the agent’s customized strategy planning capabilities and overcoming the limitations of traditional methods under single user simulators\. Building upon this foundation, User\-Tailored Dialogue Policy Planning \(UDP\)Heet al\.\([2025b](https://arxiv.org/html/2606.13683#bib.bib38)\)further expands the TRIP framework by employing advanced diffusion models to dynamically infer and model user profiles\. This approach introduces a Brownian Bridge\-inspired mechanism, enabling precise prediction of users’ response patterns and behavioral tendencies\. UDP not only captures real\-time shifts in user characteristics, but also dynamically adjusts policy planning during conversations, achieving significant performance improvements over TRIP\.
Although TRIP and UDP methods have made progress in dialogue systems based on user portraits, enabling strategy formulation for different user characteristics, both approaches still exhibit limitations in dialogue task performance metrics\. They demonstrate relatively poor performance in terms of dialogue success rate and dialogue turns, requiring more rounds of interaction to achieve dialogue targets\. However, the Nested Rollout Policy Adaptation for Goal\-oriented Dialogue \(NRPA\-GD\)Wanget al\.\([2025a](https://arxiv.org/html/2606.13683#bib.bib31)\)method employs a policy adaptation mechanism to achieve efficient dialogue planning within a single user simulator environment, significantly improving dialogue success rates\. Therefore, we propose the User Portrait\-based Nested Rollout Policy Adaptation \(UP\-NRPA\) method\. This approach integrates user characteristics into nested rollout planning, dynamically adjusting dialogue strategies through online search and user feedback\. This enables UP\-NRPA to adaptively customize strategies across diverse user scenarios, effectively addressing the need for personalized dialogue planning across different user populations\. Our contributions are summarized as follows:
- •Current dialogue strategy planning requires offline reinforcement learning, which cannot dynamically adjust strategies in real\-time for unseen user personas\. In contrast, UP\-NRPA dynamically plans strategies through real\-time feedback without requiring model training\.
- •UP\-NRPA combines user profiling with online strategy optimization, enabling the system to continuously enhance interaction strategies based on distinct user personas, thereby improving dialogue success rates\.
- •In collaborative and non\-collaborative dialogue tasks, UP\-NRPA achieved a 100% success rate across multiple tasks\. Based on the Qwen2\.5 14B model, this method improved the success rate by 56\.41% compared to existing state\-of\-the\-art approaches, validating the effectiveness of UP\-NRPA\.
## 2Related Work
Existing prompt engineering methods, such as Ask\-an\-Expert \(AnE\), integrate active prompting, self\-reflection, and self\-play to enhance LLMs’ planning capabilities by learning from context and history through predefined instruction promptsZhanget al\.\([2023](https://arxiv.org/html/2606.13683#bib.bib34)\); Chenet al\.\([2023](https://arxiv.org/html/2606.13683#bib.bib16)\); Fuet al\.\([2023](https://arxiv.org/html/2606.13683#bib.bib35)\)\. However, these prompting techniques typically prioritize user satisfaction\. User simulations often employ fixed neutral roles, struggling to reflect individual characteristicsDenget al\.\([2023a](https://arxiv.org/html/2606.13683#bib.bib17)\)\. Plug\-and\-Play Dialogue Policy Planner \(PPDPP\)Denget al\.\([2024b](https://arxiv.org/html/2606.13683#bib.bib36)\)and Dual\-Process Dialogue Planning \(DPDP\)Heet al\.\([2024](https://arxiv.org/html/2606.13683#bib.bib37)\)achieve stronger policy capabilities through offline reinforcement learning\. Additionally, TRIPZhanget al\.\([2024](https://arxiv.org/html/2606.13683#bib.bib41)\)and UDPHeet al\.\([2025b](https://arxiv.org/html/2606.13683#bib.bib38)\)methods incorporate user profiles to identify user types and adopt customized dialogue strategies\. However, these approaches remain reliant on substantial data and offline training, limiting their ability to adapt strategies for new users during dialogue planning\. Latent Dialogue Policy Planning \(LDPP\)Heet al\.\([2025a](https://arxiv.org/html/2606.13683#bib.bib40)\)is based on data\-driven autonomous policy discovery\. In dialogue scenarios such as negotiation, persuasion, and emotional support, the interaction process contains rich contextual information\. The system must autonomously adjust dialogue policies to achieve predetermined interaction goals\. LDPP implements the entire process from policy mining in dialogue records to policy planning learning\. Based on an offline hierarchical reinforcement learning algorithm in latent space, it constructs efficient policy planning capabilities\. Goal\-oriented Dialogue Planning with Zero training \(GDP\-Zero\) utilizes LLMs to simultaneously process prior strategies, value functions, and user/system roles, enabling MCTS planning for unknown scenariosYuet al\.\([2023](https://arxiv.org/html/2606.13683#bib.bib42)\)\. NRPA, a variant of Nested Monte Carlo Search, is applied to goal\-oriented dialogue systems via multi\-level policy adaptationWanget al\.\([2025b](https://arxiv.org/html/2606.13683#bib.bib19)\)\. NRPA\-GD addresses the computational overhead of offline reinforcement learning by introducing an online NRPA search algorithm, thereby improving dialogue success rates\. DialogXpert generates a small set of high\-quality action candidates for each dialogue turn using frozen LLM models\. It then leverages a compact Q\-network based on fixed BERT embeddings trained via temporal difference learning to select optimal actions within a reduced feature space\. By tracking user sentiment, DialogXpert advances tasks while making customized decisions to establish genuine empathetic connectionsRakibet al\.\([2025](https://arxiv.org/html/2606.13683#bib.bib32)\)\. The proposed UP\-NRPA model combines user role modeling with nested rollout, iteratively optimizing optimal action sequences based on user feedback to enable dialogue agent to select optimal strategies\.
Table 1:Comparison of dialogue planning methods on the CraigslistBargain, ESConv benchmarks\.Algorithm 1UP\-NRPA1:LLM
MθM\_\{\\theta\}
2:Initial policy
π\\pi
3:Number of iterations
NN
4:Learning rate
α\\alpha
5:Action space
𝒜\\mathcal\{A\}
6:Initial state
ss
7:User Portrait
UU
8:functionUP\-NRPA\(
levellevel,
π\\pi,
ss\)
9:if
level=0level=0then
10:returnPlayout\(
ss,
π\\pi,
UU\)
11:else
12:
bestScore←−∞bestScore\\leftarrow\-\\infty
13:
bestSequence←∅bestSequence\\leftarrow\\emptyset
14:for
iteration=1iteration=1to
NNdo
15:
\(score,sequence\)←\(score,sequence\)\\leftarrowNRPA\(
level−1level\-1,
π\\pi,
ss\)
16:if
score\>bestScorescore\>bestScorethen
17:
bestScore←scorebestScore\\leftarrow score
18:
bestSequence←sequencebestSequence\\leftarrow sequence
19:
π←\\pi\\leftarrowAdapt\(
π\\pi,
bestSequencebestSequence,
α\\alpha,
ss\)
20:return
\(bestScore,bestSequence\)\(bestScore,bestSequence\)
21:functionAdapt\(
π\\pi,
sequencesequence,
α\\alpha,
ss\)
22:
π′←π\\pi^\{\\prime\}\\leftarrow\\pi
23:
currentState←scurrentState\\leftarrow s
24:foreach action
aain
sequencesequencedo
25:
z←∑a′∈𝒜eπ′\(a′\)z\\leftarrow\\sum\_\{a^\{\\prime\}\\in\\mathcal\{A\}\}e^\{\\pi^\{\\prime\}\(a^\{\\prime\}\)\}
26:foreach action
a′∈𝒜a^\{\\prime\}\\in\\mathcal\{A\}do
27:
π′\(a′\)←π′\(a′\)−α⋅1zeπ′\(a′\)\\pi^\{\\prime\}\(a^\{\\prime\}\)\\leftarrow\\pi^\{\\prime\}\(a^\{\\prime\}\)\-\\alpha\\cdot\\frac\{1\}\{z\}e^\{\\pi^\{\\prime\}\(a^\{\\prime\}\)\}
28:
π′\(a\)←π′\(a\)\+α\\pi^\{\\prime\}\(a\)\\leftarrow\\pi^\{\\prime\}\(a\)\+\\alpha
29:
currentState←play\(currentState,a\)currentState\\leftarrow play\(currentState,a\)
30:return
π′\\pi^\{\\prime\}
## 3User\-Specific Planning Evaluation
To address the challenge that existing dialogue strategy planning methods struggle to dynamically adapt to diverse user characteristics, this paper proposes the UP\-NRPA online framework to explore its planning adaptation capabilities\. Two task categories are examined: collaborative and non\-collaborative tasks\. Collaborative tasks include the cooperative dialogue tasks ESConv and ExTES for emotional support scenariosLiuet al\.\([2021](https://arxiv.org/html/2606.13683#bib.bib27)\); Zhenget al\.\([2023](https://arxiv.org/html/2606.13683#bib.bib30)\), while non\-collaborative tasks encompass the non\-cooperative dialogue task P4G for persuasion scenariosWanget al\.\([2019](https://arxiv.org/html/2606.13683#bib.bib29)\)and the CB non\-cooperative dialogue task for negotiation scenariosHeet al\.\([2018](https://arxiv.org/html/2606.13683#bib.bib28)\)\. First, user profiles are customized using Big Five personality traits and decision\-making styles\. Then, GPT\-5 generates user descriptions with fine\-grained characteristics based on these profiles\. Finally, comparative experiments validate the task performance of UP\-NRPA\.
### 3\.1User Persona Designing
Building upon existing research, we combine persona portraits with user simulation and select non\-cooperative behaviors from a set of resisting strategies when dealing with non\-cooperative tasks\. The generation and integration process of persona portraits follows the research framework of TRIPZhanget al\.\([2024](https://arxiv.org/html/2606.13683#bib.bib41)\), similarly designing two role types, each equipped with coherent setting descriptions generated by large language models\. These descriptions encompass two key dimensions: Big Five personality traitsGoldberg \([1992](https://arxiv.org/html/2606.13683#bib.bib24)\)and decision\-making stylesScott and Bruce \([1995](https://arxiv.org/html/2606.13683#bib.bib25)\)\. Simultaneously, we employ the resisting strategy proposed byDuttet al\.\([2021](https://arxiv.org/html/2606.13683#bib.bib26)\)to guide simulator behavior patterns\. Hybrid active role\-playing prompts designed for each agent integrate specific character settings with dialouge context information\. For each evaluation task, we constructed 300 diverse user simulators to ensure comprehensive and systematic test coverage\. For details on the resisting strategy, see Appendix E\.
## 4Methodology
### 4\.1Problem Definition
Given existing research, the dialogue planning process can be formalized as a Markov Decision Process \(MDP\)Zhanget al\.\([2024](https://arxiv.org/html/2606.13683#bib.bib41)\); Rakibet al\.\([2025](https://arxiv.org/html/2606.13683#bib.bib32)\), represented as the tuple\(𝒮,𝒜,ℛ,𝒯\)\(\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{R\},\\mathcal\{T\}\), whereSSdenotes the dialogue state space,𝒜\\mathcal\{A\}denotes the dialogue action space,ℛ\\mathcal\{R\}is the reward function, and𝒯\\mathcal\{T\}is the state transition function\. At each dialogue time steptt, the dialogue statest∈𝒮s\_\{t\}\\in\\mathcal\{S\}encompasses the complete dialogue context and historical record\. The agent selects an actionat∈𝒜a\_\{t\}\\in\\mathcal\{A\}based on the current state, triggering a state transitionst\+1=𝒯\(st,at\)s\_\{t\+1\}=\\mathcal\{T\}\(s\_\{t\},a\_\{t\}\)and receiving an immediate rewardℛt\\mathcal\{R\}\_\{t\}\. The core objective of the dialogue agent is to learn an optimal policy\. The reward functionℛ\\mathcal\{R\}is designed based on the NRPA\-GD reward mechanism, calculating rewards according to the dialogue termination state \(11or0\), dialogue turn number, and corresponding penalty termsWanget al\.\([2025a](https://arxiv.org/html/2606.13683#bib.bib31)\)\. The output space of this planner is a predefined set of policies based on existing research, where each policy is accompanied by pre\-designed natural language instruction descriptions\.
Table 2:Comparison of dialogue planning methods on the P4G, ExTES benchmarks\.
### 4\.2UP\-NRPA
The proposed UP\-NRPA framework integrates user modeling with online search algorithms to achieve real\-time policy optimization in goal\-oriented dialogue\. As shown in Figure[1](https://arxiv.org/html/2606.13683#S0.F1), the system first employs the diverse user population for sampling, constructing structured user profiles\. These profiles then drive the Role\-Play User Simulator\. Within the NRPA Planner, the system employs a nested search mechanism for online planning\. At level 2, the agent preliminarily selects a policy based on the current dialogue state\. Subsequently, it proceeds to level 1, where monte carlo simulation is used to conduct multiple turns of complete dialogue simulation\. During this process, the user simulator provides feedback on the agent’s actions based on the predefined portrait\. Based on reward calculations, continuously update the policy distribution\. Without relying on offline reinforcement learning training, optimal policy output is achieved in complex scenarios\. In algorithm[1](https://arxiv.org/html/2606.13683#alg1), the UP\-NRPA process begins at the nested hierarchy level, recursively searching for improved action sequences that maximize dialogue rewards\. It leverages LLMs to generate system responses and user portrait driven user replies, appending dialogue pairs to the state until termination conditions are met, thereby simulating a complete dialogue trajectory\. Through reward computation, if a candidate result outperforms the current optimal outcome, the optimal score and sequence are updated, and policyπ\\piadjusts toward the new optimal sequence\. For more implementation details, see Appendix A\.
## 5Experimental Setup
### 5\.1Evaluation Tasks
We evaluated the performance of the proposed method on collaborative and non\-collaborative tasks\. Specifically, on the CraigslistBargain \(CB\)Heet al\.\([2018](https://arxiv.org/html/2606.13683#bib.bib28)\), evaluations were conducted using 3,290 training samples, 188 validation samples, and 188 test samples containing bargaining dialogues between buyers and sellers\. For ESConvLiuet al\.\([2021](https://arxiv.org/html/2606.13683#bib.bib27)\), which focuses on emotional support, we used 1,040 training samples, 130 validation samples, and 130 test samples\. For P4GWanget al\.\([2019](https://arxiv.org/html/2606.13683#bib.bib29)\), centered on donation persuasion, we employed 817 training samples, with 100 samples each for validation and testing\. Additionally, the expanded version of ESConv, ExTES, was utilizedZhenget al\.\([2023](https://arxiv.org/html/2606.13683#bib.bib30)\)\. This dataset contains richer data, comprising 10,717 training samples, 200 validation samples, and 200 test samples\. Since the UP\-NRPA method does not require offline reinforcement learning, it was evaluated directly on the test set\. For these four dialogue tasks, Appendix D provides a more detailed introduction\.
### 5\.2Evaluation Metrics
Following previous studiesHeet al\.\([2024](https://arxiv.org/html/2606.13683#bib.bib37)\), we selected Average Turns \(AT\) and Success Rate \(SR\)\. AT measures the efficiency of goal completion by calculating the average dialogue turns required to reach the target\. SR measures the effectiveness of goal completion by statistically determining the percentage of goals successfully achieved within a predetermined maximum number of turns\. The Sale\-to\-List Ratio \(SL\) is used to evaluate buyers’ transaction outcomes\. A higher SL value indicates greater buyer benefit from the transaction, if the transaction fails, SL is recorded as 0\. Its calculation formula is defined as:SL% = \(deal price \- seller target price\)/\(buyer target price \- seller target price\)\. Additionally, we introduce the Soft Success Rate \(SSR\) evaluation method proposed by LDPPHeet al\.\([2025a](https://arxiv.org/html/2606.13683#bib.bib40)\)to further assess the effectiveness of UP\-NRPA\. SSR serves as a complementary enhancement to SR, which binarily maps the final turn rewards of a dialogue to determine only task success or failure\. In contrast, SSR averages all final turn rewards directly\. Taking P4G as an example, persuasion success is rated as:refused→\\rightarrow\-1\.0,neutral→\\rightarrow\-0\.5,positive inclination→\\rightarrow0\.1, andagreed to donate→\\rightarrow1\.0\. Detailed information about ESConv task is provided in Appendix C\.
### 5\.3Baselines
Dialogue models based on fine\-tuning technology are represented by DialoGPTZhanget al\.\([2020](https://arxiv.org/html/2606.13683#bib.bib33)\), a pre\-trained dialogue generation model whose core function is to automatically generate natural, coherent, and information\-rich responses given a dialogue context\. Prompt engineering approaches, such as Standard PromptHeet al\.\([2024](https://arxiv.org/html/2606.13683#bib.bib37)\), drive LLMs to generate responses through foundational prompts; ProactiveDenget al\.\([2023b](https://arxiv.org/html/2606.13683#bib.bib39)\)and ProCoTDenget al\.\([2023b](https://arxiv.org/html/2606.13683#bib.bib39)\)introduce explicit goal planning chains within prompts, Ask\-an\-ExpertZhanget al\.\([2023](https://arxiv.org/html/2606.13683#bib.bib34)\)simulates expert standard reasoning strategies through predefined prompts, while ICL\-AIFFuet al\.\([2023](https://arxiv.org/html/2606.13683#bib.bib35)\)generates text feedback for context learning without parameter updates via model self\-play\. GDPZeroYuet al\.\([2023](https://arxiv.org/html/2606.13683#bib.bib42)\)employs MCTS to find optimal solutions\. Offline reinforcement learning\-based approaches PPDPPDenget al\.\([2024b](https://arxiv.org/html/2606.13683#bib.bib36)\)and DPDPHeet al\.\([2024](https://arxiv.org/html/2606.13683#bib.bib37)\)combine offline reinforcement learning with real\-time MCTS search optimization\. TRIPZhanget al\.\([2024](https://arxiv.org/html/2606.13683#bib.bib41)\)incorporates user portraits and Theory\-of\-Mind \(ToM\) to simulate more realistic scenarios\. UDPHeet al\.\([2025b](https://arxiv.org/html/2606.13683#bib.bib38)\)builds upon TRIP by using diffusion models to construct user portraits and predicting user feedback via a Brownian\-bridge mechanism\. LDPPHeet al\.\([2025a](https://arxiv.org/html/2606.13683#bib.bib40)\)employs a Variational Autoencoder \(VAE\) to extract latent strategies from real dialogues, then offline\-trains a hierarchical strategy planner within this latent space\. DialogXpertRakibet al\.\([2025](https://arxiv.org/html/2606.13683#bib.bib32)\)employs a Q\-network trained on BERT embeddings for rapid optimal decision\-making\.
\(a\)ESConv
\(b\)P4G
Figure 2:Comparison of different methods’ performance on SSR using ESConv and P4G\.
## 6Experiments and Analysis
### 6\.1Main Results
Table[1](https://arxiv.org/html/2606.13683#S2.T1)presents performance comparisons of various methods across multiple challenging dialogue planning benchmarks\. We evaluated the proposed UP\-NRPA planner on both collaborative dialogue planning benchmarks \(ESConv, ExTES\) and non\-collaborative dialogue planning benchmarks \(CB, P4G\)\. This summary encompasses diverse baseline approaches, including various planners and recent policy\-based language model methods\. The UP\-NRPA planner was tested using Qwen 2\.5 14B and GPT\-4o\-mini backbone models\. The experimental setup references the user portrait construction approach of the TRIP method and employs a user simulator with resistance policies on non\-collaborative tasks to construct complex, realistic interaction scenarios\. Although UP\-NRPA performed slightly worse on GPT\-4o\-mini than on Qwen2\.5\-14B, It achieved the highest results on SL, improving by approximately41\.22%over NRPA\-GD at level 2\. In contrast, DialogXpert was evaluated on Qwen2\.5\-14B in a relatively simpler environment\. Our approach achieves exceptionally high SR across both tasks\. For CB, while AT slightly outperforms DialogXpert in more complex settings, our method demonstrates superior performance on the critical negotiation metrics SR and SL\. It achieves SR of1\.0000at both levels and elevates SL from 0\.4389 to0\.6856at level 2\. In the ESConv task, it still maintained an SR of1\.0000, despite slightly more dialogue turns\.
Table[2](https://arxiv.org/html/2606.13683#S4.T2)presents experimental results for the P4G and ExTES tasks, demonstrating performance differences among various methods\. On the P4G task, the TRIP method using the backbone GPT\-3\.5 achieves AT = 8\.20 and SR = 0\.495\. Meanwhile, the UDP method using the backbone GPT\-4o\-mini achieves AT = 7\.705 and SR = 0\.598 on P4G\. As the current state\-of\-the\-art baseline, DialogXpert achieves AT = 3\.34 and SR = 0\.9129 on the P4G task using Qwen2\.5 14B, and attains AT = 2\.57 and SR = 0\.9782 on the ExTES task\. Our proposed UP\-NRPA planner, based on Qwen2\.5 14B, achieves SR = 0\.9184 and AT = 3\.40 for Level 1 on the P4G task, and SR =1\.0000with AT=3\.29 on the ExTES task\. Level 2 further optimizes AT to 3\.12 and improves SR to 0\.9849 on the P4G task; it maintains SR =1\.0000while achieving AT = 2\.69 on the ExTES task\. Compared to methods like TRIP and UDP that utilize user portraits, our UP\-NRPA method demonstrates better performance across all key metrics\. Compared to DialogXpert, which does not use user portraits, our method achieves higher SR values, indicating that user portraits contribute to improved dialogue generation performance\.
We compared the SSR performance of different methods on ESConv and P4G, as shown in Figure[2](https://arxiv.org/html/2606.13683#S5.F2)\. On the ESConv, UP\-NRPA achieved an SSR =0\.798, outperforming the previously well\-performing UDP \(0\.774\) and TRIP \(0\.744\)\. On the non\-collaborative task P4G, UP\-NRPA demonstrated even more pronounced advantages, achieving an SSR =0\.958significantly outperforming all comparison baselines, including LDPP \(0\.733\)\. This indicates that in non\-cooperative tasks, the method can stably steer conversations toward desired reward objectives, substantially enhancing the overall performance and task completion quality of dialogue systems\.


Figure 3:Performance of non\-collaborative tasks at different iteration N in the level 1 of UP\-NRPA\.
### 6\.2Ablation Study
This section analyzes the impact of iteration N on UP\-NRPA through ablation studies to validate the algorithm’s performance in both collaborative and non\-collaborative tasks\.
#### 6\.2\.1Analysis of Iteration N
The ablation results for UP\-NRPA level 1 on two non\-collaborative tasks are shown in Figure[3](https://arxiv.org/html/2606.13683#S6.F3)\. The overall performance of the UP\-NRPA planner exhibits a significant positive correlation with its iteration N\. As N increases, AT shows a substantial downward trend across both tasks, while SR and SL metrics steadily improve\. At N = 5, the SR values on CB and P4G were 0\.7234 and 0\.7542, respectively\. As N increased to 20, the SR values reached 0\.9096 and 0\.9750, respectively\. This validates that increasing the number of iterations for UP\-NRPA effectively enhances the model’s exploration capability within the search space, thereby achieving higher\-quality negotiation outcomes within fewer dialogue turns\.
For collaborative tasks, as shown in Table[3](https://arxiv.org/html/2606.13683#S6.T3), we conducted ablation experiments on both the ESConv and ExTES tasks\. Similar to the non\-collaborative tasks in Figure[3](https://arxiv.org/html/2606.13683#S6.F3), the number of iterations N also positively impacts performance\. For the collaborative task UP\-NRPA, an SR of 1\.0000 was achieved on the ESConv task at N = 5\. As N increases, accuracy drops to 3\.76 at N = 10, yielding the optimal test result\. As an extension of the ESConv dataset, ExTES exhibits similar behavior to the ESConv task, achieving SR≥\\geq0\.98 as early as level 1\. We observe that N = 10 maintains high SR values while avoiding excessive simulation time consumption\. This demonstrates that UP\-NRPA significantly enhances the success rate of collaborative dialogues\.
Table 3:Performance of collaborative tasks at different iteration N in the level 1 of UP\-NRPA\.\(a\)ESConv
\(b\)CraigslistBargain
Figure 4:Human evaluation results on ESConv and CraigslistBargain\.
### 6\.3Human Evaluation
We conducted human evaluations of responses generated by UP\-NRPA and NRPA\-GD\. Based on prior researchDenget al\.\([2024b](https://arxiv.org/html/2606.13683#bib.bib36)\); Wanget al\.\([2025a](https://arxiv.org/html/2606.13683#bib.bib31)\), 50 test samples are randomly selected from ESConv and CB\. Three annotators independently assessed responses generated on ESConv across four dimensions: Suggestion \(Sug\., comparing the quality of suggestions\), Identification \(Ide\., comparing the proactivity in addressing emotional issues\), Comforting \(Com\., comparing the quality of comfort\), and Overall \(Ove\.\)\. For responses generated on CB, the dimensions are Effectiveness \(Eff\., comparing the effectiveness in achieving negotiation outcomes\), Negotiation \(Neg\., comparing the strength of negotiation skills and tactics\), and Overall \(Ove\., comparing the overall negotiation capabilities\)\. As shown in Figure[5](https://arxiv.org/html/2606.13683#S2.F5), on the ESConv task, UP\-NRPA demonstrated superior ability to provide high\-quality suggestions, outperforming NRPA\-GD in Overall\. On the CB task, UP\-NRPA surpassed NRPA\-GD across all metrics\. Aligned with the results in Table[1](https://arxiv.org/html/2606.13683#S2.T1), UP\-NRPA particularly excelled in SL, indicating its outperformance over NRPA\-GD in negotiation\. This further validates the effectiveness of UP\-NRPA in both collaborative and non\-collaborative dialogue tasks\. We also compared UP\-NRPA’s level 1 \(N = 20\) and level 2 \(N = 5\) on the P4G and ExTES tasks to evaluate performance differences across levels\. Detailed human evaluation and results are provided in Appendix B\.
## 7Conclusion
This paper introduces the UP\-NRPA framework to address the limitations of existing dialogue strategy planning systems in adapting to diverse user tasks and achieving low success rates\. By integrating user modeling with strategy adaptation, the framework enables real\-time policy adjustment without relying on offline reinforcement learning\. During conversations, UP\-NRPA dynamically optimizes dialogue strategies in real time by responding to user feedback, thereby enhancing the performance of dialogue agents in both collaborative and non\-collaborative tasks while demonstrating particular strengths in negotiation scenarios\. This framework offers a viable solution for developing dialogue systems adaptable to diverse user types\. For future work, we plan to further optimize computational efficiency in complex dialogue scenarios and extend the framework for application in multimodal dialogue environments\.
## Acknowledgments
The authors acknowledge the financial support from National Natural Science Foundation of China, No\. 62236002, and Hefei Key Science and Technology Special Projects under Grant 2024SZD006\.
## References
- A\. Algherairy and M\. Ahmed \(2024\)A review of dialogue systems: current trends and future directions\.Neural Computing and Applications36\(12\),pp\. 6325–6351\.Cited by:[§1](https://arxiv.org/html/2606.13683#S1.p1.1)\.
- M\. Chen, X\. Yu, W\. Shi, U\. Awasthi, and Z\. Yu \(2023\)Controllable mixed\-initiative dialogue generation through prompting\.arXiv preprint arXiv:2305\.04147\.Cited by:[§2](https://arxiv.org/html/2606.13683#S2.p1.1)\.
- Y\. Deng, W\. Lei, M\. Huang, and T\. Chua \(2023a\)Rethinking conversational agents in the era of llms: proactivity, non\-collaborativity, and beyond\.InProceedings of the Annual international ACM SIGIR conference on research and development in information retrieval in the Asia Pacific region,pp\. 298–301\.Cited by:[§2](https://arxiv.org/html/2606.13683#S2.p1.1)\.
- Y\. Deng, L\. Liao, L\. Chen, H\. Wang, W\. Lei, and T\. Chua \(2023b\)Prompting and evaluating large language models for proactive dialogues: clarification, target\-guided, and non\-collaboration\.arXiv preprint arXiv:2305\.13626\.Cited by:[Table 1](https://arxiv.org/html/2606.13683#S2.T1.5.10.5.1),[Table 1](https://arxiv.org/html/2606.13683#S2.T1.5.12.7.1),[Table 2](https://arxiv.org/html/2606.13683#S4.T2.4.7.2.1),[§5\.3](https://arxiv.org/html/2606.13683#S5.SS3.p1.1)\.
- Y\. Deng, L\. Liao, W\. Lei, G\. H\. Yang, W\. Lam, and T\. Chua \(2025\)Proactive conversational ai: a comprehensive survey of advancements and opportunities\.ACM Transactions on Information Systems43\(3\),pp\. 1–45\.Cited by:[§1](https://arxiv.org/html/2606.13683#S1.p1.1)\.
- Y\. Deng, L\. Liao, Z\. Zheng, G\. H\. Yang, and T\. Chua \(2024a\)Towards human\-centered proactive conversational agents\.InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 807–818\.Cited by:[§1](https://arxiv.org/html/2606.13683#S1.p1.1),[§1](https://arxiv.org/html/2606.13683#S1.p2.1)\.
- Y\. Deng, W\. Zhang, W\. Lam, S\. Ng, and T\. Chua \(2024b\)Plug\-and\-play policy planner for large language model powered dialogue agents\.InICLR,Cited by:[Table 1](https://arxiv.org/html/2606.13683#S2.T1.5.11.6.1),[Table 1](https://arxiv.org/html/2606.13683#S2.T1.5.13.8.1),[Table 1](https://arxiv.org/html/2606.13683#S2.T1.5.15.10.1),[§2](https://arxiv.org/html/2606.13683#S2.p1.1),[Table 2](https://arxiv.org/html/2606.13683#S4.T2.4.11.6.1),[§5\.3](https://arxiv.org/html/2606.13683#S5.SS3.p1.1),[§6\.3](https://arxiv.org/html/2606.13683#S6.SS3.p1.1)\.
- R\. Dutt, S\. Sinha, R\. Joshi, S\. S\. Chakraborty, M\. Riggs, X\. Yan, H\. Bao, and C\. Rose \(2021\)Resper: computationally modelling resisting strategies in persuasive conversations\.InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume,pp\. 78–90\.Cited by:[§3\.1](https://arxiv.org/html/2606.13683#S3.SS1.p1.1)\.
- Y\. Fu, H\. Peng, T\. Khot, and M\. Lapata \(2023\)Improving language model negotiation with self\-play and in\-context learning from ai feedback\.arXiv preprint arXiv:2305\.10142\.Cited by:[Table 1](https://arxiv.org/html/2606.13683#S2.T1.5.14.9.1),[§2](https://arxiv.org/html/2606.13683#S2.p1.1),[Table 2](https://arxiv.org/html/2606.13683#S4.T2.4.8.3.1),[§5\.3](https://arxiv.org/html/2606.13683#S5.SS3.p1.1)\.
- L\. R\. Goldberg \(1992\)The development of markers for the big\-five factor structure\.\.Psychological assessment4\(1\),pp\. 26\.Cited by:[§3\.1](https://arxiv.org/html/2606.13683#S3.SS1.p1.1)\.
- H\. He, D\. Chen, A\. Balakrishnan, and P\. Liang \(2018\)Decoupling strategy and generation in negotiation dialogues\.arXiv preprint arXiv:1808\.09637\.Cited by:[§1](https://arxiv.org/html/2606.13683#S1.p2.1),[§3](https://arxiv.org/html/2606.13683#S3.p1.1),[§5\.1](https://arxiv.org/html/2606.13683#S5.SS1.p1.1)\.
- T\. He, L\. Liao, Y\. Cao, Y\. Liu, M\. Liu, Z\. Chen, and B\. Qin \(2024\)Planning like human: a dual\-process framework for dialogue planning\.arXiv preprint arXiv:2406\.05374\.Cited by:[Table 1](https://arxiv.org/html/2606.13683#S2.T1.5.18.13.1),[§2](https://arxiv.org/html/2606.13683#S2.p1.1),[§5\.2](https://arxiv.org/html/2606.13683#S5.SS2.p1.4),[§5\.3](https://arxiv.org/html/2606.13683#S5.SS3.p1.1)\.
- T\. He, L\. Liao, Y\. Cao, Y\. Liu, Y\. Sun, Z\. Chen, M\. Liu, and B\. Qin \(2025a\)Simulation\-free hierarchical latent policy planning for proactive dialogues\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 24032–24040\.Cited by:[§2](https://arxiv.org/html/2606.13683#S2.p1.1),[Table 2](https://arxiv.org/html/2606.13683#S4.T2.4.15.10.1),[§5\.2](https://arxiv.org/html/2606.13683#S5.SS2.p1.4),[§5\.3](https://arxiv.org/html/2606.13683#S5.SS3.p1.1)\.
- T\. He, L\. Liao, M\. Liu, and B\. Qin \(2025b\)Simulating before planning: constructing intrinsic user world model for user\-tailored dialogue policy planning\.arXiv preprint arXiv:2504\.13643\.Cited by:[§1](https://arxiv.org/html/2606.13683#S1.p3.1),[Table 1](https://arxiv.org/html/2606.13683#S2.T1.5.23.18.1),[§2](https://arxiv.org/html/2606.13683#S2.p1.1),[Table 2](https://arxiv.org/html/2606.13683#S4.T2.4.12.7.1),[§5\.3](https://arxiv.org/html/2606.13683#S5.SS3.p1.1)\.
- X\. Li \(2024\)A review of prominent paradigms for llm\-based agents: tool use \(including rag\), planning, and feedback learning\.arXiv preprint arXiv:2406\.05804\.Cited by:[§1](https://arxiv.org/html/2606.13683#S1.p1.1)\.
- S\. Liu, C\. Zheng, O\. Demasi, S\. Sabour, Y\. Li, Z\. Yu, Y\. Jiang, and M\. Huang \(2021\)Towards emotional support dialog systems\.arXiv preprint arXiv:2106\.01144\.Cited by:[§1](https://arxiv.org/html/2606.13683#S1.p2.1),[§3](https://arxiv.org/html/2606.13683#S3.p1.1),[§5\.1](https://arxiv.org/html/2606.13683#S5.SS1.p1.1)\.
- T\. B\. A\. Rakib, A\. Mehrish, L\. Soon, W\. H\. Lim, and S\. Poria \(2025\)DialogXpert: driving intelligent and emotion\-aware conversations through online value\-based reinforcement learning with llm priors\.arXiv preprint arXiv:2505\.17795\.Cited by:[Table 1](https://arxiv.org/html/2606.13683#S2.T1.5.28.23.1),[§2](https://arxiv.org/html/2606.13683#S2.p1.1),[§4\.1](https://arxiv.org/html/2606.13683#S4.SS1.p1.13),[Table 2](https://arxiv.org/html/2606.13683#S4.T2.4.18.13.1),[§5\.3](https://arxiv.org/html/2606.13683#S5.SS3.p1.1)\.
- S\. G\. Scott and R\. A\. Bruce \(1995\)Decision\-making style: the development and assessment of a new measure\.Educational and psychological measurement55\(5\),pp\. 818–831\.Cited by:[§3\.1](https://arxiv.org/html/2606.13683#S3.SS1.p1.1)\.
- O\. Synekop, I\. Lytovchenko, Y\. Lavrysh, and V\. Lukianenko \(2024\)Use of chat gpt in english for engineering classes: are students’ and teachers’ views on its opportunities and challenges similar?\.International Journal of Interactive Mobile Technologies18\(3\)\.Cited by:[§1](https://arxiv.org/html/2606.13683#S1.p1.1)\.
- H\. Wang, F\. Zhang, X\. Zhang, and C\. Mu \(2025a\)A general highly accurate online planning method integrating large language models into nested rollout policy adaptation for dialogue tasks\.arXiv preprint arXiv:2511\.21706\.Cited by:[§1](https://arxiv.org/html/2606.13683#S1.p4.1),[Table 1](https://arxiv.org/html/2606.13683#S2.T1.5.26.21.1),[§4\.1](https://arxiv.org/html/2606.13683#S4.SS1.p1.13),[§6\.3](https://arxiv.org/html/2606.13683#S6.SS3.p1.1)\.
- H\. Wang, X\. Zhang, and C\. Mu \(2025b\)Planning of heuristics: strategic planning on large language models with monte carlo tree search for automating heuristic optimization\.arXiv preprint arXiv:2502\.11422\.Cited by:[§2](https://arxiv.org/html/2606.13683#S2.p1.1)\.
- X\. Wang, W\. Shi, R\. Kim, Y\. Oh, S\. Yang, J\. Zhang, and Z\. Yu \(2019\)Persuasion for good: towards a personalized persuasive dialogue system for social good\.arXiv preprint arXiv:1906\.06725\.Cited by:[§1](https://arxiv.org/html/2606.13683#S1.p2.1),[§3](https://arxiv.org/html/2606.13683#S3.p1.1),[§5\.1](https://arxiv.org/html/2606.13683#S5.SS1.p1.1)\.
- X\. Yu, M\. Chen, and Z\. Yu \(2023\)Prompt\-based monte\-carlo tree search for goal\-oriented dialogue policy planning\.arXiv preprint arXiv:2305\.13660\.Cited by:[§1](https://arxiv.org/html/2606.13683#S1.p2.1),[§2](https://arxiv.org/html/2606.13683#S2.p1.1),[Table 2](https://arxiv.org/html/2606.13683#S4.T2.4.9.4.1),[§5\.3](https://arxiv.org/html/2606.13683#S5.SS3.p1.1)\.
- Q\. Zhang, J\. Naradowsky, and Y\. Miyao \(2023\)Ask an expert: leveraging language models to improve strategic reasoning in goal\-oriented dialogue models\.arXiv preprint arXiv:2305\.17878\.Cited by:[§1](https://arxiv.org/html/2606.13683#S1.p2.1),[Table 1](https://arxiv.org/html/2606.13683#S2.T1.5.8.3.1),[§2](https://arxiv.org/html/2606.13683#S2.p1.1),[§5\.3](https://arxiv.org/html/2606.13683#S5.SS3.p1.1)\.
- T\. Zhang, C\. Huang, Y\. Deng, H\. Liang, J\. Liu, Z\. Wen, W\. Lei, and T\. Chua \(2024\)Strength lies in differences\! improving strategy planning for non\-collaborative dialogues via diversified user simulation\.arXiv preprint arXiv:2403\.06769\.Cited by:[§1](https://arxiv.org/html/2606.13683#S1.p3.1),[§2](https://arxiv.org/html/2606.13683#S2.p1.1),[§3\.1](https://arxiv.org/html/2606.13683#S3.SS1.p1.1),[§4\.1](https://arxiv.org/html/2606.13683#S4.SS1.p1.13),[Table 2](https://arxiv.org/html/2606.13683#S4.T2.4.10.5.1),[§5\.3](https://arxiv.org/html/2606.13683#S5.SS3.p1.1)\.
- Y\. Zhang, S\. Sun, M\. Galley, Y\. Chen, C\. Brockett, X\. Gao, J\. Gao, J\. Liu, and W\. B\. Dolan \(2020\)Dialogpt: large\-scale generative pre\-training for conversational response generation\.InProceedings of the 58th annual meeting of the association for computational linguistics: system demonstrations,pp\. 270–278\.Cited by:[Table 1](https://arxiv.org/html/2606.13683#S2.T1.5.7.2.1),[§5\.3](https://arxiv.org/html/2606.13683#S5.SS3.p1.1)\.
- Z\. Zheng, L\. Liao, Y\. Deng, and L\. Nie \(2023\)Building emotional support chatbots in the era of llms\.arXiv preprint arXiv:2308\.11584\.Cited by:[§1](https://arxiv.org/html/2606.13683#S1.p2.1),[§3](https://arxiv.org/html/2606.13683#S3.p1.1),[§5\.1](https://arxiv.org/html/2606.13683#S5.SS1.p1.1)\.
- H\. Zhou, C\. Hu, Y\. Yuan, Y\. Cui, Y\. Jin, C\. Chen, H\. Wu, D\. Yuan, L\. Jiang, D\. Wu,et al\.\(2024\)Large language model \(llm\) for telecommunications: a comprehensive survey on principles, key techniques, and opportunities\.IEEE Communications Surveys & Tutorials\.Cited by:[§1](https://arxiv.org/html/2606.13683#S1.p1.1)\.
## AImplementation Details
#### A\.0\.1Experimental Details
In our experimental setup, we employed prompts from TRIP and NRPA\-GD, implementing the framework based on NRPA\-GD\. To ensure comparability and consistency across experiments, we adhered to established research standards for dataset selection: ESConv, P4G, ExTES, and CraigslistBargain datasets strictly adhered to Dialogxpert’s test set divisions\. For model configuration, both the dialogue system and user model employed Qwen2\.5\-14B and GPT\-4o\-mini as backbone models\. Critical hyperparameters such as temperature settings were maintained identical to NRPA\-GD’s original configuration to ensure fairness and reproducibility of experimental results\.
#### A\.0\.2Nested Rollout Policy Adaptation
The NRPA algorithm modifies the sampling mechanism in the MCTS simulation phase by integrating online policy learning within the recursive framework of Nested Monte Carlo Search \(NMCS\)\. Traditional methods employ fixed probabilities or predefined rules for rollouts, whereas NRPA maps the action space to weighted parameters and utilizes a Boltzmann distribution to generate action probabilities\. At each nested level, the algorithm uses high\-reward sequences obtained through search as supervisory signals\. It then adjusts the parameter distribution via gradient ascent, biasing sampling toward historically optimal paths\. This mechanism eliminates the need for explicit node storage, instead guiding search space convergence through weight evolution\. It achieves a shift in simulation policy from static distribution to dynamic feedback\-driven adaptation\. Specifically, let the state of the firstttstep bests\_\{t\}, and let the set of legitimate actions be denoted as𝒜\(st\)\\mathcal\{A\}\(s\_\{t\}\)\. We parameterize the strategy as a vector :π∈𝐑\|𝒜\|\\pi\\in\\mathbf\{R\}^\{\|\\mathcal\{A\}\|\}, where the componentπ\(a\)\\pi\(a\)corresponds directly to the weight of actionaa\. Given the optimal sequence of actions\(a1,a2,…,aT\)\(a\_\{1\},a\_\{2\},\\dots,a\_\{T\}\)for a high score rollout, the following updates are performed for each steptt:
Calculate the softmax normalization factor\.
z=∑a′∈𝒜eπ\(a′\)z=\\sum\_\{a^\{\\prime\}\\in\\mathcal\{A\}\}e^\{\\pi\(a^\{\\prime\}\)\}\(1\)
Calculate the probability of each action\.
P\(a\)=eπ\(a\)zP\(a\)=\\frac\{e^\{\\pi\(a\)\}\}\{z\}\(2\)
Update the weights for all actionsa′∈𝒜a^\{\\prime\}\\in\\mathcal\{A\}, and add an extraα\\alphato the optimal actionaa\.
\{π\(a′\)←π\(a′\)−α⋅1zeπ\(a′\),∀a′∈𝒜π\(a\)←π\(a\)\+α\\begin\{cases\}\\pi\(a^\{\\prime\}\)\\leftarrow\\pi\(a^\{\\prime\}\)\-\\alpha\\cdot\\frac\{1\}\{z\}e^\{\\pi\(a^\{\\prime\}\)\},&\\forall a^\{\\prime\}\\in\\mathcal\{A\}\\\\ \\pi\(a\)\\leftarrow\\pi\(a\)\+\\alpha\\end\{cases\}\(3\)
The net increase in weight of the optimal actionaaisα−α⋅1zeπ\(a\)=α\(1−P\(a\)\)\\alpha\-\\alpha\\cdot\\frac\{1\}\{z\}e^\{\\pi\(a\)\}=\\alpha\(1\-P\(a\)\), and the net decrease in weight of the remaining actions isα⋅1zeπ\(a′\)=α⋅P\(a′\)\\alpha\\cdot\\frac\{1\}\{z\}e^\{\\pi\(a^\{\\prime\}\)\}=\\alpha\\cdot P\(a^\{\\prime\}\), which transforms the original random simulation that was performed blindly into an adaptive sampling that continuously concentrates on the optimal direction\.
## BHuman Evaluation Details
To evaluate the quality of responses generated by this model, we organized a controlled human evaluation in accordance with LDPP protocols, inviting three expert annotators with backgrounds in natural language processing and computer science to participate\. Each annotator reviewed 50 dialogue scenarios\. Through a majority voting mechanism among the three annotators, preference results for each dimension were aggregated item by item\. This evaluation process ensures our assessment of the quality of both emotional support dialogues and negotiation dialogues is grounded in professional expertise\. As shown in Figure[5](https://arxiv.org/html/2606.13683#S2.F5), we compared different levels of UP\-NRPA\. Level 2 demonstrated superiority in two tasks, but its effect was not significant in the emotional support task\.
\(a\)ExTES
\(b\)P4G
Figure 5:Human evaluation results on ExTES and P4G\.As for ExTES, we measure four main metrics of the generated dialogues as follows:
- •Identification:Which assistant is more helpful in exploring and identifying the problem?
- •Comforting:Which assistant is more skilled at comforting you?
- •Suggestion:Which assistant provides more helpful suggestions for solving the problem?
- •Overall:Which assistant can better solve the patient’s problem?
As for P4G, we measure three main metrics of the responses as follows:
- •Informative:Which assistant’s introduction to the charity was more engaging?
- •Persuasive:Which assistant takes the more persuasive approaches?
- •Overall:Which assistant has stronger persuasive capabilities?
Table 4:The resisting strategies for P4G and CB tasks\.
## CDetailed Evaluation of Soft Success Rate
According to the SSR proposed by LDPP, SR calculates success rates by mapping final\-round rewards to binary values of 0 or 1, while SSR directly averages all final\-round rewards\. Therefore, we regard SSR as a soft success rate metric\. Each dataset employs a task\-specific reward mapping scheme\.
- •P4G: Persuasion success is rated as:refused→−1\.0\\text\{refused\}\\rightarrow\-1\.0,neutral→−0\.5\\text\{neutral\}\\rightarrow\-0\.5,positive inclination→0\.1\\text\{positive inclination\}\\rightarrow 0\.1, andagreed to donate→1\.0\\text\{agreed to donate\}\\rightarrow 1\.0\.
- •ESConv: Emotion trajectories are scored as follows:worse→−1\.0\\text\{worse\}\\rightarrow\-1\.0,same→−0\.5\\text\{same\}\\rightarrow\-0\.5,better→0\.5\\text\{better\}\\rightarrow 0\.5, andsolved→1\.0\\text\{solved\}\\rightarrow 1\.0\.
These mappings enable consistent supervision across diverse tasks while adapting to domain\-specific success criteria\.
## DDataset Details
- •ESConv: Emotional support and therapy\. The goal, as a therapist, is to help the patient resolve their emotional issues\.
- •CB: Negotiating for price haggle\. Roleplaying as the buyer in the conversation, the goal is to buy a given product as close as possible to the buyer’s target price in order to maximize profit\.
- •ExTES: Emotional support and therapy\. Similar to ESConv but more diverse and larger in sample size\. The goal, as a therapist, is to help the patient resolve their emotional issues\.
- •P4G: Persuasion for donation\. The goal, as a role player, is to persuade a persuadee to donate to a charity called “Save the Children”\.
## EResisting Strategy
We employ the resisting strategies proposed in the TRIP study to simulate users’ non\-cooperative behavior\. Table[4](https://arxiv.org/html/2606.13683#S2.T4)provides detailed descriptions of these resisting strategies\.
## FComprehensive Prompting
By integrating character descriptions with resistance strategies, we constructed a comprehensive prompt framework for our user simulator\. Specifically, our prompts comprise two components: task context and dialogue history\. Within the task context, we guide the large language model to simulate designated characters through role\-playing instruction sets and resistance strategies\. Tables[7](https://arxiv.org/html/2606.13683#S6.T7),[8](https://arxiv.org/html/2606.13683#S6.T8)and[9](https://arxiv.org/html/2606.13683#S6.T9)respectively present complete user simulator prompt templates for the two task categories\.
Table 5:The prompt of user persona generation\.Table 6:The prompt of user persona rephrase\.Table 7:The comprehensive prompt of user simulators in the price negotiation task\.Table 8:The comprehensive user simulator prompt for the charity persuasion task\.Table 9:The comprehensive user simulator prompt for the emotional support task\.Similar Articles
Policy and World Modeling Co-Training for Language Agents
This paper introduces PaW, a co-training framework that adds auxiliary world modeling supervision to policy learning during on-policy RL rollouts, improving language agent training without additional computational overhead.
AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints
AdaPlanBench is a dynamic benchmark for evaluating LLM agents' ability to adaptively plan under progressively revealed world and user constraints through multi-turn interactions, showing current models struggle especially with user constraints.
Breaking the Impasse: Dual-Scale Evolutionary Policy Training for Social Language Agents
This paper proposes Dual-Scale Evolutionary Policy Training (DEPT) to address the evolution impasse in social language agents, using asymmetric advantage reshaping to restore gradient signals during self-play.
Know You Before You Speak: User-State Modeling for LLM Personalization in Multi-Turn Conversation
This paper proposes PUMA, a framework for LLM personalization in multi-turn conversations that models latent user states and uses the Free Energy Principle to select dialogue actions, improving long-horizon outcomes on healthcare counseling benchmarks.
Structured Role-Aware Policy Optimization for Multimodal Reasoning
This paper introduces Structured Role-Aware Policy Optimization (SRPO), a method that improves multimodal reasoning in Large Vision-Language Models by assigning token-level credit based on distinct perception and reasoning roles within reinforcement learning frameworks.