Enhancing Target-Guided Proactive Dialogue Systems via Conversational Scenario Modeling and Intent-Keyword Bridging
Summary
This paper proposes a method to enhance target-guided proactive dialogue systems by jointly modeling user profiles and domain knowledge as conversational scenarios and employing intent-keyword bridging to predict future dialogue turns.
View Cached Full Text
Cached at: 05/13/26, 06:20 AM
# Enhancing Target-Guided Proactive Dialogue Systems via Conversational Scenario Modeling and Intent-Keyword Bridging
Source: [https://arxiv.org/html/2605.11964](https://arxiv.org/html/2605.11964)
Maodong Li1,2,Yancui Li3,Fang Kong1,2 1School of Computer Science and Technology, Soochow University, China 2Jiangsu Key Lab of Language Computing, Suzhou 215123, China 3School of Computer and Information Engineering, Henan Normal University, China \{20254027002@stu,kongfang@\}suda\.edu\.cnliyancui@htu\.edu\.cn
###### Abstract
A target\-guided proactive dialogue system aims to steer conversations proactively toward pre\-defined targets, such as designated keywords or specific topics\. During guided conversations, dynamically modeling conversational scenarios and intent keywords to guide system utterance generation is beneficial; however, existing work largely overlooks this aspect, resulting in a mismatch with the dynamics of real\-world conversations\. In this paper, we jointly model user profiles and domain knowledge as conversational scenarios to introduce a scenario bias that dynamically influences system utterances, and employ intent\-keyword bridging to predict intent keywords for upcoming dialogue turns, providing higher\-level and more flexible guidance\. Extensive automatic and human evaluations demonstrate the effectiveness of conversational scenario modeling and intent–keyword bridging, yielding substantial improvements in proactivity, fluency, and informativeness for target\-guided proactive dialogue systems, thereby narrowing the gap with real\-world interactions\.
Enhancing Target\-Guided Proactive Dialogue Systems via Conversational Scenario Modeling and Intent\-Keyword Bridging
Maodong Li1,2, Yancui Li3, Fang Kong1,2††thanks:Corresponding author1School of Computer Science and Technology, Soochow University, China2Jiangsu Key Lab of Language Computing, Suzhou 215123, China3School of Computer and Information Engineering, Henan Normal University, China\{20254027002@stu,kongfang@\}suda\.edu\.cnliyancui@htu\.edu\.cn
## 1Introduction
Target\-guided proactive dialogue systems are designed to proactively steer conversations toward pre\-defined targets, such as designated keywords or specific topicsWanget al\.\([2024](https://arxiv.org/html/2605.11964#bib.bib1)\); Zhanget al\.\([2025](https://arxiv.org/html/2605.11964#bib.bib8)\); Kanget al\.\([2026](https://arxiv.org/html/2605.11964#bib.bib41)\)\. Compared to passively responding to users, proactive guidance better aligns with real\-world interaction patterns and enhances user engagementWuet al\.\([2025a](https://arxiv.org/html/2605.11964#bib.bib9)\)\. Such systems have long been a central focus in natural language processing and have been applied across diverse domains, such as recommendation systems, emotional dialogue, and medical consultationsDaoet al\.\([2024](https://arxiv.org/html/2605.11964#bib.bib10)\); Xuet al\.\([2024](https://arxiv.org/html/2605.11964#bib.bib12)\); Hao and Kong \([2025](https://arxiv.org/html/2605.11964#bib.bib11)\); Wuet al\.\([2025b](https://arxiv.org/html/2605.11964#bib.bib40)\)\.
Figure 1:An example of a target\-guided proactive dialogue system, where the red highlights denote snippets relevant to the current dialogue, which we analyze in detail in App\.[H](https://arxiv.org/html/2605.11964#A8)\.Figure[1](https://arxiv.org/html/2605.11964#S1.F1)presents an example of a target\-guided proactive dialogue system\. When a user produces a new utterance, the system predicts intent keywords for the next few steps based on the dialogue context, represented as a sequence \(e\.g\., Music Recommendation – Piano of Sorrow→\\rightarrowPlay Music – Piano of Sorrow→\\rightarrowSay Goodbye\)\. In this paper, we refer to these as intent keywords, which indicate the system’s expected future behavior\. At the same time, the system dynamically selects points of interest based on user profiles and domain knowledge, and correspondingly influences the system utterance to ensure alignment with the current context111We use the term ”utterance” instead of ”response” to better capture the proactive guidance in our setting\., thereby guiding the conversation toward achieving the pre\-defined target while maintaining engagement\.
Most previous studies on target\-guided proactive dialogue systems focus on static entity\-based keyword planningTanget al\.\([2019](https://arxiv.org/html/2605.11964#bib.bib14)\); Yanget al\.\([2022](https://arxiv.org/html/2605.11964#bib.bib13)\)\. These keywords indicate the entities that the next system utterance should focus on, but they provide limited information regarding the system’s expected utterance\. Subsequently,Wanget al\.\([2023b](https://arxiv.org/html/2605.11964#bib.bib16)\); Daoet al\.\([2023](https://arxiv.org/html/2605.11964#bib.bib15)\)employ more semantic keywords to guide the generation of system utterances, which we refer to as intent keywords, as they indicate the system’s expected behavior with higher\-level guidance\. AlthoughWanget al\.\([2023a](https://arxiv.org/html/2605.11964#bib.bib2),[2024](https://arxiv.org/html/2605.11964#bib.bib1)\)utilize intent keywords, they focus only on the next\-turn intent keyword when guiding utterance generation, overlooking the fact that subsequent turns and their intent keywords are typically consistent and coherent\. Considering the intent keywords of multiple subsequent turns simultaneously would provide greater flexibility\. Furthermore, the system’s utterances should remain consistent with user profiles and domain knowledgeZhanget al\.\([2025](https://arxiv.org/html/2605.11964#bib.bib8)\), which rely on an external model; instead, we jointly model user profiles and domain knowledge, maintaining such consistency to improve proactivity and enhance engagement\.
To this end, we introduce the concept of conversational scenarios, which capture the conversational background of the current interaction, and intent–keyword bridging, which dynamically predicts intent keywords for the next few dialogue turns\. As shown in Figure[1](https://arxiv.org/html/2605.11964#S1.F1), we jointly model user profiles and domain knowledge as conversational scenarios and dynamically influence system utterances, ensuring that the generated utterances remain consistent with the ongoing scenarios\. Unlike static planning, our intent–keyword bridging dynamically predicts intent keywords for upcoming dialogue turns, providing higher\-level and more flexible guidance\. Extensive automatic and human evaluations demonstrate the effectiveness of conversational scenario modeling and intent–keyword bridging, yielding substantial improvements in proactivity, fluency, and informativeness for target\-guided proactive dialogue systems\. Our contributions are summarized as follows:
- •We present conversational scenarios by jointly modeling user profiles and domain knowledge, introducing a bias that dynamically influences system utterances, thereby enabling more precise initiative and enhancing user engagement\.
- •We propose intent–keyword bridging to dynamically predict intent keywords for upcoming dialogue turns, offering higher\-level and more flexible guidance222https://github\.com/imaodong/EnTarget\-Guided\_Proactive\_Dialog\.\.
## 2Methodology
Our framework is illustrated in Figure[2](https://arxiv.org/html/2605.11964#S2.F2)\. It comprises two components:Conversational Scenario Modeling \(CSM\)andIntent\-Keyword Bridging \(IKB\)\. In the CSM, we jointly model user profiles and domain knowledge as conversational scenarios, introducing a bias that dynamically influences system utterances\. This ensures that generated utterances remain consistent with the ongoing scenarios, enabling more precise initiative and enhancing user engagement\. In the IKB, intent keywords for the next few turns are dynamically predicted based on the conversational scenario and dialogue history, capturing the expected behavior in the next turn and anticipating actions several steps ahead, thereby providing higher\-level and flexible guidance\.
### 2\.1Task Formulation and Notation
SupposeD=\{\(r\(i\),h\(i\),g\(i\),S\(i\),Z\(i\)0:m\)\}Ni=1D=\\\{\(r^\{\(i\)\},h^\{\(i\)\},g^\{\(i\)\},S^\{\(i\)\},Z^\{\(i\)0:m\}\)\\\}^\{i=1\}\_\{N\}is a target\-guided dialogue dataset containingNNsamples, wherer\(i\)r^\{\(i\)\}denotes the system utterance,h\(i\)h^\{\(i\)\}denotes the dialogue history, andg\(i\)g^\{\(i\)\}denotes the pre\-defined dialogue target\.S\(i\)=\(u\(i\),k\(i\)\)S^\{\(i\)\}=\(u^\{\(i\)\},k^\{\(i\)\}\)represents the conversational scenario, comprising user profilesu\(i\)u^\{\(i\)\}and domain knowledgek\(i\)k^\{\(i\)\}, andZ\(i\)0:mZ^\{\(i\)0:m\}denotes the keyword bridging sequence, withmmrepresenting the number of future\-turn intent keywords determined empirically\. The task is to generate system utterancesr\(i\)r^\{\(i\)\}to guide the conversation toward achievingg\(i\)g^\{\(i\)\}, while maintaining proactivity, engagement, and naturalness\.
### 2\.2Conversational Scenario Modeling
The conversational scenario captures the conversational background of the current interaction\. To our knowledge, this work is the first to jointly model user profiles and domain knowledge as conversational scenarios, as they together determine the current interaction state for both the user and the system\. The conversational scenario introduces a bias that dynamically influences system utterances, enabling the system to take more precise initiative, enhancing user engagement, and ensuring that generated utterances remain consistent with the ongoing scenario\. LetEnc\(⋅\)\\text\{Enc\}\(\\cdot\)denote the T5 encoderChunget al\.\([2022](https://arxiv.org/html/2605.11964#bib.bib3)\)backbone; the modeling process can thus be formalized as follows:
𝐛=Softmax\(𝐁⋅\(ℱk\(𝐇k\)\+ℱu\(𝐇u\)\)\)\\mathbf\{b\}=\\text\{Softmax\}\(\\mathbf\{B\}\\cdot\(\\mathcal\{F\}\_\{k\}\(\\mathbf\{H\}^\{k\}\)\+\\mathcal\{F\}\_\{u\}\(\\mathbf\{H\}^\{u\}\)\)\)\(1\)𝐇k,𝐇u,𝐇h=Enc\(k\),Enc\(u\),Enc\(\[h;g\]\)\\mathbf\{H\}^\{k\},\\mathbf\{H\}^\{u\},\\mathbf\{H\}^\{h\}=\\text\{Enc\}\(k\),\\text\{Enc\}\(u\),\\text\{Enc\}\(\[h;g\]\)\(2\)where𝐇k∈ℝlk×d\\mathbf\{H\}^\{k\}\\in\\mathbb\{R\}^\{l\_\{k\}\\times d\},𝐇u∈ℝlu×d\\mathbf\{H\}^\{u\}\\in\\mathbb\{R\}^\{l\_\{u\}\\times d\}, and𝐇h∈ℝlh×d\\mathbf\{H\}^\{h\}\\in\\mathbb\{R\}^\{l\_\{h\}\\times d\}denote the hidden states of domain knowledge, user profiles, and dialogue history, respectively\. Here,lkl\_\{k\}andlul\_\{u\}denote the lengths of the domain knowledge and user profiles, respectively, whilelhl\_\{h\}denotes the length of the dialogue history and dialogue target\. The variableddrepresents the hidden dimension\.ℱk\(⋅\)\\mathcal\{F\}\_\{k\}\(\\cdot\)andℱu\(⋅\)\\mathcal\{F\}\_\{u\}\(\\cdot\)are mapping functions implemented using average pooling followed by a multi\-layer perceptron\. Thenℱk\(𝐇k\)\\mathcal\{F\}\_\{k\}\(\\mathbf\{H\}^\{k\}\)andℱu\(𝐇u\)\\mathcal\{F\}\_\{u\}\(\\mathbf\{H\}^\{u\}\)jointly constitute the proposed conversational scenario bias𝐛∈ℝ1×𝒱\\mathbf\{b\}\\in\\mathbb\{R\}^\{1\\times\\mathcal\{V\}\}via element\-wise addition\. Here,𝐁\\mathbf\{B\}denotes trainable parameters, and𝒱\\mathcal\{V\}denotes the vocabulary size\.
### 2\.3Intent\-Keyword Bridging
Intent–keyword bridging serves as a connection between the dialogue history and the system’s future behaviorsSevegnaniet al\.\([2021](https://arxiv.org/html/2605.11964#bib.bib29)\), and we employ it to dynamically predict intent keywords for upcoming turns\. We use intent keywords that consist of both keyword\-type and keyword\-topicLiuet al\.\([2021](https://arxiv.org/html/2605.11964#bib.bib24)\)\. LetA=\{a1,a2,…,axa\}A=\\\{a\_\{1\},a\_\{2\},\\dots,a\_\{x\_\{a\}\}\\\}andT=\{t1,t2,…,txt\}T=\\\{t\_\{1\},t\_\{2\},\\dots,t\_\{x\_\{t\}\}\\\}denote the sets of keyword\-types and keyword\-topics, respectively, wherexax\_\{a\}andxtx\_\{t\}represent their corresponding cardinalities\. We useζ\(⋅\)\\zeta\(\\cdot\)to denote the selection of the corresponding keyword\-type/topic index, and the extraction of keyword\-type/topic can be formalized as:
𝐄a,𝐄t=Emba\(ζ\(A\)\),Embt\(ζ\(T\)\)\\mathbf\{E\}^\{a\},\\mathbf\{E\}^\{t\}=\\text\{Emb\}\_\{a\}\(\\zeta\(A\)\),\\text\{Emb\}\_\{t\}\(\\zeta\(T\)\)\(3\)A=CLSa\(IF\(Hh,ℱk\(𝐇k\),ℱu\(𝐇u\)\)\)A=\\text\{CLS\}\_\{a\}\(\\text\{IF\}\(\\textbf\{H\}^\{h\},\\mathcal\{F\}\_\{k\}\(\\mathbf\{H\}^\{k\}\),\\mathcal\{F\}\_\{u\}\(\\mathbf\{H\}^\{u\}\)\)\)\(4\)T=CLSt\(IF\(Hh,ℱk\(𝐇k\),ℱu\(𝐇u\)\)\)T=\\text\{CLS\}\_\{t\}\(\\text\{IF\}\(\\textbf\{H\}^\{h\},\\mathcal\{F\}\_\{k\}\(\\mathbf\{H\}^\{k\}\),\\mathcal\{F\}\_\{u\}\(\\mathbf\{H\}^\{u\}\)\)\)\(5\)where𝐄a∈ℝm×d\\mathbf\{E\}^\{a\}\\in\\mathbb\{R\}^\{m\\times d\}and𝐄t∈ℝm×d\\mathbf\{E\}^\{t\}\\in\\mathbb\{R\}^\{m\\times d\}denote the embeddings of keyword\-type and keyword\-topic, respectively, andmmdenotes the hyperparameter specifying the number of future turns to be predicted\.CLSa\(⋅\)\\text\{CLS\}\_\{a\}\(\\cdot\)andCLSt\(⋅\)\\text\{CLS\}\_\{t\}\(\\cdot\)represent the classification heads for keyword\-type and keyword\-topic, respectively, andIF\(⋅\)\\text\{IF\}\(\\cdot\)denotes the information fusion mechanism, followingWanget al\.\([2023a](https://arxiv.org/html/2605.11964#bib.bib2)\)\. The resulting bridging intent keywords are formalized as follows:
𝐇z=CONCAT\(𝐇a;𝐇t\)\\mathbf\{H\}^\{z\}=\\text\{CONCAT\}\(\\mathbf\{H\}^\{a\};\\mathbf\{H\}^\{t\}\)\(6\)𝐇a,𝐇t=MP\(𝐄a\),MP\(𝐄t\)\\mathbf\{H\}^\{a\},\\mathbf\{H\}^\{t\}=\\text\{MP\}\(\\mathbf\{E\}^\{a\}\),\\text\{MP\}\(\\mathbf\{E\}^\{t\}\)\(7\)where𝐇a∈ℝ1×d\\mathbf\{H\}^\{a\}\\in\\mathbb\{R\}^\{1\\times d\},𝐇t∈ℝ1×d\\mathbf\{H\}^\{t\}\\in\\mathbb\{R\}^\{1\\times d\}, and𝐇z∈ℝ2×d\\mathbf\{H\}^\{z\}\\in\\mathbb\{R\}^\{2\\times d\}denote the hidden states of the keyword\-type, keyword\-topic, and intent keyword, respectively\.MP\(⋅\)\\text\{MP\}\(\\cdot\)denotes the max\-pooling operation\. By dynamically predictingmmintent keywords and applying max pooling, we obtain the most relevant bridging intent keyword𝐇z\\mathbf\{H\}^\{z\}for the next turn, considering the nextmmturns\. The final step in our framework, generating the system utterance, can be represented as333Implementation details are provided in App\.[B](https://arxiv.org/html/2605.11964#A2):
rt=argmaxP\(rt\|r<t,𝐇z,𝐇h,𝐛\)r\_\{t\}=\\arg\\max P\(r\_\{t\}\|r\_\{<t\},\\mathbf\{H\}^\{z\},\\mathbf\{H\}^\{h\},\\mathbf\{b\}\)\(8\)
Figure 2:Our framework\.
### 2\.4Learning Objectives
The first learning objective is to predict the intent keywords for the nextmmturns\. We minimize the following negative log\-likelihood \(NLL\):
ℒcls=−1N∑i=1N∑j=1\|V\|yjz\(i\)logpθ\(j\|S\(i\),h\(i\),g\(i\)\)\\mathcal\{L\}\_\{cls\}=\-\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\sum\_\{j=1\}^\{\|V\|\}y\_\{j\}^\{z\(i\)\}\\log p\_\{\\theta\}\(j\|S^\{\(i\)\},h^\{\(i\)\},g^\{\(i\)\}\)\(9\)whereθ\\thetadenotes the model parameters, andyz\(i\)∈\{0,1\}\|V\|y^\{z\(i\)\}\\in\\\{0,1\\\}^\{\|V\|\}is a multi\-hot vector with\|V\|=xaxb\|V\|=x\_\{a\}x\_\{b\}entries, among which2m2mentries are set to 1\.0, corresponding to the ground\-truth intent keywords\. For generation, the probability distribution over the vocabulary at stepttis computed asrt\(i\)=softmax\(𝐖𝐇tlast\+𝐛\)r\_\{t\}^\{\(i\)\}=\\text\{softmax\}\(\\mathbf\{W\}\\mathbf\{H\}\_\{t\}^\{\\text\{last\}\}\+\\mathbf\{b\}\), where𝐇tlast\\mathbf\{H\}\_\{t\}^\{\\text\{last\}\}denotes the final hidden state of the T5 decoder at steptt\. Letc\(i\)=\(S\(i\),h\(i\),g\(i\)\)c^\{\(i\)\}=\(S^\{\(i\)\},h^\{\(i\)\},g^\{\(i\)\}\)\. We jointly optimize the conversational scenarios by minimizing the following NLL over the target utterance of lengthTiT\_\{i\}:
ℒlm=−1N∑i=1N∑t=1Tilogpθ\(yt\(i\)∣y<t\(i\),c\(i\),Z\(i\)\)\\mathcal\{L\}\_\{lm\}=\-\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\sum\_\{t=1\}^\{T\_\{i\}\}\\log p\_\{\\theta\}\(y\_\{t\}^\{\(i\)\}\\mid y\_\{<t\}^\{\(i\)\},c^\{\(i\)\},Z^\{\(i\)\}\)\(10\)Overall, the training objectives are:
ℒtotal=ℒlm\+ℒcls\\mathcal\{L\}\_\{total\}=\\mathcal\{L\}\_\{lm\}\+\\mathcal\{L\}\_\{cls\}\(11\)
### 2\.5Hard and Soft Modes
Our framework operates in two modes: hard and soft\. In the hard mode, the framework predicts the intent keywords for the nextmmturns during training and selects the topmmintent keywords during inference\. In contrast, the soft mode predicts intent keywords for the nextmmturns during training, but during inference, it selects intent keywords only when their probability exceeds a thresholdδ\\delta, thereby accounting for prediction uncertainty\. For simplicity,Emba\(⋅\)\\text\{Emb\}\_\{a\}\(\\cdot\)andEmbt\(⋅\)\\text\{Emb\}\_\{t\}\(\\cdot\)are collectively denoted asEmbz\(⋅\)\\text\{Emb\}\_\{z\}\(\\cdot\)\. In the hard mode, themmintent keywords with the highest probabilities are selected:
𝐄z=Embz\(Topm\(Pz\)\)\\mathbf\{E\}^\{z\}=\\text\{Emb\}\_\{z\}\(\\text\{Top\}\_\{m\}\(P\_\{z\}\)\)\(12\)where𝐄z∈ℝm×d\\mathbf\{E\}^\{z\}\\in\\mathbb\{R\}^\{m\\times d\}denotes the embeddings of intent keywords, andPzP\_\{z\}denotes the probability distribution over intent keywords predicted by the classification head\.Topm\(⋅\)\\text\{Top\}\_\{m\}\(\\cdot\)selects themmmost probable intent keywords\. However, this hard mode does not account for prediction uncertainty\. In the soft mode, we select the intent keywords whose probabilities exceed a thresholdδ\\delta, yielding a dynamic index setℐδ=\{j∣Pz\(j\)≥δ\}\\mathcal\{I\}\_\{\\delta\}=\\\{j\\mid P\_\{z\}\(j\)\\geq\\delta\\\}\. The embeddings are then weighted by their corresponding probabilities:
𝐄jz=Pz\(j\)⋅Embz\(j\),∀j∈ℐδ\\mathbf\{E\}^\{z\}\_\{j\}=P\_\{z\}\(j\)\\cdot\\text\{Emb\}\_\{z\}\(j\),\\quad\\forall j\\in\\mathcal\{I\}\_\{\\delta\}\(13\)
Table 1:Results compared across advanced models\.Boldtext indicates the best performance, andunderlinedtext indicates the second\-best performance\. Significant improvements are marked with†\(t\-test,p<0\.01p<0\.01\) and‡\(t\-test,p<0\.05p<0\.05\)\. The advanced models labeled⋄denote the results from our reproduction\.
## 3Experiments
### 3\.1Experimental Setting
#### Datasets
In our setup, we require the dataset to be proactively guided and to incorporate elements such as user profiles and domain knowledge\. After careful examination, we identify theDuRecDialLiuet al\.\([2020](https://arxiv.org/html/2605.11964#bib.bib23)\)andDuRecDial2\.0Liuet al\.\([2021](https://arxiv.org/html/2605.11964#bib.bib24)\)datasets as suitable benchmarks for our experiments\. To further investigate the performance of target\-guided dialogue systems, the test set is additionally divided into two subsets:In\-Domain \(ID\)andOut\-of\-Domain \(OOD\)\. The statistics are summarized in Table[4](https://arxiv.org/html/2605.11964#A1.T4)\. For additional details on data processing, please refer to App\.[A](https://arxiv.org/html/2605.11964#A1)\.
#### Implementation Details
Please refer to App\.[B](https://arxiv.org/html/2605.11964#A2)\.
#### Baselines and Metrics
We compare our framework with advanced models relevant to our task, namelyTPDialWanget al\.\([2023a](https://arxiv.org/html/2605.11964#bib.bib2)\)andTRIPDialWanget al\.\([2024](https://arxiv.org/html/2605.11964#bib.bib1)\)\. We also compare our framework with large language models \(LLMs\), including LoRA fine\-tuningHuet al\.\([2022](https://arxiv.org/html/2605.11964#bib.bib32)\)\(LLaMA\-1B,LLaMA\-3B, andQwen\-3B\) and prompt\-based methods \(LLaMA\-8B,Qwen\-14B, andQwen\-32B\)444All models are obtained from the Hugging Face repository\.\. Details of the prompts used in this paper are provided in App\.[I](https://arxiv.org/html/2605.11964#A9)\. We adoptT5Raffelet al\.\([2020](https://arxiv.org/html/2605.11964#bib.bib35)\)\(Chinese:T5\-ZhZhaoet al\.\([2023](https://arxiv.org/html/2605.11964#bib.bib37)\); English:T5\-FlanChunget al\.\([2024](https://arxiv.org/html/2605.11964#bib.bib36)\)\) as our backbone model due to its lightweight design and strong empirical performance\.
Following prior workWanget al\.\([2024](https://arxiv.org/html/2605.11964#bib.bib1)\), we adopt the following widely used automatic evaluation metrics:Perplexity \(PPL\);Word F1 \(W\. F1\)Wanget al\.\([2023b](https://arxiv.org/html/2605.11964#bib.bib16)\);BLEU \(BLEU\-1/2\)Papineniet al\.\([2002](https://arxiv.org/html/2605.11964#bib.bib34)\);Distinct \(DIST\-1/2\)Liet al\.\([2016](https://arxiv.org/html/2605.11964#bib.bib33)\);Knowledge F1 \(K\. F1\)Liuet al\.\([2020](https://arxiv.org/html/2605.11964#bib.bib23)\); andFailure \(Fail\.\)Liuet al\.\([2021](https://arxiv.org/html/2605.11964#bib.bib24)\)\. See App\.[C](https://arxiv.org/html/2605.11964#A3)and App\.[D](https://arxiv.org/html/2605.11964#A4)for more details\.
### 3\.2Results and Analysis
Table[1](https://arxiv.org/html/2605.11964#S2.T1)presents the results compared with advanced models, and details of the values ofmmandδ\\deltaare provided in §[3\.3](https://arxiv.org/html/2605.11964#S3.SS3)\. Our framework achieves highly competitive performance on both the DuRecDial and DuRecDial2\.0 datasets, demonstrating clear advantages on metrics measuring dialogue quality and informativeness\. Specifically, on the ID and OOD test sets of both datasets, our framework \(in both soft and hard modes\) significantly outperforms strong baseline models, including TPDial and TRIPDial, on W\. F1 and BLEU\-1/2 metrics, while achieving competitive PPL results, indicating improved dialogue fluency\. Furthermore, on the K\. F1 metric, which measures knowledge utilization, our framework shows substantial improvement over the backbone model \(e\.g\., on the DuRecDial2\.0 ID test set, the soft mode achieves 61\.17, while the base T5\-Flan achieves 55\.26\)\. This demonstrates that by jointly modeling user profiles and domain knowledge to introduce dynamic scenario biases, the system generates more informative utterances\. Our framework also exhibits superior robustness in target achievement \(as reflected by the Failure metric\), highlighting its proactivity and generalization ability\. Advanced baseline models \(e\.g\., TPDial and TRIPDial\), while performing reasonably on the ID test set, show a dramatic increase in failure rate on the OOD test set \(e\.g\., TPDial’s failure rate reaches 99\.69% on the DuRecDial2\.0 OOD test set\)\. In contrast, our framework maintains competitive target achievement on OOD data, with failure rates of 20\.20% and 23\.68%, primarily attributed to the IKB, which predicts intent keywords for future dialogue turns\. The soft mode, which better accommodates prediction uncertainty, outperforms the hard mode on most core metrics\. These results further validate that higher\-level, flexible dynamic guidance is crucial for target\-guided proactive dialogue systems\.
We also conduct comparative experiments with several recent LLMs using both fine\-tuning and prompt\-based approaches, as summarized in Table[2](https://arxiv.org/html/2605.11964#S3.T2)\. Our lightweight 0\.3B framework demonstrates strong competitiveness against LLMs ranging from 1B to 32B parameters, indicating its capability to perform on par with much larger models\. Ultra\-large prompt\-based models \(e\.g\., Qwen32B\{\}\_\{\\text\{32B\}\}\), which generate utterances via prompting, often struggle to preserve lexical overlap and dialogue naturalness, leading to substantially lower W\. F1 and BLEU scores compared to fine\-tuned models\. In contrast, our framework excels at maintaining high\-quality dialogue flow\. Notably, on the ID test set of the DuRecDial2\.0 dataset, our soft mode achieves state\-of\-the\-art W\. F1 \(44\.87\) and BLEU\-1/2 \(0\.416/0\.310\), outperforming even parameter\-efficient fine\-tuned LLaMA3B\{\}\_\{\\text\{3B\}\}and Qwen3B\{\}\_\{\\text\{3B\}\}models\. Moreover, it sustains highly competitive knowledge utilization \(K\. F1\), demonstrating that our framework effectively integrates domain knowledge\. Our framework exhibits a superior trade\-off between dialogue quality and target achievement\. On the DuRecDial2\.0 dataset, fine\-tuned LLMs show high target failure rates \(e\.g\., LLaMA3B\{\}\_\{\\text\{3B\}\}reaches 40\.30% on the ID test set\), indicating that fine\-tuning alone is insufficient for mastering complex proactive planning\. In contrast, large prompt\-based models \(e\.g\., Qwen32B\{\}\_\{\\text\{32B\}\}\) achieve extremely low failure rates \(3\.80%\) but often at the cost of dialogue quality, abruptly forcing target achievement, as reflected by a sharp drop in dialogue\-related metrics\. By leveraging multi\-turn dynamic guidance, our framework achieves a low failure rate of 19\.77% on the ID test set while maintaining high dialogue coherence\. This demonstrates that our framework is not merely a computationally constrained compromise, but effectively balances precise target orientation with natural, human\-like interaction, maintaining unique architectural advantages even compared with large\-scale LLMs\.
Table 2:Results compared with LLMs\. Models labeled♢denote fine\-tuning, while models labeled♡denote the prompt\-based method\.Boldtext indicates the best performance among fine\-tuning methods, andunderlinedtext indicates the best performance among prompt\-based methods\. Significant improvements are marked with†\(t\-test,p<0\.01p<0\.01\) and‡\(t\-test,p<0\.05p<0\.05\)\.Table 3:Ablation study on the DuRecDial dataset\.Figure 3:Variation of W\. F1 w\.r\.tmmandδ\\delta\.
### 3\.3Analysis of Parameters
We analyze two key hyperparameters,mmandδ\\delta\. Specifically,mmspecifies the number of future turns for which the framework predicts intent keywords, whileδ\\deltadefines the probability threshold for selecting intent keywords during inference in the soft mode, where only intent keywords with probabilities exceedingδ\\deltaare retained, as detailed in §[2\.5](https://arxiv.org/html/2605.11964#S2.SS5)\. We adopt W\. F1 as the primary performance metric, as it captures exact word\-level overlap\. The performance trends measured by W\. F1 on the DuRecDial and DuRecDial2\.0 datasets, asmmandδ\\deltavary, are presented in Figures[3](https://arxiv.org/html/2605.11964#S3.F3)and[6](https://arxiv.org/html/2605.11964#A1.F6), respectively\. We chooseδ\\deltain the range of 0\.0–0\.4, as values above 0\.5 have a high probability of being selected according to the sigmoid principle\. Figure[3](https://arxiv.org/html/2605.11964#S3.F3)shows that our framework remains stable across both ID and OOD test sets of the DuRecDial dataset\. Performance peaks atm=4m=4: an excessively smallmmfails to provide sufficient forward\-looking guidance, while an overly largemm\(e\.g\.,m=5m=5\) introduces noise from distant future turns\. Usingm=4m=4, we further examine performance variations with respect toδ\\delta, resulting in a selected value ofδ=0\.2\\delta=0\.2, which effectively filters low\-confidence noise without sacrificing informative keywords\. Similarly, Figure[6](https://arxiv.org/html/2605.11964#A1.F6)presents performance variations on the DuRecDial2\.0 dataset, where the framework achieves the best balance of guidance atm=3m=3\. Accordingly, we setm=3m=3and analyze performance changes with respect toδ\\delta\. Considering the overall results on both ID and OOD test sets, we setδ=0\.2\\delta=0\.2\. We also conduct more comprehensive experiments and analyses, as reported in App\.[E](https://arxiv.org/html/2605.11964#A5)\.
### 3\.4Ablation Study
We conduct ablation studies to verify the effects of the components proposed in this paper\. Specifically, we investigate the following variants: \(1\) without IKB \(w/o IKB\); \(2\) without CSM \(w/o CSM\); \(3\) without domain knowledge modeling \(−ℱk\(𝐇k\)\-\\mathcal\{F\}\_\{k\}\(\\mathbf\{H\}^\{k\}\)\); and \(4\) without user profile modeling \(−ℱu\(𝐇u\)\-\\mathcal\{F\}\_\{u\}\(\\mathbf\{H\}^\{u\}\)\)\. The results on the DuRecDial and DuRecDial2\.0 datasets, reported in Tables[3](https://arxiv.org/html/2605.11964#S3.T3)and[9](https://arxiv.org/html/2605.11964#A6.T9), respectively, clearly demonstrate the core utility of CSM and IKB in improving system utterance informativeness, fluency, and proactivity\. For informativeness and fluency, removing CSM \(w/o CSM\) significantly degrades the framework’s knowledge utilization \(K\. F1\) on both the ID and OOD test sets across the two datasets\. For example, on the DuRecDial2\.0 dataset, K\. F1 drops sharply by 7\.71 and 6\.97, respectively\. Fine\-grained analysis further shows that removing domain knowledge modeling \(−ℱk\(𝐇k\)\-\\mathcal\{F\}\_\{k\}\(\\mathbf\{H\}^\{k\}\)\) directly leads to notable decreases in K\. F1 and BLEU scores, while removing user profile modeling \(−ℱu\(𝐇u\)\-\\mathcal\{F\}\_\{u\}\(\\mathbf\{H\}^\{u\}\)\) weakens lexical diversity \(DIST\-1/2\) and increases the target failure rate\. These results indicate that CSM effectively integrates contextual information through dynamic scenario bias, thereby avoiding the homogenization of utterances\. For proactivity, IKB serves as a key guiding engine\. Removing IKB leads to a substantial increase in the target failure rate on both the DuRecDial and DuRecDial2\.0 datasets, with increases of 7\.39 and 6\.73 on the ID and OOD test sets of the DuRecDial dataset, respectively\. Moreover, the absence of IKB also degrades dialogue quality metrics such as W\. F1 and BLEU\. This suggests that IKB does not mechanically enforce topic transitions; instead, it enables smooth and coherent proactive guidance through flexible multi\-step intent keyword predictions, thereby narrowing the gap between target\-guided proactive dialogue systems and natural real\-world interactions\.
### 3\.5Human Evaluation
We perform pairwise human evaluation to further validate our framework\. Specifically, we compare our framework with the baseline T5\-Flan\. "Win," "Lose," and "Tie" indicate that our framework outperforms T5\-Flan, underperforms relative to T5\-Flan, and performs comparably to T5\-Flan, respectively\. Figure[4](https://arxiv.org/html/2605.11964#S3.F4)presents the human evaluation results\. In terms of coherence, which measures dialogue fluency, our framework achieves win rates of 24\.4% and 28\.4% on the ID and OOD test sets, respectively, substantially outperforming the baseline model’s 1\.8% and 2\.4%\. For appropriateness and informativeness, which reflect how well the content fits the dialogue context \(intent keywords, domain knowledge, and user profiles\), our framework maintains its advantage on the ID test set and achieves an impressive win rate of 58\.2% in appropriateness on the OOD test set \(baseline: 15\.2%\)\. These results suggest that CSM effectively leverages domain knowledge and user profiles to enrich utterance content under unseen data distributions\. In terms of proactivity, our framework achieves clear net wins over the baseline on both the ID \(16\.2% vs\. 6\.4%\) and OOD \(17\.6% vs\. 10\.8%\) test sets\. Some ties occur due to the strong foundation of the base skeleton model; however, by leveraging the higher\-level multi\-step intent keyword guidance provided by IKB, our framework reduces mechanical or abrupt topic shifts, thereby narrowing the gap between target\-guided proactive dialogue systems and natural interactions\. Details of the human evaluation procedure are provided in App\.[G](https://arxiv.org/html/2605.11964#A7)\.
Figure 4:Human evaluation\. "App\." denotes "appropriateness," "Inf\." denotes "informativeness," "Pro\." denotes "proactivity," and "Coh\." denotes "coherence\."Figure 5:Case study \(ID test set\)\. \(a\) T5\-Flan; \(b\) Our framework; \(c\) Intent keyword transitions\.
### 3\.6Case Study
Figure[5](https://arxiv.org/html/2605.11964#S3.F5)presents an example showing how our framework outperforms the baseline T5\-Flan in real\-world dialogues\. When recommending songs, T5\-Flan’s utterances lack detail and awkwardly repeat the questionDo you want me to play it for you now?in subsequent turns, revealing interaction rigidity\. In contrast, CSM effectively integrates underlying knowledge\. As the ablation results indicate that removing CSM leads to a sharp drop in knowledge utilization \(K\. F1\), our framework generates more detailed utterances such asIt’s a sunny dayandcombines with the traditional Chinese music, enhancing informativeness and user engagement\. IKB provides the system with a flexible, multi\-step forward\-looking perspective, allowing the framework to naturally connect with the user’s expectations in the fifth turn, producingYes, I’m sure you’ll like itbefore logically proposing a playback request\. This guidance trajectory is smooth, avoiding abrupt topic shifts and maintaining dialogue quality\. The case study demonstrates the effectiveness of our framework, narrowing the gap between target\-guided dialogue systems and natural real\-world interactions\. A more detailed analysis is provided in App\.[H](https://arxiv.org/html/2605.11964#A8)\.
## 4Related Work
We focus on the proactivity and target\-guided aspects of dialogue; thus, we review related work on proactive and target\-guided dialogue systems\.
Proactive dialogue differs from traditional dialogue systems that merely respond passively to users; it requires the system to actively steer the conversation toward accomplishing specific tasks, emphasizing initiativeTanget al\.\([2025](https://arxiv.org/html/2605.11964#bib.bib25)\); Denget al\.\([2025](https://arxiv.org/html/2605.11964#bib.bib22)\)\. One line of research in proactive dialogue focuses on developing various types of dialogue systems, including target\-guided dialogueZhanget al\.\([2025](https://arxiv.org/html/2605.11964#bib.bib8)\), prosocial dialogueZiemset al\.\([2022](https://arxiv.org/html/2605.11964#bib.bib26)\), non\-collaborative dialogueJinet al\.\([2024](https://arxiv.org/html/2605.11964#bib.bib27)\), and user preference elicitationZhanget al\.\([2024b](https://arxiv.org/html/2605.11964#bib.bib28)\)\. These systems primarily leverage domain knowledge or user profiles to enhance proactivityLiuet al\.\([2021](https://arxiv.org/html/2605.11964#bib.bib24)\), or employ planning\-based strategiesWanget al\.\([2024](https://arxiv.org/html/2605.11964#bib.bib1)\)\. In this work, we enhance the system’s proactivity by modeling conversational scenarios and employing intent–keyword bridging for dynamic guidance\.
Target\-guided dialogue requires the system to generate utterances that fulfill a pre\-determined target, such as incorporating a keyword into the system utterance\. Accordingly, a keyword is typically designated as the target, making keyword planning the natural primary strategy\. To this end, entity\-based keywords have been widely adopted, and a variety of planning strategies have been proposedTanget al\.\([2019](https://arxiv.org/html/2605.11964#bib.bib14)\); Zhonget al\.\([2021](https://arxiv.org/html/2605.11964#bib.bib17)\)\.Zhonget al\.\([2021](https://arxiv.org/html/2605.11964#bib.bib17)\)further leveraged ConceptNetSpeeret al\.\([2017](https://arxiv.org/html/2605.11964#bib.bib18)\)to support next\-turn keyword planning, while both local and global keyword planning approaches have demonstrated substantial progressYanget al\.\([2022](https://arxiv.org/html/2605.11964#bib.bib13)\)\. However, the semantic information conveyed by entity\-based keywords is limited, motivating the adoption of keywords with richer and more abstract semanticsDaoet al\.\([2023](https://arxiv.org/html/2605.11964#bib.bib15)\); Wanget al\.\([2024](https://arxiv.org/html/2605.11964#bib.bib1)\); Zhanget al\.\([2025](https://arxiv.org/html/2605.11964#bib.bib8)\)\. In this way, planning sequences of such abstract keywords has become a primary research focus\. For example,Wanget al\.\([2023b](https://arxiv.org/html/2605.11964#bib.bib16)\)proposed Brown bridge planning,Wanget al\.\([2023a](https://arxiv.org/html/2605.11964#bib.bib2)\)introduced target\-driven planning,Daoet al\.\([2023](https://arxiv.org/html/2605.11964#bib.bib15)\)developed global reinforcement learning,Zhanget al\.\([2024a](https://arxiv.org/html/2605.11964#bib.bib20)\)presented graph\-interaction planning, andWanget al\.\([2024](https://arxiv.org/html/2605.11964#bib.bib1)\); Kanget al\.\([2026](https://arxiv.org/html/2605.11964#bib.bib41)\)proposed bidirectional planning\. However, the resulting keyword sequences remain static, relying only on one\-turn keywords without looking ahead, leading to a notable mismatch with the dynamics of real\-world conversations\. To address this, we jointly model user profiles and domain knowledge as conversational scenarios, and leverage dynamic intent–keyword bridging to further reduce this gap\.
## 5Conclusion
In this work, we jointly model user profiles and domain knowledge as conversational scenarios, which introduce a bias to dynamically influence utterances, and we propose intent\-keyword bridging to predict intent keywords for upcoming turns, providing higher\-level and flexible guidance\. Extensive experiments show that our framework substantially enhances target\-guided proactive dialogue systems, alleviating the gap with real\-world interactions\.
## Limitations
We validate the effectiveness of our framework through extensive automatic and human evaluations; nevertheless, several limitations remain\. The datasets used in this work are built uponWanget al\.\([2023a](https://arxiv.org/html/2605.11964#bib.bib2),[2024](https://arxiv.org/html/2605.11964#bib.bib1)\), from which we introduce the concept of intent keywords\. We argue that these intent keywords capture higher\-level abstract semantics, providing more effective guidance\. However, we do not further categorize the intent keywords in the datasets followingWanget al\.\([2024](https://arxiv.org/html/2605.11964#bib.bib1)\), which constitutes a potential limitation of this study\. In addition, while our framework does not consistently outperform LLM\-based models, it integrates the strengths of fine\-tuned LLMs and prompt\-based LLMs\. For instance, in terms of dialogue quality and failure rate, it achieves competitive results with only 0\.3B parameters\. Although its contribution to advancing state\-of\-the\-art performance is limited, it remains valuable for research in low\-resource settings\. The experiments collectively demonstrate the effectiveness of our proposed mechanisms, particularly CSM and IKB, which are of clear significance\. In summary, potential directions for improvement include: \(1\) further categorizing and refining intent keywords while exploring the adoption of a unified and powerful reasoning mechanism under unpredictable user behavior; \(2\) adopting more specialized extraction methods for conversational scenario modeling, where bias introduction may require a more "soft" approach; and \(3\) exploring LLM\-based backbone models, which are not employed in this work due to resource constraints\. We believe that addressing these issues will further advance research on target\-guided proactive dialogue systems\.
Furthermore, given the proactive nature of dialogue in guiding users and its potential negative social impacts, such as political interference and abuse, we strongly recommend minimizing such misuse and strictly adhering to relevant laws\. This study is limited to academic research\.
## References
- H\. W\. Chung, L\. Hou, S\. Longpre, B\. Zoph, Y\. Tay, W\. Fedus, Y\. Li, X\. Wang, M\. Dehghani, S\. Brahma,et al\.\(2024\)Scaling instruction\-finetuned language models\.Journal of Machine Learning Research25\(70\),pp\. 1–53\.Cited by:[Appendix B](https://arxiv.org/html/2605.11964#A2.p2.3),[Appendix C](https://arxiv.org/html/2605.11964#A3.p1.1),[§3\.1](https://arxiv.org/html/2605.11964#S3.SS1.SSS0.Px3.p1.1)\.
- H\. W\. Chung, L\. Hou, S\. Longpre, B\. Zoph, Y\. Tay, W\. Fedus, and Y\. Li \(2022\)Scaling instruction\-finetuned language models\.External Links:2210\.11416,[Link](https://arxiv.org/abs/2210.11416)Cited by:[§2\.2](https://arxiv.org/html/2605.11964#S2.SS2.p1.1)\.
- H\. Dao, L\. Liao, D\. Le, and Y\. Nie \(2023\)Reinforced target\-driven conversational promotion\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 12583–12596\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.775/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.775)Cited by:[§1](https://arxiv.org/html/2605.11964#S1.p3.1),[§4](https://arxiv.org/html/2605.11964#S4.p3.1)\.
- H\. Q\. Dao, Y\. Deng, K\. Bui, D\. D\. Le, and L\. Liao \(2024\)Experience as source for anticipation and planning: experiential policy learning for target\-driven recommendation dialogues\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 14179–14198\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.829/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.829)Cited by:[§1](https://arxiv.org/html/2605.11964#S1.p1.1)\.
- Y\. Deng, L\. Liao, W\. Lei, G\. H\. Yang, W\. Lam, and T\. Chua \(2025\)Proactive conversational ai: a comprehensive survey of advancements and opportunities\.ACM Transactions on Information Systems43\(3\),pp\. 1–45\.Cited by:[§4](https://arxiv.org/html/2605.11964#S4.p2.1)\.
- J\. Hao and F\. Kong \(2025\)Enhancing emotional support conversations: a framework for dynamic knowledge filtering and persona extraction\.InProceedings of the 31st International Conference on Computational Linguistics,O\. Rambow, L\. Wanner, M\. Apidianaki, H\. Al\-Khalifa, B\. D\. Eugenio, and S\. Schockaert \(Eds\.\),Abu Dhabi, UAE,pp\. 3193–3202\.External Links:[Link](https://aclanthology.org/2025.coling-main.214/)Cited by:[§1](https://arxiv.org/html/2605.11964#S1.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, W\. Chen,et al\.\(2022\)Lora: low\-rank adaptation of large language models\.\.ICLR1\(2\),pp\. 3\.Cited by:[Appendix B](https://arxiv.org/html/2605.11964#A2.p1.4),[§3\.1](https://arxiv.org/html/2605.11964#S3.SS1.SSS0.Px3.p1.1)\.
- C\. Jin, K\. Ren, L\. Kong, X\. Wang, R\. Song, and H\. Chen \(2024\)Persuading across diverse domains: a dataset and persuasion large language model\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 1678–1706\.External Links:[Link](https://aclanthology.org/2024.acl-long.92/),[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.92)Cited by:[§4](https://arxiv.org/html/2605.11964#S4.p2.1)\.
- X\. Kang, M\. Li, Y\. Zheng, and F\. Kong \(2026\)Pseudo\-siamese network for planning in target\-oriented proactive dialogues\.InICASSP 2026\-2026 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 16977–16981\.Cited by:[§1](https://arxiv.org/html/2605.11964#S1.p1.1),[§4](https://arxiv.org/html/2605.11964#S4.p3.1)\.
- J\. Li, M\. Galley, C\. Brockett, J\. Gao, and B\. Dolan \(2016\)A diversity\-promoting objective function for neural conversation models\.InProceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,K\. Knight, A\. Nenkova, and O\. Rambow \(Eds\.\),San Diego, California,pp\. 110–119\.External Links:[Link](https://aclanthology.org/N16-1014/),[Document](https://dx.doi.org/10.18653/v1/N16-1014)Cited by:[4th item](https://arxiv.org/html/2605.11964#A4.I1.i4.p1.1),[§3\.1](https://arxiv.org/html/2605.11964#S3.SS1.SSS0.Px3.p2.1)\.
- Z\. Liu, H\. Wang, Z\. Niu, H\. Wu, W\. Che, and T\. Liu \(2020\)Towards conversational recommendation over multi\-type dialogs\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 1036–1049\.External Links:[Link](https://aclanthology.org/2020.acl-main.98/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.98)Cited by:[Appendix A](https://arxiv.org/html/2605.11964#A1.p1.1),[5th item](https://arxiv.org/html/2605.11964#A4.I1.i5.p1.1),[§3\.1](https://arxiv.org/html/2605.11964#S3.SS1.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2605.11964#S3.SS1.SSS0.Px3.p2.1)\.
- Z\. Liu, H\. Wang, Z\. Niu, H\. Wu, and W\. Che \(2021\)DuRecDial 2\.0: a bilingual parallel corpus for conversational recommendation\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,M\. Moens, X\. Huang, L\. Specia, and S\. W\. Yih \(Eds\.\),Online and Punta Cana, Dominican Republic,pp\. 4335–4347\.External Links:[Link](https://aclanthology.org/2021.emnlp-main.356/),[Document](https://dx.doi.org/10.18653/v1/2021.emnlp-main.356)Cited by:[Appendix A](https://arxiv.org/html/2605.11964#A1.p1.1),[6th item](https://arxiv.org/html/2605.11964#A4.I1.i6.p1.1),[§2\.3](https://arxiv.org/html/2605.11964#S2.SS3.p1.5),[§3\.1](https://arxiv.org/html/2605.11964#S3.SS1.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2605.11964#S3.SS1.SSS0.Px3.p2.1),[§4](https://arxiv.org/html/2605.11964#S4.p2.1)\.
- K\. Papineni, S\. Roukos, T\. Ward, and W\. Zhu \(2002\)Bleu: a method for automatic evaluation of machine translation\.InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics,P\. Isabelle, E\. Charniak, and D\. Lin \(Eds\.\),Philadelphia, Pennsylvania, USA,pp\. 311–318\.External Links:[Link](https://aclanthology.org/P02-1040/),[Document](https://dx.doi.org/10.3115/1073083.1073135)Cited by:[3rd item](https://arxiv.org/html/2605.11964#A4.I1.i3.p1.1),[§3\.1](https://arxiv.org/html/2605.11964#S3.SS1.SSS0.Px3.p2.1)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Journal of machine learning research21\(140\),pp\. 1–67\.Cited by:[Appendix B](https://arxiv.org/html/2605.11964#A2.p2.3),[§3\.1](https://arxiv.org/html/2605.11964#S3.SS1.SSS0.Px3.p1.1)\.
- K\. Sevegnani, D\. M\. Howcroft, I\. Konstas, and V\. Rieser \(2021\)OTTers: one\-turn topic transitions for open\-domain dialogue\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),C\. Zong, F\. Xia, W\. Li, and R\. Navigli \(Eds\.\),Online,pp\. 2492–2504\.External Links:[Link](https://aclanthology.org/2021.acl-long.194/),[Document](https://dx.doi.org/10.18653/v1/2021.acl-long.194)Cited by:[§2\.3](https://arxiv.org/html/2605.11964#S2.SS3.p1.5)\.
- R\. Speer, J\. Chin, and C\. Havasi \(2017\)Conceptnet 5\.5: an open multilingual graph of general knowledge\.InProceedings of the AAAI conference on artificial intelligence,Vol\.31\.Cited by:[§4](https://arxiv.org/html/2605.11964#S4.p3.1)\.
- J\. Tang, S\. Shen, Z\. ZhipengWang, G\. Zhi, X\. Feng, Z\. Sun, H\. Tan, and X\. Chen \(2025\)KAPA: a deliberative agent framework with tree\-structured knowledge base for multi\-domain user intent understanding\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 6150–6166\.External Links:[Link](https://aclanthology.org/2025.findings-acl.319/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.319),ISBN 979\-8\-89176\-256\-5Cited by:[§4](https://arxiv.org/html/2605.11964#S4.p2.1)\.
- J\. Tang, T\. Zhao, C\. Xiong, X\. Liang, E\. Xing, and Z\. Hu \(2019\)Target\-guided open\-domain conversation\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,A\. Korhonen, D\. Traum, and L\. Màrquez \(Eds\.\),Florence, Italy,pp\. 5624–5634\.External Links:[Link](https://aclanthology.org/P19-1565/),[Document](https://dx.doi.org/10.18653/v1/P19-1565)Cited by:[§1](https://arxiv.org/html/2605.11964#S1.p3.1),[§4](https://arxiv.org/html/2605.11964#S4.p3.1)\.
- J\. Wang, D\. Lin, and W\. Li \(2023a\)A target\-driven planning approach for goal\-directed dialog systems\.IEEE Transactions on Neural Networks and Learning Systems35\(8\),pp\. 10475–10487\.Cited by:[Appendix A](https://arxiv.org/html/2605.11964#A1.p2.1),[Appendix B](https://arxiv.org/html/2605.11964#A2.p2.3),[Appendix C](https://arxiv.org/html/2605.11964#A3.p1.1),[Appendix D](https://arxiv.org/html/2605.11964#A4.p1.1),[§1](https://arxiv.org/html/2605.11964#S1.p3.1),[§2\.3](https://arxiv.org/html/2605.11964#S2.SS3.p1.11),[§3\.1](https://arxiv.org/html/2605.11964#S3.SS1.SSS0.Px3.p1.1),[§4](https://arxiv.org/html/2605.11964#S4.p3.1),[Limitations](https://arxiv.org/html/2605.11964#Sx1.p1.1)\.
- J\. Wang, D\. Lin, and W\. Li \(2023b\)Dialogue planning via brownian bridge stochastic process for goal\-directed proactive dialogue\.InFindings of the Association for Computational Linguistics: ACL 2023,A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 370–387\.External Links:[Link](https://aclanthology.org/2023.findings-acl.25/),[Document](https://dx.doi.org/10.18653/v1/2023.findings-acl.25)Cited by:[2nd item](https://arxiv.org/html/2605.11964#A4.I1.i2.p1.1),[§1](https://arxiv.org/html/2605.11964#S1.p3.1),[§3\.1](https://arxiv.org/html/2605.11964#S3.SS1.SSS0.Px3.p2.1),[§4](https://arxiv.org/html/2605.11964#S4.p3.1)\.
- J\. Wang, D\. Lin, and W\. Li \(2024\)Target\-constrained bidirectional planning for generation of target\-oriented proactive dialogue\.ACM Transactions on Information Systems42\(5\),pp\. 1–27\.Cited by:[Appendix A](https://arxiv.org/html/2605.11964#A1.p2.1),[Appendix B](https://arxiv.org/html/2605.11964#A2.p2.3),[Appendix C](https://arxiv.org/html/2605.11964#A3.p1.1),[Appendix D](https://arxiv.org/html/2605.11964#A4.p1.1),[§1](https://arxiv.org/html/2605.11964#S1.p1.1),[§1](https://arxiv.org/html/2605.11964#S1.p3.1),[§3\.1](https://arxiv.org/html/2605.11964#S3.SS1.SSS0.Px3.p1.1),[§3\.1](https://arxiv.org/html/2605.11964#S3.SS1.SSS0.Px3.p2.1),[§4](https://arxiv.org/html/2605.11964#S4.p2.1),[§4](https://arxiv.org/html/2605.11964#S4.p3.1),[Limitations](https://arxiv.org/html/2605.11964#Sx1.p1.1)\.
- B\. Wu, W\. Wang, L\. Lihaoran, Y\. Deng, Y\. Li, J\. Yu, and B\. Wang \(2025a\)Interpersonal memory matters: a new task for proactive dialogue utilizing conversational history\.InProceedings of the 29th Conference on Computational Natural Language Learning,G\. Boleda and M\. Roth \(Eds\.\),Vienna, Austria,pp\. 47–67\.External Links:[Link](https://aclanthology.org/2025.conll-1.4/),[Document](https://dx.doi.org/10.18653/v1/2025.conll-1.4),ISBN 979\-8\-89176\-271\-8Cited by:[§1](https://arxiv.org/html/2605.11964#S1.p1.1)\.
- S\. Wu, Y\. Zhu, W\. Hsu, M\. Lee, and Y\. Deng \(2025b\)From personas to talks: revisiting the impact of personas on LLM\-synthesized emotional support conversations\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 5439–5453\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.277/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.277),ISBN 979\-8\-89176\-332\-6Cited by:[§1](https://arxiv.org/html/2605.11964#S1.p1.1)\.
- K\. Xu, Y\. Cheng, W\. Hou, Q\. Tan, and W\. Li \(2024\)Reasoning like a doctor: improving medical dialogue systems via diagnostic reasoning process alignment\.InFindings of the Association for Computational Linguistics: ACL 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),Bangkok, Thailand,pp\. 6796–6814\.External Links:[Link](https://aclanthology.org/2024.findings-acl.406/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.406)Cited by:[§1](https://arxiv.org/html/2605.11964#S1.p1.1)\.
- Z\. Yang, B\. Wang, J\. Zhou, Y\. Tan, D\. Zhao, K\. Huang, R\. He, and Y\. Hou \(2022\)TopKG: target\-oriented dialog via global planning on knowledge graph\.InProceedings of the 29th International Conference on Computational Linguistics,N\. Calzolari, C\. Huang, H\. Kim, J\. Pustejovsky, L\. Wanner, K\. Choi, P\. Ryu, H\. Chen, L\. Donatelli, H\. Ji, S\. Kurohashi, P\. Paggio, N\. Xue, S\. Kim, Y\. Hahm, Z\. He, T\. K\. Lee, E\. Santus, F\. Bond, and S\. Na \(Eds\.\),Gyeongju, Republic of Korea,pp\. 745–755\.External Links:[Link](https://aclanthology.org/2022.coling-1.62/)Cited by:[§1](https://arxiv.org/html/2605.11964#S1.p3.1),[§4](https://arxiv.org/html/2605.11964#S4.p3.1)\.
- D\. Zhang, Y\. Fan, P\. Li, and Q\. Zhu \(2025\)Enhancing goal\-oriented proactive dialogue systems via consistency reflection and correction\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 21656–21672\.External Links:[Link](https://aclanthology.org/2025.acl-long.1050/),[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.1050),ISBN 979\-8\-89176\-251\-0Cited by:[§1](https://arxiv.org/html/2605.11964#S1.p1.1),[§1](https://arxiv.org/html/2605.11964#S1.p3.1),[§4](https://arxiv.org/html/2605.11964#S4.p2.1),[§4](https://arxiv.org/html/2605.11964#S4.p3.1)\.
- X\. Zhang, X\. Jia, H\. Liu, X\. Liu, and X\. Zhang \(2024a\)A goal interaction graph planning framework for conversational recommendation\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 19578–19587\.Cited by:[§4](https://arxiv.org/html/2605.11964#S4.p3.1)\.
- X\. Zhang, R\. Xie, Y\. Lyu, X\. Xin, P\. Ren, M\. Liang, B\. Zhang, Z\. Kang, M\. de Rijke, and Z\. Ren \(2024b\)Towards empathetic conversational recommender systems\.InProceedings of the 18th ACM Conference on Recommender Systems,pp\. 84–93\.Cited by:[§4](https://arxiv.org/html/2605.11964#S4.p2.1)\.
- Z\. Zhao, Y\. Li, C\. Hou, J\. Zhao, R\. Tian, W\. Liu, Y\. Chen, N\. Sun, H\. Liu, W\. Mao, and H\. Guo \(2023\)TencentPretrain: a scalable and flexible toolkit for pre\-training models of different modalities\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 3: System Demonstrations\),D\. Bollegala, R\. Huang, and A\. Ritter \(Eds\.\),Toronto, Canada,pp\. 217–225\.External Links:[Link](https://aclanthology.org/2023.acl-demo.20/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-demo.20)Cited by:[Appendix B](https://arxiv.org/html/2605.11964#A2.p2.3),[Appendix C](https://arxiv.org/html/2605.11964#A3.p1.1),[§3\.1](https://arxiv.org/html/2605.11964#S3.SS1.SSS0.Px3.p1.1)\.
- P\. Zhong, Y\. Liu, H\. Wang, and C\. Miao \(2021\)Keyword\-guided neural conversational model\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.35,pp\. 14568–14576\.Cited by:[§4](https://arxiv.org/html/2605.11964#S4.p3.1)\.
- C\. Ziems, J\. Yu, Y\. Wang, A\. Halevy, and D\. Yang \(2022\)The moral integrity corpus: a benchmark for ethical dialogue systems\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 3755–3773\.External Links:[Link](https://aclanthology.org/2022.acl-long.261/),[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.261)Cited by:[§4](https://arxiv.org/html/2605.11964#S4.p2.1)\.
## Appendix ADatasets
Both the originalDuRecDialLiuet al\.\([2020](https://arxiv.org/html/2605.11964#bib.bib23)\)andDuRecDial2\.0Liuet al\.\([2021](https://arxiv.org/html/2605.11964#bib.bib24)\)datasets were collected via crowdsourced human–human conversations, where one participant acted as the user and the other as the system\. In these conversations, the user was provided with profile information such as preferences and age, while the system received domain knowledge consisting of domain\-specific topics and their associated attributes \(e\.g\., movies, music\)\. The original DuRecDial dataset contains approximately 10k Chinese dialogues and 156k utterances, whereas DuRecDial2\.0 dataset includes 8\.2k dialogues in both Chinese and English\. Both datasets offer keyword\-type and keyword\-topic annotations throughout the conversations, making them well suited for the target\-guided proactive dialogue task investigated in this work\. Incidentally, in this type of dataset, the dialogue target often corresponds to the keyword\-topics appearing in the last system utterance, and this is how we determine whether the system has achieved its dialogue target\.
To better evaluate target\-guided proactive dialogue systems,Wanget al\.\([2023a](https://arxiv.org/html/2605.11964#bib.bib2)\)repurposed these two datasets by filtering out dialogues that introduced no new keyword\-topics, thereby ensuring that each intent keyword was grounded in domain\-specific knowledge within the conversation\. They further enriched the domain knowledge by sampling additional information from triplets within a two\-hop range of the target keyword\-topic in the domain knowledge graph\. To more thoroughly investigate the performance of target\-guided proactive dialogue systems, the test set was additionally divided into two subsets:In\-Domain \(ID\), where the target keyword\-topics in the test set were allowed to appear in the training data, andOut\-of\-Domain \(OOD\), where none of the target keyword\-topics in the test set appeared in the training set\. After processing, the DuRecDial dataset yielded 13 keyword\-types and 640 keyword\-topics, while the processed DuRecDial2\.0 dataset contained 13 keyword\-types and 628 keyword\-topics\. On average, each dialogue exhibited approximately 4\.3–4\.8 intent keyword transitions from the beginning to the end of the conversation\. For the DuRecDial2\.0 dataset, we used only the English version, resulting in two dataset variants for a more comprehensive evaluation\. SinceWanget al\.\([2023a](https://arxiv.org/html/2605.11964#bib.bib2),[2024](https://arxiv.org/html/2605.11964#bib.bib1)\)provided a further cleaned and processed version of the two datasets, we conducted our experiments using these processed releases\.
Table 4:Statistics of the DuRecDial and DuRecDial2\.0 datasets\. Here, "Dial\." denotes "dialogue," "Utter\." denotes "utterance," and "Trans\." denotes "keyword transition\."Figure 6:Variation of Word F1 with respect tommandδ\\deltaon the DuRecDial2\.0 dataset\.
## Appendix BImplementation Details
We use AdamW as the optimizer and employ both a warmup strategy and gradient clipping\. For our framework, training is performed for 50 epochs with a batch size of 8 and a learning rate of3×10−53\\times 10^\{\-5\}\. The hyperparametersmmand the thresholdδ\\deltaare determined empirically; see §[3\.3](https://arxiv.org/html/2605.11964#S3.SS3)for details\. We fine\-tune LLaMA\-1B, LLaMA\-3B, and Qwen\-3B using the LoRA methodHuet al\.\([2022](https://arxiv.org/html/2605.11964#bib.bib32)\), with a LoRA rank of 8 and a LoRA alpha of 16\. Training is conducted with a batch size of 1, a learning rate of1×10−41\\times 10^\{\-4\}, and 10 epochs, following common community practice\. For the DuRecDial dataset, the maximum decoding length during generation is set to 100, whereas for the DuRecDial2\.0 dataset it is set to 80\. During model training, all experiments are conducted on a GeForce RTX 3090 GPU\. For the prompt\-based methods, when GPU memory is insufficient, the model parameters are distributed across a GeForce RTX 3090 GPU and a GeForce RTX 4090 GPU for inference\.
Our framework employs different base models as backbones for the Chinese and English datasets\. For the DuRecDial dataset, we adopt T5\-ZhZhaoet al\.\([2023](https://arxiv.org/html/2605.11964#bib.bib37)\)as the backbone, since the original T5Raffelet al\.\([2020](https://arxiv.org/html/2605.11964#bib.bib35)\)does not support Chinese corpora, whereas the DuRecDial dataset is a linguistically rich and complex Chinese dataset\. For the DuRecDial2\.0 dataset, we use T5\-FlanChunget al\.\([2024](https://arxiv.org/html/2605.11964#bib.bib36)\)as the backbone, as instruction tuning enables T5\-Flan to substantially outperform T5, despite supporting only English\. We select these models as backbones in our experiments because they are strong foundational models in recent years and well\-aligned with the design requirements of our framework\. The intent keyword hidden states𝐇z\\mathbf\{H\}^\{z\}are concatenated with𝐊\\mathbf\{K\}and𝐕\\mathbf\{V\}in the cross\-attention of the T5 decoder, as illustrated in Figure[2](https://arxiv.org/html/2605.11964#S2.F2)\. During training, each backbone model is initialized with its pre\-trained parameters and further fine\-tuned\. For the two state\-of\-the\-art models, TPDialWanget al\.\([2023a](https://arxiv.org/html/2605.11964#bib.bib2)\)and TRIPDialWanget al\.\([2024](https://arxiv.org/html/2605.11964#bib.bib1)\), which are highly relevant to our task, we reproduce them using the code released by their authors, with parameter settings consistent with the original descriptions\.
## Appendix CBaselines
TPDialWanget al\.\([2023a](https://arxiv.org/html/2605.11964#bib.bib2)\)is a target\-driven keyword planning model for guided system\-utterance generation, whereasTRIPDialWanget al\.\([2024](https://arxiv.org/html/2605.11964#bib.bib1)\)is a target\-constrained bidirectional planning approach for guided system\-utterance generation\. Since TPDial and TRIPDial encompass multiple models in their original works, we cite only the relatively better\-performing variants\. Specifically, for TRIPDial, we reference the model in its controlled mode, while for TPDial, we cite the variant with a GPT\-2 backbone\.LLaMAfamily andQwenfamily are advanced language models developed in recent years, and their 1B/3B variants—known for strong performance and a favorable trade\-off—are adopted as our fine\-tuning baseline models\. We also adopt larger models from these families, namely LLaMA\-8B, Qwen\-14B, and Qwen\-32B, as prompt\-based models; the specific prompts are provided in App\.[I](https://arxiv.org/html/2605.11964#A9)\. This setup considers both efficient fine\-tuning \(details of which are provided in App\.[B](https://arxiv.org/html/2605.11964#A2)\) and the increasingly popular prompt\-based methods, allowing for a more comprehensive comparison\.T5\-ZhZhaoet al\.\([2023](https://arxiv.org/html/2605.11964#bib.bib37)\)is a Chinese\-adapted variant pretrained on large\-scale Chinese corpora, whereasT5\-FlanChunget al\.\([2024](https://arxiv.org/html/2605.11964#bib.bib36)\)is an instruction\-tuned extension of T5 that exhibits strong zero\-shot and instruction\-following capabilities across diverse tasks\. We adopt T5\-Zh as the backbone for the DuRecDial dataset and T5\-Flan as the backbone for the DuRecDial2\.0 dataset\.
## Appendix DEvaluation Metrics
Following prior workWanget al\.\([2023a](https://arxiv.org/html/2605.11964#bib.bib2),[2024](https://arxiv.org/html/2605.11964#bib.bib1)\)and our formalized setting for evaluating target\-guided proactive dialogue generation, we adopt the following widely used automatic evaluation metrics:
- •Perplexity \(PPL\)evaluates the language model’s uncertainty in predicting the token sequence; lower values indicate better fluency and generation quality;
- •Word F1 \(W\. F1\)Wanget al\.\([2023b](https://arxiv.org/html/2605.11964#bib.bib16)\)measures the exact word\-level overlap between generated and reference utterances;
- •BLEU \(BLEU\-1/2\)Papineniet al\.\([2002](https://arxiv.org/html/2605.11964#bib.bib34)\)evaluates the n\-gram overlap between generated and reference utterances;
- •Distinct \(DIST\-1/2\)Liet al\.\([2016](https://arxiv.org/html/2605.11964#bib.bib33)\)evaluates the lexical diversity of generated utterances;
- •Knowledge F1 \(K\. F1\)Liuet al\.\([2020](https://arxiv.org/html/2605.11964#bib.bib23)\)measures the correctness of generated knowledge against the ground\-truth domain knowledge;
- •Failure \(Fail\.\)Liuet al\.\([2021](https://arxiv.org/html/2605.11964#bib.bib24)\)quantifies the probability that the target is not achieved by the end of the conversation\.
## Appendix EAnalysis of Parameters
In the main text, we determine the values ofmmandδ\\deltabased on changes in Word F1\. Here, we report all results to provide a more comprehensive analysis\.
Tables[5](https://arxiv.org/html/2605.11964#A5.T5)and[7](https://arxiv.org/html/2605.11964#A5.T7)present how the performance on the DuRecDial and DuRecDial2\.0 datasets varies with different values ofmm\. On the DuRecDial dataset, the majority of the best performance is achieved atm=4m=4for both the ID and OOD test sets, whereas on the DuRecDial2\.0 dataset, most of the optimal performance occurs atm=3m=3for both test sets\. For the DuRecDial dataset, whenm=2m=2orm=3m=3, predicting intent keywords for the next two or three steps may not provide sufficiently informative guidance\. Asmmincreases, performance generally improves and reaches a peak atm=4m=4\. Whenm=5m=5, additional noise is introduced by predicting intent keywords too far ahead, resulting in a performance drop; nevertheless, the results remain better than those observed withm=2m=2orm=3m=3\. It is worth noting that on the DuRecDial dataset, whenm=2m=2, the performance drops dramatically, falling even below that of the vanilla T5\-Zh model\. This is because predicting only the next two intent keywords provides very limited guidance and further increases the sparsity of the already sparse embeddings \(Emba\(⋅\)\\text\{Emb\}\_\{a\}\(\\cdot\)andEmbt\(⋅\)\\text\{Emb\}\_\{t\}\(\\cdot\)\)\. This effect is more pronounced on the DuRecDial dataset than on the DuRecDial2\.0 dataset\. Consequently, this phenomenon is not observed on the DuRecDial2\.0 dataset, indicating that our framework exhibits more stable performance on this dataset\. For the DuRecDial2\.0 dataset, competitive performance is obtained atm=2m=2and peaks atm=3m=3\. Performance then declines atm=4m=4andm=5m=5\. These results indicate that the sparsity of embeddings has only a limited effect on the stability of performance on the DuRecDial2\.0 dataset\. Notably, whenm=2m=2, although the framework achieves the best lexical diversity, this also suggests that intent\-keyword bridging may over\-constrain the system’s utterance, thereby reducing lexical diversity\. Whenm=3m=3, the provided guidance strikes a balance across multiple metrics\. In contrast, whenm=4m=4orm=5m=5, the introduced noise leads to an overall decline in performance\.
Tables[6](https://arxiv.org/html/2605.11964#A5.T6)and[8](https://arxiv.org/html/2605.11964#A5.T8)show how the performance on the DuRecDial and DuRecDial2\.0 datasets varies with different values ofδ\\deltawhenmmis fixed to its optimal setting\. On the DuRecDial dataset, most of the best performance is achieved atδ=0\.2\\delta=0\.2for both the ID and OOD test sets, while on the DuRecDial2\.0 dataset, the majority of the optimal performance also occurs atδ=0\.2\\delta=0\.2for both test sets\. In the soft mode,δ\\deltaaccounts for the uncertainty of the prediction\. In general, a largerδ\\deltayields more concise guidance information at the cost of losing certain details, whereas a smallerδ\\deltaoffers richer information but inevitably introduces noise\. For the DuRecDial dataset, when considering all evaluation metrics, the suboptimal performance is primarily observed atδ=0\.1\\delta=0\.1on the ID test set and atδ=0\.3\\delta=0\.3on the OOD test set\. This is because, on the ID test set, keyword–topics are allowed to appear in the training set, so providing more information—even if noisy—still yields positive effects; whereas on the OOD test set, more concise information is typically offer clearer and more direct guidance\. For the DuRecDial2\.0 dataset, on the ID test set, the suboptimal performance is mostly observed atδ=0\.1\\delta=0\.1across all metrics; whereas on the OOD test set, the pattern differs slightly, with the optimal results mostly appearing atδ=0\.1\\delta=0\.1and the suboptimal results atδ=0\.2\\delta=0\.2\. The explanation is that the ID test set can benefit from additional information—even if noisy—which may positively influence the results, while the OOD test set requires more concise information to provide more direct guidance\. Although the specific values ofδ\\deltadiffer between the DuRecDial and DuRecDial2\.0 datasets, the overall trend remains consistent\. From the ablation study \(§[3\.4](https://arxiv.org/html/2605.11964#S3.SS4)\), we observe that intent–keyword bridging, which connects dialogue history to upcoming utterances, affects multiple aspects of the generated utterances, including dialogue fluency, lexical diversity, user engagement, and target achievement, highlighting the benefits of higher\-level and flexible guidance\. The magnitude of this effect varies with different values ofmmandδ\\delta\.
Table 5:Full experiments on the DuRecDial dataset with differentmmvalues\.Boldtext highlights the best results, whileunderlinedtext indicates the second\-best\.Table 6:Full experiments on the DuRecDial dataset with differentδ\\deltavalues \(whenm=4m=4\)\.Boldtext highlights the best results, whileunderlinedtext indicates the second\-best\.Table 7:Full experiments on the DuRecDial2\.0 dataset with differentmmvalues\.Boldtext highlights the best results, whileunderlinedtext indicates the second\-best\.Table 8:Full experiments on the DuRecDial2\.0 dataset with differentδ\\deltavalues \(whenm=3m=3\)\.Boldtext highlights the best results, whileunderlinedtext indicates the second\-best\.
## Appendix FAblation Study
We report the ablation experiments on the DuRecDial2\.0 dataset here, as illustrated in Table[9](https://arxiv.org/html/2605.11964#A6.T9)\.
Table 9:Ablation study on the DuRecDial2\.0 dataset\.Table 10:LLM\-as\-a\-judge scores of advanced models on the DuRecDial dataset\. "Pro\." = Proactivity, "Coh\." = Coherence, "App\." = Appropriateness, "Inf\." = Informativeness\.Boldtext indicates the best results, andunderlinedtext indicates the second\-best\.Table 11:LLM\-as\-a\-judge scores of advanced models on the DuRecDial2\.0 dataset\. "Pro\." = Proactivity, "Coh\." = Coherence, "App\." = Appropriateness, "Inf\." = Informativeness\.Boldtext indicates the best results, andunderlinedtext indicates the second\-best\.Table 12:LLM\-as\-a\-judge scores of LLMs on DuRecDial\. "Pro\." = Proactivity, "Coh\." = Coherence, "App\." = Appropriateness, "Inf\." = Informativeness\. Models labeled♢denote fine\-tuning, while models labeled♡denote the prompt\-based method\.Boldtext indicates the best\-performing model;underlinedtext indicates the best\-performing model among the fine\-tuning methods\.Table 13:LLM\-as\-a\-judge scores of LLMs on DuRecDial2\.0\. "Pro\." = Proactivity, "Coh\." = Coherence, "App\." = Appropriateness, "Inf\." = Informativeness\. Models labeled♢denote fine\-tuning, while models labeled♡denote the prompt\-based method\.Boldtext indicates the best\-performing model;underlinedtext indicates the best\-performing model among the fine\-tuning methods\.
## Appendix GHuman and LLM\-as\-a\-Judge Evaluations
Since automatic metrics cannot fully capture the quality of system utterances, they are complemented by additional evaluations: \(i\) we conduct pairwise human evaluations comparing our framework with the strong backbone models T5\-Zh and T5\-Flan on the DuRecDial and DuRecDial2\.0 datasets, respectively; \(ii\) an LLM\-as\-a\-judge approach is employed to further assess the performance of our framework against the baseline models\.
For pairwise human evaluation, we follow both turn\-level and dialogue\-level evaluations\. For the turn\-level evaluation, we consider two aspects:appropriateness \(App\.\)andinformativeness \(Inf\.\)\. \(1\) Appropriateness measures whether the system utterance aligns with the current intent keyword; \(2\) informativeness evaluates whether the system utterance incorporates relevant user profiles and domain knowledge\. For the dialogue\-level evaluation, we consider two aspects:proactivity \(Pro\.\)andcoherence \(Coh\.\)\. \(1\) Proactivity evaluates whether the system can proactively introduce new keyword–topics; \(2\) coherence evaluates the overall fluency and naturalness of the generated dialogue\. We randomly sample 500 utterances and evaluate each utterance independently, without considering the overall dialogue context\. For dialogue\-level evaluation, although 500 utterances are also sampled, the entire dialogue process is considered, i\.e\., whether the dialogue progresses coherently\. All samples are labeled by three graduate students with backgrounds in natural language processing\. The labels "Win," "Tie," and "Lose" indicate whether our framework performs better, comparably, or worse, respectively, compared with the strong backbone models T5\-Zh and T5\-Flan\. We average the labels provided by the three graduate students as the final results and report the corresponding percentages in Figures[7](https://arxiv.org/html/2605.11964#A7.F7)and[4](https://arxiv.org/html/2605.11964#S3.F4)555We collect responses from the forms using Tencent Questionnaire and perform unified calculations to derive the final results\.\. Figure[7](https://arxiv.org/html/2605.11964#A7.F7)further validates our framework’s substantial improvements in proactivity, fluency, and informativeness\. In the evaluations of conversational coherence and proactivity, our framework demonstrates a clear advantage: on the ID test set, it significantly outperforms the baseline’s 4\.0% and 4\.3% with win rates of 36\.0% and 44\.7%, respectively; on the OOD test set, it also maintains a strong lead with win rates of 35\.0% and 39\.0%\. More importantly, when facing the OOD test set, our framework achieves an impressive win rate of 66\.0% in the appropriateness metric \(compared to the baseline’s 9\.0%\), while consistently maintaining a net advantage in informativeness\. This demonstrates that the synergy between CSM and IKB can advance the dialogue trajectory and enrich contextual content in a highly natural manner, effectively avoiding the mechanical and rigid utterances produced by traditional systems in pursuit of their targets\.
For the LLM\-as\-a\-judge approach, the prompt template used in this work is shown in Figure[9](https://arxiv.org/html/2605.11964#A7.F9)\(E\)666We use the latest Qwen\-plus LLM\., and the corresponding results are presented in Tables[10](https://arxiv.org/html/2605.11964#A6.T10),[11](https://arxiv.org/html/2605.11964#A6.T11),[12](https://arxiv.org/html/2605.11964#A6.T12), and[13](https://arxiv.org/html/2605.11964#A6.T13)\.
Tables[10](https://arxiv.org/html/2605.11964#A6.T10)and[11](https://arxiv.org/html/2605.11964#A6.T11)present the automatic evaluation results based on the LLM\-as\-a\-judge, compared with advanced models on the DuRecDial and DuRecDial2\.0 datasets, respectively\. These results are highly consistent with the objective metrics in Table[1](https://arxiv.org/html/2605.11964#S2.T1)from the main experiment, further solidifying the core advantages of our framework in improving proactivity, coherence, and informativeness\. In direct comparisons with advanced models \(such as TPDial and TRIPDial\), our framework achieves state\-of\-the\-art performance across almost all evaluation dimensions on the ID and OOD test sets of both the DuRecDial and DuRecDial2\.0 datasets\. In particular, for proactivity and coherence—two metrics directly measuring the naturalness of interaction—our framework significantly and consistently outperforms strong baseline models\. This demonstrates that our framework provides multi\-step dynamic look\-ahead guidance, enabling smooth and natural topic transitions and effectively avoiding the abrupt utterances produced by traditional systems in pursuit of their targets\. Meanwhile, the absolute lead in informativeness further indicates that by introducing scenario bias, our framework successfully integrates user profiles and domain knowledge into the utterance generation process, greatly enriching the contextual information of the dialogue\.
Tables[12](https://arxiv.org/html/2605.11964#A6.T12)and[13](https://arxiv.org/html/2605.11964#A6.T13)present an in\-depth comparison of the performance of our lightweight framework \(Ours 0\.3B\) with LLMs of various sizes on the DuRecDial and DuRecDial2\.0 datasets, respectively\. Their conclusions align with the findings in Table[2](https://arxiv.org/html/2605.11964#S3.T2)from the main experiment, fully demonstrating the superior architectural performance of the proposed mechanism even when compared with the large number of parameters in the LLM family\. Among all LLMs employing efficient parameter fine\-tuning \(including LLaMA1B/3B\{\}\_\{\\text\{1B/3B\}\}and Qwen3B\{\}\_\{\\text\{3B\}\}\), our framework, with only 0\.3B parameters, consistently ranks highest in proactivity, coherence, appropriateness, and informativeness on the ID and OOD test sets of both datasets\. While large\-parameter prompt\-based models \(such as Qwen14B/32B\{\}\_\{\\text\{14B/32B\}\}\) achieve higher absolute scores due to their extensive internal knowledge \(since prompt\-based methods typically produce longer utterances, the LLM\-as\-a\-judge usually assigns higher scores\), as shown in our main experiment \(Table[2](https://arxiv.org/html/2605.11964#S3.T2)\), these models often compromise dialogue quality \(W\. F1\) and produce abrupt topic transitions\. In contrast, our framework achieves an optimal balance between generation quality, proactive guidance, and natural interaction, while maintaining minimal computational overhead and a lightweight footprint\. This not only demonstrates the substantial potential of our framework in resource\-constrained scenarios but also highlights that integrating conversational scenario modeling with higher\-level, flexible intent keyword guidance offers an effective path toward bringing target\-guided proactive dialogue systems closer to real\-world human interaction\.
Figure 7:Human evaluation on the DuRecDial dataset \(our framework vs\. T5\-Zh\)\. "App\." denotes "appropriateness," "Inf\." denotes "informativeness," "Pro\." denotes "proactivity," and "Coh\." denotes "coherence\."Figure 8:Case study \(OOD test set\)\. \(a\) Results of T5\-Flan; \(b\) Results of our framework; \(c\) Dialogue\-level intent keyword transitions across turns\.Figure 9:Our using prompt\.Name:Xinling ChenAge Range:26\-35Gender:FemaleResidence:HaikouOccupation:EmployedAccepted Celebrities:Hsu Chi; Leehom WangAccepted Music:A Simple Song; Heroes of Earth; All the Things You Never Knew; Secret Garden; Bosom FriendRejected Music:KISS GOODBYE; That Year; Change MeAccepted POI:One Plus One Northeastern Chinese Dishes and Dumplings \(Renmin Avenue Store\)etc\.
Table 14:Visualization of user profile modelingℱu\(𝐇u\)\\mathcal\{F\}\_\{u\}\(\\mathbf\{H\}^\{u\}\)in the Case Study \(ID\)\.Boldhighlights the information identified as relevant by the user profile modeling\.<Leehom Wang, Birthday, 1976 \- 5 \- 17\><Leehom Wang, Achievement, The Most Popular Male Singer of MTV Asian Music in Taiwan\><Leehom Wang, Achievement, Best Male Singer of Global Chinese Golden Chart\> <Leehom Wang, Achievement, the 15th Global Chinese Music Award for Best Male Singer in Hong Kong and Taiwan\><Leehom Wang, Awards, The Chinese Film Media Award for Most Popular Actor voted by the audience\><Leehom Wang, Sings, The Sun Washed by Spring Rain\>etc\.
Table 15:Visualization of domain knowledge modelingℱk\(𝐇k\)\\mathcal\{F\}\_\{k\}\(\\mathbf\{H\}^\{k\}\)in the Case Study \(ID\)\.Name:Pingshan HanAge Range:18\-25Gender:FemaleResidence:NantongOccupation:EmployedAccepted Celebrities:Leslie CheungAccepted Movies:Days of Being Wild; Moonlight Express; Happy TogetherRejected Movies:Double TapAccepted Food:Marinated FishAccepted POI:Lahuangshang Spicy Pot Roasted FishFavorite News:Leslie Cheung’s Newsetc\.
Table 16:Visualization of user profile modelingℱu\(𝐇u\)\\mathcal\{F\}\_\{u\}\(\\mathbf\{H\}^\{u\}\)in the Case Study \(OOD\)\.Boldhighlights the information identified as relevant by the user profile modeling\.<Leslie Cheung, Stars, Days of Being Wild\><Days of Being Wild, Director, Wong Kar Wai\><Days of Being Wild, Reputation, The reputation is good\. Its rating is in the top 5 this year\. Its rating is in the top 10 this year\><Leslie Cheung, Achievement, Three times shortlisted for the Best Actor Award in the Cannes International Film Festival\><Leslie Cheung, Achievement, introduced in Encyclopedia Britannica\><Leslie Cheung, Stars, He Is a Woman, She Is a Man\>etc\.
Table 17:Visualization of domain knowledge modelingℱk\(𝐇k\)\\mathcal\{F\}\_\{k\}\(\\mathbf\{H\}^\{k\}\)in the Case Study \(OOD\)\.DuRecDial DatasetModelID Test SetOOD Test SetF1 \(Keyword\-type\)F1 \(Keyword\-topic\)F1 \(Keyword\-type\)F1 \(Keyword\-topic\)LLaMA98\.2397\.4398\.4597\.31DuRecDial2\.0 DatasetLLaMA98\.4296\.6898\.3792\.47Table 18:Intent keyword prediction performance\.
## Appendix HCase Study
To more clearly and intuitively evaluate the performance of our framework, we conduct another case study on the ID and OOD test sets, as shown in Figure[8](https://arxiv.org/html/2605.11964#A7.F8)\. Notably, the case study adopts the hard mode, enabling a clearer comparison with T5\-Flan\. Using examples from the OOD test set, Figure[8](https://arxiv.org/html/2605.11964#A7.F8)further validates that even when faced with target topics not encountered during training, our framework significantly improves dialogue proactivity, fluency, and informativeness\. At the beginning of the dialogue, the baseline model T5\-Flan makes a factual error \(It’s Andy Lau\.\) and subsequently offers only general terms \(It’s a classic movie about love\.\), revealing its shortcomings in knowledge coverage and planning in unseen domains\. Conversely, thanks to the strong generalization ability of our framework \(as revealed by the ablation experiments, where removing CSM leads to a sharp drop in knowledge utilization K\. F1 on the OOD test set\), our framework not only accurately generates \(It’s Leslie Cheung\.\) but also proactively introduces deeper domain knowledge such as \(shortlisted for the Best Actor Award of Cannes\), greatly enriching the dialogue\. Meanwhile, IKB continues to demonstrate outstanding dynamic guidance on the OOD test set, ensuring natural and coherent logical transitions\. This aligns with its decisive role in reducing the high failure rate of OOD targets observed in the ablation experiments, further confirming that our framework effectively bridges the gap between target\-guided dialogue systems and natural real\-world interactions\.
At the same time, we examine the content emphasized by conversational scenario modeling—namely, user profile modelingℱu\(𝐇u\)\\mathcal\{F\}\_\{u\}\(\\mathbf\{H\}^\{u\}\)and domain knowledge modelingℱk\(𝐇k\)\\mathcal\{F\}\_\{k\}\(\\mathbf\{H\}^\{k\}\)—which are jointly employed to capture conversational scenario bias in this work \(see §[2\.2](https://arxiv.org/html/2605.11964#S2.SS2)\)\. We examine the user profile and domain knowledge items associated with tokens that receive higher modeling probabilities within a conversation\. For the example in Figure[5](https://arxiv.org/html/2605.11964#S3.F5), the information highlighted in the user profiles and domain knowledge is presented in Table[14](https://arxiv.org/html/2605.11964#A7.T14)and Table[15](https://arxiv.org/html/2605.11964#A7.T15), respectively\. Similarly, for the example in Figure[8](https://arxiv.org/html/2605.11964#A7.F8), the information highlighted in the user profiles and domain knowledge is presented in Table[16](https://arxiv.org/html/2605.11964#A7.T16)and Table[17](https://arxiv.org/html/2605.11964#A7.T17), respectively\. For user profiles,boldis used to mark the emphasized information\. For domain knowledge, since the domain knowledge is pre\-processed and all items are considered by the domain knowledge modeling, we only visualize the items relevant to the current conversation\.
As shown in Table[14](https://arxiv.org/html/2605.11964#A7.T14), in this conversation, the user profile modeling focuses onAccepted Celebritiesas well as items related toAccepted MusicandRejected Music\. These indicate the celebrity and music categories that the user is interested in; however, this does not necessarily imply that the user must accept or reject a specific item, as the actual dialogue flow must also be considered\. Overall, it can be generally observed that users discuss celebrities first, followed by music\-related topics\. From Table[15](https://arxiv.org/html/2605.11964#A7.T15), we can observe that all triples beginning withLeehom Wangare highlighted\. His birthday \(1976–5–7\) contributes to the first turn; his achievements and awards play a role in the third and fourth turns; and his songThe Sun Washed by Spring Raininforms the fourth turn\. Together, these elements provide additional information that enriches the dialogue, thereby enhancing the fluency and informativeness of the utterances\. As shown in Table[16](https://arxiv.org/html/2605.11964#A7.T16), we observe that, in this user profile, the modeling highlightsAccepted MoviesandAccepted POI, even though no point\-of\-interest information appears in the conversation\. In contrast, theAccepted Celebrities: Leslie Cheung—which is directly relevant to the dialogue—is not emphasized in the user profile\. These observations suggest that, on the one hand, the content identified by the modeling is meaningful only when interpreted within the specific dialogue context; on the other hand, on the OOD test set, keyword\-topics absent from the training data negatively affect user profile modeling\. From Table[17](https://arxiv.org/html/2605.11964#A7.T17), we observe that all triples beginning withLeslie Cheungare highlighted\. His starred movieDays of Being Wildplays a role in the first turn, andHe Is a Woman, She Is a Mancontributes to the fourth turn; his achievements and awards inform the second turn\. Moreover, theDirectandReputationaspects ofDays of Being Wildfurther enrich the conversation, as evidenced in the second turn\.
Whether in the user profiles or domain knowledge, much of the information is either overlooked or, although emphasized, may not be fully reflected in a single conversation\. This is expected, as such information assumes a more meaningful role when combined with intent keywords and the specific dialogue context\. Items previously ignored may be highlighted due to the intent keywords, whereas items that were previously focused but are irrelevant to the current conversation will be disregarded to avoid adverse effects, thereby generating more proactive, fluent, and informative utterances\. It is worth noting that both user profile modeling and domain knowledge modeling operate at the conversational level and cannot be applied to individual dialogue turns\. The above descriptions are intended to illustrate their respective roles\. At the conversational level, user profile and domain knowledge modeling \(i\.e\., conversational scenario modeling\) jointly influence and enrich the generated utterances, enabling precise initiative and enhanced user engagement\.
To sum up, user profile modeling renders the utterance more consistent with user profiles, thereby increasing lexical diversity, whereas domain knowledge modeling aligns the utterance with domain knowledge, improving knowledge utilization and indirectly enhancing user engagement\. Taken together with the preceding ablation studies \(§[3\.4](https://arxiv.org/html/2605.11964#S3.SS4)\), these examples further validate the effectiveness of conversational scenario modeling for utterance generation, demonstrating substantial gains across multiple aspects—particularly in lexical diversity and user engagement—while enabling precise initiative\.
## Appendix IPrompting Template
This section presents the prompt templates used, specifically the templates for fine\-tuning LLMs\. Figures[9](https://arxiv.org/html/2605.11964#A7.F9)\(A\) and[9](https://arxiv.org/html/2605.11964#A7.F9)\(B\) correspond to the DuRecDial and DuRecDial2\.0 datasets, respectively\. In the prompt\-based LLM approach, Figures[9](https://arxiv.org/html/2605.11964#A7.F9)\(C\) and[9](https://arxiv.org/html/2605.11964#A7.F9)\(D\) are used for the DuRecDial and DuRecDial2\.0 datasets\. Notably, the prompt\-based method also requires an intent keyword field\. We employ LLaMA\-8B as the predictor, and the corresponding performance is reported in Table[18](https://arxiv.org/html/2605.11964#A7.T18)\. We adopt the F1 score \(F1\), which measures the micro\-averaged precision and recall of the predicted keyword types and keyword topics\. As shown in Table[18](https://arxiv.org/html/2605.11964#A7.T18), the prediction performance of intent keywords is strong, indicating that the prompt\-based LLM method compared in this paper is reliable \(see Table[2](https://arxiv.org/html/2605.11964#S3.T2)\)\.Similar Articles
Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions
This paper identifies KV-cache contamination as a failure mode for activation steering in dialogue and proposes GCAD, a method that extracts steering signals from prompt contributions and applies token-level gating to improve long-horizon coherence, achieving substantial gains on multi-turn benchmarks.
ParaBridge: Bridging Paralinguistic Perception and Dialogue Behavior in Speech Language Models
ParaBridge is an on-policy self-distillation method that bridges the gap between paralinguistic perception and dialogue behavior in speech language models, significantly improving safety and empathy without external rewards.
A Proactive Multi-Agent Dialogue Framework for Assessing Social Language Disorder Traits in Autism
This paper introduces TPA (Think, Plan, Ask), a proactive multi-agent dialogue framework using LLMs to systematically surface latent social language disorder traits in autism by selecting clinically grounded questioning strategies. It achieves 82.1% trait coverage, outperforming real clinical dialogues by clinicians.
Know You Before You Speak: User-State Modeling for LLM Personalization in Multi-Turn Conversation
This paper proposes PUMA, a framework for LLM personalization in multi-turn conversations that models latent user states and uses the Free Energy Principle to select dialogue actions, improving long-horizon outcomes on healthcare counseling benchmarks.
SKG-Eval: Stateful Evaluation of Multi-Turn Dialogue via Incremental Semantic Knowledge Graphs
Proposes SKG-Eval, a quasi-deterministic evaluation framework for multi-turn dialogue that uses incremental semantic knowledge graphs to detect cross-turn inconsistencies, contradiction, and topic drift, achieving higher correlation with human judgments.