STRIDE-ED: A Strategy-Grounded Stepwise Reasoning Framework for Empathetic Dialogue Systems

arXiv cs.CL Papers

Summary

STRIDE-ED is a strategy-grounded reasoning framework for empathetic dialogue systems that uses structured multi-stage reasoning combined with a data refinement pipeline and two-stage training (supervised fine-tuning + multi-objective RL) to improve emotional understanding and response generation. The framework demonstrates consistent improvements across open-source LLMs on both automatic metrics and human evaluations.

arXiv:2604.07100v2 Announce Type: replace Abstract: Empathetic dialogue requires not only recognizing a user's emotional state but also making strategy-aware, context-sensitive decisions throughout response generation. However, the lack of a comprehensive empathy strategy framework, explicit task-aligned multi-stage reasoning, and high-quality strategy-aware data fundamentally limits existing approaches, preventing them from effectively modeling empathetic dialogue as a complex, multi-stage cognitive and decision-making process. To address these challenges, we propose STRIDE-ED, a STRategy-grounded, Interpretable, and DEep reasoning framework that models Empathetic Dialogue through structured, strategy-conditioned reasoning. To support effective learning, we develop a strategy-aware data refinement pipeline integrating LLM-based annotation, multi-model consistency-weighted evaluation, and dynamic sampling to construct high-quality training data aligned with empathetic strategies. Furthermore, we adopt a two-stage training paradigm that combines supervised fine-tuning with multi-objective reinforcement learning to better align model behaviors with target emotions, empathetic strategies, and response formats. Extensive experiments demonstrate that STRIDE-ED generalizes across diverse open-source LLMs and consistently outperforms existing methods on both automatic metrics and human evaluations.
Original Article
View Cached Full Text

Cached at: 04/20/26, 08:32 AM

# STRIDE-ED: A Strategy-Grounded Stepwise Reasoning Framework for Empathetic Dialogue Systems

Source: https://arxiv.org/html/2604.07100

Hongru Ji¹, Yuyin Fan¹, Meng Zhao², Xianghua Li¹, Lianwei Wu³ and Chao Gao¹

¹School of Artificial Intelligence, OPtics and ElectroNics (iOPEN), Northwestern Polytechnical University, China
²School of Artificial Intelligence and Big Data, Henan University of Technology, China
³School of Computer Science, Northwestern Polytechnical University, China

{jihongru,fanyuyin}@mail.nwpu.edu.cn, [email protected], {li_xianghua,wlw,cgao}@nwpu.edu.cn

## Abstract

Empathetic dialogue requires not only recognizing a user's emotional state but also making strategy-aware, context-sensitive decisions throughout response generation. However, the lack of a comprehensive empathy strategy framework, explicit task-aligned multi-stage reasoning, and high-quality strategy-aware data fundamentally limits existing approaches, preventing them from effectively modeling empathetic dialogue as a complex, multi-stage cognitive and decision-making process. To address these challenges, we propose STRIDE-ED, a STRategy-grounded, Interpretable, and DEep reasoning framework that models Empathetic Dialogue through structured, strategy-conditioned reasoning. To support effective learning, we develop a strategy-aware data refinement pipeline integrating LLM-based annotation, multi-model consistency-weighted evaluation, and dynamic sampling to construct high-quality training data aligned with empathetic strategies. Furthermore, we adopt a two-stage training paradigm that combines supervised fine-tuning with multi-objective reinforcement learning to better align model behaviors with target emotions, empathetic strategies, and response formats. Extensive experiments demonstrate that STRIDE-ED generalizes across diverse open-source LLMs and consistently outperforms existing methods on both automatic metrics and human evaluations.

## 1 Introduction

**Figure 1:** Illustration of the STRIDE-ED reasoning process. Given a user utterance, the framework performs scenario summarization, emotion recognition, strategy inference, and strategy-guided response generation.

Empathetic dialogue, a cornerstone of human social interaction, requires not only the recognition of another's emotional state but also the formulation of a response that conveys understanding, validation, and appropriate support (Batson, 2009). From a psychological perspective, this is a complex, multi-stage cognitive and decision-making process (Davis, 1983; Gao et al., 2023). This highlights the strategic and context-sensitive nature of empathy, making empathetic dialogue a challenging task beyond surface-level text generation.

Early studies focused on implicitly enhancing dialogue models through external commonsense knowledge or affective lexicons. For example, prior works (Liu et al., 2022; Cai et al., 2023) incorporate emotional commonsense graphs or commonsense knowledge selection into neural architectures to provide latent cues for more emotion-aware generation. However, without explicit modeling of reasoning or decision processes (Zhang et al., 2026c, b; Chen et al., 2025a; Luo et al., 2026; Wang et al., 2026), these approaches offer limited insight into the mechanisms through which emotional understanding informs response generation.

To address this lack of transparency, recent research has turned to Large Language Models (LLMs), which enable more explicit modeling of intermediate reasoning processes through Chain-of-Thought (CoT) prompting. Building on this direction, prior work such as (Hu et al., 2025) further encourages LLMs to generate explicit intermediate rationales before producing responses, thereby enhancing interpretability (Zhang et al., 2024; Chen et al., 2025b). However, a fundamental limitation persists in these CoT-based approaches: the reasoning process lacks grounding in a rigorous strategic framework. Existing strategies (Liu et al., 2021; Zhang et al., 2025) often derived from specific domains such as psychological support, primarily address negative emotions and are confined to low-level responses. Emotions in dialogue typically span positive, negative, and neutral states (Ji et al., 2025), yet existing methods fail to fully capture this spectrum and lack support for higher-order cognitive strategies, as shown in Figure 1.

Consequently, while these existing works are structured in form, the model's reasoning steps frequently remain superficial and exhibit inconsistent strategic coherence. In summary, existing approaches suffer from three key limitations:

**(1) Incomplete Empathy Strategy Coverage.** Prior methods lack a comprehensive strategy system covering diverse emotional states and higher-order cognition, limiting principled decision-making.

**(2) Lack of Task-Aligned Multi-Stage Reasoning.** CoT-based methods do not explicitly model dialogue as a task-specific multi-stage reasoning process.

**(3) Insufficient Strategy-Aware Supervision.** Training data lack sufficient high-quality annotations aligned with empathetic strategies and reasoning.

To address these issues, we propose STRIDE-ED, a framework that proceeds by first establishing a comprehensive strategy system, which then enables task-aligned reasoning through a dedicated data pipeline. At its core, we construct a unified strategy system covering positive, neutral, and negative emotions to guide response generation. Leveraging this system, we automatically annotate the EMPATHETIC DIALOGUES dataset (Rashkin et al., 2019) with strategy types and rationales using authoritative LLMs. Subsequently, a rigorous data refinement process employs multi-LLM evaluation with consistency-weighted scoring and strategy-aware sampling to curate high-quality training subsets. Finally, the model is optimized through a two-stage training paradigm that combines supervised fine-tuning with multi-objective reinforcement learning, ensuring alignment with strategic, emotional, and structural correctness.

Our contributions can be summarized as follows:

- We propose an interpretable framework for empathetic dialogue, STRIDE-ED, featuring a comprehensive empathy strategy system and stepwise decision-making.
- To support effective training, a strategy-aware data refinement pipeline is constructed, combining LLM-based annotation, multi-model weighted evaluation, and dynamic sampling to regulate strategy distribution and difficulty.
- A two-stage training paradigm is introduced, in which supervised fine-tuning establishes reasoning, followed by reinforcement learning to improve emotional alignment, strategy execution, and response consistency.
- Extensive experiments show that our framework generalizes across diverse open-source LLMs and achieves superior performance on both automatic and human evaluations.

## 2 Related Works

### 2.1 Implicit Knowledge-Driven Empathy

Early empathetic dialogue models lacked sufficient prior knowledge, limiting their understanding of users' emotions and contexts. To address this, some studies incorporate external knowledge to broaden the model's perspective (Zhang et al., 2026a) and enhance reasoning. For example, Ghosal et al. (2020) leveraged structured commonsense knowledge from ATOMIC (Sap et al., 2019), such as mental states and causal relations, to model interlocutor interactions for improved emotion understanding. Zhong et al. (2021) integrated commonsense-aware emotional latent concepts to generate emotionally appropriate responses, while Sabour et al. (2022) inferred users' situational contexts through commonsense-based reasoning. Further, Cai et al. (2023) introduced an adaptive commonsense knowledge selection mechanism to refine contextual cognition, and Qiao et al. (2025) constructed multi-hop reasoning graphs to incorporate external knowledge during response generation. However, these methods did not explicitly model strategy-guided decision-making.

### 2.2 Explicit Reasoning with CoT for Empathy

LLMs possessed extensive knowledge reserves, and CoT prompting (Wei et al., 2022) enabled stepwise reasoning, facilitating more explicit decision-making in empathetic dialogue. Building on this paradigm, Tu et al. (2022) inferred fine-grained user emotions and employed a mixed strategy mechanism for response generation, while Chen and Liu (2023) dynamically generated counseling strategies via zero-shot prompting to guide personalized responses. Subsequent work further enhanced empathy through data and structural designs: Chen et al. (2023) fine-tuned LLMs on consultant-style multi-turn dialogues, and Ye et al. (2025) adopted a strategy-enhanced role-playing framework with multiple interacting roles to generate diverse training data. Finally, Zhang et al. (2025) introduced an intention-centered framework that mapped inferred supporter intentions to support strategies using a chain-of-thought mechanism. Despite improved interpretability, these CoT-based approaches lacked comprehensive strategy coverage and did not explicitly model the full, structured reasoning process underlying human empathetic decision-making.

## 3 Methodology

STRIDE-ED is a general-purpose framework for empathetic dialogue applicable to diverse open-source LLMs. It implements a comprehensive empathy strategy system and a task-aligned, multi-step CoT paradigm to model dialogue as a progressive cognitive and decision-making process. At the data and training levels, STRIDE-ED integrates LLM-based automatic annotation, strategy-aware data refinement, and two-stage optimization. An overview is shown in Figure 2.

**Figure 2:** The architecture of the STRIDE-ED framework, illustrating the complete pipeline from data preparation and refinement to model training.

### 3.1 Task Formulation

Empathetic dialogue consists of a multi-turn interaction between a user and a conversational agent. We denote the dialogue history as C = {u₁, u₂, ..., u_{t-1}}, where u_i represents the utterance at the i-th turn. Each utterance u_i is a sequence of tokens, i.e., u_i = (w_{i,1}, w_{i,2}, ..., w_{i,n_i}). The goal of empathetic dialogue is to generate an empathetic response u_t while appropriately recognizing the user's emotional state e. In this work, we introduce auxiliary objectives and model empathetic dialogue as a sequential generation process. Specifically, conditioned on C, the model generates a dialogue scenario summary sum, infers the target emotional state e, determines an empathy strategy stra along with its execution actions acts, and finally produces the empathetic response u_t, corresponding to modeling the conditional distribution P(sum, e, stra, acts, u_t | C).

### 3.2 Empathy Strategy System

In empathetic dialogue modeling, response strategies form a crucial intermediate stage between emotion understanding and response generation. Rather than generating surface-level replies directly, effective models must explicitly decide how to respond based on the user's emotional state and dialogue context. Liu et al. (2021) proposed a widely adopted taxonomy of eight empathy strategies, primarily applied in counseling-oriented settings that focus on negative emotions. Higher-order cognitive strategies, however, remain underexplored, limiting the effectiveness of current systems in complex dialogues requiring sophisticated reasoning.

Motivated by these observations, we first analyze the emotional distributions in the EMPATHETIC DIALOGUES dataset and conduct a fine-grained examination of response content. Based on these analyses, we expand the original empathy strategy set to better accommodate positive, neutral, and negative emotional contexts, and assign a three-level difficulty rating (I–III) to each strategy to reflect the cognitive complexity involved in its application. In particular, the original Question strategy is further subdivided into three distinct *Exploring*-type strategies to capture more nuanced cognitive and emotional interactions. The structured strategy system guides both annotation and model training (full details are provided in Appendix A).

### 3.3 Stepwise Cognitive CoT Design

Inspired by insights from cognitive psychology, we model the human thought process during empathetic dialogue as a stepwise, incremental deliberation. Upon receiving a speaker's utterance, individuals first infer missing information or make educated guesses about the described situation based on prior knowledge and contextual understanding, forming a high-level situational representation. This situational comprehension allows them to adopt the speaker's perspective and facilitates accurate recognition of the speaker's emotional state. Next, humans deliberate on which response strategy to employ, guided by the speaker's emotional state and contextual cues. Because strategies represent generalized methods, their concrete execution may differ across scenarios. To account for this variability, we introduce an action inference stage that bridges strategy selection and the generation of the final response. This stepwise CoT design captures the structured cognitive progression in human empathetic reasoning, providing a principled framework for multi-stage, strategy-aware response generation. Implementation leverages structured intermediate tags—such as <scenario>, <emotion>, and <strategy>—to guide the model's internal reasoning, culminating in the final response.

### 3.4 LLM-Based Automated Data Annotation

Building upon the aforementioned theoretical framework, we address the absence of explicit annotations...

Similar Articles

Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play

Hugging Face Daily Papers

STRATAGEM is a new framework for improving reasoning transferability in language models by using game self-play with a Reasoning Transferability Coefficient and Reasoning Evolution Reward to reinforce abstract, domain-agnostic reasoning patterns over game-specific heuristics. Experiments show strong improvements on mathematical reasoning, general reasoning, and code generation benchmarks.

ETCHR: Editing To Clarify and Harness Reasoning

Hugging Face Daily Papers

ETCHR is a novel image editing approach that decouples visual reasoning from image generation, using a two-stage training process (Reasoning Imitation and Reasoning Enhancement) to improve multimodal language model performance across five visual reasoning tasks. It achieves consistent gains of 4-5% Pass@1 on models like Qwen3-VL-8B, Gemini-3.1-Flash-Lite, and Kimi K2.5.

DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling

arXiv cs.AI

This paper introduces DyCon, a training-free framework that uses step-level embeddings to model evolving task difficulty and dynamically control reasoning depth in Large Reasoning Models, effectively reducing overthinking and improving efficiency without sacrificing accuracy.