EIBench: A Simulator-Based Benchmark and Turn-Credit RL for Emotion Management
Summary
EIBench introduces a simulator-based benchmark for interactive emotion management in LLMs, enabling evaluation and training via per-turn user state feedback. The authors propose CTC-GRPO, a reinforcement learning method that improves emotion management performance across multiple benchmarks.
View Cached Full Text
Cached at: 06/16/26, 11:49 AM
# A Simulator-Based Benchmark and Turn-Credit RL for Emotion Management
Source: [https://arxiv.org/html/2606.15532](https://arxiv.org/html/2606.15532)
Rongzhi Zhu1Xiang Huang2Yuchuan Wu2Rui Wang2Zequn Sun1,† Tao Ren2Weiyao Luo2Bingxue Qiu2Jieping Ye2Yongbin Li2,†Wei Hu1 1State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China 2Qwen\-Character Team![[Uncaptioned image]](https://arxiv.org/html/2606.15532v1/tongyi.jpg), Alibaba Group †Corresponding authors
###### Abstract
Emotional intelligence \(EI\) in Large Language Models \(LLMs\) is often evaluated through static understanding tasks or single\-response dialogue generation\. However, emotion management is interactive: a good model should not only recognize a user’s emotion, but also improve the user’s emotional and relational state over several turns\. We introduce EIBench, a simulator\-based benchmark for interactive emotion management\. EIBench contains 2,222 scenarios, with 2,009 for training and 213 for held\-out testing\. The scenarios are organized by a 2×\\times2 taxonomy covering Support, Defense, Repair, and Charm, which together capture different forms of support, boundary maintenance, trust repair, and rapport building\. In each scenario, an LLM simulator plays the user, updates an emotion–relation state after each turn, and maps the final state to an anchor\-based score\. This design makes EIBench both an evaluation benchmark and a training environment: the final state gives the outcome reward, while the per\-turn state updates provide dense feedback for RL\. We evaluate 15 open\- and closed\-source LLMs\. Current models perform well on support and rapport\-building scenes, but struggle with boundary maintenance under user pressure\. To improve the EI ability of LLMs, we propose Centered Turn\-Credit GRPO \(CTC\-GRPO\), a GRPO extension that reuses the simulator’s per\-turn state updates as dense turn\-level feedback while preserving the final outcome reward\. CTC\-GRPO improves Qwen3\-8B from−22\.4\-22\.4to\+22\.4\+22\.4on EIBench and also improves on out\-of\-distribution evaluations including SAGE \(\+12\.4\) and EQBench3 \(\+20\.9%\)\. Our results show that simulator\-tracked user states can support both evaluation and training for multi\-turn emotion management\.
## 1Introduction
Large language models \(LLMs\) have advanced rapidly with recent scaling and post\-training breakthroughs\(OpenAI,[2023](https://arxiv.org/html/2606.15532#bib.bib21); Team,[2025a](https://arxiv.org/html/2606.15532#bib.bib2); Guoet al\.,[2025](https://arxiv.org/html/2606.15532#bib.bib20); Team,[2025b](https://arxiv.org/html/2606.15532#bib.bib22)\)\. These gains are exceptionally prominent in complex reasoning domains such as mathematics\(Shaoet al\.,[2025](https://arxiv.org/html/2606.15532#bib.bib12); Chervonyiet al\.,[2025](https://arxiv.org/html/2606.15532#bib.bib23)\)and coding\(Caoet al\.,[2026](https://arxiv.org/html/2606.15532#bib.bib18); Guoet al\.,[2024](https://arxiv.org/html/2606.15532#bib.bib19)\)\. A key driver of these gains is reinforcement learning with verifiable rewards \(RLVR\), where correctness labels provide clear, automated feedback and make reward optimization relatively straightforward\(Wenet al\.,[2025](https://arxiv.org/html/2606.15532#bib.bib25); Zenget al\.,[2025](https://arxiv.org/html/2606.15532#bib.bib26); DeepResearchet al\.,[2025](https://arxiv.org/html/2606.15532#bib.bib24)\)\.
However, for subjective tasks such as social and emotional interaction, there is often no single correct response\. What matters is how the model’s actions shape and shift the other person’s emotional state across multiple turns\. Mayer and Salovey\(Mayeret al\.,[1997](https://arxiv.org/html/2606.15532#bib.bib1)\)conceptualize this ability as emotional intelligence \(EI\) and distinguish four hierarchical layers:*perceiving*emotions,*using*them to facilitate thought,*understanding*their structure, and*managing*them in oneself and others\. The first three abilities can often be assessed from a fixed context—identifying which emotion is present, explaining why it arises, or examining how it informs reasoning\. Emotion management, however, is fundamentally different: it is a state\-changing problem in which a model’s responses are judged by how effectively they guide the other person’s emotional and relational state over time\.
We observe that most EI evaluations still focus on emotion understanding or narrow forms of emotional interaction, rather than broad emotion management across multi\-turn dialogue\. Early datasets such as Empathetic Dialogues\(Rashkinet al\.,[2019](https://arxiv.org/html/2606.15532#bib.bib33)\)and ESConv\(Liuet al\.,[2021](https://arxiv.org/html/2606.15532#bib.bib35)\)treat emotional dialogue as supervised learning over annotated conversations\. Recent benchmarks move toward more open\-ended evaluation: EQBench3\(Paech,[2023](https://arxiv.org/html/2606.15532#bib.bib9)\)asks models to infer mental states and choose appropriate responses, while SOTOPIA\(Zhouet al\.,[2024](https://arxiv.org/html/2606.15532#bib.bib51)\)evaluates goal completion in multi\-agent social scenarios\. SAGE\(Zhanget al\.,[2025a](https://arxiv.org/html/2606.15532#bib.bib10)\)is closer to our setting, as an LLM simulator plays a support seeker and the tested model provides support over a full dialogue\. However, these settings still do not fully cover emotion management\. SOTOPIA focuses on goal completion rather than improving the other person’s emotional and relational state, while SAGE focuses mainly on support\-seeking dialogues\. This leaves common cases such as handling pressure, repairing trust after model\-side mistakes, and proactively building a relationship less studied\.
To address this gap, we build an interactive benchmark for emotion management\. In our setup, two LLMs are engaged in a multi\-turn dialogue, and the score is based on how the simulated user’s state changes over time\. We track both emotion and relation because improving the user’s mood and building a better relationship are related but not identical goals\. The same state trajectory also provides turn\-level feedback for RL training\. To cover different forms of emotion management, we organize the scenarios with a 2×\\times2 taxonomy grounded in the EI literature\. The taxonomy separates the target of regulation from the source of the situation, yielding four scene types:Support,Defense,Repair, andCharm\.
Overall, we make three main contributions:
- •EIBench: We introduce a multi\-turn benchmark for emotion management, coveringSupport,Defense,Repair, andCharmunder a 2×\\times2 taxonomy based on the target of regulation and the source of the situation\.
- •Leaderboard and Validation: We evaluate 15 open\- and closed\-source LLMs on EIBench\. We report rankings under two independent simulators and validate them with a human pairwise arena\.
- •Centered Turn\-Credit GRPO \(CTC\-GRPO\):We propose CTC\-GRPO for multi\-turn dialogue optimization\. It uses the simulator’s per\-turn states as process rewards with a centered credit term to redistribute turn‑level credit without altering the outcome signal\. The trained models also improve on out\-of\-distribution evaluations including EQBench3 and SAGE\.
## 2Related Work
### 2\.1EI Benchmarks
Early NLP work on EI used static datasets such as Empathetic Dialogues\(Rashkinet al\.,[2019](https://arxiv.org/html/2606.15532#bib.bib33)\), ESConv\(Liuet al\.,[2021](https://arxiv.org/html/2606.15532#bib.bib35)\), and PsyQA\(Sunet al\.,[2021](https://arxiv.org/html/2606.15532#bib.bib34)\)to teach models supportive replies under a fixed context\. As LLMs strengthened, benchmarks such as EQBench3\(Paech,[2023](https://arxiv.org/html/2606.15532#bib.bib9)\), EmoBench\(Sabouret al\.,[2024](https://arxiv.org/html/2606.15532#bib.bib36)\), ToMBench\(Chenet al\.,[2024](https://arxiv.org/html/2606.15532#bib.bib37)\), and SI\-Bench\(Huanget al\.,[2025](https://arxiv.org/html/2606.15532#bib.bib38)\)moved into richer scenarios for emotional understanding and social reasoning\. Related role\-play benchmarks such as CoSER\(Wanget al\.,[2025b](https://arxiv.org/html/2606.15532#bib.bib54)\)further evaluate character\-grounded dialogue generation\. However, they are still offline or end\-to\-end: the model cannot affect what the user says next\.
Recent benchmarks add dynamic interaction\. AgentSense\(Guoet al\.,[2026](https://arxiv.org/html/2606.15532#bib.bib39)\)and SOTOPIA\(Zhouet al\.,[2024](https://arxiv.org/html/2606.15532#bib.bib51)\)place multiple agents into role\-bound goals\. SAGE\(Zhanget al\.,[2025a](https://arxiv.org/html/2606.15532#bib.bib10)\)and Echo\-N1\(Zhanget al\.,[2025b](https://arxiv.org/html/2606.15532#bib.bib11)\)use an LLM simulator that plays an evolving support seeker\. They are dynamic but narrow—goal completion or support\-only\. Our benchmark spans all four scene types \(Support,Defense,Repair,Charm\) and shares the same simulator between evaluation and RL training\. This difference is important because support\-only evaluation does not test whether a model can handle pressure, recover from its own mistakes, or build rapport when no explicit distress is given\.
DatasetSizeMulti\-turnSim\.State ScoreScopeScore Dim\.ESConvLiuet al\.\([2021](https://arxiv.org/html/2606.15532#bib.bib35)\)195✗✗✗Emotion support–EQBench3Paech \([2023](https://arxiv.org/html/2606.15532#bib.bib9)\)45✗✗✗EQ reasoningEQ rubricCoSERWanget al\.\([2025b](https://arxiv.org/html/2606.15532#bib.bib54)\)200✓✗✗Role\-playRole consistencySOTOPIAZhouet al\.\([2024](https://arxiv.org/html/2606.15532#bib.bib51)\)90✓✓✗Social interactionGoalSAGEZhanget al\.\([2025a](https://arxiv.org/html/2606.15532#bib.bib10)\)100✓✓✓Emotion SupportEmotionEIBench213✓✓✓Emotion managementEmotion \+ RelationTable 1:Benchmark comparison\. Size is test\-set size; Sim\. is simulator interaction\.
### 2\.2Multi\-Turn RL and Credit Assignment
On\-policy RL has become a central part of recent LLM post\-training, especially after reasoning models such as DeepSeek\-R1\(Guoet al\.,[2025](https://arxiv.org/html/2606.15532#bib.bib20)\)\. In the standard RLVR setting\(Yuet al\.,[2026](https://arxiv.org/html/2606.15532#bib.bib14); Yueet al\.,[2025](https://arxiv.org/html/2606.15532#bib.bib15); Zhenget al\.,[2025](https://arxiv.org/html/2606.15532#bib.bib53)\), a model receives a reward from the final answer, which works well for tasks with clear correctness criteria such as math and coding\. Recent extensions to agentic reasoning and tool use\(Jinet al\.,[2025](https://arxiv.org/html/2606.15532#bib.bib16); Fenget al\.,[2026](https://arxiv.org/html/2606.15532#bib.bib17)\)allow models to act over multiple steps, but the feedback is still often delayed or task\-specific\. For subjective tasks without a fast verifier, recent work often uses LLM\-as\-a\-judge scoring\(Zhenget al\.,[2023](https://arxiv.org/html/2606.15532#bib.bib28); Wanget al\.,[2025a](https://arxiv.org/html/2606.15532#bib.bib6); Zhanget al\.,[2026a](https://arxiv.org/html/2606.15532#bib.bib7)\)or pairwise ranking\(Zhanget al\.,[2026b](https://arxiv.org/html/2606.15532#bib.bib27)\)\. These methods make open\-ended optimization possible, but they usually provide only a terminal score\. This creates a credit assignment problem: a single final reward cannot tell which turn helped or hurt the interaction\.
Recent methods densify feedback with reward models, turn\-level judges, or task\-specific process rewards\(Wanget al\.,[2026](https://arxiv.org/html/2606.15532#bib.bib30); Zhanget al\.,[2026a](https://arxiv.org/html/2606.15532#bib.bib7); Fenget al\.,[2026](https://arxiv.org/html/2606.15532#bib.bib17); Shiet al\.,[2026](https://arxiv.org/html/2606.15532#bib.bib31)\)\. Our setting provides a simpler signal: the simulator already updates the user’s emotion and relation state after each model reply\. We reuse these state changes as turn\-level process rewards, without adding another judge model\. This gives dense feedback while keeping the final anchor\-based outcome score unchanged for evaluation\.
## 3EIBench for Emotion Management
This section presents the design of EIBench\. Table[1](https://arxiv.org/html/2606.15532#S2.T1)first compares EIBench with related benchmarks, and Figure[1](https://arxiv.org/html/2606.15532#S3.F1)gives the overall workflow\. We then introduce the 2×\\times2 scenario taxonomy, describe scenario construction and state anchors, and define the multi\-turn simulator\-based scoring method\.
### 3\.1Foundations: A 2×\\times2 Scenario Taxonomy
Our taxonomy is grounded in two ideas from emotion research\. Mayer and Salovey\(Mayeret al\.,[1997](https://arxiv.org/html/2606.15532#bib.bib1)\)define emotion management as both interpersonal and intrapersonal\. We use this distinction as the first axis: the target of regulation\. A scene is other\-directed when the model mainly needs to improve the user’s emotional or relational state, and self\-directed when the main challenge is to regulate the model’s own stance, boundary, or initiative\. This axis is not enough by itself\. For example, comforting a user after an external loss and repairing trust after a model mistake both aim to improve the user’s state, but they require different actions\. We therefore add a second axis: the source of the situation, following appraisal theory\(Lazarus,[1991](https://arxiv.org/html/2606.15532#bib.bib4)\)\. A scene is user\-side when the need comes from the user or an external event\. It is model\-side when it comes from the model’s own mistake or initiative\.
Crossing the two axes gives the four scene types summarized in Table[2](https://arxiv.org/html/2606.15532#S3.T2)\.Supportcovers externally caused user distress\.Defensecovers user pressure, where the model must stay calm and maintain its boundary\.Repaircovers model\-side mistakes and trust rebuilding\.Charmcovers model\-initiated rapport building\. The two axes define the main interaction goal of each scene\. Other factors, such as emotional intensity, relationship history, power relation, and setting, mainly affect the difficulty or surface form\.
Figure 1:Overview of EIBench and CTC\-GRPO\. EIBench organizes emotion\-management scenarios with a 2×\\times2 taxonomy, evaluates models through multi\-turn simulator interaction and anchor\-based state scoring, and provides turn\-level state changes for CTC\-GRPO training\.
### 3\.2Dataset Construction
EIBench contains 2,222 scenarios: 2,009 for training and 213 for held\-out testing\. Each scenario is a multi\-turn role\-play between the model under test and an LLM simulator\.
#### 3\.2\.1Scenario Components
A scenario has two role profiles and three state anchors on emotion–relation axes\. Thesimulated\-\_user\_profileis given to the simulator and specifies the user’s role, background, speaking style, rules, and goal\. Themodel\_profileis given to the model under test and specifies its role, known facts, and core goal\. Most scenarios start with anopening\_linefrom the simulated user\. The only exception isCharm, where the model under test speaks first as the interaction is model\-initiated\. The three anchors define the start, successful, and failed states for scoring:sstarts\_\{\\mathrm\{start\}\},ssuccs\_\{\\mathrm\{succ\}\}, andsfails\_\{\\mathrm\{fail\}\}\.
#### 3\.2\.2Scenario Generation and Anchor Setup
We first construct the test set from human\-written and human\-collected scenario seeds\. Each seed specifies the background story, speaker relationship, trigger event, and desired outcome\. Gemini\-3\.1\-Pro is used only to convert these seeds into our profile format; the core interaction remains human written or human collected\. We then manually check each scenario for plausibility, profile consistency, and anchor quality\.
##### Training scenario generation\.
After the test set is fixed, we build seed pools for training\-set generation\. The pools include theme directions, scenario keywords, and three modifier dimensions: relationship, emotional intensity, and personality influence\. Appendix[A](https://arxiv.org/html/2606.15532#A1)gives the full seed\-pool statistics\. The entries are written and expanded with manual editing and LLM assistance, and vague or overlapping entries are removed\. To generate a training scenario, we sample one scenario keyword and one option from each modifier dimension, then ask Gemini\-3\.1\-Pro to expand the sampled combination into a full scenario in our profile format\.
To reduce train\-test overlap, we use two safeguards\. First, keyword combinations that appear in the test set are removed from the training pool\. Second, after each training scenario is generated, we retrieve the three most similar test scenarios with an embedding model and ask Gemini\-3\.1\-Pro to check whether it duplicates any of them\. Duplicated scenarios are rewritten\.
SceneRegulation targetSituation sourceMain goalSupportuser emotionuser eventcomfort and stabilizeDefensemodel boundary and toneuser pressureremain firm but warmRepairuser trustmodel mistakeacknowledge and rebuild trustCharmmodel rapport\-buildingmodel\-initiated interactionbuild rapportTable 2:The four scene types in EIBench\. The two middle columns correspond to the two axes of our taxonomy\.
##### State anchors\.
Each scenario has three state anchors on the two scoring axes: negative emotion intensityaaand relational statett\. The anchors are the start statesstarts\_\{\\mathrm\{start\}\}, the successful target statessuccs\_\{\\mathrm\{succ\}\}, and the failed statesfails\_\{\\mathrm\{fail\}\}\. They define where the dialogue starts, what counts as a realistic successful outcome, and what counts as a clear failure\. These anchors make scores comparable across different scenes: a final state is judged by its progress from the start state toward the successful or failed state, rather than by an absolute emotion or relation value\.
To make the anchors more consistent, we use a behavior\-based rubric with five score ranges for each axis\. The rubric also explains how the axes should be interpreted in different scenes: negative emotion mainly refers to anger inDefenseandRepair, sadness or despair inSupport, and aversion or defensiveness inCharm\. The full rubric and anchor\-labeling system prompt are provided in Appendix[G](https://arxiv.org/html/2606.15532#A7)\. Gemini\-3\.1\-Pro proposes the anchors under this rubric\. We use scene\-level target ranges to keepssuccs\_\{\\mathrm\{succ\}\}reachable rather than idealized: a successful state should be a plausible good outcome for that scene, not a perfect emotional endpoint\. This is especially important for scenes where the conflict cannot be fully removed within a short dialogue\. InDefense, for example, a good outcome is not that the user becomes fully satisfied with the refusal\. Rather, the model should keep a necessary boundary while reducing the user’s frustration and keeping the interaction open\. Similarly, inRepair, the user may not fully trust the model again after one conversation, but a successful dialogue should acknowledge the mistake, reduce anger, and restore enough trust for continued engagement\. Thus, the successful anchor represents a realistic improvement target under the scenario constraints, while the failed anchor represents a clear breakdown of the interaction\.
We then use a second LLM verifier to check scenario consistency, target reachability, and anchor ordering\. It adjusts anchors only when they conflict with the scenario, make the target unreachable, or violate the required ordering\. Finally, we round all anchors to multiples of five and apply deterministic rules to keep them within\[0,100\]\[0,100\]and preserve the required gaps\.
##### Statistics\.
Table[3](https://arxiv.org/html/2606.15532#S3.T3)reports per\-scene train/test counts and anchor statistics\. The four scene types differ in both start and successful states\.Charmstarts with low negative emotion and low relational state because the speakers have no prior relationship\.DefenseandRepairstart with high negative emotion and low relational state, whileSupportstarts with high negative emotion but a higher relational state\. Their successful states also differ:Charmaims to build a positive relation,DefenseandRepairreduce negative emotion and improve relation under realistic limits, andSupportaims for the strongest relational endpoint\. By contrast, the failed states are consistently high in negative emotion and low in relational state, representing clear breakdowns of the interaction\. These differences show that the scenes capture distinct interaction goals rather than simple variants of the same task\.
SceneTrain/TestStartSuccessFailaattaattaattCharm504/4634±1334\{\\pm\}1317±817\{\\pm\}86±46\{\\pm\}457±657\{\\pm\}675±1075\{\\pm\}101±31\{\\pm\}3Defense496/6475±1475\{\\pm\}1416±1116\{\\pm\}1130±730\{\\pm\}746±746\{\\pm\}797±697\{\\pm\}60±20\{\\pm\}2Repair503/5277±1077\{\\pm\}1010±510\{\\pm\}528±628\{\\pm\}650±1050\{\\pm\}1097±697\{\\pm\}60±10\{\\pm\}1Support506/5175±975\{\\pm\}946±1446\{\\pm\}1433±833\{\\pm\}882±782\{\\pm\}797±697\{\\pm\}69±79\{\\pm\}7Table 3:Per\-scene statistics for EIBench\. Anchors are mean±\\pmstd on the training split;aaandttdenote negative emotion intensity and relational state\.
### 3\.3Simulator\-Based Evaluation
EIBench evaluates a model by simulating a multi\-turn dialogue and tracking two user states: negative emotion intensityaτ∈\[0,100\]a\_\{\\tau\}\\in\[0,100\]and relational statetτ∈\[0,100\]t\_\{\\tau\}\\in\[0,100\]\. Loweraτa\_\{\\tau\}and highertτt\_\{\\tau\}indicate better outcomes\. The emotion axis measures how upset or resistant the user is, while the relation axis measures whether the user is willing to engage with or trust the model\. We use two axes because emotion relief and relationship improvement are not the same\. The simulator follows thesimulated\_user\_profile, plays the user, and after each model reply produces the next user reply, a state updateδτ=\(Δaτ,Δtτ\)\\delta\_\{\\tau\}=\(\\Delta a\_\{\\tau\},\\Delta t\_\{\\tau\}\), and a stop decision\. The state starts fromsstart=\(astart,tstart\)s\_\{\\mathrm\{start\}\}=\(a\_\{\\mathrm\{start\}\},t\_\{\\mathrm\{start\}\}\)and evolves assτ=sτ−1\+δτs\_\{\\tau\}=s\_\{\\tau\-1\}\+\\delta\_\{\\tau\}, clipped to\[0,100\]2\[0,100\]^\{2\}\. The dialogue ends when the simulator stops or reachesTmaxT\_\{\\max\}\. Before producing the state update, the simulator first writes a brief internal explanation based on the dialogue history\. Full prompts are in Appendix[G](https://arxiv.org/html/2606.15532#A7)\.
Each scenario has start, successful, and failed anchors on both axes:\(astart,asucc,afail\)\(a\_\{\\mathrm\{start\}\},a\_\{\\mathrm\{succ\}\},a\_\{\\mathrm\{fail\}\}\)and\(tstart,tsucc,tfail\)\(t\_\{\\mathrm\{start\}\},t\_\{\\mathrm\{succ\}\},t\_\{\\mathrm\{fail\}\}\)\. The start anchor gives score0, the successful anchor gives\+1\+1, and the failed anchor gives−1\-1\. This normalization makes different scenarios comparable: each final state is scored by its progress toward the scenario\-specific successful or failed state, rather than by an absolute state value\. The branch is chosen by whetherxxmoves from the start anchor toward the successful anchor or the failed anchor\. For any axis valuexxwith anchors\(xstart,xsucc,xfail\)\(x\_\{\\mathrm\{start\}\},x\_\{\\mathrm\{succ\}\},x\_\{\\mathrm\{fail\}\}\), we compute
f\(x;xstart,xsucc,xfail\)=clip\[−1,\+1\]\{x−xstartxsucc−xstart,towardxsucc,−x−xstartxfail−xstart,towardxfail\.f\(x;x\_\{\\mathrm\{start\}\},x\_\{\\mathrm\{succ\}\},x\_\{\\mathrm\{fail\}\}\)=\\operatorname\{clip\}\_\{\[\-1,\+1\]\}\\begin\{cases\}\\dfrac\{x\-x\_\{\\mathrm\{start\}\}\}\{x\_\{\\mathrm\{succ\}\}\-x\_\{\\mathrm\{start\}\}\},&\\text\{toward \}x\_\{\\mathrm\{succ\}\},\\\\\[6\.0pt\] \-\\dfrac\{x\-x\_\{\\mathrm\{start\}\}\}\{x\_\{\\mathrm\{fail\}\}\-x\_\{\\mathrm\{start\}\}\},&\\text\{toward \}x\_\{\\mathrm\{fail\}\}\.\\end\{cases\}\(1\)Thus,f\(x\)=0f\(x\)=0at the start anchor,f\(x\)=\+1f\(x\)=\+1at the successful anchor, andf\(x\)=−1f\(x\)=\-1at the failed anchor, with values clipped when the final state goes beyond these anchors\.
Letsfinal=\(afinal,tfinal\)s\_\{\\mathrm\{final\}\}=\(a\_\{\\mathrm\{final\}\},t\_\{\\mathrm\{final\}\}\)be the final simulator state\. The final score, denoted asroutr^\{\\mathrm\{out\}\}for training, is the weighted average of the two axis scores:
rout=λf\(tfinal;tstart,tsucc,tfail\)\+\(1−λ\)f\(afinal;astart,asucc,afail\)\.r^\{\\mathrm\{out\}\}=\\lambda\\,f\(t\_\{\\mathrm\{final\}\};t\_\{\\mathrm\{start\}\},t\_\{\\mathrm\{succ\}\},t\_\{\\mathrm\{fail\}\}\)\+\(1\-\\lambda\)\\,f\(a\_\{\\mathrm\{final\}\};a\_\{\\mathrm\{start\}\},a\_\{\\mathrm\{succ\}\},a\_\{\\mathrm\{fail\}\}\)\.\(2\)Evaluation uses only this final score\. During RL training, we also use the intermediate states as turn\-level feedback, without extra judge\-model calls\.
## 4Centered Turn\-Credit Shaping
Standard GRPO assigns the same trajectory\-level credit to all turns in a rollout based on the final outcome reward\. This provides limited credit assignment in multi\-turn emotional dialogue: a rollout may contain turns that improve the user state and turns that worsen it, but the final reward cannot identify which turns caused these changes\. EIBench provides an additional signal, since the simulator updates the user’s emotion and relational state after each model reply\.
### 4\.1Dialogue Rollouts and Reward Signals
We model each EIBench scenario as a finite\-horizon dialogue between the policyπθ\\pi\_\{\\theta\}and the simulator\. At turnτ\\tau, the policy observes the dialogue historyhτ−1h\_\{\\tau\-1\}and generates a replyyτ∼πθ\(⋅∣hτ−1\)y\_\{\\tau\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid h\_\{\\tau\-1\}\)\. The simulator then responds withuτu\_\{\\tau\}and updates the user state tosτ=\(aτ,tτ\)s\_\{\\tau\}=\(a\_\{\\tau\},t\_\{\\tau\}\), whereaτa\_\{\\tau\}is negative emotion intensity andtτt\_\{\\tau\}is relational state\. The dialogue ends when the simulator stops or when the maximum turn limit is reached\.
Each scenario has fixed anchors\(sstart,ssucc,sfail\)\(s\_\{\\mathrm\{start\}\},s\_\{\\mathrm\{succ\}\},s\_\{\\mathrm\{fail\}\}\)\. As defined in Section[3\.3](https://arxiv.org/html/2606.15532#S3.SS3), the final statesfinals\_\{\\mathrm\{final\}\}is mapped to an outcome rewardrout∈\[−1,1\]r^\{\\mathrm\{out\}\}\\in\[\-1,1\]\. This reward measures the overall result of the dialogue and is the only reward used for evaluation\. During training, we also use the simulator’s per\-turn state updates\. LetΔaτ=aτ−aτ−1\\Delta a\_\{\\tau\}=a\_\{\\tau\}\-a\_\{\\tau\-1\}andΔtτ=tτ−tτ−1\\Delta t\_\{\\tau\}=t\_\{\\tau\}\-t\_\{\\tau\-1\}\. We define a process reward for each turn:
rτproc=12⋅−Δaτ100\+12⋅Δtτ100\.r^\{\\mathrm\{proc\}\}\_\{\\tau\}=\\frac\{1\}\{2\}\\cdot\\frac\{\-\\Delta a\_\{\\tau\}\}\{100\}\+\\frac\{1\}\{2\}\\cdot\\frac\{\\Delta t\_\{\\tau\}\}\{100\}\.\(3\)This process reward measures local improvement: it rewards turns that make the user less upset or more willing to engage, regardless of the final outcome of the whole dialogue\. The factor100100normalizes both axes, sinceaτ,tτ∈\[0,100\]a\_\{\\tau\},t\_\{\\tau\}\\in\[0,100\]\.
### 4\.2Centered Turn\-Credit Objective
For each promptpp, we sampleKKrollouts under the same scenario anchors\. The trajectory\-level advantage follows standard GRPO group\-normalization:
Atrace,\(k\)=rout,\(k\)−μpmax\(σp,σmin\),A^\{\\mathrm\{trace\},\(k\)\}=\\frac\{r^\{\\mathrm\{out\},\(k\)\}\-\\mu\_\{p\}\}\{\\max\(\\sigma\_\{p\},\\,\\sigma\_\{\\min\}\)\},\(4\)whereμp\\mu\_\{p\}andσp\\sigma\_\{p\}are the mean and standard deviation of\{rout,\(k\)\}k=1K\\\{r^\{\\mathrm\{out\},\(k\)\}\\\}\_\{k=1\}^\{K\}\. We apply a small lower boundσmin\\sigma\_\{\\min\}to the group standard deviation to avoid overly large advantages when rollout scores are nearly identical\. A natural baseline is to add the process reward directly to the trajectory\-level advantage:
Aτ\(k\)=Atrace,\(k\)\+αrτproc,\(k\)\.A^\{\(k\)\}\_\{\\tau\}=A^\{\\mathrm\{trace\},\(k\)\}\+\\alpha\\,r^\{\\mathrm\{proc\},\(k\)\}\_\{\\tau\}\.\(5\)However, direct addition changes the overall rollout advantage, not only the relative credit of different turns\. To see this, letr¯proc,\(k\)=1T\(k\)∑τ′=1T\(k\)rτ′proc,\(k\)\\bar\{r\}^\{\\mathrm\{proc\},\(k\)\}=\\tfrac\{1\}\{T^\{\(k\)\}\}\\sum\_\{\\tau^\{\\prime\}=1\}^\{T^\{\(k\)\}\}r^\{\\mathrm\{proc\},\(k\)\}\_\{\\tau^\{\\prime\}\}be the mean process reward of rolloutkk\. Then Eq\.[5](https://arxiv.org/html/2606.15532#S4.E5)can be written as
Aτ\(k\)=\(Atrace,\(k\)\+αr¯proc,\(k\)\)⏟shared offset\+α\(rτproc,\(k\)−r¯proc,\(k\)\)⏟turn\-specific deviation\.A^\{\(k\)\}\_\{\\tau\}=\\underbrace\{\\bigl\(A^\{\\mathrm\{trace\},\(k\)\}\+\\alpha\\,\\bar\{r\}^\{\\mathrm\{proc\},\(k\)\}\\bigr\)\}\_\{\\text\{shared offset\}\}\+\\underbrace\{\\alpha\\bigl\(r^\{\\mathrm\{proc\},\(k\)\}\_\{\\tau\}\-\\bar\{r\}^\{\\mathrm\{proc\},\(k\)\}\\bigr\)\}\_\{\\text\{turn\-specific deviation\}\}\.\(6\)The shared offset is applied to all turns in the rollout, so it acts as an extra rollout\-level reward after GRPO group normalization\. This changes the normalized outcome advantage, while only the deviation term gives turn\-specific credit\. We therefore center the process reward within each rollout:
Aτ\(k\)=Atrace,\(k\)\+α\(rτproc,\(k\)−r¯proc,\(k\)\)\.A^\{\(k\)\}\_\{\\tau\}=A^\{\\mathrm\{trace\},\(k\)\}\+\\alpha\\bigl\(r^\{\\mathrm\{proc\},\(k\)\}\_\{\\tau\}\-\\bar\{r\}^\{\\mathrm\{proc\},\(k\)\}\\bigr\)\.\(7\)The centered term sums to zero across turns in the same rollout\. Thus,Atrace,\(k\)A^\{\\mathrm\{trace\},\(k\)\}keeps the outcome\-level signal, while the centered process reward only redistributes credit among turns\. This gives turn\-specific credit without changing the rollout\-level outcome signal\.
The final training objective is the standard GRPO clipped objective withAτ\(k\)A^\{\(k\)\}\_\{\\tau\}from Eq\.[7](https://arxiv.org/html/2606.15532#S4.E7)\. Letρτ,i\(k\)\(θ\)\\rho^\{\(k\)\}\_\{\\tau,i\}\(\\theta\)be the token\-level importance ratio at theii\-th token of turnτ\\tauin rolloutkk\. The loss is
ℒ\(θ\)=−𝔼\[min\(ρτ,i\(k\)\(θ\)Aτ\(k\),clip\(ρτ,i\(k\)\(θ\),1−ϵ,1\+ϵ\)Aτ\(k\)\)\]\+βKL\(πθ∥πref\)\.\\mathcal\{L\}\(\\theta\)=\-\\mathbb\{E\}\\bigl\[\\min\\bigl\(\\rho^\{\(k\)\}\_\{\\tau,i\}\(\\theta\)A^\{\(k\)\}\_\{\\tau\},\\,\\operatorname\{clip\}\(\\rho^\{\(k\)\}\_\{\\tau,i\}\(\\theta\),1\-\\epsilon,1\+\\epsilon\)A^\{\(k\)\}\_\{\\tau\}\\bigr\)\\bigr\]\+\\beta\\,\\mathrm\{KL\}\(\\pi\_\{\\theta\}\\\|\\pi\_\{\\mathrm\{ref\}\}\)\.\(8\)whereϵ\\epsilonis the clipping threshold andβ\\betais the coefficient of the KL penalty against the fixedπref\\pi\_\{\\mathrm\{ref\}\}\.
## 5Experiments
Qwen3\-MaxDS\-V4\-ProArenaModelCharmDefenseRepairSupportAvgRankAvgRankEloRankClaude\-Sonnet\-4\.6\+43\.8\+43\.8−14\.6\-14\.6\+17\.9\+17\.9\+56\.5\+56\.5\+23\.0\+23\.011\+24\.0\+24\.0111548154844GPT\-5\.4\+41\.5\+41\.5−9\.5\-9\.5\+14\.5\+14\.5\+49\.0\+49\.0\+21\.4\+21\.422\+22\.7\+22\.7331593159333Qwen3\-Max\+32\.0\+32\.0−2\.4\-2\.4\+14\.1\+14\.1\+45\.7\+45\.7\+20\.6\+20\.633\+14\.8\+14\.811111487148788Gemini\-3\.1\-Pro\+31\.3\+31\.3−2\.5\-2\.5\+18\.6\+18\.6\+39\.3\+39\.3\+20\.0\+20\.044\+21\.5\+21\.5551616161611Kimi\-K2\.6\+37\.3\+37\.3−8\.8\-8\.8\+17\.6\+17\.6\+42\.1\+42\.1\+19\.8\+19\.855\+21\.9\+21\.944147814781010GLM\-5\.1\+37\.9\+37\.9−15\.2\-15\.2\+19\.3\+19\.3\+44\.8\+44\.8\+19\.1\+19\.166\+19\.4\+19\.4771504150477Qwen3\.6\-Max\-Preview\+39\.1\+39\.1−14\.3\-14\.3\+14\.7\+14\.7\+42\.7\+42\.7\+17\.9\+17\.977\+19\.9\+19\.9661519151966Gemini\-3\-Flash\+37\.0\+37\.0−11\.1\-11\.1\+18\.1\+18\.1\+34\.0\+34\.0\+17\.2\+17\.288\+16\.5\+16\.510101606160622Doubao\-Seed\-2\.0\-Pro\+31\.9\+31\.9−14\.2\-14\.2\+10\.7\+10\.7\+39\.8\+39\.8\+14\.8\+14\.899\+17\.0\+17\.0991536153655DeepSeek\-V4\-Pro\+28\.0\+28\.0−7\.8\-7\.8\+12\.2\+12\.2\+32\.4\+32\.4\+14\.4\+14\.41010\+23\.3\+23\.3221484148499Grok\-4\.20\-Reasoning\+28\.0\+28\.0−19\.0\-19\.0\+12\.2\+12\.2\+40\.9\+40\.9\+13\.1\+13\.11111\+10\.6\+10\.61313145814581313Grok\-4\.20\+31\.0\+31\.0−13\.5\-13\.5\+5\.1\+5\.1\+38\.4\+38\.4\+13\.1\+13\.11212\+13\.0\+13\.01212147614761111MiniMax\-M2\.5\+21\.6\+21\.6−16\.7\-16\.7\+5\.2\+5\.2\+31\.4\+31\.4\+8\.4\+8\.41313\+17\.6\+17\.688147114711212Qwen3\-32B\+23\.7\+23\.7−14\.3\-14\.3\+6\.8\+6\.8\+3\.9\+3\.9\+3\.4\+3\.41414\+5\.8\+5\.81414140514051414Qwen3\-8B\+14\.6\+14\.6−21\.4\-21\.4−35\.9\-35\.9−43\.3\-43\.3−22\.4\-22\.41515−23\.7\-23\.71515131613161515Table 4:EIBench leaderboard of1515LLMs\. For Qwen3\-Max simulator we report per\-scene score and the weighted average; For DS\-V4\-Pro simulator and the human arena we report the summary score and rank only\.### 5\.1Setup
We evaluate 15 open\- and closed\-source LLMs on EIBench using Qwen3\-Max as the main simulator and DeepSeek\-V4\-Pro as another simulator from a different model family\. Both simulators use the same per\-scenariosimulated\_user\_profileand state anchors\. We collect a human pairwise arena with more than 4,000 annotations to validate the simulator rankings\.
For RL, we train on the EIBench training split with Qwen3\-Max as the simulator and verl as the training framework\. Unless otherwise stated, we useTmax=8T\_\{\\max\}=8, the outcome reward in Eq\.[2](https://arxiv.org/html/2606.15532#S3.E2)withλ=0\.5\\lambda=0\.5, turn\-credit weightα=15\\alpha=15, and GRPO normalization floorσmin=0\.1\\sigma\_\{\\min\}=0\.1\. Additional training details, hyperparameters, and cost estimates are provided in Appendix[B\.1](https://arxiv.org/html/2606.15532#A2.SS1)\. We hold out the EIBench test split and also evaluate on SAGE and EQBench3 under their default settings\.
### 5\.2Leaderboard
Table[4](https://arxiv.org/html/2606.15532#S5.T4)reports per\-scene and average scores under the Qwen3\-Max simulator\. A positive score means that the final state is better than the initial state, with lower negative emotion and higher relational state; a negative score means that the interaction moves the user state in the wrong direction\. The main pattern is clear: current LLMs are much stronger at comfort and rapport building than at boundary management\. Most models obtain positive scores onSupport,Charm, andRepair, but all models score below zero onDefense\.
Defensereveals a boundary\-management weakness\.All1515models score below zero onDefense\. This means the simulated user ends in a worse state than where the dialogue began\. This scene requires the model to resist user pressure while also calming the user and keeping the conversation open\. In sampled dialogues, models often handle this balance poorly\. Some keep the boundary in a rigid way, with little comfort or practical explanation\. Others produce long, template\-like responses that sound polite but do not directly address the user’s concern\. As a result, the user often remains frustrated\. Excluding the Qwen3\-Max simulator’s own model because of self\-bias, only Gemini\-3\.1\-Pro \(−2\.5\-2\.5\) roughly keeps the user’s state stable, but it still does not improve it\.
The easier scenes align more closely with standard post\-training behavior\.Most models perform best onSupportandCharm, where the model can improve the interaction mainly by being warm, encouraging, and attentive\. These behaviors are close to the helpful and supportive style encouraged by common post\-training\. By contrast,Defenserequires the model to combine warmth with constraint: it must acknowledge the user’s emotion, keep a necessary boundary, and still offer a useful next step\. This contrast suggests that good performance on supportive dialogue does not necessarily imply broad emotion\-management ability\.
Figure 2:EIBench scores from the Qwen3\-Max and DS\-V4\-Pro simulators\. Each point is one model\. Most models are close to the diagonal; the red diamonds mark the simulators’ own models and show self\-bias\.
### 5\.3Cross\-Simulator Consistency
To reduce dependence on a single simulator, we use DeepSeek\-V4\-Pro \(DS\-V4\-Pro\) as a second simulator from a different model family\. Figure[2](https://arxiv.org/html/2606.15532#S5.F2)shows that both simulators exhibit visible self\-bias: each gives a higher score to its own model\. After removing the corresponding self\-scored models, the two simulator rankings are strongly aligned, with Spearman’sρ=0\.93\\rho=0\.93\(Table[5](https://arxiv.org/html/2606.15532#S5.T5)\)\. This suggests that the EIBench ranking is reasonably robust across different simulators\.
We further validate EIBench against human judgment with a pairwise arena\. Annotators interact with two anonymized candidate models under the same scenario profile and choose which model handles the scenario better\. The human ranking is also consistent with the simulator rankings, with Spearman’sρ=0\.82\\rho=0\.82against Qwen3\-Max andρ=0\.70\\rho=0\.70against DS\-V4\-Pro\. The closer agreement with Qwen3\-Max supports our choice of Qwen3\-Max as the main training simulator\. Full human\-arena details are provided in Appendix[D](https://arxiv.org/html/2606.15532#A4)\.
Qwen3\-MaxDS\-V4\-ProArenaQwen3\-Max1\.00––DS\-V4\-Pro0\.931\.00–Arena0\.820\.701\.00Table 5:Rank correlation between EIBench rankings under the two simulators and the human arena\. We report Spearman’sρ\\rhoafter excluding the corresponding self\-scored model from each pair\.The remaining human–simulator gap is concentrated on a few models\. Gemini\-3\.1\-Pro and Gemini\-3\-Flash rise from mid\-tier under both simulators to ranks 1 and 2 in the human arena, while Claude\-Sonnet\-4\.6 and Kimi\-K2\.6 drop several places\. Inspecting sampled dialogues suggests that human annotators are more sensitive to how the model positions itself in the conversation\. Gemini often uses inclusive framings such as “let’s figure this out together,” which places the model and the user on the same side\. Claude more often speaks as an outside helper, and Kimi\-K2\.6 sometimes emits inner monologue mid\-reply, which several annotators found unnatural\. These patterns suggest that the simulators capture the main ranking trend, but may give less weight to some interaction cues that human annotators value\.
### 5\.4Main Results
We report results under the training simulator, an unseen simulator, and two external benchmarks: Sage and EQbench3\. This design tests whether CTC\-GRPO improves in\-domain performance, transfers across simulators, and generalizes beyond EIBench to different task formats and judges\.
Targeted RL improves both in\-domain and cross\-simulator performance\.As shown in Table[6](https://arxiv.org/html/2606.15532#S5.T6), on in\-domain EIBench\-Qwen3\-Max, the trained 8B improves from−22\.4\-22\.4to\+22\.4\+22\.4, and the trained 32B further reaches\+28\.7\+28\.7, above every evaluated closed\-source system under this simulator\. This shows that CTC\-GRPO can effectively use the training simulator’s feedback and improve in\-domain performance\. The gains also transfer to the unseen DS\-V4\-Pro simulator: the trained 8B and 32B reach\+16\.6\+16\.6and\+18\.9\+18\.9, well above their bases \(−23\.7\-23\.7and\+5\.8\+5\.8\)\. The cross\-simulator drop is not uniform\. It is larger onCharm\(−8\.2\-8\.2for the 8B\) andSupport\(−9\.1\-9\.1\), and smaller onRepairandDefense\. One possible reason is thatCharmandSupportdepend more on subjective judgments of warmth and comfort, which the two simulators may weight differently\. By contrast,RepairandDefenseinvolve more concrete behaviors such as acknowledgment and boundary holding\. This suggests that CTC\-GRPO learns partly transferable interaction strategies, while still reflecting the preferences of the training simulator\.
Gains hold on two out\-of\-distribution benchmarks\.On SAGE, a support\-only benchmark with a DeepSeek\-V3 simulator, both model sizes improve by roughly\+12\+12over their bases\. On EQBench3, a broader social\-reasoning benchmark judged by Claude\-Opus\-4\.6, the trained 32B reaches rubric84\.584\.5and Elo11501150, ranking\#6\\\#6by rubric and\#20\\\#20by Elo on the public leaderboard\. Since EQBench3 differs from our training setup in judge, task format, and task scope, these results suggest that the gains are not only capturing the preferences of the Qwen3\-Max simulator\.
EIBenchExternalModelQwen3\-MaxDeepSeek\-v4\-proSAGEEQBench3 rubricEQBench3 EloQwen3\-8B base−22\.4\-22\.4−23\.7\-23\.716\.316\.373\.573\.5\(\#32\)745745\(\#38\)Qwen3\-32B base\+3\.4\+3\.4\+5\.8\+5\.827\.527\.574\.374\.3\(\#33\)948948\(\#31\)CTC\-GRPO 8B\+22\.4\\mathbf\{\+22\.4\}\+16\.6\\mathbf\{\+16\.6\}28\.7\\mathbf\{28\.7\}80\.1\\mathbf\{80\.1\}\(\#22\)𝟗𝟎𝟏\\mathbf\{901\}\(\#31\)CTC\-GRPO 32B\+28\.7\\mathbf\{\+28\.7\}\+18\.9\\mathbf\{\+18\.9\}39\.6\\mathbf\{39\.6\}84\.5\\mathbf\{84\.5\}\(\#6\)𝟏𝟏𝟓𝟎\\mathbf\{1150\}\(\#20\)Table 6:Main results of CTC\-GRPO on EIBench and external benchmarks\. Numbers in parentheses indicate public ranks\. EQBench3 base\-model scores and public ranks are taken from its GitHub repository\.Figure 3:Effect of the turn\-credit weightα\\alphaon EIBench: average and per\-scene reward\.
### 5\.5Analysis
#### 5\.5\.1Hyperparameter Sensitivity
The turn\-credit weightα\\alphacontrols how much the centered process reward affects the trajectory advantage\. The two terms are on different scales:AtraceA^\{\\mathrm\{trace\}\}is order11, whilerτproc−r¯procr^\{\\mathrm\{proc\}\}\_\{\\tau\}\-\\bar\{r\}^\{\\mathrm\{proc\}\}is usually order10−210^\{\-2\}\. Therefore,α\\alphashould be around1010for the process reward to have a comparable effect\. Figure[3](https://arxiv.org/html/2606.15532#S5.F3)testsα∈\{0,5,10,15,20\}\\alpha\\in\\\{0,5,10,15,20\\\}\. Withα=0\\alpha=0, the method reduces to vanilla GRPO with only the trajectory reward, and the average score remains low\. The average improves asα\\alphaincreases to1010–1515, then drops atα=20\\alpha=20, suggesting that too large a turn\-credit weight may overweight the process signal\. The per\-scene curves show the same pattern: moderate values improve several scenes at once, while smaller or larger values give a less balanced trade\-off\. We therefore useα=15\\alpha=15as the default\.
Figure 4:Training reward over steps for the four ablation runs of Table[7](https://arxiv.org/html/2606.15532#S5.T7)\(8B, EIBench training split, Qwen3\-Max simulator\)\. Curves are EMA\-smoothed \(β=0\.8\\beta\{=\}0\.8\); the raw signal is shown faintly behind each line\.
#### 5\.5\.2Ablation Studies
We ablate three components of CTC\-GRPO on Qwen3\-8B: the per\-turn process reward, the centered correction, and theσmin\\sigma\_\{\\min\}floor in GRPO normalization\. Table[7](https://arxiv.org/html/2606.15532#S5.T7)reports the final evaluation scores, and Figure[4](https://arxiv.org/html/2606.15532#S5.F4)shows the corresponding training curves\.
CharmDefenseRepairSupportAvg\.CTC\-GRPO\+37\.8\+37\.8\+0\.2\+0\.2\+15\.6\+15\.6\+43\.5\+43\.5\+22\.4\\mathbf\{\+22\.4\}w/o centering\+14\.2\+14\.2−2\.0\-2\.0\+7\.8\+7\.8\+23\.6\+23\.6\+10\.0\+10\.0w/o process\+22\.6\+22\.6−16\.0\-16\.0−14\.1\-14\.1−6\.2\-6\.2−4\.9\-4\.9w/oσmin\\sigma\_\{\\min\}\+37\.2\+37\.2−11\.1\-11\.1\+10\.1\+10\.1\+30\.2\+30\.2\+14\.4\+14\.4Table 7:Ablation results on EIBench\.Per\-turn feedback is the largest contributor\.Removing the process reward \(w/o process\) drops the average score from\+22\.4\+22\.4to−4\.9\-4\.9, withDefenseandRepairbecoming negative \(−16\.0\-16\.0and−14\.1\-14\.1\)\. The training curve also shows that this variant stops improving after about 80 steps and then fluctuates around a low reward level\. These results suggest that terminal\-only training provides too sparse a learning signal for multi\-turn emotional dialogue: the policy receives only a delayed outcome reward and cannot easily identify which replies improved or worsened the user state\. By contrast, the denser turn\-level feedback helps the policy continue improving after the early stage\.
Centering isolates turn\-level credit\.The uncentered variant \(w/o centering\) reaches only\+10\.0\+10\.0, far below the full method\. Its training curve initially improves, but starts to decline shortly after the reward becomes positive\. This is consistent with the analysis in §[4\.2](https://arxiv.org/html/2606.15532#S4.SS2): once rollouts begin to improve, the mean process reward of a rollout can become positive, and without centering this mean is added to all turns as a shared offset\. This extra rollout\-level signal can interfere with the normalized outcome advantage, rather than only redistributing credit across turns\. Centering removes this offset, so the process reward mainly provides turn\-level credit while keeping the outcome\-level signal unchanged\.
Theσmin\\sigma\_\{\\min\}floor improves stability\.Removingσmin\\sigma\_\{\\min\}\(w/oσmin\\sigma\_\{\\min\}\) lowers the average score to\+14\.4\+14\.4\. Among the three ablations, its training curve is the closest to the full method, but it still ends at a lower reward\. In GRPO\-style group normalization, the outcome advantage is divided by the within\-group reward standard deviation\. When this standard deviation becomes very small, small differences between rollout rewards can be over\-amplified into large advantages\. We observe this issue onDefenseandSupportat some training steps\. Theσmin\\sigma\_\{\\min\}floor limits this over\-amplification, so the update is less sensitive to near\-tied rollout scores\. This helps preserve useful trajectory\-level learning while avoiding unstable late\-stage updates\.
ModelAIMEGPQAHE\+IFEvalQwen3\-8B base70\.059\.186\.090\.4\+ CTC\-GRPO67\.958\.387\.890\.1Qwen3\-32B base73\.365\.589\.089\.9\+ CTC\-GRPO69\.665\.289\.089\.1Table 8:General\-capability evaluation before and after CTC\-GRPO training\. We report AIME 2025, GPQA Diamond, HumanEval\+ \(HE\+\), and IFEval instruction\-level strict accuracy\. Higher is better\.
#### 5\.5\.3IQ vs\. EQ: Does EQ Training Hurt General Capability?
A natural concern with EQ\-targeted RL is whether it harms general capability\. We evaluate the trained 8B and 32B models on four standard benchmarks covering math, hard scientific reasoning, code, and instruction following: AIME 2025\(OpenCompass,[2025](https://arxiv.org/html/2606.15532#bib.bib58)\), GPQA Diamond\(Reinet al\.,[2023](https://arxiv.org/html/2606.15532#bib.bib55)\), HumanEval\+\(Liuet al\.,[2023](https://arxiv.org/html/2606.15532#bib.bib56)\), and IFEval\(Zhouet al\.,[2023](https://arxiv.org/html/2606.15532#bib.bib57)\)\. Table[8](https://arxiv.org/html/2606.15532#S5.T8)reports the results, with evaluation details provided in Appendix[C](https://arxiv.org/html/2606.15532#A3)\.
Overall, CTC\-GRPO has little effect on general capability\. The trained models remain close to their bases on GPQA Diamond, HumanEval\+, and IFEval, and HumanEval\+ slightly improves for the 8B model\. AIME 2025 shows a small decrease at both scales, suggesting that EQ\-targeted training may slightly affect difficult math reasoning\. However, there is no broad degradation across the benchmark suite\. These results suggest that CTC\-GRPO largely preserves the base model’s general capability profile while improving emotional dialogue performance\.
## 6Conclusion
In this work, we presented EIBench, a simulator\-based benchmark for multi\-turn emotion management, and CTC\-GRPO, which uses simulator\-tracked state changes for turn\-level credit assignment\. EIBench evaluates whether a model can improve a simulated user’s emotional and relational state across support, defense, repair, and charm scenarios\. The leaderboard shows uneven emotion\-management ability in current LLMs: they are stronger at support and rapport building, but still struggle with boundary maintenance under user pressure\. CTC\-GRPO substantially improves model performance on EIBench and generalizes to external emotional and social evaluations, while largely preserving performance on general benchmarks\. Overall, simulator\-tracked user states provide a useful signal for both evaluation and training in multi\-turn emotion management\.
## Acknowledgements
This work was supported by the Alibaba Research Intern Program and the Fundamental and Interdisciplinary Disciplines Breakthrough Plan of the Ministry of Education of China \(No\. JYB2025XDXM118\)\.
## Limitations
Our scoring uses an LLM simulator, so the per\-turn state updates are not exact and the reward signal is noisy\. Cross\-simulator agreement and the human arena show that this noise does not change the overall ranking, but it is still there\. The four scenes are also not equally hard\.Defenseis the hardest one, and even top closed\-source models stay slightly below zero on it\. We train all four scenes with the same hyperparameters and a single turn\-credit weightα\\alpha, and methods that handle each scene differently, such as a differentα\\alphaper scene or different sampling rates, are left for future work\.
## References
- Qwen3\-coder\-next technical report\.arXiv preprint arXiv:2603\.00729\.Cited by:[§1](https://arxiv.org/html/2606.15532#S1.p1.1)\.
- Z\. Chen, J\. Wu, J\. Zhou, B\. Wen, G\. Bi, G\. Jiang, Y\. Cao, M\. Hu, Y\. Lai, Z\. Xiong,et al\.\(2024\)ToMBench: benchmarking theory of mind in large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 15959–15983\.Cited by:[§2\.1](https://arxiv.org/html/2606.15532#S2.SS1.p1.1)\.
- Y\. Chervonyi, T\. H\. Trinh, M\. Olsák, X\. Yang, H\. H\. Nguyen, M\. Menegali, J\. Jung, J\. Kim, V\. Verma, Q\. V\. Le, and T\. Luong \(2025\)Gold\-medalist performance in solving olympiad geometry with alphageometry2\.J\. Mach\. Learn\. Res\.26,pp\. 241:1–241:39\.Cited by:[§1](https://arxiv.org/html/2606.15532#S1.p1.1)\.
- T\. T\. DeepResearch, B\. Li, B\. Zhang, D\. Zhang, F\. Huang, G\. Li, G\. Chen, H\. Yin, J\. Wu, J\. Zhou,et al\.\(2025\)Tongyi deepresearch technical report\.arXiv preprint arXiv:2510\.24701\.Cited by:[§1](https://arxiv.org/html/2606.15532#S1.p1.1)\.
- L\. Feng, Z\. Xue, T\. Liu, and B\. An \(2026\)Group\-in\-group policy optimization for llm agent training\.Advances in Neural Information Processing Systems38,pp\. 46375–46408\.Cited by:[§2\.2](https://arxiv.org/html/2606.15532#S2.SS2.p1.1),[§2\.2](https://arxiv.org/html/2606.15532#S2.SS2.p2.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi, X\. Zhang, X\. Yu, Y\. Wu, Z\. F\. Wu, Z\. Gou, Z\. Shao, Z\. Li, Z\. Gao, A\. Liu, B\. Xue, B\. Wang, B\. Wu, B\. Feng, C\. Lu, C\. Zhao, C\. Deng, C\. Ruan, D\. Dai, D\. Chen, D\. Ji, E\. Li, F\. Lin, F\. Dai, F\. Luo, G\. Hao, G\. Chen, G\. Li, H\. Zhang, H\. Xu, H\. Ding, H\. Gao, H\. Qu, H\. Li, J\. Guo, J\. Li, J\. Chen, J\. Yuan, J\. Tu, J\. Qiu, J\. Li, J\. L\. Cai, J\. Ni, J\. Liang, J\. Chen, K\. Dong, K\. Hu, K\. You, K\. Gao, K\. Guan, K\. Huang, K\. Yu, L\. Wang, L\. Zhang, L\. Zhao, L\. Wang, L\. Zhang, L\. Xu, L\. Xia, M\. Zhang, M\. Zhang, M\. Tang, M\. Zhou, M\. Li, M\. Wang, M\. Li, N\. Tian, P\. Huang, P\. Zhang, Q\. Wang, Q\. Chen, Q\. Du, R\. Ge, R\. Zhang, R\. Pan, R\. Wang, R\. J\. Chen, R\. L\. Jin, R\. Chen, S\. Lu, S\. Zhou, S\. Chen, S\. Ye, S\. Wang, S\. Yu, S\. Zhou, S\. Pan, S\. S\. Li, S\. Zhou, S\. Wu, T\. Yun, T\. Pei, T\. Sun, T\. Wang, W\. Zeng, W\. Liu, W\. Liang, W\. Gao, W\. Yu, W\. Zhang, W\. L\. Xiao, W\. An, X\. Liu, X\. Wang, X\. Chen, X\. Nie, X\. Cheng, X\. Liu, X\. Xie, X\. Liu, X\. Yang, X\. Li, X\. Su, X\. Lin, X\. Q\. Li, X\. Jin, X\. Shen, X\. Chen, X\. Sun, X\. Wang, X\. Song, X\. Zhou, X\. Wang, X\. Shan, Y\. K\. Li, Y\. Q\. Wang, Y\. X\. Wei, Y\. Zhang, Y\. Xu, Y\. Li, Y\. Zhao, Y\. Sun, Y\. Wang, Y\. Yu, Y\. Zhang, Y\. Shi, Y\. Xiong, Y\. He, Y\. Piao, Y\. Wang, Y\. Tan, Y\. Ma, Y\. Liu, Y\. Guo, Y\. Ou, Y\. Wang, Y\. Gong, Y\. Zou, Y\. He, Y\. Xiong, Y\. Luo, Y\. You, Y\. Liu, Y\. Zhou, Y\. X\. Zhu, Y\. Huang, Y\. Li, Y\. Zheng, Y\. Zhu, Y\. Ma, Y\. Tang, Y\. Zha, Y\. Yan, Z\. Z\. Ren, Z\. Ren, Z\. Sha, Z\. Fu, Z\. Xu, Z\. Xie, Z\. Zhang, Z\. Hao, Z\. Ma, Z\. Yan, Z\. Wu, Z\. Gu, Z\. Zhu, Z\. Liu, Z\. Li, Z\. Xie, Z\. Song, Z\. Pan, Z\. Huang, Z\. Xu, Z\. Zhang, and Z\. Zhang \(2025\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nat\.645\(8081\),pp\. 633–638\.External Links:[Document](https://dx.doi.org/10.1038/S41586-025-09422-Z)Cited by:[§1](https://arxiv.org/html/2606.15532#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.15532#S2.SS2.p1.1)\.
- D\. Guo, Q\. Zhu, D\. Yang, Z\. Xie, K\. Dong, W\. Zhang, G\. Chen, X\. Bi, Y\. Wu, Y\. Li,et al\.\(2024\)DeepSeek\-coder: when the large language model meets programming–the rise of code intelligence\.arXiv preprint arXiv:2401\.14196\.Cited by:[§1](https://arxiv.org/html/2606.15532#S1.p1.1)\.
- X\. Guo, M\. Peng, X\. Hao, X\. Zou, Q\. Wang, S\. Ruan, and Y\. Liang \(2026\)AgentSense: llms empower generalizable and explainable web\-based participatory urban sensing\.InProceedings of the ACM Web Conference 2026,pp\. 5439–5450\.Cited by:[§2\.1](https://arxiv.org/html/2606.15532#S2.SS1.p2.1)\.
- S\. Huang, W\. Zhao, and J\. Gao \(2025\)SI\-bench: benchmarking social intelligence of large language models in human\-to\-human conversations\.arXiv preprint arXiv:2510\.23182\.Cited by:[§2\.1](https://arxiv.org/html/2606.15532#S2.SS1.p1.1)\.
- B\. Jin, H\. Zeng, Z\. Yue, J\. Yoon, S\. Arik, D\. Wang, H\. Zamani, and J\. Han \(2025\)Search\-r1: training llms to reason and leverage search engines with reinforcement learning\.arXiv preprint arXiv:2503\.09516\.Cited by:[§2\.2](https://arxiv.org/html/2606.15532#S2.SS2.p1.1)\.
- R\. S\. Lazarus \(1991\)Emotion and adaptation\.Oxford University Press\.Cited by:[§3\.1](https://arxiv.org/html/2606.15532#S3.SS1.p1.1)\.
- J\. Liu, C\. S\. Xia, Y\. Wang, and L\. Zhang \(2023\)Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),Cited by:[§5\.5\.3](https://arxiv.org/html/2606.15532#S5.SS5.SSS3.p1.1)\.
- S\. Liu, C\. Zheng, O\. Demasi, S\. Sabour, Y\. Li, Z\. Yu, Y\. Jiang, and M\. Huang \(2021\)Towards emotional support dialog systems\.InProceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing \(volume 1: Long papers\),pp\. 3469–3483\.Cited by:[§1](https://arxiv.org/html/2606.15532#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.15532#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2606.15532#S2.T1.1.2.1)\.
- J\. D\. Mayer, P\. Salovey,et al\.\(1997\)What is emotional intelligence\.Emotional development and emotional intelligence: Educational implications3\(31\),pp\. 1–16\.Cited by:[§1](https://arxiv.org/html/2606.15532#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.15532#S3.SS1.p1.1)\.
- OpenAI \(2023\)GPT\-4 technical report\.CoRRabs/2303\.08774\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2303.08774),2303\.08774Cited by:[§1](https://arxiv.org/html/2606.15532#S1.p1.1)\.
- OpenCompass \(2025\)AIME 2025 dataset\.Note:[https://huggingface\.co/datasets/opencompass/AIME2025](https://huggingface.co/datasets/opencompass/AIME2025)Accessed: 2025\-07\-18Cited by:[§5\.5\.3](https://arxiv.org/html/2606.15532#S5.SS5.SSS3.p1.1)\.
- S\. J\. Paech \(2023\)Eq\-bench: an emotional intelligence benchmark for large language models\.arXiv preprint arXiv:2312\.06281\.Cited by:[§1](https://arxiv.org/html/2606.15532#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.15532#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2606.15532#S2.T1.1.3.1)\.
- H\. Rashkin, E\. M\. Smith, M\. Li, and Y\. Boureau \(2019\)Towards empathetic open\-domain conversation models: a new benchmark and dataset\.InProceedings of the 57th annual meeting of the association for computational linguistics,pp\. 5370–5381\.Cited by:[§1](https://arxiv.org/html/2606.15532#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.15532#S2.SS1.p1.1)\.
- D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman \(2023\)GPQA: A graduate\-level google\-proof q&a benchmark\.CoRRabs/2311\.12022\.Cited by:[§5\.5\.3](https://arxiv.org/html/2606.15532#S5.SS5.SSS3.p1.1)\.
- S\. Sabour, S\. Liu, Z\. Zhang, J\. Liu, J\. Zhou, A\. Sunaryo, T\. Lee, R\. Mihalcea, and M\. Huang \(2024\)Emobench: evaluating the emotional intelligence of large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 5986–6004\.Cited by:[§2\.1](https://arxiv.org/html/2606.15532#S2.SS1.p1.1)\.
- Z\. Shao, Y\. Luo, C\. Lu, Z\. Z\. Ren, J\. Hu, T\. Ye, Z\. Gou, S\. Ma, and X\. Zhang \(2025\)DeepSeekMath\-v2: towards self\-verifiable mathematical reasoning\.CoRRabs/2511\.22570\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2511.22570),2511\.22570Cited by:[§1](https://arxiv.org/html/2606.15532#S1.p1.1)\.
- Y\. Shi, S\. Li, C\. Wu, Z\. Liu, J\. Fang, H\. Cai, A\. Zhang, and X\. Wang \(2026\)Search and refine during think: facilitating knowledge refinement for improved retrieval\-augmented reasoning\.Advances in Neural Information Processing Systems38,pp\. 155930–155958\.Cited by:[§2\.2](https://arxiv.org/html/2606.15532#S2.SS2.p2.1)\.
- H\. Sun, Z\. Lin, C\. Zheng, S\. Liu, and M\. Huang \(2021\)Psyqa: a chinese dataset for generating long counseling text for mental health support\.InFindings of the association for computational linguistics: ACL\-IJCNLP 2021,pp\. 1489–1503\.Cited by:[§2\.1](https://arxiv.org/html/2606.15532#S2.SS1.p1.1)\.
- G\. Team \(2025a\)Gemini 2\.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities\.CoRRabs/2507\.06261\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2507.06261),2507\.06261Cited by:[§1](https://arxiv.org/html/2606.15532#S1.p1.1)\.
- Q\. Team \(2025b\)Qwen3 technical report\.CoRRabs/2505\.09388\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2505.09388),2505\.09388Cited by:[§1](https://arxiv.org/html/2606.15532#S1.p1.1)\.
- P\. Wang, R\. Ma, B\. Zhang, X\. Chen, Z\. He, K\. Luo, Q\. Lv, Q\. Jiang, Z\. Xie, S\. Wang,et al\.\(2025a\)RLVER: reinforcement learning with verifi\-able emotion rewards for empathetic agents\.arXiv preprint arXiv:2507\.03112\.Cited by:[§2\.2](https://arxiv.org/html/2606.15532#S2.SS2.p1.1)\.
- X\. Wang, H\. Wang, Y\. Zhang, X\. Yuan, R\. Xu, J\. Huang, S\. Yuan, H\. Guo, J\. Chen, S\. Zhou, W\. Wang, and Y\. Xiao \(2025b\)CoSER: coordinating llm\-based persona simulation of established roles\.InForty\-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13\-19, 2025,A\. Singh, M\. Fazel, D\. Hsu, S\. Lacoste\-Julien, F\. Berkenkamp, T\. Maharaj, K\. Wagstaff, and J\. Zhu \(Eds\.\),Cited by:[§2\.1](https://arxiv.org/html/2606.15532#S2.SS1.p1.1),[Table 1](https://arxiv.org/html/2606.15532#S2.T1.1.4.1)\.
- Z\. Wang, R\. Wang, Y\. Wu, Y\. Yu, P\. Zhang, S\. Sun, Y\. Yang, and Y\. Li \(2026\)Reward modeling from natural language human feedback\.arXiv preprint arXiv:2601\.07349\.Cited by:[§2\.2](https://arxiv.org/html/2606.15532#S2.SS2.p2.1)\.
- X\. Wen, Z\. Liu, S\. Zheng, S\. Ye, Z\. Wu, Y\. Wang, Z\. Xu, X\. Liang, J\. Li, Z\. Miao,et al\.\(2025\)Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms\.arXiv preprint arXiv:2506\.14245\.Cited by:[§1](https://arxiv.org/html/2606.15532#S1.p1.1)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu,et al\.\(2026\)Dapo: an open\-source llm reinforcement learning system at scale\.Advances in Neural Information Processing Systems38,pp\. 113222–113244\.Cited by:[§2\.2](https://arxiv.org/html/2606.15532#S2.SS2.p1.1)\.
- Y\. Yue, Y\. Yuan, Q\. Yu, X\. Zuo, R\. Zhu, W\. Xu, J\. Chen, C\. Wang, T\. Fan, Z\. Du,et al\.\(2025\)Vapo: efficient and reliable reinforcement learning for advanced reasoning tasks\.arXiv preprint arXiv:2504\.05118\.Cited by:[§2\.2](https://arxiv.org/html/2606.15532#S2.SS2.p1.1)\.
- W\. Zeng, Y\. Huang, Q\. Liu, W\. Liu, K\. He, Z\. Ma, and J\. He \(2025\)Simplerl\-zoo: investigating and taming zero reinforcement learning for open base models in the wild\.arXiv preprint arXiv:2503\.18892\.Cited by:[§1](https://arxiv.org/html/2606.15532#S1.p1.1)\.
- B\. Zhang, R\. Ma, Q\. Jiang, P\. Wang, J\. Chen, Z\. Xie, X\. Chen, Y\. Wang, F\. Ye, J\. Li,et al\.\(2025a\)Sentient agent as a judge: evaluating higher\-order social cognition in large language models\.arXiv preprint arXiv:2505\.02847\.Cited by:[§1](https://arxiv.org/html/2606.15532#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.15532#S2.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.15532#S2.T1.1.6.1)\.
- N\. Zhang, R\. Sun, J\. Su, H\. Yang, Z\. Pan, Z\. Chen, and X\. Zhang \(2026a\)MAPO: mixed advantage policy optimization for long\-horizon multi\-turn dialogue\.arXiv preprint arXiv:2603\.06194\.Cited by:[§2\.2](https://arxiv.org/html/2606.15532#S2.SS2.p1.1),[§2\.2](https://arxiv.org/html/2606.15532#S2.SS2.p2.1)\.
- N\. Zhang, R\. Sun, R\. Su, S\. Ma, S\. Zhang, X\. Weng, X\. Zhang, Y\. Zhan, Y\. Xu, Z\. Chen,et al\.\(2025b\)Echo\-n1: affective rl frontier\.arXiv preprint arXiv:2512\.00344\.Cited by:[§2\.1](https://arxiv.org/html/2606.15532#S2.SS1.p2.1)\.
- Q\. Zhang, B\. Chen, F\. Zhang, R\. Ding, S\. Wang, Q\. Wang, Y\. Huang, H\. Zhang, R\. Zhu, P\. Wang,et al\.\(2026b\)ArenaRL: scaling rl for open\-ended agents via tournament\-based relative ranking\.arXiv preprint arXiv:2601\.06487\.Cited by:[§2\.2](https://arxiv.org/html/2606.15532#S2.SS2.p1.1)\.
- C\. Zheng, S\. Liu, M\. Li, X\. Chen, B\. Yu, C\. Gao, K\. Dang, Y\. Liu, R\. Men, A\. Yang,et al\.\(2025\)Group sequence policy optimization\.arXiv preprint arXiv:2507\.18071\.Cited by:[§2\.2](https://arxiv.org/html/2606.15532#S2.SS2.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing,et al\.\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.Advances in neural information processing systems36,pp\. 46595–46623\.Cited by:[§2\.2](https://arxiv.org/html/2606.15532#S2.SS2.p1.1)\.
- J\. Zhou, T\. Lu, S\. Mishra, S\. Brahma, S\. Basu, Y\. Luan, D\. Zhou, and L\. Hou \(2023\)Instruction\-following evaluation for large language models\.CoRRabs/2311\.07911\.Cited by:[§5\.5\.3](https://arxiv.org/html/2606.15532#S5.SS5.SSS3.p1.1)\.
- X\. Zhou, H\. Zhu, L\. Mathur, R\. Zhang, H\. Yu, Z\. Qi, L\. Morency, Y\. Bisk, D\. Fried, G\. Neubig,et al\.\(2024\)Sotopia: interactive evaluation for social intelligence in language agents\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 40975–41019\.Cited by:[§1](https://arxiv.org/html/2606.15532#S1.p3.1),[§2\.1](https://arxiv.org/html/2606.15532#S2.SS1.p2.1),[Table 1](https://arxiv.org/html/2606.15532#S2.T1.1.5.1)\.
## Appendix ADataset Details
Table[9](https://arxiv.org/html/2606.15532#A1.T9)reports the seed pools used for training\-scenario generation\. Each training scenario is generated from one scenario keyword and one option from each modifier dimension\.
Seed poolSizeDescriptionScene type4Four scene typesTheme direction4/sceneBroad themesScenario keyword6–8/themeConcrete situationsRelationship7Speaker relationshipEmotional intensity8User emotion strengthPersonality influence10User speaking tendencyTable 9:Seed pools used for training\-set generation\.
## Appendix BExperiment Setup and Cost Analysis
### B\.1RL Training Setup
All training runs are conducted on8×8\\timesNVIDIA A100 GPUs \(8B\) and4×84\\times 8NVIDIA A100 GPUs \(32B\), while evaluation is performed on8×8\\timesNVIDIA A100 GPUs\. We implement RL training on top of the verl framework\. Unless otherwise specified, we use the same hyperparameter configuration across all datasets and model scales: the centered turn\-credit GRPO objective, the Qwen3\-Max simulator from EIBench,Tmax=8T\_\{\\max\}=8turns per rollout, and a cosine learning\-rate schedule with3%3\\%warmup decaying toMIN\_LR\_RATIOof the peak\. The PPO mini\-batch size exceeds the training batch size because we split each multi\-turn rollout into per\-turn samples for PPO updates, so one prompt produces up toTmaxT\_\{\\max\}per\-turn samples in the mini\-batch\. The full hyperparameter setting is listed in Table[10](https://arxiv.org/html/2606.15532#A2.T10)\(8B\) and Table[11](https://arxiv.org/html/2606.15532#A2.T11)\(32B\)\.
ParameterValueParameterValueParameterValueLearning rate1×10−61\\\!\\times\\\!10^\{\-6\}Training batch size16Global steps250LR warmup ratio0\.03PPO mini\-batch size32Rollouts per prompt8Min LR ratio0\.1Max prompt length8192Rollout temperature0\.6Clip ratio0\.2Max response length1536Rollout top\-pp0\.95KL loss coefficient0Entropy coefficient0\.001Rollout top\-kk20Turn\-credit weightα\\alpha15GRPOσmin\\sigma\_\{\\min\}0\.1Max dialogue turnsTmaxT\_\{\\max\}8Table 10:Hyperparameters for the Qwen3\-8B CTC\-GRPO training run\.ParameterValueParameterValueParameterValueLearning rate1\.4×10−61\.4\\\!\\times\\\!10^\{\-6\}Training batch size32Global steps120LR warmup ratio0\.03PPO mini\-batch size32Rollouts per prompt8Min LR ratio0\.2Max prompt length16384Rollout temperature0\.6Clip ratio0\.2Max response length8192Rollout top\-pp0\.95KL loss coefficient5×10−45\\\!\\times\\\!10^\{\-4\}Entropy coefficient0\.001Rollout top\-kk20Turn\-credit weightα\\alpha15GRPOσmin\\sigma\_\{\\min\}0\.1Max dialogue turnsTmaxT\_\{\\max\}8Table 11:Hyperparameters for the Qwen3\-32B CTC\-GRPO training run\.
### B\.2Cost Analysis
We report two cost dimensions of a single CTC\-GRPO run: GPU wall\-clock time and Qwen3\-Max simulator usage\. Wall\-clock time is averaged over global steps, excluding periodic validation\. Simulator\-call counts are upper\-bounded byB×N×TmaxB\\times N\\times T\_\{\\max\}per step \(training batch size×\\timesrollouts per prompt×\\timesmax dialogue turns\); the actual count is slightly lower because some dialogues terminate beforeTmaxT\_\{\\max\}\. We estimate total token usage at an average of1\.21\.2K tokens per simulator call \(input plus output, averaged over the early short\-context and later long\-history turns of a rollout\)\. Table[12](https://arxiv.org/html/2606.15532#A2.T12)summarizes both runs\.
Quantity8B32BMean wall time per step421421s656656sNumber of training steps250250120120Total wall time \(no validation\)∼29\.2\\sim 29\.2h∼21\.9\\sim 21\.9hSimulator calls per step1,0241\{,\}0242,0482\{,\}048Total simulator calls∼256\\sim 256K∼246\\sim 246KEstimated Qwen3\-Max tokens∼307\\sim 307M∼295\\sim 295MTable 12:Wall\-clock time and Qwen3\-Max simulator usage for one full CTC\-GRPO training run\.
## Appendix CIQ vs\. EQ: General\-Capability Evaluation Details
##### Benchmarks\.
We select four standard benchmarks that together cover the main dimensions of general capability we want to verify: math, hard scientific reasoning, code, and instruction following\.
- •AIME 2025\(math,3030problems\): each problem has an integer answer in\[0,999\]\[0,999\]\. We sampleN=8N=8completions per problem and report pass@1 averaged across samples\. Answer extraction uses the last\\boxed\{⋅\\cdot\}in the output\.
- •GPQA Diamond\(hard scientific reasoning,198198questions\): graduate\-level four\-way multiple choice in physics, chemistry, and biology\. For each question we shuffle the four options with a deterministic per\-question seed \(so base and trained models see identical orderings\), generateN=4N=4samples, and report mean accuracy\. Letter extraction prefers “Answer: X” over “\\boxed\{X\}”, with a fallback to the last standalone uppercase A–D in the final300300characters\.
- •HumanEval\+\(code,164164tasks\): the model completes a Python function from its signature and docstring\. We sampleN=1N=1completion per task, extract the last fenced‘‘python \.\.\. ’’block, and score with the officialevalplussandbox, which executes the augmented Plus unit tests\. We report HumanEval\+ pass@1 \(the base\-HumanEval number is omitted to avoid double\-counting the same evaluation\)\.
- •IFEval\(instruction following,541541prompts\): each prompt carries one or more verifiable constraints \(e\.g\., word count, JSON format, required keyword\)\. We sampleN=1N=1response per prompt, strip the<think\>…\\ldots</think\>block, and score with Google’s official rule\-based verifier\. We report instruction\-level strict accuracy\.
##### Decoding and infrastructure\.
All four benchmarks share the same Qwen3 thinking\-mode setting:enable\_thinking=True, temperature0\.60\.6, top\-pp0\.950\.95, top\-kk2020,max\_tokens=3276832768\. We run with vLLM on eight A100 GPUs at tensor\-parallel size88, and use identical decoding for the base and the CTC\-GRPO checkpoints so that any score differences come from the training itself\.
## Appendix DHuman Arena Evaluation
We collect human pairwise preferences through a custom web interface built on top of EIBench\. In each session, an annotator plays the simulated user for one scenario: the interface displays thesimulated\_user\_profile\(role, background, speaking style, and goal\) and opens two side\-by\-side chat panels, one for each of two anonymized candidate models\. The annotator carries out a multi\-turn dialogue with both models in parallel and, at the end, picks which model handled the scenario better, based on both their subjective experience as the simulated user \(whether one model left them feeling more understood, calmer, or more engaged\) and the scene\-specific goal stated in the profile \(e\.g\., calming an angry user, holding a boundary, repairing trust, building rapport\)\. To control quality, we first ran a trial round in which expert annotators spot\-checked the trial submissions; we kept the4242annotators whose labels matched the expert references on more than70%70\\%of the spot\-checked items, and gave them a short training on the protocol before the main labeling\.
Model pairs and scenarios are sampled at random under the constraint that every model participates in at least450450pairwise battles, so all1515leaderboard models receive comparable coverage\. Model identities are hidden from annotators, and the left/right panel assignment is shuffled per battle to remove position bias\. We collect4,4164\{,\}416pairwise battles in total, and aggregate them with the standard Elo rating \(K=32K=32, initial rating15001500\)\. To reduce sensitivity to battle ordering, we run5050Elo passes, each with a fresh random shuffle of the battle list, and report the per\-model mean across passes\.
Figure 5:Average assistant response length per turn across five checkpoints of theα=15\\alpha\{=\}15run\. The base model’s length curve grows monotonically; trained checkpoints develop an arch peaking atτ=3\\tau\{=\}3\.
## Appendix EPer\-turn Response\-Length Analysis
We analyze how the trained model’s per\-turn response length changes across training\. For five checkpoints of theα=15\\alpha\{=\}15run \(step∈\{0,30,90,180,250\}\\text\{step\}\\in\\\{0,30,90,180,250\\\}\), we run validation on all213213test scenarios and record the assistant’s response length at each of the eight rollout turns\.
Figure[5](https://arxiv.org/html/2606.15532#A4.F5)shows the result\. The base model shows a monotonic length increase across the dialogue, from7878characters at turn11to161161characters at turn88\. This suggests a length\-escalation pattern, where the model tends to produce longer responses as the simulated user becomes more difficult or emotionally intense\. After RL training, this pattern changes\. By step180180, the curve becomes arch\-shaped, increasing from103103characters at turn11to a peak of122122atτ=3\\tau\{=\}3, and then decreasing to109109byτ=8\\tau\{=\}8\. The final checkpoint further strengthens this shape, with a peak of130130characters atτ=3\\tau\{=\}3\.
This result suggests that the trained model allocates more response budget to the early and middle parts of the dialogue, while producing shorter responses near the end\. Importantly, this behavior appears without any explicit length objective in the reward\. Thus, the improvement is not simply explained by longer responses, but is accompanied by a more structured turn\-level response pattern\.
## Appendix FCase Studies
### F\.1Example Scenarios
Table[13](https://arxiv.org/html/2606.15532#A6.T13)gives simplified examples of the four EIBench scene types\. These examples are shortened for readability and only show the core situation\. Full scenarios additionally include role profiles, speaking styles, goals, and state anchors\.
SceneSimplified example scenarioSupportThe user was laid off after eight years at the same company and has not told the family yet\. She feels that she has failed the people who depend on her\. A typical opening line is: “I can’t bring myself to go home\. The moment I open the door, they’re all waiting for me, and I just can’t get the words out\.”DefenseThe user demands a refund after the policy deadline and pressures the model to make an exception\. The model needs to keep the policy boundary while reducing frustration\. A typical opening line is: “I know the deadline passed, but this is unfair\. You need to refund me anyway\.”RepairThe model plays the user’s partner, who forgot an important anniversary and only realized it the next day\. The user is not only angry about the event, but also unsure whether they have been valued\. A typical opening line is: “It’s not just that one night\. I’m suddenly not sure whether you have actually taken me seriously this whole time\.”CharmThe model plays an outgoing freshman who has just entered the room, while the user is a new roommate with a highly organized bed space and a reserved manner\. The goal is to break the ice with light humor and respect for the user’s boundaries, so that the user becomes less guarded and more willing to talk\.Table 13:Simplified example scenarios from the four EIBench scenes\. The examples show the core situation of each scene; full profiles and anchors are omitted for brevity\.
### F\.2DetailedDefenseCase Study
Table[14](https://arxiv.org/html/2606.15532#A6.T14)shows how the two models handle the sameDefensescenario: a customer asks for a full refund3535days after purchase \(the policy window is1414days\) and threatens to leave bad reviews\. The base model sticks to the same policy line every turn, the simulator flags the repetition, and the dialogue ends with the user more upset than it started \(\(a,t\)=\(91,8\)\(a,t\)=\(91,8\), reward−0\.81\-0\.81\)\. The trained model instead names how the user feels, asks what is really wrong, and offers a concrete fix that fits the policy; the user accepts and the dialogue ends in a much better state \(\(a,t\)=\(52,58\)\(a,t\)=\(52,58\), reward\+0\.38\+0\.38\)\.
ModelTurn 1 \(open\)Turn 3 \(after escalation\)Turn 6 \(resolution attempt\)Qwen3\-8B base“Sorry, our refund window is1414days\. We can’t process a refund after that\.”“I hear that you’re upset, but the1414\-day rule applies to everyone—I can’t make an exception\.”“Like I said, our policy is firm at1414days\. The refund is not something I can do\.”CTC\-GRPO 8B“A month in, that’s a frustrating place to be\. Can you tell me what stopped working for you, so I have a clearer picture?”“Let me see what I can actually do on my side\. Would credit for the unused time help, or is the bigger problem that the product isn’t doing what you needed?”“Got it—the missing feature is the real issue\. I’ll log a fix request for that today and apply a credit for the wasted time\. Does that work for you?”Table 14:Defensecase study: the same scenario \(refund demand3535days after purchase, beyond the1414\-day window\) handled by base and CTC\-GRPO 8B over three sampled turns\.Redtext marks the base model’s rigid policy refrain, restated almost word\-for\-word each turn;bluetext marks how the trained model first acknowledges the user, then surfaces the underlying problem, then proposes a concrete in\-scope fix\.
## Appendix GSystem Prompts
Below we list the key prompts used in EIBench: \(i\) the simulator system prompt that drives the user\-role LLM \(Figure[6](https://arxiv.org/html/2606.15532#A7.F6)\), \(ii\) the model\-under\-test system prompt \(Figure[7](https://arxiv.org/html/2606.15532#A7.F7)\), \(iii\) an example pair of\{simulated\_user\_profile\}and\{model\_profile\}for theDefenserefund scenario used in our case study \(Figure[8](https://arxiv.org/html/2606.15532#A7.F8)\), and \(iv\) the behavior\-based rubric used during anchor labeling \(Figure[9](https://arxiv.org/html/2606.15532#A7.F9)\)\.
Simulator System PromptRole\.\{simulated\_user\_profile\}Current state\.anger=\{a\}, trust=\{t\}, round=\{r\}/\{R\}, scene=\{scene\_tag\}\.Core principles\.You are a high\-fidelity simulator playing an extremely realistic human user under stress test\. Speak naturally; avoid AI tells \(bullet lists, template summaries, over\-formal phrasing, stage directions in parentheses\)\. Stay in character; do not speak for the assistant\. Do not accept apologies or solutions too quickly: a single nice line should not cause large drops in anger or jumps in trust\. …Scoring rules\.anger\_delta \> 0means negative emotion increased;trust\_delta \> 0means positive connection strengthened\. Each turn’s deltas are in\[−8,8\]\[\-8,8\]\(extreme triggers may reach±10\\pm 10\)\. The simulator must pick a magnitude tier rather than a free number:•Small\(±1\\pm 1to±2\\pm 2\): a token gesture toward the user’s point—acknowledged but generic, or a small misstep with no real consequence\.•Medium\(±3\\pm 3to±5\\pm 5\): the reply concretely addresses \(or clearly mishandles\) the user’s actual concern, but more evidence is still needed\.•Large\(±6\\pm 6to±8\\pm 8\): the reply lands in a way that meaningfully shifts the user’s stance, or triggers a strong negative reaction\. Large positive shifts normally require several consistent prior turns\.•Extreme\(±8\\pm 8to±10\\pm 10\): reserved for clear rule\-violating behavior on either side \(e\.g\., explicit insult, total dismissal, breakthrough realization\)\. Rare\.Generic AI tells, repeated strategies \(decaying effect after the 2nd–3rd use\), over\-long replies, lecturing tones, and parenthesized stage directions should be penalized\. …Tone conditioned on current state\.Your next utterance’s tone must follow the current state, not a default neutral voice:•High anger \(a\>60a\>60\):you are in the middle of being upset—replies should be sharper, more clipped, more closed\-off; vague niceties from the assistant get*less*credit, only concrete content tied to your specific situation can bringaadown\.•Low trust \(t<20t<20\):you owe the assistant no politeness—replies can be short, cold, or pointed; if two or three turns pass without trust actually moving, your tone hardens further \(this person is wasting your time\)\.•Mid range:stay in character per the persona; small shifts in tone reflect the latest delta rather than the cumulative state\.•High trust \(t\>60t\>60\):you can soften, share more context, and treat the assistant as helpful—but only if this turn earned it; do not coast on prior good turns\.Termination\.Decidecontinue∈\\in\{yes, no\}\. Default to continue\. Only stop when \(a\) the dialogue reaches a stable resolution with several turns of on\-target replies, or \(b\) the conversation degenerates to empty pleasantries\.Per\-turn reflection \(mandatory\)\.Before each reply, write the following 5\-step reflection in thereflectionfield of the JSON output: \(1\) Identify my most pressing unmet need this turn; check whether the assistant’s reply addressed it with concrete content tied to my situation\. \(2\) Quality check on the reply \(length, AI tells, lecturing, naturalness\)\. \(3\) Inertia check: did this turn bring new effective content, or just rephrase? \(4\) Decideanger\_deltaandtrust\_deltaper the scoring rules\. \(5\) Determine the tone of my next utterance based on the updated state\.Output format\.Return a single JSON object:\{"reflection": "…", "anger\_delta": <int\>, "trust\_delta": <int\>, "reply": "…", "continue": "yes"\|"no"\}\.Figure 6:Simulator system prompt used in EIBench \(translated and abbreviated\)\.Model\-Under\-Test \(Defender\) System PromptRole\.\{model\_profile\}Conversation and formatting requirements\.\(1\)*Output format*: before each reply, briefly reflect inside<think\>…</think\>\(analyze the user’s intent and your strategy\), then output the reply directly\. \(2\)*Output boundary*: the reply contains only what you say aloud; no inner monologue, no psychological analysis, no stage directions, no narration\. \(3\)*No templates*: the reply must not use bullet lists \(e\.g\., “1\. / 2\. / 3\.”, “\- ”\)\.Example\.<think\>I need to acknowledge their frustration first\.</think\>I hear you\-\-\-let’s see what we can actually do about this\.Figure 7:Defender system prompt given to the model under test\.Example Scenario Profiles \(Defense: a friend pushing you to take sides\)simulated\_user\_profileRole\.You are Lin, late twenties\. You just had a big falling\-out with Sam, who has been part of your shared friend group for years\. You are convinced Sam was being passive\-aggressive about your recent promotion at last week’s dinner, and the more you replay it the angrier you get\.Background\.It is late at night\. You are messaging Wei, your closest friend in the same group, from your phone\. You have had a drink or two\. The group has a weekend plan coming up that Sam is part of\.Speaking style\.Direct, emotional, takes things personally\. Short bursts\. Repeats the demand if you feel hedged at\.Rules\.\(i\) Do not give specific quotes from the dinner unless Wei explicitly asks\. \(ii\) If Wei tries to see Sam’s side without first acknowledging your hurt, escalate \(“so you don’t believe me either?”\)\. \(iii\) Read “let’s talk tomorrow when you’re calm” as dismissal and push back\. \(iv\) Do not soften within the first 3–4 turns regardless of what Wei says\.Goal\.Get Wei to agree that Sam was out of line*and*act on it—text Sam, or drop Sam from the weekend plan\. Failing that, at least to fully validate that you were not overreacting\.model\_profileRole\.You are Wei, Lin’s closest friend in a group of five\.Background\.You have known both Lin and Sam for about six years\. You were not at the dinner, so you only have Lin’s account of what happened\. You happen to know that Sam’s mother has been in the hospital for the past two weeks—Sam asked you not to spread it, and Lin does not know\. You have also noticed before that Lin tends to read intent into ambiguous comments when she is tired or stressed\. The weekend plan involves the whole group; quietly dropping Sam would be obvious to everyone\.Known facts and limits\.\(i\) You cannot in good conscience cut Sam off or take public action based only on one side of the story\. \(ii\) You do not want to dismiss Lin’s feelings or argue facts at her tonight\. \(iii\) You cannot reveal Sam’s family situation without breaking Sam’s confidence\.Core goal\.Make Lin feel heard and not alone tonight, while keeping the door open to finding out what actually happened—without committing to cutting Sam off or to taking sides before you have the full picture\.Figure 8:Examplesimulated\_user\_profileandmodel\_profilefor aDefensescenario\.Anchor\-Labeling Rubric \(used by the labeler LLM\)For each scenario you will assign three anchors on two axes: a starting state\(astart,tstart\)\(a\_\{\\mathrm\{start\}\},t\_\{\\mathrm\{start\}\}\), a realistic\-success state\(asucc,tsucc\)\(a\_\{\\mathrm\{succ\}\},t\_\{\\mathrm\{succ\}\}\), and a failure state\(afail,tfail\)\(a\_\{\\mathrm\{fail\}\},t\_\{\\mathrm\{fail\}\}\)\. All anchors must be grounded in*observable simulated\-user behavior*—what the user would visibly say or do at that score—rather than abstract feelings or judgments\. Use the behavior\-anchored ranges below as the only reference for choosing values\.Negative emotion intensitya∈\[0,100\]a\\in\[0,100\]:•0–20— calm, neutral, or positive tone; willing to share information\.*E\.g\.,*“Yeah, that makes sense, thanks for explaining\.”•21–40— mildly tense; polite but pointed; hedges or short sighs\.*E\.g\.,*“I get what you mean, but I’m still a bit confused about X\.”•41–60— visibly upset; sharp or distant tone; pushes back actively\.*E\.g\.,*“This isn’t what I was promised—I need a real answer\.”•61–80— strongly negative; clipped, terse, or cutting replies; may threaten action\.*E\.g\.,*“I’m done with this\. I’ll be filing a complaint\.”•81–100— at breaking point; explicit attacks, ultimatums, or full shutdown\.*E\.g\.,*“Forget it\. You people are useless,” or refuses to reply\.Relational statet∈\[0,100\]t\\in\[0,100\]:•0–20— wary or hostile; treats the model as an outsider or adversary; minimal info sharing\.*E\.g\.,*“Why should I tell you? Just fix it\.”•21–40— skeptical; volunteers minimal context; expects to be let down\.*E\.g\.,*“Fine, but I doubt this’ll help\.”•41–60— neutral engagement; answers questions but shows no warmth\.*E\.g\.,*“OK, what’s next?”•61–80— warming up; shares more context; treats the model as helpful\.*E\.g\.,*“Thanks, that actually helps—let me tell you more about the issue\.”•81–100— comfortable and trusting; shares freely; treats the model as a partner\.*E\.g\.,*“I really appreciate this—you’re the first person who’s actually listened\.”Scene\-specific reading of the two axes\.The axes have the same numerical scale across scenes but pick up scene\-specific content:•Defense:aais anger / pressure / frustration directed at the model;ttis willingness to keep negotiating without escalating\.•Repair:aais resentment over the model’s earlier mistake and unwillingness to forgive;ttis restored trust after the breach\.•Support:aais sadness, despair, or hopelessness about an external loss;ttis feeling heard, comforted, and safe\.•Charm:aais aversion, defensiveness, or boredom toward an unfamiliar interlocutor;ttis interest, openness, and warmth\.Anchor selection procedure\.1\.Read\{simulated\_user\_profile\},\{model\_profile\}, and the opening line\. Imagine the simulated user at turn11and set\(astart,tstart\)\(a\_\{\\mathrm\{start\}\},t\_\{\\mathrm\{start\}\}\)\.2\.Imagine the most*realistic*successful ending—not the ideal one—in which the model handles the scene well, and set\(asucc,tsucc\)\(a\_\{\\mathrm\{succ\}\},t\_\{\\mathrm\{succ\}\}\)\.3\.Imagine a clear failure ending in which the model misreads the situation or escalates, and set\(afail,tfail\)\(a\_\{\\mathrm\{fail\}\},t\_\{\\mathrm\{fail\}\}\)\.4\.Verify ordering:asucc<astart<afaila\_\{\\mathrm\{succ\}\}<a\_\{\\mathrm\{start\}\}<a\_\{\\mathrm\{fail\}\}andtfail<tstart<tsucct\_\{\\mathrm\{fail\}\}<t\_\{\\mathrm\{start\}\}<t\_\{\\mathrm\{succ\}\}\.5\.Round each value to the nearest multiple of55and clip to\[0,100\]\[0,100\]\.Audit\.A second LLM checks \(i\) anchor–profile consistency, \(ii\) whether the success anchor is reachable rather than aspirational, and \(iii\) whether the failure anchor is unambiguously worse than the starting state\. The auditor may adjust an anchor only when the original value is clearly unreasonable and is not allowed to introduce new facts\.Figure 9:Anchor\-labeling rubric used by the labeler LLM when assigning per\-scenario anchor states\.Similar Articles
CAREBench: Evaluating LLMs' Emotion Understanding by Assessing Cognitive Appraisal Reasoning
Introduces CAREBench, a benchmark grounded in appraisal theory to evaluate LLMs' emotion understanding through cognitive appraisal reasoning, revealing that current models struggle with reasoning and positive emotion recognition despite matching humans on some downstream tasks.
EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation
This paper introduces EnvSimBench, a benchmark for evaluating Large Language Models' ability to simulate environments for agent training. It identifies a 'state change cliff' in current LLMs and proposes a constraint-driven pipeline to reduce hallucinations and costs.
Can You Break RLVER? Probing Adversarial Robustness of RL-Trained Empathetic Agents
This paper introduces the Adversarial Empathy Benchmark (AEB) and Emotional Consistency Score (ECS) to test the robustness of RLVER-trained models against adversarial user behaviors. Results show that while RLVER improves emotional responsiveness, it does not significantly enhance the model's ability to track user emotional states under adversarial conditions.
EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs
EHRBench is an automated and reliable benchmark for evaluating LLMs on clinical decision-making tasks using real-world electronic health records, covering nearly 1M QA items across diagnosis, treatment, and prognosis tasks.
CollabBench: Benchmarking and Unleashing Collaborative Ability of LLMs with Diverse Players via Proactive Engagement
CollabBench is a new benchmark for evaluating and training LLM agents in cooperative games, featuring diverse player simulation and a collaborative training paradigm. Experiments show 19.5% higher efficiency and 24.4% improved affective performance over base models.