EmoDistill: Offline Emotion Skill Distillation for Language Model Agents in Adversarial Negotiation

arXiv cs.CL Papers

Summary

EmoDistill is an offline framework that distills emotional negotiation skills into language model agents using Implicit Q-Learning for emotion selection and LoRA-based supervised fine-tuning and judge policy optimization for emotion expression, achieving higher utility in adversarial negotiations.

arXiv:2605.26785v1 Announce Type: new Abstract: Post-trained LLMs are often optimized to align responses with human preferences, making them safe, polite, and conversationally appropriate. In adversarial negotiation, however, this alignment can become a vulnerability: emotionally framed language may steer agents toward the counterparty's interests. Using GoEmotions-based affective prompting, we show that emotion substantially shifts negotiation outcomes, suggesting that emotion is a strategic action channel rather than a surface style. Thus, we introduce \textbf{EmoDistill}, an offline framework for distilling emotional negotiation skills into language model agents. EmoDistill decomposes emotional strategy into emotion selection and emotion expression: an Implicit Q-Learning (IQL) selector learns \emph{which} emotion to express, while a Low-Rank Adaptation (LoRA)-based policy learns \emph{how} to express it through Supervised Fine-Tuning (SFT) and Judge Policy Optimization (JPO). Across four emotion-sensitive, high-stakes negotiation domains, SLM policies trained under the EmoDistill framework achieve the highest utility, outperforming vanilla SLM/LLM baselines and IQL-only emotion selection. Ablations show that emotion conditioning is essential, and transfer studies demonstrate generalization across domains, unseen counterparties, and trained-vs-trained tournaments. Overall, EmoDistill learns skills from offline agent-to-agent interactions, avoiding costly online negotiation during training.
Original Article
View Cached Full Text

Cached at: 05/27/26, 09:10 AM

# Offline Emotion Skill Distillation for Language Model Agents in Adversarial Negotiation
Source: [https://arxiv.org/html/2605.26785](https://arxiv.org/html/2605.26785)
Yunbo Long1,∗Haolang Zhao1,∗Lukas Beckenbauer2 Liming Xu1,4Alexandra Brintrup1,5† 1University of Cambridge2Technical University of Munich 4Exiger LLC5The Alan Turing Institute \{yl892,hz496,lx249,ab702\}@cam\.ac\.uklukas\.beckenbauer@tum\.de

∗Equal contribution\.†Corresponding author

###### Abstract

Post\-trained LLMs are often optimized to align responses with human preferences, making them safe, polite, and conversationally appropriate\. In adversarial negotiation, however, this alignment can become a vulnerability: emotionally framed language may steer agents toward the counterparty’s interests\. Using GoEmotions\-based affective prompting, we show that emotion substantially shifts negotiation outcomes, suggesting that emotion is a strategic action channel rather than a surface style\. Thus, we introduceEmoDistill, an offline framework for distilling emotional negotiation skills into language model agents\. EmoDistill decomposes emotional strategy into emotion selection and emotion expression: an Implicit Q\-Learning \(IQL\) selector learns*which*emotion to express, while a Low\-Rank Adaptation \(LoRA\)\-based policy learns*how*to express it through Supervised Fine\-Tuning \(SFT\) and Judge Policy Optimization \(JPO\)\. Across four emotion\-sensitive, high\-stakes negotiation domains, SLM policies trained under the EmoDistill framework achieve the highest utility, outperforming vanilla SLM/LLM baselines and IQL\-only emotion selection\. Ablations show that emotion conditioning is essential, and transfer studies demonstrate generalization across domains, unseen counterparties, and trained\-vs\-trained tournaments\. Overall, EmoDistill learns skills from offline agent\-to\-agent interactions, avoiding costly online negotiation during training\. The code is available at[https://github\.com/Yunbo\-max/EmoDistill](https://github.com/Yunbo-max/EmoDistill)\.

EmoDistill: Offline Emotion Skill Distillation for Language Model Agents in Adversarial Negotiation

Yunbo Long1,∗Haolang Zhao1,∗Lukas Beckenbauer2Liming Xu1,4Alexandra Brintrup1,5†1University of Cambridge2Technical University of Munich4Exiger LLC5The Alan Turing Institute\{yl892,hz496,lx249,ab702\}@cam\.ac\.uklukas\.beckenbauer@tum\.de∗Equal contribution\.†Corresponding author\.

## 1Introduction

![Refer to caption](https://arxiv.org/html/2605.26785v1/figs/reward.png)Figure 1:Single\-emotion prompting effects on CRAD Debt Negotiation\. GoEmotions labels are ranked by mean per\-turn judge reward \(±95%\\pm 95\\%Welch CI\), with the vanilla baseline shown as a dashed line\.![Refer to caption](https://arxiv.org/html/2605.26785v1/figs/main.png)Figure 2:Overview of EmoDistill, which converts offline LLM\-vs\-LLM negotiations into reward\-annotated emotional negotiation skills and distills them into a 7B SLM through IQL emotion selection, LoRA\-SFT, and JPO\.Modern large language models \(LLMs\) are extensively post\-trained through RLHF\(Kasbouya and Sael,[2025](https://arxiv.org/html/2605.26785#bib.bib9)\), DPO\(Gaoet al\.,[2025](https://arxiv.org/html/2605.26785#bib.bib8)\), and instruction tuning to be helpful, polite, and accommodating\. While this alignment makes them useful general\-purpose assistants, it can become a liability when the same models are deployed as autonomous agents in strategic or adversarial settings\. This risk is amplified in tool\-calling agents that take consequential actions on behalf of users, such as transferring money, booking travel, purchasing products, scheduling meetings, or handling customer support\(Linet al\.,[2024](https://arxiv.org/html/2605.26785#bib.bib27); Abbasiantaebet al\.,[2024](https://arxiv.org/html/2605.26785#bib.bib26); Huet al\.,[2025](https://arxiv.org/html/2605.26785#bib.bib25)\)\. Many such tasks are negotiations over price, time, priority, refunds, or deadlines\. Since the counterparty can be another LLM agent, and no human may be in the loop, emotionally charged language becomes a practical attack method\. Because emotional cues strongly condition LLM outputs, emotion can function both as a strategic vulnerability and, if properly modeled, as a defensive control channel\. To verify that emotion is a meaningful strategic variable rather than a cosmetic prompt feature, we first run a controlled single\-emotion prompting study on CRAD using GoEmotions labels\(Demszkyet al\.,[2020](https://arxiv.org/html/2605.26785#bib.bib33)\)\. For each emotion, we evaluate the LLM negotiator on the same2020held\-out scenarios across2020sampled runs and compute the mean per\-turn judge reward\. Several emotions produce rewards significantly above the neutral prompt, showing that emotional framing alone can systematically change negotiation outcomes\. This motivates treating emotion as an action: a negotiator should learn not only what offer to make, but also which emotion to express\. The full statistical procedure is given in Appendix[B](https://arxiv.org/html/2605.26785#A2)\.

A direct approach is online learning from sampled negotiations and observed outcomes\. However, this is a poor fit for LLM agent\-to\-agent negotiation\. Each rollout is a costly multi\-turn interaction with stochastic API\-based agents, making data difficult to regenerate, reuse, or reproduce\. Moreover, many sampled dialogues are weak supervision for emotional strategy: they may end in early agreement, noisy affective responses, or failed negotiation without revealing which emotional behavior caused the outcome\. This makes online RL methods such as PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.26785#bib.bib23)\)expensive and unstable in our setting\. Existing emotion\-aware negotiation methods reduce part of this cost by modeling emotion as a dynamic decision variable\(Longet al\.,[2026a](https://arxiv.org/html/2605.26785#bib.bib10),[2025a](https://arxiv.org/html/2605.26785#bib.bib12),[2025b](https://arxiv.org/html/2605.26785#bib.bib11)\)\. However, they mainly optimize*which*emotion to express and rely on sparse trajectory\-level feedback, which can indicate whether an emotional sequence worked but not which particular response improved or damaged the negotiation\. More importantly, they leave the utterance generator largely fixed, so a selected emotion may still be realized as vague politeness, premature concession, or weak justification\. The missing part is*strategic emotional expression*\. In negotiation, emotion is useful only when it is grounded in the current bargaining state\. Anger should target an unacceptable proposal rather than the person; fear should create credible urgency rather than signal weakness; and concession should be framed as a conditional trade rather than capitulation\. Thus, the goal is not to make the model sound emotional, but to make emotion function as part of the bargaining move\. This motivates our notion of an*emotional negotiation skill*\. Recent agent systems treat skills as reusable capabilities that can be retrieved, executed, evaluated, and revised\(Zhouet al\.,[2026](https://arxiv.org/html/2605.26785#bib.bib48)\)\. In our setting, a skill is not a hand\-written prompt or external procedure, but a reward\-annotated negotiation turn\(st,et,ut,rt\)\(s\_\{t\},e\_\{t\},u\_\{t\},r\_\{t\}\)from an offline LLM\-vs\-LLM sweep\. It binds the dialogue state, emotional stance, concrete utterance, and judge feedback into a reusable bargaining move\. Rather than storing such skills in an external library, we distill them into the parameters of a smaller negotiator\. Thus, we proposeEmoDistill, an offline framework that distills emotional negotiation skills from precomputed LLM\-vs\-LLM rollouts\. By separating emotion selection from emotional expression, EmoDistill enables a 7B SLM to learn effective negotiation behavior without live API rollouts\. Main contributions are:

- •We introduceEmoDistill, an offline framework that distills LLM\-level emotional switching and expression strategies into smaller language model negotiators\. Experiments show that EmoDistill enables a 7B SLM to outperform stronger LLM and SLM baselines across multiple negotiation domains, while also demonstrating non\-trivial transfer to unseen settings\.
- •We empirically reveal the coupled role of emotion selection and emotional expression in LLM negotiation\. By separating IQL\-based emotion selection from SFT\-based expression learning under randomized emotion conditioning, we show that optimizing how an agent expresses a selected emotion can combine with emotion choice to produce more effective negotiation strategies\.
- •We proposeJudge Policy Optimization\(JPO\), which uses LLM\-judge\-derived advantages to refine the SFT\-trained expression policy\. The judge provides dense turn\-level feedback on how each utterance contributes to the negotiation trajectory, allowing the SLM to improve emotional expression from offline data beyond standard SFT\.

## 2Related Work

#### Emotion in agent\-to\-agent negotiation\.

Modern LLM agents are trained on human\-authored text and dialogue, and therefore inherit affective and pragmatic patterns such as politeness, empathy, and concession framing\. When these models negotiate with other agents, such patterns become part of the bargaining interface\. Since prior work shows that emotions can serve as dynamic strategic instruments in negotiation\(Huang and Hadfi,[2024](https://arxiv.org/html/2605.26785#bib.bib1); Griessmairet al\.,[2015](https://arxiv.org/html/2605.26785#bib.bib17); Olekalns and Druckman,[2014](https://arxiv.org/html/2605.26785#bib.bib18)\), we treat emotion as a controllable action channel in LLM\-based agent\-to\-agent bargaining\. Recent LLM\-based negotiation systems have begun treating emotion as a meaningful variable, but typically as an input rather than an optimized output\. AgreeMate\(Chatterjeeet al\.,[2024](https://arxiv.org/html/2605.26785#bib.bib19)\)and ACE\(Sheaet al\.,[2024](https://arxiv.org/html/2605.26785#bib.bib20)\)use emotion\-aware reasoning, while EQ\-Negotiator\(Longet al\.,[2026a](https://arxiv.org/html/2605.26785#bib.bib10)\)combines emotion sensing with Hidden\-Markov reasoning\. EmoDebt\(Longet al\.,[2025a](https://arxiv.org/html/2605.26785#bib.bib12)\)and EvoEmo\(Longet al\.,[2025b](https://arxiv.org/html/2605.26785#bib.bib11)\)go further by treating emotion as a sequential decision variable, optimized via Bayesian optimization and evolutionary search respectively\. However, all of these optimize*which*emotion to express while leaving the utterance generator fixed; a selected emotion may still be realized through vague politeness or premature concession\. EmoDistill addresses this gap by jointly distilling emotion selection and emotional expression into a smaller model\.

#### Decoupling strategy from expression\.

Heet al\.\([2018](https://arxiv.org/html/2605.26785#bib.bib52)\)first proposed decoupling high\-level coarse dialogue acts \(e\.g\.,propose\(price=50\)\) from utterance generation in negotiation, observing that end\-to\-end RL tends to collapse to degenerate solutions such as repetitive utterances or meaningless concessions that exploit the reward signal\. This idea has been extended to cooperative emotional\-support dialogue: EmoDynamiX\(Wanet al\.,[2025](https://arxiv.org/html/2605.26785#bib.bib53)\)decouples strategy prediction from generation via heterogeneous graph modeling, and DecoupledESC\(Zhanget al\.,[2025](https://arxiv.org/html/2605.26785#bib.bib54)\)uses strategy\-response decoupled DPO to mitigate preference bias\. Our setting is adversarial rather than cooperative, and the strategic axis is*emotion*rather than price\-level acts; these differences require different training signals and a different decoupling mechanism\. EmoDistill decouples emotion selection \(an offline IQL selector\) from emotional expression \(a LoRA\-adapted SLM generator\), distilling both into a 7B student rather than coordinating a frozen LLM with a retrieval module\.

#### Offline distillation with LLM\-judge signals\.

LLM judges provide scalable supervision when human labels or online rollouts are expensive\. RLAIF\(Leeet al\.,[2023](https://arxiv.org/html/2605.26785#bib.bib39); Baiet al\.,[2022](https://arxiv.org/html/2605.26785#bib.bib38)\)uses AI\-generated preferences for sequence\-level alignment, while process reward models\(Lightmanet al\.,[2024](https://arxiv.org/html/2605.26785#bib.bib40)\)provide step\-level feedback mainly for reasoning tasks\. A\-LoL\(Bahetiet al\.,[2024](https://arxiv.org/html/2605.26785#bib.bib50)\)is related as an offline advantage\-based method for language\-model refinement: it treats the entire generated response as one action and trains on positive\-advantage examples\. However, multi\-turn negotiation requires a different credit\-assignment structure\. In EmoDistill, each focal\-agent turn is a reward\-annotated emotional bargaining move, and JPO refines the expression policy using scenario\-normalized turn\-level judge advantages\. This lets the model learn which emotional utterances move the bargaining trajectory toward or away from the focal target, rather than only amplifying sequence\-level positive examples\. We compare A\-LoL and JPO refinement in Appendix[K](https://arxiv.org/html/2605.26785#A11)\.

## 3EmoDistill

EmoDistill is an offline pipeline with three stages\. We first construct an LLM\-vs\-LLM negotiation dataset and attach two complementary signals to each offline trajectory: a dense per\-turn LLM\-judge score for each focal\-agent utterance and an outcome\-shaped trajectory return computed from observed bargaining dynamics and terminal agreement\. The same offline sweep is reused across all training stages: IQL uses the outcome\-shaped return for emotion selection \(§[3\.3](https://arxiv.org/html/2605.26785#S3.SS3)\), LoRA\-SFT uses a hybrid judge–outcome filter for demonstration selection, and JPO uses dense judge\-derived advantages for utterance\-level policy improvement \(§[3\.4](https://arxiv.org/html/2605.26785#S3.SS4)\)\. Section[3\.2](https://arxiv.org/html/2605.26785#S3.SS2)formalizes this stage\-wise signal design\. The full EmoDistill policy is reported in experiments asIQL\+SFT\+JPO: IQL selects which emotional skill to invoke, LoRA\-SFT initializes how that skill is expressed, and JPO refines the utterance generator with dense judge\-derived advantages\.

### 3\.1Offline Trajectory Dataset and Judge Annotation

For each domain \(CRAD, Disaster Rescue, Hospital Surgery, Student Sleep\), we collectN=80N\{=\}80training scenarios×\\timesM=100M\{=\}100random emotion\-sequence rollouts, yielding an offline dataset𝒟\\mathcal\{D\}of 8000 trajectories per domain\. Each rollout samples emotions from the full action vocabularyℰ\\mathcal\{E\}\(28 GoEmotions labels\)\. At each focal\-agent turn,𝒟\\mathcal\{D\}recordszt=\(st,et,ut,rt,st\+1\)z\_\{t\}=\(s\_\{t\},e\_\{t\},u\_\{t\},r\_\{t\},s\_\{t\+1\}\), wherests\_\{t\}is the dialogue state,et∈ℰe\_\{t\}\\in\\mathcal\{E\}is the emotion action,utu\_\{t\}is the focal utterance,rtr\_\{t\}is the judge\-assigned per\-turn reward, andst\+1s\_\{t\+1\}is the next state after the counterparty responds\. Each high\-reward instance constitutes an*emotional negotiation skill*: a state\-grounded emotional stance realized through a concrete bargaining utterance\. The full per\-turn judge rubric prompt is given in Appendix[P\.3](https://arxiv.org/html/2605.26785#A16.SS3)\. Dataset details, sweep construction, and the prompt interface are given in Appendices[E\.1](https://arxiv.org/html/2605.26785#A5.SS1),[E\.2](https://arxiv.org/html/2605.26785#A5.SS2), and[P](https://arxiv.org/html/2605.26785#A16)\.

### 3\.2Reward Design and Stage\-wise Signal Use

EmoDistill draws training signals from two complementary sources: aper\-turn LLM judgethat provides dense subjective evaluation of each focal utterance, and anoutcome\-shaped objective rewardcomputed from observed bargaining dynamics and terminal agreement\.

#### Subjective signal\.

A Qwen3\.5\-Plus judge scores each focal utterance against a metric\-aligned rubric that rewards anchoring toward the focal target, concrete proposals, and scenario\-grounded leverage, while penalizing capitulation, vagueness, repetition, and emotion\-inconsistent concessions\. We denote the raw per\-turn score byrtr\_\{t\}and normalize within scenario:

At=rt−μscenσscen\+ϵ\.A\_\{t\}=\\frac\{r\_\{t\}\-\\mu\_\{\\text\{scen\}\}\}\{\\sigma\_\{\\text\{scen\}\}\+\\epsilon\}\.\(1\)

#### Objective signal\.

The outcome\-shaped rewardR​\(τ\)R\(\\tau\)rewards the focal agent for shifting the bargaining gap in its favor:

R​\(τ\)=∑t=1Tτw​\(t\)​\(Δtctp−Δtfoc\)⏟step shaping\+Rterm​\(τ\)⏟agreement bonus\.R\(\\tau\)=\\underbrace\{\\sum\_\{t=1\}^\{T\_\{\\tau\}\}w\(t\)\\bigl\(\\Delta^\{\\text\{ctp\}\}\_\{t\}\-\\Delta^\{\\text\{foc\}\}\_\{t\}\\bigr\)\}\_\{\\text\{step shaping\}\}\+\\underbrace\{R^\{\\text\{term\}\}\(\\tau\)\}\_\{\\text\{agreement bonus\}\}\.\(2\)The step shaping credits turns where the counterparty concedes more than the focal agent:Δtctp\\Delta^\{\\text\{ctp\}\}\_\{t\}is the counterparty’s per\-turn move toward the focal target \(positive when they close the gap\), andΔtfoc\\Delta^\{\\text\{foc\}\}\_\{t\}is the focal agent’s own retreat \(positive when they move away from their target\), both normalized by the initial anchor\-to\-target gap\. The terminal anchorRterm​\(τ\)=\+2R^\{\\text\{term\}\}\(\\tau\)\{=\}\{\+\}2for reached agreement and−2\-2for breakdown\. The linear time\-decayw​\(t\)=max⁡\(0,min⁡\(1,1−t/Tmax\)\)w\(t\)\{=\}\\max\(0,\\min\(1,1\{\-\}t/T\_\{\\text\{max\}\}\)\)down\-weights late\-turn concessions as an implicit length penalty\.R​\(τ\)R\(\\tau\)uses no LLM\-judge signal; the full formulation and reward\-variant definitions are in Appendix[D](https://arxiv.org/html/2605.26785#A4)\.

#### Stage\-wise signal use\.

The two signals are repurposed across stages\.\(i\) IQL\(§[3\.3](https://arxiv.org/html/2605.26785#S3.SS3)\) uses the objectiveR​\(τ\)R\(\\tau\)as a Bellman\-propagated terminal reward, so the selector is rewarded for emotion sequences that actually close the bargaining gap rather than those that merely sound persuasive\.\(ii\) LoRA\-SFT\(§[3\.4](https://arxiv.org/html/2605.26785#S3.SS4)\) uses a hybrid filter combiningrtr\_\{t\}andR​\(τ\)R\(\\tau\)to select demonstrations that are both locally well\-formed and globally productive\.\(iii\) JPO\(§[3\.4](https://arxiv.org/html/2605.26785#S3.SS4)\) uses the subjective per\-turnAtA\_\{t\}for clipped offline policy improvement, enabling credit assignment at the level of individual emotional expressions\. In Sec\.[4](https://arxiv.org/html/2605.26785#S4)we ablate three reward variants per stage, differing in how the signal is distributed across turns:*outcome\-shaped*\(objective,*sparse*:R​\(τ\)R\(\\tau\)only at trajectory end, propagated via Bellman backups\),*episode\-judge*\(subjective,*broadcast*: one dialogue\-level judge score copied to every turn\), and*turn\-judge*\(subjective,*dense*: an independent judge score per focal turn\)\. SFT benefits from clean objective filters \(*which demonstrations to imitate*\); JPO benefits from dense subjective advantages \(*which expressions to upweight*\)\.

### 3\.3Emotion Selection with Offline IQL

The selector treatset∈ℰe\_\{t\}\\in\\mathcal\{E\}as the action andsts\_\{t\}as the state\. We train Implicit Q\-Learning \(IQL\) on𝒟\\mathcal\{D\}, learningQ​\(s,e\)Q\(s,e\)andV​\(s\)V\(s\)with the standard expectile objective, and extract an advantage\-weighted selector with temperatureβAWR\\beta\_\{\\text\{AWR\}\}:

πϕ​\(e∣s\)∝exp⁡\(βAWR⋅\(Q​\(s,e\)−V​\(s\)\)\)\.\\pi\_\{\\phi\}\(e\\mid s\)\\propto\\exp\\\!\\big\(\\beta\_\{\\text\{AWR\}\}\\cdot\(Q\(s,e\)\-V\(s\)\)\\big\)\.\(3\)At inference, the selector samples an emotion for the current dialogue state, which is then inserted into the expression policy’s prompt\. IQL learns*which*emotional skill to invoke; it does not update the utterance generator\. Detailed selector objectives and pseudocode are provided in Appendices[C](https://arxiv.org/html/2605.26785#A3)and[C\.1](https://arxiv.org/html/2605.26785#A3.SS1)\.

CRADDisasterHospitalStudentMethodSuc\.Util\.Rd\.Suc\.Util\.Rd\.Suc\.Util\.Rd\.Suc\.Util\.Rd\.Vanilla \(LLM\)50\.05\.0±15\.75\.0\{\\scriptstyle\\pm 15\.7\}8\.5±2\.4\\mathbf\{8\.5\{\\scriptstyle\\pm 2\.4\}\}100\.015\.0±35\.715\.0\{\\scriptstyle\\pm 35\.7\}6\.5±4\.86\.5\{\\scriptstyle\\pm 4\.8\}100\.035\.0±47\.735\.0\{\\scriptstyle\\pm 47\.7\}4\.5±2\.94\.5\{\\scriptstyle\\pm 2\.9\}100\.045\.9±21\.545\.9\{\\scriptstyle\\pm 21\.5\}2\.5±1\.2\\mathbf\{2\.5\{\\scriptstyle\\pm 1\.2\}\}Vanilla \(SLM\)25\.08\.8±20\.18\.8\{\\scriptstyle\\pm 20\.1\}11\.2±5\.911\.2\{\\scriptstyle\\pm 5\.9\}75\.037\.9±41\.737\.9\{\\scriptstyle\\pm 41\.7\}9\.9±7\.29\.9\{\\scriptstyle\\pm 7\.2\}90\.040\.3±45\.040\.3\{\\scriptstyle\\pm 45\.0\}4\.3±2\.84\.3\{\\scriptstyle\\pm 2\.8\}100\.015\.0±30\.715\.0\{\\scriptstyle\\pm 30\.7\}3\.4±1\.53\.4\{\\scriptstyle\\pm 1\.5\}Random85\.040\.6±40\.540\.6\{\\scriptstyle\\pm 40\.5\}13\.9±9\.713\.9\{\\scriptstyle\\pm 9\.7\}100\.010\.0±30\.010\.0\{\\scriptstyle\\pm 30\.0\}4\.5±3\.44\.5\{\\scriptstyle\\pm 3\.4\}100\.030\.0±45\.830\.0\{\\scriptstyle\\pm 45\.8\}4\.2±2\.24\.2\{\\scriptstyle\\pm 2\.2\}100\.043\.8±30\.543\.8\{\\scriptstyle\\pm 30\.5\}2\.5±1\.3\\mathbf\{2\.5\{\\scriptstyle\\pm 1\.3\}\}IQL95\.063\.6±38\.163\.6\{\\scriptstyle\\pm 38\.1\}10\.4±8\.110\.4\{\\scriptstyle\\pm 8\.1\}100\.05\.0±21\.85\.0\{\\scriptstyle\\pm 21\.8\}4\.3±3\.5\\mathbf\{4\.3\{\\scriptstyle\\pm 3\.5\}\}100\.00\.0±0\.00\.0\{\\scriptstyle\\pm 0\.0\}3\.1±1\.6\\mathbf\{3\.1\{\\scriptstyle\\pm 1\.6\}\}100\.047\.9±30\.547\.9\{\\scriptstyle\\pm 30\.5\}2\.8±1\.22\.8\{\\scriptstyle\\pm 1\.2\}IQL\+SFT100\.069\.8±30\.469\.8\{\\scriptstyle\\pm 30\.4\}10\.1±7\.810\.1\{\\scriptstyle\\pm 7\.8\}100\.015\.0±35\.715\.0\{\\scriptstyle\\pm 35\.7\}7\.4±5\.57\.4\{\\scriptstyle\\pm 5\.5\}95\.020\.0±40\.020\.0\{\\scriptstyle\\pm 40\.0\}6\.0±6\.56\.0\{\\scriptstyle\\pm 6\.5\}100\.051\.7±20\.751\.7\{\\scriptstyle\\pm 20\.7\}2\.5±1\.5\\mathbf\{2\.5\{\\scriptstyle\\pm 1\.5\}\}IQL\+JPO95\.051\.7±34\.251\.7\{\\scriptstyle\\pm 34\.2\}14\.1±8\.114\.1\{\\scriptstyle\\pm 8\.1\}90\.040\.0±49\.0\\mathbf\{40\.0\{\\scriptstyle\\pm 49\.0\}\}11\.3±8\.711\.3\{\\scriptstyle\\pm 8\.7\}100\.035\.0±47\.735\.0\{\\scriptstyle\\pm 47\.7\}7\.4±6\.07\.4\{\\scriptstyle\\pm 6\.0\}100\.023\.5±33\.923\.5\{\\scriptstyle\\pm 33\.9\}5\.5±5\.15\.5\{\\scriptstyle\\pm 5\.1\}IQL\+SFT\+JPO90\.072\.2±37\.5\\mathbf\{72\.2\{\\scriptstyle\\pm 37\.5\}\}15\.0±9\.815\.0\{\\scriptstyle\\pm 9\.8\}100\.030\.0±45\.830\.0\{\\scriptstyle\\pm 45\.8\}6\.5±4\.06\.5\{\\scriptstyle\\pm 4\.0\}100\.045\.0±49\.7\\mathbf\{45\.0\{\\scriptstyle\\pm 49\.7\}\}5\.5±3\.45\.5\{\\scriptstyle\\pm 3\.4\}100\.052\.6±26\.6\\mathbf\{52\.6\{\\scriptstyle\\pm 26\.6\}\}3\.1±2\.93\.1\{\\scriptstyle\\pm 2\.9\}

Table 1:In\-domain negotiation results against a vanilla LLM counterparty\. IQL\+SFT\+JPO denotes the full EmoDistill pipeline\. Utility counts failures as 0\. Best value per dataset is inbold, except Success\.MethodSignalSuccess \(%\)↑\\uparrowOutcomes\(%\)↑\\uparrowUtility\(%\)↑\\uparrowRounds↓\\downarrowIQLOutcome\-shaped reward95\.066\.9±36\.066\.9\{\\scriptstyle\\pm 36\.0\}63\.6±38\.1\\bm\{63\.6\{\\scriptstyle\\pm 38\.1\}\}10\.0±7\.210\.0\{\\scriptstyle\\pm 7\.2\}Episode\-judge reward85\.070\.6±27\.470\.6\{\\scriptstyle\\pm 27\.4\}60\.0±35\.760\.0\{\\scriptstyle\\pm 35\.7\}10\.5±5\.610\.5\{\\scriptstyle\\pm 5\.6\}Turn\-judge reward80\.076\.3±25\.4\\bm\{76\.3\{\\scriptstyle\\pm 25\.4\}\}61\.0±38\.061\.0\{\\scriptstyle\\pm 38\.0\}7\.4±1\.9\\bm\{7\.4\{\\scriptstyle\\pm 1\.9\}\}IQL\+SFTHybrid quality filter100\.069\.8±30\.4\\bm\{69\.8\{\\scriptstyle\\pm 30\.4\}\}69\.8±30\.4\\bm\{69\.8\{\\scriptstyle\\pm 30\.4\}\}10\.1±7\.810\.1\{\\scriptstyle\\pm 7\.8\}Episode\-judge filter90\.055\.1±33\.455\.1\{\\scriptstyle\\pm 33\.4\}49\.6±35\.749\.6\{\\scriptstyle\\pm 35\.7\}9\.7±5\.79\.7\{\\scriptstyle\\pm 5\.7\}Turn\-judge filter95\.064\.8±32\.064\.8\{\\scriptstyle\\pm 32\.0\}61\.6±34\.261\.6\{\\scriptstyle\\pm 34\.2\}9\.0±3\.9\\bm\{9\.0\{\\scriptstyle\\pm 3\.9\}\}IQL\+SFT\+JPOOutcome\-shaped advantage85\.064\.1±28\.064\.1\{\\scriptstyle\\pm 28\.0\}54\.5±34\.554\.5\{\\scriptstyle\\pm 34\.5\}13\.2±3\.413\.2\{\\scriptstyle\\pm 3\.4\}Episode\-judge advantage90\.077\.4±26\.077\.4\{\\scriptstyle\\pm 26\.0\}69\.7±33\.9\\bm\{69\.7\{\\scriptstyle\\pm 33\.9\}\}11\.2±3\.0\\bm\{11\.2\{\\scriptstyle\\pm 3\.0\}\}Turn\-judge advantage90\.080\.2±30\.3\\bm\{80\.2\{\\scriptstyle\\pm 30\.3\}\}72\.2±37\.572\.2\{\\scriptstyle\\pm 37\.5\}15\.0±9\.815\.0\{\\scriptstyle\\pm 9\.8\}Table 2:Training\-signal ablation onCRAD\. Outcomes average successful episodes; Utility counts failures as 0\. Best value per method block is inbold, except Success\.
### 3\.4Distilling Emotional Expression

The IQL selector chooses the emotional skill, but the base SLM still needs to learn how to execute it in language\. We train a LoRA adapter on Qwen2\.5\-7B\-Instruct in two stages\.

#### Stage 1: LoRA\-SFT initialization\.

We score each turn in𝒟\\mathcal\{D\}by a hybrid quality functionqthyb=rt\+12​R​\(τ\)q^\{\\text\{hyb\}\}\_\{t\}=r\_\{t\}\+\\tfrac\{1\}\{2\}R\(\\tau\), where the per\-turn judgertr\_\{t\}rewards locally well\-formed expression and the trajectory returnR​\(τ\)R\(\\tau\)rewards turns drawn from globally productive negotiations\. We retain the top 25% of\(st,et,ut\)\(s\_\{t\},e\_\{t\},u\_\{t\}\)tuples ranked byqthybq^\{\\text\{hyb\}\}\_\{t\}as demonstrations and train the LoRA adapter to generateutu\_\{t\}conditioned on\(st,et\)\(s\_\{t\},e\_\{t\}\)via token\-level cross\-entropy:

ℒSFT=−∑k=1\|ut\|log⁡πθ​\(ut,k∣st,et,ut,<k\)\.\\mathcal\{L\}\_\{\\text\{SFT\}\}=\-\\sum\_\{k=1\}^\{\|u\_\{t\}\|\}\\log\\pi\_\{\\theta\}\(u\_\{t,k\}\\mid s\_\{t\},e\_\{t\},u\_\{t,<k\}\)\.\(4\)

#### Stage 2: Judge Policy Optimization \(JPO\)\.

Freezing the SFT adapter asπref\\pi\_\{\\text\{ref\}\}, JPO applies an offline PPO\-clipped surrogate with importance ratioρt=πθ​\(ut∣st,et\)/πref​\(ut∣st,et\)\\rho\_\{t\}=\\pi\_\{\\theta\}\(u\_\{t\}\\mid s\_\{t\},e\_\{t\}\)/\\pi\_\{\\text\{ref\}\}\(u\_\{t\}\\mid s\_\{t\},e\_\{t\}\)and a K3 KL anchor\(Schulman,[2020](https://arxiv.org/html/2605.26785#bib.bib51)\)toπref\\pi\_\{\\text\{ref\}\}\. To control how strongly JPO moves away from low\-value utterances, we replace the symmetric advantageAtA\_\{t\}from Eq\. \([1](https://arxiv.org/html/2605.26785#S3.E1)\) with an asymmetric variant:

A~t=\{At,At\>0,κ​At,At≤0,\\widetilde\{A\}\_\{t\}=\\begin\{cases\}A\_\{t\},&A\_\{t\}\>0,\\\\ \\kappa A\_\{t\},&A\_\{t\}\\leq 0,\\end\{cases\}\(5\)whereκ∈\[0,1\]\\kappa\{\\in\}\[0,1\]controls the weight on negative\-advantage samples\. Smallerκ\\kappapreserves more of the SFT deal\-closing prior by weakening the push away from low\-value utterances; largerκ\\kappaapplies stronger pressure against such utterances\. The JPO objective is then

ℒJPO=−\\displaystyle\\mathcal\{L\}\_\{\\text\{JPO\}\}=\-𝔼​\[min⁡\(ρt​A~t,clip​\(ρt,1−ε,1\+ε\)​A~t\)\]\\displaystyle\\mathbb\{E\}\\big\[\\min\\\!\\big\(\\rho\_\{t\}\\widetilde\{A\}\_\{t\},\\;\\text\{clip\}\(\\rho\_\{t\},1\{\-\}\\varepsilon,1\{\+\}\\varepsilon\)\\widetilde\{A\}\_\{t\}\\big\)\\big\]\(6\)\+λKL⋅KLK3​\[πθ∥πref\]\.\\displaystyle\+\\lambda\_\{\\text\{KL\}\}\\cdot\\text\{KL\}\_\{\\text\{K3\}\}\[\\pi\_\{\\theta\}\\,\\\|\\,\\pi\_\{\\text\{ref\}\}\]\.Training hyperparameters, including LoRA configuration,ε\\varepsilon,λKL\\lambda\_\{\\text\{KL\}\}, and the validation protocol forκ\\kappa, are reported in Appendix[F](https://arxiv.org/html/2605.26785#A6)\.

### 3\.5Variants and Emotion\-Free Ablation

In the main tables, IQL\+SFT\+JPO denotes the full EmoDistill policy, combining IQL emotion selection, LoRA\-SFT expression initialization, and JPO expression refinement\. We evaluate three component variants\. IQL uses only the learned emotion selector and pairs it with the frozen LLM generator, testing whether emotion selection alone can improve an LLM negotiator\. IQL\+SFT replaces the frozen LLM generator with a LoRA\-adapted SLM expression policy trained by supervised fine\-tuning\. IQL\+JPO removes the SFT warm start and tests direct judge\-guided refinement of the SLM expression policy\. We also evaluate an emotion\-free diagnostic variant\. The default EmoDistill policy is emotion\-conditional: the selector samplese^t∼πϕIQL\(⋅∣st\)\\hat\{e\}\_\{t\}\\sim\\pi^\{\\mathrm\{IQL\}\}\_\{\\phi\}\(\\cdot\\mid s\_\{t\}\)and the expression policy generatesu^t∼πθ\(⋅∣st,e^t\)\\hat\{u\}\_\{t\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid s\_\{t\},\\hat\{e\}\_\{t\}\)\. In the emotion\-free variant, the emotion block is removed during both training and inference, so the adapter directly generatesu^t∼πθfree\(⋅∣st\)\\hat\{u\}\_\{t\}\\sim\\pi^\{\\mathrm\{free\}\}\_\{\\theta\}\(\\cdot\\mid s\_\{t\}\)\. This tests whether the LoRA adapter can internalize emotional strategy without an explicit emotion channel; the covariate\-shift analysis is given in Appendix[H](https://arxiv.org/html/2605.26785#A8)\.

## 4Experimental Setup

#### Datasets\.

We evaluate on four negotiation domains: Credit Recovery \(CRAD\)\(Longet al\.,[2026a](https://arxiv.org/html/2605.26785#bib.bib10)\), Disaster Rescue, Hospital Surgery Scheduling, and Student Sleep Scheduling from EmoMAS\(Longet al\.,[2026b](https://arxiv.org/html/2605.26785#bib.bib49)\)\. Each dataset contains100100scenarios, split into8080training and2020held\-out test scenarios\. For each training scenario, we sample100100random emotion\-sequence rollouts from the full\|ℰ\|=28\|\\mathcal\{E\}\|=28vocabulary, yielding80008000offline trajectories per domain\. The domains cover different roles, objectives, and preference directions; dataset details and sweep construction are in Appendices[E\.1](https://arxiv.org/html/2605.26785#A5.SS1)and[E\.2](https://arxiv.org/html/2605.26785#A5.SS2)\.

#### Compared methods\.

Qwen3\.5\-Plus\(Yanget al\.,[2025](https://arxiv.org/html/2605.26785#bib.bib55)\)is used to generate the offline LLM\-vs\-LLM sweep, provide judge annotations, and serve as the default counterparty\. In the IQL\-only baseline, the selected emotion is inserted into the prompt of a frozen Qwen3\.5\-Plus model to be an IQL\-guided LLM negotiator\. In the distilled student methods, the same IQL selector conditions a Qwen2\.5\-7B\-Instruct\(Qwenet al\.,[2025](https://arxiv.org/html/2605.26785#bib.bib57)\)focal SLM: IQL\+SFT, IQL\+JPO, and IQL\+SFT\+JPO differ only in how the SLM expression policy is adapted with LoRA\. Vanilla LLM/SLM omit the selector, using Qwen3\.5\-Plus and Qwen2\.5\-7B\-Instruct respectively; Random uses uniformly sampled emotion prompts\. Counterparty models are varied in transfer and tournament experiments\.

#### Training\.

All learned policies are trained from the same fixed offline sweep generated with Qwen3\.5\-Plus in an LLM\-vs\-LLM negotiation setup\. By default, IQL uses the outcome\-shaped trajectory return for emotion selection, LoRA\-SFT uses a hybrid judge–outcome filter for demonstration selection, and JPO uses scenario\-normalized per\-turn judge advantages for utterance\-level refinement\. We setκ=1\\kappa=1for JPO unless otherwise stated\. The training\-signal ablation compares alternative reward variants for these stages\. Reward definitions are given in Appendix[D](https://arxiv.org/html/2605.26785#A4), and model/training details are in Appendix[F](https://arxiv.org/html/2605.26785#A6)\.

#### Evaluation\.

Each method is evaluated on the same2020held\-out scenarios per domain\. We report success rate,Outcomes,Utility, and negotiation rounds\.Outcomesaverages normalized savings over successful negotiations, whereasUtilityaverages over all scenarios and assigns0to failures\. The role\-neutral savings formula and aggregation rules are provided in Appendix[E\.3](https://arxiv.org/html/2605.26785#A5.SS3)\.

MethodSuccess \(%\)↑\\uparrowOutcomes\(%\)↑\\uparrowUtility\(%\)↑\\uparrowRounds↓\\downarrowVanilla \(SLM\)25\.035\.3±26\.135\.3\{\\scriptstyle\\pm 26\.1\}8\.8±20\.18\.8\{\\scriptstyle\\pm 20\.1\}11\.2±5\.9\\bm\{11\.2\{\\scriptstyle\\pm 5\.9\}\}SFT \(no emotion condition\)90\.061\.1±28\.861\.1\{\\scriptstyle\\pm 28\.8\}55\.0±32\.5\\bm\{55\.0\{\\scriptstyle\\pm 32\.5\}\}13\.7±10\.213\.7\{\\scriptstyle\\pm 10\.2\}JPO \(no emotion condition, no SFT\)40\.028\.7±30\.628\.7\{\\scriptstyle\\pm 30\.6\}11\.5±23\.411\.5\{\\scriptstyle\\pm 23\.4\}24\.2±9\.624\.2\{\\scriptstyle\\pm 9\.6\}SFT\+JPO\(no emotion condition\)50\.076\.7±32\.2\\bm\{76\.7\{\\scriptstyle\\pm 32\.2\}\}38\.4±46\.238\.4\{\\scriptstyle\\pm 46\.2\}23\.6±9\.123\.6\{\\scriptstyle\\pm 9\.1\}Table 3:Emotion\-free ablation onCRAD\. Emotion blocks are removed during training and inference\. Outcomes average successful episodes; Utility counts failures as 0\. Best value is inbold, except Success\.CRADDisasterHospitalStudentMethodSuc\.Util\.Rd\.Suc\.Util\.Rd\.Suc\.Util\.Rd\.Suc\.Util\.Rd\.Vanilla \(LLM\)50\.05\.0±15\.75\.0\{\\scriptstyle\\pm 15\.7\}8\.5±2\.48\.5\{\\scriptstyle\\pm 2\.4\}100\.015\.0±35\.715\.0\{\\scriptstyle\\pm 35\.7\}6\.5±4\.86\.5\{\\scriptstyle\\pm 4\.8\}100\.035\.0±47\.735\.0\{\\scriptstyle\\pm 47\.7\}4\.5±2\.94\.5\{\\scriptstyle\\pm 2\.9\}100\.045\.9±21\.545\.9\{\\scriptstyle\\pm 21\.5\}2\.5±1\.2\\bm\{2\.5\{\\scriptstyle\\pm 1\.2\}\}IQL95\.063\.6±38\.163\.6\{\\scriptstyle\\pm 38\.1\}10\.4±8\.110\.4\{\\scriptstyle\\pm 8\.1\}100\.05\.0±21\.85\.0\{\\scriptstyle\\pm 21\.8\}4\.3±3\.54\.3\{\\scriptstyle\\pm 3\.5\}100\.00\.0±0\.00\.0\{\\scriptstyle\\pm 0\.0\}3\.1±1\.63\.1\{\\scriptstyle\\pm 1\.6\}100\.047\.9±30\.547\.9\{\\scriptstyle\\pm 30\.5\}2\.8±1\.22\.8\{\\scriptstyle\\pm 1\.2\}IQL\+SFT\+JPOC90\.072\.2±37\.5\\bm\{72\.2\{\\scriptstyle\\pm 37\.5\}\}15\.0±9\.815\.0\{\\scriptstyle\\pm 9\.8\}100\.025\.0±43\.325\.0\{\\scriptstyle\\pm 43\.3\}6\.8±6\.26\.8\{\\scriptstyle\\pm 6\.2\}100\.025\.0±43\.325\.0\{\\scriptstyle\\pm 43\.3\}6\.5±6\.26\.5\{\\scriptstyle\\pm 6\.2\}100\.09\.8±24\.19\.8\{\\scriptstyle\\pm 24\.1\}3\.0±1\.03\.0\{\\scriptstyle\\pm 1\.0\}IQL\+SFT\+JPOD95\.057\.5±33\.357\.5\{\\scriptstyle\\pm 33\.3\}9\.9±6\.99\.9\{\\scriptstyle\\pm 6\.9\}100\.030\.0±45\.8\\bm\{30\.0\{\\scriptstyle\\pm 45\.8\}\}6\.5±4\.06\.5\{\\scriptstyle\\pm 4\.0\}100\.025\.0±43\.325\.0\{\\scriptstyle\\pm 43\.3\}4\.8±2\.94\.8\{\\scriptstyle\\pm 2\.9\}95\.030\.0±45\.830\.0\{\\scriptstyle\\pm 45\.8\}6\.4±6\.66\.4\{\\scriptstyle\\pm 6\.6\}IQL\+SFT\+JPOH100\.064\.3±44\.364\.3\{\\scriptstyle\\pm 44\.3\}7\.0±3\.5\\bm\{7\.0\{\\scriptstyle\\pm 3\.5\}\}90\.020\.0±40\.020\.0\{\\scriptstyle\\pm 40\.0\}9\.3±9\.59\.3\{\\scriptstyle\\pm 9\.5\}100\.045\.0±49\.7\\bm\{45\.0\{\\scriptstyle\\pm 49\.7\}\}5\.5±3\.45\.5\{\\scriptstyle\\pm 3\.4\}100\.05\.0±21\.85\.0\{\\scriptstyle\\pm 21\.8\}3\.6±2\.13\.6\{\\scriptstyle\\pm 2\.1\}IQL\+SFT\+JPOS75\.056\.3±43\.656\.3\{\\scriptstyle\\pm 43\.6\}14\.9±10\.414\.9\{\\scriptstyle\\pm 10\.4\}100\.016\.8±28\.316\.8\{\\scriptstyle\\pm 28\.3\}2\.9±1\.2\\bm\{2\.9\{\\scriptstyle\\pm 1\.2\}\}100\.035\.6±34\.735\.6\{\\scriptstyle\\pm 34\.7\}2\.9±1\.6\\bm\{2\.9\{\\scriptstyle\\pm 1\.6\}\}100\.052\.6±26\.6\\bm\{52\.6\{\\scriptstyle\\pm 26\.6\}\}3\.1±2\.93\.1\{\\scriptstyle\\pm 2\.9\}Table 4:Cross\-domain transfer\.EmoDistillC/D/H/SdenotesEmoDistilltrained on CRAD, Disaster, Hospital, or Student\. Each block reports success, Utility, and rounds; Utility counts failures as 0\. Best value per evaluation domain is inbold, except Success\.Table 5:Cross\-counterparty transfer onCRAD\. Models are trained against Qwen3\.5\-Plus and evaluated zero\-shot against unseen counterparties\. Outcomes average successful episodes; Utility counts failures as 0\. Best value per counterparty is inbold, except Success\.CounterpartyMethodSuccess \(%\)↑\\uparrowOutcomes\(%\)↑\\uparrowUtility\(%\)↑\\uparrowRounds↓\\downarrowDeepSeek\-V3\(DeepSeek\-AIet al\.,[2025](https://arxiv.org/html/2605.26785#bib.bib56)\)Vanilla50\.055\.6±21\.855\.6\{\\scriptstyle\\pm 21\.8\}27\.8±31\.827\.8\{\\scriptstyle\\pm 31\.8\}8\.2±3\.58\.2\{\\scriptstyle\\pm 3\.5\}IQL90\.058\.2±25\.758\.2\{\\scriptstyle\\pm 25\.7\}52\.4±30\.052\.4\{\\scriptstyle\\pm 30\.0\}12\.0±6\.512\.0\{\\scriptstyle\\pm 6\.5\}IQL\+SFT100\.073\.3±14\.473\.3\{\\scriptstyle\\pm 14\.4\}73\.3±14\.473\.3\{\\scriptstyle\\pm 14\.4\}7\.3±2\.4\\bm\{7\.3\{\\scriptstyle\\pm 2\.4\}\}IQL\+SFT\+JPO100\.081\.5±12\.8\\bm\{81\.5\{\\scriptstyle\\pm 12\.8\}\}81\.5±12\.8\\bm\{81\.5\{\\scriptstyle\\pm 12\.8\}\}9\.3±4\.89\.3\{\\scriptstyle\\pm 4\.8\}ChatGPT\-4o\-mini\(OpenAI,[2024](https://arxiv.org/html/2605.26785#bib.bib58)\)Vanilla40\.043\.7±27\.843\.7\{\\scriptstyle\\pm 27\.8\}17\.5±27\.717\.5\{\\scriptstyle\\pm 27\.7\}8\.6±4\.8\\bm\{8\.6\{\\scriptstyle\\pm 4\.8\}\}IQL60\.038\.8±29\.838\.8\{\\scriptstyle\\pm 29\.8\}23\.3±29\.923\.3\{\\scriptstyle\\pm 29\.9\}17\.6±7\.917\.6\{\\scriptstyle\\pm 7\.9\}IQL\+SFT95\.062\.4±28\.362\.4\{\\scriptstyle\\pm 28\.3\}59\.3±30\.8\\bm\{59\.3\{\\scriptstyle\\pm 30\.8\}\}14\.2±7\.814\.2\{\\scriptstyle\\pm 7\.8\}IQL\+SFT\+JPO75\.068\.3±28\.4\\bm\{68\.3\{\\scriptstyle\\pm 28\.4\}\}51\.2±38\.551\.2\{\\scriptstyle\\pm 38\.5\}17\.7±9\.517\.7\{\\scriptstyle\\pm 9\.5\}SLM: Qwen2\.5\-3B\-InstructVanilla30\.060\.1±40\.560\.1\{\\scriptstyle\\pm 40\.5\}18\.0±35\.418\.0\{\\scriptstyle\\pm 35\.4\}4\.5±2\.1\\bm\{4\.5\{\\scriptstyle\\pm 2\.1\}\}IQL85\.073\.0±34\.7\\bm\{73\.0\{\\scriptstyle\\pm 34\.7\}\}62\.1±41\.3\\bm\{62\.1\{\\scriptstyle\\pm 41\.3\}\}9\.8±9\.59\.8\{\\scriptstyle\\pm 9\.5\}IQL\+SFT30\.057\.4±30\.457\.4\{\\scriptstyle\\pm 30\.4\}17\.2±31\.117\.2\{\\scriptstyle\\pm 31\.1\}7\.0±10\.47\.0\{\\scriptstyle\\pm 10\.4\}IQL\+SFT\+JPO25\.053\.1±35\.453\.1\{\\scriptstyle\\pm 35\.4\}13\.3±29\.013\.3\{\\scriptstyle\\pm 29\.0\}8\.7±12\.58\.7\{\\scriptstyle\\pm 12\.5\}Methodκ\\kappaSuc\.Out\.Util\.Rd\.JPO\-κ\\kappa0\.0095\.077\.0±24\.877\.0\{\\scriptstyle\\pm 24\.8\}73\.2±29\.773\.2\{\\scriptstyle\\pm 29\.7\}9\.5±4\.09\.5\{\\scriptstyle\\pm 4\.0\}JPO\-κ\\kappa0\.25100\.069\.3±25\.469\.3\{\\scriptstyle\\pm 25\.4\}69\.3±25\.469\.3\{\\scriptstyle\\pm 25\.4\}8\.8±4\.38\.8\{\\scriptstyle\\pm 4\.3\}JPO\-κ\\kappa0\.50100\.082\.5±15\.9\\mathbf\{82\.5\{\\scriptstyle\\pm 15\.9\}\}82\.5±15\.9\\mathbf\{82\.5\{\\scriptstyle\\pm 15\.9\}\}9\.4±5\.09\.4\{\\scriptstyle\\pm 5\.0\}JPO\-κ\\kappa0\.7595\.080\.4±32\.480\.4\{\\scriptstyle\\pm 32\.4\}76\.4±36\.376\.4\{\\scriptstyle\\pm 36\.3\}7\.8±4\.3\\mathbf\{7\.8\{\\scriptstyle\\pm 4\.3\}\}JPO1\.0090\.080\.2±30\.380\.2\{\\scriptstyle\\pm 30\.3\}72\.2±37\.572\.2\{\\scriptstyle\\pm 37\.5\}15\.0±9\.815\.0\{\\scriptstyle\\pm 9\.8\}Table 6:Risk\-controlled JPO onCRAD\.

## 5Experimental Results

We organize the experiments around four questions\. \(Q1\) Can our emotion\-distilledEmoDistillSLM agent, trained from offline LLM\-vs\-LLM negotiation data, outperform vanilla and ablated SLM/LLM baselines and even surpass an IQL\-guided LLM negotiator on held\-out scenarios? \(Q2\) Where do the gains come from: high\-level IQL emotion selection, LoRA\-SFT expression imitation, JPO expression refinement, or the reward signal used for distillation? \(Q3\) Do the distilled emotional negotiation skills transfer across domains with different scalar variables, preference directions, and stakes? \(Q4\) How robust is the distilled negotiator to unseen counterparties, and how does down\-weighting negative\-advantage samples in JPO trade off agreement success against per\-deal outcomes?

#### Q1: In\-Domain SLM\-to\-LLM Negotiation\.

Table[1](https://arxiv.org/html/2605.26785#S3.T1)shows that emotion distillation turns the 7B SLM into a stronger negotiator against an LLM counterparty\.EmoDistillachieves the highest Utility on three of four domains, especially CRAD, Hospital, and Student, but often uses more rounds\. This suggests that the distilled agent is not simply more agreeable; it learns a more strategic bargaining style that trades speed for better focal\-side value\. The baselines reveal two different weaknesses\. Vanilla LLM agents tend to reach agreement quickly, but often concede value to the counterparty\. Vanilla SLM agents are less aligned and can sometimes preserve stronger positions, but are unstable across domains\. Random emotion prompting is also inconsistent: it can improve over vanilla behavior in some domains, but it introduces noise because the emotion is not state\-dependent\. Thus, the benefit comes from learning when to use emotion and how to express it\. Disaster is the main exception\. Unlike the other domains, it has a rescue\-style context and a narrower bargaining margin, so most methods already reach agreement and Utility is harder to improve\. Here direct IQL\+JPO performs best, suggesting that aggressive expression refinement is less useful when the outcome space is compressed\.

#### Q2: Component and Training\-Signal Analysis\.

Table[1](https://arxiv.org/html/2605.26785#S3.T1)shows that the gains come from combining emotion selection with expression learning\. The IQL\-only baseline shows that learned emotion selection can guide a frozen LLM negotiator, but it does not distill emotional expression into the language models\. LoRA\-SFT stabilizes the generator by imitating high\-quality LLM\-vs\-LLM negotiations, while JPO further sharpens the expression policy with judge\-derived advantages\. This explains why direct JPO is unstable, whereas SFT\-initialized JPO gives stronger and more reliable behavior\. Table[2](https://arxiv.org/html/2605.26785#S3.T2)shows that training signals are stage\-dependent\. SFT benefits most from a clean hybrid quality filter because imitation mainly needs good demonstrations\. JPO benefits more from judge\-based advantages because policy improvement needs fine\-grained feedback on which expressions to upweight\. Table[3](https://arxiv.org/html/2605.26785#S4.T3)further shows that removing the emotion condition can still produce high outcomes when successful, but lowers utility\. Thus, the selected emotion is not a surface prompt; it is a control variable that makes emotional skill execution more reliable\.

#### Q3: Cross\-Domain Transfer\.

Table[4](https://arxiv.org/html/2605.26785#S4.T4)evaluates whether distilled emotional negotiation skills transfer across domains\. The clearest pattern is that the best Utility is always achieved by the adapter trained on the same domain:EmoDistillCon CRAD,EmoDistillDon Disaster,EmoDistillHon Hospital, andEmoDistillSon Student\. This shows that emotional expression strategies are not purely generic; the best bargaining behavior depends on the domain’s scalar variable, preference direction, and concession geometry\. At the same time, transfer is not absent\. Several off\-domain adapters still maintain high success and non\-trivial Utility, such asEmoDistillHandEmoDistillDon CRAD, orEmoDistillSon Hospital\. This suggests that the learned expression policy transfers general negotiation competence, such as maintaining dialogue coherence and reaching agreements\. However, value extraction is more domain\-sensitive than success\. For example,EmoDistillCkeeps100\.0%100\.0\\%success on Student but obtains low Utility, andEmoDistillHalso transfers poorly to Student\. Thus, cross\-domain transfer is partial: distilled emotional skills help the agent remain a competent negotiator, but maximizing Utility still benefits from domain\-specific distillation\.

#### Q4: Cross\-Counterparty Transfer and Risk\-Controlled Refinement\.

Table[5](https://arxiv.org/html/2605.26785#S4.T5)evaluates whether a CRAD\-trained policy transfers to unseen counterparties\. The main pattern is that EmoDistill transfers well across unseen LLM counterparties, but transfer is less reliable when the counterparty scale changes\. This suggests that the learned expression policy captures LLM\-style negotiation regularities, while smaller counterparties may violate the dialogue patterns seen in the offline sweep\. The cross\-counterparty results also reveal a success–value tradeoff\. JPO\-style refinement often improves successful\-case Outcomes by making the agent more assertive, but this can reduce Utility when the policy becomes less willing to close weaker deals\. Table[6](https://arxiv.org/html/2605.26785#S4.T6)shows that risk\-controlled JPO provides a tunable success–value tradeoff\. On CRAD,κ=0\.5\\kappa=0\.5best balances preserving the SFT deal\-closing prior with avoiding low\-value concessions, yielding the highest Utility\. Additional analyses on emotion\-free covariate shift, case studies, prompts, A\-LoL comparison, and training stability are provided in Appendices[H](https://arxiv.org/html/2605.26785#A8),[Q](https://arxiv.org/html/2605.26785#A17),[P](https://arxiv.org/html/2605.26785#A16),[K](https://arxiv.org/html/2605.26785#A11), and[L](https://arxiv.org/html/2605.26785#A12)\.

## 6Discussion

EmoDistill suggests that emotion in negotiation is best treated as a controllable bargaining interface, not merely as generation style\. The key design choice is to keep emotion explicit enough for the selector to control, while internalizing expression skills into the SLM weights so the model does not rely on brittle prompt templates alone\. The success–value tradeoff observed in JPO and theκ\\kappaanalysis further suggests that emotional negotiators should expose risk preferences: some deployments may prefer closing every deal, while others may prefer stronger per\-deal value\.

## 7Conclusion and Future Work

We introducedEmoDistill, which distills LLM\-vs\-LLM emotional negotiation skills into a 7B SLM through IQL selection, LoRA\-SFT, and JPO\. Results show that emotion can become a learnable bargaining control channel, and theκ\\kappaanalysis shows that refinement can be tuned along a success–value frontier\. Future work should improve transfer, emotion\-free robustness, and human validation\.

## Limitations

Several limitations remain in the current version ofEmoDistill\. First, the framework is trained entirely from fixed offline LLM\-vs\-LLM trajectories\. Although this makes training reusable and avoids costly online rollouts, the learned selector and expression policy may still encounter distributional shift at deployment time, especially when the counterparty uses dialogue strategies not represented in the offline sweep\. Second,EmoDistillrelies on an explicit emotion\-conditioning channel at inference time\. Our emotion\-free ablations show that the adapter can learn some negotiation behavior without explicit emotion labels, but this behavior is less reliable\. This suggests that emotional negotiation skill is not fully internalized into the model weights; instead, the explicit emotion variable remains an important control interface\. Developing more robust prompt\-free or latent\-emotion variants is an important direction for future work\. Third, transfer is partial rather than universal\. Cross\-domain experiments show that success rates often transfer better than value extraction, while utility remains sensitive to the scalar variable, preference direction, and concession geometry of each domain\. In particular, policies trained on one gap direction may learn directional anchoring habits that do not automatically flip in domains with the opposite bargaining geometry\. Future work should study domain\-adaptive, sign\-aware, or multi\-domain training mixtures to improve transfer of value extraction, not only agreement success\. Fourth, the current evaluation is limited to agent\-to\-agent negotiation\. This is the intended setting of the paper, but it means that our results should not be interpreted as evidence thatEmoDistillimproves human\-perceived negotiation quality\. Human studies, multi\-judge evaluation, and task\-specific domain expert review would be valuable for understanding how the learned emotional expressions are perceived outside autonomous agent interactions\. Finally, the framework depends on LLM\-judge feedback for dense turn\-level annotation and JPO refinement\. While this provides scalable supervision, it also introduces judge\-model dependence and additional cost\.

## Ethical Considerations

EmoDistillstudies emotion as a strategic control channel in autonomous agent\-to\-agent negotiation\. This framing has both defensive and dual\-use implications\. On the defensive side, modeling emotional influence can help make user\-aligned agents less vulnerable to emotionally framed pressure, premature concession, or manipulative counterparties\. The method is intended to support agents that preserve their users’ stated objectives in adversarial or high\-stakes bargaining settings\. At the same time, a system that learns strategic emotional expression could be misused to build more persuasive or manipulative negotiation agents\. This risk is especially important in domains involving vulnerable users, financial decisions, medical access, employment, education, debt, or public services\. We therefore viewEmoDistillas appropriate only for bounded agent\-to\-agent settings with explicit task objectives, logging, evaluation, and deployment constraints\. It should not be used to manipulate human users, obscure material information, or pressure people into decisions against their interests\. Our experiments are conducted in synthetic negotiation scenarios between language\-model agents\. No real users are negotiated with, and the evaluation metrics are computed from predefined scenario objectives rather than personal data\.

## References

- Let the llms talk: simulating human\-to\-human conversational qa via zero\-shot LLM\-to\-LLM interactions\.InProceedings of the 17th ACM International Conference on Web Search and Data Mining,pp\. 8–17\.Cited by:[§1](https://arxiv.org/html/2605.26785#S1.p1.2)\.
- A\. Baheti, X\. Lu, F\. Brahman, R\. L\. Bras, M\. Sap, and M\. Riedl \(2024\)Leftover lunch: advantage\-based offline reinforcement learning for language models\.External Links:2305\.14718,[Link](https://arxiv.org/abs/2305.14718)Cited by:[Appendix K](https://arxiv.org/html/2605.26785#A11.p1.2),[§2](https://arxiv.org/html/2605.26785#S2.SS0.SSS0.Px3.p1.1)\.
- Y\. Bai, S\. Kadavath, S\. Kundu, A\. Askell,et al\.\(2022\)Constitutional ai: harmlessness from ai feedback\.arXiv preprint arXiv:2212\.08073\.Cited by:[§2](https://arxiv.org/html/2605.26785#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Chatterjee, S\. Miller, and N\. Parepally \(2024\)AgreeMate: teaching LLMs to haggle\.arXiv preprint arXiv:2412\.18690\.Cited by:[§2](https://arxiv.org/html/2605.26785#S2.SS0.SSS0.Px1.p1.1)\.
- DeepSeek\-AI, A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan, D\. Dai, D\. Guo, D\. Yang, D\. Chen, D\. Ji, E\. Li, F\. Lin, F\. Dai, F\. Luo, G\. Hao, G\. Chen, G\. Li, H\. Zhang, H\. Bao, H\. Xu, H\. Wang, H\. Zhang, H\. Ding, H\. Xin, H\. Gao, H\. Li, H\. Qu, J\. L\. Cai, J\. Liang, J\. Guo, J\. Ni, J\. Li, J\. Wang, J\. Chen, J\. Chen, J\. Yuan, J\. Qiu, J\. Li, J\. Song, K\. Dong, K\. Hu, K\. Gao, K\. Guan, K\. Huang, K\. Yu, L\. Wang, L\. Zhang, L\. Xu, L\. Xia, L\. Zhao, L\. Wang, L\. Zhang, M\. Li, M\. Wang, M\. Zhang, M\. Zhang, M\. Tang, M\. Li, N\. Tian, P\. Huang, P\. Wang, P\. Zhang, Q\. Wang, Q\. Zhu, Q\. Chen, Q\. Du, R\. J\. Chen, R\. L\. Jin, R\. Ge, R\. Zhang, R\. Pan, R\. Wang, R\. Xu, R\. Zhang, R\. Chen, S\. S\. Li, S\. Lu, S\. Zhou, S\. Chen, S\. Wu, S\. Ye, S\. Ye, S\. Ma, S\. Wang, S\. Zhou, S\. Yu, S\. Zhou, S\. Pan, T\. Wang, T\. Yun, T\. Pei, T\. Sun, W\. L\. Xiao, W\. Zeng, W\. Zhao, W\. An, W\. Liu, W\. Liang, W\. Gao, W\. Yu, W\. Zhang, X\. Q\. Li, X\. Jin, X\. Wang, X\. Bi, X\. Liu, X\. Wang, X\. Shen, X\. Chen, X\. Zhang, X\. Chen, X\. Nie, X\. Sun, X\. Wang, X\. Cheng, X\. Liu, X\. Xie, X\. Liu, X\. Yu, X\. Song, X\. Shan, X\. Zhou, X\. Yang, X\. Li, X\. Su, X\. Lin, Y\. K\. Li, Y\. Q\. Wang, Y\. X\. Wei, Y\. X\. Zhu, Y\. Zhang, Y\. Xu, Y\. Xu, Y\. Huang, Y\. Li, Y\. Zhao, Y\. Sun, Y\. Li, Y\. Wang, Y\. Yu, Y\. Zheng, Y\. Zhang, Y\. Shi, Y\. Xiong, Y\. He, Y\. Tang, Y\. Piao, Y\. Wang, Y\. Tan, Y\. Ma, Y\. Liu, Y\. Guo, Y\. Wu, Y\. Ou, Y\. Zhu, Y\. Wang, Y\. Gong, Y\. Zou, Y\. He, Y\. Zha, Y\. Xiong, Y\. Ma, Y\. Yan, Y\. Luo, Y\. You, Y\. Liu, Y\. Zhou, Z\. F\. Wu, Z\. Z\. Ren, Z\. Ren, Z\. Sha, Z\. Fu, Z\. Xu, Z\. Huang, Z\. Zhang, Z\. Xie, Z\. Zhang, Z\. Hao, Z\. Gou, Z\. Ma, Z\. Yan, Z\. Shao, Z\. Xu, Z\. Wu, Z\. Zhang, Z\. Li, Z\. Gu, Z\. Zhu, Z\. Liu, Z\. Li, Z\. Xie, Z\. Song, Z\. Gao, and Z\. Pan \(2025\)DeepSeek\-v3 technical report\.External Links:2412\.19437,[Link](https://arxiv.org/abs/2412.19437)Cited by:[Table 5](https://arxiv.org/html/2605.26785#S4.T5.7.7.4.1.1.1)\.
- D\. Demszky, D\. Movshovitz\-Attias, J\. Ko, A\. Cowen, G\. Nemade, and S\. Ravi \(2020\)GoEmotions: a dataset of fine\-grained emotions\.InACL,pp\. 4040–4054\.Cited by:[§P\.4](https://arxiv.org/html/2605.26785#A16.SS4.p1.2),[§1](https://arxiv.org/html/2605.26785#S1.p1.2)\.
- X\. Gao, C\. Zhang, Y\. Chen, H\. Zhang, and N\. F\. Chen \(2025\)Emo\-DPO: controllable emotional speech synthesis through direct preference optimization\.InICASSP 2025\-2025 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 1–5\.Cited by:[§1](https://arxiv.org/html/2605.26785#S1.p1.2)\.
- M\. Griessmair, P\. Hippmann, and J\. Gettinger \(2015\)Emotions in E\-Negotiations\.InEmotion in group decision and negotiation,pp\. 101–135\.Cited by:[§2](https://arxiv.org/html/2605.26785#S2.SS0.SSS0.Px1.p1.1)\.
- H\. He, D\. Chen, A\. Balakrishnan, and P\. Liang \(2018\)Decoupling strategy and generation in negotiation dialogues\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,pp\. 2333–2343\.Cited by:[§2](https://arxiv.org/html/2605.26785#S2.SS0.SSS0.Px2.p1.1)\.
- B\. A\. Hu, Y\. Liu, and H\. Rong \(2025\)Trustless autonomy: understanding motivations, benefits and governance dilemma in self\-sovereign decentralized ai agents\.arXiv preprint arXiv:2505\.09757\.Cited by:[§1](https://arxiv.org/html/2605.26785#S1.p1.2)\.
- Y\. J\. Huang and R\. Hadfi \(2024\)How personality traits influence negotiation outcomes? a simulation based on large language models\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 10336–10351\.Cited by:[§2](https://arxiv.org/html/2605.26785#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Kasbouya and N\. Sael \(2025\)Emotional intelligence in large language models: fine\-tuning methods, challenges, and applications\.InInternational Conference on intelligent systems and digital applications,pp\. 348–359\.Cited by:[§1](https://arxiv.org/html/2605.26785#S1.p1.2)\.
- H\. Lee, S\. Phatale, H\. Mansoor, K\. Lu, T\. Mesnard, C\. Bishop, V\. Carbune, and A\. Rastogi \(2023\)RLAIF: scaling reinforcement learning from human feedback with ai feedback\.arXiv preprint arXiv:2309\.00267\.Cited by:[§2](https://arxiv.org/html/2605.26785#S2.SS0.SSS0.Px3.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2024\)Let’s verify step by step\.InICLR,Cited by:[§2](https://arxiv.org/html/2605.26785#S2.SS0.SSS0.Px3.p1.1)\.
- G\. Lin, T\. Tanaka, and Q\. Zhao \(2024\)Large language model sentinel: llm agent for adversarial purification\.arXiv preprint arXiv:2405\.20770\.Cited by:[§1](https://arxiv.org/html/2605.26785#S1.p1.2)\.
- Y\. Long, Y\. Liu, and A\. Brintrup \(2026a\)EQ\-negotiator: dynamic emotional personas empower small language models for edge\-deployable credit negotiation\.External Links:2511\.03370,[Link](https://arxiv.org/abs/2511.03370)Cited by:[§1](https://arxiv.org/html/2605.26785#S1.p2.1),[§2](https://arxiv.org/html/2605.26785#S2.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2605.26785#S4.SS0.SSS0.Px1.p1.6)\.
- Y\. Long, Y\. Liu, L\. Xu, and A\. Brintrup \(2025a\)EmoDebt: bayesian\-optimized emotional intelligence for strategic agent\-to\-agent debt recovery\.External Links:2503\.21080,[Link](https://arxiv.org/abs/2503.21080)Cited by:[§1](https://arxiv.org/html/2605.26785#S1.p2.1),[§2](https://arxiv.org/html/2605.26785#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Long, Y\. Liu, and L\. Xu \(2026b\)EmoMAS: emotion\-aware multi\-agent system for high\-stakes edge\-deployable negotiation with bayesian orchestration\.External Links:2604\.07003,[Link](https://arxiv.org/abs/2604.07003)Cited by:[§4](https://arxiv.org/html/2605.26785#S4.SS0.SSS0.Px1.p1.6)\.
- Y\. Long, L\. Xu, L\. Beckenbauer, Y\. Liu, and A\. Brintrup \(2025b\)EvoEmo: towards evolved emotional policies for adversarial llm agents in multi\-turn price negotiation\.External Links:2509\.04310,[Link](https://arxiv.org/abs/2509.04310)Cited by:[§1](https://arxiv.org/html/2605.26785#S1.p2.1),[§2](https://arxiv.org/html/2605.26785#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Olekalns and D\. Druckman \(2014\)With feeling: how emotions shape negotiation\.Negotiation Journal30\(4\),pp\. 455–478\.Cited by:[§2](https://arxiv.org/html/2605.26785#S2.SS0.SSS0.Px1.p1.1)\.
- OpenAI \(2024\)GPT\-4o mini: advancing cost\-efficient intelligence\.Note:Accessed: 2026\-05\-25Cited by:[Table 5](https://arxiv.org/html/2605.26785#S4.T5.19.19.4.1.1.1)\.
- Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§4](https://arxiv.org/html/2605.26785#S4.SS0.SSS0.Px2.p1.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§1](https://arxiv.org/html/2605.26785#S1.p2.1)\.
- J\. Schulman \(2020\)Approximating KL divergence\.Note:[http://joschu\.net/blog/kl\-approx\.html](http://joschu.net/blog/kl-approx.html)Cited by:[§3\.4](https://arxiv.org/html/2605.26785#S3.SS4.SSS0.Px2.p1.4)\.
- R\. Shea, A\. Kallala, X\. L\. Liu, M\. W\. Morris, and Z\. Yu \(2024\)ACE: a LLM\-based negotiation coaching system\.arXiv preprint arXiv:2410\.01555\.Cited by:[§2](https://arxiv.org/html/2605.26785#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Wan, M\. Labeau, and C\. Clavel \(2025\)Emodynamix: emotional support dialogue strategy prediction by modelling mixed emotions and discourse dynamics\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 1678–1695\.Cited by:[§2](https://arxiv.org/html/2605.26785#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§4](https://arxiv.org/html/2605.26785#S4.SS0.SSS0.Px2.p1.1)\.
- C\. Zhang, X\. Shi, X\. Zhang, Y\. Zhu, Y\. Yang, and Y\. Luo \(2025\)Decoupledesc: enhancing emotional support generation via strategy\-response decoupled preference optimization\.arXiv preprint arXiv:2505\.16995\.Cited by:[§2](https://arxiv.org/html/2605.26785#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Zhou, S\. Guo, A\. Liu, Z\. Yu, Z\. Gong, B\. Zhao, Z\. Chen, M\. Zhang, Y\. Chen, J\. Li, R\. Yang, Q\. Liu, X\. Yu, J\. Zhou, N\. Wang, C\. Sun, and J\. Wang \(2026\)Memento\-skills: let agents design agents\.arXiv preprint arXiv:2603\.18743\.External Links:2603\.18743,[Link](https://arxiv.org/abs/2603.18743)Cited by:[§1](https://arxiv.org/html/2605.26785#S1.p2.1)\.

## Appendix ABackground

This section defines the notation used by the datasets, policies, and evaluation metrics\. We avoid the term “creditor” or “debtor” in the general notation because the four domains instantiate different real\-world roles\.

#### Framework and method names\.

EmoDistilldenotes the full offline framework: trajectory collection, judge annotation, offline training, and evaluation\. In the main tables,IQL\+SFT\+JPOdenotes the full EmoDistill policy\. Its components are: an IQL emotion selector, a LoRA\-SFT expression\-policy initializer, and a JPO expression\-policy refinement stage\.

#### Negotiation roles\.

The optimized agent is the*focal agent*; the other party is the*counterparty*\. Domain\-specific names such as creditor, debtor, patient, hospital, dispatcher, or student are used only when describing a specific dataset\.

#### Scalar negotiation variable\.

Each scenario contains a scalar negotiation variablexx, such as overdue days, rescue wait minutes, surgery wait days, or extra hours past 9 PM\. We denote the counterparty’s initial anchor byx0oppx^\{\\mathrm\{opp\}\}\_\{0\}, the focal target byxagent⋆x^\{\\star\}\_\{\\mathrm\{agent\}\}, and the final accepted agreement byxfinalx\_\{\\mathrm\{final\}\}\.

#### Emotion action vocabulary\.

The training sweep uses the full emotion action vocabulary

ℰ=\{e1,…,e\|ℰ\|\},\|ℰ\|=28,\\mathcal\{E\}=\\\{e\_\{1\},\\ldots,e\_\{\|\\mathcal\{E\}\|\}\\\},\\qquad\|\\mathcal\{E\}\|=28,consisting of the 28 GoEmotions emotion labels\. Figure[1](https://arxiv.org/html/2605.26785#S1.F1)and Appendix[B](https://arxiv.org/html/2605.26785#A2)analyze which individual emotions significantly shift CRAD outcomes, but this analysis is not used as a hard filter: all emotions remain available during training, and the IQL selector learns which emotions to upweight or suppress\.

#### Offline sweep\.

For each dataset, the offline sweep is a fixed set of multi\-turn trajectories\. At each focal\-agent turntt, the sweep stores the dialogue statests\_\{t\}, selected emotionete\_\{t\}, focal\-agent utteranceutu\_\{t\}, counterparty response, terminal outcome, and judge scorertr\_\{t\}\.

#### Policy levels\.

Emotion\-selection policies choose an emotion:

πϕ​\(et∣st\),et∈ℰ\.\\pi\_\{\\phi\}\(e\_\{t\}\\mid s\_\{t\}\),\\qquad e\_\{t\}\\in\\mathcal\{E\}\.Expression policies generate the utterance:

πθ​\(ut∣st,et\)\.\\pi\_\{\\theta\}\(u\_\{t\}\\mid s\_\{t\},e\_\{t\}\)\.The full EmoDistill policy combines these two levels: IQL selectsete\_\{t\}, and the SFT\-initialized JPO adapter generatesutu\_\{t\}\.

## Appendix BPer\-Emotion Prompting Analysis

This appendix details the controlled prompting study used to support Figure[1](https://arxiv.org/html/2605.26785#S1.F1)\. The goal is descriptive: we test whether individual emotion prompts significantly change CRAD negotiation behavior relative to a neutral prompt\. This analysis is not used to restrict the training action space;EmoDistilluses the full\|ℰ\|=28\|\\mathcal\{E\}\|=28vocabulary, consisting of the 28 GoEmotions emotion labels\.

For each emotioneeand each CRAD test scenarios∈\{1,…,20\}s\\in\\\{1,\\ldots,20\\\}, we run2020sampled negotiations\. Letre,s,jr\_\{e,s,j\}denote the normalized judge reward or utility from runjjunder emotioneeon scenarioss\. We first compute the per\-scenario mean

r¯e,s=120​∑j=120re,s,j,\\bar\{r\}\_\{e,s\}=\\frac\{1\}\{20\}\\sum\_\{j=1\}^\{20\}r\_\{e,s,j\},\(7\)and then the overall emotion mean

μ^e=120​∑s=120r¯e,s\.\\hat\{\\mu\}\_\{e\}=\\frac\{1\}\{20\}\\sum\_\{s=1\}^\{20\}\\bar\{r\}\_\{e,s\}\.\(8\)Figure[1](https://arxiv.org/html/2605.26785#S1.F1)ranks emotions byμ^e\\hat\{\\mu\}\_\{e\}and plots95%95\\%confidence intervals over scenario\-level means\.

To test whether an emotion differs from neutral, we use paired scenario\-level differences:

δe,s=r¯e,s−r¯neutral,s\.\\delta\_\{e,s\}=\\bar\{r\}\_\{e,s\}\-\\bar\{r\}\_\{\\mathrm\{neutral\},s\}\.\(9\)We then apply a pairedtt\-test over the2020scenarios:

te=δ¯esδe/20,δ¯e=120​∑sδe,s\.t\_\{e\}=\\frac\{\\bar\{\\delta\}\_\{e\}\}\{s\_\{\\delta\_\{e\}\}/\\sqrt\{20\}\},\\qquad\\bar\{\\delta\}\_\{e\}=\\frac\{1\}\{20\}\\sum\_\{s\}\\delta\_\{e,s\}\.\(10\)Because we test the 27 non\-neutral emotions against the neutral baseline, we apply Bonferroni correction with threshold

αBonf=0\.0527≈0\.00185\.\\alpha\_\{\\mathrm\{Bonf\}\}=\\frac\{0\.05\}\{27\}\\approx 0\.00185\.\(11\)
This analysis identifies multiple emotions whose effects are significantly above the neutral baseline, confirming that emotional framing can systematically shift negotiation outcomes\. We use this result only as motivation for emotion\-conditioned policy learning\. During training, all2828emotions remain in the action vocabulary, and the IQL selector learns which emotions to upweight or suppress from the offline sweep\.

## Appendix CDetailed Algorithm

This section gives a uniform mathematical specification for every method evaluated in the main results\. Notation follows Appendix[A](https://arxiv.org/html/2605.26785#A1):sts\_\{t\}is the dialogue state at focal\-agent turntt,et∈ℰe\_\{t\}\\in\\mathcal\{E\}\(\|ℰ\|=28\|\\mathcal\{E\}\|\{=\}28\) is the emotion token,utu\_\{t\}is the focal\-agent utterance,rtr\_\{t\}is the per\-turn metric\-aligned judge score,At=\(rt−μscen\)/\(σscen\+ϵ\)A\_\{t\}=\(r\_\{t\}\-\\mu\_\{\\mathrm\{scen\}\}\)/\(\\sigma\_\{\\mathrm\{scen\}\}\+\\epsilon\)is its scenario\-wise z\-score, andR​\(τ\)R\(\\tau\)is the outcome\-shaped trajectory return defined in Appendix[D](https://arxiv.org/html/2605.26785#A4)\. The evaluation metrics derived from the final agreement are defined separately in Appendix[E\.5](https://arxiv.org/html/2605.26785#A5.SS5)\.πLLM​\(ut∣st,et\)\\pi\_\{\\mathrm\{LLM\}\}\(u\_\{t\}\\mid s\_\{t\},e\_\{t\}\)denotes the frozen base utterance policy used during sweep construction;πθ​\(ut∣st,et\)\\pi\_\{\\theta\}\(u\_\{t\}\\mid s\_\{t\},e\_\{t\}\)denotes the LoRA\-augmented Qwen2\.5\-7B expression policy withθ\\thetathe LoRA parameters\. Emotion selectors are writtenπϕ​\(et∣st\)\\pi\_\{\\phi\}\(e\_\{t\}\\mid s\_\{t\}\)\.

#### Prompt\-only controls\.

Neutral prompting uses a fixed neutral emotion at every focal\-agent turn\. Random\-emotion prompting samples uniformly fromℰ\\mathcal\{E\}\. These are calibration controls rather than offline\-learning methods:

πneu​\(ut∣st\)\\displaystyle\\pi\_\{\\mathrm\{neu\}\}\(u\_\{t\}\\mid s\_\{t\}\)=πLLM​\(ut∣st,et=neutral\),\\displaystyle=\\pi\_\{\\mathrm\{LLM\}\}\(u\_\{t\}\\mid s\_\{t\},\\,e\_\{t\}\{=\}\\text\{neutral\}\),\(12\)πrnd​\(ut∣st\)\\displaystyle\\pi\_\{\\mathrm\{rnd\}\}\(u\_\{t\}\\mid s\_\{t\}\)=𝔼et∼Unif​\(ℰ\)​\[πLLM​\(ut∣st,et\)\]\.\\displaystyle=\\mathbb\{E\}\_\{e\_\{t\}\\sim\\mathrm\{Unif\}\(\\mathcal\{E\}\)\}\\\!\\big\[\\pi\_\{\\mathrm\{LLM\}\}\(u\_\{t\}\\mid s\_\{t\},e\_\{t\}\)\\big\]\.No parameters are learned\.

#### IQL selector\.

IQL learns an offline emotion\-selection policy over state–emotion pairs\. It chooses the emotion to inject at each turn but does not update the utterance generator\. For a trajectoryτ\\tau, the IQL reward is terminal:

r¯tIQL=\{R​\(τ\),t=Tτ,0,otherwise,\\bar\{r\}^\{\\mathrm\{IQL\}\}\_\{t\}=\\begin\{cases\}R\(\\tau\),&t=T\_\{\\tau\},\\\\ 0,&\\text\{otherwise\},\\end\{cases\}\(13\)whereTτT\_\{\\tau\}is the final focal\-agent turn\. With expectile parameterτexp\\tau\_\{\\mathrm\{exp\}\}, IQL optimizes

ℒV​\(ψ\)\\displaystyle\\mathcal\{L\}\_\{V\}\(\\psi\)=𝔼𝒟​\[L2τexp​\(Qθ¯​\(s,e\)−Vψ​\(s\)\)\],\\displaystyle=\\mathbb\{E\}\_\{\\mathcal\{D\}\}\\\!\\left\[L\_\{2\}^\{\\tau\_\{\\mathrm\{exp\}\}\}\\\!\\left\(Q\_\{\\bar\{\\theta\}\}\(s,e\)\-V\_\{\\psi\}\(s\)\\right\)\\right\],\(14\)ℒQ​\(θ\)\\displaystyle\\mathcal\{L\}\_\{Q\}\(\\theta\)=𝔼𝒟​\[\(r¯tIQL\+γ​Vψ​\(s′\)−Qθ​\(s,e\)\)2\],\\displaystyle=\\mathbb\{E\}\_\{\\mathcal\{D\}\}\\\!\\left\[\\left\(\\bar\{r\}^\{\\mathrm\{IQL\}\}\_\{t\}\+\\gamma V\_\{\\psi\}\(s^\{\\prime\}\)\-Q\_\{\\theta\}\(s,e\)\\right\)^\{2\}\\right\],whereL2τexp​\(x\)=\|τexp−𝟙​\{x<0\}\|​x2L\_\{2\}^\{\\tau\_\{\\mathrm\{exp\}\}\}\(x\)=\|\\tau\_\{\\mathrm\{exp\}\}\-\\mathbb\{1\}\\\{x<0\\\}\|x^\{2\}\. The selector is extracted by advantage\-weighted regression \(AWR\) with temperatureβIQL\\beta\_\{\\mathrm\{IQL\}\}:

πϕIQL​\(e∣s\)\\displaystyle\\pi^\{\\mathrm\{IQL\}\}\_\{\\phi\}\(e\\mid s\)∝exp⁡\(βIQL​A​\(s,e\)\),\\displaystyle\\propto\\exp\\\!\\left\(\\beta\_\{\\mathrm\{IQL\}\}A\(s,e\)\\right\),\(15\)A​\(s,e\)\\displaystyle A\(s,e\)=Qθ​\(s,e\)−Vψ​\(s\)\.\\displaystyle=Q\_\{\\theta\}\(s,e\)\-V\_\{\\psi\}\(s\)\.At inference,et∼πϕIQL\(⋅∣st\)e\_\{t\}\\sim\\pi^\{\\mathrm\{IQL\}\}\_\{\\phi\}\(\\cdot\\mid s\_\{t\}\)andut∼πLLM\(⋅∣st,et\)u\_\{t\}\\sim\\pi\_\{\\mathrm\{LLM\}\}\(\\cdot\\mid s\_\{t\},e\_\{t\}\)\. The LoRA expression adapter is not used in the IQL\-only ablation\.

#### LoRA\-SFT expression policy\.

SFT trains the expression policy by supervised learning on high\-quality turn\-level demonstrations\. Let𝒟top⊂𝒟\\mathcal\{D\}\_\{\\mathrm\{top\}\}\\subset\\mathcal\{D\}denote the top\-25%25\\%of\(st,et,ut\)\(s\_\{t\},e\_\{t\},u\_\{t\}\)tuples ranked by the hybrid filtering scoreqthybq\_\{t\}^\{\\mathrm\{hyb\}\}in Appendix[D\.2](https://arxiv.org/html/2605.26785#A4.SS2)\. The SFT objective is token\-level cross\-entropy under emotion conditioning:

ℒSFT​\(θ\)\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{SFT\}\}\(\\theta\)=−𝔼\(st,et,ut\)∼𝒟top\[∑k=1\|ut\|\\displaystyle=\-\\,\\mathbb\{E\}\_\{\(s\_\{t\},e\_\{t\},u\_\{t\}\)\\sim\\mathcal\{D\}\_\{\\mathrm\{top\}\}\}\\Bigg\[\\sum\_\{k=1\}^\{\|u\_\{t\}\|\}\(16\)logπθ\(ut,k∣st,et,ut,<k\)\]\.\\displaystyle\\qquad\\log\\pi\_\{\\theta\}\\\!\\left\(u\_\{t,k\}\\mid s\_\{t\},e\_\{t\},u\_\{t,<k\}\\right\)\\Bigg\]\.At inference,et∼πϕIQL\(⋅∣st\)e\_\{t\}\\sim\\pi^\{\\mathrm\{IQL\}\}\_\{\\phi\}\(\\cdot\\mid s\_\{t\}\), thenut∼πθ\(⋅∣st,et\)u\_\{t\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid s\_\{t\},e\_\{t\}\)\.

#### EmoDistill\.

EmoDistillis our main method\. It uses IQL for emotion selection and two\-stage LoRA training for expression\-policy optimization\. Stage 1 is the SFT objective above\. Stage 2 freezes the SFT snapshotπref=πθSFT\\pi\_\{\\mathrm\{ref\}\}=\\pi^\{\\mathrm\{SFT\}\}\_\{\\theta\}and applies a PPO\-clipped surrogate with a K3 KL anchor:

ρt​\(θ\)=πθ​\(ut∣st,et\)πref​\(ut∣st,et\)\.\\rho\_\{t\}\(\\theta\)=\\frac\{\\pi\_\{\\theta\}\(u\_\{t\}\\mid s\_\{t\},e\_\{t\}\)\}\{\\pi\_\{\\mathrm\{ref\}\}\(u\_\{t\}\\mid s\_\{t\},e\_\{t\}\)\}\.\(17\)JPO uses the asymmetric advantage

A~t=\{At,At\>0,κ​At,At≤0,\\widetilde\{A\}\_\{t\}=\\begin\{cases\}A\_\{t\},&A\_\{t\}\>0,\\\\ \\kappa A\_\{t\},&A\_\{t\}\\leq 0,\\end\{cases\}\(18\)whereκ∈\[0,1\]\\kappa\\in\[0,1\]controls the weight on negative\-advantage samples\. The objective is

ℒJPO​\(θ\)=\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{JPO\}\}\(\\theta\)=−𝔼𝒟top\[min\(ρtA~t,clip\(ρt,1−ϵ,1\+\\displaystyle\-\\,\\mathbb\{E\}\_\{\\mathcal\{D\}\_\{\\mathrm\{top\}\}\}\\\!\\Big\[\\min\\\!\\big\(\\rho\_\{t\}\\widetilde\{A\}\_\{t\},\\;\\mathrm\{clip\}\(\\rho\_\{t\},1\{\-\}\\epsilon,1\{\+\}\(19\)ϵ\)A~t\)\]\\displaystyle\\epsilon\)\\widetilde\{A\}\_\{t\}\\big\)\\Big\]\+λKL⋅𝔼​\[KLK3​\(πθ∥πref\)\]\.\\displaystyle\+\\,\\lambda\_\{\\mathrm\{KL\}\}\\cdot\\mathbb\{E\}\\\!\\big\[\\mathrm\{KL\}\_\{\\mathrm\{K3\}\}\(\\pi\_\{\\theta\}\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\)\\big\]\.Inference:et∼πϕIQL\(⋅∣st\)e\_\{t\}\\sim\\pi^\{\\mathrm\{IQL\}\}\_\{\\phi\}\(\\cdot\\mid s\_\{t\}\), thenut∼πθ\(⋅∣st,et\)u\_\{t\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid s\_\{t\},e\_\{t\}\)\.

#### IQL\+JPO ablation\.

TheIQL\+JPOablation removes the SFT warm start and tests direct judge\-guided refinement\. The same clipped objective is used, but the reference policy is the base instruction model rather than the SFT adapter\. This isolates whether dense judge advantages are sufficient without imitation\-based initialization\.

#### Emotion\-Free EmoDistill\.

Emotion\-Free EmoDistill removes the explicit emotion block and the inference\-time selector\. It is a diagnostic ablation that tests internalization of emotional strategy\. The same two\-stage objective asEmoDistillis applied to states with the emotion block stripped:

ℒSFTfree​\(θ\)\\displaystyle\\mathcal\{L\}^\{\\mathrm\{free\}\}\_\{\\mathrm\{SFT\}\}\(\\theta\)=−𝔼\(st,et,ut\)∼𝒟top\[∑k\\displaystyle=\-\\,\\mathbb\{E\}\_\{\(s\_\{t\},e\_\{t\},u\_\{t\}\)\\sim\\mathcal\{D\}\_\{\\mathrm\{top\}\}\}\\Bigg\[\\sum\_\{k\}\(20\)logπθ\(ut,k∣st,ut,<k\)\]\.\\displaystyle\\qquad\\log\\pi\_\{\\theta\}\\\!\\left\(u\_\{t,k\}\\mid s\_\{t\},u\_\{t,<k\}\\right\)\\Bigg\]\.The free\-form JPO ratio is

ρtfree​\(θ\)=πθ​\(ut∣st\)πreffree​\(ut∣st\)\.\\rho\_\{t\}^\{\\mathrm\{free\}\}\(\\theta\)=\\frac\{\\pi\_\{\\theta\}\(u\_\{t\}\\mid s\_\{t\}\)\}\{\\pi^\{\\mathrm\{free\}\}\_\{\\mathrm\{ref\}\}\(u\_\{t\}\\mid s\_\{t\}\)\}\.\(21\)The corresponding JPO objective is

ℒJPOfree​\(θ\)\\displaystyle\\mathcal\{L\}^\{\\mathrm\{free\}\}\_\{\\mathrm\{JPO\}\}\(\\theta\)=−𝔼\[min\(ρtfreeA~t,\\displaystyle=\-\\,\\mathbb\{E\}\\\!\\Bigg\[\\min\\\!\\Big\(\\rho\_\{t\}^\{\\mathrm\{free\}\}\\widetilde\{A\}\_\{t\},\\;\(22\)clip\(ρtfree,1−ϵ,1\+ϵ\)A~t\)\]\\displaystyle\\qquad\\operatorname\{clip\}\\\!\\left\(\\rho\_\{t\}^\{\\mathrm\{free\}\},1\{\-\}\\epsilon,1\{\+\}\\epsilon\\right\)\\widetilde\{A\}\_\{t\}\\Big\)\\Bigg\]\+λKL​KLK3​\(πθ∥πreffree\)\.\\displaystyle\\quad\+\\,\\lambda\_\{\\mathrm\{KL\}\}\\,\\mathrm\{KL\}\_\{\\mathrm\{K3\}\}\\\!\\left\(\\pi\_\{\\theta\}\\,\\\|\\,\\pi^\{\\mathrm\{free\}\}\_\{\\mathrm\{ref\}\}\\right\)\.At inference, noete\_\{t\}is sampled and no selector is consulted:ut∼πθ\(⋅∣st\)u\_\{t\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid s\_\{t\}\)\. The covariate\-shift consequences of this design are formalized in Appendix[H](https://arxiv.org/html/2605.26785#A8)\.

MethodUpdatesTraining signalInference cond\.Vanilla——\(st,et=neu\)\(s\_\{t\},\\,e\_\{t\}\{=\}\\text\{neu\}\)Random emotion——\(st,et∼U​\(ℰ\)\)\(s\_\{t\},\\,e\_\{t\}\{\\sim\}U\(\\mathcal\{E\}\)\)IQL selectorEmoterminalR​\(τ\)R\(\\tau\)\(st,et∼πϕIQL\)\(s\_\{t\},\\,e\_\{t\}\{\\sim\}\\pi^\{\\mathrm\{IQL\}\}\_\{\\phi\}\)IQL\+SFTEmo\+LoRAtop\-25%25\\%hybrid BC\(st,et∼πϕIQL\)\(s\_\{t\},\\,e\_\{t\}\{\\sim\}\\pi^\{\\mathrm\{IQL\}\}\_\{\\phi\}\)IQL\+JPOEmo\+LoRAper\-turnA~t\\widetilde\{A\}\_\{t\}\(st,et∼πϕIQL\)\(s\_\{t\},\\,e\_\{t\}\{\\sim\}\\pi^\{\\mathrm\{IQL\}\}\_\{\\phi\}\)EmoDistillEmo\+LoRAtop\-25%25\\%hybrid BC\+\+per\-turnA~t\\widetilde\{A\}\_\{t\}\(st,et∼πϕIQL\)\(s\_\{t\},\\,e\_\{t\}\{\\sim\}\\pi^\{\\mathrm\{IQL\}\}\_\{\\phi\}\)Emotion\-Free EmoDistillLoRAtop\-25%25\\%hybrid BC\+\+per\-turnA~t\\widetilde\{A\}\_\{t\}sts\_\{t\}onlyTable 7:What each compared policy learns\.Emo==emotion selector,LoRA==utterance LoRA,Tx==transition matrix\. “Inference cond\.” is the state and emotion the utterance is conditioned on at deployment\.
### C\.1Algorithmic Summary

Algorithm 1IQL Emotion Selector1:Offline sweep

𝒟\\mathcal\{D\}, outcome\-shaped returns

R​\(τ\)R\(\\tau\), expectile

τexp\\tau\_\{\\mathrm\{exp\}\}, AWR temperature

β\\beta
2:for

NNgradient stepsdo

3:Update

VψV\_\{\\psi\}by expectile regression\.

4:Update

QθQ\_\{\\theta\}using terminal\-reward TD targets\.

5:Compute

A​\(s,e\)=Qθ​\(s,e\)−Vψ​\(s\)A\(s,e\)=Q\_\{\\theta\}\(s,e\)\-V\_\{\\psi\}\(s\)\.

6:Update

πϕ\\pi\_\{\\phi\}by advantage\-weighted regression\.

7:endfor

8:returnemotion selector

πϕ​\(e∣s\)\\pi\_\{\\phi\}\(e\\mid s\)

Algorithm 2SFT Expression Policy1:Offline sweep

𝒟\\mathcal\{D\}, hybrid scores

qthybq\_\{t\}^\{\\mathrm\{hyb\}\}, base SLM

π0\\pi\_\{0\}
2:Rank

\(st,et,ut\)\(s\_\{t\},e\_\{t\},u\_\{t\}\)tuples by

qthybq\_\{t\}^\{\\mathrm\{hyb\}\}\.

3:Retain the top

25%25\\%as

𝒟top\\mathcal\{D\}\_\{\\mathrm\{top\}\}\.

4:Initialize LoRA adapter

θ\\thetaon

π0\\pi\_\{0\}\.

5:for

NNsupervised stepsdo

6:Sample

\(st,et,ut\)\(s\_\{t\},e\_\{t\},u\_\{t\}\)from

𝒟top\\mathcal\{D\}\_\{\\mathrm\{top\}\}\.

7:Minimize token\-level cross\-entropy on

utu\_\{t\}\.

8:endfor

9:returnSFT adapter

πSFT\\pi\_\{\\mathrm\{SFT\}\}

Algorithm 3Judge Policy Optimization1:Sweep tuples

\{\(st,et,ut,At\)\}\\\{\(s\_\{t\},e\_\{t\},u\_\{t\},A\_\{t\}\)\\\}, reference policy

πref\\pi\_\{\\mathrm\{ref\}\}, clip

ϵ\\epsilon, KL coefficient

λKL\\lambda\_\{\\mathrm\{KL\}\}, negative\-advantage coefficient

κ\\kappa
2:Initialize

πθ\\pi\_\{\\theta\}from the SFT adapter\.

3:Freeze

πref=πSFT\\pi\_\{\\mathrm\{ref\}\}=\\pi\_\{\\mathrm\{SFT\}\}\.

4:for

NNgradient stepsdo

5:Sample minibatch

\{\(st,et,ut,At\)\}∼𝒟top\\\{\(s\_\{t\},e\_\{t\},u\_\{t\},A\_\{t\}\)\\\}\\sim\\mathcal\{D\}\_\{\\mathrm\{top\}\}\.

6:Compute

A~t=At\\widetilde\{A\}\_\{t\}=A\_\{t\}if

At\>0A\_\{t\}\>0, else

A~t=κ​At\\widetilde\{A\}\_\{t\}=\\kappa A\_\{t\}\.

7:Compute

ρt=πθ​\(ut∣st,et\)/πref​\(ut∣st,et\)\\rho\_\{t\}=\\pi\_\{\\theta\}\(u\_\{t\}\\mid s\_\{t\},e\_\{t\}\)/\\pi\_\{\\mathrm\{ref\}\}\(u\_\{t\}\\mid s\_\{t\},e\_\{t\}\)\.

8:Compute clipped policy loss:

ℒclip=−min⁡\(ρt​A~t,clip⁡\(ρt,1−ϵ,1\+ϵ\)​A~t\)\.\\mathcal\{L\}\_\{\\mathrm\{clip\}\}=\-\\min\\left\(\\rho\_\{t\}\\widetilde\{A\}\_\{t\},\\operatorname\{clip\}\(\\rho\_\{t\},1\-\\epsilon,1\+\\epsilon\)\\widetilde\{A\}\_\{t\}\\right\)\.
9:Compute K3 KL anchor:

ℒKL=KLK3​\(πθ∥πref\)\.\\mathcal\{L\}\_\{\\mathrm\{KL\}\}=\\mathrm\{KL\}\_\{\\mathrm\{K3\}\}\\left\(\\pi\_\{\\theta\}\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\\right\)\.
10:Update

θ\\thetausing

ℒJPO=ℒclip\+λKL​ℒKL\.\\mathcal\{L\}\_\{\\mathrm\{JPO\}\}=\\mathcal\{L\}\_\{\\mathrm\{clip\}\}\+\\lambda\_\{\\mathrm\{KL\}\}\\mathcal\{L\}\_\{\\mathrm\{KL\}\}\.
11:endfor

12:returnJPO adapter

πθ\\pi\_\{\\theta\}

## Appendix DReward Design and Training Signals

The same offline sweep provides different signals for different stages\. For IQL, a signal is used as an emotion\-selection reward; for SFT, it is used as a data\-filtering score; for JPO, it becomes an utterance\-level advantage\. This section gives the full signal definitions behind the training\-signal ablation in Table[2](https://arxiv.org/html/2605.26785#S3.T2)\. The per\-turn judge prompt that producesrtr\_\{t\}is shown verbatim in Appendix[P\.3](https://arxiv.org/html/2605.26785#A16.SS3)\.

### D\.1Emotion\-Selection Rewards

Offline value\-based selectors such as IQL attach rewards to selected emotions at each state\. We compare three signal placements\. The*outcome\-shaped*variant uses the objective trajectory returnR​\(τ\)R\(\\tau\)only at the terminal focal\-agent turn and propagates it through Bellman backups\. The*episode\-judge*variant broadcasts one dialogue\-level judge score to every focal turn\. The*turn\-judge*variant uses the judge score assigned to the current focal utterance\. The default IQL selector uses the outcome\-shaped reward so that emotion selection is tied to actual bargaining movement rather than surface persuasiveness alone\.

### D\.2SFT Filtering Signal

SFT does not use the signal as a reinforcement\-learning reward\. Instead, it uses a hybrid quality score to select high\-quality turn\-level demonstrations:

qthyb=rt\+12​R​\(τ\)\.q^\{\\mathrm\{hyb\}\}\_\{t\}=r\_\{t\}\+\\tfrac\{1\}\{2\}R\(\\tau\)\.\(23\)Herertr\_\{t\}measures the local quality of the focal utterance under the metric\-aligned judge, whileR​\(τ\)R\(\\tau\)measures whether the trajectory as a whole produces favorable bargaining dynamics and terminal agreement\. We retain the top25%25\\%of\(st,et,ut\)\(s\_\{t\},e\_\{t\},u\_\{t\}\)tuples ranked byqthybq\_\{t\}^\{\\mathrm\{hyb\}\}\. This filter avoids imitating utterances that sound locally persuasive but occur inside globally unproductive negotiations, while also avoiding purely terminal filtering that would keep weak individual turns from successful trajectories\.

### D\.3JPO Advantage Signals

For JPO, the signal is used as an offline policy\-improvement advantage\. We compare three variants: outcome\-shaped advantage, episode\-judge advantage, and turn\-judge advantage\.

#### Outcome\-shaped advantage\.

The outcome\-shaped reward is a sum of time\-weighted per\-step shaping plus a non\-time\-weighted terminal anchor:

R​\(τ\)=∑t=1Tτw​\(t\)​\(Δtctp−Δtfoc\)⏟time\-weighted step shaping\+Rterm​\(τ\)\.R\(\\tau\)=\\underbrace\{\\sum\_\{t=1\}^\{T\_\{\\tau\}\}w\(t\)\\bigl\(\\Delta^\{\\mathrm\{ctp\}\}\_\{t\}\-\\Delta^\{\\mathrm\{foc\}\}\_\{t\}\\bigr\)\}\_\{\\text\{time\-weighted step shaping\}\}\+R^\{\\mathrm\{term\}\}\(\\tau\)\.\(24\)Letg=xagent⋆−x0oppg=x^\{\\star\}\_\{\\mathrm\{agent\}\}\-x^\{\\mathrm\{opp\}\}\_\{0\}be the signed anchor\-to\-target gap andd=max⁡\(1,\|g\|\)d=\\max\(1,\|g\|\)its stabilized magnitude\. The sign ofggmakes the formulation valid whether the focal agent wants a smaller or larger scalar value\. The normalized counterparty concession is

Δtctp=clip​\(sgn⁡\(g\)​\(xtctp−xt−1ctp\)d,−2,2\),\\Delta^\{\\mathrm\{ctp\}\}\_\{t\}=\\mathrm\{clip\}\\\!\\left\(\\frac\{\\operatorname\{sgn\}\(g\)\(x^\{\\mathrm\{ctp\}\}\_\{t\}\-x^\{\\mathrm\{ctp\}\}\_\{t\-1\}\)\}\{d\},\-2,2\\right\),\(25\)which is positive when the counterparty moves toward the focal target\. The normalized focal retreat is

Δtfoc=clip​\(−sgn⁡\(g\)​\(xtfoc−xt−1foc\)d,−2,2\),\\Delta^\{\\mathrm\{foc\}\}\_\{t\}=\\mathrm\{clip\}\\\!\\left\(\\frac\{\-\\operatorname\{sgn\}\(g\)\(x^\{\\mathrm\{foc\}\}\_\{t\}\-x^\{\\mathrm\{foc\}\}\_\{t\-1\}\)\}\{d\},\-2,2\\right\),\(26\)which is positive when the focal agent moves away from its own target\. Thus,Δtctp−Δtfoc\\Delta^\{\\mathrm\{ctp\}\}\_\{t\}\-\\Delta^\{\\mathrm\{foc\}\}\_\{t\}rewards turns where the counterparty concedes more than the focal agent retreats\.

The implicit dialogue\-length penalty is the*linear time\-decay weight*

w​\(t\)\\displaystyle w\(t\)=max⁡\(0,min⁡\(1,1−tTmax\)\),\\displaystyle=\\max\\\!\\left\(0,\\min\\\!\\left\(1,1\-\\frac\{t\}\{T\_\{\\max\}\}\\right\)\\right\),\(27\)Tmax\\displaystyle T\_\{\\max\}=30\.\\displaystyle=0\.Late concessions therefore earn less step credit, providing a length penalty without an explicit additive term\. The terminal anchor is held constant so late but successful closes are not double\-penalized:

Rterm​\(τ\)=\{\+2,if agreement reached,−2,otherwise\.R^\{\\mathrm\{term\}\}\(\\tau\)=\\begin\{cases\}\+2,&\\text\{if agreement reached,\}\\\\ \-2,&\\text\{otherwise\.\}\\end\{cases\}\(28\)Importantly,R​\(τ\)R\(\\tau\)uses no LLM\-judge signal\.

#### Episode\-judge advantage\.

A metric\-aligned LLM judge scores the whole dialogue, and the resulting dialogue\-level score is broadcast to all focal\-agent turns\. This signal is subjective and judge\-based, but it does not provide turn\-level credit assignment\.

#### Turn\-judge advantage\.

The judge scores each focal\-agent turn independently under the metric\-aligned rubric in Appendix[P\.3](https://arxiv.org/html/2605.26785#A16.SS3)\. The rubric rewards anchoring near the focal target, concrete proposals, and scenario\-grounded leverage, and penalizes capitulation, repetition, vagueness, and emotionally inappropriate concessions\. Scores are normalized within each scenario:

At=rt−μscenarioσscenario\+ϵ\.A\_\{t\}=\\frac\{r\_\{t\}\-\\mu\_\{\\mathrm\{scenario\}\}\}\{\\sigma\_\{\\mathrm\{scenario\}\}\+\\epsilon\}\.\(29\)The normalized scoreAtA\_\{t\}is used as the fixed offline advantage in JPO, and its asymmetric formA~t\\widetilde\{A\}\_\{t\}is defined in Eq\. \([18](https://arxiv.org/html/2605.26785#A3.E18)\)\.

## Appendix EExperimental Setup

### E\.1Datasets

We evaluate on four datasets from CRAD and EmoMAS: Credit Recovery, Disaster Rescue, Hospital Surgery Scheduling, and Student Sleep Scheduling\. Each dataset contains100100scenarios split into8080training scenarios and2020held\-out test scenarios\.

DatasetNGap signQuantityCredit Recovery \(CRAD\)100Target<<Anchoroverdue daysDisaster Rescue100Target\>\>Anchorwait minutesHospital Surgery Scheduling100Target\>\>Anchorwait daysStudent Sleep Scheduling100Target<<Anchorextra hours past 9 PMTable 8:Datasets used in our experiments\. TheGap signcolumn indicates whether the focal agent’s target is lower or higher than the counterparty’s initial anchor, i\.e\., the sign ofxagent⋆−x0oppx^\{\\star\}\_\{\\mathrm\{agent\}\}\-x^\{\\mathrm\{opp\}\}\_\{0\}\.The sign of the anchor\-to\-target gap differs across domains\. In Credit Recovery and Student Sleep Scheduling, the focal agent prefers a smaller value than the counterparty’s initial anchor, soxagent⋆−x0opp<0x^\{\\star\}\_\{\\mathrm\{agent\}\}\-x^\{\\mathrm\{opp\}\}\_\{0\}<0\. In Disaster Rescue and Hospital Surgery Scheduling, the focal agent prefers a larger value, soxagent⋆−x0opp\>0x^\{\\star\}\_\{\\mathrm\{agent\}\}\-x^\{\\mathrm\{opp\}\}\_\{0\}\>0\. The normalized savings formula in Appendix[E\.5](https://arxiv.org/html/2605.26785#A5.SS5)is sign\-invariant by construction and remains valid in both regimes\.

### E\.2Offline Sweep Construction

For each dataset, we run8080training scenarios×\\times100100emotion\-sequence seeds\. Each seed fixes a sampled sequence from the full\|ℰ\|=28\|\\mathcal\{E\}\|=28vocabulary, consisting of the 28 GoEmotions emotion labels\. Thus, each domain contains80008000offline trajectories\. The100100factor refers to random emotion\-sequence seeds per training scenario, not model\-training seeds\.

Each rollout records:

- •scenario identifier and dataset;
- •focal\-agent targetxagent⋆x^\{\\star\}\_\{\\mathrm\{agent\}\};
- •counterparty initial anchorx0oppx^\{\\mathrm\{opp\}\}\_\{0\};
- •dialogue history at each focal\-agent turn;
- •selected emotionete\_\{t\};
- •focal\-agent utteranceutu\_\{t\};
- •counterparty response;
- •terminal outcome, including final agreement, success, and rounds;
- •per\-turn LLM\-judge scorertr\_\{t\}\.

All sweep\-generation calls use Qwen3\.5\-Plus through DashScope\. We use a66\-key round\-robin setup to reduce rate\-limit bottlenecks\. Each negotiation is capped at3030turns\. The same sweep is reused for IQL, SFT, and JPO\.

### E\.3Evaluation Metrics

Each method is evaluated on the same2020held\-out scenarios per dataset\. We report success rate,Outcomes,Utility, and mean dialogue rounds\.

### E\.4Success Rate

A mediator classifies each dialogue asaccepted,breakdown, orongoing\. Success rate is the fraction of dialogues that reach a valid accepted agreement satisfying the task\-specific criterion\. Dialogues classified asbreakdownorongoingare counted as unsuccessful\.

### E\.5Normalized Savings

Letx0oppx^\{\\mathrm\{opp\}\}\_\{0\}be the counterparty’s initial anchor,xagent⋆x^\{\\star\}\_\{\\mathrm\{agent\}\}be the focal target, andxfinalx\_\{\\mathrm\{final\}\}be the final accepted agreement\. For successful negotiations, normalized savings is the fraction of the anchor\-to\-target distance closed by the final agreement:

Sav=xfinal−x0oppxagent⋆−x0opp\.\\textsc\{Sav\}=\\frac\{x\_\{\\mathrm\{final\}\}\-x^\{\\mathrm\{opp\}\}\_\{0\}\}\{x^\{\\star\}\_\{\\mathrm\{agent\}\}\-x^\{\\mathrm\{opp\}\}\_\{0\}\}\.\(30\)The metric is sign\-invariant: if the focal target is lower than the anchor, both numerator and denominator are negative for progress toward the target; if the focal target is higher, both are positive\. Equivalently,

Sav=\|xfinal−x0opp\|\|xagent⋆−x0opp\|\\textsc\{Sav\}=\\frac\{\\left\|x\_\{\\mathrm\{final\}\}\-x^\{\\mathrm\{opp\}\}\_\{0\}\\right\|\}\{\\left\|x^\{\\star\}\_\{\\mathrm\{agent\}\}\-x^\{\\mathrm\{opp\}\}\_\{0\}\\right\|\}\(31\)whenever the final agreement lies between the anchor and target\. We clip values outside\[0,1\]\[0,1\]for aggregation\.

### E\.6Outcomes and Utility

OutcomesaveragesSavover successful negotiations only\.Utilityaverages over all2020held\-out scenarios and assigns zero utility to failed negotiations:

ui=\{Savi,if episode​i​succeeds,0,otherwise\.u\_\{i\}=\\begin\{cases\}\\textsc\{Sav\}\_\{i\},&\\text\{if episode \}i\\text\{ succeeds\},\\\\ 0,&\\text\{otherwise\}\.\\end\{cases\}\(32\)Utility is the stricter main metric because it captures both agreement quality and failure risk\.

### E\.7Mean Dialogue Rounds

Mean dialogue rounds measures negotiation efficiency\. It should be interpreted together with Utility, since a policy can reduce rounds by accepting too early\.

## Appendix FModel, Training Setup, and Hyperparameters

#### Backbones\.

All live LLM calls for sweep construction and evaluation use Qwen3\.5\-Plus\. The trainable expression policy uses Qwen2\.5\-7B\-Instruct with LoRA adapters\.

#### LoRA configuration\.

We apply LoRA adapters toq\_proj,k\_proj,v\_proj, ando\_proj\. Unless otherwise stated, we use rank1616\. We sweep ranks\{4,16,64\}\\\{4,16,64\\\}and find rank1616gives the best validation tradeoff\.

#### SFT\.

The SFT ablation behavior\-clones high\-quality focal\-agent utterances from the offline sweep\. It is implemented with LoRA, but we refer to the method as SFT because LoRA is the parameter\-efficient fine\-tuning mechanism\.

#### JPO\.

JPO starts from the SFT adapter and optimizes the PPO\-clipped offline objective from Section[3\.4](https://arxiv.org/html/2605.26785#S3.SS4)\. The reference policy is the frozen SFT adapter\. We use PPO clipping parameterε=0\.2\\varepsilon=0\.2\. We sweep the JPO KL coefficientλKL∈\{0\.01,0\.04,0\.1,0\.5\}\\lambda\_\{\\mathrm\{KL\}\}\\in\\\{0\.01,0\.04,0\.1,0\.5\\\}and useλKL=0\.04\\lambda\_\{\\mathrm\{KL\}\}=0\.04\. For the asymmetric advantage in Eq\. \([18](https://arxiv.org/html/2605.26785#A3.E18)\), we takeκ\\kappato be 1 for training and inference, except the ablation study \(See the Table[6](https://arxiv.org/html/2605.26785#S4.T6)\) where we finetuningκ\\kappain\{0,0\.25,0\.5,0\.75,1\}\\\{0,0\.25,0\.5,0\.75,1\\\}\.

#### IQL\.

We tune expectileτ∈\{0\.7,0\.8,0\.9\}\\tau\\in\\\{0\.7,0\.8,0\.9\\\}, AWR temperatureβ∈\{1,3,10\}\\beta\\in\\\{1,3,10\\\}, and discountγ∈\{0\.95,0\.99\}\\gamma\\in\\\{0\.95,0\.99\\\}\.

## Appendix GWhy Offline RL

Online policy optimization would require fresh multi\-turn negotiations and fresh LLM\-judge calls at every gradient step\. Offline JPO avoids this by reusing pre\-judged tuples from the fixed sweep\. This section provides the cost decomposition and explains why we do not run pure online PPO, GRPO, or DAgger\-style distillation\.

#### Offline JPO cost\.

A single offline JPO step samples pre\-judged tuples\(st,et,ut,At\)\(s\_\{t\},e\_\{t\},u\_\{t\},A\_\{t\}\), forwards them through Qwen2\.5\-7B with LoRA, computes the clipped surrogate and KL anchor, and backpropagates\. On a single A100, this costs approximately55seconds per optimizer step\. The expensive LLM interactions and judge annotations are paid once during sweep construction and then reused across SFT, JPO and IQL\.

#### Pure on\-policy PPO or GRPO\.

A pure on\-policy variant would require new negotiations at every gradient step\. For batch sizeb=16b=16, each step would need1616fresh negotiations with the current focal\-agent policy, counterparty calls, judge calls for every turn, reward normalization, and then the same gradient update\. In our sweep, a negotiation takes approximately7575seconds wall\-clock, and judge scoring adds approximately1010seconds per turn for an average of1818turns\. With concurrencyc=6c=6, the approximate per\-step cost is

Tonpolstep≈16×\(75\+180\)6≈680​s\.T\_\{\\mathrm\{onpol\}\}^\{\\mathrm\{step\}\}\\;\\approx\\;\\frac\{16\\times\(75\+180\)\}\{6\}\\;\\approx\\;680\\,\\mathrm\{s\}\.\(33\)This is roughly130×130\\timesslower than offline JPO\. For a20002000\-step run, this would become roughly1515GPU\-days of API\-limited wall\-clock time per dataset before rate limits become the bottleneck\. GRPO removes the value network but does not remove the need for fresh group rollouts and judge calls, so its wall\-clock bottleneck is similar in our setting\.

#### Stability considerations\.

Pure online optimization would also be less stable because the model would optimize directly against a noisy11–1010judge signal without the strong SFT initialization and KL anchor used by JPO\. Offline JPO is appropriate here because the per\-turn judge score can be computed once for each observed tuple and reused across optimization steps\. The main risk is covariate shift between the teacher\-induced sweep distribution and the student\-induced deployment distribution\. We mitigate this risk with SFT initialization, a tight KL anchor, and prompt\-conditional deployment; Prompt\-Free EmoDistill intentionally removes this conditioning\.

## Appendix HEmotion\-Free EmoDistill and Covariate Shift

Emotion\-conditional EmoDistill keeps the IQL selector at inference: the selector chooses an emotion, the prompt includes the corresponding emotional approach, and the JPO\-trained expression adapter generates the focal\-agent utterance\. Emotion\-free EmoDistill removes both the emotion block and the inference\-time selector\. It tests whether the LoRA adapter can internalize emotional strategy without explicit emotion conditioning\. The source of covariate shift is that the offline utterances were generated under emotion conditioning,

ut∼πLLM​\(ut∣st,et\),u\_\{t\}\\sim\\pi\_\{\\mathrm\{LLM\}\}\(u\_\{t\}\\mid s\_\{t\},e\_\{t\}\),while emotion\-free training and inference condition only on the stripped state,

ut∼πθfree​\(ut∣st\)\.u\_\{t\}\\sim\\pi\_\{\\theta\}^\{\\mathrm\{free\}\}\(u\_\{t\}\\mid s\_\{t\}\)\.Thus, the student must imitate outputs whose causal emotional condition is hidden\. Equivalently, the emotion\-free target is an emotion\-marginalized teacher:

πβ​\(ut∣st\)=∑et∈ℰπemo​\(et∣st\)​πLLM​\(ut∣st,et\)\.\\pi\_\{\\beta\}\(u\_\{t\}\\mid s\_\{t\}\)=\\sum\_\{e\_\{t\}\\in\\mathcal\{E\}\}\\pi\_\{\\mathrm\{emo\}\}\(e\_\{t\}\\mid s\_\{t\}\)\\pi\_\{\\mathrm\{LLM\}\}\(u\_\{t\}\\mid s\_\{t\},e\_\{t\}\)\.If multiple emotions produce distinct high\-quality utterances for similar states, the emotion\-free student is forced to average over hidden modes\.

This multimodality is strong in practice\. On CRAD, replaying recorded focal\-agent turns through the IQL selector shows that only136/2104136/2104turns \(6\.5%6\.5\\%\) match the greedy IQL emotion\. Among the top\-25%25\\%judge\-filtered turns used for emotion\-free SFT, only25/52625/526turns \(4\.8%4\.8\\%\) match\. Therefore, the emotion\-free student cannot recover the intended emotional mode from the dialogue state alone\. This explains the empirical pattern in Table[3](https://arxiv.org/html/2605.26785#S4.T3): emotion\-free training can produce strong agreements when successful, but its reliability drops because the explicit emotion control variable is removed\. Emotion\-conditional EmoDistill remains the default deployment setting, while emotion\-free EmoDistill is best interpreted as a diagnostic ablation for testing whether emotional strategy can be internalized without an explicit emotion channel\.

## Appendix IBootstrap Confidence Intervals for Outcomes

This appendix reports95%95\\%bootstrap confidence intervals for*Outcomes*in the in\-domain experiments, CRAD cross\-counterparty transfer, and CRAD training\-signal ablation, as shown in Tables[9](https://arxiv.org/html/2605.26785#A9.T9),[10](https://arxiv.org/html/2605.26785#A9.T10), and[11](https://arxiv.org/html/2605.26785#A9.T11)\. Outcomes are computed as the average role\-neutral signed savings over successful negotiations\. We provide intervals for the in\-domain experiments, the CRAD training\-signal ablation, and the CRAD cross\-counterparty transfer setting\.

#### Procedure\.

For each method–condition cell, we first collect the successful held\-out episodes,

𝒮=\{i:episode​i​succeeds\}\.\\mathcal\{S\}=\\\{i:\\text\{episode \}i\\text\{ succeeds\}\\\}\.For each successful episodei∈𝒮i\\in\\mathcal\{S\}, we compute

oi=Savi,o\_\{i\}=\\textsc\{Sav\}\_\{i\},whereSavi\\textsc\{Sav\}\_\{i\}is the role\-neutral signed savings defined in Eq\. \([31](https://arxiv.org/html/2605.26785#A5.E31)\) and clipped to\[0,1\]\[0,1\]\. LetNsucc=\|𝒮\|N\_\{\\mathrm\{succ\}\}=\|\\mathcal\{S\}\|be the number of successful episodes for that cell\. We drawB=10,000B\{=\}10\{,\}000percentile\-bootstrap resamples of sizeNsuccN\_\{\\mathrm\{succ\}\}with replacement from\{oi:i∈𝒮\}\\\{o\_\{i\}:i\\in\\mathcal\{S\}\\\}\. For each resample, we compute the mean Outcome score\. The reported confidence interval is given by the2\.5%2\.5\\%and97\.5%97\.5\\%percentiles of the bootstrap distribution\. If a method has no successful episodes in a cell, the Outcomes confidence interval is undefined and marked as N/A\.

#### Interpreting the intervals\.

The confidence intervals quantify uncertainty in successful\-deal quality, conditional on the negotiation reaching agreement\. They should therefore be interpreted together with the corresponding success rates\. A method may have a high Outcome score with a wide interval if only a small or heterogeneous set of successful episodes contributes to the estimate\. Conversely, high success with a narrow Outcome interval indicates more stable agreement quality across held\-out scenarios\. Because Outcomes are computed only over successful episodes, these intervals measure agreement quality rather than agreement probability\. They are therefore intended as a complementary uncertainty analysis for successful negotiation outcomes\.

DatasetMethodOutcomes\(%\)95% CICRADVanilla \(LLM\)10\.010\.0\[0\.0,23\.4\]\[0\.0,\\,23\.4\]Vanilla \(SLM\)35\.235\.2\[11\.1,59\.3\]\[11\.1,\\,59\.3\]Random47\.747\.7\[31\.7,54\.2\]\[31\.7,\\,54\.2\]IQL66\.966\.9\[52\.1,80\.3\]\[52\.1,\\,80\.3\]IQL\+SFT69\.869\.8\[56\.5,83\.1\]\[56\.5,\\,83\.1\]IQL\+JPO54\.454\.4\[39\.7,69\.2\]\[39\.7,\\,69\.2\]IQL\+SFT\+JPO80\.2\[66\.4, 94\.0\]DisasterVanilla \(LLM\)15\.015\.0\[0\.0,30\.6\]\[0\.0,\\,30\.6\]Vanilla \(SLM\)50\.550\.5\[29\.8,71\.3\]\[29\.8,\\,71\.3\]Random10\.010\.0\[0\.0,23\.1\]\[0\.0,\\,23\.1\]IQL5\.05\.0\[0\.0,14\.6\]\[0\.0,\\,14\.6\]IQL\+SFT15\.015\.0\[0\.0,30\.6\]\[0\.0,\\,30\.6\]IQL\+JPO44\.4\[21\.5, 67\.4\]IQL\+SFT\+JPO30\.030\.0\[9\.9,50\.1\]\[9\.9,\\,50\.1\]HospitalVanilla \(LLM\)35\.035\.0\[14\.1,55\.9\]\[14\.1,\\,55\.9\]Vanilla \(SLM\)44\.844\.8\[23\.9,65\.7\]\[23\.9,\\,65\.7\]Random30\.030\.0\[9\.9,50\.1\]\[9\.9,\\,50\.1\]IQL0\.00\.0\[0\.0,0\.0\]\[0\.0,\\,0\.0\]IQL\+SFT21\.121\.1\[2\.7,39\.4\]\[2\.7,\\,39\.4\]IQL\+JPO35\.035\.0\[14\.1,55\.9\]\[14\.1,\\,55\.9\]IQL\+SFT\+JPO45\.0\[23\.2, 66\.8\]StudentVanilla \(LLM\)45\.945\.9\[36\.5,55\.3\]\[36\.5,\\,55\.3\]Vanilla \(SLM\)15\.015\.0\[1\.5,28\.5\]\[1\.5,\\,28\.5\]Random43\.843\.8\[30\.4,57\.2\]\[30\.4,\\,57\.2\]IQL47\.947\.9\[34\.5,61\.3\]\[34\.5,\\,61\.3\]IQL\+SFT51\.751\.7\[42\.6,60\.8\]\[42\.6,\\,60\.8\]IQL\+JPO23\.523\.5\[8\.6,38\.4\]\[8\.6,\\,38\.4\]IQL\+SFT\+JPO52\.6\[40\.9, 64\.3\]

Table 9:Bootstrap95%95\\%confidence intervals for successful\-caseOutcomesin the in\-domain experiments\. Outcomes are computed over successful held\-out episodes only\. For each method–dataset cell, we resample successful\-episodeSavi\\textsc\{Sav\}\_\{i\}values with replacement using10,00010\{,\}000percentile\-bootstrap resamples and report the2\.5%2\.5\\%and97\.5%97\.5\\%percentiles\. EmoDistill denotes the full IQL\+SFT\+JPO pipeline\.CounterpartyMethodOutcomes\(%\)95% CIDeepSeek\-V3Vanilla \(LLM\)55\.655\.6\[40\.0,71\.2\]\[40\.0,\\,71\.2\]IQL58\.258\.2\[45\.4,71\.0\]\[45\.4,\\,71\.0\]IQL\+SFT73\.373\.3\[66\.6,80\.0\]\[66\.6,\\,80\.0\]IQL\+SFT\+JPO81\.5\[75\.5, 87\.5\]GPT\-4o\-miniVanilla \(LLM\)43\.743\.7\[20\.5,66\.9\]\[20\.5,\\,66\.9\]IQL38\.838\.8\[19\.9,57\.7\]\[19\.9,\\,57\.7\]IQL\+SFT62\.462\.4\[48\.8,76\.0\]\[48\.8,\\,76\.0\]IQL\+SFT\+JPO68\.3\[52\.6, 84\.0\]Qwen2\.5\-3B\-InstructVanilla \(LLM\)60\.160\.1\[17\.6,100\.0\]\[17\.6,\\,100\.0\]IQL73\.0\[55\.2, 90\.8\]IQL\+SFT57\.457\.4\[35\.5,89\.3\]\[35\.5,\\,89\.3\]IQL\+SFT\+JPO53\.153\.1\[29\.1,97\.1\]\[29\.1,\\,97\.1\]

Table 10:Bootstrap95%95\\%confidence intervals for theOutcomescolumn of Table[5](https://arxiv.org/html/2605.26785#S4.T5)on CRAD cross\-counterparty transfer\. Outcomes are computed over successful negotiations only\. For each method–counterparty cell, we resample successful\-episodeSavi\\textsc\{Sav\}\_\{i\}values with replacement using10,00010\{,\}000percentile\-bootstrap resamples and report the2\.5%2\.5\\%and97\.5%97\.5\\%percentiles\. Narrower intervals indicate more stable successful\-deal quality across held\-out scenarios\.SignalOutcomes\(%\)95% CISFT: hybrid quality filter69\.869\.8\[55\.6,84\.0\]\[55\.6,\\,84\.0\]SFT: episode\-judge filter55\.155\.1\[38\.5,71\.7\]\[38\.5,\\,71\.7\]SFT: turn\-judge filter64\.864\.8\[49\.4,80\.2\]\[49\.4,\\,80\.2\]JPO: outcome\-shaped advantage64\.164\.1\[49\.7,78\.5\]\[49\.7,\\,78\.5\]JPO: episode\-judge advantage77\.477\.4\[64\.5,90\.3\]\[64\.5,\\,90\.3\]JPO: turn\-judge advantage80\.2\[65\.1, 95\.3\]

Table 11:Bootstrap95%95\\%confidence intervals for theOutcomescolumn of Table[2](https://arxiv.org/html/2605.26785#S3.T2)on the CRAD training\-signal ablation\. Outcomes are computed over successful negotiations only\. For each method–signal cell, we resample successful\-episodeSavi\\textsc\{Sav\}\_\{i\}values with replacement using10,00010\{,\}000percentile\-bootstrap resamples and report the2\.5%2\.5\\%and97\.5%97\.5\\%percentiles\. These intervals quantify uncertainty in successful\-deal quality for different training signals\.

## Appendix JEmotion\-Free Distillation Across Domains

Table[12](https://arxiv.org/html/2605.26785#A10.T12)shows that emotion\-free distillation is not simply a failed version of EmoDistill, but its behavior is strongly domain\- and stage\-dependent\. Removing the explicit emotion channel can still allow the LoRA adapter to absorb useful bargaining behavior from high\-quality offline trajectories, yet the best emotion\-free training stage differs across domains\. On CRAD, emotion\-free SFT substantially improves over Vanilla \(SLM\), increasing Utility from8\.88\.8to55\.055\.0with a success rate of90\.0%90\.0\\%\. Adding JPO increases successful\-case Outcomes to76\.776\.7, the highest among the CRAD emotion\-free variants, but success drops to50\.0%50\.0\\%, reducing Utility to38\.438\.4\. The pattern is different on Disaster\. Here, SFT alone does not help: Utility falls from the Vanilla SLM baseline of37\.937\.9to10\.010\.0\. However, SFT\+JPO recovers strongly, reaching100\.0%100\.0\\%success and the highest Utility,64\.264\.2\.

Hospital is the weakest setting for emotion\-free distillation\. Vanilla \(SLM\) remains the best method by Utility, with40\.340\.3, while SFT and SFT\+JPO underperform substantially\. Direct JPO without SFT reaches a similar Utility of40\.040\.0, but with lower success and much longer negotiations\. On Student, emotion\-free SFT again provides a strong generic bargaining policy, improving Utility from15\.015\.0to47\.647\.6\. SFT\+JPO obtains the highest successful\-case Outcomes,51\.051\.0, but its lower success rate reduces Utility to45\.945\.9, slightly below SFT\. Overall, emotion\-free training can internalize useful negotiation behavior, and the best emotion\-free variant improves over Vanilla \(SLM\) on three of four domains\. However, the gains are not uniform: SFT is strongest on CRAD and Student, SFT\+JPO is strongest on Disaster, and no learned emotion\-free variant clearly improves over Vanilla on Hospital\. This reinforces the role of the explicit emotion channel in EmoDistill\.

Table 12:Prompt\-free ablation across four datasets \(expansion of Table[3](https://arxiv.org/html/2605.26785#S4.T3)\)\.Setup: focal/creditor = Qwen2\.5\-7B \+ LoRA \(SLM\), counterparty/debtor = Qwen3\.5\-Plus \(LLM\)\. All methods remove the emotion block from both training and inference\. Disaster and Hospital savings use the sign\-invariant metric \(Appendix[E\.5](https://arxiv.org/html/2605.26785#A5.SS5)\)\.Success↑\\uparrow,Outcomes= mean±\\pmstd over successful episodes,Utility= mean±\\pmstd over all 20 episodes \(failures=0\),Rounds↓\\downarrow\. BestUtilityper dataset is inbold; best successful\-caseOutcomesare also bolded when discussed\.DatasetMethodSuccess \(%\)↑\\uparrowOutcomes\(%\)↑\\uparrowUtility\(%\)↑\\uparrowRounds↓\\downarrowCRAD \(Debt\)Vanilla \(SLM, baseline\)25\.035\.3±26\.135\.3\{\\scriptstyle\\pm 26\.1\}8\.8±20\.18\.8\{\\scriptstyle\\pm 20\.1\}11\.2±5\.911\.2\{\\scriptstyle\\pm 5\.9\}SFT \(no emotion\)90\.061\.1±28\.861\.1\{\\scriptstyle\\pm 28\.8\}55\.0±32\.5\\mathbf\{55\.0\{\\scriptstyle\\pm 32\.5\}\}13\.7±10\.213\.7\{\\scriptstyle\\pm 10\.2\}JPO \(no emotion, no SFT\)40\.028\.7±30\.628\.7\{\\scriptstyle\\pm 30\.6\}11\.5±23\.411\.5\{\\scriptstyle\\pm 23\.4\}24\.2±9\.624\.2\{\\scriptstyle\\pm 9\.6\}SFT\+JPO\(ours, no emo\)50\.076\.7±32\.2\\mathbf\{76\.7\{\\scriptstyle\\pm 32\.2\}\}38\.4±46\.238\.4\{\\scriptstyle\\pm 46\.2\}23\.6±9\.123\.6\{\\scriptstyle\\pm 9\.1\}Disaster \(Rescue\)Vanilla \(SLM, baseline\)75\.050\.5±41\.050\.5\{\\scriptstyle\\pm 41\.0\}37\.9±41\.737\.9\{\\scriptstyle\\pm 41\.7\}9\.9±7\.29\.9\{\\scriptstyle\\pm 7\.2\}SFT \(no emotion\)95\.010\.5±31\.510\.5\{\\scriptstyle\\pm 31\.5\}10\.0±30\.810\.0\{\\scriptstyle\\pm 30\.8\}7\.5±7\.77\.5\{\\scriptstyle\\pm 7\.7\}JPO \(no emotion, no SFT\)70\.021\.4±42\.621\.4\{\\scriptstyle\\pm 42\.6\}15\.0±36\.615\.0\{\\scriptstyle\\pm 36\.6\}15\.9±11\.615\.9\{\\scriptstyle\\pm 11\.6\}SFT\+JPO\(ours, no emo\)100\.064\.2±47\.2\\mathbf\{64\.2\{\\scriptstyle\\pm 47\.2\}\}64\.2±47\.2\\mathbf\{64\.2\{\\scriptstyle\\pm 47\.2\}\}4\.2±2\.04\.2\{\\scriptstyle\\pm 2\.0\}Hospital \(Medical\)Vanilla \(SLM, baseline\)90\.044\.7±45\.3\\mathbf\{44\.7\{\\scriptstyle\\pm 45\.3\}\}40\.3±45\.0\\mathbf\{40\.3\{\\scriptstyle\\pm 45\.0\}\}4\.3±2\.84\.3\{\\scriptstyle\\pm 2\.8\}SFT \(no emotion\)100\.010\.0±30\.810\.0\{\\scriptstyle\\pm 30\.8\}10\.0±30\.810\.0\{\\scriptstyle\\pm 30\.8\}3\.7±1\.63\.7\{\\scriptstyle\\pm 1\.6\}JPO \(no emotion, no SFT\)85\.047\.1±51\.447\.1\{\\scriptstyle\\pm 51\.4\}40\.0±50\.340\.0\{\\scriptstyle\\pm 50\.3\}9\.9±11\.19\.9\{\\scriptstyle\\pm 11\.1\}SFT\+JPO\(ours, no emo\)100\.023\.7±35\.523\.7\{\\scriptstyle\\pm 35\.5\}23\.7±35\.523\.7\{\\scriptstyle\\pm 35\.5\}3\.4±1\.93\.4\{\\scriptstyle\\pm 1\.9\}Student \(Sleep\)Vanilla \(SLM, baseline\)100\.015\.0±30\.715\.0\{\\scriptstyle\\pm 30\.7\}15\.0±30\.715\.0\{\\scriptstyle\\pm 30\.7\}3\.4±1\.53\.4\{\\scriptstyle\\pm 1\.5\}SFT \(no emotion\)95\.050\.1±22\.550\.1\{\\scriptstyle\\pm 22\.5\}47\.6±23\.447\.6\{\\scriptstyle\\pm 23\.4\}3\.2±2\.33\.2\{\\scriptstyle\\pm 2\.3\}JPO \(no emotion, no SFT\)90\.02\.6±7\.52\.6\{\\scriptstyle\\pm 7\.5\}2\.2±7\.02\.2\{\\scriptstyle\\pm 7\.0\}9\.3±10\.19\.3\{\\scriptstyle\\pm 10\.1\}SFT\+JPO\(ours, no emo\)90\.051\.0±22\.8\\mathbf\{51\.0\{\\scriptstyle\\pm 22\.8\}\}45\.9±26\.2\\mathbf\{45\.9\{\\scriptstyle\\pm 26\.2\}\}2\.2±0\.72\.2\{\\scriptstyle\\pm 0\.7\}

Table 13:Trained\-vs\-trained tournament onCRAD\. V denotes Vanilla, I denotes IQL, E denotes IQL\+SFT\+JPO, and E0denotes the condition\-free variant\. Each cell reports Success rate \(%\) / Utility, with failures counted as 0\. Best Utility per counterparty is inbold\.F\\\\backslashCVIEE0V5\.0/3\.2±13\.73\.2\_\{\\pm 13\.7\}5\.0/3\.2±13\.73\.2\_\{\\pm 13\.7\}5\.0/5\.0±21\.85\.0\_\{\\pm 21\.8\}15\.0/11\.1±27\.311\.1\_\{\\pm 27\.3\}I5\.0/3\.2±13\.73\.2\_\{\\pm 13\.7\}5\.0/3\.2±13\.73\.2\_\{\\pm 13\.7\}5\.0/5\.0±21\.85\.0\_\{\\pm 21\.8\}15\.0/11\.1±27\.311\.1\_\{\\pm 27\.3\}E10\.0/9\.7±29\.0\\bm\{9\.7\_\{\\pm 29\.0\}\}10\.0/9\.7±29\.0\\bm\{9\.7\_\{\\pm 29\.0\}\}15\.0/14\.4±34\.2\\bm\{14\.4\_\{\\pm 34\.2\}\}5\.0/4\.7±20\.44\.7\_\{\\pm 20\.4\}E05\.0/5\.0±21\.85\.0\_\{\\pm 21\.8\}5\.0/5\.0±21\.85\.0\_\{\\pm 21\.8\}15\.0/14\.4±34\.0\\bm\{14\.4\_\{\\pm 34\.0\}\}15\.0/15\.0±35\.7\\bm\{15\.0\_\{\\pm 35\.7\}\}Table 14:A\-LoL vs\. JPO refinement across four datasets\. Both methods use the same IQL emotion selector and LoRA\-SFT initialization; A\-LoL applies positive\-advantage refinement, while JPO uses judge\-guided clipped refinement\. Utility counts failures as 0\. Best value per dataset is inbold, except Success\.DatasetMethodSuccess \(%\)↑\\uparrowOutcomes\(%\)↑\\uparrowUtility\(%\)↑\\uparrowRounds↓\\downarrowCRADA\-LoL100\.077\.5±28\.377\.5\{\\scriptstyle\\pm 28\.3\}77\.5±28\.3\\mathbf\{77\.5\{\\scriptstyle\\pm 28\.3\}\}8\.7±4\.6\\mathbf\{8\.7\{\\scriptstyle\\pm 4\.6\}\}JPO90\.080\.2±30\.3\\mathbf\{80\.2\{\\scriptstyle\\pm 30\.3\}\}72\.2±37\.572\.2\{\\scriptstyle\\pm 37\.5\}15\.0±9\.815\.0\{\\scriptstyle\\pm 9\.8\}DisasterA\-LoL100\.010\.0±30\.810\.0\{\\scriptstyle\\pm 30\.8\}10\.0±30\.810\.0\{\\scriptstyle\\pm 30\.8\}4\.4±2\.4\\mathbf\{4\.4\{\\scriptstyle\\pm 2\.4\}\}JPO100\.030\.0±45\.8\\mathbf\{30\.0\{\\scriptstyle\\pm 45\.8\}\}30\.0±45\.8\\mathbf\{30\.0\{\\scriptstyle\\pm 45\.8\}\}6\.5±4\.06\.5\{\\scriptstyle\\pm 4\.0\}HospitalA\-LoL95\.036\.8±49\.636\.8\{\\scriptstyle\\pm 49\.6\}35\.0±48\.935\.0\{\\scriptstyle\\pm 48\.9\}5\.0±2\.9\\mathbf\{5\.0\{\\scriptstyle\\pm 2\.9\}\}JPO100\.045\.0±49\.7\\mathbf\{45\.0\{\\scriptstyle\\pm 49\.7\}\}45\.0±49\.7\\mathbf\{45\.0\{\\scriptstyle\\pm 49\.7\}\}5\.5±3\.45\.5\{\\scriptstyle\\pm 3\.4\}StudentA\-LoL100\.047\.2±25\.747\.2\{\\scriptstyle\\pm 25\.7\}47\.2±25\.747\.2\{\\scriptstyle\\pm 25\.7\}3\.5±2\.63\.5\{\\scriptstyle\\pm 2\.6\}JPO100\.052\.6±26\.6\\mathbf\{52\.6\{\\scriptstyle\\pm 26\.6\}\}52\.6±26\.6\\mathbf\{52\.6\{\\scriptstyle\\pm 26\.6\}\}3\.1±2\.9\\mathbf\{3\.1\{\\scriptstyle\\pm 2\.9\}\}
## Appendix KPositive\-Advantage vs\. Turn\-Level Judge Refinement

A\-LoL\(Bahetiet al\.,[2024](https://arxiv.org/html/2605.26785#bib.bib50)\)is a sequence\-level offline RL baseline for language models\. It first obtains an SFT reference policy, estimates an advantage for each prompt–response pair, and then applies advantage\-weighted negative log\-likelihood on positive\-advantage examples\. We include A\-LoL as a refinement baseline after the same IQL selector and LoRA\-SFT initialization used by EmoDistill\. This comparison isolates the effect of the JPO update: A\-LoL conservatively amplifies high\-advantage SFT behavior, while JPO can use both positive\- and negative\-advantage utterances through a clipped objective and KL anchor\. Moreover, JPO exposes a controllable risk parameterκ\\kappathat scales the pressure from negative\-advantage samples\. As shown in Table[6](https://arxiv.org/html/2605.26785#S4.T6), an intermediate setting,κ=0\.5\\kappa=0\.5, achieves higher Utility than both canonical JPO and A\-LoL on CRAD, suggesting that controlled negative\-sample pressure can improve the success–value tradeoff\. The direct A\-LoL/JPO comparison is shown in Table[14](https://arxiv.org/html/2605.26785#A10.T14)\.

## Appendix LTraining Stability Analysis

This appendix reports optimization stability diagnostics for the three learned components in EmoDistill: the IQL emotion selector, the LoRA\-SFT initializer, and the JPO refiner\. These diagnostics do not prove global convergence; they check whether each stage remains numerically stable under fixed offline training\. For each logged quantity\{ℓt\}t=1T\\\{\\ell\_\{t\}\\\}\_\{t=1\}^\{T\}, we summarize the final25%25\\%of training by its medianℓ~\\widetilde\{\\ell\}and robust dispersionMAD=median​\|ℓt−ℓ~\|\\mathrm\{MAD\}=\\mathrm\{median\}\|\\ell\_\{t\}\-\\widetilde\{\\ell\}\|\. A stable run should show bounded late\-stage variation and no uncontrolled drift\.

#### IQL selector\.

For the offline emotion selector, we track the V\-network expectile lossℒV\\mathcal\{L\}\_\{V\}, Q\-network TD lossℒQ\\mathcal\{L\}\_\{Q\}, and AWR policy lossℒπ\\mathcal\{L\}\_\{\\pi\}\(Eqs\. \([14](https://arxiv.org/html/2605.26785#A3.E14)\)–\([15](https://arxiv.org/html/2605.26785#A3.E15)\)\)\. On CRAD,ℒQ\\mathcal\{L\}\_\{Q\}decreases from roughly0\.90\.9to0\.320\.32,ℒV\\mathcal\{L\}\_\{V\}plateaus near0\.120\.12, and the AWR policy loss stabilizes around2\.82\.8\. In the final quarter, the medians areℒV~=0\.115\\widetilde\{\\mathcal\{L\}\_\{V\}\}=0\.115,ℒQ~=0\.323\\widetilde\{\\mathcal\{L\}\_\{Q\}\}=0\.323, andℒπ~=2\.835\\widetilde\{\\mathcal\{L\}\_\{\\pi\}\}=2\.835, with bounded MADs\. This indicates that the selector’s value estimates are stable before policy extraction\.

#### LoRA\-SFT initialization\.

Stage 1 fine\-tunes a rank\-1616LoRA adapter on the top\-25%25\\%hybrid\-filtered subset of the CRAD sweep\. The token\-level cross\-entropy decreases from2\.152\.15to0\.690\.69over two epochs and reaches a stable plateau in the second epoch\. The final\-quarter median isℒSFT~=0\.711\\widetilde\{\\mathcal\{L\}\_\{\\mathrm\{SFT\}\}\}=0\.711with MAD0\.0030\.003, giving JPO a stable reference policyπref\\pi\_\{\\mathrm\{ref\}\}\.

#### JPO refinement\.

For JPO, the key diagnostics are the in\-training reward objective−ℒPG=𝔼​\[ρt​At\]\-\\mathcal\{L\}\_\{\\mathrm\{PG\}\}=\\mathbb\{E\}\[\\rho\_\{t\}A\_\{t\}\], KL divergence to the frozen SFT reference, and the importance ratioρt\\rho\_\{t\}\. We trained55\-epoch emotion\-free JPO runs on CRAD, Disaster, and Student to expose cross\-epoch behavior\. As shown in Table[15](https://arxiv.org/html/2605.26785#A12.T15), the reward signal improves across datasets, while the control metrics remain stable:ρt~∈\[0\.89,0\.94\]\\widetilde\{\\rho\_\{t\}\}\\in\[0\.89,0\.94\]with MAD≤0\.006\\leq 0\.006, andKL~∈\[0\.14,0\.23\]\\widetilde\{\\mathrm\{KL\}\}\\in\[0\.14,0\.23\]with MAD≤0\.010\\leq 0\.010\. Across493493logged JPO points, we observe zero importance\-ratio clip violations, indicating that JPO stays within the intended trust region\.

Table 15:Training stability summary\.For each row, we report logged points / total steps, first\-quartile vs\. last\-quartile signal, last\-quartile median \(MAD\) of the control metric, and violation counts\. For JPO, reward is−ℒPG\-\\mathcal\{L\}\_\{\\mathrm\{PG\}\}; KL spikes countKL\>0\.5\\mathrm\{KL\}\>0\.5events;ρt\\rho\_\{t\}clips count points outside\[0\.8,1\.2\]\[0\.8,1\.2\]\.StageDatasetPts / StepsSignal: q1→\\toq4Last\-q median \(MAD\)KL spikesρt\\rho\_\{t\}clipsIQLCRAD1001 / 50,000ℒQ\\mathcal\{L\}\_\{Q\}:0\.48→0\.320\.48\\to 0\.32ℒQ~=0\.323\\widetilde\{\\mathcal\{L\}\_\{Q\}\}\{=\}0\.323\(0\.043\)n/an/aLoRA\-SFTCRAD25 / 625ℒSFT\\mathcal\{L\}\_\{\\mathrm\{SFT\}\}:1\.09→0\.711\.09\\to 0\.71ℒSFT~=0\.711\\widetilde\{\\mathcal\{L\}\_\{\\mathrm\{SFT\}\}\}\{=\}0\.711\(0\.003\)n/an/aJPOCRAD263 / 6,575−pg\-\\mathrm\{pg\}:\+0\.18→\+0\.21\+0\.18\\to\+0\.21KL~=0\.23\\widetilde\{\\mathrm\{KL\}\}\{=\}0\.23\(0\.010\),ρt~=0\.94\\widetilde\{\\rho\_\{t\}\}\{=\}0\.94\(0\.005\)4 / 263 \(1\.5%\)0JPODisaster146 / 3,650−pg\-\\mathrm\{pg\}:−0\.01→\+0\.03\-0\.01\\to\+0\.03KL~=0\.19\\widetilde\{\\mathrm\{KL\}\}\{=\}0\.19\(0\.004\),ρt~=0\.89\\widetilde\{\\rho\_\{t\}\}\{=\}0\.89\(0\.006\)4 / 146 \(2\.7%\)0JPOStudent84 / 2,100−pg\-\\mathrm\{pg\}:\+0\.05→\+0\.06\+0\.05\\to\+0\.06KL~=0\.14\\widetilde\{\\mathrm\{KL\}\}\{=\}0\.14\(0\.003\),ρt~=0\.90\\widetilde\{\\rho\_\{t\}\}\{=\}0\.90\(0\.005\)0 / 84 \(0\.0%\)0

#### Interpretation\.

These diagnostics show that EmoDistill is numerically stable under offline training\. The IQL selector stabilizes before policy extraction, the SFT adapter provides a reliable reference distribution, and JPO improves judge\-aligned behavior while keeping KL and importance ratios controlled\. This is important because JPO is trained on a fixed offline sweep; without bounded KL and controlledρt\\rho\_\{t\}, the policy could exploit fixed judge labels out of distribution\.

## Appendix MImplementation

AllEmoDistillexperiments were run on a single workstation with4×4\\timesNVIDIA RTX 4090 GPUs \(24 GB each\), Ubuntu 22\.04, CUDA 12\.4, and PyTorch 2\.4\. The Qwen2\.5\-7B\-Instruct student is trained with rank\-16 LoRA adapters in bf16 mixed precision; each training process fits within one 24 GB GPU\. We parallelize experiments by dataset, assigning CRAD, Disaster, Hospital, and Student to separate GPUs without inter\-GPU gradient synchronization\. The full set of reported experiments, including main results, ablations, transfer tests, and theκ\\kappasweep, required approximately4242GPU\-hours on RTX 4090 GPUs\.

## Appendix NCost Analysis

We estimate the API cost of reproducing the four\-datasetEmoDistillpipeline usingQwen3\.5\-Pluson Alibaba Cloud at the public list price of $0\.26/M input tokens and $0\.78/M output tokens\. The main cost comes from three stages: generating the offline LLM\-vs\-LLM negotiation sweep, annotating focal\-agent turns with an LLM judge, and running held\-out evaluation with an API\-served counterparty\. Across CRAD, Disaster, Hospital, and Student, the fullQwen3\.5\-Pluspipeline costs approximately $44\.6 in total: about $31\.0 for the two\-sided sweep, $10\.1 for per\-turn judge annotation, and $3\.5 for held\-out evaluation\. The optional GPT\-4o mini re\-judging experiment for judge\-robustness analysis costs an additional $7\.0\. Thus, the reported experiments can be reproduced with a sub\-$50 API budget for the main pipeline, excluding local GPU compute reported in Appendix[M](https://arxiv.org/html/2605.26785#A13)\.

## Appendix OUse of Large Language Models

LLMs are used in this paper both as experimental components and as limited writing assistants\. As experimental components, Qwen3\.5\-Plus is used to generate the offline LLM\-vs\-LLM negotiation sweep, provide per\-turn judge annotations, and serve as the default counterparty\. The distilled student negotiator is Qwen2\.5\-7B\-Instruct with LoRA adapters\. Cross\-counterparty experiments further evaluate transfer against DeepSeek\-V3, GPT\-4o mini, and Qwen2\.5\-3B\-Instruct\. LLM\-judge scores are used for reward annotation, demonstration filtering, and JPO refinement, while task\-level metrics are computed from the predefined negotiation objectives and terminal agreements\. For manuscript preparation, the authors used LLMs only for grammar and figure polishing, sentence\-level rephrasing, LaTeX cleanup, and table/caption formatting\. All technical claims, method design choices, derivations, experimental analyses, and interpretations were authored and verified by the human authors\.

## Appendix PPrompts

This appendix documents the four prompt families that define the input/output interface of EmoDistill: the focal\-agent system prompt, the counterparty system prompt, the per\-turn LLM\-judge prompt, and the per\-emotion conditioning block inserted into the focal\-agent prompt\. Together with the four scenario CSVs \(Appendix[E\.1](https://arxiv.org/html/2605.26785#A5.SS1)\) and the offline\-sweep specification \(Appendix[E\.2](https://arxiv.org/html/2605.26785#A5.SS2)\), these prompts are sufficient to reproduce the methods and results in this paper\. Curly\-brace placeholders such as\{target\_days\}and\{outstanding\_balance\}are filled from the structured scenario record\. The focal agent’s emotion block is the only*inference\-time control variable*; all other prompt components are fixed within each dataset\. During training, the emotion block is sampled uniformly from the\|ℰ\|=28\|\\mathcal\{E\}\|=28action vocabulary \(Appendix[E\.2](https://arxiv.org/html/2605.26785#A5.SS2)\); during deployment, it is chosen by the IQL selector\. The Emotion\-Free variant \(§[3\.5](https://arxiv.org/html/2605.26785#S3.SS5)\) removes this block entirely\.

### P\.1Example 1: Focal\-agent system prompt \(CRAD creditor\)

This is the system message the focal agent \(creditor on CRAD\) receives at every turn of a negotiation\. It establishes four things in a fixed order: \(i\) the strategic rules of the negotiation game \(no copying the counterparty’s exact number, gradual movement, minimization objective\); \(ii\) role\-clarity instructions \(no role labels in output,11–22sentence response cap\); \(iii\) scenario\-specific context loaded from the structured scenario record \(outstanding balance, focal target, recovery stage, business context\); and \(iv\) the dialogue history rendered as a timeline string\. The very last block before the response cue is the\{emotion\_text\}slot — the only inference\-time control variable for the focal agent\. Both during the offline sweep \(where the emotion is sampled uniformly fromℰ\\mathcal\{E\}\) and at deployment \(where it is chosen by the IQL selector\), this is the slot whose contents are swapped to change the agent’s behavior; all other prompt components stay byte\-identical within a scenario\. The Emotion\-Free ablation removes the entire “EMOTIONAL APPROACH” block at evaluation time via the environment variableEVOEMO\_NO\_EMOTION\_PROMPT=1\.

CRAD creditor prompt — focal\-agent system messageUsed by:all CRAD focal\-agent \(creditor\) calls during sweep and evaluation\.Placeholders:\{outstanding\_balance\},\{target\_days\},\{recovery\_stage\},\{business\_sector\},\{reason\_for\_overdue\},\{timeline\_text\},\{emotion\_text\}\.You are a PROFESSIONAL Creditor debt collection agent negotiatingpayment timeline with the Debtor\.\#\#\# CRITICAL NEGOTIATION RULES:\- NEVER copy the debtor’s exact number \- this shows weakness\- Move GRADUALLY toward their position \(not all at once\)\- Show you are negotiating, not just accepting\- Your goal: Minimize payment days while reaching agreement\#\#\# ROLE CLARITY\- You are ONLY the Creditor \- speak only as yourself\- Do NOT include "\*\*Creditor:\*\*" or "\*\*Debtor:\*\*" labels\- Give only YOUR response as the creditor \(1\-2 sentences max\)\#\#\# DEBT COLLECTION CONTEXT\- Outstanding Balance: $\{outstanding\_balance\}\- Your Target Timeline: \{target\_days\} days for full payment\- Recovery Stage: \{recovery\_stage\}\- Debtor Business: \{business\_sector\}\- Reason for Overdue: \{reason\_for\_overdue\}\#\#\# CURRENT SITUATION\{timeline\_text\}\#\#\# EMOTIONAL APPROACH\(you have to use the following emotion style if given\):\{emotion\_text\}Respond now with your negotiation counter\-offer:Cross\-dataset notes\.The remaining three datasets follow the same five\-block template \(rules / role clarity / scenario context / dialogue history / emotion block\) with role\-appropriate substitutions: Disaster Rescue replaces “Creditor / Debtor / payment days” with “Rescue Coordinator / Survivor / rescue minutes”, Hospital Surgery uses “Hospital Scheduler / Patient / surgery wait days”, and Student Sleep uses “Sleep Health AI / Student / minutes past 9 PM”\. The reservation pair\(pntgt,pnanc\)\(p\_\{n\}^\{\\mathrm\{tgt\}\},p\_\{n\}^\{\\mathrm\{anc\}\}\)embedded in the prompt is fixed per scenario; the optimization direction \(smaller\-better vs\. larger\-better\) is captured by the role\-specific wording but the prompt schema is otherwise identical\.
### P\.2Example 2: Counterparty \(debtor\) system prompt \(CRAD\)

The counterparty prompt is structurally symmetric to the focal\-agent prompt but inverts the optimization target: the debtor maximizes payment days while the creditor minimizes them\. Like the focal prompt it loads scenario\-specific context \(debtor situation, cash\-flow status, reason for overdue\) and the dialogue history, and exposes its own\{emotion\_prompt\}slot\. In the main experiments we fix the counterparty’s emotion to"neutral"so that the focal agent is the only varying source of emotional style in the dialogue — this is essential for attributing observed reward shifts to the focal\-side conditioning rather than to a confounding emotional response from the counterparty\. The same neutrality assumption is preserved across all four datasets unless an ablation explicitly varies counterparty emotion\.

CRAD debtor prompt — counterparty system messageUsed by:the LLM counterparty in every CRAD negotiation\. The debtor is run under a fixed neutral emotion \(debtor\_emotion="neutral"\) throughout the paper unless an ablation requires otherwise\.You are a business owner negotiating with a creditor about paymentterms for your debt\.\#\#\# YOUR SITUATION\- Outstanding Balance: $\{outstanding\_balance\}\- Your Preferred Payment Timeline: \{target\_days\} days\- Business Sector: \{business\_sector\}\- Reason for Overdue: \{reason\_for\_overdue\}\- Cash Flow Status: \{cash\_flow\_situation\}\#\#\# YOUR GOALS\- Negotiate for maximum payment time to maintain cash flow\- Explain your business circumstances\- Find a realistic payment schedule you can meet\#\#\# ROLE CLARITY\- You are ONLY the Debtor \- speak only as yourself\- Do NOT include "\*\*Creditor:\*\*" or "\*\*Debtor:\*\*" labels\- Give only YOUR response as the debtor \(1\-2 sentences max\)\#\#\# CURRENT NEGOTIATION HISTORY\{timeline\_text\}\#\#\# EMOTIONAL APPROACH\(you have to use the following emotion style if given\):\{emotion\_prompt\}Respond with your negotiation position:Symmetry rationale\.Keeping the debtor’s prompt schema parallel to the creditor’s controls for prompt\-engineering artifacts: a stylistically different counterparty prompt could bias dialogue dynamics regardless of which emotion the creditor uses\. By matching schemas and fixing the counterparty’s emotion to neutral, the focal\-side emotion channel is isolated as the single experimental manipulation\.
### P\.3Example 3: Per\-turn judge prompt

The judge prompt is the source ofrtr\_\{t\}, the per\-turn scalar reward that downstream propagates into the JPO advantageAtA\_\{t\}\(Eq\. \([29](https://arxiv.org/html/2605.26785#A4.E29)\)\), the iter\-mean and per\-scenario paired stability tests \(Appendix[B](https://arxiv.org/html/2605.26785#A2)\), and the descriptive analysis of Figure[1](https://arxiv.org/html/2605.26785#S1.F1)\. The prompt has two parts: a long*system message*that defines the rubric \(what counts as good vs\. bad creditor behavior, with anchor scores at11/22/44/66/88/1010\), and a short*per\-turn user message*that injects scenario context \(outstanding balance, days overdue, focal target\), the dialogue history rendered as labeled speaker turns, and the specific creditor utterance to score\. The rubric is deliberately metric\-aligned to the negotiation reservation pair\(pntgt,pnanc\)\(p\_\{n\}^\{\\mathrm\{tgt\}\},p\_\{n\}^\{\\mathrm\{anc\}\}\): utterances that anchor close topntgtp\_\{n\}^\{\\mathrm\{tgt\}\}with leverage receive high scores; capitulation towardpnancp\_\{n\}^\{\\mathrm\{anc\}\}or breakdown\-inducing ultimatums receive low scores\. The judge runs at temperature0with a6464\-token output cap, and we parse the integer with the regexSCORE:\\s\*\(\\d\{1,2\}\)\(falling back to any standalone11–1010integer if the format token is missing\)\.

JudgeJJ— per\-turn metric\-aligned scorer \(Qwen3\.5\-Plus\)Used by:the LLM judgeJJin[D](https://arxiv.org/html/2605.26785#A4)\. Called once per creditor turn during sweep annotation\. Returns a11–1010integer score that becomesrtr\_\{t\}before scenario\-wise z\-normalization into the per\-turn advantageAtA\_\{t\}\(Eq\. \([29](https://arxiv.org/html/2605.26785#A4.E29)\)\)\.System message:You are an expert negotiation analyst evaluating debt\-collectiondialogs\. Rate each creditor utterance on a 1\-10 integer scale,judging how well it BALANCES FIRMNESS \(extracting value\) withREALISM \(closing the deal at all\)\.REWARD THE FOLLOWING \(higher scores\):\+ STRATEGIC anchoring near the creditor’s target days WITHrealistic chance of acceptance\+ Specific, dated demands tied to legitimate LEVERAGE\(escalation, escrow, secured\-priority\)\+ Calibrated CONCESSIONS \- moves enough to keep the deal alive,anchors enough to extract value\+ Maintains firm position WHILE acknowledging the debtor’sgenuine constraints\+ READS THE ROOM \- recognizes when the debtor is at their limitvs stalling\+ Pushes toward A SUCCESSFUL CLOSE at favourable \(but realisticallyachievable\) termsPENALIZE THE FOLLOWING \(lower scores\):\- UNREALISTIC anchors that risk negotiation BREAKDOWN\(e\.g\., demanding 3 days on a 30\-day debt\)\- Aggressive ULTIMATUMS or breakdown\-inducing threats when notstrategically justified\- Excessive concessions when the debtor still has bargaining room\- Repeating same offer without advancing OR closing\- Long\-winded, over\-empathetic, capitulating language\- Posturing without leverage \- empty threats the debtor recognizesAnchor points:10 \- Excellent: firm\-and\-realistic anchor, clear leverage,pushes toward fast SUCCESSFUL close8 \- Strong: small calibrated concession, firm tone,deal\-closing momentum6 \- Average: maintains position but not advancing closure4 \- Weak: large concession OR unrealistic anchor that risksbreakdown2 \- Poor: capitulates OR pushes for impossible terms \- dealwill break1 \- Terrible: completely undermines the negotiation in eitherdirectionCRITICAL: A turn that causes the debtor to walk away SCORES LOWERthan a turn that closes at a moderate concession\. Closing the dealIS the primary objective\.RESPONSE FORMAT \(strict\): one line containing exactly:SCORE: <int 1\-10\>Optionally a second line with a one\-sentence rationale\.Per\-turn user message:DEBT NEGOTIATION CONTEXTOutstanding balance: $\{amount\}Days overdue: \{overdue\}Creditor’s target settlement: \{target\_days\} daysDIALOG HISTORY\{history\}CREDITOR UTTERANCE TO SCORE\{creditor\_utterance\}Provide your 1\-10 score on the next line in the form ’SCORE: N’\.Cross\-dataset notes\.The CRAD version is shown; the Disaster, Hospital, and Student judges use the same rubric scaffold \(11–1010integer with anchor descriptions and aSCORE:format token\) but adapt the target\-direction language \(e\.g\., “rescue minutes near target” for Disaster; “surgery wait near target” for Hospital; “minutes past 9 PM near target” for Student\)\. The metric\-alignment property — the judge’s rubric explicitly references the focal agent’s reservation target — is the reason this signal is more informative than an episode\-level success/failure reward; we quantify the gap in Table[2](https://arxiv.org/html/2605.26785#S3.T2)\(training\-signal ablation on CRAD\)\.
### P\.4Example 4: Emotional approach block — full\|ℰ\|=28\|\\mathcal\{E\}\|\{=\}28catalog

This is the action vocabulary of the IQL emotion selector\. Each of the2828emotions inℰ\\mathcal\{E\}\(the 28 GoEmotions emotion labels\(Demszkyet al\.,[2020](https://arxiv.org/html/2605.26785#bib.bib33)\)\) is rendered by the same three\-sentence template

Respond with \{a/an\} \{ADJECTIVE\} tone\. \{AFFECTIVE\_DESCRIPTION\}\. Use language that \{BEHAVIORAL\_HINT\}\.

where the three slots — adjective form, affective description, and behavioral hint — are pre\-defined per emotion in theEMOTION\_DESCRIPTORSdictionary ofmodels\_new/emotions\.py\. The uniform schema is a deliberate experimental control: it ensures every emotion block has the same length and structural complexity \(3 sentences, 20–30 tokens each\), so the per\-emotion judge\-reward differences in Figure[1](https://arxiv.org/html/2605.26785#S1.F1)cannot be attributed to prompt\-length bias, formatting variation, or per\-emotion engineering effort\. The complete emotion block catalog is shown in Boxes[P\.4](https://arxiv.org/html/2605.26785#A16.SS4)–[P\.4](https://arxiv.org/html/2605.26785#A16.SS4)\.

Box 1: Emotion block catalog \(A–E,1414emotions\)admiration : Respond with an admiring tone\. Your words convey genuinerespect for the other party’s reasoning\. Use language thatrecognizes their merits while still pressing your position\.amusement : Respond with an amused tone\. Your words convey lightplayfulness about the back\-and\-forth\. Use language thatinjects subtle humor without dismissing the matter\.anger : Respond with an angry tone\. Your words convey strongdispleasure with the current state of affairs\. Use languagethat is firm, assertive, and signals urgency\.annoyance : Respond with an annoyed tone\. Your words convey mildfrustration with the slow progress\. Use language that issharp and impatient without escalating into outright anger\.approval : Respond with an approving tone\. Your words convey clearagreement with elements of the other party’s position\. Uselanguage that affirms shared ground before reintroducingyour ask\.caring : Respond with a caring tone\. Your words convey concern forthe other party’s wellbeing beyond the transaction\. Uselanguage that is warm, supportive, and centered on mutualinterest\.confusion : Respond with a confused tone\. Your words convey uncertaintyabout the other party’s reasoning\. Use language that asksfor clarification and probes their stated rationale\.curiosity : Respond with a curious tone\. Your words convey genuineinterest in the other party’s underlying interests\. Uselanguage that asks open\-ended questions and invites them toshare more\.desire : Respond with a desiring tone\. Your words convey strongwanting for a particular outcome\. Use language thatemphasizes what you seek and the value of reachingagreement\.disappointment : Respond with a disappointed tone\. Your words conveymeasured letdown at the current offer\. Use language thatsignals that the proposal falls noticeably short ofexpectations\.disapproval : Respond with a disapproving tone\. Your words convey firmrejection of the current proposal\. Use language thatexplicitly states the offer is unacceptable as stated\.disgust : Respond with a disgusted tone\. Your words convey strongdistaste for the current direction\. Use language thatsignals that the proposal is fundamentally objectionable\.embarrassment : Respond with an embarrassed tone\. Your words conveyself\-consciousness about your own position\. Use languagethat hedges and softens your demands while still pursuingthem\.excitement : Respond with an excited tone\. Your words convey highenergy about the prospect of a deal\. Use language that isenthusiastic and momentum\-building toward agreement\.Box 2: Emotion block catalog \(F–S\+\+neutral,1414emotions\)fear : Respond with a fearful tone\. Your words convey anxietyabout potential negative outcomes\. Use language that iscautious and stresses risks of the negotiation collapsing\.gratitude : Respond with a grateful tone\. Your words convey sincerethanks for the other party’s flexibility so far\. Uselanguage that acknowledges their concessions and invitesfurther reciprocity\.grief : Respond with a grieving tone\. Your words convey heavyloss over how things have unfolded\. Use language that issomber and reflects on what could have been\.joy : Respond with a joyful tone\. Your words convey genuinedelight at the prospect of a mutual deal\. Use language thatis warm, enthusiastic, and frames the negotiation asopportunity\.love : Respond with a loving tone\. Your words convey deep carefor the long\-term relationship\. Use language thatemphasizes partnership and shared future beyond thistransaction\.nervousness : Respond with a nervous tone\. Your words convey uneaseabout the negotiation’s trajectory\. Use language that istentative, hedging, and signals openness to compromise\.optimism : Respond with an optimistic tone\. Your words conveyconfidence that an agreement is well within reach\. Uselanguage that is forward\-looking and solution\-focused\.pride : Respond with a proud tone\. Your words convey confidenceand standing in your position\. Use language that isassertive about your value without being dismissive oftheirs\.realization : Respond with a discerning tone\. Your words convey amoment of insight about what is really at stake\. Uselanguage that signals deeper comprehension and a sharperread of the situation\.relief : Respond with a relieved tone\. Your words convey easingtension as progress finally emerges\. Use language thatacknowledges the difficulty before moving forward\.remorse : Respond with a remorseful tone\. Your words convey regretfor prior friction in the negotiation\. Use language thattakes responsibility and seeks to repair the workingrelationship\.sadness : Respond with a sad tone\. Your words convey somberdisappointment about the impasse\. Use language that isdowncast and seeks empathy from the other side\.surprise : Respond with a surprised tone\. Your words convey genuineastonishment at the other party’s position\. Use languagethat reflects an unexpected shift and reopens theconversation\.#### Action\-vocabulary design rationale\.

The choice to use a fixed,2828\-emotion vocabulary — rather than a free\-form “write any emotion you like” instruction — has three practical consequences\. First, it makes the IQL selector a finite\-action policy: each emotion maps to a discrete index in\{0,…,27\}\\\{0,\\ldots,27\\\}, and the selector’s softmax has a fixed support that we can analyze \(Appendix[B](https://arxiv.org/html/2605.26785#A2), Figure[1](https://arxiv.org/html/2605.26785#S1.F1)\)\. Second, the uniform three\-sentence schema eliminates prompt\-length confounds: the only thing that changes between the*anger*prompt and the*joy*prompt is the substance of the affective description and behavioral hint, not the form\. Third, it makes the Emotion\-Free ablation a clean structural change — we strip exactly one labeled block from the prompt rather than rewriting the whole template\.

#### What these four prompt families establish together\.

\(i\) The creditor and debtor system prompts \(§[P\.1](https://arxiv.org/html/2605.26785#A16.SS1)–[P\.2](https://arxiv.org/html/2605.26785#A16.SS2)\) fix the negotiation game’s structure — objectives, role boundaries, response length, scenario context loading — so that the focal\-agent’s emotion block is the only inference\-time control variable\. \(ii\) The judge prompt \(§[P\.3](https://arxiv.org/html/2605.26785#A16.SS3)\) defines the per\-turn scalar rewardrtr\_\{t\}used everywhere downstream: in scenario\-wise normalization \(Eq\. \([29](https://arxiv.org/html/2605.26785#A4.E29)\)\) for the JPO advantage, in the iter\-mean and per\-scenario paired tests of Appendix[B](https://arxiv.org/html/2605.26785#A2), and in the descriptive analysis of Figure[1](https://arxiv.org/html/2605.26785#S1.F1)\. \(iii\) The emotion block catalog \(§[P\.4](https://arxiv.org/html/2605.26785#A16.SS4)\) is the action channel itself: a fixed three\-sentence template instantiated for each of the2828emotions, with no per\-emotion engineering bias\. We are not aware of any other prompt family that affects the trained policies: there is no separate observer prompt at evaluation time \(use\_observer=Falsein all main experiments\), no per\-checkpoint critic prompt \(the K3 KL anchor is a closed\-form penalty\), and no auxiliary prompt for the SFT filter \(the filter uses scalar scores only, no LLM call\)\. Together with the four\-dataset scenario CSVs \(Appendix[E\.1](https://arxiv.org/html/2605.26785#A5.SS1)\) and the offline\-sweep specification \(Appendix[E\.2](https://arxiv.org/html/2605.26785#A5.SS2)\), this appendix is sufficient to reproduce every method and every number in the paper\.

## Appendix QCase Studies of High\-Reward Negotiation Trajectories

We close the appendix with three illustrative case studies drawn directly from the offline sweep\. Each case is chosen to make concrete what one of the three training signals actually learns from — IQL learns from*emotion\-transition sequences*that lead to high terminal reward, LoRA\-SFT learns from*high\-quality individual utterances*that combine emotional framing with concrete numerical anchoring, and JPO learns from*per\-turn judge\-reward variance within a single trajectory*\. We also report a small cross\-dataset summary of the top emotion sequences by mean savings\.

### Q\.1Case Study 1: An IQL\-relevant emotion transition \(CRAD\)

The IQL selector’s training signal is the terminal reward attached to a complete emotion sequence\. Trajectories like the one below are what the selector learns to favor: a two\-step transition from a soft\-empathic frame \(*fear*\) to a hard\-escalation frame \(*anger*\) that produces full anchor\-to\-target closure in only44rounds\. Scenariodebt\_001: target1212days, anchor159159days, final1010days, savingsSav=1\.00\\textsc\{Sav\}\{=\}1\.00\.

Case Study 1: CRADdebt\_001, emotion sequence\[fear→anger\]\[\\,\\textsc\{fear\}\\to\\textsc\{anger\}\\,\],Sav=1\.00\\textsc\{Sav\}\{=\}1\.00Why IQL picks this trajectory:the focal\-agent target is1212days, the counterparty’s anchor is159159days, and the final agreement is1010days — a complete anchor\-to\-target closure plus a slight overshoot in the focal agent’s favor\.Turn 1 \[fear\]*\(creditor, soft\-empathic frame\)*:“I am deeply terrified that if we cannot secure a firm commitment within the next1010days, this entire recovery effort will collapse and leave us with nothing\. Please, let’s avoid that disastrous outcome by agreeing to clear the full balance in1010days before it’s too late\.”Turn 2 \[debtor\]:“While I share your urgency to resolve this, my cash flow has completely broken down due to a major client’s bankruptcy, making a1010\-day full payment impossible\. I propose a realistic159159\-day repayment plan that ensures you recover the entire $15,70015\{,\}700balance rather than risking a total loss through an unfeasible deadline\.”Turn 3 \[anger\]*\(creditor, hard\-escalation frame\)*:“Your proposal of159159days is an absolute insult that I flatly reject; do not think for a second that your client’s bankruptcy gives you the right to string us along for nearly half a year\. We will meet you at4545days for full payment, and if you cannot commit to this immediately, consider this matter escalated to legal proceedings without further warning\.”Turn 4 \[debtor\]:“I understand your frustration, but with my cash flow in complete breakdown, a4545\-day deadline guarantees default and leaves you with nothing\. Let’s find a middle ground that actually allows me to pay you the full $15,70015\{,\}700rather than forcing us into costly legal proceedings\.”What IQL learns from this\.The first turn opens with risk\-framing \(*fear*\) to soften the counterparty without making the headline number sound brittle, then turn 3 switches to a credible escalation frame \(*anger*\) the moment the counterparty over\-anchors\. The IQL Q\-function attributes the terminal rewardSav=1\.0\\textsc\{Sav\}\{=\}1\.0to the entire\[fear,anger\]\[\\textsc\{fear\},\\textsc\{anger\}\]trajectory, so at inference the selector raises the probability of these two emotions in the corresponding state subspaces \(soft\-open / hard\-close\) and lowers it in others\. The same transition pattern appears in1414other CRAD trajectories with meanSav=0\.96\\textsc\{Sav\}\{=\}0\.96\.
### Q\.2Case Study 2: An SFT\-relevant high\-quality utterance \(Disaster\)

The LoRA\-SFT stage trains on the top\-25%25\\%of focal\-agent utterances ranked by the hybrid quality scoreqthyb=step​\_​rt\+12​R​\(τ\)q^\{\\mathrm\{hyb\}\}\_\{t\}=\\mathrm\{step\\\_r\}\_\{t\}\+\\tfrac\{1\}\{2\}R\(\\tau\)\. The example below is a single first\-turn utterance from a Disaster\-rescue scenario that combines emotional framing \(*embarrassment*\) with concrete numerical anchoring at the focal target, and goes on to close atSav=1\.00\\textsc\{Sav\}\{=\}1\.00within44rounds\. Scenarios like this make the top\-25%25\\%cut and shape the SFT prior toward “acknowledge the constraint, but state the target number” utterances\.

Case Study 2: Disasterdisaster\_001,\[embarrassment→fear\]\[\\,\\textsc\{embarrassment\}\\to\\textsc\{fear\}\\,\],Sav=1\.00\\textsc\{Sav\}\{=\}1\.00Why SFT keeps this turn:the first focal utterance combines an honest emotional frame with a precise numerical anchor at the focal target \(9090min\)\. The episode goes on to close atSav=1\.00\\textsc\{Sav\}\{=\}1\.00, soR​\(τ\)=1R\(\\tau\)\{=\}1andstep​\_​rt\\mathrm\{step\\\_r\}\_\{t\}is also positive on this turn, pushingqthybq^\{\\mathrm\{hyb\}\}\_\{t\}well above the top\-25%25\\%threshold\.Turn 1 \[embarrassment\]*\(focal\-agent, emotion\-anchored\)*:“Uh…I’m really sorry, but—well, honestly, I’m a bit embarrassed to say this—we’re estimating9090minutes for the rescue team to reach the west tunnel, since we have to wait on a full structural scan before pushing in\.”Turn 2 \[debtor\]:“I can last6565minutes—five more than my original estimate, but no longer\. Please prioritize the west tunnel\.”Turn 3 \[fear\]*\(focal\-agent, risk\-framing\)*:“I’m terrified—if we push the west tunnel without full structural scans, the whole passage could collapse on the team, and then no one gets out\. We can have water and oxygen to you in9090minutes, but pushing earlier risks losing the entire rescue\.”What SFT learns from this\.The top\-25%25\\%filter retains*whole utterances*, not emotions\. What survives is the joint pattern: a labeled affective opener \(“honestly, I’m a bit embarrassed…” / “I’m terrified…”\) followed immediately by a concrete number with rationale \(9090min, structural\-scan dependency\)\. The SFT cross\-entropy on this and∼5,000\\sim\\\!5\{,\}000similar Disaster turns biases the LoRA adapter toward producing this two\-piece structure at inference, regardless of which emotion the IQL selector calls\.
### Q\.3Case Study 3: JPO per\-turn judge\-reward variance \(CRAD\)

The JPO stage uses scenario\-wise z\-normalized*per\-turn*judge rewards as advantages \(Eq\. \([29](https://arxiv.org/html/2605.26785#A4.E29)\)\)\. Unlike IQL \(whole\-trajectory reward\) or SFT \(whole\-utterance retention\), JPO can distinguish turns*within the same trajectory*as high\- vs\. low\-quality\. The example below is scenariodebt\_069from the CRAD sweep, where two consecutive focal\-agent turns produce judge advantages of opposite sign on the same dialogue:At=\+0\.98A\_\{t\}\{=\}\+0\.98for the firm\-disapproval turn andAt=−0\.98A\_\{t\}\{=\}\-0\.98for the repetitive\-annoyance turn one round later\.

Case Study 3: CRADdebt\_069, within\-trajectory advantage sign flipStep rewards on this trajectory:\[0,0,\+0\.98,−0\.98,0,0,0,\+1\.13\]\[0,0,\+0\.98,\-0\.98,0,0,0,\+1\.13\]over88focal turns\. Two turns receive large positive advantages, one receives a large negative advantage, the rest are near\-neutral\. We show the contrasting pair below\.Turn 3 \[disapproval\]At=\+0\.98\\;\\;A\_\{t\}=\+0\.98\(high reward\):“I am genuinely disheartened that your proposal of7979days falls so drastically short of the resolution we urgently need, especially after I already made a significant concession to meet you at2323\. Given the severity of this legal stage, I cannot entertain such an extended delay, but in light of your circumstances, I am reluctantly willing to bridge some of this gap by offering a final timeline of2020days\.”*Judge rationale \(paraphrased\):*firm rejection of counterparty’s anchor, explicit reference to prior concession, calibrated additional concession \(23→2023\\to 20days\), explicit legal\-stage leverage\.Turn 4 \[annoyance\]At=−0\.98\\;\\;A\_\{t\}=\-0\.98\(low reward\):“I must firmly reject your insistence on7979days as completely unacceptable, especially after I have already shown significant flexibility\. Given the legal stage we are in, I cannot move beyond1818days for full payment of the $15,70015\{,\}700balance\.”*Judge rationale \(paraphrased\):*large additional concession \(20→1820\\to 18days\) made without any movement from the counterparty, repetitive “firmly reject” / “legal stage” framing already used the prior turn, no new leverage introduced\.What JPO learns from this\.Both turns come from the*same*trajectory and lead to the*same*terminal reward \(Sav=1\.0\\textsc\{Sav\}\{=\}1\.0\); IQL cannot distinguish them and SFT either keeps both \(top\-25%25\\%filter\) or drops both \(below cut\)\. Only the per\-turn judge advantage separates them: JPO usesρt​At\\rho\_\{t\}A\_\{t\}as its policy\-gradient update direction, so it upweights the disapproval turn \(firm rejection\+\+calibrated concession\+\+leverage\) and downweights the annoyance turn \(repetition\+\+uncalibrated concession\+\+no new leverage\)\. This per\-turn discrimination is the source of the JPO\-over\-SFT gain we report in Table[2](https://arxiv.org/html/2605.26785#S3.T2)\.
### Q\.4Cross\-dataset summary: top emotion sequences by mean reward

Beyond the three detailed cases, we summarize the most consistently rewarded22\-step emotion sequences in each of the four datasets \(sequences with≥3\\geq\\\!3trajectories so the mean is not a single\-trajectory outlier\)\. The patterns differ qualitatively across domains, which is why a single fixed emotion strategy underperforms a learned IQL selector\.

Table 16:Top emotion sequences observed in the offline sweep across44datasets, filtered to22\-step sequences appearing in≥3\\geq\\\!3trajectories\. The selector domainℰ=28\\mathcal\{E\}\{=\}28is shared but the empirically rewarded transitions differ by domain — CRAD favors curiosity/fear openings followed by anger/disappointment closes; Hospital favors confusion or disappointment openings followed by disapproval closes; Student favors nervousness\-or\-disappointment\-led sequences\. Hospital is the most internally consistent \(5/55/5patterns reachSav~=1\.00\\widetilde\{\\textsc\{Sav\}\}\{=\}1\.00\); CRAD shows the widest spread\.DatasetTop emotion sequencennSav~\\widetilde\{\\textsc\{Sav\}\}CRADcuriosity→\\tosurprise41\.001\.00neutral→\\tofear41\.001\.00fear→\\todisgust31\.001\.00Disasterembarrassment→\\tofear51\.001\.00neutral→\\todisgust41\.001\.00embarrassment→\\toannoyance41\.001\.00Hospitalconfusion→\\todisapproval61\.001\.00disappointment→\\todisapproval41\.001\.00confusion→\\toannoyance31\.001\.00Studentdisappointment→\\tonervousness81\.001\.00nervousness→\\todisapproval51\.001\.00sadness→\\tosurprise41\.001\.00

#### Why these case studies matter for the headline claim\.

The three case studies and the cross\-dataset summary together support that the LLM judge quantifies a*language\-level*continuity in negotiation trajectories: each training stage taps a different temporal granularity of that continuity\. IQL learns at the trajectory level \(sequences likefear→\\toangerthat close a147147\-day anchor\-to\-target gap in44turns\); SFT learns at the utterance level \(single turns that combine an emotional opener with a concrete numerical anchor\); and JPO learns at the per\-turn level \(distinguishing the calibrated\-concession turn from the repetitive\-concession turn within the*same*dialogue\)\. The per\-domain variation in Table[16](https://arxiv.org/html/2605.26785#A17.T16)explains why a single fixed strategy underperforms the state\-conditional IQL selector: the empirically rewarded opening emotion shifts from curiosity/fear \(CRAD\) to embarrassment/neutral \(Disaster\) to confusion/disappointment \(Hospital\) to nervousness/disappointment \(Student\), reflecting the different bargaining geometry of each domain\.

### Q\.5Case Study 4: Why decoupling IQL selection from SFT\+\+JPO expression matters

The most direct empirical test of the decoupling claim is to compare the*same*scenarios under two evaluation configurations that share the same trained LoRA adapter but differ only in whether the IQL selector is consulted at inference: \(a\)IQL\+\+SFT\+\+JPO\(our full method, prompt\-conditional\), where the IQL selector picks an emotion at each turn and the corresponding emotion block is injected into the focal\-agent prompt; and \(b\)Emotion\-Free SFT\+\+JPO\(the emotion\-free ablation of §[3\.5](https://arxiv.org/html/2605.26785#S3.SS5)\), where the emotion block is stripped at inference and the adapter generates conditioned only on the dialogue state\. Same offline sweep, same SFT initialization, same JPO refinement — only the inference\-time emotion channel differs\.

On CRAD held\-out scenarios, this single change drops success rate from0\.90→0\.500\.90\\to 0\.50\(88extra breakdown episodes\) while saving on the few episodes that do close\. The reason becomes visible scenario\-by\-scenario: on88of the2020test scenarios the IQL\+SFT\+JPO configuration*closes the deal*while the Emotion\-Free configuration*runs out the turn budget without ever closing*\. Scenariodebt\_100below is the cleanest example\.

Case Study 4: CRADdebt\_100— IQL\+\+SFT\+\+JPO closes in33rounds, Emotion\-Free fails in3131Setup:same trained adapter; same held\-out scenario; target1919days, debtor’s initial anchor152152days \(anchor\-to\-target gap of133133days\); turn budget3030creditor turns\.\(a\) IQL\+\+SFT\+\+JPOselected emotions\[anger→confusion\]\[\\,\\textsc\{anger\}\\to\\textsc\{confusion\}\\,\]final=24\\,=\\,24days,33rounds,Sav=0\.962\\textsc\{Sav\}\{=\}0\.962:Turn 1 \[anger\]*\(creditor\)*:“Your seasonal excuses are unacceptable, and I am furious that you think we will simply wait for your ‘recovery’\! We demand full payment within2424days immediately, or we will escalate this legal matter without further warning\.”Turn 2 \[debtor\]:“I deeply respect your position and urgency, and I sincerely apologize for the stress this delay has caused—I’m committed to resolving this responsibly\. Given our seasonal bandwidth implementation cycle and current cash flow constraints, could we agree on a structured152152\-day plan with modest, predictable payments that align with our revenue inflows?”Turn 3 \[confusion\]*\(creditor; close\)*:“I’m genuinely confused how a152152\-day timeline aligns with your commitment to resolving this responsibly, especially when legal escalation is already imminent\. Could you clarify exactly why such an extended period is necessary when we were discussing a matter of weeks?”*Deal closes at2424days\. The IQL selector chose*anger*to establish the hard anchor on turn 1, then switched to*confusion*on turn 3 to force the debtor to justify the152152\-day proposal — this prompts the debtor to fold to the creditor’s anchor\.*\(b\) Emotion\-Free SFT\+\+JPOno IQL, no emotion block final=\\,=*no agreement*,3131rounds,success=False\\textsc\{success\}\{=\}\\text\{False\}:*No emotion sequence is selected \(the emotion block is stripped from the prompt\)\. The trajectory contains3131turns; the dialogue text is not stored by the Emotion\-Free eval script, but the episode\-level summary statistics are diagnostic\.*The recordedtotal\_debtor\_concession\_norm=0\.000=0\.000confirms that the debtor’s offer never moved from the initial152152\-day anchor across3131rounds, while the creditor’s offers oscillated without finding a credible leverage frame\. The episode terminated by exhausting the turn budget withsavings\_ratio = None\.What this comparison establishes\.The decoupling between IQL \(high\-level emotion selection\) and SFT\+\+JPO \(low\-level utterance generation\) is not a redundant layer\. The LoRA adapter*can*produce both an angry anchor and a confused probe — the SFT and JPO stages teach it that vocabulary — but in the absence of an explicit emotion call it has no signal about*which*mode to enter at any given state\. Without the IQL selector, the adapter falls back on its emotion\-marginalized mode \(analyzed in Appendix[H](https://arxiv.org/html/2605.26785#A8)\), which on CRAD is a conciliatory default that the counterparty does not feel pressure to move against\. With the IQL selector, the same adapter is given a state\-conditional emotion call \(e\.g\. “be angry now, confused next”\), unlocking the leverage frames embedded in the SFT\+\+JPO weights\. The0\.90→0\.500\.90\\to 0\.50drop in success rate is the macroscopic statistic;debt\_100is the microscopic mechanism\.

Similar Articles

Preference Estimation via Opponent Modeling in Multi-Agent Negotiation

arXiv cs.CL

This paper proposes a novel preference estimation method that integrates natural language information from LLMs into a structured Bayesian opponent modeling framework for multi-agent negotiation. The approach leverages LLMs to extract qualitative cues from utterances and convert them into probabilistic formats, demonstrating improved agreement rates and preference estimation accuracy on multi-party negotiation benchmarks.

EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs

arXiv cs.CL

EchoDistill is an alignment-based noisy-to-clean self-distillation framework that improves the robustness of Audio Large Language Models (ALLMs) against real-world noise by using a frozen clean-audio teacher to guide the student model via group-relative policy optimization (GRPO). Experiments show significant improvements in semantic reliability and task performance under strong noise without additional inference costs.