Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction
Summary
This paper introduces RPCL, a training-only framework for robust pair confidence learning in multimodal emotion-cause pair extraction, which improves discriminative separation of gold pairs from hard negatives and achieves significant gains in Pair F1 and AUPRC on three datasets.
View Cached Full Text
Cached at: 06/18/26, 05:46 AM
# Learning Robust Pair Confidence for Multimodal Emotion-Cause Pair Extraction
Source: [https://arxiv.org/html/2606.18893](https://arxiv.org/html/2606.18893)
[![[Uncaptioned image]](https://arxiv.org/html/2606.18893v1/x1.png)Zhuangzhuang Pan](https://orcid.org/0009-0009-0451-2162) Institute for Advanced Studies Universiti Malaya Kuala Lumpur 50603, Malaysia 23078403@siswa\.um\.edu\.my&[![[Uncaptioned image]](https://arxiv.org/html/2606.18893v1/x2.png)Ning Dong](http://orcid.org/0000-0003-3045-9798) School of Information Engineering Suqian University Suqian 223800, China dongning@squ\.edu\.cn[![[Uncaptioned image]](https://arxiv.org/html/2606.18893v1/x3.png)Yingna Su](http://orcid.org/0000-0003-2348-5082) School of Information Engineering Suqian University Suqian 223800, China suyingna@squ\.edu\.cn&[![[Uncaptioned image]](https://arxiv.org/html/2606.18893v1/x4.png)Yan Xia](https://orcid.org/0009-0006-3559-4680) Digitization Department Suzhou University of Technology Suzhou 215500, China 23072126@siswa\.um\.edu\.my
###### Abstract
Multimodal emotion\-cause pair extraction \(MECPE\) requires reliable pair confidence over candidate pairs\. Existing pair scorers commonly use pair\-level cross entropy over valid candidates, which treats links mostly independently\. This leaves the relative confidence geometry among competing causes under\-constrained, allowing gold pairs to stay close to hard negatives or rely on incidental non\-gold context\. We study this vulnerability as pair\-confidence brittleness and proposeRPCL\(Robust Pair Confidence Learning\), a training\-only framework for pair\-confidence learning\.RPCLencourages pair confidence to be both discriminative and stable: gold pairs are separated from row\-wise hard negatives through a confidence\-difference margin constraint, and clean pair predictions are aligned with predictions from a corrupted view where non\-gold contextual utterance representations are partially corrupted\. The original clean pair scorer and decoding pipeline are used unchanged at inference time\. On ECF, MECAD, and MEC4,RPCLimproves the three\-seed mean Pair F1 over a matched base model by 2\.58–2\.83 percentage points in the full text\-audio\-video setting, and improves mean Pair AUPRC on all three datasets\. Diagnostic analysis further shows larger gold\-negative confidence gaps and lower margin\-violation severity\. These results suggest that explicitly shaping pair confidence is an effective training strategy for MECPE\.
*Keywords*Multimodal emotion\-cause pair extraction⋅\\cdotMECPE⋅\\cdotPair\-confidence learning⋅\\cdotRow\-conditioned margin ranking⋅\\cdotCorrupted\-context pair stability
## 1Introduction
Multimodal emotion\-cause pair extraction \(MECPE\) in conversations aims to identify which utterances express emotions and which utterances cause them, forming emotion\-cause pairs over a dialogue\(Xia and Ding,[2019](https://arxiv.org/html/2606.18893#bib.bib1); Wanget al\.,[2023a](https://arxiv.org/html/2606.18893#bib.bib4)\)\. Compared with text\-based emotion\-cause pair extraction, it makes pair decisions inside a conversational structure where emotions, causes, speakers, and background turns are interleaved\(Liet al\.,[2023b](https://arxiv.org/html/2606.18893#bib.bib2); Jeong and Bak,[2023](https://arxiv.org/html/2606.18893#bib.bib3); Huet al\.,[2024c](https://arxiv.org/html/2606.18893#bib.bib20)\)\. The relevant cause may be separated from the emotion by several turns, spoken by another participant, or supported unevenly by textual, acoustic, and visual cues\(Wanget al\.,[2024b](https://arxiv.org/html/2606.18893#bib.bib5); Wuet al\.,[2025](https://arxiv.org/html/2606.18893#bib.bib6); Liet al\.,[2023a](https://arxiv.org/html/2606.18893#bib.bib26); Yuet al\.,[2025a](https://arxiv.org/html/2606.18893#bib.bib25)\)\. These properties make the task a structured pair decision problem: for a given emotion utterance, multiple candidate causes can be locally plausible, while only a small subset corresponds to annotated causal relations\.
A common training practice is to supervise candidate pairs as positive or negative pair instances, often with cross\-entropy\-based objectives over valid candidates\(Liet al\.,[2023b](https://arxiv.org/html/2606.18893#bib.bib2); Chenget al\.,[2023](https://arxiv.org/html/2606.18893#bib.bib12); Liet al\.,[2025](https://arxiv.org/html/2606.18893#bib.bib8),[2024](https://arxiv.org/html/2606.18893#bib.bib17)\)\. This supervision is necessary, but it mainly evaluates each candidate through its own label\. It does not directly enforce the relative confidence geometry needed when several causes compete for the same emotion\. In difficult cases, a gold pair can remain close to non\-gold candidates that share speaker, topic, temporal proximity, or multimodal affective evidence with the true cause\(Wanget al\.,[2025](https://arxiv.org/html/2606.18893#bib.bib19); Juet al\.,[2025](https://arxiv.org/html/2606.18893#bib.bib9); Maet al\.,[2025](https://arxiv.org/html/2606.18893#bib.bib23)\)\. This vulnerability is referred to as pair\-confidence brittleness\.
This paper studies multimodal emotion\-cause pair extraction in conversations from the perspective of reliable pair confidence\. A useful pair score should satisfy two complementary requirements\. First, for a fixed emotion utterance, the score of a gold cause should be separated from the strongest non\-gold alternatives for the same emotion\. Second, the pair score should remain stable when contextual utterances outside annotated gold pairs are partially perturbed\. These requirements complement recent progress in multimodal interaction, label constraints, memory\-inspired modeling, and graph\-based structure by directly shaping how a pair scorer allocates confidence among plausible links\(Liet al\.,[2025](https://arxiv.org/html/2606.18893#bib.bib8); Wuet al\.,[2025](https://arxiv.org/html/2606.18893#bib.bib6); Lianget al\.,[2025](https://arxiv.org/html/2606.18893#bib.bib7)\)\.
To this end, this paper proposes RPCL \(Robust Pair Confidence Learning\), a training\-only framework for pair\-scoring emotion\-cause models\. RPCL adds no inference\-time fusion module, decoder, or post\-processing step\. During training, it encourages two behaviors: gold pairs should stand apart from strong competing causes for the same emotion, and pair predictions should remain consistent when non\-gold contextual evidence is partially corrupted\. At inference time, the original clean pair scorer and the same decoding pipeline are used unchanged\.
Evaluation is conducted on ECF, MECAD, and MEC4using matched base scorers, identical input features, and unchanged decoding pipelines\(Wanget al\.,[2023a](https://arxiv.org/html/2606.18893#bib.bib4); Wuet al\.,[2025](https://arxiv.org/html/2606.18893#bib.bib6); Lianget al\.,[2025](https://arxiv.org/html/2606.18893#bib.bib7)\)\. Overall, the contributions are:
- •We identify pair\-confidence brittleness in MECPE and formulate reliable pair\-confidence learning as a training problem beyond independent candidate\-pair classification\.
- •We proposeRPCL, a training\-only framework that improves pair confidence by encouraging separation from strong non\-gold alternatives and stability under label\-preserving context perturbation\.
- •We verify the proposed mechanism through controlled comparisons and confidence diagnostics, showing improved pair extraction and better gold\-negative confidence separation\.
## 2Related Work
##### Structured Emotion\-Cause Pair Extraction
Emotion\-cause pair extraction \(ECPE\) recasts affect analysis as link prediction between emotion and cause utterances rather than separate emotion/cause detection\(Xia and Ding,[2019](https://arxiv.org/html/2606.18893#bib.bib1); Liet al\.,[2023b](https://arxiv.org/html/2606.18893#bib.bib2)\)\. Conversational extensions add speaker turns and dialogue context, while recent ECPE systems explore guided experts, commonsense generation, and semantic structure for more explicit causal reasoning\(Jeong and Bak,[2023](https://arxiv.org/html/2606.18893#bib.bib3); Wanget al\.,[2023b](https://arxiv.org/html/2606.18893#bib.bib10); Yuet al\.,[2025b](https://arxiv.org/html/2606.18893#bib.bib18); Wanget al\.,[2025](https://arxiv.org/html/2606.18893#bib.bib19)\)\. These lines define the extraction space, but leave confidence geometry largely implicit\.
##### Multimodal Emotion\-Cause Pair Modeling
Multimodal ECPE further binds causal links to textual, acoustic, and visual evidence, with ECF, SemEval\-2024, MECAD, and MEC4providing representative benchmarks\(Wanget al\.,[2023a](https://arxiv.org/html/2606.18893#bib.bib4),[2024b](https://arxiv.org/html/2606.18893#bib.bib5); Lianget al\.,[2025](https://arxiv.org/html/2606.18893#bib.bib7); Wuet al\.,[2025](https://arxiv.org/html/2606.18893#bib.bib6)\)\. Existing systems strengthen pair modeling through holistic cross\-modal interaction, causal prompting, memory\-inspired aggregation, heterogeneous graphs, or LLM\-enhanced generation\(Huet al\.,[2024c](https://arxiv.org/html/2606.18893#bib.bib20); Chenget al\.,[2024](https://arxiv.org/html/2606.18893#bib.bib21); Luoet al\.,[2024](https://arxiv.org/html/2606.18893#bib.bib22); Juet al\.,[2025](https://arxiv.org/html/2606.18893#bib.bib9); Wanget al\.,[2024a](https://arxiv.org/html/2606.18893#bib.bib24)\)\. They improve evidence encoding, whereas RPCL studies the confidence surface after scoring\.
##### Training Objectives for Pair Reliability
Several ECPE studies move beyond ordinary pair classification by making supervision more structurally consistent across emotion detection, cause detection, and pair extraction\(Fenget al\.,[2023](https://arxiv.org/html/2606.18893#bib.bib13); Chenget al\.,[2023](https://arxiv.org/html/2606.18893#bib.bib12); Huet al\.,[2024b](https://arxiv.org/html/2606.18893#bib.bib14)\)\. Another line improves the training signal through stronger representations or sampling strategies for imbalanced candidate pairs\(Huet al\.,[2024a](https://arxiv.org/html/2606.18893#bib.bib15); Suet al\.,[2024](https://arxiv.org/html/2606.18893#bib.bib16); Liet al\.,[2024](https://arxiv.org/html/2606.18893#bib.bib17)\)\. Recent reliability\-oriented studies further revisit confidence calibration, negative\-sample regularization, and consistency under noisy views\(Huanget al\.,[2026](https://arxiv.org/html/2606.18893#bib.bib28); Luoet al\.,[2026](https://arxiv.org/html/2606.18893#bib.bib29); Heet al\.,[2026](https://arxiv.org/html/2606.18893#bib.bib30)\)\. However, these objectives regularize labels, tasks, examples, or representations rather than the row\-conditioned confidence geometry in which a gold cause must outrank hard alternatives for the same emotion\. RPCL adds this missing row\-wise pressure and corrupted\-context stability while preserving clean inference\.
## 3Method
### 3\.1Problem Formulation
Given a dialogueD=\{ui\}i=1nD=\\\{u\_\{i\}\\\}\_\{i=1\}^\{n\}, each utteranceuiu\_\{i\}may contain textual, acoustic, and visual information\. The task of multimodal emotion\-cause pair extraction is to identify the set of emotion\-cause pairs
Y=\{\(i,j\):uiexpresses an emotion andujis its cause\}\.Y=\\\{\(i,j\):u\_\{i\}\\text\{ expresses an emotion and \}u\_\{j\}\\text\{ is its cause\}\\\}\.\(1\)Let𝒱⊆\{1,…,n\}2\\mathcal\{V\}\\subseteq\\\{1,\\ldots,n\\\}^\{2\}denote the valid candidate pair set under the adopted decoding scheme, and let
yij=𝟏\[\(i,j\)∈Y\],\(i,j\)∈𝒱,y\_\{ij\}=\\mathbf\{1\}\[\(i,j\)\\in Y\],\\qquad\(i,j\)\\in\\mathcal\{V\},\(2\)be the pair label\.
We build on a general multimodal ECPE backbone\. For each dialogue, the backbone first produces a multimodal utterance representationhth\_\{t\}for each utteranceutu\_\{t\}\. Based on these representations, it outputs emotion logitsziez\_\{i\}^\{e\}for utteranceuiu\_\{i\}, cause logitszjcz\_\{j\}^\{c\}for utteranceuju\_\{j\}, and pair logitssij∈ℝ2s\_\{ij\}\\in\\mathbb\{R\}^\{2\}for each valid candidate pair\(i,j\)∈𝒱\(i,j\)\\in\\mathcal\{V\}\. The pair scorer can be viewed as a module that consumes the dialogue\-level utterance representations and the candidate indices:
sij=fpair\(\{ht\}t=1n,i,j\)\.s\_\{ij\}=f\_\{\\mathrm\{pair\}\}\(\\\{h\_\{t\}\\\}\_\{t=1\}^\{n\},i,j\)\.\(3\)We denote the pair distribution and the positive pair confidence by
𝝅ij=softmax\(sij\),pij=𝝅ij,1\.\\bm\{\\pi\}\_\{ij\}=\\operatorname\{softmax\}\(s\_\{ij\}\),\\qquad p\_\{ij\}=\\bm\{\\pi\}\_\{ij,1\}\.\(4\)Here,pijp\_\{ij\}is the confidence used by the pair scorer to decide whetheruju\_\{j\}is the cause of the emotion inuiu\_\{i\}\. The same pair\-scoring interface is later used by the corrupted branch, where\{ht\}t=1n\\\{h\_\{t\}\\\}\_\{t=1\}^\{n\}is replaced with the corrupted representations\{h~t\}t=1n\\\{\\tilde\{h\}\_\{t\}\\\}\_\{t=1\}^\{n\}\.
### 3\.2Overview of Robust Pair\-Confidence Learning
We view multimodal ECPE as learning a structured*pair\-confidence surface*over valid emotion\-cause candidates\. Since all candidate pairs in a dialogue share the same conversational context, reliable pair confidence is shaped by two coupled factors: competition among alternative causes for the same emotion and stability when non\-causal context is perturbed\.
Standard pair\-level cross entropy supervises each valid pair by its binary label, but it does not explicitly shape this row\-wise confidence geometry and trains only on clean dialogues\. Consequently, a gold pair may remain close to hard negatives in the same emotion row or depend on incidental non\-gold context\.RPCLaddresses this with two training\-only constraints: \(i\)row\-conditioned margin ranking, which separates gold pairs from top\-kkhard negatives within the same row, and \(ii\)corrupted\-context pair stability, which preserves gold\-pair evidence while perturbing non\-gold utterances and aligning the resulting pair predictions\. Both constraints act on the original pair scorer, and the inference pipeline remains unchanged\. Figure[1](https://arxiv.org/html/2606.18893#S3.F1)summarizes the framework\.
Figure 1:Overview of RPCL\. CDMR separates gold pairs from row\-wise hard negatives, and CCPS aligns clean/corrupted predictions after protected context corruption\.
### 3\.3Row\-Conditioned Margin Ranking
We first make pair confidence discriminative within each emotion row\. For an emotion utteranceuiu\_\{i\}, let
Pi=\{j:\(i,j\)∈𝒱,yij=1\},Ni=\{j:\(i,j\)∈𝒱,yij=0\},P\_\{i\}=\\\{j:\(i,j\)\\in\\mathcal\{V\},\\,y\_\{ij\}=1\\\},\\qquad N\_\{i\}=\\\{j:\(i,j\)\\in\\mathcal\{V\},\\,y\_\{ij\}=0\\\},\(5\)wherePiP\_\{i\}andNiN\_\{i\}are the gold cause set and the non\-gold candidate set for rowii, respectively\. The constraint is applied only to rows where both sets are non\-empty\.
Among all non\-gold candidates, the most informative ones are those that the current model already considers plausible\. We therefore mine the top\-kkhard negatives according to the current pair confidence:
Hi=TopKj∈Ni\(pij\),H\_\{i\}=\\operatorname\{TopK\}\_\{j\\in N\_\{i\}\}\(p\_\{ij\}\),\(6\)whereHiH\_\{i\}contains the indices of the selected negatives\. If fewer thankknegatives are available, all negatives are used\. TheTopK\\operatorname\{TopK\}operation is used only to select negative candidates in the current forward pass\. We do not back\-propagate through the discrete selection itself\.
For each gold causej\+∈Pij^\{\+\}\\in P\_\{i\}and hard negativej−∈Hij^\{\-\}\\in H\_\{i\}, the model is encouraged to satisfy
pij\+−pij−≥mi,j\+,j−\.p\_\{ij^\{\+\}\}\-p\_\{ij^\{\-\}\}\\geq m\_\{i,j^\{\+\},j^\{\-\}\}\.\(7\)The margin should be larger when the hard negative also appears cause\-like\. We use the cause classifier as a confidence signal\. Let
qjc=softmax\(zjc\)1q\_\{j\}^\{c\}=\\operatorname\{softmax\}\(z\_\{j\}^\{c\}\)\_\{1\}\(8\)be the probability that utteranceuju\_\{j\}is a cause utterance\. The adaptive margin is defined as
mi,j\+,j−=m0exp\(sg\(qj−c−qj\+c\)\),m\_\{i,j^\{\+\},j^\{\-\}\}=m\_\{0\}\\exp\\\!\\left\(\\operatorname\{sg\}\(q\_\{j^\{\-\}\}^\{c\}\-q\_\{j^\{\+\}\}^\{c\}\)\\right\),\(9\)wherem0m\_\{0\}is a base margin andsg\(⋅\)\\operatorname\{sg\}\(\\cdot\)denotes stop\-gradient\. Thus, the cause\-confidence contrast determines the required pair\-confidence gap, but the margin value itself does not back\-propagate into the cause classifier\.
This gives the Confidence\-Difference Margin Ranking constraint:
ℒCDMR=1\|Ω\|∑\(i,j\+,j−\)∈Ω\[mi,j\+,j−−\(pij\+−pij−\)\]\+,\\mathcal\{L\}\_\{\\mathrm\{CDMR\}\}=\\frac\{1\}\{\|\\Omega\|\}\\sum\_\{\(i,j^\{\+\},j^\{\-\}\)\\in\\Omega\}\\left\[m\_\{i,j^\{\+\},j^\{\-\}\}\-\\left\(p\_\{ij^\{\+\}\}\-p\_\{ij^\{\-\}\}\\right\)\\right\]\_\{\+\},\(10\)where\[x\]\+=max\(0,x\)\[x\]\_\{\+\}=\\max\(0,x\)and
Ω=\{\(i,j\+,j−\):j\+∈Pi,j−∈Hi\}\.\\Omega=\\\{\(i,j^\{\+\},j^\{\-\}\):j^\{\+\}\\in P\_\{i\},\\;j^\{\-\}\\in H\_\{i\}\\\}\.\(11\)IfΩ\\Omegais empty in a mini\-batch, we setℒCDMR=0\\mathcal\{L\}\_\{\\mathrm\{CDMR\}\}=0\.
The effect of Eq\. \([10](https://arxiv.org/html/2606.18893#S3.E10)\) is local and row\-conditioned\. It does not replace pair classification; rather, it focuses additional pressure on cases where the pair confidence is most likely to be brittle: gold pairs competing with high\-confidence false causes for the same emotion utterance\.
### 3\.4Corrupted\-Context Pair Stability
The second constraint targets the stability of pair confidence under label\-preserving context corruption\. A model may classify a pair correctly in the clean dialogue but rely on incidental non\-gold context to do so\. To discourage this behavior, RPCL constructs a corrupted view that preserves annotated pair evidence while perturbing non\-gold utterance representations\.
Lethth\_\{t\}denote the multimodal utterance representation ofutu\_\{t\}consumed by the pair scorer\. We first identify utterances that participate in at least one gold pair:
G=\{t:∃\(i,j\)∈Y,t=iort=j\}\.G=\\\{t:\\exists\(i,j\)\\in Y,\\;t=i\\text\{ or \}t=j\\\}\.\(12\)Utterances inGGare protected\. For each utterance outsideGG, we sample a Bernoulli corruption variablert∼Bernoulli\(ρ\)r\_\{t\}\\sim\\operatorname\{Bernoulli\}\(\\rho\)and construct
h~t=\{ht,t∈G,\(1−rt\)ht,t∉G\.\\tilde\{h\}\_\{t\}=\\begin\{cases\}h\_\{t\},&t\\in G,\\\\ \(1\-r\_\{t\}\)h\_\{t\},&t\\notin G\.\\end\{cases\}\(13\)Thus, a sampled non\-gold utterance representation is zeroed out with probabilityρ\\rho, while all utterances involved in annotated gold pairs remain unchanged\. The corrupted dialogue view is used only during training\.
Running the same pair scorer on the corrupted representations yields pair logitss~ij\\tilde\{s\}\_\{ij\}and pair distributions
𝝅~ij=softmax\(s~ij\)\.\\tilde\{\\bm\{\\pi\}\}\_\{ij\}=\\operatorname\{softmax\}\(\\tilde\{s\}\_\{ij\}\)\.\(14\)Because the gold pair evidence is protected, the original pair labels remain valid for the corrupted view\. We therefore supervise the corrupted view with the same pair labels and, at the same time, align its pair distribution with the clean prediction:
ℒCCPS=λcorCE𝒱\(s~,y\)\+λali1\|𝒱\|∑\(i,j\)∈𝒱‖𝝅~ij−sg\(𝝅ij\)‖22\.\\mathcal\{L\}\_\{\\mathrm\{CCPS\}\}=\\lambda\_\{\\mathrm\{cor\}\}\\,\\operatorname\{CE\}\_\{\\mathcal\{V\}\}\(\\tilde\{s\},y\)\+\\lambda\_\{\\mathrm\{ali\}\}\\,\\frac\{1\}\{\|\\mathcal\{V\}\|\}\\sum\_\{\(i,j\)\\in\\mathcal\{V\}\}\\left\\\|\\tilde\{\\bm\{\\pi\}\}\_\{ij\}\-\\operatorname\{sg\}\(\\bm\{\\pi\}\_\{ij\}\)\\right\\\|\_\{2\}^\{2\}\.\(15\)Here,CE𝒱\\operatorname\{CE\}\_\{\\mathcal\{V\}\}is the average pair cross\-entropy over valid candidates, andλcor\\lambda\_\{\\mathrm\{cor\}\}andλali\\lambda\_\{\\mathrm\{ali\}\}control the two parts of the constraint\. The stop\-gradient operation makes the clean prediction the reference distribution\. Consequently, the corrupted branch is trained to preserve clean pair confidence, without forcing the clean branch to move toward a noisier corrupted prediction\.
### 3\.5Training Objective and Inference
We train the backbone with the standard supervised extraction objective on the clean dialogue:
ℒsup=1n∑i=1nCE\(zie,yie\)\+1n∑j=1nCE\(zjc,yjc\)\+CE𝒱\(s,y\),\\mathcal\{L\}\_\{\\mathrm\{sup\}\}=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\operatorname\{CE\}\(z\_\{i\}^\{e\},y\_\{i\}^\{e\}\)\+\\frac\{1\}\{n\}\\sum\_\{j=1\}^\{n\}\\operatorname\{CE\}\(z\_\{j\}^\{c\},y\_\{j\}^\{c\}\)\+\\operatorname\{CE\}\_\{\\mathcal\{V\}\}\(s,y\),\(16\)whereyiey\_\{i\}^\{e\}andyjcy\_\{j\}^\{c\}are the utterance\-level emotion and cause labels used by the backbone, andCE𝒱\(s,y\)\\operatorname\{CE\}\_\{\\mathcal\{V\}\}\(s,y\)denotes the average pair cross\-entropy over valid candidate pairs\.
The full RPCL objective is
ℒRPCL=ℒsup\+λrowℒCDMR\+ℒCCPS,\\mathcal\{L\}\_\{\\mathrm\{RPCL\}\}=\\mathcal\{L\}\_\{\\mathrm\{sup\}\}\+\\lambda\_\{\\mathrm\{row\}\}\\mathcal\{L\}\_\{\\mathrm\{CDMR\}\}\+\\mathcal\{L\}\_\{\\mathrm\{CCPS\}\},\(17\)whereλrow\\lambda\_\{\\mathrm\{row\}\}controls the strength of row\-conditioned margin ranking, while the internal weights ofℒCCPS\\mathcal\{L\}\_\{\\mathrm\{CCPS\}\}are defined in Eq\. \([15](https://arxiv.org/html/2606.18893#S3.E15)\)\.
All RPCL\-specific operations are training\-time constraints\. At inference time, we use the original clean dialogue, the backbone pair scorer, and the same thresholding or decoding pipeline as the base ECPE model\. No hard\-negative mining, corrupted view, gold\-utterance protection, or additional post\-processing is introduced during inference\.
## 4Experiments and Analysis
### 4\.1Datasets and Experimental Setup
#### 4\.1\.1Datasets
We evaluate on three multimodal ECPE benchmarks: ECF, MECAD, and MEC4\. ECF is an English benchmark, whereas MECAD and MEC4are Chinese benchmarks\(Wanget al\.,[2023a](https://arxiv.org/html/2606.18893#bib.bib4); Wuet al\.,[2025](https://arxiv.org/html/2606.18893#bib.bib6); Lianget al\.,[2025](https://arxiv.org/html/2606.18893#bib.bib7)\)\. All three datasets provide text, audio, and video modalities under the multimodal emotion\-cause pair extraction formulation, with split\-wise statistics summarized in Table[1](https://arxiv.org/html/2606.18893#S4.T1)\. We report Pair F1 and Pair AUPRC on the test split\.
Table 1:Split\-wise statistics for ECF, MECAD, and MEC4\. Avg\. pair distance denotes the mean absolute utterance\-index distance\|i−j\|\\lvert i\-j\\rvertwithin annotated pairs\.
#### 4\.1\.2Experimental Settings
The compared variants use the same datasets, splits, input features, validation\-based threshold search, and decoding pipeline\. Base denotes the same backbone and pair scorer trained only withℒsup\\mathcal\{L\}\_\{\\mathrm\{sup\}\}, whileRPCLadds the proposedCDMRandCCPSconstraints during training\. Unless otherwise stated, results are averaged over three seeds: 42, 345, and 456\. Thresholds are selected on the validation split to maximize Pair F1 and are then fixed for test evaluation\.
Implementation details are grouped as follows\. \(i\) For model inputs, we use RoBERTa\-base for text, wav2vec 2\.0 features for audio, and CLIP features for vision\(Liuet al\.,[2019](https://arxiv.org/html/2606.18893#bib.bib11); Baevskiet al\.,[2020](https://arxiv.org/html/2606.18893#bib.bib27); Radfordet al\.,[2021](https://arxiv.org/html/2606.18893#bib.bib31)\), with a task\-specific hidden size of 400\. Dialogues are truncated to at most 35 utterances, and utterances are truncated to at most 512 tokens\. \(ii\) For optimization, all models are trained for up to 30 epochs with early stopping patience 5, batch size 32, weight decay10−410^\{\-4\}, gradient clipping at norm 1\.0, learning rate10−510^\{\-5\}for RoBERTa, and5×10−55\\times 10^\{\-5\}for non\-backbone parameters\. \(iii\) ForRPCL, we use one hyperparameter setting across datasets, modality settings, and random seeds:m0=0\.05m\_\{0\}=0\.05,k=8k=8, corruption probability0\.300\.30,λrow=0\.3\\lambda\_\{\\mathrm\{row\}\}=0\.3,λcor=0\.75\\lambda\_\{\\mathrm\{cor\}\}=0\.75, andλali=0\.2\\lambda\_\{\\mathrm\{ali\}\}=0\.2\. Experiments are implemented with PyTorch 2\.10\.0 and Transformers 5\.7\.0 on NVIDIA A100\-SXM4\-80GB GPUs\.
### 4\.2Main Results on Complete Multimodal Evidence
Across the complete TAV setting,RPCLconsistently improves controlled pair extraction over Base, with mean Pair F1 gains of \+2\.58 on ECF, \+2\.59 on MECAD, and \+2\.83 on MEC4, together with threshold\-independent AUPRC gains on all three datasets \(Table[2](https://arxiv.org/html/2606.18893#S4.T2)\)\. Because Base andRPCLshare the same validation\-based threshold search and inference pipeline, these gains are attributable to the training objective rather than decoding or operating\-point changes\. The largest F1 and AUPRC gains occur on MEC4, where Base has the lowest F1, suggesting that the confidence constraints are most useful in the hardest evaluated setting\.
Table 2:Main text\-audio\-video \(TAV\) results\. Values are mean Pair F1/AUPRC percentages over three seeds, with standard deviation shown as subscript; deltas are computed from the reported means\.Table 3:Standard training\-objective controls under text\-audio\-video \(TAV\)\. Values are mean Pair F1/AUPRC percentages over three seeds; subscripts show standard deviation\.
### 4\.3Comparison with Standard Training Objectives
Against two conventional objective controls in the same TAV setting,RPCLgives the best Pair F1 on all three datasets and the best AUPRC on MECAD and MEC4, while Fixed\-margin ranking is the only exception on ECF AUPRC \(Table[3](https://arxiv.org/html/2606.18893#S4.T3)\)\. Fixed\-margin ranking keeps row\-wise ranking but replaces the cause\-confidence\-aware adaptive margin with a fixed margin and removes corrupted\-context stability, while Utterance\-dropout consistency removes row\-wise margin ranking, corrupts utterance\-level representations without gold\-utterance protection, and aligns clean/corrupted pair distributions without corrupted\-view pair supervision\. Both controls improve over Base, but their weaker cross\-dataset profile suggests that adaptive row\-wise separation and protected corrupted\-context pair stability are complementary rather than reducible to either standard objective alone\. The ECF AUPRC exception suggests fixed margins can sharpen ranking, although RPCL yields stronger balanced extraction\.
### 4\.4Comparison with Published Systems
The published\-system comparison provides contextual positioning rather than an isolated component test, because the compared ECPE and MECPE systems differ in architectures, modalities, input features, training protocols, and evaluation settings \(Table[4](https://arxiv.org/html/2606.18893#S4.T4)\)\. This caveat is especially relevant for MEC4, where M3F remains stronger under a different architecture\. The selected systems cover heuristic and two\-stage extraction methods, multimodal interaction and label\-constraint models, graph\-based methods, generative frameworks, and large\-model\-based approaches\. We therefore use this comparison as broader literature context, while the controlled analyses below assess the effect of robust pair\-confidence learning\.
Table 4:Published ECPE/MECPE comparison on official splits\. Baselines use reported best modality settings;RPCLreports three\-seed means\.
### 4\.5Pair\-Confidence Diagnostics
The confidence diagnostics are consistent with the proposed mechanism rather than only the downstream F1 gains:RPCLincreases the mean gold\-minus\-negative pair\-probability gap by 4\.72, 1\.69, and 3\.46 percentage points on ECF, MECAD, and MEC4, respectively \(Figure[2](https://arxiv.org/html/2606.18893#S4.F2)\)\. The precision\-recall movements differ by dataset, with ECF and MEC4mainly gaining recall and MECAD mainly gaining precision, suggesting thatRPCLdoes not simply bias the model toward more positive predictions\. Gold\-pair confidence rises on all three datasets, while hard\-negative and all\-candidate margin\-violation severity decrease, which is consistent with better separation from competing candidates\.
Figure 2:Pair\-confidence diagnostics under TAV\. Results show gold\-negative confidence gaps, precision\-recall operating points, and desired\-direction changes in gold confidence and margin\-violation severity\.Figure 3:RPCLgains over matched Base across modality settings\. Cells show Pair F1 and Pair AUPRC changes in percentage points, averaged over three seeds\.
### 4\.6Modality\-Specific Behavior
Across modality configurations, including text\-only input without acoustic or visual evidence, the pair\-confidence objective remains useful beyond the complete TAV setting: matchedRPCL\-Base gains are positive for Pair F1 and Pair AUPRC under T, T\+A, T\+V, and T\+A\+V \(Figure[3](https://arxiv.org/html/2606.18893#S4.F3)\)\. The T\+A\+V column corresponds to the complete\-evidence setting in Table[2](https://arxiv.org/html/2606.18893#S4.T2)\. The other columns show that the objective is not tied to a single input configuration, without isolating the relative contribution of each modality or the design of fusion modules\.
### 4\.7Ablation Study
The ablations indicate that discriminative separation and stability both contribute to the final TAV performance:CDMRalone andCCPSalone improve over Base on all datasets, and the fullRPCLobjective yields the best deltas on all reported metrics \(Table[5](https://arxiv.org/html/2606.18893#S4.T5)\)\. Partial removals show the same pattern internally: removing consistency or corrupted\-view supervision weakens theCCPSpath, while removing the adaptive margin or top\-kknegative selection weakensCDMR\. Overall, removing either hard\-negative separation or clean/corrupted stability reduces theRPCLgains\.
Table 5:TAV ablation deltas over Base\. Rows isolate or remove CDMR/CCPS components; positive values indicate percentage\-point gains\.
## 5Limitations
RPCLis limited in three aspects\. First, it is a training objective for pair\-scoring backbones rather than a new multimodal encoder or decoder, so it may be complementary to stronger architectures\. Second, the corrupted\-context constraint uses representation\-level perturbation and does not fully cover real\-world noise such as ASR errors, missing visual frames, domain shift, or cultural variation\. Third, because the task involves conversational multimodal emotion data, predictions should be interpreted only as annotated emotion\-cause links, rather than as evidence of the underlying internal causes of emotion or as a basis for high\-stakes decisions\.
## 6Conclusion
This paper addresses pair\-confidence brittleness in multimodal emotion\-cause pair extraction by framing pair extraction as reliable confidence learning over competing candidate pairs\. We proposeRPCL, a training\-only framework that combines row\-conditioned margin ranking with corrupted\-context pair stability while leaving the original inference pipeline unchanged\. Across ECF, MECAD, and MEC4,RPCLconsistently improves Pair F1 and Pair AUPRC over a matched base model, with diagnostics showing larger gold\-negative confidence gaps and reduced margin\-violation severity\. These results show that explicitly shaping the pair\-confidence surface, rather than only enriching representations or decoders, provides an effective and lightweight strategy for multimodal ECPE\.
## References
- Wav2vec 2\.0: a framework for self\-supervised learning of speech representations\.Advances in neural information processing systems33,pp\. 12449–12460\.Cited by:[§4\.1\.2](https://arxiv.org/html/2606.18893#S4.SS1.SSS2.p2.9)\.
- Z\. Cheng, F\. Niu, Y\. Lin, Z\. Cheng, X\. Peng, and B\. Zhang \(2024\)MIPS at semeval\-2024 task 3: multimodal emotion\-cause pair extraction in conversations with multimodal language models\.InProceedings of the 18th International Workshop on Semantic Evaluation, SemEval@NAACL 2024, Mexico City, Mexico, June 20\-21, 2024,A\. Kr\. Ojha, A\. S\. Dogruöz, H\. T\. Madabushi, G\. D\. S\. Martino, S\. Rosenthal, and A\. Rosá \(Eds\.\),pp\. 667–674\.External Links:[Document](https://dx.doi.org/10.18653/V1/2024.SEMEVAL-1.97)Cited by:[§2](https://arxiv.org/html/2606.18893#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Cheng, Z\. Jiang, Y\. Yin, C\. Wang, S\. Ge, and Q\. Gu \(2023\)A consistent dual\-mrc framework for emotion\-cause pair extraction\.ACM Trans\. Inf\. Syst\.41\(4\),pp\. 105:1–105:27\.External Links:[Document](https://dx.doi.org/10.1145/3558548)Cited by:[§1](https://arxiv.org/html/2606.18893#S1.p2.1),[§2](https://arxiv.org/html/2606.18893#S2.SS0.SSS0.Px3.p1.1)\.
- H\. Feng, J\. Liu, J\. Zheng, H\. Chen, X\. Shang, and Q\. Ma \(2023\)Joint constrained learning with boundary\-adjusting for emotion\-cause pair extraction\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Toronto, Canada,pp\. 1118–1131\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.62)Cited by:[§2](https://arxiv.org/html/2606.18893#S2.SS0.SSS0.Px3.p1.1)\.
- Q\. He, Y\. Li, H\. Ye, J\. Wang, X\. Liao, P\. Heng, S\. Ermon, J\. Zou, and A\. Yao \(2026\)reAR: rethinking visual autoregressive models via token\-wise consistency regularization\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.18893#S2.SS0.SSS0.Px3.p1.1)\.
- G\. Hu, Y\. Zhao, and G\. Lu \(2024a\)Improving representation with hierarchical contrastive learning for emotion\-cause pair extraction\.IEEE Transactions on Affective Computing15\(4\),pp\. 1997–2011\.External Links:[Document](https://dx.doi.org/10.1109/TAFFC.2024.3391854)Cited by:[§2](https://arxiv.org/html/2606.18893#S2.SS0.SSS0.Px3.p1.1)\.
- G\. Hu, Y\. Zhao, and G\. Lu \(2024b\)Unifying emotion\-oriented and cause\-oriented predictions for emotion\-cause pair extraction\.Neural Networks178,pp\. 106431\.External Links:[Document](https://dx.doi.org/10.1016/J.NEUNET.2024.106431)Cited by:[§2](https://arxiv.org/html/2606.18893#S2.SS0.SSS0.Px3.p1.1)\.
- G\. Hu, Z\. Zhu, D\. Hershcovich, L\. Hu, H\. Seifi, and J\. Xie \(2024c\)UniMEEC: towards unified multimodal emotion recognition and emotion cause\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 5248–5261\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.302)Cited by:[§1](https://arxiv.org/html/2606.18893#S1.p1.1),[§2](https://arxiv.org/html/2606.18893#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Huang, J\. Xu, X\. Shi, P\. Hu, L\. Feng, and X\. Zhu \(2026\)Revisiting confidence calibration for misclassification detection in VLMs\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.18893#S2.SS0.SSS0.Px3.p1.1)\.
- D\. Jeong and J\. Bak \(2023\)Conversational emotion\-cause pair extraction with guided mixture of experts\.InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics,Dubrovnik, Croatia,pp\. 3288–3298\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.eacl-main.240)Cited by:[§1](https://arxiv.org/html/2606.18893#S1.p1.1),[§2](https://arxiv.org/html/2606.18893#S2.SS0.SSS0.Px1.p1.1)\.
- X\. Ju, D\. Zhang, J\. Li, S\. Li, and G\. Zhou \(2025\)Enhanced generative framework with LLMs for multimodal emotion\-cause pair extraction in conversations\.IEEE Transactions on Multimedia27,pp\. 4924–4935\.Cited by:[§1](https://arxiv.org/html/2606.18893#S1.p2.1),[§2](https://arxiv.org/html/2606.18893#S2.SS0.SSS0.Px2.p1.1)\.
- B\. Li, H\. Fei, F\. Li, T\. Chua, and D\. Ji \(2025\)Multimodal emotion\-cause pair extraction with holistic interaction and label constraint\.ACM Transactions on Multimedia Computing, Communications, and Applications21\(11\),pp\. 307:1–307:19\.Cited by:[§1](https://arxiv.org/html/2606.18893#S1.p2.1),[§1](https://arxiv.org/html/2606.18893#S1.p3.1),[Table 4](https://arxiv.org/html/2606.18893#S4.T4.3.20.16.1.1.1),[Table 4](https://arxiv.org/html/2606.18893#S4.T4.3.9.5.1.1.1)\.
- M\. Li, H\. Zhao, T\. Gu, D\. Ying, and B\. Liao \(2024\)Class imbalance mitigation: A select\-then\-extract learning framework for emotion\-cause pair extraction\.Expert Syst\. Appl\.236,pp\. 121386\.External Links:[Document](https://dx.doi.org/10.1016/J.ESWA.2023.121386)Cited by:[§1](https://arxiv.org/html/2606.18893#S1.p2.1),[§2](https://arxiv.org/html/2606.18893#S2.SS0.SSS0.Px3.p1.1)\.
- Q\. Li, P\. Huang, J\. Chen, J\. Wu, Y\. Xu, and P\. Lin \(2023a\)Multimodal emotion recognition in conversation with mutual information maximization and contrastive loss\.InProceedings of the 22nd Chinese National Conference on Computational Linguistics,M\. Sun, B\. Qin, X\. Qiu, J\. Jiang, and X\. Han \(Eds\.\),Harbin, China,pp\. 264–276\(zho\)\.Note:In ChineseCited by:[§1](https://arxiv.org/html/2606.18893#S1.p1.1)\.
- W\. Li, Y\. Li, V\. Pandelea, M\. Ge, L\. Zhu, and E\. Cambria \(2023b\)ECPEC: emotion\-cause pair extraction in conversations\.IEEE Transactions on Affective Computing14\(3\),pp\. 1754–1765\.Cited by:[§1](https://arxiv.org/html/2606.18893#S1.p1.1),[§1](https://arxiv.org/html/2606.18893#S1.p2.1),[§2](https://arxiv.org/html/2606.18893#S2.SS0.SSS0.Px1.p1.1)\.
- Q\. Liang, Y\. Shen, T\. Chen, and L\. Zhang \(2025\)M3HG: multimodal, multi\-scale, and multi\-type node heterogeneous graph for emotion cause triplet extraction in conversations\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 11416–11431\.Cited by:[§1](https://arxiv.org/html/2606.18893#S1.p3.1),[§1](https://arxiv.org/html/2606.18893#S1.p5.1),[§2](https://arxiv.org/html/2606.18893#S2.SS0.SSS0.Px2.p1.1),[§4\.1\.1](https://arxiv.org/html/2606.18893#S4.SS1.SSS1.p1.2),[Table 4](https://arxiv.org/html/2606.18893#S4.T4.1.1.1.1.1),[Table 4](https://arxiv.org/html/2606.18893#S4.T4.3.14.10.1.1.1),[Table 4](https://arxiv.org/html/2606.18893#S4.T4.3.15.11.1.1.1)\.
- Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov \(2019\)RoBERTa: a robustly optimized BERT pretraining approach\.CoRRabs/1907\.11692\.Cited by:[§4\.1\.2](https://arxiv.org/html/2606.18893#S4.SS1.SSS2.p2.9)\.
- J\. Luo, Y\. Yin, Y\. Xie, J\. Ru, X\. Zhuang, M\. He, A\. Liu, Z\. Xiong, and D\. Yang \(2026\)SupCLAP: controlling optimization trajectory drift in audio\-text contrastive learning with support vector regularization\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.18893#S2.SS0.SSS0.Px3.p1.1)\.
- M\. Luo, H\. Zhang, S\. Wu, B\. Li, H\. Han, and H\. Fei \(2024\)NUS\-emo at SemEval\-2024 task 3: instruction\-tuning LLM for multimodal emotion\-cause analysis in conversations\.InProceedings of the 18th International Workshop on Semantic Evaluation \(SemEval\-2024\),A\. Kr\. Ojha, A\. S\. Doğruöz, H\. Tayyar Madabushi, G\. Da San Martino, S\. Rosenthal, and A\. Rosá \(Eds\.\),Mexico City, Mexico,pp\. 1589–1596\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.semeval-1.226)Cited by:[§2](https://arxiv.org/html/2606.18893#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Ma, J\. Yu, F\. Wang, H\. Cao, and R\. Xia \(2025\)From extraction to generation: multimodal emotion\-cause pair generation in conversations\.IEEE Trans\. Affect\. Comput\.16\(2\),pp\. 586–597\.External Links:[Document](https://dx.doi.org/10.1109/TAFFC.2024.3446646)Cited by:[§1](https://arxiv.org/html/2606.18893#S1.p2.1)\.
- A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark,et al\.\(2021\)Learning transferable visual models from natural language supervision\.InInternational conference on machine learning,pp\. 8748–8763\.Cited by:[§4\.1\.2](https://arxiv.org/html/2606.18893#S4.SS1.SSS2.p2.9)\.
- X\. Su, Z\. Huang, Y\. Su, B\. D\. Trisedya, Y\. Dou, and Y\. Zhao \(2024\)Hierarchical shared encoder with task\-specific transformer layer selection for emotion\-cause pair extraction\.IEEE Transactions on Affective Computing15\(4\),pp\. 1934–1948\.External Links:[Document](https://dx.doi.org/10.1109/TAFFC.2024.3390223)Cited by:[§2](https://arxiv.org/html/2606.18893#S2.SS0.SSS0.Px3.p1.1)\.
- F\. Wang, Z\. Ding, R\. Xia, Z\. Li, and J\. Yu \(2023a\)Multimodal emotion\-cause pair extraction in conversations\.IEEE Transactions on Affective Computing14\(3\),pp\. 1832–1844\.Cited by:[§1](https://arxiv.org/html/2606.18893#S1.p1.1),[§1](https://arxiv.org/html/2606.18893#S1.p5.1),[§2](https://arxiv.org/html/2606.18893#S2.SS0.SSS0.Px2.p1.1),[§4\.1\.1](https://arxiv.org/html/2606.18893#S4.SS1.SSS1.p1.2),[Table 4](https://arxiv.org/html/2606.18893#S4.T4.3.19.15.1.1.1),[Table 4](https://arxiv.org/html/2606.18893#S4.T4.3.7.3.1.1.1),[Table 4](https://arxiv.org/html/2606.18893#S4.T4.3.8.4.1.1.1)\.
- F\. Wang, H\. Ma, X\. Shen, J\. Yu, and R\. Xia \(2024a\)Observe before generate: emotion\-cause aware video caption for multimodal emotion cause generation in conversations\.InProceedings of the 32nd ACM International Conference on Multimedia, MM 2024, Melbourne, VIC, Australia, 28 October 2024 \- 1 November 2024,J\. Cai, M\. S\. Kankanhalli, B\. Prabhakaran, S\. Boll, R\. Subramanian, L\. Zheng, V\. K\. Singh, P\. César, L\. Xie, and D\. Xu \(Eds\.\),pp\. 5820–5828\.External Links:[Document](https://dx.doi.org/10.1145/3664647.3681601)Cited by:[§2](https://arxiv.org/html/2606.18893#S2.SS0.SSS0.Px2.p1.1)\.
- F\. Wang, H\. Ma, R\. Xia, J\. Yu, and E\. Cambria \(2024b\)SemEval\-2024 task 3: multimodal emotion cause analysis in conversations\.InProceedings of the 18th International Workshop on Semantic Evaluation,pp\. 2039–2050\.Cited by:[§1](https://arxiv.org/html/2606.18893#S1.p1.1),[§2](https://arxiv.org/html/2606.18893#S2.SS0.SSS0.Px2.p1.1)\.
- F\. Wang, J\. Yu, and R\. Xia \(2023b\)Generative emotion cause triplet extraction in conversations with commonsense knowledge\.InFindings of the Association for Computational Linguistics: EMNLP 2023,pp\. 3952–3963\.Cited by:[§2](https://arxiv.org/html/2606.18893#S2.SS0.SSS0.Px1.p1.1),[Table 4](https://arxiv.org/html/2606.18893#S4.T4.3.14.10.1.1.1)\.
- Y\. Wang, Y\. Li, K\. Yu, and J\. Yang \(2025\)A semantic structure\-based emotion\-guided model for emotion\-cause pair extraction\.Pattern Recognition161,pp\. 111296\.External Links:[Document](https://dx.doi.org/10.1016/J.PATCOG.2024.111296)Cited by:[§1](https://arxiv.org/html/2606.18893#S1.p2.1),[§2](https://arxiv.org/html/2606.18893#S2.SS0.SSS0.Px1.p1.1)\.
- D\. Wu, X\. Ju, D\. Zhang, S\. Li, E\. Cambria, and G\. Zhou \(2025\)Emotion across modalities and cultures: multilingual multimodal emotion\-cause analysis with memory\-inspired framework\.InProceedings of the 33rd ACM International Conference on Multimedia,pp\. 5775–5783\.Cited by:[§1](https://arxiv.org/html/2606.18893#S1.p1.1),[§1](https://arxiv.org/html/2606.18893#S1.p3.1),[§1](https://arxiv.org/html/2606.18893#S1.p5.1),[§2](https://arxiv.org/html/2606.18893#S2.SS0.SSS0.Px2.p1.1),[§4\.1\.1](https://arxiv.org/html/2606.18893#S4.SS1.SSS1.p1.2),[Table 4](https://arxiv.org/html/2606.18893#S4.T4.3.19.15.1.1.1),[Table 4](https://arxiv.org/html/2606.18893#S4.T4.3.20.16.1.1.1),[Table 4](https://arxiv.org/html/2606.18893#S4.T4.3.3.1.1.1)\.
- R\. Xia and Z\. Ding \(2019\)Emotion\-Cause Pair Extraction: a new task to emotion analysis in texts\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,pp\. 1003–1012\.Cited by:[§1](https://arxiv.org/html/2606.18893#S1.p1.1),[§2](https://arxiv.org/html/2606.18893#S2.SS0.SSS0.Px1.p1.1)\.
- F\. Yu, J\. Guo, Z\. Wu, and X\. Dai \(2025a\)Beyond verbal cues: emotional contagion graph network for causal emotion entailment\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 1755–1767\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.88),ISBN 979\-8\-89176\-256\-5Cited by:[§1](https://arxiv.org/html/2606.18893#S1.p1.1)\.
- Z\. Yu, X\. Xiao, and W\. Mao \(2025b\)One unified model for diverse tasks: emotion cause analysis via self\-promote cognitive structure modeling\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 10278–10293\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.naacl-long.516),ISBN 979\-8\-89176\-189\-6Cited by:[§2](https://arxiv.org/html/2606.18893#S2.SS0.SSS0.Px1.p1.1)\.Similar Articles
Leveraging Self-Paced Curriculum Learning for Enhanced Modality Balance in Multimodal Conversational Emotion Recognition
This paper proposes a plug-and-play module using self-paced curriculum learning to enhance modality balance in multimodal conversational emotion recognition, achieving consistent F1-score improvements on IEMOCAP and MELD datasets.
Evaluating multimodal emotion recognition in proactive conversational agents: A user study
This paper presents a multimodal emotion recognition module for proactive conversational agents, using facial recognition and linguistic analysis. A user study with 20 participants reveals a 'poker face' effect where visual cues are unreliable, while linguistic analysis proves more accurate; the study also shows agents can elicit emotions through conversational adaptation.
Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?
Robust-U1 is a framework that enables multimodal large language models (MLLMs) to self-recover corrupted visual content using supervised fine-tuning, reinforcement learning with dual rewards, and joint multimodal reasoning, achieving state-of-the-art robustness on corruption benchmarks.
CoRA: Confidence-Rationale Alignment for Reliable Chain-of-Thought Reasoning
This paper introduces CoRA, a GRPO-based reinforcement learning framework that aligns LLM confidence with generated rationales to improve the reliability of chain-of-thought reasoning, achieving up to 26.51% reduction in misalignment error across multiple benchmarks.
CORE: Conflict-Oriented Reasoning for General Multimodal Manipulation Detection
Proposes the CORE framework that endows multimodal large language models with explicit conflict-capturing capability for generalizable manipulation detection, adapting to unseen manipulation types with few or zero samples.