MER-R1: Multimodal Emotion Reasoning via Slow-Fast Thinking Synergy

arXiv cs.AI 06/29/26, 04:00 AM Papers
Summary
This paper introduces MER-R1, a reinforcement learning framework that synergizes fast and slow thinking for multimodal emotion recognition. It achieves state-of-the-art performance by jointly optimizing recall and precision through dual-objective disentanglement and slow-fast confidence calibration.
arXiv:2606.27652v1 Announce Type: new Abstract: We find that explicit reasoning does not necessarily translate into better multimodal emotion recognition (MER) accuracy, even though it makes predictions more interpretable. Specifically, for reasoning-based MLLMs, fast thinking by triggering direct answers often outperforms slow thinking after deliberative reasoning. Our empirical analyses show that fast thinking improves recall with broader and more confident predictions, whereas slow thinking favors precision through conservative filtering of incorrect categories. Building on these insights, we propose MER-R1, a reinforcement learning framework that turns slow-fast complementarity into explicit optimization. Dual-objective disentanglement separates recall and precision into two optimization signals, allowing them to be jointly optimized rather than traded off against each other. Slow-fast confidence calibration further aligns the final slow-thinking answer with fast-thinking intuition, strengthening correct emotions while suppressing incorrect ones. In this way, MER-R1 unifies the recall-oriented intuition of fast thinking with the precision-oriented selectivity of slow thinking. We further provide theoretical justification for this synergy, showing that it mitigates variance-induced interference during optimization. Extensive experiments on MER-UniBench and MME-Emotion show that MER-R1 achieves state-of-the-art performance and makes reasoning genuinely benefit emotion recognition.
Original Article
View Cached Full Text
Cached at: 06/29/26, 05:27 AM
# MER-R1: Multimodal Emotion Reasoning via Slow-Fast Thinking Synergy
Source: [https://arxiv.org/html/2606.27652](https://arxiv.org/html/2606.27652)
Zhiyuan Han1,2,4Beier Zhu1,†\\daggerWenwen Tong2Chengwei Qin3Xinyi Wang1Jiayu Zhang1Jiangnan Chen2Hewei Guo2Dongchuan Ran2Lewei Lu2Xun Yang1,†\\dagger1University of Science and Technology of China2SenseTime Research3The Hong Kong University of Science and Technology \(Guangzhou\)4Institute of Artificial Intelligence, Hefei Comprehensive National Science Centerbeier\.zhu@ustc\.edu\.cn,xyang21@ustc\.edu\.cn†Corresponding authors\.

###### Abstract

We find that explicit reasoning does not necessarily translate into better multimodal emotion recognition \(MER\) accuracy, even though it makes predictions more interpretable\. Specifically, for reasoning\-based MLLMs,fast thinkingby triggering direct answers often outperformsslow thinkingafter deliberative reasoning\. Our empirical analyses show that fast thinking improves recall with broader and more confident predictions, whereas slow thinking favors precision through conservative filtering of incorrect categories\. Building on these insights, we proposeMER\-R1, a reinforcement learning framework that turns slow\-fast complementarity into explicit optimization\. Dual\-objective disentanglement separates recall and precision into two optimization signals, allowing them to be jointly optimized rather than traded off against each other\. Slow\-fast confidence calibration further aligns the final slow\-thinking answer with fast\-thinking intuition, strengthening correct emotions while suppressing incorrect ones\. In this way,MER\-R1unifies the recall\-oriented intuition of fast thinking with the precision\-oriented selectivity of slow thinking\. We further provide theoretical justification for this synergy, showing that it mitigates variance\-induced interference during optimization\. Extensive experiments on MER\-UniBench and MME\-Emotion show thatMER\-R1achieves SoTA performance and makes reasoning genuinely benefit emotion recognition\.

## 1Introduction

Recent advances in Multimodal Large Language Models \(MLLMs\)\[[6](https://arxiv.org/html/2606.27652#bib.bib11),[28](https://arxiv.org/html/2606.27652#bib.bib12),[34](https://arxiv.org/html/2606.27652#bib.bib13)\]are transforming affective computing\[[24](https://arxiv.org/html/2606.27652#bib.bib17),[12](https://arxiv.org/html/2606.27652#bib.bib34),[27](https://arxiv.org/html/2606.27652#bib.bib18)\], shifting multimodal emotion recognition\[[14](https://arxiv.org/html/2606.27652#bib.bib45),[15](https://arxiv.org/html/2606.27652#bib.bib46),[25](https://arxiv.org/html/2606.27652#bib.bib36),[1](https://arxiv.org/html/2606.27652#bib.bib37)\]from fixed\-label prediction toOpen\-Vocabulary Multimodal Emotion Recognition\(OV\-MER\)\[[11](https://arxiv.org/html/2606.27652#bib.bib50),[13](https://arxiv.org/html/2606.27652#bib.bib10)\], where models recognize diverse and open\-ended human emotions\. To improve trustworthiness\[[16](https://arxiv.org/html/2606.27652#bib.bib49)\], recent RLVR\-based approaches\[[4](https://arxiv.org/html/2606.27652#bib.bib19),[17](https://arxiv.org/html/2606.27652#bib.bib21),[36](https://arxiv.org/html/2606.27652#bib.bib20)\]further equip emotion MLLMs with explicit reasoning chains that ground emotion predictions in visual, acoustic, and textual cues\.

Despite this promise, we find that such reasoning does not necessarily lead to stronger emotion recognition\. Specifically, we consider two thinking modes for an R1\-like emotion reasoning model\[[34](https://arxiv.org/html/2606.27652#bib.bib13),[17](https://arxiv.org/html/2606.27652#bib.bib21)\]:\(1\) Slow thinkingproduces an answer after standard reasoning process, and\(2\) Fast thinkingelicits an answer directly without reasoning\. Despite well\-formed reasoning traces \(Figure[1](https://arxiv.org/html/2606.27652#S1.F1)\(a\)\), slow thinking consistently underperforms fast thinking on current OV\-MER benchmarks such as MER\-UniBench\[[11](https://arxiv.org/html/2606.27652#bib.bib50),[13](https://arxiv.org/html/2606.27652#bib.bib10)\]\(Figure[1](https://arxiv.org/html/2606.27652#S1.F1)\(b\)\)\. This reveals athinking paradoxin MER:

Reasoning improves interpretability, yet fails to improve recognition accuracy itself\.

To understand this paradox, we analyze fast and slow thinking in Section[3](https://arxiv.org/html/2606.27652#S3)from prediction and confidence perspectives\. At the prediction level, fast thinking exhibits stronger System\-1 intuition\[[10](https://arxiv.org/html/2606.27652#bib.bib9)\], producing broader emotion coverage and higher recall, whereas slow thinking makes more concentrated predictions and favors precision\. At the confidence level, fast thinking is more confident on correct emotions, while slow thinking more conservatively suppresses incorrect categories but may also lower confidence in correct ones\. These findings suggest twoslow\-fast synergy desideratafor effective MER reasoning: preserving the recall\-oriented coverage and correct\-category confidence of fast thinking, while retaining the precision\-oriented selectivity of slow thinking\.

![Refer to caption](https://arxiv.org/html/2606.27652v1/x1.png)Figure 1:Thinking paradox and slow\-fast synergy in MER\.\(a\) Motivating example: slow thinking is cautious but may miss valid emotions, whereas fast thinking improves coverage but introduces noise\. Slow\-fast synergy combines their complementary strengths\. \(b\) Evaluation summary on 9 datasets: Fast thinking counter\-intuitively outperforms slow thinking\.To realize these desiderata, we proposeMER\-R1, a reinforcement learning \(RL\) framework with two components\. First,dual\-objective disentanglementseparates recall and precision into two optimization targets and preserves them in both reward and advantage spaces\. Unlike prior MER reasoning methods that directly optimizeF1F\_\{1\}reward\[[17](https://arxiv.org/html/2606.27652#bib.bib21)\], our design prevents recall and precision interference during advantage normalization\. Our theoretical analysis further shows that standardF1F\_\{1\}optimization can be biased toward the objective with larger group\-wise variation, whereas the dual\-objective advantage establishes a balanced coupling to both recall and precision signals\. Second,slow\-fast confidence calibrationtransfers fast thinking’s confidence behavior to the final slow\-thinking answer while preserving slow thinking’s selectivity\. Specifically, we compare category\-level confidence between the two thinking modes and calibrate correct and incorrect categories in opposite directions: correct categories are encouraged to retain or exceed fast\-thinking confidence, whereas incorrect categories are encouraged to remain suppressed\. This enablesMER\-R1to strengthen correct emotions without absorbing the noisy over\-coverage of fast thinking\.

Our contributionsare three\-fold:\(1\) Findings:We reveal athinking paradoxin MER: for reasoning\-based MLLMs, fast thinking can outperform slow thinking\. We diagnose this paradox from prediction and confidence perspectives and derive two desiderata for slow\-fast synergy\.\(2\) Methodology:We proposeMER\-R1, an RL framework with dual\-objective disentanglement and slow\-fast confidence calibration, enabling joint recall–precision optimization and confidence calibration over correct and incorrect categories\.\(3\) Performance:MER\-R1achieves state\-of\-the\-art results on MER\-UniBench and MME\-Emotion, and validates that satisfying the two slow\-fast synergy desiderata makes reasoning genuinely beneficial for emotion recognition rather than merely improving interpretability\.

## 2Related work

MLLM reasoning\.Recent multimodal large language models \(MLLMs\)\[[6](https://arxiv.org/html/2606.27652#bib.bib11),[35](https://arxiv.org/html/2606.27652#bib.bib16),[34](https://arxiv.org/html/2606.27652#bib.bib13),[18](https://arxiv.org/html/2606.27652#bib.bib48),[19](https://arxiv.org/html/2606.27652#bib.bib138)\]have acquired reasoning ability through reinforcement learning with verifiable rewards \(RLVR\)\[[4](https://arxiv.org/html/2606.27652#bib.bib19),[10](https://arxiv.org/html/2606.27652#bib.bib9),[32](https://arxiv.org/html/2606.27652#bib.bib3)\], allowing them to explicitly reason over visual, acoustic, and textual signals during the reasoning process\. Building on this capability, recent work\[[23](https://arxiv.org/html/2606.27652#bib.bib31),[33](https://arxiv.org/html/2606.27652#bib.bib29),[22](https://arxiv.org/html/2606.27652#bib.bib32),[31](https://arxiv.org/html/2606.27652#bib.bib26)\]seeks to tackle increasingly complex multimodal reasoning tasks\. However, recent studies suggest that slow thinking is not always better than fast thinking\. VideoAuto\-R1\[[20](https://arxiv.org/html/2606.27652#bib.bib5)\]observes this phenomenon on video understanding tasks and addresses it with ananswer\-think\-answerparadigm that rewards both the early and final answers while assigning a larger weight to the latter\. However, this design remains at the output level and does not model why fast and slow thinking differ, let alone how to realize synergy between them\. In contrast, our work studies their underlying complementarity and turns it into an explicit synergy framework for multimodal reasoning, thereby making slow thinking genuinely more effective\.

Multimodal emotion understanding\.Multimodal emotion understanding has recently shifted from closed\-set classification to more open and reasoning\-oriented settings\[[16](https://arxiv.org/html/2606.27652#bib.bib49),[2](https://arxiv.org/html/2606.27652#bib.bib51),[5](https://arxiv.org/html/2606.27652#bib.bib22),[11](https://arxiv.org/html/2606.27652#bib.bib50)\]\. Open\-Vocabulary Multimodal Emotion Recognition \(OV\-MER\)\[[13](https://arxiv.org/html/2606.27652#bib.bib10)\]requires models to predict free\-form emotion words beyond a fixed label space, while MME\-Emotion\[[41](https://arxiv.org/html/2606.27652#bib.bib8)\]further evaluates both recognition and reasoning ability with LLM\-based judges\. Recent emotion MLLMs\[[42](https://arxiv.org/html/2606.27652#bib.bib54),[36](https://arxiv.org/html/2606.27652#bib.bib20),[17](https://arxiv.org/html/2606.27652#bib.bib21),[26](https://arxiv.org/html/2606.27652#bib.bib4)\]increasingly introduce reasoning for better interpretability and reinforcement learning in an attempt to further improve emotion recognition\. However, whether reasoning itself truly improves emotion recognition remains unclear\. We identify and systematically study the*thinking paradox*in multimodal emotion reasoning, where fast thinking can outperform slow thinking on recognition, and resolve it through an explicit slow\-fast thinking synergy framework\.

## 3Understanding the thinking paradox

This section investigates the counter\-intuitive finding that fast thinking outperforms slow thinking in MER tasks \(Figure[1](https://arxiv.org/html/2606.27652#S1.F1)\(b\)\)\. We show that fast thinking improves recall with more diverse and confident predictions, whereas slow thinking is more conservative, favoring precision by suppressing incorrect categories but lowering confidence in correct ones\. Under current evaluation, which mainly relies on recall or F1 metrics, the advantage of slow thinking is not fully reflected\. We begin by introducing the emotion reasoning formulation and the emotion\-wheel\-based evaluation protocol\.

### 3\.1Problem formulation and emotion wheel\-based evaluation

![Refer to caption](https://arxiv.org/html/2606.27652v1/x2.png)Figure 2:Emotion wheel\.Given a multimodal inputxx, the policyπθ\\pi\_\{\\theta\}generates an outputy=\(y𝖼𝗈𝗍,y𝖺𝗇𝗌\)y=\(y^\{\\mathsf\{cot\}\},y^\{\\mathsf\{ans\}\}\), wherey𝖼𝗈𝗍y^\{\\mathsf\{cot\}\}denotes the chain\-of\-thought reasoning trace andy𝖺𝗇𝗌y^\{\\mathsf\{ans\}\}denotes the final emotion answer\. Slow thinking follows the standard think\-then\-answer format, while fast thinking is elicited with an answer\-only prompt that directly producesy𝖺𝗇𝗌y^\{\\mathsf\{ans\}\}without reasoning,*i\.e\.*,y𝖼𝗈𝗍=∅y^\{\\mathsf\{cot\}\}=\\emptyset\. Since both the model answery𝖺𝗇𝗌y^\{\\mathsf\{ans\}\}and the ground\-truth emotionsy𝗀𝗍y^\{\\mathsf\{gt\}\}may contain multiple free\-form emotion words, we follow prior work\[[11](https://arxiv.org/html/2606.27652#bib.bib50),[13](https://arxiv.org/html/2606.27652#bib.bib10)\]and map them through the emotion wheel to level\-1 categories\. For example, as shown in Figure[2](https://arxiv.org/html/2606.27652#S3.F2), theinnermost sectors define level\-1 emotion categories, such asangry,happy, andsad\. Fine\-grained emotion words in the outer sectors are then mapped to these categories; for instance,joyful,content, andcheerfulare mapped to the level\-1 categoryhappy\.

Let𝒴^\\hat\{\\mathcal\{Y\}\}and𝒢\\mathcal\{G\}denote the predicted and ground\-truth level\-1 category sets obtained fromy𝖺𝗇𝗌y^\{\\mathsf\{ans\}\}andy𝗀𝗍y^\{\\mathsf\{gt\}\}, respectively\. We evaluate the prediction by set\-level matching\. Specifically, recall, precision, andF1F\_\{1\}are defined as:𝖱=\|𝒴^∩𝒢\|\|𝒢\|,𝖯=\|𝒴^∩𝒢\|\|𝒴^\|,𝖥=2𝖯𝖱𝖯\+𝖱\.\\mathsf\{R\}=\\tfrac\{\|\\hat\{\\mathcal\{Y\}\}\\cap\\mathcal\{G\}\|\}\{\|\\mathcal\{G\}\|\},\\mathsf\{P\}=\\tfrac\{\|\\hat\{\\mathcal\{Y\}\}\\cap\\mathcal\{G\}\|\}\{\|\\hat\{\\mathcal\{Y\}\}\|\},\\mathsf\{F\}=\\tfrac\{2\\mathsf\{PR\}\}\{\\mathsf\{P\+R\}\}\.Current OV\-MER evaluation mainly relies on recall\-based metrics\[[11](https://arxiv.org/html/2606.27652#bib.bib50)\]or F1\-based metrics\[[13](https://arxiv.org/html/2606.27652#bib.bib10)\]\. We further examine precision and category\-level confidence to diagnose the thinking paradox\.

![Refer to caption](https://arxiv.org/html/2606.27652v1/x3.png)Figure 3:Analyses of the thinking paradox on five MER benchmarks\.\(a\) Performance gaps in recall \(𝖱\\mathsf\{R\}\), precision \(𝖯\\mathsf\{P\}\), andF1F\_\{1\}\(𝖥\\mathsf\{F\}\), whereΔ𝖬=𝖬𝖿𝖺𝗌𝗍−𝖬𝗌𝗅𝗈𝗐\\Delta\\mathsf\{M\}=\\mathsf\{M\}\_\{\\mathsf\{fast\}\}\-\\mathsf\{M\}\_\{\\mathsf\{slow\}\}for𝖬∈\{𝖱,𝖯,𝖥\}\\mathsf\{M\}\\in\\\{\\mathsf\{R\},\\mathsf\{P\},\\mathsf\{F\}\\\}\. \(b\) Prediction diversity by the average number of emotion categories\. \(c\) Confidence on ground\-truth emotion categories\. \(d\) Confidence margin between ground\-truth and hard\-negative emotion categories\.
### 3\.2Precision–recall trade\-off

In Figure[3](https://arxiv.org/html/2606.27652#S3.F3)\(a\), we report the performance gaps between fast and slow thinking in terms of precision, recall, and F1 on five widely used datasets\[[13](https://arxiv.org/html/2606.27652#bib.bib10),[12](https://arxiv.org/html/2606.27652#bib.bib34),[14](https://arxiv.org/html/2606.27652#bib.bib45),[15](https://arxiv.org/html/2606.27652#bib.bib46),[25](https://arxiv.org/html/2606.27652#bib.bib36),[1](https://arxiv.org/html/2606.27652#bib.bib37)\]\. Fast thinking consistently leads in recall, while slow thinking leads in precision; overall, fast thinking retains a slight advantage in F1\. Figure[3](https://arxiv.org/html/2606.27652#S3.F3)\(b\) reports the average number of emotion categories predicted by fast and slow thinking\. The results show a clear difference in prediction diversity: fast thinking covers more emotion categories, whereas slow thinking produces more concentrated predictions\. Together, these two analyses lead to the following finding:

Finding 1:Fast thinking improves recall with broader emotion coverage, whereas slow thinking favors precision with more concentrated predictions\.

Previous benchmarks such as MER\-UniBench\[[11](https://arxiv.org/html/2606.27652#bib.bib50)\]mainly use hitrate as the evaluation metric, which reflects recall but does not penalize over\-prediction \(low precision\)\. This helps explain why fast thinking appears more favorable than slow thinking under current evaluation\.

### 3\.3Confident vs\. conservative predictions

To compare the confidence behavior, we first aggregate the probability mass of free\-form emotion words mapped to the level\-1 category \(Definition[1](https://arxiv.org/html/2606.27652#Thmdefinition1)\)\. We then contrast this confidence between ground\-truth and hard\-negative categories to define a confidence margin \(Definition[2](https://arxiv.org/html/2606.27652#Thmdefinition2)\)\.

###### Definition 1\(Category\-level log\-confidence\)\.

For a level\-1 emotion categorycc, let𝒱\(c\)\\mathcal\{V\}\(c\)denote the set of emotion words associated withccdefined by the emotion wheel\[[13](https://arxiv.org/html/2606.27652#bib.bib10)\]\. Given a thinking modem∈\{𝖿𝖺𝗌𝗍,𝗌𝗅𝗈𝗐\}m\\in\\\{\\mathsf\{fast\},\\mathsf\{slow\}\\\}, we define the category\-level confidence ofccas:

𝖼𝗈𝗇𝖿m\(c\)=log∑e∈𝒱\(c\)pm\(e\),\\mathsf\{conf\}^\{m\}\(c\)=\\log\\sum\_\{e\\in\\mathcal\{V\}\(c\)\}p^\{m\}\(e\),\(1\)wherepm\(e\)p^\{m\}\(e\)denotes the probability assigned to the first token of emotion wordee\.

###### Definition 2\(Confidence margin\)\.

Let𝒢\\mathcal\{G\}denote the set of ground\-truth level\-1 categories for a sample, and letℋm\\mathcal\{H\}^\{m\}denote the set of top\-3 hard\-negative categories\. We define𝖼𝗈𝗇𝖿𝗀𝗍m=1\|𝒢\|∑c∈𝒢𝖼𝗈𝗇𝖿m\(c\)\\mathsf\{conf\}\_\{\\mathsf\{gt\}\}^\{m\}=\\frac\{1\}\{\|\\mathcal\{G\}\|\}\\sum\_\{c\\in\\mathcal\{G\}\}\\mathsf\{conf\}^\{m\}\(c\)and𝖼𝗈𝗇𝖿𝗇𝖾𝗀m=1\|ℋm\|∑c∈ℋm𝖼𝗈𝗇𝖿m\(c\)\\mathsf\{conf\}\_\{\\mathsf\{neg\}\}^\{m\}=\\frac\{1\}\{\|\\mathcal\{H\}^\{m\}\|\}\\sum\_\{c\\in\\mathcal\{H\}^\{m\}\}\\mathsf\{conf\}^\{m\}\(c\)as the average confidence assigned to correct and incorrect categories, respectively\. Their relative confidence margin is defined as:

γm=𝖼𝗈𝗇𝖿𝗀𝗍m−𝖼𝗈𝗇𝖿𝗇𝖾𝗀m\.\\gamma^\{m\}=\\mathsf\{conf\}\_\{\\mathsf\{gt\}\}^\{m\}\-\\mathsf\{conf\}\_\{\\mathsf\{neg\}\}^\{m\}\.\(2\)

Figure[3](https://arxiv.org/html/2606.27652#S3.F3)\(c\) reports the category\-level log\-confidence on ground\-truth emotion categories, showing that fast thinking consistently assigns higher confidence to correct emotions\. Figure[3](https://arxiv.org/html/2606.27652#S3.F3)\(d\) reports the confidence margin between ground\-truth and hard\-negative categories, showing that slow thinking yields larger margins by suppressing hard negatives\. Together, these results lead to Finding 2:

Finding 2:Fast thinking is more confident on correct emotions, while slow thinking is more conservative, better suppressing incorrect categories but under\-confidence correct ones\.

### 3\.4Desiderata for slow\-fast thinking synergy

These observations suggest that an effective emotion reasoning model should not simply prefer one thinking mode over the other\. Instead, it should satisfy two desiderata\.

- •Desideratum I: At the prediction level, the model should preserve the recall\-oriented coverage of fast thinking while maintaining the precision\-oriented selectivity of slow thinking\.
- •Desideratum II: At the confidence level, the model should preserve fast thinking’s strong confidence on correct categories, while retaining slow thinking’s suppression of incorrect ones\.

These desiderata motivate our slow\-fast thinking synergy framework in Section[4](https://arxiv.org/html/2606.27652#S4)\.

## 4Method

We proposeMER\-R1, an RL framework for slow\-fast thinking synergy\. Section[4\.1](https://arxiv.org/html/2606.27652#S4.SS1)introduces the GRPO baseline and its limitations\. Section[4\.2](https://arxiv.org/html/2606.27652#S4.SS2)proposes dual\-objective disentanglement to preserve the recall\-oriented coverage and the precision\-oriented selectivity in the advantage space \(Desideratum I\)\. Section[4\.3](https://arxiv.org/html/2606.27652#S4.SS3)introduces fast\-thinking confidence calibration to strengthen correct categories while suppressing incorrect ones \(Desideratum II\)\. The overall pipeline is illustrated in Figure[4](https://arxiv.org/html/2606.27652#S4.F4)\.

### 4\.1Preliminary on GRPO baseline

Reasoning\-oriented MLLMs are commonly trained with a two\-stage recipe: a supervised fine\-tuning \(SFT\) stage, followed by GRPO\-style RL\. Given inputxx, GRPO samplesKKoutputs\{yi\}i=1K\\\{y\_\{i\}\\\}\_\{i=1\}^\{K\}fromπθ𝗈𝗅𝖽\\pi\_\{\\theta\_\{\\mathsf\{old\}\}\}, computes rewards\{Ri\}i=1K\\\{R\_\{i\}\\\}\_\{i=1\}^\{K\}, normalizes them into relative advantagesAi=Ri−μσA\_\{i\}=\\tfrac\{R\_\{i\}\-\\mu\}\{\\sigma\}, and optimizes:

ℒ𝖦𝖱𝖯𝖮\(θ\)=−1K∑i=1Kmin⁡\(riAi,clip\(ri,1−η,1\+η\)Ai\)\+βDKL\(πθ∥π𝗋𝖾𝖿\),\\mathcal\{L\}\_\{\\mathsf\{GRPO\}\}\(\\theta\)=\-\\frac\{1\}\{K\}\\sum\_\{i=1\}^\{K\}\\min\\\!\\big\(r\_\{i\}A\_\{i\},\\;\\mathrm\{clip\}\(r\_\{i\},1\-\\eta,1\+\\eta\)A\_\{i\}\\big\)\+\\beta D\_\{\\mathrm\{KL\}\}\(\\pi\_\{\\theta\}\\,\\\|\\,\\pi\_\{\\mathsf\{ref\}\}\),\(3\)whereri=πθ\(yi∣x\)πθ𝗈𝗅𝖽\(yi∣x\)r\_\{i\}=\\tfrac\{\\pi\_\{\\theta\}\(y\_\{i\}\\mid x\)\}\{\\pi\_\{\\theta\_\{\\mathsf\{old\}\}\}\(y\_\{i\}\\mid x\)\}is the importance ratio, andβ≥0\\beta\\geq 0controls the KL regularization strength\. As a typical reward design\[[17](https://arxiv.org/html/2606.27652#bib.bib21)\], the baseline combines an emotion\-wheel\-basedF1F\_\{1\}reward with a format reward:R𝖻𝖺𝗌𝖾=R𝖥\+R𝖿𝗆𝗍,R\_\{\\mathsf\{base\}\}=R\_\{\\mathsf\{F\}\}\+R\_\{\\mathsf\{fmt\}\},whereR𝖿𝗆𝗍R\_\{\\mathsf\{fmt\}\}encourages<think\>then<answer\>structure\.

However, the GRPO baseline has two limitations: theF1F\_\{1\}\-style reward entangles recall and precision, allowing the model to trade off one for the other, and the training objective does not explicitly preserve the recall\-oriented intuition of fast thinking\.

![Refer to caption](https://arxiv.org/html/2606.27652v1/x4.png)Figure 4:Framework ofMER\-R1\.OurMER\-R1consists of dual\-objective disentanglement for Desideratum I and slow\-fast confidence calibration for Desideratum II\.
### 4\.2Dual\-objective disentanglement for Desideratum I

To avoid the ambiguous trade\-off induced by a singleF1F\_\{1\}\-style reward, we explicitly separate the two optimization goals behind emotion recognition: covering correct categories and avoiding spurious ones\. We first decompose the task reward into recall and precision rewards, and then compute their group\-normalized advantages separately as the optimization signals\.

Reward\-level disentanglement\.We decomposeF1F\_\{1\}reward into recall and precision rewards:

R𝖱=\|𝒴^∩𝒢\|\|𝒢\|,R𝖯=\|𝒴^∩𝒢\|\|𝒴^\|\.R\_\{\\mathsf\{R\}\}=\\tfrac\{\|\\hat\{\\mathcal\{Y\}\}\\cap\\mathcal\{G\}\|\}\{\|\\mathcal\{G\}\|\},\\qquad R\_\{\\mathsf\{P\}\}=\\tfrac\{\|\\hat\{\\mathcal\{Y\}\}\\cap\\mathcal\{G\}\|\}\{\|\\hat\{\\mathcal\{Y\}\}\|\}\.\(4\)This decomposition makes the hidden bi\-objective structure ofF1F\_\{1\}explicit:R𝖱R\_\{\\mathsf\{R\}\}promotes coverage of correct categories, whileR𝖯R\_\{\\mathsf\{P\}\}penalizes spurious predictions\. Since different recall–precision trade\-offs can yield the sameF1F\_\{1\}value, separatingR𝖱R\_\{\\mathsf\{R\}\}andR𝖯R\_\{\\mathsf\{P\}\}avoids an ambiguous optimization target\.

Advantage\-level disentanglement\.Given the two rewardsR𝖱R\_\{\\mathsf\{R\}\}andR𝖯R\_\{\\mathsf\{P\}\}, a straightforward GRPO pipeline would first aggregate them into a scalar reward and then apply group normalization to obtain the advantage\. However, when the two rewards have different group\-wise variances, the combined scalar reward is dominated by the higher\-variance objective, causing the lower\-variance objective to be underrepresented in the final advantage\. We therefore first compute group\-normalized advantages forR𝖱R\_\{\\mathsf\{R\}\}andR𝖯R\_\{\\mathsf\{P\}\}separately then compute the final advantageA𝖽𝗎𝖺𝗅A\_\{\\mathsf\{dual\}\}:

A𝖱,i=R𝖱,i−μ𝖱σ𝖱,A𝖯,i=R𝖯,i−μ𝖯σ𝖯,A𝖽𝗎𝖺𝗅,i=A𝖱,i\+A𝖯,i\.A\_\{\\mathsf\{R\},i\}=\\tfrac\{R\_\{\\mathsf\{R\},i\}\-\\mu\_\{\\mathsf\{R\}\}\}\{\\sigma\_\{\\mathsf\{R\}\}\},\\qquad A\_\{\\mathsf\{P\},i\}=\\tfrac\{R\_\{\\mathsf\{P\},i\}\-\\mu\_\{\\mathsf\{P\}\}\}\{\\sigma\_\{\\mathsf\{P\}\}\},\\qquad A\_\{\\mathsf\{dual\},i\}=A\_\{\\mathsf\{R\},i\}\+A\_\{\\mathsf\{P\},i\}\.\(5\)
Theoretical justification\.The following proposition shows that standardF1F\_\{1\}optimization can be biased toward the objective with larger normalized variation, whereas our dual\-objective disentanglement removes this correlation imbalance \(proof in Appendix[A](https://arxiv.org/html/2606.27652#A1)\)\.

###### Proposition 1\.

Letμ𝖱,μ𝖯\\mu\_\{\\mathsf\{R\}\},\\mu\_\{\\mathsf\{P\}\}andσ𝖱,σ𝖯\\sigma\_\{\\mathsf\{R\}\},\\sigma\_\{\\mathsf\{P\}\}denote the group means and standard deviations of the recall and precision rewards, and letτ=σ𝖱/μ𝖱2σ𝖯/μ𝖯2\\tau=\\frac\{\\sigma\_\{\\mathsf\{R\}\}/\\mu\_\{\\mathsf\{R\}\}^\{2\}\}\{\\sigma\_\{\\mathsf\{P\}\}/\\mu\_\{\\mathsf\{P\}\}^\{2\}\}be the normalized group\-wise variance ratio\. The correlations between each reward and the standardF1F\_\{1\}advantage satisfy

\{𝖼𝗈𝗋𝗋\(R𝖱,A𝖥\)\>𝖼𝗈𝗋𝗋\(R𝖯,A𝖥\),ifτ\>1,𝖼𝗈𝗋𝗋\(R𝖱,A𝖥\)≤𝖼𝗈𝗋𝗋\(R𝖯,A𝖥\),ifτ≤1\.\\begin\{cases\}\\mathsf\{corr\}\(R\_\{\\mathsf\{R\}\},A\_\{\\mathsf\{F\}\}\)\>\\mathsf\{corr\}\(R\_\{\\mathsf\{P\}\},A\_\{\\mathsf\{F\}\}\),&\\text\{if \}\\tau\>1,\\\\\[2\.84526pt\] \\mathsf\{corr\}\(R\_\{\\mathsf\{R\}\},A\_\{\\mathsf\{F\}\}\)\\leq\\mathsf\{corr\}\(R\_\{\\mathsf\{P\}\},A\_\{\\mathsf\{F\}\}\),&\\text\{if \}\\tau\\leq 1\.\\end\{cases\}\(6\)In contrast, our dual\-objective advantage givesbalanced correlations:

𝖼𝗈𝗋𝗋\(R𝖱,A𝖽𝗎𝖺𝗅\)=𝖼𝗈𝗋𝗋\(R𝖯,A𝖽𝗎𝖺𝗅\)\.\\mathsf\{corr\}\(R\_\{\\mathsf\{R\}\},A\_\{\\mathsf\{dual\}\}\)=\\mathsf\{corr\}\(R\_\{\\mathsf\{P\}\},A\_\{\\mathsf\{dual\}\}\)\.\(7\)

### 4\.3Slow\-fast confidence calibration for Desideratum II

For Desideratum II, we propose slow\-fast confidence calibration, which guides the final slow\-thinking answer to preserve fast thinking’s confidence on correct categories while retaining its suppression of incorrect ones\. For each responseyi=\(yi𝖼𝗈𝗍,yi𝖺𝗇𝗌\)y\_\{i\}=\(y\_\{i\}^\{\\mathsf\{cot\}\},y\_\{i\}^\{\\mathsf\{ans\}\}\), we construct two answer\-generation distributions:

pi𝗌𝗅𝗈𝗐\(e\)=πθ\(⋅∣x,yi𝖼𝗈𝗍,yi,<e𝖺𝗇𝗌\),pi𝖿𝖺𝗌𝗍\(e\)=πθ\(⋅∣x,yi,<e𝖺𝗇𝗌\)\.p\_\{i\}^\{\\mathsf\{slow\}\}\(e\)=\\pi\_\{\\theta\}\(\\cdot\\mid x,y\_\{i\}^\{\\mathsf\{cot\}\},y\_\{i,<e\}^\{\\mathsf\{ans\}\}\),\\qquad p\_\{i\}^\{\\mathsf\{fast\}\}\(e\)=\\pi\_\{\\theta\}\(\\cdot\\mid x,y\_\{i,<e\}^\{\\mathsf\{ans\}\}\)\.\(8\)For each generated emotion worde∈yi𝖺𝗇𝗌e\\in y\_\{i\}^\{\\mathsf\{ans\}\}, we map it to its level\-1 categoryc\(e\)c\(e\)and compute its slow\-fast confidence gap using Eq\. \([1](https://arxiv.org/html/2606.27652#S3.E1)\):

Δi\(e\)=𝖼𝗈𝗇𝖿i𝗌𝗅𝗈𝗐\(c\(e\)\)−𝖼𝗈𝗇𝖿i𝖿𝖺𝗌𝗍\(c\(e\)\)\.\\Delta\_\{i\}\(e\)=\\mathsf\{conf\}\_\{i\}^\{\\mathsf\{slow\}\}\(c\(e\)\)\-\\mathsf\{conf\}\_\{i\}^\{\\mathsf\{fast\}\}\(c\(e\)\)\.\(9\)
We split the emotion words parsed fromyi𝖺𝗇𝗌y\_\{i\}^\{\\mathsf\{ans\}\}into correct \(ℰi\+\\mathcal\{E\}\_\{i\}^\{\+\}\) and incorrect \(ℰi−\\mathcal\{E\}\_\{i\}^\{\-\}\) sets:

ℰi\+=\{e∈yi𝖺𝗇𝗌∣c\(e\)∈𝒢\},ℰi−=\{e∈yi𝖺𝗇𝗌∣c\(e\)∉𝒢\}\.\\mathcal\{E\}\_\{i\}^\{\+\}=\\\{e\\in y\_\{i\}^\{\\mathsf\{ans\}\}\\mid c\(e\)\\in\\mathcal\{G\}\\\},\\qquad\\mathcal\{E\}\_\{i\}^\{\-\}=\\\{e\\in y\_\{i\}^\{\\mathsf\{ans\}\}\\mid c\(e\)\\notin\\mathcal\{G\}\\\}\.\(10\)
We use theslow\-fast confidence gapto calibrate the answer in two opposite directions\. For emotion words mapped to correct categories, we reward a positive gap so that slow thinking preserves or exceeds the strong confidence of fast thinking\. For incorrect emotion words, we reward a negative gap so that slow thinking retains its conservative suppression of spurious predictions:

Ri\+=1\|ℰi\+\|∑e∈ℰi\+Δi\(e\),Ri−=1\|ℰi−\|∑e∈ℰi−−Δi\(e\)\.R\_\{i\}^\{\+\}=\\frac\{1\}\{\|\\mathcal\{E\}\_\{i\}^\{\+\}\|\}\\sum\_\{e\\in\\mathcal\{E\}\_\{i\}^\{\+\}\}\\Delta\_\{i\}\(e\),\\qquad R\_\{i\}^\{\-\}=\\frac\{1\}\{\|\\mathcal\{E\}\_\{i\}^\{\-\}\|\}\\sum\_\{e\\in\\mathcal\{E\}\_\{i\}^\{\-\}\}\-\\Delta\_\{i\}\(e\)\.\(11\)
Following the same disentanglement strategy, we first compute group\-normalized advantages forRi\+R\_\{i\}^\{\+\}andRi−R\_\{i\}^\{\-\}separately, denoted asAi\+A\_\{i\}^\{\+\}andAi−A\_\{i\}^\{\-\}, and then combine them asA𝖼𝖺𝗅𝗂,i=Ai\+\+Ai−A\_\{\\mathsf\{cali\},i\}=A\_\{i\}^\{\+\}\+A\_\{i\}^\{\-\}\. The overall advantage is combined as:

Ai=A𝖽𝗎𝖺𝗅,i\+λ𝖼𝖺𝗅𝗂A𝖼𝖺𝗅𝗂,i\+λ𝖿𝗆𝗍A𝖿𝗆𝗍,i\.A\_\{i\}=A\_\{\\mathsf\{dual\},i\}\+\\lambda\_\{\\mathsf\{cali\}\}A\_\{\\mathsf\{cali\},i\}\+\\lambda\_\{\\mathsf\{fmt\}\}A\_\{\\mathsf\{fmt\},i\}\.\(12\)Finally, we substituteAiA\_\{i\}into Eq\. \([3](https://arxiv.org/html/2606.27652#S4.E3)\) for policy optimization\.

## 5Experiments

Table 1:Main results on MER\-UniBench\.MER\-R1achieves the best overall performance under the official benchmark protocol\. Entries marked with†\\daggerare our reproduced results\.ModelFine \(F1\)Basic Emotion \(Hitrate\)Sentiment Analysis \(WAF\)MeanOV\-MERD\+MER23MER24MELDIEMOCAPMOSIMOSEISIMSSIMSv2Qwen\-Audio\[[3](https://arxiv.org/html/2606.27652#bib.bib140)\]32\.3641\.8531\.6149\.0935\.4770\.0946\.9070\.7365\.2649\.26SALMONN\[[30](https://arxiv.org/html/2606.27652#bib.bib93)\]45\.0055\.5345\.3845\.6246\.8481\.0067\.0368\.6965\.9357\.89VideoChat2\[[8](https://arxiv.org/html/2606.27652#bib.bib92)\]39\.2133\.6754\.5036\.6448\.7066\.8454\.3269\.4970\.6652\.67LLaMA\-VID\[[9](https://arxiv.org/html/2606.27652#bib.bib80)\]45\.0150\.7257\.6042\.7546\.0261\.7863\.8969\.3567\.4856\.07Chat\-UniVi\[[7](https://arxiv.org/html/2606.27652#bib.bib81)\]48\.0057\.6265\.6745\.6152\.3754\.5363\.1868\.1566\.3657\.94mPLUG\-Owl\[[37](https://arxiv.org/html/2606.27652#bib.bib94)\]48\.1856\.8659\.8949\.1155\.5472\.4072\.9172\.1375\.0062\.45PandaGPT\[[29](https://arxiv.org/html/2606.27652#bib.bib131)\]37\.1240\.2151\.8937\.8844\.0461\.9267\.6168\.3867\.2352\.92R1\-Omni\[[42](https://arxiv.org/html/2606.27652#bib.bib54)\]55\.2464\.1767\.4343\.2051\.5858\.0256\.4871\.8268\.5859\.61Emotion\-LLaMA\[[2](https://arxiv.org/html/2606.27652#bib.bib51)\]52\.9759\.3873\.6246\.7655\.4766\.1367\.6678\.3277\.2364\.17AffectGPT\[[11](https://arxiv.org/html/2606.27652#bib.bib50)\]62\.5278\.5478\.8055\.6560\.5481\.3080\.9088\.4986\.1874\.77AffectGPT\-R1\[[17](https://arxiv.org/html/2606.27652#bib.bib21)\]68\.3984\.5193\.1366\.7174\.2679\.6580\.1887\.2685\.7579\.98AffectGPT†59\.8973\.6976\.9052\.0957\.7478\.3078\.5185\.2885\.3171\.96AffectGPT\-R1†64\.0881\.8880\.8957\.5365\.7180\.2980\.6485\.7083\.7575\.60Baseline66\.8678\.1682\.7061\.4567\.2883\.4886\.3187\.7886\.8477\.87MER\-R170\.6888\.6092\.9671\.9078\.1086\.0686\.7388\.0488\.4883\.50Table 2:Main results on MME\-Emotion\.We report the CoT score for each task and all three metrics for the overall average\.MER\-R1achieves the best overall performance\.MethodsER\-LabER\-WildFG\-ERFG\-SAML\-ERNoise\-ERIRSAMeanRec\.Rea\.CoTPandaGPT\[[29](https://arxiv.org/html/2606.27652#bib.bib131)\]28\.523\.225\.241\.428\.926\.329\.237\.721\.638\.430\.0Emotion\-LLaMA\[[2](https://arxiv.org/html/2606.27652#bib.bib51)\]31\.719\.731\.736\.429\.549\.327\.340\.721\.142\.431\.8AffectGPT\[[11](https://arxiv.org/html/2606.27652#bib.bib50)\]35\.032\.832\.644\.231\.350\.730\.047\.919\.757\.438\.6Baseline47\.040\.835\.035\.637\.759\.032\.458\.727\.962\.745\.3VideoAuto\-R1\[[20](https://arxiv.org/html/2606.27652#bib.bib5)\]49\.444\.440\.233\.137\.364\.832\.856\.734\.558\.846\.6MER\-R157\.349\.638\.737\.242\.272\.133\.962\.938\.464\.651\.5Table 3:Slow Thinking vs\. Fast Thinking on MER\-UniBench under unified F1 metrics\.The four basic emotion recognition datasets are re\-evaluated by F1\.R\-Meanaverages the two emotion recognition tasks,*i\.e\.*, fine\-grained and basic emotion recognition\.MER\-R1makes the final slow\-thinking answer outperform its fast\-thinking counterpart\.ModelModeFine \(F1\)Basic Emotion \(F1\)Sentiment Analysis \(WAF\)R\-MeanMeanOV\-MERD\+MER23MER24MELDIEMOCAPMOSIMOSEISIMSSIMSv2BaselineFast66\.8761\.9264\.3746\.1853\.1386\.0386\.4787\.4887\.6058\.4971\.11Slow66\.8660\.5865\.0646\.2051\.8683\.4886\.3187\.7886\.8458\.1170\.52VideoAuto\-R1Fast69\.6765\.0967\.3846\.7756\.3386\.2186\.0586\.2087\.9361\.0572\.40Slow69\.4964\.2067\.2046\.3655\.3285\.7585\.8686\.2787\.8360\.5172\.03MER\-R1Fast70\.1064\.4167\.2146\.8955\.7185\.3087\.4388\.5787\.4960\.8672\.56Slow70\.6865\.4368\.4648\.7055\.7386\.0686\.7388\.0488\.4861\.8073\.14![Refer to caption](https://arxiv.org/html/2606.27652v1/x5.png)Figure 5:Effect of slow\-fast thinking synergy\(a\) Recall\. \(b\) Precision\. \(c\) Confidence on ground\-truth categories\. \(d\) Confidence margin\.MER\-R1improves recall, preserves precision, boosts confidence on ground\-truth categories, and strengthens confidence discrimination\.Table 4:Main ablation study on MER\-UniBench\.RD,AD, andSFCCdenote Reward Disentanglement, Advantage Disentanglement, and Slow\-Fast Confidence Calibration, respectively\.Meanofficial\{\}\_\{\\text\{official\}\}uses the original bench metrics, whileMeanf1\{\}\_\{\\text\{f1\}\}uses F1 for the basic emotion recognition task\.VariantRDADSFCCFine \(F1\)Basic \(Hitrate\)Basic \(F1\)Sentiment \(WAF\)Meanofficial\{\}\_\{\\text\{official\}\}Meanf1\{\}\_\{\\text\{f1\}\}Baseline66\.8672\.4055\.9386\.1077\.8770\.55Ex1✓68\.0677\.5057\.1885\.8980\.1871\.15Ex2✓✓69\.7782\.4058\.5585\.1182\.2071\.60Ex3✓✓✓70\.6882\.9059\.5887\.3283\.5073\.15Table 5:Ablation on slow\-fast confidence calibration\.The first block compares different calibration formulations, including confidence granularity, and the role of the precision calibration term\. The second block compares different optimization spaces\.VariantFine \(F1\)Basic \(Hitrate\)Basic \(F1\)Sentiment \(WAF\)Meanofficial\{\}\_\{\\text\{official\}\}Meanf1\{\}\_\{\\text\{f1\}\}Baseline66\.8672\.4055\.9386\.1077\.8770\.55Study on calibration formulationA1: word\-level confidence69\.0780\.6758\.8986\.1481\.8172\.13A2: w/o precision calibration69\.4882\.5159\.0986\.5882\.8772\.46Study on optimization spaceB1: reward\-space calibration69\.0381\.8359\.0685\.1981\.9171\.78B2: mixed\-space calibration67\.8183\.6258\.4486\.4283\.1171\.91MER\-R170\.6882\.9059\.5887\.3283\.5073\.15Datasets and metrics\.We evaluateMER\-R1on two multimodal emotion benchmarks: MER\-UniBench\[[11](https://arxiv.org/html/2606.27652#bib.bib50)\]and MME\-Emotion\[[41](https://arxiv.org/html/2606.27652#bib.bib8)\]\. OnMER\-UniBench, we follow its three\-task setting with 9 datasets: fine\-grained emotion recognition on OV\-MERD\+\[[13](https://arxiv.org/html/2606.27652#bib.bib10)\], evaluated by theF1\-scoremetric; basic emotion recognition on MER23\[[14](https://arxiv.org/html/2606.27652#bib.bib45)\], MER24\[[15](https://arxiv.org/html/2606.27652#bib.bib46)\], MELD\[[25](https://arxiv.org/html/2606.27652#bib.bib36)\], and IEMOCAP\[[1](https://arxiv.org/html/2606.27652#bib.bib37)\], evaluated byHitrate; and sentiment analysis on MOSI\[[39](https://arxiv.org/html/2606.27652#bib.bib39)\], MOSEI\[[40](https://arxiv.org/html/2606.27652#bib.bib40)\], SIMS\[[38](https://arxiv.org/html/2606.27652#bib.bib41)\], and SIMSv2\[[21](https://arxiv.org/html/2606.27652#bib.bib42)\], evaluated byWeighted Average F1\-score \(WAF\)\. SinceHitratemainly reflects recall and does not penalize over\-prediction, we re\-evaluate the four basic emotion recognition datasets usingF1\-score\. We further evaluate onMME\-Emotion, which covers 8 tasks: lab, wild, noisy, fine\-grained, and multi\-label emotion recognition \(ER\-Lab, ER\-Wild, Noise\-ER, FG\-ER, ML\-ER\), together with sentiment analysis \(SA\), fine\-grained sentiment analysis \(FG\-SA\), and intent recognition \(IR\)\. MME\-Emotion reports three metrics:Recognitionfor emotion recognition,Reasoningfor emotion reasoning, andCoTas their average\. Since the original benchmark used GPT\-4o as the evaluator, which is no longer available, we instead usegemini\-3\.1\-flash\-lite\-preview\. More details are in Appendix[B](https://arxiv.org/html/2606.27652#A2)\.

Implementation details\.We use Qwen2\.5\-Omni\[[34](https://arxiv.org/html/2606.27652#bib.bib13)\]as the backbone and train it on MER\-Caption\+\[[11](https://arxiv.org/html/2606.27652#bib.bib50)\]with a two\-stage pipeline\. We first perform SFT on 5k randomly sampled instances for 2 epochs with a learning rate of2×10−52\\times 10^\{\-5\}, and then apply RL on the remaining data for 1 epoch with a learning rate of2×10−62\\times 10^\{\-6\}, sampling 4 responses per prompt\. We set bothλ𝖼𝖺𝗅𝗂\\lambda\_\{\\mathsf\{cali\}\}andλ𝖿𝗆𝗍\\lambda\_\{\\mathsf\{fmt\}\}to0\.10\.1\. Experiments are conducted on 16 NVIDIA H100 GPUs\. More details are in Appendix[C](https://arxiv.org/html/2606.27652#A3)\.

### 5\.1Main results

SoTA performance on MER\-UniBench\.Table[1](https://arxiv.org/html/2606.27652#S5.T1)reports the main results under the official MER\-UniBench protocol\.MER\-R1achieves a new state\-of\-the\-art mean score of 83\.50, outperforming the strongest reproduced baseline by 5\.63 points\. The improvement is consistent across fine\-grained emotion recognition, basic emotion recognition, and sentiment analysis\. Notably,MER\-R1improves OV\-MERD\+ from 66\.86 to 70\.68, obtains large gains on MER23, MELD, and IEMOCAP, and achieves the best results on MOSI, MOSEI, and SIMSv2\. These results demonstrate the effectiveness of slow\-fast thinking synergy across diverse multimodal emotion tasks\.

SoTA performance on MME\-Emotion\.Table[2](https://arxiv.org/html/2606.27652#S5.T2)reports the results on MME\-Emotion\.MER\-R1achieves the best overall CoT mean of 51\.5, outperforming all compared methods on this broader benchmark\. Compared with the baseline,MER\-R1substantially improves the overall Recognition score from 27\.9 to 38\.4, with clear gains on ER\-Lab, ER\-Wild, ML\-ER, Noise\-ER, and SA\. Meanwhile, the overall Reasoning score also increases from 62\.7 to 64\.6, indicating that the improvement does not come at the cost of reasoning quality\. These results show that slow\-fast thinking synergy generalizes beyond MER\-UniBench and consistently strengthens both emotion recognition and reasoning\.

Fulfilling the Desiderata\.MER\-R1fulfills the Desiderata defined in Section[3\.4](https://arxiv.org/html/2606.27652#S3.SS4)\. Table[3](https://arxiv.org/html/2606.27652#S5.T3)compares slow and fast thinking under a unified F1\-based evaluation, where the four basic emotion recognition datasets are re\-evaluated by F1 rather than Hitrate\. Under this setting, the baseline still exhibits the thinking paradox: its final slow\-thinking answer is worse than its fast\-thinking counterpart\. VideoAuto\-R1\[[20](https://arxiv.org/html/2606.27652#bib.bib5)\]trains ananswer\-think\-answerformat with separate answer rewards, but still leaves the final answer weaker than the early answer on emotion recognition, with its R\-Mean decreasing from 61\.05 to 60\.51, as it lacks a mechanism for slow\-fast complementarity\. In contrast,MER\-R1successfully reverses this trend:the slow\-thinking answer outperforms fast thinking on all five recognition datasets, improving the R\-Mean from 60\.86 to 61\.80 and demonstrating thatMER\-R1turns slow thinking into stronger final recognition\.

Figure[5](https://arxiv.org/html/2606.27652#S5.F5)further explains howMER\-R1satisfies the two desiderata\. At the prediction level, Figure[5](https://arxiv.org/html/2606.27652#S5.F5)\(a\) and \(b\) show thatMER\-R1improves recall over the baseline while preserving precision\. At the confidence level, Figure[5](https://arxiv.org/html/2606.27652#S5.F5)\(c\) and \(d\) show thatMER\-R1assigns higher confidence to ground\-truth categories and enlarges the margin between ground\-truth and hard\-negative categories\. Together with Table[3](https://arxiv.org/html/2606.27652#S5.T3), these results demonstrate thatMER\-R1fulfills the desiderata, turning slow\-fast complementarity into a stronger final slow\-thinking answer\.

### 5\.2Ablation studies and analyses

Main ablation study\.Table[4](https://arxiv.org/html/2606.27652#S5.T4)presents the main ablation study on MER\-UniBench by progressively adding Reward Disentanglement \(RD\), Advantage Disentanglement \(AD\), and Slow\-Fast Confidence Calibration \(SFCC\)\. RD already brings clear improvements over the baseline, and AD further boosts both fine\-grained and basic emotion recognition, showing the benefit of preserving separated optimization signals in the advantage space\. Adding SFCC yields the best performance across all metrics, improving Meanofficial\{\}\_\{\\text\{official\}\}from 77\.87 to 83\.50 and Meanf1\{\}\_\{\\text\{f1\}\}from 70\.55 to 73\.15\. These results show that all three components contribute to the final gain ofMER\-R1\.

Ablation on slow\-fast confidence calibration\.Table[5](https://arxiv.org/html/2606.27652#S5.T5)ablates the design of Slow\-Fast Confidence Calibration\. A1 uses word\-level confidence instead of category\-level confidence and performs worse, showing the importance of calibrating in the emotion\-wheel category space\. A2 removes precision calibration and is also inferior to the full model, indicating that recall\-side calibration alone cannot preserve slow\-thinking selectivity\. For optimization space, B1 applies calibration in the reward space, while B2 adds corresponding dual and calibration rewards before advantage normalization\. Both are weaker thanMER\-R1in overall performance, demonstrating that SFCC works best with category\-level bidirectional calibration and fully disentangled advantage\-space optimization\.

## 6Conclusion and limitation

Conclusion\.We reveal a thinking paradox in MER, where explicit reasoning does not necessarily improve recognition accuracy\. Our analysis shows that fast thinking provides recall\-oriented coverage and correct\-category confidence, while slow thinking offers precision\-oriented selectivity\. Based on this complementarity, we proposeMER\-R1, an RL framework that realizes slow\-fast thinking synergy through dual\-objective disentanglement and slow\-fast confidence calibration\. Experiments on MER\-UniBench and MME\-Emotion show thatMER\-R1achieves SoTA performance and fulfills the Desiderata, making the slow\-thinking genuinely beneficial to emotion recognition\.

Limitation\.This work is evaluated mainly on MER benchmarks, so its generalization to broader multimodal reasoning tasks remains open\[[43](https://arxiv.org/html/2606.27652#bib.bib2),[44](https://arxiv.org/html/2606.27652#bib.bib1)\]\. Moreover,MER\-R1depends on emotion\-wheel\-based category mapping and requires an additional fast\-thinking forward pass during training, which may limit coverage of fine\-grained emotions and increase computational cost\.

## References

- \[1\]C\. Busso, M\. Bulut, C\. Lee, A\. Kazemzadeh, E\. Mower, S\. Kim, J\. N\. Chang, S\. Lee, and S\. S\. Narayanan\(2008\)IEMOCAP: interactive emotional dyadic motion capture database\.Language resources and evaluation42\(4\),pp\. 335–359\.Cited by:[§B\.1](https://arxiv.org/html/2606.27652#A2.SS1.p6.1),[§1](https://arxiv.org/html/2606.27652#S1.p1.1),[§3\.2](https://arxiv.org/html/2606.27652#S3.SS2.p1.1),[§5](https://arxiv.org/html/2606.27652#S5.p1.1)\.
- \[2\]\(2024\)Emotion\-llama: multimodal emotion recognition and reasoning with instruction tuning\.NeurIPS\.Cited by:[§2](https://arxiv.org/html/2606.27652#S2.p2.1),[Table 1](https://arxiv.org/html/2606.27652#S5.T1.4.13.1),[Table 2](https://arxiv.org/html/2606.27652#S5.T2.6.4.1)\.
- \[3\]Y\. Chu, J\. Xu, X\. Zhou, Q\. Yang, S\. Zhang, Z\. Yan, C\. Zhou, and J\. Zhou\(2023\)Qwen\-audio: advancing universal audio understanding via unified large\-scale audio\-language models\.arXiv preprint arXiv:2311\.07919\.Cited by:[Table 1](https://arxiv.org/html/2606.27652#S5.T1.4.5.1)\.
- \[4\]D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)DeepSeek\-r1 incentivizes reasoning in llms through reinforcement learning\.Nature645\(8081\),pp\. 633–638\.Cited by:[§1](https://arxiv.org/html/2606.27652#S1.p1.1),[§2](https://arxiv.org/html/2606.27652#S2.p1.1)\.
- \[5\]Z\. Han, B\. Zhu, Y\. Xu, P\. Song, and X\. Yang\(2025\)Benchmarking and bridging emotion conflicts for multimodal emotion reasoning\.InACM MM,Cited by:[§2](https://arxiv.org/html/2606.27652#S2.p2.1)\.
- \[6\]A\. Hurst, A\. Lerer, A\. P\. Goucher, A\. Perelman, A\. Ramesh, A\. Clark, A\. Ostrow, A\. Welihinda, A\. Hayes, A\. Radford,et al\.\(2024\)Gpt\-4o system card\.arXiv preprint arXiv:2410\.21276\.Cited by:[§1](https://arxiv.org/html/2606.27652#S1.p1.1),[§2](https://arxiv.org/html/2606.27652#S2.p1.1)\.
- \[7\]P\. Jin, R\. Takanobu, W\. Zhang, X\. Cao, and L\. Yuan\(2024\)Chat\-univi: unified visual representation empowers large language models with image and video understanding\.InCVPR,Cited by:[Table 1](https://arxiv.org/html/2606.27652#S5.T1.4.9.1)\.
- \[8\]K\. Li, Y\. Wang, Y\. He, Y\. Li, Y\. Wang, Y\. Liu, Z\. Wang, J\. Xu, G\. Chen, P\. Luo,et al\.\(2024\)Mvbench: a comprehensive multi\-modal video understanding benchmark\.InCVPR,Cited by:[Table 1](https://arxiv.org/html/2606.27652#S5.T1.4.7.1)\.
- \[9\]Y\. Li, C\. Wang, and J\. Jia\(2024\)Llama\-vid: an image is worth 2 tokens in large language models\.InECCV,Cited by:[Table 1](https://arxiv.org/html/2606.27652#S5.T1.4.8.1)\.
- \[10\]Z\. Li, D\. Zhang, M\. Zhang, J\. Zhang, Z\. Liu, Y\. Yao, H\. Xu, J\. Zheng, P\. Wang, X\. Chen,et al\.\(2026\)From system 1 to system 2: a survey of reasoning large language models\.TPAMI\.Cited by:[§1](https://arxiv.org/html/2606.27652#S1.p4.1),[§2](https://arxiv.org/html/2606.27652#S2.p1.1)\.
- \[11\]Z\. Lian, H\. Chen, L\. Chen, H\. Sun, L\. Sun, Y\. Ren, Z\. Cheng, B\. Liu, R\. Liu, X\. Peng,et al\.\(2025\)Affectgpt: a new dataset, model, and benchmark for emotion understanding with multimodal large language models\.InICML,Cited by:[§B\.1](https://arxiv.org/html/2606.27652#A2.SS1.p1.1),[§B\.1](https://arxiv.org/html/2606.27652#A2.SS1.p2.1),[Appendix B](https://arxiv.org/html/2606.27652#A2.p1.1),[Appendix C](https://arxiv.org/html/2606.27652#A3.p1.6),[§1](https://arxiv.org/html/2606.27652#S1.p1.1),[§1](https://arxiv.org/html/2606.27652#S1.p2.1),[§2](https://arxiv.org/html/2606.27652#S2.p2.1),[§3\.1](https://arxiv.org/html/2606.27652#S3.SS1.p1.9),[§3\.1](https://arxiv.org/html/2606.27652#S3.SS1.p2.6),[§3\.2](https://arxiv.org/html/2606.27652#S3.SS2.p3.1),[Table 1](https://arxiv.org/html/2606.27652#S5.T1.4.14.1),[Table 2](https://arxiv.org/html/2606.27652#S5.T2.6.5.1),[§5](https://arxiv.org/html/2606.27652#S5.p1.1),[§5](https://arxiv.org/html/2606.27652#S5.p2.5)\.
- \[12\]Z\. Lian, R\. Liu, K\. Xu, B\. Liu, X\. Liu, Y\. Zhang, X\. Liu, Y\. Li, Z\. Cheng, H\. Zuo,et al\.\(2025\)Mer 2025: when affective computing meets large language models\.InACM MM,Cited by:[§1](https://arxiv.org/html/2606.27652#S1.p1.1),[§3\.2](https://arxiv.org/html/2606.27652#S3.SS2.p1.1)\.
- \[13\]Z\. Lian, H\. Sun, L\. Sun, H\. Chen, L\. Chen, H\. Gu, Z\. Wen, S\. Chen, S\. Zhang, H\. Yao,et al\.\(2025\)Ov\-mer: towards open\-vocabulary multimodal emotion recognition\.InICML,Cited by:[§B\.1](https://arxiv.org/html/2606.27652#A2.SS1.p1.1),[§B\.1](https://arxiv.org/html/2606.27652#A2.SS1.p5.1),[§1](https://arxiv.org/html/2606.27652#S1.p1.1),[§1](https://arxiv.org/html/2606.27652#S1.p2.1),[§2](https://arxiv.org/html/2606.27652#S2.p2.1),[§3\.1](https://arxiv.org/html/2606.27652#S3.SS1.p1.9),[§3\.1](https://arxiv.org/html/2606.27652#S3.SS1.p2.6),[§3\.2](https://arxiv.org/html/2606.27652#S3.SS2.p1.1),[§5](https://arxiv.org/html/2606.27652#S5.p1.1),[Definition 1](https://arxiv.org/html/2606.27652#Thmdefinition1.p1.5.5)\.
- \[14\]Z\. Lian, H\. Sun, L\. Sun, K\. Chen, M\. Xu, K\. Wang, K\. Xu, Y\. He, Y\. Li, J\. Zhao,et al\.\(2023\)Mer 2023: multi\-label learning, modality robustness, and semi\-supervised learning\.InACM MM,Cited by:[§B\.1](https://arxiv.org/html/2606.27652#A2.SS1.p6.1),[§1](https://arxiv.org/html/2606.27652#S1.p1.1),[§3\.2](https://arxiv.org/html/2606.27652#S3.SS2.p1.1),[§5](https://arxiv.org/html/2606.27652#S5.p1.1)\.
- \[15\]Z\. Lian, H\. Sun, L\. Sun, Z\. Wen, S\. Zhang, S\. Chen, H\. Gu, J\. Zhao, Z\. Ma, X\. Chen,et al\.\(2024\)Mer 2024: semi\-supervised learning, noise robustness, and open\-vocabulary multimodal emotion recognition\.InProceedings of the 2nd International Workshop on Multimodal and Responsible Affective Computing,Cited by:[§B\.1](https://arxiv.org/html/2606.27652#A2.SS1.p6.1),[§1](https://arxiv.org/html/2606.27652#S1.p1.1),[§3\.2](https://arxiv.org/html/2606.27652#S3.SS2.p1.1),[§5](https://arxiv.org/html/2606.27652#S5.p1.1)\.
- \[16\]Z\. Lian, L\. Sun, M\. Xu, H\. Sun, K\. Xu, Z\. Wen, S\. Chen, B\. Liu, and J\. Tao\(2023\)Explainable multimodal emotion reasoning\.CoRR\.Cited by:[§B\.1](https://arxiv.org/html/2606.27652#A2.SS1.p5.1),[§1](https://arxiv.org/html/2606.27652#S1.p1.1),[§2](https://arxiv.org/html/2606.27652#S2.p2.1)\.
- \[17\]Z\. Lian, F\. Zhang, Y\. Zhang, J\. Tao, R\. Liu, H\. Chen, and X\. Li\(2025\)AffectGPT\-r1: leveraging reinforcement learning for open\-vocabulary multimodal emotion recognition\.arXiv preprint arXiv:2508\.01318\.Cited by:[Appendix C](https://arxiv.org/html/2606.27652#A3.p1.6),[§1](https://arxiv.org/html/2606.27652#S1.p1.1),[§1](https://arxiv.org/html/2606.27652#S1.p2.1),[§1](https://arxiv.org/html/2606.27652#S1.p5.2),[§2](https://arxiv.org/html/2606.27652#S2.p2.1),[§4\.1](https://arxiv.org/html/2606.27652#S4.SS1.p1.11),[Table 1](https://arxiv.org/html/2606.27652#S5.T1.4.15.1)\.
- \[18\]A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan,et al\.\(2024\)Deepseek\-v3 technical report\.arXiv preprint arXiv:2412\.19437\.Cited by:[§2](https://arxiv.org/html/2606.27652#S2.p1.1)\.
- \[19\]H\. Liu, C\. Li, Q\. Wu, and Y\. J\. Lee\(2024\)Visual instruction tuning\.NeurIPS\.Cited by:[§2](https://arxiv.org/html/2606.27652#S2.p1.1)\.
- \[20\]S\. Liu, M\. Zhuge, C\. Zhao, J\. Chen, L\. Wu, Z\. Liu, C\. Zhu, Z\. Cai, C\. Zhou, H\. Liu,et al\.\(2026\)VideoAuto\-r1: video auto reasoning via thinking once, answering twice\.arXiv preprint arXiv:2601\.05175\.Cited by:[§2](https://arxiv.org/html/2606.27652#S2.p1.1),[§5\.1](https://arxiv.org/html/2606.27652#S5.SS1.p3.1),[Table 2](https://arxiv.org/html/2606.27652#S5.T2.6.7.1)\.
- \[21\]Y\. Liu, Z\. Yuan, H\. Mao, Z\. Liang, W\. Yang, Y\. Qiu, T\. Cheng, X\. Li, H\. Xu, and K\. Gao\(2022\)Make acoustic and visual cues matter: ch\-sims v2\. 0 dataset and av\-mixup consistent module\.InICMI,Cited by:[§B\.1](https://arxiv.org/html/2606.27652#A2.SS1.p7.1),[§5](https://arxiv.org/html/2606.27652#S5.p1.1)\.
- \[22\]Z\. Liu, Z\. Sun, Y\. Zang, X\. Dong, Y\. Cao, H\. Duan, D\. Lin, and J\. Wang\(2025\)Visual\-rft: visual reinforcement fine\-tuning\.InICCV,Cited by:[§2](https://arxiv.org/html/2606.27652#S2.p1.1)\.
- \[23\]F\. Meng, L\. Du, Z\. Liu, Z\. Zhou, Q\. Lu, D\. Fu, T\. Han, B\. Shi, W\. Wang, J\. He,et al\.\(2025\)Mm\-eureka: exploring the frontiers of multimodal reasoning with rule\-based reinforcement learning\.arXiv preprint arXiv:2503\.07365\.Cited by:[§2](https://arxiv.org/html/2606.27652#S2.p1.1)\.
- \[24\]R\. W\. Picard\(2000\)Affective computing\.MIT press\.Cited by:[§1](https://arxiv.org/html/2606.27652#S1.p1.1)\.
- \[25\]S\. Poria, D\. Hazarika, N\. Majumder, G\. Naik, E\. Cambria, and R\. Mihalcea\(2019\)Meld: a multimodal multi\-party dataset for emotion recognition in conversations\.InACL,Cited by:[§B\.1](https://arxiv.org/html/2606.27652#A2.SS1.p6.1),[§1](https://arxiv.org/html/2606.27652#S1.p1.1),[§3\.2](https://arxiv.org/html/2606.27652#S3.SS2.p1.1),[§5](https://arxiv.org/html/2606.27652#S5.p1.1)\.
- \[26\]H\. Rha, J\. H\. Yeo, Y\. Kim, and Y\. M\. Ro\(2026\)Emotion\-coherent reasoning for multimodal llms via emotional rationale verifier\.InAAAI,Cited by:[§2](https://arxiv.org/html/2606.27652#S2.p2.1)\.
- \[27\]Y\. Shou, T\. Meng, W\. Ai, and K\. Li\(2025\)Multimodal large language models meet multimodal emotion recognition and reasoning: a survey\.arXiv preprint arXiv:2509\.24322\.Cited by:[§1](https://arxiv.org/html/2606.27652#S1.p1.1)\.
- \[28\]A\. Singh, A\. Fry, A\. Perelman, A\. Tart, A\. Ganesh, A\. El\-Kishky, A\. McLaughlin, A\. Low, A\. Ostrow, A\. Ananthram,et al\.\(2025\)OpenAI gpt\-5 system card\.arXiv preprint arXiv:2601\.03267\.Cited by:[§1](https://arxiv.org/html/2606.27652#S1.p1.1)\.
- \[29\]Y\. Su, T\. Lan, H\. Li, J\. Xu, Y\. Wang, and D\. Cai\(2023\)Pandagpt: one model to instruction\-follow them all\.InTLLM,Cited by:[Table 1](https://arxiv.org/html/2606.27652#S5.T1.4.11.1),[Table 2](https://arxiv.org/html/2606.27652#S5.T2.6.3.1)\.
- \[30\]C\. Tang, W\. Yu, G\. Sun, X\. Chen, T\. Tan, W\. Li, L\. Lu, Z\. Ma, and C\. Zhang\(2024\)Salmonn: towards generic hearing abilities for large language models\.InICLR,Cited by:[Table 1](https://arxiv.org/html/2606.27652#S5.T1.4.6.1)\.
- \[31\]H\. Wang, C\. Qu, Z\. Huang, W\. Chu, F\. Lin, and W\. Chen\(2025\)Vl\-rethinker: incentivizing self\-reflection of vision\-language models with reinforcement learning\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2606.27652#S2.p1.1)\.
- \[32\]R\. Wang, Z\. Li, B\. Zhu, L\. Yuan, H\. Zhang, X\. Yang, X\. Chang, and C\. Zhang\(2026\)Parallel diffusion solver via residual dirichlet policy optimization\.IEEE TPAMI\(\),pp\. 1–17\.Cited by:[§2](https://arxiv.org/html/2606.27652#S2.p1.1)\.
- \[33\]Z\. Wang, X\. Guo, S\. Stoica, H\. Xu, H\. Wang, H\. Ha, X\. Chen, Y\. Chen, M\. Yan, F\. Huang,et al\.\(2026\)Perception\-aware policy optimization for multimodal reasoning\.InICLR,Cited by:[§2](https://arxiv.org/html/2606.27652#S2.p1.1)\.
- \[34\]J\. Xu, Z\. Guo, J\. He, H\. Hu, T\. He, S\. Bai, K\. Chen, J\. Wang, Y\. Fan, K\. Dang,et al\.\(2025\)Qwen2\.5\-omni technical report\.arXiv preprint arXiv:2503\.20215\.Cited by:[Appendix C](https://arxiv.org/html/2606.27652#A3.p1.6),[§1](https://arxiv.org/html/2606.27652#S1.p1.1),[§1](https://arxiv.org/html/2606.27652#S1.p2.1),[§2](https://arxiv.org/html/2606.27652#S2.p1.1),[§5](https://arxiv.org/html/2606.27652#S5.p2.5)\.
- \[35\]J\. Xu, Z\. Guo, H\. Hu, Y\. Chu, X\. Wang, J\. He, Y\. Wang, X\. Shi, T\. He, X\. Zhu, Y\. Lv, Y\. Wang, D\. Guo, H\. Wang, L\. Ma, P\. Zhang, X\. Zhang, H\. Hao, Z\. Guo, B\. Yang, B\. Zhang, Z\. Ma, X\. Wei, S\. Bai, K\. Chen, X\. Liu, P\. Wang, M\. Yang, D\. Liu, X\. Ren, B\. Zheng, R\. Men, F\. Zhou, B\. Yu, J\. Yang, L\. Yu, J\. Zhou, and J\. Lin\(2025\)Qwen3\-omni technical report\.arXiv preprint arXiv:2509\.17765\.Cited by:[§2](https://arxiv.org/html/2606.27652#S2.p1.1)\.
- \[36\]Q\. Yang, S\. Yao, W\. Chen, S\. Fu, D\. Bai, J\. Zhao, B\. Sun, B\. Yin, X\. Wei, and J\. Zhou\(2025\)HumanOmniV2: from understanding to omni\-modal reasoning with context\.arXiv preprint arXiv:2506\.21277\.Cited by:[§1](https://arxiv.org/html/2606.27652#S1.p1.1),[§2](https://arxiv.org/html/2606.27652#S2.p2.1)\.
- \[37\]Q\. Ye, H\. Xu, G\. Xu, J\. Ye, M\. Yan, Y\. Zhou, J\. Wang, A\. Hu, P\. Shi, Y\. Shi,et al\.\(2023\)Mplug\-owl: modularization empowers large language models with multimodality\.arXiv preprint arXiv:2304\.14178\.Cited by:[Table 1](https://arxiv.org/html/2606.27652#S5.T1.4.10.1)\.
- \[38\]W\. Yu, H\. Xu, F\. Meng, Y\. Zhu, Y\. Ma, J\. Wu, J\. Zou, and K\. Yang\(2020\)Ch\-sims: a chinese multimodal sentiment analysis dataset with fine\-grained annotation of modality\.InACL,Cited by:[§B\.1](https://arxiv.org/html/2606.27652#A2.SS1.p7.1),[§5](https://arxiv.org/html/2606.27652#S5.p1.1)\.
- \[39\]A\. Zadeh, R\. Zellers, E\. Pincus, and L\. Morency\(2016\)Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos\.InCoRR,Cited by:[§B\.1](https://arxiv.org/html/2606.27652#A2.SS1.p7.1),[§5](https://arxiv.org/html/2606.27652#S5.p1.1)\.
- \[40\]A\. B\. Zadeh, P\. P\. Liang, S\. Poria, E\. Cambria, and L\. Morency\(2018\)Multimodal language analysis in the wild: cmu\-mosei dataset and interpretable dynamic fusion graph\.InACL,Cited by:[§B\.1](https://arxiv.org/html/2606.27652#A2.SS1.p7.1),[§5](https://arxiv.org/html/2606.27652#S5.p1.1)\.
- \[41\]F\. Zhang, Z\. Cheng, C\. Deng, H\. Li, Z\. Lian, Q\. Chen, H\. Liu, W\. Wang, Y\. Zhang, R\. Zhang,et al\.\(2026\)Mme\-emotion: a holistic evaluation benchmark for emotional intelligence in multimodal large language models\.InICLR,Cited by:[§B\.2](https://arxiv.org/html/2606.27652#A2.SS2.p1.1),[Appendix B](https://arxiv.org/html/2606.27652#A2.p1.1),[§2](https://arxiv.org/html/2606.27652#S2.p2.1),[§5](https://arxiv.org/html/2606.27652#S5.p1.1)\.
- \[42\]J\. Zhao, X\. Wei, and L\. Bo\(2025\)R1\-omni: explainable omni\-multimodal emotion recognition with reinforcing learning\.arXiv preprint arXiv:2503\.05379\.Cited by:[§2](https://arxiv.org/html/2606.27652#S2.p2.1),[Table 1](https://arxiv.org/html/2606.27652#S5.T1.4.12.1)\.
- \[43\]K\. Zhao, B\. Zhu, Q\. Sun, and H\. Zhang\(2025\)Unsupervised visual chain\-of\-thought reasoning via preference optimization\.InICCV,Cited by:[§6](https://arxiv.org/html/2606.27652#S6.p2.1)\.
- \[44\]X\. Zhu, K\. Zhao, L\. Yi, S\. Wang, Z\. Wang, B\. Zhu, H\. Zhang, and X\. He\(2026\)Look carefully: adaptive visual reinforcements in multimodal large language models for hallucination mitigation\.InICLR,Cited by:[§6](https://arxiv.org/html/2606.27652#S6.p2.1)\.

Appendix

## Appendix ATheoretical analysis of dual\-objective disentanglement

We provide a theoretical analysis of the dual\-objective disentanglement in Section[4\.2](https://arxiv.org/html/2606.27652#S4.SS2)\. We first show that the standardF1F\_\{1\}\-based advantage is affected by both the local sensitivities ofF1F\_\{1\}and the reward variances\. We then show that our two\-step disentanglement removes these factors by separating recall and precision before group normalization\.

### A\.1Limitation of theF1F\_\{1\}advantage

For theii\-th response in a GRPO group, letR𝖱,iR\_\{\\mathsf\{R\},i\}andR𝖯,iR\_\{\\mathsf\{P\},i\}denote its recall and precision rewards\. The baseline first combines them into a scalarF1F\_\{1\}reward:

R𝖥,i=2R𝖱,iR𝖯,iR𝖱,i\+R𝖯,i,R\_\{\\mathsf\{F\},i\}=\\frac\{2R\_\{\\mathsf\{R\},i\}R\_\{\\mathsf\{P\},i\}\}\{R\_\{\\mathsf\{R\},i\}\+R\_\{\\mathsf\{P\},i\}\},\(13\)and then computes the group\-normalized advantage:

A𝖥,i=R𝖥,i−μ𝖥σ𝖥,A\_\{\\mathsf\{F\},i\}=\\frac\{R\_\{\\mathsf\{F\},i\}\-\\mu\_\{\\mathsf\{F\}\}\}\{\\sigma\_\{\\mathsf\{F\}\}\},\(14\)whereμ𝖥\\mu\_\{\\mathsf\{F\}\}andσ𝖥\\sigma\_\{\\mathsf\{F\}\}are the group mean and standard deviation ofR𝖥R\_\{\\mathsf\{F\}\}\.

###### Proposition 2\.

For𝗄∈\{𝖱,𝖯\}\\mathsf\{k\}\\in\\\{\\mathsf\{R\},\\mathsf\{P\}\\\}and𝗅≠𝗄\\mathsf\{l\}\\neq\\mathsf\{k\}, the correlation between rewardR𝗄R\_\{\\mathsf\{k\}\}and theF1F\_\{1\}advantageA𝖥A\_\{\\mathsf\{F\}\}is approximated by:

Corr\(R𝗄,A𝖥\)≈α𝗄σ𝗄\+α𝗅ρσ𝗅σ𝖥\.\\mathrm\{Corr\}\(R\_\{\\mathsf\{k\}\},A\_\{\\mathsf\{F\}\}\)\\approx\\frac\{\\alpha\_\{\\mathsf\{k\}\}\\sigma\_\{\\mathsf\{k\}\}\+\\alpha\_\{\\mathsf\{l\}\}\\rho\\sigma\_\{\\mathsf\{l\}\}\}\{\\sigma\_\{\\mathsf\{F\}\}\}\.\(15\)Here,ρ\\rhois the correlation betweenR𝖱R\_\{\\mathsf\{R\}\}andR𝖯R\_\{\\mathsf\{P\}\};α𝗄\\alpha\_\{\\mathsf\{k\}\}is the localF1F\_\{1\}sensitivity toR𝗄R\_\{\\mathsf\{k\}\}; andσ𝗄\\sigma\_\{\\mathsf\{k\}\}is the standard deviation ofR𝗄R\_\{\\mathsf\{k\}\}\.

###### Proof\.

For simplicity, we omit the response indexiiwhen no ambiguity arises\. Letμ𝖱,μ𝖯\\mu\_\{\\mathsf\{R\}\},\\mu\_\{\\mathsf\{P\}\}be the group means ofR𝖱R\_\{\\mathsf\{R\}\}andR𝖯R\_\{\\mathsf\{P\}\}respectively\. By the first\-order Taylor expansion around\(μ𝖱,μ𝖯\)\(\\mu\_\{\\mathsf\{R\}\},\\mu\_\{\\mathsf\{P\}\}\), the scalarF1F\_\{1\}reward can be locally approximated as:

R𝖥≈2μ𝖱μ𝖯μ𝖱\+μ𝖯\+α𝖱\(R𝖱−μ𝖱\)\+α𝖯\(R𝖯−μ𝖯\),R\_\{\\mathsf\{F\}\}\\approx\\frac\{2\\mu\_\{\\mathsf\{R\}\}\\mu\_\{\\mathsf\{P\}\}\}\{\\mu\_\{\\mathsf\{R\}\}\+\\mu\_\{\\mathsf\{P\}\}\}\+\\alpha\_\{\\mathsf\{R\}\}\(R\_\{\\mathsf\{R\}\}\-\\mu\_\{\\mathsf\{R\}\}\)\+\\alpha\_\{\\mathsf\{P\}\}\(R\_\{\\mathsf\{P\}\}\-\\mu\_\{\\mathsf\{P\}\}\),\(16\)where:

α𝖱\\displaystyle\\alpha\_\{\\mathsf\{R\}\}=∂R𝖥∂R𝖱\|\(μ𝖱,μ𝖯\)=2μ𝖯2\(μ𝖱\+μ𝖯\)2,α𝖯=∂R𝖥∂R𝖯\|\(μ𝖱,μ𝖯\)=2μ𝖱2\(μ𝖱\+μ𝖯\)2\.\\displaystyle=\\left\.\\frac\{\\partial R\_\{\\mathsf\{F\}\}\}\{\\partial R\_\{\\mathsf\{R\}\}\}\\right\|\_\{\(\\mu\_\{\\mathsf\{R\}\},\\mu\_\{\\mathsf\{P\}\}\)\}=\\frac\{2\\mu\_\{\\mathsf\{P\}\}^\{2\}\}\{\(\\mu\_\{\\mathsf\{R\}\}\+\\mu\_\{\\mathsf\{P\}\}\)^\{2\}\},\\qquad\\alpha\_\{\\mathsf\{P\}\}=\\left\.\\frac\{\\partial R\_\{\\mathsf\{F\}\}\}\{\\partial R\_\{\\mathsf\{P\}\}\}\\right\|\_\{\(\\mu\_\{\\mathsf\{R\}\},\\mu\_\{\\mathsf\{P\}\}\)\}=\\frac\{2\\mu\_\{\\mathsf\{R\}\}^\{2\}\}\{\(\\mu\_\{\\mathsf\{R\}\}\+\\mu\_\{\\mathsf\{P\}\}\)^\{2\}\}\.\(17\)Subtracting the group mean from the local approximation and normalizing byσ𝖥\\sigma\_\{\\mathsf\{F\}\}give:

R𝖥−μ𝖥≈α𝖱\(R𝖱−μ𝖱\)\+α𝖯\(R𝖯−μ𝖯\),A𝖥≈α𝖱\(R𝖱−μ𝖱\)\+α𝖯\(R𝖯−μ𝖯\)σ𝖥\.R\_\{\\mathsf\{F\}\}\-\\mu\_\{\\mathsf\{F\}\}\\approx\\alpha\_\{\\mathsf\{R\}\}\(R\_\{\\mathsf\{R\}\}\-\\mu\_\{\\mathsf\{R\}\}\)\+\\alpha\_\{\\mathsf\{P\}\}\(R\_\{\\mathsf\{P\}\}\-\\mu\_\{\\mathsf\{P\}\}\),\\qquad A\_\{\\mathsf\{F\}\}\\approx\\frac\{\\alpha\_\{\\mathsf\{R\}\}\(R\_\{\\mathsf\{R\}\}\-\\mu\_\{\\mathsf\{R\}\}\)\+\\alpha\_\{\\mathsf\{P\}\}\(R\_\{\\mathsf\{P\}\}\-\\mu\_\{\\mathsf\{P\}\}\)\}\{\\sigma\_\{\\mathsf\{F\}\}\}\.\(18\)
We first compute the correlation betweenR𝖱R\_\{\\mathsf\{R\}\}andA𝖥A\_\{\\mathsf\{F\}\}\. By definition:

Corr\(R𝖱,A𝖥\)=Cov\(R𝖱,A𝖥\)σ𝖱σA𝖥=Cov\(R𝖱,A𝖥\)σ𝖱,\\mathrm\{Corr\}\(R\_\{\\mathsf\{R\}\},A\_\{\\mathsf\{F\}\}\)=\\frac\{\\mathrm\{Cov\}\(R\_\{\\mathsf\{R\}\},A\_\{\\mathsf\{F\}\}\)\}\{\\sigma\_\{\\mathsf\{R\}\}\\sigma\_\{A\_\{\\mathsf\{F\}\}\}\}=\\frac\{\\mathrm\{Cov\}\(R\_\{\\mathsf\{R\}\},A\_\{\\mathsf\{F\}\}\)\}\{\\sigma\_\{\\mathsf\{R\}\}\},\(19\)where the equality follows from the group normalization ofA𝖥A\_\{\\mathsf\{F\}\}, which givesσA𝖥=1\\sigma\_\{A\_\{\\mathsf\{F\}\}\}=1\. Using the expression ofA𝖥A\_\{\\mathsf\{F\}\}, together withVar\(R𝖱\)=σ𝖱2\\mathrm\{Var\}\(R\_\{\\mathsf\{R\}\}\)=\\sigma\_\{\\mathsf\{R\}\}^\{2\}andCov\(R𝖱,R𝖯\)=ρσ𝖱σ𝖯\\mathrm\{Cov\}\(R\_\{\\mathsf\{R\}\},R\_\{\\mathsf\{P\}\}\)=\\rho\\sigma\_\{\\mathsf\{R\}\}\\sigma\_\{\\mathsf\{P\}\}, we obtain:

Cov\(R𝖱,A𝖥\)\\displaystyle\\mathrm\{Cov\}\(R\_\{\\mathsf\{R\}\},A\_\{\\mathsf\{F\}\}\)≈α𝖱Var\(R𝖱\)\+α𝖯Cov\(R𝖱,R𝖯\)σ𝖥=α𝖱σ𝖱2\+α𝖯ρσ𝖱σ𝖯σ𝖥\.\\displaystyle\\approx\\frac\{\\alpha\_\{\\mathsf\{R\}\}\\mathrm\{Var\}\(R\_\{\\mathsf\{R\}\}\)\+\\alpha\_\{\\mathsf\{P\}\}\\mathrm\{Cov\}\(R\_\{\\mathsf\{R\}\},R\_\{\\mathsf\{P\}\}\)\}\{\\sigma\_\{\\mathsf\{F\}\}\}=\\frac\{\\alpha\_\{\\mathsf\{R\}\}\\sigma\_\{\\mathsf\{R\}\}^\{2\}\+\\alpha\_\{\\mathsf\{P\}\}\\rho\\sigma\_\{\\mathsf\{R\}\}\\sigma\_\{\\mathsf\{P\}\}\}\{\\sigma\_\{\\mathsf\{F\}\}\}\.\(20\)By symmetry, the corresponding result forR𝖯R\_\{\\mathsf\{P\}\}can be obtained in the same way\. Substituting the two covariance results into the correlation definition gives:

Corr\(R𝖱,A𝖥\)\\displaystyle\\mathrm\{Corr\}\(R\_\{\\mathsf\{R\}\},A\_\{\\mathsf\{F\}\}\)≈α𝖱σ𝖱\+α𝖯ρσ𝖯σ𝖥,\\displaystyle\\approx\\frac\{\\alpha\_\{\\mathsf\{R\}\}\\sigma\_\{\\mathsf\{R\}\}\+\\alpha\_\{\\mathsf\{P\}\}\\rho\\sigma\_\{\\mathsf\{P\}\}\}\{\\sigma\_\{\\mathsf\{F\}\}\},\\qquadCorr\(R𝖯,A𝖥\)\\displaystyle\\mathrm\{Corr\}\(R\_\{\\mathsf\{P\}\},A\_\{\\mathsf\{F\}\}\)≈α𝖯σ𝖯\+α𝖱ρσ𝖱σ𝖥\.\\displaystyle\\approx\\frac\{\\alpha\_\{\\mathsf\{P\}\}\\sigma\_\{\\mathsf\{P\}\}\+\\alpha\_\{\\mathsf\{R\}\}\\rho\\sigma\_\{\\mathsf\{R\}\}\}\{\\sigma\_\{\\mathsf\{F\}\}\}\.\(21\)This completes the proof\. ∎

Implication\.Proposition 2 shows that the correlation between each reward and theF1F\_\{1\}advantage depends on both the localF1F\_\{1\}sensitivityα𝗄\\alpha\_\{\\mathsf\{k\}\}and the reward standard deviationσ𝗄\\sigma\_\{\\mathsf\{k\}\}, especially whenρ\\rhois small\. Therefore, the baseline tends to emphasize the reward with larger local sensitivity or larger standard deviation, even when both recall and precision are important\.

### A\.2Effect of dual\-objective disentanglement

###### Proposition 3\.

The correlation between each reward and the disentangled advantage is:

Corr\(R𝖱,A𝖽𝗎𝖺𝗅\)=Corr\(R𝖯,A𝖽𝗎𝖺𝗅\)=1\+ρ2\.\\mathrm\{Corr\}\(R\_\{\\mathsf\{R\}\},A\_\{\\mathsf\{dual\}\}\)=\\mathrm\{Corr\}\(R\_\{\\mathsf\{P\}\},A\_\{\\mathsf\{dual\}\}\)=\\sqrt\{\\frac\{1\+\\rho\}\{2\}\}\.\(22\)

###### Proof\.

By definition,A𝖱A\_\{\\mathsf\{R\}\}andA𝖯A\_\{\\mathsf\{P\}\}are standardized rewards:

A𝖱=R𝖱−μ𝖱σ𝖱,A𝖯=R𝖯−μ𝖯σ𝖯\.A\_\{\\mathsf\{R\}\}=\\frac\{R\_\{\\mathsf\{R\}\}\-\\mu\_\{\\mathsf\{R\}\}\}\{\\sigma\_\{\\mathsf\{R\}\}\},\\qquad A\_\{\\mathsf\{P\}\}=\\frac\{R\_\{\\mathsf\{P\}\}\-\\mu\_\{\\mathsf\{P\}\}\}\{\\sigma\_\{\\mathsf\{P\}\}\}\.\(23\)Thus, both have zero mean and unit variance, and their covariance isCov\(A𝖱,A𝖯\)=Corr\(R𝖱,R𝖯\)=ρ\\mathrm\{Cov\}\(A\_\{\\mathsf\{R\}\},A\_\{\\mathsf\{P\}\}\)=\\mathrm\{Corr\}\(R\_\{\\mathsf\{R\}\},R\_\{\\mathsf\{P\}\}\)=\\rho\. SinceA𝖽𝗎𝖺𝗅=A𝖱\+A𝖯A\_\{\\mathsf\{dual\}\}=A\_\{\\mathsf\{R\}\}\+A\_\{\\mathsf\{P\}\}, its variance is:

Var\(A𝖽𝗎𝖺𝗅\)=Var\(A𝖱\)\+Var\(A𝖯\)\+2Cov\(A𝖱,A𝖯\)=2\+2ρ\.\\mathrm\{Var\}\(A\_\{\\mathsf\{dual\}\}\)=\\mathrm\{Var\}\(A\_\{\\mathsf\{R\}\}\)\+\\mathrm\{Var\}\(A\_\{\\mathsf\{P\}\}\)\+2\\mathrm\{Cov\}\(A\_\{\\mathsf\{R\}\},A\_\{\\mathsf\{P\}\}\)=2\+2\\rho\.\(24\)
We first compute the correlation betweenR𝖱R\_\{\\mathsf\{R\}\}andA𝖽𝗎𝖺𝗅A\_\{\\mathsf\{dual\}\}\. The covariance term can be written as:

Cov\(R𝖱,A𝖽𝗎𝖺𝗅\)\\displaystyle\\mathrm\{Cov\}\(R\_\{\\mathsf\{R\}\},A\_\{\\mathsf\{dual\}\}\)=Cov\(R𝖱,A𝖱\)\+Cov\(R𝖱,A𝖯\)\\displaystyle=\\mathrm\{Cov\}\(R\_\{\\mathsf\{R\}\},A\_\{\\mathsf\{R\}\}\)\+\\mathrm\{Cov\}\(R\_\{\\mathsf\{R\}\},A\_\{\\mathsf\{P\}\}\)\(25\)=Cov\(R𝖱,R𝖱−μ𝖱σ𝖱\)\+Cov\(R𝖱,R𝖯−μ𝖯σ𝖯\)\\displaystyle=\\mathrm\{Cov\}\\left\(R\_\{\\mathsf\{R\}\},\\frac\{R\_\{\\mathsf\{R\}\}\-\\mu\_\{\\mathsf\{R\}\}\}\{\\sigma\_\{\\mathsf\{R\}\}\}\\right\)\+\\mathrm\{Cov\}\\left\(R\_\{\\mathsf\{R\}\},\\frac\{R\_\{\\mathsf\{P\}\}\-\\mu\_\{\\mathsf\{P\}\}\}\{\\sigma\_\{\\mathsf\{P\}\}\}\\right\)=σ𝖱\+ρσ𝖱=\(1\+ρ\)σ𝖱\.\\displaystyle=\\sigma\_\{\\mathsf\{R\}\}\+\\rho\\sigma\_\{\\mathsf\{R\}\}=\(1\+\\rho\)\\sigma\_\{\\mathsf\{R\}\}\.Combining this withσA𝖽𝗎𝖺𝗅=2\+2ρ\\sigma\_\{A\_\{\\mathsf\{dual\}\}\}=\\sqrt\{2\+2\\rho\}, we obtain:

Corr\(R𝖱,A𝖽𝗎𝖺𝗅\)=Cov\(R𝖱,A𝖽𝗎𝖺𝗅\)σ𝖱σA𝖽𝗎𝖺𝗅=\(1\+ρ\)σ𝖱σ𝖱2\+2ρ=1\+ρ2\.\\mathrm\{Corr\}\(R\_\{\\mathsf\{R\}\},A\_\{\\mathsf\{dual\}\}\)=\\frac\{\\mathrm\{Cov\}\(R\_\{\\mathsf\{R\}\},A\_\{\\mathsf\{dual\}\}\)\}\{\\sigma\_\{\\mathsf\{R\}\}\\sigma\_\{A\_\{\\mathsf\{dual\}\}\}\}=\\frac\{\(1\+\\rho\)\\sigma\_\{\\mathsf\{R\}\}\}\{\\sigma\_\{\\mathsf\{R\}\}\\sqrt\{2\+2\\rho\}\}=\\sqrt\{\\frac\{1\+\\rho\}\{2\}\}\.\(26\)By symmetry, the same derivation gives:

Corr\(R𝖯,A𝖽𝗎𝖺𝗅\)=1\+ρ2\.\\mathrm\{Corr\}\(R\_\{\\mathsf\{P\}\},A\_\{\\mathsf\{dual\}\}\)=\\sqrt\{\\frac\{1\+\\rho\}\{2\}\}\.\(27\)This completes the proof\. ∎

Implication\.Proposition 3 shows that dual\-objective disentanglement removes the local\-sensitivity and standard\-deviation factors identified in Proposition 2 by normalizing recall and precision separately\. As a result, both rewards are equally correlated withA𝖽𝗎𝖺𝗅A\_\{\\mathsf\{dual\}\}, preventing optimization from improving one objective at the expense of the other\.

### A\.3Proof of Proposition 1

###### Restated Proposition 0 \(Proposition[1](https://arxiv.org/html/2606.27652#Thmproposition1)\)\.

Letμ𝖱,μ𝖯\\mu\_\{\\mathsf\{R\}\},\\mu\_\{\\mathsf\{P\}\}andσ𝖱,σ𝖯\\sigma\_\{\\mathsf\{R\}\},\\sigma\_\{\\mathsf\{P\}\}denote the group means and standard deviations of the recall and precision rewards, and letτ=σ𝖱/μ𝖱2σ𝖯/μ𝖯2\\tau=\\frac\{\\sigma\_\{\\mathsf\{R\}\}/\\mu\_\{\\mathsf\{R\}\}^\{2\}\}\{\\sigma\_\{\\mathsf\{P\}\}/\\mu\_\{\\mathsf\{P\}\}^\{2\}\}be the normalized group\-wise variance ratio\. The correlations between each reward and the standardF1F\_\{1\}advantage satisfy

\{𝖼𝗈𝗋𝗋\(R𝖱,A𝖥\)\>𝖼𝗈𝗋𝗋\(R𝖯,A𝖥\),ifτ\>1,𝖼𝗈𝗋𝗋\(R𝖱,A𝖥\)≤𝖼𝗈𝗋𝗋\(R𝖯,A𝖥\),ifτ≤1\.\\begin\{cases\}\\mathsf\{corr\}\(R\_\{\\mathsf\{R\}\},A\_\{\\mathsf\{F\}\}\)\>\\mathsf\{corr\}\(R\_\{\\mathsf\{P\}\},A\_\{\\mathsf\{F\}\}\),&\\text\{if \}\\tau\>1,\\\\\[2\.84526pt\] \\mathsf\{corr\}\(R\_\{\\mathsf\{R\}\},A\_\{\\mathsf\{F\}\}\)\\leq\\mathsf\{corr\}\(R\_\{\\mathsf\{P\}\},A\_\{\\mathsf\{F\}\}\),&\\text\{if \}\\tau\\leq 1\.\\end\{cases\}\(28\)In contrast, our dual\-objective advantage givesbalanced correlations:

𝖼𝗈𝗋𝗋\(R𝖱,A𝖽𝗎𝖺𝗅\)=𝖼𝗈𝗋𝗋\(R𝖯,A𝖽𝗎𝖺𝗅\)\.\\mathsf\{corr\}\(R\_\{\\mathsf\{R\}\},A\_\{\\mathsf\{dual\}\}\)=\\mathsf\{corr\}\(R\_\{\\mathsf\{P\}\},A\_\{\\mathsf\{dual\}\}\)\.\(29\)

###### Proof\.

From Proposition 2, the correlations between recall/precision rewards and the standardF1F\_\{1\}advantage are:

Corr\(R𝖱,A𝖥\)\\displaystyle\\mathrm\{Corr\}\(R\_\{\\mathsf\{R\}\},A\_\{\\mathsf\{F\}\}\)≈α𝖱σ𝖱\+α𝖯ρσ𝖯σ𝖥,\\displaystyle\\approx\\frac\{\\alpha\_\{\\mathsf\{R\}\}\\sigma\_\{\\mathsf\{R\}\}\+\\alpha\_\{\\mathsf\{P\}\}\\rho\\sigma\_\{\\mathsf\{P\}\}\}\{\\sigma\_\{\\mathsf\{F\}\}\},\\qquadCorr\(R𝖯,A𝖥\)\\displaystyle\\mathrm\{Corr\}\(R\_\{\\mathsf\{P\}\},A\_\{\\mathsf\{F\}\}\)≈α𝖯σ𝖯\+α𝖱ρσ𝖱σ𝖥\.\\displaystyle\\approx\\frac\{\\alpha\_\{\\mathsf\{P\}\}\\sigma\_\{\\mathsf\{P\}\}\+\\alpha\_\{\\mathsf\{R\}\}\\rho\\sigma\_\{\\mathsf\{R\}\}\}\{\\sigma\_\{\\mathsf\{F\}\}\}\.\(30\)Taking their difference gives:

Corr\(R𝖱,A𝖥\)−Corr\(R𝖯,A𝖥\)≈\(1−ρ\)\(α𝖱σ𝖱−α𝖯σ𝖯\)σ𝖥\.\\mathrm\{Corr\}\(R\_\{\\mathsf\{R\}\},A\_\{\\mathsf\{F\}\}\)\-\\mathrm\{Corr\}\(R\_\{\\mathsf\{P\}\},A\_\{\\mathsf\{F\}\}\)\\approx\\frac\{\(1\-\\rho\)\(\\alpha\_\{\\mathsf\{R\}\}\\sigma\_\{\\mathsf\{R\}\}\-\\alpha\_\{\\mathsf\{P\}\}\\sigma\_\{\\mathsf\{P\}\}\)\}\{\\sigma\_\{\\mathsf\{F\}\}\}\.\(31\)In the non\-degenerate case withρ<1\\rho<1andσ𝖥\>0\\sigma\_\{\\mathsf\{F\}\}\>0, the sign of this difference is determined byα𝖱σ𝖱−α𝖯σ𝖯\\alpha\_\{\\mathsf\{R\}\}\\sigma\_\{\\mathsf\{R\}\}\-\\alpha\_\{\\mathsf\{P\}\}\\sigma\_\{\\mathsf\{P\}\}\. Using the local sensitivities in Proposition 2, we have:

α𝖱σ𝖱−α𝖯σ𝖯=2\(μ𝖱\+μ𝖯\)2\(μ𝖯2σ𝖱−μ𝖱2σ𝖯\)\.\\alpha\_\{\\mathsf\{R\}\}\\sigma\_\{\\mathsf\{R\}\}\-\\alpha\_\{\\mathsf\{P\}\}\\sigma\_\{\\mathsf\{P\}\}=\\frac\{2\}\{\(\\mu\_\{\\mathsf\{R\}\}\+\\mu\_\{\\mathsf\{P\}\}\)^\{2\}\}\\left\(\\mu\_\{\\mathsf\{P\}\}^\{2\}\\sigma\_\{\\mathsf\{R\}\}\-\\mu\_\{\\mathsf\{R\}\}^\{2\}\\sigma\_\{\\mathsf\{P\}\}\\right\)\.\(32\)Therefore:

α𝖱σ𝖱\>α𝖯σ𝖯⟺σ𝖱μ𝖱2\>σ𝖯μ𝖯2⟺τ\>1\.\\alpha\_\{\\mathsf\{R\}\}\\sigma\_\{\\mathsf\{R\}\}\>\\alpha\_\{\\mathsf\{P\}\}\\sigma\_\{\\mathsf\{P\}\}\\Longleftrightarrow\\frac\{\\sigma\_\{\\mathsf\{R\}\}\}\{\\mu\_\{\\mathsf\{R\}\}^\{2\}\}\>\\frac\{\\sigma\_\{\\mathsf\{P\}\}\}\{\\mu\_\{\\mathsf\{P\}\}^\{2\}\}\\Longleftrightarrow\\tau\>1\.\(33\)Thus, whenτ\>1\\tau\>1,Corr\(R𝖱,A𝖥\)\>Corr\(R𝖯,A𝖥\)\\mathrm\{Corr\}\(R\_\{\\mathsf\{R\}\},A\_\{\\mathsf\{F\}\}\)\>\\mathrm\{Corr\}\(R\_\{\\mathsf\{P\}\},A\_\{\\mathsf\{F\}\}\); otherwise,Corr\(R𝖱,A𝖥\)≤Corr\(R𝖯,A𝖥\)\\mathrm\{Corr\}\(R\_\{\\mathsf\{R\}\},A\_\{\\mathsf\{F\}\}\)\\leq\\mathrm\{Corr\}\(R\_\{\\mathsf\{P\}\},A\_\{\\mathsf\{F\}\}\)\.

Finally, Proposition 3 directly gives:

Corr\(R𝖱,A𝖽𝗎𝖺𝗅\)=Corr\(R𝖯,A𝖽𝗎𝖺𝗅\)=1\+ρ2,\\mathrm\{Corr\}\(R\_\{\\mathsf\{R\}\},A\_\{\\mathsf\{dual\}\}\)=\\mathrm\{Corr\}\(R\_\{\\mathsf\{P\}\},A\_\{\\mathsf\{dual\}\}\)=\\sqrt\{\\frac\{1\+\\rho\}\{2\}\},\(34\)which proves the balanced\-correlation result for the dual\-objective advantage\. ∎

## Appendix BTask details

We evaluateMER\-R1on two multimodal emotion benchmarks: MER\-UniBench\[[11](https://arxiv.org/html/2606.27652#bib.bib50)\]and MME\-Emotion\[[41](https://arxiv.org/html/2606.27652#bib.bib8)\]\. MER\-UniBench evaluates open\-vocabulary multimodal emotion recognition under emotion\-wheel\-based metrics, while MME\-Emotion provides a llm\-as\-a\-judge based evaluation of both emotion recognition and emotion reasoning\.

### B\.1MER\-UniBench

MER\-UniBench\[[11](https://arxiv.org/html/2606.27652#bib.bib50)\]unifies nine multimodal emotion datasets into three task groups: fine\-grained emotion recognition, basic emotion recognition, and sentiment analysis\. Since model outputs are free\-form emotion words, direct string matching is insufficient\. Following prior work\[[11](https://arxiv.org/html/2606.27652#bib.bib50),[13](https://arxiv.org/html/2606.27652#bib.bib10)\], we evaluate recognition outputs with an emotion\-wheel\-based normalization protocol\.

Emotion\-wheel normalization\.The emotion wheel maps diverse fine\-grained emotion words into a shared category space\. We apply three normalization steps to both predictions and ground\-truth labels\. First, morphological variants are reduced to their base form,*e\.g\.*,*happier*and*happiness*are normalized to*happy*\. Second, semantically equivalent emotion words are mapped to the same canonical form,*e\.g\.*,*joyful*and*happy*\. Third, fine\-grained emotion words are further mapped to their corresponding inner emotion categories according to each emotion wheel\. Following prior work\[[11](https://arxiv.org/html/2606.27652#bib.bib50)\], we adoptK=5K=5emotion wheels, as shown in Figure[6](https://arxiv.org/html/2606.27652#A2.F6)\. In each wheel, the outer labels correspond to fine\-grained emotions, and the inner sectors define the corresponding level\-1 emotion categories\.

![Refer to caption](https://arxiv.org/html/2606.27652v1/EW_image/wheel1.jpg)\(a\)W1
![Refer to caption](https://arxiv.org/html/2606.27652v1/EW_image/wheel2.jpg)\(b\)W2
![Refer to caption](https://arxiv.org/html/2606.27652v1/EW_image/wheel3.jpg)\(c\)W3
![Refer to caption](https://arxiv.org/html/2606.27652v1/EW_image/wheel4.jpg)\(d\)W4
![Refer to caption](https://arxiv.org/html/2606.27652v1/EW_image/wheel5.jpg)\(e\)W5

Figure 6:Emotion wheels used in MER\-UniBench evaluation\.We adopt five emotion wheels \(W1–W5\) to map free\-form emotion words into level\-1 emotion categories\.For thekk\-th wheelwkw\_\{k\}, the complete mapping function is defined as:

Gwk\(⋅\)=Fl3wk\(Fl2\(Fl1\(⋅\)\)\),k∈\{1,…,K\},G\_\{w\_\{k\}\}\(\\cdot\)=F\_\{l\_\{3\}\}^\{w\_\{k\}\}\\bigl\(F\_\{l\_\{2\}\}\(F\_\{l\_\{1\}\}\(\\cdot\)\)\\bigr\),\\quad k\\in\\\{1,\\ldots,K\\\},\(35\)whereFl1F\_\{l\_\{1\}\},Fl2F\_\{l\_\{2\}\}, andFl3wkF\_\{l\_\{3\}\}^\{w\_\{k\}\}denote morphological normalization, semantic normalization, and wheel\-specific category mapping, respectively\.

Set\-level precision, recall, and F1\.For sampleii, let𝒴i\\mathcal\{Y\}\_\{i\}denote the ground\-truth emotion set and𝒴^i\\hat\{\\mathcal\{Y\}\}\_\{i\}denote the predicted emotion set extracted from the<answer\>field, with duplicate emotion words removed\. For each emotion wheelwkw\_\{k\}, we compute the set\-level precision and recall as:

𝖯sk=1N∑i=1N\|Gwk\(𝒴i\)∩Gwk\(𝒴^i\)\|\|Gwk\(𝒴^i\)\|,\\mathsf\{P\}\_\{s\}^\{k\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\frac\{\\left\|G\_\{w\_\{k\}\}\(\\mathcal\{Y\}\_\{i\}\)\\cap G\_\{w\_\{k\}\}\(\\hat\{\\mathcal\{Y\}\}\_\{i\}\)\\right\|\}\{\\left\|G\_\{w\_\{k\}\}\(\\hat\{\\mathcal\{Y\}\}\_\{i\}\)\\right\|\},\(36\)𝖱sk=1N∑i=1N\|Gwk\(𝒴i\)∩Gwk\(𝒴^i\)\|\|Gwk\(𝒴i\)\|\.\\mathsf\{R\}\_\{s\}^\{k\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\frac\{\\left\|G\_\{w\_\{k\}\}\(\\mathcal\{Y\}\_\{i\}\)\\cap G\_\{w\_\{k\}\}\(\\hat\{\\mathcal\{Y\}\}\_\{i\}\)\\right\|\}\{\\left\|G\_\{w\_\{k\}\}\(\\mathcal\{Y\}\_\{i\}\)\\right\|\}\.\(37\)The corresponding F1 score is defined as:

𝖥sk=2𝖯sk𝖱sk𝖯sk\+𝖱sk\.\\mathsf\{F\}\_\{s\}^\{k\}=\\frac\{2\\mathsf\{P\}\_\{s\}^\{k\}\\mathsf\{R\}\_\{s\}^\{k\}\}\{\\mathsf\{P\}\_\{s\}^\{k\}\+\\mathsf\{R\}\_\{s\}^\{k\}\}\.\(38\)The final emotion\-wheel\-based F1 score is obtained by averaging over all emotion wheels:

𝖥𝖤𝖶=1K∑k=1K𝖥sk\.\\mathsf\{F\}\_\{\\mathsf\{EW\}\}=\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}\\mathsf\{F\}\_\{s\}^\{k\}\.\(39\)
Fine\-grained emotion recognition\.Fine\-grained emotion recognition requires the model to predict one or more open\-vocabulary emotion descriptors\. We evaluate this task on OV\-MERD\+\[[13](https://arxiv.org/html/2606.27652#bib.bib10)\], which extends OV\-MERD\[[16](https://arxiv.org/html/2606.27652#bib.bib49)\]\. Following MER\-UniBench, we use the emotion\-wheel\-based set\-level F1 score as the primary metric\.

Basic emotion recognition\.Basic emotion recognition asks the model to infer a basic emotion label from multimodal input\. We evaluate this task on MER2023\[[14](https://arxiv.org/html/2606.27652#bib.bib45)\], MER2024\[[15](https://arxiv.org/html/2606.27652#bib.bib46)\], MELD\[[25](https://arxiv.org/html/2606.27652#bib.bib36)\], and IEMOCAP\[[1](https://arxiv.org/html/2606.27652#bib.bib37)\]\. Each sample is associated with a single majority\-voted emotion label\. Under the official MER\-UniBench protocol, this task is evaluated by Hitrate \(HIT\)\. For each wheelwkw\_\{k\}, HIT is defined as:

HITk=1N∑i=1N𝕀\[Gwk\(yi\)∈Gwk\(𝒴^i\)\],\\mathrm\{HIT\}^\{k\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbb\{I\}\\left\[G\_\{w\_\{k\}\}\(y\_\{i\}\)\\in G\_\{w\_\{k\}\}\(\\hat\{\\mathcal\{Y\}\}\_\{i\}\)\\right\],\(40\)whereyiy\_\{i\}is the ground\-truth basic emotion label and𝕀\[⋅\]\\mathbb\{I\}\[\\cdot\]denotes the indicator function\. The final HIT score is averaged over the five emotion wheels\. Since HIT mainly reflects recall and does not penalize over\-prediction,we additionally report an F1\-based evaluation for these four datasetsafter the same emotion\-wheel normalization\.

Sentiment analysis\.Sentiment analysis requires the model to determine the polarity of the multimodal input\. We evaluate this task on CMU\-MOSI\[[39](https://arxiv.org/html/2606.27652#bib.bib39)\], CMU\-MOSEI\[[40](https://arxiv.org/html/2606.27652#bib.bib40)\], CH\-SIMS\[[38](https://arxiv.org/html/2606.27652#bib.bib41)\], and CH\-SIMS v2\[[21](https://arxiv.org/html/2606.27652#bib.bib42)\]\. Since our model produces free\-form emotion words rather than sentiment labels, we follow the MER\-UniBench protocol and use Qwen2\.5\-7B\-Instruct to map the extracted emotion words into a sentiment category, selected from \{*positive*,*negative*,*neutral*\}\. We report the Weighted Average F\-score \(WAF\) as the primary metric\. The prompt used for this mapping is shown below\.

Sentiment Mapping PromptPlease assume the role of an expert in emotions\. We provide a set of emotion words describing a character\. Please choose the most likely sentiment from the candidates: \[positive, negative, neutral\]\.

### B\.2MME\-Emotion

MME\-Emotion\[[41](https://arxiv.org/html/2606.27652#bib.bib8)\]is a holistic benchmark for evaluating emotional intelligence in multimodal large language models\. Different from MER\-UniBench, which mainly evaluates open\-vocabulary emotion recognition with emotion\-wheel\-based metrics, MME\-Emotion uses LLM\-based judging to assess both the final emotion prediction and the reasoning process\.

Tasks\.MME\-Emotion contains eight evaluation tasks: emotion recognition in laboratory scenarios \(ER\-Lab\), emotion recognition in the wild \(ER\-Wild\), emotion recognition under noisy conditions \(Noise\-ER\), fine\-grained emotion recognition \(FG\-ER\), multi\-label emotion recognition \(ML\-ER\), sentiment analysis \(SA\), fine\-grained sentiment analysis \(FG\-SA\), and intent recognition \(IR\)\. These tasks cover both controlled and in\-the\-wild scenarios, and evaluate different aspects of affective understanding, including categorical emotion recognition, fine\-grained affective interpretation, sentiment polarity, and intent inference\.

Evaluation metrics\.MME\-Emotion reports three metrics\.*Recognition*measures whether the final predicted emotion or affective label is correct\.*Reasoning*evaluates whether the generated explanation is coherent, grounded, and supportive of the final answer\.*CoT*is the average of Recognition and Reasoning\. Following the benchmark protocol, task\-level results are judged by an evaluator model\. Since the original GPT\-4o evaluator used by the benchmark is no longer available in our setting, we usegemini\-3\.1\-flash\-lite\-previewas the evaluator for all compared methods to ensure a consistent evaluation protocol\. In the Table[2](https://arxiv.org/html/2606.27652#S5.T2), we report the CoT score for each task and the averaged Recognition, Reasoning, and CoT scores across all tasks\.

## Appendix CImplementation details

We use Qwen2\.5\-Omni\[[34](https://arxiv.org/html/2606.27652#bib.bib13)\]as the backbone and train only its Thinker module\. All models are trained on MER\-Caption\+\[[11](https://arxiv.org/html/2606.27652#bib.bib50)\]with a two\-stage SFT\-RL pipeline based on GRPO\-style optimization\. We consider two reasoning baselines\. The reproduced AffectGPT\-R1\[[17](https://arxiv.org/html/2606.27652#bib.bib21)\]follows its original SFT/RL data allocation, while keeping the backbone and optimization framework consistent with our implementation\. Our stronger*Baseline*adopts an RL\-heavy allocation: we randomly sample 5k MER\-Caption\+ instances for SFT and use the remaining samples for RL\. For this*Baseline*andMER\-R1, we train the SFT stage for 2 epochs with a learning rate of2×10−52\\times 10^\{\-5\}, and train the RL stage for 1 epoch with a learning rate of2×10−62\\times 10^\{\-6\}\. During RL, we sampleG=4G=4responses per prompt, train for 3262 optimization steps, set the batch size to 1, and use gradient accumulation with 2 steps\. We set bothλ𝖼𝖺𝗅𝗂\\lambda\_\{\\mathsf\{cali\}\}andλ𝖿𝗆𝗍\\lambda\_\{\\mathsf\{fmt\}\}to0\.10\.1\. All experiments are conducted on 16 NVIDIA H100 GPUs, and the RL stage takes approximately 18 hours\. For fair comparison, we evaluate all datasets using a single final checkpoint without dataset\-specific checkpoint selection\.

![Refer to caption](https://arxiv.org/html/2606.27652v1/x6.png)Figure 7:Training dynamics on dual rewards\.
![Refer to caption](https://arxiv.org/html/2606.27652v1/x7.png)Figure 8:Retained hitrate comparision\.

## Appendix DAdditional analysis

Training dynamics on dual rewards\.Figure[8](https://arxiv.org/html/2606.27652#A3.F8)compares the recall and precision reward dynamics of theBaselineandMER\-R1\.MER\-R1achieves a substantially higher recall reward, reaching around0\.750\.75compared with about0\.670\.67for theBaseline\. Meanwhile, both methods maintain similar precision rewards, fluctuating around0\.650\.65\. This shows thatMER\-R1improves recall without sacrificing precision\. The dual\-objective design therefore better balances the recall–precision trade\-off and preserves the recall\-oriented coverage that is often suppressed during slow thinking\.

Retained hitrate analysis\.We define retained hitrate as the fraction of ground\-truth emotion categories that are correctly predicted by fast thinking and still preserved in the final slow\-thinking answer\. As shown in Figure[8](https://arxiv.org/html/2606.27652#A3.F8),MER\-R1consistently retains more correct fast\-thinking predictions than theBaselineacross all five recognition benchmarks\. For example, the retained hitrate increases from83\.583\.5to94\.794\.7on OV\-MERD\+, from86\.586\.5to95\.495\.4on MER2023, and from89\.489\.4to97\.297\.2on MER2024, with similar gains on MELD and IEMOCAP\. These results show thatMER\-R1better preserves useful fast\-thinking intuition during slow thinking, instead of filtering out correct categories as theBaselineoften does\. This supports the role of slow\-fast confidence calibration in selectively retaining recall\-oriented signals from fast thinking\.

## Appendix EQualitative analysis

We provide qualitative examples to further illustrate howMER\-R1realizes slow\-fast thinking synergy in multimodal emotion recognition\. Specifically, we present three types of cases\. \(i\)*Fast\-intuition retention*refers to cases where fast thinking already predicts a correct emotion, but the baseline slow\-thinking answer removes it after deliberation;MER\-R1keeps this useful fast\-thinking intuition in the final answer\. \(ii\)*Spurious emotion suppression*refers to cases where fast thinking introduces incorrect emotions due to over\-coverage;MER\-R1suppresses these spurious predictions through slow\-thinking selectivity\. \(iii\)*Missed emotion recovery*refers to cases where baseline fast thinking fails to cover the ground\-truth emotion, whileMER\-R1recovers it after slow reasoning\.

For each case type, we show examples from both fine\-grained emotion recognition and basic emotion recognition\. The fast\-intuition retention examples show thatMER\-R1does not simply discard the useful predictions produced by fast thinking; instead, it preserves correct categories such asangryorhappythat are missed by the baseline slow\-thinking answer\. The spurious emotion suppression examples show thatMER\-R1can remove unsupported emotions such asangryorresentment, keeping the final prediction focused on the ground\-truth emotions\. The missed emotion recovery examples further show thatMER\-R1can recover emotions overlooked by baseline fast thinking, such assurprised,worried, and other positive fine\-grained emotions\. Together, these examples qualitatively demonstrate thatMER\-R1selectively combines the recall\-oriented intuition of fast thinking with the precision\-oriented selectivity of slow thinking\.

Table 6:Fine\-grained emotion recognition example of fast\-intuition retention\.Example of fast\-intuition retention![[Uncaptioned image]](https://arxiv.org/html/2606.27652v1/x8.png)Subtitle \(En\)I tell you, it’s not easy for a woman who has divorced and has a child to find a partner, it’s not as easy as you think\.Ground TruthLabels:angry, dissatisfiedBaselineFast Thinking<think\></think\><answer\>angry, frustrated, anxious, disappointed</answer\>BaselineSlow Thinking<think\>In the text, the subtitle content says, "I’ll tell you, it’s not easy for a woman who has divorced and is raising a child to find a partner\." Based on the woman’s relaxed and comfortable posture and facial expression in the video clues, as well as the individual’s high level of emotional stress and frustration in the audio clues, we can infer that this sentence may carry a negative or depressed tone\. This sentence may be the woman’s evaluation or complaint about the current situation or some difficulties\. The woman’s positive emotions and comfortable posture in the video clues indicate that she may be able to cope with the current situation or solve problems in some way, but based on the individual’s negative emotions in the audio clues, we can infer that this sentence may carry a sense of disappointment or frustration\. Therefore, this sentence may be the woman’s negative evaluation of the current situation, aligning with her overall display of negative emotions\.</think\><answer\>disappointment, frustration</answer\>MER\-R1Slow Thinking<think\>In the text, the subtitle reads: "It’s difficult for a woman to find a partner after divorce\." This sentence may be the speaker’s response to a certain problem or situation\. Based on the audio clues describing the speaker’s aggressive tone and high\-pitched voice, as well as the audio clue of the speaker expressing feelings of anger or frustration, we can infer that this sentence may carry a tone of complaint or dissatisfaction\. The speaker may be angry or dissatisfied about the difficulties or troubles of a woman finding a partner after divorce, hence using this statement to express their emotions\.</think\><answer\>angry, frustrated, anxious, stressed</answer\>Explanation:Baseline slow thinking misses the ground\-truth emotionangry, whileMER\-R1slow thinking retains it\.

Table 7:Basic emotion recognition example of fast\-intuition retention\.Example of fast\-intuition retention![[Uncaptioned image]](https://arxiv.org/html/2606.27652v1/x9.png)Subtitle \(En\)Thank you all, thank you for not treating me as an outsider\. Actually, before this, I thought I only had one friend in my life, Old Tan\. I didn’t expect to meet you all on the 22nd floor\.Ground TruthLabels:happyBaselineFast Thinking<think\></think\><answer\>happy, surprised, grateful</answer\>BaselineSlow Thinking<think\>In the text, the subtitle reads: "Thank you all, thank you for not treating me as an outsider\. Actually, before this, I thought I only had one friend in my life, Old Tan, but I didn’t expect to meet you all on the 22nd floor\." This sentence may be the woman expressing gratitude to the people around her\. Based on the audio clue of the woman’s voice being filled with excitement and enthusiasm, as well as the video clue of the woman smiling and laughing while driving, we can infer that this sentence may carry a tone of excitement and gratitude\. The woman may be surprised and grateful for the people around her, feeling that she has made more friends than she thought\. Therefore, this sentence aligns with the overall positive emotions displayed by the woman\.</think\><answer\>excitement, gratitude, surprise</answer\>MER\-R1Slow Thinking<think\>In the text, the subtitle reads: "Thank you all, thank you for not treating me as an outsider\. Actually, before this, I thought I only had one friend, Mr\. Tan\. But I didn’t expect to meet you all on the 22nd floor\." This sentence might be a woman’s expression of gratitude and excitement to those around her\. Based on the audio clues, with the woman’s voice having a high pitch and slightly faster pace, as well as the video clues showing the woman’s happiness and excitement, we can infer that this sentence might be delivered with a positive, appreciative, and joyful tone\. The woman might be sharing her happiness and gratitude with those around her, while also feeling unexpected and fortunate for meeting new friends\.</think\><answer\>happy, grateful, appreciative, positive</answer\>Explanation:Baseline slow thinking misses the ground\-truth emotionhappy, whileMER\-R1slow thinking retains it\.

Table 8:Fine\-grained emotion recognition example of spurious emotion suppression\.Example of spurious emotion suppression![[Uncaptioned image]](https://arxiv.org/html/2606.27652v1/x10.png)Subtitle \(En\)You would feel better this way, wouldn’t you?Ground TruthLabels: concerned, negative, sadBaselineFast Thinking<think\></think\><answer\>angry, frustrated, anxious, sad, disappointed</answer\>MER\-R1Slow Thinking<think\>In the text, the subtitle reads, "Can you feel better this way?" This sentence may be a woman’s response or concern for someone\. Based on the audio clue describing the woman’s crying and heavy breathing, as well as the video clue of the woman’s sad and crying facial expression, we can infer that this sentence may carry a tone of concern or comfort\. The woman may be responding to someone’s distress or negative emotions, hoping that the other person can feel better or cope with the situation\. Therefore, this sentence may express the woman’s concern and care for the other person\.</think\><answer\>concerned, worried, sad, anxious</answer\>Explanation:Baseline fast thinking predicts the spurious emotionangry, whereasMER\-R1slow thinking suppresses it and keeps the prediction focused on ground\-truth emotions\.

Table 9:Basic emotion recognition example of spurious emotion suppression\.Example of spurious emotion suppression![[Uncaptioned image]](https://arxiv.org/html/2606.27652v1/x11.png)Subtitle \(En\)Grandpa\.Ground TruthLabels: sadBaselineFast Thinking<think\></think\><answer\>anxiety, nervousness, worry, confusion, frustration, disappointment, stress, sadness,anger,resentment, urgency, despair, panic</answer\>MER\-R1Slow Thinking<think\>In the text, the caption reads: "Grandpa\." This phrase might be a man’s way of addressing someone or a title\. Based on the video hints, with the man’s facial expression showing contemplation and determination, and his body language suggesting relaxation, we can infer that this statement could be the man expressing concern, worry, or asking for help or assistance to his grandfather\. The audio clues describing his voice with a slight tremble, at a slightly faster pace, and at a higher pitch, suggest he might be experiencing intense emotions such as anxiety, excitement, or stress\. Therefore, this sentence could be the man expressing his emotional distress or seeking help\.</think\><answer\>anxiety, sadness, concern, worry</answer\>Explanation:Baseline fast thinking predicts the spurious emotionanger,resentment, whereasMER\-R1slow thinking suppresses it and keeps the prediction focused on ground\-truth emotions\.

Table 10:Fine\-grained emotion recognition example of missed emotion recovery\.Example of missed emotion recovery![[Uncaptioned image]](https://arxiv.org/html/2606.27652v1/x12.png)Subtitle \(En\)It’s hard to see it\.Ground TruthLabels: relaxed, shy, happy, teasingBaselineFast Thinking<think\></think\><answer\>excited, anxious, frustrated, stressed, stressed out</answer\>MER\-R1Slow Thinking<think\>In the text, caption content: "It’s hard to see it\." This sentence might be a woman’s response or expression towards the man\. Based on the audio clues, which include a woman’s laughter and her voice filled with excitement and anticipation, it can be inferred that the woman is experiencing positive emotions and is comfortable around the man\. The video hints at the woman’s comfortable and relaxed state\. From the woman’s voice in the audio clues being high\-pitched and enthusiastic, it can be speculated that she is experiencing positive emotions and might be responding to the man’s words or actions in a playful or teasing manner\. Therefore, this sentence could be a woman’s sarcastic or humorous way of responding, fitting well with her overall positive emotions\.</think\><answer\>happy, content, playful, teasing, amused</answer\>Explanation:Baseline fast thinking misses all ground\-truth emotions, whereasMER\-R1slow thinking recovers the correct positive emotions, includinghappy,content, andteasing\.

Table 11:Fine\-grained emotion recognition example of missed emotion recovery\.Example of missed emotion recovery![[Uncaptioned image]](https://arxiv.org/html/2606.27652v1/x13.png)Subtitle \(En\)Ah\! We won’t see each other on Sunday?Ground TruthLabels: surprisedBaselineFast Thinking<think\></think\><answer\>disappointment, frustration, anxiety</answer\>MER\-R1Slow Thinking<think\>In the text, the subtitle reads: "Ah, we’re not going to see each other on Sunday\." Based on the audio clues, which describe the speaker’s voice with a high pitch and excited tone, along with the speaker’s use of "ah" to express emotional distress, we can infer that this sentence might convey a sense of anxiety or surprise\. The speaker’s high pitch and excited tone suggest that they are experiencing strong emotions, while the "ah" in the subtitle could be the expression of surprise or confusion\. Therefore, this sentence might reflect the speaker’s concern or disappointment over the lack of a Sunday meeting\.</think\><answer\>confused, anxious, surprised, frustrated</answer\>Explanation:Baseline fast thinking misses the ground\-truth emotionsurprised, whileMER\-R1slow thinking successfully recovers this missed emotion in its final prediction\.

Table 12:Basic emotion recognition example of missed emotion recovery\.Example of missed emotion recovery![[Uncaptioned image]](https://arxiv.org/html/2606.27652v1/x14.png)Subtitle \(En\)Are you afraid that my presence here will affect your interaction with Cao Xiaoqiang?Ground TruthLabels: worriedBaselineFast Thinking<think\></think\><answer\>angry, anxious, frustrated</answer\>MER\-R1Slow Thinking<think\>In the text, the caption reads: "Are you really that afraid of me here\." This sentence might be a man’s response to someone’s concern or unease\. Based on the audio cues describing a personal voice with a flat and monotonous tone, a higher pitch and slightly more aggressive nature, as well as the video clues showing the man’s serious and thoughtful expression, we can infer that the statement might convey a tone of worry or concern\. The man might feel uneasy or stressed because of the other person’s question or response, thus this sentence could be his way of responding to express his own worry or anxiety\.</think\><answer\>concerned, fearful, worried, anxious</answer\>Explanation:Baseline fast thinking misses the ground\-truth emotionworried, whereasMER\-R1slow thinking recovers precise emotions such asworriedandconcerned, with other predictions remaining contextually plausible\.
MER-R1: Multimodal Emotion Reasoning via Slow-Fast Thinking Synergy

Similar Articles

Leveraging Self-Paced Curriculum Learning for Enhanced Modality Balance in Multimodal Conversational Emotion Recognition

MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval

ReM-MoA: Reasoning Memory Sustains Mixture-of-Agents Scaling

TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens

MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning

Submit Feedback

Similar Articles

Leveraging Self-Paced Curriculum Learning for Enhanced Modality Balance in Multimodal Conversational Emotion Recognition
MemReranker: Reasoning-Aware Reranking for Agent Memory Retrieval
ReM-MoA: Reasoning Memory Sustains Mixture-of-Agents Scaling
TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens
MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning