SHALA-LLM: Smartly Handling Ambiguous Labels in Aligning LLMs

arXiv cs.LG Papers

Summary

SHALA-LLM is a reinforcement learning framework that enables LLMs to learn directly from annotator distributions and dynamically prioritize highly ambiguous samples during alignment, improving agreement with human label distributions and classification performance.

arXiv:2606.05376v1 Announce Type: new Abstract: Many human-centered tasks, including natural language inference (NLI) and emotion recognition (ER), have multiple plausible interpretations, leading to label ambiguity and challenging disagreements across human annotators. As LLMs are increasingly deployed in real-world settings, faithfully modeling such ambiguity is essential to identify contested inputs, preserve variability in ambiguous cases, and capture the full distribution of human judgments. Yet, existing LLM alignment approaches have predominantly assumed a single correct label, excluding annotator disagreement during optimization. Instead of treating this ambiguity as noise, we show how to treat it as information that improves model behavior through a new algorithm called SMARTLY HANDLING AMBIGUOUS LABELS IN ALIGNING LLMS (SHALA-LLM). This reinforcement learning framework provides a new way for LLMs to learn directly from annotator distributions while dynamically prioritizing highly ambiguous samples during optimization. Experiments on ambiguity-sensitive NLI and ER benchmarks, including ChaosNLI, GoEmotions, and MSP-Podcast, demonstrate that SHALA-LLM improves agreement with annotator label distributions, e.g. on ChaosNLI, it reduces Jensen-Shannon Distance by up to 62.1%. At the same time, SHALA-LLM improves F1 by up to 16.7%, showing that modeling annotator disagreement can also strengthen classification performance.
Original Article
View Cached Full Text

Cached at: 06/05/26, 08:10 AM

# SHALA-LLM: Smartly Handling Ambiguous Labels in Aligning LLMs
Source: [https://arxiv.org/html/2606.05376](https://arxiv.org/html/2606.05376)
Jingyao Wu1, Ashley Wang111footnotemark:1, Keane Ong1,2, Paul Pu Liang1, Rosalind W\. Picard1 1MIT Media Lab, Massachusetts Institute of Technology 2National University of Singapore \{jingyaow,ashley25,keaneong,ppliang\}@mit\.edu picard@media\.mit\.edu

###### Abstract

Many human\-centered tasks, including natural language inference \(NLI\) and emotion recognition \(ER\), have multiple plausible interpretations, leading to label ambiguity and challenging disagreements across human annotators\. As LLMs are increasingly deployed in real\-world settings, faithfully modeling such ambiguity is essential to identify contested inputs, preserve variability in ambiguous cases, and capture the full distribution of human judgments\. Yet, existing LLM alignment approaches have predominantly assumed a single correct label, excluding annotator disagreement during optimization\. Instead of treating this ambiguity as noise, we show how to treat it as information that improves model behavior through a new algorithm calledSmartly Handling Ambiguous Labels in Aligning LLMs\(SHALA\-LLM\)\. This reinforcement learning framework provides a new way for LLMs to learn directly from annotator distributions while dynamically prioritizing highly ambiguous samples during optimization\. Experiments on ambiguity\-sensitive NLI and ER benchmarks, including ChaosNLI, GoEmotions, and MSP\-Podcast, demonstrate thatSHALA\-LLMimproves agreement with annotator label distributions, e\.g\. on ChaosNLI, it reduces Jensen–Shannon Distance by up to 62\.1%\. At the same time,SHALA\-LLMimproves F1 by up to 16\.7%, showing that modeling annotator disagreement can also strengthen classification performance111Code will be available upon publication\.

SHALA\-LLM: Smartly Handling Ambiguous Labels in Aligning LLMs

Jingyao Wu1††thanks:Equal contribution\., Ashley Wang111footnotemark:1, Keane Ong1,2, Paul Pu Liang1, Rosalind W\. Picard11MIT Media Lab, Massachusetts Institute of Technology2National University of Singapore\{jingyaow,ashley25,keaneong,ppliang\}@mit\.edupicard@media\.mit\.edu

## 1Introduction

In many human\-centered tasks, different individuals may interpret the same text, speech, or interaction differently depending on contextual understanding, personal experience, cultural background, or emotional perceptionUmaet al\.\([2021](https://arxiv.org/html/2606.05376#bib.bib43)\)\. This phenomenon is especially common in natural language inference \(NLI\)Nieet al\.\([2020](https://arxiv.org/html/2606.05376#bib.bib11)\); Chenet al\.\([2025](https://arxiv.org/html/2606.05376#bib.bib12)\)and emotion recognition \(ER\)Sethuet al\.\([2019](https://arxiv.org/html/2606.05376#bib.bib40)\); Wuet al\.\([2026](https://arxiv.org/html/2606.05376#bib.bib22)\)leading to disagreement across human annotators and complicating the alignment of large language models \(LLMs\)\. Most existing LLM alignment paradigms treat disagreement as annotation noise, collapsing ambiguous annotations into single target labels through majority voting, label averaging, or calibration methods that assume a single gold standardRadharapuet al\.\([2025](https://arxiv.org/html/2606.05376#bib.bib44)\)\. In doing so, they discard the disagreement embedded within human annotations, particularly in highly ambiguous settings where annotator uncertainty and conflicting judgments are prominentBaanet al\.\([2022](https://arxiv.org/html/2606.05376#bib.bib15)\)\. Although recent studies have explored distributional and pluralistic alignment in LLMs, including uncertainty\-aware distribution elicitation and alignment with diverse human viewpoints or preferencesSorensenet al\.\([2024](https://arxiv.org/html/2606.05376#bib.bib45)\); Meisteret al\.\([2025](https://arxiv.org/html/2606.05376#bib.bib46)\), they primarily operate at inference time through prompting or distribution estimation, rather than directly optimizing LLMs with annotator disagreement during training, limiting learning dynamics and model adaptation under highly ambiguous settingsBaanet al\.\([2022](https://arxiv.org/html/2606.05376#bib.bib15)\)\.

In this paper, we presentSHALA\-LLM:Smartly Handling Ambiguous Labels in Aligning LLMs, a new ambiguity\-aware distributional alignment framework that directly learns from human disagreement under ambiguous supervision\.SHALA\-LLMdynamically reweights rollout rewards according to the degree of annotator disagreement, enabling highly ambiguous samples containing richer disagreement structures to exert greater influence during learning while remaining aligned with the underlying distributions of human judgments under ambiguous supervision\.

Evaluations on ambiguity\-sensitive NLI and ER benchmarks, including ChaosNLINieet al\.\([2020](https://arxiv.org/html/2606.05376#bib.bib11)\), GoEmotionsDemszkyet al\.\([2020](https://arxiv.org/html/2606.05376#bib.bib20)\), and MSP\-PodcastBussoet al\.\([2025](https://arxiv.org/html/2606.05376#bib.bib21)\), demonstrate thatSHALA\-LLMconsistently improves distributional alignment and classification results compared with majority\-label supervision\. On ChaosNLI,SHALA\-LLMreduces Jensen\-Shannon Distance \(JSD\) by 62\.1% while improving F1 by 16\.6%; and on MSP\-Podcast,SHALA\-LLMreduces JSD by 6\.2% and improving F1 by 29\.2%\. Systematic analyses revealSHALA\-LLMexhibits stronger robustness under highly ambiguous conditions, provides particularly strong benefits for inherently ambiguous semantic categories, and encourages reasoning behaviors that better reflect uncertainty and multiple plausible interpretations under ambiguous supervision\. This demonstrates that preserving disagreement structures leads to more robust and human\-aligned LLM behavior under ambiguity\.

## 2Related Work

##### Modeling ambiguity in NLP\.

Annotations for human\-centered tasks from multiple human annotators may disagree, reflecting important informationAroyo and Welty \([2015](https://arxiv.org/html/2606.05376#bib.bib16)\); Davaniet al\.\([2022](https://arxiv.org/html/2606.05376#bib.bib10)\); Chenet al\.\([2024b](https://arxiv.org/html/2606.05376#bib.bib17)\); Plank \([2022](https://arxiv.org/html/2606.05376#bib.bib28)\)\. This phenomenon has been studied across multiple domains, including semantic reasoning tasks such as natural language inference \(NLI\)Jayaweera and Dorr \([2025](https://arxiv.org/html/2606.05376#bib.bib14)\); Pavlick and Kwiatkowski \([2019](https://arxiv.org/html/2606.05376#bib.bib23)\)and subjective perception tasks such as emotion recognition and affective computingNiuet al\.\([2025](https://arxiv.org/html/2606.05376#bib.bib51)\); Chouet al\.\([2025](https://arxiv.org/html/2606.05376#bib.bib26)\); Wuet al\.\([2024b](https://arxiv.org/html/2606.05376#bib.bib41)\); Sethuet al\.\([2019](https://arxiv.org/html/2606.05376#bib.bib40)\)\. Researchers increasingly recognize the importance of addressing annotation ambiguity with prior work in categorical representation settings such as NLI and discrete ER investigating soft\-label supervisionWuet al\.\([2026](https://arxiv.org/html/2606.05376#bib.bib22)\); Fardet al\.\([2025](https://arxiv.org/html/2606.05376#bib.bib52)\), multi\-label formulationsAndoet al\.\([2019](https://arxiv.org/html/2606.05376#bib.bib36)\); Juet al\.\([2020](https://arxiv.org/html/2606.05376#bib.bib37)\), and disagreement\-aware learning approachesChou and Lee \([2019](https://arxiv.org/html/2606.05376#bib.bib50)\)\. In continuous affect prediction settings, others have explored Gaussian DistributionHanet al\.\([2017](https://arxiv.org/html/2606.05376#bib.bib33)\), Beta DistributionBoseet al\.\([2024](https://arxiv.org/html/2606.05376#bib.bib34)\); Wuet al\.\([2024a](https://arxiv.org/html/2606.05376#bib.bib42)\), and non\-parametric label distributionsWuet al\.\([2022](https://arxiv.org/html/2606.05376#bib.bib35)\)\. However, these approaches have been primarily developed for conventional supervised learning, while LLM alignment needs advancing\.

![Refer to caption](https://arxiv.org/html/2606.05376v1/x1.png)Figure 1:Overview of proposedSHALA\-LLM:Smartly Handling Ambiguous Labels in Aligning LLMs\. Given ambiguity\-sensitive tasks such as NLI and ER \(a\), annotations from multiple annotators are aggregated into empirical label distributions\.SHALA\-LLMprompts the LLM to verbalize probability distributions over candidate classes \(b\-a\), which are directly aligned with annotator distributions throughSHALAreward \(b\-b\)\. Rollout rewards are dynamically reweighted according to annotator disagreement during GRPO optimization \(b\-c\), producing distributions that better capture uncertainty and diverse human interpretations under ambiguous supervision \(b\-d\)\.
##### LLM alignment for ambiguous labels\.

As LLMs expand use in human\-centered applications, they increasingly encounter ambiguous labels\. To this end, recent studies have begun analyzing how LLMs behave under annotator disagreement and subjective supervision settingsLuet al\.\([2025](https://arxiv.org/html/2606.05376#bib.bib18)\); Jiaet al\.\([2026](https://arxiv.org/html/2606.05376#bib.bib19)\)\. Efforts have explored ambiguity\-aware instruction fine\-tuningHonget al\.\([2025](https://arxiv.org/html/2606.05376#bib.bib27)\), soft\-label supervisionMostafazadeh Davaniet al\.\([2022](https://arxiv.org/html/2606.05376#bib.bib9)\); Chenet al\.\([2025](https://arxiv.org/html/2606.05376#bib.bib12),[2024a](https://arxiv.org/html/2606.05376#bib.bib13)\), distributional alignmentMeisteret al\.\([2025](https://arxiv.org/html/2606.05376#bib.bib46)\)and disagreement\-aware alignment strategiesBaanet al\.\([2022](https://arxiv.org/html/2606.05376#bib.bib15)\)to better align LLM outputs with distributions of human judgments\. Smart handling of ambiguous labels for alignment remains to be extended to RL reasoning methods such as Group Relative Policy Optimization \(GRPO\)Shaoet al\.\([2024](https://arxiv.org/html/2606.05376#bib.bib8)\)\.

## 3ProposedSHALA\-LLMFramework

### 3\.1Problem Overview

Fig\.[1](https://arxiv.org/html/2606.05376#S2.F1)overviews our problem\. We consider human subjective tasks \(Fig\.[1](https://arxiv.org/html/2606.05376#S2.F1)a\) where annotations are derived from human perceptions, interpretations, and opinions\. Labels are collected from multiple annotators, and we treat ambiguity that arises between them as information, not noise\.

Assume the labels collected fromNNhuman annotators for sampleqqare denoted as:

\{l\(q,1\),l\(q,2\),…,l\(q,N\)\},\\\{l\_\{\(q,1\)\},l\_\{\(q,2\)\},\\dots,l\_\{\(q,N\)\}\\\},\(1\)
where eachl\(q,i\)∈\{1,…,C\}l\_\{\(q,i\)\}\\in\\\{1,\\dots,C\\\}corresponds to one ofCCcandidate classes\.

Instead of collapsing annotations into a single majority label, we derive anEmpirical Label Distributionover annotator judgment, which serves as the supervision signal:

pq,c=nq,cN,c=1,…,C,∑c=1Cpq,c=1,p\_\{q,c\}=\\frac\{n\_\{q,c\}\}\{N\},\\quad c=1,\\dots,C,\\quad\\sum\_\{c=1\}^\{C\}p\_\{q,c\}=1,\(2\)
wherenq,cn\_\{q,c\}denotes the number of annotators assigning classccfor sampleqq, and𝐩q∈ℝC\\mathbf\{p\}\_\{q\}\\in\\mathbb\{R\}^\{C\}represents the corresponding annotator distribution\.

Given an input samplexqx\_\{q\}, the LLM generates a sampled textual responseo\(q,i\)o\_\{\(q,i\)\}containing predicted probabilities over the candidate label space:

𝐩^\(q,i\)=\[p^\(q,i\),1,p^\(q,i\),2,…,p^\(q,i\),C\]\.\\hat\{\\mathbf\{p\}\}\_\{\(q,i\)\}=\[\\hat\{p\}\_\{\(q,i\),1\},\\hat\{p\}\_\{\(q,i\),2\},\\dots,\\hat\{p\}\_\{\(q,i\),C\}\]\.\(3\)
p^\(q,i\),c=P​\(y^=c∣xq,o\(q,i\)\)\.\\hat\{p\}\_\{\(q,i\),c\}=P\(\\hat\{y\}=c\\mid x\_\{q\},o\_\{\(q,i\)\}\)\.\(4\)
We formulate the learning objective within a reward\-based optimization framework \(Fig\.[1](https://arxiv.org/html/2606.05376#S2.F1)b\)\. Instead of optimizing toward a single collapsed label, our goal is toencourage generated predictions that better align with the underlying distribution of human interpretations\. Based on this formulation, we developSHALA\-LLMthat enables LLMs to learn directly from annotator distributions under ambiguous human annotations\. We adopt GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.05376#bib.bib8)\)as the optimization backbone due to its flexibility in reward design and its effectiveness in LLM alignment\. This formulation allows smart handling of ambiguous labels to be naturally incorporated in reward\-based learning\.

### 3\.2Verbalized Distributions from LLM

To obtain distributional outputs from LLMs, we adopt a verbalized distribution prediction formulation \(Fig\.[1](https://arxiv.org/html/2606.05376#S2.F1)\.b\-a\), where the model directly generates probability estimates over candidate classes in textual form\. Such verbalized distributions have been shown to provide an effective mechanism for eliciting uncertainty\-aware predictions from LLMsRadharapuet al\.\([2025](https://arxiv.org/html/2606.05376#bib.bib44)\)\.

The sampled textual responseo\(q,i\)o\_\{\(q,i\)\}from the input samplexqx\_\{q\}is represented as a verbalized distribution over candidate classes:

o\(q,i\)=\{\(c1,p^\(q,i\),1\),…,\(cC,p^\(q,i\),C\)\},o\_\{\(q,i\)\}=\\\{\(c\_\{1\},\\hat\{p\}\_\{\(q,i\),1\}\),\\dots,\(c\_\{C\},\\hat\{p\}\_\{\(q,i\),C\}\)\\\},\(5\)
wherep^\(q,i\),c\\hat\{p\}\_\{\(q,i\),c\}denotes the predicted probability assigned to candidate classccfor rollouto\(q,i\)o\_\{\(q,i\)\}\. The resulting probability estimates are then parsed into the predicted label distribution𝐩^\(q,i\)\\hat\{\\mathbf\{p\}\}\_\{\(q,i\)\}defined in Eq\. \([3](https://arxiv.org/html/2606.05376#S3.E3)\)\.

### 3\.3Group Relative Policy Optimization

We consider an LLM parameterized byθ\\thetaas the policy modelπθ​\(a∣s\)\\pi\_\{\\theta\}\(a\\mid s\), wheressdenotes the input prompt andaadenotes the generated output\. Following GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.05376#bib.bib8)\), for each sampleqq, a rollout groupG\(q\)G\_\{\(q\)\}consisting of multiple sampled outputs\{o\(q,i\)\}\\\{o\_\{\(q,i\)\}\\\}is generated, wherei∈G\(q\)i\\in G\_\{\(q\)\}indexes individual rollouts with corresponding rewardsr\(q,i\)r\_\{\(q,i\)\}\. GRPO then computes the group\-normalized advantage:

A^\(q,i\)=r\(q,i\)−μ^G\(q\)σ^G\(q\)\+ε\\hat\{A\}\_\{\(q,i\)\}=\\frac\{r\_\{\(q,i\)\}\-\\hat\{\\mu\}\_\{G\_\{\(q\)\}\}\}\{\\hat\{\\sigma\}\_\{G\_\{\(q\)\}\}\+\\varepsilon\}\(6\)whereμ^G\(q\)\\hat\{\\mu\}\_\{G\_\{\(q\)\}\}andσ^G\(q\)\\hat\{\\sigma\}\_\{G\_\{\(q\)\}\}are the mean and standard deviation of\{r\(q,i\)\}i=1\|G\(q\)\|\\\{r\_\{\(q,i\)\}\\\}\_\{i=1\}^\{\|G\_\{\(q\)\}\|\}\. With a PPO clipped surrogateA~\(q,i\):k​\(θ\)\\tilde\{A\}\_\{\(q,i\):k\}\(\\theta\)constructed fromA^\(q,i\)\\hat\{A\}\_\{\(q,i\)\}, GRPO then optimizesπθ​\(a∣s\)\\pi\_\{\\theta\}\(a\\mid s\)using a PPO\-style trust\-region objective222Full formulation of GRPO is provided in App\.[A](https://arxiv.org/html/2606.05376#A1)\.:

JGRPO​\(θ\)\\displaystyle J\_\{\\mathrm\{GRPO\}\}\(\\theta\)=𝔼\(q\)∼𝒟𝔼\{o\(q,i\)\}∼πθold\[1\|G\(q\)\|∑i∈G\(q\)\\displaystyle=\\mathbb\{E\}\_\{\(q\)\\sim\\mathcal\{D\}\}\\mathbb\{E\}\_\{\\\{o\_\{\(q,i\)\}\\\}\\sim\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\}\\\!\\Bigg\[\\frac\{1\}\{\|G\_\{\(q\)\}\|\}\\sum\_\{i\\in G\_\{\(q\)\}\}\(7\)1no\(q,i\)∑k=1no\(q,i\)A~\(q,i\):k\(θ\)\]−β𝔼\[DKL\(πθ∥πref\)\]\\displaystyle\\hskip\-60\.0pt\\phantom\{=\\mathbb\{E\}\\Bigg\[\}\\frac\{1\}\{n\_\{o\_\{\(q,i\)\}\}\}\\sum\_\{k=1\}^\{n\_\{o\_\{\(q,i\)\}\}\}\\tilde\{A\}\_\{\(q,i\):k\}\(\\theta\)\\Bigg\]\\;\-\\;\\beta\\,\\mathbb\{E\}\\\!\\left\[D\_\{\\mathrm\{KL\}\}\\\!\\left\(\\pi\_\{\\theta\}\\;\\\|\\;\\pi\_\{\\mathrm\{ref\}\}\\right\)\\right\]

### 3\.4SHALAReward

Our hypothesis is that ambiguous samples often contain disagreement structures that provide informative signals regarding nuance, interpretational diversity, uncertainty, and conflicting human perceptions\. We present a novel way to use this hypothesis to improve LLM alignment:SHALA\-LLM, and describe its reward\-based optimization algorithm within GRPO\.

SHALA\-LLMconsists of two key components \(Fig\.[1](https://arxiv.org/html/2606.05376#S2.F1)b\-b\): First, it aligns LLM predictions with annotator distributions, preserving the structure of human judgments during optimization rather than collapsing supervision into a dominant label\. Second, it dynamically reweights rollout contributions according to the degree of annotator disagreement, enabling highly ambiguous samples to exert greater influence during policy optimization\.

Following the GRPO formulation in Section[3\.3](https://arxiv.org/html/2606.05376#S3.SS3), for each sampled rollout outputo\(q,i\)o\_\{\(q,i\)\}generated from the LM, we define the ambiguity\-enhanced reward:

r\(q,i\)SHALA=H~​\(𝐩q\)​\[1−DJS​\(𝐩^\(q,i\),𝐩q\)\],r\_\{\(q,i\)\}^\{\\mathrm\{\\textsc\{SHALA\}\}\}=\\tilde\{H\}\(\\mathbf\{p\}\_\{q\}\)\\left\[1\-D\_\{\\mathrm\{JS\}\}\\left\(\\hat\{\\mathbf\{p\}\}\_\{\(q,i\)\},\\mathbf\{p\}\_\{q\}\\right\)\\right\],\(8\)
where𝐩^\(q,i\)\\hat\{\\mathbf\{p\}\}\_\{\(q,i\)\}denotes the predicted label distribution generated from rollouto\(q,i\)o\_\{\(q,i\)\}following Section[3\.2](https://arxiv.org/html/2606.05376#S3.SS2), and𝐩q\\mathbf\{p\}\_\{q\}denotes the corresponding annotator distribution for sampleqqdefined in Section[3\.1](https://arxiv.org/html/2606.05376#S3.SS1)\.

The component,1−DJS​\(𝐩^\(q,i\),𝐩q\)1\-D\_\{\\mathrm\{JS\}\}\(\\hat\{\\mathbf\{p\}\}\_\{\(q,i\)\},\\mathbf\{p\}\_\{q\}\), measures agreement between the predicted distribution and the annotator distribution using Jensen\-Shannon \(JS\) Distance\. SinceDJS​\(⋅,⋅\)∈\[0,1\]D\_\{\\mathrm\{JS\}\}\(\\cdot,\\cdot\)\\in\[0,1\], the resulting reward is also bounded within\[0,1\]\[0,1\]\.

Importantly, not all disagreement structures are equally informative during learning\. To explicitly prioritize highly ambiguous samples containing richer uncertainty and interpretational diversity, we introduce ambiguity\-enhance reward modulation based on the normalized entropy of the annotator distribution:

H~​\(𝐩q\)=−∑c=1Cpq,c​log⁡pq,clog⁡C,\\tilde\{H\}\(\\mathbf\{p\}\_\{q\}\)=\\frac\{\-\\sum\_\{c=1\}^\{C\}p\_\{q,c\}\\log p\_\{q,c\}\}\{\\log C\},\(9\)
whereH~​\(𝐩q\)∈\[0,1\]\\tilde\{H\}\(\\mathbf\{p\}\_\{q\}\)\\in\[0,1\]quantifies the degree of annotator disagreement in sampleqqwhilepq​cp\_\{q\}crefers to the empirical label distribution defined in \([2](https://arxiv.org/html/2606.05376#S3.E2)\)\.

##### Distributional alignment under annotator ambiguity\.

The proposed reward is incorporated into the GRPO framework through the rollout reward term in Eq\. \([6](https://arxiv.org/html/2606.05376#S3.E6)\)\. Since GRPO computes policy updates using group\-normalized rollout advantages, the reward distribution directly influences rollout contribution dynamics\. Unlike discrete majority\-label rewards that produce sparse and highly concentrated signals,SHALA\-LLMgenerates continuous rewards proportional to distributional agreement with human annotations\. As a result, sampled outputs exhibiting partial alignment with annotator distributions can still contribute positively to policy learning\. This reshapes the reward geometry within each rollout group, yielding smoother advantage distributions and more informative policy updates under ambiguous supervision\. Compared with conventional majority\-label optimization,SHALA\-LLMpreserves the full structure of human judgments, enabling the model to learn from diverse and potentially conflicting labels rather than only the dominant view\.

##### Dynamic ambiguity\-modulated policy optimization\.

While distributional alignment preserves the full structure of human judgments,SHALA\-LLMfurther introduces reward modulation that dynamically reweights rollout contributions utilizing annotator ambiguity\. Since GRPO computes policy updates through rollout advantage magnitudes,SHALAreward scaling directly modulates the relative contribution strength of sampled outputs\. Samples with higher annotator ambiguity receive proportionally amplified rollout rewards and therefore exert greater influence on the resulting policy updates\. Compared with uniform distributional supervision, the ambiguity\-conditioned optimization bySHALApotentially enables the policy model to better learn variability while maintaining alignment with the underlying annotator distributions\.

## 4Experimental Setup

##### Datasets\.

We evaluate the proposedSHALA\-LLMframework on ambiguity\-sensitive human\-centered tasks exhibiting different forms of annotator disagreement, including semantic ambiguity in NLI and affective ambiguity in ER\. For NLI, we conduct experiments on ChaosNLINieet al\.\([2020](https://arxiv.org/html/2606.05376#bib.bib11)\), which contains 100 annotations per sample to capture diverse semantic interpretations and disagreement, as well as its underlying ChaosNLI\-MWilliamset al\.\([2018](https://arxiv.org/html/2606.05376#bib.bib25)\)and ChaosNLI\-SBowmanet al\.\([2015](https://arxiv.org/html/2606.05376#bib.bib24)\)subsets\. For ER, we evaluate on MSP\-PodcastBussoet al\.\([2025](https://arxiv.org/html/2606.05376#bib.bib21)\), one of the largest naturalistic speech emotion corpora, and GoEmotionsDemszkyet al\.\([2020](https://arxiv.org/html/2606.05376#bib.bib20)\), which contains diverse emotion categories and intentionally ambiguous samples\. Both emotion datasets include 5–12 annotations per sample, reflecting subjective emotional perception and interpretation\. For MSP\-Podcast, we include both speech and text modalities to evaluateSHALA\-LLMunder multimodal ambiguity\.

##### Model\.

All experiments are conducted using Qwen2\.5\-Omni\-7BXuet al\.\([2025](https://arxiv.org/html/2606.05376#bib.bib32)\)as the LLM\. We adopt a unified GRPO\-based optimization framework across all experiments following the GRPOTrainer provided by TRLvon Werraet al\.\([2020](https://arxiv.org/html/2606.05376#bib.bib38)\)\.

##### Baselines\.

We compare the proposedSHALA\-LLMagainst a range of baselines and ablation settings\. These include: \(1\) a Zero\-Shot \(ZS\) inference model without task\-specific training; \(2\) Majority\-Label Supervision \(MLS\) \(refer to App\.[D\.4](https://arxiv.org/html/2606.05376#A4.SS4)\), which encourages the model to assign high probability mass to the dominant annotation and serves as a reward\-based analogue of conventional majority\-label training; \(3\) recent state\-of\-the\-art methods reported for each dataset, as justified in App\.[D\.5](https://arxiv.org/html/2606.05376#A4.SS5); and \(4\) an ablation variant without ambiguity enhancement, denoted asSHALA\-LLM\(w/o\) Ambi\-En, which removes reward modulation by settingH~​\(𝐩q\)=1\\tilde\{H\}\(\\mathbf\{p\}\_\{q\}\)=1\.

##### Evaluation\.

To evaluate distributional alignment between model predictions and annotator distributions, we report Jensen\-Shannon Distance \(JSD\) \(↓\\downarrow\) and Bhattacharyya Coefficient \(BC\) \(↑\\uparrow\)\. Both metrics lie in \[0,1\]\. We additionally report conventional classification metrics, including Accuracy \(↑\\uparrow\), macro F1\-score \(↑\\uparrow\) and Weighted F1 \(W\-F1\) \(↑\\uparrow\) to assess whether distributional supervision maintains competitive performance under standard evaluation protocols\. We note that the latter metrics do not account for ambiguity333Full details of experimental settings are in App\.[D](https://arxiv.org/html/2606.05376#A4)\.

## 5Results and Discussion

### 5\.1Overall Performance Comparison

Table 1:Performance comparisons of the proposedSHALA\-LLMframework with baselines across NLI and ER datasets\. Relative percentage changes are computed with respect to Zero\-shot inference \(ZS\)\. Best and second\-best results are highlighted inboldandunderline\. Relative performance improvement \(\+\) and degradation \(\-\) compared with the ZS are reported in brackets\.DatasetMethodJSD \(↓\\downarrow\)BC \(↑\\uparrow\)Acc \(↑\\uparrow\)F1 \(↑\\uparrow\)W\-F1 \(↑\\uparrow\)ChaosNLIZero\-shot0\.3750\.8500\.6030\.4730\.547MLS0\.477 \(\-27\.2%\)0\.751 \(\-11\.6%\)0\.699 \(\+15\.9%\)0\.650 \(\+37\.4%\)0\.684 \(\+25\.0%\)SHALA\-LLM\(w/o Ambi\-En\)0\.192 \(\+48\.8%\)0\.964 \(\+13\.4%\)0\.736 \(\+22\.1%\)0\.686 \(\+45\.0%\)0\.721 \(\+31\.8%\)SHALA\-LLM0\.181 \(\+51\.7%\)0\.966 \(\+13\.6%\)0\.768 \(\+27\.4%\)0\.758 \(\+60\.3%\)0\.767 \(\+40\.2%\)ChaosNLI\-MZero\-shot0\.3760\.8450\.6490\.5110\.607MLS0\.510 \(\-35\.6%\)0\.731 \(\-13\.5%\)0\.695 \(\+7\.1%\)0\.637 \(\+24\.7%\)0\.683 \(\+12\.5%\)SHALA\-LLM\(w/o Ambi\-En\)0\.188 \(\+50\.0%\)0\.968 \(\+14\.6%\)0\.701 \(\+8\.0%\)0\.588 \(\+15\.1%\)0\.671 \(\+10\.5%\)SHALA\-LLM0\.173 \(\+54\.0%\)0\.972 \(\+15\.0%\)0\.760 \(\+17\.1%\)0\.737 \(\+44\.2%\)0\.758 \(\+24\.9%\)ChaosNLI\-SZero\-shot0\.3750\.8550\.5570\.4360\.490MLS0\.445 \(\-18\.7%\)0\.770 \(\-9\.9%\)0\.703 \(\+26\.2%\)0\.654 \(\+50\.0%\)0\.686 \(\+40\.0%\)SHALA\-LLM\(w/o Ambi\-En\)0\.195 \(\+48\.0%\)0\.960 \(\+12\.3%\)0\.769 \(\+38\.1%\)0\.746 \(\+71\.1%\)0\.763 \(\+55\.7%\)SHALA\-LLM0\.191 \(\+49\.1%\)0\.961 \(\+12\.4%\)0\.775 \(\+39\.1%\)0\.767 \(\+75\.9%\)0\.775 \(\+58\.2%\)MSP\-Podcast\(Speech \+ Text\)Zero\-shot0\.6400\.5080\.4210\.2660\.388MLS0\.580\(\+10\.3%\)0\.585\(\+7\.7%\)0\.488\(\+13\.7%\)0\.233\(\-3\.3%\)0\.415\(\+2\.7%\)SHALA\-LLM\(w/o Ambi\-En\)0\.550 \(\+14\.1%\)0\.658 \(\+29\.5%\)0\.482 \(\+14\.5%\)0\.276 \(\+3\.8%\)0\.433 \(\+11\.6%\)SHALA\-LLM0\.544\(\+15\.0%\)0\.694\(\+36\.7%\)0\.496\(\+17\.8%\)0\.301\(\+13\.2%\)0\.455 \(\+17\.3%\)GoEmotions\(Text\)Zero\-shot0\.6810\.4800\.3610\.3450\.377MLS0\.542\(\+20\.4%\)0\.638\(\+32\.9%\)0\.595\(\+64\.8%\)0\.591\(\+71\.3%\)0\.595\(\+57\.8%\)SHALA\-LLM\(w/o Ambi\-En\)0\.449 \(\+34\.1%\)0\.750 \(\+56\.3%\)0\.611 \(\+69\.3%\)0\.544 \(\+57\.7%\)0\.603 \(\+59\.9%\)SHALA\-LLM0\.465 \(\+31\.7%\)0\.756 \(\+57\.5%\)0\.600\(\+66\.2%\)0\.589\(\+70\.7%\)0\.595\(\+57\.8%\)

Table 2:Performance comparison between the newSHALA\-LLMframework and existing ambiguity\-aware approaches across NLI and ER datasets\. Best results are highlighted inbold\.DatasetMethodJSD↓\\downarrowBC↑\\uparrowAcc↑\\uparrowF1↑\\uparrowWF1↑\\uparrowChaosNLILLM\-ExplainChenet al\.\([2025](https://arxiv.org/html/2606.05376#bib.bib12)\)0\.207–––0\.645LLM\-MJDChenet al\.\([2024a](https://arxiv.org/html/2606.05376#bib.bib13)\)0\.208–––0\.621SHALA\-LLM0\.1810\.9660\.7680\.7580\.767ChaosNLI\-MChaos\-BenchmarkNieet al\.\([2020](https://arxiv.org/html/2606.05376#bib.bib11)\)0\.306–0\.635––Dist\. NLIZhouet al\.\([2022](https://arxiv.org/html/2606.05376#bib.bib31)\)0\.197–0\.637––AmbiNLIMeissneret al\.\([2021](https://arxiv.org/html/2606.05376#bib.bib29)\)0\.223–0\.584––Flan\-T5Leeet al\.\([2023](https://arxiv.org/html/2606.05376#bib.bib30)\)0\.260–0\.726––SHALA\-LLM0\.1730\.9720\.7600\.7370\.758ChaosNLI\-SChaos\-BenchmarkNieet al\.\([2020](https://arxiv.org/html/2606.05376#bib.bib11)\)0\.220–0\.787––Dist\. NLIZhouet al\.\([2022](https://arxiv.org/html/2606.05376#bib.bib31)\)0\.181–0\.794––AmbiNLIMeissneret al\.\([2021](https://arxiv.org/html/2606.05376#bib.bib29)\)0\.189–0\.755––Flan\-T5Leeet al\.\([2023](https://arxiv.org/html/2606.05376#bib.bib30)\)0\.231–0\.738––SHALA\-LLM0\.1910\.9610\.7750\.7670\.775MSP\-PodcastTTS\-BenchmarkJiaet al\.\([2026](https://arxiv.org/html/2606.05376#bib.bib19)\)0\.2850\.6210\.4230\.253–SHALA\-LLM0\.5440\.6490\.4960\.3010\.455GoEmotionsAER\-LLM \(ZS\)Honget al\.\([2025](https://arxiv.org/html/2606.05376#bib.bib27)\)0\.490\.540\.371–0\.357AER\-LLM \(FS\)Honget al\.\([2025](https://arxiv.org/html/2606.05376#bib.bib27)\)0\.440\.700\.505–0\.511SHALA\-LLM0\.470\.760\.600\.590\.60Tables[1](https://arxiv.org/html/2606.05376#S5.T1)and[2](https://arxiv.org/html/2606.05376#S5.T2)show the performance across ambiguity\-sensitive tasks under both distributional and conventional single\-label evaluation metrics\.

##### Comparison with baselines\.

As shown in Table[1](https://arxiv.org/html/2606.05376#S5.T1),SHALA\-LLMconsistently improves both distributional alignment and conventional classification performance compared with ZS and MLS across all datasets and evaluation metrics\. On ChaosNLI,SHALA\-LLMreduces JSD from 0\.477 to 0\.181 \(62\.1%↓\\downarrow\) while improving BC from 0\.751 to 0\.966 \(28\.6%↑\\uparrow\) compared with MLS\.SHALA\-LLMalso substantially improves conventional classification performance, increasing ACC from 0\.699 to 0\.768 \(9\.9%↑\\uparrow\), F1 from 0\.650 to 0\.758 \(16\.6%↑\\uparrow\), and W\-F1 from 0\.684 to 0\.767 \(12\.1%↑\\uparrow\)\. Strong consistent trends are also observed in its subset ChaosNLI\-M and ChaosNLI\-S and observed across ER tasks\. On MSP\-Podcast,SHALA\-LLMimproves BC from 0\.580 to 0\.544 \(6\.2%↑\\uparrow\) and F1 from 0\.233 to 0\.301 \(29\.2%↑\\uparrow\) compared to MLS\. Although the performance differences on GoEmotion betweenSHALA\-LLMand MLS remain relatively moderate on conventional classification metrics, the overall results demonstrate thatSHALA\-LLMremains highly effective across ambiguous ER settings\. Overall, these findings suggest that preserving annotator disagreement during optimization not only improves alignment with human judgment distributions, but can also benefit dominant\-label prediction performance under ambiguous supervision\.

Notably, we observe for NLI datasets, although the conventional MLS generally improves conventional classification performance over ZS, it simultaneously leads to substantially worse distributional alignment\. For example, on ChaosNLI, MLS increases JSD from 0\.374 to 0\.477 \(27\.2%↑\\uparrow\) while reducing BC from 0\.851 to 0\.751 \(11\.6%↓\\downarrow\)\. These findings suggest that MLS encourages the model to collapse ambiguity into dominant labels, thereby discarding variability and disagreement structures embedded within annotator distributions\.

##### Ablation studies\.

We further compareSHALA\-LLMagainst its ablation variant,SHALA\-LLM\(w/o\) Ambi\-En, which removes ambiguity\-enhanced reward modulation and optimizes only the distributional alignment objective\. WhileSHALA\-LLM\(w/o\) already demonstrates strong improvements over ZS and MLS, the fullSHALA\-LLMframework consistently achieves further gains across both distributional and classification metrics\. On ChaosNLI,SHALA\-LLMfurther reduces JSD from 0\.192 to 0\.181 \(5\.4%↓\\downarrow\) while improving BC from 0\.964 to 0\.966 \(0\.2%↑\\uparrow\) compared toSHALA\-LLMw/o Ambi\-En\.SHALA\-LLMalso improves ACC from 0\.736 to 0\.768 \(4\.3%↑\\uparrow\), F1 from 0\.686 to 0\.758 \(10\.5%↑\\uparrow\), and W\-F1 from 0\.721 to 0\.767 \(6\.4%↑\\uparrow\)\.

On GoEmotions, although the performance differences remain relatively moderate, the fullSHALA\-LLMframework still demonstrates competitive and consistently stronger performance across several evaluation metrics compared withSHALA\-LLM\(w/o\) Ambi\-En, including improvements on BC \(0\.8%↑\\uparrow\) and F1 \(8\.3%↑\\uparrow\)\. These findings suggest that ambiguity\-enhanced reward modulation provides additional optimization benefits beyond distributional alignment alone, while the primary gains are already largely achieved through ambiguity\-aware distributional supervision\.

##### Comparison with state of the art\.

![Refer to caption](https://arxiv.org/html/2606.05376v1/figures/nli_all_ambiguity.png)Figure 2:Performance comparison across different ambiguity levels on the ChaosNLI dataset\.We further compareSHALA\-LLMwith previously reported ambiguity\-aware approaches\. While direct comparisons should be interpreted cautiously due to differences in model architectures and experimental settings,SHALA\-LLMachieves competitive or superior performance on both distributional and conventional metrics\. For example, on ChaosNLI,SHALA\-LLMachieves lower JSD and substantially higher W\-F1 than recent ambiguity\-aware methods, including LLM\-ExplainChenet al\.\([2025](https://arxiv.org/html/2606.05376#bib.bib12)\)and LLM\-MJDChenet al\.\([2024a](https://arxiv.org/html/2606.05376#bib.bib13)\)\. Similar trends are observed on MSP\-Podcast and GoEmotions\. These findings further support the effectiveness and generalizability ofSHALA\-LLMacross diverse ambiguity\-sensitive tasks\.

Collectively, these findings suggest that highly ambiguous samples provide particularly informative supervision signals during optimization\. By dynamically prioritizing samples with substantial annotator disagreement,SHALA\-LLMencourages the model to better capture uncertainty and diverse human interpretations embedded in ambiguous annotations\. Overall, the results demonstrate that ambiguity\-aware optimization in reinforcement learning–based alignment leads to stronger agreement with human judgment distributions under subjective supervision\.

### 5\.2Analysis at Different Ambiguity Levels

To gain deeper insight into the robustness and effectiveness of ambiguity\-aware optimization, we further analyze model performance across varying levels of annotator disagreement\. Specifically, we partition samples according to the entropy of annotator distributions and evaluate different supervision strategies under low\-, medium\-, and high\-ambiguity settings\. Details of the data partition procedure can be found in App\.[B](https://arxiv.org/html/2606.05376#A2)\.

##### Overall performance at all ambiguity levels\.

Fig\.[2](https://arxiv.org/html/2606.05376#S5.F2)presents the results on the ChaosNLI dataset\. Overall, the proposedSHALA\-LLMframework consistently achieves superior performance compared with ZS and MLS baselines across all ambiguity levels\. Specifically, for distributional evaluation \(Fig\.[2](https://arxiv.org/html/2606.05376#S5.F2)a\-b\), bothSHALA\-LLMand its ablation variantSHALA\-LLM\(w/o\) exhibit consistently better median performance together with reduced variance across ambiguity levels compared with baseline methods\. Similar observations are also seen for conventional classification metrics \(Fig\.[2](https://arxiv.org/html/2606.05376#S5.F2)c\-d\), whereSHALA\-LLM\(green lines\) consistently achieves the strongest overall performance across all ambiguity levels\.

##### Robustness as ambiguity level increases\.

Importantly, across the boxplots, the performance ofSHALA\-LLMandSHALA\-LLM\(w/o\) remains relatively stable as ambiguity increases, with no statistically significant degradation observed across ambiguity levels \(p\>0\.05p\>0\.05\)\. In contrast, the baseline methods exhibit substantially larger degradation under higher ambiguity conditions\. Similarly, although all methods show lower performance in highly ambiguous settings, the degradation remains smaller forSHALA\-LLM\. For example, the F1 ofSHALA\-LLMdecreases by 32\.0% from low\- to high\-ambiguity, whereas MLS shows a larger drop of 40\.2%\. These findings suggest that preserving annotator disagreement during optimization improves robustness under increasingly subjective supervision conditions\.

Interestingly, MLS \(purple boxplots\) demonstrates competitive and occasionally superior performance under low\-ambiguity settings where dominant consensus labels are clearer\. However, its performance drops substantially under medium\- and high\-ambiguity conditions, for example, its BC decreases by 28\.56% from low \(0\.970\) to high\-ambiguity \(0\.693\) with even worse performance at high\-ambiguity level compared to ZS \(0\.844\)\. This suggests that conventional majority\-label optimization struggles to generalize once supervision becomes increasingly subjective and disagreement structures become more prominent\.

![Refer to caption](https://arxiv.org/html/2606.05376v1/figures/nli_majority_class_row.png)Figure 3:Category\-level performance analysis on the ChaosNLI dataset across different classes\.Finally, comparingSHALA\-LLMagainst its ablation variantSHALA\-LLM\(w/o\), we observe thatSHALA\-LLMconsistently achieves stronger robustness under higher ambiguity levels, reflected by both higher overall performance and smaller performance degradation as ambiguity increases\. These findings indicate that dynamically prioritizing highly ambiguous samples during optimization further strengthens the model’s ability to learn from rich disagreement structures embedded within human annotations\.

### 5\.3Interpreting Model Behavior under Ambiguous Supervision

We further investigate how ambiguity\-aware optimization influences model behavior under ambiguous supervision\. We analyze category\-level performance patterns and representative reasoning examples to better understand howSHALA\-LLMcaptures disagreement structures and multiple plausible interpretations in ambiguous settings\.

##### Quantitative analysis across different semantic categories\.

Fig\.[3](https://arxiv.org/html/2606.05376#S5.F3)presents a category\-level analysis on the ChaosNLI dataset across theEntailment,Neutral, andContradictionclasses\. Overall, bothSHALA\-LLMand its ablation variantSHALA\-LLM\(w/o\) consistently improve performance across all semantic categories compared with ZS and MLS baselines\. The largest improvements are observed for theContradictionclass, which also exhibits the highest ambiguity and annotator disagreement\.

While all methods achieve relatively strong performance onEntailmentandNeutral, baseline approaches struggle substantially onContradictionsamples\. In contrast,SHALA\-LLMachieves dramatic improvements across all evaluation metrics for contradiction cases \(Fig\.[3](https://arxiv.org/html/2606.05376#S5.F3)c\-d\), improving F1 from 0\.09 \(ZS\) and 0\.50 \(MLS\) to 0\.83, while ACC increases from 0\.05 \(ZS\) and 0\.33 \(MLS\) to 0\.72\. Similar trends are also observed on distributional metrics with substantially lower JSD and higher BC metrics\.

These findings suggest that ambiguity\-aware optimization is particularly beneficial for semantic categories with greater uncertainty and conflicting interpretations\. By preserving disagreement structures during optimization,SHALA\-LLMbetter captures multiple plausible semantic relationships under ambiguous supervision\.

##### Interpreting reasoning behavior under high ambiguity\.

We further examine the reasoning behavior of the model underSHALA\-LLMtraining\. Table[3](https://arxiv.org/html/2606.05376#S5.T3)presents a representative highly ambiguous example from the ChaosNLI dataset\. While the ZS baseline gives an overconfident prediction,SHALA\-LLMgenerates probability estimates and reasoning patterns that more closely reflect the underlying annotator disagreement \(refer to the bold text in the table\)\. Rather than collapsing the sample into a single interpretation,SHALA\-LLMreasons over multiple plausible semantic relationships, producing a predicted distribution that better aligns with human judgment distributions\. These findings suggest that ambiguity\-aware optimization influences not only final predictions, but also the model’s underlying reasoning behavior\.

Table 3:Premise: "In short, we all got tired of clever analyses of what might happen; and throughout economics there was a shift in focus away from theorizing, toward data collection and careful statistical analysis";Hypothesis: "We all got tired of data collection and clever analyses of what might happen";Labels: Entailment \(E\), Neutral \(N\), or Contradiction \(C\)\.Model ReasoningENCGround Truth\.45\.17\.38Zero\-shotThe premise and hypothesis are in direct contradiction\.…\\dotsThere areno multiple valid interpretationshere\.…\\dots\.00\.001\.00SHALA\-LLMThere are multiple valid interpretations:1\. The premise and hypothesis could be seen as contradictory because…\\dots2\. Alternatively, the premise could be interpreted as a shift…\\dotsGiven these interpretations, the probability of entailment is relativelylow, as… The probability of neutrality ismoderate, as… The probability of contradiction ishigh, as…\.20\.50\.30

## 6Conclusion

In this work, we introducedSmartly Handling Ambiguous Labels in Aligning LLMs\(SHALA\-LLM\), a reinforcement learning based alignment framework that enables large language models to learn directly from annotator distributions under ambiguous supervision\. By preserving disagreement structures and dynamically prioritizing highly ambiguous samples during optimization,SHALA\-LLMconsistently improves both distributional alignment and conventional classification performance across natural language inference and emotion recognition tasks, while demonstrating stronger robustness under highly ambiguous conditions\. Overall, our findings highlight the value of treating human disagreement and ambiguity as informative supervision signals rather than annotation noise, suggesting a promising direction toward more robust and human\-centered large language models under subjective real\-world settings\.

## Limitations

One limitation of this work is thatSHALA\-LLMis primarily evaluated on tasks with relatively structured label distributions\. While our approach demonstrates improved alignment with distributions of human judgments, the current experiments do not yet capture more complex forms of ambiguity that arise in open\-ended generative tasks, long\-form reasoning, or real\-world multi\-turn interactions\. In addition, our reward formulation assumes that annotator disagreement can be adequately represented through empirical label distributions, which may not fully reflect underlying factors such as annotator expertise, demographic variation, or contextual uncertainty\. Moreover, our experiments on Qwen2\.5\-Omni\-7B disabled stochastic sampling to improve output consistency and parsing stability for verbalized distribution generation\. Future work should further investigate the uncertainty of model\-generated distributions and how sampling variability may influence ambiguity\-aware optimization under subjective supervision\. Finally, although we focus on GRPO\-based reasoning alignment, the generalization of the proposed ambiguity\-aware reward design to other reinforcement learning and preference optimization frameworks remains future work\.

## Ethical Considerations

This work focuses on ambiguity\-aware alignment for subjective human\-centered tasks, including natural language inference and emotion recognition, where disagreement between annotators often reflects diverse interpretations and perceptions rather than annotation noise\. By preserving annotator disagreement distributions during optimization, the proposed framework aims to better capture uncertainty and interpretational diversity instead of enforcing a single dominant label\. Nevertheless, the underlying datasets may still contain societal, cultural, or demographic biases inherited from human annotations, which could influence model behavior and potentially amplify biased disagreement patterns\. Our framework partially mitigates this issue by explicitly modeling disagreement distributions rather than collapsing annotations into majority labels, thereby preserving diverse interpretations under subjective supervision\. Additionally, outputs from ambiguity\-aware models should not be interpreted as definitive representations of human emotions, intentions, or beliefs, particularly in high\-stakes decision\-making settings\.

Potential risks of this work include the possibility that societal, cultural, or demographic biases embedded within human annotations may also be preserved or amplified through ambiguity\-aware optimization\. Additionally, outputs from ambiguity\-aware models could be misinterpreted as definitive representations of human emotions, intentions, or beliefs, particularly in high\-stakes decision\-making settings\. Our framework partially mitigates these risks by explicitly modeling disagreement distributions rather than collapsing annotations into majority labels, thereby preserving diverse interpretations and uncertainty under subjective supervision\.

## Acknowledgments

## References

- A\. Ando, R\. Masumura, H\. Kamiyama, S\. Kobashikawa, and Y\. Aono \(2019\)Speech Emotion Recognition Based on Multi\-Label Emotion Existence Model\.InInterspeech 2019,pp\. 2818–2822\.External Links:[Document](https://dx.doi.org/10.21437/Interspeech.2019-2524),ISSN 2958\-1796Cited by:[§2](https://arxiv.org/html/2606.05376#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Aroyo and C\. Welty \(2015\)Truth is a lie: crowd truth and the seven myths of human annotation\.AI magazine36\(1\),pp\. 15–24\.Cited by:[§2](https://arxiv.org/html/2606.05376#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Baan, W\. Aziz, B\. Plank, and R\. Fernandez \(2022\)Stop measuring calibration when humans disagree\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,pp\. 1892–1915\.Cited by:[§1](https://arxiv.org/html/2606.05376#S1.p1.1),[§2](https://arxiv.org/html/2606.05376#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Bose, V\. Sethu, and E\. Ambikairajah \(2024\)Continuous emotion ambiguity prediction: modeling with beta distributions\.IEEE Trans\. Affect\. Comput\.15\(3\),pp\. 1684–1695\.External Links:ISSN 1949\-3045,[Link](https://doi.org/10.1109/TAFFC.2024.3367371),[Document](https://dx.doi.org/10.1109/TAFFC.2024.3367371)Cited by:[§2](https://arxiv.org/html/2606.05376#S2.SS0.SSS0.Px1.p1.1)\.
- S\. R\. Bowman, G\. Angeli, C\. Potts, and C\. D\. Manning \(2015\)A large annotated corpus for learning natural language inference\.InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing,L\. Màrquez, C\. Callison\-Burch, and J\. Su \(Eds\.\),Lisbon, Portugal,pp\. 632–642\.External Links:[Link](https://aclanthology.org/D15-1075/),[Document](https://dx.doi.org/10.18653/v1/D15-1075)Cited by:[§B\.1](https://arxiv.org/html/2606.05376#A2.SS1.p1.1),[§B\.1](https://arxiv.org/html/2606.05376#A2.SS1.p2.1),[§4](https://arxiv.org/html/2606.05376#S4.SS0.SSS0.Px1.p1.1)\.
- C\. Busso, R\. Lotfian, K\. Sridhar, A\. N\. Salman, W\. Lin, L\. Goncalves, S\. Parthasarathy, A\. R\. Naini, S\. Leem, L\. Martinez\-Lucas,et al\.\(2025\)The msp\-podcast corpus\.arXiv preprint arXiv:2509\.09791\.Cited by:[§B\.2](https://arxiv.org/html/2606.05376#A2.SS2.p1.1),[§1](https://arxiv.org/html/2606.05376#S1.p3.1),[§4](https://arxiv.org/html/2606.05376#S4.SS0.SSS0.Px1.p1.1)\.
- B\. Chen, S\. Peng, A\. Korhonen, and B\. Plank \(2025\)A rose by any other name: LLM\-generated explanations are good proxies for human explanations to collect label distributions on NLI\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 10777–10802\.External Links:[Link](https://aclanthology.org/2025.findings-acl.562/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.562),ISBN 979\-8\-89176\-256\-5Cited by:[§D\.5](https://arxiv.org/html/2606.05376#A4.SS5.p1.1),[§1](https://arxiv.org/html/2606.05376#S1.p1.1),[§2](https://arxiv.org/html/2606.05376#S2.SS0.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2606.05376#S5.SS1.SSS0.Px3.p1.1),[Table 2](https://arxiv.org/html/2606.05376#S5.T2.5.6.2.2.1.2.1)\.
- B\. Chen, X\. Wang, S\. Peng, R\. Litschko, A\. Korhonen, and B\. Plank \(2024a\)“Seeing the big through the small”: can LLMs approximate human judgment distributions on NLI from a few explanations?\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 14396–14419\.External Links:[Link](https://aclanthology.org/2024.findings-emnlp.842/),[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.842)Cited by:[§D\.5](https://arxiv.org/html/2606.05376#A4.SS5.p1.1),[§2](https://arxiv.org/html/2606.05376#S2.SS0.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2606.05376#S5.SS1.SSS0.Px3.p1.1),[Table 2](https://arxiv.org/html/2606.05376#S5.T2.5.7.1.2.1.2.1)\.
- B\. Chen, X\. Wang, S\. Peng, R\. Litschko, A\. Korhonen, and B\. Plank \(2024b\)“Seeing the big through the small”: can llms approximate human judgment distributions on nli from a few explanations?\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 14396–14419\.Cited by:[§2](https://arxiv.org/html/2606.05376#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Chou, L\. Goncalves, S\. Leem, A\. N\. Salman, C\. Lee, and C\. Busso \(2025\)Minority views matter: evaluating speech emotion classifiers with human subjective annotations by an all\-inclusive aggregation rule\.IEEE Transactions on Affective Computing16\(1\),pp\. 41–55\.External Links:[Document](https://dx.doi.org/10.1109/TAFFC.2024.3411290)Cited by:[§2](https://arxiv.org/html/2606.05376#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Chou and C\. Lee \(2019\)Every rating matters: joint learning of subjective labels and individual annotators for speech emotion classification\.InICASSP 2019\-2019 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 5886–5890\.Cited by:[§2](https://arxiv.org/html/2606.05376#S2.SS0.SSS0.Px1.p1.1)\.
- A\. M\. Davani, M\. Díaz, and V\. Prabhakaran \(2022\)Dealing with disagreements: looking beyond the majority vote in subjective annotations\.Transactions of the Association for Computational Linguistics10,pp\. 92–110\.Cited by:[§2](https://arxiv.org/html/2606.05376#S2.SS0.SSS0.Px1.p1.1)\.
- D\. Demszky, D\. Movshovitz\-Attias, J\. Ko, A\. Cowen, G\. Nemade, and S\. Ravi \(2020\)GoEmotions: a dataset of fine\-grained emotions\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 4040–4054\.External Links:[Link](https://aclanthology.org/2020.acl-main.372/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.372)Cited by:[§B\.3](https://arxiv.org/html/2606.05376#A2.SS3.p1.1),[§B\.3](https://arxiv.org/html/2606.05376#A2.SS3.p2.1),[Table 4](https://arxiv.org/html/2606.05376#A2.T4),[§1](https://arxiv.org/html/2606.05376#S1.p3.1),[§4](https://arxiv.org/html/2606.05376#S4.SS0.SSS0.Px1.p1.1)\.
- A\. P\. Fard, M\. M\. Hosseini, T\. D\. Sweeny, and M\. H\. Mahoor \(2025\)Affectnet\+: a database for enhancing facial expression recognition with soft\-labels\.IEEE Transactions on Affective Computing\.Cited by:[§2](https://arxiv.org/html/2606.05376#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Han, Z\. Zhang, M\. Schmitt, M\. Pantic, and B\. Schuller \(2017\)From hard to soft: towards more human\-like emotion recognition by modelling the perception uncertainty\.InProceedings of the 25th ACM International Conference on Multimedia,MM ’17,New York, NY, USA,pp\. 890–897\.External Links:ISBN 9781450349062,[Link](https://doi.org/10.1145/3123266.3123383),[Document](https://dx.doi.org/10.1145/3123266.3123383)Cited by:[§2](https://arxiv.org/html/2606.05376#S2.SS0.SSS0.Px1.p1.1)\.
- X\. Hong, Y\. Gong, V\. Sethu, and T\. Dang \(2025\)AER\-llm: ambiguity\-aware emotion recognition leveraging large language models\.InICASSP,pp\. 1–5\.External Links:[Link](https://doi.org/10.1109/ICASSP49660.2025.10888198)Cited by:[§B\.2](https://arxiv.org/html/2606.05376#A2.SS2.p1.1),[§D\.5](https://arxiv.org/html/2606.05376#A4.SS5.p3.1),[§2](https://arxiv.org/html/2606.05376#S2.SS0.SSS0.Px2.p1.1),[Table 2](https://arxiv.org/html/2606.05376#S5.T2.5.21.2.2.1.2.1),[Table 2](https://arxiv.org/html/2606.05376#S5.T2.5.22.1.2.1.2.1)\.
- J\. Hu, Y\. Zhang, Q\. Han, D\. Jiang, X\. Zhang, and H\. Shum \(2025\)Open\-reasoner\-zero: an open source approach to scaling up reinforcement learning on the base model\.External Links:2503\.24290,[Link](https://arxiv.org/abs/2503.24290)Cited by:[§D\.2](https://arxiv.org/html/2606.05376#A4.SS2.p1.2)\.
- C\. Jayaweera and B\. Dorr \(2025\)From disagreement to understanding: the case for ambiguity detection in nli\.InProceedings of the The 4th Workshop on Perspectivist Approaches to NLP,pp\. 37–46\.Cited by:[§2](https://arxiv.org/html/2606.05376#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Jia, W\. Li, J\. Wu, X\. Yu, Y\. Gao, J\. Cheng, X\. Tang, F\. Xia, and T\. Dang \(2026\)Decoding ambiguous emotions with test\-time scaling in audio\-language models\.arXiv preprint arXiv:2602\.03873\.Cited by:[§D\.5](https://arxiv.org/html/2606.05376#A4.SS5.p3.1),[§2](https://arxiv.org/html/2606.05376#S2.SS0.SSS0.Px2.p1.1),[Table 2](https://arxiv.org/html/2606.05376#S5.T2.5.19.2.2.1.2.1)\.
- X\. Ju, D\. Zhang, J\. Li, and G\. Zhou \(2020\)Transformer\-based label set generation for multi\-modal multi\-label emotion detection\.InProceedings of the 28th ACM International Conference on Multimedia,MM ’20,New York, NY, USA,pp\. 512–520\.External Links:ISBN 9781450379885,[Link](https://doi.org/10.1145/3394171.3413577),[Document](https://dx.doi.org/10.1145/3394171.3413577)Cited by:[§2](https://arxiv.org/html/2606.05376#S2.SS0.SSS0.Px1.p1.1)\.
- N\. Lee, N\. M\. An, and J\. Thorne \(2023\)Can large language models capture dissenting human voices?\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 4569–4585\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.278/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.278)Cited by:[§D\.5](https://arxiv.org/html/2606.05376#A4.SS5.p1.1),[Table 2](https://arxiv.org/html/2606.05376#S5.T2.5.12.1.2.1.2.1),[Table 2](https://arxiv.org/html/2606.05376#S5.T2.5.17.1.2.1.2.1)\.
- Z\. Liu, C\. Chen, W\. Li, P\. Qi, T\. Pang, C\. Du, W\. S\. Lee, and M\. Lin \(2025\)Understanding r1\-zero\-like training: a critical perspective\.External Links:2503\.20783,[Link](https://arxiv.org/abs/2503.20783)Cited by:[§D\.2](https://arxiv.org/html/2606.05376#A4.SS2.p1.2)\.
- J\. Lu, K\. Ma, K\. Wang, K\. Xiao, R\. K\. Lee, B\. Xu, L\. Yang, and H\. Lin \(2025\)Is llm an overconfident judge? unveiling the capabilities of llms in detecting offensive language with annotation disagreement\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 5609–5626\.Cited by:[§2](https://arxiv.org/html/2606.05376#S2.SS0.SSS0.Px2.p1.1)\.
- J\. M\. Meissner, N\. Thumwanit, S\. Sugawara, and A\. Aizawa \(2021\)Embracing ambiguity: Shifting the training target of NLI models\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 2: Short Papers\),Online,pp\. 862–869\.External Links:[Link](https://aclanthology.org/2021.acl-short.109),[Document](https://dx.doi.org/10.18653/v1/2021.acl-short.109)Cited by:[§D\.5](https://arxiv.org/html/2606.05376#A4.SS5.p2.1),[Table 2](https://arxiv.org/html/2606.05376#S5.T2.5.11.1.2.1.2.1),[Table 2](https://arxiv.org/html/2606.05376#S5.T2.5.16.1.2.1.2.1)\.
- N\. Meister, C\. Guestrin, and T\. B\. Hashimoto \(2025\)Benchmarking distributional alignment of large language models\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 24–49\.Cited by:[§1](https://arxiv.org/html/2606.05376#S1.p1.1),[§2](https://arxiv.org/html/2606.05376#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Mostafazadeh Davani, M\. Díaz, and V\. Prabhakaran \(2022\)Dealing with disagreements: looking beyond the majority vote in subjective annotations\.Transactions of the Association for Computational Linguistics10,pp\. 92–110\.External Links:[Link](https://aclanthology.org/2022.tacl-1.6/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00449)Cited by:[§2](https://arxiv.org/html/2606.05376#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Nie, X\. Zhou, and M\. Bansal \(2020\)What can we learn from collective human opinions on natural language inference data?\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),B\. Webber, T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 9131–9143\.External Links:[Link](https://aclanthology.org/2020.emnlp-main.734/),[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.734)Cited by:[§B\.1](https://arxiv.org/html/2606.05376#A2.SS1.p1.1),[§D\.5](https://arxiv.org/html/2606.05376#A4.SS5.p1.1),[§1](https://arxiv.org/html/2606.05376#S1.p1.1),[§1](https://arxiv.org/html/2606.05376#S1.p3.1),[§4](https://arxiv.org/html/2606.05376#S4.SS0.SSS0.Px1.p1.1),[Table 2](https://arxiv.org/html/2606.05376#S5.T2.5.14.2.2.1.2.1),[Table 2](https://arxiv.org/html/2606.05376#S5.T2.5.9.2.2.1.2.1)\.
- M\. Niu, Y\. El\-Tawil, A\. Romana, and E\. M\. Provost \(2025\)Rethinking emotion annotations in the era of large language models\.IEEE transactions on affective computing\.Cited by:[§2](https://arxiv.org/html/2606.05376#S2.SS0.SSS0.Px1.p1.1)\.
- E\. Pavlick and T\. Kwiatkowski \(2019\)Inherent disagreements in human textual inferences\.Transactions of the Association for Computational Linguistics7,pp\. 677–694\.External Links:[Link](https://aclanthology.org/Q19-1043/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00293)Cited by:[§2](https://arxiv.org/html/2606.05376#S2.SS0.SSS0.Px1.p1.1)\.
- B\. Plank \(2022\)The “problem” of human label variation: on ground truth in data, modeling and evaluation\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,Y\. Goldberg, Z\. Kozareva, and Y\. Zhang \(Eds\.\),Abu Dhabi, United Arab Emirates,pp\. 10671–10682\.External Links:[Link](https://aclanthology.org/2022.emnlp-main.731/),[Document](https://dx.doi.org/10.18653/v1/2022.emnlp-main.731)Cited by:[§2](https://arxiv.org/html/2606.05376#S2.SS0.SSS0.Px1.p1.1)\.
- B\. Radharapu, M\. Revel, M\. Ung, S\. Ruder, and A\. Williams \(2025\)Arbiters of ambivalence: challenges of using llms in no\-consensus tasks\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 4677–4731\.Cited by:[§1](https://arxiv.org/html/2606.05376#S1.p1.1),[§3\.2](https://arxiv.org/html/2606.05376#S3.SS2.p1.1)\.
- A\. Rouditchenko, S\. Bhati, E\. Araujo, S\. Thomas, H\. Kuehne, R\. Feris, and J\. Glass \(2025\)Omni\-r1: do you really need audio to fine\-tune your audio llm?\.External Links:2505\.09439,[Link](https://arxiv.org/abs/2505.09439)Cited by:[§D\.2](https://arxiv.org/html/2606.05376#A4.SS2.p1.2)\.
- V\. Sethu, E\. M\. Provost, J\. Epps, C\. Busso, N\. Cummins, and S\. Narayanan \(2019\)The ambiguous world of emotion representation\.arXiv preprint arXiv:1909\.00360\.Cited by:[§1](https://arxiv.org/html/2606.05376#S1.p1.1),[§2](https://arxiv.org/html/2606.05376#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§D\.2](https://arxiv.org/html/2606.05376#A4.SS2.p1.2),[§2](https://arxiv.org/html/2606.05376#S2.SS0.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2606.05376#S3.SS1.p9.1),[§3\.3](https://arxiv.org/html/2606.05376#S3.SS3.p1.9)\.
- T\. Sorensen, J\. Moore, J\. Fisher, M\. Gordon, N\. Mireshghallah, C\. M\. Rytting, A\. Ye, L\. Jiang, X\. Lu, N\. Dziri,et al\.\(2024\)A roadmap to pluralistic alignment\.arXiv preprint arXiv:2402\.05070\.Cited by:[§1](https://arxiv.org/html/2606.05376#S1.p1.1)\.
- A\. N\. Uma, T\. Fornaciari, D\. Hovy, S\. Paun, B\. Plank, and M\. Poesio \(2021\)Learning from disagreement: a survey\.Journal of Artificial Intelligence Research72,pp\. 1385–1470\.Cited by:[§1](https://arxiv.org/html/2606.05376#S1.p1.1)\.
- L\. von Werra, Y\. Belkada, L\. Tunstall, E\. Beeching, T\. Thrush, N\. Lambert, S\. Huang, K\. Rasul, and Q\. Gallouédec \(2020\)TRL: Transformers Reinforcement LearningExternal Links:[Link](https://github.com/huggingface/trl)Cited by:[§D\.2](https://arxiv.org/html/2606.05376#A4.SS2.p1.2),[§D\.2](https://arxiv.org/html/2606.05376#A4.SS2.p2.1),[§4](https://arxiv.org/html/2606.05376#S4.SS0.SSS0.Px2.p1.1)\.
- A\. Williams, N\. Nangia, and S\. Bowman \(2018\)A broad\-coverage challenge corpus for sentence understanding through inference\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long Papers\),pp\. 1112–1122\.External Links:[Link](http://aclweb.org/anthology/N18-1101)Cited by:[§B\.1](https://arxiv.org/html/2606.05376#A2.SS1.p1.1),[§B\.1](https://arxiv.org/html/2606.05376#A2.SS1.p3.1),[§4](https://arxiv.org/html/2606.05376#S4.SS0.SSS0.Px1.p1.1)\.
- J\. Wu, T\. Dang, V\. Sethu, and E\. Ambikairajah \(2022\)A novel sequential monte carlo framework for predicting ambiguous emotion states\.ICASSP 2022 \- 2022 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 8567–8571\.External Links:[Link](https://api.semanticscholar.org/CorpusID:249436756)Cited by:[§2](https://arxiv.org/html/2606.05376#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Wu, T\. Dang, V\. Sethu, and E\. Ambikairajah \(2024a\)Dual\-constrained dynamical neural odes for ambiguity\-aware continuous emotion prediction\.arXiv preprint arXiv:2407\.21344\.Cited by:[§2](https://arxiv.org/html/2606.05376#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Wu, T\. Dang, V\. Sethu, and E\. Ambikairajah \(2024b\)Emotion recognition systems must embrace ambiguity\.In2024 12th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos \(ACIIW\),pp\. 166–170\.Cited by:[§2](https://arxiv.org/html/2606.05376#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Wu, G\. Lin, Y\. Song, and R\. Picard \(2026\)AmbER2: dual ambiguity\-aware emotion recognition applied to speech and text\.arXiv preprint arXiv:2601\.18010\.Cited by:[§B\.2](https://arxiv.org/html/2606.05376#A2.SS2.p1.1),[§C\.1\.2](https://arxiv.org/html/2606.05376#A3.SS1.SSS2.p1.1),[§1](https://arxiv.org/html/2606.05376#S1.p1.1),[§2](https://arxiv.org/html/2606.05376#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Xu, Z\. Guo, J\. He, H\. Hu, T\. He, S\. Bai, K\. Chen, J\. Wang, Y\. Fan, K\. Dang, B\. Zhang, X\. Wang, Y\. Chu, and J\. Lin \(2025\)Qwen2\.5\-omni technical report\.arXiv preprint arXiv:2503\.20215\.Cited by:[§D\.2](https://arxiv.org/html/2606.05376#A4.SS2.p1.2),[§4](https://arxiv.org/html/2606.05376#S4.SS0.SSS0.Px2.p1.1)\.
- X\. Zhou, Y\. Nie, and M\. Bansal \(2022\)Distributed nli: learning to predict human opinion distributions for language reasoning\.InFindings of the Association for Computational Linguistics: ACL 2022,Cited by:[§D\.5](https://arxiv.org/html/2606.05376#A4.SS5.p2.1),[Table 2](https://arxiv.org/html/2606.05376#S5.T2.5.10.1.2.1.2.1),[Table 2](https://arxiv.org/html/2606.05376#S5.T2.5.15.1.2.1.2.1)\.

## Appendix AFull GRPO Formulation

For completeness, we provide the full formulation of GRPO, including the surrogate objective and importance\-sampling formulation\. Accordingly, for taskmmand sampleqq, GRPO samples a rollout groupG\(m,q\)G\_\{\(m,q\)\}of responses\{o\(m,q,i\)\}\\\{o\_\{\(m,q,i\)\}\\\}, wherei∈G\(m,q\)i\\in G\_\{\(m,q\)\}indexes individual rollouts \(i\.e\., a sampled response\) with rewardsr\(m,q,i\)r\_\{\(m,q,i\)\}, computing the group\-normalized advantage:

A^\(m,q,i\)=r\(m,q,i\)−μ^G\(m,q\)σ^G\(m,q\)\+ε,\\hat\{A\}\_\{\(m,q,i\)\}=\\frac\{r\_\{\(m,q,i\)\}\-\\hat\{\\mu\}\_\{G\_\{\(m,q\)\}\}\}\{\\hat\{\\sigma\}\_\{G\_\{\(m,q\)\}\}\+\\varepsilon\},\(10\)whereμ^G\(m,q\)\\hat\{\\mu\}\_\{G\_\{\(m,q\)\}\}andσ^G\(m,q\)\\hat\{\\sigma\}\_\{G\_\{\(m,q\)\}\}are the empirical mean and standard deviation of\{r\(m,q,i\)\}i=1\|G\(m,q\)\|\\\{r\_\{\(m,q,i\)\}\\\}\_\{i=1\}^\{\|G\_\{\(m,q\)\}\|\}\. GRPO then optimizesπθ​\(a∣s\)\\pi\_\{\\theta\}\(a\\mid s\)by performing a PPO\-style trust\-region update\. At token positionkkof responseo\(m,q,i\)o\_\{\(m,q,i\)\},φ\(m,q,i\):k​\(θ\)\\varphi\_\{\(m,q,i\):k\}\(\\theta\)denotes the importance sampling ratio betweenπθ\\pi\_\{\\theta\}and the old policyπθold\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\},A~\(m,q,i\):k​\(θ\)\\tilde\{A\}\_\{\(m,q,i\):k\}\(\\theta\)denotes the PPO\-clipped surrogate usingA^\(m,q,i\)\\hat\{A\}\_\{\(m,q,i\)\}, andJGRPO​\(θ\)J\_\{\\mathrm\{GRPO\}\}\(\\theta\)averages this surrogate over tokens and rollout samples with an optional KL penalty to a reference policyπref\\pi\_\{\\mathrm\{ref\}\}\(with weightβ\\beta\)\. We summarize these with a compact objective:

φ\(m,q,i\):k​\(θ\)\\displaystyle\\varphi\_\{\(m,q,i\):k\}\(\\theta\)=πθ​\(o\(m,q,i\):k∣q,o\(m,q,i\)⁣:⁣<k\)πθold​\(o\(m,q,i\):k∣q,o\(m,q,i\)⁣:⁣<k\)\\displaystyle=\\frac\{\\pi\_\{\\theta\}\\\!\\big\(o\_\{\(m,q,i\):k\}\\mid q,\\,o\_\{\(m,q,i\):<k\}\\big\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\\\!\\big\(o\_\{\(m,q,i\):k\}\\mid q,\\,o\_\{\(m,q,i\):<k\}\\big\)\}\(11\)A~\(m,q,i\):k​\(θ\)\\displaystyle\\tilde\{A\}\_\{\(m,q,i\):k\}\(\\theta\)=min\(φ\(m,q,i\):k\(θ\)A^\(m,q,i\),\\displaystyle=\\min\\\!\\Big\(\\varphi\_\{\(m,q,i\):k\}\(\\theta\)\\,\\hat\{A\}\_\{\(m,q,i\)\},clip\(φ\(m,q,i\):k\(θ\),1−ϵ,1\+ϵ\)A^\(m,q,i\)\)\\displaystyle\\hskip\-50\.0pt\\phantom\{=\\min\\\!\\Big\(\}\\operatorname\{clip\}\\\!\\big\(\\varphi\_\{\(m,q,i\):k\}\(\\theta\),1\-\\epsilon,1\+\\epsilon\\big\)\\,\\hat\{A\}\_\{\(m,q,i\)\}\\Big\)JGRPO​\(θ\)\\displaystyle J\_\{\\mathrm\{GRPO\}\}\(\\theta\)=𝔼q∼𝒟m𝔼\{o\(m,q,i\)\}∼πθold\[1\|G\(m,q\)\|∑i=1\|G\(m,q\)\|\\displaystyle=\\mathbb\{E\}\_\{q\\sim\\mathcal\{D\}\_\{m\}\}\\mathbb\{E\}\_\{\\\{o\_\{\(m,q,i\)\}\\\}\\sim\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\}\\\!\\Bigg\[\\frac\{1\}\{\|G\_\{\(m,q\)\}\|\}\\sum\_\{i=1\}^\{\|G\_\{\(m,q\)\}\|\}1no\(m,q,i\)∑k=1no\(m,q,i\)A~\(m,q,i\):k\(θ\)\]−β𝔼\[DKL\(πθ∥πref\)\]\.\\displaystyle\\hskip\-75\.0pt\\phantom\{=\\mathbb\{E\}\\Bigg\[\}\\frac\{1\}\{n\_\{o\_\{\(m,q,i\)\}\}\}\\sum\_\{k=1\}^\{n\_\{o\_\{\(m,q,i\)\}\}\}\\tilde\{A\}\_\{\(m,q,i\):k\}\(\\theta\)\\Bigg\]\\;\-\\;\\beta\\,\\mathbb\{E\}\\\!\\left\[D\_\{\\mathrm\{KL\}\}\\\!\\left\(\\pi\_\{\\theta\}\\;\\\|\\;\\pi\_\{\\mathrm\{ref\}\}\\right\)\\right\]\.

## Appendix BDataset Description

### B\.1ChaosNLI Dataset

ChaosNLINieet al\.\([2020](https://arxiv.org/html/2606.05376#bib.bib11)\)is a text\-based dataset designed to capture human disagreement in natural language inference tasks\. It consists of premise–hypothesis pairs drawn from established NLI benchmarks and reannotated with 100 independent human raters per example\. Raters are asked to determine whether the hypothesis is entailed by, contradicted by, or neutral with respect to the premise\. In this work, we restrict our analysis to examples originating from the Stanford Natural Language Inference Corpus \(SNLI\)Bowmanet al\.\([2015](https://arxiv.org/html/2606.05376#bib.bib24)\)and Multi\-Genre Natural Language Inference Corpus \(MNLI\)Williamset al\.\([2018](https://arxiv.org/html/2606.05376#bib.bib25)\)\.

ChaosNLI\-SBowmanet al\.\([2015](https://arxiv.org/html/2606.05376#bib.bib24)\)is a large\-scale natural language inference dataset constructed from image captions in the Flickr30k corpus\. Human annotators are presented with a text caption describing an image and asked to write three corresponding hypotheses: one that is definitely true given the caption \(entailment\), one that may be true but is uncertain \(neutral\), and one that is definitely false \(contradiction\)\. This process produces sentence pairs with relatively simple linguistic structure and forces the data to be balanced among these classes\.

ChaosNLI\-MWilliamset al\.\([2018](https://arxiv.org/html/2606.05376#bib.bib25)\)is modeled on the SNLI corpus but extends the task to a broader range of spoken and written text\. It collects premise sentences from ten distinct genres including fiction, government reports, travel guides, telephone speech, and spoken conversations\. Annotators are then asked to label hypothesis–premise pairs according to the same entailment, neutral, and contradiction categories, resulting in a dataset with more diverse language and more challenging inference phenomena\.

### B\.2MSP Podcast

The MSP Podcast \(v1\.12\)Bussoet al\.\([2025](https://arxiv.org/html/2606.05376#bib.bib21)\)dataset is a large\-scale naturalistic speech corpus constructed from podcast audio\. Each short audio segment is annotated by at least five human raters for perceived emotion labels\. To ensure consistency across samples, we follow prior work in restricting the label space to a standard set of eight emotion categories: anger, contempt, disgust, fear, happiness, neutral, sadness, and surpriseHonget al\.\([2025](https://arxiv.org/html/2606.05376#bib.bib27)\); Wuet al\.\([2026](https://arxiv.org/html/2606.05376#bib.bib22)\)\. Examples containing annotations outside this label set are discarded\.

### B\.3GoEmotions

GoEmotions\(Demszkyet al\.,[2020](https://arxiv.org/html/2606.05376#bib.bib20)\)is a large\-scale text emotion dataset consisting of English Reddit comments annotated with fine\-grained emotion labels\. Each example is initially annotated by 3–5 human raters, where annotators are allowed to assign multiple emotion labels to a single example\. If no two annotators agree on any emotion label, two additional raters are assigned to the example\. Annotators may also choose to assign no label if they consider the example too difficult to classify\. Rather than using the aggregated binary labels released with the original dataset, we utilize the raw annotator\-level responses to construct human annotation distributions\.

We apply several preprocessing steps before constructing these distributions\. First, we discard examples where fewer than three annotators provide valid emotion labels to ensure a reliable estimate of an underlying distribution\. Second, the original dataset contains 27 fine\-grained emotion categories and a neutral label, making distribution prediction substantially more difficult and sparse\. Following the Ekman taxonomy proposed in\(Demszkyet al\.,[2020](https://arxiv.org/html/2606.05376#bib.bib20)\), we map the original labels into six broader emotion categories as shown in Table[4](https://arxiv.org/html/2606.05376#A2.T4)\. For each example, annotator labels are mapped to the corresponding Ekman categories, aggregated across raters, and normalized to produce a probability distribution over emotions\.

Finally, analysis of the majority label of resulting annotation distributions revealed substantial class imbalance\. We observed that a large proportion of examples were concentrated in the joy and neutral classes\. To reduce bias toward high\-frequency classes and improve evaluation across emotions, we apply stratified sampling to select approximately equal numbers of examples for each dominant emotion category\.

Table 4:Mapping of GoEmotions 27 fine\-grained emotion labels to Ekman’s six basic emotion categories\(Demszkyet al\.,[2020](https://arxiv.org/html/2606.05376#bib.bib20)\)\.Ekman CategoryGoEmotions Fine\-Grained LabelsJoyAdmiration, Amusement, Approval, Caring, Desire, Excitement, Gratitude, Joy, Love,Optimism, Pride, ReliefSadnessDisappointment, Embarrassment, Grief, Remorse, SadnessAngerAnger, Annoyance, DisapprovalSurpriseConfusion, Curiosity, Realization, SurpriseFearFear, NervousnessDisgustDisgustNeutralNeutral

## Appendix CData Processing

### C\.1Dataset statistics

#### C\.1\.1ChaosNLI

The ChaosNLI dataset consists of 3,113 examples, including 1,514 examples sourced from SNLI and 1,599 examples sourced from MNLI\. Prior to training, the dataset is randomly shuffled and partitioned into training, validation, and test sets using a 65/15/20 split\. Table[5](https://arxiv.org/html/2606.05376#A3.T5)summarizes the distribution of the majority label across the dataset, where the majority label for an example is defined as the label receiving the largest proportion of annotations\.

Table 5:ChaosNLI dataset statistics by majority label\.DatasetMajority LabelCountChaosNLIEntailment1,168Neutral1,397Contradiction548ChaosNLI\-SEntailment424Neutral811Contradiction279ChaosNLI\-MEntailment744Neutral586Contradiction269
#### C\.1\.2MSP Podcast

After preprocessing and filtering, the resulting dataset contains 12,955 examples\. Following prior workWuet al\.\([2026](https://arxiv.org/html/2606.05376#bib.bib22)\), we adopt an even five\-fold partitioning strategy rather than the standard speaker\-based splits to achieve a more balanced distribution of ambiguous examples across data partitions\. We divide the dataset into five equal\-sized folds, using three folds for training, one for validation, and one for testing, resulting in an approximate 60/20/20 split\. Table[6](https://arxiv.org/html/2606.05376#A3.T6)summarizes the distribution of majority emotion labels across the dataset\.

Table 6:MSP Podcast dataset statistics by majority label\.Majority LabelCountAngry1,053Contempt1053Disgust806Fear603Happy3519Neutral4609Sad432Surprise880
#### C\.1\.3GoEmotions

After preprocessing and filtering, we collect 6567 examples in the GoEmotions dataset\. Examples are randomly shuffled and partitioned into training, validation, and test sets using a 70/15/15 split\. Table[7](https://arxiv.org/html/2606.05376#A3.T7)summarizes the distribution of the majority emotion label across the dataset\.

Table 7:GoEmotions dataset statistics by majority label\.Majority LabelCountAnger1,000Disgust1000Fear640Joy927Sadness1000Surprise1000Neutral1000

### C\.2Data partition across different ambiguity levels

In addition to overall dataset statistics, we analyze how examples are distributed across different levels of ambiguity\.

#### C\.2\.1ChaosNLI

For ChaosNLI, ambiguity is quantified using the entropy of the ground\-truth annotation distribution\. Lower entropy corresponds to stronger agreement among annotators, while higher entropy indicates greater disagreement\. We partition examples into three ambiguity levels: low, medium, and high, using normalized entropy thresholds of\[0,0\.33\)\[0,0\.33\),\[0\.33,0\.66\)\[0\.33,0\.66\), and\[0\.66,1\]\[0\.66,1\], respectively\. Table[8](https://arxiv.org/html/2606.05376#A3.T8)summarizes the resulting distribution of examples across these ambiguity levels\.

Table 8:ChaosNLI dataset statistics by ambiguity level\.DatasetAmbiguity LevelCountChaosNLILow377Medium1493High1243ChaosNLI\-SLow327Medium865High322ChaosNLI\-MLow50Medium628High921An interesting observation is that ChaosNLI\-M contains a larger proportion of examples concentrated in the high\-ambiguity level compared to ChaosNLI\-S\. This difference likely reflects characteristics of the underlying source datasets\. ChaosNLI\-S is constructed from image captions and therefore tends to contain shorter, more concrete descriptions with relatively straightforward relationship structures\. In contrast, ChaosNLI\-M spans a broader range of genres and linguistic styles, introducing more complex language and inference patterns that naturally lead to greater annotator disagreement\.

#### C\.2\.2MSP\-Podcast

For MSP\-Podcast, we characterize ambiguity using the number of active labels present in the ground\-truth distribution rather than entropy directly\. We define an active label as an emotion category receiving non\-zero probability mass in the annotation distribution\. Intuitively, the lowest ambiguity occurs when all annotators agree on a single emotion label\. Examples with two active labels correspond to cases where annotators only identify two distinct emotion categories, while progressively larger numbers of active labels indicate greater disagreement among raters\. We partition examples into categories of one, two, three, and four\-or\-more active labels\. Table[9](https://arxiv.org/html/2606.05376#A3.T9)presents the distribution of examples across these groups along with the average normalized entropy of the ground\-truth distribution\.

Table 9:MSP Podcast dataset statistics by ambiguity level\.Active LabelsCountAverage Entropy1 label29630\.00002 labels24100\.25373 labels31520\.46614\+ labels44300\.6604
#### C\.2\.3GoEmotions

Similar to MSP\-Podcast, ambiguity in GoEmotions is characterized using the number of active labels within the ground\-truth distribution\. Examples with a single active label represent strong annotator agreement, whereas examples with multiple active labels indicate increasing diversity in annotation responses\. We group examples into categories containing one, two, three, and four\-or\-more active labels\. Table[10](https://arxiv.org/html/2606.05376#A3.T10)summarizes the distribution of examples across these ambiguity levels and the average entropy of the corresponding ground\-truth distributions\.

Table 10:GoEmotions dataset statistics by ambiguity level\.Active LabelsCountAverage Entropy1 label14340\.00002 labels18280\.31403 labels17140\.51384\+ labels15910\.6875

## Appendix DExperimental Setup

### D\.1Artifact Usage and Licensing\.

All datasets and pretrained models used in this work are publicly available for research purposes and were used in accordance with their intended usage conditions and licenses\.

### D\.2Model and GRPO Configuration

We conduct all experiments using the Qwen2\.5\-Omni\-7B modelXuet al\.\([2025](https://arxiv.org/html/2606.05376#bib.bib32)\)and implement training with the GRPOTrainer from the TRL frameworkvon Werraet al\.\([2020](https://arxiv.org/html/2606.05376#bib.bib38)\)\. We follow the GRPO configuration settings ofRouditchenkoet al\.\([2025](https://arxiv.org/html/2606.05376#bib.bib47)\), which also performs GRPO fine\-tuning on Qwen2\.5\-Omni, particularly for improving the model’s audio interpretation capabilities\. We utilize the AdamW optimizer with an initial learning rate of1×10−61\\times 10^\{\-6\}\. Following the GRPO setup, we set the number of rollouts to 4, temperature to 1\.2, and maximum completion length to 128 tokens\. We additionally setβ=0\\beta=0, removing the KL divergence regularization term from the original GRPO formulationShaoet al\.\([2024](https://arxiv.org/html/2606.05376#bib.bib8)\)\. This choice is motivated by recent studiesLiuet al\.\([2025](https://arxiv.org/html/2606.05376#bib.bib48)\); Huet al\.\([2025](https://arxiv.org/html/2606.05376#bib.bib49)\)showing that the KL divergence term is not necessary effective GRPO training\. All experiments were conducted using the Qwen2\.5\-Omni\-7B model backbone, which contains approximately 7 billion parameters\.

For generation, we set do\_sample to False, resulting in deterministic decoding where the model selects the highest\-probability token at each generation step\. This improves reproducibility and ensures that evaluation results are not affected by sampling variation\. Training is performed on a single compute node with two NVIDIA H200 GPUs and 400GB of memory\. The batch size per GPU is set to 1 with gradient accumulation over two steps, resulting in an effective batch size of 4 prompts per optimization step\. We employ DeepSpeed ZeRO Stage 3 optimization for efficient distributed training and memory management\. Training progress and evaluation metrics are logged using WandB\. Unless otherwise specified, all remaining hyperparameters follow the default settings defined in GRPOConfig within TRLvon Werraet al\.\([2020](https://arxiv.org/html/2606.05376#bib.bib38)\)\. Results are reported as single\-run experiments due to the computational cost of GRPO\-based LLM fine\-tuning\.

### D\.3Prompts

Below, we provide the prompt templates used for training across each dataset\. Table[11](https://arxiv.org/html/2606.05376#A4.T11)presents the prompt template used for ChaosNLI, Table[12](https://arxiv.org/html/2606.05376#A4.T12)shows the template for MSP Podcast, and Table[13](https://arxiv.org/html/2606.05376#A4.T13)shows the template used for GoEmotions\. While the task\-specific context and label spaces differ across datasets, all prompts follow a common structure consisting of background information, the target utterance to evaluate, task instructions, and output formatting constraints\.

Table 11:Prompt template used for ChaosNLI with example target utterance\.SectionPrompt ContentBackgroundTarget UtterancePremise: A child in a red jacket, waist deep in a pit on the beachHypothesis: A child is building a sandcastle on the beachTaskGiven a premise and a hypothesis, predict the probability of the relationship between them from the following options: entailment, neutral, contradiction\.1\.entailment: the hypothesis logically follows from the premise2\.neutral: neither entailment nor contradiction can be determined3\.contradiction: the hypothesis conflicts with the premiseYou MUST produce a calibrated probability distribution\.Output ConstraintsBefore outputting, check if the format of your output is in accordance with the requirements I provided\.1\.1\. Generate the label probabilities in EXACTLY this JSON structure: \{\{"entailment": float, "neutral": float, "contradiction": float\}\}\.2\.The sum of all probabilities must be exactly 1\.0\.3\.Do not include any explanations or text besides the dictionary\.Table 12:Prompt template used for MSP Podcast with example target utterance\.SectionPrompt ContentBackgroundTwo speakers are having a conversation\.Target Utterancethat’s right\. spilling my load of liberty all over your faces\. it’s the golden stallion of the tech\-TaskPredict the probability of the emotion in the target utterance from the following options: angry, contempt, disgust, fear, happy, neutral, sad, surprise\.You MUST produce a calibrated probability distribution\. Identify any emotional cues expressed by the speaker, even subtle ones\. For each of the following emotions from angry, contempt, disgust, fear, happy, neutral, sad, surprise, evaluate whether the emotion is present\. If only one emotion present, assign 1\.0 to that emotion\. If multiple emotions are present, estimate the relative strength and assign a probability to each emotion\. The probability should reflect how much time, intensity, and presence each emotion has in the conversation\.Output ConstraintsBefore outputting, check if the format of your output is in accordance with the requirements I provided\.1\.Generate the emotion probabilities in EXACTLY this JSON structure: \{\{"Angry": float, "Contempt": float, "Disgust": float, "Fear": float, "Happy": float, "Neutral": float, "Sad": float, "Surprise": float\}\}\.2\.The sum of all probabilities must be exactly 1\.0\.3\.Do not include any explanations or text besides the dictionary\.Table 13:Prompt template used for GoEmotions with example target utterance\.SectionPrompt ContentBackgroundThis is a text comment extracted from Reddit\.Target UtteranceDear \[NAME\] man\! \(Irony intended\) You didnt́ say "proof" you said "evidence"\!TaskAssign a calibrated probability distribution over the following emotion categories: anger, disgust, fear, joy, sadness, surprise, neutral\.These categories follow the Ekman taxonomy:•anger: annoyance, disapproval, hostility•disgust: contempt, revulsion•fear: nervousness, anxiety, dread•joy: happiness, admiration, gratitude, excitement, love, optimism, pride, relief, amusement•sadness: disappointment, grief, remorse, embarrassment•surprise: curiosity, confusion, realization•neutral: no clear emotion expressedPredict the probability of the emotion in the target utterance from the following options: anger, disgust, fear, joy, sadness, surprise, neutral\.You MUST produce a calibrated probability distribution\.For each of the following emotions from anger, disgust, fear, joy, sadness, surprise, neutral, evaluate whether the emotion is present\. If only one emotion present, assign 1\.0 to that emotion\. If multiple emotions are present, estimate the relative strength and assign a probability to each emotion\. The probability should reflect how much intensity and presence each emotion has in the comment\.Output ConstraintsBefore outputting, check if the format of your output is in accordance with the requirements I provided\.1\.Generate the emotion probabilities in EXACTLY this JSON structure: \{\{"anger": float, "disgust": float, "fear": float, "joy": float, "sadness": float, "surprise": float, "neutral": float\}\}\.2\.The sum of all probabilities must be exactly 1\.0\.3\.Do not include any explanations or text besides the dictionary\.
### D\.4Majority\-label supervision \(MLS\) Baseline

As a reference baseline, we first consider conventional Majority\-label supervision \(MLS\), where the annotator distribution is reduced to its majority label:

yq∗=arg⁡maxc⁡pq,c,y\_\{q\}^\{\*\}=\\arg\\max\_\{c\}p\_\{q,c\},\(12\)
wherepq,cp\_\{q,c\}denotes the annotator distribution defined in Eq\.[2](https://arxiv.org/html/2606.05376#S3.E2)\.

Given the predicted distribution𝐩^\(q,i\)\\hat\{\\mathbf\{p\}\}\_\{\(q,i\)\}defined in Eq\.[3](https://arxiv.org/html/2606.05376#S3.E3), the rollout reward is defined as the probability assigned to the majority label:

r\(q,i\)maj=p^\(q,i\),yq∗\.r\_\{\(q,i\)\}^\{\\mathrm\{maj\}\}=\\hat\{p\}\_\{\(q,i\),y\_\{q\}^\{\*\}\}\.\(13\)
This formulation encourages the model to assign high probability mass to the dominant annotation and serves as a reward\-based analogue of conventional MLS\.

### D\.5Justification of Included Baselines

We include a diverse range of recent studies and strong baselines for comparison across both ambiguity\-aware learning and large language model alignment settings\. On ChaosNLI \(including ChaosNLI\-M and ChaosNLI\-S\), we include the original benchmark framework proposed together with the dataset collection process, denoted as Chaos\-BenchmarNieet al\.\([2020](https://arxiv.org/html/2606.05376#bib.bib11)\)\. We further include recent studies that explicitly model annotator disagreement or distributional supervision under the same benchmark settings, including distribution\-aware LLM alignment and ambiguity\-aware learning approachesChenet al\.\([2025](https://arxiv.org/html/2606.05376#bib.bib12),[2024a](https://arxiv.org/html/2606.05376#bib.bib13)\); Leeet al\.\([2023](https://arxiv.org/html/2606.05376#bib.bib30)\)\.

Additionally, we include prior ambiguity\-aware fine\-tuning methods such asZhouet al\.\([2022](https://arxiv.org/html/2606.05376#bib.bib31)\); Meissneret al\.\([2021](https://arxiv.org/html/2606.05376#bib.bib29)\), which are based on BERT\-style architectures rather than LLM backbones\. Although these methods do not perform reinforcement learning based alignment, they are included because they optimize directly on annotator distributions and report evaluation results using the same distributional metrics adopted in our work, enabling meaningful comparison under ambiguity\-aware supervision\.

For emotion recognition datasets, there remain relatively limited prior studies explicitly addressing ambiguity\-aware distributional supervision\. On MSP\-Podcast, we include the recent benchmark study on decoding ambiguous emotions using test\-time scaling, denoted as TTS\-BenchmarkJiaet al\.\([2026](https://arxiv.org/html/2606.05376#bib.bib19)\)\. On GoEmotions, we include recent LLM\-based ambiguity\-aware emotion recognition approachesHonget al\.\([2025](https://arxiv.org/html/2606.05376#bib.bib27)\)that report distributional emotion outputs under both zero\-shot \(ZS\) and few\-shot \(FS\) settings, enabling direct comparison under ambiguity\-aware evaluation protocols\.

## Appendix EEvaluation Metrics

Jensen\-Shannon DistanceJensen\-Shannon distance \(JS\) is a metric that measures the similarity between two probability distributions\. It is a symmetric and smoothed version of the Kullback–Leibler divergence \(KL\) with finite values bounded between 0 and 1 when using base 2 logarithms\. A smaller JS distance would indicate more similarity between two probability distributions, with zero indicating two identical distributions\. Given probability distributionsPPandQQ, letM=P\+Q2M=\\frac\{P\+Q\}\{2\}andDKL​\(P∥Q\)D\_\{\\mathrm\{KL\}\}\(P\\\|Q\)be the KL divergence given by

DKL​\(P∥Q\)=∑x∈𝒳P​\(x\)​log⁡P​\(x\)Q​\(x\)\.D\_\{\\mathrm\{KL\}\}\(P\\\|Q\)=\\sum\_\{x\\in\\mathcal\{X\}\}P\(x\)\\log\\frac\{P\(x\)\}\{Q\(x\)\}\.The JS distance is defined as

DJS​\(P∥Q\)=12​DKL​\(P∥M\)\+12​DKL​\(Q∥M\)\.D\_\{\\mathrm\{JS\}\}\(P\\\|Q\)=\\sqrt\{\\frac\{1\}\{2\}D\_\{\\mathrm\{KL\}\}\(P\\\|M\)\+\\frac\{1\}\{2\}D\_\{\\mathrm\{KL\}\}\(Q\\\|M\)\}\.By comparing comparing both distributions to a central mixture distribution, the JS distance metric is more stable than KL divergence\. In our analysis, we set

QQto be the ground truth probability distribution from the annotations and

PPto be the model’s predicted probability distribution\.

Bhattacharyya CoefficientThe Bhattacharyya Coefficient \(BC\) measures the amount of overlap or statistical similarity between two probability distributions\. It is bounded between 0 and 1, where a value of 1 indicates identical distributions and 0 indicates no overlap\. For discrete probability distributions, BC is defined by

B​C​\(P,Q\)=∑x∈𝒳P​\(x\)⋅Q​\(x\)\.BC\(P,Q\)=\\sum\_\{x\\in\\mathcal\{X\}\}\\sqrt\{P\(x\)\\cdot Q\(x\)\}\.AccuracyFor classification accuracy, we first derive a single label,y^\\hat\{y\}, from the model’s predicted probability distribution,PP, by taking the argmax over the label space𝒴\\mathcal\{Y\}

y^=arg​maxy∈𝒴⁡P​\(y\)\.\\hat\{y\}=\\text\{arg\}\\,\\max\_\{y\\in\\mathcal\{Y\}\}P\(y\)\.The ground truth label,y∗y^\{\*\}, is similarly derived from the ground truth distribution,QQ, by selecting the label with maximum probability mass\.

y∗^=arg​maxy∈𝒴⁡Q​\(y\)\.\\hat\{y^\{\*\}\}=\\text\{arg\}\\,\\max\_\{y\\in\\mathcal\{Y\}\}Q\(y\)\.GivenNNevaluation examples, accuracy is then defined as

Accuracy=1N​∑i=1N𝟏​\[y^i=yi∗\]\.\\text\{Accuracy\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbf\{1\}\[\\hat\{y\}\_\{i\}=y^\{\*\}\_\{i\}\]\.F1 ScoreThe F1 score is the harmonic mean of precision and recall for a given classcc

F1c=2⋅Precisionc⋅RecallcPrecisionc\+Recallc,\\text\{F1\}\_\{c\}=\\frac\{2\\cdot\\text\{Precision\}\_\{c\}\\cdot\\text\{Recall\}\_\{c\}\}\{\\text\{Precision\}\_\{c\}\+\\text\{Recall\}\_\{c\}\},where

Precisionc=T​PcT​Pc\+F​Pc\\text\{Precision\}\_\{c\}=\\frac\{TP\_\{c\}\}\{TP\_\{c\}\+FP\_\{c\}\}Recallc=T​PcT​Pc\+F​Nc\\text\{Recall\}\_\{c\}=\\frac\{TP\_\{c\}\}\{TP\_\{c\}\+FN\_\{c\}\}andT​PcTP\_\{c\},F​PcFP\_\{c\}, andF​NcFN\_\{c\}denote the true positives, false positives, and false negatives for classcc, respectively\. Hard predicted labelsy^\\hat\{y\}and ground truth labelsy∗y^\{\*\}are derived fromQQandPPvia argmax as described above\. The macro\-averaged F1 score averagesF1c\\text\{F1\}\_\{c\}uniformly across all\|𝒴\|\|\\mathcal\{Y\}\|classes:

F1macro=1\|𝒴\|​∑c∈𝒴F1c\.\\text\{F1\}\_\{\\text\{macro\}\}=\\frac\{1\}\{\|\\mathcal\{Y\}\|\}\\sum\_\{c\\in\\mathcal\{Y\}\}\\text\{F1\}\_\{c\}\.Weighted F1 ScoreThe weighted F1 score extends the macro\-averaged F1 by accounting for class imbalance through support\-weighted averaging\. Specifically, the F1 score for each classccis weighted by its supportNcN\_\{c\}, defined as the number of ground\-truth instances belonging to classcc\. The weighted F1 is then computed as:

F1weighted=∑c∈𝒴Nc⋅F1c∑c∈𝒴Nc,\\text\{F1\}\_\{\\text\{weighted\}\}=\\frac\{\\sum\_\{c\\in\\mathcal\{Y\}\}N\_\{c\}\\cdot\\text\{F1\}\_\{c\}\}\{\\sum\_\{c\\in\\mathcal\{Y\}\}N\_\{c\}\},whereNc=T​Pc\+F​NcN\_\{c\}=TP\_\{c\}\+FN\_\{c\}represents the total number of true instances of classcc\. This metric therefore reflects both per\-class performance and the empirical class distribution, giving higher influence to more frequent classes in the dataset\.

## Appendix FAmbiguity Analysis Results for GoEmotions and MSP\-Podcast

In this section, we expand on the results in Table[1](https://arxiv.org/html/2606.05376#S5.T1)by providing an analysis of model performance at different ambiguity levels for each dataset as defined in App\.[C\.2](https://arxiv.org/html/2606.05376#A3.SS2)

Table 14:Observing model reasoning on a ChaosNLI example with aHighlevel of ambiguity\. Given thePremise: "In short, we all got tired of clever analyses of what might happen; and throughout economics there was a shift in focus away from theorizing, toward data collection and careful statistical analysis" andHypothesis: "We all got tired of data collection and clever analyses of what might happen", annotators are asked to give a vote on entailment \(E\), neutral \(N\), or contradiction \(C\)\.Model ReasoningENCGround Truth\.45\.17\.38Baseline1\. \*\*Premise Analysis\*\*: \- The premise states that there was a shift in focus away from theorizing and clever analyses towards data collection and careful statistical analysis\. \- It implies that people got tired of clever analyses and moved towards data collection\.2\. \*\*Hypothesis Analysis\*\*: \- The hypothesis states that people got tired of data collection and clever analyses of what might happen\. \- This is the opposite of what the premise suggests\.3\. \*\*Relationship Between Premise and Hypothesis\*\*: \- The premise and hypothesis are in direct contradiction\. The premise suggests a shift towards data collection, while the hypothesis suggests a shift away from data collection\.4\. \*\*Multiple Valid Interpretations\*\*: \- There are no multiple valid interpretations here\. The premise and hypothesis are clearly in conflict\.5\. \*\*Probability Assignment\*\*: \- Since the premise and hypothesis are in direct contradiction, the probability of entailment is 0\. \- The probability of neutrality is also 0 because the relationship is not neutral\. \- The probability of contradiction is 1 because the hypothesis directly contradicts the premise\.\.00\.001\.00R1R\_\{1\}The relationship between the premise and the hypothesis is somewhat complex\. The premise states that there was a shift in focus away from theorizing and towards data collection and statistical analysis, which implies that the focus on clever analyses of what might happen was lessened\. However, the hypothesis suggests that we got tired of data collection and clever analyses of what might happen, which is somewhat contradictory to the premise\.There are multiple valid interpretations: 1\. The premise and hypothesis could be seen as contradictory because the premise indicates a shift away from clever analyses, while the hypothesis suggests tiredness with both data collection and clever analyses\. 2\. Alternatively, the premise could be interpreted as a shift in focus, which might imply that the tiredness with clever analyses was a result of the shift towards data collection and statistical analysis\.Given these interpretations, the probability of entailment is relatively low, as the hypothesis suggests a contradiction to the premise\. The probability of neutrality is moderate, as the premise and hypothesis could be seen as having different focuses\. The probability of contradiction is high, as the hypothesis directly contradicts the premise\.\.20\.50\.30Table 15:Observing model reasoning on a ChaosNLI example with aMediumlevel of ambiguity\. Given thePremise: A child in a red jacket, waist deep in a pit on the beach andHypothesis: A child is building a sandcastle on the beach, annotators are asked to give a vote on entailment \(E\), neutral \(N\), or contradiction \(C\)\.Model ReasoningENCGround Truth\.00\.77\.23Baseline1\. \*\*Premise Analysis\*\*: \- The premise describes a child in a red jacket, waist deep in a pit on the beach\. This suggests the child is engaged in some form of activity in the sand, but it does not specify what that activity is\.2\. \*\*Hypothesis Analysis\*\*: \- The hypothesis states that a child is building a sandcastle on the beach\. This is a specific activity that involves creating a sandcastle\.3\. \*\*Relationship Between Premise and Hypothesis\*\*: \- The premise does not provide enough information to confirm or deny the hypothesis\. The child could be building a sandcastle, but they could also be doing other activities like digging, playing, or just standing in the sand\.4\. \*\*Multiple Valid Interpretations\*\*: \- There are multiple valid interpretations of the premise\. The child could be building a sandcastle, but they could also be doing other activities\. This lack of specificity in the premise means that the hypothesis is neither entailed nor contradicted by the premise\.5\. \*\*Probability Assignment\*\*: \- Since the premise does not provide enough information to confirm or deny the hypothesis, the relationship between the two statements is neutral\. The probability of entailment is low because the premise does not logically follow the hypothesis\. The probability of contradiction is also low because the premise does not conflict with the hypothesis\. The probability of neutrality is high because the premise does not provide enough information to determine the relationship\.\.05\.900\.05R1R\_\{1\}The relationship between the premise and the hypothesis is neutral\. The premise describes a child in a red jacket, waist deep in a pit on the beach, but it does not provide enough information to determine if the child is building a sandcastle\. The child could be doing various activities such as playing, digging, or even just standing in the pit\. There are multiple valid interpretations of the premise, which makes it difficult to assign a high probability to either entailment or contradiction\. Therefore, the probability of entailment is low, the probability of neutrality is high, and the probability of contradiction is also low\.\.10\.80\.10Table 16:Observing model reasoning on a ChaosNLI example with aLowlevel of ambiguity\. Given thePremise: Linda Tripp was indicted for illegally taping telephone conversations with Monica Lewinsky andHypothesis: Monica Lewinsky was illegally taped by Linda Tripp, annotators are asked to give a vote on entailment \(E\), neutral \(N\), or contradiction \(C\)\.Model ReasoningENCGround Truth\.95\.03\.02Baseline1\. \*\*Premise Analysis\*\*: \- The premise states that "Linda Tripp was indicted for illegally taping telephone conversations with Monica Lewinsky\." \- This implies that Linda Tripp was the one who performed the illegal act of taping\.2\. \*\*Hypothesis Analysis\*\*: \- The hypothesis states that "Monica Lewinsky was illegally taped by Linda Tripp\." \- This suggests that Monica Lewinsky was the one who was taped by Linda Tripp\.3\. \*\*Relationship Between Premise and Hypothesis\*\*: \- The premise and hypothesis are describing the same event but from different perspectives\. \- The premise focuses on Linda Tripp as the tapper, while the hypothesis focuses on Monica Lewinsky as the taped person\. \- Since the premise explicitly states that Linda Tripp was the one who taped, the hypothesis directly contradicts this by suggesting that Monica Lewinsky was the one who was taped\.4\. \*\*Multiple Valid Interpretations\*\*: \- There are no multiple valid interpretations here\. The premise and hypothesis are in direct conflict\.5\. \*\*Probability Assignment\*\*: \- Given the direct contradiction between the premise and the hypothesis, the probability of the hypothesis being true given the premise is 0\. \- The probability of the hypothesis being neutral or entailment is also 0 because the hypothesis directly contradicts the premise\.\.00\.001\.00R1R\_\{1\}The relationship between the premise and the hypothesis is that the hypothesis logically follows from the premise\. The premise states that Linda Tripp was indicted for illegally taping telephone conversations with Monica Lewinsky, which directly implies that Monica Lewinsky was the subject of the illegal taping by Linda Tripp\.There is no ambiguity in this relationship, and the hypothesis is a direct consequence of the premise\.Given this clear relationship, the probability of entailment is high\. The hypothesis is neutral in terms of additional information not provided in the premise, and there is no contradiction since the hypothesis is a logical outcome of the premise\.\.85\.10\.05![Refer to caption](https://arxiv.org/html/2606.05376v1/figures/msp_ambiguity_row.png)Figure 4:Performance comparison across different ambiguity levels on theMSP Podcastdataset\. x\-axis denotes the number of active labels in the ground truth distribution which reflects ambiguity levels from low \(left\) to high \(right\) explained in App\.[C\.2\.2](https://arxiv.org/html/2606.05376#A3.SS2.SSS2)We provide an in\-depth analysis of model performance across different ambiguity levels on GoEmotions and MSP\-Podcast in Fig\.[5](https://arxiv.org/html/2606.05376#A6.F5)and[4](https://arxiv.org/html/2606.05376#A6.F4)\. BothSHALA\-LLMand its variant,SHALA\-LLM\(w/o Ambi\-En\), consistently outperform the ZS and MLS baselines across all ambiguity levels and evaluation metrics, particularly on the distributional metrics JSD and BC\. Importantly, as ambiguity increases,SHALA\-LLMexhibits the smallest performance degradation, demonstrating the robustness of ambiguity\-aware training under highly ambiguous conditions\. AlthoughSHALA\-LLMandSHALA\-LLM\(w/o Ambi\-En\) do not show substantial gains over MLS on conventional classification metrics compared with the improvements over ZS, this is likely because MLS is primarily optimized for dominant single\-label classification\. Overall,SHALA\-LLMdemonstrates strong capability in both capturing annotator distributions, as evidenced by JSD and BC, and maintaining competitive performance on dominant\-label classification, further supporting the discussion in Section[5\.2](https://arxiv.org/html/2606.05376#S5.SS2)\.

![Refer to caption](https://arxiv.org/html/2606.05376v1/figures/goemotions_ambiguity_row.png)Figure 5:Performance comparison across different ambiguity levels on theGoEmotionsdataset\. x\-axis denotes the number of active labels in the ground truth distribution which reflects ambiguity levels from low \(left\) to high \(right\) explained in App\.[C\.2\.3](https://arxiv.org/html/2606.05376#A3.SS2.SSS3)
## Appendix GPer\-Class Analysis Results

In this section, we expand on the results in Table[1](https://arxiv.org/html/2606.05376#S5.T1)by providing an analysis of model performance at the per\-class level\.

### G\.1MSP Podcast Per\-Class Analysis

Fig\.[6](https://arxiv.org/html/2606.05376#A7.F6)presents the per\-class analysis across different models\. It is observed that, on distributional evaluation metrics, bothSHALA\-LLMand its ablation variantSHALA\-LLM\(w/o\) consistently outperform the baseline ZS and MLS models, while still maintaining relatively strong performance on conventional classification metrics\. Importantly, the largest improvements are concentrated in theHappyclass, likely due to class imbalance, as this class contains substantially more training samples\. Interestingly, theContemptclass exhibits relatively poor performance under MLS andSHALA\-LLM\(w/o\), whereas the fullSHALA\-LLMframework substantially improves both ACC and F1\. This suggests that ambiguity\-enhanced optimization helps the model better capture highly ambiguous or underrepresented emotional categories by amplifying informative disagreement structures during learning\.

![Refer to caption](https://arxiv.org/html/2606.05376v1/figures/msp_perclass_2x2.png)Figure 6:Performance comparison across different emotion classes on theMSP Podcastdataset\.
### G\.2GoEmotions Per\-Class Analysis

Fig\.[7](https://arxiv.org/html/2606.05376#A7.F7)presents the per\-class analysis on the GoEmotions dataset\. BothSHALA\-LLMandSHALA\-LLM\(w/o Ambi\-En\) demonstrate strong performance compared with the ZS and MLS baselines on distributional evaluation metrics, while also maintaining generally competitive classification performance\. Notably, improvements are particularly evident for highly ambiguous categories such asSurprise, which was intentionally designed by the dataset creators to exhibit ambiguity\. Although these improvements are less apparent on conventional classification metrics due to the inherently ambiguous nature of the category, the distributional metrics reveal thatSHALA\-LLMbetter captures the underlying annotator disagreement\.

![Refer to caption](https://arxiv.org/html/2606.05376v1/figures/goemotions_perclass_2x2.png)Figure 7:Performance comparison across different emotion classes on theGoEmotionsdataset\.

## Appendix HModel Reasoning Under Different Ambiguity Levels

We further include model reasoning outputs for representative low\-, medium\-, and high\-ambiguity examples in Tables[14](https://arxiv.org/html/2606.05376#A6.T14)to[16](https://arxiv.org/html/2606.05376#A6.T16)\. For low\- and medium\-ambiguity cases, we do not observe substantial differences betweenSHALA\-LLMand the baseline\. However, for highly ambiguous cases,SHALA\-LLMdemonstrates a stronger capability to reason over multiple valid interpretations, as discussed in detail in Section[5\.3](https://arxiv.org/html/2606.05376#S5.SS3)of the main paper\.

We include the following prompt during inference to encourage the model to explain its reasoning\.

Before producing your final answer, explain your reasoning:•What is the relationship between the premise and hypothesis?"•Are there multiple valid interpretations? If so, describe them\.•How does your reasoning inform the probability you assign to each label?Then output EXACTLY this JSON on its own line: \{\{"entailment": float, "neutral": float, "contradiction": float\}\}

## Appendix IUse of AI Assistants

AI assistants were used for language refinement, editing support during manuscript preparation and debugging for codes\. All technical content, experimental design, implementation, and scientific claims were developed and verified by the authors\.

Similar Articles

The Geometry of LLM-as-Judge: Why Inter-LLM Consensus Is Not Human Alignment

arXiv cs.CL

This paper geometrically analyzes why LLMs acting as judges agree strongly with each other but weakly with humans, finding that inter-LLM consensus reflects a collapsed subspace rather than true human alignment on subjective rubrics. Post-hoc calibration on human data improves alignment, but even calibrated LLMs fall short of human reliability.

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

Hugging Face Daily Papers

This paper investigates the alignment of LLM-generated reviews with human judgment using 1k real ACL 2025 submissions, finding limited agreement, instability across models/prompts, and a method to artificially inflate scores without meaningful changes. The authors advise against relying solely on LLM reviews and call for discussion on their use in handling increasing submission volumes.

Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting

arXiv cs.CL

This paper introduces Attention-Shifting (AS), a novel framework for selective machine unlearning in LLMs that balances effective removal of sensitive information while preventing hallucinations and preserving model utility. The method uses importance-aware attention suppression and retention enhancement to achieve up to 15% higher accuracy preservation compared to existing unlearning approaches on standard benchmarks.

Hidden Latent-State Shifts in LLMs: Why Current Alignment Is Blind to Real Internal Dangers — Especially With Agents

Reddit r/artificial

This paper demonstrates that LLMs can enter measurably different internal latent states under coherent context while maintaining aligned outputs, revealing a blind spot in current alignment methods that only monitor surface tokens. The Gemma-3-12B-IT experiment shows strong residual stream geometry shifts that existing safety frameworks cannot detect, with implications for agentic AI deployment.