Multimodal Hidden Markov Models for Persistent Emotional State Tracking
Summary
This paper proposes a lightweight framework using sticky factorial HDP-HMMs to model conversational emotion as latent regimes from multimodal valence-arousal trajectories, aiming for interpretable and computationally efficient emotional state tracking.
View Cached Full Text
Cached at: 05/14/26, 06:14 AM
# Multimodal Hidden Markov Models for Persistent Emotional State Tracking
Source: [https://arxiv.org/html/2605.12838](https://arxiv.org/html/2605.12838)
###### Abstract
Tracking an interpretable emotional arc of a conversation via the sentiment of individual utterances processed as a whole is central to both understanding and guiding communication in applied, especially clinical, conversational contexts\. Existing approaches to emotion recognition operate at the utterance level, obscuring the persistent phases that characterize real conversational dynamics\. We propose a lightweight framework that models conversational emotion as a sequence of latent emotional regimes using sticky factorial HDP\-HMMs over multimodal valence\-arousal representations derived from simultaneous video, audio and textual input\. We evaluate the quality of regime prediction using LLM\-as\-a\-Judge, geometric, and temporal consistency metrics, demonstrating that the sticky HDP\-HMM produces more interpretable regime sequences than the baseline Gaussian HMM at a fraction of the computational cost of LLM\-based dialogue state tracking methods\. In addition, Question\-Answer experiments in a clinical dataset suggest that meaningful emotional phases can reliably be recovered from multimodal valence\-arousal trajectories and used to improve the quality of LLM responses in unstable affective regimes via context augmentation\. This framework thus opens a path toward interpretable, lightweight, and actionable analysis of conversational emotion dynamics at scale\.
Bayesian Nonparametric HMMs, Multimodal Question Answering, Factorial HDP\-HMMs, Affective Computing, Conversational Emotion Recognition, Compute\-Efficient Multimodal Fusion, LLM\-as\-a\-Judge
## 1Introduction
In conversational settings, being able to address a speaker’s emotional needs in a timely and nuanced manner is strongly linked to effective communication; in clinical contexts, this is further associated with improved care outcomes and therapeutic alliance\. As such, research interest in dialogue state tracking and emotion recognition within conversational applications has grown substantially over the past decade, especially in the clinical context\(Poriaet al\.,[2019b](https://arxiv.org/html/2605.12838#bib.bib1)\)\. Realistically, clinicians are responding more to evolving patterns of distress, reassurance, and engagement over the course of an interaction, rather than single utterances\. Thus, when emotion is computationally inferred at the level of short utterances, producing sequences of labels that lack an explicit notion of persistence \(as is the prevailing convention\), the temporal\-structural property of real conversations is obscured\(Lee,[2022](https://arxiv.org/html/2605.12838#bib.bib2)\)\. This motivates our first research question: to what extent can we detect stable and persistent regimes of emotion over the course of a conversation?
Moreover, emotion recognition systems have largely relied on discrete categorical labels \(e\.g\. happy, sad, angry\), reflecting annotative convenience rather than the underlying continuity of affect\(Lee,[2022](https://arxiv.org/html/2605.12838#bib.bib2)\)\. While valence\-arousal \(VA\) representations offer a more psychologically grounded alternative, there has been comparatively little work on modeling the higher\-level temporal dynamics of these representations using purely numerical methods\. This raises our second research question: to what extent can continuous emotional trajectories be modeled and interpreted in a way that is both temporally coherent and useful for a conversational agent?
In this work, we address these questions by framing conversation in time\-series as a sequence of latent emotional regimes\. We propose a lightweight, multimodal framework that segments continuous VA trajectories into persistent states, leveraging Hidden Markov Models \(HMMs\), which are well\-suited for modeling sequential data with underlying latent structural and temporal dependencies\. While standard Gaussian HMMs provide a natural baseline for such classification tasks, they tend to over\-segment emotional trajectories, producing unrealistically rapid switching between states\. To address this, we employ a truncated sticky factorial HDP\-HMM, which introduces an explicit bias toward self\-transitions, allows the number of active regimes in a conversation to be inferred from data rather than fixeda priori, and flexibly merges simultaneous multimodal inputs into coherent regimes\.
Figure 1:Comparison of conversational textual Valence\-Arousal regimes identified by Gaussian HMMs \(top\) vs\. Sticky HDP\-HMMs \(bottom\)\. Regimes denoted by shading different colors across respective utterance indices\. HMM specified to have 4 regimes \(n=4\), and Sticky\-HDP\-HMMs set to 8 maximum regimes \(K=8\), with 3 effective regimes identified\.We show that continuous multimodal emotion trajectories can be segmented into stable, interpretable emotional regimes using sticky factorial HDP\-HMMs \(Figure[1](https://arxiv.org/html/2605.12838#S1.F1)\), yielding coherent conversational structure without a reliance on large language model \(LLM\) inference at runtime\. Crucially, our approach operates directly on numerical VA representations derived from audio, visual, and textual signals, enabling efficient inference without reliance on expensive LLM calls at runtime\.
Beyond this computational efficiency, we emphasize interpretability in dialogue state tracking\. By aligning inferred regimes with intuitive & labeled interpretations of the dynamics in a given dataset, we move toward structured representations and guidance of conversational dynamics\. We further explore mapping these regimes as higher\-level communication strategies for use by an LLM to guide the conversation, providing a meaningful application leap from low\-level affective signals to conversationally meaningful interaction patterns\. Responses from an LLM in a clinical Question\-Answer setting are compared with and without access to the interpretable regimes, and we find that augmenting the LLM context with the calculated regime with interpretable states improves response quality\. Since our method for inferring these regimes is lightweight and "online", we thus contribute an efficient method leveraging multimodal signals for enhancing question\-answer dynamics\.
## 2Background & Related Works
### 2\.1Emotion Recognition Models
As the field of multimodal emotion recognition has expanded, speech and text became central modalities\. A major reason is that emotion is often carried not only in facial cues but also in prosody, lexical choice, and discourse context; more recent multimodal systems fuse these signals to improve robustness, but the dominant pipeline still often performs utterance\-level prediction over independently scored segments rather than modeling longer\-range affective structure\(Ramaswamy and Palaniswamy,[2024](https://arxiv.org/html/2605.12838#bib.bib3)\)\. Further, as an improvement from discrete representation of emotions, modeling emotion as continuous coordinates\(Onget al\.,[2006](https://arxiv.org/html/2605.12838#bib.bib4)\)is especially useful to our study because it preserves graded differences between nearby states and creates a structured numeric space that is well suited to downstream sequential models\. For agentic conversational applications in particular, such nuance is critical: distinguishing between different types and intensities of negative affect can directly inform response strategies and interventions\(Onget al\.,[2006](https://arxiv.org/html/2605.12838#bib.bib4)\)\.
### 2\.2Modern Limitations of Dialogue State Tracking
More recently in Dialogue\-State\-Tracking \(DST\), schema\-driven and LLM\-based DST has become the dominant high\-performance paradigm, especially in open\-domain and zero\-shot settings, because LLMs can generate slot values or state descriptions directly from dialogue history\. However, this gain in flexibility comes at a substantial computational cost, and it does not fully solve the underlying robustness problem: multimodal and noisy dialogues still require a principled way to map heterogeneous inputs into stable state estimates\(Carranza and Rojas,[2025](https://arxiv.org/html/2605.12838#bib.bib5)\)\. This is why DST is increasingly treated as a problem of robust state inference under uncertainty, especially when the dialogue’s crucial signals have important temporal dimensions and are only partially observable in the presence of noisy data\(Balaramanet al\.,[2021](https://arxiv.org/html/2605.12838#bib.bib6)\)\.
### 2\.3Hidden Markov Models
Hidden Markov Models \(HMMs\) provide a natural framework for modeling sequential data with latent structure, and their conceptual appeal lies in their establishment of unobserved states that govern the behavior of observed phenomena over time\. The notion of inferring the internal states of time\-series data from noisy observations allows us to treat conversational emotion as contiguous regimes with persistent affective dynamics and not as independent signals\. Standard Gaussian HMMs, however, exhibit significant limitations when applied to granular sequential data such as utterance\-level emotions\. Maximum likelihood estimation in the model tends to favor rapid switching between states, leading to the over\-segmentation of utterances, in which regimes fluctuate wildly at nearly every timestep\. Attempts to stabilize these models by manually fixing the number of states or regularizing transitions often introduce rigidity, collapsing the probabilistic model into a more deterministic system and limiting its ability to adapt to the natural variability of conversational structure\(Foxet al\.,[2008](https://arxiv.org/html/2605.12838#bib.bib7)\)\. These limitations have motivated our development of more flexible models which can effectively adapt to real\-world conversations\.
## 3Methods
### 3\.1Utterance\-level Emotion extraction
Representing emotional state across two continuous dimensions capturing polarity \(positive vs\. negative\) and activation \(high vs\. low energy\) respectively\(Russell,[1980](https://arxiv.org/html/2605.12838#bib.bib8)\), the VA estimates produced at the utterance level serve as the foundational observations for subsequent temporal modeling and multimodal analysis of a given conversation\. We consider text, audio, and video\-based multimodal emotion detection from the textual sentiment, facial expressions, and aural tone of voice respectively\.
VA scores for written transcripts are produced by a fine\-tuned DistilBERT model adapted for continuous regression\(Mavdol,[2025](https://arxiv.org/html/2605.12838#bib.bib9)\)\. Specifically, the model maps each utterance to a two\-dimensional Valence–Arousal space following Russell’s circumplex framework\(Russell,[1980](https://arxiv.org/html/2605.12838#bib.bib8)\)\. Transcripts fed to the model are segmented at the utterance level using time\-aligned boundaries; additionally, we remove disfluencies and non\-speech artifacts within the text\. After this preprocessing step, each cleaned segment is passed independently to the model for inference\.
VA scores for aural data are produced by a pruned Wav2Vec 2\.0 model fine\-tuned for dimensional speech emotion recognition\(Wagneret al\.,[2023](https://arxiv.org/html/2605.12838#bib.bib11)\)on MSP\-Podcast\(Bussoet al\.,[2025](https://arxiv.org/html/2605.12838#bib.bib12)\), a large collection of natural conversational speech that is richly annotated with continuous VA\-dominance scores\. With raw 16 kHz audio as input, a regression head is applied to pooled transformer representations to generate these continuous outputs\. Wav2Vec2\.0 is particularly well\-suited for this task because it captures prosodic and acoustic features—such as pitch, energy, and temporal dynamics—that carry affective information independently of lexical content\.
VA scores for visual data are produced by EmoNet, a facial affect model trained to estimate continuous valence and arousal from face images captured in naturalistic conditions\(Toisoulet al\.,[2021](https://arxiv.org/html/2605.12838#bib.bib10)\)\. In our pipeline, video is first segmented at the utterance level and represented as a sequence of extracted frames\.
### 3\.2Multimodal feature construction
For each utterancett, let𝐱ttxt=\(vttxt,attxt\)∈ℝ2\\mathbf\{x\}^\{\\text\{txt\}\}\_\{t\}=\(v^\{\\text\{txt\}\}\_\{t\},a^\{\\text\{txt\}\}\_\{t\}\)\\in\\mathbb\{R\}^\{2\}and𝐱taud=\(vtaud,ataud\)∈ℝ2\\mathbf\{x\}^\{\\text\{aud\}\}\_\{t\}=\(v^\{\\text\{aud\}\}\_\{t\},a^\{\\text\{aud\}\}\_\{t\}\)\\in\\mathbb\{R\}^\{2\}denote the text and audio VA estimates, respectively\. Rather than concatenating these features into a single observation vector, we treat each modality as a distinct observation channel generated by a shared latent emotional regime\.
Formally, at each time steptt, the observation is represented as a collection of modality\-specific variables:
𝒳t=\{𝐱ttxt,𝐱taud\}\.\\mathcal\{X\}\_\{t\}=\\left\\\{\\mathbf\{x\}^\{\\text\{txt\}\}\_\{t\},\\;\\mathbf\{x\}^\{\\text\{aud\}\}\_\{t\}\\right\\\}\.\(1\)
Given a latent regimeztz\_\{t\}, we assume that modalities are conditionally independent:
p\(𝒳t∣zt\)=p\(𝐱ttxt∣zt\)p\(𝐱taud∣zt\)\.p\(\\mathcal\{X\}\_\{t\}\\mid z\_\{t\}\)=p\(\\mathbf\{x\}^\{\\text\{txt\}\}\_\{t\}\\mid z\_\{t\}\)\\;p\(\\mathbf\{x\}^\{\\text\{aud\}\}\_\{t\}\\mid z\_\{t\}\)\.\(2\)
Where each modality is modeled with its own Gaussian emission distribution\.
This factorized emission model allows for each modality to exhibit distinct noise characteristics and variability within the same latent regime, avoiding the need to model a potentially complex joint distribution over linearly concatenated features\. The overall log\-likelihood decomposes additively across modalities:
logp\(𝒳t∣zt\)=logp\(𝐱ttxt∣zt\)\+logp\(𝐱taud∣zt\)\.\\log p\(\\mathcal\{X\}\_\{t\}\\mid z\_\{t\}\)=\\log p\(\\mathbf\{x\}^\{\\text\{txt\}\}\_\{t\}\\mid z\_\{t\}\)\+\\log p\(\\mathbf\{x\}^\{\\text\{aud\}\}\_\{t\}\\mid z\_\{t\}\)\.\(3\)
This formulation also naturally extends directly to additional modalities\. We incorporate visual affect by introducing a third modality𝐱tvid=\(vtvid,atvid\)∈ℝ2\\mathbf\{x\}^\{\\text\{vid\}\}\_\{t\}=\(v^\{\\text\{vid\}\}\_\{t\},a^\{\\text\{vid\}\}\_\{t\}\)\\in\\mathbb\{R\}^\{2\}, computed from frame\-level estimates and aggregated at the utterance level\. The observation at timettbecomes
𝒳t=\{𝐱ttxt,𝐱taud,𝐱tvid\},\\mathcal\{X\}\_\{t\}=\\left\\\{\\mathbf\{x\}^\{\\text\{txt\}\}\_\{t\},\\;\\mathbf\{x\}^\{\\text\{aud\}\}\_\{t\},\\;\\mathbf\{x\}^\{\\text\{vid\}\}\_\{t\}\\right\\\},\(4\)and the conditional independence assumption extends as
p\(𝒳t∣zt\)=∏m∈\{txt,aud,vid\}p\(𝐱t\(m\)∣zt\)\.p\(\\mathcal\{X\}\_\{t\}\\mid z\_\{t\}\)=\\prod\_\{m\\in\\\{\\text\{txt\},\\,\\text\{aud\},\\,\\text\{vid\}\\\}\}p\(\\mathbf\{x\}^\{\(m\)\}\_\{t\}\\mid z\_\{t\}\)\.\(5\)All modality streams are standardised independently to zero mean and unit variance prior to modeling to ensure comparable scaling across modalities\.
### 3\.3Temporal regime modeling
#### 3\.3\.1Gaussian HMM baseline
A standard Gaussian Hidden Markov Model\(Rabiner,[1989](https://arxiv.org/html/2605.12838#bib.bib13)\)provides a tractable baseline for latent regime detection\. Given a conversation represented as a sequence ofTTutterance\-level observations𝐗=\{𝐱1,…,𝐱T\}\\mathbf\{X\}=\\\{\\mathbf\{x\}\_\{1\},\\dots,\\mathbf\{x\}\_\{T\}\\\}, the model assumes that each observation is generated by a discrete latent statezt∈\{1,…,K\}z\_\{t\}\\in\\\{1,\\dots,K\\\}evolving according to a first\-order Markov process:
z1\\displaystyle z\_\{1\}∼𝝅,\\displaystyle\\sim\\boldsymbol\{\\pi\},\(6\)zt∣zt−1\\displaystyle z\_\{t\}\\mid z\_\{t\-1\}∼Categorical\(𝐀zt−1\),\\displaystyle\\sim\\mathrm\{Categorical\}\(\\mathbf\{A\}\_\{z\_\{t\-1\}\}\),\(7\)𝐱t∣zt=k\\displaystyle\\mathbf\{x\}\_\{t\}\\mid z\_\{t\}=k∼𝒩\(𝝁k,𝚺k\),\\displaystyle\\sim\\mathcal\{N\}\(\\boldsymbol\{\\mu\}\_\{k\},\\boldsymbol\{\\Sigma\}\_\{k\}\),\(8\)where𝝅\\boldsymbol\{\\pi\}is the initial state distribution,𝐀∈\[0,1\]K×K\\mathbf\{A\}\\in\[0,1\]^\{K\\times K\}is the transition matrix, and\(𝝁k,𝚺k\)\(\\boldsymbol\{\\mu\}\_\{k\},\\boldsymbol\{\\Sigma\}\_\{k\}\)are the mean and covariance of the Gaussian emission for statekk\.
Model parameters are estimated via Expectation\-Maximization, and the most probable latent sequence is recovered by Viterbi decoding\. The number of statesKKis fixed as a hyperparameter\. We use a tied covariance structure \(𝚺k=𝚺∀k\\boldsymbol\{\\Sigma\}\_\{k\}=\\boldsymbol\{\\Sigma\}\\ \\forall k\) to reduce overfitting given the limited number of utterances per consultation\.
#### 3\.3\.2Sticky HDP\-HMM
The standard Gaussian HMM requiresKKto be specified in advance and imposes no prior preference for temporal persistence, leaving the model free to switch states at every step, as VA scores are independent across utterances with no temporal dimension\. We address both limitations with a sticky Hierarchical Dirichlet Process HMM \(sticky HDP\-HMM\)\(Foxet al\.,[2008](https://arxiv.org/html/2605.12838#bib.bib7)\), which infers the effective number of states from data and places an explicit self\-transition bias that encourages temporally coherent regime occupancy\.
##### HDP prior over transition distributions\.
The HDP\-HMM places a hierarchical nonparametric prior over the rows of the transition matrix\. A global mixing measure𝜷∼GEM\(γ\)\\boldsymbol\{\\beta\}\\sim\\mathrm\{GEM\}\(\\gamma\)is drawn from a stick\-breaking process with concentrationγ\\gamma, and each state\-specific transition distribution is drawn as:
𝝅k∼DP\(α,𝜷\),k=1,2,…\\boldsymbol\{\\pi\}\_\{k\}\\sim\\mathrm\{DP\}\\\!\\left\(\\alpha,\\,\\boldsymbol\{\\beta\}\\right\),\\quad k=1,2,\\dots\(9\)whereα\\alphais a per\-state concentration parameter\. Because all rows share the same base measure𝜷\\boldsymbol\{\\beta\}, states that are never visited receive negligible probability mass, and the effective number of active states is inferred rather than presupposed\.
##### Sticky self\-transition bias\.
The standard HDP\-HMM does not penalize rapid state switching\. The sticky extension\(Foxet al\.,[2008](https://arxiv.org/html/2605.12838#bib.bib7)\)augments each transition distribution with an additional massκ\>0\\kappa\>0on the self\-transition:
𝝅k∼DP\(α\+κ,α𝜷\+κ𝒆kα\+κ\),\\boldsymbol\{\\pi\}\_\{k\}\\sim\\mathrm\{DP\}\\\!\\left\(\\alpha\+\\kappa,\\,\\frac\{\\alpha\\boldsymbol\{\\beta\}\+\\kappa\\boldsymbol\{e\}\_\{k\}\}\{\\alpha\+\\kappa\}\\right\),\(10\)where𝒆k\\boldsymbol\{e\}\_\{k\}is the unit vector on statekk\. Increasingκ\\kapparaises the prior probability of remaining in the current state, directly encoding the assumption that emotional regimes are persistent rather than fleeting\. This is particularly appropriate for clinical conversation, where affective phases typically span multiple consecutive utterances\.
##### Truncated inference\.
Exact inference under the HDP\-HMM is intractable\. We employ a truncated weak\-limit approximation\(Blei and Jordan,[2004](https://arxiv.org/html/2605.12838#bib.bib15); Ishwaran and James,[2001](https://arxiv.org/html/2605.12838#bib.bib14)\), fixing an upper boundKmaxK\_\{\\max\}on the state space while allowing the posterior to concentrate on a smaller active subset\.KmaxK\_\{\\max\}is informed by the number of states at which the quality of Gaussian HMM modeling degrades, as determined by hyperparameter tuning \(see Appendix\-[B](https://arxiv.org/html/2605.12838#A2)\)\.
Inference proceeds via collapsed Gibbs sampling, alternating between sampling the latent state sequencez1:Tz\_\{1:T\}via the forward\-backward algorithm and resampling the model hyperparameters\(α,κ,γ\)\(\\alpha,\\kappa,\\gamma\)and emission parameters\{\(𝝁k,𝚺k\)\}k=1Kmax\\\{\(\\boldsymbol\{\\mu\}\_\{k\},\\boldsymbol\{\\Sigma\}\_\{k\}\)\\\}\_\{k=1\}^\{K\_\{\\max\}\}from their conjugate posteriors\. The most probable state sequence for reporting is taken as the Viterbi path under the posterior mean parameters\.
##### Emission model\.
As with the Gaussian HMM baseline, each active statekkis associated with a Gaussian emission𝒩\(𝝁k,𝚺k\)\\mathcal\{N\}\(\\boldsymbol\{\\mu\}\_\{k\},\\boldsymbol\{\\Sigma\}\_\{k\}\)over thedd\-dimensional observation space\. Because the emission structure is otherwise identical to the baseline, any differences in regime quality between the two models are attributable solely to the nonparametric prior and the sticky self\-transition mechanism\.
### 3\.4Evaluation
Evaluating unsupervised temporal segmentation quality is non\-trivial in the absence of ground\-truth regime annotations\. We employ an LLM\-as\-a\-Judge to evaluate our methodology by constructing a reference label using a large language model \(LLM\) prompted to segment each conversation into coherent emotional phases and assign a descriptive label to each\. Formally, given a conversation comprisingTTutterances𝐮=\(u1,u2,…,uT\)\\mathbf\{u\}=\(u\_\{1\},u\_\{2\},\\ldots,u\_\{T\}\), the LLM assigns each utterance a regime label, producing a reference label sequence𝐫∈1,…,KT\\mathbf\{r\}\\in\{1,\\ldots,K\}^\{T\}overKKdistinct labels, whereKKis not fixed in advance\. The prompt provides the full utterance\-level transcript and instructs the model to identify utterances sharing a consistent affective character; labels may recur freely across the sequence, permitting regimes to alternate and re\-emerge\. The prompt instructs the model to identify transitions significant enough to require a clinician to meaningfully change their therapeutic approach, with regimes consolidated to a maximum of 8 distinct labels per consultation\. Reference labels were generated using GPT\-5\.4 via the OpenAI API, and the full prompt is available in Appendix\-[C\.1](https://arxiv.org/html/2605.12838#A3.SS1)\.
These LLM\-derived labels are treated as approximate ground truth for the purposes of quantitative evaluation; our sticky HDP\-HMM independently produces a predicted label sequence𝐩∈1,…,MT\\mathbf\{p\}\\in\{1,\\ldots,M\}^\{T\}, where the number of predicted regimesMMneed not equalKK\. Because both labels are produced independently, their cluster indices are arbitrary — predicted labeliicarries no semantic correspondence to reference labeljj\. This is the label permutation problem\. Predicted regime sequences are therefore aligned to reference labels via the Hungarian algorithm\(Kuhn,[1955](https://arxiv.org/html/2605.12838#bib.bib16)\)to resolve this, with details in Appendix\-[A](https://arxiv.org/html/2605.12838#A1)\.
We report segment\-level F1, boundary F1 \(within a tolerance of±1\\pm 1utterance\), and normalized mutual information \(NMI\) between predicted and reference label sequences\. In addition, we evaluate regime stability directly via the mean regime duration \(in utterances\) and the proportion of single\-utterance regimes, both of which should be low for a temporally coherent model\.
For Question\-Answer augmentation evaluation, we construct a paired evaluation over mid\-conversation prompts drawn from the PriMock57 dataset\(Papadopoulos Korfiatiset al\.,[2022](https://arxiv.org/html/2605.12838#bib.bib21)\)\. For each consultation, the first half of the dialogue is provided as context, and the LLM is tasked with generating the next response in the interaction\.
We compare two conditions: \(1\)Baseline: The LLM is prompted with the conversation history only\. \(2\)Regime\-augmented: The LLM is prompted with the same conversation history, along with a compact structured summary of the current inferred emotional regime \(valence, arousal, regime label, persistence, and consultation phase\), derived from the multimodal sticky HDP\-HMM\.
No additional task\-specific instructions are introduced beyond this structured signal\. The generation model is GPT\-5\.4, with three candidate responses generated per condition and a canonical response selected via a separate model call\. Evaluation uses an LLM\-as\-a\-judge framework\(Zhenget al\.,[2023](https://arxiv.org/html/2605.12838#bib.bib17)\)also implemented with GPT\-5\.4, comprising two complementary judgment types: pairwise A/B preference \(with randomized condition\-to\-label assignment to control for position bias\) and absolute rubric scoring across four dimensions\. Ground truth is defined as the next three actual clinician turns in the transcript, used as a reference anchor for the judge\. For each prompt, the judge model is asked to select the better response between baseline and regime\-augmented across multiple criteria: Contextual Appropriateness, Emotional Attunement, Helpfulness, and Conversational Coherence\. Full prompts for all LLM calls are provided in Appendix[C\.2](https://arxiv.org/html/2605.12838#A3.SS2)\.
## 4Results
The results are organized around five claims: that HMMs can segment clinical conversations into coherent emotional regimes \(Section[4\.1](https://arxiv.org/html/2605.12838#S4.SS1)\); that the sticky HDP\-HMM produces meaningfully better regime structure than the Gaussian HMM baseline \(Section[4\.2](https://arxiv.org/html/2605.12838#S4.SS2)\); that the framework extends to multimodal observation spaces \(Section[4\.3](https://arxiv.org/html/2605.12838#S4.SS3)\); that these recovered regimes are interpretable in affective and clinical terms \(Section[4\.4](https://arxiv.org/html/2605.12838#S4.SS4)\); and that downstream augmentation of a QnA task LLM with the current interpreted regime significantly improves the quality of generated conversational responses \(Section[4\.5](https://arxiv.org/html/2605.12838#S4.SS5)\)\.
### 4\.1HMM Regime Segmentation Baseline
As a prerequisite for all subsequent claims, latent state models must recover structured regime sequences from utterance\-level VA trajectories\. The key question is whether the resulting segmentation reflects genuine conversational structure rather than noise\.
Across all 57 PriMock57 consultation transcripts alone, both the Gaussian HMM and the sticky HDP\-HMM recover regime sequences with sustained occupancy\. The Gaussian HMM produces a mean regime duration of 3\.04 utterances, while the sticky HDP\-HMM produces that of 9\.66 utterances\. Both are substantially above the duration of 1\.0 expected under random assignment\. Representative decoded sequences show that latent states persist across contiguous conversational spans\. The emission means recovered by both models occupy distinct and interpretable regions of the VA plane\. Across consultations, states partition the VA space into recognizable affective quadrants, indicating that the latent states capture meaningful affective modes\. The positive mean inter\-regime centroid \(Euclidean\) distance of 0\.469 with the Gaussian HMM and 0\.339 with the sticky HDP\-HMM further supports this separation, indicating that both models identify geometrically distinct affective clusters\.
### 4\.2Sticky HDP\-HMM Comparison to Gaussian HMM
Both models recover non\-trivial regime structure, but differ in how realistically they represent temporal dynamics\. The sticky HDP\-HMM captures temporally coherent phases that reflect persistent affective structure, whereas the Gaussian HMM over\-segments what may be real emotional states in conversation, generating more noise than interpretable insights\. Table[1](https://arxiv.org/html/2605.12838#S4.T1)summarizes temporal coherence metrics across all consultations\. The sticky HDP\-HMM reduces single\-utterance regimes by nearly one full order of magnitude and reduces regime shifts by a factor of four, while achieving substantially longer regime durations\. These differences indicate a strong improvement in temporal coherence\.
Table 1:Quantitative metrics averaged across 57 PriMock57 consultations\. Better values arebolded\.Notably, the Gaussian HMM achieves marginally higher mean log\-likelihood \(45\.61 vs\. 44\.19\), suggesting a higher accuracy upon first glance\. This is expected due to the objective of log\-likelihood optimizing the observation space without temporal regularization\. The difference is small relative to the large gains in temporal coherence, and log\-likelihood alone is not a reliable indicator of regime quality in this setting\. For this reason, we extend our evaluation methodology to include LLM\-as\-a\-Judge, and find that the sticky HDP\-HMM substantially outperforms the Gaussian HMM across all three reference\-based metrics\. The LLM judge assigned a mean of 3\.61 distinct regimes per consultation with an average duration of 14\.18 utterances, a granularity considerably closer to the sticky model’s behavior than the Gaussian’s\.
The most compelling result is Normalized Mutual Information \(NMI\), where the sticky model scores 0\.431 against 0\.229, winning on 48 of 57 consultations — indicating that its regime assignments are substantially more aligned with the LLM judge’s partition of the affective space\. Boundary F1 corroborates this, with the sticky model scoring 0\.302 against 0\.266\. Segment\-level F1 is low for both models, which is expected given the metric’s requirement for exact coincidence of both boundaries and labels; we therefore treat NMI and boundary F1 as the primary judge\-based metrics\.
Taken together, these results confirm that the Gaussian HMM’s lack of a temporal persistence prior causes it to over\-segment\. The sticky HDP\-HMM, in contrast, recovers sustained affective structure, and we conclude that temporal regularization via the sticky prior is essential for detecting clinically meaningful regime shifts in this setting\.
### 4\.3Extension to multimodal observation spaces
We first establish the collinearity of the two streams before evaluating the combined model to assert our assumption of conditional independence of modalities’ emission channels \(Section\-[3\.2](https://arxiv.org/html/2605.12838#S3.SS2)\), and find the Pearson correlation between text valence and audio valence to ber=0\.361r=0\.361, and between text arousal and audio arousal to ber=0\.188r=0\.188\. These values indicate a weak relationship; this divergence reflects the known phenomenon of semantic\-prosodic decoupling in clinical speech, where patients frequently describe distressing content in a controlled vocal register\(Cumminset al\.,[2015](https://arxiv.org/html/2605.12838#bib.bib19); Schuller and Batliner,[2014](https://arxiv.org/html/2605.12838#bib.bib18)\)\. Table[2](https://arxiv.org/html/2605.12838#S4.T2)summarizes regime statistics across all three configurations of the sticky HDP\-HMM\.
Table 2:Sticky HDP\-HMM regime statistics by modality configuration\. Best values arebolded\.The results reveal a clear modality\-specific trade\-off\. The text\-only model recovers the richest affective structure, with the highest effective regime count \(4\.46\) and a comparatively balanced dominant regime share \(0\.48\), suggesting it captures a wider repertoire of emotional states\. The audio\-only model, by contrast, favors temporal stability: it produces the fewest regime shifts \(4\.00\), the longest mean regime duration \(12\.82 utterances\), and the lowest single\-utterance fraction \(0\.03\), consistent with the smoothly\-varying nature of prosodic signals in clinical conversation\. The combined 4D model presents a mixed picture\. It achieves the lowest transition entropy \(0\.49\), indicating that its between\-regime dynamics are the most structured and predictable of the three configurations, yet it also exhibits the highest regime\-shift rate \(8\.05\) and shortest mean duration \(7\.50\), and the largest single\-utterance fraction \(0\.09\)\. This apparent tension is interpretable: when text and audio disagree — as the low inter\-modal correlations suggest they frequently do — the model must resolve conflicting emission signals, which can induce brief excursions into transient regimes that would not arise under either unimodal model alone\.
As we know, the model scales transparently to an arbitrary number of modality channels, where adding a new modality requires only the specification of an additional emission distributionp\(𝐱t\(m\)∣zt\)p\(\\mathbf\{x\}\_\{t\}^\{\(m\)\}\\mid z\_\{t\}\)for that channel; the HDP prior over the transition matrix and the inference procedure remain unchanged\. We exploit this flexibility to extend the evaluation to a trimodal setting using MELD\(Poriaet al\.,[2019a](https://arxiv.org/html/2605.12838#bib.bib20)\), a multimodal corpus of emotionally labeled dialogue segments drawn from the television seriesFriends, where text, audio, and video valence\-arousal streams are all available\.
Table[3](https://arxiv.org/html/2605.12838#S4.T3)summarizes regime statistics across all four configurations on MELD\. The unimodal models are broadly comparable in effective regime count \(2\.00–2\.30\) and show near\-zero single\-utterance fractions, indicating stable, persistent affective states within each modality\. The combined 6D model highlights the framework’s flexibility to integrate richer multimodal signals, yielding a more expressive affective decomposition \(2\.70 effective regimes, dominant regime share of 0\.60\)\. This added granularity comes with brief transitions that capture transient disagreements between modalities\. Consistent with the 4D setting, this behavior demonstrates adaptive segmentation, where increased input complexity leads to finer\-grained, but not noisy, affective structure\.
Table 3:Sticky \(Factorial\) HDP\-HMM regime statistics by modality configuration on MELD\.
### 4\.4Interpretation of Regimes
Statistical coherence alone is insufficient; the recovered regimes must also correspond to meaningful affective states\. We therefore analyze the geometry of emission means, their consistency across modalities, and their temporal realization in case studies\.
Across both PriMock\-57 and MELD, the learned regimes organize into a structured trajectory in VA space rather than forming arbitrary clusters\. The elliptical structure visible in the text and audio emission plots as seen in Figure[2](https://arxiv.org/html/2605.12838#S4.F2)supports this interpretation: individual modalities exhibit elongated, overlapping distributions, while the combined emissions collapse these into intermediate, interpretable distributions\. The combined VA means lie between modalities, indicating that the factorial emission model resolves these discrepancies into a consensus representation rather than privileging a single modality\. This behavior is critical for interpretability: regimes reflect shared affective structure while retaining modality\-specific nuance\.
Figure 2:Multimodal regime emission structure \(Gaussian\) across datasets\.Each point denotes the mean VA centroid of an inferred regime, and each ellipse shows the corresponding 1\-σ\\sigmaemission covariance, representing the spread of utterance\-level observations assigned to that regime\. Regime labels \(R0, R1, R2\) are assigned in ascending order of valence\. PriMock57 regimes cluster near neutral valence with low arousal, consistent with the affective range of primary care consultations\. MELD regimes show greater separation and higher arousal, reflecting the more expressive emotional range of scripted TV dialogue\.Temporal structure further reinforces this interpretation\. In PriMock\-57 \(day2\_cons10\), the combined sticky HDP\-HMM is dominated by the lowest\-valence regime, withR0R0occupying 61\.3% of utterances, compared with 21\.0% forR1R1and 17\.7% forR2R2\. The sequence begins with a brief high\-valence segment \(R2R2\), then settles into a long dwell in the dominant negative regime \(R0R0\), before shifting to an intermediate\-valence regime \(R1R1\) and ending with a short return to the highest\-valence regime \(R2R2\)\. In MELD \(dialogue 0\), the early interaction is likewise governed by a single high\-occupancy regime, but here that regime is the highest\-valence state:R2R2occupies 71\.4% of the clip, whereasR1R1accounts for 21\.4% andR0R0only 7\.1%\. After this extendedR2R2segment, the clip transitions briefly into a low\-valence, high\-arousal state \(R0R0\) and then ends in a short intermediate\-valence state \(R1R1\)\. These transitions occur in contiguous blocks of utterances, consistent with temporally extended affective phases\. Importantly, regime\-specific variance provides additional evidence that these states are meaningful\. In PriMock\-57, lower\-valence regimes \(R0, R1\) exhibit tighter variance, while more positive regimes \(R2\) show increased dispersion, suggesting greater heterogeneity in affective expression\. In MELD, high\-arousal regimes are both extreme in mean VA and relatively stable within modalities, indicating distinct emotional peaks rather than noise\-driven outliers\.
Taken together, these findings show that the sticky factorial HDP\-HMM recovers regimes that are geometrically structured, temporally persistent, and consistent across modalities\. The factorized emission model is central to this interpretability: by explicitly modeling modality\-specific contributions, it exposes systematic cross\-modal differences while producing unified regime representations that align with coherent phases of interaction\.
### 4\.5Multimodal QnA Augmentation with Regimes
We evaluate whether inferred emotional regimes can serve as a compact conditioning signal for large language models \(LLMs\) in clinical question\-answer \(QnA\) settings\. Figure[3](https://arxiv.org/html/2605.12838#S4.F3)shows a representative comparison\. Regime\-augmented responses tend to incorporate brief affective acknowledgments calibrated to the inferred emotional state before continuing the clinical thread, whereas baseline responses proceed directly to clinical content without tonal modulation\.
Inferred regime: negative valence with slightly elevated arousal \(prolonged fatigue/frustration\), stable for 5 consecutive turns\.Baseline:“Yes, that can happen with this sort of illness\. Have you noticed feeling mentally slower than usual? And how’s your sleep been?”Regime\-augmented:“That sounds really uncomfortable, especially with it keeping you up at night\. Can you tell me whether the itch is worse at any particular time of day?”
Figure 3:Representative comparison of baseline and regime\-augmented responses\. The augmented response adds a brief affective acknowledgment and manageable clinical inquiry\.Table 4:Absolute rubric scores \(mean, scale 1—5\) under baseline and regime\-augmented conditions acrossN=57N=57PriMock57 consultations\.pp\-values from two\-sided pairedtt\-tests\. Stratified results show regime\-dependent effects\. Better values arebolded\.OverallDimensionBaselineAugmentedΔ\\DeltappAffective Attunement1\.872\.07\+0\.20\+0\.200\.104Clinical Appropriateness4\.044\.02−0\.02\-0\.020\.866Contextual Coherence3\.954\.10\+0\.15\+0\.150\.106Specificity4\.003\.88−0\.12\-0\.120\.082Ground Truth Alignment2\.092\.28\+0\.19\+0\.190\.076Stratified by Regime PersistenceConditionnWin RateppUnstable regimes \(≤5\\leq 5turns\)280\.7140\.036Stable regimes \(\>5\>5turns\)110\.6360\.549
Table[4](https://arxiv.org/html/2605.12838#S4.T4)summarizes absolute rubric scores across allN=57N=57consultations\. Each dimension captures a distinct aspect of response quality:Affective Attunementmeasures sensitivity to the patient’s emotional state;Clinical Appropriatenessreflects adherence to plausibly sound medical practice;Contextual Coherenceassesses logical consistency with the preceding dialogue;Specificitycaptures the precision and detail of the response; andGround Truth Alignmentmeasures agreement with the reference clinician response\. Across dimensions, the regime\-augmented condition exhibits consistent directional improvements in affective attunement, contextual coherence , and ground truth alignment, though none reach conventional significance thresholds at this sample size\. Clinical appropriateness remains unchanged, and specificity shows a small negative shift\. Overall, these results suggest modest but consistent gains in affective and contextual alignment without degradation of core clinical quality\.
Despite the lack of significance at the aggregate conversation level, stratified analysis by regime persistence and stability localizes the effect\. Under rubric\-based evaluation, augmentation yields a significant improvement in emotionally unstable consultations, but not in stable ones\. This pattern is consistent with the hypothesis that regime summaries are most informative when the patient’s emotional state is shifting and not yet fully inferable from local context\. These results provide qualified support for emotional regimes as a useful intermediate representation for conditioning LLM behavior in specific segments of conversational dialogues\. In addition, unstable regimes are often when the criticality of giving an appropriate responses is high; steady\-state conversation segments are lower complexity and often handled well by both the baseline and augmented models\.
## 5Conclusion
In this work, we presented a lightweight framework for modeling conversational emotion as a sequence of persistent latent regimes\. By applying truncated sticky factorial HDP\-HMMs to multimodal valence–arousal representations derived from audio and textual signals, we showed that it is possible to recover temporally coherent and interpretable affective structure that is obscured by standard utterance\-level approaches\. Relative to Gaussian HMM baselines, the proposed model produces substantially more stable regime sequences while retaining meaningful structure in the underlying affective space\.
Beyond modeling, we explored the use of inferred regimes as a compact representation of conversational states for downstream systems\. Using a clinical question\-answer setting, we find that while incorporating regime\-level summaries into LLM context has no significant impact on response quality for steady\-state affective conversational trajectories, itdoessignificantly lead to better aligned responses in unstable affective regimes where the user is emotionally fluctuating\.
This suggests a broader view of regime inference as a form of state compression, where it serves as a low\-dimensional, persistent signal that allows LLMs to respond in alignment with conversational dynamics without repeatedly processing the full dialogue history\. Such representations offer a practical pathway toward more efficient and stable multimodal question\-answer systems\.
Future work will focus on strengthening both evaluation and interpretability\. On the evaluation side, replacing the LLM\-as\-a\-Judge with human raters and developing datasets with regime annotations grounded in real cognition would enable more rigorous benchmarking of the proposed framework\. On the interpretability side, projecting continuous VA trajectories as defined by regimes’ emission distributions onto discrete emotion representations may yield regime sequences that are more directly legible to both clinicians and downstream sequences\. Taken together, these directions build towards a broader vision of conversational emotion analysis that is simultaneously lightweight, interpretable across fields, and clinically actionable\.
## References
- V\. Balaraman, S\. Sheikhalishahi, and B\. Magnini \(2021\)Recent neural methods on dialogue state tracking for task\-oriented dialogue systems: a survey\.InProceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue,H\. Li, G\. Levow, Z\. Yu, C\. Gupta, B\. Sisman, S\. Cai, D\. Vandyke, N\. Dethlefs, Y\. Wu, and J\. J\. Li \(Eds\.\),Singapore and Online,pp\. 239–251\.External Links:[Document](https://dx.doi.org/10.18653/v1/2021.sigdial-1.25)Cited by:[§2\.2](https://arxiv.org/html/2605.12838#S2.SS2.p1.1)\.
- D\. M\. Blei and M\. I\. Jordan \(2004\)Variational methods for the dirichlet process\.InProceedings of the Twenty\-First International Conference on Machine Learning,ICML ’04,New York, NY, USA,pp\. 12\.External Links:ISBN 1581138385,[Document](https://dx.doi.org/10.1145/1015330.1015439)Cited by:[§3\.3\.2](https://arxiv.org/html/2605.12838#S3.SS3.SSS2.Px3.p1.2)\.
- C\. Busso, R\. Lotfian, K\. Sridhar, A\. N\. Salman, W\. Lin, L\. Goncalves, S\. Parthasarathy, A\. R\. Naini, S\. Leem, L\. Martinez\-Lucas,et al\.\(2025\)The msp\-podcast corpus\.arXiv preprint arXiv:2509\.09791\.Cited by:[§3\.1](https://arxiv.org/html/2605.12838#S3.SS1.p3.1)\.
- R\. Carranza and M\. A\. Rojas \(2025\)Interpretable and robust dialogue state tracking via natural language summarization with llms\.arXiv preprint arXiv:2503\.08857\.Cited by:[§2\.2](https://arxiv.org/html/2605.12838#S2.SS2.p1.1)\.
- N\. Cummins, S\. Scherer, J\. Krajewski, S\. Schnieder, J\. Epps, and T\. F\. Quatieri \(2015\)A review of depression and suicide risk assessment using speech analysis\.Speech Commun\.71\(C\),pp\. 10–49\.External Links:ISSN 0167\-6393,[Document](https://dx.doi.org/10.1016/j.specom.2015.03.004)Cited by:[§4\.3](https://arxiv.org/html/2605.12838#S4.SS3.p1.2)\.
- E\. B\. Fox, E\. B\. Sudderth, M\. I\. Jordan, and A\. S\. Willsky \(2008\)An hdp\-hmm for systems with state persistence\.InProceedings of the 25th International Conference on Machine Learning,ICML ’08,New York, NY, USA,pp\. 312–319\.External Links:ISBN 9781605582054,[Document](https://dx.doi.org/10.1145/1390156.1390196)Cited by:[§2\.3](https://arxiv.org/html/2605.12838#S2.SS3.p1.1),[§3\.3\.2](https://arxiv.org/html/2605.12838#S3.SS3.SSS2.Px2.p1.1),[§3\.3\.2](https://arxiv.org/html/2605.12838#S3.SS3.SSS2.p1.1)\.
- H\. Ishwaran and L\. F\. James \(2001\)Gibbs sampling methods for stick\-breaking priors\.Journal of the American Statistical Association96\(453\),pp\. 161–173\.External Links:[Document](https://dx.doi.org/10.1198/016214501750332758),https://doi\.org/10\.1198/016214501750332758Cited by:[§3\.3\.2](https://arxiv.org/html/2605.12838#S3.SS3.SSS2.Px3.p1.2)\.
- H\. W\. Kuhn \(1955\)The hungarian method for the assignment problem\.Naval Research Logistics Quarterly2\(1\-2\),pp\. 83–97\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1002/nav.3800020109),https://onlinelibrary\.wiley\.com/doi/pdf/10\.1002/nav\.3800020109Cited by:[§3\.4](https://arxiv.org/html/2605.12838#S3.SS4.p2.5)\.
- J\. Lee \(2022\)The emotion is not one\-hot encoding: learning with grayscale label for emotion recognition in conversation\.arXiv preprint arXiv:2206\.07359\.Cited by:[§1](https://arxiv.org/html/2605.12838#S1.p1.1),[§1](https://arxiv.org/html/2605.12838#S1.p2.1)\.
- Mavdol \(2025\)Cited by:[§3\.1](https://arxiv.org/html/2605.12838#S3.SS1.p2.1)\.
- A\. D\. Ong, C\. S\. Bergeman, T\. L\. Bisconti, and K\. A\. Wallace \(2006\)Psychological resilience, positive emotions, and successful adaptation to stress in later life\.Journal of Personality and Social Psychology91\(4\),pp\. 730–749\.External Links:[Document](https://dx.doi.org/10.1037/0022-3514.91.4.730)Cited by:[§2\.1](https://arxiv.org/html/2605.12838#S2.SS1.p1.1)\.
- A\. Papadopoulos Korfiatis, F\. Moramarco, R\. Sarac, and A\. Savkov \(2022\)PriMock57: a dataset of primary care mock consultations\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\),S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 588–598\.External Links:[Document](https://dx.doi.org/10.18653/v1/2022.acl-short.65)Cited by:[§3\.4](https://arxiv.org/html/2605.12838#S3.SS4.p4.1)\.
- S\. Poria, D\. Hazarika, N\. Majumder, G\. Naik, E\. Cambria, and R\. Mihalcea \(2019a\)Meld: a multimodal multi\-party dataset for emotion recognition in conversations\.InProceedings of the 57th annual meeting of the association for computational linguistics,pp\. 527–536\.Cited by:[§4\.3](https://arxiv.org/html/2605.12838#S4.SS3.p3.1)\.
- S\. Poria, N\. Majumder, R\. Mihalcea, and E\. Hovy \(2019b\)Emotion recognition in conversation: research challenges, datasets, and recent advances\.IEEE access7,pp\. 100943–100953\.Cited by:[§1](https://arxiv.org/html/2605.12838#S1.p1.1)\.
- L\. R\. Rabiner \(1989\)A tutorial on hidden markov models and selected applications in speech recognition\.Proceedings of the IEEE77\(2\),pp\. 257–286\.External Links:[Document](https://dx.doi.org/10.1109/5.18626)Cited by:[§3\.3\.1](https://arxiv.org/html/2605.12838#S3.SS3.SSS1.p1.3)\.
- M\. P\. A\. Ramaswamy and S\. Palaniswamy \(2024\)Multimodal emotion recognition: a comprehensive review, trends, and challenges\.WIREs Data Mining and Knowledge Discovery14\(6\),pp\. e1563\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1002/widm.1563),https://wires\.onlinelibrary\.wiley\.com/doi/pdf/10\.1002/widm\.1563Cited by:[§2\.1](https://arxiv.org/html/2605.12838#S2.SS1.p1.1)\.
- J\. A\. Russell \(1980\)A circumplex model of affect\.Journal of Personality and Social Psychology39\(6\),pp\. 1161–1178\.External Links:[Document](https://dx.doi.org/10.1037/h0077714)Cited by:[§3\.1](https://arxiv.org/html/2605.12838#S3.SS1.p1.1),[§3\.1](https://arxiv.org/html/2605.12838#S3.SS1.p2.1)\.
- B\. Schuller and A\. Batliner \(2014\)Computational paralinguistics: emotion, affect and personality in speech and language processing\.Wiley\.Cited by:[§4\.3](https://arxiv.org/html/2605.12838#S4.SS3.p1.2)\.
- A\. Toisoul, J\. Kossaifi, A\. Bulat, G\. Tzimiropoulos, and M\. Pantic \(2021\)Estimation of continuous valence and arousal levels from faces in naturalistic conditions\.Nature Machine Intelligence3\(1\),pp\. 42–50\.External Links:[Document](https://dx.doi.org/10.1038/s42256-020-00280-0)Cited by:[§3\.1](https://arxiv.org/html/2605.12838#S3.SS1.p4.1)\.
- J\. Wagner, A\. Triantafyllopoulos, H\. Wierstorf, M\. Schmitt, F\. Burkhardt, F\. Eyben, and B\. W\. Schuller \(2023\)Dawn of the transformer era in speech emotion recognition: closing the valence gap\.IEEE Transactions on Pattern Analysis and Machine Intelligence45\(9\),pp\. 10745–10759\.External Links:[Document](https://dx.doi.org/10.1109/TPAMI.2023.3263585)Cited by:[§3\.1](https://arxiv.org/html/2605.12838#S3.SS1.p3.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging LLM\-as\-a\-judge with MT\-bench and chatbot arena\.InThirty\-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track,Cited by:[§3\.4](https://arxiv.org/html/2605.12838#S3.SS4.p6.1)\.
## Appendix AHungarian Algorithm for Label Alignment
To match the LLM\-as\-Judge regime indices with our model prediction indices, we first construct anM×KM\\times Kcost matrix𝐂\\mathbf\{C\}whose entries measure the utterance\-level disagreement between each pair of predicted and reference labels:
Cij=−∑t=1T\[pt=i\]⋅\[rt=j\],i∈1,…,M,j∈1,…,KC\_\{ij\}=\-\\sum\_\{t=1\}^\{T\}\[p\_\{t\}=i\]\\cdot\[r\_\{t\}=j\],\\quad i\\in\{1,\\ldots,M\},\\ j\\in\{1,\\ldots,K\}\(11\)
so that−Cij\-C\_\{ij\}counts the number of utterances on which predicted clusteriiand reference labeljjco\-occur\. Settingn=max\(M,K\)n=\\max\(M,K\), we zero\-pad𝐂\\mathbf\{C\}to ann×nn\\times nmatrix and find the optimal permutationπ∗∈Sn\\pi^\{\*\}\\in S\_\{n\}minimizing total assignment cost:
π∗=argminπ∈Sn∑i=1nCi,,π\(i\)\\pi^\{\*\}=\\underset\{\\pi\\in S\_\{n\}\}\{\\arg\\min\}\\sum\_\{i=1\}^\{n\}C\_\{i,,\\pi\(i\)\}\(12\)
where assignments involving padded dummy rows or columns incur zero cost and are discarded after solving\. WhenM≠KM\\neq K,π∗\\pi^\{\*\}therefore constitutes a partial assignment, matching each label in the smaller set to its most overlapping counterpart in the larger\. The optimalπ∗\\pi^\{\*\}equivalently maximizes total utterance\-level overlap between predicted and reference labels\. Applyingπ∗\\pi^\{\*\}to𝐩\\mathbf\{p\}yields a remapped prediction sequence𝐩~\\tilde\{\\mathbf\{p\}\}defined byp~t=π\(pt\)\\tilde\{p\}\_\{t\}=\\pi^\{\(p\_\{t\}\)\}, which now shares a common label space with𝐫\\mathbf\{r\}\.
## Appendix BSelection ofKmaxK\_\{\\max\}from Hyperparameter Tuning Gaussian HMM
Figure 4:Model selection behavior across different numbers of hidden states\.Log\-likelihood increases monotonically withKK, but structural metrics such as mean regime duration and transition entropy indicate reduced interpretability beyondK≈6K\\approx 6\. BeyondK=8K=8, regimes become increasingly short\-lived and fragmented, motivating the choice ofKmax=8K\_\{\\max\}=8as a balance between fit and temporal coherence\.Figure[4](https://arxiv.org/html/2605.12838#A2.F4)shows model behavior as the number of states increases, evaluated via log\-likelihood, mean utterances per regime, and transition matrix entropy\. While log\-likelihood improves monotonically with additional states, this gain is accompanied by diminishing structural coherence: mean regime duration drops sharply beyondK≈4K\\approx 4, and transition entropy remains high, indicating increasingly diffuse and less interpretable dynamics\.
Notably, beyondK=8K=8, additional states primarily capture short\-lived or singleton regimes rather than stable affective structure, reflecting over\-fragmentation of the observation space\. Additionally, transition entropy rebounds after that point to increase\. This aligns with the point at which improvements in Gaussian HMM fit are driven more by hyperparameter flexibility than by meaningful structure in the data\. Accordingly, we setKmax=8K\_\{\\max\}=8as a balance between representational richness and temporal coherence\.
## Appendix CLLM Prompts
### C\.1For comparison between Gaussian & sticky HDP\-HMMs:
This prompt is used to obtain LLM\-based regime annotations for comparison with Gaussian and sticky HDP\-HMM segmentations\. All calls use GPT\-5\.4 with temperature=0\.0=0\.0to encourage deterministic, consistent labeling\.
> You are an expert clinical supervisor reviewing a patient’s utterances from a therapy session\. You will be given a numbered list of the patient’s utterances\. Your task is to segment the conversation into emotional regimes\. A regime is a sustained affective state that persists across multiple utterances\. A new regime only begins when the patient’s emotional state shifts so substantially that a clinician would need to meaningfully change their approach to meet the patient’s needs — for example, shifting from active listening to crisis intervention, or from psychoeducation to emotional validation\. Minor variations in wording, topic, or intensity within the same underlying emotional state do NOT constitute a new regime\. A patient venting about different topics while remaining equally distressed is in one regime\. A patient who moves from distress into flat detachment has crossed into a new one\. Rules: \- Use a STRICT MAXIMUM of 8 distinct regime labels total\. \- Label each regime with a short phrase \(2\-\-5 words\) describing the dominant affective character, e\.g\. "acute distress", "guarded withdrawal", "calm reflection", "resigned hopelessness"\. \- The same label may recur if the patient returns to a prior state\. \- Every utterance must receive a label\. First, write 2\-\-3 sentences describing the overall emotional arc of the conversation and identifying the major regime transitions you observe\. Then output a JSON array, one object per utterance, in this exact format: \["t": <utterance index\>, "label": "<regime label\>", \.\.\.\] Output only the reasoning followed by the JSON\. No other commentary\. Transcript: \{transcript\}
### C\.2QnA Task/Evaluation Prompts:
This section documents all prompts used in the QnA evaluation pipeline \(Section[4\.5](https://arxiv.org/html/2605.12838#S4.SS5)\)\. All generation and evaluation calls use GPT\-5\.4\. Judge calls use temperature=0\.0=0\.0; generation calls use temperature=0\.7=0\.7\.
#### A\.2\.1 Response Generation — Baseline System Prompt
> You are an experienced primary care clinician conducting a medical consultation\. Given the conversation history, generate the next clinician response\. Be concise, empathetic, and clinically appropriate\. Respond with one turn only\.
#### A\.2\.2 Response Generation — Regime\-Augmented System Prompt
The augmented system prompt appends the following block to the baseline prompt above, where\{regime\_block\}is instantiated per consultation:
> During history\-taking phases, use the following regime signal to calibrate empathic acknowledgment only\. Reserve diagnostic and management responses for assessment phases\.: \{regime\_block\}
The regime block itself is structured as follows, with fields derived from the multimodal sticky HDP\-HMM at the consultation midpoint:
> \[Emotional Regime Summary\] Consultation phase: \{history\-taking \| assessment/management\} Current regime: \{label\} \(valence: \{v\}, arousal: \{a\}\) Regime persistence: \{n\} consecutive turns \(\{stable \| unstable\}\) Regime shifts so far: \{k\}
Consultation phase is assigned ashistory\-takingwhen the midpoint turn index falls below 60% of the total turn count, andassessment/managementotherwise\. Stability is defined as persistence\>5\>5consecutive turns\.
#### A\.2\.3 Response Generation — User Prompt \(Both Conditions\)
The user prompt is identical for both conditions:
> \#\# Consultation Context \{context\}
where\{context\}is the interleaved patient—clinician transcript up to and including the midpoint clinician turn, formatted as:
> Clinician: \{utterance\} Patient: \{utterance\} \.\.\.
#### A\.2\.4 Canonical Response Selection Prompt
Three candidate responses are generated per condition\. A separate model call selects the most representative candidate for downstream evaluation:
> You are choosing the most representative candidate response for downstream evaluation\. Select the single candidate that best represents the set as a typical output for the task\. Return exact JSON only: \{"choice": 1 \| 2 \| 3, "reasoning": "<1 sentence\>"\}
#### A\.2\.5 Pairwise Preference judgment Prompt
> You are an expert evaluator of clinical consultation quality\. \#\# Consultation Context \(first half of consultation\) \{context\} \#\# Ground Truth Next Three Clinician Turns \{ground\_truth\} \#\# Response A \{response\_a\} \#\# Response B \{response\_b\} Which response better reflects the patient’s current emotional state and is more clinically appropriate as the next turn in this consultation? Respond in this exact JSON format: \{ "preference": "A" \| "B" \| "TIE", "confidence": 1 \| 2 \| 3, "reasoning": "<2\-3 sentences\>" \} Use the full scale of judgments\. Do not treat ties or extremely strong preferences as defaults\. Do not include any text outside the JSON\.
Condition\-to\-label assignment \(A/B\) is randomized per consultation using a deterministic seed combining a global seed and the consultation identifier, ensuring reproducibility while balancing assignment across the dataset\. The ground truth is presented as a reference anchor representing the actual clinician’s next three turns; judges are not instructed to prefer responses closer to ground truth, but rather to use it as contextual scaffolding for assessing clinical appropriateness\.
#### A\.2\.6 Absolute Rubric Scoring Prompt
Each response is scored independently on four dimensions via separate model calls \(one call per dimension per response\):
> You are an expert evaluator of clinical consultation quality\. \#\# Consultation Context \{context\} \#\# Ground Truth Next Three Clinician Turns \{ground\_truth\} \#\# Response to Evaluate \{response\} Rate the response on the following dimension only: Dimension: \{dimension\_name\} Description: \{dimension\_description\} Scoring guidance: \- Use the full 1\-\-5 scale\. \- A score of 5 should be reserved for responses that satisfy the dimension exceptionally well with little or no meaningful room for improvement\. \- Strong but imperfect responses should usually receive a 4 rather than a 5\. \- Do not avoid giving a 5 when it is clearly warranted\. Respond in this exact JSON format: \{"score": <integer 1\-\-5\>, "reasoning": "<1\-2 sentences\>"\} Do not include any text outside the JSON\.
The four dimensions and their descriptions are given in Table[5](https://arxiv.org/html/2605.12838#A3.T5)\.
Table 5:Rubric dimensions used in absolute scoring\.
#### A\.2\.7 Ground Truth Alignment Scoring Prompt
> Rate how closely Response \[\{label\}\] matches the Ground Truth in terms of communicative intent and emotional attunement, on a scale of 1\-\-5\. You are not evaluating whether the ground truth is optimal—only measuring alignment with it\. Treat the ground truth as a short reference trajectory across the next three clinician turns, not as a word\-for\-word target\. \#\# Ground Truth \{ground\_truth\} \#\# Response \{response\} Scoring guidance: \- Use the full 1\-\-5 scale\. \- Reserve a 5 for near\-matches in communicative intent and emotional attunement\. \- Good but not especially close matches should usually receive a 4 rather than a 5\. \- Do not avoid giving a 5 when the match is truly very close\. Respond in this exact JSON format: \{"score": <integer 1\-\-5\>, "reasoning": "<1 sentence\>"\}Similar Articles
Leveraging Self-Paced Curriculum Learning for Enhanced Modality Balance in Multimodal Conversational Emotion Recognition
This paper proposes a plug-and-play module using self-paced curriculum learning to enhance modality balance in multimodal conversational emotion recognition, achieving consistent F1-score improvements on IEMOCAP and MELD datasets.
Evaluating multimodal emotion recognition in proactive conversational agents: A user study
This paper presents a multimodal emotion recognition module for proactive conversational agents, using facial recognition and linguistic analysis. A user study with 20 participants reveals a 'poker face' effect where visual cues are unreliable, while linguistic analysis proves more accurate; the study also shows agents can elicit emotions through conversational adaptation.
Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions
This paper proposes Polar, a multimodal memory-augmented framework for personalizing embodied MLLM agents over long-term user interactions, using a knowledge graph and episodic memory to ground user-intended instances from accumulated context.
Know You Before You Speak: User-State Modeling for LLM Personalization in Multi-Turn Conversation
This paper proposes PUMA, a framework for LLM personalization in multi-turn conversations that models latent user states and uses the Free Energy Principle to select dialogue actions, improving long-horizon outcomes on healthcare counseling benchmarks.
EmoS: A High-Fidelity Multimodal Benchmark for Fine-grained Streaming Emotional Understanding
This article introduces EmoS, a high-fidelity multimodal benchmark designed for fine-grained streaming emotional understanding, addressing limitations in ecological validity and labeling reliability found in existing datasets.