Emotion Recognition in Sign Language Conversation

arXiv cs.CL 05/25/26, 04:00 AM Papers
emotion-recognition sign-language conversation dataset benchmarking multimodal affective-computing
Summary
This paper introduces the eJSL Dialog dataset for emotion recognition in sign language conversations, addressing the lack of conversational context in existing datasets. Benchmarking shows a domain gap when applying generic multimodal models, highlighting the need for context-aware visual extractors for sign language.
arXiv:2605.23328v1 Announce Type: new Abstract: Emotion Recognition in Conversation is a core component of affective computing, while current resources of sign language emotion datasets primarily focus on isolated sentences and lack conversational context. Models trained exclusively on these isolated utterances demonstrate degraded performance in real world scenarios because they cannot utilize historical dialogue flow. To address this structural limitation, we introduce the ERC task to sign language video analysis and propose the eJSL Dialog dataset. Constructed using the scripts from the STUDIES corpus, the dataset contains 1,920 video samples organized into 480 unique dialogues. We conduct systematic benchmarking on this dataset using models ranging from isolated visual networks to multimodal conversational architectures. The results reveal a domain gap when applying generic multimodal conversational emotion recognition models to sign language. These findings demonstrate the explicit need for context aware visual extractors specific to sign language and indicate that expanding the scale of conversational datasets to support large scale pre-training is a necessary next step for future research.
Original Article
View Cached Full Text
Cached at: 05/25/26, 09:01 AM
# Emotion Recognition in Sign Language Conversation
Source: [https://arxiv.org/html/2605.23328](https://arxiv.org/html/2605.23328)
Yusong Wang1, Keyu Mao1, Takao Obi1, Minghao Shao2and Kotaro Funakoshi11Yusong Wang, Keyu Mao, Takao Obi, and Kotaro Funakoshi are with Institute of Science Tokyo, Yokohama 230\-0045, Japan\.\{wangyi, maokeyu, obi, funakoshi\}@lr\.first\.iir\.isct\.ac\.jp2Minghao Shao is with New York University, New York, 11201, USA\.shao\.minghao@nyu\.edu

###### Abstract

Emotion Recognition in Conversation is a core component of affective computing, while current resources of sign language emotion datasets primarily focus on isolated sentences and lack conversational context\. Models trained exclusively on these isolated utterances demonstrate degraded performance in real world scenarios because they cannot utilize historical dialogue flow\. To address this structural limitation, we introduce the ERC task to sign language video analysis and propose the eJSL Dialog dataset\. Constructed using the scripts from the STUDIES corpus, the dataset contains 1,920 video samples organized into 480 unique dialogues\. We conduct systematic benchmarking on this dataset using models ranging from isolated visual networks to multimodal conversational architectures\. The results reveal a domain gap when applying generic multimodal conversational emotion recognition models to sign language\. These findings demonstrate the explicit need for context aware visual extractors specific to sign language and indicate that expanding the scale of conversational datasets to support large scale pre\-training is a necessary next step for future research\.

## IINTRODUCTION

Emotion recognition enables machines to perceive users’ affective states and provide empathetic responses, which makes it a core component in applications such as virtual assistants and emotionally aware assistive technologies\[[20](https://arxiv.org/html/2605.23328#bib.bib11),[21](https://arxiv.org/html/2605.23328#bib.bib13),[27](https://arxiv.org/html/2605.23328#bib.bib12)\]\. Most current systems primarily focus on spoken languages and standard facial expressions from non\-signing populations\. Extending these technologies to sign language, which is often ignored in mainstream research, is a necessary step\[[17](https://arxiv.org/html/2605.23328#bib.bib21),[1](https://arxiv.org/html/2605.23328#bib.bib14)\]\. In sign language communication, the emotion recognition task becomes highly complex due to the visual nature of the language itself\. As visual languages, sign languages rely primarily on hand signs, facial expressions, and upper body movements to convey linguistic structure and emotional content simultaneously\[[6](https://arxiv.org/html/2605.23328#bib.bib16),[8](https://arxiv.org/html/2605.23328#bib.bib15)\]\. This overlap between grammatical features and emotional features introduces significant ambiguity to automatic emotion recognition systems\[[6](https://arxiv.org/html/2605.23328#bib.bib16),[9](https://arxiv.org/html/2605.23328#bib.bib17)\]\. Accurately capturing and modeling emotional dynamics in sign language remains a practical challenge in the field of both computer vision and language processing\.

Existing resources for sign language emotion analysis focus primarily on isolated sentences or unidirectional expressions\. For instance, datasets such as eJSL Solo\[[10](https://arxiv.org/html/2605.23328#bib.bib4)\]consist of sign language video clips detached from conversational context\. Similarly, the EmoSign dataset\[[5](https://arxiv.org/html/2605.23328#bib.bib18)\]concentrates on capturing emotional expressions within single video utterances\. These datasets have advanced research in isolated sign language emotion recognition, but they omit the conversational context present in real communication\.

In general, emotion recognition models trained exclusively on these isolated utterances cannot utilize historical context\. Consequently, these models demonstrate degraded performance when applied to real world scenarios where emotional meaning depends on the continuous dialogue flow\[[13](https://arxiv.org/html/2605.23328#bib.bib31),[18](https://arxiv.org/html/2605.23328#bib.bib32),[11](https://arxiv.org/html/2605.23328#bib.bib35)\]\. In authentic bidirectional interactions, emotional states undergo a dynamic evolution process\. An individual’s emotional shift is influenced by their own emotional history and also occurs as a direct response to the state of their interlocutor\. The absence of dialogue history limits the ability of existing models to understand complex emotional evolution\. Therefore, exploring emotion recognition in sign language within multiple turn dialogue scenarios is a core step to advance this field\.

To address this structural limitation and the lack of dialogue data in existing research, we propose the eJSL Dialog dataset for the Emotion Recognition in Conversation \(ERC\) task in sign language\. This represents the first exploration into this specific task\. The eJSL Dialog dataset is constructed using the dialogue scripts from the STUDIES Japanese Empathetic Dialogue Speech Corpus\[[23](https://arxiv.org/html/2605.23328#bib.bib3)\]\. Each line of the scripts has a designated emotion category label\. The constructed dataset contains a total of 1,920 video samples divided into 480 unique dialogues centered around teacher and student interactions\.[Table I](https://arxiv.org/html/2605.23328#S1.T1)presents a dialogue example from STUDIES, illustrating the dynamic emotional exchange between the student and the teacher\.

TABLE I:Example of dialogue lines from STUDIES\[[23](https://arxiv.org/html/2605.23328#bib.bib3)\]by a teacher and a male student used in our eJSL Dialog dataset\.To establish an objective evaluation benchmark, we applied and compared multiple baseline methods on this dataset\. These models span purely visual emotion recognition networks, a text based conversational emotion recognition model, and multimodal conversational emotion recognition architectures\. Our benchmark evaluations confirm that visual models lacking contextual awareness fail to capture dynamic emotional transitions\. Furthermore, the results reveal a domain gap when applying generic multimodal conversational emotion recognition models to sign language\. These findings demonstrate the explicit need for context aware visual extractors specific to sign language and indicate that expanding the scale of conversational datasets to support large scale pre training is a necessary next step for future research\.

The main contributions of this paper are as follows:

- •We formally define the ERC task for sign language video analysis to establish an objective evaluation benchmark for bidirectional interaction scenarios\.
- •We construct and release the eJSL Dialog dataset, providing sign language video samples with explicit multiple turn dialogue context and corresponding emotion annotations to address the structural limitations of isolated utterance datasets\.
- •We conduct systematic benchmarking on this dataset using models\. We demonstrate the limitations of isolated visual models and generic multimodal conversational emotion recognition models, confirming the explicit need for context aware visual extractors specific to sign language and indicating that expanding the scale of conversational datasets to support large scale pre\-training is a necessary next step\.

## IIBACKGROUND

### II\-ASign Language & Emotion

Sign language utilizes visual\-manual modality and non\-manual markers, such as facial expressions and upper body movements, to convey information\[[24](https://arxiv.org/html/2605.23328#bib.bib19),[3](https://arxiv.org/html/2605.23328#bib.bib20)\]\. A key challenge in sign language emotion recognition is that facial expressions serve dual purposes\[[6](https://arxiv.org/html/2605.23328#bib.bib16),[9](https://arxiv.org/html/2605.23328#bib.bib17)\]\. They encode both grammatical structures, such as questions, and affective states, such as surprise\. This overlap introduces ambiguity for automatic recognition systems because the same facial movement can indicate either a linguistic function or an emotional response\[[25](https://arxiv.org/html/2605.23328#bib.bib1)\]\. Indeed, the research on emotion in sign language recognition is quite scarce\. For example, while a survey\[[17](https://arxiv.org/html/2605.23328#bib.bib21)\]refers to over 200 relevant papers, no emotion recognition work is included in the survey\.

### II\-BVisual Emotion Recognition

Current methods address visual emotion recognition by integrating facial and hand gesture features into multimodal frameworks\[[12](https://arxiv.org/html/2605.23328#bib.bib25),[26](https://arxiv.org/html/2605.23328#bib.bib23),[4](https://arxiv.org/html/2605.23328#bib.bib24)\]\. However, these approaches primarily focus on classifying isolated video sequences\[[30](https://arxiv.org/html/2605.23328#bib.bib26),[31](https://arxiv.org/html/2605.23328#bib.bib27),[28](https://arxiv.org/html/2605.23328#bib.bib28)\]\. They typically rely on frame level spatial feature aggregation and short term temporal tracking\. Consequently, they do not possess the structural capacity to model the long term emotional dynamics and conversational context dependencies present in continuous and interactive communication\.

### II\-CEmotion Recognition in Conversation \(ERC\)

ERC involves identifying the affective states of participants across multiple turns in a dialogue\[[21](https://arxiv.org/html/2605.23328#bib.bib13),[19](https://arxiv.org/html/2605.23328#bib.bib30)\]\. In realistic interactions, emotion evolves based on personal history and interlocutor responses\[[13](https://arxiv.org/html/2605.23328#bib.bib31),[18](https://arxiv.org/html/2605.23328#bib.bib32)\]\. Current ERC research primarily focuses on spoken languages using textual and acoustic modalities\. These frameworks often include visual cues, they are optimized for non signing populations where facial expressions predominantly reflect affective states\[[12](https://arxiv.org/html/2605.23328#bib.bib25),[26](https://arxiv.org/html/2605.23328#bib.bib23),[4](https://arxiv.org/html/2605.23328#bib.bib24)\]\. As explained, in sign language, facial expressions serve dual roles by encoding grammatical structures and emotional content simultaneously\. This introduces a significant domain gap where generic models struggle to distinguish linguistic markers from emotional transitions\. Exploring ERC in sign language is necessary to model dynamic transitions in actual communication\.

### II\-DExisting Emotion Sign Language Datasets

Existing datasets for sign language emotion recognition primarily consist of isolated sentences or single expressions\. For example, the eJSL Solo dataset\[[10](https://arxiv.org/html/2605.23328#bib.bib4)\]contains individual video clips of signers performing specific sentences under instructed emotion categories in Japanese Sign Language \(JSL\)\. Similarly, the EmoSign dataset\[[5](https://arxiv.org/html/2605.23328#bib.bib18)\]focuses on capturing affective expressions within single American Sign Language video utterances\. However, because these recordings are detached from any conversational flow, they lack the multiple turn dialogue history required for conversational analysis\. The absence of sequential interaction means these datasets cannot represent how a signer adjusts their emotional expression based on the previous statements of a partner\. This limitation prevents researchers from developing models that understand context dependent emotional changes in sign language

To address the limitations of existing datasets and the absence of conversational context in current research, we construct the eJSL Dialog dataset\.[Table II](https://arxiv.org/html/2605.23328#S2.T2)compares EmoSign, eJSL Solo and eJSL Dialog\. eJSL Solo and Dialog are perfectly and nearly balanced respectively but both are not spontaneous \(acted\), while smaller EmoSign is spontaneous but not balanced\.

TABLE II:Comparison of existing sign language datasets with emotion annotations\.

## IIIMETHODS

### III\-AData Source and Script Selection

We derived the linguistic content for our dataset from the short dialogue subset of the STUDIES corpus\[[23](https://arxiv.org/html/2605.23328#bib.bib3)\]\. In the original corpus, the short dialogue scripts were collected through a microtask crowdsourcing mechanism\. Specifically, the data collection process involved launching 12 microtasks and recruiting 100 participants for each task\. After an initial screening, this process yielded a total of 720 short dialogue texts with 4 emotion types\. Every single utterance within these dialogues is equipped with an explicit emotion label\. The original text scripts underwent manual revision to remove typographical errors and inappropriate expressions, yielding a clean text baseline for sign language adaptation\.

From this collection, we selected 480 dialogues to construct our dataset\. The selection was aimed to balance instance numbers among 4 emotion types and binary genders\. We specifically utilized the structure consisting of four consecutive utterances for each dialogue\. This length provides sufficient conversational history to model emotional transitions and is suitable for a simplified evaluation of empathetic dialogue systems\.

![Refer to caption](https://arxiv.org/html/2605.23328v1/img/pipeline.png)Figure 1:Illustration of the sign language video recording process\. The actors perform the dialogue by sign language based on the provided script and emotion labels\. An RGB camera for each actor records interleaved performances to produce the final video dataset\.
### III\-BSign Language Video Recording

The overall recording pipeline is illustrated in[Figure 1](https://arxiv.org/html/2605.23328#S3.F1)\. We recorded the sign language videos using the selected scripts\. The dialogues follow a scenario set in a tutoring school, involving a female teacher interacting with a male or female student\. Following the original corpus, the male student was acted by a male signer and the female student was acted by a female signer\. On the other hand, the teacher role was acted by both of them because there were only one male and one female signers\. For each dialogue line, the actors were provided with the text and the corresponding utterance level emotion labels on computer screens simultaneously, and only the corresponding actor made a recording for the line interchangeably\. Based on these instructions, the actors expressed the semantic meaning of the text and the assigned emotional state in JSL\.

The participated actors are native JSL signers who work as vocational deaf actors\. They read and write fluently in Japanese, so all instructions and utterances were textually presented in Japanese\. The recording was conducted in 2025\. We obtained the explicit consent of the signers using standard consent forms of the institution, and the signers were paid for their participation\. Ethical reviewing was exempted based on the prescreening of the institution\.

The videos were recorded in a controlled indoor environment featuring a pure white background to eliminate visual distractions\. We used a single RGB camera positioned at a fixed height for the recording process of each signer\. The final videos are processed at a resolution of1440×10801440\\times 1080and a frame rate of 30 frames per second\. Each resulting clip is a complete JSL utterance conveying a single intended emotion\.

## IVeJSL Dialog STRUCTURE

### IV\-ADataset Configuration

The eJSL Dialog dataset comprises 1,920 video clips organized into 480 unique dialogues\. Spanning eight distinct scenes with exactly 60 dialogues per scene, the dataset is strictly structured so that each dialogue consists of four consecutive utterances\. The total duration of the recorded video data is approximately 4\.65 hours\. At the utterance level, the clips have an average duration of 8\.73 seconds, ranging from a minimum of 2\.94 seconds to a maximum of 43\.98 seconds\. The accompanying Japanese text transcripts contain a total of 134,416 characters, with an average of 70\.0 characters per utterance\. All video files are stored in mp4 format\. We adopted a standardized file naming convention that explicitly encodes the scene identifier \(SDnn\), dialogue number \(01–60\), utterance sequence \(01–04\), and the corresponding emotion label for each line\. Specific examples of this naming structure include SD09\-38\-03A\.mp4 and SD10\-26\-03S\.mp4\. Emotion labels consist of A: Angry, H: Happy, N: Neutral, and S: Sad\.

eJSL Dialog is exclusively designed as a benchmark dataset\. It is intended strictly for the evaluation and testing of emotion recognition models and is not expected to be used for training purposes\. This decision ensures that the dataset serves as a gold standard for assessing the generalization capabilities of models trained on other resources\.

### IV\-BEmotion Categories and Annotation

Each video sample in the dataset is explicitly annotated with one of four emotion categories: Neutral, Happy, Sad, and Angry\.[Table III](https://arxiv.org/html/2605.23328#S4.T3)details the quantitative distribution of these labels across the dataset\. The higher proportion of Neutral labels aligns with the designated conversational setting, where the teacher predominantly responds in an objective state\. The student utterances present a relatively balanced distribution across all four emotion categories\. Moreover,[Figure 2](https://arxiv.org/html/2605.23328#S4.F2)summarizes the emotional dynamics of eJSL Dialog dataset\. The count matrix \(Panel A\) reflects the empirical transition frequencies between emotion pairs, while the probability matrix \(Panel B\) normalizes each source row to highlight the conditional transition patterns independent of source class frequency\. This structural characteristic reflects empathetic dialogue dynamics\.

TABLE III:Distribution of utterance level emotion labels in the eJSL Dialog dataset\.![Refer to caption](https://arxiv.org/html/2605.23328v1/img/transition_heatmap.png)Figure 2:Emotion transition structure on eJSL Dialogue\. Panel \(A\) reports absolute transition counts between the previous and current utterance labels\. Panel \(B\) shows the corresponding row\-normalized transition probabilities\.

## VPROBLEM FORMULATION

We formalize the emotion recognition in sign conversation task\. Given a dialogue sequenceD=\{u1,u2,…,uN\}D=\\\{u\_\{1\},u\_\{2\},\\dots,u\_\{N\}\\\}consisting ofNNconsecutive utterances, each utteranceuiu\_\{i\}contains visual modality featuresViV\_\{i\}and textual modality featuresTiT\_\{i\}\. HereViV\_\{i\}conveys the original sign language information andTiT\_\{i\}is an utterance\-level spoken language translation or a word\-for\-word translation \(i\.e\., gloss\) ofViV\_\{i\}\. The objective of the model is to learn a mapping functionffto predict the emotion labelyt∈Ey\_\{t\}\\in Eof the target utteranceutu\_\{t\}, whereEErepresents the predefined set of emotion categories\. The complete prediction process relies on the current multimodal inpututu\_\{t\}and the historical context setC=\{u1,…,ut−1\}C=\\\{u\_\{1\},\\dots,u\_\{t\-1\}\\\}\. The task is formally expressed as:

yt=f\(Vt,Tt,C\)y\_\{t\}=f\(V\_\{t\},T\_\{t\},C\)Note that this formulation defines the comprehensive multimodal conversational task\. Specific evaluation models may operate on a constrained subset of these inputs\. For example, traditional visual recognition models omit text and context variables, reducing the mapping toyt=f\(Vt\)y\_\{t\}=f\(V\_\{t\}\)\. Text based conversational models omit visual features, operating asyt=f\(Tt,C\)y\_\{t\}=f\(T\_\{t\},C\)\. Additionally, isolated multimodal models ignore historical context information entirely, relying solely on the current utterance to predict the emotion asyt=f\(Vt,Tt\)y\_\{t\}=f\(V\_\{t\},T\_\{t\}\)\.

## VIBaseline Evaluation

### VI\-ABaselines

To establish an objective benchmark for the eJSL Dialog dataset and investigate the specific contributions of different modalities and historical context, we evaluate five emotion recognition models:

- •EmoAffectNet is a purely visual emotion recognition model that processes frame level facial features\[[22](https://arxiv.org/html/2605.23328#bib.bib2)\]\.
- •EANwH is an extension of EmoAffectNet that concatenates facial and hand skeletal features at the frame level and utilizes an LSTM network to capture temporal emotional dynamics across the video sequence\[[10](https://arxiv.org/html/2605.23328#bib.bib4)\]\.
- •TelME is a conversational emotion recognition architecture that utilizes cross modal knowledge distillation to transfer information from a textual teacher model to non verbal student networks, and combines these multimodal features through a shifting fusion approach to capture dialogue context\[[29](https://arxiv.org/html/2605.23328#bib.bib8)\]\.
- •EmoTrans is an emotional transition based model for ERC\. It concatenates recent utterances to capture emotional state changes and operates exclusively on textual input\[[15](https://arxiv.org/html/2605.23328#bib.bib6)\]\.
- •MMGCN is a multimodal fused graph convolutional network that leverages speaker information to model inter speaker and intra speaker dependencies across consecutive utterances\[[14](https://arxiv.org/html/2605.23328#bib.bib5)\]\.

### VI\-BExperiment Setup

TABLE IV:Distribution of emotion labels in the processed BOBSL pre training corpus\.TABLE V:Evaluation results on the eJSL Dialog dataset\. The evaluation utilizes the weighted F1 score for overall performance and reports individual F1 scores for the four emotion categories\.As EmoAffectNet and EANwH require fundamental sign language visual representations, we followed the pre\-training strategy of\[[10](https://arxiv.org/html/2605.23328#bib.bib4)\]and utilized the BBC Oxford British Sign Language dataset \(BOBSL\)\[[2](https://arxiv.org/html/2605.23328#bib.bib7)\]for pre\-training\. Although BOBSL represents a different linguistic system than our target JSL, both languages share fundamental physical mechanisms for conveying emotions through facial expressions and hand trajectories\. This shared foundation allows the models to learn generic visual features before fine\-tuning\. We applied the emotion recognition modelFine\-tuned DistilRoBERTa\-base for Emotion Classification111[michelleli99/emotion\_text\_classifier](https://huggingface.co/michelleli99/emotion_text_classifier)to the automatically aligned English subtitles to obtain emotion labels\. We selected 893 videos from the total collection to balance data quality and computational efficiency\. To align with the emotion categories of the eJSL Dialog dataset, we filtered the samples to retain four emotion labels: Angry, Neutral, Joy, and Sad\. We resampled the remaining data to achieve a balanced label distribution\.[Table IV](https://arxiv.org/html/2605.23328#S6.T4)details the quantitative distribution of the processed pre\-training corpus across the training and validation sets\.

To prepare the eJSL Dialog dataset for the baseline evaluation, we applied a consistent visual preprocessing pipeline\. We uniformly sampled each video clip at 2 frames per second to extract the frame sequence\. This sampling rate balances temporal resolution and computational efficiency, ensuring that key visual expressions are captured without processing redundant frames\. For facial feature extraction, we utilized the RetinaFace detector\[[7](https://arxiv.org/html/2605.23328#bib.bib10)\]to identify and crop the face region in each extracted frame\. The cropped facial images were resized to a spatial resolution of 224 by 224 pixels and converted to RGB format\. For the hand skeletal feature extraction required by the EANwH, we employed the Wholebody landmarker from the rtmlib library\[[16](https://arxiv.org/html/2605.23328#bib.bib9)\]\. We extracted 21 spatial keypoints for both the left and right hands\. The hand keypoint coordinates were mathematically normalized relative to the wrist position and concatenated into an 84 dimensional feature vector for each frame\. This standardized processing ensures that the visual inputs from the target dataset strictly match the architectural requirements of the evaluated baseline models\. To account for the unequal distribution of emotion categories, we utilize the weighted F1 score as the primary overall metric\. Furthermore, we report the individual F1 scores for the four emotion categories to analyze specific recognition capabilities\. For the text inputs required by the conversational baseline models, we translated the Japanese transcripts into English using Google Translate to match their pre\-training conditions\.

### VI\-CEvaluation Results

We present the evaluation results of the baseline models on the eJSL Dialog dataset in[Table V](https://arxiv.org/html/2605.23328#S6.T5)\. The visual baseline EmoAffectNet achieves a weighted F1 score of 29\.06\. Building upon this, the extended visual model EANwH yields a higher weighted F1 score of 33\.31\. This performance improvement occurs because EANwH incorporates hand skeletal features and utilizes an LSTM network to capture temporal information across the frame sequence\. This confirms that sign language visual features contain explicit affective information independent of the text\. However, recognizing the Sad category remains challenging using only visual cues, with EmoAffectNet recording an F1 score of 17\.03 and EANwH recording 13\.24\. The text only model EmoTrans achieves the highest overall performance with a weighted F1 score of 55\.40\. This performance demonstrates that textual semantics are generalizable and provide direct emotional cues across different domains\. Note that, however, the dialogue lines of STUDIES used to construct eJSL dialog are created so that they are highly correlated with the assigned emotion labels\.

In contrast, the multimodal conversational emotion recognition model exhibit performance degradation\. For these models, we excluded the audio modality to match our visual and text dataset\. Under this configuration, TelME achieves a weighted F1 score of 10\.46, and MMGCN records a weighted F1 score of 8\.50\. These multimodal models are trained on non signing datasets\. In sign language, visual features contain grammatical markers and manual signs that differ from the visual expressions of hearing individuals\. Consequently, during the multimodal fusion process, these cross domain visual features interfere with the text modality and degrade the overall performance\. This phenomenon demonstrates that generic multimodal conversational emotion recognition model cannot transfer to sign language tasks\. It highlights the necessity of developing visual extractors specific to emotion recognition in sign language conversation rather than relying on generic architectures\.

\(a\) Emotion Transfer with Context![Refer to caption](https://arxiv.org/html/2605.23328v1/img/0037.png)![Refer to caption](https://arxiv.org/html/2605.23328v1/img/0001.png)PreviousCurrentPrevious \(Teacher,Positive Feedback\):私は、かっこいいし清潔感あると思うけどなあ。
\(I think you look cool and give a clean impression\.\)Current \(Student,Joy;Visually Neutral\):そう！清潔感なんですよ！大事なのは！
\(Exactly\! Cleanliness is what matters\!\)Predictions:
TelME: Joy \(✓\) EmoTrans: Joy \(✓\)
EANwH:Neutral \(×\\times\)\(b\) Visual Emotion Recognition![Refer to caption](https://arxiv.org/html/2605.23328v1/img/0000.png)![Refer to caption](https://arxiv.org/html/2605.23328v1/img/0015.png)EarlyLateCurrent \(Student,Joy\):コンクールで、僕がかいた絵が入選したんです！
\(My drawing got selected in a competition\!\)Predictions:
TelME: Joy \(✓\) EANwH: Joy \(✓\)
EmoTrans:Neutral \(×\\times\)Figure 3:Case studies illustrating complementary roles of conversational context and visual cues\. \(a\) Context\-aware models correctly capture emotion transfer, while the non\-conversational model fails\. \(b\) Visual cues enable correct emotion recognition, while the text\-only model fails\.
### VI\-DCase Study

To better understand the contribution of conversational and visual information in ERC task for sign language, we present two representative cases\. Each case corresponds to a short sign clip, visualized by representative frames to approximate temporal dynamics\.

#### VI\-D1Case 1: Importance of Conversational Context

As shown in[Figure 3](https://arxiv.org/html/2605.23328#S6.F3)\(a\), the student expresses Joy in response to the positive feedback from the teacher\. Although the facial expressions and manual signs of the student appear relatively restrained, the preceding positive feedback from the teacher, combined with the affirmative and excited textual semantics of the student utterance, jointly enable the text based model EmoTrans and the multimodal model TelME to make the correct Joy prediction\. In contrast, the visual model EANwH, which relies exclusively on the current visual frames and does not model conversational context, fails to capture this transition and incorrectly outputs a Neutral prediction\. This demonstrates that contextual reasoning and textual semantics are critical for modeling emotion dynamics in dialogue based sign language scenarios\.

#### VI\-D2Case 2: Importance of Visual Cues

As shown in[Figure 3](https://arxiv.org/html/2605.23328#S6.F3)\(b\), the emotion can convey through the signer’s facial expressions and visual cues rather than textual content alone\. Although the text describes a positive event, this factual statement is misinterpreted by the language model as a neutral report because it lacks explicit emotional modifiers\. The multimodal model TelME and the visual model EANwH successfully capture these dynamic visual signals and correctly classify the emotion as Joy\. This highlights the importance of visual information in sign language emotion recognition, where facial expressions play a crucial role in confirming the true affective state of the speaker\.

Overall, these cases reveal three complementary factors in sign language emotion understanding: conversational context, textual semantics, and visual cues\. Conversational context provides the necessary history for tracking emotional shifts across turns\. Furthermore, while textual semantics help capture implicit emotion transitions when visual expressions are restrained, visually grounded signals remain essential for resolving ambiguity in factual textual expressions\. Integrating these three elements into a unified multimodal architecture offers a clear direction for further improvement\.

## VIIBROADER IMPACT, LIMITATIONS, AND FUTURE WORK

### VII\-ABroader Impact

In real sign language conversation, emotional states evolve dynamically across conversational turns\. By providing a benchmark for bidirectional interactions, this research supports the future development of empathetic dialogue systems that can track and respond to the emotional history of deaf users\. This capability is a requirement for creating virtual assistants that can engage in natural and context aware human computer interaction for deaf people\.

### VII\-BLimitations

We identify several limitations in the current study\. Regarding the data scope, the eJSL Dialog dataset features only two actors and was recorded in a controlled laboratory environment with a pure white background\. Consequently, models positively evaluated on this dataset might still struggle with generalization when applied to in the wild environments that contain diverse backgrounds, lighting conditions, and signer demographics\. Regarding the physical modalities, the dataset provides two dimensional RGB video clips but lacks depth information and fine grained facial or hand mesh data\. This absence restricts the ability of models to capture subtle spatial variations in sign language articulation\.

### VII\-CFuture Work

In future research, we plan to address the current limitations by expanding the dataset scale, collecting longer dialogue sessions, and incorporating spontaneous sign language conversations\. Additionally, we aim to integrate large language models to process and understand the multiple turn dialogue history\. This integration will enhance the contextual modeling of emotional transitions across these extended conversations and improve the overall performance of empathetic dialogue systems\.

## VIIICONCLUSIONS

In this work, we introduced the ERC task for sign language and proposed the eJSL Dialog dataset\. By providing multiple turn dialogue history, this dataset addresses the limitations of isolated sign language emotion recognition\. Our baseline evaluation on the proposed dataset exposes the domain gap in existing methods\. This finding confirms the necessity of developing visual extractors specific to sign language and constructing large scale sign language emotion conversational datasets\.

## ACKNOWLEDGMENT

This work was supported by Japan Science and Technology Agency \(JST\) as part of Adopting Sustainable Partnerships for Innovative Research Ecosystem \(ASPIRE\), Grant Number JPMJAP25B3\.

## References

- \[1\]\(2021\)Deep learning for sign language recognition: current techniques, benchmarks, and open issues\.IEEE Access9,pp\. 126917–126951\.Cited by:[§I](https://arxiv.org/html/2605.23328#S1.p1.1)\.
- \[2\]S\. Albanie, G\. Varol, L\. Momeni, H\. Bull, T\. Afouras, H\. Chowdhury, N\. Fox, B\. Woll, R\. Cooper, A\. McParland, and A\. Zisserman\(2021\)BOBSL: BBC\-Oxford British Sign Language Dataset\.arXiv\.Note:[https://www\.robots\.ox\.ac\.uk/~vgg/data/bobsl](https://www.robots.ox.ac.uk/~vgg/data/bobsl)Cited by:[§VI\-B](https://arxiv.org/html/2605.23328#S6.SS2.p1.1)\.
- \[3\]D\. Brentari\(2010\)Sign languages\.Cambridge University Press\.Cited by:[§II\-A](https://arxiv.org/html/2605.23328#S2.SS1.p1.1)\.
- \[4\]L\. Chen, M\. Li, M\. Wu, W\. Pedrycz, and K\. Hirota\(2024\)Coupled multimodal emotional feature analysis based on broad\-deep fusion networks in human–robot interaction\.IEEE Transactions on Neural Networks and Learning Systems35\(7\),pp\. 9663–9673\.External Links:[Document](https://dx.doi.org/10.1109/TNNLS.2023.3236320)Cited by:[§II\-B](https://arxiv.org/html/2605.23328#S2.SS2.p1.1),[§II\-C](https://arxiv.org/html/2605.23328#S2.SS3.p1.1)\.
- \[5\]P\. Chua, C\. M\. Fang, T\. Ohkawa, R\. Kushalnagar, S\. Nanayakkara, and P\. Maes\(2025\)Emosign: a multimodal dataset for understanding emotions in american sign language\.arXiv preprint arXiv:2505\.17090\.Cited by:[§I](https://arxiv.org/html/2605.23328#S1.p2.1),[§II\-D](https://arxiv.org/html/2605.23328#S2.SS4.p1.1)\.
- \[6\]D\. P\. Corina, U\. Bellugi, and J\. Reilly\(1999\)Neuropsychological studies of linguistic and affective facial expressions in deaf signers\.Language and Speech42\(2\-3\),pp\. 307–331\.Cited by:[§I](https://arxiv.org/html/2605.23328#S1.p1.1),[§II\-A](https://arxiv.org/html/2605.23328#S2.SS1.p1.1)\.
- \[7\]J\. Deng, J\. Guo, E\. Ververas, I\. Kotsia, and S\. Zafeiriou\(2020\)Retinaface: single\-shot multi\-level face localisation in the wild\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 5203–5212\.Cited by:[§VI\-B](https://arxiv.org/html/2605.23328#S6.SS2.p2.1)\.
- \[8\]E\. A\. Elliott and A\. M\. Jacobs\(2013\)Facial expressions, emotions, and sign languages\.Frontiers in psychology4,pp\. 39013\.Cited by:[§I](https://arxiv.org/html/2605.23328#S1.p1.1)\.
- \[9\]F\. A\. Freitas, S\. M\. Peres, C\. A\. Lima, and F\. V\. Barbosa\(2017\)Grammatical facial expression recognition in sign language discourse: a study at the syntax level\.Information Systems Frontiers19\(6\),pp\. 1243–1259\.Cited by:[§I](https://arxiv.org/html/2605.23328#S1.p1.1),[§II\-A](https://arxiv.org/html/2605.23328#S2.SS1.p1.1)\.
- \[10\]K\. Funakoshi and Y\. Zhu\(2025\)Emotion recognition in signers\.arXiv preprint arXiv:2512\.15376\.Cited by:[§I](https://arxiv.org/html/2605.23328#S1.p2.1),[§II\-D](https://arxiv.org/html/2605.23328#S2.SS4.p1.1),[2nd item](https://arxiv.org/html/2605.23328#S6.I1.i2.p1.1),[§VI\-B](https://arxiv.org/html/2605.23328#S6.SS2.p1.1)\.
- \[11\]D\. Ghosal, N\. Majumder, A\. Gelbukh, R\. Mihalcea, and S\. Poria\(2020\)Cosmic: commonsense knowledge for emotion identification in conversations\.InFindings of the association for computational linguistics: EMNLP 2020,pp\. 2470–2481\.Cited by:[§I](https://arxiv.org/html/2605.23328#S1.p3.1)\.
- \[12\]Y\. Gu, X\. Zhang, H\. Yan, J\. Huang, Z\. Liu, M\. Dong, and F\. Ren\(2023\)Wife: wifi and vision based unobtrusive emotion recognition via gesture and facial expression\.IEEE Transactions on Affective Computing14\(4\),pp\. 2567–2581\.Cited by:[§II\-B](https://arxiv.org/html/2605.23328#S2.SS2.p1.1),[§II\-C](https://arxiv.org/html/2605.23328#S2.SS3.p1.1)\.
- \[13\]D\. Hazarika, S\. Poria, R\. Mihalcea, E\. Cambria, and R\. Zimmermann\(2018\)Icon: interactive conversational memory network for multimodal emotion detection\.InProceedings of the 2018 conference on empirical methods in natural language processing,pp\. 2594–2604\.Cited by:[§I](https://arxiv.org/html/2605.23328#S1.p3.1),[§II\-C](https://arxiv.org/html/2605.23328#S2.SS3.p1.1)\.
- \[14\]J\. Hu, Y\. Liu, J\. Zhao, and Q\. Jin\(2021\)MMGCN: multimodal fusion via deep graph convolution network for emotion recognition in conversation\.InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 1: Long Papers\),pp\. 5666–5675\.Cited by:[5th item](https://arxiv.org/html/2605.23328#S6.I1.i5.p1.1)\.
- \[15\]Z\. Jian, A\. Wang, J\. Su, J\. Yao, M\. Wang, and Q\. Wu\(2024\)Emotrans: emotional transition\-based model for emotion recognition in conversation\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),pp\. 5723–5733\.Cited by:[4th item](https://arxiv.org/html/2605.23328#S6.I1.i4.p1.1)\.
- \[16\]T\. Jiang, P\. Lu, L\. Zhang, N\. Ma, R\. Han, C\. Lyu, Y\. Li, and K\. Chen\(2023\)RTMPose: real\-time multi\-person pose estimation based on mmpose\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2303.07399),[Link](https://arxiv.org/abs/2303.07399)Cited by:[§VI\-B](https://arxiv.org/html/2605.23328#S6.SS2.p2.1)\.
- \[17\]O\. Koller\(2020\)Quantitative survey of the state of the art in sign language recognition\.arXiv preprint arXiv:2008\.09918\.Cited by:[§I](https://arxiv.org/html/2605.23328#S1.p1.1),[§II\-A](https://arxiv.org/html/2605.23328#S2.SS1.p1.1)\.
- \[18\]N\. Majumder, S\. Poria, D\. Hazarika, R\. Mihalcea, A\. Gelbukh, and E\. Cambria\(2019\)Dialoguernn: an attentive rnn for emotion detection in conversations\.InProceedings of the AAAI conference on artificial intelligence,Vol\.33,pp\. 6818–6825\.Cited by:[§I](https://arxiv.org/html/2605.23328#S1.p3.1),[§II\-C](https://arxiv.org/html/2605.23328#S2.SS3.p1.1)\.
- \[19\]P\. Pereira, H\. Moniz, and J\. P\. Carvalho\(2024\)Deep emotion recognition in textual conversations: a survey\.Artificial Intelligence Review58\(1\),pp\. 10\.Cited by:[§II\-C](https://arxiv.org/html/2605.23328#S2.SS3.p1.1)\.
- \[20\]R\. W\. Picard\(2000\)Affective computing\.MIT press\.Cited by:[§I](https://arxiv.org/html/2605.23328#S1.p1.1)\.
- \[21\]S\. Poria, E\. Cambria, R\. Bajpai, and A\. Hussain\(2017\)A review of affective computing: from unimodal analysis to multimodal fusion\.Information fusion37,pp\. 98–125\.Cited by:[§I](https://arxiv.org/html/2605.23328#S1.p1.1),[§II\-C](https://arxiv.org/html/2605.23328#S2.SS3.p1.1)\.
- \[22\]E\. Ryumina, D\. Dresvyanskiy, and A\. Karpov\(2022\)In search of a robust facial expressions recognition model: a large\-scale visual cross\-corpus study\.Neurocomputing\.External Links:[Document](https://dx.doi.org/10.1016/j.neucom.2022.10.013),[Link](https://www.sciencedirect.com/science/article/pii/S0925231222012656)Cited by:[1st item](https://arxiv.org/html/2605.23328#S6.I1.i1.p1.1)\.
- \[23\]Y\. Saito, Y\. Nishimura, S\. Takamichi, K\. Tachibana, and H\. Saruwatari\(2022\)STUDIES: corpus of japanese empathetic dialogue speech towards friendly voice agent\.arXiv preprint arXiv:2203\.14757\.Cited by:[TABLE I](https://arxiv.org/html/2605.23328#S1.T1),[§I](https://arxiv.org/html/2605.23328#S1.p4.1),[§III\-A](https://arxiv.org/html/2605.23328#S3.SS1.p1.1)\.
- \[24\]W\. Sandler and D\. C\. Lillo\-Martin\(2006\)Sign language and linguistic universals\.Cambridge University Press\.Cited by:[§II\-A](https://arxiv.org/html/2605.23328#S2.SS1.p1.1)\.
- \[25\]E\. P\. d\. Silva, P\. D\. P\. Costa, K\. M\. O\. Kumada, J\. M\. De Martino, and G\. A\. Florentino\(2020\)Recognition of affective and grammatical facial expressions: a study for brazilian sign language\.InECCV 2020 Workshops,pp\. 218–236\.External Links:[Document](https://dx.doi.org/10.1007/978-3-030-66096-3%5F16),[Link](https://doi.org/10.1007/978-3-030-66096-3_16)Cited by:[§II\-A](https://arxiv.org/html/2605.23328#S2.SS1.p1.1)\.
- \[26\]J\. Wei, G\. Hu, X\. Yang, A\. T\. Luu, and Y\. Dong\(2024\)Learning facial expression and body gesture visual information for video emotion recognition\.Expert Systems with Applications237,pp\. 121419\.Cited by:[§II\-B](https://arxiv.org/html/2605.23328#S2.SS2.p1.1),[§II\-C](https://arxiv.org/html/2605.23328#S2.SS3.p1.1)\.
- \[27\]Y\. Wu, Q\. Mi, and T\. Gao\(2025\)A comprehensive review of multimodal emotion recognition: techniques, challenges, and future directions\.Biomimetics10\(7\),pp\. 418\.Cited by:[§I](https://arxiv.org/html/2605.23328#S1.p1.1)\.
- \[28\]J\. Xue, J\. Wang, X\. Liu, Q\. Zhang, and X\. Wu\(2024\)Affective video content analysis: decade review and new perspectives\.Big Data Mining and Analytics8\(1\),pp\. 118–144\.Cited by:[§II\-B](https://arxiv.org/html/2605.23328#S2.SS2.p1.1)\.
- \[29\]T\. Yun, H\. Lim, J\. Lee, and M\. Song\(2024\-06\)TelME: teacher\-leading multimodal fusion network for emotion recognition in conversation\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 82–95\.External Links:[Link](https://aclanthology.org/2024.naacl-long.5/),[Document](https://dx.doi.org/10.18653/v1/2024.naacl-long.5)Cited by:[3rd item](https://arxiv.org/html/2605.23328#S6.I1.i3.p1.1)\.
- \[30\]Z\. Zhang, L\. Wang, and J\. Yang\(2023\)Weakly supervised video emotion detection and prediction via cross\-modal temporal erasing network\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 18888–18897\.Cited by:[§II\-B](https://arxiv.org/html/2605.23328#S2.SS2.p1.1)\.
- \[31\]Z\. Zhang, P\. Zhao, E\. Park, and J\. Yang\(2024\)Mart: masked affective representation learning via masked temporal distribution distillation\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 12830–12840\.Cited by:[§II\-B](https://arxiv.org/html/2605.23328#S2.SS2.p1.1)\.
Emotion Recognition in Sign Language Conversation

Similar Articles

EmoS: A High-Fidelity Multimodal Benchmark for Fine-grained Streaming Emotional Understanding

SignX: Continuous Sign Recognition in Compact Pose-Rich Latent Space

Sentiment Analysis of German Sign Language Fairy Tales

Leveraging Self-Paced Curriculum Learning for Enhanced Modality Balance in Multimodal Conversational Emotion Recognition

Evaluating multimodal emotion recognition in proactive conversational agents: A user study

Submit Feedback

Similar Articles

EmoS: A High-Fidelity Multimodal Benchmark for Fine-grained Streaming Emotional Understanding
SignX: Continuous Sign Recognition in Compact Pose-Rich Latent Space
Sentiment Analysis of German Sign Language Fairy Tales
Leveraging Self-Paced Curriculum Learning for Enhanced Modality Balance in Multimodal Conversational Emotion Recognition
Evaluating multimodal emotion recognition in proactive conversational agents: A user study