Generative Learning as a Tool to Improve Perception of Emotional Body Motion Expressions
Summary
This paper investigates using a Transformer-based generative model to learn emotional body motions from motion-capture data of Japanese actors, generating motions conditioned on discrete emotion labels. Evaluations show the generated motions improve emotion recognition when used for data augmentation and enable smooth transitions between emotion intensities.
View Cached Full Text
Cached at: 06/30/26, 05:29 AM
# Generative Learning as a Tool to Improve Perception of Emotional Body Motion Expressions
Source: [https://arxiv.org/html/2606.28769](https://arxiv.org/html/2606.28769)
Miao Cheng Victor Schneider Yoshifumi KitamuraXin Wei Hideaki Uchiyama Monica Perusquia\-Hernandez
###### Abstract
Emotional body motion expressions are an essential element of non\-verbal communication\. Effectively conveying these expressions through technology is of utmost importance, for example, with virtual reality avatars and in social robotics\. Recent advances in generative models have opened new opportunities for advancing research on emotional body motion learning\. However, generating accurate emotional expression representations is challenging, given the subtlety of emotional cues, individual variability, and cultural differences\. We investigate whether a generative model can implicitly learn emotional body motions directly from culturally grounded motion\-capture data, without explicit emotion\-motion guidance\. Using a dataset of emotional performances by 49 Japanese actors, we trained a Transformer\-based generative model to generate expressive motions conditioned on 13 discrete emotion labels\. We evaluate the generated motions from two perspectives: \(1\) an LSTM\-based classifier to assess recognizability by machine observers, achieving a recognition accuracy of 22\.80%, and \(2\) a human perception study with Japanese raters to assess alignment with human affective interpretations, yielding a recognition accuracy of 24\.91%\. Beyond these, we evaluate the utility of generative modeling for three practical tasks: augmenting emotion recognition models, extracting representative emotion\-specific motion patterns, and synthesizing smooth transitions between emotion intensities\. Our findings highlight the potential of implicit, data\-driven generative modeling to enhance affective computing applications and our understanding of emotion expressions\.
## IIntroduction
Figure 1:Overview of our study on emotion\-conditioned motion generation and its implications for emotion recognition and interpolation\. Left: The generative model produces whole\-body emotional motions from discrete emotion labels, capturing expressive cues such as the forward\-leaning posture in gratitude emotion\. Middle: Generated motions, supervised by an emotion recognition model trained on real data, are used to augment the training set, resulting in improved recognition accuracy over using real data alone\. Right: The latent space supports smooth interpolation between intensity levels within the same emotion, enabling fine\-grained control over expressive variations\.Emotion expressions have been investigated primarily for facial and vocal signals, while body motion affective expressions remain under\-investigated\[[12](https://arxiv.org/html/2606.28769#bib.bib55),[38](https://arxiv.org/html/2606.28769#bib.bib57),[23](https://arxiv.org/html/2606.28769#bib.bib54)\]\. Body motion is vital in conveying affect, especially in non\-verbal or physically distant communication\. This is increasingly important in applications such as virtual reality \(VR\), social robotics, and telepresence systems, where full\-body motion is a critical channel for interaction\. Generating body motions that express emotions is crucial for creating engaging and naturalistic user experiences and may encourage richer nonverbal behavior than traditional face\-focused video platforms\[[8](https://arxiv.org/html/2606.28769#bib.bib56),[10](https://arxiv.org/html/2606.28769#bib.bib50)\]\. Beyond animation and interactive systems, generating emotional body motions also provides a compelling opportunity to study affective motions\. Generative models can uncover core expressive patterns and improve embodied emotion expression recognition by providing data augmentation\[[29](https://arxiv.org/html/2606.28769#bib.bib53)\]\.
Motion generation has become increasingly popular, achieving impressive results in tasks such as action synthesis and text\-to\-motion generation\[[34](https://arxiv.org/html/2606.28769#bib.bib10),[33](https://arxiv.org/html/2606.28769#bib.bib19),[20](https://arxiv.org/html/2606.28769#bib.bib4)\]\. However, emotional body motion generation, which aims to synthesize whole\-body motions to convey emotions, remains underexplored\. Emotion expressions tend to be context\-dependent\[[3](https://arxiv.org/html/2606.28769#bib.bib58)\], and highly variable among individuals due to personal expressive style\[[14](https://arxiv.org/html/2606.28769#bib.bib59)\]\. Furthermore, physical conditions, cultural norms, and individual expectations complicate the production and perception of emotional motion\[[21](https://arxiv.org/html/2606.28769#bib.bib63)\]\. This makes the mapping from emotion to body motion inherently challenging\.
Previous studies addressing emotional body motion generation typically utilize explicit emotional cues and structured supervision to guide the generative process, relying heavily on manually defined emotion\-motion relationships\. Prior work mapped particular emotional states to specific limb movements, such as associating sadness with a lowered head posture and a slightly bent torso, and subsequently injecting these handcrafted associations into human limb generations\[[43](https://arxiv.org/html/2606.28769#bib.bib5)\]\. While effective in controlled scenarios, such approaches do not capture nuanced expressions of emotion\.
We explored the potential of implicitly generating emotionally expressive human motions directly from culturally grounded motion data\. Rather than relying on predefined mappings or handcrafted emotion\-to\-motion rules, we investigate whether a generative model trained solely on expressive performances can learn meaningful motion patterns associated with emotion categories\. To this aim, we leverage a rich motion dataset of acted emotional performances by 49 Japanese actors, explicitly capturing both individual variability and Japanese\-specific cues\. We adopt a Transformer\-based variational autoencoder \(VAE\) to learn a latent representation of emotion\-conditioned motions\. During generation, latent vectors are sampled from each emotion category’s learned distribution and decoded into pose sequences represented by a human body model\. We evaluate the generated motions through two complementary perspectives: \(1\) an LSTM\-based emotion classifier trained to predict emotion labels to quantify the model’s ability to generate emotional patterns; and \(2\) a human perception study with Japanese raters to examine whether the synthesized motions align with how emotional lay observers interpret expressions\. Furthermore, we explore the practical utility of the generative model in three tasks \(Fig\.[1](https://arxiv.org/html/2606.28769#S1.F1)\): 1\. Data augmentation for emotion recognition\.Recognizing emotions from human body motion presents challenges due to the inherent variability in expressive styles, limited labeled data, and the subtlety of emotion\-specific cues compared to more explicit control signals like action labels\. Generative models offer a promising solution by synthesizing anonymous, emotionally expressive motions that can be used to both evaluate recognition systems and augment training datasets\. 2\. Extraction of representative motion patterns\.While emotional expressions vary across individuals, we hypothesize that the generative model encodes shared, prototypical features within each emotion category\. By decoding central latent vectors, we aim to uncover common motion tendencies that may support behavioral analysis or synthesis\. 3\. Interpolation across emotional intensities\.Emotion expressions often vary in intensity\. Leveraging the continuity of the learned latent space, we test whether smooth transitions can be generated between different\-intensity emotional expressions, offering a way to model graded affective behavior\. Through these analyses, our study clarifies both the opportunities and limitations of implicit, data\-driven generative modeling as a tool for advancing affective science research on emotional body motion understanding\.
## IIRelated Works
### II\-AMotion generation
Recent advances in motion generation have led to diverse approaches for synthesizing human motions from structured inputs such as action categories and text descriptions\. Early work used recurrent networks for pose prediction\[[13](https://arxiv.org/html/2606.28769#bib.bib25)\], while others improved temporal coherence with hierarchical RNNs\[[28](https://arxiv.org/html/2606.28769#bib.bib26)\]\. Later, a temporal VAE was used to generate 3D human motions from action labels\[[17](https://arxiv.org/html/2606.28769#bib.bib16)\]\. Building on this, an action\-conditioned transformer VAE \(ACTOR\) was proposed to sample from a sequence\-level latent space conditioned on action labels and duration\[[31](https://arxiv.org/html/2606.28769#bib.bib1)\]with benefits for motion denoising and action recognition\. Using another approach, Tevet et al\.\[[39](https://arxiv.org/html/2606.28769#bib.bib3)\]employ a classifier\-free transformer\-based diffusion model that operates in joint space\. Later, performing diffusion in the motion latent space was proposed to reduce computational overhead while preserving generation quality, resulting in more efficient conditional generation\[[5](https://arxiv.org/html/2606.28769#bib.bib17)\]\. In parallel to action categories inputs, efforts in text\-to\-motion synthesis map language embeddings to motion\[[1](https://arxiv.org/html/2606.28769#bib.bib29),[32](https://arxiv.org/html/2606.28769#bib.bib18),[44](https://arxiv.org/html/2606.28769#bib.bib28),[16](https://arxiv.org/html/2606.28769#bib.bib2),[20](https://arxiv.org/html/2606.28769#bib.bib4)\]with a focus on action semantics\. We adopted ACTOR as our backbone generation model, which has been widely used across various motion generation tasks\[[32](https://arxiv.org/html/2606.28769#bib.bib18),[33](https://arxiv.org/html/2606.28769#bib.bib19)\], and offers a strong balance between temporal coherence, computational efficiency, and controllability in sequence\-level synthesis\[[34](https://arxiv.org/html/2606.28769#bib.bib10)\]\.
### II\-BEmotional motion generation
Emotion expression has typically been studied through facial\[[24](https://arxiv.org/html/2606.28769#bib.bib44)\]and vocal\[[2](https://arxiv.org/html/2606.28769#bib.bib45)\]modalities, leading to early success in generating emotion\-aware outputs across multiple modalities\[[42](https://arxiv.org/html/2606.28769#bib.bib46),[4](https://arxiv.org/html/2606.28769#bib.bib21),[25](https://arxiv.org/html/2606.28769#bib.bib22),[15](https://arxiv.org/html/2606.28769#bib.bib47),[36](https://arxiv.org/html/2606.28769#bib.bib23),[37](https://arxiv.org/html/2606.28769#bib.bib24)\]\. In contrast, the generation of whole\-body emotional motion remains relatively underexplored, despite evidence that emotions can be conveyed through body movement alone\[[30](https://arxiv.org/html/2606.28769#bib.bib40),[41](https://arxiv.org/html/2606.28769#bib.bib41)\]\. Early rule\-based approaches map body features to emotional states\[[9](https://arxiv.org/html/2606.28769#bib.bib31)\], but they limit expressive richness by failing to capture the wide variability in how emotions are physically expressed across individuals, contexts, and cultures\. Recent work has attempted to address the data representation heterogeneity and scarcity by leveraging large language models \(LLMs\)\. Emotion\-rich textual prompts were used \(e\.g\., “a man, filled with sadness, walks forward”\) and a fine\-tuned LLM to infer how emotional states influence specific body parts\[[43](https://arxiv.org/html/2606.28769#bib.bib5)\]\. Importantly, emotion\-to\-limb mappings are manually defined in advance and used as supervision signals during LLM training\. These purely rule\-based mappings include terms such as “head: lowered, looking downward” or “torso: slightly bent\.” However, the mix of data approach and rule\-based methods may suffer from a lack of flexibility, notably with how cultural background and individual differences may shape emotional perception and execution\[[11](https://arxiv.org/html/2606.28769#bib.bib49),[22](https://arxiv.org/html/2606.28769#bib.bib32)\]\. Therefore, we explored whether a generative model can learn to produce expressive whole\-body emotional motion directly from performance data, without relying on manually predefined motion\-to\-emotion guidance\.
## IIIMethods
Figure 2:Overview of the ACTOR\-based emotional motion generation model\[[31](https://arxiv.org/html/2606.28769#bib.bib1)\]\. The model consists of a Transformer\-based encoder and decoder trained as a conditional variational autoencoder\. The encoder receives a sequence of linearly projected motion inputsH1,…,HTH\_\{1\},\\ldots,H\_\{T\}, concatenated with emotion\-specific tokensμe\\mu\_\{e\}andΣe\\Sigma\_\{e\}corresponding to the given labelee, and outputs the latent distribution parametersμ^\\hat\{\\mu\}andΣ^\\hat\{\\Sigma\}\. A latent vectorzzis sampled and passed to the decoder, along with the emotion embeddingbeb\_\{e\}and positional encodings, to reconstruct the motion sequenceH^1,…,H^T\\hat\{H\}\_\{1\},\\ldots,\\hat\{H\}\_\{T\}\. During inference, generation is performed by randomly samplingz∼𝒩\(μ¯e,Σ¯e\)z\\sim\\mathcal\{N\}\(\\bar\{\\mu\}\_\{e\},\\bar\{\\Sigma\}\_\{e\}\)from the learned distribution for emotionee\.### III\-ADataset
We use a subset of the Diverse Intercultural E\-Motion Database of Asian Performers \(DIEM\-A\)\[[7](https://arxiv.org/html/2606.28769#bib.bib62)\], consisting of motion data from 49 Japanese professional performers \(27 female, 22 male; mean age = 38\.7 years; mean performing experience = 19\.6 years\)\. Each performer was instructed to prepare performances for 13 emotion categories:joy,sadness,anger,surprise,fear,disgust,contempt,gratitude,guilt,jealousy,shame,pride, andneutral\. For the 12 non\-neutral emotions, performers created three emotion\-eliciting scenarios per category, performed at three different intensity levels \(low, medium, and high\), while theneutralemotion required three scenarios without specified intensities\. This protocol resulted in 111 motion sequences per performer\. No constraints were imposed on how performers expressed their emotions to ensure diversity in expression styles and motion\.
The performances, originally captured using motion tracking, were transformed into the SMPL body model\[[26](https://arxiv.org/html/2606.28769#bib.bib9)\]to represent human motions\. SMPL provides a detailed and expressive mesh\-based, surface\-level representation of the full human body using two sets of parameters: \(1\) pose parameters𝜽∈ℝ24×3\\boldsymbol\{\\theta\}\\in\\mathbb\{R\}^\{24\\times 3\}, which define the relative rotations of 23 body joints and a global root orientation in axis\-angle format, and \(2\) shape parameters𝜷∈ℝ10\\boldsymbol\{\\beta\}\\in\\mathbb\{R\}^\{10\}, which account for person\-specific body shape variations\. The model uses a linear blend skinning function to produce a mesh\. To produce realistic motion, the deformed mesh is then posed using joint locations\.
To prepare the data for analysis and generation, we first cleaned the raw recordings, downsampled from 120 Hz to 20 Hz, and exported them into C3D format\. These cleaned sequences were then converted into SMPL representations using MoSh\+\+\[[27](https://arxiv.org/html/2606.28769#bib.bib12)\]\. After conversion, we applied a T\-pose removal classifier trained on the AMASS dataset\[[27](https://arxiv.org/html/2606.28769#bib.bib12),[35](https://arxiv.org/html/2606.28769#bib.bib13)\]to detect and remove calibration T\-poses and surrounding neutral transitions automatically\. All results were manually verified, and only segments corresponding to the main expressive content of each scenario were retained\. As a result, the final dataset used in this study contains 5,439 motion sequences, totaling approximately 13\.6 hours of data\. Sequence durations range from 0\.9 to 59\.9 seconds \(mean = 9\.0 s, SD = 6\.0 s\)\.
### III\-BEmotion\-conditioned motion generation
Problem definition\.Our goal is to generate emotionally expressive human body motion, conditioned on a target emotion label\. We focus solely on pose generation, assuming a fixed average body shape\. Formally, given a discrete emotion labele∈ℰe\\in\\mathcal\{E\}from a predefined set of 13 categories, we generate a sequence of motion parametersH1:T=\(𝜽t,𝒑t\)t=1TH\_\{1:T\}=\(\\boldsymbol\{\\theta\}\_\{t\},\\boldsymbol\{p\}\_\{t\}\)\_\{t=1\}^\{T\}, where𝜽t\\boldsymbol\{\\theta\}\_\{t\}and𝒑t\\boldsymbol\{p\}\_\{t\}denote the pose parameters and root joint translations at time steptt, respectively\. The sequence lengthTTis randomly sampled from an emotion\-specific duration distribution\. Since emotional expression varies between individuals and is culturally influenced, we only used data from Japanese performers\.
Generation model\.As shown in Fig\.[2](https://arxiv.org/html/2606.28769#S3.F2), we adopt ACTOR\[[31](https://arxiv.org/html/2606.28769#bib.bib1)\], a Transformer\-based conditional variational autoencoder \(VAE\), to generate human motion sequences conditioned on discrete emotion labels\. Unlike autoregressive methods that generate poses frame\-by\-frame\[[19](https://arxiv.org/html/2606.28769#bib.bib52)\], ACTOR samples a latent vector representing the entire motion sequence from a latent distribution and generates the full motion in a single forward pass\. This design improves computational efficiency and reduces error accumulation over time\. The architecture consists of Transformer\-based encoder and decoder modules\. The encoder takes sequences of parametersH1:TH\_\{1:T\}along with their corresponding emotion labels as input, encoding them into latent distribution parameters\(𝝁^,𝚺^\)\(\\hat\{\\boldsymbol\{\\mu\}\},\\hat\{\\boldsymbol\{\\Sigma\}\}\)\. Emotion\-specific learnable tokens𝝁e,𝚺e\\boldsymbol\{\\mu\}\_\{e\},\\boldsymbol\{\\Sigma\}\_\{e\}explicitly condition the encoder, capturing the emotion variability into the latent space\. The decoder reconstructs motion sequences from a latent vector𝒛^\\hat\{\\boldsymbol\{z\}\}sampled from this latent distribution\. Specifically, the decoder receives the latent vector, the emotion\-specific embedding𝒃e\\boldsymbol\{b\}\_\{e\}, and sinusoidal positional encodings representing the desired sequence duration as inputs\. The output is a reconstructed sequenceH^1:T\\hat\{H\}\_\{1:T\}, which can be directly converted into SMPL body meshes for visualization and further evaluation\.
During generation, given an emotion labelee, a target sequence durationTTis sampled from the emotion\-specific duration distribution\. The model then selects the corresponding emotion embedding𝒃e\\boldsymbol\{b\}\_\{e\}to condition the latent vector𝒛∼𝒩\(𝝁¯e,𝚺¯e\)\\boldsymbol\{z\}\\sim\\mathcal\{N\}\(\\bar\{\\boldsymbol\{\\mu\}\}\_\{e\},\\bar\{\\boldsymbol\{\\Sigma\}\}\_\{e\}\), where\(𝝁¯e,𝚺¯e\)\(\\bar\{\\boldsymbol\{\\mu\}\}\_\{e\},\\bar\{\\boldsymbol\{\\Sigma\}\}\_\{e\}\)represent the average latent distribution parameters computed from all training data labeled with emotionee\. This latent vector, combined with the emotion embedding𝒃e\\boldsymbol\{b\}\_\{e\}and positional encodings, is fed to the decoder to generate the final motion sequenceH^1:T\\hat\{H\}\_\{1:T\}\.
## IVEvaluation and Discussion
TABLE I:Quantitative evaluation of model variants with and without vertex lossℒv\\mathcal\{L\}\_\{v\}\.We evaluated the generative model from technical and perceptual perspectives to assess its capability in capturing and synthesizing emotionally expressive body motions\. Our analysis covers reconstruction accuracy and generation diversity \(Sec\.[IV\-A](https://arxiv.org/html/2606.28769#S4.SS1)\), emotion recognizability by machine classifiers and human observers \(Sec\.[IV\-B](https://arxiv.org/html/2606.28769#S4.SS2)\), and utility in tasks such as emotion recognition \(Sec\.[IV\-B](https://arxiv.org/html/2606.28769#S4.SS2)\), representative motion extraction \(Sec\.[IV\-C](https://arxiv.org/html/2606.28769#S4.SS3)\), and intensity interpolation \(Sec\.[IV\-D](https://arxiv.org/html/2606.28769#S4.SS4)\)\.
### IV\-AReconstruction and generation
We compared two variants of the model: one with the mesh vertex loss termℒv\\mathcal\{L\}\_\{v\}and one without \(see supplementary materials for details\), as this is the primary factor influencing reconstruction fidelity and surface\-level consistency\[[34](https://arxiv.org/html/2606.28769#bib.bib10)\]\. We evaluated their performance in both reconstruction and generation settings\. Forreconstruction, we assessed how accurately the model can reproduce input motion sequences, thereby capturing the details of observed motions\. Forgeneration, we analyzed the quality and diversity of motions sampled from the learned latent space, conditioned only on emotion labels\. The focus of our evaluation is whether the model can generate diverse, semantically consistent motions and how such data might be used in emotion\-related tasks\.
For thereconstruction evaluation, we computed three metrics:angular error, which measures the difference in joint rotations between the reconstructed and ground\-truth poses;mesh error, andjoint error, calculated as the Euclidean distance between corresponding vertices and joints of the reconstructed and ground\-truth SMPL body meshes\. Forgeneration evaluation, we checked the realism of sampled motions using theFréchet Inception Distance \(FID\), which compares the distribution of generated motions to that of real motions in a learned feature space\. To compute FID, we adopted an RNN\-based emotion recognition model trained on the real motion dataset\. We extracted feature embeddings from the model’s final layer and computed FID between the feature distributions of 2,600 randomly sampled real and generated motions\. This process was repeated 20 times and averaged to ensure robust estimation\. To promote statistical stability, we computed FID jointly across all emotion categories\.A lower FID indicates greater similarity to real data\. We further assessed the variability of generated motions using two metrics: diversity and multimodality\. For reference, we also computed these metrics on real motion data\.Diversityis measured as the average pairwise feature distance between randomly sampled motion sequences, reflecting the overall variation across the generation space\. Values closer to those of real data are preferred\.Multimodalityquantifies the average feature distance among multiple samples conditioned on the same emotion label, capturing the model’s ability to express within\-class variability\. Higher values indicate better within\-category variation\. To capture semantically meaningful variations, we computed generation metrics only on generated samples that were correctly classified by the emotion recognition model\. This is the same filtered subset used in the one\-time supervised augmentation setup \(Sec\.[IV\-B](https://arxiv.org/html/2606.28769#S4.SS2)\) used to mitigate the risk of misleading metric values due to off\-target generations and to ensure that measured values reflect expressive variation within an emotion class\.As shown in Table[I](https://arxiv.org/html/2606.28769#S4.T1), both model variants—one trained with the vertex loss termℒv\\mathcal\{L\}\_\{v\}and one without—achieve reconstruction quality comparable to those used in action\- or text\-to\-motion generation tasks\[[5](https://arxiv.org/html/2606.28769#bib.bib17),[45](https://arxiv.org/html/2606.28769#bib.bib61)\], with mesh errors within 4 cm\. The results reveal no substantial difference between the two variants in emotional body motion generation, suggesting that the vertex lossℒv\\mathcal\{L\}\_\{v\}, while providing stronger surface\-level supervision, does not significantly enhance reconstruction accuracy in the emotional model generation task\. Notably, both variants achieved higher multimodality than those computed on real data, indicating that the generative model has a strong capacity to generate diverse expressions within each emotion category\. This highlights its potential for augmenting motion\-based emotion recognition tasks\.
### IV\-BEmotion expression recognition from human body motion
We investigated whether a generative model can support emotion expression recognition in two complementary roles: \(1\) generating emotionally expressive motions that are recognizable by machines and humans, and \(2\) providing additional training data to improve the performance of data\-driven recognition models\. To ensure model\-agnostic evaluation, we used a unidirectional multi\-layer long short\-term memory \(LSTM\)\[[18](https://arxiv.org/html/2606.28769#bib.bib15)\]recurrent neural network as our recognition model\. This allows us to assess motion discriminability without bias toward any specific architecture\. To train the baseline recognition model, the DIEM\-A data was split into 80% training set, 10% validation set, and 10% test set\. The generation model was trained on the same training set, ensuring no data overlap or information leakage during the evaluation\.
Machine\-based emotion expression recognition on real and generated motions\.As shown in Table[I](https://arxiv.org/html/2606.28769#S4.T1), real body motion sequences achieve the highest recognition accuracy at 34\.28%, compared to a chance level of 7\.69%, establishing a baseline for the model’s ability to extract affective cues from real body motions\. This also highlights the difficulty of emotion expression recognition from body motion alone in DIEM\-A dataset, given the freely performed and highly variable expressive body motions\. In comparison, generated motions achieve an accuracy of 22\.80% on 2,000 randomly generated motions, indicating that the synthesized sequences retain emotion\-specific patterns recognizable by a downstream classifier\. However, a noticeable gap of approximately 12% remains between real and generated data, consistent with prior findings in motion synthesis and recognition\[[34](https://arxiv.org/html/2606.28769#bib.bib10)\]\. This gap may arise from distributional discrepancies, where generated motions partially deviate from the true data distribution and lack cues critical for emotion expression recognition\.
Figure 3:Confusion matrices of the emotion recognition model evaluated without augmentation on real motion data \(left\), generated motion data \(middle\), and with 2000 generated samples used for augmentation \(right\), with top\-3 predictions annotated in each row\.Fig\.[3](https://arxiv.org/html/2606.28769#S4.F3)shows the confusion matrix of recognition results\. For real motions, emotion categories such asfear,gratitude, andsadnessshow relatively high recognition accuracies, because of their distinct and culturally consistent body cues, such as withdrawn or protective motions for fear\. In contrast, emotions with more subtle expressions, such asshameandguilt, exhibit greater confusion in both real and generated data\. While the confusion patterns are broadly aligned between real and generated motions, predictions on generated data are less concentrated and show reduced classification accuracy\. This highlights the challenge of modeling subtle expressive differences and suggests that further refinement of the generative model is needed to improve the distinctiveness of emotion\- and context\-specific motion patterns\.
One\-time vs\. iterative augmentation strategies\.
Figure 4:Comparison of recognition accuracy under different data augmentation strategies and sample sizes\.We compared two augmentation strategies: one\-time and iterative augmentation\. In theone\-time approach, a fixed number of generated samples per emotion category is selected using the baseline classifier, and added to the training set in a single step\. In theiterative approach, the classifier is incrementally retrained from the baseline model, where each new model builds on the previous one by adding newly selected high\-confidence samples to the training set\. For example, the model trained with 200 augmented samples is used to select additional data, which is then used to train the next model with 400 samples, and so on\. As shown in Fig\.[4](https://arxiv.org/html/2606.28769#S4.F4), both strategies yield improvements over the no\-augmentation baseline\. However, the one\-time augmentation consistently outperforms the iterative approach, especially when a sufficient number of samples is available\. This suggests that while iterative selection may introduce additional diversity, it also risks reinforcing early misclassifications or redundant patterns\. By contrast, selecting a batch of synthetic samples using a strong initial classifier leads to more stable and substantial improvements in recognition performance\.
Importance of Sample Quality in Augmentation\.We compared supervised and unsupervised strategies for selecting generated samples to augment the training of emotion expression recognition models\. The supervised approach filters samples using the baseline classifier, retaining only those correctly classified and thus semantically aligned with the intended emotion\. In contrast, the unsupervised strategy selects samples randomly\. As shown in Fig\.[4](https://arxiv.org/html/2606.28769#S4.F4), supervised augmentation consistently outperforms the unsupervised one across all augmentation sizes\. The unsupervised method resulted in a limited improvement, suggesting that inaccurate or ambiguous samples introduce noise and degrade recognition model performance\. These results demonstrate the importance of quality control in sample selection and highlight the benefit of leveraging classifier feedback to guide data augmentation\.
Effect of the Number of Augmented Samples\.We evaluated how the quantity of synthetic samples affects recognition performance\. We incrementally added varying numbers of generated samples to the training set and evaluated the resulting classification accuracy\. As shown in Fig\.[4](https://arxiv.org/html/2606.28769#S4.F4), accuracy steadily improves with the number of augmented samples, peaking at 42\.3% with 2,000 generated motions, significantly higher than the no\-augmentation baseline of 34\.2%\. Notably, performance begins to plateau beyond 1,200 samples and slightly fluctuates, suggesting a saturation point where additional synthetic data offers diminishing returns or introduces redundant information\. The confusion matrices in Fig\.[3](https://arxiv.org/html/2606.28769#S4.F3)highlight the effect of augmentation on individual emotion categories\. With 2,000 augmented samples, benefiting from the increased diversity and quantity of training samples, improvements are observed in most categories, particularlyjoy,anger,gratitude,shame, andneutral, which show clear increases in class\-wise accuracy\. These results demonstrate that augmenting with a moderate number of high\-quality generated emotional motions enhances recognition performance, particularly in categories with distinctive expressive cues\. However, the improvement becomes increasingly limited as more samples are added, suggesting diminishing returns beyond a certain quantity\.
Human\-based emotional expression recognition on generated motions\.
Figure 5:Human perception accuracy of generated motions\. The average accuracies for motionscorrectly classified,misclassified, anddecoded from mean latent vectorsare 24\.91%, 10\.99%, and 21\.98%, respectively\.To evaluate the perceptual validity of the generated motions, we conducted a human evaluation study with 20 Japanese raters \(16 males, 4 females, mean age 43\.5 years, average age closely matches that of the motion performers\)\. Participants were asked to identify emotions from 65 generated motion sequences \(total duration: 7\.02 minutes, mean duration per sequence: 6\.47 seconds, SD = 3\.04 seconds\)\. Of these sequences, 13 were generated from the mean latent vector for each emotion category \(see Sec\.[IV\-C](https://arxiv.org/html/2606.28769#S4.SS3)\); 26 were correctly classified by the trained recognition classifier; and 26 were misclassified by the classifier\. This setup enabled us to analyze how machine classification confidence correlates with human perception and whether it can serve as a proxy for perceptual quality\. Participants viewed each motion and selected the most appropriate emotion label from the 13 options, without additional cues\. Completion time was an average of 23\.80 minutes\. Correlation analysis revealed no significant relationship between response time and accuracy \(Pearson’sr=−0\.23r=\-0\.23,p=0\.33p=0\.33\)\. Fig\.[5](https://arxiv.org/html/2606.28769#S4.F5)summarizes perceptual evaluation results\. Motions correctly classified by the machine classifier achieved the highest human recognition accuracy at 24\.91%\. Conversely, misclassified motions yielded a lower accuracy of 10\.99%, confirming that trained classifier predictions can serve as a reliable indicator for selecting higher\-quality synthetic data\. However, the overall accuracy across three conditions was only 18\.75%, demonstrating that generated motions still struggle to convey clearly recognizable expressive cues\. These findings underscore the promise of generative models for synthesizing perceptually meaningful emotional motion data and their current limitations in capturing expressive nuances necessary for reliable human recognition\. To contextualize these results, a recent study\[[6](https://arxiv.org/html/2606.28769#bib.bib11)\]reported 41\.6% human accuracy on real motion data from the same dataset, highlighting the gap between performed and perceived emotion\. This indicates a difference between emotion production and perception, perhaps due to the performer’s acting ability\. Therefore, our observed results fall within a realistic and interpretable range, given the broader challenges of expressing emotions through body motion and accurately perceiving that motion as the intended affective message\.
### IV\-CCommon motion extraction
Figure 6:Motion decoded from the mean latent vector computed across all training samples labeled asangry\(top\), and from the learned emotion\-specific bias vectorbgratitudeb\_\{\\text\{gratitude\}\}\(bottom\)\.Furthermore, to explore whether the generative model captures shared patterns of emotional expression, we analyzed the latent space by extracting motion sequences that represent common patterns within each emotion category\. We adopted two strategies\. First, we computed the mean latent vector for each emotion category by averaging all latent vectors obtained from training sequences with the same label\. We then decoded each mean vector into a motion sequence, representing the central tendency of the learned distribution\. This approach produces motions that reflect shared characteristics across multiple performers and scenarios\. In particular, individual variations are suppressed while the common motion patterns associated with each emotion are retained\. Second, we decoded from the learned emotion\-specific bias vectorsbeb\_\{e\}used for conditioning the latent distribution\. These vectors act as a distilled representation of each category, independent of any specific input sequence\. By decoding them directly, we obtain a complementary perspective on what the model has implicitly learned as the core features of each emotional expression\.
Fig\.[6](https://arxiv.org/html/2606.28769#S4.F6)shows two qualitative results for theangryandgratitudeemotions, generated from the mean latent vector and the emotion\-specific bias vector, respectively\. We observed that both averaging over expressive motions and learning emotion embeddings tend to suppress individual variations\. However, the decoded motions still retain a few key features that are representative of each emotional category\. For example, thegratitudemotion displays a forward\-leaning posture with hands held close to the chest, which is an expressive cue commonly associated with appreciation in Japanese culture\. These observations suggest that both the mean latent vectors and the learned emotion\-specific bias vectors encode semantically meaningful motion priors for each emotion\. Fig\.[5](https://arxiv.org/html/2606.28769#S4.F5)shows the results of the human perceptual evaluation on motions decoded from mean latent vectors\. These representative motions achieved a recognition accuracy of 21\.98%, slightly lower than motions correctly classified by the recognition model \(24\.91%\), but substantially higher than misclassified samples \(10\.99%\)\. Notably, motions for sadness and gratitude yielded relatively high perceptual accuracy, suggesting that these emotions are more consistently expressed across performers\. However, the overall moderate accuracy also suggests that these motions may lack the expressive detail necessary for clearly recognizable expressions of emotion\.
### IV\-DEmotional motion interpolation
Figure 7:Angrymotion interpolation between low and high intensity sequences\. The interpolated sequence exhibits motion characteristics that fall between the two endpoints, demonstrating a smooth transition in intensity\.The DIEM\-A dataset provides a unique opportunity to explore emotional intensity interpolation, as it includes multiple performances of each emotion expressed at three distinct intensity levels\. Thus, instead of generating new emotional body motion expressions through random sampling, we synthesized new motions by interpolating between latent vectors of motions that share the same emotion label\. We encoded two motion sequences of the same emotion but with different intensities using the trained encoder to obtain their latent representations\. We then computed a new latent vector by taking a weighted average of the two, simulating an intermediate emotional intensity\. The resulting vector is decoded into a motion sequence\. Fig\.[7](https://arxiv.org/html/2606.28769#S4.F7)illustrates a qualitative example of emotional motion interpolation between low and high\-intensity performances of the “angry” emotion\. The top and bottom rows show the source motions with high and low intensity, respectively, while the middle row presents the interpolated motion generated from an average of their latent vectors\. The interpolated motion exhibits a smooth transition in both posture and dynamics\. It clearly reflects characteristics that lie between the subtle, minimal movement of the low\-intensity motion and the energetic movement of the high\-intensity motion\. This result demonstrates that the model captures a meaningful internal representation of emotional intensity, allowing it to generate expressive motions along a continuous spectrum rather than relying on discrete intensity labels\. Such interpolation ability opens new opportunities for emotion\-aware animation control and continuous emotion modeling, enabling more flexible and expressive applications in affective computing and virtual character design\.
## VConclusions and Future Work
Our results show the feasibility of generating emotional body motions using a model originally developed for action\-based generation\. Although the task is challenging, the generated motions improved machine recognition of body expressions of emotion\. Machine recognition also supports the pre\-selection of body motions that can be used to convey an effective message to human observers, as evidenced by a higher perception recognition of correctly labeled motions by a machine\. However, the generated motions still lag behind real data in terms of classification accuracy and perceptual clarity\. Furthermore, generated data are either randomly sampled or filtered using a pre\-trained classifier in our current implementation\. This static selection may propagate misleading samples\. To address this, future work could incorporate a generative adversarial network to iteratively refine the generated motions by integrating feedback from recognition models or human evaluators in a data\-centric loop\.
## Ethical impact statement
This research was a data analysis of the Diverse Intercultural E\-Motion Database of Asian Performers \(DIEM\-A\)\[[7](https://arxiv.org/html/2606.28769#bib.bib62)\]\. All data provided is anonymous and obtained following the Declaration of Helsinki after obtaining ethical approval\. Our results have several limitations\. First, the emotions are expressed by the actors\. Even though the actors were asked to reminisce about a scenario where they would feel a particular emotion, the expressions were posed and not spontaneous\[[40](https://arxiv.org/html/2606.28769#bib.bib60)\]\. Therefore, our results need to be interpreted from a perception perspective as a plausible communication message, as opposed to identifying how a person truly feels\. The emotion recognition study involved 20 Japanese raters\. The cultural and gender composition of this sample may limit the generalizability of the results\. Perception of emotion can be shaped by cultural norms and individual background, and the use of a culturally homogeneous rater group should be considered when interpreting the findings\. Moreover, our models provide baseline results that still need to be improved\. Our human rating results show that the human interpretation of an expression might differ from that of a machine recognition model\. There is a challenge due to individual differences when expressing an emotion, and this is just the first attempt to find the commonalities in body expressions of emotion\.
## Acknowledgment
This work was supported by the Japan Science and Tech\- nology Agency under the Broadening Opportunities for Outstanding Young Researchers and Doctoral Students in Strategic Areas \(BOOST\) JPMJBS2423 and the RIEC Nation\-Wide Cooperative Research Projects Grant Number R05/A33\.
## References
- \[1\]C\. Ahuja and L\. Morency\(2019\-09\)Language2Pose: Natural Language Grounded Pose Forecasting\.In2019 International Conference on 3D Vision \(3DV\),Québec City, QC, Canada,pp\. 719–728\.External Links:ISBN 978\-1\-72813\-131\-3,[Document](https://dx.doi.org/10.1109/3DV.2019.00084)Cited by:[§II\-A](https://arxiv.org/html/2606.28769#S2.SS1.p1.1)\.
- \[2\]M\. J\. Al\-Dujaili and A\. Ebrahimi\-Moghadam\(2023\-04\)Speech Emotion Recognition: A Comprehensive Survey\.Wireless Personal Communications129\(4\),pp\. 2525–2561\(en\)\.External Links:ISSN 0929\-6212, 1572\-834X,[Document](https://dx.doi.org/10.1007/s11277-023-10244-3)Cited by:[§II\-B](https://arxiv.org/html/2606.28769#S2.SS2.p1.1)\.
- \[3\]L\. F\. Barrett, R\. Adolphs, S\. Marsella, A\. M\. Martinez, and S\. D\. Pollak\(2019\-07\)Emotional Expressions Reconsidered: Challenges to Inferring Emotion From Human Facial Movements\.Psychological Science in the Public Interest20\(1\),pp\. 1–68\(en\)\.External Links:ISSN 1529\-1006,[Document](https://dx.doi.org/10.1177/1529100619832930)Cited by:[§I](https://arxiv.org/html/2606.28769#S1.p2.1)\.
- \[4\]U\. Bhattacharya, N\. Rewkowski, A\. Banerjee, P\. Guhan, A\. Bera, and D\. Manocha\(2021\)Text2gestures: a transformer\-based network for generating emotive body gestures for virtual agents\.In2021 IEEE virtual reality and 3D user interfaces \(VR\),pp\. 1–10\.Cited by:[§II\-B](https://arxiv.org/html/2606.28769#S2.SS2.p1.1)\.
- \[5\]X\. Chen, B\. Jiang, W\. Liu, Z\. Huang, B\. Fu, T\. Chen, and G\. Yu\(2023\)Executing your commands via motion diffusion in latent space\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 18000–18010\.Cited by:[§II\-A](https://arxiv.org/html/2606.28769#S2.SS1.p1.1),[§IV\-A](https://arxiv.org/html/2606.28769#S4.SS1.p2.2)\.
- \[6\]M\. Cheng, C\. Tseng, K\. Fujiwara, S\. Higashiyama, A\. Weng, and Y\. Kitamura\(2024\)Toward an asian\-based bodily movement database for emotional communication\.Behavior Research Methods57\(1\),pp\. 10\.Cited by:[§IV\-B](https://arxiv.org/html/2606.28769#S4.SS2.p9.2)\.
- \[7\]M\. Cheng, C\. Tseng, K\. Fujiwara, V\. Schneider, and Y\. Kitamura\(2025\)Asian emotional body movement database: diverse intercultural e\-motion database of asian performers \(diem\-a\)\.In2025 13th International Conference on Affective Computing and Intelligent Interaction \(ACII\),Cited by:[§III\-A](https://arxiv.org/html/2606.28769#S3.SS1.p1.1),[Ethical impact statement](https://arxiv.org/html/2606.28769#Sx1.p1.1)\.
- \[8\]B\. De Gelder, A\. W\. de Borst, and R\. Watson\(2015\)The perception of emotion in body expressions\.Wiley Interdisciplinary Reviews: Cognitive Science6\(2\),pp\. 149–158\.Cited by:[§I](https://arxiv.org/html/2606.28769#S1.p1.1)\.
- \[9\]M\. De Meijer\(1989\-12\)The contribution of general features of body movement to the attribution of emotions\.Journal of Nonverbal Behavior13\(4\),pp\. 247–268\(en\)\.External Links:ISSN 0191\-5886, 1573\-3653,[Document](https://dx.doi.org/10.1007/BF00990296)Cited by:[§II\-B](https://arxiv.org/html/2606.28769#S2.SS2.p1.1)\.
- \[10\]L\. Elansary, Z\. Taha, and W\. Gad\(2024\)Survey on emotion recognition through posture detection and the possibility of its application in virtual reality\.arXiv preprint arXiv:2408\.01728\.Cited by:[§I](https://arxiv.org/html/2606.28769#S1.p1.1)\.
- \[11\]H\. A\. Elfenbein and N\. Ambady\(2002\)On the universality and cultural specificity of emotion recognition: A meta\-analysis\.\.Psychological Bulletin128\(2\),pp\. 203–235\(en\)\.External Links:ISSN 1939\-1455, 0033\-2909,[Document](https://dx.doi.org/10.1037/0033-2909.128.2.203)Cited by:[§II\-B](https://arxiv.org/html/2606.28769#S2.SS2.p1.1)\.
- \[12\]B\. Fasel and J\. Luettin\(2003\)Automatic facial expression analysis: a survey\.Pattern recognition36\(1\),pp\. 259–275\.Cited by:[§I](https://arxiv.org/html/2606.28769#S1.p1.1)\.
- \[13\]K\. Fragkiadaki, S\. Levine, P\. Felsen, and J\. Malik\(2015\)Recurrent Network Models for Human Dynamics\.arXiv\.Note:Version Number: 2Other International Conference on Computer Vision 2015External Links:[Document](https://dx.doi.org/10.48550/ARXIV.1508.00271)Cited by:[§II\-A](https://arxiv.org/html/2606.28769#S2.SS1.p1.1)\.
- \[14\]H\. S\. Friedman, R\. E\. Riggio, and D\. O\. Segall\(1980\-09\)Personality and the enactment of emotion\.Journal of Nonverbal Behavior5\(1\),pp\. 35–48\(en\)\.External Links:ISSN 1573\-3653,[Document](https://dx.doi.org/10.1007/BF00987053)Cited by:[§I](https://arxiv.org/html/2606.28769#S1.p2.1)\.
- \[15\]S\. Goyal, S\. Bhagat, S\. Uppal, H\. Jangra, Y\. Yu, Y\. Yin, and R\. R\. Shah\(2023\-10\)Emotionally Enhanced Talking Face Generation\.InProceedings of the 1st International Workshop on Multimedia Content Generation and Evaluation: New Methods and Practice,Ottawa ON Canada,pp\. 81–90\(en\)\.External Links:ISBN 9798400702785,[Document](https://dx.doi.org/10.1145/3607541.3616812)Cited by:[§II\-B](https://arxiv.org/html/2606.28769#S2.SS2.p1.1)\.
- \[16\]C\. Guo, S\. Zou, X\. Zuo, S\. Wang, W\. Ji, X\. Li, and L\. Cheng\(2022\)Generating diverse and natural 3d human motions from text\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 5152–5161\.Cited by:[§II\-A](https://arxiv.org/html/2606.28769#S2.SS1.p1.1)\.
- \[17\]C\. Guo, X\. Zuo, S\. Wang, S\. Zou, Q\. Sun, A\. Deng, M\. Gong, and L\. Cheng\(2020\)Action2motion: conditioned generation of 3d human motions\.InProceedings of the 28th ACM International Conference on Multimedia,pp\. 2021–2029\.Cited by:[§II\-A](https://arxiv.org/html/2606.28769#S2.SS1.p1.1)\.
- \[18\]S\. Hochreiter\(1997\)Long short\-term memory\.Neural Computation MIT\-Press\.Cited by:[§IV\-B](https://arxiv.org/html/2606.28769#S4.SS2.p1.1)\.
- \[19\]Y\. Hou, H\. Yao, X\. Sun, and H\. Li\(2020\)Soul dancer: emotion\-based human action generation\.ACM Transactions on Multimedia Computing, Communications, and Applications \(TOMM\)15\(3s\),pp\. 1–19\.Cited by:[§III\-B](https://arxiv.org/html/2606.28769#S3.SS2.p2.6)\.
- \[20\]B\. Jiang, X\. Chen, W\. Liu, J\. Yu, G\. Yu, and T\. Chen\(2023\)Motiongpt: human motion as a foreign language\.Advances in Neural Information Processing Systems36,pp\. 20067–20079\.Cited by:[§I](https://arxiv.org/html/2606.28769#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.28769#S2.SS1.p1.1)\.
- \[21\]A\. Kleinsmith and N\. Bianchi\-Berthouze\(2013\)Affective body expression perception and recognition: a survey\.IEEE Transactions on Affective Computing4\(1\),pp\. 15–33\.External Links:[Document](https://dx.doi.org/10.1109/T-AFFC.2012.16)Cited by:[§I](https://arxiv.org/html/2606.28769#S1.p2.1)\.
- \[22\]A\. Kleinsmith, P\. R\. De Silva, and N\. Bianchi\-Berthouze\(2006\-12\)Cross\-cultural differences in recognizing affect from body posture\.Interacting with Computers18\(6\),pp\. 1371–1389\(en\)\.External Links:ISSN 09535438,[Document](https://dx.doi.org/10.1016/j.intcom.2006.04.003)Cited by:[§II\-B](https://arxiv.org/html/2606.28769#S2.SS2.p1.1)\.
- \[23\]S\. Li and W\. Deng\(2020\)Deep facial expression recognition: a survey\.IEEE transactions on affective computing13\(3\),pp\. 1195–1215\.Cited by:[§I](https://arxiv.org/html/2606.28769#S1.p1.1)\.
- \[24\]S\. Li and W\. Deng\(2022\-07\)Deep Facial Expression Recognition: A Survey\.IEEE Transactions on Affective Computing13\(3\),pp\. 1195–1215\.External Links:ISSN 1949\-3045, 2371\-9850,[Document](https://dx.doi.org/10.1109/TAFFC.2020.2981446)Cited by:[§II\-B](https://arxiv.org/html/2606.28769#S2.SS2.p1.1)\.
- \[25\]H\. Liu, Z\. Zhu, N\. Iwamoto, Y\. Peng, Z\. Li, Y\. Zhou, E\. Bozkurt, and B\. Zheng\(2022\)Beat: a large\-scale semantic and emotional multi\-modal dataset for conversational gestures synthesis\.InEuropean conference on computer vision,pp\. 612–630\.Cited by:[§II\-B](https://arxiv.org/html/2606.28769#S2.SS2.p1.1)\.
- \[26\]M\. Loper, N\. Mahmood, J\. Romero, G\. Pons\-Moll, and M\. J\. Black\(2015\-10\)SMPL: a skinned multi\-person linear model\.ACM Trans\. Graph\.34\(6\)\.External Links:ISSN 0730\-0301,[Document](https://dx.doi.org/10.1145/2816795.2818013)Cited by:[§III\-A](https://arxiv.org/html/2606.28769#S3.SS1.p2.2)\.
- \[27\]N\. Mahmood, N\. Ghorbani, N\. F\. Troje, G\. Pons\-Moll, and M\. J\. Black\(2019\-10\)AMASS: archive of motion capture as surface shapes\.InInternational Conference on Computer Vision,pp\. 5442–5451\.Cited by:[§III\-A](https://arxiv.org/html/2606.28769#S3.SS1.p3.1)\.
- \[28\]J\. Martinez, M\. J\. Black, and J\. Romero\(2017\-07\)On Human Motion Prediction Using Recurrent Neural Networks\.In2017 IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),Honolulu, HI,pp\. 4674–4683\.External Links:ISBN 978\-1\-5386\-0457\-1,[Document](https://dx.doi.org/10.1109/CVPR.2017.497)Cited by:[§II\-A](https://arxiv.org/html/2606.28769#S2.SS1.p1.1)\.
- \[29\]S\. Mousavi\(2025\)Synthetic data generation by supervised neural gas network for physiological emotion recognition data\.arXiv preprint arXiv:2501\.16353\.Cited by:[§I](https://arxiv.org/html/2606.28769#S1.p1.1)\.
- \[30\]P\. M\. Niedenthal\(2007\-05\)Embodying Emotion\.Science316\(5827\),pp\. 1002–1005\(en\)\.External Links:ISSN 0036\-8075, 1095\-9203,[Document](https://dx.doi.org/10.1126/science.1136930)Cited by:[§II\-B](https://arxiv.org/html/2606.28769#S2.SS2.p1.1)\.
- \[31\]M\. Petrovich, M\. J\. Black, and G\. Varol\(2021\)Action\-conditioned 3d human motion synthesis with transformer vae\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 10985–10995\.Cited by:[§II\-A](https://arxiv.org/html/2606.28769#S2.SS1.p1.1),[Figure 2](https://arxiv.org/html/2606.28769#S3.F2),[§III\-B](https://arxiv.org/html/2606.28769#S3.SS2.p2.6)\.
- \[32\]M\. Petrovich, M\. J\. Black, and G\. Varol\(2022\)Temos: generating diverse human motions from textual descriptions\.InEuropean Conference on Computer Vision,pp\. 480–497\.Cited by:[§II\-A](https://arxiv.org/html/2606.28769#S2.SS1.p1.1)\.
- \[33\]M\. Petrovich, M\. J\. Black, and G\. Varol\(2023\)Tmr: text\-to\-motion retrieval using contrastive 3d human motion synthesis\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 9488–9497\.Cited by:[§I](https://arxiv.org/html/2606.28769#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.28769#S2.SS1.p1.1)\.
- \[34\]M\. Petrovich, M\. J\. Black, and G\. Varol\(2021\)Action\-conditioned 3D human motion synthesis with transformer VAE\.InInternational Conference on Computer Vision \(ICCV\),Cited by:[§I](https://arxiv.org/html/2606.28769#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.28769#S2.SS1.p1.1),[§IV\-A](https://arxiv.org/html/2606.28769#S4.SS1.p1.1),[§IV\-B](https://arxiv.org/html/2606.28769#S4.SS2.p2.1)\.
- \[35\]A\. R\. Punnakkal, A\. Chandrasekaran, N\. Athanasiou, A\. Quiros\-Ramirez, and M\. J\. Black\(2021\-06\)BABEL: bodies, action and behavior with english labels\.InProceedings IEEE/CVF Conf\. on Computer Vision and Pattern Recognition \(CVPR\),pp\. 722–731\.External Links:[Document](https://dx.doi.org/)Cited by:[§III\-A](https://arxiv.org/html/2606.28769#S3.SS1.p3.1)\.
- \[36\]X\. Qi, C\. Liu, L\. Li, J\. Hou, H\. Xin, and X\. Yu\(2024\)Emotiongesture: audio\-driven diverse emotional co\-speech 3d gesture generation\.IEEE Transactions on Multimedia\.Cited by:[§II\-B](https://arxiv.org/html/2606.28769#S2.SS2.p1.1)\.
- \[37\]X\. Qi, J\. Pan, P\. Li, R\. Yuan, X\. Chi, M\. Li, W\. Luo, W\. Xue, S\. Zhang, Q\. Liu,et al\.\(2024\)Weakly\-supervised emotion transition learning for diverse 3d co\-speech gesture generation\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 10424–10434\.Cited by:[§II\-B](https://arxiv.org/html/2606.28769#S2.SS2.p1.1)\.
- \[38\]K\. R\. Scherer\(2003\)Vocal communication of emotion: a review of research paradigms\.Speech communication40\(1\-2\),pp\. 227–256\.Cited by:[§I](https://arxiv.org/html/2606.28769#S1.p1.1)\.
- \[39\]G\. Tevet, S\. Raab, B\. Gordon, Y\. Shafir, D\. Cohen\-Or, and A\. H\. Bermano\(2022\)Human motion diffusion model\.arXiv preprint arXiv:2209\.14916\.Cited by:[§II\-A](https://arxiv.org/html/2606.28769#S2.SS1.p1.1)\.
- \[40\]L\. Tian, J\. D\. Moore, and C\. Lai\(2015\)Emotion recognition in spontaneous and acted dialogues\.In2015 international conference on affective computing and intelligent interaction \(ACII\),pp\. 698–704\.Cited by:[Ethical impact statement](https://arxiv.org/html/2606.28769#Sx1.p1.1)\.
- \[41\]J\. L\. Tracy and D\. Matsumoto\(2008\-08\)The spontaneous expression of pride and shame: Evidence for biologically innate nonverbal displays\.Proceedings of the National Academy of Sciences105\(33\),pp\. 11655–11660\(en\)\.External Links:ISSN 0027\-8424, 1091\-6490,[Document](https://dx.doi.org/10.1073/pnas.0802686105)Cited by:[§II\-B](https://arxiv.org/html/2606.28769#S2.SS2.p1.1)\.
- \[42\]O\. Wiles, A\. S\. Koepke, and A\. Zisserman\(2018\)X2Face: A Network for Controlling Face Generation Using Images, Audio, and Pose Codes\.InComputer Vision – ECCV 2018,V\. Ferrari, M\. Hebert, C\. Sminchisescu, and Y\. Weiss \(Eds\.\),Vol\.11217,pp\. 690–706\(en\)\.Note:Series Title: Lecture Notes in Computer ScienceExternal Links:ISBN 978\-3\-030\-01260\-1 978\-3\-030\-01261\-8,[Document](https://dx.doi.org/10.1007/978-3-030-01261-8%5F41)Cited by:[§II\-B](https://arxiv.org/html/2606.28769#S2.SS2.p1.1)\.
- \[43\]T\. Yu, J\. Wang, J\. Wang, J\. Luo, and G\. Zhou\(2024\)Towards emotion\-enriched text\-to\-motion generation via llm\-guided limb\-level emotion manipulating\.InProceedings of the 32nd ACM International Conference on Multimedia,pp\. 612–621\.Cited by:[§I](https://arxiv.org/html/2606.28769#S1.p3.1),[§II\-B](https://arxiv.org/html/2606.28769#S2.SS2.p1.1)\.
- \[44\]Y\. Zhang, M\. J\. Black, and S\. Tang\(2021\-06\)We are More than Our Joints: Predicting how 3D Bodies Move\.In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),Nashville, TN, USA,pp\. 3371–3381\.External Links:ISBN 978\-1\-66544\-509\-2,[Document](https://dx.doi.org/10.1109/CVPR46437.2021.00338)Cited by:[§II\-A](https://arxiv.org/html/2606.28769#S2.SS1.p1.1)\.
- \[45\]Z\. Zhou, Y\. Wan, and B\. Wang\(2023\)A unified framework for multimodal, multi\-part human motion synthesis\.arXiv preprint arXiv:2311\.16471\.Cited by:[§IV\-A](https://arxiv.org/html/2606.28769#S4.SS1.p2.2)\.Similar Articles
EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection
This paper proposes Emo-Boost, a multimodal deepfake detection framework that leverages emotion cues (audio-visual emotion recognition) as high-level semantic signals to improve generalization to unseen manipulation types, achieving a 2.1% average AUC improvement on the FakeAVCeleb dataset.
The Echo Amplifies the Knowledge: Somatic Marker Analogues in Language Models via Emotion Vector Re-Injection
This preprint introduces a method to inject emotion vectors into language models to simulate somatic markers, aiming to bridge the gap between semantic and episodic memory. The authors demonstrate that combining emotional echoes with semantic knowledge improves decision-making capabilities, replicating findings from human cognitive science.
Beyond Sentiment Classification: A Generative Framework for Emotion Intensity Evaluation in Text
This paper proposes a generative framework for emotion intensity evaluation, shifting from discrete classification to continuous 0-100 scoring. It demonstrates superior performance and generalization in domains like finance.
Deep Temporal Modeling and Ensemble Fusion for Multimodal Emotion Recognition from Physiological Signals
This paper evaluates deep learning models (LSTM, TCN, Transformer) on the WESAD dataset for multimodal emotion recognition from physiological signals, showing that an ensemble achieves 98.91% accuracy.
@_akhaliq: GEM Generative Supervision Helps Embodied Intelligence
GEM introduces a generative supervision method to improve embodied intelligence by leveraging generative models for training.