Deep Temporal Modeling and Ensemble Fusion for Multimodal Emotion Recognition from Physiological Signals

arXiv cs.CL 06/16/26, 04:00 AM Papers
Summary
This paper evaluates deep learning models (LSTM, TCN, Transformer) on the WESAD dataset for multimodal emotion recognition from physiological signals, showing that an ensemble achieves 98.91% accuracy.
arXiv:2606.15026v1 Announce Type: new Abstract: Physiological stress and emotion recognition are important for health monitoring and affective computing. In this work, we present a comprehensive evaluation of deep learning models such as Long Short-Term Memory (LSTM), Temporal Convolutional Networks (TCN), and Transformer on the WESAD dataset for multimodal affect recognition using wrist and chest sensor signals. We perform ablation studies to assess the individual contributions of each modality by training models on wrist-only and chest-only inputs. In addition, we implement a late-fusion ensemble strategy that combines predictions from all three architectures trained on multimodal input. We also employ early fusion at the sensor level by concatenating wrist and chest signals before feeding them into each model. Our results show that Transformer models consistently achieve the highest accuracy in multimodal settings, while TCN models perform best in the wrist-only configuration. The ensemble method yields the highest overall accuracy (98.91 +/- 0.13%) and macro-F1 score (98.56 +/- 0.17%). These findings demonstrate the effectiveness of sensor fusion and ensemble-based fusion in developing robust systems for physiological emotion recognition.
Original Article
View Cached Full Text
Cached at: 06/16/26, 11:44 AM
# Deep Temporal Modeling and Ensemble Fusion for Multimodal Emotion Recognition from Physiological Signals
Source: [https://arxiv.org/html/2606.15026](https://arxiv.org/html/2606.15026)
###### Abstract\.

Physiological stress and emotion recognition are important for health monitoring and affective computing\. In this work, we present a comprehensive evaluation of deep learning models such asLong Short\-Term Memory\(LSTM\),Temporal Convolutional Networks\(TCN\), and Transformer on the WESAD dataset for multimodal affect recognition using wrist and chest sensor signals\. We perform ablation studies to assess the individual contributions of each modality by training models on wrist\-only and chest\-only inputs\. In addition, we implement a late\-fusion ensemble strategy that combines predictions from all three architectures trained on multimodal input\. We also employ early fusion at the sensor level by concatenating wrist and chest signals before feeding them into each model\. Our results show that Transformer models consistently achieve the highest accuracy in multimodal settings, while TCN models perform best in the wrist\-only configuration\. The ensemble method yields the highest overall accuracy \(98\.91±\\pm0\.13%\) and macro\-F1 score \(98\.56±\\pm0\.17%\)\. These findings demonstrate the effectiveness of sensor fusion and ensemble\-based fusion in developing robust systems for physiological emotion recognition\.

Note:An extended version containing supplementary analyses is included in the appendices\.

Physiological signal processing, Multimodal emotion recognition, Deep learning, LSTM, TCN, Transformer, Sensor fusion, Wearable computing

††copyright:cc††conference:17th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics; June 30\-July 03, 2026; Rende \(CS\), Italy††ccs:Computing methodologies Neural networks††ccs:Computing methodologies Ensemble methods††ccs:Applied computing Health informatics## 1\.Introduction

The ability to detect and classify human emotional states from physiological signals is increasingly important across domains including mental health monitoring, wearable computing, human\-computer interaction, and affective computing\(Calvo and D’Mello,[2010](https://arxiv.org/html/2606.15026#bib.bib2)\)\. Stress is a growing public health concern associated with chronic diseases, reduced quality of life, and decreased productivity\(Schneidermanet al\.,[2005](https://arxiv.org/html/2606.15026#bib.bib3)\)\. The accurate and real\-time detection of stress and affective states can pave the way for timely interventions and personalized wellness solutions\(Healey and Picard,[2005](https://arxiv.org/html/2606.15026#bib.bib4)\)\. Traditional methods for detecting affects are based heavily on self\-reports or interviews, which are subjective and impractical for continuous monitoring\(Schmidtet al\.,[2019](https://arxiv.org/html/2606.15026#bib.bib5); Sheikhet al\.,[2021](https://arxiv.org/html/2606.15026#bib.bib6)\)\. Recent advances in wearable sensors have allowed the collection of rich physiological signals, such aselectrodermal activity\(EDA\), respiration, body temperature, heart rate, and accelerometry data\(Schmidtet al\.,[2018](https://arxiv.org/html/2606.15026#bib.bib7)\)\. These signals enable real\-time inference of emotional states using data\-driven models\(Rissleret al\.,[2018](https://arxiv.org/html/2606.15026#bib.bib8); Larradetet al\.,[2020](https://arxiv.org/html/2606.15026#bib.bib9)\)\. Deep learning models are particularly effective for modeling time\-series data\. Architectures such asLSTM\(Hochreiter and Schmidhuber,[1997](https://arxiv.org/html/2606.15026#bib.bib10)\),TCN\(Leaet al\.,[2017](https://arxiv.org/html/2606.15026#bib.bib1); Baiet al\.,[2018](https://arxiv.org/html/2606.15026#bib.bib15)\), and Transformers\(Vaswaniet al\.,[2017](https://arxiv.org/html/2606.15026#bib.bib11)\)are particularly well\-suited to learning complex temporal dependencies in multimodal physiological data\. However, the effective use of different sensor modalities, understanding their relative contributions, and optimizing their fusion remain open challenges in wearable affect recognition\(Kulviciuset al\.,[2025](https://arxiv.org/html/2606.15026#bib.bib12); Liet al\.,[2024](https://arxiv.org/html/2606.15026#bib.bib13)\)\.

The goal of this work is to systematically evaluate temporal deep learning architectures \(LSTM,TCN, Transformer\) under unimodal, multimodal, and ensemble fusion settings to identify robust and generalizable approaches for physiological emotion recognition\. We benchmark these architectures on the WESAD dataset, a multimodal benchmark for wearable stress and affect detection, performing extensive ablation studies using wrist\-only and chest\-only modalities to understand their standalone performance and how each model adapts to unimodal input\. Finally, we propose an ensemble fusion approach that integrates predictions from all three models to improve classification robustness and accuracy\.

Our findings provide practical insight into the effectiveness of each model and modality configuration, offering guidance for the development of efficient, reliable, and scalable emotion recognition systems based on wearable sensor data\. To our knowledge, no previous study provides a unified and controlled comparison ofLSTM,TCN, and Transformer architectures across unimodal, multimodal, and ensemble settings under a leave\-one\-subject\-out cross\-validation \(LOSO\-CV\) protocol on WESAD\(Schmidtet al\.,[2018](https://arxiv.org/html/2606.15026#bib.bib7)\)\. Existing works typically examine a single architecture, a single sensor location, or a single fusion strategy in isolation\. Our contribution lies in establishing a systematic and architecture\-agnostic benchmarking framework that clarifies when and why specific temporal models or sensor modalities are advantageous, effectively isolating the effects of architecture choice, sensor modality, and fusion design on affect recognition performance\.

## 2\.Motivation

Stress and affective disorders are among the leading contributors to the global burden of diseases, affecting both mental and physical health across populations\(Calvo and D’Mello,[2010](https://arxiv.org/html/2606.15026#bib.bib2); Schneidermanet al\.,[2005](https://arxiv.org/html/2606.15026#bib.bib3)\)\. Early and accurate detection of emotional states, such as stress and amusement, can be important in preventing burnout, improving productivity, and enabling timely interventions in clinical and workplace settings\(Schneidermanet al\.,[2005](https://arxiv.org/html/2606.15026#bib.bib3); Sonnentag and Fritz,[2015](https://arxiv.org/html/2606.15026#bib.bib16)\)\. Wearable devices enable continuous and non\-invasive monitoring of physiological signals\. The increasing availability of multimodal sensors embedded in smartwatches, chest straps, and fitness trackers demands intelligent systems capable of interpreting these signals as meaningful emotional labels\(Healey and Picard,[2005](https://arxiv.org/html/2606.15026#bib.bib4); Schmidtet al\.,[2019](https://arxiv.org/html/2606.15026#bib.bib5)\)\. However, developing such systems involves challenges in modeling temporal dependencies, handling heterogeneous signal modalities, and ensuring generalization across users and contexts\(Schmidtet al\.,[2018](https://arxiv.org/html/2606.15026#bib.bib7)\)\. Previous studies have applied LSTMs\(Rostamiet al\.,[2024](https://arxiv.org/html/2606.15026#bib.bib19); Malviyaet al\.,[2023](https://arxiv.org/html/2606.15026#bib.bib20); Zitouniet al\.,[2022](https://arxiv.org/html/2606.15026#bib.bib21)\), TCNs\(Dinget al\.,[2024](https://arxiv.org/html/2606.15026#bib.bib25); Ingolfssonet al\.,[2021](https://arxiv.org/html/2606.15026#bib.bib18); Alghoulet al\.,[2025](https://arxiv.org/html/2606.15026#bib.bib26)\), and Transformers\(Vaswaniet al\.,[2017](https://arxiv.org/html/2606.15026#bib.bib11); Li and Zhang,[2025](https://arxiv.org/html/2606.15026#bib.bib22); Wuet al\.,[2023](https://arxiv.org/html/2606.15026#bib.bib23); Vazquez\-Rodriguezet al\.,[2022](https://arxiv.org/html/2606.15026#bib.bib24)\)to physiological affect recognition, but few have systematically compared these architectures under controlled and consistent conditions\. Recent work by Liao et al\.\(Liaoet al\.,[2025](https://arxiv.org/html/2606.15026#bib.bib31)\)and Choi\(Choi,[2025](https://arxiv.org/html/2606.15026#bib.bib32)\)explore ensemble and fusion strategies but focus on specific model designs rather than a unified cross\-architecture evaluation\. We adopt WESAD\(Schmidtet al\.,[2018](https://arxiv.org/html/2606.15026#bib.bib7)\)as our benchmark as it enables a direct and reproducible comparison with previous work under a consistent LOSO\-CV protocol\.

## 3\.Dataset and Methodology

We adopt a strategy that combines advanced temporal deep learning architectures with both unimodal and multimodal data perspectives to develop a robust system for affect recognition using physiological signals\. Our pipeline consists of three main stages: \(1\) data preparation and segmentation, \(2\) architecture design with ablation\-based evaluation, and \(3\) multimodal ensemble fusion\.

### 3\.1\.Dataset Description

We use the publicly available WESAD dataset\(Schmidtet al\.,[2018](https://arxiv.org/html/2606.15026#bib.bib7)\), a widely adopted benchmark for physiological stress and affect recognition using wearable sensors\. The dataset contains multimodal physiological recordings collected from 15 participants \(12 male, 3 female\) during controlled experimental conditions designed to induce baseline, stress, and amusement states\. Physiological signals are acquired using two wearable devices: the Empatica E4 wristband and the RespiBAN chest sensor\. The wrist modality includesEDA,blood volume pulse\(BVP\),body temperature\(TEMP\), andaccelerometer\(ACC\) signals, while the chest modality includesrespiration\(RES\),electrocardiography\(ECG\), andACCsignals\. A summary of the physiological signals and their corresponding devices is provided in Table[1](https://arxiv.org/html/2606.15026#S3.T1)\. Each participant undergoes three affective conditions: baseline \(neutral state\), stress \(induced by the Trier Social Stress Test\), and amusement \(elicited through video stimuli\)\. These conditions serve as ground\-truth labels for supervised learning, where the task is to classify each time segment into one of the three affective states\. To evaluate generalization across individuals, we adopt a leave\-one\-subject\-out cross\-validation \(LOSO\-CV\) protocol, where models are trained on data from all but one participant and tested on the held\-out participant\. This process is repeated for all participants to ensure subject\-independent evaluation\. The full dataset contains over 5\.3 million samples across all modalities\. As the dataset exhibits moderate class imbalance among the three affective states, we employ stratified batching during training and report class\-wise performance metrics to ensure fair and reliable evaluation\.

Table 1\.Physiological signals collected from wrist \(Empatica E4\) and chest \(RespiBAN\) devices in the WESAD dataset\.DeviceSignalsSampling RateEmpatica E4 \(Wrist\)EDA, BVP, TEMP, ACC4–64 HzRespiBAN \(Chest\)RES, ECG, ACC700 Hz

### 3\.2\.Data Preparation

To prepare the data, we resampled all signals to a common rate of 4 Hz, temporally aligning wrist signals \(EDA,TEMP,ACC,BVP\) and chest signals \(RES,ECG,ACC\) to a consistent temporal resolution before segmentation\. This resampling focuses the models on low\-frequency temporal dynamics associated with affective states, as commonly adopted in affective computing studies\(Tanwaret al\.,[2024](https://arxiv.org/html/2606.15026#bib.bib33)\)\.

The continuous signals were then segmented into non\-overlapping windows of 10 time steps, where each time step corresponds to one preprocessed multimodal sample at 4 Hz\. This window length balances temporal resolution and computational efficiency for affective state classification\. A sensitivity analysis evaluating the effect of resampling rate on model performance is provided in the supplementary material\. Categorical labels were one\-hot encoded, and stratified sampling was applied within the LOSO\-CV folds to ensure class balance and reproducibility during training\. Window segmentation and normalization were performed independently within each LOSO fold to avoid information leakage between training and testing subjects\. To ensure a fair comparison across architectures and modality configurations, all models used the same window length, normalization procedure, and segmentation strategy, so that differences in performance reflect architectural design rather than preprocessing choices\.

### 3\.3\.Model architectures

We adopt three model architectures known for their effectiveness in time\-series learning\. Each model is trained in unimodal and multimodal settings\.

LSTM\. These networks are a class ofRecurrent Neural Networks\(RNNs\) designed to model long\-range temporal dependencies in sequences through memory cells and gating mechanisms\(Hochreiter and Schmidhuber,[1997](https://arxiv.org/html/2606.15026#bib.bib10)\)\. They have shown strong performance in physiological signal modeling, particularly for emotion and stress recognition\. LSTMs are well\-suited for capturing the temporal continuity and gradual transitions characteristic of affective states, including stress onset and relaxation periods\. We implement a two\-layer bi\-directional LSTM followed by a fully connected output layer with dropout regularization to mitigate overfitting\. The bi\-directional structure enables the model to leverage both past and future context, enhancing its ability to learn complex temporal patterns in physiological signals\. Despite the added depth, the model remains relatively lightweight, making it suitable for wearable applications\.

TCN\. These models are fully convolutional architectures designed for sequence modeling using dilated causal convolutions, allowing the model to efficiently capture long\-range temporal dependencies\(Leaet al\.,[2017](https://arxiv.org/html/2606.15026#bib.bib1); Baiet al\.,[2018](https://arxiv.org/html/2606.15026#bib.bib15)\)\. Unlike LSTMs, TCNs support parallel processing in time steps, enabling faster training and improved stability\. Their ability to capture multi\-scale temporal features makes them particularly well\-suited for modeling physiological signals of varying durations\. We implemented a stack of 1D convolutional layers with progressively increasing dilation rates, combined with residual connections and batch normalization to improve training stability and enhance generalization\.

Transformer\. Transformers rely on self\-attention mechanisms rather than recurrence or convolution, allowing them to model dependencies over arbitrary time steps without assuming local structure\(Vaswaniet al\.,[2017](https://arxiv.org/html/2606.15026#bib.bib11)\)\. Transformers identify salient signal patterns regardless of position\. In our implementation, we use a compact encoder with three layers of multi\-head self\-attention, learnable positional embeddings, and dropout regularization\. The model concludes with global average pooling across the temporal dimension to aggregate sequence\-level representations, followed by a softmax classification layer\. This architecture is particularly well\-suited for multimodal emotion recognition, as its self\-attention mechanism can capture global dependencies across asynchronous signal streams and dynamically focus on the most informative temporal segments, an essential capability when fusing heterogeneous biosignals\.

Fusion Strategies\. We adopt two complementary fusion strategies\. First, for multimodal input\-level fusion \(early fusion\), we concatenate wrist\- and chest\-derived features along the channel dimension before feeding them into the model\. This enables the temporal architectures \(LSTM, Transformer, TCN\) to learn joint representations across sensor types\. Second, for output\-level fusion \(late fusion\), we perform ensemble classification by averaging the*softmax*probability outputs of three independently trained models\. This dual\-fusion design enables the system to benefit from both the complementary nature of multimodal physiological signals and the diverse learning capacities of the architectures, which in turn reduces overfitting and enhances generalization in affect recognition tasks\. Throughout this paper, late fusion refers to the ensemble\-learning approach in which model\-level*softmax*probabilities are averaged to produce the final prediction\. This is distinct from the traditional multimodal\-learning definition of late fusion, where separate models operate on different modalities\. In our ensemble, all three architectures receive the same multimodal input\.

### 3\.4\.Implementation Details

All models were implemented inPythonusingTensorFlow 2\.16\.2and trained on two NVIDIA GeForce RTX 4090 GPUs \(24 GB GDDR6X each\) with fixed random seeds to ensure reproducibility\. Architectural configurations were selected based on empirical tuning guided by validation performance and previous work in physiological signal processing, exploring variations in layer depth, hidden units, kernel sizes, and regularization strength for each model family\. We report the mean and standard deviation of all evaluation metrics across the 15 LOSO folds to provide a reliable estimate of model performance and inter\-subject variability\. Training was performed using the Adam optimizer with early stopping based on validation loss to prevent overfitting\. Input features were standardized on a per\-subject basis within each LOSO fold to prevent data leakage, and categorical labels were one\-hot encoded\. For multimodal configurations, wrist\- and chest\-derived features were concatenated at the feature level \(early fusion\), while ensemble fusion was performed via averaging of*softmax*probabilities across independently trained models \(late fusion\)\.

Hyperparameters\.The Transformer used embedding dimension 64, 4 attention heads, 3 encoder blocks, and dropout 0\.3\. The TCN used 128 filters, kernel size 7, and dilation factors \(1,2,4,8\)\. The LSTM used two stacked bidirectional layers with 64 units each\. All models used Adam \(learning rate = 0\.001\), early stopping \(patience=3\), andReduceLROnPlateau\(factor=0\.5, patience=2\)\. Batch size was 32 for Transformer and 64 for LSTM and TCN\.

## 4\.Experimental Results and Analysis

We present a comprehensive evaluation of the proposed models in the WESAD dataset\(Schmidtet al\.,[2018](https://arxiv.org/html/2606.15026#bib.bib7)\)to assess their effectiveness in recognizing affective states from physiological signals\. Our experiments compare multiple model architectures, LSTM, Transformer, and TCN, in both unimodal and multimodal configurations under a LOSO\-CV protocol\. In addition, we conducted ablation studies to analyze the contribution of individual sensor modalities and examine an ensemble strategy that integrates complementary strengths across models\. These results provide insights into the impact of architectural choices and modality fusion on performance, generalization, and the design of robust affective computing systems\.

### 4\.1\.Ensemble Fusion Performance

We applied a late fusion ensemble strategy integrating predictions from LSTM, TCN, and Transformer models, each trained independently on the same multimodal input\. Instead of concatenating features, the fusion is performed by averaging the*softmax*probability outputs of the three models\. This approach leverages the complementary strengths of the architectures: LSTM captures long\-term dependencies; TCN models local and mid\-range patterns using dilated convolutions; and the Transformer is effective at identifying salient temporal segments through attention mechanisms\. As shown in Table[2](https://arxiv.org/html/2606.15026#S4.T2), the ensemble achieved the highest performance on all metrics, with 98\.91±\\pm0\.13% accuracy, 98\.56±\\pm0\.17% macro\-F1, and 98\.90±\\pm0\.13% weighted\-F1\. This performance gain reflects the combined effect of variance reduction through ensemble averaging and the complementary inductive biases of the three architectures\. While variance reduction alone contributes meaningfully to ensemble gains, the diversity of temporal modeling strategies, namely sequential modeling in LSTM, local pattern extraction in TCN, and global attention in Transformer, provides complementary representations that contribute to the stability and robustness of the ensemble across subjects\. These results demonstrate the benefit of combining early sensor fusion with late model ensemble strategies, as the approach leverages both cross\-sensor synergy and variance reduction to improve performance\.

The ensemble provides superior macro\-F1 performance and markedly lower variance across LOSO folds, with a standard deviation of±\\pm0\.13% in accuracy and±\\pm0\.17% in macro\-F1, compared to±\\pm0\.51%–0\.59% and±\\pm0\.64%–0\.76% for individual models, respectively, demonstrating stable and consistent generalization across subjects\. Because macro\-F1 is more sensitive to class imbalance and reflects consistency across subjects, we report the ensemble as the best overall model, while noting that the Transformer remains the strongest standalone architecture\.

Table 2\.Ensemble fusion results combining LSTM, TCN, and Transformer predictions on the multimodal WESAD dataset\. All metrics are reported in percentage \(%\)\.ClassPrecisionRecallF1\-scoreAccuracyBaseline98\.8799\.3699\.1298\.91±\\pm0\.13Stress99\.5099\.7599\.62Amusement97\.9495\.9696\.94*Macro Avg*98\.7798\.3698\.56±\\pm0\.17*Weighted Avg*98\.9098\.9198\.90±\\pm0\.13

### 4\.2\.Model Architecture Comparison

The performance of the model varied across the modalities and configurations\. In the multimodal setting \(Table[4](https://arxiv.org/html/2606.15026#S4.T4)\), Transformer models achieved the highest overall performance, likely due to their ability to model temporal dependencies through self\-attention mechanisms\. In modality\-constrained scenarios, the ablation study \(Table[5](https://arxiv.org/html/2606.15026#S4.T5)\) shows that TCN performed best in the wrist\-only setting, while LSTM achieved the strongest results for chest\-only inputs\. Table[3](https://arxiv.org/html/2606.15026#S4.T3)indicates that LSTM achieved the highest accuracy and F1\-scores in the unimodal wrist\-only configuration, followed by Transformer and TCN\. These findings suggest that architectural performance is protocol\-specific and reflects the interaction between input representation and model configuration\. Within the ablation protocol, TCN retained the strongest wrist\-only performance, while within the primary unimodal protocol, the bidirectional LSTM performed best\. These are setting\-specific observations and should not be interpreted as a universal architectural ranking across protocols\.

Table[3](https://arxiv.org/html/2606.15026#S4.T3)reports results from our primary unimodal pipeline, where wrist\-only inputs are taken from a dedicated preprocessed wrist representation pipeline and each architecture uses its primary configuration, including a bidirectional stacked LSTM, the main TCN settings, and the main Transformer configuration\. Table[5](https://arxiv.org/html/2606.15026#S4.T5), by contrast, reconstructs wrist\-only inputs by slicing wrist features from the fused multimodal tensor representation and rebuilding sequences from that sliced representation, resulting in a different effective input distribution\. The ablation models also use architecture\-specific configurations that differ from the primary benchmark, including differences in LSTM depth and directionality, TCN kernel size, dilation pattern, and filter count, as well as batch size, early stopping schedule, and learning rate scheduling\. Within each table, all model comparisons are conducted under identical conditions, which is the basis for our architectural conclusions\. These findings suggest that, while attention\-based architectures are particularly advantageous in fused or complex input settings, convolutional and recurrent models remain more effective in low\-signal or sensor\-limited scenarios\. Multimodal fusion consistently outperforms unimodal configurations across all architectures, with wrist\-only models offering competitive performance in resource\-constrained settings \(Table[3](https://arxiv.org/html/2606.15026#S4.T3)vs\. Table[5](https://arxiv.org/html/2606.15026#S4.T5)\)\.

Table 3\.Unimodal classification performance on wrist signals\. All evaluation metrics are reported in percentage \(%\)\.ModelClassPrecisionRecallF1\-ScoreAccuracyLSTMBaseline95\.9296\.6696\.2995\.81±\\pm0\.48Stress97\.0398\.3797\.70Amusement93\.1488\.5490\.78*Macro Avg*95\.3694\.5294\.92±\\pm0\.53*Weighted Avg*95\.7995\.8195\.79±\\pm0\.46TransformerBaseline93\.5394\.4694\.0093\.33±\\pm1\.68Stress93\.7998\.6296\.15Amusement91\.5680\.2785\.54*Macro Avg*92\.9691\.1291\.87±\\pm1\.93*Weighted Avg*93\.2893\.3393\.22±\\pm1\.70TCNBaseline88\.9095\.4692\.0691\.06±\\pm0\.46Stress94\.8492\.2293\.51Amusement92\.0375\.1182\.72*Macro Avg*91\.9287\.6089\.43±\\pm0\.64*Weighted Avg*91\.2191\.0690\.92±\\pm0\.48

Table 4\.Multimodal fusion classification performance\. All evaluation metrics are reported in percentage \(%\)\.ModelClassPrecisionRecallF1\-scoreAccuracyLSTMBaseline96\.6299\.2997\.9397\.70±\\pm0\.55Stress99\.3799\.0099\.18Amusement98\.2990\.3494\.15*Macro Avg*98\.0996\.2197\.09±\\pm0\.72*Weighted Avg*97\.7397\.7097\.67±\\pm0\.56TransformerBaseline99\.2998\.9499\.1199\.02±\\pm0\.51Stress99\.0099\.6299\.31Amusement98\.2198\.2198\.21*Macro Avg*98\.8398\.9298\.88±\\pm0\.64*Weighted Avg*99\.0299\.0299\.02±\\pm0\.51TCNBaseline97\.4497\.0997\.2696\.76±\\pm0\.59Stress97\.1698\.5097\.82Amusement93\.8592\.5893\.21*Macro Avg*96\.1596\.0696\.10±\\pm0\.76*Weighted Avg*96\.7596\.7696\.75±\\pm0\.60

Table 5\.Ablation study results for wrist\-only and chest\-only inputs\. All evaluation metrics are reported in percentage \(%\)\.InputModelPrecisionRecallF1\-scoreAccuracyWrist\-OnlyLSTMBaseline96\.4193\.4794\.9294\.38±\\pm0\.88Stress97\.6397\.9997\.81Amusement83\.3390\.8186\.91*Macro Avg*92\.4694\.0993\.21±\\pm0\.97*Weighted Avg*94\.5894\.3894\.44±\\pm0\.85TransformerBaseline88\.0092\.1290\.0188\.73±\\pm1\.79Stress91\.6095\.7393\.62Amusement84\.8865\.4773\.92*Macro Avg*88\.1684\.4485\.85±\\pm2\.46*Weighted Avg*88\.5688\.7388\.39±\\pm1\.90TCNBaseline98\.7298\.3098\.5198\.20±\\pm0\.41Stress98\.8899\.3799\.13Amusement95\.3095\.7395\.51*Macro Avg*97\.6397\.8097\.72±\\pm0\.52*Weighted Avg*98\.2098\.2098\.20±\\pm0\.41Chest\-OnlyLSTMBaseline93\.4089\.4391\.3787\.93±\\pm0\.70Stress85\.6889\.3487\.47Amusement76\.2780\.7278\.43*Macro Avg*85\.1286\.5085\.76±\\pm1\.04*Weighted Avg*88\.2087\.9388\.02±\\pm0\.74TransformerBaseline88\.4889\.4388\.9584\.80±\\pm0\.84Stress81\.5081\.8181\.65Amusement78\.7475\.5677\.12*Macro Avg*82\.9182\.2682\.57±\\pm1\.30*Weighted Avg*84\.7584\.8084\.77±\\pm0\.87TCNBaseline95\.2885\.9590\.3786\.35±\\pm0\.53Stress81\.6490\.9786\.05Amusement71\.8179\.3775\.40*Macro Avg*82\.9185\.4383\.94±\\pm0\.71*Weighted Avg*87\.2386\.3586\.56±\\pm0\.62

### 4\.3\.Ablation Studies on Sensor Modalities

To quantify each modality’s contribution, we selectively removed wrist\- or chest\-derived signals from the multimodal models\. Comparing these unimodal variants against the full multimodal configuration shows the benefit of multimodal fusion and clarifies how each modality contributes to model performance and generalization across subjects\.

Component and Signal Importance\. To understand what drives model performance, we systematically evaluated the contributions of key architectural components, such as the fusion layer, attention mechanism, and the temporal encoder, as well as individual physiological signals by selectively ablating them from the full model\. This analysis helps quantify the importance of each component and validate which design elements are critical to performance and which offer only marginal benefits for robust affect recognition\.

Modality Discrimination\. Although the ablation experiments isolate the technical contribution of each modality, we also analyze their physiological relevance\. To assess the contribution of each sensor modality, we trained models using only wrist\- or chest\-based inputs\. The wrist\-worn sensors primarily captureEDA, temperature, and motion through accelerometry, while the chest\-mounted sensors measure respiration andECG\. This contextual analysis helps interpret why combining both modalities enhances affect recognition\.

Practical Deployment\. Wrist\-based devices, such as smartwatches, are more practical for real\-world use because of their convenience and non\-invasiveness\. Demonstrating competitive performance using wrist\-only input shows the feasibility of lightweight, wearable affect recognition systems\.

Findings\. As shown in Table[5](https://arxiv.org/html/2606.15026#S4.T5), our ablation results indicate that removing either sensor modality or disabling temporal modeling consistently leads to a drop in performance across all models, reinforcing the importance of multimodal fusion and temporal context\. The results in this table are produced under the ablation protocol, in which wrist\-only and chest\-only inputs are reconstructed by slicing features from the fused multimodal tensor, and model configurations are intentionally simplified and standardized relative to the primary benchmark\. Conclusions drawn here therefore reflect within\-ablation comparisons and should not be interpreted as a direct replication of the primary unimodal results in Table[3](https://arxiv.org/html/2606.15026#S4.T3)\. The TCN model achieved the strongest wrist\-only performance within this ablation setting \(98\.20±\\pm0\.41% accuracy, 97\.72±\\pm0\.52% macro\-F1 score\), outperforming both LSTM and Transformer\. LSTM also maintained high performance using wrist\-only input \(94\.38±\\pm0\.88% accuracy\), with particularly high F1\-scores for stress \(97\.81\) and baseline \(94\.92\), while amusement remained more difficult to classify\. Across all models, amusement consistently resulted in the lowest scores, highlighting its subtle physiological expression and class imbalance\. In chest\-only configurations, performance declined further; for example, the Transformer model achieved 84\.80±\\pm0\.84% accuracy, suggesting that wrist signals may carry more discriminative information for affect recognition in this dataset\. These findings indicate that while both sensor modalities contribute significantly, wrist signals alone can support accurate emotion recognition in constrained settings\. TCN and LSTM architectures, in particular, remain effective even without multimodal input, reinforcing their suitability for real\-time, wearable applications\.

### 4\.4\.Saliency\-Based Interpretation

Gradient\-based saliency maps show that ACC and temperature dominate wrist\-only predictions, while the multimodal model assigns greater importance to RES, HR, and EDA under stress conditions\. These observations are illustrative and subject\-specific\. Full saliency maps and population\-level analysis are provided in the supplementary material\.

### 4\.5\.Summary

Our analysis shows several findings with practical and methodological implications\. The ensemble achieves the highest overall performance \(98\.91±\\pm0\.13% accuracy\), with markedly lower inter\-subject variance than individual models\. Wrist\-only configurations offer competitive performance in resource\-constrained settings, while multimodal fusion consistently improves robustness\. A detailed comparison with previous work on WESAD, including summary\-statistics baselines and previous classical machine learning and deep learning results, is provided in the supplementary material, where our Transformer and ensemble models consistently outperform previously reported results under the same LOSO\-CV protocol\.

## 5\.Conclusion

We evaluated the effectiveness of various deep learning architectures and sensor modalities for automatic emotion recognition using physiological signals\. We conducted a comprehensive comparison of three model architectures, LSTM, TCN, and Transformer, across unimodal \(wrist\-only and chest\-only\) and multimodal input settings using the WESAD dataset\. Our results show that early fusion of chest and wrist signals improves performance over unimodal baselines, demonstrating the complementary nature of the two sensor types\. We implemented a late\-fusion ensemble strategy by averaging the predictions of the three architectures trained on fused inputs, which yielded the highest overall accuracy \(98\.91±\\pm0\.13%\) and macro\-F1 score \(98\.56±\\pm0\.17%\)\. Through ablation studies and ensemble experiments, we analyzed how architectural and modality choices influence performance, generalization, and practical applicability\. Our findings show that Transformer\-based models achieved strong generalization under the LOSO\-CV protocol and competitive performance in multimodal settings\. Wrist\-only models offer a practical trade\-off suitable for lightweight, wearable applications\. These results show the practical value of ensemble\-based fusion in wearable affect recognition, particularly for systems designed to generalize across diverse users\. Amusement remained the most difficult emotion to classify, likely due to overlapping physiological patterns and limited training samples\.

## 6\.Limitations and Future Work

Limitations\. One main limitation of this study is the use of a single dataset collected under controlled laboratory conditions\. Although WESAD enables a fair and controlled cross\-architecture comparison, which is the primary focus of this work, future evaluations on datasets with greater diversity in subjects, devices, and real\-world conditions are needed to further validate the generalizability of the proposed framework\. This setting may not fully capture the variability, noise, and complexity of real\-world emotional states and physiological sensor readings\. Furthermore, class imbalance and potential variability across sensor recordings introduce additional challenges to generalization\.

Future Work\. Future work will focus on validating the proposed framework on more diverse datasets, including ambulatory and real\-world settings, to better assess generalizability beyond controlled laboratory conditions\. We will also explore multi\-resolution representations that combine model predictions across temporal resolutions, preserving both low\-frequency affective dynamics and high\-frequency morphological features, as our sensitivity analysis shows that different architectures respond differently to resampling rate\. Improving model interpretability through population\-level techniques such as SHAP and Integrated Gradients remains an important direction to support clinical and behavioral applications\.

## References

- K\. Alghoul, H\. Al Osman, and A\. El Saddik \(2025\)Enhancing Generalization in PPG\-Based Emotion Measurement with a CNN\-TCN\-LSTM Model\.In2025 IEEE International Instrumentation and Measurement Technology Conference \(I2MTC\),pp\. 1–6\.Cited by:[§2](https://arxiv.org/html/2606.15026#S2.p1.1)\.
- S\. Bai, J\. Z\. Kolter, and V\. Koltun \(2018\)An empirical evaluation of generic convolutional and recurrent networks for sequence modeling\.arXiv preprint arXiv:1803\.01271\.Cited by:[§1](https://arxiv.org/html/2606.15026#S1.p1.1),[§3\.3](https://arxiv.org/html/2606.15026#S3.SS3.p3.1)\.
- B\. Behinaein, A\. Bhatti, D\. Rodenburg, P\. Hungler, and A\. Etemad \(2021\)A transformer architecture for stress detection from ecg\.InProc\. of the 2021 ACM International Symposium on Wearable Computers,pp\. 132–134\.Cited by:[Table 11](https://arxiv.org/html/2606.15026#A1.T11.31.38.1)\.
- R\. A\. Calvo and S\. D’Mello \(2010\)Affect detection: An interdisciplinary review of models, methods, and their applications\.IEEE Transactions on affective computing1\(1\),pp\. 18–37\.Cited by:[§1](https://arxiv.org/html/2606.15026#S1.p1.1),[§2](https://arxiv.org/html/2606.15026#S2.p1.1)\.
- H\. Choi \(2025\)Emotion recognition using a Siamese model and a late fusion\-based multimodal method in the WESAD dataset with hardware accelerators\.Electronics14\(4\),pp\. 723\.Cited by:[§2](https://arxiv.org/html/2606.15026#S2.p1.1)\.
- Y\. Ding, S\. Zhang, C\. Tang, and C\. Guan \(2024\)MASA\-TCN: Multi\-Anchor Space\-Aware Temporal Convolutional Neural Networks for Continuous and Discrete EEG Emotion Recognition\.IEEE Journal of Biomedical and Health Informatics28\(7\),pp\. 3953–3964\.Cited by:[§2](https://arxiv.org/html/2606.15026#S2.p1.1)\.
- P\. Garg, J\. Santhosh, A\. Dengel, and S\. Ishimaru \(2021\)Stress detection by machine learning and wearable sensors\.InCompanion Proceedings of the 26th International Conference on Intelligent User Interfaces,pp\. 43–45\.Cited by:[Table 10](https://arxiv.org/html/2606.15026#A1.T10.15.18.5),[Table 10](https://arxiv.org/html/2606.15026#A1.T10.15.19.5),[Table 10](https://arxiv.org/html/2606.15026#A1.T10.15.20.5),[Table 10](https://arxiv.org/html/2606.15026#A1.T10.15.21.5),[Table 10](https://arxiv.org/html/2606.15026#A1.T10.15.22.5)\.
- J\. A\. Healey and R\. W\. Picard \(2005\)Detecting stress during real\-world driving tasks using physiological sensors\.IEEE Transactions on intelligent transportation systems6\(2\),pp\. 156–166\.Cited by:[§1](https://arxiv.org/html/2606.15026#S1.p1.1),[§2](https://arxiv.org/html/2606.15026#S2.p1.1)\.
- S\. Hochreiter and J\. Schmidhuber \(1997\)Long Short\-Term Memory\.Neural computation9\(8\),pp\. 1735–1780\.Cited by:[§1](https://arxiv.org/html/2606.15026#S1.p1.1),[§3\.3](https://arxiv.org/html/2606.15026#S3.SS3.p2.1)\.
- T\. M\. Ingolfsson, X\. Wang, M\. Hersche, A\. Burrello, L\. Cavigelli, and L\. Benini \(2021\)ECG\-TCN: Wearable cardiac arrhythmia detection with a temporal convolutional network\.In2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems \(AICAS\),pp\. 1–4\.Cited by:[§2](https://arxiv.org/html/2606.15026#S2.p1.1)\.
- T\. Kulvicius, D\. Zhang, L\. Poustka, S\. Bölte, L\. Jahn, S\. Flügge, M\. Kraft, M\. Zweckstetter, K\. Nielsen\-Saines, F\. Wörgötter,et al\.\(2025\)Deep learning empowered sensor fusion boosts infant movement classification\.Communications medicine5\(1\),pp\. 16\.Cited by:[§1](https://arxiv.org/html/2606.15026#S1.p1.1)\.
- F\. Larradet, R\. Niewiadomski, G\. Barresi, D\. G\. Caldwell, and L\. S\. Mattos \(2020\)Toward emotion recognition from physiological signals in the wild: approaching the methodological issues in real\-life data collection\.Frontiers in psychology11,pp\. 1111\.Cited by:[§1](https://arxiv.org/html/2606.15026#S1.p1.1)\.
- C\. Lea, M\. D\. Flynn, R\. Vidal, A\. Reiter, and G\. D\. Hager \(2017\)Temporal convolutional networks for action segmentation and detection\.InProc\. of the IEEE Conference on Computer Vision and Pattern Recognition,pp\. 156–165\.Cited by:[§1](https://arxiv.org/html/2606.15026#S1.p1.1),[§3\.3](https://arxiv.org/html/2606.15026#S3.SS3.p3.1)\.
- F\. Li and D\. Zhang \(2025\)Transformer\-Driven Affective State Recognition from Wearable Physiological Data in Everyday Contexts\.Sensors25\(3\)\.Cited by:[§2](https://arxiv.org/html/2606.15026#S2.p1.1)\.
- Y\. Li, M\. E\. H\. Daho, P\. Conze,et al\.\(2024\)A review of deep learning\-based information fusion techniques for multimodal medical image classification\.Computers in Biology and Medicine177,pp\. 108635\.Cited by:[§1](https://arxiv.org/html/2606.15026#S1.p1.1)\.
- L\. Liakopoulos, N\. Stagakis, E\. I\. Zacharaki, and K\. Moustakas \(2021\)CNN\-based stress and emotion recognition in ambulatory settings\.In2021 12th international conference on information, intelligence, systems & applications \(IISA\),pp\. 1–8\.Cited by:[Table 11](https://arxiv.org/html/2606.15026#A1.T11.31.37.1)\.
- Y\. Liao, Y\. Gao, F\. Wang, L\. Zhang, Z\. Xu, and Y\. Wu \(2025\)Emotion recognition with multiple physiological parameters based on ensemble learning\.Scientific Reports15\(1\),pp\. 19869\.Cited by:[§2](https://arxiv.org/html/2606.15026#S2.p1.1)\.
- L\. Malviya, S\. Mal, R\. Kumar, B\. Roy, U\. Gupta,et al\.\(2023\)Mental stress level detection using LSTM for WESAD dataset\.InProceedings of data analytics and management: Icdam 2022,pp\. 243–250\.Cited by:[§2](https://arxiv.org/html/2606.15026#S2.p1.1)\.
- P\. Prajod and E\. André \(2022\)On the generalizability of ECG\-based stress detection models\.In2022 21st IEEE ICMLA,pp\. 549–554\.Cited by:[Table 11](https://arxiv.org/html/2606.15026#A1.T11.31.36.5),[Table 11](https://arxiv.org/html/2606.15026#A1.T11.31.37.5),[Table 11](https://arxiv.org/html/2606.15026#A1.T11.31.38.5),[Table 11](https://arxiv.org/html/2606.15026#A1.T11.31.39.5),[Table 11](https://arxiv.org/html/2606.15026#A1.T11.31.40.5),[Table 11](https://arxiv.org/html/2606.15026#A1.T11.31.41.5),[Table 11](https://arxiv.org/html/2606.15026#A1.T11.31.42.5),[Table 11](https://arxiv.org/html/2606.15026#A1.T11.31.43.5)\.
- R\. Rissler, M\. Nadj, M\. X\. Li, M\. T\. Knierim, and A\. Maedche \(2018\)Got Flow? Using Machine Learning on Physiological Data to Classify Flow\.InExtended abstracts of the 2018 CHI conference on human factors in computing systems,pp\. 1–6\.Cited by:[§1](https://arxiv.org/html/2606.15026#S1.p1.1)\.
- A\. Rostami, K\. Motaman, B\. Tarvirdizadeh, K\. Alipour, and M\. Ghamari \(2024\)LSTM\-based real\-time stress detection using PPG signals on raspberry Pi\.IET Wireless Sensor Systems14\(6\),pp\. 333–347\.Cited by:[§2](https://arxiv.org/html/2606.15026#S2.p1.1)\.
- P\. Schmidt, A\. Reiss, R\. Duerichen, C\. Marberger, and K\. Van Laerhoven \(2018\)Introducing WESAD, a Multimodal Dataset for Wearable Stress and Affect Detection\.InProceedings of the 20th ACM international conference on multimodal interaction,pp\. 400–408\.Cited by:[Table 10](https://arxiv.org/html/2606.15026#A1.T10.15.16.5),[Table 10](https://arxiv.org/html/2606.15026#A1.T10.15.17.5),[Table 10](https://arxiv.org/html/2606.15026#A1.T10.3.3.5),[Table 10](https://arxiv.org/html/2606.15026#A1.T10.5.5.5),[Table 10](https://arxiv.org/html/2606.15026#A1.T10.7.7.5),[Table 11](https://arxiv.org/html/2606.15026#A1.T11.11.11.5),[Table 11](https://arxiv.org/html/2606.15026#A1.T11.13.13.5),[Table 11](https://arxiv.org/html/2606.15026#A1.T11.3.3.5),[Table 11](https://arxiv.org/html/2606.15026#A1.T11.31.32.5),[Table 11](https://arxiv.org/html/2606.15026#A1.T11.31.33.5),[Table 11](https://arxiv.org/html/2606.15026#A1.T11.31.34.5),[Table 11](https://arxiv.org/html/2606.15026#A1.T11.31.35.5),[Table 11](https://arxiv.org/html/2606.15026#A1.T11.5.5.5),[Table 11](https://arxiv.org/html/2606.15026#A1.T11.7.7.5),[Table 11](https://arxiv.org/html/2606.15026#A1.T11.9.9.5),[§1](https://arxiv.org/html/2606.15026#S1.p1.1),[§1](https://arxiv.org/html/2606.15026#S1.p3.1),[§2](https://arxiv.org/html/2606.15026#S2.p1.1),[§3\.1](https://arxiv.org/html/2606.15026#S3.SS1.p1.1),[§4](https://arxiv.org/html/2606.15026#S4.p1.1)\.
- P\. Schmidt, A\. Reiss, R\. Dürichen, and K\. Van Laerhoven \(2019\)Wearable\-based affect recognition—A review\.Sensors19\(19\),pp\. 4079\.Cited by:[§1](https://arxiv.org/html/2606.15026#S1.p1.1),[§2](https://arxiv.org/html/2606.15026#S2.p1.1)\.
- N\. Schneiderman, G\. Ironson, and S\. D\. Siegel \(2005\)Stress and health: psychological, behavioral, and biological determinants\.Annu\. Rev\. Clin\. Psychol\.1\(1\),pp\. 607–628\.Cited by:[§1](https://arxiv.org/html/2606.15026#S1.p1.1),[§2](https://arxiv.org/html/2606.15026#S2.p1.1)\.
- M\. Sheikh, M\. Qassem, and P\. A\. Kyriacou \(2021\)Wearable, environmental, and smartphone\-based passive sensing for mental health monitoring\.Frontiers in digital health3,pp\. 662811\.Cited by:[§1](https://arxiv.org/html/2606.15026#S1.p1.1)\.
- S\. Sonnentag and C\. Fritz \(2015\)Recovery from Job Stress: The Stressor\-Detachment Model as an Integrative Framework\.Journal of organizational behavior36\(S1\),pp\. S72–S103\.Cited by:[§2](https://arxiv.org/html/2606.15026#S2.p1.1)\.
- R\. Tanwar, O\. C\. Phukan, G\. Singh,et al\.\(2024\)Attention based hybrid deep learning model for wearable based stress recognition\.Engineering Applications of Artificial Intelligence127,pp\. 107391\.Cited by:[§3\.2](https://arxiv.org/html/2606.15026#S3.SS2.p1.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention Is All You Need\.Advances in neural information processing systems30\.Cited by:[§1](https://arxiv.org/html/2606.15026#S1.p1.1),[§2](https://arxiv.org/html/2606.15026#S2.p1.1),[§3\.3](https://arxiv.org/html/2606.15026#S3.SS3.p4.1)\.
- J\. Vazquez\-Rodriguez, G\. Lefebvre, J\. Cumin, and J\. L\. Crowley \(2022\)Emotion recognition with pre\-trained transformers using multimodal signals\.In2022 10th international conference on affective computing and intelligent interaction \(ACII\),pp\. 1–8\.Cited by:[§2](https://arxiv.org/html/2606.15026#S2.p1.1)\.
- Y\. Wu, M\. Daoudi, and A\. Amad \(2023\)Transformer\-based self\-supervised multimodal representation learning for wearable emotion recognition\.IEEE Transactions on Affective Computing15\(1\),pp\. 157–172\.Cited by:[§2](https://arxiv.org/html/2606.15026#S2.p1.1)\.
- M\. S\. Zitouni, C\. Y\. Park, U\. Lee, L\. J\. Hadjileontiadis, and A\. Khandoker \(2022\)LSTM\-modeling of emotion recognition using peripheral physiological signals in naturalistic conversations\.IEEE Journal of Biomedical and Health Informatics27\(2\),pp\. 912–923\.Cited by:[§2](https://arxiv.org/html/2606.15026#S2.p1.1)\.

## Appendix ASupplementary Material

This supplementary material provides additional experimental results, baseline comparisons, and interpretability analyses referenced in the main paper\. All experiments follow the same LOSO\-CV protocol and preprocessing pipeline described in Section 3 of the main paper\.

### A\.1\.Sensitivity Analysis: Resampling Rate

To empirically assess the impact of the 4 Hz resampling choice on model performance, we evaluated all three architectures under a 16 Hz resampling rate with proportionally scaled windows of 40 time steps, maintaining the same 2\.5\-second temporal coverage as the primary 4 Hz setting\. The results are summarized in Table[6](https://arxiv.org/html/2606.15026#A1.T6)\. LSTM and TCN show marginal improvements at 16 Hz \(98\.30%±\\pm0\.26% and 98\.60%±\\pm0\.24% accuracy respectively\), while the Transformer performs slightly better at 4 Hz \(99\.02%±\\pm0\.51% vs 97\.21%±\\pm0\.64%\)\. The standard deviation values decrease at 16 Hz for LSTM and TCN, suggesting more consistent generalization across subjects at higher resolution\. These architecture\-specific differences are modest and fall within overlapping standard deviation ranges, indicating that performance is robust across resampling rates\. The 4 Hz setting was therefore retained as the primary configuration to maintain temporal alignment across modalities, reduce computational cost, and ensure consistency across models\. Our sensitivity analysis shows that different architectures respond differently to resampling rate, supporting multi\-resolution modeling as an important direction for future work\.

Table 6\.Sensitivity analysis comparing multimodal classification performance at 4 Hz and 16 Hz resampling rates on the WESAD dataset\. Windows of 10 and 40 time steps are used respectively, maintaining identical 2\.5\-second temporal coverage\. All metrics are reported in percentage \(%\)\.ModelRateAccuracyMacro\-F1Weighted\-F1LSTM4 Hz97\.70±\\pm0\.5597\.09±\\pm0\.7297\.67±\\pm0\.5616 Hz98\.30±\\pm0\.2697\.98±\\pm0\.3598\.30±\\pm0\.26Transformer4 Hz99\.02±\\pm0\.5198\.88±\\pm0\.6499\.02±\\pm0\.5116 Hz97\.21±\\pm0\.6496\.61±\\pm0\.7697\.20±\\pm0\.65TCN4 Hz96\.76±\\pm0\.5996\.10±\\pm0\.7696\.75±\\pm0\.6016 Hz98\.60±\\pm0\.2498\.23±\\pm0\.3398\.61±\\pm0\.24

### A\.2\.Summary\-Statistics Baselines

To assess whether the observed performance gains of temporal deep learning models can be explained by simple signal\-level differences, we evaluated two summary\-statistics\-based baselines under the same preprocessing pipeline and LOSO\-CV protocol used in our multimodal experiments\. For each 10\-step window, we extracted summary features including the mean, standard deviation, minimum, maximum, and first\-to\-last differences across all signal channels, resulting in a 24\-dimensional feature vector per window\.

A Logistic Regression model trained on these features achieved 59\.68%±\\pm16\.86% accuracy and 52\.93%±\\pm20\.47% macro\-F1 across LOSO folds, indicating that linear relationships over aggregated features are insufficient to capture the discriminative structure of the data\. A Random Forest classifier using the same feature representation achieved 67\.50%±\\pm10\.89% accuracy and 57\.16%±\\pm13\.82% macro\-F1, showing modest improvement over the linear model but remaining substantially below all temporal deep learning models\. Beyond the mean performance gap, both baselines show high inter\-subject variance across LOSO folds\. The accuracy of Logistic Regression shows a standard deviation of±\\pm16\.86% across subjects, and Random Forest±\\pm10\.89%, compared to±\\pm0\.13% for our ensemble\. This indicates that summary\-statistics representations fail to generalize consistently across individuals, whereas temporal architectures do\. This suggests that temporal architectures learn generalizable affective representations rather than subject\-specific signal patterns, which is the main empirical claim of our work\.

These results confirm that the performance gains of our temporal models are not attributable to simple signal\-level differences between affective conditions\. Instead, they reflect the ability of temporal architectures to capture ordering, transition dynamics, and cross\-modal dependencies that cannot be recovered from per\-window summary statistics\. Detailed classification metrics are reported in Table[7](https://arxiv.org/html/2606.15026#A1.T7), and confusion matrices are provided in Table[8](https://arxiv.org/html/2606.15026#A1.T8)\. Mean±\\pmstandard deviation values are computed across the 15 LOSO folds, while per\-class metrics are derived from pooled predictions across all test folds\. These results are summarized in Section 4\.5 of the main paper and provide empirical support for the contribution of temporal modeling\.

Table 7\.Multimodal classification performance of summary\-statistics baselines on the WESAD dataset\. All metrics are reported in percentage \(%\)\.ModelClassPrecisionRecallF1\-scoreAccuracyLogistic RegressionBaseline65\.6172\.8669\.0559\.68±\\pm16\.86Stress55\.8251\.0653\.33Amusement40\.8732\.8236\.41*Macro Avg*54\.1052\.2552\.93±\\pm20\.47*Weighted Avg*58\.5259\.5958\.85±\\pm19\.90Random ForestBaseline73\.9577\.8275\.8467\.50±\\pm10\.89Stress62\.3476\.4168\.66Amusement44\.8319\.2926\.97*Macro Avg*60\.3757\.8457\.16±\\pm13\.82*Weighted Avg*65\.5867\.5765\.48±\\pm11\.84

Table 8\.Confusion matrices for summary\-statistics baselines on the multimodal WESAD dataset\.PredictedModelActual ClassBaselineStressAmusementMultimodal \(LOSO\)Logistic RegressionBaseline51351173740Stress16322032316Amusement1059435730Random ForestBaseline54851245318Stress7293041210Amusement1203592429

### A\.3\.Confusion Matrices for Deep Learning Models

Table[9](https://arxiv.org/html/2606.15026#A1.T9)presents the confusion matrices for all deep learning models across wrist\-only, chest\-only, multimodal, and ensemble fusion settings\. Each row shows predicted label counts per actual class\.

Table 9\.Confusion matrices for deep learning models across all input settings on the WESAD dataset\.PredictedModelActual ClassBaselineStressAmusementWrist\-OnlyLSTMBaseline13171379Stress147812Amusement356405TransformerBaseline12985952Stress347630Amusement14311292TCNBaseline1385420Stress47931Amusement145426Chest\-OnlyLSTMBaseline12607772Stress4571240Amusement4442360TransformerBaseline126010445Stress9965246Amusement6544337TCNBaseline121110296Stress2972543Amusement3161354MultimodalLSTMBaseline139946Stress77901Amusement421402TransformerBaseline139487Stress27941Amusement80438TCNBaseline13681724Stress97863Amusement276412Ensemble FusionMajority VotingBaseline140018Stress17961Amusement153427

### A\.4\.Comparison with Previous Work

Tables[10](https://arxiv.org/html/2606.15026#A1.T10)and[11](https://arxiv.org/html/2606.15026#A1.T11)compare our models with previously reported results on WESAD under the same LOSO\-CV protocol\. Our Transformer and ensemble models consistently outperform all previous work across both multimodal and unimodal settings by a substantial margin\. Minor methodological differences, such as window lengths and preprocessing steps, may affect the metrics reported across studies, but results are compared under the same LOSO\-CV protocol to ensure consistency\.

Table 10\.Comparison with reported results on the WESAD dataset \(multimodal, 3\-class classification\)\. All metrics are reported as originally stated\.ModelSettingF1F\_\{1\}\-score \(%\)Accuracy \(%\)StudyCross\-Validation ProtocolDecision TreeAll modalities58\.05±1\.6158\.05\\pm 1\.6163\.56±1\.7363\.56\\pm 1\.73WESAD \(Schmidt et al\.\(Schmidtet al\.,[2018](https://arxiv.org/html/2606.15026#bib.bib7)\)\)LOSO\-CVRandom ForestAll modalities64\.08±1\.6864\.08\\pm 1\.6874\.97±1\.1174\.97\\pm 1\.11WESAD \(Schmidt et al\.\(Schmidtet al\.,[2018](https://arxiv.org/html/2606.15026#bib.bib7)\)\)LOSO\-CVAdaBoostAll modalities68\.85±0\.8968\.85\\pm 0\.8979\.57±0\.9379\.57\\pm 0\.93WESAD \(Schmidt et al\.\(Schmidtet al\.,[2018](https://arxiv.org/html/2606.15026#bib.bib7)\)\)LOSO\-CVLinear Discriminant Analysis \(LDA\)All modalities71\.5675\.80WESAD \(Schmidt et al\.\(Schmidtet al\.,[2018](https://arxiv.org/html/2606.15026#bib.bib7)\)\)LOSO\-CVk\-Nearest Neighbour \(kNN\)All modalities48\.7056\.14WESAD \(Schmidt et al\.\(Schmidtet al\.,[2018](https://arxiv.org/html/2606.15026#bib.bib7)\)\)LOSO\-CVRandom ForestAll modalities65\.7367\.56Garg et al\.\(Garget al\.,[2021](https://arxiv.org/html/2606.15026#bib.bib27)\)LOSO\-CVSupport Vector Machine \(SVM\)All modalities59\.6459\.56Garg et al\.\(Garget al\.,[2021](https://arxiv.org/html/2606.15026#bib.bib27)\)LOSO\-CVk\-Nearest Neighbour \(kNN\)All modalities58\.1465\.00Garg et al\.\(Garget al\.,[2021](https://arxiv.org/html/2606.15026#bib.bib27)\)LOSO\-CVLinear Discriminant Analysis \(LDA\)All modalities50\.4467\.06Garg et al\.\(Garget al\.,[2021](https://arxiv.org/html/2606.15026#bib.bib27)\)LOSO\-CVAdaBoostAll modalities63\.8264\.34Garg et al\.\(Garget al\.,[2021](https://arxiv.org/html/2606.15026#bib.bib27)\)LOSO\-CVOur approachesLSTMAll modalities97\.09±\\pm0\.7297\.70±\\pm0\.55This workLOSO\-CVTransformerAll modalities98\.88±\\pm0\.6499\.02±\\pm0\.51This workLOSO\-CVTCNAll modalities96\.10±\\pm0\.7696\.76±\\pm0\.59This workLOSO\-CVEnsemble FusionAll modalities98\.56±\\pm0\.1798\.91±\\pm0\.13This workLOSO\-CVTable 11\.Performance comparison of unimodal models on the WESAD Dataset\. All metrics are reported as originally stated\.ModelSettingF1F\_\{1\}\-score \(%\)Accuracy \(%\)StudyCross\-Validation ProtocolDecision TreeAll wrist43\.62±1\.3343\.62\\pm 1\.3353\.98±1\.7953\.98\\pm 1\.79WESAD \(Schmidt et al\.\(Schmidtet al\.,[2018](https://arxiv.org/html/2606.15026#bib.bib7)\)\)LOSO\-CVRandom ForestAll wrist62\.86±0\.6562\.86\\pm 0\.6574\.85±0\.2074\.85\\pm 0\.20WESAD \(Schmidt et al\.\(Schmidtet al\.,[2018](https://arxiv.org/html/2606.15026#bib.bib7)\)\)LOSO\-CVAdaBoostAll wrist64\.12±0\.9864\.12\\pm 0\.9875\.21±0\.7775\.21\\pm 0\.77WESAD \(Schmidt et al\.\(Schmidtet al\.,[2018](https://arxiv.org/html/2606.15026#bib.bib7)\)\)LOSO\-CVLinear Discriminant Analysis \(LDA\)All wrist63\.2470\.74WESAD \(Schmidt et al\.\(Schmidtet al\.,[2018](https://arxiv.org/html/2606.15026#bib.bib7)\)\)LOSO\-CVk\-Nearest Neighbour \(kNN\)All wrist37\.2045\.54WESAD \(Schmidt et al\.\(Schmidtet al\.,[2018](https://arxiv.org/html/2606.15026#bib.bib7)\)\)LOSO\-CVDecision TreeAll chest53\.06±0\.5053\.06\\pm 0\.5057\.68±0\.4057\.68\\pm 0\.40WESAD \(Schmidt et al\.\(Schmidtet al\.,[2018](https://arxiv.org/html/2606.15026#bib.bib7)\)\)LOSO\-CVRandom ForestAll chest60\.80±1\.0060\.80\\pm 1\.0068\.76±1\.3568\.76\\pm 1\.35WESAD \(Schmidt et al\.\(Schmidtet al\.,[2018](https://arxiv.org/html/2606.15026#bib.bib7)\)\)LOSO\-CVAdaBoostAll chest64\.89±0\.8164\.89\\pm 0\.8174\.74±0\.9474\.74\\pm 0\.94WESAD \(Schmidt et al\.\(Schmidtet al\.,[2018](https://arxiv.org/html/2606.15026#bib.bib7)\)\)LOSO\-CVLinear Discriminant Analysis \(LDA\)All chest72\.4976\.50WESAD \(Schmidt et al\.\(Schmidtet al\.,[2018](https://arxiv.org/html/2606.15026#bib.bib7)\)\)LOSO\-CVk\-Nearest Neighbour \(kNN\)All wrist38\.3946\.18WESAD \(Schmidt et al\.\(Schmidtet al\.,[2018](https://arxiv.org/html/2606.15026#bib.bib7)\)\)LOSO\-CVLDAECG\-based81\.385\.4Prajod et al\.\(Prajod and André,[2022](https://arxiv.org/html/2606.15026#bib.bib28)\)LOSO\-CVCNN \(spectrogram\)\(Liakopouloset al\.,[2021](https://arxiv.org/html/2606.15026#bib.bib30)\)ECG\-based79\.482\.4Prajod et al\.\(Prajod and André,[2022](https://arxiv.org/html/2606.15026#bib.bib28)\)LOSO\-CVTransformer\(Behinaeinet al\.,[2021](https://arxiv.org/html/2606.15026#bib.bib29)\)\(without fine\-tuning\)ECG\-based69\.780\.4Prajod et al\.\(Prajod and André,[2022](https://arxiv.org/html/2606.15026#bib.bib28)\)LOSO\-CVRandom ForestECG\-based81\.386\.3Prajod et al\.\(Prajod and André,[2022](https://arxiv.org/html/2606.15026#bib.bib28)\)LOSO\-CVSVMECG\-based83\.287\.1Prajod et al\.\(Prajod and André,[2022](https://arxiv.org/html/2606.15026#bib.bib28)\)LOSO\-CVMulti\-Layer Perceptron \(MLP\)ECG\-based85\.989\.5Prajod et al\.\(Prajod and André,[2022](https://arxiv.org/html/2606.15026#bib.bib28)\)LOSO\-CVECG Emotion ModelECG\-based85\.889\.7Prajod et al\.\(Prajod and André,[2022](https://arxiv.org/html/2606.15026#bib.bib28)\)LOSO\-CVDeep ECGNetECG\-based85\.790\.8Prajod et al\.\(Prajod and André,[2022](https://arxiv.org/html/2606.15026#bib.bib28)\)LOSO\-CVOur ApproachesLSTMAll wrist94\.92±\\pm0\.5395\.81±\\pm0\.48This workLOSO\-CVTransformerAll wrist91\.87±\\pm1\.9393\.33±\\pm1\.68This workLOSO\-CVTCNAll wrist89\.43±\\pm0\.6491\.06±\\pm0\.46This workLOSO\-CVAblation StudyLSTMAll wrist93\.21±\\pm0\.9794\.38±\\pm0\.88This workLOSO\-CVTransformerAll wrist85\.85±\\pm2\.4688\.73±\\pm1\.79This workLOSO\-CVTCNAll wrist97\.72±\\pm0\.5298\.20±\\pm0\.41This workLOSO\-CVLSTMAll chest85\.76±\\pm1\.0487\.93±\\pm0\.70This workLOSO\-CVTransformerAll chest82\.57±\\pm1\.3084\.80±\\pm0\.84This workLOSO\-CVTCNAll chest83\.94±\\pm0\.7186\.35±\\pm0\.53This workLOSO\-CV
### A\.5\.Saliency Maps

Figures[1](https://arxiv.org/html/2606.15026#A1.F1)and[2](https://arxiv.org/html/2606.15026#A1.F2)show gradient\-based saliency maps for representative samples from the baseline and stress classes using the unimodal and multimodal LSTM models respectively\. In the wrist\-only setting, accelerometer \(ACC\) and temperature signals dominate the saliency maps, particularly for baseline detection\. The multimodal model assigns greater importance to respiration \(RES\), heart rate \(HR\), and EDA, especially under stress conditions\. These observations are illustrative and subject\-specific, and should not be interpreted as population\-level findings\. A robust population\-level interpretation would require techniques such as SHAP or Integrated Gradients across all subjects\.

![Refer to caption](https://arxiv.org/html/2606.15026v1/x1.png)Figure 1\.Saliency map for a sample from the baseline class using the unimodal LSTM model\.![Refer to caption](https://arxiv.org/html/2606.15026v1/x2.png)Figure 2\.Saliency map for a sample from the stress class using the multimodal LSTM model\.
Deep Temporal Modeling and Ensemble Fusion for Multimodal Emotion Recognition from Physiological Signals

Similar Articles

EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection

EmoS: A High-Fidelity Multimodal Benchmark for Fine-grained Streaming Emotional Understanding

Leveraging Self-Paced Curriculum Learning for Enhanced Modality Balance in Multimodal Conversational Emotion Recognition

@jonxuxu: We successfully predicted emotion from brain activity, achieving >2x performance to the previous state of the art. Allj…

Evaluating multimodal emotion recognition in proactive conversational agents: A user study

Submit Feedback

Similar Articles

EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection
EmoS: A High-Fidelity Multimodal Benchmark for Fine-grained Streaming Emotional Understanding
Leveraging Self-Paced Curriculum Learning for Enhanced Modality Balance in Multimodal Conversational Emotion Recognition
@jonxuxu: We successfully predicted emotion from brain activity, achieving >2x performance to the previous state of the art. Allj…
Evaluating multimodal emotion recognition in proactive conversational agents: A user study