MA-DLE: Speech-based Automatic Depression Level Estimation via Memory Augmentation

arXiv cs.AI 06/11/26, 04:00 AM Papers
Summary
This paper introduces MA-DLE, a memory-based feature augmentation method for speech-based automatic depression level estimation, achieving state-of-the-art performance on the DAIC-WOZ and E-DAIC datasets.
arXiv:2606.11197v1 Announce Type: cross Abstract: Speech-based automatic estimation of depression levels is essential for enabling early detection and timely intervention, particularly in resource-constrained mental health settings. In recent years, deep learning has demonstrated impressive success across various domains, including affective computing and mental health assessment. Most existing approaches rely on RNN-based architectures (such as LSTM and GRU) to model temporal information for depression estimation. However, the extracted features often emphasize only a few adjacent speech segments, limiting their ability to capture long-range dependencies. To overcome this limitation, we introduce a memory-based feature augmentation method that enhances the representational capacity of GRU-extracted features. Rather than indiscriminately incorporating historical data, our memory bank is designed to selectively integrate two types of components in order to reduce redundancy and irrelevance: (1) historical temporal features that closely resemble the current GRU output, offering complementary contextual information; and (2) dynamic memory features identified based on feature variability, which capture behavioral and emotional fluctuations indicative of depressive symptoms. To effectively fuse the memory-augmented features with GRU outputs, we further design a Hierarchical Attention Fusion (HAF) module. Our method is evaluated on the widely used DAIC-WOZ and E-DAIC datasets, achieving state-of-the-art performance.
Original Article
View Cached Full Text
Cached at: 06/11/26, 01:53 PM
# MA-DLE: Speech-based Automatic Depression Level Estimation via Memory Augmentation
Source: [https://arxiv.org/html/2606.11197](https://arxiv.org/html/2606.11197)
Xuzhi Wang1, Xinran Wu1, Ziping Zhao1, Jianhua Tao2, Björn W\. Schuller3,4, 1Tianjin Normal University2Tsinghua University3Technical University of Munich4Imperial College LondonXuzhi Wang, Xinran Wu and Ziping Zhao are with the School of Computer and Information Engineering, Tianjin Normal University \(TJNU\), Tianjin 300387, China \(e\-mail: wangxuzhi@tjnu\.edu\.cn; wxr@stu\.tjnu\.edu\.cn; ztianjin@126\.com\)\. Jianhua Tao is with the Department of Automation, Tsinghua University, Beijing 100084, China \(e\-mail: jhtao@tsinghua\.edu\.cn\)\. Björn W\. Schuller is the Chair of Health Informatics at the Technical University of Munich, Munich, Germany, and a Professor of Artificial Intelligence with the Department of Computing at Imperial College London, U\.K\. \(e\-mail: bjoern\.schuller@imperial\.ac\.uk \)\. Corresponding author: Ziping Zhao\. This work was supported by the National Natural Science Foundation of China under Grants No\. 62071330, 61831022, U21B2020, and 62471249, the Humanities and Social Science Foundation of China Ministry of Education under Grant No\. 24YJC740076, and the DFG \(German Research Foundation\) Reinhart Koselleck\-Project AUDI0NOMOUS \(Grant No\. 442218748\)\.

###### Abstract

Speech\-based automatic estimation of depression levels is essential for enabling early detection and timely intervention, particularly in resource\-constrained mental health settings\. In recent years, deep learning has demonstrated impressive success across various domains, including affective computing and mental health assessment\. Most existing approaches rely on RNN\-based architectures \(such as LSTM and GRU\) to model temporal information for depression estimation\. However, the extracted features often emphasize only a few adjacent speech segments, limiting their ability to capture long\-range dependencies\. To overcome this limitation, we introduce a memory\-based feature augmentation method that enhances the representational capacity of GRU\-extracted features\. Rather than indiscriminately incorporating historical data, our memory bank is designed to selectively integrate two types of components in order to reduce redundancy and irrelevance: \(1\) historical temporal features that closely resemble the current GRU output, offering complementary contextual information; and \(2\) dynamic memory features identified based on feature variability, which capture behavioral and emotional fluctuations indicative of depressive symptoms\. To effectively fuse the memory\-augmented features with GRU outputs, we further design a Hierarchical Attention Fusion \(HAF\) module\. Our method is evaluated on the widely used DAIC\-WOZ and E\-DAIC datasets, achieving state\-of\-the\-art performance\.

## IIntroduction

Depression is a common mental disorder that can cause individuals to experience persistent low mood, making it difficult for them to engage in daily social activities\. In severe cases, it may even lead to suicide\. Data released by the World Health Organization in 2017 indicates that approximately 350 million people worldwide suffer from depression\. Furthermore, depression is projected to become the second leading cause of death by 2030\[[39](https://arxiv.org/html/2606.11197#bib.bib20)\]\. Traditional depression detection mainly relies on health questionnaires, which heavily depend on the subjective judgment of psychologists\. This approach is not only time\-consuming but also has limited accuracy, resulting in many patients failing to receive timely detection and treatment in the early stages\. Moreover, especially in remote and economically underdeveloped areas, there is a severe shortage of psychological professionals, and the insufficient number of specialists makes it difficult for many individuals with depressive symptoms to obtain timely and professional diagnosis and treatment\. Therefore, it is particularly urgent to develop automated depression monitoring systems to assist doctors in diagnosis\. Such automated systems can efficiently process large volumes of data and conduct preliminary screening of large populations in a short period, significantly improving detection efficiency and coverage\.

Many researchers have applied deep learning to the field of depression detection, primarily using GRU, LSTM and CNN models to capture temporal variations in speech\. However, these approaches have certain limitations in modeling speech sequences, as they struggle to effectively capture long\-range dependencies across time steps\. As shown in the Fig\.[1](https://arxiv.org/html/2606.11197#S3.F1), we compute the cosine similarity between speech signals from different time segments and the final output of the GRU\. The results indicate that the final output focuses mainly on a few adjacent speech segments, lacking the ability to model speech information over longer temporal spans\. This may lead to incomplete extraction of critical speech features, thereby affecting the accuracy of depression detection\.

Inspired by the above findings, we propose a memory\-based approach to capture long\-term dependencies across speech segments\. Such long\-range dependencies are crucial for accurate depression level estimation for the following reasons: 1\) The speech patterns of individuals with depression often exhibit long term dependencies\. For example, variations in speaking rate, intonation, and pauses may evolve gradually over extended periods\. 2\) Depression is typically characterized by a sustained low emotional state, which cannot be adequately captured by a single short speech segment\. 3\) Short speech segments are susceptible to interference from environmental noise or momentary emotional fluctuations\.

One major challenge in applying memory mechanisms to depression level prediction lies in the fact that speech signals contain a large amount of information unrelated to depression\. If such irrelevant signals are incorporated without proper filtering, they may contaminate subsequent predictions\. Although the output features extracted by GRU may not be sufficiently expressive, they still carry certain discriminative cues for depression\. Therefore, on the one hand, it is beneficial to focus on historical features that are highly correlated with the current frame to provide supplementary information; on the other hand, even frames with relatively low correlation may contain critical discriminative signals and should not be completely discarded\.

With these in mind, we introduce a Memory\-Augmented Automatic Depression Level Estimation method, aiming to enhance the representation capability of GRU features, thereby promoting the task\. Unlike the internal memory units in GRU, we propose an external structure with independent parameters to store various long\-term features in the data that are beneficial for depression level estimation\. Specifically, we first select features that are highly similar to the GRU output based on cosine similarity, treating them as semantic complements\. Then, from the relatively dissimilar features, we extract temporal variation patterns to identify potentially informative cues for depression detection\. Finally, We design a Hierarchical Attention Fusion \(HAF\) module to effectively leverage the complementary information embedded in the GRU outputs, similarity\-retrieved features, and dynamic features\.

The main contributions of our paper are as follows:

We propose a novel framework for speech\-based depression level estimation\. To the best of our knowledge, this is the first work to introduce the memory bank mechanism into this task\.

We propose a similarity\-based feature retrieval approach to condense the memory bank and enhance it with dynamic features\. These features are specifically designed to capture depressive cues, thereby improving the model’s ability to understand and predict depression levels\.

We design a Hierarchical Attention Fusion \(HAF\) module to effectively integrate the features from the memory bank and the GRU\.

Our method achieves the state\-of\-the\-art performance on DAIC\-WOZ and E\-DAIC datasets\.

## IIRelated Works

In recent years, as mental health issues have become increasingly prevalent, the early automatic detection of depression has emerged as a key research focus in multimodal affective computing\. Emotional cues embedded in modalities such as speech, facial expressions, and textual language offer new opportunities for objective assessment\. The following sections provide an overview of related work, primarily focusing on depression level estimation, while also covering aspects of depression detection\. We discuss both traditional approaches based on handcrafted feature extraction and recent advancements in deep learning\-based methods\[[46](https://arxiv.org/html/2606.11197#bib.bib7),[74](https://arxiv.org/html/2606.11197#bib.bib8),[13](https://arxiv.org/html/2606.11197#bib.bib9),[3](https://arxiv.org/html/2606.11197#bib.bib10),[64](https://arxiv.org/html/2606.11197#bib.bib11),[65](https://arxiv.org/html/2606.11197#bib.bib12),[6](https://arxiv.org/html/2606.11197#bib.bib13),[61](https://arxiv.org/html/2606.11197#bib.bib14)\]\.

### II\-ATraditional Approaches Based on Handcrafted Features

Handcrafted feature extraction has played an essential role in early research on automatic depression detection\. These features are designed based on domain knowledge to capture relevant cues from different modalities\. This section reviews representative studies based on handcrafted features from both speech and other modalities\.

In speech analysis, handcrafted features are designed to capture acoustic and prosodic variations that correlate with depressive symptoms\. Prior work\[[15](https://arxiv.org/html/2606.11197#bib.bib53)\]have explored the application of five types of handcrafted audio features in depression detection, including spectral features, cepstral features, glottal features, prosodic features, and voice quality features\. These features can describe the low\-frequency variations, intonation, speech rate, rhythm, and quality of speech, providing support for automatic depression detection\. Shin et al\.\[[43](https://arxiv.org/html/2606.11197#bib.bib51)\]employed a manual feature extraction method to extract four types of features from speech signals for depression detection, including glottal features, time\-frequency features, formant features, and other physical features\. These features are extracted individually within each speech segment and then averaged across the segment for subsequent analysis\.

Visual cues, particularly facial expressions and movements, also provide valuable information for depression detection\. Previous works have extracted dynamic facial features using methods such as Local Phase Quantization \(LPQ\)\[[51](https://arxiv.org/html/2606.11197#bib.bib44),[50](https://arxiv.org/html/2606.11197#bib.bib45)\], Local Binary Pattern Three Orthogonal Planes \(LBP\-TOP\)\[[5](https://arxiv.org/html/2606.11197#bib.bib48)\], Median Robust Local Binary Pattern \(MRLBP\)\[[12](https://arxiv.org/html/2606.11197#bib.bib47)\], and sparse coding\[[59](https://arxiv.org/html/2606.11197#bib.bib49)\]to capture subtle non\-verbal indicators of depression\.

Although handcrafted features have proven effective for depression detection and severity estimation, they heavily depend on expert design and may overlook subtle depression cues\. Moreover, they often lack robustness across diverse individuals and fail to capture the temporal dynamics that are crucial for accurate depression assessment\.

### II\-BData\-Driven Methods with Deep Neural Networks

In recent years, with the rapid development of deep learning, an increasing number of studies have focused on depression prediction using architectures such as Convolutional Neural Networks \(CNNs\), Recurrent Neural Networks \(RNNs\), and Transformers\. These approaches enable automatic feature learning from raw multimodal data, such as speech, text, and video, and have demonstrated promising performance in modeling both spatial and temporal patterns associated with depressive symptoms\. These approaches offer a powerful and flexible alternative to traditional handcrafted feature\-based techniques\[[68](https://arxiv.org/html/2606.11197#bib.bib77),[19](https://arxiv.org/html/2606.11197#bib.bib78),[27](https://arxiv.org/html/2606.11197#bib.bib79),[26](https://arxiv.org/html/2606.11197#bib.bib80),[9](https://arxiv.org/html/2606.11197#bib.bib81),[76](https://arxiv.org/html/2606.11197#bib.bib82),[18](https://arxiv.org/html/2606.11197#bib.bib83),[45](https://arxiv.org/html/2606.11197#bib.bib84),[20](https://arxiv.org/html/2606.11197#bib.bib85),[54](https://arxiv.org/html/2606.11197#bib.bib86),[53](https://arxiv.org/html/2606.11197#bib.bib87),[57](https://arxiv.org/html/2606.11197#bib.bib88),[56](https://arxiv.org/html/2606.11197#bib.bib89),[55](https://arxiv.org/html/2606.11197#bib.bib91),[22](https://arxiv.org/html/2606.11197#bib.bib92),[60](https://arxiv.org/html/2606.11197#bib.bib93)\]\.

Some prior works have explored depression prediction based on speech\. Han et al\.\[[11](https://arxiv.org/html/2606.11197#bib.bib59)\]introduced STFN, which employs VQWTNet for feature mapping, stacked gated residual blocks for multi\-scale information\. Chen et al\.\[[2](https://arxiv.org/html/2606.11197#bib.bib57)\]proposed TTFNet, which encodes log\-Mel spectrograms and their derivatives into quaternions, extracts frequency and temporal features, fuses them through XConformer blocks, and balances training with GradNorm\. Zhang et al\.\[[71](https://arxiv.org/html/2606.11197#bib.bib24)\]propose DEPA, a self\-supervised audio embedding for depression detection, extracted using an encoder\-decoder network on both in\-domain \(DAIC, MDD\) and out\-of\-domain \(Switchboard, Alzheimer’s\) datasets\.

In addition to single\-modality approaches, many studies have focused on multimodal fusion strategies to capture complementary information from different data sources\[[27](https://arxiv.org/html/2606.11197#bib.bib79),[76](https://arxiv.org/html/2606.11197#bib.bib82),[20](https://arxiv.org/html/2606.11197#bib.bib85)\]\. Guramritpal et al\.\[[38](https://arxiv.org/html/2606.11197#bib.bib2)\]introduced DepressNet, a multimodal framework employing a hierarchical attention mechanism for depression detection\. Their approach fuses multiscale temporal features from audio, video, and text modalities, leveraging a Bidirectional LSTM network and attention mechanisms for effective feature fusion\. Marriwala et al\.\[[23](https://arxiv.org/html/2606.11197#bib.bib3)\]developed a hybrid deep learning model for depression detection, integrating text and audio features\. The model combines Text CNN, Audio CNN, and hybrid LSTM/Bi\-LSTM architectures for robust feature extraction and classification\. Zhang et al\.\[[70](https://arxiv.org/html/2606.11197#bib.bib32)\]proposed DepITCM, which integrates audio\-visual features using an ITCM encoder, fuses time\-channel\-space information, and employs multi\-task learning\.

Different from existing works, our approach is the first to introduce the memory mechanism into depression level estimation, aiming to address the forgetting issue commonly observed in GRU/LSTM\-based models\. Specifically, we propose enhancing the memory features through a combination of similarity\-based feature retrieval and dynamic feature augmentation\.

### II\-CMemory\-Augmented Networks

Existing memory\-augmented recurrent neural networks \(RNNs\) can be broadly categorized into two types\. The first type consists of models based on internal states, such as LSTMs and GRUs, which utilize hidden states and gating mechanisms to retain short\-term and partially long\-term information during sequence modeling\. To our knowledge, most of existing methods\[[11](https://arxiv.org/html/2606.11197#bib.bib59),[2](https://arxiv.org/html/2606.11197#bib.bib57),[36](https://arxiv.org/html/2606.11197#bib.bib1),[69](https://arxiv.org/html/2606.11197#bib.bib36),[49](https://arxiv.org/html/2606.11197#bib.bib39)\]for depression level estimation or detection fall into this category\. The second type comprises models enhanced with external memory structures\[[62](https://arxiv.org/html/2606.11197#bib.bib73),[16](https://arxiv.org/html/2606.11197#bib.bib74),[52](https://arxiv.org/html/2606.11197#bib.bib75),[31](https://arxiv.org/html/2606.11197#bib.bib76)\], which read from and write to external memory units to expand the network’s memory capacity, thereby enabling more effective capture of long\-range dependencies or storage of complex structured information\. Our approach belongs to this second category\. External memory mechanisms have been relatively underexplored in speech processing\. Emformer\[[42](https://arxiv.org/html/2606.11197#bib.bib71)\]introduces an efficient external memory that compresses long\-range historical context into an augmented memory bank, significantly reducing self\-attention computation for streaming ASR\. Chen et al\.\[[41](https://arxiv.org/html/2606.11197#bib.bib72)\]addresses the Face\-Driven Zero\-Shot Voice Conversion task by employing a memory\-based face\-voice alignment module, in which memory slots act as a bridge to align the two modalities, enabling the extraction of voice characteristics from face images\. While Emformer\[[42](https://arxiv.org/html/2606.11197#bib.bib71)\]primarily focuses on computational efficiency, Chen et al\.\[[41](https://arxiv.org/html/2606.11197#bib.bib72)\]emphasizes aligning multimodal face and audio information\. In contrast, our approach leverages memory to capture rich emotional and temporal dynamics, which goes beyond modality alignment or efficiency optimization\.

To the best of our knowledge, our method is the first attempt of external memory mechanisms in the affective computing\. Motivated by the limitations of GRU\-based models in retaining long\-range dependencies and the importance of temporal speech dynamics for depression detection, we introduce a similarity\-based feature retrieval mechanism and a dynamic\-feature memory module\. Furthermore, we design a hierarchical attention fusion strategy to effectively integrate different types of memory representations\.

## IIIPreliminary and Analysis

![Refer to caption](https://arxiv.org/html/2606.11197v1/x1.png)Figure 1:The similarity between the GRU output and individual frame features\. We can observe from these representative examples that the later frames tend to exhibit higher similarity with the GRU output\. The x\-axis represents the temporally downsampled time\-dimension features, and the y\-axis denotes the cosine similarity\.Recurrent Neural Networks \(RNNs\) are specifically designed to model temporal dependencies in sequential data\. By maintaining a hidden state that captures information from previous time steps, RNNs are capable of learning patterns and relationships across time, making them well\-suited for tasks involving time\-series analysis, natural language processing, and other domains where context and order are crucial\.

The Gated Recurrent Unit \(GRU\) is a variant of Recurrent Neural Networks \(RNNs\) that introduces gating mechanisms to better capture long\-term dependencies and alleviate the vanishing gradient problem\. Due to its efficiency and ability to model temporal patterns in sequential data, GRU has been widely applied in speech\-based depression detection, where capturing subtle temporal cues in vocal signals is crucial for identifying depressive symptoms\. The formulations of GRU are shown as follows:

zt\\displaystyle\\quad z\_\{t\}=σ\(Wz⋅\[ht−1,xt\]\),\\displaystyle=\\sigma\(W\_\{z\}\\cdot\[h\_\{t\-1\},x\_\{t\}\]\),\(1\)rt\\displaystyle\\quad r\_\{t\}=σ\(Wr⋅\[ht−1,xt\]\),\\displaystyle=\\sigma\(W\_\{r\}\\cdot\[h\_\{t\-1\},x\_\{t\}\]\),\(2\)h~t\\displaystyle\\quad\\tilde\{h\}\_\{t\}=tanh⁡\(W⋅\[rt∗ht−1,xt\]\),\\displaystyle=\\tanh\(W\\cdot\[r\_\{t\}\*h\_\{t\-1\},x\_\{t\}\]\),\(3\)ht\\displaystyle\\quad h\_\{t\}=\(1−zt\)∗ht−1\+zt∗h~t\.\\displaystyle=\(1\-z\_\{t\}\)\*h\_\{t\-1\}\+z\_\{t\}\*\\tilde\{h\}\_\{t\}\.\(4\)
In Equation \(4\), the current hidden statehth\_\{t\}is computed as a weighted sum of the previous hidden stateht−1h\_\{t\-1\}and the current candidate stateh^t\\hat\{h\}\_\{t\}, where the weights are determined by the update gateztz\_\{t\}\. Ifztz\_\{t\}approaches 1, the current input \(and the candidate state generated from it\) has a stronger influence; conversely, ifztz\_\{t\}is close to 0, the influence of the historical state becomes dominant\. As GRU updates its state step by step, historical information is progressively overwritten or updated by new inputs, especially those from recent time steps\. This leads the model to become more sensitive to recent features, effectively assigning them greater weight\.

We present a visualization of the cosine similarity between features extracted from different temporal segments and the GRU output\. As shown in Fig\.[1](https://arxiv.org/html/2606.11197#S3.F1), features from early temporal segments generally exhibit lower similarity with the GRU output and are more likely to show negative correlations, whereas features from later temporal segments tend to have higher similarity and exhibit more positive correlations\. This phenomenon suggests that during sequential modeling, the GRU has a relatively limited capacity to retain information from early inputs and instead relies more heavily on the contextual information contained in later segments to form its final representation\. This observation indirectly confirms the presence of a ‘forgetting’ mechanism in GRU when processing long sequences, and further highlights the temporal imbalance in sequence modeling — that is, different temporal segments contribute unequally to the final output\.

Notably, since the GRU output and the features of each input frame represent different levels of semantic abstraction, their similarity tends to be low\. Nevertheless, a consistent pattern can still be observed\.

![Refer to caption](https://arxiv.org/html/2606.11197v1/x2.png)Figure 2:The overall architecture of our method\. The input audio signals are first transformed into Mel spectrograms, which are then processed by an audio encoder to extract high\-level audio embeddings\. These embeddings are subsequently fed in parallel into two branches\. The upper branch incorporates a Memory Bank module, while the lower branch consists of a stack of ConvGRU modules\. Then, the outputs of the two branches are fused through a Hierarchical Attention Fusion \(HAF\) module to obtain the final representation, which is passed through a regression head to predict the PHQ\-8 score\. ‘Orig\. Memory’ represents the original memory and ‘Repr\. Memory’ represents the representative memory\.
## IVMethods

In this section, we first provide an overview of the problem formulation and our method\. Next, we delve into the Memory Augmentation methods and transformer\-based fusion mechanisms, which are the core of our method\. Lastly, we introduce the loss function for model training\.

### IV\-AOverview

Problem Formulation\.Speech\-based Depression level estimation can be formulated as follows\. Let\{Ai\}i=0n\\\{A\_\{i\}\\\}\_\{i=0\}^\{n\}denote the input audio sequence consisting of frames from time step 0 to n\. Lety∈\[0,24\]y\\in\[0,24\]represent the PHQ\-8 score, a clinically validated indicator of depression severity\. The goal is to learn a regression functionffthat maps the input sequence to the target score, expressed asy=f\(\{Ai\}i=0n\)y=f\(\\\{A\_\{i\}\\\}\_\{i=0\}^\{n\}\)\.

Overall Architecture\.We propose a novel method for depression level estimation based on memory augmentation\. The overall framework of our approach is illustrated in Fig\.[2](https://arxiv.org/html/2606.11197#S3.F2)\. Initially, audio signals are transformed into Mel spectrograms, which are then passed through NetVLAD to extract audio embeddings\. The extracted speech feature embeddings are fed in parallel into two branches: the lower branch consists of a stack of ConvGRU modules, while the upper branch incorporates a Memory Bank module\. The ConvGRU in the lower branch models the temporal variations of the speech signal, capturing sequential features that are closely associated with depressive symptoms\. Meanwhile, the Memory Bank in the upper branch stores task\-relevant similar features and dynamic features, serving as a crucial semantic supplement to the ConvGRU output and enhancing the model’s ability to detect depression\-related cues\. Then, the GRU features, along with the task\-relevant similar features and dynamic features, are fused using the proposed Transformer module\. Finally, the fused representation is passed through a regression head to predict the PHQ\-8 score\.

### IV\-BMemory Augmentation

In this paper, we introduce an explicitly designed external memory module—memory bank—into the task of depression level estimation, aiming to store long\-term or global speech information\. In preliminary experiments, we followed existing approaches by either writing all temporal speech features directly into the memory bank or updating its content dynamically using a first\-in\-first\-out \(FIFO\) mechanism as shown in Fig\.[3](https://arxiv.org/html/2606.11197#S4.F3)\. However, the former tends to introduce a large amount of redundant information, while the latter fails to retain long\-term context, both of which limit the overall performance of the model\.

To address the above issues, we filter the candidate features written into the memory bank based on their similarity to the GRU output features, effectively removing information irrelevant to depression assessment and retaining complementary representations\. In addition, we extract the temporal variation patterns of speech features and incorporate them into the memory bank, thereby enhancing its ability to capture dynamic characteristics\. The detailed method is described as follows\.

Augmenting with Similarity\-Based Feature Retrieval\.The features output by the GRU are closely related to depression level estimation\. However, due to the GRU’s tendency to emphasize later frames in sequence modeling, important information in the earlier part of the sequence may be forgotten\. To address this, we compute the similarity between early\-frame features and the GRU output, and select those early features that exhibit high similarity\. These selected features, which contain valuable information for depression level estimation, serve as an effective complement to the GRU output\.

Formally, given the original speech feature sequenceX=\{x1,x2,…,xT\}X=\\\{x\_\{1\},x\_\{2\},\.\.\.,x\_\{T\}\\\}\(the output of the audio encoder in Fig\.[2](https://arxiv.org/html/2606.11197#S3.F2)\), where eachxt∈ℝdx\_\{t\}\\in\\mathbb\{R\}^\{d\}\. We first calculate the cosine similarity between the output of GRU and all the temporal speech features as:

si=sim\(𝐪,𝐱i\)=𝐪⊤𝐱i‖𝐪‖⋅‖𝐱i‖,fori=0,2,…,T−1,\\displaystyle s\_\{i\}=\\text\{sim\}\(\\mathbf\{q\},\\mathbf\{x\}\_\{i\}\)=\\frac\{\\mathbf\{q\}^\{\\top\}\\mathbf\{x\}\_\{i\}\}\{\\\|\\mathbf\{q\}\\\|\\cdot\\\|\\mathbf\{x\}\_\{i\}\\\|\},\\quad\\text\{for \}i=0,2,\\dots,T\-1,\(5\)where𝐪\\mathbf\{q\}denotes the output feature of the GRU, andsis\_\{i\}represents the similarity between𝐪\\mathbf\{q\}and theii\-th feature in speech feature sequence\.

Then, we select the top\-K most similar features across time, which can be interpreted as the most relevant information for depression evaluation and a complementary representation to the GRU\-extracted features:

ℳK=TopK\(\{si\}i=1N,K\),\\displaystyle\\mathcal\{M\}\_\{K\}=\\text\{TopK\}\\left\(\\\{s\_\{i\}\\\}\_\{i=1\}^\{N\},K\\right\),\(6\)whereℳK\\mathcal\{M\}\_\{K\}denotes the set of the top\-K most similar features, andTopK\(⋅,K\)\\text\{TopK\}\(\\cdot,K\)refers to the function that selects the K features with the highest similarity scores\.

![Refer to caption](https://arxiv.org/html/2606.11197v1/x3.png)Figure 3:Different memory bank construction methods\. \(a\) Speech features containing all time steps\. \(b\) First\-in\-first\-out \(FIFO\) strategy\. \(c\) Augmenting with similarity\-based feature retrieval, which first computes the similarity between each temporal feature segment and the GRU output, and then selects the top\-k most similar features\. \(d\) Augmenting with dynamic features, which capture temporal variations in speech that are indicative of depressive symptoms\.Augmenting with Dynamic Features\.Besides incorporating features similar to the GRU outputs as complementary information, the temporal dynamics of speech features provide another important source of information\. Temporal changes in speech features can reveal behavioral and emotional fluctuations, which are indicative of depressive symptoms\.

LetX=\{x1,x2,…,xT\}X=\\\{x\_\{1\},x\_\{2\},\.\.\.,x\_\{T\}\\\}denote the original speech feature sequence \(the output of the audio encoder in Fig\.[2](https://arxiv.org/html/2606.11197#S3.F2)\), where eachxt∈ℝdx\_\{t\}\\in\\mathbb\{R\}^\{d\}\. To capture temporal variation, we perform frame\-wise differencing to computing the changes in speech features, as shown below\.

ΔX=\{xt−xt−1∣t=1,2,…,T−1\},\\displaystyle\\Delta X=\\\{x\_\{t\}\-x\_\{t\-1\}\\mid t=1,2,\\ldots,T\-1\\\},\(7\)whereΔX\\Delta Xrepresents the sequence of frame\-wise differences\.

While frame\-wise differencing provides a straightforward representation of first\-order temporal changes between adjacent frames, it lacks the capacity to capture long\-range dependencies and complex dynamic patterns, such as progressively increasing pitch or emotional shifts\. To address this limitation, we design a lightweight temporal variation encoder to model the temporal dynamics of speech features\.

In our experiments, we observe that directly feeding the entire difference sequence into the temporal variation encoder fails to yield satisfactory results\. This is mainly because the dynamic variations in depressive speech are often extremely subtle and localized\. When the difference features are encoded as a whole sequence, these transient cues tend to be smoothed out by adjacent frames, making them harder to detect\.

To address this issue, we propose a more fine\-grained modeling approach that focuses on capturing local temporal dynamics\.

\(1\) Frame\-wise Modeling Strategy\. To better preserve and model these local dynamics, we separate the difference features along the temporal dimension and feed each time\-step individually into the temporal variation encoder\. This frame\-wise modeling strategy allows the encoder to focus on capturing fine\-grained, localized temporal variations that are critical for depression detection\. The formulation is shown below:

zt=fdyn\(Δxt\),t=1,2,…,T−1,\\displaystyle z\_\{t\}=f\_\{dyn\}\(\\Delta x\_\{t\}\),t=1,2,\.\.\.,T\-1,\(8\)wherefdynf\_\{dyn\}denotes the temporal variation encoder, which is designed to model higher\-order dynamics from the frame\-wise difference sequenceΔxt\\Delta x\_\{t\}, andztz\_\{t\}is the resulting latent representation that captures the temporal variation patterns of the speech features\.

In addition to its general architectural layout, the Temporal Variation Encoder integrates two important component\-level considerations:

\(2\) Max Pooling\. Depressive speech often exhibits subtle dynamic variations that can be easily masked by environmental noise or speaker\-specific characteristics\. To mitigate this, we insert a max pooling layer after the convolutional blocks\. This helps suppress low\-amplitude or unstable fluctuations while preserving informative variations that may carry depressive cues\.

\(3\) Batch Normalization\. We remove batch normalization layers from the encoder\. Due to the small batch sizes typically used in depression detection tasks and the high variability of individual speech patterns, batch normalization may introduce instability and hinder generalization\. Eliminating it enables more stable and consistent feature extraction under diverse input conditions\.

Then, we stack the frame\-wise features to form the final dynamic representationZZ\.

Z=Concat\(z1,z2,…,zT−1\)∈ℝ\(T−1\)×D′,\\displaystyle Z=\\mathrm\{Concat\}\(z\_\{1\},z\_\{2\},\\ldots,z\_\{T\-1\}\)\\in\\mathbb\{R\}^\{\(T\-1\)\\times D^\{\\prime\}\},\(9\)whereD′D^\{\\prime\}represents the dimensionality of the output from the dynamic encoder andConcat\(⋅\)\\mathrm\{Concat\}\(\\cdot\)denotes the concatenation of multiple vectors along the temporal dimension\.

![Refer to caption](https://arxiv.org/html/2606.11197v1/x4.png)Figure 4:Hierarchical Attention Fusion \(HAF\) module\. The GRU output featureqq, the similarity\-retrieved featureMKM\_\{K\}, and the dynamic featureZZare first processed by independent Transformer blocks\. Their resulting representations are then concatenated and fed into a global Transformer to enable hierarchical fusion of heterogeneous feature types\.
### IV\-CHierarchical Attention Fusion

Due to the inherently different nature of the three types of features, including the GRU output, the similarity retrieved features, and the dynamic features, directly adding or concatenating them often leads to sub\-optimal performance\. Each type of feature captures distinct aspects of the input sequence: the GRU output encodes the global temporal context, the similarity\-retrieved features emphasize historically relevant patterns, and the dynamic features highlight temporal variations indicative of depressive symptoms\.

To address this issue, we propose a Hierarchical Attention Fusion \(HAF\) mechanism as shown in Fig\.[4](https://arxiv.org/html/2606.11197#S4.F4)\. Specifically, the GRU output featureqq, the similarity\-based retrieved featureMKM\_\{K\}, and the dynamic featureZZare first processed by three independent Transformer blocks, yielding the enhanced representationsq′q^\{\\prime\},MK′M\_\{K\}^\{\\prime\}, andZ′Z^\{\\prime\}, respectively\.

q′\\displaystyle q^\{\\prime\}=𝒯q\(q\),MK′=𝒯m\(MK\),Z′=𝒯z\(Z\),\\displaystyle=\\mathcal\{T\}\_\{q\}\(q\),\\quad M\_\{K\}^\{\\prime\}=\\mathcal\{T\}\_\{m\}\(M\_\{K\}\),\\quad Z^\{\\prime\}=\\mathcal\{T\}\_\{z\}\(Z\),\(10\)where𝒯q\\mathcal\{T\}\_\{q\},𝒯m\\mathcal\{T\}\_\{m\}, and𝒯z\\mathcal\{T\}\_\{z\}denote three independent Transformer blocks\.

Then, the processed features are concatenated and passed through another Transformer layer to enable global self\-attention interaction and fusion\.

H=𝒯global\(Concat\(q′,MK′,Z′\)\)H=\\mathcal\{T\}\_\{\\mathrm\{global\}\}\\Big\(\\mathrm\{Concat\}\(q^\{\\prime\},M\_\{K\}^\{\\prime\},Z^\{\\prime\}\)\\Big\)\(11\)whereHHdenotes the globally fused output feature, andConcatConcatrepresents the concatenation operation along the feature dimension\. This hierarchical design allows the model to fully exploit the complementary information embedded in each feature stream, leading to more discriminative representations for depression level estimation\.

### IV\-DLoss Function

We introduce Smooth L1 Loss into depression level estimation\. Smooth L1 Loss, also known as Huber Loss, behaves like Mean Squared Error \(MSE\) when the error is small, and like Mean Absolute Error \(MAE\) when the error is large\. In the DAIC\-WOZ and E\-DAIC datasets, there exist a small number of extreme samples with PHQ\-8 scores greater than 15, and in some cases, even exceeding 20\. Introducing Smooth L1 Loss helps alleviate the instability in model training caused by these outlier samples\. The formulation of the Smooth L1 Loss is shown as follows:

SmoothL1\(x\)\\displaystyle\\text\{SmoothL1\}\(x\)=\{0\.5⋅x2β,if\|x\|<β\|x\|−0\.5⋅β,otherwise,\\displaystyle=\\begin\{cases\}0\.5\\cdot\\frac\{x^\{2\}\}\{\\beta\},&\\text\{if \}\|x\|<\\beta\\\\ \|x\|\-0\.5\\cdot\\beta,&\\text\{otherwise\},\\end\{cases\}\(12\)wherexxrepresents the difference between the predicted value and the true value, andβ\\betais the smoothing parameter that determines the threshold between L2 \(quadratic\) and L1 \(linear\) behavior\.

## VExperiments

TABLE I:PHQ\-8 Score InterpretationPHQ\-8 Total ScoreDepression SeverityDescription0–4None or Minimal DepressionGenerally considered within the normal emotional range5–9Mild DepressionPossible depressive symptoms10–14Moderate DepressionClinically significant; may benefit from counselling or intervention15–19Moderately Severe DepressionProfessional help is advisable, such as therapy or medication20–24Severe DepressionStrongly suggests major depressive disorderWe conduct experiments on the widely used DAIC\-WOZ and E\-DAIC datasets, comparing our method with state\-of\-the\-art approaches for depression level estimation\. Subsequently, we provide a detailed analysis of the evaluation results, including comprehensive ablation studies\. In addition, to further validate the effectiveness of the memory module, we perform t\-SNE visualizations on the extracted features with and without memory\.

TABLE II:Comparison with Previous Unimodal and Multimodal Methods on the DAIC\-WOZ test set\.MethodModalityYearMAE↓\\downarrowRMSE↓\\downarrowMilintsevich et al\.\[[24](https://arxiv.org/html/2606.11197#bib.bib61)\]T20235\.51\-Niu et al\.\[[25](https://arxiv.org/html/2606.11197#bib.bib27)\]T20213\.734\.80Rumahorbo et al\.\[[37](https://arxiv.org/html/2606.11197#bib.bib60)\]V20235\.396\.27Rathi et al\.\[[33](https://arxiv.org/html/2606.11197#bib.bib29)\]V20194\.645\.98Zhao et al\.\[[77](https://arxiv.org/html/2606.11197#bib.bib30)\]A\+T20194\.205\.66Lin et al\.\[[21](https://arxiv.org/html/2606.11197#bib.bib31)\]A\+T20203\.755\.44Pan et al\.\[[30](https://arxiv.org/html/2606.11197#bib.bib62)\]A\+V20234\.625\.78Li et al\.\[[17](https://arxiv.org/html/2606.11197#bib.bib58)\]A\+V20254\.255\.34Zhang et al\.\[[73](https://arxiv.org/html/2606.11197#bib.bib67)\]A\+V\+T20224\.976\.45Wei et al\.\[[58](https://arxiv.org/html/2606.11197#bib.bib90)\]A\+V\+T20224\.925\.86Rasipuram et al\.\[[32](https://arxiv.org/html/2606.11197#bib.bib68)\]A\+V\+T20224\.835\.76Zhang et al\.\[[72](https://arxiv.org/html/2606.11197#bib.bib69)\]A\+V\+T20244\.485\.57Fang et al\.\[[8](https://arxiv.org/html/2606.11197#bib.bib65)\]A\+V\+T2023\-5\.44Jayawardena et al\.\[[14](https://arxiv.org/html/2606.11197#bib.bib70)\]A2023\-6\.84Zhang et al\.\[[71](https://arxiv.org/html/2606.11197#bib.bib24)\]A20215\.606\.47Han et al\.\[[11](https://arxiv.org/html/2606.11197#bib.bib59)\]A20235\.386\.36Zhang et al\.\[[70](https://arxiv.org/html/2606.11197#bib.bib32)\]A20255\.216\.10Qureshi et al\.\[[28](https://arxiv.org/html/2606.11197#bib.bib21)\]A20215\.146\.56Alhanai et al\.\[[1](https://arxiv.org/html/2606.11197#bib.bib22)\]A20185\.136\.50Chen et al\.\[[2](https://arxiv.org/html/2606.11197#bib.bib57)\]A20255\.096\.01Stepanov et al\.\[[44](https://arxiv.org/html/2606.11197#bib.bib25)\]A20184\.966\.32Yang et al\.\[[66](https://arxiv.org/html/2606.11197#bib.bib26)\]A20204\.635\.52Niu et al\.\[[27](https://arxiv.org/html/2606.11197#bib.bib79)\]A20254\.625\.61OursA20254\.315\.49TABLE III:Comparison with Previous Unimodal and Multimodal Methods on the E\-DAIC test set\.MethodModalityYearMAE↓\\downarrowRMSE↓\\downarrowZhang et al\.\[[75](https://arxiv.org/html/2606.11197#bib.bib33)\]T2020\-4\.66Gönç et al\.\[[9](https://arxiv.org/html/2606.11197#bib.bib81)\]T20253\.464\.37Xu et al\.\[[63](https://arxiv.org/html/2606.11197#bib.bib63)\]V2024\-5\.99Shen et al\.\[[40](https://arxiv.org/html/2606.11197#bib.bib43)\]V2024\-5\.83Rodrigues et al\.\[[36](https://arxiv.org/html/2606.11197#bib.bib1)\]V2019\-5\.74Fan et al\.\[[7](https://arxiv.org/html/2606.11197#bib.bib64)\]A\+T2019\-5\.91Ringeval et al\.\[[34](https://arxiv.org/html/2606.11197#bib.bib34)\]A\+V2019\-6\.37Fang et al\.\[[8](https://arxiv.org/html/2606.11197#bib.bib65)\]A\+V2023\-5\.17Li et al\.\[[17](https://arxiv.org/html/2606.11197#bib.bib58)\]A\+V20254\.415\.10Pan et al\.\[[29](https://arxiv.org/html/2606.11197#bib.bib66)\]A\+V20244\.325\.35Yin et al\.\[[67](https://arxiv.org/html/2606.11197#bib.bib37)\]A\+V\+T2019\-5\.50Saggu et al\.\[[38](https://arxiv.org/html/2606.11197#bib.bib2)\]A\+V\+T2022\-5\.36Zhang et al\.\[[75](https://arxiv.org/html/2606.11197#bib.bib33)\]A\+V\+T2020\-4\.47Sun et al\.\[[48](https://arxiv.org/html/2606.11197#bib.bib38)\]A\+V\+T20224\.37\-Yuan et al\.\[[69](https://arxiv.org/html/2606.11197#bib.bib36)\]A\+V\+T20243\.984\.91Sun et al\.\[[47](https://arxiv.org/html/2606.11197#bib.bib35)\]A2021\-8\.67Ringeval et al\.\[[34](https://arxiv.org/html/2606.11197#bib.bib34)\]A2019\-8\.00Rodrigues et al\.\[[36](https://arxiv.org/html/2606.11197#bib.bib1)\]A2019\-6\.71Uddin et al\.\[[49](https://arxiv.org/html/2606.11197#bib.bib39)\]A2022\-5\.78Han et al\.\[[11](https://arxiv.org/html/2606.11197#bib.bib59)\]A20235\.386\.29Chen et al\.\[[2](https://arxiv.org/html/2606.11197#bib.bib57)\]A20255\.005\.76OursA20254\.685\.72### V\-ADatasets

DAIC\-WOZ Dateset\.The DAIC\-WOZ\[[34](https://arxiv.org/html/2606.11197#bib.bib34),[35](https://arxiv.org/html/2606.11197#bib.bib40),[4](https://arxiv.org/html/2606.11197#bib.bib41),[10](https://arxiv.org/html/2606.11197#bib.bib42)\]dataset is a multimodal conversational dataset specifically designed for emotion and mental health analysis, primarily aimed at automatic depression detection and assessment\. It was developed by the University of Southern California’s Institute for Creative Technologies to facilitate mental health evaluation through recordings of human\-computer interviews, including audio, video, and textual data\. The dataset contains a total of 189 interview sessions, including question\-and\-answer interactions from 59 depressed patients and 130 non\-depressed patients\. These samples are divided into three subsets: 107 sessions for training, 35 for validation, and 47 for testing\. For technical resions, only 182 audio recordings are used\. For depression level estimation, the dataset provides scores based on the Patient Health Questionnaire\-8 \(PHQ\-8\)\. PHQ\-8 is a widely used clinical questionnaire consisting of eight items that measure depressive symptoms over the past two weeks\. In this dataset, PHQ\-8 scores are obtained from participants’ responses during structured interviews, serving as standardized labels for depression severity\. The interpretation of the PHQ\-8 scores is summarized in Table[I](https://arxiv.org/html/2606.11197#S5.T1)\. PHQ\-8≥10\\,\\geq\\,10is considered to indicate clinically significant depression and is commonly used as the threshold for determining the presence of depression\.

E\-DAIC Dateset\.The E\-DAIC dataset is an extended version of DAIC\-WOZ, including 275 respondents\. The training set includes 163 samples, the validation set includes 56 samples, and the test set includes 56 samples\. The dataset includes data such as facial action units \(Action Units, AU\) and gaze coordinates \(Gaze\)\. Due to the multimodal data fusion characteristics of these two datasets and the fact that they contain a large amount of patient information in various forms, they are used in our experiment\.

### V\-BEvaluation Metrics

In our experiment, we follow existing work and use Mean Absolute Error \(MAE\) and Root Mean Square Error \(RMSE\) as evaluation metrics for the regression task\. MAE has high robustness, while RMSE amplifies the impact of larger errors through squaring operations, making it more sensitive to outliers\. Smaller values of both metrics indicate that the prediction results are closer to the true values, and the model performs better\. The formulations for MAE and RMSE are shown as follows:

MAE=1T∑j=1T\|yj−y¯j\|,MAE=\\frac\{1\}\{T\}\\sum\_\{j=1\}^\{T\}\|y\_\{j\}\-\\bar\{y\}\_\{j\}\|,\(5\)RMSE=1T∑j=1T\(yj−y¯j\)2,RMSE=\\sqrt\{\\frac\{1\}\{T\}\\sum\_\{j=1\}^\{T\}\(y\_\{j\}\-\\bar\{y\}\_\{j\}\)^\{2\}\},\(6\)where T represents the total number of sample data, whereyjy\_\{j\}typically denotes the true value of the j\-th sample andy¯j\\bar\{y\}\_\{j\}signifies its predicted value\.

### V\-CExperimental Setup and Training Configuration\.

Our implementation is based on PyTorch\. We train the model for 500 epochs using the Adam optimizer with a learning rate of1×10−31\\times 10^\{\-3\}and weight decay of 1e\-5, with the learning rate scheduled by CosineAnnealingLR and a batch size of 2\. Temporal features are extracted using an 8\-layer stacked GRU, and the number of Mel filter banks is set to 80 during preprocessing to capture perceptually relevant frequency components\.

### V\-DOverview of the Model Architecture and Hyperparameter Settings\.

Table[IV](https://arxiv.org/html/2606.11197#S5.T4)summarizes the key components of our model\. The backbone ConvGRU module consists of an 8\-layer unidirectional GRU with a hidden size of 256, an input dimension of 256, and a dropout rate of 0\.7\. The similarity\-based memory module retrieves relevant historical features using cosine similarity followed by top\-KKsampling, whereK=5K=5\. The dynamic memory branch conducts frame\-wise local modeling using a 1D convolution with a kernel size of 3 that expands the channel dimension from 1 to 12, followed by a ReLU activation and a max\-pooling operation with a kernel size of 7\. For the Independent Transformer, the hidden sizes are configured as follows: 256 for the GRU output featureqq, 512 for the similarity\-retrieved featureMKM\_\{K\}, and 1024 for the dynamic featureZZ\. The Global Transformer block uses a hidden size of 512\. After the Independent Transformer processing, the GRU output feature, as well as the similarity\-retrieved and dynamic features, are further passed through a convolutional module to adjust their dimensionality\.

TABLE IV:Details of the Key Components of Our Method\.ModuleOperationsKey HyperparametersConvGRU8\-Layer Unidirectional GRUHidden Size = 256Input Dimension = 256Dropout Rate = 0\.7Similarity\-Based MemoryCosine SimilarityTop\-K SamplingK = 5Dynamic MemoryFrame\-Wise Modeling1D Conv \+ ReLU \+ Max PoolingConv Kernel Size = 3Channels: 1 → 12MaxPool Kernel Size = 7HAFIndependent TransformerGlobal TransformerHidden Size = 256 / 512 / 1024Hidden Size = 512
### V\-EExperimental Results

To verify the effectiveness of the proposed method, we conduct experimental validation on two widely used depression detection datasets, DAIC\-WOZ and E\-DAIC\. In the research, we compared the proposed method with existing other methods\. The performance of our method and other state\-of\-the\-art approaches on the DAIC\-WOZ dataset is shown in Table[II](https://arxiv.org/html/2606.11197#S5.T2), and the results on the E\-DAIC dataset are reported in Table[III](https://arxiv.org/html/2606.11197#S5.T3)\.

It is important to note that our method is based solely on audio input\. As many recent approaches rely on multimodal data, we include in our evaluation representative methods that utilize other modalities—including video\-based, text\-based, and multimodal approaches—to enable a more comprehensive and fair comparison\.

As shown in Table[II](https://arxiv.org/html/2606.11197#S5.T2), our method achieves an MAE of 4\.31 and an RMSE of 5\.49, outperforming all state\-of\-the\-art audio\-based methods\. Moreover, it also yields competitive results compared to approaches using other input modalities\. As shown in Table[III](https://arxiv.org/html/2606.11197#S5.T3), which presents the performance on the E\-DAIC dataset, our method achieves an MAE of 4\.68 and an RMSE of 5\.72, outperforming all audio\-based state\-of\-the\-art methods listed in the table and surpassing most methods based on other modalities

### V\-FAblation Studies

TABLE V:Ablation study on the proposed memory bank on DAIC\-WOZ test set\.MethodMAE↓\\downarrowRMSE↓\\downarrowBaseline4\.856\.05\+\+Full4\.786\.10\+\+FIFO4\.936\.24\+\+Sim4\.565\.89\+\+Sim & Dyn4\.315\.49TABLE VI:Ablation study on the proposed memory bank on E\-DAIC test set\.MethodMAE↓\\downarrowRMSE↓\\downarrowBaseline5\.166\.32w/o Sim4\.895\.99w/o Dyn4\.825\.87w/o HAF4\.795\.81Ours4\.685\.72In this section, we conduct ablation studies to evaluate the effectiveness of the core components of our method\. Specifically, we focus on the following aspects: \(1\) the impact of the proposed memory banks, \(2\) the design choices within the temporal variation encoder, \(3\) the design choices on loss function and \(4\) the design choices on fusion mechanisms\.

The Impact of the Proposed Memory Banks\.To investigate the contribution of the proposed memory banks, we conduct an ablation study by removing or modifying this component from our framework\. The results on the DAIC\-WOZ dataset are summarized in Table[V](https://arxiv.org/html/2606.11197#S5.T5), while the results on the E\-DAIC dataset are presented in Table[VI](https://arxiv.org/html/2606.11197#S5.T6)\. ‘Baseline’ refers to a GRU\-based model without incorporating any memory bank modules\. ‘Full’ represents the variant where the memory bank is constructed using features from all frames in the sequence, without temporal selection or compression\. ‘FIFO’ represents the variant where the memory bank is updated in a first\-in\-first\-out manner, maintaining a fixed number of the most recent features\. ‘Sim’ denotes that We construct the memory bank by retrieving features from the sequence based on their similarity to the current query representation\. ‘Dyn’ represents the variant where inter\-frame variation features are integrated into the memory bank\. ‘AF’ denotes the use of a Hierarchical Attention Fusion \(HAF\) mechanism to integrate the GRU output features, the similarity\-based retrieved features, and the dynamic features\.

As shown in Table[V](https://arxiv.org/html/2606.11197#S5.T5), on the DAIC\-WOZ dataset, the baseline model that uses only GRU features achieves an MAE of4\.854\.85and an RMSE of6\.056\.05\. Including all frame features in the memory bank results in an MAE of4\.784\.78and an RMSE of6\.106\.10, which does not lead to performance improvement\. Using a FIFO strategy to update the memory bank results in an MAE of4\.934\.93and an RMSE of6\.246\.24, leading to a decrease in performance\. This may be due to the inclusion of redundant and irrelevant information\. After retrieving features based on similarity before feeding them to the memory bank, the model achieves an MAE of4\.56\(−0\.29\)4\.56\\,\(\-0\.29\)and an RMSE of5\.89\(−0\.16\)5\.89\\,\(\-0\.16\), indicating a significant improvement\. After incorporating dynamic features into the memory bank, the model achieves an MAE of4\.31\(−0\.54\)4\.31\\,\(\-0\.54\)and an RMSE of5\.49\(−0\.56\)5\.49\\,\(\-0\.56\)\.

As shown in Table[VI](https://arxiv.org/html/2606.11197#S5.T6), on the E\-DAIC dataset, the baseline achieves an MAE of5\.165\.16and an RMSE of6\.216\.21\. Without incorporating the similarity\-based feature retrieval, the method yields an MAE of4\.89\(\+0\.21\)4\.89\\,\(\+0\.21\)and an RMSE of5\.99\(\+0\.27\)5\.99\\,\(\+0\.27\)\. Excluding the dynamic features from our method results in an MAE of4\.82\(\+0\.14\)4\.82\\,\(\+0\.14\)and an RMSE of5\.87\(\+0\.15\)5\.87\\,\(\+0\.15\)\. When the Hierarchical Attention Fusion is removed, the model obtains an MAE of4\.79\(\+0\.11\)4\.79\\,\(\+0\.11\)and an RMSE of5\.96\(\+0\.09\)5\.96\\,\(\+0\.09\)\. Overall, removing any of these three components leads to a performance drop, showing consistent effects on both the E\-DAIC and DAIC\-WOZ datasets\.

The Design Choices within the Temporal Variation Encoder\.To evaluate the impact of different design choices within the Temporal Variation Encoder, we perform a series of ablation studies\. The experimental results are shown in Table[VII](https://arxiv.org/html/2606.11197#S5.T7)\. ‘Split’ indicates that features from each frame are individually processed by the Temporal Variation Encoder\. ‘Pooling’ refers to the introducing the max pooling layers in the network to suppress low\-amplitude or unstable fluctuations\. ‘BN’ represents the batch normalization layer\.

As shown in Table[VII](https://arxiv.org/html/2606.11197#S5.T7), when each frame is not individually processed by the Temporal Variation Encoder, the model performance degrades, resulting in an MAE of4\.56\(\+0\.25\)4\.56\(\+0\.25\)and an RMSE of5\.49\(\+0\.32\)5\.49\(\+0\.32\)\. After removing the max pooling operation, the model performance declined, with the MAE increasing to4\.61\(\+0\.30\)4\.61\(\+0\.30\)and the RMSE to5\.67\(\+0\.18\)5\.67\(\+0\.18\)\. After introducing batch normalization, the model performance declined, with the MAE increasing to4\.49\(\+0\.18\)4\.49\(\+0\.18\)and the RMSE to5\.56\(\+0\.07\)5\.56\(\+0\.07\)\.

TABLE VII:The Design Choices within the Temporal Variation Encoder on DAIC\-WOZ test set\.MethodMAE↓\\downarrowRMSE↓\\downarrowWith Split4\.315\.49W/o Split4\.565\.81With Pooling4\.315\.49W/o Polling4\.615\.67With BN4\.495\.56W/o BN4\.315\.49TABLE VIII:Ablation study on different fusion mechanisms on DAIC\-WOZ test set\.MethodMAE↓\\downarrowRMSE↓\\downarrowAddition4\.575\.74Concatenation4\.645\.81Self\-attention4\.695\.88HAF4\.315\.49The Design Choices on Fusion Mechanisms\.We conducted ablation studies to investigate the impact of different fusion mechanisms for integrating the GRU outputs with the memory\-enhanced features\. Our fusion strategy incorporates three types of features: the original GRU output, the similarity\-based retrieved features, and the dynamic feature representations\. ‘Addition’ refers to the direct summation of the three types of features\. ‘Concatenation’ refers to the sequential concatenation of the three types of features\. ‘Self\-attention’ refers to concatenating the three types of features first, followed by applying a Transformer\-based self\-attention mechanism over the combined feature representation\. ‘HAF’ refers to the proposed Hierarchical Attention Fusion mechanism, which first applies local self\-attention to each of the three feature types individually, and then concatenates these features to perform a global self\-attention over the combined representation\. From Table[VIII](https://arxiv.org/html/2606.11197#S5.T8), we observe that the HAF method achieves the best results\. By employing a hierarchical design with staged fusion, it leads to more discriminative representations for depression level estimation, reducing the MAE by 0\.26 and the RMSE by 0\.25 compared to the best results of other methods\.

![Refer to caption](https://arxiv.org/html/2606.11197v1/x5.png)Figure 5:The t\-SNE visualization results of features with and without memory across different checkpoints\. \(a\) shows the t\-SNE visualization results at the middle stage of training, while \(b\) corresponds to the late stage\. It can be observed that at both stages, the inclusion of memory features leads to better clustering of samples with similar levels of depression\. This effect becomes more pronounced in the later stage of training, indicating that memory mechanisms help the model to learn more discriminative and compact feature representations over time\.The Design Choices on Loss Function\.To investigate the contribution of each component in the loss function to the overall model performance, we conducted an ablation study by selectively removing or modifying individual loss terms\. In Table[IX](https://arxiv.org/html/2606.11197#S5.T9), MAE indicates the use of mean absolute error loss, RMSE denotes the use of root mean square error loss, and Smooth L1 refers to the use of Smooth L1 loss\. From Table[IX](https://arxiv.org/html/2606.11197#S5.T9), we can observe that using the MAE loss results in a MAE of 4\.39 and an RMSE of 5\.84, while using the RMSE loss yields a MAE of 4\.51 and an RMSE of 5\.62\. In comparison, the Smooth L1 loss achieves a MAE of 4\.31 and an RMSE of 5\.49, demonstrating improvements in both metrics\.

TABLE IX:Ablation study on Loss Function on DAIC\-WOZ test set\.MethodMAE↓\\downarrowRMSE↓\\downarrowMAE4\.395\.84RMSE4\.515\.62Smooth L14\.315\.49Ablation Analysis of Memory Mechanisms on Different Backbone Architectures\.We conduct extensive ablation studies across various backbone architectures, including GRUs with different depths, BiLSTMs, LSTMs, and Transformers\. As shown in[X](https://arxiv.org/html/2606.11197#S5.T10), the results consistently show that incorporating the memory\-augmentation mechanism leads to improvements in model performance\.

TABLE X:Ablation studies of the memory mechanism across different backbone architectures on the DAIC\-WOZ test set\.MethodsMAE↓\\downarrowRMSE↓\\downarrowGRU 1 Layer w/o Memory4\.866\.37GRU 1 Layer w/ Memory4\.516\.13GRU 2 Layers w/o Memory4\.916\.28GRU 2 Layers w/ Memory4\.426\.01GRU 4 Layers w/o Memory4\.896\.13GRU 4 Layers w/ Memory4\.455\.91LSTM w/o Memory4\.576\.01LSTM w/ Memory4\.355\.68BiLSTM w/o Memory4\.656\.14BiLSTM w/ Memory4\.446\.07Transformer w/o Memory4\.746\.11Transformer w/ Memory4\.685\.98GRU 8 Layers w/ Memory \(Ours\)4\.315\.49Sensitivity Analysis of the Top\-K Parameter\.As shown in Table[XI](https://arxiv.org/html/2606.11197#S5.T11), the experimental results show that the model achieves the best performance whenk=5k=5\. Whenkkis too small, the retrieved information is insufficient to cover key depression\-related cues\. However, whenkkis too large, it introduces excessive irrelevant or noisy information, which weakens the model’s ability to focus on informative features\.

TABLE XI:Sensitivity Analysis of the Top\-K Parameter in Similarity Retrieval on the DAIC\-WOZ test set\.MethodsMAE↓\\downarrowRMSE↓\\downarrowK=14\.565\.99K=34\.415\.68K=5 \(Ours\)4\.315\.49K=74\.335\.65K=94\.455\.66Sensitivity Analysis of theβ\\betaParameter in Smooth\-L1 Loss\.As shown in Table[XII](https://arxiv.org/html/2606.11197#S5.T12), the experimental results indicate that the model achieves the best MAE performance whenβ=0\.5\\beta=0\.5, while the best RMSE performance is obtained whenβ=1\\beta=1\. Whenβ\\betais set to either a smaller or larger value beyond these ranges, both MAE and RMSE degrade modestly\. This suggests that an inappropriateβ\\betaeither over\-emphasizes the L1 region \(leading to insufficient sensitivity to moderate errors\) or over\-expands the L2 region \(causing the model to over\-penalize large errors\), ultimately harming overall performance\.

TABLE XII:Sensitivity Analysis of theβ\\betaParameter in Smooth\-L1 Loss on the DAIC\-WOZ test set\.MethodsMAE↓\\downarrowRMSE↓\\downarrowβ\\beta=0\.04\.395\.84β\\beta=0\.54\.275\.56β\\beta=1\.0 \(Ours\)4\.315\.49β\\beta=1\.54\.425\.67β\\beta=2\.04\.685\.98Impact of Historical Memory Length on Our Method\. In our ablation study, the GRU processes the full input sequence, while the memory module is evaluated using historical features of varying lengths\. As shown in Table[XIII](https://arxiv.org/html/2606.11197#S5.T13), when the memory length increases from 5 to 25, the model performance steadily improves\. Specifically, with a memory length of 5, the model achieves an MAE of 4\.68 and an RMSE of 5\.98, while increasing the memory length to 25 results in an MAE of 4\.35 and an RMSE of 5\.66\. Once the memory length exceeds 25, the performance tends to stabilize, with only marginal gains observed\.

TABLE XIII:Sensitivity Analysis of the Historical Sequence Length on the DAIC\-WOZ test set\.Memory LengthMAE↓\\downarrowRMSE↓\\downarrowL=54\.735\.98L=104\.565\.79L=154\.415\.67L=204\.465\.65L=254\.355\.66L=304\.305\.51L=354\.335\.47L=404\.295\.56
### V\-GModel Complexity\.

The proposed model contains 9\.00M learnable parameters and requires only 0\.72 GFLOPs for a single forward pass\. Since few open\-source methods are available, we compare only with Wei et al\. \[54\], which uses 7\.17M parameters but requires 7\.18 GFLOPs per forward pass\. The higher computational cost of Wei et al\. is mainly attributed to its use of a larger number of sample slicing windows, which increases the overall processing overhead\. This comparison highlights that our architecture achieves efficient inference with significantly lower computational cost while maintaining strong predictive performance\.

Moreover, our method exhibits excellent scalability to long input sequences, primarily because all core operations are designed to be computationally lightweight\. For instance, each time\-step audio feature and its corresponding GRU output both have a dimensionality of 256\. Computing the cosine similarity between them requires only about 2k FLOPs, which constitutes a negligible computational cost in the overall pipeline\.

TABLE XIV:Model complexity and performance comparison\.MethodsParameter↓\\downarrowFLOPs↓\\downarrowMAE↓\\downarrowRMSE↓\\downarrowWei et al\.\[[58](https://arxiv.org/html/2606.11197#bib.bib90)\]7\.17M7\.18G4\.925\.86Ours9\.00M0\.72G4\.315\.49
### V\-HVisualizing the Impact of Memory Features\.

In this section, we employ t\-SNE to visualize the feature representations extracted by the model, aiming to compare the representational capacity of the model in the feature space with and without the incorporation of memory features\. Since the depression severity scores are continuous in regression tasks and not directly suitable for categorical visualization, we categorize the samples into five levels based on standard depression severity criteria: non or minimal depression, mild depression, moderate depression, moderately severe depression, and severe depression\. This categorization enables a more intuitive observation of the distribution of features across different severity levels\.

As shown in Fig\.[5](https://arxiv.org/html/2606.11197#S5.F5), the visualization results reveal that, without memory features, samples of different depression levels tend to overlap significantly in the feature space, resulting in blurred class boundaries\. In contrast, after incorporating memory features, samples within the same category are more tightly clustered, and inter\-class separation becomes more distinct\. These observations indicate that the memory module enhances the model’s ability to capture temporal dependencies in the input sequences, thereby improving the quality of the learned feature representations and strengthening the model’s capability to discriminate between different levels of depression severity\.

## VIConclusion

In this work, we proposed a memory\-enhanced framework to address the limitation of traditional GRU and LSTM models in retaining long\-term temporal information, particularly the tendency to forget early\-stage features\. To this end, we introduced two key components: a similarity\-based feature retrieval mechanism and a dynamic feature bank, designed to capture critical depression\-related cues and enhance the model’s capability in estimating depression severity levels\. Our method achieved state\-of\-the\-art performance on both the DAIC\-WOZ and E\-DAIC datasets, demonstrating its effectiveness and generalizability\. In future work, we plan to further explore the integration of multi\-modal memory structures and investigate the applicability of our approach to broader mental health assessment tasks\.

## References

- \[1\]\(2018\)Detecting depression with audio/text sequence modeling of interviews\.InProceedings of the Interspeech,pp\. 1716–1720\.Cited by:[TABLE II](https://arxiv.org/html/2606.11197#S5.T2.2.21.1)\.
- \[2\]X\. Chen, Z\. Shao, Y\. Jiang, R\. Chen, Y\. Wang, B\. Li, M\. Niu, H\. Chen, Q\. Hu, J\. Wu, C\. Yang, and Y\. Shang\(2025\)TTFNet: temporal\-frequency features fusion network for speech based automatic depression recognition and assessment\.IEEE Journal of Biomedical and Health Informatics29\(10\),pp\. 7536–7548\.Cited by:[§II\-B](https://arxiv.org/html/2606.11197#S2.SS2.p2.1),[§II\-C](https://arxiv.org/html/2606.11197#S2.SS3.p1.1),[TABLE II](https://arxiv.org/html/2606.11197#S5.T2.2.22.1),[TABLE III](https://arxiv.org/html/2606.11197#S5.T3.2.23.1)\.
- \[3\]A\. K\. Das and R\. Naskar\(2024\)A deep learning model for depression detection based on mfcc and cnn generated spectrogram features\.Biomedical Signal Processing and Control90,pp\. 105898\.Cited by:[§II](https://arxiv.org/html/2606.11197#S2.p1.1)\.
- \[4\]D\. DeVault, R\. Artstein, G\. Benn, T\. Dey, E\. Fast, A\. Gainer, K\. Georgila, J\. Gratch, A\. Hartholt, M\. Lhommet,et al\.\(2014\)SimSensei kiosk: a virtual human interviewer for healthcare decision support\.InProceedings of the 2014 International Conference on Autonomous Agents and Multi\-Agent Systems,pp\. 1061–1068\.Cited by:[§V\-A](https://arxiv.org/html/2606.11197#S5.SS1.p1.1)\.
- \[5\]A\. Dhall and R\. Goecke\(2015\)A temporally piece\-wise fisher vector approach for depression analysis\.InProceedings of the 2015 International Conference on Affective Computing and Intelligent Interaction \(ACII\),pp\. 255–259\.Cited by:[§II\-A](https://arxiv.org/html/2606.11197#S2.SS1.p3.1)\.
- \[6\]H\. Fan, X\. Zhang, Y\. Xu, J\. Fang, S\. Zhang, X\. Zhao, and J\. Yu\(2024\)Transformer\-based multimodal feature enhancement networks for multimodal depression detection integrating video, audio and remote photoplethysmograph signals\.Information Fusion104,pp\. 102161\.Cited by:[§II](https://arxiv.org/html/2606.11197#S2.p1.1)\.
- \[7\]W\. Fan, Z\. He, X\. Xing, B\. Cai, and W\. Lu\(2019\)Multi\-modality depression detection via multi\-scale temporal dilated cnns\.InProceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop,pp\. 73–80\.Cited by:[TABLE III](https://arxiv.org/html/2606.11197#S5.T3.2.8.1.1)\.
- \[8\]M\. Fang, S\. Peng, Y\. Liang, C\. Hung, and S\. Liu\(2023\)A multimodal fusion model with multi\-level attention mechanism for depression detection\.Biomedical Signal Processing and Control82,pp\. 104561\.Cited by:[TABLE II](https://arxiv.org/html/2606.11197#S5.T2.2.15.1.1),[TABLE III](https://arxiv.org/html/2606.11197#S5.T3.2.10.1.1)\.
- \[9\]K\. Gönç and H\. Dibeklioğlu\(2025\)Affect and personality aided modeling of transcribed speech for depression severity estimation\.IEEE Transactions on Affective Computing\.Cited by:[§II\-B](https://arxiv.org/html/2606.11197#S2.SS2.p1.1),[TABLE III](https://arxiv.org/html/2606.11197#S5.T3.2.4.1.1)\.
- \[10\]J\. Gratch, R\. Artstein, G\. M\. Lucas, G\. Stratou, S\. Scherer, A\. Nazarian, R\. Wood, J\. Boberg, D\. DeVault, S\. Marsella,et al\.\(2014\)The distress analysis interview corpus of human and computer interviews\.InProceedings of the 14th International Conference on Language Resources and Evaluation \(LREC\),Vol\.14,pp\. 3123–3128\.Cited by:[§V\-A](https://arxiv.org/html/2606.11197#S5.SS1.p1.1)\.
- \[11\]Z\. Han, Y\. Shang, Z\. Shao, J\. Liu, G\. Guo, T\. Liu, H\. Ding, and Q\. Hu\(2023\)Spatial\-temporal feature network for speech\-based depression recognition\.IEEE Transactions on Cognitive and Developmental Systems16\(1\),pp\. 308–318\.Cited by:[§II\-B](https://arxiv.org/html/2606.11197#S2.SS2.p2.1),[§II\-C](https://arxiv.org/html/2606.11197#S2.SS3.p1.1),[TABLE II](https://arxiv.org/html/2606.11197#S5.T2.2.18.1),[TABLE III](https://arxiv.org/html/2606.11197#S5.T3.2.22.1)\.
- \[12\]L\. He, D\. Jiang, and H\. Sahli\(2018\)Automatic depression analysis using dynamic facial appearance descriptor and dirichlet process fisher encoding\.IEEE Transactions on Multimedia21\(6\),pp\. 1476–1486\.Cited by:[§II\-A](https://arxiv.org/html/2606.11197#S2.SS1.p3.1)\.
- \[13\]L\. He, M\. Niu, P\. Tiwari, P\. Marttinen, R\. Su, J\. Jiang, C\. Guo, H\. Wang, S\. Ding, Z\. Wang,et al\.\(2022\)Deep learning for depression recognition with audiovisual cues: a review\.Information Fusion80,pp\. 56–86\.Cited by:[§II](https://arxiv.org/html/2606.11197#S2.p1.1)\.
- \[14\]S\. Jayawardena, J\. Epps, and E\. Ambikairajah\(2023\)Ordinal logistic regression with partial proportional odds for depression prediction\.IEEE Transactions on Affective Computing14\(01\),pp\. 563–577\.Cited by:[TABLE II](https://arxiv.org/html/2606.11197#S5.T2.2.16.1)\.
- \[15\]H\. Jiang, B\. Hu, Z\. Liu, G\. Wang, L\. Zhang, X\. Li, and H\. Kang\(2018\)Detecting depression using an ensemble logistic regression model based on multiple speech features\.Computational and Mathematical Methods in Medicine2018\(1\),pp\. 6508319\.Cited by:[§II\-A](https://arxiv.org/html/2606.11197#S2.SS1.p2.1)\.
- \[16\]S\. Lee, H\. G\. Kim, D\. H\. Choi, H\. Kim, and Y\. M\. Ro\(2021\)Video prediction recalling long\-term motion context via memory alignment learning\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 3054–3063\.Cited by:[§II\-C](https://arxiv.org/html/2606.11197#S2.SS3.p1.1)\.
- \[17\]S\. Li, Z\. Shao, R\. Qin, Y\. Huang, P\. Liang, X\. Li, Y\. Jiang, Y\. Deng, T\. Liu, and X\. Tan\(2025\)Audio\-visual feature disentanglement and fusion network for automatic depression severity prediction\.IEEE Transactions on Affective Computing\(\),pp\. 1–15\.External Links:[Document](https://dx.doi.org/10.1109/TAFFC.2025.3611238)Cited by:[TABLE II](https://arxiv.org/html/2606.11197#S5.T2.2.10.1.1),[TABLE III](https://arxiv.org/html/2606.11197#S5.T3.2.11.1.1)\.
- \[18\]S\. Li, Z\. Xie, and S\. M\. Naqvi\(2025\)Efficient long speech sequence modelling for time\-domain depression level estimation\.InICASSP 2025\-2025 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 1–5\.Cited by:[§II\-B](https://arxiv.org/html/2606.11197#S2.SS2.p1.1)\.
- \[19\]Y\. Li, S\. Qu, and X\. Zhou\(2025\)Conformal depression prediction\.IEEE Transactions on Affective Computing\.Cited by:[§II\-B](https://arxiv.org/html/2606.11197#S2.SS2.p1.1)\.
- \[20\]Y\. Li, S\. Kumbale, Y\. Chen, T\. Surana, E\. S\. Chng, and C\. Guan\(2025\)Automated depression detection from text and audio: a systematic review\.IEEE Journal of Biomedical and Health Informatics\.Cited by:[§II\-B](https://arxiv.org/html/2606.11197#S2.SS2.p1.1),[§II\-B](https://arxiv.org/html/2606.11197#S2.SS2.p3.1)\.
- \[21\]L\. Lin, X\. Chen, Y\. Shen, and L\. Zhang\(2020\)Towards automatic depression detection: a bilstm/1d cnn\-based model\.Applied Sciences10\(23\),pp\. 8701\.Cited by:[TABLE II](https://arxiv.org/html/2606.11197#S5.T2.2.8.1.1)\.
- \[22\]S\. Luitel, Y\. Liu, and M\. Anwar\(2025\)Investigating fairness in machine learning\-based audio sentiment analysis\.AI and Ethics5\(2\),pp\. 1099–1108\.Cited by:[§II\-B](https://arxiv.org/html/2606.11197#S2.SS2.p1.1)\.
- \[23\]N\. Marriwala, D\. Chaudhary,et al\.\(2023\)A hybrid model for depression detection using deep learning\.Measurement: Sensors25,pp\. 100587\.Cited by:[§II\-B](https://arxiv.org/html/2606.11197#S2.SS2.p3.1)\.
- \[24\]K\. Milintsevich, K\. Sirts, and G\. Dias\(2023\)Towards automatic text\-based estimation of depression through symptom prediction\.Brain Informatics10\(1\),pp\. 4\.Cited by:[TABLE II](https://arxiv.org/html/2606.11197#S5.T2.2.3.1.1)\.
- \[25\]M\. Niu, K\. Chen, Q\. Chen, and L\. Yang\(2021\)HCAG: a hierarchical context\-aware graph attention model for depression detection\.InProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 4235–4239\.Cited by:[TABLE II](https://arxiv.org/html/2606.11197#S5.T2.2.4.1.1)\.
- \[26\]M\. Niu, J\. Tao, Y\. He, S\. Zhang, and M\. Li\(2025\)Examining the fourier spectrum of speech signal from a time\-frequency perspective for automatic depression level prediction\.IEEE Transactions on Affective Computing\.Cited by:[§II\-B](https://arxiv.org/html/2606.11197#S2.SS2.p1.1)\.
- \[27\]M\. Niu, X\. Wang, J\. Gong, B\. Liu, J\. Tao, and B\. W\. Schuller\(2025\)Depression scale dictionary decomposition framework for multimodal automatic depression level prediction\.IEEE Transactions on Circuits and Systems for Video Technology\.Cited by:[§II\-B](https://arxiv.org/html/2606.11197#S2.SS2.p1.1),[§II\-B](https://arxiv.org/html/2606.11197#S2.SS2.p3.1),[TABLE II](https://arxiv.org/html/2606.11197#S5.T2.2.25.1)\.
- \[28\]S\. A\. Oureshi, G\. Dias, S\. Saha, and M\. Hasanuzzaman\(2021\)Gender\-aware estimation of depression severity level in a multimodal setting\.InProceedings of the 2021 International Joint Conference on Neural Networks \(IJCNN\),pp\. 1–8\.Cited by:[TABLE II](https://arxiv.org/html/2606.11197#S5.T2.2.20.1)\.
- \[29\]Y\. Pan, J\. Jiang, K\. Jiang, and X\. Liu\(2024\)Disentangled\-multimodal privileged knowledge distillation for depression recognition with incomplete multimodal data\.InProceedings of the 32nd ACM International Conference on Multimedia,pp\. 5712–5721\.Cited by:[TABLE III](https://arxiv.org/html/2606.11197#S5.T3.2.12.1.1)\.
- \[30\]Y\. Pan, Y\. Shang, Z\. Shao, T\. Liu, G\. Guo, and H\. Ding\(2023\)Integrating deep facial priors into landmarks for privacy preserving multimodal depression recognition\.IEEE Transactions on Affective Computing\.Cited by:[TABLE II](https://arxiv.org/html/2606.11197#S5.T2.2.9.1.1)\.
- \[31\]Z\. Pang, F\. Sener, and A\. Yao\(2025\)Context\-enhanced memory\-refined transformer for online action detection\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 8700–8710\.Cited by:[§II\-C](https://arxiv.org/html/2606.11197#S2.SS3.p1.1)\.
- \[32\]S\. Rasipuram, J\. H\. Bhat, A\. Maitra, B\. Shaw, and S\. Saha\(2022\)Multimodal depression detection using task\-oriented transformer\-based embedding\.InProceedings of the 2022 IEEE Symposium on Computers and Communications \(ISCC\),pp\. 1–4\.Cited by:[TABLE II](https://arxiv.org/html/2606.11197#S5.T2.2.13.1.1)\.
- \[33\]S\. Rathi, B\. Kaur, and R\. K\. Agrawal\(2019\)Enhanced depression detection from facial cues using univariate feature selection techniques\.InProceedings of the International Conference on Pattern Recognition and Machine Intelligence,pp\. 22–29\.Cited by:[TABLE II](https://arxiv.org/html/2606.11197#S5.T2.2.6.1.1)\.
- \[34\]F\. Ringeval, B\. Schuller, M\. Valstar, N\. Cummins, R\. Cowie, L\. Tavabi, M\. Schmitt, S\. Alisamir, S\. Amiriparian, E\. Messner,et al\.\(2019\)AVEC 2019 workshop and challenge: state\-of\-mind, detecting depression with ai, and cross\-cultural affect recognition\.InProceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop,pp\. 3–12\.Cited by:[§V\-A](https://arxiv.org/html/2606.11197#S5.SS1.p1.1),[TABLE III](https://arxiv.org/html/2606.11197#S5.T3.2.19.1),[TABLE III](https://arxiv.org/html/2606.11197#S5.T3.2.9.1.1)\.
- \[35\]F\. Ringeval, B\. Schuller, M\. Valstar, J\. Gratch, R\. Cowie, S\. Scherer, S\. Mozgai, N\. Cummins, M\. Schmitt, and M\. Pantic\(2017\)AVEC 2017: real\-life depression, and affect recognition workshop and challenge\.InProceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge,pp\. 3–9\.Cited by:[§V\-A](https://arxiv.org/html/2606.11197#S5.SS1.p1.1)\.
- \[36\]M\. Rodrigues Makiuchi, T\. Warnita, K\. Uto, and K\. Shinoda\(2019\)Multimodal fusion of bert\-cnn and gated cnn representations for depression detection\.InProceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop,pp\. 55–63\.Cited by:[§II\-C](https://arxiv.org/html/2606.11197#S2.SS3.p1.1),[TABLE III](https://arxiv.org/html/2606.11197#S5.T3.2.20.1),[TABLE III](https://arxiv.org/html/2606.11197#S5.T3.2.7.1.1)\.
- \[37\]B\. N\. Rumahorbo, B\. Pardamean, and G\. N\. Elwirehardja\(2023\)Exploring recurrent neural network models for depression detection through facial expressions: a systematic literature review\.InProceedings of the 2023 6th International Conference of Computer and Informatics Engineering \(IC2IE\),pp\. 209–214\.Cited by:[TABLE II](https://arxiv.org/html/2606.11197#S5.T2.2.5.1.1)\.
- \[38\]G\. S\. Saggu, K\. Gupta, K\. V\. Arya, and C\. R\. Rodriguez\(2022\)Depressnet: a multimodal hierarchical attention mechanism approach for depression detection\.International Journal of Engineering Science15\(1\),pp\. 24–32\.Cited by:[§II\-B](https://arxiv.org/html/2606.11197#S2.SS2.p3.1),[TABLE III](https://arxiv.org/html/2606.11197#S5.T3.2.14.1.1)\.
- \[39\]G\. Sharma, A\. M\. Joshi, R\. Gupta, and L\. R\. Cenkeramaddi\(2023\)DepCap: a smart healthcare framework for eeg based depression detection using time\-frequency response and deep neural network\.IEEE Access11,pp\. 52327–52338\.Cited by:[§I](https://arxiv.org/html/2606.11197#S1.p1.1)\.
- \[40\]H\. Shen, S\. Song, and H\. Gunes\(2024\)Multi\-modal human behaviour graph representation learning for automatic depression assessment\.InProceedings of the 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition \(FG\),pp\. 1–10\.Cited by:[TABLE III](https://arxiv.org/html/2606.11197#S5.T3.2.6.1.1)\.
- \[41\]Z\. Sheng, Y\. Ai, Y\. Chen, and Z\. Ling\(2023\)Face\-driven zero\-shot voice conversion with memory\-based face\-voice alignment\.InProceedings of the 31st ACM International Conference on Multimedia,pp\. 8443–8452\.Cited by:[§II\-C](https://arxiv.org/html/2606.11197#S2.SS3.p1.1)\.
- \[42\]Y\. Shi, Y\. Wang, C\. Wu, C\. Yeh, J\. Chan, F\. Zhang, D\. Le, and M\. Seltzer\(2021\)Emformer: efficient memory transformer based acoustic model for low latency streaming speech recognition\.InIEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 6783–6787\.Cited by:[§II\-C](https://arxiv.org/html/2606.11197#S2.SS3.p1.1)\.
- \[43\]D\. Shin, W\. I\. Cho, C\. H\. K\. Park, S\. J\. Rhee, M\. J\. Kim, H\. Lee, N\. S\. Kim, and Y\. M\. Ahn\(2021\)Detection of minor and major depression through voice as a biomarker using machine learning\.Journal of Clinical Medicine10\(14\),pp\. 3046\.Cited by:[§II\-A](https://arxiv.org/html/2606.11197#S2.SS1.p2.1)\.
- \[44\]E\. A\. Stepanov, S\. Lathuiliere, S\. A\. Chowdhury, A\. Ghosh, R\. Vieriu, N\. Sebe, and G\. Riccardi\(2018\)Depression severity estimation from multiple modalities\.InProceedings of the 2018 IEEE 20th International Conference on E\-Health Networking, Applications and Services \(HealthCom\),pp\. 1–6\.Cited by:[TABLE II](https://arxiv.org/html/2606.11197#S5.T2.2.23.1)\.
- \[45\]R\. Su, C\. Xu, H\. Yu, X\. Wu, F\. Xu, X\. Chen, L\. Wang, and N\. Yan\(2026\)Investigating acoustic\-textual emotional inconsistency information for automatic depression detection\.IEEE Transactions on Affective Computing\.Cited by:[§II\-B](https://arxiv.org/html/2606.11197#S2.SS2.p1.1)\.
- \[46\]C\. Sun, M\. Jiang, L\. Gao, Y\. Xin, and Y\. Dong\(2024\)A novel study for depression detecting using audio signals based on graph neural network\.Biomedical Signal Processing and Control88,pp\. 105675\.Cited by:[§II](https://arxiv.org/html/2606.11197#S2.p1.1)\.
- \[47\]H\. Sun, J\. Liu, S\. Chai, Z\. Qiu, L\. Lin, X\. Huang, and Y\. Chen\(2021\)Multi\-modal adaptive fusion transformer network for the estimation of depression level\.Sensors21\(14\),pp\. 4764\.Cited by:[TABLE III](https://arxiv.org/html/2606.11197#S5.T3.2.18.1)\.
- \[48\]H\. Sun, H\. Wang, J\. Liu, Y\. Chen, and L\. Lin\(2022\)CubeMLP: an mlp\-based model for multimodal sentiment analysis and depression estimation\.InProceedings of the 30th ACM International Conference on Multimedia,pp\. 3722–3729\.Cited by:[TABLE III](https://arxiv.org/html/2606.11197#S5.T3.2.16.1.1)\.
- \[49\]M\. A\. Uddin, J\. B\. Joolee, and K\. Sohn\(2022\)Deep multi\-modal network based automated depression severity estimation\.IEEE Transactions on Affective Computing14\(3\),pp\. 2153–2167\.Cited by:[§II\-C](https://arxiv.org/html/2606.11197#S2.SS3.p1.1),[TABLE III](https://arxiv.org/html/2606.11197#S5.T3.2.21.1)\.
- \[50\]M\. Valstar, B\. Schuller, K\. Smith, T\. Almaev, F\. Eyben, J\. Krajewski, R\. Cowie, and M\. Pantic\(2014\)AVEC 2014: 3d dimensional affect and depression recognition challenge\.InProceedings of the 4th International Workshop on Audio/Visual Emotion Challenge,pp\. 3–10\.Cited by:[§II\-A](https://arxiv.org/html/2606.11197#S2.SS1.p3.1)\.
- \[51\]M\. Valstar, B\. Schuller, K\. Smith, F\. Eyben, B\. Jiang, S\. Bilakhia, S\. Schnieder, R\. Cowie, and M\. Pantic\(2013\)AVEC 2013: the continuous audio/visual emotion and depression recognition challenge\.InProceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge,pp\. 3–10\.Cited by:[§II\-A](https://arxiv.org/html/2606.11197#S2.SS1.p3.1)\.
- \[52\]J\. Videnovic, A\. Lukezic, and M\. Kristan\(2025\)A distractor\-aware memory for visual object tracking with sam2\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 24255–24264\.Cited by:[§II\-C](https://arxiv.org/html/2606.11197#S2.SS3.p1.1)\.
- \[53\]X\. Wang, D\. Lin, and L\. Wan\(2022\)FFNet: frequency fusion network for semantic scene completion\.InAAAI Conference on Artificial Intelligence,Vol\.36,pp\. 2550–2557\.Cited by:[§II\-B](https://arxiv.org/html/2606.11197#S2.SS2.p1.1)\.
- \[54\]X\. Wanget al\.\(2025\)NUC\-Net: non\-uniform cylindrical partition network for efficient LiDAR semantic segmentation\.IEEE Transactions on Circuits and Systems for Video Technology35\(9\),pp\. 9090–9104\.Cited by:[§II\-B](https://arxiv.org/html/2606.11197#S2.SS2.p1.1)\.
- \[55\]X\. Wang, L\. Wan, D\. Lin, and W\. Feng\(2023\)Phase\-based fine\-grained change detection\.Expert Systems with Applications227,pp\. 120181\.Cited by:[§II\-B](https://arxiv.org/html/2606.11197#S2.SS2.p1.1)\.
- \[56\]X\. Wang, X\. Wu, S\. Wang, L\. Kong, and Z\. Zhao\(2026\)AdaSFormer: adaptive serialized transformers for monocular semantic scene completion from indoor environments\.InarXiv:2603\.25494,Cited by:[§II\-B](https://arxiv.org/html/2606.11197#S2.SS2.p1.1)\.
- \[57\]X\. Wang, X\. Wu, S\. Wang,et al\.\(2025\)Monocular semantic scene completion via masked recurrent networks\.InIEEE/CVF Conference on International Conference on Computer Vision,Cited by:[§II\-B](https://arxiv.org/html/2606.11197#S2.SS2.p1.1)\.
- \[58\]P\. Wei, K\. Peng, A\. Roitberg, K\. Yang, J\. Zhang, and R\. Stiefelhagen\(2022\)Multi\-modal depression estimation based on sub\-attentional fusion\.InEuropean Conference on Computer Vision,pp\. 623–639\.Cited by:[TABLE XIV](https://arxiv.org/html/2606.11197#S5.T14.4.5.1),[TABLE II](https://arxiv.org/html/2606.11197#S5.T2.2.12.1.1)\.
- \[59\]L\. Wen, X\. Li, G\. Guo, and Y\. Zhu\(2015\)Automated depression diagnosis based on facial dynamic analysis and sparse coding\.IEEE Transactions on Information Forensics and Security10\(7\),pp\. 1432–1441\.Cited by:[§II\-A](https://arxiv.org/html/2606.11197#S2.SS1.p3.1)\.
- \[60\]C\. Wu, Y\. Cai, Y\. Liu, P\. Zhu, Y\. Xue, Z\. Gong, J\. Hirschberg, and B\. Ma\(2025\)Multimodal emotion recognition in conversations: a survey of methods, trends, challenges and prospects\.arXiv preprint arXiv:2505\.20511\.Cited by:[§II\-B](https://arxiv.org/html/2606.11197#S2.SS2.p1.1)\.
- \[61\]Y\. Xia, L\. Liu, T\. Dong, J\. Chen, Y\. Cheng, and L\. Tang\(2024\)A depression detection model based on multimodal graph neural network\.Multimedia Tools and Applications83\(23\),pp\. 63379–63395\.Cited by:[§II](https://arxiv.org/html/2606.11197#S2.p1.1)\.
- \[62\]C\. Xiao, P\. Zhang, X\. Han, G\. Xiao, Y\. Lin, Z\. Zhang, Z\. Liu, and M\. Sun\(2024\)Infllm: training\-free long\-context extrapolation for llms with an efficient context memory\.InProceedings of the Advances in Neural Information Processing Systems,pp\. 119638–119661\.Cited by:[§II\-C](https://arxiv.org/html/2606.11197#S2.SS3.p1.1)\.
- \[63\]J\. Xu, H\. Gunes, K\. Kusumam, M\. Valstar, and S\. Song\(2024\)Two\-stage temporal modelling framework for video\-based depression recognition using graph representation\.IEEE Transactions on Affective Computing16\(1\),pp\. 161–178\.Cited by:[TABLE III](https://arxiv.org/html/2606.11197#S5.T3.2.5.1.1)\.
- \[64\]X\. Xu, Y\. Wang, X\. Wei, F\. Wang, and X\. Zhang\(2024\)Attention\-based acoustic feature fusion network for depression detection\.Neurocomputing601,pp\. 128209\.Cited by:[§II](https://arxiv.org/html/2606.11197#S2.p1.1)\.
- \[65\]J\. Xue, R\. Qin, X\. Zhou, H\. Liu, M\. Zhang, and Z\. Zhang\(2024\)Fusing multi\-level features from audio and contextual sentence embedding from text for interview\-based depression detection\.InProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 6790–6794\.Cited by:[§II](https://arxiv.org/html/2606.11197#S2.p1.1)\.
- \[66\]L\. Yang, D\. Jiang, and H\. Sahli\(2020\)Feature augmenting networks for improving depression severity estimation from speech signals\.IEEE Access8,pp\. 24033–24045\.Cited by:[TABLE II](https://arxiv.org/html/2606.11197#S5.T2.2.24.1)\.
- \[67\]S\. Yin, C\. Liang, H\. Ding, and S\. Wang\(2019\)A multi\-modal hierarchical recurrent neural network for depression detection\.InProceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop,pp\. 65–71\.Cited by:[TABLE III](https://arxiv.org/html/2606.11197#S5.T3.2.13.1.1)\.
- \[68\]J\. Yu and H\. Kaya\(2025\)Using emotionally rich speech segments for depression prediction\.InICASSP 2025\-2025 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 1–5\.Cited by:[§II\-B](https://arxiv.org/html/2606.11197#S2.SS2.p1.1)\.
- \[69\]C\. Yuan, X\. Liu, Q\. Xu, Y\. Li, Y\. Luo, and X\. Zhou\(2024\)Depression diagnosis and analysis via multimodal multi\-order factor fusion\.InProceedings of the International Conference on Artificial Neural Networks,pp\. 56–70\.Cited by:[§II\-C](https://arxiv.org/html/2606.11197#S2.SS3.p1.1),[TABLE III](https://arxiv.org/html/2606.11197#S5.T3.2.17.1.1)\.
- \[70\]L\. Zhang, Z\. Liu, Y\. Wan, Y\. Fan, D\. Chen, Q\. Wang, K\. Zhang, and Y\. Zheng\(2025\)DepITCM: an audio\-visual method for detecting depression\.Frontiers in Psychiatry15,pp\. 1466507\.Cited by:[§II\-B](https://arxiv.org/html/2606.11197#S2.SS2.p3.1),[TABLE II](https://arxiv.org/html/2606.11197#S5.T2.2.19.1)\.
- \[71\]P\. Zhang, M\. Wu, H\. Dinkel, and K\. Yu\(2021\)Depa: self\-supervised audio embedding for depression detection\.InProceedings of the 29th ACM International Conference on Multimedia,pp\. 135–143\.Cited by:[§II\-B](https://arxiv.org/html/2606.11197#S2.SS2.p2.1),[TABLE II](https://arxiv.org/html/2606.11197#S5.T2.2.17.1)\.
- \[72\]W\. Zhang, K\. Mao, and J\. Chen\(2024\)A multimodal approach for detection and assessment of depression using text, audio and video\.Phenomics4\(3\),pp\. 234–249\.Cited by:[TABLE II](https://arxiv.org/html/2606.11197#S5.T2.2.14.1.1)\.
- \[73\]W\. Zhang\(2022\)Biomedical engineering application: disease diagnosis and treatment\.Cited by:[TABLE II](https://arxiv.org/html/2606.11197#S5.T2.2.11.1.1)\.
- \[74\]Z\. Zhang, S\. Zhang, D\. Ni, Z\. Wei, K\. Yang, S\. Jin, G\. Huang, Z\. Liang, L\. Zhang, L\. Li,et al\.\(2024\)Multimodal sensing for depression risk detection: integrating audio, video, and text data\.Sensors24\(12\),pp\. 3714\.Cited by:[§II](https://arxiv.org/html/2606.11197#S2.p1.1)\.
- \[75\]Z\. Zhang, W\. Lin, M\. Liu, and M\. Mahmoud\(2020\)Multimodal deep learning framework for mental disorder recognition\.InProceedings of the 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition \(FG 2020\),pp\. 344–350\.Cited by:[TABLE III](https://arxiv.org/html/2606.11197#S5.T3.2.15.1.1),[TABLE III](https://arxiv.org/html/2606.11197#S5.T3.2.3.1.1)\.
- \[76\]Y\. Zhao, H\. Zhang, J\. Li, S\. Song, C\. Lian, Y\. Liu, Y\. Wang, and C\. Fu\(2025\)Multimodal depression assessment framework integrating personality and gait for older adults with medical conditions\.\.IEEE Transactions on Affective Computing\.Cited by:[§II\-B](https://arxiv.org/html/2606.11197#S2.SS2.p1.1),[§II\-B](https://arxiv.org/html/2606.11197#S2.SS2.p3.1)\.
- \[77\]Z\. Zhao, Z\. Bao, Z\. Zhang, J\. Deng, N\. Cummins, H\. Wang, J\. Tao, and B\. Schuller\(2019\)Automatic assessment of depression from speech via a hierarchical attention transfer network and attention autoencoders\.IEEE Journal of Selected Topics in Signal Processing14\(2\),pp\. 423–434\.Cited by:[TABLE II](https://arxiv.org/html/2606.11197#S5.T2.2.7.1.1)\.
MA-DLE: Speech-based Automatic Depression Level Estimation via Memory Augmentation

Similar Articles

Fine-tuning LLMs for Passive Depression Severity Estimation from AI Mental Health Dialogue

Dep-LLM: Training-Free Depression Diagnosis via Evidence-Guided Structured Multi-factor with Reliable LLM Reasoning

@dair_ai: // Memory as a Model // The paper augments any LLM with a separate trained memory model that stores, retrieves, and int…

Leveraging Self-Paced Curriculum Learning for Enhanced Modality Balance in Multimodal Conversational Emotion Recognition

Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM-Driven Data Augmentation for Cognitive Score Prediction

Submit Feedback

Similar Articles

Fine-tuning LLMs for Passive Depression Severity Estimation from AI Mental Health Dialogue
Dep-LLM: Training-Free Depression Diagnosis via Evidence-Guided Structured Multi-factor with Reliable LLM Reasoning
@dair_ai: // Memory as a Model // The paper augments any LLM with a separate trained memory model that stores, retrieves, and int…
Leveraging Self-Paced Curriculum Learning for Enhanced Modality Balance in Multimodal Conversational Emotion Recognition
Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM-Driven Data Augmentation for Cognitive Score Prediction