Dep-LLM: Training-Free Depression Diagnosis via Evidence-Guided Structured Multi-factor with Reliable LLM Reasoning

arXiv cs.CL 06/10/26, 04:00 AM Papers
depression-diagnosis mental-health llm training-free chain-of-thought confidence-analysis clinical-interview
Summary
Dep-LLM is a training-free framework that uses frozen large language models to diagnose depression from clinical interviews by decomposing dialogue into five clinically aligned themes with evidence-grounded reasoning and confidence modulation, outperforming zero-shot and some supervised methods on DAIC-WOZ and E-DAIC datasets.
arXiv:2606.10796v1 Announce Type: new Abstract: Automatic Depression Detection (ADD) from clinical interviews is a pivotal task in computational mental health, yet it remains challenging due to two critical obstacles: 1) difficulty in modeling complex but sparsely distributed depression clues within lengthy, multi-topic clinical interviews, leading to superficial and unreliable reasoning; 2) scarcity of labeled data due to clinical privacy, together with high cost of training and fine-tuning, limiting the deployment of supervised ADD systems. To jointly address these challenges, we propose Dep-LLM, a training-free framework that mirrors the step-by-step reasoning of clinical psychiatrists and operates entirely on frozen off-the-shelf foundation LLMs. Dep-LLM comprises three stages. First, a Chain-of-Thought (CoT) Depression Multi-factor Analysis module structurally decomposes the long dialogue into five clinically aligned themes and produces evidence-grounded rationales, effectively handling long-context dependencies. Second, we introduce Confidence Analysis and Modulation module that quantifies the epistemic reliability from token-level entropy of each rationale and applies an intra-label and inter-theme modulation that amplifies trustworthy signals while suppressing uncertain ones without extra training. Third, a Collaborative Multi-factor Prediction module dynamically integrates multi-factor signals weighted by confidence into the final diagnosis. Extensive experiments on the DAIC-WOZ and E-DAIC datasets demonstrate the effectiveness and generalizability of Dep-LLM: it surpasses zero-shot baseline on nearly all 21 foundation LLMs across 9 metrics such as accuracy, macro F1 and weighted-average F1, and further outperforms state-of-the-art supervised domain-specific LLMs as well as the latest closed-source commercial LLMs, while requiring no extra training.
Original Article
View Cached Full Text
Cached at: 06/10/26, 06:12 AM
# Dep-LLM: Training-Free Depression Diagnosis via Evidence-Guided Structured Multi-factor with Reliable LLM Reasoning
Source: [https://arxiv.org/html/2606.10796](https://arxiv.org/html/2606.10796)
Yiqing Lyu, Xianbing Zhao, Buzhou Tang, Ronghuan JiangYiqing Lyu is with School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China \(e\-mail: kosmischer@stu\.hit\.edu\.cn\)\. Xianbing Zhao is with School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China, Harbin Institute of Technology, Shenzhen, Guangdong, China, and Guangdong Provincial Key Laboratory of Intelligent Information Processing \(e\-mail: zhaoxianbing\_hitsz@163\.com\)\. Buzhou Tang \(Corresponding Author\) is with School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China, Pengcheng Laboratory, Guangdong Provincial Key Laboratory of Intelligent Information Processing \(e\-mail: tangbuzhou@hit\.edu\.cn\)\.Ronghuan Jiang \(Corresponding Author\) is with Chinese People’s Liberation Army General Hospital, Beijing, China \(e\-mail: jangrh55@126\.com\)\.This study is partially supported by National Key R&D Program of China \(2023YFC3502900\), National Natural Science Foundation of China \(62276082\), Shenzhen Science and Technology Research and Development Fund \(KJZD20240903102802003\), Shenzhen Science and Technology Research and Development Fund for Sustainable Development Project \(GXWD20231128103819001, 20230706140548006\) and Guangdong Provincial Key Laboratory Grant \(2023B1212060076\)\.

###### Abstract

Automatic Depression Detection \(ADD\) from clinical interviews is a pivotal task in computational mental health, yet it remains challenging due to two critical obstacles: 1\) difficulty in modeling complex but sparsely distributed depression clues within lengthy, multi\-topic clinical interviews, leading to superficial and unreliable reasoning; 2\) scarcity of labeled data due to clinical privacy, together with high cost of training and fine\-tuning, limiting the deployment of supervised ADD systems\. To jointly address these challenges, we propose Dep\-LLM, a training\-free framework that mirrors the step\-by\-step reasoning of clinical psychiatrists and operates entirely on frozen off\-the\-shelf foundation LLMs\. Dep\-LLM comprises three stages\. First, a Chain\-of\-Thought \(CoT\) Depression Multi\-factor Analysis module structurally decomposes the long dialogue into five clinically aligned themes and produces evidence\-grounded rationales, effectively handling long\-context dependencies\. Second, we introduce Confidence Analysis and Modulation module that quantifies the epistemic reliability from token\-level entropy of each rationale and applies an intra\-label and inter\-theme modulation that amplifies trustworthy signals while suppressing uncertain ones without extra training\. Third, a Collaborative Multi\-factor Prediction module dynamically integrates multi\-factor signals weighted by confidence into the final diagnosis\. Extensive experiments on the DAIC\-WOZ and E\-DAIC datasets demonstrate the effectiveness and generalizability of Dep\-LLM: it surpasses zero\-shot baseline on nearly all 21 foundation LLMs across 9 metrics such as accuracy, macro F1 and weighted\-average F1, and further outperforms state\-of\-the\-art supervised domain\-specific LLMs as well as the latest closed\-source commercial LLMs, while requiring no extra training\.

## IIntroduction

![Refer to caption](https://arxiv.org/html/2606.10796v1/x1.png)

Figure 1:Dep\-LLM decomposes and analyzes dialogues via structural multi\-factor schema and verifies their reliability by confidence mechanism\. Without extra training, Dep\-LLM outperforms zero\-shot settings on a range of foundation LLMs\.Mental disorders, particularly depression, have become a major global health challenge\. According to recent statistics from WHO, depression affects millions of people worldwide, standing as a leading cause of disability and contributing significantly to the global burden of disease\[[47](https://arxiv.org/html/2606.10796#bib.bib10),[27](https://arxiv.org/html/2606.10796#bib.bib11)\]\. The authoritative psychiatric literature DSM\-5\[[1](https://arxiv.org/html/2606.10796#bib.bib9)\]establishes the gold standard for depression diagnosis, which is operationalized through Structured Clinical Interview for DSM\-5 \(SCID\), where clinicians assess a patient’s psychiatric state via complex dialogue interaction\. Psychiatric works\[[4](https://arxiv.org/html/2606.10796#bib.bib41),[41](https://arxiv.org/html/2606.10796#bib.bib42)\]also propose a synthetic methodology for SCID design by structurally incorporating themes such as family relationship, work satisfaction, medical history and so on\. While this synthetic approach is clinically validated, the manual administration and analysis of such interviews are resource\-intensive and difficult to scale to meet the growing demand for mental health services\[[43](https://arxiv.org/html/2606.10796#bib.bib12)\]\. Consequently, Automatic Depression Detection \(ADD\) from clinical interview transcripts has attracted extensive attention, aiming to assist clinicians by objectively identifying depressive risks from natural language\[[58](https://arxiv.org/html/2606.10796#bib.bib75),[11](https://arxiv.org/html/2606.10796#bib.bib13)\]\.

Early ADD research focused on supervised deep learning over interview transcripts\[[37](https://arxiv.org/html/2606.10796#bib.bib45),[19](https://arxiv.org/html/2606.10796#bib.bib50),[63](https://arxiv.org/html/2606.10796#bib.bib16)\], which suffers from limited interpretability and heavy reliance on labeled clinical data\[[56](https://arxiv.org/html/2606.10796#bib.bib74)\]\. The rise of Large Language Models \(LLMs\) has shifted ADD toward generative reasoning, domain\-adaptive pre\-training\[[18](https://arxiv.org/html/2606.10796#bib.bib40)\], instruction fine\-tuning\[[50](https://arxiv.org/html/2606.10796#bib.bib25),[52](https://arxiv.org/html/2606.10796#bib.bib26)\], retrieval\-augmented generation\[[57](https://arxiv.org/html/2606.10796#bib.bib76),[58](https://arxiv.org/html/2606.10796#bib.bib75)\], and multi\-agent pipelines\[[16](https://arxiv.org/html/2606.10796#bib.bib38),[62](https://arxiv.org/html/2606.10796#bib.bib65)\]\. Despite the progress, ADD remains challenging due to two bottlenecks that persist across both supervised and LLM\-based methods: difficulty in modeling sparse depression clues within long clinical interview dialogues, and the practical barriers of data scarcity and high training cost\.

Challenge 1: Difficulty in modeling depression clues within long context\.Clinical interviews are inherently lengthy and cover multiple themes, in which depression clues are complex but sparsely distributed, and intertwined with content irrelevant to depression diagnosis\[[53](https://arxiv.org/html/2606.10796#bib.bib17),[19](https://arxiv.org/html/2606.10796#bib.bib50)\]\(e\.g\., greetings and transitional sentences\)\. This complex semantic nature makes it difficult for off\-the\-shelf LLMs to capture the subtle symptoms associated with depression\. As illustrated in Figure[1](https://arxiv.org/html/2606.10796#S1.F1), zero\-shot LLMs tend to collapse the entire dialogue into a holistic judgment and produce superficial reasons such as “a generally positive outlook”, overlooking specific clues that are clinically relevant\[[57](https://arxiv.org/html/2606.10796#bib.bib76),[38](https://arxiv.org/html/2606.10796#bib.bib77)\]\. Moreover, even when LLMs produce detailed rationales, these may be locally plausible yet clinically unreliable due to medical hallucination\[[21](https://arxiv.org/html/2606.10796#bib.bib19),[2](https://arxiv.org/html/2606.10796#bib.bib83)\]\. Therefore, a structured analysis schema well\-aligned with the clinical standard \(e\.g\., SCID\) is needed to decompose these interview dialogues with multiple factors into evidence\-grounded rationales\[[59](https://arxiv.org/html/2606.10796#bib.bib46),[40](https://arxiv.org/html/2606.10796#bib.bib84),[24](https://arxiv.org/html/2606.10796#bib.bib85)\], accompanied by a reliability verification mechanism to distinguish trustworthy rationales from uncertain ones\.

Challenge 2: Data scarcity and high training cost\.Supervised methods\[[63](https://arxiv.org/html/2606.10796#bib.bib16),[30](https://arxiv.org/html/2606.10796#bib.bib15)\]and fine\-tuned LLMs\[[18](https://arxiv.org/html/2606.10796#bib.bib40),[52](https://arxiv.org/html/2606.10796#bib.bib26),[50](https://arxiv.org/html/2606.10796#bib.bib25)\]rely heavily on large\-scale labeled clinical data, which is scarce and hard to access due to privacy and ethical concerns\. Without such data, these models are difficult to optimize and deploy in domain\-specific clinical settings\. Furthermore, training or fine\-tuning LLMs calls for expensive computational resources, creating a prohibitive barrier for clinical institutions with limited budgets\[[61](https://arxiv.org/html/2606.10796#bib.bib23),[31](https://arxiv.org/html/2606.10796#bib.bib24)\]\. Consequently, training\-free methods that leverage frozen foundation LLMs without extra data or training overhead are urgently needed, while their performance must remain competitive against supervised models\.

To jointly address these two challenges, we proposeDep\-LLM, a novel training\-free framework for ADD that mirrors the step\-by\-step reasoning process of clinical psychiatrists\. As illustrated in Figure[1](https://arxiv.org/html/2606.10796#S1.F1)and Figure[2](https://arxiv.org/html/2606.10796#S3.F2), Dep\-LLM is a three\-stage pipeline operating entirely on frozen off\-the\-shelf foundation LLMs: 1\) The*CoT Depression Multi\-factor Analysis*module uses Chain\-of\-Thought prompting\[[46](https://arxiv.org/html/2606.10796#bib.bib30),[44](https://arxiv.org/html/2606.10796#bib.bib32)\]to structurally decompose the long dialogue into five SCID\-aligned themes\[[38](https://arxiv.org/html/2606.10796#bib.bib77),[59](https://arxiv.org/html/2606.10796#bib.bib46)\]\(family relationship, work satisfaction, mental state, medical history, overall evaluation\) and extract evidence\-based rationales over a fine\-grained possibility space, effectively handling semantic dependencies in long context; 2\) The*Semantic Confidence Analysis and Modulation*module quantifies the epistemic reliability of each rationale from token\-level entropy and applies an intra\-label and inter\-theme contrastive modulation that amplifies reliable signals while suppressing uncertain ones\[[9](https://arxiv.org/html/2606.10796#bib.bib71),[33](https://arxiv.org/html/2606.10796#bib.bib81),[35](https://arxiv.org/html/2606.10796#bib.bib82)\]; 3\) The*Collaborative Multi\-factor Prediction*module dynamically integrates these confidence\-weighted signals into a diagnostic decision\. The entire pipeline introduces no learnable parameters and requires no labeled clinical data, making Dep\-LLM accessible in deployment under the constrained data and computational budgets typical of real\-world clinical settings\.

Extensive experiments on the widely used DAIC\-WOZ\[[13](https://arxiv.org/html/2606.10796#bib.bib48)\]and E\-DAIC\[[8](https://arxiv.org/html/2606.10796#bib.bib49)\]datasets demonstrate the effectiveness of Dep\-LLM\. Against zero\-shot baselines, Dep\-LLM yields significant improvements on nearly all tested 21 foundation LLMs \(Llama\-2/3/4, Qwen\-2\.5/3/3\.5, Gemma\-2/3 families, with parameters ranging from 4B to 17B\) and further outperforms representative supervised domain\-specific LLMs \(e\.g\., MentalBERT, MentaLLaMA, BioMistral, Meditron\) as well as the latest closed\-source commercial LLMs \(GPT\-5\.5, Gemini\-3\.1\-Pro, Claude\-Opus\-4\.6, Grok\-4\.3, DeepSeek\-V4\) across most metrics\. Our main contributions are three\-fold:

- •We proposeDep\-LLM, a novel depression detection framework addressing the challenge of reasoning in long SCID dialogues via introducing structured multi\-factor analysis schema and LLM reliability verification\.
- •We implement Dep\-LLM framework under a fully training\-free setting, eliminating the overhead for expensive model training and the need for scarce data while retaining strong performance in clinical scenarios\.
- •We realize the Dep\-LLM architecture, integrating CoT reasoning, Semantic Confidence analysis, and Collaborative Multi\-factor fusion, which jointly reinforces the interpretability and rationality of the automatic diagnosis\.

## IIRelated Work

### II\-AAutomatic Depression Detection

The methodology for Automatic Depression Detection \(ADD\) has followed a clear trajectory from supervised deep learning and multimodal fusion to the current paradigm of generative reasoning with Large Language Models \(LLMs\)\.

Early ADD research leveraged neural networks like CNNs and RNNs for sequential dialogue feature modeling\[[12](https://arxiv.org/html/2606.10796#bib.bib52),[37](https://arxiv.org/html/2606.10796#bib.bib45)\], alongside frameworks like TFN\[[55](https://arxiv.org/html/2606.10796#bib.bib51)\], MulT\[[45](https://arxiv.org/html/2606.10796#bib.bib54)\], and MISA\[[14](https://arxiv.org/html/2606.10796#bib.bib55)\]to perform multimodal fusion\. HAN\[[28](https://arxiv.org/html/2606.10796#bib.bib53)\]further refined this via hierarchical attention\. Recent advances prioritize complex fusion: DepMSTAT\[[42](https://arxiv.org/html/2606.10796#bib.bib60)\]and TTFNet\[[5](https://arxiv.org/html/2606.10796#bib.bib58)\]utilize spatio\-temporal and frequency\-domain networks, while DepMamba\[[54](https://arxiv.org/html/2606.10796#bib.bib56)\]employs state space models for efficiency\. Methods like MMPF\[[51](https://arxiv.org/html/2606.10796#bib.bib57)\]and WavFace\[[10](https://arxiv.org/html/2606.10796#bib.bib59)\]focus on signal filtration and alignment\. Structural approaches like SEGA\[[7](https://arxiv.org/html/2606.10796#bib.bib61)\]and HiQuE\[[19](https://arxiv.org/html/2606.10796#bib.bib50)\]focus on reconstructing semantic structure via embedding networks\. However, these supervised models universally suffer from low interpretability and high dependency for training data\[[56](https://arxiv.org/html/2606.10796#bib.bib74)\]\.

The advancement of LLMs has shifted the focus toward generative reasoning\. Initial efforts leveraged pre\-trained models, such as MentalBERT\[[18](https://arxiv.org/html/2606.10796#bib.bib40)\], or instruction fine\-tuning models, such as MentalAlpaca\[[50](https://arxiv.org/html/2606.10796#bib.bib25)\], to align models with mental health tasks\. More recent supervised domain\-specific LLMs like PsycoLLM\[[15](https://arxiv.org/html/2606.10796#bib.bib87)\]and DepressLLM\[[32](https://arxiv.org/html/2606.10796#bib.bib86)\]further inject psychological knowledge and interpretable confidence into the backbone\. Beyond text, multimodal LLMs fuse linguistic clues with acoustic and facial signals for interview\-based assessment\[[39](https://arxiv.org/html/2606.10796#bib.bib88),[25](https://arxiv.org/html/2606.10796#bib.bib89)\]\. Meanwhile, agentic frameworks decompose diagnosis across collaborative roles, including doctor\-patient\-family interaction\[[62](https://arxiv.org/html/2606.10796#bib.bib65)\], multi\-agent guided interviewing\[[3](https://arxiv.org/html/2606.10796#bib.bib90),[16](https://arxiv.org/html/2606.10796#bib.bib38)\], knowledge\-guided psychiatric reasoning\[[49](https://arxiv.org/html/2606.10796#bib.bib91)\], and realistic patient simulation\[[26](https://arxiv.org/html/2606.10796#bib.bib66)\]\. To address the hallucination and the lack of clinical grounding of off\-the\-shelf LLMs\[[34](https://arxiv.org/html/2606.10796#bib.bib62)\], Retrieval\-Augmented Generation \(RAG\) methods such as SpeechT\-RAG\[[58](https://arxiv.org/html/2606.10796#bib.bib75)\]and RED\[[57](https://arxiv.org/html/2606.10796#bib.bib76)\]ground diagnosis on external evidence\. Apart from that, Chain\-of\-Thought \(CoT\) frameworks such as Doris\[[23](https://arxiv.org/html/2606.10796#bib.bib64)\], EMDRC\[[60](https://arxiv.org/html/2606.10796#bib.bib63)\], and emotion\-to\-reasoning prompting\[[44](https://arxiv.org/html/2606.10796#bib.bib32)\]structure the reasoning process around psychiatric criteria\. Yet most of these methods either lack an internal mechanism to verify the reliability of their own reasoning, or intensively depend on scarce labeled data and computational resources for fine\-tuning, retrieval, or simulation\.

Dep\-LLM addresses these mentioned gaps via elaborating a training\-free framework mirroring clinical assessment through CoT\-driven structured multi\-factor analysis and employing semantic confidence mechanism to mathematically reinforce reasoning reliability, ensuring both interpretability and accuracy without training overhead\.

## IIIMethod

In this section, we elaborate the methodological architecture of our Dep\-LLM, a fully training\-free framework that mirrors the step\-by\-step reasoning process of clinical psychiatrists\. To jointly address the two bottlenecks of inaccurate clue reasoning in long SCID\-style dialogues and the prohibitive cost of supervised adaptation which are identified in Section[I](https://arxiv.org/html/2606.10796#S1), Dep\-LLM is designed as a three\-stage pipeline that operates entirely on the frozen off\-the\-shelf foundation LLMs, requiring no task\-specific fine\-tuning, no labeled depression corpus, and no auxiliary trainable parameters\. As illustrated in Figure[2](https://arxiv.org/html/2606.10796#S3.F2), the pipeline 1\) structurally decomposes the raw interview dialogues into a multi\-theme, multi\-label evidence space through CoT\-driven prompting; 2\) quantifies the epistemic reliability from token\-level entropy of every generated rationale, with intra\-label and inter\-theme modulation; 3\) synthesizes a diagnostic prediction based on a collaborative fusion strategy weighted by confidence\. Notably, all of the three stages use the same frozen LLM only at inference time\. Therefore the entire Dep\-LLM framework can be deployed under constrained data and resource of clinical settings without any extra training overhead\.

### III\-AProblem Formulation

Given clinical interview dialogue transcript𝒮\\mathcal\{S\}, our goal is to assess the participant’s depressive state by extracting and analyzing multi\-factor evidence from𝒮\\mathcal\{S\}in a training\-free manner\. To this end, we define collaborative multi\-factor space spanning two orthogonal dimensions: multi\-theme𝒟\\mathcal\{D\}aligned with SCID standard\[[4](https://arxiv.org/html/2606.10796#bib.bib41),[41](https://arxiv.org/html/2606.10796#bib.bib42)\], and multi\-labelℒ\\mathcal\{L\}for fine\-grained diagnostic analysis:

𝒟=\{family,work,mental,medical,overall\},\\displaystyle\\mathcal\{D\}=\\\{\\text\{family,work,mental,medical,overall\}\\\},\(1\)ℒ=\{\+\(depressive\),\-\(healthy\),=\(neutral\)\},\\displaystyle\\mathcal\{L\}=\\\{\\text\{\+\(depressive\),\-\(healthy\),=\(neutral\)\}\\\},\(2\)
For each themei∈𝒟i\\in\\mathcal\{D\}and each potential labelj∈ℒj\\in\\mathcal\{L\}, Dep\-LLM derives a possibility scorePijP\_\{ij\}, and an evidence\-grounded rationaleRijR\_\{ij\}, and a semantic confidenceCijC\_\{ij\}that measures the reliability ofRijR\_\{ij\}\. The final diagnosisy^\\hat\{y\}is obtained by dynamically fusing these factors in𝒟×ℒ\\mathcal\{D\}\\times\\mathcal\{L\}weighted by their confidences\. Throughout the pipeline, the foundation LLM is invoked only as a frozen decoder, which is the key property that enables Dep\-LLM to remain training\-free\.

![Refer to caption](https://arxiv.org/html/2606.10796v1/x2.png)

Figure 2:Schematic illustration of the proposed Dep\-LLM framework with three components\. The CoT Depression Multi\-factor Analysis applies CoT techniques to prompt LLMs to assess possibilities and rationales for each theme and each label\. Confidence Analysis & Modulation introduces confidence mechanism to evaluate the reliability of those rationales\. Collaborative Prediction incorporates possibility, rationale and confidence across multi\-factor\.
### III\-BCoT Depression Multi\-factor Analysis

Standard zero\-shot prompting tends to collapse the long multi\-turn dialogue into a single holistic judgment, which often yields vague and superficial rationales and overlooks sparsely distributed depression clues\[[53](https://arxiv.org/html/2606.10796#bib.bib17),[38](https://arxiv.org/html/2606.10796#bib.bib77)\]\. To mitigate this, the first stage of Dep\-LLM employs a Chain\-of\-Thought \(CoT\) scheme that decomposes the reasoning process into three sequential prompting steps, all executed on the same frozen LLM without parameter update\.

#### III\-B1Step 1: Thematic Content Extraction

For a multi\-turn clinical interview dialogue𝒮\\mathcal\{S\}and a frozen large language modelLLM\(⋅\)\\mathrm\{LLM\}\(\\cdot\), we employ an in\-context learning \(ICL\) templateX1X\_\{1\}\(provided in supplemental file\) to prompt the LLM to extract the narrative thematic content of the four atomic themes𝒟\{family,work,mental,medical\}∘\\mathcal\{D\}^\{\\circ\}\_\{\\\{\\text\{family,work,mental,medical\}\\\}\}:

𝒯∘=\{ti∣ti=LLM\(𝒮,X1\),i∈𝒟∘\},\\mathcal\{T\}^\{\\circ\}=\\bigl\\\{\\,t\_\{i\}\\mid t\_\{i\}=\\mathrm\{LLM\}\(\\mathcal\{S\},X\_\{1\}\),\\,i\\in\\mathcal\{D\}^\{\\circ\}\\bigr\\\},\(3\)wheretit\_\{i\}retains only the clinically relevant content for themeiiand filters out trivial chitchat such as greetings and transitional sentences\. This decomposition explicitly realigns the unstructured dialogue with the structured multi\-theme schema corresponding to SCID\.

#### III\-B2Step2: Integrative Overall Synthesis

We then apply a second instructionX2X\_\{2\}that prompts the LLM to integrate the four atomic themes into anoverallnarrative summary, capturing global clues \(e\.g\., cross\-theme emotional consistency\) that a single theme cannot express in isolation:

𝒯=𝒯∘∪\{toverall∣toverall=LLM\(𝒯∘,X2\)\}\.\\mathcal\{T\}=\\mathcal\{T\}^\{\\circ\}\\cup\\bigl\\\{t\_\{\\text\{overall\}\}\\mid t\_\{\\text\{overall\}\}=\\mathrm\{LLM\}\(\\mathcal\{T\}^\{\\circ\},X\_\{2\}\)\\bigr\\\}\.\(4\)The overall theme contributes integrative evidence rather than a redundant factor, which is verified through experiments in ablation study \(Section[IV\-C](https://arxiv.org/html/2606.10796#S4.SS3)\)\.

#### III\-B3Step3: Depression Multi\-factor Analysis

Finally, we employ a third ICL templateX3X\_\{3\}to prompt the LLM to perform a fine\-grained assessment over the full multi\-factor space𝒟×ℒ\\mathcal\{D\}\\times\\mathcal\{L\}\. For every\(theme,label\)pair, this step produces a possibility scorePijP\_\{ij\}and an explicit justifying rationaleRijR\_\{ij\}:

\(Pij,Rij\)=LLM\(𝒯,X3\),i∈𝒟,j∈ℒ\.\(P\_\{ij\},R\_\{ij\}\)=\\mathrm\{LLM\}\(\\mathcal\{T\},X\_\{3\}\),\\quad i\\in\\mathcal\{D\},\\,j\\in\\mathcal\{L\}\.\(5\)The three\-step decomposition ensures that the LLM never has to simultaneously perform topic filtering, cross\-theme integration, and a label\-conditioned reasoning in a single forward pass, thereby substantially mitigating the superficial summary observed in the zero\-shot baseline \(Figure[1](https://arxiv.org/html/2606.10796#S1.F1)\)\. Instead, the multi\-factor rationales are strictly evidence\-grounded and directed by CoT\-driven structured analysis schema\.

### III\-CSemantic Confidence Analysis and Modulation

The rationales generated by stage[III\-B](https://arxiv.org/html/2606.10796#S3.SS2)are linguistically fluent but not all equally trustworthy given that training\-free LLMs are prone to producing locally plausible yet clinically unreliable explanations\[[21](https://arxiv.org/html/2606.10796#bib.bib19),[48](https://arxiv.org/html/2606.10796#bib.bib36)\]\. Instead of verifying these rationales via supervised fine\-tuning which intensively calls for resource and labeled data, we address this by exploiting the epistemic reliability, a quantity that exists inherently during generation which can be calculated using the model’s token\-level predictive distribution\.

#### III\-C1Theoretical Grounding

The motivation behind our confidence mechanism is straightforward: when an LLM is genuinely confident in what it is saying, its probability distribution of next token should be sharply peaked on a small set of candidates; when it is fabricating content, by contrast, the distribution tends to flatten because no single continuation is strongly supported by the model’s internal knowledge\. There are works that theoretically support the motivation\. Generally,\[[20](https://arxiv.org/html/2606.10796#bib.bib68)\]showed that the predictive entropy of an LLM is inversely correlated with the quality \(i\.e\., confidence\) of its generation, establishing entropy as a robust proxy for the model’s self\-knowledge\. Specifically in medical fields where hallucinations are high stakes,\[[29](https://arxiv.org/html/2606.10796#bib.bib70),[9](https://arxiv.org/html/2606.10796#bib.bib71)\]verified that high\-entropy \(i\.e\., low\-confidence\) states are a reliable signal of non\-factual generation, and\[[48](https://arxiv.org/html/2606.10796#bib.bib36)\]further demonstrated that entropy\-based uncertainty estimation is critical for distinguishing trustworthy medical advice from hazardous hallucinations\. Mathematically, we define the rationale confidenceCCas the reciprocal of the Shannon entropyHHof the LLM’s token distribution, aligning with the geometric framework proposed by\[[36](https://arxiv.org/html/2606.10796#bib.bib72)\]which demonstrates low uncertainty corresponds to a tighter concentration of semantic embeddings, indicating a stable and trustworthy evidence\.

#### III\-C2Token\-Level Entropy and Rationale\-Level Confidence

Formally, for a generated tokenyky\_\{k\}at stepkk, the token\-level entropy is:

Htoken\(yk\)=−∑v∈𝒱p\(v∣y<k,x\)log2⁡p\(v∣y<k,x\),H^\{\\text\{token\}\}\(y\_\{k\}\)=\-\\sum\_\{v\\in\\mathcal\{V\}\}p\(v\\mid y\_\{<k\},x\)\\log\_\{2\}p\(v\\mid y\_\{<k\},x\),\(6\)where𝒱\\mathcal\{V\}is the token sampling space andp\(v∣⋅\)p\(v\\mid\\cdot\)is the probability of candidate tokenvv\. A lowHtokenH^\{\\text\{token\}\}indicates that the model is decisive in its selection, reflecting internalized knowledge rather than hallucinated medical details\. We then define the token confidence as the reciprocal of the token entropy:

Ctoken\(yk\)=1Htoken\(yk\)\+ϵ,C^\{\\text\{token\}\}\(y\_\{k\}\)=\\frac\{1\}\{H^\{\\text\{token\}\}\(y\_\{k\}\)\+\\epsilon\},\(7\)whereϵ\\epsilonis a small constant for numerical stability\. Then we average it along the sequence to obtain the rationale\-level average confidence:

Cavg\(y\)=1\|y\|∑k=1\|y\|Ctoken\(yk\)\.C^\{\\text\{avg\}\}\(y\)=\\frac\{1\}\{\|y\|\}\\sum\_\{k=1\}^\{\|y\|\}C^\{\\text\{token\}\}\(y\_\{k\}\)\.\(8\)In practice, we apply Equation[8](https://arxiv.org/html/2606.10796#S3.E8)to every rationale produced in Equation[5](https://arxiv.org/html/2606.10796#S3.E5):

Cijavg=Cavg\(Rij\),i∈𝒟,j∈ℒ,C^\{\\text\{avg\}\}\_\{ij\}=C^\{\\text\{avg\}\}\(R\_\{ij\}\),\\quad i\\in\\mathcal\{D\},\\,j\\in\\mathcal\{L\},\(9\)yielding a\|𝒟×ℒ\|\|\\mathcal\{D\}\\times\\mathcal\{L\}\|raw average confidence matrix that measures the extent to which each rationale is internally consistent with the LLM’s own probabilistic beliefs\. Notably, this process is totally training\-free and introduces no learnable parameters since the entropy is derived directly from the same forward pass that produces the rationale\.

#### III\-C3Intra\-Label and Inter\-Theme Confidence Modulation

Raw Confidence scores may exhibit high variance across different themes due to varying text lengths and inherent semantic complexities, so a naive employment ofCijavgC^\{\\text\{avg\}\}\_\{ij\}would cause scale\-driven artifacts dominating the fusion\. To address this bias, we apply a contrastive modulation based on the principle that a factor should weigh more when its own rationale is confident and its competing themes are uncertain\. Concretely, we first compute a harmonic\-mean normalizer:

αjnorm=11\|𝒟\|∑i∈𝒟1Cijavg,j∈ℒ,\\alpha^\{\\text\{norm\}\}\_\{j\}=\\frac\{1\}\{\\frac\{1\}\{\|\\mathcal\{D\}\|\}\\sum\_\{i\\in\\mathcal\{D\}\}\\frac\{1\}\{C^\{\\text\{avg\}\}\_\{ij\}\}\},\\quad j\\in\\mathcal\{L\},\(10\)and then derive the modulated confidence as:

Cijmod=αjnorm∏i′∈𝒟,i′≠iCi′javg,j∈ℒ\.C^\{\\text\{mod\}\}\_\{ij\}=\\frac\{\\alpha^\{\\text\{norm\}\}\_\{j\}\}\{\\prod\_\{i^\{\\prime\}\\in\\mathcal\{D\},\\,i^\{\\prime\}\\neq i\}C^\{\\text\{avg\}\}\_\{i^\{\\prime\}j\}\},\\quad j\\in\\mathcal\{L\}\.\(11\)By construction, for themei∈𝒟i\\in\\mathcal\{D\},CijmodC^\{\\text\{mod\}\}\_\{ij\}is monotonically increasing in its own raw confidenceCijavgC^\{\\text\{avg\}\}\_\{ij\}and decreasing in the confidences of the competing themesi′∈𝒟,i′≠ii^\{\\prime\}\\in\\mathcal\{D\},i^\{\\prime\}\\neq iunder the same labelj∈ℒj\\in\\mathcal\{L\}\. This effectively amplifies the signal of the most reliable diagnostic factor while suppressing uncertain ones, which is a behaviour quantitatively verified later in case study in Section[IV\-D](https://arxiv.org/html/2606.10796#S4.SS4)\.

### III\-DCollaborative Multi\-factor Prediction

In the final stage, Dep\-LLM synthesizes the multi\-factor assessments into a single diagnostic decision via a confidence\-weighted fusion\. Notably, this stage involves no learned weights, where all coefficients are either derived analytically from the modulated confidence \(Equation[11](https://arxiv.org/html/2606.10796#S3.E11)\) or fixed by design and exposed to the clinician as transparent hyperparameters\.

#### III\-D1Confidence\-Weighted Aggregation

First, for each labelj∈ℒj\\in\\mathcal\{L\}, the support scorepjp^\{j\}is computed by fusing the possibilities with the modulated confidences of the five themes:

pj=∑i∈𝒟Pij⋅Cijmod,j∈ℒ\{\+,−,=\}\.p^\{j\}=\\sum\_\{i\\in\\mathcal\{D\}\}P\_\{ij\}\\cdot C^\{\\text\{mod\}\}\_\{ij\},\\quad j\\in\\mathcal\{L\}\_\{\\\{\+,\-,=\\\}\}\.\(12\)In this formulation, possibility of each theme contributes to the final decision in proportion to how reliably the LLM justified it, rather than flat summation based on the possibility itself only\.

#### III\-D2Collaborative Prediction

Afterwards, we define depressive coefficientδ\\deltato reflect the predictive risk of depression\. While depressive evidencep\+p^\{\+\}and healthy evidencep−p^\{\-\}vote for their own label, the neutral evidencep=p^\{=\}is intrinsically ambiguous yet should not be discarded directly\. Therefore we introduce neutral label calibratorsλ\+,λ−\\lambda^\{\+\},\\lambda^\{\-\}, explicitly incorporating the neutral evidence:

δ=p\+\+λ\+⋅p=p−\+λ−⋅p=,\\delta=\\frac\{p^\{\+\}\+\\lambda^\{\+\}\\cdot p^\{=\}\}\{p^\{\-\}\+\\lambda^\{\-\}\\cdot p^\{=\}\},\(13\)where a higherδ\\deltaindicates a higher predictive risk of depression\. The final prediction is obtained by filteringδ\\deltathrough the decision thresholdδ∗\\delta^\{\*\}:

y^=\{depressive\(\+\),ifδ≥δ∗,healthy\(−\),otherwise\.\\hat\{y\}=\\begin\{cases\}\\text\{depressive\}\\;\(\+\),&\\text\{if \}\\delta\\geq\\delta^\{\*\},\\\\ \\text\{healthy\}\\;\(\-\),&\\text\{otherwise\}\.\\end\{cases\}\(14\)

#### III\-D3Default Configuration

In our standard configuration, we assignλ\+=λ−=1\\lambda^\{\+\}=\\lambda^\{\-\}=1andδ∗=1\\delta^\{\*\}=1, assuming balanced sensitivity towards the two diagnostic poles\. Asλ±\\lambda^\{\\pm\}andδ∗\\delta^\{\*\}are decision\-side hyperparameters rather than learnable weights, they can be calibrated on a small held\-out validation set or directly adjusted by clinicians to reflect domain\-specific risk preference\. We systematically explore this property in the hyperparameter sensitivity analysis in Section[IV\-E](https://arxiv.org/html/2606.10796#S4.SS5)\.

TABLE I:Performance Comparison on DAIC\-WOZ between zero\-shot baseline and Dep\-LLM\. Ma\* refers to macro metrics, WA\* refers to weighted\-average metrics\., F1\-Depr\. and F1\-Heal\. refer to per\-class binary metrics\. For readability, gray shade marks the performance of Dep\-LLM and Bold marks the performance of the better one\.Foundation LLMMethodAcc\.Ma\*Prec\.Ma\*Rec\.Ma\*F1\.WA\*Prec\.WA\*Rec\.WA\*F1\.F1\-Depr\.F1\-Heal\.Llama2\-7B\-BaseZero\-shot0\.4680\.5160\.5180\.4640\.6000\.4680\.4830\.4190\.510Dep\-LLM0\.5740\.5790\.5940\.5580\.6610\.5740\.5920\.4740\.643Llama2\-7B\-ChatZero\-shot0\.5740\.5620\.5740\.5500\.6420\.5740\.5920\.4440\.655Dep\-LLM0\.6380\.6180\.6400\.6130\.6950\.6380\.6530\.5140\.712Llama2\-13B\-BaseZero\-shot0\.5100\.5060\.5080\.4870\.5880\.5110\.5310\.3780\.596Dep\-LLM0\.5740\.5450\.5530\.5390\.6240\.5740\.5910\.4120\.667Llama2\-13B\-ChatZero\-shot0\.5530\.5490\.5580\.5320\.6300\.5530\.5720\.4320\.632Dep\-LLM0\.5320\.5950\.6050\.5300\.6870\.5320\.5420\.5000\.560Llama3\-8B\-BaseZero\-shot0\.5320\.5540\.5210\.4500\.6380\.5320\.5500\.4500\.593Dep\-LLM0\.6810\.6770\.7110\.6660\.7570\.6810\.6940\.5950\.737Llama3\-8B\-InstructZero\-shot0\.5320\.5190\.5230\.5050\.6000\.5320\.5520\.3890\.621Dep\-LLM0\.6600\.6470\.6750\.6400\.7250\.6600\.6740\.5560\.724Llama4\-17B\-BaseZero\-shot0\.5960\.6090\.6300\.5840\.6940\.5960\.6120\.5130\.655Dep\-LLM0\.6810\.6350\.6490\.6390\.7020\.6810\.6890\.5160\.762Llama4\-17B\-InstructZero\-shot0\.6170\.5890\.6040\.5850\.6650\.6170\.6310\.4710\.700Dep\-LLM0\.7440\.7190\.7560\.7240\.7880\.7450\.7540\.6470\.800Qwen2\.5\-7B\-BaseZero\-shot0\.6600\.6190\.6340\.6210\.6890\.6600\.6700\.5000\.742Dep\-LLM0\.6810\.6350\.6490\.6390\.7020\.6810\.6890\.5160\.762Qwen2\.5\-7B\-InstructZero\-shot0\.7020\.6900\.7260\.6850\.7670\.7020\.7150\.6110\.759Dep\-LLM0\.7870\.7630\.8070\.7700\.8290\.7870\.7950\.7060\.833Qwen2\.5\-14B\-BaseZero\-shot0\.6170\.5580\.5630\.5590\.6330\.6170\.6240\.4000\.719Dep\-LLM0\.6600\.6190\.6340\.6210\.6890\.6600\.6700\.5000\.742Qwen2\.5\-14B\-InstructZero\-shot0\.8090\.7730\.7610\.7660\.8050\.8090\.8060\.6670\.866Dep\-LLM0\.8090\.7800\.7400\.7550\.8020\.8090\.8010\.6400\.870Qwen3\-4B\-BaseZero\-shot0\.6810\.6350\.6490\.6390\.7020\.6810\.6890\.5160\.762Dep\-LLM0\.6810\.6610\.6900\.6590\.7360\.6810\.6940\.5710\.746Qwen3\-8B\-BaseZero\-shot0\.7660\.7220\.7310\.7260\.7710\.7660\.7680\.6210\.831Dep\-LLM0\.7870\.7540\.7870\.7630\.8120\.7870\.7940\.6880\.839Qwen3\-14B\-BaseZero\-shot0\.7020\.6440\.6440\.6440\.7020\.7020\.7020\.5000\.788Dep\-LLM0\.7020\.6640\.6850\.6680\.7300\.7020\.7110\.5630\.774Qwen3\.5\-4B\-BaseZero\-shot0\.7230\.6810\.7000\.6870\.7430\.7230\.7300\.5810\.794Dep\-LLM0\.7450\.7010\.7150\.7060\.7560\.7450\.7490\.6000\.813Qwen3\.5\-9B\-BaseZero\-shot0\.7230\.6660\.6590\.6620\.7180\.7230\.7200\.5180\.806Dep\-LLM0\.7870\.7540\.7870\.7630\.8120\.7870\.7940\.6880\.839Gemma2\-9B\-BaseZero\-shot0\.6600\.6470\.6750\.6400\.7250\.6600\.6740\.5560\.724Dep\-LLM0\.7450\.7320\.7770\.7300\.8080\.7450\.7550\.6670\.793Gemma2\-9B\-InstructZero\-shot0\.6810\.6350\.6490\.6390\.7020\.6810\.6890\.5160\.762Dep\-LLM0\.7870\.7480\.7660\.7550\.7980\.7870\.7910\.6670\.844Gemma3\-12B\-BaseZero\-shot0\.7020\.6530\.6650\.6570\.7150\.7020\.7070\.5330\.781Dep\-LLM0\.7870\.7630\.8070\.7700\.8290\.7870\.7950\.7060\.833Gemma3\-12B\-InstructZero\-shot0\.7020\.6900\.7260\.6850\.7670\.7020\.7150\.6110\.759Dep\-LLM0\.8510\.8260\.8120\.8180\.8490\.8510\.8490\.7410\.896

## IVExperiment

### IV\-AExperimental Settings

#### IV\-A1Datasets and Evaluation Metrics\.

We evaluate Dep\-LLM on two widely adopted clinical interview datasets for depression detection\. The primary dataset is DAIC\-WOZ\[[13](https://arxiv.org/html/2606.10796#bib.bib48)\], which contains semi\-structured Wizard\-of\-Oz interviews collected for distress analysis\. To further assess the generalizability of Dep\-LLM beyond a single corpus, extended experiments are conducted on E\-DAIC\[[8](https://arxiv.org/html/2606.10796#bib.bib49)\], an extended version of DAIC\-WOZ in which the human operator behind the interviewer is totally replaced by an AI agent, leading to a noticeable distribution shift in interview style and participant behaviour\.

To have a comprehensive and clinically meaningful insight on the performance, we report a panel of nine metrics coveringAccuracy\(Acc\.\),Macro Precision\(Ma\*Prec\.\),Macro Recall\(Ma\*Rec\.\),Macro F1\(Ma\*F1\.\),Weighted\-Average Precision\(WA\*Prec\.\),Weighted\-Average Recall\(WA\*Rec\.\),Weighted\-Average F1\(WA\*F1\.\),Binary F1\-Depressive\(F1\-Depr\.\),Binary F1\-Healthy\(F1\-Heal\.\)\. Accuracy and Macro metrics reflect overall correctness while Weighted\-Average metrics complementarily display class\-balanced behaviour, and Per\-class Binary metrics expose whether a method gains overall correctness at the cost of failing on the minority depressive class, following previous work\[[37](https://arxiv.org/html/2606.10796#bib.bib45),[19](https://arxiv.org/html/2606.10796#bib.bib50),[59](https://arxiv.org/html/2606.10796#bib.bib46)\], due to the imbalanced labels on depressive and healthy of the two datasets\.

#### IV\-A2Implementation Details\.

We implement Dep\-LLM in Pytorch on four NVIDIA H100 GPUs using the HuggingFace Transformers library\. To verify that the performance gains brought by Dep\-LLM are agnostic to the underlying foundation LLM, we evaluate it on a wide spectrum of open\-source foundation LLMs that differ in family, scale and alignment status\. Specifically, we cover Llama\-2/3/4, Qwen\-2\.5/3/3\.5 and Gemma\-2/3 series, with parameter sizes ranging from44B to1717B and variants covering base and instruction\-tuned whenever available, yielding 21 foundation LLMs on DAIC\-WOZ in total\. For each foundation LLM, we compare Dep\-LLM against vanilla zero\-shot prompting baseline under identical generation hyperparameters \(temperature, top\-K, etc\.\)\. To ensure equality for zero\-shot baselines, we designed a structured prompt for them as well as the Dep\-LLM prompt, which are displayed in supplemental file\.

### IV\-BComparison Experiment

To demonstrate the effectiveness of Dep\-LLM, we conducted three groups of comparison experiments: 1\) Dep\-LLM versus vanilla zero\-shot prompting on 21 foundation LLMs upon DAIC\-WOZ; 2\) Dep\-LLM versus vanilla zero\-shot on 6 representative foundation LLMs upon E\-DAIC to verify generalizability; 3\) Dep\-LLM versus supervised domain\-specific LLMs adapted to mental health field and the latest closed\-source commercial LLMs upon DAIC\-WOZ\.

#### IV\-B1Performance Against Zero\-Shot Baselines on DAIC\-WOZ

Table[I](https://arxiv.org/html/2606.10796#S3.T1)reports the head\-to\-head comparison between Dep\-LLM and the zero\-shot baseline on DAIC\-WOZ across all 21 foundation LLMs\. For readability, rows corresponding to Dep\-LLM are shaded ingray, andboldmarks the better result of the two methods for each foundation LLM\. There are three discoveries to note: 1\) Dep\-LLM is universally superior on all metrics over zero\-shot baseline\. For example, Dep\-LLM outperforms baseline on 19 out of 21 foundation LLMs on Ma\*F1\. and WA\*F1\., with only slight disadvantages for the two exceptions\. Similar performance improvements are also observed on the remaining metrics, indicating the comprehensive superiority of Dep\-LLM over zero\-shot baselines\. 2\) regarding best performance, Dep\-LLM paired with Gemma3\-12B\-Instruct attains the best scores on all metrics, reaching Ma\*F1\. of0\.8180\.818, WA\*F1\. of0\.8490\.849, F1\-Depr\. of0\.7140\.714and F1\-Heal\. of0\.8960\.896\. An average12\.6%12\.6\\%absolute improvement on the 9 metrics is remarkable for this foundation LLM\. 3\) such improvements are also significant on weaker models\. For example, on Llama3\-8B\-Base, Dep\-LLM has improved Ma\*F1\. by over20%20\\%absolute value, from0\.4500\.450to0\.6660\.666, with other metrics improved greatly as well, suggesting that the structured analysis schema and confidence modulation of Dep\-LLM bridge the gap of reasoning ability of weaker models as well, rather than merely amplifying already\-strong ones\. Overall, the table confirms that Dep\-LLM is a generalizable and backbone\-agnostic augmentation that systematically strengthens vanilla LLMs in long\-context depression reasoning\.

TABLE II:Extended Experiments on E\-DAIC Dataset based on representative foundation models to demonstrate the generalizability of Dep\-LLM\.Foundation LLMMethodAcc\.Ma\*Prec\.Ma\*Rec\.Ma\*F1\.WA\*Prec\.WA\*Rec\.WA\*F1\.F1\-Depr\.F1\-Heal\.Llama4\-17B\-BaseZero\-shot0\.5710\.5380\.5430\.5330\.6120\.5710\.5860\.4000\.667Dep\-LLM0\.6430\.6230\.6440\.6190\.6950\.6430\.6560\.5240\.714Llama4\-17B\-InstructZero\-shot0\.5710\.5240\.5260\.5220\.5990\.6170\.6310\.4710\.700Dep\-LLM0\.6610\.6040\.6070\.6050\.6660\.6610\.6630\.4570\.753Qwen2\.5\-14B\-BaseZero\-shot0\.6790\.6380\.6530\.6420\.7020\.6790\.6870\.5260\.757Dep\-LLM0\.6610\.6610\.6900\.6470\.7380\.6610\.6740\.5780\.716Qwen2\.5\-14B\-InstructZero\-shot0\.6960\.6620\.6830\.6660\.7250\.6960\.7050\.5640\.767Dep\-LLM0\.7320\.7080\.7410\.7110\.7740\.7320\.7420\.6340\.789Gemma3\-12B\-BaseZero\-shot0\.6430\.6230\.6440\.6190\.6950\.6430\.6560\.5240\.714Dep\-LLM0\.6610\.6350\.6570\.6340\.7050\.6610\.6730\.5370\.732Gemma3\-12B\-InstructZero\-shot0\.6250\.5990\.6150\.5960\.6710\.6250\.6390\.4880\.704Dep\-LLM0\.7500\.7320\.7710\.7330\.7990\.7500\.7600\.6670\.800TABLE III:Comparison Experiment against supervised domain\-specific LLMs and closed\-source commercial LLMs\. For every metric, bold marks the best performance, and underline marks the second best performance\.ModelDAIC\-WOZE\-DAICAcc\.Ma\*F1\.WA\*F1\.F1\-Depr\.F1\-Heal\.Acc\.Ma\*F1\.WA\*F1\.F1\-Depr\.F1\-Heal\.Domain\-specific LLMs\.MentalBERT0\.7660\.7260\.7680\.6210\.8310\.7140\.7010\.7260\.6360\.765MentalRoBERTa0\.7020\.6990\.7120\.6670\.7310\.7320\.7110\.7420\.6340\.789ClinicalBERT0\.7020\.6770\.7140\.5880\.7670\.6960\.6660\.7050\.5640\.767MentaLLaMA0\.6600\.6310\.6730\.5290\.7330\.6960\.6460\.6990\.5140\.779MentalAlpaca0\.6810\.6500\.6920\.5450\.7540\.6780\.6630\.6910\.5910\.735BioMistral0\.7450\.7300\.7550\.6670\.7930\.7140\.6350\.7020\.4670\.805Meditron0\.7450\.6950\.7450\.5710\.8180\.7140\.6890\.7240\.6000\.778Closed\-Source Commercial LLMs\.GPT\-5\.50\.7020\.6910\.7150\.6320\.7500\.6960\.6790\.7080\.6050\.754Gemini\-3\.1\-Pro0\.7020\.6680\.7110\.5630\.7740\.7320\.7250\.7420\.6810\.769Claude\-Opus\-4\.60\.7870\.7630\.7940\.6880\.8390\.7140\.7050\.7260\.6520\.758Grok\-4\.30\.7230\.6960\.7330\.6060\.7860\.6420\.6260\.6570\.5450\.706DeepSeek\-V40\.7020\.6570\.7070\.5330\.7810\.7140\.6890\.7240\.6000\.778Our Method\.Dep\-LLM\*0\.8510\.8180\.8490\.7410\.8960\.7500\.7330\.7500\.6670\.800
#### IV\-B2Performance Against Zero\-Shot Baselines on E\-DAIC

To validate the generalizability of Dep\-LLM across different corpora, we further evaluate representative foundation LLMs from Llama, Qwen and Gemma series on E\-DAIC dataset\. The results are reported in Table[II](https://arxiv.org/html/2606.10796#S4.T2): 1\) despite the difference of interview method and data quality, Dep\-LLM continues to outperform the zero\-shot baselines on all foundation models on Ma\*Prec\., Ma\*Rec\., Ma\*F1\., WA\*Prec\., WA\*F1\. and F1\-Depr\. and on 5 out of 6 models on Acc\., WA\*Rec\., and F1\-Heal\. 2\) The largest gain is observed on Gemma3\-12B\-Instruct where Dep\-LLM raises Ma\*F1\. from0\.5960\.596to0\.7330\.733, WA\*F1\. from0\.6380\.638to0\.7600\.760, and F1\-Depr\. from0\.4880\.488to0\.6670\.667\. These results provide strong evidence for the effectiveness and generalizability of Dep\-LLM across different foundation LLMs and different corpora\.

TABLE IV:Ablation Study to explore the importance of CoT reasoning, the included themes, and the multi\-factor collaborative strategy\. For every metric, bold marks the best performance, and underline marks the second best performance\.Model VariantAcc\.Ma\*Prec\.Ma\*Rec\.Ma\*F1\.WA\*Prec\.WA\*Rec\.WA\*F1\.F1\-Depr\.F1\-Heal\.zero\-shot0\.7020\.6900\.7260\.6850\.7670\.7020\.7150\.6110\.759Importance of Chain\-of\-Thought Reasoning\.w/o Multi\-factor CoT0\.7020\.6440\.6440\.6440\.7020\.7020\.7020\.5000\.788w/o Thinking Steps0\.7020\.6640\.6850\.6680\.7300\.7020\.7110\.5630\.774Importance of Themes\.w/o Family0\.7870\.7480\.7660\.7550\.7980\.7870\.7910\.6670\.844w/o Work0\.8080\.7710\.7810\.7760\.8130\.8090\.8100\.6900\.862w/o Mental0\.7450\.7190\.7560\.7240\.7880\.7450\.7540\.6470\.800w/o Medical0\.7870\.7480\.7250\.7340\.7800\.7870\.7820\.6150\.853w/o Overall0\.8300\.8040\.7760\.7870\.8250\.8300\.8250\.6920\.882Multi\-factor Collaborative Strategy\.Majority Voting0\.7020\.6760\.7060\.6780\.7470\.7020\.7140\.5880\.767High Possibility First0\.7020\.6760\.7060\.6780\.7470\.7020\.7140\.5880\.767High Confidence First0\.7450\.7090\.7360\.7160\.7710\.7450\.7520\.6250\.806Fusion w/o Confidence0\.7230\.7040\.7410\.7040\.7770\.7230\.7350\.6290\.780Fusion w/o Modulation0\.8080\.7730\.8020\.7830\.8250\.8090\.8130\.7100\.857Full Version\.Dep\-LLM\*0\.8510\.8260\.8120\.8180\.8490\.8510\.8490\.7410\.896
#### IV\-B3Performance Against other LLM methods

To evaluate Dep\-LLM under broader landscape of LLM\-based depression detection, we further compare it with two representative groups of strong competitors which is displayed in Table[III](https://arxiv.org/html/2606.10796#S4.T3): 1\)supervised domain\-specific LLMsthat are explicitly pre\-trained and fine\-tuned on mental\-health and clinical corpora, including MentalBERT\[[18](https://arxiv.org/html/2606.10796#bib.bib40)\], MentalRoBERTa\[[18](https://arxiv.org/html/2606.10796#bib.bib40)\], ClinicalBERT\[[17](https://arxiv.org/html/2606.10796#bib.bib92)\], MentaLLaMA\[[52](https://arxiv.org/html/2606.10796#bib.bib26)\], MentalAlpaca\[[50](https://arxiv.org/html/2606.10796#bib.bib25)\], BioMistral\[[22](https://arxiv.org/html/2606.10796#bib.bib93)\]and Meditron\[[6](https://arxiv.org/html/2606.10796#bib.bib94)\]; 2\)closed\-source commercial LLMsincluding GPT\-5\.5, Gemini\-3\.1\-Pro, Claude\-Opus\-4\.6, Grok\-4\.3, and DeepSeek\-V4, called under their public APIs\. For a fair and reproducible comparison, Dep\-LLM uses its best open\-source foundation LLM Gemma3\-12B\-Instruct and is denoted as Dep\-LLM\* in the table\. Table[III](https://arxiv.org/html/2606.10796#S4.T3)reports five metrics \(Acc\., Ma\*F1\., WA\*F1\., F1\-Depr\., and F1\-Heal\.\) for each method on DAIC\-WOZ and E\-DAIC dataset\.

##### Versus Domain\-Specific LLMs\.

Although supervised domain\-specific LLMs benefit from extensive mental health oriented pre\-training or supervised fine\-tuning, Dep\-LLM nearly outperforms every one of them by clear difference on both datasets\. On DAIC\-WOZ, we compare the best performance of Dep\-LLM\* against them: Acc\.\(0\.8510\.851vs\.0\.7660\.766\), Ma\*F1\.\(0\.8180\.818vs\.0\.7300\.730\), WA\*F1\.\(0\.8490\.849vs\.0\.7680\.768\), F1\-Depr\.\(0\.7410\.741vs\.0\.6670\.667\), F1\-Heal\.\(0\.8960\.896vs\.0\.8310\.831\)\. On E\-DAIC this advantage remains remarkable, except for tiny disadvantage \(0\.5%0\.5\\%\) of the F1\-Heal\. on BioMistral, yet Dep\-LLM\* exceeds BioMistral on F1\-Depr\. by a big gap \(20\.0%20\.0\\%\), which means BioMistral has an ill\-balanced performance on E\-DAIC\. Crucially, Dep\-LLM\* achieves this without any task\-specific training or fine\-tuning, demonstrating that a well\-structured training\-free pipeline rivals and in fact surpasses supervised domain\-specific models that depend on scarce labeled clinical data\.

##### Versus Commercial LLMs\.

The selected closed\-source commercial LLMs represent the current frontier of general\-purpose reasoning\. On DAIC\-WOZ, the strongest competitor is Claude\-Opus\-4\.6 yet Dep\-LLM\* still wins on every metric with\+6\.4%\+6\.4\\%Acc\.,\+5\.5%\+5\.5\\%Ma\*F1\.,\+5\.5%\+5\.5\\%WA\*F1\.,\+5\.3%\+5\.3\\%F1\-Depr\., and\+5\.7%\+5\.7\\%F1\-Heal\. On E\-DAIC, Gemini\-3\.1\-Pro emerges as the strongest commercial baseline\. Yet Dep\-LLM\* outperforms Gemini\-3\.1\-Pro on all metrics except for F1\-Depr\. with slight difference, indicating its stronger per\-class specialization but not overall and balanced capability\. It is worth noting that Dep\-LLM\* is built on a1212B open\-source foundation LLM, far smaller than the commercial systems in the comparison\. Universally, the lightweight training\-free framework beats the frontier commercial LLMs on most metrics and displays a more balanced performance, indicating that for high\-stakes clinical scenarios, structured CoT decomposition and entropy\-grounded confidence modulation are more critical than mere parameter scale\.

### IV\-CAblation Study

To demonstrate the effectiveness and indispensability of each component of Dep\-LLM, we conduct a comprehensive ablation study on the foundation LLM Gemma3\-12B\-Instruct, reporting all nine metrics in Table[IV](https://arxiv.org/html/2606.10796#S4.T4)\. For readability,boldmarks the best score for each metric \(all achieved by full Dep\-LLM\*\) andunderlinemarks the second best performance\. The ablation study is organized into three groups, respectively manifesting the contribution of the Chain\-of\-Thought reasoning, each individual theme in the multi\-factor schema, and the collaborative fusion strategy\.

#### IV\-C1Importance of Chain\-of\-Thought Reasoning

We consider two decremental variants:w/o Multi\-factor CoT, which holistically evaluates the depressive/healthy/neutral possibilities and rationales from the raw clinical interview dialogue without structured multi\-factor decomposition; andw/o Thinking Steps, which removes the explicit step\-by\-step thinking prompt and directly generates multi\-factor possibilities and rationales\. The performance of both variants collapses to the zero\-shot baseline with WA\*F1\. of0\.7020\.702and0\.7110\.711versus0\.8490\.849on full Dep\-LLM\*\. F1\-Depr\. drops even more sharply from0\.7410\.741to0\.5000\.500and0\.5630\.563, worse than zero\-shot baseline\. This result confirms that the CoT\-driven structured multi\-factor reasoning is a substantial foundation for the Dep\-LLM framework due to its decomposition of the long and complex raw dialogues, without which Dep\-LLM falls back to superficial holistic judgment\.

#### IV\-C2Importance of Themes

We remove one single theme at a time alternatively and run prediction using the remaining four themes\. It’s evident from the results that every removal leads to a drop of performance\. On average of five themes, Acc\. drops by5\.96%5\.96\\%, Ma\*F1\. drops by6\.28%6\.28\\%and WA\*F1\. drops by5\.66%5\.66\\%\. Among these themes, removal of mental state causes the biggest performance drop on most metrics, followed by medical history, family relationship and work satisfaction, which aligns with clinical intuition\. Notably, while not so significant, removal of overall evaluation theme still hurts the performance, especially on Ma\*Rec\. and F1\-Depr\., meaning it’s crucial for recognizing depressive samples, demonstrating that overall theme contributes supplementary integrative evidence from the four atomic themes rather than merely restating them\.

#### IV\-C3Multi\-factor Collaborative Strategy

We compare our confidence\-weighted fusion against three rule\-based alternatives and two decremental variants\. Rule\-based strategies includeMajority votingadopting the label endorsed by the most themes,High Possibility Firstadopting the label of the theme with highest possibility, andHigh Confidence Firstadopting the label of the theme with highest confidence\. They all underperform the full Dep\-LLM\* significantly, indicating that superficial rules without comprehensive utilization of possibilities and confidences are not reliable under complex clinical application\. Two decremental variants includeFusion w/o Confidenceintegrating themes only using possibilities, andFusion w/o Modulationintegrating themes using raw confidence without intra\-label and inter\-theme modulation\. Focusing on Acc\. we can see that only possibility contributes\+2\.1%\+2\.1\\%over zero\-shot baseline, raw confidence contributes\+8\.5%\+8\.5\\%over possibility, while modulation contributes the remaining\+4\.3%\+4\.3\\%towards full Dep\-LLM\*\. This improving trajectory is further explored in the following case study \(Figure[4](https://arxiv.org/html/2606.10796#S4.F4)\), quantitatively demonstrating that confidence mechanism and modulation adaptation are of great significance to the fusion process of the prediction\.

### IV\-DCase Study

![Refer to caption](https://arxiv.org/html/2606.10796v1/x3.png)

Figure 3:Possibility, rationale and confidence in multi\-factor Analysis of two case example\.#### IV\-D1Multi\-factor Content Analysis

To intuitively illustrate how Dep\-LLM structurally decomposes clinical interviews and produces evidence\-grounded reasoning, we visualize two representative thematic cases in Figure[3](https://arxiv.org/html/2606.10796#S4.F3), respectively supporting a healthy and a depressive diagnosis\. For each theme, Dep\-LLM generates label\-specific possibilities and rationales, during which confidences are calculated within the internal generation process, mirroring the analysis procedure of clinicians\.

In theFamily Relationshipcase, the participant describes a close bond with their parents and son\. Dep\-LLM correctly identifies this as a protective signal and assigns highest possibility \(70%70\\%\) to the healthy label\. The depressive \(10%10\\%\) and neutral \(20%20\\%\) are assigned by rationales acknowledging the underlying struggles without overgeneralizing them, showing a clinically faithful attitude\. In theWork Satisfactioncase, the participant reports troubles in finding a job and work\-related stress\. Dep\-LLM assigns dominant possibility to the depressive label \(65%65\\%\) and precisely captures clues like ’strained work environment’, rather than superficial summaries\. Meanwhile, the healthy rationale honestly admits the lack of positive evidence, and the neutral rationale appropriately frames the situation as ’a factual description’ but not a confirming diagnostic factor\.

These two cases jointly demonstrate three merits of Dep\-LLM\. First, the structured multi\-factor schema enables the framework to analyze based on clinically meaningful themes \(family, work, etc\.\) rather than unclear holistic impressions\. Second, the multi\-label decomposition forces the LLM to argue for each diagnostic hypothesis, suppressing biased reasoning leaning towards one side\. Third, the generated rationales are strictly evidence\-anchored and can be traced back to specific clues in dialogue, which is essential for clinical reliability\.

![Refer to caption](https://arxiv.org/html/2606.10796v1/x4.png)

Figure 4:Quantitative impact on prediction of confidence and modulation
#### IV\-D2Confidence Mechanism

To further display the role of the confidence mechanism in shaping the final diagnosis, in Figure[4](https://arxiv.org/html/2606.10796#S4.F4)we numerically visualize the fusion process under three different confidence settings\. They areNo Confidence\(all confidences degenerate to11\),Average Confidence\(using raw average confidenceCavgC^\{avg\}, specified in Equation[8](https://arxiv.org/html/2606.10796#S3.E8)\) andModulated Confidence\(usingCmodC^\{mod\}produced by intra\-label and inter\-theme modulation specified in Equation[11](https://arxiv.org/html/2606.10796#S3.E11)\)\. These three settings are aligned with ablation study \(Table[IV](https://arxiv.org/html/2606.10796#S4.T4)\), respectively corresponding toFusion w/o Confidence,Fusion w/o ModulationandFull Version, maintaining consistency\.

Figure[4](https://arxiv.org/html/2606.10796#S4.F4)shows that fusion using only possibilitiesNo Confidenceyields depression coefficientδ=0\.714<1\\delta=0\.714<1which mistakenly supports healthy diagnosis, because numerically large possibilities from less trustworthy themes have biased other informative depressive clues\.Average Confidencere\-weights possibilities withCavgC^\{avg\}and producesδ=0\.813<1\\delta=0\.813<1\. Although the prediction is still incorrect, yetδ\\deltais nearer to the threshold11, indicating that average confidence carries a positive but insufficient signal when not normalized, where longer or more complex themes tend to dominate due to scale rather than reliability\. Fusion underModulated Confidenceexportsδ=1\.013\>1\\delta=1\.013\>1and correctly predicts the depressive diagnosis, by allowing inter\-theme interaction via amplifying a theme’s weight when its own rationale is confident and other themes are uncertain\.

The progress from incorrect to incorrect\-but\-closer to correct offers a clear quantitative trajectory of how confidence and modulation reshape the fusion process\. The results verify the theory in Section[III\-C](https://arxiv.org/html/2606.10796#S3.SS3)that raw average confidence derived from token entropy is a proxy for epistemic reliability, while contrastive modulation across labels and themes produces meaningful weights for fusion\.

### IV\-EHyperparameter Sensitivity Analysis

![Refer to caption](https://arxiv.org/html/2606.10796v1/x5.png)

Figure 5:The impact that value shift ofδ∗\\delta^\{\*\}exerts on WA\*F1\. Score of different models![Refer to caption](https://arxiv.org/html/2606.10796v1/x6.png)

Figure 6:The impact that value shift ofλ\+,λ−\\lambda^\{\+\},\\lambda^\{\-\}exerts on WA\*F1\. Score of different modelsSince Dep\-LLM is a training\-free framework, the two decision hyperparameters introduced in Section[III\-D](https://arxiv.org/html/2606.10796#S3.SS4)— depression thresholdδ∗\\delta^\{\*\}and neutral label calibratorsλ\+,λ−\\lambda^\{\+\},\\lambda^\{\-\}— are not learned from data but configured by design\. To verify the robustness of the default configurationδ∗=1,λ\+=λ−=1\\delta^\{\*\}=1,\\lambda^\{\+\}=\\lambda^\{\-\}=1and to provide practical calibration guidance for real\-world deployment, we conduct sensitivity analysis on these two hyperparameters across representative foundation LLMs from Qwen, Llama and Gemma series\.

#### IV\-E1Depression Thresholdδ∗\\delta^\{\*\}\.

We screenδ∗\\delta^\{\*\}within the interval\[0\.8,1\.2\]\[0\.8,1\.2\]and report the resulting WA\*F1\. in Figure[5](https://arxiv.org/html/2606.10796#S4.F5)\. We can draw three conclusions from the results\. First, although the exact peak location varies across models, where Llama4\-17B\-Base and Gemma3\-12B\-Instruct perform best slightly below11and the others peak slightly above11, yet they are all tightly clustered around11, confirming thatδ∗=1\\delta^\{\*\}=1is a near\-optimal and well\-justified default\. Second, the WA\*F1\. curves remain relatively smooth around the optimum value, indicating that Dep\-LLM is robust to small perturbations ofδ∗\\delta^\{\*\}\. Third, the discrepancy of the optimum value across models can be attributed to the intrinsic semantic preference of different LLMs, where the differences in architecture, pre\-training corpora and alignment strategies introduce subtle but systematic biases towards specific diagnosis\. Consequently, in deployment once the foundation LLM is fixed,δ∗\\delta^\{\*\}may be further calibrated on a small held\-out validation set by practitioners to gain additional accuracy by a marginal calibration effort\.

#### IV\-E2Neutral Label Calibratorsλ\+,λ−\\lambda^\{\+\},\\lambda^\{\-\}\.

Analogously, we jointly screenλ\+,λ−\\lambda^\{\+\},\\lambda^\{\-\}within\[0,2\]×\[0,2\]\[0,2\]\\times\[0,2\]and visualize the resulting WA\*F1\. as heatmaps in Figure[6](https://arxiv.org/html/2606.10796#S4.F6)\. There we can see two patterns\. First, the high\-performance regions concentrate along the diagonalλ\+=λ−=1\\lambda^\{\+\}=\\lambda^\{\-\}=1with slight shifts of different models, corresponding to our default symmetric setting in Section[III\-D](https://arxiv.org/html/2606.10796#S3.SS4)\. Second, there are many iso\-performance bands with unit slope, which is predictable from the structure of Equation[13](https://arxiv.org/html/2606.10796#S3.E13)that equal shifts ofλ\+\\lambda^\{\+\}andλ−\\lambda^\{\-\}would not change the value ofδ\\delta\. In practice,λ\+=λ−=1\\lambda^\{\+\}=\\lambda^\{\-\}=1remains a universal choice yet could be fine\-calibrated for better performance\.

Taken together, the sensitivity analysis onδ,λ\+,λ−\\delta,\\lambda^\{\+\},\\lambda^\{\-\}demonstrates that Dep\-LLM is robust by design with default hyperparameter configuration grounded in both empirical and experimental analysis\. Furthermore, Dep\-LLM still allows clinicians to calibrate them for deployment\-specific adaptation in a transparent and lightweight approach, which is usually critical in high\-stakes mental health applications\.

## VConclusion

In this paper, we presentDep\-LLM, a training\-free framework for automatic depression detection that jointly addresses two long\-standing challenges: inaccurate reasoning over sparse clues in long dialogues, along with the data scarcity and high training cost of supervised systems\. Mirroring the diagnostic workflow of clinicians, Dep\-LLM operates entirely on frozen LLMs through coordinated stages: 1\) a CoT Depression Multi\-factor Analysis module that decomposes each interview into SCID\-aligned themes and extracts evidence\-grounded rationales; 2\) a Confidence Analysis and Modulation module that estimates the epistemic reliability of each rationale from token\-level entropy and modulates it across labels and themes; 3\) a Collaborative Multi\-factor Fusion that integrates these confidence\-weighted signals into a transparent diagnosis\. Since the pipeline introduces no learnable parameters and requires no labeled data, it can be deployed under the data and computational constraints typical of real clinical settings\. Extensive experiments on DAIC\-WOZ and E\-DAIC show that Dep\-LLM consistently improves diverse foundation LLMs over zero\-shot baseline and outperforms both supervised domain\-specific LLMs and the latest commercial LLMs, while retaining high interpretability\.

## References

- \[1\]\(2013\)Diagnostic and statistical manual of mental disorders \(5th ed\.\)\.American Psychiatric Publishing,Arlington, VA\.External Links:[Link](https://doi.org/10.1176/appi.books.9780890425596)Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p1.1)\.
- \[2\]E\. Asgari, N\. Montaña\-Brown, M\. Dubois, S\. Khalil, J\. Balloch, J\. A\. Yeung, and D\. Pimenta\(2025\)A framework to assess clinical safety and hallucination rates of llms for medical text summarisation\.npj Digital Medicine8\(1\),pp\. 274\.External Links:[Document](https://dx.doi.org/10.1038/s41746-025-01670-7),[Link](https://doi.org/10.1038/s41746-025-01670-7),ISSN 2398\-6352Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p3.1)\.
- \[3\]G\. Bi, Z\. Chen, Z\. Liu, H\. Wang, X\. Xiao, Y\. Xie, W\. Zhang, Y\. Huang, Y\. Chen, L\. Peng, and M\. Huang\(2025\-07\)MAGI: multi\-agent guided interview for psychiatric assessment\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 24898–24921\.External Links:[Link](https://aclanthology.org/2025.findings-acl.1278/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.1278),ISBN 979\-8\-89176\-256\-5Cited by:[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p3.1)\.
- \[4\]B\. B\. Brodey, S\. E\. Purcell, K\. Rhea, P\. Maier, M\. B\. First, L\. Zweede, M\. Sinisterra, M\. B\. Nunn, M\. Austin, and I\. S\. Brodey\(2018\)Rapid and accurate diagnosis of mental disorders in the general population: validity of the computerized adaptive testing–mental health \(cat\-mh\) module\.Journal of Medical Internet Research20\(3\),pp\. e10685\.External Links:[Link](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5889494/)Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p1.1),[§III\-A](https://arxiv.org/html/2606.10796#S3.SS1.p1.4)\.
- \[5\]X\. Chen, Z\. Shao, Y\. Jiang, R\. Chen, Y\. Wang, B\. Li, M\. Niu, H\. Chen, Q\. Hu, J\. Wu, C\. Yang, and Y\. Shang\(2025\)TTFNet: temporal\-frequency features fusion network for speech based automatic depression recognition and assessment\.IEEE Journal of Biomedical and Health Informatics29\(10\),pp\. 7536–7548\.External Links:[Document](https://dx.doi.org/10.1109/JBHI.2025.3574864)Cited by:[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p2.1)\.
- \[6\]Z\. Chen, A\. H\. Cano, A\. Romanou, A\. Bonnet, K\. Matoba, F\. Salvi, M\. Pagliardini, S\. Fan, A\. Köpf, A\. Mohtashami, A\. Sallinen, A\. Sakhaeirad, V\. Swamy, I\. Krawczuk, D\. Bayazit, A\. Marmet, S\. Montariol, M\. Hartley, M\. Jaggi, and A\. Bosselut\(2023\)MEDITRON\-70b: scaling medical pretraining for large language models\.External Links:2311\.16079,[Link](https://arxiv.org/abs/2311.16079)Cited by:[§IV\-B3](https://arxiv.org/html/2606.10796#S4.SS2.SSS3.p1.1)\.
- \[7\]Z\. Chen, J\. Deng, J\. Zhou, J\. Wu, T\. Qian, and M\. Huang\(2024\-06\)Depression detection in clinical interviews with LLM\-empowered structural element graph\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 8181–8194\.External Links:[Link](https://aclanthology.org/2024.naacl-long.452/),[Document](https://dx.doi.org/10.18653/v1/2024.naacl-long.452)Cited by:[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p2.1)\.
- \[8\]D\. DeVault, R\. Artstein, G\. Benn, T\. Dey, E\. Fast, A\. Gainer, K\. Georgila, J\. Gratch, A\. Hartholt, M\. Lhommet, G\. Lucas, S\. Marsella, F\. Morbini, A\. Nazarian, S\. Scherer, G\. Stratou, A\. Suri, D\. Traum, R\. Wood, Y\. Xu, A\. Rizzo, and L\. Morency\(2014\)SimSensei kiosk: a virtual human interviewer for healthcare decision support\.InProceedings of the 2014 International Conference on Autonomous Agents and Multi\-Agent Systems,AAMAS ’14,Richland, SC,pp\. 1061–1068\.External Links:ISBN 9781450327381,[Link](https://dl.acm.org/doi/10.5555/2615731.2617415)Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p6.1),[§IV\-A1](https://arxiv.org/html/2606.10796#S4.SS1.SSS1.p1.1)\.
- \[9\]S\. Farquhar, J\. Kossen, L\. Kuhn, and Y\. Gal\(2024\)Detecting hallucinations in large language models using semantic entropy\.Nature630\(8017\),pp\. 625–630\.External Links:[Document](https://dx.doi.org/10.1038/s41586-024-07421-0)Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p5.1),[§III\-C1](https://arxiv.org/html/2606.10796#S3.SS3.SSS1.p1.2)\.
- \[10\]R\. Flores, M\. Tlachac, A\. Shrestha, and E\. A\. Rundensteiner\(2025\)WavFace: a multimodal transformer\-based model for depression screening\.IEEE Journal of Biomedical and Health Informatics29\(5\),pp\. 3632–3641\.External Links:[Document](https://dx.doi.org/10.1109/JBHI.2025.3529348)Cited by:[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p2.1)\.
- \[11\]Z\. Ge, N\. Hu, D\. Li, Y\. Wang, S\. Qi, Y\. Xu, H\. Shi, and J\. Zhang\(2025\-05\)A survey of large language models in mental health disorder detection on social media\.In2025 IEEE 41st International Conference on Data Engineering Workshops \(ICDEW\),pp\. 164–176\.External Links:[Link](http://dx.doi.org/10.1109/ICDEW67478.2025.00027),[Document](https://dx.doi.org/10.1109/icdew67478.2025.00027)Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p1.1)\.
- \[12\]Y\. Gong and C\. Poellabauer\(2017\)Topic modeling based multi\-modal depression detection\.InProceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge,AVEC ’17,New York, NY, USA,pp\. 69–76\.External Links:ISBN 9781450355025,[Link](https://doi.org/10.1145/3133944.3133945),[Document](https://dx.doi.org/10.1145/3133944.3133945)Cited by:[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p2.1)\.
- \[13\]J\. Gratch, R\. Artstein, G\. M\. Lucas, G\. Stratou, S\. Scherer, A\. Nazarian, R\. Wood, J\. Boberg, D\. DeVault, S\. Marsella,et al\.\(2014\)The distress analysis interview corpus of human and computer interviews\.\.InLREC,pp\. 3123–3128\.External Links:[Link](https://api.semanticscholar.org/CorpusID:14488823)Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p6.1),[§IV\-A1](https://arxiv.org/html/2606.10796#S4.SS1.SSS1.p1.1)\.
- \[14\]D\. Hazarika, R\. Zimmermann, and S\. Poria\(2020\)MISA: modality\-invariant and \-specific representations for multimodal sentiment analysis\.External Links:2005\.03545,[Link](https://arxiv.org/abs/2005.03545)Cited by:[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p2.1)\.
- \[15\]J\. Hu, T\. Dong, G\. Luo, H\. Ma, P\. Zou, X\. Sun, D\. Guo, X\. Yang, and M\. Wang\(2025\)PsycoLLM: enhancing llm for psychological understanding and evaluation\.IEEE Transactions on Computational Social Systems12\(2\),pp\. 539–551\.External Links:[Document](https://dx.doi.org/10.1109/TCSS.2024.3497725)Cited by:[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p3.1)\.
- \[16\]J\. Huet al\.\(2025\)AgentMental: an interactive multi\-agent framework for explainable and adaptive mental health assessment\.arXiv preprint arXiv:2508\.11567\.External Links:[Link](https://arxiv.org/abs/2508.11567)Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p3.1)\.
- \[17\]K\. Huang, J\. Altosaar, and R\. Ranganath\(2020\)ClinicalBERT: modeling clinical notes and predicting hospital readmission\.External Links:1904\.05342,[Link](https://arxiv.org/abs/1904.05342)Cited by:[§IV\-B3](https://arxiv.org/html/2606.10796#S4.SS2.SSS3.p1.1)\.
- \[18\]S\. Ji, T\. Zhang, L\. Ansari, J\. Fu, P\. Tiwari, and E\. Cambria\(2022\-06\)MentalBERT: publicly available pretrained language models for mental healthcare\.InProceedings of the Thirteenth Language Resources and Evaluation Conference,N\. Calzolari, F\. Béchet, P\. Blache, K\. Choukri, C\. Cieri, T\. Declerck, S\. Goggi, H\. Isahara, B\. Maegaard, J\. Mariani, H\. Mazo, J\. Odijk, and S\. Piperidis \(Eds\.\),Marseille, France,pp\. 7184–7190\.External Links:[Link](https://aclanthology.org/2022.lrec-1.778/)Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p2.1),[§I](https://arxiv.org/html/2606.10796#S1.p4.1),[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p3.1),[§IV\-B3](https://arxiv.org/html/2606.10796#S4.SS2.SSS3.p1.1)\.
- \[19\]J\. Jung, C\. Kang, J\. Yoon, S\. Kim, and J\. Han\(2024\)HiQuE: hierarchical question embedding network for multimodal depression detection\.InProceedings of the 33rd ACM International Conference on Information and Knowledge Management,CIKM ’24,New York, NY, USA,pp\. 1049–1059\.External Links:ISBN 9798400704369,[Link](https://doi.org/10.1145/3627673.3679797),[Document](https://dx.doi.org/10.1145/3627673.3679797)Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p2.1),[§I](https://arxiv.org/html/2606.10796#S1.p3.1),[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p2.1),[§IV\-A1](https://arxiv.org/html/2606.10796#S4.SS1.SSS1.p2.1)\.
- \[20\]S\. Kadavath, T\. Conerly, A\. Askell, T\. Henighan, D\. Drain, E\. Perez, N\. Schiefer, Z\. Hatfield\-Dodds,et al\.\(2022\)Language models \(mostly\) know what they know\.arXiv preprint arXiv:2207\.05221\.External Links:[Link](https://arxiv.org/abs/2207.05221)Cited by:[§III\-C1](https://arxiv.org/html/2606.10796#S3.SS3.SSS1.p1.2)\.
- \[21\]Y\. Kim, H\. Jeong, S\. Chen, S\. S\. Li, M\. Lu, K\. Alhamoud, J\. Mun, C\. Grau, M\. Jung, R\. Gameiro, L\. Fan, E\. Park, T\. Lin, J\. Yoon, W\. Yoon, M\. Sap, Y\. Tsvetkov, P\. Liang, X\. Xu, X\. Liu, D\. McDuff, H\. Lee, H\. W\. Park, S\. Tulebaev, and C\. Breazeal\(2025\)Medical hallucination in foundation models and their impact on healthcare\.medRxiv\.External Links:[Document](https://dx.doi.org/10.1101/2025.02.28.25323115),[Link](https://www.medrxiv.org/content/early/2025/03/03/2025.02.28.25323115),https://www\.medrxiv\.org/content/early/2025/03/03/2025\.02\.28\.25323115\.full\.pdfCited by:[§I](https://arxiv.org/html/2606.10796#S1.p3.1),[§III\-C](https://arxiv.org/html/2606.10796#S3.SS3.p1.1)\.
- \[22\]Y\. Labrak, A\. Bazoge, E\. Morin, P\. Gourraud, M\. Rouvier, and R\. Dufour\(2024\)BioMistral: a collection of open\-source pretrained large language models for medical domains\.External Links:2402\.10373,[Link](https://arxiv.org/abs/2402.10373)Cited by:[§IV\-B3](https://arxiv.org/html/2606.10796#S4.SS2.SSS3.p1.1)\.
- \[23\]X\. Lan, Z\. Han, Y\. Cheng, L\. Sheng, J\. Feng, C\. Gao, and Y\. Li\(2025\-11\)Depression detection on social media with large language models\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track,S\. Potdar, L\. Rojas\-Barahona, and S\. Montella \(Eds\.\),Suzhou \(China\),pp\. 2155–2171\.External Links:[Link](https://aclanthology.org/2025.emnlp-industry.151/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-industry.151),ISBN 979\-8\-89176\-333\-3Cited by:[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p3.1)\.
- \[24\]J\. Lee, J\. Han, and C\. Woo\(2026\-02\)Interpretable depression assessment using a large language model\.PLOS Digital Health5\(2\),pp\. e0001205\.External Links:[Document](https://dx.doi.org/10.1371/journal.pdig.0001205),[Link](https://doi.org/10.1371/journal.pdig.0001205),ISSN 2767\-3170Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p3.1)\.
- \[25\]Y\. Li, S\. Shao, M\. Milling, and B\. W\. Schuller\(2025\-08\)Large language models for depression recognition in spoken language integrating psychological knowledge\.Frontiers in Computer Science7\.External Links:ISSN 2624\-9898,[Link](http://dx.doi.org/10.3389/fcomp.2025.1629725),[Document](https://dx.doi.org/10.3389/fcomp.2025.1629725)Cited by:[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p3.1)\.
- \[26\]S\. Liu, B\. Brie, W\. Li, L\. Biester, A\. Lee, J\. Pennebaker, and R\. Mihalcea\(2025\-07\)Eeyore: realistic depression simulation via expert\-in\-the\-loop supervised and preference optimization\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 13750–13770\.External Links:[Link](https://aclanthology.org/2025.findings-acl.707/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.707),ISBN 979\-8\-89176\-256\-5Cited by:[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p3.1)\.
- \[27\]E\. Loweimiet al\.\(2025\)Zero\-shot speech\-based depression and anxiety assessment with llms\.InProceedings of Interspeech 2025,pp\. 489–493\.External Links:[Link](https://www.research.ed.ac.uk/en/publications/zero-shot-speech-based-depression-and-anxiety-assessment-with-llm)Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p1.1)\.
- \[28\]A\. Mallol\-Ragolta, Z\. Zhao, L\. Stappen, and B\. Schuller\(2019\-09\)A hierarchical attention network\-based approach for depression detection from transcribed clinical interviews\.pp\. 221–225\.External Links:[Document](https://dx.doi.org/10.21437/Interspeech.2019-2036)Cited by:[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p2.1)\.
- \[29\]P\. Manakul, A\. Liusie, and M\. Gales\(2023\-12\)SelfCheckGPT: zero\-resource black\-box hallucination detection for generative large language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,H\. Bouamor, J\. Pino, and K\. Bali \(Eds\.\),Singapore,pp\. 9004–9017\.External Links:[Link](https://aclanthology.org/2023.emnlp-main.557/),[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.557)Cited by:[§III\-C1](https://arxiv.org/html/2606.10796#S3.SS3.SSS1.p1.2)\.
- \[30\]H\. Mao and Q\. Han\(2025\)Enhancing textgcn for depression detection on social media with emotion representation\.Frontiers in Psychology16,pp\. 1612769\.External Links:[Link](https://www.frontiersin.org/journals/psychology/articles/10.3389/fpsyg.2025.1612769/full)Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p4.1)\.
- \[31\]J\. Miao, C\. Thongprayoon, S\. Suppadungsuk, O\. A\. Garcia Valencia, and W\. Cheungpasitporn\(2024\)Integrating retrieval\-augmented generation with large language models in nephrology: advancing practical applications\.Medicina60\(3\)\.External Links:[Link](https://www.mdpi.com/1648-9144/60/3/445),ISSN 1648\-9144,[Document](https://dx.doi.org/10.3390/medicina60030445)Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p4.1)\.
- \[32\]S\. Moon, A\. Lee, J\. E\. Kim, H\. Kang, I\. Shin, S\. Kim, J\. Kim, M\. Jhon, and J\. Kim\(2025\)DepressLLM: interpretable domain\-adapted language model for depression detection from real\-world narratives\.External Links:2508\.08591,[Link](https://arxiv.org/abs/2508.08591)Cited by:[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p3.1)\.
- \[33\]D\. Nguyen, A\. Payani, and B\. Mirzasoleiman\(2025\-07\)Beyond semantic entropy: boosting LLM uncertainty quantification with pairwise semantic similarity\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 4530–4540\.External Links:[Link](https://aclanthology.org/2025.findings-acl.234/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.234),ISBN 979\-8\-89176\-256\-5Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p5.1)\.
- \[34\]J\. Ohse, B\. Hadžić, P\. Mohammed, N\. Peperkorn, M\. Danner, A\. Yorita, N\. Kubota, M\. Rätsch, and Y\. Shiban\(2024\)Zero\-shot strike: testing the generalisation capabilities of out\-of\-the\-box llm models for depression detection\.Computer Speech & Language88,pp\. 101663\.External Links:ISSN 0885\-2308,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.csl.2024.101663),[Link](https://www.sciencedirect.com/science/article/pii/S0885230824000469)Cited by:[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p3.1)\.
- \[35\]J\. C\. Penny\-Dimri, M\. Bachmann, W\. R\. Cooke, S\. Mathewlynn, S\. Dockree, J\. Tolladay, J\. Kossen, L\. Li, Y\. Gal, and G\. Davis Jones\(2025\-09\)Measuring large language model uncertainty in women’s health using semantic entropy and perplexity: a comparative study\.The Lancet Obstetrics, Gynaecology, & Women’s Health1\(1\),pp\. e47–e56\.External Links:[Document](https://dx.doi.org/10.1016/j.lanogw.2025.100005),[Link](https://doi.org/10.1016/j.lanogw.2025.100005),ISSN 3050\-5038Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p5.1)\.
- \[36\]E\. Phillips, S\. Wu, S\. Molaei, D\. Belgrave, A\. Thakur, and D\. Clifton\(2025\)Geometric uncertainty for detecting and correcting hallucinations in llms\.arXiv preprint arXiv:2509\.13813\.External Links:[Link](https://arxiv.org/abs/2509.13813)Cited by:[§III\-C1](https://arxiv.org/html/2606.10796#S3.SS3.SSS1.p1.2)\.
- \[37\]A\. Rinaldi, J\. Fox Tree, and S\. Chaturvedi\(2020\-07\)Predicting depression in screening interviews from latent categorization of interview prompts\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 7–18\.External Links:[Link](https://aclanthology.org/2020.acl-main.2/),[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.2)Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p2.1),[§IV\-A1](https://arxiv.org/html/2606.10796#S4.SS1.SSS1.p2.1)\.
- \[38\]F\. Ringeval, B\. Schuller, M\. Valstar, N\. Cummins, R\. Cowie, L\. Tavabi, M\. Schmitt, S\. Alisamir, S\. Amiriparian, E\. Messner, S\. Song, S\. Liu, Z\. Zhao, A\. Mallol\-Ragolta, Z\. Ren, M\. Soleymani, and M\. Pantic\(2019\)AVEC 2019 workshop and challenge: state\-of\-mind, detecting depression with ai, and cross\-cultural affect recognition\.pp\. 3–12\.External Links:ISBN 9781450369138,[Link](https://doi.org/10.1145/3347320.3357688),[Document](https://dx.doi.org/10.1145/3347320.3357688)Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p3.1),[§I](https://arxiv.org/html/2606.10796#S1.p5.1),[§III\-B](https://arxiv.org/html/2606.10796#S3.SS2.p1.1)\.
- \[39\]M\. Sadeghi, R\. Richer, B\. Egger, L\. Schindler\-Gmelch, L\. H\. Rupp, F\. Rahimi, M\. Berking, and B\. M\. Eskofier\(2024\)Harnessing multimodal approaches for depression detection using large language models and facial expressions\.npj Mental Health Research3\(1\),pp\. 66\.External Links:[Document](https://dx.doi.org/10.1038/s44184-024-00112-8),[Link](https://doi.org/10.1038/s44184-024-00112-8),ISSN 2731\-4251Cited by:[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p3.1)\.
- \[40\]K\. V\. Sarma, K\. E\. Hanss, A\. J\. M\. Halls, A\. Krystal, D\. F\. Becker, A\. L\. Glowinski, and A\. J\. Butte\(2026\)Integrating expert knowledge into large language models improves performance for psychiatric reasoning and diagnosis\.Psychiatry Research355,pp\. 116844\.External Links:ISSN 0165\-1781,[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.psychres.2025.116844),[Link](https://www.sciencedirect.com/science/article/pii/S0165178125004895)Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p3.1)\.
- \[41\]S\. A\. Shankman, C\. J\. Funkhouser, D\. N\. Klein, J\. Davila, D\. Lerner, and D\. Hee\(2018\)Reliability and validity of severity dimensions of psychopathology assessed using the structured clinical interview for dsm\-5 \(scid\)\.International Journal of Methods in Psychiatric Research27\(1\),pp\. e1590\.External Links:[Link](https://pubmed.ncbi.nlm.nih.gov/29034525/)Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p1.1),[§III\-A](https://arxiv.org/html/2606.10796#S3.SS1.p1.4)\.
- \[42\]Y\. Tao, M\. Yang, H\. Li, Y\. Wu, and B\. Hu\(2024\)DepMSTAT: multimodal spatio\-temporal attentional transformer for depression detection\.IEEE Transactions on Knowledge and Data Engineering36\(7\),pp\. 2956–2966\.External Links:[Document](https://dx.doi.org/10.1109/TKDE.2024.3350071)Cited by:[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p2.1)\.
- \[43\]B\. G\. Teferra and A\. Perivolaris\(2024\)Leveraging large language models for automated depression screening\.Frontiers in Psychiatry15\.External Links:[Link](https://pmc.ncbi.nlm.nih.gov/articles/PMC12303271/)Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p1.1)\.
- \[44\]S\. Teng, J\. Liu, R\. K\. Jain, S\. Chai, R\. Hou, T\. Tateyama, L\. Lin, and Y\. Chen\(2025\)Enhancing depression detection with chain\-of\-thought prompting: from emotion to reasoning using large language models\.arXiv preprint arXiv:2502\.05879\.External Links:[Link](https://arxiv.org/abs/2502.05879)Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p5.1),[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p3.1)\.
- \[45\]Y\. H\. Tsai, S\. Bai, P\. P\. Liang, J\. Z\. Kolter, L\. Morency, and R\. Salakhutdinov\(2019\-07\)Multimodal transformer for unaligned multimodal language sequences\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,A\. Korhonen, D\. Traum, and L\. Màrquez \(Eds\.\),Florence, Italy,pp\. 6558–6569\.External Links:[Link](https://aclanthology.org/P19-1656/),[Document](https://dx.doi.org/10.18653/v1/P19-1656)Cited by:[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p2.1)\.
- \[46\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, E\. Chi, Q\. Le, and D\. Zhou\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 24824–24837\.External Links:[Link](https://arxiv.org/abs/2201.11903)Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p5.1)\.
- \[47\]World Health Organization\(2024\)World mental health today and mental health atlas 2024\.World Health Organization,Geneva\.External Links:[Link](https://www.who.int/publications/i/item/9789240114487)Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p1.1)\.
- \[48\]J\. Wu, Y\. Yu, and H\. Zhou\(2024\)Uncertainty estimation of large language models in medical question answering\.External Links:2407\.08662,[Link](https://arxiv.org/abs/2407.08662)Cited by:[§III\-C1](https://arxiv.org/html/2606.10796#S3.SS3.SSS1.p1.2),[§III\-C](https://arxiv.org/html/2606.10796#S3.SS3.p1.1)\.
- \[49\]Y\. Wu, G\. Wan, J\. Li, S\. Zhao, L\. Ma, T\. Ye, M\. Zhang, I\. Pop, Y\. Zhang, and J\. Chen\(2026\)WiseMind: a knowledge\-guided multi\-agent framework for accurate and empathetic psychiatric diagnosis\.npj Digital Medicine\.External Links:[Document](https://dx.doi.org/10.1038/s41746-026-02559-9),[Link](https://doi.org/10.1038/s41746-026-02559-9),ISSN 2398\-6352Cited by:[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p3.1)\.
- \[50\]X\. Xu, B\. Yao, Y\. Dong, S\. Gabriel, H\. Yu, J\. Hendler, M\. Ghassemi, A\. K\. Dey, and D\. Wang\(2024\)Mental\-llm: leveraging large language models for mental health prediction via online text data\.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies8\(1\),pp\. 1–32\.External Links:[Link](https://arxiv.org/abs/2307.14385)Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p2.1),[§I](https://arxiv.org/html/2606.10796#S1.p4.1),[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p3.1),[§IV\-B3](https://arxiv.org/html/2606.10796#S4.SS2.SSS3.p1.1)\.
- \[51\]B\. Yang, M\. Cao, X\. Zhu, S\. Wang, C\. Yang, R\. Ni, and X\. Liu\(2024\)MMPF: multimodal purification fusion for automatic depression detection\.IEEE Transactions on Computational Social Systems11\(6\),pp\. 7421–7434\.External Links:[Document](https://dx.doi.org/10.1109/TCSS.2024.3411616)Cited by:[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p2.1)\.
- \[52\]K\. Yang, T\. Zhang, Z\. Kuang, Q\. Xie, J\. Huang, and S\. Ananiadou\(2024\-05\)MentaLLaMA: interpretable mental health analysis on social media with large language models\.InProceedings of the ACM Web Conference 2024,pp\. 4489–4500\.External Links:[Link](http://dx.doi.org/10.1145/3589334.3648137),[Document](https://dx.doi.org/10.1145/3589334.3648137)Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p2.1),[§I](https://arxiv.org/html/2606.10796#S1.p4.1),[§IV\-B3](https://arxiv.org/html/2606.10796#S4.SS2.SSS3.p1.1)\.
- \[53\]X\. Yao, L\. Ying, T\. He, L\. Ren, R\. Xu, and K\. Mao\(2024\)Depression detection based on multilevel semantic features\.InArtificial Neural Networks and Machine Learning – ICANN 2024: 33rd International Conference on Artificial Neural Networks, Lugano, Switzerland, September 17–20, 2024, Proceedings, Part VIII,Berlin, Heidelberg,pp\. 44–55\.External Links:ISBN 978\-3\-031\-72352\-0,[Link](https://doi.org/10.1007/978-3-031-72353-7_4),[Document](https://dx.doi.org/10.1007/978-3-031-72353-7%5F4)Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p3.1),[§III\-B](https://arxiv.org/html/2606.10796#S3.SS2.p1.1)\.
- \[54\]J\. Ye, J\. Zhang, and H\. Shan\(2025\)DepMamba: progressive fusion mamba for multimodal depression detection\.InICASSP 2025 \- 2025 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),Vol\.,pp\. 1–5\.External Links:[Document](https://dx.doi.org/10.1109/ICASSP49660.2025.10889975)Cited by:[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p2.1)\.
- \[55\]A\. Zadeh, M\. Chen, S\. Poria, E\. Cambria, and L\. Morency\(2017\-09\)Tensor fusion network for multimodal sentiment analysis\.InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,M\. Palmer, R\. Hwa, and S\. Riedel \(Eds\.\),Copenhagen, Denmark,pp\. 1103–1114\.External Links:[Link](https://aclanthology.org/D17-1115/),[Document](https://dx.doi.org/10.18653/v1/D17-1115)Cited by:[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p2.1)\.
- \[56\]E\. Zhang and C\. Poellabauer\(2025\)Mitigating interviewer bias in multimodal depression detection: an approach with adversarial learning and contextual positional encoding\.InFindings of the Association for Computational Linguistics: EMNLP 2025,pp\. 12169–12188\.External Links:[Link](https://aclanthology.org/2025.findings-emnlp.650)Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p2.1)\.
- \[57\]L\. Zhang, Z\. Gao, D\. Zhou, and Y\. He\(2025\)Explainable depression detection in clinical interviews with personalized retrieval\-augmented generation\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 9927–9944\.External Links:[Link](https://aclanthology.org/2025.findings-acl.517)Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p2.1),[§I](https://arxiv.org/html/2606.10796#S1.p3.1),[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p3.1)\.
- \[58\]X\. Zhang, H\. Liu, Q\. Zhang, B\. Ahmed, and J\. Epps\(2025\-07\)SpeechT\-RAG: reliable depression detection in LLMs with retrieval\-augmented generation using speech timing information\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 10019–10030\.External Links:[Link](https://aclanthology.org/2025.findings-acl.521/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.521),ISBN 979\-8\-89176\-256\-5Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p1.1),[§I](https://arxiv.org/html/2606.10796#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p3.1)\.
- \[59\]X\. Zhao, Y\. Lyu, D\. Wang, and B\. Tang\(2025\-07\)Predicting depression in screening interviews from interactive multi\-theme collaboration\.InFindings of the Association for Computational Linguistics: ACL 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),Vienna, Austria,pp\. 23025–23035\.External Links:[Link](https://aclanthology.org/2025.findings-acl.1181/),[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.1181),ISBN 979\-8\-89176\-256\-5Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p3.1),[§I](https://arxiv.org/html/2606.10796#S1.p5.1),[§IV\-A1](https://arxiv.org/html/2606.10796#S4.SS1.SSS1.p2.1)\.
- \[60\]W\. Zheng, Q\. Xie, Z\. Wang, J\. Yu, and R\. Xia\(2025\)Towards explainable multimodal depression recognition for clinical interviews\.External Links:2501\.16106,[Link](https://arxiv.org/abs/2501.16106)Cited by:[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p3.1)\.
- \[61\]J\. Zhou, H\. Li, S\. Chen, Z\. Chen, Z\. Han, and X\. Gao\(2025\-12\)Large language models in biomedicine and healthcare\.npj Artificial Intelligence1\(1\),pp\. 44\.External Links:[Document](https://dx.doi.org/10.1038/s44387-025-00047-1),[Link](https://doi.org/10.1038/s44387-025-00047-1),ISSN 3005\-1460Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p4.1)\.
- \[62\]Z\. Zhou, J\. Liu, S\. Wang, S\. Hao, Y\. Guo, and R\. Hong\(2025\)InterMind: doctor\-patient\-family interactive depression assessment empowered by large language models\.InProceedings of the 33rd ACM International Conference on Multimedia,MM ’25,New York, NY, USA,pp\. 5480–5489\.External Links:ISBN 9798400720352,[Link](https://doi.org/10.1145/3746027.3754755),[Document](https://dx.doi.org/10.1145/3746027.3754755)Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.10796#S2.SS1.p3.1)\.
- \[63\]H\. Zogan, I\. Razzak, X\. Wang, S\. Jameel, and G\. Xu\(2022\)Explainable depression detection with multi\-aspect features using a hybrid deep learning model on social media\.World Wide Web25\(1\),pp\. 281–304\.External Links:[Link](https://link.springer.com/article/10.1007/s11280-021-00992-2)Cited by:[§I](https://arxiv.org/html/2606.10796#S1.p2.1),[§I](https://arxiv.org/html/2606.10796#S1.p4.1)\.
Dep-LLM: Training-Free Depression Diagnosis via Evidence-Guided Structured Multi-factor with Reliable LLM Reasoning

Similar Articles

Fine-tuning LLMs for Passive Depression Severity Estimation from AI Mental Health Dialogue

Depression Risk Assessment in Social Media via Large Language Models

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

Multi-View Attention Multiple-Instance Learning Enhanced by LLM Reasoning for Cognitive Distortion Detection

MA-DLE: Speech-based Automatic Depression Level Estimation via Memory Augmentation

Submit Feedback

Similar Articles

Fine-tuning LLMs for Passive Depression Severity Estimation from AI Mental Health Dialogue
Depression Risk Assessment in Social Media via Large Language Models
LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis
Multi-View Attention Multiple-Instance Learning Enhanced by LLM Reasoning for Cognitive Distortion Detection
MA-DLE: Speech-based Automatic Depression Level Estimation via Memory Augmentation