A Multi-Probe Audit of Clinical-Interview Depression Detection Benchmarks

arXiv cs.CL 05/26/26, 04:00 AM Papers

depression-detection clinical-interviews benchmark-audit multi-probe evaluation daic cmdc

Summary

This paper audits benchmark evaluation in clinical-interview depression detection through four complementary probes across five datasets, finding that standard evaluation protocols may overestimate model performance and that leaderboard rankings lack stability.

arXiv:2605.23977v1 Announce Type: new Abstract: This paper audits benchmark evaluation in clinical-interview depression detection through four complementary probes across DAIC/E-DAIC, CMDC, ANDROIDS, MODMA, and PDCH. First, we re-evaluate E-DAIC under strict subject-disjoint leave-one-subject-out cross-validation. A lightweight hybrid text-plus-LLM-score model reaches macro-F1 = 0.723 - the highest reported under this protocol, to our knowledge - providing a conservative out-of-fold reference point that does not depend on the privileged official holdout. Second, we test whether the E-DAIC official split supports fine-grained leaderboard rankings by sweeping 96 model configurations across modality bundles, pooling strategies, and learners. Development-side cross-validation and official-test rankings align only moderately: the best cross-validation configuration ranks twentieth on the official test, the official-test winner ranks forty-first by cross-validation, top-3 overlap is zero, and the apparent winner is rank-1 in only 32.3% of subject bootstraps. Third, we externally validate strong public CMDC and ANDROIDS baselines that achieve near-ceiling in-domain performance. Zero-shot transfer to external corpora is substantially weaker. Finally, we stress-test E-DAIC text and audio models using paired symptom-dense versus symptom-light interview slices defined by an SRDS-based annotator. Text scores rise sharply on symptom-dense slices, whereas audio scores remain nearly flat; the text-minus-audio gap is positive across all five seeds.

Original Article

View Cached Full Text

Cached at: 05/26/26, 09:00 AM

# A Multi-Probe Audit of Clinical-Interview Depression Detection Benchmarks
Source: [https://arxiv.org/abs/2605.23977](https://arxiv.org/abs/2605.23977)
[View PDF](https://arxiv.org/pdf/2605.23977)

> Abstract:This paper audits benchmark evaluation in clinical\-interview depression detection through four complementary probes across DAIC/E\-DAIC, CMDC, ANDROIDS, MODMA, and PDCH\. First, we re\-evaluate E\-DAIC under strict subject\-disjoint leave\-one\-subject\-out cross\-validation\. A lightweight hybrid text\-plus\-LLM\-score model reaches macro\-F1 = 0\.723 \- the highest reported under this protocol, to our knowledge \- providing a conservative out\-of\-fold reference point that does not depend on the privileged official holdout\. Second, we test whether the E\-DAIC official split supports fine\-grained leaderboard rankings by sweeping 96 model configurations across modality bundles, pooling strategies, and learners\. Development\-side cross\-validation and official\-test rankings align only moderately: the best cross\-validation configuration ranks twentieth on the official test, the official\-test winner ranks forty\-first by cross\-validation, top\-3 overlap is zero, and the apparent winner is rank\-1 in only 32\.3% of subject bootstraps\. Third, we externally validate strong public CMDC and ANDROIDS baselines that achieve near\-ceiling in\-domain performance\. Zero\-shot transfer to external corpora is substantially weaker\. Finally, we stress\-test E\-DAIC text and audio models using paired symptom\-dense versus symptom\-light interview slices defined by an SRDS\-based annotator\. Text scores rise sharply on symptom\-dense slices, whereas audio scores remain nearly flat; the text\-minus\-audio gap is positive across all five seeds\.

## Submission history

From: Takehiro Ishikawa \[[view email](https://arxiv.org/show-email/0f13edff/2605.23977)\] **\[v1\]**Wed, 13 May 2026 17:32:41 UTC \(347 KB\)

A Multi-Probe Audit of Clinical-Interview Depression Detection Benchmarks

Similar Articles

MedBench v5: A Dynamic, Process-Oriented, and Hallucination-Aware Benchmark for Clinical Multimodal Models

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

Fine-tuning LLMs for Passive Depression Severity Estimation from AI Mental Health Dialogue

ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV

When Symptoms Are Not Enough: Evidence-Weighting Patterns in Large Language Model Psychiatric Screening

Submit Feedback

Similar Articles

MedBench v5: A Dynamic, Process-Oriented, and Hallucination-Aware Benchmark for Clinical Multimodal Models

LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

Fine-tuning LLMs for Passive Depression Severity Estimation from AI Mental Health Dialogue

ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV

When Symptoms Are Not Enough: Evidence-Weighting Patterns in Large Language Model Psychiatric Screening