A Multi-Probe Audit of Clinical-Interview Depression Detection Benchmarks

arXiv cs.CL Papers

Summary

This paper audits benchmark evaluation in clinical-interview depression detection through four complementary probes across five datasets, finding that standard evaluation protocols may overestimate model performance and that leaderboard rankings lack stability.

arXiv:2605.23977v1 Announce Type: new Abstract: This paper audits benchmark evaluation in clinical-interview depression detection through four complementary probes across DAIC/E-DAIC, CMDC, ANDROIDS, MODMA, and PDCH. First, we re-evaluate E-DAIC under strict subject-disjoint leave-one-subject-out cross-validation. A lightweight hybrid text-plus-LLM-score model reaches macro-F1 = 0.723 - the highest reported under this protocol, to our knowledge - providing a conservative out-of-fold reference point that does not depend on the privileged official holdout. Second, we test whether the E-DAIC official split supports fine-grained leaderboard rankings by sweeping 96 model configurations across modality bundles, pooling strategies, and learners. Development-side cross-validation and official-test rankings align only moderately: the best cross-validation configuration ranks twentieth on the official test, the official-test winner ranks forty-first by cross-validation, top-3 overlap is zero, and the apparent winner is rank-1 in only 32.3% of subject bootstraps. Third, we externally validate strong public CMDC and ANDROIDS baselines that achieve near-ceiling in-domain performance. Zero-shot transfer to external corpora is substantially weaker. Finally, we stress-test E-DAIC text and audio models using paired symptom-dense versus symptom-light interview slices defined by an SRDS-based annotator. Text scores rise sharply on symptom-dense slices, whereas audio scores remain nearly flat; the text-minus-audio gap is positive across all five seeds.
Original Article
View Cached Full Text

Cached at: 05/26/26, 09:00 AM

# A Multi-Probe Audit of Clinical-Interview Depression Detection Benchmarks
Source: [https://arxiv.org/abs/2605.23977](https://arxiv.org/abs/2605.23977)
[View PDF](https://arxiv.org/pdf/2605.23977)

> Abstract:This paper audits benchmark evaluation in clinical\-interview depression detection through four complementary probes across DAIC/E\-DAIC, CMDC, ANDROIDS, MODMA, and PDCH\. First, we re\-evaluate E\-DAIC under strict subject\-disjoint leave\-one\-subject\-out cross\-validation\. A lightweight hybrid text\-plus\-LLM\-score model reaches macro\-F1 = 0\.723 \- the highest reported under this protocol, to our knowledge \- providing a conservative out\-of\-fold reference point that does not depend on the privileged official holdout\. Second, we test whether the E\-DAIC official split supports fine\-grained leaderboard rankings by sweeping 96 model configurations across modality bundles, pooling strategies, and learners\. Development\-side cross\-validation and official\-test rankings align only moderately: the best cross\-validation configuration ranks twentieth on the official test, the official\-test winner ranks forty\-first by cross\-validation, top\-3 overlap is zero, and the apparent winner is rank\-1 in only 32\.3% of subject bootstraps\. Third, we externally validate strong public CMDC and ANDROIDS baselines that achieve near\-ceiling in\-domain performance\. Zero\-shot transfer to external corpora is substantially weaker\. Finally, we stress\-test E\-DAIC text and audio models using paired symptom\-dense versus symptom\-light interview slices defined by an SRDS\-based annotator\. Text scores rise sharply on symptom\-dense slices, whereas audio scores remain nearly flat; the text\-minus\-audio gap is positive across all five seeds\.

## Submission history

From: Takehiro Ishikawa \[[view email](https://arxiv.org/show-email/0f13edff/2605.23977)\] **\[v1\]**Wed, 13 May 2026 17:32:41 UTC \(347 KB\)

Similar Articles

Expert-Level Crisis Detection in Mental Health Conversations

arXiv cs.CL

Introduces CRADLE-Dialogue, a clinician-annotated benchmark for turn-level crisis detection in mental health conversations, along with an Alert–Confirm evaluation protocol and a synthetic training corpus plus a 32B model that outperforms existing open-source and proprietary models.

Unsteady Metrics and Benchmarking Cultures of AI Model Builders

arXiv cs.AI

This paper introduces Benchmarking-Cultures-25, a dataset analyzing how AI model builders selectively highlight benchmarks in press releases. It finds a fragmented evaluation landscape with limited cross-model comparability, arguing that benchmarks are used as narrative devices for market positioning rather than standardized scientific measurement.

ASD-Bench: A Four-Axis Comprehensive Benchmark of AI Models for Autism Spectrum Disorder

arXiv cs.LG

The article introduces ASD-Bench, a comprehensive benchmark evaluating AI models for Autism Spectrum Disorder screening across four axes: predictive performance, calibration, interpretability, and robustness. It analyzes various models across different age cohorts using AQ-10 data, highlighting the importance of multi-metric evaluation in clinical AI applications.