A Multi-Probe Audit of Clinical-Interview Depression Detection Benchmarks
Summary
This paper audits benchmark evaluation in clinical-interview depression detection through four complementary probes across five datasets, finding that standard evaluation protocols may overestimate model performance and that leaderboard rankings lack stability.
View Cached Full Text
Cached at: 05/26/26, 09:00 AM
# A Multi-Probe Audit of Clinical-Interview Depression Detection Benchmarks Source: [https://arxiv.org/abs/2605.23977](https://arxiv.org/abs/2605.23977) [View PDF](https://arxiv.org/pdf/2605.23977) > Abstract:This paper audits benchmark evaluation in clinical\-interview depression detection through four complementary probes across DAIC/E\-DAIC, CMDC, ANDROIDS, MODMA, and PDCH\. First, we re\-evaluate E\-DAIC under strict subject\-disjoint leave\-one\-subject\-out cross\-validation\. A lightweight hybrid text\-plus\-LLM\-score model reaches macro\-F1 = 0\.723 \- the highest reported under this protocol, to our knowledge \- providing a conservative out\-of\-fold reference point that does not depend on the privileged official holdout\. Second, we test whether the E\-DAIC official split supports fine\-grained leaderboard rankings by sweeping 96 model configurations across modality bundles, pooling strategies, and learners\. Development\-side cross\-validation and official\-test rankings align only moderately: the best cross\-validation configuration ranks twentieth on the official test, the official\-test winner ranks forty\-first by cross\-validation, top\-3 overlap is zero, and the apparent winner is rank\-1 in only 32\.3% of subject bootstraps\. Third, we externally validate strong public CMDC and ANDROIDS baselines that achieve near\-ceiling in\-domain performance\. Zero\-shot transfer to external corpora is substantially weaker\. Finally, we stress\-test E\-DAIC text and audio models using paired symptom\-dense versus symptom\-light interview slices defined by an SRDS\-based annotator\. Text scores rise sharply on symptom\-dense slices, whereas audio scores remain nearly flat; the text\-minus\-audio gap is positive across all five seeds\. ## Submission history From: Takehiro Ishikawa \[[view email](https://arxiv.org/show-email/0f13edff/2605.23977)\] **\[v1\]**Wed, 13 May 2026 17:32:41 UTC \(347 KB\)
Similar Articles
ClinicalBench: Stress-Testing Assertion-Aware Retrieval for Cross-Admission Clinical QA on MIMIC-IV
This paper introduces ClinicalBench and the EpiKG system, evaluating assertion-aware retrieval for clinical question answering on MIMIC-IV data across multiple LLMs. It demonstrates that handling negation and temporality in retrieval significantly improves performance over standard baselines.
When Symptoms Are Not Enough: Evidence-Weighting Patterns in Large Language Model Psychiatric Screening
This paper introduces a SCID-anchored benchmark of 555 interviews to evaluate five LLMs for psychiatric screening, finding that while models show potential, they tend to discount symptom evidence in the presence of preserved functioning or protective context, requiring careful validation.
Expert-Level Crisis Detection in Mental Health Conversations
Introduces CRADLE-Dialogue, a clinician-annotated benchmark for turn-level crisis detection in mental health conversations, along with an Alert–Confirm evaluation protocol and a synthetic training corpus plus a 32B model that outperforms existing open-source and proprietary models.
Unsteady Metrics and Benchmarking Cultures of AI Model Builders
This paper introduces Benchmarking-Cultures-25, a dataset analyzing how AI model builders selectively highlight benchmarks in press releases. It finds a fragmented evaluation landscape with limited cross-model comparability, arguing that benchmarks are used as narrative devices for market positioning rather than standardized scientific measurement.
ASD-Bench: A Four-Axis Comprehensive Benchmark of AI Models for Autism Spectrum Disorder
The article introduces ASD-Bench, a comprehensive benchmark evaluating AI models for Autism Spectrum Disorder screening across four axes: predictive performance, calibration, interpretability, and robustness. It analyzes various models across different age cohorts using AQ-10 data, highlighting the importance of multi-metric evaluation in clinical AI applications.