benchmark-audit

#benchmark-audit

A Multi-Probe Audit of Clinical-Interview Depression Detection Benchmarks

arXiv cs.CL ↗ · 2026-05-26 Cached

This paper audits benchmark evaluation in clinical-interview depression detection through four complementary probes across five datasets, finding that standard evaluation protocols may overestimate model performance and that leaderboard rankings lack stability.

0 favorites 0 likes

#benchmark-audit

Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning

arXiv cs.CL ↗ · 2026-05-15 Cached

This paper audits multimodal physics evaluation pipelines, revealing issues like train-eval contamination, translation drift, and MCQ saturation. It releases new datasets (PhysCorp-A, PhysR1Corp, PhysOlym-A) and a training recipe (Physics-R1) that significantly improves performance on held-out olympiad problems.

0 favorites 0 likes

benchmark-audit

A Multi-Probe Audit of Clinical-Interview Depression Detection Benchmarks

Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning

Submit Feedback