Tag
This paper audits benchmark evaluation in clinical-interview depression detection through four complementary probes across five datasets, finding that standard evaluation protocols may overestimate model performance and that leaderboard rankings lack stability.
This paper audits multimodal physics evaluation pipelines, revealing issues like train-eval contamination, translation drift, and MCQ saturation. It releases new datasets (PhysCorp-A, PhysR1Corp, PhysOlym-A) and a training recipe (Physics-R1) that significantly improves performance on held-out olympiad problems.