multi-dimensional-evaluation

#multi-dimensional-evaluation

Search Discipline for Long-Horizon Research Agents

arXiv cs.AI ↗ · 19h ago Cached

This paper identifies a failure mode in long-horizon research agents where optimizing an aggregate metric can select candidates that improve the headline number but break critical subgroups (inversion). It proposes a search-discipline protocol with an external control loop that audits candidates based on disaggregated behavior rather than the score.

0 favorites 0 likes

multi-dimensional-evaluation

Search Discipline for Long-Horizon Research Agents

Submit Feedback