Tag
DailyReport is an open-ended benchmark for evaluating search agents on daily search tasks, featuring 150 tasks and 3,546 rubrics for interpretable, user-centric evaluation.
Introduces StatefulDiscovery, a framework for open-ended scientific discovery that uses externalized investigation state to calibrate evidence and claims, outperforming baselines in producing well-supported high-value claims.
AgentOdyssey is an open-ended text-game engine built for agents that learn continually, erasing the line between training and testing.
GTA-2 introduces a hierarchical benchmark for evaluating general tool agents across atomic tool-use and open-ended workflows, revealing a significant capability cliff where frontier models achieve only 14.39% success on complex tasks despite reasonable atomic performance.