open-ended

#open-ended

DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks

arXiv cs.AI ↗ · 15h ago Cached

DailyReport is an open-ended benchmark for evaluating search agents on daily search tasks, featuring 150 tasks and 3,546 rubrics for interpretable, user-centric evaluation.

0 favorites 0 likes

#open-ended

StatefulDiscovery: Evidence-Calibrated Claim Formation in Open-Ended Scientific Discovery

arXiv cs.AI ↗ · yesterday Cached

Introduces StatefulDiscovery, a framework for open-ended scientific discovery that uses externalized investigation state to calibrate evidence and claims, outperforming baselines in producing well-supported high-value claims.

0 favorites 0 likes

#open-ended

@zheyuanzhang99: Introducing AgentOdyssey — an open-ended, long-horizon text-game engine for test-time continual…

X AI KOLs Timeline ↗ · 2026-04-20 Cached

AgentOdyssey is an open-ended text-game engine built for agents that learn continually, erasing the line between training and testing.

0 favorites 0 likes

#open-ended

GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows

Hugging Face Daily Papers ↗ · 2026-04-17 Cached

GTA-2 introduces a hierarchical benchmark for evaluating general tool agents across atomic tool-use and open-ended workflows, revealing a significant capability cliff where frontier models achieve only 14.39% success on complex tasks despite reasonable atomic performance.

0 favorites 0 likes

open-ended

DailyReport: An Open-ended Benchmark for Evaluating Search Agents on Daily Search Tasks

StatefulDiscovery: Evidence-Calibrated Claim Formation in Open-Ended Scientific Discovery

@zheyuanzhang99: Introducing AgentOdyssey — an open-ended, long-horizon text-game engine for test-time continual…

GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows

Submit Feedback