Tag
This paper proposes offline preference-based trajectory evaluation for agentic systems, which compares trajectories via temporal preferences rather than binary success metrics. It shows that this approach reduces ties from roughly 75% to 35%, improving discriminative power and data efficiency across diverse benchmarks.
BenchEvolver is an evolutionary framework that automatically generates harder coding problems from existing ones, creating challenging benchmarks that maintain validity and diversity while enabling model self-improvement and enhanced training performance.
This paper introduces a population coupling trend and h-field diagnostic to analyze the relationship between coding and reasoning capabilities across frontier AI models, finding that capabilities cooperate but with varying emphasis per lab. It provides a playbook for measurement and predicts benchmark saturation trends.