benchmark-saturation

#benchmark-saturation

Offline Preference-Based Trajectory Evaluation

arXiv cs.LG ↗ · 2026-06-17 Cached

This paper proposes offline preference-based trajectory evaluation for agentic systems, which compares trajectories via temporal preferences rather than binary success metrics. It shows that this approach reduces ties from roughly 75% to 35%, improving discriminative power and data efficiency across diverse benchmarks.

0 favorites 0 likes

#benchmark-saturation

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

Hugging Face Daily Papers ↗ · 2026-05-31 Cached

BenchEvolver is an evolutionary framework that automatically generates harder coding problems from existing ones, creating challenging benchmarks that maintain validity and diversity while enabling model self-improvement and enhanced training performance.

0 favorites 0 likes

#benchmark-saturation

The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next

arXiv cs.LG ↗ · 2026-05-20

This paper introduces a population coupling trend and h-field diagnostic to analyze the relationship between coding and reasoning capabilities across frontier AI models, finding that capabilities cooperate but with varying emphasis per lab. It provides a playbook for measurement and predicts benchmark saturation trends.

0 favorites 0 likes

benchmark-saturation

Offline Preference-Based Trajectory Evaluation

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next

Submit Feedback