benchmark-design

#benchmark-design

Rethinking the Evaluation of Harness Evolution for Agents

arXiv cs.AI ↗ · 2d ago Cached

This paper re-evaluates the methodology of automatic harness evolution for LLM agents, highlighting that its gains may stem from additional test-time search rather than improved harness design, and that evaluation on the same benchmark risks overfitting. Experiments show that harness evolution does not consistently outperform simpler test-time scaling methods.

0 favorites 0 likes

#benchmark-design

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

arXiv cs.CL ↗ · 2026-06-15 Cached

This paper investigates the run-to-run reliability of LLM-as-a-Judge evaluations, finding that pairwise preferences flip 13.6% of the time on average, with significant first-position bias in GPT-4o-mini, and recommends multi-trial aggregation and position randomization.

0 favorites 0 likes

#benchmark-design

Design and Report Benchmarks for Knowledge Work

arXiv cs.AI ↗ · 2026-05-25 Cached

This paper proposes a three-step framework for designing and reporting benchmarks for knowledge work AI, emphasizing alignment between benchmark tasks and real-world work activities. It derives 18 work activities from the O*NET database and analyzes three existing benchmarks (GDPval, OfficeQA Pro, APEX-SWE) to demonstrate gaps between benchmark scores and actual work capability.

0 favorites 0 likes

#benchmark-design

The Evaluation Trap: Benchmark Design as Theoretical Commitment

arXiv cs.AI ↗ · 2026-05-15 Cached

This paper identifies the 'evaluation trap' where AI benchmarks inadvertently stabilize dominant paradigms by narrowing what counts as progress, and introduces Epistematics, a meta-evaluative methodology to ensure evaluation criteria discriminate true capability from proxy behaviors.

0 favorites 0 likes

benchmark-design

Rethinking the Evaluation of Harness Evolution for Agents

The Coin Flip Judge? Reliability and Bias in LLM-as-a-Judge Evaluation

Design and Report Benchmarks for Knowledge Work

The Evaluation Trap: Benchmark Design as Theoretical Commitment

Submit Feedback