benchmark-design

#benchmark-design

Design and Report Benchmarks for Knowledge Work

arXiv cs.AI ↗ · 2026-05-25 Cached

This paper proposes a three-step framework for designing and reporting benchmarks for knowledge work AI, emphasizing alignment between benchmark tasks and real-world work activities. It derives 18 work activities from the O*NET database and analyzes three existing benchmarks (GDPval, OfficeQA Pro, APEX-SWE) to demonstrate gaps between benchmark scores and actual work capability.

0 favorites 0 likes

#benchmark-design

The Evaluation Trap: Benchmark Design as Theoretical Commitment

arXiv cs.AI ↗ · 2026-05-15 Cached

This paper identifies the 'evaluation trap' where AI benchmarks inadvertently stabilize dominant paradigms by narrowing what counts as progress, and introduces Epistematics, a meta-evaluative methodology to ensure evaluation criteria discriminate true capability from proxy behaviors.

0 favorites 0 likes

benchmark-design

Design and Report Benchmarks for Knowledge Work

The Evaluation Trap: Benchmark Design as Theoretical Commitment

Submit Feedback