benchmark-design

Tag

Cards List
#benchmark-design

Design and Report Benchmarks for Knowledge Work

arXiv cs.AI · 2026-05-25 Cached

This paper proposes a three-step framework for designing and reporting benchmarks for knowledge work AI, emphasizing alignment between benchmark tasks and real-world work activities. It derives 18 work activities from the O*NET database and analyzes three existing benchmarks (GDPval, OfficeQA Pro, APEX-SWE) to demonstrate gaps between benchmark scores and actual work capability.

0 favorites 0 likes
#benchmark-design

The Evaluation Trap: Benchmark Design as Theoretical Commitment

arXiv cs.AI · 2026-05-15 Cached

This paper identifies the 'evaluation trap' where AI benchmarks inadvertently stabilize dominant paradigms by narrowing what counts as progress, and introduces Epistematics, a meta-evaluative methodology to ensure evaluation criteria discriminate true capability from proxy behaviors.

0 favorites 0 likes
← Back to home

Submit Feedback