Tag
This paper adapts paired binary sample-size calculations to 4-bit quantization benchmarks, providing a conservative minimum detectable effect (MDE) bound that helps benchmark designers determine reliability before running experiments. A pilot audit shows that much of the observed variance across small subsamples is binomial sampling noise, not true model unreliability.
OpenAI shares lessons and recommended approaches for designing trustworthy third-party evaluations of frontier models, emphasizing the critical role of evaluation harnesses and validity checks.
This paper proposes a two-stage sampling design where LLM evaluations are used to augment, rather than replace, human ratings, and provides guidance on determining sample sizes for human and LLM reviews using a doubly robust estimator from missing data literature.
This paper argues that simple averaging in AI benchmarks fails under data sparsity and difficulty heterogeneity, proposing Item Response Theory (IRT) as a robust alternative to recover ground truth rankings.
A new paper from PwC challenges the intuition that 'earlier is better' for agent clarification, showing via a forced-injection framework that goal clarification loses value quickly while input clarification remains useful longer. The study provides quantitative demand curves for when agents should ask questions, revealing that current frontier models often mistime their clarifications.
Anthropic reveals that infrastructure configuration and resource enforcement significantly impact scores in agentic coding benchmarks like Terminal-Bench 2.0, often exceeding the margins between top models.