evaluation-methodology

Tag

Cards List
#evaluation-methodology

Pre-Registering the Detectable Effect: A Paired-MDE Budget for 4-bit Quantization Benchmarks, with a Pilot Audit

arXiv cs.LG · 6d ago Cached

This paper adapts paired binary sample-size calculations to 4-bit quantization benchmarks, providing a conservative minimum detectable effect (MDE) bound that helps benchmark designers determine reliability before running experiments. A pilot audit shows that much of the observed variance across small subsamples is binomial sampling noise, not true model unreliability.

0 favorites 0 likes
#evaluation-methodology

A shared playbook for trustworthy third party evaluations

OpenAI Blog · 6d ago Cached

OpenAI shares lessons and recommended approaches for designing trustworthy third-party evaluations of frontier models, emphasizing the critical role of evaluation harnesses and validity checks.

0 favorites 0 likes
#evaluation-methodology

Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?

arXiv cs.LG · 2026-05-19 Cached

This paper proposes a two-stage sampling design where LLM evaluations are used to augment, rather than replace, human ratings, and provides guidance on determining sample sizes for human and LLM reviews using a doubly robust estimator from missing data literature.

0 favorites 0 likes
#evaluation-methodology

The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains

arXiv cs.LG · 2026-05-13 Cached

This paper argues that simple averaging in AI benchmarks fails under data sparsity and difficulty heterogeneity, proposing Item Response Theory (IRT) as a robust alternative to recover ground truth rankings.

0 favorites 0 likes
#evaluation-methodology

@dair_ai: Cool paper from PwC. "Earlier is always better" is the default intuition for agent clarification. New paper claims that…

X AI KOLs Following · 2026-05-11 Cached

A new paper from PwC challenges the intuition that 'earlier is better' for agent clarification, showing via a forced-injection framework that goal clarification loses value quickly while input clarification remains useful longer. The study provides quantitative demand curves for when agents should ask questions, revealing that current frontier models often mistime their clarifications.

0 favorites 0 likes
#evaluation-methodology

Quantifying infrastructure noise in agentic coding evals

Anthropic Engineering · 2026-05-08 Cached

Anthropic reveals that infrastructure configuration and resource enforcement significantly impact scores in agentic coding benchmarks like Terminal-Bench 2.0, often exceeding the margins between top models.

0 favorites 0 likes
← Back to home

Submit Feedback