evaluation-methodology

#evaluation-methodology

Best-of-$N$ TTS Evaluation is Confounded by ASR Family Alignment

arXiv cs.CL ↗ · 2026-07-10 Cached

This paper identifies a confound in best-of-N TTS evaluation where the apparent quality of ASR verifiers depends strongly on which ASR family is used as evaluator. The authors propose cross-family rank ensembles that achieve lower word error rates across multiple evaluators.

0 favorites 0 likes

#evaluation-methodology

Discourse-Role Labels as Presentation-Time Variables for Context Use in Language Models

arXiv cs.CL ↗ · 2026-06-04 Cached

This paper investigates how discourse-role labels (e.g., 'Reference:', 'Instruction:', 'Example:') used to wrap context in RAG systems significantly affect how much language models adopt misleading information, with shifts of 56–84 percentage points observed across GPT-5.5, DeepSeek V4 Pro, Llama-3-8B-Instruct, and Qwen2.5-7B-Instruct. The authors argue that wrapper labels should be treated as presentation-time variables and reported/controlled in context-utilization benchmarks.

0 favorites 0 likes

#evaluation-methodology

Pre-Registering the Detectable Effect: A Paired-MDE Budget for 4-bit Quantization Benchmarks, with a Pilot Audit

arXiv cs.LG ↗ · 2026-05-29 Cached

This paper adapts paired binary sample-size calculations to 4-bit quantization benchmarks, providing a conservative minimum detectable effect (MDE) bound that helps benchmark designers determine reliability before running experiments. A pilot audit shows that much of the observed variance across small subsamples is binomial sampling noise, not true model unreliability.

0 favorites 0 likes

#evaluation-methodology

A shared playbook for trustworthy third party evaluations

OpenAI Blog ↗ · 2026-05-29 Cached

OpenAI shares lessons and recommended approaches for designing trustworthy third-party evaluations of frontier models, emphasizing the critical role of evaluation harnesses and validity checks.

0 favorites 0 likes

#evaluation-methodology

Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?

arXiv cs.LG ↗ · 2026-05-19 Cached

This paper proposes a two-stage sampling design where LLM evaluations are used to augment, rather than replace, human ratings, and provides guidance on determining sample sizes for human and LLM reviews using a doubly robust estimator from missing data literature.

0 favorites 0 likes

#evaluation-methodology

The Scaling Law of Evaluation Failure: Why Simple Averaging Collapses Under Data Sparsity and Item Difficulty Gaps, and How Item Response Theory Recovers Ground Truth Across Domains

arXiv cs.LG ↗ · 2026-05-13 Cached

This paper argues that simple averaging in AI benchmarks fails under data sparsity and difficulty heterogeneity, proposing Item Response Theory (IRT) as a robust alternative to recover ground truth rankings.

0 favorites 0 likes

#evaluation-methodology

@dair_ai: Cool paper from PwC. "Earlier is always better" is the default intuition for agent clarification. New paper claims that…

X AI KOLs Following ↗ · 2026-05-11 Cached

A new paper from PwC challenges the intuition that 'earlier is better' for agent clarification, showing via a forced-injection framework that goal clarification loses value quickly while input clarification remains useful longer. The study provides quantitative demand curves for when agents should ask questions, revealing that current frontier models often mistime their clarifications.

0 favorites 0 likes

#evaluation-methodology

Quantifying infrastructure noise in agentic coding evals

Anthropic Engineering ↗ · 2026-05-08 Cached

Anthropic reveals that infrastructure configuration and resource enforcement significantly impact scores in agentic coding benchmarks like Terminal-Bench 2.0, often exceeding the margins between top models.

0 favorites 0 likes

evaluation-methodology

Submit Feedback