meta-evaluation

#meta-evaluation

Can LLMs Write Reliable Rubrics? A Meta-Evaluation for Experiment Reproduction

arXiv cs.CL ↗ · 2d ago Cached

This paper presents the first systematic meta-evaluation of LLM-generated rubrics for reproducing experiments from research papers. It reformulates rubrics into a checklist format and evaluates generation settings both intrinsically (semantic similarity) and extrinsically (score alignment), finding that augmented settings improve downstream evaluation alignment but generated rubrics are often overly fine-grained and biased toward high scores.

0 favorites 0 likes

#meta-evaluation

Eval-Pair Matrix: Answer-Paired Meta-Evaluation of LLM Judges for Grounded RAG

arXiv cs.CL ↗ · 3d ago Cached

Introduces Eval-Pair Matrix, a controlled meta-evaluation protocol for source-grounded RAG that induces hidden contradictions to detect self-leniency in LLM judges. The study finds minimal same-model effects and emphasizes methodological improvements for RAG judge studies.

0 favorites 0 likes

#meta-evaluation

Counsel: A Meta-Evaluation Dataset for Agentic Tasks

Hugging Face Daily Papers ↗ · 2026-06-19 Cached

Counsel is the first public dataset of human meta-evaluations of LLM critiques for agentic tasks, designed to improve the calibration and reliability of automated evaluation methods.

0 favorites 0 likes

#meta-evaluation

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

arXiv cs.CL ↗ · 2026-05-20 Cached

This paper introduces REFLECT, a meta-evaluation benchmark for assessing the reliability of LLM judges in evaluating deep research agents. Experiments show current LLM judges remain unreliable, with overall accuracies below 55% across reasoning, tool-use, and report-quality failures.

0 favorites 0 likes

#meta-evaluation

The Evaluation Trap: Benchmark Design as Theoretical Commitment

arXiv cs.AI ↗ · 2026-05-15 Cached

This paper identifies the 'evaluation trap' where AI benchmarks inadvertently stabilize dominant paradigms by narrowing what counts as progress, and introduces Epistematics, a meta-evaluative methodology to ensure evaluation criteria discriminate true capability from proxy behaviors.

0 favorites 0 likes

meta-evaluation

Can LLMs Write Reliable Rubrics? A Meta-Evaluation for Experiment Reproduction

Eval-Pair Matrix: Answer-Paired Meta-Evaluation of LLM Judges for Grounded RAG

Counsel: A Meta-Evaluation Dataset for Agentic Tasks

Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?

The Evaluation Trap: Benchmark Design as Theoretical Commitment

Submit Feedback