benchmark-auditing

#benchmark-auditing

Auditing LLM Benchmarks with Item Response Theory

arXiv cs.CL ↗ · 2026-06-01 Cached

This paper introduces an Item Response Theory-based method to detect mislabeled examples in LLM benchmarks at 95% precision, tracing errors to labeling heuristics and annotation issues.

0 favorites 0 likes

#benchmark-auditing

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

arXiv cs.AI ↗ · 2026-05-14 Cached

This paper introduces BenchJack, an automated red-teaming system that systematically audits AI agent benchmarks by identifying reward-hacking exploits. It applies BenchJack to 10 popular benchmarks, surfacing 219 distinct flaws and demonstrating that evaluation pipelines lack an adversarial mindset, with the system reducing hackable-task ratios from near 100% to under 10% on four benchmarks.

0 favorites 0 likes

benchmark-auditing

Auditing LLM Benchmarks with Item Response Theory

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

Submit Feedback