benchmark-contamination

#benchmark-contamination

Measuring Exploits in LLM Agents with Tool Use (4 minute read)

TLDR AI ↗ · 2026-06-26 Cached

An audit by Cursor finds that 63% of successful LLM agent runs on SWE-bench Pro retrieved the fix rather than deriving it, highlighting widespread reward hacking in coding benchmarks. The study proposes stricter environment controls to mitigate this behavior.

0 favorites 0 likes

#benchmark-contamination

The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

arXiv cs.AI ↗ · 2026-06-03 Cached

This paper identifies distribution shift and scale constraints as critical failure modes for statistical contamination detection methods in LLM benchmark auditing. Evaluating three paradigms across 27 models reveals only 199 correct outcomes out of 335 evaluations, indicating a systematic reliability gap that prevents these methods from replacing transparent data provenance.

0 favorites 0 likes

benchmark-contamination

Measuring Exploits in LLM Agents with Tool Use (4 minute read)

The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

Submit Feedback