benchmark-contamination

Tag

Cards List
#benchmark-contamination

The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection

arXiv cs.AI · 19h ago Cached

This paper identifies distribution shift and scale constraints as critical failure modes for statistical contamination detection methods in LLM benchmark auditing. Evaluating three paradigms across 27 models reveals only 199 correct outcomes out of 335 evaluations, indicating a systematic reliability gap that prevents these methods from replacing transparent data provenance.

0 favorites 0 likes
← Back to home

Submit Feedback