Tag
A detailed explanation of why training on benchmarks, evals, or test sets is a cardinal sin in ML, corrupting the ability to measure generalization. The article emphasizes the importance of clean evaluation protocols and warns against benchmaxxing.
This paper proposes recall-based prompting strategies (Self-Recall and Question-Recall) to improve LLM knowledge cutoff adherence, outperforming existing methods on counterfactual questions and introducing a Multi-cutoff Historical Event Benchmark (MHEB) for robustness evaluation.
LaRA is a layer-wise representation analysis framework that detects data contamination in RL post-trained LLMs by measuring geometric deviations across model layers, outperforming output-level baselines.
This paper introduces TSFMAudit, the first method for auditing pretraining data contamination in time series foundation models, using probe adaptation dynamics to detect unusually efficient fine-tuning that indicates prior exposure.
A unified survey of pretraining data exposure (PDE) in large language models, covering membership inference, data contamination, and security implications, with a review of attack and defense methods.
Proposes Joint Envelope Conformal Selection (JECS), a conformal procedure for multi-model benchmark decontamination that provably controls global contamination rate while maintaining higher power than baselines.
This paper introduces Zero-CoT Probe (ZCP), a black-box detection method that identifies evasive data contamination in LLMs by truncating chain-of-thought reasoning and comparing performance on perturbed datasets, achieving robust detection of both direct and indirect contamination.
This paper investigates LLM-based generative error correction (GER) for low-resource West Frisian ASR, using a contamination-aware evaluation with a private dataset to show that GPT-5.1 reduces errors beyond oracle levels.
This paper empirically studies LLMs' legal reasoning in tax law, showing that data contamination inflates performance and that neuro-symbolic hybrid systems offer more reliable and robust generalization than monolithic LLMs.
Hugging Face announces the addition of private, high-quality datasets from Appen and DataoceanAI to the Open ASR Leaderboard to prevent benchmaxxing and test-set contamination, while maintaining public data for the default average WER calculation.