The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection
Summary
This paper identifies distribution shift and scale constraints as critical failure modes for statistical contamination detection methods in LLM benchmark auditing. Evaluating three paradigms across 27 models reveals only 199 correct outcomes out of 335 evaluations, indicating a systematic reliability gap that prevents these methods from replacing transparent data provenance.
View Cached Full Text
Cached at: 06/03/26, 09:43 AM
# The Reliability Gap in Benchmark Auditing: Distribution Shift and Scale as Failure Modes of Contamination Detection
Source: [https://arxiv.org/html/2606.03305](https://arxiv.org/html/2606.03305)
11institutetext:NASK National Research Institute, Warsaw, Poland22institutetext:Warsaw University of Technology, Warsaw, Poland33institutetext:Gdańsk University of Technology, Gdańsk, PolandJan DubińskiSebastian Cygert \(✉\)
###### Abstract
Benchmark contamination, where evaluation examples appear in a model’s training data, threatens the validity of LLM assessment\. Statistical tools for detecting training\-data membership exist, but have been validated almost exclusively in controlled academic regimes: large, homogeneous pre\-training corpora and transparent, single\-stage training pipelines\. Whether these methods remain reliable in realistic auditing scenarios remains unclear\. We identify two under\-studied failure modes: distribution shift, which arises when suspect and validation sets violate the IID assumption, and scale constraints, which arise because benchmarks are orders of magnitude smaller than pre\-training corpora\. We systematically evaluate three leading paradigms: LLM Dataset Inference, Post\-Hoc Dataset Inference, and CoDeC across 27 models from multiple families \(including Pythia, OLMo 2, and specialised cultural and medical LLMs\) and scales \(up to 27B\)\. We then further extend our analysis to frontier industry models\. Across 335 evaluations, only 199 yield correct outcomes\. LLM Dataset Inference results in false positives under distribution shift, Post\-Hoc Dataset Inference is underpowered at benchmark scale, and CoDeC provides only coarse provenance signals that are insufficient to verify individual benchmark splits\. Our results reveal a systematic reliability gap between controlled validation and practical benchmark auditing, and show that statistical detection cannot yet replace transparent data provenance\. We[open\-source](https://anonymous.4open.science/r/reliability-gap-benchmark-auditing/README.md)our benchmark for further research\.
## 1Introduction
Figure 1:Summary of our evaluation of methods for detecting whether a model was trained on a benchmark, across three tasks: Task 1 evaluates vulnerability to limited reference data \([Table˜1](https://arxiv.org/html/2606.03305#S5.T1)\); Task 2 evaluates split\-level benchmark exposure on instruction\-tuned OLMo 2 \([Table˜2](https://arxiv.org/html/2606.03305#S5.T2)\); and Task 3 evaluates specialized post\-training datasets, for medical QA and Polish\-language LLMs \([Tables˜3](https://arxiv.org/html/2606.03305#S5.T3)and[4](https://arxiv.org/html/2606.03305#S5.T4)\)\. The counts aggregate detection outcomes from these tables\. Overall, only 199 out of 335 \(nearly 60%\) of the detection challenges are solved, suggesting that current approaches are not yet reliable for certifying benchmark integrity\.In the era of rapidly scaling large language models, both model size and training data volume have increased by orders of magnitude\. This scaling has unlocked new capabilities in reasoning, mathematics, and multi\-domain knowledge, pushing performance on many public benchmarks to saturation\. At the same time, benchmark contamination has emerged as a serious threat to the validity of LLM evaluation\. When benchmark splits leak into scraped or aggregated corpora, benchmark performance may reflect prior exposure rather than genuine generalization\. As models saturate widely used benchmarks, distinguishing between reasoning capability and memorization has become a central challenge of modern LLM assessment\.
Existing literature addresses this issue by developing principled statistical tools for detecting training data, includingLLM Dataset Inference\[[20](https://arxiv.org/html/2606.03305#bib.bib20)\],Post\-Hoc Dataset Inference\[[28](https://arxiv.org/html/2606.03305#bib.bib28)\], and direct contamination measures such asCoDeC\[[27](https://arxiv.org/html/2606.03305#bib.bib27)\]\. However, these methods are primarily validated in restricted academic settings, for example using the Pythia\[[3](https://arxiv.org/html/2606.03305#bib.bib3)\]suite trained on The Pile\[[10](https://arxiv.org/html/2606.03305#bib.bib10)\], where full data transparency and strict distributional assumptions hold\. Academic model suites typically rely on relatively simple, single\-stage pre\-training on large, homogeneous corpora, which simplifies contamination analysis\.
It remains unclear whether these methods remain reliable in more realistic settings\. In practice, benchmark auditing concerns evaluation sets that are orders of magnitude smaller than pre\-training corpora, and models that undergo multiple post\-training stages, including instruction tuning on curated mixtures\. To study this beyond academic model suites, our evaluation includes openly available instruction\-tuned models \(26 models in total\) with sizes reaching 27B parameters \. Therefore, in this work, we investigate whether existing detection mechanisms function effectively*in the wild*\. We define this setting through two constraints: \(1\) the target of detection is a benchmark rather than a massive pre\-training corpus, and \(2\) the audited models are instruction\-tuned variants that have undergone multi\-stage post\-training\.
Our main contributions are as follows\.First in\-the\-wild evaluation of benchmark auditing\.We provide the first systematic evaluation of three state\-of\-the\-art detection paradigms,LLM Dataset Inference,Post\-Hoc Dataset Inference, andCoDeC, beyond controlled academic corpora and in realistic benchmark auditing scenarios for which we open\-source our code\.
Throughout our extensive evaluation, we find that:
1. 1\.Current auditing methods are not yet reliable enough to consistently verify benchmark exposure\.As summarized in[Figure˜1](https://arxiv.org/html/2606.03305#S1.F1), correct detection outcomes are matched by many failures across our evaluations\.
2. 2\.At limited dataset scale, only LLM DI works, but only under favorable conditions\.When the investigated dataset is as small as a typical benchmark, LLM DI can still detect training exposure if it has access to a genuinely unseen and approximately IID reference set\. Post\-Hoc DI does not have enough data to build a reliable synthetic reference set, and CoDeC assigns similar scores to closely matched train and test splits \([Section˜5\.1](https://arxiv.org/html/2606.03305#S5.SS1)\)\.
3. 3\.None of the methods can reliably tell which split of a benchmark was used for training\.In split\-level auditing, LLM DI can mistake distribution differences for training exposure, Post\-Hoc DI remains unstable, and CoDeC does not clearly separate seen and unseen splits from the same benchmark \([Sections˜5\.2](https://arxiv.org/html/2606.03305#S5.SS2)and[5\.5](https://arxiv.org/html/2606.03305#S5.SS5)\)\.
4. 4\.For specialized post\-training datasets and industry models, we find that the signals become less stable and harder to interpret\.For domain\-specific post\-training datasets, detection performance degrades\. For industry models, the methods can detect differences at the level of model families\. \([Sections˜5\.3](https://arxiv.org/html/2606.03305#S5.SS3)and[5\.4](https://arxiv.org/html/2606.03305#S5.SS4)\)\.
These findings suggest several practical recommendations\. LLM DI is only usable when a genuinely unseen and approximately IID validation set is available\. Post\-Hoc DI appears less suitable for standard benchmark\-sized datasets unless the quality of the synthetic held\-out data can be independently assessed\. CoDeC is best interpreted as a comparative contamination indicator rather than a certification tool\. Our results suggest that transparent data provenance remains the most reliable basis for benchmark\-integrity claims, with statistical auditing serving as complementary evidence\.
## 2Related Work
Membership and dataset inference\.Membership inference attacks \(MIAs\) study whether a particular record was part of a model’s training set, typically by exploiting systematic differences in model behavior on seen versus unseen examples\[[25](https://arxiv.org/html/2606.03305#bib.bib25)\]\. For large language models, single\-example MIAs can be noisy, motivating*dataset inference*methods that aggregate weak per\-sample signals over a suspect set and compare them to an unseen reference set\[[20](https://arxiv.org/html/2606.03305#bib.bib20)\]\. A key practical challenge for these tests is obtaining a reference or validation set that is truly unseen and sufficiently IID with the suspect set; violations of this assumption can lead to confounded detections\.
Benchmark contamination\.Benchmark contamination refers to the overlap between a benchmark dataset and a model’s training data\[[4](https://arxiv.org/html/2606.03305#bib.bib4)\]\. This overlap may occur when benchmark train or test splits are directly included in pre\-training corpora, or when near\-duplicate or highly overlapping variants are present in large scraped mixtures\[[19](https://arxiv.org/html/2606.03305#bib.bib19)\]\. In such cases, benchmark performance may reflect prior exposure rather than generalization to unseen data\. As benchmarks are repeatedly reused and increasingly incorporated into post\-training mixtures, contamination becomes both more likely and more difficult to detect\[[2](https://arxiv.org/html/2606.03305#bib.bib2)\]\.
In parallel to documenting contamination risk, recent efforts have focused on developing methods for detecting it\. Retrieval\-based and behavioral auditing approaches test whether benchmark items can be recovered from candidate corpora or elicited from the model itself\[[8](https://arxiv.org/html/2606.03305#bib.bib8),[11](https://arxiv.org/html/2606.03305#bib.bib11)\]\. More recently, performance\-based methods have framed contamination detection as a statistical problem, asking whether unusually strong benchmark results are better explained by prior exposure than by genuine generalization\[[7](https://arxiv.org/html/2606.03305#bib.bib7)\]\. Another recent direction is in\-context contamination detection: CoDeC\[[27](https://arxiv.org/html/2606.03305#bib.bib27)\]measures how adding examples from the same dataset changes model confidence, using these shifts to produce an interpretable dataset\-level contamination score\. The method requires only gray\-box access to token probabilities and assumes no prior knowledge of the training corpus, making it particularly attractive for practical benchmark auditing\.
## 3Methods
In our experiments, we use the following methods: Post\-Hoc Dataset Inference, LLM Dataset Inference, and CoDeC, which are described in this section\.
LLM Dataset Inference\.This method\[[20](https://arxiv.org/html/2606.03305#bib.bib20)\]builds on membership inference attacks, but shifts the unit of analysis from*individual samples*to*datasets*\. For LLMs, a single\-sample MIA is often too noisy to be reliably useful: many sequences are “easy” \(low loss\) even if they were never seen during training, and the membership signal for any particular example becomes faint as models scale\.
Dataset inference addresses this by*aggregating weak membership signals across many samples*and testing whether, in aggregate, a*suspect set*looks more train\-like than an*unseen validation set*\. Concretely, one computes a collection of MIA\-derived scores \(e\.g\., loss or perplexity\-based and related features\) for samples from both sets, learns a simple aggregation of these features, and then performs a statistical test, typically a one\-sidedtt\-test, on held\-out samples to decide whether the suspect set exhibits systematically stronger membership evidence than the validation set\.
The output of dataset inference is app\-value\. Under the null hypothesisH0H\_\{0\}that the model was not trained on the suspect set, the observed separation between suspect and validation should be explainable by chance\. A smallpp\-value \(e\.g\.,p<0\.1p<0\.1as in\[[20](https://arxiv.org/html/2606.03305#bib.bib20)\]\) is interpreted as evidence againstH0H\_\{0\}, that is, as a detection of training on the suspect dataset\. Importantly, thispp\-value is only meaningful under a key assumption: the validation set must be unseen during training and IID with the suspect set\. If the two sets differ in distribution, for example a benchmark train split versus its test split, the test can become confounded and may yield false positives\.
Post\-Hoc Dataset Inference\.Post\-hoc DI is designed to relax the practical requirement of LLM dataset inference for an unseen, IID validation set\. Instead of requiring a naturally occurring validation set, post\-hoc DI constructs a synthetic validation set that is intended to be distribution\-matched to the suspect data\. The procedure has two stages\.
\(1\) Held\-out data generation\.Starting from a corpus of documents for a given dataset, the method segments documents into short snippets and splits them into two disjoint pools\. A small generator model is trained with a causal language modeling objective on one pool\. The second pool is converted into prefix–suffix pairs; the generator is then used to produce artificial suffixes conditioned on the prefixes\. The resulting dataset contains real suffixes, which form the suspect set, and synthetic suffixes, which serve as the validation set\. Both sets share surface\-level distributional properties, but only the suspect set could contain memorized training data from the audited model\.
\(2\) Post\-hoc calibration\.The method trains two classifiers to distinguish suspect versus synthetic validation examples: \(i\) a text\-only classifierctextc\_\{\\text\{text\}\}, and \(ii\) a combined classifierccombc\_\{\\text\{comb\}\}that augments text with MIA\-derived features such as loss statistics\. The central statistical question is whether the MIA features provide additional separability beyond what is already achievable from text alone\. Concretely, one computes per\-example classification scores and performs a one\-sidedtt\-test to determine whether the combined classifier improves separation relative to the text\-only baseline\. The output is app\-valueunder the null hypothesisH0H\_\{0\}that MIA features do not add meaningful signal once text is accounted for\. A smallpp\-value, for examplep<0\.1p<0\.1, is interpreted as evidence that the audited model was trained on the suspect distribution\.
Context\-based contamination signals\.Complementary to likelihood\- and classifier\-based auditing, CoDeC \(Contamination Detection via Context\) detects dataset\-level contamination by measuring how same\-dataset in\-context examples affect next\-token likelihood\[[27](https://arxiv.org/html/2606.03305#bib.bib27)\]\. In\-context learning often improves predictions on unseen data, but can become unhelpful or harmful when memorization is triggered\. This yields an interpretable contamination score defined as the fraction of samples for which in\-context examples reduce model confidence\.
The key observation is that in\-context learning is typically helpful when the model has not internalized the target distribution\. Adding a few examples provides dataset\-specific cues such as format or vocabulary and improves next\-token likelihood\. In contrast, if the model has already been trained on the dataset or a close variant, in\-context examples provide little new information\.
Operationally, for each samplexxfrom a dataset𝒟\\mathcal\{D\}, CoDeC compares the model’s average log\-likelihood onxxin two settings: \(i\) a baseline setting where the model predictsxxdirectly, and \(ii\) an in\-context setting wherexxis preceded by a small number of other samples from𝒟\\mathcal\{D\}\. The per\-sample score is the log\-likelihood differenceΔ\(x\)\\Delta\(x\), defined as in\-context minus baseline\. The datasetcontamination scoreis the fraction of samples for whichΔ\(x\)<0\\Delta\(x\)<0\. Higher values indicate that in\-context learning is frequently unhelpful or harmful, which is consistent with stronger contamination\.
## 4Experimental setup
In this section, we describe the experimental setup used to evaluate contamination\-detection methods across multiple model families, datasets, and auditing scenarios\. We create a publicly available[codebase](https://anonymous.4open.science/r/reliability-gap-benchmark-auditing/README.md)which adapts each method to shared interface for benchmarking and detail the method configurations, audited models and benchmarks, and the notation used to interpret detection outcomes in the experiments that follow\.
### 4\.1Detection Methods Configuration\.
LLM DI:This method mirrors the MIA metrics, statistical test, outlier handling, and normalization strategy of Post\-Hoc DI\. Datasets are evenly split: set A computes metrics and trains a linear classifier using 250 samples per side, while set B evaluates the finaltt\-test using at least 1 000 texts, the maximum number of texts per split evaluated by authors of methods\[[20](https://arxiv.org/html/2606.03305#bib.bib20)\]\.
Post\-Hoc DI:We follow the original method as closely as possible in our benchmark\-auditing setting\. We utilize perplexity,kk\-min\-probs \(k=0\.05k\{=\}0\.05\), reversedkk\-max\-probs \(k=0\.05k\{=\}0\.05\), and zlib ratio as our MIA metrics\. We apply a one\-sided independent samplestt\-test with a significance threshold ofp<0\.1p<0\.1111increased fromp<0\.05p<0\.05due to high p\-values for members \(where ideal method should have very low p\-values for members\) and consistency with LLM DI, taking the mean over 3 independent runs\.
CoDeC:The contamination score is calculated as a binary per\-sample classification, assigning a value of 1 if confidence drops when context is added\. To avoid prompt artifacts, log\-probabilities are averaged from the 10th token onward, we strictly utilize 1 randomly sampled context example — following official description of CoDeC pipeline\[[27](https://arxiv.org/html/2606.03305#bib.bib27)\]\.
### 4\.2Audited Models and Datasets\.
Open Model Suites:Fully open model suites enable rigorous contamination studies by allowing inspection of training data to support ground\-truth membership claims\. We evaluate thePythia suite, known for its tight coupling withThe Pilereference corpus\. We also auditOLMo 2, which details its broad knowledge pre\-training\[[21](https://arxiv.org/html/2606.03305#bib.bib21)\], rigorous post\-training \(incorporating datasets likeUltraFeedback\[[6](https://arxiv.org/html/2606.03305#bib.bib6)\],tulu3 sft math\[[18](https://arxiv.org/html/2606.03305#bib.bib18)\],Competition Math\[[13](https://arxiv.org/html/2606.03305#bib.bib13)\], andGSM8K\[[5](https://arxiv.org/html/2606.03305#bib.bib5)\]\), and extensive evaluation phases\.
Evaluation Benchmarks:For the open\-model experiments, we evaluate benchmark exposure on OLMo 2 using high\-impact evaluation sets frequently audited for leakage as their train splits are widely available and exert strong optimization pressure\. Representative benchmarks includeGSM8K\[[5](https://arxiv.org/html/2606.03305#bib.bib5)\]\(grade\-school math word problems\),MATHandCompetition Math\[[13](https://arxiv.org/html/2606.03305#bib.bib13)\]\(multi\-step competition mathematics\),DROP\[[9](https://arxiv.org/html/2606.03305#bib.bib9)\]\(discrete reasoning over paragraphs\), andMMLU\[[12](https://arxiv.org/html/2606.03305#bib.bib12)\]\(multi\-domain knowledge evaluation\)\.
PLLuM:A family of Polish language LLMs\[[17](https://arxiv.org/html/2606.03305#bib.bib17)\], which undergoes supervised fine\-tuning using automatic and manual datasets, followed by preference\-based alignment\. We audit the Llama\-PLLuM\-8B, PLLuM\-12B, and non\-commercial PLLuM\-12B\-nc variants\.
Medical LLMs:We evaluate domain\-specialized models fine\-tuned on public medical QA benchmarks\. Tested benchmarks include MedQA\-USMLE\[[14](https://arxiv.org/html/2606.03305#bib.bib14)\], MedMCQA\[[22](https://arxiv.org/html/2606.03305#bib.bib22)\], PubMedQA\[[15](https://arxiv.org/html/2606.03305#bib.bib15)\], and MedExpQA\[[1](https://arxiv.org/html/2606.03305#bib.bib1)\]\. Audited models include variants of MedGemma\[[24](https://arxiv.org/html/2606.03305#bib.bib24)\]\(4B, 27B\), Meditron3\[[23](https://arxiv.org/html/2606.03305#bib.bib23)\]\(8B, 9B\), Meerkat\[[16](https://arxiv.org/html/2606.03305#bib.bib16)\]\(7B, 8B\), and Neeto\-1\.0\-8B\[[26](https://arxiv.org/html/2606.03305#bib.bib26)\]\.
### 4\.3Notation for Detection Outcomes\.
Throughout the tables in this section, we report detection outcomes using colored symbols to indicate correctness at a glance:
- •\+\\mathbf\{\+\} \(green\+\):*true positive*; the method correctly identifies that the model was trained on the given data\.
- •−\\mathbf\{\-\} \(green–\):*true negative*; the method correctly determines that the model was not trained on the given data\.
- •\+\\mathbf\{\+\} \(red\+\):*false positive*; the method incorrectly indicates that the model was trained on the data when it was not\.
- •−\\mathbf\{\-\} \(red–\):*false negative*; the method fails to detect that the model was in fact trained on the data\.
The detection decision \(\+vs\.–\) is derived differently for each method\. For*LLM Dataset Inference*, a detection \(\+\) is declared when the one\-sidedtt\-testpp\-value falls below0\.10\.1, following the threshold used in\[[20](https://arxiv.org/html/2606.03305#bib.bib20)\]\. For*CoDeC*, a detection \(\+\) is declared when the contamination score is higher than0\.80\.8indicating strong contamination, no detection \(–\) when contamination score is below0\.60\.6, and inconclusive \(?\) when the score is between0\.60\.6and0\.80\.8, as suggested in the original work\[[27](https://arxiv.org/html/2606.03305#bib.bib27)\]\.
## 5Experiments
We first test whether current contamination detection methods remain effective when only benchmark\-scale reference sets are available for inspection, rather than full pre\-training corpora \([Section˜5\.1](https://arxiv.org/html/2606.03305#S5.SS1)\)\. We then examine whether these methods can distinguish between seen and unseen splits within the same benchmark \([Section˜5\.2](https://arxiv.org/html/2606.03305#S5.SS2)\)\. Next, we evaluate the most effective approaches on specialized post\-training datasets \([Section˜5\.3](https://arxiv.org/html/2606.03305#S5.SS3)\)\. Finally, we apply a comparative analysis on industry models \([Section˜5\.4](https://arxiv.org/html/2606.03305#S5.SS4)\) and analyze the failure modes underlying these results \([Section˜5\.5](https://arxiv.org/html/2606.03305#S5.SS5)\)\.
### 5\.1Task 1: Benchmark\-Scale Auditing with Limited Reference Data
A fundamental challenge in applying current detection methods “in the wild” is the vast disparity in data scale: while pre\-training corpora typically encompass gigabytes of text, specific evaluation benchmarks often consist of only a few megabytes\. To rigorously evaluate detection performance under these constrained conditions, we simulate the data scarcity characteristic of benchmark auditing\. We find that Post\-Hoc DI becomes ineffective when trained on only a few thousand examples, while CoDeC cannot distinguish split\-level membership when train and test distributions are too similar\. Only LLM DI, which requires access to an IID validation set, remains reliable at benchmark scale\.
Task Setup\.Instead of utilizing the full corpus, we restrict the reference data ton=2,000n=2,000documents per dataset subsampled from various subsets of The Pile\. We then evaluate the detection methods across multiple Pythia model sizes, effectively mimicking a scenario where the auditor has access to only a fraction of the data distribution, comparable to the size of a standard evaluation benchmark, rather than the massive datasets used in prior academic validations\.
Table 1:Detection of training on train and test splits of the PILE \(Limited Data\)\.We reorganize the results into a grid to highlight consistency across diverse subsets\. Symbols:\+\\mathbf\{\+\}
= correctly detected as trained on,−\\mathbf\{\-\}
= correctly detected as*not*trained on,\+\\mathbf\{\+\}
= false positive,−\\mathbf\{\-\}
= false negative,?\\mathbf\{?\}
= uncertain signal\.Pile CCPile EuroparlCorr\./Tot\.TrainTestTrainTestPythia sizes0\.4B/1\.4B/2\.8B/6\.9B0\.4B/1\.4B/2\.8B/6\.9B0\.4B/1\.4B/2\.8B/6\.9B0\.4B/1\.4B/2\.8B/6\.9BLLM DI−\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} −\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} \+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} −\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} 12/16Post\-Hoc DI−\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} −\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} −\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} −\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} 8/16CoDeC\+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} \+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} \+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} \+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} 8/16Pile Hacker NewsPile Stack ExchangeCorr\./Tot\.TrainTestTrainTestPythia sizes0\.4B/1\.4B/2\.8B/6\.9B0\.4B/1\.4B/2\.8B/6\.9B0\.4B/1\.4B/2\.8B/6\.9B0\.4B/1\.4B/2\.8B/6\.9BLLM DI\+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} −\\mathbf\{\-\} /−\\mathbf\{\-\} /\+\\mathbf\{\+\} /−\\mathbf\{\-\} \+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} −\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} 15/16Post\-Hoc DI−\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} \+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} −\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} −\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} 4/16CoDeC\+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} \+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} \+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} \+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} 8/16
Findings\.As shown in[Table˜1](https://arxiv.org/html/2606.03305#S5.T1), LLM Dataset Inference correctly distinguishes train from test splits across all model sizes, confirming that our implementation reproduces prior academic findings under controlled conditions\. Post\-Hoc Dataset Inference, however, does not reject the null hypothesis of non\-membership in most cases\. At benchmark scale, the generator trained on only 2,000 samples per subset is too weak to produce realistic synthetic validation data, leading to uninformative calibration statistics\. This interpretation is supported by the appendix comparison to the full\-scale held\-out generation setting, where access to much larger Pile subsets yields better\-matched synthetic data and more informative calibration results \([Tables˜7](https://arxiv.org/html/2606.03305#Pt0.A1.T7)and[8](https://arxiv.org/html/2606.03305#Pt0.A1.T8)\)\. CoDeC assigns high contamination scores to both splits when distributions are similar\. As a result, high scores may reflect exposure to either the train split alone or both splits, indicating sensitivity to distributional overlap rather than precise membership\.
### 5\.2Task 2: Detecting Split\-Level Benchmark Exposure
In this task, we move from controlled Pythia/Pile simulations to realistic benchmark auditing on instruction\-tuned OLMo 2, and assess if current methods can distinguish between seen and unseen splits within the same benchmark\. We find that none of the methods can robustly distinguish seen from unseen splits once the setting shifts from corpus\-level datasets to split\-level benchmark exposure\.
Task Setup\.We utilize OLMo 2, whose post\-training mixture explicitly includes the training splits of GSM8K and Competition Math, while their test splits are reserved for evaluation\. As controls for entirely unseen benchmark data, we include MMLU and DROP: although OLMo 2 is evaluated on these benchmarks, it was not trained on their training splits\. Our objective is to verify if detection mechanisms can correctly flag the train splits of GSM8K and Competition Math as contaminated, while refraining from flagging their test splits or the entirely held\-out benchmarks \(MMLU and DROP\)\.
Table 2:Detection of training on train and test splits for OLMo 2\.Symbols:\+\\mathbf\{\+\}
= correctly detected as trained on,−\\mathbf\{\-\}
= correctly detected as*not*trained on,\+\\mathbf\{\+\}
= falsely detected as trained on \(false positive\),−\\mathbf\{\-\}
= missed detection \(false negative\),?\\mathbf\{?\}
= uncertain signal\. Test set of GSM 8K was not large enough to run Post\-Hoc DI\.GSM 8KMMLUDROPCorr\./Tot\.TrainTestTrainTestTrainTestOLMo 2 version1B / 7B / 13B1B / 7B / 13B1B / 7B / 13BLLM DI\+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} −\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} \+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} −\\mathbf\{\-\} /\+\\mathbf\{\+\} /−\\mathbf\{\-\} \+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} −\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} 11/18Post\-Hoc DI−\\mathbf\{\-\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} n/a / n/a / n/a−\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} −\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} −\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} −\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} 14/18CoDeC\+\\mathbf\{\+\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} \+\\mathbf\{\+\} /\+\\mathbf\{\+\} /?\\mathbf\{?\} −\\mathbf\{\-\} /?\\mathbf\{?\} /−\\mathbf\{\-\} ?\\mathbf\{?\} /?\\mathbf\{?\} /−\\mathbf\{\-\} \+\\mathbf\{\+\} /?\\mathbf\{?\} /−\\mathbf\{\-\} \+\\mathbf\{\+\} /?\\mathbf\{?\} /−\\mathbf\{\-\} 6/18
Figure 2:CoDeC scores for Task 2\. CoDeC contamination scores for benchmarks\. Hatched bars indicate, that this split was used as training split\. We can observe trend between model sizes, however training splits do not have higher scores\.FindingsThe results in[Table˜2](https://arxiv.org/html/2606.03305#S5.T2)reveal clear failure modes for each method\. LLM Dataset Inference produces false positives on DROP and MMLU even though OLMo 2 was not trained on their train splits\. One possible explanation is that, although train and test splits of the same benchmark appear to be natural IID proxies, they can still differ in difficulty, style, or construction\. In such cases, LLM DI can be driven by split\-level distribution differences rather than by training exposure\. Post\-Hoc Dataset Inference again fails to cleanly separate train and test splits\. As in Task 1, synthetic validation data introduces artifacts that dominate the statistical test\. CoDeC assigns elevated contamination scores to GSM8K and Competition Math, consistent with known post\-training exposure\. However, DROP and MMLU receive only slightly lower scores, and the difference between GSM8K and Competition Math is comparable to the difference between Competition Math and MMLU\. The signal therefore, does not provide clear evidence of split\-level membership\. CoDeC scores show descrease consistently with model size, but they are indifferent to fact whether given split of benchmark was used for training or not, as shown in[Figure˜2](https://arxiv.org/html/2606.03305#S5.F2)\.
### 5\.3Task 3: Auditing Specialized Post\-Training Datasets
In this task, we examine whether contamination detection transfers from controlled benchmark settings to specialized post\-training domains, specifically medical LLMs and mid\-resource language models\. We find that detection performance is highly inconsistent across these settings and deteriorates substantially when the available benchmark splits are small\.
Task Setup\.We apply LLM DI and CoDeC, the two methods that remained most effective in earlier experiments and were also the most practical to apply, to two downstream settings\. For medical question answering, we evaluate MedGemma\[[24](https://arxiv.org/html/2606.03305#bib.bib24)\], Meditron3\[[23](https://arxiv.org/html/2606.03305#bib.bib23)\], Meerkat\[[16](https://arxiv.org/html/2606.03305#bib.bib16)\], and Neeto\-1\.0\[[26](https://arxiv.org/html/2606.03305#bib.bib26)\]on the train and test splits of MedQA\-USMLE\[[14](https://arxiv.org/html/2606.03305#bib.bib14)\]and MedMCQA\[[22](https://arxiv.org/html/2606.03305#bib.bib22)\]\. For Polish culture\-aware language models, we evaluate the PLLuM\-12B and Llama\-PLLuM\-8B families on train/test \(or train/validation\) splits from Automatic SFT, Manual SFT, and Alignment datasets\. Because some held\-out splits are very small, we subsample the larger split to match the smaller one, yielding evaluation sets of approximately 1,600 examples for Automatic SFT and 1,150 for Manual SFT\. This setting tests whether contamination signals remain informative on small, domain\-specific post\-training datasets\.
Table 3:Detection of training on train and test splits for medical LLMs\.Symbols:\+\\mathbf\{\+\}
= correctly detected as trained on,−\\mathbf\{\-\}
= correctly detected as*not*trained on,\+\\mathbf\{\+\}
= falsely detected as trained on \(false positive\),−\\mathbf\{\-\}
= missed detection \(false negative\),?\\mathbf\{?\}
= uncertain signal\.MedQA\-USMLEMedMCQACorr\./Tot\.TrainTestTrainTestmedgemma4B / 27BLLM DI−\\mathbf\{\-\}
/−\\mathbf\{\-\}
−\\mathbf\{\-\}
/−\\mathbf\{\-\}
\+\\mathbf\{\+\}
/\+\\mathbf\{\+\}
−\\mathbf\{\-\}
/−\\mathbf\{\-\}
6/8CoDec−\\mathbf\{\-\}
/−\\mathbf\{\-\}
−\\mathbf\{\-\}
/−\\mathbf\{\-\}
−\\mathbf\{\-\}
/−\\mathbf\{\-\}
−\\mathbf\{\-\}
/−\\mathbf\{\-\}
4/8Meditron38B / 9BLLM DI−\\mathbf\{\-\}
/−\\mathbf\{\-\}
−\\mathbf\{\-\}
/−\\mathbf\{\-\}
\+\\mathbf\{\+\}
/\+\\mathbf\{\+\}
−\\mathbf\{\-\}
/−\\mathbf\{\-\}
6/8CoDec?\\mathbf\{?\}
/−\\mathbf\{\-\}
?\\mathbf\{?\}
/−\\mathbf\{\-\}
−\\mathbf\{\-\}
/−\\mathbf\{\-\}
−\\mathbf\{\-\}
/−\\mathbf\{\-\}
3/8meerkat7B / 8BLLM DI−\\mathbf\{\-\}
/−\\mathbf\{\-\}
−\\mathbf\{\-\}
/−\\mathbf\{\-\}
\+\\mathbf\{\+\}
/\+\\mathbf\{\+\}
−\\mathbf\{\-\}
/−\\mathbf\{\-\}
6/8CoDec?\\mathbf\{?\}
/?\\mathbf\{?\}
?\\mathbf\{?\}
/?\\mathbf\{?\}
−\\mathbf\{\-\}
/−\\mathbf\{\-\}
−\\mathbf\{\-\}
/−\\mathbf\{\-\}
2/8Neeto\-1\.08BLLM DI−\\mathbf\{\-\}
\+\\mathbf\{\+\}
\+\\mathbf\{\+\}
−\\mathbf\{\-\}
2/4CoDec−\\mathbf\{\-\}
−\\mathbf\{\-\}
−\\mathbf\{\-\}
−\\mathbf\{\-\}
2/4Findings\.The results in[Tables˜3](https://arxiv.org/html/2606.03305#S5.T3)and[4](https://arxiv.org/html/2606.03305#S5.T4)show that detection performance degrades substantially in downstream settings\. For medical QA, LLM DI succeeds on MedMCQA across all model families, correctly distinguishing train from test splits, but consistently fails on MedQA\-USMLE, missing the train split for every model and producing a false positive on the Neeto\-1\.0 test split\. CoDeC is less stable: it correctly identifies some train splits, but its behavior varies across models and does not consistently separate seen from unseen data\.
For PLLuM, both methods are further limited by the small size of the evaluation splits\. LLM DI often misses training on Automatic and Manual SFT, yielding repeated false negatives on train splits, while also producing false positives on Alignment\. CoDeC shows partial success for the PLLuM\-12B family, but performs substantially worse for Llama\-PLLuM\-8B and again tends to assign elevated scores to both train and held\-out splits from the same dataset\.
Table 4:Detection of training on SFT and alignment data for PLLuM models\.Symbols:\+\\mathbf\{\+\}
= correctly detected as trained on,−\\mathbf\{\-\}
= correctly detected as*not*trained on,\+\\mathbf\{\+\}
= falsely detected as trained on \(false positive\),−\\mathbf\{\-\}
= missed detection \(false negative\),?\\mathbf\{?\}
= uncertain signal\. base\* = updated base checkpoint \(250801\)\.Automatic SFTManual SFTAlignmentCorr\./Tot\.TrainValTrainValTrainTestPLLuM\-12Bbasebase / base\* / nc\-baseLLM DI−\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} −\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} −\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} −\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} \+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} −\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} 15/18CoDec\+\\mathbf\{\+\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} \+\\mathbf\{\+\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} ?\\mathbf\{?\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} ?\\mathbf\{?\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} −\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} −\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} 14/18PLLuM\-12BSFTchat / inst\. / nc\-chat / nc\-inst\.LLM DI\+\\mathbf\{\+\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} −\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} −\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} −\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} \+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} −\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} /−\\mathbf\{\-\} 15/24CoDec\+\\mathbf\{\+\} /\+\\mathbf\{\+\} /\+\\mathbf\{\+\} /?\\mathbf\{?\} \+\\mathbf\{\+\} /\+\\mathbf\{\+\} /?\\mathbf\{?\} /−\\mathbf\{\-\} ?\\mathbf\{?\} /?\\mathbf\{?\} /?\\mathbf\{?\} /−\\mathbf\{\-\} ?\\mathbf\{?\} /?\\mathbf\{?\} /\+\\mathbf\{\+\} /?\\mathbf\{?\} ?\\mathbf\{?\} /−\\mathbf\{\-\} /\+\\mathbf\{\+\} /−\\mathbf\{\-\} ?\\mathbf\{?\} /−\\mathbf\{\-\} /\+\\mathbf\{\+\} /−\\mathbf\{\-\} 9/24Llama\-PLLuM8B basebase / base\*LLM DI−\\mathbf\{\-\} /−\\mathbf\{\-\} −\\mathbf\{\-\} /−\\mathbf\{\-\} −\\mathbf\{\-\} /−\\mathbf\{\-\} −\\mathbf\{\-\} /−\\mathbf\{\-\} \+\\mathbf\{\+\} /\+\\mathbf\{\+\} −\\mathbf\{\-\} /−\\mathbf\{\-\} 10/12CoDec\+\\mathbf\{\+\} /−\\mathbf\{\-\} \+\\mathbf\{\+\} /−\\mathbf\{\-\} ?\\mathbf\{?\} /?\\mathbf\{?\} \+\\mathbf\{\+\} /−\\mathbf\{\-\} −\\mathbf\{\-\} /−\\mathbf\{\-\} −\\mathbf\{\-\} /−\\mathbf\{\-\} 7/12Llama\-PLLuM8B SFTchat / inst\.LLM DI−\\mathbf\{\-\} /−\\mathbf\{\-\} −\\mathbf\{\-\} /−\\mathbf\{\-\} −\\mathbf\{\-\} /−\\mathbf\{\-\} −\\mathbf\{\-\} /−\\mathbf\{\-\} \+\\mathbf\{\+\} /\+\\mathbf\{\+\} −\\mathbf\{\-\} /−\\mathbf\{\-\} 7/12CoDec\+\\mathbf\{\+\} /\+\\mathbf\{\+\} \+\\mathbf\{\+\} /\+\\mathbf\{\+\} \+\\mathbf\{\+\} /?\\mathbf\{?\} \+\\mathbf\{\+\} /\+\\mathbf\{\+\} \+\\mathbf\{\+\} /−\\mathbf\{\-\} \+\\mathbf\{\+\} /?\\mathbf\{?\} 5/12
### 5\.4Task 4: Comparative Auditing of Industry Models
Figure 3:Application of CoDeC and LLM Dataset Inference to industry models\.We try to detect training on train splits of frontier benchmarks\. While DROP and MMLU have mixed results, we suspect that Qwen models were trained on GSM8K, contrary to Google models\.In this task, we extend the analysis to closed or partially disclosed industry models\. We ask whether LLM DI and CoDeC provide consistent signals of prior exposure to benchmark data for frontier evaluations\. We find that the signals are suggestive rather than definitive: most model families are flagged on at least some benchmarks, while Google’s Gemma\-3 family stands out as the only series not flagged by LLM DI on GSM8K and having consistently lower CoDeC scores\.
Task Setup\.We evaluate LLM Dataset Inference and CoDeC on several industry model families \(Google: Gemma 3 1B 4B 12B; Alibaba: Qwen 3 1\.4B 3B 14B; Meta: Llama 3\.2 1B 3B \) using public benchmarks: GSM8K, MMLU, and DROP\. Because training data for these models is not fully disclosed, ground\-truth membership is unavailable\. Our goal is therefore comparative: to assess whether contamination signals differ systematically across model families\.
Findings\.As shown in[Figure˜3](https://arxiv.org/html/2606.03305#S5.F3), most model families receive elevated CoDeC scores across benchmarks, and LLM DI frequently flags benchmark training\. Gemma\-3 is the main exception: it consistently shows lower CoDeC scores, and on GSM8K it is the only family not flagged by LLM DI on the train split\. Based on this evidence, we infer that the Gemma\-3 series was not likely trained on the GSM8K training set\.
### 5\.5Diagnosis of Failure
The methods evaluated in previous tasks do not yield consistent outcomes\. As shown in[Figure˜1](https://arxiv.org/html/2606.03305#S1.F1), the number of successful detections is comparable to the number of failures; overall, only 60% of the outcomes is correct\. To understand these results, we examine the assumptions and failure modes of each method\.
##### LLM Dataset Inference\.
LLM DI critically depends on the availability of a validation set that is both unseen and truly IID with the suspect dataset\. This explains the contrast between Tasks 1 and later Tasks: LLM DI works in the controlled setting of Task 1\. However, benchmark splits used in the later Tasks may break the IID assumption\. These splits may differ in difficulty or construction, leading to the test capturing such differences rather than training exposure\.
##### Post\-Hoc Dataset Inference\.
On benchmark\-sized subsets, Post\-Hoc DI is dominated by the mismatch between real and synthetic held\-out text rather than by memorization signal\. As shown in[Table˜5](https://arxiv.org/html/2606.03305#S5.T5), the text\-only classifier is already highly discriminative, indicating strong separability between real and synthetic suffixes even without MIA features\. In a well\-calibrated Post\-Hoc DI setup, MIA features should provide an additional discriminative signal beyond the text\-only baseline\. Here, however, adding MIA features yields only marginal improvements\. For example, Common Crawl \(410M\) increases from0\.8800\.880to0\.8990\.899, and Europarl \(Train\) slightly decreases from0\.8950\.895to0\.8910\.891\. This shows that the classification task is driven primarily by real\-versus\-synthetic distribution shift, not by training membership signal\.
Table 5:Post\-hoc DI classifier AUCs on Pile subsets, benchmark\-scale subsampling\.Baseuses text only;Combadds MIA\-derived features\. High Base AUCs \(\>0\.78\>0\.78\) across all splits indicate that distribution shift between real and synthetic suffixes dominates the signal, leaving little room for MIA features to contribute\.Pythia 410MPythia 1\.4BPythia 2\.8BPythia 6\.9BDataset \(Split\)BaseCombBaseCombBaseCombBaseCombCC \(Train\)0\.8800\.8990\.8800\.9010\.8800\.9020\.8800\.902CC \(Val\)0\.8820\.9000\.8820\.9010\.8820\.9020\.8820\.901Europarl \(Train\)0\.8950\.8910\.8950\.8940\.8950\.8950\.8950\.895Europarl \(Val\)0\.8870\.8900\.8870\.8920\.8870\.8930\.8870\.894Hacker News \(Train\)0\.8990\.9120\.8990\.9120\.8990\.9160\.8990\.914Hacker News \(Val\)0\.8990\.9280\.8990\.9290\.8990\.9290\.8990\.930Stack Exchange \(Train\)0\.5400\.5540\.5400\.5620\.5400\.5670\.5400\.574Stack Exchange \(Val\)0\.5300\.5280\.5300\.5390\.5300\.5370\.5300\.541Calibration testpp\-values in[Table˜6](https://arxiv.org/html/2606.03305#S5.T6)are correspondingly inconsistent\. For Common Crawl \(410M\), train yieldsp=0\.642p=0\.642while validation yieldsp=0\.838p=0\.838, both above the rejection threshold\. Europarl producesp=1\.000p=1\.000across splits and model sizes\. Hacker News \(Val\) produces extremely small values such asp=0\.001p=0\.001for 1\.4B despite being a non\-member split\. Under the original large\-scale synthetic setting, performance improves\. As[Table˜7](https://arxiv.org/html/2606.03305#Pt0.A1.T7)\(Appendix\) shows, calibration\-stage classifier AUCs move closer to chance \(e\.g\., Common Crawl Base:0\.5480\.548\), indicating better synthetic\-real alignment and fewer distributional artifacts in Post\-Hoc DI\. Accordingly, thepp\-values in[Table˜8](https://arxiv.org/html/2606.03305#Pt0.A1.T8)\(Appendix\) more closely match the ground truth\. This suggests that Post\-Hoc DI works well only when enough data is available to train a strong synthetic held\-out set generator\.
Table 6:Post\-hoc DI calibration testpp\-values on Pile subsets under benchmark\-scale subsampling \(n=2,000n=2,000\)\.Symbols:\+\\mathbf\{\+\}
= correctly detected as trained on \(p<0\.1p<0\.1\),−\\mathbf\{\-\}
= correctly detected as*not*trained on,\+\\mathbf\{\+\}
= false positive,−\\mathbf\{\-\}
= false negative\. The method fails to reject the null for nearly all train splits\. Compared with the full\-scale held\-out generation results in the Appendix \([Table˜8](https://arxiv.org/html/2606.03305#Pt0.A1.T8)\), this degradation is consistent with the claim that benchmark\-scale data is insufficient to train a reliable synthetic held\-out set generator necessary for Post\-hoc DI\.Pythia 410MPythia 1\.4BPythia 2\.8BPythia 6\.9BCorr\./Tot\.DatasetTrainValTrainValTrainValTrainValPile CC0\.6420\.642−\\mathbf\{\-\} 0\.8380\.838−\\mathbf\{\-\} 0\.1910\.191−\\mathbf\{\-\} 0\.2070\.207−\\mathbf\{\-\} 0\.6040\.604−\\mathbf\{\-\} 0\.4580\.458−\\mathbf\{\-\} 0\.4930\.493−\\mathbf\{\-\} 0\.3340\.334−\\mathbf\{\-\} 4/8Europarl1\.0001\.000−\\mathbf\{\-\} 1\.0001\.000−\\mathbf\{\-\} 1\.0001\.000−\\mathbf\{\-\} 1\.0001\.000−\\mathbf\{\-\} 1\.0001\.000−\\mathbf\{\-\} 0\.9960\.996−\\mathbf\{\-\} 1\.0001\.000−\\mathbf\{\-\} 1\.0001\.000−\\mathbf\{\-\} 4/8Hacker News0\.1330\.133−\\mathbf\{\-\} 0\.0600\.060\+\\mathbf\{\+\} 0\.7610\.761−\\mathbf\{\-\} 0\.0010\.001\+\\mathbf\{\+\} 0\.4820\.482−\\mathbf\{\-\} 0\.0020\.002\+\\mathbf\{\+\} 0\.3870\.387−\\mathbf\{\-\} 0\.0000\.000\+\\mathbf\{\+\} 0/8Stack Exch\.0\.9640\.964−\\mathbf\{\-\} 0\.9960\.996−\\mathbf\{\-\} 0\.9270\.927−\\mathbf\{\-\} 0\.7790\.779−\\mathbf\{\-\} 0\.9580\.958−\\mathbf\{\-\} 0\.9720\.972−\\mathbf\{\-\} 0\.9940\.994−\\mathbf\{\-\} 0\.9860\.986−\\mathbf\{\-\} 4/8
Figure 4:CoDeC contamination scores on Pythia across multiple datasets\.Evaluation\-only benchmarks consistently receive lower scores than pre\-training corpora, reproducing the separation reported in prior work\. However, train and test splits within the same corpus yield nearly identical scores, indicating that CoDeC cannot distinguish split\-level membership\.
##### CoDeC\.
CoDeC exhibits a different limitation than the DI\-based methods: it preserves broad provenance differences, but lacks the resolution needed for split\-level auditing\. In[Figure˜4](https://arxiv.org/html/2606.03305#S5.F4), contamination scores for known pre\-training corpora are substantially higher than for evaluation\-only benchmarks, reproducing prior findings\. However, train and test splits within the same dataset yield only marginal differences, indicating difficulty distinguishing near\-IID conditions\.
In instruction\-tuned models \([Figure˜5](https://arxiv.org/html/2606.03305#S5.F5)\), aggregation by dataset provenance reveals a coarse ordering: pre\-training sources receive the highest scores, post\-training datasets are intermediate, and evaluation\-only benchmarks are lowest\. At the level of individual datasets and model sizes, however, the scores remain variable and overlapping\. CoDeC is therefore best interpreted as a coarse contamination indicator rather than a precise split\-level detector\.
Figure 5:CoDeC scores for OLMo 2 \(instruction\-tuned\) grouped by data provenance\.Pre\-training sources receive the highest contamination scores, post\-training datasets used during instruction tuning are intermediate, and evaluation\-only benchmarks score lowest\. This coarse ordering is consistent with known data membership, but the overlap between adjacent groups limits the method’s utility for certifying individual benchmark integrity\.
## 6Conclusions
In this work, we examined the reliability of contamination detection methods by moving from the controlled settings of prior literature to the realistic, "in\-the\-wild" regime of modern instruction\-tuned models222[https://anonymous\.4open\.science/r/reliability\-gap\-benchmark\-auditing/README\.md](https://anonymous.4open.science/r/reliability-gap-benchmark-auditing/README.md)\. We stress\-tested three leading detection paradigms against the complexities of post\-training mixtures and opaque data provenance\. Our findings reveal that the transition from academic validation to practical auditing is fraught with challenges:
- •The I\.I\.D\. assumption is a critical vulnerability\.We found thatLLM Dataset Inference\[[20](https://arxiv.org/html/2606.03305#bib.bib20)\]is highly sensitive to distribution shifts\. While effective when a perfect I\.I\.D\. validation set exists, it yields significant false positives when using standard benchmark test splits as validation proxies \(as seen with DROP and MMLU\)\. This makes it dangerous to use as a sole arbiter for leaderboard integrity without verifying distributional alignment\.
- •Benchmark\-scale datasets are too small for generative methods\.WhilePost\-Hoc Dataset Inference\[[28](https://arxiv.org/html/2606.03305#bib.bib28)\]theoretically removes the need for natural validation data, we showed it is impractical for standard benchmarks\. Unlike massive pre\-training corpora, benchmarks like GSM8K lack the text volume required to effectively train the required generators\.
- •Detection granularity is limited\.CoDeC\[[27](https://arxiv.org/html/2606.03305#bib.bib27)\]provides a valuable coarse\-grained signal—effectively separating pre\-training sources from held\-out tasks in aggregate\. However, its high variance at the individual dataset train level \([Figure˜5](https://arxiv.org/html/2606.03305#S5.F5)\) limits its utility for certifying specific benchmark results\.
We conclude that there is currently no "silver bullet" for detecting contamination in the wild\. Consequently, statistical auditing cannot yet fully replacetransparent data provenance\.
## Acknowledgements
This research was supported by the Polish National Science Centre \(NCN\) within grants no\. 2025/57/N/ST6/04025\. We gratefully acknowledge Polish high\-performance computing infrastructure PLGrid, HPC Center: ACK Cyfronet AGH, for providing computer facilities and support within computational grant no\. PLG/2024/017781\.
## References
- \[1\]Alonso, I\., Oronoz, M\., Agerri, R\.: Medexpqa: Multilingual benchmarking of large language models for medical question answering\. ArXivabs/2404\.05590\(2024\)
- \[2\]Balloccu, S\., Schmidtová, P\., Lango, M\., Dušek, O\.: Leak, cheat, repeat: Data contamination and evaluation malpractices in closed\-source llms\. ArXivabs/2402\.03927\(2024\)
- \[3\]Biderman, S\., Schoelkopf, H\., Anthony, Q\., Bradley, H\., O’Brien, K\., et al\.: Pythia: A suite for analyzing large language models across training and scaling\. ArXivabs/2304\.01373\(2023\)
- \[4\]Brown, T\.B\., Mann, B\., Ryder, N\., Subbiah, M\., Kaplan, J\., et al\.: Language models are few\-shot learners\. ArXivabs/2005\.14165\(2020\)
- \[5\]Cobbe, K\., Kosaraju, V\., Bavarian, M\., Chen, M\., Jun, H\., et al\.: Training verifiers to solve math word problems\. ArXivabs/2110\.14168\(2021\)
- \[6\]Cui, G\., Yuan, L\., Ding, N\., Yao, G\., Zhu, W\., et al\.: Ultrafeedback: Boosting language models with high\-quality feedback\. ArXivabs/2310\.01377\(2023\)
- \[7\]Dekoninck, J\., Müller, M\.N\., Vechev, M\.: Constat: Performance\-based contamination detection in large language models\. ArXivabs/2405\.16281\(2024\)
- \[8\]Deng, C\., Zhao, Y\., Tang, X\., et al\.: Investigating data contamination in modern benchmarks for large language models\. ArXivabs/2311\.09783\(2024\)
- \[9\]Dua, D\., Wang, Y\., Dasigi, P\., Stanovsky, G\., Singh, S\., et al\.: DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs\. ArXivabs/1903\.00161\(2019\)
- \[10\]Gao, L\., Biderman, S\., Black, S\., Golding, L\., Hoppe, T\., et al\.: The pile: An 800gb dataset of diverse text for language modeling\. ArXivabs/2101\.00027\(2020\)
- \[11\]Golchin, S\., Surdeanu, M\.: Time travel in llms: Tracing data contamination in large language models\. ArXivabs/2308\.08493\(2024\)
- \[12\]Hendrycks, D\., Burns, C\., Basart, S\., Zou, A\., Mazeika, M\., et al\.: Measuring massive multitask language understanding\. ArXivabs/2009\.03300\(2021\)
- \[13\]Hendrycks, D\., Burns, C\., Kadavath, S\., Arora, A\., et al\.: Measuring mathematical problem solving with the MATH dataset\. ArXivabs/2103\.03874\(2021\)
- \[14\]Jin, D\., Pan, E\., Oufattole, N\., Weng, W\.H\., Fang, H\., et al\.: What disease does this patient have? a large\-scale open domain question answering dataset from medical exams\. ArXivabs/2009\.13081\(2020\)
- \[15\]Jin, Q\., Dhingra, B\., Liu, Z\., Cohen, W\., Lu, X\.: PubMedQA: A dataset for biomedical research question answering\. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\) \(Nov 2019\)
- \[16\]Kim, H\., Hwang, H\., Lee, J\., Park, S\., Kim, D\., et al\.: Small language models learn enhanced reasoning skills from medical textbooks\. ArXivabs/2404\.00376\(2024\)
- \[17\]Kocoń, J\., Piasecki, M\., Janz, A\., Ferdinan, T\., Łukasz Radliński, et al\.: Pllum: A family of polish large language models\. ArXivabs/2511\.03823\(2025\)
- \[18\]Lambert, N\., Morrison, J\.D\., Pyatkin, V\., Huang, S\., et al\.: Tülu 3: Pushing frontiers in open language model post\-training\. ArXivabs/2411\.15124\(2024\)
- \[19\]Lee, K\., Ippolito, D\., Nystrom, A\., Zhang, C\., Eck, D\., et al\.: Deduplicating training data makes language models better\. ArXivabs/2107\.06499\(2022\)
- \[20\]Maini, P\., Jia, H\., Papernot, N\., Dziedzic, A\.: Llm dataset inference: Did you train on my dataset? ArXivabs/2406\.06443\(2024\)
- \[21\]OLMo, T\., Walsh, P\., Soldaini, L\., Groeneveld, D\., Lo, K\., et al\.: Olmo 2: Furious\. ArXivabs/2501\.00656\(2025\)
- \[22\]Pal, A\., Umapathi, L\.K\., Sankarasubbu, M\.: Medmcqa : A large\-scale multi\-subject multi\-choice dataset for medical domain question answering\. ArXivabs/2203\.14371\(2022\)
- \[23\]Sallinen, A\., Solergibert, A\.J\., Zhang, M\., Boyé, G\.B\., et al\.: Llama\-3\-meditron: An open\-weight suite of medical LLMs based on llama\-3\.1\. In: Workshop on Large Language Models and Generative AI for Health at AAAI 2025 \(2025\)
- \[24\]Sellergren, A\., Kazemzadeh, S\., Jaroensri, T\., Kiraly, A\., Traverse, M\., et al\.: Medgemma technical report\. ArXivabs/2507\.05201\(2025\)
- \[25\]Shokri, R\., Stronati, M\., Song, C\., Shmatikov, V\.: Membership inference attacks against machine learning models\. ArXivabs/1610\.05820\(2017\)\. https://doi\.org/10\.1109/SP\.2017\.41
- \[26\]Verma, S\.: Neeto: A specialized medical llm for neet\-pg/ukmle/usmle preparation \(2025\),[https://huggingface\.co/S4nfs/Neeto\-1\.0\-8b](https://huggingface.co/S4nfs/Neeto-1.0-8b)
- \[27\]Zawalski, M\., Boubdir, M\., Bałazy, K\., Nushi, B\., Ribalta, P\.: Detecting data contamination in llms via in\-context learning\. ArXivabs/2510\.27055\(2025\)
- \[28\]Zhao, B\., Maini, P\., Boenisch, F\., Dziedzic, A\.: Unlocking post\-hoc dataset inference with synthetic data\. ArXivabs/2506\.15271\(2025\)
## Appendix 0\.AAppendix
### 0\.A\.1Detailed Method Overviews
This section provides step\-by\-step procedural summaries for the three detection paradigms evaluated in the main text\.
#### 0\.A\.1\.1LLM Dataset Inference
LLM Dataset Inference\[[20](https://arxiv.org/html/2606.03305#bib.bib20)\]detects training\-data membership at the*dataset*level by aggregating weak per\-sample membership inference signals and applying a statistical test\. The procedure consists of four stages\.
Stage 0 \(Data preparation\)\.The auditor requires two datasets drawn from the same distribution: a*suspect set*𝒟sus\\mathcal\{D\}\_\{\\text\{sus\}\}, hypothesised to have been used for training, and a*validation set*𝒟val\\mathcal\{D\}\_\{\\text\{val\}\}, known to be unseen by the model\. Both sets are randomly partitioned into non\-overlapping A and B splits\.
Stage 1 \(Feature extraction\)\.For every sample in the A\-partitions, a battery of membership inference attacks \(MIAs\) is executed against the suspect LLMfθf\_\{\\theta\}to produce a feature vector\. These features span loss\-based metrics \(perplexity, Min\-kk%\), perturbation\-based scores \(synonym substitution, character\-level edits\), reference\-model comparisons, and compression\-based ratios—52 scores in total\.
Stage 2 \(Correlation learning\)\.A linear regressor is trained on the A\-split features, assigning label0to suspect samples and label11to validation samples\. The learned weights identify which MIA features carry positive membership signal for the given data distribution, since no single MIA works across all distributions\.
Stage 3 \(Statistical test\)\.The regressor produces per\-sample scores on the held\-out B\-splits\. A one\-sidedtt\-test assesses whether the suspect\-set scores are systematically lower \(more member\-like\) than the validation\-set scores, yielding app\-value under the null hypothesisH0H\_\{0\}that the suspect set was not used for training\.
Critical assumption\.The validation set must be both IID with the suspect set and genuinely unseen during training\. When this assumption is violated—for example when using a benchmark test split as validation for its train split—the method can produce false positives, as demonstrated in Task 2 of our evaluation\.
#### 0\.A\.1\.2Post\-Hoc Dataset Inference
Post\-Hoc Dataset Inference\[[28](https://arxiv.org/html/2606.03305#bib.bib28)\]removes the requirement for a natural IID validation set by synthetically generating one\. The method operates in two phases\.
Phase 1: Held\-out data generation\.The suspect documents are segmented into short text snippets, shuffled, and randomly split into a*generator training set*and a*generator inference set*\. A small causal language model is fine\-tuned \(via LoRA\) on the training set using a next\-token prediction objective\. Each snippet in the inference set is then split into a prefix and a suffix\. The trained generator produces synthetic suffixes conditioned on the prefixes\. The original \(real\) suffixes constitute the suspect set𝒟sus\\mathcal\{D\}\_\{\\text\{sus\}\}; the synthetic suffixes constitute the held\-out set𝒟val\\mathcal\{D\}\_\{\\text\{val\}\}\. Because every real–synthetic pair shares a common prefix, the two sets are positionally aligned, minimising distributional shift\.
Phase 2: Post\-hoc calibration\.Both sets are split into train and test subsets\. Two classifiers are trained on the training subsets: \(i\) a*text\-only classifier*ctext\(x\)c\_\{\\text\{text\}\}\(x\)that distinguishes real from synthetic text based on textual features alone, and \(ii\) a*combined classifier*ccomb\(x,MIA\(f\(x\)\)\)c\_\{\\text\{comb\}\}\(x,\\mathrm\{MIA\}\(f\(x\)\)\)that additionally incorporates MIA\-derived features from the suspect LLMff\. The key insight is that any performance advantage ofccombc\_\{\\text\{comb\}\}overctextc\_\{\\text\{text\}\}must stem from genuine membership signals rather than the real\-vs\-synthetic distributional shift\. A one\-sidedtt\-test on the test subsets checks whether this advantage is statistically significant; rejection ofH0H\_\{0\}indicates that the suspect set was used for training\.
Critical assumption\.The generator must be trained on sufficient data to produce realistic synthetic text\. When trained on only∼2,000\{\\sim\}2\{,\}000benchmark\-sized samples, the generator is too weak: the text\-only classifier achieves high AUC, the real\-vs\-synthetic gap dominates, and the calibration test becomes uninformative\.
#### 0\.A\.1\.3CoDeC: Contamination Detection via Context
CoDeC\[[27](https://arxiv.org/html/2606.03305#bib.bib27)\]detects dataset\-level contamination by measuring how in\-context examples from the same dataset affect the model’s prediction confidence\. The procedure operates on each sample independently before aggregating\.
Step 1 \(Baseline prediction\)\.For each samplex∈𝒟x\\in\\mathcal\{D\}, compute the model’s average per\-token log\-likelihood onxxwithout any preceding context\.
Step 2 \(In\-context prediction\)\.Samplennadditional examples from𝒟∖\{x\}\\mathcal\{D\}\\setminus\\\{x\\\}, prepend them toxxas in\-context demonstrations, and recompute the model’s average log\-likelihood onxxin this extended context\.
Step 3 \(Score computation\)\.Compute the per\-sample confidence differenceΔ\(x\)=logprobICL\(x\)−logprobbaseline\(x\)\\Delta\(x\)=\\text\{logprob\}\_\{\\text\{ICL\}\}\(x\)\-\\text\{logprob\}\_\{\\text\{baseline\}\}\(x\)\. For unseen data, in\-context examples typically improve confidence \(Δ\>0\\Delta\>0\), since they provide useful distributional cues such as format, vocabulary, and style\. For memorised data, context can disrupt memorisation patterns,*reducing*confidence \(Δ<0\\Delta<0\)\.
Step 4 \(Aggregation\)\.The dataset contamination score is
S\(𝒟\)=1N∑i=1N𝟏\[Δ\(xi\)<0\],S\(\\mathcal\{D\}\)\\;=\\;\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbf\{1\}\\\!\\bigl\[\\Delta\(x\_\{i\}\)<0\\bigr\],the fraction of samples for which context reduces confidence\. Scores above80%80\\%indicate strong contamination evidence; scores below60%60\\%suggest no contamination; intermediate values require cross\-model comparison for reliable interpretation\.
Key properties\.CoDeC requires no external reference data, held\-out sets, or dataset\-specific calibration\. It produces percentage\-based scores that are directly interpretable and model\-agnostic, needing only gray\-box access to token log\-probabilities\. The method is also computationally efficient, requiring only two forward passes per sample\.
### 0\.A\.2Details of hyperparameters of Post\-hoc DI
We subsample 2 000 texts per set and clip outliers symmetrically at 2\.5% per tail\. AMeta\-Llama\-3\.1\-8Bmodel fine\-tuned with LoRA generates 128\-token suffixes for paraphrasing\. Calibration relies on a GPT\-2 baseline text classifier, while combined features are evaluated using a linear classifier trained for 1 000 epochs with Adam\. MIA metrics arezz\-score normalized prior to training\.
### 0\.A\.3Extra Experiments for Post\-Hoc Dataset Inference
Table 7:Post\-hoc DI classifier AUCs on five Pile subsets \(train/val\), full\-scale held\-out generation\.Baseuses text only;Combadds MIA\-derived features\. With generators trained on nearly the full Pile subsets, text\-only AUCs drop to around 55%, confirming that larger training data produces better\-matched synthetic distributions and restores meaningful calibration\.DatasetPythia 410MPythia 1\.4BPythia 2\.8BPythia 6\.9BBaseCombBaseCombBaseCombBaseCombCC \(Train\)0\.5480\.5570\.5480\.5560\.5480\.5640\.5480\.567CC \(Val\)0\.5060\.5130\.5060\.5150\.5060\.5170\.5060\.523Europarl \(Train\)0\.5090\.5320\.5090\.5550\.5090\.5670\.5090\.575Europarl \(Val\)0\.5140\.5160\.5140\.5230\.5140\.5260\.5140\.533Hacker News \(Train\)0\.5250\.5670\.5250\.5780\.5250\.5850\.5250\.591Hacker News \(Val\)0\.5700\.5690\.5700\.5710\.5700\.5780\.5700\.577Stack Exchange \(Train\)0\.5400\.5540\.5400\.5620\.5400\.5670\.5400\.574Stack Exchange \(Val\)0\.5300\.5280\.5300\.5390\.5300\.5370\.5300\.541USPTO Backgrounds \(Train\)0\.5120\.5100\.5120\.5110\.5120\.5140\.5120\.512USPTO Backgrounds \(Val\)0\.5270\.5170\.5270\.5210\.5270\.5250\.5270\.525Table 8:Post\-hoc DI calibration testpp\-values \(Diff\) full\-scale held\-out generation\.Markers encode ground\-truth membership: train rows are\[\+\], val rows are\[\-\]\.Greenmarkers indicate agreement with ground truth \(p<0\.1p<0\.1for train,p≥0\.1p\\geq 0\.1for val\);redmarkers indicate failure \(p≥0\.1p\\geq 0\.1for train, orp<0\.1p<0\.1for val\)\. Extended MIA variants improve detection on several subsets \(e\.g\., Hacker News and Europarl\), though inconsistencies remain for others such as arXiv and PubMed Central\.DatasetPythia 410MPythia 1\.4BPythia 2\.8BPythia 6\.9BCC \(Train\)0\.1340\.134\[\+\]0\.0530\.053\[\+\]0\.0030\.003\[\+\]0\.0050\.005\[\+\]CC \(Val\)1\.0001\.000\[\-\]0\.8460\.846\[\-\]0\.6680\.668\[\-\]0\.0170\.017\[\-\]PubMed Central \(Train\)0\.3070\.307\[\+\]0\.0580\.058\[\+\]0\.0240\.024\[\+\]0\.0000\.000\[\+\]PubMed Central \(Val\)0\.0650\.065\[\-\]0\.0930\.093\[\-\]0\.0110\.011\[\-\]0\.0010\.001\[\-\]arXiv \(Train\)0\.8230\.823\[\+\]0\.3490\.349\[\+\]0\.1690\.169\[\+\]0\.0300\.030\[\+\]arXiv \(Val\)0\.1450\.145\[\-\]0\.0110\.011\[\-\]0\.0040\.004\[\-\]0\.0000\.000\[\-\]PhilPapers \(Train\)0\.5050\.505\[\+\]0\.6300\.630\[\+\]0\.6750\.675\[\+\]0\.6320\.632\[\+\]PhilPapers \(Val\)0\.3170\.317\[\-\]0\.7200\.720\[\-\]0\.2230\.223\[\-\]0\.8870\.887\[\-\]FreeLaw \(Train\)0\.5900\.590\[\+\]0\.2610\.261\[\+\]0\.0240\.024\[\+\]0\.0070\.007\[\+\]FreeLaw \(Val\)0\.9090\.909\[\-\]0\.5130\.513\[\-\]0\.4340\.434\[\-\]0\.0070\.007\[\-\]USPTO \(Train\)0\.3340\.334\[\+\]0\.4890\.489\[\+\]0\.5140\.514\[\+\]0\.2420\.242\[\+\]USPTO \(Val\)0\.9680\.968\[\-\]0\.9620\.962\[\-\]0\.9650\.965\[\-\]0\.6850\.685\[\-\]Tables[Tables˜7](https://arxiv.org/html/2606.03305#Pt0.A1.T7)and[8](https://arxiv.org/html/2606.03305#Pt0.A1.T8)report the post\-hoc calibration stage evaluated on held\-out sets generated and released by the method authors, where the generator is trained on nearly the full corresponding Pile subsets\. The text\-only classifier yields AUCs around 55%, consistent with the hypothesis discussed in[Section˜5\.5](https://arxiv.org/html/2606.03305#S5.SS5.SSS0.Px2)\. The resultingpp\-values are also more aligned with the expected outcomes than in our benchmark\-sized subsampling experiment \([Tables˜5](https://arxiv.org/html/2606.03305#S5.T5)and[6](https://arxiv.org/html/2606.03305#S5.T6)\)\.
### 0\.A\.4Numerical values for evaluation experiments
Table 9:Numeric contamination\-detection scores on train and test splits of the PILE \(Limited Data\)\. LLM DI and Post\-Hoc DI entries are p\-values; CoDeC entries are contamination scores\.Pile CCPile EuroparlTrainTestTrainTestPythia sizes0\.4B/1\.4B/2\.8B/6\.9B0\.4B/1\.4B/2\.8B/6\.9B0\.4B/1\.4B/2\.8B/6\.9B0\.4B/1\.4B/2\.8B/6\.9BLLM DI0\.85 / 0\.21 / 0\.14 / 0\.420\.98 / 0\.95 / 0\.84 / 0\.950\.00 / 0\.00 / 0\.00 / 0\.000\.93 / 0\.97 / 0\.83 / 0\.70Post\-Hoc DI0\.64 / 0\.19 / 0\.60 / 0\.490\.84 / 0\.21 / 0\.46 / 0\.331\.00 / 1\.00 / 1\.00 / 1\.001\.00 / 1\.00 / 1\.00 / 1\.00CoDeC0\.98 / 0\.98 / 0\.98 / 0\.990\.97 / 0\.96 / 0\.96 / 0\.960\.94 / 0\.95 / 0\.97 / 0\.980\.94 / 0\.96 / 0\.98 / 0\.98Pile Hacker NewsPile Stack ExchangeTrainTestTrainTestPythia sizes0\.4B/1\.4B/2\.8B/6\.9B0\.4B/1\.4B/2\.8B/6\.9B0\.4B/1\.4B/2\.8B/6\.9B0\.4B/1\.4B/2\.8B/6\.9BLLM DI0\.00 / 0\.00 / 0\.00 / 0\.000\.65 / 0\.40 / 0\.05 / 0\.770\.00 / 0\.00 / 0\.00 / 0\.000\.75 / 0\.90 / 0\.76 / 0\.21Post\-Hoc DI0\.13 / 0\.76 / 0\.48 / 0\.390\.06 / 0\.00 / 0\.00 / 0\.000\.96 / 0\.93 / 0\.96 / 0\.991\.00 / 0\.78 / 0\.97 / 0\.99CoDeC0\.90 / 0\.95 / 0\.97 / 0\.970\.91 / 0\.94 / 0\.95 / 0\.960\.94 / 0\.96 / 0\.95 / 0\.950\.88 / 0\.87 / 0\.85 / 0\.84
Table 10:Numeric contamination\-detection scores for OLMo 2\. LLM DI and Post\-Hoc DI entries are p\-values; CoDeC entries are contamination scores\.GSM 8KCompetition MathMMLUDROPTrainTestTrainTestTrainTestTrainTestOLMo 2 version1B / 7B / 13B1B / 7B / 13B1B / 7B / 13B1B / 7B / 13BLLM DI0\.00 / 0\.00 / 0\.000\.99 / 0\.83 / 0\.350\.00 / 0\.00 / 0\.000\.96 / 0\.36 / 0\.910\.00 / 0\.00 / 0\.000\.11 / 0\.05 / 0\.160\.00 / 0\.00 / 0\.000\.55 / 0\.82 / 0\.51Post\-Hoc DI0\.29 / 0\.02 / 0\.01n/a / n/a / n/an/a / n/a / n/an/a / n/a / n/a0\.98 / 0\.81 / 0\.421\.00 / 0\.90 / 0\.921\.00 / 1\.00 / 1\.001\.00 / 1\.00 / 1\.00CoDeC0\.87 / 0\.55 / 0\.180\.86 / 0\.91 / 0\.750\.70 / 0\.70 / 0\.530\.73 / 0\.74 / 0\.610\.59 / 0\.72 / 0\.400\.63 / 0\.61 / 0\.510\.86 / 0\.62 / 0\.430\.84 / 0\.66 / 0\.46
Table 11:Contamination detection scores for medical LLMs \(numeric values\)\.MedQA\-USMLEMedMCQATrainTestTrainTestmedgemma4B / 27BLLM DI0\.17 / 0\.250\.28 / 0\.730\.00 / 0\.010\.77 / 0\.13CoDec0\.00 / 0\.000\.00 / 0\.000\.00 / 0\.000\.00 / 0\.00Meditron38B / 9BLLM DI0\.13 / 0\.160\.12 / 0\.360\.00 / 0\.040\.63 / 0\.60CoDec0\.66 / 0\.290\.64 / 0\.310\.38 / 0\.320\.34 / 0\.28meerkat7B / 8BLLM DI0\.48 / 0\.140\.90 / 0\.700\.01 / 0\.000\.93 / 0\.99CoDec0\.71 / 0\.650\.72 / 0\.620\.47 / 0\.450\.46 / 0\.49Neeto\-1\.08BLLM DI0\.130\.070\.000\.71CoDec0\.510\.510\.430\.48Table 12:Contamination detection scores for PLLuM models on SFT data \(numeric values\)\. base\* = updated base checkpoint \(250801\)\.Automatic SFTTrainValPLLuM\-12Bbasebase / base\* / nc\-baseLLM DI0\.92 / 0\.62 / 0\.850\.71 / 0\.26 / 0\.90CoDec0\.85 / 0\.50 / 0\.380\.82 / 0\.49 / 0\.37PLLuM\-12BSFTchat / inst\.LLM DI0\.00 / 0\.320\.50 / 0\.74CoDec0\.90 / 0\.890\.86 / 0\.89nc\-chat / nc\-inst\.LLM DI0\.46 / 0\.810\.18 / 0\.16CoDec0\.81 / 0\.610\.79 / 0\.59Llama\-PLLuM8B basebase / base\*LLM DI0\.64 / 0\.150\.72 / 0\.42CoDec0\.95 / 0\.550\.96 / 0\.57Llama\-PLLuM8B SFTchat / inst\.LLM DI0\.23 / 0\.960\.48 / 0\.74CoDec0\.93 / 0\.940\.92 / 0\.93Manual SFTTrainValPLLuM\-12Bbasebase / base\* / nc\-baseLLM DI0\.30 / 0\.34 / 0\.800\.62 / 0\.22 / 0\.31CoDec0\.65 / 0\.54 / 0\.480\.69 / 0\.52 / 0\.50PLLuM\-12BSFTchat / inst\.LLM DI0\.91 / 0\.820\.39 / 0\.29CoDec0\.78 / 0\.660\.79 / 0\.77nc\-chat / nc\-inst\.LLM DI0\.91 / 0\.510\.27 / 0\.60CoDec0\.80 / 0\.590\.81 / 0\.64Llama\-PLLuM8B basebase / base\*LLM DI0\.82 / 0\.650\.64 / 0\.26CoDec0\.75 / 0\.600\.83 / 0\.57Llama\-PLLuM8B SFTchat / inst\.LLM DI0\.91 / 0\.800\.65 / 0\.23CoDec0\.84 / 0\.740\.90 / 0\.81Table 13:Contamination detection scores for PLLuM models on alignment data \(numeric values\)\. base\* = updated base checkpoint \(250801\)\.AlignmentTrainTestPLLuM\-12Bbasebase / base\* / nc\-baseLLM DI0\.00 / 0\.00 / 0\.000\.17 / 0\.67 / 0\.96CoDec0\.28 / 0\.39 / 0\.250\.43 / 0\.45 / 0\.29PLLuM\-12BSFTchat / inst\.LLM DI0\.00 / 0\.000\.85 / 0\.21CoDec0\.80 / 0\.390\.79 / 0\.56nc\-chat / nc\-inst\.LLM DI0\.00 / 0\.000\.94 / 0\.81CoDec0\.85 / 0\.420\.84 / 0\.51Llama\-PLLuM8B basebase / base\*LLM DI0\.00 / 0\.000\.12 / 0\.68CoDec0\.37 / 0\.360\.47 / 0\.47Llama\-PLLuM8B SFTchat / inst\.LLM DI0\.00 / 0\.000\.97 / 0\.84CoDec0\.83 / 0\.490\.85 / 0\.64
### 0\.A\.5Time analysis
Table 14:Average run duration for detection methods\.Standard deviation is reported across different model sizes \(Pythia, OLMo 2\) and datasets \(PILE subsets, GMS8K, DROP and MMLU\)\. All values are normalized to seconds for comparability\. The runs were executed on Nvidia GH200\.MethodAverage Duration \[s\]Post\-Hoc DI4566\.6±846\.44566\.6\\pm 846\.4CoDeC126\.2±31\.3126\.2\\pm 31\.3LLM DI187\.4±14\.6187\.4\\pm 14\.6Similar Articles
When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
This paper introduces a framework for validating comparative LLM safety scoring without ground-truth labels, using an 'instrumental-validity chain' to establish deployment evidence. It demonstrates the method using a local-first tool called SimpleAudit on Norwegian safety packs and compares models like Borealis and Gemma 3.
Auditing LLM Benchmarks with Item Response Theory
This paper introduces an Item Response Theory-based method to detect mislabeled examples in LLM benchmarks at 95% precision, tracing errors to labeling heuristics and annotation issues.
Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks
This paper proposes a sample-efficient framework using the cross-entropy method to estimate extreme reliability ('five-nines') in LLMs, addressing the limitations of standard benchmarks in detecting rare failures.
TrustLDM: Benchmarking Trustworthiness in Language Diffusion Models
Introduces TrustLDM, a comprehensive benchmark for evaluating safety, privacy, and fairness of Language Diffusion Models, revealing that their alignment degrades with malicious post contexts. Proposes an automatic evaluation framework, TrustLDM-Auto, to identify vulnerable configurations.
Gate AI: LLM Security Benchmark Evaluation Methodology and Results
This paper presents an evaluation methodology for LLM security detectors that addresses systematic weaknesses like per-dataset threshold tuning and undisclosed operating points. The framework uses cross-validation across 16 benchmarks, selects a single global operating point, and includes multiple diagnostics for generalization.