@TheAhmadOsman: https://x.com/TheAhmadOsman/status/2064724789952958663
Summary
A detailed explanation of why training on benchmarks, evals, or test sets is a cardinal sin in ML, corrupting the ability to measure generalization. The article emphasizes the importance of clean evaluation protocols and warns against benchmaxxing.
View Cached Full Text
Cached at: 06/11/26, 01:56 PM
LLM/ML Fundamentals: Why Training on Benchmarks, Evals, or Test Sets Is a Cardinal Sin
Machine learning is not supposed to prove that a model can repeat examples it has already seen. It is supposed to tell us whether the model will work on the next case, the one that was not in the training run. That is the whole reason benchmarks, evals, and test sets exist.
This article started from the below exchange. Someone asked for a model-ranking website, and my answer was that I am not willing to publish the evals or data that make my rankings useful.
Here is the tension: a private ranking can sound like “trust me bro,” but publishing the evals can destroy the very thing that makes them work. Once the test items, scoring logic, examples, or answer patterns become public optimization targets, they stop being clean measurement and start to potentially become a benchmaxxing utility.
A benchmark, eval suite, or hidden test is useful because it acts like an independent measuring instrument. It gives us a way to ask, “How well does this system generalize?” Training on that measuring instrument breaks the instrument. It turns the score into a record of exposure, not a record of capability.
The basic rule is simple:
Train on training data. Tune on validation data. Measure once, honestly, on test data. Once you train, tune, prompt-optimize, scaffold-optimize, filter, or select against the test set, it is not a test set anymore.
Classical machine learning understood this decades ago. You do not fit preprocessing on the full dataset before splitting. You do not choose features using the test labels. You do not pick checkpoints by looking at final test performance. The same logic applies to LLMs, but the surface area is much larger.
Training on benchmarks, evals, or test sets is not a minor procedural mistake. It is a cardinal sin because it corrupts the central object of empirical machine learning: a trustworthy estimate of generalization.
A Story About Failure, Misunderstanding, And What To Trust
The next few screenshots tell a story. It starts with a confidently wrong take about benchmark training, moves into a public model recommendation, gets polished up with a benchmark table, and then runs into the reality of a fresh task that does not care about the take.
This argument is bad on the surface. If you treat benchmark scores as proof for public rankings, then turn around and treat benchmark contamination as “how models get better,” that is just benchmaxxing. Training on the test set you plan to cite as evidence is not normal.
This is where the same mistake gets more polished. A benchmark table can make a model look genuinely solid, and maybe it is. But you also have to wonder if these numbers indicate benchmaxxing. Are the evaluations independent, representative, and not optimized against?
This is where bad measurement starts looking like useful advice. People do want concrete, hardware-aware recommendations. They want to know what to run this week, what fits in 12GB or 32GB, and what is actually worth their time. A good ranking can help with that. But the Nex-N2 example is exactly why the following “benchmaxxed” screenshot matters. The same model family can look compelling in a weekly recommendation and still raise serious questions once fresh, messy tasks expose brittle behavior.
This is the same story from the other side. A model can be interesting enough to recommend for a hardware tier and still be suspicious if a related or higher-tier variant looks optimized for benchmark-shaped tasks while failing at a simple task. The screenshot is not a controlled eval, but it is the kind of field report that should make you ask the right question: did the benchmark score reflect broad capability, or did it reward a narrow pattern the model learned to imitate?
What and who to trust: not the prettiest ranking, not a benchmark number that has been turned into a target, and not the most confident account. Trust clean protocol, fresh tasks, private holdouts, disclosed limits, and real behavior under conditions the model did not get to study and train on.
What This Piece Covers
This is mainly a methodological argument about benchmark hygiene, test-set integrity, and why clean evaluation matters, with enough classical ML and LLM-specific detail to make the argument concrete.
For the deeper hardware, software, model-mechanics, and hands-on projects path, I have a five-part series teaching self-hosted LLMs / local AI:
-
Part 1: GPU Memory Math for LLMs (2026 Edition).
-
Part 2: Memory Bandwidth for Local AI Hardware (2026 Edition).
-
Part 3: Inference Engines for LLMs & Local AI Hardware (2026 Edition).
-
Part 4: LLMs 101 (2026 Edition): How Models Think One Token At A Time.
-
Part 5: Step-By-Step LLM Engineering Projects (2026 Edition).
The first two cover hardware capacity and bandwidth math. The third covers the software layer that turns that hardware into usable inference for running LLMs locally. The fourth covers model-side mechanics: tokens, Transformers, attention, KV cache, decoding, context, RAG, agents, and local deployment mechanics. The fifth turns that foundation into a projects-first roadmap you can build, test, and publish.
This post adds the measurement layer. Once you can run, tune, compare, and ship LLM systems, you need to know when a benchmark is still a ruler and when you have accidentally bent it into training data.
In short: the local AI series helps you understand and build the systems. This piece helps you measure them without fooling yourself.
Benchmark names mentioned in this article will age / get outdated. The core point WILL NOT.
What ML Is Actually Trying To Do
Machine learning is a disciplined way to learn from past data so a system performs well on future cases. In supervised learning, the usual story is inputs, labels, a model, parameters, and a loss function. Training adjusts the parameters so predictions improve on the training examples.
But low training loss is not the target. Low expected loss on the deployment distribution is the target.
That distinction is everything.
A model can look good on training data for a lot of reasons: it may have learned durable structure, memorized examples, found shortcuts, exploited artifacts in the dataset, picked up leakage from future data, or learned something that works on the benchmark but fails in the world.
The point of a held-out test set is to separate “this model learned the task” from “this model learned the dataset.”
Modern ML gives data different jobs:
-
Training set: Fits parameters, representations, and learned features. This is where learning is supposed to happen.
-
Validation or dev set: Chooses hyperparameters, prompts, architectures, checkpoints, thresholds, scaffolds, and other design decisions. Useful, but overfittable after enough iterations.
-
Test set, eval, or benchmark: Estimates final generalization after decisions are frozen. Should not be used for training, tuning, filtering, prompt iteration, model selection, or repeated optimization.
-
Private or hidden test: Provides an external audit or leaderboard score. Examples, labels, answer keys, solution logic, and repeated feedback must be protected.
-
Deployment monitoring data: Tracks real-world drift and failures after release. Valuable, but not the same thing as a clean pre-deployment test.
The split is not ceremony. Training error is usually optimistic, especially for systems powerful enough to memorize, interpolate, or exploit tiny regularities. Without an untouched holdout, you cannot reliably tell whether a system generalizes.
Vocabulary: Fit, Generalization, Leakage, Contamination
Before talking about benchmarks, it helps to be precise about the words.
Fitting means information from data changed the system. In classical ML, fitting includes updating weights, estimating normalization statistics, selecting features, choosing thresholds, tuning hyperparameters, and selecting models. In LLM systems, fitting also includes prompt iteration, system-message design, tool-routing changes, retrieval-index changes, agent-loop changes, rejection sampling, self-consistency settings, benchmark-specific postprocessing, and model selection.
One common defense is, “We did not update the weights, so we did not train on the test set.” That defense does not hold. If test-set results changed the system, the system was optimized against the test set.
Generalization means performance on unseen cases from the target distribution. A test set estimates generalization only while it remains independent of the training and selection process. If information from the test set leaks into the model or system, the estimate gets biased upward.
Leakage means information that should be unavailable at prediction time was used during training, validation, or evaluation. In tabular ML, that can be as obvious as a feature that encodes the target label. It can also be subtle: scaling the whole dataset before splitting, imputing with global statistics, selecting features using all labels, or allowing the same patient, customer, repository, author, or classroom to appear on both sides of a split.
The common leakage patterns:
-
Preprocessing leakage: A scaler, imputer, PCA transform, tokenizer vocabulary, resampler, or feature selector is fit on all data before splitting.
-
Feature leakage: A feature uses information from after the prediction time.
-
Duplicate leakage: Near-identical users, documents, code files, questions, or records appear in both train and test.
-
Group leakage: The same patient, customer, repository, author, school, or classroom crosses split boundaries.
-
Temporal leakage: Future events leak into historical examples.
-
Target leakage: A feature contains the label or a proxy created after the outcome.
-
Leaderboard leakage: Repeated submissions reveal structure in a hidden test.
-
Benchmark contamination: Test questions, answers, explanations, solutions, or paraphrases enter model training, tuning, retrieval, prompt design, or agent context.
Leakage is not a technicality. If you select features on the full dataset and then split, you can produce impressive accuracy even when the labels are random. Split first, fit transforms on the training fold only, then apply the learned transform to validation and test data. That one habit prevents a surprising amount of bad science.
Contamination is leakage at benchmark scale. It happens when benchmark items, answer keys, explanations, hidden tests, or close variants appear in training data, tuning data, synthetic data, retrieval corpora, prompt-engineering loops, eval logs, or agent tools. For LLMs, contamination is often not just a dataset problem. It is a system problem.
Why LLMs Make Contamination Harder Than Classical ML
Classical ML often starts with a known table, a curated dataset, or a controlled data pipeline. LLM development usually starts with massive, mixed corpora: web pages, code, books, papers, issue threads, documentation, Stack Overflow answers, model outputs, synthetic corpora, benchmark discussions, and user preference data.
Contamination becomes easier to create and harder to rule out.
LLMs Learn From Tokens At Huge Scale
LLMs turn text into token IDs, often with subword tokenization schemes like BPE, WordPiece, or Unigram. Rare words get broken into pieces; common words stay compact. The model learns to predict the next token or fill in missing tokens across enormous corpora.
Transformers made that scaling practical by replacing recurrence and convolution with attention, which supports efficient parallel training and long-range dependency modeling. Scaling-law work showed that language-model loss follows predictable trends with model size, dataset size, and compute. Chinchilla-style results made the point sharper: compute-optimal training does not mean only making models larger. It also means scaling training tokens.
The consequence is straightforward: if a benchmark has been published, mirrored, solved, translated, summarized, discussed, or embedded in derivative datasets, it has many ways to enter training.
LLM Training Is Not Only Pretraining
Modern LLM systems are usually built in layers:
-
Pretraining on broad corpora.
-
Continued pretraining on domain data.
-
Supervised fine-tuning on instruction-response examples.
-
Preference optimization, including RLHF and DPO.
-
Safety tuning and refusal tuning.
-
Tool use and retrieval augmentation.
-
System prompts, chain-of-thought policies, decoding settings, and agent scaffolds.
-
Distillation from stronger models.
-
Synthetic-data generation and filtering.
InstructGPT showed how human-feedback training could make a smaller model more aligned with user intent than a larger base model. DPO later simplified preference optimization by directly optimizing a classification-style objective instead of training a separate reward model and running reinforcement learning.
Each stage can contaminate evaluation. A benchmark item might be absent from pretraining but present in instruction tuning, preference comparisons, synthetic reasoning traces, a retrieval corpus, an eval log, or a prompt-optimization spreadsheet. Clean evaluation has to account for the whole system.
LLMs Can Benefit From Semantic Contamination
Old decontamination often looked for exact strings or n-gram overlap. Necessary, but not enough.
Rephrased samples, translations, close paraphrases, and semantically equivalent problems can still teach the model the benchmark. Recent work on rephrased benchmark items showed that models can overfit paraphrased or translated test data while evading simple string matching. Soft-contamination work in 2026 pushed this further: semantic duplicates can matter even when exact duplicates are removed, and adding duplicates or near-duplicates can improve benchmark performance, including on held-out items from the same benchmark distribution.
Contamination is not a binary switch. There is exact contamination, near-duplicate contamination, semantic contamination, translation contamination, distributional contamination, procedural contamination, and feedback contamination. The farther you move from exact copies, the harder the measurement problem becomes.
What Benchmarks And Evals Are Actually For
Benchmarks are measuring instruments. They are not the task itself.
A good benchmark gives you a clear task definition, representative cases, reliable labels or scoring, stable evaluation code, meaningful metrics, uncertainty estimates, documented limitations, contamination controls, and a lifecycle plan for refresh or retirement.
What it can answer is narrow: how this system performed on this measurement under these conditions.
What it cannot prove by itself is much larger. A benchmark score does not prove that a system is generally intelligent, safe, useful, truthful, fair, robust, calibrated, or deployable.
Static Academic Benchmarks
Static benchmarks like MMLU, BIG-bench, GPQA, ARC-style tasks, and many code benchmarks helped the field because they were reproducible and cheap to run. MMLU measured multitask language understanding across 57 subjects. BIG-bench collected a large community-written task suite. HELM pushed the field to evaluate more than accuracy by including calibration, robustness, fairness, bias, toxicity, and efficiency.
But public static benchmarks age. They become known. They get discussed. They are copied into datasets. Models are optimized for them. Researchers learn the quirks. Eventually, they saturate.
By the 2026 AI Index, benchmark reliability and gaming concerns were prominent enough to call out explicitly, including invalid-question rates and platform-adaptation effects in major evaluations. A 2025 systematic study of text-based LLM benchmarks framed saturation as the loss of discriminative power among top models and found that age and scale were strong predictors.
Harder And Refreshed Benchmarks
The field responded with harder benchmarks and revised designs.
MMLU-Pro made MMLU-style testing harder with more reasoning-focused questions, more answer options, fewer noisy or trivial items, and lower prompt sensitivity. GPQA used expert-written graduate-level science questions that domain experts could answer far more reliably than skilled non-experts with web access. Humanity’s Last Exam aimed for broad, multimodal, expert-written questions that resist simple lookup. ARC-AGI-style evaluations focus on novel task induction and few-shot generalization rather than fact recall.
These moves help, but they do not remove the incentive problem. Once a public benchmark becomes valuable, people optimize for it.
Live And Dynamic Benchmarks
Dynamic benchmarks try to reduce contamination by adding new questions after model training cutoffs. LiveBench uses frequently updated questions, objective ground-truth answers, broad task coverage, and monthly updates. LiveCodeBench evaluates code generation with newly collected programming problems over time. LiveMedBench extends the same idea to medical QA, using recent clinical cases and rubric-based scoring to test temporal generalization.
Dynamic evals are not magic. They still need label quality, sampling discipline, scoring robustness, security, and anti-leak controls. But they directly attack one of the largest LLM-era problems: static public test sets eventually become training data.
Human Preference Arenas
Some outputs cannot be scored by exact match. Open-ended chat responses, writing quality, instruction following, helpfulness, and conversational usefulness often need preference evaluation. Chatbot Arena made pairwise human preference evaluation popular by using anonymous randomized battles and a large volume of user votes across many languages.
Useful, but not universal truth. Arena-style rankings can reflect user-base bias, prompt distribution, platform adaptation, and a focus on helpfulness more than safety or domain-specific production needs.
LLM-as-Judge Evals
LLM judges are common because they are cheap and scalable. Strong judges can correlate with human preferences in many settings, but they also bring position bias, verbosity bias, self-enhancement bias, model-family bias, and weak reasoning under some rubrics.
LLM judges are useful for development, triage, rubric scoring, and qualitative comparison. They should not be treated as infallible. An eval judged by an LLM can be gamed, contaminated, biased, or miscalibrated against what humans actually need.
Why Training On Test Sets Is a Cardinal Sin
It Destroys The Test Set’s Reason For Existing
The test set exists to estimate performance on unseen data. If the model has seen the examples, labels, answer keys, solution writeups, rubrics, or close paraphrases, the test set is not unseen. The score no longer cleanly estimates generalization.
This is not etiquette. It is measurement validity.
If a student gets the final exam and answer key while studying, a perfect score no longer measures mastery. It measures access.
It Turns Science Into Self-Deception
When a model is trained on the benchmark, score gains may reflect memorization, distributional mimicry, or artifact exploitation rather than new capability. A team may think it improved reasoning, coding, medicine, law, math, or safety when it only improved benchmark recall.
Dangerous, because benchmark gains often shape deployment decisions, investment, research direction, product claims, and public trust.
It Creates a Benchmark-Hacking Arms Race
When a public metric becomes the target, people optimize for the metric. In LLMs, that can mean training on benchmark questions, training on answer explanations, filtering synthetic data for benchmark-like examples, optimizing prompts on public test items, selecting checkpoints by leaderboard performance, using hidden-test feedback across many submissions, exploiting harness quirks, sampling many answers and selecting with benchmark-specific heuristics, routing benchmark inputs to special tools, or retrieving from corpora that contain solutions.
The result is not just inflated scores. It is a degraded research ecosystem.
It Breaks Comparability
Benchmarks are useful only when scores are comparable. If one model team excluded benchmark data and another trained on it, the numbers do not mean the same thing. If one system is a clean closed-book model and another uses a retrieval index containing benchmark answers, the comparison is misleading unless it is disclosed.
This is especially bad for public leaderboards. A contaminated leaderboard rewards data access, benchmark archaeology, and prompt overfitting instead of capability.
It Hides Real-World Failure
A model that memorized benchmark examples can fail on fresh examples. That matters most in high-stakes domains: medicine, law, finance, education, cybersecurity, hiring, and public services. A static medical or legal benchmark can look reassuring while testing stale exam-style recall instead of current, context-specific judgment.
The problem is not that benchmarks are useless. The problem is that contaminated benchmarks are falsely reassuring.
It Weakens Safety Evaluation
Safety evals are vulnerable to the same failure mode. If a model is trained to refuse exactly the jailbreaks in a public suite, or to pass a red-team benchmark while staying vulnerable to small variants, the safety score is inflated. Serious safety work increasingly uses private, dynamic, adversarial, and scenario-based tests. MLCommons AILuminate is one example of the field moving toward standardized safety testing across hazard categories for general-purpose chat systems.
It Can Be Accidental And Still Invalidating
Intent matters ethically. It does not rescue the statistic. An accidental duplicate in pretraining can inflate a score. A hidden benchmark answer in a scraped GitHub repo still leaks information. A synthetic-data generator that recreates benchmark items still contaminates the dataset.
Good evaluation hygiene needs prevention, detection, documentation, and disclosure. Good intentions are not enough.
The LLM-Specific Contamination Pathways
LLMs create more paths from benchmark to model than older ML pipelines.
Pretraining contamination: Published benchmarks often appear in original repositories, Hugging Face datasets, GitHub mirrors, course materials, blog posts, arXiv papers, benchmark leaderboards, model cards, solution discussions, issue threads, academic slides, and scraped Q&A sites. If those sources are in pretraining, the model may learn the benchmark directly.
Instruction-tuning contamination: Even if pretraining is clean, examples can enter supervised fine-tuning datasets. Instruction corpora often aggregate public QA, coding, math, reasoning, exam, and benchmark-style tasks.
Preference-data contamination: RLHF, DPO, and related methods use human or AI comparisons. If benchmark questions appear in preference data, or if preference labels reward benchmark-specific formats, the final model becomes benchmark-adapted without direct supervised answer training.
Synthetic-data contamination: Synthetic data can launder contamination. A teacher model may have memorized a benchmark. Prompt templates may ask for benchmark-like questions. Generated examples may paraphrase known items. Filters may keep examples that resemble benchmark tasks. Distillation can transfer benchmark-specific heuristics.
Retrieval contamination: The base model may be clean while the system is not. A RAG system can retrieve benchmark answers from indexed web pages, papers, repositories, or logs. If an eval is closed-book, that invalidates the score. If browsing is allowed, the claim must say so.
Prompt and scaffold overfitting: Prompts are part of the system. Repeatedly changing prompts after seeing test results is test-set training. The same goes for chain-of-thought policies, few-shot examples, tool-use policies, retry loops, self-consistency sample counts, answer-extraction regexes, code-execution repair loops, and benchmark-specific routers.
Code benchmark contamination: Code evals are especially vulnerable because code, tests, fixes, pull requests, issue threads, and solutions are public and easy to scrape. SWE-bench improved realism by asking systems to patch real repositories for GitHub issues, and SWE-bench Verified added a human-filtered subset for reliability. But agentic code evals introduced new failure modes: patches can pass weak validation while failing developer tests, agents can exploit repository history, and future repository state can leak solutions. SWE-bench-Live responds by making the benchmark continuously updatable.
The general lesson is not limited to SWE-bench. Realistic evals are better, but they need stronger controls.
A Practical Taxonomy Of Training On The Test Set
Not every violation looks like dragging test.csv into the training folder. The useful question is whether information from the test influenced the model or system.
These behaviors are clean when handled correctly:
-
Training on a benchmark’s official training split.
-
Tuning on an official validation or dev split, with restraint.
-
Evaluating once on the test split after decisions are frozen.
-
Training on a retired benchmark if you stop reporting it as a clean measurement.
-
Using a benchmark format without protected benchmark items, with disclosure when the resemblance is close.
-
Using browsing or retrieval in an eval that explicitly allows browsing, while reporting that the capability includes retrieval.
These behaviors are contamination risks or outright violations:
-
Looking at test examples while designing prompts.
-
Using test labels to choose checkpoints.
-
Including test questions in supervised fine-tuning data.
-
Including answer explanations or solutions.
-
Including paraphrases or translations of test questions.
-
Generating synthetic variants from test questions.
-
Training a reward model on preferences over test questions.
-
Repeatedly tuning from hidden-test leaderboard feedback.
-
Using RAG over a corpus that contains benchmark answers in a closed-book eval.
-
Running final-test error analysis and then continuing to claim that same test is fresh.
The clean principle:
Any information path from test examples, test labels, answer keys, scoring rules, hidden tests, or repeated test feedback into model or system decisions is a contamination risk.
Why Old Evals Stop Working
A benchmark can fail even without intentional cheating. It may become too easy, too familiar, too noisy, too stale, or too far from real use.
Saturation happens when top models cluster near the ceiling. Once most frontier systems score highly, the benchmark cannot distinguish them. Tiny label errors, prompt choices, scoring quirks, or run-to-run variance can dominate the ranking.
Dataset aging happens because older benchmarks have had more time to be copied, solved, discussed, mirrored, and absorbed into training data. They may also reflect old coding practices, old medical guidance, old legal rules, old software versions, old social norms, or old user behavior.
Invalid or ambiguous items matter more as models improve. Wrong labels, stale facts, unclear questions, and multiple valid answers can distort rankings. The 2026 AI Index highlighted invalid-question rates as a major reliability problem.
Benchmark artifacts invite shortcuts. Models can exploit answer-position priors, wording patterns, boilerplate, common distractors, unit-test weakness, output-format regularities, or hidden patterns in multiple-choice options. Then the benchmark rewards shortcut learning instead of the intended capability.
Proper Evaluation Design For ML Systems
Classical evaluation hygiene still matters. LLM-specific controls do not replace it.
Split before preprocessing. Fit scalers, imputers, feature selectors, encoders, PCA transforms, resamplers, and other preprocessing steps on training data only. Apply the learned transform to validation and test data. Fitting preprocessing on all data leaks information from test into train.
Use the right split. Random splits are not always appropriate. Time series, forecasting, finance, and event data usually need chronological splits. Medical records need patient-level splits. Education data may need student, classroom, or school-level splits. Recommenders may need user, item, and time-aware splits. Code repositories may need repository-level or issue-date splits. Legal and policy QA may need jurisdiction and date awareness. Document classification may need author or source-level splits. Molecular and protein ML may need scaffold or family splits. Fraud and security need time and adversary-aware splits.
Treat resampling carefully. Resampling the full dataset before splitting can change the test distribution and let test-sample information influence training. Split first, then resample only inside the training data or training folds.
Use validation for decisions and test for measurement. Hyperparameters, prompts, thresholds, checkpoints, context windows, decoding settings, tool policies, and postprocessors should be selected on validation data. If you tune for many rounds, your validation set can become overfit too. Refresh it or add a final holdout.
Use nested validation when selection is heavy. If model selection is substantial, nested cross-validation or a separate final holdout is safer. Adaptive data-analysis theory shows that repeated holdout use can overfit the holdout, even though careful leaderboard design can reduce the damage.
Report uncertainty. Single-number benchmark claims are easy to overread. Report confidence intervals, bootstrap intervals, category breakdowns, variance across seeds, sensitivity to prompts and harnesses, and uncertainty in pairwise preference rankings.
Audit for leakage. Red flags include unusually high performance, large train-test discrepancies, inconsistent cross-validation results, suspicious feature importance, and performance that collapses under time-aware or group-aware splits. Feature engineering should use only information available at prediction time.
Proper Evaluation Design For LLMs
LLMs need all the classical controls plus several more.
Define the evaluated object. Are you evaluating a base model, an instruction-tuned model, a chat model, a model plus system prompt, a model plus RAG, a model plus tools, an agent scaffold, a multi-agent system, a fine-tuned domain model, or a full product deployment? “Model X scored 80” may actually mean “Model X, with this prompt, decoding setup, tool policy, context, answer extractor, and retry loop, scored 80.”
Freeze the protocol before testing. Freeze the model version, system prompt, developer prompt, tool access, retrieval corpus, decoding parameters, number of samples, self-consistency rules, answer extraction, scoring script, refusal handling, time limits, external API versions, hardware, and runtime constraints. Changing any of these after seeing test results is optimization against the test.
Maintain benchmark exclusion lists. Exclude benchmark repositories, dataset names, mirrors, canonical question text, answer keys, explanations, solution writeups, translations, paraphrase clusters, benchmark papers and appendices where needed, leaderboard logs, eval traces, and known contamination sources. The exclusion needs to cover pretraining, continued pretraining, supervised fine-tuning, preference data, synthetic data, retrieval corpora, and internal eval logs.
Use layered decontamination. Exact matching is necessary, but not sufficient. Use exact hash matching, n-gram overlap, MinHash or SimHash, embedding similarity, translation-aware matching, code clone detection, canary strings, and black-box contamination probes when training data is unavailable. No single detector proves cleanliness. The goal is a documented risk-reduction process.
Prefer fresh, private, and live tests for serious claims. For frontier claims, use private tests or freshly collected items that postdate the model’s training cutoff. LiveBench, LiveCodeBench, LiveMedBench, and SWE-bench-Live show the direction: time-aware, refreshed, and contamination-resistant evaluation.
Separate development evals from audit evals. Development evals are for iteration. Audit evals are for final claims. A mature LLM program keeps smoke tests for regressions, development evals for iteration, canary evals for known failures, private holdouts for release decisions, external audits for credibility, and post-deployment monitoring for drift and incidents.
Do not rely on one benchmark family. A model that performs well on MMLU-style questions may still fail at tool use, long-context retrieval, real software engineering, factuality, multilingual tasks, social reasoning, calibration, adversarial robustness, or domain workflows. HELM’s multi-metric instinct remains right: accuracy alone is not enough.
Evaluate robustness, not just peak score. Test sensitivity to prompt wording, answer-option order, language, formatting, irrelevant context, misleading context, tool errors, distribution shift, adversarial examples, long-tail categories, abstention and refusal calibration, multi-turn state, stale information, and retrieval failures.
What Clean Benchmark Practice Looks Like
A strong benchmark report gives readers enough detail to interpret the score.
At minimum, report the exact model or system version, release date, base versus fine-tuned status, tool use, RAG, scaffold details, claimed data cutoff, known exceptions, decontamination methods, datasets checked, thresholds, limitations, prompt, sampling setup, number of attempts, answer extraction, scoring code, split discipline, primary and secondary metrics, confidence intervals, category subscores, human label audits, invalid-item handling, adjudication process, cost, latency, safety failure categories, refusal behavior, jailbreak robustness when relevant, known limitations, contamination risks, domain mismatch, versioned code, seeds, and dependencies.
A serious benchmark claim should be able to say:
“We froze the model and evaluation protocol before touching the test set. We trained only on permitted training data. We used validation data for development. We checked pretraining, fine-tuning, preference, synthetic, and retrieval data for contamination. We did not inspect hidden test labels. We report uncertainty and limitations. We will not continue optimizing on this test and still call it fresh.”
That is the standard. Anything less needs caveats.
Common Anti-Patterns And Why They Are Wrong
“Everyone trains on benchmarks anyway.” That is an argument for stronger norms, not weaker ones. If a benchmark is widely contaminated, stop using it for clean claims or label it as a development benchmark.
“The model only saw the questions, not the answers.” Still contamination. The model may learn the distribution, recognize the item, infer the answer from surrounding web text, or benefit from repeated exposure. In multiple-choice settings, question familiarity alone can help.
“We only used benchmark-style synthetic data.” Acceptable if the items are genuinely new and not derived from protected tests. Not acceptable if the data came from benchmark questions, answer keys, solution writeups, or close paraphrases.
“We only optimized the prompt.” Prompt optimization is system optimization. If it was done against the test set, the test set became validation data.
“We used hidden-test leaderboard feedback, not labels.” Repeated leaderboard feedback can leak information adaptively. The more submissions and the more design changes based on feedback, the less independent the hidden test becomes.
“Our RAG system found the answer online, so it is fair.” Only if the eval allows retrieval or browsing and the claim is about retrieval-augmented performance. For closed-book model evaluation, retrieving benchmark solutions invalidates the result.
“The benchmark is public, so it is fair game.” Public does not mean trainable for clean evaluation. Public training splits are fair game. Public test splits are not fair game if you plan to report them as evidence of generalization.
What Is Not a Sin
The rule is strict, but it is not simplistic.
It is fine to train on a benchmark’s official training split. It is fine to use a development split for tuning. It is fine to inspect test errors after a final evaluation, as long as you treat that test as consumed and stop using it for future clean claims. It is fine to train on retired benchmark tests if you no longer report those benchmarks as uncontaminated measurements. It is fine to create new training data inspired by a task format, as long as it is not derived from protected test content.
It is also important to separate contamination from ordinary capability transfer. A model trained on algebra can solve new algebra questions. That is generalization. A model trained on the exact algebra exam, its answer key, or paraphrases of its questions is contaminated.
Benchmarks As Public Goods
Benchmarks are public goods. Expensive to create, easy to overuse, and fragile once exposed.
The best ones do more than rank models. They shape research priorities. MMLU pushed broad multitask knowledge. HELM pushed multi-metric evaluation. GPQA and Humanity’s Last Exam pushed harder expert-level questions. SWE-bench pushed realistic repository-level software engineering. Chatbot Arena pushed live human preference comparison. LiveBench and LiveCodeBench pushed dynamic, time-aware evaluation.
Every public benchmark has a lifecycle:
-
Launch: Fresh, informative, and hard.
-
Adoption: Many models report scores.
-
Optimization: Prompts, scaffolds, and training data adapt.
-
Saturation: Top scores cluster.
-
Contamination: Public discussion and data reuse spread the items.
-
Retirement or refresh: The benchmark becomes a dev set, historical reference, or component of a broader suite.
A healthy field accepts that lifecycle instead of pretending old public test sets remain pristine forever.
A Rigorous Benchmark Hygiene Checklist
Before training:
-
Inventory every benchmark you plan to use for evaluation.
-
Add benchmark repositories, mirrors, papers, answer keys, and known solution sources to exclusion lists.
-
Exclude eval logs from future training.
-
Deduplicate exact, near-duplicate, and semantic overlaps.
-
Track data provenance and timestamps.
-
Separate pretraining, fine-tuning, preference, synthetic, and retrieval corpora.
-
Document what could not be checked.
Before evaluation:
-
Freeze the model or system version.
-
Freeze prompts, tools, retrieval indexes, decoding, sample counts, and scoring.
-
Confirm there is no benchmark-specific routing or special casing.
-
Use validation data for final adjustments, not test data.
-
Run contamination scans against the final system inputs.
-
Define metrics and confidence intervals in advance.
-
Decide how invalid or ambiguous items will be handled before seeing results.
During evaluation:
-
Log inputs, outputs, tool calls, retrieval hits, errors, and retries.
-
Prevent internet or retrieval access unless it is explicitly part of the eval.
-
Use deterministic settings where appropriate, or report variance.
-
Avoid manual correction except under a predefined adjudication protocol.
-
Keep hidden labels hidden.
After evaluation:
-
Report the full protocol and limitations.
-
Publish aggregate and subgroup metrics.
-
Report uncertainty, not just point estimates.
-
Investigate failures, but mark the test set as consumed for future development.
-
Refresh, rotate, or replace tests that have influenced development.
-
Do not keep tuning on the test and call later scores clean.
The Standard For Serious LLM Evaluation In 2026
A strong LLM evaluation program should combine ten habits:
-
Classical split discipline: Separate training, validation, and test data before preprocessing or optimization.
-
LLM-specific contamination control: Use exact, fuzzy, semantic, multilingual, code-aware, and synthetic-data-aware filtering.
-
Freshness: Use private, live, or post-training-cutoff tests for frontier claims.
-
Breadth: Cover multiple capabilities, domains, languages, and user scenarios.
-
Depth: Include hard expert tasks, long-horizon tasks, tool use, and realistic workflows.
-
Robustness: Test prompt sensitivity, distribution shift, adversarial variants, and repeated-run variance.
-
Human grounding: Use expert labels, adjudication, and human preference studies where open-ended tasks need them.
-
Safety and reliability: Evaluate hallucination, refusal calibration, jailbreak resistance, bias, toxicity, privacy, security, and misuse.
-
System transparency: Disclose model versions, prompts, tools, retrieval corpora, scaffolds, and scoring.
-
Benchmark lifecycle management: Retire, refresh, and stop relying indefinitely on saturated public tests.
The strongest evaluation cultures treat benchmarks like scientific instruments: calibrated, protected, versioned, audited, and retired when worn out.
Final Principle
Training on benchmarks, evals, or test sets is a cardinal sin because it replaces the question we actually care about:
“Can this system handle new cases?”
with a much weaker question:
“Has this system absorbed the measuring instrument?”
That swap poisons model comparison, misleads users, misdirects research, weakens safety claims, and erodes trust.
The answer is not to abandon benchmarks. The answer is to respect them. Train on training data. Tune on validation data. Evaluate on untouched tests. Use fresh, private, live, and well-audited benchmarks for major claims. Disclose contamination risk honestly. Retire benchmarks when they become development targets.
In machine learning, the test set is not just another dataset. It is the ruler. Once you bend the ruler to fit the model, the measurement is gone.
Until next time.
-Ahmad
Similar Articles
@TheAhmadOsman: INCREDIBLE The MOST COMPLETE GUIDE for understanding benchmarks and evals, and why training on them is intentionally mi…
A comprehensive free online guide covering benchmarks, evaluation, contamination, and proper practices for machine learning and LLMs is now available, emphasizing the importance of clean measurement and avoiding misleading training on test sets.
@dunik_7: the $90,000 Stanford lecture that explains why an AI can ace every benchmark and still break on your codebase just drop…
A free Stanford lecture by Percy Liang on AI generalization explains why models excel on benchmarks but fail on real codebases, covering benchmark memorization, bias-variance tradeoff, and hallucination.
@ArizePhoenix: One of the oldest lessons in ML is still one of the most useful for working with LLM apps: Don’t evaluate on the same d…
This article discusses best practices for LLM application development using Arize Phoenix, specifically highlighting the importance of using train/validation/test splits for honest evaluation and tracking regressions.
@cwolferesearch: Evaluations should not be static. We need to evolve evaluation sets / benchmarks over time so that they remain relevant…
Discusses the need for evolving AI evaluation benchmarks through difficulty, quality, and diversity refinement, citing examples like MMLU-Pro, MMLU-Redux, BIG-Bench Extra Hard, RealMath, MathArena, and DatBench.
@_lamaahmad: We (@CedricWhitney, @SandhiniAgarwal, @EstherTetruas, @OliviaGWatkins2, @dgrobinson) wrote about nuances we’ve observed…
OpenAI researchers share lessons learned from working with third parties on frontier model evaluations, highlighting the importance of considering the evaluation harness and potential validity issues like reward hacking, contamination, and sandbagging.