Know2Guess: A Contamination-Aware Multi-Zone Benchmark for Knowledge-Boundary Evaluation in Large Language Models
Summary
This paper introduces Know2Guess, a contamination-aware multi-zone benchmark designed to evaluate the transition from answerable knowledge to expected abstention in large language models, addressing data contamination, prompt sensitivity, and refusal behavior. The authors assess FLAN-T5, Qwen2.5-Instruct, and Llama-3-Instruct models, finding that stronger models show selective but incomplete abstention. The benchmark and dataset are publicly released.
View Cached Full Text
Cached at: 06/26/26, 05:13 AM
# Know2Guess: A Contamination-Aware Multi-Zone Benchmark for Knowledge-Boundary Evaluation in Large Language Models
Source: [https://arxiv.org/html/2606.26101](https://arxiv.org/html/2606.26101)
11institutetext:Anhui University, Hefei, China
11email:R32314095@stu\.ahu\.edu\.cn, wd2324041@stu\.ahu\.edu\.cn
∗Equal contributionBowen Zhang∗Jian WangXican WangHaoyi WuXuanyan QiuShengan Yang
###### Abstract
Reliable evaluation of large language models should separate supported answering from unsupported guessing without conflating either with data contamination, prompt idiosyncrasy, or generic refusal behavior\. We present a contamination\-aware, multi\-zone benchmark for measuring the transition from answerable knowledge to abstention\-expected unknowns under frozen build\-time labels\. The benchmark contains 1,200 items across five domains, explicit abstention expectations, contamination\-risk metadata, and dual parsing with an official strict parser plus a normalized robustness parser\. We evaluate FLAN\-T5, Qwen2\.5\-Instruct, and Llama\-3\-Instruct models under locked answer\-or\-abstain prompts, answer\-only controls, and prompt\-template variants\. The benchmark is not solved by generic non\-answer behavior: FLAN baselines remain weak on productive abstention, while stronger instruction\-tuned models expose a selective but incomplete transition from answering to abstaining\. Qwen2\.5\-3B\-Instruct achieves the best overall reliability, but answer\-expected zones remain difficult, calibration remains poor, and benign\-item refusal persists\. Prompt and parser robustness analyses preserve the main ranking and qualitative conclusions\. The benchmark therefore provides a reproducible protocol for auditing answerability, abstention, refusal, and contamination as distinct but interacting dimensions of LLM reliability\.The dataset is publicly available at[https://github\.com/renweimeng/Know2Guess\-A\-Contamination\-Aware\-Multi\-Zone\-Benchmark](https://github.com/renweimeng/Know2Guess-A-Contamination-Aware-Multi-Zone-Benchmark)\.
## 1Introduction
Large language models can produce fluent answers far beyond the boundary of supported knowledge\. This behavior is useful when the model is correct, but costly when unsupported guesses are expressed with unwarranted confidence\. As a result, evaluation can no longer be reduced to raw answer accuracy alone\. Truthfulness, hallucination, calibration, and selective abstention all matter for deployment\-facing reliability, yet they are often measured in separate protocols with incompatible assumptions\[[13](https://arxiv.org/html/2606.26101#bib.bib1),[9](https://arxiv.org/html/2606.26101#bib.bib2),[12](https://arxiv.org/html/2606.26101#bib.bib3),[8](https://arxiv.org/html/2606.26101#bib.bib4),[6](https://arxiv.org/html/2606.26101#bib.bib6)\]\. A benchmark that scores only answers risks rewarding confident fabrication; a benchmark that rewards non\-answer behavior without qualification risks conflating epistemic caution with generic refusal\.
Recent evaluation work has made substantial progress, but important gaps remain\. Truthfulness and hallucination benchmarks reveal how often models generate false or unverifiable content\[[13](https://arxiv.org/html/2606.26101#bib.bib1),[9](https://arxiv.org/html/2606.26101#bib.bib2)\]\. Broad suites such as HELM show that model quality is multi\-dimensional and that accuracy alone hides important trade\-offs\[[12](https://arxiv.org/html/2606.26101#bib.bib3)\]\. Dynamic or time\-sensitive evaluations reduce leakage and measure freshness more directly\[[10](https://arxiv.org/html/2606.26101#bib.bib19),[19](https://arxiv.org/html/2606.26101#bib.bib20)\]\. Prompt\-sensitive knowledge\-boundary work further shows that a model’s apparent knowledge can shift under paraphrase or query optimization\[[23](https://arxiv.org/html/2606.26101#bib.bib21)\]\. However, these lines do not jointly provide a frozen protocol that separates answerable items from abstention\-expected unknowns, tracks contamination as metadata, distinguishes benchmark abstention from policy refusal, and reports robustness to both prompting and parsing\.
A second difficulty is conceptual\. Abstention is not simply the absence of an answer\. In selective prediction, the system should decline only when the risk of error is sufficiently high\[[4](https://arxiv.org/html/2606.26101#bib.bib5),[18](https://arxiv.org/html/2606.26101#bib.bib17)\]\. In language models, this decision is further entangled with self\-knowledge, confidence elicitation, alignment, and safety style\[[8](https://arxiv.org/html/2606.26101#bib.bib4),[6](https://arxiv.org/html/2606.26101#bib.bib6),[21](https://arxiv.org/html/2606.26101#bib.bib18)\]\. A model may know that it does not know, but it may also refuse for policy reasons unrelated to epistemic uncertainty\. Conversely, a model may produce a parser\-compliant answer that is semantically unsupported\. These cases should not be collapsed into a single “non\-answer” category if the goal is to understand the boundary between knowledge and hallucination\.
This paper addresses that problem with a contamination\-aware, multi\-zone benchmark and a locked evaluation protocol\. The benchmark contains 1,200 items across five domains and assigns each item a frozen build\-time zone\. Zones A–C are answer\-expected but differ in answer popularity and boundary difficulty; Zone D is abstention\-expected and constructed as a style\-matched synthetic unknown\. Each item is released with provenance, contamination\-risk metadata, and review notes\. Evaluation uses a structured answer\-or\-abstain prompt, answer\-only controls, an official strict parser, and a normalized parser for robustness\. The primary metric, reliability, rewards correct answering on answer\-expected items and correct abstention on abstention\-expected items, while separate reporting tracks productive abstention, refusal, calibration, and boundary sharpness\.
Our empirical claims are deliberately limited\. The benchmark is executable, auditable, and informative under a predeclared protocol\. Stronger instruction\-tuned models do exhibit a selective transition from answering to abstaining, but the behavior is neither trivial nor solved\. FLAN baselines remain weak on productive abstention\. Under answer\-or\-abstain, Qwen2\.5\-3B\-Instruct achieves the best reliability at 0\.3657 with Zone\-D productive abstention 0\.9249, while Llama\-3\-8B\-Instruct is more accurate on answerable public items but less selective on unknowns\. Under answer\-only controls, both Qwen checkpoints collapse sharply and Llama degrades materially\. Parser normalization, prompt variants, and cost\-sensitive scoring preserve the main ranking and qualitative conclusions\. At the same time, answer\-expected Zones A–C remain difficult, ECE stays high, and benign\-item refusal remains a real failure mode\. The benchmark should therefore be read not as evidence that abstention has been solved, but as a stricter protocol for measuring where current models still fail\.
## 2Related Work
##### Truthfulness and hallucination benchmarks\.
A large body of work evaluates whether language models produce false, misleading, or unverifiable content\. TruthfulQA measures whether models reproduce popular misconceptions rather than truthful answers\[[13](https://arxiv.org/html/2606.26101#bib.bib1)\]\. HaluEval focuses on hallucination recognition across instruction\-following settings and human\-annotated outputs\[[9](https://arxiv.org/html/2606.26101#bib.bib2)\]\. HELM broadens the lens by standardizing multi\-metric evaluation across diverse scenarios\[[12](https://arxiv.org/html/2606.26101#bib.bib3)\]\. FreshQA and related dynamic factuality benchmarks emphasize fast\-changing knowledge and false\-premise questions, showing that factuality degrades when world knowledge changes after pretraining\[[19](https://arxiv.org/html/2606.26101#bib.bib20),[10](https://arxiv.org/html/2606.26101#bib.bib19)\]\. Together, these benchmarks establish the need for evaluation beyond plain accuracy, but they usually do not isolate abstention\-expected unknowns under a fixed decision protocol\.
##### Selective prediction, self\-knowledge, and abstention\.
Selective prediction studies when a model should answer and when it should abstain\[[4](https://arxiv.org/html/2606.26101#bib.bib5),[18](https://arxiv.org/html/2606.26101#bib.bib17)\]\. In language models, related work asks whether models know what they know and whether elicited confidence is calibrated enough to support abstention decisions\[[8](https://arxiv.org/html/2606.26101#bib.bib4),[6](https://arxiv.org/html/2606.26101#bib.bib6)\]\. Recent synthesis work argues that abstention should be analyzed as a distinct capability shaped by the query, the model, and human values rather than as a side effect of safety tuning alone\[[21](https://arxiv.org/html/2606.26101#bib.bib18)\]\. This literature motivates treating abstention as a first\-class output, but most methods either assume task\-specific calibration or evaluate abstention without explicitly separating policy refusal from epistemic uncertainty\.
##### Dynamic, fresh, and prompt\-sensitive knowledge evaluation\.
To reduce contamination and staleness, recent benchmarks construct questions from recent documents or evolving sources\[[10](https://arxiv.org/html/2606.26101#bib.bib19),[19](https://arxiv.org/html/2606.26101#bib.bib20)\]\. A related line argues that knowledge estimates based on a single prompt are incomplete because language models can be sensitive to paraphrase and query form\[[23](https://arxiv.org/html/2606.26101#bib.bib21),[8](https://arxiv.org/html/2606.26101#bib.bib4),[12](https://arxiv.org/html/2606.26101#bib.bib3)\]\. These studies are important because they show that apparent model knowledge depends not only on stored facts but also on how the query is posed\. However, dynamic freshness evaluation and prompt\-sensitive boundary evaluation target somewhat different problems: the former focuses on temporal novelty, while the latter emphasizes prompt dependence\. Neither by itself yields a contamination\-aware abstention protocol with frozen answerability labels and explicit refusal accounting\.
##### Contamination analysis and dataset documentation\.
Data contamination has become a central concern in LLM evaluation because benchmark leakage can inflate reported performance and blur the distinction between memorization and generalization\[[11](https://arxiv.org/html/2606.26101#bib.bib22),[15](https://arxiv.org/html/2606.26101#bib.bib23),[10](https://arxiv.org/html/2606.26101#bib.bib19)\]\. Recent work has proposed open contamination reports, taxonomies of contamination modes, and benchmark construction strategies that reduce temporal leakage\[[11](https://arxiv.org/html/2606.26101#bib.bib22),[15](https://arxiv.org/html/2606.26101#bib.bib23),[10](https://arxiv.org/html/2606.26101#bib.bib19)\]\. In parallel, dataset documentation work argues that released benchmarks should expose provenance, scope, and known risks rather than treating them as informal background knowledge\[[3](https://arxiv.org/html/2606.26101#bib.bib7)\]\. This perspective motivates explicit contamination\-risk fields, frozen build\-time rules, and auditable metadata in modern evaluation datasets\.
## 3Problem Formulation and Task Definition
We study*knowledge\-boundary evaluation*: whether a language model answers answerable items, abstains on intentionally unknown items, and avoids conflating epistemic uncertainty with policy refusal\. Existing benchmarks do not jointly control contamination risk, abstention expectation, and refusal under one fixed protocol\[[13](https://arxiv.org/html/2606.26101#bib.bib1),[9](https://arxiv.org/html/2606.26101#bib.bib2),[12](https://arxiv.org/html/2606.26101#bib.bib3),[8](https://arxiv.org/html/2606.26101#bib.bib4),[2](https://arxiv.org/html/2606.26101#bib.bib8)\]\. Our goal is to measure the transition from supported answering to unsupported guessing under frozen build\-time labels\.
Each item is
xi=\(qi,Gi,zi,ei,oi,ri\),x\_\{i\}=\(q\_\{i\},G\_\{i\},z\_\{i\},e\_\{i\},o\_\{i\},r\_\{i\}\),whereqiq\_\{i\}is the question,GiG\_\{i\}the gold\-answer set and grading rule,zi∈\{A,B,C,D\}z\_\{i\}\\in\\\{A,B,C,D\\\}the zone label,ei∈\{0,1\}e\_\{i\}\\in\\\{0,1\\\}whether abstention is expected,oi∈\{real\_public,synthetic\_unknown\}o\_\{i\}\\in\\\{\\texttt\{real\\\_public\},\\texttt\{synthetic\\\_unknown\}\\\}the item origin, andri∈\{low,medium,high\}r\_\{i\}\\in\\\{\\texttt\{low\},\\texttt\{medium\},\\texttt\{high\}\\\}the contamination\-risk tag\. ZonesAA–CCare answer\-expected; ZoneDDis abstention\-expected\. Zone assignment is fixed before evaluation\.
The parser maps each completion to
y^i∈\{ANSWER,ABSTAIN,REFUSE\}\.\\hat\{y\}\_\{i\}\\in\\\{\\texttt\{ANSWER\},\\texttt\{ABSTAIN\},\\texttt\{REFUSE\}\\\}\.Ify^i=ANSWER\\hat\{y\}\_\{i\}=\\texttt\{ANSWER\}, it also extracts final answera^i\\hat\{a\}\_\{i\}and confidencec^i∈\[0,1\]\\hat\{c\}\_\{i\}\\in\[0,1\]\.REFUSEdenotes policy\-style refusal and is not counted as productive abstention\.
Our primary metric is*reliability*:
Rel=1N∑i=1N\[𝟙\(ei=0∧y^i=ANSWER∧a^i∈Gi\)\+𝟙\(ei=1∧y^i=ABSTAIN\)\]\.\\mathrm\{Rel\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\Big\[\\mathbb\{1\}\(e\_\{i\}=0\\wedge\\hat\{y\}\_\{i\}=\\texttt\{ANSWER\}\\wedge\\hat\{a\}\_\{i\}\\in G\_\{i\}\)\+\\mathbb\{1\}\(e\_\{i\}=1\\wedge\\hat\{y\}\_\{i\}=\\texttt\{ABSTAIN\}\)\\Big\]\.It gives equal credit to correct answering on answer\-expected items and correct abstention on abstention\-expected items, because the benchmark evaluates*decision correctness at the knowledge boundary*\. To test whether conclusions depend on this choice, Section[7](https://arxiv.org/html/2606.26101#S7)reports cost\-sensitive variants\.
We additionally report answered accuracyAccans\\mathrm\{Acc\}\_\{\\mathrm\{ans\}\}, abstention rate, productive abstention on ZoneDD, refusal rate, expected calibration error \(ECE\) on answered items\[[6](https://arxiv.org/html/2606.26101#bib.bib6)\], and*boundary sharpness*
BS=RelD−RelC,\\mathrm\{BS\}=\\mathrm\{Rel\}\_\{D\}\-\\mathrm\{Rel\}\_\{C\},which measures separation between difficult answerable items and abstention\-expected unknowns\.
## 4Methodology
### 4\.1Source Pools and Normalization
The benchmark contains 1,200 items across five domains: commonsense, general trivia, medicine, multi\-hop reasoning, and science\. The real\-public pool is derived from CommonsenseQA, TriviaQA, MedMCQA, HotpotQA, and SciQ\[[17](https://arxiv.org/html/2606.26101#bib.bib9),[7](https://arxiv.org/html/2606.26101#bib.bib10),[14](https://arxiv.org/html/2606.26101#bib.bib11),[22](https://arxiv.org/html/2606.26101#bib.bib12),[20](https://arxiv.org/html/2606.26101#bib.bib13)\]\. All candidates are normalized into a shared schema with question text, canonical answer, aliases, grading rule, provenance, domain, contamination metadata, and review notes\. Items with unresolved ambiguity, unstable grading, or corrupted provenance are removed\.
Starting from 2,846 normalized candidates, we apply exact\-match deduplication, near\-duplicate filtering, and manual consolidation\. The final build contains 1,200 items, split with seed 1337 into 120 development and 1,080 test items\. Build\-time zone counts areA=420A\{=\}420,B=300B\{=\}300,C=240C\{=\}240, andD=240D\{=\}240; the test split containsA=377A\{=\}377,B=272B\{=\}272,C=218C\{=\}218, andD=213D\{=\}213\.
### 4\.2Build\-Time Zone Assignment
The zone system targets a graded transition rather than a binary known/unknown split\.
##### Zone A\.
ZoneAAcontains answer\-expected real\-public items with relatively high answer popularity\. We define
spop=13\(ssrc\+salias\+spilot\),s\_\{\\mathrm\{pop\}\}=\\frac\{1\}\{3\}\(s\_\{\\mathrm\{src\}\}\+s\_\{\\mathrm\{alias\}\}\+s\_\{\\mathrm\{pilot\}\}\),wheressrcs\_\{\\mathrm\{src\}\}is answer\-frequency percentile in the source pool,saliass\_\{\\mathrm\{alias\}\}is alias\-coverage percentile, andspilots\_\{\\mathrm\{pilot\}\}is pilot familiarity\. Top within\-domain bands are assigned to ZoneAA\.
##### Zone B\.
ZoneBBis also answer\-expected, but uses lower within\-domain popularity bands\. ThusA/BA/Bdiffer by build\-time public salience, not by post hoc model difficulty\.
##### Zone C\.
ZoneCCcontains transformed real\-public items that preserve the gold answer while increasing boundary difficulty\. Allowed transformations are controlled paraphrase, lexical compression, support re\-ordering, distractor suppression, benign indirection, and answer\-preserving alias substitution\. Each transformed item is reviewed by two annotators for answer preservation and validity; disagreements are adjudicated by a third reviewer\. On 180 audited candidates, Cohen’sκ\\kappais0\.860\.86for answer preservation and0\.830\.83for validity\.
##### Zone D\.
ZoneDDis abstention\-expected by construction\. Items are generated with syntax and domain style matched to the real\-public pool, but with unsupported entities, events, or compositions\. To reduce template artifacts, ZoneDDreuses question families and discourse forms from ZonesAA–CC\. Of Zone\-DDitems, 40% come from minimal edits of real\-public stems, 35% from domain\-consistent recombinations under impossible provenance, and 25% from hand\-authored but style\-matched items\. Each item must pass plausibility review and unknownness verification under a fixed search\-and\-reference protocol\.
Table 1:Representative zone definitions and audit targets\.
### 4\.3Contamination Annotation and Human Review
Contamination is treated as metadata rather than a post hoc explanation for performance\[[2](https://arxiv.org/html/2606.26101#bib.bib8)\]\. Each item receives a low/medium/high risk tag from a rubric combining lexical overlap with known benchmark phrasing, answer salience in common evaluation corpora, and stereotypy of canonical benchmark questions; labels are then reviewed by annotators\. The final build contains 457 low\-risk, 155 medium\-risk, and 588 high\-risk items\. Synthetic unknowns are labeled low\-risk by construction\. We release contamination fields together with provenance and review notes\[[3](https://arxiv.org/html/2606.26101#bib.bib7)\]\.
All items are reviewed by at least two annotators\. Real\-public items are checked for gold correctness, alias sufficiency, and zone consistency; Zone\-DDitems are additionally checked for plausibility and unsupportedness\. On a 240\-item adjudication sample, Fleiss’κ\\kappais0\.820\.82for zone validity,0\.880\.88for abstention\-expected vs\. answer\-expected assignment, and0\.850\.85for contamination\-risk agreement before adjudication\.
### 4\.4Evaluation Protocol
We use two locked prompts\. The main*answer\-or\-abstain*prompt requestsDECISION,CONFIDENCE,FINAL ANSWER, andREASON CODE, and explicitly permitsABSTAIN\_DONT\_KNOW\. The control*answer\-only*prompt requests only the best short factual answer and instructs the model to guess if unsure\.
The official parser is strict: it accepts only format\-compliant decision blocks and never converts policy refusal into benchmark abstention\. We also report a*normalized parser*that repairs superficial formatting defects while preserving decision semantics\. The strict parser remains official; the normalized parser is used only for robustness analysis\.
Figure 1:Benchmark construction and evaluation pipeline\.
## 5Experimental Setup
### 5\.1Models and Prompts
We evaluate FLAN\-T5\-Base, FLAN\-T5\-Large, FLAN\-T5\-XL\[[1](https://arxiv.org/html/2606.26101#bib.bib14)\], Qwen2\.5\-1\.5B\-Instruct, Qwen2\.5\-3B\-Instruct\[[16](https://arxiv.org/html/2606.26101#bib.bib15)\], and Llama\-3\-8B\-Instruct\[[5](https://arxiv.org/html/2606.26101#bib.bib16)\]\. This roster adds family diversity and a modest scale gradient while keeping evaluation manageable\.
All models are tested under the main*answer\-or\-abstain*prompt\. To assess template sensitivity, we also use three close paraphrase variants:standard,compact, anddelimiter\-heavy\. The*answer\-only*control is run for Qwen2\.5\-1\.5B, Qwen2\.5\-3B, and Llama\-3\-8B\.
### 5\.2Inference and Uncertainty
All runs use greedy decoding with temperature0, top\-p=1\.0p\{=\}1\.0, and no sampling\. Max new tokens are 128 for FLAN and Qwen2\.5\-1\.5B, and 160 for Qwen2\.5\-3B and Llama\-3\-8B\. No retrieval, tool use, self\-consistency, or external verifier is allowed\. The development split is used only to finalize prompts, parser rules, and reporting\. All headline results are reported on the 1,080\-item test split after freezing the protocol\.
Because decoding is deterministic, uncertainty is estimated with nonparametric bootstrap over items\. Unless stated otherwise, confidence intervals are 95% percentile bootstrap intervals from 10,000 resamples\. Prompt\-condition comparisons use paired bootstrap over per\-item reliability indicators\.
### 5\.3Metrics and Audit
We report answered accuracyAccans\\mathrm\{Acc\}\_\{\\mathrm\{ans\}\}, abstention rate, productive abstention on ZoneDD, refusal rate, hallucination among answered items, reliability, boundary sharpness, and ECE with 10 bins\[[6](https://arxiv.org/html/2606.26101#bib.bib6)\]\. Calibration is computed only on answered items\.
The main paper includes three robustness analyses: strict vs\. normalized parser, prompt\-template sensitivity, and cost\-sensitive reliability reweighting\. We also report contamination and item\-origin slices for all main models\.
For qualitative analysis, we manually inspect 120 sampled errors from the strongest answer\-or\-abstain runs\. Each is assigned one primary label: unnecessary abstention, policy refusal on benign items, confident hallucination, partial knowledge with wrong finalization, or parser\-compliant but semantically wrong output\.
## 6Main Results
### 6\.1Overall Comparison
Table[2](https://arxiv.org/html/2606.26101#S6.T2)shows the full\-test results\. Two conclusions are central\. First, the benchmark is not solved by generic abstention: answer\-expected zones remain hard and calibration remains weak\. Second, the protocol distinguishes meaningful model and prompt differences\.
The FLAN baselines remain negative on productive abstention\. FLAN\-T5\-Base and FLAN\-T5\-Large achieve zero productive abstention under the official parser, and FLAN\-T5\-XL reaches only 0\.0141 on ZoneDD\. Their reliability ranges from 0\.1009 to 0\.1463, indicating that the benchmark does not reward generic non\-answer behavior\.
Qwen2\.5\-3B\-Instruct is the strongest answer\-or\-abstain model, reaching reliability 0\.3657 \(95% CI: 0\.3378–0\.3939\) with Zone\-DDproductive abstention 0\.9249\. Llama\-3\-8B\-Instruct is competitive on answered accuracy and stronger than Qwen2\.5\-1\.5B on answer\-expected items, but is less selective on unknowns and more refusal\-prone, yielding lower reliability 0\.3407\. Qwen2\.5\-1\.5B\-Instruct remains useful as a smaller\-scale replication of the same pattern, with reliability 0\.2787 and Zone\-DDproductive abstention 0\.9906\.
Table 2:Main benchmark results on the full test split \(N=1080N=1080\)\.Accans\\mathrm\{Acc\}\_\{\\mathrm\{ans\}\}and hallucination are computed over answered items; reliability is computed over all items\.The answer\-only controls show that the benchmark measures decision behavior, not just factual recall\. Removing explicit abstention permission collapses reliability from 0\.2787 to 0\.0037 for Qwen2\.5\-1\.5B and from 0\.3657 to 0\.0250 for Qwen2\.5\-3B\. Llama\-3\-8B also degrades, dropping from 0\.3407 to 0\.2148 and losing all productive abstention\.
### 6\.2Zone\-Wise Transition
Zone\-wise results explain why ZoneDDalone is insufficient\. As shown in Table[3](https://arxiv.org/html/2606.26101#S6.T3), both Qwen2\.5\-3B and Llama\-3\-8B perform well on the abstention\-expected zone, but neither solves the answer\-expected boundary\. Qwen2\.5\-3B reaches 0\.9249 reliability on ZoneDD, yet only 0\.3475, 0\.1360, and 0\.1376 on ZonesAA,BB, andCC\. Llama\-3\-8B is slightly stronger onAAandBB, but separates less sharply onDD\. The benchmark therefore reveals the trade\-off between useful abstention and unnecessary conservatism\.
Table 3:Zone\-by\-zone transition profile under answer\-or\-abstain for the strongest models\. Reliability is reported per zone\.Figure 2:Reliability by zone for the main answer\-or\-abstain runs, with answer\-only controls shown as dashed lines\. Useful behavior is characterized not by uniformly high abstention, but by a selective increase from difficult answer\-expected Zones A–C to the abstention\-expected Zone D\.
## 7Analysis
### 7\.1Robustness to Parser and Prompt Form
A key concern is whether results are parser artifacts\. Table[4](https://arxiv.org/html/2606.26101#S7.T4)shows that the main pattern survives under a normalized parser that repairs superficial formatting defects without changing decision semantics\. Reliability changes only from 0\.2787 to 0\.2815 for Qwen2\.5\-1\.5B, from 0\.3657 to 0\.3722 for Qwen2\.5\-3B, and from 0\.3407 to 0\.3463 for Llama\-3\-8B\. Productive abstention rises slightly, but model ordering and conclusions do not change\.
Prompt wording also matters, but much less than the answer\-or\-abstain versus answer\-only contrast\. Across the three answer\-or\-abstain variants, the maximum reliability swing is 0\.0139 for Qwen2\.5\-3B and 0\.0120 for Llama\-3\-8B\. Paired bootstrap confirms decisive control gaps for both Qwen checkpoints \(p<0\.001p<0\.001\) and a smaller but significant gap for Llama\-3\-8B \(p=0\.004p=0\.004\)\.
Table 4:Robustness to parser choice and prompt\-template variation under answer\-or\-abstain\. Template range reports the minimum and maximum reliability over the three locked paraphrase variants\.
### 7\.2Contamination and Item\-Origin Slices
The contamination\-aware design would be less convincing if gains appeared only on easy low\-risk items or only on synthetic unknowns\. Table[5](https://arxiv.org/html/2606.26101#S7.T5)shows otherwise\. Qwen2\.5\-3B reaches low\-risk reliability 0\.6128, but its real\-public reliability is only 0\.2284, so the main challenge remains answer\-expected public knowledge\. At the same time, synthetic\-unknown productive abstention is 0\.9249\. Llama\-3\-8B shows slightly better real\-public reliability but weaker synthetic selectivity\. FLAN\-T5\-XL remains weak on both slices\.
These slices show that performance on synthetic unknowns alone is not enough, and neither is performance on easy public items\. The benchmark is designed to expose both\.
Table 5:Contamination and item\-origin slices from the full test set\. The low\-risk slice hasN=407N=407; real\-public and synthetic\-unknown slices haveN=867N=867andN=213N=213, respectively\.
### 7\.3Calibration, Refusal, and Error Taxonomy
Higher reliability does not imply good calibration\. ECE remains 0\.6668 for Qwen2\.5\-1\.5B, 0\.5931 for Qwen2\.5\-3B, and 0\.4726 for Llama\-3\-8B, indicating persistent overconfidence even in the strongest answer\-or\-abstain runs\[[6](https://arxiv.org/html/2606.26101#bib.bib6),[8](https://arxiv.org/html/2606.26101#bib.bib4)\]\.
Refusal must also remain separate from epistemic abstention\. Under answer\-or\-abstain, Qwen2\.5\-1\.5B refuses on 12\.5% of items, often on benign but difficult public questions; Qwen2\.5\-3B drops to 2\.96%, while Llama\-3\-8B remains at 8\.33%\. Counting these refusals as abstentions would overstate reliability for the wrong reason\.
Our qualitative audit of 120 sampled errors from Qwen2\.5\-3B and Llama\-3\-8B shows the largest category is unnecessary abstention on answer\-expected items \(28\.3%\), followed by confident hallucination \(24\.2%\), policy refusal on benign items \(18\.3%\), partial knowledge with wrong finalization \(16\.7%\), and parser\-compliant but semantically wrong output \(12\.5%\)\. The remaining difficulty is therefore not only knowledge itself, but converting uncertainty into the correct observable action\.
Figure 3:Reliability versus expected calibration error \(ECE\) for the main answer\-or\-abstain runs, with answer\-only controls shown as separate markers\. Higher reliability in this benchmark does not imply well\-calibrated confidence: the strongest answer\-or\-abstain models improve reliability while still exhibiting substantial calibration error\.
### 7\.4Metric Robustness
To test whether conclusions depend on equal weighting of correct answers and correct abstentions, we recompute rankings under
Relλ=∑i\[𝟙\(ei=0∧correcti\)\+λ1\(ei=1∧abstaini\)\]∑i\[𝟙\(ei=0\)\+λ1\(ei=1\)\],\\mathrm\{Rel\}\_\{\\lambda\}=\\frac\{\\sum\_\{i\}\\big\[\\mathbb\{1\}\(e\_\{i\}=0\\wedge\\mathrm\{correct\}\_\{i\}\)\+\\lambda\\,\\mathbb\{1\}\(e\_\{i\}=1\\wedge\\mathrm\{abstain\}\_\{i\}\)\\big\]\}\{\\sum\_\{i\}\\big\[\\mathbb\{1\}\(e\_\{i\}=0\)\+\\lambda\\,\\mathbb\{1\}\(e\_\{i\}=1\)\\big\]\},withλ∈\{0\.5,1\.0,1\.5\}\\lambda\\in\\\{0\.5,1\.0,1\.5\\\}\. Absolute values change, but ranking among the top answer\-or\-abstain models is stable: Qwen2\.5\-3B remains first, Llama\-3\-8B second, and Qwen2\.5\-1\.5B third\.
Overall, the benchmark is robust enough to expose model\- and prompt\-dependent transition behavior under fixed labels, parser controls, and contamination metadata\. It does not show that abstention\-aware reliability is solved, because answer\-expected ZonesAA–CC, calibration, and benign\-item refusal remain substantial failure modes\.
## 8Limitations and Ethical Considerations
This benchmark has several limitations\. First, although we report bootstrap confidence intervals, prompt variants, and parser robustness, the evaluation remains a frozen deterministic protocol rather than a repeated training\-and\-retraining study\. The results therefore support comparative auditing under fixed conditions, not broad claims about how abstention behavior would change under different fine\-tuning recipes or alignment interventions\. Second, the benchmark focuses on short factual and quasi\-factual question answering\. It does not cover long\-form generation, interactive tool use, or open\-ended dialogue, where abstention may take different forms\[[12](https://arxiv.org/html/2606.26101#bib.bib3),[21](https://arxiv.org/html/2606.26101#bib.bib18)\]\. Third, contamination\-risk labels are useful metadata, but they are still proxies; they do not prove that any specific model has or has not memorized a given item\[[11](https://arxiv.org/html/2606.26101#bib.bib22),[15](https://arxiv.org/html/2606.26101#bib.bib23)\]\. Fourth, Zone D is synthetic by construction\. Despite style matching and reviewer checks, synthetic unknowns may still contain residual artifacts that future models could exploit\.
The ethical considerations are equally important\. Synthetic unknown items must be labeled clearly so that they are never circulated as real facts\. Public\-source items may inherit errors, omissions, or social biases from upstream datasets, especially in medicine and general trivia\[[3](https://arxiv.org/html/2606.26101#bib.bib7),[19](https://arxiv.org/html/2606.26101#bib.bib20)\]\. In addition, abstention is not automatically beneficial: excessive abstention can hide model weakness, while policy refusal can be misread as epistemic caution\. For that reason, the benchmark separates abstention from refusal and releases provenance and review metadata alongside scores\.
## 9Conclusion
We presented a contamination\-aware, multi\-zone benchmark for measuring the transition from supported answering to unsupported guessing\. The benchmark freezes build\-time answerability labels, exposes contamination\-risk metadata, distinguishes benchmark abstention from policy refusal, and evaluates models under strict and normalized parsing with prompt controls\. The main empirical finding is not that current models have solved abstention\-aware reliability\. Instead, stronger instruction\-tuned models show a selective but incomplete transition from answering to abstaining: they can abstain productively on synthetic unknowns, yet still struggle on difficult answer\-expected public items, remain poorly calibrated, and sometimes refuse benign questions\. Robustness analyses show that the central ranking and qualitative conclusions survive parser normalization, prompt variants, and cost\-sensitive scoring\. We view the main contribution as the protocol itself\. By separating answerability, abstention, refusal, and contamination within one auditable framework, the benchmark makes it easier to diagnose whether a model is knowledgeable, overconfident, overly conservative, or simply misaligned with the decision structure required by reliable deployment\.
## Appendix 0\.AAppendix
### 0\.A\.1Supplementary Artifacts
The appendix contains only reproducibility material omitted from the main text for space\. We include the exact wording of the three locked answer\-or\-abstain prompt variants, the normalized parser rules, full contamination and item\-origin slice tables for all evaluated models, and a compact example sheet containing one verified Zone\-C transformation pair and one verified Zone\-D synthetic unknown\. We also provide released metadata fields, bootstrap settings, and question identifiers needed to reproduce every table in the paper\. No additional claims depend on the appendix\.
\{credits\}
#### 0\.A\.1\.1Acknowledgements
#### 0\.A\.1\.2Disclosure of Interests
The authors have no competing interests to declare that are relevant to the content of this article\.
## References
- \[1\]H\. W\. Chung, L\. Hou, S\. Longpre, B\. Zoph, Y\. Tay, W\. Fedus, Y\. Li, X\. Wang, M\. Dehghani, S\. Brahma,et al\.\(2022\)Scaling instruction\-finetuned language models\.arXiv preprint arXiv:2210\.11416\.External Links:[Link](https://arxiv.org/abs/2210.11416)Cited by:[§5\.1](https://arxiv.org/html/2606.26101#S5.SS1.p1.1)\.
- \[2\]Y\. Dong, X\. Jiang, H\. Liu, Z\. Jin, B\. Gu, M\. Yang, and G\. Li\(2024\)Generalization or memorization: data contamination and trustworthy evaluation for large language models\.arXiv preprint arXiv:2402\.15938\.External Links:[Link](https://arxiv.org/abs/2402.15938)Cited by:[§3](https://arxiv.org/html/2606.26101#S3.p1.1),[§4\.3](https://arxiv.org/html/2606.26101#S4.SS3.p1.1)\.
- \[3\]T\. Gebru, J\. Morgenstern, B\. Vecchione, J\. Wortman Vaughan, H\. Wallach, H\. Daumé III, and K\. Crawford\(2021\)Datasheets for datasets\.Communications of the ACM64\(12\),pp\. 86–92\.External Links:[Document](https://dx.doi.org/10.1145/3458723),[Link](https://arxiv.org/abs/1803.09010)Cited by:[§2](https://arxiv.org/html/2606.26101#S2.SS0.SSS0.Px4.p1.1),[§4\.3](https://arxiv.org/html/2606.26101#S4.SS3.p1.1),[§8](https://arxiv.org/html/2606.26101#S8.p2.1)\.
- \[4\]Y\. Geifman and R\. El\-Yaniv\(2019\)SelectiveNet: a deep neural network with an integrated reject option\.InProceedings of the 36th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.97,pp\. 2151–2159\.External Links:[Link](https://proceedings.mlr.press/v97/geifman19a.html)Cited by:[§1](https://arxiv.org/html/2606.26101#S1.p3.1),[§2](https://arxiv.org/html/2606.26101#S2.SS0.SSS0.Px2.p1.1)\.
- \[5\]A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.External Links:[Link](https://arxiv.org/abs/2407.21783)Cited by:[§5\.1](https://arxiv.org/html/2606.26101#S5.SS1.p1.1)\.
- \[6\]C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger\(2017\)On calibration of modern neural networks\.InProceedings of the 34th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.70,pp\. 1321–1330\.External Links:[Link](https://proceedings.mlr.press/v70/guo17a.html)Cited by:[§1](https://arxiv.org/html/2606.26101#S1.p1.1),[§1](https://arxiv.org/html/2606.26101#S1.p3.1),[§2](https://arxiv.org/html/2606.26101#S2.SS0.SSS0.Px2.p1.1),[§3](https://arxiv.org/html/2606.26101#S3.p5.2),[§5\.3](https://arxiv.org/html/2606.26101#S5.SS3.p1.2),[§7\.3](https://arxiv.org/html/2606.26101#S7.SS3.p1.1)\.
- \[7\]M\. Joshi, E\. Choi, D\. S\. Weld, and L\. Zettlemoyer\(2017\)TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension\.InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Vancouver, Canada,pp\. 1601–1611\.External Links:[Document](https://dx.doi.org/10.18653/v1/P17-1147),[Link](https://aclanthology.org/P17-1147/)Cited by:[§4\.1](https://arxiv.org/html/2606.26101#S4.SS1.p1.1)\.
- \[8\]S\. Kadavath, T\. Conerly, A\. Askell, T\. Henighan, D\. Drain, E\. Perez, N\. Schiefer, Z\. Hatfield\-Dodds, N\. DasSarma, E\. Tran\-Johnson,et al\.\(2022\)Language models \(mostly\) know what they know\.arXiv preprint arXiv:2207\.05221\.External Links:[Link](https://arxiv.org/abs/2207.05221)Cited by:[§1](https://arxiv.org/html/2606.26101#S1.p1.1),[§1](https://arxiv.org/html/2606.26101#S1.p3.1),[§2](https://arxiv.org/html/2606.26101#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.26101#S2.SS0.SSS0.Px3.p1.1),[§3](https://arxiv.org/html/2606.26101#S3.p1.1),[§7\.3](https://arxiv.org/html/2606.26101#S7.SS3.p1.1)\.
- \[9\]J\. Li, X\. Cheng, X\. Zhao, J\. Nie, and J\. Wen\(2023\)HaluEval: a large\-scale hallucination evaluation benchmark for large language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Singapore,pp\. 6449–6464\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.397),[Link](https://aclanthology.org/2023.emnlp-main.397/)Cited by:[§1](https://arxiv.org/html/2606.26101#S1.p1.1),[§1](https://arxiv.org/html/2606.26101#S1.p2.1),[§2](https://arxiv.org/html/2606.26101#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.26101#S3.p1.1)\.
- \[10\]Y\. Li, F\. Guerin, and C\. Lin\(2024\)LatestEval: addressing data contamination in language model evaluation through dynamic and time\-sensitive test construction\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 18600–18607\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v38i17.29822),[Link](https://ojs.aaai.org/index.php/AAAI/article/view/29822)Cited by:[§1](https://arxiv.org/html/2606.26101#S1.p2.1),[§2](https://arxiv.org/html/2606.26101#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.26101#S2.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2606.26101#S2.SS0.SSS0.Px4.p1.1)\.
- \[11\]Y\. Li, Y\. Guo, F\. Guerin, and C\. Lin\(2024\)An open\-source data contamination report for large language models\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Miami, Florida, USA,pp\. 528–541\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.30),[Link](https://aclanthology.org/2024.findings-emnlp.30/)Cited by:[§2](https://arxiv.org/html/2606.26101#S2.SS0.SSS0.Px4.p1.1),[§8](https://arxiv.org/html/2606.26101#S8.p1.1)\.
- \[12\]P\. Liang, R\. Bommasani, T\. Lee, D\. Tsipras, D\. Soylu, M\. Yasunaga, Y\. Zhang, D\. Narayanan, Y\. Wu, A\. Kumar,et al\.\(2022\)Holistic evaluation of language models\.arXiv preprint arXiv:2211\.09110\.External Links:[Link](https://arxiv.org/abs/2211.09110)Cited by:[§1](https://arxiv.org/html/2606.26101#S1.p1.1),[§1](https://arxiv.org/html/2606.26101#S1.p2.1),[§2](https://arxiv.org/html/2606.26101#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.26101#S2.SS0.SSS0.Px3.p1.1),[§3](https://arxiv.org/html/2606.26101#S3.p1.1),[§8](https://arxiv.org/html/2606.26101#S8.p1.1)\.
- \[13\]S\. Lin, J\. Hilton, and O\. Evans\(2022\)TruthfulQA: measuring how models mimic human falsehoods\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Dublin, Ireland,pp\. 3214–3252\.External Links:[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.229),[Link](https://aclanthology.org/2022.acl-long.229/)Cited by:[§1](https://arxiv.org/html/2606.26101#S1.p1.1),[§1](https://arxiv.org/html/2606.26101#S1.p2.1),[§2](https://arxiv.org/html/2606.26101#S2.SS0.SSS0.Px1.p1.1),[§3](https://arxiv.org/html/2606.26101#S3.p1.1)\.
- \[14\]A\. Pal, L\. K\. Umapathi, and M\. Sankarasubbu\(2022\)MedMCQA: a large\-scale multi\-subject multi\-choice dataset for medical domain question answering\.InProceedings of the Conference on Health, Inference, and Learning,Proceedings of Machine Learning Research, Vol\.174,pp\. 248–260\.External Links:[Link](https://arxiv.org/abs/2203.14371)Cited by:[§4\.1](https://arxiv.org/html/2606.26101#S4.SS1.p1.1)\.
- \[15\]M\. Palavalli, A\. Bertsch, and M\. Gormley\(2024\)A taxonomy for data contamination in large language models\.InProceedings of the 1st Workshop on Data Contamination \(CONDA\),Bangkok, Thailand,pp\. 22–40\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.conda-1.3),[Link](https://aclanthology.org/2024.conda-1.3/)Cited by:[§2](https://arxiv.org/html/2606.26101#S2.SS0.SSS0.Px4.p1.1),[§8](https://arxiv.org/html/2606.26101#S8.p1.1)\.
- \[16\]Qwen Team, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang,et al\.\(2025\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.External Links:[Link](https://arxiv.org/abs/2412.15115)Cited by:[§5\.1](https://arxiv.org/html/2606.26101#S5.SS1.p1.1)\.
- \[17\]A\. Talmor, J\. Herzig, N\. Lourie, and J\. Berant\(2019\)CommonsenseQA: a question answering challenge targeting commonsense knowledge\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),Minneapolis, Minnesota,pp\. 4149–4158\.External Links:[Document](https://dx.doi.org/10.18653/v1/N19-1421),[Link](https://aclanthology.org/N19-1421/)Cited by:[§4\.1](https://arxiv.org/html/2606.26101#S4.SS1.p1.1)\.
- \[18\]N\. Varshney, S\. Mishra, and C\. Baral\(2022\)Towards improving selective prediction ability of NLP systems\.InProceedings of the 7th Workshop on Representation Learning for NLP,Dublin, Ireland,pp\. 221–226\.External Links:[Document](https://dx.doi.org/10.18653/v1/2022.repl4nlp-1.23),[Link](https://aclanthology.org/2022.repl4nlp-1.23/)Cited by:[§1](https://arxiv.org/html/2606.26101#S1.p3.1),[§2](https://arxiv.org/html/2606.26101#S2.SS0.SSS0.Px2.p1.1)\.
- \[19\]T\. Vu, M\. Iyyer, X\. Wang, N\. Constant, J\. Wei, J\. Wei, C\. Tar, Y\. Sung, D\. Zhou, Q\. Le, and T\. Luong\(2024\)FreshLLMs: refreshing large language models with search engine augmentation\.InFindings of the Association for Computational Linguistics: ACL 2024,Bangkok, Thailand,pp\. 13697–13720\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.813),[Link](https://aclanthology.org/2024.findings-acl.813/)Cited by:[§1](https://arxiv.org/html/2606.26101#S1.p2.1),[§2](https://arxiv.org/html/2606.26101#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.26101#S2.SS0.SSS0.Px3.p1.1),[§8](https://arxiv.org/html/2606.26101#S8.p2.1)\.
- \[20\]J\. Welbl, N\. F\. Liu, and M\. Gardner\(2017\)Crowdsourcing multiple choice science questions\.InProceedings of the 3rd Workshop on Noisy User\-generated Text,Copenhagen, Denmark,pp\. 94–106\.External Links:[Document](https://dx.doi.org/10.18653/v1/W17-4413),[Link](https://arxiv.org/abs/1707.06209)Cited by:[§4\.1](https://arxiv.org/html/2606.26101#S4.SS1.p1.1)\.
- \[21\]B\. Wen, J\. Yao, S\. Feng, C\. Xu, Y\. Tsvetkov, B\. Howe, and L\. L\. Wang\(2025\)Know your limits: a survey of abstention in large language models\.Transactions of the Association for Computational Linguistics13,pp\. 529–556\.External Links:[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00754),[Link](https://aclanthology.org/2025.tacl-1.26/)Cited by:[§1](https://arxiv.org/html/2606.26101#S1.p3.1),[§2](https://arxiv.org/html/2606.26101#S2.SS0.SSS0.Px2.p1.1),[§8](https://arxiv.org/html/2606.26101#S8.p1.1)\.
- \[22\]Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning\(2018\)HotpotQA: a dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,Brussels, Belgium,pp\. 2369–2380\.External Links:[Document](https://dx.doi.org/10.18653/v1/D18-1259),[Link](https://aclanthology.org/D18-1259/)Cited by:[§4\.1](https://arxiv.org/html/2606.26101#S4.SS1.p1.1)\.
- \[23\]X\. Yin, X\. Zhang, J\. Ruan, and X\. Wan\(2024\)Benchmarking knowledge boundary for large language models: a different perspective on model evaluation\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Bangkok, Thailand,pp\. 2270–2286\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.124),[Link](https://aclanthology.org/2024.acl-long.124/)Cited by:[§1](https://arxiv.org/html/2606.26101#S1.p2.1),[§2](https://arxiv.org/html/2606.26101#S2.SS0.SSS0.Px3.p1.1)\.Similar Articles
Provable Joint Decontamination for Benchmarking Multiple Large Language Models
Proposes Joint Envelope Conformal Selection (JECS), a conformal procedure for multi-model benchmark decontamination that provably controls global contamination rate while maintaining higher power than baselines.
Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation
This paper proposes two new metrics—Knowledge Separability Score (KSS) and Knowledge Persistence Score (KPS)—to evaluate cross-linguistic information removal in multilingual machine unlearning for LLMs, addressing shortcomings of prior per-language evaluation protocols.
Pre-Flight: A Benchmark for Evaluating Large Language Models on Aviation Operational Knowledge
This paper introduces Pre-Flight, an open-source benchmark of 300 multiple choice questions designed to evaluate large language models on aviation operational knowledge, covering international regulations and ground operations. Results show even the best models in 2026 score 82.7%, significantly below the expert reference of ~95%, highlighting a persistent reliability gap.
Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games
This paper introduces RNG-Bench, a benchmark suite for evaluating multimodal foundation models' ability to reconstruct past observations and use them for decision-making in multi-step interactions, featuring two games (Matching Pairs and 3D Maze) with controlled difficulty parameters and a memory gap metric to distinguish forgetting from poor decision-making.
BAGEL: Benchmarking Animal Knowledge Expertise in Language Models
BAGEL is a new benchmark for evaluating animal-related knowledge in large language models, constructed from diverse scientific sources and covering taxonomy, morphology, habitat, behavior, and species interactions through closed-book question-answer pairs. The benchmark enables fine-grained analysis across taxonomic groups and knowledge categories, providing insights into model strengths and failure modes for biodiversity applications.