Evidence Absence Is Not Evidence Insufficiency: Diagnosing NEI Construction Artifacts in Fact Verification

arXiv cs.CL Papers

Summary

The paper introduces NEI-CAP, a diagnostic protocol to evaluate how 'Not Enough Information' examples are constructed in fact verification benchmarks, revealing that models trained on shortcut-prone NEI constructions fail to transfer to harder, semantically related insufficient evidence cases.

arXiv:2605.26663v1 Announce Type: new Abstract: Evidence absence is not evidence insufficiency, but fact verification benchmarks can make them observationally similar. The Not Enough Information (NEI) label is often operationalized through different evidence conditions, and that choice silently determines what a verifier learns and what its score can hide. We introduce NEI-CAP, a construction-aware diagnostic protocol for insufficient-evidence evaluation. Each NEI example carries the construction family that produced it; NEI-CAP audits shortcut cues, validates hard cases through human adjudication, and tests whether competence transfers across constructions. We instantiate the protocol in SciFact-style scientific verification, with FEVER and HoVer as bounded external controls. Across these settings, NEI competence does not transfer reliably: models trained on shortcut-prone constructions fail to recognize semantically related insufficient evidence, and mixed-construction training narrows but does not close the gap. Fixed-claim diagnostics further show that the evidence condition shifts confidence in the reference Support/Refute label, not only NEI recall, so an aggregate NEI score can hide which problem a model has actually solved.
Original Article
View Cached Full Text

Cached at: 05/27/26, 09:08 AM

# Evidence Absence Is Not Evidence Insufficiency: Diagnosing NEI Construction Artifacts in Fact Verification
Source: [https://arxiv.org/html/2605.26663](https://arxiv.org/html/2605.26663)
###### Abstract

Evidence absence is not evidence insufficiency, but fact verification benchmarks can make them observationally similar\. TheNot Enough Information\(NEI\) label is often operationalized through different evidence conditions, and that choice silently determines what a verifier learns, and what its score can hide\. We introduce NEI\-CAP, a construction\-aware diagnostic protocol for insufficient\-evidence evaluation\. EachNEIexample carries the construction family that produced it; NEI\-CAP audits shortcut cues, validates hard cases through human adjudication, and tests whether competence transfers across constructions\. We instantiate the protocol in SciFact\-style scientific verification, with FEVER and HoVer as bounded external controls\. Across these settings,NEIcompetence does not transfer reliably: models trained on shortcut\-prone constructions fail to recognize semantically related insufficient evidence, and mixed\-construction training narrows but does not close the gap\. Fixed\-claim diagnostics further show that the evidence condition shifts confidence in the referenceSupport/Refutelabel, not onlyNEIrecall, so an aggregateNEIscore can hide which problem a model has actually solved\.

Evidence Absence Is Not Evidence Insufficiency: Diagnosing NEI Construction Artifacts in Fact Verification

Jingxi Qiu1,2, Zeyu Han2, Cheng Huang1,†1ZenWeave AI,2Georgetown University,†Corresponding Author[jingxi@zenweaveai\.com](https://arxiv.org/html/2605.26663v1/mailto:[email protected]),[chenghuang@zenweaveai\.com](https://arxiv.org/html/2605.26663v1/mailto:[email protected])

## 1Introduction

A fact verification system labels a claim as supported, refuted, or*Not Enough Information*\(NEI\) when the available evidence is inconclusive\(Thorneet al\.,[2018](https://arxiv.org/html/2605.26663#bib.bib1); Waddenet al\.,[2020](https://arxiv.org/html/2605.26663#bib.bib2); Jianget al\.,[2020](https://arxiv.org/html/2605.26663#bib.bib3)\)\. The NEI label is meant to be evidence\-conditioned: for a claimccand an evidence setEE, it means thatEEdoes not establishcceither way\. Building those negative evidence sets, however, is itself a design step that no formal definition covers\. An empty field, an off\-topic passage, a high\-overlap retrieval miss, and a non\-rationale sentence drawn from a cited document can all be labelled NEI, and a verifier trained on one of these constructions can predict NEI for reasons that have little to do with whether the evidence is actually sufficient\.

Prior work on fact\-verification artifacts has focused on the claim side\.Schusteret al\.\([2019](https://arxiv.org/html/2605.26663#bib.bib4)\)show that FEVER can be partially solved by claim\-only classifiers, and adversarial or contrastive verification resources further show that standard evidence\-aware accuracy can hide brittle decision rules\(Thorne and Vlachos,[2019](https://arxiv.org/html/2605.26663#bib.bib21); Schusteret al\.,[2021](https://arxiv.org/html/2605.26663#bib.bib22)\)\. A separate line studies evidence sufficiency by removing parts of otherwise\-valid evidence\(Atanasovaet al\.,[2022](https://arxiv.org/html/2605.26663#bib.bib5); Vladikaet al\.,[2025](https://arxiv.org/html/2605.26663#bib.bib16)\)\. We study a complementary failure mode—how the negative evidence condition is built in the first place—and argue that this construction silently determines what a verifier learns and what an aggregate NEI\-F1 can hide\.

Figure[1](https://arxiv.org/html/2605.26663#S1.F1)illustrates the mechanism\. Easy NEI can be solved by recognizing absence, format, or topic mismatch; hard NEI keeps the evidence related to the claim but incomplete, which can induce false support when the model mistakes overlap for sufficiency\. NEI\-CAP targets this gap by making construction family part of the evaluation record rather than treating NEI as a construction\-free class\.

![Refer to caption](https://arxiv.org/html/2605.26663v1/x1.png)Figure 1:Conceptual illustration of NEI construction artifacts\. Easy NEI constructions, such as placeholders or unrelated passages, can teach absence, format, or topic\-mismatch shortcuts\. Hard NEI keeps the evidence semantically related but incomplete; a verifier may therefore overpredictSupportfrom overlap rather than recognize insufficiency\. NEI\-CAP records the construction family, audits shortcuts, validates hard examples, and stress\-tests whether NEI competence transfers\. Examples are schematic; experiments use the SciFact, FEVER, and HoVer constructions described in Sections[3](https://arxiv.org/html/2605.26663#S3)–[4](https://arxiv.org/html/2605.26663#S4)\.NEI\-CAP makes the construction explicit\. Each NEI example carries the family of evidence condition that produced it, which lets us audit each family for shortcut features and stress\-test whether a model trained on one family recognizes insufficiency in another\. We instantiate the protocol on SciFact\-style scientific verification\(Waddenet al\.,[2020](https://arxiv.org/html/2605.26663#bib.bib2)\), with FEVER\(Thorneet al\.,[2018](https://arxiv.org/html/2605.26663#bib.bib1)\), HoVer\(Jianget al\.,[2020](https://arxiv.org/html/2605.26663#bib.bib3)\), and the broader fact\-verification literature as context\(Augensteinet al\.,[2019](https://arxiv.org/html/2605.26663#bib.bib23); Alyet al\.,[2021](https://arxiv.org/html/2605.26663#bib.bib14)\)\.

The headline finding is a transfer failure that aggregate NEI\-F1 cannot detect\. A DeBERTa verifier trained on placeholder NEI reaches perfect matched\-placeholder NEI\-F1 across five seeds, yet scores zero NEI\-F1 on BM25 near\-miss and cited non\-rationale evaluation; the collapse replicates on RoBERTa and SciBERT\. Probability mass shifts toSupportandRefuterather than to NEI, so the failure is not a calibration artifact\. Training on random\-irrelevant NEI fares only marginally better, showing that the problem extends beyond placeholder detection\. Mixed\-construction training narrows but does not close the gap\. A fixed\-claim diagnostic further shows that swapping the evidence shifts confidence in the referenceSupportorRefutelabel, not only NEI recall\. The construction choice therefore affects the verifier on the full three\-way task, not only the NEI corner\.

We make three contributions\.First, we recast NEI as a construction\-sensitive evidence condition rather than a single negative label\.Second, we introduce NEI\-CAP: a diagnostic protocol that treats the construction family as an explicit evaluation variable, audits its shortcut surface, and validates hard cases through human adjudication\.Third, our SciFact, FEVER, and HoVer experiments show that shortcut\-prone training fails to transfer to semantically related insufficient evidence, and that multi\-seed and mixed\-construction protocols do not remove the need for construction\-stratified reporting\.

## 2Related Work

### 2\.1Fact Verification Benchmarks

Fact verification is commonly formulated as a three\-way classification problem: given a claim and an evidence set, predict whether the evidence supports, refutes, or is insufficient to verify the claim\. FEVER introduced a large\-scale Wikipedia benchmark withSupported,Refuted, andNotEnoughInfolabels\(Thorneet al\.,[2018](https://arxiv.org/html/2605.26663#bib.bib1)\); SciFact extended the formulation to expert\-written scientific claims that require retrieving evidence\-containing abstracts and rationales\(Waddenet al\.,[2020](https://arxiv.org/html/2605.26663#bib.bib2)\); and HoVer added many\-hop evidence retrieval, where verification can depend on facts spread across multiple Wikipedia articles\(Jianget al\.,[2020](https://arxiv.org/html/2605.26663#bib.bib3)\)\. MultiFC broadens fact\-checking to real\-world multi\-domain claims\(Augensteinet al\.,[2019](https://arxiv.org/html/2605.26663#bib.bib23)\); HealthFC focuses on evidence\-backed medical claims\(Vladikaet al\.,[2023](https://arxiv.org/html/2605.26663#bib.bib28)\); FEVEROUS adds structured table evidence\(Alyet al\.,[2021](https://arxiv.org/html/2605.26663#bib.bib14)\); and VitaminC creates contrastive claim–evidence pairs that require sensitivity to small factual changes\(Schusteret al\.,[2021](https://arxiv.org/html/2605.26663#bib.bib22)\)\. Rationale\-centered resources further ask whether systems identify the evidence used to support predictions\(DeYounget al\.,[2020](https://arxiv.org/html/2605.26663#bib.bib12)\)\. Across these benchmarks theNEIlabel is treated as a fixed third class, but how its evidence side is built is left to each benchmark’s discretion—and that choice is not made part of the evaluation protocol\.

### 2\.2Dataset Artifacts and Behavioral Evaluation

NLP benchmarks routinely contain artifacts that let models score well without learning the intended capability\. In natural language inference, hypothesis\-only classifiers can recover the label from premise\-free input\(Gururanganet al\.,[2018](https://arxiv.org/html/2605.26663#bib.bib6); Poliaket al\.,[2018](https://arxiv.org/html/2605.26663#bib.bib17)\), and controlled challenge sets such as HANS show that high benchmark accuracy can mask reliance on lexical or syntactic heuristics\(McCoyet al\.,[2019](https://arxiv.org/html/2605.26663#bib.bib18)\)\. Similar shortcut effects appear beyond NLI, including argument reasoning artifacts\(Niven and Kao,[2019](https://arxiv.org/html/2605.26663#bib.bib19)\); more broadly, shortcut learning is a known failure mode of modern neural systems\(Geirhoset al\.,[2020](https://arxiv.org/html/2605.26663#bib.bib20)\)\. In fact verification,Schusteret al\.\([2019](https://arxiv.org/html/2605.26663#bib.bib4)\)show analogous claim\-side cues in FEVER and demonstrate that claim\-only baselines remain competitive against evidence\-aware models; FEVER2\.0\-style adversarial work further stresses robustness to perturbations\(Thorne and Vlachos,[2019](https://arxiv.org/html/2605.26663#bib.bib21)\)\. A broader line of work uses contrast sets, counterfactually augmented data, and behavioral testing to expose brittle shortcut reliance\(Kaushiket al\.,[2020](https://arxiv.org/html/2605.26663#bib.bib8); Gardneret al\.,[2020](https://arxiv.org/html/2605.26663#bib.bib7); Ribeiroet al\.,[2020](https://arxiv.org/html/2605.26663#bib.bib13)\)\. These studies mostly examine claim\-side, hypothesis\-side, or local decision\-boundary artifacts\. NEI\-CAP carries the same diagnostic stance to the evidence side and asks what shortcuts the construction of NEI evidence itself can teach\.

### 2\.3Evidence Sufficiency and Missing Evidence

A claim can be true or false in the world while the available evidence is still insufficient to settle it, so evidence sufficiency is a distinct question from veracity prediction\.Atanasovaet al\.\([2022](https://arxiv.org/html/2605.26663#bib.bib5)\)make this question operational by removing parts of otherwise\-valid evidence and asking whether fact\-checking models notice the omission, and rationale evaluation benchmarks ask whether models identify supporting passages rather than only predicting labels\(DeYounget al\.,[2020](https://arxiv.org/html/2605.26663#bib.bib12)\)\. Work on missing counter\-evidence similarly argues that evidence availability and sufficiency are assumptions built into fact\-checking datasets\(Glockneret al\.,[2022](https://arxiv.org/html/2605.26663#bib.bib26)\)\. The same question appears in factuality evaluation for generation, where long\-form claims are decomposed into atomic facts and checked against supporting sources\(Minet al\.,[2023](https://arxiv.org/html/2605.26663#bib.bib24)\), and in retrieval\-augmented generation or grounding evaluation, where generated claims must be supported by provided context\(Niuet al\.,[2024](https://arxiv.org/html/2605.26663#bib.bib25); Jacoviet al\.,[2025](https://arxiv.org/html/2605.26663#bib.bib27)\)\. These studies motivate evidence\-sensitive evaluation, but they usually treat the negative or unsupported condition as already given or derived from a valid one\. NEI\-CAP works in the opposite direction: it asks how the insufficient evidence set was built in the first place, and how that construction determines what a verifier can be said to have learned\.

## 3NEI\-CAP: Construction\-Aware NEI Evaluation

Figure[1](https://arxiv.org/html/2605.26663#S1.F1)motivates NEI\-CAP as a way to separate shortcut recognition from evidence\-insufficiency recognition\. This section formalizes that idea with a construction variable, a compact taxonomy of evidence conditions, and the operational workflow in Protocol 1\.

### 3\.1Evidence\-Conditioned NEI

A verification instance is\(c,E,y\)\(c,E,y\), whereccis a claim,E=\{e1,…,ek\}E=\\\{e\_\{1\},\\ldots,e\_\{k\}\\\}is its evidence set, andy∈\{Support,Refute,NEI\}y\\in\\\{\\textsc\{Support\},\\textsc\{Refute\},\\textsc\{NEI\}\\\}\. TheNEIlabel is a property of the pair\(c,E\)\(c,E\)rather than of the claim alone\(Thorneet al\.,[2018](https://arxiv.org/html/2605.26663#bib.bib1); Waddenet al\.,[2020](https://arxiv.org/html/2605.26663#bib.bib2)\): whether the evidence is insufficient depends on what evidence is provided\. We make this dependence explicit by extending each example with a construction variable,

x=\(c,E,y,z,g\),x=\(c,E,y,z,g\),wherezzrecords the family of NEI evidence condition that producedEEandggis a grouping identifier that keeps variants of the same claim within a single split\. The model never receiveszzorgg; they are diagnostic interventions, used only for auditing, splitting, and stratified reporting\.

### 3\.2NEI Construction Families

NEI\-CAP separates shortcut\-prone constructions from semantically related insufficient evidence\. The former expose format, topic, position, or retrieval shortcuts; the latter test whether evidence that remains related to the claim is still recognized as insufficient\. This follows the same motivation as contrast\-set and behavioral evaluation: a benchmark should expose when a model succeeds through an unintended decision rule rather than the intended capability\(Gardneret al\.,[2020](https://arxiv.org/html/2605.26663#bib.bib7); Ribeiroet al\.,[2020](https://arxiv.org/html/2605.26663#bib.bib13); Geirhoset al\.,[2020](https://arxiv.org/html/2605.26663#bib.bib20)\)\. Table[1](https://arxiv.org/html/2605.26663#S3.T1)lists the compact taxonomy used in the rest of the paper; full definitions, metadata fields, and shortcut\-risk dimensions are in Appendix[A](https://arxiv.org/html/2605.26663#A1)\.

FamilyEvidence conditionRolePlaceholderFixed/empty no\-evidence markerFormat shortcut anchorRandom irrelevantUnrelated evidenceTopic\-mismatch anchorPosition\-biasedPredictable non\-rationalesPosition/source auditBM25 near\-missHigh\-overlap insufficient evidenceHard NEICited non\-rationaleCited but non\-rationale evidenceHard NEISame\-documentSame\-source non\-rationale evidenceSource\-controlled NEIFixed\-claimSame claim, changed evidenceEvidence\-substitution diagnosticMissing\-hopMulti\-hop evidence with a required fact removedExternal multi\-hop control

Table 1:Compact NEI\-CAP construction taxonomy\.
### 3\.3Diagnostic Protocol

Protocol 1 lists the five stages that produce the construction\-stratified evidence reported in Sections[5](https://arxiv.org/html/2605.26663#S5)–[6](https://arxiv.org/html/2605.26663#S6)\. The stages share a common output discipline: each one returns a typed artifact that the next stage can consume without re\-deriving anything from the raw text\.

Protocol 1: NEI\-CAP diagnostic workflow Input:claim–evidence examples\(c,E,y\)\(c,E,y\)and construction rules\.Output:audit tables, adjudicated subsets, construction\-stratified metrics, and claim boundaries\.1\. Construct\.Assign construction familyzzand group IDgg\.2\. Audit\.Measure evidence\-side shortcut features by label and construction\.3\. Validate\.Adjudicate candidate hard NEI used for central claims\.4\. Stress\-test\.Evaluate single\- and mixed\-construction training under construction\-stratified tests\.5\. Report\.Release construction\-specific metrics, uncertainty, and claim boundaries\.

## 4Data, Validation, and Experimental Setup

SciFact is our primary setting\(Waddenet al\.,[2020](https://arxiv.org/html/2605.26663#bib.bib2)\): it requires evidence\-sensitive verification over scientific abstracts, where reference evidence and semantically related but insufficient evidence routinely coexist in the same document\. We use FEVER and HoVer as bounded external controls\(Thorneet al\.,[2018](https://arxiv.org/html/2605.26663#bib.bib1); Jianget al\.,[2020](https://arxiv.org/html/2605.26663#bib.bib3)\)\.

### 4\.1SciFact Construction Suite

The main SciFact suite instantiates the construction families in Table[1](https://arxiv.org/html/2605.26663#S3.T1): placeholder, random irrelevant, position\-biased, BM25 near\-miss, and cited non\-rationale\. BM25 near\-miss examples are obtained with BM25 retrieval\(Robertson and Zaragoza,[2009](https://arxiv.org/html/2605.26663#bib.bib15)\)and then filtered to retain high\-overlap but insufficient evidence\. TheSupportandRefuteportions of the task are held comparable across variants, while only theNEIevidence condition changes\. This lets us train a verifier under oneNEIconstruction and evaluate it under another\. Splits are group\-disjoint, keyed by the original claim or claim–document grouping, so construction variants of the same claim never straddle train and test partitions\.

### 4\.2Human\-Adjudicated Hard NEI

Candidate hardNEIexamples are useful only if they are actually insufficient: BM25 near\-miss, cited non\-rationale, and same\-document non\-rationale evidence can carry implicit support or refutation that the construction rule does not catch\. NEI\-CAP therefore separates*candidate*hardNEIfrom*human\-adjudicated*hardNEI\.

We use two validation resources\. The SciFact hard\-NEI audit adjudicates candidate BM25/cited near\-miss examples from the construction suite; the fixed\-claim/same\-document audit covers examples where the claim or source document is held fixed while the evidence condition changes\. Two PhD annotators produced consensus adjudications; AI\-assisted checks were used only as secondary support\. The full protocol and label schema are in Appendix[D](https://arxiv.org/html/2605.26663#A4)\.

The primary human\-hard model evaluation uses the held\-out test split of the SciFact human\-adjudicated hard\-NEI audit: 54 validated hard\-NEI examples after group\-disjoint splitting\. The larger audit pool estimates label validity, while the held\-out subset supports model evaluation\. Appendix[G](https://arxiv.org/html/2605.26663#A7)reports the full evaluation table and Appendix[B](https://arxiv.org/html/2605.26663#A2)documents the sampling trail\.

Validation resourcennValid NEIContam\.Hard outcomeSciFact BM25/cited hard\-NEI audit25089\.2%10\.8%195 hard \(87\.4% of valid\)Fixed\-claim/same\-document audit12296\.7%3\.3%114 hard \(96\.6% of valid\)

Table 2:Human validation summary\. Contamination includes examples adjudicated as actually supported, actually refuted, ambiguous, or invalid\. The hard\-outcome column reports the absolute number of human\-adjudicated hardNEIexamples that pass the audit, with the in\-parentheses rate computed against validNEI\. Full adjudication details are in Appendix[D](https://arxiv.org/html/2605.26663#A4)\.
### 4\.3Models and Metrics

Our primary verifier is DeBERTa\-v3\-base\(Heet al\.,[2021](https://arxiv.org/html/2605.26663#bib.bib10)\), with RoBERTa\-base and SciBERT as secondary backbones\(Liuet al\.,[2019](https://arxiv.org/html/2605.26663#bib.bib9); Beltagyet al\.,[2019](https://arxiv.org/html/2605.26663#bib.bib11)\)\. We use these as diagnostic probes rather than proposed architectures: training variants differ only in theNEIconstruction family used during training\. Mixed\-construction regimes combine the single\-construction variants into an easy mixture \(placeholder, random irrelevant, position\-biased\), a hard mixture \(BM25 near\-miss, cited non\-rationale\), and a balanced mixture over all five families\. Multi\-seed experiments use seeds 13, 17, 23, 29, and 37\.

For three\-way verification, we report Macro\-F1 and class\-specific F1, with particular attention toNEI\-F1\. For human\-validated hard\-NEIsubsets, every evaluated example is adjudicated asNEI, so Macro\-F1 is not informative; we instead reportNEIrecall, false support rate, false refute rate, and mean predicted class probabilities\. Hyperparameters, checkpoint selection, seed aggregation, and metric details are in Appendices[E](https://arxiv.org/html/2605.26663#A5)and[L](https://arxiv.org/html/2605.26663#A12)\.

### 4\.4External Controls

FEVER provides a Wikipedia verification control with aNotEnoughInfolabel\(Thorneet al\.,[2018](https://arxiv.org/html/2605.26663#bib.bib1)\), and HoVer provides a multi\-hop control in which evidence sufficiency can depend on more than one supporting fact\(Jianget al\.,[2020](https://arxiv.org/html/2605.26663#bib.bib3)\)\. We use FEVER as a non\-toy subset control and HoVer as a candidate missing\-hop control\. Together they test whether the construction\-aware perspective travels beyond SciFact; neither is intended as a full\-data or human\-validated hard\-NEIevaluation in its own right\. Full details are in Appendix[J](https://arxiv.org/html/2605.26663#A10)\.

## 5Results

### 5\.1Matched NEI Performance Does Not Imply Transfer

Figure[2](https://arxiv.org/html/2605.26663#S5.F2)shows the SciFact train/test construction matrix\. A placeholder\-trained verifier obtains perfect matched\-placeholderNEI\-F1, but falls to 0\.000 on BM25 near\-miss and cited non\-rationaleNEI\. Position\-biased training shows the same hard\-construction collapse, while random\-irrelevant training is less extreme but still shortcut\-prone: it is nearly solved under matched evaluation yet transfers poorly to BM25 and cited hard constructions\. Full numeric matrices are in Appendix[F](https://arxiv.org/html/2605.26663#A6)\.

![Refer to caption](https://arxiv.org/html/2605.26663v1/x2.png)Figure 2:SciFactNEI\-F1 under train/test construction shifts\. Cell values are seed\-aggregated means from the primary construction matrix; Appendix[F](https://arxiv.org/html/2605.26663#A6)reports full numeric matrices and Macro\-F1\. Matched easyNEIcan yield high scores without transferring to semantically related insufficient evidence\.The point is not that placeholder evidence is realistic; it is an intentionally shortcut\-prone anchor\. The broader finding is that different easy constructions teach different shortcuts—format absence, position/source bias, or topic mismatch—whereas the target of NEI\-CAP is hardNEI: semantically related evidence that remains insufficient\.

### 5\.2Human\-Adjudicated Hard NEI Exposes Easy\-Training Failure

Human validation does not rescue easy NEI training\. Table[3](https://arxiv.org/html/2605.26663#S5.T3)shows that placeholder and position\-biased training yield 0\.000 recall on human\-adjudicated hardNEI\. Random\-irrelevant training also performs poorly \(0\.216 recall\), confirming that topic\-unrelatedNEIdoes not teach hard\-insufficiency recognition\. Errors are dominated by falseSupport, and BM25 near\-miss and cited non\-rationale training partially recover recall\.

Train NEIRecallFalse SUPFalse REFBM25 near\-miss0\.6910\.1420\.167Cited non\-rationale0\.6790\.0740\.247Random irrelevant0\.2160\.5740\.210Placeholder0\.0000\.8700\.130Position\-biased0\.0000\.9070\.093

Table 3:Evaluation on the held\-out SciFact human\-adjudicated hard\-NEItest split \(n=54n=54\)\. Since all examples are validated asNEI, we reportNEIrecall and error rates rather than Macro\-F1; Appendix[B](https://arxiv.org/html/2605.26663#A2)gives the sampling trail from the larger audit pool\.Appendix[G](https://arxiv.org/html/2605.26663#A7)further shows that easy training assigns near\-zeroNEIprobability to validated hardNEIwhile reallocating probability mass toSupportandRefute; the failure is therefore not just low recall but high\-confidence wrong answers\.

### 5\.3Mixed Training Helps but Remains Construction\-Stratified

Mixed training is a stronger test because real benchmarks rarely contain a singleNEIconstruction\. Table[4](https://arxiv.org/html/2605.26663#S5.T4)summarizes DeBERTa under an easy mixture, a hard BM25/cited mixture, and a balanced all\-family mixture\. Easy\-mixture training remains weak on hard constructions; hard and balanced mixtures improve recall but still produce construction\-stratified profiles\. Mixed training changes the failure profile—it does not makeNEIconstruction\-free\.

Train regimeBM25CitedHard recallTakeawayEasy mixture0\.3790\.2730\.178Shortcut\-heavy mixture remains weakHard mixture0\.6620\.6520\.770Hard training improves recallBalanced mixture0\.8020\.8030\.915Best, but still stratified

Table 4:DeBERTa mixed\-construction training summary\. BM25 and Cited are construction\-stratifiedNEI\-F1; Hard recall is one\-class recall on the SciFact human\-adjudicated hard\-NEI subset\. Easy mixture includes random\-irrelevant and position\-biasedNEI, so it improves over placeholder\-only training but remains weak on hard insufficient evidence\.The balanced mixture improves the hard\-side metrics, but the full stratified matrix in Appendix[H](https://arxiv.org/html/2605.26663#A8)still shows different performance profiles across placeholder, random\-irrelevant, BM25, and cited conditions; mixed training is therefore a stronger stress test, not a replacement for construction\-aware reporting\.

### 5\.4Placeholder\-to\-Hard Collapse Is Stable Across Seeds and Backbones

The placeholder\-to\-hard collapse is not specific to DeBERTa or to one seed\. Across five seeds for DeBERTa, RoBERTa, and SciBERT, each backbone reaches perfect matched\-placeholderNEI\-F1 and zeroNEI\-F1 on BM25 near\-miss and cited non\-rationale evaluation when trained on placeholderNEI\(Table[5](https://arxiv.org/html/2605.26663#S5.T5)\)\. That the same exact 1\.000→\\rightarrow0\.000 pattern reproduces across three architecture families and fifteen training runs rules out interpreting the collapse as a training\-dynamic artifact of any one backbone\.

ModelSeedsPH→\\rightarrowPHPH→\\rightarrowBM25PH→\\rightarrowCitedDropDeBERTa51\.0000\.0000\.0001\.000RoBERTa51\.0000\.0000\.0001\.000SciBERT51\.0000\.0000\.0001\.000

Table 5:Five\-seed placeholder\-to\-hard robustness\. PH denotes placeholderNEI\. Scores of 0\.000 are exact: in all five seeds, placeholder\-trained models never predictedNEIon the BM25 or cited hard test sets, yielding zeroNEI\-F1\.
### 5\.5Source\-Controlled Diagnostics Are Useful but Bounded

Fixed\-claim diagnostics extend the finding pastNEIrecall: when the same claim is paired first with reference evidence and then with human\-adjudicated insufficient evidence, the probability that the verifier assigns to the referenceSupport/Refutelabel drops on the insufficient side\. Construction choice therefore affects how confidently the model commits to the non\-NEIlabels, not only whether it predictsNEI\. Same\-document hardNEIprovides an additional source\-controlled view that retains human adjudication; a shallow\-feature audit shows it remains constructionally distinct from BM25/cited hardNEI, so we report it as its own diagnostic family rather than as a universal hard\-NEIscore\. Full fixed\-claim and same\-document diagnostics are in Appendix[I](https://arxiv.org/html/2605.26663#A9)\.

### 5\.6External Controls Are Bounded

FEVER and HoVer extend the diagnostic scope to Wikipedia and to multi\-hop evidence\. In both, a placeholder\-trained baseline reachesNEI\-F1 of 1\.000 on placeholder evaluation and 0\.000 on BM25 \(FEVER\) or missing\-hop \(HoVer\) evaluation, mirroring the SciFact pattern at a different model scale and on a different domain\. We use FEVER as a non\-toy subset shortcut control and HoVer as a candidate missing\-hop control; both are narrower in scope than the SciFact suite\. Full external\-control results are in Appendix[J](https://arxiv.org/html/2605.26663#A10)\.

## 6Analysis and Discussion

### 6\.1NEI Is a Family of Evidence Conditions

Hard NEI should not be treated as one uniform category\. BM25 near\-miss, cited non\-rationale, same\-document non\-rationale, fixed\-claim hard NEI, and missing\-hop evidence remove different shortcuts and stress different behaviors\. Construction\-level audits show that these families differ in evidence length, claim–evidence overlap, coverage, and retrieval metadata\. These audits are not auxiliary checks: they are the mechanism by which NEI\-CAP distinguishes evidence\-insufficiency evaluation from shortcut\-sensitive construction recognition\. Surface differences do not invalidate the constructions; they show why construction family must be reported instead of hidden inside aggregate NEI\-F1\. Random irrelevant evidence is especially important: it can teach topic mismatch rather than insufficiency, demonstrating that the problem is broader than placeholder detection\.

### 6\.2Mixed Training Changes the Profile, Not the Reporting Requirement

Mixed\-construction training addresses a legitimate concern about single\-construction probes\. If construction sensitivity disappeared under mixed training, the single\-construction matrix would be less relevant\. It does not disappear\. Easy\-mixture training remains weak on hard constructions; hard and balanced mixtures improve hard\-NEI recall but still produce construction\-stratified profiles\. Mixed training is therefore not a replacement for NEI\-CAP; it is a stronger test of whether construction\-aware reporting remains necessary\.

NEI\-CAP’s output is not a better aggregate score; it is a reporting discipline\. Every result we report carries the construction family that produced it, the audit metrics that describe its shortcut surface, and the validation status of any hard examples it relies on\.

### 6\.3What Fixed\-Claim Diagnostics Can and Cannot Show

Fixed\-claim evidence substitution separates two abilities: recognizing semantically related evidence as insufficient, and assigning the reference label when decisive evidence is present\. The probability\-drop metric shows that replacing reference evidence with insufficient evidence lowers confidence in the reference label, so construction affectsSupport/Refuteverification confidence as well as NEI recall\.

The same\-document artifact audit further strengthens the taxonomy argument\. Same\-document hard NEI is human\-adjudicated, but shallow features nearly separate it from BM25/cited hard NEI and other construction families\. Human validity and constructional distinctness are different properties: even human\-valid hard NEI can carry construction\-specific signatures\.

Future benchmarks should therefore treat the construction of insufficient evidence as part of their evaluation protocol, not as an implementation detail\.

## 7Conclusion

NEI is often treated as a single negative label, but the way it is built decides what a verifier can actually learn\. In SciFact\-style verification, a placeholder\-trained verifier earns a perfect matched\-placeholder NEI\-F1 yet scores zero on evidence that is topically related but insufficient—a pattern that reproduces across three architectures, five seeds, and two external datasets\.

NEI\-CAP makes the construction explicit\. By attaching every NEI example to the family of evidence condition that produced it and validating the hard cases through human adjudication, the protocol turns insufficient\-evidence evaluation from a single number into a construction\-stratified report\.

The construction of insufficient evidence belongs alongside the labels and metrics that define any fact\-verification benchmark\. An aggregate NEI\-F1 that hides how its NEI examples were built cannot tell a reader whether the model has learned to recognize insufficiency or simply to recognize the artifact\.

## Limitations

NEI\-CAP is a diagnostic protocol, not a complete solution to evidence\-insufficiency reasoning\. It identifies construction sensitivity and provides audit tools, but strong NEI\-CAP results do not guarantee general\-purpose sufficiency reasoning\. Different domains may require additional construction families and annotation guidelines\.

Our strongest evidence comes from SciFact\-style scientific verification\. FEVER and HoVer are bounded controls: FEVER is a subset control, and HoVer uses candidate missing\-hop constructions rather than human\-validated hardNEI\. SciFact does not natively provide a construction\-stratified training split; reconstructing one in model\-ready form would require additional preprocessing outside the scope of this paper\.

Human validation reduces label\-validity risk but does not eliminate it\. BM25 near\-miss, cited non\-rationale, and same\-document examples can carry implicit support, refutation, or ambiguity\. Final labels are the consensus of two PhD annotators, reached after adjudicating a small number of pre\-consensus disagreements; these disagreements concerned fine\-grained sub\-label boundaries within the contamination categories rather than the binary hard\-NEIversus contaminated distinction that the paper relies on\. Appendix[D](https://arxiv.org/html/2605.26663#A4)gives the full protocol\.

Fixed\-claim and same\-document diagnostics are decomposition tests, not proofs of clean counterfactual evidence use\. Same\-document hardNEIis human\-adjudicated but constructionally distinct under shallow\-feature audits, so we report it as a source\-controlled diagnostic family rather than as an artifact\-free universal hard\-NEIscore\.

Our model probes are pretrained encoder cross\-encoders with secondary\-backbone checks\. We use them because they support controlled construction\-specific training, seed replication, and prediction logging\. Recent factuality, retrieval\-augmented, and grounding benchmarks evaluate whether generated atomic facts or long\-form responses are supported by reliable sources, retrieved passages, or provided documents\(Minet al\.,[2023](https://arxiv.org/html/2605.26663#bib.bib24); Niuet al\.,[2024](https://arxiv.org/html/2605.26663#bib.bib25); Jacoviet al\.,[2025](https://arxiv.org/html/2605.26663#bib.bib27)\)\. Extending NEI\-CAP from fact\-verification labels to those generative evaluation settings is future work\.

## References

- R\. Aly, Z\. Guo, M\. S\. Schlichtkrull, J\. Thorne, A\. Vlachos, C\. Christodoulopoulos, O\. Cocarascu, and A\. Mittal \(2021\)FEVEROUS: fact extraction and VERification over unstructured and structured information\.InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks,External Links:[Link](https://arxiv.org/abs/2106.05707),[Document](https://dx.doi.org/10.48550/arXiv.2106.05707)Cited by:[§1](https://arxiv.org/html/2605.26663#S1.p4.1),[§2\.1](https://arxiv.org/html/2605.26663#S2.SS1.p1.1)\.
- Fact checking with insufficient evidence\.Transactions of the Association for Computational Linguistics10,pp\. 746–763\.External Links:[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00486),[Link](https://aclanthology.org/2022.tacl-1.43/)Cited by:[§1](https://arxiv.org/html/2605.26663#S1.p2.1),[§2\.3](https://arxiv.org/html/2605.26663#S2.SS3.p1.1)\.
- I\. Augenstein, C\. Lioma, D\. Wang, L\. Chaves Lima, C\. Hansen, C\. Hansen, and J\. G\. Simonsen \(2019\)MultiFC: a real\-world multi\-domain dataset for evidence\-based fact checking of claims\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),Hong Kong, China,pp\. 4685–4697\.External Links:[Document](https://dx.doi.org/10.18653/v1/D19-1475),[Link](https://aclanthology.org/D19-1475/)Cited by:[§1](https://arxiv.org/html/2605.26663#S1.p4.1),[§2\.1](https://arxiv.org/html/2605.26663#S2.SS1.p1.1)\.
- I\. Beltagy, K\. Lo, and A\. Cohan \(2019\)SciBERT: a pretrained language model for scientific text\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),K\. Inui, J\. Jiang, V\. Ng, and X\. Wan \(Eds\.\),Hong Kong, China,pp\. 3615–3620\.External Links:[Document](https://dx.doi.org/10.18653/v1/D19-1371),[Link](https://aclanthology.org/D19-1371/)Cited by:[§4\.3](https://arxiv.org/html/2605.26663#S4.SS3.p1.1)\.
- J\. DeYoung, S\. Jain, N\. F\. Rajani, E\. Lehman, C\. Xiong, R\. Socher, and B\. C\. Wallace \(2020\)ERASER: a benchmark to evaluate rationalized NLP models\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,pp\. 4443–4458\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.408),[Link](https://aclanthology.org/2020.acl-main.408/)Cited by:[§2\.1](https://arxiv.org/html/2605.26663#S2.SS1.p1.1),[§2\.3](https://arxiv.org/html/2605.26663#S2.SS3.p1.1)\.
- M\. Gardner, Y\. Artzi, V\. Basmov, J\. Berant, B\. Bogin, S\. Chen, P\. Dasigi, D\. Dua, Y\. Elazar, A\. Gottumukkala, N\. Gupta, H\. Hajishirzi, G\. Ilharco, D\. Khashabi, K\. Lin, J\. Liu, N\. F\. Liu, P\. Mulcaire, Q\. Ning, S\. Singh, N\. A\. Smith, S\. Subramanian, R\. Tsarfaty, E\. Wallace, A\. Zhang, and B\. Zhou \(2020\)Evaluating models’ local decision boundaries via contrast sets\.InFindings of the Association for Computational Linguistics: EMNLP 2020,T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 1307–1323\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.findings-emnlp.117),[Link](https://aclanthology.org/2020.findings-emnlp.117/)Cited by:[§2\.2](https://arxiv.org/html/2605.26663#S2.SS2.p1.1),[§3\.2](https://arxiv.org/html/2605.26663#S3.SS2.p1.1)\.
- R\. Geirhos, J\. Jacobsen, C\. Michaelis, R\. Zemel, W\. Brendel, M\. Bethge, and F\. A\. Wichmann \(2020\)Shortcut learning in deep neural networks\.Nature Machine Intelligence2,pp\. 665–673\.External Links:[Document](https://dx.doi.org/10.1038/s42256-020-00257-z),[Link](https://doi.org/10.1038/s42256-020-00257-z)Cited by:[§2\.2](https://arxiv.org/html/2605.26663#S2.SS2.p1.1),[§3\.2](https://arxiv.org/html/2605.26663#S3.SS2.p1.1)\.
- M\. Glockner, Y\. Hou, and I\. Gurevych \(2022\)Missing counter\-evidence renders NLP fact\-checking unrealistic for misinformation\.arXiv preprint arXiv:2210\.13865\.External Links:[Link](https://arxiv.org/abs/2210.13865)Cited by:[§2\.3](https://arxiv.org/html/2605.26663#S2.SS3.p1.1)\.
- S\. Gururangan, S\. Swayamdipta, O\. Levy, R\. Schwartz, S\. R\. Bowman, and N\. A\. Smith \(2018\)Annotation artifacts in natural language inference data\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 \(Short Papers\),M\. Walker, H\. Ji, and A\. Stent \(Eds\.\),New Orleans, Louisiana,pp\. 107–112\.External Links:[Document](https://dx.doi.org/10.18653/v1/N18-2017),[Link](https://aclanthology.org/N18-2017/)Cited by:[§2\.2](https://arxiv.org/html/2605.26663#S2.SS2.p1.1)\.
- P\. He, J\. Gao, and W\. Chen \(2021\)DeBERTaV3: improving DeBERTa using ELECTRA\-style pre\-training with gradient\-disentangled embedding sharing\.arXiv preprint arXiv:2111\.09543\.External Links:[Link](https://arxiv.org/abs/2111.09543)Cited by:[§4\.3](https://arxiv.org/html/2605.26663#S4.SS3.p1.1)\.
- A\. Jacovi, A\. Wang, C\. Alberti, C\. Tao, J\. Lipovetz, K\. Olszewska, L\. Haas, M\. Liu, N\. Keating, A\. Bloniarz, C\. Saroufim, C\. Fry, D\. Marcus, D\. Kukliansky, G\. S\. Tomar, J\. Swirhun, J\. Xing, L\. Wang, M\. Gurumurthy, M\. Aaron, M\. Ambar, R\. Fellinger, R\. Wang, Z\. Zhang, S\. Goldshtein, and D\. Das \(2025\)The FACTS grounding leaderboard: benchmarking LLMs’ ability to ground responses to long\-form input\.arXiv preprint arXiv:2501\.03200\.External Links:[Link](https://arxiv.org/abs/2501.03200)Cited by:[§2\.3](https://arxiv.org/html/2605.26663#S2.SS3.p1.1),[Limitations](https://arxiv.org/html/2605.26663#Sx1.p5.1)\.
- Y\. Jiang, S\. Bordia, Z\. Zhong, C\. Dognin, M\. Singh, and M\. Bansal \(2020\)HoVer: a dataset for many\-hop fact extraction and claim verification\.InFindings of the Association for Computational Linguistics: EMNLP 2020,T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 3441–3460\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.findings-emnlp.309),[Link](https://aclanthology.org/2020.findings-emnlp.309/)Cited by:[§1](https://arxiv.org/html/2605.26663#S1.p1.4),[§1](https://arxiv.org/html/2605.26663#S1.p4.1),[§2\.1](https://arxiv.org/html/2605.26663#S2.SS1.p1.1),[§4\.4](https://arxiv.org/html/2605.26663#S4.SS4.p1.1),[§4](https://arxiv.org/html/2605.26663#S4.p1.1)\.
- D\. Kaushik, E\. Hovy, and Z\. C\. Lipton \(2020\)Learning the difference that makes a difference with counterfactually\-augmented data\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/1909.12434)Cited by:[§2\.2](https://arxiv.org/html/2605.26663#S2.SS2.p1.1)\.
- Y\. Liu, M\. Ott, N\. Goyal, J\. Du, M\. Joshi, D\. Chen, O\. Levy, M\. Lewis, L\. Zettlemoyer, and V\. Stoyanov \(2019\)RoBERTa: a robustly optimized BERT pretraining approach\.arXiv preprint arXiv:1907\.11692\.External Links:[Link](https://arxiv.org/abs/1907.11692)Cited by:[§4\.3](https://arxiv.org/html/2605.26663#S4.SS3.p1.1)\.
- R\. T\. McCoy, E\. Pavlick, and T\. Linzen \(2019\)Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,pp\. 3428–3448\.External Links:[Document](https://dx.doi.org/10.18653/v1/P19-1334),[Link](https://aclanthology.org/P19-1334/)Cited by:[§2\.2](https://arxiv.org/html/2605.26663#S2.SS2.p1.1)\.
- S\. Min, K\. Krishna, X\. Lyu, M\. Lewis, W\. Yih, P\. Koh, M\. Iyyer, L\. Zettlemoyer, and H\. Hajishirzi \(2023\)FActScore: fine\-grained atomic evaluation of factual precision in long form text generation\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Singapore,pp\. 12076–12100\.External Links:[Document](https://dx.doi.org/10.18653/v1/2023.emnlp-main.741),[Link](https://aclanthology.org/2023.emnlp-main.741/)Cited by:[§2\.3](https://arxiv.org/html/2605.26663#S2.SS3.p1.1),[Limitations](https://arxiv.org/html/2605.26663#Sx1.p5.1)\.
- C\. Niu, Y\. Wu, J\. Zhu, S\. Xu, K\. Shum, R\. Zhong, J\. Song, and T\. Zhang \(2024\)RAGTruth: a hallucination corpus for developing trustworthy retrieval\-augmented language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),Bangkok, Thailand,pp\. 10862–10878\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.585),[Link](https://aclanthology.org/2024.acl-long.585/)Cited by:[§2\.3](https://arxiv.org/html/2605.26663#S2.SS3.p1.1),[Limitations](https://arxiv.org/html/2605.26663#Sx1.p5.1)\.
- T\. Niven and H\. Kao \(2019\)Probing neural network comprehension of natural language arguments\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,pp\. 4658–4664\.External Links:[Document](https://dx.doi.org/10.18653/v1/P19-1459),[Link](https://aclanthology.org/P19-1459/)Cited by:[§2\.2](https://arxiv.org/html/2605.26663#S2.SS2.p1.1)\.
- A\. Poliak, J\. Naradowsky, A\. Haldar, R\. Rudinger, and B\. Van Durme \(2018\)Hypothesis only baselines in natural language inference\.arXiv preprint arXiv:1805\.01042\.External Links:[Link](https://arxiv.org/abs/1805.01042)Cited by:[§2\.2](https://arxiv.org/html/2605.26663#S2.SS2.p1.1)\.
- M\. T\. Ribeiro, T\. Wu, C\. Guestrin, and S\. Singh \(2020\)Beyond accuracy: behavioral testing of NLP models with CheckList\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,pp\. 4902–4912\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.442),[Link](https://aclanthology.org/2020.acl-main.442/)Cited by:[§2\.2](https://arxiv.org/html/2605.26663#S2.SS2.p1.1),[§3\.2](https://arxiv.org/html/2605.26663#S3.SS2.p1.1)\.
- S\. Robertson and H\. Zaragoza \(2009\)The probabilistic relevance framework: BM25 and beyond\.Foundations and Trends in Information Retrieval3,pp\. 333–389\.External Links:[Document](https://dx.doi.org/10.1561/1500000019)Cited by:[§4\.1](https://arxiv.org/html/2605.26663#S4.SS1.p1.1)\.
- T\. Schuster, A\. Fisch, and R\. Barzilay \(2021\)Get your vitamin C\! robust fact verification with contrastive evidence\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,Online,pp\. 624–643\.External Links:[Document](https://dx.doi.org/10.18653/v1/2021.naacl-main.52),[Link](https://aclanthology.org/2021.naacl-main.52/)Cited by:[§1](https://arxiv.org/html/2605.26663#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.26663#S2.SS1.p1.1)\.
- T\. Schuster, D\. Shah, Y\. J\. S\. Yeo, D\. Roberto Filizzola Ortiz, E\. Santus, and R\. Barzilay \(2019\)Towards debiasing fact verification models\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),K\. Inui, J\. Jiang, V\. Ng, and X\. Wan \(Eds\.\),Hong Kong, China,pp\. 3419–3425\.External Links:[Document](https://dx.doi.org/10.18653/v1/D19-1341),[Link](https://aclanthology.org/D19-1341/)Cited by:[§1](https://arxiv.org/html/2605.26663#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.26663#S2.SS2.p1.1)\.
- J\. Thorne, A\. Vlachos, C\. Christodoulopoulos, and A\. Mittal \(2018\)FEVER: a large\-scale dataset for fact extraction and VERification\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long Papers\),M\. Walker, H\. Ji, and A\. Stent \(Eds\.\),New Orleans, Louisiana,pp\. 809–819\.External Links:[Document](https://dx.doi.org/10.18653/v1/N18-1074),[Link](https://aclanthology.org/N18-1074/)Cited by:[§1](https://arxiv.org/html/2605.26663#S1.p1.4),[§1](https://arxiv.org/html/2605.26663#S1.p4.1),[§2\.1](https://arxiv.org/html/2605.26663#S2.SS1.p1.1),[§3\.1](https://arxiv.org/html/2605.26663#S3.SS1.p1.5),[§4\.4](https://arxiv.org/html/2605.26663#S4.SS4.p1.1),[§4](https://arxiv.org/html/2605.26663#S4.p1.1)\.
- J\. Thorne and A\. Vlachos \(2019\)Adversarial attacks against fact extraction and verification\.arXiv preprint arXiv:1903\.05543\.External Links:[Link](https://arxiv.org/abs/1903.05543)Cited by:[§1](https://arxiv.org/html/2605.26663#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.26663#S2.SS2.p1.1)\.
- J\. Vladika, I\. Hacajova, and F\. Matthes \(2025\)Step\-by\-step fact verification system for medical claims with explainable reasoning\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 2: Short Papers\),L\. Chiruzzo, A\. Ritter, and L\. Wang \(Eds\.\),Albuquerque, New Mexico,pp\. 805–816\.External Links:[Link](https://aclanthology.org/2025.naacl-short.68/),[Document](https://dx.doi.org/10.18653/v1/2025.naacl-short.68),ISBN 979\-8\-89176\-190\-2Cited by:[§1](https://arxiv.org/html/2605.26663#S1.p2.1)\.
- J\. Vladika, P\. Schneider, and F\. Matthes \(2023\)HealthFC: verifying health claims with evidence\-based medical fact\-checking\.arXiv preprint arXiv:2309\.08503\.External Links:[Link](https://arxiv.org/abs/2309.08503)Cited by:[§2\.1](https://arxiv.org/html/2605.26663#S2.SS1.p1.1)\.
- D\. Wadden, S\. Lin, K\. Lo, L\. L\. Wang, M\. van Zuylen, A\. Cohan, and H\. Hajishirzi \(2020\)Fact or fiction: verifying scientific claims\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),B\. Webber, T\. Cohn, Y\. He, and Y\. Liu \(Eds\.\),Online,pp\. 7534–7550\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-main.609),[Link](https://aclanthology.org/2020.emnlp-main.609/)Cited by:[§1](https://arxiv.org/html/2605.26663#S1.p1.4),[§1](https://arxiv.org/html/2605.26663#S1.p4.1),[§2\.1](https://arxiv.org/html/2605.26663#S2.SS1.p1.1),[§3\.1](https://arxiv.org/html/2605.26663#S3.SS1.p1.5),[§4](https://arxiv.org/html/2605.26663#S4.p1.1)\.

## Appendix ANEI\-CAP Construction Taxonomy

This appendix expands the compact taxonomy in Table[1](https://arxiv.org/html/2605.26663#S3.T1)\. NEI\-CAP treatsNEInot as a construction\-free negative label, but as an evidence\-conditioned label whose interpretation depends on how the evidence set is paired with the claim\.

### A\.1Taxonomy Axes

We characterize eachNEIconstruction family along six axes: evidence availability, topical relatedness, source control, lexical or entity overlap, evidential completeness, and validation status\. These axes separate absence\-based shortcuts from semantically related insufficient evidence\.

#### Evidence availability\.

SomeNEIexamples contain no substantive evidence or use a fixed placeholder\. Others contain non\-empty evidence that is related to the claim but still insufficient\.

#### Topical relatedness\.

Random irrelevant evidence may be insufficient because it is off topic\. BM25 near\-miss, cited non\-rationale, same\-document non\-rationale, and missing\-hop examples are instead related to the claim but incomplete\.

#### Source control\.

Same\-document non\-rationales and fixed\-claim diagnostics reduce source and topic shortcuts by holding the source document or claim fixed\.

#### Lexical and entity overlap\.

Near\-miss examples preserve lexical or entity overlap while removing decisive support, testing whether models mistake overlap for sufficiency\.

#### Evidential completeness\.

Partial and missing\-hop examples contain some relevant information while omitting a necessary relation, condition, or reasoning step\.

#### Validation status\.

NEI\-CAP distinguishes constructed examples, candidate hardNEI, and human\-adjudicated hardNEI\. The strongest hard\-NEIclaims are restricted to human\-adjudicated subsets\.

### A\.2Construction Families

FamilyConstruction ruleShortcut riskValidation statusPaper rolePlaceholderReplace evidence with a fixed or empty no\-evidence marker\.Format, length, absence\.Constructed only\.Shortcut baseline\.Random irrelevantPair claim with evidence from unrelated claims or documents\.Topic mismatch, low overlap\.Constructed only\.Topic\-mismatch baseline\.Position\-biasedSelect non\-rationale evidence from predictable locations\.Sentence position, source distribution\.Not primary human\-validated\.Artifact stress test\.BM25 near\-missRetrieve high\-overlap evidence that remains insufficient\.Lexical overlap mistaken for support\.Human\-adjudicated in SciFact audit\.HardNEI\.Cited non\-rationaleUse evidence from claim\-associated cited documents that is not the rationale\.Source relevance mistaken for sufficiency\.Human\-adjudicated in SciFact audit\.HardNEI\.Same\-document non\-rationaleUse non\-rationale evidence from the same source document\.Residual topic/source cues\.Human\-adjudicated in fixed\-claim/same\-document audit\.Source\-controlled hardNEI\.Fixed\-claim hardNEIPair the same claim with reference evidence and validated insufficient evidence\.Evidence\-side differences may still contain shallow cues\.Human\-adjudicated in fixed\-claim/same\-document audit\.Evidence\-subst\. diagnostic\.Missing\-hop controlRemove a required supporting fact from multi\-hop evidence\.Partial evidence over\-interpreted as sufficient\.Candidate\-only in HoVer\.External multi\-hop control\.FEVER subset controlConstruct bounded Wikipedia verification controls with alternativeNEIevidence\.Dataset\-specific lexical or retrieval artifacts\.Subset control, not human\-validated hardNEI\.External shortcut control\.

Table 6:Full NEI\-CAP construction taxonomy\. The taxonomy separates shortcut\-prone baselines, candidate hardNEI, human\-adjudicated hardNEI, and bounded external controls\.
### A\.3Construction Strength

We use three construction\-strength categories\.

#### EasyNEI\.

Placeholder, random irrelevant, and position\-biased examples are shortcut\-prone by design\. They are useful probes of artifact sensitivity but should not be treated as strong evidence\-sufficiency tests\.

#### Candidate hardNEI\.

Candidate hard examples are constructed to be semantically related but insufficient\. Before adjudication, they may contain semantic contamination\.

#### Human\-adjudicated hardNEI\.

Human\-adjudicated hard examples are candidate hard examples labeled as truly insufficient rather than actually supportive, actually refuting, ambiguous, or invalid\. These subsets support the strongest hard\-NEIclaims in the paper\.

### A\.4Terminology and Claim Boundaries

AvoidUse insteadReasonhard\-negativehardNEIexampleAvoids retrieval/contrastive ambiguity\.gold\-truthgold label / reference labelStandard terminology\.gold\-sidereference\-evidence sideMore precise\.hard\-sideinsufficient\-evidence sideDefines evidence condition\.counterfactual prooffixed\-claim evidence\-substitution diagnosticAvoids causal overclaim\.HoVer human\-validatedHoVer candidate missing\-hop controlNo HoVer human audit\.FEVER full validationFEVER subset controlBounded external probe\.

Table 7:Terminology mapping used throughout the paper\.The taxonomy supports three bounded claims:NEIevaluation is construction\-sensitive; easyNEIcan inflate apparent competence; and hard insufficient evidence should be validated when it supports central claims\. It does not imply that NEI\-CAP solves evidence sufficiency, that all hard\-NEIfamilies are equivalent, or that fixed\-claim substitution proves clean counterfactual evidence use\.

## Appendix BDataset Construction and Manifests

This appendix documents the dataset assets and manifest requirements behind NEI\-CAP\. The main paper reports only compact descriptions; the appendix records the construction metadata needed for reproducible construction\-aware evaluation\.

### B\.1Example Representation

Each example is represented as:

xi=\(ci,Ei,yi,zi,gi,mi\),x\_\{i\}=\(c\_\{i\},E\_\{i\},y\_\{i\},z\_\{i\},g\_\{i\},m\_\{i\}\),wherecic\_\{i\}is the claim,EiE\_\{i\}is the evidence set,yi∈\{Support,Refute,NEI\}y\_\{i\}\\in\\\{\\textsc\{Support\},\\textsc\{Refute\},\\textsc\{NEI\}\\\}is the label,ziz\_\{i\}is the construction family,gig\_\{i\}is a group identifier, andmim\_\{i\}stores provenance and audit metadata\. The construction variableziz\_\{i\}is not given to the model; it is used for splitting, auditing, and reporting\.

### B\.2Manifest Schema

FieldTypeDescriptionexample IDstringUnique claim–evidence instance ID\.claim IDstringOriginal claim ID\.group IDstringGroup key for group\-disjoint splitting\.source datastringSciFact, FEVER, HoVer, or derived resource\.claimstringClaim text\.evidencestring/listEvidence text or list of evidence units\.labelcategoricalSupport,Refute, orNEI\.constructioncategoricalPlaceholder, BM25 near\-miss, same\-document, etc\.splitcategoricalTrain, development, test, or audit\.document IDstring/listSource document IDs when available\.sentence IDsstring/listSentence or rationale IDs when available\.retrieval methodstringRetrieval or sampling method\.retrieval rankintegerRank of retrieved evidence\.BM25 scorefloatRetrieval score, when applicable\.sentence positioninteger/listSentence position within document\.validation statuscategoricalNot validated, candidate, validNEI, contaminated, or ambiguous\.adjudicated labelcategoricalFinal adjudicated label if available\.

Table 8:Recommended NEI\-CAP manifest schema\. Construction metadata is required to report results by evidence condition rather than by label alone\.
### B\.3SciFact Suite

The core SciFact suite contains five variants: placeholder, random irrelevant, position\-biased, BM25 near\-miss, and cited non\-rationale\. TheSupportandRefuteportions remain comparable across variants; only theNEIevidence condition changes\. This design supports train/test construction\-shift evaluation\.

### B\.4Group\-Disjoint Splitting

Construction variants derived from the same claim or claim–document group must not leak across train and test\. NEI\-CAP therefore uses group\-disjoint splitting keyed by claim, claim–document pair, or fixed\-claim substitution group where applicable\. This prevents a model from exploiting claim memorization across evidence variants\.

### B\.5Human\-Adjudicated Assets

Human\-adjudicated assets are separated from automatically constructed candidate data\. The SciFact audit validates BM25/cited candidate hardNEI; the fixed\-claim/same\-document audit validates same\-claim and same\-document diagnostics\. These assets estimate semantic contamination and define human\-adjudicated hard\-NEIsubsets for model evaluation\.

### B\.6Human\-Hard Evaluation Sampling Trail

The SciFact human\-audit pool estimates label validity, while the model evaluation uses only the held\-out group\-disjoint test split\. Table[9](https://arxiv.org/html/2605.26663#A2.T9)records the paper\-facing sampling trail used to interpret then=54n=54hard\-NEIevaluation in Table[3](https://arxiv.org/html/2605.26663#S5.T3)\.

StageCountCandidate BM25/cited hard\-NEIaudit pool250Human\-validNEIafter adjudication223Human\-adjudicated hard\-NEIsubtype195Held\-out group\-disjoint model\-evaluation split54

Table 9:Sampling trail for the SciFact human\-hard model\-evaluation subset\. The larger audit pool estimates label validity; the held\-out split supports model evaluation\.
### B\.7External Controls

FEVER and HoVer are included as bounded external controls\. FEVER provides a non\-toy Wikipedia subset control\. HoVer provides a candidate missing\-hop control for multi\-hop insufficiency\. Neither is treated as human\-validated hardNEI\.

### B\.8Construction Split Statistics and Leakage Audit

The additional experiments add construction\-level split statistics for each paper\-facing family\. The audit records label counts, claim and document group counts, evidence length, sentence count, placeholder rate, claim–evidence overlap, coverage, retrieval metadata, duplicate counts, and missing\-field checks\. Table[10](https://arxiv.org/html/2605.26663#A2.T10)gives a compact test\-split view for the core SciFact suite; full CSV artifacts will be released with the accompanying code and data package\.

VariantnnSUP/REF/NEIAvg\. tok\.CoveragePlaceholder17776/40/61174\.70\.444Random irrelevant17776/40/61236\.80\.474Position\-biased17776/40/61181\.70\.501BM25 near\-miss17776/40/61251\.40\.630Cited non\-rationale17776/40/61246\.90\.590

Table 10:Compact test\-split statistics for the SciFact construction suite\. Each variant keeps comparableSupport/Refute/NEIlabel counts while changing theNEIevidence condition\.The leakage audit reports zero claim\-group overlap across train/development/test for the core SciFact construction variants and zero construction\-variant cross\-split leakage\. Document overlap can occur because the same scientific paper may be relevant to multiple claims; we report it as source\-distribution metadata rather than claim leakage\.

### B\.9Asset Map

The accompanying release links each reported result to its source dataset, construction family, split policy, model configuration, seed set, prediction artifact, and evaluation output\. The full machine\-readable asset map is provided with the released manifests\.

## Appendix CArtifact Audit Statistics

NEI\-CAP audits construction families before interpreting model performance\. The audit asks whetherNEIexamples can be separated using superficial evidence\-side features rather than evidence\-sufficiency reasoning\.

### C\.1Audited Features

We audit evidence length, number of evidence sentences, claim–evidence lexical overlap, claim–evidence coverage, placeholder rate, sentence position, source concentration, and discourse markers\. These features do not determine semantic validity; they identify shortcut risk\.

### C\.2Construction\-Level Summary

Construction groupnnAvg\. sent\.Avg\. tok\.CoverageStatusSciFact BM25/cited hardNEI19511\.49216\.560\.483Human\-adjudicatedSame\-document hardNEI1141\.0423\.930\.233Human\-adjudicatedFixed\-claim hardNEI1141\.2323\.930\.233Human\-adjudicatedPlaceholderNEI611\.002\.000\.005Constructed onlyRandom irrelevantNEI619\.70182\.460\.092Constructed only

Table 11:Artifact audit summary by construction group\. PlaceholderNEIhas extremely short evidence and near\-zero coverage; human\-adjudicated hardNEIcontains substantive evidence\.
### C\.3Overlap and Marker Statistics

Construction groupOverlap countJaccardContext markerMethod markerSciFact BM25/cited hardNEI6\.240\.0530\.2100\.226Same\-document hardNEI3\.060\.0980\.3160\.140Fixed\-claim hardNEI3\.060\.0980\.3160\.140PlaceholderNEI0\.050\.0050\.0000\.000Random irrelevantNEI1\.310\.0130\.1480\.213

Table 12:Claim–evidence overlap and marker statistics\. Random irrelevantNEIcontains long evidence but low overlap, indicating a topic\-mismatch shortcut rather than evidence insufficiency\.
### C\.4Hard\-NEI Subtypes

GroupBroad topicNear\-missPartialSciFact BM25/cited hardNEI6210231Fixed\-claim/same\-document hardNEI504915

Table 13:Subtype distribution for human\-adjudicated hardNEI\. HardNEIis heterogeneous rather than a single condition\.
### C\.5Interpretation

The audit supports three conclusions\. First, placeholderNEIexposes strong format and absence cues\. Second, random irrelevantNEImainly tests topic mismatch\. Third, human\-adjudicated hardNEIis heterogeneous across BM25/cited, same\-document, and fixed\-claim settings\. These findings motivate construction\-specific reporting rather than aggregateNEI\-F1 alone\.

## Appendix DHuman Validation Protocol and Adjudication

Hard insufficient evidence can be noisy\. Retrieved near\-misses, cited non\-rationales, and same\-document non\-rationales may contain implicit support, implicit refutation, or ambiguity\. NEI\-CAP therefore separates automatically constructed candidate hardNEIfrom human\-adjudicated hardNEI\.

### D\.1Validation Goal

Human validation determines whether a candidate hard\-NEIexample is truly insufficient\. The resulting labels are used to estimate semantic contamination and to define human\-adjudicated hard\-NEIsubsets for evaluation\.

### D\.2Annotation Labels

LabelDefinitiontruly\_insufficientEvidence does not support or refute the claim\.actually\_supportedEvidence supports the claim\.actually\_contradictedEvidence refutes the claim; reported asRefutein paper\-facing terminology\.ambiguousStatus cannot be determined reliably\.invalid\_or\_unreadableClaim or evidence is malformed, missing, unreadable, or out of scope\.

Table 14:Human validation label schema\. Contamination includes actually supported, actually contradicted/refuted, ambiguous, invalid, and unreadable cases\.
### D\.3Reporting Metrics

We report:

valid​NEI=NinsufficientNaudited,\\mathrm\{valid\\ NEI\}=\\frac\{N\_\{\\mathrm\{insufficient\}\}\}\{N\_\{\\mathrm\{audited\}\}\},and:

contam\.=Nsup\+Nref\+Namb/invalidNaudited\.\\mathrm\{contam\.\}=\\frac\{N\_\{\\mathrm\{sup\}\}\+N\_\{\\mathrm\{ref\}\}\+N\_\{\\mathrm\{amb/invalid\}\}\}\{N\_\{\\mathrm\{audited\}\}\}\.AI\-assisted checks, where used, are treated only as triage or secondary support, not as human validation\.

### D\.4SciFact Hard\-NEI Audit

MetricObserved95% low95% highValid NEI rate0\.8920\.8520\.928Contamination rate0\.1080\.0720\.148Actually supported rate0\.0560\.0280\.088Actually refuted rate0\.0520\.0280\.084Ambiguous/invalid rate0\.0000\.0000\.000Hard subtype rate among valid NEI0\.8740\.8260\.919Topic\-unrelated rate among valid NEI0\.1170\.0760\.164

Table 15:SciFact hard\-NEI audit validation over 250 candidate hard\-NEIexamples\. Intervals are bootstrap 95% intervals\.
### D\.5Fixed\-Claim and Same\-Document Validation

SubsetRowsContam\.Hard rowsHard rateFinal adjudicated labels1220\.03281140\.966Human\-validated NEI1180\.00001140\.966Human\-validated hard NEI1140\.00001141\.000Same\-claim hard NEI1140\.00001141\.000Same\-document hard NEI1140\.00001141\.000

Table 16:Fixed\-claim/same\-document validation summary\. The 114 hard\-NEIexamples support same\-claim and same\-document diagnostics\.
### D\.6Annotator and Adjudication Protocol

The revised protocol documentation clarifies that two PhD annotators participated in human validation and that final labels are consensus adjudications\. Annotators independently evaluated claim plus candidate evidence for semantic sufficiency\. Blinded packets removed LLM labels, gold labels, model predictions, and construction provenance from primary annotation fields\. AI\-assisted checks were used only as secondary support or triage and are never reported as standalone human validation\.

Pre\-consensus agreement between the two annotators was high\. The small number of disagreements concerned fine\-grained sub\-label boundaries within the contamination categories\. The hard\-NEIsubsets used for paper\-facing model evaluation therefore depend on a coarser decision than the one on which annotators occasionally diverged\. Disagreements were resolved through joint adjudication; the resulting consensus labels are the labels used throughout the paper\.

### D\.7Boundary

The paper may claim that the SciFact hard\-NEI audit and the fixed\-claim/same\-document audit provide human\-adjudicated hard\-NEIsubsets\. It should not claim that human validation eliminates all ambiguity, that candidate\-only examples are valid without adjudication, or that FEVER/HoVer controls are human\-validated hardNEI\.

## Appendix EExperimental Setup and Model Details

NEI\-CAP uses models as diagnostic probes of construction sensitivity, not as proposed architectures\.

### E\.1Task Format

All neural experiments are three\-way claim–evidence classification:

y∈\{Support,Refute,NEI\}\.y\\in\\\{\\textsc\{Support\},\\textsc\{Refute\},\\textsc\{NEI\}\\\}\.The model receives claim and evidence text\. Construction family is not provided as model input; it defines train/evaluation variants and reporting groups\.

### E\.2Primary SciFact Configuration

SettingValuePrimary modeldeberta\-v3\-baseInput rootNEI\-CAP SciFact suiteTrain variantsplaceholder, cited, random irrelevant, BM25 near\-miss, position\-biasedEvaluation variantsplaceholder, cited, random irrelevant, BM25 near\-miss, position\-biasedEvaluation splittestMax length384Epochs3Batch size16Learning rate2\.0×10−52\.0\\times 10^\{\-5\}Weight decay0\.01Warmup ratio0\.10Class weightingenabledEarly stopping metricdevelopment Macro\-F1Patience1Precisionbf16 if available, otherwise fp32Seeds13, 17, 23Checkpoint policysave best checkpointPrediction loggingenabled

Table 17:Primary SciFact construction\-matrix configuration\.
### E\.3Construction Matrix

The primary matrix trains one verifier for eachNEIconstruction and evaluates it on all five constructions\. Each cell asks whether performance transfers across evidence conditions whileSupportandRefuteportions remain comparable\.

### E\.4Secondary Backbones

RoBERTa\-base and SciBERT are secondary robustness probes\. They reuse the same construction suite and train/evaluation variants\. Secondary models supplement, but do not replace, the locked primary DeBERTa matrix\.

### E\.5Metrics

For three\-way verification, we report accuracy, Macro\-F1, class\-specific F1, and especiallyNEI\-F1\. For one\-class human\-hard evaluation, Macro\-F1 is not meaningful\. We reportNEIrecall, falseSupportrate, falseRefuterate, and mean predicted class probabilities\.

### E\.6Robustness Classification

For secondary\-backbone SciFact probes, construction sensitivity is considered replicated when placeholder matchedNEI\-F1 is high, placeholder\-to\-hard drop is large, and at least one hard construction has non\-trivial matchedNEI\-F1\. These thresholds are diagnostic for this study, not universal standards\.

### E\.7Additional Experiments

Additional experiments add targeted checks without overwriting the primary outputs\. The new runs use five seeds, \{13,17,23,29,37\}, for expanded placeholder\-to\-hard uncertainty and mixed\-construction training\. Mixed\-construction regimes include an easy mixture over shortcut\-prone constructions, a hard mixture over BM25/cited constructions, and a balanced mixture over all five SciFact construction families\. These experiments test whether construction sensitivity persists under more realistic training distributions; they are not treated as new model architectures\.

### E\.8Result Provenance

Each reported result should be traceable to a dataset manifest, construction family, model configuration, seed set, prediction file, evaluation output, and paper table or figure\. Aggregate scores without construction metadata are insufficient for NEI\-CAP\.

## Appendix FFull SciFact Construction Matrices

This appendix reports the full locked DeBERTa SciFact construction\-shift matrices\.

### F\.1NEI\-F1 Matrix

Train NEIBM25CitedPlaceholderPositionRandomBM25 near\-miss0\.6750\.5610\.7250\.6520\.676Cited non\-rationale0\.5080\.5780\.7680\.7680\.599Placeholder0\.0000\.0001\.0000\.9470\.000Position\-biased0\.0000\.0001\.0001\.0000\.063Random irrelevant0\.4450\.2940\.9950\.9920\.995

Table 18:Full SciFact construction\-shift matrix forNEI\-F1\. Rows are train constructions; columns are evaluation constructions\.
### F\.2Macro\-F1 Matrix

Train NEIBM25CitedPlaceholderPositionRandomBM25 near\-miss0\.5290\.4790\.5590\.5300\.532Cited non\-rationale0\.3570\.3860\.4740\.4740\.399Placeholder0\.2770\.2770\.6840\.6570\.276Position\-biased0\.2620\.2620\.6650\.6650\.285Random irrelevant0\.4740\.4150\.7160\.7150\.716

Table 19:Full SciFact construction\-shift matrix for Macro\-F1\. Aggregate performance can appear strong under easy matched evaluation while hiding hard\-NEIfailure\.
### F\.3Drop Summary

Train NEIMatchedBM25CitedHard dropPlaceholder1\.0000\.0000\.0001\.000Position\-biased1\.0000\.0000\.0001\.000Random irrelevant0\.9950\.4450\.2940\.625BM25 near\-miss0\.6750\.6750\.5610\.057Cited non\-rationale0\.5780\.5080\.5780\.035

Table 20:Matched and hard\-evaluationNEI\-F1 comparisons\. Hard drop compares matched performance against average BM25/cited performance\.
### F\.4Interpretation

The matrix shows thatNEIis not construction\-free\. Placeholder and position\-biased training yield high matched performance but collapse on hard constructions\. BM25 and cited training do not yield perfect matched scores, but they transfer more stably across hard constructions\.

## Appendix GHuman\-Validated Hard\-NEI Evaluation

This appendix reports full evaluation on the held\-out SciFact human\-adjudicated hard\-NEItest split \(n=54n=54\)\. The larger audit pool and sampling trail are documented in Appendix[B](https://arxiv.org/html/2605.26663#A2); since all evaluated examples are validated asNEI, Macro\-F1 is not reported\.

### G\.1Recall and Error Rates

Train NEIRecall95% low95% highFalse SUPFalse REFnnBM25 near\-miss0\.6910\.5930\.7900\.1420\.16754Cited non\-rationale0\.6790\.5620\.7900\.0740\.24754Random irrelevant0\.2160\.1230\.3210\.5740\.21054Placeholder0\.0000\.0000\.0000\.8700\.13054Position\-biased0\.0000\.0000\.0000\.9070\.09354

Table 21:Primary DeBERTa evaluation on human\-adjudicated hardNEI\. False REF maps internal contradiction errors to the paper\-facingRefutelabel\.
### G\.2Predicted Probabilities

Train NEIMeanPP\(NEI\)95% low95% highMeanPP\(SUP\)MeanPP\(REF\)nnBM25 near\-miss0\.3980\.3660\.4310\.3240\.27754Cited non\-rationale0\.4160\.3710\.4580\.3180\.26654Random irrelevant0\.2120\.1290\.3040\.4170\.37154Placeholder0\.0130\.0120\.0130\.5400\.44854Position\-biased0\.0140\.0130\.0140\.5510\.43554

Table 22:Mean predicted probabilities on human\-adjudicated hardNEI\. Easy training assigns near\-zeroNEIprobability to validated hard insufficient evidence and reallocates probability mass toSupportandRefute\.
### G\.3Interpretation

Placeholder and position\-biased training fail to recognize validated hardNEI\. Their errors are dominated by falseSupport, and their meanP​\(NEI\)P\(\\mathrm\{NEI\}\)is near zero whileP​\(Support\)P\(\\mathrm\{Support\}\)andP​\(Refute\)P\(\\mathrm\{Refute\}\)absorb most probability mass\. Random\-irrelevant training also performs poorly, showing that topic\-unrelated NEI does not teach the intended hard\-insufficiency behavior\. BM25 and cited training partially recover hard\-NEIrecognition but do not solve the task\.

## Appendix HMulti\-Model Robustness

This appendix documents secondary\-backbone robustness and five\-seed revision checks\. RoBERTa and SciBERT are used as diagnostic probes; they do not replace the locked primary DeBERTa matrix\.

### H\.1Five\-Seed Placeholder\-to\-Hard Drop

The main paper reports the exact five\-seed placeholder\-to\-hard comparison\. Across DeBERTa, RoBERTa, and SciBERT, placeholder\-trained verifiers obtain PH→\\rightarrowPHNEI\-F1 of 1\.000 and PH→\\rightarrowBM25/CitedNEI\-F1 of 0\.000 for all five seeds\. We omit the duplicate table here and use this appendix for expanded secondary\-backbone, mixed\-training, and same\-document diagnostics\.

### H\.2Secondary Backbones on Human\-Adjudicated Hard NEI

ModelTrain NEIRecallFalse SUPFalse REFMeanPP\(NEI\)SciBERTBM25 near\-miss0\.8380\.0700\.0920\.710SciBERTCited non\-rationale0\.8750\.0630\.0620\.692SciBERTPlaceholder0\.0000\.5740\.4260\.011SciBERTPosition\-biased0\.0070\.8380\.1560\.024SciBERTRandom irrelevant0\.0990\.5200\.3810\.114RoBERTaBM25 near\-miss0\.7030\.1160\.1810\.404RoBERTaCited non\-rationale0\.6380\.1090\.2530\.435RoBERTaPlaceholder0\.0000\.8670\.1330\.004RoBERTaPosition\-biased0\.0090\.9320\.0600\.014RoBERTaRandom irrelevant0\.1860\.5860\.2270\.191

Table 23:Secondary\-backbone evaluation on human\-adjudicated hardNEI\. Placeholder\-trained RoBERTa and SciBERT both collapse on validated hard insufficient evidence\.
### H\.3Mixed\-Construction Training

We evaluate mixed\-construction training regimes\. Table[24](https://arxiv.org/html/2605.26663#A8.T24)summarizes the DeBERTa revision results\. Easy mixtures remain weak on hard conditions, while hard and balanced mixtures improve hard\-NEI recognition\. These results reduce the artificiality of single\-construction training but preserve the need for construction\-stratified reporting\.

RegimeBM25CitedSciFact hardEasy mixture0\.3790\.2730\.178Hard mixture0\.6620\.6520\.770Balanced mixture0\.8020\.8030\.915

Table 24:DeBERTa mixed\-construction training summary\. BM25 and Cited are construction\-stratifiedNEI\-F1; SciFact hard is one\-classNEIrecall on human\-adjudicated hardNEI\. Same\-document results are reported separately because that family is constructionally distinct\.RegimePHPositionRandomBM25CitedEasy mixture0\.9890\.9890\.9600\.3790\.273Hard mixture0\.7540\.7290\.6940\.6620\.652Balanced mixture0\.9030\.9030\.9010\.8020\.803

Table 25:Full DeBERTa mixed\-construction stratifiedNEI\-F1 matrix\. The balanced mixture improves all conditions but still yields a construction\-specific profile: easy conditions remain higher than BM25/cited hard conditions\. PH denotes placeholder\.
### H\.4Same\-Document Robustness

ModelTrain NEIRecallFalse SUPFalse REFSciBERTBM25 near\-miss0\.4820\.2310\.287SciBERTCited non\-rationale0\.7310\.1430\.126SciBERTPlaceholder0\.0000\.5090\.491SciBERTPosition\-biased1\.0000\.0000\.000SciBERTRandom irrelevant0\.2160\.4150\.368RoBERTaBM25 near\-miss0\.8160\.0230\.161RoBERTaCited non\-rationale0\.8710\.0000\.129RoBERTaPlaceholder0\.5260\.4120\.061RoBERTaPosition\-biased1\.0000\.0000\.000RoBERTaRandom irrelevant0\.9500\.0290\.020

Table 26:Secondary\-backbone same\-document evaluation\. Same\-document hardNEIbehaves differently from SciFact BM25/cited hardNEI, reinforcing hard\-NEIheterogeneity\.
### H\.5Boundary

The secondary models support the claim that construction sensitivity persists beyond DeBERTa\. They do not show that all model families behave identically, and they do not replace the primary SciFact matrix\.

## Appendix ISame\-Claim and Same\-Document Diagnostics

Fixed\-claim/same\-document diagnostics control the claim or source document while varying evidence\. They are decomposition tests, not proofs of clean counterfactual evidence use\.

### I\.1Setup

The same\-claim diagnostic pairs the same claim with reference evidence and with human\-adjudicated insufficient evidence\. The same\-document diagnostic draws insufficient evidence from the same source document as the reference evidence\. We report hard\-NEIrecall, falseSupport/Refuterates, reference\-side accuracy, probability\-drop success, strict swap success, and the mean drop in the reference\-label probability\. The reference label isSupportorRefute, so this metric tracks non\-NEI verification confidence rather than only NEI recall\.

### I\.2Primary DeBERTa Fixed\-Claim Diagnostics

Train NEIHard recallSwap successMeanΔ\\DeltarefBM25 near\-miss0\.6610\.5350\.017Cited non\-rationale0\.9880\.7430\.023Placeholder0\.8980\.7370\.112Position\-biased1\.0000\.7280\.034Random irrelevant0\.9470\.8070\.124

Table 27:Primary DeBERTa fixed\-claim diagnostics\. The same claim is paired with reference evidence and with insufficient evidence, so swap success and mean reference\-label probability drop are defined\.
### I\.3Primary DeBERTa Same\-Document Hard\-NEI Diagnostics

Train NEIHard recallFalse SUPFalse REFBM25 near\-miss0\.6610\.3100\.029Cited non\-rationale0\.9880\.0090\.003Placeholder0\.8980\.0850\.018Position\-biased1\.0000\.0000\.000Random irrelevant0\.9470\.0530\.000

Table 28:Primary DeBERTa same\-document hard\-NEIdiagnostics\. The hard subset is shared with the fixed\-claim diagnostic, so hard recall matches Table[27](https://arxiv.org/html/2605.26663#A9.T27); same\-document evaluation does not define swap success or reference\-label probability drop, and instead reports error rates on the insufficient\-evidence side\.
### I\.4Secondary\-Backbone Same\-Claim Diagnostics

ModelTrain NEIRef\. acc\.Hard recallProb\. dropStrict swapSciBERTPlaceholder0\.5990\.0000\.4940\.000SciBERTCited non\-rationale0\.4880\.7310\.8270\.365SciBERTRandom irrelevant0\.5560\.2160\.6020\.140SciBERTBM25 near\-miss0\.5290\.4820\.7340\.289SciBERTPosition\-biased0\.0061\.0000\.8830\.006RoBERTaPlaceholder0\.4010\.5030\.6430\.137RoBERTaCited non\-rationale0\.1080\.8330\.6140\.053RoBERTaRandom irrelevant0\.0910\.9560\.7600\.079RoBERTaBM25 near\-miss0\.1350\.7630\.6050\.050RoBERTaPosition\-biased0\.0091\.0000\.5380\.009

Table 29:Secondary\-backbone same\-claim diagnostics\. Results motivate the bounded interpretation: hard\-side recognition and reference\-side verification can diverge\.
### I\.5Same\-Document Artifact Audit

We audit whether the same\-document hard\-NEIfamily has a shallow construction signature\. A shallow feature classifier nearly separates same\-document hardNEIfrom several other construction groups using evidence length, sentence count, overlap, and coverage features\. Table[30](https://arxiv.org/html/2605.26663#A9.T30)reports the compact audit result\. The matched length/coverage audit is underpowered, so we do not overinterpret matched\-subset comparisons\.

ComparisonAcc\.Macro\-F1StatusSame\-doc vs BM25/cited hard0\.9970\.997completedSame\-doc vs placeholder1\.0001\.000completedSame\-doc vs random irrelevant1\.0001\.000completedSame\-doc vs constructed BM25/cited1\.0001\.000completedMatched feature audit––underpowered

Table 30:Same\-document artifact audit\. Same\-document hardNEIis human\-adjudicated but constructionally distinct; it should be reported as a source\-controlled diagnostic family, not as artifact\-free universal hardNEI\.
### I\.6Interpretation

Same\-claim diagnostics separate two abilities: recognizing that semantically related evidence is insufficient, and assigning higher confidence to the reference label when decisive evidence is present\. The positive mean reference\-label deltas in Table[27](https://arxiv.org/html/2605.26663#A9.T27)show that substituting insufficient evidence lowers confidence in the originalSupport/Refutelabel\. A model may still succeed on the hard\-NEIside while remaining weak on the reference\-evidence side\. Therefore, fixed\-claim evidence substitution is a useful diagnostic but not a proof of clean counterfactual evidence use\.

## Appendix JExternal Controls: FEVER and HoVer

FEVER and HoVer are bounded external controls\. They test whether construction\-aware diagnostics are useful beyond SciFact, but they are not human\-validated hard\-NEIbenchmarks\.

### J\.1FEVER Subset Control

BaselineTrain NEIPlaceholderBM25RandomPH→\\rightarrowBM25 dropClaim\+evidence TF\-IDFPlaceholder1\.0000\.0000\.0001\.000Claim\+evidence TF\-IDFBM25 near\-miss0\.6140\.3730\.3270\.241Claim\+evidence TF\-IDFRandom irrelevant0\.2370\.4140\.476\-0\.177Evidence\-only TF\-IDFBM25 near\-miss0\.8260\.3130\.2740\.513Length/overlap logistic regressionBM25 near\-miss0\.7720\.4060\.5360\.366

Table 31:FEVER external subset control\. Placeholder\-trained shallow models solve placeholderNEIwhile failing on BM25 near\-miss and random\-irrelevant evaluation\.
### J\.2HoVer Candidate Missing\-Hop Control

TrainEvalNEI\-F1NEI recallFalse SUPnevaln\_\{\\mathrm\{eval\}\}Missing\-hopMissing\-hop0\.8350\.7560\.0977,978Missing\-hopPlaceholder0\.9731\.0000\.0007,978PlaceholderMissing\-hop0\.0000\.0000\.4317,978PlaceholderPlaceholder1\.0001\.0000\.0007,978

Table 32:HoVer candidate missing\-hop control\. Values are rounded to three decimals\. HoVer rows are candidate\-only and are not human\-validated hardNEI\.
### J\.3Boundary

FEVER supports a non\-toy subset\-level shortcut\-control claim\. HoVer supports a candidate\-only multi\-hop construction\-sensitivity claim\. Neither supports full cross\-dataset human\-validated hard\-NEIgeneralization\.

## Appendix KDiagnostic Case Studies

This appendix provides qualitative examples from row\-level artifacts\. We include only cases with available claim/evidence text and, where model behavior is discussed, saved predictions and probabilities\. We do not reconstruct unavailable cases from aggregate statistics\.

### K\.1Case Selection Policy

Cases are selected only when: the claim and evidence are available, prediction and probability fields are available when model behavior is discussed, and human validation status is available when the case is described as human\-adjudicated hardNEI\. FEVER and HoVer cases are construction examples or candidate controls unless human\-adjudicated\.

### K\.2Fixed\-Claim Hard\-Side NEI Recognition

FieldValueDatasetSciFact fixed\-claim/same\-documentClaimCholesterol loading induces KLF4 expression in vascular smooth muscle cells, resulting in pro\-inflammatory cytokine expression\.ConstructionSame\-claim / same\-document human\-validated hardNEITrain/seedRandom irrelevant / 17Ref\. pred\.SupportHard pred\.NEIProb\.P​\(reference\)=0\.594P\(\\mathrm\{reference\}\)=0\.594;P​\(NEI∣Ehard\)=0\.978P\(\\mathrm\{NEI\}\\mid E\_\{\\mathrm\{hard\}\}\)=0\.978Why it mattersThe model changes prediction when evidence changes while the claim is fixed, supporting hard\-side insufficiency recognition\.

Table 33:Case study: fixed\-claim hard\-sideNEIrecognition\.
### K\.3Reference\-Side Weakness

FieldValueDatasetSciFact fixed\-claim/same\-documentClaimT cell receptor/CD3 microdomains are required to induce the immunologic synapse\.ConstructionSame\-claim reference\-side weaknessTrain/seedPosition\-biased / 13Ref\. pred\.NEIHard pred\.NEIProb\.P​\(reference\)=0\.001P\(\\mathrm\{reference\}\)=0\.001;P​\(NEI∣Ehard\)=0\.998P\(\\mathrm\{NEI\}\\mid E\_\{\\mathrm\{hard\}\}\)=0\.998Why it mattersThe hard side is recognized asNEI, but the reference\-evidence side is missed, motivating the same\-claim caveat\.

Table 34:Case study: fixed\-claim reference\-side weakness\.
### K\.4Same\-Document Human\-Validated Hard NEI

FieldValueDatasetSciFact fixed\-claim/same\-documentClaimRemoval of H3K9me3 improves reprogramming efficiency in human somatic cell nuclear transfer experiments\.Evidence excerptAberrant epigenetic reprogramming can cause developmental defects in somatic cell nuclear transfer embryos\.StatusHuman\-validated hardNEIPredictionNEIProb\.P​\(NEI\)=0\.998P\(\\mathrm\{NEI\}\)=0\.998;P​\(Support\)=0\.001P\(\\mathrm\{Support\}\)=0\.001;P​\(Refute\)=0\.001P\(\\mathrm\{Refute\}\)=0\.001Why it mattersSame\-document evidence controls source/topic better than random irrelevant evidence while remaining insufficient\.

Table 35:Case study: same\-document human\-validated hardNEI\.
### K\.5HoVer Candidate Missing\-Hop Control

FieldValueDatasetHoVerClaimBrett Herron’s team competes in the Pro14 and the European Rugby Champions Cup\.ConstructionMissing\-one\-supporting\-fact candidate controlStatusCandidate\-only, not human\-validatedTrain/seedHoVer placeholder / 13PredictionSupportProb\.P​\(Support\)=0\.981P\(\\mathrm\{Support\}\)=0\.981;P​\(NEI\)=0\.000P\(\\mathrm\{NEI\}\)=0\.000;P​\(Refute\)=0\.018P\(\\mathrm\{Refute\}\)=0\.018Why it mattersPlaceholder\-trained behavior can fail on topically related missing\-hop evidence\.

Table 36:Case study: HoVer candidate missing\-hop control\.
### K\.6Unavailable Row\-Level Cases

Case typeUnavailable reasonSciFact hard\-audit placeholder failure on human\-validated BM25/cited hardNEIRow\-level model predictions for final hard\-NEIexamples are unavailable; aggregate bootstrap tables exist only\.SciFact BM25/cited recovery caseRow\-level paired recovery predictions for placeholder versus BM25/cited hardNEIare unavailable in primary outputs\.

Table 37:Unavailable case\-study reasons\. We do not invent row\-level cases when prediction artifacts are missing\.

## Appendix LStatistical Testing and Uncertainty Estimation

This appendix defines metrics and uncertainty procedures for construction matrices, one\-class hard\-NEIevaluation, fixed\-claim diagnostics, and prediction coverage checks\.

### L\.1Three\-Way Verification Metrics

For standard verification, models predict:

y∈\{Support,Refute,NEI\}\.y\\in\\\{\\textsc\{Support\},\\textsc\{Refute\},\\textsc\{NEI\}\\\}\.We report accuracy, Macro\-F1, class\-specific F1, and especiallyNEI\-F1\. Construction\-shift matrices emphasizeNEI\-F1 because the experimental manipulation changes theNEIevidence condition\.

### L\.2One\-Class Human\-Hard Metrics

For human\-validated hard\-NEIsubsets, all evaluated examples are adjudicated asNEI\. Macro\-F1 is therefore not informative\. We report:

NEI​recall=\#​predicted NEI\#​validated NEI,\\mathrm\{NEI\\ recall\}=\\frac\{\\\#\\text\{predicted NEI\}\}\{\\\#\\text\{validated NEI\}\},False​Support\\displaystyle\\mathrm\{False\\ Support\}=Npred=SUPNvalidated​NEI,\\displaystyle=\\frac\{N\_\{\\mathrm\{pred=SUP\}\}\}\{N\_\{\\mathrm\{validated\\ NEI\}\}\},False​Refute\\displaystyle\\mathrm\{False\\ Refute\}=Npred=REFNvalidated​NEI\.\\displaystyle=\\frac\{N\_\{\\mathrm\{pred=REF\}\}\}\{N\_\{\\mathrm\{validated\\ NEI\}\}\}\.We also report mean predicted probabilities forNEI,Support, andRefute\.

### L\.3Seed Aggregation and Bootstrap Intervals

The primary construction matrix uses seeds 13, 17, and 23\. Revision robustness and mixed\-construction experiments use seeds 13, 17, 23, 29, and 37\. Where seed\-level outputs are available, we report mean, standard deviation, minimum, maximum, and bounded intervals\. Human validation and human\-hard evaluations report bootstrap 95% intervals where available, computed by resampling evaluation examples or groups and recomputing the metric\.

### L\.4Fixed\-Claim Metrics

For same\-claim diagnostics, reference\-label probability drop is:

Δi=Pθ​\(yref∣ci,Eiref\)−Pθ​\(yref∣ci,Eihard\)\.\\Delta\_\{i\}=P\_\{\\theta\}\(y\_\{\\mathrm\{ref\}\}\\mid c\_\{i\},E\_\{i\}^\{\\mathrm\{ref\}\}\)\-P\_\{\\theta\}\(y\_\{\\mathrm\{ref\}\}\\mid c\_\{i\},E\_\{i\}^\{\\mathrm\{hard\}\}\)\.Probability\-drop success is the fraction of pairs withΔi\>0\\Delta\_\{i\}\>0\. Strict swap success additionally requires the reference\-evidence side to be predicted as the reference label and the insufficient\-evidence side to be predicted asNEI\.

### L\.5Prediction Coverage

Prediction coverage is:

coverage=npredictednexpected\.\\mathrm\{coverage\}=\\frac\{n\_\{\\mathrm\{predicted\}\}\}\{n\_\{\\mathrm\{expected\}\}\}\.Rows with missing or duplicate predictions are not treated as complete\.

Scope groupExpectedPredictedCoverageFixed\-claim reference side1141141\.000Fixed\-claim insufficient side1141141\.000Same\-document hard side1141141\.000SciFact matrix regression8858851\.000

Table 38:Prediction coverage summary\. Values summarize the repeated complete\-coverage pattern across train variants and seeds\.
### L\.6Reporting Rules

NEI\-CAP follows these rules: report construction family for everyNEIresult; report Macro\-F1 only when all three labels are meaningful; do not report Macro\-F1 on one\-class hard\-NEIsubsets; document prediction coverage before interpreting fixed\-claim diagnostics; and mark FEVER/HoVer controls as bounded unless human validation is available\.

## Appendix MReproducibility, Release, and Claim Boundary Checklist

This appendix records release readiness, locked\-output policy, and paper\-facing claim boundaries\.

### M\.1Release Audit

CheckPassedCriticalDetailKey reports existyesyesExperiment reports and summaries are present\.Paper tables exist and are non\-emptyyesyesPaper\-facing tables are generated\.Run registry entries existyesyesCore experiment entries are tracked\.Fixed\-claim labels are preservedyesyesFinal fixed\-claim/same\-document labels path exists\.No LLM labels marked humanyesyesReports use boundary language\.No candidate\-only HoVer rows marked human\-validatedyesyesHoVer marked human\_validated=no\.No Macro\-F1 on one\-class human subsetsyesyesOne\-class hard\-NEIevaluations exclude Macro\-F1\.Primary SciFact matrix primary outputs unchangedyesyesLocked primary matrix is preserved\.External dependencies documentedyesnoUpstream data dependencies documented\.Known limitations documentedyesyesLimitations recorded\.Unavailable cases documentedyesnoMissing row\-level case reasons tracked\.

Table 39:NEI\-CAP release audit checklist\. Critical checks concern result traceability, label provenance, metric validity, and claim boundaries\.
### M\.2Locked Outputs and Traceability

Locked evidence includes the primary SciFact construction matrix, SciFact human\-audit and hard\-NEIevaluation, fixed\-claim/same\-document adjudicated labels, secondary\-backbone robustness, FEVER subset controls, HoVer candidate missing\-hop controls, and derived paper tables/figures\. Each reported result should be traceable to a source dataset, construction family, split policy, model configuration, seed set, prediction artifact, and evaluation output\.

### M\.3Release Package Contents

The accompanying release will include construction manifests, group\-disjoint split files, human\-adjudicated labels, paper\-facing prediction logs, evaluation scripts, artifact\-audit scripts, and table\-generation scripts\. Upstream datasets are referenced and documented but not redistributed\. Rows or case studies unavailable in primary outputs are documented rather than reconstructed from aggregate statistics\.

### M\.4Claim Strength Matrix

ClaimStrengthAllowed wordingForbidden wordingNEI\-CAP is construction\-awareStrongProtocol for constructing, auditing, and stress\-testingNEIevaluation\.Do not claim it solves fact verification\.SciFactNEIis construction\-sensitiveStrongScores vary substantially across construction conditions\.Do not claim all errors have the same cause\.EasyNEIinflates performanceModerateShortcut\-like constructions can inflate apparent competence\.Do not claim every model uses only shortcuts\.Human validation confirms hard subsetsStrongAudited BM25/cited and Fixed\-claim/same\-doc hard\-NEIsubsets are human\-adjudicated\.Do not call LLM triage human validation\.Same\-claim is diagnosticLimitedSupports hard\-side recognition with reference\-side caveats\.Do not claim clean counterfactual evidence\-use success\.HoVer is external controlLimitedCandidate\-only multi\-hop construction\-sensitivity probe\.Do not call HoVer human\-validated\.FEVER is external controlLimitedNon\-toy subset\-level shortcut control\.Do not call FEVER full\-data validation\.HardNEIis heterogeneousModerateDifferent construction families stress different behaviors\.Do not collapse topic\-unrelated cases into hardNEI\.Release is reproducible with limitationsStrongIncludes reproducibility instructions and claim boundaries\.Do not report toy, failed, or smoke outputs as paper\-facing\.

Table 40:Claim strength and wording boundaries\.
### M\.5Revision\-Stage Release Audit and Responsible Reporting

The additional experiments add response evidence without changing primary outputs or creating new human labels\. The final audit records completed split statistics, mixed training, expanded multi\-seed checks, same\-document artifact audit, human\-protocol documentation, no Macro\-F1 on one\-class hard subsets, no LLM\-only human validation, and no candidate\-only HoVer human validation\. NEI\-CAP is an audit and reporting protocol, not a blanket declaration that a dataset is invalid\.

Similar Articles

Partial Evidence Bench: Benchmarking Authorization-Limited Evidence in Agentic Systems

arXiv cs.AI

This paper introduces Partial-Evidence-Bench, a deterministic benchmark for measuring 'authorization-limited evidence' failures in agentic AI systems. It evaluates how models handle tasks where access control restricts visibility, assessing their ability to recognize and report incomplete information rather than silently producing seemingly complete but incomplete answers.

The CIFAR Synthetic Evidence Corpus for Detecting AI-Generated Evidence

arXiv cs.AI

This paper introduces the CIFAR Synthetic Evidence Corpus, a dataset designed for detecting AI-generated evidence in legal contexts. It spans multiple document types and manipulation strategies, includes structured metadata, and provides a benchmark suite for evaluating detection systems.