From Articles to Premises: Building PrimeFacts, an Extraction Methodology and Resource for Fact-Checking Evidence
Summary
This paper introduces PrimeFacts, a methodology and resource for extracting fine-grained evidence from fact-checking articles using large language models. The extracted premises improve evidence retrieval and claim verification performance by up to 30% in MRR and 10-20 points in Macro-F1.
View Cached Full Text
Cached at: 05/08/26, 07:00 AM
# From Articles to Premises: Building PrimeFacts, an Extraction Methodology and Resource for Fact-Checking Evidence
Source: [https://arxiv.org/html/2605.06006](https://arxiv.org/html/2605.06006)
###### Abstract
Fact\-checking articles encode rich supporting evidence and reasoning, yet this evidence remains largely inaccessible to automated verification systems due to unstructured presentation\. We introducePrimeFacts, a methodology and resource for extracting fine\-grained evidence from full fact\-checking articles\. We compile 13,106 PolitiFact articles with claims, verdicts, and all referenced sources, and we identify 49,718 in\-article hyperlinks as natural anchors to pinpoint key evidence\. Our framework leverages large language models \(LLMs\) to rewrite these anchor sentences into stand\-alone, context\-independent premises and investigates the extraction of additional implicit evidence\. In evaluations on cross\-article evidence retrieval and claim verification, the extracted premises substantially improve performance\. Decontextualized evidence yields higher retrievability, achieving up to a 30% relative gain in Mean Reciprocal Rank over verbatim sentences, and using the evidence for verdict prediction raises Macro\-F1by 10\-20 points over the baseline\. These gains are consistent across different verdict granularities \(2\-class vs\. 5\-class\) and model architectures\. A qualitative analysis indicates that the decontextualized premises remain faithful to the original sources\. Our work highlights the promise of reusing fact\-checkers’ evidence for automation and provides a large\-scale resource of structured evidence from real\-world fact\-checks\.
Keywords:Corpus Construction, Information Extraction, Large Language Models, Resource Evaluation
\\NAT@set@cites
From Articles to Premises: Building PrimeFacts, an Extraction Methodology and Resource for Fact\-Checking Evidence
Premtim Sahitaj1,2, Jawan Kolanowski3, Ariana Sahitaj1,2Veronika Solopova1,2, Max Upravitelev1,2, Daniel Röder2, Iffat Maab4Junichi Yamagishi4, Sebastian Möller1,2, Vera Schmitt1,21Technische Universität Berlin, Quality and Usability Lab, Berlin, Germany2Deutsches Forschungszentrum für Künstliche Intelligenz \(DFKI\), Berlin, Germany3Harz University of Applied Sciences, Wernigerode, Germany4National Institute of Informatics, Tokyo, Japan\{sahitaj, sebastian\.moeller, vera\.schmitt\}@tu\-berlin\.de\{ariana\.sahitaj, veronika\.solopova\}@tu\-berlin\.de, daniel\.roeder@dfki\.de\{maab, jyamagis\}@nii\.ac\.jp, u37871@hs\-harz\.deAbstract content
## 1\. Introduction
Professionalfact\-checkinghas become a crucial process to counter misinformation and disinformation in politics and media\. Journalistic fact\-checkers investigate a check\-worthy claim by gathering supporting and refuting evidence and then issuing a verdict on the claim’s veracityGraves \([2016](https://arxiv.org/html/2605.06006#bib.bib17)\); Jianget al\.\([2020b](https://arxiv.org/html/2605.06006#bib.bib21)\)\. The results of this labor\-intensive process are published as articles containing the claim, contextual background, evidence discussion, and a final verdict on truthfulnessSahitajet al\.\([2025](https://arxiv.org/html/2605.06006#bib.bib33)\)\. Standardization efforts like the ClaimReview schema have encouraged fact\-checkers to explicitly mark the claim and verdict in each article, but other critical components, namely theevidencethat led to the verdict and thereasoningconnecting evidence to conclusion, are rarely structured or annotated due to the extra workload this would entailJianget al\.\([2020b](https://arxiv.org/html/2605.06006#bib.bib21)\)\. As a result, rich supporting information in fact\-check articles remains locked in unstructured text, limiting its reusability for automated verification systems and obscuring the transparency of how conclusions are reached for a given claimJianget al\.\([2020b](https://arxiv.org/html/2605.06006#bib.bib21)\); Alhindiet al\.\([2018](https://arxiv.org/html/2605.06006#bib.bib3)\)\.
Figure 1:Example fragments from a fact\-checking article\. \(Top and middle\) hyperlink\-anchored sentencedecontextualizedinMode B\. \(Bottom\) statistical claim identified byopen extractioninMode C\.Similar or rephrased claims tend to reappear across media, causing fact\-checkers to spend effort redundantly on already verified informationNakovet al\.\([2021](https://arxiv.org/html/2605.06006#bib.bib30)\); Panchendrarajan and Zubiaga \([2024](https://arxiv.org/html/2605.06006#bib.bib32)\)\. We argue that*evidence reuse*is a valuable extension of claim matching: once supporting material is extracted and normalized, the same evidence can be associated with multiple variants of a claim, enabling future claims to be verified using pre\-existing evidence\. Moreover, this would enable analyses of aspects such as source diversity and reasoning consistency\. For instance, one could examine how premises are assembled to support or refute a target claim, or assess whether the evidence usage reveals informational bias in an article’s reasoningWanget al\.\([2025a](https://arxiv.org/html/2605.06006#bib.bib54)\); Stab and Gurevych \([2017](https://arxiv.org/html/2605.06006#bib.bib58)\); Maabet al\.\([2024](https://arxiv.org/html/2605.06006#bib.bib59)\)\. However, manually annotating evidence at scale is impractical\. Fact\-check articles are often long, densely sourced, and rhetorically complexHumprecht \([2020](https://arxiv.org/html/2605.06006#bib.bib20)\); Jianget al\.\([2020a](https://arxiv.org/html/2605.06006#bib.bib52)\)\. Exhaustively identifying all relevant evidence sentences and rewriting them as standalone statements would require significant expert effort, leading to annotator fatigue, inconsistency, and prohibitive costOstrowskiet al\.\([2021](https://arxiv.org/html/2605.06006#bib.bib31)\); Schlichtkrullet al\.\([2023](https://arxiv.org/html/2605.06006#bib.bib49)\)\. This motivates exploring automated methods to unlock the evidence within fact\-check articles\.
These articles containin\-text citationsthat serve as natural anchors for evidence to enhance the transparency of the manual verification process \(see Figure[1](https://arxiv.org/html/2605.06006#S1.F1)for an illustrative example\)\. Specifically, hyperlinks to primary sources such as data, reports and transcripts are extensively embedded within their writingCazzamatta \([2025a](https://arxiv.org/html/2605.06006#bib.bib44)\); Humprecht \([2020](https://arxiv.org/html/2605.06006#bib.bib20)\)\. Therefore, these reference links can be leveraged to automatically pinpoint the key supporting sentences within an article\. By extracting and appropriately reformulating those anchor sentences, we aim to make the evidence both*addressable*\(tied to a specific source reference\) and*portable*\(understandable outside the original context\) for use in downstream automated retrieval and verification systems\. In this work, we investigate the feasibility of automatically extracting evidence from fact\-check articles by exploiting their cited sources and language models\. We focus on three research questions:
- •RQ1 \(Addressability\):To what extent can in\-article hyperlinks serve as a reliable proxy for identifying the core evidence that supports or refutes a claim?
- •RQ2 \(Portability\):Does rewriting premises into stand\-alone,*decontextualized*statements improve their usefulness for retrieval and automated verification tasks?
- •RQ3 \(Robustness\):Are the findings consistent across different evaluation settings and task granularities?
Our contributions are four\-fold: \(i\) We curatePrimeFacts, a resource of 13,106 PolitiFact fact\-checks with structured article metadata and 49,718 in\-article hyperlinks that serve as evidence anchors\. \(ii\) We introduce a three\-mode extraction framework: anchored verbatim sentences, LLM\-based decontextualization into stand\-alone premises, and open extraction of self\-contained premises with source attributions\. We additionally define a lightweight evidence\-type ontology to characterize premise content\. \(iii\) We propose a score for evaluating the faithfulness of decontextualizations, combining forward textual entailment with an asymmetric lexical\-overlap penalty, and conduct a targeted human study to assess self\-containment and evidence typing\. \(iv\) We evaluate*evidence reuse*on two downstream tasks, cross\-article retrieval and zero\-shot claim verification, across six instruction\-tuned LLMs and two verdict granularities\.
Figure 2:Extraction pipeline for transforming fact\-checks into refined, decontextualized evidence\.
## 2\. Related Work
Automated fact\-checking \(AFC\) research aims to verify claims by retrieving evidence and predicting verdicts, often producing a textual explanationGuoet al\.\([2022](https://arxiv.org/html/2605.06006#bib.bib19)\)\. A persistent challenge in AFC is the shortage of training data that contains real\-world claims paired with gold\-standard evidence and detailed reasoning\. Many existing datasets use artificially constructed claims or evidence from Wikipedia, e\.g\. FEVERThorneet al\.\([2018](https://arxiv.org/html/2605.06006#bib.bib35)\), and extensionsSchusteret al\.\([2021](https://arxiv.org/html/2605.06006#bib.bib34)\); Alyet al\.\([2021](https://arxiv.org/html/2605.06006#bib.bib4)\); Maet al\.\([2024](https://arxiv.org/html/2605.06006#bib.bib26)\), or consider the referenced pages within fact\-checking articles as evidenceAugensteinet al\.\([2019](https://arxiv.org/html/2605.06006#bib.bib5)\); Khanet al\.\([2022](https://arxiv.org/html/2605.06006#bib.bib23)\)\. Other datasets target fact\-checking justifications by relying on heuristic extractionsAlhindiet al\.\([2018](https://arxiv.org/html/2605.06006#bib.bib3)\); Zeng and Gao \([2024](https://arxiv.org/html/2605.06006#bib.bib39)\)or are limited in scale due to manual annotationOstrowskiet al\.\([2021](https://arxiv.org/html/2605.06006#bib.bib31)\); Wanget al\.\([2025b](https://arxiv.org/html/2605.06006#bib.bib37)\)\. These approaches do not isolate or quantify the evidence\-bearing sentences that journalists integrate into the article’s reasoning\. Instead, they ingest entire referenced pages or treat fact\-checks as the evidence\. Reliance on whole fact\-checking articles introduces noise and reduces system effectivenessSamarinaset al\.\([2021](https://arxiv.org/html/2605.06006#bib.bib40)\); Xinget al\.\([2024](https://arxiv.org/html/2605.06006#bib.bib41)\); Denget al\.\([2025](https://arxiv.org/html/2605.06006#bib.bib42)\); Sauchuket al\.\([2022](https://arxiv.org/html/2605.06006#bib.bib43)\), while incomplete or inaccessible sources and link rot further undermine automated evidence retrievalCazzamatta \([2025a](https://arxiv.org/html/2605.06006#bib.bib44)\); Warrenet al\.\([2025](https://arxiv.org/html/2605.06006#bib.bib45)\); Kavtaradze \([2024](https://arxiv.org/html/2605.06006#bib.bib46)\); Zhouet al\.\([2015](https://arxiv.org/html/2605.06006#bib.bib47)\); Kleinet al\.\([2014](https://arxiv.org/html/2605.06006#bib.bib48)\)\.PrimeFactsaddresses this gap by focusing on evidence units within fact\-checking articles that journalists already anchor through in\-text hyperlinks and end\-of\-article references\. We promote both precise addressability via these anchored spans and enhanced portability by reformulating them into context\-independent premises\. Our open extraction approach shares the core insight ofChenet al\.\([2024](https://arxiv.org/html/2605.06006#bib.bib1)\)that decomposing text into atomic, self\-contained propositions improves retrieval, but differs in scope and constraints\.Chenet al\.\([2024](https://arxiv.org/html/2605.06006#bib.bib1)\)define propositions as minimal, self\-contained factoids and exhaustively decompose passages so that all propositions together recover the full passage semantics, using a compact model distilled from GPT\-4 outputs\. In contrast, our approach performs selective, attribution\-based extraction from fact\-checking articles, bounded by the article’s anchor count and targeting key evidence rather than every atomic fact\. Recent work on maintaining editable knowledge basesLiet al\.\([2025](https://arxiv.org/html/2605.06006#bib.bib2)\)further motivates the extraction of modular, updateable evidence units, a goal that our decontextualized premises directly support\.
## 3\. Data Collection
Each record inPrimeFactscorresponds to a single fact\-checking article from the English\-language PolitiFact111[https://www\.politifact\.com](https://www.politifact.com/)archive, collected from the chronological and topic\-based indexes up to September 2025\. We retain only fact\-check articles targeting text\-based claims, identified via URL and markup patterns, to preserve internal validity and consistent textual structure\. The final dataset comprises 13,106 fact\-check articles authored by 661 individual journalists, each containing at least one in\-text hyperlink to an external source\. For each article, we store a structured representation linking the canonical PolitiFact URL with its metadata, including crawl timestamp, editorial tags, claim information, author and speaker metadata, and a structured list of cited sources to ensure provenance and reproducibility\. The released version of the resource222[https://huggingface\.co/datasets/xplainlp/prime\-facts](https://huggingface.co/datasets/xplainlp/prime-facts)contains only derived metadata and annotations\. Original fact\-check article texts are not redistributed for copyright reasons\. The resource is organized into three main components: \(i\)Article metadata, containing the article origin, verdict label, claim statement, and cited materials; \(ii\)Entity metadata, providing unified lookup profiles for authors and speakers to maintain consistent attribution across articles, and \(iii\)Annotations, linking each claim to its extracted anchor statements and aligning them with the corresponding article structure and metadata\. All components are cross\-referenced through their canonical URLs\. A companion statistics file reports aggregate distributions to support stratified sampling and downstream evaluation\. Beyond aggregating raw PolitiFact metadata,PrimeFactscontributes several processing steps that constitute a standalone resource: layout\-aware text normalization derived from web browser renderingWeichselbraun \([2021](https://arxiv.org/html/2605.06006#bib.bib60)\), sentence\-level segmentation with stable letter identifiers, hyperlink extraction and cross\-referencing with author\-provided source lists, systematic label\-leak filtering to prevent verdict contamination, and the decontextualized premise annotations as structured evidence\. Together, these transformations turn unstructured fact\-check articles into a queryable evidence base that supports retrieval\-augmented verification and comparative analysis\.
## 4\. Methodology
Before describing the framework, we clarify the key terms used throughout this paper\. Ananchoris an in\-article hyperlink that marks a sentence citing an external source\. Asourceis the external document or page linked by an anchor \(e\.g\., a government report or dataset\)\.Evidencerefers broadly to any information supporting or refuting a claim\. Apremiseis a decontextualized, self\-contained evidence statement derived from the article, suitable for reuse outside its original context\. We use these terms consistently hereafter\. We propose a framework for extracting fine\-grained evidence from fact\-check articles and for evaluating its usefulness in downstream verification settings\. Figure[2](https://arxiv.org/html/2605.06006#S1.F2)gives an overview\. The framework includes three evidence modes: hyperlink\-anchored sentences, their decontextualized variants, and premises obtained through open extraction\. We evaluate the resulting evidence representations in cross\-article retrieval and claim verification, and we further assess their faithfulness to the original article content\.
### 4\.1\. Evidence Extraction
Each article is segmented into distinct sentence\-like unitsuu, each assigned a stable identifierι\\iota\. For downstream experiments, we remove sentences that explicitly state the verdict to prevent label leakage, ensuring that models see only evidence, not conclusionsAlhindiet al\.\([2018](https://arxiv.org/html/2605.06006#bib.bib3)\); Glockneret al\.\([2022](https://arxiv.org/html/2605.06006#bib.bib16)\)\. We then identify anchor sentences that contain at least one hyperlink to an external source\. These in\-text hyperlinks typically mark factual assertions supported by external evidenceCazzamatta \([2025a](https://arxiv.org/html/2605.06006#bib.bib44)\)\. The rationale for treating hyperlinked sentences as high\-quality evidence markers is grounded in journalistic practice: fact\-checkers deliberately embed links to primary sources such as government data, official records, prior reporting, and expert statements to substantiate their analysis and enhance transparencyHumprecht \([2020](https://arxiv.org/html/2605.06006#bib.bib20)\)\. This editorial convention makes hyperlinks natural proxies for evidence\-bearing content\. We cross\-reference each hyperlink with the article’s author\-provided source list and retain only those pointing to sources that the author has explicitly listed by name in the article’s reference section, to ensure we capture only high\-quality anchors\. We only consider articles with at least one anchor\. This leaves us with 13,106 articles and a final set of 49,718 anchors\. Using the identified anchors, we derive two sets of evidence, corresponding to Mode A and its decontextualized variation Mode B\. Mode C extracts decontextualized premises without anchors\. Each mode produces a collection of candidate premises per article with a unique identifierι\\iotalinking it back to the article text for provenance\.
#### 4\.1\.1\. Mode A: Anchored Evidence
Mode A uses each anchor sentence verbatim as an evidence unit\. This yields a set of factual statements directly grounded in the fact\-checker’s cited sources\. For example, if an article states, "In 2020, the city’s homicide rate was the lowest on record" with a hyperlink to a police report, we retain that sentence unchanged as evidence\. By design, these anchor\-based evidence units are high\-precision, but this does not guarantee high recall\. However, using them as\-is can limit reusability, since anchor sentences may contain pronouns or context\-dependent references that are not self\-explanatory outside the article context\. Mode A primarily addresses*addressability*\(RQ1\) and serves as a baseline for our extraction framework\.
#### 4\.1\.2\. Mode B: Decontextualization
Mode B aims to improve the*portability*of evidence by rewriting each Mode A sentence into a stand\-alone premise\. We employ an LLM to*decontextualize*each anchor sentence, i\.e\. to make implicit references explicit so that the sentence is understandable out of contextChoiet al\.\([2021](https://arxiv.org/html/2605.06006#bib.bib12)\)\. For example, a Mode A statement"He took office in 2019"might be rewritten as"Volodymyr Zelenskyy took office as President of Ukraine in May 2019", resolving the pronoun and adding context for clarity\. Concretely, for each anchor sentence the LLM receives the full letter\-segmented article together with the claim statement and the target letter identifier pointing to the anchor sentence\. This provides the model with sufficient surrounding context to resolve coreferences and implicit references\. We use a zero\-shot prompting strategy with structured JSON output guided by a Pydantic schema \(see Appendix[A](https://arxiv.org/html/2605.06006#A1)\), which constrains the model to return a single decontextualized sentence along with an evidence\-type category in one joint generation step\. The prompt explicitly instructs the model to preserve the original meaning and factual content while only integrating details necessary for standalone interpretation\. This follows best practices from recent work on minimality in decontextualizationGunjal and Durrett \([2024](https://arxiv.org/html/2605.06006#bib.bib18)\)\. Each decontextualized premise is output together with its corresponding identifierι\\iotafor traceability to the source unituu\. Mode B produces one decontextualized premise for each Mode A evidence unit\. This allows evidence from one fact\-check to be more readily understood and reused in verifying other claims, addressing RQ2\. Building on prior work investigating evidence operationalization in the fact\-checking newsroom byCazzamatta \([2025c](https://arxiv.org/html/2605.06006#bib.bib10)\)and consistent with comparative evidence that fact\-checkers routinely add background and context to qualify and make verdicts interpretable\(Cazzamatta,[2025b](https://arxiv.org/html/2605.06006#bib.bib9)\), we define a novel ontology of evidence types and assign types to each decontextualized premise:QUOTEas attributed statement by a person or organization,STATISTICas numeric fact from an official dataset or series,DOCUMENTas findings of an authoritative record \(e\.g\., law, ruling, report, prior fact\-check\),CONTEXTas background attribution or qualification needed to interpret premises, andOTHERif none of the above fits\. These labels do not affect the extraction framework, but help characterize what kinds of information the premises convey\.
#### 4\.1\.3\. Mode C: Open Extraction
Mode C explores a more open\-ended extraction using LLM generation\. Instead of relying only on explicit hyperlinks, we prompt an LLM to process the entire fact\-checking article and directly output a set of self\-contained premises\. Prompt details are provided in Appendix[A](https://arxiv.org/html/2605.06006#A1)\. The purpose of this approach is to capture any central premises in the article that the journalist may have implied and not explicitly linked\. We constrain this process to maintain alignment with the article: the model is instructed to produce at mostnnpremises, wherennequals the number of Mode A anchors in that article, to ensure a fair comparison\. For each generated premise, the LLM must cite the supporting unit identifierι\\iotafrom which it drew the information\. Comparing Mode C to the anchor\-driven modes thus reveals whether important premises were not considered by only looking at hyperlinks\. This serves as a stress test for our framework, since any premises found in Mode C should ideally overlap with or complement Mode A/B if the anchor\-based approach is comprehensive\.
### 4\.2\. Evidence Reuse
#### 4\.2\.1\. Faithfulness
Decontextualization rewrites a source sentence into a stand\-alone premise that preserves the original factual content while resolving contextual ambiguityChoiet al\.\([2021](https://arxiv.org/html/2605.06006#bib.bib12)\)\. Prior work in abstractive summarization has shown that Natural Language Inference \(NLI\) models, which determine if a premise entails, contradicts, or is neutral towards a source, correlate higher with human judgments of factual consistency than standard measures such asROUGEorBERTScoreMaynezet al\.\([2020](https://arxiv.org/html/2605.06006#bib.bib28)\)\. We adapt this idea and evaluate the*faithfulness*of a decontextualized premiseppto its source sentencessas a NLI problem, using*forward*textual entailment\. Forward direction captures the desideratum that a more specific, context\-completed rewrite should imply the key facts of the original sentence, while the reverse direction does not need to hold\. For example, the premise"The unemployment rate doubled in 2016, according to the Bureau of Labor Statistics\."correctly entails the more general source statement"The rate doubled in 2016"\. The reverse is not necessarily true, making this a directional check of information preservation\. However, NLI models can be susceptible to lexical\-overlapNaiket al\.\([2018](https://arxiv.org/html/2605.06006#bib.bib29)\)\. A premise that simply copies, minimally edits, or truncatessscan receive a high entailment score despite failing the decontextualization objective\. To address this, we introduce theDecontextualization Faithfulness Score \(DFS\), a composite measure that balances factual entailment with an explicit lexical overlap penalty across a datasetD⊆P×SD\\subseteq P\\times S, wherePPandSSare the sets of generated premises and source sentences
DFS\(D\)\\displaystyle\\mathrm\{DFS\}\(D\)=1\|D\|∑\(p,s\)∈DE\(p,s\)\(1−O\(p,s\)\),\\displaystyle=\\frac\{1\}\{\|D\|\}\\sum\_\{\(p,s\)\\in D\}\\mathrm\{E\}\(p,s\)\\,\\bigl\(1\-\\mathrm\{O\}\(p,s\)\\bigr\),O\(p,s\)\\displaystyle\\mathrm\{O\}\(p,s\)=\|t\(p\)∩t\(s\)\|\|t\(p\)\|,\\displaystyle=\\frac\{\\bigl\|\\operatorname\{t\}\(p\)\\cap\\operatorname\{t\}\(s\)\\bigr\|\}\{\\bigl\|\\operatorname\{t\}\(p\)\\bigr\|\},
whereE\(p,s\)∈\[0,1\]\\mathrm\{E\}\(p,s\)\\in\[0,1\]is the probability thatppentailsss,O\(p,s\)∈\[0,1\]\\mathrm\{O\}\(p,s\)\\in\[0,1\]is the asymmetric lexical overlap ofppcovered byss, andt\(⋅\)\\operatorname\{t\}\(\\cdot\)denotes the \(multi\)set of tokens\. The token overlap is normalized by the length of the generated premise\|t\(p\)\|\|\\operatorname\{t\}\(p\)\|, which correctly penalizes premises that are simple excerpts of the source while rewarding the addition of new, contextualizing tokens that improve portability without undermining factual consistency with its source\. We compute DFS for both Mode B and Mode C across model outputs to evaluate the faithfulness of the decontextualizations\.
#### 4\.2\.2\. Retrievability
Retrievability operationalizes portability: if a premise is self\-contained enough, then, given a claim as a query, standard retrieval methods should locate it more reliably in a corpus of prior fact\-checks\. To assess how well the resulting premises serve as a reusable knowledge base for claim and evidence matchingPanchendrarajan and Zubiaga \([2024](https://arxiv.org/html/2605.06006#bib.bib32)\), we simulate the task of retrieving relevant evidence for a new claim by searching a database of verified facts from prior fact\-checks\. Specifically, for each mode, we compile all premises from the collected articles into a search index\. We treat each article’s claim statement as a query and attempt to retrieve candidate evidence from the indexed premises of all fact\-checks\. We compute ranked lists over the mode\-specific indexes and score effectiveness with standard information retrieval metrics\. By comparing retrieval performance across modes, we aim to quantify the benefit of evidence decontextualization and open extraction on evidence matching\. We interpret relative gains from Mode A→\\\!\\toB/C as evidence that decontextualization improves cross\-article portability\.
#### 4\.2\.3\. Verification Utility
We investigate whether surfaced premises enable zero\-shot claim verification\. For each claim, a model receives the premises for that article and the label schema, and must output \(i\) a verdict from the allowed set and \(ii\) a brief justification that cites used premises via identifiersι\\iotaGuoet al\.\([2022](https://arxiv.org/html/2605.06006#bib.bib19)\)\. Cited IDs make decisions traceable and let us quantify evidence useJollyet al\.\([2022](https://arxiv.org/html/2605.06006#bib.bib22)\)\. Alongside verdict accuracy, we report*citation coverage*as the fraction of presented premises that are cited: withSgivenS\_\{\\text\{given\}\}the shown premise IDs andScited⊆SgivenS\_\{\\text\{cited\}\}\\subseteq S\_\{\\text\{given\}\}those mentioned,C=\|Scited\|/\|Sgiven\|C=\|S\_\{\\text\{cited\}\}\|/\|S\_\{\\text\{given\}\}\|\. Coverage contrasts evidence modes as more informative premises should improve task performance while citing fewer items\. Accordingly, we treat coverage as diagnostic rather than a target, since correctness is our primary objective\.
## 5\. Experiments
### 5\.1\. Setup
We evaluate our approach on the collected and processed corpus of 13,106 PolitiFact fact\-check articles \(Section[3](https://arxiv.org/html/2605.06006#S3)\)\. Each fact\-check instance provides a query claim and its extracted evidence set per mode\. Because our method does not require model fine\-tuning, we do not partition the data into training splits\. Instead, all evidence extraction and verification experiments are conducted in a zero\-shot inference setting with LLMs, respectively\. We compare six publicly available state\-of\-the\-art instruction\-tuned LLMs of varying scale and architecture: Qwen3 at 8B, 14B, and 32B dense and 235B mixture of experts \(MoE\) with 22B active during inferenceYanget al\.\([2025](https://arxiv.org/html/2605.06006#bib.bib55)\), Llama 3\.3 at 70B denseMeta AI \([2024](https://arxiv.org/html/2605.06006#bib.bib57)\), and Llama 4 Scout MoE with 109B total and 17B activeMeta \([2025](https://arxiv.org/html/2605.06006#bib.bib56)\)\.
### 5\.2\. Automatic Evaluation
We operationalize the three research questions through complementary evaluations: retrieval performance measures portability \(RQ2\) by testing whether decontextualized premises are more discoverable across articles; verification performance measures whether portable evidence also improves downstream task accuracy; and faithfulness analysis ensures that LLM\-based rewriting preserves factual content\. By evaluating across six LLMs of varying scale and architecture and two verdict granularities \(binary and five\-class\), we assess robustness \(RQ3\)\. We acknowledge that model\-size differences confound model\-specific and knowledge\-driven effects and discuss this in Limitations\.
#### 5\.2\.1\. Retrieval Performance
Each claim statement is used as a query to retrieve evidence sentences from an index of all extracted premises \(Section[4](https://arxiv.org/html/2605.06006#S4)\), simulating cross\-article evidence reuse\. We use BM25 to construct an efficient retrieval indexLù \([2024](https://arxiv.org/html/2605.06006#bib.bib25)\)\. We measure standard ranking metrics: Mean Reciprocal Rank at 10 \(MRR@10\), normalized Discounted Cumulative Gain \(nDCG\) at 3 and 10, and Recall \(R\) at 1, 3, 10, treating the premises from the claim’s own fact\-check as the relevant gold truth set\. Higher MRR and nDCG indicate that the relevant evidence is ranked near the top, while higher Recall@kkindicates more of the gold premises are retrieved within the topkkresults\.
Table 1:Retrieval results across models\.We first examine the effectiveness of evidence retrieval across different extraction modes\. Table[1](https://arxiv.org/html/2605.06006#S5.T1)compares decontextualized premises from Mode B against the baseline using verbatim sentences from Mode A\. Decontextualization yields substantial gains in all metrics for every model\. For instance, Llama\-3\.3\-70B achieves an MRR@10 of 0\.59 with Mode B, compared to 0\.43 for the baseline, which is a 37% improvement\. Recall@10 improves from 0\.21 to 0\.36, meaning the self\-contained premises allow 71% more of the relevant evidence to be retrieved in the top\-10 results\. We observe consistent improvements at rank\-3 as well \(nDCG@3 from 0\.26 to 0\.41\)\. These results confirm that making evidence sentences context\-independent greatly increases theirportabilityand discoverability by lexical\-matching methods, addressing RQ2\. Table[1](https://arxiv.org/html/2605.06006#S5.T1)also shows retrieval results for Mode C with LLM\-generated premises without anchor cues\. Mode C outperforms the baseline by a wide margin\. Across models, MRR@10 ranges from 0\.78 to 0\.88, indicating that a large share of queries have a relevant premise at rank 1\. For example, Qwen3\-8B and Qwen3\-14B reach MRR@10 scores of 0\.88 and 0\.81, respectively\. Recall@10 more than doubles relative to the baseline, reaching up to 0\.57, with all models at 0\.48 or higher, meaning that Mode C premises capture a larger portion of the self\-selected evidence per claim\. The retrieval scores for Mode C suggest that the LLMs extract evidence that is more directly aligned with the claim than anchored sentences from Mode A\. This highlights a potential trade\-off\. While Mode C yields high recall and may surface additional relevant premises beyond explicit hyperlink anchors, it may also benefit from more direct lexical overlap with the claim wording\. We return to this point in Section[6](https://arxiv.org/html/2605.06006#S6)\. Overall, the trend from Mode A→\\toB→\\toC is one of strictly increasing retrieval effectiveness, demonstrating the benefit of evidence decontextualization and open extraction\.
#### 5\.2\.2\. Verification Performance
We evaluate label prediction using Macro\-F1for both a binary setting, omittinghalf\-truein the binary collapse, and a fine\-grained five\-class setting\. Macro\-F1is appropriate due to class imbalance, see Table[2](https://arxiv.org/html/2605.06006#S5.T2), and the need to reward balanced performance across all verdict categories\. We also quantify the evidence usage in each model’s explanation by computing the citation coverage\. Table[3](https://arxiv.org/html/2605.06006#S5.T3)reports Macro\-F1scores for each model and evidence mode, for both the binary and five\-class verdict prediction tasks\. Several clear patterns emerge\. First, providing any evidence \(Mode A\) dramatically improves performance over the majority\-class baseline which achieves Macro\-F1of 0\.39 for binary and 0\.10 for five\-class\. Even the raw anchor sentences enable Macro\-F1in the 0\.57\-0\.68 range \(binary\) depending on the model, confirming that journalist\-provided reference sentences capture relevant factual information needed to judge veracity\. This supports RQ1\.
Second, decontextualizing the evidence consistently boosts accuracy over Mode A\. All models see an absolute Macro\-F1gain of 4\-7 points in the binary setting and up to 8 points in the five\-class setting when using self\-contained premises from Mode B\. For example, Qwen3\-32B improves from 0\.59 to 0\.66 \(binary\) and 0\.27 to 0\.28 \(five\-class\)\. The largest jump is for Llama\-3\.3\-70B, rising to 0\.74 \(binary\) and 0\.35 \(five\-class\) in Mode B, a relative improvement of∼\\sim8 and∼\\sim27 percent, respectively\. This indicates that evidence portability \(RQ2\) is not only beneficial for retrieval, but also aids the model in understanding and applying the evidence to the claim\. By reducing ambiguity through operations such as resolving pronouns or making implicit context explicit, decontextualized premises make it easier for the verification model to connect facts to the claim\.
Table 2:Distribution of data for five\- and two\-class settings\.Third, Mode C yields the highest verification performance across the board\. Qwen3\-235B and Llama\-3\.3\-70B both reach binary Macro\-F1scores of 0\.81, and Llama\-3\.3\-70B achieves the top five\-class score of 0\.42\. Relative to Mode B, this corresponds to an additional gain of 12 points for Qwen3\-235B in the binary setting and 7 points for Llama\-3\.3\-70B in the five\-class setting\. Notably, the improvements from Mode B to Mode C are smaller than from Mode A to Mode B, suggesting diminishing returns and possible overlap between anchor\-based and open\-extracted evidence\. On further investigation, we find that, on average, about 25% of the source references surfaced in Mode C overlap with anchor\-based evidence from Mode A\. The trend holds across all model sizes and for both binary and fine\-grained tasks, supporting the robustness hypothesis in RQ3\. We also observe that larger models tend to perform better overall\. For instance, Llama\-3\.3\-70B outperforms the smaller Qwen3 models in each mode, which is expected given its greater parametric knowledge\.
Table 3:Results across extraction modes \(A\-C\) and models\. Bold values indicate best Macro\-F1per setting\.
#### 5\.2\.3\. Evidence Faithfulness
As defined in Section[4\.2\.1](https://arxiv.org/html/2605.06006#S4.SS2.SSS1), we quantify faithfulness with the*Decontextualization Faithfulness Score*\(DFS\), which combines forward textual entailmentEEwith an explicit penalty for lexical copy\-overlap\. In all experiments,EEis estimated by a standard DeBERTa\-Large cross\-encoder fine\-tuned on SNLIBowmanet al\.\([2015](https://arxiv.org/html/2605.06006#bib.bib6)\)and MNLIWilliamset al\.\([2018](https://arxiv.org/html/2605.06006#bib.bib38)\)\. Table[4](https://arxiv.org/html/2605.06006#S5.T4)reports mean forward entailment \(EE\) and DFS for*Mode B*between anchored source sentences and their decontextualized premises and*Mode C*between the referenced source sentences and open\-extracted premises\.Caveat:DFS can underestimate quality when the original source sentence is already self\-contained and well decontextualized, because high lexical overlapOOdepresses the score even ifEEis strong\. Across models, three patterns emerge\. First, in Mode B, smaller and mid\-sized models achieve very high entailment with low DFS, indicating mostly minimal edits \(e\.g\., Qwen3\-14B hasE=0\.91E\{=\}0\.91but DFS=0\.03\{=\}0\.03, and Qwen3\-8BE=0\.81E\{=\}0\.81with DFS=0\.06\{=\}0\.06\), whereas Qwen3\-235B and Llama\-3\.3\-70B strike a better balance with higher DFS \(0\.19 and 0\.21\) at moderateEE\(0\.76 and 0\.67\), suggesting more substantive, portable rewrites rather than near\-copies\. Second, in Mode C, as expected for more abstractive generation,EEdecreases across models while DFS rises for some configurations, indicating non\-trivial reformulations\. Qwen3\-235B attains the strongest DFS in Mode C \(0\.16\), followed by Qwen3\-8B \(0\.11\) and Llama\-3\.3\-70B \(0\.09\), reflecting premises that are less verbatim yet still sufficiently supported to aid downstream use\. Notably, Mode B outputs appear more faithful to their references than Mode C, plausibly because the two\-step, anchor\-driven pipeline with the anchor selection followed by constrained decontextualization biases rewrites toward the cited source, whereas open extraction has more freedom to abstract and synthesize\. Third, forward entailment alone tends to overestimate trivial copy\-edits, while DFS differentiates portable decontextualizations from near\-verbatim text\. Models with higher DFS in Mode B \(Qwen3\-235B, Llama\-3\.3\-70B\) also yield strong retrieval and verification results \(Table[1](https://arxiv.org/html/2605.06006#S5.T1), Table[3](https://arxiv.org/html/2605.06006#S5.T3)\), aligning faithfulness with utility, and although Mode C is more abstractive \(lowerEE\), its DFS indicates that many generated premises remain useful as complementary element to anchor\-driven evidence\.
Table 4:Results for Mean Forward Entailment and DFS for Mode B and Mode C\. Bold values indicate best performance per metric\.
### 5\.3\. Human Study
To complement the automatic metrics, we conducted a manual annotation study of the evidence to evaluate extraction utility\. We randomly sampled 100 premises each from Mode B and Mode C outputs\. The sample covered premises extracted by the strongest overall model, Qwen3\-235B, to evaluate best\-case outputs\. Each article contributed at most one sampled item to diversify topics\. Two annotators with a background in fact\-checking independently labeled all items, using annotation guidelines, and were tasked with assessing \(a\) whether the statement is self\-contained and interpretable without surrounding context, and \(b\) the evidence type assigned to the statement:Document,Statistic,Quote, orContext\. Question \(a\) was rated on an ordinal scale from incomplete \(1\) to complete \(3\)\. For Mode B \(a\), the results show an observed agreement rate of 0\.87 and a Krippendorff’s alpha of 0\.255 due to both annotators agreeing on a majority of cases to be self\-contained \(3\)\. For Mode B \(b\), we measure an observed agreement of 0\.58 and a Krippendorff’s alpha of 0\.441 due to significant disagreements on theCONTEXTlabel\. For Mode C \(a\), the results show an observed agreement rate of 0\.835 and a Krippendorff’s alpha of 0\.474\. For Mode C \(b\), we measure an observed agreement of 0\.67 and a Krippendorff’s alpha of 0\.561\. After resolving evidence\-type annotation disagreements through discussion, with the final label restricted to one of the two original annotator choices, Qwen3\-235B achieves a Macro\-F1of 0\.859 for Mode B and 0\.857 for Mode C\. Furthermore, we did not identify label leakage or factual inconsistencies in either Mode B or Mode C within the annotated samples\.
## 6\. Discussion
Our results show that fact\-checking articles contain reusable evidence that can be systematically unlocked\. In\-text hyperlinks provide a strong and scalable signal for locating evidence\-bearing statements, and decontextualizing these statements into stand\-alone premises consistently improves both retrieval and verification\. This suggests that fact\-checkers’ sourcing practices can be repurposed to build structured evidence resources for automated fact\-checking\. In this sense,PrimeFactscan be interpreted as an intermediate layer between document retrieval and claim verification\. Instead of reasoning directly over long source documents, verification models operate on compact, decontextualized premises that explicitly encode the factual content of the evidence\.
The comparison between Modes B and C reveals a clear trade\-off\. Mode B provides grounded, source\-linked premises that remain close to the journalist’s explicitly cited evidence, while Mode C often surfaces additional relevant premises beyond hyperlink anchors and achieves the strongest downstream performance\. At the same time, open extraction can introduce redundancy or produce statements with high lexical overlap to the claim itself\. On average, about 25% of the source references surfaced in Mode C overlap with anchor\-based evidence from Mode A, indicating partial but non\-trivial complementarity rather than simple duplication\. In practice, this suggests a hybrid strategy: use Mode B as a faithful foundation and supplement it with non\-redundant Mode C premises to improve coverage\.
We also observe that decontextualized premises often support stronger predictions while requiring fewer cited items in the generated justifications\. This suggests that self\-contained premises may be individually more informative for decision\-making, although citation coverage should be interpreted cautiously because it also reflects model selection behavior\. One explanation is that decontextualization removes discourse dependencies that would otherwise require models to reconstruct context from surrounding text\. By converting evidence into self\-contained premises, the reasoning task becomes closer to structured factual inference than long\-context interpretation\. Faithfulness analyses further indicate that stronger models tend to produce supported rewrites rather than verbatim copies, which aligns with their improved downstream verification utility\.
These trends were consistent across model families and across both binary and five\-class verdict settings, indicating that the gains stem from the evidence representation rather than from a single model architecture\. Overall,PrimeFactsshows that evidence extracted from professional fact\-checks can support retrieval\-augmented and semi\-automated verification workflows, although human oversight remains important when generated premises are used in high\-stakes settings\.
## 7\. Conclusion
We presentedPrimeFacts, a methodology and resource for transforming full\-length fact\-checking articles into a reusable evidence resource, and demonstrated its value for automated misinformation detection\. Our framework leverages fact\-checkers’ own sourcing practices by using hyperlink\-anchored evidence, decontextualizing these statements into stand\-alone premises, and investigating whether similar evidence can also be extracted without relying on anchors\. This yields structured evidence representations that are suitable for downstream retrieval and verification\. Our findings support the core assumptions of the paper\. First, in\-article hyperlinks provide a strong and scalable signal for identifying evidence\-bearing content\. Second, rewriting anchored evidence into decontextualized premises substantially improves both cross\-article retrieval and verdict prediction\. Third, these improvements remain consistent across different verdict granularities and model architectures\. Together, these results show that evidence extracted from professional fact\-checks can serve as an effective intermediate representation between long\-form journalistic articles and automated claim verification\.
By introducing thePrimeFactsresource and extraction methodology, we aim to support future research on retrieval\-augmented fact\-checking, evidence reuse, and transparent decision support\. More broadly, our work suggests that professional fact\-checks are not only useful as final verdicts, but also as rich repositories of structured evidence that can support more transparent, reusable, and effective verification systems\.
## Ethical Considerations
Our target is to developPrimeFactsas a structured knowledge base for intelligent decision\-support systems in fact\-checking and related applications\. While it enables automated evidence retrieval and verdict prediction, these functions are designed to assist rather than replace human judgment, particularly in high\-stakes or politically sensitive contexts\. Collaborative human oversight remains essential to interpret nuance, context, and evolving facts\. The evidence extracted inPrimeFactsreflects how fact\-checkers present and justify information within their articles\. Both the selection and presentation of evidence may encode subtle biases from the authors, such as framing, emphasis, or omission of counterpoints, which our extraction pipeline may in turn reproduce\. Similarly, while fact\-checking organizations are reputable and adhere to editorial standards, their verdicts and accompanying justifications are not free from subjective interpretation and editorial policies\. These judgments can be influenced by institutional perspectives, available sources, or political context\. Analyses or systems built on this resource should therefore explicitly account for such potential biases\. The dataset is derived from copyrighted fact\-checking articles\. We publicly release only derived metadata and annotations\. Original fact\-check article texts are not redistributed for copyright reasons\.
## Limitations
Our work has some limitations that suggest avenues for future work\. One primary limitation is the reliance on explicit in\-text citations \(hyperlinks\) to identify evidence\. While PolitiFact articles are richly linked, some fact\-checks or segments rely on implicit evidence or general knowledge that is not captured by a specific hyperlink\. Our pipeline would miss such uncited yet important premises\. In domains or languages where fact\-checkers provide fewer references, a hybrid extraction strategy, combining the anchor\-based method with additional open extraction, may be necessary to achieve high recall\. Another limitation lies in the scope of the extracted evidence\. We isolate individual supporting sentences but do not explicitly capture the logical structure or multi\-hop reasoning that a fact\-checker might apply across an article\. For example, an article might piece together two separate facts to reach a conclusion, but our current method would list these facts separately without representing their inferential connection\. This could limit the usefulness of the evidence in tasks requiring joint reasoning\. Future extensions should link premises into argumentative chains and label their roles, enabling a closer mirroring of human reasoning steps\. The use of large language models for evidence rewriting and generation introduces additional considerations\. Although we took measures to preserve faithfulness, such as constrained prompting and post\-hoc entailment checks, LLMs can occasionally produce subtly altered or extraneous details\. In our manual evaluation we did not observe major factual errors, but there remains a risk of hallucination, especially as we push the models to be more abstractive across long contexts\. Users of thePrimeFactsframework should treat decontextualized premises as suggestions to be compared against the original anchor or reference statements\. Furthermore, our evaluation is conducted exclusively on PolitiFact, a single English\-language fact\-checking platform with a particular editorial style and sourcing convention\. While Mode C is platform\-agnostic by design and Mode B requires only that articles contain hyperlinks, we have not yet validated our pipeline on other platforms \(e\.g\., Full Fact, Snopes\) or languages\. Cross\-platform and multilingual evaluation is planned as future work\. We also note that our operationalization of portability through retrieval performance and robustness through multi\-model, multi\-granularity evaluation may not capture all aspects of these concepts\. Model\-size differences in our LLM comparisons introduce a confound, since larger models have both more parametric knowledge and better instruction\-following ability; disentangling these factors is an open question\. Finally, we have not tested robustness to adversarial noise or contradictory evidence in the premise set, which would be a valuable stress test for future work\.
## Acknowledgments
This work is funded by the German Federal Ministry for Research, Technology and Aeronautics \(BMFTR\) in the scope of the projectsnews\-polygraph\(03RU2U151C\) andFAR\-REASONING\(16IS23068\)\. This work is supported by JST CREST Grants \(JPMJCR20D3\), Japan\.
## References
- Where is Your Evidence: Improving Fact\-checking by Justification Modeling\.InProceedings of the First Workshop on Fact Extraction and VERification \(FEVER\),J\. Thorne, A\. Vlachos, O\. Cocarascu, C\. Christodoulopoulos, and A\. Mittal \(Eds\.\),Brussels, Belgium,pp\. 85–90\.External Links:[Document](https://dx.doi.org/10.18653/v1/W18-5513)Cited by:[§1](https://arxiv.org/html/2605.06006#S1.p1.1),[§2](https://arxiv.org/html/2605.06006#S2.p1.1),[§4\.1](https://arxiv.org/html/2605.06006#S4.SS1.p1.3)\.
- R\. Aly, Z\. Guo, M\. Schlichtkrull, J\. Thorne, A\. Vlachos, C\. Christodoulopoulos, O\. Cocarascu, and A\. Mittal \(2021\)FEVEROUS: Fact Extraction and VERification Over Unstructured and Structured information\.arXiv\.External Links:2106\.05707,[Document](https://dx.doi.org/10.48550/arXiv.2106.05707)Cited by:[§2](https://arxiv.org/html/2605.06006#S2.p1.1)\.
- I\. Augenstein, C\. Lioma, D\. Wang, L\. Chaves Lima, C\. Hansen, C\. Hansen, and J\. G\. Simonsen \(2019\)MultiFC: A Real\-World Multi\-Domain Dataset for Evidence\-Based Fact Checking of Claims\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),K\. Inui, J\. Jiang, V\. Ng, and X\. Wan \(Eds\.\),Hong Kong, China,pp\. 4685–4697\.External Links:[Document](https://dx.doi.org/10.18653/v1/D19-1475)Cited by:[§2](https://arxiv.org/html/2605.06006#S2.p1.1)\.
- S\. R\. Bowman, G\. Angeli, C\. Potts, and C\. D\. Manning \(2015\)A large annotated corpus for learning natural language inference\.arXiv\.External Links:1508\.05326,[Document](https://dx.doi.org/10.48550/arXiv.1508.05326)Cited by:[§5\.2\.3](https://arxiv.org/html/2605.06006#S5.SS2.SSS3.p1.12)\.
- R\. Cazzamatta \(2025a\)Building a cross\-border european information network: hyperlink connections among fact\-checking organizations\.Media and Communication13\.Cited by:[§1](https://arxiv.org/html/2605.06006#S1.p3.1),[§2](https://arxiv.org/html/2605.06006#S2.p1.1),[§4\.1](https://arxiv.org/html/2605.06006#S4.SS1.p1.3)\.
- R\. Cazzamatta \(2025b\)Decoding Correction Strategies: How Fact\-Checkers Uncover Falsehoods Across Countries\.Journalism Studies26\(7\),pp\. 777–799\.External Links:ISSN 1461\-670X,[Document](https://dx.doi.org/10.1080/1461670X.2025.2470177)Cited by:[§4\.1\.2](https://arxiv.org/html/2605.06006#S4.SS1.SSS2.p1.2)\.
- R\. Cazzamatta \(2025c\)Redefining objectivity: Exploring types of evidence by fact\-checkers in four European countries\.European Journal of Communication40\(2\),pp\. 144–164\.External Links:ISSN 0267\-3231, 1460\-3705,[Document](https://dx.doi.org/10.1177/02673231251319145)Cited by:[§4\.1\.2](https://arxiv.org/html/2605.06006#S4.SS1.SSS2.p1.2)\.
- T\. Chen, H\. Wang, S\. Chen, W\. Yu, K\. Ma, X\. Zhao, D\. Yu, and H\. Zhang \(2024\)Dense X retrieval: what retrieval granularity should we use?\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 15159–15177\.Cited by:[§2](https://arxiv.org/html/2605.06006#S2.p1.1)\.
- E\. Choi, J\. Palomaki, M\. Lamm, T\. Kwiatkowski, D\. Das, and M\. Collins \(2021\)Decontextualization: Making Sentences Stand\-Alone\.Transactions of the Association for Computational Linguistics9,pp\. 447–461\.External Links:[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00377)Cited by:[§4\.1\.2](https://arxiv.org/html/2605.06006#S4.SS1.SSS2.p1.2),[§4\.2\.1](https://arxiv.org/html/2605.06006#S4.SS2.SSS1.p1.6)\.
- X\. Deng, X\. Wang, and M\. Stevenson \(2025\)The next phase of scientific fact\-checking: advanced evidence retrieval from complex structured academic papers\.InProceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval \(ICTIR\),pp\. 436–448\.Cited by:[§2](https://arxiv.org/html/2605.06006#S2.p1.1)\.
- M\. Glockner, Y\. Hou, and I\. Gurevych \(2022\)Missing Counter\-Evidence Renders NLP Fact\-Checking Unrealistic for Misinformation\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,Y\. Goldberg, Z\. Kozareva, and Y\. Zhang \(Eds\.\),Abu Dhabi, United Arab Emirates,pp\. 5916–5936\.External Links:[Document](https://dx.doi.org/10.18653/v1/2022.emnlp-main.397)Cited by:[§4\.1](https://arxiv.org/html/2605.06006#S4.SS1.p1.3)\.
- L\. Graves \(2016\)Deciding what’s true: the rise of political fact\-checking in American journalism\.Columbia University Press,New York\.External Links:ISBN 978\-0\-231\-17506\-7 978\-0\-231\-54222\-7Cited by:[§1](https://arxiv.org/html/2605.06006#S1.p1.1)\.
- A\. Gunjal and G\. Durrett \(2024\)Molecular Facts: Desiderata for Decontextualization in LLM Fact Verification\.InFindings of the Association for Computational Linguistics: EMNLP 2024,Y\. Al\-Onaizan, M\. Bansal, and Y\. Chen \(Eds\.\),Miami, Florida, USA,pp\. 3751–3768\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.findings-emnlp.215)Cited by:[§4\.1\.2](https://arxiv.org/html/2605.06006#S4.SS1.SSS2.p1.2)\.
- Z\. Guo, M\. Schlichtkrull, and A\. Vlachos \(2022\)A Survey on Automated Fact\-Checking\.Transactions of the Association for Computational Linguistics10,pp\. 178–206\.External Links:ISSN 2307\-387X,[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00454)Cited by:[§2](https://arxiv.org/html/2605.06006#S2.p1.1),[§4\.2\.3](https://arxiv.org/html/2605.06006#S4.SS2.SSS3.p1.4)\.
- E\. Humprecht \(2020\)How Do They Debunk “Fake News”? A Cross\-National Comparison of Transparency in Fact Checks\.Digital Journalism8\(3\),pp\. 310–327\.External Links:ISSN 2167\-0811,[Document](https://dx.doi.org/10.1080/21670811.2019.1691031)Cited by:[§1](https://arxiv.org/html/2605.06006#S1.p2.1),[§1](https://arxiv.org/html/2605.06006#S1.p3.1),[§4\.1](https://arxiv.org/html/2605.06006#S4.SS1.p1.3)\.
- S\. Jiang, S\. Baumgartner, A\. Ittycheriah, and C\. Yu \(2020a\)Factoring fact\-checks: structured information extraction from fact\-checking articles\.InProceedings of The Web Conference 2020,pp\. 1592–1603\.Cited by:[§1](https://arxiv.org/html/2605.06006#S1.p2.1)\.
- S\. Jiang, S\. Baumgartner, A\. Ittycheriah, and C\. Yu \(2020b\)Factoring Fact\-Checks: Structured Information Extraction from Fact\-Checking Articles\.InProceedings of The Web Conference 2020,WWW ’20,New York, NY, USA,pp\. 1592–1603\.External Links:[Document](https://dx.doi.org/10.1145/3366423.3380231),ISBN 978\-1\-4503\-7023\-3Cited by:[§1](https://arxiv.org/html/2605.06006#S1.p1.1)\.
- S\. Jolly, P\. Atanasova, and I\. Augenstein \(2022\)Generating Fluent Fact Checking Explanations with Unsupervised Post\-Editing\.Information13\(10\),pp\. 500\.External Links:ISSN 2078\-2489,[Document](https://dx.doi.org/10.3390/info13100500)Cited by:[§4\.2\.3](https://arxiv.org/html/2605.06006#S4.SS2.SSS3.p1.4)\.
- L\. Kavtaradze \(2024\)Challenges of automating fact\-checking: a technographic case study\.Emerging Media2\(2\),pp\. 236–258\.Cited by:[§2](https://arxiv.org/html/2605.06006#S2.p1.1)\.
- K\. Khan, R\. Wang, and P\. Poupart \(2022\)WatClaimCheck: A new Dataset for Claim Entailment and Inference\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),S\. Muresan, P\. Nakov, and A\. Villavicencio \(Eds\.\),Dublin, Ireland,pp\. 1293–1304\.External Links:[Document](https://dx.doi.org/10.18653/v1/2022.acl-long.92)Cited by:[§2](https://arxiv.org/html/2605.06006#S2.p1.1)\.
- M\. Klein, H\. Van de Sompel, R\. Sanderson, H\. Shankar, L\. Balakireva, K\. Zhou, and R\. Tobin \(2014\)Scholarly context not found: one in five articles suffers from reference rot\.PloS one9\(12\),pp\. e115253\.Cited by:[§2](https://arxiv.org/html/2605.06006#S2.p1.1)\.
- B\. Z\. Li, E\. Liu, A\. Ross, A\. Zeitoun, G\. Neubig, and J\. Andreas \(2025\)Language modeling with editable external knowledge\.InFindings of the Association for Computational Linguistics: NAACL 2025,pp\. 3070–3090\.Cited by:[§2](https://arxiv.org/html/2605.06006#S2.p1.1)\.
- X\. H\. Lù \(2024\)BM25S: Orders of magnitude faster lexical search via eager sparse scoring\.arXiv\.External Links:2407\.03618,[Document](https://dx.doi.org/10.48550/arXiv.2407.03618)Cited by:[§5\.2\.1](https://arxiv.org/html/2605.06006#S5.SS2.SSS1.p1.2)\.
- H\. Ma, W\. Xu, Y\. Wei, L\. Chen, L\. Wang, Q\. Liu, S\. Wu, and L\. Wang \(2024\)EX\-FEVER: A Dataset for Multi\-hop Explainable Fact Verification\.arXiv\.External Links:2310\.09754,[Document](https://dx.doi.org/10.48550/arXiv.2310.09754)Cited by:[§2](https://arxiv.org/html/2605.06006#S2.p1.1)\.
- I\. Maab, E\. Marrese\-Taylor, S\. Padó, and Y\. Matsuo \(2024\)Media bias detection across families of language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),K\. Duh, H\. Gomez, and S\. Bethard \(Eds\.\),Mexico City, Mexico,pp\. 4083–4098\.External Links:[Link](https://aclanthology.org/2024.naacl-long.227/),[Document](https://dx.doi.org/10.18653/v1/2024.naacl-long.227)Cited by:[§1](https://arxiv.org/html/2605.06006#S1.p2.1)\.
- J\. Maynez, S\. Narayan, B\. Bohnet, and R\. McDonald \(2020\)On Faithfulness and Factuality in Abstractive Summarization\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,D\. Jurafsky, J\. Chai, N\. Schluter, and J\. Tetreault \(Eds\.\),Online,pp\. 1906–1919\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.173)Cited by:[§4\.2\.1](https://arxiv.org/html/2605.06006#S4.SS2.SSS1.p1.6)\.
- Meta AI \(2024\)Llama 3\.3 70B Instruct\.Note:[https://huggingface\.co/meta\-llama/Llama\-3\.3\-70B\-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct)Accessed: 2025\-10\-13Cited by:[§5\.1](https://arxiv.org/html/2605.06006#S5.SS1.p1.1)\.
- Meta \(2025\)Llama 4 scout \(17b, 16 experts\) — instruct version\.Note:[https://huggingface\.co/meta\-llama/Llama\-4\-Scout\-17B\-16E\-Instruct](https://huggingface.co/meta-llama/Llama-4-Scout-17B-16E-Instruct)Accessed: 2025\-10\-13Cited by:[§5\.1](https://arxiv.org/html/2605.06006#S5.SS1.p1.1)\.
- A\. Naik, A\. Ravichander, N\. Sadeh, C\. Rose, and G\. Neubig \(2018\)Stress Test Evaluation for Natural Language Inference\.InProceedings of the 27th International Conference on Computational Linguistics,E\. M\. Bender, L\. Derczynski, and P\. Isabelle \(Eds\.\),Santa Fe, New Mexico, USA,pp\. 2340–2353\.Cited by:[§4\.2\.1](https://arxiv.org/html/2605.06006#S4.SS2.SSS1.p1.6)\.
- P\. Nakov, G\. D\. S\. Martino, T\. Elsayed, A\. Barrón\-Cedeño, R\. Míguez, S\. Shaar, F\. Alam, F\. Haouari, M\. Hasanain, W\. Mansour, B\. Hamdan, Z\. S\. Ali, N\. Babulkov, A\. Nikolov, G\. K\. Shahi, J\. M\. Struß, T\. Mandl, M\. Kutlu, and Y\. S\. Kartal \(2021\)Overview of the CLEF–2021 CheckThat\! Lab on Detecting Check\-Worthy Claims, Previously Fact\-Checked Claims, and Fake News\.arXiv\.External Links:2109\.12987,[Document](https://dx.doi.org/10.48550/arXiv.2109.12987)Cited by:[§1](https://arxiv.org/html/2605.06006#S1.p2.1)\.
- W\. Ostrowski, A\. Arora, P\. Atanasova, and I\. Augenstein \(2021\)Multi\-Hop Fact Checking of Political Claims\.arXiv\.External Links:2009\.06401,[Document](https://dx.doi.org/10.48550/arXiv.2009.06401)Cited by:[§1](https://arxiv.org/html/2605.06006#S1.p2.1),[§2](https://arxiv.org/html/2605.06006#S2.p1.1)\.
- R\. Panchendrarajan and A\. Zubiaga \(2024\)Claim Detection for Automated Fact\-checking: A Survey on Monolingual, Multilingual and Cross\-Lingual Research\.Natural Language Processing Journal7,pp\. 100066\.External Links:2401\.11969,ISSN 29497191,[Document](https://dx.doi.org/10.1016/j.nlp.2024.100066)Cited by:[§1](https://arxiv.org/html/2605.06006#S1.p2.1),[§4\.2\.2](https://arxiv.org/html/2605.06006#S4.SS2.SSS2.p1.1)\.
- P\. Sahitaj, I\. Maab, J\. Yamagishi, J\. Kolanowski, S\. Möller, and V\. Schmitt \(2025\)Towards Automated Fact\-Checking of Real\-World Claims: Exploring Task Formulation and Assessment with LLMs\.arXiv\.External Links:2502\.08909,[Document](https://dx.doi.org/10.48550/arXiv.2502.08909)Cited by:[§1](https://arxiv.org/html/2605.06006#S1.p1.1)\.
- C\. Samarinas, W\. Hsu, and M\. Lee \(2021\)Improving evidence retrieval for automated explainable fact\-checking\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations,pp\. 84–91\.Cited by:[§2](https://arxiv.org/html/2605.06006#S2.p1.1)\.
- A\. Sauchuk, J\. Thorne, A\. Halevy, N\. Tonellotto, and F\. Silvestri \(2022\)On the role of relevance in natural language processing tasks\.InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval,pp\. 1785–1789\.Cited by:[§2](https://arxiv.org/html/2605.06006#S2.p1.1)\.
- M\. Schlichtkrull, Z\. Guo, and A\. Vlachos \(2023\)Averitec: a dataset for real\-world claim verification with evidence from the web\.Advances in Neural Information Processing Systems36,pp\. 65128–65167\.Cited by:[§1](https://arxiv.org/html/2605.06006#S1.p2.1)\.
- T\. Schuster, A\. Fisch, and R\. Barzilay \(2021\)Get Your Vitamin C\! Robust Fact Verification with Contrastive Evidence\.InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,K\. Toutanova, A\. Rumshisky, L\. Zettlemoyer, D\. Hakkani\-Tur, I\. Beltagy, S\. Bethard, R\. Cotterell, T\. Chakraborty, and Y\. Zhou \(Eds\.\),Online,pp\. 624–643\.External Links:[Document](https://dx.doi.org/10.18653/v1/2021.naacl-main.52)Cited by:[§2](https://arxiv.org/html/2605.06006#S2.p1.1)\.
- C\. Stab and I\. Gurevych \(2017\)Parsing argumentation structures in persuasive essays\.Computational Linguistics43\(3\),pp\. 619–659\.Cited by:[§1](https://arxiv.org/html/2605.06006#S1.p2.1)\.
- J\. Thorne, A\. Vlachos, C\. Christodoulopoulos, and A\. Mittal \(2018\)FEVER: a Large\-scale Dataset for Fact Extraction and VERification\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long Papers\),M\. Walker, H\. Ji, and A\. Stent \(Eds\.\),New Orleans, Louisiana,pp\. 809–819\.External Links:[Document](https://dx.doi.org/10.18653/v1/N18-1074)Cited by:[§2](https://arxiv.org/html/2605.06006#S2.p1.1)\.
- X\. Wang, E\. Cabrio, and S\. Villata \(2025a\)SAFE: structured argumentation for fact\-checking with explanations\.InProceedings of the Thirty\-Fourth International Joint Conference on Artificial Intelligence, IJCAI\-25,J\. Kwok \(Ed\.\),pp\. 11114–11118\.Note:Demo TrackExternal Links:[Document](https://dx.doi.org/10.24963/ijcai.2025/1274),[Link](https://doi.org/10.24963/ijcai.2025/1274)Cited by:[§1](https://arxiv.org/html/2605.06006#S1.p2.1)\.
- X\. Wang, E\. Cabrio, and S\. Villata \(2025b\)When automated fact\-checking meets argumentation: unveiling fake news through argumentative evidence\.Argument and Computation16\.External Links:[Document](https://dx.doi.org/10.1177/19462174251330980)Cited by:[§2](https://arxiv.org/html/2605.06006#S2.p1.1)\.
- G\. Warren, I\. Shklovski, and I\. Augenstein \(2025\)Show me the work: fact\-checkers’ requirements for explainable automated fact\-checking\.InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems,pp\. 1–21\.Cited by:[§2](https://arxiv.org/html/2605.06006#S2.p1.1)\.
- A\. Weichselbraun \(2021\)Inscriptis \- a python\-based HTML to text conversion library optimized for knowledge extraction from the web\.Journal of Open Source Software6\(66\),pp\. 3557\.External Links:ISSN 2475\-9066,[Document](https://dx.doi.org/10.21105/joss.03557)Cited by:[§3](https://arxiv.org/html/2605.06006#S3.p1.1)\.
- A\. Williams, N\. Nangia, and S\. Bowman \(2018\)A Broad\-Coverage Challenge Corpus for Sentence Understanding through Inference\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long Papers\),M\. Walker, H\. Ji, and A\. Stent \(Eds\.\),New Orleans, Louisiana,pp\. 1112–1122\.External Links:[Document](https://dx.doi.org/10.18653/v1/N18-1101)Cited by:[§5\.2\.3](https://arxiv.org/html/2605.06006#S5.SS2.SSS3.p1.12)\.
- R\. Xing, T\. Baldwin, and J\. H\. Lau \(2024\)Evaluating transparency of machine generated fact checking explanations\.arXiv e\-prints,pp\. arXiv–2406\.Cited by:[§2](https://arxiv.org/html/2605.06006#S2.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§5\.1](https://arxiv.org/html/2605.06006#S5.SS1.p1.1)\.
- F\. Zeng and W\. Gao \(2024\)JustiLM: Few\-shot Justification Generation for Explainable Fact\-Checking of Real\-world Claims\.Transactions of the Association for Computational Linguistics12,pp\. 334–354\.External Links:[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00649)Cited by:[§2](https://arxiv.org/html/2605.06006#S2.p1.1)\.
- K\. Zhou, C\. Grover, M\. Klein, and R\. Tobin \(2015\)No more 404s: predicting referenced link rot in scholarly articles for pro\-active archiving\.InProceedings of the 15th ACM/IEEE\-CS joint conference on digital libraries,pp\. 233–236\.Cited by:[§2](https://arxiv.org/html/2605.06006#S2.p1.1)\.
## Appendix AAppendix
### A\.1\. Mode B: Decontextualization Prompt
For each anchor sentence, the LLM receives a system prompt followed by a user message\. The system prompt instructs the model to produce a single decontextualized sentence with an evidence\-type category via structured JSON output:
> System:“You are a careful scientific editor\. Produce ONE decontextualized sentence that stands alone, explicitly preserving or adding entities, numbers, dates that make the sentence clear even when read outside of the article\. Assign a category label using exactly one of: QUOTE, STATISTIC, DOCUMENT, CONTEXT, OTHER\. \[…category guide with definitions and examples…\] Return JSON only\.” User:“Claim: \{claim\} \\n Article \(labeled\): \{segmented\_article\} \\n Target letter: \{letter\} \\n Target sentence: \{target\_sentence\} \\n Return JSON only\.”
The JSON schema constrains the output to three fields:letter\(the segment identifier\),decontextualized\(the rewritten sentence\), andcategory\(one of the five evidence types\)\.
### A\.2\. Mode C: Open Extraction Prompt
Mode C receives the full article and extracts multiple premises at once\. The system prompt specifies a bounded range of premises and uses the same category guide:
> System:“You are a careful scientific editor\. Extract \{min\}–\{max\} non\-redundant key premises from the labeled article\. For each premise, provide: \(a\) exactly ONE letter anchor from the article that supports it; \(b\) ONE decontextualized sentence that stands alone; and \(c\) a category using exactly one of: QUOTE, STATISTIC, DOCUMENT, CONTEXT, OTHER\. \[…category guide…\] Output JSON only\.” User:“Claim: \{claim\} \\n Article \(labeled\): \{segmented\_article\} \\n Return JSON only\.”
The output schema constrains the response to a list of premise objects, each withletter,decontextualized, andcategoryfields\. The maximum list length is set to the number of Mode A anchors for that article\.Similar Articles
From Snippets to Semantics: Rethinking Evidence Granularity for Multilingual Fact Verification
This paper introduces SEEK, a framework for semantic evidence extraction in multilingual fact verification, which constructs coherent evidence chunks from full articles and fine-tunes multilingual LLMs with LoRA, achieving up to 20% improvement in macro-F1 over baselines.
FACTS Benchmark Suite: Systematically evaluating the factuality of large language models
Google DeepMind and Kaggle have launched the FACTS Benchmark Suite, a comprehensive set of evaluations including parametric, search, multimodal, and grounding benchmarks to systematically measure the factuality of large language models.
FACTS Grounding: A new benchmark for evaluating the factuality of large language models
DeepMind introduces FACTS Grounding, a comprehensive benchmark with 1,719 examples for evaluating how accurately large language models ground their responses in source material and avoid hallucinations. The benchmark includes a public dataset and an online Kaggle leaderboard tracking LLM performance on factual accuracy and grounding tasks.
Can Factual Opinions Be Edited (Manipulated) in Large Language Models?
This paper introduces the FactualOpinionEditing with Evidence (FOE) benchmark to assess the ability to edit factual opinions in LLMs, and proposes a Self-Generated Evidence-Aligned method to improve opinion-evidence alignment.
Multimodal Claim Extraction for Fact-Checking
Researchers present the first benchmark for multimodal claim extraction from social media, evaluating state-of-the-art multimodal LLMs and introducing MICE, an intent-aware framework that improves handling of rhetorical intent and contextual cues in combined text-image posts.