From Snippets to Semantics: Rethinking Evidence Granularity for Multilingual Fact Verification

arXiv cs.CL 05/27/26, 04:00 AM Papers
multilingual fact-verification evidence-extraction semantic-chunking nlp llm lora
Summary
This paper introduces SEEK, a framework for semantic evidence extraction in multilingual fact verification, which constructs coherent evidence chunks from full articles and fine-tunes multilingual LLMs with LoRA, achieving up to 20% improvement in macro-F1 over baselines.
arXiv:2605.26755v1 Announce Type: new Abstract: Multilingual fact verification requires evidence that is both relevant and sufficiently complete for reliable factuality prediction. However, existing systems often rely on search snippets, sentence-level evidence, or locally segmented passages, which can miss decisive context and produce fragmented evidence. To overcome these limitations, we propose SEEK, a Semantic Evidence Extraction with an adaptive chunKing framework that constructs coherent evidence chunks from full fact-checking articles by identifying semantic topic transitions and preserving local verification context. The constructed chunks are encoded using a multilingual encoder and then multilingual LLMs are finetuned using LoRA adapter for veracity prediction. Experiments on X-FACT and RU22Fact show that SEEK improves macro-f1 by up to 10% over semantic chunking, 19% over sentence chunking, and 20% over search-snippet baselines. Evidence completeness and significance analyses further show that SEEK preserves richer verification context and enables more reliable multilingual fact-checking.
Original Article
View Cached Full Text
Cached at: 05/27/26, 09:10 AM
# From Snippets to Semantics: Rethinking Evidence Granularity for Multilingual Fact Verification
Source: [https://arxiv.org/html/2605.26755](https://arxiv.org/html/2605.26755)
Babu Kumar \*, Gaurav Kumar \*, Ayush Garg, Aditya Kishore, Jasabanta Patro Department of Data Science and Engineering Indian Institute of Science Education and Research, Bhopal, India \{babu21, gaurav22, ayushg24, adityak21, jpatro\}@iiserb\.ac\.in

###### Abstract

Multilingual fact verification requires evidence that is both relevant and sufficiently complete for reliable factuality prediction\. However, existing systems often rely on search snippets, sentence\-level evidence, or locally segmented passages, which can miss decisive context and produce fragmented evidence\. To overcome these limitations, we proposeSEEK, aSemanticEvidenceExtraction with adaptive chunKing framework that constructs coherent evidence chunks from full fact\-checking articles by identifying semantic topic transitions and preserving local verification context\. The constructed chunks are encoded using a multilingual encoder and then multilingual LLMs are finetuned using LoRA adapter for veracity prediction\. Experiments on X\-FACT and RU22Fact show thatSEEKimproves macro\-f1 by up to10% over semantic chunking,19% over sentence chunking, and20% over search\-snippet baselines\. Evidence completeness and significance analyses further show thatSEEKpreserves richer verification context and enables more reliable multilingual fact\-checking\.

From Snippets to Semantics: Rethinking Evidence Granularity for Multilingual Fact Verification

Babu Kumar \*, Gaurav Kumar \*, Ayush Garg, Aditya Kishore, Jasabanta PatroDepartment of Data Science and EngineeringIndian Institute of Science Education and Research, Bhopal, India\{babu21, gaurav22, ayushg24, adityak21, jpatro\}@iiserb\.ac\.in

## 1Introduction

The multilingual nature of today’s online information ecosystem has made misinformation increasingly difficult to detect, contextualize, and verifyPanchendrarajan and Zubiaga \([2024](https://arxiv.org/html/2605.26755#bib.bib15)\)\. A single misleading claim may circulate across languages, platforms, and regional communities, often appearing in modified forms through translation, paraphrasing, or culturally specific framingQuelle et al\. \([2025](https://arxiv.org/html/2605.26755#bib.bib19)\); Peng et al\. \([2025b](https://arxiv.org/html/2605.26755#bib.bib17)\)\. In such settings, fact verification is not limited to deciding whether a claim is true or false; it also requires identifying reliable evidence, preserving the meaning of the claim across languages, and reasoning over heterogeneous sources that may differ in structure, language, and level of detailGuo et al\. \([2022](https://arxiv.org/html/2605.26755#bib.bib5)\); Gupta and Srikumar \([2021a](https://arxiv.org/html/2605.26755#bib.bib6)\)\.

![Refer to caption](https://arxiv.org/html/2605.26755v1/x1.png)Figure 1:High\-level illustration of the proposed multilingual fact verification framework\. Noisy multilingual snippets and full web pages are transformed through an SEEK lens into coherent evidence, which is then used by a multilingual LLM for veracity prediction\.Human fact\-checkers remain central to this process because they can interpret context, compare sources, and evaluate evidence with linguistic and domain\-specific judgmentNakov et al\. \([2021](https://arxiv.org/html/2605.26755#bib.bib13)\)\. However, manual fact\-checking is difficult to scale against the volume and speed of online misinformationNakov et al\. \([2021](https://arxiv.org/html/2605.26755#bib.bib13)\); Nanhekhan et al\. \([2025](https://arxiv.org/html/2605.26755#bib.bib14)\)\. This has motivated automated fact verification, where NLP systems retrieve relevant evidence and predict claim veracityGuo et al\. \([2022](https://arxiv.org/html/2605.26755#bib.bib5)\); Zheng et al\. \([2024a](https://arxiv.org/html/2605.26755#bib.bib30)\)\. Despite recent progress, system reliability depends strongly on evidence qualityZheng et al\. \([2024a](https://arxiv.org/html/2605.26755#bib.bib30)\); Nanhekhan et al\. \([2025](https://arxiv.org/html/2605.26755#bib.bib14)\), especially in multilingual settings where relevant information may be scattered across long fact\-checking articles, news reports, official documents, or web pagesZheng et al\. \([2024a](https://arxiv.org/html/2605.26755#bib.bib30)\); Gupta and Srikumar \([2021a](https://arxiv.org/html/2605.26755#bib.bib6)\); Cekinel et al\. \([2024](https://arxiv.org/html/2605.26755#bib.bib2)\); Peng et al\. \([2025b](https://arxiv.org/html/2605.26755#bib.bib17)\); Schlichtkrull et al\. \([2023](https://arxiv.org/html/2605.26755#bib.bib20)\)\. Such sources often mix background discussion, quoted claims, contextual explanations, and final verdicts within the same documentAugenstein et al\. \([2019](https://arxiv.org/html/2605.26755#bib.bib1)\); Schlichtkrull et al\. \([2023](https://arxiv.org/html/2605.26755#bib.bib20)\)\. As a result, short snippets or isolated sentences may miss the complete verification context, whereas full\-document processing can introduce substantial irrelevant informationZheng et al\. \([2024a](https://arxiv.org/html/2605.26755#bib.bib30)\); Schlichtkrull et al\. \([2023](https://arxiv.org/html/2605.26755#bib.bib20)\)\. This creates a mismatch between the retrieved evidence unit and the evidence span actually required for reliable claim verificationZheng et al\. \([2024a](https://arxiv.org/html/2605.26755#bib.bib30)\); Schlichtkrull et al\. \([2023](https://arxiv.org/html/2605.26755#bib.bib20)\)\.

Existing evidence retrieval strategies often rely on fixed\-size passages, sentence\-level segmentation, or externally retrieved snippetsChen et al\. \([2022](https://arxiv.org/html/2605.26755#bib.bib3)\); Schlichtkrull et al\. \([2023](https://arxiv.org/html/2605.26755#bib.bib20)\)\. While simple and efficient, these approaches may not align with the semantic structure of fact\-checking documentsZhang et al\. \([2023](https://arxiv.org/html/2605.26755#bib.bib28)\)\. Fixed windows can split reasoning across boundaries, whereas sentence\-level retrieval may discard the surrounding context needed for correct interpretationZhang et al\. \([2023](https://arxiv.org/html/2605.26755#bib.bib28)\); Zheng et al\. \([2024a](https://arxiv.org/html/2605.26755#bib.bib30)\)\. This limitation is amplified in multilingual settings, where translation variation, code\-mixing, and language\-specific discourse patterns can weaken claim\-evidence alignmentPanchendrarajan and Zubiaga \([2024](https://arxiv.org/html/2605.26755#bib.bib15)\); Peng et al\. \([2025b](https://arxiv.org/html/2605.26755#bib.bib17)\)\.

To address these limitations, we proposeSEEK, a Semantic Evidence Extraction with adaptive chunKing framework for multilingual fact verification\. Rather than relying on short snippets or fixed\-length passages,SEEKconstructs coherent evidence chunks from full web documents by detecting semantic topic shifts in a shared multilingual embedding space\. These chunks preserve complete verification context while reducing irrelevant noise, and are then retrieved for veracity prediction using LoRA\-tunedHu et al\. \([2022](https://arxiv.org/html/2605.26755#bib.bib8)\)multilingual large language models\. The central idea is that reliable multilingual fact verification requires not only strong language models, but also better evidence granularity\. Improving how evidence is segmented and retrieved can therefore enhance the reliability of downstream veracity prediction across multilingual settings\.

Our contributions are as follows:

- •We introduceSEEK, a multilingual evidence construction framework that combines contextual topic\-shift detection, score smoothing, adaptive thresholding, and boundary overlap to generate verification\-oriented evidence chunks\.
- •We combineSEEKwith multilingual dense retrieval and LoRA\-tuned multilingual LLMs, achieving state\-of\-the\-art veracity prediction performance across multilingual fact\-checking benchmarks\.
- •We perform comprehensive evidence analysis on X\-Fact and RU22Fact, including evidence completeness, similarity, and statistical significance studies across multilingual and generalization settings\.
- •We further conduct translation\-based evaluation to analyze the effect of language normalization on evidence retrieval and veracity prediction in multilingual fact verification\.

## 2Related Work

### 2\.1General and Multilingual Fact\-Checking

Automated fact\-checking is typically framed as verifying a claim using either model\-internal knowledge or external evidence\. Benchmarks such as FEVERThorne et al\. \([2018](https://arxiv.org/html/2605.26755#bib.bib21)\)and LIARWang \([2017](https://arxiv.org/html/2605.26755#bib.bib26)\)established evidence\-based and fine\-grained fact verification in English, while multilingual datasets such as X\-FACTGupta and Srikumar \([2021b](https://arxiv.org/html/2605.26755#bib.bib7)\)and RU22FactZeng et al\. \([2024](https://arxiv.org/html/2605.26755#bib.bib27)\)extend this task across languages, domains, and resource settings\. These benchmarks show that multilingual fact verification depends not only on strong veracity models, but also on evidence that is relevant, complete, and linguistically usable for the target claim\.

### 2\.2Evidence Extraction and Retrieval

Snippet\- and sentence\-level retrieval\.Many fact\-checking systems rely on search snippets, individual sentences, or short passages as evidenceGupta and Srikumar \([2021a](https://arxiv.org/html/2605.26755#bib.bib6)\)\. These units are efficient and focused, but they often fragment the verification context\. In multilingual settings, this issue is amplified because snippets may omit key background, distort named entities, or miss the link between the claim, investigation, and verdictGupta and Srikumar \([2021a](https://arxiv.org/html/2605.26755#bib.bib6)\)\. Dense retrievers such as DPR and multilingual\-e5 improve semantic matching between claims and evidence candidatesKarpukhin et al\. \([2020](https://arxiv.org/html/2605.26755#bib.bib10)\); Wang et al\. \([2024](https://arxiv.org/html/2605.26755#bib.bib25)\)\. Similarly, CONCRETE improves cross\-lingual fact\-checking by learning multilingual retrieval representations from trusted multilingual corporaHuang et al\. \([2022](https://arxiv.org/html/2605.26755#bib.bib9)\)\. However, these retrieval methods do not directly resolve the granularity problem: the retrieved unit may still be too short to contain complete verification context or too broad to avoid irrelevant information\.

Document chunking methodologies\.Other approaches retrieve evidence from longer documents by splitting them into fixed\-size, sentence\-aware, or semantic chunksQu et al\. \([2025](https://arxiv.org/html/2605.26755#bib.bib18)\); Kiss \([2025](https://arxiv.org/html/2605.26755#bib.bib11)\)\. Fixed and sentence\-based chunking are simple, but they can cut off decisive information when verification cues appear beyond the boundaryQu et al\. \([2025](https://arxiv.org/html/2605.26755#bib.bib18)\)\. Semantic chunking improves coherence, yet local similarity\-based boundaries can still be unstable in noisy multilingual fact\-checking articles that mix claims, explanations, quotes, and verdictsQu et al\. \([2025](https://arxiv.org/html/2605.26755#bib.bib18)\); Kiss \([2025](https://arxiv.org/html/2605.26755#bib.bib11)\)\. This creates a key trade\-off: shorter units reduce noise but risk context fragmentation, while longer units preserve context but may introduce irrelevant informationQu et al\. \([2025](https://arxiv.org/html/2605.26755#bib.bib18)\)\.

### 2\.3Evidence Granularity for Multilingual Verification

Prior work has advanced multilingual fact\-checking through better datasets, cross\-lingual retrieval methods such as CONCRETE, translation\-based normalization, and LLM\-based verificationHuang et al\. \([2022](https://arxiv.org/html/2605.26755#bib.bib9)\); Peng et al\. \([2025a](https://arxiv.org/html/2605.26755#bib.bib16)\)\. However, the role of evidence granularity remains underexplored: retrieved text must be not only semantically related to the claim, but also complete enough to support factuality predictionViswanathan et al\. \([2025](https://arxiv.org/html/2605.26755#bib.bib24)\)\. This is especially important for multilingual verification, where systems must handle language variation, low\-resource settings, noisy web pages, and dispersed evidence cuesPeng et al\. \([2025a](https://arxiv.org/html/2605.26755#bib.bib16)\); Viswanathan et al\. \([2025](https://arxiv.org/html/2605.26755#bib.bib24)\)\. This gap motivates an evidence construction strategy that operates beyond isolated snippets or fixed windows while still avoiding overly broad document segments\. By aligning chunk boundaries with contextual topic shifts and retaining boundary overlap, the proposed approach preserves the continuity between the claim, supporting details, and verdict\. In this way, it reduces context fragmentation in evidence retrieval and provides more complete verification evidence for multilingual fact\-checking\.

## 3Datasets Details

In this section, we report the details of datasets used to evaluate our study\. Since our work focuses on multilingual fact\-checking, we consider two multilingual benchmarks:X\-FACTGupta and Srikumar \([2021b](https://arxiv.org/html/2605.26755#bib.bib7)\)andRU22FactZeng et al\. \([2024](https://arxiv.org/html/2605.26755#bib.bib27)\)\. These datasets allow us to evaluate the proposed framework across diverse languages, claims, and evidence sources\. The statistical details of both datasets are summarized in Table[1](https://arxiv.org/html/2605.26755#S3.T1)\. Additional dataset descriptions are provided in the Appendix[A](https://arxiv.org/html/2605.26755#A1)\.

Table 1:Details of multilingual datasets considered in our work\.
## 4Methodology

The overall fact\-checking pipeline used in this work consists of four key components: \(i\) a web crawler, \(ii\) theSEEKchunking module, \(iii\) a multilingual dense retriever, and \(iv\) instruction\-tuned large language models, namely LLaMA, Gemma, and Mistral\. The individual components are described below:

Web crawler:In the X\-FACT datasetGupta and Srikumar \([2021b](https://arxiv.org/html/2605.26755#bib.bib7)\), each claim is accompanied by five Google Search snippets and their corresponding source URLs\. Since these snippets are often insufficient for reliable verification, we crawl the full content of each URL usingCrawl4AIUncleCode \([2024](https://arxiv.org/html/2605.26755#bib.bib23)\)\. The crawler removes boilerplate elements such as navigation bars and scripts and extracts the main textual content, yielding up to five documents per claim, with lengths ranging from hundreds to several thousand tokens\. In contrast, each claim in RU22Fact is linked to a single source URL, which we retrieve and clean using the same pipeline\.

Chunking Module:Due to the long and unstructured nature of crawled web pages, the extracted documents are divided into smaller textual units before evidence retrieval\. Effective chunking should balance efficiency and completeness: very short chunks may miss the context needed for verification, while very long chunks may introduce irrelevant content\. Existing sentence\-based or semantic chunking methods can still produce incomplete evidence when verification cues are spread across nearby but separated parts of a document\. To address this,SEEKuses context\-window topic\-shift detection, smoothing, adaptive thresholding, and overlap\-aware chunk construction to form coherent evidence passages\. As shown in Figure[11](https://arxiv.org/html/2605.26755#A5.F11), this helps preserve both the viral claim context and the later refuting evidence within the same retrieval unit\.

#### Baseline chunking methods\.

We compareSEEKwith two common document chunking strategies\. Sentence\-aware fixed\-size chunking divides a document into consecutive groups of complete sentences under a fixed token budget of 512 tokens\. This method preserves sentence boundaries and avoids cutting sentences in the middle, but it does not consider topic changes within the document\. Therefore, it may either stop before the full verification context is captured or combine unrelated sentences in the same chunk\. Semantic chunkingLangChain \([2024](https://arxiv.org/html/2605.26755#bib.bib12)\), in contrast, uses a multilingual sentence encoder to map each sentencesis\_\{i\}into an embedding𝐞i=f\(si\)\\mathbf\{e\}\_\{i\}=f\(s\_\{i\}\)and places a boundary when the cosine similarity between neighboring sentence embeddings falls below a thresholdτ\\tau\. Although this captures local semantic changes, it mainly relies on adjacent sentence similarity and can still produce incomplete or noisy evidence when the verification context spans multiple sentences\.

#### SEEK\.

Although standard semantic chunking is more meaningful than fixed\-size splitting, it can still be unstable for noisy multilingual web documents\. A boundary decision based only on adjacent sentences is sensitive to local variations, translation style, and short sentence noise\. To address this limitation, we propose a strategy, SEEK, that detects topic shifts using context\-window comparison rather than only sentence\-to\-sentence similarity\.

For each possible boundary positionii, we define a left context window and a right context window:

Li=\{si−w\+1,…,si\},Ri=\{si\+1,…,si\+w\},L\_\{i\}=\\\{s\_\{i\-w\+1\},\\ldots,s\_\{i\}\\\},\\quad R\_\{i\}=\\\{s\_\{i\+1\},\\ldots,s\_\{i\+w\}\\\},wherewwis the window size \(set tow=3w=3in our experiments\)\. The embeddings of these two windows are computed by averaging sentence embeddings:

𝐥i=1\|Li\|∑sj∈Li𝐞j,𝐫i=1\|Ri\|∑sj∈Ri𝐞j\.\\mathbf\{l\}\_\{i\}=\\frac\{1\}\{\|L\_\{i\}\|\}\\sum\_\{s\_\{j\}\\in L\_\{i\}\}\\mathbf\{e\}\_\{j\},\\quad\\mathbf\{r\}\_\{i\}=\\frac\{1\}\{\|R\_\{i\}\|\}\\sum\_\{s\_\{j\}\\in R\_\{i\}\}\\mathbf\{e\}\_\{j\}\.
We then compute a semantic shift score:

Δi=1−cos⁡\(𝐥i,𝐫i\)\.\\Delta\_\{i\}=1\-\\cos\(\\mathbf\{l\}\_\{i\},\\mathbf\{r\}\_\{i\}\)\.
A higher value ofΔi\\Delta\_\{i\}indicates a stronger semantic change between the left and right contexts, suggesting a possible topic boundary\. To reduce noisy fluctuations, the shift scores are smoothed:

Δ~i=1k∑j=i−⌊k/2⌋i\+⌊k/2⌋Δj,\\tilde\{\\Delta\}\_\{i\}=\\frac\{1\}\{k\}\\sum\_\{j=i\-\\lfloor k/2\\rfloor\}^\{i\+\\lfloor k/2\\rfloor\}\\Delta\_\{j\},wherekk\(set tok=3k=3in our experiments\) is the smoothing window size\.

Instead of using a fixed threshold, we apply adaptive thresholding based on the distribution of shift scores within the document:

τ=Percentile\(Δ~,p\),\\tau=\\text\{Percentile\}\(\\tilde\{\\Delta\},p\),whereppcontrols the selectivity of boundary detection and is set top=95%p=95\\%in our experiments\. A boundary is selected when:

bi=\{1,ifΔ~i≥τ,0,otherwise\.b\_\{i\}=\\begin\{cases\}1,&\\text\{if \}\\tilde\{\\Delta\}\_\{i\}\\geq\\tau,\\\\ 0,&\\text\{otherwise\}\.\\end\{cases\}
![Refer to caption](https://arxiv.org/html/2605.26755v1/x2.png)

Figure 2:Overview of the SEEK and retrieval\-based veracity prediction pipeline\.Finally, sentences between selected boundaries are grouped into chunks under a maximum token budget\. We also include a small overlap between consecutive chunks to preserve boundary\-level context\. This produces chunks that are semantically coherent, token\-efficient, and more suitable for dense evidence retrieval in multilingual fact\-checking\. An example visualization of the raw semantic shift scores, smoothing process, and adaptive boundary selection is provided inAppendix[B](https://arxiv.org/html/2605.26755#A2)\.

Multilingual dense retriever:After generating semantically coherent chunks withSEEK, we retrieve claim\-relevant evidence using a multilingual dense retrieval pipeline\. Each claim and candidate chunk are encoded withmultilingual\-e5\-large\-instructWang et al\. \([2024](https://arxiv.org/html/2605.26755#bib.bib25)\), which maps multilingual claims and evidence passages into a shared semantic embedding space\.

Given a claimccand crawled source documents𝒟=\{d1,d2,…,dn\}\\mathcal\{D\}=\\\{d\_\{1\},d\_\{2\},\\ldots,d\_\{n\}\\\}, each documentdjd\_\{j\}is segmented into a set of chunks:

𝒫j=\{pj1,pj2,…,pjm\}\.\\mathcal\{P\}\_\{j\}=\\\{p\_\{j1\},p\_\{j2\},\\ldots,p\_\{jm\}\\\}\.The claim and chunk embeddings are computed as:

𝐡c=Encoder\(c\),𝐡pji=Encoder\(pji\),\\mathbf\{h\}\_\{c\}=\\text\{Encoder\}\(c\),\\qquad\\mathbf\{h\}\_\{p\_\{ji\}\}=\\text\{Encoder\}\(p\_\{ji\}\),where both are produced using the same multilingual encoder\. Following the instruction\-tuned formulation of multilingual\-e5, the claim is encoded with the retrieval instruction:

> Instruct: Given a claim, retrieve relevant evidence from web documents that support or refute the claim\.

All chunks from the crawled documents are pooled into a global multilingual evidence space and indexed with FAISSDouze et al\. \([2024](https://arxiv.org/html/2605.26755#bib.bib4)\)\. For each claim, cosine similarity is computed between the claim embedding and every candidate chunk:

sji=cos⁡\(𝐡c,𝐡pji\)\.s\_\{ji\}=\\cos\(\\mathbf\{h\}\_\{c\},\\mathbf\{h\}\_\{p\_\{ji\}\}\)\.Instead of retrieving evidence separately from each document, we perform global retrieval across the complete chunk pool\. We first retrieve the top\-NNcandidates using FAISS, withN=20N=20:

ℛ=TopN⁡\(cos⁡\(𝐡c,𝐡pji\)\)\.\\mathcal\{R\}=\\operatorname\{TopN\}\\left\(\\cos\(\\mathbf\{h\}\_\{c\},\\mathbf\{h\}\_\{p\_\{ji\}\}\)\\right\)\.These candidates are then re\-ranked using the same multilingual bi\-encoder, and the final top\-KKchunks are selected, withK=5K=5:

ℰ∗=TopK⁡\(ℛ\)\.\\mathcal\{E\}^\{\*\}=\\operatorname\{TopK\}\(\\mathcal\{R\}\)\.Here,ℰ∗\\mathcal\{E\}^\{\*\}denotes the final evidence set passed to the downstream veracity prediction model\. The selected evidence chunks are concatenated with the claim for veracity prediction\. Unlike sentence\-level retrieval, which may return isolated or incomplete evidence,SEEKretrieves semantically coherent chunks that preserve topic continuity and verification context while reducing irrelevant noise\.

Instruction\-tuning of large language modelsZhang et al\. \([2025](https://arxiv.org/html/2605.26755#bib.bib29)\):After retrieval, the top\-KKranked evidence chunksℰ∗=\{p1∗,p2∗,…,pK∗\}\\mathcal\{E\}^\{\*\}=\\\{p\_\{1\}^\{\*\},p\_\{2\}^\{\*\},\\ldots,p\_\{K\}^\{\*\}\\\}are used for downstream veracity prediction\. For each claim, the input is formed by concatenating the task instruction, claim, and retrieved evidence:

Input=ℐ⊕𝒞⊕p1∗⊕p2∗⊕⋯⊕pK∗,\\text\{Input\}=\\mathcal\{I\}\\oplus\\mathcal\{C\}\\oplus p\_\{1\}^\{\*\}\\oplus p\_\{2\}^\{\*\}\\oplus\\cdots\\oplus p\_\{K\}^\{\*\},whereℐ\\mathcal\{I\}denotes the dataset\-specific instruction,𝒞\\mathcal\{C\}is the claim, and⊕\\oplusdenotes textual concatenation\. The instructions forX\-FACTandRU22Factare provided in Appendix[C](https://arxiv.org/html/2605.26755#A3)\. ForX\-FACT, evidence is retrieved from up to five associated web sources, while forRU22Fact, it is retrieved from the corresponding source article\.

We formulate veracity prediction as an instruction\-following causal language modeling task, where the model generates the factuality label conditioned on the claim and evidence\. We fine\-tune multilingual LLMs, includingLLaMA,Gemma, andMistral, usingLoRAHu et al\. \([2022](https://arxiv.org/html/2605.26755#bib.bib8)\)\. The base model is frozen, and only the low\-rank adaptation parameters and output language modeling head are updated\. Given the target label sequencey1:Ty\_\{1:T\}, the training loss is:

ℒ=−∑t=1Tlog⁡p\(yt∣y<t,Input\)\.\\mathcal\{L\}=\-\\sum\_\{t=1\}^\{T\}\\log p\(y\_\{t\}\\mid y\_\{<t\},\\text\{Input\}\)\.This trains the model to generate the correct veracity label using the claim and the retrieved evidence, whereSEEKprovides complete verification context for prediction\.

## 5Experimental Setup

This section describes the baselines, implementation settings, and evaluation protocols used to assess both veracity prediction performance and retrieval quality in multilingual fact\-checking\.

### 5\.1Baselines

We compareSEEKwith several retrieval and evidence construction baselines\.

Google Search Snippets:Following the originalX\-FACTsetup, this baseline uses the provided Google Search snippets as evidence\.

CONCRETE Retriever:We compare against CONCRETEHuang et al\. \([2022](https://arxiv.org/html/2605.26755#bib.bib9)\), a multilingual dense retriever trained with the Cross\-lingual Inverse Cloze Task \(X\-ICT\)\. It retrieves evidence passages using multilingual dense similarity matching\.

Sentence Chunking:This baseline splits each crawled webpage into consecutive sentence groups under a fixed token budget\. This strategy is described in detail in Section[4](https://arxiv.org/html/2605.26755#S4)\.

Semantic Chunking:This baseline groups neighboring sentences into chunks based on embedding similarity\. This strategy is also described in detail in Section[4](https://arxiv.org/html/2605.26755#S4)\.

LLM\-Generated Evidence:ForRU22Fact, we also compare with the LLM\-generated evidence setting from the original RU22Fact workZeng et al\. \([2024](https://arxiv.org/html/2605.26755#bib.bib27)\), where evidence is generated by an LLM instead of retrieved from crawled webpages\.

### 5\.2Implementation Details

We perform multilingual dense retrieval usingintfloat/multilingual\-e5\-large\-instruct, which encodes claims and evidence chunks into a shared multilingual semantic space\. The resulting chunk embeddings are indexed with FAISS to enable efficient dense similarity search\. After retrieval, the top\-ranked evidence chunks are concatenated with the input claim and provided to instruction\-tuned large language models for veracity prediction\. We fine\-tuneLLaMA,Gemma, andMistralusing Low\-Rank Adaptation \(LoRA\) under a causal language modeling objective\. Further details on these models are provided in Appendix[D](https://arxiv.org/html/2605.26755#A4)\. All experiments are implemented usingLLaMAFactoryZheng et al\. \([2024b](https://arxiv.org/html/2605.26755#bib.bib31)\)and conducted on a single NVIDIA A100 80GB PCIe GPU\.

### 5\.3Translation\-based Retrieval Analysis

To study the effect of multilingual variation on retrieval, we translate allX\-FACTandRU22Factclaims and crawled webpages into English before applying the sameSEEKpipeline\. The translated documents are chunked into semantically coherent passages, and retrieval is performed usingsentence\-transformers/all\-MiniLM\-L6\-v2Transformers \([2021](https://arxiv.org/html/2605.26755#bib.bib22)\)\. The retrieved chunks are then combined with the translated claim for veracity prediction\. This setup tests whether retrieval improves when claim\-evidence matching is performed in a unified English semantic space rather than a multilingual one\.

### 5\.4Evaluation Protocol

ForX\-FACT, we report Macro\-F1 under the original benchmark settings: in\-domain \(ID\), out\-of\-domain \(OOD\), and zero\-shot \(ZS\), with split details provided in Appendix[A](https://arxiv.org/html/2605.26755#A1)\. ForRU22Fact, we report Macro\-F1 on the official test split\. In addition to veracity prediction, we evaluate retrieval quality using claim\-evidence semantic similarity, evidence completeness analysis, and qualitative retrieval analysis\. To ensure fair comparison, we use the same fine\-tuning, retrieval, decoding, and evaluation settings across all chunking strategies; detailed hyperparameters are provided in Appendix[D\.1](https://arxiv.org/html/2605.26755#A4.SS1)\.

## 6Results and Discussion

Table 2:Macro\-F1 \(mean±\\pmstd\) on X\-Fact across In\-Domain \(ID\), Out\-of\-Domain \(OOD\), and Zero\-Shot \(ZS\) settings\.Table 3:Macro\-F1 on RU22Fact across four settings: LLM, sentence chunking, semantic chunking, and SEEK\.We report veracity prediction results using Macro\-F1 as the main evaluation metric\. The results are summarized in Tables[2](https://arxiv.org/html/2605.26755#S6.T2)and[3](https://arxiv.org/html/2605.26755#S6.T3)\.

### 6\.1Findings on X\-FACT

Table[2](https://arxiv.org/html/2605.26755#S6.T2)reports Macro\-F1 results onX\-FACTacross in\-domain \(ID\), out\-of\-domain \(OOD\), and zero\-shot \(ZS\) settings\.SEEKconsistently outperforms search\-snippet evidence and theCONCRETEretriever, showing that veracity prediction benefits from more complete and semantically coherent evidence\. The strongest results are obtained byLLaMA withSEEK, achieving0\.67Macro\-F1 in ID,0\.41in OOD, and0\.30in ZS\. This corresponds to gains of up to\+0\.21over the best search\-snippet baselines and up to\+0\.26overCONCRETE\. These gains are especially important in OOD and ZS settings, where domain and language shifts make evidence retrieval more challenging\. Across LLaMA, Gemma, and Mistral,SEEKyields stable improvements over sentence\-level and standard semantic chunking baselines\. This indicates that the gains arise mainly from improved evidence construction rather than a particular LLM backbone\. Overall, coherent document\-level chunks provide more reliable multilingual verification signals than snippets or isolated retrieval units\.

Table 4:Translation\-based Macro\-F1 performance ofSEEKonX\-Factusing different multilingual LLMs\. Results are reported as Macro\-F1 \(mean±\\pmstd\) across random seeds\.Table 5:Translation\-based Macro\-F1 performance ofSEEKonRU22Fact\.
### 6\.2Findings on RU22Fact

Table[3](https://arxiv.org/html/2605.26755#S6.T3)reports Macro\-F1 results on the officialRU22Facttest split\. AlthoughRU22Factis monolingual and has a lower cross\-lingual retrieval burden thanX\-FACT, the results still show that evidence quality strongly affects factual verification\. The LLM\-only setting is competitive, with LLaMA reaching0\.73and Mistral reaching0\.72, while sentence chunking provides only marginal gains\. Semantic chunking improves performance, with LLaMA and Mistral both reaching0\.79\. However,SEEKgives the strongest results, with Gemma and Mistral achieving0\.90and LLaMA reaching0\.89\. This gives a gain of\+0\.17over the best LLM\-only result and\+0\.11over the best semantic chunking result\. Overall, these results show thatSEEKis useful not only in multilingual or cross\-domain settings, but also in monolingual fact\-checking, where complete and coherent evidence helps different LLMs make more reliable predictions\.

### 6\.3Findings in Translation\-based Retrieval

We further evaluate translation\-based retrieval withSEEK, where claims and crawled documents are translated into English before retrieval and prediction\. As shown in Table[4](https://arxiv.org/html/2605.26755#S6.T4), this setting performs strongly onX\-FACT, with the best Macro\-F1 scores of0\.66in ID,0\.41in OOD, and0\.37in ZS\. OnRU22Fact, Table[5](https://arxiv.org/html/2605.26755#S6.T5)shows that all models achieve similarly high performance, with LLaMA and Mistral reaching0\.90and Gemma reaching0\.89\. These results suggest that translation\-based retrieval is complementary toSEEK: translation reduces linguistic variation, whileSEEKpreserves complete verification context\.

### 6\.4Evaluation of Retrieval Quality

Beyond veracity prediction, we examine whether the retrieved evidence is useful for fact verification\. Since gold evidence annotations are unavailable, we compare sentence chunking, semantic chunking, andSEEKusing proxy analyses\.

Claim–Evidence Semantic Similarity:We compute cosine similarity between each claim and its retrieved evidence in the multilingual embedding space\. Higher scores indicate stronger claim–evidence alignment, and Figure[3](https://arxiv.org/html/2605.26755#S6.F3)shows the resulting distributions across chunking strategies\.

![Refer to caption](https://arxiv.org/html/2605.26755v1/x3.png)Figure 3:Claim\-evidence semantic similarity across different chunking strategies\.Evidence Completeness Analysis:Because similarity alone does not ensure verification sufficiency, we use a locally deployed LLaMA\-based evaluator to label each retrieved evidence instance asComplete,Partial, orIrrelevant\. The prompt is given in Appendix[E](https://arxiv.org/html/2605.26755#A5)\. As shown in Figures[4](https://arxiv.org/html/2605.26755#S6.F4)and[5](https://arxiv.org/html/2605.26755#S6.F5),SEEKretrieves more complete evidence than both baselines\. Complete evidence reaches 54\.3% onX\-FACT, compared with 41\.9% for sentence chunking and 32\.8% for semantic chunking\. OnRU22Fact,SEEKachieves 76\.0%, improving over sentence and semantic chunking by 29\.0 and 15\.3 points, respectively\. These results indicate that the proposed chunking strategy better preserves verification\-ready context while reducing incomplete evidence retrieval\.

![Refer to caption](https://arxiv.org/html/2605.26755v1/x4.png)Figure 4:Evidence completeness analysis on X\-FACT using an LLM\-based evaluator\.![Refer to caption](https://arxiv.org/html/2605.26755v1/x5.png)Figure 5:Evidence completeness analysis on RU22Fact using an LLM\-based evaluator\.Qualitative and Significance Analysis:We further provide a qualitative comparison in Figure[11](https://arxiv.org/html/2605.26755#A5.F11), focusing on semantic continuity, investigation flow, and whether the retrieved evidence contains final supporting or refuting statements\. Additional examples are included in Appendix[B](https://arxiv.org/html/2605.26755#A2)\. McNemar’s test further confirms thatSEEKyields statistically significant gains over other methods, particularly onRU22FactandX\-FACTID\. Significance tests are reported in Appendix[F](https://arxiv.org/html/2605.26755#A6)\.

## 7Conclusion

In this work, we proposedSEEK, a semantic evidence extraction framework for multilingual fact\-checking\. SEEK moves beyond short snippets, isolated sentences, and fixed\-size chunks by building coherent evidence passages from full web documents\. It detects topic shifts and keeps boundary context, allowing the retrieved evidence to better preserve the claim, supporting details, and final verification signal\. Experiments on X\-FACT and RU22Fact show that SEEK improves veracity prediction across in\-domain, out\-of\-domain, and zero\-shot settings\. The retrieval analysis further shows that SEEK retrieves evidence that is not only semantically relevant, but also more complete for verification\. Overall, our findings highlight that reliable multilingual fact\-checking depends strongly on evidence construction\. By providing focused and context\-rich evidence chunks, SEEK helps downstream LLMs make more reliable factuality predictions and reduces the limitations of snippet\-, sentence\-, and standard semantic\-chunking based retrieval\.

## 8Limitations

SEEKdepends on the quality of crawled web documents; incomplete or noisy pages can still lead to missing evidence\. Since the method detects semantic topic shifts, it may also miss cases where verification cues are spread across distant document sections\. Our experiments are limited toX\-FACTandRU22Fact, so evaluation on more languages, domains, and real\-time fact\-checking settings is needed\. Finally, human\-annotated evidence completeness labels would provide a stronger evaluation signal, which we leave for future work\.

## References

- Augenstein et al\. \(2019\)Isabelle Augenstein, Christina Lioma, Dongsheng Wang, Lucas Chaves Lima, Casper Hansen, Christian Hansen, and Jakob Grue Simonsen\. 2019\.[MultiFC: A real\-world multi\-domain dataset for evidence\-based fact checking of claims](https://doi.org/10.18653/v1/D19-1475)\.In*Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\)*, pages 4685–4697, Hong Kong, China\. Association for Computational Linguistics\.
- Cekinel et al\. \(2024\)Recep Firat Cekinel, Pinar Karagoz, and Çağrı Çöltekin\. 2024\.[Cross\-lingual learning vs\. low\-resource fine\-tuning: A case study with fact\-checking in Turkish](https://aclanthology.org/2024.lrec-main.368/)\.In*Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\)*, pages 4127–4142, Torino, Italia\. ELRA and ICCL\.
- Chen et al\. \(2022\)Jiangui Chen, Ruqing Zhang, Jiafeng Guo, Yixing Fan, and Xueqi Cheng\. 2022\.[GERE: Generative evidence retrieval for fact verification](https://doi.org/10.1145/3477495.3531827)\.In*Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval*, SIGIR ’22, pages 2184–2189, New York, NY, USA\. Association for Computing Machinery\.
- Douze et al\. \(2024\)Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre\-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou\. 2024\.[The Faiss library](http://arxiv.org/abs/2401.08281)\.arXiv preprint\.
- Guo et al\. \(2022\)Zhijiang Guo, Michael Schlichtkrull, and Andreas Vlachos\. 2022\.[A survey on automated fact\-checking](https://doi.org/10.1162/tacl_a_00454)\.*Transactions of the Association for Computational Linguistics*, 10:178–206\.
- Gupta and Srikumar \(2021a\)Ashim Gupta and Vivek Srikumar\. 2021a\.[X\-fact: A new benchmark dataset for multilingual fact checking](https://doi.org/10.18653/v1/2021.acl-short.86)\.In*Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 2: Short Papers\)*, pages 675–682, Online\. Association for Computational Linguistics\.
- Gupta and Srikumar \(2021b\)Ashim Gupta and Vivek Srikumar\. 2021b\.[X\-fact: A new benchmark dataset for multilingual fact checking](https://aclanthology.org/2021.acl-short.86)\.In*Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing \(Volume 2: Short Papers\)*, pages 675–682, Online\. Association for Computational Linguistics\.
- Hu et al\. \(2022\)Edward J\. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen\-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen\. 2022\.[Lora: Low\-rank adaptation of large language models](https://www.microsoft.com/en-us/research/publication/lora-low-rank-adaptation-of-large-language-models/)\.In*ICLR 2022*\.
- Huang et al\. \(2022\)Kung\-Hsiang Huang, ChengXiang Zhai, and Heng Ji\. 2022\.[CONCRETE: Improving cross\-lingual fact\-checking with cross\-lingual retrieval](https://aclanthology.org/2022.coling-1.86/)\.In*Proceedings of the 29th International Conference on Computational Linguistics*, pages 1024–1035, Gyeongju, Republic of Korea\. International Committee on Computational Linguistics\.
- Karpukhin et al\. \(2020\)Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen\-tau Yih\. 2020\.[Dense passage retrieval for open\-domain question answering](https://doi.org/10.18653/v1/2020.emnlp-main.550)\.In*Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, pages 6769–6781, Online\. Association for Computational Linguistics\.
- Kiss \(2025\)C\. Kiss\. 2025\.[Max–min semantic chunking of documents for rag application](https://doi.org/10.1007/s10791-025-09638-7)\.*Discover Computing*, 28\(117\)\.
- LangChain \(2024\)LangChain\. 2024\.Langchain semantic chunking\.[https://python\.langchain\.com/docs/how\_to/semantic\_chunker/](https://python.langchain.com/docs/how_to/semantic_chunker/)\.Accessed: 2024\-07\-04\.
- Nakov et al\. \(2021\)Preslav Nakov, David Corney, Maram Hasanain, Firoj Alam, Tamer Elsayed, Alberto Barrón\-Cedeño, Paolo Papotti, Shaden Shaar, and Giovanni Da San Martino\. 2021\.[Automated fact\-checking for assisting human fact\-checkers](https://doi.org/10.24963/ijcai.2021/619)\.In*Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI\-21*, pages 4551–4558\. International Joint Conferences on Artificial Intelligence Organization\.Survey Track\.
- Nanhekhan et al\. \(2025\)Kevin Nanhekhan, V\. Venktesh, Erik Martin, Henrik Vatndal, Vinay Setty, and Avishek Anand\. 2025\.[Flashcheck: Exploration of efficient evidence retrieval for fast fact\-checking](https://doi.org/10.1007/978-3-031-88717-8_28)\.In*Advances in Information Retrieval*, volume 15575 of*Lecture Notes in Computer Science*, pages 385–399, Cham\. Springer\.
- Panchendrarajan and Zubiaga \(2024\)Rrubaa Panchendrarajan and Arkaitz Zubiaga\. 2024\.[Claim detection for automated fact\-checking: A survey on monolingual, multilingual and cross\-lingual research](https://doi.org/10.1016/j.nlp.2024.100066)\.*Natural Language Processing Journal*, 7:100066\.
- Peng et al\. \(2025a\)Q\. Peng et al\. 2025a\.[Semeval\-2025 task 7: Multilingual and crosslingual fact\-checked claim retrieval](https://arxiv.org/abs/2505.10740)\.In*Proceedings of the 19th International Workshop on Semantic Evaluation \(SemEval\-2025\)*\.
- Peng et al\. \(2025b\)Qiwei Peng, Robert Moro, Michal Gregor, Ivan Srba, Simon Ostermann, Marian Simko, Juraj Podrouzek, Matúš Mesarčík, Jaroslav Kopčan, and Anders Søgaard\. 2025b\.[SemEval\-2025 task 7: Multilingual and crosslingual fact\-checked claim retrieval](https://aclanthology.org/2025.semeval-1.323/)\.In*Proceedings of the 19th International Workshop on Semantic Evaluation \(SemEval\-2025\)*, pages 2498–2511, Vienna, Austria\. Association for Computational Linguistics\.
- Qu et al\. \(2025\)R\. Qu et al\. 2025\.[Is semantic chunking worth the computational cost?](https://aclanthology.org/2025.findings-naacl.114/)*Findings of the Association for Computational Linguistics: NAACL 2025*\.
- Quelle et al\. \(2025\)Dorian Quelle, Calvin Yixiang Cheng, Alexandre Bovet, and Scott A\. Hale\. 2025\.[Lost in translation: using global fact\-checks to measure multilingual misinformation prevalence, spread, and evolution](https://doi.org/10.1140/epjds/s13688-025-00520-6)\.*EPJ Data Science*, 14\(1\):22\.
- Schlichtkrull et al\. \(2023\)Michael Schlichtkrull, Zhijiang Guo, and Andreas Vlachos\. 2023\.[Averitec: A dataset for real\-world claim verification with evidence from the web](https://proceedings.neurips.cc/paper_files/paper/2023/file/cd86a30526cd1aff61d6f89f107634e4-Paper-Datasets_and_Benchmarks.pdf)\.In*Advances in Neural Information Processing Systems*, volume 36, pages 65128–65167\. Curran Associates, Inc\.
- Thorne et al\. \(2018\)James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal\. 2018\.[FEVER: a large\-scale dataset for fact extraction and verification](http://arxiv.org/abs/1803.05355)\.Updated version of NAACL2018 paper\.[https://doi\.org/10\.48550/arXiv\.1803\.05355](https://doi.org/10.48550/arXiv.1803.05355)\.
- Transformers \(2021\)Sentence Transformers\. 2021\.sentence\-transformers/all\-minilm\-l6\-v2\.[https://huggingface\.co/sentence\-transformers/all\-MiniLM\-L6\-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)\.Accessed: 2026\-05\-17\.
- UncleCode \(2024\)UncleCode\. 2024\.Crawl4ai: Open\-source llm friendly web crawler & scraper\.[https://github\.com/unclecode/crawl4ai](https://github.com/unclecode/crawl4ai)\.GitHub repository\. Please use the commit hash you’re working with\.
- Viswanathan et al\. \(2025\)V\. Viswanathan et al\. 2025\.[Claimiq at checkthat\! 2025: Comparing prompted and fine\-tuned language models for verifying numerical claims](https://arxiv.org/abs/2509.11492)\.
- Wang et al\. \(2024\)Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei\. 2024\.Multilingual e5 text embeddings: A technical report\.*arXiv preprint arXiv:2402\.05672*\.
- Wang \(2017\)William Y\. Wang\. 2017\.“liar, liar pants on fire”: A new benchmark dataset for fake news detection\.In*Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics \(ACL\)*, pages 422–426\. Association for Computational Linguistics\.
- Zeng et al\. \(2024\)Yirong Zeng, Xiao Ding, Yi Zhao, Xiangyu Li, Jie Zhang, Chao Yao, Ting Liu, and Bing Qin\. 2024\.[Ru22fact: Optimizing evidence for multilingual explainable fact\-checking on russia\-ukraine conflict](https://arxiv.org/abs/2403.16662)\.*arXiv preprint arXiv:2403\.16662*\.
- Zhang et al\. \(2023\)Hengran Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, and Xueqi Cheng\. 2023\.[From relevance to utility: Evidence retrieval with feedback for fact verification](https://doi.org/10.18653/v1/2023.findings-emnlp.422)\.In*Findings of the Association for Computational Linguistics: EMNLP 2023*, pages 6373–6384, Singapore\. Association for Computational Linguistics\.
- Zhang et al\. \(2025\)Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Fei Wu, and Guoyin Wang\. 2025\.[Instruction tuning for large language models: A survey](http://arxiv.org/abs/2308.10792)\.
- Zheng et al\. \(2024a\)Liwen Zheng, Chaozhuo Li, Xi Zhang, Yu\-Ming Shang, Feiran Huang, and Haoran Jia\. 2024a\.[Evidence retrieval is almost all you need for fact verification](https://doi.org/10.18653/v1/2024.findings-acl.551)\.In*Findings of the Association for Computational Linguistics: ACL 2024*, pages 9274–9281, Bangkok, Thailand\. Association for Computational Linguistics\.
- Zheng et al\. \(2024b\)Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma\. 2024b\.[Llamafactory: Unified efficient fine\-tuning of 100\+ language models](http://arxiv.org/abs/2403.13372)\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 3: System Demonstrations\)*, Bangkok, Thailand\. Association for Computational Linguistics\.

## Appendix AAdditional Details on Datasets

This appendix provides additional details on the datasets used in our study, including their label distributions and evaluation splits\.

### A\.1X\-FACT \(Gupta and Srikumar, 2021\)

X\-FACT is a multilingual fact\-checking dataset containing 31,189 claims collected from 85 fact\-checking websites across 25 languages and 11 language familiesGupta and Srikumar \([2021b](https://arxiv.org/html/2605.26755#bib.bib7)\)\. Each claim is annotated with a factuality label and includes metadata such as language, source URL, claim date, and review date\.

The dataset contains seven labels:True, Mostly True, Partially True/Misleading, Mostly False, False, Complicated/Hard to categorize, andOther\. As shown in Table[6](https://arxiv.org/html/2605.26755#A1.T6)\.

For evaluation, X\-FACT provides three test settings:In\-domain, where claims come from sources and languages seen during training;Out\-of\-domain, where languages are seen but sources are unseen; andZero\-shot, where both languages and sources are unseen during training\. Table[7](https://arxiv.org/html/2605.26755#A1.T7)shows the language\-wise distribution, highlighting that several languages appear only in the zero\-shot split, making this setting useful for evaluating cross\-lingual generalization\.

Table 6:Label\-wise distribution of the X\-FACT dataset across training, development, in\-domain, out\-of\-domain, and zero\-shot evaluation splits\. PT/M = Partially True/Misleading, MT = Mostly True, MF = Mostly False, and C/H = Complicated/Hard to categorize\.Table 7:Language\-wise distribution of the X\-FACT dataset across evaluation splits\. Languages appearing only in the zero\-shot split are unseen during training and are used to evaluate cross\-lingual transfer\.
### A\.2RU22Fact \(Zeng et al\., 2024\)

RU22Fact is a multilingual explainable fact\-checking dataset focused on the 2022 Russia–Ukraine conflictZeng et al\. \([2024](https://arxiv.org/html/2605.26755#bib.bib27)\)\. It contains 16,033 claims in four languages: English, Chinese, Russian, and Ukrainian\. Each sample includes a claim, optimized evidence, a reference explanation, metadata, and one of three labels:Supported,Refuted, orNot Enough Information\(NEI\)\.

Table[8](https://arxiv.org/html/2605.26755#A1.T8)shows the label distribution, whereSupportedis the largest class\. Table[9](https://arxiv.org/html/2605.26755#A1.T9)reports the language\-wise distribution, with English forming the largest portion of the dataset\.

ClassTrainDevTestTotalSupported6,8361,0792,16610,081Refuted3,2994509024,651NEI1,082711481,301Total11,2171,6003,21616,033

Table 8:Label\-wise distribution of the RU22Fact dataset across training, development, and test splits\. NEI denotes Not Enough Information\.Table 9:Language\-wise distribution of the RU22Fact dataset across training, development, and test splits\.

## Appendix BQualitative Analysis of Chunk Formation

To qualitatively illustrate the proposed Semantic Evidence Extraction with adaptive chunKing \(SEEK\) method, we use a real Hindi fact\-checking example consisting of a viral claim and its corresponding source article, as shown in Figure[7](https://arxiv.org/html/2605.26755#A2.F7)\. The claim states that the woman inspector in the viral image is posted in the same area where her father works as a rickshaw driver\. The source article investigates this claim and concludes that the viral story is misleading\. Figure[6](https://arxiv.org/html/2605.26755#A2.F6)shows the raw semantic shift scores, their smoothed version, and the adaptive threshold used bySEEKto identify meaningful topic boundaries\.

![Refer to caption](https://arxiv.org/html/2605.26755v1/x6.png)Figure 6:Example of topic\-shift boundary detection using SEEK on a Hindi fact\-checking document\. The smoothed semantic shift curve reduces local sentence\-level noise, while the adaptive threshold selects meaningful chunk boundaries\.Figure[8](https://arxiv.org/html/2605.26755#A2.F8)shows the chunks produced by the proposedSEEKmethod\. The first chunk preserves the complete fact\-checking flow, including the viral claim, evidence investigation, corrective clarification, and final refutation, while the second chunk mainly contains trailing unrelated webpage content\. In contrast, Figure[9](https://arxiv.org/html/2605.26755#A2.F9)shows that semantic chunking splits the verification context across two chunks: the first chunk contains the claim and early investigation, whereas the second chunk contains the final clarification but also mixes it with unrelated navigation content\. Similarly, Figure[10](https://arxiv.org/html/2605.26755#A2.F10)shows that sentence chunking fragments the fact\-checking flow into separate sentence\-budgeted chunks, pushing the decisive refuting evidence into the second chunk and reducing evidence completeness within a single retrieved chunk\. This example demonstrates thatSEEKbetter separates the core verification evidence from less relevant trailing content\.

![Refer to caption](https://arxiv.org/html/2605.26755v1/x7.png)Figure 7:Example Hindi claim and source document\.![Refer to caption](https://arxiv.org/html/2605.26755v1/x8.png)Figure 8:Example chunks produced by the proposed Semantic Evidence Extraction with adaptive chunKing\(SEEK\) method on a Hindi fact\-checking document\. The image reports English summaries and original Hindi chunks![Refer to caption](https://arxiv.org/html/2605.26755v1/x9.png)Figure 9:Example chunks produced by the semantic chunking method on a Hindi fact\-checking document\. The image reports English summaries and original Hindi chunks![Refer to caption](https://arxiv.org/html/2605.26755v1/x10.png)Figure 10:Example chunks produced by the sentence chunking method on a Hindi fact\-checking document\. The image reports English summaries and original Hindi chunks
## Appendix CDataset\-Specific Instructions

We use dataset\-specific task instructions to guide the language model for veracity prediction\. The instruction is prepended to the claim and retrieved evidence chunks, and the exact prompts used for X\-FACT and RU22Fact are reported in Table[10](https://arxiv.org/html/2605.26755#A3.T10)\.

Table 10:Dataset\-specific task instructions used for veracity prediction\.
## Appendix DModel Details

Table[11](https://arxiv.org/html/2605.26755#A4.T11)summarizes the encoder and decoder models used in our framework\. The encoder models are used for multilingual claim–evidence representation and retrieval, while the instruction\-tuned large language models are used for downstream veracity prediction\.

Table 11:Details of encoder and decoder models used in our experiments\.We use multilingual\-e5\-large as the retrieval encoder because our task involves fact\-checking across multiple languages\. This encoder provides a shared multilingual representation space, allowing claims and evidence chunks within each language to be encoded consistently for dense retrieval\. FAISS is used to perform efficient nearest\-neighbour search over the global chunk pool\.

For veracity prediction, we evaluate three instruction\-tuned LLM families: LLaMA, Gemma, and Mistral\. These models are selected because they are strong open\-source multilingual instruction\-following models with different architectures and training strategies\. Evaluating multiple model families allows us to examine whether the proposed chunking strategy consistently improves downstream fact verification performance, rather than being effective only for a single model\.

### D\.1Hyperparameters

All veracity prediction models are fine\-tuned for3epochs using LoRA adaptation\. Unless otherwise specified, the LoRA rank is set to8\. The same retrieval configuration, decoding settings, and evaluation protocol are used across all experiments to ensure a fair comparison between different chunking strategies\.

Table 12:Hyperparameter settings used across experiments\.

## Appendix EPrompt for Evidence Completeness Evaluation

We use the prompt shown in Table[13](https://arxiv.org/html/2605.26755#A5.T13)to evaluate whether the retrieved evidence contains sufficient verification context\.

You are evaluating retrieved evidence for a fact\-checking task\.Claim:\{claim\}Retrieved Evidence:\{evidence\}Question:Does the retrieved evidence contain enough information to verify whether the claim is true or false?Choose exactly one label:Complete:The evidence contains sufficient information to verify the claim\.Partial:The evidence is related to the claim but incomplete\.Irrelevant:The evidence is unrelated, noisy, empty, blocked webpage text, or does not help verify the claim\.Return only one word:Complete,Partial, orIrrelevant\.Table 13:Prompt used for evidence completeness evaluation\.![Refer to caption](https://arxiv.org/html/2605.26755v1/x11.png)Figure 11:Illustration of the chunking strategies used for evidence retrieval\. Sentence chunking groups consecutive sentences under a token budget but may truncate the evidence before the full verification context is reached\. Existing semantic chunking uses embedding\-based similarity to form coherent chunks, but can still miss later refuting evidence in noisy web documents\. Our SEEK preserves a longer and more complete evidence unit by detecting topic shifts using contextual windows, smoothing, adaptive thresholding, and boundary overlap\.Table 14:McNemar significance test comparing our SEEK method against standard semantic chunking\. B\>\>O denotes examples correctly predicted only by the baseline, while O\>\>B denotes examples correctly predicted only by our method\. Net Gain is computed as O\>\>B minus B\>\>O\. Significance is reported atα=0\.05\\alpha=0\.05\.
## Appendix FSignificance Test Results

We report all McNemar significance test results in this appendix to assess whether the gains ofSEEKover different evidence construction baselines are statistically reliable\. Table[14](https://arxiv.org/html/2605.26755#A5.T14)comparesSEEKwith semantic chunking, Table[15](https://arxiv.org/html/2605.26755#A6.T15)compares it with sentence chunking, Table[16](https://arxiv.org/html/2605.26755#A6.T16)compares it with theCONCRETEbaseline, Table[17](https://arxiv.org/html/2605.26755#A6.T17)compares it with Google Search snippets, and Table[18](https://arxiv.org/html/2605.26755#A6.T18)compares it with the LLM\-only baseline\. Across all tables, B\>\>O denotes examples correctly predicted only by the baseline, whereas O\>\>B denotes examples correctly predicted only bySEEK\. Net Gain is computed as O\>\>B minus B\>\>O, and statistical significance is reported atα=0\.05\\alpha=0\.05\.

Table 15:McNemar significance test comparing our SEEK method against sentence chunking\.Table 16:McNemar significance test comparing our SEEK method against the concrete baseline\.Table 17:McNemar significance test comparing our SEEK method against the Google snippet baseline\.Table 18:McNemar significance test comparing our SEEK method against the LLM\-based baseline\.
From Snippets to Semantics: Rethinking Evidence Granularity for Multilingual Fact Verification

Similar Articles

From Articles to Premises: Building PrimeFacts, an Extraction Methodology and Resource for Fact-Checking Evidence

DocScope: Benchmarking Verifiable Reasoning for Trustworthy Long-Document Understanding

Multi-Perspective Evidence Synthesis and Reasoning for Unsupervised Multimodal Entity Linking

ToE: A Hierarchical and Explainable Claim Verification Framework with Dynamic Multi-source Evidence Retrieval and Aggregation

From Scenes to Elements: Multi-Granularity Evidence Retrieval for Verifiable Multimodal RAG

Submit Feedback

Similar Articles

From Articles to Premises: Building PrimeFacts, an Extraction Methodology and Resource for Fact-Checking Evidence
DocScope: Benchmarking Verifiable Reasoning for Trustworthy Long-Document Understanding
Multi-Perspective Evidence Synthesis and Reasoning for Unsupervised Multimodal Entity Linking
ToE: A Hierarchical and Explainable Claim Verification Framework with Dynamic Multi-source Evidence Retrieval and Aggregation
From Scenes to Elements: Multi-Granularity Evidence Retrieval for Verifiable Multimodal RAG