ToE: A Hierarchical and Explainable Claim Verification Framework with Dynamic Multi-source Evidence Retrieval and Aggregation
Summary
This paper introduces Tree of Evidence (ToE), a hierarchical and explainable claim verification framework that dynamically retrieves and aggregates multi-source evidence using reinforcement learning. Experiments show 4-24 percentage point improvements over baselines, especially against adversarially poisoned inputs from Generative Engine Optimization.
View Cached Full Text
Cached at: 06/29/26, 05:27 AM
# ToE: A Hierarchical and Explainable Claim Verification Framework with Dynamic Multi-source Evidence Retrieval and Aggregation
Source: [https://arxiv.org/html/2606.27736](https://arxiv.org/html/2606.27736)
Zhaoqi Wang1, Zijian Zhang1, Kun Zheng1, Zhen Li1, Xin Li1, Chunlei Li2, Jiamou Liu3
###### Abstract
Content Warning: This paper contains examples of misinformation used only for research purpose\.The rapid spread of fake news poses increasing threats to information ecosystems, especially as AI\-generated misinformation under Generative Engine Optimization \(GEO\) poisoning allows adversarially crafted content to be systematically surfaced by retrieval systems, contaminating LLM reasoning\. In this paper, we propose Tree of Evidence \(ToE\), a hierarchical evidence reasoning framework for automated fact\-checking that models each claim as a dynamically expanding argument tree\. ToE integrates a reinforcement learning\-driven multi\-source retrieval agent, an evidence evaluation agent, and an argument tree aggregation algorithm to iteratively decompose, retrieve, and verify claims through an explainable evidence chain\. We further provide a theoretical analysis of the retrieval process, deriving a formal error bound that guarantees the learned policy converges to a neighborhood of the information\-theoretically optimal policy\. Experiments across multiple datasets and backbone LLMs demonstrate that ToE achieves improvements ranging from 4 to 24 percentage points over competitive baselines, with particularly pronounced gains on adversarially poisoned inputs\.
## IIntroduction
Large language models \(LLMs\), such as the DeepSeek series\[[9](https://arxiv.org/html/2606.27736#bib.bib2)\]and the GPT series\[[1](https://arxiv.org/html/2606.27736#bib.bib1)\], have demonstrated impressive capabilities across a wide range of tasks\. However, due to their reliance on static training data, LLMs lack access to real\-time information, which makes them prone to hallucination\. To address this limitation, two major lines of work have been proposed\. Retrieval\-Augmented Generation \(RAG\) constructs an external knowledge index and retrieves relevant documents at inference time to supply the model with up\-to\-date context\[[7](https://arxiv.org/html/2606.27736#bib.bib9)\]\. Tool calling\[[15](https://arxiv.org/html/2606.27736#bib.bib7)\], on the other hand, enables LLMs to dynamically invoke external tools such as search engines during generation, thereby grounding responses in freshly retrieved information\. Both approaches alleviate the hallucination problem to a certain extent\. However, the introduction of third\-party information also brings new risks\. AI\-generated or manually fabricated misinformation can lead LLMs to produce incorrect conclusions\[[21](https://arxiv.org/html/2606.27736#bib.bib46),[23](https://arxiv.org/html/2606.27736#bib.bib45),[20](https://arxiv.org/html/2606.27736#bib.bib48)\]\. This risk may be further exacerbated by GEO, a technique that structures content to maximize its discoverability by retrieval algorithms such as embedding\-based ranking\[[2](https://arxiv.org/html/2606.27736#bib.bib8)\]\. Compared to manually written truthful content, adversarially crafted misinformation optimized via GEO can be systematically ranked higher in retrieval results, making it more likely to be consumed by LLMs\. For example, as illustrated in Figure[1](https://arxiv.org/html/2606.27736#S1.F1), when a user asks“Who is the CEO of OpenAI?”, an attacker can inject a fabricated document claiming that Tim Cook has joined OpenAI as CEO, which once ranked alongside legitimate sources, contaminates the retrieved context and causes the LLM to produce a confidently wrong answer\.
Figure 1:Illustration of LLM context pollution via malicious retrievalTo combat the spread of misinformation, researchers have proposed a range of detection methods\. Early work relies on deep learning\-based approaches\[[22](https://arxiv.org/html/2606.27736#bib.bib49)\], but these methods suffer from limited generalizability, as the inherent diversity of online information across topics, styles, and platforms makes it difficult to transfer models trained on one distribution to unseen data from another\[[10](https://arxiv.org/html/2606.27736#bib.bib51)\]\. More recently, LLM\-based fake news detection methods have been proposed\[[11](https://arxiv.org/html/2606.27736#bib.bib50),[10](https://arxiv.org/html/2606.27736#bib.bib51),[8](https://arxiv.org/html/2606.27736#bib.bib53),[13](https://arxiv.org/html/2606.27736#bib.bib52)\], which leverage the reasoning capabilities of LLMs to analyze input content and identify potential misinformation\. While these methods achieve promising results on scientific or knowledge\-intensive claims, they struggle with time\-sensitive news, particularly in the face of AI\-generated misinformation where fabricated content is fluent, coherent, and difficult to distinguish from genuine reporting without access to real\-time external knowledge\.
To address this challenge, we propose ToE, a hierarchical evidence reasoning framework for automated fact\-checking\. ToE models a claim as a dynamically expanding argument tree and verifies it through three core components: a reinforcement learning\-driven multi\-source retrieval agent that collects evidence across platforms according to the characteristics of the claim under investigation, an evidence evaluation agent that scores the veracity and reliability of the claim based on the collected evidence, and an argument tree aggregation agent that decomposes the claim into sub\-claims along dimensions such as who, what, when, where, why, and how, expanding the subtree for deeper verification when the current evidence is insufficient for a reliable judgment\. Node\-level scores are propagated bottom\-up through the tree, and the process terminates when the root node reaches a convergence or decisiveness threshold, yielding a final veracity score along with the full reasoning tree\.
We summarize our main contributions as follows:
- •We propose ToE, a hierarchical evidence reasoning framework for automated fact\-checking\. To the best of our knowledge, this is the first algorithm based on dynamic evidence collection and evaluation, which produces an interpretable argument tree as an explainable evidence chain for each verdict\.
- •We provide a theoretical analysis of the retrieval framework by modeling evidence collection as a Partially Observable Markov Decision Process, and derive a formal error bound showing that the learned policy converges to a neighborhood of the information\-theoretically optimal policy\.
- •We construct an adversarial dataset AdvFact to evaluate robustness under GEO poisoning, and conduct experiments across multiple datasets and LLMs with ablation studies on the retrieval action space, demonstrating the effectiveness and generalizability of the proposed approach\.
## IIRelated Work
Figure 2:An overview of ToE framework\.With the rapid development of AI, LLM\-based agents address the knowledge staleness problem by invoking external search tools to retrieve relevant information as supplementary context\[[7](https://arxiv.org/html/2606.27736#bib.bib9),[15](https://arxiv.org/html/2606.27736#bib.bib7)\]\. However, this reliance on third\-party sources introduces new vulnerabilities\. Prior work has demonstrated that AI\-generated misinformation injected into retrieval results can mislead LLMs into producing incorrect conclusions, with potentially severe consequences in domains such as medicine and finance\[[23](https://arxiv.org/html/2606.27736#bib.bib45),[4](https://arxiv.org/html/2606.27736#bib.bib47)\]\. While deep learning\-based detection methods have been proposed to identify such content\[[22](https://arxiv.org/html/2606.27736#bib.bib49)\], they rely on surface\-level stylistic features and fail to generalize to AI\-generated misinformation that differs from truthful content only in subtle factual details\. For instance, a claim such as “Tim Cook is Apple’s CEO” can be corrupted to “Tim Cook became OpenAI’s CEO yesterday,” a statement that is stylistically indistinguishable from genuine news yet factually false\. Recent work leverages the reasoning capabilities of LLMs to assist in verifying the reliability of information\. F3 framework guides LLMs through stepwise logical analysis and evidence consistency assessment via prompt engineering techniques such as zero\-shot chain\-of\-thought reasoning and deductive generation, producing a veracity judgment for the input claim\[[11](https://arxiv.org/html/2606.27736#bib.bib50)\]\. TELLER decomposes fake news detection into a set of structured evaluation questions, directing LLMs to analyze news from dimensions including factual accuracy, contextual consistency, and source credibility, then aggregates the dimension\-level scores through interpretable logical rules to reach a final verdict\[[10](https://arxiv.org/html/2606.27736#bib.bib51)\]\. However, these methods predominantly rely on LLMs to analyze content provided by the claim content, and show limited performance against AI\-generated misinformation\. STEEL adopts a multi\-round retrieval\-augmented strategy in which an LLM assesses the confidence of initial retrieval results and automatically generates refined queries to iteratively collect additional web evidence when the current evidence is insufficient, alleviating the dependence on static knowledge bases\[[8](https://arxiv.org/html/2606.27736#bib.bib53)\]\. AdSent rewrites input claims into sentiment\-neutralized variants to force the veracity classifier to rely solely on factual content, improving robustness against sentiment\-manipulated misinformation\[[17](https://arxiv.org/html/2606.27736#bib.bib54)\]\. However, it provides limited defense against misinformation grounded in fabricated facts\.
## IIIMethod
To counter the potential impact of false and misleading information, we propose ToE, a hierarchical evidence reasoning framework for automated fact\-checking\. ToE models the verification of a claim as a dynamically growing argument tree\. At each step, it collects and evaluates evidence from heterogeneous sources — including Wikipedia, Arxiv, fact\-checking platforms, search engine results, and social media\. Based on the evaluation results, new subtrees are expanded to progressively refine the veracity judgment\. This process ultimately produces an interpretable evidence tree that traces the full reasoning process\.
As depicted in Fig\.[2](https://arxiv.org/html/2606.27736#S2.F2), the system initializes the argument tree with the claim under investigation as the root node and enters the main iterative loop\. At each iteration, the priority of each pending node is computed dynamically based on the uncertainty of its parent and its own estimated importance, so that nodes with the greatest expected contribution to the final judgment are processed first\. For each selected node, the system analyzes semantic features of the claim, such as its category and geographic scope, and generates three types of search queries: general background queries, supporting evidence queries, and counter\-evidence queries\. This design ensures comprehensive and objective evidence collection and reduces confirmation bias\. A retrieval agent trained via reinforcement learning then executes searches autonomously across heterogeneous sources, including Wikipedia, PolitiFact, social media, and the general web, and decides dynamically when to stop retrieval\. The collected content is parsed by an LLM to extract evidence snippets directly relevant to the claim, each annotated with attributes such as stance and source credibility, forming a structured evidence set\. Based on this evidence, the system computes two core scores for the current node: veracity, which measures the probability that the claim is true, and reliability, which measures how strongly the current evidence supports that judgment\.
After each node is evaluated, scores are propagated bottom\-up through the tree via an aggregation network, continuously updating the root node’s overall judgment\. If a node’s reliability is insufficient, indicating that the available evidence does not yet support a confident verdict, the system invokes an LLM to decompose the claim into finer\-grained, more verifiable sub\-claims along various dimensions, covering time, location, person, event, cause, and information source, and adds them as new child nodes for further processing\. Concurrently, subtrees that have reached high reliability with a sufficiently decisive verdict are pruned to avoid redundant computation\. The iterative process continues until the root node’s judgment converges, a decisiveness threshold is met, or the maximum number of iterations is reached, ultimately producing a veracity score for the claim and a complete argument tree available for traceability and review\.
### III\-ATheoretical Foundations
The verification process in ToE can be understood through two complementary theoretical lenses: a decision\-theoretic formulation that characterizes the structure of the search problem, and an information\-theoretic interpretation that justifies the design of the reward signal driving the retrieval agent\.
#### POMDP Formulation\.
The core challenge in automated fact\-checking is that the ground\-truth veracityv∗∈\[0,1\]v^\{\*\}\\in\[0,1\]of a claimccis a latent variable that cannot be directly observed\. The system can only accumulate evidence through sequential search actions and must form a judgment under persistent uncertainty\. This structure maps naturally onto a Partially Observable Markov Decision Process \(POMDP\)\[[5](https://arxiv.org/html/2606.27736#bib.bib60)\], defined as the tuple\(𝒮,𝒜,𝒪,𝒯,ℛ\)\(\\mathcal\{S\},\\mathcal\{A\},\\mathcal\{O\},\\mathcal\{T\},\\mathcal\{R\}\), with components as follows\. Thestatest=\(ℰt,fc\)s\_\{t\}=\(\\mathcal\{E\}\_\{t\},f\_\{c\}\)is the joint representation of the evidence setℰt\\mathcal\{E\}\_\{t\}accumulated up to stepttand the semantic featuresfcf\_\{c\}of the claim under investigation, wherefcf\_\{c\}encodes attributes such as claim category and geographic scope\. Theactionat∈𝒜a\_\{t\}\\in\\mathcal\{A\}is selected from the eight\-option discrete action space of the retrieval agent, covering heterogeneous source types including Wikipedia, arXiv, fact\-checking platforms, social media, and a stop action\. Theobservationot∈𝒪o\_\{t\}\\in\\mathcal\{O\}is the set of documents retrieved after executing actionata\_\{t\}, which is subsequently processed by the evidence evaluation agent into structured evidence objects annotated with stance and credibility attributes\. Thetransition𝒯\(st\+1∣st,at\)\\mathcal\{T\}\(s\_\{t\+1\}\\mid s\_\{t\},a\_\{t\}\)describes how the evidence set is updated upon receiving new observations:ℰt\+1=ℰt∪\{ot\}\\mathcal\{E\}\_\{t\+1\}=\\mathcal\{E\}\_\{t\}\\cup\\\{o\_\{t\}\\\}\. Thebelief statebt=\(vn,t,rn,t\)b\_\{t\}=\(v\_\{n,t\},r\_\{n,t\}\)is maintained by the evaluation network, which maps the current evidence set and claim features to a veracity scorevn,tv\_\{n,t\}and a reliability scorern,tr\_\{n,t\}\. Crucially, the evaluation network serves as a belief state estimator rather than an independently trained value network: it is a domain\-driven module that produces a posterior estimate ofv∗v^\{\*\}given all available evidence, which distinguishes ToE from standard Actor\-Critic architectures where the value function is learned separately from the environment model\.
#### Information\-Seeking Objective\.
The reward signal driving the retrieval agent admits a natural information\-theoretic interpretation\. At each steptt, the agent receives a step reward proportional to the reliability gainΔrt=rn,t−rn,t−1\\Delta r\_\{t\}=r\_\{n,t\}\-r\_\{n,t\-1\}\. We interpret this gain as an approximation of the conditional mutual information between the latent veracityv∗v^\{\*\}and the new observationoto\_\{t\}given prior evidence\[[6](https://arxiv.org/html/2606.27736#bib.bib61)\]:
Δrt≈I\(v∗;ot∣ℰt−1\)=H\(v∗∣ℰt−1\)−H\(v∗∣ℰt\)\.\\Delta r\_\{t\}\\approx I\(v^\{\*\};\\,o\_\{t\}\\mid\\mathcal\{E\}\_\{t\-1\}\)=H\(v^\{\*\}\\mid\\mathcal\{E\}\_\{t\-1\}\)\-H\(v^\{\*\}\\mid\\mathcal\{E\}\_\{t\}\)\.\(1\)Each search action reduces the system’s epistemic uncertainty about the true veracity of the claim, and the reliability scorern,tr\_\{n,t\}serves as a tractable surrogate for the posterior confidence overv∗v^\{\*\}\.
To establish a formal performance guarantee for the greedy retrieval policy, we assume that the reliability functionr:2𝒪→\[0,1\]r:2^\{\\mathcal\{O\}\}\\to\[0,1\]is submodular with respect to the evidence set, meaning that the marginal reliability gain from any new observationoosatisfies
r\(ℰA∪\{o\}\)−r\(ℰA\)≥r\(ℰB∪\{o\}\)−r\(ℰB\),r\(\\mathcal\{E\}\_\{A\}\\cup\\\{o\\\}\)\-r\(\\mathcal\{E\}\_\{A\}\)\\geq r\(\\mathcal\{E\}\_\{B\}\\cup\\\{o\\\}\)\-r\(\\mathcal\{E\}\_\{B\}\),\(2\)for anyℰA⊆ℰB\\mathcal\{E\}\_\{A\}\\subseteq\\mathcal\{E\}\_\{B\}\. This assumption captures the diminishing returns property of evidence accumulation: once a sufficient evidential foundation has been established, redundant or corroborating observations contribute progressively less to resolving uncertainty\. This is empirically well\-motivated in the fact\-checking setting, where the first few high\-quality sources typically determine the verdict and additional evidence yields smaller marginal gains\. Under this assumption, the greedy policy that selects at each step the action maximizing the expected reliability gain is guaranteed to achieve performance no worse than\(1−1/e\)\(1\-1/e\)of the optimal non\-greedy policy, providing a formal justification for the step reward design\.
We further note that the approximation in the mutual information interpretation is bounded\. If the evaluation network’s reliability estimate satisfies\|rn,t−r~n,t\|≤ϵ\|r\_\{n,t\}\-\\tilde\{r\}\_\{n,t\}\|\\leq\\epsilonfor alltt, wherer~n,t\\tilde\{r\}\_\{n,t\}denotes the ideal reliability derived from the true posterior, then the per\-step deviation of the reward signal from the true information gain is uniformly controlled, as formalized below\.
###### Theorem 1\.
Letr~n,t\\tilde\{r\}\_\{n,t\}denote the ideal reliability score derived from the true posterior overv∗v^\{\*\}, and suppose the evaluation network satisfies\|rn,t−r~n,t\|≤ϵ\|r\_\{n,t\}\-\\tilde\{r\}\_\{n,t\}\|\\leq\\epsilonfor alltt\. Then the per\-step deviation of the reward signal from the true conditional mutual information is uniformly bounded by\|Δrt−I\(v∗;ot∣ℰt−1\)\|≤2ϵ\\left\|\\Delta r\_\{t\}\-I\(v^\{\*\};\\,o\_\{t\}\\mid\\mathcal\{E\}\_\{t\-1\}\)\\right\|\\leq 2\\epsilonfor allt∈\{1,…,T\}t\\in\\\{1,\\dots,T\\\}, and consequently the average cumulative deviation over the retrieval horizon satisfies1T∑t=1T\|Δrt−I\(v∗;ot∣ℰt−1\)\|≤2ϵ\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\left\|\\Delta r\_\{t\}\-I\(v^\{\*\};\\,o\_\{t\}\\mid\\mathcal\{E\}\_\{t\-1\}\)\\right\|\\leq 2\\epsilon, independently ofTT\.
###### Proof\.
By the triangle inequality and the pointwise approximationΔr~t≈I\(v∗;ot∣ℰt−1\)\\Delta\\tilde\{r\}\_\{t\}\\approx I\(v^\{\*\};\\,o\_\{t\}\\mid\\mathcal\{E\}\_\{t\-1\}\), each term satisfies:
\|Δrt−I\(v∗;ot∣ℰt−1\)\|\\displaystyle\\left\|\\Delta r\_\{t\}\-I\(v^\{\*\};\\,o\_\{t\}\\mid\\mathcal\{E\}\_\{t\-1\}\)\\right\|≤\|Δrt−Δr~t\|\\displaystyle\\leq\\left\|\\Delta r\_\{t\}\-\\Delta\\tilde\{r\}\_\{t\}\\right\|=\|\(rn,t−rn,t−1\)−\(r~n,t−r~n,t−1\)\|\\displaystyle=\\left\|\(r\_\{n,t\}\-r\_\{n,t\-1\}\)\-\(\\tilde\{r\}\_\{n,t\}\-\\tilde\{r\}\_\{n,t\-1\}\)\\right\|≤\|rn,t−r~n,t\|\+\|rn,t−1−r~n,t−1\|\\displaystyle\\leq\\left\|r\_\{n,t\}\-\\tilde\{r\}\_\{n,t\}\\right\|\+\\left\|r\_\{n,t\-1\}\-\\tilde\{r\}\_\{n,t\-1\}\\right\|≤2ϵ\.\\displaystyle\\leq 2\\epsilon\.\(3\)This bound holds uniformly for alltt, so the average overTTsteps is also bounded by2ϵ2\\epsilon\. ∎
#### Reward Fidelity and Policy\-Level Guarantee\.
This per\-step guarantee further implies a bound on the policy\-level performance gap\. Letπ∗\\pi^\{\*\}denote the optimal policy under the true information gain rewardI\(v∗;ot∣ℰt−1\)I\(v^\{\*\};o\_\{t\}\\mid\\mathcal\{E\}\_\{t\-1\}\), and letπ^\\hat\{\\pi\}denote the policy optimized under the approximate rewardΔrt\\Delta r\_\{t\}\. By the simulation lemma for finite\-horizon MDPs, a uniform per\-step reward perturbation ofδ\\deltainduces a value function gap of at most2Tδ2T\\delta\. Substituting the per\-step boundδ=2ϵ\\delta=2\\epsilonfrom Theorem[1](https://arxiv.org/html/2606.27736#Thmtheorem1), the value gap over a fixed horizonTTsatisfies:
\|Vπ∗−Vπ^\|≤4Tϵ,\\left\|V^\{\\pi^\{\*\}\}\-V^\{\\hat\{\\pi\}\}\\right\|\\leq 4T\\epsilon,\(4\)whereVπ=𝔼π\[∑t=1TI\(v∗;ot∣ℰt−1\)\]V^\{\\pi\}=\\mathbb\{E\}\_\{\\pi\}\\\!\\left\[\\sum\_\{t=1\}^\{T\}I\(v^\{\*\};o\_\{t\}\\mid\\mathcal\{E\}\_\{t\-1\}\)\\right\]denotes the expected cumulative information gain under policyπ\\pi\. The per\-step bound of2ϵ2\\epsilonis independent of the horizon lengthTTand ensures that the approximate reward preserves the correct greedy action ranking whenever the gap between the best and second\-best actions exceeds4ϵ4\\epsilon\. In our setting, the retrieval agent operates over a short fixed horizon with at most 10 retrieved documents per node and a maximum tree depth of 5, so the cumulative deviation remains small in absolute terms relative to the reliability range\[0,1\]\[0,1\]\. These results formalize the conditions under which the learned reliability estimate is a faithful proxy for the true information gain, justifying the use of the surrogate reward in policy optimization\.
### III\-BRL\-Driven Retrieval Agent
As one of the core components of ToE, the retrieval agent is responsible for searching the web for content relevant to the claim under investigation\. The agent first analyzes the input claim to extract a set of semantic features, including its category \(e\.g\., historical, scientific, or political\) and attributes such as whether it describes a recent event, a historical fact, or a factual statement\. These claim\-level features are combined with step\-level search state features, including the number of searches performed, the number of supporting snippets retrieved, and the number of counter\-evidence snippets retrieved, to form the current statests\_\{t\}\. The state is fed into a policy networkπθ\\pi\_\{\\theta\}, which selects an actionata\_\{t\}from a discrete action space\. The action space covers eight options: searching general web sources, searching for counter\-evidence, querying arXiv for academic papers, querying Wikipedia, searching social media platforms, querying fact\-checking sites such as PolitiFact and Snopes, searching Reddit discussions, and stopping the search\. This design reflects the observation that different types of claims are best verified through different sources\. Established facts benefit from dedicated fact\-checking sites, encyclopedic knowledge is well served by Wikipedia, and time\-sensitive claims, particularly statements attributed to public figures or official bodies, are often most efficiently verified via social media\. The explicit counter\-evidence action further ensures that the agent actively seeks disconfirming information rather than collecting only supporting content\. Given the selected action, the agent uses an LLM to generate a search query tailored to the claim’s content and then invokes the corresponding tool to retrieve results from the target source\. After each search step, a step rewardrtr\_\{t\}is computed and stored together with the transition in a trajectory buffer\. Once the buffer reaches the designated capacity, the policy network is updated online using Group Relative Policy Optimization \(GRPO\)\[[16](https://arxiv.org/html/2606.27736#bib.bib44)\], a reinforcement learning algorithm that estimates advantages by comparing returns within a group of sampled trajectories, eliminating the need for a separate value network\.
The concrete policy state is a 29\-dimensional vector\. It contains claim\-level attributes, including a 6\-dimensional category indicator, a 3\-dimensional locale indicator, and binary flags for recent events, historical facts, factual statements, opinions, and predictions\. It also records search progress, including the number and ratio of completed searches, and evidence status, including the total number of retrieved evidence items, their stance breakdown, per\-source search counts, and whether an official or authoritative source has been found\. The policy network is a feed\-forward model with hidden dimensions 128, 128, and 64, followed by a softmax action head\.
At each search steptt, the reward is computed as
rt=wr⋅Δrelt\+wv⋅\|Δvert\|\+we⋅min\(ntnew,4\)⋅δe\+bt−δstep,r\_\{t\}=w\_\{r\}\\cdot\\Delta\\text\{rel\}\_\{t\}\+w\_\{v\}\\cdot\|\\Delta\\text\{ver\}\_\{t\}\|\+w\_\{e\}\\cdot\\min\(n\_\{t\}^\{\\text\{new\}\},4\)\\cdot\\delta\_\{e\}\+b\_\{t\}\-\\delta\_\{\\text\{step\}\},\(5\)whereΔrelt\\Delta\\text\{rel\}\_\{t\}and\|Δvert\|\|\\Delta\\text\{ver\}\_\{t\}\|denote the changes in reliability and veracity,ntnewn\_\{t\}^\{\\text\{new\}\}is the number of newly retrieved evidence items,btb\_\{t\}includes high\-reliability and authoritative\-source bonuses, andδstep\\delta\_\{\\text\{step\}\}penalizes unnecessary searches\. Table[I](https://arxiv.org/html/2606.27736#S3.T1)summarizes the main hyperparameters\. During the firstNwarmN\_\{\\text\{warm\}\}claims, the agent follows a heuristic warm\-up policy based on claim category and recency, after which GRPO updates are triggered whenever the trajectory buffer accumulatesGGcompleted trajectories\.
TABLE I:Hyperparameters of the Policy Network\.ParameterSymbolValueReliability improvement weightwrw\_\{r\}1\.0Veracity change weightwvw\_\{v\}0\.5Evidence bonus weightwew\_\{e\}0\.05Per\-step penaltyδstep\\delta\_\{\\text\{step\}\}0\.01High\-confidence thresholdτh\\tau\_\{h\}0\.8Authoritative source bonusbauthb\_\{\\text\{auth\}\}0\.1High\-reliability bonusbrelb\_\{\\text\{rel\}\}0\.2Group sizeGG8Discount factorγ\\gamma0\.99Clipping rangeε\\varepsilon0\.2Entropy coefficientα\\alpha0\.01KL coefficientβ\\beta0\.1Learning rate–1e\-5Max gradient norm–0\.5Warm\-up claimsNwarmN\_\{\\text\{warm\}\}8Max search steps per claim–5
### III\-CEvidence Evaluation Agent
After the retrieval agent collects relevant documents, each documentDiD\_\{i\}is processed individually\. To avoid exceeding the context limit of the LLM, each document is first split into chunks\. Each chunk is submitted to a cleaning LLM that extracts statements directly relevant to the claim\. Once all chunks of a document are processed, the extracted statements are combined with the document’s metadata, such as URL and author information, and submitted to an analysis LLM\. This LLM determines the document’s stance toward the claim \(supporting, refuting, or neutral\), whether it originates from an official or authoritative source, and whether it constitutes a primary report or a republication, among other attributes\. The result is a structured evidence objectEiE\_\{i\}\. After all documents are processed, the full evidence set𝔼\\mathbb\{E\}and the claim’s semantic features are passed to the evaluation network, which produces two scores for the current node: a veracity scorevnv\_\{n\}and a reliability scorernr\_\{n\}\.
Figure 3:Detail of Evaluation Network\.As illustrated in Fig\.[3](https://arxiv.org/html/2606.27736#S3.F3), the evaluation network jointly predicts veracity and reliability through a stance\-aware multi\-path attention mechanism\. Claim features and evidence features are first encoded by a claim encoder and a shared evidence encoder into high\-dimensional claim embeddings and evidence embeddings, respectively\. In parallel, the evidence features are also passed through a quality gate, a small multilayer network that produces a per\-evidence quality weight reflecting its estimated relevance and credibility\. The encoded evidence then enters the stance\-aware group attention module, where evidence objects are partitioned into three parallel processing tracks according to their annotated stance: supporting, neutral, and refuting\. A mask generator takes as input the stance labels, the quality gate weights, and a binary evidence mask indicating which evidence slots are occupied, and produces two types of masks for each track\. The safe mask is a hard binary mask that blocks invalid or cross\-stance evidence from attending to the current track\. The soft mask is a continuous weight derived from the quality gate output, used to scale attention outputs by evidence quality\. Within each track, the evidence embeddings pass through a multi\-head self\-attention layer gated by the safe mask, enabling intra\-stance contextual interaction\. The attention outputs are then multiplied by the soft mask to inject quality priors, and mean pooling compresses the variable\-length evidence set within each track into a fixed\-dimensional stance aggregation vector\. In the fusion stage, the original claim embedding and the three stance aggregation vectors are concatenated along the feature dimension, forming a compact unified representation that encodes the claim content alongside the evidential signal from all three stances\. This representation is passed through a shared network for global feature integration, after which it is routed to two parallel output heads: a veracity MLP and a reliability MLP, producing the final veracity score and reliability score, respectively\.
The evaluation network uses a 12\-dimensional claim feature vector and a padded20×820\\times 8evidence matrix with a binary evidence mask\. Claim features encode category, locale, and whether the claim describes a recent event, historical fact, or factual statement\. Each evidence item encodes stance, source type, relevance, and credibility\. Training targets are produced by an LLM judge, which assigns soft veracity and reliability scores based on evidential support, source authority, evidence quantity, cross\-source consistency, and directness\. Given targets\(v~,r~\)\(\\tilde\{v\},\\tilde\{r\}\), the network minimizes
ℒ=BCE\(v^,v~\)\+BCE\(r^,r~\),\\mathcal\{L\}=\\text\{BCE\}\(\\hat\{v\},\\,\\tilde\{v\}\)\+\\text\{BCE\}\(\\hat\{r\},\\,\\tilde\{r\}\),\(6\)wherev^\\hat\{v\}andr^\\hat\{r\}are the predicted veracity and reliability scores\. The model is optimized with Adam using a learning rate of10−310^\{\-3\}and gradient norm clipping at1\.01\.0\.
### III\-DTree Management
Once a node obtains its veracity scorevn∗v\_\{n^\{\*\}\}and reliability scorern∗r\_\{n^\{\*\}\}from the evaluation network, the tree management module determines how the argument tree should evolve\. Algorithm[1](https://arxiv.org/html/2606.27736#alg1)summarizes the full procedure, where three binary flags govern the control flow: the expansion flagδn\\delta\_\{n\}, the decisiveness flagϕn\\phi\_\{n\}, and the convergence flagγ\\gamma\. These flags are defined in terms of five thresholds: the expansion reliability thresholdτrexp\\tau\_\{r\}^\{\\text\{exp\}\}, the decision reliability thresholdτrdec\\tau\_\{r\}^\{\\text\{dec\}\}, the upper and lower veracity thresholdsτv\+\\tau\_\{v\}^\{\+\}andτv−\\tau\_\{v\}^\{\-\}, and the convergence toleranceτc\\tau\_\{c\}\. Specifically:
- •δn=1\\delta\_\{n\}=1ifrnself<τrexpr\_\{n\}^\{\\text\{self\}\}<\\tau\_\{r\}^\{\\text\{exp\}\}, indicating that the node’s self\-reliability is too low for a confident judgment and further evidence is needed; the root node is always expanded on its first evaluation regardless of this condition\.
- •ϕn=1\\phi\_\{n\}=1ifrnaggr\>τrdecr\_\{n\}^\{\\text\{aggr\}\}\>\\tau\_\{r\}^\{\\text\{dec\}\}andvnaggr∉\[τv−,τv\+\]v\_\{n\}^\{\\text\{aggr\}\}\\notin\[\\tau\_\{v\}^\{\-\},\\tau\_\{v\}^\{\+\}\], indicating that the aggregated evidence is sufficiently reliable and the verdict is unambiguously true or false\.
- •γ=1\\gamma=1if the range ofv0v\_\{0\}over the last three iterations is belowτc\\tau\_\{c\}andr0aggr\>τrexpr\_\{0\}^\{\\text\{aggr\}\}\>\\tau\_\{r\}^\{\\text\{exp\}\}, indicating that the root veracity score has stabilized and further iterations are unlikely to change the outcome\.
Algorithm 1Tree of Evidence VerificationInput: Initial claimc0c\_\{0\}, maximum iterationsTmaxT\_\{\\max\}, and thresholdsτrexp,τrdec,τv\+,τv−,τc\\tau\_\{r\}^\{\\text\{exp\}\},\\tau\_\{r\}^\{\\text\{dec\}\},\\tau\_\{v\}^\{\+\},\\tau\_\{v\}^\{\-\},\\tau\_\{c\}\. Output: Final root veracity scorev0v\_\{0\}and reliability scorer0r\_\{0\}\.
1:Initialize tree
𝒯\\mathcal\{T\}with root node
n0←c0n\_\{0\}\\leftarrow c\_\{0\}
2:Initialize queue
𝒬←\{n0\}\\mathcal\{Q\}\\leftarrow\\\{n\_\{0\}\\\}
3:for
t=1,2,…,Tmaxt=1,2,\\dots,T\_\{\\max\}do
4:if
𝒬\\mathcal\{Q\}is emptythen
5:break
6:endif
7:
n∗←argmaxn∈𝒬Priority\(n\)n^\{\*\}\\leftarrow\\arg\\max\_\{n\\in\\mathcal\{Q\}\}\\textsc\{Priority\}\(n\)
8:
𝒬←𝒬∖\{n∗\}\\mathcal\{Q\}\\leftarrow\\mathcal\{Q\}\\setminus\\\{n^\{\*\}\\\}
9:
ℰ←GetEvidence\(n∗\)\\mathcal\{E\}\\leftarrow\\textsc\{GetEvidence\}\(n^\{\*\}\)
10:
vn∗,rn∗←Evaluate\(n∗,ℰ\)v\_\{n^\{\*\}\},r\_\{n^\{\*\}\}\\leftarrow\\textsc\{Evaluate\}\(n^\{\*\},\\mathcal\{E\}\)
11:if
δn∗=1\\delta\_\{n^\{\*\}\}=1then
12:
𝒮←Decompose\(n∗\)\\mathcal\{S\}\\leftarrow\\textsc\{Decompose\}\(n^\{\*\}\)
13:
𝒯\.Add\(n∗,𝒮\)\\mathcal\{T\}\.\\textsc\{Add\}\(n^\{\*\},\\mathcal\{S\}\)
14:
𝒬←𝒬∪𝒮\\mathcal\{Q\}\\leftarrow\\mathcal\{Q\}\\cup\\mathcal\{S\}
15:endif
16:
v0,r0←Aggregate\(𝒯\)v\_\{0\},r\_\{0\}\\leftarrow\\textsc\{Aggregate\}\(\\mathcal\{T\}\)
17:
𝒯,𝒬←Prune\(𝒯,𝒬\)\\mathcal\{T\},\\mathcal\{Q\}\\leftarrow\\textsc\{Prune\}\(\\mathcal\{T\},\\mathcal\{Q\}\)
18:if
γ=1\\gamma=1or
ϕ0=1\\phi\_\{0\}=1then
19:break
20:endif
21:endfor
22:return
\(v0,r0\)\(v\_\{0\},r\_\{0\}\)
If the expansion flagδn∗=1\\delta\_\{n^\{\*\}\}=1, indicating that the self\-reliabilityrn∗r\_\{n^\{\*\}\}falls belowτrexp\\tau\_\{r\}^\{\\text\{exp\}\}and the node is eligible for expansion, the system invokes a decomposition LLM\. The LLM breaks the claim into two to four sub\-claims, each targeting a distinct verification dimension: geographic accuracy, entity identity, core event, causal relationship, source credibility, quantitative accuracy, or contextual background\. Each sub\-claim is assigned an importance weight, with all weights summing to one, and the resulting sub\-claims are appended to the argument tree𝒯\\mathcal\{T\}as children of the current node and added to the processing queue𝒬\\mathcal\{Q\}\. After any structural change to the tree, scores are propagated bottom\-up via the aggregation network\. For each node, the network takes as input the node’s own veracity and reliability scores alongside those of all processed children that reflect the combined evidence from the entire subtree\. This update refreshes the root node’s scoresv0v\_\{0\}andr0r\_\{0\}to reflect all available evidence\. As illustrated in Fig\.[4](https://arxiv.org/html/2606.27736#S3.F4), the aggregation network encodes the node’s own 16\-dimensional feature vector and the feature vectors of its evaluated children, applies self\-attention over child representations, and uses cross\-attention from the parent node to aggregate child evidence into updated veracity and reliability scores\.
Figure 4:Detail of Tree Aggregation Network\.Before neural training, ToE uses a rule\-based bottom\-up aggregation scheme as an interpretable fallback and a source of soft supervision\. Leaf and unexpanded nodes inherit their self\-assessed scores\. Internal nodes compute veracity as a reliability\- and importance\-weighted average of child veracity scores, while reliability is adjusted by the number of available children and the consistency of their veracity estimates\. The neural aggregator is then trained bottom\-up on completed trees\. Its loss combines a supervised root loss and an imitation loss over non\-root nodes:
ℒroot=wroot\[BCE\(v^root,yv\)\+BCE\(r^root,yr\)\],\\mathcal\{L\}\_\{\\text\{root\}\}=w\_\{\\text\{root\}\}\\left\[\\text\{BCE\}\(\\hat\{v\}\_\{\\text\{root\}\},\\,y\_\{v\}\)\+\\text\{BCE\}\(\\hat\{r\}\_\{\\text\{root\}\},\\,y\_\{r\}\)\\right\],\(7\)ℒimit=wimit⋅1\|𝒩\|∑i∈𝒩\[BCE\(v^i,v~i\)\+BCE\(r^i,r~i\)\],\\mathcal\{L\}\_\{\\text\{imit\}\}=w\_\{\\text\{imit\}\}\\cdot\\frac\{1\}\{\|\\mathcal\{N\}\|\}\\sum\_\{i\\in\\mathcal\{N\}\}\\left\[\\text\{BCE\}\(\\hat\{v\}\_\{i\},\\,\\tilde\{v\}\_\{i\}\)\+\\text\{BCE\}\(\\hat\{r\}\_\{i\},\\,\\tilde\{r\}\_\{i\}\)\\right\],\(8\)withwroot=1\.0w\_\{\\text\{root\}\}=1\.0andwimit=0\.1w\_\{\\text\{imit\}\}=0\.1\. The aggregation network is optimized with AdamW using a learning rate of10−410^\{\-4\}, weight decay10−510^\{\-5\}, and gradient norm clipping at1\.01\.0\.
Following aggregation, the system performs a top\-down pruning pass\. For any non\-leaf node whose decisiveness flagϕn=1\\phi\_\{n\}=1, meaning its aggregated reliabilityrnaggrr\_\{n\}^\{\\text\{aggr\}\}exceedsτrdec\\tau\_\{r\}^\{\\text\{dec\}\}and its aggregated veracityvnaggrv\_\{n\}^\{\\text\{aggr\}\}falls outside\[τv−,τv\+\]\[\\tau\_\{v\}^\{\-\},\\tau\_\{v\}^\{\+\}\], all pending descendants are removed from𝒬\\mathcal\{Q\}, avoiding unnecessary computation on branches that no longer affect the outcome\. The loop then checks whether the convergence flagγ=1\\gamma=1or the root’s decisiveness flagϕ0=1\\phi\_\{0\}=1\. Convergence holds when the range ofv0v\_\{0\}over the last three iterations drops belowτc\\tau\_\{c\}andr0aggrr\_\{0\}^\{\\text\{aggr\}\}exceedsτrexp\\tau\_\{r\}^\{\\text\{exp\}\}\. If either condition is met, the system terminates and returnsv0v\_\{0\}andr0r\_\{0\}as the final verdict\. Otherwise, the next node is selected from𝒬\\mathcal\{Q\}according to its priority score, which is computed as a weighted combination of the parent node’s uncertainty and the node’s own importance weight assigned at decomposition time, and the next iteration begins\.
## IVEvaluation
### IV\-AExperimental Setup
Datasets\.We evaluate ToE on three publicly available fact\-checking benchmarks\. The LIAR dataset\[[19](https://arxiv.org/html/2606.27736#bib.bib55)\]contains 12,836 human\-labeled short statements about political news, each annotated with one of six fine\-grained veracity labels along with speaker metadata such as party affiliation and job title\. PolitiFact\[[12](https://arxiv.org/html/2606.27736#bib.bib56)\]comprises 21,152 expert\-verified statements sourced from the PolitiFact website, spanning claims made between 2008 and 2022 across six verdict categories\. Check\-COVID\[[18](https://arxiv.org/html/2606.27736#bib.bib57)\]is a domain\-specific benchmark of 1,504 COVID\-19 related claims drawn from news sources, each verified against evidence from scientific journal articles and labeled as supported, refuted, or lacking sufficient information\. To enable consistent three\-class accuracy evaluation across all datasets, we map each dataset’s original labels to a unified label set: true, false, and uncertain\. For LIAR and PolitiFact, the labels pants\-fire, false, and barely\-true are mapped to false; half\-true is mapped to uncertain; and mostly\-true and true are mapped to true\. The two datasets differ only in that PolitiFact uses mostly\-false in place of barely\-true, which is likewise mapped to false\. For Check\-COVID, the labels SUPPORT, REFUTE, and NOTENOUGHINFO are mapped to true, false, and uncertain, respectively\. To further assess the robustness of ToE against AI\-generated misinformation, we constructAdvFact, an adversarial benchmark designed to simulate sophisticated false claims that are stylistically convincing yet factually incorrect\. We randomly sample 100 false statements from the LIAR dataset and apply two complementary poisoning strategies to generate 100 samples each: FakeGPT\[[4](https://arxiv.org/html/2606.27736#bib.bib47)\]rewrites each statement into a more fluent and detail\-rich false narrative, while PoisonedRAG\[[23](https://arxiv.org/html/2606.27736#bib.bib45)\]injects adversarially crafted passages into the retrieval corpus to mislead evidence\-based reasoning\. The resulting dataset contains 200 samples in total, all labeled as false, and is evaluated directly without any additional training\.
Comparison Methods\.We compare ToE against a range of LLM\-based fact\-checking methods\. Direct prompting feeds the raw claim directly to an LLM and asks for a veracity judgment without any additional context\. Z\-CoT and DefGen\[[11](https://arxiv.org/html/2606.27736#bib.bib50)\]are two structured prompting strategies proposed by the same work, eliciting step\-by\-step reasoning through zero\-shot chain\-of\-thought and deductive generation, respectively, without external retrieval\. AFaCTA\[[13](https://arxiv.org/html/2606.27736#bib.bib52)\]employs a multi\-agent voting mechanism to aggregate judgments from multiple LLM instances\. TELLER\[[10](https://arxiv.org/html/2606.27736#bib.bib51)\]models the verification process through a dual cognitive and decision\-making system\. STEEL\[[8](https://arxiv.org/html/2606.27736#bib.bib53)\]performs iterative multi\-round retrieval to progressively gather evidence before reaching a verdict\. AdSent\[[17](https://arxiv.org/html/2606.27736#bib.bib54)\]neutralizes sentiment signals in the input claim to improve veracity judgment\. To assess whether the gains of ToE generalize across model scales, we instantiate all methods on two backbone LLMs: DeepSeek\-V3\.2\[[3](https://arxiv.org/html/2606.27736#bib.bib58)\], a large\-scale model, and gpt\-oss\-20b\[[14](https://arxiv.org/html/2606.27736#bib.bib59)\], a smaller open\-weight reasoning model released by OpenAI\.
ToE Settings\.The argument tree is constrained to a maximum depth of 5 and a maximum ofTmax=20T\_\{\\max\}=20iterations\. The thresholds are set asτrexp=0\.7\\tau\_\{r\}^\{\\text\{exp\}\}=0\.7\(expansion\),τrdec=0\.7\\tau\_\{r\}^\{\\text\{dec\}\}=0\.7\(decision\),τv\+=0\.7\\tau\_\{v\}^\{\+\}=0\.7,τv−=0\.3\\tau\_\{v\}^\{\-\}=0\.3\(veracity bounds\), andτc=0\.1\\tau\_\{c\}=0\.1\(convergence\)\. For the retrieval agent, each search action returns at most 3 results, and the total number of retrieved documents per node is capped at 10\. The final veracity scorev∈\[0,1\]v\\in\[0,1\]is mapped to a three\-way verdict:v<0\.4v<0\.4is classified asFalse,0\.4≤v≤0\.60\.4\\leq v\\leq 0\.6asUncertain, andv\>0\.6v\>0\.6asTrue\. To validate that our approach isdataset\-agnostic, we train exclusively on the training split of the LIAR dataset and evaluate directly on its test split as well as all other datasets without any dataset\-specific fine\-tuning or adaptation\.
### IV\-BResults and Analysis
TABLE II:Average Accuracy Comparison of Different Methods Across Datasets and LLMsTable[II](https://arxiv.org/html/2606.27736#S4.T2)compares ToE against six baseline methods across four datasets and two LLMs\. In each trial, 50 claims are sampled from each dataset using a fixed random seed, and all methods are prompted to base their judgments solely on the provided input to ensure a fair comparison\. Each experiment is repeated ten times and the median accuracy is reported\. The four datasets cover distinct claim domains, and ToE performs consistently across all of them\. The improvement is most visible on AdvFact, indicating that the method generalizes to adversarially framed claims\. Notably, the Direct baseline, which prompts the LLM to judge claims without any external retrieval, achieves non\-trivial accuracy on several datasets, likely because some claims in these benchmarks overlap with knowledge seen during pretraining, enabling correct judgments from parametric memory alone\. The results also suggest that backbone capability influences method effectiveness: ZCoT, which relies heavily on chain\-of\-thought reasoning, scores 0\.60 on PolitiFact under DeepSeek\-V3\.2 but drops to 0\.22 under GPT\-OSS\-20B, reflecting its sensitivity to the underlying model’s reasoning capacity\. ToE improves over the Direct baseline across all settings, confirming that structured retrieval and evidence aggregation provide reliable gains beyond parametric knowledge\.
Figure 5:Heatmap of Action Distribution\.TABLE III:Ablation study of the search tool space\.To investigate the agent’s adaptive search behavior, we manually collected 30 claims across six categories\. Each category consists of five claims \(two true, two false, and one uncertain\) to evaluate performance across different veracity labels\. run ToE on all samples, and record the frequency with which the retrieval agent selects each evidence source\. Figure[5](https://arxiv.org/html/2606.27736#S4.F5)reveals clear category\-specific patterns\. Scientific claims show the highest Arxiv usage at 26\.4%, reflecting the agent’s preference for peer\-reviewed literature when evaluating technical assertions\. Political claims and health claims exhibit the strongest reliance on Factcheck at 35\.5% and 34\.5% respectively, consistent with the availability of expert fact\-checking records for these domains\. Celebrity claims favor Social Media at 24\.5%, where public figures’ statements and reactions are most readily documented\. Historical claims distribute usage more evenly across Web Engine, Wiki, and Factcheck, suggesting that verifying historical statements benefits from multiple complementary sources\. The Other category shows the highest Web Engine usage at 32\.4%, which is expected given the heterogeneous nature of claims that do not fall into a well\-defined domain\. Table[III](https://arxiv.org/html/2606.27736#S4.T3)presents an ablation study on the retrieval tool space, where we partition the available tools into four functional groups: academic sources \(ArXiv\), counter\-evidence search \(RefuteSearch\), fact\-checking databases \(Wiki and PolitiFact\), and social media platforms \(Twitter and Reddit\)\. Removing any single group consistently degrades overall accuracy, confirming that each tool category contributes complementary evidence\. The most severe drop occurs when fact\-checking sources are removed, reducing overall accuracy to 56\.6% and true\-claim accuracy to 50\.0%, which suggests that structured fact\-checking records provide the most direct and reliable signal for veracity assessment\. These patterns indicate that the retrieval agent, trained with rule\-based warm\-up, learns to allocate search actions in a domain\-aware manner rather than applying a uniform retrieval strategy across all claim types and highlight the particular importance of the design of a diverse tool space\.
## VConclusion
We presented ToE, a tree\-structured argument reasoning framework for automated fact\-checking\. By organizing evidence into a hierarchical argument tree and guiding retrieval and evidence scoring through a trained agent, ToE decomposes complex claims into verifiable sub\-arguments and aggregates evidence\. Experiments across multiple datasets and claim domains demonstrate that the approach generalizes consistently, offering a viable path toward evidence\-grounded claim verification resilient to GEO poisoning\.
## References
- \[1\]J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)Gpt\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§I](https://arxiv.org/html/2606.27736#S1.p1.1)\.
- \[2\]P\. Aggarwal, V\. Murahari, T\. Rajpurohit, A\. Kalyan, K\. Narasimhan, and A\. Deshpande\(2024\)Geo: generative engine optimization\.InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 5–16\.Cited by:[§I](https://arxiv.org/html/2606.27736#S1.p1.1)\.
- \[3\]DeepSeek\-AI\(2025\)DeepSeek\-v3\.2: pushing the frontier of open large language models\.Cited by:[§IV\-A](https://arxiv.org/html/2606.27736#S4.SS1.p2.1)\.
- \[4\]Y\. Huang and L\. Sun\(2023\)FakeGPT: fake news generation, explanation and detection of large language models\.arXiv preprint arXiv:2310\.05046\.Cited by:[§II](https://arxiv.org/html/2606.27736#S2.p1.1),[§IV\-A](https://arxiv.org/html/2606.27736#S4.SS1.p1.1)\.
- \[5\]L\. P\. Kaelbling, M\. L\. Littman, and A\. R\. Cassandra\(1998\)Planning and acting in partially observable stochastic domains\.Artificial intelligence101\(1\-2\),pp\. 99–134\.Cited by:[§III\-A](https://arxiv.org/html/2606.27736#S3.SS1.SSS0.Px1.p1.17)\.
- \[6\]S\. Kothawade, N\. Beck, K\. Killamsetty, and R\. Iyer\(2021\)Similar: submodular information measures based active learning in realistic scenarios\.Advances in Neural Information Processing Systems34,pp\. 18685–18697\.Cited by:[§III\-A](https://arxiv.org/html/2606.27736#S3.SS1.SSS0.Px2.p1.4)\.
- \[7\]P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.Advances in neural information processing systems33,pp\. 9459–9474\.Cited by:[§I](https://arxiv.org/html/2606.27736#S1.p1.1),[§II](https://arxiv.org/html/2606.27736#S2.p1.1)\.
- \[8\]G\. Li, W\. Lu, W\. Zhang, D\. Lian, K\. Lu, R\. Mao, K\. Shu, and H\. Liao\(2024\)Re\-search for the truth: multi\-round retrieval\-augmented large language models are strong fake news detectors\.arXiv preprint arXiv:2403\.09747\.Cited by:[§I](https://arxiv.org/html/2606.27736#S1.p2.1),[§II](https://arxiv.org/html/2606.27736#S2.p1.1),[§IV\-A](https://arxiv.org/html/2606.27736#S4.SS1.p2.1)\.
- \[9\]A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan,et al\.\(2024\)Deepseek\-v3 technical report\.arXiv preprint arXiv:2412\.19437\.Cited by:[§I](https://arxiv.org/html/2606.27736#S1.p1.1)\.
- \[10\]H\. Liu, W\. Wang, H\. Li, and H\. Li\(2024\)Teller: a trustworthy framework for explainable, generalizable and controllable fake news detection\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 15556–15583\.Cited by:[§I](https://arxiv.org/html/2606.27736#S1.p2.1),[§II](https://arxiv.org/html/2606.27736#S2.p1.1),[§IV\-A](https://arxiv.org/html/2606.27736#S4.SS1.p2.1)\.
- \[11\]J\. Lucas, A\. Uchendu, M\. Yamashita, J\. Lee, S\. Rohatgi, and D\. Lee\(2023\)Fighting fire with fire: the dual role of llms in crafting and detecting elusive disinformation\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 14279–14305\.Cited by:[§I](https://arxiv.org/html/2606.27736#S1.p2.1),[§II](https://arxiv.org/html/2606.27736#S2.p1.1),[§IV\-A](https://arxiv.org/html/2606.27736#S4.SS1.p2.1)\.
- \[12\]Cited by:[§IV\-A](https://arxiv.org/html/2606.27736#S4.SS1.p1.1)\.
- \[13\]J\. Ni, M\. Shi, D\. Stammbach, M\. Sachan, E\. Ash, and M\. Leippold\(2024\)Afacta: assisting the annotation of factual claim detection with reliable llm annotators\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 1890–1912\.Cited by:[§I](https://arxiv.org/html/2606.27736#S1.p2.1),[§IV\-A](https://arxiv.org/html/2606.27736#S4.SS1.p2.1)\.
- \[14\]OpenAI\(2025\)Gpt\-oss\-120b & gpt\-oss\-20b model card\.External Links:2508\.10925,[Link](https://arxiv.org/abs/2508.10925)Cited by:[§IV\-A](https://arxiv.org/html/2606.27736#S4.SS1.p2.1)\.
- \[15\]T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, E\. Hambro, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom\(2023\)Toolformer: language models can teach themselves to use tools\.Advances in Neural Information Processing Systems36,pp\. 68539–68551\.Cited by:[§I](https://arxiv.org/html/2606.27736#S1.p1.1),[§II](https://arxiv.org/html/2606.27736#S2.p1.1)\.
- \[16\]Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§III\-B](https://arxiv.org/html/2606.27736#S3.SS2.p1.4)\.
- \[17\]S\. Tahmasebi, E\. Müller\-Budack, and R\. Ewerth\(2026\)Robust fake news detection using large language models under adversarial sentiment attacks\.arXiv preprint arXiv:2601\.15277\.Cited by:[§II](https://arxiv.org/html/2606.27736#S2.p1.1),[§IV\-A](https://arxiv.org/html/2606.27736#S4.SS1.p2.1)\.
- \[18\]G\. Wang, K\. Harwood, L\. Chillrud, A\. Ananthram, M\. Subbiah, and K\. McKeown\(2023\)Check\-covid: fact\-checking covid\-19 news claims with scientific evidence\.InFindings of the Association for Computational Linguistics: ACL 2023,pp\. 14114–14127\.Cited by:[§IV\-A](https://arxiv.org/html/2606.27736#S4.SS1.p1.1)\.
- \[19\]W\. Y\. Wang\(2017\)“Liar, liar pants on fire”: a new benchmark dataset for fake news detection\.InProceedings of the 55th annual meeting of the association for computational linguistics \(volume 2: short papers\),pp\. 422–426\.Cited by:[§IV\-A](https://arxiv.org/html/2606.27736#S4.SS1.p1.1)\.
- \[20\]Z\. Wang, D\. He, Z\. Zhang, Y\. Liu, J\. Liu, Z\. Zeng, Z\. Qin, Z\. Li, X\. Li, H\. Yao, J\. An, Y\. Liu, Y\. Li, Q\. Sun, X\. Liu, and L\. Zhu\(2026\)Combating knowledge corruption in agent systems: a byzantine\-tolerant secure collaborative rag framework\.InProceedings of the ACM Web Conference 2026,WWW ’26\.Cited by:[§I](https://arxiv.org/html/2606.27736#S1.p1.1)\.
- \[21\]Z\. Zhong, Z\. Huang, A\. Wettig, and D\. Chen\(2023\)Poisoning retrieval corpora by injecting adversarial passages\.arXiv preprint arXiv:2310\.19156\.Cited by:[§I](https://arxiv.org/html/2606.27736#S1.p1.1)\.
- \[22\]X\. Zhou and R\. Zafarani\(2020\)A survey of fake news: fundamental theories, detection methods, and opportunities\.ACM Computing Surveys \(CSUR\)53\(5\),pp\. 1–40\.Cited by:[§I](https://arxiv.org/html/2606.27736#S1.p2.1),[§II](https://arxiv.org/html/2606.27736#S2.p1.1)\.
- \[23\]W\. Zou, R\. Geng, B\. Wang, and J\. Jia\(2024\)Poisonedrag: knowledge corruption attacks to retrieval\-augmented generation of large language models\.arXiv preprint arXiv:2402\.07867\.Cited by:[§I](https://arxiv.org/html/2606.27736#S1.p1.1),[§II](https://arxiv.org/html/2606.27736#S2.p1.1),[§IV\-A](https://arxiv.org/html/2606.27736#S4.SS1.p1.1)\.Similar Articles
Tree-of-Experience: A Structured Experience-Management Solution for Self-Evolving Agents under Low-Repetition and Implicit-Reward Environments
This paper introduces FinEvolveBench, a benchmark for financial sentiment prediction, and Tree-of-Experience (ToE), a structured experience-management method for LLM agents in low-repetition tasks with implicit rewards. Experiments show that ToE outperforms general-purpose experience mechanisms in such challenging settings.
EVE-Agent: Evidence-Verifiable Self-Evolving Agents
EVE-Agent introduces a framework for self-evolving search agents that ensure evidence verifiability by generating questions, answers, and evidence spans, and training on marginal accuracy gain of evidence. This improves grounded correctness without human annotations.
From Snippets to Semantics: Rethinking Evidence Granularity for Multilingual Fact Verification
This paper introduces SEEK, a framework for semantic evidence extraction in multilingual fact verification, which constructs coherent evidence chunks from full articles and fine-tunes multilingual LLMs with LoRA, achieving up to 20% improvement in macro-F1 over baselines.
TRACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents
TRACE is a monitoring framework for long-horizon LLM agent trajectories that uses a Triage-Inspect-Judge loop to connect evidence across temporally distant actions, achieving high recall and F1 on evasive sabotage detection tasks.
ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence
ScientistOne introduces Chain-of-Evidence, a verifiability framework for autonomous research agents that ensures every claim is traceable to evidence, achieving zero hallucinated references, perfect score verification, and the highest method-code alignment across 75 papers while matching or exceeding human expert performance on frontier research tasks.