Debating the Unspoken: Role-Anchored Multi-Agent Reasoning for Half-Truth Detection

arXiv cs.CL Papers

Summary

RADAR introduces a role-anchored multi-agent debate framework where Politician and Scientist agents adversarially reason over evidence to detect misleading half-truths, outperforming baselines on omission-aware fact verification.

arXiv:2604.19005v1 Announce Type: new Abstract: Half-truths, claims that are factually correct yet misleading due to omitted context, remain a blind spot for fact verification systems focused on explicit falsehoods. Addressing such omission-based manipulation requires reasoning not only about what is said, but also about what is left unsaid. We propose RADAR, a role-anchored multi-agent debate framework for omission-aware fact verification under realistic, noisy retrieval. RADAR assigns complementary roles to a Politician and a Scientist, who reason adversarially over shared retrieved evidence, moderated by a neutral Judge. A dual-threshold early termination controller adaptively decides when sufficient reasoning has been reached to issue a verdict. Experiments show that RADAR consistently outperforms strong single- and multi-agent baselines across datasets and backbones, improving omission detection accuracy while reducing reasoning cost. These results demonstrate that role-anchored, retrieval-grounded debate with adaptive control is an effective and scalable framework for uncovering missing context in fact verification. The code is available at https://github.com/tangyixuan/RADAR.
Original Article
View Cached Full Text

Cached at: 04/22/26, 08:30 AM

# Debating the Unspoken: Role-Anchored Multi-Agent Reasoning for Half-Truth Detection
Source: [https://arxiv.org/html/2604.19005](https://arxiv.org/html/2604.19005)
Yixuan Tang1Yirui Zhang1Hang Feng2Anthony K\.H\. Tung1 1National University of Singapore2Shanghai Jiao Tong University yixuan@comp\.nus\.edu\.sg yirui\-z@comp\.nus\.edu\.sg fhhh0826@sjtu\.edu\.cn atung@comp\.nus\.edu\.sg

###### Abstract

Half\-truths, claims that are factually correct yet misleading due to omitted context, remain a blind spot for fact verification systems focused on explicit falsehoods\. Addressing such omission\-based manipulation requires reasoning not only about what is said, but also about what is left unsaid\. We proposeRADAR, a role\-anchored multi\-agent debate framework for omission\-aware fact verification under realistic, noisy retrieval\. RADAR assigns complementary roles to a*Politician*and a*Scientist*, who reason adversarially over shared retrieved evidence, moderated by a neutral*Judge*\. A dual\-threshold early termination controller adaptively decides when sufficient reasoning has been reached to issue a verdict\. Experiments show that RADAR consistently outperforms strong single\- and multi\-agent baselines across datasets and backbones, improving omission detection accuracy while reducing reasoning cost\. These results demonstrate that role\-anchored, retrieval\-grounded debate with adaptive control is an effective and scalable framework for uncovering missing context in fact verification\. The code is available at[https://github\.com/tangyixuan/RADAR](https://github.com/tangyixuan/RADAR)\.

Debating the Unspoken: Role\-Anchored Multi\-Agent Reasoning for Half\-Truth Detection

## 1Introduction

![Refer to caption](https://arxiv.org/html/2604.19005v1/fig/intro.png)Figure 1:Overview of fact verification paradigms: \(a\) encoder\-based classifiers over claim–evidence pairs, \(b\) prompt\-guided LLMs, \(c\) multi\-agent debate with fixedpro/conroles and internal knowledge, and \(d\) our proposedRADAR, which performs omission\-aware reasoning through evidence\-grounded role debate\.The rapid spread of online misinformation has intensified the demand for scalable and reliable fact verification systemsAchiamet al\.\([2023](https://arxiv.org/html/2604.19005#bib.bib33)\); Hazra and Majumder \([2024](https://arxiv.org/html/2604.19005#bib.bib39)\); Estornellet al\.\([2020](https://arxiv.org/html/2604.19005#bib.bib43)\)\. While large language models \(LLMs\) have made notable progress in detecting explicit falsehoods, a subtler and more pervasive form of manipulation remains difficult to identify: the*half\-truth*\. Half\-truths are claims that are factually correct in isolation but misleading due to omitted context\. For instance, stating that a politician reduced national debt by 15% may be accurate, yet misleading if it conceals that the reduction followed a prior 20% increase during the same term\. Such selective framing distorts interpretation without fabricating facts, posing unique challenges for both human and automated verifiersSingamsettyet al\.\([2023](https://arxiv.org/html/2604.19005#bib.bib35)\); Bhattacharyya \([2025](https://arxiv.org/html/2604.19005#bib.bib37)\); Sandeep and Bhattacharyya \([2023](https://arxiv.org/html/2604.19005#bib.bib38)\)\.

Detecting half\-truths differs fundamentally from standard fact verification\. Rather than identifying explicit contradictions, omission\-aware verification must determine whether the available evidence is insufficient to justify a claim’s implied conclusion, requiring models to reason over latent intermediate assumptions rather than surface support signals[Tanget al\.](https://arxiv.org/html/2604.19005#bib.bib41)\. This challenge is exacerbated under realistic retrieval settings, where evidence is noisy and incomplete\. Even when retrieved passages appear supportive, they may omit critical background information, causing systems to misclassify omission\-based claims as true\.

Most existing fact verification pipelines are not designed for this setting\. As shown in Figure[1](https://arxiv.org/html/2604.19005#S1.F1), single\-agent approaches, including encoder\-based classifiers and instruction\-tuned LLMsLiuet al\.\([2025](https://arxiv.org/html/2604.19005#bib.bib21)\), perform one\-pass reasoning over retrieved evidence and are prone to error propagation when key context is missing\. Multi\-agent debate \(MAD\) frameworksKundu and et al\. \([2024](https://arxiv.org/html/2604.19005#bib.bib27)\); Hanet al\.\([2025](https://arxiv.org/html/2604.19005#bib.bib26)\)introduce reasoning diversity through argument exchange, but their fixedpro/conroles and focus on explicit contradiction make them ill\-suited for omission reasoning, where the core issue is missing context rather than opposing claims\. As a result, both single\-agent and conventional MAD approaches struggle to reliably detect half\-truths when evidence is only partially supportive\. Recent workTRACERTanget al\.\([2025](https://arxiv.org/html/2604.19005#bib.bib40)\)has begun to explicitly study omission\-aware verification\. However, it assumes access to gold evidence annotation and operates as a single\-agent pipeline without adaptive reasoning, limiting its applicability under realistic retrieval conditions\.

In this paper, we proposeRADAR\(Role\-Anchored multi\-agentDebate for hAlf\-tRuth reasoning\), a framework for omission\-aware fact verification under incomplete retrieval\. Unlike conventional debate systems with fixedpro/constances or internal knowledge reliance,RADARformulates verification as a structured debate among agents with realistic expertise roles, all constrained to reason over retrieved evidence\. A*Politician*agent constructs a coherent narrative supporting the claim based on available evidence, while a*Scientist*agent probes for missing, weak, or selectively presented context\. Their interaction is moderated by a neutral*Judge*, which synthesizes the debate and issues the final verdict\.

To enhance efficiency, RADAR incorporates a dual\-threshold early termination controller that monitors both the judge’s stop margin and verdict confidence\. The debate concludes automatically once sufficient information has been exchanged, preventing redundant interactions while preserving reasoning depth\. Through this adaptive mechanism, RADAR balances interpretability, accuracy, and computational cost, making omission reasoning more scalable for real\-world verification\.

Our contributions are as follows:

- •We address omission\-based half\-truth detection under a realistic retrieval setting, highlighting the challenges of identifying missing context from incomplete evidence\.
- •We propose RADAR, a role\-anchored and retrieval\-grounded multi\-agent reasoning framework that surfaces missing context through complementary debate, with adaptive early stopping for efficiency\.
- •We demonstrate across datasets and backbone LLMs that RADAR consistently improves both accuracy and efficiency over strong single\-agent and multi\-agent baselines\.

## 2Related Work

### 2\.1Fact Verification and Half\-Truths

Fact verification is commonly formulated as determining whether a claim is supported or refuted given retrieved evidenceThorneet al\.\([2018](https://arxiv.org/html/2604.19005#bib.bib30)\); Alamet al\.\([2021](https://arxiv.org/html/2604.19005#bib.bib13)\); Zhang and Gao \([2023](https://arxiv.org/html/2604.19005#bib.bib12)\); Wuet al\.\([2025](https://arxiv.org/html/2604.19005#bib.bib18)\), with numerous systems exploring retrieval, reasoning, and explanationChenet al\.\([2022a](https://arxiv.org/html/2604.19005#bib.bib20)\); Sriramet al\.\([2024](https://arxiv.org/html/2604.19005#bib.bib16)\); Yueet al\.\([2024](https://arxiv.org/html/2604.19005#bib.bib19)\); Wanget al\.\([2025](https://arxiv.org/html/2604.19005#bib.bib17)\)\.CliminatorLeippoldet al\.\([2024](https://arxiv.org/html/2604.19005#bib.bib1)\)studies LLM\-based fact verification in the climate domain, using prompting and model internal knowledge to assess claim support\.ClaimDecompChenet al\.\([2022b](https://arxiv.org/html/2604.19005#bib.bib31)\)decompose claims into sub\-questions, whileFIREXieet al\.\([2025](https://arxiv.org/html/2604.19005#bib.bib23)\)uses iterative verify\-or\-search loops to reformulate queries upon failure\. While effective for explicit misinformation, these approaches largely assume that misleading claims manifest as direct contradictions with evidence\.

Half\-truths violate this assumption\. Claims may be factually correct with respect to retrieved evidence yet still be misleading due to omitted context\. TRACERTanget al\.\([2025](https://arxiv.org/html/2604.19005#bib.bib40)\)is the first framework to explicitly model this phenomenon through the concept of*critical hidden evidence*\. However, TRACER assumes access to gold evidence annotations and follows a single\-agent, static reasoning pipeline\. In contrast, our work addresses omission detection under noisy retrieval and performs reasoning through structured interaction rather than one\-pass inference\.

![Refer to caption](https://arxiv.org/html/2604.19005v1/fig/framework.png)Figure 2:Overview of theRADARframework for omission\-based half\-truth detection\. The system conducts structured multi\-agent debate between expertise\-grounded roles over retrieved evidence, equipped with an adaptive early termination controller to uncover missing yet critical context efficiently\.
### 2\.2Multi\-Agent Debate for Reasoning

Multi\-agent debate \(MAD\) has recently emerged as a paradigm for enhancing reasoning and factuality in LLMsDuet al\.\([2024](https://arxiv.org/html/2604.19005#bib.bib24)\); Weiet al\.\([2022](https://arxiv.org/html/2604.19005#bib.bib25)\); Lianget al\.\([2024](https://arxiv.org/html/2604.19005#bib.bib29)\); Guoet al\.\([2024](https://arxiv.org/html/2604.19005#bib.bib15)\); Tillmann \([2025](https://arxiv.org/html/2604.19005#bib.bib22)\); Liuet al\.\([2025](https://arxiv.org/html/2604.19005#bib.bib21)\)\. Debate\-based frameworks such as D2DHanet al\.\([2025](https://arxiv.org/html/2604.19005#bib.bib26)\)and TruEDebate \(TED\)Kundu and et al\. \([2024](https://arxiv.org/html/2604.19005#bib.bib27)\)employ multi\-stage dialogue between a proponent, skeptic, and judge to reach veracity judgments for fake news\. However, most MAD systems are designed for contradiction\-oriented verification, where agents argue for or against a claim\. Omission\-based half\-truths differ fundamentally: the key issue may be missing context rather than explicit conflict\. RADAR therefore replaces polarity\-based debate with complementary reasoning roles and grounds all interaction in retrieved evidence, enabling agents to surface what is absent rather than only dispute what is stated\.

### 2\.3Bias by Omission in News

Ehrhardtet al\.\([2021](https://arxiv.org/html/2604.19005#bib.bib2)\)detect political slant by analyzing entity co\-occurrence across outlets, andZhukovaet al\.\([2023](https://arxiv.org/html/2604.19005#bib.bib3)\)model bias by commission, omission, and source selection through the COSS framework\. Other works explore rhetorical manipulation and propaganda framingMartinoet al\.\([2019](https://arxiv.org/html/2604.19005#bib.bib4)\); Akhtaret al\.\([2024](https://arxiv.org/html/2604.19005#bib.bib5)\); Piskorskiet al\.\([2023](https://arxiv.org/html/2604.19005#bib.bib6)\)or examine linguistic cues distinguishing factual from deceptive reportingRashkinet al\.\([2017](https://arxiv.org/html/2604.19005#bib.bib7)\)\. While these studies highlight omission as a key mechanism of bias, they primarily analyze it descriptively\. Our work instead formulates omission as a reasoning and verification task, aiming to detect when incomplete context leads to misleading claims\.

## 3RADAR Framework

### 3\.1Framework Overview

We presentRADAR\(Role\-Anchored multi\-agent Debate for hAlf\-truth Reasoning\), a multi\-agent framework for detecting omission\-based half\-truths under realistic retrieval conditions\. Unlike prior systems that rely on single\-pass classification or assume gold evidence, RADAR performs deliberative reasoning among role\-specialized agents who exchange arguments grounded in retrieved evidence\. As illustrated in Figure[2](https://arxiv.org/html/2604.19005#S2.F2), the framework operates in two stages: \(1\)Context Construction under Noisy Retrieval, which builds a shared evidence pool without assuming completeness, and \(2\)Role\-Grounded Multi\-Round Debate, where agents reason adversarially to identify missing context and assess whether the claim is misleading\.

Given an input claimcc, RADAR first constructs an evidence poolEEby retrieving the top\-ranked text segments for the input claim from a large background corpus\. The retrieved set inevitably contains irrelevant or incomplete content, thereby simulating the uncertainty of real\-world fact\-checking scenarios\. Three agents participate in the subsequent debate: aPolitician, aScientist, and a neutralJudge\. The Politician agent is designed to construct a persuasive narrative from the available evidence, representing an advocacy\-oriented reasoning style\. In contrast, the Scientist agent critically examines the same evidence for missing, weak, or selectively presented support, representing an analytical reasoning style\. Although the role names are inspired by political discourse, their functional behaviors are domain\-agnostic and intended to capture complementary reasoning tendencies\. The Judge moderates the discussion, monitors sufficiency, and issues the final verdict\. This role design transforms verification from stance competition into omission\-focused reasoning: one agent constructs the strongest evidence\-backed narrative, while the other probes for missing or underrepresented context\. Combined with shared evidence grounding and adaptive stopping, RADAR enables structured reasoning under incomplete retrieval\.

Table 1:Roles and reasoning styles \(persuasive vs\. analytical\) inRADAR\.
### 3\.2Stage I: Context Construction under Noisy Retrieval

Omission\-based misinformation often remains convincing because relevant facts are selectively presented while critical context is absent\. To evaluate claims under realistic conditions, RADAR first constructs a retrieval\-grounded evidence pool that is informative yet imperfect\. Rather than assuming complete supporting evidence, the system requires downstream reasoning to operate under uncertainty, where retrieved information may be partial, noisy, or misleading\.

##### Evidence Retrieval\.

Given a claimcc, RADAR retrieves the top\-mmrelevant snippets from a background corpus to form an evidence poolE=Retrieve​\(c\)E=\\textsc\{Retrieve\}\(c\)\. Because retrieval is inherently imperfect,EEmay contain irrelevant passages, fragmented facts, or only partially supportive evidence\. This setting better reflects real\-world fact verification, where systems must reason from incomplete retrieved context\. It differs from prior omission\-aware work such asTRACERTanget al\.\([2025](https://arxiv.org/html/2604.19005#bib.bib40)\), which assumes access to gold evidence, and from debate frameworks such asTEDKundu and et al\. \([2024](https://arxiv.org/html/2604.19005#bib.bib27)\)orD2DHanet al\.\([2025](https://arxiv.org/html/2604.19005#bib.bib26)\), which rely primarily on internal model knowledge\.

##### Shared Grounding\.

The same evidence poolEEis provided to all agents as a common factual basis\. During debate, agents are required to justify their arguments using retrieved evidence, which improves transparency and reduces unsupported claims\. Since all agents observe identical but incomplete evidence, differences in their conclusions arise from reasoning rather than information asymmetry\.

### 3\.3Stage II: Role\-Grounded Multi\-Round Debate

Once the shared context is established, RADAR conducts a structured debate in which the Politician and Scientist iteratively exchange arguments while the Judge moderates and evaluates progress\. All agents operate on the same evidence poolEEand follow a fixed debate protocol\.

##### Debate Protocol\.

Each debate instance begins with anopening round, where the Politician presents an evidence\-backed narrative supporting the claim and the Scientist highlights potential omissions or contradictory details\. In therebuttal round, both agents are tasked with challenging the opponent’s argument by identifying potential flaws or weaknesses\. Finally, in theclosing round, both summarize their stances, after which the Judge synthesizes the debate transcript and evidence to assign one of three labels:true,half\-true, orfalse\.

##### Adaptive Early Stopping\.

To balance reasoning thoroughness and efficiency, RADAR introduces a dual\-thresholdearly termination controllerthat determines whether to continue or stop after each round\. At the end of every round, the Judge model evaluates the accumulated dialogue and computes thestop marginas

s=p​\(STOP\)−p​\(CONTINUE\),s=p\(\\text\{STOP\}\)\-p\(\\text\{CONTINUE\}\),
wherep​\(⋅\)p\(\\cdot\)denotes the softmax\-normalized probability of logits for the corresponding decision token\. The debate proceeds to the next round ifs<τss<\\tau\_\{s\}, and terminates otherwise\.

To further prevent premature stopping on uncertain or ambiguous cases \(especially half\-truths\), we combine this signal with the model’s veracity confidence over the final labelsTrue,False, andHalf\-True\. Let

c=maxy∈\{True,False,Half\-True\}⁡p​\(y\)c=\\max\_\{y\\in\\\{\\text\{True\},\\text\{False\},\\text\{Half\-True\}\\\}\}p\(y\)denote the maximum confidence among the label logits\. The debate is terminated early only whens≥τsandc≥τvs\\geq\\tau\_\{s\}\\quad\\text\{and\}\\quad c\\geq\\tau\_\{v\}\. Both thresholdsτs\\tau\_\{s\}andτv\\tau\_\{v\}are calibrated on a development set, using cached decision scores from a single inference pass, to optimize final classification accuracy while minimizing unnecessary debate rounds\. This adaptive control balances reasoning depth with computational cost, allowing RADAR to maintain accuracy while shortening redundant interactions\.

##### Structured Verdict and Interpretability\.

At the end of the debate, the Judge produces a structured output containing both the label and justification:

\[REASON\]:Concise explanation citing evidence and debate exchanges\.

\[VERDICT\]:TRUE / FALSE / HALF\-TRUE

The inclusion of debate transcripts and evidence citations provides transparent reasoning paths that expose which contextual gaps or omissions influenced the final decision\. This interpretability distinguishes RADAR from opaque end\-to\-end verifiers, enabling human fact\-checkers to trace and validate each reasoning step\.

By integrating retrieval\-grounded context, expertise\-based roles, and adaptive termination, RADAR transforms omission detection from static classification into a dynamic reasoning process\. All prompt templates and implementation details are presented in Appendix[A](https://arxiv.org/html/2604.19005#A1)\.

## 4Experiments

### 4\.1Experimental Setup

##### Dataset\.

We evaluateRADARon thePolitiFact\-HiddenbenchmarkTanget al\.\([2025](https://arxiv.org/html/2604.19005#bib.bib40)\), which contains∼\\sim15k political claims with gold evidence and three labels \(true,half\-true,false\), focusing on omission\-based reasoning rather than explicit contradiction\. The evidence corpus includes 41,952 documents, and we follow the original train/dev/test split \(11,994/1,000/2,000\)\. AsRADARis unsupervised, thresholds are tuned on the dev set and results are reported on the test set\.

ModelAccu\.F1mcF1TF1HTF1F\\rowcolorgray\!10Full Evidence SettingCoT76\.364\.352\.952\.887\.1CoT \+ RA78\.568\.056\.760\.287\.1HiSS74\.761\.948\.151\.286\.4HiSS \+ RA77\.865\.749\.161\.586\.5RADARsingle\{\}\_\{\\text\{single\}\}68\.762\.457\.651\.178\.6RADARmulti\{\}\_\{\\text\{multi\}\}83\.670\.859\.361\.192\.0\\rowcolorgray\!10Retrieved Evidence SettingFIRE60\.346\.930\.734\.175\.9D2D63\.050\.936\.339\.776\.5RADARsingle\{\}\_\{\\text\{single\}\}58\.451\.041\.241\.570\.2RADARmulti\{\}\_\{\\text\{multi\}\}77\.763\.345\.656\.587\.6
Table 2:Main results under full and retrieved evidence settings usingGPT\-4o\-mini\. “\+RA” denotes TRACER\-enhanced baseline methods\.RADARsingle\{\}\_\{\\text\{single\}\}andRADARmulti\{\}\_\{\\text\{multi\}\}share identical architectures and retrieval setups, differing only in whether multi\-agent debate is used\.Table 3:Generalization results of strong baselines andRADARvariants across different backbone LLMs under the retrieved evidence setting\.
Table 4:Performance comparison of baseline methods andRADARmulti\{\}\_\{\\text\{multi\}\}on AVERITEC\.
To assess generalization, we further evaluateRADARon AVERITECSchlichtkrullet al\.\([2023](https://arxiv.org/html/2604.19005#bib.bib44)\)\.AVERITEC decomposes real\-world claim verification into multi\-step question\-answer pairs retrieved from the open web, utilizing textual justifications to explain how diverse evidence aggregates into a final verdict\. It features a "Conflicting Evidence/Cherry\-picking" label to address "half\-true" claims that are technically accurate but misleading due to omitted context\.

##### Settings\.

We evaluate under two conditions\. In thefull evidencesetting, models receive all gold evidence associated with each claim\. In theretrieved evidencesetting, we simulate realistic verification by retrieving top\-20 passages per claim from the corpus using a dense retriever \(bge\-base\-en\-v1\.5\)\. This introduces noise and incompleteness, testing whether RADAR’s debate mechanism can recover missing context when retrieval is imperfect\. Unless otherwise specified, the same role prompts and debate protocol are used across all datasets and backbone models, with no dataset\-specific prompt tuning or role adaptation\.

To isolate the contribution of multi\-agent debate, we additionally include a single\-agent variant that uses the same architecture, prompts, retrieval setup, and inference procedure, but without inter\-agent interaction\. We refer to this variant as RADARsingle\{\}\_\{\\text\{single\}\}, and denote the full multi\-agent system as RADARmulti\{\}\_\{\\text\{multi\}\}\.

##### Baselines\.

We compare against representative fact verification and debate\-based methods:

- •CoTWeiet al\.\([2022](https://arxiv.org/html/2604.19005#bib.bib25)\): Chain\-of\-thought prompting over provided evidence\.
- •HiSSZhang and Gao \([2023](https://arxiv.org/html/2604.19005#bib.bib12)\): Hierarchical structured prompting with intermediate reasoning steps\.
- •TRACERTanget al\.\([2025](https://arxiv.org/html/2604.19005#bib.bib40)\): An omission\-aware verification framework that assumes access to complete gold evidence, including hidden context required for intent and assumption inference\.
- •FIREXieet al\.\([2025](https://arxiv.org/html/2604.19005#bib.bib23)\): An iterative retrieve\-then\-verify framework alternating between retrieval and validation\.
- •D2DHanet al\.\([2025](https://arxiv.org/html/2604.19005#bib.bib26)\): A multi\-agent debate model for news classification using fixedpro/conroles\.

For all baselines, we follow the experimental settings recommended in their original papers to reflect their intended usage\. For prompting\-based methods such as CoT, we use a standard task instruction with step\-by\-step reasoning\. For framework\-specific baselines such as FIRE and D2D, we preserve their original prompting protocols\. Meanwhile, all shared components, including the backbone models, retrieval pipeline, and evidence pool, are kept identical across methods\. This controlled setup helps isolate the effect of reasoning frameworks rather than differences in external resources or implementation details\.

##### Evaluation Metrics\.

We report accuracy and macro\-F1 as overall metrics, along with per\-class F1 scores \(true,half\-true,false\) to analyze class\-level behavior\. Performance changes forRADARmulti\{\}\_\{\\text\{multi\}\}are measured relative to its single\-agent variant\. Each debate proceeds for up to three rounds, with adaptive early stopping mechanism\.

### 4\.2Main Results

Table[4\.1](https://arxiv.org/html/2604.19005#S4.SS1.SSS0.Px1)presents the main comparison results\.

##### Performance with Gold Evidence\.

When complete gold evidence is provided, most baselines achieve competitive performance, reflecting that omission\-aware verification becomes easier when all relevant context is available\. Nevertheless,RADARmulti\{\}\_\{\\text\{multi\}\}attains the best overall performance, achieving 83\.6 accuracy and 70\.8 macro\-F1, outperforming strong baselines such as CoT\+RA and HiSS\+RA by clear margins\. Compared toRADARsingle\{\}\_\{\\text\{single\}\}, multi\-agent debate yields substantial gains \(\+14\.9 accuracy and \+8\.4 macro\-F1\), indicating that structured interaction between expertise\-anchored agents improves reasoning consistency even when evidence is complete\.

##### Performance with Retrieved Evidence\.

The retrieved\-evidence setting is more realistic and challenging due to noisy and incomplete evidence\.RADARmulti\{\}\_\{\\text\{multi\}\}remains robust, achieving 77\.7 accuracy and 63\.3 macro\-F1, substantially outperforming all strong baselines\. Relative to its single\-agent counterpart, multi\-agent debate brings large improvements \(\+19\.3 accuracy and \+12\.3 macro\-F1\), demonstrating that RADAR’s structured interaction is particularly effective when critical evidence is missing or fragmented\.

##### Per\-Class Performance Analysis\.

Per\-class results show that performance gains are concentrated in thehalf\-trueandfalsecategories, where omission reasoning is most critical\. Under retrieved evidence,RADARmulti\{\}\_\{\\text\{multi\}\}improveshalf\-trueF1 by 15 andfalseF1 by 17\.4, while also maintaining stronger performance on thetrueclass\. The pronounced gap betweenRADARsingle\{\}\_\{\\text\{single\}\}andRADARmulti\{\}\_\{\\text\{multi\}\}on these classes suggests that debate helps scrutinize selectively framed narratives and surface omitted context beyond what single\-pass reasoning can achieve\. To further examine model behavior near class boundaries, especially the subtle distinction betweenTrueandHalf\-True, we complement the per\-class F1 scores in Table[3](https://arxiv.org/html/2604.19005#S4.T3)with confusion matrices and class\-wise precision/recall analyses in Appendix[B](https://arxiv.org/html/2604.19005#A2)\.

##### Generalization Across Backbone LLMs\.

Table[3](https://arxiv.org/html/2604.19005#S4.T3)reports results across three backbone LLMs\.RADARmulti\{\}\_\{\\text\{multi\}\}consistently achieves strong performance acrossGPT\-4o\-mini,LLaMA3\-8B\-Instruct, andQwen2\.5\-7B\-Instruct\. While absolute performance varies across backbones, the relative improvements over strong baselines and the single\-agent variant remain stable, indicating that RADAR’s debate mechanism generalizes across architectures\.

##### Generalization Across Fact Verification Datasets\.

Table[4](https://arxiv.org/html/2604.19005#S4.T4)reports results on the AVERITEC benchmark under the retrieved\-evidence setting\. According to the table,RADARmulti\{\}\_\{\\text\{multi\}\}substantially outperforms both FIRE and D2D, with consistent gains across all classes\. Overall, these results show that RADAR is not limited to political claims\. The same role prompts transfer effectively to AVERITEC without task\-specific redesign, suggesting that the gains arise from complementary reasoning dynamics rather than domain\-specific priors\.

ModelAcc\.F1\# Avg\. tokens\\rowcolorgray\!10LLaMA3\-8B\-InstructD2D64\.145\.12483\.3FIRE50\.443\.4754\.5\(30\.4%\)RADAR \(3 rounds\)77\.262\.31834\.7 \(73\.9%\)RADAR \(early stop\)77\.363\.21551\.4 \(62\.5%\)\\rowcolorgray\!10Qwen2\.5\-7B\-InstructD2D69\.048\.53686\.9FIRE63\.946\.91026\.2\(27\.8%\)RADAR \(3 rounds\)76\.562\.61751\.5 \(47\.5%\)RADAR \(early stop\)76\.763\.11723\.7 \(46\.8%\)
Table 5:Effect of adaptive early stopping on performance and efficiency\.Table 6:Thresholds and number of instances stopped at each round for RADAR’s early\-stopping controller\.
Single AgentPosition\-Based \(Adv\. vs Crit\.\)Expertise\-Based \(Pol\. vs Sci\.\)Verdict: True
Klitschko sold his medal in 2012\. Funds supported his foundation\. The buyer returned it\. No evidence disputes the claim\.Verdict: True
Pro Opening:Klitschko sold his medal in 2012\. Proceeds went to help Ukrainian children\.
Con Opening:The sale happened in 2012\. It’s not tied to the 2022 war\.
Pro Rebuttal:The act still helped children and shows lasting support\.
Con Rebuttal:The timing and intent are unrelated to the current crisis\.
Pro Closing:The sale reflects long\-term commitment\.
Con Closing:The event came before the war\. Linking them misleads\.
Judge:TRUE\. The claim is factually accurate\.Verdict: Half\-True
Politician Opening:Klitschko auctioned his medal in 2012 to help children\.
Scientist Opening:The sale was real but not linked to the 2022 war\.
Politician Rebuttal:The act still shows support for Ukrainian children\.
Scientist Rebuttal:The claim misrepresents time and cause\.
Politician Closing:The act is real, but how it’s framed changes how people remember it\.
Scientist Closing:It was not a wartime gesture\. Context is key\.
Judge:HALF\-TRUE\. The sale was charitable, but the implied timing misleads\.Table 7:Reasoning process for a temporally misleading claim across different agent setups\. Expertise\-based agents flag the omission by combining intent and factual analysis\.
## 5Discussions

![Refer to caption](https://arxiv.org/html/2604.19005v1/fig/evidence_comparison_large.png)Figure 3:Performance comparison of different agent role configurations under full and retrieved evidence settings\.![Refer to caption](https://arxiv.org/html/2604.19005v1/fig/debate_rounds3.png)\(a\)Retrieved evidence
![Refer to caption](https://arxiv.org/html/2604.19005v1/fig/debate_rounds_full3.png)\(b\)Full evidence

Figure 4:Effect of varying the maximum number of debate rounds usingLLaMA3\-8B\-Instruct\.Table 8:Performance with different numbers of agents\.### 5\.1Impact of Agent Role Configurations

To assess how role design affects reasoning quality, we compare two configurations: the proposedexpertise\-basedpairing \(Politician vs\. Scientist\) and the conventionalposition\-basedpairing \(Advocate vs\. Critic\)\. As shown in Figure[3](https://arxiv.org/html/2604.19005#S5.F3), expertise\-based agents consistently outperform position\-based ones across all settings, with the largest gains observed under noisy retrieval \(62\.3 vs\. 56\.4 macro\-F1\)\. These results suggest that domain\-aligned personas encourage contextually grounded and rhetorically realistic reasoning, exposing omissions that generic pro/con roles often overlook\.

### 5\.2Effect of Early Stopping on Efficiency

Table[4\.2](https://arxiv.org/html/2604.19005#S4.SS2.SSS0.Px5)compares RADAR with adaptive early stopping against the other debating baseline D2D, FIRE and the fixed three\-round variant\. Across backbones, early stopping preserves or slightly improves performance while reducing generation cost\.

OnLLaMA3\-8B\-Instruct, early stopping achieves comparable accuracy \(77\.3 vs\. 77\.2\) and higher macro\-F1 \(63\.2 vs\. 62\.3\) than the three\-round setting, while reducing average token usage by about 15%\. OnQwen2\.5\-7B\-Instruct, it similarly yields small performance gains with nearly identical token usage\. In both cases, RADAR substantially outperforms D2D in accuracy and macro\-F1\.

As shown in Table[6](https://arxiv.org/html/2604.19005#S4.T6), many instances terminate after one or two rounds, indicating that the Judge adaptively stops debate once sufficient reasoning confidence is reached, while harder cases proceed to later rounds\. Overall, adaptive early stopping enables efficient multi\-agent debate by allocating reasoning effort in proportion to claim difficulty\.

Although the optimal thresholds vary across backbones, calibration is lightweight\. Each model requires only a single inference pass on the development set, after which threshold search is performed offline using cached decision scores\. Additional sensitivity analysis in Appendix[C](https://arxiv.org/html/2604.19005#A3)shows that performance remains stable across broad threshold ranges, indicating that RADAR does not rely on fragile hyperparameter choices\.

### 5\.3Effect of Maximum Debate Rounds

We further investigate how the maximum number of debate rounds affects performance on a sampled test set of 150 examples \(50 per class\) usingLLaMA3\-8B\-Instruct\. As shown in Figure 4, the impact of additional rounds depends on the evidence setting rather than exhibiting a universal monotonic trend\.

Under retrieved evidence, performance improves from one to three rounds, suggesting that extra interaction can help agents challenge partial evidence and uncover omitted context\. Beyond this point, gains become marginal\. In contrast, under full evidence, the effect of additional rounds is less consistent, since most relevant context is already available and further exchanges may introduce redundant reasoning\.

These observations indicate that more rounds are not always better, and the appropriate debate depth depends on task difficulty and evidence quality\. This further motivates RADAR’s adaptive early\-stopping mechanism, which allocates reasoning rounds dynamically instead of relying on a fixed budget\.

### 5\.4Effect of Number of Agents

To assess the impact of scaling up the number of agents, we vary the number of debating agents using the same 150 set with fixed backboneLLaMA3\-8B\-Instructand expertise\-based roles\. As shown in Table[8](https://arxiv.org/html/2604.19005#S5.T8), performance peaks with two agents \(64\.0 accuracy, 62\.8 macro\-F1\), confirming that the Politician–Scientist pairing provides complementary yet focused reasoning\. Adding more agents yields diminishing returns and lower F1true\{\}\_\{\\text\{true\}\}\(dropping to 30\.0 with three agents\), as additional perspectives often introduce redundant or conflicting arguments\. A minimal yet diverse pair thus offers efficient and stable omission reasoning\.

### 5\.5Sensitivity to Retrieval Quality

Table 9:Sensitivity to retrieval quality by varying the number of retrieved passages\. RADAR improves with stronger retrieval while remaining robust under limited evidence\.We further examine how RADAR behaves under different retrieval budgets by varying the number of retrieved passages \(Top\-nn\) on the sampled test set\. As shown in Table[9](https://arxiv.org/html/2604.19005#S5.T9), performance improves steadily as more evidence becomes available, confirming that stronger retrieval benefits omission\-aware verification\.

Notably, degradation under limited retrieval is gradual rather than catastrophic\. Even with Top\-5 retrieval, RADAR achieves 60\.5 macro\-F1 and retains meaningfulHalf\-Truedetection \(54\.8 F1\), suggesting that the debate mechanism remains effective under partial evidence\.

### 5\.6Case Study: Temporal Mis\-framing

To illustrate how RADAR uncovers omission\-based deception, we analyze the claim “Wladimir Klitschko auctioned his 1996 Olympic medal to raise money for the children of Ukraine\.” Although factually correct, the claim omits that the auction occurred in 2012, creating a misleading association with the 2022 war\. As shown in Table[7](https://arxiv.org/html/2604.19005#S4.T7), both single\-agent and position\-based systems classify it astrue, focusing on surface factual correctness\. In contrast, the expertise\-based agents highlight the temporal mismatch and rhetorical framing, producing ahalf\-trueverdict\. This example demonstrates how domain\-specialized roles capture both presented facts and implied context, enabling more nuanced judgment in omission\-aware verification\.

### 5\.7Failure Analysis

Table 10:Error analysis of 50 misclassified cases\.To better understand current limitations, we manually reviewed 50 randomly sampled misclassified cases from POLITIFACT\-HIDDEN\. Table[10](https://arxiv.org/html/2604.19005#S5.T10)summarizes the dominant error patterns\.

The most common failures involveclaim constraint violations\(56%\), where the model fails to strictly enforce explicit numerical, temporal, or scope constraints stated in the claim\. Examples include overlooking date mismatches, percentage ranges, or restricted populations\. The second major category issemantic framing instability\(30%\), where key terms are interpreted inconsistently across debate rounds, leading to unstable judgments\. We also observecausal attribution errors\(10%\), where responsibility is incorrectly assigned by conflating local actions with broader institutional or national outcomes\.

These findings suggest that further progress in omission\-aware verification requires stronger logical constraint tracking, more stable semantic grounding, and finer\-grained causal reasoning\.

## 6Conclusion

We presentRADAR, a role\-anchored multi\-agent debate framework for omission\-aware fact verification under noisy retrieval\. By grounding debate in retrieved evidence and assigning expertise\-driven roles,RADAReffectively surfaces missing context that leads to misleading claims\. An adaptive early\-stopping controller enables efficient reasoning by allocating debate depth according to claim difficulty\. Across datasets and backbone models,RADARconsistently outperforms existing baselines, particularly on half\-true claims\. This work underscores the importance of structured, evidence\-grounded multi\-agent reasoning for trustworthy fact verification\.

## Limitations

WhileRADARdemonstrates consistent improvements across datasets and settings, several limitations remain\. First, the multi\-agent configuration may occasionally over\-scrutinize factual claims: for statements labeledtrue, adversarial critique can raise minor counterpoints that shift the final decision towardhalf\-true\. This reflects a broader challenge in balancing omission sensitivity with factual precision\. Second, real\-world fact\-checking often involves decentralized, evolving, and partially conflicting evidence sources rather than a fixed shared corpus, which is not fully captured by current benchmarks\. Third, multi\-round debates naturally increase inference cost, and while the early\-stopping mechanism improves efficiency, it does not completely eliminate this overhead\. Future work will explore adaptive role calibration, open\-world evidence retrieval, and more efficient debate strategies to further enhance scalability and reliability\.

## References

- J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)Gpt\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§1](https://arxiv.org/html/2604.19005#S1.p1.1)\.
- M\. M\. Akhtar, R\. Masood, M\. Ikram, and S\. S\. Kanhere \(2024\)SoK: false information, bots and malicious campaigns: demystifying elements of social media manipulations\.InAsiaCCS,Cited by:[§2\.3](https://arxiv.org/html/2604.19005#S2.SS3.p1.1)\.
- F\. Alam, S\. Shaar, F\. Dalvi, H\. Sajjad, A\. Nikolov, H\. Mubarak, G\. D\. S\. Martino, A\. Abdelali, N\. Durrani, K\. Darwish, A\. Al\-Homaid, W\. Zaghouani, T\. Caselli, G\. Danoe, F\. Stolk, B\. Bruntink, and P\. Nakov \(2021\)Fighting the COVID\-19 infodemic: modeling the perspective of journalists, fact\-checkers, social media platforms, policy makers, and the society\.InEMNLP \(Findings\),pp\. 611–649\.Cited by:[§2\.1](https://arxiv.org/html/2604.19005#S2.SS1.p1.1)\.
- S\. S\. P\. Bhattacharyya \(2025\)Half truth detection and mitigation: a survey\.Cited by:[§1](https://arxiv.org/html/2604.19005#S1.p1.1)\.
- J\. Chen, R\. Zhang, J\. Guo, Y\. Fan, and X\. Cheng \(2022a\)GERE: generative evidence retrieval for fact verification\.InSIGIR,pp\. 2184–2189\.Cited by:[§2\.1](https://arxiv.org/html/2604.19005#S2.SS1.p1.1)\.
- J\. Chen, A\. Sriram, E\. Choi, and G\. Durrett \(2022b\)Generating literal and implied subquestions to fact\-check complex claims\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,Y\. Goldberg, Z\. Kozareva, and Y\. Zhang \(Eds\.\),pp\. 3495–3516\.External Links:[Document](https://dx.doi.org/10.18653/v1/2022.emnlp-main.229),[Link](https://aclanthology.org/2022.emnlp-main.229/)Cited by:[§2\.1](https://arxiv.org/html/2604.19005#S2.SS1.p1.1)\.
- Y\. Du, S\. Li, A\. Torralba, J\. B\. Tenenbaum, and I\. Mordatch \(2024\)Improving factuality and reasoning in language models through multiagent debate\.InICML,Cited by:[§2\.2](https://arxiv.org/html/2604.19005#S2.SS2.p1.1)\.
- J\. Ehrhardt, T\. Spinde, A\. Vardasbi, and F\. Hamborg \(2021\)Omission of information: identifying political slant via an analysis of co\-occurring entities\.InISI,pp\. 80–93\.Cited by:[§2\.3](https://arxiv.org/html/2604.19005#S2.SS3.p1.1)\.
- A\. Estornell, S\. Das, and Y\. Vorobeychik \(2020\)Deception through half\-truths\.InThe Thirty\-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty\-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7\-12, 2020,pp\. 10110–10117\.External Links:[Document](https://dx.doi.org/10.1609/AAAI.V34I06.6570)Cited by:[§1](https://arxiv.org/html/2604.19005#S1.p1.1)\.
- T\. Guo, X\. Chen, Y\. Wang, R\. Chang, S\. Pei, N\. V\. Chawla, O\. Wiest, and X\. Zhang \(2024\)Large language model based multi\-agents: A survey of progress and challenges\.InProceedings of the Thirty\-Third International Joint Conference on Artificial Intelligence, IJCAI 2024, Jeju, South Korea, August 3\-9, 2024,pp\. 8048–8057\.Cited by:[§2\.2](https://arxiv.org/html/2604.19005#S2.SS2.p1.1)\.
- C\. Han, W\. Zheng, and X\. Tang \(2025\)Debate\-to\-detect: reformulating misinformation detection as a real\-world debate with large language models\.Vol\.abs/2505\.18596\.Cited by:[§1](https://arxiv.org/html/2604.19005#S1.p3.1),[§2\.2](https://arxiv.org/html/2604.19005#S2.SS2.p1.1),[§3\.2](https://arxiv.org/html/2604.19005#S3.SS2.SSS0.Px1.p1.4),[5th item](https://arxiv.org/html/2604.19005#S4.I1.i5.p1.1)\.
- S\. Hazra and B\. P\. Majumder \(2024\)To tell the truth: language of deception and language models\.InNAACL,pp\. 8506–8520\.External Links:[Document](https://dx.doi.org/10.18653/V1/2024.NAACL-LONG.470)Cited by:[§1](https://arxiv.org/html/2604.19005#S1.p1.1)\.
- D\. Kundu and et al\. \(2024\)TruEDebate: multi\-agent structured debate for trustworthy claim verification\.InACL,Cited by:[§1](https://arxiv.org/html/2604.19005#S1.p3.1),[§2\.2](https://arxiv.org/html/2604.19005#S2.SS2.p1.1),[§3\.2](https://arxiv.org/html/2604.19005#S3.SS2.SSS0.Px1.p1.4)\.
- M\. Leippold, S\. A\. Vaghefi, D\. Stammbach, V\. Muccione, J\. A\. Bingler, J\. Ni, C\. C\. Senni, T\. Wekhof, T\. Schimanski, G\. Gostlow, T\. Yu, J\. Luterbacher, and C\. Huggel \(2024\)Automated fact\-checking of climate change claims with large language models\.CoRRabs/2401\.12566\.Cited by:[§2\.1](https://arxiv.org/html/2604.19005#S2.SS1.p1.1)\.
- T\. Liang, Z\. He, W\. Jiao, X\. Wang, Y\. Wang, R\. Wang, Y\. Yang, S\. Shi, and Z\. Tu \(2024\)Encouraging divergent thinking in large language models through multi\-agent debate\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12\-16, 2024,pp\. 17889–17904\.External Links:[Document](https://dx.doi.org/10.18653/V1/2024.EMNLP-MAIN.992)Cited by:[§2\.2](https://arxiv.org/html/2604.19005#S2.SS2.p1.1)\.
- Y\. Liu, Y\. Liu, X\. Zhang, X\. Chen, and R\. Yan \(2025\)The truth becomes clearer through debate\! multi\-agent systems with large language models unmask fake news\.InSIGIR,pp\. 504–514\.External Links:[Document](https://dx.doi.org/10.1145/3726302.3730092)Cited by:[§1](https://arxiv.org/html/2604.19005#S1.p3.1),[§2\.2](https://arxiv.org/html/2604.19005#S2.SS2.p1.1)\.
- G\. D\. S\. Martino, S\. Yu, A\. Barrón\-Cedeño, R\. Petrov, and P\. Nakov \(2019\)Fine\-grained analysis of propaganda in news article\.InEMNLP/IJCNLP \(1\),pp\. 5635–5645\.Cited by:[§2\.3](https://arxiv.org/html/2604.19005#S2.SS3.p1.1)\.
- J\. Piskorski, N\. Stefanovitch, N\. Nikolaidis, G\. D\. S\. Martino, and P\. Nakov \(2023\)Multilingual multifaceted understanding of online news in terms of genre, framing, and persuasion techniques\.InACL,pp\. 3001–3022\.Cited by:[§2\.3](https://arxiv.org/html/2604.19005#S2.SS3.p1.1)\.
- H\. Rashkin, E\. Choi, J\. Y\. Jang, S\. Volkova, and Y\. Choi \(2017\)Truth of varying shades: analyzing language in fake news and political fact\-checking\.InEMNLP,pp\. 2931–2937\.Cited by:[§2\.3](https://arxiv.org/html/2604.19005#S2.SS3.p1.1)\.
- S\. Sandeep and P\. Bhattacharyya \(2023\)Detecting and debunking fake news and half\-truth: a survey\.Cited by:[§1](https://arxiv.org/html/2604.19005#S1.p1.1)\.
- M\. S\. Schlichtkrull, Z\. Guo, and A\. Vlachos \(2023\)AVeriTeC: A dataset for real\-world claim verification with evidence from the web\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2023/hash/cd86a30526cd1aff61d6f89f107634e4-Abstract-Datasets%5C_and%5C_Benchmarks.html)Cited by:[§4\.1](https://arxiv.org/html/2604.19005#S4.SS1.SSS0.Px1.12.15)\.
- S\. Singamsetty, N\. Madaan, S\. Mehta, V\. Bhatnagar, and P\. Bhattacharyya \(2023\)"Beware of deception": detecting half\-truth and debunking it through controlled claim editing\.CoRRabs/2308\.07973\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2308.07973)Cited by:[§1](https://arxiv.org/html/2604.19005#S1.p1.1)\.
- A\. Sriram, F\. Xu, E\. Choi, and G\. Durrett \(2024\)Contrastive learning to improve retrieval for real\-world fact checking\.CoRRabs/2410\.04657\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2410.04657)Cited by:[§2\.1](https://arxiv.org/html/2604.19005#S2.SS1.p1.1)\.
- \[24\]Y\. Tang, H\. T\. Ng, and A\. TungDo multi\-hop question answering systems know how to answer the single\-hop sub\-questions?\.InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume,pp\. 3244–3249\.External Links:[Link](https://aclanthology.org/2021.eacl-main.283/)Cited by:[§1](https://arxiv.org/html/2604.19005#S1.p2.1)\.
- Y\. Tang, J\. Wang, and A\. K\. H\. Tung \(2025\)The missing parts: augmenting fact verification with half truth detection\.InEMNLP,pp\. 33979–33996\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.1724/)Cited by:[§1](https://arxiv.org/html/2604.19005#S1.p3.1),[§2\.1](https://arxiv.org/html/2604.19005#S2.SS1.p2.1),[§3\.2](https://arxiv.org/html/2604.19005#S3.SS2.SSS0.Px1.p1.4),[3rd item](https://arxiv.org/html/2604.19005#S4.I1.i3.p1.1),[§4\.1](https://arxiv.org/html/2604.19005#S4.SS1.SSS0.Px1.p1.1)\.
- J\. Thorne, A\. Vlachos, C\. Christodoulopoulos, and A\. Mittal \(2018\)FEVER: a large\-scale dataset for fact extraction and verification\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL\-HLT 2018, New Orleans, Louisiana, USA, June 1\-6, 2018, Volume 1 \(Long Papers\),pp\. 809–819\.External Links:[Document](https://dx.doi.org/10.18653/V1/N18-1074)Cited by:[§2\.1](https://arxiv.org/html/2604.19005#S2.SS1.p1.1)\.
- A\. Tillmann \(2025\)Literature review of multi\-agent debate for problem\-solving\.CoRRabs/2506\.00066\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2506.00066)Cited by:[§2\.2](https://arxiv.org/html/2604.19005#S2.SS2.p1.1)\.
- Y\. Wang, H\. Zhang, L\. Pang, B\. Guo, H\. Zheng, and Z\. Zheng \(2025\)MaFeRw: query rewriting with multi\-aspect feedbacks for retrieval\-augmented large language models\.InAAAI\-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25 \- March 4, 2025, Philadelphia, PA, USA,pp\. 25434–25442\.External Links:[Document](https://dx.doi.org/10.1609/AAAI.V39I24.34732)Cited by:[§2\.1](https://arxiv.org/html/2604.19005#S2.SS1.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. H\. Chi, Q\. V\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InNeurIPS,Cited by:[§2\.2](https://arxiv.org/html/2604.19005#S2.SS2.p1.1),[1st item](https://arxiv.org/html/2604.19005#S4.I1.i1.p1.1)\.
- Z\. Wu, N\. Sun, J\. Jiang, and S\. Yang \(2025\)Systematic approaches to fact verification: evidence retrieval, veracity prediction, and beyond\.InAdvances in Knowledge Discovery and Data Mining \- 29th Pacific\-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2025, Sydney, NSW, Australia, June 10\-13, 2025, Proceedings, Part IV,Lecture Notes in Computer Science, Vol\.15873,pp\. 271–289\.External Links:[Document](https://dx.doi.org/10.1007/978-981-96-8183-9%5F22)Cited by:[§2\.1](https://arxiv.org/html/2604.19005#S2.SS1.p1.1)\.
- Z\. Xie, R\. Xing, Y\. Wang, J\. Geng, H\. Iqbal, D\. Sahnan, I\. Gurevych, and P\. Nakov \(2025\)FIRE: fact\-checking with iterative retrieval and verification\.InNAACL \(Findings\),pp\. 2901–2914\.External Links:[Document](https://dx.doi.org/10.18653/V1/2025.FINDINGS-NAACL.158)Cited by:[§2\.1](https://arxiv.org/html/2604.19005#S2.SS1.p1.1),[4th item](https://arxiv.org/html/2604.19005#S4.I1.i4.p1.1)\.
- Z\. Yue, H\. Zeng, L\. Shang, Y\. Liu, Y\. Zhang, and D\. Wang \(2024\)Retrieval augmented fact verification by synthesizing contrastive arguments\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2024, Bangkok, Thailand, August 11\-16, 2024,pp\. 10331–10343\.External Links:[Document](https://dx.doi.org/10.18653/V1/2024.ACL-LONG.556)Cited by:[§2\.1](https://arxiv.org/html/2604.19005#S2.SS1.p1.1)\.
- X\. Zhang and W\. Gao \(2023\)Towards llm\-based fact verification on news claims with a hierarchical step\-by\-step prompting method\.InProceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia\-Pacific Chapter of the Association for Computational Linguistics, IJCNLP 2023 \-Volume 1: Long Papers, Nusa Dua, Bali, November 1 \- 4, 2023,pp\. 996–1011\.External Links:[Document](https://dx.doi.org/10.18653/V1/2023.IJCNLP-MAIN.64)Cited by:[§2\.1](https://arxiv.org/html/2604.19005#S2.SS1.p1.1),[2nd item](https://arxiv.org/html/2604.19005#S4.I1.i2.p1.1)\.
- A\. Zhukova, T\. Ruas, F\. Hamborg, K\. Donnay, and B\. Gipp \(2023\)What’s in the news? towards identification of bias by commission, omission, and source selection \(COSS\)\.InJCDL,pp\. 258–259\.Cited by:[§2\.3](https://arxiv.org/html/2604.19005#S2.SS3.p1.1)\.

## Appendix APrompts for RADAR

To support reproducibility, we present the complete set of prompts used in our experiments\. The prompts are grouped into multi\-agent debating and early stopping prompts\.

### A\.1Multi\-Agent Debate

#### \(a\) Position\-Based Agent Roles

In this setup, one agentsupportsthe claim and the otheropposesit\. Each agent goes through three phases:Opening\(present arguments\),Rebuttal\(respond over n rounds,n=0,1,2,…n=0,1,2,\\dots\), andClosing\(summarize the position\)\.

Position\-Based Debate \- System PromptDebater agentYou are a critical thinker participating in a factual debate\.==========Judge agentYou are a neutral judge who evaluates factual debates\.

Position\-Based Debate \- Agent A / B \(Advocate / Critic\)Opening PhaseYou \[support /oppose\] the following claim\. Present your opening argument using the evidence\.==========Claim: \{claim\}Evidence:\{evidence\}==========Begin your argument with your position\.\[Highlight facts that support the claim as TRUE /Explain why the claim is FALSE or misleading\], referencing specific points in the evidence\.Rebuttal Phase \(n rounds forn=0,1,2,…n=0,1,2,\\dots\)You are the \[supporting /opposing\] agent in a debate about the claim below\. Your opponent has made an argument \[against /supporting\] the claim\.==========Claim: \{claim\}Evidence:\{evidence\}Opponent’s argument:\{oppo\_argument\}==========Write your rebuttal, explaining why the opponent is wrong and \[defending the claim /why the claim is still FALSE or misleading\]\.Closing PhaseYou are the \[supporting /opposing\] agent in a debate\. Summarize your final position\.==========Claim: \{claim\}Evidence:\{evidence\}==========Provide a closing statement reinforcing why the claim is \[TRUE /FALSE or misleading\]\.

Position\-Based Debate \- Agent JudgeJudge agentYou are a neutral judge evaluating a factual debate\.Based on the arguments and evidence, decide whether the claim is TRUE, FALSE, or HALF\-TRUE\.==========Claim: \{claim\}Evidence: \{evidence\}— Opening Statements —Agent A: \{advocate\_open\}Agent B: \{critic\_open\}— Rebuttals —Agent A: \{advocate\_rebuttal\}Agent B: \{critic\_rebuttal\}— Closing Statements —Agent A: \{advocate\_close\}Agent B: \{critic\_close\}==========Answer format:\[REASON\]: your justification\[VERDICT\]: TRUE / FALSE / HALF\-TRUE

#### \(b\) Expertise\-Based Agent Roles

In theExpertise\-Based Debatesetup, agents are assigned distinct expert roles to reflect real\-world perspectives in political discourse\. Thepolitician agentframes arguments through the lens of public opinion, electoral interests, and policy communication\. Its responses prioritize political strategy, constituent concerns, and national priorities\.

In contrast, thescientist agentserves as a fact\-checker grounded in empirical evidence and methodological rigor\. This agent emphasizes data quality, factual accuracy, and scientific limitations, avoiding political spin\.

Expertise\-Based Debate — System PromptPolitician agentYou are a politician in a political debate, skilled at persuasive public\-facing reasoning\. Your role is to construct the strongest evidence\-supported narrative for the claim\.Key political characteristics:\- Focus on public opinion and voter concerns\- Emphasize policy benefits and political messaging\- Consider electoral implications and public perception\- Frame arguments in terms of public interest and national priorities\- Balance facts with political strategy and messagingWhen debating, frame arguments in terms of public benefit, policy effectiveness, and political priorities\. Keep responses concise and focused\.

Expertise\-Based Debate — System PromptScientist agentYou are a scientist in a political debate, skilled at analytical fact\-checking\. Your role is to identify missing context, weak support, or selective framing in the claim\.Key scientific characteristics:\- Rely strictly on empirical evidence and data\- Question claims that lack scientific support\- Emphasize methodology and research quality\- Focus on accuracy over political messaging\- Consider uncertainty and limitations in evidenceWhen debating, frame arguments in terms of scientific evidence, data analysis, and factual accuracy\. Keep responses concise and focused\.==========Judge agentYou are a neural judge in a political debate\.Your role is to evaluate the evidence and arguments from both parties and make a clear decision\.When making your decision, consider:\- Is this claim supported by credible evidence?\- Does this align with common sense and universal human values?\- What would be the practical impact on families and communities?Do not deviate from this format\. Do not ask for more information\. Make your decision based on what is provided\.

For convenience, we useGreento denote the politician agent andRedfor the scientist agent\. However, we do not assign them predefined positions\.

Expertise\-Based Debate \- Agent A / B \(Politician / Scientist\)Opening PhaseEvaluate the following claim\. Based on your stance as a \[politician /scientist\], either support or oppose the claim\. Present your opening argument using the evidence given\.==========Claim: \{claim\}Evidence:\{evidence\}==========Begin your argument with your position\. Highlight facts that support your position\.Rebuttal Phase \(n rounds forn=0,1,2,…n=0,1,2,\\dots\)You are a \[politician /scientist\] in a debate about the claim below\. Your opponent, the \[scientist /politician\], has made an argument about the veracity of the claim\.==========Claim: \{claim\}Evidence:\{evidence\}Opponent’s argument:\{con\_argument\}==========Write your rebuttal, explaining why you agree or disagree with the opponent and defending your position\.

Expertise\-Based Debate \- Agent A / B \(Politician / Scientist\)Closing PhaseYou are a \[politician /scientist\] in a political debate about the claim below\. Summarize your final position\.==========Claim: \{claim\}Evidence:\{evidence\}==========Provide a closing statement reinforcing your position about the claim\.

Expertise\-Based Debate \- Agent JudgeJudge agentYou are a neural judge evaluating a political debate\.Based on the arguments and evidence, decide whether the claim is TRUE, FALSE, or HALF\-TRUE\.==========Claim: \{claim\}Evidence: \{evidence\}— Opening Statements —Agent A: \{advocate\_open\}Agent B: \{critic\_open\}— Rebuttals —Agent A: \{advocate\_rebuttal\}Agent B: \{critic\_rebuttal\}— Closing Statements —Agent A: \{advocate\_close\}Agent B: \{critic\_close\}==========Answer format:\[REASON\]: your justification\[VERDICT\]: TRUE / FALSE / HALF\-TRUE

To better align with the nature of thePolitifactdataset where ground truth labels are determined by journalists, we introduce ajournalist agentinto the debate framework in the3\-agent setting\. This agent plays a foundational role by offering an initial perspective grounded in professional fact\-checking practices\.

In each round of the debate, the journalist speaks first, providing an evidence\-based assessment of the claim\. Its response is then made available to both thepoliticianandscientistagents, helping to anchor the subsequent debate in verified information while allowing other agents to build upon or challenge the journalist’s framing based on their respective expertise\.

Expertise\-Based Debate \- System PromptJournalist agentYou are a journalist in a political debate\. You represent balanced reporting and public interest\.Key journalistic characteristics:\- Focus on extracting the most compelling evidence from both sides\- Emphasize clarity, accuracy, and public understanding\- Consider the broader context and implications\- Highlight key facts that support or contradict claims\- Maintain objectivity while identifying the strongest argumentsWhen debating, frame arguments in terms of what evidence best supports or contradicts the claim, focusing on the most relevant and impactful facts\. Keep responses concise and focused\.

Expertise\-Based Debate \- Agent C \(Journalist\)Opening PhaseEvaluate the following claim\. Based on your stance as a journalist, extract and present the most compelling evidence that either supports or opposes the claim\.==========Claim: \{claim\}Evidence:\{evidence\}==========Begin your argument by identifying the strongest evidence that either supports or contradicts the claim\. Focus on the most relevant and impactful facts\.Rebuttal Phase \(n rounds forn=0,1,2,…n=0,1,2,\\dots\)You are a journalist in a debate about the claim below\. Your opponent has made an argument about the veracity of the claim\.==========Claim: \{claim\}Evidence:\{evidence\}Politician’s argument: \{poli\_argument\}Scientist’s argument: \{sci\_argument\}==========Write your rebuttal, focusing on the most compelling evidence that either supports or contradicts the claim, and address any gaps or weaknesses in the opponent’s argument\.Closing PhaseYou are the journalist in a political debate about the claim below\. Summarize your final position\.==========Claim: \{claim\}Evidence:\{evidence\}Politician’s rebuttal: \{politician\_rebuttal\}Scientist’s rebuttal: \{scientist\_rebuttal\}==========Provide a closing statement highlighting the most compelling evidence that either supports or contradicts the claim, and your assessment of the claim’s veracity based on the key arguments presented\.

In the4\-agent debatesetting, we further enrich the evaluation process by introducing adomain expert agent\. To ensure relevance, we first use an LLM to automatically infer the domain of the claim \(e\.g\., economics, healthcare, climate science\)\. Based on this inference, the debate includes a fourth agent with specialized knowledge in that domain\.

Domain Inference PromptYou are analyzing a claim to determine the most relevant domain specialist\.The following is the claim:\{claim\}Your task is to identify themost relevant domain specialistwho would have expertise in the subject matter of this claim, and output thedomain\.Consider the main topic, subject area, or field of knowledge that this claim addresses\.Respondstrictlyusing this format, only output one word:DOMAIN: \{specific domain\}Examples:\- Climate for climate\-related claims\- Economy for economic claims\- Health for health claims\- Education for education claims\- Law for law enforcement claims\- Technology for tech\-related claims\- Environment for environmental claims\- Public health for public health claimsChoose the most specific and relevant domain for this claim\.

Expertise\-Based Debate \- System PromptDomain Expert agentYou are a specialist in \{domain\_inferred\} in a political debate\. You represent specialized knowledge in your field\.Key domain expert characteristics:\- Apply deep expertise in the \{domain\_inferred\} field\- Consider technical details and specialized evidence\- Question claims that lack domain\-specific support\- Emphasize field\-specific methodology and standards\- Focus on technical accuracy and domain knowledgeWhen debating, frame arguments in terms of expertise in the \{domain\_inferred\} field, technical evidence, and specialized knowledge\. Keep responses concise and focused\.

Expertise\-Based Debate \- Agent D \(Domain Expert\)Opening PhaseAs a specialist in the \{domain\_inferred\} field, evaluate this claim\. Present your opening argument\.==========Claim: \{claim\}Evidence:\{evidence\}==========Focus on: \{domain\_inferred\} expertise, technical evidence, specialized knowledge, field\-specific analysis\.

Expertise\-Based Debate \- Agent D \(Domain Expert\)Rebuttal Phase \(n rounds forn=0,1,2,…n=0,1,2,\\dots\)As a specialist in the \{domain\_inferred\} field, rebut the opposing arguments\.==========Claim: \{claim\}Evidence:\{evidence\}Politician’s argument: \{poli\_argument\}Scientist’s argument: \{sci\_argument\}Journalist’s argument: \{jour\_argument\}==========Focus on: defending \{domain\_inferred\} expertise, technical evidence, specialized knowledge\.Closing PhaseAs a expert scientist in \{domain\_inferred\}, summarize your final position\.==========Claim: \{claim\}Evidence:\{evidence\}==========Summarize: \{domain\_inferred\} expertise, technical evidence, specialized knowledge\.

### A\.2Early Stop

To more accurately determine the optimal round at which round the debate can be stopped for a final verdict, we incorporate both the Stop Agent and the Round Judge as decision signals\.

#### \(a\) Stop Agent Roles

Debate Continuation Check — Stop AgentStop AgentYou are acting as an interim adjudicator during a two\-agent fact\-checking debate\. Your goal is to determine whether the debate should continue or stop based on the information currently available\.==========Claim:\{claim\} Evidence:\{evidence\}Rounds completed so far:\{summary\}You are about to listen to the\{upcoming\_round\}round\. Decide whether you already have enough information to reach a final verdict now, or whether the debate should continue and you should listen to this upcoming round\.==========Answer format:DECISION: CONTINUE or STOP

#### \(b\) Round Judge Roles

The Round Judge is implemented by invoking the Expertise\-Based Agent Judge after each debate round, using only the arguments accumulated up to that point as input, producing a preliminary verdict for partial debates\.

## Appendix BAdditional Boundary Analysis

The boundary betweenTrueandHalf\-Trueis inherently subtle, since both may involve factually correct statements that differ mainly in whether important context is omitted\. To better understand model behavior beyond per\-class F1, we report confusion matrices together with class\-wise precision, recall, and F1 under the retrieved\-evidence setting using Qwen2\.5\-7B\-Instruct\.

Compared with strong baselines, RADAR substantially improves recognition ofHalf\-Trueclaims while maintaining strong performance onFalse\. In particular, RADAR reduces the tendency to misclassify omission\-based claims as fullyTrue, suggesting that its gains arise from stronger contextual reasoning rather than annotation noise alone\.

Table 11:Confusion matrices and class\-wise precision/recall/F1 under the retrieved\-evidence setting using Qwen2\.5\-7B\-Instruct\. RADAR shows stronger discrimination on the subtleTrue/Half\-Trueboundary while maintaining high performance onFalse\.
## Appendix CThreshold Sensitivity Analysis

RADAR uses two stopping thresholds:τs\\tau\_\{s\}for the stop margin andτv\\tau\_\{v\}for verdict confidence\. Although the selected values differ across backbones, tuning is computationally lightweight\. For each backbone, we run inference on the development set only once, cache the logits forp​\(Stop\)p\(\\textsc\{Stop\}\),p​\(Continue\)p\(\\textsc\{Continue\}\), and label probabilities, then perform offline grid search without re\-running inference or retraining\.

To assess robustness, we report coarse\-grained sensitivity results below\. Across both backbones, performance remains stable over wide threshold ranges, suggesting that RADAR does not depend on narrowly tuned stopping parameters\.

Table 12:Sensitivity to the stop\-thresholdτs\\tau\_\{s\}\. Performance remains stable across a broad range of values for both backbones, with accuracy varying within a narrow band and macro\-F1 showing only small fluctuation\.Table 13:Sensitivity to confidence thresholdτv\\tau\_\{v\}\. Results are highly stable across values\.

Similar Articles

Deep Reasoning in General Purpose Agents via Structured Meta-Cognition

arXiv cs.CL

This paper introduces Deep Reasoning, an inference-time approach that uses structured meta-reasoning to construct task-specific scaffolds for general-purpose agents. The proposed agent, Dolores, outperforms existing methods by distributing cognition across lower-load reasoning threads, reducing hallucinations and improving performance across multiple benchmarks.

Structured Role-Aware Policy Optimization for Multimodal Reasoning

arXiv cs.AI

This paper introduces Structured Role-Aware Policy Optimization (SRPO), a method that improves multimodal reasoning in Large Vision-Language Models by assigning token-level credit based on distinct perception and reasoning roles within reinforcement learning frameworks.

Persona-Assigned Large Language Models Exhibit Human-Like Motivated Reasoning

arXiv cs.CL

This paper investigates whether assigning personas to large language models induces human-like motivated reasoning, finding that persona-assigned LLMs show up to 9% reduced veracity discernment and are up to 90% more likely to evaluate scientific evidence in ways congruent with their induced political identity, with prompt-based debiasing largely ineffective.

RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography

Hugging Face Daily Papers

RadAgent is a tool-using AI agent that generates chest CT reports through interpretable step-by-step reasoning, improving clinical accuracy by 36.4% relative and achieving 37% faithfulness—a capability absent in existing 3D vision-language models. The system provides fully inspectable reasoning traces allowing clinicians to validate and refine diagnostic outputs.