ConflictRAG: Detecting and Resolving Knowledge Conflicts in Retrieval Augmented Generation

arXiv cs.CL Papers

Summary

ConflictRAG is a conflict-aware RAG framework that detects, classifies, and resolves knowledge conflicts in retrieved documents, achieving 88.7% detection F1 and 5.3–6.1% correctness gains over baselines while reducing API costs by 62%.

arXiv:2605.17301v1 Announce Type: new Abstract: Retrieval-Augmented Generation (RAG) systems implicitly assume mutual consistency among retrieved documents -- an assumption that frequently fails in practice. We present ConflictRAG, a conflict-aware RAG framework that detects, classifies, and resolves knowledge conflicts prior to answer generation. The framework introduces three contributions: (1) a two-stage conflict detection module combining a lightweight embedding-based MLP classifier with selective LLM refinement, reducing API costs by 62% while maintaining 90.8% detection accuracy; (2) an Entropy-TOPSIS framework for data-driven source credibility assessment, improving selection accuracy by 7.1% over manual heuristics; and (3) a Conflict-Aware RAG Score (CARS) for diagnostic evaluation of conflict-handling capabilities. Experiments on three benchmarks against six baselines demonstrate 88.7% conflict-detection F1 and consistent 5.3--6.1% correctness gains over the strongest conflict-aware baseline, with the pipeline transferring effectively across backbone LLMs.
Original Article
View Cached Full Text

Cached at: 05/19/26, 06:39 AM

# ConflictRAG: Detecting and Resolving Knowledge Conflicts in Retrieval-Augmented Generation
Source: [https://arxiv.org/html/2605.17301](https://arxiv.org/html/2605.17301)
###### Abstract

Retrieval\-Augmented Generation \(RAG\) systems implicitly assume mutual consistency among retrieved documents—an assumption that frequently fails in practice\. We presentConflictRAG, a conflict\-aware RAG framework that detects, classifies, and resolves knowledge conflicts prior to answer generation\. The framework introduces three contributions: \(1\) atwo\-stage conflict detectionmodule combining a lightweight embedding\-based MLP classifier with selective LLM refinement, reducing API costs by 62% while maintaining 90\.8% detection accuracy; \(2\) anEntropy\-TOPSISframework for data\-driven source credibility assessment, improving selection accuracy by 7\.1% over manual heuristics; and \(3\) aConflict\-Aware RAG Score \(CARS\)for diagnostic evaluation of conflict\-handling capabilities\. Experiments on three benchmarks against six baselines demonstrate 88\.7% conflict\-detection F1 and consistent 5\.3–6\.1% correctness gains over the strongest conflict\-aware baseline, with the pipeline transferring effectively across backbone LLMs\.

## IIntroduction

Retrieval\-Augmented Generation \(RAG\)\[[13](https://arxiv.org/html/2605.17301#bib.bib1)\]has become the prevailing paradigm for grounding large language model \(LLM\) outputs in external knowledge, reducing hallucination and enabling knowledge\-intensive tasks\[[6](https://arxiv.org/html/2605.17301#bib.bib9),[3](https://arxiv.org/html/2605.17301#bib.bib13)\]\.

Despite its success, a fundamental yet underexplored challenge persists: retrieved documents may contain*mutually contradictory*information\. For instance, a query about recommended Vitamin D intake may retrieve both 400 IU \(2010\) and 600–800 IU \(2020\) guidelines\. A conventional system concatenates all documents, potentially producing an inconsistent response without flagging the conflict\.

Knowledge conflicts in RAG systems arise from two sources\[[23](https://arxiv.org/html/2605.17301#bib.bib5),[22](https://arxiv.org/html/2605.17301#bib.bib4)\]:

- •Inter\-document conflicts: retrieved passages contradict each other \(subtypes: factual, temporal, opinion; see Sect\.[III\-C](https://arxiv.org/html/2605.17301#S3.SS3)\)\.
- •Parametric–contextual conflicts: retrieved evidence contradicts the LLM’s internal knowledge\.

Recent surveys\[[23](https://arxiv.org/html/2605.17301#bib.bib5)\]identify such conflicts as a critical reliability concern, yet existing approaches—including Self\-RAG\[[1](https://arxiv.org/html/2605.17301#bib.bib2)\]and CRAG\[[24](https://arxiv.org/html/2605.17301#bib.bib3)\]—primarily target retrieval relevance without explicitly detecting or resolving contradictions\.

To bridge this gap, we proposeConflictRAG, a conflict\-aware RAG framework that addresses knowledge conflicts through a systematic detect\-classify\-resolve\-generate pipeline\. Our contributions are:

1. 1\.Two\-stage conflict detection: an embedding\-based MLP classifier \(Stage 1\) with selective LLM refinement \(Stage 2\), reducing costs by 62% at 90\.8% accuracy\.
2. 2\.AnEntropy\-TOPSISframework for source credibility, outperforming hand\-crafted heuristics by 7\.1%\.
3. 3\.AConflict\-Aware RAG Score \(CARS\)integrating correctness, detection, resolution, and source fidelity\.
4. 4\.Experiments on three benchmarks against six baselines with ablation and efficiency analyses\.

The overall ConflictRAG pipeline is illustrated in Fig\.[1](https://arxiv.org/html/2605.17301#S1.F1)\.

![Refer to caption](https://arxiv.org/html/2605.17301v1/figures/fig1_framework.png)Figure 1:Overview of the ConflictRAG pipeline: hybrid retrieval→\\totwo\-stage conflict detection→\\totype\-adaptive resolution→\\toconflict\-aware generation with source attribution\.
## IIRelated Work

RAG and Knowledge Conflicts\.RAG\[[13](https://arxiv.org/html/2605.17301#bib.bib1)\]enhances LLMs with external knowledge\. Extensions such as Self\-RAG\[[1](https://arxiv.org/html/2605.17301#bib.bib2)\]and CRAG\[[24](https://arxiv.org/html/2605.17301#bib.bib3)\]improve retrieval quality but assume document consistency\. Xu et al\.\[[23](https://arxiv.org/html/2605.17301#bib.bib5)\]categorized conflicts into inter\-context and context\-memory types, and ConflictQA\[[22](https://arxiv.org/html/2605.17301#bib.bib4)\]demonstrated that LLMs follow context regardless of correctness\. Further studies address parametric\-contextual tensions\[[11](https://arxiv.org/html/2605.17301#bib.bib23)\], conflicting search results\[[2](https://arxiv.org/html/2605.17301#bib.bib25)\], entity\-level conflicts\[[15](https://arxiv.org/html/2605.17301#bib.bib15)\], and trust calibration\[[16](https://arxiv.org/html/2605.17301#bib.bib6),[21](https://arxiv.org/html/2605.17301#bib.bib14)\]\. However, none offer a unified detection\-resolution pipeline\.

Conflict\-Aware RAG\.Recent work tackles RAG conflicts via knowledge graphs\[[14](https://arxiv.org/html/2605.17301#bib.bib27),[26](https://arxiv.org/html/2605.17301#bib.bib28)\], fact\-checking\[[7](https://arxiv.org/html/2605.17301#bib.bib24)\], transparent handling\[[25](https://arxiv.org/html/2605.17301#bib.bib26)\], and multi\-agent debate\[[20](https://arxiv.org/html/2605.17301#bib.bib31)\]\. These either focus on a single conflict type or incur high inference costs\. ConflictRAG combines a learned two\-stage detector with type\-adaptive resolution and a diagnostic metric—to our knowledge, the first system integrating all three\. Hallucination mitigation\[[10](https://arxiv.org/html/2605.17301#bib.bib10)\]and RAGAS\[[5](https://arxiv.org/html/2605.17301#bib.bib20)\]improve quality but do not target inter\-document contradictions\.

## IIIMethodology

### III\-AProblem Formulation

Given a user queryqq, a retrieverℛ\\mathcal\{R\}returnsKKdocuments𝒟=\{d1,d2,…,dK\}\\mathcal\{D\}=\\\{d\_\{1\},d\_\{2\},\\ldots,d\_\{K\}\\\}\. A standard RAG system generatesa=LLM​\(q,𝒟\)a=\\text\{LLM\}\(q,\\mathcal\{D\}\)\. We extend this with a conflict\-aware pipeline:

a=Generate​\(q,𝒟,Resolve​\(q,𝒟,Detect​\(q,𝒟\)\)\),a=\\text\{Generate\}\(q,\\mathcal\{D\},\\text\{Resolve\}\(q,\\mathcal\{D\},\\text\{Detect\}\(q,\\mathcal\{D\}\)\)\)\\,,\(1\)whereDetect​\(⋅\)\\text\{Detect\}\(\\cdot\)identifies conflicting document pairs and their conflict types,Resolve​\(⋅\)\\text\{Resolve\}\(\\cdot\)applies type\-adaptive strategies, andGenerate​\(⋅\)\\text\{Generate\}\(\\cdot\)produces a conflict\-aware answer with annotations\.

### III\-BTwo\-Stage Conflict Detection

WithK=5K=5retrieved documents, there are\(52\)=10\\binom\{5\}\{2\}=10pairs per query\. Calling the LLM for every pair is prohibitively expensive\. We propose a two\-stage architecture \(Fig\.[2](https://arxiv.org/html/2605.17301#S3.F2)\) that significantly reduces this cost\.

![Refer to caption](https://arxiv.org/html/2605.17301v1/figures/fig2_two_stage.png)Figure 2:Two\-stage conflict detection\. Stage 1 \(MLP\) handles 73% of pairs at 120ms; uncertain cases \(conf<τc\\textit\{conf\}<\\tau\_\{c\}\) go to Stage 2 \(LLM\)\. Combined: 90\.8% accuracy, 62% cost reduction\.Stage 1: Embedding\-Based MLP Classifier\.For each pair\(di,dj\)\(d\_\{i\},d\_\{j\}\)and queryqq, we encode via a sentence transformer \(all\-MiniLM\-L6\-v2, 384\-dim\), where⊕\\oplusdenotes concatenation:

𝐞i=SentEnc​\(q⊕di\),𝐞j=SentEnc​\(q⊕dj\)\.\\mathbf\{e\}\_\{i\}=\\text\{SentEnc\}\(q\\oplus d\_\{i\}\),\\quad\\mathbf\{e\}\_\{j\}=\\text\{SentEnc\}\(q\\oplus d\_\{j\}\)\\,\.\(2\)The feature vector combines four interaction components\[[4](https://arxiv.org/html/2605.17301#bib.bib22)\]:

𝐟i​j=\[𝐞i;𝐞j;\|𝐞i−𝐞j\|;𝐞i⊙𝐞j\]∈ℝ1536\.\\mathbf\{f\}\_\{ij\}=\[\\mathbf\{e\}\_\{i\};\\;\\mathbf\{e\}\_\{j\};\\;\|\\mathbf\{e\}\_\{i\}\-\\mathbf\{e\}\_\{j\}\|;\\;\\mathbf\{e\}\_\{i\}\\odot\\mathbf\{e\}\_\{j\}\]\\in\\mathbb\{R\}^\{1536\}\\,\.\(3\)This is fed into two parallel MLPs sharing the same feature representation; we use a frozen encoder with MLP heads \(rather than fine\-tuned cross\-encoders\) to enable CPU\-only deployment\.Head 1performs binary conflict detection \(1536→\\to256→\\to128→\\to2, ReLU\);Head 2classifies into four categories—no\-conflict, factual, temporal, or opinion \(1536→\\to256→\\to128→\\to64→\\to4, ReLU\)\. The classifier is trained on 3,000 document pairs \(2,400 train, 600 val\) derived from ConflictQA by re\-pairing retrieved passages that support opposing answers to the same query, thereby converting parametric–contextual labels into pairwise inter\-document annotations\. Early stopping is applied on the validation set\.

Stage 2: LLM\-Based Refinement\.When Head 1’s binary detection confidence satisfiesconfi​j<τc=0\.7\\textit\{conf\}\_\{ij\}<\\tau\_\{c\}\{=\}0\.7, the pair is routed to an LLM for precise conflict judgment and type classification via structured prompting\. This reserves costly LLM calls for genuinely ambiguous cases\.

Parametric–Contextual Conflict Detection\.Orthogonal to the pairwise inter\-document detector above, we detect parametric–contextual conflicts by comparing a closed\-book answerapar=LLM​\(q\)a\_\{\\text\{par\}\}\{=\}\\text\{LLM\}\(q\)with an open\-book answeractx=LLM​\(q,𝒟\)a\_\{\\text\{ctx\}\}\{=\}\\text\{LLM\}\(q,\\mathcal\{D\}\)via a structured comparison prompt\. When the two answers disagree, the system defers to retrieved evidence\. This approach achieves 81% accuracy \(precision 84%, recall 77%\) on a 100\-sample subset\.

### III\-CType\-Adaptive Conflict Resolution

We define three inter\-document conflict types—factual\(contradictory claims\),temporal\(different time periods\), andopinion\(subjective viewpoints\)—each requiring a distinct resolution strategy\.

Factual Conflicts: Entropy\-TOPSIS\.We formulate source selection as an MCDM problem\[[8](https://arxiv.org/html/2605.17301#bib.bib21)\]\. Five LLM\-extracted criteria \(n=5n\{=\}5: authority, recency, relevance, specificity, consistency\) yield𝐗∈ℝm×n\\mathbf\{X\}\\in\\mathbb\{R\}^\{m\\times n\}\(scores∈\[0,1\]\\in\[0,1\]\); note that “consistency” may bias toward an incorrect majority—ablation in Sect\.[V\-E](https://arxiv.org/html/2605.17301#S5.SS5)confirms it contributes only 2\.1% accuracy\. Weights are entropy\-derived from LLM\-extracted scores\. Letpi​j=xi​j/∑i=1mxi​jp\_\{ij\}=x\_\{ij\}/\\sum\_\{i=1\}^\{m\}x\_\{ij\}:

Ej=−1ln⁡m​∑i=1mpi​j​ln⁡pi​j,wj=1−Ej∑k=1n\(1−Ek\),E\_\{j\}=\-\\frac\{1\}\{\\ln m\}\\sum\_\{i=1\}^\{m\}p\_\{ij\}\\ln p\_\{ij\},\\quad w\_\{j\}=\\frac\{1\-E\_\{j\}\}\{\\sum\_\{k=1\}^\{n\}\(1\-E\_\{k\}\)\}\\,,\(4\)where higherEjE\_\{j\}means less discriminating power \(lower weight\)\. Documents are ranked by closenessCi∗=Di−/\(Di\+\+Di−\)C\_\{i\}^\{\*\}=D\_\{i\}^\{\-\}/\(D\_\{i\}^\{\+\}\{\+\}D\_\{i\}^\{\-\}\), whereDi±D\_\{i\}^\{\\pm\}are distances to ideal/anti\-ideal solutions\.

Fortemporal conflicts, documents are ranked by recency \(metadata or LLM\-extracted dates\); the generator prioritizes the latest source while noting temporal evolution\. Foropinion conflicts, multi\-perspective synthesis presents all viewpoints with source attribution\.

### III\-DConflict\-Aware Answer Generation

The resolved contextrrand detected conflicts𝒞\\mathcal\{C\}are passed to GPT\-4o\-mini with a conflict\-aware prompt\. The output comprises: \(i\) a response grounded in the most credible source\(s\), \(ii\) conflict annotations, \(iii\) source attribution, and \(iv\) a confidence qualifier\. Without detected conflicts, the system falls back to standard generation\.

### III\-EConflict\-Aware RAG Score \(CARS\)

Existing RAG metrics \(EM, F1, RAGAS\[[5](https://arxiv.org/html/2605.17301#bib.bib20)\]\) ignore conflict handling\. We propose CARS as adiagnosticmetric that structurally favors systems with explicit conflict modules:

CARS=wa⋅AC\+wd⋅CDA\+wr⋅RA\+ws⋅SF,\\text\{CARS\}=w\_\{a\}\\cdot\\text\{AC\}\+w\_\{d\}\\cdot\\text\{CDA\}\+w\_\{r\}\\cdot\\text\{RA\}\+w\_\{s\}\\cdot\\text\{SF\}\\,,\(5\)where AC is answer correctness, CDA is conflict detection F1, RA is resolution appropriateness \(LLM\-rated\), SF is source fidelity, and\(wa,wd,wr,ws\)=\(0\.35,0\.25,0\.25,0\.15\)\(w\_\{a\},w\_\{d\},w\_\{r\},w\_\{s\}\)=\(0\.35,0\.25,0\.25,0\.15\)\.AC remains the primary metric; CARS is diagnostic only\. Varying weights±\\pm0\.1 around the default values does not change the system ranking in our experiments\.

## IVExperimental Setup

### IV\-ADatasets

We evaluate on three benchmarks \(100% of ConflictQA, 75% of NQ\-Conflict, and∼\\sim68% of AmbigQA queries contain≥\\geq1 detected conflict\):ConflictQA\[[22](https://arxiv.org/html/2605.17301#bib.bib4)\]: 2,000 QA pairs where parametric knowledge conflicts with counter\-evidence\.NQ\-Conflict: constructed from Natural Questions\[[12](https://arxiv.org/html/2605.17301#bib.bib8)\]by prompting GPT\-4o to inject controlled conflicts\. Contains 500 samples \(150 factual, 125 temporal, 100 opinion, 125 no\-conflict\); 100\-sample human verification \(κ=0\.83\\kappa\{=\}0\.83\) confirms 91% injection accuracy\. As a self\-constructed benchmark, NQ\-Conflict carries inherent risks of distribution bias and injection artifacts; accordingly, we designate it as a supplementary controlled testbed and draw primary conclusions from the naturally occurring benchmarks \(ConflictQA, AmbigQA\)\.AmbigQA\[[17](https://arxiv.org/html/2605.17301#bib.bib7)\]: 1,000 ambiguous questions where documents naturally present different perspectives\.

### IV\-BBaselines

We compare with six methods sharing the same retrieval pool, metadata, and generation model \(GPT\-4o\-mini\)\.Standard RAGconcatenates top\-KKdocuments;RAG\+Rerankinggenerates from the single top\-reranked passage \(sidestepping conflicts by design\);Self\-RAG\[[1](https://arxiv.org/html/2605.17301#bib.bib2)\]adds self\-reflection tokens;CRAG\[[24](https://arxiv.org/html/2605.17301#bib.bib3)\]adds corrective retrieval\. Two conflict\-aware baselines:NLI\-Filteruses a cross\-encoder NLI model \(DeBERTa\-v3\-base\) to detect pairwise contradictions and generates from the consistent subset;CoT Detectionuses a structured chain\-of\-thought prompt \(GPT\-4o\-mini\) to identify conflicts, classify types, and synthesize a resolution in one call\. To ensure fairness, CoT Detection is prompted to produce structured reasoning traces comparable to ConflictRAG’s output; all prompt templates are in the supplementary material\.

### IV\-CEvaluation Metrics

We reportAnswer Correctness \(AC\)via LLM\-as\-judge\[[27](https://arxiv.org/html/2605.17301#bib.bib19)\], token\-levelF1,Conflict Detection F1,ResolutionandTransparencyscores \(LLM\-rated 1–5\), and ourCARS\(Eq\.[5](https://arxiv.org/html/2605.17301#S3.E5)\)\. GPT\-4o serves as the judge \(temperature 0, distinct from GPT\-4o\-mini used for generation\) to mitigate self\-evaluation bias\. The judge evaluates factual correctness regardless of output formatting; residual format preference is quantified in Sect\.[V\-A](https://arxiv.org/html/2605.17301#S5.SS1)\. Human verification on 200 samples confirms 85% agreement with the LLM judge \(κ=0\.74\\kappa\{=\}0\.74, substantial agreement by the Landis–Koch scale\)\.

### IV\-DImplementation Details

All experiments use GPT\-4o\-mini\[[18](https://arxiv.org/html/2605.17301#bib.bib17)\]\(temperature 0\.0 for detection, 0\.3 for generation\)\. Stage 1 employs all\-MiniLM\-L6\-v2 \(384\-dim\) withτc=0\.7\\tau\_\{c\}\{=\}0\.7\. The MLP is trained on 3,000 labeled pairs from a 750\-instance ConflictQA subset \(2,400 train / 600 val\) using Adam \(lr=10−310^\{\-3\}\); the remaining 1,250 instances serve as the evaluation set\. Retrieval combines BM25\[[19](https://arxiv.org/html/2605.17301#bib.bib11)\]and Contriever\[[9](https://arxiv.org/html/2605.17301#bib.bib12)\]in a hybrid pipeline withK=5K\{=\}5\. Results are averaged over 3 seeds; paired bootstrap tests yieldp<0\.01p<0\.01for all main comparisons\.

## VResults and Analysis

### V\-AMain Results

TABLE I:Main results on three benchmarks against six baselines\. Correctness \(%\) by LLM\-as\-judge \(GPT\-4o\), F1 is token\-level, CARS is our composite metric \(Eq\.[5](https://arxiv.org/html/2605.17301#S3.E5)\)\. Best inbold;±\\pm= std over 3 seeds \(ConflictRAG\); baselines use temp\. 0\. The CARS gap reflects varying conflict\-handling capabilities; see Sect\.[V\-A](https://arxiv.org/html/2605.17301#S5.SS1)\.Table[I](https://arxiv.org/html/2605.17301#S5.T1)presents the main comparison\. ConflictRAG consistently outperforms all baselines across the three benchmarks; gains on naturally occurring datasets \(ConflictQA \+5\.8%, AmbigQA \+5\.3%\) corroborate those on the constructed NQ\-Conflict \(\+6\.1%\)\. The CARS gap reflects design alignment \(Sect\.[III\-E](https://arxiv.org/html/2605.17301#S3.SS5)\) rather than proportional end\-task gain\. Notably, Self\-RAG scores below Standard RAG \(47\.8% vs\. 49\.2% on ConflictQA\), likely because its reflection filter over\-removes contradictory evidence\.

Format bias analysis\.To quantify potential LLM\-judge preference for structured output\[[27](https://arxiv.org/html/2605.17301#bib.bib19)\], we strip annotations from ConflictRAG outputs and re\-evaluate\. Correctness drops 2\.2–2\.5%; even conservatively assuming baselines receivezeroformat bias, corrected gains remain \+3\.1–3\.7%—above the 2\.0% bootstrap threshold\.

![Refer to caption](https://arxiv.org/html/2605.17301v1/figures/fig5_results_bar.png)\(a\)Correctness \(%\) across benchmarks\.
![Refer to caption](https://arxiv.org/html/2605.17301v1/figures/fig6_radar.png)\(b\)Radar on NQ\-Conflict\.

Figure 3:\(a\) Correctness \(%\); error bars show±\\pm1 std over 3 runs\. \(b\) Radar on NQ\-Conflict across six CARS dimensions; note that detection, resolution, and transparency are partly method\-defined and structurally favor systems with explicit conflict modules \(see Sect\.[III\-E](https://arxiv.org/html/2605.17301#S3.SS5)\)\.Multi\-dimensional comparison\.Fig\.[3b](https://arxiv.org/html/2605.17301#S5.F3.sf2)shows ConflictRAG leads across all six dimensions on NQ\-Conflict, with the largest margins in detection \(88\.7% vs\. 58\.6% F1\) and transparency \(4\.35 vs\.<<2\.5\)\.

### V\-BConflict Detection Performance

On NQ\-Conflict, the two\-stage detection module \(Head 1\) achieves 88\.7% binary conflict\-detection F1 \(precision 92\.1%, recall 85\.6%\) and 90\.8% overall accuracy\. The F1–accuracy gap reflects class imbalance at the pair level, as most document pairs are non\-conflicting\. Four\-class type classification \(Head 2\) reaches 74\.3% accuracy, with per\-type F1 ordered as temporal \(0\.823\)\>\>factual \(0\.798\)\>\>no\-conflict \(0\.790\)\>\>opinion \(0\.685\)\. Opinion conflicts remain hardest due to the ambiguous fact\-opinion boundary\.

Cross\-dataset generalization\.To verify that the MLP does not overfit to ConflictQA’s distribution, we evaluate the ConflictQA\-trained detector directly on AmbigQA document pairs \(without retraining\)\. Binary detection F1 is 83\.4% on AmbigQA \(vs\. 88\.7% on NQ\-Conflict\); type classification accuracy is 65\.7% \(vs\. 74\.3%\), with the larger drop expected for the finer\-grained four\-class task\. AmbigQA’s naturally occurring ambiguity presents subtler conflicts than NQ\-Conflict’s injected contradictions\. Crucially, the full two\-stage system \(MLP \+ LLM fallback\) on AmbigQA yields only 1\.9% lower correctness than an oracle detector, confirming that the learned representations generalize across question distributions\.

### V\-CAblation Study

TABLE II:Ablation study on NQ\-Conflict \(nn=500\)\. Each row removes one module\. Detection and resolution are the most critical\.Detectionis the most critical module \(−\-16\.6%\); without it, the system approaches standard RAG\+Reranking performance\.Resolution\(−\-13\.2%\) andclassification\(−\-7\.9%\) are also essential; without classification, conflicts are still detected \(Det\. F1 unchanged\) but all default to the factual strategy, mishandling temporal and opinion cases\.Annotationmainly affects transparency \(4\.35→\\to1\.42\) with only−\-1\.3% correctness impact, confirming it serves user experience rather than answer quality\.

### V\-DTwo\-Stage Detection Efficiency

TABLE III:Two\-stage detection efficiency on NQ\-Conflict \(5,000 pairs\)\. Stage 1 resolves 73% of pairs without LLM calls, achieving 2\.77×\\timesspeedup and 62% cost reduction\.The combined system achieves 90\.8% accuracy while reducing costs by 62% and latency by 2\.77×\\times\($0\.026 vs\. $0\.068 per query\)\. Varyingτc∈\[0\.5,0\.9\]\\tau\_\{c\}\\in\[0\.5,0\.9\]yields accuracy 89\.5–91\.3%;τc=0\.7\\tau\_\{c\}\{=\}0\.7balances accuracy and cost\.

### V\-EEntropy\-TOPSIS Analysis

The entropy\-derived weights identifyauthority\(0\.312\) andrecency\(0\.245\) as the most discriminating criteria\. Perturbing all weights by±\\pm10% changes the final source ranking in only 4\.8% of cases, indicating stable decision boundaries\. To assess extraction robustness, we run the scoring prompt 5 times \(temp\. 0\.3\) on 50 samples; the mean score std is 0\.04, and TOPSIS rankings change in only 6% of cases\. Against human ground truth \(200 factual samples,κ=0\.79\\kappa\{=\}0\.79\), Entropy\-TOPSIS achieves 82\.7% selection accuracy, outperforming LLM direct selection \(78\.3%\), fixed weights \(75\.6%\), equal weights \(71\.2%\), and random \(53\.4%\)\. Ablating consistency reduces accuracy by only 2\.1%, confirming that authority and recency dominate\.

### V\-FPer\-Type and Cross\-Model Analysis

Per\-type correctness on NQ\-Conflict varies: factual 73\.8%, temporal 72\.1%, opinion 60\.4%, and no\-conflict 76\.6%\. Opinion conflicts remain the most challenging category owing to inherent subjectivity\. To assess model dependence, we replace GPT\-4o\-mini with DeepSeek\-V3; the resulting 4\.2 percentage\-point drop \(67\.2% vs\. 71\.4%\) at 38% lower cost demonstrates that the majority of gains stem from the pipeline architecture rather than a specific backbone model\. We further validate with Claude\-3\.5\-Sonnet as the backbone, obtaining 69\.8% correctness \(vs\. 71\.4% for GPT\-4o\-mini\), confirming the framework’s transferability across three model families\.

### V\-GLimitations

We acknowledge several limitations\. First, the LLM\-as\-judge protocol \(κ=0\.74\\kappa\{=\}0\.74, substantial agreement\) may exhibit residual format preference; our format bias analysis bounds this effect at≤\\leq2\.5%, and the corrected gains \(\+3\.1–3\.7%\) remain well above significance thresholds\. Second, while NQ\-Conflict is 20% human\-verified \(κ=0\.83\\kappa\{=\}0\.83\), we conservatively draw primary conclusions from ConflictQA and AmbigQA\. Third, CARS is explicitly designed as a diagnostic metric for conflict\-aware systems\. Future work includes encoder fine\-tuning for domain\-specific detection and multilingual extension\.

## VIConclusion

We presented ConflictRAG, a conflict\-aware RAG framework featuring two\-stage detection \(62% cost reduction\), Entropy\-TOPSIS credibility assessment, and the CARS diagnostic metric\. Experiments on two naturally occurring benchmarks and one constructed testbed demonstrate 88\.7% detection F1 and 5\.3–6\.1% correctness gains over the strongest conflict\-aware baseline, with effective cross\-model transfer\. These results demonstrate that explicit conflict handling meaningfully improves RAG reliability\. We believe ConflictRAG’s modular design—separating detection, classification, and resolution—provides a principled foundation for building more trustworthy retrieval\-augmented systems\.

## Acknowledgment

This work was supported by the National Natural Science Foundation of China\. Code, the NQ\-Conflict benchmark, and all prompt templates will be released upon acceptance\.

## References

- \[1\]A\. Asai, Z\. Wu, Y\. Wang, A\. Sil, and H\. Hajishirzi\(2024\)Self\-RAG: learning to retrieve, generate, and critique through self\-reflection\.InInternational Conference on Learning Representations,Cited by:[§I](https://arxiv.org/html/2605.17301#S1.p3.2),[§II](https://arxiv.org/html/2605.17301#S2.p1.1),[§IV\-B](https://arxiv.org/html/2605.17301#S4.SS2.p1.1)\.
- \[2\]A\. Cattan, A\. Jacovi, O\. Ram, J\. Herzig, R\. Aharoni, S\. Goldshtein, E\. Ofek, I\. Szpektor, and A\. Caciularu\(2025\)DRAGged into conflicts: detecting and addressing conflicting sources in search\-augmented LLMs\.arXiv preprint arXiv:2506\.08500\.Cited by:[§II](https://arxiv.org/html/2605.17301#S2.p1.1)\.
- \[3\]J\. Chen, H\. Lin, X\. Han, and L\. Sun\(2024\)Benchmarking large language models in retrieval\-augmented generation\.InAAAI Conference on Artificial Intelligence,Vol\.38,pp\. 17754–17762\.Cited by:[§I](https://arxiv.org/html/2605.17301#S1.p1.1)\.
- \[4\]A\. Conneau, D\. Kiela, H\. Schwenk, L\. Barrault, and A\. Bordes\(2017\)Supervised learning of universal sentence representations from natural language inference data\.InEmpirical Methods in Natural Language Processing,pp\. 670–680\.Cited by:[§III\-B](https://arxiv.org/html/2605.17301#S3.SS2.p2.11)\.
- \[5\]S\. Es, J\. James, L\. Espinosa\-Anke, and S\. Schockaert\(2024\)RAGAS: automated evaluation of retrieval augmented generation\.InEuropean Chapter of the Association for Computational Linguistics,Cited by:[§II](https://arxiv.org/html/2605.17301#S2.p2.1),[§III\-E](https://arxiv.org/html/2605.17301#S3.SS5.p1.3)\.
- \[6\]Y\. Gao, Y\. Xiong, X\. Gao, K\. Jia, J\. Pan, Y\. Bi, Y\. Dai, J\. Sun, and H\. Wang\(2023\)Retrieval\-augmented generation for large language models: a survey\.arXiv preprint arXiv:2312\.10997\.Cited by:[§I](https://arxiv.org/html/2605.17301#S1.p1.1)\.
- \[7\]Z\. Ge, Y\. Wu, D\. W\. K\. Chin, R\. K\. Lee, and R\. Cao\(2025\)Resolving conflicting evidence in automated fact\-checking: a study on retrieval\-augmented LLMs\.arXiv preprint arXiv:2505\.17762\.Cited by:[§II](https://arxiv.org/html/2605.17301#S2.p2.1)\.
- \[8\]C\. Hwang and K\. Yoon\(1981\)Multiple attribute decision making: methods and applications\.Springer\-Verlag,Berlin\.Cited by:[§III\-C](https://arxiv.org/html/2605.17301#S3.SS3.p2.4)\.
- \[9\]G\. Izacard, M\. Caron, L\. Hosseini, S\. Riedel, P\. Bojanowski, A\. Joulin, and E\. Grave\(2022\)Unsupervised dense information retrieval with contrastive learning\.Transactions on Machine Learning Research\.Cited by:[§IV\-D](https://arxiv.org/html/2605.17301#S4.SS4.p1.4)\.
- \[10\]Z\. Ji, N\. Lee, R\. Frieske, T\. Yu, D\. Su, Y\. Xu, E\. Ishii, Y\. Bang, A\. Madotto, and P\. Fung\(2023\)Survey of hallucination in natural language generation\.ACM Computing Surveys55\(12\),pp\. 1–38\.Cited by:[§II](https://arxiv.org/html/2605.17301#S2.p2.1)\.
- \[11\]Z\. Jin, P\. Cao, Y\. Chen, K\. Liu, X\. Jiang, J\. Xu, Q\. Li, and J\. Zhao\(2024\)Tug\-of\-war between knowledge: exploring and resolving knowledge conflicts in retrieval\-augmented language models\.arXiv preprint arXiv:2402\.14409\.Cited by:[§II](https://arxiv.org/html/2605.17301#S2.p1.1)\.
- \[12\]T\. Kwiatkowski, J\. Palomaki, O\. Redfield, M\. Collins, A\. Parikh, C\. Alberti, D\. Epstein, I\. Polosukhin, J\. Devlin, and K\. Lee\(2019\)Natural questions: a benchmark for question answering research\.Transactions of the Association for Computational Linguistics7,pp\. 453–466\.Cited by:[§IV\-A](https://arxiv.org/html/2605.17301#S4.SS1.p1.3)\.
- \[13\]P\. Lewis, E\. Perez, A\. Piktus, F\. Petroni, V\. Karpukhin, N\. Goyal, H\. Küttler, M\. Lewis, W\. Yih, T\. Rocktäschel,et al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive NLP tasks\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 9459–9474\.Cited by:[§I](https://arxiv.org/html/2605.17301#S1.p1.1),[§II](https://arxiv.org/html/2605.17301#S2.p1.1)\.
- \[14\]S\. Liu, Y\. Shang, and X\. Zhang\(2025\)TruthfulRAG: resolving factual\-level conflicts in retrieval\-augmented generation with knowledge graphs\.arXiv preprint arXiv:2511\.10375\.Cited by:[§II](https://arxiv.org/html/2605.17301#S2.p2.1)\.
- \[15\]S\. Longpre, K\. Perisetla, A\. Chen, N\. Ramesh, C\. DuBois, and S\. Singh\(2021\)Entity\-based knowledge conflicts in question answering\.InEmpirical Methods in Natural Language Processing,pp\. 7052–7063\.Cited by:[§II](https://arxiv.org/html/2605.17301#S2.p1.1)\.
- \[16\]A\. Mallen, A\. Asai, V\. Zhong, R\. Das, D\. Khashabi, and H\. Hajishirzi\(2023\)When not to trust language models: investigating effectiveness of parametric and non\-parametric memories\.InAnnual Meeting of the Association for Computational Linguistics,pp\. 9802–9822\.Cited by:[§II](https://arxiv.org/html/2605.17301#S2.p1.1)\.
- \[17\]S\. Min, J\. Michael, H\. Hajishirzi, and L\. Zettlemoyer\(2020\)AmbigQA: answering ambiguous open\-domain questions\.InEmpirical Methods in Natural Language Processing,pp\. 5783–5797\.Cited by:[§IV\-A](https://arxiv.org/html/2605.17301#S4.SS1.p1.3)\.
- \[18\]OpenAI\(2024\)GPT\-4o: system card\.Technical Report\.Cited by:[§IV\-D](https://arxiv.org/html/2605.17301#S4.SS4.p1.4)\.
- \[19\]S\. Robertson and H\. Zaragoza\(2009\)The probabilistic relevance framework: BM25 and beyond\.Foundations and Trends in Information Retrieval3\(4\),pp\. 333–389\.Cited by:[§IV\-D](https://arxiv.org/html/2605.17301#S4.SS4.p1.4)\.
- \[20\]H\. Wang, A\. Prasad, E\. Stengel\-Eskin, and M\. Bansal\(2025\)Retrieval\-augmented generation with conflicting evidence\.arXiv preprint arXiv:2504\.13079\.Cited by:[§II](https://arxiv.org/html/2605.17301#S2.p2.1)\.
- \[21\]Y\. Wang, S\. Feng, H\. Wang, W\. Shi, V\. Balachandran, T\. He, and Y\. Tsvetkov\(2023\)Resolving knowledge conflicts in large language models\.arXiv preprint arXiv:2310\.00935\.Cited by:[§II](https://arxiv.org/html/2605.17301#S2.p1.1)\.
- \[22\]J\. Xie, K\. Zhang, J\. Chen, R\. Zhu, and Y\. Xiao\(2024\)Adaptive chameleon or stubborn sloth: revealing the behavior of large language models in knowledge conflicts\.InInternational Conference on Learning Representations,Cited by:[§I](https://arxiv.org/html/2605.17301#S1.p3.1),[§II](https://arxiv.org/html/2605.17301#S2.p1.1),[§IV\-A](https://arxiv.org/html/2605.17301#S4.SS1.p1.3)\.
- \[23\]R\. Xu, Z\. Qi, C\. Wang, H\. Wang, Y\. Zhang, and W\. Xu\(2024\)Knowledge conflicts for LLMs: a survey\.arXiv preprint arXiv:2403\.08319\.Cited by:[§I](https://arxiv.org/html/2605.17301#S1.p3.1),[§I](https://arxiv.org/html/2605.17301#S1.p3.2),[§II](https://arxiv.org/html/2605.17301#S2.p1.1)\.
- \[24\]S\. Yan, J\. Gu, Y\. Zhu, and Z\. Ling\(2024\)Corrective retrieval augmented generation\.arXiv preprint arXiv:2401\.15884\.Cited by:[§I](https://arxiv.org/html/2605.17301#S1.p3.2),[§II](https://arxiv.org/html/2605.17301#S2.p1.1),[§IV\-B](https://arxiv.org/html/2605.17301#S4.SS2.p1.1)\.
- \[25\]H\. Ye, S\. Chen, Z\. Zhong, C\. Xiao, H\. Zhang, Y\. Wu, and F\. Shen\(2026\)Seeing through the conflict: transparent knowledge conflict handling in retrieval\-augmented generation\.arXiv preprint arXiv:2601\.06842\.Cited by:[§II](https://arxiv.org/html/2605.17301#S2.p2.1)\.
- \[26\]Q\. Zhang, Z\. Xiang, Y\. Xiao, L\. Wang, J\. Li, X\. Wang, and J\. Su\(2025\)FaithfulRAG: fact\-level conflict modeling for context\-faithful retrieval\-augmented generation\.arXiv preprint arXiv:2506\.08938\.Cited by:[§II](https://arxiv.org/html/2605.17301#S2.p2.1)\.
- \[27\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing,et al\.\(2023\)Judging LLM\-as\-a\-judge with MT\-Bench and chatbot arena\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§IV\-C](https://arxiv.org/html/2605.17301#S4.SS3.p1.1),[§V\-A](https://arxiv.org/html/2605.17301#S5.SS1.p2.1)\.

Similar Articles

RAG-Anything: All-in-One RAG Framework

Papers with Code Trending

RAG-Anything is a new open-source framework that enhances multimodal knowledge retrieval by integrating cross-modal relationships and semantic matching, outperforming existing methods on complex benchmarks.

Disco-RAG: Discourse-Aware Retrieval-Augmented Generation

arXiv cs.CL

Disco-RAG proposes a discourse-aware retrieval-augmented generation framework that integrates discourse signals through intra-chunk discourse trees and inter-chunk rhetorical graphs to improve knowledge synthesis in LLMs. The method achieves state-of-the-art results on QA and summarization benchmarks without fine-tuning.

AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases

arXiv cs.AI

This paper introduces AgenticRAG, a framework from Microsoft that enhances enterprise knowledge base retrieval by equipping LLMs with tools for iterative search, document navigation, and analysis. It demonstrates significant improvements in recall and factuality over standard RAG pipelines on multiple benchmarks.