ConflictScore: Identifying and Measuring How Language Models Handle Conflicting Evidence
Summary
ConflictScore is a new metric that quantifies how well language models acknowledge conflicting evidence in their grounding documents, decomposing responses into atomic claims and measuring conflict balance. The paper also introduces ConflictBench, a benchmark covering diverse conflict forms, and shows the metric can improve truthfulness on TruthfulQA.
View Cached Full Text
Cached at: 06/26/26, 05:15 AM
# ConflictScore: Identifying and Measuring How Language Models Handle Conflicting Evidence Source: [https://arxiv.org/abs/2606.26437](https://arxiv.org/abs/2606.26437) [View PDF](https://arxiv.org/pdf/2606.26437) > Abstract:Existing metrics for factuality and faithfulness evaluate whether an answer is supported or contradicted by its grounding documents, but they fail to capture when both supporting and contradicting evidence coexist\. We introduce ConflictScore, a novel metric that quantifies how well a model's response acknowledges conflicting evidence in its grounding documents\. Our framework decomposes responses into atomic claims, labels each claim against each grounding document, and then aggregates these labels into two complementary measures: ConflictScore\-Count \(CS\-C\), the proportion of claims exhibiting conflicts, and ConflictScore\-Ratio \(CS\-R\), the balance between supporting and contradicting evidence\. We develop ConflictBench, a benchmark covering diverse forms of conflicts such as ambiguity, contradiction, and divergent opinions, to systematically evaluate our metric\. Experiments show that ConflictScore effectively detects overconfident claims across domains and can serve as a corrective feedback mechanism that improves truthfulness on TruthfulQA\. ## Submission history From: Siyi Liu \[[view email](https://arxiv.org/show-email/461de426/2606.26437)\] **\[v1\]**Wed, 24 Jun 2026 23:00:09 UTC \(459 KB\)
Similar Articles
A better method for identifying overconfident large language models
MIT researchers developed a new method for identifying overconfident LLMs by measuring cross-model disagreement across similar models, rather than relying solely on self-consistency metrics. This approach better captures epistemic uncertainty and more accurately identifies unreliable predictions in high-stakes applications.
When Evidence Conflicts: Uncertainty and Order Effects in Retrieval-Augmented Biomedical Question Answering
This paper evaluates six open-weight LLMs on biomedical QA under conflicting evidence conditions, revealing accuracy drops and prediction flips, and proposes a conflict-aware abstention score that improves selective accuracy.
Three Regimes of Context-Parametric Conflict: A Predictive Framework and Empirical Validation
This paper proposes a three-regime framework to resolve empirical contradictions in how LLMs handle conflict between training knowledge and new documents, validated across five major models. It distinguishes between parametric strength and uniqueness and demonstrates how task framing and evidence coherence significantly impact model behavior.
From Context-Aware to Conflict-Aware: Generalizing Contrastive Decoding for Knowledge Conflict in LLMs
The paper generalizes contrastive decoding to a conflict-aware paradigm that dynamically allocates authority between external context and parametric priors, proposes the TriState-Bench evaluation protocol, and introduces Adaptive Regime Routing (ARR) to resolve asymmetry between correction and resistance.
SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations
SoCRATES introduces a realistic multi-domain benchmark for evaluating proactive LLM mediators, showing that top models resolve only about one-third of the consensus gap in conflict resolution.