ConflictScore: Identifying and Measuring How Language Models Handle Conflicting Evidence

arXiv cs.CL Papers

Summary

ConflictScore is a new metric that quantifies how well language models acknowledge conflicting evidence in their grounding documents, decomposing responses into atomic claims and measuring conflict balance. The paper also introduces ConflictBench, a benchmark covering diverse conflict forms, and shows the metric can improve truthfulness on TruthfulQA.

arXiv:2606.26437v1 Announce Type: new Abstract: Existing metrics for factuality and faithfulness evaluate whether an answer is supported or contradicted by its grounding documents, but they fail to capture when both supporting and contradicting evidence coexist. We introduce ConflictScore, a novel metric that quantifies how well a model's response acknowledges conflicting evidence in its grounding documents. Our framework decomposes responses into atomic claims, labels each claim against each grounding document, and then aggregates these labels into two complementary measures: ConflictScore-Count (CS-C), the proportion of claims exhibiting conflicts, and ConflictScore-Ratio (CS-R), the balance between supporting and contradicting evidence. We develop ConflictBench, a benchmark covering diverse forms of conflicts such as ambiguity, contradiction, and divergent opinions, to systematically evaluate our metric. Experiments show that ConflictScore effectively detects overconfident claims across domains and can serve as a corrective feedback mechanism that improves truthfulness on TruthfulQA.
Original Article
View Cached Full Text

Cached at: 06/26/26, 05:15 AM

# ConflictScore: Identifying and Measuring How Language Models Handle Conflicting Evidence
Source: [https://arxiv.org/abs/2606.26437](https://arxiv.org/abs/2606.26437)
[View PDF](https://arxiv.org/pdf/2606.26437)

> Abstract:Existing metrics for factuality and faithfulness evaluate whether an answer is supported or contradicted by its grounding documents, but they fail to capture when both supporting and contradicting evidence coexist\. We introduce ConflictScore, a novel metric that quantifies how well a model's response acknowledges conflicting evidence in its grounding documents\. Our framework decomposes responses into atomic claims, labels each claim against each grounding document, and then aggregates these labels into two complementary measures: ConflictScore\-Count \(CS\-C\), the proportion of claims exhibiting conflicts, and ConflictScore\-Ratio \(CS\-R\), the balance between supporting and contradicting evidence\. We develop ConflictBench, a benchmark covering diverse forms of conflicts such as ambiguity, contradiction, and divergent opinions, to systematically evaluate our metric\. Experiments show that ConflictScore effectively detects overconfident claims across domains and can serve as a corrective feedback mechanism that improves truthfulness on TruthfulQA\.

## Submission history

From: Siyi Liu \[[view email](https://arxiv.org/show-email/461de426/2606.26437)\] **\[v1\]**Wed, 24 Jun 2026 23:00:09 UTC \(459 KB\)

Similar Articles

A better method for identifying overconfident large language models

MIT News — Artificial Intelligence

MIT researchers developed a new method for identifying overconfident LLMs by measuring cross-model disagreement across similar models, rather than relying solely on self-consistency metrics. This approach better captures epistemic uncertainty and more accurately identifies unreliable predictions in high-stakes applications.