ConflictScore: Identifying and Measuring How Language Models Handle Conflicting Evidence

arXiv cs.CL 06/26/26, 04:00 AM Papers

conflicting-evidence factuality faithfulness metrics language-models truthfulness benchmark

Summary

ConflictScore is a new metric that quantifies how well language models acknowledge conflicting evidence in their grounding documents, decomposing responses into atomic claims and measuring conflict balance. The paper also introduces ConflictBench, a benchmark covering diverse conflict forms, and shows the metric can improve truthfulness on TruthfulQA.

arXiv:2606.26437v1 Announce Type: new Abstract: Existing metrics for factuality and faithfulness evaluate whether an answer is supported or contradicted by its grounding documents, but they fail to capture when both supporting and contradicting evidence coexist. We introduce ConflictScore, a novel metric that quantifies how well a model's response acknowledges conflicting evidence in its grounding documents. Our framework decomposes responses into atomic claims, labels each claim against each grounding document, and then aggregates these labels into two complementary measures: ConflictScore-Count (CS-C), the proportion of claims exhibiting conflicts, and ConflictScore-Ratio (CS-R), the balance between supporting and contradicting evidence. We develop ConflictBench, a benchmark covering diverse forms of conflicts such as ambiguity, contradiction, and divergent opinions, to systematically evaluate our metric. Experiments show that ConflictScore effectively detects overconfident claims across domains and can serve as a corrective feedback mechanism that improves truthfulness on TruthfulQA.

Original Article

View Cached Full Text

Cached at: 06/26/26, 05:15 AM

# ConflictScore: Identifying and Measuring How Language Models Handle Conflicting Evidence
Source: [https://arxiv.org/abs/2606.26437](https://arxiv.org/abs/2606.26437)
[View PDF](https://arxiv.org/pdf/2606.26437)

> Abstract:Existing metrics for factuality and faithfulness evaluate whether an answer is supported or contradicted by its grounding documents, but they fail to capture when both supporting and contradicting evidence coexist\. We introduce ConflictScore, a novel metric that quantifies how well a model's response acknowledges conflicting evidence in its grounding documents\. Our framework decomposes responses into atomic claims, labels each claim against each grounding document, and then aggregates these labels into two complementary measures: ConflictScore\-Count \(CS\-C\), the proportion of claims exhibiting conflicts, and ConflictScore\-Ratio \(CS\-R\), the balance between supporting and contradicting evidence\. We develop ConflictBench, a benchmark covering diverse forms of conflicts such as ambiguity, contradiction, and divergent opinions, to systematically evaluate our metric\. Experiments show that ConflictScore effectively detects overconfident claims across domains and can serve as a corrective feedback mechanism that improves truthfulness on TruthfulQA\.

## Submission history

From: Siyi Liu \[[view email](https://arxiv.org/show-email/461de426/2606.26437)\] **\[v1\]**Wed, 24 Jun 2026 23:00:09 UTC \(459 KB\)

ConflictScore: Identifying and Measuring How Language Models Handle Conflicting Evidence

Similar Articles

A better method for identifying overconfident large language models

When Evidence Conflicts: Uncertainty and Order Effects in Retrieval-Augmented Biomedical Question Answering

Three Regimes of Context-Parametric Conflict: A Predictive Framework and Empirical Validation

From Context-Aware to Conflict-Aware: Generalizing Contrastive Decoding for Knowledge Conflict in LLMs

SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations

Submit Feedback

Similar Articles

A better method for identifying overconfident large language models

When Evidence Conflicts: Uncertainty and Order Effects in Retrieval-Augmented Biomedical Question Answering

Three Regimes of Context-Parametric Conflict: A Predictive Framework and Empirical Validation

From Context-Aware to Conflict-Aware: Generalizing Contrastive Decoding for Knowledge Conflict in LLMs

SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations