Evaluating Bivariate Causal Statements Based on Mutual Compatibility

arXiv cs.AI Papers

Summary

This paper introduces compatibility and incompatibility scores for evaluating collections of bivariate causal statements without relying on faithfulness, and demonstrates their applicability by analyzing causal claims from large language models.

arXiv:2606.00278v1 Announce Type: new Abstract: For many real-world systems, causal ground truth is difficult to obtain, making claims about causal effects hard to assess. We develop methods for evaluating collections of $\binom{n}{2}$ bivariate causal statements over a set of $n$ variables. In the setting of acyclic linear statements, any such collection can be extended to a unique multivariate causal model, but we argue that this induced model is implausible if it imposes substantial additional confounding to explain observed correlations. We introduce a compatibility score that quantifies this notion of plausibility, notably without relying on the faithfulness assumption. Additionally, we define an incompatibility score for purely graphical bivariate causal statements, based on global consistency constraints that are derived from acyclicity and faithfulness assumptions. We give theoretical and empirical evidence that both scores can successfully distinguish correct from incorrect causal statements in generic settings. Moreover, we demonstrate the practical applicability of our methods by analyzing causal claims made by large language models. Our work aims to provide a foundation for assessing the reliability of causal information derived from human experts or artificial intelligence in settings where alternative forms of validation are unavailable.
Original Article
View Cached Full Text

Cached at: 06/02/26, 03:46 PM

# Evaluating Bivariate Causal Statements Based on Mutual Compatibility
Source: [https://arxiv.org/abs/2606.00278](https://arxiv.org/abs/2606.00278)
[View PDF](https://arxiv.org/pdf/2606.00278)

> Abstract:For many real\-world systems, causal ground truth is difficult to obtain, making claims about causal effects hard to assess\. We develop methods for evaluating collections of $\\binom\{n\}\{2\}$ bivariate causal statements over a set of $n$ variables\. In the setting of acyclic linear statements, any such collection can be extended to a unique multivariate causal model, but we argue that this induced model is implausible if it imposes substantial additional confounding to explain observed correlations\. We introduce a compatibility score that quantifies this notion of plausibility, notably without relying on the faithfulness assumption\. Additionally, we define an incompatibility score for purely graphical bivariate causal statements, based on global consistency constraints that are derived from acyclicity and faithfulness assumptions\. We give theoretical and empirical evidence that both scores can successfully distinguish correct from incorrect causal statements in generic settings\. Moreover, we demonstrate the practical applicability of our methods by analyzing causal claims made by large language models\. Our work aims to provide a foundation for assessing the reliability of causal information derived from human experts or artificial intelligence in settings where alternative forms of validation are unavailable\.

## Submission history

From: Dominik Janzing \[[view email](https://arxiv.org/show-email/7bf22f62/2606.00278)\] **\[v1\]**Fri, 29 May 2026 19:15:09 UTC \(2,709 KB\)

Similar Articles

Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents

arXiv cs.LG

This paper introduces the Causal Sensitivity Score (CSS), an interventional metric that evaluates whether clinical LLMs and agents appropriately update their recommendations when patient inputs change along clinically meaningful dimensions. It reveals hidden capability profiles not captured by standard coverage-based metrics, exposing safety blind spots and structural responsiveness deficits.