CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists
Summary
CausaLab is a scalable environment for evaluating LLM agents on interactive causal discovery, assessing both predictive accuracy and faithful recovery of underlying causal mechanisms. Experiments reveal a gap between prediction and mechanism recovery, highlighting limits in current LLM agents as experimental causal reasoners.
View Cached Full Text
Cached at: 05/29/26, 07:00 AM
Paper page - CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists
Source: https://huggingface.co/papers/2605.26029 Published on May 28
·
Submitted byhttps://huggingface.co/shizhuo2
Dylanon May 29
Abstract
CausaLab evaluates LLM agents on causal discovery by requiring both accurate predictions and faithful recovery of underlying causal mechanisms through synthetic experimental scenarios.
We introduce CausaLab, a scalable environment for evaluating interactivecausal discoveryby LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is grounded in a faithful recovered causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampledstructural causal model(SCM), so success requires recovering both acausal graphandstructural equationsrather than recalling prior knowledge. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge F_1. Mixed observation-interventionstrategies improve structural fidelity, while pureinterventionremains difficult even for strong agents. We identify premature stopping as a major weakness and show that consistency verification mitigates it. CausaLab therefore separatespredictive successfromcausal understandingand exposes current LLM agents’ limits as experimental causal reasoners.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.26029
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.26029 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.26029 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.26029 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
"Excuse me, may I say something..." CoLabScience, A Proactive AI Assistant for Biomedical Discovery and LLM-Expert Collaborations
CoLabScience introduces a proactive LLM assistant for biomedical research that autonomously intervenes in scientific discussions using PULI (Positive-Unlabeled Learning-to-Intervene), a novel reinforcement learning framework that determines when and how to contribute context-aware insights. The work includes BSDD, a new benchmark dataset of simulated research dialogues with intervention points derived from PubMed articles.
LLM-AutoSciLab: Closed-Loop Scientific Discovery via Active Experimentation with LLMs
LLM-AutoSciLab is a closed-loop framework that uses LLMs to iteratively generate hypotheses, select informative experiments, and refine mechanisms, achieving superior accuracy and sample efficiency on physics and biology benchmarks over prior static methods.
Large Language Models for Causal Relations Extraction in Social Media: A Validation Framework for Disaster Intelligence
This paper proposes a validation framework for using Large Language Models to extract causal relations from social media posts during disasters. It evaluates the effectiveness of LLMs in identifying cause-effect relationships and compares them against expert-grounded reference graphs to assess reliability and risks.
LLM Explainability with Counterfactual Chains and Causal Graphs
This paper proposes a four-phase method for constructing causal graphs that model LLM inference processes, using counterfactual augmentation to enable stable causal discovery and provide transparent, concept-level explainability.
Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents
This paper introduces the Causal Sensitivity Score (CSS), an interventional metric that evaluates whether clinical LLMs and agents appropriately update their recommendations when patient inputs change along clinically meaningful dimensions. It reveals hidden capability profiles not captured by standard coverage-based metrics, exposing safety blind spots and structural responsiveness deficits.