Do Real-World Datasets Contain Natural Experiments? An Empirical Study Using Causal Feature Selection

arXiv cs.AI Papers

Summary

This paper investigates whether real-world datasets contain natural experiments by using causal discovery and feature selection, finding that they do and can improve model performance.

arXiv:2606.03251v1 Announce Type: new Abstract: In nature, events that affect some individuals or groups but not others constitute an implicit intervention and are known as natural experiments. For example, the COVID-19 pandemic was an intervention by the coronavirus on the sub-population infected with COVID. We ask, do natural experiments occur in existing real-world datasets? If yes, how should we treat them? To detect natural experiments in data, we use causal discovery to recover the underlying causal graph and perform feature selection based on causal links. If downstream performance improves by treating the data as interventional rather than observational, we argue that this suggests the dataset contains natural experiments. We first validate this hypothesis by simulating datasets with and without natural experiments using synthetic graphs. We then perform a systematic empirical evaluation on a large suite of real-world datasets. Our results indicate that real-world datasets do contain natural experiments and we can take advantage of those natural experiments to improve model performance using causal inference. Our work represents the initial foray into this area, offering a preliminary exploration within a limited scope.
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:43 AM

# Do Real-World Datasets Contain Natural Experiments? An Empirical Study Using Causal Feature Selection
Source: [https://arxiv.org/abs/2606.03251](https://arxiv.org/abs/2606.03251)
[View PDF](https://arxiv.org/pdf/2606.03251)

> Abstract:In nature, events that affect some individuals or groups but not others constitute an implicit intervention and are known as natural experiments\. For example, the COVID\-19 pandemic was an intervention by the coronavirus on the sub\-population infected with COVID\. We ask, do natural experiments occur in existing real\-world datasets? If yes, how should we treat them? To detect natural experiments in data, we use causal discovery to recover the underlying causal graph and perform feature selection based on causal links\. If downstream performance improves by treating the data as interventional rather than observational, we argue that this suggests the dataset contains natural experiments\. We first validate this hypothesis by simulating datasets with and without natural experiments using synthetic graphs\. We then perform a systematic empirical evaluation on a large suite of real\-world datasets\. Our results indicate that real\-world datasets do contain natural experiments and we can take advantage of those natural experiments to improve model performance using causal inference\. Our work represents the initial foray into this area, offering a preliminary exploration within a limited scope\.

## Submission history

From: Gautam Gare \[[view email](https://arxiv.org/show-email/e2d9d0e1/2606.03251)\] **\[v1\]**Tue, 2 Jun 2026 07:12:30 UTC \(5,942 KB\)

Similar Articles

Optimal Experiments for Partial Causal Effect Identification

arXiv cs.AI

This paper introduces the 'max-potency problem' for selecting cost-constrained experiments to maximize the tightening of bounds on partial causal effects. The authors propose graphical pruning criteria to reduce the search space and demonstrate the method on NHANES health data.

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

Hugging Face Daily Papers

CausaLab is a scalable environment for evaluating LLM agents on interactive causal discovery, assessing both predictive accuracy and faithful recovery of underlying causal mechanisms. Experiments reveal a gap between prediction and mechanism recovery, highlighting limits in current LLM agents as experimental causal reasoners.

Causal Discovery in the Era of Agents

Hugging Face Daily Papers

This paper argues that language model agents should assist causal discovery workflows by providing contextual support and explanations rather than generating causal conclusions, and introduces causal-learn+ platform to demonstrate this principle.

Large-scale study of curiosity-driven learning

OpenAI Blog

OpenAI presents a large-scale empirical study of curiosity-driven reinforcement learning without extrinsic rewards across 54 benchmark environments, showing strong performance and investigating the role of feature spaces in prediction-based reward signals.