Prediction Bottlenecks Don't Discover Causal Structure (But Here's What They Actually Do)

Hugging Face Daily Papers Papers

Summary

This paper challenges the claim that prediction bottlenecks in models like Mamba recover causal structure, demonstrating through a new benchmark that gains are largely due to confounds and robustness artifacts rather than true causal discovery.

A Mamba state-space model trained only for next-step prediction appears to recover Granger-causal structure through a simple readout S = |W_{out} W_{in}|, with early experiments suggesting the phenomenon generalized across architectures and benefited from interventional data at p < 10^{-5}. We package the protocol used to test that claim -- standardized synthetic generators (VAR/Lorenz/CauseMe-style), three intervention semantics (do(X=c), soft-noise, random-forcing), edge-provenance cards on three real datasets, and size-matched control arms -- as a reusable falsification benchmark, and walk the claim through it in five stages. The method-level claim does not survive: (i) a plain linear bottleneck does as well or better; (ii) tuned Lasso beats the bottleneck on synthetic CauseMe-style benchmarks, and on Lorenz-96 (the only real benchmark with unambiguous ground truth) classical PCMCI and Granger lead a tight cluster in which the bottleneck trails; (iii) the headline intervention advantage is roughly 60% a sample-size confound, and the residual disappears under standard do(X=c) interventions, surviving only under a non-standard random-forcing scheme; (iv) even that residual reproduces, with a larger effect, in classical bivariate Granger -- the effect is method-agnostic. What survives is a narrow characterization result; the benchmark is the lasting artifact, and each stage above is one of its control arms.
Original Article
View Cached Full Text

Cached at: 05/13/26, 04:13 AM

Paper page - Prediction Bottlenecks Don’t Discover Causal Structure (But Here’s What They Actually Do)

Source: https://huggingface.co/papers/2605.09169 This paper falsifies the claim that next-step prediction bottlenecks—especially Mamba/SSM weight projections—recover causal structure, showing instead that their apparent gains are mostly low-rank regression, sample-size confounds, intervention-semantics artifacts, and target-corruption robustness, with the main durable contribution being a reusable falsification benchmark.

➡️ 𝐊𝐞𝐲 𝐇𝐢𝐠𝐡𝐥𝐢𝐠𝐡𝐭𝐬 𝐨𝐟 𝐭𝐡𝐞𝐢𝐫 𝐏𝐫𝐞𝐝𝐢𝐜𝐭𝐢𝐨𝐧-𝐚𝐬-𝐂𝐚𝐮𝐬𝐚𝐥-𝐃𝐢𝐬𝐜𝐨𝐯𝐞𝐫𝐲 𝐅𝐚𝐥𝐬𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧 𝐅𝐫𝐚𝐦𝐞𝐰𝐨𝐫𝐤:

🧪 𝑹𝒆𝒖𝒔𝒂𝒃𝒍𝒆 𝑭𝒊𝒗𝒆-𝑺𝒕𝒂𝒈𝒆 𝑭𝒂𝒍𝒔𝒊𝒇𝒊𝒄𝒂𝒕𝒊𝒐𝒏 𝑩𝒆𝒏𝒄𝒉𝒎𝒂𝒓𝒌: Introduces a control-heavy benchmark spanning VAR, Lorenz-96, CauseMe-style generators, real datasets with edge-provenance cards, matched-capacity architectures, size-matched observational controls, and multiple intervention semantics to stress-test claims that prediction models implicitly recover causal graphs.

🧩 𝑾𝒆𝒊𝒈𝒉𝒕-𝑷𝒓𝒐𝒋𝒆𝒄𝒕𝒊𝒐𝒏 𝑪𝒂𝒖𝒔𝒂𝒍𝒊𝒕𝒚 𝑫𝒐𝒆𝒔 𝑵𝒐𝒕 𝑺𝒖𝒓𝒗𝒊𝒗𝒆 𝑪𝒐𝒏𝒕𝒓𝒐𝒍𝒔: Tests the extraction rule (S = |W_{out}W_{in}|) for bottleneck predictors and shows that linear bottlenecks match or beat Mamba SSMs, tuned Lasso dominates on synthetic graph recovery, and classical PCMCI/Granger-style methods outperform the bottleneck on clean Lorenz-96 ground truth.

🧠 𝑰𝒏𝒕𝒆𝒓𝒗𝒆𝒏𝒕𝒊𝒐𝒏 𝑮𝒂𝒊𝒏𝒔 𝑨𝒓𝒆 𝑪𝒐𝒏𝒇𝒐𝒖𝒏𝒅𝒔, 𝑵𝒐𝒕 𝑪𝒂𝒖𝒔𝒂𝒍 𝑬𝒙𝒕𝒓𝒂𝒄𝒕𝒊𝒐𝒏: Demonstrates that the reported interventional advantage mostly comes from extra sample size and a non-standard per-step random-forcing intervention; under proper (do(X_i=c)) interventions the effect nearly vanishes, while the residual appears even more strongly in classical bivariate Granger, indicating method-agnostic target-corruption robustness rather than learned causal discovery.

Similar Articles

The Good, the Bad, and the Ugly of Markov Boundary for Tabular Prediction

Hugging Face Daily Papers

This paper evaluates the practical effectiveness of Markov boundaries for tabular prediction, finding that while theoretically optimal, current causal discovery methods fail to consistently improve predictive performance due to computational limitations and mismatched optimization goals.

A Geometric View of Counterfactual Behavior: Interaction of Boundary Proximity and Local Support

arXiv cs.LG

This paper examines counterfactual behavior in ML models through a geometric lens, showing that models with similar predictive performance can differ substantially in counterfactual outcomes due to the interaction between decision-boundary proximity and local data support. The findings identify counterfactual behavior as a distinct dimension from predictive performance, with implications for model selection and reliability of counterfactual explanation methods.

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

Hugging Face Daily Papers

CausaLab is a scalable environment for evaluating LLM agents on interactive causal discovery, assessing both predictive accuracy and faithful recovery of underlying causal mechanisms. Experiments reveal a gap between prediction and mechanism recovery, highlighting limits in current LLM agents as experimental causal reasoners.