Prediction Bottlenecks Don't Discover Causal Structure (But Here's What They Actually Do)
Summary
This paper challenges the claim that prediction bottlenecks in models like Mamba recover causal structure, demonstrating through a new benchmark that gains are largely due to confounds and robustness artifacts rather than true causal discovery.
View Cached Full Text
Cached at: 05/13/26, 04:13 AM
Paper page - Prediction Bottlenecks Don’t Discover Causal Structure (But Here’s What They Actually Do)
Source: https://huggingface.co/papers/2605.09169 This paper falsifies the claim that next-step prediction bottlenecks—especially Mamba/SSM weight projections—recover causal structure, showing instead that their apparent gains are mostly low-rank regression, sample-size confounds, intervention-semantics artifacts, and target-corruption robustness, with the main durable contribution being a reusable falsification benchmark.
➡️ 𝐊𝐞𝐲 𝐇𝐢𝐠𝐡𝐥𝐢𝐠𝐡𝐭𝐬 𝐨𝐟 𝐭𝐡𝐞𝐢𝐫 𝐏𝐫𝐞𝐝𝐢𝐜𝐭𝐢𝐨𝐧-𝐚𝐬-𝐂𝐚𝐮𝐬𝐚𝐥-𝐃𝐢𝐬𝐜𝐨𝐯𝐞𝐫𝐲 𝐅𝐚𝐥𝐬𝐢𝐟𝐢𝐜𝐚𝐭𝐢𝐨𝐧 𝐅𝐫𝐚𝐦𝐞𝐰𝐨𝐫𝐤:
🧪 𝑹𝒆𝒖𝒔𝒂𝒃𝒍𝒆 𝑭𝒊𝒗𝒆-𝑺𝒕𝒂𝒈𝒆 𝑭𝒂𝒍𝒔𝒊𝒇𝒊𝒄𝒂𝒕𝒊𝒐𝒏 𝑩𝒆𝒏𝒄𝒉𝒎𝒂𝒓𝒌: Introduces a control-heavy benchmark spanning VAR, Lorenz-96, CauseMe-style generators, real datasets with edge-provenance cards, matched-capacity architectures, size-matched observational controls, and multiple intervention semantics to stress-test claims that prediction models implicitly recover causal graphs.
🧩 𝑾𝒆𝒊𝒈𝒉𝒕-𝑷𝒓𝒐𝒋𝒆𝒄𝒕𝒊𝒐𝒏 𝑪𝒂𝒖𝒔𝒂𝒍𝒊𝒕𝒚 𝑫𝒐𝒆𝒔 𝑵𝒐𝒕 𝑺𝒖𝒓𝒗𝒊𝒗𝒆 𝑪𝒐𝒏𝒕𝒓𝒐𝒍𝒔: Tests the extraction rule (S = |W_{out}W_{in}|) for bottleneck predictors and shows that linear bottlenecks match or beat Mamba SSMs, tuned Lasso dominates on synthetic graph recovery, and classical PCMCI/Granger-style methods outperform the bottleneck on clean Lorenz-96 ground truth.
🧠 𝑰𝒏𝒕𝒆𝒓𝒗𝒆𝒏𝒕𝒊𝒐𝒏 𝑮𝒂𝒊𝒏𝒔 𝑨𝒓𝒆 𝑪𝒐𝒏𝒇𝒐𝒖𝒏𝒅𝒔, 𝑵𝒐𝒕 𝑪𝒂𝒖𝒔𝒂𝒍 𝑬𝒙𝒕𝒓𝒂𝒄𝒕𝒊𝒐𝒏: Demonstrates that the reported interventional advantage mostly comes from extra sample size and a non-standard per-step random-forcing intervention; under proper (do(X_i=c)) interventions the effect nearly vanishes, while the residual appears even more strongly in classical bivariate Granger, indicating method-agnostic target-corruption robustness rather than learned causal discovery.
Similar Articles
The Good, the Bad, and the Ugly of Markov Boundary for Tabular Prediction
This paper evaluates the practical effectiveness of Markov boundaries for tabular prediction, finding that while theoretically optimal, current causal discovery methods fail to consistently improve predictive performance due to computational limitations and mismatched optimization goals.
Score-Based Causal Discovery of Latent Variable Causal Models
This paper introduces score-based methods for causal discovery in the presence of latent variables, offering theoretical guarantees of consistency and score equivalence, and unifies several constraint-based approaches.
A Geometric View of Counterfactual Behavior: Interaction of Boundary Proximity and Local Support
This paper examines counterfactual behavior in ML models through a geometric lens, showing that models with similar predictive performance can differ substantially in counterfactual outcomes due to the interaction between decision-boundary proximity and local data support. The findings identify counterfactual behavior as a distinct dimension from predictive performance, with implications for model selection and reliability of counterfactual explanation methods.
CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists
CausaLab is a scalable environment for evaluating LLM agents on interactive causal discovery, assessing both predictive accuracy and faithful recovery of underlying causal mechanisms. Experiments reveal a gap between prediction and mechanism recovery, highlighting limits in current LLM agents as experimental causal reasoners.
Getting good predictions without data cleaning (Why "Garbage In, Garbage Out" is sometimes a trap)
This arXiv preprint challenges the 'Garbage In, Garbage Out' heuristic, arguing that aggressive manual data cleaning can limit predictive performance in high-dimensional tabular data by reducing dimensionality needed to triangulate latent drivers.