Tag
This paper challenges the claim that prediction bottlenecks in models like Mamba recover causal structure, demonstrating through a new benchmark that gains are largely due to confounds and robustness artifacts rather than true causal discovery.
An undergraduate researcher expresses disillusionment with recent mechanistic interpretability research from Anthropic, specifically criticizing their new natural language autoencoder approach as a black-box technique that lacks rigorous metric comparisons against sparse autoencoder baselines.