causal-mediation

#causal-mediation

When and How Long? The Readout-Mediator Angle in Temporal Reasoning

arXiv cs.LG ↗ · 6d ago Cached

This paper introduces the readout-mediator angle to demonstrate that linear probes can decode information from language model activations that is orthogonal to the model's actual causal computation, undermining probe-based interpretability. The finding replicates across model scales and families, revealing a fundamental failure mode in using probes for mechanistic understanding or safety monitoring.

0 favorites 0 likes

causal-mediation

When and How Long? The Readout-Mediator Angle in Temporal Reasoning

Submit Feedback