causal-mediation

Tag

Cards List
#causal-mediation

When and How Long? The Readout-Mediator Angle in Temporal Reasoning

arXiv cs.LG · 6d ago Cached

This paper introduces the readout-mediator angle to demonstrate that linear probes can decode information from language model activations that is orthogonal to the model's actual causal computation, undermining probe-based interpretability. The finding replicates across model scales and families, revealing a fundamental failure mode in using probes for mechanistic understanding or safety monitoring.

0 favorites 0 likes
← Back to home

Submit Feedback