Tag
This paper systematically tests linear probes for deception detection in large language models, finding they fail under distributional shifts but style-augmented probes recover performance, and revealing that deception is encoded through distributed sub-threshold features.
Introduces counterfactual localization to identify when language models become committed to deception during reasoning, using five environments and a corpus of 1.46M sentences across four reasoning models. Shows that attention-based transition features generalize across environments for detecting deceptive commitment.