counterfactual-localization

Tag

Cards List
#counterfactual-localization

The Point of No Return: Counterfactual Localization of Deceptive Commitment in Language-Model Reasoning

arXiv cs.CL · 2026-05-19 Cached

Introduces counterfactual localization to identify when language models become committed to deception during reasoning, using five environments and a corpus of 1.46M sentences across four reasoning models. Shows that attention-based transition features generalize across environments for detecting deceptive commitment.

0 favorites 0 likes
← Back to home

Submit Feedback