Tag
This paper investigates the production-evaluation gap in large reasoning models (LRMs), finding that they fail to robustly evaluate reasoning despite near-perfect solution production, due to an answer confirmation bias.
This paper presents a methodology for delineating concepts and training linear probes to detect them in LLM embeddings, using four example concepts across three models. The work aims to enable scalable monitoring of LLM internal representations.
This paper systematically tests linear probes for deception detection in large language models, finding they fail under distributional shifts but style-augmented probes recover performance, and revealing that deception is encoded through distributed sub-threshold features.