deception-detection

Tag

Cards List
#deception-detection

Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations

Hugging Face Daily Papers · 2026-05-27 Cached

This paper systematically tests linear probes for deception detection in large language models, finding they fail under distributional shifts but style-augmented probes recover performance, and revealing that deception is encoded through distributed sub-threshold features.

0 favorites 0 likes
#deception-detection

The Point of No Return: Counterfactual Localization of Deceptive Commitment in Language-Model Reasoning

arXiv cs.CL · 2026-05-19 Cached

Introduces counterfactual localization to identify when language models become committed to deception during reasoning, using five environments and a corpus of 1.46M sentences across four reasoning models. Shows that attention-based transition features generalize across environments for detecting deceptive commitment.

0 favorites 0 likes
← Back to home

Submit Feedback