linear-representations

#linear-representations

When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception

arXiv cs.LG ↗ · 3d ago Cached

This paper studies synthetic dishonesty in LLMs by fine-tuning honest and deceptive variants of five transformer models and finding that robust, domain-invariant dishonesty representations can be rapidly entrenched via modest supervised fine-tuning, with implications for activation-based monitoring.

0 favorites 0 likes

linear-representations

When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception

Submit Feedback