linear-representations

Tag

Cards List
#linear-representations

When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception

arXiv cs.LG · 3d ago Cached

This paper studies synthetic dishonesty in LLMs by fine-tuning honest and deceptive variants of five transformer models and finding that robust, domain-invariant dishonesty representations can be rapidly entrenched via modest supervised fine-tuning, with implications for activation-based monitoring.

0 favorites 0 likes
← Back to home

Submit Feedback