contrastive-activation-additions

#contrastive-activation-additions

Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

arXiv cs.AI ↗ · 2026-05-22 Cached

This paper investigates whether off-the-shelf persona steering vectors can reduce sycophancy in large language models, finding they achieve 68-98% of the effect of targeted Contrastive Activation Addition (CAA) without requiring sycophancy-specific training data, and that sycophancy is better understood as a persona-level property.

0 favorites 0 likes

contrastive-activation-additions

Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

Submit Feedback