Tag
This paper investigates whether off-the-shelf persona steering vectors can reduce sycophancy in large language models, finding they achieve 68-98% of the effect of targeted Contrastive Activation Addition (CAA) without requiring sycophancy-specific training data, and that sycophancy is better understood as a persona-level property.