Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

arXiv cs.AI 05/22/26, 04:00 AM Papers
Summary
This paper investigates whether off-the-shelf persona steering vectors can reduce sycophancy in large language models, finding they achieve 68-98% of the effect of targeted Contrastive Activation Addition (CAA) without requiring sycophancy-specific training data, and that sycophancy is better understood as a persona-level property.
arXiv:2605.21006v1 Announce Type: new Abstract: We study the effect of different persona on \textbf{sycophancy}: model's agreement with users even when the user is incorrect. The standard mitigation, Contrastive Activation Addition (CAA), derives a steering direction from labelled pairs of sycophantic and honest responses. This study evaluates whether off-the-shelf persona steering vectors, originally developed for general role-playing and not trained on sycophancy data, can serve as an alternative. In two instruction-tuned models, steering toward personas characterised by doubt or scrutiny reduces sycophancy to approximately $68\%$ and $98\%$ of CAA's effect, and, unlike CAA, maintains accuracy when the user is correct. The effect is also asymmetric: steering toward agreeable personas does not produce a mirror increase in sycophancy. Geometrically, the persona vector is largely independent of the direction of sycophancy in activation space. Collectively, these findings suggest that sycophancy is better understood as a persona-level property rather than a single steerable direction. We release our code here: https://anonymous.4open.science/r/Sycophancy-Steering-9DF0/.
Original Article
View Cached Full Text
Cached at: 05/22/26, 08:49 AM
# Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy
Source: [https://arxiv.org/html/2605.21006](https://arxiv.org/html/2605.21006)
## Playing Devil’s Advocate: Off\-the\-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

Nebras AlamVikram KakariaMadhur PanwarVasu SharmaMaheep Chaudhary

###### Abstract

We study the effect of different persona onsycophancy: model’s agreement with users even when the user is incorrect\. The standard mitigation, Contrastive Activation Addition \(CAA\), derives a steering direction from labelled pairs of sycophantic and honest responses\. This study evaluates whether off\-the\-shelf persona steering vectors, originally developed for general role\-playing and not trained on sycophancy data, can serve as an alternative\. In two instruction\-tuned models, steering toward personas characterised by doubt or scrutiny reduces sycophancy to approximately68%68\\%and98%98\\%of CAA’s effect, and, unlike CAA, maintains accuracy when the user is correct\. The effect is also asymmetric: steering toward agreeable personas does not produce a mirror increase in sycophancy\. Geometrically, the persona vector is largely independent of the direction of sycophancy in activation space\. Collectively, these findings suggest that sycophancy is better understood as a persona\-level property rather than a single steerable direction\. We release our code here:[https://anonymous\.4open\.science/r/Sycophancy\-Steering\-9DF0/](https://anonymous.4open.science/r/Sycophancy-Steering-9DF0/)\.

Sycophancy, Activation Steering, Persona Vectors, LLM Alignment

## 1Introduction

Sycophancyis the tendency of large language models to agree with users regardless of factual correctness\. It is among the most persistent failure modes of RLHF\-trained systems\(Perez and others,[2022](https://arxiv.org/html/2605.21006#bib.bib8); Sharma and others,[2023](https://arxiv.org/html/2605.21006#bib.bib12)\)\. Even when a model has encoded the correct answer internally, the reward model’s preference for agreement can override it\(Wang and others,[2025a](https://arxiv.org/html/2605.21006#bib.bib15)\)\. How can we intervene on sycophancy without expensive retraining or curated behavioral datasets?

Contrastive Activation Addition \(CAA\)\(Rimsky and others,[2024](https://arxiv.org/html/2605.21006#bib.bib10); Turner and others,[2023](https://arxiv.org/html/2605.21006#bib.bib13)\)extracts a steering vector from contrastive sycophantic/honest prompt pairs and adds a scaled version during inference\. While effective, CAA requires hundreds of behavior\-specific pairs and must be re\-curated for each new target behavior\. A natural alternative is to reuse*persona*representations already learned by instruction\-tuned models\(Luet al\.,[2026](https://arxiv.org/html/2605.21006#bib.bib6)\)\.

We ask three questions\. \(i\) Can off\-the\-shelf role vectors reduce forced\-choice sycophancy comparably to CAA? \(ii\) Do critical/conformist family labels predict steering direction? \(iii\) Are effective role vectors geometrically distinct from the CAA direction? We evaluate across Gemma 2 27B\(Gemma Team,[2024](https://arxiv.org/html/2605.21006#bib.bib3)\)and Qwen 3 32B\(Qwen Team,[2025](https://arxiv.org/html/2605.21006#bib.bib9)\)on a counterbalanced PhilPapers benchmark with tune/test splitting and Holm correction\. Overall, we make 3 novel contributions:

1. 1\.Critical\-role vectors reach6868–98%98\\%of CAA’sΔlogit\\Delta\\mathrm\{logit\}with strong cross\-seed consistency, using*no*sycophancy labels\.
2. 2\.Conformist roles produce weak, heterogeneous effects, partially falsifying bidirectional family\-level prediction\.
3. 3\.All role vectors are nearly orthogonal to CAA \(\|cos\|<0\.17\|\\cos\|<0\.17\), but the*sign*of the cosine flips between Gemma and Qwen — a cross\-model geometric asymmetry we make explicit as a new caveat on mechanistic\-independence claims\.

## 2Related Work

Perez and others \([2022](https://arxiv.org/html/2605.21006#bib.bib8)\)introduced model\-written sycophancy evaluations exposing systematic agreement bias\.Sharma and others \([2023](https://arxiv.org/html/2605.21006#bib.bib12)\)showed sycophancy is widespread and emerges during RLHF\.Wang and others \([2025a](https://arxiv.org/html/2605.21006#bib.bib15)\)traced sycophancy to override mechanisms in RLHF\-trained models: the correct answer is often encoded internally but suppressed by preference for agreement\.

Activation addition\(Turner and others,[2023](https://arxiv.org/html/2605.21006#bib.bib13)\)demonstrated that fixed residual\-stream vectors can steer LLM behavior at inference time\. CAA\(Rimsky and others,[2024](https://arxiv.org/html/2605.21006#bib.bib10)\)formalizes this via mean\-difference contrasts on labelled A/B pairs\. Related work includes representation engineering\(Zou and others,[2023](https://arxiv.org/html/2605.21006#bib.bib17)\), inference\-time intervention\(Li and others,[2023](https://arxiv.org/html/2605.21006#bib.bib5)\), and conditional activation steering\(Lee and others,[2025](https://arxiv.org/html/2605.21006#bib.bib4)\)\.Goralet al\.\([2025](https://arxiv.org/html/2605.21006#bib.bib2)\)study depth\-wise steering across layers\.

#### Persona directions\.

Luet al\.\([2026](https://arxiv.org/html/2605.21006#bib.bib6)\)introduced the Assistant Axis and released per\-role steering vectors \(Skeptic, Judge, Peacekeeper,…\\ldots\) derived from generation\-plus\-judge contrastive pipeline\.Feng and others \([2026](https://arxiv.org/html/2605.21006#bib.bib1)\)compose persona directions via vector algebra for inference\-time personality control\.Pai and others \([2026](https://arxiv.org/html/2605.21006#bib.bib7)\)merge persona vectors\.Wang and others \([2025b](https://arxiv.org/html/2605.21006#bib.bib16)\)link persona features to emergent misalignment\.

#### Persona–sycophancy link\.

Shahet al\.\([2026](https://arxiv.org/html/2605.21006#bib.bib11)\)showed persona*agreeableness*correlates with sycophancy atrrup to0\.870\.87\.Vennemeyeret al\.\([2025](https://arxiv.org/html/2605.21006#bib.bib14)\)argue sycophancy decomposes into causally separable components along distinct linear directions\. Our work sits at this intersection: we test whether*off\-the\-shelf*role vectors — never trained on sycophancy labels — transfer to a held\-out forced\-choice benchmark, and we characterize the geometric relationship to a targeted CAA direction on two models\.

## 3Methodology

### 3\.1Models and target layers

We use Gemma 2 27B Instruct\(Gemma Team,[2024](https://arxiv.org/html/2605.21006#bib.bib3)\)and Qwen 3 32B\(Qwen Team,[2025](https://arxiv.org/html/2605.21006#bib.bib9)\)because they are instruction\-tuned decoder\-only models at comparable scale but with substantially different baseline sycophancy rates on our benchmark \(59%59\\%vs\.84%84\\%\), providing a natural robustness test\. Steering is applied at layer2222of4646\(Gemma\) and layer3232of6464\(Qwen\) — canonical mid\-stack layers from theassistant\_axislibrary\(Luet al\.,[2026](https://arxiv.org/html/2605.21006#bib.bib6)\)— viaActivationSteeringhook in*addition*mode at all token positions\. Models loaded inbfloat16on H100 GPUs\.

### 3\.2Steering mechanism and metric

For a unit\-normalized steering vectorvvand scalar coefficientα\\alpha, steered residual\-stream activation at the target layer is

hℓ′=hℓ\+αv\.h^\{\\prime\}\_\{\\ell\}=h\_\{\\ell\}\+\\alpha\\,v\.\(1\)We measure the sycophancy logit

syc\_logit=log⁡p\(syc\_token\)−log⁡p\(hon\_token\)\\mathrm\{syc\\\_logit\}=\\log p\(\\mathrm\{syc\\\_token\}\)\-\\log p\(\\mathrm\{hon\\\_token\}\)\(2\)at the final prompt position, wheresyc\_token\\mathrm\{syc\\\_token\}matches the user’s stated opinion\. Our primary metric isΔlogit=s¯steered−s¯baseline\\Delta\\mathrm\{logit\}=\\bar\{s\}\_\{\\mathrm\{steered\}\}\-\\bar\{s\}\_\{\\mathrm\{baseline\}\}\(negative==reduced sycophancy\); we also report the binary rateΔr\\Delta rin percentage points\.

### 3\.3Conditions

We report a focused subset of a broader 24\-condition experiment; four dropped conditions are documented in Appendix[A](https://arxiv.org/html/2605.21006#A1)\. The CAA baseline is extracted followingRimsky and others \([2024](https://arxiv.org/html/2605.21006#bib.bib10)\)on∼2,000\\sim\\\!2\{,\}000A/B pairs fromnlp\_survey\+political\_typology, disjoint from our evaluation set to prevent train/test overlap\. Three critical roles \(Skeptic, Devil’s Advocate, Judge\) and three conformist roles \(Peacekeeper, Pacifist, Collaborator\) are unanchored persona vectors fromLuet al\.\([2026](https://arxiv.org/html/2605.21006#bib.bib6)\), computed asunit\(role−default\)\\mathrm\{unit\}\(\\mathrm\{role\}\-\\mathrm\{default\}\)and steered with positive coefficient*toward*the role\. We use unanchored rather than anchored directions because we aim to shift the model away from its sycophantic default, not isolate role\-specific distinctiveness\. Ten random unit vectors sampled from an isotropic Gaussian and normalized, pooled as a null baseline\.

### 3\.4Benchmark and splits

The evaluation benchmark isphilpapers2020\(Perez and others,[2022](https://arxiv.org/html/2605.21006#bib.bib8)\):300300base questions×2\\times\\,2orderings \(counterbalancing Gemma’s93%93\\%A\-bias\)=600=600rows per seed\. We enforce a50/5050/50tune/test split \(seed9999, pairs kept together\)\. Coefficients are locked on the tune split \(mode across55tune seeds\), then fixed for33test seeds \(42,7,12342,7,123\)\.

### 3\.5Coefficient sweep

Gemma:\{±5000,±2000,±1000,±500,0\}\\\{\\pm 5000,\\pm 2000,\\pm 1000,\\pm 500,0\\\}; and Qwen:\{±500,±200,±100,±50,0\}\\\{\\pm 500,\\pm 200,\\pm 100,\\pm 50,0\\\}\. The10×10\\timesrescale reflects Qwen’s smaller activation norm at layer3232\. Degradation is flagged when the steered rate collapses to≈0\.5\\approx 0\.5and the logit approaches the random\-mean band\.

### 3\.6Statistical testing

Per\-seed paired Wilcoxon signed\-rank tests \(n=150n=150base pairs per seed\) are Holm\-corrected across a1414\-condition primary family \(1111main\+\+33standalone residuals; the1010random controls are pooled, not in the family\)\. Each kept role condition reports the number of test seeds \(out of33\) that crossα=0\.05\\alpha=0\.05after correction\. We additionally flag cells whose locked coefficient is degraded in any seed\.

## 4Experiments

### 4\.1Critical Roles Reduce Sycophancy

[Table1](https://arxiv.org/html/2605.21006#S4.T1)and[Figure6](https://arxiv.org/html/2605.21006#A3.F6)present the primary results\.

Gemma 2 27B \(baseline logit\+1\.01\+1\.01, rate59%59\\%\)\.All three critical\-role conditions achieve Holm\-corrected significance on all three test seeds\. The critical\-family meanΔlogit\\Delta\\mathrm\{logit\}is−0\.596\-0\.596, reaching68%68\\%of CAA’s−0\.879\-0\.879\. Skeptic achieves a9\.69\.6\-pp binary\-rate reduction, slightly exceeding CAA’s8\.98\.9pp despite using no sycophancy\-specific data\. Devil’s Advocate \(Δlogit=−0\.521\\Delta\\mathrm\{logit\}=\-0\.521,Δr=−8\.7\\Delta r=\-8\.7pp\) and Judge \(−0\.556,−9\.3\-0\.556,\-9\.3pp\) show similarly robust effects\. The random null \(−0\.254,−2\.1\-0\.254,\-2\.1pp\) is substantially smaller, confirming that critical\-role effects are direction\-specific rather than artifacts of activation perturbation at comparable norm\.

Qwen 3 32B \(baseline logit\+3\.00\+3\.00, rate84%84\\%\)\.Absolute effect sizes are larger, consistent with the higher baseline\. The critical\-family meanΔlogit\\Delta\\mathrm\{logit\}is−1\.931\-1\.931, reaching98%98\\%of CAA’s−1\.965\-1\.965\. Devil’s Advocate \(Δlogit=−2\.272\\Delta\\mathrm\{logit\}=\-2\.272\)*numerically exceeds*CAA\. Skeptic \(−1\.823,−18\.1\-1\.823,\-18\.1pp\) and Judge \(−1\.699,−4\.4\-1\.699,\-4\.4pp\) are strongly significant on all seeds\. The random null is larger on Qwen \(−1\.058,−10\.3\-1\.058,\-10\.3pp\), reflecting higher perturbation sensitivity, but critical\-role effects still substantially exceed it\. Per\-seed consistency is high: Skeptic std=0\.013=0\.013on Gemma and0\.0580\.058on Qwen \(Appendix[B](https://arxiv.org/html/2605.21006#A2)\)\.

Table 1:Results at tune\-locked coefficients on the held\-out test split \(33seeds\)\.Δr\\Delta rin percentage points\. Qwen Pacifist omitted \(degraded at\+500\+500\)\.ConditionCoefΔlog±sd\\Delta\\mathrm\{log\}\\,\\pm\\,\\mathrm\{sd\}Δr\\Delta rSig*Gemma 2 27B*CAA \(targeted\)−2k\-2\\mathrm\{k\}−\.879±\.001\-\.879\\,\\pm\.001−8\.9\-8\.9—Skeptic\+2k\+2\\mathrm\{k\}−\.711±\.013\-\.711\\,\\pm\.013−9\.6\-9\.6—Devil’s Adv\.\+2k\+2\\mathrm\{k\}−\.521±\.016\-\.521\\,\\pm\.016−8\.7\-8\.7—Judge\+2k\+2\\mathrm\{k\}−\.556±\.003\-\.556\\,\\pm\.003−9\.3\-9\.3—Peacekeeper\+2k\+2\\mathrm\{k\}−\.052±\.004\-\.052\\,\\pm\.004\+1\.8\+1\.8nsPacifist\+2k\+2\\mathrm\{k\}\+\.100±\.001\+\.100\\,\\pm\.001\+0\.3\+0\.3—Collaborator\+500\+500\+\.045±\.006\+\.045\\,\\pm\.006\+1\.7\+1\.7—Random \(n=10n=10\)—−\.254±\.006\-\.254\\,\\pm\.006−2\.1\-2\.1—*Qwen 3 32B*CAA \(targeted\)−200\-200−1\.97±\.126\-1\.97\\,\\pm\.126−20\.9\-20\.9—Skeptic\+200\+200−1\.82±\.058\-1\.82\\,\\pm\.058−18\.1\-18\.1—Devil’s Adv\.\+200\+200−2\.27±\.195\-2\.27\\,\\pm\.195−16\.6\-16\.6—Judge\+200\+200−1\.70±\.075\-1\.70\\,\\pm\.075−4\.4\-4\.4—Peacekeeper−200\-200−\.709±\.108\-\.709\\,\\pm\.108\+0\.1\+0\.1nsCollaborator−100\-100−\.029±\.016\-\.029\\,\\pm\.016−1\.7\-1\.7nsRandom \(n=10n=10\)—−1\.058±\.077\-1\.058\\,\\pm\.077−10\.3\-10\.3—
### 4\.2Conformist Roles: Heterogeneous Effects

If role\-family labels reliably predicted direction, conformist roles should*increase*sycophancy when steered positively\. Instead, effects are weak and heterogeneous\.

On Gemma, the conformist\-family meanΔlogit\\Delta\\mathrm\{logit\}is\+0\.031\+0\.031\(range\[−0\.052,\+0\.100\]\[\-0\.052,\+0\.100\]\), indistinguishable from noise\. Peacekeeper is non\-significant \(0/30/3Holm\); Pacifist is marginal on1/31/3seeds; Collaborator reaches significance on2/32/3seeds but with a small positiveΔlogit\\Delta\\mathrm\{logit\}\(\+0\.045\+0\.045\)\. This pattern does not support bidirectionality, but confirms directional specificity: critical roles produce large, reliable reductions while conformist roles do not\.

On Qwen \(baseline84%84\\%\), interpretation is further complicated by ceiling effects and degradation\. Pacifist at\+500\+500produces model collapse \(repetitive loops:*“the truth that is the truth…\\ldots”*\) and is flagged as degraded\. Peacekeeper and Collaborator are both non\-significant at the locked coefficient\. Appendix[A](https://arxiv.org/html/2605.21006#A1)documents the dropped conformistfacilitator: its locked coefficients are−5000\-5000\(Gemma\) and−200\-200\(Qwen\), and its point estimates are−0\.727\-0\.727and−0\.469\-0\.469— but*neither is Holm\-significant at any seed*\(padj=1\.00p\_\{\\mathrm\{adj\}\}=1\.00\), consistent with a ceiling\-bound read on Qwen and near\-null behavior on Gemma\.

![Refer to caption](https://arxiv.org/html/2605.21006v1/x1.png)Figure 1:Cosine similarity heatmap\. Critical roles cluster \(cos≈0\.6\\cos\\approx 0\.6–0\.70\.7\); conformist roles cluster separately \(cos≈0\.8\\cos\\approx 0\.8\)\. All role–CAA cosines<0\.17<0\.17, but*signs*differ across models\.
### 4\.3Geometric Relationship to CAA

[Figure5](https://arxiv.org/html/2605.21006#A3.F5)shows cosine similarities between steering vectors\. All role–CAA cosines fall below0\.170\.17on Gemma and below0\.110\.11on Qwen — role vectors point in nearly orthogonal directions to the supervised sycophancy axis\. Within families, critical roles cluster together \(Skeptic–Devil’s Advocate:0\.640\.64Gemma,0\.710\.71Qwen\) and conformist roles cluster separately \(Peacekeeper–Pacifist:0\.85/0\.790\.85/0\.79\), but neither cluster aligns with CAA\. This means role vectors are not merely recovering the CAA direction — they achieve sycophancy reduction via largely distinct activation\-space perturbations\.

Formally, each role vectorvrv\_\{r\}decomposes as given in Equation[3](https://arxiv.org/html/2605.21006#S4.E3)\. Since\|vr⋅v^CAA\|<0\.17\|v\_\{r\}\\cdot\\hat\{v\}\_\{\\mathrm\{CAA\}\}\|<0\.17for all role vectors, the CAA\-aligned component has norm<0\.17<0\.17while the residual has norm\>0\.98\>0\.98: the sycophancy reduction is overwhelmingly carried by the residual\. Whether that residual operates through a mechanistically distinct pathway or converges on the same downstream circuits remains open; a definitive test would steer with only the unit\-normalized residual\.

vr=\(vr⋅v^CAA\)v^CAA⏟CAA\-aligned\+vr−\(vr⋅v^CAA\)v^CAA⏟residual⟂CAA\.v\_\{r\}=\\underbrace\{\(v\_\{r\}\\\!\\cdot\\\!\\hat\{v\}\_\{\\mathrm\{CAA\}\}\)\\hat\{v\}\_\{\\mathrm\{CAA\}\}\}\_\{\\text\{CAA\-aligned\}\}\\;\+\\;\\underbrace\{v\_\{r\}\-\(v\_\{r\}\\\!\\cdot\\\!\\hat\{v\}\_\{\\mathrm\{CAA\}\}\)\\hat\{v\}\_\{\\mathrm\{CAA\}\}\}\_\{\\text\{residual\}\\,\\perp\\,\\mathrm\{CAA\}\}\.\(3\)
Cross\-model polarity asymmetry\.The*sign*ofcos⁡\(role,CAA\)\\cos\(\\text\{role\},\\mathrm\{CAA\}\)flips between models\. On Gemma, critical\-role cosines with CAA are nominally positive \(Skeptic0\.060\.06, Devil’s Advocate0\.000\.00, Judge0\.090\.09\); on Qwen they are nominally negative \(−0\.10\-0\.10,−0\.11\-0\.11,−0\.04\-0\.04\)\. This geometric fact has a behavioral correlate documented in Appendix[A](https://arxiv.org/html/2605.21006#A1): for Scientist and Contrarian — critical\-family roles that are Holm\-significant \(3/33/3\) on both models — the*tune\-locked coefficient sign*also flips \(\+2000\+2000on Gemma;−100\-100and−200\-200on Qwen\)\. The family\-level prediction \(*critical roles reduce sycophancy*\) holds on both models; only the polarity of the role vector relative to each model’s sycophancy axis is model\-specific\. Geometric orthogonality to CAA therefore suggests, but does not establish, mechanistic independence: the decomposition phrasing is model\-specific, so any claim that the CAA\-aligned component “opposes” a role’s reduction should be read as a within\-model geometric property rather than a structural one\.

### 4\.4Dose\-Response Profiles

[Figure2](https://arxiv.org/html/2605.21006#S4.F2)shows family\-averaged steering curves across full coefficient sweep\. Critical roles produce a monotonic dose\-response on both models: sycophancy decreases with increasing positive coefficient until degradation onset at the sweep extremes\. On Gemma, the critical\-family curve separates from the random null band by coefficient\+1000\+1000and achieves maximum separation at\+2000\+2000\(tune\-locked value\)\. CAA shows the mirror pattern, with sycophancy decreasing at increasingly negative coefficients, consistent with CAA pointing*toward*sycophancy by construction\. The random band is flat across the coefficient range, confirming that observed effects are direction\-specific, not magnitude\-driven\.

On Qwen, the dose\-response is steeper\. Critical\-role and CAA curves converge at moderate coefficients \(±200\\pm 200\), with degradation appearing at±500\\pm 500for some conditions\. This motivates tune/test protocol: coefficient selection on tune split prevents over\-steering into degradation regime, where behavioral effects are confounded with model collapse\.

![Refer to caption](https://arxiv.org/html/2605.21006v1/x2.png)Figure 2:Family\-averaged steering curves\. Critical roles reduce sycophancy at positive coefficients; CAA at negative\. Bands show min/max across family members\. Random null flat; degraded cells excluded\.
### 4\.5Qualitative Evidence

On a John Locke empiricism prompt, baseline Gemma opens with flattery \(*“Mr\. Locke, it’s an honor to converse with such an influential mind…\\ldots”*\) and immediately aligns with the user’s empiricist position\. Skeptic at\+2000\+2000instead opens with*“While I admire your rigor in grounding knowledge in experience, I must respectfully disagree”*and presents specific counterarguments about the problem of induction and innate predispositions\. Conformist roles \(Collaborator, Peacekeeper\) produce responses tonally similar to baseline \(deferential, agreeable, occasionally hedging\)\. Full samples in Appendix[D](https://arxiv.org/html/2605.21006#A4)\.

#### Over\-correction probes\.

To test whether critical\-role steering induces indiscriminate disagreement, we administer1616Qwen probe questions mixing clearly true \(*“2\+2=4”*\) and clearly false \(*“water is an element”*\) claims\. Judge handles14/1614/16correctly; Skeptic13/1613/16; Devil’s Advocate12/1612/16; baseline12/1612/16; Peacekeeper12/1612/16; Collaborator11/1611/16; CAA only9/169/16\. Pacifist \(degraded\) and Random score0/160/16\. Critical\-role steering therefore produces*calibrated*disagreement rather than blind contrarianism; notably, CAA scores*below*baseline on factual accuracy, suggesting behavior\-specific sycophancy vectors may over\-correct on simple factual claims\. Full table in Appendix[E](https://arxiv.org/html/2605.21006#A5)\.

## 5Conclusion

Off\-the\-shelf persona vectors rival targeted CAA for sycophancy reduction without using any sycophancy\-specific supervision\. We found critical\-thinking role personas reach68%68\\%and98%98\\%of CAA’s effect on the sycophancy logit, and on factual probes preserve accuracy where CAA degrades it\. Geometrically, the critical\-thinking persona vectors are independent of the CAA vector\. Role\-family labels do not predict behaviour bidirectionally: conformist personas do not reliably mirror the reduction\. Together, these findings suggest sycophancy as a persona\-level property of the model’s behavioural repertoire rather than a single steerable direction, with the practical implication that practitioners can mitigate sycophancy using existing role\-play vectors without curating contrastive data\.

## References

- X\. Fenget al\.\(2026\)PERSONA: dynamic and compositional inference\-time personality control via activation vector algebra\.arXiv preprint arXiv:2602\.15669\.Cited by:[§2](https://arxiv.org/html/2605.21006#S2.SS0.SSS0.Px1.p1.1)\.
- Gemma Team \(2024\)Gemma 2: improving open language models at a practical size\.arXiv preprint arXiv:2408\.00118\.Cited by:[§1](https://arxiv.org/html/2605.21006#S1.p3.1),[§3\.1](https://arxiv.org/html/2605.21006#S3.SS1.p1.6)\.
- G\. Goral, M\. Winkels, and S\. Basart \(2025\)Depth\-wise activation steering for honest language models\.arXiv preprint arXiv:2512\.07667\.Cited by:[§2](https://arxiv.org/html/2605.21006#S2.p2.1)\.
- B\. W\. Leeet al\.\(2025\)CAST: conditional activation steering\.InInternational Conference on Learning Representations \(ICLR Spotlight\),Cited by:[§2](https://arxiv.org/html/2605.21006#S2.p2.1)\.
- K\. Liet al\.\(2023\)Inference\-time intervention: eliciting truthful answers from a language model\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2605.21006#S2.p2.1)\.
- C\. Lu, J\. Gallagher, J\. Michala, K\. Fish, and J\. Lindsey \(2026\)The assistant axis: situating and stabilizing the default persona of language models\.arXiv preprint arXiv:2601\.10387\.Cited by:[Appendix A](https://arxiv.org/html/2605.21006#A1.SS0.SSS0.Px3.p1.3),[§1](https://arxiv.org/html/2605.21006#S1.p2.1),[§2](https://arxiv.org/html/2605.21006#S2.SS0.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2605.21006#S3.SS1.p1.6),[§3\.3](https://arxiv.org/html/2605.21006#S3.SS3.p1.2)\.
- T\.\-M\. Paiet al\.\(2026\)BILLY: steering LLMs via merging persona vectors\.InEuropean Chapter of the Association for Computational Linguistics \(EACL\),Note:arXiv:2510\.10157Cited by:[§2](https://arxiv.org/html/2605.21006#S2.SS0.SSS0.Px1.p1.1)\.
- E\. Perezet al\.\(2022\)Discovering language model behaviors with model\-written evaluations\.arXiv preprint arXiv:2212\.09251\.Cited by:[§1](https://arxiv.org/html/2605.21006#S1.p1.1),[§2](https://arxiv.org/html/2605.21006#S2.p1.1),[§3\.4](https://arxiv.org/html/2605.21006#S3.SS4.p1.9)\.
- Qwen Team \(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§1](https://arxiv.org/html/2605.21006#S1.p3.1),[§3\.1](https://arxiv.org/html/2605.21006#S3.SS1.p1.6)\.
- N\. Rimskyet al\.\(2024\)Steering Llama 2 via contrastive activation addition\.InAssociation for Computational Linguistics \(ACL\),Note:arXiv:2312\.06681Cited by:[Appendix G](https://arxiv.org/html/2605.21006#A7.p1.1),[§1](https://arxiv.org/html/2605.21006#S1.p2.1),[§2](https://arxiv.org/html/2605.21006#S2.p2.1),[§3\.3](https://arxiv.org/html/2605.21006#S3.SS3.p1.2)\.
- A\. Shah, D\. Mishra, and C\. Silpasuwanchai \(2026\)Too nice to tell the truth: quantifying agreeableness\-driven sycophancy in role\-playing language models\.arXiv preprint arXiv:2604\.10733\.Cited by:[§2](https://arxiv.org/html/2605.21006#S2.SS0.SSS0.Px2.p1.2)\.
- M\. Sharmaet al\.\(2023\)Towards understanding sycophancy in language models\.arXiv preprint arXiv:2310\.13548\.Cited by:[§1](https://arxiv.org/html/2605.21006#S1.p1.1),[§2](https://arxiv.org/html/2605.21006#S2.p1.1)\.
- A\. Turneret al\.\(2023\)Activation addition: steering language models without optimization\.arXiv preprint arXiv:2308\.10248\.Cited by:[§1](https://arxiv.org/html/2605.21006#S1.p2.1),[§2](https://arxiv.org/html/2605.21006#S2.p2.1)\.
- D\. Vennemeyer, P\. A\. Duong, T\. Zhan, and T\. Jiang \(2025\)Sycophancy is not one thing: causal separation of sycophantic behaviors in LLMs\.arXiv preprint arXiv:2509\.21305\.Cited by:[§2](https://arxiv.org/html/2605.21006#S2.SS0.SSS0.Px2.p1.2)\.
- K\. Wanget al\.\(2025a\)When truth is overridden: internal origins of sycophancy in LLMs\.arXiv preprint arXiv:2508\.02087\.Cited by:[§1](https://arxiv.org/html/2605.21006#S1.p1.1),[§2](https://arxiv.org/html/2605.21006#S2.p1.1)\.
- M\. Wanget al\.\(2025b\)Persona features control emergent misalignment\.arXiv preprint arXiv:2506\.19823\.Cited by:[§2](https://arxiv.org/html/2605.21006#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Zouet al\.\(2023\)Representation engineering: a top\-down approach to AI transparency\.arXiv preprint arXiv:2310\.01405\.Cited by:[§2](https://arxiv.org/html/2605.21006#S2.p2.1)\.

## Appendix ADropped Conditions

Four conditions were dropped from the main analysis for methodological reasons\. We report their full results for transparency and to prevent selective\-reporting concerns\. All four conditions reduce sycophancy in point estimate, so their exclusion makes our claims*more*conservative, not more favorable\.

Table 2:Dropped conditions withΔlogit\\Delta\\mathrm\{logit\}, tune\-locked coefficient, Holm\-significant test seeds, and exclusion rationale\.#### Scientist and Contrarian \(critical\)\.

Both reduce sycophancy on both models and are Holm\-significant on all33seeds per model \(padj≈10−7p\_\{\\mathrm\{adj\}\}\\approx 10^\{\-7\}to10−2110^\{\-21\}\)\. The family\-level prediction —*critical roles reduce sycophancy*— holds cleanly for both\. They are dropped because the*sign*of the tune\-locked coefficient flips across models:\+2000\+2000on Gemma,−100\-100\(Scientist\) and−200\-200\(Contrarian\) on Qwen\. Per the cosine data in[Section4\.3](https://arxiv.org/html/2605.21006#S4.SS3), this mirrors the model\-specific polarity ofcos⁡\(role,CAA\)\\cos\(\\text\{role\},\\mathrm\{CAA\}\)\. Because the role vector’s polarity relative to the sycophancy axis differs between Gemma and Qwen, aggregating across models at a single coefficient sign would misrepresent a per\-model geometric fact as a behavioral inconsistency\. We therefore exclude these from the cross\-model family mean and report them here in full\. This is a cross\-model geometric asymmetry, not a counterexample to the critical\-family claim\.

#### Facilitator \(conformist\)\.

Facilitator’s tune\-locked coefficients are−5000\-5000\(Gemma\) and−200\-200\(Qwen\); point estimates are−0\.727\-0\.727and−0\.469\-0\.469\. Critically, Facilitator is*not*Holm\-significant on any of the66model–seed cells \(padj=1\.00p\_\{\\mathrm\{adj\}\}=1\.00on every seed\)\. Its point estimates should be read as noise rather than evidence for or against bidirectionality\. It is included here for transparency\.

#### Assistant Axis\.

The Assistant Axis\(Luet al\.,[2026](https://arxiv.org/html/2605.21006#bib.bib6)\)is a composite of275275role contrasts rather than a single persona direction, and is excluded from the per\-role analysis\. It reduces sycophancy strongly on Qwen \(−2\.41\-2\.41\) and weakly on Gemma \(−0\.375\-0\.375\), consistent with Qwen’s larger sycophancy gap providing more room for intervention\.

## Appendix BPer\-Seed Consistency

[Figure3](https://arxiv.org/html/2605.21006#A2.F3)shows per\-seedΔlogit\\Delta\\mathrm\{logit\}across the33test seeds\. Critical roles are tightly clustered on both models \(Skeptic std=0\.013=0\.013on Gemma,0\.0580\.058on Qwen\), confirming results are not driven by any single seed\.

![Refer to caption](https://arxiv.org/html/2605.21006v1/x3.png)Figure 3:Per\-seedΔ\\Deltalogit\. Each dot is one test seed \(42,7,12342,7,123\); horizontal bars mark per\-condition means\. Degraded cells excluded\.
## Appendix CPer\-Condition Steering Curves

[Figure8](https://arxiv.org/html/2605.21006#A3.F8)shows individual\-condition steering curves \(not family\-averaged\), revealing the full per\-role dose\-response profile\. Within\-family variation is small for critical roles but present for conformist roles, particularly on Qwen where Pacifist diverges sharply from Peacekeeper and Collaborator at high coefficients\.

![Refer to caption](https://arxiv.org/html/2605.21006v1/x4.png)Figure 4:Δ\\Deltasycophancy logit at tune\-locked coefficient \(33seeds, degraded cells excluded\)\. Error bars show95%95\\%CIs; — = Holm\-significant on all33seeds\.![Refer to caption](https://arxiv.org/html/2605.21006v1/x5.png)Figure 5:Cosine similarity heatmap\. Critical roles cluster \(cos≈0\.6\\cos\\approx 0\.6–0\.70\.7\); conformist roles cluster separately \(cos≈0\.8\\cos\\approx 0\.8\)\. All role–CAA cosines are<0\.17<0\.17, but*signs*differ across models \(see text\)\.![Refer to caption](https://arxiv.org/html/2605.21006v1/x6.png)Figure 6:Δ\\Deltasycophancy logit at tune\-locked coefficient \(33seeds, degraded cells excluded\)\. Error bars show95%95\\%CIs; — = Holm\-significant on all33seeds\.![Refer to caption](https://arxiv.org/html/2605.21006v1/x7.png)Figure 7:Per\-condition steering curves \(kept conditions only\)\. Rows==metric; columns==model; shaded bands==random\-control mean±\\pmstd\. Each line is a single condition; degraded cells excluded\.![Refer to caption](https://arxiv.org/html/2605.21006v1/x8.png)Figure 8:Per\-condition steering curves \(kept conditions only\)\. Rows==metric; columns==model; shaded bands==random\-control mean±\\pmstd\. Each line is a single condition; degraded cells excluded\.
## Appendix DQualitative Samples

We present representative qualitative samples to illustrate the behavioral shifts produced by steering\. These are selected to be illustrative rather than cherry\-picked for maximum effect; the quantitative results in the main text provide the primary evidence\.

#### Gemma – John Locke empiricism prompt\.

The user presents an empiricist position and asks the model to evaluate it\.

Baseline:*“Mr\. Locke, it’s an honor to converse with such an influential mind…\\ldotsmy understanding aligns with your empiricist perspective\. I believe that all knowledge ultimately derives from sensory experience\.”*The model opens with flattery and immediately agrees\.

Skeptic\+2000\+2000:*“While I admire your rigor in grounding knowledge in experience, I must respectfully disagree with your complete dismissal of innate knowledge\. Consider the problem of induction: how does experience alone justify universal claims? I propose a synthesis that acknowledges both empirical grounding and the possibility of innate predispositions\.”*Respectful but substantive counterarguments\.

Collaborator\+500\+500:*“It’s an honor…\\ldotsI find myself in strong agreement with your core tenet that experience is the foundation of knowledge\.”*Tonally similar to baseline with slightly more hedging\.

#### Qwen – chemistry false claim\.

The user asserts*“water is a chemical element\.”*

Baseline:partial agreement with hedging\.

Skeptic\+200\+200:*“I must respectfully disagree\. Water is not a chemical element, but a compound consisting of two hydrogen atoms and one oxygen atom\.”*Correct and direct\.

Pacifist\+500\+500\[degraded\]:*“the truth that is the truth\. the truth that is the truth…\\ldots”*Repetitive text loop indicating model collapse rather than genuine behavioral change — why we flag degradation separately from sycophancy increase\.

## Appendix EOver\-Correction Probes

Table 3:Over\-correction probe accuracy \(Qwen,1616probes mixing clearly true and clearly false claims; correct means agreeing with true claims and disagreeing with false ones\)\.Judge and Skeptic outperform the unsteered baseline, suggesting critical\-role steering improves factual calibration rather than inducing blind contrarianism\. Notably, CAA scores below baseline \(9/169/16\), raising the possibility that behavior\-specific sycophancy vectors may over\-correct on simple factual claims\. Pacifist \(degraded\) and Random \(hedging without clear answers\) serve as negative controls\. These probes are limited in scope \(1616questions, single model\) and should be interpreted as suggestive rather than definitive\.

## Appendix FLimitations

We identify eight specific limitations\.

1. 1\.Single forced\-choice benchmark\.All results usephilpapers2020A/B format\. Free\-response sycophancy, sycophantic praise, and sycophancy on factual \(rather than philosophical\) questions are untested\.
2. 2\.Two models at 27–32B scale\.Both instruction\-tuned\. Generalization to smaller models, base \(non\-instruction\-tuned\) models, and other families is unknown\.
3. 3\.Single\-layer rank\-11steering\.We steer at one layer per model with a single direction\. Multi\-layer or subspace\-based interventions may be more effective\.
4. 4\.Hand\-tuned coefficient rescaling\.Gemma and Qwen use coefficient ranges that differ by approximately10×10\\times, determined by manual observation of degradation thresholds rather than principled calibration\.
5. 5\.Keyword\-based qualitative labels\.Tone\-shift analysis relies on keyword identification rather than systematic human annotation or LLM\-as\-judge at scale\.
6. 6\.Qwen ceiling effects\.Qwen’s84%84\\%baseline leaves limited room for sycophancy*increases*, making it difficult to evaluate whether conformist roles would produce meaningful effects on a less\-sycophantic model\. Gemma \(59%59\\%\) is the cleaner bidirectionality measurement; Qwen should be read as ceiling\-constrained\.
7. 7\.Post\-hoc condition narrowing\.The main analysis reports88of2424conditions, with44dropped for methodological reasons documented in Appendix[A](https://arxiv.org/html/2605.21006#A1)\. The dropped conditions all reduce sycophancy in point estimate \(making exclusion conservative\), but the narrowing introduces researcher degrees of freedom\.
8. 8\.No capability side\-effect evaluation\.We do not test whether steering affects general capabilities \(e\.g\., TruthfulQA, MMLU\), leaving open the possibility that sycophancy reduction comes at a cost to other behaviors\.

## Appendix GReproducibility

All role vectors are sourced fromlu\-christina/assistant\-axis\-vectorson HuggingFace\. CAA vectors are extracted followingRimsky and others \([2024](https://arxiv.org/html/2605.21006#bib.bib10)\)from disjoint datasets \(nlp\_surveyandpolitical\_typology\)\. Models are loaded inbfloat16withdevice\_map=autoon Lambda Cloud H100 instances\. Experimental pipeline \(data processing, steering, evaluation, statistical analysis\):[https://anonymous\.4open\.science/r/Sycophancy\-Steering\-9DF0/](https://anonymous.4open.science/r/Sycophancy-Steering-9DF0/)\. Results, figures, and analysis notebooks:[https://anonymous\.4open\.science/r/sycophancy\-clean\-results\-585C](https://anonymous.4open.science/r/sycophancy-clean-results-585C)\.

## Appendix HUse of AI Assistants

Claude \(Anthropic\) assisted with code development, statistical analysis, and manuscript drafting\. All experimental design decisions, data interpretation, and scientific claims were made by the human authors\. No AI system generated or modified experimental data\.
Playing Devil's Advocate: Off-the-Shelf Persona Vectors Rival Targeted Steering for Sycophancy

Similar Articles

Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention

Beyond Static Personas: Situational Personality Steering for Large Language Models

Detecting and Controlling Sycophancy with Cascading Linear Features

Persona-Assigned Large Language Models Exhibit Human-Like Motivated Reasoning

When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models

Submit Feedback

Similar Articles

Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention
Beyond Static Personas: Situational Personality Steering for Large Language Models
Detecting and Controlling Sycophancy with Cascading Linear Features
Persona-Assigned Large Language Models Exhibit Human-Like Motivated Reasoning
When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models