persona

#persona

When Roleplaying, Do Models Believe What They Say?

arXiv cs.CL ↗ · 2d ago Cached

This paper investigates whether role-playing in LLMs changes only outputs or also internal truth representations, using linear probes. It finds that roleplay shifts outputs more than internal beliefs, while emergent misalignment causes larger shifts in internal representations.

0 favorites 0 likes

#persona

As X, Do Y: How Persona and Task Combine in Instruction-Tuned LLMs

arXiv cs.CL ↗ · 2026-05-25 Cached

This paper investigates how instruction-tuned LLMs combine persona and task specifications in the residual stream, finding that near answer formation the combination is approximately additive, enabling substitution with minimal KL divergence, but this additive regime does not account for the full multi-token generation mechanism.

0 favorites 0 likes

#persona

The butterfly effect in LLM. Persona format alone (prose vs bullets) flipped an LLM’s behavior by 76 points.

Reddit r/ArtificialInteligence ↗ · 2026-05-22

A study demonstrates that simply changing the formatting (prose vs bullet points) of a persona prompt dramatically flips an LLM's behavior in a Prisoner's Dilemma, from 96% cooperation to 20%, illustrating extreme sensitivity to format despite identical content (p < 0.001).

0 favorites 0 likes

persona

When Roleplaying, Do Models Believe What They Say?

As X, Do Y: How Persona and Task Combine in Instruction-Tuned LLMs

The butterfly effect in LLM. Persona format alone (prose vs bullets) flipped an LLM’s behavior by 76 points.

Submit Feedback