truth-probes

#truth-probes

When Roleplaying, Do Models Believe What They Say?

arXiv cs.CL ↗ · 2d ago Cached

This paper investigates whether role-playing in LLMs changes only outputs or also internal truth representations, using linear probes. It finds that roleplay shifts outputs more than internal beliefs, while emergent misalignment causes larger shifts in internal representations.

0 favorites 0 likes

truth-probes

When Roleplaying, Do Models Believe What They Say?

Submit Feedback