truth-probes

Tag

Cards List
#truth-probes

When Roleplaying, Do Models Believe What They Say?

arXiv cs.CL · 2d ago Cached

This paper investigates whether role-playing in LLMs changes only outputs or also internal truth representations, using linear probes. It finds that roleplay shifts outputs more than internal beliefs, while emergent misalignment causes larger shifts in internal representations.

0 favorites 0 likes
← Back to home

Submit Feedback