behavioral-evidence

Tag

Cards List
#behavioral-evidence

What you read before a question changes how a language model answers it — even when the question has nothing to do with what you read. Potential Alignment Vulnerability in LLMs: Behavioral and Hidden-State Evidence from Gemma-3-12B

Reddit r/ArtificialInteligence · 11h ago

The article reports a potential alignment vulnerability in LLMs where processing a structured passage before an unrelated question can alter the model's response, with mechanistic evidence from Gemma-3-12B showing hidden-state separation.

0 favorites 0 likes
← Back to home

Submit Feedback