behavioral-evidence

#behavioral-evidence

What you read before a question changes how a language model answers it — even when the question has nothing to do with what you read. Potential Alignment Vulnerability in LLMs: Behavioral and Hidden-State Evidence from Gemma-3-12B

Reddit r/ArtificialInteligence ↗ · 12h ago

The article reports a potential alignment vulnerability in LLMs where processing a structured passage before an unrelated question can alter the model's response, with mechanistic evidence from Gemma-3-12B showing hidden-state separation.

0 favorites 0 likes

behavioral-evidence

What you read before a question changes how a language model answers it — even when the question has nothing to do with what you read. Potential Alignment Vulnerability in LLMs: Behavioral and Hidden-State Evidence from Gemma-3-12B

Submit Feedback