A Potential Alignment Vulnerability in LLMs: Behavioral and Hidden-State Evidence from Gemma-3-12B . Pre-token hidden state shift as an alignment policy traversal vector in instruction-tuned LLMs
Summary
This paper investigates an alignment vulnerability in instruction-tuned LLMs, specifically Gemma-3-12B, by showing that pre-token hidden state shifts can act as an alignment policy traversal vector, potentially enabling bypass of safety measures.
Similar Articles
Hidden Latent-State Shifts in LLMs: Why Current Alignment Is Blind to Real Internal Dangers — Especially With Agents
This paper demonstrates that LLMs can enter measurably different internal latent states under coherent context while maintaining aligned outputs, revealing a blind spot in current alignment methods that only monitor surface tokens. The Gemma-3-12B-IT experiment shows strong residual stream geometry shifts that existing safety frameworks cannot detect, with implications for agentic AI deployment.
What you read before a question changes how a language model answers it — even when the question has nothing to do with what you read. Potential Alignment Vulnerability in LLMs: Behavioral and Hidden-State Evidence from Gemma-3-12B
The article reports a potential alignment vulnerability in LLMs where processing a structured passage before an unrelated question can alter the model's response, with mechanistic evidence from Gemma-3-12B showing hidden-state separation.
Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories
This paper demonstrates that LLM safety vulnerabilities extend beyond 'shallow safety' (first-token alignment) to any point during generation, showing that short token injections mid-sequence can redirect models toward harmful outputs. The authors propose training on generation trajectories with simulated mid-sequence perturbations to improve robustness.
Coherent Context Can Silently Shift LLMs Into a Different Internal Regime — And Current Safety Systems Are Blind To It [D]
An independent researcher presents evidence that coherent context can shift LLMs into a different internal regime before producing output, bypassing surface-level safety filters. This suggests current alignment methods like RLHF may not be robust defenses.
Bypassing LLM Guardrails: How Plain Text Shifts Latent Trajectories Without Jailbreaks
The article presents a research finding that saturating an LLM's context window with benign narrative text can dominate the attention mechanism and shift latent trajectories, potentially bypassing alignment guardrails without traditional jailbreaks. It argues that current alignment methods are a superficial fix for a fundamentally fluid architecture.