A Potential Alignment Vulnerability in LLMs: Behavioral and Hidden-State Evidence from Gemma-3-12B . Pre-token hidden state shift as an alignment policy traversal vector in instruction-tuned LLMs

Reddit r/AI_Agents 06/23/26, 05:55 PM Papers

alignment vulnerability llm hidden-state gemma instruction-tuning safety

Summary

This paper investigates an alignment vulnerability in instruction-tuned LLMs, specifically Gemma-3-12B, by showing that pre-token hidden state shifts can act as an alignment policy traversal vector, potentially enabling bypass of safety measures.

No content available

Original Article

Similar Articles

Hidden Latent-State Shifts in LLMs: Why Current Alignment Is Blind to Real Internal Dangers — Especially With Agents

Reddit r/artificial

This paper demonstrates that LLMs can enter measurably different internal latent states under coherent context while maintaining aligned outputs, revealing a blind spot in current alignment methods that only monitor surface tokens. The Gemma-3-12B-IT experiment shows strong residual stream geometry shifts that existing safety frameworks cannot detect, with implications for agentic AI deployment.

What you read before a question changes how a language model answers it — even when the question has nothing to do with what you read. Potential Alignment Vulnerability in LLMs: Behavioral and Hidden-State Evidence from Gemma-3-12B

Reddit r/ArtificialInteligence

The article reports a potential alignment vulnerability in LLMs where processing a structured passage before an unrelated question can alter the model's response, with mechanistic evidence from Gemma-3-12B showing hidden-state separation.

Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories

arXiv cs.AI

This paper demonstrates that LLM safety vulnerabilities extend beyond 'shallow safety' (first-token alignment) to any point during generation, showing that short token injections mid-sequence can redirect models toward harmful outputs. The authors propose training on generation trajectories with simulated mid-sequence perturbations to improve robustness.

Coherent Context Can Silently Shift LLMs Into a Different Internal Regime — And Current Safety Systems Are Blind To It [D]

Reddit r/MachineLearning

An independent researcher presents evidence that coherent context can shift LLMs into a different internal regime before producing output, bypassing surface-level safety filters. This suggests current alignment methods like RLHF may not be robust defenses.

Bypassing LLM Guardrails: How Plain Text Shifts Latent Trajectories Without Jailbreaks

Reddit r/AI_Agents

The article presents a research finding that saturating an LLM's context window with benign narrative text can dominate the attention mechanism and shift latent trajectories, potentially bypassing alignment guardrails without traditional jailbreaks. It argues that current alignment methods are a superficial fix for a fundamentally fluid architecture.