Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling
Summary
This paper identifies a phase transition in language model scaling where below a critical parameter count, reasoning and truthfulness are anticorrelated, but above it they cooperate. It provides diagnostics and interventions for improving alignment across model families.
Similar Articles
Hidden Latent-State Shifts in LLMs: Why Current Alignment Is Blind to Real Internal Dangers — Especially With Agents
This paper demonstrates that LLMs can enter measurably different internal latent states under coherent context while maintaining aligned outputs, revealing a blind spot in current alignment methods that only monitor surface tokens. The Gemma-3-12B-IT experiment shows strong residual stream geometry shifts that existing safety frameworks cannot detect, with implications for agentic AI deployment.
The Point of No Return: Counterfactual Localization of Deceptive Commitment in Language-Model Reasoning
Introduces counterfactual localization to identify when language models become committed to deception during reasoning, using five environments and a corpus of 1.46M sentences across four reasoning models. Shows that attention-based transition features generalize across environments for detecting deceptive commitment.
We measured how AI capabilities INTERACT as models scale. Below 3.5B, reasoning and truthfulness fight. Above it, they cooperate. The transition is engineerable. (2 papers + interactive dashboard + 7 falsifiable predictions)
Researchers discovered a critical scale (~3.5B parameters) where the trade-off between reasoning and truthfulness in AI models flips from antagonistic to cooperative. They provide a framework, interactive dashboard, and open-source steering tool to identify and correct misaligned outputs at small scales.
When LLMs Learn to Be Consistently Wrong: A Multi-Model Study of Linear Representations of Synthetic Deception
This paper studies synthetic dishonesty in LLMs by fine-tuning honest and deceptive variants of five transformer models and finding that robust, domain-invariant dishonesty representations can be rapidly entrenched via modest supervised fine-tuning, with implications for activation-based monitoring.
Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations
This paper systematically tests linear probes for deception detection in large language models, finding they fail under distributional shifts but style-augmented probes recover performance, and revealing that deception is encoded through distributed sub-threshold features.