Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling

arXiv cs.LG 05/20/26, 04:00 AM Papers

alignment scaling-laws language-models truthfulness coupling interpretability phase-transition

Summary

This paper identifies a phase transition in language model scaling where below a critical parameter count, reasoning and truthfulness are anticorrelated, but above it they cooperate. It provides diagnostics and interventions for improving alignment across model families.

arXiv:2605.18838v1 Announce Type: new Abstract: Scaling laws predict loss from compute but not how capabilities interact. We measure the coupling between reasoning and truthfulness across 63 base models from 16 families and find a regime change invisible to loss curves: below a family-dependent critical scale $N_c$, capabilities anticorrelate; above it, they cooperate. $N_c \approx 3.5$B parameters [2.9B, 13.4B] (bootstrap 95% CI), but model size is not the only variable that determines phase. Architecture, data curation, and training recipe each shift $N_c$ independently: curated training eliminated the coupling dip between Qwen generations ($0.025 \to 0.830$ at matched scale), Gemma-4 at 4B achieves coupling 0.871, characteristic of 13B+ standard-trained models, through distillation and architectural innovation, and Phi at 1B matches web-trained coupling at 10B through data curation alone. Width normalization eliminates the anticorrelation across all tested families, supporting an output-projection bottleneck. Internally, 38 of 40 models show zero competing attention heads. A sparse-regression ODE cross-predicts held-out Llama-2 at 5.6% error. The diagnostic requires no model internals -- only public benchmark scores across a model family. The cooperative regime extends to the frontier ($r = +0.72$, 34 models, 10 labs). Code, data, and an open-source activation-steering tool for any open-weight model are released alongside an interactive dashboard that diagnoses any model's coupling phase, suggests concrete interventions (data curation, width, benchmark rotation), and provides ODE scaling predictions, frontier diagnostics, and eigenstructure analysis: https://zehenlabs.com/cape/.

Original Article

Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling

Similar Articles

Probing the Misaligned Thinking Process of Language Models

Do Models Fake Alignment Without Clear Consequences?

Hidden Latent-State Shifts in LLMs: Why Current Alignment Is Blind to Real Internal Dangers — Especially With Agents

The Point of No Return: Counterfactual Localization of Deceptive Commitment in Language-Model Reasoning

We measured how AI capabilities INTERACT as models scale. Below 3.5B, reasoning and truthfulness fight. Above it, they cooperate. The transition is engineerable. (2 papers + interactive dashboard + 7 falsifiable predictions)

Submit Feedback

Similar Articles

Probing the Misaligned Thinking Process of Language Models

Do Models Fake Alignment Without Clear Consequences?

Hidden Latent-State Shifts in LLMs: Why Current Alignment Is Blind to Real Internal Dangers — Especially With Agents

The Point of No Return: Counterfactual Localization of Deceptive Commitment in Language-Model Reasoning

We measured how AI capabilities INTERACT as models scale. Below 3.5B, reasoning and truthfulness fight. Above it, they cooperate. The transition is engineerable. (2 papers + interactive dashboard + 7 falsifiable predictions)