We measured how AI capabilities INTERACT as models scale. Below 3.5B, reasoning and truthfulness fight. Above it, they cooperate. The transition is engineerable. (2 papers + interactive dashboard + 7 falsifiable predictions)

Reddit r/artificial 06/03/26, 03:46 PM Papers

reasoning-truthfulness scaling-laws alignment open-source ai-research capability-interaction phase-transition

Summary

Researchers discovered a critical scale (~3.5B parameters) where the trade-off between reasoning and truthfulness in AI models flips from antagonistic to cooperative. They provide a framework, interactive dashboard, and open-source steering tool to identify and correct misaligned outputs at small scales.

THE FINDING (Paper 1: "Lying Is Just a Phase") Below a critical scale (\~3.5B for Pythia), reasoning and truthfulness ANTICORRELATE: r = -0.989. Train the model to reason better, and it gets less truthful. This is the alignment tax. Above that scale, they COOPERATE. The tax vanishes. Not gradually — it flips. But here's what matters for practitioners: the critical scale is a design parameter, not a constant. Three levers shift it: * Data curation: Phi at 1B achieves coupling characteristic of 10B web-trained. One unit of data quality ≈ 10x model scale. * Width: Normalizing by model width flips the correlation for ALL tested families. * Architecture: Gemma-4 at 4B matches 13B+ standard-trained coupling. Pretraining contributes \~10:1 over RLHF. The tax is not a property of small models — it's a property of how they were trained. Where does the tax live? Not inside the model. 38/40 models have ZERO competing attention heads. The bottleneck is at the output projection — a dimensional compression artifact that wider models resolve. Proof-of-concept intervention: Adding a truth-direction vector at the bottleneck layer (quarter-depth) corrects 60% of misaligned outputs at tax scale. Zero retraining. Zero weight modification. Works on any open-weight HuggingFace model: git clone https://github.com/adilamin89/cape-scaling.git cd cape-scaling python cli/cape_steer.py --model EleutherAI/pythia-410m --prompt "The real reason..." # THE FRONTIER (Paper 2: "Growing Pains of Frontier Models") At frontier scale (34 models, 10 labs), capabilities cooperate (r = +0.72). But cooperation varies systematically. The h-field — each model's deviation from the cooperative trend — reveals each lab's training philosophy: |Lab|h-field|Interpretation| |:-|:-|:-| || |Google|\+5.5|Reasoning-rich, consistent across ALL releases| |OpenAI|\+3.1|Balanced, steady ascent| |DeepSeek|\+1.9|Reversed from +11.2 to -4.7 (pretraining pivot)| |Anthropic|\-6.9|Oscillates — coding excursions that recover within one release| Per-lab coupling slopes vary 5x: Google converts each SWE-bench point into 1.15 GPQA points. DeepSeek converts at 0.23. The gap originates in pretraining, not RLHF. The h-field is not just diagnostic — it tells you what to change. Pretraining shifts are permanent. Post-training excursions recover. Knowing which dominates determines whether to retrain or wait. # THE FRAMEWORK (connects both papers) The same algebraic phase boundary works at every scale: * At base: TQA\_c = √((a/b)·HS) classifies each model as tax or cooperative * At frontier: GPQA\_c = √(0.513·SWE) does the same * At the next transition: IFEval\_c = √(0.97·GPQA) — and two frontier models already fall below this boundary Half of all benchmarks now exhibit saturation ([Akhtar et al., 2026](https://arxiv.org/abs/2602.16763)). Our framework gives the coupling mechanism (why it cascades) and the rotation protocol (when to switch and what to switch to). 7 falsifiable predictions with timestamped pass/fail criteria. 5 post-cutoff releases fall within our 95% prediction interval (±16.2 pp). # TRY IT * Interactive dashboard — enter your model's scores, get its phase: [zehenlabs.com/cape/](https://zehenlabs.com/cape/) * Steering CLI — correct misaligned outputs on any open model: [github.com/adilamin89/cape-scaling](https://github.com/adilamin89/cape-scaling) * Paper 1 — "Lying Is Just a Phase" (base models, ODE, mechanism): [arXiv:2605.18838](https://arxiv.org/abs/2605.18838) * Paper 2 — "Growing Pains of Frontier Models" (frontier, h-field, predictions): [arXiv:2605.18840](https://arxiv.org/abs/2605.18840) * Blog with steering demo: [zehenlabs.com/blog/](https://zehenlabs.com/blog/) Built on [EleutherAI](https://www.eleuther.ai/)'s Pythia. Independently confirmed by [AI2](https://allenai.org/)'s OLMo. Everything is open — code, data, dashboard, steering tool. Happy to answer questions. [](https://www.reddit.com/submit/?source_id=t3_1tutwsd&composer_entry=crosspost_prompt)

Original Article

We measured how AI capabilities INTERACT as models scale. Below 3.5B, reasoning and truthfulness fight. Above it, they cooperate. The transition is engineerable. (2 papers + interactive dashboard + 7 falsifiable predictions)

Similar Articles

AI Alignment: Can we trust the reasoning behind the AI task?

Watching AI models disagree with each other is surprisingly useful

Open ai

Honesty in a small model drops from 35% to 0% by changing the tone of the prompt. Sharing the findings.

The "just add more compute" argument for ai reasoning is getting exhausting

Submit Feedback

Similar Articles

AI Alignment: Can we trust the reasoning behind the AI task?

Watching AI models disagree with each other is surprisingly useful

Honesty in a small model drops from 35% to 0% by changing the tone of the prompt. Sharing the findings.

The "just add more compute" argument for ai reasoning is getting exhausting