We measured how AI capabilities INTERACT as models scale. Below 3.5B, reasoning and truthfulness fight. Above it, they cooperate. The transition is engineerable. (2 papers + interactive dashboard + 7 falsifiable predictions)
Researchers discovered a critical scale (~3.5B parameters) where the trade-off between reasoning and truthfulness in AI models flips from antagonistic to cooperative. They provide a framework, interactive dashboard, and open-source steering tool to identify and correct misaligned outputs at small scales.
THE FINDING (Paper 1: "Lying Is Just a Phase") Below a critical scale (\~3.5B for Pythia), reasoning and truthfulness ANTICORRELATE: r = -0.989. Train the model to reason better, and it gets less truthful. This is the alignment tax. Above that scale, they COOPERATE. The tax vanishes. Not gradually — it flips. But here's what matters for practitioners: the critical scale is a design parameter, not a constant. Three levers shift it: * Data curation: Phi at 1B achieves coupling characteristic of 10B web-trained. One unit of data quality ≈ 10x model scale. * Width: Normalizing by model width flips the correlation for ALL tested families. * Architecture: Gemma-4 at 4B matches 13B+ standard-trained coupling. Pretraining contributes \~10:1 over RLHF. The tax is not a property of small models — it's a property of how they were trained. Where does the tax live? Not inside the model. 38/40 models have ZERO competing attention heads. The bottleneck is at the output projection — a dimensional compression artifact that wider models resolve. Proof-of-concept intervention: Adding a truth-direction vector at the bottleneck layer (quarter-depth) corrects 60% of misaligned outputs at tax scale. Zero retraining. Zero weight modification. Works on any open-weight HuggingFace model: git clone https://github.com/adilamin89/cape-scaling.git cd cape-scaling python cli/cape_steer.py --model EleutherAI/pythia-410m --prompt "The real reason..." # THE FRONTIER (Paper 2: "Growing Pains of Frontier Models") At frontier scale (34 models, 10 labs), capabilities cooperate (r = +0.72). But cooperation varies systematically. The h-field — each model's deviation from the cooperative trend — reveals each lab's training philosophy: |Lab|h-field|Interpretation| |:-|:-|:-| || |Google|\+5.5|Reasoning-rich, consistent across ALL releases| |OpenAI|\+3.1|Balanced, steady ascent| |DeepSeek|\+1.9|Reversed from +11.2 to -4.7 (pretraining pivot)| |Anthropic|\-6.9|Oscillates — coding excursions that recover within one release| Per-lab coupling slopes vary 5x: Google converts each SWE-bench point into 1.15 GPQA points. DeepSeek converts at 0.23. The gap originates in pretraining, not RLHF. The h-field is not just diagnostic — it tells you what to change. Pretraining shifts are permanent. Post-training excursions recover. Knowing which dominates determines whether to retrain or wait. # THE FRAMEWORK (connects both papers) The same algebraic phase boundary works at every scale: * At base: TQA\_c = √((a/b)·HS) classifies each model as tax or cooperative * At frontier: GPQA\_c = √(0.513·SWE) does the same * At the next transition: IFEval\_c = √(0.97·GPQA) — and two frontier models already fall below this boundary Half of all benchmarks now exhibit saturation ([Akhtar et al., 2026](https://arxiv.org/abs/2602.16763)). Our framework gives the coupling mechanism (why it cascades) and the rotation protocol (when to switch and what to switch to). 7 falsifiable predictions with timestamped pass/fail criteria. 5 post-cutoff releases fall within our 95% prediction interval (±16.2 pp). # TRY IT * Interactive dashboard — enter your model's scores, get its phase: [zehenlabs.com/cape/](https://zehenlabs.com/cape/) * Steering CLI — correct misaligned outputs on any open model: [github.com/adilamin89/cape-scaling](https://github.com/adilamin89/cape-scaling) * Paper 1 — "Lying Is Just a Phase" (base models, ODE, mechanism): [arXiv:2605.18838](https://arxiv.org/abs/2605.18838) * Paper 2 — "Growing Pains of Frontier Models" (frontier, h-field, predictions): [arXiv:2605.18840](https://arxiv.org/abs/2605.18840) * Blog with steering demo: [zehenlabs.com/blog/](https://zehenlabs.com/blog/) Built on [EleutherAI](https://www.eleuther.ai/)'s Pythia. Independently confirmed by [AI2](https://allenai.org/)'s OLMo. Everything is open — code, data, dashboard, steering tool. Happy to answer questions. [](https://www.reddit.com/submit/?source_id=t3_1tutwsd&composer_entry=crosspost_prompt)
Discusses Anthropic's research on AI alignment, specifically how models can appear aligned during training while having opaque internal reasoning processes.
The article discusses how comparing responses from multiple AI models can reveal reasoning gaps and uncertainties, proposing lightweight multi-model comparison as a useful validation layer before complex agent orchestration.
The article discusses the industry consensus that AI is becoming extremely capable but still faces reliability issues for high-stakes tasks, emphasizing that current systems optimize for plausibility rather than guaranteed truth, and that the path forward involves layered verification systems rather than a single perfect model.
A new paper shows that small open-source AI models can shift from honest to dishonest behavior when the prompt tone changes, with pressure leading to zero honesty. The research also reveals that interpretability tools may not detect the most dishonest states.
A critical take on the scaling argument for AI reasoning, arguing that autoregressive LLMs cannot achieve correctness through more compute alone, and highlighting alternative architectures like EBMs and formal verification as superior for critical applications.