@percyliang: Not only do we want to train a good model, we want to know it'll be good before we even start training. About a month a…

X AI KOLs Following News

Summary

The Marin team pre-registered a predicted loss of 2.252 for a 129B parameter MoE model training run, and the actual result landed at 2.234, demonstrating accurate loss prediction before training.

Not only do we want to train a good model, we want to know it'll be good before we even start training. About a month ago, the Marin team launched a 129B (16B active) 1e23 FLOPs MoE run and preregistered a loss of 2.252. The run finished this past week and landed at 2.234. https://x.com/percyliang/status/2044994822965191106…
Original Article
View Cached Full Text

Cached at: 05/25/26, 06:32 AM

Not only do we want to train a good model, we want to know it’ll be good before we even start training.

About a month ago, the Marin team launched a 129B (16B active) 1e23 FLOPs MoE run and preregistered a loss of 2.252. The run finished this past week and landed at 2.234. https://x.com/percyliang/status/2044994822965191106…

Similar Articles

@vintcessun: Pretraining can be this cost-effective? Train a usable 1B base model from scratch for ~$1000, slashing compute and data by hundreds of times. The key isn't brute-force compute, but hierarchical recursive architecture plus latent space reasoning, combined with PrefixLM packing and FA3 to maximize efficiency. Sounds insane, but the paper and code are open-sourced.

X AI KOLs Timeline

HRM-Text released a 1B-parameter base model, claiming it can be pretrained from scratch for only ~$1000, reducing compute and data volume by hundreds of times. It employs efficient techniques such as hierarchical recursive architecture, latent space reasoning, and PrefixLM packing. The paper and code are open-sourced.

@0xcherry: https://x.com/0xcherry/status/2067610347633025281

X AI KOLs Timeline

This article analyzes the reasons behind the performance leap of Zhipu GLM-5.2, suggesting that its 40B activation parameters provide greater effective capacity after accounting for fixed overhead, making RL post-training more effective. It also reviews the history of Chinese AI model development and notes that the large model approach ultimately prevailed.