@percyliang: Not only do we want to train a good model, we want to know it'll be good before we even start training. About a month a…
Summary
The Marin team pre-registered a predicted loss of 2.252 for a 129B parameter MoE model training run, and the actual result landed at 2.234, demonstrating accurate loss prediction before training.
View Cached Full Text
Cached at: 05/25/26, 06:32 AM
Not only do we want to train a good model, we want to know it’ll be good before we even start training.
About a month ago, the Marin team launched a 129B (16B active) 1e23 FLOPs MoE run and preregistered a loss of 2.252. The run finished this past week and landed at 2.234. https://x.com/percyliang/status/2044994822965191106…
Similar Articles
@WilliamBarrHeld: To train better open models, we need predictable scaling. Delphi is Marin’s first step: we pretrained many small models…
Marin AI researchers, led by William Barr Held, introduce Delphi, a methodology that pretrains small models to accurately predict the training outcomes of larger 25B-parameter runs. This research aims to establish predictable scaling for more efficient open-source AI model development.
@percyliang: For the next Marin model, we are putting together a new data mix. Currently we have 18T tokens, but could use more. So …
Percy Liang announces that for the next Marin model, they are compiling a new data mix and request high-quality token data for pre-training, mid-training, and SFT.
@eliebakouch: one of my favorite projects is Marin from the stanford folks, they have a scientific approach to training, are ready to…
Marin is an open-source framework from Stanford for reproducible foundation model research, covering data curation, tokenization, training, and evaluation; it was used to train an 8B parameter model that outperforms Llama 3.1 8B.
@vintcessun: Pretraining can be this cost-effective? Train a usable 1B base model from scratch for ~$1000, slashing compute and data by hundreds of times. The key isn't brute-force compute, but hierarchical recursive architecture plus latent space reasoning, combined with PrefixLM packing and FA3 to maximize efficiency. Sounds insane, but the paper and code are open-sourced.
HRM-Text released a 1B-parameter base model, claiming it can be pretrained from scratch for only ~$1000, reducing compute and data volume by hundreds of times. It employs efficient techniques such as hierarchical recursive architecture, latent space reasoning, and PrefixLM packing. The paper and code are open-sourced.
@0xcherry: https://x.com/0xcherry/status/2067610347633025281
This article analyzes the reasons behind the performance leap of Zhipu GLM-5.2, suggesting that its 40B activation parameters provide greater effective capacity after accounting for fixed overhead, making RL post-training more effective. It also reviews the history of Chinese AI model development and notes that the large model approach ultimately prevailed.