@percyliang: For the next Marin model, we are putting together a new data mix. Currently we have 18T tokens, but could use more. So …
Summary
Percy Liang announces that for the next Marin model, they are compiling a new data mix and request high-quality token data for pre-training, mid-training, and SFT.
View Cached Full Text
Cached at: 05/13/26, 06:25 PM
For the next Marin model, we are putting together a new data mix. Currently we have 18T tokens, but could use more. So if you are sitting on some secret stash of high quality tokens, please let us know! Pre-training, mid-training, SFT data all welcome. https://t.co/49DBdzvYXE
Similar Articles
@eliebakouch: one of my favorite projects is Marin from the stanford folks, they have a scientific approach to training, are ready to…
Marin is an open-source framework from Stanford for reproducible foundation model research, covering data curation, tokenization, training, and evaluation; it was used to train an 8B parameter model that outperforms Llama 3.1 8B.
@WilliamBarrHeld: To train better open models, we need predictable scaling. Delphi is Marin’s first step: we pretrained many small models…
Marin AI researchers, led by William Barr Held, introduce Delphi, a methodology that pretrains small models to accurately predict the training outcomes of larger 25B-parameter runs. This research aims to establish predictable scaling for more efficient open-source AI model development.
@percyliang: Not only do we want to train a good model, we want to know it'll be good before we even start training. About a month a…
The Marin team pre-registered a predicted loss of 2.252 for a 129B parameter MoE model training run, and the actual result landed at 2.234, demonstrating accurate loss prediction before training.
Nous Research Releases Token Superposition Training to Speed Up LLM Pre-Training by Up to 2.5x Across 270M to 10B Parameter Models
Nous Research releases Token Superposition Training (TST), a method that speeds up LLM pre-training by up to 2.5x across models from 270M to 10B parameters, reducing wall-clock time without altering architecture or data.
Want to build a custom model
A user discusses building a small autocomplete model (25M parameters) as a learning project, mentions hardware constraints (32GB VRAM), data requirements (~100M tokens), and seeks advice on datasets and data formatting for autocomplete-style training.