@percyliang: For the next Marin model, we are putting together a new data mix. Currently we have 18T tokens, but could use more. So …

X AI KOLs Following Models

Summary

Percy Liang announces that for the next Marin model, they are compiling a new data mix and request high-quality token data for pre-training, mid-training, and SFT.

For the next Marin model, we are putting together a new data mix. Currently we have 18T tokens, but could use more. So if you are sitting on some secret stash of high quality tokens, please let us know! Pre-training, mid-training, SFT data all welcome. https://t.co/49DBdzvYXE
Original Article
View Cached Full Text

Cached at: 05/13/26, 06:25 PM

For the next Marin model, we are putting together a new data mix. Currently we have 18T tokens, but could use more. So if you are sitting on some secret stash of high quality tokens, please let us know! Pre-training, mid-training, SFT data all welcome. https://t.co/49DBdzvYXE

Similar Articles

Want to build a custom model

Reddit r/LocalLLaMA

A user discusses building a small autocomplete model (25M parameters) as a learning project, mentions hardware constraints (32GB VRAM), data requirements (~100M tokens), and seeks advice on datasets and data formatting for autocomplete-style training.