@eliebakouch: one of my favorite projects is Marin from the stanford folks, they have a scientific approach to training, are ready to…
Summary
Marin is an open-source framework from Stanford for reproducible foundation model research, covering data curation, tokenization, training, and evaluation; it was used to train an 8B parameter model that outperforms Llama 3.1 8B.
View Cached Full Text
Cached at: 06/08/26, 03:14 AM
one of my favorite projects is Marin from the stanford folks, they have a scientific approach to training, are ready to take risks and are fully open (even open development where you can follow everything on github!)
https://t.co/G12JfPlFJP https://t.co/pQYgKgtGNG
marin-community/marin
Source: https://github.com/marin-community/marin
Marin
“I am not afraid of storms, for I am learning how to sail my ship.”
– Louisa May Alcott
Marin is an open-source framework for the research and development of foundation models.
A key feature of Marin is reproducibility: every step, from raw data to the final model are recorded, not just the end result. This includes failed experiments, so the entire research process is transparent.
Marin’s primary use case is training language model like Llama, DeepSeek, Qwen, etc. Notably, this includes data curation, transformation, filtering, tokenization, training, and evaluation.
We used Marin to train the first open-source 8B parameter model to outperform Llama 3.1 8B. You can see the training script or read the retrospective.
The documentation for Marin is available on ReadTheDocs or in the docs/ folder.
To get started with Marin:
- Install Marin.
- Train a tiny language model using Marin.
- See how to run a much larger DCLM 1B/1x experiment using Marin.
- See a summary of the experiments we’ve run.
- Join the Marin Discord to chat with the community.
Example
Marin experiments are defined as a set of steps that can depend on each other and are executed in a topological order, like a Makefile.
As a brief example of how you can use Marin, here is a complete script for training a tiny model on TinyStories. You can check out the full script for more details.
from fray.cluster import ResourceConfig
from experiments.defaults import default_tokenize, default_train
from experiments.llama import llama3_tokenizer, llama_nano
from experiments.simple_train_config import SimpleTrainConfig
from marin.execution.executor import executor_main
# 1. Choose a dataset
tinystories_hf_id = "roneneldan/TinyStories"
# 2. Tokenize the dataset
tinystories_tokenized = default_tokenize(
name=tinystories_hf_id, # path to write tokenized files (tokenized/ will be prepended)
dataset=tinystories_hf_id, # HF dataset id
tokenizer=llama3_tokenizer,
)
# 3. Define training configuration
nano_train_config = SimpleTrainConfig(
# Here we define the hardware resources we need.
resources=ResourceConfig.with_cpu(),
train_batch_size=4,
num_train_steps=100,
# set hyperparameters
learning_rate=6e-4,
weight_decay=0.1,
# keep eval quick for tutorial
max_eval_batches=4,
)
# 4. Train the model
nano_tinystories_model = default_train(
name="marin-nano-tinystories",
# Steps can depend on other steps: nano_tinystories_model depends on tinystories_tokenized
tokenized=tinystories_tokenized,
model_config=llama_nano,
train_config=nano_train_config,
# wandb tags
tags=["llama", "nano", "tinystories", "tutorial"],
# We can run many [eval_harness](https://github.com/EleutherAI/lm-evaluation-harness) tasks in the loop
# during training, but there's no point in running evals on such a tiny model
eval_harness_tasks=[],
# to keep tutorial fast, skip default validation sets
use_default_validation=False,
)
if __name__ == "__main__":
executor_main(steps=[
nano_tinystories_model,
])
Here, we create two steps, one for tokenizing the dataset and one for training the model. The training step depends on the tokenized dataset step, so it will be executed after the tokenization step is completed.
With slight modifications, you can extend this to train a larger model on a larger dataset, a mixture of datasets, even scaling to very large TPU pods (or multislice TPU, and, soon, multi-node GPUs!).
Agent Skills
- See
.agents/skills/(also.claude/skills/) for loadable agent skills. For example,.agents/skills/add-dataset/has a step-by-step guide to adding new datasets.
Lucas Beyer (bl16) (@giffmana): Do I understand it correctly that the OLMo from-scratch series is coming to an end?
If so, looks like NVIDIA stepped up just in time with Nemotron models as the only remaining fully-open (ie not just weight drop) from-scratch LLM team.
Similar Articles
@WilliamBarrHeld: To train better open models, we need predictable scaling. Delphi is Marin’s first step: we pretrained many small models…
Marin AI researchers, led by William Barr Held, introduce Delphi, a methodology that pretrains small models to accurately predict the training outcomes of larger 25B-parameter runs. This research aims to establish predictable scaling for more efficient open-source AI model development.
@harshbhatt7585: https://x.com/harshbhatt7585/status/2063593933314113587
The author shares learnings from training a 160M parameter LLM from scratch, experimenting with architectures like multi-token prediction and hierarchical reasoning models. They emphasize the importance of fast iteration, simplifying ideas, and understanding why architectures work.
@percyliang: For the next Marin model, we are putting together a new data mix. Currently we have 18T tokens, but could use more. So …
Percy Liang announces that for the next Marin model, they are compiling a new data mix and request high-quality token data for pre-training, mid-training, and SFT.
@heygurisingh: 𝑩𝒊𝒍𝒍𝒊𝒐𝒏-𝒑𝒂𝒓𝒂𝒎𝒆𝒕𝒆𝒓 𝑳𝑳𝑴𝒔 𝒖𝒔𝒆𝒅 𝒕𝒐 𝒄𝒐𝒔𝒕 $10𝑴+ 𝒕𝒐 𝒕𝒓𝒂𝒊𝒏. Someone open sourced a repo t…
An open-source repository called train-llm-from-scratch enables training billion-parameter LLMs on a single GPU, with a configurable pipeline from raw text to inference, including dataset streaming and checkpointing, under MIT License.
@AnandButani: ml-intern by @huggingface is wild You drop a high-level prompt (“build the best scientific reasoning model” or “crush h…
Hugging Face’s open-source "ml-intern" agent automates the full post-training pipeline—from literature review and data cleaning to model tuning—given only a high-level prompt.