@eliebakouch: one of my favorite projects is Marin from the stanford folks, they have a scientific approach to training, are ready to…

X AI KOLs Following 06/07/26, 05:56 PM Tools

open-source framework foundation-models reproducibility training research

Summary

Marin is an open-source framework from Stanford for reproducible foundation model research, covering data curation, tokenization, training, and evaluation; it was used to train an 8B parameter model that outperforms Llama 3.1 8B.

one of my favorite projects is Marin from the stanford folks, they have a scientific approach to training, are ready to take risks and are fully open (even open development where you can follow everything on github!) https://t.co/G12JfPlFJP https://t.co/pQYgKgtGNG

Original Article

View Cached Full Text

Cached at: 06/08/26, 03:14 AM

https://t.co/G12JfPlFJP https://t.co/pQYgKgtGNG

marin-community/marin

Source: https://github.com/marin-community/marin

Marin

“I am not afraid of storms, for I am learning how to sail my ship.”
– Louisa May Alcott

Marin is an open-source framework for the research and development of foundation models.

A key feature of Marin is reproducibility: every step, from raw data to the final model are recorded, not just the end result. This includes failed experiments, so the entire research process is transparent.

Marin’s primary use case is training language model like Llama, DeepSeek, Qwen, etc. Notably, this includes data curation, transformation, filtering, tokenization, training, and evaluation.

We used Marin to train the first open-source 8B parameter model to outperform Llama 3.1 8B. You can see the training script or read the retrospective.

The documentation for Marin is available on ReadTheDocs or in the docs/ folder.

To get started with Marin:

Install Marin.
Train a tiny language model using Marin.
See how to run a much larger DCLM 1B/1x experiment using Marin.
See a summary of the experiments we’ve run.
Join the Marin Discord to chat with the community.

Example

Marin experiments are defined as a set of steps that can depend on each other and are executed in a topological order, like a Makefile.

As a brief example of how you can use Marin, here is a complete script for training a tiny model on TinyStories. You can check out the full script for more details.

from fray.cluster import ResourceConfig

from experiments.defaults import default_tokenize, default_train
from experiments.llama import llama3_tokenizer, llama_nano
from experiments.simple_train_config import SimpleTrainConfig
from marin.execution.executor import executor_main

# 1. Choose a dataset
tinystories_hf_id = "roneneldan/TinyStories"

# 2. Tokenize the dataset
tinystories_tokenized = default_tokenize(
    name=tinystories_hf_id,  # path to write tokenized files (tokenized/ will be prepended)
    dataset=tinystories_hf_id,  # HF dataset id
    tokenizer=llama3_tokenizer,
)

# 3. Define training configuration
nano_train_config = SimpleTrainConfig(
    # Here we define the hardware resources we need.
    resources=ResourceConfig.with_cpu(),
    train_batch_size=4,
    num_train_steps=100,
    # set hyperparameters
    learning_rate=6e-4,
    weight_decay=0.1,
    # keep eval quick for tutorial
    max_eval_batches=4,
)

# 4. Train the model
nano_tinystories_model = default_train(
    name="marin-nano-tinystories",
    # Steps can depend on other steps: nano_tinystories_model depends on tinystories_tokenized
    tokenized=tinystories_tokenized,
    model_config=llama_nano,
    train_config=nano_train_config,
    # wandb tags
    tags=["llama", "nano", "tinystories", "tutorial"],
    # We can run many [eval_harness](https://github.com/EleutherAI/lm-evaluation-harness) tasks in the loop
    # during training, but there's no point in running evals on such a tiny model
    eval_harness_tasks=[],
    # to keep tutorial fast, skip default validation sets
    use_default_validation=False,
)

if __name__ == "__main__":
    executor_main(steps=[
        nano_tinystories_model,
    ])

Here, we create two steps, one for tokenizing the dataset and one for training the model. The training step depends on the tokenized dataset step, so it will be executed after the tokenization step is completed.

With slight modifications, you can extend this to train a larger model on a larger dataset, a mixture of datasets, even scaling to very large TPU pods (or multislice TPU, and, soon, multi-node GPUs!).

Agent Skills

See .agents/skills/ (also .claude/skills/) for loadable agent skills. For example, .agents/skills/add-dataset/ has a step-by-step guide to adding new datasets.

Lucas Beyer (bl16) (@giffmana): Do I understand it correctly that the OLMo from-scratch series is coming to an end?

If so, looks like NVIDIA stepped up just in time with Nemotron models as the only remaining fully-open (ie not just weight drop) from-scratch LLM team.

@eliebakouch: one of my favorite projects is Marin from the stanford folks, they have a scientific approach to training, are ready to…

marin-community/marin

Marin

Example

Agent Skills

Similar Articles

@WilliamBarrHeld: To train better open models, we need predictable scaling. Delphi is Marin’s first step: we pretrained many small models…

@harshbhatt7585: https://x.com/harshbhatt7585/status/2063593933314113587

@percyliang: For the next Marin model, we are putting together a new data mix. Currently we have 18T tokens, but could use more. So …

@heygurisingh: 𝑩𝒊𝒍𝒍𝒊𝒐𝒏-𝒑𝒂𝒓𝒂𝒎𝒆𝒕𝒆𝒓 𝑳𝑳𝑴𝒔 𝒖𝒔𝒆𝒅 𝒕𝒐 𝒄𝒐𝒔𝒕 $10𝑴+ 𝒕𝒐 𝒕𝒓𝒂𝒊𝒏. Someone open sourced a repo t…

@AnandButani: ml-intern by @huggingface is wild You drop a high-level prompt (“build the best scientific reasoning model” or “crush h…

Submit Feedback

Similar Articles

@WilliamBarrHeld: To train better open models, we need predictable scaling. Delphi is Marin’s first step: we pretrained many small models…

@harshbhatt7585: https://x.com/harshbhatt7585/status/2063593933314113587

@percyliang: For the next Marin model, we are putting together a new data mix. Currently we have 18T tokens, but could use more. So …

@heygurisingh: 𝑩𝒊𝒍𝒍𝒊𝒐𝒏-𝒑𝒂𝒓𝒂𝒎𝒆𝒕𝒆𝒓 𝑳𝑳𝑴𝒔 𝒖𝒔𝒆𝒅 𝒕𝒐 𝒄𝒐𝒔𝒕 $10𝑴+ 𝒕𝒐 𝒕𝒓𝒂𝒊𝒏. Someone open sourced a repo t…

@AnandButani: ml-intern by @huggingface is wild You drop a high-level prompt (“build the best scientific reasoning model” or “crush h…