Donate your coding sessions to an open CC-BY-4.0 dataset to help train open-weight and open source models
Summary
A new initiative called Trace Commons aims to collect coding agent traces into an open CC-BY-4.0 dataset to help train open-weight and open-source models, countering the data advantage of proprietary models from Anthropic and OpenAI.
Similar Articles
@ClementDelangue: We need open traces so that everyone can train open agent models! cc @steipete @badlogicgames @thdxr @matanSF @hwchase17
Clement Delangue advocates for open traces to democratize training of open agent models.
@kevin_x_li: Introducing SWE-ZERO-12M-trajectories: the largest agentic trace dataset in the open, 5.7x larger than the previous lar…
SWE-ZERO-12M-trajectories is the largest open agentic trace dataset for coding, with 112B tokens across 12M trajectories from 122K pull requests and 3K repositories, enabling scalable training of agentic coding models without requiring containerized execution.
LocalLLaMA crowdsourced coding dataset
A community member proposes creating a crowdsourced coding dataset for local LLMs to enable collaborative model training and fine-tuning, addressing concerns about future availability of open-weight models.
OpenAI gives free daily tokens if you do this
OpenAI offers free daily API tokens (up to 2.5M tokens for lighter models) through its data sharing program, accessible by toggling a setting in the dashboard; the trade-off is that prompts and outputs may be used for training.
@ClementDelangue: Should we try to train an open source AI building model? We obviously have interesting datasets with HF, MLintern, tran…
Clement Delangue asks whether an open source AI building model should be trained, noting available datasets and tools like HF, MLintern, transformers, and trl.