Donate your coding sessions to an open CC-BY-4.0 dataset to help train open-weight and open source models

Reddit r/LocalLLaMA 06/16/26, 09:58 AM News

open-source open-weight coding-agent dataset training-data community-initiative

Summary

A new initiative called Trace Commons aims to collect coding agent traces into an open CC-BY-4.0 dataset to help train open-weight and open-source models, countering the data advantage of proprietary models from Anthropic and OpenAI.

Anthropic and Open AI are getting so much data from the Claude Code and Codex usage, and I'm quite scared this will create an oligopoly because only their models will be trained on it, leaving the open-weight and open source models behind. So I'm trying to launch a little initiative called Trace Commons and encouraging people around to donate their coding agent traces into an open dataset [https://trace-commons-web.hf.space/](https://trace-commons-web.hf.space/) so that other model labs can also train on them Let me know if you have any feedback and hopefully we can have a nice open dataset soon !

Original Article

Similar Articles

@ClementDelangue: We need open traces so that everyone can train open agent models! cc @steipete @badlogicgames @thdxr @matanSF @hwchase17

X AI KOLs Following

Clement Delangue advocates for open traces to democratize training of open agent models.

@kevin_x_li: Introducing SWE-ZERO-12M-trajectories: the largest agentic trace dataset in the open, 5.7x larger than the previous lar…

X AI KOLs Following

SWE-ZERO-12M-trajectories is the largest open agentic trace dataset for coding, with 112B tokens across 12M trajectories from 122K pull requests and 3K repositories, enabling scalable training of agentic coding models without requiring containerized execution.

LocalLLaMA crowdsourced coding dataset

Reddit r/LocalLLaMA

A community member proposes creating a crowdsourced coding dataset for local LLMs to enable collaborative model training and fine-tuning, addressing concerns about future availability of open-weight models.

OpenAI gives free daily tokens if you do this

Reddit r/artificial

OpenAI offers free daily API tokens (up to 2.5M tokens for lighter models) through its data sharing program, accessible by toggling a setting in the dashboard; the trade-off is that prompts and outputs may be used for training.

@ClementDelangue: Should we try to train an open source AI building model? We obviously have interesting datasets with HF, MLintern, tran…

X AI KOLs Following

Clement Delangue asks whether an open source AI building model should be trained, noting available datasets and tools like HF, MLintern, transformers, and trl.

Similar Articles

@ClementDelangue: We need open traces so that everyone can train open agent models! cc @steipete @badlogicgames @thdxr @matanSF @hwchase17

@kevin_x_li: Introducing SWE-ZERO-12M-trajectories: the largest agentic trace dataset in the open, 5.7x larger than the previous lar…

LocalLLaMA crowdsourced coding dataset

OpenAI gives free daily tokens if you do this

@ClementDelangue: Should we try to train an open source AI building model? We obviously have interesting datasets with HF, MLintern, tran…

Submit Feedback