Donate your coding sessions to an open CC-BY-4.0 dataset to help train open-weight and open source models

Reddit r/LocalLLaMA News

Summary

A new initiative called Trace Commons aims to collect coding agent traces into an open CC-BY-4.0 dataset to help train open-weight and open-source models, countering the data advantage of proprietary models from Anthropic and OpenAI.

Anthropic and Open AI are getting so much data from the Claude Code and Codex usage, and I'm quite scared this will create an oligopoly because only their models will be trained on it, leaving the open-weight and open source models behind. So I'm trying to launch a little initiative called Trace Commons and encouraging people around to donate their coding agent traces into an open dataset [https://trace-commons-web.hf.space/](https://trace-commons-web.hf.space/) so that other model labs can also train on them Let me know if you have any feedback and hopefully we can have a nice open dataset soon !
Original Article

Similar Articles

LocalLLaMA crowdsourced coding dataset

Reddit r/LocalLLaMA

A community member proposes creating a crowdsourced coding dataset for local LLMs to enable collaborative model training and fine-tuning, addressing concerns about future availability of open-weight models.

OpenAI gives free daily tokens if you do this

Reddit r/artificial

OpenAI offers free daily API tokens (up to 2.5M tokens for lighter models) through its data sharing program, accessible by toggling a setting in the dashboard; the trade-off is that prompts and outputs may be used for training.