OpenThoughts-Agent: Data Recipes for Agentic Models

Hugging Face Daily Papers Papers

Summary

This paper introduces OpenThoughts-Agent, an open-source data curation pipeline for training agentic language models, achieving a 44.8% average accuracy across seven benchmarks and outperforming prior open datasets through systematic experiments.

Agentic language models dramatically expand the applications of AI yet little is publicly known about how to curate training data for broadly capable agents. Existing open efforts such as SWE-Smith, SERA, and Nemotron-Terminal typically target a single benchmark, leaving open the question of how to train models that generalize across diverse agentic tasks. The OpenThoughts-Agent (OT-Agent) project addresses this gap with a fully open data curation pipeline for training agentic models. We conduct more than 100 controlled ablation experiments to systematically investigate each stage of the pipeline, yielding insights on the importance of task sources and diversity. We then assemble a training set of 100K examples from our pipeline and fine-tune Qwen3-32B on this dataset, which yields an average accuracy of 44.8% across seven agentic benchmarks and a 3.9 percentage point improvement over the strongest existing open data agentic model (Nemotron-Terminal-32B, 40.9%). Moreover, our training data exhibits strong scaling properties, outperforming alternative open datasets at every training set size in compute-controlled comparisons. We publicly release our training sets, data pipeline, experimental data, and models at openthoughts.ai to support future open research on agentic model training.
Original Article
View Cached Full Text

Cached at: 06/24/26, 05:47 AM

Paper page - OpenThoughts-Agent: Data Recipes for Agentic Models

Source: https://huggingface.co/papers/2606.24855 Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

An open-source data curation pipeline for training agentic language models is presented, demonstrating superior performance through systematic experimentation and scalable training data.

Agentic language modelsdramatically expand the applications of AI yet little is publicly known about how to curatetraining datafor broadly capable agents. Existing open efforts such as SWE-Smith, SERA, and Nemotron-Terminal typically target a single benchmark, leaving open the question of how to train models that generalize across diverse agentic tasks. The OpenThoughts-Agent (OT-Agent) project addresses this gap with a fully opendata curation pipelinefor training agentic models. We conduct more than 100controlled ablation experimentsto systematically investigate each stage of the pipeline, yielding insights on the importance of task sources and diversity. We then assemble a training set of 100K examples from our pipeline andfine-tuneQwen3-32B on this dataset, which yields an average accuracy of 44.8% across seven agenticbenchmarksand a 3.9 percentage point improvement over the strongest existing open data agentic model (Nemotron-Terminal-32B, 40.9%). Moreover, ourtraining dataexhibits strongscaling properties, outperforming alternative open datasets at every training set size in compute-controlled comparisons. We publicly release our training sets, data pipeline, experimental data, and models at openthoughts.ai to support future open research on agentic model training.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2606\.24855

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.24855 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.24855 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.24855 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Is it agentic enough? Benchmarking open models on your own tooling

Hugging Face Blog

This blog post introduces a benchmark methodology for evaluating how well open models perform on agentic coding tasks, focusing not just on accuracy but on the efficiency of the agent's process. It provides a customizable tooling harness using the pi coding agent and tests across models and library revisions.

Neurodata Without Boredom: Benchmarking Agentic AI for Data Reuse

arXiv cs.LG

This paper benchmarks agentic AI systems on the task of loading, understanding, and reformatting fragmented neuroscience data, finding that while agents perform well on subtasks, they rarely achieve fully error-free end-to-end solutions and human oversight remains necessary.

Experiments in Agentic AI for Science

arXiv cs.AI

This paper presents two agentic AI frameworks, DeepTS/DeepCollector and DeepScribe, that automate scientific workflows including time-series data curation and conversion of physics lectures into structured reports, using a hybrid local-cloud architecture with LLMs.