Autodata: An agentic data scientist to create high quality synthetic data

Hugging Face Daily Papers Papers

Summary

Autodata is a method that enables AI agents to act as data scientists to create high-quality synthetic training data through meta-optimization, achieving improved performance across computer science, legal reasoning, and mathematical tasks.

We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data. We show how to train (meta-optimize) such a data scientist agent, so that it learns to create even stronger data. We describe the overall formulation, and a specific practical implementation, Agentic Self-Instruct. We conduct experiments on computer science research tasks, legal reasoning tasks and reasoning with mathematical objects, where we obtain improved results compared to classical synthetic dataset creation methods. Further, meta-optimizing the data scientist agent itself delivers an even larger performance uplift. Agentic data creation provides a way to convert increased inference compute into higher quality model training. Overall, we believe this direction has the potential to change the way we build AI data.
Original Article
View Cached Full Text

Cached at: 06/25/26, 05:17 AM

Paper page - Autodata: An agentic data scientist to create high quality synthetic data

Source: https://huggingface.co/papers/2606.25996 Authors:

,

,

,

,

,

,

,

,

,

,

,

,

,

Abstract

Autodata enables AI agents to function as data scientists who create high-quality training data through meta-optimization, demonstrating improved performance across multiple task domains.

We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data. We show how to train (meta-optimize) such adata scientist agent, so that it learns to create even stronger data. We describe the overall formulation, and a specific practical implementation,Agentic Self-Instruct. We conduct experiments on computer science research tasks, legal reasoning tasks and reasoning with mathematical objects, where we obtain improved results compared to classicalsynthetic dataset creationmethods. Further, meta-optimizing thedata scientist agentitself delivers an even larger performance uplift. Agentic data creation provides a way to convert increasedinference computeinto higher qualitymodel training. Overall, we believe this direction has the potential to change the way we build AI data.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2606\.25996

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.25996 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.25996 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.25996 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Agents That Build Better Training Data (25 minute read)

TLDR AI

Autodata introduces an agentic data scientist that iteratively generates and refines synthetic training data, with meta-optimization to further improve data quality, achieving better results on computer science and legal reasoning tasks.

@neural_avb: https://x.com/neural_avb/status/2072294078805684613

X AI KOLs Timeline

This paper introduces Autodata, a method that uses an agentic 'data scientist' AI to automate the creation of high-quality synthetic datasets through iterative generation, verification, and refinement, specifically optimized for reinforcement learning (GRPO) to improve reasoning in language models.