Autodata: An agentic data scientist to create high quality synthetic data

Hugging Face Daily Papers 06/24/26, 12:00 AM Papers

synthetic-data data-scientist agentic meta-optimization ai-agents self-instruct training-data

Summary

Autodata is a method that enables AI agents to act as data scientists to create high-quality synthetic training data through meta-optimization, achieving improved performance across computer science, legal reasoning, and mathematical tasks.

We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data. We show how to train (meta-optimize) such a data scientist agent, so that it learns to create even stronger data. We describe the overall formulation, and a specific practical implementation, Agentic Self-Instruct. We conduct experiments on computer science research tasks, legal reasoning tasks and reasoning with mathematical objects, where we obtain improved results compared to classical synthetic dataset creation methods. Further, meta-optimizing the data scientist agent itself delivers an even larger performance uplift. Agentic data creation provides a way to convert increased inference compute into higher quality model training. Overall, we believe this direction has the potential to change the way we build AI data.

Original Article

View Cached Full Text

Cached at: 06/25/26, 05:17 AM

Paper page - Autodata: An agentic data scientist to create high quality synthetic data

Source: https://huggingface.co/papers/2606.25996 Authors:

Abstract

Autodata enables AI agents to function as data scientists who create high-quality training data through meta-optimization, demonstrating improved performance across multiple task domains.

We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data. We show how to train (meta-optimize) such adata scientist agent, so that it learns to create even stronger data. We describe the overall formulation, and a specific practical implementation,Agentic Self-Instruct. We conduct experiments on computer science research tasks, legal reasoning tasks and reasoning with mathematical objects, where we obtain improved results compared to classicalsynthetic dataset creationmethods. Further, meta-optimizing thedata scientist agentitself delivers an even larger performance uplift. Agentic data creation provides a way to convert increasedinference computeinto higher qualitymodel training. Overall, we believe this direction has the potential to change the way we build AI data.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2606\.25996

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.25996 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.25996 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.25996 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Autodata: An agentic data scientist to create high quality synthetic data

Paper page - Autodata: An agentic data scientist to create high quality synthetic data

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Agents That Build Better Training Data (25 minute read)

@rohanpaul_ai: Very important Meta paper brings Autodata, an agentic data scientist to create high quality synthetic data. The main re…

@HarveenChadha: meta releases Autodata: an agentic data scientist to create high quality synthetic data basically its a loop. given a d…

@neural_avb: https://x.com/neural_avb/status/2072294078805684613

@jaseweston: Claim: Autoresearch that moves the frontier will be about better data: we call that Autodata. 1/6 -- Paper is out! ht…

Submit Feedback

Similar Articles

Agents That Build Better Training Data (25 minute read)

@rohanpaul_ai: Very important Meta paper brings Autodata, an agentic data scientist to create high quality synthetic data. The main re…

@HarveenChadha: meta releases Autodata: an agentic data scientist to create high quality synthetic data basically its a loop. given a d…

@neural_avb: https://x.com/neural_avb/status/2072294078805684613

@jaseweston: Claim: Autoresearch that moves the frontier will be about better data: we call that *Autodata*. 1/6 -- Paper is out! ht…

Paper page - Autodata: An agentic data scientist to create high quality synthetic data

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Agents That Build Better Training Data (25 minute read)

@rohanpaul_ai: Very important Meta paper brings Autodata, an agentic data scientist to create high quality synthetic data. The main re…

@HarveenChadha: meta releases Autodata: an agentic data scientist to create high quality synthetic data basically its a loop. given a d…

@neural_avb: https://x.com/neural_avb/status/2072294078805684613

@jaseweston: Claim: Autoresearch that moves the frontier will be about better data: we call that *Autodata*. 1/6 -- Paper is out! ht…

Submit Feedback

@jaseweston: Claim: Autoresearch that moves the frontier will be about better data: we call that Autodata. 1/6 -- Paper is out! ht…