Can Generalist Agents Automate Data Curation?

arXiv cs.AI 06/04/26, 04:00 AM Papers

data-curation ai-agents benchmarking automation vision-language llm-agents open-source

Summary

Researchers introduce Curation-Bench, a benchmark to evaluate whether generalist coding agents can automate the iterative data curation loop in AI development. Results show agents can match strong baselines within ten iterations, but reliable data research requires scaffolded method adaptation rather than open-ended prompting alone.

arXiv:2606.04261v1 Announce Type: new Abstract: Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.

Original Article

View Cached Full Text

Cached at: 06/05/26, 02:06 AM

# Can Generalist Agents Automate Data Curation?
Source: [https://arxiv.org/abs/2606.04261](https://arxiv.org/abs/2606.04261)
[View PDF](https://arxiv.org/pdf/2606.04261)

> Abstract:Curating training data is among the most consequential yet labor\-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback\. We ask whether generalist coding agents can automate this data\-curation loop\. We introduce \*Curation\-Bench\*, an agent\-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command\-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise\. In a vision\-language instruction\-tuning instantiation, out\-of\-the\-box agents reach strong published data\-selection baselines within ten iterations\. However, trajectory analysis reveals a persistent \*execution\-research gap\*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references\. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method\-guided exploration\. The scaffolded agent autonomously composes \-\- without human design input \-\- a data\-selection policy that outperforms strong published baselines at one\-tenth their data budget\. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open\-ended prompting alone\. Code and benchmark are open\-sourced\.

## Submission history

From: Feiyang Kang \[[view email](https://arxiv.org/show-email/2c993ff5/2606.04261)\] **\[v1\]**Tue, 2 Jun 2026 22:26:53 UTC \(2,150 KB\)

Can Generalist Agents Automate Data Curation?

Similar Articles

Can Generalist Agents Automate Data Curation?

Neurodata Without Boredom: Benchmarking Agentic AI for Data Reuse

AI Coding Agents Can Reproduce Social Science Findings

An Empirical Study of Automating Agent Evaluation

A case study of evaluating AI agents on a neuroscience data-to-discovery pipeline

Submit Feedback

Similar Articles

Can Generalist Agents Automate Data Curation?

Neurodata Without Boredom: Benchmarking Agentic AI for Data Reuse

AI Coding Agents Can Reproduce Social Science Findings

An Empirical Study of Automating Agent Evaluation

A case study of evaluating AI agents on a neuroscience data-to-discovery pipeline