Can Generalist Agents Automate Data Curation?

arXiv cs.AI Papers

Summary

Researchers introduce Curation-Bench, a benchmark to evaluate whether generalist coding agents can automate the iterative data curation loop in AI development. Results show agents can match strong baselines within ten iterations, but reliable data research requires scaffolded method adaptation rather than open-ended prompting alone.

arXiv:2606.04261v1 Announce Type: new Abstract: Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.
Original Article
View Cached Full Text

Cached at: 06/05/26, 02:06 AM

# Can Generalist Agents Automate Data Curation?
Source: [https://arxiv.org/abs/2606.04261](https://arxiv.org/abs/2606.04261)
[View PDF](https://arxiv.org/pdf/2606.04261)

> Abstract:Curating training data is among the most consequential yet labor\-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback\. We ask whether generalist coding agents can automate this data\-curation loop\. We introduce \*Curation\-Bench\*, an agent\-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command\-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise\. In a vision\-language instruction\-tuning instantiation, out\-of\-the\-box agents reach strong published data\-selection baselines within ten iterations\. However, trajectory analysis reveals a persistent \*execution\-research gap\*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references\. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method\-guided exploration\. The scaffolded agent autonomously composes \-\- without human design input \-\- a data\-selection policy that outperforms strong published baselines at one\-tenth their data budget\. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open\-ended prompting alone\. Code and benchmark are open\-sourced\.

## Submission history

From: Feiyang Kang \[[view email](https://arxiv.org/show-email/2c993ff5/2606.04261)\] **\[v1\]**Tue, 2 Jun 2026 22:26:53 UTC \(2,150 KB\)

Similar Articles

Can Generalist Agents Automate Data Curation?

Hugging Face Daily Papers

This paper explores whether generalist coding agents (Claude Code, Codex, etc.) can automate data curation loops, achieving published baselines within 10 iterations but revealing a gap in exploring new methods. A scaffold that forces agents to adapt prior research yields policies that beat baselines using 10x less data.

Neurodata Without Boredom: Benchmarking Agentic AI for Data Reuse

arXiv cs.LG

This paper benchmarks agentic AI systems on the task of loading, understanding, and reformatting fragmented neuroscience data, finding that while agents perform well on subtasks, they rarely achieve fully error-free end-to-end solutions and human oversight remains necessary.

AI Coding Agents Can Reproduce Social Science Findings

arXiv cs.CL

This paper introduces SocSci-Repro-Bench, a benchmark of 221 tasks to evaluate AI coding agents' ability to reproduce social science findings from original data and code. It finds that frontier agents like Claude Code and Codex can reproduce a large share of results, with Claude substantially outperforming Codex, and that results are not primarily driven by memorization.

An Empirical Study of Automating Agent Evaluation

arXiv cs.CL

This paper introduces EvalAgent, a system that automates the evaluation of AI agents by encoding domain-specific expertise, addressing the limitations of standard coding assistants in this task. It also presents AgentEvalBench, a benchmark for testing evaluation pipelines, and demonstrates significant improvements in evaluation reliability.