Can Generalist Agents Automate Data Curation?
Summary
Researchers introduce Curation-Bench, a benchmark to evaluate whether generalist coding agents can automate the iterative data curation loop in AI development. Results show agents can match strong baselines within ten iterations, but reliable data research requires scaffolded method adaptation rather than open-ended prompting alone.
View Cached Full Text
Cached at: 06/05/26, 02:06 AM
# Can Generalist Agents Automate Data Curation? Source: [https://arxiv.org/abs/2606.04261](https://arxiv.org/abs/2606.04261) [View PDF](https://arxiv.org/pdf/2606.04261) > Abstract:Curating training data is among the most consequential yet labor\-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback\. We ask whether generalist coding agents can automate this data\-curation loop\. We introduce \*Curation\-Bench\*, an agent\-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command\-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise\. In a vision\-language instruction\-tuning instantiation, out\-of\-the\-box agents reach strong published data\-selection baselines within ten iterations\. However, trajectory analysis reveals a persistent \*execution\-research gap\*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references\. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method\-guided exploration\. The scaffolded agent autonomously composes \-\- without human design input \-\- a data\-selection policy that outperforms strong published baselines at one\-tenth their data budget\. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open\-ended prompting alone\. Code and benchmark are open\-sourced\. ## Submission history From: Feiyang Kang \[[view email](https://arxiv.org/show-email/2c993ff5/2606.04261)\] **\[v1\]**Tue, 2 Jun 2026 22:26:53 UTC \(2,150 KB\)
Similar Articles
Can Generalist Agents Automate Data Curation?
This paper explores whether generalist coding agents (Claude Code, Codex, etc.) can automate data curation loops, achieving published baselines within 10 iterations but revealing a gap in exploring new methods. A scaffold that forces agents to adapt prior research yields policies that beat baselines using 10x less data.
Neurodata Without Boredom: Benchmarking Agentic AI for Data Reuse
This paper benchmarks agentic AI systems on the task of loading, understanding, and reformatting fragmented neuroscience data, finding that while agents perform well on subtasks, they rarely achieve fully error-free end-to-end solutions and human oversight remains necessary.
AI Coding Agents Can Reproduce Social Science Findings
This paper introduces SocSci-Repro-Bench, a benchmark of 221 tasks to evaluate AI coding agents' ability to reproduce social science findings from original data and code. It finds that frontier agents like Claude Code and Codex can reproduce a large share of results, with Claude substantially outperforming Codex, and that results are not primarily driven by memorization.
An Empirical Study of Automating Agent Evaluation
This paper introduces EvalAgent, a system that automates the evaluation of AI agents by encoding domain-specific expertise, addressing the limitations of standard coding assistants in this task. It also presents AgentEvalBench, a benchmark for testing evaluation pipelines, and demonstrates significant improvements in evaluation reliability.
A case study of evaluating AI agents on a neuroscience data-to-discovery pipeline
This paper presents an empirical study evaluating general-purpose coding agents on a fly optogenetics data-to-discovery pipeline, finding that while agents can automate individual stages, they struggle with end-to-end tasks requiring scientific judgment and resource management.