Can Generalist Agents Automate Data Curation?

Hugging Face Daily Papers 06/02/26, 12:00 AM Papers

data-curation generalist-agents coding-agents automated-ml machine-learning data-selection fine-tuning

Summary

This paper explores whether generalist coding agents (Claude Code, Codex, etc.) can automate data curation loops, achieving published baselines within 10 iterations but revealing a gap in exploring new methods. A scaffold that forces agents to adapt prior research yields policies that beat baselines using 10x less data.

Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.

Original Article

View Cached Full Text

Cached at: 06/12/26, 02:52 AM

Paper page - Can Generalist Agents Automate Data Curation?

Source: https://huggingface.co/papers/2606.04261 Hi all. Quick summary of what we think is the interesting part:

Generalist coding agents (Claude Code, Codex, OpenHands with Kimi K2.5 / Qwen3.5-397B) can already run a full data-curation loop: inspect the pool, implement a selection policy, train, evaluate, revise. They match published data-selection baselines (ICONS, ARDS) within 10 iterations, recovering ~60% of the full-data fine-tuning gain from 1.5% of LLaVA-665K. The loop is not limited to instruction tuning: the same setup works for CLIP pretraining on DataComp-Small, where the agent clearly beats the strongest filtering baseline (top-30% CLIP L/14 score).

But trajectory analysis shows what we call theexecution-research gap: agents grind local knobs (source ratios, length thresholds, random seeds) instead of exploring new method families. In a typical open-prompt run, only 2/10 iterations try something genuinely new. Strategy guides and paper references don’t fix it. A scaffold requiring each iteration to cite, instantiate, and adapt a method from prior research does: the agent composed an EL2N-style top-loss + noise-filter policy, with no human design input, that beats published baselines given 10x its data budget.

One more finding we find intriguing: curation search itself scales. Extending the agent budget from 10 to 50 iterations keeps improving average outcomes with no clear plateau. Agent search iterations look like a meaningful compute axis for the finite-data regime.

Environment, trajectory diagnostics, and all scaffolds are open source:https://github.com/feiyang-k/curation-bench. Happy to answer questions.

Can Generalist Agents Automate Data Curation?

Paper page - Can Generalist Agents Automate Data Curation?

Similar Articles

Can Generalist Agents Automate Data Curation?

Codex for Everyday Work: AI Agents Beyond Coding

@AlexGDimakis: I am very excited about this research: We show 2 things: 1. If you just do random sampling (i.e. you try to solve a pro…

AI Coding Agents Can Reproduce Social Science Findings

Why is every agent ever made just a worse Claude Code?

Submit Feedback

Similar Articles

Can Generalist Agents Automate Data Curation?

Codex for Everyday Work: AI Agents Beyond Coding

@AlexGDimakis: I am very excited about this research: We show 2 things: 1. If you just do random sampling (i.e. you try to solve a pro…

AI Coding Agents Can Reproduce Social Science Findings

Why is every agent ever made just a worse Claude Code?