Can Generalist Agents Automate Data Curation?
Summary
This paper explores whether generalist coding agents (Claude Code, Codex, etc.) can automate data curation loops, achieving published baselines within 10 iterations but revealing a gap in exploring new methods. A scaffold that forces agents to adapt prior research yields policies that beat baselines using 10x less data.
View Cached Full Text
Cached at: 06/12/26, 02:52 AM
Paper page - Can Generalist Agents Automate Data Curation?
Source: https://huggingface.co/papers/2606.04261 Hi all. Quick summary of what we think is the interesting part:
Generalist coding agents (Claude Code, Codex, OpenHands with Kimi K2.5 / Qwen3.5-397B) can already run a full data-curation loop: inspect the pool, implement a selection policy, train, evaluate, revise. They match published data-selection baselines (ICONS, ARDS) within 10 iterations, recovering ~60% of the full-data fine-tuning gain from 1.5% of LLaVA-665K. The loop is not limited to instruction tuning: the same setup works for CLIP pretraining on DataComp-Small, where the agent clearly beats the strongest filtering baseline (top-30% CLIP L/14 score).
But trajectory analysis shows what we call theexecution-research gap: agents grind local knobs (source ratios, length thresholds, random seeds) instead of exploring new method families. In a typical open-prompt run, only 2/10 iterations try something genuinely new. Strategy guides and paper references don’t fix it. A scaffold requiring each iteration to cite, instantiate, and adapt a method from prior research does: the agent composed an EL2N-style top-loss + noise-filter policy, with no human design input, that beats published baselines given 10x its data budget.
One more finding we find intriguing: curation search itself scales. Extending the agent budget from 10 to 50 iterations keeps improving average outcomes with no clear plateau. Agent search iterations look like a meaningful compute axis for the finite-data regime.
Environment, trajectory diagnostics, and all scaffolds are open source:https://github.com/feiyang-k/curation-bench. Happy to answer questions.
Similar Articles
Can Generalist Agents Automate Data Curation?
Researchers introduce Curation-Bench, a benchmark to evaluate whether generalist coding agents can automate the iterative data curation loop in AI development. Results show agents can match strong baselines within ten iterations, but reliable data research requires scaffolded method adaptation rather than open-ended prompting alone.
Codex for Everyday Work: AI Agents Beyond Coding
OpenAI's Codex has evolved from a coding tool into a general-purpose AI agent, now used by knowledge workers for research, coordination, and data analysis, reducing hours of work into minutes.
@AlexGDimakis: I am very excited about this research: We show 2 things: 1. If you just do random sampling (i.e. you try to solve a pro…
This research compares AI coding agents (like Claude-Code and Codex) with human expert coders on long-horizon tasks, showing that humans scale super-linearly due to continual learning while agents plateau, highlighting a key limitation of current AI in extended problem-solving.
AI Coding Agents Can Reproduce Social Science Findings
This paper introduces SocSci-Repro-Bench, a benchmark of 221 tasks to evaluate AI coding agents' ability to reproduce social science findings from original data and code. It finds that frontier agents like Claude Code and Codex can reproduce a large share of results, with Claude substantially outperforming Codex, and that results are not primarily driven by memorization.
Why is every agent ever made just a worse Claude Code?
A developer questions the value of building specialized AI agents when general-purpose tools like Claude Code can accomplish the same tasks, suggesting that current agentic approaches are just less capable versions of Claude with extra guardrails.