Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes
Summary
This paper introduces an auto-research framework using specialist agents to iteratively refine training recipes through an empirical loop of code execution and feedback. The system autonomously improves performance on tasks like Parameter Golf and NanoChat without human intervention by leveraging lineage feedback.
View Cached Full Text
Cached at: 05/08/26, 07:37 AM
Paper page - Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes
Source: https://huggingface.co/papers/2605.05724
Abstract
Auto research operates as an empirical loop where agents iteratively refine code based on evaluation feedback, achieving improved performance across multiple tasks without human intervention.
We study auto research as a closedempirical loopdriven by external measurement. Each submitted trial carries a hypothesis, an executable code edit, anevaluator-owned outcome, and feedback that shapes the next proposal. The output is not a generated paper or a single model checkpoint, but an auditable trajectory of proposals, code diffs, experiments, scores, and failure labels. We instantiate this loop withspecialist agentsthat partitionrecipe surfacesand share measured lineage across trials. The central empirical finding is thatlineage feedbacklets agents turn evaluator outcomes, including crashes, budget overruns, size failures, and accuracy-gate misses, into laterprogram-level recipe editsrather than one-shot suggestions. Across 1,197 headline-run trials plus 600 Parameter Golf control trials after one-time setup and launch, humans did not choose proposals, edit recipes, override scores, or repair failed trials during the search. In the three headline runs, the same submitted-trial loop reduces Parameter Golf validation bpb by 0.81%, raises NanoChat-D12 CORE by 38.7%, and reduces CIFAR-10 Airbench96 wallclock by 4.59%, with each task measured by its own external evaluator and legality checks. The trace includes a strictarchitecture-domain auditof 157 headline-run submissions andprogram rewritessuch as a NanoChat attention-kernel path change. Within this scope the loop autonomously writes code, submits experiments, absorbs feedback, applies and combines known techniques inside each environment, and improves public starting recipes.
View arXiv pageView PDFGitHub1Add to collection
Get this paper in your agent:
hf papers read 2605\.05724
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.05724 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.05724 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.05724 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration
AutoResearchClaw is a multi-agent autonomous research system that improves scientific discovery through structured debate, self-healing execution, and human collaboration, outperforming previous systems on the ARC-Bench benchmark by 54.7%.
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive
This paper introduces AutoLLMResearch, an agentic framework that automates the configuration of expensive LLM experiments by learning from low-fidelity environments and extrapolating to high-cost settings. It aims to reduce computational waste and reliance on expert intuition in scalable LLM research.
NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation
NanoResearch is a multi-agent framework designed to personalize research automation by co-evolving skills, memory, and policy to adapt to individual user preferences and research styles.
How Far Are We From True Auto-Research?
This paper introduces ResearchArena, a scaffold for evaluating auto-research agents, and finds that while agent-generated papers appear competitive under manuscript-only review, artifact-aware review reveals severe failures in experimental rigor, with no paper meeting top-tier acceptance standards.
AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery
A survey paper examining the transition of AI from task-specific assistants to workflow-level research automators, defining AutoResearch as the spectrum of AI-powered scientific workflow automation and analyzing challenges in autonomy, reproducibility, and accountability.