AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive
Summary
This paper introduces AutoLLMResearch, an agentic framework that automates the configuration of expensive LLM experiments by learning from low-fidelity environments and extrapolating to high-cost settings. It aims to reduce computational waste and reliance on expert intuition in scalable LLM research.
View Cached Full Text
Cached at: 05/13/26, 04:12 AM
Paper page - AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration – Learning from Cheap, Optimizing Expensive
Source: https://huggingface.co/papers/2605.11518
Abstract
An agentic framework called AutoLLMResearch automates high-cost large language model experiment configurations by learning from multi-fidelity experimental environments and enabling efficient configuration identification through cross-fidelity extrapolation.
Effectively configuring scalable large language model (LLM) experiments, spanning architecture design, hyperparameter tuning, and beyond, is crucial for advancing LLM research, as poor configuration choices can waste substantial computational resources and prevent models from realizing their full potential. Prior automated methods are designed for low-cost settings where repeated trial and error is feasible, but scalable LLM experiments are too expensive for such extensive iteration. To our knowledge, no work has addressed the automation of high-cost LLM experiment configurations, leaving this problem labor-intensive and dependent on expert intuition. Motivated by this gap, we propose AutoLLMResearch, anagentic frameworkthat mimics how human researchers learn generalizable principles from low-fidelity experiments and extrapolate to efficiently identify promising configurations in expensive LLM settings. The core challenge is how to enable an agent to learn, through interaction with a multi-fidelity experimental environment that captures the structure of the LLM configuration landscape. To achieve this, we propose a systematic framework with two key components: 1)LLMConfig-Gym, amulti-fidelity environmentencompassing four critical LLM experiment tasks, supported by over one million GPU hours of verifiable experiment outcomes; 2) A structured training pipeline that formulates configuration research as a long-horizonMarkov Decision Processand accordingly incentivizescross-fidelity extrapolationreasoning. Extensive evaluation against diverse strong baselines on held-out experiments demonstrates the effectiveness, generalization, and interpretability of our framework, supporting its potential as a practical and general solution for scalable real-world LLM experiment automation.
View arXiv pageView PDFProject pageGitHub1Add to collection
Get this paper in your agent:
hf papers read 2605\.11518
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.11518 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.11518 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.11518 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
This paper introduces AutoTTS, an environment-driven framework that automates the discovery of test-time scaling strategies for LLMs by formulating it as controller synthesis. It demonstrates improved accuracy-cost tradeoffs on mathematical reasoning benchmarks with minimal computational overhead.
Auto Research with Specialist Agents Develops Effective and Non-Trivial Training Recipes
This paper introduces an auto-research framework using specialist agents to iteratively refine training recipes through an empirical loop of code execution and feedback. The system autonomously improves performance on tasks like Parameter Golf and NanoChat without human intervention by leveraging lineage feedback.
@ihtesham2005: If you still think AI agents can't do real research, this paper will end that argument. Researchers from Google and Met…
Researchers from Google and Meta propose AutoTTS, a framework using AI agents to automatically discover and refine test-time scaling strategies for LLMs without human intervention. The agent successfully identified complex, coordinated reasoning mechanisms that outperformed manual baselines at a low computational cost.
SkillMaster: Toward Autonomous Skill Mastery in LLM Agents
This paper introduces SkillMaster, a training framework that enables LLM agents to autonomously create, refine, and select skills through trajectory-informed review and counterfactual utility evaluation.
Researchers let AI Agents Optimize LLM Reasoning and Cut Tokens by 70%
Researchers developed AutoTTS, a framework where AI agents automatically design control policies to optimize LLM inference, cutting token consumption by approximately 70% while maintaining high reasoning accuracy.