Evaluation-driven Scaling for Scientific Discovery
Summary
SimpleTES framework scales evaluation-driven discovery loops across 21 scientific problems, yielding 2× speedups on LASSO, 24.5% quantum gate reductions, and new Erdos constructions while enabling trajectory-level model post-training.
View Cached Full Text
Cached at: 04/22/26, 06:17 AM
Paper page - Evaluation-driven Scaling for Scientific Discovery
Source: https://huggingface.co/papers/2604.19341 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
SimpleTES framework scales evaluation-driven discovery loops for scientific problems, achieving state-of-the-art results across multiple domains through parallel exploration and feedback-driven refinement.
Language models are increasingly used in scientific discovery to generate hypotheses, propose candidate solutions, implement systems, and iteratively refine them. At the core of these trial-and-error loops lies evaluation: the process of obtaining feedback on candidate solutions via verifiers, simulators, or task-specific scoring functions. While prior work has highlighted the importance of evaluation, it has not explicitly formulated the problem of howevaluation-driven discovery loopscan be scaled up in a principled and effective manner to push the boundaries of scientific discovery, a problem this paper seeks to address. We introduce Simple Test-time Evaluation-driven Scaling (SimpleTES), a general framework that strategically combinesparallel exploration,feedback-driven refinement, and local selection, revealing substantial gains unlocked by scalingevaluation-driven discovery loopsalong the right dimensions. Across 21 scientific problems spanning six domains, SimpleTES discovers state-of-the-art solutions using gpt-oss models, consistently outperforming both frontier-model baselines and sophisticated optimization pipelines. Particularly, we sped up the widely usedLASSO algorithmby over 2x, designedquantum circuit routingpolicies that reduce gate overhead by 24.5%, and discovered newErdos minimum overlap constructionsthat surpass the best-known results. Beyond novel discoveries, SimpleTES produces trajectory-level histories that naturally supervise feedback-driven learning. When post-trained on successful trajectories, models not only improve efficiency on seen problems but also generalize to unseen problems, discovering solutions that base models fail to uncover. Together, our results establish effective evaluation-driven loop scaling as a central axis for advancingLLM-driven scientific discovery, and provide a simple yet practical framework for realizing these gains.
View arXiv pageView PDFProject pageAdd to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.19341 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.19341 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.19341 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
TEMPO: Scaling Test-time Training for Large Reasoning Models
TEMPO introduces a test-time training framework that alternates policy refinement with critic recalibration to prevent diversity collapse and sustain performance gains in large reasoning models, boosting AIME 2024 scores for Qwen3-14B from 42.3% to 65.8%.
Scaling Test-Time Compute for Agentic Coding
A test-time scaling framework for agentic coding that compresses rollout trajectories into structured summaries and uses recursive voting/PDR to boost Claude-4.5-Opus to 77.6% on SWE-Bench Verified.
EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis
EnvScaler is an automated framework for scaling tool-interactive environments for LLM agents through programmatic synthesis, creating 191 diverse environments and 7K scenarios to improve agent performance on multi-turn, multi-tool interactions.
EvoMaster: A Foundational Agent Framework for Building Evolving Autonomous Scientific Agents at Scale
EvoMaster is a scalable, self-evolving agent framework for large-scale scientific discovery that enables iterative hypothesis refinement and knowledge accumulation across experimental cycles. It achieves state-of-the-art results on four benchmarks including Humanity's Last Exam (41.1%) and MLE-Bench Lite (75.8%), outperforming general-purpose baselines by up to 316%.
Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints
Stargazer introduces a scalable benchmark environment with 120 astrophysics tasks to evaluate AI agents on physics-grounded model-fitting of radial-velocity data, revealing gaps between statistical optimization and physical constraint adherence.