LLM-AutoSciLab: Closed-Loop Scientific Discovery via Active Experimentation with LLMs

arXiv cs.LG 05/26/26, 04:00 AM Papers
scientific-discovery llm active-learning hypothesis-generation benchmark closed-loop
Summary
LLM-AutoSciLab is a closed-loop framework that uses LLMs to iteratively generate hypotheses, select informative experiments, and refine mechanisms, achieving superior accuracy and sample efficiency on physics and biology benchmarks over prior static methods.
arXiv:2605.24043v1 Announce Type: new Abstract: Scientific discovery is a closed-loop process in which hypotheses guide data acquisition and observations refine the hypothesis space. Yet most approaches reduce discovery to supervised learning over fixed datasets, where limited observations can support multiple plausible mechanisms that fit locally but fail to generalize. Thus, the key challenge is selecting informative observations to resolve uncertainty, shifting the focus from static inference to adaptive data acquisition. To address this, we propose LLM-AutoSciLab, a closed-loop framework that couples hypothesis generation with hypothesis-conditioned experiment selection and mechanism refinement. Rather than fitting models to passively collected data, LLM-AutoSciLab iteratively proposes plausible hypotheses, selects informative experiments to distinguish or refine them, and updates its state using the resulting evidence. To evaluate dynamic, closed-loop scientific discovery with active data acquisition, we introduce ActiveSciBench, comprising two datasets: ActiveSciBench-Chem with 57 enzyme-kinetics tasks and ActiveSciBench-GRN with 45 gene-regulatory-network tasks. These datasets model discovery as a budget-constrained process requiring adaptive experiment design, variable selection, and recovery of true mechanisms. Across NewtonBench, ActiveSciBench-Chem, and ActiveSciBench-GRN, LLM-AutoSciLab outperforms prior methods, achieving 67.6% and 35.1% symbolic accuracy on NewtonBench and ActiveSciBench-Chem, respectively, and 31.1% exact graph recovery on ActiveSciBench-GRN. Moreover, hypothesis-guided experimentation is 2-5x more sample-efficient than the strongest competing baselines. Code and data are available at: https://github.com/scientific-discovery/LLM-AutoSciLab
Original Article
View Cached Full Text
Cached at: 05/26/26, 08:59 AM
# LLM-AutoSciLab: Closed-Loop Scientific Discovery via Active Experimentation with LLMs
Source: [https://arxiv.org/html/2605.24043](https://arxiv.org/html/2605.24043)
Sanchit Kabra1\*, Nikhil Abhyankar1\*, Saaketh Desai2, Prasad P\. Iyer2, Chandan K\. Reddy1 1Virginia Tech2Sandia National Laboratories

###### Abstract

Scientific discovery is a closed\-loop process where hypotheses guide data acquisition and observations refine the hypothesis space\. Yet most approaches reduce discovery to supervised learning over fixed datasets, where limited observations can support multiple plausible mechanisms that fit locally but fail to generalize\. Thus, the key challenge is selecting informative observations to resolve uncertainty, shifting the focus from static inference to adaptive data acquisition\. To address this, we proposeLLM\-AutoSciLab, a closed\-loop framework that couples hypothesis generation with hypothesis\-conditioned experiment selection and mechanism refinement\. Rather than fitting models to passively collected data,LLM\-AutoSciLabiteratively proposes plausible hypotheses, selects informative experiments to distinguish or refine them, and updates its state using the resulting evidence\.To evaluate dynamic, closed\-loop scientific discovery with active data acquisition, we introduceActiveSciBench, comprising two datasets: \(i\)ActiveSciBench\-Chem\(57 enzyme\-kinetics tasks\) and \(ii\)ActiveSciBench\-GRN\(45 gene\-regulatory\-network tasks\), that model discovery as a budget\-constrained process requiring adaptive experiment design, variable selection, and recovery of true mechanisms\.Across NewtonBench,ActiveSciBench\-Chem, andActiveSciBench\-GRN,LLM\-AutoSciLaboutperforms prior methods, achieving 67\.6% and 35\.1% symbolic accuracy on NewtonBench andActiveSciBench\-Chem, respectively, and 31\.1% exact graph recovery onActiveSciBench\-GRN\. Moreover, hypothesis\-guided experimentation is 2–5×\\timesmore sample\-efficient than the strongest competing baselines\.111Code: https://github\.com/scientific\-discovery/LLM\-AutoSciLab

††footnotetext:\*Equal contribution\. Correspondence:sanchit23@vt\.edu, nikhilsa@vt\.edu\.## 1Introduction

Discovering governing principles underlying physical systems remains a central challenge in science\(Udrescu and Tegmark,[2020](https://arxiv.org/html/2605.24043#bib.bib63); Petersenet al\.,[2021](https://arxiv.org/html/2605.24043#bib.bib26)\)\. Recent advances in large language models \(LLMs\) have enabled systems that leverage pretrained knowledge, reasoning, and tool use to generate hypotheses, analyze observations, and accelerate scientific discovery\(Wanget al\.,[2023](https://arxiv.org/html/2605.24043#bib.bib31); AI4Science and Quantum,[2023](https://arxiv.org/html/2605.24043#bib.bib30); Reddy and Shojaee,[2025](https://arxiv.org/html/2605.24043#bib.bib29)\)\. However,*existing methods treat discovery as static, supervised inference on fixed datasets*\(Cranmer,[2023](https://arxiv.org/html/2605.24043#bib.bib4); Shojaeeet al\.,[2025a](https://arxiv.org/html/2605.24043#bib.bib34)\)\. This static formulation creates an identifiability bottleneck, where multiple competing hypotheses can fit the limited observed data equally well, while failing to generalize, making it impossible to recover the true underlying law\(Jianget al\.,[2025](https://arxiv.org/html/2605.24043#bib.bib18)\)\.

In practice, scientific discovery is inherently a closed loop, with hypotheses guiding experiments and observations refining subsequent hypotheses\(Chenet al\.,[2025a](https://arxiv.org/html/2605.24043#bib.bib42)\)\. Crucially, scientists design experiments to induce targeted variations that force competing explanations to diverge, revealing distinctions that static data cannot resolve\(Box and Hill,[1967](https://arxiv.org/html/2605.24043#bib.bib41); Ouyanget al\.,[2016](https://arxiv.org/html/2605.24043#bib.bib40)\)\. Although self\-driving laboratories \(SDLs\) and active learning systems enable adaptive experimentation\(Linget al\.,[2017](https://arxiv.org/html/2605.24043#bib.bib39); Kusneet al\.,[2020](https://arxiv.org/html/2605.24043#bib.bib52); Desaiet al\.,[2025](https://arxiv.org/html/2605.24043#bib.bib27)\), they still require substantial human effort for hypothesis formulation and refinement\. Moreover, their acquisition strategies are typically optimized for predictive performance and uncertainty reduction, rather than mechanism identification\. Consequently, they are not designed to actively resolve competing hypotheses, limiting recovery of the true underlying law under constrained experimental budgets\.

![Refer to caption](https://arxiv.org/html/2605.24043v1/x1.png)Figure 1:Overview ofLLM\-AutoSciLab\.\(A\) An LLM generates candidate hypotheses from observations and memory\. \(B\) Experiments are actively selected in regions of maximal disagreement with the hypothesis\. \(C\) Candidates are iteratively refined via domain\-specific optimization \(e\.g\., parameter fitting and constraint enforcement\), with confidence\-based feedback guiding updates\.To address this gap, we proposeLLM\-AutoSciLab,*a closed\-loop framework that models scientific discovery as active hypothesis\-conditioned experiment design rather than passive regression over fixed datasets*\(Table[1](https://arxiv.org/html/2605.24043#S1.T1)\)\. At iterationtt,LLM\-AutoSciLabconstructs a structured mechanism hypothesis set from accumulated observations and previous interactions, then identifies regions where candidate mechanisms are predicted to disagree\. New experiments are selected online using a*hypothesis\-conditioned acquisition objective that prioritizes mechanism disambiguation*, acquiring data most informative for separating competing laws \(Figure[1](https://arxiv.org/html/2605.24043#S1.F1)\)\. The resulting observation is used to evaluate, refine, or eliminate hypotheses, and update the next acquisition step\. Unlike Bayesian or traditional active learning methods that acquire data to reduce uncertainty,LLM\-AutoSciLabselects experiments to maximize disagreement among explicit candidate mechanisms, enabling law recovery under constrained experimental budgets\.

Real\-world closed\-loop discovery requires evaluation settings in which data is actively acquired through experimental design\. As shown in Table[2](https://arxiv.org/html/2605.24043#S2.T2), existing benchmarks\(Udrescu and Tegmark,[2020](https://arxiv.org/html/2605.24043#bib.bib63); Cranmer,[2023](https://arxiv.org/html/2605.24043#bib.bib4); Shojaeeet al\.,[2025b](https://arxiv.org/html/2605.24043#bib.bib37)\)assume fully observed, fixed datasets, reducing discovery to static function fitting\. NewtonBench\(Zhenget al\.,[2026](https://arxiv.org/html/2605.24043#bib.bib36)\)introduces interactive probing of memorization\-resistant counterfactual laws, but remains limited to predefined input\-output physics variables and symbolic law recovery\.We address this gap by introducingActiveSciBench, a benchmark suite for active experimental design grounded across two scientific domains: chemistry and gene regulatory networks\.Both datasets impose budget\-limited oracle access, in which relevant variables are hidden and must be discovered jointly with the experimental design and hypothesis refinement\.ActiveSciBench\-Chemfocuses on enzyme\-kinetic rate laws from selected reaction conditions with distractor variables, whileActiveSciBench\-GRNtargets signed causal regulatory graphs from perturbation\-response experiments\. Together, they move evaluation beyond symbolic regression to both equation\-structured and graph\-structured discovery\. We evaluateLLM\-AutoSciLabusingGPT\-4o\-miniandQwen\-3\-4B/14B/32B, demonstrating that it discovers governing mechanisms faster across settings\. Our main contributions can be summarized as:

- •We introduceLLM\-AutoSciLab, aclosed\-loop scientific discovery frameworkcoupling LLM\-guided hypothesis generation, hypothesis\-conditioned experiment design, and refinement\.
- •We introduceActiveSciBench,a benchmark suite for active sequential discovery in scientifically grounded systems, where data is acquired under budget\-limited oracle access and relevant variables must be identified\.
- •We propose ahypothesis\-conditioned acquisition strategythat selects experiments maximizing disagreement among competing hypotheses, improving sample efficiency under fixed budgets\.
- •We show thatLLM\-AutoSciLaboutperforms prior methods across benchmarks, achieving up to 67\.6% symbolic accuracy and 31\.1% exact graph recovery, improving sample efficiency by 2−\-5×\\times\. Ablations confirm the importance of each component\.

Table 1:Comparison of scientific discovery frameworks across key design dimensions\.
## 2Related Work

#### LLMs for Scientific Discovery\.

LLMs have shown strong potential for accelerating scientific discovery through embedded knowledge and reasoning for hypothesis generation\(Zhouet al\.,[2024](https://arxiv.org/html/2605.24043#bib.bib48); Jansenet al\.,[2026](https://arxiv.org/html/2605.24043#bib.bib49)\), data\-driven analysis\(Majumderet al\.,[2024](https://arxiv.org/html/2605.24043#bib.bib47); Reddy and Shojaee,[2025](https://arxiv.org/html/2605.24043#bib.bib29); Agarwalet al\.,[2026](https://arxiv.org/html/2605.24043#bib.bib45)\), and equation discovery\(Shojaeeet al\.,[2025a](https://arxiv.org/html/2605.24043#bib.bib34); Grayeliet al\.,[2024](https://arxiv.org/html/2605.24043#bib.bib50); Behzadifaret al\.,[2025](https://arxiv.org/html/2605.24043#bib.bib51)\)\. LLM\-based discovery frameworks have also been applied across domains such as chemistry\(Wanget al\.,[2025](https://arxiv.org/html/2605.24043#bib.bib33)\), materials discovery\(Abhyankaret al\.,[2026](https://arxiv.org/html/2605.24043#bib.bib32); Ganet al\.,[2025](https://arxiv.org/html/2605.24043#bib.bib58)\), and program synthesis\(Romera\-Paredeset al\.,[2024](https://arxiv.org/html/2605.24043#bib.bib35)\)\. However, most existing systems are both representation\-specific and passive, searching within a predetermined output space, such as equations, materials, or programs, and use LLMs primarily for post\-hoc hypothesis generation and refinement over pre\-collected, static datasets \(Table[1](https://arxiv.org/html/2605.24043#S1.T1)\)\. We extend this line of work by using the representational flexibility of LLMs beyond candidate generation, where hypotheses serve as mechanism\-level objects that guide online experiment selection, closing the loop between hypothesis generation, data acquisition, and refinement\.

#### Experiment Design for Scientific Discovery\.

Experiment design formalizes discovery as selecting measurements that reduce uncertainty over hypotheses under limited budgets\(Ouyanget al\.,[2016](https://arxiv.org/html/2605.24043#bib.bib40)\), with applications across materials and process optimization\(Linget al\.,[2017](https://arxiv.org/html/2605.24043#bib.bib39); Kusneet al\.,[2020](https://arxiv.org/html/2605.24043#bib.bib52)\), drug discovery and molecular design\(Baileyet al\.,[2024](https://arxiv.org/html/2605.24043#bib.bib61); Kyroet al\.,[2024](https://arxiv.org/html/2605.24043#bib.bib60)\), genomics and perturbation screening\(Huanget al\.,[2023](https://arxiv.org/html/2605.24043#bib.bib59); Qinet al\.,[2024](https://arxiv.org/html/2605.24043#bib.bib77)\), and applied physics\(Melnikovet al\.,[2018](https://arxiv.org/html/2605.24043#bib.bib57)\)\. Self\-driving laboratories extend this paradigm to physical closed\-loop platforms by coupling adaptive decision\-making with automated synthesis and characterization\(Abolhasani and Kumacheva,[2023](https://arxiv.org/html/2605.24043#bib.bib53); MacLeodet al\.,[2020](https://arxiv.org/html/2605.24043#bib.bib54)\)\. Recent systems such as AutoSciLab\(Desaiet al\.,[2025](https://arxiv.org/html/2605.24043#bib.bib27)\)integrate active learning with symbolic model recovery, while broader discovery frameworks coordinate experiment selection, model construction, and revision\(Langley,[2024](https://arxiv.org/html/2605.24043#bib.bib56)\)\. However, such systems often depend on domain\-specific experimental interfaces, acquisition objectives, model classes, or predefined hypothesis spaces, limiting representation\-agnostic discovery\.LLM\-AutoSciLabinstead treats acquisition as mechanism discrimination: it constructs competing hypotheses, identifies where they diverge, and selects experiments to separate, refine, or falsify them\.

#### Benchmarks for Scientific Discovery\.

Scientific discovery benchmarks have largely evaluated recovery from fixed observations, where variables are provided, and the target is an equation or predictive model\(Udrescu and Tegmark,[2020](https://arxiv.org/html/2605.24043#bib.bib63); Cranmer,[2023](https://arxiv.org/html/2605.24043#bib.bib4)\)\(Table[2](https://arxiv.org/html/2605.24043#S2.T2)\)\. Recent discovery benchmarks reduce memorization through newly generated or out\-of\-distribution tasks, but still evaluate discovery as offline recovery from pre\-collected datasets rather than active acquisition of informative observations\(Shojaeeet al\.,[2025b](https://arxiv.org/html/2605.24043#bib.bib37); Kabraet al\.,[2026](https://arxiv.org/html/2605.24043#bib.bib62)\)\. NewtonBench\(Zhenget al\.,[2026](https://arxiv.org/html/2605.24043#bib.bib36)\)introduces active querying over counterfactual systems, but remains limited to predefined variables and closed\-form law recovery\. Other benchmarks focus on dynamical prediction\(Takamotoet al\.,[2022](https://arxiv.org/html/2605.24043#bib.bib64); d’Ascoliet al\.,[2024](https://arxiv.org/html/2605.24043#bib.bib65)\), condition optimization\(Häseet al\.,[2021](https://arxiv.org/html/2605.24043#bib.bib66)\), or causal and gene\-regulatory inference from benchmark\-provided perturbations\(Chevalleyet al\.,[2025](https://arxiv.org/html/2605.24043#bib.bib67); Pratapaet al\.,[2019](https://arxiv.org/html/2605.24043#bib.bib68); Schaffteret al\.,[2011](https://arxiv.org/html/2605.24043#bib.bib69)\)\. In contrast, our benchmarks evaluate active mechanism discovery, where the learner must choose experiments under a fixed budget, identify relevant variables, and recover equation\- or graph\-structured mechanisms from hidden experimental systems\.

Table 2:Comparison of scientific discovery benchmarks across key properties\.

## 3LLM\-AutoSciLabMethod

We instantiateLLM\-AutoSciLabas an iterative algorithm over a dynamically maintained hypothesis set\. At each iteration, candidate mechanisms are sampled from a distribution conditioned on the current state, and the next experiment is selected by maximizing an inter\-hypothesis disagreement objective over this set\. The resulting observation is incorporated by refitting each candidate hypothesis on the augmented dataset, computing its empirical loss, and applying stability\-based filtering to retain consistent mechanisms and eliminate unstable ones\.

### 3\.1Problem Formulation

Algorithm 1LLM\-AutoSciLab1:Oracle

𝒪\\mathcal\{O\}, Dataset

𝒟\\mathcal\{D\}, Budget

BB, State

𝒮t\\mathcal\{S\}\_\{t\}, Search Region

ℛ\\mathcal\{R\}, Memory

ℰ\\mathcal\{E\}, LLM

πθ\\pi\_\{\\theta\}, Hypothesis Set

ℋt\\mathcal\{H\}\_\{t\}, Confidence Threshold

τconf\\tau\_\{\\rm conf\}, Confidence Score

ctc\_\{t\}
2:\# Initialize data and experience buffer

3:

𝒟0,c0,ℰ0←∅,∅,InitMemory\(\)\\mathcal\{D\}\_\{0\},c\_\{0\},\\mathcal\{E\}\_\{0\}\\leftarrow\\emptyset,\\emptyset,\\texttt\{InitMemory\}\(\)
4:for

t=0,…,B−1t=0,\\ldots,B\-1do

5:

St←\(𝒟t,ℰt,ℋt\)S\_\{t\}\\leftarrow\(\\mathcal\{D\}\_\{t\},\\mathcal\{E\}\_\{t\},\\mathcal\{H\}\_\{t\}\)
6:\# Propose hypotheses and search regions

7:

ℋt,ℛt←GenHyp\(πθlarge,πθsmall,St\)\\mathcal\{H\}\_\{t\},\\mathcal\{R\}\_\{t\}\\leftarrow\\texttt\{GenHyp\}\(\\pi\_\{\\theta\}^\{\\rm large\},\\pi\_\{\\theta\}^\{\\rm small\},S\_\{t\}\)
8:\# Select acquisition mode

9:if

ct<τconfc\_\{t\}<\\tau\_\{\\rm conf\}then

10:

mode←Disambiguate\\texttt\{mode\}\\leftarrow\\texttt\{Disambiguate\}
11:

Δt←Disagree\(ℋt,𝒟t\)\\Delta\_\{t\}\\leftarrow\\texttt\{Disagree\}\(\\mathcal\{H\}\_\{t\},\\mathcal\{D\}\_\{t\}\)
12:else

13:

mode←Refine\\texttt\{mode\}\\leftarrow\\texttt\{Refine\}
14:

Δt←∅\\Delta\_\{t\}\\leftarrow\\emptyset
15:endif

16:\# Acquire new experiment

17:

𝐱t\+1←Acquire\(𝒟t,ℋt,ℛt,mode,Δt\)\\mathbf\{x\}\_\{t\+1\}\\leftarrow\\texttt\{Acquire\}\(\\mathcal\{D\}\_\{t\},\\mathcal\{H\}\_\{t\},\\mathcal\{R\}\_\{t\},\\texttt\{mode\},\\Delta\_\{t\}\)
18:

yt\+1←𝒪\(𝐱t\+1\)y\_\{t\+1\}\\leftarrow\\mathcal\{O\}\(\\mathbf\{x\}\_\{t\+1\}\)
19:

𝒟t\+1←𝒟t∪\{\(𝐱t\+1,yt\+1\)\}\\mathcal\{D\}\_\{t\+1\}\\leftarrow\\mathcal\{D\}\_\{t\}\\cup\\\{\(\\mathbf\{x\}\_\{t\+1\},y\_\{t\+1\}\)\\\}
20:\# Refine and update memory

21:

m^t\+1,ct\+1←RefineHyp\(𝒟t\+1,ℋt\)\\hat\{m\}\_\{t\+1\},c\_\{t\+1\}\\leftarrow\\texttt\{RefineHyp\}\(\\mathcal\{D\}\_\{t\+1\},\\mathcal\{H\}\_\{t\}\)
22:

m~t\+1,c~t\+1←ConfGate\(m^t\+1,ct\+1\)\\tilde\{m\}\_\{t\+1\},\\tilde\{c\}\_\{t\+1\}\\leftarrow\\texttt\{ConfGate\}\(\\hat\{m\}\_\{t\+1\},c\_\{t\+1\}\)
23:

ℰt\+1←UpdateMemory\(\)\\mathcal\{E\}\_\{t\+1\}\\leftarrow\\texttt\{UpdateMemory\}\(\)
24:endfor

25:return

m~B\\tilde\{m\}\_\{B\}

We frame scientific discovery as an active experimental design task, optimizing hypothesis selection under a fixed resource budget\. Letℳ\\mathcal\{M\}denote a space of candidate mechanisms, where each mechanismm∈ℳm\\in\\mathcal\{M\}defines a predictive mappingfm:𝒳→𝒴f\_\{m\}:\\mathcal\{X\}\\rightarrow\\mathcal\{Y\}\. The objective is to recover the unknown ground\-truth mechanismm⋆∈ℳm^\{\\star\}\\in\\mathcal\{M\}\. At each roundtt, the learner selects an experimentxt∈𝒳x\_\{t\}\\in\\mathcal\{X\}that yields an observationy∼p\(⋅∣𝒙,m⋆\)y\\sim p\(\\cdot\\mid\\boldsymbol\{x\},m^\{\\star\}\), whereppis the observation model\. Unlike static settings, where the data are fixed a priori, this task requires active data acquisition\. Consequently, afterttrounds, the accumulated dataset𝒟t=\{\(𝒙i,yi\)\}i=1t\\mathcal\{D\}\_\{t\}=\\\{\(\\boldsymbol\{x\}\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{t\}grows iteratively until the total experimental budgetBBis exhausted\. At each step, the process maintains a discovery state𝒮t=\(𝒟t,ℰt,ℋt\)\\mathcal\{S\}\_\{t\}=\(\\mathcal\{D\}\_\{t\},\\mathcal\{E\}\_\{t\},\\mathcal\{H\}\_\{t\}\), whereℰt\\mathcal\{E\}\_\{t\}is structured memory summarizing prior hypotheses and evidence, andℋt\\mathcal\{H\}\_\{t\}is the current set of plausible mechanisms given as:ℋt=\{m\(k\)\}k=1K⊆ℳ\\mathcal\{H\}\_\{t\}=\\\{m^\{\(k\)\}\\\}\_\{k=1\}^\{K\}\\subseteq\\mathcal\{M\}\. A discovery policyπ\\pimaps the current state to the next experiment, producing a sequence of adaptive queries𝒙t\+1=πt\(𝒮t\)\\boldsymbol\{x\}\_\{t\+1\}=\\pi\_\{t\}\(\\mathcal\{S\}\_\{t\}\), observationsyt\+1∼p\(⋅∣𝒙t\+1,m⋆\)y\_\{t\+1\}\\sim p\(\\cdot\\mid\\boldsymbol\{x\}\_\{t\+1\},m^\{\\star\}\), and updated states𝒮t\+1\\mathcal\{S\}\_\{t\+1\}\. The objective is to produce a final estimatem^B\\hat\{m\}\_\{B\}that minimizes expected mechanism error𝔼\[ℒ\(m^B,m⋆\)\]\\mathbb\{E\}\[\\mathcal\{L\}\(\\hat\{m\}\_\{B\},m^\{\\star\}\)\], whereℒ\\mathcal\{L\}is a domain\-dependent loss, and the expectation is over trajectories induced byπ\\piandpp, with a fixedm⋆m^\{\\star\}\.

### 3\.2Hypothesis Generation

To mitigate the path\-dependence and premature collapse of one\-shot LLM hypothesis generation under sparse data\(Chenet al\.,[2025b](https://arxiv.org/html/2605.24043#bib.bib43)\),LLM\-AutoSciLabdecoupleshypothesis diversityfromhypothesis synthesis\. As shown in Algorithm[1](https://arxiv.org/html/2605.24043#alg1),GenHyp\(⋅\)\\texttt\{GenHyp\}\(\\cdot\)decouples exploration from synthesis via asymmetric model roles\. A smaller LLMπθsmall\\pi\_\{\\theta\}^\{\\rm small\}is sampled in batches to generate candidate hypotheses conditioned on the current state𝒮t\\mathcal\{S\}\_\{t\}\. These are grouped into structural mechanism families, and sampling continues until the distribution stabilizes, yielding a hypothesis setℋt\\mathcal\{H\}\_\{t\}\. A larger LLMπθlarge\\pi\_\{\\theta\}^\{\\rm large\}then conditions onℋt\\mathcal\{H\}\_\{t\}to produce a structured proposal containing a primary hypothesis, alternative hypotheses, and diagnostic search regionsℛt⊆Θ\\mathcal\{R\}\_\{t\}\\subseteq\\Theta\. Thus, LLMs are used to define a hypothesis space for data acquisition via experimentation, rather than to produce a final answer\.

### 3\.3Hypothesis\-Conditioned Experiment Selection

The data acquisition policy is governed by the disagreement of the hypothesesℋt\\mathcal\{H\}\_\{t\}within the LLM\-generated search regionℛt\\mathcal\{R\}\_\{t\}\. Given the confidence scorectc\_\{t\},LLM\-AutoSciLabfirst determines whether the current hypothesis is sufficiently stable to serve as the basis for local refinement\. Ifct<τconfc\_\{t\}<\\tau\_\{\\rm conf\},LLM\-AutoSciLabretains the full hypothesis setℋt\\mathcal\{H\}\_\{t\}and selects experiments for mechanism disambiguation; otherwise, it treats the current mechanism as stable and shifts acquisition to refinement\. InDisambiguatemodeLLM\-AutoSciLabcomputesΔt\\Delta\_\{t\}=Disagree\(ℋt,𝒟t\),\(\\mathcal\{H\}\_\{t\},\\mathcal\{D\}\_\{t\}\),which governs the acquisition strategy andAcquire\(⋅\\cdot\)selectsxt\+1∈ℛx\_\{t\+1\}\\in\\mathcal\{R\}where candidate mechanisms inℋt\\mathcal\{H\}\_\{t\}make maximally divergent predictions\. InRefinemode,Acquire\(⋅\\cdot\)instead selects experiments that improve the fit or parameterization of the supported mechanism family\. Unlike Bayesian experimental design, which typically optimizes information gain under a predefined probabilistic model class,LLM\-AutoSciLabconstructs and revises explicit mechanism hypotheses and uses their predicted disagreements to guide experiment selection\. This makes acquisition depend on the status of the evolving hypothesis space, rather than predictive improvement alone\.

### 3\.4Hypothesis Optimization and Confidence Feedback

After each experiment, the new observations from𝒪\\mathcal\{O\}are incorporated into𝒟t\\mathcal\{D\}\_\{t\}to produce the new dataset𝒟t\+1\\mathcal\{D\}\_\{t\+1\}\. As shown in Algorithm[1](https://arxiv.org/html/2605.24043#alg1),RefineHyp\(⋅\)\\texttt\{RefineHyp\}\(\\cdot\)fitsℋt\\mathcal\{H\}\_\{t\}to the new dataset producing a refined mechanismm^t\+1\\hat\{m\}\_\{t\+1\}along with a confidence scorect\+1c\_\{t\+1\}\. This step turns each generated hypothesis structure into a data\-evaluated mechanism on𝒟t\+1\\mathcal\{D\}\_\{t\+1\}\. To assess robustness under adaptive data collection, we introduce a confidence gate applied viaConfGate\(⋅\)\\texttt\{ConfGate\}\(\\cdot\)\. Since𝒟t\+1\\mathcal\{D\}\_\{t\+1\}may be biased toward the currently selected regions, this step performs bootstrap resampling and refits the candidate mechanism across these datasets, measuring agreement across the resulting fits\. Consistent hypotheses are assigned higher confidencec~t\+1\\tilde\{c\}\_\{t\+1\}, while those exhibiting instability in structure or predictions are treated as brittle\. The confidence\-adjusted mechanismm~t\+1\\tilde\{m\}\_\{t\+1\}is then written back into memory throughUpdateMemory\(⋅\)\\texttt\{UpdateMemory\}\(\\cdot\), informing subsequent hypothesis generation and acquisition decisions\. Appendix[E\.1](https://arxiv.org/html/2605.24043#A5.SS1)provides a complete algorithmic specification and implementation details\.

### 3\.5Implementation Details

We useGPT\-4o\-minias the primary LLM backbone andQwen/Qwen2\.5\-7B\-Instructfor the smaller models used in adaptive ensembling\. Data acquisition follows the NewtonBench setup\(Zhenget al\.,[2026](https://arxiv.org/html/2605.24043#bib.bib36)\), where the oracle is a noiseless black\-box functionu↦ftarget\(u\)u\\mapsto f\_\{\\mathrm\{target\}\}\(u\)defined over an open setUUof achievable target\-input values\. For equation discovery, the refinement backend usesPySRwith 800 iterations per fitting call, together with direct numerical fitting of candidate mechanism skeletons when available\. For graph discovery, refinement usesBFGSwith 800 iterations\. All models are used in inference mode without task\-specific finetuning; additional implementation details and hyperparameters are reported in Appendix[E\.2](https://arxiv.org/html/2605.24043#A5.SS2)\.

## 4ActiveSciBench: Benchmark for Active Scientific Discovery

Real\-world scientific discovery requires not only inferring governing laws from observations, but also choosing experiments that yield the most informative data\. Existing benchmarks reduce discovery to passive inference on fixed datasets, bypassing the experimental\-design problem that is critical when observations are costly\.To address this gap, we introduceActiveSciBencha two\-dataset benchmark suite for active, closed\-loop scientific discovery \(Figure[2](https://arxiv.org/html/2605.24043#S4.F2)\), each based on*physically grounded laws*and a queryable experimental system where the underlying law and parameters are hidden, relevant variables are unknown a priori, and discovery must occur within a fixed experimental budget\.

### 4\.1ActiveSciBench\-Chem: Active Enzyme\-Kinetic Law Discovery

#### Task Formulation\.

Enzyme\-kinetic rate laws describe how reaction rates vary as a function of experimental conditions\. InActiveSciBench\-Chem\(Figure[2](https://arxiv.org/html/2605.24043#S4.F2)\(a\)\), each task simulates an enzyme\-catalyzed reaction governed by a hidden kinetic mechanism and hidden parameters; the learner must recover the symbolic rate law from budget\-limited experiments\. Each experiment is specified via a shared 7\-dimensional interface for substrate, inhibitor, second\-substrate, and product concentrations, enzyme loading, temperature, and pH, and returns the observed initial rater0r\_\{0\}along with auxiliary mass\-balance observables\. The true rate law, including its functional form and which inputs actually appear, is withheld from the learner throughout\. Candidate mechanisms can produce indistinguishable behavior over restricted regions of the design space, and the correct law is often recoverable only through experiments that deliberately isolate individual dependencies\.

#### Dataset Construction\.

The benchmark is drawn from a curated, mechanistically grounded hypothesis space comprising standard kinetic families, structured compositions thereof, and extended mechanisms beyond the standard textbook library, yielding 57 curated tasks in total\. We report results across three complexity tiers:Easy\(standard families: Michaelis\-Menten, competitive inhibition\),Medium\(structured compositions: mixed inhibition, substrate inhibition\), andHard\(extended mechanisms: cooperative binding, allosteric regulation\)\.

![Refer to caption](https://arxiv.org/html/2605.24043v1/images/benchmark_fig_new.png)Figure 2:Overview ofActiveSciBench: \(a\)ActiveSciBench\-Chem:Symbolic enzyme rate law recovery; \(b\)ActiveSciBench\-GRN:Signed directed gene regulatory graph inference\.

### 4\.2ActiveSciBench\-GRN: Active Causal Graph Discovery

#### Task Formulation\.

Gene regulatory networks are signed, directed graphs describing which genes or regulators activate or repress other genes\. InActiveSciBench\-GRN\(Figure[2](https://arxiv.org/html/2605.24043#S4.F2)\(b\)\), each task simulates a hidden regulatory system with an unknown graph structure and nonlinear dynamics, aiming to recover the causal graph from within an experimental budget\. The system consists of intervenable gene expression nodes, upstream signals, and kinetic parameters\. UnlikeActiveSciBench\-Chem,ActiveSciBench\-GRNoperates in a discrete intervention space: each experiment knocks up, knocks down, or perturbs specific nodes, and the learner observes all downstream expression changes\. The true graph, including edge presence, direction, and sign \(activation vs\. repression\) along with the governing nonlinear dynamics, is withheld from the learner throughout\. Different motifs can produce similar observations under weak interventions, and hidden parameters can make the same motif look qualitatively different across tasks\. The learner must therefore jointly identify topology, sign, and effective dynamics from sparse data, requiring informative perturbation choices rather than response\-surface fitting\. The core challenge is inferring which variables are relevant, but here the discovery target is a structured causal graph rather than a symbolic equation, extending the benchmark suite to broader coverage of scientific discovery problems\.

#### Dataset Construction\.

The benchmark is curated from a small set of canonical regulatory motifs spanning increasing structural complexity, from feedforward activation to repression, feedback, and switching behavior\. It contains 5 motif families, each instantiated at 3 native difficulty levels and 3 topological versions, yielding 45 tasks per random seed with the capacity to generate more\. The data split inActiveSciBench\-GRNcorresponds directly to the difficulty levels, reflecting progressively sharper and more nonlinear parameter regimes\.Easytasks exhibit quasi\-linear responses where small perturbations reveal the graph clearly;Mediumtasks introduce nonlinearity where saturation effects obscure weak edges; andHardtasks feature bistability and switching behavior, where the graph is identifiable only through carefully designed multi\-node perturbations\. Appendix[B](https://arxiv.org/html/2605.24043#A2)details the construction details, family definitions, filtering rules, and tasks\.

Table 3:Quantitative performance comparisonof baselines across all benchmarks\.EDdenotes whether a method incorporates experiment design\. Metrics forNewtonBenchandActiveSciBench\-Chem: SA = Symbolic Accuracy \(%\), Ex\. = Exact Accuracy \(%\), RMSLE = Root Mean Squared Log Error\.ActiveSciBench\-GRN: F1 = Edge F1 \(%\), Ex\. = Exact Graph Accuracy \(%\), Sign = Sign Accuracy \(%\)\.

## 5Experiments

### 5\.1Experimental Setup

#### Datasets\.

We evaluateLLM\-AutoSciLabin a closed\-loop scientific discovery setting where the system iteratively designs experiments and refines hypotheses under a fixed oracle query budget\. Our study includes both quantitative comparisons with prior methods and targeted ablations\. Specifically, we conduct experiments on:NewtonBench, spanning 12 physics domains across multiple difficulty levels and variants;ActiveSciBench\-Chem, a suite of compositional enzyme kinetics tasks; andActiveSciBench\-GRN, which focuses on graph\-structured discovery in gene regulatory networks\. ForActiveSciBench\-Chemand NewtonBench, we report: \(i\) symbolic accuracy, \(ii\) predictive error via RMSLE, \(iii\) numerical exact accuracy\. Symbolic accuracy is stricter than numerical accuracy, as approximate fits may not recover the true form\. ForActiveSciBench\-GRN\(structure discovery\), we evaluate structural recovery using edge\-level precision, recall, F1, and sign accuracy \(activation vs\. repression\), and mechanistic recovery via exact graph accuracy and motif accuracy\. Additional metric details are provided in Appendix[C](https://arxiv.org/html/2605.24043#A3)\.

#### Baselines\.

We compare against a broad set of baselines:Symbolic regression methodssuch as PySR\(Cranmer,[2023](https://arxiv.org/html/2605.24043#bib.bib4)\)on fixed datasets without experiment design andActive learning methods\(Bayesian Optimization, Bayesian Experimental Design\) that select experiments but do not model symbolic structure\. We further experiment with LLM\-only prompting and code\-assisted LLMs\. ForActiveSciBench\-GRN, we additionally evaluateGraph Discovery BaselinesGENIE3Huynh\-Thuet al\.\([2010](https://arxiv.org/html/2605.24043#bib.bib76)\), GIESHauser and Bühlmann \([2012](https://arxiv.org/html/2605.24043#bib.bib75)\), and NOTEARSZhenget al\.\([2018](https://arxiv.org/html/2605.24043#bib.bib74)\)\(offline\), as well as Random and Uncertainty sampling\. For ablations and diagnostic analyses, we use stratified, representative subsets to improve computational tractability \(Appendix[E\.3](https://arxiv.org/html/2605.24043#A5.SS3)\)\. For all LLM\-based methods, we useGPT\-4o\-minias the primary model withQwen2\.5\-7B\-Instructas the smaller local ensemble model\. Appendix[A\.5](https://arxiv.org/html/2605.24043#A1.SS5)reports results with standard deviations\.

### 5\.2Main Results

As shown in Table[3](https://arxiv.org/html/2605.24043#S4.T3), we evaluateLLM\-AutoSciLabusingGPT\-4o\-miniacross NewtonBench,ActiveSciBench\-Chem, andActiveSciBench\-GRNunder fixed oracle budgets\. Across all three benchmarks,LLM\-AutoSciLabachieves the strongest overall recovery, but the source of advantage differs by setting: NewtonBench tests symbolic identifiability with known variables,ActiveSciBench\-Chemrequires discovering relevant kinetic variables, andActiveSciBench\-GRNrequires recovering graph structure from sparse interventions\. Appendix[A\.2](https://arxiv.org/html/2605.24043#A1.SS2)contains experiments usingQwen\-3family LLMs\.

#### NewtonBench\.

NewtonBench isolates symbolic recovery when relevant variables are known\. Strong fitting baselines such as PySR and Bayesian Optimization achieve high exact accuracy \(74\.54%74\.54\\%,68\.52%68\.52\\%\) but low symbolic accuracy \(24\.07%24\.07\\%,24\.54%24\.54\\%\), indicating overfitting to observed regimes without recovering the underlying law\. LLM\-based methods perform poorly overall \(<8%<8\\%SA\)\. In contrast,LLM\-AutoSciLabachieves67\.60%67\.60\\%symbolic accuracy and81\.50%81\.50\\%exact accuracy, substantially outperforming all baselines\. This gap highlights thathypothesis\-conditioned experimentation improves structural identification rather than merely numerical fit\.

#### ActiveSciBench\-Chem\.

ActiveSciBench\-Chemrequires identifying both relevant variables and kinetic structure through interaction with the experimental interface\. BED is the strongest baseline \(31\.58%31\.58\\%SA\), performing well on easy tasks \(77\.78%77\.78\\%SA\) but collapsing on hard settings \(0\.00%0\.00\\%SA/Ex\.\), suggesting limited coverage beyond the candidate model class\. We observe that LLM\-only and code\-assisted LLM baselines consistently achieve0\.00%0\.00\\%SA/Ex\. Although they often produce plausible rate laws, they tend to default to generic textbook templates rather than testing variable relevance or distinguishing kinetic families\. Thus, their outputs can be locally reasonable while failing strict symbolic\-equivalence and exact\-recovery criteria\. In contrast,LLM\-AutoSciLabachieves the best overall performance \(35\.09%35\.09\\%SA,50\.88%50\.88\\%Ex\.\) and remains robust on hard tasks \(42\.86%42\.86\\%SA,52\.38%52\.38\\%Ex\.\)\. The results suggest thatLLM\-generated hypotheses enable exploration beyond fixed mechanism libraries, which is critical for nonstandard kinetics\.

#### ActiveSciBench\-GRN\.

ActiveSciBench\-GRNevaluates graph\-structured mechanism recovery\. Offline methods recover partial structure but rarely the full graph: GIES achieves56\.27%56\.27\\%F1 but only6\.67%6\.67\\%exact accuracy\. Active baselines improve edge recovery \(e\.g\., uncertainty sampling:50\.10%50\.10\\%F1\) but still fail to recover full graphs \(4\.44%4\.44\\%Ex\.\)\.LLM\-AutoSciLabsignificantly outperforms all baselines, achieving72\.49%72\.49\\%F1,31\.11%31\.11\\%exact graph accuracy, and98\.15%98\.15\\%sign accuracy\. We observe that most methods perform very well on the sign metric, indicating that it is easier to determine whether a node suppresses, activates, or has no effect on other nodes\. This demonstrates thataccurate graph recovery requires targeted, hypothesis\-driven perturbations that disambiguate competing signed structures, rather than passive fitting or uncertainty\-based sampling\.

## 6Analysis

### 6\.1Qualitative Analysis

![Refer to caption](https://arxiv.org/html/2605.24043v1/images/target_law_final.png)Figure 3:Qualitative NewtonBench case study\.LLM\-AutoSciLabrecovers the correct symbolic structure, while other baselines introduce spurious terms, collapse to incorrect families, or recover only simplified harmonic forms\.Figure[3](https://arxiv.org/html/2605.24043#S6.F3)presents a qualitative NewtonBench case study illustrating failure modes under limited\-budget mechanism discovery\. The target system contains two additive components: a restoring force term and a nonlinear damping term\. Fit\-driven baselines such as PySR and Bayesian optimization achieve low local error by introducing spurious or entangled terms, but fail to recover the correct mechanistic structure\. In particular, PySR inserts an unnecessary sinusoidal component, while Bayesian optimization converges to numerically accurate but mechanistically incorrect expressions\. LLM\-only and code\-assisted LLM methods instead collapse to simplified textbook harmonic forms, identifying only partial structure and missing the nonlinear damping behavior\. Bayesian experiment design moves toward the correct components, but ultimately still converges to a fit\-driven approximation\. In contrast,LLM\-AutoSciLabrecovers the correct additive structure and closely matches the hidden constants, including a recovered coefficient of0\.4915≈1/20\.4915\\approx 1/2and a damping exponent within roughly1%1\\%of the ground truth, highlighting the importance of discriminative experimentation in accurate mechanism recovery\.

### 6\.2Ablation Study

Figure[4](https://arxiv.org/html/2605.24043#S6.F4)reports single\-component removal results across all three benchmarks\. Removing hypothesis\-conditioned acquisition causes a substantial drop across all benchmarks, showing that candidate mechanisms must guide data acquisition under limited oracle budgets\. Other components contribute differently depending on the source of difficulty\. NewtonBench stresses sparse functional identification through counterfactual laws, making diverse hypothesis generation and mechanism stability important; without them, the LLM is drawn toward canonical physics laws\.ActiveSciBench\-Chemrequires reasoning over the input space and identifying mechanistically relevant variables, making memory crucial for disambiguating competing mechanisms\.ActiveSciBench\-GRNrelies on accumulating perturbation evidence, making evidence preservation and intervention selection more central\. These ablation patterns also support the diversity of our suite: NewtonBench,ActiveSciBench\-Chem, andActiveSciBench\-GRNare sensitive to different removed components, suggesting they probe complementary discovery capabilities rather than one shared fitting problem\. Overall, the results show thatLLM\-AutoSciLabfunctions as a closed\-loop pipeline across discovery settings, with each component contributing meaningfully\.

![Refer to caption](https://arxiv.org/html/2605.24043v1/images/ablation_three_panels_renamed.png)Figure 4:Ablation study across all benchmarksremoving one component fromLLM\-AutoSciLab\.
### 6\.3Experiment Budget Analysis

Figure[5](https://arxiv.org/html/2605.24043#S6.F5)plots recovery versus query budget\. On NewtonBench andActiveSciBench\-GRN, the second\-best baseline fails to matchLLM\-AutoSciLab’s fixed\-budget performance even at5×5\\timesthe query count\. ForActiveSciBench\-Chem, BED closes the low\-budget gap only by budget 60, using three times as many queries as theB=20B=20setting ofLLM\-AutoSciLab\. Separately, we also measure sample efficiency as the query budget each baseline needs to matchLLM\-AutoSciLab’s fixed\-budget performance\. The strongest active baselines require substantially more queries:2\.602\.60–3\.10×3\.10\\timeson NewtonBench,2\.332\.33–2\.47×2\.47\\timesonActiveSciBench\-Chem, and3\.903\.90–4\.60×4\.60\\timesonActiveSciBench\-GRN; LLM\-only and code\-assisted variants require5\.205\.20–14\.40×14\.40\\timesmore queries depending on the benchmark \(Appendix[A\.3](https://arxiv.org/html/2605.24043#A1.SS3)\)\. By conditioning acquisition on competing hypotheses, each oracle call is more likely to resolve structural ambiguity across symbolic laws, kinetic rate mechanisms, and signed regulatory graphs\. Together, these results show thatLLM\-AutoSciLabis significantly more sample efficient than the baselines\.

![Refer to caption](https://arxiv.org/html/2605.24043v1/images/active_scibench_budget_lines.png)Figure 5:Budget ablations across benchmarksshowing the recovery metric versus query budget\.

## 7Conclusion

We introducedLLM\-AutoSciLab, a closed\-loop algorithm for scientific discovery that formalizes discovery as iterative experimental design over an evolving hypothesis set\.LLM\-AutoSciLabinstantiates a structured algorithmic loop that \(i\)generates a diverse hypothesis set, \(ii\) selects experiments via ahypothesis\-conditioned acquisition objective, and \(iii\) refines candidates throughdata\-driven optimization with feedback\. This formulation shifts the objective of data acquisition from predictive uncertainty reduction to hypothesis discrimination, improving true mechanism recovery under limited experimental budgets\. To overcome the lack of evaluation settings for closed\-loop discovery, we introduceActiveSciBench\-ChemandActiveSciBench\-GRN,benchmarks that recast scientific discovery as an active, budget\-constrained process requiring joint experiment design, variable selection, and recovery of underlying mechanisms, enabling systematic evaluation beyond static function fitting\. Across NewtonBench,ActiveSciBench\-Chem, andActiveSciBench\-GRN,LLM\-AutoSciLabachieves higher symbolic and structural recovery with fewer queries than prior methods, demonstrating thataligning data acquisition with hypothesis discrimination rather than predictive accuracy improves both efficiency and reliability of scientific discovery\.

#### Limitations\.

LLM\-AutoSciLabuses simulator\-based oracles, so physical\-lab noise, failures, costs, and operational constraints are not fully modeled\. Performance also depends on the quality of LLM\-generated hypotheses and on the coverage of the parser and refinement backends\. Broader domains, richer refinement tools, and real\-world validation remain future work\.

## References

- \[1\]\(2026\)LLEMA: evolutionary search with LLMs for multi\-objective materials discovery\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=TIqzhBvCNB)Cited by:[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px1.p1.1)\.
- \[2\]M\. Abolhasani and E\. Kumacheva\(2023\)The rise of self\-driving labs in chemical and materials sciences\.Nature Synthesis2,pp\. 483 – 492\.External Links:[Link](https://api.semanticscholar.org/CorpusID:256435190)Cited by:[Table 1](https://arxiv.org/html/2605.24043#S1.T1.5.1.7.5.1),[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px2.p1.1)\.
- \[3\]D\. Agarwal, B\. P\. Majumder, R\. Adamson, M\. Chakravorty, S\. R\. Gavireddy, A\. Parashar, H\. Surana, B\. D\. Mishra, A\. McCallum, A\. Sabharwal, and P\. Clark\(2026\)AutoDiscovery: open\-ended scientific discovery via bayesian surprise\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=kJqTkj2HhF)Cited by:[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px1.p1.1)\.
- \[4\]M\. R\. AI4Science and M\. A\. Quantum\(2023\)The impact of large language models on scientific discovery: a preliminary study using gpt\-4\.arXiv preprint arXiv:2311\.07361\.Cited by:[§1](https://arxiv.org/html/2605.24043#S1.p1.1)\.
- \[5\]M\. Bailey, S\. Moayedpour, R\. Li, A\. Corrochano\-Navarro, A\. Kötter, L\. Kogler\-Anele, S\. Riahi, C\. Grebner, G\. Hessler, H\. Matter, M\. Bianciotto, P\. Mas, Z\. Bar\-Joseph, and S\. Jager\(2024\-01\)Deep batch active learning for drug discovery\.External Links:[Link](http://dx.doi.org/10.7554/eLife.89679.2),[Document](https://dx.doi.org/10.7554/elife.89679.2)Cited by:[Table 1](https://arxiv.org/html/2605.24043#S1.T1.5.1.6.4.1),[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px2.p1.1)\.
- \[6\]P\. Behzadifar, P\. Shojaee, S\. Kabra, K\. Meidani, and C\. K\. Reddy\(2025\)Decompose, adapt, and evolve: towards efficient scientific equation discovery with large language models\.InNeurIPS 2025 AI for Science Workshop,External Links:[Link](https://openreview.net/forum?id=iU4ddu2fgi)Cited by:[Table 1](https://arxiv.org/html/2605.24043#S1.T1.5.1.4.2.1),[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px1.p1.1)\.
- \[7\]G\. E\. Box and W\. J\. Hill\(1967\)Discrimination among mechanistic models\.Technometrics9\(1\),pp\. 57–71\.Cited by:[§1](https://arxiv.org/html/2605.24043#S1.p2.1)\.
- \[8\]Q\. Chen, M\. Yang, L\. Qin, J\. Liu, Z\. Yan, J\. Guan, D\. Peng, Y\. Ji, H\. Li, M\. Hu,et al\.\(2025\)Ai4research: a survey of artificial intelligence for scientific research\.arXiv preprint arXiv:2507\.01903\.Cited by:[§1](https://arxiv.org/html/2605.24043#S1.p2.1)\.
- \[9\]T\. Chen, B\. Lin, Z\. Yuan, Q\. Zou, H\. He, A\. Goyal, Y\. Ong, and D\. Liu\(2025\)HypoSpace: evaluating llm creativity as set\-valued hypothesis generators under underdetermination\.arXiv preprint arXiv:2510\.15614\.Cited by:[§3\.2](https://arxiv.org/html/2605.24043#S3.SS2.p1.7)\.
- \[10\]M\. Chevalley, Y\. H\. Roohani, A\. Mehrjou, J\. Leskovec, and P\. Schwab\(2025\)A large\-scale benchmark for network inference from single\-cell perturbation data\.Communications Biology8\(1\),pp\. 412\.Cited by:[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px3.p1.1),[Table 2](https://arxiv.org/html/2605.24043#S2.T2.5.1.8.6.1)\.
- \[11\]M\. Cranmer\(2023\)Interpretable machine learning for science with pysr and symbolicregression\. jl\.arXiv preprint arXiv:2305\.01582\.Cited by:[Table 1](https://arxiv.org/html/2605.24043#S1.T1.5.1.3.1.1),[§1](https://arxiv.org/html/2605.24043#S1.p1.1),[§1](https://arxiv.org/html/2605.24043#S1.p4.1),[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px3.p1.1),[Table 2](https://arxiv.org/html/2605.24043#S2.T2.5.1.3.1.1),[§5\.1](https://arxiv.org/html/2605.24043#S5.SS1.SSS0.Px2.p1.1)\.
- \[12\]S\. d’Ascoli, S\. Becker, P\. Schwaller, A\. Mathis, and N\. Kilbertus\(2024\)ODEFormer: symbolic regression of dynamical systems with transformers\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=TzoHLiGVMo)Cited by:[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px3.p1.1),[Table 2](https://arxiv.org/html/2605.24043#S2.T2.5.1.6.4.1)\.
- \[13\]S\. Desai, S\. Addamane, J\. Y\. Tsao, I\. Brener, L\. P\. Swiler, R\. Dingreville, and P\. P\. Iyer\(2025\)AutoSciLab: a self\-driving laboratory for interpretable scientific discovery\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 146–154\.Cited by:[Table 1](https://arxiv.org/html/2605.24043#S1.T1.5.1.8.6.1),[§1](https://arxiv.org/html/2605.24043#S1.p2.1),[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px2.p1.1)\.
- \[14\]J\. Gan, P\. Zhong, Y\. Du, Y\. Zhu, C\. Duan, H\. Wang, D\. Schwalbe\-Koda, C\. P\. Gomes, K\. A\. Persson, and W\. Wang\(2025\)MatLLMSearch: crystal structure discovery with evolution\-guided large language models\.arXiv preprint arXiv:2502\.20933\.Cited by:[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px1.p1.1)\.
- \[15\]A\. Grayeli, A\. Sehgal, O\. Costilla\-Reyes, M\. Cranmer, and S\. Chaudhuri\(2024\)Symbolic regression with a learned concept library\.Advances in Neural Information Processing Systems37,pp\. 44678–44709\.Cited by:[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px1.p1.1)\.
- \[16\]F\. Häse, M\. Aldeghi, R\. J\. Hickman, L\. M\. Roch, M\. Christensen, E\. Liles, J\. E\. Hein, and A\. Aspuru\-Guzik\(2021\)Olympus: a benchmarking framework for noisy optimization and experiment planning\.Machine Learning: Science and Technology2\(3\),pp\. 035021\.Cited by:[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px3.p1.1),[Table 2](https://arxiv.org/html/2605.24043#S2.T2.5.1.7.5.1)\.
- \[17\]A\. Hauser and P\. Bühlmann\(2012\)Characterization and greedy learning of interventional markov equivalence classes of directed acyclic graphs\.The Journal of Machine Learning Research13\(1\),pp\. 2409–2464\.Cited by:[§E\.2](https://arxiv.org/html/2605.24043#A5.SS2.p6.1),[§5\.1](https://arxiv.org/html/2605.24043#S5.SS1.SSS0.Px2.p1.1)\.
- \[18\]K\. Huang, R\. Lopez, J\. Hütter, T\. Kudo, A\. Rios, and A\. Regev\(2023\)Sequential optimal experimental design of perturbation screens guided by multi\-modal priors\.bioRxiv\.External Links:[Document](https://dx.doi.org/10.1101/2023.12.12.571389),[Link](https://www.biorxiv.org/content/early/2023/12/13/2023.12.12.571389),https://www\.biorxiv\.org/content/early/2023/12/13/2023\.12\.12\.571389\.full\.pdfCited by:[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px2.p1.1)\.
- \[19\]V\. A\. Huynh\-Thu, A\. Irrthum, L\. Wehenkel, and P\. Geurts\(2010\)Inferring regulatory networks from expression data using tree\-based methods\.PLoS ONE5\.External Links:[Link](https://api.semanticscholar.org/CorpusID:10420934)Cited by:[§E\.2](https://arxiv.org/html/2605.24043#A5.SS2.p5.1),[§5\.1](https://arxiv.org/html/2605.24043#S5.SS1.SSS0.Px2.p1.1)\.
- \[20\]P\. Jansen, P\. Clark, D\. Downey, and D\. S\. Weld\(2026\)Generating literature\-driven scientific theories at scale\.arXiv preprint arXiv:2601\.16282\.Cited by:[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px1.p1.1)\.
- \[21\]N\. Jiang, M\. Nasim, and Y\. Xue\(2025\)Active symbolic discovery of ordinary differential equations via phase portrait sketching\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 17626–17634\.Cited by:[§1](https://arxiv.org/html/2605.24043#S1.p1.1)\.
- \[22\]S\. Kabra, S\. Kriplani, P\. Shojaee, and C\. K\. Reddy\(2026\)SURFACEBENCH: a geometry\-aware benchmark for symbolic surface discovery\.Transactions on Machine Learning Research\.External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=sHLTzkczSi)Cited by:[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px3.p1.1)\.
- \[23\]A\. G\. Kusne, H\. Yu, C\. Wu, H\. Zhang, J\. Hattrick\-Simpers, B\. DeCost, S\. Sarker, C\. Oses, C\. Toher, S\. Curtarolo,et al\.\(2020\)On\-the\-fly closed\-loop materials discovery via bayesian active learning\.Nature communications11\(1\),pp\. 5966\.Cited by:[Table 1](https://arxiv.org/html/2605.24043#S1.T1.5.1.5.3.1),[§1](https://arxiv.org/html/2605.24043#S1.p2.1),[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px2.p1.1)\.
- \[24\]G\. W\. Kyro, A\. Morgunov, R\. I\. Brent, and V\. S\. Batista\(2024\-01\)ChemSpaceAL: an efficient active learning methodology applied to protein\-specific molecular generation\.Journal of Chemical Information and Modeling64\(3\),pp\. 653–665\.External Links:ISSN 1549\-960X,[Link](http://dx.doi.org/10.1021/acs.jcim.3c01456),[Document](https://dx.doi.org/10.1021/acs.jcim.3c01456)Cited by:[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px2.p1.1)\.
- \[25\]P\. Langley\(2024\-Mar\.\)Integrated systems for computational scientific discovery\.Proceedings of the AAAI Conference on Artificial Intelligence38\(20\),pp\. 22598–22606\.External Links:[Link](https://ojs.aaai.org/index.php/AAAI/article/view/30269),[Document](https://dx.doi.org/10.1609/aaai.v38i20.30269)Cited by:[Table 1](https://arxiv.org/html/2605.24043#S1.T1.5.1.8.6.1),[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px2.p1.1)\.
- \[26\]J\. Ling, M\. Hutchinson, E\. Antono, S\. Paradiso, and B\. Meredig\(2017\)High\-dimensional materials and process optimization using data\-driven experimental design with well\-calibrated uncertainty estimates\.Integrating Materials and Manufacturing Innovation6\(3\),pp\. 207–217\.Cited by:[Table 1](https://arxiv.org/html/2605.24043#S1.T1.5.1.6.4.1),[§1](https://arxiv.org/html/2605.24043#S1.p2.1),[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px2.p1.1)\.
- \[27\]B\. P\. MacLeod, F\. G\. L\. Parlane, T\. D\. Morrissey, F\. Häse, L\. M\. Roch, K\. E\. Dettelbach, R\. Moreira, L\. P\. E\. Yunker, M\. B\. Rooney, J\. R\. Deeth, V\. Lai, G\. J\. Ng, H\. Situ, R\. H\. Zhang, M\. S\. Elliott, T\. H\. Haley, D\. J\. Dvorak, A\. Aspuru\-Guzik, J\. E\. Hein, and C\. P\. Berlinguette\(2020\)Self\-driving laboratory for accelerated discovery of thin\-film materials\.Science Advances6\(20\),pp\. eaaz8867\.External Links:[Document](https://dx.doi.org/10.1126/sciadv.aaz8867),[Link](https://www.science.org/doi/abs/10.1126/sciadv.aaz8867),https://www\.science\.org/doi/pdf/10\.1126/sciadv\.aaz8867Cited by:[Table 1](https://arxiv.org/html/2605.24043#S1.T1.5.1.7.5.1),[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px2.p1.1)\.
- \[28\]B\. P\. Majumder, H\. Surana, D\. Agarwal, S\. Hazra, A\. Sabharwal, and P\. Clark\(2024\)Data\-driven discovery with large generative models\.arXiv preprint arXiv:2402\.13610\.Cited by:[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px1.p1.1)\.
- \[29\]A\. A\. Melnikov, H\. P\. Nautrup, M\. Krenn, V\. Dunjko, M\. Tiersch, A\. Zeilinger, and H\. J\. Briegel\(2018\)Active learning machine learns to create new quantum experiments\.Proceedings of the National Academy of Sciences115\(6\),pp\. 1221–1226\.External Links:[Document](https://dx.doi.org/10.1073/pnas.1714936115),[Link](https://www.pnas.org/doi/abs/10.1073/pnas.1714936115),https://www\.pnas\.org/doi/pdf/10\.1073/pnas\.1714936115Cited by:[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px2.p1.1)\.
- \[30\]L\. Ouyang, M\. H\. Tessler, D\. Ly, and N\. Goodman\(2016\)Practical optimal experiment design with probabilistic programs\.arXiv preprint arXiv:1608\.05046\.Cited by:[Table 1](https://arxiv.org/html/2605.24043#S1.T1.5.1.5.3.1),[§1](https://arxiv.org/html/2605.24043#S1.p2.1),[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px2.p1.1)\.
- \[31\]B\. K\. Petersen, M\. L\. Larma, T\. N\. Mundhenk, C\. P\. Santiago, S\. K\. Kim, and J\. T\. Kim\(2021\)Deep symbolic regression: recovering mathematical expressions from data via risk\-seeking policy gradients\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=m5Qsh0kBQG)Cited by:[§1](https://arxiv.org/html/2605.24043#S1.p1.1)\.
- \[32\]A\. Pratapa, A\. P\. Jalihal, J\. N\. Law, A\. Bharadwaj, and T\. M\. Murali\(2019\)Benchmarking algorithms for gene regulatory network inference from single\-cell transcriptomic data\.bioRxiv\.External Links:[Document](https://dx.doi.org/10.1101/642926),[Link](https://www.biorxiv.org/content/early/2019/06/04/642926),https://www\.biorxiv\.org/content/early/2019/06/04/642926\.full\.pdfCited by:[Table 1](https://arxiv.org/html/2605.24043#S1.T1.5.1.9.7.1),[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px3.p1.1),[Table 2](https://arxiv.org/html/2605.24043#S2.T2.5.1.9.7.1)\.
- \[33\]J\. Qin, H\. Wessels, C\. Fernandez\-Granda, and Y\. Hao\(2024\)Active learning for efficient discovery of optimal gene combinations in the combinatorial perturbation space\.InNeurIPS 2024 Workshop on AI for New Drug Modalities,External Links:[Link](https://openreview.net/forum?id=7aCRpxvu2N)Cited by:[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px2.p1.1)\.
- \[34\]C\. K\. Reddy and P\. Shojaee\(2025\)Towards scientific discovery with generative ai: progress, opportunities, and challenges\.InProceedings of the AAAI conference on artificial intelligence,Vol\.39,pp\. 28601–28609\.Cited by:[§1](https://arxiv.org/html/2605.24043#S1.p1.1),[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px1.p1.1)\.
- \[35\]B\. Romera\-Paredes, M\. Barekatain, A\. Novikov, M\. Balog, M\. P\. Kumar, E\. Dupont, F\. J\. Ruiz, J\. S\. Ellenberg, P\. Wang, O\. Fawzi,et al\.\(2024\)Mathematical discoveries from program search with large language models\.Nature625\(7995\),pp\. 468–475\.Cited by:[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px1.p1.1)\.
- \[36\]T\. Schaffter, D\. Marbach, and D\. Floreano\(2011\-08\)GeneNetWeaver: in silico benchmark generation and performance profiling of network inference methods\.Bioinformatics27\(16\),pp\. 2263–2270\.External Links:ISSN 1367\-4803,[Document](https://dx.doi.org/10.1093/bioinformatics/btr373),[Link](https://doi.org/10.1093/bioinformatics/btr373),https://academic\.oup\.com/bioinformatics/article\-pdf/27/16/2263/48863257/bioinformatics\_27\_16\_2263\.pdfCited by:[Table 1](https://arxiv.org/html/2605.24043#S1.T1.5.1.9.7.1),[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px3.p1.1),[Table 2](https://arxiv.org/html/2605.24043#S2.T2.5.1.9.7.1)\.
- \[37\]P\. Shojaee, K\. Meidani, S\. Gupta, A\. B\. Farimani, and C\. K\. Reddy\(2025\)LLM\-SR: scientific equation discovery via programming with large language models\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=m2nmp8P5in)Cited by:[Table 1](https://arxiv.org/html/2605.24043#S1.T1.5.1.4.2.1),[§1](https://arxiv.org/html/2605.24043#S1.p1.1),[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px1.p1.1)\.
- \[38\]P\. Shojaee, N\. Nguyen, K\. Meidani, A\. B\. Farimani, K\. D\. Doan, and C\. K\. Reddy\(2025\)LLM\-SRBench: a new benchmark for scientific equation discovery with large language models\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=SyQPiZJVWY)Cited by:[§1](https://arxiv.org/html/2605.24043#S1.p4.1),[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px3.p1.1),[Table 2](https://arxiv.org/html/2605.24043#S2.T2.5.1.4.2.1)\.
- \[39\]M\. Takamoto, T\. Praditia, R\. Leiteritz, D\. MacKinlay, F\. Alesiani, D\. Pflüger, and M\. Niepert\(2022\)Pdebench: an extensive benchmark for scientific machine learning\.Advances in neural information processing systems35,pp\. 1596–1611\.Cited by:[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px3.p1.1),[Table 2](https://arxiv.org/html/2605.24043#S2.T2.5.1.6.4.1)\.
- \[40\]S\. Udrescu and M\. Tegmark\(2020\)AI feynman: a physics\-inspired method for symbolic regression\.Science advances6\(16\),pp\. eaay2631\.Cited by:[Table 1](https://arxiv.org/html/2605.24043#S1.T1.5.1.3.1.1),[§1](https://arxiv.org/html/2605.24043#S1.p1.1),[§1](https://arxiv.org/html/2605.24043#S1.p4.1),[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px3.p1.1),[Table 2](https://arxiv.org/html/2605.24043#S2.T2.5.1.3.1.1)\.
- \[41\]H\. Wang, T\. Fu, Y\. Du, W\. Gao, K\. Huang, Z\. Liu, P\. Chandak, S\. Liu, P\. Van Katwyk, A\. Deac,et al\.\(2023\)Scientific discovery in the age of artificial intelligence\.Nature620\(7972\),pp\. 47–60\.Cited by:[§1](https://arxiv.org/html/2605.24043#S1.p1.1)\.
- \[42\]H\. Wang, M\. Skreta, C\. T\. Ser, W\. Gao, L\. Kong, F\. Strieth\-Kalthoff, C\. Duan, Y\. Zhuang, Y\. Yu, Y\. Zhu, Y\. Du, A\. Aspuru\-Guzik, K\. Neklyudov, and C\. Zhang\(2025\)Efficient evolutionary search over chemical space with large language models\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=awWiNvQwf3)Cited by:[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px1.p1.1)\.
- \[43\]T\. Zheng, K\. K\. W\. Tam, N\. N\. K\. H\. Nam, B\. Xu, Z\. Wang, C\. Jiayang, H\. T\. Tsang, W\. Wang, J\. Bai, T\. Fang, Y\. Song, G\. Wong, and S\. See\(2026\)NewtonBench: benchmarking generalizable scientific law discovery in LLM agents\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Gk6umqW74m)Cited by:[§B\.3](https://arxiv.org/html/2605.24043#A2.SS3.p1.1),[§E\.2](https://arxiv.org/html/2605.24043#A5.SS2.p1.1),[§E\.2](https://arxiv.org/html/2605.24043#A5.SS2.p8.1),[§1](https://arxiv.org/html/2605.24043#S1.p4.1),[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px3.p1.1),[Table 2](https://arxiv.org/html/2605.24043#S2.T2.5.1.5.3.1),[§3\.5](https://arxiv.org/html/2605.24043#S3.SS5.p1.2)\.
- \[44\]X\. Zheng, B\. Aragam, P\. K\. Ravikumar, and E\. P\. Xing\(2018\)Dags with no tears: continuous optimization for structure learning\.Advances in neural information processing systems31\.Cited by:[§E\.2](https://arxiv.org/html/2605.24043#A5.SS2.p7.1),[§5\.1](https://arxiv.org/html/2605.24043#S5.SS1.SSS0.Px2.p1.1)\.
- \[45\]Y\. Zhou, H\. Liu, T\. Srivastava, H\. Mei, and C\. Tan\(2024\)Hypothesis generation with large language models\.InProceedings of the 1st Workshop on NLP for Science \(NLP4Science\),pp\. 117–139\.Cited by:[§2](https://arxiv.org/html/2605.24043#S2.SS0.SSS0.Px1.p1.1)\.

### Reproducibility Statement

To ensure reproducibility, we provide the relevant implementation and experimental details throughout the paper, including the overall methodology described in Section[3](https://arxiv.org/html/2605.24043#S3)and Appendix[E\.1](https://arxiv.org/html/2605.24043#A5.SS1), and the LLM prompts listed in Appendix[E\.4](https://arxiv.org/html/2605.24043#A5.SS4)\. We also document the datasets used in our experiments in Appendix[B](https://arxiv.org/html/2605.24043#A2)and release the accompanying code and data to support future research\.

### Impact Statement

LLM\-AutoSciLab accelerates scientific discovery by automating hypothesis\-driven experimentation, with potential benefits for researchers in biology, chemistry, and physics who face costly experimental budgets, reducing the number of experiments needed to recover governing mechanisms and lowering barriers for smaller research groups\. The primary risks are over\-reliance on model outputs in safety\-critical domains such as drug discovery, where a plausible but incorrect mechanistic law could have downstream consequences, and the inheritance of LLM biases that may systematically favor well\-represented mechanisms over genuinely novel ones\. The framework currently targets simulator\-based discovery rather than direct laboratory deployment, limiting immediate risk, but domain\-specific safety review remains essential before application in sensitive real\-world contexts\. We do not anticipate direct dual\-use concerns\.

## Appendix

## Appendix AAdditional Results

### A\.1Noise Sensitivity

![Refer to caption](https://arxiv.org/html/2605.24043v1/images/noise_newton.png)\(a\)NewtonBench
![Refer to caption](https://arxiv.org/html/2605.24043v1/images/noise_chem.png)\(b\)ActiveSciBench\-Chem
![Refer to caption](https://arxiv.org/html/2605.24043v1/images/noise_grn.png)\(c\)ActiveSciBench\-GRN

Figure 6:Robustness to observation noise across NewtonBench,ActiveSciBench\-Chem, andActiveSciBench\-GRN\. Bars report exact accuracy, while lines report the continuous error or graph recovery metric: RMSLE for NewtonBench andActiveSciBench\-Chem, and Edge F1 forActiveSciBench\-GRN\.Figure[6](https://arxiv.org/html/2605.24043#A1.F6)reports a noise sensitivity analysis evaluating robustness to increasing levels of observation noise across all three benchmarks\. For each benchmark, we inject controlled Gaussian noise at varying levels and measure the effect on both primary recovery metrics and predictive fidelity\. Across all settings,LLM\-AutoSciLabshows stronger recovery than baselines as noise increases, consistent with scientific priors that constrain the search to physically plausible structures even as data quality degrades\. On NewtonBench andActiveSciBench\-Chem, symbolic accuracy exhibits threshold behavior, degrading in a step\-function pattern while RMSLE increases smoothly, reflecting a transition between a regime where noise blurs parameter estimates but structural identifiability is preserved, and a regime where competing hypotheses become observationally indistinguishable within the current budget\. OnActiveSciBench\-GRN, the two recovery metrics decouple sharply under noise\. Exact graph accuracy drops severely even at moderate noise levels, while edge F1 deteriorates slowly and gradually\. This dissociation reveals that noise primarily disrupts precise topology recovery while the broader edge\-level structure remains partially identifiable\. The result suggests that activation versus repression provides a stronger, more noise\-resilient signal than exact graph structure and points to precise topology recovery as the primary fragility of the current framework under noisy perturbation settings\.

### A\.2Model Capability

Table 4:LLM backbone comparison\.GPT\-4o\-mini is the default backbone; Qwen3 models evaluate open\-weight scaling\.Table[4](https://arxiv.org/html/2605.24043#A1.T4)evaluates the effect of LLM backbone capability on closed\-loop discovery\. We keep theQwen2\.5\-7B\-Instructas the smaller local ensemble model, but vary the primary model driving the experiments\. Model scale is most beneficial on the more compositional and structured benchmarks\.ActiveSciBench\-Chemimproves consistently fromQwen3\-4BtoQwen3\-32Bacross SA, exact accuracy, and RMSLE, whileActiveSciBench\-GRNshows clear gains in edge F1 and exact graph accuracy\. This suggests that stronger backbones provide better mechanistic priors for selecting relevant variables, proposing discriminative experiments, and reasoning over structured mechanisms\. NewtonBench reveals a different pattern\.Qwen3\-32Bachieves the lowest RMSLE and highest exact accuracy, butQwen3\-14Bobtains higher symbolic accuracy\. Thus, larger models can fit behavior more accurately without always recovering the exact symbolic form\. Since the difference is small, this should not be interpreted as evidence against larger models; rather, the non\-monotonic symbolic trend suggests that closed\-loop discovery is not determined by scale alone, but by the interaction between hypothesis generation, experiment selection, and refinement\.

### A\.3Relative Sample Efficiency

![Refer to caption](https://arxiv.org/html/2605.24043v1/images/relative_query_cost_threepanel_fixed.png)Figure 7:Relative sample efficiency across benchmarks\.Each bar shows the multiplicative number of samples required by a comparison method to match the fixed\-budget performance ofLLM\-AutoSciLab\(lower is better\)\. The number of samples is measured relative to the reference budgets used forLLM\-AutoSciLab:B=20B\{=\}20for NewtonBench,B=60B\{=\}60forActiveSciBench\-Chem, andB=20B\{=\}20forActiveSciBench\-GRN\.Figure[7](https://arxiv.org/html/2605.24043#A1.F7)reports target\-matching relative sample efficiency across the three benchmark families\. For each benchmark, we treat the fixed\-budget performance ofLLM\-AutoSciLabas the reference target and measure how many oracle queries a comparison method requires to match that target\. A relative sample efficiency of2×2\\timestherefore means that the comparison method requires twice as many queries asLLM\-AutoSciLabto attain the same level of performance\.

On NewtonBench, all comparison methods require more than twice the query budget ofLLM\-AutoSciLab, with symbolic regression baselines ranging from2\.17×2\.17\\timesto2\.58×2\.58\\timesand LLM\-only variants ranging from2\.75×2\.75\\timesto2\.92×2\.92\\times\. OnActiveSciBench\-Chem, the strongest non\-LLM methods remain above2\.3×2\.3\\times, while the LLM\-only and code\-assisted LLM conditions require5\.97×5\.97\\timesand5\.20×5\.20\\timesthe reference budget, respectively\. OnActiveSciBench\-GRN, the canonical graph baselines GENIE3, GIES, and NOTEARS require2\.4×2\.4\\times,1\.3×1\.3\\times, and1\.9×1\.9\\timesthe reference budget, while active and LLM\-based alternatives require between3\.9×3\.9\\timesand7\.4×7\.4\\times\. Taken together, these results show that the main benefit ofLLM\-AutoSciLabis not only improved final recovery but also substantially better budget utilization, since conditioning acquisition on explicitly competing mechanistic explanations makes each oracle query more informative for resolving structural ambiguity\.

### A\.4Qualitative Analysis

![Refer to caption](https://arxiv.org/html/2605.24043v1/images/qual_grn.png)Figure 8:QualitativeActiveSciBench\-GRNcase study\.LLM\-AutoSciLabexactly recovers the sparse activation chain, while baselines either add spurious auxiliary edges, reverse edge orientation, or recover only partial structure\. Green edges indicate correctly recovered relations; red edges indicate incorrect or spurious relations\.Figure[8](https://arxiv.org/html/2605.24043#A1.F8)shows a representativeActiveSciBench\-GRNexample where the target mechanism is a sparse activation chain,S→A→B→CS\\rightarrow A\\rightarrow B\\rightarrow C, with an irrelevant regulatorRR\.LLM\-AutoSciLabexactly recovers the ground\-truth chain and correctly excludesRR, indicating that its perturbation choices isolate the causal backbone rather than merely fitting correlated responses\. In contrast, the baselines recover only parts of the structure\. LLM\-only adds shortcut and auxiliary edges, code\-assisted LLM preserves the main chain but introduces a spurious side branch, and classical graph\-discovery methods either overconnect the graph, reverse orientations, or miss key dependencies\. This example illustrates the main GRN failure mode: baselines often detect local associations but fail to distinguish direct causal edges from indirect or irrelevant perturbation effects, whereas hypothesis\-conditioned acquisition supports targeted disambiguation of the signed graph structure\.

### A\.5Statistical significance

Table 5:Standard Deviation across benchmarks\.Values are reported as mean±\\pmstandard deviation across 3 seeds\. For NewtonBench andActiveSciBench\-Chem, columns are SA / Exact / RMSLE\. ForActiveSciBench\-GRN, columns are F1 / Exact / Sign\.To assess statistical significance, we repeated each benchmark configuration across three seeds and report the mean and standard deviation of the seed\-level aggregate scores\. The mean results are presented in Table[3](https://arxiv.org/html/2605.24043#S4.T3)\. For NewtonBench andActiveSciBench\-Chem, we report symbolic accuracy \(SA\), exact accuracy, and RMSLE\. ForActiveSciBench\-GRN, we report edgeF1F\_\{1\}, exact graph accuracy, and sign accuracy\. Overall, the variance across benchmarks is modest\. The strongest methods, includingLLM\-AutoSciLab, remain stable across seeds, while weaker LLM\-only or unguided baselines show somewhat larger variability, particularly on harder graph\-recovery settings\. Standard deviations of±0\.00\\pm 0\.00indicate that the seed\-level aggregate scores are identical at the displayed precision as expected for deterministic baselines or methods run under fixed random seeds\.

## Appendix BBenchmark Details

### B\.1ActiveSciBench\-Chem

ActiveSciBench\-Chemis an active enzyme kinetics benchmark for mechanism recovery under finite experimental budgets\. Rather than receiving a fixed dataset, the learner adaptively selects biochemical assay conditions and observes reaction rates to infer hidden kinetic mechanisms and parameters\. This setup reflects the challenge of scientific discovery, where multiple mechanisms can explain limited observations and targeted experiments are required to distinguish competing hypotheses\.

𝐱=\(CA,CI,CB,CP,Enz,T,pH\),\\mathbf\{x\}=\(C\_\{A\},C\_\{I\},C\_\{B\},C\_\{P\},\\mathrm\{Enz\},T,\\mathrm\{pH\}\),covering substrate, inhibitor, secondary substrate, product concentration, enzyme loading, temperature, and pH\. While every task exposes the same interface, each mechanism depends on only a subset of variables, requiring the learner to identify both the governing law and the relevant dimensions\. Dependence on different variables corresponds to distinct mechanistic behaviors such as inhibition, bisubstrate reactions, product feedback, or environmental modulation\.ActiveSciBench\-Chemcontains57 curated tasksorganized into easy, medium, and hard tiers\. Easy tasks correspond to standard kinetic families, medium tasks introduce structured compositions, and hard tasks include weaker identifiability and nonstandard behaviors such as cooperative or allosteric kinetics\. TheActiveSciBench\-Chembenchmark evaluates both active experimentation and symbolic scientific reasoning and is organized aroundnine canonical base families:

- •Michaelis–Menten saturation
- •Competitive inhibition
- •Product inhibition
- •Arrhenius temperature dependence
- •Ping\-pong bisubstrate kinetics
- •Uncompetitive inhibition
- •Substrate inhibition
- •Hill cooperativity
- •Noncompetitive inhibition

Beyond the nine base families,ActiveSciBench\-Chemincludesstructured composite mechanismsthat combine substrate\-response kinetics with modifiers such as inhibition, temperature dependence, or feedback\. Examples include Michaelis–Menten with competitive inhibition and Arrhenius modulation, ping\-pong bisubstrate kinetics with noncompetitive inhibition, and Hill cooperativity with product feedback\. These composites move beyond isolated textbook mechanisms and require the learner to distinguish competing mechanistic explanations under a shared assay interface\.ActiveSciBench\-Chemalso includes a targeted set ofextended and nonstandard mechanism familiesbeyond the core base library:

- •Ordered sequential bisubstrate kinetics
- •Allosteric activation
- •Anti\-cooperative Hill behavior
- •Fractal or anomalous kinetics
- •Mixed inhibition
- •Cooperative inhibition
- •Monotonic pH dependence
- •Metal\-ion activation
- •Product activation / autocatalytic feedback
- •Dual inhibition by inhibitor and product

Taken together, the 57ActiveSciBench\-Chemtasks span textbook kinetic families, structured compositions, and targeted extended mechanisms beyond the standard library\. This makesActiveSciBench\-Chemboth interpretable at the mechanism level and sufficiently rich to require active experimentation, relevant\-variable identification, and mechanistic discrimination rather than simple equation retrieval\.ActiveSciBench\-Chemis a simulator\-based oracle benchmark designed to isolate active kinetic mechanism discovery under controlled, reproducible conditions\. Observations are generated from mechanistically specified kinetic families with hidden parameters and benchmark\-defined noise, rather than physical wet\-lab experiments\. While it does not capture the full complexity of laboratory biochemistry, such as assay failures, batch effects, protocol variability, or experimental cost, it provides a clean, budget\-controlled setting for evaluating closed\-loop mechanistic recovery\.

### B\.2ActiveSciBench\-GRN

ActiveSciBench\-GRNis an online gene perturbation benchmark for active causal graph discovery in gene regulation\. UnlikeActiveSciBench\-Chem, which focuses on recovering biochemical rate laws,ActiveSciBench\-GRNrequires the learner to infer hidden causal structure and nonlinear dynamics from interventional data\. Each task simulates a hidden regulatory system with unknown graph structure, edge signs, and nonlinear dynamics\. The learner performs discrete interventions, such as gene knock\-up or knock\-down experiments, and observes downstream expression changes, while the underlying graph remains unobserved\. Different motifs can produce similar responses under limited interventions, and hidden parameters can make identical motifs appear qualitatively different, requiring the learner to jointly infer topology, sign, and dynamics from sparse experimental data\.ActiveSciBench\-GRNis built from canonical regulatory motifs spanning increasing structural complexity, including activation, repression, feedback, and switching behavior\. The paper\-facing benchmark contains five core motif families, three topological variants, and three difficulty levels, yielding 45 tasks per random seed\. These motifs are standard systems biology primitives that provide a compact yet meaningful testbed for mechanistic graph discovery\. The fiveActiveSciBench\-GRNmotif families are:

- •Activation chain\.A layered activation cascade from signal to intermediate regulators to the reporter\.
- •Coherent feedforward loop\.A motif in which the input acts through both a direct and a mediated activating branch\.
- •Incoherent feedforward loop\.A motif in which activation and repression act along competing paths, producing adaptation\-like or pulse\-like behavior\.
- •Negative\-feedback circuit\.A self\-limiting repression architecture in which downstream activation induces a repressive branch\.
- •Toggle\-switch or bistable decision circuit\.A mutually repressive switching architecture used to model bistability, state selection, and cellular decision making\.

Each motif family is instantiated across three topological variants and three difficulty levels\. The topological variants preserve the motif identity while changing the precise wiring structure, whereas the difficulty levels correspond to increasingly nonlinear and feedback\-sensitive parameter regimes\. Easy tasks exhibit near\-linear responses that reveal the graph relatively clearly, while medium and hard tasks introduce saturation, switching behavior, and bistability that obscure weak or mediated dependencies\. As a result,ActiveSciBench\-GRNis not simply a passive graph estimation problem, but a budgeted intervention design task in which the learner must select perturbations that best distinguish competing regulatory mechanisms\.ActiveSciBench\-GRNis a simulator\-based oracle benchmark designed to isolate intervention\-driven regulatory discovery under controlled, reproducible conditions\. Responses are generated from motif\-specific nonlinear dynamics with hidden parameters and benchmark\-defined noise rather than physical biological experiments\. Although it does not capture the full complexity of real perturbation biology, including failed interventions, cell\-state heterogeneity, batch effects, off\-target effects, or experimental cost variability, it provides a clean, budget\-controlled testbed for evaluating closed\-loop mechanistic graph recovery\.

### B\.3NewtonBench

NewtonBench provides the physics component of our benchmark suite\. In contrast toActiveSciBench\-ChemandActiveSciBench\-GRN, it isolates active symbolic law discovery in a setting where the relevant variables for each task are already known\. In our experiments, one oracle call evaluates the hidden physical law at a chosen assignment of task\-specific input variables, and the learner must recover the symbolic law under a finite query budget\. Because NewtonBench is already introduced as a standalone benchmark in\[[43](https://arxiv.org/html/2605.24043#bib.bib36)\], we refer readers there for the full benchmark construction, counterfactual law\-generation procedure, and task catalog\. Table[6](https://arxiv.org/html/2605.24043#A2.T6)summarizes the benchmark suite we use for evaluation in this work\.

Table 6:Benchmark summary\.NewtonBench,ActiveSciBench\-Chem, andActiveSciBench\-GRNdiffer in scientific setting, discovery target, interface, and benchmark families\.

## Appendix CMetrics

We evaluate all three benchmarks under a unified objective of recovering the true scientific mechanism\. For NewtonBench andActiveSciBench\-Chem, we report three metrics\. First, we measure predictive fidelity using the root mean squared logarithmic error \(RMSLE\),

RMSLE\(f^,f\)=1N∑i=1N\(log⁡\(1\+y^i\)−log⁡\(1\+yi\)\)2\.\\mathrm\{RMSLE\}\(\\hat\{f\},f\)=\\sqrt\{\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\left\(\\log\(1\+\\hat\{y\}\_\{i\}\)\-\\log\(1\+y\_\{i\}\)\\right\)^\{2\}\}\.RMSLE is appropriate because target values in these domains can span multiple orders of magnitude\. Second, we report numerical exact accuracy,

ExAcc\(f^\)=𝟏\[RMSLE\(f^,f\)<0\.01\],\\mathrm\{ExAcc\}\(\\hat\{f\}\)=\\mathbf\{1\}\\\!\\left\[\\mathrm\{RMSLE\}\(\\hat\{f\},f\)<0\.01\\right\],which follows the paper’s exact\-recovery criterion\. Third, we report symbolic accuracy \(SA\), which measures whether the recovered law matches the ground\-truth mechanism up to algebraic rewriting, fitted constants, and variable renaming\. Symbolic accuracy is stricter than numerical exactness, since a numerically accurate approximation need not recover the correct mechanistic form\. ForActiveSciBench\-GRN, the target is graph recovery rather than scalar law recovery\. We therefore report edge\-level precision, recall, and F1, sign accuracy for activation versus repression, exact graph accuracy, and motif accuracy\. These metrics distinguish partial structural recovery from full mechanistic recovery\. For NewtonBench andActiveSciBench\-Chem, symbolic accuracy is evaluated with an LLM judge that determines whether the predicted hypothesis is equivalent to the ground\-truth expression up to constant parameter values\. In this evaluation, the ground\-truth law is presented as expressionAA, and the candidate hypothesisBBmay be represented either as an executable program or as a symbolic expression\. As illustrated, the judge is prompted as follows:

```
Question: Given the ground truth mathematical expression A and the
hypothesis B, determine if there exist any constant parameter values that
would make the hypothesis equivalent to the given ground truth expression.
Let’s think step by step. Explain your reasoning and
then provide the final answer as:
{
  "reasoning": "Brief step-by-step analysis",
  "answer": "Yes/No"
}
```

A prediction is counted as symbolically correct if the judge returns “Yes\.” This metric is stricter than numerical exactness, since a numerically accurate approximation need not recover the same underlying symbolic form\.

## Appendix DActiveSciBenchDesign and Memorization

A key concern for LLM\-based scientific discovery is whether strong performance reflects genuine mechanistic recovery or simple recall of textbook systems\. Our benchmarks are designed to make retrieval alone insufficient\. Each run begins without observations, requiring the learner to actively select experiments under a limited oracle budget and infer the hidden mechanism from the resulting responses\. The instantiated system, not just the family label, is hidden throughout\. In NewtonBench, the learner observes the natural variables but not the hidden governing law\. InActiveSciBench\-Chem, all tasks share a common seven\-variable assay interface while the relevant variables, mechanism class, and parameterization remain hidden\. InActiveSciBench\-GRN, the learner must infer a hidden signed regulatory mechanism from perturbation responses rather than directly observing the graph\. At the same time, realistic scientific semantics are preserved through variables such as substrate, inhibitor, temperature, and perturbation target, ensuring the evaluation measures scientific reasoning rather than anonymized symbol matching\. This can also be quantified through conservative lower bounds on the hidden task space\. InActiveSciBench\-Chem, the 57 mechanism classes contain between 2 and 8 hidden continuous parameters \(median 5\)\. Even under a coarse discretization of only 100 possible values per parameter, this yields more than6×10166\\times 10^\{16\}possible hidden instantiations\.ActiveSciBench\-GRNis broader still: the target is a signed regulatory graph oversignal,A,B,C,R\{signal,A,B,C,R\}with 20 possible directed non\-self edges, each absent, activating, or repressing, yielding320≈3\.5×1093^\{20\}\\approx 3\.5\\times 10^\{9\}unconstrained signed graphs\. Even with a sparsity cap of six edges, this leaves more than 3 million candidate graphs before accounting for hidden continuous dynamics\. These values are conservative lower bounds, but they illustrate that the benchmarks cannot be reduced to retrieving a small library of familiar mechanisms\.

## Appendix EImplementation Details

### E\.1LLM\-AutoSciLab

Algorithm 2Adaptive Ensemble1:Prompt context

StS\_\{t\}, base ensemble size

KK, cap

KmaxK\_\{\\max\}, entropy threshold

τH\\tau\_\{H\}
2:

𝒞←∅,Hprev←∅\\mathcal\{C\}\\leftarrow\\emptyset,\\;H\_\{\\mathrm\{prev\}\}\\leftarrow\\varnothing
3:while

\|𝒞\|<Kmax\|\\mathcal\{C\}\|<K\_\{\\max\}do

4:

b←min⁡\(K,Kmax−\|𝒞\|\)b\\leftarrow\\min\(K,K\_\{\\max\}\-\|\\mathcal\{C\}\|\)
5:

ℬ←SampleHypotheses\(πθsmall,St,b\)\\mathcal\{B\}\\leftarrow\\texttt\{SampleHypotheses\}\(\\pi\_\{\\theta\}^\{\\rm small\},S\_\{t\},b\)
6:

𝒞←𝒞∪FilterValid\(ℬ\)\\mathcal\{C\}\\leftarrow\\mathcal\{C\}\\cup\\texttt\{FilterValid\}\(\\mathcal\{B\}\)
7:if

\|𝒞\|<K\|\\mathcal\{C\}\|<Kthen

8:continue

9:endif

10:\# Cluster𝒞\\mathcal\{C\}by structural skeleton and compute entropyHH

11:if

Hprev≠∅H\_\{\\mathrm\{prev\}\}\\neq\\varnothingthen

12:if

\|H−Hprev\|<τH\|H\-H\_\{\\mathrm\{prev\}\}\|<\\tau\_\{H\}then

13:break

14:endif

15:endif

16:

Hprev←HH\_\{\\mathrm\{prev\}\}\\leftarrow H
17:endwhile

18:return

BuildDistribution\(𝒞\)\\texttt\{BuildDistribution\}\(\\mathcal\{C\}\)

Our implementation ofLLM\-AutoSciLabinstantiates the closed\-loop discovery framework in a hypothesis\-driven setting with adaptive ensembling\. At each stage, the system maintains a set of candidate mechanisms, proposes informative experiments conditioned on disagreement among those candidates, updates the observation set with oracle feedback, and periodically refines the candidate laws through symbolic fitting\. The implementation uses a two\-model architecture in which a smaller local model generates a diverse hypothesis set, while the main model synthesizes these hypotheses with accumulated evidence to guide subsequent experimentation\. Symbolic refinement results are summarized in a structured memory representation and re\-injected into later reasoning steps\.

#### Prompting and Hypothesis Generation\.

The hypothesis\-generation step receives the current goal, domain description, parameter names, Python function signature, experiment history as a text table, the current working hypothesis, the remaining budget, the current phase, the best symbolic equation found so far \(if available\), and the structured memory summary\. At this stage, the model is asked to propose one primary hypothesis together with multiple alternate hypotheses, but*not*search regions\. All hypotheses are returned as pure Python expressions using the exact oracle variable names and symbolic free constants such asC0,C1,alpha, andbeta\. In the paper configuration, the smaller ensemble model is sampled in parallel with base ensemble sizeK=5K=5, and each call produces a structured hypothesis output from which the primary candidate is retained for ensemble construction\.

#### Adaptive Ensemble Construction\.

Raw ensemble hypotheses are filtered before use: expressions that cannot be executed on the observed data, produce predominantly non\-finite predictions, or are clearly nonsensical under the current dataset are discarded\. The remaining hypotheses are then clustered by structural skeleton, obtained by canonicalizing constants while preserving functional form\.LLM\-AutoSciLabuses adaptive ensemble growth in which hypotheses are sampled in batches of sizeKK, reclustered after each batch, and the Shannon entropy of the structural cluster distribution is recomputed\. Sampling stops when the entropy change between successive batches falls below0\.10\.1, or when the hard cap ofKmax=20K\_\{\\max\}=20samples is reached\. The resulting hypothesis distribution stores the raw hypotheses, structurally unique representatives, cluster assignments, the majority\-cluster agreement score, and a synthesis summary that is passed to the main model\.

#### Hypothesis\-Conditioned Acquisition\.

The main model receives a second prompt containing the full experiment history, the current working hypothesis, the structured memory summary, and a compact summary of the current hypothesis set, including the number of sampled hypotheses, the number of unique structures, and representative candidates from each structural cluster\. It returns an updated primary hypothesis, alternate hypotheses, and a set of search regions\. In parallel,LLM\-AutoSciLabcomputes a falsification\-oriented disagreement score directly from the current hypothesis set\. For a candidate point𝐱\\mathbf\{x\}, letf^1\(𝐱\),…,f^K\(𝐱\)\\hat\{f\}\_\{1\}\(\\mathbf\{x\}\),\\dots,\\hat\{f\}\_\{K\}\(\\mathbf\{x\}\)denote the predictions induced by the current candidate mechanisms\. We define the disagreement score as

Δ\(𝐱\)=Std\(log10⁡f^1\(𝐱\),…,log10⁡f^K\(𝐱\)\)\.\\Delta\(\\mathbf\{x\}\)=\\mathrm\{Std\}\\\!\\left\(\\log\_\{10\}\\hat\{f\}\_\{1\}\(\\mathbf\{x\}\),\\ldots,\\log\_\{10\}\\hat\{f\}\_\{K\}\(\\mathbf\{x\}\)\\right\)\.ForActiveSciBench\-GRN, disagreement is computed over fitted graph hypotheses through their predicted intervention responses\. If graph hypothesisgkg\_\{k\}predicts a post\-intervention response vector𝐲^k\(a\)∈ℝ\>0d\\hat\{\\mathbf\{y\}\}\_\{k\}\(a\)\\in\\mathbb\{R\}\_\{\>0\}^\{d\}for interventionaa, we define

δGRN\(a\)=1d∑ℓ=1dVark=1,…,K⁡\[log⁡\(y^k,ℓ\(a\)\+ε\)\]\.\\delta\_\{\\mathrm\{GRN\}\}\(a\)=\\frac\{1\}\{d\}\\sum\_\{\\ell=1\}^\{d\}\\operatorname\{Var\}\_\{k=1,\\ldots,K\}\\\!\\Bigl\[\\log\\\!\\bigl\(\\hat\{y\}\_\{k,\\ell\}\(a\)\+\\varepsilon\\bigr\)\\Bigr\]\.This means that edge and sign disagreements matter only through their falsifiable intervention consequences\. The role of the LLM at this stage is to propose search regions expected to be informative for separating competing mechanisms; the final experiment points are then selected by the active\-learning layer within those regions\. When the system is in the low\-confidence regime, candidate points are sampled from the proposed bounds, scored byΔ\(𝐱\)\\Delta\(\\mathbf\{x\}\), and a small diverse subset is chosen for oracle evaluation\. This yields a hypothesis\-conditioned acquisition rule that preferentially queries regions where competing mechanisms make sharply different predictions\.

#### Hypothesis Refinement\.

Mechanism refinement is performed on accumulated observations using a domain\-specific refinement backend\. The refinement stage is run periodically rather than continuously\. Before the backend is invoked, the loop extracts the variables implicated by the current hypothesis set and uses them to focus the refinement search, optionally reintroducing variables whose residual behavior suggests missing structure\. In equation\-discovery settings, refinement combines direct fitting of candidate structural families with numerical parameter optimization and symbolic search over the accumulated observations\. In graph\-discovery settings, refinement updates the candidate signed regulatory structures and their associated dynamical parameters using the observed perturbation responses\. The refined candidates are then pooled for later arbitration and memory updates\.

#### Bootstrap Confidence\.

Confidence is computed in bootstrap mode and is used to control the acquisition regime rather than to terminate the run\. After fitting a candidate mechanism, the domain\-specific refinement backend is rerun on bootstrap resamples of the training split, and the resulting models are evaluated on a held\-out validation split\. Lety^\(b\)\\hat\{y\}^\{\(b\)\}denote the prediction vector from bootstrap fitbb\. We compute bootstrap confidence from the mean coefficient of variation across validation predictions,

confboot=1−1N∑i=1Nstdb\(y^i\(b\)\)\|meanb\(y^i\(b\)\)\|\+ε,\\mathrm\{conf\}\_\{\\mathrm\{boot\}\}=1\-\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\frac\{\\mathrm\{std\}\_\{b\}\\\!\\left\(\\hat\{y\}^\{\(b\)\}\_\{i\}\\right\)\}\{\\left\|\\mathrm\{mean\}\_\{b\}\\\!\\left\(\\hat\{y\}^\{\(b\)\}\_\{i\}\\right\)\\right\|\+\\varepsilon\},clipped to\[0,1\]\[0,1\]\. In our setup, this confidence determines whether acquisition emphasizes hypothesis disambiguation or parameter refinement\. When bootstrap confidence remains below the gating threshold, the system stays in a disagreement\-driven regime and selects experiments using hypothesis\-conditioned acquisition to separate competing mechanisms\. Once confidence exceeds the threshold \(0\.9 in the paper configuration\), the system switches to a refinement regime, where acquisition is driven by uncertainty sampling to reduce residual uncertainty within the current high\-confidence mechanism class\.

#### Memory and Final Selection\.

The memory injected into later prompts is a structured summary rather than a raw conversation transcript\. It can include the full symbolic\-regression history, the current best equation and its fit statistics, bootstrap confidence, mid\-run ground\-truth RMSLE when available, validated hypotheses that survived discriminating experiments, negative evidence for falsified forms, and a hypothesis scoreboard that tracks validated, failed, and uncertain structures\. At budget exhaustion,LLM\-AutoSciLabperforms a final symbolic fitting pass using the full symbolic\-regression budget, reuses the final variable filter and hypothesis\-family pool, and constructs a final candidate set from domain\-specific optimization candidates, direct skeleton fits, and validated hypothesis survivors\. The main model then arbitrates among these final candidates based on scientific plausibility and consistency with the observed data, rather than selecting solely on trainingR2R^\{2\}\. If a mid\-run candidate achieved a clearly better ground\-truth validation score, that candidate is preserved as the final equation\.

### E\.2Baselines

All methods use the same oracle, task instance, difficulty, law version, and total experiment budget asLLM\-AutoSciLab\. For theLLM\-onlyandCode\-assisted LLMconditions, we follow the same definitions and prompting/tool\-use setups as in NewtonBench and refer readers to\[[43](https://arxiv.org/html/2605.24043#bib.bib36)\]for those implementation details\. Below, we summarize the remaining comparison methods used in our benchmark suite\.

Random\+PySR\.This method uses a non\-adaptive experimental design followed by symbolic regression\. It serves as the non\-adaptive floor\.

BO\+PySR\.This method uses Gaussian\-process Bayesian optimization without LLM guidance\.

BED\+PySR\.This method performs Bayesian experimental design over a fixed hand\-specified mechanism library\. At each step, each candidate mechanism is fit to the current observations by nonlinear least squares in log\-space, candidate experiments are sampled from the admissible bounds, and the next experiment is chosen to maximize the disagreement between the fitted mechanisms\. After the budget is exhausted, the best\-fitting library member is retained, and PySR is run as a final symbolic refinement stage\. This uses the same general style of mechanistic library reasoning as the LLM pipeline, but with a fixed, predefined family set rather than dynamically generated hypotheses\.

GENIE3\.We evaluate GENIE3 on fixedActiveSciBench\-GRNdatasets collected under the same task budget and use the standardGENIE3implementation\[[19](https://arxiv.org/html/2605.24043#bib.bib76)\], with a lightweight post\-processing step to convert the output into the signed\-graph format required byActiveSciBench\-GRNevaluation\.

GIES\.We evaluate GIES on the same fixedActiveSciBench\-GRNdatasets using the standardpcalgimplementation\[[17](https://arxiv.org/html/2605.24043#bib.bib75)\], with intervention labels corresponding to the benchmark perturbation environments and the same signed\-graph conversion\.

NOTEARS\.We evaluate the linear NOTEARS method on the same fixedActiveSciBench\-GRNdatasets using the referencenotearsimplementation\[[44](https://arxiv.org/html/2605.24043#bib.bib74)\], followed by the same signed\-graph conversion used for benchmark evaluation\.

LLM\-only and Code Assisted LLM\.We follow the implementation of these baselines as presented in NewtonBench\[[43](https://arxiv.org/html/2605.24043#bib.bib36)\]\.

### E\.3Experimental Protocol

Unless otherwise noted, all experiments use deterministic benchmark instances with zero oracle noise and matched task manifests\. NewtonBench evaluates 12 domains across 3 difficulty levels, 3 law versions, and 3 seeds \(324 runs total\), with representative budget studies using 96 tasks\. ForActiveSciBench\-Chem, representative manifests contain 36 tasks, with standard budgetsB∈\{20,40,60,80,100\}B\\in\\\{20,40,60,80,100\\\}and fixed\-budget comparisons atB=60B=60\. ForActiveSciBench\-GRN, representative manifests contain 36 tasks for budget studies and 18 for noise studies, usingB∈\{10,20,50\}B\\in\\\{10,20,50\\\}with fixed\-budget comparisons atB=20B=20\. NewtonBench fixed\-budget comparisons also useB=20B=20, with extended studies reported up toB=100B=100\. Evaluation is performed on held\-out oracle outputs\. Symbolic regression tasks report RMSLE\-based recovery together with exact and symbolic recovery, whileActiveSciBench\-GRNreports edgeF1F\_\{1\}, exact graph accuracy, and sign accuracy against the hidden graph\. All LLM\-based experiments useGPT\-4o\-minias the primary reasoning model andQwen/Qwen2\.5\-7B\-Instructfor adaptive ensembling via local vLLM\. The main model is used without task\-specific fine\-tuning, and ensemble sampling uses a temperature of 1\.0 for structural diversity\. Symbolic regression refinement uses PySR with 800 iterations plus direct fitting of candidate mechanism skeletons when available, whileActiveSciBench\-GRNuses signed\-graph fitting with BFGS optimization\. Bootstrap confidence from held\-out validation fits is used only for acquisition\-mode switching\.

### E\.4Prompt Templates

Hypothesis Generation \(Symbolic Regression\)``` GOAL: {goal} DOMAIN: {domain} PARAMETERS: {param_names} {param_description} OBJECTIVE TYPE: {objective_type} OBJECTIVE DIRECTION: {objective_direction} {optional_objective_profile} FUNCTION SIGNATURE: {function_signature} DATA ({budget_pct}% budget used, {budget_remaining} experiments left): {data_table} Best symbolic equation so far: {best_equation_fit} {memory_str} Current hypotheses: {current_hypothesis} CURRENT PHASE: {current_phase} TASK: Generate one primary and 2-6 alternate EQUATION hypotheses that remain plausible under current data. Do NOT propose search regions in this step. Focus only on plausible competing hypotheses and concise reasoning for ambiguity. ```

Search\-Region Proposal \(Symbolic Regression\)``` GOAL: {goal} DOMAIN: {domain} PARAMETERS: {param_names} {param_description} OBJECTIVE TYPE: {objective_type} OBJECTIVE DIRECTION: {objective_direction} {optional_objective_profile} DISCOVERED LAW FUNCTION SIGNATURE (must match exactly): {function_signature} EXPERIMENTAL DATA ({budget_pct}% of budget used, {budget_remaining} experiments remaining): {data_table} {best_equation_if_present} {hypothesis_requirement_if_present} Current best hypothesis: {current_hypothesis} {memory_str} {phase_instruction} {discrimination_hints} Based on this data, propose the most informative parameter regions to explore next. Think carefully about what the data represents and what experiments would most help discover the governing equation via disambiguating the competing hypotheses set. search_regions must be a list of objects with this exact structure: {"bounds": {"p1": [lo, hi], "p2": [lo, hi], ...}, "n_experiments": 4, "priority": "high", "rationale": "why"} ```

Hypothesis\-to\-Executable Structure \(Symbolic Regression\)``` GOAL: {goal} DOMAIN: {domain} PARAMETERS: {params_str} REQUIRED FUNCTION SIGNATURE (use EXACTLY this): {function_signature} EXPERIMENTAL DATA (raw values and [log10 values]): {data_table} CURRENT HYPOTHESIS: {current_hypothesis} {previous_attempts_if_any} Read the log10 columns to estimate exponents: \Delta log10(measurement) / \Delta log10(param) \approx exponent. Then propose a Python function ’discovered_law’ that fits this structural form. RULES: - Use free constants C0, C1, alpha, beta, ... — they will be fitted numerically. - Use ONLY standard Python math (no imports). Powers: use ** not pow(). - Use the EXACT parameter names from the FUNCTION SIGNATURE above. - Constants must appear in the return expression. ```

Hypothesis Generation and Region Proposal \(Graph Recovery\)``` GOAL: {goal} DOMAIN: {oracle.param_description} CURRENT BEST HYPOTHESIS: {current_hypothesis} RECENT DATA: {store_to_table} HYPOTHESIS FIT SUMMARY: {lineage_summary} MEMORY SUMMARY: {memory_str} CONFIDENCE SUMMARY: {confidence_str} ENSEMBLE GRAPH DISTRIBUTION: {ensemble_summary} {optional_diversity_requirement_block} BUDGET REMAINING: {budget_remaining} MAX EXPERIMENTS THIS ITERATION: {max_experiments_per_iter} Return: - 2-5 natural-language mechanism hypotheses with stable IDs like h1, h2, h3 - a primary_hypothesis_id naming the current best one - 1-4 search regions for the next experiments - confidence and done flag ```

Single\-Hypothesis Sampling \(Graph Recovery\)``` GOAL: {goal} DOMAIN: {oracle.param_description} CURRENT BEST HYPOTHESIS: {current_hypothesis} RECENT DATA: {store_to_table} HYPOTHESIS FIT SUMMARY: {lineage_summary} MEMORY SUMMARY: {memory_str} CONFIDENCE SUMMARY: {confidence_str} BUDGET REMAINING: {budget_remaining} Return exactly one plausible natural-language mechanism hypothesis, with a short rationale and confidence. Do not output graph edges or search regions. ```

Hypothesis\-to\-Structure Translation \(Graph Recovery\)``` Translate this single natural-language hypothesis into one signed graph. hypothesis_id: {hypothesis_id} text: {hypothesis_text} Fill this schema exactly: { "translation": { "hypothesis_id": "{hypothesis_id}", "rationale": "why this graph matches the hypothesis", "assumptions": ["optional assumption"], "edges": [ {"src": "signal", "dst": "A", "sign":1}, {"src": "A", "dst": "C", "sign":1} ] } } Requirements: - dst must be one of {graphs_nodes}. - The graph must contain a directed path from signal to {final_node}. - If the hypothesis does not mention {final_node} explicitly, add the minimal faithful chain needed so the signal reaches {final_node}. ```

## Acknowledgements

This research was partially supported by the U\.S\. National Science Foundation \(NSF\) under Grant No\. 2416728 and Autodesk Research\. This work was supported by a Laboratory Directed Research & Development \(LDRD\) project\. This work was performed at the Center for Integrated Nanotechnologies, a U\.S\. Department of Energy Office of Science user facility\. This article was authored by an employee of National Technology & Engineering Solutions of Sandia, LLC under Contract No\. DE\-NA0003525 with the U\.S\. DOE\. The employee retains all rights to the article and is solely responsible for its contents\. The U\.S\. Government retains a non\-exclusive, paid\-up, irrevocable, worldwide license to publish or reproduce this work for government purposes\. Public access will be provided in accordance with the DOE Public Access Plan: https://www\.energy\.gov/downloads/doe\-public\-access\-plan\.
LLM-AutoSciLab: Closed-Loop Scientific Discovery via Active Experimentation with LLMs

Similar Articles

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

Are LLMs Ready for Scientific Discovery? A Capability-Oriented Benchmark for AI Scientists

"Excuse me, may I say something..." CoLabScience, A Proactive AI Assistant for Biomedical Discovery and LLM-Expert Collaborations

AutoSci: A Memory-Centric Agentic System for the Full Scientific Research Lifecycle

Submit Feedback

Similar Articles

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
Are LLMs Ready for Scientific Discovery? A Capability-Oriented Benchmark for AI Scientists
"Excuse me, may I say something..." CoLabScience, A Proactive AI Assistant for Biomedical Discovery and LLM-Expert Collaborations
AutoSci: A Memory-Centric Agentic System for the Full Scientific Research Lifecycle